Saturday, 27 February 2016

Subtleties and magic with CUDA callbacks

Callbacks, such as cuFFT callbacks, are a relatively new feature of CUDA. They are made possible since the existence of separate compilation (-rdc true). Before that, nvcc would compile and link a CUDA unit (CUmodule) produced by just a single compile unit, so the compiler could know anything about the way a kernel is used. With the whole-program compilation mode, most probably all your function calls are inlined, and even if your functions take generic pointers to memory (you can't even specify __shared__/__device__/__constant__ on pointer argument) their address space could probably be inferred, and specialized opcodes for that specific address space could be emitted.

With separate compilation, things get much harder. It is entirely possible that while compiling a kernel nvcc encounters a call to a function of which it knows nothing but the signature, so it actually has to emit a real call sequence. This has a number of consequences.

  1. An ABI for argument passing and register usage must be defined: such ABI must guarantee that the function may not use more than the least possible number of register used by any kernel. Of course, this number tends to be low (~20 it appears), so you very likely will have unavoidable spilling. I reported this to NVIDIA with a bug report, they acknowledged the issue but said that they don't have plans for a better solution.
  2. If a function takes a pointer, it's no longer possible to infer statically its address space, leading to the emission of less specialized memory access instructions. This is worsened by the fact that there is no way to make the compiler believe that a pointer belongs to some address space: you can't cast to (__shared__ mytype*) even if you really know you are handling __shared__ memory, because the __shared__ attribute will be ignored (and nvcc warns about it). However, see below for a silly workaround.
  3. Freestanding functions can declare static __shared__ storage, but then this extra __shared__ memory must be allocated even if the function is not used. If you have two callbacks with two different static __shared__ allocations and a kernel that calls only one of them, both the __shared__ buffers will be statically allocated. I have found however that nvcc is smart enough to avoid this if a kernel does not use callbacks at all.
Again, this is all due to separate compilation. In my experiments, I found that with whole-program compilation, if one uses callbacks, the compiler goes to the extent  of trying to resolve which callback exactly you pass to your kernels: if it manages to do it, then it  will create different clones of your kernels with the callbacks actually inlined, as normal function calls. The kernel clones will show up as hidden symbols with a complicated mangled name in cuobjdump. This interestingly is the first evidence I know of something that requires the tight integration of kernels and host code. In fact, I generally frowned upon the CUDA way as opposed to the OpenCL way of writing GPU code, because the difference between host and device code was somewhat hidden (and that causes in general a lot of confusion, especially to beginners trying to take the address of a __device__ variable), and it looked like this wasn't really needed.

To conclude, here is the "magic" function that serve as a cast to __shared__ memory.

  * This function takes a pointer and forces the compiler to believe  
  * that it resides in __shared__ space.  
 template<typename P> __device__ P* force_shared_ptr(P* assumed_shmem){  
     extern __shared__ char shmem[];  
     auto ptrdiff = (char*)assumed_shmem - shmem;  
     return (P*)(shmem + ptrdiff);  

It doesn't need explanation I suppose. Of course, behavior is undefined if your generic pointer is not actually a pointer to shared memory.