Writing practical CUDA kernels

using CUDA, BenchmarkTools

Thread unique identifier

We calculate the thread unique identifier using a simple formula

$$ \text{id} = x_{\text{thread}} + N \left( x_{\text{block}} - 1 \right) $$

where $N$ is the number of threads per block.

function kernel(a)
    i = threadIdx().x + blockDim().x * (blockIdx().x - 1)
    
    a[i] = i
    
    return
end
kernel (generic function with 1 method)

Launching this kernel

a = CUDA.zeros(8)

@cuda threads=4 blocks=2 kernel(a)
a
8-element CuArray{Float32, 1}:
 1.0
 2.0
 3.0
 4.0
 5.0
 6.0
 7.0
 8.0

Kernel occupancy

Finding the right balance between threads and blocks that fully utilize the GPU effectively can be tricky. Fortunately, we can utilize tools to assist finding the optimal launch configuration by compiling our kernel and inspecting it

function kernel(c, a, b)
    i = threadIdx().x + blockDim().x * (blockIdx().x - 1)
    
    c[i] = a[i] + 2 * b[i]
    
    return
end
kernel (generic function with 2 methods)

To analyse our launch, we require some sample data

a = CuArray(1:4096)
b = reverse(copy(a))
c = similar(a) 
; # no ouput

Attempting a poor configuration also results in poorer performance:

threads = 32
blocks = cld(length(a), 32)  # ceil division

@benchmark CUDA.@sync @cuda threads=threads blocks=blocks kernel($c, $a, $b)
BenchmarkTools.Trial: 10000 samples with 4 evaluations.
 Range (minmax):  7.535 μs 1.007 ms  ┊ GC (min … max): 0.00% … 99.01%
 Time  (median):     7.790 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   7.988 μs ± 10.003 μs  ┊ GC (mean ± σ):  1.25% ±  0.99%

    ▄█                                                       
  ▃▇███▇▅▄▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▁▁▂▁▁▁▁▁▂▁▁▁▂▁▂▂▂▂▂▂▂▂▂▂▂▂ ▃
  7.53 μs        Histogram: frequency by time        11.2 μs <

 Memory estimate: 1.45 KiB, allocs estimate: 38.

Now we compile with launch=false and analyse the optimal launch configuration:

kern = @cuda launch=false kernel(c, a, b)

conf = CUDA.launch_configuration(kern.fun)
@show conf

threads = min(length(a), conf.threads)
blocks  = cld(length(a), threads)

print("calc = (blocks = $blocks, threads = $threads)")
conf = (blocks = 32, threads = 1024)
calc = (blocks = 4, threads = 1024)

We can then launch our compiled kernel through a function invocation:

@benchmark kern($c, $a, $b; threads, blocks)
BenchmarkTools.Trial: 10000 samples with 9 evaluations.
 Range (minmax):  1.996 μs552.159 μs  ┊ GC (min … max): 0.00% … 99.50%
 Time  (median):     2.122 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.273 μs ±   7.781 μs  ┊ GC (mean ± σ):  4.83% ±  1.41%

   ▁▆█▂▁                                                      
  ▃█████▆▅▃▃▂▂▂▂▂▂▂▂▂▂▁▂▂▂▁▁▁▁▂▁▁▁▂▂▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▂▂▂▂ ▃
  2 μs            Histogram: frequency by time        3.77 μs <

 Memory estimate: 1.03 KiB, allocs estimate: 23.

In general a good launch configuration will perform well on any CUDA GPU device, though the optimal may very well be device dependent. As always, profiling and benchmarking is required.

A reduce implementation

We can demonstrate some GPU idioms by examining how we might write a reduce function on the GPU.

The GPU is best utilized by launching independed (trivially parallelizeable) threads. There are a few ways we can keep this practice, and still implement reduce effectively.

a = CUDA.rand(1024, 1024)
b = CuArray{Float32, 1}([0]) # result store, samne type as a
1-element CuArray{Float32, 1}:
 0.0

For comparison, we benchmark the CPU with 1024x1024 elements:

cpu_a = Array(a)
@benchmark sum($cpu_a)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (minmax):  73.902 μs246.050 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     74.465 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   74.902 μs ±   3.705 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▆█ ▁▁▁▁                                                    ▂
  ██▅████▅▅▅▃▄▃▃▁▁▃▃▁▃▁▁▃▁▁▁▁▁▁▃▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▃▁▅▇▆▇▇▇▅▅▆▆ █
  73.9 μs       Histogram: log(frequency) by time      91.4 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Avoiding write conflicts with CUDA.@atomic

A basic implementation leveraging atomics to avoid write conflicts may look like:

function reduce_kernel(op, a, b)
    i = threadIdx().x + (blockIdx().x - 1) * blockDim().x
    
    @atomic b[] = op(b[], a[i])
    
    return
end

@benchmark CUDA.@sync @cuda threads=1024 blocks=1024 reduce_kernel(+, $a, $b)
BenchmarkTools.Trial: 1734 samples with 1 evaluation.
 Range (minmax):  2.743 ms  5.306 ms  ┊ GC (min … max): 0.00% … 44.34%
 Time  (median):     2.984 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.882 ms ± 134.470 μs  ┊ GC (mean ± σ):  0.12% ±  2.25%

  █                                                      ▁    
  █▁▄▃▃▃▁▁▁▁▁▁▁▁▁▁▁▁▄▅▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▃▃▁▁▁▁▁▁▁▃▁▁▁▁▅▄▆█▇█ █
  2.74 ms      Histogram: log(frequency) by time         3 ms <

 Memory estimate: 32.56 KiB, allocs estimate: 2070.
a = CUDA.rand(1024, 1024)
b = CuArray{Float32, 1}([0]) # result store
1-element CuArray{Float32, 1}:
 0.0

Note that these atomics result is serialization of access, and thus slow the kernel dramatically.

Thread reduction

The idea is to implement a reduction along the lines of (length(a)=16)

itt 1:
thread 1: a[1]  + a[2]  = 1  + 2   = 3
thread 2: a[3]  + a[4]  = 3  + 4   = 7
thread 3: a[5]  + a[6]  = 5  + 6   = 11
thread 4: a[7]  + a[8]  = 7  + 8   = 15
thread 5: a[9]  + a[10] = 9  + 10  = 19
thread 6: a[11] + a[12] = 11 + 12  = 23
thread 7: a[13] + a[14] = 13 + 14  = 27
thread 8: a[15] + a[16] = 15 + 16  = 31

itt 2:
thread 1: a[1]  + a[3]  = 3  + 7   = 10
thread 2: a[5]  + a[7]  = 11 + 15  = 26
thread 3: a[9]  + a[11] = 19 + 23  = 42
thread 4: a[13] + a[15] = 27 + 31  = 58

itt 3:
thread 1: a[1]  + a[5]  = 10 + 26  = 36
thread 2: a[9]  + a[13] = 42 + 58  = 100

itt 4:
thread 1: a[1]  + a[9]  = 36 + 100 = 136

in a single block. This does limit our kernel to arrays of length 2 x 1024 = 2048, but provides the ground work for how we scale the pattern over many blocks.

function reduce_thread_kernel(op, a, b)
    elements = blockDim().x * 2
    thread = threadIdx().x
    
    d = 1
    while d < elements
        
        sync_threads()
        i = 2 * d * (thread - 1) + 1
        
        @inbounds if i  elements && i + d  length(a)
            a[i] = op(a[i], a[i+d])
        end
        d *= 2
    end
    
    if thread == 1
        b[] = a[1]
    end
    
    return
end
reduce_thread_kernel (generic function with 1 method)

Trying out the function on a small sample

a1 = CuArray(1:1024)
b1 = CuArray([0])

# limit threads to half of data
@cuda threads=cld(length(a1), 2) reduce_thread_kernel(+, a1, b1)

CUDA.@allowscalar b1[1] == sum(1:1024)
true

Single block reduction

We can generalise the above implementation for arbitrary array size by first reducing the array above the block size, and then running the algorithm as above.

function reduce_block_kernel(op, a, b)
    elements = blockDim().x * 2 # as each thread consumes 2 elements
    thread = threadIdx().x
    
    i = thread + elements
    while i  length(a)
        
        a[thread] = op(a[thread], a[i])
        i += elements
        
    end
    
    d = 1
    while d < elements
        sync_threads()
        i = 2 * d * (thread - 1) + 1
        
        @inbounds if i  elements && i + d  length(a)
            a[i] = op(a[i], a[i+d])
        end
        d *= 2
        
    end
    
    if thread == 1
        b[] = a[1]
    end
    
    return
end
reduce_block_kernel (generic function with 1 method)

This effectively reduces the size of the array, over the first while loop, until the length of the array is equal to the number of threads. We can now benchmark this function with the original test data, however limiting the GPU to a processing block

@benchmark CUDA.@sync @cuda threads=1024 reduce_block_kernel(+, $a, $b)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (minmax):  235.713 μs 2.278 ms  ┊ GC (min … max): 0.00% … 91.75%
 Time  (median):     244.456 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   275.479 μs ± 68.080 μs  ┊ GC (mean ± σ):  0.15% ±  1.32%

    █                                                          
  ▂██▆▃▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▁▁▁▁▂▂▂▂▂▂▇▄▂▂▂▂ ▂
  236 μs          Histogram: frequency by time          409 μs <

 Memory estimate: 4.23 KiB, allocs estimate: 257.

Grid reduction with atomics

To fully utilize the GPU, we want to run many blocks so that the GPU can switch between them without latency. To perform a multi-block (grid) reduction we have to take into account that each block is independent, and may not neccessarily be running at the same time as another block. Consequently, our hope to synchronise threads doesn’t extend beyond a single block.

Instead, we can use atomic operations to write each threads result to the output array before the block switches.

Consider

function reduce_gric_atomic_kernel(op, a, b)
    
    elements = blockDim().x * 2     # number of elements each block will consume
    thread = threadIdx().x 
    block = blockIdx().x
    
    offset = (block - 1) * elements
    
    # parallel block reduction
    d = 1
    while d < elements
        sync_threads()
        
        i = 2 * d * (thread - 1) + 1
        @inbounds if i  elements && offset + i + d  length(a)
            a[offset + i] = op(a[offset + i], a[offset + i + d])
        end
        
        d *= 2
    end
    
    # atomic reduction of this blocks value
    if thread == 1
        @atomic b[] = op(b[], a[offset + 1])
    end
    
    return
end
reduce_gric_atomic_kernel (generic function with 1 method)

Benchmarking this kernel:

@benchmark CUDA.@sync @cuda threads=1024 blocks=512 reduce_gric_atomic_kernel(+, $a, $b)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (minmax):  150.457 μs 2.231 ms  ┊ GC (min … max): 0.00% … 99.18%
 Time  (median):     153.051 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   153.348 μs ± 20.818 μs  ┊ GC (mean ± σ):  0.14% ±  0.99%

                     ▂▂▄▆▅▅▆▅▆▇▇█▅▄▃▅▄▃▁                      
  ▁▁▁▁▁▁▂▂▃▃▃▄▄▅▆▆█▇█████████████████████▇█▇▆▆▆▅▄▃▃▃▃▂▂▂▁▁▁▁ ▅
  150 μs          Histogram: frequency by time          156 μs <

 Memory estimate: 992 bytes, allocs estimate: 48.

Although faster, we are still relying on serial operations with the use of atomic.

Shared memory

Each block has its own shared memory between threads, which is stored on chip of the streaming multiprocessor. Although this memory is rather small, we can use it for communication between threads, and caching loads.

We utilise both of these ideas here:

A given thread writes its memory from a into a shared buffer, as well as the element it will be reducing with, along the lines of (ignoring the offset):

shared[thread] = a[thread]
shared[thread + blockDim().x] = a[thread + blockDim.x()]

For a single block (blockdim().x == 1), these values are adjacent.

The shared buffer thus holds two values per thread, and since we are running at most 1024 threads, the size of the buffer is fixed to 2048.

function reduce_grid_atomic_shared_kernel(op, a::AbstractArray{T}, b) where {T}
    
    elements = blockDim().x * 2
    thread = threadIdx().x
    block = blockIdx().x
    
    offset = (block-1) * elements

    # copy into shared memory buffer
    shared = @cuStaticSharedMem(T, (2048,))
    @inbounds shared[thread] = a[offset+thread]
    @inbounds shared[thread+blockDim().x] = a[offset+thread+blockDim().x]

    # parallel reduction
    d = 1
    while d < elements
        sync_threads()
        
        i = 2 * d * (thread-1) + 1
        
        @inbounds if i  elements && offset + i + d  length(a)
            shared[i] = op(shared[i], shared[i + d])
        end
        
        d *= 2
    end
    
    # atomic end-of-block reduction
    if thread == 1
        @atomic b[] = op(b[], shared[1])
    end
    
    return
end
reduce_grid_atomic_shared_kernel (generic function with 2 methods)

Benchmarking this also

@benchmark CUDA.@sync @cuda threads=1024 blocks=512 reduce_grid_atomic_shared_kernel(+, $a, $b)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (minmax):  137.328 μs 2.558 ms  ┊ GC (min … max): 0.00% … 95.25%
 Time  (median):     139.799 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   139.935 μs ± 24.383 μs  ┊ GC (mean ± σ):  0.17% ±  0.95%

            ▁▂▂▁▁▁             ▁▃▅▇██▆▇▅▄▂                    
  ▁▁▁▁▂▂▄▅▆███████▇▅▅▅▄▄▃▃▄▅▅▇█████████████▇▆▅▄▃▃▂▂▂▁▁▁▁▁▁▁▁ ▄
  137 μs          Histogram: frequency by time          142 μs <

 Memory estimate: 2.77 KiB, allocs estimate: 163.

Which gives us a small performance improvement.

If we compare our efforts to the sum implementation in CUDA.jl, we see there is still a little effort to go.

@benchmark CUDA.@sync sum($a)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (minmax):   96.457 μs 21.465 ms  ┊ GC (min … max): 0.00% … 40.07%
 Time  (median):     100.062 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   118.998 μs ± 216.389 μs  ┊ GC (mean ± σ):  0.72% ±  0.40%

    █                                                            
  ▂██▃▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▁▁▁▁▂▁▁▁▂▁▁▁▁▁▁▁▁▁▂▁▂▂▂▂▂▁▁▂▁▂▁▂▂▂▂▂▄▇▄ ▂
  96.5 μs          Histogram: frequency by time          173 μs <

 Memory estimate: 4.84 KiB, allocs estimate: 161.