Writing practical CUDA kernels¶
using CUDA, BenchmarkTools
Thread unique identifier¶
We calculate the thread unique identifier using a simple formula
$$ \text{id} = x_{\text{thread}} + N \left( x_{\text{block}} - 1 \right) $$
where $N$ is the number of threads per block.
function kernel(a)
i = threadIdx().x + blockDim().x * (blockIdx().x - 1)
a[i] = i
return
end
kernel (generic function with 1 method)
Launching this kernel
a = CUDA.zeros(8)
@cuda threads=4 blocks=2 kernel(a)
a
8-element CuArray{Float32, 1}:
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
Kernel occupancy¶
Finding the right balance between threads and blocks that fully utilize the GPU effectively can be tricky. Fortunately, we can utilize tools to assist finding the optimal launch configuration by compiling our kernel and inspecting it
function kernel(c, a, b)
i = threadIdx().x + blockDim().x * (blockIdx().x - 1)
c[i] = a[i] + 2 * b[i]
return
end
kernel (generic function with 2 methods)
To analyse our launch, we require some sample data
a = CuArray(1:4096)
b = reverse(copy(a))
c = similar(a)
; # no ouput
Attempting a poor configuration also results in poorer performance:
threads = 32
blocks = cld(length(a), 32) # ceil division
@benchmark CUDA.@sync @cuda threads=threads blocks=blocks kernel($c, $a, $b)
BenchmarkTools.Trial: 10000 samples with 4 evaluations.
Range (min … max): 7.535 μs … 1.007 ms ┊ GC (min … max): 0.00% … 99.01%
Time (median): 7.790 μs ┊ GC (median): 0.00%
Time (mean ± σ): 7.988 μs ± 10.003 μs ┊ GC (mean ± σ): 1.25% ± 0.99%
▄█▃
▃▇████▇▆▅▄▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▁▁▂▁▁▁▁▁▂▁▁▁▂▁▂▂▂▂▂▂▂▂▂▂▂▂ ▃
7.53 μs Histogram: frequency by time 11.2 μs <
Memory estimate: 1.45 KiB, allocs estimate: 38.
Now we compile with launch=false
and analyse the optimal launch configuration:
kern = @cuda launch=false kernel(c, a, b)
conf = CUDA.launch_configuration(kern.fun)
@show conf
threads = min(length(a), conf.threads)
blocks = cld(length(a), threads)
print("calc = (blocks = $blocks, threads = $threads)")
conf = (blocks = 32, threads = 1024)
calc = (blocks = 4, threads = 1024)
We can then launch our compiled kernel through a function invocation:
@benchmark kern($c, $a, $b; threads, blocks)
BenchmarkTools.Trial: 10000 samples with 9 evaluations.
Range (min … max): 1.996 μs … 552.159 μs ┊ GC (min … max): 0.00% … 99.50%
Time (median): 2.122 μs ┊ GC (median): 0.00%
Time (mean ± σ): 2.273 μs ± 7.781 μs ┊ GC (mean ± σ): 4.83% ± 1.41%
▁▆█▇▂▁
▃██████▆▅▄▃▃▂▂▂▂▂▂▂▂▂▂▁▂▂▂▁▁▁▁▂▁▁▁▂▂▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▂▂▂▂ ▃
2 μs Histogram: frequency by time 3.77 μs <
Memory estimate: 1.03 KiB, allocs estimate: 23.
In general a good launch configuration will perform well on any CUDA GPU device, though the optimal may very well be device dependent. As always, profiling and benchmarking is required.
A reduce
implementation¶
We can demonstrate some GPU idioms by examining how we might write a reduce function on the GPU.
The GPU is best utilized by launching independed (trivially parallelizeable) threads. There are a few ways we can keep this practice, and still implement reduce
effectively.
a = CUDA.rand(1024, 1024)
b = CuArray{Float32, 1}([0]) # result store, samne type as a
1-element CuArray{Float32, 1}:
0.0
For comparison, we benchmark the CPU with 1024x1024 elements:
cpu_a = Array(a)
@benchmark sum($cpu_a)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 73.902 μs … 246.050 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 74.465 μs ┊ GC (median): 0.00%
Time (mean ± σ): 74.902 μs ± 3.705 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▆██▂ ▁▁▁▁ ▂
████▅████▅▅▅▃▄▃▃▁▁▃▃▁▃▁▁▃▁▁▁▁▁▁▃▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▃▁▅▇▆▇▇▇▅▅▆▆ █
73.9 μs Histogram: log(frequency) by time 91.4 μs <
Memory estimate: 0 bytes, allocs estimate: 0.
Avoiding write conflicts with CUDA.@atomic
¶
A basic implementation leveraging atomics to avoid write conflicts may look like:
function reduce_kernel(op, a, b)
i = threadIdx().x + (blockIdx().x - 1) * blockDim().x
@atomic b[] = op(b[], a[i])
return
end
@benchmark CUDA.@sync @cuda threads=1024 blocks=1024 reduce_kernel(+, $a, $b)
BenchmarkTools.Trial: 1734 samples with 1 evaluation.
Range (min … max): 2.743 ms … 5.306 ms ┊ GC (min … max): 0.00% … 44.34%
Time (median): 2.984 ms ┊ GC (median): 0.00%
Time (mean ± σ): 2.882 ms ± 134.470 μs ┊ GC (mean ± σ): 0.12% ± 2.25%
█ █▁
█▁▄▃▃▃▁▁▁▁▁▁▁▁▁▁▁▁▄▅▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▃▃▁▁▁▁▁▁▁▃▁▁▁▁▅▄▆██▇█ █
2.74 ms Histogram: log(frequency) by time 3 ms <
Memory estimate: 32.56 KiB, allocs estimate: 2070.
a = CUDA.rand(1024, 1024)
b = CuArray{Float32, 1}([0]) # result store
1-element CuArray{Float32, 1}:
0.0
Note that these atomics result is serialization of access, and thus slow the kernel dramatically.
Thread reduction¶
The idea is to implement a reduction along the lines of (length(a)=16
)
itt 1:
thread 1: a[1] + a[2] = 1 + 2 = 3
thread 2: a[3] + a[4] = 3 + 4 = 7
thread 3: a[5] + a[6] = 5 + 6 = 11
thread 4: a[7] + a[8] = 7 + 8 = 15
thread 5: a[9] + a[10] = 9 + 10 = 19
thread 6: a[11] + a[12] = 11 + 12 = 23
thread 7: a[13] + a[14] = 13 + 14 = 27
thread 8: a[15] + a[16] = 15 + 16 = 31
itt 2:
thread 1: a[1] + a[3] = 3 + 7 = 10
thread 2: a[5] + a[7] = 11 + 15 = 26
thread 3: a[9] + a[11] = 19 + 23 = 42
thread 4: a[13] + a[15] = 27 + 31 = 58
itt 3:
thread 1: a[1] + a[5] = 10 + 26 = 36
thread 2: a[9] + a[13] = 42 + 58 = 100
itt 4:
thread 1: a[1] + a[9] = 36 + 100 = 136
in a single block. This does limit our kernel to arrays of length 2 x 1024 = 2048, but provides the ground work for how we scale the pattern over many blocks.
function reduce_thread_kernel(op, a, b)
elements = blockDim().x * 2
thread = threadIdx().x
d = 1
while d < elements
sync_threads()
i = 2 * d * (thread - 1) + 1
@inbounds if i ≤ elements && i + d ≤ length(a)
a[i] = op(a[i], a[i+d])
end
d *= 2
end
if thread == 1
b[] = a[1]
end
return
end
reduce_thread_kernel (generic function with 1 method)
Trying out the function on a small sample
a1 = CuArray(1:1024)
b1 = CuArray([0])
# limit threads to half of data
@cuda threads=cld(length(a1), 2) reduce_thread_kernel(+, a1, b1)
CUDA.@allowscalar b1[1] == sum(1:1024)
true
Single block reduction¶
We can generalise the above implementation for arbitrary array size by first reducing the array above the block size, and then running the algorithm as above.
function reduce_block_kernel(op, a, b)
elements = blockDim().x * 2 # as each thread consumes 2 elements
thread = threadIdx().x
i = thread + elements
while i ≤ length(a)
a[thread] = op(a[thread], a[i])
i += elements
end
d = 1
while d < elements
sync_threads()
i = 2 * d * (thread - 1) + 1
@inbounds if i ≤ elements && i + d ≤ length(a)
a[i] = op(a[i], a[i+d])
end
d *= 2
end
if thread == 1
b[] = a[1]
end
return
end
reduce_block_kernel (generic function with 1 method)
This effectively reduces the size of the array, over the first while
loop, until the length of the array is equal to the number of threads. We can now benchmark this function with the original test data, however limiting the GPU to a processing block
@benchmark CUDA.@sync @cuda threads=1024 reduce_block_kernel(+, $a, $b)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 235.713 μs … 2.278 ms ┊ GC (min … max): 0.00% … 91.75%
Time (median): 244.456 μs ┊ GC (median): 0.00%
Time (mean ± σ): 275.479 μs ± 68.080 μs ┊ GC (mean ± σ): 0.15% ± 1.32%
█▁
▂███▆▃▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▁▁▁▁▂▂▂▂▂▂▇▄▂▂▂▂ ▂
236 μs Histogram: frequency by time 409 μs <
Memory estimate: 4.23 KiB, allocs estimate: 257.
Grid reduction with atomics¶
To fully utilize the GPU, we want to run many blocks so that the GPU can switch between them without latency. To perform a multi-block (grid) reduction we have to take into account that each block is independent, and may not neccessarily be running at the same time as another block. Consequently, our hope to synchronise threads doesn’t extend beyond a single block.
Instead, we can use atomic operations to write each threads result to the output array before the block switches.
Consider
function reduce_gric_atomic_kernel(op, a, b)
elements = blockDim().x * 2 # number of elements each block will consume
thread = threadIdx().x
block = blockIdx().x
offset = (block - 1) * elements
# parallel block reduction
d = 1
while d < elements
sync_threads()
i = 2 * d * (thread - 1) + 1
@inbounds if i ≤ elements && offset + i + d ≤ length(a)
a[offset + i] = op(a[offset + i], a[offset + i + d])
end
d *= 2
end
# atomic reduction of this blocks value
if thread == 1
@atomic b[] = op(b[], a[offset + 1])
end
return
end
reduce_gric_atomic_kernel (generic function with 1 method)
Benchmarking this kernel:
@benchmark CUDA.@sync @cuda threads=1024 blocks=512 reduce_gric_atomic_kernel(+, $a, $b)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 150.457 μs … 2.231 ms ┊ GC (min … max): 0.00% … 99.18%
Time (median): 153.051 μs ┊ GC (median): 0.00%
Time (mean ± σ): 153.348 μs ± 20.818 μs ┊ GC (mean ± σ): 0.14% ± 0.99%
▂▂▄▆▅▅▆▅▆▇▇▇█▅▆▄▃▅▄▃▁
▁▁▁▁▁▁▂▂▃▃▃▄▄▅▆▆█▇███████████████████████▇█▇▆▆▆▅▄▃▃▃▃▂▂▂▁▁▁▁ ▅
150 μs Histogram: frequency by time 156 μs <
Memory estimate: 992 bytes, allocs estimate: 48.
Although faster, we are still relying on serial operations with the use of atomic.