A deep-dive into CUDA.jl¶
Based on a workshop from JuliaCon2021, link to GitHub repository with the workshop notes.
A quick rule-of-thumb note on libraries:
There are many other libraries available that wrap or leverage other specific features, but these may be considered as the main three. It is useful to bare in mind the differences when trying to write portable code.
Another useful tool is Tullio.jl, which provides einsum macros that automatically use GPU or threading features to speed up your code.
using CUDA, BenchmarkTools, Tullio
CUDA.allowscalar(false)
Abstractions¶
Although the Julia GPU libraries provide alternate implementations for many abstractions, they are not exhaustive. These implementations generate scalar kernels hidden behind the multiple dispatch, and are often simple drop in replacements when using standard algorithms or abstractions.
Many are supported:
a = CUDA.ones(5)
broadcast(a) do x
x += 1
end
5-element CuArray{Float32, 1}:
2.0
2.0
2.0
2.0
2.0
a .+ 1
5-element CuArray{Float32, 1}:
2.0
2.0
2.0
2.0
2.0
map(a) do x
x + 1
end
5-element CuArray{Float32, 1}:
2.0
2.0
2.0
2.0
2.0
reduce(+, a)
5.0f0
accumulate(+, a)
5-element CuArray{Float32, 1}:
1.0
2.0
3.0
4.0
5.0
However, CUDA is implementing these abstractions directly on type hierarchy and generating kernels. Thus their interoperability is not guarunteed, and using functions together operating on arrays will not result in GPU kernels being generated:
b = CUDA.ones(10, 10)
broadcast(eachcol(b)) do x
sum(x)
end
10-element Vector{Float32}:
10.0
10.0
10.0
10.0
10.0
10.0
10.0
10.0
10.0
10.0
The above only results in a single kernel launched per column. We can investigate this by examining the generated code using
@code_warntype
or
@code_typed
and indeed we see the broadcast being called is running on the CPU, with the sum(x)
call spawing kernels.
We can more intuitively see this in a benchmark comparison. We use
CUDA.@sync
to ensure the function call is synchronised with the execution on the device, as otherwise the function returns immediately after dispatching the kernel.
@benchmark CUDA.@sync broadcast(eachcol($b)) do x
sum(x)
end
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 199.408 μs … 38.064 ms ┊ GC (min … max): 0.00% … 30.59%
Time (median): 211.100 μs ┊ GC (median): 0.00%
Time (mean ± σ): 299.489 μs ± 875.649 μs ┊ GC (mean ± σ): 2.66% ± 0.94%
█▇▃▂▁ ▄▃ ▁
██████▇█▆▅▆▆▆▆▆▃▄▄▅▇▆▅▄▅▄▄▁▄▃▄▄▃▅▄▅▃▁▄▅▄▄▁▄▅▆▄▃▄▄▅▄▁▇██▇▆▄▃▄█ █
199 μs Histogram: log(frequency) by time 970 μs <
Memory estimate: 26.23 KiB, allocs estimate: 673.
@benchmark CUDA.@sync sum(b; dims=2)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 13.833 μs … 7.825 ms ┊ GC (min … max): 0.00% … 65.00%
Time (median): 15.171 μs ┊ GC (median): 0.00%
Time (mean ± σ): 344.530 μs ± 510.946 μs ┊ GC (mean ± σ): 0.29% ± 0.96%
█ ▄ ▁ ▆ ▃ ▁
█▆▄▇██▅▄█▆▃▃▄▆▄▃▄▁▁▁▃▄▆▄▃▄▄▃▁▃▁▁▁▃▃▃▃▄▁▁▃▄▁▁▁▁▁▃▄▁▁▁▃▃▃▁▃▅█▇█ █
13.8 μs Histogram: log(frequency) by time 1.18 ms <
Memory estimate: 2.06 KiB, allocs estimate: 49.
More complex and involved kernels can also be generated by other tools, such as Tullio.jl. For example, the expression
$$ x_i = \sum_{j} \sum_{k} b_{ji} + b_{ik} $$
can be executed as a GPU kernel with
b = reshape(CUDA.CuArray(1:100), 10, 10)
CUDA.@allowscalar @tullio x[i] := b[j, i] + b[i, k]
10-element CuArray{Int64, 1}:
5150
6250
7350
8450
9550
10650
11750
12850
13950
15050