A deep-dive into CUDA.jl

Based on a workshop from JuliaCon2021, link to GitHub repository with the workshop notes.

A quick rule-of-thumb note on libraries:

There are many other libraries available that wrap or leverage other specific features, but these may be considered as the main three. It is useful to bare in mind the differences when trying to write portable code.

Another useful tool is Tullio.jl, which provides einsum macros that automatically use GPU or threading features to speed up your code.

using CUDA, BenchmarkTools, Tullio
CUDA.allowscalar(false)

Abstractions

Although the Julia GPU libraries provide alternate implementations for many abstractions, they are not exhaustive. These implementations generate scalar kernels hidden behind the multiple dispatch, and are often simple drop in replacements when using standard algorithms or abstractions.

Many are supported:

a = CUDA.ones(5)
broadcast(a) do x
    x += 1
end
5-element CuArray{Float32, 1}:
 2.0
 2.0
 2.0
 2.0
 2.0
a .+ 1
5-element CuArray{Float32, 1}:
 2.0
 2.0
 2.0
 2.0
 2.0
map(a) do x
    x + 1
end
5-element CuArray{Float32, 1}:
 2.0
 2.0
 2.0
 2.0
 2.0
reduce(+, a)
5.0f0
accumulate(+, a)
5-element CuArray{Float32, 1}:
 1.0
 2.0
 3.0
 4.0
 5.0

However, CUDA is implementing these abstractions directly on type hierarchy and generating kernels. Thus their interoperability is not guarunteed, and using functions together operating on arrays will not result in GPU kernels being generated:

b = CUDA.ones(10, 10)
broadcast(eachcol(b)) do x
    sum(x)
end
10-element Vector{Float32}:
 10.0
 10.0
 10.0
 10.0
 10.0
 10.0
 10.0
 10.0
 10.0
 10.0

The above only results in a single kernel launched per column. We can investigate this by examining the generated code using

@code_warntype

or

@code_typed

and indeed we see the broadcast being called is running on the CPU, with the sum(x) call spawing kernels.

We can more intuitively see this in a benchmark comparison. We use

CUDA.@sync

to ensure the function call is synchronised with the execution on the device, as otherwise the function returns immediately after dispatching the kernel.

@benchmark CUDA.@sync broadcast(eachcol($b)) do x
    sum(x)
end
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (minmax):  199.408 μs 38.064 ms  ┊ GC (min … max): 0.00% … 30.59%
 Time  (median):     211.100 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   299.489 μs ± 875.649 μs  ┊ GC (mean ± σ):  2.66% ±  0.94%

  █▃▂▁                                                ▄▃       ▁
  █████▇█▅▆▆▆▆▆▃▄▄▅▇▆▅▄▅▄▄▁▄▃▄▄▃▅▄▅▃▁▄▅▄▄▁▄▅▆▄▃▄▄▅▄▁▇██▇▆▄▃▄█ █
  199 μs        Histogram: log(frequency) by time        970 μs <

 Memory estimate: 26.23 KiB, allocs estimate: 673.
@benchmark CUDA.@sync sum(b; dims=2)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (minmax):   13.833 μs  7.825 ms  ┊ GC (min … max): 0.00% … 65.00%
 Time  (median):      15.171 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   344.530 μs ± 510.946 μs  ┊ GC (mean ± σ):  0.29% ±  0.96%

      ▄  ▁                                                 ▆ ▃ ▁
  ▆▄▇██▅▄█▆▃▃▄▆▄▃▄▁▁▃▄▆▄▃▄▄▃▁▃▁▁▁▃▃▃▃▄▁▁▃▄▁▁▁▁▁▃▄▁▁▁▃▃▃▁▃▅█▇█ █
  13.8 μs       Histogram: log(frequency) by time       1.18 ms <

 Memory estimate: 2.06 KiB, allocs estimate: 49.

More complex and involved kernels can also be generated by other tools, such as Tullio.jl. For example, the expression

$$ x_i = \sum_{j} \sum_{k} b_{ji} + b_{ik} $$

can be executed as a GPU kernel with

b = reshape(CUDA.CuArray(1:100), 10, 10)

CUDA.@allowscalar @tullio x[i] := b[j, i] + b[i, k]
10-element CuArray{Int64, 1}:
  5150
  6250
  7350
  8450
  9550
 10650
 11750
 12850
 13950
 15050