Automatic differentiation

Automatic Differention (Automatic Differentiationin Machine Learning: a Survey is a method of applying algorithmic differentiation, to compute derivatives on the fly by repeated application of the chain rule

\[\frac{\text{d}}{\text{d}x}f \left( g \left( x \right) \right) = \frac{\text{d}f}{\text{d}g} \frac{\text{d}g}{\text{d}x}.\]

For any given operation, we can draw a computational graph of operations, and thus calculate the total chain rule derivative, using the above formula. However, auto-diff can compute this for us using the concept of Dual Numbers, and corollaries of the chain rule (sum rule, product rule, quotient rule, sin, log, etc.).

A Dual Number is the one-dimensional Grassman algebra over $\mathbb{R}$. It is simply defined some $a, b$

\[a + \varepsilon b,\]

such that $\varepsilon^2 = 0$.

Heuristically, one can consider this as a value $a$, along with it’s differential component $b$.

A simple Julia implementation

We define our dual number as a struct; here, let $x=a$ be the value of our number, and $\epsilon = \varepsilon b$ be it’s derivative:

struct D <: Number
    x::Float64
    ϵ::Float64
end

We then define corollaries of the chain rule so that our Dual Number can be used:

import Base: +, -, *, /, sin, log, convert, promote_rule

# summation rule
a::D + b::D = D(a.x + b.x, a.ϵ + b.ϵ)
a::D - b::D = D(a.x - b.x, a.ϵ - b.ϵ)

# product rule
a::D * b::D = D(a.x * b.x, (a.x * b.ϵ) + (a.ϵ * b.x) )

# quotient rule
a::D / b::D = D(a.x / b.x, ( (b.x * a.ϵ) - (a.x * b.ϵ) ) / (b.x^2))

# sin & log
sin(a::D) = D(sin(a.x), cos(a.x) * a.ϵ)
log(a::D) = D(log(a.x), 1/a.x * a.ϵ)

# conversion
Base.convert(::Type{D}, x::Real) = D(x, zero(x))

# always promote to Dual
Base.promote_rule(::Type{D}, ::Type{<:Number}) = D

We can try it out on any function of choice: for some $$ f(a, b) = \log \left( ab + \sin(a) \right), $$ we want to compute $$ \frac{\text{d}f}{\text{d}a}. $$

In Julia with our auto-diff:

f(a, b) = log(a * b + sin(a))

a = 3.1
b = 2.4

f(D(a, 1), b)
D(2.0124440881688996, 0.18724182935843758)

Note, we set $\epsilon = 1$ as it represents $\text{d}x / \text{d}x$ in the chain rule expansion.

Comparing to a symbolic result:

f_derivative(a,b) = 1/(a*b + sin(a)) * (b + cos(a)) # symbolic derivative

f_derivative(a, b)
0.18724182935843758
f(D(a,1), b).ϵ  f_derivative(a, b)
true

We can generalize the above for some impressive results, by adding a simple convenience function

derivative(f::Function, x::Number) = f(D(x, one(x))).ϵ
derivative (generic function with 1 method)
derivative(x -> 3*x^2, 5) # -> 6 * x eval at x = 5
30.0

Forward and reverse mode

The concept of forward mode and reverse mode is often discussed in auto-diff literature. A nice overview of the differences is given by Mike Innes.

We’ll use packages for this later. In brief:

  • Forward mode: use for $\mathbb{R} \rightarrow \mathbb{R}^n$

  • Reverse mode: use for $\mathbb{R}^n \rightarrow \mathbb{R}$

Example: Babylonian square root

The Babylonian square root is an algorithm for calculating the square root of a number, up to a given repeated precision.

In short, it is repeatedly applying $$ t \mapsto \left( t + \frac{1}{2} x \right) $$ until $t$ converges on $\sqrt{x}$. In Julia:

@inline function Babylonian(x; N=10)
    t = (1 + x) / 2
    for i = 2:N
        t = (t + x/t) / 2
    end
    t
end
Babylonian (generic function with 1 method)

Comparing the accuracy of the algorithm:

Babylonian(2), 2
(1.414213562373095, 1.4142135623730951)

With this implementation, we can obtain an implementation for the derivative for free:

Babylonian(D(5, 1))
D(2.23606797749979, 0.22360679774997896)

Note that the LLVM optimizer uses parallelized instructions in the above (check with @code_native).

Forward mode with ForwardDiff.jl

Our above has been implemented much more extensively and complete in the ForwardDiff.jl package (and associated DiffRules.jl).

ForwardDiff provides the concept of a dual number for us, as well as all the requesite chain rule applications. It is extremely simple to use:

using ForwardDiff

ForwardDiff.derivative(Babylonian, 5)
0.223606797749979
?ForwardDiff.derivative
ForwardDiff.derivative(f, x::Real)

Return df/dx evaluated at x, assuming f is called as f(x).

This method assumes that isa(f(x), Union{Real,AbstractArray}).


ForwardDiff.derivative(f!, y::AbstractArray, x::Real, cfg::DerivativeConfig = DerivativeConfig(f!, y, x), check=Val{true}())

Return df!/dx evaluated at x, assuming f! is called as f!(y, x) where the result is stored in y.

Set check to Val{false}() to disable tag checking. This can lead to perturbation confusion, so should be used with care.

Reverse mode with Zygote.jl

An implementation of reverse mode auto-diff is in Zygote.jl, which is the basis of most of the Julia ML ecosystem.

Similarly to ForwardDiff, Zygote is very straight forward to use:

using Zygote

gradient(Babylonian, 5)
(0.22360679774997896,)
?gradient
search: gradient
gradient(f, args...)

Returns a tuple containing ∂f/∂x for each argument x, the derivative (for scalar x) or the gradient.

f(args...) must be a real number, see jacobian for array output.