The essence of deep learning, automatic differentiation

© 2019, Amazon Web Services, Inc. or its Affiliates.
AWS AI Engines
2019/11
The essence of deep
learning: automatic
differentiationpllarroy@

Table of contents
• Backward pass overview
• Autodiff
• How gradients are calculated
• Higher order gradients
• Future directions

Backward pass recap

Weight updates need gradients of the Loss function
Example, ordinary least squares
𝐿𝑜𝑠𝑠 ≔ 𝐿 =
1
2𝑛
𝑖
𝑛
(𝑦𝑖 − 𝑦𝑖
′
)2
SGD weight updates
𝑤𝑗+1 = 𝑤𝑗 − 𝜂
𝜕
𝜕𝑤𝑗
L(𝑥𝑖, 𝑤𝑗)
For SGD we need the derivative of the
Loss function w.r.t. the weights evaluated
at the point given by the model inputs and
the weights.
Network:
𝑦 = 𝑓 𝑥, 𝑤
Loss:
𝐿 = 𝑓 𝑦, 𝑦′
= 𝑓 𝑥, 𝑤, 𝑦′
= 𝑓(𝑔 𝑥, 𝑤 , 𝑦′
)

How do we calculate gradients? Autodiff.
• It’s called automatic differentiation “autodiff”, not autograd, which is an
implementation of autodiff.
• What is autodiff? A method to compute derivatives for a program /
function.
• Autodiff is different than symbolic differentiation. We don’t explicitly
calculate symbolic gradients. This is not only not necessary but often
it’s very difficult to find a closed form expression. This is in my point of
view one of the keys that had made the latest advance in DL possible.

Autodiff, two main approaches: forward and backward
mode
For a set of composed functions
𝑦 = 𝑓 ∘ 𝑔 ∘ ℎ 𝑥 = 𝑓(𝑔(ℎ 𝑥 )
𝑓′
𝑥 =
𝜕𝑦
𝜕𝑥
=
𝜕𝑦
𝜕𝑔
𝜕𝑔
𝜕ℎ
𝜕ℎ
𝜕𝑥
Forward accumulation of gradients
𝜕𝑦
𝜕𝑔
(
𝜕𝑔
𝜕ℎ
(
𝜕ℎ
𝜕𝑥
))
Back accumulation of gradients
((
𝜕𝑦
𝜕𝑔
)
𝜕𝑔
𝜕ℎ
)
𝜕ℎ
𝜕𝑥
Products of large tensors
∝ #params
Products ∝ #outputs

Multivariate chain rule and Jacobian vector products
For a scalar valued function 𝑓: ℝ 𝑛
→ ℝ the gradient of 𝑓, ∇𝑓 is a vector
field ∇𝑓: ℝ 𝑛
→ ℝ 𝑛
. If 𝑦 is the output of our network (of dim n, and m
parameters). Applying the chain rule we get Jacobian vector products.
∇ 𝑦 𝐿 𝑇
= (
𝜕𝐿
𝜕𝑦1
, … ,
𝜕𝐿
𝜕𝑦𝑛
)
∇ 𝑤 𝐿 =
𝜕𝑦1
𝜕𝑤1
⋯
𝜕𝑦1
𝜕𝑤 𝑚
⋮ ⋱ ⋮
𝜕𝑦𝑛
𝜕𝑤1
⋯
𝜕𝑦𝑛
𝜕𝑤 𝑚
𝜕𝐿
𝜕𝑦1
⋮
𝜕𝐿
𝜕𝑦𝑛
=
𝜕𝐿
𝜕𝑦1
𝜕𝑦1
𝜕𝑤1
+ ⋯ +
𝜕𝐿
𝜕𝑦𝑛
𝜕𝑦1
𝜕𝑤 𝑚
, …
𝑇
https://math.boisestate.edu/~jaimos/classes/m275-fall2017/notes/chain-rule.html
Multivariate chain rule

Back accumulation of gradients: Jacobian vector
products
Recall that the loss is a scalar-valued differentiable function of multiple variables
𝐿: ℝ 𝑛 → ℝ
The loss is a scalar function of the network output L = f(𝑦) where n is the number of outputs.
Hence, starting right to left we have a vector of partial derivatives multiplying a matrix.
𝐉𝐚𝐜𝐨𝐛𝐢𝐚𝐧 ∗
𝜕𝐿
𝜕𝑦1
⋮
𝜕𝐿
𝜕𝑦𝑛
The resulting shapes are:
(m,n) ∗ (n) = (m)
Recall the Jacobian of a function F: ℝ 𝑛
→ ℝ 𝑚
is of shape (m,n), the network output is vector-
valued function, not scalar.

products
Formally the relationship between the chain rule and JVP:
For a network of input dimension m, and output dimension n
𝜕𝐿 𝑇
𝜕𝑤
=
𝜕𝐿 𝑇
𝜕𝑦
𝜕𝑦
𝜕𝑤
1, 𝑚 = 1, 𝑛 ∗ 𝑛, 𝑚
𝜕𝐿
𝜕𝑤
=
𝜕𝑦
𝜕𝑤
𝑇
𝜕𝐿
𝜕𝑦
(𝑚, 1) = 𝑚, 𝑛 ∗ (𝑛, 1)
𝜕𝑦
𝜕𝑤
is the mxn Jacobian matrix of the weights (recall y: ℝ 𝑚 → ℝ 𝑛)

products
What we really want is the gradient of the Loss w.r.t. the weights at the
current value of the weights (and the inputs).
𝜕𝐿
𝜕𝑤 𝑤=𝑤 𝑜
We have a head gradient that comes from the previous (rightmost) element in
the chain rule.
𝜕𝐿 𝑇
𝜕𝑤
=
𝜕𝐿 𝑇
𝜕𝑦
𝜕𝑦
𝜕𝑤
Head
gradient
Layer
output

Back accumulation of gradients: head gradient
We have a head gradient that comes from the previous (rightmost) element in
the chain rule. Ex. Unary function: y = f(x)
fx y
Backward:
bfx
y
dL
dx
dL
dy
Head gradient
Inputs
Outputs
Backward returns
gradients of each input

Operator gradients example: fully connected
Fully connected op performs: 𝑦 = 𝑥 ∗ 𝑤 𝑇
+ 𝑏
x
w FC
b
y
Shapes:
y = x * T(w) + b
(batch, hidden) = (batch, …) * (…, hidden) + (hidden)

Operator gradients example: first order gradient
derivation for FC
𝑦 = 𝑥 ∗ 𝑤 𝑇
+ 𝑏
𝜕𝐿 𝑇
𝜕𝑤
=
𝜕𝐿 𝑇
𝜕𝑦
𝜕𝑦
𝜕𝑤
= ℎ 𝑦
𝑇
∗ 𝑥
linalg_gemm(y_grad, x, w_grad, true, false, stream, req[fullc::kWeight]);
𝜕𝐿
𝜕𝑥
=
𝜕𝐿
𝜕𝑦
𝜕𝑦
𝜕𝑥
𝑇
= ℎ 𝑦 ∗ 𝑤
linalg_gemm(y_grad, wmat, x_grad, false, false, stream, req[fullc::kData]);
./src/operator/nn/fully_connected-inl.h FCBackward

Operator gradients example: fully connected 𝑦 = 𝑥 ∗ 𝑤 𝑇
+
𝑏
Backward of fully connected, returns gradients of inputs.
x
w FC
b
y
x
w
b
dL
dy
y
bFC
dL
dx
dL
dw
dL
db

Second order gradient for fully connected, Liebniz starts
to break.
hy
bFC
X’’
W’’
Hy’
hx’
hw’
hb’
The derivation is too complex to fit in a slide  also is better to use other types of
notation like 𝑥 =
𝜕𝑦
𝜕𝑥
… we care most for what we are calculating the gradient w.r.t.
b’’

Operator gradients example: second order gradient
derivation
The derivation of the second order gradient is too heavy to fit in a slide.
Liebniz notation for derivatives also doesn’t help in this case. The take
home idea is that head gradients can depend in the input variables of the
computation graph and this can’t be ignored.
Implementing higher order gradients for every operator has big problems:
• It can be very complex or practically impossible to find a closed form
• It doesn’t scale

Solution
Operator code composed of differentiable primitives is already differentiable.
Ex. If we have an operator to calculate the gradients for x+y and x*y, any
composition of + and * can get n-order gradients calculated.
Problem: you want to do operator fusion and optimization passes on such
graph to avoid fragmenting operators in smaller ops and the increased
overhead: need for a JIT and graph passes.
For the interim in MXNet we could use Xingjian’s idea: swap the backward for
a differentiable backward when a higher order is requested and Fgradient is
not registered (FgradientSymbolic).

Operator gradients example: second order gradient
derivation
For the second order gradient we could naively differentiate the first
gradient.
𝜕
𝜕𝑤
𝜕𝐿 𝑇
𝜕𝑤
=
𝜕
𝜕𝑤
ℎ 𝑦
𝑇
∗ 𝑥 = 1 ?
But this is not correct, reason being that the head gradient could have a
dependency on the variables.
Solution: use the multivariate chain rule to account for this:
𝑓 𝑥 𝑡
𝜕𝑓
𝜕𝑡
=
𝜕𝑓
𝜕𝑥
𝜕𝑥
𝜕𝑡

Two general approaches for automatic differentiation
1. Operator tracing:
Register calls to every operator to build a call-graph. This effectively
unrolls computations and loops. DL frameworks do this: Pytorch & MXNet
autograd, TF gradient tape.
2. Source-to-source transformation:
Generate code to calculate derivatives. Low overhead, doesn’t unroll
loops, supports user-defined types.
3. Dual numbers, similar to imaginary numbers. 𝑥 + 𝑥′
𝜀 | 𝜀2
= 0

Operator tracing in autograd
A gradient tensor is allocated and attached to variables
Operations are traced into a call graph (mxnet.autograd.record /
torch.autograd.set_grad_enabled)
By default, initial head gradients are initialized to one(s). (In torch needs to
be scalar), MXNet allocates the right size of ones.
The graph is traversed from the outputs and gradients are (back)
accumulated until the input variables get gradients.

General differentiable programming
For a function f we can attach a function similar to a backward operator which
returns gradients of the inputs given a head gradient ℎ
𝑓 𝑥, 𝑦 → 𝑥 + 𝑦, ℎ → (ℎ, ℎ)
𝑓 𝑥, 𝑦 → 𝑥 ∗ 𝑦, ℎ → ℎ ∗ 𝑦, ℎ ∗ 𝑥
𝑓 𝑥, 𝑦 → 𝑥 − 𝑦, ℎ → (ℎ, −ℎ)
You can add autodiff for any function, you need a calculation for each input
which is a function of the head gradient, inputs and outputs. You can annotate
functions or transform the source to calculate gradients very efficiently. For
deep learning tracing is usually small overhead compared to tensor
computations though, not so for scientific computation.
∂P: A Differentiable Programming System to Bridge Machine Learning and Scientific Computing. https://arxiv.org/pdf/1907.07587.pdf

MXNet
Each operator defines a backward function. Only differentiable if implemented in terms of NNVM ops.
Implemented usually in C++, can be in Python as well.
Ex. Fully connected:
// backprop
CHECK_NE(req[fullc::kWeight], kWriteInplace) << "cannot write weight inplace";
// gradient of weight
Tensor<xpu, 2, DType> w_grad = in_grad[fullc::kWeight].get<xpu, 2, DType>(stream);
linalg_gemm(y_grad, x, w_grad, true, false, stream, req[fullc::kWeight]);
// gradient of bias
if (!param.no_bias) {
AddBiasGrad(in_grad[fullc::kBias], y_grad, req[fullc::kBias], param.num_hidden,
ctx);
}
// gradient of data
// Legacy approach shown here for comparison:
// Assign(x_grad, req[fullc::kData], dot(y_grad, wmat));
linalg_gemm(y_grad, wmat, x_grad, false, false, stream, req[fullc::kData]);
https://github.com/apache/incubator-mxnet/blob/master/src/operator/nn/fully_connected-inl.h#L365

TF
Implemented in Python
@ops.RegisterGradient("Xdivy")
def _XDivyGrad(op, grad):
"""Returns gradient of xdivy(x, y) with respect to x and y."""
x = op.inputs[0]
y = op.inputs[1]
sx = array_ops.shape(x)
sy = array_ops.shape(y)
rx, ry = gen_array_ops.broadcast_gradient_args(sx, sy)
with ops.control_dependencies([grad]):
not_zero_x = math_ops.cast(
math_ops.not_equal(x, math_ops.cast(0., dtype=x.dtype)), dtype=x.dtype)
partial_x = gen_math_ops.xdivy(not_zero_x, y)
partial_y = gen_math_ops.xdivy(math_ops.negative(x), y**2)
return (array_ops.reshape(math_ops.reduce_sum(partial_x * grad, rx), sx),
array_ops.reshape(math_ops.reduce_sum(partial_y * grad, ry), sy))
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/ops/math_grad.py#L697-L714

Gradient definition in PyTorch
- name: div.Tensor(Tensor self, Tensor other) -> Tensor
self: grad / other
other: -grad * self / (other * other)
- name: mul.Tensor(Tensor self, Tensor other) -> Tensor
self: grad * other
other: grad * self
https://github.com/pytorch/pytorch/blob/master/tools/autograd/derivatives.yaml

FIN

The essence of deep learning, automatic differentiation

Recommended

Recommended

More Related Content

Similar to The essence of deep learning, automatic differentiation

Similar to The essence of deep learning, automatic differentiation (20)

Recently uploaded

Recently uploaded (20)

The essence of deep learning, automatic differentiation