SlideShare a Scribd company logo
1 of 25
© 2019, Amazon Web Services, Inc. or its Affiliates.
AWS AI Engines
2019/11
The essence of deep
learning: automatic
differentiationpllarroy@
© 2019, Amazon Web Services, Inc. or its Affiliates.
Table of contents
• Backward pass overview
• Autodiff
• How gradients are calculated
• Higher order gradients
• Future directions
© 2019, Amazon Web Services, Inc. or its Affiliates.
Backward pass recap
© 2019, Amazon Web Services, Inc. or its Affiliates.
Weight updates need gradients of the Loss function
Example, ordinary least squares
𝐿𝑜𝑠𝑠 ≔ 𝐿 =
1
2𝑛
𝑖
𝑛
(𝑦𝑖 − 𝑦𝑖
′
)2
SGD weight updates
𝑤𝑗+1 = 𝑤𝑗 − 𝜂
𝜕
𝜕𝑤𝑗
L(𝑥𝑖, 𝑤𝑗)
For SGD we need the derivative of the
Loss function w.r.t. the weights evaluated
at the point given by the model inputs and
the weights.
Network:
𝑦 = 𝑓 𝑥, 𝑤
Loss:
𝐿 = 𝑓 𝑦, 𝑦′
= 𝑓 𝑥, 𝑤, 𝑦′
= 𝑓(𝑔 𝑥, 𝑤 , 𝑦′
)
© 2019, Amazon Web Services, Inc. or its Affiliates.
How do we calculate gradients? Autodiff.
• It’s called automatic differentiation “autodiff”, not autograd, which is an
implementation of autodiff.
• What is autodiff? A method to compute derivatives for a program /
function.
• Autodiff is different than symbolic differentiation. We don’t explicitly
calculate symbolic gradients. This is not only not necessary but often
it’s very difficult to find a closed form expression. This is in my point of
view one of the keys that had made the latest advance in DL possible.
© 2019, Amazon Web Services, Inc. or its Affiliates.
Autodiff, two main approaches: forward and backward
mode
For a set of composed functions
𝑦 = 𝑓 ∘ 𝑔 ∘ ℎ 𝑥 = 𝑓(𝑔(ℎ 𝑥 )
𝑓′
𝑥 =
𝜕𝑦
𝜕𝑥
=
𝜕𝑦
𝜕𝑔
𝜕𝑔
𝜕ℎ
𝜕ℎ
𝜕𝑥
Forward accumulation of gradients
𝜕𝑦
𝜕𝑔
(
𝜕𝑔
𝜕ℎ
(
𝜕ℎ
𝜕𝑥
))
Back accumulation of gradients
((
𝜕𝑦
𝜕𝑔
)
𝜕𝑔
𝜕ℎ
)
𝜕ℎ
𝜕𝑥
Products of large tensors
∝ #params
Products ∝ #outputs
© 2019, Amazon Web Services, Inc. or its Affiliates.
Multivariate chain rule and Jacobian vector products
For a scalar valued function 𝑓: ℝ 𝑛
→ ℝ the gradient of 𝑓, ∇𝑓 is a vector
field ∇𝑓: ℝ 𝑛
→ ℝ 𝑛
. If 𝑦 is the output of our network (of dim n, and m
parameters). Applying the chain rule we get Jacobian vector products.
∇ 𝑦 𝐿 𝑇
= (
𝜕𝐿
𝜕𝑦1
, … ,
𝜕𝐿
𝜕𝑦𝑛
)
∇ 𝑤 𝐿 =
𝜕𝑦1
𝜕𝑤1
⋯
𝜕𝑦1
𝜕𝑤 𝑚
⋮ ⋱ ⋮
𝜕𝑦𝑛
𝜕𝑤1
⋯
𝜕𝑦𝑛
𝜕𝑤 𝑚
𝜕𝐿
𝜕𝑦1
⋮
𝜕𝐿
𝜕𝑦𝑛
=
𝜕𝐿
𝜕𝑦1
𝜕𝑦1
𝜕𝑤1
+ ⋯ +
𝜕𝐿
𝜕𝑦𝑛
𝜕𝑦1
𝜕𝑤 𝑚
, …
𝑇
https://math.boisestate.edu/~jaimos/classes/m275-fall2017/notes/chain-rule.html
Multivariate chain rule
© 2019, Amazon Web Services, Inc. or its Affiliates.
Back accumulation of gradients: Jacobian vector
products
Recall that the loss is a scalar-valued differentiable function of multiple variables
𝐿: ℝ 𝑛 → ℝ
The loss is a scalar function of the network output L = f(𝑦) where n is the number of outputs.
Hence, starting right to left we have a vector of partial derivatives multiplying a matrix.
𝐉𝐚𝐜𝐨𝐛𝐢𝐚𝐧 ∗
𝜕𝐿
𝜕𝑦1
⋮
𝜕𝐿
𝜕𝑦𝑛
The resulting shapes are:
(m,n) ∗ (n) = (m)
Recall the Jacobian of a function F: ℝ 𝑛
→ ℝ 𝑚
is of shape (m,n), the network output is vector-
valued function, not scalar.
© 2019, Amazon Web Services, Inc. or its Affiliates.
Back accumulation of gradients: Jacobian vector
products
Formally the relationship between the chain rule and JVP:
For a network of input dimension m, and output dimension n
𝜕𝐿 𝑇
𝜕𝑤
=
𝜕𝐿 𝑇
𝜕𝑦
𝜕𝑦
𝜕𝑤
1, 𝑚 = 1, 𝑛 ∗ 𝑛, 𝑚
𝜕𝐿
𝜕𝑤
=
𝜕𝑦
𝜕𝑤
𝑇
𝜕𝐿
𝜕𝑦
(𝑚, 1) = 𝑚, 𝑛 ∗ (𝑛, 1)
𝜕𝑦
𝜕𝑤
is the mxn Jacobian matrix of the weights (recall y: ℝ 𝑚 → ℝ 𝑛)
© 2019, Amazon Web Services, Inc. or its Affiliates.
Back accumulation of gradients: Jacobian vector
products
What we really want is the gradient of the Loss w.r.t. the weights at the
current value of the weights (and the inputs).
𝜕𝐿
𝜕𝑤 𝑤=𝑤 𝑜
We have a head gradient that comes from the previous (rightmost) element in
the chain rule.
𝜕𝐿 𝑇
𝜕𝑤
=
𝜕𝐿 𝑇
𝜕𝑦
𝜕𝑦
𝜕𝑤
Head
gradient
Layer
output
© 2019, Amazon Web Services, Inc. or its Affiliates.
Back accumulation of gradients: head gradient
We have a head gradient that comes from the previous (rightmost) element in
the chain rule. Ex. Unary function: y = f(x)
fx y
Backward:
bfx
y
dL
dx
dL
dy
Head gradient
Inputs
Outputs
Backward returns
gradients of each input
© 2019, Amazon Web Services, Inc. or its Affiliates.
Operator gradients example: fully connected
Fully connected op performs: 𝑦 = 𝑥 ∗ 𝑤 𝑇
+ 𝑏
x
w FC
b
y
Shapes:
y = x * T(w) + b
(batch, hidden) = (batch, …) * (…, hidden) + (hidden)
© 2019, Amazon Web Services, Inc. or its Affiliates.
Operator gradients example: first order gradient
derivation for FC
𝑦 = 𝑥 ∗ 𝑤 𝑇
+ 𝑏
𝜕𝐿 𝑇
𝜕𝑤
=
𝜕𝐿 𝑇
𝜕𝑦
𝜕𝑦
𝜕𝑤
= ℎ 𝑦
𝑇
∗ 𝑥
linalg_gemm(y_grad, x, w_grad, true, false, stream, req[fullc::kWeight]);
𝜕𝐿
𝜕𝑥
=
𝜕𝐿
𝜕𝑦
𝜕𝑦
𝜕𝑥
𝑇
= ℎ 𝑦 ∗ 𝑤
linalg_gemm(y_grad, wmat, x_grad, false, false, stream, req[fullc::kData]);
./src/operator/nn/fully_connected-inl.h FCBackward
© 2019, Amazon Web Services, Inc. or its Affiliates.
Operator gradients example: fully connected 𝑦 = 𝑥 ∗ 𝑤 𝑇
+
𝑏
Backward of fully connected, returns gradients of inputs.
x
w FC
b
y
x
w
b
dL
dy
y
bFC
dL
dx
dL
dw
dL
db
© 2019, Amazon Web Services, Inc. or its Affiliates.
Second order gradient for fully connected, Liebniz starts
to break.
hy
bFC
X’’
W’’
Hy’
hx’
hw’
hb’
The derivation is too complex to fit in a slide  also is better to use other types of
notation like 𝑥 =
𝜕𝑦
𝜕𝑥
… we care most for what we are calculating the gradient w.r.t.
b’’
© 2019, Amazon Web Services, Inc. or its Affiliates.
Operator gradients example: second order gradient
derivation
The derivation of the second order gradient is too heavy to fit in a slide.
Liebniz notation for derivatives also doesn’t help in this case. The take
home idea is that head gradients can depend in the input variables of the
computation graph and this can’t be ignored.
Implementing higher order gradients for every operator has big problems:
• It can be very complex or practically impossible to find a closed form
• It doesn’t scale
© 2019, Amazon Web Services, Inc. or its Affiliates.
Solution
Operator code composed of differentiable primitives is already differentiable.
Ex. If we have an operator to calculate the gradients for x+y and x*y, any
composition of + and * can get n-order gradients calculated.
Problem: you want to do operator fusion and optimization passes on such
graph to avoid fragmenting operators in smaller ops and the increased
overhead: need for a JIT and graph passes.
For the interim in MXNet we could use Xingjian’s idea: swap the backward for
a differentiable backward when a higher order is requested and Fgradient is
not registered (FgradientSymbolic).
© 2019, Amazon Web Services, Inc. or its Affiliates.
Operator gradients example: second order gradient
derivation
For the second order gradient we could naively differentiate the first
gradient.
𝜕
𝜕𝑤
𝜕𝐿 𝑇
𝜕𝑤
=
𝜕
𝜕𝑤
ℎ 𝑦
𝑇
∗ 𝑥 = 1 ?
But this is not correct, reason being that the head gradient could have a
dependency on the variables.
Solution: use the multivariate chain rule to account for this:
𝑓 𝑥 𝑡
𝜕𝑓
𝜕𝑡
=
𝜕𝑓
𝜕𝑥
𝜕𝑥
𝜕𝑡
© 2019, Amazon Web Services, Inc. or its Affiliates.
Two general approaches for automatic differentiation
1. Operator tracing:
Register calls to every operator to build a call-graph. This effectively
unrolls computations and loops. DL frameworks do this: Pytorch & MXNet
autograd, TF gradient tape.
2. Source-to-source transformation:
Generate code to calculate derivatives. Low overhead, doesn’t unroll
loops, supports user-defined types.
3. Dual numbers, similar to imaginary numbers. 𝑥 + 𝑥′
𝜀 | 𝜀2
= 0
© 2019, Amazon Web Services, Inc. or its Affiliates.
Operator tracing in autograd
A gradient tensor is allocated and attached to variables
Operations are traced into a call graph (mxnet.autograd.record /
torch.autograd.set_grad_enabled)
By default, initial head gradients are initialized to one(s). (In torch needs to
be scalar), MXNet allocates the right size of ones.
The graph is traversed from the outputs and gradients are (back)
accumulated until the input variables get gradients.
© 2019, Amazon Web Services, Inc. or its Affiliates.
General differentiable programming
For a function f we can attach a function similar to a backward operator which
returns gradients of the inputs given a head gradient ℎ
𝑓 𝑥, 𝑦 → 𝑥 + 𝑦, ℎ → (ℎ, ℎ)
𝑓 𝑥, 𝑦 → 𝑥 ∗ 𝑦, ℎ → ℎ ∗ 𝑦, ℎ ∗ 𝑥
𝑓 𝑥, 𝑦 → 𝑥 − 𝑦, ℎ → (ℎ, −ℎ)
You can add autodiff for any function, you need a calculation for each input
which is a function of the head gradient, inputs and outputs. You can annotate
functions or transform the source to calculate gradients very efficiently. For
deep learning tracing is usually small overhead compared to tensor
computations though, not so for scientific computation.
∂P: A Differentiable Programming System to Bridge Machine Learning and Scientific Computing. https://arxiv.org/pdf/1907.07587.pdf
© 2019, Amazon Web Services, Inc. or its Affiliates.
MXNet
Each operator defines a backward function. Only differentiable if implemented in terms of NNVM ops.
Implemented usually in C++, can be in Python as well.
Ex. Fully connected:
// backprop
CHECK_NE(req[fullc::kWeight], kWriteInplace) << "cannot write weight inplace";
// gradient of weight
Tensor<xpu, 2, DType> w_grad = in_grad[fullc::kWeight].get<xpu, 2, DType>(stream);
linalg_gemm(y_grad, x, w_grad, true, false, stream, req[fullc::kWeight]);
// gradient of bias
if (!param.no_bias) {
AddBiasGrad(in_grad[fullc::kBias], y_grad, req[fullc::kBias], param.num_hidden,
ctx);
}
// gradient of data
// Legacy approach shown here for comparison:
// Assign(x_grad, req[fullc::kData], dot(y_grad, wmat));
linalg_gemm(y_grad, wmat, x_grad, false, false, stream, req[fullc::kData]);
https://github.com/apache/incubator-mxnet/blob/master/src/operator/nn/fully_connected-inl.h#L365
© 2019, Amazon Web Services, Inc. or its Affiliates.
TF
Implemented in Python
@ops.RegisterGradient("Xdivy")
def _XDivyGrad(op, grad):
"""Returns gradient of xdivy(x, y) with respect to x and y."""
x = op.inputs[0]
y = op.inputs[1]
sx = array_ops.shape(x)
sy = array_ops.shape(y)
rx, ry = gen_array_ops.broadcast_gradient_args(sx, sy)
with ops.control_dependencies([grad]):
not_zero_x = math_ops.cast(
math_ops.not_equal(x, math_ops.cast(0., dtype=x.dtype)), dtype=x.dtype)
partial_x = gen_math_ops.xdivy(not_zero_x, y)
partial_y = gen_math_ops.xdivy(math_ops.negative(x), y**2)
return (array_ops.reshape(math_ops.reduce_sum(partial_x * grad, rx), sx),
array_ops.reshape(math_ops.reduce_sum(partial_y * grad, ry), sy))
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/ops/math_grad.py#L697-L714
© 2019, Amazon Web Services, Inc. or its Affiliates.
Gradient definition in PyTorch
- name: div.Tensor(Tensor self, Tensor other) -> Tensor
self: grad / other
other: -grad * self / (other * other)
- name: mul.Tensor(Tensor self, Tensor other) -> Tensor
self: grad * other
other: grad * self
https://github.com/pytorch/pytorch/blob/master/tools/autograd/derivatives.yaml
© 2019, Amazon Web Services, Inc. or its Affiliates.
FIN

More Related Content

Similar to The essence of deep learning, automatic differentiation

Best Practices and Hard Lessons of Serverless- AWS Startup Day Toronto- Diego...
Best Practices and Hard Lessons of Serverless- AWS Startup Day Toronto- Diego...Best Practices and Hard Lessons of Serverless- AWS Startup Day Toronto- Diego...
Best Practices and Hard Lessons of Serverless- AWS Startup Day Toronto- Diego...Amazon Web Services
 
Lecture 1 interfaces and polymorphism
Lecture 1    interfaces and polymorphismLecture 1    interfaces and polymorphism
Lecture 1 interfaces and polymorphismNada G.Youssef
 
Chapter 4 - Completing the Problem-Solving Process
Chapter 4 - Completing the Problem-Solving ProcessChapter 4 - Completing the Problem-Solving Process
Chapter 4 - Completing the Problem-Solving Processmshellman
 
Designing and Implementing a Serverless Media Processing Workflow Using AWS S...
Designing and Implementing a Serverless Media Processing Workflow Using AWS S...Designing and Implementing a Serverless Media Processing Workflow Using AWS S...
Designing and Implementing a Serverless Media Processing Workflow Using AWS S...Amazon Web Services
 
Design and Implement a Serverless Media-Processing Workflow - SRV328 - Atlant...
Design and Implement a Serverless Media-Processing Workflow - SRV328 - Atlant...Design and Implement a Serverless Media-Processing Workflow - SRV328 - Atlant...
Design and Implement a Serverless Media-Processing Workflow - SRV328 - Atlant...Amazon Web Services
 
Design and Implement a Serverless Media-Processing Workflow
Design and Implement a Serverless Media-Processing Workflow Design and Implement a Serverless Media-Processing Workflow
Design and Implement a Serverless Media-Processing Workflow Amazon Web Services
 
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Amazon Web Services
 
Machine learning Algorithms with a Sagemaker demo
Machine learning Algorithms with a Sagemaker demoMachine learning Algorithms with a Sagemaker demo
Machine learning Algorithms with a Sagemaker demoHridyesh Bisht
 
MCL310_Building Deep Learning Applications with Apache MXNet and Gluon
MCL310_Building Deep Learning Applications with Apache MXNet and GluonMCL310_Building Deep Learning Applications with Apache MXNet and Gluon
MCL310_Building Deep Learning Applications with Apache MXNet and GluonAmazon Web Services
 
The Best Practices and Hard Lessons Learned of Serverless Applications
The Best Practices and Hard Lessons Learned of Serverless ApplicationsThe Best Practices and Hard Lessons Learned of Serverless Applications
The Best Practices and Hard Lessons Learned of Serverless ApplicationsAmazon Web Services LATAM
 
Big Data Helsinki v 3 | "Distributed Machine and Deep Learning at Scale with ...
Big Data Helsinki v 3 | "Distributed Machine and Deep Learning at Scale with ...Big Data Helsinki v 3 | "Distributed Machine and Deep Learning at Scale with ...
Big Data Helsinki v 3 | "Distributed Machine and Deep Learning at Scale with ...Dataconomy Media
 
Building Applications with Apache MXNet
Building Applications with Apache MXNetBuilding Applications with Apache MXNet
Building Applications with Apache MXNetApache MXNet
 
How Dow Jones Uses AWS to Enable Innovation and New Engineering Work (CTD316)...
How Dow Jones Uses AWS to Enable Innovation and New Engineering Work (CTD316)...How Dow Jones Uses AWS to Enable Innovation and New Engineering Work (CTD316)...
How Dow Jones Uses AWS to Enable Innovation and New Engineering Work (CTD316)...Amazon Web Services
 
Build Deep Learning Applications Using MXNet and Amazon SageMaker (AIM418) - ...
Build Deep Learning Applications Using MXNet and Amazon SageMaker (AIM418) - ...Build Deep Learning Applications Using MXNet and Amazon SageMaker (AIM418) - ...
Build Deep Learning Applications Using MXNet and Amazon SageMaker (AIM418) - ...Amazon Web Services
 
SageMaker Algorithms Infinitely Scalable Machine Learning
SageMaker Algorithms Infinitely Scalable Machine LearningSageMaker Algorithms Infinitely Scalable Machine Learning
SageMaker Algorithms Infinitely Scalable Machine LearningAmazon Web Services
 
How much do we know about Object-Oriented Programming?
How much do we know about Object-Oriented Programming?How much do we know about Object-Oriented Programming?
How much do we know about Object-Oriented Programming?Sandro Mancuso
 
RoboMaker로 DeepRacer 자율 주행차 만들기 :: 유정열 - AWS Community Day 2019
RoboMaker로 DeepRacer 자율 주행차 만들기 :: 유정열 - AWS Community Day 2019 RoboMaker로 DeepRacer 자율 주행차 만들기 :: 유정열 - AWS Community Day 2019
RoboMaker로 DeepRacer 자율 주행차 만들기 :: 유정열 - AWS Community Day 2019 AWSKRUG - AWS한국사용자모임
 

Similar to The essence of deep learning, automatic differentiation (20)

Best Practices and Hard Lessons of Serverless- AWS Startup Day Toronto- Diego...
Best Practices and Hard Lessons of Serverless- AWS Startup Day Toronto- Diego...Best Practices and Hard Lessons of Serverless- AWS Startup Day Toronto- Diego...
Best Practices and Hard Lessons of Serverless- AWS Startup Day Toronto- Diego...
 
Deep ar presentation
Deep ar presentationDeep ar presentation
Deep ar presentation
 
Lecture 1 interfaces and polymorphism
Lecture 1    interfaces and polymorphismLecture 1    interfaces and polymorphism
Lecture 1 interfaces and polymorphism
 
Chapter 4 - Completing the Problem-Solving Process
Chapter 4 - Completing the Problem-Solving ProcessChapter 4 - Completing the Problem-Solving Process
Chapter 4 - Completing the Problem-Solving Process
 
Designing and Implementing a Serverless Media Processing Workflow Using AWS S...
Designing and Implementing a Serverless Media Processing Workflow Using AWS S...Designing and Implementing a Serverless Media Processing Workflow Using AWS S...
Designing and Implementing a Serverless Media Processing Workflow Using AWS S...
 
Design and Implement a Serverless Media-Processing Workflow - SRV328 - Atlant...
Design and Implement a Serverless Media-Processing Workflow - SRV328 - Atlant...Design and Implement a Serverless Media-Processing Workflow - SRV328 - Atlant...
Design and Implement a Serverless Media-Processing Workflow - SRV328 - Atlant...
 
Design and Implement a Serverless Media-Processing Workflow
Design and Implement a Serverless Media-Processing Workflow Design and Implement a Serverless Media-Processing Workflow
Design and Implement a Serverless Media-Processing Workflow
 
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Machine learning Algorithms with a Sagemaker demo
Machine learning Algorithms with a Sagemaker demoMachine learning Algorithms with a Sagemaker demo
Machine learning Algorithms with a Sagemaker demo
 
CS267_Graph_Lab
CS267_Graph_LabCS267_Graph_Lab
CS267_Graph_Lab
 
MCL310_Building Deep Learning Applications with Apache MXNet and Gluon
MCL310_Building Deep Learning Applications with Apache MXNet and GluonMCL310_Building Deep Learning Applications with Apache MXNet and Gluon
MCL310_Building Deep Learning Applications with Apache MXNet and Gluon
 
Deep Learning with MXNet
Deep Learning with MXNetDeep Learning with MXNet
Deep Learning with MXNet
 
The Best Practices and Hard Lessons Learned of Serverless Applications
The Best Practices and Hard Lessons Learned of Serverless ApplicationsThe Best Practices and Hard Lessons Learned of Serverless Applications
The Best Practices and Hard Lessons Learned of Serverless Applications
 
Big Data Helsinki v 3 | "Distributed Machine and Deep Learning at Scale with ...
Big Data Helsinki v 3 | "Distributed Machine and Deep Learning at Scale with ...Big Data Helsinki v 3 | "Distributed Machine and Deep Learning at Scale with ...
Big Data Helsinki v 3 | "Distributed Machine and Deep Learning at Scale with ...
 
Building Applications with Apache MXNet
Building Applications with Apache MXNetBuilding Applications with Apache MXNet
Building Applications with Apache MXNet
 
How Dow Jones Uses AWS to Enable Innovation and New Engineering Work (CTD316)...
How Dow Jones Uses AWS to Enable Innovation and New Engineering Work (CTD316)...How Dow Jones Uses AWS to Enable Innovation and New Engineering Work (CTD316)...
How Dow Jones Uses AWS to Enable Innovation and New Engineering Work (CTD316)...
 
Build Deep Learning Applications Using MXNet and Amazon SageMaker (AIM418) - ...
Build Deep Learning Applications Using MXNet and Amazon SageMaker (AIM418) - ...Build Deep Learning Applications Using MXNet and Amazon SageMaker (AIM418) - ...
Build Deep Learning Applications Using MXNet and Amazon SageMaker (AIM418) - ...
 
SageMaker Algorithms Infinitely Scalable Machine Learning
SageMaker Algorithms Infinitely Scalable Machine LearningSageMaker Algorithms Infinitely Scalable Machine Learning
SageMaker Algorithms Infinitely Scalable Machine Learning
 
How much do we know about Object-Oriented Programming?
How much do we know about Object-Oriented Programming?How much do we know about Object-Oriented Programming?
How much do we know about Object-Oriented Programming?
 
RoboMaker로 DeepRacer 자율 주행차 만들기 :: 유정열 - AWS Community Day 2019
RoboMaker로 DeepRacer 자율 주행차 만들기 :: 유정열 - AWS Community Day 2019 RoboMaker로 DeepRacer 자율 주행차 만들기 :: 유정열 - AWS Community Day 2019
RoboMaker로 DeepRacer 자율 주행차 만들기 :: 유정열 - AWS Community Day 2019
 

Recently uploaded

Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝soniya singh
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...ranjana rawat
 
chaitra-1.pptx fake news detection using machine learning
chaitra-1.pptx  fake news detection using machine learningchaitra-1.pptx  fake news detection using machine learning
chaitra-1.pptx fake news detection using machine learningmisbanausheenparvam
 
Analog to Digital and Digital to Analog Converter
Analog to Digital and Digital to Analog ConverterAnalog to Digital and Digital to Analog Converter
Analog to Digital and Digital to Analog ConverterAbhinavSharma374939
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINESIVASHANKAR N
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSSIVASHANKAR N
 
main PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidmain PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidNikhilNagaraju
 
GDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSCAESB
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVRajaP95
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxupamatechverse
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Dr.Costas Sachpazis
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSHARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSRajkumarAkumalla
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escortsranjana rawat
 

Recently uploaded (20)

Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 
chaitra-1.pptx fake news detection using machine learning
chaitra-1.pptx  fake news detection using machine learningchaitra-1.pptx  fake news detection using machine learning
chaitra-1.pptx fake news detection using machine learning
 
Analog to Digital and Digital to Analog Converter
Analog to Digital and Digital to Analog ConverterAnalog to Digital and Digital to Analog Converter
Analog to Digital and Digital to Analog Converter
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
 
main PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidmain PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfid
 
GDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentation
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptx
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
 
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCRCall Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
 
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSHARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
 
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptxExploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
 
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 

The essence of deep learning, automatic differentiation

  • 1. © 2019, Amazon Web Services, Inc. or its Affiliates. AWS AI Engines 2019/11 The essence of deep learning: automatic differentiationpllarroy@
  • 2. © 2019, Amazon Web Services, Inc. or its Affiliates. Table of contents • Backward pass overview • Autodiff • How gradients are calculated • Higher order gradients • Future directions
  • 3. © 2019, Amazon Web Services, Inc. or its Affiliates. Backward pass recap
  • 4. © 2019, Amazon Web Services, Inc. or its Affiliates. Weight updates need gradients of the Loss function Example, ordinary least squares 𝐿𝑜𝑠𝑠 ≔ 𝐿 = 1 2𝑛 𝑖 𝑛 (𝑦𝑖 − 𝑦𝑖 ′ )2 SGD weight updates 𝑤𝑗+1 = 𝑤𝑗 − 𝜂 𝜕 𝜕𝑤𝑗 L(𝑥𝑖, 𝑤𝑗) For SGD we need the derivative of the Loss function w.r.t. the weights evaluated at the point given by the model inputs and the weights. Network: 𝑦 = 𝑓 𝑥, 𝑤 Loss: 𝐿 = 𝑓 𝑦, 𝑦′ = 𝑓 𝑥, 𝑤, 𝑦′ = 𝑓(𝑔 𝑥, 𝑤 , 𝑦′ )
  • 5. © 2019, Amazon Web Services, Inc. or its Affiliates. How do we calculate gradients? Autodiff. • It’s called automatic differentiation “autodiff”, not autograd, which is an implementation of autodiff. • What is autodiff? A method to compute derivatives for a program / function. • Autodiff is different than symbolic differentiation. We don’t explicitly calculate symbolic gradients. This is not only not necessary but often it’s very difficult to find a closed form expression. This is in my point of view one of the keys that had made the latest advance in DL possible.
  • 6. © 2019, Amazon Web Services, Inc. or its Affiliates. Autodiff, two main approaches: forward and backward mode For a set of composed functions 𝑦 = 𝑓 ∘ 𝑔 ∘ ℎ 𝑥 = 𝑓(𝑔(ℎ 𝑥 ) 𝑓′ 𝑥 = 𝜕𝑦 𝜕𝑥 = 𝜕𝑦 𝜕𝑔 𝜕𝑔 𝜕ℎ 𝜕ℎ 𝜕𝑥 Forward accumulation of gradients 𝜕𝑦 𝜕𝑔 ( 𝜕𝑔 𝜕ℎ ( 𝜕ℎ 𝜕𝑥 )) Back accumulation of gradients (( 𝜕𝑦 𝜕𝑔 ) 𝜕𝑔 𝜕ℎ ) 𝜕ℎ 𝜕𝑥 Products of large tensors ∝ #params Products ∝ #outputs
  • 7. © 2019, Amazon Web Services, Inc. or its Affiliates. Multivariate chain rule and Jacobian vector products For a scalar valued function 𝑓: ℝ 𝑛 → ℝ the gradient of 𝑓, ∇𝑓 is a vector field ∇𝑓: ℝ 𝑛 → ℝ 𝑛 . If 𝑦 is the output of our network (of dim n, and m parameters). Applying the chain rule we get Jacobian vector products. ∇ 𝑦 𝐿 𝑇 = ( 𝜕𝐿 𝜕𝑦1 , … , 𝜕𝐿 𝜕𝑦𝑛 ) ∇ 𝑤 𝐿 = 𝜕𝑦1 𝜕𝑤1 ⋯ 𝜕𝑦1 𝜕𝑤 𝑚 ⋮ ⋱ ⋮ 𝜕𝑦𝑛 𝜕𝑤1 ⋯ 𝜕𝑦𝑛 𝜕𝑤 𝑚 𝜕𝐿 𝜕𝑦1 ⋮ 𝜕𝐿 𝜕𝑦𝑛 = 𝜕𝐿 𝜕𝑦1 𝜕𝑦1 𝜕𝑤1 + ⋯ + 𝜕𝐿 𝜕𝑦𝑛 𝜕𝑦1 𝜕𝑤 𝑚 , … 𝑇 https://math.boisestate.edu/~jaimos/classes/m275-fall2017/notes/chain-rule.html Multivariate chain rule
  • 8. © 2019, Amazon Web Services, Inc. or its Affiliates. Back accumulation of gradients: Jacobian vector products Recall that the loss is a scalar-valued differentiable function of multiple variables 𝐿: ℝ 𝑛 → ℝ The loss is a scalar function of the network output L = f(𝑦) where n is the number of outputs. Hence, starting right to left we have a vector of partial derivatives multiplying a matrix. 𝐉𝐚𝐜𝐨𝐛𝐢𝐚𝐧 ∗ 𝜕𝐿 𝜕𝑦1 ⋮ 𝜕𝐿 𝜕𝑦𝑛 The resulting shapes are: (m,n) ∗ (n) = (m) Recall the Jacobian of a function F: ℝ 𝑛 → ℝ 𝑚 is of shape (m,n), the network output is vector- valued function, not scalar.
  • 9. © 2019, Amazon Web Services, Inc. or its Affiliates. Back accumulation of gradients: Jacobian vector products Formally the relationship between the chain rule and JVP: For a network of input dimension m, and output dimension n 𝜕𝐿 𝑇 𝜕𝑤 = 𝜕𝐿 𝑇 𝜕𝑦 𝜕𝑦 𝜕𝑤 1, 𝑚 = 1, 𝑛 ∗ 𝑛, 𝑚 𝜕𝐿 𝜕𝑤 = 𝜕𝑦 𝜕𝑤 𝑇 𝜕𝐿 𝜕𝑦 (𝑚, 1) = 𝑚, 𝑛 ∗ (𝑛, 1) 𝜕𝑦 𝜕𝑤 is the mxn Jacobian matrix of the weights (recall y: ℝ 𝑚 → ℝ 𝑛)
  • 10. © 2019, Amazon Web Services, Inc. or its Affiliates. Back accumulation of gradients: Jacobian vector products What we really want is the gradient of the Loss w.r.t. the weights at the current value of the weights (and the inputs). 𝜕𝐿 𝜕𝑤 𝑤=𝑤 𝑜 We have a head gradient that comes from the previous (rightmost) element in the chain rule. 𝜕𝐿 𝑇 𝜕𝑤 = 𝜕𝐿 𝑇 𝜕𝑦 𝜕𝑦 𝜕𝑤 Head gradient Layer output
  • 11. © 2019, Amazon Web Services, Inc. or its Affiliates. Back accumulation of gradients: head gradient We have a head gradient that comes from the previous (rightmost) element in the chain rule. Ex. Unary function: y = f(x) fx y Backward: bfx y dL dx dL dy Head gradient Inputs Outputs Backward returns gradients of each input
  • 12. © 2019, Amazon Web Services, Inc. or its Affiliates. Operator gradients example: fully connected Fully connected op performs: 𝑦 = 𝑥 ∗ 𝑤 𝑇 + 𝑏 x w FC b y Shapes: y = x * T(w) + b (batch, hidden) = (batch, …) * (…, hidden) + (hidden)
  • 13. © 2019, Amazon Web Services, Inc. or its Affiliates. Operator gradients example: first order gradient derivation for FC 𝑦 = 𝑥 ∗ 𝑤 𝑇 + 𝑏 𝜕𝐿 𝑇 𝜕𝑤 = 𝜕𝐿 𝑇 𝜕𝑦 𝜕𝑦 𝜕𝑤 = ℎ 𝑦 𝑇 ∗ 𝑥 linalg_gemm(y_grad, x, w_grad, true, false, stream, req[fullc::kWeight]); 𝜕𝐿 𝜕𝑥 = 𝜕𝐿 𝜕𝑦 𝜕𝑦 𝜕𝑥 𝑇 = ℎ 𝑦 ∗ 𝑤 linalg_gemm(y_grad, wmat, x_grad, false, false, stream, req[fullc::kData]); ./src/operator/nn/fully_connected-inl.h FCBackward
  • 14. © 2019, Amazon Web Services, Inc. or its Affiliates. Operator gradients example: fully connected 𝑦 = 𝑥 ∗ 𝑤 𝑇 + 𝑏 Backward of fully connected, returns gradients of inputs. x w FC b y x w b dL dy y bFC dL dx dL dw dL db
  • 15. © 2019, Amazon Web Services, Inc. or its Affiliates. Second order gradient for fully connected, Liebniz starts to break. hy bFC X’’ W’’ Hy’ hx’ hw’ hb’ The derivation is too complex to fit in a slide  also is better to use other types of notation like 𝑥 = 𝜕𝑦 𝜕𝑥 … we care most for what we are calculating the gradient w.r.t. b’’
  • 16. © 2019, Amazon Web Services, Inc. or its Affiliates. Operator gradients example: second order gradient derivation The derivation of the second order gradient is too heavy to fit in a slide. Liebniz notation for derivatives also doesn’t help in this case. The take home idea is that head gradients can depend in the input variables of the computation graph and this can’t be ignored. Implementing higher order gradients for every operator has big problems: • It can be very complex or practically impossible to find a closed form • It doesn’t scale
  • 17. © 2019, Amazon Web Services, Inc. or its Affiliates. Solution Operator code composed of differentiable primitives is already differentiable. Ex. If we have an operator to calculate the gradients for x+y and x*y, any composition of + and * can get n-order gradients calculated. Problem: you want to do operator fusion and optimization passes on such graph to avoid fragmenting operators in smaller ops and the increased overhead: need for a JIT and graph passes. For the interim in MXNet we could use Xingjian’s idea: swap the backward for a differentiable backward when a higher order is requested and Fgradient is not registered (FgradientSymbolic).
  • 18. © 2019, Amazon Web Services, Inc. or its Affiliates. Operator gradients example: second order gradient derivation For the second order gradient we could naively differentiate the first gradient. 𝜕 𝜕𝑤 𝜕𝐿 𝑇 𝜕𝑤 = 𝜕 𝜕𝑤 ℎ 𝑦 𝑇 ∗ 𝑥 = 1 ? But this is not correct, reason being that the head gradient could have a dependency on the variables. Solution: use the multivariate chain rule to account for this: 𝑓 𝑥 𝑡 𝜕𝑓 𝜕𝑡 = 𝜕𝑓 𝜕𝑥 𝜕𝑥 𝜕𝑡
  • 19. © 2019, Amazon Web Services, Inc. or its Affiliates. Two general approaches for automatic differentiation 1. Operator tracing: Register calls to every operator to build a call-graph. This effectively unrolls computations and loops. DL frameworks do this: Pytorch & MXNet autograd, TF gradient tape. 2. Source-to-source transformation: Generate code to calculate derivatives. Low overhead, doesn’t unroll loops, supports user-defined types. 3. Dual numbers, similar to imaginary numbers. 𝑥 + 𝑥′ 𝜀 | 𝜀2 = 0
  • 20. © 2019, Amazon Web Services, Inc. or its Affiliates. Operator tracing in autograd A gradient tensor is allocated and attached to variables Operations are traced into a call graph (mxnet.autograd.record / torch.autograd.set_grad_enabled) By default, initial head gradients are initialized to one(s). (In torch needs to be scalar), MXNet allocates the right size of ones. The graph is traversed from the outputs and gradients are (back) accumulated until the input variables get gradients.
  • 21. © 2019, Amazon Web Services, Inc. or its Affiliates. General differentiable programming For a function f we can attach a function similar to a backward operator which returns gradients of the inputs given a head gradient ℎ 𝑓 𝑥, 𝑦 → 𝑥 + 𝑦, ℎ → (ℎ, ℎ) 𝑓 𝑥, 𝑦 → 𝑥 ∗ 𝑦, ℎ → ℎ ∗ 𝑦, ℎ ∗ 𝑥 𝑓 𝑥, 𝑦 → 𝑥 − 𝑦, ℎ → (ℎ, −ℎ) You can add autodiff for any function, you need a calculation for each input which is a function of the head gradient, inputs and outputs. You can annotate functions or transform the source to calculate gradients very efficiently. For deep learning tracing is usually small overhead compared to tensor computations though, not so for scientific computation. ∂P: A Differentiable Programming System to Bridge Machine Learning and Scientific Computing. https://arxiv.org/pdf/1907.07587.pdf
  • 22. © 2019, Amazon Web Services, Inc. or its Affiliates. MXNet Each operator defines a backward function. Only differentiable if implemented in terms of NNVM ops. Implemented usually in C++, can be in Python as well. Ex. Fully connected: // backprop CHECK_NE(req[fullc::kWeight], kWriteInplace) << "cannot write weight inplace"; // gradient of weight Tensor<xpu, 2, DType> w_grad = in_grad[fullc::kWeight].get<xpu, 2, DType>(stream); linalg_gemm(y_grad, x, w_grad, true, false, stream, req[fullc::kWeight]); // gradient of bias if (!param.no_bias) { AddBiasGrad(in_grad[fullc::kBias], y_grad, req[fullc::kBias], param.num_hidden, ctx); } // gradient of data // Legacy approach shown here for comparison: // Assign(x_grad, req[fullc::kData], dot(y_grad, wmat)); linalg_gemm(y_grad, wmat, x_grad, false, false, stream, req[fullc::kData]); https://github.com/apache/incubator-mxnet/blob/master/src/operator/nn/fully_connected-inl.h#L365
  • 23. © 2019, Amazon Web Services, Inc. or its Affiliates. TF Implemented in Python @ops.RegisterGradient("Xdivy") def _XDivyGrad(op, grad): """Returns gradient of xdivy(x, y) with respect to x and y.""" x = op.inputs[0] y = op.inputs[1] sx = array_ops.shape(x) sy = array_ops.shape(y) rx, ry = gen_array_ops.broadcast_gradient_args(sx, sy) with ops.control_dependencies([grad]): not_zero_x = math_ops.cast( math_ops.not_equal(x, math_ops.cast(0., dtype=x.dtype)), dtype=x.dtype) partial_x = gen_math_ops.xdivy(not_zero_x, y) partial_y = gen_math_ops.xdivy(math_ops.negative(x), y**2) return (array_ops.reshape(math_ops.reduce_sum(partial_x * grad, rx), sx), array_ops.reshape(math_ops.reduce_sum(partial_y * grad, ry), sy)) https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/ops/math_grad.py#L697-L714
  • 24. © 2019, Amazon Web Services, Inc. or its Affiliates. Gradient definition in PyTorch - name: div.Tensor(Tensor self, Tensor other) -> Tensor self: grad / other other: -grad * self / (other * other) - name: mul.Tensor(Tensor self, Tensor other) -> Tensor self: grad * other other: grad * self https://github.com/pytorch/pytorch/blob/master/tools/autograd/derivatives.yaml
  • 25. © 2019, Amazon Web Services, Inc. or its Affiliates. FIN