Introduction to Neural Networks and Deep Learning
Backpropagation and Automatic Differentiation
Andres Mendez-Vazquez
November 4, 2019
1 / 158
Images/cinvestav
Outline
1 Backpropagation
Introduction
Derivatives of Network Functions
Function Composition, Weights and Addition
The Backpropagation Algorithm Works
Moving everything to Tensors
2 Automatic Differentiation
Introduction
Advantages of Automatic Differentiation
Avoiding Truncation Errors
Differences with Symbolic Differentiation
Difference Quotients May be Useful
RNN Example
A Simple Example
The Forward and Reverse Mode
Forward propagation of Tangents
Forward Mode of a ML Perceptron
Complexity of the Forward Procedure
The Reverse Mode
Dual Process in Reverse Process
Incremental Adjoint Recursion
Example
What Method to Use Forward or Reverse Mode?
3 Basic Implementation of Automatic Differentiation
Source Transformation and Overloading
Building the Computational Graph
Memory Structures
Way More...
4 Conclusions
The Problem of Backpropagation
2 / 158
Images/cinvestav
Outline
1 Backpropagation
Introduction
Derivatives of Network Functions
Function Composition, Weights and Addition
The Backpropagation Algorithm Works
Moving everything to Tensors
2 Automatic Differentiation
Introduction
Advantages of Automatic Differentiation
Avoiding Truncation Errors
Differences with Symbolic Differentiation
Difference Quotients May be Useful
RNN Example
A Simple Example
The Forward and Reverse Mode
Forward propagation of Tangents
Forward Mode of a ML Perceptron
Complexity of the Forward Procedure
The Reverse Mode
Dual Process in Reverse Process
Incremental Adjoint Recursion
Example
What Method to Use Forward or Reverse Mode?
3 Basic Implementation of Automatic Differentiation
Source Transformation and Overloading
Building the Computational Graph
Memory Structures
Way More...
4 Conclusions
The Problem of Backpropagation
3 / 158
Images/cinvestav
A Remarkable Revenant
This algorithm has been used by many communities
Discovered and rediscovered, until 1985 it reached the AI community
[1]
Basically
The Basis of the modern neural networks
4 / 158
Images/cinvestav
A Remarkable Revenant
This algorithm has been used by many communities
Discovered and rediscovered, until 1985 it reached the AI community
[1]
Basically
The Basis of the modern neural networks
4 / 158
Images/cinvestav
One Big Problem, a lot of Local Minimums
A Lot of Them!!!
Local Minimas
5 / 158
Images/cinvestav
This is due to the fact that
Yes, we have a convex function
1
2
(zi − ti)2
With an intermediate non-linear activation function
zi = f


d
j=1
wijyj


Making the surface to be searched for the optimum
A non linear function map from Rd to Rm
6 / 158
Images/cinvestav
This is due to the fact that
Yes, we have a convex function
1
2
(zi − ti)2
With an intermediate non-linear activation function
zi = f


d
j=1
wijyj


Making the surface to be searched for the optimum
A non linear function map from Rd to Rm
6 / 158
Images/cinvestav
This is due to the fact that
Yes, we have a convex function
1
2
(zi − ti)2
With an intermediate non-linear activation function
zi = f


d
j=1
wijyj


Making the surface to be searched for the optimum
A non linear function map from Rd to Rm
6 / 158
Images/cinvestav
Recall The Learning Problem
Neural Networks
You can see the network as a computational graph...
Transmitting information from node to node...
Therefore, the network
It is a particular implementation of a composite function from input
space to output space.
7 / 158
Images/cinvestav
Recall The Learning Problem
Neural Networks
You can see the network as a computational graph...
Transmitting information from node to node...
Therefore, the network
It is a particular implementation of a composite function from input
space to output space.
7 / 158
Images/cinvestav
Extended Network
The computation of the error by the network [2]
NETWORK
8 / 158
Images/cinvestav
Thus
The network can calculate the total error
E =
N
i=1
Ei
Therefore, the network can be updated using
E =
∂E
∂w1
,
∂E
∂w2
, ...,
∂E
∂wl
∆wi = −γ
∂E
∂w1
for i = 1, ..., l
9 / 158
Images/cinvestav
Thus
The network can calculate the total error
E =
N
i=1
Ei
Therefore, the network can be updated using
E =
∂E
∂w1
,
∂E
∂w2
, ...,
∂E
∂wl
∆wi = −γ
∂E
∂w1
for i = 1, ..., l
9 / 158
Images/cinvestav
Outline
1 Backpropagation
Introduction
Derivatives of Network Functions
Function Composition, Weights and Addition
The Backpropagation Algorithm Works
Moving everything to Tensors
2 Automatic Differentiation
Introduction
Advantages of Automatic Differentiation
Avoiding Truncation Errors
Differences with Symbolic Differentiation
Difference Quotients May be Useful
RNN Example
A Simple Example
The Forward and Reverse Mode
Forward propagation of Tangents
Forward Mode of a ML Perceptron
Complexity of the Forward Procedure
The Reverse Mode
Dual Process in Reverse Process
Incremental Adjoint Recursion
Example
What Method to Use Forward or Reverse Mode?
3 Basic Implementation of Automatic Differentiation
Source Transformation and Overloading
Building the Computational Graph
Memory Structures
Way More...
4 Conclusions
The Problem of Backpropagation
10 / 158
Images/cinvestav
Now, if we forget everything about learning
Given that the network is a complex composition of functions
E = f1 ◦ f2 ◦ · · · ◦ fK
Now, each node has a left and right side
11 / 158
Images/cinvestav
Now, if we forget everything about learning
Given that the network is a complex composition of functions
E = f1 ◦ f2 ◦ · · · ◦ fK
Now, each node has a left and right side
11 / 158
Images/cinvestav
Furthermore
Separation of integration and activation function
Then, we can use this notation to build the forward/backward steps
Actually the basis for automatic differentiation
12 / 158
Images/cinvestav
Furthermore
Separation of integration and activation function
Then, we can use this notation to build the forward/backward steps
Actually the basis for automatic differentiation
12 / 158
Images/cinvestav
Outline
1 Backpropagation
Introduction
Derivatives of Network Functions
Function Composition, Weights and Addition
The Backpropagation Algorithm Works
Moving everything to Tensors
2 Automatic Differentiation
Introduction
Advantages of Automatic Differentiation
Avoiding Truncation Errors
Differences with Symbolic Differentiation
Difference Quotients May be Useful
RNN Example
A Simple Example
The Forward and Reverse Mode
Forward propagation of Tangents
Forward Mode of a ML Perceptron
Complexity of the Forward Procedure
The Reverse Mode
Dual Process in Reverse Process
Incremental Adjoint Recursion
Example
What Method to Use Forward or Reverse Mode?
3 Basic Implementation of Automatic Differentiation
Source Transformation and Overloading
Building the Computational Graph
Memory Structures
Way More...
4 Conclusions
The Problem of Backpropagation
13 / 158
Images/cinvestav
First, we have
The sequence of derivatives
Then, we can do the forward step getting the function compositions
14 / 158
Images/cinvestav
First, we have
The sequence of derivatives
Then, we can do the forward step getting the function compositions
Function Composition
14 / 158
Images/cinvestav
Now, Backpropagation
Here the interesting part, you can collect such information
Backpropagation
1
Now, what else?
The aggregation of functions toward the activation functions!!!
15 / 158
Images/cinvestav
Now, Backpropagation
Here the interesting part, you can collect such information
Backpropagation
1
Now, what else?
The aggregation of functions toward the activation functions!!!
15 / 158
Images/cinvestav
We add an extra caveat to the graph representation
A weight into the graph
Feed-Forward
We have the backward process
16 / 158
Images/cinvestav
We add an extra caveat to the graph representation
A weight into the graph
Feed-Forward
We have the backward process
1
Backpropagation
16 / 158
Images/cinvestav
Function Addition
We have the forward step
Function Composition
17 / 158
Images/cinvestav
Then
At the Backward Step, we have
Backward
1
18 / 158
Images/cinvestav
Outline
1 Backpropagation
Introduction
Derivatives of Network Functions
Function Composition, Weights and Addition
The Backpropagation Algorithm Works
Moving everything to Tensors
2 Automatic Differentiation
Introduction
Advantages of Automatic Differentiation
Avoiding Truncation Errors
Differences with Symbolic Differentiation
Difference Quotients May be Useful
RNN Example
A Simple Example
The Forward and Reverse Mode
Forward propagation of Tangents
Forward Mode of a ML Perceptron
Complexity of the Forward Procedure
The Reverse Mode
Dual Process in Reverse Process
Incremental Adjoint Recursion
Example
What Method to Use Forward or Reverse Mode?
3 Basic Implementation of Automatic Differentiation
Source Transformation and Overloading
Building the Computational Graph
Memory Structures
Way More...
4 Conclusions
The Problem of Backpropagation
19 / 158
Images/cinvestav
Backpropagation Algorithm
Consider a network with a single input and a network function F
The Derivative F (x) is computed in two phases.
1 Feed-forward:
The input x is fed into the network.
The primitive functions at the nodes and their derivatives are evaluated
at each node.
The derivatives are stored at the left side of the node.
2 Backpropagation:
The constant 1 is fed into the output unit and the network is run
backwards.
Incoming information to a node is added and the result is multiplied by
the value stored in the left part of the unit.
The result is transmitted to the left of the unit.
The result collected at the input unit is the derivative of the network
function with respect to x.
20 / 158
Images/cinvestav
Backpropagation Algorithm
Consider a network with a single input and a network function F
The Derivative F (x) is computed in two phases.
1 Feed-forward:
The input x is fed into the network.
The primitive functions at the nodes and their derivatives are evaluated
at each node.
The derivatives are stored at the left side of the node.
2 Backpropagation:
The constant 1 is fed into the output unit and the network is run
backwards.
Incoming information to a node is added and the result is multiplied by
the value stored in the left part of the unit.
The result is transmitted to the left of the unit.
The result collected at the input unit is the derivative of the network
function with respect to x.
20 / 158
Images/cinvestav
Backpropagation Algorithm
Consider a network with a single input and a network function F
The Derivative F (x) is computed in two phases.
1 Feed-forward:
The input x is fed into the network.
The primitive functions at the nodes and their derivatives are evaluated
at each node.
The derivatives are stored at the left side of the node.
2 Backpropagation:
The constant 1 is fed into the output unit and the network is run
backwards.
Incoming information to a node is added and the result is multiplied by
the value stored in the left part of the unit.
The result is transmitted to the left of the unit.
The result collected at the input unit is the derivative of the network
function with respect to x.
20 / 158
Images/cinvestav
Backpropagation Algorithm
Consider a network with a single input and a network function F
The Derivative F (x) is computed in two phases.
1 Feed-forward:
The input x is fed into the network.
The primitive functions at the nodes and their derivatives are evaluated
at each node.
The derivatives are stored at the left side of the node.
2 Backpropagation:
The constant 1 is fed into the output unit and the network is run
backwards.
Incoming information to a node is added and the result is multiplied by
the value stored in the left part of the unit.
The result is transmitted to the left of the unit.
The result collected at the input unit is the derivative of the network
function with respect to x.
20 / 158
Images/cinvestav
Backpropagation Algorithm
Consider a network with a single input and a network function F
The Derivative F (x) is computed in two phases.
1 Feed-forward:
The input x is fed into the network.
The primitive functions at the nodes and their derivatives are evaluated
at each node.
The derivatives are stored at the left side of the node.
2 Backpropagation:
The constant 1 is fed into the output unit and the network is run
backwards.
Incoming information to a node is added and the result is multiplied by
the value stored in the left part of the unit.
The result is transmitted to the left of the unit.
The result collected at the input unit is the derivative of the network
function with respect to x.
20 / 158
Images/cinvestav
Backpropagation Algorithm
Consider a network with a single input and a network function F
The Derivative F (x) is computed in two phases.
1 Feed-forward:
The input x is fed into the network.
The primitive functions at the nodes and their derivatives are evaluated
at each node.
The derivatives are stored at the left side of the node.
2 Backpropagation:
The constant 1 is fed into the output unit and the network is run
backwards.
Incoming information to a node is added and the result is multiplied by
the value stored in the left part of the unit.
The result is transmitted to the left of the unit.
The result collected at the input unit is the derivative of the network
function with respect to x.
20 / 158
Images/cinvestav
Backpropagation Algorithm
Consider a network with a single input and a network function F
The Derivative F (x) is computed in two phases.
1 Feed-forward:
The input x is fed into the network.
The primitive functions at the nodes and their derivatives are evaluated
at each node.
The derivatives are stored at the left side of the node.
2 Backpropagation:
The constant 1 is fed into the output unit and the network is run
backwards.
Incoming information to a node is added and the result is multiplied by
the value stored in the left part of the unit.
The result is transmitted to the left of the unit.
The result collected at the input unit is the derivative of the network
function with respect to x.
20 / 158
Images/cinvestav
Backpropagation Algorithm
Consider a network with a single input and a network function F
The Derivative F (x) is computed in two phases.
1 Feed-forward:
The input x is fed into the network.
The primitive functions at the nodes and their derivatives are evaluated
at each node.
The derivatives are stored at the left side of the node.
2 Backpropagation:
The constant 1 is fed into the output unit and the network is run
backwards.
Incoming information to a node is added and the result is multiplied by
the value stored in the left part of the unit.
The result is transmitted to the left of the unit.
The result collected at the input unit is the derivative of the network
function with respect to x.
20 / 158
Images/cinvestav
Backpropagation Algorithm
Consider a network with a single input and a network function F
The Derivative F (x) is computed in two phases.
1 Feed-forward:
The input x is fed into the network.
The primitive functions at the nodes and their derivatives are evaluated
at each node.
The derivatives are stored at the left side of the node.
2 Backpropagation:
The constant 1 is fed into the output unit and the network is run
backwards.
Incoming information to a node is added and the result is multiplied by
the value stored in the left part of the unit.
The result is transmitted to the left of the unit.
The result collected at the input unit is the derivative of the network
function with respect to x.
20 / 158
Images/cinvestav
Backpropagation Algorithm
Consider a network with a single input and a network function F
The Derivative F (x) is computed in two phases.
1 Feed-forward:
The input x is fed into the network.
The primitive functions at the nodes and their derivatives are evaluated
at each node.
The derivatives are stored at the left side of the node.
2 Backpropagation:
The constant 1 is fed into the output unit and the network is run
backwards.
Incoming information to a node is added and the result is multiplied by
the value stored in the left part of the unit.
The result is transmitted to the left of the unit.
The result collected at the input unit is the derivative of the network
function with respect to x.
20 / 158
Images/cinvestav
Backpropagation Algorithm
Consider a network with a single input and a network function F
The Derivative F (x) is computed in two phases.
1 Feed-forward:
The input x is fed into the network.
The primitive functions at the nodes and their derivatives are evaluated
at each node.
The derivatives are stored at the left side of the node.
2 Backpropagation:
The constant 1 is fed into the output unit and the network is run
backwards.
Incoming information to a node is added and the result is multiplied by
the value stored in the left part of the unit.
The result is transmitted to the left of the unit.
The result collected at the input unit is the derivative of the network
function with respect to x.
20 / 158
Images/cinvestav
Proof of Correctness about the derivatives
Proposition
The Backpropagation algorithm computes the derivative of the
network function F with respect to the input x correctly.
Proof
By induction assume that the algorithm works with n or fewer nodes
21 / 158
Images/cinvestav
Proof of Correctness about the derivatives
Proposition
The Backpropagation algorithm computes the derivative of the
network function F with respect to the input x correctly.
Proof
By induction assume that the algorithm works with n or fewer nodes
21 / 158
Images/cinvestav
Consider
The following network with n + 1 nodes
NETWORK
22 / 158
Images/cinvestav
Thus
We have that
F (x) = φ (w1F1 (x) + w2F2 (x) + · · · + wmFm (x))
We have that the derivative
F (x) = φ (s) w1F1 (x) + w2F2 (x) + · · · + wmFm (x)
With s = w1F1 (x) + w2F2 (x) + · · · + wmFm (x)
23 / 158
Images/cinvestav
Thus
We have that
F (x) = φ (w1F1 (x) + w2F2 (x) + · · · + wmFm (x))
We have that the derivative
F (x) = φ (s) w1F1 (x) + w2F2 (x) + · · · + wmFm (x)
With s = w1F1 (x) + w2F2 (x) + · · · + wmFm (x)
23 / 158
Images/cinvestav
Now, we use induction
The subgraph of the main graph which contains all the nodes to
F1 (x)
Thus, by induction, we can calculate the derivative of F1 (x) by
introducing a 1 into the last unit and doing backpropagation
The same happens to all the other units
Now if instead of multiplying by 1 we introduce φ (s)and multiply by
wj, we get
wjFj (x) φ (s)
This can be accomplished by
Introducing a 1 into the output unit, multiplying by the stored value
φ (s) and distributing the result to the m units through edge weight
nodes.
24 / 158
Images/cinvestav
Now, we use induction
The subgraph of the main graph which contains all the nodes to
F1 (x)
Thus, by induction, we can calculate the derivative of F1 (x) by
introducing a 1 into the last unit and doing backpropagation
The same happens to all the other units
Now if instead of multiplying by 1 we introduce φ (s)and multiply by
wj, we get
wjFj (x) φ (s)
This can be accomplished by
Introducing a 1 into the output unit, multiplying by the stored value
φ (s) and distributing the result to the m units through edge weight
nodes.
24 / 158
Images/cinvestav
Now, we use induction
The subgraph of the main graph which contains all the nodes to
F1 (x)
Thus, by induction, we can calculate the derivative of F1 (x) by
introducing a 1 into the last unit and doing backpropagation
The same happens to all the other units
Now if instead of multiplying by 1 we introduce φ (s)and multiply by
wj, we get
wjFj (x) φ (s)
This can be accomplished by
Introducing a 1 into the output unit, multiplying by the stored value
φ (s) and distributing the result to the m units through edge weight
nodes.
24 / 158
Images/cinvestav
Basically, we get the derivative
We get then
φ (s) w1F1 (x) + w2F2 (x) + · · · + wmFm (x)
Basically the networks is run backward
F (x) = φ (s) w1F1 (x) + w2F2 (x) + · · · + wmFm (x)
The algorithms works for n + 1
QED
25 / 158
Images/cinvestav
Basically, we get the derivative
We get then
φ (s) w1F1 (x) + w2F2 (x) + · · · + wmFm (x)
Basically the networks is run backward
F (x) = φ (s) w1F1 (x) + w2F2 (x) + · · · + wmFm (x)
The algorithms works for n + 1
QED
25 / 158
Images/cinvestav
Basically, we get the derivative
We get then
φ (s) w1F1 (x) + w2F2 (x) + · · · + wmFm (x)
Basically the networks is run backward
F (x) = φ (s) w1F1 (x) + w2F2 (x) + · · · + wmFm (x)
The algorithms works for n + 1
QED
25 / 158
Images/cinvestav
Outline
1 Backpropagation
Introduction
Derivatives of Network Functions
Function Composition, Weights and Addition
The Backpropagation Algorithm Works
Moving everything to Tensors
2 Automatic Differentiation
Introduction
Advantages of Automatic Differentiation
Avoiding Truncation Errors
Differences with Symbolic Differentiation
Difference Quotients May be Useful
RNN Example
A Simple Example
The Forward and Reverse Mode
Forward propagation of Tangents
Forward Mode of a ML Perceptron
Complexity of the Forward Procedure
The Reverse Mode
Dual Process in Reverse Process
Incremental Adjoint Recursion
Example
What Method to Use Forward or Reverse Mode?
3 Basic Implementation of Automatic Differentiation
Source Transformation and Overloading
Building the Computational Graph
Memory Structures
Way More...
4 Conclusions
The Problem of Backpropagation
26 / 158
Images/cinvestav
Why not using matrices to process all the individual parts?
Imagine the following, a simple idea
X =






xT
1
xT
2
...
xT
N






We know the fields are created in input to hidden as
g (X) = XW =






xT
1
xT
2
...
xT
N






w1 w2 · · · wd
27 / 158
Images/cinvestav
Why not using matrices to process all the individual parts?
Imagine the following, a simple idea
X =






xT
1
xT
2
...
xT
N






We know the fields are created in input to hidden as
g (X) = XW =






xT
1
xT
2
...
xT
N






w1 w2 · · · wd
27 / 158
Images/cinvestav
Where
We have these construct gij xT
i = xT
i wj
g (X) =








g11 xT
1 g12 xT
1 · · · g1d xT
1
g21 xT
2 g21 xT
2 · · · g2d xT
2
...
...
...
...
gN1 xT
N gN2 xT
N · · · gNd xT
N








28 / 158
Images/cinvestav
Then
We have that the fij (x) = 1
1+exp{−x}
f (g (X)) =








f11 g11 xT
1 f g12 xT
1 · · · f g1d xT
1
f g21 xT
2 f g21 xT
2 · · · f g2d xT
2
...
...
...
...
f gN1 xT
N f gN2 xT
N · · · f gNd xT
N








29 / 158
Images/cinvestav
Finally, we can do the following modification when forward
Then the matrix can be extended
g (X) |g (X) =







dg11(xT
1 )
dw1
|xT
1 w1
dg12(xT
1 )
dw2
|xT
1 w2 · · ·
dg1d(xT
1 )
dwd
|xT
1 wh
dg21(xT
2 )
dw1
|xT
2 w1
dg22(xT
2 )
dw2
|xT
2 w2 · · ·
dg2d(xT
2 )
dwd
|xT
2 wh
...
...
...
...
dgN1(xT
N )
dw1
|xT
N w1
dgN2(xT
N )
dw2
|xT
2 w2 · · ·
dgNd(xT
N )
dwd
|xT
N wh







30 / 158
Images/cinvestav
Finally, we have
The next function f (g (X)) |f (g (X)) =




df11(x)
dx
g11 xT
1 |f11 g11 xT
1 · · ·
df1d(x)
dx
g1d xT
1 |f g1h xT
1
df21(x)
dx
g21 xT
1 |f g21 xT
2 · · ·
df2d(x)
dx
g2d xT
1 |f g2h xT
2
.
.
.
.
.
.
.
.
.
dfN1(x)
dx
gN1 xT
1 |f gN1 xT
N · · ·
dfNd(x)
dx
gNd xT
1 |f gNh xT
N




31 / 158
Images/cinvestav
Using the Hadamard Product
We have for the backpropagation
f (g (X)) ◦ g (X)
In particular for a position ij
dgij xT
i
dwj
×
dfij (x)
dx
gij xT
i =
dfij (x)
dx
gij xT
i ×






x1i
x2i
...
xdi






32 / 158
Images/cinvestav
Using the Hadamard Product
We have for the backpropagation
f (g (X)) ◦ g (X)
In particular for a position ij
dgij xT
i
dwj
×
dfij (x)
dx
gij xT
i =
dfij (x)
dx
gij xT
i ×






x1i
x2i
...
xdi






32 / 158
Images/cinvestav
Then using a vertical sum
We get the change that is imposed into the possible vector wj
sum f (g (X)) ◦ g (X) , axis =0 =
N
i=1
dgij xT
i
dwj
×
dfij (x)
dx
gij xT
i
h
j=1
33 / 158
Images/cinvestav
Now a Historical Perspective
The idea of a Graph Structure was proposed by Raul Rojas
“Neural Networks - A Systematic Introduction” by Raul Rojas in
1996...
TensorFlow was initially released in November 9, 2015
Originally an inception of the project “Google Brain” (Circa 2011)
So TensorFlow started around 2012-2013 with internal development
and DNNResearch’s code (Hinton’s Company)
However, the graph idea was introduced in 2002 in torch, the basis of
Pytorch (Circa 2016)
One of the creators, Samy Bengio, is the brother of Joshua Bengio [3]
34 / 158
Images/cinvestav
Now a Historical Perspective
The idea of a Graph Structure was proposed by Raul Rojas
“Neural Networks - A Systematic Introduction” by Raul Rojas in
1996...
TensorFlow was initially released in November 9, 2015
Originally an inception of the project “Google Brain” (Circa 2011)
So TensorFlow started around 2012-2013 with internal development
and DNNResearch’s code (Hinton’s Company)
However, the graph idea was introduced in 2002 in torch, the basis of
Pytorch (Circa 2016)
One of the creators, Samy Bengio, is the brother of Joshua Bengio [3]
34 / 158
Images/cinvestav
Now a Historical Perspective
The idea of a Graph Structure was proposed by Raul Rojas
“Neural Networks - A Systematic Introduction” by Raul Rojas in
1996...
TensorFlow was initially released in November 9, 2015
Originally an inception of the project “Google Brain” (Circa 2011)
So TensorFlow started around 2012-2013 with internal development
and DNNResearch’s code (Hinton’s Company)
However, the graph idea was introduced in 2002 in torch, the basis of
Pytorch (Circa 2016)
One of the creators, Samy Bengio, is the brother of Joshua Bengio [3]
34 / 158
Images/cinvestav
Now a Historical Perspective
The idea of a Graph Structure was proposed by Raul Rojas
“Neural Networks - A Systematic Introduction” by Raul Rojas in
1996...
TensorFlow was initially released in November 9, 2015
Originally an inception of the project “Google Brain” (Circa 2011)
So TensorFlow started around 2012-2013 with internal development
and DNNResearch’s code (Hinton’s Company)
However, the graph idea was introduced in 2002 in torch, the basis of
Pytorch (Circa 2016)
One of the creators, Samy Bengio, is the brother of Joshua Bengio [3]
34 / 158
Images/cinvestav
Outline
1 Backpropagation
Introduction
Derivatives of Network Functions
Function Composition, Weights and Addition
The Backpropagation Algorithm Works
Moving everything to Tensors
2 Automatic Differentiation
Introduction
Advantages of Automatic Differentiation
Avoiding Truncation Errors
Differences with Symbolic Differentiation
Difference Quotients May be Useful
RNN Example
A Simple Example
The Forward and Reverse Mode
Forward propagation of Tangents
Forward Mode of a ML Perceptron
Complexity of the Forward Procedure
The Reverse Mode
Dual Process in Reverse Process
Incremental Adjoint Recursion
Example
What Method to Use Forward or Reverse Mode?
3 Basic Implementation of Automatic Differentiation
Source Transformation and Overloading
Building the Computational Graph
Memory Structures
Way More...
4 Conclusions
The Problem of Backpropagation
35 / 158
Images/cinvestav
Backpropagation a little brother of Automatic
Differentiation (AD)
We have a crude way to obtain derivatives [4, 5, 6][7]
D+hf (x) ≈
f (x + h) − f (x)
2h
or D hf (x) ≈
f (x + h) − f (x − h)
2h
Huge Problems
If h is small, then cancellation error reduces the number of significant
figures in D+hf (x).
if h is not small, then truncation errors (terms such as h2f (x))
become significant.
Even if h is optimally chosen, the values of D+hf (x) and D hf (x)
will be accurate to only about 1
2 or 2
3 of the significant digits of f.
36 / 158
Images/cinvestav
Backpropagation a little brother of Automatic
Differentiation (AD)
We have a crude way to obtain derivatives [4, 5, 6][7]
D+hf (x) ≈
f (x + h) − f (x)
2h
or D hf (x) ≈
f (x + h) − f (x − h)
2h
Huge Problems
If h is small, then cancellation error reduces the number of significant
figures in D+hf (x).
if h is not small, then truncation errors (terms such as h2f (x))
become significant.
Even if h is optimally chosen, the values of D+hf (x) and D hf (x)
will be accurate to only about 1
2 or 2
3 of the significant digits of f.
36 / 158
Images/cinvestav
Backpropagation a little brother of Automatic
Differentiation (AD)
We have a crude way to obtain derivatives [4, 5, 6][7]
D+hf (x) ≈
f (x + h) − f (x)
2h
or D hf (x) ≈
f (x + h) − f (x − h)
2h
Huge Problems
If h is small, then cancellation error reduces the number of significant
figures in D+hf (x).
if h is not small, then truncation errors (terms such as h2f (x))
become significant.
Even if h is optimally chosen, the values of D+hf (x) and D hf (x)
will be accurate to only about 1
2 or 2
3 of the significant digits of f.
36 / 158
Images/cinvestav
Backpropagation a little brother of Automatic
Differentiation (AD)
We have a crude way to obtain derivatives [4, 5, 6][7]
D+hf (x) ≈
f (x + h) − f (x)
2h
or D hf (x) ≈
f (x + h) − f (x − h)
2h
Huge Problems
If h is small, then cancellation error reduces the number of significant
figures in D+hf (x).
if h is not small, then truncation errors (terms such as h2f (x))
become significant.
Even if h is optimally chosen, the values of D+hf (x) and D hf (x)
will be accurate to only about 1
2 or 2
3 of the significant digits of f.
36 / 158
Images/cinvestav
Outline
1 Backpropagation
Introduction
Derivatives of Network Functions
Function Composition, Weights and Addition
The Backpropagation Algorithm Works
Moving everything to Tensors
2 Automatic Differentiation
Introduction
Advantages of Automatic Differentiation
Avoiding Truncation Errors
Differences with Symbolic Differentiation
Difference Quotients May be Useful
RNN Example
A Simple Example
The Forward and Reverse Mode
Forward propagation of Tangents
Forward Mode of a ML Perceptron
Complexity of the Forward Procedure
The Reverse Mode
Dual Process in Reverse Process
Incremental Adjoint Recursion
Example
What Method to Use Forward or Reverse Mode?
3 Basic Implementation of Automatic Differentiation
Source Transformation and Overloading
Building the Computational Graph
Memory Structures
Way More...
4 Conclusions
The Problem of Backpropagation
37 / 158
Images/cinvestav
Outline
1 Backpropagation
Introduction
Derivatives of Network Functions
Function Composition, Weights and Addition
The Backpropagation Algorithm Works
Moving everything to Tensors
2 Automatic Differentiation
Introduction
Advantages of Automatic Differentiation
Avoiding Truncation Errors
Differences with Symbolic Differentiation
Difference Quotients May be Useful
RNN Example
A Simple Example
The Forward and Reverse Mode
Forward propagation of Tangents
Forward Mode of a ML Perceptron
Complexity of the Forward Procedure
The Reverse Mode
Dual Process in Reverse Process
Incremental Adjoint Recursion
Example
What Method to Use Forward or Reverse Mode?
3 Basic Implementation of Automatic Differentiation
Source Transformation and Overloading
Building the Computational Graph
Memory Structures
Way More...
4 Conclusions
The Problem of Backpropagation
38 / 158
Images/cinvestav
Avoiding Truncation Errors
We have that
Algorithmic differentiation does not incur truncation errors.
For example
f (x) =
n
i=1
x2
i at xi = i for i = 1...n
Then for e1 ∈ Rn
f (x + he1) − f (x)
h
=
∂f (x)
∂x1
+ h = 2x1 + h = 2 + h
39 / 158
Images/cinvestav
Avoiding Truncation Errors
We have that
Algorithmic differentiation does not incur truncation errors.
For example
f (x) =
n
i=1
x2
i at xi = i for i = 1...n
Then for e1 ∈ Rn
f (x + he1) − f (x)
h
=
∂f (x)
∂x1
+ h = 2x1 + h = 2 + h
39 / 158
Images/cinvestav
Avoiding Truncation Errors
We have that
Algorithmic differentiation does not incur truncation errors.
For example
f (x) =
n
i=1
x2
i at xi = i for i = 1...n
Then for e1 ∈ Rn
f (x + he1) − f (x)
h
=
∂f (x)
∂x1
+ h = 2x1 + h = 2 + h
39 / 158
Images/cinvestav
Floating Points
Given that the quantity needs floating point number representation in
machine accuracy of 64 bits
Roundoff error = f (x + he1) ≈ n3
3
with = 2−54
≈ 10−16
For h =
√
,as often is recommended
The difference quotient has a rounding error of size
1
3
n3√
≈
1
3
n3
10−8
40 / 158
Images/cinvestav
Floating Points
Given that the quantity needs floating point number representation in
machine accuracy of 64 bits
Roundoff error = f (x + he1) ≈ n3
3
with = 2−54
≈ 10−16
For h =
√
,as often is recommended
The difference quotient has a rounding error of size
1
3
n3√
≈
1
3
n3
10−8
40 / 158
Images/cinvestav
Now, Imagine n = 1000
Then Rounding Error
1
3
10003√
≈
1
3
1000000000 × 10−8
=
1
3
100 ≈ 33.333...
Ouch
We cannot even get the sign correctly!!!
f (x + he1) − f (x)
h
41 / 158
Images/cinvestav
Now, Imagine n = 1000
Then Rounding Error
1
3
10003√
≈
1
3
1000000000 × 10−8
=
1
3
100 ≈ 33.333...
Ouch
We cannot even get the sign correctly!!!
f (x + he1) − f (x)
h
41 / 158
Images/cinvestav
In contrast Automatic Differentiation
It yields
2xi in both its forward and reverse modes
You could assume that the derivatives are generated symbolically
Actually is true in some sense, but 2xi will be never be generated by
Symbolic Differentiation
In Symbolic Differentiation
The numerical value of xi is multiplied by 2 then returned as the
gradient value.
42 / 158
Images/cinvestav
In contrast Automatic Differentiation
It yields
2xi in both its forward and reverse modes
You could assume that the derivatives are generated symbolically
Actually is true in some sense, but 2xi will be never be generated by
Symbolic Differentiation
In Symbolic Differentiation
The numerical value of xi is multiplied by 2 then returned as the
gradient value.
42 / 158
Images/cinvestav
In contrast Automatic Differentiation
It yields
2xi in both its forward and reverse modes
You could assume that the derivatives are generated symbolically
Actually is true in some sense, but 2xi will be never be generated by
Symbolic Differentiation
In Symbolic Differentiation
The numerical value of xi is multiplied by 2 then returned as the
gradient value.
42 / 158
Images/cinvestav
Example using Forward Differentiation
We will see the forward procedure later on
f (xx) =
n
i=1
x2
i with xi = i for i = 1, ..., n
AD Initializes (Do not worry we will see this in more detail)
vi−n =i for i = 1, ..., n
˙vi−n =0, but ˙v1−n = 1
43 / 158
Images/cinvestav
Example using Forward Differentiation
We will see the forward procedure later on
f (xx) =
n
i=1
x2
i with xi = i for i = 1, ..., n
AD Initializes (Do not worry we will see this in more detail)
vi−n =i for i = 1, ..., n
˙vi−n =0, but ˙v1−n = 1
43 / 158
Images/cinvestav
Then, we have that
Apply the compositions
φ Functions Derivatives
v1 = 12 ˙v1 = ∂v1
∂v1−n
˙v1−n = 2 × (1) × 1 = 2
...
...
vn = n2 0
Therefore, we have at the end
∂f
∂x
(x) = (2, 0, ..., 0)
44 / 158
Images/cinvestav
Then, we have that
Apply the compositions
φ Functions Derivatives
v1 = 12 ˙v1 = ∂v1
∂v1−n
˙v1−n = 2 × (1) × 1 = 2
...
...
vn = n2 0
Therefore, we have at the end
∂f
∂x
(x) = (2, 0, ..., 0)
44 / 158
Images/cinvestav
Quite different from
Using a numerical difference, we have
f (x + e1h) − f (x)
h
− 2 < 0
Then for n = 10j
and h = 10−k
10k
(h + 1)2
− 1 < 2
Finally, we have
k > − log10 3
45 / 158
Images/cinvestav
Quite different from
Using a numerical difference, we have
f (x + e1h) − f (x)
h
− 2 < 0
Then for n = 10j
and h = 10−k
10k
(h + 1)2
− 1 < 2
Finally, we have
k > − log10 3
45 / 158
Images/cinvestav
Quite different from
Using a numerical difference, we have
f (x + e1h) − f (x)
h
− 2 < 0
Then for n = 10j
and h = 10−k
10k
(h + 1)2
− 1 < 2
Finally, we have
k > − log10 3
45 / 158
Images/cinvestav
Therefore
It is possible to get into underflow
by getting a k > − log10 3
Therefore, we have that
Automatic Differentiation allows to obtain the correct answer!!!
46 / 158
Images/cinvestav
Therefore
It is possible to get into underflow
by getting a k > − log10 3
Therefore, we have that
Automatic Differentiation allows to obtain the correct answer!!!
46 / 158
Images/cinvestav
Outline
1 Backpropagation
Introduction
Derivatives of Network Functions
Function Composition, Weights and Addition
The Backpropagation Algorithm Works
Moving everything to Tensors
2 Automatic Differentiation
Introduction
Advantages of Automatic Differentiation
Avoiding Truncation Errors
Differences with Symbolic Differentiation
Difference Quotients May be Useful
RNN Example
A Simple Example
The Forward and Reverse Mode
Forward propagation of Tangents
Forward Mode of a ML Perceptron
Complexity of the Forward Procedure
The Reverse Mode
Dual Process in Reverse Process
Incremental Adjoint Recursion
Example
What Method to Use Forward or Reverse Mode?
3 Basic Implementation of Automatic Differentiation
Source Transformation and Overloading
Building the Computational Graph
Memory Structures
Way More...
4 Conclusions
The Problem of Backpropagation
47 / 158
Images/cinvestav
For example
You have the following equation
f (x) =
n
i=1
xi
Then, the gradient
f (x) =
∂f
∂x1
,
∂f
∂x2
, ...,
∂f
∂xn
=


j=i
xj


i=1...n
= (x2 × x3 × ... × xi × xi+1 × ... × xn−1 × xn,
...............................................................
x1 × x2 × ... × xi−1 × xi+1 × ... × xn−1 × xn,
...............................................................
x1 × x2 × ... × xi−1 × xi × ... × xn−2 × xn−1, )
48 / 158
Images/cinvestav
For example
You have the following equation
f (x) =
n
i=1
xi
Then, the gradient
f (x) =
∂f
∂x1
,
∂f
∂x2
, ...,
∂f
∂xn
=


j=i
xj


i=1...n
= (x2 × x3 × ... × xi × xi+1 × ... × xn−1 × xn,
...............................................................
x1 × x2 × ... × xi−1 × xi+1 × ... × xn−1 × xn,
...............................................................
x1 × x2 × ... × xi−1 × xi × ... × xn−2 × xn−1, )
48 / 158
Images/cinvestav
Actually
Symbolic Differentiation will consume a lot of memory
Instead AD will reuse the common expressions to improve
performance and memory.
However, Symbolic and Automatic Differentiation
They make use of the chain rule to achieve their results
However, the chain rules in AD
It is used not into the symbolic expressions but the actual numerical
values.
49 / 158
Images/cinvestav
Actually
Symbolic Differentiation will consume a lot of memory
Instead AD will reuse the common expressions to improve
performance and memory.
However, Symbolic and Automatic Differentiation
They make use of the chain rule to achieve their results
However, the chain rules in AD
It is used not into the symbolic expressions but the actual numerical
values.
49 / 158
Images/cinvestav
Actually
Symbolic Differentiation will consume a lot of memory
Instead AD will reuse the common expressions to improve
performance and memory.
However, Symbolic and Automatic Differentiation
They make use of the chain rule to achieve their results
However, the chain rules in AD
It is used not into the symbolic expressions but the actual numerical
values.
49 / 158
Images/cinvestav
Outline
1 Backpropagation
Introduction
Derivatives of Network Functions
Function Composition, Weights and Addition
The Backpropagation Algorithm Works
Moving everything to Tensors
2 Automatic Differentiation
Introduction
Advantages of Automatic Differentiation
Avoiding Truncation Errors
Differences with Symbolic Differentiation
Difference Quotients May be Useful
RNN Example
A Simple Example
The Forward and Reverse Mode
Forward propagation of Tangents
Forward Mode of a ML Perceptron
Complexity of the Forward Procedure
The Reverse Mode
Dual Process in Reverse Process
Incremental Adjoint Recursion
Example
What Method to Use Forward or Reverse Mode?
3 Basic Implementation of Automatic Differentiation
Source Transformation and Overloading
Building the Computational Graph
Memory Structures
Way More...
4 Conclusions
The Problem of Backpropagation
50 / 158
Images/cinvestav
The User Insight
Difference quotients may sometimes be useful too
f (x + he1) − f (x)
h
Computer Algebra packages
They have really neat ways to simplify expressions.
In contrast, current AD packages assume that
That the given program calculates the underlying function efficiently
51 / 158
Images/cinvestav
The User Insight
Difference quotients may sometimes be useful too
f (x + he1) − f (x)
h
Computer Algebra packages
They have really neat ways to simplify expressions.
In contrast, current AD packages assume that
That the given program calculates the underlying function efficiently
51 / 158
Images/cinvestav
The User Insight
Difference quotients may sometimes be useful too
f (x + he1) − f (x)
h
Computer Algebra packages
They have really neat ways to simplify expressions.
In contrast, current AD packages assume that
That the given program calculates the underlying function efficiently
51 / 158
Images/cinvestav
There
AD can automatize the gradient generation
The best results will be obtained when AD takes advantage
the user’s insight into the structure underlying the program
52 / 158
Images/cinvestav
Outline
1 Backpropagation
Introduction
Derivatives of Network Functions
Function Composition, Weights and Addition
The Backpropagation Algorithm Works
Moving everything to Tensors
2 Automatic Differentiation
Introduction
Advantages of Automatic Differentiation
Avoiding Truncation Errors
Differences with Symbolic Differentiation
Difference Quotients May be Useful
RNN Example
A Simple Example
The Forward and Reverse Mode
Forward propagation of Tangents
Forward Mode of a ML Perceptron
Complexity of the Forward Procedure
The Reverse Mode
Dual Process in Reverse Process
Incremental Adjoint Recursion
Example
What Method to Use Forward or Reverse Mode?
3 Basic Implementation of Automatic Differentiation
Source Transformation and Overloading
Building the Computational Graph
Memory Structures
Way More...
4 Conclusions
The Problem of Backpropagation
53 / 158
Images/cinvestav
RNN Example
When you look at the recurrent neural network Elman [8]
ht = σh (Wsdxt + Ushht−1 + bh)
yt = σy (Vosht)
L =
1
2
(yt − zt)2
Here if you do blind AD sooner or later you have
∂ht
∂ht−1
×
∂ht−1
∂ht−2
×
∂ht−2
∂ht−3
× ... ×
∂hk+1
∂hk
This is known as Back Propagation Through Time (BPTT)
This is a problem given
The Vanishing Gradient or Exploding Gradient
54 / 158
Images/cinvestav
RNN Example
When you look at the recurrent neural network Elman [8]
ht = σh (Wsdxt + Ushht−1 + bh)
yt = σy (Vosht)
L =
1
2
(yt − zt)2
Here if you do blind AD sooner or later you have
∂ht
∂ht−1
×
∂ht−1
∂ht−2
×
∂ht−2
∂ht−3
× ... ×
∂hk+1
∂hk
This is known as Back Propagation Through Time (BPTT)
This is a problem given
The Vanishing Gradient or Exploding Gradient
54 / 158
Images/cinvestav
RNN Example
When you look at the recurrent neural network Elman [8]
ht = σh (Wsdxt + Ushht−1 + bh)
yt = σy (Vosht)
L =
1
2
(yt − zt)2
Here if you do blind AD sooner or later you have
∂ht
∂ht−1
×
∂ht−1
∂ht−2
×
∂ht−2
∂ht−3
× ... ×
∂hk+1
∂hk
This is known as Back Propagation Through Time (BPTT)
This is a problem given
The Vanishing Gradient or Exploding Gradient
54 / 158
Images/cinvestav
Here, you can modify the architecture
Using an intermediate layer using the Hadamard product ◦ we have
L =
1
2
(yt − zt)2
yt = σy (Wodxt + Uohht−1 + bo)
st = σs (Vhoyt + Dhdxt + bh)
ht = (1 − yt) ◦ ht−1 + yt ◦ st
55 / 158
Images/cinvestav
Therefore
You have multiple paths of derivatives
-1
-1
56 / 158
Images/cinvestav
One of them
It can be seen
That one of the paths can take you to BPTT
57 / 158
Images/cinvestav
The Other One
The other gets you into a more Markovian Property
This allows to to get a Backpropagation that does not require the
BPTT
How? For example, the derivative of L with respect to Dhd
∂L
∂Dhd
=
∂L
∂yt
×
∂yt
∂nety
×
∂nety
∂ht−1
×
∂ht−1
∂st−2
×
∂st−2
nets
×
nets
∂Dhd
58 / 158
Images/cinvestav
The Other One
The other gets you into a more Markovian Property
This allows to to get a Backpropagation that does not require the
BPTT
How? For example, the derivative of L with respect to Dhd
∂L
∂Dhd
=
∂L
∂yt
×
∂yt
∂nety
×
∂nety
∂ht−1
×
∂ht−1
∂st−2
×
∂st−2
nets
×
nets
∂Dhd
58 / 158
Images/cinvestav
The Other One
The other gets you into a more Markovian Property
This allows to to get a Backpropagation that does not require the
BPTT
How? For example, the derivative of L with respect to Dhd
∂L
∂Dhd
=
∂L
∂yt
×
∂yt
∂nety
×
∂nety
∂ht−1
×
∂ht−1
∂st−2
×
∂st−2
nets
×
nets
∂Dhd
58 / 158
Images/cinvestav
Therefore
You do not have
The Backpropagation through time... you can avoid it all together!!!
Because Backpropagation Through Time
Makes the process of obtaining the gradients unstable...
59 / 158
Images/cinvestav
Therefore
You do not have
The Backpropagation through time... you can avoid it all together!!!
Because Backpropagation Through Time
Makes the process of obtaining the gradients unstable...
59 / 158
Images/cinvestav
Thus
A great simplifying step
Here resound trues the phrase
“AD taking advantage of the user’s insight”
60 / 158
Images/cinvestav
Outline
1 Backpropagation
Introduction
Derivatives of Network Functions
Function Composition, Weights and Addition
The Backpropagation Algorithm Works
Moving everything to Tensors
2 Automatic Differentiation
Introduction
Advantages of Automatic Differentiation
Avoiding Truncation Errors
Differences with Symbolic Differentiation
Difference Quotients May be Useful
RNN Example
A Simple Example
The Forward and Reverse Mode
Forward propagation of Tangents
Forward Mode of a ML Perceptron
Complexity of the Forward Procedure
The Reverse Mode
Dual Process in Reverse Process
Incremental Adjoint Recursion
Example
What Method to Use Forward or Reverse Mode?
3 Basic Implementation of Automatic Differentiation
Source Transformation and Overloading
Building the Computational Graph
Memory Structures
Way More...
4 Conclusions
The Problem of Backpropagation
61 / 158
Images/cinvestav
A Simple Example
Here, we have the following ideas
Some of the floating point values, generated by the AD, will be stored
in variables of the program,
Other operations will be held until overwritten or discarded.
Thus, we will introduce the concept
Evaluation Trace which is basically a record of a particular run of a
given program.
This Evaluation Trace stores
Input variables,
Sequence of floating point generated by the CPU
Operations that are used for it
62 / 158
Images/cinvestav
A Simple Example
Here, we have the following ideas
Some of the floating point values, generated by the AD, will be stored
in variables of the program,
Other operations will be held until overwritten or discarded.
Thus, we will introduce the concept
Evaluation Trace which is basically a record of a particular run of a
given program.
This Evaluation Trace stores
Input variables,
Sequence of floating point generated by the CPU
Operations that are used for it
62 / 158
Images/cinvestav
A Simple Example
Here, we have the following ideas
Some of the floating point values, generated by the AD, will be stored
in variables of the program,
Other operations will be held until overwritten or discarded.
Thus, we will introduce the concept
Evaluation Trace which is basically a record of a particular run of a
given program.
This Evaluation Trace stores
Input variables,
Sequence of floating point generated by the CPU
Operations that are used for it
62 / 158
Images/cinvestav
A Simple Example
Here, we have the following ideas
Some of the floating point values, generated by the AD, will be stored
in variables of the program,
Other operations will be held until overwritten or discarded.
Thus, we will introduce the concept
Evaluation Trace which is basically a record of a particular run of a
given program.
This Evaluation Trace stores
Input variables,
Sequence of floating point generated by the CPU
Operations that are used for it
62 / 158
Images/cinvestav
A Simple Example
Here, we have the following ideas
Some of the floating point values, generated by the AD, will be stored
in variables of the program,
Other operations will be held until overwritten or discarded.
Thus, we will introduce the concept
Evaluation Trace which is basically a record of a particular run of a
given program.
This Evaluation Trace stores
Input variables,
Sequence of floating point generated by the CPU
Operations that are used for it
62 / 158
Images/cinvestav
A Simple Example
Here, we have the following ideas
Some of the floating point values, generated by the AD, will be stored
in variables of the program,
Other operations will be held until overwritten or discarded.
Thus, we will introduce the concept
Evaluation Trace which is basically a record of a particular run of a
given program.
This Evaluation Trace stores
Input variables,
Sequence of floating point generated by the CPU
Operations that are used for it
62 / 158
Images/cinvestav
Example
A simple example
y = f (x1, x2) = sin
x1
x2
+
x1
x2
− exp (x2) ×
x1
x2
− exp (x2)
We wish to calculate y = f (x1, x2)
With x1 = 1.5, x2 = 0.5
63 / 158
Images/cinvestav
Example
A simple example
y = f (x1, x2) = sin
x1
x2
+
x1
x2
− exp (x2) ×
x1
x2
− exp (x2)
We wish to calculate y = f (x1, x2)
With x1 = 1.5, x2 = 0.5
63 / 158
Images/cinvestav
Evaluation Trace/Forward Procedure
We have the table for the evaluation of the function
v−1 = x1 = 1.5
v0 = x2 = 0.5
v1 = v−1
v0
= 1.5
0.5 = 3.0
v2 = sin (v1) = sin (3.0) = 0.1411
v3 = exp (v0) = exp (0.5) = 1.6487
v4 = v1 − v3 = 3.0 − 1.6487 = 1.3513
v5 = v2 + v4 = 0.1411 + 1.3413 = 1.4924
v6 = v5 × v4 = 1.4924 × 1.3513 = 2.0167
y = v6 = 2.0167
64 / 158
Images/cinvestav
A Cautionary Note
Normally
Programmers will try to rearrange this execution trace to improve
performance through parallelism.
Thus
Subexpressions will be algorithmically exploited by the AD to improve
performance.
It is usually more convenient to use
The so called “computational graph”
65 / 158
Images/cinvestav
A Cautionary Note
Normally
Programmers will try to rearrange this execution trace to improve
performance through parallelism.
Thus
Subexpressions will be algorithmically exploited by the AD to improve
performance.
It is usually more convenient to use
The so called “computational graph”
65 / 158
Images/cinvestav
A Cautionary Note
Normally
Programmers will try to rearrange this execution trace to improve
performance through parallelism.
Thus
Subexpressions will be algorithmically exploited by the AD to improve
performance.
It is usually more convenient to use
The so called “computational graph”
65 / 158
Images/cinvestav
Computational Graph
A Simpler Version
66 / 158
Images/cinvestav
Please take a look at section in Chapter 2 A Framework for
Evaluating Functions
At the book [7]
Andreas Griewank and Andrea Walther, Evaluating derivatives:
principles and techniques of algorithmic differentiation vol. 105,
(Siam, 2008).
67 / 158
Images/cinvestav
Outline
1 Backpropagation
Introduction
Derivatives of Network Functions
Function Composition, Weights and Addition
The Backpropagation Algorithm Works
Moving everything to Tensors
2 Automatic Differentiation
Introduction
Advantages of Automatic Differentiation
Avoiding Truncation Errors
Differences with Symbolic Differentiation
Difference Quotients May be Useful
RNN Example
A Simple Example
The Forward and Reverse Mode
Forward propagation of Tangents
Forward Mode of a ML Perceptron
Complexity of the Forward Procedure
The Reverse Mode
Dual Process in Reverse Process
Incremental Adjoint Recursion
Example
What Method to Use Forward or Reverse Mode?
3 Basic Implementation of Automatic Differentiation
Source Transformation and Overloading
Building the Computational Graph
Memory Structures
Way More...
4 Conclusions
The Problem of Backpropagation
67 / 158
Images/cinvestav
A Little Bit of Notation
In general, we assume quantities vi such
v1−n, ..., v0
x
v1, ..., vl−m−1vl−m+1, ..., vl
y
Then, we have
1 v1−n, ..., v0 are the initial input variables
2 vl−m+1, ..., vl the output variables
3 v1, ..., vl−m−1 the intermediate functions
68 / 158
Images/cinvestav
A Little Bit of Notation
In general, we assume quantities vi such
v1−n, ..., v0
x
v1, ..., vl−m−1vl−m+1, ..., vl
y
Then, we have
1 v1−n, ..., v0 are the initial input variables
2 vl−m+1, ..., vl the output variables
3 v1, ..., vl−m−1 the intermediate functions
68 / 158
Images/cinvestav
Additionally
Where each value vi with i > 0 is obtained by applying an elemental
function φ
vi = φi (vj)j i
j i vi depends directly on vj
69 / 158
Images/cinvestav
Then, for the application of the chain rule
It is useful to associate with each elemental function φi the state
transformation
vi = Φi (vi−1) with Φi : Rn+l
→ Rn+l
where
vi = (v1−n, ..., vi, 0, ..., 0)T
In other words
Φi sets of vi to φi (vj)j i and keeps all other components vj for j = i
unchanged.
70 / 158
Images/cinvestav
Then, for the application of the chain rule
It is useful to associate with each elemental function φi the state
transformation
vi = Φi (vi−1) with Φi : Rn+l
→ Rn+l
where
vi = (v1−n, ..., vi, 0, ..., 0)T
In other words
Φi sets of vi to φi (vj)j i and keeps all other components vj for j = i
unchanged.
70 / 158
Images/cinvestav
Then, for the application of the chain rule
It is useful to associate with each elemental function φi the state
transformation
vi = Φi (vi−1) with Φi : Rn+l
→ Rn+l
where
vi = (v1−n, ..., vi, 0, ..., 0)T
In other words
Φi sets of vi to φi (vj)j i and keeps all other components vj for j = i
unchanged.
70 / 158
Images/cinvestav
Basically the Computational Graph
A Simpler Version
71 / 158
Images/cinvestav
Example of the Forward Mode
Suppose we want to differentiate y = f (x1, x2) with respect to x1
We consider x1 as an independent variable and y as a dependent
variable.
We can work the numerical value of the y = f (x1, x2)
By getting the numerical derivative of each of its components
Something like
˙vi =
∂vi
∂x1
72 / 158
Images/cinvestav
Example of the Forward Mode
Suppose we want to differentiate y = f (x1, x2) with respect to x1
We consider x1 as an independent variable and y as a dependent
variable.
We can work the numerical value of the y = f (x1, x2)
By getting the numerical derivative of each of its components
Something like
˙vi =
∂vi
∂x1
72 / 158
Images/cinvestav
Example of the Forward Mode
Suppose we want to differentiate y = f (x1, x2) with respect to x1
We consider x1 as an independent variable and y as a dependent
variable.
We can work the numerical value of the y = f (x1, x2)
By getting the numerical derivative of each of its components
Something like
˙vi =
∂vi
∂x1
72 / 158
Images/cinvestav
Therefore, we get
We have the Procedure
v−1 = x1 = 1.5 ˙v−1 = 1.0
v0 = x2 = 0.5 ˙v1 = 0.0
v1 =
v−1
v0
= 1.5
0.5
= 3.0 ˙v1 = ∂v1
∂v−1
˙v−1 + ∂v1
∂v0
˙v0 = 2.0
v2 = sin (v1) = sin (3.0) = 0.1411 ˙v2 = cos (v1) ˙v1 = −1.98
v3 = exp (v0) = exp (0.5) = 1.6487 ˙v3 = v3 ˙×v1 = 0.0
v4 = v1 − v3 = 3.0 − 1.6487 = 1.3513 ˙v4 = ˙v1 − ˙v3 = 2.0
v5 = v2 + v4 = 0.1411 + 1.3413 = 1.4924 ˙v5 = ˙v2 + ˙v4 = 0.02
v6 = v5 × v4 = 1.4924 × 1.3513 = 2.0167 ˙v6 = ˙v5 × v4 + v5 × ˙v4 = 3.0118
y = v6 = 2.0167 ˙y = 3.0118
73 / 158
Images/cinvestav
The first Column of this process
It can be seen as an automatic procedure
vi−n i = 1...n
vi = ϕi (vj)j i i = 1...l
ym−i = vl−i i = m − 1...0
74 / 158
Images/cinvestav
In a similar way
We can obtain ∂f(x1,x2)
∂x2
However, it can be more efficient to redefine the ˙vi as vectors for
efficiency!!!
75 / 158
Images/cinvestav
Outline
1 Backpropagation
Introduction
Derivatives of Network Functions
Function Composition, Weights and Addition
The Backpropagation Algorithm Works
Moving everything to Tensors
2 Automatic Differentiation
Introduction
Advantages of Automatic Differentiation
Avoiding Truncation Errors
Differences with Symbolic Differentiation
Difference Quotients May be Useful
RNN Example
A Simple Example
The Forward and Reverse Mode
Forward propagation of Tangents
Forward Mode of a ML Perceptron
Complexity of the Forward Procedure
The Reverse Mode
Dual Process in Reverse Process
Incremental Adjoint Recursion
Example
What Method to Use Forward or Reverse Mode?
3 Basic Implementation of Automatic Differentiation
Source Transformation and Overloading
Building the Computational Graph
Memory Structures
Way More...
4 Conclusions
The Problem of Backpropagation
76 / 158
Images/cinvestav
Forward propagation of Tangents
Remarks
As you can see the second column of the evaluation procedure is done
in a mechanical way
This increase the size
Basically, twice the size of the original simple evaluation.
77 / 158
Images/cinvestav
Forward propagation of Tangents
Remarks
As you can see the second column of the evaluation procedure is done
in a mechanical way
This increase the size
Basically, twice the size of the original simple evaluation.
77 / 158
Images/cinvestav
We have the following
We have the chain rule
˙y (t) =
∂F (x (t))
∂t
= F (x (t)) ˙x (t)
Where
F (x) ∈ Rm×n is the Jacobian Matrix
Here, we will be tempted to calculate ˙y (t)
By evaluating the full Jacobian F (x) then multiplying by ˙x (t)
78 / 158
Images/cinvestav
We have the following
We have the chain rule
˙y (t) =
∂F (x (t))
∂t
= F (x (t)) ˙x (t)
Where
F (x) ∈ Rm×n is the Jacobian Matrix
Here, we will be tempted to calculate ˙y (t)
By evaluating the full Jacobian F (x) then multiplying by ˙x (t)
78 / 158
Images/cinvestav
We have the following
We have the chain rule
˙y (t) =
∂F (x (t))
∂t
= F (x (t)) ˙x (t)
Where
F (x) ∈ Rm×n is the Jacobian Matrix
Here, we will be tempted to calculate ˙y (t)
By evaluating the full Jacobian F (x) then multiplying by ˙x (t)
78 / 158
Images/cinvestav
However
Such approach is quite uneconomically
Unless many tangents need to be calculated as in the Newton Step.
A simpler version, differentiate the first column of the table
vi−n = xi i = 1, ..., n
vi = φi (vj)j i i = 1, ..., l
ym−i = vl−i i = m − 1, ..., 0
j i vi depends directly vj (The graph propagation of the
dependencies)
79 / 158
Images/cinvestav
However
Such approach is quite uneconomically
Unless many tangents need to be calculated as in the Newton Step.
A simpler version, differentiate the first column of the table
vi−n = xi i = 1, ..., n
vi = φi (vj)j i i = 1, ..., l
ym−i = vl−i i = m − 1, ..., 0
j i vi depends directly vj (The graph propagation of the
dependencies)
79 / 158
Images/cinvestav
Which can be seen as Forward Propagation of Tangents
Basically, we can think of the forward mode as a propagation of
tangents
80 / 158
Images/cinvestav
The Automatic Procedure
Therefore, we have the following automatic procedure
j i vi depends directly on vj and ui = (vj)j i ∈ Rni
vi−n ≡ xi
i = 1...n
˙vi−n ≡ ˙xi
vi ≡ φi (vj)j i i = 1...l
i = 1...l
˙vi ≡ j i
∂φi(uj)
∂vj
˙vj
ym−i ≡ vl−i
i = m − 1...0
˙ym−i ≡ ˙vl−i
81 / 158
Images/cinvestav
Therefore
Each element assignment vi = φi (ui)
You have the corresponding
˙vi =
j i
∂φi (uj)
∂vj
× ˙vj =
j i
cij × ˙vj
Abbreviating ˙ui = (˙vj)j i
˙vi = ˙φi (ui, ˙ui) = φi (ui) ˙ui
Where ˙φi = R2ni
→ R
It is called the tangent function associated with the elemental φi.
82 / 158
Images/cinvestav
Therefore
Each element assignment vi = φi (ui)
You have the corresponding
˙vi =
j i
∂φi (uj)
∂vj
× ˙vj =
j i
cij × ˙vj
Abbreviating ˙ui = (˙vj)j i
˙vi = ˙φi (ui, ˙ui) = φi (ui) ˙ui
Where ˙φi = R2ni
→ R
It is called the tangent function associated with the elemental φi.
82 / 158
Images/cinvestav
Therefore
Each element assignment vi = φi (ui)
You have the corresponding
˙vi =
j i
∂φi (uj)
∂vj
× ˙vj =
j i
cij × ˙vj
Abbreviating ˙ui = (˙vj)j i
˙vi = ˙φi (ui, ˙ui) = φi (ui) ˙ui
Where ˙φi = R2ni
→ R
It is called the tangent function associated with the elemental φi.
82 / 158
Images/cinvestav
Now
Question
What is the correct order of evaluation?
83 / 158
Images/cinvestav
Why the question?
Until now, we have always placed the tangent statement yielding ˙vi
after the underlying value vi
This order of calculation seems natural and certainly yields correct
results as long as there is no overwriting.
Then the order of 2l statements in the middle part of Table does not
matter
vi−n ≡ xi
i = 1...n
˙vi−n ≡ ˙xi
vi ≡ φi (vj)j i i = 1...l
i = 1...l
˙vi ≡ j i
∂φi(uj)
∂vj
˙vj
ym−i ≡ vl−i
i = m − 1...0
˙ym−i ≡ ˙vl−i
84 / 158
Images/cinvestav
Why the question?
Until now, we have always placed the tangent statement yielding ˙vi
after the underlying value vi
This order of calculation seems natural and certainly yields correct
results as long as there is no overwriting.
Then the order of 2l statements in the middle part of Table does not
matter
vi−n ≡ xi
i = 1...n
˙vi−n ≡ ˙xi
vi ≡ φi (vj)j i i = 1...l
i = 1...l
˙vi ≡ j i
∂φi(uj)
∂vj
˙vj
ym−i ≡ vl−i
i = m − 1...0
˙ym−i ≡ ˙vl−i
84 / 158
Images/cinvestav
Here, we have a big problem in Cache
Imagine that we have a single block of memory to hold
For vi and its arguments vj live in the same memory cell on the cache
memory
CPU
Register
Register
Register
Register
CACHE
MainMemory
85 / 158
Images/cinvestav
This is known as Cache Aliasing
Definition
Cache aliasing occurs when multiple mappings to a physical page of
memory have conflicting caching states, such as cached and
uncached.
the same physical address can be mapped to multiple virtual addresses.
On ARMv4 and ARMv5 processors, cache is organized as a
virtual-indexed, virtual-tagged (VIVT)
Cache lookups are faster because the translation look-aside buffer
(TLB) is not involved in matching cache lines for a virtual address.
However
This caching method does require more frequent cache flushing
because of cache aliasing.
86 / 158
Images/cinvestav
This is known as Cache Aliasing
Definition
Cache aliasing occurs when multiple mappings to a physical page of
memory have conflicting caching states, such as cached and
uncached.
the same physical address can be mapped to multiple virtual addresses.
On ARMv4 and ARMv5 processors, cache is organized as a
virtual-indexed, virtual-tagged (VIVT)
Cache lookups are faster because the translation look-aside buffer
(TLB) is not involved in matching cache lines for a virtual address.
However
This caching method does require more frequent cache flushing
because of cache aliasing.
86 / 158
Images/cinvestav
This is known as Cache Aliasing
Definition
Cache aliasing occurs when multiple mappings to a physical page of
memory have conflicting caching states, such as cached and
uncached.
the same physical address can be mapped to multiple virtual addresses.
On ARMv4 and ARMv5 processors, cache is organized as a
virtual-indexed, virtual-tagged (VIVT)
Cache lookups are faster because the translation look-aside buffer
(TLB) is not involved in matching cache lines for a virtual address.
However
This caching method does require more frequent cache flushing
because of cache aliasing.
86 / 158
Images/cinvestav
Then
The value of ˙vi = ˙φi (ui, ˙ui) it will incorrect
Once we update vi = φi (ui)
ADIFOR and Tapenade [9, 5]
They put the derivative statement ahead of the original assignment
and update before the erasing the original statement.
On the other hand
For most univariate functions v = φ (u) is better to obtain the
undifferentiated value first
Then to use it into the tangent function ˙φ
87 / 158
Images/cinvestav
Then
The value of ˙vi = ˙φi (ui, ˙ui) it will incorrect
Once we update vi = φi (ui)
ADIFOR and Tapenade [9, 5]
They put the derivative statement ahead of the original assignment
and update before the erasing the original statement.
On the other hand
For most univariate functions v = φ (u) is better to obtain the
undifferentiated value first
Then to use it into the tangent function ˙φ
87 / 158
Images/cinvestav
Then
The value of ˙vi = ˙φi (ui, ˙ui) it will incorrect
Once we update vi = φi (ui)
ADIFOR and Tapenade [9, 5]
They put the derivative statement ahead of the original assignment
and update before the erasing the original statement.
On the other hand
For most univariate functions v = φ (u) is better to obtain the
undifferentiated value first
Then to use it into the tangent function ˙φ
87 / 158
Images/cinvestav
In this presentation
We will list ϕ and ˙ϕ
Side by side in a common bracket to indicate that they should be
evaluated simultaneously
Then
sharing results is immediate.
88 / 158
Images/cinvestav
In this presentation
We will list ϕ and ˙ϕ
Side by side in a common bracket to indicate that they should be
evaluated simultaneously
Then
sharing results is immediate.
88 / 158
Images/cinvestav
Classic Tangent Operations
We have a series of improvements on the tangent equations
φ φ, ˙φ
v = c v = c, ˙v = 0
v = v ± w v = v ± w
˙v = ˙v ± ˙w
v = u × w ˙v = ˙u × w + u × ˙w
v = u × w
v = 1/u v = 1/u
˙v = −v × (v × ˙u)
φ φ, ˙φ
v = uc v = ˙u
u ; v = uc
˙v = v × (v × ˙u)
v =
√
u v =
√
u
v = 0.5 × ˙u
v
v = exp (u) v = exp (u)
˙v = v ∗ ˙u
v = log (u) ˙v = ˙u/u
v = log (u)
v = sin (u) ˙v = cos (u) × ˙u
v = sin (u)
89 / 158
Images/cinvestav
Outline
1 Backpropagation
Introduction
Derivatives of Network Functions
Function Composition, Weights and Addition
The Backpropagation Algorithm Works
Moving everything to Tensors
2 Automatic Differentiation
Introduction
Advantages of Automatic Differentiation
Avoiding Truncation Errors
Differences with Symbolic Differentiation
Difference Quotients May be Useful
RNN Example
A Simple Example
The Forward and Reverse Mode
Forward propagation of Tangents
Forward Mode of a ML Perceptron
Complexity of the Forward Procedure
The Reverse Mode
Dual Process in Reverse Process
Incremental Adjoint Recursion
Example
What Method to Use Forward or Reverse Mode?
3 Basic Implementation of Automatic Differentiation
Source Transformation and Overloading
Building the Computational Graph
Memory Structures
Way More...
4 Conclusions
The Problem of Backpropagation
90 / 158
Images/cinvestav
Now Imagine the following network
Something simple for our sake
91 / 158
Images/cinvestav
Forward mode to get gradient of x1
v−11 = w11, ..., v−6 = w16, v−5 = w21, ..., v−2 = w24, v−1 = v31, v−1 = w41
˙v−11 = 1, ˙v−10 = 0, ..., ˙v0 = 0
v1 =
3
i=1
w1ixi , ˙v1 = x1
v2 =
3
i=1
w2ixi , ˙v2 = 0
v3 = 1
1+exp(−v1)
, ˙v3 = v3 [1 − v3] x11
v4 = 1
1+exp(−v2)
, ˙v4 = 0
v5 =
3
i=1
w3ivi, ˙v5 = w31 × ˙v3
v6 =
3
i=1
w4ivi, ˙v6 = w41 × ˙v3
v7 = 1
1+exp(−v5)
, ˙v7 = v7 [1 − v7] × ˙v5
v8 = 1
1+exp(−v6)
, ˙v8 = v8 [1 − v8] × ˙v6
v9 =
2
i=1
w5ivi, ˙v9 = w51 × ˙v7 + w32 × ˙v8
v10 = 1
1+exp(−v9)
, ˙v10 = v10 [1 − v10] × ˙v9
92 / 158
Images/cinvestav
Outline
1 Backpropagation
Introduction
Derivatives of Network Functions
Function Composition, Weights and Addition
The Backpropagation Algorithm Works
Moving everything to Tensors
2 Automatic Differentiation
Introduction
Advantages of Automatic Differentiation
Avoiding Truncation Errors
Differences with Symbolic Differentiation
Difference Quotients May be Useful
RNN Example
A Simple Example
The Forward and Reverse Mode
Forward propagation of Tangents
Forward Mode of a ML Perceptron
Complexity of the Forward Procedure
The Reverse Mode
Dual Process in Reverse Process
Incremental Adjoint Recursion
Example
What Method to Use Forward or Reverse Mode?
3 Basic Implementation of Automatic Differentiation
Source Transformation and Overloading
Building the Computational Graph
Memory Structures
Way More...
4 Conclusions
The Problem of Backpropagation
93 / 158
Images/cinvestav
Complexity of the Procedure
Time Complexity
TIME F (x) , F (x) ˙x ≤ wtanTIME {F (x)}
Where wtan ∈ 2, 5
2
Space Complexity
SPACE F (x) , F (x) ˙x ≤ 2SPACE {F (x)}
94 / 158
Images/cinvestav
Complexity of the Procedure
Time Complexity
TIME F (x) , F (x) ˙x ≤ wtanTIME {F (x)}
Where wtan ∈ 2, 5
2
Space Complexity
SPACE F (x) , F (x) ˙x ≤ 2SPACE {F (x)}
94 / 158
Images/cinvestav
Outline
1 Backpropagation
Introduction
Derivatives of Network Functions
Function Composition, Weights and Addition
The Backpropagation Algorithm Works
Moving everything to Tensors
2 Automatic Differentiation
Introduction
Advantages of Automatic Differentiation
Avoiding Truncation Errors
Differences with Symbolic Differentiation
Difference Quotients May be Useful
RNN Example
A Simple Example
The Forward and Reverse Mode
Forward propagation of Tangents
Forward Mode of a ML Perceptron
Complexity of the Forward Procedure
The Reverse Mode
Dual Process in Reverse Process
Incremental Adjoint Recursion
Example
What Method to Use Forward or Reverse Mode?
3 Basic Implementation of Automatic Differentiation
Source Transformation and Overloading
Building the Computational Graph
Memory Structures
Way More...
4 Conclusions
The Problem of Backpropagation
95 / 158
Images/cinvestav
Here, an essential observation
The cost of evaluating derivatives by propagating them forward
it increases linearly with number of directions ˙x along which we want
to differentiate.
It looks inevitable
But it is possible to avoid these complexity by
Observing that the gradient of a single dependent variable could be
obtained for a fixed multiple of the cost of evaluating the underlying
scalar-valued function.
96 / 158
Images/cinvestav
Here, an essential observation
The cost of evaluating derivatives by propagating them forward
it increases linearly with number of directions ˙x along which we want
to differentiate.
It looks inevitable
But it is possible to avoid these complexity by
Observing that the gradient of a single dependent variable could be
obtained for a fixed multiple of the cost of evaluating the underlying
scalar-valued function.
96 / 158
Images/cinvestav
We choose instead an output variable
We use the term “reverse mode” for this technique
Because the label “backward differentiation” is well established
[10, 11].
Therefore, for an output f (x1, x2)
We have for each variable v1
vi =
∂y
∂vi
(Adjoint Variable)
97 / 158
Images/cinvestav
We choose instead an output variable
We use the term “reverse mode” for this technique
Because the label “backward differentiation” is well established
[10, 11].
Therefore, for an output f (x1, x2)
We have for each variable v1
vi =
∂y
∂vi
(Adjoint Variable)
97 / 158
Images/cinvestav
Actually
This is an abuse of notation
We mean a new independent variable δi
vi =
∂y
∂δi
(Adjoint Variable)
Which can be thought as adding a small numerical value δi to vi
vi + δi → f (x1, x2) + viδi
As a perturbation in variational calculus
98 / 158
Images/cinvestav
Actually
This is an abuse of notation
We mean a new independent variable δi
vi =
∂y
∂δi
(Adjoint Variable)
Which can be thought as adding a small numerical value δi to vi
vi + δi → f (x1, x2) + viδi
As a perturbation in variational calculus
98 / 158
Images/cinvestav
Actually, you propagate the Normal vectors
Actually, y and vi are normals or cotangents
99 / 158
Images/cinvestav
Then, we have
The following sought mapping
x = yT
F (x) = yT
F (x)
Observation
Here, y is a fixed vector that plays a dual role to the domain direction
˙x.
In the Forward Procedure, you compute
˙y = F (x) ˙x = ˙F (x, ˙x)
100 / 158
Images/cinvestav
Then, we have
The following sought mapping
x = yT
F (x) = yT
F (x)
Observation
Here, y is a fixed vector that plays a dual role to the domain direction
˙x.
In the Forward Procedure, you compute
˙y = F (x) ˙x = ˙F (x, ˙x)
100 / 158
Images/cinvestav
Then, we have
The following sought mapping
x = yT
F (x) = yT
F (x)
Observation
Here, y is a fixed vector that plays a dual role to the domain direction
˙x.
In the Forward Procedure, you compute
˙y = F (x) ˙x = ˙F (x, ˙x)
100 / 158
Images/cinvestav
Instead
In the Reverse Procedure, you compute
xT
= yT
F (x) ≡ F (x, y)
Where we solve F and F are evaluated together
Thus, we have a dual process
101 / 158
Images/cinvestav
Instead
In the Reverse Procedure, you compute
xT
= yT
F (x) ≡ F (x, y)
Where we solve F and F are evaluated together
Thus, we have a dual process
101 / 158
Images/cinvestav
Outline
1 Backpropagation
Introduction
Derivatives of Network Functions
Function Composition, Weights and Addition
The Backpropagation Algorithm Works
Moving everything to Tensors
2 Automatic Differentiation
Introduction
Advantages of Automatic Differentiation
Avoiding Truncation Errors
Differences with Symbolic Differentiation
Difference Quotients May be Useful
RNN Example
A Simple Example
The Forward and Reverse Mode
Forward propagation of Tangents
Forward Mode of a ML Perceptron
Complexity of the Forward Procedure
The Reverse Mode
Dual Process in Reverse Process
Incremental Adjoint Recursion
Example
What Method to Use Forward or Reverse Mode?
3 Basic Implementation of Automatic Differentiation
Source Transformation and Overloading
Building the Computational Graph
Memory Structures
Way More...
4 Conclusions
The Problem of Backpropagation
102 / 158
Images/cinvestav
Dual Process
Here, we have that the hyperplane yT
y = c in the range of F has
inverse image x|yT
F (x) = c
103 / 158
Images/cinvestav
The implicit function theorem
Theorem
Let F : Rn+m
→ Rm
be a continuously differentiable function, and a
point x0
1, x0
2, ..., x0
m+n so F x0
1, x0
2, ..., x0
m+n = c. If
∂F(x0
1,x0
2,...,x0
m+n)
∂xm+n
= 0, then there exist a neighborhood of
x0
1, x0
2, ..., x0
m+n so whatever (x1, ..., xn+m−1) is close enough to
x0
1, ..., x0
m+n−1 , there is a unique z so that
F (x1, ..., xn+m−1, z) = c. Furthermore, z = g (x1, ..., xn+m−1) a
continuous function of (x1, ..., xn+m−1).
104 / 158
Images/cinvestav
Therefore
The set x|yT
F (x) = c
It is a smooth hyper-surface with the normal
xT
= yT
F (x)
at x provided that x does not vanishes.
105 / 158
Images/cinvestav
The Process
Here, we have that the hyperplane yT
y = c in the range of F has
inverse image x|yT
F (x) = c
Forward Procedure
Reverse Procedure
106 / 158
Images/cinvestav
Therefore
When m = 1, then F = f is scaler-valued
We obtain y = 1 ∈ R the familiar gradient f (x) = yT F (x).
Something Notable
We will look only at the main procedure of Incremental Adjoint
Recursion
Please take a look at section in Derivation by Matrix-Product
Reversal
At the book [7]
Andreas Griewank and Andrea Walther, Evaluating derivatives:
principles and techniques of algorithmic differentiation vol. 105,
(Siam, 2008).
107 / 158
Images/cinvestav
Therefore
When m = 1, then F = f is scaler-valued
We obtain y = 1 ∈ R the familiar gradient f (x) = yT F (x).
Something Notable
We will look only at the main procedure of Incremental Adjoint
Recursion
Please take a look at section in Derivation by Matrix-Product
Reversal
At the book [7]
Andreas Griewank and Andrea Walther, Evaluating derivatives:
principles and techniques of algorithmic differentiation vol. 105,
(Siam, 2008).
107 / 158
Images/cinvestav
Therefore
When m = 1, then F = f is scaler-valued
We obtain y = 1 ∈ R the familiar gradient f (x) = yT F (x).
Something Notable
We will look only at the main procedure of Incremental Adjoint
Recursion
Please take a look at section in Derivation by Matrix-Product
Reversal
At the book [7]
Andreas Griewank and Andrea Walther, Evaluating derivatives:
principles and techniques of algorithmic differentiation vol. 105,
(Siam, 2008).
107 / 158
Images/cinvestav
The derivation of the reversal mode
For this, we will use
vi−n ≡ xi
i = 1...n
˙vi−n ≡ ˙xi
vi ≡ φi (vj)j i i = 1...l
i = 1...l
˙vi ≡ j i
∂φi(uj)
∂vj
˙vj
ym−i ≡ vl−i
i = m − 1...0
˙ym−i ≡ ˙vl−i
And the identity
yT
˙y = xT
˙x
108 / 158
Images/cinvestav
The derivation of the reversal mode
For this, we will use
vi−n ≡ xi
i = 1...n
˙vi−n ≡ ˙xi
vi ≡ φi (vj)j i i = 1...l
i = 1...l
˙vi ≡ j i
∂φi(uj)
∂vj
˙vj
ym−i ≡ vl−i
i = m − 1...0
˙ym−i ≡ ˙vl−i
And the identity
yT
˙y = xT
˙x
108 / 158
Images/cinvestav
Now, using the state transformation Φ
We map from x to y = F (x) as the composition
y = QmΦl ◦ Φl−1 ◦ · · · ◦ Φ2 ◦ Φ1 PT
n x
Where Pn ≡ [I, 0, ..., 0] ∈ Rn×(n+l) and Qm ≡ [0, 0, ..., I] ∈ Rm×(n+l)
They are matrices that project an arbitrary (n + l)-vector
Onto its first n and last m components.
109 / 158
Images/cinvestav
Now, using the state transformation Φ
We map from x to y = F (x) as the composition
y = QmΦl ◦ Φl−1 ◦ · · · ◦ Φ2 ◦ Φ1 PT
n x
Where Pn ≡ [I, 0, ..., 0] ∈ Rn×(n+l) and Qm ≡ [0, 0, ..., I] ∈ Rm×(n+l)
They are matrices that project an arbitrary (n + l)-vector
Onto its first n and last m components.
109 / 158
Images/cinvestav
Where
The cij’s represent partial differential
cij ≡ cij (ui) ≡
∂φi
∂vj
for 1 − n ≤ i, j ≤ l
110 / 158
Images/cinvestav
Labelin the elemental partials as cij
We get the state Jacobian
Ai ≡ Φi ≡
















1 0 . . . 0 . . . . . . 0
0 1 . . . 0 . . . . . . 0
...
...
...
... . . . . . .
0 0 . . . 1 . . . . . . 0
ci1−n ci2−n . . . cii−n . . . . . . 0
0 0 . . . 0 1 . . . 0
...
...
...
...
...
...
...
0 0 . . . . . . . . . . . . 1
















∈ R(n+l)×(n+l)
where the cij occur in the (n + i)th row of Ai.
111 / 158
Images/cinvestav
Remarks
The square matrices Ai are lower triangular
It may also be written as rank-one perturbations of the identity,
Ai = I + en+i [ φi (ui) − en+i]T
Where ej denotes the jth Cartesian basis vector in Rn+l
The differentiating the composition of functions
˙y = QmAlAl−1 · · · A2A1PT
n ˙x
112 / 158
Images/cinvestav
Remarks
The square matrices Ai are lower triangular
It may also be written as rank-one perturbations of the identity,
Ai = I + en+i [ φi (ui) − en+i]T
Where ej denotes the jth Cartesian basis vector in Rn+l
The differentiating the composition of functions
˙y = QmAlAl−1 · · · A2A1PT
n ˙x
112 / 158
Images/cinvestav
Embeddings
The multiplication by PT
n ∈ R(n+l)×n
It embeds ˙x into Rn+l
Meaning
corresponding to the first part of the tangent recursion
The subsequent multiplications by the Ai
It generates ine component ˙vi at a time, according to the middle part
113 / 158
Images/cinvestav
Embeddings
The multiplication by PT
n ∈ R(n+l)×n
It embeds ˙x into Rn+l
Meaning
corresponding to the first part of the tangent recursion
The subsequent multiplications by the Ai
It generates ine component ˙vi at a time, according to the middle part
113 / 158
Images/cinvestav
Embeddings
The multiplication by PT
n ∈ R(n+l)×n
It embeds ˙x into Rn+l
Meaning
corresponding to the first part of the tangent recursion
The subsequent multiplications by the Ai
It generates ine component ˙vi at a time, according to the middle part
113 / 158
Images/cinvestav
Finally
Qm extracts the last m components as ˙y corresponding to the third
part of the table
vi−n ≡ xi
i = 1...n
˙vi−n ≡ ˙xi
vi ≡ φi (vj)j i i = 1...l
i = 1...l
˙vi ≡ j i
∂φi(uj)
∂vj
˙vj
ym−i ≡ vl−i
i = m − 1...0
˙ym−i ≡ ˙vl−i
114 / 158
Images/cinvestav
Now
By comparison with
˙y (t) =
∂F (x (t))
∂t
= F (x (t)) ˙x (t)
We have in fact a product representation of the full Jacobian
F (x) = QmAlAl−1 · · · A2A1PT
n ∈ Rm×n
115 / 158
Images/cinvestav
Now
By comparison with
˙y (t) =
∂F (x (t))
∂t
= F (x (t)) ˙x (t)
We have in fact a product representation of the full Jacobian
F (x) = QmAlAl−1 · · · A2A1PT
n ∈ Rm×n
115 / 158
Images/cinvestav
Then
By transposing the product we obtain the adjoint relation
x = PnAT
1 AT
2 · · · AT
l−1AT
l y
Given that
AT
i = I + [ φi (ui) − en+i] eT
n+i
116 / 158
Images/cinvestav
Then
By transposing the product we obtain the adjoint relation
x = PnAT
1 AT
2 · · · AT
l−1AT
l y
Given that
AT
i = I + [ φi (ui) − en+i] eT
n+i
116 / 158
Images/cinvestav
Therefore
The transformation of any vector (vj)1−n≤j≤l
By multiplication with AT
i representing an incremental operation.
117 / 158
Images/cinvestav
In detail, one obtains for i = l, ..., 1 the operations
For all j with i = j i
vj is left unchanged
For all j with i = j i
vi is augmented by vicij
cij ≡ cij (ui) ≡
∂φi
∂vj
for 1 − n ≤ i, j ≤ l
Subsequently
vi is set to zero.
118 / 158
Images/cinvestav
In detail, one obtains for i = l, ..., 1 the operations
For all j with i = j i
vj is left unchanged
For all j with i = j i
vi is augmented by vicij
cij ≡ cij (ui) ≡
∂φi
∂vj
for 1 − n ≤ i, j ≤ l
Subsequently
vi is set to zero.
118 / 158
Images/cinvestav
In detail, one obtains for i = l, ..., 1 the operations
For all j with i = j i
vj is left unchanged
For all j with i = j i
vi is augmented by vicij
cij ≡ cij (ui) ≡
∂φi
∂vj
for 1 − n ≤ i, j ≤ l
Subsequently
vi is set to zero.
118 / 158
Images/cinvestav
Outline
1 Backpropagation
Introduction
Derivatives of Network Functions
Function Composition, Weights and Addition
The Backpropagation Algorithm Works
Moving everything to Tensors
2 Automatic Differentiation
Introduction
Advantages of Automatic Differentiation
Avoiding Truncation Errors
Differences with Symbolic Differentiation
Difference Quotients May be Useful
RNN Example
A Simple Example
The Forward and Reverse Mode
Forward propagation of Tangents
Forward Mode of a ML Perceptron
Complexity of the Forward Procedure
The Reverse Mode
Dual Process in Reverse Process
Incremental Adjoint Recursion
Example
What Method to Use Forward or Reverse Mode?
3 Basic Implementation of Automatic Differentiation
Source Transformation and Overloading
Building the Computational Graph
Memory Structures
Way More...
4 Conclusions
The Problem of Backpropagation
119 / 158
Images/cinvestav
Some Remarks
Using the C-style abbreviation
a+ ≡ b for a ≡ a + b
We may rewrite the matrix- vector product as the adjoint evaluation
procedure in the following table
120 / 158
Images/cinvestav
Incremental Adjoint Recursion
We have the following procedure (ui = (vj)j i ∈ Rni
)
vi ≡ 0 i = 1 − n...l
vi−n ≡ xi i = 1...n
vi ≡ φi (vj)j i i = m − 1...l
ym−i ≡ vl−i i = 0...m − 1
vl−i ≡ ym−i i = 0...m − 1
vj+ ≡ vi
∂φi(ui)
∂vj
for j i i = l...1
xi ≡ vi−n i = n...1
121 / 158
Images/cinvestav
Explanation
It is assumed as a precondition that the adjoint quantities
vi for 1 ≤ i ≤ l have been initialized to zero
As indicated by the range specification i = l, ..., 1
we think of the incremental assignments as being executed in reverse
order, i.e., for i = l, l − 1, l − 2, ..., 1.
Only then is it guaranteed
Each vi will reach its full value before it occurs on the right-hand side.
122 / 158
Images/cinvestav
Explanation
It is assumed as a precondition that the adjoint quantities
vi for 1 ≤ i ≤ l have been initialized to zero
As indicated by the range specification i = l, ..., 1
we think of the incremental assignments as being executed in reverse
order, i.e., for i = l, l − 1, l − 2, ..., 1.
Only then is it guaranteed
Each vi will reach its full value before it occurs on the right-hand side.
122 / 158
Images/cinvestav
Explanation
It is assumed as a precondition that the adjoint quantities
vi for 1 ≤ i ≤ l have been initialized to zero
As indicated by the range specification i = l, ..., 1
we think of the incremental assignments as being executed in reverse
order, i.e., for i = l, l − 1, l − 2, ..., 1.
Only then is it guaranteed
Each vi will reach its full value before it occurs on the right-hand side.
122 / 158
Images/cinvestav
Furthermore
We can combine the incremental operations
Affected by the adjoint of φi to
ui+ = vi · φi (ui) where ui ≡ (uj)j i ∈ Rni
Something Remarkable
We can do something different
one can directly compute the value of the adjoint quantity vj by
collecting all contributions to it as a sum ranging over all successors
i j.
This no-incremental
Requires global information that is not easy to come by.
123 / 158
Images/cinvestav
Furthermore
We can combine the incremental operations
Affected by the adjoint of φi to
ui+ = vi · φi (ui) where ui ≡ (uj)j i ∈ Rni
Something Remarkable
We can do something different
one can directly compute the value of the adjoint quantity vj by
collecting all contributions to it as a sum ranging over all successors
i j.
This no-incremental
Requires global information that is not easy to come by.
123 / 158
Images/cinvestav
Furthermore
We can combine the incremental operations
Affected by the adjoint of φi to
ui+ = vi · φi (ui) where ui ≡ (uj)j i ∈ Rni
Something Remarkable
We can do something different
one can directly compute the value of the adjoint quantity vj by
collecting all contributions to it as a sum ranging over all successors
i j.
This no-incremental
Requires global information that is not easy to come by.
123 / 158
Images/cinvestav
Complexity
Something Notable
TIME F (x) , yT
F (x) ≤ wgradTIME {F (x)}
Where wgrad ∈ [3, 4] (The cheap gradient principle)
124 / 158
Images/cinvestav
Remember
Time Complexity
TIME F (x) , F (x) ˙x ≤ wtanTIME {F (x)}
Where wtan ∈ 2, 5
2
125 / 158
Images/cinvestav
Outline
1 Backpropagation
Introduction
Derivatives of Network Functions
Function Composition, Weights and Addition
The Backpropagation Algorithm Works
Moving everything to Tensors
2 Automatic Differentiation
Introduction
Advantages of Automatic Differentiation
Avoiding Truncation Errors
Differences with Symbolic Differentiation
Difference Quotients May be Useful
RNN Example
A Simple Example
The Forward and Reverse Mode
Forward propagation of Tangents
Forward Mode of a ML Perceptron
Complexity of the Forward Procedure
The Reverse Mode
Dual Process in Reverse Process
Incremental Adjoint Recursion
Example
What Method to Use Forward or Reverse Mode?
3 Basic Implementation of Automatic Differentiation
Source Transformation and Overloading
Building the Computational Graph
Memory Structures
Way More...
4 Conclusions
The Problem of Backpropagation
126 / 158
Images/cinvestav
Example a single layer perceptron
We have
y = σ
3
i=1
wixi
127 / 158
Images/cinvestav
First Phase
Forward Step
Forward Step
v−2 = w1
v−1 = w2
v0 = w3
v1 = x1v−2
v2 = x2v−1
v3 = x3v0
v4 = v1 + v2 + v3
v5 = σ (v4)
y1 = v5
128 / 158
Images/cinvestav
Second Phase
Incremental Return
Forward Step
v−2 = w1
v−1 = w2
v0 = w3
v1 = x1v−2
v2 = x2v−1
v3 = x3v0
v4 = v1 + v2 + v3
v5 = σ (v4)
y1 = v5
Incremental Return
v5 = y1 = 1
v4 = ∂v5
∂v4
y1 = σ (v4)
v3+ = ∂v4
∂v3
v4 = 1 × σ (v4)
v0 = ∂v3
∂v0
v3 = x3 × σ (v4)
v2+ = ∂v4
∂v2
v4 = 1 × σ (v4)
v−1 = ∂v2
∂v−1
v2 = x2 × σ (v4)
v1+ = ∂v4
∂v1
v4 = 1 × σ (v4)
v−2 = ∂v1
∂v−2
v1 = x1 × σ (v4)
w3 = x3 × σ (v4)
w2 = x2 × σ (v4)
w1 = x1 × σ (v4)
129 / 158
Images/cinvestav
How does it compares with the Forward Mode?
We noticed that you do the following for each gradient variable
Forward Step; Gradient of Forward Step
v−2 = w1; ˙v−2 = ˙w1 = 0
v−1 = w2; ˙v−1 = ˙w2 = 0
v0 = w3; ˙v0 = ˙w2 = 1
v1 = x1v−2
˙v1 = x1 ˙v−2 = 0
v2 = x2v−1
˙v2 = x2 ˙v−1 = 0
v3 = w3v0
˙v3 = x3 ˙v0 = x3
v4 = v1 + v2 + v3
˙v4 = ˙v1 + ˙v2 + ˙v3 = x3
v5 = σ (v4)
˙v5 = ˙v4 = x3 × σ (v4)
y1 = v5; ˙y1 = ˙v5
130 / 158
Images/cinvestav
Outline
1 Backpropagation
Introduction
Derivatives of Network Functions
Function Composition, Weights and Addition
The Backpropagation Algorithm Works
Moving everything to Tensors
2 Automatic Differentiation
Introduction
Advantages of Automatic Differentiation
Avoiding Truncation Errors
Differences with Symbolic Differentiation
Difference Quotients May be Useful
RNN Example
A Simple Example
The Forward and Reverse Mode
Forward propagation of Tangents
Forward Mode of a ML Perceptron
Complexity of the Forward Procedure
The Reverse Mode
Dual Process in Reverse Process
Incremental Adjoint Recursion
Example
What Method to Use Forward or Reverse Mode?
3 Basic Implementation of Automatic Differentiation
Source Transformation and Overloading
Building the Computational Graph
Memory Structures
Way More...
4 Conclusions
The Problem of Backpropagation
131 / 158
Images/cinvestav
Let us to look at the following example
We have the following system of equations
y1 = σ (w1x)
y2 = σ (w2x)
y3 = σ (w2x)
132 / 158
Images/cinvestav
With the following graph
Notice the difference with a neural network
133 / 158
Images/cinvestav
The Forward mode looks like
We have that
v0 = x; ˙v0 = ˙x = 1
v1 = w1v0
˙v1 = w1 ˙v−2 = w1
v2 = w2v0
˙v2 = w1 ˙v0 = w2
v3 = w3v0
˙v3 = w1 ˙v0 = w3
v4 = σ (v1)
˙v4 = σ (v1) × ˙v1 = w1 × σ (v1)
v5 = σ (v2)
˙v5 = σ (v2) × ˙v2 = w2 × σ (v2)
v6 = σ (v3)
˙v6 = σ (v3) × ˙v3 = w3 × σ (v3)
y1 = v4; ˙y1 = ˙v4
y2 = v5; ˙y2 = ˙v5
y = v ; ˙y = ˙v
134 / 158
Images/cinvestav
Now you can see it
Forward and Reverse Mode
They depend on the input and output size!!!
A More Formal Definition
For a function f : Rn → Rm, suppose we wish to compute all the
elements of the m × n Jacobian matrix
Ignoring the overhead of building the expression graph
Under this situation Reverse Mode requires m sweeps performs better
when n > m.
135 / 158
Images/cinvestav
Now you can see it
Forward and Reverse Mode
They depend on the input and output size!!!
A More Formal Definition
For a function f : Rn → Rm, suppose we wish to compute all the
elements of the m × n Jacobian matrix
Ignoring the overhead of building the expression graph
Under this situation Reverse Mode requires m sweeps performs better
when n > m.
135 / 158
Images/cinvestav
Now you can see it
Forward and Reverse Mode
They depend on the input and output size!!!
A More Formal Definition
For a function f : Rn → Rm, suppose we wish to compute all the
elements of the m × n Jacobian matrix
Ignoring the overhead of building the expression graph
Under this situation Reverse Mode requires m sweeps performs better
when n > m.
135 / 158
Images/cinvestav
Consequences for Deep Learning
With a relatively small overhead
The performance of reverse-mode AD is superior when n m, that
is when we have many inputs and few outputs.
As we saw it in the previous examples
If n ≤ m forward mode performs better
136 / 158
Images/cinvestav
Consequences for Deep Learning
With a relatively small overhead
The performance of reverse-mode AD is superior when n m, that
is when we have many inputs and few outputs.
As we saw it in the previous examples
If n ≤ m forward mode performs better
136 / 158
Images/cinvestav
Special Cases
Nevertheless when we have a comparable number of outputs and
inputs
Forward mode can be more efficient,
less overhead associated with storing the expression graph in memory
in forward mode.
For Example
If you have f : Rn → R, when n = 1 forward mode is more efficient,
but the result flips as n increases.
137 / 158
Images/cinvestav
Special Cases
Nevertheless when we have a comparable number of outputs and
inputs
Forward mode can be more efficient,
less overhead associated with storing the expression graph in memory
in forward mode.
For Example
If you have f : Rn → R, when n = 1 forward mode is more efficient,
but the result flips as n increases.
137 / 158
Images/cinvestav
Outline
1 Backpropagation
Introduction
Derivatives of Network Functions
Function Composition, Weights and Addition
The Backpropagation Algorithm Works
Moving everything to Tensors
2 Automatic Differentiation
Introduction
Advantages of Automatic Differentiation
Avoiding Truncation Errors
Differences with Symbolic Differentiation
Difference Quotients May be Useful
RNN Example
A Simple Example
The Forward and Reverse Mode
Forward propagation of Tangents
Forward Mode of a ML Perceptron
Complexity of the Forward Procedure
The Reverse Mode
Dual Process in Reverse Process
Incremental Adjoint Recursion
Example
What Method to Use Forward or Reverse Mode?
3 Basic Implementation of Automatic Differentiation
Source Transformation and Overloading
Building the Computational Graph
Memory Structures
Way More...
4 Conclusions
The Problem of Backpropagation
138 / 158
Images/cinvestav
Be Aware
Be Careful
A computationally naive implementation of AD can result in slow
code and excess use of memory.
Additionally
There exists no standard set of problems spanning the diversity of AD
applications.
139 / 158
Images/cinvestav
Be Aware
Be Careful
A computationally naive implementation of AD can result in slow
code and excess use of memory.
Additionally
There exists no standard set of problems spanning the diversity of AD
applications.
139 / 158
Images/cinvestav
Source Transformation in Fortran and C
First
We start with the source code of a computer program that
implements our target function.
Second
A preprocessor then applies differentiation rules to the code
generating new source code which calculates derivatives.
Remember our basic tables
140 / 158
Images/cinvestav
Source Transformation in Fortran and C
First
We start with the source code of a computer program that
implements our target function.
Second
A preprocessor then applies differentiation rules to the code
generating new source code which calculates derivatives.
Remember our basic tables
140 / 158
Images/cinvestav
Example on how source code transformation could work
We could have the following pipeline
AD Tool C compiler
function.c diff_function.c diff_function.o
141 / 158
Images/cinvestav
Limitations of Source Transformation
Severe limitations with source transformation
it can only use information avail- able at compile time
It cannot handle more sophisticated programming statements, such as
while loops, C++ templates, and other object-oriented features
For this, it is better to use operator overloading
Operator overloading is the appropriate technology.
142 / 158
Images/cinvestav
Limitations of Source Transformation
Severe limitations with source transformation
it can only use information avail- able at compile time
It cannot handle more sophisticated programming statements, such as
while loops, C++ templates, and other object-oriented features
For this, it is better to use operator overloading
Operator overloading is the appropriate technology.
142 / 158
Images/cinvestav
Operator overloading
The key idea is to introduce a new class of objects
Containing the value of a variable on the expression graph and a
differential component.
Not all variables on the expression graph will belong to this class
But the root variables, which require sensitivities, and all the
intermediate variables.
In a forward mode framework
The differential component is the derivative with respect to one input.
143 / 158
Images/cinvestav
Operator overloading
The key idea is to introduce a new class of objects
Containing the value of a variable on the expression graph and a
differential component.
Not all variables on the expression graph will belong to this class
But the root variables, which require sensitivities, and all the
intermediate variables.
In a forward mode framework
The differential component is the derivative with respect to one input.
143 / 158
Images/cinvestav
Operator overloading
The key idea is to introduce a new class of objects
Containing the value of a variable on the expression graph and a
differential component.
Not all variables on the expression graph will belong to this class
But the root variables, which require sensitivities, and all the
intermediate variables.
In a forward mode framework
The differential component is the derivative with respect to one input.
143 / 158
Images/cinvestav
Furthermore
In a reverse mode framework
it is the adjoint with respect to one output.
Something Notable
Operators and math functions are overloaded to handle these new
types.
144 / 158
Images/cinvestav
Furthermore
In a reverse mode framework
it is the adjoint with respect to one output.
Something Notable
Operators and math functions are overloaded to handle these new
types.
144 / 158
Images/cinvestav
Basically
The operators are overloaded
So it is possible to handle the dual numbers under their new
arithmetic (Forward Mode)
145 / 158
Images/cinvestav
Outline
1 Backpropagation
Introduction
Derivatives of Network Functions
Function Composition, Weights and Addition
The Backpropagation Algorithm Works
Moving everything to Tensors
2 Automatic Differentiation
Introduction
Advantages of Automatic Differentiation
Avoiding Truncation Errors
Differences with Symbolic Differentiation
Difference Quotients May be Useful
RNN Example
A Simple Example
The Forward and Reverse Mode
Forward propagation of Tangents
Forward Mode of a ML Perceptron
Complexity of the Forward Procedure
The Reverse Mode
Dual Process in Reverse Process
Incremental Adjoint Recursion
Example
What Method to Use Forward or Reverse Mode?
3 Basic Implementation of Automatic Differentiation
Source Transformation and Overloading
Building the Computational Graph
Memory Structures
Way More...
4 Conclusions
The Problem of Backpropagation
146 / 158
Images/cinvestav
Building a Computational Graph
Parse the equations
y1 = σ (w1x)
y2 = σ (w2x)
y3 = σ (w2x)
Generate variables for intermediate values
v0 = x Then V = V ∪ {v0}
Generate edges between the intermediate values
If v1 = w1x Then E = E ∪ { v0, v1 }
147 / 158
Images/cinvestav
Building a Computational Graph
Parse the equations
y1 = σ (w1x)
y2 = σ (w2x)
y3 = σ (w2x)
Generate variables for intermediate values
v0 = x Then V = V ∪ {v0}
Generate edges between the intermediate values
If v1 = w1x Then E = E ∪ { v0, v1 }
147 / 158
Images/cinvestav
Building a Computational Graph
Parse the equations
y1 = σ (w1x)
y2 = σ (w2x)
y3 = σ (w2x)
Generate variables for intermediate values
v0 = x Then V = V ∪ {v0}
Generate edges between the intermediate values
If v1 = w1x Then E = E ∪ { v0, v1 }
147 / 158
Images/cinvestav
Topological sort for Evaluation
Run the topological sort, then you get the order of evaluation
Using the graph built in the previous step
148 / 158
Images/cinvestav
Outline
1 Backpropagation
Introduction
Derivatives of Network Functions
Function Composition, Weights and Addition
The Backpropagation Algorithm Works
Moving everything to Tensors
2 Automatic Differentiation
Introduction
Advantages of Automatic Differentiation
Avoiding Truncation Errors
Differences with Symbolic Differentiation
Difference Quotients May be Useful
RNN Example
A Simple Example
The Forward and Reverse Mode
Forward propagation of Tangents
Forward Mode of a ML Perceptron
Complexity of the Forward Procedure
The Reverse Mode
Dual Process in Reverse Process
Incremental Adjoint Recursion
Example
What Method to Use Forward or Reverse Mode?
3 Basic Implementation of Automatic Differentiation
Source Transformation and Overloading
Building the Computational Graph
Memory Structures
Way More...
4 Conclusions
The Problem of Backpropagation
149 / 158
Images/cinvestav
For the Reversal Mode
We could use a stack
When assignments occur at the Forward Mode of the process
vi ≡ 0 i = 1 − n...l
vi−n ≡ xi i = 1...n
vi ≡ φi (vj)j i i = m − 1...l
ym−i ≡ vl−i i = 0...m − 1
Then we pop the necessary elements
At the reversal process
150 / 158
Images/cinvestav
For the Reversal Mode
We could use a stack
When assignments occur at the Forward Mode of the process
vi ≡ 0 i = 1 − n...l
vi−n ≡ xi i = 1...n
vi ≡ φi (vj)j i i = m − 1...l
ym−i ≡ vl−i i = 0...m − 1
Then we pop the necessary elements
At the reversal process
150 / 158
Images/cinvestav
Outline
1 Backpropagation
Introduction
Derivatives of Network Functions
Function Composition, Weights and Addition
The Backpropagation Algorithm Works
Moving everything to Tensors
2 Automatic Differentiation
Introduction
Advantages of Automatic Differentiation
Avoiding Truncation Errors
Differences with Symbolic Differentiation
Difference Quotients May be Useful
RNN Example
A Simple Example
The Forward and Reverse Mode
Forward propagation of Tangents
Forward Mode of a ML Perceptron
Complexity of the Forward Procedure
The Reverse Mode
Dual Process in Reverse Process
Incremental Adjoint Recursion
Example
What Method to Use Forward or Reverse Mode?
3 Basic Implementation of Automatic Differentiation
Source Transformation and Overloading
Building the Computational Graph
Memory Structures
Way More...
4 Conclusions
The Problem of Backpropagation
151 / 158
Images/cinvestav
Nevertheless
There are many techniques to improve the efficiency and avoid
aliasing problems of these modes [12]
1 Taping for adjoint recursion
2 Caching
3 Checkpoints
4 Expression Templates
5 etc
You are invited to read more about them
Given that these techniques are already being implemented in
languages as swift...
“First-Class Automatic Differentiation in Swift: A Manifesto”
https://gist.github.com/rxwei/30ba75ce092ab3b0dce4bde1fc2c9f1d
152 / 158
Images/cinvestav
Nevertheless
There are many techniques to improve the efficiency and avoid
aliasing problems of these modes [12]
1 Taping for adjoint recursion
2 Caching
3 Checkpoints
4 Expression Templates
5 etc
You are invited to read more about them
Given that these techniques are already being implemented in
languages as swift...
“First-Class Automatic Differentiation in Swift: A Manifesto”
https://gist.github.com/rxwei/30ba75ce092ab3b0dce4bde1fc2c9f1d
152 / 158
Images/cinvestav
Outline
1 Backpropagation
Introduction
Derivatives of Network Functions
Function Composition, Weights and Addition
The Backpropagation Algorithm Works
Moving everything to Tensors
2 Automatic Differentiation
Introduction
Advantages of Automatic Differentiation
Avoiding Truncation Errors
Differences with Symbolic Differentiation
Difference Quotients May be Useful
RNN Example
A Simple Example
The Forward and Reverse Mode
Forward propagation of Tangents
Forward Mode of a ML Perceptron
Complexity of the Forward Procedure
The Reverse Mode
Dual Process in Reverse Process
Incremental Adjoint Recursion
Example
What Method to Use Forward or Reverse Mode?
3 Basic Implementation of Automatic Differentiation
Source Transformation and Overloading
Building the Computational Graph
Memory Structures
Way More...
4 Conclusions
The Problem of Backpropagation
153 / 158
Images/cinvestav
Between Two Extremes
Something Notable
Forward and reverse accumulation are just two (extreme) ways of
traversing the chain rule.
The problem of computing a full Jacobian of f : Rn
→ Rm
with a
minimum number of arithmetic operations
It is known as the Optimal Jacobian Accumulation (OJA) problem,
which is NP-complete [13].
154 / 158
Images/cinvestav
Between Two Extremes
Something Notable
Forward and reverse accumulation are just two (extreme) ways of
traversing the chain rule.
The problem of computing a full Jacobian of f : Rn
→ Rm
with a
minimum number of arithmetic operations
It is known as the Optimal Jacobian Accumulation (OJA) problem,
which is NP-complete [13].
154 / 158
Images/cinvestav
Finally
Using all the previous ideas
The Graph Structure Proposed in [2]
The Computational Graph of AD
The Forward and Reversal Methods
It has been possible to develop the Deep Learning Frameworks
TensorFlow
Torch
Pytorch
Keras
etc...
155 / 158
Images/cinvestav
Finally
Using all the previous ideas
The Graph Structure Proposed in [2]
The Computational Graph of AD
The Forward and Reversal Methods
It has been possible to develop the Deep Learning Frameworks
TensorFlow
Torch
Pytorch
Keras
etc...
155 / 158
Images/cinvestav
D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning internal
representations by error propagation,” tech. rep., California Univ San
Diego La Jolla Inst for Cognitive Science, 1985.
R. Rojas, Neural networks: a systematic introduction.
Springer Science & Business Media, 1996.
R. Collobert, S. Bengio, and J. Mariéthoz, “Torch: a modular machine
learning software library,” Idiap-RR Idiap-RR-46-2002, IDIAP, 2002.
A. G. Baydin, B. A. Pearlmutter, A. A. Radul, and J. M. Siskind,
“Automatic differentiation in machine learning: a survey,” Journal of
machine learning research, vol. 18, no. 153, 2018.
C. H. Bischof, A. Carle, P. Khademi, and A. Mauer, “ADIFOR 2.0:
Automatic differentiation of Fortran 77 programs,” IEEE
Computational Science & Engineering, vol. 3, no. 3, pp. 18–32, 1996.
156 / 158
Images/cinvestav
C. Elliott, “The simple essence of automatic differentiation,”
Proceedings of the ACM on Programming Languages, vol. 2,
no. ICFP, p. 70, 2018.
A. Griewank and A. Walther, Evaluating derivatives: principles and
techniques of algorithmic differentiation, vol. 105.
Siam, 2008.
J. L. Elman, “Finding structure in time,” Cognitive science, vol. 14,
no. 2, pp. 179–211, 1990.
L. Hascoët and V. Pascual, “The Tapenade automatic differentiation
tool: Principles, model, and specification,” ACM Transactions on
Mathematical Software, vol. 39, no. 3, pp. 20:1–20:43, 2013.
Y. A. LeCun, L. Bottou, G. B. Orr, and K.-R. Müller, “Efficient
backprop,” in Neural networks: Tricks of the trade, pp. 9–48,
Springer, 2012.
157 / 158
Images/cinvestav
R. Alexander, “Solving ordinary differential equations i: Nonstiff
problems (e. hairer, sp norsett, and g. wanner),” Siam Review, vol. 32,
no. 3, p. 485, 1990.
C. C. Margossian, “A review of automatic differentiation and its
efficient implementation,” Wiley Interdisciplinary Reviews: Data
Mining and Knowledge Discovery, vol. 9, no. 4, p. e1305, 2019.
U. Naumann, “Optimal jacobian accumulation is np-complete,”
Mathematical Programming, vol. 112, no. 2, pp. 427–441, 2008.
158 / 158

05 backpropagation automatic_differentiation

  • 1.
    Introduction to NeuralNetworks and Deep Learning Backpropagation and Automatic Differentiation Andres Mendez-Vazquez November 4, 2019 1 / 158
  • 2.
    Images/cinvestav Outline 1 Backpropagation Introduction Derivatives ofNetwork Functions Function Composition, Weights and Addition The Backpropagation Algorithm Works Moving everything to Tensors 2 Automatic Differentiation Introduction Advantages of Automatic Differentiation Avoiding Truncation Errors Differences with Symbolic Differentiation Difference Quotients May be Useful RNN Example A Simple Example The Forward and Reverse Mode Forward propagation of Tangents Forward Mode of a ML Perceptron Complexity of the Forward Procedure The Reverse Mode Dual Process in Reverse Process Incremental Adjoint Recursion Example What Method to Use Forward or Reverse Mode? 3 Basic Implementation of Automatic Differentiation Source Transformation and Overloading Building the Computational Graph Memory Structures Way More... 4 Conclusions The Problem of Backpropagation 2 / 158
  • 3.
    Images/cinvestav Outline 1 Backpropagation Introduction Derivatives ofNetwork Functions Function Composition, Weights and Addition The Backpropagation Algorithm Works Moving everything to Tensors 2 Automatic Differentiation Introduction Advantages of Automatic Differentiation Avoiding Truncation Errors Differences with Symbolic Differentiation Difference Quotients May be Useful RNN Example A Simple Example The Forward and Reverse Mode Forward propagation of Tangents Forward Mode of a ML Perceptron Complexity of the Forward Procedure The Reverse Mode Dual Process in Reverse Process Incremental Adjoint Recursion Example What Method to Use Forward or Reverse Mode? 3 Basic Implementation of Automatic Differentiation Source Transformation and Overloading Building the Computational Graph Memory Structures Way More... 4 Conclusions The Problem of Backpropagation 3 / 158
  • 4.
    Images/cinvestav A Remarkable Revenant Thisalgorithm has been used by many communities Discovered and rediscovered, until 1985 it reached the AI community [1] Basically The Basis of the modern neural networks 4 / 158
  • 5.
    Images/cinvestav A Remarkable Revenant Thisalgorithm has been used by many communities Discovered and rediscovered, until 1985 it reached the AI community [1] Basically The Basis of the modern neural networks 4 / 158
  • 6.
    Images/cinvestav One Big Problem,a lot of Local Minimums A Lot of Them!!! Local Minimas 5 / 158
  • 7.
    Images/cinvestav This is dueto the fact that Yes, we have a convex function 1 2 (zi − ti)2 With an intermediate non-linear activation function zi = f   d j=1 wijyj   Making the surface to be searched for the optimum A non linear function map from Rd to Rm 6 / 158
  • 8.
    Images/cinvestav This is dueto the fact that Yes, we have a convex function 1 2 (zi − ti)2 With an intermediate non-linear activation function zi = f   d j=1 wijyj   Making the surface to be searched for the optimum A non linear function map from Rd to Rm 6 / 158
  • 9.
    Images/cinvestav This is dueto the fact that Yes, we have a convex function 1 2 (zi − ti)2 With an intermediate non-linear activation function zi = f   d j=1 wijyj   Making the surface to be searched for the optimum A non linear function map from Rd to Rm 6 / 158
  • 10.
    Images/cinvestav Recall The LearningProblem Neural Networks You can see the network as a computational graph... Transmitting information from node to node... Therefore, the network It is a particular implementation of a composite function from input space to output space. 7 / 158
  • 11.
    Images/cinvestav Recall The LearningProblem Neural Networks You can see the network as a computational graph... Transmitting information from node to node... Therefore, the network It is a particular implementation of a composite function from input space to output space. 7 / 158
  • 12.
    Images/cinvestav Extended Network The computationof the error by the network [2] NETWORK 8 / 158
  • 13.
    Images/cinvestav Thus The network cancalculate the total error E = N i=1 Ei Therefore, the network can be updated using E = ∂E ∂w1 , ∂E ∂w2 , ..., ∂E ∂wl ∆wi = −γ ∂E ∂w1 for i = 1, ..., l 9 / 158
  • 14.
    Images/cinvestav Thus The network cancalculate the total error E = N i=1 Ei Therefore, the network can be updated using E = ∂E ∂w1 , ∂E ∂w2 , ..., ∂E ∂wl ∆wi = −γ ∂E ∂w1 for i = 1, ..., l 9 / 158
  • 15.
    Images/cinvestav Outline 1 Backpropagation Introduction Derivatives ofNetwork Functions Function Composition, Weights and Addition The Backpropagation Algorithm Works Moving everything to Tensors 2 Automatic Differentiation Introduction Advantages of Automatic Differentiation Avoiding Truncation Errors Differences with Symbolic Differentiation Difference Quotients May be Useful RNN Example A Simple Example The Forward and Reverse Mode Forward propagation of Tangents Forward Mode of a ML Perceptron Complexity of the Forward Procedure The Reverse Mode Dual Process in Reverse Process Incremental Adjoint Recursion Example What Method to Use Forward or Reverse Mode? 3 Basic Implementation of Automatic Differentiation Source Transformation and Overloading Building the Computational Graph Memory Structures Way More... 4 Conclusions The Problem of Backpropagation 10 / 158
  • 16.
    Images/cinvestav Now, if weforget everything about learning Given that the network is a complex composition of functions E = f1 ◦ f2 ◦ · · · ◦ fK Now, each node has a left and right side 11 / 158
  • 17.
    Images/cinvestav Now, if weforget everything about learning Given that the network is a complex composition of functions E = f1 ◦ f2 ◦ · · · ◦ fK Now, each node has a left and right side 11 / 158
  • 18.
    Images/cinvestav Furthermore Separation of integrationand activation function Then, we can use this notation to build the forward/backward steps Actually the basis for automatic differentiation 12 / 158
  • 19.
    Images/cinvestav Furthermore Separation of integrationand activation function Then, we can use this notation to build the forward/backward steps Actually the basis for automatic differentiation 12 / 158
  • 20.
    Images/cinvestav Outline 1 Backpropagation Introduction Derivatives ofNetwork Functions Function Composition, Weights and Addition The Backpropagation Algorithm Works Moving everything to Tensors 2 Automatic Differentiation Introduction Advantages of Automatic Differentiation Avoiding Truncation Errors Differences with Symbolic Differentiation Difference Quotients May be Useful RNN Example A Simple Example The Forward and Reverse Mode Forward propagation of Tangents Forward Mode of a ML Perceptron Complexity of the Forward Procedure The Reverse Mode Dual Process in Reverse Process Incremental Adjoint Recursion Example What Method to Use Forward or Reverse Mode? 3 Basic Implementation of Automatic Differentiation Source Transformation and Overloading Building the Computational Graph Memory Structures Way More... 4 Conclusions The Problem of Backpropagation 13 / 158
  • 21.
    Images/cinvestav First, we have Thesequence of derivatives Then, we can do the forward step getting the function compositions 14 / 158
  • 22.
    Images/cinvestav First, we have Thesequence of derivatives Then, we can do the forward step getting the function compositions Function Composition 14 / 158
  • 23.
    Images/cinvestav Now, Backpropagation Here theinteresting part, you can collect such information Backpropagation 1 Now, what else? The aggregation of functions toward the activation functions!!! 15 / 158
  • 24.
    Images/cinvestav Now, Backpropagation Here theinteresting part, you can collect such information Backpropagation 1 Now, what else? The aggregation of functions toward the activation functions!!! 15 / 158
  • 25.
    Images/cinvestav We add anextra caveat to the graph representation A weight into the graph Feed-Forward We have the backward process 16 / 158
  • 26.
    Images/cinvestav We add anextra caveat to the graph representation A weight into the graph Feed-Forward We have the backward process 1 Backpropagation 16 / 158
  • 27.
    Images/cinvestav Function Addition We havethe forward step Function Composition 17 / 158
  • 28.
    Images/cinvestav Then At the BackwardStep, we have Backward 1 18 / 158
  • 29.
    Images/cinvestav Outline 1 Backpropagation Introduction Derivatives ofNetwork Functions Function Composition, Weights and Addition The Backpropagation Algorithm Works Moving everything to Tensors 2 Automatic Differentiation Introduction Advantages of Automatic Differentiation Avoiding Truncation Errors Differences with Symbolic Differentiation Difference Quotients May be Useful RNN Example A Simple Example The Forward and Reverse Mode Forward propagation of Tangents Forward Mode of a ML Perceptron Complexity of the Forward Procedure The Reverse Mode Dual Process in Reverse Process Incremental Adjoint Recursion Example What Method to Use Forward or Reverse Mode? 3 Basic Implementation of Automatic Differentiation Source Transformation and Overloading Building the Computational Graph Memory Structures Way More... 4 Conclusions The Problem of Backpropagation 19 / 158
  • 30.
    Images/cinvestav Backpropagation Algorithm Consider anetwork with a single input and a network function F The Derivative F (x) is computed in two phases. 1 Feed-forward: The input x is fed into the network. The primitive functions at the nodes and their derivatives are evaluated at each node. The derivatives are stored at the left side of the node. 2 Backpropagation: The constant 1 is fed into the output unit and the network is run backwards. Incoming information to a node is added and the result is multiplied by the value stored in the left part of the unit. The result is transmitted to the left of the unit. The result collected at the input unit is the derivative of the network function with respect to x. 20 / 158
  • 31.
    Images/cinvestav Backpropagation Algorithm Consider anetwork with a single input and a network function F The Derivative F (x) is computed in two phases. 1 Feed-forward: The input x is fed into the network. The primitive functions at the nodes and their derivatives are evaluated at each node. The derivatives are stored at the left side of the node. 2 Backpropagation: The constant 1 is fed into the output unit and the network is run backwards. Incoming information to a node is added and the result is multiplied by the value stored in the left part of the unit. The result is transmitted to the left of the unit. The result collected at the input unit is the derivative of the network function with respect to x. 20 / 158
  • 32.
    Images/cinvestav Backpropagation Algorithm Consider anetwork with a single input and a network function F The Derivative F (x) is computed in two phases. 1 Feed-forward: The input x is fed into the network. The primitive functions at the nodes and their derivatives are evaluated at each node. The derivatives are stored at the left side of the node. 2 Backpropagation: The constant 1 is fed into the output unit and the network is run backwards. Incoming information to a node is added and the result is multiplied by the value stored in the left part of the unit. The result is transmitted to the left of the unit. The result collected at the input unit is the derivative of the network function with respect to x. 20 / 158
  • 33.
    Images/cinvestav Backpropagation Algorithm Consider anetwork with a single input and a network function F The Derivative F (x) is computed in two phases. 1 Feed-forward: The input x is fed into the network. The primitive functions at the nodes and their derivatives are evaluated at each node. The derivatives are stored at the left side of the node. 2 Backpropagation: The constant 1 is fed into the output unit and the network is run backwards. Incoming information to a node is added and the result is multiplied by the value stored in the left part of the unit. The result is transmitted to the left of the unit. The result collected at the input unit is the derivative of the network function with respect to x. 20 / 158
  • 34.
    Images/cinvestav Backpropagation Algorithm Consider anetwork with a single input and a network function F The Derivative F (x) is computed in two phases. 1 Feed-forward: The input x is fed into the network. The primitive functions at the nodes and their derivatives are evaluated at each node. The derivatives are stored at the left side of the node. 2 Backpropagation: The constant 1 is fed into the output unit and the network is run backwards. Incoming information to a node is added and the result is multiplied by the value stored in the left part of the unit. The result is transmitted to the left of the unit. The result collected at the input unit is the derivative of the network function with respect to x. 20 / 158
  • 35.
    Images/cinvestav Backpropagation Algorithm Consider anetwork with a single input and a network function F The Derivative F (x) is computed in two phases. 1 Feed-forward: The input x is fed into the network. The primitive functions at the nodes and their derivatives are evaluated at each node. The derivatives are stored at the left side of the node. 2 Backpropagation: The constant 1 is fed into the output unit and the network is run backwards. Incoming information to a node is added and the result is multiplied by the value stored in the left part of the unit. The result is transmitted to the left of the unit. The result collected at the input unit is the derivative of the network function with respect to x. 20 / 158
  • 36.
    Images/cinvestav Backpropagation Algorithm Consider anetwork with a single input and a network function F The Derivative F (x) is computed in two phases. 1 Feed-forward: The input x is fed into the network. The primitive functions at the nodes and their derivatives are evaluated at each node. The derivatives are stored at the left side of the node. 2 Backpropagation: The constant 1 is fed into the output unit and the network is run backwards. Incoming information to a node is added and the result is multiplied by the value stored in the left part of the unit. The result is transmitted to the left of the unit. The result collected at the input unit is the derivative of the network function with respect to x. 20 / 158
  • 37.
    Images/cinvestav Backpropagation Algorithm Consider anetwork with a single input and a network function F The Derivative F (x) is computed in two phases. 1 Feed-forward: The input x is fed into the network. The primitive functions at the nodes and their derivatives are evaluated at each node. The derivatives are stored at the left side of the node. 2 Backpropagation: The constant 1 is fed into the output unit and the network is run backwards. Incoming information to a node is added and the result is multiplied by the value stored in the left part of the unit. The result is transmitted to the left of the unit. The result collected at the input unit is the derivative of the network function with respect to x. 20 / 158
  • 38.
    Images/cinvestav Backpropagation Algorithm Consider anetwork with a single input and a network function F The Derivative F (x) is computed in two phases. 1 Feed-forward: The input x is fed into the network. The primitive functions at the nodes and their derivatives are evaluated at each node. The derivatives are stored at the left side of the node. 2 Backpropagation: The constant 1 is fed into the output unit and the network is run backwards. Incoming information to a node is added and the result is multiplied by the value stored in the left part of the unit. The result is transmitted to the left of the unit. The result collected at the input unit is the derivative of the network function with respect to x. 20 / 158
  • 39.
    Images/cinvestav Backpropagation Algorithm Consider anetwork with a single input and a network function F The Derivative F (x) is computed in two phases. 1 Feed-forward: The input x is fed into the network. The primitive functions at the nodes and their derivatives are evaluated at each node. The derivatives are stored at the left side of the node. 2 Backpropagation: The constant 1 is fed into the output unit and the network is run backwards. Incoming information to a node is added and the result is multiplied by the value stored in the left part of the unit. The result is transmitted to the left of the unit. The result collected at the input unit is the derivative of the network function with respect to x. 20 / 158
  • 40.
    Images/cinvestav Backpropagation Algorithm Consider anetwork with a single input and a network function F The Derivative F (x) is computed in two phases. 1 Feed-forward: The input x is fed into the network. The primitive functions at the nodes and their derivatives are evaluated at each node. The derivatives are stored at the left side of the node. 2 Backpropagation: The constant 1 is fed into the output unit and the network is run backwards. Incoming information to a node is added and the result is multiplied by the value stored in the left part of the unit. The result is transmitted to the left of the unit. The result collected at the input unit is the derivative of the network function with respect to x. 20 / 158
  • 41.
    Images/cinvestav Proof of Correctnessabout the derivatives Proposition The Backpropagation algorithm computes the derivative of the network function F with respect to the input x correctly. Proof By induction assume that the algorithm works with n or fewer nodes 21 / 158
  • 42.
    Images/cinvestav Proof of Correctnessabout the derivatives Proposition The Backpropagation algorithm computes the derivative of the network function F with respect to the input x correctly. Proof By induction assume that the algorithm works with n or fewer nodes 21 / 158
  • 43.
    Images/cinvestav Consider The following networkwith n + 1 nodes NETWORK 22 / 158
  • 44.
    Images/cinvestav Thus We have that F(x) = φ (w1F1 (x) + w2F2 (x) + · · · + wmFm (x)) We have that the derivative F (x) = φ (s) w1F1 (x) + w2F2 (x) + · · · + wmFm (x) With s = w1F1 (x) + w2F2 (x) + · · · + wmFm (x) 23 / 158
  • 45.
    Images/cinvestav Thus We have that F(x) = φ (w1F1 (x) + w2F2 (x) + · · · + wmFm (x)) We have that the derivative F (x) = φ (s) w1F1 (x) + w2F2 (x) + · · · + wmFm (x) With s = w1F1 (x) + w2F2 (x) + · · · + wmFm (x) 23 / 158
  • 46.
    Images/cinvestav Now, we useinduction The subgraph of the main graph which contains all the nodes to F1 (x) Thus, by induction, we can calculate the derivative of F1 (x) by introducing a 1 into the last unit and doing backpropagation The same happens to all the other units Now if instead of multiplying by 1 we introduce φ (s)and multiply by wj, we get wjFj (x) φ (s) This can be accomplished by Introducing a 1 into the output unit, multiplying by the stored value φ (s) and distributing the result to the m units through edge weight nodes. 24 / 158
  • 47.
    Images/cinvestav Now, we useinduction The subgraph of the main graph which contains all the nodes to F1 (x) Thus, by induction, we can calculate the derivative of F1 (x) by introducing a 1 into the last unit and doing backpropagation The same happens to all the other units Now if instead of multiplying by 1 we introduce φ (s)and multiply by wj, we get wjFj (x) φ (s) This can be accomplished by Introducing a 1 into the output unit, multiplying by the stored value φ (s) and distributing the result to the m units through edge weight nodes. 24 / 158
  • 48.
    Images/cinvestav Now, we useinduction The subgraph of the main graph which contains all the nodes to F1 (x) Thus, by induction, we can calculate the derivative of F1 (x) by introducing a 1 into the last unit and doing backpropagation The same happens to all the other units Now if instead of multiplying by 1 we introduce φ (s)and multiply by wj, we get wjFj (x) φ (s) This can be accomplished by Introducing a 1 into the output unit, multiplying by the stored value φ (s) and distributing the result to the m units through edge weight nodes. 24 / 158
  • 49.
    Images/cinvestav Basically, we getthe derivative We get then φ (s) w1F1 (x) + w2F2 (x) + · · · + wmFm (x) Basically the networks is run backward F (x) = φ (s) w1F1 (x) + w2F2 (x) + · · · + wmFm (x) The algorithms works for n + 1 QED 25 / 158
  • 50.
    Images/cinvestav Basically, we getthe derivative We get then φ (s) w1F1 (x) + w2F2 (x) + · · · + wmFm (x) Basically the networks is run backward F (x) = φ (s) w1F1 (x) + w2F2 (x) + · · · + wmFm (x) The algorithms works for n + 1 QED 25 / 158
  • 51.
    Images/cinvestav Basically, we getthe derivative We get then φ (s) w1F1 (x) + w2F2 (x) + · · · + wmFm (x) Basically the networks is run backward F (x) = φ (s) w1F1 (x) + w2F2 (x) + · · · + wmFm (x) The algorithms works for n + 1 QED 25 / 158
  • 52.
    Images/cinvestav Outline 1 Backpropagation Introduction Derivatives ofNetwork Functions Function Composition, Weights and Addition The Backpropagation Algorithm Works Moving everything to Tensors 2 Automatic Differentiation Introduction Advantages of Automatic Differentiation Avoiding Truncation Errors Differences with Symbolic Differentiation Difference Quotients May be Useful RNN Example A Simple Example The Forward and Reverse Mode Forward propagation of Tangents Forward Mode of a ML Perceptron Complexity of the Forward Procedure The Reverse Mode Dual Process in Reverse Process Incremental Adjoint Recursion Example What Method to Use Forward or Reverse Mode? 3 Basic Implementation of Automatic Differentiation Source Transformation and Overloading Building the Computational Graph Memory Structures Way More... 4 Conclusions The Problem of Backpropagation 26 / 158
  • 53.
    Images/cinvestav Why not usingmatrices to process all the individual parts? Imagine the following, a simple idea X =       xT 1 xT 2 ... xT N       We know the fields are created in input to hidden as g (X) = XW =       xT 1 xT 2 ... xT N       w1 w2 · · · wd 27 / 158
  • 54.
    Images/cinvestav Why not usingmatrices to process all the individual parts? Imagine the following, a simple idea X =       xT 1 xT 2 ... xT N       We know the fields are created in input to hidden as g (X) = XW =       xT 1 xT 2 ... xT N       w1 w2 · · · wd 27 / 158
  • 55.
    Images/cinvestav Where We have theseconstruct gij xT i = xT i wj g (X) =         g11 xT 1 g12 xT 1 · · · g1d xT 1 g21 xT 2 g21 xT 2 · · · g2d xT 2 ... ... ... ... gN1 xT N gN2 xT N · · · gNd xT N         28 / 158
  • 56.
    Images/cinvestav Then We have thatthe fij (x) = 1 1+exp{−x} f (g (X)) =         f11 g11 xT 1 f g12 xT 1 · · · f g1d xT 1 f g21 xT 2 f g21 xT 2 · · · f g2d xT 2 ... ... ... ... f gN1 xT N f gN2 xT N · · · f gNd xT N         29 / 158
  • 57.
    Images/cinvestav Finally, we cando the following modification when forward Then the matrix can be extended g (X) |g (X) =        dg11(xT 1 ) dw1 |xT 1 w1 dg12(xT 1 ) dw2 |xT 1 w2 · · · dg1d(xT 1 ) dwd |xT 1 wh dg21(xT 2 ) dw1 |xT 2 w1 dg22(xT 2 ) dw2 |xT 2 w2 · · · dg2d(xT 2 ) dwd |xT 2 wh ... ... ... ... dgN1(xT N ) dw1 |xT N w1 dgN2(xT N ) dw2 |xT 2 w2 · · · dgNd(xT N ) dwd |xT N wh        30 / 158
  • 58.
    Images/cinvestav Finally, we have Thenext function f (g (X)) |f (g (X)) =     df11(x) dx g11 xT 1 |f11 g11 xT 1 · · · df1d(x) dx g1d xT 1 |f g1h xT 1 df21(x) dx g21 xT 1 |f g21 xT 2 · · · df2d(x) dx g2d xT 1 |f g2h xT 2 . . . . . . . . . dfN1(x) dx gN1 xT 1 |f gN1 xT N · · · dfNd(x) dx gNd xT 1 |f gNh xT N     31 / 158
  • 59.
    Images/cinvestav Using the HadamardProduct We have for the backpropagation f (g (X)) ◦ g (X) In particular for a position ij dgij xT i dwj × dfij (x) dx gij xT i = dfij (x) dx gij xT i ×       x1i x2i ... xdi       32 / 158
  • 60.
    Images/cinvestav Using the HadamardProduct We have for the backpropagation f (g (X)) ◦ g (X) In particular for a position ij dgij xT i dwj × dfij (x) dx gij xT i = dfij (x) dx gij xT i ×       x1i x2i ... xdi       32 / 158
  • 61.
    Images/cinvestav Then using avertical sum We get the change that is imposed into the possible vector wj sum f (g (X)) ◦ g (X) , axis =0 = N i=1 dgij xT i dwj × dfij (x) dx gij xT i h j=1 33 / 158
  • 62.
    Images/cinvestav Now a HistoricalPerspective The idea of a Graph Structure was proposed by Raul Rojas “Neural Networks - A Systematic Introduction” by Raul Rojas in 1996... TensorFlow was initially released in November 9, 2015 Originally an inception of the project “Google Brain” (Circa 2011) So TensorFlow started around 2012-2013 with internal development and DNNResearch’s code (Hinton’s Company) However, the graph idea was introduced in 2002 in torch, the basis of Pytorch (Circa 2016) One of the creators, Samy Bengio, is the brother of Joshua Bengio [3] 34 / 158
  • 63.
    Images/cinvestav Now a HistoricalPerspective The idea of a Graph Structure was proposed by Raul Rojas “Neural Networks - A Systematic Introduction” by Raul Rojas in 1996... TensorFlow was initially released in November 9, 2015 Originally an inception of the project “Google Brain” (Circa 2011) So TensorFlow started around 2012-2013 with internal development and DNNResearch’s code (Hinton’s Company) However, the graph idea was introduced in 2002 in torch, the basis of Pytorch (Circa 2016) One of the creators, Samy Bengio, is the brother of Joshua Bengio [3] 34 / 158
  • 64.
    Images/cinvestav Now a HistoricalPerspective The idea of a Graph Structure was proposed by Raul Rojas “Neural Networks - A Systematic Introduction” by Raul Rojas in 1996... TensorFlow was initially released in November 9, 2015 Originally an inception of the project “Google Brain” (Circa 2011) So TensorFlow started around 2012-2013 with internal development and DNNResearch’s code (Hinton’s Company) However, the graph idea was introduced in 2002 in torch, the basis of Pytorch (Circa 2016) One of the creators, Samy Bengio, is the brother of Joshua Bengio [3] 34 / 158
  • 65.
    Images/cinvestav Now a HistoricalPerspective The idea of a Graph Structure was proposed by Raul Rojas “Neural Networks - A Systematic Introduction” by Raul Rojas in 1996... TensorFlow was initially released in November 9, 2015 Originally an inception of the project “Google Brain” (Circa 2011) So TensorFlow started around 2012-2013 with internal development and DNNResearch’s code (Hinton’s Company) However, the graph idea was introduced in 2002 in torch, the basis of Pytorch (Circa 2016) One of the creators, Samy Bengio, is the brother of Joshua Bengio [3] 34 / 158
  • 66.
    Images/cinvestav Outline 1 Backpropagation Introduction Derivatives ofNetwork Functions Function Composition, Weights and Addition The Backpropagation Algorithm Works Moving everything to Tensors 2 Automatic Differentiation Introduction Advantages of Automatic Differentiation Avoiding Truncation Errors Differences with Symbolic Differentiation Difference Quotients May be Useful RNN Example A Simple Example The Forward and Reverse Mode Forward propagation of Tangents Forward Mode of a ML Perceptron Complexity of the Forward Procedure The Reverse Mode Dual Process in Reverse Process Incremental Adjoint Recursion Example What Method to Use Forward or Reverse Mode? 3 Basic Implementation of Automatic Differentiation Source Transformation and Overloading Building the Computational Graph Memory Structures Way More... 4 Conclusions The Problem of Backpropagation 35 / 158
  • 67.
    Images/cinvestav Backpropagation a littlebrother of Automatic Differentiation (AD) We have a crude way to obtain derivatives [4, 5, 6][7] D+hf (x) ≈ f (x + h) − f (x) 2h or D hf (x) ≈ f (x + h) − f (x − h) 2h Huge Problems If h is small, then cancellation error reduces the number of significant figures in D+hf (x). if h is not small, then truncation errors (terms such as h2f (x)) become significant. Even if h is optimally chosen, the values of D+hf (x) and D hf (x) will be accurate to only about 1 2 or 2 3 of the significant digits of f. 36 / 158
  • 68.
    Images/cinvestav Backpropagation a littlebrother of Automatic Differentiation (AD) We have a crude way to obtain derivatives [4, 5, 6][7] D+hf (x) ≈ f (x + h) − f (x) 2h or D hf (x) ≈ f (x + h) − f (x − h) 2h Huge Problems If h is small, then cancellation error reduces the number of significant figures in D+hf (x). if h is not small, then truncation errors (terms such as h2f (x)) become significant. Even if h is optimally chosen, the values of D+hf (x) and D hf (x) will be accurate to only about 1 2 or 2 3 of the significant digits of f. 36 / 158
  • 69.
    Images/cinvestav Backpropagation a littlebrother of Automatic Differentiation (AD) We have a crude way to obtain derivatives [4, 5, 6][7] D+hf (x) ≈ f (x + h) − f (x) 2h or D hf (x) ≈ f (x + h) − f (x − h) 2h Huge Problems If h is small, then cancellation error reduces the number of significant figures in D+hf (x). if h is not small, then truncation errors (terms such as h2f (x)) become significant. Even if h is optimally chosen, the values of D+hf (x) and D hf (x) will be accurate to only about 1 2 or 2 3 of the significant digits of f. 36 / 158
  • 70.
    Images/cinvestav Backpropagation a littlebrother of Automatic Differentiation (AD) We have a crude way to obtain derivatives [4, 5, 6][7] D+hf (x) ≈ f (x + h) − f (x) 2h or D hf (x) ≈ f (x + h) − f (x − h) 2h Huge Problems If h is small, then cancellation error reduces the number of significant figures in D+hf (x). if h is not small, then truncation errors (terms such as h2f (x)) become significant. Even if h is optimally chosen, the values of D+hf (x) and D hf (x) will be accurate to only about 1 2 or 2 3 of the significant digits of f. 36 / 158
  • 71.
    Images/cinvestav Outline 1 Backpropagation Introduction Derivatives ofNetwork Functions Function Composition, Weights and Addition The Backpropagation Algorithm Works Moving everything to Tensors 2 Automatic Differentiation Introduction Advantages of Automatic Differentiation Avoiding Truncation Errors Differences with Symbolic Differentiation Difference Quotients May be Useful RNN Example A Simple Example The Forward and Reverse Mode Forward propagation of Tangents Forward Mode of a ML Perceptron Complexity of the Forward Procedure The Reverse Mode Dual Process in Reverse Process Incremental Adjoint Recursion Example What Method to Use Forward or Reverse Mode? 3 Basic Implementation of Automatic Differentiation Source Transformation and Overloading Building the Computational Graph Memory Structures Way More... 4 Conclusions The Problem of Backpropagation 37 / 158
  • 72.
    Images/cinvestav Outline 1 Backpropagation Introduction Derivatives ofNetwork Functions Function Composition, Weights and Addition The Backpropagation Algorithm Works Moving everything to Tensors 2 Automatic Differentiation Introduction Advantages of Automatic Differentiation Avoiding Truncation Errors Differences with Symbolic Differentiation Difference Quotients May be Useful RNN Example A Simple Example The Forward and Reverse Mode Forward propagation of Tangents Forward Mode of a ML Perceptron Complexity of the Forward Procedure The Reverse Mode Dual Process in Reverse Process Incremental Adjoint Recursion Example What Method to Use Forward or Reverse Mode? 3 Basic Implementation of Automatic Differentiation Source Transformation and Overloading Building the Computational Graph Memory Structures Way More... 4 Conclusions The Problem of Backpropagation 38 / 158
  • 73.
    Images/cinvestav Avoiding Truncation Errors Wehave that Algorithmic differentiation does not incur truncation errors. For example f (x) = n i=1 x2 i at xi = i for i = 1...n Then for e1 ∈ Rn f (x + he1) − f (x) h = ∂f (x) ∂x1 + h = 2x1 + h = 2 + h 39 / 158
  • 74.
    Images/cinvestav Avoiding Truncation Errors Wehave that Algorithmic differentiation does not incur truncation errors. For example f (x) = n i=1 x2 i at xi = i for i = 1...n Then for e1 ∈ Rn f (x + he1) − f (x) h = ∂f (x) ∂x1 + h = 2x1 + h = 2 + h 39 / 158
  • 75.
    Images/cinvestav Avoiding Truncation Errors Wehave that Algorithmic differentiation does not incur truncation errors. For example f (x) = n i=1 x2 i at xi = i for i = 1...n Then for e1 ∈ Rn f (x + he1) − f (x) h = ∂f (x) ∂x1 + h = 2x1 + h = 2 + h 39 / 158
  • 76.
    Images/cinvestav Floating Points Given thatthe quantity needs floating point number representation in machine accuracy of 64 bits Roundoff error = f (x + he1) ≈ n3 3 with = 2−54 ≈ 10−16 For h = √ ,as often is recommended The difference quotient has a rounding error of size 1 3 n3√ ≈ 1 3 n3 10−8 40 / 158
  • 77.
    Images/cinvestav Floating Points Given thatthe quantity needs floating point number representation in machine accuracy of 64 bits Roundoff error = f (x + he1) ≈ n3 3 with = 2−54 ≈ 10−16 For h = √ ,as often is recommended The difference quotient has a rounding error of size 1 3 n3√ ≈ 1 3 n3 10−8 40 / 158
  • 78.
    Images/cinvestav Now, Imagine n= 1000 Then Rounding Error 1 3 10003√ ≈ 1 3 1000000000 × 10−8 = 1 3 100 ≈ 33.333... Ouch We cannot even get the sign correctly!!! f (x + he1) − f (x) h 41 / 158
  • 79.
    Images/cinvestav Now, Imagine n= 1000 Then Rounding Error 1 3 10003√ ≈ 1 3 1000000000 × 10−8 = 1 3 100 ≈ 33.333... Ouch We cannot even get the sign correctly!!! f (x + he1) − f (x) h 41 / 158
  • 80.
    Images/cinvestav In contrast AutomaticDifferentiation It yields 2xi in both its forward and reverse modes You could assume that the derivatives are generated symbolically Actually is true in some sense, but 2xi will be never be generated by Symbolic Differentiation In Symbolic Differentiation The numerical value of xi is multiplied by 2 then returned as the gradient value. 42 / 158
  • 81.
    Images/cinvestav In contrast AutomaticDifferentiation It yields 2xi in both its forward and reverse modes You could assume that the derivatives are generated symbolically Actually is true in some sense, but 2xi will be never be generated by Symbolic Differentiation In Symbolic Differentiation The numerical value of xi is multiplied by 2 then returned as the gradient value. 42 / 158
  • 82.
    Images/cinvestav In contrast AutomaticDifferentiation It yields 2xi in both its forward and reverse modes You could assume that the derivatives are generated symbolically Actually is true in some sense, but 2xi will be never be generated by Symbolic Differentiation In Symbolic Differentiation The numerical value of xi is multiplied by 2 then returned as the gradient value. 42 / 158
  • 83.
    Images/cinvestav Example using ForwardDifferentiation We will see the forward procedure later on f (xx) = n i=1 x2 i with xi = i for i = 1, ..., n AD Initializes (Do not worry we will see this in more detail) vi−n =i for i = 1, ..., n ˙vi−n =0, but ˙v1−n = 1 43 / 158
  • 84.
    Images/cinvestav Example using ForwardDifferentiation We will see the forward procedure later on f (xx) = n i=1 x2 i with xi = i for i = 1, ..., n AD Initializes (Do not worry we will see this in more detail) vi−n =i for i = 1, ..., n ˙vi−n =0, but ˙v1−n = 1 43 / 158
  • 85.
    Images/cinvestav Then, we havethat Apply the compositions φ Functions Derivatives v1 = 12 ˙v1 = ∂v1 ∂v1−n ˙v1−n = 2 × (1) × 1 = 2 ... ... vn = n2 0 Therefore, we have at the end ∂f ∂x (x) = (2, 0, ..., 0) 44 / 158
  • 86.
    Images/cinvestav Then, we havethat Apply the compositions φ Functions Derivatives v1 = 12 ˙v1 = ∂v1 ∂v1−n ˙v1−n = 2 × (1) × 1 = 2 ... ... vn = n2 0 Therefore, we have at the end ∂f ∂x (x) = (2, 0, ..., 0) 44 / 158
  • 87.
    Images/cinvestav Quite different from Usinga numerical difference, we have f (x + e1h) − f (x) h − 2 < 0 Then for n = 10j and h = 10−k 10k (h + 1)2 − 1 < 2 Finally, we have k > − log10 3 45 / 158
  • 88.
    Images/cinvestav Quite different from Usinga numerical difference, we have f (x + e1h) − f (x) h − 2 < 0 Then for n = 10j and h = 10−k 10k (h + 1)2 − 1 < 2 Finally, we have k > − log10 3 45 / 158
  • 89.
    Images/cinvestav Quite different from Usinga numerical difference, we have f (x + e1h) − f (x) h − 2 < 0 Then for n = 10j and h = 10−k 10k (h + 1)2 − 1 < 2 Finally, we have k > − log10 3 45 / 158
  • 90.
    Images/cinvestav Therefore It is possibleto get into underflow by getting a k > − log10 3 Therefore, we have that Automatic Differentiation allows to obtain the correct answer!!! 46 / 158
  • 91.
    Images/cinvestav Therefore It is possibleto get into underflow by getting a k > − log10 3 Therefore, we have that Automatic Differentiation allows to obtain the correct answer!!! 46 / 158
  • 92.
    Images/cinvestav Outline 1 Backpropagation Introduction Derivatives ofNetwork Functions Function Composition, Weights and Addition The Backpropagation Algorithm Works Moving everything to Tensors 2 Automatic Differentiation Introduction Advantages of Automatic Differentiation Avoiding Truncation Errors Differences with Symbolic Differentiation Difference Quotients May be Useful RNN Example A Simple Example The Forward and Reverse Mode Forward propagation of Tangents Forward Mode of a ML Perceptron Complexity of the Forward Procedure The Reverse Mode Dual Process in Reverse Process Incremental Adjoint Recursion Example What Method to Use Forward or Reverse Mode? 3 Basic Implementation of Automatic Differentiation Source Transformation and Overloading Building the Computational Graph Memory Structures Way More... 4 Conclusions The Problem of Backpropagation 47 / 158
  • 93.
    Images/cinvestav For example You havethe following equation f (x) = n i=1 xi Then, the gradient f (x) = ∂f ∂x1 , ∂f ∂x2 , ..., ∂f ∂xn =   j=i xj   i=1...n = (x2 × x3 × ... × xi × xi+1 × ... × xn−1 × xn, ............................................................... x1 × x2 × ... × xi−1 × xi+1 × ... × xn−1 × xn, ............................................................... x1 × x2 × ... × xi−1 × xi × ... × xn−2 × xn−1, ) 48 / 158
  • 94.
    Images/cinvestav For example You havethe following equation f (x) = n i=1 xi Then, the gradient f (x) = ∂f ∂x1 , ∂f ∂x2 , ..., ∂f ∂xn =   j=i xj   i=1...n = (x2 × x3 × ... × xi × xi+1 × ... × xn−1 × xn, ............................................................... x1 × x2 × ... × xi−1 × xi+1 × ... × xn−1 × xn, ............................................................... x1 × x2 × ... × xi−1 × xi × ... × xn−2 × xn−1, ) 48 / 158
  • 95.
    Images/cinvestav Actually Symbolic Differentiation willconsume a lot of memory Instead AD will reuse the common expressions to improve performance and memory. However, Symbolic and Automatic Differentiation They make use of the chain rule to achieve their results However, the chain rules in AD It is used not into the symbolic expressions but the actual numerical values. 49 / 158
  • 96.
    Images/cinvestav Actually Symbolic Differentiation willconsume a lot of memory Instead AD will reuse the common expressions to improve performance and memory. However, Symbolic and Automatic Differentiation They make use of the chain rule to achieve their results However, the chain rules in AD It is used not into the symbolic expressions but the actual numerical values. 49 / 158
  • 97.
    Images/cinvestav Actually Symbolic Differentiation willconsume a lot of memory Instead AD will reuse the common expressions to improve performance and memory. However, Symbolic and Automatic Differentiation They make use of the chain rule to achieve their results However, the chain rules in AD It is used not into the symbolic expressions but the actual numerical values. 49 / 158
  • 98.
    Images/cinvestav Outline 1 Backpropagation Introduction Derivatives ofNetwork Functions Function Composition, Weights and Addition The Backpropagation Algorithm Works Moving everything to Tensors 2 Automatic Differentiation Introduction Advantages of Automatic Differentiation Avoiding Truncation Errors Differences with Symbolic Differentiation Difference Quotients May be Useful RNN Example A Simple Example The Forward and Reverse Mode Forward propagation of Tangents Forward Mode of a ML Perceptron Complexity of the Forward Procedure The Reverse Mode Dual Process in Reverse Process Incremental Adjoint Recursion Example What Method to Use Forward or Reverse Mode? 3 Basic Implementation of Automatic Differentiation Source Transformation and Overloading Building the Computational Graph Memory Structures Way More... 4 Conclusions The Problem of Backpropagation 50 / 158
  • 99.
    Images/cinvestav The User Insight Differencequotients may sometimes be useful too f (x + he1) − f (x) h Computer Algebra packages They have really neat ways to simplify expressions. In contrast, current AD packages assume that That the given program calculates the underlying function efficiently 51 / 158
  • 100.
    Images/cinvestav The User Insight Differencequotients may sometimes be useful too f (x + he1) − f (x) h Computer Algebra packages They have really neat ways to simplify expressions. In contrast, current AD packages assume that That the given program calculates the underlying function efficiently 51 / 158
  • 101.
    Images/cinvestav The User Insight Differencequotients may sometimes be useful too f (x + he1) − f (x) h Computer Algebra packages They have really neat ways to simplify expressions. In contrast, current AD packages assume that That the given program calculates the underlying function efficiently 51 / 158
  • 102.
    Images/cinvestav There AD can automatizethe gradient generation The best results will be obtained when AD takes advantage the user’s insight into the structure underlying the program 52 / 158
  • 103.
    Images/cinvestav Outline 1 Backpropagation Introduction Derivatives ofNetwork Functions Function Composition, Weights and Addition The Backpropagation Algorithm Works Moving everything to Tensors 2 Automatic Differentiation Introduction Advantages of Automatic Differentiation Avoiding Truncation Errors Differences with Symbolic Differentiation Difference Quotients May be Useful RNN Example A Simple Example The Forward and Reverse Mode Forward propagation of Tangents Forward Mode of a ML Perceptron Complexity of the Forward Procedure The Reverse Mode Dual Process in Reverse Process Incremental Adjoint Recursion Example What Method to Use Forward or Reverse Mode? 3 Basic Implementation of Automatic Differentiation Source Transformation and Overloading Building the Computational Graph Memory Structures Way More... 4 Conclusions The Problem of Backpropagation 53 / 158
  • 104.
    Images/cinvestav RNN Example When youlook at the recurrent neural network Elman [8] ht = σh (Wsdxt + Ushht−1 + bh) yt = σy (Vosht) L = 1 2 (yt − zt)2 Here if you do blind AD sooner or later you have ∂ht ∂ht−1 × ∂ht−1 ∂ht−2 × ∂ht−2 ∂ht−3 × ... × ∂hk+1 ∂hk This is known as Back Propagation Through Time (BPTT) This is a problem given The Vanishing Gradient or Exploding Gradient 54 / 158
  • 105.
    Images/cinvestav RNN Example When youlook at the recurrent neural network Elman [8] ht = σh (Wsdxt + Ushht−1 + bh) yt = σy (Vosht) L = 1 2 (yt − zt)2 Here if you do blind AD sooner or later you have ∂ht ∂ht−1 × ∂ht−1 ∂ht−2 × ∂ht−2 ∂ht−3 × ... × ∂hk+1 ∂hk This is known as Back Propagation Through Time (BPTT) This is a problem given The Vanishing Gradient or Exploding Gradient 54 / 158
  • 106.
    Images/cinvestav RNN Example When youlook at the recurrent neural network Elman [8] ht = σh (Wsdxt + Ushht−1 + bh) yt = σy (Vosht) L = 1 2 (yt − zt)2 Here if you do blind AD sooner or later you have ∂ht ∂ht−1 × ∂ht−1 ∂ht−2 × ∂ht−2 ∂ht−3 × ... × ∂hk+1 ∂hk This is known as Back Propagation Through Time (BPTT) This is a problem given The Vanishing Gradient or Exploding Gradient 54 / 158
  • 107.
    Images/cinvestav Here, you canmodify the architecture Using an intermediate layer using the Hadamard product ◦ we have L = 1 2 (yt − zt)2 yt = σy (Wodxt + Uohht−1 + bo) st = σs (Vhoyt + Dhdxt + bh) ht = (1 − yt) ◦ ht−1 + yt ◦ st 55 / 158
  • 108.
    Images/cinvestav Therefore You have multiplepaths of derivatives -1 -1 56 / 158
  • 109.
    Images/cinvestav One of them Itcan be seen That one of the paths can take you to BPTT 57 / 158
  • 110.
    Images/cinvestav The Other One Theother gets you into a more Markovian Property This allows to to get a Backpropagation that does not require the BPTT How? For example, the derivative of L with respect to Dhd ∂L ∂Dhd = ∂L ∂yt × ∂yt ∂nety × ∂nety ∂ht−1 × ∂ht−1 ∂st−2 × ∂st−2 nets × nets ∂Dhd 58 / 158
  • 111.
    Images/cinvestav The Other One Theother gets you into a more Markovian Property This allows to to get a Backpropagation that does not require the BPTT How? For example, the derivative of L with respect to Dhd ∂L ∂Dhd = ∂L ∂yt × ∂yt ∂nety × ∂nety ∂ht−1 × ∂ht−1 ∂st−2 × ∂st−2 nets × nets ∂Dhd 58 / 158
  • 112.
    Images/cinvestav The Other One Theother gets you into a more Markovian Property This allows to to get a Backpropagation that does not require the BPTT How? For example, the derivative of L with respect to Dhd ∂L ∂Dhd = ∂L ∂yt × ∂yt ∂nety × ∂nety ∂ht−1 × ∂ht−1 ∂st−2 × ∂st−2 nets × nets ∂Dhd 58 / 158
  • 113.
    Images/cinvestav Therefore You do nothave The Backpropagation through time... you can avoid it all together!!! Because Backpropagation Through Time Makes the process of obtaining the gradients unstable... 59 / 158
  • 114.
    Images/cinvestav Therefore You do nothave The Backpropagation through time... you can avoid it all together!!! Because Backpropagation Through Time Makes the process of obtaining the gradients unstable... 59 / 158
  • 115.
    Images/cinvestav Thus A great simplifyingstep Here resound trues the phrase “AD taking advantage of the user’s insight” 60 / 158
  • 116.
    Images/cinvestav Outline 1 Backpropagation Introduction Derivatives ofNetwork Functions Function Composition, Weights and Addition The Backpropagation Algorithm Works Moving everything to Tensors 2 Automatic Differentiation Introduction Advantages of Automatic Differentiation Avoiding Truncation Errors Differences with Symbolic Differentiation Difference Quotients May be Useful RNN Example A Simple Example The Forward and Reverse Mode Forward propagation of Tangents Forward Mode of a ML Perceptron Complexity of the Forward Procedure The Reverse Mode Dual Process in Reverse Process Incremental Adjoint Recursion Example What Method to Use Forward or Reverse Mode? 3 Basic Implementation of Automatic Differentiation Source Transformation and Overloading Building the Computational Graph Memory Structures Way More... 4 Conclusions The Problem of Backpropagation 61 / 158
  • 117.
    Images/cinvestav A Simple Example Here,we have the following ideas Some of the floating point values, generated by the AD, will be stored in variables of the program, Other operations will be held until overwritten or discarded. Thus, we will introduce the concept Evaluation Trace which is basically a record of a particular run of a given program. This Evaluation Trace stores Input variables, Sequence of floating point generated by the CPU Operations that are used for it 62 / 158
  • 118.
    Images/cinvestav A Simple Example Here,we have the following ideas Some of the floating point values, generated by the AD, will be stored in variables of the program, Other operations will be held until overwritten or discarded. Thus, we will introduce the concept Evaluation Trace which is basically a record of a particular run of a given program. This Evaluation Trace stores Input variables, Sequence of floating point generated by the CPU Operations that are used for it 62 / 158
  • 119.
    Images/cinvestav A Simple Example Here,we have the following ideas Some of the floating point values, generated by the AD, will be stored in variables of the program, Other operations will be held until overwritten or discarded. Thus, we will introduce the concept Evaluation Trace which is basically a record of a particular run of a given program. This Evaluation Trace stores Input variables, Sequence of floating point generated by the CPU Operations that are used for it 62 / 158
  • 120.
    Images/cinvestav A Simple Example Here,we have the following ideas Some of the floating point values, generated by the AD, will be stored in variables of the program, Other operations will be held until overwritten or discarded. Thus, we will introduce the concept Evaluation Trace which is basically a record of a particular run of a given program. This Evaluation Trace stores Input variables, Sequence of floating point generated by the CPU Operations that are used for it 62 / 158
  • 121.
    Images/cinvestav A Simple Example Here,we have the following ideas Some of the floating point values, generated by the AD, will be stored in variables of the program, Other operations will be held until overwritten or discarded. Thus, we will introduce the concept Evaluation Trace which is basically a record of a particular run of a given program. This Evaluation Trace stores Input variables, Sequence of floating point generated by the CPU Operations that are used for it 62 / 158
  • 122.
    Images/cinvestav A Simple Example Here,we have the following ideas Some of the floating point values, generated by the AD, will be stored in variables of the program, Other operations will be held until overwritten or discarded. Thus, we will introduce the concept Evaluation Trace which is basically a record of a particular run of a given program. This Evaluation Trace stores Input variables, Sequence of floating point generated by the CPU Operations that are used for it 62 / 158
  • 123.
    Images/cinvestav Example A simple example y= f (x1, x2) = sin x1 x2 + x1 x2 − exp (x2) × x1 x2 − exp (x2) We wish to calculate y = f (x1, x2) With x1 = 1.5, x2 = 0.5 63 / 158
  • 124.
    Images/cinvestav Example A simple example y= f (x1, x2) = sin x1 x2 + x1 x2 − exp (x2) × x1 x2 − exp (x2) We wish to calculate y = f (x1, x2) With x1 = 1.5, x2 = 0.5 63 / 158
  • 125.
    Images/cinvestav Evaluation Trace/Forward Procedure Wehave the table for the evaluation of the function v−1 = x1 = 1.5 v0 = x2 = 0.5 v1 = v−1 v0 = 1.5 0.5 = 3.0 v2 = sin (v1) = sin (3.0) = 0.1411 v3 = exp (v0) = exp (0.5) = 1.6487 v4 = v1 − v3 = 3.0 − 1.6487 = 1.3513 v5 = v2 + v4 = 0.1411 + 1.3413 = 1.4924 v6 = v5 × v4 = 1.4924 × 1.3513 = 2.0167 y = v6 = 2.0167 64 / 158
  • 126.
    Images/cinvestav A Cautionary Note Normally Programmerswill try to rearrange this execution trace to improve performance through parallelism. Thus Subexpressions will be algorithmically exploited by the AD to improve performance. It is usually more convenient to use The so called “computational graph” 65 / 158
  • 127.
    Images/cinvestav A Cautionary Note Normally Programmerswill try to rearrange this execution trace to improve performance through parallelism. Thus Subexpressions will be algorithmically exploited by the AD to improve performance. It is usually more convenient to use The so called “computational graph” 65 / 158
  • 128.
    Images/cinvestav A Cautionary Note Normally Programmerswill try to rearrange this execution trace to improve performance through parallelism. Thus Subexpressions will be algorithmically exploited by the AD to improve performance. It is usually more convenient to use The so called “computational graph” 65 / 158
  • 129.
  • 130.
    Images/cinvestav Please take alook at section in Chapter 2 A Framework for Evaluating Functions At the book [7] Andreas Griewank and Andrea Walther, Evaluating derivatives: principles and techniques of algorithmic differentiation vol. 105, (Siam, 2008). 67 / 158
  • 131.
    Images/cinvestav Outline 1 Backpropagation Introduction Derivatives ofNetwork Functions Function Composition, Weights and Addition The Backpropagation Algorithm Works Moving everything to Tensors 2 Automatic Differentiation Introduction Advantages of Automatic Differentiation Avoiding Truncation Errors Differences with Symbolic Differentiation Difference Quotients May be Useful RNN Example A Simple Example The Forward and Reverse Mode Forward propagation of Tangents Forward Mode of a ML Perceptron Complexity of the Forward Procedure The Reverse Mode Dual Process in Reverse Process Incremental Adjoint Recursion Example What Method to Use Forward or Reverse Mode? 3 Basic Implementation of Automatic Differentiation Source Transformation and Overloading Building the Computational Graph Memory Structures Way More... 4 Conclusions The Problem of Backpropagation 67 / 158
  • 132.
    Images/cinvestav A Little Bitof Notation In general, we assume quantities vi such v1−n, ..., v0 x v1, ..., vl−m−1vl−m+1, ..., vl y Then, we have 1 v1−n, ..., v0 are the initial input variables 2 vl−m+1, ..., vl the output variables 3 v1, ..., vl−m−1 the intermediate functions 68 / 158
  • 133.
    Images/cinvestav A Little Bitof Notation In general, we assume quantities vi such v1−n, ..., v0 x v1, ..., vl−m−1vl−m+1, ..., vl y Then, we have 1 v1−n, ..., v0 are the initial input variables 2 vl−m+1, ..., vl the output variables 3 v1, ..., vl−m−1 the intermediate functions 68 / 158
  • 134.
    Images/cinvestav Additionally Where each valuevi with i > 0 is obtained by applying an elemental function φ vi = φi (vj)j i j i vi depends directly on vj 69 / 158
  • 135.
    Images/cinvestav Then, for theapplication of the chain rule It is useful to associate with each elemental function φi the state transformation vi = Φi (vi−1) with Φi : Rn+l → Rn+l where vi = (v1−n, ..., vi, 0, ..., 0)T In other words Φi sets of vi to φi (vj)j i and keeps all other components vj for j = i unchanged. 70 / 158
  • 136.
    Images/cinvestav Then, for theapplication of the chain rule It is useful to associate with each elemental function φi the state transformation vi = Φi (vi−1) with Φi : Rn+l → Rn+l where vi = (v1−n, ..., vi, 0, ..., 0)T In other words Φi sets of vi to φi (vj)j i and keeps all other components vj for j = i unchanged. 70 / 158
  • 137.
    Images/cinvestav Then, for theapplication of the chain rule It is useful to associate with each elemental function φi the state transformation vi = Φi (vi−1) with Φi : Rn+l → Rn+l where vi = (v1−n, ..., vi, 0, ..., 0)T In other words Φi sets of vi to φi (vj)j i and keeps all other components vj for j = i unchanged. 70 / 158
  • 138.
    Images/cinvestav Basically the ComputationalGraph A Simpler Version 71 / 158
  • 139.
    Images/cinvestav Example of theForward Mode Suppose we want to differentiate y = f (x1, x2) with respect to x1 We consider x1 as an independent variable and y as a dependent variable. We can work the numerical value of the y = f (x1, x2) By getting the numerical derivative of each of its components Something like ˙vi = ∂vi ∂x1 72 / 158
  • 140.
    Images/cinvestav Example of theForward Mode Suppose we want to differentiate y = f (x1, x2) with respect to x1 We consider x1 as an independent variable and y as a dependent variable. We can work the numerical value of the y = f (x1, x2) By getting the numerical derivative of each of its components Something like ˙vi = ∂vi ∂x1 72 / 158
  • 141.
    Images/cinvestav Example of theForward Mode Suppose we want to differentiate y = f (x1, x2) with respect to x1 We consider x1 as an independent variable and y as a dependent variable. We can work the numerical value of the y = f (x1, x2) By getting the numerical derivative of each of its components Something like ˙vi = ∂vi ∂x1 72 / 158
  • 142.
    Images/cinvestav Therefore, we get Wehave the Procedure v−1 = x1 = 1.5 ˙v−1 = 1.0 v0 = x2 = 0.5 ˙v1 = 0.0 v1 = v−1 v0 = 1.5 0.5 = 3.0 ˙v1 = ∂v1 ∂v−1 ˙v−1 + ∂v1 ∂v0 ˙v0 = 2.0 v2 = sin (v1) = sin (3.0) = 0.1411 ˙v2 = cos (v1) ˙v1 = −1.98 v3 = exp (v0) = exp (0.5) = 1.6487 ˙v3 = v3 ˙×v1 = 0.0 v4 = v1 − v3 = 3.0 − 1.6487 = 1.3513 ˙v4 = ˙v1 − ˙v3 = 2.0 v5 = v2 + v4 = 0.1411 + 1.3413 = 1.4924 ˙v5 = ˙v2 + ˙v4 = 0.02 v6 = v5 × v4 = 1.4924 × 1.3513 = 2.0167 ˙v6 = ˙v5 × v4 + v5 × ˙v4 = 3.0118 y = v6 = 2.0167 ˙y = 3.0118 73 / 158
  • 143.
    Images/cinvestav The first Columnof this process It can be seen as an automatic procedure vi−n i = 1...n vi = ϕi (vj)j i i = 1...l ym−i = vl−i i = m − 1...0 74 / 158
  • 144.
    Images/cinvestav In a similarway We can obtain ∂f(x1,x2) ∂x2 However, it can be more efficient to redefine the ˙vi as vectors for efficiency!!! 75 / 158
  • 145.
    Images/cinvestav Outline 1 Backpropagation Introduction Derivatives ofNetwork Functions Function Composition, Weights and Addition The Backpropagation Algorithm Works Moving everything to Tensors 2 Automatic Differentiation Introduction Advantages of Automatic Differentiation Avoiding Truncation Errors Differences with Symbolic Differentiation Difference Quotients May be Useful RNN Example A Simple Example The Forward and Reverse Mode Forward propagation of Tangents Forward Mode of a ML Perceptron Complexity of the Forward Procedure The Reverse Mode Dual Process in Reverse Process Incremental Adjoint Recursion Example What Method to Use Forward or Reverse Mode? 3 Basic Implementation of Automatic Differentiation Source Transformation and Overloading Building the Computational Graph Memory Structures Way More... 4 Conclusions The Problem of Backpropagation 76 / 158
  • 146.
    Images/cinvestav Forward propagation ofTangents Remarks As you can see the second column of the evaluation procedure is done in a mechanical way This increase the size Basically, twice the size of the original simple evaluation. 77 / 158
  • 147.
    Images/cinvestav Forward propagation ofTangents Remarks As you can see the second column of the evaluation procedure is done in a mechanical way This increase the size Basically, twice the size of the original simple evaluation. 77 / 158
  • 148.
    Images/cinvestav We have thefollowing We have the chain rule ˙y (t) = ∂F (x (t)) ∂t = F (x (t)) ˙x (t) Where F (x) ∈ Rm×n is the Jacobian Matrix Here, we will be tempted to calculate ˙y (t) By evaluating the full Jacobian F (x) then multiplying by ˙x (t) 78 / 158
  • 149.
    Images/cinvestav We have thefollowing We have the chain rule ˙y (t) = ∂F (x (t)) ∂t = F (x (t)) ˙x (t) Where F (x) ∈ Rm×n is the Jacobian Matrix Here, we will be tempted to calculate ˙y (t) By evaluating the full Jacobian F (x) then multiplying by ˙x (t) 78 / 158
  • 150.
    Images/cinvestav We have thefollowing We have the chain rule ˙y (t) = ∂F (x (t)) ∂t = F (x (t)) ˙x (t) Where F (x) ∈ Rm×n is the Jacobian Matrix Here, we will be tempted to calculate ˙y (t) By evaluating the full Jacobian F (x) then multiplying by ˙x (t) 78 / 158
  • 151.
    Images/cinvestav However Such approach isquite uneconomically Unless many tangents need to be calculated as in the Newton Step. A simpler version, differentiate the first column of the table vi−n = xi i = 1, ..., n vi = φi (vj)j i i = 1, ..., l ym−i = vl−i i = m − 1, ..., 0 j i vi depends directly vj (The graph propagation of the dependencies) 79 / 158
  • 152.
    Images/cinvestav However Such approach isquite uneconomically Unless many tangents need to be calculated as in the Newton Step. A simpler version, differentiate the first column of the table vi−n = xi i = 1, ..., n vi = φi (vj)j i i = 1, ..., l ym−i = vl−i i = m − 1, ..., 0 j i vi depends directly vj (The graph propagation of the dependencies) 79 / 158
  • 153.
    Images/cinvestav Which can beseen as Forward Propagation of Tangents Basically, we can think of the forward mode as a propagation of tangents 80 / 158
  • 154.
    Images/cinvestav The Automatic Procedure Therefore,we have the following automatic procedure j i vi depends directly on vj and ui = (vj)j i ∈ Rni vi−n ≡ xi i = 1...n ˙vi−n ≡ ˙xi vi ≡ φi (vj)j i i = 1...l i = 1...l ˙vi ≡ j i ∂φi(uj) ∂vj ˙vj ym−i ≡ vl−i i = m − 1...0 ˙ym−i ≡ ˙vl−i 81 / 158
  • 155.
    Images/cinvestav Therefore Each element assignmentvi = φi (ui) You have the corresponding ˙vi = j i ∂φi (uj) ∂vj × ˙vj = j i cij × ˙vj Abbreviating ˙ui = (˙vj)j i ˙vi = ˙φi (ui, ˙ui) = φi (ui) ˙ui Where ˙φi = R2ni → R It is called the tangent function associated with the elemental φi. 82 / 158
  • 156.
    Images/cinvestav Therefore Each element assignmentvi = φi (ui) You have the corresponding ˙vi = j i ∂φi (uj) ∂vj × ˙vj = j i cij × ˙vj Abbreviating ˙ui = (˙vj)j i ˙vi = ˙φi (ui, ˙ui) = φi (ui) ˙ui Where ˙φi = R2ni → R It is called the tangent function associated with the elemental φi. 82 / 158
  • 157.
    Images/cinvestav Therefore Each element assignmentvi = φi (ui) You have the corresponding ˙vi = j i ∂φi (uj) ∂vj × ˙vj = j i cij × ˙vj Abbreviating ˙ui = (˙vj)j i ˙vi = ˙φi (ui, ˙ui) = φi (ui) ˙ui Where ˙φi = R2ni → R It is called the tangent function associated with the elemental φi. 82 / 158
  • 158.
    Images/cinvestav Now Question What is thecorrect order of evaluation? 83 / 158
  • 159.
    Images/cinvestav Why the question? Untilnow, we have always placed the tangent statement yielding ˙vi after the underlying value vi This order of calculation seems natural and certainly yields correct results as long as there is no overwriting. Then the order of 2l statements in the middle part of Table does not matter vi−n ≡ xi i = 1...n ˙vi−n ≡ ˙xi vi ≡ φi (vj)j i i = 1...l i = 1...l ˙vi ≡ j i ∂φi(uj) ∂vj ˙vj ym−i ≡ vl−i i = m − 1...0 ˙ym−i ≡ ˙vl−i 84 / 158
  • 160.
    Images/cinvestav Why the question? Untilnow, we have always placed the tangent statement yielding ˙vi after the underlying value vi This order of calculation seems natural and certainly yields correct results as long as there is no overwriting. Then the order of 2l statements in the middle part of Table does not matter vi−n ≡ xi i = 1...n ˙vi−n ≡ ˙xi vi ≡ φi (vj)j i i = 1...l i = 1...l ˙vi ≡ j i ∂φi(uj) ∂vj ˙vj ym−i ≡ vl−i i = m − 1...0 ˙ym−i ≡ ˙vl−i 84 / 158
  • 161.
    Images/cinvestav Here, we havea big problem in Cache Imagine that we have a single block of memory to hold For vi and its arguments vj live in the same memory cell on the cache memory CPU Register Register Register Register CACHE MainMemory 85 / 158
  • 162.
    Images/cinvestav This is knownas Cache Aliasing Definition Cache aliasing occurs when multiple mappings to a physical page of memory have conflicting caching states, such as cached and uncached. the same physical address can be mapped to multiple virtual addresses. On ARMv4 and ARMv5 processors, cache is organized as a virtual-indexed, virtual-tagged (VIVT) Cache lookups are faster because the translation look-aside buffer (TLB) is not involved in matching cache lines for a virtual address. However This caching method does require more frequent cache flushing because of cache aliasing. 86 / 158
  • 163.
    Images/cinvestav This is knownas Cache Aliasing Definition Cache aliasing occurs when multiple mappings to a physical page of memory have conflicting caching states, such as cached and uncached. the same physical address can be mapped to multiple virtual addresses. On ARMv4 and ARMv5 processors, cache is organized as a virtual-indexed, virtual-tagged (VIVT) Cache lookups are faster because the translation look-aside buffer (TLB) is not involved in matching cache lines for a virtual address. However This caching method does require more frequent cache flushing because of cache aliasing. 86 / 158
  • 164.
    Images/cinvestav This is knownas Cache Aliasing Definition Cache aliasing occurs when multiple mappings to a physical page of memory have conflicting caching states, such as cached and uncached. the same physical address can be mapped to multiple virtual addresses. On ARMv4 and ARMv5 processors, cache is organized as a virtual-indexed, virtual-tagged (VIVT) Cache lookups are faster because the translation look-aside buffer (TLB) is not involved in matching cache lines for a virtual address. However This caching method does require more frequent cache flushing because of cache aliasing. 86 / 158
  • 165.
    Images/cinvestav Then The value of˙vi = ˙φi (ui, ˙ui) it will incorrect Once we update vi = φi (ui) ADIFOR and Tapenade [9, 5] They put the derivative statement ahead of the original assignment and update before the erasing the original statement. On the other hand For most univariate functions v = φ (u) is better to obtain the undifferentiated value first Then to use it into the tangent function ˙φ 87 / 158
  • 166.
    Images/cinvestav Then The value of˙vi = ˙φi (ui, ˙ui) it will incorrect Once we update vi = φi (ui) ADIFOR and Tapenade [9, 5] They put the derivative statement ahead of the original assignment and update before the erasing the original statement. On the other hand For most univariate functions v = φ (u) is better to obtain the undifferentiated value first Then to use it into the tangent function ˙φ 87 / 158
  • 167.
    Images/cinvestav Then The value of˙vi = ˙φi (ui, ˙ui) it will incorrect Once we update vi = φi (ui) ADIFOR and Tapenade [9, 5] They put the derivative statement ahead of the original assignment and update before the erasing the original statement. On the other hand For most univariate functions v = φ (u) is better to obtain the undifferentiated value first Then to use it into the tangent function ˙φ 87 / 158
  • 168.
    Images/cinvestav In this presentation Wewill list ϕ and ˙ϕ Side by side in a common bracket to indicate that they should be evaluated simultaneously Then sharing results is immediate. 88 / 158
  • 169.
    Images/cinvestav In this presentation Wewill list ϕ and ˙ϕ Side by side in a common bracket to indicate that they should be evaluated simultaneously Then sharing results is immediate. 88 / 158
  • 170.
    Images/cinvestav Classic Tangent Operations Wehave a series of improvements on the tangent equations φ φ, ˙φ v = c v = c, ˙v = 0 v = v ± w v = v ± w ˙v = ˙v ± ˙w v = u × w ˙v = ˙u × w + u × ˙w v = u × w v = 1/u v = 1/u ˙v = −v × (v × ˙u) φ φ, ˙φ v = uc v = ˙u u ; v = uc ˙v = v × (v × ˙u) v = √ u v = √ u v = 0.5 × ˙u v v = exp (u) v = exp (u) ˙v = v ∗ ˙u v = log (u) ˙v = ˙u/u v = log (u) v = sin (u) ˙v = cos (u) × ˙u v = sin (u) 89 / 158
  • 171.
    Images/cinvestav Outline 1 Backpropagation Introduction Derivatives ofNetwork Functions Function Composition, Weights and Addition The Backpropagation Algorithm Works Moving everything to Tensors 2 Automatic Differentiation Introduction Advantages of Automatic Differentiation Avoiding Truncation Errors Differences with Symbolic Differentiation Difference Quotients May be Useful RNN Example A Simple Example The Forward and Reverse Mode Forward propagation of Tangents Forward Mode of a ML Perceptron Complexity of the Forward Procedure The Reverse Mode Dual Process in Reverse Process Incremental Adjoint Recursion Example What Method to Use Forward or Reverse Mode? 3 Basic Implementation of Automatic Differentiation Source Transformation and Overloading Building the Computational Graph Memory Structures Way More... 4 Conclusions The Problem of Backpropagation 90 / 158
  • 172.
    Images/cinvestav Now Imagine thefollowing network Something simple for our sake 91 / 158
  • 173.
    Images/cinvestav Forward mode toget gradient of x1 v−11 = w11, ..., v−6 = w16, v−5 = w21, ..., v−2 = w24, v−1 = v31, v−1 = w41 ˙v−11 = 1, ˙v−10 = 0, ..., ˙v0 = 0 v1 = 3 i=1 w1ixi , ˙v1 = x1 v2 = 3 i=1 w2ixi , ˙v2 = 0 v3 = 1 1+exp(−v1) , ˙v3 = v3 [1 − v3] x11 v4 = 1 1+exp(−v2) , ˙v4 = 0 v5 = 3 i=1 w3ivi, ˙v5 = w31 × ˙v3 v6 = 3 i=1 w4ivi, ˙v6 = w41 × ˙v3 v7 = 1 1+exp(−v5) , ˙v7 = v7 [1 − v7] × ˙v5 v8 = 1 1+exp(−v6) , ˙v8 = v8 [1 − v8] × ˙v6 v9 = 2 i=1 w5ivi, ˙v9 = w51 × ˙v7 + w32 × ˙v8 v10 = 1 1+exp(−v9) , ˙v10 = v10 [1 − v10] × ˙v9 92 / 158
  • 174.
    Images/cinvestav Outline 1 Backpropagation Introduction Derivatives ofNetwork Functions Function Composition, Weights and Addition The Backpropagation Algorithm Works Moving everything to Tensors 2 Automatic Differentiation Introduction Advantages of Automatic Differentiation Avoiding Truncation Errors Differences with Symbolic Differentiation Difference Quotients May be Useful RNN Example A Simple Example The Forward and Reverse Mode Forward propagation of Tangents Forward Mode of a ML Perceptron Complexity of the Forward Procedure The Reverse Mode Dual Process in Reverse Process Incremental Adjoint Recursion Example What Method to Use Forward or Reverse Mode? 3 Basic Implementation of Automatic Differentiation Source Transformation and Overloading Building the Computational Graph Memory Structures Way More... 4 Conclusions The Problem of Backpropagation 93 / 158
  • 175.
    Images/cinvestav Complexity of theProcedure Time Complexity TIME F (x) , F (x) ˙x ≤ wtanTIME {F (x)} Where wtan ∈ 2, 5 2 Space Complexity SPACE F (x) , F (x) ˙x ≤ 2SPACE {F (x)} 94 / 158
  • 176.
    Images/cinvestav Complexity of theProcedure Time Complexity TIME F (x) , F (x) ˙x ≤ wtanTIME {F (x)} Where wtan ∈ 2, 5 2 Space Complexity SPACE F (x) , F (x) ˙x ≤ 2SPACE {F (x)} 94 / 158
  • 177.
    Images/cinvestav Outline 1 Backpropagation Introduction Derivatives ofNetwork Functions Function Composition, Weights and Addition The Backpropagation Algorithm Works Moving everything to Tensors 2 Automatic Differentiation Introduction Advantages of Automatic Differentiation Avoiding Truncation Errors Differences with Symbolic Differentiation Difference Quotients May be Useful RNN Example A Simple Example The Forward and Reverse Mode Forward propagation of Tangents Forward Mode of a ML Perceptron Complexity of the Forward Procedure The Reverse Mode Dual Process in Reverse Process Incremental Adjoint Recursion Example What Method to Use Forward or Reverse Mode? 3 Basic Implementation of Automatic Differentiation Source Transformation and Overloading Building the Computational Graph Memory Structures Way More... 4 Conclusions The Problem of Backpropagation 95 / 158
  • 178.
    Images/cinvestav Here, an essentialobservation The cost of evaluating derivatives by propagating them forward it increases linearly with number of directions ˙x along which we want to differentiate. It looks inevitable But it is possible to avoid these complexity by Observing that the gradient of a single dependent variable could be obtained for a fixed multiple of the cost of evaluating the underlying scalar-valued function. 96 / 158
  • 179.
    Images/cinvestav Here, an essentialobservation The cost of evaluating derivatives by propagating them forward it increases linearly with number of directions ˙x along which we want to differentiate. It looks inevitable But it is possible to avoid these complexity by Observing that the gradient of a single dependent variable could be obtained for a fixed multiple of the cost of evaluating the underlying scalar-valued function. 96 / 158
  • 180.
    Images/cinvestav We choose insteadan output variable We use the term “reverse mode” for this technique Because the label “backward differentiation” is well established [10, 11]. Therefore, for an output f (x1, x2) We have for each variable v1 vi = ∂y ∂vi (Adjoint Variable) 97 / 158
  • 181.
    Images/cinvestav We choose insteadan output variable We use the term “reverse mode” for this technique Because the label “backward differentiation” is well established [10, 11]. Therefore, for an output f (x1, x2) We have for each variable v1 vi = ∂y ∂vi (Adjoint Variable) 97 / 158
  • 182.
    Images/cinvestav Actually This is anabuse of notation We mean a new independent variable δi vi = ∂y ∂δi (Adjoint Variable) Which can be thought as adding a small numerical value δi to vi vi + δi → f (x1, x2) + viδi As a perturbation in variational calculus 98 / 158
  • 183.
    Images/cinvestav Actually This is anabuse of notation We mean a new independent variable δi vi = ∂y ∂δi (Adjoint Variable) Which can be thought as adding a small numerical value δi to vi vi + δi → f (x1, x2) + viδi As a perturbation in variational calculus 98 / 158
  • 184.
    Images/cinvestav Actually, you propagatethe Normal vectors Actually, y and vi are normals or cotangents 99 / 158
  • 185.
    Images/cinvestav Then, we have Thefollowing sought mapping x = yT F (x) = yT F (x) Observation Here, y is a fixed vector that plays a dual role to the domain direction ˙x. In the Forward Procedure, you compute ˙y = F (x) ˙x = ˙F (x, ˙x) 100 / 158
  • 186.
    Images/cinvestav Then, we have Thefollowing sought mapping x = yT F (x) = yT F (x) Observation Here, y is a fixed vector that plays a dual role to the domain direction ˙x. In the Forward Procedure, you compute ˙y = F (x) ˙x = ˙F (x, ˙x) 100 / 158
  • 187.
    Images/cinvestav Then, we have Thefollowing sought mapping x = yT F (x) = yT F (x) Observation Here, y is a fixed vector that plays a dual role to the domain direction ˙x. In the Forward Procedure, you compute ˙y = F (x) ˙x = ˙F (x, ˙x) 100 / 158
  • 188.
    Images/cinvestav Instead In the ReverseProcedure, you compute xT = yT F (x) ≡ F (x, y) Where we solve F and F are evaluated together Thus, we have a dual process 101 / 158
  • 189.
    Images/cinvestav Instead In the ReverseProcedure, you compute xT = yT F (x) ≡ F (x, y) Where we solve F and F are evaluated together Thus, we have a dual process 101 / 158
  • 190.
    Images/cinvestav Outline 1 Backpropagation Introduction Derivatives ofNetwork Functions Function Composition, Weights and Addition The Backpropagation Algorithm Works Moving everything to Tensors 2 Automatic Differentiation Introduction Advantages of Automatic Differentiation Avoiding Truncation Errors Differences with Symbolic Differentiation Difference Quotients May be Useful RNN Example A Simple Example The Forward and Reverse Mode Forward propagation of Tangents Forward Mode of a ML Perceptron Complexity of the Forward Procedure The Reverse Mode Dual Process in Reverse Process Incremental Adjoint Recursion Example What Method to Use Forward or Reverse Mode? 3 Basic Implementation of Automatic Differentiation Source Transformation and Overloading Building the Computational Graph Memory Structures Way More... 4 Conclusions The Problem of Backpropagation 102 / 158
  • 191.
    Images/cinvestav Dual Process Here, wehave that the hyperplane yT y = c in the range of F has inverse image x|yT F (x) = c 103 / 158
  • 192.
    Images/cinvestav The implicit functiontheorem Theorem Let F : Rn+m → Rm be a continuously differentiable function, and a point x0 1, x0 2, ..., x0 m+n so F x0 1, x0 2, ..., x0 m+n = c. If ∂F(x0 1,x0 2,...,x0 m+n) ∂xm+n = 0, then there exist a neighborhood of x0 1, x0 2, ..., x0 m+n so whatever (x1, ..., xn+m−1) is close enough to x0 1, ..., x0 m+n−1 , there is a unique z so that F (x1, ..., xn+m−1, z) = c. Furthermore, z = g (x1, ..., xn+m−1) a continuous function of (x1, ..., xn+m−1). 104 / 158
  • 193.
    Images/cinvestav Therefore The set x|yT F(x) = c It is a smooth hyper-surface with the normal xT = yT F (x) at x provided that x does not vanishes. 105 / 158
  • 194.
    Images/cinvestav The Process Here, wehave that the hyperplane yT y = c in the range of F has inverse image x|yT F (x) = c Forward Procedure Reverse Procedure 106 / 158
  • 195.
    Images/cinvestav Therefore When m =1, then F = f is scaler-valued We obtain y = 1 ∈ R the familiar gradient f (x) = yT F (x). Something Notable We will look only at the main procedure of Incremental Adjoint Recursion Please take a look at section in Derivation by Matrix-Product Reversal At the book [7] Andreas Griewank and Andrea Walther, Evaluating derivatives: principles and techniques of algorithmic differentiation vol. 105, (Siam, 2008). 107 / 158
  • 196.
    Images/cinvestav Therefore When m =1, then F = f is scaler-valued We obtain y = 1 ∈ R the familiar gradient f (x) = yT F (x). Something Notable We will look only at the main procedure of Incremental Adjoint Recursion Please take a look at section in Derivation by Matrix-Product Reversal At the book [7] Andreas Griewank and Andrea Walther, Evaluating derivatives: principles and techniques of algorithmic differentiation vol. 105, (Siam, 2008). 107 / 158
  • 197.
    Images/cinvestav Therefore When m =1, then F = f is scaler-valued We obtain y = 1 ∈ R the familiar gradient f (x) = yT F (x). Something Notable We will look only at the main procedure of Incremental Adjoint Recursion Please take a look at section in Derivation by Matrix-Product Reversal At the book [7] Andreas Griewank and Andrea Walther, Evaluating derivatives: principles and techniques of algorithmic differentiation vol. 105, (Siam, 2008). 107 / 158
  • 198.
    Images/cinvestav The derivation ofthe reversal mode For this, we will use vi−n ≡ xi i = 1...n ˙vi−n ≡ ˙xi vi ≡ φi (vj)j i i = 1...l i = 1...l ˙vi ≡ j i ∂φi(uj) ∂vj ˙vj ym−i ≡ vl−i i = m − 1...0 ˙ym−i ≡ ˙vl−i And the identity yT ˙y = xT ˙x 108 / 158
  • 199.
    Images/cinvestav The derivation ofthe reversal mode For this, we will use vi−n ≡ xi i = 1...n ˙vi−n ≡ ˙xi vi ≡ φi (vj)j i i = 1...l i = 1...l ˙vi ≡ j i ∂φi(uj) ∂vj ˙vj ym−i ≡ vl−i i = m − 1...0 ˙ym−i ≡ ˙vl−i And the identity yT ˙y = xT ˙x 108 / 158
  • 200.
    Images/cinvestav Now, using thestate transformation Φ We map from x to y = F (x) as the composition y = QmΦl ◦ Φl−1 ◦ · · · ◦ Φ2 ◦ Φ1 PT n x Where Pn ≡ [I, 0, ..., 0] ∈ Rn×(n+l) and Qm ≡ [0, 0, ..., I] ∈ Rm×(n+l) They are matrices that project an arbitrary (n + l)-vector Onto its first n and last m components. 109 / 158
  • 201.
    Images/cinvestav Now, using thestate transformation Φ We map from x to y = F (x) as the composition y = QmΦl ◦ Φl−1 ◦ · · · ◦ Φ2 ◦ Φ1 PT n x Where Pn ≡ [I, 0, ..., 0] ∈ Rn×(n+l) and Qm ≡ [0, 0, ..., I] ∈ Rm×(n+l) They are matrices that project an arbitrary (n + l)-vector Onto its first n and last m components. 109 / 158
  • 202.
    Images/cinvestav Where The cij’s representpartial differential cij ≡ cij (ui) ≡ ∂φi ∂vj for 1 − n ≤ i, j ≤ l 110 / 158
  • 203.
    Images/cinvestav Labelin the elementalpartials as cij We get the state Jacobian Ai ≡ Φi ≡                 1 0 . . . 0 . . . . . . 0 0 1 . . . 0 . . . . . . 0 ... ... ... ... . . . . . . 0 0 . . . 1 . . . . . . 0 ci1−n ci2−n . . . cii−n . . . . . . 0 0 0 . . . 0 1 . . . 0 ... ... ... ... ... ... ... 0 0 . . . . . . . . . . . . 1                 ∈ R(n+l)×(n+l) where the cij occur in the (n + i)th row of Ai. 111 / 158
  • 204.
    Images/cinvestav Remarks The square matricesAi are lower triangular It may also be written as rank-one perturbations of the identity, Ai = I + en+i [ φi (ui) − en+i]T Where ej denotes the jth Cartesian basis vector in Rn+l The differentiating the composition of functions ˙y = QmAlAl−1 · · · A2A1PT n ˙x 112 / 158
  • 205.
    Images/cinvestav Remarks The square matricesAi are lower triangular It may also be written as rank-one perturbations of the identity, Ai = I + en+i [ φi (ui) − en+i]T Where ej denotes the jth Cartesian basis vector in Rn+l The differentiating the composition of functions ˙y = QmAlAl−1 · · · A2A1PT n ˙x 112 / 158
  • 206.
    Images/cinvestav Embeddings The multiplication byPT n ∈ R(n+l)×n It embeds ˙x into Rn+l Meaning corresponding to the first part of the tangent recursion The subsequent multiplications by the Ai It generates ine component ˙vi at a time, according to the middle part 113 / 158
  • 207.
    Images/cinvestav Embeddings The multiplication byPT n ∈ R(n+l)×n It embeds ˙x into Rn+l Meaning corresponding to the first part of the tangent recursion The subsequent multiplications by the Ai It generates ine component ˙vi at a time, according to the middle part 113 / 158
  • 208.
    Images/cinvestav Embeddings The multiplication byPT n ∈ R(n+l)×n It embeds ˙x into Rn+l Meaning corresponding to the first part of the tangent recursion The subsequent multiplications by the Ai It generates ine component ˙vi at a time, according to the middle part 113 / 158
  • 209.
    Images/cinvestav Finally Qm extracts thelast m components as ˙y corresponding to the third part of the table vi−n ≡ xi i = 1...n ˙vi−n ≡ ˙xi vi ≡ φi (vj)j i i = 1...l i = 1...l ˙vi ≡ j i ∂φi(uj) ∂vj ˙vj ym−i ≡ vl−i i = m − 1...0 ˙ym−i ≡ ˙vl−i 114 / 158
  • 210.
    Images/cinvestav Now By comparison with ˙y(t) = ∂F (x (t)) ∂t = F (x (t)) ˙x (t) We have in fact a product representation of the full Jacobian F (x) = QmAlAl−1 · · · A2A1PT n ∈ Rm×n 115 / 158
  • 211.
    Images/cinvestav Now By comparison with ˙y(t) = ∂F (x (t)) ∂t = F (x (t)) ˙x (t) We have in fact a product representation of the full Jacobian F (x) = QmAlAl−1 · · · A2A1PT n ∈ Rm×n 115 / 158
  • 212.
    Images/cinvestav Then By transposing theproduct we obtain the adjoint relation x = PnAT 1 AT 2 · · · AT l−1AT l y Given that AT i = I + [ φi (ui) − en+i] eT n+i 116 / 158
  • 213.
    Images/cinvestav Then By transposing theproduct we obtain the adjoint relation x = PnAT 1 AT 2 · · · AT l−1AT l y Given that AT i = I + [ φi (ui) − en+i] eT n+i 116 / 158
  • 214.
    Images/cinvestav Therefore The transformation ofany vector (vj)1−n≤j≤l By multiplication with AT i representing an incremental operation. 117 / 158
  • 215.
    Images/cinvestav In detail, oneobtains for i = l, ..., 1 the operations For all j with i = j i vj is left unchanged For all j with i = j i vi is augmented by vicij cij ≡ cij (ui) ≡ ∂φi ∂vj for 1 − n ≤ i, j ≤ l Subsequently vi is set to zero. 118 / 158
  • 216.
    Images/cinvestav In detail, oneobtains for i = l, ..., 1 the operations For all j with i = j i vj is left unchanged For all j with i = j i vi is augmented by vicij cij ≡ cij (ui) ≡ ∂φi ∂vj for 1 − n ≤ i, j ≤ l Subsequently vi is set to zero. 118 / 158
  • 217.
    Images/cinvestav In detail, oneobtains for i = l, ..., 1 the operations For all j with i = j i vj is left unchanged For all j with i = j i vi is augmented by vicij cij ≡ cij (ui) ≡ ∂φi ∂vj for 1 − n ≤ i, j ≤ l Subsequently vi is set to zero. 118 / 158
  • 218.
    Images/cinvestav Outline 1 Backpropagation Introduction Derivatives ofNetwork Functions Function Composition, Weights and Addition The Backpropagation Algorithm Works Moving everything to Tensors 2 Automatic Differentiation Introduction Advantages of Automatic Differentiation Avoiding Truncation Errors Differences with Symbolic Differentiation Difference Quotients May be Useful RNN Example A Simple Example The Forward and Reverse Mode Forward propagation of Tangents Forward Mode of a ML Perceptron Complexity of the Forward Procedure The Reverse Mode Dual Process in Reverse Process Incremental Adjoint Recursion Example What Method to Use Forward or Reverse Mode? 3 Basic Implementation of Automatic Differentiation Source Transformation and Overloading Building the Computational Graph Memory Structures Way More... 4 Conclusions The Problem of Backpropagation 119 / 158
  • 219.
    Images/cinvestav Some Remarks Using theC-style abbreviation a+ ≡ b for a ≡ a + b We may rewrite the matrix- vector product as the adjoint evaluation procedure in the following table 120 / 158
  • 220.
    Images/cinvestav Incremental Adjoint Recursion Wehave the following procedure (ui = (vj)j i ∈ Rni ) vi ≡ 0 i = 1 − n...l vi−n ≡ xi i = 1...n vi ≡ φi (vj)j i i = m − 1...l ym−i ≡ vl−i i = 0...m − 1 vl−i ≡ ym−i i = 0...m − 1 vj+ ≡ vi ∂φi(ui) ∂vj for j i i = l...1 xi ≡ vi−n i = n...1 121 / 158
  • 221.
    Images/cinvestav Explanation It is assumedas a precondition that the adjoint quantities vi for 1 ≤ i ≤ l have been initialized to zero As indicated by the range specification i = l, ..., 1 we think of the incremental assignments as being executed in reverse order, i.e., for i = l, l − 1, l − 2, ..., 1. Only then is it guaranteed Each vi will reach its full value before it occurs on the right-hand side. 122 / 158
  • 222.
    Images/cinvestav Explanation It is assumedas a precondition that the adjoint quantities vi for 1 ≤ i ≤ l have been initialized to zero As indicated by the range specification i = l, ..., 1 we think of the incremental assignments as being executed in reverse order, i.e., for i = l, l − 1, l − 2, ..., 1. Only then is it guaranteed Each vi will reach its full value before it occurs on the right-hand side. 122 / 158
  • 223.
    Images/cinvestav Explanation It is assumedas a precondition that the adjoint quantities vi for 1 ≤ i ≤ l have been initialized to zero As indicated by the range specification i = l, ..., 1 we think of the incremental assignments as being executed in reverse order, i.e., for i = l, l − 1, l − 2, ..., 1. Only then is it guaranteed Each vi will reach its full value before it occurs on the right-hand side. 122 / 158
  • 224.
    Images/cinvestav Furthermore We can combinethe incremental operations Affected by the adjoint of φi to ui+ = vi · φi (ui) where ui ≡ (uj)j i ∈ Rni Something Remarkable We can do something different one can directly compute the value of the adjoint quantity vj by collecting all contributions to it as a sum ranging over all successors i j. This no-incremental Requires global information that is not easy to come by. 123 / 158
  • 225.
    Images/cinvestav Furthermore We can combinethe incremental operations Affected by the adjoint of φi to ui+ = vi · φi (ui) where ui ≡ (uj)j i ∈ Rni Something Remarkable We can do something different one can directly compute the value of the adjoint quantity vj by collecting all contributions to it as a sum ranging over all successors i j. This no-incremental Requires global information that is not easy to come by. 123 / 158
  • 226.
    Images/cinvestav Furthermore We can combinethe incremental operations Affected by the adjoint of φi to ui+ = vi · φi (ui) where ui ≡ (uj)j i ∈ Rni Something Remarkable We can do something different one can directly compute the value of the adjoint quantity vj by collecting all contributions to it as a sum ranging over all successors i j. This no-incremental Requires global information that is not easy to come by. 123 / 158
  • 227.
    Images/cinvestav Complexity Something Notable TIME F(x) , yT F (x) ≤ wgradTIME {F (x)} Where wgrad ∈ [3, 4] (The cheap gradient principle) 124 / 158
  • 228.
    Images/cinvestav Remember Time Complexity TIME F(x) , F (x) ˙x ≤ wtanTIME {F (x)} Where wtan ∈ 2, 5 2 125 / 158
  • 229.
    Images/cinvestav Outline 1 Backpropagation Introduction Derivatives ofNetwork Functions Function Composition, Weights and Addition The Backpropagation Algorithm Works Moving everything to Tensors 2 Automatic Differentiation Introduction Advantages of Automatic Differentiation Avoiding Truncation Errors Differences with Symbolic Differentiation Difference Quotients May be Useful RNN Example A Simple Example The Forward and Reverse Mode Forward propagation of Tangents Forward Mode of a ML Perceptron Complexity of the Forward Procedure The Reverse Mode Dual Process in Reverse Process Incremental Adjoint Recursion Example What Method to Use Forward or Reverse Mode? 3 Basic Implementation of Automatic Differentiation Source Transformation and Overloading Building the Computational Graph Memory Structures Way More... 4 Conclusions The Problem of Backpropagation 126 / 158
  • 230.
    Images/cinvestav Example a singlelayer perceptron We have y = σ 3 i=1 wixi 127 / 158
  • 231.
    Images/cinvestav First Phase Forward Step ForwardStep v−2 = w1 v−1 = w2 v0 = w3 v1 = x1v−2 v2 = x2v−1 v3 = x3v0 v4 = v1 + v2 + v3 v5 = σ (v4) y1 = v5 128 / 158
  • 232.
    Images/cinvestav Second Phase Incremental Return ForwardStep v−2 = w1 v−1 = w2 v0 = w3 v1 = x1v−2 v2 = x2v−1 v3 = x3v0 v4 = v1 + v2 + v3 v5 = σ (v4) y1 = v5 Incremental Return v5 = y1 = 1 v4 = ∂v5 ∂v4 y1 = σ (v4) v3+ = ∂v4 ∂v3 v4 = 1 × σ (v4) v0 = ∂v3 ∂v0 v3 = x3 × σ (v4) v2+ = ∂v4 ∂v2 v4 = 1 × σ (v4) v−1 = ∂v2 ∂v−1 v2 = x2 × σ (v4) v1+ = ∂v4 ∂v1 v4 = 1 × σ (v4) v−2 = ∂v1 ∂v−2 v1 = x1 × σ (v4) w3 = x3 × σ (v4) w2 = x2 × σ (v4) w1 = x1 × σ (v4) 129 / 158
  • 233.
    Images/cinvestav How does itcompares with the Forward Mode? We noticed that you do the following for each gradient variable Forward Step; Gradient of Forward Step v−2 = w1; ˙v−2 = ˙w1 = 0 v−1 = w2; ˙v−1 = ˙w2 = 0 v0 = w3; ˙v0 = ˙w2 = 1 v1 = x1v−2 ˙v1 = x1 ˙v−2 = 0 v2 = x2v−1 ˙v2 = x2 ˙v−1 = 0 v3 = w3v0 ˙v3 = x3 ˙v0 = x3 v4 = v1 + v2 + v3 ˙v4 = ˙v1 + ˙v2 + ˙v3 = x3 v5 = σ (v4) ˙v5 = ˙v4 = x3 × σ (v4) y1 = v5; ˙y1 = ˙v5 130 / 158
  • 234.
    Images/cinvestav Outline 1 Backpropagation Introduction Derivatives ofNetwork Functions Function Composition, Weights and Addition The Backpropagation Algorithm Works Moving everything to Tensors 2 Automatic Differentiation Introduction Advantages of Automatic Differentiation Avoiding Truncation Errors Differences with Symbolic Differentiation Difference Quotients May be Useful RNN Example A Simple Example The Forward and Reverse Mode Forward propagation of Tangents Forward Mode of a ML Perceptron Complexity of the Forward Procedure The Reverse Mode Dual Process in Reverse Process Incremental Adjoint Recursion Example What Method to Use Forward or Reverse Mode? 3 Basic Implementation of Automatic Differentiation Source Transformation and Overloading Building the Computational Graph Memory Structures Way More... 4 Conclusions The Problem of Backpropagation 131 / 158
  • 235.
    Images/cinvestav Let us tolook at the following example We have the following system of equations y1 = σ (w1x) y2 = σ (w2x) y3 = σ (w2x) 132 / 158
  • 236.
    Images/cinvestav With the followinggraph Notice the difference with a neural network 133 / 158
  • 237.
    Images/cinvestav The Forward modelooks like We have that v0 = x; ˙v0 = ˙x = 1 v1 = w1v0 ˙v1 = w1 ˙v−2 = w1 v2 = w2v0 ˙v2 = w1 ˙v0 = w2 v3 = w3v0 ˙v3 = w1 ˙v0 = w3 v4 = σ (v1) ˙v4 = σ (v1) × ˙v1 = w1 × σ (v1) v5 = σ (v2) ˙v5 = σ (v2) × ˙v2 = w2 × σ (v2) v6 = σ (v3) ˙v6 = σ (v3) × ˙v3 = w3 × σ (v3) y1 = v4; ˙y1 = ˙v4 y2 = v5; ˙y2 = ˙v5 y = v ; ˙y = ˙v 134 / 158
  • 238.
    Images/cinvestav Now you cansee it Forward and Reverse Mode They depend on the input and output size!!! A More Formal Definition For a function f : Rn → Rm, suppose we wish to compute all the elements of the m × n Jacobian matrix Ignoring the overhead of building the expression graph Under this situation Reverse Mode requires m sweeps performs better when n > m. 135 / 158
  • 239.
    Images/cinvestav Now you cansee it Forward and Reverse Mode They depend on the input and output size!!! A More Formal Definition For a function f : Rn → Rm, suppose we wish to compute all the elements of the m × n Jacobian matrix Ignoring the overhead of building the expression graph Under this situation Reverse Mode requires m sweeps performs better when n > m. 135 / 158
  • 240.
    Images/cinvestav Now you cansee it Forward and Reverse Mode They depend on the input and output size!!! A More Formal Definition For a function f : Rn → Rm, suppose we wish to compute all the elements of the m × n Jacobian matrix Ignoring the overhead of building the expression graph Under this situation Reverse Mode requires m sweeps performs better when n > m. 135 / 158
  • 241.
    Images/cinvestav Consequences for DeepLearning With a relatively small overhead The performance of reverse-mode AD is superior when n m, that is when we have many inputs and few outputs. As we saw it in the previous examples If n ≤ m forward mode performs better 136 / 158
  • 242.
    Images/cinvestav Consequences for DeepLearning With a relatively small overhead The performance of reverse-mode AD is superior when n m, that is when we have many inputs and few outputs. As we saw it in the previous examples If n ≤ m forward mode performs better 136 / 158
  • 243.
    Images/cinvestav Special Cases Nevertheless whenwe have a comparable number of outputs and inputs Forward mode can be more efficient, less overhead associated with storing the expression graph in memory in forward mode. For Example If you have f : Rn → R, when n = 1 forward mode is more efficient, but the result flips as n increases. 137 / 158
  • 244.
    Images/cinvestav Special Cases Nevertheless whenwe have a comparable number of outputs and inputs Forward mode can be more efficient, less overhead associated with storing the expression graph in memory in forward mode. For Example If you have f : Rn → R, when n = 1 forward mode is more efficient, but the result flips as n increases. 137 / 158
  • 245.
    Images/cinvestav Outline 1 Backpropagation Introduction Derivatives ofNetwork Functions Function Composition, Weights and Addition The Backpropagation Algorithm Works Moving everything to Tensors 2 Automatic Differentiation Introduction Advantages of Automatic Differentiation Avoiding Truncation Errors Differences with Symbolic Differentiation Difference Quotients May be Useful RNN Example A Simple Example The Forward and Reverse Mode Forward propagation of Tangents Forward Mode of a ML Perceptron Complexity of the Forward Procedure The Reverse Mode Dual Process in Reverse Process Incremental Adjoint Recursion Example What Method to Use Forward or Reverse Mode? 3 Basic Implementation of Automatic Differentiation Source Transformation and Overloading Building the Computational Graph Memory Structures Way More... 4 Conclusions The Problem of Backpropagation 138 / 158
  • 246.
    Images/cinvestav Be Aware Be Careful Acomputationally naive implementation of AD can result in slow code and excess use of memory. Additionally There exists no standard set of problems spanning the diversity of AD applications. 139 / 158
  • 247.
    Images/cinvestav Be Aware Be Careful Acomputationally naive implementation of AD can result in slow code and excess use of memory. Additionally There exists no standard set of problems spanning the diversity of AD applications. 139 / 158
  • 248.
    Images/cinvestav Source Transformation inFortran and C First We start with the source code of a computer program that implements our target function. Second A preprocessor then applies differentiation rules to the code generating new source code which calculates derivatives. Remember our basic tables 140 / 158
  • 249.
    Images/cinvestav Source Transformation inFortran and C First We start with the source code of a computer program that implements our target function. Second A preprocessor then applies differentiation rules to the code generating new source code which calculates derivatives. Remember our basic tables 140 / 158
  • 250.
    Images/cinvestav Example on howsource code transformation could work We could have the following pipeline AD Tool C compiler function.c diff_function.c diff_function.o 141 / 158
  • 251.
    Images/cinvestav Limitations of SourceTransformation Severe limitations with source transformation it can only use information avail- able at compile time It cannot handle more sophisticated programming statements, such as while loops, C++ templates, and other object-oriented features For this, it is better to use operator overloading Operator overloading is the appropriate technology. 142 / 158
  • 252.
    Images/cinvestav Limitations of SourceTransformation Severe limitations with source transformation it can only use information avail- able at compile time It cannot handle more sophisticated programming statements, such as while loops, C++ templates, and other object-oriented features For this, it is better to use operator overloading Operator overloading is the appropriate technology. 142 / 158
  • 253.
    Images/cinvestav Operator overloading The keyidea is to introduce a new class of objects Containing the value of a variable on the expression graph and a differential component. Not all variables on the expression graph will belong to this class But the root variables, which require sensitivities, and all the intermediate variables. In a forward mode framework The differential component is the derivative with respect to one input. 143 / 158
  • 254.
    Images/cinvestav Operator overloading The keyidea is to introduce a new class of objects Containing the value of a variable on the expression graph and a differential component. Not all variables on the expression graph will belong to this class But the root variables, which require sensitivities, and all the intermediate variables. In a forward mode framework The differential component is the derivative with respect to one input. 143 / 158
  • 255.
    Images/cinvestav Operator overloading The keyidea is to introduce a new class of objects Containing the value of a variable on the expression graph and a differential component. Not all variables on the expression graph will belong to this class But the root variables, which require sensitivities, and all the intermediate variables. In a forward mode framework The differential component is the derivative with respect to one input. 143 / 158
  • 256.
    Images/cinvestav Furthermore In a reversemode framework it is the adjoint with respect to one output. Something Notable Operators and math functions are overloaded to handle these new types. 144 / 158
  • 257.
    Images/cinvestav Furthermore In a reversemode framework it is the adjoint with respect to one output. Something Notable Operators and math functions are overloaded to handle these new types. 144 / 158
  • 258.
    Images/cinvestav Basically The operators areoverloaded So it is possible to handle the dual numbers under their new arithmetic (Forward Mode) 145 / 158
  • 259.
    Images/cinvestav Outline 1 Backpropagation Introduction Derivatives ofNetwork Functions Function Composition, Weights and Addition The Backpropagation Algorithm Works Moving everything to Tensors 2 Automatic Differentiation Introduction Advantages of Automatic Differentiation Avoiding Truncation Errors Differences with Symbolic Differentiation Difference Quotients May be Useful RNN Example A Simple Example The Forward and Reverse Mode Forward propagation of Tangents Forward Mode of a ML Perceptron Complexity of the Forward Procedure The Reverse Mode Dual Process in Reverse Process Incremental Adjoint Recursion Example What Method to Use Forward or Reverse Mode? 3 Basic Implementation of Automatic Differentiation Source Transformation and Overloading Building the Computational Graph Memory Structures Way More... 4 Conclusions The Problem of Backpropagation 146 / 158
  • 260.
    Images/cinvestav Building a ComputationalGraph Parse the equations y1 = σ (w1x) y2 = σ (w2x) y3 = σ (w2x) Generate variables for intermediate values v0 = x Then V = V ∪ {v0} Generate edges between the intermediate values If v1 = w1x Then E = E ∪ { v0, v1 } 147 / 158
  • 261.
    Images/cinvestav Building a ComputationalGraph Parse the equations y1 = σ (w1x) y2 = σ (w2x) y3 = σ (w2x) Generate variables for intermediate values v0 = x Then V = V ∪ {v0} Generate edges between the intermediate values If v1 = w1x Then E = E ∪ { v0, v1 } 147 / 158
  • 262.
    Images/cinvestav Building a ComputationalGraph Parse the equations y1 = σ (w1x) y2 = σ (w2x) y3 = σ (w2x) Generate variables for intermediate values v0 = x Then V = V ∪ {v0} Generate edges between the intermediate values If v1 = w1x Then E = E ∪ { v0, v1 } 147 / 158
  • 263.
    Images/cinvestav Topological sort forEvaluation Run the topological sort, then you get the order of evaluation Using the graph built in the previous step 148 / 158
  • 264.
    Images/cinvestav Outline 1 Backpropagation Introduction Derivatives ofNetwork Functions Function Composition, Weights and Addition The Backpropagation Algorithm Works Moving everything to Tensors 2 Automatic Differentiation Introduction Advantages of Automatic Differentiation Avoiding Truncation Errors Differences with Symbolic Differentiation Difference Quotients May be Useful RNN Example A Simple Example The Forward and Reverse Mode Forward propagation of Tangents Forward Mode of a ML Perceptron Complexity of the Forward Procedure The Reverse Mode Dual Process in Reverse Process Incremental Adjoint Recursion Example What Method to Use Forward or Reverse Mode? 3 Basic Implementation of Automatic Differentiation Source Transformation and Overloading Building the Computational Graph Memory Structures Way More... 4 Conclusions The Problem of Backpropagation 149 / 158
  • 265.
    Images/cinvestav For the ReversalMode We could use a stack When assignments occur at the Forward Mode of the process vi ≡ 0 i = 1 − n...l vi−n ≡ xi i = 1...n vi ≡ φi (vj)j i i = m − 1...l ym−i ≡ vl−i i = 0...m − 1 Then we pop the necessary elements At the reversal process 150 / 158
  • 266.
    Images/cinvestav For the ReversalMode We could use a stack When assignments occur at the Forward Mode of the process vi ≡ 0 i = 1 − n...l vi−n ≡ xi i = 1...n vi ≡ φi (vj)j i i = m − 1...l ym−i ≡ vl−i i = 0...m − 1 Then we pop the necessary elements At the reversal process 150 / 158
  • 267.
    Images/cinvestav Outline 1 Backpropagation Introduction Derivatives ofNetwork Functions Function Composition, Weights and Addition The Backpropagation Algorithm Works Moving everything to Tensors 2 Automatic Differentiation Introduction Advantages of Automatic Differentiation Avoiding Truncation Errors Differences with Symbolic Differentiation Difference Quotients May be Useful RNN Example A Simple Example The Forward and Reverse Mode Forward propagation of Tangents Forward Mode of a ML Perceptron Complexity of the Forward Procedure The Reverse Mode Dual Process in Reverse Process Incremental Adjoint Recursion Example What Method to Use Forward or Reverse Mode? 3 Basic Implementation of Automatic Differentiation Source Transformation and Overloading Building the Computational Graph Memory Structures Way More... 4 Conclusions The Problem of Backpropagation 151 / 158
  • 268.
    Images/cinvestav Nevertheless There are manytechniques to improve the efficiency and avoid aliasing problems of these modes [12] 1 Taping for adjoint recursion 2 Caching 3 Checkpoints 4 Expression Templates 5 etc You are invited to read more about them Given that these techniques are already being implemented in languages as swift... “First-Class Automatic Differentiation in Swift: A Manifesto” https://gist.github.com/rxwei/30ba75ce092ab3b0dce4bde1fc2c9f1d 152 / 158
  • 269.
    Images/cinvestav Nevertheless There are manytechniques to improve the efficiency and avoid aliasing problems of these modes [12] 1 Taping for adjoint recursion 2 Caching 3 Checkpoints 4 Expression Templates 5 etc You are invited to read more about them Given that these techniques are already being implemented in languages as swift... “First-Class Automatic Differentiation in Swift: A Manifesto” https://gist.github.com/rxwei/30ba75ce092ab3b0dce4bde1fc2c9f1d 152 / 158
  • 270.
    Images/cinvestav Outline 1 Backpropagation Introduction Derivatives ofNetwork Functions Function Composition, Weights and Addition The Backpropagation Algorithm Works Moving everything to Tensors 2 Automatic Differentiation Introduction Advantages of Automatic Differentiation Avoiding Truncation Errors Differences with Symbolic Differentiation Difference Quotients May be Useful RNN Example A Simple Example The Forward and Reverse Mode Forward propagation of Tangents Forward Mode of a ML Perceptron Complexity of the Forward Procedure The Reverse Mode Dual Process in Reverse Process Incremental Adjoint Recursion Example What Method to Use Forward or Reverse Mode? 3 Basic Implementation of Automatic Differentiation Source Transformation and Overloading Building the Computational Graph Memory Structures Way More... 4 Conclusions The Problem of Backpropagation 153 / 158
  • 271.
    Images/cinvestav Between Two Extremes SomethingNotable Forward and reverse accumulation are just two (extreme) ways of traversing the chain rule. The problem of computing a full Jacobian of f : Rn → Rm with a minimum number of arithmetic operations It is known as the Optimal Jacobian Accumulation (OJA) problem, which is NP-complete [13]. 154 / 158
  • 272.
    Images/cinvestav Between Two Extremes SomethingNotable Forward and reverse accumulation are just two (extreme) ways of traversing the chain rule. The problem of computing a full Jacobian of f : Rn → Rm with a minimum number of arithmetic operations It is known as the Optimal Jacobian Accumulation (OJA) problem, which is NP-complete [13]. 154 / 158
  • 273.
    Images/cinvestav Finally Using all theprevious ideas The Graph Structure Proposed in [2] The Computational Graph of AD The Forward and Reversal Methods It has been possible to develop the Deep Learning Frameworks TensorFlow Torch Pytorch Keras etc... 155 / 158
  • 274.
    Images/cinvestav Finally Using all theprevious ideas The Graph Structure Proposed in [2] The Computational Graph of AD The Forward and Reversal Methods It has been possible to develop the Deep Learning Frameworks TensorFlow Torch Pytorch Keras etc... 155 / 158
  • 275.
    Images/cinvestav D. E. Rumelhart,G. E. Hinton, and R. J. Williams, “Learning internal representations by error propagation,” tech. rep., California Univ San Diego La Jolla Inst for Cognitive Science, 1985. R. Rojas, Neural networks: a systematic introduction. Springer Science & Business Media, 1996. R. Collobert, S. Bengio, and J. Mariéthoz, “Torch: a modular machine learning software library,” Idiap-RR Idiap-RR-46-2002, IDIAP, 2002. A. G. Baydin, B. A. Pearlmutter, A. A. Radul, and J. M. Siskind, “Automatic differentiation in machine learning: a survey,” Journal of machine learning research, vol. 18, no. 153, 2018. C. H. Bischof, A. Carle, P. Khademi, and A. Mauer, “ADIFOR 2.0: Automatic differentiation of Fortran 77 programs,” IEEE Computational Science & Engineering, vol. 3, no. 3, pp. 18–32, 1996. 156 / 158
  • 276.
    Images/cinvestav C. Elliott, “Thesimple essence of automatic differentiation,” Proceedings of the ACM on Programming Languages, vol. 2, no. ICFP, p. 70, 2018. A. Griewank and A. Walther, Evaluating derivatives: principles and techniques of algorithmic differentiation, vol. 105. Siam, 2008. J. L. Elman, “Finding structure in time,” Cognitive science, vol. 14, no. 2, pp. 179–211, 1990. L. Hascoët and V. Pascual, “The Tapenade automatic differentiation tool: Principles, model, and specification,” ACM Transactions on Mathematical Software, vol. 39, no. 3, pp. 20:1–20:43, 2013. Y. A. LeCun, L. Bottou, G. B. Orr, and K.-R. Müller, “Efficient backprop,” in Neural networks: Tricks of the trade, pp. 9–48, Springer, 2012. 157 / 158
  • 277.
    Images/cinvestav R. Alexander, “Solvingordinary differential equations i: Nonstiff problems (e. hairer, sp norsett, and g. wanner),” Siam Review, vol. 32, no. 3, p. 485, 1990. C. C. Margossian, “A review of automatic differentiation and its efficient implementation,” Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, vol. 9, no. 4, p. e1305, 2019. U. Naumann, “Optimal jacobian accumulation is np-complete,” Mathematical Programming, vol. 112, no. 2, pp. 427–441, 2008. 158 / 158