Backpropagation - Elisa Sayrol - UPC Barcelona 2018

[course site]
Day 2 Lecture 2
Backpropagation
Elisa Sayrol

Acknowledgements
Kevin McGuinness
kevin.mcguinness@dcu.ie
Research Fellow
Insight Centre for Data Analytics
Dublin City University
Xavier Giro-i-Nieto
xavier.giro@upc.edu

Multilayer perceptrons
When each node in each layer is a linear
combination of all inputs from the previous
layer then the network is called a multilayer
perceptron (MLP)
Weights can be organized into matrices.
Forward pass computes
!(#)

Training MLPs
With Multiple Layer Perceptrons we need to find the gradient of the loss function with respect to all the
parameters of the model (W(t), b(t))
These can be found using the chain rule of differentiation.
The calculations reveal that the gradient wrt the parameters in layer k only depends on the error from the
above layer and the output from the layer below.
This means that the gradients for each layer can be computed iteratively, starting at the last layer and
propagating the error back through the network. This is known as the backpropagation algorithm.

• Computational Graphs
• Examples applying chain of rule in simple graphs
• Backpropagation applied to Multilayer Perceptron
• Another perspective: modular backprop
Backpropagation algorithm

Computational graphs
z
x y
x
u(1) u(2)
·
+
y^
x w b
s
U(1) U(2)
matmul
+
H
X W b
relu
u(1)
u(2)
·
y^
x w l
x
u(3)
sum
sqrt
! = #$ %$=&(x)w + b) .=max 0, 12 + 3 %$=x)w
4 5
6
76
8
From Deep Learning Book

Applying the Chain Rule to Computational Graphs
! = #(%)
' = ( # %
)'
)%
=
)'
)!
)!
)%
)'
)!
)!
)%*
)!
)%+
)'
)%*
)'
)%+
)'
)%*
=
)'
)!
)!
)%*
)'
)%+
=
)'
)!
)!
)%+
%*
%+
! '
fg
)'
)!
' = ((!)

From computational graphs to MLP
Numerical Examples
From Stanford Course: Convolutional Neural Networks for Visual Recognition 2017
+
x
! ", $, % = " + $ %
"
$
%
(
−2
5
−4
-12
3
( = " + $
! = (%
-(
-"
= 1
-(
-$
= 1
-!
-(
= %
-!
-%
= (
/0 1234 45 6578940:
-!
-"
,
-!
-$
,
-!
-%
;"278<0 " = −2, $ = 5, % = −4
-!
-!
= 1 -!
-!
= 1
-!
-(
= % = −4
-!
-%
= ( = 3
-!
-"
=
-!
-(
-(
-"
= −4 · 1 = −4
-!
-$
=
-!
-(
-(
-$
= −4 · 1 = −4

From computational graphs to MLP
x
+! ", $, % = ' ()") + (+"+ + ,
(0
"0
,
x
(1
"1
+ s
/'(")
/"
=
234
1 + 234 2
=
1 + 234 − 1
1 + 234
1
1 + 234
' " =
1
1 + 234
2
−1
−3
−2
−3
−2
6
4 1 0.73
1
/'(")
/"
= (1 − '("))('("))
0,20,2
0,2
-0,2
0,2
0,2
0,4
-0,4
-0,6
From Stanford Course: Convolutional Neural Networks for Visual Recognition
:!
:!
= 1
:!
:(0
:!
:(1
:!
:"0
:!
:"1
:!
:,
Numerical Examples

Gates. Backward Pass
! " =
1
1 + &'(
)!(")
)"
= (1 − !("))(!("))
- = " + . /-
/"
= 1
/-
/.
= 1
0 = -1
/0
/-
= 1
/0
/1
= -
Sum: Distributes the gradient to both branches
Product: Switches gradient weigth values
Max: Routes the gradient only to the higher input
branche (not sensitive to the lower branche)
"
.
-0,2
0,2
0
max
2
1
2
+
In general: Derivative of a function
Add branches: Branches that split in the forward pass
and merge in the backward pass, add gradients

Numerical Examples
From Stanford Course: Convolutional Neural Networks for Visual Recognition 2017
x
! ", $ = & ' ( 2 = *
+,-
.
& ' ( +
/
= *
+,-
.
0 +
/
&
(
0 0,116
1!
12+
= 22+
1
3&! = 20 ' (4
L2
0.1 0.5
−0.3 0.8
0.2
0.4
0.22
0.26
0.44
0.52
0.088 0.176
0.104 0.208
30! = 20
3(! = 2&? ' 0
−0.112
0.636

Backpropagation applied to an element of the MLP
For a single neuron with its linear and non-linear part
ℎ"
#
g(·)
ℎ(
)
ℎ"
#*"
+)*"
,)*" = .(/),) +0)) = .(+)*")
1+)*"
1,)
= /2
1,)*"
1+)*"
= .3(+)*")

…Backpropagation applied to Multilayer
Perceptron

h2 h3a3 a4 h4
Loss
Hidden Hidden Output
W2 W3
x a2
Input
W1
Forward Pass
h2 h3a3 a4 h4
Loss
W2 W3
x a2
Input
W1
L
Backward Pass
Backpropagation is applied to the Backward Pass

Probability Class given an input
(softmax)
h2 h3a3 a4 h4
Loss
W2 W3
x a2
Input
W1
Figure Credit: Kevin McGuiness
Forward Pass

(softmax)
Loss function; e.g., negative log-likelihood
(good for classification)
h2 h3a3 a4 h4
Loss
W2 W3
x a2
Input
W1
Regularization term (L2 Norm)
aka as weight decay
Forward Pass

(softmax)
Minimize the loss (plus some regularization
term) w.r.t. Parameters over the whole
training set.
Loss function; e.g., negative log-likelihood
(good for classification)
h2 h3a3 a4 h4
Loss
W2 W3
x a2
Input
W1
Regularization term (L2 Norm)
aka as weight decay
Forward Pass

1. Find the error in the top layer:
h2 h3a3 a4 h4
Loss
W2 W3
x a2
Input
W1
L
Backward Pass

1. Find the error in the top layer: 2. Compute weight updates
h2 h3a3 a4 h4
Loss
W2 W3
x a2
Input
W1
L
Backward Pass
To simplify we don’t consider the biass

1. Find the error in the top layer: 3. Backpropagate error to layer below2. Compute weight updates
h2 h3a3 a4 h4
Loss
W2 W3
x a2
Input
W1
L
Backward Pass
To simplify we don’t consider the biass

Another perspective: Modular backprop
You could use the chain rule on all the individual neurons to compute the
gradients with respect to the parameters and backpropagate the error signal.
It is useful to use the layer abstraction
Then define the backpropagation algorithm in terms of three operations that layers
need to be able to do.
This is called modular backpropagation

The layer abstraction
* see change of notation
*

Modular backprop
Using this idea, it is possible to create
many types of layers
● Linear (fully connected layers)
● Activation functions (sigmoid, ReLU)
● Convolutions
● Pooling
● Dropout
Once layers support the backward and
forward operations, they can be plugged
together to create more complex functions
Convolution
Input Error (L)
Gradients
ReLU
Linear
Gradients
Output Error (L+1)

Implementation notes
Caffe and Torch
Libraries like Caffe and Torch implement
backpropagation this way.
To define a new layer, you need to create
an class and define the forward and
backward operations.
Theano and TensorFlow
Libraries like Theano and TensorFlow
operate on a computational graph.
To define a new layer, you only need to
specify the forward operation. Autodiff is
used to automatically infer backward.
You also don't need to implement
backprop manually in Theano or
TensorFlow. It uses computational graph
optimizations to automatically factor out
common computations.

Issues on Backpropagation and Training
Gradient Descent: Move the parameter !"in small steps in the direction opposite sign of the
derivative of the loss with respect j.
!($) = !($'() − * $'( + ,-ℒ /, 1 2 − 3! $'(
Stochastic gradient descent (SGD): estimate the gradient with one sample, or better, with a
minibatch of examples.
Weight Decay: Regularization term that penalizes large weights, distributes values among all the
parameters
Momentum: the movement direction of parameters averages the gradient estimation with
previous ones.
Several strategies have been proposed to update the weights: optimizers

Note on hyperparameters
So far we have lots of hyperparameters to choose:
1. Learning rate (a)
2. Weight decay (l)
3. Number of epochs
4. Number of hidden layers
5. Nodes in each hidden layer
6. Weight initialization strategy
7. Loss function
8. Activation functions
9. ...
… next class more

Summary
• Backpropagation is applied during the Backward pass while training
• Computational graphs help to understand the chain rule of differentiation
• Parameters in layer k only depend on the error from the above layer and the output from
the layer below. This means that the gradients for each layer can be computed iteratively,
starting at the last layer and propagating the error back through the network.
• Hyperparameters have to be chosen and it’s not obvious
• For a “deeper” study: http://www.deeplearningbook.org/

Backpropagation - Elisa Sayrol - UPC Barcelona 2018

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Backpropagation - Elisa Sayrol - UPC Barcelona 2018

Similar to Backpropagation - Elisa Sayrol - UPC Barcelona 2018 (20)

More from Universitat Politècnica de Catalunya

More from Universitat Politècnica de Catalunya (20)

Recently uploaded

Recently uploaded (20)

Backpropagation - Elisa Sayrol - UPC Barcelona 2018