Deep learning networks

1/67
CS7015 (Deep Learning) : Lecture 9
Greedy Layerwise Pre-training, Better activation functions, Better weight
initialization methods, Batch Normalization
Mitesh M. Khapra
Department of Computer Science and Engineering
Indian Institute of Technology Madras
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9

2/67
Module 9.1 : A quick recap of training deep neural
networks

3/67
x
σ
w
y

3/67
x
σ
w
y
We already saw how to train this network

3/67
x
σ
w
y
w = w − η w where,

3/67
x
σ
w
y
w =
∂L (w)
∂w

3/67
x
σ
w
y
w =
∂L (w)
∂w
= (f(x) − y) ∗ f(x) ∗ (1 − f(x)) ∗ x

3/67
x
σ
w
y
x1x2 x3
σ
y
w1 w2 w3
w =
∂L (w)
∂w
= (f(x) − y) ∗ f(x) ∗ (1 − f(x)) ∗ x
What about a wider network with more inputs:

3/67
x
σ
w
y
x1x2 x3
σ
y
w1 w2 w3
w =
∂L (w)
∂w
= (f(x) − y) ∗ f(x) ∗ (1 − f(x)) ∗ x
w1 = w1 − η w1

3/67
x
σ
w
y
x1x2 x3
σ
y
w1 w2 w3
w =
∂L (w)
∂w
= (f(x) − y) ∗ f(x) ∗ (1 − f(x)) ∗ x
w1 = w1 − η w1
w2 = w2 − η w2

3/67
x
σ
w
y
x1x2 x3
σ
y
w1 w2 w3
w =
∂L (w)
∂w
= (f(x) − y) ∗ f(x) ∗ (1 − f(x)) ∗ x
w1 = w1 − η w1
w2 = w2 − η w2
w3 = w3 − η w3

3/67
x
σ
w
y
x1x2 x3
σ
y
w1 w2 w3
w =
∂L (w)
∂w
= (f(x) − y) ∗ f(x) ∗ (1 − f(x)) ∗ x
w1 = w1 − η w1
w2 = w2 − η w2
w3 = w3 − η w3
where, wi = (f(x) − y) ∗ f(x) ∗ (1 − f(x)) ∗ xi

4/67
σ
x = h0
σ
σ
y
w1
w2
w3
a1
h1
a2
h2
a3
ai = wihi−1; hi = σ(ai)
a1 = w1 ∗ x = w1 ∗ h0
What if we have a deeper network ?

4/67
σ
x = h0
σ
σ
y
w1
w2
w3
a1
h1
a2
h2
a3
a1 = w1 ∗ x = w1 ∗ h0
We can now calculate w1 using chain rule:

4/67
σ
x = h0
σ
σ
y
w1
w2
w3
a1
h1
a2
h2
a3
a1 = w1 ∗ x = w1 ∗ h0
∂L (w)
∂w1
=
∂L (w)
∂y
.
∂y
∂a3
.
∂a3
∂h2
.
∂h2
∂a2
.
∂a2
∂h1
.
∂h1
∂a1
.
∂a1
∂w1

4/67
σ
x = h0
σ
σ
y
w1
w2
w3
a1
h1
a2
h2
a3
a1 = w1 ∗ x = w1 ∗ h0
∂L (w)
∂w1
=
∂L (w)
∂y
.
∂y
∂a3
.
∂a3
∂h2
.
∂h2
∂a2
.
∂a2
∂h1
.
∂h1
∂a1
.
∂a1
∂w1
=
∂L (w)
∂y
∗ ............... ∗ h0

4/67
σ
x = h0
σ
σ
y
w1
w2
w3
a1
h1
a2
h2
a3
a1 = w1 ∗ x = w1 ∗ h0
∂L (w)
∂w1
=
∂L (w)
∂y
.
∂y
∂a3
.
∂a3
∂h2
.
∂h2
∂a2
.
∂a2
∂h1
.
∂h1
∂a1
.
∂a1
∂w1
=
∂L (w)
∂y
∗ ............... ∗ h0
In general,
wi =
∂L (w)
∂y
∗ ............... ∗ hi−1

4/67
σ
x = h0
σ
σ
y
w1
w2
w3
a1
h1
a2
h2
a3
a1 = w1 ∗ x = w1 ∗ h0
∂L (w)
∂w1
=
∂L (w)
∂y
.
∂y
∂a3
.
∂a3
∂h2
.
∂h2
∂a2
.
∂a2
∂h1
.
∂h1
∂a1
.
∂a1
∂w1
=
∂L (w)
∂y
∗ ............... ∗ h0
In general,
wi =
∂L (w)
∂y
∗ ............... ∗ hi−1
Notice that wi is proportional to the correspond-
ing input hi−1

4/67
σ
x = h0
σ
σ
y
w1
w2
w3
a1
h1
a2
h2
a3
a1 = w1 ∗ x = w1 ∗ h0
∂L (w)
∂w1
=
∂L (w)
∂y
.
∂y
∂a3
.
∂a3
∂h2
.
∂h2
∂a2
.
∂a2
∂h1
.
∂h1
∂a1
.
∂a1
∂w1
=
∂L (w)
∂y
∗ ............... ∗ h0
In general,
wi =
∂L (w)
∂y
∗ ............... ∗ hi−1
Notice that wi is proportional to the correspond-
ing input hi−1 (we will use this fact later)

5/67
σ
σ
x1
σ σ
x2 x3
σ σ σ
σ
y
w1 w2 w3
What happens if we have a network which is deep
and wide?

5/67
σ
σ
x1
σ σ
x2 x3
σ σ σ
σ
y
w1 w2 w3
and wide?
How do you calculate w2 =?

5/67
σ
σ
x1
σ σ
x2 x3
σ σ σ
σ
y
w1 w2 w3
and wide?
It will be given by chain rule applied across mul-
tiple paths

5/67
σ
σ
x1
σ σ
x2 x3
σ σ σ
σ
y
w1 w2 w3
and wide?
It will be given by chain rule applied across mul-
tiple paths (We saw this in detail when we studied
back propagation)

6/67
Things to remember
Training Neural Networks is a Game of Gradients (played using any of the
existing gradient based approaches that we discussed)

6/67
Things to remember
The gradient tells us the responsibility of a parameter towards the loss

6/67
Things to remember
The gradient tells us the responsibility of a parameter towards the loss
The gradient w.r.t. a parameter is proportional to the input to the parameters
(recall the “..... ∗ x” term or the “.... ∗ hi” term in the formula for wi)

7/67
σ
σ
x1
σ σ
x2 x3
σ σ σ
σ
y
w1 w2 w3

7/67
σ
σ
x1
σ σ
x2 x3
σ σ σ
σ
y
w1 w2 w3
Backpropagation was made popular
by Rumelhart et.al in 1986

7/67
σ
σ
x1
σ σ
x2 x3
σ σ σ
σ
y
w1 w2 w3
However when used for really deep
networks it was not very successful

7/67
σ
σ
x1
σ σ
x2 x3
σ σ σ
σ
y
w1 w2 w3
In fact, till 2006 it was very hard to
train very deep networks

7/67
σ
σ
x1
σ σ
x2 x3
σ σ σ
σ
y
w1 w2 w3
In fact, till 2006 it was very hard to
train very deep networks
Typically, even after a large number
of epochs the training did not con-
verge

8/67
Module 9.2 : Unsupervised pre-training

9/67
What has changed now? How did Deep Learning become so popular despite
this problem with training large networks?
1
G. E. Hinton and R. R. Salakhutdinov. Reducing the dimensionality of data with neural
networks. Science, 313(5786):504–507, July 2006.

9/67
Well, until 2006 it wasn’t so popular
1

9/67
Well, until 2006 it wasn’t so popular
The ﬁeld got revived after the seminal work of Hinton and Salakhutdinov in
2006
1

10/67
Let’s look at the idea of unsupervised pre-training introduced in this paper ...

10/67
Let’s look at the idea of unsupervised pre-training introduced in this paper ...
(note that in this paper they introduced the idea in the context of RBMs but we
will discuss it in the context of Autoencoders)

11/67
Consider the deep neural network
shown in this ﬁgure

11/67
Let us focus on the ﬁrst two layers of
the network (x and h1)

11/67
We will ﬁrst train the weights
between these two layers using an un-
supervised objective

11/67
ˆx
h1
x
reconstruct x
min
1
m
m
i=1
n
j=1
(ˆxij − xij)2
Note that we are trying to reconstruct
the input (x) from the hidden repres-
entation (h1)

11/67
ˆx
h1
x
reconstruct x
min
1
m
m
i=1
n
j=1
(ˆxij − xij)2
Note that we are trying to reconstruct
the input (x) from the hidden repres-
entation (h1)
We refer to this as an unsupervised
objective because it does not involve
the output label (y) and only uses the
input data (x)

11/67
ˆx
h1
x
reconstruct x
min
1
m
m
i=1
n
j=1
(ˆxij − xij)2
At the end of this step, the weights
in layer 1 are trained such that h1
captures an abstract representation
of the input x

11/67
ˆh1
h2
h1
x
min
1
m
m
i=1
n
j=1
(ˆh1ij − h1ij )2
of the input x
We now ﬁx the weights in layer 1 and
repeat the same process with layer 2

11/67
ˆh1
h2
h1
x
min
1
m
m
i=1
n
j=1
(ˆh1ij − h1ij )2
of the input x
At the end of this step, the weights in
layer 2 are trained such that h2 cap-
tures an abstract representation of h1

11/67
ˆh1
h2
h1
x
min
1
m
m
i=1
n
j=1
(ˆh1ij − h1ij )2
of the input x
At the end of this step, the weights in
layer 2 are trained such that h2 cap-
tures an abstract representation of h1
We continue this process till the last
hidden layer (i.e., the layer before the
output layer) so that each successive
layer captures an abstract represent-
ation of the previous layer

12/67
x1 x2 x3
min
θ
1
m
m
i=1
(yi − f(xi))2
After this layerwise pre-training, we
add the output layer and train the
whole network using the task speciﬁc
objective

12/67
x1 x2 x3
min
θ
1
m
m
i=1
(yi − f(xi))2
After this layerwise pre-training, we
add the output layer and train the
whole network using the task specific
objective
Note that, in effect we have initial-
ized the weights of the network us-
ing the greedy unsupervised objective
and are now fine tuning these weights
using the supervised objective

13/67
Why does this work better?
1
The diﬃculty of training deep architectures and eﬀect of unsupervised pre-training - Erhan et
al,2009
2
Exploring Strategies for Training Deep Neural Networks, Larocelle et al,2009

13/67
Is it because of better optimization?
1
al,2009
2

13/67
Is it because of better regularization?
1
al,2009
2

13/67
Let’s see what these two questions mean and try to answer them based on some
(among many) existing studies1,2
1
al,2009
2

14/67

15/67
What is the optimization problem that we are trying to solve?

15/67
minimize L (θ) =
1
m
m
i=1
(yi − f(xi))2

15/67
minimize L (θ) =
1
m
m
i=1
(yi − f(xi))2
Is it the case that in the absence of unsupervised pre-training we are not able
to drive L (θ) to 0 even for the training data (hence poor optimization) ?

15/67
minimize L (θ) =
1
m
m
i=1
(yi − f(xi))2
Is it the case that in the absence of unsupervised pre-training we are not able
to drive L (θ) to 0 even for the training data (hence poor optimization) ?
Let us see this in more detail ...

16/67
The error surface of the supervised
objective of a Deep Neural Network
is highly non-convex
1

16/67
With many hills and plateaus and val-
leys
1

16/67
leys
Given that large capacity of DNNs it
is still easy to land in one of these 0
error regions
1

16/67
leys
error regions
Indeed Larochelle et.al.1 show that if
the last layer has large capacity then
L (θ) goes to 0 even without pre-
training
1

16/67
leys
error regions
Indeed Larochelle et.al.1 show that if
the last layer has large capacity then
L (θ) goes to 0 even without pre-
training
However, if the capacity of the net-
work is small, unsupervised pre-
training helps
1

17/67

18/67
What does regularization do?
1
Image Source:The Elements of Statistical Learning-T. Hastie, R. Tibshirani, and J. Friedman,
Pg 71

18/67
What does regularization do? It con-
strains the weights to certain regions
of the parameter space
1
Pg 71

18/67
L-1 regularization: constrains most
weights to be 0
1
Pg 71

18/67
L-1 regularization: constrains most
weights to be 0
L-2 regularization: prevents most
weights from taking large values
1
Pg 71

19/67
Indeed, pre-training constrains the
weights to lie in only certain regions

19/67
Speciﬁcally, it constrains the weights
to lie in regions where the character-
istics of the data are captured well (as
governed by the unsupervised object-
ive)

19/67
Unsupervised objective:
Ω(θ) =
1
m
m
i=1
n
j=1
(xij − ˆxij)2
ive)

19/67
Ω(θ) =
1
m
m
i=1
n
j=1
(xij − ˆxij)2
We can think of this unsupervised ob-
jective as an additional constraint on
the optimization problem
ive)

19/67
Ω(θ) =
1
m
m
i=1
n
j=1
(xij − ˆxij)2
We can think of this unsupervised ob-
jective as an additional constraint on
the optimization problem
Supervised objective:
L (θ) =
1
m
m
i=1
(yi − f(xi))2
ive)
This unsupervised objective ensures
that that the learning is not greedy
w.r.t. the supervised objective (and
also satisﬁes the unsupervised object-
ive)

20/67
Some other experiments have also
shown that pre-training is more ro-
bust to random initializations
1
al,2009

20/67
Some other experiments have also
shown that pre-training is more ro-
bust to random initializations
One accepted hypothesis is that pre-
training leads to better weight ini-
tializations (so that the layers cap-
ture the internal characteristics of the
data)
1
al,2009

21/67
So what has happened since 2006-2009?

22/67
Deep Learning has evolved
Better optimization algorithms

22/67
Better regularization methods

22/67
Better activation functions

22/67
Better weight initialization strategies

23/67
Module 9.3 : Better activation functions

24/67

25/67
Before we look at activation functions, let’s try to answer the following question:
“What makes Deep Neural Networks powerful ?”

26/67
h0 = x
y
σ
σ
σ
a1
h1
a2
h2
a3
w1
w2
w3
Consider this deep neural network

26/67
h0 = x
y
σ
σ
σ
a1
h1
a2
h2
a3
w1
w2
w3
Imagine if we replace the sigmoid in
each layer by a simple linear trans-
formation

26/67
h0 = x
y
σ
σ
σ
a1
h1
a2
h2
a3
w1
w2
w3
formation
y = (w4 ∗ (w3 ∗ (w2 ∗ (w1x))))

26/67
h0 = x
y
σ
σ
σ
a1
h1
a2
h2
a3
w1
w2
w3
formation
y = (w4 ∗ (w3 ∗ (w2 ∗ (w1x))))
Then we will just learn y as a linear
transformation of x

26/67
h0 = x
y
σ
σ
σ
a1
h1
a2
h2
a3
w1
w2
w3
formation
y = (w4 ∗ (w3 ∗ (w2 ∗ (w1x))))
transformation of x
In other words we will be constrained
to learning linear decision boundaries

26/67
h0 = x
y
σ
σ
σ
a1
h1
a2
h2
a3
w1
w2
w3
formation
y = (w4 ∗ (w3 ∗ (w2 ∗ (w1x))))
transformation of x
In other words we will be constrained
to learning linear decision boundaries
We cannot learn arbitrary decision
boundaries

26/67
In particular, a deep linear neural
network cannot learn such boundar-
ies

26/67
In particular, a deep linear neural
network cannot learn such boundar-
ies
But a deep non linear neural net-
work can indeed learn such bound-
aries (recall Universal Approximation
Theorem)

27/67
Now let’s look at some non-linear activation functions that are typically used in
deep neural networks (Much of this material is taken from Andrej Karpathy’s
lecture notes 1)
1
http://cs231n.github.io

28/67
σ(x) = 1
1+e−x

28/67
Sigmoid
σ(x) = 1
1+e−x
As is obvious, the sigmoid function
compresses all its inputs to the range
[0,1]

28/67
Sigmoid
σ(x) = 1
1+e−x
[0,1]
Since we are always interested in
gradients, let us ﬁnd the gradient of
this function

28/67
Sigmoid
σ(x) = 1
1+e−x
[0,1]
this function
∂σ(x)
∂x
= σ(x)(1 − σ(x))
(you can easily derive it)

28/67
Sigmoid
σ(x) = 1
1+e−x
[0,1]
this function
∂σ(x)
∂x
= σ(x)(1 − σ(x))
(you can easily derive it)
Let us see what happens if we use sig-
moid in a deep network

29/67
h0 = x
σ
σ
σ
σ
a1
h1
a2
h2
a3
h3
a4
h4
a3 = w2h2
h3 = σ(a3)

29/67
h0 = x
σ
σ
σ
σ
a1
h1
a2
h2
a3
h3
a4
h4
a3 = w2h2
h3 = σ(a3)
While calculating w2 at some point
in the chain rule we will encounter
∂h3
∂a3
=
∂σ(a3)
∂a3
= σ(a3)(1 − σ(a3))

29/67
h0 = x
σ
σ
σ
σ
a1
h1
a2
h2
a3
h3
a4
h4
a3 = w2h2
h3 = σ(a3)
∂h3
∂a3
=
∂σ(a3)
∂a3
= σ(a3)(1 − σ(a3))
What is the consequence of this ?

29/67
h0 = x
σ
σ
σ
σ
a1
h1
a2
h2
a3
h3
a4
h4
a3 = w2h2
h3 = σ(a3)
∂h3
∂a3
=
∂σ(a3)
∂a3
= σ(a3)(1 − σ(a3))
What is the consequence of this ?
To answer this question let us ﬁrst
understand the concept of saturated
neuron ?

30/67
−2 −1 1 2
0.2
0.4
0.6
0.8
1
x
y
A sigmoid neuron is said to have sat-
urated when σ(x) = 1 or σ(x) = 0

30/67
−2 −1 1 2
0.2
0.4
0.6
0.8
1
x
y
What would the gradient be at satur-
ation?

30/67
−2 −1 1 2
0.2
0.4
0.6
0.8
1
x
y
ation?
Well it would be 0 (you can see it from
the plot or from the formula that we
derived)

30/67
−2 −1 1 2
0.2
0.4
0.6
0.8
1
x
y
Saturated neurons thus cause the
gradient to vanish.
ation?
Well it would be 0 (you can see it from
the plot or from the formula that we
derived)

31/67
gradient to vanish.
w1 w2 w3 w4
σ( 4
i=1 wixi)
−2 −1 1 2
0.2
0.4
0.6
0.8
1
4
i=1 wixi
y
But why would the neurons saturate
?

31/67
gradient to vanish.
w1 w2 w3 w4
σ( 4
i=1 wixi)
−2 −1 1 2
0.2
0.4
0.6
0.8
1
4
i=1 wixi
y
?
Consider what would happen if we
use sigmoid neurons and initialize the
weights to very high values ?

31/67
gradient to vanish.
w1 w2 w3 w4
σ( 4
i=1 wixi)
−2 −1 1 2
0.2
0.4
0.6
0.8
1
4
i=1 wixi
y
?
The neurons will saturate very
quickly

31/67
gradient to vanish.
w1 w2 w3 w4
σ( 4
i=1 wixi)
−2 −1 1 2
0.2
0.4
0.6
0.8
1
4
i=1 wixi
y
?
The neurons will saturate very
quickly
The gradients will vanish and the
training will stall (more on this later)

32/67
Saturated neurons cause the gradient
to vanish

32/67
to vanish
Sigmoids are not zero centered

32/67
to vanish
Why is this a problem??

32/67
to vanish
w1 w2
a3 = w1 ∗ h21 + w2 ∗ h22
y
h0 = x
h1
h2

32/67
to vanish
Consider the gradient w.r.t. w1 and
w2
w1 =
∂L (w)
∂y
∂y
h3
∂h3
∂a3
∂a3
∂w1
w2 =
∂L (w)
∂y
∂y
h3
∂h3
∂a3
∂a3
∂w2
w1 w2
a3 = w1 ∗ h21 + w2 ∗ h22
y
h0 = x
h1
h2

32/67
to vanish
w2
w1 =
∂L (w)
∂y
∂y
h3
∂h3
∂a3
h21
w2 =
∂L (w)
∂y
∂y
h3
∂h3
∂a3
h22
Note that h21 and h22 are between
[0, 1] (i.e., they are both positive)
w1 w2
a3 = w1 ∗ h21 + w2 ∗ h22
y
h0 = x
h1
h2

32/67
to vanish
w2
w1 =
∂L (w)
∂y
∂y
h3
∂h3
∂a3
h21
w2 =
∂L (w)
∂y
∂y
h3
∂h3
∂a3
h22
So if the ﬁrst common term (in red)
is positive (negative) then both w1
and w2 are positive (negative)
w1 w2
a3 = w1 ∗ h21 + w2 ∗ h22
y
h0 = x
h1
h2

32/67
to vanish
w2
w1 =
∂L (w)
∂y
∂y
h3
∂h3
∂a3
h21
w2 =
∂L (w)
∂y
∂y
h3
∂h3
∂a3
h22
So if the ﬁrst common term (in red)
is positive (negative) then both w1
and w2 are positive (negative)
w1 w2
a3 = w1 ∗ h21 + w2 ∗ h22
y
h0 = x
h1
h2
Essentially, either all the gradients at
a layer are positive or all the gradients
at a layer are negative

33/67
to vanish
This restricts the possible update dir-
ections
w2
w1

33/67
to vanish
ections
w2
w1
(Not possible) Quadrant in which
all gradients are
+ve
(Allowed)
Quadrant in which
all gradients are
-ve
(Allowed)
(Not possible)

33/67
to vanish
ections
w2
w1
(Not possible) Quadrant in which
all gradients are
+ve
(Allowed)
Quadrant in which
all gradients are
-ve
(Allowed)
(Not possible)
Now imagine:
this is the
optimal w

34/67
to vanish
Sigmoids are not zero centered w2
w1

34/67
to vanish
Sigmoids are not zero centered w2
w1
starting from this
initial position
only way to reach it
is by taking a zigzag path

34/67
to vanish
And lastly, sigmoids are compu-
tationally expensive (because of
exp (x))
w2
w1
starting from this
initial position
only way to reach it
is by taking a zigzag path

35/67
tanh(x)
0−4 −2 2 4
−1
−0.5
0.5
1
x
y
f(x) = tanh(x)
Compresses all its inputs to the range
[-1,1]

35/67
tanh(x)
0−4 −2 2 4
−1
−0.5
0.5
1
x
y
f(x) = tanh(x)
[-1,1]
Zero centered

35/67
tanh(x)
0−4 −2 2 4
−1
−0.5
0.5
1
x
y
f(x) = tanh(x)
[-1,1]
Zero centered
What is the derivative of this func-
tion?

35/67
tanh(x)
0−4 −2 2 4
−1
−0.5
0.5
1
x
y
f(x) = tanh(x)
[-1,1]
Zero centered
tion?
∂tanh(x)
∂x
= (1 − tanh2
(x))

35/67
tanh(x)
0−4 −2 2 4
−1
−0.5
0.5
1
x
y
f(x) = tanh(x)
[-1,1]
Zero centered
tion?
∂tanh(x)
∂x
= (1 − tanh2
(x))
The gradient still vanishes at satura-
tion

35/67
tanh(x)
0−4 −2 2 4
−1
−0.5
0.5
1
x
y
f(x) = tanh(x)
[-1,1]
Zero centered
tion?
∂tanh(x)
∂x
= (1 − tanh2
(x))
The gradient still vanishes at satura-
tion
Also computationally expensive

36/67
ReLU
f(x) = max(0, x)
Is this a non-linear function?

36/67
ReLU
f(x) = max(0, x)
Indeed it is!

36/67
ReLU
f(x) = max(0, x + 1) − max(0, x − 1)
Indeed it is!
In fact we can combine two ReLU
units to recover a piecewise linear ap-
proximation of the sigmoid function

37/67
ReLU
f(x) = max(0, x)
Advantages of ReLU
Does not saturate in the positive re-
gion
1
ImageNet Classiﬁcation with Deep Convolutional Neural Networks- Alex Krizhevsky Ilya
Sutskever, Geoﬀrey E. Hinton, 2012

37/67
ReLU
f(x) = max(0, x)
Advantages of ReLU
gion
Computationally eﬃcient
1

37/67
ReLU
f(x) = max(0, x)
Advantages of ReLU
gion
In practice converges much faster
than sigmoid/tanh1
1

38/67
x1 x2 1
y
w1 w2 b
h1
a1
a2
w3
In practice there is a caveat

38/67
x1 x2 1
y
w1 w2 b
h1
a1
a2
w3
Let’s see what is the derivative of ReLU(x)
∂ReLU(x)
∂x
= 0 if x < 0
= 1 if x > 0

38/67
x1 x2 1
y
w1 w2 b
h1
a1
a2
w3
∂ReLU(x)
∂x
= 0 if x < 0
= 1 if x > 0
Now consider the given network

38/67
x1 x2 1
y
w1 w2 b
h1
a1
a2
w3
∂ReLU(x)
∂x
= 0 if x < 0
= 1 if x > 0
Now consider the given network
What would happen if at some point a large
gradient causes the bias b to be updated to a
large negative value?

39/67
x1 x2 1
y
w1 w2 b
h1
a1
a2
w3
w1x1 + w2x2 + b < 0 [if b << 0]
The neuron would output 0 [dead neuron]

39/67
x1 x2 1
y
w1 w2 b
h1
a1
a2
w3
w1x1 + w2x2 + b < 0 [if b << 0]
Not only would the output be 0 but during
backpropagation even the gradient ∂h1
∂a1
would
be zero

39/67
x1 x2 1
y
w1 w2 b
h1
a1
a2
w3
w1x1 + w2x2 + b < 0 [if b << 0]
∂a1
would
be zero
The weights w1, w2 and b will not get updated
[ there will be a zero term in the chain rule]
w1 =
∂L (θ)
∂y
.
∂y
∂a2
.
∂a2
∂h1
.
∂h1
∂a1
.
∂a1
∂w1

39/67
x1 x2 1
y
w1 w2 b
h1
a1
a2
w3
w1x1 + w2x2 + b < 0 [if b << 0]
∂a1
would
be zero
The weights w1, w2 and b will not get updated
[ there will be a zero term in the chain rule]
w1 =
∂L (θ)
∂y
.
∂y
∂a2
.
∂a2
∂h1
.
∂h1
∂a1
.
∂a1
∂w1
The neuron will now stay dead forever!!

40/67
x1 x2 1
y
w1 w2 b
h1
a1
a2
w3
In practice a large fraction of ReLU
units can die if the learning rate is set
too high

40/67
x1 x2 1
y
w1 w2 b
h1
a1
a2
w3
too high
It is advised to initialize the bias to a
positive value (0.01)

40/67
x1 x2 1
y
w1 w2 b
h1
a1
a2
w3
too high
It is advised to initialize the bias to a
positive value (0.01)
Use other variants of ReLU (as we
will soon see)

41/67
Leaky ReLU
x
y
f(x) = max(0.01x,x)
No saturation

41/67
Leaky ReLU
x
y
f(x) = max(0.01x,x)
No saturation
Will not die (0.01x ensures that
at least a small gradient will ﬂow
through)

41/67
Leaky ReLU
x
y
f(x) = max(0.01x,x)
No saturation
through)

41/67
Leaky ReLU
x
y
f(x) = max(0.01x,x)
No saturation
through)
Close to zero centered ouputs

41/67
Leaky ReLU
x
y
f(x) = max(0.01x,x)
No saturation
through)
Close to zero centered ouputs
Parametric ReLU
f(x) = max(αx, x)
α is a parameter of the model
α will get updated during backpropagation

42/67
Exponential Linear Unit
x
y
f(x) = x if x > 0
= aex
− 1 if x ≤ 0
All beneﬁts of ReLU

42/67
x
y
f(x) = x if x > 0
= aex
− 1 if x ≤ 0
aex − 1 ensures that at least a small
gradient will ﬂow through

42/67
x
y
f(x) = x if x > 0
= aex
− 1 if x ≤ 0
Close to zero centered outputs

42/67
x
y
f(x) = x if x > 0
= aex
− 1 if x ≤ 0
Close to zero centered outputs
Expensive (requires computation of
exp(x))

43/67
Maxout Neuron
max(wT
1 x + b1, wT
2 x + b2)
Generalizes ReLU and Leaky ReLU

43/67
Maxout Neuron
max(wT
1 x + b1, wT
2 x + b2)
No saturation! No death!

43/67
Maxout Neuron
max(wT
1 x + b1, wT
2 x + b2)
No saturation! No death!
Doubles the number of parameters

44/67
Things to Remember
Sigmoids are bad

44/67
Things to Remember
Sigmoids are bad
ReLU is more or less the standard unit for Convolutional Neural Networks

44/67
Things to Remember
Sigmoids are bad
Can explore Leaky ReLU/Maxout/ELU

44/67
Things to Remember
Sigmoids are bad
Can explore Leaky ReLU/Maxout/ELU
tanh sigmoids are still used in LSTMs/RNNs (we will see more on this later)

45/67
Module 9.4 : Better initialization strategies

46/67

47/67
y
σ σ σ
σ
x1 x2
h21
a21
h11 h12 h13
a11 a12 a13
What happens if we initialize all
weights to 0?

47/67
y
σ σ σ
σ
x1 x2
h21
a21
h11 h12 h13
a11 a12 a13
a11 = w11x1 + w12x2
weights to 0?

47/67
y
σ σ σ
σ
x1 x2
h21
a21
h11 h12 h13
a11 a12 a13
a11 = w11x1 + w12x2
a12 = w21x1 + w22x2
weights to 0?

47/67
y
σ σ σ
σ
x1 x2
h21
a21
h11 h12 h13
a11 a12 a13
a11 = w11x1 + w12x2
a12 = w21x1 + w22x2
∴ a11 = a12 = 0
weights to 0?

47/67
y
σ σ σ
σ
x1 x2
h21
a21
h11 h12 h13
a11 a12 a13
a11 = w11x1 + w12x2
a12 = w21x1 + w22x2
∴ a11 = a12 = 0
∴ h11 = h12
weights to 0?
All neurons in layer 1 will get the
same activation

47/67
y
σ σ σ
σ
x1 x2
h21
a21
h11 h12 h13
a11 a12 a13
a11 = w11x1 + w12x2
a12 = w21x1 + w22x2
∴ a11 = a12 = 0
∴ h11 = h12
Now what will happen during back
propagation?

47/67
y
σ σ σ
σ
x1 x2
h21
a21
h11 h12 h13
a11 a12 a13
a11 = w11x1 + w12x2
a12 = w21x1 + w22x2
∴ a11 = a12 = 0
∴ h11 = h12
propagation?
w11 =
∂L (w)
∂y
.
∂y
∂h11
.
∂h11
∂a11
.x1

47/67
y
σ σ σ
σ
x1 x2
h21
a21
h11 h12 h13
a11 a12 a13
a11 = w11x1 + w12x2
a12 = w21x1 + w22x2
∴ a11 = a12 = 0
∴ h11 = h12
propagation?
w11 =
∂L (w)
∂y
.
∂y
∂h11
.
∂h11
∂a11
.x1
w21 =
∂L (w)
∂y
.
∂y
∂h12
.
∂h12
∂a12
.x1

47/67
y
σ σ σ
σ
x1 x2
h21
a21
h11 h12 h13
a11 a12 a13
a11 = w11x1 + w12x2
a12 = w21x1 + w22x2
∴ a11 = a12 = 0
∴ h11 = h12
propagation?
w11 =
∂L (w)
∂y
.
∂y
∂h11
.
∂h11
∂a11
.x1
w21 =
∂L (w)
∂y
.
∂y
∂h12
.
∂h12
∂a12
.x1
but h11 = h12

47/67
y
σ σ σ
σ
x1 x2
h21
a21
h11 h12 h13
a11 a12 a13
a11 = w11x1 + w12x2
a12 = w21x1 + w22x2
∴ a11 = a12 = 0
∴ h11 = h12
propagation?
w11 =
∂L (w)
∂y
.
∂y
∂h11
.
∂h11
∂a11
.x1
w21 =
∂L (w)
∂y
.
∂y
∂h12
.
∂h12
∂a12
.x1
but h11 = h12
and a12 = a12

47/67
y
σ σ σ
σ
x1 x2
h21
a21
h11 h12 h13
a11 a12 a13
a11 = w11x1 + w12x2
a12 = w21x1 + w22x2
∴ a11 = a12 = 0
∴ h11 = h12
propagation?
w11 =
∂L (w)
∂y
.
∂y
∂h11
.
∂h11
∂a11
.x1
w21 =
∂L (w)
∂y
.
∂y
∂h12
.
∂h12
∂a12
.x1
but h11 = h12
and a12 = a12
∴ w11 = w21

47/67
y
σ σ σ
σ
x1 x2
h21
a21
h11 h12 h13
a11 a12 a13
a11 = w11x1 + w12x2
a12 = w21x1 + w22x2
∴ a11 = a12 = 0
∴ h11 = h12
Hence both the weights will get the
same update and remain equal

47/67
y
σ σ σ
σ
x1 x2
h21
a21
h11 h12 h13
a11 a12 a13
a11 = w11x1 + w12x2
a12 = w21x1 + w22x2
∴ a11 = a12 = 0
∴ h11 = h12
Infact this symmetry will never break
during training

47/67
y
σ σ σ
σ
x1 x2
h21
a21
h11 h12 h13
a11 a12 a13
a11 = w11x1 + w12x2
a12 = w21x1 + w22x2
∴ a11 = a12 = 0
∴ h11 = h12
during training
The same is true for w12 and w22

47/67
y
σ σ σ
σ
x1 x2
h21
a21
h11 h12 h13
a11 a12 a13
a11 = w11x1 + w12x2
a12 = w21x1 + w22x2
∴ a11 = a12 = 0
∴ h11 = h12
during training
And for all weights in layer 2 (infact,
work out the math and convince your-
self that all the weights in this layer
will remain equal )

47/67
y
σ σ σ
σ
x1 x2
h21
a21
h11 h12 h13
a11 a12 a13
a11 = w11x1 + w12x2
a12 = w21x1 + w22x2
∴ a11 = a12 = 0
∴ h11 = h12
during training
will remain equal )
This is known as the symmetry
breaking problem

47/67
y
σ σ σ
σ
x1 x2
h21
a21
h11 h12 h13
a11 a12 a13
a11 = w11x1 + w12x2
a12 = w21x1 + w22x2
∴ a11 = a12 = 0
∴ h11 = h12
during training
will remain equal )
This is known as the symmetry
breaking problem
This will happen if all the weights in
a network are initialized to the same
value

48/67
We will now consider a feedforward
network with:

48/67
network with:
input: 1000 points, each ∈ R500

48/67
network with:
input data is drawn from unit Gaus-
sian
−3 −2 −1 0 1 2 3
0.1
0.2
0.3
0.4

48/67
network with:
sian
−3 −2 −1 0 1 2 3
0.1
0.2
0.3
0.4
the network has 5 layers

48/67
network with:
sian
−3 −2 −1 0 1 2 3
0.1
0.2
0.3
0.4
each layer has 500 neurons

48/67
network with:
sian
−3 −2 −1 0 1 2 3
0.1
0.2
0.3
0.4
each layer has 500 neurons
we will run forward propagation on
this network with diﬀerent weight ini-
tializations

49/67
Let’s try to initialize the weights to
small random numbers

49/67
tanh activation functions
We will see what happens to the ac-
tivation across diﬀerent layers

49/67
sigmoid activation functions
We will see what happens to the ac-
tivation across diﬀerent layers

50/67
What will happen during back
propagation?

50/67
propagation?
Recall that w1 is proportional to
the activation passing through it

50/67
propagation?
If all the activations in a layer are
very close to 0, what will happen to
the gradient of the weights connect-
ing this layer to the next layer?

50/67
propagation?
If all the activations in a layer are
very close to 0, what will happen to
the gradient of the weights connect-
ing this layer to the next layer?
They will all be close to 0 (vanishing
gradient problem)

51/67
Let us try to initialize the weights to
large random numbers

51/67
tanh activation with large weights

51/67
sigmoid activations with large weights

51/67
Most activations have saturated

51/67
What happens to the gradients at sat-
uration?

51/67
What happens to the gradients at sat-
uration?
They will all be close to 0 (vanishing
gradient problem)

52/67
Let us try to arrive at a more principled
way of initializing weights

52/67
x1 x2 x3
s1ns11
xn
s11 =
n
i=1
w1ixi

52/67
x1 x2 x3
s1ns11
xn
s11 =
n
i=1
w1ixi
V ar(s11) = V ar(
n
i=1
w1ixi) =
n
i=1
V ar(w1ixi)

52/67
x1 x2 x3
s1ns11
xn
s11 =
n
i=1
w1ixi
V ar(s11) = V ar(
n
i=1
w1ixi) =
n
i=1
V ar(w1ixi)
=
n
i=1
(E[w1i])2
V ar(xi)
+ (E[xi])2
V ar(w1i) + V ar(xi)V ar(w1i)

52/67
x1 x2 x3
s1ns11
xn
[Assuming 0 Mean inputs and
weights]
s11 =
n
i=1
w1ixi
V ar(s11) = V ar(
n
i=1
w1ixi) =
n
i=1
V ar(w1ixi)
=
n
i=1
(E[w1i])2
V ar(xi)
+ (E[xi])2

52/67
x1 x2 x3
s1ns11
xn
weights]
s11 =
n
i=1
w1ixi
V ar(s11) = V ar(
n
i=1
w1ixi) =
n
i=1
V ar(w1ixi)
=
n
i=1
(E[w1i])2
V ar(xi)
+ (E[xi])2
=
n
i=1
V ar(xi)V ar(w1i)

52/67
x1 x2 x3
s1ns11
xn
weights]
[Assuming V ar(xi) = V ar(x)∀i ]
[Assuming
V ar(w1i) = V ar(w)∀i]
s11 =
n
i=1
w1ixi
V ar(s11) = V ar(
n
i=1
w1ixi) =
n
i=1
V ar(w1ixi)
=
n
i=1
(E[w1i])2
V ar(xi)
+ (E[xi])2
=
n
i=1
V ar(xi)V ar(w1i)
= (nV ar(w))(V ar(x))

53/67
x1 x2 x3
s1ns11
xn
In general,
V ar(S1i) = (nV ar(w))(V ar(x))

53/67
x1 x2 x3
s1ns11
xn
In general,
What would happen if nV ar(w) 1
?

53/67
x1 x2 x3
s1ns11
xn
In general,
?
The variance of S1i will be large

53/67
x1 x2 x3
s1ns11
xn
In general,
?
What would happen if nV ar(w) → 0?

53/67
x1 x2 x3
s1ns11
xn
In general,
?
What would happen if nV ar(w) → 0?
The variance of S1i will be small

54/67
Let us see what happens if we add one
more layer

54/67
x1 x2 x3
s1ns11
s21
xn
more layer
Using the same procedure as above
we will arrive at

54/67
x1 x2 x3
s1ns11
s21
xn
more layer
we will arrive at
V ar(s21) =
n
i=1
V ar(s1i)V ar(w2i)

54/67
x1 x2 x3
s1ns11
s21
xn
more layer
we will arrive at
V ar(s21) =
n
i=1
V ar(s1i)V ar(w2i)
= nV ar(s1i)V ar(w2)

54/67
x1 x2 x3
s1ns11
s21
xn
V ar(Si1) = nV ar(w1)V ar(x)
more layer
we will arrive at
V ar(s21) =
n
i=1
V ar(s1i)V ar(w2i)

54/67
x1 x2 x3
s1ns11
s21
xn
more layer
we will arrive at
V ar(s21) =
n
i=1
V ar(s1i)V ar(w2i)
V ar(s21) ∝ [nV ar(w2)][nV ar(w1)]V ar(x)

54/67
x1 x2 x3
s1ns11
s21
xn
more layer
we will arrive at
V ar(s21) =
n
i=1
V ar(s1i)V ar(w2i)
∝ [nV ar(w)]2
V ar(x)

54/67
x1 x2 x3
s1ns11
s21
xn
more layer
we will arrive at
V ar(s21) =
n
i=1
V ar(s1i)V ar(w2i)
∝ [nV ar(w)]2
V ar(x)
Assuming weights across all layers
have the same variance

55/67
In general,

55/67
In general,
V ar(ski) = [nV ar(w)]k
V ar(x)

55/67
In general,
V ar(x)
To ensure that variance in the output of any
layer does not blow up or shrink we want:
nV ar(w) = 1

55/67
In general,
V ar(x)
nV ar(w) = 1
If we draw the weights from a unit Gaussian
and scale them by 1√
n
then, we have :

55/67
In general,
V ar(x)
nV ar(w) = 1
n
then, we have :
nV ar(w) = nV ar(
z
√
n
)

55/67
V ar(az) = a2
(V ar(z))
In general,
V ar(x)
nV ar(w) = 1
n
then, we have :
nV ar(w) = nV ar(
z
√
n
)
= n ∗
1
n
V ar(z) = 1

55/67
V ar(az) = a2
(V ar(z))
In general,
V ar(x)
nV ar(w) = 1
n
then, we have :
nV ar(w) = nV ar(
z
√
n
)
= n ∗
1
n
V ar(z) = 1 ← (UnitGaussian)

56/67
Let’s see what happens if we use this
initialization

56/67
tanh activation
initialization

56/67
sigmoid activations
initialization

57/67
However this does not work for ReLU
neurons

57/67
neurons
Why ?

57/67
neurons
Why ?
Intuition: He et.al. argue that a
factor of 2 is needed when dealing
with ReLU Neurons

57/67
neurons
Why ?
Intuition: He et.al. argue that a
factor of 2 is needed when dealing
with ReLU Neurons
Intuitively this happens because the
range of ReLU neurons is restricted
only to the positive half of the space

58/67
Indeed when we account for this
factor of 2 we see better performance

59/67
Module 9.5 : Batch Normalization

60/67
We will now see a method called batch normalization which allows us to be less
careful about initialization

61/67
To understand the intuition behind Batch Nor-
malization let us consider a deep network

61/67
x1 x2 x3
h0
h1
h2
h3
h4

61/67
x1 x2 x3
h0
h1
h2
h3
h4
Let us focus on the learning process for the weights
between these two layers

61/67
x1 x2 x3
h0
h1
h2
h3
h4
Typically we use mini-batch algorithms

61/67
x1 x2 x3
h0
h1
h2
h3
h4
What would happen if there is a constant change
in the distribution of h3

61/67
x1 x2 x3
h0
h1
h2
h3
h4
In other words what would happen if across mini-
batches the distribution of h3 keeps changing

61/67
x1 x2 x3
h0
h1
h2
h3
h4
In other words what would happen if across mini-
batches the distribution of h3 keeps changing
Would the learning process be easy or hard?

62/67
It would help if the pre-activations at each layer
were unit gaussians

62/67
were unit gaussians
Why not explicitly ensure this by standardizing
the pre-activation ?
ˆsik = sik−E[sik]
√
var(sik)

62/67
were unit gaussians
√
var(sik)
But how do we compute E[sik] and Var[sik]?

62/67
were unit gaussians
√
var(sik)
We compute it from a mini-batch

62/67
were unit gaussians
√
var(sik)
We compute it from a mini-batch
Thus we are explicitly ensuring that the distri-
bution of the inputs at diﬀerent layers does not
change across batches

63/67
This is what the deep network will look like with
Batch Normalization

63/67
Batch Normalization
Is this legal ?

63/67
Batch Normalization
Is this legal ?
Yes, it is because just as the tanh layer is dif-
ferentiable, the Batch Normalization layer is also
diﬀerentiable

63/67
Batch Normalization
Is this legal ?
Yes, it is because just as the tanh layer is dif-
ferentiable, the Batch Normalization layer is also
diﬀerentiable
Hence we can backpropagate through this layer

64/67
Catch: Do we necessarily want to force a unit
gaussian input to the tanh layer?

64/67
Why not let the network learn what is best for it?

64/67
After the Batch Normalization step add the fol-
lowing step:
y(k) = γk ˆsik + β(k)

64/67
γk and βk are additional
parameters of the network.
lowing step:
What happens if the network learns:
γk = var(xk)
βk = E[xk]

64/67
lowing step:
γk = var(xk)
βk = E[xk]
We will recover sik

64/67
lowing step:
γk = var(xk)
βk = E[xk]
We will recover sik
In other words by adjusting these additional para-
meters the network can learn to recover sik if that
is more favourable

65/67
We will now compare the performance with and without batch normalization on
MNIST data using 2 layers....

66/67

67/67
2016-17: Still exciting times
Even better optimization methods

67/67
Data driven initialization methods

67/67
Data driven initialization methods
Beyond batch normalization

Deep learning networks

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Deep learning networks

Similar to Deep learning networks (20)

Recently uploaded

Recently uploaded (20)

Deep learning networks