SlideShare a Scribd company logo
1/67
CS7015 (Deep Learning) : Lecture 9
Greedy Layerwise Pre-training, Better activation functions, Better weight
initialization methods, Batch Normalization
Mitesh M. Khapra
Department of Computer Science and Engineering
Indian Institute of Technology Madras
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
2/67
Module 9.1 : A quick recap of training deep neural
networks
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
3/67
x
σ
w
y
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
3/67
x
σ
w
y
We already saw how to train this network
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
3/67
x
σ
w
y
We already saw how to train this network
w = w − η w where,
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
3/67
x
σ
w
y
We already saw how to train this network
w = w − η w where,
w =
∂L (w)
∂w
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
3/67
x
σ
w
y
We already saw how to train this network
w = w − η w where,
w =
∂L (w)
∂w
= (f(x) − y) ∗ f(x) ∗ (1 − f(x)) ∗ x
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
3/67
x
σ
w
y
x1x2 x3
σ
y
w1 w2 w3
We already saw how to train this network
w = w − η w where,
w =
∂L (w)
∂w
= (f(x) − y) ∗ f(x) ∗ (1 − f(x)) ∗ x
What about a wider network with more inputs:
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
3/67
x
σ
w
y
x1x2 x3
σ
y
w1 w2 w3
We already saw how to train this network
w = w − η w where,
w =
∂L (w)
∂w
= (f(x) − y) ∗ f(x) ∗ (1 − f(x)) ∗ x
What about a wider network with more inputs:
w1 = w1 − η w1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
3/67
x
σ
w
y
x1x2 x3
σ
y
w1 w2 w3
We already saw how to train this network
w = w − η w where,
w =
∂L (w)
∂w
= (f(x) − y) ∗ f(x) ∗ (1 − f(x)) ∗ x
What about a wider network with more inputs:
w1 = w1 − η w1
w2 = w2 − η w2
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
3/67
x
σ
w
y
x1x2 x3
σ
y
w1 w2 w3
We already saw how to train this network
w = w − η w where,
w =
∂L (w)
∂w
= (f(x) − y) ∗ f(x) ∗ (1 − f(x)) ∗ x
What about a wider network with more inputs:
w1 = w1 − η w1
w2 = w2 − η w2
w3 = w3 − η w3
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
3/67
x
σ
w
y
x1x2 x3
σ
y
w1 w2 w3
We already saw how to train this network
w = w − η w where,
w =
∂L (w)
∂w
= (f(x) − y) ∗ f(x) ∗ (1 − f(x)) ∗ x
What about a wider network with more inputs:
w1 = w1 − η w1
w2 = w2 − η w2
w3 = w3 − η w3
where, wi = (f(x) − y) ∗ f(x) ∗ (1 − f(x)) ∗ xi
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
4/67
σ
x = h0
σ
σ
y
w1
w2
w3
a1
h1
a2
h2
a3
ai = wihi−1; hi = σ(ai)
a1 = w1 ∗ x = w1 ∗ h0
What if we have a deeper network ?
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
4/67
σ
x = h0
σ
σ
y
w1
w2
w3
a1
h1
a2
h2
a3
ai = wihi−1; hi = σ(ai)
a1 = w1 ∗ x = w1 ∗ h0
What if we have a deeper network ?
We can now calculate w1 using chain rule:
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
4/67
σ
x = h0
σ
σ
y
w1
w2
w3
a1
h1
a2
h2
a3
ai = wihi−1; hi = σ(ai)
a1 = w1 ∗ x = w1 ∗ h0
What if we have a deeper network ?
We can now calculate w1 using chain rule:
∂L (w)
∂w1
=
∂L (w)
∂y
.
∂y
∂a3
.
∂a3
∂h2
.
∂h2
∂a2
.
∂a2
∂h1
.
∂h1
∂a1
.
∂a1
∂w1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
4/67
σ
x = h0
σ
σ
y
w1
w2
w3
a1
h1
a2
h2
a3
ai = wihi−1; hi = σ(ai)
a1 = w1 ∗ x = w1 ∗ h0
What if we have a deeper network ?
We can now calculate w1 using chain rule:
∂L (w)
∂w1
=
∂L (w)
∂y
.
∂y
∂a3
.
∂a3
∂h2
.
∂h2
∂a2
.
∂a2
∂h1
.
∂h1
∂a1
.
∂a1
∂w1
=
∂L (w)
∂y
∗ ............... ∗ h0
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
4/67
σ
x = h0
σ
σ
y
w1
w2
w3
a1
h1
a2
h2
a3
ai = wihi−1; hi = σ(ai)
a1 = w1 ∗ x = w1 ∗ h0
What if we have a deeper network ?
We can now calculate w1 using chain rule:
∂L (w)
∂w1
=
∂L (w)
∂y
.
∂y
∂a3
.
∂a3
∂h2
.
∂h2
∂a2
.
∂a2
∂h1
.
∂h1
∂a1
.
∂a1
∂w1
=
∂L (w)
∂y
∗ ............... ∗ h0
In general,
wi =
∂L (w)
∂y
∗ ............... ∗ hi−1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
4/67
σ
x = h0
σ
σ
y
w1
w2
w3
a1
h1
a2
h2
a3
ai = wihi−1; hi = σ(ai)
a1 = w1 ∗ x = w1 ∗ h0
What if we have a deeper network ?
We can now calculate w1 using chain rule:
∂L (w)
∂w1
=
∂L (w)
∂y
.
∂y
∂a3
.
∂a3
∂h2
.
∂h2
∂a2
.
∂a2
∂h1
.
∂h1
∂a1
.
∂a1
∂w1
=
∂L (w)
∂y
∗ ............... ∗ h0
In general,
wi =
∂L (w)
∂y
∗ ............... ∗ hi−1
Notice that wi is proportional to the correspond-
ing input hi−1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
4/67
σ
x = h0
σ
σ
y
w1
w2
w3
a1
h1
a2
h2
a3
ai = wihi−1; hi = σ(ai)
a1 = w1 ∗ x = w1 ∗ h0
What if we have a deeper network ?
We can now calculate w1 using chain rule:
∂L (w)
∂w1
=
∂L (w)
∂y
.
∂y
∂a3
.
∂a3
∂h2
.
∂h2
∂a2
.
∂a2
∂h1
.
∂h1
∂a1
.
∂a1
∂w1
=
∂L (w)
∂y
∗ ............... ∗ h0
In general,
wi =
∂L (w)
∂y
∗ ............... ∗ hi−1
Notice that wi is proportional to the correspond-
ing input hi−1 (we will use this fact later)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
5/67
σ
σ
x1
σ σ
x2 x3
σ σ σ
σ
y
w1 w2 w3
What happens if we have a network which is deep
and wide?
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
5/67
σ
σ
x1
σ σ
x2 x3
σ σ σ
σ
y
w1 w2 w3
What happens if we have a network which is deep
and wide?
How do you calculate w2 =?
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
5/67
σ
σ
x1
σ σ
x2 x3
σ σ σ
σ
y
w1 w2 w3
What happens if we have a network which is deep
and wide?
How do you calculate w2 =?
It will be given by chain rule applied across mul-
tiple paths
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
5/67
σ
σ
x1
σ σ
x2 x3
σ σ σ
σ
y
w1 w2 w3
What happens if we have a network which is deep
and wide?
How do you calculate w2 =?
It will be given by chain rule applied across mul-
tiple paths
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
5/67
σ
σ
x1
σ σ
x2 x3
σ σ σ
σ
y
w1 w2 w3
What happens if we have a network which is deep
and wide?
How do you calculate w2 =?
It will be given by chain rule applied across mul-
tiple paths
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
5/67
σ
σ
x1
σ σ
x2 x3
σ σ σ
σ
y
w1 w2 w3
What happens if we have a network which is deep
and wide?
How do you calculate w2 =?
It will be given by chain rule applied across mul-
tiple paths
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
5/67
σ
σ
x1
σ σ
x2 x3
σ σ σ
σ
y
w1 w2 w3
What happens if we have a network which is deep
and wide?
How do you calculate w2 =?
It will be given by chain rule applied across mul-
tiple paths (We saw this in detail when we studied
back propagation)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
6/67
Things to remember
Training Neural Networks is a Game of Gradients (played using any of the
existing gradient based approaches that we discussed)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
6/67
Things to remember
Training Neural Networks is a Game of Gradients (played using any of the
existing gradient based approaches that we discussed)
The gradient tells us the responsibility of a parameter towards the loss
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
6/67
Things to remember
Training Neural Networks is a Game of Gradients (played using any of the
existing gradient based approaches that we discussed)
The gradient tells us the responsibility of a parameter towards the loss
The gradient w.r.t. a parameter is proportional to the input to the parameters
(recall the “..... ∗ x” term or the “.... ∗ hi” term in the formula for wi)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
7/67
σ
σ
x1
σ σ
x2 x3
σ σ σ
σ
y
w1 w2 w3
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
7/67
σ
σ
x1
σ σ
x2 x3
σ σ σ
σ
y
w1 w2 w3
Backpropagation was made popular
by Rumelhart et.al in 1986
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
7/67
σ
σ
x1
σ σ
x2 x3
σ σ σ
σ
y
w1 w2 w3
Backpropagation was made popular
by Rumelhart et.al in 1986
However when used for really deep
networks it was not very successful
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
7/67
σ
σ
x1
σ σ
x2 x3
σ σ σ
σ
y
w1 w2 w3
Backpropagation was made popular
by Rumelhart et.al in 1986
However when used for really deep
networks it was not very successful
In fact, till 2006 it was very hard to
train very deep networks
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
7/67
σ
σ
x1
σ σ
x2 x3
σ σ σ
σ
y
w1 w2 w3
Backpropagation was made popular
by Rumelhart et.al in 1986
However when used for really deep
networks it was not very successful
In fact, till 2006 it was very hard to
train very deep networks
Typically, even after a large number
of epochs the training did not con-
verge
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
8/67
Module 9.2 : Unsupervised pre-training
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
9/67
What has changed now? How did Deep Learning become so popular despite
this problem with training large networks?
1
G. E. Hinton and R. R. Salakhutdinov. Reducing the dimensionality of data with neural
networks. Science, 313(5786):504–507, July 2006.
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
9/67
What has changed now? How did Deep Learning become so popular despite
this problem with training large networks?
Well, until 2006 it wasn’t so popular
1
G. E. Hinton and R. R. Salakhutdinov. Reducing the dimensionality of data with neural
networks. Science, 313(5786):504–507, July 2006.
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
9/67
What has changed now? How did Deep Learning become so popular despite
this problem with training large networks?
Well, until 2006 it wasn’t so popular
The field got revived after the seminal work of Hinton and Salakhutdinov in
2006
1
G. E. Hinton and R. R. Salakhutdinov. Reducing the dimensionality of data with neural
networks. Science, 313(5786):504–507, July 2006.
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
10/67
Let’s look at the idea of unsupervised pre-training introduced in this paper ...
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
10/67
Let’s look at the idea of unsupervised pre-training introduced in this paper ...
(note that in this paper they introduced the idea in the context of RBMs but we
will discuss it in the context of Autoencoders)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
11/67
Consider the deep neural network
shown in this figure
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
11/67
Consider the deep neural network
shown in this figure
Let us focus on the first two layers of
the network (x and h1)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
11/67
Consider the deep neural network
shown in this figure
Let us focus on the first two layers of
the network (x and h1)
We will first train the weights
between these two layers using an un-
supervised objective
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
11/67
ˆx
h1
x
reconstruct x
min
1
m
m
i=1
n
j=1
(ˆxij − xij)2
Consider the deep neural network
shown in this figure
Let us focus on the first two layers of
the network (x and h1)
We will first train the weights
between these two layers using an un-
supervised objective
Note that we are trying to reconstruct
the input (x) from the hidden repres-
entation (h1)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
11/67
ˆx
h1
x
reconstruct x
min
1
m
m
i=1
n
j=1
(ˆxij − xij)2
Consider the deep neural network
shown in this figure
Let us focus on the first two layers of
the network (x and h1)
We will first train the weights
between these two layers using an un-
supervised objective
Note that we are trying to reconstruct
the input (x) from the hidden repres-
entation (h1)
We refer to this as an unsupervised
objective because it does not involve
the output label (y) and only uses the
input data (x)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
11/67
ˆx
h1
x
reconstruct x
min
1
m
m
i=1
n
j=1
(ˆxij − xij)2
Consider the deep neural network
shown in this figure
Let us focus on the first two layers of
the network (x and h1)
We will first train the weights
between these two layers using an un-
supervised objective
Note that we are trying to reconstruct
the input (x) from the hidden repres-
entation (h1)
We refer to this as an unsupervised
objective because it does not involve
the output label (y) and only uses the
input data (x)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
11/67
ˆx
h1
x
reconstruct x
min
1
m
m
i=1
n
j=1
(ˆxij − xij)2
At the end of this step, the weights
in layer 1 are trained such that h1
captures an abstract representation
of the input x
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
11/67
ˆh1
h2
h1
x
min
1
m
m
i=1
n
j=1
(ˆh1ij − h1ij )2
At the end of this step, the weights
in layer 1 are trained such that h1
captures an abstract representation
of the input x
We now fix the weights in layer 1 and
repeat the same process with layer 2
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
11/67
ˆh1
h2
h1
x
min
1
m
m
i=1
n
j=1
(ˆh1ij − h1ij )2
At the end of this step, the weights
in layer 1 are trained such that h1
captures an abstract representation
of the input x
We now fix the weights in layer 1 and
repeat the same process with layer 2
At the end of this step, the weights in
layer 2 are trained such that h2 cap-
tures an abstract representation of h1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
11/67
ˆh1
h2
h1
x
min
1
m
m
i=1
n
j=1
(ˆh1ij − h1ij )2
At the end of this step, the weights
in layer 1 are trained such that h1
captures an abstract representation
of the input x
We now fix the weights in layer 1 and
repeat the same process with layer 2
At the end of this step, the weights in
layer 2 are trained such that h2 cap-
tures an abstract representation of h1
We continue this process till the last
hidden layer (i.e., the layer before the
output layer) so that each successive
layer captures an abstract represent-
ation of the previous layer
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
12/67
x1 x2 x3
min
θ
1
m
m
i=1
(yi − f(xi))2
After this layerwise pre-training, we
add the output layer and train the
whole network using the task specific
objective
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
12/67
x1 x2 x3
min
θ
1
m
m
i=1
(yi − f(xi))2
After this layerwise pre-training, we
add the output layer and train the
whole network using the task specific
objective
Note that, in effect we have initial-
ized the weights of the network us-
ing the greedy unsupervised objective
and are now fine tuning these weights
using the supervised objective
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
13/67
Why does this work better?
1
The difficulty of training deep architectures and effect of unsupervised pre-training - Erhan et
al,2009
2
Exploring Strategies for Training Deep Neural Networks, Larocelle et al,2009
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
13/67
Why does this work better?
Is it because of better optimization?
1
The difficulty of training deep architectures and effect of unsupervised pre-training - Erhan et
al,2009
2
Exploring Strategies for Training Deep Neural Networks, Larocelle et al,2009
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
13/67
Why does this work better?
Is it because of better optimization?
Is it because of better regularization?
1
The difficulty of training deep architectures and effect of unsupervised pre-training - Erhan et
al,2009
2
Exploring Strategies for Training Deep Neural Networks, Larocelle et al,2009
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
13/67
Why does this work better?
Is it because of better optimization?
Is it because of better regularization?
Let’s see what these two questions mean and try to answer them based on some
(among many) existing studies1,2
1
The difficulty of training deep architectures and effect of unsupervised pre-training - Erhan et
al,2009
2
Exploring Strategies for Training Deep Neural Networks, Larocelle et al,2009
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
14/67
Why does this work better?
Is it because of better optimization?
Is it because of better regularization?
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
15/67
What is the optimization problem that we are trying to solve?
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
15/67
What is the optimization problem that we are trying to solve?
minimize L (θ) =
1
m
m
i=1
(yi − f(xi))2
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
15/67
What is the optimization problem that we are trying to solve?
minimize L (θ) =
1
m
m
i=1
(yi − f(xi))2
Is it the case that in the absence of unsupervised pre-training we are not able
to drive L (θ) to 0 even for the training data (hence poor optimization) ?
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
15/67
What is the optimization problem that we are trying to solve?
minimize L (θ) =
1
m
m
i=1
(yi − f(xi))2
Is it the case that in the absence of unsupervised pre-training we are not able
to drive L (θ) to 0 even for the training data (hence poor optimization) ?
Let us see this in more detail ...
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
16/67
The error surface of the supervised
objective of a Deep Neural Network
is highly non-convex
1
Exploring Strategies for Training Deep Neural Networks, Larocelle et al,2009
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
16/67
The error surface of the supervised
objective of a Deep Neural Network
is highly non-convex
With many hills and plateaus and val-
leys
1
Exploring Strategies for Training Deep Neural Networks, Larocelle et al,2009
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
16/67
The error surface of the supervised
objective of a Deep Neural Network
is highly non-convex
With many hills and plateaus and val-
leys
Given that large capacity of DNNs it
is still easy to land in one of these 0
error regions
1
Exploring Strategies for Training Deep Neural Networks, Larocelle et al,2009
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
16/67
The error surface of the supervised
objective of a Deep Neural Network
is highly non-convex
With many hills and plateaus and val-
leys
Given that large capacity of DNNs it
is still easy to land in one of these 0
error regions
Indeed Larochelle et.al.1 show that if
the last layer has large capacity then
L (θ) goes to 0 even without pre-
training
1
Exploring Strategies for Training Deep Neural Networks, Larocelle et al,2009
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
16/67
The error surface of the supervised
objective of a Deep Neural Network
is highly non-convex
With many hills and plateaus and val-
leys
Given that large capacity of DNNs it
is still easy to land in one of these 0
error regions
Indeed Larochelle et.al.1 show that if
the last layer has large capacity then
L (θ) goes to 0 even without pre-
training
However, if the capacity of the net-
work is small, unsupervised pre-
training helps
1
Exploring Strategies for Training Deep Neural Networks, Larocelle et al,2009
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
17/67
Why does this work better?
Is it because of better optimization?
Is it because of better regularization?
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
18/67
What does regularization do?
1
Image Source:The Elements of Statistical Learning-T. Hastie, R. Tibshirani, and J. Friedman,
Pg 71
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
18/67
What does regularization do? It con-
strains the weights to certain regions
of the parameter space
1
Image Source:The Elements of Statistical Learning-T. Hastie, R. Tibshirani, and J. Friedman,
Pg 71
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
18/67
What does regularization do? It con-
strains the weights to certain regions
of the parameter space
L-1 regularization: constrains most
weights to be 0
1
Image Source:The Elements of Statistical Learning-T. Hastie, R. Tibshirani, and J. Friedman,
Pg 71
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
18/67
What does regularization do? It con-
strains the weights to certain regions
of the parameter space
L-1 regularization: constrains most
weights to be 0
L-2 regularization: prevents most
weights from taking large values
1
Image Source:The Elements of Statistical Learning-T. Hastie, R. Tibshirani, and J. Friedman,
Pg 71
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
19/67
Indeed, pre-training constrains the
weights to lie in only certain regions
of the parameter space
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
19/67
Indeed, pre-training constrains the
weights to lie in only certain regions
of the parameter space
Specifically, it constrains the weights
to lie in regions where the character-
istics of the data are captured well (as
governed by the unsupervised object-
ive)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
19/67
Unsupervised objective:
Ω(θ) =
1
m
m
i=1
n
j=1
(xij − ˆxij)2
Indeed, pre-training constrains the
weights to lie in only certain regions
of the parameter space
Specifically, it constrains the weights
to lie in regions where the character-
istics of the data are captured well (as
governed by the unsupervised object-
ive)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
19/67
Unsupervised objective:
Ω(θ) =
1
m
m
i=1
n
j=1
(xij − ˆxij)2
We can think of this unsupervised ob-
jective as an additional constraint on
the optimization problem
Indeed, pre-training constrains the
weights to lie in only certain regions
of the parameter space
Specifically, it constrains the weights
to lie in regions where the character-
istics of the data are captured well (as
governed by the unsupervised object-
ive)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
19/67
Unsupervised objective:
Ω(θ) =
1
m
m
i=1
n
j=1
(xij − ˆxij)2
We can think of this unsupervised ob-
jective as an additional constraint on
the optimization problem
Supervised objective:
L (θ) =
1
m
m
i=1
(yi − f(xi))2
Indeed, pre-training constrains the
weights to lie in only certain regions
of the parameter space
Specifically, it constrains the weights
to lie in regions where the character-
istics of the data are captured well (as
governed by the unsupervised object-
ive)
This unsupervised objective ensures
that that the learning is not greedy
w.r.t. the supervised objective (and
also satisfies the unsupervised object-
ive)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
20/67
Some other experiments have also
shown that pre-training is more ro-
bust to random initializations
1
The difficulty of training deep architectures and effect of unsupervised pre-training - Erhan et
al,2009
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
20/67
Some other experiments have also
shown that pre-training is more ro-
bust to random initializations
One accepted hypothesis is that pre-
training leads to better weight ini-
tializations (so that the layers cap-
ture the internal characteristics of the
data)
1
The difficulty of training deep architectures and effect of unsupervised pre-training - Erhan et
al,2009
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
21/67
So what has happened since 2006-2009?
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
22/67
Deep Learning has evolved
Better optimization algorithms
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
22/67
Deep Learning has evolved
Better optimization algorithms
Better regularization methods
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
22/67
Deep Learning has evolved
Better optimization algorithms
Better regularization methods
Better activation functions
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
22/67
Deep Learning has evolved
Better optimization algorithms
Better regularization methods
Better activation functions
Better weight initialization strategies
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
23/67
Module 9.3 : Better activation functions
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
24/67
Deep Learning has evolved
Better optimization algorithms
Better regularization methods
Better activation functions
Better weight initialization strategies
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
25/67
Before we look at activation functions, let’s try to answer the following question:
“What makes Deep Neural Networks powerful ?”
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
26/67
h0 = x
y
σ
σ
σ
a1
h1
a2
h2
a3
w1
w2
w3
Consider this deep neural network
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
26/67
h0 = x
y
σ
σ
σ
a1
h1
a2
h2
a3
w1
w2
w3
Consider this deep neural network
Imagine if we replace the sigmoid in
each layer by a simple linear trans-
formation
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
26/67
h0 = x
y
σ
σ
σ
a1
h1
a2
h2
a3
w1
w2
w3
Consider this deep neural network
Imagine if we replace the sigmoid in
each layer by a simple linear trans-
formation
y = (w4 ∗ (w3 ∗ (w2 ∗ (w1x))))
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
26/67
h0 = x
y
σ
σ
σ
a1
h1
a2
h2
a3
w1
w2
w3
Consider this deep neural network
Imagine if we replace the sigmoid in
each layer by a simple linear trans-
formation
y = (w4 ∗ (w3 ∗ (w2 ∗ (w1x))))
Then we will just learn y as a linear
transformation of x
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
26/67
h0 = x
y
σ
σ
σ
a1
h1
a2
h2
a3
w1
w2
w3
Consider this deep neural network
Imagine if we replace the sigmoid in
each layer by a simple linear trans-
formation
y = (w4 ∗ (w3 ∗ (w2 ∗ (w1x))))
Then we will just learn y as a linear
transformation of x
In other words we will be constrained
to learning linear decision boundaries
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
26/67
h0 = x
y
σ
σ
σ
a1
h1
a2
h2
a3
w1
w2
w3
Consider this deep neural network
Imagine if we replace the sigmoid in
each layer by a simple linear trans-
formation
y = (w4 ∗ (w3 ∗ (w2 ∗ (w1x))))
Then we will just learn y as a linear
transformation of x
In other words we will be constrained
to learning linear decision boundaries
We cannot learn arbitrary decision
boundaries
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
26/67
In particular, a deep linear neural
network cannot learn such boundar-
ies
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
26/67
In particular, a deep linear neural
network cannot learn such boundar-
ies
But a deep non linear neural net-
work can indeed learn such bound-
aries (recall Universal Approximation
Theorem)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
27/67
Now let’s look at some non-linear activation functions that are typically used in
deep neural networks (Much of this material is taken from Andrej Karpathy’s
lecture notes 1)
1
http://cs231n.github.io
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
28/67
σ(x) = 1
1+e−x
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
28/67
Sigmoid
σ(x) = 1
1+e−x
As is obvious, the sigmoid function
compresses all its inputs to the range
[0,1]
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
28/67
Sigmoid
σ(x) = 1
1+e−x
As is obvious, the sigmoid function
compresses all its inputs to the range
[0,1]
Since we are always interested in
gradients, let us find the gradient of
this function
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
28/67
Sigmoid
σ(x) = 1
1+e−x
As is obvious, the sigmoid function
compresses all its inputs to the range
[0,1]
Since we are always interested in
gradients, let us find the gradient of
this function
∂σ(x)
∂x
= σ(x)(1 − σ(x))
(you can easily derive it)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
28/67
Sigmoid
σ(x) = 1
1+e−x
As is obvious, the sigmoid function
compresses all its inputs to the range
[0,1]
Since we are always interested in
gradients, let us find the gradient of
this function
∂σ(x)
∂x
= σ(x)(1 − σ(x))
(you can easily derive it)
Let us see what happens if we use sig-
moid in a deep network
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
29/67
h0 = x
σ
σ
σ
σ
a1
h1
a2
h2
a3
h3
a4
h4
a3 = w2h2
h3 = σ(a3)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
29/67
h0 = x
σ
σ
σ
σ
a1
h1
a2
h2
a3
h3
a4
h4
a3 = w2h2
h3 = σ(a3)
While calculating w2 at some point
in the chain rule we will encounter
∂h3
∂a3
=
∂σ(a3)
∂a3
= σ(a3)(1 − σ(a3))
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
29/67
h0 = x
σ
σ
σ
σ
a1
h1
a2
h2
a3
h3
a4
h4
a3 = w2h2
h3 = σ(a3)
While calculating w2 at some point
in the chain rule we will encounter
∂h3
∂a3
=
∂σ(a3)
∂a3
= σ(a3)(1 − σ(a3))
What is the consequence of this ?
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
29/67
h0 = x
σ
σ
σ
σ
a1
h1
a2
h2
a3
h3
a4
h4
a3 = w2h2
h3 = σ(a3)
While calculating w2 at some point
in the chain rule we will encounter
∂h3
∂a3
=
∂σ(a3)
∂a3
= σ(a3)(1 − σ(a3))
What is the consequence of this ?
To answer this question let us first
understand the concept of saturated
neuron ?
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
30/67
−2 −1 1 2
0.2
0.4
0.6
0.8
1
x
y
A sigmoid neuron is said to have sat-
urated when σ(x) = 1 or σ(x) = 0
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
30/67
−2 −1 1 2
0.2
0.4
0.6
0.8
1
x
y
A sigmoid neuron is said to have sat-
urated when σ(x) = 1 or σ(x) = 0
What would the gradient be at satur-
ation?
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
30/67
−2 −1 1 2
0.2
0.4
0.6
0.8
1
x
y
A sigmoid neuron is said to have sat-
urated when σ(x) = 1 or σ(x) = 0
What would the gradient be at satur-
ation?
Well it would be 0 (you can see it from
the plot or from the formula that we
derived)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
30/67
−2 −1 1 2
0.2
0.4
0.6
0.8
1
x
y
Saturated neurons thus cause the
gradient to vanish.
A sigmoid neuron is said to have sat-
urated when σ(x) = 1 or σ(x) = 0
What would the gradient be at satur-
ation?
Well it would be 0 (you can see it from
the plot or from the formula that we
derived)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
31/67
Saturated neurons thus cause the
gradient to vanish.
w1 w2 w3 w4
σ( 4
i=1 wixi)
−2 −1 1 2
0.2
0.4
0.6
0.8
1
4
i=1 wixi
y
But why would the neurons saturate
?
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
31/67
Saturated neurons thus cause the
gradient to vanish.
w1 w2 w3 w4
σ( 4
i=1 wixi)
−2 −1 1 2
0.2
0.4
0.6
0.8
1
4
i=1 wixi
y
But why would the neurons saturate
?
Consider what would happen if we
use sigmoid neurons and initialize the
weights to very high values ?
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
31/67
Saturated neurons thus cause the
gradient to vanish.
w1 w2 w3 w4
σ( 4
i=1 wixi)
−2 −1 1 2
0.2
0.4
0.6
0.8
1
4
i=1 wixi
y
But why would the neurons saturate
?
Consider what would happen if we
use sigmoid neurons and initialize the
weights to very high values ?
The neurons will saturate very
quickly
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
31/67
Saturated neurons thus cause the
gradient to vanish.
w1 w2 w3 w4
σ( 4
i=1 wixi)
−2 −1 1 2
0.2
0.4
0.6
0.8
1
4
i=1 wixi
y
But why would the neurons saturate
?
Consider what would happen if we
use sigmoid neurons and initialize the
weights to very high values ?
The neurons will saturate very
quickly
The gradients will vanish and the
training will stall (more on this later)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
32/67
Saturated neurons cause the gradient
to vanish
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
32/67
Saturated neurons cause the gradient
to vanish
Sigmoids are not zero centered
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
32/67
Saturated neurons cause the gradient
to vanish
Sigmoids are not zero centered
Why is this a problem??
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
32/67
Saturated neurons cause the gradient
to vanish
Sigmoids are not zero centered
Why is this a problem??
w1 w2
a3 = w1 ∗ h21 + w2 ∗ h22
y
h0 = x
h1
h2
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
32/67
Saturated neurons cause the gradient
to vanish
Sigmoids are not zero centered
Consider the gradient w.r.t. w1 and
w2
w1 =
∂L (w)
∂y
∂y
h3
∂h3
∂a3
∂a3
∂w1
w2 =
∂L (w)
∂y
∂y
h3
∂h3
∂a3
∂a3
∂w2
Why is this a problem??
w1 w2
a3 = w1 ∗ h21 + w2 ∗ h22
y
h0 = x
h1
h2
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
32/67
Saturated neurons cause the gradient
to vanish
Sigmoids are not zero centered
Consider the gradient w.r.t. w1 and
w2
w1 =
∂L (w)
∂y
∂y
h3
∂h3
∂a3
h21
w2 =
∂L (w)
∂y
∂y
h3
∂h3
∂a3
h22
Note that h21 and h22 are between
[0, 1] (i.e., they are both positive)
Why is this a problem??
w1 w2
a3 = w1 ∗ h21 + w2 ∗ h22
y
h0 = x
h1
h2
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
32/67
Saturated neurons cause the gradient
to vanish
Sigmoids are not zero centered
Consider the gradient w.r.t. w1 and
w2
w1 =
∂L (w)
∂y
∂y
h3
∂h3
∂a3
h21
w2 =
∂L (w)
∂y
∂y
h3
∂h3
∂a3
h22
Note that h21 and h22 are between
[0, 1] (i.e., they are both positive)
So if the first common term (in red)
is positive (negative) then both w1
and w2 are positive (negative)
Why is this a problem??
w1 w2
a3 = w1 ∗ h21 + w2 ∗ h22
y
h0 = x
h1
h2
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
32/67
Saturated neurons cause the gradient
to vanish
Sigmoids are not zero centered
Consider the gradient w.r.t. w1 and
w2
w1 =
∂L (w)
∂y
∂y
h3
∂h3
∂a3
h21
w2 =
∂L (w)
∂y
∂y
h3
∂h3
∂a3
h22
Note that h21 and h22 are between
[0, 1] (i.e., they are both positive)
So if the first common term (in red)
is positive (negative) then both w1
and w2 are positive (negative)
Why is this a problem??
w1 w2
a3 = w1 ∗ h21 + w2 ∗ h22
y
h0 = x
h1
h2
Essentially, either all the gradients at
a layer are positive or all the gradients
at a layer are negative
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
33/67
Saturated neurons cause the gradient
to vanish
Sigmoids are not zero centered
This restricts the possible update dir-
ections
w2
w1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
33/67
Saturated neurons cause the gradient
to vanish
Sigmoids are not zero centered
This restricts the possible update dir-
ections
w2
w1
(Not possible) Quadrant in which
all gradients are
+ve
(Allowed)
Quadrant in which
all gradients are
-ve
(Allowed)
(Not possible)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
33/67
Saturated neurons cause the gradient
to vanish
Sigmoids are not zero centered
This restricts the possible update dir-
ections
w2
w1
(Not possible) Quadrant in which
all gradients are
+ve
(Allowed)
Quadrant in which
all gradients are
-ve
(Allowed)
(Not possible)
Now imagine:
this is the
optimal w
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
34/67
Saturated neurons cause the gradient
to vanish
Sigmoids are not zero centered w2
w1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
34/67
Saturated neurons cause the gradient
to vanish
Sigmoids are not zero centered w2
w1
starting from this
initial position
only way to reach it
is by taking a zigzag path
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
34/67
Saturated neurons cause the gradient
to vanish
Sigmoids are not zero centered w2
w1
starting from this
initial position
only way to reach it
is by taking a zigzag path
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
34/67
Saturated neurons cause the gradient
to vanish
Sigmoids are not zero centered w2
w1
starting from this
initial position
only way to reach it
is by taking a zigzag path
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
34/67
Saturated neurons cause the gradient
to vanish
Sigmoids are not zero centered w2
w1
starting from this
initial position
only way to reach it
is by taking a zigzag path
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
34/67
Saturated neurons cause the gradient
to vanish
Sigmoids are not zero centered w2
w1
starting from this
initial position
only way to reach it
is by taking a zigzag path
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
34/67
Saturated neurons cause the gradient
to vanish
Sigmoids are not zero centered w2
w1
starting from this
initial position
only way to reach it
is by taking a zigzag path
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
34/67
Saturated neurons cause the gradient
to vanish
Sigmoids are not zero centered
And lastly, sigmoids are compu-
tationally expensive (because of
exp (x))
w2
w1
starting from this
initial position
only way to reach it
is by taking a zigzag path
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
35/67
tanh(x)
0−4 −2 2 4
−1
−0.5
0.5
1
x
y
f(x) = tanh(x)
Compresses all its inputs to the range
[-1,1]
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
35/67
tanh(x)
0−4 −2 2 4
−1
−0.5
0.5
1
x
y
f(x) = tanh(x)
Compresses all its inputs to the range
[-1,1]
Zero centered
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
35/67
tanh(x)
0−4 −2 2 4
−1
−0.5
0.5
1
x
y
f(x) = tanh(x)
Compresses all its inputs to the range
[-1,1]
Zero centered
What is the derivative of this func-
tion?
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
35/67
tanh(x)
0−4 −2 2 4
−1
−0.5
0.5
1
x
y
f(x) = tanh(x)
Compresses all its inputs to the range
[-1,1]
Zero centered
What is the derivative of this func-
tion?
∂tanh(x)
∂x
= (1 − tanh2
(x))
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
35/67
tanh(x)
0−4 −2 2 4
−1
−0.5
0.5
1
x
y
f(x) = tanh(x)
Compresses all its inputs to the range
[-1,1]
Zero centered
What is the derivative of this func-
tion?
∂tanh(x)
∂x
= (1 − tanh2
(x))
The gradient still vanishes at satura-
tion
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
35/67
tanh(x)
0−4 −2 2 4
−1
−0.5
0.5
1
x
y
f(x) = tanh(x)
Compresses all its inputs to the range
[-1,1]
Zero centered
What is the derivative of this func-
tion?
∂tanh(x)
∂x
= (1 − tanh2
(x))
The gradient still vanishes at satura-
tion
Also computationally expensive
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
36/67
ReLU
f(x) = max(0, x)
Is this a non-linear function?
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
36/67
ReLU
f(x) = max(0, x)
Is this a non-linear function?
Indeed it is!
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
36/67
ReLU
f(x) = max(0, x + 1) − max(0, x − 1)
Is this a non-linear function?
Indeed it is!
In fact we can combine two ReLU
units to recover a piecewise linear ap-
proximation of the sigmoid function
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
37/67
ReLU
f(x) = max(0, x)
Advantages of ReLU
Does not saturate in the positive re-
gion
1
ImageNet Classification with Deep Convolutional Neural Networks- Alex Krizhevsky Ilya
Sutskever, Geoffrey E. Hinton, 2012
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
37/67
ReLU
f(x) = max(0, x)
Advantages of ReLU
Does not saturate in the positive re-
gion
Computationally efficient
1
ImageNet Classification with Deep Convolutional Neural Networks- Alex Krizhevsky Ilya
Sutskever, Geoffrey E. Hinton, 2012
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
37/67
ReLU
f(x) = max(0, x)
Advantages of ReLU
Does not saturate in the positive re-
gion
Computationally efficient
In practice converges much faster
than sigmoid/tanh1
1
ImageNet Classification with Deep Convolutional Neural Networks- Alex Krizhevsky Ilya
Sutskever, Geoffrey E. Hinton, 2012
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
38/67
x1 x2 1
y
w1 w2 b
h1
a1
a2
w3
In practice there is a caveat
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
38/67
x1 x2 1
y
w1 w2 b
h1
a1
a2
w3
In practice there is a caveat
Let’s see what is the derivative of ReLU(x)
∂ReLU(x)
∂x
= 0 if x < 0
= 1 if x > 0
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
38/67
x1 x2 1
y
w1 w2 b
h1
a1
a2
w3
In practice there is a caveat
Let’s see what is the derivative of ReLU(x)
∂ReLU(x)
∂x
= 0 if x < 0
= 1 if x > 0
Now consider the given network
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
38/67
x1 x2 1
y
w1 w2 b
h1
a1
a2
w3
In practice there is a caveat
Let’s see what is the derivative of ReLU(x)
∂ReLU(x)
∂x
= 0 if x < 0
= 1 if x > 0
Now consider the given network
What would happen if at some point a large
gradient causes the bias b to be updated to a
large negative value?
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
39/67
x1 x2 1
y
w1 w2 b
h1
a1
a2
w3
w1x1 + w2x2 + b < 0 [if b << 0]
The neuron would output 0 [dead neuron]
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
39/67
x1 x2 1
y
w1 w2 b
h1
a1
a2
w3
w1x1 + w2x2 + b < 0 [if b << 0]
The neuron would output 0 [dead neuron]
Not only would the output be 0 but during
backpropagation even the gradient ∂h1
∂a1
would
be zero
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
39/67
x1 x2 1
y
w1 w2 b
h1
a1
a2
w3
w1x1 + w2x2 + b < 0 [if b << 0]
The neuron would output 0 [dead neuron]
Not only would the output be 0 but during
backpropagation even the gradient ∂h1
∂a1
would
be zero
The weights w1, w2 and b will not get updated
[ there will be a zero term in the chain rule]
w1 =
∂L (θ)
∂y
.
∂y
∂a2
.
∂a2
∂h1
.
∂h1
∂a1
.
∂a1
∂w1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
39/67
x1 x2 1
y
w1 w2 b
h1
a1
a2
w3
w1x1 + w2x2 + b < 0 [if b << 0]
The neuron would output 0 [dead neuron]
Not only would the output be 0 but during
backpropagation even the gradient ∂h1
∂a1
would
be zero
The weights w1, w2 and b will not get updated
[ there will be a zero term in the chain rule]
w1 =
∂L (θ)
∂y
.
∂y
∂a2
.
∂a2
∂h1
.
∂h1
∂a1
.
∂a1
∂w1
The neuron will now stay dead forever!!
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
40/67
x1 x2 1
y
w1 w2 b
h1
a1
a2
w3
In practice a large fraction of ReLU
units can die if the learning rate is set
too high
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
40/67
x1 x2 1
y
w1 w2 b
h1
a1
a2
w3
In practice a large fraction of ReLU
units can die if the learning rate is set
too high
It is advised to initialize the bias to a
positive value (0.01)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
40/67
x1 x2 1
y
w1 w2 b
h1
a1
a2
w3
In practice a large fraction of ReLU
units can die if the learning rate is set
too high
It is advised to initialize the bias to a
positive value (0.01)
Use other variants of ReLU (as we
will soon see)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
41/67
Leaky ReLU
x
y
f(x) = max(0.01x,x)
No saturation
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
41/67
Leaky ReLU
x
y
f(x) = max(0.01x,x)
No saturation
Will not die (0.01x ensures that
at least a small gradient will flow
through)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
41/67
Leaky ReLU
x
y
f(x) = max(0.01x,x)
No saturation
Will not die (0.01x ensures that
at least a small gradient will flow
through)
Computationally efficient
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
41/67
Leaky ReLU
x
y
f(x) = max(0.01x,x)
No saturation
Will not die (0.01x ensures that
at least a small gradient will flow
through)
Computationally efficient
Close to zero centered ouputs
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
41/67
Leaky ReLU
x
y
f(x) = max(0.01x,x)
No saturation
Will not die (0.01x ensures that
at least a small gradient will flow
through)
Computationally efficient
Close to zero centered ouputs
Parametric ReLU
f(x) = max(αx, x)
α is a parameter of the model
α will get updated during backpropagation
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
42/67
Exponential Linear Unit
x
y
f(x) = x if x > 0
= aex
− 1 if x ≤ 0
All benefits of ReLU
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
42/67
Exponential Linear Unit
x
y
f(x) = x if x > 0
= aex
− 1 if x ≤ 0
All benefits of ReLU
aex − 1 ensures that at least a small
gradient will flow through
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
42/67
Exponential Linear Unit
x
y
f(x) = x if x > 0
= aex
− 1 if x ≤ 0
All benefits of ReLU
aex − 1 ensures that at least a small
gradient will flow through
Close to zero centered outputs
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
42/67
Exponential Linear Unit
x
y
f(x) = x if x > 0
= aex
− 1 if x ≤ 0
All benefits of ReLU
aex − 1 ensures that at least a small
gradient will flow through
Close to zero centered outputs
Expensive (requires computation of
exp(x))
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
43/67
Maxout Neuron
max(wT
1 x + b1, wT
2 x + b2)
Generalizes ReLU and Leaky ReLU
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
43/67
Maxout Neuron
max(wT
1 x + b1, wT
2 x + b2)
Generalizes ReLU and Leaky ReLU
No saturation! No death!
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
43/67
Maxout Neuron
max(wT
1 x + b1, wT
2 x + b2)
Generalizes ReLU and Leaky ReLU
No saturation! No death!
Doubles the number of parameters
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
44/67
Things to Remember
Sigmoids are bad
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
44/67
Things to Remember
Sigmoids are bad
ReLU is more or less the standard unit for Convolutional Neural Networks
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
44/67
Things to Remember
Sigmoids are bad
ReLU is more or less the standard unit for Convolutional Neural Networks
Can explore Leaky ReLU/Maxout/ELU
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
44/67
Things to Remember
Sigmoids are bad
ReLU is more or less the standard unit for Convolutional Neural Networks
Can explore Leaky ReLU/Maxout/ELU
tanh sigmoids are still used in LSTMs/RNNs (we will see more on this later)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
45/67
Module 9.4 : Better initialization strategies
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
46/67
Deep Learning has evolved
Better optimization algorithms
Better regularization methods
Better activation functions
Better weight initialization strategies
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
47/67
y
σ σ σ
σ
x1 x2
h21
a21
h11 h12 h13
a11 a12 a13
What happens if we initialize all
weights to 0?
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
47/67
y
σ σ σ
σ
x1 x2
h21
a21
h11 h12 h13
a11 a12 a13
a11 = w11x1 + w12x2
What happens if we initialize all
weights to 0?
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
47/67
y
σ σ σ
σ
x1 x2
h21
a21
h11 h12 h13
a11 a12 a13
a11 = w11x1 + w12x2
a12 = w21x1 + w22x2
What happens if we initialize all
weights to 0?
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
47/67
y
σ σ σ
σ
x1 x2
h21
a21
h11 h12 h13
a11 a12 a13
a11 = w11x1 + w12x2
a12 = w21x1 + w22x2
∴ a11 = a12 = 0
What happens if we initialize all
weights to 0?
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
47/67
y
σ σ σ
σ
x1 x2
h21
a21
h11 h12 h13
a11 a12 a13
a11 = w11x1 + w12x2
a12 = w21x1 + w22x2
∴ a11 = a12 = 0
∴ h11 = h12
What happens if we initialize all
weights to 0?
All neurons in layer 1 will get the
same activation
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
47/67
y
σ σ σ
σ
x1 x2
h21
a21
h11 h12 h13
a11 a12 a13
a11 = w11x1 + w12x2
a12 = w21x1 + w22x2
∴ a11 = a12 = 0
∴ h11 = h12
Now what will happen during back
propagation?
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
47/67
y
σ σ σ
σ
x1 x2
h21
a21
h11 h12 h13
a11 a12 a13
a11 = w11x1 + w12x2
a12 = w21x1 + w22x2
∴ a11 = a12 = 0
∴ h11 = h12
Now what will happen during back
propagation?
w11 =
∂L (w)
∂y
.
∂y
∂h11
.
∂h11
∂a11
.x1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
47/67
y
σ σ σ
σ
x1 x2
h21
a21
h11 h12 h13
a11 a12 a13
a11 = w11x1 + w12x2
a12 = w21x1 + w22x2
∴ a11 = a12 = 0
∴ h11 = h12
Now what will happen during back
propagation?
w11 =
∂L (w)
∂y
.
∂y
∂h11
.
∂h11
∂a11
.x1
w21 =
∂L (w)
∂y
.
∂y
∂h12
.
∂h12
∂a12
.x1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
47/67
y
σ σ σ
σ
x1 x2
h21
a21
h11 h12 h13
a11 a12 a13
a11 = w11x1 + w12x2
a12 = w21x1 + w22x2
∴ a11 = a12 = 0
∴ h11 = h12
Now what will happen during back
propagation?
w11 =
∂L (w)
∂y
.
∂y
∂h11
.
∂h11
∂a11
.x1
w21 =
∂L (w)
∂y
.
∂y
∂h12
.
∂h12
∂a12
.x1
but h11 = h12
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
47/67
y
σ σ σ
σ
x1 x2
h21
a21
h11 h12 h13
a11 a12 a13
a11 = w11x1 + w12x2
a12 = w21x1 + w22x2
∴ a11 = a12 = 0
∴ h11 = h12
Now what will happen during back
propagation?
w11 =
∂L (w)
∂y
.
∂y
∂h11
.
∂h11
∂a11
.x1
w21 =
∂L (w)
∂y
.
∂y
∂h12
.
∂h12
∂a12
.x1
but h11 = h12
and a12 = a12
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
47/67
y
σ σ σ
σ
x1 x2
h21
a21
h11 h12 h13
a11 a12 a13
a11 = w11x1 + w12x2
a12 = w21x1 + w22x2
∴ a11 = a12 = 0
∴ h11 = h12
Now what will happen during back
propagation?
w11 =
∂L (w)
∂y
.
∂y
∂h11
.
∂h11
∂a11
.x1
w21 =
∂L (w)
∂y
.
∂y
∂h12
.
∂h12
∂a12
.x1
but h11 = h12
and a12 = a12
∴ w11 = w21
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
47/67
y
σ σ σ
σ
x1 x2
h21
a21
h11 h12 h13
a11 a12 a13
a11 = w11x1 + w12x2
a12 = w21x1 + w22x2
∴ a11 = a12 = 0
∴ h11 = h12
Hence both the weights will get the
same update and remain equal
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
47/67
y
σ σ σ
σ
x1 x2
h21
a21
h11 h12 h13
a11 a12 a13
a11 = w11x1 + w12x2
a12 = w21x1 + w22x2
∴ a11 = a12 = 0
∴ h11 = h12
Hence both the weights will get the
same update and remain equal
Infact this symmetry will never break
during training
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
47/67
y
σ σ σ
σ
x1 x2
h21
a21
h11 h12 h13
a11 a12 a13
a11 = w11x1 + w12x2
a12 = w21x1 + w22x2
∴ a11 = a12 = 0
∴ h11 = h12
Hence both the weights will get the
same update and remain equal
Infact this symmetry will never break
during training
The same is true for w12 and w22
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
47/67
y
σ σ σ
σ
x1 x2
h21
a21
h11 h12 h13
a11 a12 a13
a11 = w11x1 + w12x2
a12 = w21x1 + w22x2
∴ a11 = a12 = 0
∴ h11 = h12
Hence both the weights will get the
same update and remain equal
Infact this symmetry will never break
during training
The same is true for w12 and w22
And for all weights in layer 2 (infact,
work out the math and convince your-
self that all the weights in this layer
will remain equal )
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
47/67
y
σ σ σ
σ
x1 x2
h21
a21
h11 h12 h13
a11 a12 a13
a11 = w11x1 + w12x2
a12 = w21x1 + w22x2
∴ a11 = a12 = 0
∴ h11 = h12
Hence both the weights will get the
same update and remain equal
Infact this symmetry will never break
during training
The same is true for w12 and w22
And for all weights in layer 2 (infact,
work out the math and convince your-
self that all the weights in this layer
will remain equal )
This is known as the symmetry
breaking problem
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
47/67
y
σ σ σ
σ
x1 x2
h21
a21
h11 h12 h13
a11 a12 a13
a11 = w11x1 + w12x2
a12 = w21x1 + w22x2
∴ a11 = a12 = 0
∴ h11 = h12
Hence both the weights will get the
same update and remain equal
Infact this symmetry will never break
during training
The same is true for w12 and w22
And for all weights in layer 2 (infact,
work out the math and convince your-
self that all the weights in this layer
will remain equal )
This is known as the symmetry
breaking problem
This will happen if all the weights in
a network are initialized to the same
value
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
48/67
We will now consider a feedforward
network with:
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
48/67
We will now consider a feedforward
network with:
input: 1000 points, each ∈ R500
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
48/67
We will now consider a feedforward
network with:
input: 1000 points, each ∈ R500
input data is drawn from unit Gaus-
sian
−3 −2 −1 0 1 2 3
0.1
0.2
0.3
0.4
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
48/67
We will now consider a feedforward
network with:
input: 1000 points, each ∈ R500
input data is drawn from unit Gaus-
sian
−3 −2 −1 0 1 2 3
0.1
0.2
0.3
0.4
the network has 5 layers
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
48/67
We will now consider a feedforward
network with:
input: 1000 points, each ∈ R500
input data is drawn from unit Gaus-
sian
−3 −2 −1 0 1 2 3
0.1
0.2
0.3
0.4
the network has 5 layers
each layer has 500 neurons
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
48/67
We will now consider a feedforward
network with:
input: 1000 points, each ∈ R500
input data is drawn from unit Gaus-
sian
−3 −2 −1 0 1 2 3
0.1
0.2
0.3
0.4
the network has 5 layers
each layer has 500 neurons
we will run forward propagation on
this network with different weight ini-
tializations
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
49/67
Let’s try to initialize the weights to
small random numbers
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
49/67
tanh activation functions
Let’s try to initialize the weights to
small random numbers
We will see what happens to the ac-
tivation across different layers
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
49/67
sigmoid activation functions
Let’s try to initialize the weights to
small random numbers
We will see what happens to the ac-
tivation across different layers
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
50/67
What will happen during back
propagation?
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
50/67
What will happen during back
propagation?
Recall that w1 is proportional to
the activation passing through it
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
50/67
What will happen during back
propagation?
Recall that w1 is proportional to
the activation passing through it
If all the activations in a layer are
very close to 0, what will happen to
the gradient of the weights connect-
ing this layer to the next layer?
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
50/67
What will happen during back
propagation?
Recall that w1 is proportional to
the activation passing through it
If all the activations in a layer are
very close to 0, what will happen to
the gradient of the weights connect-
ing this layer to the next layer?
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
50/67
What will happen during back
propagation?
Recall that w1 is proportional to
the activation passing through it
If all the activations in a layer are
very close to 0, what will happen to
the gradient of the weights connect-
ing this layer to the next layer?
They will all be close to 0 (vanishing
gradient problem)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
51/67
Let us try to initialize the weights to
large random numbers
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
51/67
tanh activation with large weights
Let us try to initialize the weights to
large random numbers
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
51/67
sigmoid activations with large weights
Let us try to initialize the weights to
large random numbers
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
51/67
sigmoid activations with large weights
Let us try to initialize the weights to
large random numbers
Most activations have saturated
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
51/67
sigmoid activations with large weights
Let us try to initialize the weights to
large random numbers
Most activations have saturated
What happens to the gradients at sat-
uration?
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
51/67
sigmoid activations with large weights
Let us try to initialize the weights to
large random numbers
Most activations have saturated
What happens to the gradients at sat-
uration?
They will all be close to 0 (vanishing
gradient problem)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
52/67
Let us try to arrive at a more principled
way of initializing weights
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
52/67
x1 x2 x3
s1ns11
xn
Let us try to arrive at a more principled
way of initializing weights
s11 =
n
i=1
w1ixi
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
52/67
x1 x2 x3
s1ns11
xn
Let us try to arrive at a more principled
way of initializing weights
s11 =
n
i=1
w1ixi
V ar(s11) = V ar(
n
i=1
w1ixi) =
n
i=1
V ar(w1ixi)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
52/67
x1 x2 x3
s1ns11
xn
Let us try to arrive at a more principled
way of initializing weights
s11 =
n
i=1
w1ixi
V ar(s11) = V ar(
n
i=1
w1ixi) =
n
i=1
V ar(w1ixi)
=
n
i=1
(E[w1i])2
V ar(xi)
+ (E[xi])2
V ar(w1i) + V ar(xi)V ar(w1i)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
52/67
x1 x2 x3
s1ns11
xn
[Assuming 0 Mean inputs and
weights]
Let us try to arrive at a more principled
way of initializing weights
s11 =
n
i=1
w1ixi
V ar(s11) = V ar(
n
i=1
w1ixi) =
n
i=1
V ar(w1ixi)
=
n
i=1
(E[w1i])2
V ar(xi)
+ (E[xi])2
V ar(w1i) + V ar(xi)V ar(w1i)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
52/67
x1 x2 x3
s1ns11
xn
[Assuming 0 Mean inputs and
weights]
Let us try to arrive at a more principled
way of initializing weights
s11 =
n
i=1
w1ixi
V ar(s11) = V ar(
n
i=1
w1ixi) =
n
i=1
V ar(w1ixi)
=
n
i=1
(E[w1i])2
V ar(xi)
+ (E[xi])2
V ar(w1i) + V ar(xi)V ar(w1i)
=
n
i=1
V ar(xi)V ar(w1i)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
52/67
x1 x2 x3
s1ns11
xn
[Assuming 0 Mean inputs and
weights]
[Assuming V ar(xi) = V ar(x)∀i ]
[Assuming
V ar(w1i) = V ar(w)∀i]
Let us try to arrive at a more principled
way of initializing weights
s11 =
n
i=1
w1ixi
V ar(s11) = V ar(
n
i=1
w1ixi) =
n
i=1
V ar(w1ixi)
=
n
i=1
(E[w1i])2
V ar(xi)
+ (E[xi])2
V ar(w1i) + V ar(xi)V ar(w1i)
=
n
i=1
V ar(xi)V ar(w1i)
= (nV ar(w))(V ar(x))
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
53/67
x1 x2 x3
s1ns11
xn
In general,
V ar(S1i) = (nV ar(w))(V ar(x))
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
53/67
x1 x2 x3
s1ns11
xn
In general,
V ar(S1i) = (nV ar(w))(V ar(x))
What would happen if nV ar(w) 1
?
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
53/67
x1 x2 x3
s1ns11
xn
In general,
V ar(S1i) = (nV ar(w))(V ar(x))
What would happen if nV ar(w) 1
?
The variance of S1i will be large
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
53/67
x1 x2 x3
s1ns11
xn
In general,
V ar(S1i) = (nV ar(w))(V ar(x))
What would happen if nV ar(w) 1
?
The variance of S1i will be large
What would happen if nV ar(w) → 0?
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
53/67
x1 x2 x3
s1ns11
xn
In general,
V ar(S1i) = (nV ar(w))(V ar(x))
What would happen if nV ar(w) 1
?
The variance of S1i will be large
What would happen if nV ar(w) → 0?
The variance of S1i will be small
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
54/67
Let us see what happens if we add one
more layer
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
54/67
x1 x2 x3
s1ns11
s21
xn
Let us see what happens if we add one
more layer
Using the same procedure as above
we will arrive at
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
54/67
x1 x2 x3
s1ns11
s21
xn
Let us see what happens if we add one
more layer
Using the same procedure as above
we will arrive at
V ar(s21) =
n
i=1
V ar(s1i)V ar(w2i)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
54/67
x1 x2 x3
s1ns11
s21
xn
Let us see what happens if we add one
more layer
Using the same procedure as above
we will arrive at
V ar(s21) =
n
i=1
V ar(s1i)V ar(w2i)
= nV ar(s1i)V ar(w2)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
54/67
x1 x2 x3
s1ns11
s21
xn
V ar(Si1) = nV ar(w1)V ar(x)
Let us see what happens if we add one
more layer
Using the same procedure as above
we will arrive at
V ar(s21) =
n
i=1
V ar(s1i)V ar(w2i)
= nV ar(s1i)V ar(w2)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
54/67
x1 x2 x3
s1ns11
s21
xn
V ar(Si1) = nV ar(w1)V ar(x)
Let us see what happens if we add one
more layer
Using the same procedure as above
we will arrive at
V ar(s21) =
n
i=1
V ar(s1i)V ar(w2i)
= nV ar(s1i)V ar(w2)
V ar(s21) ∝ [nV ar(w2)][nV ar(w1)]V ar(x)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
54/67
x1 x2 x3
s1ns11
s21
xn
V ar(Si1) = nV ar(w1)V ar(x)
Let us see what happens if we add one
more layer
Using the same procedure as above
we will arrive at
V ar(s21) =
n
i=1
V ar(s1i)V ar(w2i)
= nV ar(s1i)V ar(w2)
V ar(s21) ∝ [nV ar(w2)][nV ar(w1)]V ar(x)
∝ [nV ar(w)]2
V ar(x)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
54/67
x1 x2 x3
s1ns11
s21
xn
V ar(Si1) = nV ar(w1)V ar(x)
Let us see what happens if we add one
more layer
Using the same procedure as above
we will arrive at
V ar(s21) =
n
i=1
V ar(s1i)V ar(w2i)
= nV ar(s1i)V ar(w2)
V ar(s21) ∝ [nV ar(w2)][nV ar(w1)]V ar(x)
∝ [nV ar(w)]2
V ar(x)
Assuming weights across all layers
have the same variance
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
55/67
In general,
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
55/67
In general,
V ar(ski) = [nV ar(w)]k
V ar(x)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
55/67
In general,
V ar(ski) = [nV ar(w)]k
V ar(x)
To ensure that variance in the output of any
layer does not blow up or shrink we want:
nV ar(w) = 1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
55/67
In general,
V ar(ski) = [nV ar(w)]k
V ar(x)
To ensure that variance in the output of any
layer does not blow up or shrink we want:
nV ar(w) = 1
If we draw the weights from a unit Gaussian
and scale them by 1√
n
then, we have :
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
55/67
In general,
V ar(ski) = [nV ar(w)]k
V ar(x)
To ensure that variance in the output of any
layer does not blow up or shrink we want:
nV ar(w) = 1
If we draw the weights from a unit Gaussian
and scale them by 1√
n
then, we have :
nV ar(w) = nV ar(
z
√
n
)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
55/67
V ar(az) = a2
(V ar(z))
In general,
V ar(ski) = [nV ar(w)]k
V ar(x)
To ensure that variance in the output of any
layer does not blow up or shrink we want:
nV ar(w) = 1
If we draw the weights from a unit Gaussian
and scale them by 1√
n
then, we have :
nV ar(w) = nV ar(
z
√
n
)
= n ∗
1
n
V ar(z) = 1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
55/67
V ar(az) = a2
(V ar(z))
In general,
V ar(ski) = [nV ar(w)]k
V ar(x)
To ensure that variance in the output of any
layer does not blow up or shrink we want:
nV ar(w) = 1
If we draw the weights from a unit Gaussian
and scale them by 1√
n
then, we have :
nV ar(w) = nV ar(
z
√
n
)
= n ∗
1
n
V ar(z) = 1 ← (UnitGaussian)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
56/67
Let’s see what happens if we use this
initialization
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
56/67
tanh activation
Let’s see what happens if we use this
initialization
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
56/67
sigmoid activations
Let’s see what happens if we use this
initialization
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
57/67
However this does not work for ReLU
neurons
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
57/67
However this does not work for ReLU
neurons
Why ?
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
57/67
However this does not work for ReLU
neurons
Why ?
Intuition: He et.al. argue that a
factor of 2 is needed when dealing
with ReLU Neurons
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
57/67
However this does not work for ReLU
neurons
Why ?
Intuition: He et.al. argue that a
factor of 2 is needed when dealing
with ReLU Neurons
Intuitively this happens because the
range of ReLU neurons is restricted
only to the positive half of the space
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
58/67
Indeed when we account for this
factor of 2 we see better performance
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
58/67
Indeed when we account for this
factor of 2 we see better performance
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
59/67
Module 9.5 : Batch Normalization
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
60/67
We will now see a method called batch normalization which allows us to be less
careful about initialization
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
61/67
To understand the intuition behind Batch Nor-
malization let us consider a deep network
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
61/67
x1 x2 x3
h0
h1
h2
h3
h4
To understand the intuition behind Batch Nor-
malization let us consider a deep network
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
61/67
x1 x2 x3
h0
h1
h2
h3
h4
To understand the intuition behind Batch Nor-
malization let us consider a deep network
Let us focus on the learning process for the weights
between these two layers
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
61/67
x1 x2 x3
h0
h1
h2
h3
h4
To understand the intuition behind Batch Nor-
malization let us consider a deep network
Let us focus on the learning process for the weights
between these two layers
Typically we use mini-batch algorithms
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
61/67
x1 x2 x3
h0
h1
h2
h3
h4
To understand the intuition behind Batch Nor-
malization let us consider a deep network
Let us focus on the learning process for the weights
between these two layers
Typically we use mini-batch algorithms
What would happen if there is a constant change
in the distribution of h3
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
61/67
x1 x2 x3
h0
h1
h2
h3
h4
To understand the intuition behind Batch Nor-
malization let us consider a deep network
Let us focus on the learning process for the weights
between these two layers
Typically we use mini-batch algorithms
What would happen if there is a constant change
in the distribution of h3
In other words what would happen if across mini-
batches the distribution of h3 keeps changing
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
61/67
x1 x2 x3
h0
h1
h2
h3
h4
To understand the intuition behind Batch Nor-
malization let us consider a deep network
Let us focus on the learning process for the weights
between these two layers
Typically we use mini-batch algorithms
What would happen if there is a constant change
in the distribution of h3
In other words what would happen if across mini-
batches the distribution of h3 keeps changing
Would the learning process be easy or hard?
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
62/67
It would help if the pre-activations at each layer
were unit gaussians
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
62/67
It would help if the pre-activations at each layer
were unit gaussians
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
62/67
It would help if the pre-activations at each layer
were unit gaussians
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
62/67
It would help if the pre-activations at each layer
were unit gaussians
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
62/67
It would help if the pre-activations at each layer
were unit gaussians
Why not explicitly ensure this by standardizing
the pre-activation ?
ˆsik = sik−E[sik]
√
var(sik)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
62/67
It would help if the pre-activations at each layer
were unit gaussians
Why not explicitly ensure this by standardizing
the pre-activation ?
ˆsik = sik−E[sik]
√
var(sik)
But how do we compute E[sik] and Var[sik]?
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
62/67
It would help if the pre-activations at each layer
were unit gaussians
Why not explicitly ensure this by standardizing
the pre-activation ?
ˆsik = sik−E[sik]
√
var(sik)
But how do we compute E[sik] and Var[sik]?
We compute it from a mini-batch
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
62/67
It would help if the pre-activations at each layer
were unit gaussians
Why not explicitly ensure this by standardizing
the pre-activation ?
ˆsik = sik−E[sik]
√
var(sik)
But how do we compute E[sik] and Var[sik]?
We compute it from a mini-batch
Thus we are explicitly ensuring that the distri-
bution of the inputs at different layers does not
change across batches
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
63/67
This is what the deep network will look like with
Batch Normalization
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
63/67
This is what the deep network will look like with
Batch Normalization
Is this legal ?
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
63/67
This is what the deep network will look like with
Batch Normalization
Is this legal ?
Yes, it is because just as the tanh layer is dif-
ferentiable, the Batch Normalization layer is also
differentiable
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
63/67
This is what the deep network will look like with
Batch Normalization
Is this legal ?
Yes, it is because just as the tanh layer is dif-
ferentiable, the Batch Normalization layer is also
differentiable
Hence we can backpropagate through this layer
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
64/67
Catch: Do we necessarily want to force a unit
gaussian input to the tanh layer?
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
64/67
Catch: Do we necessarily want to force a unit
gaussian input to the tanh layer?
Why not let the network learn what is best for it?
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
64/67
Catch: Do we necessarily want to force a unit
gaussian input to the tanh layer?
Why not let the network learn what is best for it?
After the Batch Normalization step add the fol-
lowing step:
y(k) = γk ˆsik + β(k)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
64/67
γk and βk are additional
parameters of the network.
Catch: Do we necessarily want to force a unit
gaussian input to the tanh layer?
Why not let the network learn what is best for it?
After the Batch Normalization step add the fol-
lowing step:
y(k) = γk ˆsik + β(k)
What happens if the network learns:
γk = var(xk)
βk = E[xk]
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
64/67
γk and βk are additional
parameters of the network.
Catch: Do we necessarily want to force a unit
gaussian input to the tanh layer?
Why not let the network learn what is best for it?
After the Batch Normalization step add the fol-
lowing step:
y(k) = γk ˆsik + β(k)
What happens if the network learns:
γk = var(xk)
βk = E[xk]
We will recover sik
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
64/67
γk and βk are additional
parameters of the network.
Catch: Do we necessarily want to force a unit
gaussian input to the tanh layer?
Why not let the network learn what is best for it?
After the Batch Normalization step add the fol-
lowing step:
y(k) = γk ˆsik + β(k)
What happens if the network learns:
γk = var(xk)
βk = E[xk]
We will recover sik
In other words by adjusting these additional para-
meters the network can learn to recover sik if that
is more favourable
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
65/67
We will now compare the performance with and without batch normalization on
MNIST data using 2 layers....
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
66/67
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
66/67
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
66/67
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
66/67
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
66/67
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
66/67
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
66/67
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
66/67
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
66/67
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
66/67
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
66/67
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
66/67
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
66/67
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
66/67
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
66/67
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
66/67
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
66/67
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
66/67
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
66/67
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
66/67
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
67/67
2016-17: Still exciting times
Even better optimization methods
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
67/67
2016-17: Still exciting times
Even better optimization methods
Data driven initialization methods
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
67/67
2016-17: Still exciting times
Even better optimization methods
Data driven initialization methods
Beyond batch normalization
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9

More Related Content

What's hot

Paper id 71201925
Paper id 71201925Paper id 71201925
Paper id 71201925
IJRAT
 
K mean-clustering algorithm
K mean-clustering algorithmK mean-clustering algorithm
K mean-clustering algorithm
parry prabhu
 
K-means Clustering Algorithm with Matlab Source code
K-means Clustering Algorithm with Matlab Source codeK-means Clustering Algorithm with Matlab Source code
K-means Clustering Algorithm with Matlab Source code
gokulprasath06
 
Hierarchical clustering techniques
Hierarchical clustering techniquesHierarchical clustering techniques
Hierarchical clustering techniques
Md Syed Ahamad
 
Cluster analysis using k-means method in R
Cluster analysis using k-means method in RCluster analysis using k-means method in R
Cluster analysis using k-means method in R
Vladimir Bakhrushin
 
Cs36565569
Cs36565569Cs36565569
Cs36565569
IJERA Editor
 
0-1 KNAPSACK PROBLEM
0-1 KNAPSACK PROBLEM0-1 KNAPSACK PROBLEM
0-1 KNAPSACK PROBLEM
i i
 
Interface 2010
Interface 2010Interface 2010
Interface 2010
jtrevelator
 
Using Topological Data Analysis on your BigData
Using Topological Data Analysis on your BigDataUsing Topological Data Analysis on your BigData
Using Topological Data Analysis on your BigData
AnalyticsWeek
 
[251] implementing deep learning using cu dnn
[251] implementing deep learning using cu dnn[251] implementing deep learning using cu dnn
[251] implementing deep learning using cu dnn
NAVER D2
 
K means clustering
K means clusteringK means clustering
K means clustering
Ahmedasbasb
 
RNN and sequence-to-sequence processing
RNN and sequence-to-sequence processingRNN and sequence-to-sequence processing
RNN and sequence-to-sequence processing
Dongang (Sean) Wang
 
Personalized PageRank based community detection
Personalized PageRank based community detectionPersonalized PageRank based community detection
Personalized PageRank based community detection
David Gleich
 
K mean-clustering
K mean-clusteringK mean-clustering
K mean-clustering
PVP College
 
Complex models in ecology: challenges and solutions
Complex models in ecology: challenges and solutionsComplex models in ecology: challenges and solutions
Complex models in ecology: challenges and solutions
Peter Solymos
 
Neural nw k means
Neural nw k meansNeural nw k means
Neural nw k means
Eng. Dr. Dennis N. Mwighusa
 
Intro to Game Physics with Box2D Chapter 2 Part 1
Intro to Game Physics with Box2D Chapter 2 Part 1Intro to Game Physics with Box2D Chapter 2 Part 1
Intro to Game Physics with Box2D Chapter 2 Part 1
Ian Parberry
 

What's hot (20)

Paper id 71201925
Paper id 71201925Paper id 71201925
Paper id 71201925
 
K mean-clustering algorithm
K mean-clustering algorithmK mean-clustering algorithm
K mean-clustering algorithm
 
K-means Clustering Algorithm with Matlab Source code
K-means Clustering Algorithm with Matlab Source codeK-means Clustering Algorithm with Matlab Source code
K-means Clustering Algorithm with Matlab Source code
 
honn
honnhonn
honn
 
Hierarchical clustering techniques
Hierarchical clustering techniquesHierarchical clustering techniques
Hierarchical clustering techniques
 
Cluster analysis using k-means method in R
Cluster analysis using k-means method in RCluster analysis using k-means method in R
Cluster analysis using k-means method in R
 
Cs36565569
Cs36565569Cs36565569
Cs36565569
 
0-1 KNAPSACK PROBLEM
0-1 KNAPSACK PROBLEM0-1 KNAPSACK PROBLEM
0-1 KNAPSACK PROBLEM
 
Interface 2010
Interface 2010Interface 2010
Interface 2010
 
Using Topological Data Analysis on your BigData
Using Topological Data Analysis on your BigDataUsing Topological Data Analysis on your BigData
Using Topological Data Analysis on your BigData
 
[251] implementing deep learning using cu dnn
[251] implementing deep learning using cu dnn[251] implementing deep learning using cu dnn
[251] implementing deep learning using cu dnn
 
K means clustering
K means clusteringK means clustering
K means clustering
 
RNN and sequence-to-sequence processing
RNN and sequence-to-sequence processingRNN and sequence-to-sequence processing
RNN and sequence-to-sequence processing
 
Personalized PageRank based community detection
Personalized PageRank based community detectionPersonalized PageRank based community detection
Personalized PageRank based community detection
 
post119s1-file2
post119s1-file2post119s1-file2
post119s1-file2
 
K mean-clustering
K mean-clusteringK mean-clustering
K mean-clustering
 
Complex models in ecology: challenges and solutions
Complex models in ecology: challenges and solutionsComplex models in ecology: challenges and solutions
Complex models in ecology: challenges and solutions
 
Neural nw k means
Neural nw k meansNeural nw k means
Neural nw k means
 
Intro to Game Physics with Box2D Chapter 2 Part 1
Intro to Game Physics with Box2D Chapter 2 Part 1Intro to Game Physics with Box2D Chapter 2 Part 1
Intro to Game Physics with Box2D Chapter 2 Part 1
 
08 clustering
08 clustering08 clustering
08 clustering
 

Similar to Deep learning networks

Lausanne 2019 #2
Lausanne 2019 #2Lausanne 2019 #2
Lausanne 2019 #2
Arthur Charpentier
 
MLIP - Chapter 2 - Preliminaries to deep learning
MLIP - Chapter 2 - Preliminaries to deep learningMLIP - Chapter 2 - Preliminaries to deep learning
MLIP - Chapter 2 - Preliminaries to deep learning
Charles Deledalle
 
deep learning
deep learningdeep learning
deep learning
Aravindharamanan S
 
Two algorithms to accelerate training of back-propagation neural networks
Two algorithms to accelerate training of back-propagation neural networksTwo algorithms to accelerate training of back-propagation neural networks
Two algorithms to accelerate training of back-propagation neural networks
ESCOM
 
Classification using perceptron.pptx
Classification using perceptron.pptxClassification using perceptron.pptx
Classification using perceptron.pptx
someyamohsen3
 
T. Lucas Makinen x Imperial SBI Workshop
T. Lucas Makinen x Imperial SBI WorkshopT. Lucas Makinen x Imperial SBI Workshop
T. Lucas Makinen x Imperial SBI Workshop
LucasMakinen1
 
Deep learning
Deep learningDeep learning
Deep learning
Aravindharamanan S
 
Artificial Neural Network
Artificial Neural NetworkArtificial Neural Network
Artificial Neural Network
Dessy Amirudin
 
NPTEL Deep Learning Lecture notes for session 2
NPTEL Deep Learning Lecture notes for session 2NPTEL Deep Learning Lecture notes for session 2
NPTEL Deep Learning Lecture notes for session 2
maheswari285864
 
Rethinking of Generalization
Rethinking of GeneralizationRethinking of Generalization
Rethinking of Generalization
Hikaru Ibayashi
 
[PR12] Inception and Xception - Jaejun Yoo
[PR12] Inception and Xception - Jaejun Yoo[PR12] Inception and Xception - Jaejun Yoo
[PR12] Inception and Xception - Jaejun Yoo
JaeJun Yoo
 
What Deep Learning Means for Artificial Intelligence
What Deep Learning Means for Artificial IntelligenceWhat Deep Learning Means for Artificial Intelligence
What Deep Learning Means for Artificial Intelligence
Jonathan Mugan
 
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...
MLconf
 
DA FDAFDSasd
DA FDAFDSasdDA FDAFDSasd
DA FDAFDSasd
WinnerLogin1
 
Fundamental Limits of Recovering Tree Sparse Vectors from Noisy Linear Measur...
Fundamental Limits of Recovering Tree Sparse Vectors from Noisy Linear Measur...Fundamental Limits of Recovering Tree Sparse Vectors from Noisy Linear Measur...
Fundamental Limits of Recovering Tree Sparse Vectors from Noisy Linear Measur...sonix022
 
Algorithmic Data Science = Theory + Practice
Algorithmic Data Science = Theory + PracticeAlgorithmic Data Science = Theory + Practice
Algorithmic Data Science = Theory + Practice
Two Sigma
 
Poster present at the CAIM workshop NYC, Feb 15 2018
Poster present at the CAIM workshop NYC, Feb 15 2018Poster present at the CAIM workshop NYC, Feb 15 2018
Poster present at the CAIM workshop NYC, Feb 15 2018
KavithaSrinivas6
 
23 Machine Learning Feature Generation
23 Machine Learning Feature Generation23 Machine Learning Feature Generation
23 Machine Learning Feature Generation
Andres Mendez-Vazquez
 

Similar to Deep learning networks (20)

Lausanne 2019 #2
Lausanne 2019 #2Lausanne 2019 #2
Lausanne 2019 #2
 
Lecture5.pdf
Lecture5.pdfLecture5.pdf
Lecture5.pdf
 
MLIP - Chapter 2 - Preliminaries to deep learning
MLIP - Chapter 2 - Preliminaries to deep learningMLIP - Chapter 2 - Preliminaries to deep learning
MLIP - Chapter 2 - Preliminaries to deep learning
 
deep learning
deep learningdeep learning
deep learning
 
Two algorithms to accelerate training of back-propagation neural networks
Two algorithms to accelerate training of back-propagation neural networksTwo algorithms to accelerate training of back-propagation neural networks
Two algorithms to accelerate training of back-propagation neural networks
 
Classification using perceptron.pptx
Classification using perceptron.pptxClassification using perceptron.pptx
Classification using perceptron.pptx
 
T. Lucas Makinen x Imperial SBI Workshop
T. Lucas Makinen x Imperial SBI WorkshopT. Lucas Makinen x Imperial SBI Workshop
T. Lucas Makinen x Imperial SBI Workshop
 
Final Report-1-(1)
Final Report-1-(1)Final Report-1-(1)
Final Report-1-(1)
 
Deep learning
Deep learningDeep learning
Deep learning
 
Artificial Neural Network
Artificial Neural NetworkArtificial Neural Network
Artificial Neural Network
 
NPTEL Deep Learning Lecture notes for session 2
NPTEL Deep Learning Lecture notes for session 2NPTEL Deep Learning Lecture notes for session 2
NPTEL Deep Learning Lecture notes for session 2
 
Rethinking of Generalization
Rethinking of GeneralizationRethinking of Generalization
Rethinking of Generalization
 
[PR12] Inception and Xception - Jaejun Yoo
[PR12] Inception and Xception - Jaejun Yoo[PR12] Inception and Xception - Jaejun Yoo
[PR12] Inception and Xception - Jaejun Yoo
 
What Deep Learning Means for Artificial Intelligence
What Deep Learning Means for Artificial IntelligenceWhat Deep Learning Means for Artificial Intelligence
What Deep Learning Means for Artificial Intelligence
 
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...
 
DA FDAFDSasd
DA FDAFDSasdDA FDAFDSasd
DA FDAFDSasd
 
Fundamental Limits of Recovering Tree Sparse Vectors from Noisy Linear Measur...
Fundamental Limits of Recovering Tree Sparse Vectors from Noisy Linear Measur...Fundamental Limits of Recovering Tree Sparse Vectors from Noisy Linear Measur...
Fundamental Limits of Recovering Tree Sparse Vectors from Noisy Linear Measur...
 
Algorithmic Data Science = Theory + Practice
Algorithmic Data Science = Theory + PracticeAlgorithmic Data Science = Theory + Practice
Algorithmic Data Science = Theory + Practice
 
Poster present at the CAIM workshop NYC, Feb 15 2018
Poster present at the CAIM workshop NYC, Feb 15 2018Poster present at the CAIM workshop NYC, Feb 15 2018
Poster present at the CAIM workshop NYC, Feb 15 2018
 
23 Machine Learning Feature Generation
23 Machine Learning Feature Generation23 Machine Learning Feature Generation
23 Machine Learning Feature Generation
 

Recently uploaded

Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTSHeap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Soumen Santra
 
weather web application report.pdf
weather web application report.pdfweather web application report.pdf
weather web application report.pdf
Pratik Pawar
 
Tutorial for 16S rRNA Gene Analysis with QIIME2.pdf
Tutorial for 16S rRNA Gene Analysis with QIIME2.pdfTutorial for 16S rRNA Gene Analysis with QIIME2.pdf
Tutorial for 16S rRNA Gene Analysis with QIIME2.pdf
aqil azizi
 
Building Electrical System Design & Installation
Building Electrical System Design & InstallationBuilding Electrical System Design & Installation
Building Electrical System Design & Installation
symbo111
 
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
MdTanvirMahtab2
 
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
ydteq
 
Investor-Presentation-Q1FY2024 investor presentation document.pptx
Investor-Presentation-Q1FY2024 investor presentation document.pptxInvestor-Presentation-Q1FY2024 investor presentation document.pptx
Investor-Presentation-Q1FY2024 investor presentation document.pptx
AmarGB2
 
DESIGN AND ANALYSIS OF A CAR SHOWROOM USING E TABS
DESIGN AND ANALYSIS OF A CAR SHOWROOM USING E TABSDESIGN AND ANALYSIS OF A CAR SHOWROOM USING E TABS
DESIGN AND ANALYSIS OF A CAR SHOWROOM USING E TABS
itech2017
 
The Role of Electrical and Electronics Engineers in IOT Technology.pdf
The Role of Electrical and Electronics Engineers in IOT Technology.pdfThe Role of Electrical and Electronics Engineers in IOT Technology.pdf
The Role of Electrical and Electronics Engineers in IOT Technology.pdf
Nettur Technical Training Foundation
 
Technical Drawings introduction to drawing of prisms
Technical Drawings introduction to drawing of prismsTechnical Drawings introduction to drawing of prisms
Technical Drawings introduction to drawing of prisms
heavyhaig
 
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdfWater Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation & Control
 
Water billing management system project report.pdf
Water billing management system project report.pdfWater billing management system project report.pdf
Water billing management system project report.pdf
Kamal Acharya
 
PPT on GRP pipes manufacturing and testing
PPT on GRP pipes manufacturing and testingPPT on GRP pipes manufacturing and testing
PPT on GRP pipes manufacturing and testing
anoopmanoharan2
 
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
Amil Baba Dawood bangali
 
Final project report on grocery store management system..pdf
Final project report on grocery store management system..pdfFinal project report on grocery store management system..pdf
Final project report on grocery store management system..pdf
Kamal Acharya
 
CW RADAR, FMCW RADAR, FMCW ALTIMETER, AND THEIR PARAMETERS
CW RADAR, FMCW RADAR, FMCW ALTIMETER, AND THEIR PARAMETERSCW RADAR, FMCW RADAR, FMCW ALTIMETER, AND THEIR PARAMETERS
CW RADAR, FMCW RADAR, FMCW ALTIMETER, AND THEIR PARAMETERS
veerababupersonal22
 
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
obonagu
 
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdfHybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
fxintegritypublishin
 
Understanding Inductive Bias in Machine Learning
Understanding Inductive Bias in Machine LearningUnderstanding Inductive Bias in Machine Learning
Understanding Inductive Bias in Machine Learning
SUTEJAS
 
6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)
ClaraZara1
 

Recently uploaded (20)

Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTSHeap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
 
weather web application report.pdf
weather web application report.pdfweather web application report.pdf
weather web application report.pdf
 
Tutorial for 16S rRNA Gene Analysis with QIIME2.pdf
Tutorial for 16S rRNA Gene Analysis with QIIME2.pdfTutorial for 16S rRNA Gene Analysis with QIIME2.pdf
Tutorial for 16S rRNA Gene Analysis with QIIME2.pdf
 
Building Electrical System Design & Installation
Building Electrical System Design & InstallationBuilding Electrical System Design & Installation
Building Electrical System Design & Installation
 
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
 
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
 
Investor-Presentation-Q1FY2024 investor presentation document.pptx
Investor-Presentation-Q1FY2024 investor presentation document.pptxInvestor-Presentation-Q1FY2024 investor presentation document.pptx
Investor-Presentation-Q1FY2024 investor presentation document.pptx
 
DESIGN AND ANALYSIS OF A CAR SHOWROOM USING E TABS
DESIGN AND ANALYSIS OF A CAR SHOWROOM USING E TABSDESIGN AND ANALYSIS OF A CAR SHOWROOM USING E TABS
DESIGN AND ANALYSIS OF A CAR SHOWROOM USING E TABS
 
The Role of Electrical and Electronics Engineers in IOT Technology.pdf
The Role of Electrical and Electronics Engineers in IOT Technology.pdfThe Role of Electrical and Electronics Engineers in IOT Technology.pdf
The Role of Electrical and Electronics Engineers in IOT Technology.pdf
 
Technical Drawings introduction to drawing of prisms
Technical Drawings introduction to drawing of prismsTechnical Drawings introduction to drawing of prisms
Technical Drawings introduction to drawing of prisms
 
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdfWater Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdf
 
Water billing management system project report.pdf
Water billing management system project report.pdfWater billing management system project report.pdf
Water billing management system project report.pdf
 
PPT on GRP pipes manufacturing and testing
PPT on GRP pipes manufacturing and testingPPT on GRP pipes manufacturing and testing
PPT on GRP pipes manufacturing and testing
 
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
 
Final project report on grocery store management system..pdf
Final project report on grocery store management system..pdfFinal project report on grocery store management system..pdf
Final project report on grocery store management system..pdf
 
CW RADAR, FMCW RADAR, FMCW ALTIMETER, AND THEIR PARAMETERS
CW RADAR, FMCW RADAR, FMCW ALTIMETER, AND THEIR PARAMETERSCW RADAR, FMCW RADAR, FMCW ALTIMETER, AND THEIR PARAMETERS
CW RADAR, FMCW RADAR, FMCW ALTIMETER, AND THEIR PARAMETERS
 
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
 
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdfHybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
 
Understanding Inductive Bias in Machine Learning
Understanding Inductive Bias in Machine LearningUnderstanding Inductive Bias in Machine Learning
Understanding Inductive Bias in Machine Learning
 
6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)
 

Deep learning networks

  • 1. 1/67 CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization Mitesh M. Khapra Department of Computer Science and Engineering Indian Institute of Technology Madras Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 2. 2/67 Module 9.1 : A quick recap of training deep neural networks Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 3. 3/67 x σ w y Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 4. 3/67 x σ w y We already saw how to train this network Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 5. 3/67 x σ w y We already saw how to train this network w = w − η w where, Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 6. 3/67 x σ w y We already saw how to train this network w = w − η w where, w = ∂L (w) ∂w Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 7. 3/67 x σ w y We already saw how to train this network w = w − η w where, w = ∂L (w) ∂w = (f(x) − y) ∗ f(x) ∗ (1 − f(x)) ∗ x Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 8. 3/67 x σ w y x1x2 x3 σ y w1 w2 w3 We already saw how to train this network w = w − η w where, w = ∂L (w) ∂w = (f(x) − y) ∗ f(x) ∗ (1 − f(x)) ∗ x What about a wider network with more inputs: Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 9. 3/67 x σ w y x1x2 x3 σ y w1 w2 w3 We already saw how to train this network w = w − η w where, w = ∂L (w) ∂w = (f(x) − y) ∗ f(x) ∗ (1 − f(x)) ∗ x What about a wider network with more inputs: w1 = w1 − η w1 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 10. 3/67 x σ w y x1x2 x3 σ y w1 w2 w3 We already saw how to train this network w = w − η w where, w = ∂L (w) ∂w = (f(x) − y) ∗ f(x) ∗ (1 − f(x)) ∗ x What about a wider network with more inputs: w1 = w1 − η w1 w2 = w2 − η w2 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 11. 3/67 x σ w y x1x2 x3 σ y w1 w2 w3 We already saw how to train this network w = w − η w where, w = ∂L (w) ∂w = (f(x) − y) ∗ f(x) ∗ (1 − f(x)) ∗ x What about a wider network with more inputs: w1 = w1 − η w1 w2 = w2 − η w2 w3 = w3 − η w3 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 12. 3/67 x σ w y x1x2 x3 σ y w1 w2 w3 We already saw how to train this network w = w − η w where, w = ∂L (w) ∂w = (f(x) − y) ∗ f(x) ∗ (1 − f(x)) ∗ x What about a wider network with more inputs: w1 = w1 − η w1 w2 = w2 − η w2 w3 = w3 − η w3 where, wi = (f(x) − y) ∗ f(x) ∗ (1 − f(x)) ∗ xi Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 13. 4/67 σ x = h0 σ σ y w1 w2 w3 a1 h1 a2 h2 a3 ai = wihi−1; hi = σ(ai) a1 = w1 ∗ x = w1 ∗ h0 What if we have a deeper network ? Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 14. 4/67 σ x = h0 σ σ y w1 w2 w3 a1 h1 a2 h2 a3 ai = wihi−1; hi = σ(ai) a1 = w1 ∗ x = w1 ∗ h0 What if we have a deeper network ? We can now calculate w1 using chain rule: Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 15. 4/67 σ x = h0 σ σ y w1 w2 w3 a1 h1 a2 h2 a3 ai = wihi−1; hi = σ(ai) a1 = w1 ∗ x = w1 ∗ h0 What if we have a deeper network ? We can now calculate w1 using chain rule: ∂L (w) ∂w1 = ∂L (w) ∂y . ∂y ∂a3 . ∂a3 ∂h2 . ∂h2 ∂a2 . ∂a2 ∂h1 . ∂h1 ∂a1 . ∂a1 ∂w1 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 16. 4/67 σ x = h0 σ σ y w1 w2 w3 a1 h1 a2 h2 a3 ai = wihi−1; hi = σ(ai) a1 = w1 ∗ x = w1 ∗ h0 What if we have a deeper network ? We can now calculate w1 using chain rule: ∂L (w) ∂w1 = ∂L (w) ∂y . ∂y ∂a3 . ∂a3 ∂h2 . ∂h2 ∂a2 . ∂a2 ∂h1 . ∂h1 ∂a1 . ∂a1 ∂w1 = ∂L (w) ∂y ∗ ............... ∗ h0 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 17. 4/67 σ x = h0 σ σ y w1 w2 w3 a1 h1 a2 h2 a3 ai = wihi−1; hi = σ(ai) a1 = w1 ∗ x = w1 ∗ h0 What if we have a deeper network ? We can now calculate w1 using chain rule: ∂L (w) ∂w1 = ∂L (w) ∂y . ∂y ∂a3 . ∂a3 ∂h2 . ∂h2 ∂a2 . ∂a2 ∂h1 . ∂h1 ∂a1 . ∂a1 ∂w1 = ∂L (w) ∂y ∗ ............... ∗ h0 In general, wi = ∂L (w) ∂y ∗ ............... ∗ hi−1 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 18. 4/67 σ x = h0 σ σ y w1 w2 w3 a1 h1 a2 h2 a3 ai = wihi−1; hi = σ(ai) a1 = w1 ∗ x = w1 ∗ h0 What if we have a deeper network ? We can now calculate w1 using chain rule: ∂L (w) ∂w1 = ∂L (w) ∂y . ∂y ∂a3 . ∂a3 ∂h2 . ∂h2 ∂a2 . ∂a2 ∂h1 . ∂h1 ∂a1 . ∂a1 ∂w1 = ∂L (w) ∂y ∗ ............... ∗ h0 In general, wi = ∂L (w) ∂y ∗ ............... ∗ hi−1 Notice that wi is proportional to the correspond- ing input hi−1 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 19. 4/67 σ x = h0 σ σ y w1 w2 w3 a1 h1 a2 h2 a3 ai = wihi−1; hi = σ(ai) a1 = w1 ∗ x = w1 ∗ h0 What if we have a deeper network ? We can now calculate w1 using chain rule: ∂L (w) ∂w1 = ∂L (w) ∂y . ∂y ∂a3 . ∂a3 ∂h2 . ∂h2 ∂a2 . ∂a2 ∂h1 . ∂h1 ∂a1 . ∂a1 ∂w1 = ∂L (w) ∂y ∗ ............... ∗ h0 In general, wi = ∂L (w) ∂y ∗ ............... ∗ hi−1 Notice that wi is proportional to the correspond- ing input hi−1 (we will use this fact later) Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 20. 5/67 σ σ x1 σ σ x2 x3 σ σ σ σ y w1 w2 w3 What happens if we have a network which is deep and wide? Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 21. 5/67 σ σ x1 σ σ x2 x3 σ σ σ σ y w1 w2 w3 What happens if we have a network which is deep and wide? How do you calculate w2 =? Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 22. 5/67 σ σ x1 σ σ x2 x3 σ σ σ σ y w1 w2 w3 What happens if we have a network which is deep and wide? How do you calculate w2 =? It will be given by chain rule applied across mul- tiple paths Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 23. 5/67 σ σ x1 σ σ x2 x3 σ σ σ σ y w1 w2 w3 What happens if we have a network which is deep and wide? How do you calculate w2 =? It will be given by chain rule applied across mul- tiple paths Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 24. 5/67 σ σ x1 σ σ x2 x3 σ σ σ σ y w1 w2 w3 What happens if we have a network which is deep and wide? How do you calculate w2 =? It will be given by chain rule applied across mul- tiple paths Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 25. 5/67 σ σ x1 σ σ x2 x3 σ σ σ σ y w1 w2 w3 What happens if we have a network which is deep and wide? How do you calculate w2 =? It will be given by chain rule applied across mul- tiple paths Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 26. 5/67 σ σ x1 σ σ x2 x3 σ σ σ σ y w1 w2 w3 What happens if we have a network which is deep and wide? How do you calculate w2 =? It will be given by chain rule applied across mul- tiple paths (We saw this in detail when we studied back propagation) Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 27. 6/67 Things to remember Training Neural Networks is a Game of Gradients (played using any of the existing gradient based approaches that we discussed) Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 28. 6/67 Things to remember Training Neural Networks is a Game of Gradients (played using any of the existing gradient based approaches that we discussed) The gradient tells us the responsibility of a parameter towards the loss Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 29. 6/67 Things to remember Training Neural Networks is a Game of Gradients (played using any of the existing gradient based approaches that we discussed) The gradient tells us the responsibility of a parameter towards the loss The gradient w.r.t. a parameter is proportional to the input to the parameters (recall the “..... ∗ x” term or the “.... ∗ hi” term in the formula for wi) Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 30. 7/67 σ σ x1 σ σ x2 x3 σ σ σ σ y w1 w2 w3 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 31. 7/67 σ σ x1 σ σ x2 x3 σ σ σ σ y w1 w2 w3 Backpropagation was made popular by Rumelhart et.al in 1986 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 32. 7/67 σ σ x1 σ σ x2 x3 σ σ σ σ y w1 w2 w3 Backpropagation was made popular by Rumelhart et.al in 1986 However when used for really deep networks it was not very successful Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 33. 7/67 σ σ x1 σ σ x2 x3 σ σ σ σ y w1 w2 w3 Backpropagation was made popular by Rumelhart et.al in 1986 However when used for really deep networks it was not very successful In fact, till 2006 it was very hard to train very deep networks Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 34. 7/67 σ σ x1 σ σ x2 x3 σ σ σ σ y w1 w2 w3 Backpropagation was made popular by Rumelhart et.al in 1986 However when used for really deep networks it was not very successful In fact, till 2006 it was very hard to train very deep networks Typically, even after a large number of epochs the training did not con- verge Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 35. 8/67 Module 9.2 : Unsupervised pre-training Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 36. 9/67 What has changed now? How did Deep Learning become so popular despite this problem with training large networks? 1 G. E. Hinton and R. R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504–507, July 2006. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 37. 9/67 What has changed now? How did Deep Learning become so popular despite this problem with training large networks? Well, until 2006 it wasn’t so popular 1 G. E. Hinton and R. R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504–507, July 2006. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 38. 9/67 What has changed now? How did Deep Learning become so popular despite this problem with training large networks? Well, until 2006 it wasn’t so popular The field got revived after the seminal work of Hinton and Salakhutdinov in 2006 1 G. E. Hinton and R. R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504–507, July 2006. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 39. 10/67 Let’s look at the idea of unsupervised pre-training introduced in this paper ... Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 40. 10/67 Let’s look at the idea of unsupervised pre-training introduced in this paper ... (note that in this paper they introduced the idea in the context of RBMs but we will discuss it in the context of Autoencoders) Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 41. 11/67 Consider the deep neural network shown in this figure Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 42. 11/67 Consider the deep neural network shown in this figure Let us focus on the first two layers of the network (x and h1) Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 43. 11/67 Consider the deep neural network shown in this figure Let us focus on the first two layers of the network (x and h1) We will first train the weights between these two layers using an un- supervised objective Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 44. 11/67 ˆx h1 x reconstruct x min 1 m m i=1 n j=1 (ˆxij − xij)2 Consider the deep neural network shown in this figure Let us focus on the first two layers of the network (x and h1) We will first train the weights between these two layers using an un- supervised objective Note that we are trying to reconstruct the input (x) from the hidden repres- entation (h1) Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 45. 11/67 ˆx h1 x reconstruct x min 1 m m i=1 n j=1 (ˆxij − xij)2 Consider the deep neural network shown in this figure Let us focus on the first two layers of the network (x and h1) We will first train the weights between these two layers using an un- supervised objective Note that we are trying to reconstruct the input (x) from the hidden repres- entation (h1) We refer to this as an unsupervised objective because it does not involve the output label (y) and only uses the input data (x) Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 46. 11/67 ˆx h1 x reconstruct x min 1 m m i=1 n j=1 (ˆxij − xij)2 Consider the deep neural network shown in this figure Let us focus on the first two layers of the network (x and h1) We will first train the weights between these two layers using an un- supervised objective Note that we are trying to reconstruct the input (x) from the hidden repres- entation (h1) We refer to this as an unsupervised objective because it does not involve the output label (y) and only uses the input data (x) Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 47. 11/67 ˆx h1 x reconstruct x min 1 m m i=1 n j=1 (ˆxij − xij)2 At the end of this step, the weights in layer 1 are trained such that h1 captures an abstract representation of the input x Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 48. 11/67 ˆh1 h2 h1 x min 1 m m i=1 n j=1 (ˆh1ij − h1ij )2 At the end of this step, the weights in layer 1 are trained such that h1 captures an abstract representation of the input x We now fix the weights in layer 1 and repeat the same process with layer 2 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 49. 11/67 ˆh1 h2 h1 x min 1 m m i=1 n j=1 (ˆh1ij − h1ij )2 At the end of this step, the weights in layer 1 are trained such that h1 captures an abstract representation of the input x We now fix the weights in layer 1 and repeat the same process with layer 2 At the end of this step, the weights in layer 2 are trained such that h2 cap- tures an abstract representation of h1 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 50. 11/67 ˆh1 h2 h1 x min 1 m m i=1 n j=1 (ˆh1ij − h1ij )2 At the end of this step, the weights in layer 1 are trained such that h1 captures an abstract representation of the input x We now fix the weights in layer 1 and repeat the same process with layer 2 At the end of this step, the weights in layer 2 are trained such that h2 cap- tures an abstract representation of h1 We continue this process till the last hidden layer (i.e., the layer before the output layer) so that each successive layer captures an abstract represent- ation of the previous layer Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 51. 12/67 x1 x2 x3 min θ 1 m m i=1 (yi − f(xi))2 After this layerwise pre-training, we add the output layer and train the whole network using the task specific objective Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 52. 12/67 x1 x2 x3 min θ 1 m m i=1 (yi − f(xi))2 After this layerwise pre-training, we add the output layer and train the whole network using the task specific objective Note that, in effect we have initial- ized the weights of the network us- ing the greedy unsupervised objective and are now fine tuning these weights using the supervised objective Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 53. 13/67 Why does this work better? 1 The difficulty of training deep architectures and effect of unsupervised pre-training - Erhan et al,2009 2 Exploring Strategies for Training Deep Neural Networks, Larocelle et al,2009 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 54. 13/67 Why does this work better? Is it because of better optimization? 1 The difficulty of training deep architectures and effect of unsupervised pre-training - Erhan et al,2009 2 Exploring Strategies for Training Deep Neural Networks, Larocelle et al,2009 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 55. 13/67 Why does this work better? Is it because of better optimization? Is it because of better regularization? 1 The difficulty of training deep architectures and effect of unsupervised pre-training - Erhan et al,2009 2 Exploring Strategies for Training Deep Neural Networks, Larocelle et al,2009 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 56. 13/67 Why does this work better? Is it because of better optimization? Is it because of better regularization? Let’s see what these two questions mean and try to answer them based on some (among many) existing studies1,2 1 The difficulty of training deep architectures and effect of unsupervised pre-training - Erhan et al,2009 2 Exploring Strategies for Training Deep Neural Networks, Larocelle et al,2009 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 57. 14/67 Why does this work better? Is it because of better optimization? Is it because of better regularization? Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 58. 15/67 What is the optimization problem that we are trying to solve? Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 59. 15/67 What is the optimization problem that we are trying to solve? minimize L (θ) = 1 m m i=1 (yi − f(xi))2 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 60. 15/67 What is the optimization problem that we are trying to solve? minimize L (θ) = 1 m m i=1 (yi − f(xi))2 Is it the case that in the absence of unsupervised pre-training we are not able to drive L (θ) to 0 even for the training data (hence poor optimization) ? Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 61. 15/67 What is the optimization problem that we are trying to solve? minimize L (θ) = 1 m m i=1 (yi − f(xi))2 Is it the case that in the absence of unsupervised pre-training we are not able to drive L (θ) to 0 even for the training data (hence poor optimization) ? Let us see this in more detail ... Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 62. 16/67 The error surface of the supervised objective of a Deep Neural Network is highly non-convex 1 Exploring Strategies for Training Deep Neural Networks, Larocelle et al,2009 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 63. 16/67 The error surface of the supervised objective of a Deep Neural Network is highly non-convex With many hills and plateaus and val- leys 1 Exploring Strategies for Training Deep Neural Networks, Larocelle et al,2009 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 64. 16/67 The error surface of the supervised objective of a Deep Neural Network is highly non-convex With many hills and plateaus and val- leys Given that large capacity of DNNs it is still easy to land in one of these 0 error regions 1 Exploring Strategies for Training Deep Neural Networks, Larocelle et al,2009 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 65. 16/67 The error surface of the supervised objective of a Deep Neural Network is highly non-convex With many hills and plateaus and val- leys Given that large capacity of DNNs it is still easy to land in one of these 0 error regions Indeed Larochelle et.al.1 show that if the last layer has large capacity then L (θ) goes to 0 even without pre- training 1 Exploring Strategies for Training Deep Neural Networks, Larocelle et al,2009 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 66. 16/67 The error surface of the supervised objective of a Deep Neural Network is highly non-convex With many hills and plateaus and val- leys Given that large capacity of DNNs it is still easy to land in one of these 0 error regions Indeed Larochelle et.al.1 show that if the last layer has large capacity then L (θ) goes to 0 even without pre- training However, if the capacity of the net- work is small, unsupervised pre- training helps 1 Exploring Strategies for Training Deep Neural Networks, Larocelle et al,2009 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 67. 17/67 Why does this work better? Is it because of better optimization? Is it because of better regularization? Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 68. 18/67 What does regularization do? 1 Image Source:The Elements of Statistical Learning-T. Hastie, R. Tibshirani, and J. Friedman, Pg 71 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 69. 18/67 What does regularization do? It con- strains the weights to certain regions of the parameter space 1 Image Source:The Elements of Statistical Learning-T. Hastie, R. Tibshirani, and J. Friedman, Pg 71 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 70. 18/67 What does regularization do? It con- strains the weights to certain regions of the parameter space L-1 regularization: constrains most weights to be 0 1 Image Source:The Elements of Statistical Learning-T. Hastie, R. Tibshirani, and J. Friedman, Pg 71 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 71. 18/67 What does regularization do? It con- strains the weights to certain regions of the parameter space L-1 regularization: constrains most weights to be 0 L-2 regularization: prevents most weights from taking large values 1 Image Source:The Elements of Statistical Learning-T. Hastie, R. Tibshirani, and J. Friedman, Pg 71 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 72. 19/67 Indeed, pre-training constrains the weights to lie in only certain regions of the parameter space Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 73. 19/67 Indeed, pre-training constrains the weights to lie in only certain regions of the parameter space Specifically, it constrains the weights to lie in regions where the character- istics of the data are captured well (as governed by the unsupervised object- ive) Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 74. 19/67 Unsupervised objective: Ω(θ) = 1 m m i=1 n j=1 (xij − ˆxij)2 Indeed, pre-training constrains the weights to lie in only certain regions of the parameter space Specifically, it constrains the weights to lie in regions where the character- istics of the data are captured well (as governed by the unsupervised object- ive) Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 75. 19/67 Unsupervised objective: Ω(θ) = 1 m m i=1 n j=1 (xij − ˆxij)2 We can think of this unsupervised ob- jective as an additional constraint on the optimization problem Indeed, pre-training constrains the weights to lie in only certain regions of the parameter space Specifically, it constrains the weights to lie in regions where the character- istics of the data are captured well (as governed by the unsupervised object- ive) Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 76. 19/67 Unsupervised objective: Ω(θ) = 1 m m i=1 n j=1 (xij − ˆxij)2 We can think of this unsupervised ob- jective as an additional constraint on the optimization problem Supervised objective: L (θ) = 1 m m i=1 (yi − f(xi))2 Indeed, pre-training constrains the weights to lie in only certain regions of the parameter space Specifically, it constrains the weights to lie in regions where the character- istics of the data are captured well (as governed by the unsupervised object- ive) This unsupervised objective ensures that that the learning is not greedy w.r.t. the supervised objective (and also satisfies the unsupervised object- ive) Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 77. 20/67 Some other experiments have also shown that pre-training is more ro- bust to random initializations 1 The difficulty of training deep architectures and effect of unsupervised pre-training - Erhan et al,2009 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 78. 20/67 Some other experiments have also shown that pre-training is more ro- bust to random initializations One accepted hypothesis is that pre- training leads to better weight ini- tializations (so that the layers cap- ture the internal characteristics of the data) 1 The difficulty of training deep architectures and effect of unsupervised pre-training - Erhan et al,2009 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 79. 21/67 So what has happened since 2006-2009? Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 80. 22/67 Deep Learning has evolved Better optimization algorithms Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 81. 22/67 Deep Learning has evolved Better optimization algorithms Better regularization methods Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 82. 22/67 Deep Learning has evolved Better optimization algorithms Better regularization methods Better activation functions Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 83. 22/67 Deep Learning has evolved Better optimization algorithms Better regularization methods Better activation functions Better weight initialization strategies Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 84. 23/67 Module 9.3 : Better activation functions Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 85. 24/67 Deep Learning has evolved Better optimization algorithms Better regularization methods Better activation functions Better weight initialization strategies Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 86. 25/67 Before we look at activation functions, let’s try to answer the following question: “What makes Deep Neural Networks powerful ?” Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 87. 26/67 h0 = x y σ σ σ a1 h1 a2 h2 a3 w1 w2 w3 Consider this deep neural network Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 88. 26/67 h0 = x y σ σ σ a1 h1 a2 h2 a3 w1 w2 w3 Consider this deep neural network Imagine if we replace the sigmoid in each layer by a simple linear trans- formation Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 89. 26/67 h0 = x y σ σ σ a1 h1 a2 h2 a3 w1 w2 w3 Consider this deep neural network Imagine if we replace the sigmoid in each layer by a simple linear trans- formation y = (w4 ∗ (w3 ∗ (w2 ∗ (w1x)))) Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 90. 26/67 h0 = x y σ σ σ a1 h1 a2 h2 a3 w1 w2 w3 Consider this deep neural network Imagine if we replace the sigmoid in each layer by a simple linear trans- formation y = (w4 ∗ (w3 ∗ (w2 ∗ (w1x)))) Then we will just learn y as a linear transformation of x Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 91. 26/67 h0 = x y σ σ σ a1 h1 a2 h2 a3 w1 w2 w3 Consider this deep neural network Imagine if we replace the sigmoid in each layer by a simple linear trans- formation y = (w4 ∗ (w3 ∗ (w2 ∗ (w1x)))) Then we will just learn y as a linear transformation of x In other words we will be constrained to learning linear decision boundaries Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 92. 26/67 h0 = x y σ σ σ a1 h1 a2 h2 a3 w1 w2 w3 Consider this deep neural network Imagine if we replace the sigmoid in each layer by a simple linear trans- formation y = (w4 ∗ (w3 ∗ (w2 ∗ (w1x)))) Then we will just learn y as a linear transformation of x In other words we will be constrained to learning linear decision boundaries We cannot learn arbitrary decision boundaries Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 93. 26/67 In particular, a deep linear neural network cannot learn such boundar- ies Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 94. 26/67 In particular, a deep linear neural network cannot learn such boundar- ies But a deep non linear neural net- work can indeed learn such bound- aries (recall Universal Approximation Theorem) Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 95. 27/67 Now let’s look at some non-linear activation functions that are typically used in deep neural networks (Much of this material is taken from Andrej Karpathy’s lecture notes 1) 1 http://cs231n.github.io Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 96. 28/67 σ(x) = 1 1+e−x Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 97. 28/67 Sigmoid σ(x) = 1 1+e−x As is obvious, the sigmoid function compresses all its inputs to the range [0,1] Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 98. 28/67 Sigmoid σ(x) = 1 1+e−x As is obvious, the sigmoid function compresses all its inputs to the range [0,1] Since we are always interested in gradients, let us find the gradient of this function Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 99. 28/67 Sigmoid σ(x) = 1 1+e−x As is obvious, the sigmoid function compresses all its inputs to the range [0,1] Since we are always interested in gradients, let us find the gradient of this function ∂σ(x) ∂x = σ(x)(1 − σ(x)) (you can easily derive it) Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 100. 28/67 Sigmoid σ(x) = 1 1+e−x As is obvious, the sigmoid function compresses all its inputs to the range [0,1] Since we are always interested in gradients, let us find the gradient of this function ∂σ(x) ∂x = σ(x)(1 − σ(x)) (you can easily derive it) Let us see what happens if we use sig- moid in a deep network Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 101. 29/67 h0 = x σ σ σ σ a1 h1 a2 h2 a3 h3 a4 h4 a3 = w2h2 h3 = σ(a3) Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 102. 29/67 h0 = x σ σ σ σ a1 h1 a2 h2 a3 h3 a4 h4 a3 = w2h2 h3 = σ(a3) While calculating w2 at some point in the chain rule we will encounter ∂h3 ∂a3 = ∂σ(a3) ∂a3 = σ(a3)(1 − σ(a3)) Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 103. 29/67 h0 = x σ σ σ σ a1 h1 a2 h2 a3 h3 a4 h4 a3 = w2h2 h3 = σ(a3) While calculating w2 at some point in the chain rule we will encounter ∂h3 ∂a3 = ∂σ(a3) ∂a3 = σ(a3)(1 − σ(a3)) What is the consequence of this ? Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 104. 29/67 h0 = x σ σ σ σ a1 h1 a2 h2 a3 h3 a4 h4 a3 = w2h2 h3 = σ(a3) While calculating w2 at some point in the chain rule we will encounter ∂h3 ∂a3 = ∂σ(a3) ∂a3 = σ(a3)(1 − σ(a3)) What is the consequence of this ? To answer this question let us first understand the concept of saturated neuron ? Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 105. 30/67 −2 −1 1 2 0.2 0.4 0.6 0.8 1 x y A sigmoid neuron is said to have sat- urated when σ(x) = 1 or σ(x) = 0 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 106. 30/67 −2 −1 1 2 0.2 0.4 0.6 0.8 1 x y A sigmoid neuron is said to have sat- urated when σ(x) = 1 or σ(x) = 0 What would the gradient be at satur- ation? Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 107. 30/67 −2 −1 1 2 0.2 0.4 0.6 0.8 1 x y A sigmoid neuron is said to have sat- urated when σ(x) = 1 or σ(x) = 0 What would the gradient be at satur- ation? Well it would be 0 (you can see it from the plot or from the formula that we derived) Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 108. 30/67 −2 −1 1 2 0.2 0.4 0.6 0.8 1 x y Saturated neurons thus cause the gradient to vanish. A sigmoid neuron is said to have sat- urated when σ(x) = 1 or σ(x) = 0 What would the gradient be at satur- ation? Well it would be 0 (you can see it from the plot or from the formula that we derived) Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 109. 31/67 Saturated neurons thus cause the gradient to vanish. w1 w2 w3 w4 σ( 4 i=1 wixi) −2 −1 1 2 0.2 0.4 0.6 0.8 1 4 i=1 wixi y But why would the neurons saturate ? Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 110. 31/67 Saturated neurons thus cause the gradient to vanish. w1 w2 w3 w4 σ( 4 i=1 wixi) −2 −1 1 2 0.2 0.4 0.6 0.8 1 4 i=1 wixi y But why would the neurons saturate ? Consider what would happen if we use sigmoid neurons and initialize the weights to very high values ? Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 111. 31/67 Saturated neurons thus cause the gradient to vanish. w1 w2 w3 w4 σ( 4 i=1 wixi) −2 −1 1 2 0.2 0.4 0.6 0.8 1 4 i=1 wixi y But why would the neurons saturate ? Consider what would happen if we use sigmoid neurons and initialize the weights to very high values ? The neurons will saturate very quickly Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 112. 31/67 Saturated neurons thus cause the gradient to vanish. w1 w2 w3 w4 σ( 4 i=1 wixi) −2 −1 1 2 0.2 0.4 0.6 0.8 1 4 i=1 wixi y But why would the neurons saturate ? Consider what would happen if we use sigmoid neurons and initialize the weights to very high values ? The neurons will saturate very quickly The gradients will vanish and the training will stall (more on this later) Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 113. 32/67 Saturated neurons cause the gradient to vanish Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 114. 32/67 Saturated neurons cause the gradient to vanish Sigmoids are not zero centered Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 115. 32/67 Saturated neurons cause the gradient to vanish Sigmoids are not zero centered Why is this a problem?? Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 116. 32/67 Saturated neurons cause the gradient to vanish Sigmoids are not zero centered Why is this a problem?? w1 w2 a3 = w1 ∗ h21 + w2 ∗ h22 y h0 = x h1 h2 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 117. 32/67 Saturated neurons cause the gradient to vanish Sigmoids are not zero centered Consider the gradient w.r.t. w1 and w2 w1 = ∂L (w) ∂y ∂y h3 ∂h3 ∂a3 ∂a3 ∂w1 w2 = ∂L (w) ∂y ∂y h3 ∂h3 ∂a3 ∂a3 ∂w2 Why is this a problem?? w1 w2 a3 = w1 ∗ h21 + w2 ∗ h22 y h0 = x h1 h2 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 118. 32/67 Saturated neurons cause the gradient to vanish Sigmoids are not zero centered Consider the gradient w.r.t. w1 and w2 w1 = ∂L (w) ∂y ∂y h3 ∂h3 ∂a3 h21 w2 = ∂L (w) ∂y ∂y h3 ∂h3 ∂a3 h22 Note that h21 and h22 are between [0, 1] (i.e., they are both positive) Why is this a problem?? w1 w2 a3 = w1 ∗ h21 + w2 ∗ h22 y h0 = x h1 h2 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 119. 32/67 Saturated neurons cause the gradient to vanish Sigmoids are not zero centered Consider the gradient w.r.t. w1 and w2 w1 = ∂L (w) ∂y ∂y h3 ∂h3 ∂a3 h21 w2 = ∂L (w) ∂y ∂y h3 ∂h3 ∂a3 h22 Note that h21 and h22 are between [0, 1] (i.e., they are both positive) So if the first common term (in red) is positive (negative) then both w1 and w2 are positive (negative) Why is this a problem?? w1 w2 a3 = w1 ∗ h21 + w2 ∗ h22 y h0 = x h1 h2 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 120. 32/67 Saturated neurons cause the gradient to vanish Sigmoids are not zero centered Consider the gradient w.r.t. w1 and w2 w1 = ∂L (w) ∂y ∂y h3 ∂h3 ∂a3 h21 w2 = ∂L (w) ∂y ∂y h3 ∂h3 ∂a3 h22 Note that h21 and h22 are between [0, 1] (i.e., they are both positive) So if the first common term (in red) is positive (negative) then both w1 and w2 are positive (negative) Why is this a problem?? w1 w2 a3 = w1 ∗ h21 + w2 ∗ h22 y h0 = x h1 h2 Essentially, either all the gradients at a layer are positive or all the gradients at a layer are negative Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 121. 33/67 Saturated neurons cause the gradient to vanish Sigmoids are not zero centered This restricts the possible update dir- ections w2 w1 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 122. 33/67 Saturated neurons cause the gradient to vanish Sigmoids are not zero centered This restricts the possible update dir- ections w2 w1 (Not possible) Quadrant in which all gradients are +ve (Allowed) Quadrant in which all gradients are -ve (Allowed) (Not possible) Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 123. 33/67 Saturated neurons cause the gradient to vanish Sigmoids are not zero centered This restricts the possible update dir- ections w2 w1 (Not possible) Quadrant in which all gradients are +ve (Allowed) Quadrant in which all gradients are -ve (Allowed) (Not possible) Now imagine: this is the optimal w Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 124. 34/67 Saturated neurons cause the gradient to vanish Sigmoids are not zero centered w2 w1 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 125. 34/67 Saturated neurons cause the gradient to vanish Sigmoids are not zero centered w2 w1 starting from this initial position only way to reach it is by taking a zigzag path Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 126. 34/67 Saturated neurons cause the gradient to vanish Sigmoids are not zero centered w2 w1 starting from this initial position only way to reach it is by taking a zigzag path Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 127. 34/67 Saturated neurons cause the gradient to vanish Sigmoids are not zero centered w2 w1 starting from this initial position only way to reach it is by taking a zigzag path Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 128. 34/67 Saturated neurons cause the gradient to vanish Sigmoids are not zero centered w2 w1 starting from this initial position only way to reach it is by taking a zigzag path Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 129. 34/67 Saturated neurons cause the gradient to vanish Sigmoids are not zero centered w2 w1 starting from this initial position only way to reach it is by taking a zigzag path Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 130. 34/67 Saturated neurons cause the gradient to vanish Sigmoids are not zero centered w2 w1 starting from this initial position only way to reach it is by taking a zigzag path Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 131. 34/67 Saturated neurons cause the gradient to vanish Sigmoids are not zero centered And lastly, sigmoids are compu- tationally expensive (because of exp (x)) w2 w1 starting from this initial position only way to reach it is by taking a zigzag path Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 132. 35/67 tanh(x) 0−4 −2 2 4 −1 −0.5 0.5 1 x y f(x) = tanh(x) Compresses all its inputs to the range [-1,1] Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 133. 35/67 tanh(x) 0−4 −2 2 4 −1 −0.5 0.5 1 x y f(x) = tanh(x) Compresses all its inputs to the range [-1,1] Zero centered Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 134. 35/67 tanh(x) 0−4 −2 2 4 −1 −0.5 0.5 1 x y f(x) = tanh(x) Compresses all its inputs to the range [-1,1] Zero centered What is the derivative of this func- tion? Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 135. 35/67 tanh(x) 0−4 −2 2 4 −1 −0.5 0.5 1 x y f(x) = tanh(x) Compresses all its inputs to the range [-1,1] Zero centered What is the derivative of this func- tion? ∂tanh(x) ∂x = (1 − tanh2 (x)) Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 136. 35/67 tanh(x) 0−4 −2 2 4 −1 −0.5 0.5 1 x y f(x) = tanh(x) Compresses all its inputs to the range [-1,1] Zero centered What is the derivative of this func- tion? ∂tanh(x) ∂x = (1 − tanh2 (x)) The gradient still vanishes at satura- tion Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 137. 35/67 tanh(x) 0−4 −2 2 4 −1 −0.5 0.5 1 x y f(x) = tanh(x) Compresses all its inputs to the range [-1,1] Zero centered What is the derivative of this func- tion? ∂tanh(x) ∂x = (1 − tanh2 (x)) The gradient still vanishes at satura- tion Also computationally expensive Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 138. 36/67 ReLU f(x) = max(0, x) Is this a non-linear function? Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 139. 36/67 ReLU f(x) = max(0, x) Is this a non-linear function? Indeed it is! Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 140. 36/67 ReLU f(x) = max(0, x + 1) − max(0, x − 1) Is this a non-linear function? Indeed it is! In fact we can combine two ReLU units to recover a piecewise linear ap- proximation of the sigmoid function Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 141. 37/67 ReLU f(x) = max(0, x) Advantages of ReLU Does not saturate in the positive re- gion 1 ImageNet Classification with Deep Convolutional Neural Networks- Alex Krizhevsky Ilya Sutskever, Geoffrey E. Hinton, 2012 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 142. 37/67 ReLU f(x) = max(0, x) Advantages of ReLU Does not saturate in the positive re- gion Computationally efficient 1 ImageNet Classification with Deep Convolutional Neural Networks- Alex Krizhevsky Ilya Sutskever, Geoffrey E. Hinton, 2012 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 143. 37/67 ReLU f(x) = max(0, x) Advantages of ReLU Does not saturate in the positive re- gion Computationally efficient In practice converges much faster than sigmoid/tanh1 1 ImageNet Classification with Deep Convolutional Neural Networks- Alex Krizhevsky Ilya Sutskever, Geoffrey E. Hinton, 2012 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 144. 38/67 x1 x2 1 y w1 w2 b h1 a1 a2 w3 In practice there is a caveat Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 145. 38/67 x1 x2 1 y w1 w2 b h1 a1 a2 w3 In practice there is a caveat Let’s see what is the derivative of ReLU(x) ∂ReLU(x) ∂x = 0 if x < 0 = 1 if x > 0 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 146. 38/67 x1 x2 1 y w1 w2 b h1 a1 a2 w3 In practice there is a caveat Let’s see what is the derivative of ReLU(x) ∂ReLU(x) ∂x = 0 if x < 0 = 1 if x > 0 Now consider the given network Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 147. 38/67 x1 x2 1 y w1 w2 b h1 a1 a2 w3 In practice there is a caveat Let’s see what is the derivative of ReLU(x) ∂ReLU(x) ∂x = 0 if x < 0 = 1 if x > 0 Now consider the given network What would happen if at some point a large gradient causes the bias b to be updated to a large negative value? Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 148. 39/67 x1 x2 1 y w1 w2 b h1 a1 a2 w3 w1x1 + w2x2 + b < 0 [if b << 0] The neuron would output 0 [dead neuron] Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 149. 39/67 x1 x2 1 y w1 w2 b h1 a1 a2 w3 w1x1 + w2x2 + b < 0 [if b << 0] The neuron would output 0 [dead neuron] Not only would the output be 0 but during backpropagation even the gradient ∂h1 ∂a1 would be zero Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 150. 39/67 x1 x2 1 y w1 w2 b h1 a1 a2 w3 w1x1 + w2x2 + b < 0 [if b << 0] The neuron would output 0 [dead neuron] Not only would the output be 0 but during backpropagation even the gradient ∂h1 ∂a1 would be zero The weights w1, w2 and b will not get updated [ there will be a zero term in the chain rule] w1 = ∂L (θ) ∂y . ∂y ∂a2 . ∂a2 ∂h1 . ∂h1 ∂a1 . ∂a1 ∂w1 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 151. 39/67 x1 x2 1 y w1 w2 b h1 a1 a2 w3 w1x1 + w2x2 + b < 0 [if b << 0] The neuron would output 0 [dead neuron] Not only would the output be 0 but during backpropagation even the gradient ∂h1 ∂a1 would be zero The weights w1, w2 and b will not get updated [ there will be a zero term in the chain rule] w1 = ∂L (θ) ∂y . ∂y ∂a2 . ∂a2 ∂h1 . ∂h1 ∂a1 . ∂a1 ∂w1 The neuron will now stay dead forever!! Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 152. 40/67 x1 x2 1 y w1 w2 b h1 a1 a2 w3 In practice a large fraction of ReLU units can die if the learning rate is set too high Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 153. 40/67 x1 x2 1 y w1 w2 b h1 a1 a2 w3 In practice a large fraction of ReLU units can die if the learning rate is set too high It is advised to initialize the bias to a positive value (0.01) Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 154. 40/67 x1 x2 1 y w1 w2 b h1 a1 a2 w3 In practice a large fraction of ReLU units can die if the learning rate is set too high It is advised to initialize the bias to a positive value (0.01) Use other variants of ReLU (as we will soon see) Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 155. 41/67 Leaky ReLU x y f(x) = max(0.01x,x) No saturation Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 156. 41/67 Leaky ReLU x y f(x) = max(0.01x,x) No saturation Will not die (0.01x ensures that at least a small gradient will flow through) Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 157. 41/67 Leaky ReLU x y f(x) = max(0.01x,x) No saturation Will not die (0.01x ensures that at least a small gradient will flow through) Computationally efficient Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 158. 41/67 Leaky ReLU x y f(x) = max(0.01x,x) No saturation Will not die (0.01x ensures that at least a small gradient will flow through) Computationally efficient Close to zero centered ouputs Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 159. 41/67 Leaky ReLU x y f(x) = max(0.01x,x) No saturation Will not die (0.01x ensures that at least a small gradient will flow through) Computationally efficient Close to zero centered ouputs Parametric ReLU f(x) = max(αx, x) α is a parameter of the model α will get updated during backpropagation Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 160. 42/67 Exponential Linear Unit x y f(x) = x if x > 0 = aex − 1 if x ≤ 0 All benefits of ReLU Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 161. 42/67 Exponential Linear Unit x y f(x) = x if x > 0 = aex − 1 if x ≤ 0 All benefits of ReLU aex − 1 ensures that at least a small gradient will flow through Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 162. 42/67 Exponential Linear Unit x y f(x) = x if x > 0 = aex − 1 if x ≤ 0 All benefits of ReLU aex − 1 ensures that at least a small gradient will flow through Close to zero centered outputs Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 163. 42/67 Exponential Linear Unit x y f(x) = x if x > 0 = aex − 1 if x ≤ 0 All benefits of ReLU aex − 1 ensures that at least a small gradient will flow through Close to zero centered outputs Expensive (requires computation of exp(x)) Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 164. 43/67 Maxout Neuron max(wT 1 x + b1, wT 2 x + b2) Generalizes ReLU and Leaky ReLU Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 165. 43/67 Maxout Neuron max(wT 1 x + b1, wT 2 x + b2) Generalizes ReLU and Leaky ReLU No saturation! No death! Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 166. 43/67 Maxout Neuron max(wT 1 x + b1, wT 2 x + b2) Generalizes ReLU and Leaky ReLU No saturation! No death! Doubles the number of parameters Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 167. 44/67 Things to Remember Sigmoids are bad Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 168. 44/67 Things to Remember Sigmoids are bad ReLU is more or less the standard unit for Convolutional Neural Networks Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 169. 44/67 Things to Remember Sigmoids are bad ReLU is more or less the standard unit for Convolutional Neural Networks Can explore Leaky ReLU/Maxout/ELU Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 170. 44/67 Things to Remember Sigmoids are bad ReLU is more or less the standard unit for Convolutional Neural Networks Can explore Leaky ReLU/Maxout/ELU tanh sigmoids are still used in LSTMs/RNNs (we will see more on this later) Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 171. 45/67 Module 9.4 : Better initialization strategies Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 172. 46/67 Deep Learning has evolved Better optimization algorithms Better regularization methods Better activation functions Better weight initialization strategies Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 173. 47/67 y σ σ σ σ x1 x2 h21 a21 h11 h12 h13 a11 a12 a13 What happens if we initialize all weights to 0? Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 174. 47/67 y σ σ σ σ x1 x2 h21 a21 h11 h12 h13 a11 a12 a13 a11 = w11x1 + w12x2 What happens if we initialize all weights to 0? Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 175. 47/67 y σ σ σ σ x1 x2 h21 a21 h11 h12 h13 a11 a12 a13 a11 = w11x1 + w12x2 a12 = w21x1 + w22x2 What happens if we initialize all weights to 0? Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 176. 47/67 y σ σ σ σ x1 x2 h21 a21 h11 h12 h13 a11 a12 a13 a11 = w11x1 + w12x2 a12 = w21x1 + w22x2 ∴ a11 = a12 = 0 What happens if we initialize all weights to 0? Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 177. 47/67 y σ σ σ σ x1 x2 h21 a21 h11 h12 h13 a11 a12 a13 a11 = w11x1 + w12x2 a12 = w21x1 + w22x2 ∴ a11 = a12 = 0 ∴ h11 = h12 What happens if we initialize all weights to 0? All neurons in layer 1 will get the same activation Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 178. 47/67 y σ σ σ σ x1 x2 h21 a21 h11 h12 h13 a11 a12 a13 a11 = w11x1 + w12x2 a12 = w21x1 + w22x2 ∴ a11 = a12 = 0 ∴ h11 = h12 Now what will happen during back propagation? Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 179. 47/67 y σ σ σ σ x1 x2 h21 a21 h11 h12 h13 a11 a12 a13 a11 = w11x1 + w12x2 a12 = w21x1 + w22x2 ∴ a11 = a12 = 0 ∴ h11 = h12 Now what will happen during back propagation? w11 = ∂L (w) ∂y . ∂y ∂h11 . ∂h11 ∂a11 .x1 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 180. 47/67 y σ σ σ σ x1 x2 h21 a21 h11 h12 h13 a11 a12 a13 a11 = w11x1 + w12x2 a12 = w21x1 + w22x2 ∴ a11 = a12 = 0 ∴ h11 = h12 Now what will happen during back propagation? w11 = ∂L (w) ∂y . ∂y ∂h11 . ∂h11 ∂a11 .x1 w21 = ∂L (w) ∂y . ∂y ∂h12 . ∂h12 ∂a12 .x1 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 181. 47/67 y σ σ σ σ x1 x2 h21 a21 h11 h12 h13 a11 a12 a13 a11 = w11x1 + w12x2 a12 = w21x1 + w22x2 ∴ a11 = a12 = 0 ∴ h11 = h12 Now what will happen during back propagation? w11 = ∂L (w) ∂y . ∂y ∂h11 . ∂h11 ∂a11 .x1 w21 = ∂L (w) ∂y . ∂y ∂h12 . ∂h12 ∂a12 .x1 but h11 = h12 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 182. 47/67 y σ σ σ σ x1 x2 h21 a21 h11 h12 h13 a11 a12 a13 a11 = w11x1 + w12x2 a12 = w21x1 + w22x2 ∴ a11 = a12 = 0 ∴ h11 = h12 Now what will happen during back propagation? w11 = ∂L (w) ∂y . ∂y ∂h11 . ∂h11 ∂a11 .x1 w21 = ∂L (w) ∂y . ∂y ∂h12 . ∂h12 ∂a12 .x1 but h11 = h12 and a12 = a12 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 183. 47/67 y σ σ σ σ x1 x2 h21 a21 h11 h12 h13 a11 a12 a13 a11 = w11x1 + w12x2 a12 = w21x1 + w22x2 ∴ a11 = a12 = 0 ∴ h11 = h12 Now what will happen during back propagation? w11 = ∂L (w) ∂y . ∂y ∂h11 . ∂h11 ∂a11 .x1 w21 = ∂L (w) ∂y . ∂y ∂h12 . ∂h12 ∂a12 .x1 but h11 = h12 and a12 = a12 ∴ w11 = w21 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 184. 47/67 y σ σ σ σ x1 x2 h21 a21 h11 h12 h13 a11 a12 a13 a11 = w11x1 + w12x2 a12 = w21x1 + w22x2 ∴ a11 = a12 = 0 ∴ h11 = h12 Hence both the weights will get the same update and remain equal Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 185. 47/67 y σ σ σ σ x1 x2 h21 a21 h11 h12 h13 a11 a12 a13 a11 = w11x1 + w12x2 a12 = w21x1 + w22x2 ∴ a11 = a12 = 0 ∴ h11 = h12 Hence both the weights will get the same update and remain equal Infact this symmetry will never break during training Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 186. 47/67 y σ σ σ σ x1 x2 h21 a21 h11 h12 h13 a11 a12 a13 a11 = w11x1 + w12x2 a12 = w21x1 + w22x2 ∴ a11 = a12 = 0 ∴ h11 = h12 Hence both the weights will get the same update and remain equal Infact this symmetry will never break during training The same is true for w12 and w22 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 187. 47/67 y σ σ σ σ x1 x2 h21 a21 h11 h12 h13 a11 a12 a13 a11 = w11x1 + w12x2 a12 = w21x1 + w22x2 ∴ a11 = a12 = 0 ∴ h11 = h12 Hence both the weights will get the same update and remain equal Infact this symmetry will never break during training The same is true for w12 and w22 And for all weights in layer 2 (infact, work out the math and convince your- self that all the weights in this layer will remain equal ) Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 188. 47/67 y σ σ σ σ x1 x2 h21 a21 h11 h12 h13 a11 a12 a13 a11 = w11x1 + w12x2 a12 = w21x1 + w22x2 ∴ a11 = a12 = 0 ∴ h11 = h12 Hence both the weights will get the same update and remain equal Infact this symmetry will never break during training The same is true for w12 and w22 And for all weights in layer 2 (infact, work out the math and convince your- self that all the weights in this layer will remain equal ) This is known as the symmetry breaking problem Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 189. 47/67 y σ σ σ σ x1 x2 h21 a21 h11 h12 h13 a11 a12 a13 a11 = w11x1 + w12x2 a12 = w21x1 + w22x2 ∴ a11 = a12 = 0 ∴ h11 = h12 Hence both the weights will get the same update and remain equal Infact this symmetry will never break during training The same is true for w12 and w22 And for all weights in layer 2 (infact, work out the math and convince your- self that all the weights in this layer will remain equal ) This is known as the symmetry breaking problem This will happen if all the weights in a network are initialized to the same value Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 190. 48/67 We will now consider a feedforward network with: Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 191. 48/67 We will now consider a feedforward network with: input: 1000 points, each ∈ R500 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 192. 48/67 We will now consider a feedforward network with: input: 1000 points, each ∈ R500 input data is drawn from unit Gaus- sian −3 −2 −1 0 1 2 3 0.1 0.2 0.3 0.4 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 193. 48/67 We will now consider a feedforward network with: input: 1000 points, each ∈ R500 input data is drawn from unit Gaus- sian −3 −2 −1 0 1 2 3 0.1 0.2 0.3 0.4 the network has 5 layers Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 194. 48/67 We will now consider a feedforward network with: input: 1000 points, each ∈ R500 input data is drawn from unit Gaus- sian −3 −2 −1 0 1 2 3 0.1 0.2 0.3 0.4 the network has 5 layers each layer has 500 neurons Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 195. 48/67 We will now consider a feedforward network with: input: 1000 points, each ∈ R500 input data is drawn from unit Gaus- sian −3 −2 −1 0 1 2 3 0.1 0.2 0.3 0.4 the network has 5 layers each layer has 500 neurons we will run forward propagation on this network with different weight ini- tializations Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 196. 49/67 Let’s try to initialize the weights to small random numbers Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 197. 49/67 tanh activation functions Let’s try to initialize the weights to small random numbers We will see what happens to the ac- tivation across different layers Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 198. 49/67 sigmoid activation functions Let’s try to initialize the weights to small random numbers We will see what happens to the ac- tivation across different layers Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 199. 50/67 What will happen during back propagation? Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 200. 50/67 What will happen during back propagation? Recall that w1 is proportional to the activation passing through it Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 201. 50/67 What will happen during back propagation? Recall that w1 is proportional to the activation passing through it If all the activations in a layer are very close to 0, what will happen to the gradient of the weights connect- ing this layer to the next layer? Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 202. 50/67 What will happen during back propagation? Recall that w1 is proportional to the activation passing through it If all the activations in a layer are very close to 0, what will happen to the gradient of the weights connect- ing this layer to the next layer? Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 203. 50/67 What will happen during back propagation? Recall that w1 is proportional to the activation passing through it If all the activations in a layer are very close to 0, what will happen to the gradient of the weights connect- ing this layer to the next layer? They will all be close to 0 (vanishing gradient problem) Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 204. 51/67 Let us try to initialize the weights to large random numbers Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 205. 51/67 tanh activation with large weights Let us try to initialize the weights to large random numbers Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 206. 51/67 sigmoid activations with large weights Let us try to initialize the weights to large random numbers Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 207. 51/67 sigmoid activations with large weights Let us try to initialize the weights to large random numbers Most activations have saturated Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 208. 51/67 sigmoid activations with large weights Let us try to initialize the weights to large random numbers Most activations have saturated What happens to the gradients at sat- uration? Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 209. 51/67 sigmoid activations with large weights Let us try to initialize the weights to large random numbers Most activations have saturated What happens to the gradients at sat- uration? They will all be close to 0 (vanishing gradient problem) Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 210. 52/67 Let us try to arrive at a more principled way of initializing weights Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 211. 52/67 x1 x2 x3 s1ns11 xn Let us try to arrive at a more principled way of initializing weights s11 = n i=1 w1ixi Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 212. 52/67 x1 x2 x3 s1ns11 xn Let us try to arrive at a more principled way of initializing weights s11 = n i=1 w1ixi V ar(s11) = V ar( n i=1 w1ixi) = n i=1 V ar(w1ixi) Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 213. 52/67 x1 x2 x3 s1ns11 xn Let us try to arrive at a more principled way of initializing weights s11 = n i=1 w1ixi V ar(s11) = V ar( n i=1 w1ixi) = n i=1 V ar(w1ixi) = n i=1 (E[w1i])2 V ar(xi) + (E[xi])2 V ar(w1i) + V ar(xi)V ar(w1i) Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 214. 52/67 x1 x2 x3 s1ns11 xn [Assuming 0 Mean inputs and weights] Let us try to arrive at a more principled way of initializing weights s11 = n i=1 w1ixi V ar(s11) = V ar( n i=1 w1ixi) = n i=1 V ar(w1ixi) = n i=1 (E[w1i])2 V ar(xi) + (E[xi])2 V ar(w1i) + V ar(xi)V ar(w1i) Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 215. 52/67 x1 x2 x3 s1ns11 xn [Assuming 0 Mean inputs and weights] Let us try to arrive at a more principled way of initializing weights s11 = n i=1 w1ixi V ar(s11) = V ar( n i=1 w1ixi) = n i=1 V ar(w1ixi) = n i=1 (E[w1i])2 V ar(xi) + (E[xi])2 V ar(w1i) + V ar(xi)V ar(w1i) = n i=1 V ar(xi)V ar(w1i) Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 216. 52/67 x1 x2 x3 s1ns11 xn [Assuming 0 Mean inputs and weights] [Assuming V ar(xi) = V ar(x)∀i ] [Assuming V ar(w1i) = V ar(w)∀i] Let us try to arrive at a more principled way of initializing weights s11 = n i=1 w1ixi V ar(s11) = V ar( n i=1 w1ixi) = n i=1 V ar(w1ixi) = n i=1 (E[w1i])2 V ar(xi) + (E[xi])2 V ar(w1i) + V ar(xi)V ar(w1i) = n i=1 V ar(xi)V ar(w1i) = (nV ar(w))(V ar(x)) Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 217. 53/67 x1 x2 x3 s1ns11 xn In general, V ar(S1i) = (nV ar(w))(V ar(x)) Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 218. 53/67 x1 x2 x3 s1ns11 xn In general, V ar(S1i) = (nV ar(w))(V ar(x)) What would happen if nV ar(w) 1 ? Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 219. 53/67 x1 x2 x3 s1ns11 xn In general, V ar(S1i) = (nV ar(w))(V ar(x)) What would happen if nV ar(w) 1 ? The variance of S1i will be large Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 220. 53/67 x1 x2 x3 s1ns11 xn In general, V ar(S1i) = (nV ar(w))(V ar(x)) What would happen if nV ar(w) 1 ? The variance of S1i will be large What would happen if nV ar(w) → 0? Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 221. 53/67 x1 x2 x3 s1ns11 xn In general, V ar(S1i) = (nV ar(w))(V ar(x)) What would happen if nV ar(w) 1 ? The variance of S1i will be large What would happen if nV ar(w) → 0? The variance of S1i will be small Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 222. 54/67 Let us see what happens if we add one more layer Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 223. 54/67 x1 x2 x3 s1ns11 s21 xn Let us see what happens if we add one more layer Using the same procedure as above we will arrive at Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 224. 54/67 x1 x2 x3 s1ns11 s21 xn Let us see what happens if we add one more layer Using the same procedure as above we will arrive at V ar(s21) = n i=1 V ar(s1i)V ar(w2i) Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 225. 54/67 x1 x2 x3 s1ns11 s21 xn Let us see what happens if we add one more layer Using the same procedure as above we will arrive at V ar(s21) = n i=1 V ar(s1i)V ar(w2i) = nV ar(s1i)V ar(w2) Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 226. 54/67 x1 x2 x3 s1ns11 s21 xn V ar(Si1) = nV ar(w1)V ar(x) Let us see what happens if we add one more layer Using the same procedure as above we will arrive at V ar(s21) = n i=1 V ar(s1i)V ar(w2i) = nV ar(s1i)V ar(w2) Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 227. 54/67 x1 x2 x3 s1ns11 s21 xn V ar(Si1) = nV ar(w1)V ar(x) Let us see what happens if we add one more layer Using the same procedure as above we will arrive at V ar(s21) = n i=1 V ar(s1i)V ar(w2i) = nV ar(s1i)V ar(w2) V ar(s21) ∝ [nV ar(w2)][nV ar(w1)]V ar(x) Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 228. 54/67 x1 x2 x3 s1ns11 s21 xn V ar(Si1) = nV ar(w1)V ar(x) Let us see what happens if we add one more layer Using the same procedure as above we will arrive at V ar(s21) = n i=1 V ar(s1i)V ar(w2i) = nV ar(s1i)V ar(w2) V ar(s21) ∝ [nV ar(w2)][nV ar(w1)]V ar(x) ∝ [nV ar(w)]2 V ar(x) Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 229. 54/67 x1 x2 x3 s1ns11 s21 xn V ar(Si1) = nV ar(w1)V ar(x) Let us see what happens if we add one more layer Using the same procedure as above we will arrive at V ar(s21) = n i=1 V ar(s1i)V ar(w2i) = nV ar(s1i)V ar(w2) V ar(s21) ∝ [nV ar(w2)][nV ar(w1)]V ar(x) ∝ [nV ar(w)]2 V ar(x) Assuming weights across all layers have the same variance Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 230. 55/67 In general, Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 231. 55/67 In general, V ar(ski) = [nV ar(w)]k V ar(x) Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 232. 55/67 In general, V ar(ski) = [nV ar(w)]k V ar(x) To ensure that variance in the output of any layer does not blow up or shrink we want: nV ar(w) = 1 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 233. 55/67 In general, V ar(ski) = [nV ar(w)]k V ar(x) To ensure that variance in the output of any layer does not blow up or shrink we want: nV ar(w) = 1 If we draw the weights from a unit Gaussian and scale them by 1√ n then, we have : Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 234. 55/67 In general, V ar(ski) = [nV ar(w)]k V ar(x) To ensure that variance in the output of any layer does not blow up or shrink we want: nV ar(w) = 1 If we draw the weights from a unit Gaussian and scale them by 1√ n then, we have : nV ar(w) = nV ar( z √ n ) Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 235. 55/67 V ar(az) = a2 (V ar(z)) In general, V ar(ski) = [nV ar(w)]k V ar(x) To ensure that variance in the output of any layer does not blow up or shrink we want: nV ar(w) = 1 If we draw the weights from a unit Gaussian and scale them by 1√ n then, we have : nV ar(w) = nV ar( z √ n ) = n ∗ 1 n V ar(z) = 1 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 236. 55/67 V ar(az) = a2 (V ar(z)) In general, V ar(ski) = [nV ar(w)]k V ar(x) To ensure that variance in the output of any layer does not blow up or shrink we want: nV ar(w) = 1 If we draw the weights from a unit Gaussian and scale them by 1√ n then, we have : nV ar(w) = nV ar( z √ n ) = n ∗ 1 n V ar(z) = 1 ← (UnitGaussian) Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 237. 56/67 Let’s see what happens if we use this initialization Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 238. 56/67 tanh activation Let’s see what happens if we use this initialization Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 239. 56/67 sigmoid activations Let’s see what happens if we use this initialization Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 240. 57/67 However this does not work for ReLU neurons Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 241. 57/67 However this does not work for ReLU neurons Why ? Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 242. 57/67 However this does not work for ReLU neurons Why ? Intuition: He et.al. argue that a factor of 2 is needed when dealing with ReLU Neurons Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 243. 57/67 However this does not work for ReLU neurons Why ? Intuition: He et.al. argue that a factor of 2 is needed when dealing with ReLU Neurons Intuitively this happens because the range of ReLU neurons is restricted only to the positive half of the space Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 244. 58/67 Indeed when we account for this factor of 2 we see better performance Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 245. 58/67 Indeed when we account for this factor of 2 we see better performance Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 246. 59/67 Module 9.5 : Batch Normalization Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 247. 60/67 We will now see a method called batch normalization which allows us to be less careful about initialization Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 248. 61/67 To understand the intuition behind Batch Nor- malization let us consider a deep network Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 249. 61/67 x1 x2 x3 h0 h1 h2 h3 h4 To understand the intuition behind Batch Nor- malization let us consider a deep network Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 250. 61/67 x1 x2 x3 h0 h1 h2 h3 h4 To understand the intuition behind Batch Nor- malization let us consider a deep network Let us focus on the learning process for the weights between these two layers Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 251. 61/67 x1 x2 x3 h0 h1 h2 h3 h4 To understand the intuition behind Batch Nor- malization let us consider a deep network Let us focus on the learning process for the weights between these two layers Typically we use mini-batch algorithms Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 252. 61/67 x1 x2 x3 h0 h1 h2 h3 h4 To understand the intuition behind Batch Nor- malization let us consider a deep network Let us focus on the learning process for the weights between these two layers Typically we use mini-batch algorithms What would happen if there is a constant change in the distribution of h3 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 253. 61/67 x1 x2 x3 h0 h1 h2 h3 h4 To understand the intuition behind Batch Nor- malization let us consider a deep network Let us focus on the learning process for the weights between these two layers Typically we use mini-batch algorithms What would happen if there is a constant change in the distribution of h3 In other words what would happen if across mini- batches the distribution of h3 keeps changing Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 254. 61/67 x1 x2 x3 h0 h1 h2 h3 h4 To understand the intuition behind Batch Nor- malization let us consider a deep network Let us focus on the learning process for the weights between these two layers Typically we use mini-batch algorithms What would happen if there is a constant change in the distribution of h3 In other words what would happen if across mini- batches the distribution of h3 keeps changing Would the learning process be easy or hard? Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 255. 62/67 It would help if the pre-activations at each layer were unit gaussians Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 256. 62/67 It would help if the pre-activations at each layer were unit gaussians Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 257. 62/67 It would help if the pre-activations at each layer were unit gaussians Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 258. 62/67 It would help if the pre-activations at each layer were unit gaussians Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 259. 62/67 It would help if the pre-activations at each layer were unit gaussians Why not explicitly ensure this by standardizing the pre-activation ? ˆsik = sik−E[sik] √ var(sik) Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 260. 62/67 It would help if the pre-activations at each layer were unit gaussians Why not explicitly ensure this by standardizing the pre-activation ? ˆsik = sik−E[sik] √ var(sik) But how do we compute E[sik] and Var[sik]? Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 261. 62/67 It would help if the pre-activations at each layer were unit gaussians Why not explicitly ensure this by standardizing the pre-activation ? ˆsik = sik−E[sik] √ var(sik) But how do we compute E[sik] and Var[sik]? We compute it from a mini-batch Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 262. 62/67 It would help if the pre-activations at each layer were unit gaussians Why not explicitly ensure this by standardizing the pre-activation ? ˆsik = sik−E[sik] √ var(sik) But how do we compute E[sik] and Var[sik]? We compute it from a mini-batch Thus we are explicitly ensuring that the distri- bution of the inputs at different layers does not change across batches Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 263. 63/67 This is what the deep network will look like with Batch Normalization Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 264. 63/67 This is what the deep network will look like with Batch Normalization Is this legal ? Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 265. 63/67 This is what the deep network will look like with Batch Normalization Is this legal ? Yes, it is because just as the tanh layer is dif- ferentiable, the Batch Normalization layer is also differentiable Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 266. 63/67 This is what the deep network will look like with Batch Normalization Is this legal ? Yes, it is because just as the tanh layer is dif- ferentiable, the Batch Normalization layer is also differentiable Hence we can backpropagate through this layer Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 267. 64/67 Catch: Do we necessarily want to force a unit gaussian input to the tanh layer? Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 268. 64/67 Catch: Do we necessarily want to force a unit gaussian input to the tanh layer? Why not let the network learn what is best for it? Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 269. 64/67 Catch: Do we necessarily want to force a unit gaussian input to the tanh layer? Why not let the network learn what is best for it? After the Batch Normalization step add the fol- lowing step: y(k) = γk ˆsik + β(k) Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 270. 64/67 γk and βk are additional parameters of the network. Catch: Do we necessarily want to force a unit gaussian input to the tanh layer? Why not let the network learn what is best for it? After the Batch Normalization step add the fol- lowing step: y(k) = γk ˆsik + β(k) What happens if the network learns: γk = var(xk) βk = E[xk] Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 271. 64/67 γk and βk are additional parameters of the network. Catch: Do we necessarily want to force a unit gaussian input to the tanh layer? Why not let the network learn what is best for it? After the Batch Normalization step add the fol- lowing step: y(k) = γk ˆsik + β(k) What happens if the network learns: γk = var(xk) βk = E[xk] We will recover sik Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 272. 64/67 γk and βk are additional parameters of the network. Catch: Do we necessarily want to force a unit gaussian input to the tanh layer? Why not let the network learn what is best for it? After the Batch Normalization step add the fol- lowing step: y(k) = γk ˆsik + β(k) What happens if the network learns: γk = var(xk) βk = E[xk] We will recover sik In other words by adjusting these additional para- meters the network can learn to recover sik if that is more favourable Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 273. 65/67 We will now compare the performance with and without batch normalization on MNIST data using 2 layers.... Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 274. 66/67 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 275. 66/67 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 276. 66/67 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 277. 66/67 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 278. 66/67 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 279. 66/67 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 280. 66/67 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 281. 66/67 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 282. 66/67 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 283. 66/67 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 284. 66/67 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 285. 66/67 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 286. 66/67 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 287. 66/67 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 288. 66/67 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 289. 66/67 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 290. 66/67 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 291. 66/67 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 292. 66/67 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 293. 66/67 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 294. 67/67 2016-17: Still exciting times Even better optimization methods Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 295. 67/67 2016-17: Still exciting times Even better optimization methods Data driven initialization methods Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
  • 296. 67/67 2016-17: Still exciting times Even better optimization methods Data driven initialization methods Beyond batch normalization Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9