Deep learning

1/55
CS7015 (Deep Learning) : Lecture 7
Autoencoders and relation to PCA, Regularization in autoencoders, Denoising
autoencoders, Sparse autoencoders, Contractive autoencoders
Mitesh M. Khapra
Department of Computer Science and Engineering
Indian Institute of Technology Madras
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

2/55
Module 7.1: Introduction to Autoencoders

3/55
xi
W
h
W∗
ˆxi

3/55
xi
W
h
W∗
ˆxi
An autoencoder is a special type of
feed forward neural network which
does the following

3/55
xi
W
h
W∗
ˆxi
does the following
Encodes its input xi into a hidden
representation h

3/55
xi
W
h
W∗
ˆxi
h = g(Wxi + b)
does the following
representation h

3/55
xi
W
h
W∗
ˆxi
h = g(Wxi + b)
does the following
representation h
Decodes the input again from this
hidden representation

3/55
xi
W
h
W∗
ˆxi
h = g(Wxi + b)
ˆxi = f(W∗
h + c)
does the following
representation h

3/55
xi
W
h
W∗
ˆxi
h = g(Wxi + b)
ˆxi = f(W∗
h + c)
does the following
representation h
The model is trained to minimize a
certain loss function which will ensure
that ˆxi is close to xi (we will see some
such loss functions soon)

4/55
xi
W
h
W∗
ˆxi
h = g(Wxi + b)
ˆxi = f(W∗
h + c)

4/55
xi
W
h
W∗
ˆxi
h = g(Wxi + b)
ˆxi = f(W∗
h + c)
Let us consider the case where
dim(h) < dim(xi)

4/55
xi
W
h
W∗
ˆxi
h = g(Wxi + b)
ˆxi = f(W∗
h + c)
dim(h) < dim(xi)
If we are still able to reconstruct ˆxi
perfectly from h, then what does it
say about h?

4/55
xi
W
h
W∗
ˆxi
h = g(Wxi + b)
ˆxi = f(W∗
h + c)
dim(h) < dim(xi)
say about h?
h is a loss-free encoding of xi. It cap-
tures all the important characteristics
of xi

4/55
xi
W
h
W∗
ˆxi
h = g(Wxi + b)
ˆxi = f(W∗
h + c)
dim(h) < dim(xi)
say about h?
of xi
Do you see an analogy with PCA?

4/55
xi
W
h
W∗
ˆxi
h = g(Wxi + b)
ˆxi = f(W∗
h + c)
An autoencoder where dim(h) < dim(xi) is
called an under complete autoencoder
dim(h) < dim(xi)
say about h?
of xi
Do you see an analogy with PCA?

5/55
xi
W
h
W∗
ˆxi
h = g(Wxi + b)
ˆxi = f(W∗
h + c)

5/55
xi
W
h
W∗
ˆxi
h = g(Wxi + b)
ˆxi = f(W∗
h + c)
Let us consider the case when
dim(h) ≥ dim(xi)

5/55
xi
W
h
W∗
ˆxi
h = g(Wxi + b)
ˆxi = f(W∗
h + c)
dim(h) ≥ dim(xi)
In such a case the autoencoder could
learn a trivial encoding by simply
copying xi into h and then copying
h into ˆxi

5/55
xi
W
h
W∗
ˆxi
h = g(Wxi + b)
ˆxi = f(W∗
h + c)
dim(h) ≥ dim(xi)
h into ˆxi
Such an identity encoding is useless
in practice as it does not really tell us
anything about the important char-
acteristics of the data

5/55
xi
W
h
W∗
ˆxi
h = g(Wxi + b)
ˆxi = f(W∗
h + c)
An autoencoder where dim(h) ≥ dim(xi) is
called an over complete autoencoder
dim(h) ≥ dim(xi)
h into ˆxi
Such an identity encoding is useless
in practice as it does not really tell us
anything about the important char-
acteristics of the data

6/55
The Road Ahead

6/55
The Road Ahead
Choice of f(xi) and g(xi)

6/55
The Road Ahead
Choice of loss function

7/55
The Road Ahead

8/55
0 1 1 0 1
xi
h = g(Wxi + b)
W
W∗
ˆxi = f(W∗
h + c)
(binary inputs)

8/55
0 1 1 0 1
xi
h = g(Wxi + b)
W
W∗
ˆxi = f(W∗
h + c)
(binary inputs)
Suppose all our inputs are binary
(each xij ∈ {0, 1})

8/55
0 1 1 0 1
xi
h = g(Wxi + b)
W
W∗
ˆxi = f(W∗
h + c)
(binary inputs)
(each xij ∈ {0, 1})
Which of the following functions
would be most apt for the decoder?

8/55
0 1 1 0 1
xi
h = g(Wxi + b)
W
W∗
ˆxi = f(W∗
h + c)
(binary inputs)
(each xij ∈ {0, 1})
ˆxi = tanh(W∗
h + c)

8/55
0 1 1 0 1
xi
h = g(Wxi + b)
W
W∗
ˆxi = f(W∗
h + c)
(binary inputs)
(each xij ∈ {0, 1})
ˆxi = tanh(W∗
h + c)
ˆxi = W∗
h + c

8/55
0 1 1 0 1
xi
h = g(Wxi + b)
W
W∗
ˆxi = f(W∗
h + c)
(binary inputs)
(each xij ∈ {0, 1})
ˆxi = tanh(W∗
h + c)
ˆxi = W∗
h + c
ˆxi = logistic(W∗
h + c)

8/55
0 1 1 0 1
xi
h = g(Wxi + b)
W
W∗
ˆxi = f(W∗
h + c)
(binary inputs)
(each xij ∈ {0, 1})
ˆxi = tanh(W∗
h + c)
ˆxi = W∗
h + c
h + c)
Logistic as it naturally restricts all
outputs to be between 0 and 1

8/55
0 1 1 0 1
xi
h = g(Wxi + b)
W
W∗
ˆxi = f(W∗
h + c)
(binary inputs)
g is typically chosen as the sigmoid
function
(each xij ∈ {0, 1})
ˆxi = tanh(W∗
h + c)
ˆxi = W∗
h + c
h + c)
Logistic as it naturally restricts all
outputs to be between 0 and 1

9/55
0.25 0.5 1.25 3.5 4.5
xi
h = g(Wxi + b)
W
W∗
ˆxi = f(W∗
h + c)
(real valued inputs)

9/55
0.25 0.5 1.25 3.5 4.5
xi
h = g(Wxi + b)
W
W∗
ˆxi = f(W∗
h + c)
Suppose all our inputs are real (each
xij ∈ R)

9/55
0.25 0.5 1.25 3.5 4.5
xi
h = g(Wxi + b)
W
W∗
ˆxi = f(W∗
h + c)
xij ∈ R)

9/55
0.25 0.5 1.25 3.5 4.5
xi
h = g(Wxi + b)
W
W∗
ˆxi = f(W∗
h + c)
xij ∈ R)
ˆxi = tanh(W∗
h + c)

9/55
0.25 0.5 1.25 3.5 4.5
xi
h = g(Wxi + b)
W
W∗
ˆxi = f(W∗
h + c)
xij ∈ R)
ˆxi = tanh(W∗
h + c)
ˆxi = W∗
h + c

9/55
0.25 0.5 1.25 3.5 4.5
xi
h = g(Wxi + b)
W
W∗
ˆxi = f(W∗
h + c)
xij ∈ R)
ˆxi = tanh(W∗
h + c)
ˆxi = W∗
h + c
h + c)

9/55
0.25 0.5 1.25 3.5 4.5
xi
h = g(Wxi + b)
W
W∗
ˆxi = f(W∗
h + c)
xij ∈ R)
ˆxi = tanh(W∗
h + c)
ˆxi = W∗
h + c
h + c)
What will logistic and tanh do?

9/55
0.25 0.5 1.25 3.5 4.5
xi
h = g(Wxi + b)
W
W∗
ˆxi = f(W∗
h + c)
xij ∈ R)
ˆxi = tanh(W∗
h + c)
ˆxi = W∗
h + c
h + c)
They will restrict the reconstruc-
ted ˆxi to lie between [0,1] or [-1,1]
whereas we want ˆxi ∈ Rn

9/55
0.25 0.5 1.25 3.5 4.5
xi
h = g(Wxi + b)
W
W∗
ˆxi = f(W∗
h + c)
Again, g is typically chosen as the
sigmoid function
xij ∈ R)
ˆxi = tanh(W∗
h + c)
ˆxi = W∗
h + c
h + c)
They will restrict the reconstruc-
ted ˆxi to lie between [0,1] or [-1,1]
whereas we want ˆxi ∈ Rn

10/55
The Road Ahead

11/55
xi
W
h
W∗
ˆxi
h = g(Wxi + b)
ˆxi = f(W∗
h + c)

11/55
xi
W
h
W∗
ˆxi
h = g(Wxi + b)
ˆxi = f(W∗
h + c)
Consider the case when the inputs are real
valued

11/55
xi
W
h
W∗
ˆxi
h = g(Wxi + b)
ˆxi = f(W∗
h + c)
valued
The objective of the autoencoder is to recon-
struct ˆxi to be as close to xi as possible

11/55
xi
W
h
W∗
ˆxi
h = g(Wxi + b)
ˆxi = f(W∗
h + c)
valued
This can be formalized using the following
objective function:
min
W,W ∗,c,b
1
m
m
i=1
n
j=1
(ˆxij − xij)2

11/55
xi
W
h
W∗
ˆxi
h = g(Wxi + b)
ˆxi = f(W∗
h + c)
valued
objective function:
min
W,W ∗,c,b
1
m
m
i=1
n
j=1
(ˆxij − xij)2
i.e., min
W,W ∗,c,b
1
m
m
i=1
(ˆxi − xi)T
(ˆxi − xi)

11/55
xi
W
h
W∗
ˆxi
h = g(Wxi + b)
ˆxi = f(W∗
h + c)
valued
objective function:
min
W,W ∗,c,b
1
m
m
i=1
n
j=1
(ˆxij − xij)2
i.e., min
W,W ∗,c,b
1
m
m
i=1
(ˆxi − xi)T
(ˆxi − xi)
We can then train the autoencoder just like
a regular feedforward network using back-
propagation

11/55
xi
W
h
W∗
ˆxi
h = g(Wxi + b)
ˆxi = f(W∗
h + c)
valued
objective function:
min
W,W ∗,c,b
1
m
m
i=1
n
j=1
(ˆxij − xij)2
i.e., min
W,W ∗,c,b
1
m
m
i=1
(ˆxi − xi)T
(ˆxi − xi)
We can then train the autoencoder just like
a regular feedforward network using back-
propagation
All we need is a formula for ∂L (θ)
∂W ∗ and ∂L (θ)
∂W
which we will see now

12/55
L (θ) = (ˆxi − xi)T
(ˆxi − xi)
h0 = xi
h1
a1
h2 = ˆxi
a2
W
W∗

12/55
(ˆxi − xi)
h0 = xi
h1
a1
h2 = ˆxi
a2
W
W∗
Note that the loss function is
shown for only one training
example.

12/55
(ˆxi − xi)
h0 = xi
h1
a1
h2 = ˆxi
a2
W
W∗
example.
∂L (θ)
∂W∗
=
∂L (θ)
∂h2
∂h2
∂a2
∂a2
∂W∗

12/55
(ˆxi − xi)
h0 = xi
h1
a1
h2 = ˆxi
a2
W
W∗
example.
∂L (θ)
∂W∗
=
∂L (θ)
∂h2
∂h2
∂a2
∂a2
∂W∗
∂L (θ)
∂W
=
∂L (θ)
∂h2
∂h2
∂a2
∂a2
∂h1
∂h1
∂a1
∂a1
∂W

12/55
(ˆxi − xi)
h0 = xi
h1
a1
h2 = ˆxi
a2
W
W∗
example.
∂L (θ)
∂W∗
=
∂L (θ)
∂h2
∂h2
∂a2
∂a2
∂W∗
∂L (θ)
∂W
=
∂L (θ)
∂h2
∂h2
∂a2
∂a2
∂h1
∂h1
∂a1
∂a1
∂W
We have already seen how to calculate the expres-
sion in the boxes when we learnt backpropagation

12/55
(ˆxi − xi)
h0 = xi
h1
a1
h2 = ˆxi
a2
W
W∗
example.
∂L (θ)
∂W∗
=
∂L (θ)
∂h2
∂h2
∂a2
∂a2
∂W∗
∂L (θ)
∂W
=
∂L (θ)
∂h2
∂h2
∂a2
∂a2
∂h1
∂h1
∂a1
∂a1
∂W
∂L (θ)
∂h2
=
∂L (θ)
∂ˆxi

12/55
(ˆxi − xi)
h0 = xi
h1
a1
h2 = ˆxi
a2
W
W∗
example.
∂L (θ)
∂W∗
=
∂L (θ)
∂h2
∂h2
∂a2
∂a2
∂W∗
∂L (θ)
∂W
=
∂L (θ)
∂h2
∂h2
∂a2
∂a2
∂h1
∂h1
∂a1
∂a1
∂W
∂L (θ)
∂h2
=
∂L (θ)
∂ˆxi
= ˆxi
{(ˆxi − xi)T
(ˆxi − xi)}

12/55
(ˆxi − xi)
h0 = xi
h1
a1
h2 = ˆxi
a2
W
W∗
example.
∂L (θ)
∂W∗
=
∂L (θ)
∂h2
∂h2
∂a2
∂a2
∂W∗
∂L (θ)
∂W
=
∂L (θ)
∂h2
∂h2
∂a2
∂a2
∂h1
∂h1
∂a1
∂a1
∂W
∂L (θ)
∂h2
=
∂L (θ)
∂ˆxi
= ˆxi
{(ˆxi − xi)T
(ˆxi − xi)}
= 2(ˆxi − xi)

13/55
0 1 1 0 1
xi
h = g(Wxi + b)
W
W∗
ˆxi = f(W∗
h + c)
(binary inputs)

13/55
0 1 1 0 1
xi
h = g(Wxi + b)
W
W∗
ˆxi = f(W∗
h + c)
(binary inputs)
Consider the case when the inputs are
binary

13/55
0 1 1 0 1
xi
h = g(Wxi + b)
W
W∗
ˆxi = f(W∗
h + c)
(binary inputs)
binary
We use a sigmoid decoder which will
produce outputs between 0 and 1, and
can be interpreted as probabilities.

13/55
0 1 1 0 1
xi
h = g(Wxi + b)
W
W∗
ˆxi = f(W∗
h + c)
(binary inputs)
binary
For a single n-dimensional ith
input we
can use the following loss function
min{−
n
j=1
(xij log ˆxij + (1 − xij) log(1 − ˆxij))}

13/55
0 1 1 0 1
xi
h = g(Wxi + b)
W
W∗
ˆxi = f(W∗
h + c)
(binary inputs)
What value of ˆxij will minimize this
function?
binary
input we
min{−
n
j=1

13/55
0 1 1 0 1
xi
h = g(Wxi + b)
W
W∗
ˆxi = f(W∗
h + c)
(binary inputs)
function?
If xij = 1 ?
binary
input we
min{−
n
j=1

13/55
0 1 1 0 1
xi
h = g(Wxi + b)
W
W∗
ˆxi = f(W∗
h + c)
(binary inputs)
function?
If xij = 1 ?
If xij = 0 ?
binary
input we
min{−
n
j=1

13/55
0 1 1 0 1
xi
h = g(Wxi + b)
W
W∗
ˆxi = f(W∗
h + c)
(binary inputs)
function?
If xij = 1 ?
If xij = 0 ?
binary
input we
min{−
n
j=1
Again we need is a formula for ∂L (θ)
∂W ∗ and
∂L (θ)
∂W to use backpropagation

13/55
0 1 1 0 1
xi
h = g(Wxi + b)
W
W∗
ˆxi = f(W∗
h + c)
(binary inputs)
function?
If xij = 1 ?
If xij = 0 ?
Indeed the above function will be
minimized when ˆxij = xij !
binary
input we
min{−
n
j=1
Again we need is a formula for ∂L (θ)
∂W ∗ and
∂L (θ)
∂W to use backpropagation

14/55
L (θ) = −
n
j=1
(xij log ˆxij + (1 − xij) log(1 − ˆxij))
h0 = xi
h1
a1
h2 = ˆxi
a2
W
W∗

14/55
L (θ) = −
n
j=1
h0 = xi
h1
a1
h2 = ˆxi
a2
W
W∗
∂L (θ)
∂W∗
=
∂L (θ)
∂h2
∂h2
∂a2
∂a2
∂W∗

14/55
L (θ) = −
n
j=1
h0 = xi
h1
a1
h2 = ˆxi
a2
W
W∗
∂L (θ)
∂W∗
=
∂L (θ)
∂h2
∂h2
∂a2
∂a2
∂W∗
∂L (θ)
∂W
=
∂L (θ)
∂h2
∂h2
∂a2
∂a2
∂h1
∂h1
∂a1
∂a1
∂W

14/55
L (θ) = −
n
j=1
h0 = xi
h1
a1
h2 = ˆxi
a2
W
W∗
∂L (θ)
∂W∗
=
∂L (θ)
∂h2
∂h2
∂a2
∂a2
∂W∗
∂L (θ)
∂W
=
∂L (θ)
∂h2
∂h2
∂a2
∂a2
∂h1
∂h1
∂a1
∂a1
∂W
We have already seen how to
calculate the expressions in the
square boxes when we learnt BP

14/55
L (θ) = −
n
j=1
h0 = xi
h1
a1
h2 = ˆxi
a2
W
W∗
∂L (θ)
∂W∗
=
∂L (θ)
∂h2
∂h2
∂a2
∂a2
∂W∗
∂L (θ)
∂W
=
∂L (θ)
∂h2
∂h2
∂a2
∂a2
∂h1
∂h1
∂a1
∂a1
∂W
The ﬁrst two terms on RHS can be
computed as:
∂L (θ)
∂h2j
= −
xij
ˆxij
+
1 − xij
1 − ˆxij
∂h2j
∂a2j
= σ(a2j)(1 − σ(a2j))

14/55
L (θ) = −
n
j=1
h0 = xi
h1
a1
h2 = ˆxi
a2
W
W∗
∂L (θ)
∂h2
=







∂L (θ)
∂h2n
...
∂L (θ)
∂h22
∂L (θ)
∂h21







∂L (θ)
∂W∗
=
∂L (θ)
∂h2
∂h2
∂a2
∂a2
∂W∗
∂L (θ)
∂W
=
∂L (θ)
∂h2
∂h2
∂a2
∂a2
∂h1
∂h1
∂a1
∂a1
∂W
The ﬁrst two terms on RHS can be
computed as:
∂L (θ)
∂h2j
= −
xij
ˆxij
+
1 − xij
1 − ˆxij
∂h2j
∂a2j
= σ(a2j)(1 − σ(a2j))

15/55
Module 7.2: Link between PCA and Autoencoders

16/55
xi
h
ˆxi
≡
PCA
PT XT XP = D
y
x
u1 u2
We will now see that the encoder part
of an autoencoder is equivalent to
PCA if we

16/55
xi
h
ˆxi
≡
PCA
PT XT XP = D
y
x
u1 u2
PCA if we
use a linear encoder

16/55
xi
h
ˆxi
≡
PCA
PT XT XP = D
y
x
u1 u2
PCA if we
use a linear decoder

16/55
xi
h
ˆxi
≡
PCA
PT XT XP = D
y
x
u1 u2
PCA if we
use squared error loss function

16/55
xi
h
ˆxi
≡
PCA
PT XT XP = D
y
x
u1 u2
PCA if we
use squared error loss function
normalize the inputs to
ˆxij =
1
√
m
xij −
1
m
m
k=1
xkj

17/55
xi
h
ˆxi
≡
PCA
PT XT XP = D
y
x
u1 u2
First let us consider the implication
of normalizing the inputs to
ˆxij =
1
√
m
xij −
1
m
m
k=1
xkj

17/55
xi
h
ˆxi
≡
PCA
PT XT XP = D
y
x
u1 u2
ˆxij =
1
√
m
xij −
1
m
m
k=1
xkj
The operation in the bracket ensures
that the data now has 0 mean along
each dimension j (we are subtracting
the mean)

17/55
xi
h
ˆxi
≡
PCA
PT XT XP = D
y
x
u1 u2
ˆxij =
1
√
m
xij −
1
m
m
k=1
xkj
the mean)
Let X be this zero mean data mat-
rix then what the above normaliza-
tion gives us is X = 1√
m
X

17/55
xi
h
ˆxi
≡
PCA
PT XT XP = D
y
x
u1 u2
ˆxij =
1
√
m
xij −
1
m
m
k=1
xkj
the mean)
Let X be this zero mean data mat-
rix then what the above normaliza-
tion gives us is X = 1√
m
X
Now (X)T X = 1
m (X )T X is the co-
variance matrix (recall that covari-
ance matrix plays an important role
in PCA)

18/55
xi
h
ˆxi
≡
PCA
PT XT XP = D
y
x
u1 u2

18/55
xi
h
ˆxi
≡
PCA
PT XT XP = D
y
x
u1 u2
First we will show that if we use lin-
ear decoder and a squared error loss
function then

18/55
xi
h
ˆxi
≡
PCA
PT XT XP = D
y
x
u1 u2
function then
The optimal solution to the following
objective function

18/55
xi
h
ˆxi
≡
PCA
PT XT XP = D
y
x
u1 u2
function then
objective function
1
m
m
i=1
n
j=1
(xij − ˆxij)2

18/55
xi
h
ˆxi
≡
PCA
PT XT XP = D
y
x
u1 u2
function then
objective function
1
m
m
i=1
n
j=1
(xij − ˆxij)2
is obtained when we use a linear en-
coder.

19/55
min
θ
m
i=1
n
j=1
(xij − ˆxij)2
(1)

19/55
min
θ
m
i=1
n
j=1
(xij − ˆxij)2
(1)
This is equivalent to

19/55
min
θ
m
i=1
n
j=1
(xij − ˆxij)2
(1)
min
W ∗H
( X − HW∗
F )2

19/55
min
θ
m
i=1
n
j=1
(xij − ˆxij)2
(1)
min
W ∗H
( X − HW∗
F )2
A F =
m
i=1
n
j=1
a2
ij

19/55
min
θ
m
i=1
n
j=1
(xij − ˆxij)2
(1)
min
W ∗H
( X − HW∗
F )2
A F =
m
i=1
n
j=1
a2
ij
(just writing the expression (1) in matrix form and using the deﬁnition of ||A||F ) (we
are ignoring the biases)

19/55
min
θ
m
i=1
n
j=1
(xij − ˆxij)2
(1)
min
W ∗H
( X − HW∗
F )2
A F =
m
i=1
n
j=1
a2
ij
From SVD we know that optimal solution to the above problem is given by
HW∗
= U.,≤kΣk,kV T
.,≤k

19/55
min
θ
m
i=1
n
j=1
(xij − ˆxij)2
(1)
min
W ∗H
( X − HW∗
F )2
A F =
m
i=1
n
j=1
a2
ij
From SVD we know that optimal solution to the above problem is given by
HW∗
= U.,≤kΣk,kV T
.,≤k
By matching variables one possible solution is
H = U.,≤kΣk,k
W∗
= V T
.,≤k

20/55
We will now show that H is a linear encoding and ﬁnd an expression for the encoder
weights W

20/55
weights W
H = U.,≤kΣk,k

20/55
weights W
H = U.,≤kΣk,k
= (XXT
)(XXT
)−1
U.,≤KΣk,k (pre-multiplying (XXT
)(XXT
)−1
= I)

20/55
weights W
H = U.,≤kΣk,k
= (XXT
)(XXT
)−1
)(XXT
)−1
= I)
= (XV ΣT
UT
)(UΣV T
V ΣT
UT
)−1
U.,≤kΣk,k (using X = UΣV T
)

20/55
weights W
H = U.,≤kΣk,k
= (XXT
)(XXT
)−1
)(XXT
)−1
= I)
= (XV ΣT
UT
)(UΣV T
V ΣT
UT
)−1
)
= XV ΣT
UT
(UΣΣT
UT
)−1
U.,≤kΣk,k (V T
V = I)

20/55
weights W
H = U.,≤kΣk,k
= (XXT
)(XXT
)−1
)(XXT
)−1
= I)
= (XV ΣT
UT
)(UΣV T
V ΣT
UT
)−1
)
= XV ΣT
UT
(UΣΣT
UT
)−1
U.,≤kΣk,k (V T
V = I)
= XV ΣT
UT
U(ΣΣT
)−1
UT
U.,≤kΣk,k ((ABC)−1
= C−1
B−1
A−1
)

20/55
weights W
H = U.,≤kΣk,k
= (XXT
)(XXT
)−1
)(XXT
)−1
= I)
= (XV ΣT
UT
)(UΣV T
V ΣT
UT
)−1
)
= XV ΣT
UT
(UΣΣT
UT
)−1
U.,≤kΣk,k (V T
V = I)
= XV ΣT
UT
U(ΣΣT
)−1
UT
= C−1
B−1
A−1
)
= XV ΣT
(ΣΣT
)−1
UT
U.,≤kΣk,k (UT
U = I)

20/55
weights W
H = U.,≤kΣk,k
= (XXT
)(XXT
)−1
)(XXT
)−1
= I)
= (XV ΣT
UT
)(UΣV T
V ΣT
UT
)−1
)
= XV ΣT
UT
(UΣΣT
UT
)−1
U.,≤kΣk,k (V T
V = I)
= XV ΣT
UT
U(ΣΣT
)−1
UT
= C−1
B−1
A−1
)
= XV ΣT
(ΣΣT
)−1
UT
U.,≤kΣk,k (UT
U = I)
= XV ΣT
ΣT −1
Σ−1
UT
U.,≤kΣk,k ((AB)−1
= B−1
A−1
)

20/55
weights W
H = U.,≤kΣk,k
= (XXT
)(XXT
)−1
)(XXT
)−1
= I)
= (XV ΣT
UT
)(UΣV T
V ΣT
UT
)−1
)
= XV ΣT
UT
(UΣΣT
UT
)−1
U.,≤kΣk,k (V T
V = I)
= XV ΣT
UT
U(ΣΣT
)−1
UT
= C−1
B−1
A−1
)
= XV ΣT
(ΣΣT
)−1
UT
U.,≤kΣk,k (UT
U = I)
= XV ΣT
ΣT −1
Σ−1
UT
= B−1
A−1
)
= XV Σ−1
I.,≤kΣk,k (UT
U.,≤k = I.,≤k)

20/55
weights W
H = U.,≤kΣk,k
= (XXT
)(XXT
)−1
)(XXT
)−1
= I)
= (XV ΣT
UT
)(UΣV T
V ΣT
UT
)−1
)
= XV ΣT
UT
(UΣΣT
UT
)−1
U.,≤kΣk,k (V T
V = I)
= XV ΣT
UT
U(ΣΣT
)−1
UT
= C−1
B−1
A−1
)
= XV ΣT
(ΣΣT
)−1
UT
U.,≤kΣk,k (UT
U = I)
= XV ΣT
ΣT −1
Σ−1
UT
= B−1
A−1
)
= XV Σ−1
I.,≤kΣk,k (UT
U.,≤k = I.,≤k)
= XV I.,≤k (Σ−1
I.,≤k = Σ−1
k,k)

20/55
weights W
H = U.,≤kΣk,k
= (XXT
)(XXT
)−1
)(XXT
)−1
= I)
= (XV ΣT
UT
)(UΣV T
V ΣT
UT
)−1
)
= XV ΣT
UT
(UΣΣT
UT
)−1
U.,≤kΣk,k (V T
V = I)
= XV ΣT
UT
U(ΣΣT
)−1
UT
= C−1
B−1
A−1
)
= XV ΣT
(ΣΣT
)−1
UT
U.,≤kΣk,k (UT
U = I)
= XV ΣT
ΣT −1
Σ−1
UT
= B−1
A−1
)
= XV Σ−1
I.,≤kΣk,k (UT
U.,≤k = I.,≤k)
= XV I.,≤k (Σ−1
I.,≤k = Σ−1
k,k)
H = XV.,≤k

20/55
weights W
H = U.,≤kΣk,k
= (XXT
)(XXT
)−1
)(XXT
)−1
= I)
= (XV ΣT
UT
)(UΣV T
V ΣT
UT
)−1
)
= XV ΣT
UT
(UΣΣT
UT
)−1
U.,≤kΣk,k (V T
V = I)
= XV ΣT
UT
U(ΣΣT
)−1
UT
= C−1
B−1
A−1
)
= XV ΣT
(ΣΣT
)−1
UT
U.,≤kΣk,k (UT
U = I)
= XV ΣT
ΣT −1
Σ−1
UT
= B−1
A−1
)
= XV Σ−1
I.,≤kΣk,k (UT
U.,≤k = I.,≤k)
= XV I.,≤k (Σ−1
I.,≤k = Σ−1
k,k)
H = XV.,≤k
Thus H is a linear transformation of X and W = V.,≤k

21/55
We have encoder W = V.,≤k

21/55
From SVD, we know that V is the matrix of eigen vectors of XT X

21/55
From PCA, we know that P is the matrix of the eigen vectors of the covariance
matrix

21/55
matrix
We saw earlier that, if entries of X are normalized by

21/55
matrix
ˆxij =
1
√
m
xij −
1
m
m
k=1
xkj

21/55
matrix
ˆxij =
1
√
m
xij −
1
m
m
k=1
xkj
then XT X is indeed the covariance matrix

21/55
matrix
ˆxij =
1
√
m
xij −
1
m
m
k=1
xkj
then XT X is indeed the covariance matrix
Thus, the encoder matrix for linear autoencoder(W) and the projection
matrix(P) for PCA could indeed be the same. Hence proved

22/55
Remember
The encoder of a linear autoencoder is equivalent to PCA if we

22/55
Remember

22/55
Remember
use a squared error loss function

22/55
Remember
and normalize the inputs to

22/55
Remember
and normalize the inputs to
ˆxij =
1
√
m
xij −
1
m
m
k=1
xkj

23/55
Module 7.3: Regularization in autoencoders
(Motivation)

24/55
xi
W
h
W∗
ˆxi

24/55
xi
W
h
W∗
ˆxi
While poor generalization could hap-
pen even in undercomplete autoen-
coders it is an even more serious prob-
lem for overcomplete auto encoders

24/55
xi
W
h
W∗
ˆxi
Here, (as stated earlier) the model
can simply learn to copy xi to h and
then h to ˆxi

24/55
xi
W
h
W∗
ˆxi
Here, (as stated earlier) the model
can simply learn to copy xi to h and
then h to ˆxi
To avoid poor generalization, we need
to introduce regularization

25/55
xi
W
h
W∗
ˆxi
The simplest solution is to add a L2-
regularization term to the objective
function
min
θ,w,w∗,b,c
1
m
m
i=1
n
j=1
(ˆxij − xij)2
+ λ θ 2

25/55
xi
W
h
W∗
ˆxi
The simplest solution is to add a L2-
regularization term to the objective
function
min
θ,w,w∗,b,c
1
m
m
i=1
n
j=1
(ˆxij − xij)2
+ λ θ 2
This is very easy to implement and
just adds a term λW to the gradient
∂L (θ)
∂W (and similarly for other para-
meters)

26/55
xi
W
h
W∗
ˆxi
Another trick is to tie the weights of
the encoder and decoder

26/55
xi
W
h
W∗
ˆxi
the encoder and decoder i.e., W∗ =
WT

26/55
xi
W
h
W∗
ˆxi
the encoder and decoder i.e., W∗ =
WT
This eﬀectively reduces the capacity
of Autoencoder and acts as a regular-
izer

27/55
Module 7.4: Denoising Autoencoders

28/55
xi
˜xi
h
ˆxi
P(xij|xij)
A denoising encoder simply corrupts
the input data using a probabilistic
process (P(xij|xij)) before feeding it
to the network

28/55
xi
˜xi
h
ˆxi
P(xij|xij)
to the network
A simple P(xij|xij) used in practice
is the following

28/55
xi
˜xi
h
ˆxi
P(xij|xij)
to the network
is the following
P(xij = 0|xij) = q

28/55
xi
˜xi
h
ˆxi
P(xij|xij)
to the network
is the following
P(xij = 0|xij) = q
P(xij = xij|xij) = 1 − q

28/55
xi
˜xi
h
ˆxi
P(xij|xij)
to the network
is the following
P(xij = 0|xij) = q
P(xij = xij|xij) = 1 − q
In other words, with probability q the
input is ﬂipped to 0 and with probab-
ility (1 − q) it is retained as it is

29/55
xi
˜xi
h
ˆxi
P(xij|xij)
How does this help ?

29/55
xi
˜xi
h
ˆxi
P(xij|xij)
This helps because the objective is
still to reconstruct the original (un-
corrupted) xi
arg min
θ
1
m
m
i=1
n
j=1
(ˆxij − xij)2

29/55
xi
˜xi
h
ˆxi
P(xij|xij)
corrupted) xi
arg min
θ
1
m
m
i=1
n
j=1
(ˆxij − xij)2
It no longer makes sense for the model
to copy the corrupted xi into h(xi)
and then into ˆxi (the objective func-
tion will not be minimized by doing
so)

29/55
xi
˜xi
h
ˆxi
P(xij|xij)
corrupted) xi
arg min
θ
1
m
m
i=1
n
j=1
(ˆxij − xij)2
so)
Instead the model will now have to
capture the characteristics of the data
correctly.

29/55
xi
˜xi
h
ˆxi
P(xij|xij)
For example, it will have to learn to
reconstruct a corrupted xij correctly by
relying on its interactions with other
elements of xi
corrupted) xi
arg min
θ
1
m
m
i=1
n
j=1
(ˆxij − xij)2
so)
Instead the model will now have to
capture the characteristics of the data
correctly.

30/55
We will now see a practical application in which AEs are used and then compare
Denoising Autoencoders with regular autoencoders

31/55
Task: Hand-written digit
recognition
Figure: MNIST Data
0 1 2 3 9
|xi| = 784 = 28 × 28
28*28
Figure: Basic approach(we use raw data as input
features)

32/55
recognition
Figure: MNIST Data
|xi| = 784 = 28 × 28
ˆxi ∈ R784
h ∈ Rd
Figure: AE approach (ﬁrst learn important
characteristics of data)

33/55
recognition
Figure: MNIST Data
0 1 2 3 9
|xi| = 784 = 28 × 28
h ∈ Rd
Figure: AE approach (and then train a classiﬁer on
top of this hidden representation)

34/55
We will now see a way of visualizing AEs and use this visualization to compare
diﬀerent AEs

35/55
xi
h
ˆxi
We can think of each neuron as a filter which
will fire (or get maximally) activated for a cer-
tain input configuration xi

35/55
xi
h
ˆxi
For example,
h1 = σ(WT
1 xi) [ignoring bias b]
Where W1 is the trained vector of weights con-
necting the input to the ﬁrst hidden neuron

35/55
xi
h
ˆxi
For example,
h1 = σ(WT
What values of xi will cause h1 to be max-
imum (or maximally activated)

35/55
xi
h
ˆxi
For example,
h1 = σ(WT
Suppose we assume that our inputs are nor-
malized so that xi = 1

35/55
xi
h
ˆxi
max
xi
{WT
1 xi}
s.t. ||xi||2
= xT
i xi = 1
For example,
h1 = σ(WT

35/55
xi
h
ˆxi
max
xi
{WT
1 xi}
s.t. ||xi||2
= xT
i xi = 1
Solution: xi =
W1
WT
1 W1
For example,
h1 = σ(WT

36/55
xi
h
ˆxi
max
xi
{WT
1 xi}
s.t. ||xi||2
= xT
i xi = 1
Solution: xi =
W1
WT
1 W1
Thus the inputs
xi =
W1
WT
1 W1
,
W2
WT
2 W2
, . . .
Wn
WT
n Wn
will respectively cause hidden neurons 1 to n
to maximally ﬁre

36/55
xi
h
ˆxi
max
xi
{WT
1 xi}
s.t. ||xi||2
= xT
i xi = 1
Solution: xi =
W1
WT
1 W1
Thus the inputs
xi =
W1
WT
1 W1
,
W2
WT
2 W2
, . . .
Wn
WT
n Wn
to maximally fire
Let us plot these images (xi’s) which maxim-
ally activate the first k neurons of the hidden
representations learned by a vanilla autoen-
coder and different denoising autoencoders

36/55
xi
h
ˆxi
max
xi
{WT
1 xi}
s.t. ||xi||2
= xT
i xi = 1
Solution: xi =
W1
WT
1 W1
Thus the inputs
xi =
W1
WT
1 W1
,
W2
WT
2 W2
, . . .
Wn
WT
n Wn
to maximally fire
Let us plot these images (xi’s) which maxim-
ally activate the first k neurons of the hidden
representations learned by a vanilla autoen-
coder and different denoising autoencoders
These xi’s are computed by the above formula
using the weights (W1, W2 . . . Wk) learned by
the respective autoencoders

37/55
Figure: Vanilla AE
(No noise)
Figure: 25% Denoising
AE (q=0.25)
AE (q=0.5)
The vanilla AE does not learn many meaningful patterns

37/55
Figure: Vanilla AE
(No noise)
AE (q=0.25)
AE (q=0.5)
The hidden neurons of the denoising AEs seem to act like pen-stroke detectors
(for example, in the highlighted neuron the black region is a stroke that you
would expect in a ’0’ or a ’2’ or a ’3’ or a ’8’ or a ’9’)

37/55
Figure: Vanilla AE
(No noise)
AE (q=0.25)
AE (q=0.5)
The hidden neurons of the denoising AEs seem to act like pen-stroke detectors
(for example, in the highlighted neuron the black region is a stroke that you
would expect in a ’0’ or a ’2’ or a ’3’ or a ’8’ or a ’9’)
As the noise increases the ﬁlters become more wide because the neuron has to
rely on more adjacent pixels to feel conﬁdent about a stroke

38/55
xi
˜xi
h
ˆxi
P(xij|xij)
We saw one form of P(xij|xij) which ﬂips a
fraction q of the inputs to zero

38/55
xi
˜xi
h
ˆxi
P(xij|xij)
Another way of corrupting the inputs is to add
a Gaussian noise to the input
xij = xij + N (0, 1)

38/55
xi
˜xi
h
ˆxi
P(xij|xij)
Another way of corrupting the inputs is to add
a Gaussian noise to the input
xij = xij + N (0, 1)
We will now use such a denoising AE on a
diﬀerent dataset and see their performance

39/55
Figure: Data Figure: AE ﬁlters
Figure: Weight decay
ﬁlters
The hidden neurons essentially behave like edge detectors

39/55
Figure: Data Figure: AE ﬁlters
Figure: Weight decay
ﬁlters
The hidden neurons essentially behave like edge detectors
PCA does not give such edge detectors

40/55
Module 7.5: Sparse Autoencoders

41/55
xi
h
ˆxi

41/55
xi
h
ˆxi
A hidden neuron with sigmoid activation will
have values between 0 and 1

41/55
xi
h
ˆxi
We say that the neuron is activated when its
output is close to 1 and not activated when
its output is close to 0.

41/55
xi
h
ˆxi
We say that the neuron is activated when its
output is close to 1 and not activated when
its output is close to 0.
A sparse autoencoder tries to ensure the
neuron is inactive most of the times.

42/55
xi
h
ˆxi
The average value of the
activation of a neuron l is given
by
ˆρl =
1
m
m
i=1
h(xi)l
If the neuron l is sparse (i.e. mostly inactive)
then ˆρl → 0

42/55
xi
h
ˆxi
by
ˆρl =
1
m
m
i=1
h(xi)l
then ˆρl → 0
A sparse autoencoder uses a sparsity para-
meter ρ (typically very close to 0, say, 0.005)
and tries to enforce the constraint ˆρl = ρ

42/55
xi
h
ˆxi
by
ˆρl =
1
m
m
i=1
h(xi)l
then ˆρl → 0
One way of ensuring this is to add the follow-
ing term to the objective function
Ω(θ) =
k
l=1
ρ log
ρ
ˆρl
+ (1 − ρ) log
1 − ρ
1 − ˆρl

42/55
xi
h
ˆxi
by
ˆρl =
1
m
m
i=1
h(xi)l
then ˆρl → 0
One way of ensuring this is to add the follow-
ing term to the objective function
Ω(θ) =
k
l=1
ρ log
ρ
ˆρl
+ (1 − ρ) log
1 − ρ
1 − ˆρl
When will this term reach its minimum value
and what is the minimum value? Let us plot
it and check.

43/55
Ω(θ)
0.2 ˆρl
ρ = 0.2

43/55
Ω(θ)
0.2 ˆρl
ρ = 0.2
The function will reach its minimum value(s) when ˆρl = ρ.

44/55
Now,
ˆL (θ) = L (θ) + Ω(θ)

44/55
Now,
ˆL (θ) = L (θ) + Ω(θ)
L (θ) is the squared error loss or
cross entropy loss and Ω(θ) is the
sparsity constraint.

44/55
Now,
ˆL (θ) = L (θ) + Ω(θ)
We already know how to calculate
∂L (θ)
∂W

44/55
Now,
ˆL (θ) = L (θ) + Ω(θ)
∂L (θ)
∂W
Let us see how to calculate ∂Ω(θ)
∂W .

44/55
Ω(θ) =
k
l=1
ρlog
ρ
ˆρl
+ (1 − ρ)log
1 − ρ
1 − ˆρl
Now,
ˆL (θ) = L (θ) + Ω(θ)
∂L (θ)
∂W
∂W .

44/55
Ω(θ) =
k
l=1
ρlog
ρ
ˆρl
+ (1 − ρ)log
1 − ρ
1 − ˆρl
Can be re-written as
Ω(θ) =
k
l=1
ρlogρ−ρlogˆρl+(1−ρ)log(1−ρ)−(1−ρ)log(1−ˆρl)
Now,
ˆL (θ) = L (θ) + Ω(θ)
∂L (θ)
∂W
∂W .

44/55
Ω(θ) =
k
l=1
ρlog
ρ
ˆρl
+ (1 − ρ)log
1 − ρ
1 − ˆρl
Ω(θ) =
k
l=1
By Chain rule:
∂Ω(θ)
∂W
=
∂Ω(θ)
∂ˆρ
.
∂ˆρ
∂W
Now,
ˆL (θ) = L (θ) + Ω(θ)
∂L (θ)
∂W
∂W .

44/55
Ω(θ) =
k
l=1
ρlog
ρ
ˆρl
+ (1 − ρ)log
1 − ρ
1 − ˆρl
Ω(θ) =
k
l=1
By Chain rule:
∂Ω(θ)
∂W
=
∂Ω(θ)
∂ˆρ
.
∂ˆρ
∂W
∂Ω(θ)
∂ˆρ
= ∂Ω(θ)
∂ ˆρ1
, ∂Ω(θ)
∂ ˆρ2
, . . . ∂Ω(θ)
∂ ˆρk
T
Now,
ˆL (θ) = L (θ) + Ω(θ)
∂L (θ)
∂W
∂W .

44/55
Ω(θ) =
k
l=1
ρlog
ρ
ˆρl
+ (1 − ρ)log
1 − ρ
1 − ˆρl
Ω(θ) =
k
l=1
By Chain rule:
∂Ω(θ)
∂W
=
∂Ω(θ)
∂ˆρ
.
∂ˆρ
∂W
∂Ω(θ)
∂ˆρ
= ∂Ω(θ)
∂ ˆρ1
, ∂Ω(θ)
∂ ˆρ2
, . . . ∂Ω(θ)
∂ ˆρk
T
For each neuron l ∈ 1 . . . k in hidden layer, we have
Now,
ˆL (θ) = L (θ) + Ω(θ)
∂L (θ)
∂W
∂W .

44/55
Ω(θ) =
k
l=1
ρlog
ρ
ˆρl
+ (1 − ρ)log
1 − ρ
1 − ˆρl
Ω(θ) =
k
l=1
By Chain rule:
∂Ω(θ)
∂W
=
∂Ω(θ)
∂ˆρ
.
∂ˆρ
∂W
∂Ω(θ)
∂ˆρ
= ∂Ω(θ)
∂ ˆρ1
, ∂Ω(θ)
∂ ˆρ2
, . . . ∂Ω(θ)
∂ ˆρk
T
∂Ω(θ)
∂ ˆρl
= −
ρ
ˆρl
+
(1 − ρ)
1 − ˆρl
Now,
ˆL (θ) = L (θ) + Ω(θ)
∂L (θ)
∂W
∂W .

44/55
Ω(θ) =
k
l=1
ρlog
ρ
ˆρl
+ (1 − ρ)log
1 − ρ
1 − ˆρl
Ω(θ) =
k
l=1
By Chain rule:
∂Ω(θ)
∂W
=
∂Ω(θ)
∂ˆρ
.
∂ˆρ
∂W
∂Ω(θ)
∂ˆρ
= ∂Ω(θ)
∂ ˆρ1
, ∂Ω(θ)
∂ ˆρ2
, . . . ∂Ω(θ)
∂ ˆρk
T
∂Ω(θ)
∂ ˆρl
= −
ρ
ˆρl
+
(1 − ρ)
1 − ˆρl
and
∂ ˆρl
∂W
= xi(g (WT
xi + b))T
(see next slide)
Now,
ˆL (θ) = L (θ) + Ω(θ)
∂L (θ)
∂W
∂W .

44/55
Ω(θ) =
k
l=1
ρlog
ρ
ˆρl
+ (1 − ρ)log
1 − ρ
1 − ˆρl
Ω(θ) =
k
l=1
By Chain rule:
∂Ω(θ)
∂W
=
∂Ω(θ)
∂ˆρ
.
∂ˆρ
∂W
∂Ω(θ)
∂ˆρ
= ∂Ω(θ)
∂ ˆρ1
, ∂Ω(θ)
∂ ˆρ2
, . . . ∂Ω(θ)
∂ ˆρk
T
∂Ω(θ)
∂ ˆρl
= −
ρ
ˆρl
+
(1 − ρ)
1 − ˆρl
and
∂ ˆρl
∂W
= xi(g (WT
xi + b))T
(see next slide)
Now,
ˆL (θ) = L (θ) + Ω(θ)
∂L (θ)
∂W
∂W .
Finally,
∂ ˆL (θ)
∂W
=
∂L (θ)
∂W
+
∂Ω(θ)
∂W
(and we know how to calculate both
terms on R.H.S)

45/55
Derivation
∂ˆρ
∂W
= ∂ ˆρ1
∂W
∂ ˆρ2
∂W . . . ∂ ˆρk
∂W
For each element in the above equation we can calculate ∂ ˆρl
∂W (which is the partial
derivative of a scalar w.r.t. a matrix = matrix). For a single element of a matrix Wjl:-
∂ˆρl
∂Wjl
=
∂ 1
m
m
i=1 g WT
:,lxi + bl
∂Wjl
=
1
m
m
i=1
∂ g WT
:,lxi + bl
∂Wjl
=
1
m
m
i=1
g WT
:,lxi + bl xij
So in matrix notation we can write it as :
∂ˆρl
∂W
= xi(g (WT
xi + b))T

46/55
Module 7.6: Contractive Autoencoders

47/55
A contractive autoencoder also tries
to prevent an overcomplete autoen-
coder from learning the identity func-
tion.
x
h
ˆx

47/55
tion.
It does so by adding the following reg-
ularization term to the loss function
Ω(θ) = Jx(h) 2
F
x
h
ˆx

47/55
tion.
Ω(θ) = Jx(h) 2
F
where Jx(h) is the Jacobian of the en-
coder.
x
h
ˆx

47/55
tion.
Ω(θ) = Jx(h) 2
F
where Jx(h) is the Jacobian of the en-
coder.
Let us see what it looks like.
x
h
ˆx

48/55
If the input has n dimensions and the
hidden layer has k dimensions then

48/55
Jx(h) =






∂h1
∂x1
. . . . . . . . . ∂h1
∂xn
∂h2
∂x1
. . . . . . . . . ∂h2
∂xn
...
...
...
∂hk
∂x1
. . . . . . . . . ∂hk
∂xn







48/55
In other words, the (j, l) entry of the
Jacobian captures the variation in the
output of the lth neuron with a small
variation in the jth input.
Jx(h) =






∂h1
∂x1
. . . . . . . . . ∂h1
∂xn
∂h2
∂x1
. . . . . . . . . ∂h2
∂xn
...
...
...
∂hk
∂x1
. . . . . . . . . ∂hk
∂xn







48/55
In other words, the (j, l) entry of the
Jacobian captures the variation in the
output of the lth neuron with a small
variation in the jth input.
Jx(h) =






∂h1
∂x1
. . . . . . . . . ∂h1
∂xn
∂h2
∂x1
. . . . . . . . . ∂h2
∂xn
...
...
...
∂hk
∂x1
. . . . . . . . . ∂hk
∂xn






Jx(h) 2
F =
n
j=1
k
l=1
∂hl
∂xj
2

49/55
What is the intuition behind this ?
Jx(h) 2
F =
n
j=1
k
l=1
∂hl
∂xj
2
x
h
ˆx

49/55
Consider ∂h1
∂x1
, what does it mean if
∂h1
∂x1
= 0
Jx(h) 2
F =
n
j=1
k
l=1
∂hl
∂xj
2
x
h
ˆx

49/55
Consider ∂h1
∂x1
∂h1
∂x1
= 0
It means that this neuron is not very
sensitive to variations in the input x1.
Jx(h) 2
F =
n
j=1
k
l=1
∂hl
∂xj
2
x
h
ˆx

49/55
Consider ∂h1
∂x1
∂h1
∂x1
= 0
It means that this neuron is not very
sensitive to variations in the input x1.
But doesn’t this contradict our other
goal of minimizing L(θ) which re-
quires h to capture variations in the
input.
Jx(h) 2
F =
n
j=1
k
l=1
∂hl
∂xj
2
x
h
ˆx

50/55
Indeed it does and that’s the idea
Jx(h) 2
F =
n
j=1
k
l=1
∂hl
∂xj
2
x
h
ˆx

50/55
By putting these two contradicting
objectives against each other we en-
sure that h is sensitive to only very
important variations as observed in
the training data.
Jx(h) 2
F =
n
j=1
k
l=1
∂hl
∂xj
2
x
h
ˆx

50/55
the training data.
L(θ) - capture important variations
in data
Jx(h) 2
F =
n
j=1
k
l=1
∂hl
∂xj
2
x
h
ˆx

50/55
the training data.
in data
Ω(θ) - do not capture variations in
data
Jx(h) 2
F =
n
j=1
k
l=1
∂hl
∂xj
2
x
h
ˆx

50/55
the training data.
in data
Ω(θ) - do not capture variations in
data
Tradeoﬀ - capture only very import-
ant variations in the data
Jx(h) 2
F =
n
j=1
k
l=1
∂hl
∂xj
2
x
h
ˆx

51/55
Let us try to understand this with the help of an illustration.

52/55
x
y
u1
u2

52/55
x
y
u1
u2
Consider the variations in the data
along directions u1 and u2

52/55
x
y
u1
u2
It makes sense to maximize a neuron
to be sensitive to variations along u1

52/55
x
y
u1
u2
At the same time it makes sense to
inhibit a neuron from being sensitive
to variations along u2 (as there seems
to be small noise and unimportant for
reconstruction)

52/55
x
y
u1
u2
reconstruction)
By doing so we can balance between
the contradicting goals of good recon-
struction and low sensitivity.

52/55
x
y
u1
u2
reconstruction)
By doing so we can balance between
the contradicting goals of good recon-
struction and low sensitivity.
What does this remind you of ?

53/55
Module 7.7 : Summary

54/55
x
h
ˆx
≡
PCA
PT XT XP = D
y
x
u1 u2

54/55
x
h
ˆx
≡
PCA
PT XT XP = D
y
x
u1 u2
min
θ
X − HW∗
UΣV T
(SVD)
2
F

55/55
xi
˜xi
h
ˆxi
P(xij|xij)

55/55
xi
˜xi
h
ˆxi
P(xij|xij)
Regularization

55/55
xi
˜xi
h
ˆxi
P(xij|xij)
Regularization
Ω(θ) = λ θ 2
Weight decaying

55/55
xi
˜xi
h
ˆxi
P(xij|xij)
Regularization
Ω(θ) = λ θ 2
Weight decaying
Ω(θ) =
k
l=1
ρ log
ρ
ˆρl
+ (1 − ρ) log
1 − ρ
1 − ˆρl
Sparse

55/55
xi
˜xi
h
ˆxi
P(xij|xij)
Regularization
Ω(θ) = λ θ 2
Weight decaying
Ω(θ) =
k
l=1
ρ log
ρ
ˆρl
+ (1 − ρ) log
1 − ρ
1 − ˆρl
Sparse
Ω(θ) =
n
j=1
k
l=1
∂hl
∂xj
2
Contractive

Deep learning

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Deep learning

Similar to Deep learning (15)

Recently uploaded

Recently uploaded (20)

Deep learning