This document discusses autoencoders, including:
- Autoencoders are neural networks that encode inputs into a hidden representation and then decode the hidden representation to reconstruct the input.
- Undercomplete autoencoders have fewer hidden units than inputs, forcing the hidden representation to learn the most important characteristics of the input. This is analogous to PCA.
- Overcomplete autoencoders can learn trivial encodings unless regularized.
- For binary inputs, the logistic function is most appropriate for the decoder to restrict outputs to 0-1.
Integral Calculus. - Differential Calculus - Integration as an Inverse Process of Differentiation - Methods of Integration - Integration using trigonometric identities - Integrals of Some Particular Functions - rational function - partial fraction - Integration by partial fractions - standard integrals - First and second fundamental theorem of integral calculus
A One-Pass Triclustering Approach: Is There any Room for Big Data?Dmitrii Ignatov
An efficient one-pass online algorithm for triclustering of binary data (triadic formal contexts) is proposed. This algorithm is a modified version of the basic algorithm for OAC-triclustering approach, but it has linear time and memory complexities with respect to the cardinality
of the underlying ternary relation and can be easily parallelized in order to be applied for the analysis of big datasets. The results of computer experiments show the efficiency of the proposed algorithm.
Limits and Continuity - Intuitive Approach part 3FellowBuddy.com
FellowBuddy.com is an innovative platform that brings students together to share notes, exam papers, study guides, project reports and presentation for upcoming exams.
We connect Students who have an understanding of course material with Students who need help.
Benefits:-
# Students can catch up on notes they missed because of an absence.
# Underachievers can find peer developed notes that break down lecture and study material in a way that they can understand
# Students can earn better grades, save time and study effectively
Our Vision & Mission – Simplifying Students Life
Our Belief – “The great breakthrough in your life comes when you realize it, that you can learn anything you need to learn; to accomplish any goal that you have set for yourself. This means there are no limits on what you can be, have or do.”
Like Us - https://www.facebook.com/FellowBuddycom
In [8] Liang and Bai have shown that the - 4 kC snake graph is an odd harmonious graph for each k ³ 1.
In this paper we generalize this result on cycles by showing that the - n kC snake with string 1,1,…,1 when
n º 0 (mod 4) are odd harmonious graph. Also we show that the - 4 kC snake with m-pendant edges for
each k,m ³ 1 , (for linear case and for general case). Moreover, we show that, all subdivision of 2 k mD -
snake are odd harmonious for each k,m ³ 1 . Finally we present some examples to illustrate the proposed
theories.
Deep learning is a subset of machine learning in artificial intelligence (AI) that has networks capable of learning unsupervised from data that is unstructured or unlabeled. Also known as deep neural learning or deep neural network
Integral Calculus. - Differential Calculus - Integration as an Inverse Process of Differentiation - Methods of Integration - Integration using trigonometric identities - Integrals of Some Particular Functions - rational function - partial fraction - Integration by partial fractions - standard integrals - First and second fundamental theorem of integral calculus
A One-Pass Triclustering Approach: Is There any Room for Big Data?Dmitrii Ignatov
An efficient one-pass online algorithm for triclustering of binary data (triadic formal contexts) is proposed. This algorithm is a modified version of the basic algorithm for OAC-triclustering approach, but it has linear time and memory complexities with respect to the cardinality
of the underlying ternary relation and can be easily parallelized in order to be applied for the analysis of big datasets. The results of computer experiments show the efficiency of the proposed algorithm.
Limits and Continuity - Intuitive Approach part 3FellowBuddy.com
FellowBuddy.com is an innovative platform that brings students together to share notes, exam papers, study guides, project reports and presentation for upcoming exams.
We connect Students who have an understanding of course material with Students who need help.
Benefits:-
# Students can catch up on notes they missed because of an absence.
# Underachievers can find peer developed notes that break down lecture and study material in a way that they can understand
# Students can earn better grades, save time and study effectively
Our Vision & Mission – Simplifying Students Life
Our Belief – “The great breakthrough in your life comes when you realize it, that you can learn anything you need to learn; to accomplish any goal that you have set for yourself. This means there are no limits on what you can be, have or do.”
Like Us - https://www.facebook.com/FellowBuddycom
In [8] Liang and Bai have shown that the - 4 kC snake graph is an odd harmonious graph for each k ³ 1.
In this paper we generalize this result on cycles by showing that the - n kC snake with string 1,1,…,1 when
n º 0 (mod 4) are odd harmonious graph. Also we show that the - 4 kC snake with m-pendant edges for
each k,m ³ 1 , (for linear case and for general case). Moreover, we show that, all subdivision of 2 k mD -
snake are odd harmonious for each k,m ³ 1 . Finally we present some examples to illustrate the proposed
theories.
Deep learning is a subset of machine learning in artificial intelligence (AI) that has networks capable of learning unsupervised from data that is unstructured or unlabeled. Also known as deep neural learning or deep neural network
support vector machine algorithm in machine learningSamGuy7
The objective of the support vector machine algorithm is to find a hyperplane in an N-dimensional space(N — the number of features) that distinctly classifies the
SVMs are known for their effectiveness in high-dimensional spaces and their ability to handle complex data patterns. data points
On Application of Unbounded Hilbert Linear Operators in Quantum MechanicsBRNSS Publication Hub
This research work presents an important Banach space in functional analysis which is known and called
Hilbert space. We verified the crucial operations in this space and their applications in physics, particularly
in quantum mechanics. The operations are restricted to the unbounded linear operators densely defined
in Hilbert space which is the case of prime interest in physics, precisely in quantum machines. Precisely,
we discuss the role of unbounded linear operators in quantum mechanics, particularly, in the study of
Heisenberg uncertainty principle, time-independent Schrödinger equation, Harmonic oscillation, and
finally, the application of Hamilton operator. To make these analyses fruitful, the knowledge of Hilbert
spaces was first investigated followed by the spectral theory of unbounded operators, which are claimed
to be densely defined in Hilbert space. Consequently, the theory of probability is also employed to study
some systems since the operators used in studying these systems are only dense in H (i.e., they must (or
probably) be in the domain of H defined by L2 ( ) −∞,+∞ ).
In this paper we introduce the notions of Fuzzy Ideals in BH-algebras and the notion
of fuzzy dot Ideals of BH-algebras and investigate some of their results
In this paper we introduce the notions of Fuzzy Ideals in BH-algebras and the notion
of fuzzy dot Ideals of BH-algebras and investigate some of their results.
Cosmetic shop management system project report.pdfKamal Acharya
Buying new cosmetic products is difficult. It can even be scary for those who have sensitive skin and are prone to skin trouble. The information needed to alleviate this problem is on the back of each product, but it's thought to interpret those ingredient lists unless you have a background in chemistry.
Instead of buying and hoping for the best, we can use data science to help us predict which products may be good fits for us. It includes various function programs to do the above mentioned tasks.
Data file handling has been effectively used in the program.
The automated cosmetic shop management system should deal with the automation of general workflow and administration process of the shop. The main processes of the system focus on customer's request where the system is able to search the most appropriate products and deliver it to the customers. It should help the employees to quickly identify the list of cosmetic product that have reached the minimum quantity and also keep a track of expired date for each cosmetic product. It should help the employees to find the rack number in which the product is placed.It is also Faster and more efficient way.
The Internet of Things (IoT) is a revolutionary concept that connects everyday objects and devices to the internet, enabling them to communicate, collect, and exchange data. Imagine a world where your refrigerator notifies you when you’re running low on groceries, or streetlights adjust their brightness based on traffic patterns – that’s the power of IoT. In essence, IoT transforms ordinary objects into smart, interconnected devices, creating a network of endless possibilities.
Here is a blog on the role of electrical and electronics engineers in IOT. Let's dig in!!!!
For more such content visit: https://nttftrg.com/
6th International Conference on Machine Learning & Applications (CMLA 2024)ClaraZara1
6th International Conference on Machine Learning & Applications (CMLA 2024) will provide an excellent international forum for sharing knowledge and results in theory, methodology and applications of on Machine Learning & Applications.
CW RADAR, FMCW RADAR, FMCW ALTIMETER, AND THEIR PARAMETERSveerababupersonal22
It consists of cw radar and fmcw radar ,range measurement,if amplifier and fmcw altimeterThe CW radar operates using continuous wave transmission, while the FMCW radar employs frequency-modulated continuous wave technology. Range measurement is a crucial aspect of radar systems, providing information about the distance to a target. The IF amplifier plays a key role in signal processing, amplifying intermediate frequency signals for further analysis. The FMCW altimeter utilizes frequency-modulated continuous wave technology to accurately measure altitude above a reference point.
Forklift Classes Overview by Intella PartsIntella Parts
Discover the different forklift classes and their specific applications. Learn how to choose the right forklift for your needs to ensure safety, efficiency, and compliance in your operations.
For more technical information, visit our website https://intellaparts.com
Understanding Inductive Bias in Machine LearningSUTEJAS
This presentation explores the concept of inductive bias in machine learning. It explains how algorithms come with built-in assumptions and preferences that guide the learning process. You'll learn about the different types of inductive bias and how they can impact the performance and generalizability of machine learning models.
The presentation also covers the positive and negative aspects of inductive bias, along with strategies for mitigating potential drawbacks. We'll explore examples of how bias manifests in algorithms like neural networks and decision trees.
By understanding inductive bias, you can gain valuable insights into how machine learning models work and make informed decisions when building and deploying them.
Saudi Arabia stands as a titan in the global energy landscape, renowned for its abundant oil and gas resources. It's the largest exporter of petroleum and holds some of the world's most significant reserves. Let's delve into the top 10 oil and gas projects shaping Saudi Arabia's energy future in 2024.
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...Amil Baba Dawood bangali
Contact with Dawood Bhai Just call on +92322-6382012 and we'll help you. We'll solve all your problems within 12 to 24 hours and with 101% guarantee and with astrology systematic. If you want to take any personal or professional advice then also you can call us on +92322-6382012 , ONLINE LOVE PROBLEM & Other all types of Daily Life Problem's.Then CALL or WHATSAPP us on +92322-6382012 and Get all these problems solutions here by Amil Baba DAWOOD BANGALI
#vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore#blackmagicformarriage #aamilbaba #kalajadu #kalailam #taweez #wazifaexpert #jadumantar #vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore #blackmagicforlove #blackmagicformarriage #aamilbaba #kalajadu #kalailam #taweez #wazifaexpert #jadumantar #vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore #Amilbabainuk #amilbabainspain #amilbabaindubai #Amilbabainnorway #amilbabainkrachi #amilbabainlahore #amilbabaingujranwalan #amilbabainislamabad
Student information management system project report ii.pdfKamal Acharya
Our project explains about the student management. This project mainly explains the various actions related to student details. This project shows some ease in adding, editing and deleting the student details. It also provides a less time consuming process for viewing, adding, editing and deleting the marks of the students.
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Dr.Costas Sachpazis
Terzaghi's soil bearing capacity theory, developed by Karl Terzaghi, is a fundamental principle in geotechnical engineering used to determine the bearing capacity of shallow foundations. This theory provides a method to calculate the ultimate bearing capacity of soil, which is the maximum load per unit area that the soil can support without undergoing shear failure. The Calculation HTML Code included.
1. 1/55
CS7015 (Deep Learning) : Lecture 7
Autoencoders and relation to PCA, Regularization in autoencoders, Denoising
autoencoders, Sparse autoencoders, Contractive autoencoders
Mitesh M. Khapra
Department of Computer Science and Engineering
Indian Institute of Technology Madras
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
4. 3/55
xi
W
h
W∗
ˆxi
An autoencoder is a special type of
feed forward neural network which
does the following
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
5. 3/55
xi
W
h
W∗
ˆxi
An autoencoder is a special type of
feed forward neural network which
does the following
Encodes its input xi into a hidden
representation h
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
6. 3/55
xi
W
h
W∗
ˆxi
h = g(Wxi + b)
An autoencoder is a special type of
feed forward neural network which
does the following
Encodes its input xi into a hidden
representation h
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
7. 3/55
xi
W
h
W∗
ˆxi
h = g(Wxi + b)
An autoencoder is a special type of
feed forward neural network which
does the following
Encodes its input xi into a hidden
representation h
Decodes the input again from this
hidden representation
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
8. 3/55
xi
W
h
W∗
ˆxi
h = g(Wxi + b)
ˆxi = f(W∗
h + c)
An autoencoder is a special type of
feed forward neural network which
does the following
Encodes its input xi into a hidden
representation h
Decodes the input again from this
hidden representation
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
9. 3/55
xi
W
h
W∗
ˆxi
h = g(Wxi + b)
ˆxi = f(W∗
h + c)
An autoencoder is a special type of
feed forward neural network which
does the following
Encodes its input xi into a hidden
representation h
Decodes the input again from this
hidden representation
The model is trained to minimize a
certain loss function which will ensure
that ˆxi is close to xi (we will see some
such loss functions soon)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
11. 4/55
xi
W
h
W∗
ˆxi
h = g(Wxi + b)
ˆxi = f(W∗
h + c)
Let us consider the case where
dim(h) < dim(xi)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
12. 4/55
xi
W
h
W∗
ˆxi
h = g(Wxi + b)
ˆxi = f(W∗
h + c)
Let us consider the case where
dim(h) < dim(xi)
If we are still able to reconstruct ˆxi
perfectly from h, then what does it
say about h?
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
13. 4/55
xi
W
h
W∗
ˆxi
h = g(Wxi + b)
ˆxi = f(W∗
h + c)
Let us consider the case where
dim(h) < dim(xi)
If we are still able to reconstruct ˆxi
perfectly from h, then what does it
say about h?
h is a loss-free encoding of xi. It cap-
tures all the important characteristics
of xi
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
14. 4/55
xi
W
h
W∗
ˆxi
h = g(Wxi + b)
ˆxi = f(W∗
h + c)
Let us consider the case where
dim(h) < dim(xi)
If we are still able to reconstruct ˆxi
perfectly from h, then what does it
say about h?
h is a loss-free encoding of xi. It cap-
tures all the important characteristics
of xi
Do you see an analogy with PCA?
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
15. 4/55
xi
W
h
W∗
ˆxi
h = g(Wxi + b)
ˆxi = f(W∗
h + c)
An autoencoder where dim(h) < dim(xi) is
called an under complete autoencoder
Let us consider the case where
dim(h) < dim(xi)
If we are still able to reconstruct ˆxi
perfectly from h, then what does it
say about h?
h is a loss-free encoding of xi. It cap-
tures all the important characteristics
of xi
Do you see an analogy with PCA?
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
17. 5/55
xi
W
h
W∗
ˆxi
h = g(Wxi + b)
ˆxi = f(W∗
h + c)
Let us consider the case when
dim(h) ≥ dim(xi)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
18. 5/55
xi
W
h
W∗
ˆxi
h = g(Wxi + b)
ˆxi = f(W∗
h + c)
Let us consider the case when
dim(h) ≥ dim(xi)
In such a case the autoencoder could
learn a trivial encoding by simply
copying xi into h and then copying
h into ˆxi
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
19. 5/55
xi
W
h
W∗
ˆxi
h = g(Wxi + b)
ˆxi = f(W∗
h + c)
Let us consider the case when
dim(h) ≥ dim(xi)
In such a case the autoencoder could
learn a trivial encoding by simply
copying xi into h and then copying
h into ˆxi
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
20. 5/55
xi
W
h
W∗
ˆxi
h = g(Wxi + b)
ˆxi = f(W∗
h + c)
Let us consider the case when
dim(h) ≥ dim(xi)
In such a case the autoencoder could
learn a trivial encoding by simply
copying xi into h and then copying
h into ˆxi
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
21. 5/55
xi
W
h
W∗
ˆxi
h = g(Wxi + b)
ˆxi = f(W∗
h + c)
Let us consider the case when
dim(h) ≥ dim(xi)
In such a case the autoencoder could
learn a trivial encoding by simply
copying xi into h and then copying
h into ˆxi
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
22. 5/55
xi
W
h
W∗
ˆxi
h = g(Wxi + b)
ˆxi = f(W∗
h + c)
Let us consider the case when
dim(h) ≥ dim(xi)
In such a case the autoencoder could
learn a trivial encoding by simply
copying xi into h and then copying
h into ˆxi
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
23. 5/55
xi
W
h
W∗
ˆxi
h = g(Wxi + b)
ˆxi = f(W∗
h + c)
Let us consider the case when
dim(h) ≥ dim(xi)
In such a case the autoencoder could
learn a trivial encoding by simply
copying xi into h and then copying
h into ˆxi
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
24. 5/55
xi
W
h
W∗
ˆxi
h = g(Wxi + b)
ˆxi = f(W∗
h + c)
Let us consider the case when
dim(h) ≥ dim(xi)
In such a case the autoencoder could
learn a trivial encoding by simply
copying xi into h and then copying
h into ˆxi
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
25. 5/55
xi
W
h
W∗
ˆxi
h = g(Wxi + b)
ˆxi = f(W∗
h + c)
Let us consider the case when
dim(h) ≥ dim(xi)
In such a case the autoencoder could
learn a trivial encoding by simply
copying xi into h and then copying
h into ˆxi
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
26. 5/55
xi
W
h
W∗
ˆxi
h = g(Wxi + b)
ˆxi = f(W∗
h + c)
Let us consider the case when
dim(h) ≥ dim(xi)
In such a case the autoencoder could
learn a trivial encoding by simply
copying xi into h and then copying
h into ˆxi
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
27. 5/55
xi
W
h
W∗
ˆxi
h = g(Wxi + b)
ˆxi = f(W∗
h + c)
Let us consider the case when
dim(h) ≥ dim(xi)
In such a case the autoencoder could
learn a trivial encoding by simply
copying xi into h and then copying
h into ˆxi
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
28. 5/55
xi
W
h
W∗
ˆxi
h = g(Wxi + b)
ˆxi = f(W∗
h + c)
Let us consider the case when
dim(h) ≥ dim(xi)
In such a case the autoencoder could
learn a trivial encoding by simply
copying xi into h and then copying
h into ˆxi
Such an identity encoding is useless
in practice as it does not really tell us
anything about the important char-
acteristics of the data
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
29. 5/55
xi
W
h
W∗
ˆxi
h = g(Wxi + b)
ˆxi = f(W∗
h + c)
An autoencoder where dim(h) ≥ dim(xi) is
called an over complete autoencoder
Let us consider the case when
dim(h) ≥ dim(xi)
In such a case the autoencoder could
learn a trivial encoding by simply
copying xi into h and then copying
h into ˆxi
Such an identity encoding is useless
in practice as it does not really tell us
anything about the important char-
acteristics of the data
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
32. 6/55
The Road Ahead
Choice of f(xi) and g(xi)
Choice of loss function
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
33. 7/55
The Road Ahead
Choice of f(xi) and g(xi)
Choice of loss function
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
34. 8/55
0 1 1 0 1
xi
h = g(Wxi + b)
W
W∗
ˆxi = f(W∗
h + c)
(binary inputs)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
35. 8/55
0 1 1 0 1
xi
h = g(Wxi + b)
W
W∗
ˆxi = f(W∗
h + c)
(binary inputs)
Suppose all our inputs are binary
(each xij ∈ {0, 1})
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
36. 8/55
0 1 1 0 1
xi
h = g(Wxi + b)
W
W∗
ˆxi = f(W∗
h + c)
(binary inputs)
Suppose all our inputs are binary
(each xij ∈ {0, 1})
Which of the following functions
would be most apt for the decoder?
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
37. 8/55
0 1 1 0 1
xi
h = g(Wxi + b)
W
W∗
ˆxi = f(W∗
h + c)
(binary inputs)
Suppose all our inputs are binary
(each xij ∈ {0, 1})
Which of the following functions
would be most apt for the decoder?
ˆxi = tanh(W∗
h + c)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
38. 8/55
0 1 1 0 1
xi
h = g(Wxi + b)
W
W∗
ˆxi = f(W∗
h + c)
(binary inputs)
Suppose all our inputs are binary
(each xij ∈ {0, 1})
Which of the following functions
would be most apt for the decoder?
ˆxi = tanh(W∗
h + c)
ˆxi = W∗
h + c
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
39. 8/55
0 1 1 0 1
xi
h = g(Wxi + b)
W
W∗
ˆxi = f(W∗
h + c)
(binary inputs)
Suppose all our inputs are binary
(each xij ∈ {0, 1})
Which of the following functions
would be most apt for the decoder?
ˆxi = tanh(W∗
h + c)
ˆxi = W∗
h + c
ˆxi = logistic(W∗
h + c)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
40. 8/55
0 1 1 0 1
xi
h = g(Wxi + b)
W
W∗
ˆxi = f(W∗
h + c)
(binary inputs)
Suppose all our inputs are binary
(each xij ∈ {0, 1})
Which of the following functions
would be most apt for the decoder?
ˆxi = tanh(W∗
h + c)
ˆxi = W∗
h + c
ˆxi = logistic(W∗
h + c)
Logistic as it naturally restricts all
outputs to be between 0 and 1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
41. 8/55
0 1 1 0 1
xi
h = g(Wxi + b)
W
W∗
ˆxi = f(W∗
h + c)
(binary inputs)
g is typically chosen as the sigmoid
function
Suppose all our inputs are binary
(each xij ∈ {0, 1})
Which of the following functions
would be most apt for the decoder?
ˆxi = tanh(W∗
h + c)
ˆxi = W∗
h + c
ˆxi = logistic(W∗
h + c)
Logistic as it naturally restricts all
outputs to be between 0 and 1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
42. 9/55
0.25 0.5 1.25 3.5 4.5
xi
h = g(Wxi + b)
W
W∗
ˆxi = f(W∗
h + c)
(real valued inputs)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
43. 9/55
0.25 0.5 1.25 3.5 4.5
xi
h = g(Wxi + b)
W
W∗
ˆxi = f(W∗
h + c)
(real valued inputs)
Suppose all our inputs are real (each
xij ∈ R)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
44. 9/55
0.25 0.5 1.25 3.5 4.5
xi
h = g(Wxi + b)
W
W∗
ˆxi = f(W∗
h + c)
(real valued inputs)
Suppose all our inputs are real (each
xij ∈ R)
Which of the following functions
would be most apt for the decoder?
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
45. 9/55
0.25 0.5 1.25 3.5 4.5
xi
h = g(Wxi + b)
W
W∗
ˆxi = f(W∗
h + c)
(real valued inputs)
Suppose all our inputs are real (each
xij ∈ R)
Which of the following functions
would be most apt for the decoder?
ˆxi = tanh(W∗
h + c)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
46. 9/55
0.25 0.5 1.25 3.5 4.5
xi
h = g(Wxi + b)
W
W∗
ˆxi = f(W∗
h + c)
(real valued inputs)
Suppose all our inputs are real (each
xij ∈ R)
Which of the following functions
would be most apt for the decoder?
ˆxi = tanh(W∗
h + c)
ˆxi = W∗
h + c
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
47. 9/55
0.25 0.5 1.25 3.5 4.5
xi
h = g(Wxi + b)
W
W∗
ˆxi = f(W∗
h + c)
(real valued inputs)
Suppose all our inputs are real (each
xij ∈ R)
Which of the following functions
would be most apt for the decoder?
ˆxi = tanh(W∗
h + c)
ˆxi = W∗
h + c
ˆxi = logistic(W∗
h + c)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
48. 9/55
0.25 0.5 1.25 3.5 4.5
xi
h = g(Wxi + b)
W
W∗
ˆxi = f(W∗
h + c)
(real valued inputs)
Suppose all our inputs are real (each
xij ∈ R)
Which of the following functions
would be most apt for the decoder?
ˆxi = tanh(W∗
h + c)
ˆxi = W∗
h + c
ˆxi = logistic(W∗
h + c)
What will logistic and tanh do?
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
49. 9/55
0.25 0.5 1.25 3.5 4.5
xi
h = g(Wxi + b)
W
W∗
ˆxi = f(W∗
h + c)
(real valued inputs)
Suppose all our inputs are real (each
xij ∈ R)
Which of the following functions
would be most apt for the decoder?
ˆxi = tanh(W∗
h + c)
ˆxi = W∗
h + c
ˆxi = logistic(W∗
h + c)
What will logistic and tanh do?
They will restrict the reconstruc-
ted ˆxi to lie between [0,1] or [-1,1]
whereas we want ˆxi ∈ Rn
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
50. 9/55
0.25 0.5 1.25 3.5 4.5
xi
h = g(Wxi + b)
W
W∗
ˆxi = f(W∗
h + c)
(real valued inputs)
Again, g is typically chosen as the
sigmoid function
Suppose all our inputs are real (each
xij ∈ R)
Which of the following functions
would be most apt for the decoder?
ˆxi = tanh(W∗
h + c)
ˆxi = W∗
h + c
ˆxi = logistic(W∗
h + c)
What will logistic and tanh do?
They will restrict the reconstruc-
ted ˆxi to lie between [0,1] or [-1,1]
whereas we want ˆxi ∈ Rn
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
51. 10/55
The Road Ahead
Choice of f(xi) and g(xi)
Choice of loss function
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
53. 11/55
xi
W
h
W∗
ˆxi
h = g(Wxi + b)
ˆxi = f(W∗
h + c)
Consider the case when the inputs are real
valued
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
54. 11/55
xi
W
h
W∗
ˆxi
h = g(Wxi + b)
ˆxi = f(W∗
h + c)
Consider the case when the inputs are real
valued
The objective of the autoencoder is to recon-
struct ˆxi to be as close to xi as possible
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
55. 11/55
xi
W
h
W∗
ˆxi
h = g(Wxi + b)
ˆxi = f(W∗
h + c)
Consider the case when the inputs are real
valued
The objective of the autoencoder is to recon-
struct ˆxi to be as close to xi as possible
This can be formalized using the following
objective function:
min
W,W ∗,c,b
1
m
m
i=1
n
j=1
(ˆxij − xij)2
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
56. 11/55
xi
W
h
W∗
ˆxi
h = g(Wxi + b)
ˆxi = f(W∗
h + c)
Consider the case when the inputs are real
valued
The objective of the autoencoder is to recon-
struct ˆxi to be as close to xi as possible
This can be formalized using the following
objective function:
min
W,W ∗,c,b
1
m
m
i=1
n
j=1
(ˆxij − xij)2
i.e., min
W,W ∗,c,b
1
m
m
i=1
(ˆxi − xi)T
(ˆxi − xi)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
57. 11/55
xi
W
h
W∗
ˆxi
h = g(Wxi + b)
ˆxi = f(W∗
h + c)
Consider the case when the inputs are real
valued
The objective of the autoencoder is to recon-
struct ˆxi to be as close to xi as possible
This can be formalized using the following
objective function:
min
W,W ∗,c,b
1
m
m
i=1
n
j=1
(ˆxij − xij)2
i.e., min
W,W ∗,c,b
1
m
m
i=1
(ˆxi − xi)T
(ˆxi − xi)
We can then train the autoencoder just like
a regular feedforward network using back-
propagation
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
58. 11/55
xi
W
h
W∗
ˆxi
h = g(Wxi + b)
ˆxi = f(W∗
h + c)
Consider the case when the inputs are real
valued
The objective of the autoencoder is to recon-
struct ˆxi to be as close to xi as possible
This can be formalized using the following
objective function:
min
W,W ∗,c,b
1
m
m
i=1
n
j=1
(ˆxij − xij)2
i.e., min
W,W ∗,c,b
1
m
m
i=1
(ˆxi − xi)T
(ˆxi − xi)
We can then train the autoencoder just like
a regular feedforward network using back-
propagation
All we need is a formula for ∂L (θ)
∂W ∗ and ∂L (θ)
∂W
which we will see now
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
59. 12/55
L (θ) = (ˆxi − xi)T
(ˆxi − xi)
h0 = xi
h1
a1
h2 = ˆxi
a2
W
W∗
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
60. 12/55
L (θ) = (ˆxi − xi)T
(ˆxi − xi)
h0 = xi
h1
a1
h2 = ˆxi
a2
W
W∗
Note that the loss function is
shown for only one training
example.
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
61. 12/55
L (θ) = (ˆxi − xi)T
(ˆxi − xi)
h0 = xi
h1
a1
h2 = ˆxi
a2
W
W∗
Note that the loss function is
shown for only one training
example.
∂L (θ)
∂W∗
=
∂L (θ)
∂h2
∂h2
∂a2
∂a2
∂W∗
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
62. 12/55
L (θ) = (ˆxi − xi)T
(ˆxi − xi)
h0 = xi
h1
a1
h2 = ˆxi
a2
W
W∗
Note that the loss function is
shown for only one training
example.
∂L (θ)
∂W∗
=
∂L (θ)
∂h2
∂h2
∂a2
∂a2
∂W∗
∂L (θ)
∂W
=
∂L (θ)
∂h2
∂h2
∂a2
∂a2
∂h1
∂h1
∂a1
∂a1
∂W
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
63. 12/55
L (θ) = (ˆxi − xi)T
(ˆxi − xi)
h0 = xi
h1
a1
h2 = ˆxi
a2
W
W∗
Note that the loss function is
shown for only one training
example.
∂L (θ)
∂W∗
=
∂L (θ)
∂h2
∂h2
∂a2
∂a2
∂W∗
∂L (θ)
∂W
=
∂L (θ)
∂h2
∂h2
∂a2
∂a2
∂h1
∂h1
∂a1
∂a1
∂W
We have already seen how to calculate the expres-
sion in the boxes when we learnt backpropagation
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
64. 12/55
L (θ) = (ˆxi − xi)T
(ˆxi − xi)
h0 = xi
h1
a1
h2 = ˆxi
a2
W
W∗
Note that the loss function is
shown for only one training
example.
∂L (θ)
∂W∗
=
∂L (θ)
∂h2
∂h2
∂a2
∂a2
∂W∗
∂L (θ)
∂W
=
∂L (θ)
∂h2
∂h2
∂a2
∂a2
∂h1
∂h1
∂a1
∂a1
∂W
We have already seen how to calculate the expres-
sion in the boxes when we learnt backpropagation
∂L (θ)
∂h2
=
∂L (θ)
∂ˆxi
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
65. 12/55
L (θ) = (ˆxi − xi)T
(ˆxi − xi)
h0 = xi
h1
a1
h2 = ˆxi
a2
W
W∗
Note that the loss function is
shown for only one training
example.
∂L (θ)
∂W∗
=
∂L (θ)
∂h2
∂h2
∂a2
∂a2
∂W∗
∂L (θ)
∂W
=
∂L (θ)
∂h2
∂h2
∂a2
∂a2
∂h1
∂h1
∂a1
∂a1
∂W
We have already seen how to calculate the expres-
sion in the boxes when we learnt backpropagation
∂L (θ)
∂h2
=
∂L (θ)
∂ˆxi
= ˆxi
{(ˆxi − xi)T
(ˆxi − xi)}
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
66. 12/55
L (θ) = (ˆxi − xi)T
(ˆxi − xi)
h0 = xi
h1
a1
h2 = ˆxi
a2
W
W∗
Note that the loss function is
shown for only one training
example.
∂L (θ)
∂W∗
=
∂L (θ)
∂h2
∂h2
∂a2
∂a2
∂W∗
∂L (θ)
∂W
=
∂L (θ)
∂h2
∂h2
∂a2
∂a2
∂h1
∂h1
∂a1
∂a1
∂W
We have already seen how to calculate the expres-
sion in the boxes when we learnt backpropagation
∂L (θ)
∂h2
=
∂L (θ)
∂ˆxi
= ˆxi
{(ˆxi − xi)T
(ˆxi − xi)}
= 2(ˆxi − xi)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
67. 13/55
0 1 1 0 1
xi
h = g(Wxi + b)
W
W∗
ˆxi = f(W∗
h + c)
(binary inputs)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
68. 13/55
0 1 1 0 1
xi
h = g(Wxi + b)
W
W∗
ˆxi = f(W∗
h + c)
(binary inputs)
Consider the case when the inputs are
binary
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
69. 13/55
0 1 1 0 1
xi
h = g(Wxi + b)
W
W∗
ˆxi = f(W∗
h + c)
(binary inputs)
Consider the case when the inputs are
binary
We use a sigmoid decoder which will
produce outputs between 0 and 1, and
can be interpreted as probabilities.
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
70. 13/55
0 1 1 0 1
xi
h = g(Wxi + b)
W
W∗
ˆxi = f(W∗
h + c)
(binary inputs)
Consider the case when the inputs are
binary
We use a sigmoid decoder which will
produce outputs between 0 and 1, and
can be interpreted as probabilities.
For a single n-dimensional ith
input we
can use the following loss function
min{−
n
j=1
(xij log ˆxij + (1 − xij) log(1 − ˆxij))}
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
71. 13/55
0 1 1 0 1
xi
h = g(Wxi + b)
W
W∗
ˆxi = f(W∗
h + c)
(binary inputs)
What value of ˆxij will minimize this
function?
Consider the case when the inputs are
binary
We use a sigmoid decoder which will
produce outputs between 0 and 1, and
can be interpreted as probabilities.
For a single n-dimensional ith
input we
can use the following loss function
min{−
n
j=1
(xij log ˆxij + (1 − xij) log(1 − ˆxij))}
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
72. 13/55
0 1 1 0 1
xi
h = g(Wxi + b)
W
W∗
ˆxi = f(W∗
h + c)
(binary inputs)
What value of ˆxij will minimize this
function?
If xij = 1 ?
Consider the case when the inputs are
binary
We use a sigmoid decoder which will
produce outputs between 0 and 1, and
can be interpreted as probabilities.
For a single n-dimensional ith
input we
can use the following loss function
min{−
n
j=1
(xij log ˆxij + (1 − xij) log(1 − ˆxij))}
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
73. 13/55
0 1 1 0 1
xi
h = g(Wxi + b)
W
W∗
ˆxi = f(W∗
h + c)
(binary inputs)
What value of ˆxij will minimize this
function?
If xij = 1 ?
If xij = 0 ?
Consider the case when the inputs are
binary
We use a sigmoid decoder which will
produce outputs between 0 and 1, and
can be interpreted as probabilities.
For a single n-dimensional ith
input we
can use the following loss function
min{−
n
j=1
(xij log ˆxij + (1 − xij) log(1 − ˆxij))}
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
74. 13/55
0 1 1 0 1
xi
h = g(Wxi + b)
W
W∗
ˆxi = f(W∗
h + c)
(binary inputs)
What value of ˆxij will minimize this
function?
If xij = 1 ?
If xij = 0 ?
Consider the case when the inputs are
binary
We use a sigmoid decoder which will
produce outputs between 0 and 1, and
can be interpreted as probabilities.
For a single n-dimensional ith
input we
can use the following loss function
min{−
n
j=1
(xij log ˆxij + (1 − xij) log(1 − ˆxij))}
Again we need is a formula for ∂L (θ)
∂W ∗ and
∂L (θ)
∂W to use backpropagation
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
75. 13/55
0 1 1 0 1
xi
h = g(Wxi + b)
W
W∗
ˆxi = f(W∗
h + c)
(binary inputs)
What value of ˆxij will minimize this
function?
If xij = 1 ?
If xij = 0 ?
Indeed the above function will be
minimized when ˆxij = xij !
Consider the case when the inputs are
binary
We use a sigmoid decoder which will
produce outputs between 0 and 1, and
can be interpreted as probabilities.
For a single n-dimensional ith
input we
can use the following loss function
min{−
n
j=1
(xij log ˆxij + (1 − xij) log(1 − ˆxij))}
Again we need is a formula for ∂L (θ)
∂W ∗ and
∂L (θ)
∂W to use backpropagation
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
76. 14/55
L (θ) = −
n
j=1
(xij log ˆxij + (1 − xij) log(1 − ˆxij))
h0 = xi
h1
a1
h2 = ˆxi
a2
W
W∗
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
77. 14/55
L (θ) = −
n
j=1
(xij log ˆxij + (1 − xij) log(1 − ˆxij))
h0 = xi
h1
a1
h2 = ˆxi
a2
W
W∗
∂L (θ)
∂W∗
=
∂L (θ)
∂h2
∂h2
∂a2
∂a2
∂W∗
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
79. 14/55
L (θ) = −
n
j=1
(xij log ˆxij + (1 − xij) log(1 − ˆxij))
h0 = xi
h1
a1
h2 = ˆxi
a2
W
W∗
∂L (θ)
∂W∗
=
∂L (θ)
∂h2
∂h2
∂a2
∂a2
∂W∗
∂L (θ)
∂W
=
∂L (θ)
∂h2
∂h2
∂a2
∂a2
∂h1
∂h1
∂a1
∂a1
∂W
We have already seen how to
calculate the expressions in the
square boxes when we learnt BP
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
80. 14/55
L (θ) = −
n
j=1
(xij log ˆxij + (1 − xij) log(1 − ˆxij))
h0 = xi
h1
a1
h2 = ˆxi
a2
W
W∗
∂L (θ)
∂W∗
=
∂L (θ)
∂h2
∂h2
∂a2
∂a2
∂W∗
∂L (θ)
∂W
=
∂L (θ)
∂h2
∂h2
∂a2
∂a2
∂h1
∂h1
∂a1
∂a1
∂W
We have already seen how to
calculate the expressions in the
square boxes when we learnt BP
The first two terms on RHS can be
computed as:
∂L (θ)
∂h2j
= −
xij
ˆxij
+
1 − xij
1 − ˆxij
∂h2j
∂a2j
= σ(a2j)(1 − σ(a2j))
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
81. 14/55
L (θ) = −
n
j=1
(xij log ˆxij + (1 − xij) log(1 − ˆxij))
h0 = xi
h1
a1
h2 = ˆxi
a2
W
W∗
∂L (θ)
∂h2
=
∂L (θ)
∂h2n
...
∂L (θ)
∂h22
∂L (θ)
∂h21
∂L (θ)
∂W∗
=
∂L (θ)
∂h2
∂h2
∂a2
∂a2
∂W∗
∂L (θ)
∂W
=
∂L (θ)
∂h2
∂h2
∂a2
∂a2
∂h1
∂h1
∂a1
∂a1
∂W
We have already seen how to
calculate the expressions in the
square boxes when we learnt BP
The first two terms on RHS can be
computed as:
∂L (θ)
∂h2j
= −
xij
ˆxij
+
1 − xij
1 − ˆxij
∂h2j
∂a2j
= σ(a2j)(1 − σ(a2j))
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
82. 15/55
Module 7.2: Link between PCA and Autoencoders
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
83. 16/55
xi
h
ˆxi
≡
PCA
PT XT XP = D
y
x
u1 u2
We will now see that the encoder part
of an autoencoder is equivalent to
PCA if we
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
84. 16/55
xi
h
ˆxi
≡
PCA
PT XT XP = D
y
x
u1 u2
We will now see that the encoder part
of an autoencoder is equivalent to
PCA if we
use a linear encoder
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
85. 16/55
xi
h
ˆxi
≡
PCA
PT XT XP = D
y
x
u1 u2
We will now see that the encoder part
of an autoencoder is equivalent to
PCA if we
use a linear encoder
use a linear decoder
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
86. 16/55
xi
h
ˆxi
≡
PCA
PT XT XP = D
y
x
u1 u2
We will now see that the encoder part
of an autoencoder is equivalent to
PCA if we
use a linear encoder
use a linear decoder
use squared error loss function
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
87. 16/55
xi
h
ˆxi
≡
PCA
PT XT XP = D
y
x
u1 u2
We will now see that the encoder part
of an autoencoder is equivalent to
PCA if we
use a linear encoder
use a linear decoder
use squared error loss function
normalize the inputs to
ˆxij =
1
√
m
xij −
1
m
m
k=1
xkj
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
88. 17/55
xi
h
ˆxi
≡
PCA
PT XT XP = D
y
x
u1 u2
First let us consider the implication
of normalizing the inputs to
ˆxij =
1
√
m
xij −
1
m
m
k=1
xkj
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
89. 17/55
xi
h
ˆxi
≡
PCA
PT XT XP = D
y
x
u1 u2
First let us consider the implication
of normalizing the inputs to
ˆxij =
1
√
m
xij −
1
m
m
k=1
xkj
The operation in the bracket ensures
that the data now has 0 mean along
each dimension j (we are subtracting
the mean)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
90. 17/55
xi
h
ˆxi
≡
PCA
PT XT XP = D
y
x
u1 u2
First let us consider the implication
of normalizing the inputs to
ˆxij =
1
√
m
xij −
1
m
m
k=1
xkj
The operation in the bracket ensures
that the data now has 0 mean along
each dimension j (we are subtracting
the mean)
Let X be this zero mean data mat-
rix then what the above normaliza-
tion gives us is X = 1√
m
X
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
91. 17/55
xi
h
ˆxi
≡
PCA
PT XT XP = D
y
x
u1 u2
First let us consider the implication
of normalizing the inputs to
ˆxij =
1
√
m
xij −
1
m
m
k=1
xkj
The operation in the bracket ensures
that the data now has 0 mean along
each dimension j (we are subtracting
the mean)
Let X be this zero mean data mat-
rix then what the above normaliza-
tion gives us is X = 1√
m
X
Now (X)T X = 1
m (X )T X is the co-
variance matrix (recall that covari-
ance matrix plays an important role
in PCA)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
93. 18/55
xi
h
ˆxi
≡
PCA
PT XT XP = D
y
x
u1 u2
First we will show that if we use lin-
ear decoder and a squared error loss
function then
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
94. 18/55
xi
h
ˆxi
≡
PCA
PT XT XP = D
y
x
u1 u2
First we will show that if we use lin-
ear decoder and a squared error loss
function then
The optimal solution to the following
objective function
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
95. 18/55
xi
h
ˆxi
≡
PCA
PT XT XP = D
y
x
u1 u2
First we will show that if we use lin-
ear decoder and a squared error loss
function then
The optimal solution to the following
objective function
1
m
m
i=1
n
j=1
(xij − ˆxij)2
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
96. 18/55
xi
h
ˆxi
≡
PCA
PT XT XP = D
y
x
u1 u2
First we will show that if we use lin-
ear decoder and a squared error loss
function then
The optimal solution to the following
objective function
1
m
m
i=1
n
j=1
(xij − ˆxij)2
is obtained when we use a linear en-
coder.
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
101. 19/55
min
θ
m
i=1
n
j=1
(xij − ˆxij)2
(1)
This is equivalent to
min
W ∗H
( X − HW∗
F )2
A F =
m
i=1
n
j=1
a2
ij
(just writing the expression (1) in matrix form and using the definition of ||A||F ) (we
are ignoring the biases)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
102. 19/55
min
θ
m
i=1
n
j=1
(xij − ˆxij)2
(1)
This is equivalent to
min
W ∗H
( X − HW∗
F )2
A F =
m
i=1
n
j=1
a2
ij
(just writing the expression (1) in matrix form and using the definition of ||A||F ) (we
are ignoring the biases)
From SVD we know that optimal solution to the above problem is given by
HW∗
= U.,≤kΣk,kV T
.,≤k
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
103. 19/55
min
θ
m
i=1
n
j=1
(xij − ˆxij)2
(1)
This is equivalent to
min
W ∗H
( X − HW∗
F )2
A F =
m
i=1
n
j=1
a2
ij
(just writing the expression (1) in matrix form and using the definition of ||A||F ) (we
are ignoring the biases)
From SVD we know that optimal solution to the above problem is given by
HW∗
= U.,≤kΣk,kV T
.,≤k
By matching variables one possible solution is
H = U.,≤kΣk,k
W∗
= V T
.,≤k
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
104. 20/55
We will now show that H is a linear encoding and find an expression for the encoder
weights W
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
105. 20/55
We will now show that H is a linear encoding and find an expression for the encoder
weights W
H = U.,≤kΣk,k
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
106. 20/55
We will now show that H is a linear encoding and find an expression for the encoder
weights W
H = U.,≤kΣk,k
= (XXT
)(XXT
)−1
U.,≤KΣk,k (pre-multiplying (XXT
)(XXT
)−1
= I)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
107. 20/55
We will now show that H is a linear encoding and find an expression for the encoder
weights W
H = U.,≤kΣk,k
= (XXT
)(XXT
)−1
U.,≤KΣk,k (pre-multiplying (XXT
)(XXT
)−1
= I)
= (XV ΣT
UT
)(UΣV T
V ΣT
UT
)−1
U.,≤kΣk,k (using X = UΣV T
)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
108. 20/55
We will now show that H is a linear encoding and find an expression for the encoder
weights W
H = U.,≤kΣk,k
= (XXT
)(XXT
)−1
U.,≤KΣk,k (pre-multiplying (XXT
)(XXT
)−1
= I)
= (XV ΣT
UT
)(UΣV T
V ΣT
UT
)−1
U.,≤kΣk,k (using X = UΣV T
)
= XV ΣT
UT
(UΣΣT
UT
)−1
U.,≤kΣk,k (V T
V = I)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
109. 20/55
We will now show that H is a linear encoding and find an expression for the encoder
weights W
H = U.,≤kΣk,k
= (XXT
)(XXT
)−1
U.,≤KΣk,k (pre-multiplying (XXT
)(XXT
)−1
= I)
= (XV ΣT
UT
)(UΣV T
V ΣT
UT
)−1
U.,≤kΣk,k (using X = UΣV T
)
= XV ΣT
UT
(UΣΣT
UT
)−1
U.,≤kΣk,k (V T
V = I)
= XV ΣT
UT
U(ΣΣT
)−1
UT
U.,≤kΣk,k ((ABC)−1
= C−1
B−1
A−1
)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
110. 20/55
We will now show that H is a linear encoding and find an expression for the encoder
weights W
H = U.,≤kΣk,k
= (XXT
)(XXT
)−1
U.,≤KΣk,k (pre-multiplying (XXT
)(XXT
)−1
= I)
= (XV ΣT
UT
)(UΣV T
V ΣT
UT
)−1
U.,≤kΣk,k (using X = UΣV T
)
= XV ΣT
UT
(UΣΣT
UT
)−1
U.,≤kΣk,k (V T
V = I)
= XV ΣT
UT
U(ΣΣT
)−1
UT
U.,≤kΣk,k ((ABC)−1
= C−1
B−1
A−1
)
= XV ΣT
(ΣΣT
)−1
UT
U.,≤kΣk,k (UT
U = I)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
111. 20/55
We will now show that H is a linear encoding and find an expression for the encoder
weights W
H = U.,≤kΣk,k
= (XXT
)(XXT
)−1
U.,≤KΣk,k (pre-multiplying (XXT
)(XXT
)−1
= I)
= (XV ΣT
UT
)(UΣV T
V ΣT
UT
)−1
U.,≤kΣk,k (using X = UΣV T
)
= XV ΣT
UT
(UΣΣT
UT
)−1
U.,≤kΣk,k (V T
V = I)
= XV ΣT
UT
U(ΣΣT
)−1
UT
U.,≤kΣk,k ((ABC)−1
= C−1
B−1
A−1
)
= XV ΣT
(ΣΣT
)−1
UT
U.,≤kΣk,k (UT
U = I)
= XV ΣT
ΣT −1
Σ−1
UT
U.,≤kΣk,k ((AB)−1
= B−1
A−1
)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
112. 20/55
We will now show that H is a linear encoding and find an expression for the encoder
weights W
H = U.,≤kΣk,k
= (XXT
)(XXT
)−1
U.,≤KΣk,k (pre-multiplying (XXT
)(XXT
)−1
= I)
= (XV ΣT
UT
)(UΣV T
V ΣT
UT
)−1
U.,≤kΣk,k (using X = UΣV T
)
= XV ΣT
UT
(UΣΣT
UT
)−1
U.,≤kΣk,k (V T
V = I)
= XV ΣT
UT
U(ΣΣT
)−1
UT
U.,≤kΣk,k ((ABC)−1
= C−1
B−1
A−1
)
= XV ΣT
(ΣΣT
)−1
UT
U.,≤kΣk,k (UT
U = I)
= XV ΣT
ΣT −1
Σ−1
UT
U.,≤kΣk,k ((AB)−1
= B−1
A−1
)
= XV Σ−1
I.,≤kΣk,k (UT
U.,≤k = I.,≤k)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
113. 20/55
We will now show that H is a linear encoding and find an expression for the encoder
weights W
H = U.,≤kΣk,k
= (XXT
)(XXT
)−1
U.,≤KΣk,k (pre-multiplying (XXT
)(XXT
)−1
= I)
= (XV ΣT
UT
)(UΣV T
V ΣT
UT
)−1
U.,≤kΣk,k (using X = UΣV T
)
= XV ΣT
UT
(UΣΣT
UT
)−1
U.,≤kΣk,k (V T
V = I)
= XV ΣT
UT
U(ΣΣT
)−1
UT
U.,≤kΣk,k ((ABC)−1
= C−1
B−1
A−1
)
= XV ΣT
(ΣΣT
)−1
UT
U.,≤kΣk,k (UT
U = I)
= XV ΣT
ΣT −1
Σ−1
UT
U.,≤kΣk,k ((AB)−1
= B−1
A−1
)
= XV Σ−1
I.,≤kΣk,k (UT
U.,≤k = I.,≤k)
= XV I.,≤k (Σ−1
I.,≤k = Σ−1
k,k)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
114. 20/55
We will now show that H is a linear encoding and find an expression for the encoder
weights W
H = U.,≤kΣk,k
= (XXT
)(XXT
)−1
U.,≤KΣk,k (pre-multiplying (XXT
)(XXT
)−1
= I)
= (XV ΣT
UT
)(UΣV T
V ΣT
UT
)−1
U.,≤kΣk,k (using X = UΣV T
)
= XV ΣT
UT
(UΣΣT
UT
)−1
U.,≤kΣk,k (V T
V = I)
= XV ΣT
UT
U(ΣΣT
)−1
UT
U.,≤kΣk,k ((ABC)−1
= C−1
B−1
A−1
)
= XV ΣT
(ΣΣT
)−1
UT
U.,≤kΣk,k (UT
U = I)
= XV ΣT
ΣT −1
Σ−1
UT
U.,≤kΣk,k ((AB)−1
= B−1
A−1
)
= XV Σ−1
I.,≤kΣk,k (UT
U.,≤k = I.,≤k)
= XV I.,≤k (Σ−1
I.,≤k = Σ−1
k,k)
H = XV.,≤k
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
115. 20/55
We will now show that H is a linear encoding and find an expression for the encoder
weights W
H = U.,≤kΣk,k
= (XXT
)(XXT
)−1
U.,≤KΣk,k (pre-multiplying (XXT
)(XXT
)−1
= I)
= (XV ΣT
UT
)(UΣV T
V ΣT
UT
)−1
U.,≤kΣk,k (using X = UΣV T
)
= XV ΣT
UT
(UΣΣT
UT
)−1
U.,≤kΣk,k (V T
V = I)
= XV ΣT
UT
U(ΣΣT
)−1
UT
U.,≤kΣk,k ((ABC)−1
= C−1
B−1
A−1
)
= XV ΣT
(ΣΣT
)−1
UT
U.,≤kΣk,k (UT
U = I)
= XV ΣT
ΣT −1
Σ−1
UT
U.,≤kΣk,k ((AB)−1
= B−1
A−1
)
= XV Σ−1
I.,≤kΣk,k (UT
U.,≤k = I.,≤k)
= XV I.,≤k (Σ−1
I.,≤k = Σ−1
k,k)
H = XV.,≤k
Thus H is a linear transformation of X and W = V.,≤k
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
116. 21/55
We have encoder W = V.,≤k
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
117. 21/55
We have encoder W = V.,≤k
From SVD, we know that V is the matrix of eigen vectors of XT X
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
118. 21/55
We have encoder W = V.,≤k
From SVD, we know that V is the matrix of eigen vectors of XT X
From PCA, we know that P is the matrix of the eigen vectors of the covariance
matrix
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
119. 21/55
We have encoder W = V.,≤k
From SVD, we know that V is the matrix of eigen vectors of XT X
From PCA, we know that P is the matrix of the eigen vectors of the covariance
matrix
We saw earlier that, if entries of X are normalized by
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
120. 21/55
We have encoder W = V.,≤k
From SVD, we know that V is the matrix of eigen vectors of XT X
From PCA, we know that P is the matrix of the eigen vectors of the covariance
matrix
We saw earlier that, if entries of X are normalized by
ˆxij =
1
√
m
xij −
1
m
m
k=1
xkj
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
121. 21/55
We have encoder W = V.,≤k
From SVD, we know that V is the matrix of eigen vectors of XT X
From PCA, we know that P is the matrix of the eigen vectors of the covariance
matrix
We saw earlier that, if entries of X are normalized by
ˆxij =
1
√
m
xij −
1
m
m
k=1
xkj
then XT X is indeed the covariance matrix
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
122. 21/55
We have encoder W = V.,≤k
From SVD, we know that V is the matrix of eigen vectors of XT X
From PCA, we know that P is the matrix of the eigen vectors of the covariance
matrix
We saw earlier that, if entries of X are normalized by
ˆxij =
1
√
m
xij −
1
m
m
k=1
xkj
then XT X is indeed the covariance matrix
Thus, the encoder matrix for linear autoencoder(W) and the projection
matrix(P) for PCA could indeed be the same. Hence proved
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
123. 22/55
Remember
The encoder of a linear autoencoder is equivalent to PCA if we
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
124. 22/55
Remember
The encoder of a linear autoencoder is equivalent to PCA if we
use a linear encoder
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
125. 22/55
Remember
The encoder of a linear autoencoder is equivalent to PCA if we
use a linear encoder
use a linear decoder
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
126. 22/55
Remember
The encoder of a linear autoencoder is equivalent to PCA if we
use a linear encoder
use a linear decoder
use a squared error loss function
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
127. 22/55
Remember
The encoder of a linear autoencoder is equivalent to PCA if we
use a linear encoder
use a linear decoder
use a squared error loss function
and normalize the inputs to
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
128. 22/55
Remember
The encoder of a linear autoencoder is equivalent to PCA if we
use a linear encoder
use a linear decoder
use a squared error loss function
and normalize the inputs to
ˆxij =
1
√
m
xij −
1
m
m
k=1
xkj
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
131. 24/55
xi
W
h
W∗
ˆxi
While poor generalization could hap-
pen even in undercomplete autoen-
coders it is an even more serious prob-
lem for overcomplete auto encoders
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
132. 24/55
xi
W
h
W∗
ˆxi
While poor generalization could hap-
pen even in undercomplete autoen-
coders it is an even more serious prob-
lem for overcomplete auto encoders
Here, (as stated earlier) the model
can simply learn to copy xi to h and
then h to ˆxi
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
133. 24/55
xi
W
h
W∗
ˆxi
While poor generalization could hap-
pen even in undercomplete autoen-
coders it is an even more serious prob-
lem for overcomplete auto encoders
Here, (as stated earlier) the model
can simply learn to copy xi to h and
then h to ˆxi
To avoid poor generalization, we need
to introduce regularization
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
134. 25/55
xi
W
h
W∗
ˆxi
The simplest solution is to add a L2-
regularization term to the objective
function
min
θ,w,w∗,b,c
1
m
m
i=1
n
j=1
(ˆxij − xij)2
+ λ θ 2
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
135. 25/55
xi
W
h
W∗
ˆxi
The simplest solution is to add a L2-
regularization term to the objective
function
min
θ,w,w∗,b,c
1
m
m
i=1
n
j=1
(ˆxij − xij)2
+ λ θ 2
This is very easy to implement and
just adds a term λW to the gradient
∂L (θ)
∂W (and similarly for other para-
meters)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
137. 26/55
xi
W
h
W∗
ˆxi
Another trick is to tie the weights of
the encoder and decoder i.e., W∗ =
WT
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
138. 26/55
xi
W
h
W∗
ˆxi
Another trick is to tie the weights of
the encoder and decoder i.e., W∗ =
WT
This effectively reduces the capacity
of Autoencoder and acts as a regular-
izer
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
140. 28/55
xi
˜xi
h
ˆxi
P(xij|xij)
A denoising encoder simply corrupts
the input data using a probabilistic
process (P(xij|xij)) before feeding it
to the network
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
141. 28/55
xi
˜xi
h
ˆxi
P(xij|xij)
A denoising encoder simply corrupts
the input data using a probabilistic
process (P(xij|xij)) before feeding it
to the network
A simple P(xij|xij) used in practice
is the following
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
142. 28/55
xi
˜xi
h
ˆxi
P(xij|xij)
A denoising encoder simply corrupts
the input data using a probabilistic
process (P(xij|xij)) before feeding it
to the network
A simple P(xij|xij) used in practice
is the following
P(xij = 0|xij) = q
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
143. 28/55
xi
˜xi
h
ˆxi
P(xij|xij)
A denoising encoder simply corrupts
the input data using a probabilistic
process (P(xij|xij)) before feeding it
to the network
A simple P(xij|xij) used in practice
is the following
P(xij = 0|xij) = q
P(xij = xij|xij) = 1 − q
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
144. 28/55
xi
˜xi
h
ˆxi
P(xij|xij)
A denoising encoder simply corrupts
the input data using a probabilistic
process (P(xij|xij)) before feeding it
to the network
A simple P(xij|xij) used in practice
is the following
P(xij = 0|xij) = q
P(xij = xij|xij) = 1 − q
In other words, with probability q the
input is flipped to 0 and with probab-
ility (1 − q) it is retained as it is
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
146. 29/55
xi
˜xi
h
ˆxi
P(xij|xij)
How does this help ?
This helps because the objective is
still to reconstruct the original (un-
corrupted) xi
arg min
θ
1
m
m
i=1
n
j=1
(ˆxij − xij)2
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
147. 29/55
xi
˜xi
h
ˆxi
P(xij|xij)
How does this help ?
This helps because the objective is
still to reconstruct the original (un-
corrupted) xi
arg min
θ
1
m
m
i=1
n
j=1
(ˆxij − xij)2
It no longer makes sense for the model
to copy the corrupted xi into h(xi)
and then into ˆxi (the objective func-
tion will not be minimized by doing
so)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
148. 29/55
xi
˜xi
h
ˆxi
P(xij|xij)
How does this help ?
This helps because the objective is
still to reconstruct the original (un-
corrupted) xi
arg min
θ
1
m
m
i=1
n
j=1
(ˆxij − xij)2
It no longer makes sense for the model
to copy the corrupted xi into h(xi)
and then into ˆxi (the objective func-
tion will not be minimized by doing
so)
Instead the model will now have to
capture the characteristics of the data
correctly.
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
149. 29/55
xi
˜xi
h
ˆxi
P(xij|xij)
For example, it will have to learn to
reconstruct a corrupted xij correctly by
relying on its interactions with other
elements of xi
How does this help ?
This helps because the objective is
still to reconstruct the original (un-
corrupted) xi
arg min
θ
1
m
m
i=1
n
j=1
(ˆxij − xij)2
It no longer makes sense for the model
to copy the corrupted xi into h(xi)
and then into ˆxi (the objective func-
tion will not be minimized by doing
so)
Instead the model will now have to
capture the characteristics of the data
correctly.
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
150. 30/55
We will now see a practical application in which AEs are used and then compare
Denoising Autoencoders with regular autoencoders
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
151. 31/55
Task: Hand-written digit
recognition
Figure: MNIST Data
0 1 2 3 9
|xi| = 784 = 28 × 28
28*28
Figure: Basic approach(we use raw data as input
features)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
152. 32/55
Task: Hand-written digit
recognition
Figure: MNIST Data
|xi| = 784 = 28 × 28
ˆxi ∈ R784
h ∈ Rd
Figure: AE approach (first learn important
characteristics of data)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
153. 33/55
Task: Hand-written digit
recognition
Figure: MNIST Data
0 1 2 3 9
|xi| = 784 = 28 × 28
h ∈ Rd
Figure: AE approach (and then train a classifier on
top of this hidden representation)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
154. 34/55
We will now see a way of visualizing AEs and use this visualization to compare
different AEs
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
155. 35/55
xi
h
ˆxi
We can think of each neuron as a filter which
will fire (or get maximally) activated for a cer-
tain input configuration xi
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
156. 35/55
xi
h
ˆxi
We can think of each neuron as a filter which
will fire (or get maximally) activated for a cer-
tain input configuration xi
For example,
h1 = σ(WT
1 xi) [ignoring bias b]
Where W1 is the trained vector of weights con-
necting the input to the first hidden neuron
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
157. 35/55
xi
h
ˆxi
We can think of each neuron as a filter which
will fire (or get maximally) activated for a cer-
tain input configuration xi
For example,
h1 = σ(WT
1 xi) [ignoring bias b]
Where W1 is the trained vector of weights con-
necting the input to the first hidden neuron
What values of xi will cause h1 to be max-
imum (or maximally activated)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
158. 35/55
xi
h
ˆxi
We can think of each neuron as a filter which
will fire (or get maximally) activated for a cer-
tain input configuration xi
For example,
h1 = σ(WT
1 xi) [ignoring bias b]
Where W1 is the trained vector of weights con-
necting the input to the first hidden neuron
What values of xi will cause h1 to be max-
imum (or maximally activated)
Suppose we assume that our inputs are nor-
malized so that xi = 1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
159. 35/55
xi
h
ˆxi
max
xi
{WT
1 xi}
s.t. ||xi||2
= xT
i xi = 1
We can think of each neuron as a filter which
will fire (or get maximally) activated for a cer-
tain input configuration xi
For example,
h1 = σ(WT
1 xi) [ignoring bias b]
Where W1 is the trained vector of weights con-
necting the input to the first hidden neuron
What values of xi will cause h1 to be max-
imum (or maximally activated)
Suppose we assume that our inputs are nor-
malized so that xi = 1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
160. 35/55
xi
h
ˆxi
max
xi
{WT
1 xi}
s.t. ||xi||2
= xT
i xi = 1
Solution: xi =
W1
WT
1 W1
We can think of each neuron as a filter which
will fire (or get maximally) activated for a cer-
tain input configuration xi
For example,
h1 = σ(WT
1 xi) [ignoring bias b]
Where W1 is the trained vector of weights con-
necting the input to the first hidden neuron
What values of xi will cause h1 to be max-
imum (or maximally activated)
Suppose we assume that our inputs are nor-
malized so that xi = 1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
161. 36/55
xi
h
ˆxi
max
xi
{WT
1 xi}
s.t. ||xi||2
= xT
i xi = 1
Solution: xi =
W1
WT
1 W1
Thus the inputs
xi =
W1
WT
1 W1
,
W2
WT
2 W2
, . . .
Wn
WT
n Wn
will respectively cause hidden neurons 1 to n
to maximally fire
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
162. 36/55
xi
h
ˆxi
max
xi
{WT
1 xi}
s.t. ||xi||2
= xT
i xi = 1
Solution: xi =
W1
WT
1 W1
Thus the inputs
xi =
W1
WT
1 W1
,
W2
WT
2 W2
, . . .
Wn
WT
n Wn
will respectively cause hidden neurons 1 to n
to maximally fire
Let us plot these images (xi’s) which maxim-
ally activate the first k neurons of the hidden
representations learned by a vanilla autoen-
coder and different denoising autoencoders
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
163. 36/55
xi
h
ˆxi
max
xi
{WT
1 xi}
s.t. ||xi||2
= xT
i xi = 1
Solution: xi =
W1
WT
1 W1
Thus the inputs
xi =
W1
WT
1 W1
,
W2
WT
2 W2
, . . .
Wn
WT
n Wn
will respectively cause hidden neurons 1 to n
to maximally fire
Let us plot these images (xi’s) which maxim-
ally activate the first k neurons of the hidden
representations learned by a vanilla autoen-
coder and different denoising autoencoders
These xi’s are computed by the above formula
using the weights (W1, W2 . . . Wk) learned by
the respective autoencoders
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
164. 37/55
Figure: Vanilla AE
(No noise)
Figure: 25% Denoising
AE (q=0.25)
Figure: 50% Denoising
AE (q=0.5)
The vanilla AE does not learn many meaningful patterns
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
165. 37/55
Figure: Vanilla AE
(No noise)
Figure: 25% Denoising
AE (q=0.25)
Figure: 50% Denoising
AE (q=0.5)
The vanilla AE does not learn many meaningful patterns
The hidden neurons of the denoising AEs seem to act like pen-stroke detectors
(for example, in the highlighted neuron the black region is a stroke that you
would expect in a ’0’ or a ’2’ or a ’3’ or a ’8’ or a ’9’)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
166. 37/55
Figure: Vanilla AE
(No noise)
Figure: 25% Denoising
AE (q=0.25)
Figure: 50% Denoising
AE (q=0.5)
The vanilla AE does not learn many meaningful patterns
The hidden neurons of the denoising AEs seem to act like pen-stroke detectors
(for example, in the highlighted neuron the black region is a stroke that you
would expect in a ’0’ or a ’2’ or a ’3’ or a ’8’ or a ’9’)
As the noise increases the filters become more wide because the neuron has to
rely on more adjacent pixels to feel confident about a stroke
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
168. 38/55
xi
˜xi
h
ˆxi
P(xij|xij)
We saw one form of P(xij|xij) which flips a
fraction q of the inputs to zero
Another way of corrupting the inputs is to add
a Gaussian noise to the input
xij = xij + N (0, 1)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
169. 38/55
xi
˜xi
h
ˆxi
P(xij|xij)
We saw one form of P(xij|xij) which flips a
fraction q of the inputs to zero
Another way of corrupting the inputs is to add
a Gaussian noise to the input
xij = xij + N (0, 1)
We will now use such a denoising AE on a
different dataset and see their performance
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
170. 39/55
Figure: Data Figure: AE filters
Figure: Weight decay
filters
The hidden neurons essentially behave like edge detectors
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
171. 39/55
Figure: Data Figure: AE filters
Figure: Weight decay
filters
The hidden neurons essentially behave like edge detectors
PCA does not give such edge detectors
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
174. 41/55
xi
h
ˆxi
A hidden neuron with sigmoid activation will
have values between 0 and 1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
175. 41/55
xi
h
ˆxi
A hidden neuron with sigmoid activation will
have values between 0 and 1
We say that the neuron is activated when its
output is close to 1 and not activated when
its output is close to 0.
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
176. 41/55
xi
h
ˆxi
A hidden neuron with sigmoid activation will
have values between 0 and 1
We say that the neuron is activated when its
output is close to 1 and not activated when
its output is close to 0.
A sparse autoencoder tries to ensure the
neuron is inactive most of the times.
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
177. 42/55
xi
h
ˆxi
The average value of the
activation of a neuron l is given
by
ˆρl =
1
m
m
i=1
h(xi)l
If the neuron l is sparse (i.e. mostly inactive)
then ˆρl → 0
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
178. 42/55
xi
h
ˆxi
The average value of the
activation of a neuron l is given
by
ˆρl =
1
m
m
i=1
h(xi)l
If the neuron l is sparse (i.e. mostly inactive)
then ˆρl → 0
A sparse autoencoder uses a sparsity para-
meter ρ (typically very close to 0, say, 0.005)
and tries to enforce the constraint ˆρl = ρ
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
179. 42/55
xi
h
ˆxi
The average value of the
activation of a neuron l is given
by
ˆρl =
1
m
m
i=1
h(xi)l
If the neuron l is sparse (i.e. mostly inactive)
then ˆρl → 0
A sparse autoencoder uses a sparsity para-
meter ρ (typically very close to 0, say, 0.005)
and tries to enforce the constraint ˆρl = ρ
One way of ensuring this is to add the follow-
ing term to the objective function
Ω(θ) =
k
l=1
ρ log
ρ
ˆρl
+ (1 − ρ) log
1 − ρ
1 − ˆρl
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
180. 42/55
xi
h
ˆxi
The average value of the
activation of a neuron l is given
by
ˆρl =
1
m
m
i=1
h(xi)l
If the neuron l is sparse (i.e. mostly inactive)
then ˆρl → 0
A sparse autoencoder uses a sparsity para-
meter ρ (typically very close to 0, say, 0.005)
and tries to enforce the constraint ˆρl = ρ
One way of ensuring this is to add the follow-
ing term to the objective function
Ω(θ) =
k
l=1
ρ log
ρ
ˆρl
+ (1 − ρ) log
1 − ρ
1 − ˆρl
When will this term reach its minimum value
and what is the minimum value? Let us plot
it and check.
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
182. 43/55
Ω(θ)
0.2 ˆρl
ρ = 0.2
The function will reach its minimum value(s) when ˆρl = ρ.
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
183. 44/55
Now,
ˆL (θ) = L (θ) + Ω(θ)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
184. 44/55
Now,
ˆL (θ) = L (θ) + Ω(θ)
L (θ) is the squared error loss or
cross entropy loss and Ω(θ) is the
sparsity constraint.
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
185. 44/55
Now,
ˆL (θ) = L (θ) + Ω(θ)
L (θ) is the squared error loss or
cross entropy loss and Ω(θ) is the
sparsity constraint.
We already know how to calculate
∂L (θ)
∂W
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
186. 44/55
Now,
ˆL (θ) = L (θ) + Ω(θ)
L (θ) is the squared error loss or
cross entropy loss and Ω(θ) is the
sparsity constraint.
We already know how to calculate
∂L (θ)
∂W
Let us see how to calculate ∂Ω(θ)
∂W .
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
187. 44/55
Ω(θ) =
k
l=1
ρlog
ρ
ˆρl
+ (1 − ρ)log
1 − ρ
1 − ˆρl
Now,
ˆL (θ) = L (θ) + Ω(θ)
L (θ) is the squared error loss or
cross entropy loss and Ω(θ) is the
sparsity constraint.
We already know how to calculate
∂L (θ)
∂W
Let us see how to calculate ∂Ω(θ)
∂W .
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
188. 44/55
Ω(θ) =
k
l=1
ρlog
ρ
ˆρl
+ (1 − ρ)log
1 − ρ
1 − ˆρl
Can be re-written as
Ω(θ) =
k
l=1
ρlogρ−ρlogˆρl+(1−ρ)log(1−ρ)−(1−ρ)log(1−ˆρl)
Now,
ˆL (θ) = L (θ) + Ω(θ)
L (θ) is the squared error loss or
cross entropy loss and Ω(θ) is the
sparsity constraint.
We already know how to calculate
∂L (θ)
∂W
Let us see how to calculate ∂Ω(θ)
∂W .
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
189. 44/55
Ω(θ) =
k
l=1
ρlog
ρ
ˆρl
+ (1 − ρ)log
1 − ρ
1 − ˆρl
Can be re-written as
Ω(θ) =
k
l=1
ρlogρ−ρlogˆρl+(1−ρ)log(1−ρ)−(1−ρ)log(1−ˆρl)
By Chain rule:
∂Ω(θ)
∂W
=
∂Ω(θ)
∂ˆρ
.
∂ˆρ
∂W
Now,
ˆL (θ) = L (θ) + Ω(θ)
L (θ) is the squared error loss or
cross entropy loss and Ω(θ) is the
sparsity constraint.
We already know how to calculate
∂L (θ)
∂W
Let us see how to calculate ∂Ω(θ)
∂W .
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
190. 44/55
Ω(θ) =
k
l=1
ρlog
ρ
ˆρl
+ (1 − ρ)log
1 − ρ
1 − ˆρl
Can be re-written as
Ω(θ) =
k
l=1
ρlogρ−ρlogˆρl+(1−ρ)log(1−ρ)−(1−ρ)log(1−ˆρl)
By Chain rule:
∂Ω(θ)
∂W
=
∂Ω(θ)
∂ˆρ
.
∂ˆρ
∂W
∂Ω(θ)
∂ˆρ
= ∂Ω(θ)
∂ ˆρ1
, ∂Ω(θ)
∂ ˆρ2
, . . . ∂Ω(θ)
∂ ˆρk
T
Now,
ˆL (θ) = L (θ) + Ω(θ)
L (θ) is the squared error loss or
cross entropy loss and Ω(θ) is the
sparsity constraint.
We already know how to calculate
∂L (θ)
∂W
Let us see how to calculate ∂Ω(θ)
∂W .
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
191. 44/55
Ω(θ) =
k
l=1
ρlog
ρ
ˆρl
+ (1 − ρ)log
1 − ρ
1 − ˆρl
Can be re-written as
Ω(θ) =
k
l=1
ρlogρ−ρlogˆρl+(1−ρ)log(1−ρ)−(1−ρ)log(1−ˆρl)
By Chain rule:
∂Ω(θ)
∂W
=
∂Ω(θ)
∂ˆρ
.
∂ˆρ
∂W
∂Ω(θ)
∂ˆρ
= ∂Ω(θ)
∂ ˆρ1
, ∂Ω(θ)
∂ ˆρ2
, . . . ∂Ω(θ)
∂ ˆρk
T
For each neuron l ∈ 1 . . . k in hidden layer, we have
Now,
ˆL (θ) = L (θ) + Ω(θ)
L (θ) is the squared error loss or
cross entropy loss and Ω(θ) is the
sparsity constraint.
We already know how to calculate
∂L (θ)
∂W
Let us see how to calculate ∂Ω(θ)
∂W .
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
192. 44/55
Ω(θ) =
k
l=1
ρlog
ρ
ˆρl
+ (1 − ρ)log
1 − ρ
1 − ˆρl
Can be re-written as
Ω(θ) =
k
l=1
ρlogρ−ρlogˆρl+(1−ρ)log(1−ρ)−(1−ρ)log(1−ˆρl)
By Chain rule:
∂Ω(θ)
∂W
=
∂Ω(θ)
∂ˆρ
.
∂ˆρ
∂W
∂Ω(θ)
∂ˆρ
= ∂Ω(θ)
∂ ˆρ1
, ∂Ω(θ)
∂ ˆρ2
, . . . ∂Ω(θ)
∂ ˆρk
T
For each neuron l ∈ 1 . . . k in hidden layer, we have
∂Ω(θ)
∂ ˆρl
= −
ρ
ˆρl
+
(1 − ρ)
1 − ˆρl
Now,
ˆL (θ) = L (θ) + Ω(θ)
L (θ) is the squared error loss or
cross entropy loss and Ω(θ) is the
sparsity constraint.
We already know how to calculate
∂L (θ)
∂W
Let us see how to calculate ∂Ω(θ)
∂W .
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
193. 44/55
Ω(θ) =
k
l=1
ρlog
ρ
ˆρl
+ (1 − ρ)log
1 − ρ
1 − ˆρl
Can be re-written as
Ω(θ) =
k
l=1
ρlogρ−ρlogˆρl+(1−ρ)log(1−ρ)−(1−ρ)log(1−ˆρl)
By Chain rule:
∂Ω(θ)
∂W
=
∂Ω(θ)
∂ˆρ
.
∂ˆρ
∂W
∂Ω(θ)
∂ˆρ
= ∂Ω(θ)
∂ ˆρ1
, ∂Ω(θ)
∂ ˆρ2
, . . . ∂Ω(θ)
∂ ˆρk
T
For each neuron l ∈ 1 . . . k in hidden layer, we have
∂Ω(θ)
∂ ˆρl
= −
ρ
ˆρl
+
(1 − ρ)
1 − ˆρl
and
∂ ˆρl
∂W
= xi(g (WT
xi + b))T
(see next slide)
Now,
ˆL (θ) = L (θ) + Ω(θ)
L (θ) is the squared error loss or
cross entropy loss and Ω(θ) is the
sparsity constraint.
We already know how to calculate
∂L (θ)
∂W
Let us see how to calculate ∂Ω(θ)
∂W .
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
194. 44/55
Ω(θ) =
k
l=1
ρlog
ρ
ˆρl
+ (1 − ρ)log
1 − ρ
1 − ˆρl
Can be re-written as
Ω(θ) =
k
l=1
ρlogρ−ρlogˆρl+(1−ρ)log(1−ρ)−(1−ρ)log(1−ˆρl)
By Chain rule:
∂Ω(θ)
∂W
=
∂Ω(θ)
∂ˆρ
.
∂ˆρ
∂W
∂Ω(θ)
∂ˆρ
= ∂Ω(θ)
∂ ˆρ1
, ∂Ω(θ)
∂ ˆρ2
, . . . ∂Ω(θ)
∂ ˆρk
T
For each neuron l ∈ 1 . . . k in hidden layer, we have
∂Ω(θ)
∂ ˆρl
= −
ρ
ˆρl
+
(1 − ρ)
1 − ˆρl
and
∂ ˆρl
∂W
= xi(g (WT
xi + b))T
(see next slide)
Now,
ˆL (θ) = L (θ) + Ω(θ)
L (θ) is the squared error loss or
cross entropy loss and Ω(θ) is the
sparsity constraint.
We already know how to calculate
∂L (θ)
∂W
Let us see how to calculate ∂Ω(θ)
∂W .
Finally,
∂ ˆL (θ)
∂W
=
∂L (θ)
∂W
+
∂Ω(θ)
∂W
(and we know how to calculate both
terms on R.H.S)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
195. 45/55
Derivation
∂ˆρ
∂W
= ∂ ˆρ1
∂W
∂ ˆρ2
∂W . . . ∂ ˆρk
∂W
For each element in the above equation we can calculate ∂ ˆρl
∂W (which is the partial
derivative of a scalar w.r.t. a matrix = matrix). For a single element of a matrix Wjl:-
∂ˆρl
∂Wjl
=
∂ 1
m
m
i=1 g WT
:,lxi + bl
∂Wjl
=
1
m
m
i=1
∂ g WT
:,lxi + bl
∂Wjl
=
1
m
m
i=1
g WT
:,lxi + bl xij
So in matrix notation we can write it as :
∂ˆρl
∂W
= xi(g (WT
xi + b))T
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
197. 47/55
A contractive autoencoder also tries
to prevent an overcomplete autoen-
coder from learning the identity func-
tion.
x
h
ˆx
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
198. 47/55
A contractive autoencoder also tries
to prevent an overcomplete autoen-
coder from learning the identity func-
tion.
It does so by adding the following reg-
ularization term to the loss function
Ω(θ) = Jx(h) 2
F
x
h
ˆx
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
199. 47/55
A contractive autoencoder also tries
to prevent an overcomplete autoen-
coder from learning the identity func-
tion.
It does so by adding the following reg-
ularization term to the loss function
Ω(θ) = Jx(h) 2
F
where Jx(h) is the Jacobian of the en-
coder.
x
h
ˆx
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
200. 47/55
A contractive autoencoder also tries
to prevent an overcomplete autoen-
coder from learning the identity func-
tion.
It does so by adding the following reg-
ularization term to the loss function
Ω(θ) = Jx(h) 2
F
where Jx(h) is the Jacobian of the en-
coder.
Let us see what it looks like.
x
h
ˆx
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
201. 48/55
If the input has n dimensions and the
hidden layer has k dimensions then
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
202. 48/55
If the input has n dimensions and the
hidden layer has k dimensions then
Jx(h) =
∂h1
∂x1
. . . . . . . . . ∂h1
∂xn
∂h2
∂x1
. . . . . . . . . ∂h2
∂xn
...
...
...
∂hk
∂x1
. . . . . . . . . ∂hk
∂xn
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
203. 48/55
If the input has n dimensions and the
hidden layer has k dimensions then
In other words, the (j, l) entry of the
Jacobian captures the variation in the
output of the lth neuron with a small
variation in the jth input.
Jx(h) =
∂h1
∂x1
. . . . . . . . . ∂h1
∂xn
∂h2
∂x1
. . . . . . . . . ∂h2
∂xn
...
...
...
∂hk
∂x1
. . . . . . . . . ∂hk
∂xn
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
204. 48/55
If the input has n dimensions and the
hidden layer has k dimensions then
In other words, the (j, l) entry of the
Jacobian captures the variation in the
output of the lth neuron with a small
variation in the jth input.
Jx(h) =
∂h1
∂x1
. . . . . . . . . ∂h1
∂xn
∂h2
∂x1
. . . . . . . . . ∂h2
∂xn
...
...
...
∂hk
∂x1
. . . . . . . . . ∂hk
∂xn
Jx(h) 2
F =
n
j=1
k
l=1
∂hl
∂xj
2
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
205. 49/55
What is the intuition behind this ?
Jx(h) 2
F =
n
j=1
k
l=1
∂hl
∂xj
2
x
h
ˆx
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
206. 49/55
What is the intuition behind this ?
Consider ∂h1
∂x1
, what does it mean if
∂h1
∂x1
= 0
Jx(h) 2
F =
n
j=1
k
l=1
∂hl
∂xj
2
x
h
ˆx
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
207. 49/55
What is the intuition behind this ?
Consider ∂h1
∂x1
, what does it mean if
∂h1
∂x1
= 0
It means that this neuron is not very
sensitive to variations in the input x1.
Jx(h) 2
F =
n
j=1
k
l=1
∂hl
∂xj
2
x
h
ˆx
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
208. 49/55
What is the intuition behind this ?
Consider ∂h1
∂x1
, what does it mean if
∂h1
∂x1
= 0
It means that this neuron is not very
sensitive to variations in the input x1.
But doesn’t this contradict our other
goal of minimizing L(θ) which re-
quires h to capture variations in the
input.
Jx(h) 2
F =
n
j=1
k
l=1
∂hl
∂xj
2
x
h
ˆx
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
209. 50/55
Indeed it does and that’s the idea
Jx(h) 2
F =
n
j=1
k
l=1
∂hl
∂xj
2
x
h
ˆx
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
210. 50/55
Indeed it does and that’s the idea
By putting these two contradicting
objectives against each other we en-
sure that h is sensitive to only very
important variations as observed in
the training data.
Jx(h) 2
F =
n
j=1
k
l=1
∂hl
∂xj
2
x
h
ˆx
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
211. 50/55
Indeed it does and that’s the idea
By putting these two contradicting
objectives against each other we en-
sure that h is sensitive to only very
important variations as observed in
the training data.
L(θ) - capture important variations
in data
Jx(h) 2
F =
n
j=1
k
l=1
∂hl
∂xj
2
x
h
ˆx
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
212. 50/55
Indeed it does and that’s the idea
By putting these two contradicting
objectives against each other we en-
sure that h is sensitive to only very
important variations as observed in
the training data.
L(θ) - capture important variations
in data
Ω(θ) - do not capture variations in
data
Jx(h) 2
F =
n
j=1
k
l=1
∂hl
∂xj
2
x
h
ˆx
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
213. 50/55
Indeed it does and that’s the idea
By putting these two contradicting
objectives against each other we en-
sure that h is sensitive to only very
important variations as observed in
the training data.
L(θ) - capture important variations
in data
Ω(θ) - do not capture variations in
data
Tradeoff - capture only very import-
ant variations in the data
Jx(h) 2
F =
n
j=1
k
l=1
∂hl
∂xj
2
x
h
ˆx
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
214. 51/55
Let us try to understand this with the help of an illustration.
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
217. 52/55
x
y
u1
u2
Consider the variations in the data
along directions u1 and u2
It makes sense to maximize a neuron
to be sensitive to variations along u1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
218. 52/55
x
y
u1
u2
Consider the variations in the data
along directions u1 and u2
It makes sense to maximize a neuron
to be sensitive to variations along u1
At the same time it makes sense to
inhibit a neuron from being sensitive
to variations along u2 (as there seems
to be small noise and unimportant for
reconstruction)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
219. 52/55
x
y
u1
u2
Consider the variations in the data
along directions u1 and u2
It makes sense to maximize a neuron
to be sensitive to variations along u1
At the same time it makes sense to
inhibit a neuron from being sensitive
to variations along u2 (as there seems
to be small noise and unimportant for
reconstruction)
By doing so we can balance between
the contradicting goals of good recon-
struction and low sensitivity.
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
220. 52/55
x
y
u1
u2
Consider the variations in the data
along directions u1 and u2
It makes sense to maximize a neuron
to be sensitive to variations along u1
At the same time it makes sense to
inhibit a neuron from being sensitive
to variations along u2 (as there seems
to be small noise and unimportant for
reconstruction)
By doing so we can balance between
the contradicting goals of good recon-
struction and low sensitivity.
What does this remind you of ?
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7