SlideShare a Scribd company logo
1 of 35
Geoffrey Hinton
Nitish Srivastava,
Kevin Swersky
Tijmen Tieleman
Abdel-rahman Mohamed
Neural Networks for Machine Learning
Lecture 15a
From Principal Components Analysis to Autoencoders
Principal Components Analysis
• This takes N-dimensional data and finds the M orthogonal directions
in which the data have the most variance.
– These M principal directions form a lower-dimensional subspace.
– We can represent an N-dimensional datapoint by its projections
onto the M principal directions.
– This loses all information about where the datapoint is located in
the remaining orthogonal directions.
• We reconstruct by using the mean value (over all the data) on the N-
M directions that are not represented.
– The reconstruction error is the sum over all these unrepresented
directions of the squared differences of the datapoint from the mean.
A picture of PCA with N=2 and M=1
direction of first principal component
i.e. direction of greatest variance
The red point is represented by the
green point. The “reconstruction”
of the red point has an error
equal to the squared
distance between
red and green
points.
Using backpropagation to implement PCA inefficiently
• Try to make the output be the
same as the input in a network
with a central bottleneck.
• The activities of the hidden
units in the bottleneck form an
efficient code.
• If the hidden and output layers are
linear, it will learn hidden units
that are a linear function of the
data and minimize the squared
reconstruction error.
– This is exactly what PCA does.
• The M hidden units will span the
same space as the first M
components found by PCA
– Their weight vectors may not be
orthogonal.
– They will tend to have equal
variances.
input vector
output vector
code
Using backpropagation to generalize PCA
• With non-linear layers before
and after the code, it should be
possible to efficiently represent
data that lies on or near a non-
linear manifold.
– The encoder converts
coordinates in the input
space to coordinates on
the manifold.
– The decoder does the
inverse mapping.
input vector
output vector
code
encoding
weights
decoding
weights
Geoffrey Hinton
Nitish Srivastava,
Kevin Swersky
Tijmen Tieleman
Abdel-rahman Mohamed
Neural Networks for Machine Learning
Lecture 15b
Deep Autoencoders
Deep Autoencoders
• They always looked like a
really nice way to do non-
linear dimensionality
reduction:
– They provide flexible
mappings both ways.
– The learning time is linear
(or better) in the number of
training cases.
– The final encoding model
is fairly compact and fast.
• But it turned out to be very difficult
to optimize deep autoencoders
using backpropagation.
– With small initial weights the
backpropagated gradient dies.
• We now have a much better ways
to optimize them.
– Use unsupervised layer-by-
layer pre-training.
– Or just initialize the weights
carefully as in Echo-State Nets.
The first really successful deep autoencoders
(Hinton & Salakhutdinov, Science, 2006)
784  1000  500  250
30 linear units
784  1000  500  250
We train a stack of 4 RBM’s and then “unroll” them.
Then we fine-tune with gentle backprop.
W1 W2 W3
W1
T
W2
T
W3
T
4
W
T
W4
A comparison of methods for compressing digit
images to 30 real numbers
real
data
30-D
deep auto
30-D
PCA
Geoffrey Hinton
Nitish Srivastava,
Kevin Swersky
Tijmen Tieleman
Abdel-rahman Mohamed
Neural Networks for Machine Learning
Lecture 15c
Deep autoencoders for document retrieval and
visualization
How to find documents that are similar to a query
document
• Convert each document into a “bag of words”.
– This is a vector of word counts ignoring order.
– Ignore stop words (like “the” or “over”)
• We could compare the word counts of the query
document and millions of other documents but this
is too slow.
– So we reduce each query vector to a much
smaller vector that still contains most of the
information about the content of the document.
fish
cheese
vector
count
school
query
reduce
bag
pulpit
iraq
word
0
0
2
2
0
2
1
1
0
0
2
How to compress the count vector
• We train the neural network to
reproduce its input vector as its
output
• This forces it to compress as
much information as possible
into the 10 numbers in the
central bottleneck.
• These 10 numbers are then a
good way to compare
documents.
2000 reconstructed counts
500 neurons
2000 word counts
500 neurons
250 neurons
250 neurons
10
input
vector
output
vector
The non-linearity used for reconstructing bags of words
• Divide the counts in a bag of words
vector by N, where N is the total number
of non-stop words in the document.
– The resulting probability vector gives
the probability of getting a particular
word if we pick a non-stop word at
random from the document.
• At the output of the autoencoder, we use
a softmax.
– The probability vector defines the
desired outputs of the softmax.
• When we train the first
RBM in the stack we use
the same trick.
– We treat the word
counts as probabilities,
but we make the visible
to hidden weights N
times bigger than the
hidden to visible
because we have N
observations from the
probability distribution.
Performance of the autoencoder at document
retrieval
• Train on bags of 2000 words for 400,000 training cases of business
documents.
– First train a stack of RBM’s. Then fine-tune with backprop.
• Test on a separate 400,000 documents.
– Pick one test document as a query. Rank order all the other test
documents by using the cosine of the angle between codes.
– Repeat this using each of the 400,000 test documents as the
query (requires 0.16 trillion comparisons).
• Plot the number of retrieved documents against the proportion that
are in the same hand-labeled class as the query document.
Compare with LSA (a version of PCA).
Retrieval performance on 400,000 Reuters business news stories
First compress all documents to 2 numbers using PCA on
log(1+count). Then use different colors for different categories.
First compress all documents to 2 numbers using deep auto.
Then use different colors for different document categories
Geoffrey Hinton
Nitish Srivastava,
Kevin Swersky
Tijmen Tieleman
Abdel-rahman Mohamed
Neural Networks for Machine Learning
Lecture 15d
Semantic hashing
Finding binary codes for documents
• Train an auto-encoder using 30 logistic
units for the code layer.
• During the fine-tuning stage, add noise
to the inputs to the code units.
– The noise forces their activities to
become bimodal in order to resist
the effects of the noise.
– Then we simply threshold the
activities of the 30 code units to get
a binary code.
• Krizhevsky discovered later that its
easier to just use binary stochastic
units in the code layer during training.
2000 reconstructed counts
500 neurons
2000 word counts
500 neurons
250 neurons
250 neurons
30 code
Using a deep autoencoder as a hash-function for
finding approximate matches
hash
function
supermarket
search
Another view of semantic hashing
• Fast retrieval methods typically work by intersecting stored lists that
are associated with cues extracted from the query.
• Computers have special hardware that can intersect 32 very long
lists in one instruction.
– Each bit in a 32-bit binary code specifies a list of half the
addresses in the memory.
• Semantic hashing uses machine learning to map the retrieval
problem onto the type of list intersection the computer is good at.
Geoffrey Hinton
Nitish Srivastava,
Kevin Swersky
Tijmen Tieleman
Abdel-rahman Mohamed
Neural Networks for Machine Learning
Lecture 15e
Learning binary codes for image retrieval
Binary codes for image retrieval
• Image retrieval is typically done by using the captions. Why not use
the images too?
– Pixels are not like words: individual pixels do not tell us much
about the content.
– Extracting object classes from images is hard (this is out of date!)
• Maybe we should extract a real-valued vector that has information
about the content?
– Matching real-valued vectors in a big database is slow and
requires a lot of storage.
• Short binary codes are very easy to store and match.
A two-stage method
• First, use semantic hashing with 28-bit binary codes to get a long
“shortlist” of promising images.
• Then use 256-bit binary codes to do a serial search for good
matches.
– This only requires a few words of storage per image and the
serial search can be done using fast bit-operations.
• But how good are the 256-bit binary codes?
– Do they find images that we think are similar?
Krizhevsky’s deep autoencoder
1024 1024 1024
8192
4096
2048
1024
512
256-bit binary code
The encoder has
about 67,000,000
parameters.
There is no theory to
justify this architecture
It takes a few days on
a GTX 285 GPU to
train on two million
images.
Reconstructions of 32x32 color images from 256-bit codes
retrieved using 256 bit codes
retrieved using Euclidean distance in pixel intensity space
retrieved using 256 bit codes
retrieved using Euclidean distance in pixel intensity space
How to make image retrieval more sensitive to
objects and less sensitive to pixels
• First train a big net to recognize
lots of different types of object in
real images.
– We saw how to do that in lecture 5.
• Then use the activity vector in the
last hidden layer as the
representation of the image.
– This should be a much better
representation to match than the
pixel intensities.
• To see if this approach is likely
to work, we can use the net
described in lecture 5 that won
the ImageNet competition.
• So far we have only tried using
the Euclidian distance between
the activity vectors in the last
hidden layer.
– It works really well!
– Will it work with binary codes?
Leftmost column
is the search
image.
Other columns
are the images
that have the
most similar
feature activities
in the last hidden
layer.
Geoffrey Hinton
Nitish Srivastava,
Kevin Swersky
Tijmen Tieleman
Abdel-rahman Mohamed
Neural Networks for Machine Learning
Lecture 15f
Shallow autoencoders for pre-training
RBM’s as autoencoders
• When we train an RBM with one-
step contrastive divergence, it tries
to make the reconstructions look like
data.
– It’s like an autoencoder, but it’s
strongly regularized by using
binary activities in the hidden
layer.
• When trained with maximum
likelihood, RBMs are not like
autoencoders.
• Maybe we can replace the
stack of RBM’s used for
pre-training by a stack of
shallow autoencoders?
– Pre-training is not as
effective (for subsequent
discrimination) if the
shallow autoencoders
are regularized by
penalizing the squared
weights.
Denoising autoencoders (Vincent et. al. 2008)
• Denoising autoencoders add
noise to the input vector by
setting many of its components
to zero (like dropout, but for
inputs).
– They are still required to
reconstruct these
components so they must
extract features that capture
correlations between inputs.
• Pre-training is very effective if we use
a stack of denoising autoencoders.
– It’s as good as or better than pre-
training with RBMs.
– It’s also simpler to evaluate the
pre-training because we can
easily compute the value of the
objective function.
– It lacks the nice variational bound
we get with RBMs, but this is only
of theoretical interest.
Contractive autoencoders (Rifai et. al. 2011)
• Another way to regularize an
autoencoder is to try to make
the activities of the hidden
units as insensitive as possible
to the inputs.
– But they cannot just ignore
the inputs because they
must reconstruct them.
• We achieve this by penalizing
the squared gradient of each
hidden activity w.r.t. the inputs.
• Contractive autoencoders work very
well for pre-training.
– The codes tend to have the
property that only a small
subset of the hidden units
are sensitive to changes in
the input.
– But for different parts of the input
space, its a different subset. The
active set is sparse.
– RBMs behave similarly.
Conclusions about pre-training
• There are now many different
ways to do layer-by-layer pre-
training of features.
– For datasets that do not
have huge numbers of
labeled cases, pre-training
helps subsequent
discriminative learning.
• Especially if there is extra
data that is unlabeled but
can be used for pretraining.
• For very large, labeled datasets,
initializing the weights used in
supervised learning by using
unsupervised pre-training is not
necessary, even for deep nets.
– Pre-training was the first good
way to initialize the weights for
deep nets, but now there are
other ways.
• But if we make the nets much larger
we will need pre-training again!

More Related Content

Similar to Neural Networks for Machine Learning and Deep Learning

large_scale_search.pdf
large_scale_search.pdflarge_scale_search.pdf
large_scale_search.pdfEmerald72
 
Automatic Attendace using convolutional neural network Face Recognition
Automatic Attendace using convolutional neural network Face RecognitionAutomatic Attendace using convolutional neural network Face Recognition
Automatic Attendace using convolutional neural network Face Recognitionvatsal199567
 
Introduction to computer vision
Introduction to computer visionIntroduction to computer vision
Introduction to computer visionMarcin Jedyk
 
Handwritten Digit Recognition(Convolutional Neural Network) PPT
Handwritten Digit Recognition(Convolutional Neural Network) PPTHandwritten Digit Recognition(Convolutional Neural Network) PPT
Handwritten Digit Recognition(Convolutional Neural Network) PPTRishabhTyagi48
 
Hand Written Digit Classification
Hand Written Digit ClassificationHand Written Digit Classification
Hand Written Digit Classificationijtsrd
 
Demystifying NLP Transformers: Understanding the Power and Architecture behin...
Demystifying NLP Transformers: Understanding the Power and Architecture behin...Demystifying NLP Transformers: Understanding the Power and Architecture behin...
Demystifying NLP Transformers: Understanding the Power and Architecture behin...NILESH VERMA
 
Wits presentation 6_28072015
Wits presentation 6_28072015Wits presentation 6_28072015
Wits presentation 6_28072015Beatrice van Eden
 
Machine learning by using python By: Professor Lili Saghafi
Machine learning by using python By: Professor Lili SaghafiMachine learning by using python By: Professor Lili Saghafi
Machine learning by using python By: Professor Lili SaghafiProfessor Lili Saghafi
 
Digit recognition
Digit recognitionDigit recognition
Digit recognitionbtandale
 
NLP and Deep Learning for non_experts
NLP and Deep Learning for non_expertsNLP and Deep Learning for non_experts
NLP and Deep Learning for non_expertsSanghamitra Deb
 
Fundamental of deep learning
Fundamental of deep learningFundamental of deep learning
Fundamental of deep learningStanley Wang
 
Natural Language Processing Advancements By Deep Learning: A Survey
Natural Language Processing Advancements By Deep Learning: A SurveyNatural Language Processing Advancements By Deep Learning: A Survey
Natural Language Processing Advancements By Deep Learning: A SurveyRimzim Thube
 
Scalable Similarity-Based Neighborhood Methods with MapReduce
Scalable Similarity-Based Neighborhood Methods with MapReduceScalable Similarity-Based Neighborhood Methods with MapReduce
Scalable Similarity-Based Neighborhood Methods with MapReducesscdotopen
 
Lessons in Linear Algebra at Scale with Apache Spark : Let's Make the Sparse ...
Lessons in Linear Algebra at Scale with Apache Spark : Let's Make the Sparse ...Lessons in Linear Algebra at Scale with Apache Spark : Let's Make the Sparse ...
Lessons in Linear Algebra at Scale with Apache Spark : Let's Make the Sparse ...Databricks
 
201907 AutoML and Neural Architecture Search
201907 AutoML and Neural Architecture Search201907 AutoML and Neural Architecture Search
201907 AutoML and Neural Architecture SearchDaeJin Kim
 
Mirko Lucchese - Deep Image Processing
Mirko Lucchese - Deep Image ProcessingMirko Lucchese - Deep Image Processing
Mirko Lucchese - Deep Image ProcessingMeetupDataScienceRoma
 
Deep learning from a novice perspective
Deep learning from a novice perspectiveDeep learning from a novice perspective
Deep learning from a novice perspectiveAnirban Santara
 
#ITsubbotnik Spring 2017: Dmitrii Nikitko "Deep learning for understanding of...
#ITsubbotnik Spring 2017: Dmitrii Nikitko "Deep learning for understanding of...#ITsubbotnik Spring 2017: Dmitrii Nikitko "Deep learning for understanding of...
#ITsubbotnik Spring 2017: Dmitrii Nikitko "Deep learning for understanding of...epamspb
 

Similar to Neural Networks for Machine Learning and Deep Learning (20)

large_scale_search.pdf
large_scale_search.pdflarge_scale_search.pdf
large_scale_search.pdf
 
Automatic Attendace using convolutional neural network Face Recognition
Automatic Attendace using convolutional neural network Face RecognitionAutomatic Attendace using convolutional neural network Face Recognition
Automatic Attendace using convolutional neural network Face Recognition
 
Introduction to computer vision
Introduction to computer visionIntroduction to computer vision
Introduction to computer vision
 
Handwritten Digit Recognition(Convolutional Neural Network) PPT
Handwritten Digit Recognition(Convolutional Neural Network) PPTHandwritten Digit Recognition(Convolutional Neural Network) PPT
Handwritten Digit Recognition(Convolutional Neural Network) PPT
 
Hand Written Digit Classification
Hand Written Digit ClassificationHand Written Digit Classification
Hand Written Digit Classification
 
Demystifying NLP Transformers: Understanding the Power and Architecture behin...
Demystifying NLP Transformers: Understanding the Power and Architecture behin...Demystifying NLP Transformers: Understanding the Power and Architecture behin...
Demystifying NLP Transformers: Understanding the Power and Architecture behin...
 
Wits presentation 6_28072015
Wits presentation 6_28072015Wits presentation 6_28072015
Wits presentation 6_28072015
 
Machine learning by using python By: Professor Lili Saghafi
Machine learning by using python By: Professor Lili SaghafiMachine learning by using python By: Professor Lili Saghafi
Machine learning by using python By: Professor Lili Saghafi
 
Digit recognition
Digit recognitionDigit recognition
Digit recognition
 
TensorFlow.pptx
TensorFlow.pptxTensorFlow.pptx
TensorFlow.pptx
 
NLP and Deep Learning for non_experts
NLP and Deep Learning for non_expertsNLP and Deep Learning for non_experts
NLP and Deep Learning for non_experts
 
Deep learning
Deep learningDeep learning
Deep learning
 
Fundamental of deep learning
Fundamental of deep learningFundamental of deep learning
Fundamental of deep learning
 
Natural Language Processing Advancements By Deep Learning: A Survey
Natural Language Processing Advancements By Deep Learning: A SurveyNatural Language Processing Advancements By Deep Learning: A Survey
Natural Language Processing Advancements By Deep Learning: A Survey
 
Scalable Similarity-Based Neighborhood Methods with MapReduce
Scalable Similarity-Based Neighborhood Methods with MapReduceScalable Similarity-Based Neighborhood Methods with MapReduce
Scalable Similarity-Based Neighborhood Methods with MapReduce
 
Lessons in Linear Algebra at Scale with Apache Spark : Let's Make the Sparse ...
Lessons in Linear Algebra at Scale with Apache Spark : Let's Make the Sparse ...Lessons in Linear Algebra at Scale with Apache Spark : Let's Make the Sparse ...
Lessons in Linear Algebra at Scale with Apache Spark : Let's Make the Sparse ...
 
201907 AutoML and Neural Architecture Search
201907 AutoML and Neural Architecture Search201907 AutoML and Neural Architecture Search
201907 AutoML and Neural Architecture Search
 
Mirko Lucchese - Deep Image Processing
Mirko Lucchese - Deep Image ProcessingMirko Lucchese - Deep Image Processing
Mirko Lucchese - Deep Image Processing
 
Deep learning from a novice perspective
Deep learning from a novice perspectiveDeep learning from a novice perspective
Deep learning from a novice perspective
 
#ITsubbotnik Spring 2017: Dmitrii Nikitko "Deep learning for understanding of...
#ITsubbotnik Spring 2017: Dmitrii Nikitko "Deep learning for understanding of...#ITsubbotnik Spring 2017: Dmitrii Nikitko "Deep learning for understanding of...
#ITsubbotnik Spring 2017: Dmitrii Nikitko "Deep learning for understanding of...
 

Recently uploaded

sourabh vyas1222222222222222222244444444
sourabh vyas1222222222222222222244444444sourabh vyas1222222222222222222244444444
sourabh vyas1222222222222222222244444444saurabvyas476
 
Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...
Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...
Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...ThinkInnovation
 
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证pwgnohujw
 
Harnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptxHarnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptxParas Gupta
 
Introduction to Statistics Presentation.pptx
Introduction to Statistics Presentation.pptxIntroduction to Statistics Presentation.pptx
Introduction to Statistics Presentation.pptxAniqa Zai
 
Case Study 4 Where the cry of rebellion happen?
Case Study 4 Where the cry of rebellion happen?Case Study 4 Where the cry of rebellion happen?
Case Study 4 Where the cry of rebellion happen?RemarkSemacio
 
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证zifhagzkk
 
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...yulianti213969
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxronsairoathenadugay
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabiaahmedjiabur940
 
Simplify hybrid data integration at an enterprise scale. Integrate all your d...
Simplify hybrid data integration at an enterprise scale. Integrate all your d...Simplify hybrid data integration at an enterprise scale. Integrate all your d...
Simplify hybrid data integration at an enterprise scale. Integrate all your d...varanasisatyanvesh
 
DS Lecture-1 about discrete structure .ppt
DS Lecture-1 about discrete structure .pptDS Lecture-1 about discrete structure .ppt
DS Lecture-1 about discrete structure .pptTanveerAhmed817946
 
Bios of leading Astrologers & Researchers
Bios of leading Astrologers & ResearchersBios of leading Astrologers & Researchers
Bios of leading Astrologers & Researchersdarmandersingh4580
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRajesh Mondal
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...Bertram Ludäscher
 
Las implicancias del memorándum de entendimiento entre Codelco y SQM según la...
Las implicancias del memorándum de entendimiento entre Codelco y SQM según la...Las implicancias del memorándum de entendimiento entre Codelco y SQM según la...
Las implicancias del memorándum de entendimiento entre Codelco y SQM según la...Voces Mineras
 
DAA Assignment Solution.pdf is the best1
DAA Assignment Solution.pdf is the best1DAA Assignment Solution.pdf is the best1
DAA Assignment Solution.pdf is the best1sinhaabhiyanshu
 

Recently uploaded (20)

sourabh vyas1222222222222222222244444444
sourabh vyas1222222222222222222244444444sourabh vyas1222222222222222222244444444
sourabh vyas1222222222222222222244444444
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...
Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...
Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...
 
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
 
Harnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptxHarnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptx
 
Introduction to Statistics Presentation.pptx
Introduction to Statistics Presentation.pptxIntroduction to Statistics Presentation.pptx
Introduction to Statistics Presentation.pptx
 
Case Study 4 Where the cry of rebellion happen?
Case Study 4 Where the cry of rebellion happen?Case Study 4 Where the cry of rebellion happen?
Case Study 4 Where the cry of rebellion happen?
 
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
 
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
Simplify hybrid data integration at an enterprise scale. Integrate all your d...
Simplify hybrid data integration at an enterprise scale. Integrate all your d...Simplify hybrid data integration at an enterprise scale. Integrate all your d...
Simplify hybrid data integration at an enterprise scale. Integrate all your d...
 
DS Lecture-1 about discrete structure .ppt
DS Lecture-1 about discrete structure .pptDS Lecture-1 about discrete structure .ppt
DS Lecture-1 about discrete structure .ppt
 
Bios of leading Astrologers & Researchers
Bios of leading Astrologers & ResearchersBios of leading Astrologers & Researchers
Bios of leading Astrologers & Researchers
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
 
Abortion pills in Doha {{ QATAR }} +966572737505) Get Cytotec
Abortion pills in Doha {{ QATAR }} +966572737505) Get CytotecAbortion pills in Doha {{ QATAR }} +966572737505) Get Cytotec
Abortion pills in Doha {{ QATAR }} +966572737505) Get Cytotec
 
Las implicancias del memorándum de entendimiento entre Codelco y SQM según la...
Las implicancias del memorándum de entendimiento entre Codelco y SQM según la...Las implicancias del memorándum de entendimiento entre Codelco y SQM según la...
Las implicancias del memorándum de entendimiento entre Codelco y SQM según la...
 
DAA Assignment Solution.pdf is the best1
DAA Assignment Solution.pdf is the best1DAA Assignment Solution.pdf is the best1
DAA Assignment Solution.pdf is the best1
 
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotecAbortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
 

Neural Networks for Machine Learning and Deep Learning

  • 1. Geoffrey Hinton Nitish Srivastava, Kevin Swersky Tijmen Tieleman Abdel-rahman Mohamed Neural Networks for Machine Learning Lecture 15a From Principal Components Analysis to Autoencoders
  • 2. Principal Components Analysis • This takes N-dimensional data and finds the M orthogonal directions in which the data have the most variance. – These M principal directions form a lower-dimensional subspace. – We can represent an N-dimensional datapoint by its projections onto the M principal directions. – This loses all information about where the datapoint is located in the remaining orthogonal directions. • We reconstruct by using the mean value (over all the data) on the N- M directions that are not represented. – The reconstruction error is the sum over all these unrepresented directions of the squared differences of the datapoint from the mean.
  • 3. A picture of PCA with N=2 and M=1 direction of first principal component i.e. direction of greatest variance The red point is represented by the green point. The “reconstruction” of the red point has an error equal to the squared distance between red and green points.
  • 4. Using backpropagation to implement PCA inefficiently • Try to make the output be the same as the input in a network with a central bottleneck. • The activities of the hidden units in the bottleneck form an efficient code. • If the hidden and output layers are linear, it will learn hidden units that are a linear function of the data and minimize the squared reconstruction error. – This is exactly what PCA does. • The M hidden units will span the same space as the first M components found by PCA – Their weight vectors may not be orthogonal. – They will tend to have equal variances. input vector output vector code
  • 5. Using backpropagation to generalize PCA • With non-linear layers before and after the code, it should be possible to efficiently represent data that lies on or near a non- linear manifold. – The encoder converts coordinates in the input space to coordinates on the manifold. – The decoder does the inverse mapping. input vector output vector code encoding weights decoding weights
  • 6. Geoffrey Hinton Nitish Srivastava, Kevin Swersky Tijmen Tieleman Abdel-rahman Mohamed Neural Networks for Machine Learning Lecture 15b Deep Autoencoders
  • 7. Deep Autoencoders • They always looked like a really nice way to do non- linear dimensionality reduction: – They provide flexible mappings both ways. – The learning time is linear (or better) in the number of training cases. – The final encoding model is fairly compact and fast. • But it turned out to be very difficult to optimize deep autoencoders using backpropagation. – With small initial weights the backpropagated gradient dies. • We now have a much better ways to optimize them. – Use unsupervised layer-by- layer pre-training. – Or just initialize the weights carefully as in Echo-State Nets.
  • 8. The first really successful deep autoencoders (Hinton & Salakhutdinov, Science, 2006) 784  1000  500  250 30 linear units 784  1000  500  250 We train a stack of 4 RBM’s and then “unroll” them. Then we fine-tune with gentle backprop. W1 W2 W3 W1 T W2 T W3 T 4 W T W4
  • 9. A comparison of methods for compressing digit images to 30 real numbers real data 30-D deep auto 30-D PCA
  • 10. Geoffrey Hinton Nitish Srivastava, Kevin Swersky Tijmen Tieleman Abdel-rahman Mohamed Neural Networks for Machine Learning Lecture 15c Deep autoencoders for document retrieval and visualization
  • 11. How to find documents that are similar to a query document • Convert each document into a “bag of words”. – This is a vector of word counts ignoring order. – Ignore stop words (like “the” or “over”) • We could compare the word counts of the query document and millions of other documents but this is too slow. – So we reduce each query vector to a much smaller vector that still contains most of the information about the content of the document. fish cheese vector count school query reduce bag pulpit iraq word 0 0 2 2 0 2 1 1 0 0 2
  • 12. How to compress the count vector • We train the neural network to reproduce its input vector as its output • This forces it to compress as much information as possible into the 10 numbers in the central bottleneck. • These 10 numbers are then a good way to compare documents. 2000 reconstructed counts 500 neurons 2000 word counts 500 neurons 250 neurons 250 neurons 10 input vector output vector
  • 13. The non-linearity used for reconstructing bags of words • Divide the counts in a bag of words vector by N, where N is the total number of non-stop words in the document. – The resulting probability vector gives the probability of getting a particular word if we pick a non-stop word at random from the document. • At the output of the autoencoder, we use a softmax. – The probability vector defines the desired outputs of the softmax. • When we train the first RBM in the stack we use the same trick. – We treat the word counts as probabilities, but we make the visible to hidden weights N times bigger than the hidden to visible because we have N observations from the probability distribution.
  • 14. Performance of the autoencoder at document retrieval • Train on bags of 2000 words for 400,000 training cases of business documents. – First train a stack of RBM’s. Then fine-tune with backprop. • Test on a separate 400,000 documents. – Pick one test document as a query. Rank order all the other test documents by using the cosine of the angle between codes. – Repeat this using each of the 400,000 test documents as the query (requires 0.16 trillion comparisons). • Plot the number of retrieved documents against the proportion that are in the same hand-labeled class as the query document. Compare with LSA (a version of PCA).
  • 15. Retrieval performance on 400,000 Reuters business news stories
  • 16. First compress all documents to 2 numbers using PCA on log(1+count). Then use different colors for different categories.
  • 17. First compress all documents to 2 numbers using deep auto. Then use different colors for different document categories
  • 18. Geoffrey Hinton Nitish Srivastava, Kevin Swersky Tijmen Tieleman Abdel-rahman Mohamed Neural Networks for Machine Learning Lecture 15d Semantic hashing
  • 19. Finding binary codes for documents • Train an auto-encoder using 30 logistic units for the code layer. • During the fine-tuning stage, add noise to the inputs to the code units. – The noise forces their activities to become bimodal in order to resist the effects of the noise. – Then we simply threshold the activities of the 30 code units to get a binary code. • Krizhevsky discovered later that its easier to just use binary stochastic units in the code layer during training. 2000 reconstructed counts 500 neurons 2000 word counts 500 neurons 250 neurons 250 neurons 30 code
  • 20. Using a deep autoencoder as a hash-function for finding approximate matches hash function supermarket search
  • 21. Another view of semantic hashing • Fast retrieval methods typically work by intersecting stored lists that are associated with cues extracted from the query. • Computers have special hardware that can intersect 32 very long lists in one instruction. – Each bit in a 32-bit binary code specifies a list of half the addresses in the memory. • Semantic hashing uses machine learning to map the retrieval problem onto the type of list intersection the computer is good at.
  • 22. Geoffrey Hinton Nitish Srivastava, Kevin Swersky Tijmen Tieleman Abdel-rahman Mohamed Neural Networks for Machine Learning Lecture 15e Learning binary codes for image retrieval
  • 23. Binary codes for image retrieval • Image retrieval is typically done by using the captions. Why not use the images too? – Pixels are not like words: individual pixels do not tell us much about the content. – Extracting object classes from images is hard (this is out of date!) • Maybe we should extract a real-valued vector that has information about the content? – Matching real-valued vectors in a big database is slow and requires a lot of storage. • Short binary codes are very easy to store and match.
  • 24. A two-stage method • First, use semantic hashing with 28-bit binary codes to get a long “shortlist” of promising images. • Then use 256-bit binary codes to do a serial search for good matches. – This only requires a few words of storage per image and the serial search can be done using fast bit-operations. • But how good are the 256-bit binary codes? – Do they find images that we think are similar?
  • 25. Krizhevsky’s deep autoencoder 1024 1024 1024 8192 4096 2048 1024 512 256-bit binary code The encoder has about 67,000,000 parameters. There is no theory to justify this architecture It takes a few days on a GTX 285 GPU to train on two million images.
  • 26. Reconstructions of 32x32 color images from 256-bit codes
  • 27. retrieved using 256 bit codes retrieved using Euclidean distance in pixel intensity space
  • 28. retrieved using 256 bit codes retrieved using Euclidean distance in pixel intensity space
  • 29. How to make image retrieval more sensitive to objects and less sensitive to pixels • First train a big net to recognize lots of different types of object in real images. – We saw how to do that in lecture 5. • Then use the activity vector in the last hidden layer as the representation of the image. – This should be a much better representation to match than the pixel intensities. • To see if this approach is likely to work, we can use the net described in lecture 5 that won the ImageNet competition. • So far we have only tried using the Euclidian distance between the activity vectors in the last hidden layer. – It works really well! – Will it work with binary codes?
  • 30. Leftmost column is the search image. Other columns are the images that have the most similar feature activities in the last hidden layer.
  • 31. Geoffrey Hinton Nitish Srivastava, Kevin Swersky Tijmen Tieleman Abdel-rahman Mohamed Neural Networks for Machine Learning Lecture 15f Shallow autoencoders for pre-training
  • 32. RBM’s as autoencoders • When we train an RBM with one- step contrastive divergence, it tries to make the reconstructions look like data. – It’s like an autoencoder, but it’s strongly regularized by using binary activities in the hidden layer. • When trained with maximum likelihood, RBMs are not like autoencoders. • Maybe we can replace the stack of RBM’s used for pre-training by a stack of shallow autoencoders? – Pre-training is not as effective (for subsequent discrimination) if the shallow autoencoders are regularized by penalizing the squared weights.
  • 33. Denoising autoencoders (Vincent et. al. 2008) • Denoising autoencoders add noise to the input vector by setting many of its components to zero (like dropout, but for inputs). – They are still required to reconstruct these components so they must extract features that capture correlations between inputs. • Pre-training is very effective if we use a stack of denoising autoencoders. – It’s as good as or better than pre- training with RBMs. – It’s also simpler to evaluate the pre-training because we can easily compute the value of the objective function. – It lacks the nice variational bound we get with RBMs, but this is only of theoretical interest.
  • 34. Contractive autoencoders (Rifai et. al. 2011) • Another way to regularize an autoencoder is to try to make the activities of the hidden units as insensitive as possible to the inputs. – But they cannot just ignore the inputs because they must reconstruct them. • We achieve this by penalizing the squared gradient of each hidden activity w.r.t. the inputs. • Contractive autoencoders work very well for pre-training. – The codes tend to have the property that only a small subset of the hidden units are sensitive to changes in the input. – But for different parts of the input space, its a different subset. The active set is sparse. – RBMs behave similarly.
  • 35. Conclusions about pre-training • There are now many different ways to do layer-by-layer pre- training of features. – For datasets that do not have huge numbers of labeled cases, pre-training helps subsequent discriminative learning. • Especially if there is extra data that is unlabeled but can be used for pretraining. • For very large, labeled datasets, initializing the weights used in supervised learning by using unsupervised pre-training is not necessary, even for deep nets. – Pre-training was the first good way to initialize the weights for deep nets, but now there are other ways. • But if we make the nets much larger we will need pre-training again!