SlideShare a Scribd company logo
Review of
auto-encoders
Piotr Mirowski, Microsoft Bing London
(Dirk Gorissen)
Computational Intelligence
Unconference, 26 July 2014
Code
Input
Code prediction
Code energy
Decoding energy
Input decoding
Sparsity
constraint
X
Z
Outline
• Deep learning concepts covered
o Hierarchical representations
o Sparse and/or distributed representations
o Supervised vs. unsupervised learning
• Auto-encoder
o Architecture
o Inference and learning
o Sparse coding
o Sparse auto-encoders
• Illustration: handwritten digits
o Stacking auto-encoders
o Learning representations of digits
o Impact on classification
• Applications to text
o Semantic hashing
o Semi-supervised learning
o Moving away from auto-encoders
• Topics not covered in this talk
Outline
• Deep learning concepts covered
o Hierarchical representations
o Sparse and/or distributed representations
o Supervised vs. unsupervised learning
• Auto-encoder
o Architecture
o Inference and learning
o Sparse coding
o Sparse auto-encoders
• Illustration: handwritten digits
o Stacking auto-encoders
o Learning representations of digits
o Impact on classification
• Applications to text
o Semantic hashing
o Semi-supervised learning
o Moving away from auto-encoders
• Topics not covered in this talk
Hierarchical
representations
“Deep learning methods aim at
learning feature hierarchies
with features from higher levels
of the hierarchy formed by the
composition of lower level
features.
Automatically learning features
at multiple levels of abstraction
allows a system
to learn complex functions
mapping the input to the output
directly from data,
without depending completely
on human-crafted features.”
— Yoshua Bengio
[Bengio, “On the expressive power of deep architectures”, Talk at ALT, 2011]
[Bengio, Learning Deep Architectures for AI, 2009]
Sparse and/or distributed
representations
Example on MNIST handwritten digits
An image of size 28x28 pixels can be represented
using a small combination of codes from a basis set.
[Ranzato, Poultney, Chopra & LeCun, “Efficient Learning of Sparse Representations with an Energy-Based Model ”, NIPS, 2006;
Ranzato, Boureau & LeCun, “Sparse Feature Learning for Deep Belief Networks ”, NIPS, 2007]
Biological motivation: V1 visual cortex
Sparse and/or distributed
representations
Example on MNIST handwritten digits
An image of size 28x28 pixels can be represented
using a small combination of codes from a basis set.
At the end of this talk, you should know how to learn that basis set
and how to infer the codes, in a 2-layer auto-encoder architecture.
Matlab/Octave code and the MNIST dataset will be provided.
[Ranzato, Poultney, Chopra & LeCun, “Efficient Learning of Sparse Representations with an Energy-Based Model ”, NIPS, 2006;
Ranzato, Boureau & LeCun, “Sparse Feature Learning for Deep Belief Networks ”, NIPS, 2007]
Biological motivation: V1 visual cortex (backup slides)
Supervised learning
X
Y
Ef Y, ˆY( )
Target
Input
Prediction
Error
ˆY = f X;W,b( )
Supervised learning
Y - ˆY
2
2
Target
Input
Prediction
Error
ˆY = WT
X +b
X
Y
Why not exploit
unlabeled data?
Unsupervised learning
No target…
Input
Prediction
No error…
ˆY = f X;W,b( )
X
?
Unsupervised learning
Code
“latent/hidden”
representation
Input
Prediction(s)
Error(s)
X
Z
Unsupervised learning
Code
Input
Prediction(s)
Error(s)
We want
the codes
to represent
the inputs
in the dataset.
The code should
be a compact
representation
of the inputs:
low-dimensional
and/or sparse.
X
Z
Examples of
unsupervised learning
• Linear decomposition of the inputs:
o Principal Component Analysis and Singular Value Decomposition
o Independent Component Analysis [Bell & Sejnowski, 1995]
o Sparse coding [Olshausen & Field, 1997]
o …
• Fitting a distribution to the inputs:
o Mixtures of Gaussians
o Use of Expectation-Maximization algorithm [Dempster et al, 1977]
o …
• For text or discrete data:
o Latent Semantic Indexing [Deerwester et al, 1990]
o Probabilistic Latent Semantic Indexing [Hofmann et al, 1999]
o Latent Dirichlet Allocation [Blei et al, 2003]
o Semantic Hashing
o …
Objective of this tutorial
Study a fundamental building block
for deep learning,
the auto-encoder
Outline
• Deep learning concepts covered
o Hierarchical representations
o Sparse and/or distributed representations
o Supervised vs. unsupervised learning
• Auto-encoder
o Architecture
o Inference and learning
o Sparse coding
o Sparse auto-encoders
• Illustration: handwritten digits
o Stacking auto-encoders
o Learning representations of digits
o Impact on classification
• Applications to text
o Semantic hashing
o Semi-supervised learning
o Moving away from auto-encoders
• Topics not covered in this talk
Auto-encoder
Code
Input X
Z
Target
= input
Y = X
Code
Input
“Bottleneck” code
i.e., low-dimensional,
typically dense,
distributed
representation
“Overcomplete” code
i.e., high-dimensional,
always sparse,
distributed
representation
Target
= input
Auto-encoder
Eg Z, ˆZ( )
Code
Input
Code
prediction
Encoding
“energy”
ˆZ = g X;C,bC( ) Eh X, ˆX( )
ˆX = h Z;D,bD( )
Decoding
“energy”
Input
decoding
X
Z
Auto-encoder
Z - ˆZ
2
2
Code
Input
Code
prediction
Encoding
energy
X - ˆX
2
2 Decoding
energy
Input
decoding
X
Z
ˆZ = g X;C,bC( )
ˆX = h Z;D,bD( )
Auto-encoder
loss function
L X(t),Z(t);W( )=a Z(t)-g X(t);C,bC( ) 2
2
+ X(t)-h Z(t);D,bD( ) 2
2
L X,Z;W( )= a Z(t)- g X(t);C,bC( ) 2
2
t=1
T
å + X(t)-h Z(t);D,bD( ) 2
2
t=1
T
å
Encoding energy Decoding energy
Encoding energy Decoding energy
For one sample t
For all T samples
How do we get the codes Z?
coefficient of
the encoder error
We note W={C, bC, D, bD}
Learning and inference in
auto-encoders
Z*= argmin
Z
L X,Z;W( )
W*= argmin
W
L X,Z;W( )
Learn the parameters (weights) W
of the encoder and decoder
given the current codes Z
Infer the codes Z
given the current
model parameters W
Relationship to Expectation-Maximization
in graphical models (backup slides)
Learning and inference:
stochastic gradient descent
Z*
(t) = argmin
Z(t)
L X(t),Z(t);W( )
L X(t),Z(t);W*
( )< L X(t),Z(t);W( )
Take a gradient descent step
on the parameters (weights) W
of the encoder and decoder
given the current codes Z
Iterated gradient descent (?)
on the code Z(t)
given the current
model parameters W
Relationship to Generalized EM
in graphical models (backup slides)
Auto-encoder
Z - ˆZ
2
2
Code
Input
Code
prediction
Encoding
energy
ˆZ = CT
X +bC
X - ˆX
2
2
A =s Z( ) =
1
1+e-Z
ˆX = DT
A+bD
Decoding
energy
Input
decoding
X
Z
Auto-encoder: fprop
Code
Input X
Z
function [x_hat, a_hat] = …
Module_Decode_FProp(model, z)
% Apply the logistic to the code
a_hat = 1 ./ (1 + exp(-z));
% Linear decoding
x_hat = model.D * a_hat + model.bias_D;
function z_hat = …
Module_Encode_FProp(model, x, params)
% Compute the linear encoding activation
z_hat = model.C * x + model.bias_C;
function e = Loss_Gaussian(z, z_hat)
zDiff = z_hat - z;
e = 0.5 * sum(zDiff.^2);
function e = Loss_Gaussian(x, x_hat)
xDiff = x_hat - x;
e = 0.5 * sum(xDiff.^2);
Auto-encoder
backprop w.r.t. codes
Code
Input
Code
prediction
Encoding
energy
X
Z
¶L
¶ ˆX
=
¶
¶ ˆX
X - ˆX
2
2
( )
¶A
¶Z
=
¶s Z( )
¶Z
¶ ˆX
¶A
=
¶
¶A
DT
A + bD( )
¶Ldec
¶Z
=
¶L
¶ ˆX
¶ ˆX
¶ ˆA
¶ ˆA
¶Z
¶Lenc
¶Z
=
¶
¶Z
Z - ˆZ
2
2
( )
[Ranzato, Boureau & LeCun, “Sparse Feature Learning for Deep Belief Networks ”, NIPS, 2007]
Auto-encoder:
backprop w.r.t. codes
Code
Z
function dL_dz = ...
Module_Decode_BackProp_Codes(model, dL_dx, a_hat)
% Gradient of the loss w.r.t. activations
dL_da = model.D' * dL_dx;
% Gradient of the loss w.r.t. latent codes
% a_hat = 1 ./ (1 + exp(-z_hat))
dL_dz = dL_da .* a_hat .* (1 - a_hat);
% Add the gradient w.r.t.
% the encoder's outputs
dL_dz = z_star - z_hat;
% Gradient of the loss w.r.t.
% the decoder prediction
dL_dx_star = x_star - x;
Input X
[Ranzato, Boureau & LeCun, “Sparse Feature Learning for Deep Belief Networks ”, NIPS, 2007]
Code inference
in the auto-encoder
function [z_star, z_hat, loss_star, loss_hat] = Layer_Infer(model, x, params)
% Encode the current input and initialize the latent code
z_hat = Module_Encode_FProp(model, x, params);
% Decode the current latent code
[x_hat, a_hat] = Module_Decode_FProp(model, z_hat);
% Compute the current loss term due to decoding (encoding loss is 0)
loss_hat = Loss_Gaussian(x, x_hat);
% Relaxation on the latent code: loop until convergence
x_star = x_hat; a_star = a_hat; z_star = z_hat; loss_star = loss_hat;
while (true)
% Gradient of the loss function w.r.t. decoder prediction
dL_dx_star = x_star - x;
% Back-propagate the gradient of the loss onto the codes
dL_dz = Module_Decode_BackProp_Codes(model, dL_dx_star, a_star, params);
% Add the gradient w.r.t. the encoder's outputs
dL_dz = dL_dz + params.alpha_c * (z_star - z_hat);
% Perform one step of gradient descent on the codes
z_star = z_star - params.eta_z * dL_dz;
% Decode the current latent code
[x_star, a_star] = Module_Decode_FProp(model, z_star);
% Compute the current loss and convergence criteria
loss_star = Loss_Gaussian(x, x_star) + ...
params.alpha_c * Loss_Gaussian(z_star, z_hat);
% Stopping criteria
[...]
end
Code
Input
Code
prediction
Encoding
energy
X
Z
¶L
¶ ˆX
=
¶
¶ ˆX
X - ˆX
2
2
( )
¶A
¶Z
=
¶s Z( )
¶Z
¶ ˆX
¶A
=
¶
¶A
DT
A + bD( )
¶Lenc
¶Z
=
¶
¶Z
Z - ˆZ
2
2
( )
[Ranzato, Boureau & LeCun, “Sparse Feature Learning for Deep Belief Networks ”, NIPS, 2007]
Auto-encoder
backprop w.r.t. codes
Code
Input
Code
prediction
Encoding
energy
X
Z*
¶L
¶X*
=
¶
¶X*
X - X*
2
2
( )
A*
=s Z*
( )
¶X*
¶D
=
¶
¶D
DT
A*
+ bD( )
¶L
¶D
=
¶L
¶X*
¶X*
¶D
¶L
¶Z*
=
¶
¶Z*
ˆZ - Z*
2
2
( )
[Ranzato, Boureau & LeCun, “Sparse Feature Learning for Deep Belief Networks ”, NIPS, 2007]
¶L
¶C
=
¶L
¶ ˆZ
¶ ˆZ
¶C
¶ ˆZ
¶C
=
¶
¶C
CT
X + bD( )
Auto-encoder:
backprop w.r.t. weights
Code
Z
function model = ...
Module_Decode_BackProp_Weights(model, ...
dL_dx_star, a_star, params)
% Jacobian of the loss w.r.t. decoder matrix
model.dL_dD = dL_dx_star * a_star';
% Gradient of the loss w.r.t. decoder bias
model.dL_dbias_D = dL_dx_star;
% Gradient of the loss w.r.t. codes
dL_dz = z_hat - z_star;
% Gradient of the loss w.r.t. reconstruction
dL_dx_star = x_star - x;
Input X
[Ranzato, Boureau & LeCun, “Sparse Feature Learning for Deep Belief Networks ”, NIPS, 2007]
function model = ...
Module_Encode_BackProp_Weights(model, ...
dL_dz, x, params)
% Jacobian of the loss w.r.t. encoder matrix
model.dL_dC = dL_dz * x';
% Gradient of the loss w.r.t. encoding bias
model.dL_dbias_C = dL_dz;
Usual tricks about
classical SGD
• Regularization (L1-norm or L2-norm)
of the parameters?
• Learning rate?
• Learning rate decay?
• Momentum term on the parameters?
• Choice of the learning hyperparameters
o Cross-validation?
Sparse coding
Overcomplete code
Input
Eh X, ˆX( )
ˆX = h Z;W,b( )
Decoding
error
Input
decoding
Ls Z( )
Sparsity constraint
X
Z
[Olshausen & Field, “Sparse coding with an overcomplete basis set: a strategy employed by V1? ”, Vision Research, 1997]
L X,Z;W( )=
X -h Z;W,b( ) 2
2
+ lLS Z( )
Sparse coding
Decoding
error
Input
decoding
Sparsity constraint
Z
X - ˆX
2
2
ˆX = WT
Z +b
log 1+ zd
2
( )d=1
M
å
Input X
[Olshausen & Field, “Sparse coding with an overcomplete basis set: a strategy employed by V1? ”, Vision Research, 1997]
L X,Z;W( )=
X - WT
Z +b( ) 2
2
+lLS Z( )
Overcomplete code
Limitations of
sparse coding
• At runtime, assuming a trained model W,
inferring the code Z given an input sample X
is expensive
• Need a tweak on the model weights W:
normalize the columns of W to unit length
after each learning step
• Otherwise:
ˆX = WT
Z +b
code pulled to 0
by sparsity constraint
weights go to
infinity to compensate
Sparse
auto-encoder
Eg Z, ˆZ( )
Code
Input
Code
prediction
Code
error
ˆZ = g X;Wg,bg( ) Eh X, ˆX( )
ˆX = h Z;Wh,bh( )
Decoding
error
Input
decoding
Ls Z( )
Sparsity constraint
X
Z
[Ranzato, Poultney, Chopra & LeCun, “Efficient Learning of Sparse Representations with an Energy-Based Model ”, NIPS, 2006;
Ranzato, Boureau & LeCun, “Sparse Feature Learning for Deep Belief Networks ”, NIPS, 2007]
Symmetric sparse
auto-encoder
Code
Code
prediction
Code
error
ˆZ = WX +bC
ˆX = WT
s Z( )+bD
Decoding
error
Input
decoding
Sparsity constraint
Z
Z - ˆZ
2
2
X - ˆX
2
2
log 1+s zd( )
2
( )d=1
M
å
Input X[Ranzato, Poultney, Chopra & LeCun, “Efficient Learning of Sparse Representations with an Energy-Based Model ”, NIPS, 2006;
Ranzato, Boureau & LeCun, “Sparse Feature Learning for Deep Belief Networks ”, NIPS, 2007]
Encoder matrix W
is symmetric to
decoder matrix WT
Predictive Sparse
Decomposition
Code
Code
prediction
ˆZ = g X;C,bC( )
Z
Once the encoder g
is properly trained,
the code Z can be
directly predicted
from input X
Input X
Outline
• Deep learning concepts covered
o Hierarchical representations
o Sparse and/or distributed representations
o Supervised vs. unsupervised learning
• Auto-encoder
o Architecture
o Inference and learning
o Sparse coding
o Sparse auto-encoders
• Illustration: handwritten digits
o Stacking auto-encoders
o Learning representations of digits
o Impact on classification
• Applications to text
o Semantic hashing
o Semi-supervised learning
o Moving away from auto-encoders
• Topics not covered in this talk
Stacking auto-encoders
Code
Input
Code prediction
Code energy
Decoding energy
Input decoding
Sparsity
constraint
X
Z
Code
Input
Code prediction
Code energy
Decoding energy
Input decoding
Sparsity
constraint
X
Z
[Ranzato,Boureau&LeCun,“SparseFeatureLearningforDeepBeliefNetworks”,NIPS,2007]
MNIST handwritten
digits
• Database of 70k
handwritten digits
o Training set: 60k
o Test set: 10k
• 28 x 28 pixels
• Best performing
classifiers:
o Linear classifier: 12% error
o Gaussian SVM 1.4% error
o ConvNets <1% error
[http://yann.lecun.com/exdb/mnist/]
Stacked auto-encoders
Code
Input
Code prediction
Code energy
Decoding energy
Input decoding
Sparsity
constraint
X
Z
Code
Input
Code prediction
Code energy
Decoding energy
Input decoding
Sparsity
constraint
X
Z
Layer 1: Matrix W1 of size 192 x 784
192 sparse bases of 28 x 28 pixels
Layer 2: Matrix W2 of size 10 x 192
10 sparse bases of 192 units
[Ranzato,Boureau&LeCun,“SparseFeatureLearningforDeepBeliefNetworks”,NIPS,2007]
Our results:
bases learned on layer 1
Our results:
back-projecting layer 2
Sparse representations
Layer 1
Layer 2
Training “converges” in
one pass over data
Layer 1 Layer 2
Outline
• Deep learning concepts covered
o Hierarchical representations
o Sparse and/or distributed representations
o Supervised vs. unsupervised learning
• Auto-encoder
o Architecture
o Inference and learning
o Sparse coding
o Sparse auto-encoders
• Illustration: handwritten digits
o Stacking auto-encoders
o Learning representations of digits
o Impact on classification
• Applications to text
o Semantic hashing
o Semi-supervised learning
o Moving away from auto-encoders
• Topics not covered in this talk
Semantic Hashing
[Hinton & Salakhutdinov, “Reducing the dimensionality of data with neural networks, Science, 2006;
Salakhutdinov & Hinton, “Semantic Hashing”, Int J Approx Reason, 2007]
2000
500
250
125
2
125
250
500
2000
Semi-supervised learning
of auto-encoders
• Add classifier
module to the
codes
• When a input X(t)
has a label Y(t),
back-propagate
the prediction error
on Y(t)
to the code Z(t)
• Stack the encoders
• Train layer-wise
[Ranzato & Szummer, “Semi-supervised learning of compact document representations with deep networks”, ICML, 2008;
Mirowski, Ranzato & LeCun, “Dynamic auto-encoders for semantic indexing”, NIPS Deep Learning Workshop, 2010]
y(t) y(t+1)
z(1)(t) z(1)(t+1)document
classifier f1
x(t) x(t+1)
y(t) y(t+1)
z(2)(t) z(2)(t+1)document
classifier f2
y(t) y(t+1)
z(3)(t) z(3)(t+1)document
classifier f3
auto-encoder g3,h3
auto-encoder g2,h2
auto-encoder g1,h1
Random
walk
word
histograms
Semi-supervised learning
of auto-encoders
[Ranzato & Szummer, “Semi-supervised learning of compact document representations with deep networks”, ICML, 2008;
Mirowski, Ranzato & LeCun, “Dynamic auto-encoders for semantic indexing”, NIPS Deep Learning Workshop, 2010]
2000w-TFIDFlogistic regression F1=0.83
2000w-TFIDFSVM F1=0.84
LSA+ICA [Mirowski et al,2010] [Mirowski et al,2010] [Ranzato et al,2008]
K no dynamics L1 dynamics
100 0.81 0.86 0.85 0.85
30 0.70 0.86 0.85 0.85
10 0.40 0.86 0.85 0.84
2 0.09 0.54 0.51 0.19
Performance on document retrieval task:
Reuters-21k dataset (9.6k training, 4k test),
vocabulary 2k words, 10-class classification
Comparison with:
• unsupervised techniques
(DBN: Semantic Hashing, LSA) + SVM
• traditional technique: word TF-IDF + SVM
Beyond auto-encoders
for web search (MSR)
[Huang, He, Gao, Deng et al, “Learning Deep Structured Semantic Models for Web Search using Clickthrough Data”, CIKM, 2013]
s: “racing car”Input word/phrase
dim = 5MBag-of-words vector
dim = 50K
d=500Letter-tri-gram
embedding matrix
Letter-tri-gram coeff.
matrix (fixed)
d=500
Semantic
vector
d=300
t1: “formula one”
dim = 5M
dim = 50K
d=500
d=500
d=300
t2: “ford model t”
dim = 5M
dim = 50K
d=500
d=500
d=300
Compute Cosine similarity
between semantic vectors cos(s,t1) cos(s,t2)
W1
W2
W3
W4
Beyond auto-encoders
for web search (MSR)
Semantic hashing
[Salakhutdinov & Hinton, 2007]
[Huang, He, Gao, Deng et al, “Learning Deep Structured Semantic Models for Web Search using Clickthrough Data”, CIKM, 2013]
Deep Structured
Semantic Model
[Huang, He, Gao et al, 2013]
Results on a web ranking task (16k queries)
Normalized discounted cumulative gains
Outline
• Deep learning concepts covered
o Hierarchical representations
o Sparse and/or distributed representations
o Supervised vs. unsupervised learning
• Auto-encoder
o Architecture
o Inference and learning
o Sparse coding
o Sparse auto-encoders
• Illustration: handwritten digits
o Stacking auto-encoders
o Learning representations of digits
o Impact on classification
• Applications to text
o Semantic hashing
o Semi-supervised learning
o Moving away from auto-encoders
• Topics not covered in this talk
Topics not covered
in this talk
• Other variations of
auto-encoders
o Restricted Boltzmann Machines
(work in Geoff Hinton’s lab)
o Denoising Auto-Encoders
(work in Yoshua Bengio’s lab)
• Invariance to shifts in
input and feature
space
o Convolutional kernels
o Sliding windows over input
o Max-pooling over codes
[LeCun, Bottou, Bengio & Haffner, “Gradient-based learning applied to document recognition”, Proceedings of IEEE,1998;
Le, Ranzato et al. "Building high-level features using large-scale unsupervised learning" ICML 2012;
Sermanet et al, “OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks”, ICLR 2014]
Thank you!
• Tutorial code:
https://github.com/piotrmirowski
http://piotrmirowski.wordpress.com
• Contact:
piotr.mirowski@computer.org
• Acknowledgements:
Marc’Aurelio Ranzato (FB)
Yann LeCun (FB/NYU)
Auto-encoders and
Expectation-Maximization
E X,Z;W( )= a Zt - g Xt;C,bC( ) 2
2
t=1
T
å + Xt -h Zt;D,bD( ) 2
2
t=1
T
å
P X W( )= P X,Z W( )
Z
ò =
e
-bE X,Z;W( )
Z
ò
e
-bE X,Z;W( )
X,Z
ò
-logP X W( )= -
1
b
log e
-bE X,Z;W( )
+
Z
ò
1
b
log e
-bE X,Z;W( )
Y,Z
ò
Energy of inputs and codes
Input data likelihood
Maximum A Posteriori: take minimal energy code Z
-logP X W( )= E X,Z*
;W( )+
1
b
log e
-bE X,Z*
;W( )
Y
ò
Do not marginalize over:
take maximum likelihood
latent code instead
Enforce sparsity on Z
to constrain Z and
avoid computing
partition function
Stochastic
gradient descent
[LeCun et al, "Efficient BackProp", Neural Networks: Tricks of the Trade, 1998;
Bottou, "Stochastic Learning", Slides from a talk in Tubingen, 2003]
Stochastic
gradient descent
[LeCun et al, "Efficient BackProp", Neural Networks: Tricks of the Trade, 1998;
Bottou, "Stochastic Learning", Slides from a talk in Tubingen, 2003]
Dimensionality reduction
and invariant mapping
Figure 2. Figure showing the spring system. The solid circles rep-
resent points that are similar to the point in the center. The hol-
low circles represent dissimilar points. The springs are shown as
global loss function L over all springs, one would ultimately
drive the system to its equilibrium state.
2.3. The Algorithm
The algorithm first generates the training set, then trains
the machine.
Step 1: For each input sample X i , do the following:
(a) Using prior knowledge find the set of samples
SX i
= { X j } p
j = 1, such that X j is deemed sim-
ilar to X i .
(b) Pair the sample X i with all the other training
samples and label the pairs so that:
Yi j = 0 if X j ∈ SX i
, and Yi j = 1 otherwise.
Combine all the pairs to form the labeled training set.
Step 2: Repeat until convergence:
(a) For each pair (X i , X j ) in the training set, do
i. If Yi j = 0, then update W to decrease
DW = GW (X i ) − GW (X j ) 2
ii. If Yi j = 1, then update W to increase
DW = GW (X i ) − GW (X j ) 2[Hadsell, Chopra & LeCun, “Dimensionality Reduction by Learning an Invariant Mapping”, CVPR, 2006]
Z1
Ef Z1,Z2( )
Z1 = f X1;W,b( ) Z2 = f X2;W,b( )
Z2
X1 X2
Similarly
labelled
samples
Dissimilar
codes

More Related Content

What's hot

Foundations: Artificial Neural Networks
Foundations: Artificial Neural NetworksFoundations: Artificial Neural Networks
Foundations: Artificial Neural Networks
ananth
 
Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015
Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015
Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015
Turi, Inc.
 
Deep learning summary
Deep learning summaryDeep learning summary
Deep learning summary
ankit_ppt
 
Machine Learning, Deep Learning and Data Analysis Introduction
Machine Learning, Deep Learning and Data Analysis IntroductionMachine Learning, Deep Learning and Data Analysis Introduction
Machine Learning, Deep Learning and Data Analysis Introduction
Te-Yen Liu
 
NeuralProcessingofGeneralPurposeApproximatePrograms
NeuralProcessingofGeneralPurposeApproximateProgramsNeuralProcessingofGeneralPurposeApproximatePrograms
NeuralProcessingofGeneralPurposeApproximateProgramsMohid Nabil
 
Deep learning
Deep learningDeep learning
Deep learning
Pratap Dangeti
 
backpropagation in neural networks
backpropagation in neural networksbackpropagation in neural networks
backpropagation in neural networks
Akash Goel
 
Visualization of Deep Learning
Visualization of Deep LearningVisualization of Deep Learning
Visualization of Deep Learning
YaminiAlapati1
 
Scene understanding
Scene understandingScene understanding
Scene understanding
Mohammed Shoaib
 
Training Neural Networks
Training Neural NetworksTraining Neural Networks
Training Neural Networks
Databricks
 
Deep learning in Computer Vision
Deep learning in Computer VisionDeep learning in Computer Vision
Deep learning in Computer Vision
David Dao
 
Principles of soft computing-Associative memory networks
Principles of soft computing-Associative memory networksPrinciples of soft computing-Associative memory networks
Principles of soft computing-Associative memory networksSivagowry Shathesh
 
Machine Learning: Introduction to Neural Networks
Machine Learning: Introduction to Neural NetworksMachine Learning: Introduction to Neural Networks
Machine Learning: Introduction to Neural NetworksFrancesco Collova'
 
Recurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRURecurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRU
ananth
 
Neural Network as a function
Neural Network as a functionNeural Network as a function
Neural Network as a function
Taisuke Oe
 
Learning in Networks: were Pavlov and Hebb right?
Learning in Networks: were Pavlov and Hebb right?Learning in Networks: were Pavlov and Hebb right?
Learning in Networks: were Pavlov and Hebb right?Victor Miagkikh
 
Lecture 29 Convolutional Neural Networks - Computer Vision Spring2015
Lecture 29 Convolutional Neural Networks -  Computer Vision Spring2015Lecture 29 Convolutional Neural Networks -  Computer Vision Spring2015
Lecture 29 Convolutional Neural Networks - Computer Vision Spring2015
Jia-Bin Huang
 
Deep Learning in Computer Vision
Deep Learning in Computer VisionDeep Learning in Computer Vision
Deep Learning in Computer Vision
Sungjoon Choi
 
Deep learning
Deep learningDeep learning
Deep learning
Rouyun Pan
 

What's hot (20)

Foundations: Artificial Neural Networks
Foundations: Artificial Neural NetworksFoundations: Artificial Neural Networks
Foundations: Artificial Neural Networks
 
Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015
Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015
Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015
 
Deep learning summary
Deep learning summaryDeep learning summary
Deep learning summary
 
Machine Learning, Deep Learning and Data Analysis Introduction
Machine Learning, Deep Learning and Data Analysis IntroductionMachine Learning, Deep Learning and Data Analysis Introduction
Machine Learning, Deep Learning and Data Analysis Introduction
 
Lesson 33
Lesson 33Lesson 33
Lesson 33
 
NeuralProcessingofGeneralPurposeApproximatePrograms
NeuralProcessingofGeneralPurposeApproximateProgramsNeuralProcessingofGeneralPurposeApproximatePrograms
NeuralProcessingofGeneralPurposeApproximatePrograms
 
Deep learning
Deep learningDeep learning
Deep learning
 
backpropagation in neural networks
backpropagation in neural networksbackpropagation in neural networks
backpropagation in neural networks
 
Visualization of Deep Learning
Visualization of Deep LearningVisualization of Deep Learning
Visualization of Deep Learning
 
Scene understanding
Scene understandingScene understanding
Scene understanding
 
Training Neural Networks
Training Neural NetworksTraining Neural Networks
Training Neural Networks
 
Deep learning in Computer Vision
Deep learning in Computer VisionDeep learning in Computer Vision
Deep learning in Computer Vision
 
Principles of soft computing-Associative memory networks
Principles of soft computing-Associative memory networksPrinciples of soft computing-Associative memory networks
Principles of soft computing-Associative memory networks
 
Machine Learning: Introduction to Neural Networks
Machine Learning: Introduction to Neural NetworksMachine Learning: Introduction to Neural Networks
Machine Learning: Introduction to Neural Networks
 
Recurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRURecurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRU
 
Neural Network as a function
Neural Network as a functionNeural Network as a function
Neural Network as a function
 
Learning in Networks: were Pavlov and Hebb right?
Learning in Networks: were Pavlov and Hebb right?Learning in Networks: were Pavlov and Hebb right?
Learning in Networks: were Pavlov and Hebb right?
 
Lecture 29 Convolutional Neural Networks - Computer Vision Spring2015
Lecture 29 Convolutional Neural Networks -  Computer Vision Spring2015Lecture 29 Convolutional Neural Networks -  Computer Vision Spring2015
Lecture 29 Convolutional Neural Networks - Computer Vision Spring2015
 
Deep Learning in Computer Vision
Deep Learning in Computer VisionDeep Learning in Computer Vision
Deep Learning in Computer Vision
 
Deep learning
Deep learningDeep learning
Deep learning
 

Similar to Piotr Mirowski - Review Autoencoders (Deep Learning) - CIUUK14

Introduction to Autoencoders
Introduction to AutoencodersIntroduction to Autoencoders
Introduction to Autoencoders
Yan Xu
 
Software tookits for machine learning and graphical models
Software tookits for machine learning and graphical modelsSoftware tookits for machine learning and graphical models
Software tookits for machine learning and graphical modelsbutest
 
AI 로봇 아티스트의 비밀(창원대학교 정보통신공학과 특강)
AI 로봇 아티스트의 비밀(창원대학교 정보통신공학과 특강)AI 로봇 아티스트의 비밀(창원대학교 정보통신공학과 특강)
AI 로봇 아티스트의 비밀(창원대학교 정보통신공학과 특강)
Changwon National University
 
Introduction to Deep Learning and Tensorflow
Introduction to Deep Learning and TensorflowIntroduction to Deep Learning and Tensorflow
Introduction to Deep Learning and Tensorflow
Oswald Campesato
 
Neural Networks in the Wild: Handwriting Recognition
Neural Networks in the Wild: Handwriting RecognitionNeural Networks in the Wild: Handwriting Recognition
Neural Networks in the Wild: Handwriting Recognition
John Liu
 
supervised.pptx
supervised.pptxsupervised.pptx
supervised.pptx
MohamedSaied316569
 
Feature Engineering
Feature EngineeringFeature Engineering
Feature Engineering
Sri Ambati
 
Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018
Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018
Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018
Universitat Politècnica de Catalunya
 
NSC #2 - D2 06 - Richard Johnson - SAGEly Advice
NSC #2 - D2 06 - Richard Johnson - SAGEly AdviceNSC #2 - D2 06 - Richard Johnson - SAGEly Advice
NSC #2 - D2 06 - Richard Johnson - SAGEly Advice
NoSuchCon
 
InfoGAN and Generative Adversarial Networks
InfoGAN and Generative Adversarial NetworksInfoGAN and Generative Adversarial Networks
InfoGAN and Generative Adversarial Networks
Zak Jost
 
Variational Autoencoders For Image Generation
Variational Autoencoders For Image GenerationVariational Autoencoders For Image Generation
Variational Autoencoders For Image Generation
Jason Anderson
 
DeepLearningLecture.pptx
DeepLearningLecture.pptxDeepLearningLecture.pptx
DeepLearningLecture.pptx
ssuserf07225
 
Unsupervised program synthesis
Unsupervised program synthesisUnsupervised program synthesis
Unsupervised program synthesis
Amrith Krishna
 
Auto-encoding variational bayes
Auto-encoding variational bayesAuto-encoding variational bayes
Auto-encoding variational bayes
Kyuri Kim
 
Elliptic curvecryptography Shane Almeida Saqib Awan Dan Palacio
Elliptic curvecryptography Shane Almeida Saqib Awan Dan PalacioElliptic curvecryptography Shane Almeida Saqib Awan Dan Palacio
Elliptic curvecryptography Shane Almeida Saqib Awan Dan Palacio
Information Security Awareness Group
 
Number Crunching in Python
Number Crunching in PythonNumber Crunching in Python
Number Crunching in Python
Valerio Maggio
 
Verilog Final Probe'22.pptx
Verilog Final Probe'22.pptxVerilog Final Probe'22.pptx
Verilog Final Probe'22.pptx
SyedAzim6
 
Pointcuts and Analysis
Pointcuts and AnalysisPointcuts and Analysis
Pointcuts and AnalysisWiwat Ruengmee
 
Feature Engineering - Getting most out of data for predictive models - TDC 2017
Feature Engineering - Getting most out of data for predictive models - TDC 2017Feature Engineering - Getting most out of data for predictive models - TDC 2017
Feature Engineering - Getting most out of data for predictive models - TDC 2017
Gabriel Moreira
 
Applying Compiler Techniques to Iterate At Blazing Speed
Applying Compiler Techniques to Iterate At Blazing SpeedApplying Compiler Techniques to Iterate At Blazing Speed
Applying Compiler Techniques to Iterate At Blazing Speed
Pascal-Louis Perez
 

Similar to Piotr Mirowski - Review Autoencoders (Deep Learning) - CIUUK14 (20)

Introduction to Autoencoders
Introduction to AutoencodersIntroduction to Autoencoders
Introduction to Autoencoders
 
Software tookits for machine learning and graphical models
Software tookits for machine learning and graphical modelsSoftware tookits for machine learning and graphical models
Software tookits for machine learning and graphical models
 
AI 로봇 아티스트의 비밀(창원대학교 정보통신공학과 특강)
AI 로봇 아티스트의 비밀(창원대학교 정보통신공학과 특강)AI 로봇 아티스트의 비밀(창원대학교 정보통신공학과 특강)
AI 로봇 아티스트의 비밀(창원대학교 정보통신공학과 특강)
 
Introduction to Deep Learning and Tensorflow
Introduction to Deep Learning and TensorflowIntroduction to Deep Learning and Tensorflow
Introduction to Deep Learning and Tensorflow
 
Neural Networks in the Wild: Handwriting Recognition
Neural Networks in the Wild: Handwriting RecognitionNeural Networks in the Wild: Handwriting Recognition
Neural Networks in the Wild: Handwriting Recognition
 
supervised.pptx
supervised.pptxsupervised.pptx
supervised.pptx
 
Feature Engineering
Feature EngineeringFeature Engineering
Feature Engineering
 
Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018
Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018
Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018
 
NSC #2 - D2 06 - Richard Johnson - SAGEly Advice
NSC #2 - D2 06 - Richard Johnson - SAGEly AdviceNSC #2 - D2 06 - Richard Johnson - SAGEly Advice
NSC #2 - D2 06 - Richard Johnson - SAGEly Advice
 
InfoGAN and Generative Adversarial Networks
InfoGAN and Generative Adversarial NetworksInfoGAN and Generative Adversarial Networks
InfoGAN and Generative Adversarial Networks
 
Variational Autoencoders For Image Generation
Variational Autoencoders For Image GenerationVariational Autoencoders For Image Generation
Variational Autoencoders For Image Generation
 
DeepLearningLecture.pptx
DeepLearningLecture.pptxDeepLearningLecture.pptx
DeepLearningLecture.pptx
 
Unsupervised program synthesis
Unsupervised program synthesisUnsupervised program synthesis
Unsupervised program synthesis
 
Auto-encoding variational bayes
Auto-encoding variational bayesAuto-encoding variational bayes
Auto-encoding variational bayes
 
Elliptic curvecryptography Shane Almeida Saqib Awan Dan Palacio
Elliptic curvecryptography Shane Almeida Saqib Awan Dan PalacioElliptic curvecryptography Shane Almeida Saqib Awan Dan Palacio
Elliptic curvecryptography Shane Almeida Saqib Awan Dan Palacio
 
Number Crunching in Python
Number Crunching in PythonNumber Crunching in Python
Number Crunching in Python
 
Verilog Final Probe'22.pptx
Verilog Final Probe'22.pptxVerilog Final Probe'22.pptx
Verilog Final Probe'22.pptx
 
Pointcuts and Analysis
Pointcuts and AnalysisPointcuts and Analysis
Pointcuts and Analysis
 
Feature Engineering - Getting most out of data for predictive models - TDC 2017
Feature Engineering - Getting most out of data for predictive models - TDC 2017Feature Engineering - Getting most out of data for predictive models - TDC 2017
Feature Engineering - Getting most out of data for predictive models - TDC 2017
 
Applying Compiler Techniques to Iterate At Blazing Speed
Applying Compiler Techniques to Iterate At Blazing SpeedApplying Compiler Techniques to Iterate At Blazing Speed
Applying Compiler Techniques to Iterate At Blazing Speed
 

More from Daniel Lewis

Innovation Council - UK AI Awards - CIUUK14
Innovation Council - UK AI Awards - CIUUK14Innovation Council - UK AI Awards - CIUUK14
Innovation Council - UK AI Awards - CIUUK14
Daniel Lewis
 
Damien George - Micro Python - CIUUK14
Damien George - Micro Python - CIUUK14Damien George - Micro Python - CIUUK14
Damien George - Micro Python - CIUUK14
Daniel Lewis
 
Marcelo Funes-Gallanzi - Simplish - Computational intelligence unconference
Marcelo Funes-Gallanzi - Simplish - Computational intelligence unconferenceMarcelo Funes-Gallanzi - Simplish - Computational intelligence unconference
Marcelo Funes-Gallanzi - Simplish - Computational intelligence unconference
Daniel Lewis
 
Andrew Lewis - Project BEESWAX - CIUUK14
Andrew Lewis - Project BEESWAX - CIUUK14Andrew Lewis - Project BEESWAX - CIUUK14
Andrew Lewis - Project BEESWAX - CIUUK14
Daniel Lewis
 
Rohit Talwar - love in a cold climate - CIUUK14
Rohit Talwar -  love in a cold climate - CIUUK14Rohit Talwar -  love in a cold climate - CIUUK14
Rohit Talwar - love in a cold climate - CIUUK14
Daniel Lewis
 
CIUUK14 Closing Slides
CIUUK14 Closing SlidesCIUUK14 Closing Slides
CIUUK14 Closing Slides
Daniel Lewis
 
CIUUK14 Opening Slides
CIUUK14 Opening SlidesCIUUK14 Opening Slides
CIUUK14 Opening Slides
Daniel Lewis
 
On the fringes daniel lewis
On the fringes   daniel lewisOn the fringes   daniel lewis
On the fringes daniel lewis
Daniel Lewis
 

More from Daniel Lewis (8)

Innovation Council - UK AI Awards - CIUUK14
Innovation Council - UK AI Awards - CIUUK14Innovation Council - UK AI Awards - CIUUK14
Innovation Council - UK AI Awards - CIUUK14
 
Damien George - Micro Python - CIUUK14
Damien George - Micro Python - CIUUK14Damien George - Micro Python - CIUUK14
Damien George - Micro Python - CIUUK14
 
Marcelo Funes-Gallanzi - Simplish - Computational intelligence unconference
Marcelo Funes-Gallanzi - Simplish - Computational intelligence unconferenceMarcelo Funes-Gallanzi - Simplish - Computational intelligence unconference
Marcelo Funes-Gallanzi - Simplish - Computational intelligence unconference
 
Andrew Lewis - Project BEESWAX - CIUUK14
Andrew Lewis - Project BEESWAX - CIUUK14Andrew Lewis - Project BEESWAX - CIUUK14
Andrew Lewis - Project BEESWAX - CIUUK14
 
Rohit Talwar - love in a cold climate - CIUUK14
Rohit Talwar -  love in a cold climate - CIUUK14Rohit Talwar -  love in a cold climate - CIUUK14
Rohit Talwar - love in a cold climate - CIUUK14
 
CIUUK14 Closing Slides
CIUUK14 Closing SlidesCIUUK14 Closing Slides
CIUUK14 Closing Slides
 
CIUUK14 Opening Slides
CIUUK14 Opening SlidesCIUUK14 Opening Slides
CIUUK14 Opening Slides
 
On the fringes daniel lewis
On the fringes   daniel lewisOn the fringes   daniel lewis
On the fringes daniel lewis
 

Recently uploaded

PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
Assure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyesAssure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
sonjaschweigert1
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
Pierluigi Pugliese
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 

Recently uploaded (20)

PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
Assure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyesAssure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyes
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 

Piotr Mirowski - Review Autoencoders (Deep Learning) - CIUUK14

  • 1. Review of auto-encoders Piotr Mirowski, Microsoft Bing London (Dirk Gorissen) Computational Intelligence Unconference, 26 July 2014 Code Input Code prediction Code energy Decoding energy Input decoding Sparsity constraint X Z
  • 2. Outline • Deep learning concepts covered o Hierarchical representations o Sparse and/or distributed representations o Supervised vs. unsupervised learning • Auto-encoder o Architecture o Inference and learning o Sparse coding o Sparse auto-encoders • Illustration: handwritten digits o Stacking auto-encoders o Learning representations of digits o Impact on classification • Applications to text o Semantic hashing o Semi-supervised learning o Moving away from auto-encoders • Topics not covered in this talk
  • 3. Outline • Deep learning concepts covered o Hierarchical representations o Sparse and/or distributed representations o Supervised vs. unsupervised learning • Auto-encoder o Architecture o Inference and learning o Sparse coding o Sparse auto-encoders • Illustration: handwritten digits o Stacking auto-encoders o Learning representations of digits o Impact on classification • Applications to text o Semantic hashing o Semi-supervised learning o Moving away from auto-encoders • Topics not covered in this talk
  • 4. Hierarchical representations “Deep learning methods aim at learning feature hierarchies with features from higher levels of the hierarchy formed by the composition of lower level features. Automatically learning features at multiple levels of abstraction allows a system to learn complex functions mapping the input to the output directly from data, without depending completely on human-crafted features.” — Yoshua Bengio [Bengio, “On the expressive power of deep architectures”, Talk at ALT, 2011] [Bengio, Learning Deep Architectures for AI, 2009]
  • 5. Sparse and/or distributed representations Example on MNIST handwritten digits An image of size 28x28 pixels can be represented using a small combination of codes from a basis set. [Ranzato, Poultney, Chopra & LeCun, “Efficient Learning of Sparse Representations with an Energy-Based Model ”, NIPS, 2006; Ranzato, Boureau & LeCun, “Sparse Feature Learning for Deep Belief Networks ”, NIPS, 2007] Biological motivation: V1 visual cortex
  • 6. Sparse and/or distributed representations Example on MNIST handwritten digits An image of size 28x28 pixels can be represented using a small combination of codes from a basis set. At the end of this talk, you should know how to learn that basis set and how to infer the codes, in a 2-layer auto-encoder architecture. Matlab/Octave code and the MNIST dataset will be provided. [Ranzato, Poultney, Chopra & LeCun, “Efficient Learning of Sparse Representations with an Energy-Based Model ”, NIPS, 2006; Ranzato, Boureau & LeCun, “Sparse Feature Learning for Deep Belief Networks ”, NIPS, 2007] Biological motivation: V1 visual cortex (backup slides)
  • 7. Supervised learning X Y Ef Y, ˆY( ) Target Input Prediction Error ˆY = f X;W,b( )
  • 8. Supervised learning Y - ˆY 2 2 Target Input Prediction Error ˆY = WT X +b X Y
  • 12. Unsupervised learning Code Input Prediction(s) Error(s) We want the codes to represent the inputs in the dataset. The code should be a compact representation of the inputs: low-dimensional and/or sparse. X Z
  • 13. Examples of unsupervised learning • Linear decomposition of the inputs: o Principal Component Analysis and Singular Value Decomposition o Independent Component Analysis [Bell & Sejnowski, 1995] o Sparse coding [Olshausen & Field, 1997] o … • Fitting a distribution to the inputs: o Mixtures of Gaussians o Use of Expectation-Maximization algorithm [Dempster et al, 1977] o … • For text or discrete data: o Latent Semantic Indexing [Deerwester et al, 1990] o Probabilistic Latent Semantic Indexing [Hofmann et al, 1999] o Latent Dirichlet Allocation [Blei et al, 2003] o Semantic Hashing o …
  • 14. Objective of this tutorial Study a fundamental building block for deep learning, the auto-encoder
  • 15. Outline • Deep learning concepts covered o Hierarchical representations o Sparse and/or distributed representations o Supervised vs. unsupervised learning • Auto-encoder o Architecture o Inference and learning o Sparse coding o Sparse auto-encoders • Illustration: handwritten digits o Stacking auto-encoders o Learning representations of digits o Impact on classification • Applications to text o Semantic hashing o Semi-supervised learning o Moving away from auto-encoders • Topics not covered in this talk
  • 16. Auto-encoder Code Input X Z Target = input Y = X Code Input “Bottleneck” code i.e., low-dimensional, typically dense, distributed representation “Overcomplete” code i.e., high-dimensional, always sparse, distributed representation Target = input
  • 17. Auto-encoder Eg Z, ˆZ( ) Code Input Code prediction Encoding “energy” ˆZ = g X;C,bC( ) Eh X, ˆX( ) ˆX = h Z;D,bD( ) Decoding “energy” Input decoding X Z
  • 18. Auto-encoder Z - ˆZ 2 2 Code Input Code prediction Encoding energy X - ˆX 2 2 Decoding energy Input decoding X Z ˆZ = g X;C,bC( ) ˆX = h Z;D,bD( )
  • 19. Auto-encoder loss function L X(t),Z(t);W( )=a Z(t)-g X(t);C,bC( ) 2 2 + X(t)-h Z(t);D,bD( ) 2 2 L X,Z;W( )= a Z(t)- g X(t);C,bC( ) 2 2 t=1 T å + X(t)-h Z(t);D,bD( ) 2 2 t=1 T å Encoding energy Decoding energy Encoding energy Decoding energy For one sample t For all T samples How do we get the codes Z? coefficient of the encoder error We note W={C, bC, D, bD}
  • 20. Learning and inference in auto-encoders Z*= argmin Z L X,Z;W( ) W*= argmin W L X,Z;W( ) Learn the parameters (weights) W of the encoder and decoder given the current codes Z Infer the codes Z given the current model parameters W Relationship to Expectation-Maximization in graphical models (backup slides)
  • 21. Learning and inference: stochastic gradient descent Z* (t) = argmin Z(t) L X(t),Z(t);W( ) L X(t),Z(t);W* ( )< L X(t),Z(t);W( ) Take a gradient descent step on the parameters (weights) W of the encoder and decoder given the current codes Z Iterated gradient descent (?) on the code Z(t) given the current model parameters W Relationship to Generalized EM in graphical models (backup slides)
  • 22. Auto-encoder Z - ˆZ 2 2 Code Input Code prediction Encoding energy ˆZ = CT X +bC X - ˆX 2 2 A =s Z( ) = 1 1+e-Z ˆX = DT A+bD Decoding energy Input decoding X Z
  • 23. Auto-encoder: fprop Code Input X Z function [x_hat, a_hat] = … Module_Decode_FProp(model, z) % Apply the logistic to the code a_hat = 1 ./ (1 + exp(-z)); % Linear decoding x_hat = model.D * a_hat + model.bias_D; function z_hat = … Module_Encode_FProp(model, x, params) % Compute the linear encoding activation z_hat = model.C * x + model.bias_C; function e = Loss_Gaussian(z, z_hat) zDiff = z_hat - z; e = 0.5 * sum(zDiff.^2); function e = Loss_Gaussian(x, x_hat) xDiff = x_hat - x; e = 0.5 * sum(xDiff.^2);
  • 24. Auto-encoder backprop w.r.t. codes Code Input Code prediction Encoding energy X Z ¶L ¶ ˆX = ¶ ¶ ˆX X - ˆX 2 2 ( ) ¶A ¶Z = ¶s Z( ) ¶Z ¶ ˆX ¶A = ¶ ¶A DT A + bD( ) ¶Ldec ¶Z = ¶L ¶ ˆX ¶ ˆX ¶ ˆA ¶ ˆA ¶Z ¶Lenc ¶Z = ¶ ¶Z Z - ˆZ 2 2 ( ) [Ranzato, Boureau & LeCun, “Sparse Feature Learning for Deep Belief Networks ”, NIPS, 2007]
  • 25. Auto-encoder: backprop w.r.t. codes Code Z function dL_dz = ... Module_Decode_BackProp_Codes(model, dL_dx, a_hat) % Gradient of the loss w.r.t. activations dL_da = model.D' * dL_dx; % Gradient of the loss w.r.t. latent codes % a_hat = 1 ./ (1 + exp(-z_hat)) dL_dz = dL_da .* a_hat .* (1 - a_hat); % Add the gradient w.r.t. % the encoder's outputs dL_dz = z_star - z_hat; % Gradient of the loss w.r.t. % the decoder prediction dL_dx_star = x_star - x; Input X [Ranzato, Boureau & LeCun, “Sparse Feature Learning for Deep Belief Networks ”, NIPS, 2007]
  • 26. Code inference in the auto-encoder function [z_star, z_hat, loss_star, loss_hat] = Layer_Infer(model, x, params) % Encode the current input and initialize the latent code z_hat = Module_Encode_FProp(model, x, params); % Decode the current latent code [x_hat, a_hat] = Module_Decode_FProp(model, z_hat); % Compute the current loss term due to decoding (encoding loss is 0) loss_hat = Loss_Gaussian(x, x_hat); % Relaxation on the latent code: loop until convergence x_star = x_hat; a_star = a_hat; z_star = z_hat; loss_star = loss_hat; while (true) % Gradient of the loss function w.r.t. decoder prediction dL_dx_star = x_star - x; % Back-propagate the gradient of the loss onto the codes dL_dz = Module_Decode_BackProp_Codes(model, dL_dx_star, a_star, params); % Add the gradient w.r.t. the encoder's outputs dL_dz = dL_dz + params.alpha_c * (z_star - z_hat); % Perform one step of gradient descent on the codes z_star = z_star - params.eta_z * dL_dz; % Decode the current latent code [x_star, a_star] = Module_Decode_FProp(model, z_star); % Compute the current loss and convergence criteria loss_star = Loss_Gaussian(x, x_star) + ... params.alpha_c * Loss_Gaussian(z_star, z_hat); % Stopping criteria [...] end Code Input Code prediction Encoding energy X Z ¶L ¶ ˆX = ¶ ¶ ˆX X - ˆX 2 2 ( ) ¶A ¶Z = ¶s Z( ) ¶Z ¶ ˆX ¶A = ¶ ¶A DT A + bD( ) ¶Lenc ¶Z = ¶ ¶Z Z - ˆZ 2 2 ( ) [Ranzato, Boureau & LeCun, “Sparse Feature Learning for Deep Belief Networks ”, NIPS, 2007]
  • 27. Auto-encoder backprop w.r.t. codes Code Input Code prediction Encoding energy X Z* ¶L ¶X* = ¶ ¶X* X - X* 2 2 ( ) A* =s Z* ( ) ¶X* ¶D = ¶ ¶D DT A* + bD( ) ¶L ¶D = ¶L ¶X* ¶X* ¶D ¶L ¶Z* = ¶ ¶Z* ˆZ - Z* 2 2 ( ) [Ranzato, Boureau & LeCun, “Sparse Feature Learning for Deep Belief Networks ”, NIPS, 2007] ¶L ¶C = ¶L ¶ ˆZ ¶ ˆZ ¶C ¶ ˆZ ¶C = ¶ ¶C CT X + bD( )
  • 28. Auto-encoder: backprop w.r.t. weights Code Z function model = ... Module_Decode_BackProp_Weights(model, ... dL_dx_star, a_star, params) % Jacobian of the loss w.r.t. decoder matrix model.dL_dD = dL_dx_star * a_star'; % Gradient of the loss w.r.t. decoder bias model.dL_dbias_D = dL_dx_star; % Gradient of the loss w.r.t. codes dL_dz = z_hat - z_star; % Gradient of the loss w.r.t. reconstruction dL_dx_star = x_star - x; Input X [Ranzato, Boureau & LeCun, “Sparse Feature Learning for Deep Belief Networks ”, NIPS, 2007] function model = ... Module_Encode_BackProp_Weights(model, ... dL_dz, x, params) % Jacobian of the loss w.r.t. encoder matrix model.dL_dC = dL_dz * x'; % Gradient of the loss w.r.t. encoding bias model.dL_dbias_C = dL_dz;
  • 29. Usual tricks about classical SGD • Regularization (L1-norm or L2-norm) of the parameters? • Learning rate? • Learning rate decay? • Momentum term on the parameters? • Choice of the learning hyperparameters o Cross-validation?
  • 30. Sparse coding Overcomplete code Input Eh X, ˆX( ) ˆX = h Z;W,b( ) Decoding error Input decoding Ls Z( ) Sparsity constraint X Z [Olshausen & Field, “Sparse coding with an overcomplete basis set: a strategy employed by V1? ”, Vision Research, 1997] L X,Z;W( )= X -h Z;W,b( ) 2 2 + lLS Z( )
  • 31. Sparse coding Decoding error Input decoding Sparsity constraint Z X - ˆX 2 2 ˆX = WT Z +b log 1+ zd 2 ( )d=1 M å Input X [Olshausen & Field, “Sparse coding with an overcomplete basis set: a strategy employed by V1? ”, Vision Research, 1997] L X,Z;W( )= X - WT Z +b( ) 2 2 +lLS Z( ) Overcomplete code
  • 32. Limitations of sparse coding • At runtime, assuming a trained model W, inferring the code Z given an input sample X is expensive • Need a tweak on the model weights W: normalize the columns of W to unit length after each learning step • Otherwise: ˆX = WT Z +b code pulled to 0 by sparsity constraint weights go to infinity to compensate
  • 33. Sparse auto-encoder Eg Z, ˆZ( ) Code Input Code prediction Code error ˆZ = g X;Wg,bg( ) Eh X, ˆX( ) ˆX = h Z;Wh,bh( ) Decoding error Input decoding Ls Z( ) Sparsity constraint X Z [Ranzato, Poultney, Chopra & LeCun, “Efficient Learning of Sparse Representations with an Energy-Based Model ”, NIPS, 2006; Ranzato, Boureau & LeCun, “Sparse Feature Learning for Deep Belief Networks ”, NIPS, 2007]
  • 34. Symmetric sparse auto-encoder Code Code prediction Code error ˆZ = WX +bC ˆX = WT s Z( )+bD Decoding error Input decoding Sparsity constraint Z Z - ˆZ 2 2 X - ˆX 2 2 log 1+s zd( ) 2 ( )d=1 M å Input X[Ranzato, Poultney, Chopra & LeCun, “Efficient Learning of Sparse Representations with an Energy-Based Model ”, NIPS, 2006; Ranzato, Boureau & LeCun, “Sparse Feature Learning for Deep Belief Networks ”, NIPS, 2007] Encoder matrix W is symmetric to decoder matrix WT
  • 35. Predictive Sparse Decomposition Code Code prediction ˆZ = g X;C,bC( ) Z Once the encoder g is properly trained, the code Z can be directly predicted from input X Input X
  • 36. Outline • Deep learning concepts covered o Hierarchical representations o Sparse and/or distributed representations o Supervised vs. unsupervised learning • Auto-encoder o Architecture o Inference and learning o Sparse coding o Sparse auto-encoders • Illustration: handwritten digits o Stacking auto-encoders o Learning representations of digits o Impact on classification • Applications to text o Semantic hashing o Semi-supervised learning o Moving away from auto-encoders • Topics not covered in this talk
  • 37. Stacking auto-encoders Code Input Code prediction Code energy Decoding energy Input decoding Sparsity constraint X Z Code Input Code prediction Code energy Decoding energy Input decoding Sparsity constraint X Z [Ranzato,Boureau&LeCun,“SparseFeatureLearningforDeepBeliefNetworks”,NIPS,2007]
  • 38. MNIST handwritten digits • Database of 70k handwritten digits o Training set: 60k o Test set: 10k • 28 x 28 pixels • Best performing classifiers: o Linear classifier: 12% error o Gaussian SVM 1.4% error o ConvNets <1% error [http://yann.lecun.com/exdb/mnist/]
  • 39. Stacked auto-encoders Code Input Code prediction Code energy Decoding energy Input decoding Sparsity constraint X Z Code Input Code prediction Code energy Decoding energy Input decoding Sparsity constraint X Z Layer 1: Matrix W1 of size 192 x 784 192 sparse bases of 28 x 28 pixels Layer 2: Matrix W2 of size 10 x 192 10 sparse bases of 192 units [Ranzato,Boureau&LeCun,“SparseFeatureLearningforDeepBeliefNetworks”,NIPS,2007]
  • 43. Training “converges” in one pass over data Layer 1 Layer 2
  • 44. Outline • Deep learning concepts covered o Hierarchical representations o Sparse and/or distributed representations o Supervised vs. unsupervised learning • Auto-encoder o Architecture o Inference and learning o Sparse coding o Sparse auto-encoders • Illustration: handwritten digits o Stacking auto-encoders o Learning representations of digits o Impact on classification • Applications to text o Semantic hashing o Semi-supervised learning o Moving away from auto-encoders • Topics not covered in this talk
  • 45. Semantic Hashing [Hinton & Salakhutdinov, “Reducing the dimensionality of data with neural networks, Science, 2006; Salakhutdinov & Hinton, “Semantic Hashing”, Int J Approx Reason, 2007] 2000 500 250 125 2 125 250 500 2000
  • 46. Semi-supervised learning of auto-encoders • Add classifier module to the codes • When a input X(t) has a label Y(t), back-propagate the prediction error on Y(t) to the code Z(t) • Stack the encoders • Train layer-wise [Ranzato & Szummer, “Semi-supervised learning of compact document representations with deep networks”, ICML, 2008; Mirowski, Ranzato & LeCun, “Dynamic auto-encoders for semantic indexing”, NIPS Deep Learning Workshop, 2010] y(t) y(t+1) z(1)(t) z(1)(t+1)document classifier f1 x(t) x(t+1) y(t) y(t+1) z(2)(t) z(2)(t+1)document classifier f2 y(t) y(t+1) z(3)(t) z(3)(t+1)document classifier f3 auto-encoder g3,h3 auto-encoder g2,h2 auto-encoder g1,h1 Random walk word histograms
  • 47. Semi-supervised learning of auto-encoders [Ranzato & Szummer, “Semi-supervised learning of compact document representations with deep networks”, ICML, 2008; Mirowski, Ranzato & LeCun, “Dynamic auto-encoders for semantic indexing”, NIPS Deep Learning Workshop, 2010] 2000w-TFIDFlogistic regression F1=0.83 2000w-TFIDFSVM F1=0.84 LSA+ICA [Mirowski et al,2010] [Mirowski et al,2010] [Ranzato et al,2008] K no dynamics L1 dynamics 100 0.81 0.86 0.85 0.85 30 0.70 0.86 0.85 0.85 10 0.40 0.86 0.85 0.84 2 0.09 0.54 0.51 0.19 Performance on document retrieval task: Reuters-21k dataset (9.6k training, 4k test), vocabulary 2k words, 10-class classification Comparison with: • unsupervised techniques (DBN: Semantic Hashing, LSA) + SVM • traditional technique: word TF-IDF + SVM
  • 48. Beyond auto-encoders for web search (MSR) [Huang, He, Gao, Deng et al, “Learning Deep Structured Semantic Models for Web Search using Clickthrough Data”, CIKM, 2013] s: “racing car”Input word/phrase dim = 5MBag-of-words vector dim = 50K d=500Letter-tri-gram embedding matrix Letter-tri-gram coeff. matrix (fixed) d=500 Semantic vector d=300 t1: “formula one” dim = 5M dim = 50K d=500 d=500 d=300 t2: “ford model t” dim = 5M dim = 50K d=500 d=500 d=300 Compute Cosine similarity between semantic vectors cos(s,t1) cos(s,t2) W1 W2 W3 W4
  • 49. Beyond auto-encoders for web search (MSR) Semantic hashing [Salakhutdinov & Hinton, 2007] [Huang, He, Gao, Deng et al, “Learning Deep Structured Semantic Models for Web Search using Clickthrough Data”, CIKM, 2013] Deep Structured Semantic Model [Huang, He, Gao et al, 2013] Results on a web ranking task (16k queries) Normalized discounted cumulative gains
  • 50. Outline • Deep learning concepts covered o Hierarchical representations o Sparse and/or distributed representations o Supervised vs. unsupervised learning • Auto-encoder o Architecture o Inference and learning o Sparse coding o Sparse auto-encoders • Illustration: handwritten digits o Stacking auto-encoders o Learning representations of digits o Impact on classification • Applications to text o Semantic hashing o Semi-supervised learning o Moving away from auto-encoders • Topics not covered in this talk
  • 51. Topics not covered in this talk • Other variations of auto-encoders o Restricted Boltzmann Machines (work in Geoff Hinton’s lab) o Denoising Auto-Encoders (work in Yoshua Bengio’s lab) • Invariance to shifts in input and feature space o Convolutional kernels o Sliding windows over input o Max-pooling over codes [LeCun, Bottou, Bengio & Haffner, “Gradient-based learning applied to document recognition”, Proceedings of IEEE,1998; Le, Ranzato et al. "Building high-level features using large-scale unsupervised learning" ICML 2012; Sermanet et al, “OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks”, ICLR 2014]
  • 52. Thank you! • Tutorial code: https://github.com/piotrmirowski http://piotrmirowski.wordpress.com • Contact: piotr.mirowski@computer.org • Acknowledgements: Marc’Aurelio Ranzato (FB) Yann LeCun (FB/NYU)
  • 53. Auto-encoders and Expectation-Maximization E X,Z;W( )= a Zt - g Xt;C,bC( ) 2 2 t=1 T å + Xt -h Zt;D,bD( ) 2 2 t=1 T å P X W( )= P X,Z W( ) Z ò = e -bE X,Z;W( ) Z ò e -bE X,Z;W( ) X,Z ò -logP X W( )= - 1 b log e -bE X,Z;W( ) + Z ò 1 b log e -bE X,Z;W( ) Y,Z ò Energy of inputs and codes Input data likelihood Maximum A Posteriori: take minimal energy code Z -logP X W( )= E X,Z* ;W( )+ 1 b log e -bE X,Z* ;W( ) Y ò Do not marginalize over: take maximum likelihood latent code instead Enforce sparsity on Z to constrain Z and avoid computing partition function
  • 54. Stochastic gradient descent [LeCun et al, "Efficient BackProp", Neural Networks: Tricks of the Trade, 1998; Bottou, "Stochastic Learning", Slides from a talk in Tubingen, 2003]
  • 55. Stochastic gradient descent [LeCun et al, "Efficient BackProp", Neural Networks: Tricks of the Trade, 1998; Bottou, "Stochastic Learning", Slides from a talk in Tubingen, 2003]
  • 56. Dimensionality reduction and invariant mapping Figure 2. Figure showing the spring system. The solid circles rep- resent points that are similar to the point in the center. The hol- low circles represent dissimilar points. The springs are shown as global loss function L over all springs, one would ultimately drive the system to its equilibrium state. 2.3. The Algorithm The algorithm first generates the training set, then trains the machine. Step 1: For each input sample X i , do the following: (a) Using prior knowledge find the set of samples SX i = { X j } p j = 1, such that X j is deemed sim- ilar to X i . (b) Pair the sample X i with all the other training samples and label the pairs so that: Yi j = 0 if X j ∈ SX i , and Yi j = 1 otherwise. Combine all the pairs to form the labeled training set. Step 2: Repeat until convergence: (a) For each pair (X i , X j ) in the training set, do i. If Yi j = 0, then update W to decrease DW = GW (X i ) − GW (X j ) 2 ii. If Yi j = 1, then update W to increase DW = GW (X i ) − GW (X j ) 2[Hadsell, Chopra & LeCun, “Dimensionality Reduction by Learning an Invariant Mapping”, CVPR, 2006] Z1 Ef Z1,Z2( ) Z1 = f X1;W,b( ) Z2 = f X2;W,b( ) Z2 X1 X2 Similarly labelled samples Dissimilar codes