Review_Cibe Sridharan

Deep learning -The Dream of Artiﬁcial Intelligence
Cibe Sridharan Kumaran(11pt08)
cibesridharan94@gmail.com
Indian Institute of Technology, Madras
Cibe Sridharan (PSG Tech) August 8,2014 1 / 34

Thanks to
Dr R.Nadarajan
Professor and head
Department of Applied Mathematics and Computational Sciences
PSG College of Technology
Dr R.Anitha
Programme Coordinator
Associate Professor

Thanks to
Mr N.Mohan Raj
Tutor
Associate Professor
Ms B.Malar
Internal Guide
Assistant Professor (Senior Grade)

Thanks to
Dr B.Ravindran
External Guide
Associate Professor
Department of Computer Science and Engineering
Indian Institute of Technology Madras

Agenda
Motivation
Learning Methods
BackPropagation
Autoencoders
Restricted Boltzmann Machines
Deep learning Recipe
Problem
Implementation
Future Works
References

AI’s Dream
Startling gains in fields as diverse as computer vision, speech
recognition and the identification of promising new molecules for
designing drugs.
The Deep learning movement seeks to meld computer science
with Neuroscience — something that never quite happened in the
world of Artificial Intelligence.
This remarkable machine is capable of what amounts to thought.
Instead of doing AI, we ended up spending our lives doing curve
fitting.

Motivation
Human Intelligence dominated Machine Learning perfomance.
Finally Machine Learning is catching up to the dream of AI.

Learning Methods
Supervised Learning:
Supervised learning is when the data you feed your algorithm is
”tagged” to help your logic make decisions.
Example: For instance, very often training a neural network is
supervised learning: you’re telling the network to which class
corresponds the feature vector you’re feeding..
Unsupervised Learning:
The algorithm decides how to group samples into classes that
share common properties.
Example: K-means Clustering.

BackPropagation
BPN is based on Gradient Descent Learning that is the
mininmization of error E in terms of weights and activation
function. Wij = η ∗ ∂E/∂Wij
It is usually considered to be a supervised learning method,
although it is also used in some unsupervised networks such as
autoencoders.
It involes the activation function to be differentiable.

Idea-BPN
Computes the error term for the output units using the observed
error.
From output layer, repeat propagating the error term back to the
previous layerand updating the weights between the two layers
until the earliest hidden layer is reached.

Gradient Computation
Initialize weights (typically random)
Keep doing epochs
For each example e in training set do
•forward pass to compute
•–O = neural-net-output(network,e).
•–miss = (T-O) at each output unit.
•backward pass to calculate deltas to weights.
•update all weights.
end
until tuning set error stops improving

Flaws in BackPropagation Algorithm
Requires labeled data, and most data is unlabeled.
Vanishing Gradient Problem.
Overﬁtting(high variance low bias).
Easy to get stuck in poor local optima.
Gets worse as we add more hidden layers.

Autoencoders
Another technique for dimensionality reduction[3], where the
output of the encoder represents the reduced representation and
where the decoder is tuned to reconstruct the initial input from the
encoder’s representation through the minimization of a cost
function.

Contd..
Encoder
The encoder is a function f that maps an input x ∈ Rdx to hidden
representation h(x) ∈ Rdh . h = f(x) = sf (Wx + bh)
Decoder
The decoder function g maps hidden representation h back to a
reconstruction y. y = g(h) = sg(W h + by)
Objective Function
The Cross-entropy loss when sg is the sigmoid and inputs are in [0; 1]
L(x; y) = − dx
i=1 xi log(yi) + (1 − xi) log(1 − yi)

More on Hidden Layers
1 The hidden layer is said to undercomplete if it is smaller than the
input layer.The input layer compress its information in its units.
1 The hidden layer is said to overcomplete if it is greater than the
input layer.No compression in the hidden layer. Each hidden layer
copies a different input component.

Denoising Autoencoders
One simply corrupts input x before sending it through the
autoencoder, that is trained to reconstruct the clean version.[5]
Use a Guassian Additive Noise.
Loss function compares the reconstructed output with the
noiseless input not the corrupted input.

Contractive Autoencoders
The Contractive auto-encoder (CAE) is obtained with the
regularization term of yielding the objective function:
J(θ) = x∈Dn
(L(x, g(f(x))) + λ Jf (x) 2
F )
We add an explicit term in the loss that penalizes the solution.
We wish only to extract features that only reﬂect variations in the
training set.
Jf (x) 2
F = i(∂hj(x)∂xi)2

Restricted Boltzmann Machines
An RBM is an energy-based and undirected graphical model for
unsupervised learning.
It consists of two layers of binary units visible layer, to represent
the data, and the hidden layer, to increase learning capacity[4].
E(v, h) = − i,j vihjwij − i vibi − j hjbj
Inputs are Binary as Autoencoders.
There are no lateral connection in each layers. Which Graph??

Formulation
1 The Distribution Probability :P(v, h) = e−E(v,h)/Z
Z:Normalizing constant
2 The Objective Function:φ = log P(v) = φ+ − φ−
3 Positive gradient φ+ = log h e−E(v,h)
4 Negative gradient φ− = log v,h e−E(v,h)
∂φ+/∂wij = viP(hj = 1|v)
∂φ−/∂wij = log Z = P(vi = 1, hj = 1)
Negative Gradient is Intractable
5 Our distribution function is Intractable due to the Normalizing
constant Z so we have conditional distribution on it.
p(h|x) = j p(hj|x) all hj are binary.
p(hj = 1|x) = sigmoid(bj + Wjx).

Contrastive Divergence Algorithm
Idea:Every unit inﬂuences every neighbors since it is undirected one.
Replace the expectation by a point estimate x
Obtain the points by gibbs sampling.[2]
Start sampling chain x(t).

Relaxing the Constraints
If the two constraints are relaxed then:
Input x are unbound reals.
Add a quadtraic term to the energy function.
E(v, h) = − i,j vihjwij − i vibi − j hjbj + 1/2xT x
The distribution is only Guassian with µ = c + WT h and Identity
Covariance matrix.
If the original layers have lateral connections then.
E(v, h) = − i,j vihjwij − i vibi − j hjbj − 1/2xT Vx − 1/2hT Uh

Deep Learning era
Figure: Visual Cortex
Learning model with Multilayer Representations.
Each layer corresponds to the distributed representaion.
Each units in the layer are not mutually exclusive.

Recipe
The models in these Neural Networks are formed by two parts.
The ﬁrst part is a trained feature extraction section, consisting in
successive layers of units, processing the data set inputs.
Each layer can be unsupervisedly pre-trained, as an
Auto-Encoder or a RBM.
The second part is a classiﬁer which is trained in a supervised
way.

Greedy layerwise procedure
Geoff Hinton : “If you want to do computer vision, ﬁrst learn computer
graphics.”
Unsupervised pretraining
Train one layer at a time using the unsupervised criterion.Fix the
parameters of the previous hidden layers.The previous layers are
viewed as feature extraction[1].
Supervised ﬁnetuning
Add an Output layer.Train the Neural network using Supervised
Learning(Back Propagation).Finally all parameters are tuned then stop
it.

Greedy Layerwise Algorithm
PseudoCode
1 for l=1:L
build unsupervised training set with (h(0)
(x) = x)
D ← {h(l−1)
(xt
)}T
t=1
2 train greedy module(Autoencoder,RBM) on D
3 use hidden layer weights and biases of greedy module to initialize
the deep network parameters Wl, bl
4 Initialize Wl+1, bl+1 randomly.
5 Train the whole network by supervised stochastic gradient
descent(Backprop).

Droupout Approach
This “overﬁtting” is greatly reduced by randomly omitting half of
the feature detectors on each training case.
For each training case, each hidden unit is randomly omitted from
the network with a probability of 0.5, so a hidden unit cannot rely
on other hidden units being present.
Instead, each neuron learns to detect a feature that is generally
helpful for producing the correct answer given the combinatorially
large variety of internal contexts in which it must operate.

Droupout Approach
This “overﬁtting” is greatly reduced by randomly omitting half of
the feature detectors on each training case.
For each training case, each hidden unit is randomly omitted from
the network with a probability of 0.5, so a hidden unit cannot rely
on other hidden units being present.
Instead, each neuron learns to detect a feature that is generally
helpful for producing the correct answer given the combinatorially
large variety of internal contexts in which it must operate.
h
(k)
(x) ← g
(k)
(x).m(k)

Pros and Cons
Add a pretraining phase to learn the structure of the input data.
Requires no labeled data.
No possibility to get stuck in bad local optima(Regularization).
A greedy layer-wise algorithm makes this efﬁcient and fast.
Most of the learning in deep architectures is just some form of
gradient descent.
The Convergence rate of Contrastive Divergence is not clear.
Deep learning methods are often looked at as a black box, with
most conﬁrmations done empirically, rather than theoretically.

Implementation
Implementation in Python-Theano Deep learning Library
Matlab
MNIST datset.

MNIST Recoginition
When the hidden layers are large the construction is good not the
vice versa.

Feature Seperation
Given two feature vector which are independent can we learn a
shared representaion from it.
Given a feature vector it should be able to retrieve the another one.
The above problem can be modelled as a Cocktail party problem.

Independent Component Analysis
It depends:
Cotraining
Nongaussian is independent
Kurtosis
Preprocessing in ICA.
Steps
Drawbacks

Future Works
KLA Tencor dataset.(Defect Classiﬁcation)
Using deep nets for feature selection.
To ﬁnd the classes from without the knowledge of labels.
To build a neural networks in Speech Recoginition (Deep ICA).

Timeline
1 May 14th-June14th –Deep learning-intro,Autoencoders
2 June 14th-July14th –Restricted Boltzmann Machines,Deep belief
nets.
3 June 15th-July27th –Python-Theano
4 July 28th-August 8th–Independent Component Analysis

References
Dumitru Erhan, Yoshua Bengio, Aaron Courville, Pierre-Antoine
Manzagol, Pascal Vincent, and Samy Bengio.
Why does unsupervised pre-training help deep learning?
The Journal of Machine Learning Research, 11:625–660, 2010.
Geoffrey E Hinton.
Training products of experts by minimizing contrastive divergence.
Neural computation, 14(8):1771–1800, 2002.
Salah Rifai, Pascal Vincent, Xavier Muller, Xavier Glorot, and
Yoshua Bengio.
Contractive auto-encoders: Explicit invariance during feature
extraction.
pages 833–840, 2011.
Tijmen Tieleman.
Training restricted boltzmann machines using approximations to
the likelihood gradient.
pages 1064–1071, 2008.Cibe Sridharan (PSG Tech) August 8,2014 34 / 34

Review_Cibe Sridharan

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Review_Cibe Sridharan

Similar to Review_Cibe Sridharan (20)

Review_Cibe Sridharan