1. Deep learning -The Dream of Artificial Intelligence
Cibe Sridharan Kumaran(11pt08)
cibesridharan94@gmail.com
Indian Institute of Technology, Madras
Cibe Sridharan (PSG Tech) August 8,2014 1 / 34
2. Thanks to
Dr R.Nadarajan
Professor and head
Department of Applied Mathematics and Computational Sciences
PSG College of Technology
Dr R.Anitha
Programme Coordinator
Associate Professor
Department of Applied Mathematics and Computational Sciences
PSG College of Technology
Cibe Sridharan (PSG Tech) August 8,2014 2 / 34
3. Thanks to
Mr N.Mohan Raj
Tutor
Associate Professor
Department of Applied Mathematics and Computational Sciences
PSG College of Technology
Ms B.Malar
Internal Guide
Assistant Professor (Senior Grade)
Department of Applied Mathematics and Computational Sciences
PSG College of Technology
Cibe Sridharan (PSG Tech) August 8,2014 3 / 34
4. Thanks to
Dr B.Ravindran
External Guide
Associate Professor
Department of Computer Science and Engineering
Indian Institute of Technology Madras
Cibe Sridharan (PSG Tech) August 8,2014 4 / 34
6. AI’s Dream
Startling gains in fields as diverse as computer vision, speech
recognition and the identification of promising new molecules for
designing drugs.
The Deep learning movement seeks to meld computer science
with Neuroscience — something that never quite happened in the
world of Artificial Intelligence.
This remarkable machine is capable of what amounts to thought.
Instead of doing AI, we ended up spending our lives doing curve
fitting.
Cibe Sridharan (PSG Tech) August 8,2014 6 / 34
7. Motivation
Human Intelligence dominated Machine Learning perfomance.
Finally Machine Learning is catching up to the dream of AI.
Cibe Sridharan (PSG Tech) August 8,2014 7 / 34
8. Learning Methods
Supervised Learning:
Supervised learning is when the data you feed your algorithm is
”tagged” to help your logic make decisions.
Example: For instance, very often training a neural network is
supervised learning: you’re telling the network to which class
corresponds the feature vector you’re feeding..
Unsupervised Learning:
The algorithm decides how to group samples into classes that
share common properties.
Example: K-means Clustering.
Cibe Sridharan (PSG Tech) August 8,2014 8 / 34
9. BackPropagation
BPN is based on Gradient Descent Learning that is the
mininmization of error E in terms of weights and activation
function. Wij = η ∗ ∂E/∂Wij
It is usually considered to be a supervised learning method,
although it is also used in some unsupervised networks such as
autoencoders.
It involes the activation function to be differentiable.
Cibe Sridharan (PSG Tech) August 8,2014 9 / 34
10. Idea-BPN
Computes the error term for the output units using the observed
error.
From output layer, repeat propagating the error term back to the
previous layerand updating the weights between the two layers
until the earliest hidden layer is reached.
Cibe Sridharan (PSG Tech) August 8,2014 10 / 34
11. Gradient Computation
Initialize weights (typically random)
Keep doing epochs
For each example e in training set do
•forward pass to compute
•–O = neural-net-output(network,e).
•–miss = (T-O) at each output unit.
•backward pass to calculate deltas to weights.
•update all weights.
end
until tuning set error stops improving
Cibe Sridharan (PSG Tech) August 8,2014 11 / 34
12. Flaws in BackPropagation Algorithm
Requires labeled data, and most data is unlabeled.
Vanishing Gradient Problem.
Overfitting(high variance low bias).
Easy to get stuck in poor local optima.
Gets worse as we add more hidden layers.
Cibe Sridharan (PSG Tech) August 8,2014 12 / 34
13. Autoencoders
Another technique for dimensionality reduction[3], where the
output of the encoder represents the reduced representation and
where the decoder is tuned to reconstruct the initial input from the
encoder’s representation through the minimization of a cost
function.
Cibe Sridharan (PSG Tech) August 8,2014 13 / 34
14. Contd..
Encoder
The encoder is a function f that maps an input x ∈ Rdx to hidden
representation h(x) ∈ Rdh . h = f(x) = sf (Wx + bh)
Decoder
The decoder function g maps hidden representation h back to a
reconstruction y. y = g(h) = sg(W h + by)
Objective Function
The Cross-entropy loss when sg is the sigmoid and inputs are in [0; 1]
L(x; y) = − dx
i=1 xi log(yi) + (1 − xi) log(1 − yi)
Cibe Sridharan (PSG Tech) August 8,2014 14 / 34
15. More on Hidden Layers
1 The hidden layer is said to undercomplete if it is smaller than the
input layer.The input layer compress its information in its units.
1 The hidden layer is said to overcomplete if it is greater than the
input layer.No compression in the hidden layer. Each hidden layer
copies a different input component.
Cibe Sridharan (PSG Tech) August 8,2014 15 / 34
16. Denoising Autoencoders
One simply corrupts input x before sending it through the
autoencoder, that is trained to reconstruct the clean version.[5]
Use a Guassian Additive Noise.
Loss function compares the reconstructed output with the
noiseless input not the corrupted input.
Cibe Sridharan (PSG Tech) August 8,2014 16 / 34
17. Contractive Autoencoders
The Contractive auto-encoder (CAE) is obtained with the
regularization term of yielding the objective function:
J(θ) = x∈Dn
(L(x, g(f(x))) + λ Jf (x) 2
F )
We add an explicit term in the loss that penalizes the solution.
We wish only to extract features that only reflect variations in the
training set.
Jf (x) 2
F = i(∂hj(x)∂xi)2
Cibe Sridharan (PSG Tech) August 8,2014 17 / 34
18. Restricted Boltzmann Machines
An RBM is an energy-based and undirected graphical model for
unsupervised learning.
It consists of two layers of binary units visible layer, to represent
the data, and the hidden layer, to increase learning capacity[4].
E(v, h) = − i,j vihjwij − i vibi − j hjbj
Inputs are Binary as Autoencoders.
There are no lateral connection in each layers. Which Graph??
Cibe Sridharan (PSG Tech) August 8,2014 18 / 34
19. Formulation
1 The Distribution Probability :P(v, h) = e−E(v,h)/Z
Z:Normalizing constant
2 The Objective Function:φ = log P(v) = φ+ − φ−
3 Positive gradient φ+ = log h e−E(v,h)
4 Negative gradient φ− = log v,h e−E(v,h)
∂φ+/∂wij = viP(hj = 1|v)
∂φ−/∂wij = log Z = P(vi = 1, hj = 1)
Negative Gradient is Intractable
5 Our distribution function is Intractable due to the Normalizing
constant Z so we have conditional distribution on it.
p(h|x) = j p(hj|x) all hj are binary.
p(hj = 1|x) = sigmoid(bj + Wjx).
Cibe Sridharan (PSG Tech) August 8,2014 19 / 34
20. Contrastive Divergence Algorithm
Idea:Every unit influences every neighbors since it is undirected one.
Replace the expectation by a point estimate x
Obtain the points by gibbs sampling.[2]
Start sampling chain x(t).
Cibe Sridharan (PSG Tech) August 8,2014 20 / 34
21. Relaxing the Constraints
If the two constraints are relaxed then:
Input x are unbound reals.
Add a quadtraic term to the energy function.
E(v, h) = − i,j vihjwij − i vibi − j hjbj + 1/2xT x
The distribution is only Guassian with µ = c + WT h and Identity
Covariance matrix.
If the original layers have lateral connections then.
E(v, h) = − i,j vihjwij − i vibi − j hjbj − 1/2xT Vx − 1/2hT Uh
Cibe Sridharan (PSG Tech) August 8,2014 21 / 34
22. Deep Learning era
Figure: Visual Cortex
Learning model with Multilayer Representations.
Each layer corresponds to the distributed representaion.
Each units in the layer are not mutually exclusive.
Cibe Sridharan (PSG Tech) August 8,2014 22 / 34
23. Recipe
The models in these Neural Networks are formed by two parts.
The first part is a trained feature extraction section, consisting in
successive layers of units, processing the data set inputs.
Each layer can be unsupervisedly pre-trained, as an
Auto-Encoder or a RBM.
The second part is a classifier which is trained in a supervised
way.
Cibe Sridharan (PSG Tech) August 8,2014 23 / 34
24. Greedy layerwise procedure
Geoff Hinton : “If you want to do computer vision, first learn computer
graphics.”
Unsupervised pretraining
Train one layer at a time using the unsupervised criterion.Fix the
parameters of the previous hidden layers.The previous layers are
viewed as feature extraction[1].
Supervised finetuning
Add an Output layer.Train the Neural network using Supervised
Learning(Back Propagation).Finally all parameters are tuned then stop
it.
Cibe Sridharan (PSG Tech) August 8,2014 24 / 34
25. Greedy Layerwise Algorithm
PseudoCode
1 for l=1:L
build unsupervised training set with (h(0)
(x) = x)
D ← {h(l−1)
(xt
)}T
t=1
2 train greedy module(Autoencoder,RBM) on D
3 use hidden layer weights and biases of greedy module to initialize
the deep network parameters Wl, bl
4 Initialize Wl+1, bl+1 randomly.
5 Train the whole network by supervised stochastic gradient
descent(Backprop).
Cibe Sridharan (PSG Tech) August 8,2014 25 / 34
26. Droupout Approach
This “overfitting” is greatly reduced by randomly omitting half of
the feature detectors on each training case.
For each training case, each hidden unit is randomly omitted from
the network with a probability of 0.5, so a hidden unit cannot rely
on other hidden units being present.
Instead, each neuron learns to detect a feature that is generally
helpful for producing the correct answer given the combinatorially
large variety of internal contexts in which it must operate.
Cibe Sridharan (PSG Tech) August 8,2014 26 / 34
27. Droupout Approach
This “overfitting” is greatly reduced by randomly omitting half of
the feature detectors on each training case.
For each training case, each hidden unit is randomly omitted from
the network with a probability of 0.5, so a hidden unit cannot rely
on other hidden units being present.
Instead, each neuron learns to detect a feature that is generally
helpful for producing the correct answer given the combinatorially
large variety of internal contexts in which it must operate.
h
(k)
(x) ← g
(k)
(x).m(k)
Cibe Sridharan (PSG Tech) August 8,2014 26 / 34
28. Pros and Cons
Add a pretraining phase to learn the structure of the input data.
Requires no labeled data.
No possibility to get stuck in bad local optima(Regularization).
A greedy layer-wise algorithm makes this efficient and fast.
Most of the learning in deep architectures is just some form of
gradient descent.
The Convergence rate of Contrastive Divergence is not clear.
Deep learning methods are often looked at as a black box, with
most confirmations done empirically, rather than theoretically.
Cibe Sridharan (PSG Tech) August 8,2014 27 / 34
30. MNIST Recoginition
When the hidden layers are large the construction is good not the
vice versa.
Cibe Sridharan (PSG Tech) August 8,2014 29 / 34
31. Feature Seperation
Given two feature vector which are independent can we learn a
shared representaion from it.
Given a feature vector it should be able to retrieve the another one.
The above problem can be modelled as a Cocktail party problem.
Cibe Sridharan (PSG Tech) August 8,2014 30 / 34
32. Independent Component Analysis
It depends:
Cotraining
Nongaussian is independent
Kurtosis
Preprocessing in ICA.
Steps
Drawbacks
Cibe Sridharan (PSG Tech) August 8,2014 31 / 34
33. Future Works
KLA Tencor dataset.(Defect Classification)
Using deep nets for feature selection.
To find the classes from without the knowledge of labels.
To build a neural networks in Speech Recoginition (Deep ICA).
Cibe Sridharan (PSG Tech) August 8,2014 32 / 34
34. Timeline
1 May 14th-June14th –Deep learning-intro,Autoencoders
2 June 14th-July14th –Restricted Boltzmann Machines,Deep belief
nets.
3 June 15th-July27th –Python-Theano
4 July 28th-August 8th–Independent Component Analysis
Cibe Sridharan (PSG Tech) August 8,2014 33 / 34
35. References
Dumitru Erhan, Yoshua Bengio, Aaron Courville, Pierre-Antoine
Manzagol, Pascal Vincent, and Samy Bengio.
Why does unsupervised pre-training help deep learning?
The Journal of Machine Learning Research, 11:625–660, 2010.
Geoffrey E Hinton.
Training products of experts by minimizing contrastive divergence.
Neural computation, 14(8):1771–1800, 2002.
Salah Rifai, Pascal Vincent, Xavier Muller, Xavier Glorot, and
Yoshua Bengio.
Contractive auto-encoders: Explicit invariance during feature
extraction.
pages 833–840, 2011.
Tijmen Tieleman.
Training restricted boltzmann machines using approximations to
the likelihood gradient.
pages 1064–1071, 2008.Cibe Sridharan (PSG Tech) August 8,2014 34 / 34