Deep Learning for Image Classification and Object Detection

Deep Learning
Presented By:
Srishty Saha
IIIT-DELHI

Shallow Learning
• SVM
• Linear & Kernel Regression
• Hidden Markov Models (HMM)
• Gaussian Mixture Models (GMM)
• Single hidden layer MLP
Limitations
Cannot make use of unlabeled data

Supervised vs Unsupervised Learning
• Supervised Learning
1.Output has to be produced according to target vector.
2.Input + Target vector = Training Pair
3.Labelled Data
• Unsupervised Learning ( self Organising)
1.Network receives input patterns to form clusters.
2.When a new input pattern is applied , output gives the class the input pattern
belongs to
3.Unlabelled Data

Neural Networks
• Machine Learning
• Knowledge from high dimensional data
• Classification
• Input: features of data
• supervised vs unsupervised
• labeled data
• Neurons

What is it used for?
• Classification
• Regression
---- Prediction
---- Fitting Curve

Multi Layer Perceptron
• Multiple Layers
• Feed Forward
• Connected Weights
• 1-of-N Output
hidden
output

Back Propagation
• Minimize error of
calculated output
• Adjust weights
• Gradient Descent
• Procedure
• Forward Phase
• Backpropagation
of errors
• For each sample,
multiple epochs

Problems with Backpropagation
• Multiple hidden Layers
• Get stuck in local optima
• start weights from random positions
• Only use labeled data
• most data is unlabeled

Deep Learning Means Feature Learning
• Deep Learning is about Learning Hierarchical Features.

Convolutional Neural Network
Feature extraction layer
Convolution layer
Shift and distortion invariance or
Subsampling layer

CNN contd.
• Detect the same feature at different positions in the
input image in C Layer.
features

CNN Contd.
Shared weights: all neurons in a feature share the
same weights (but not the biases).
In this way all neurons detect the same feature at
different positions in the input image.
Reduce the number of free parameters.
If a neuron in the feature map fires, this corresponds to a match with
the template

CNN Contd.
S Layer
The subsampling layers reduce the spatial resolution of each feature
map
By reducing the spatial resolution of the feature map, a certain degree
of shift and distortion invariance is achieved

Contd. S layer
The weight sharing is also applied in subsampling layers
Reduce the effect of noises and shift or distortion.

Applications
Speech Recognition.
Object Detection ( Computer Vision).
Web search – Text Analysis.

Few Insights Gathered From Papers.
• Used CBIR method to do feature extractions in Convolutional Layer.
• Applied filters to feature extraction.
• Used Definite size of patch to work upon.
• CNN method was used throughout.
• 3D convolution – time added as third factor .
• Feature Extraction so far observed was:
1. Gradient Filter in X and Y directions.

Object Detection
Architecture:
Dataset : MIT Face dataset 1104 faces.
Training – 200 images.
Test - 200 images.
The Convolutional Neural Network consists of two parts
1) the convolution layers and max-pooling layers
2) the fully connection layers and the output layers.
The Input Layer consists of 72x72 size histogram equalized images an
output is the set of different face images each of size 18x18.
The networks used for face detection and face recognition contains two
convolutional layer and two sub-sampling layer.

Output Layer
72 x 72 72x72 :5no 36x36: 5no
36X36
18 x 18:
12no
Output Layer
20 faces
Feature Maps :5no Feature Maps :5no
Input Layer Conv Layer-1
Kernels of 3x3
Conv Layer-2Samp Layer-1 Samp Layer-2

Convolutional Layer
Total of 5 kernels of size 3x3 is used to convolutional operation.
5 different feature maps :
• gray
• gradient –x
• gradient-y
• last two kernels gives the information below the eyes area.

Sampling Layer
• Mean filter of 2x2 is applied on image
• Alternate rows and alternate columns of image is sampled out.
72 x 72 72x72 :5no 36x36: 5no
36X36
18 x 18:
12no
Fully
connected
layer20 faces
Kernels of 3x3

Fully Connected And Output Layer
• Output layer : 70 images.
72 x 72 72x72 :5no 36x36: 5no
36X36
18 x 18:
12no
Fully
connected
layer20 faces
Kernels of 3x3

Error Propagation.
• Error Matrix e is obtained by finding difference between values of
neurons in output layer and fully connected layer.
• As there are 5 kernels in convolutional layers so each face will have 5
different feature maps. So in fully connected layer of Object
Recognition CNN,total neurons are 18 X 18 X 5(feature maps).
• So, Mean error M(i=1:5) of each Map is calculated.
• EM is Mean of M(i=1:5) is calculated.
• Error {M( i=1 :5) } is used for back-propagation.
• {EM} is used as a threshold such that any value below the threshold
is considered success or face match.

Implementation
1) Input of 200 images each of size 72X72 is presented to network one
by one for training.
2) In First Layer, Convolutional operation is performed using
aforementioned kernels of size 3X3 .The resultant output is of size
7 2 X 72 X 5.
3) In first S Layer, Sampling of image using mean filter of size 2X2
and sampling alternate rows and columns.The output of this of size
36 X 36 X 5.

4) In Second C layer,after convolution operation we get output
of size 36 X 36 X 5 .
5) In Second S layer, after sampling operation we get image of size
200 X 18 X 18 X 5.
6) The Fully connected layer is obtained after S layer and it is of
size 18 X 18 X 5 and each of neuron is connected to output layer.
7) Error Propagated using above method mentioned in
Back Propagation section of Object Recognition.
8) Error propagation takes place for fixed number of epochs in during
trainning.
9) For testing, {EM} obtained in equation is used as threshold to find face
match.

Results – Accuracy vs epochs

Object Detection.
• Input – Face images of size 72 x 72
Non Face images of size 72 x 72
• Output – 1 or 0

Error Propagation
• Error Matrix e is obtained by finding difference between values of
neurons in output layer and fully connected layer.
• Error of each neuron is propagated backwards and thus weight up-
dation is done.
• The backpropagation comes to hault when error <0.0003 or number
of epochs is 64 for training.

Object Detection ( Face/Non face)
1) Input of 50 (30 faces +20 non face) images each of size 72X72 is
presented to network one by one for training.
7 2 X 72 X 5.
36 X 36 X 5.

of size 36 X 36 X 5 .
18 X 18 X 5.
trainning.
9) For testing, 200 images were used 125 faces and 75 non face.

Result:
Confusion Matrix Face Non face
Face (test) 100 25
Non Face (test) 32 43
Accuracy : 80% approx. for faces.
Accuracy : 57.33 % approx. for non faces.
Time to detect a face : 25.35 secs approx.

Object Recognition in an image.

Implementation iterative search
1) Input is an image is presented to network .
2) 72 X 72 patch is created and presented to network.
7 2 X 72 X 5.
36 X 36 X 5.

of size 36 X 36 X 5 .
200 X 18 X 18 X 5.
trainning.
9) As an image is found Count = count+1.

Results: Error vs ith iteration face found.

To do.
• Face detection using in set of images.
• Improve the accuracy.
• Implement it for videos.

Deep Learning for Image Classification and Object Detection

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Deep Learning for Image Classification and Object Detection

Similar to Deep Learning for Image Classification and Object Detection (20)

Recently uploaded

Recently uploaded (20)

Deep Learning for Image Classification and Object Detection