Lazy Deep Learning for Images
Recognition in ZZ Photo app
Artem Chernodub, George Paschenko
IMMSP NASU
AI&Big Data Lab, 23 April, 2015, Odessa.
ZZ Photo
𝑝 𝑥 𝑦 =
𝑝 𝑦 𝑥 𝑝(𝑥)
𝑝(𝑦)
Biological-inspired models
Neuroscience
Machine Learning
2 / 55
Biological Neural Networks
3 / 55
Artificial Neural Networks
Traditional (Shallow) Neural
Networks
Deep Neural Networks
Deep Feedforward Neural
Networks
Recurrent Neural Networks
4 / 55
Conventional Methods vs Deep Learning
5 / 55
Deep Learning = Learning of
Representations (Features)
The traditional model of pattern recognition (since the late 50's):
fixed/engineered features + trainable classifier
Hand-crafted
Feature
Extractor
Trainable
Classifier
Trainable
Feature
Extractor
Trainable
Classifier
End-to-end learning / Feature learning / Deep learning:
trainable features + trainable classifier
6 / 55
ImageNet
Le et al. “Building high-level features using large-scale unsupervised learning” ICML
2012.
Model # of parameters Accuracy, %
Deep Net 10M 15.8
best state-of-the-art N/A 9.3
Training data: 16M images, 20K categories
7 / 55
Deep Face (Facebook)
Y. Taigman, M. Yang, M.A. Ranzato, L. Wolf. DeepFace: Closing the Gap to Human-
Level Performance in Face Verification // CVPR 2014.
Model # of parameters Accuracy, %
Deep Face Net 128M 97.35
Human level N/A 97.5
Training data: 4M facial images
8 / 55
TIMIT Phoneme Recognition
Graves, A., Mohamed, A.-R., and Hinton, G. E. (2013). Speech recognition with deep
recurrent neural networks // IEEE International Conference on Acoustics, Speech and
Signal Processing (ICASSP), pages 6645–6649. IEEE.
Mohamed, A. and Hinton, G. E. (2010). Phone recognition using restricted Boltzmann
machines // IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASSP), pages 4354–4357.
Model # of parameters Accuracy
Hidden Markov Model, HMM N / A 27,3%
Deep Belief Network, DBN ~ 4M 26,7%
Deep RNN 4,3M 17.7%
Training data: 462 speakers train / 24 speakers test, 3.16 / 0.14 hrs.
9 / 55
Google Large Vocabulary Speech
Recognition
H. Sak, A. Senior, F. Beaufays. Long Short-Term Memory Recurrent Neural Network
Architectures for Large Scale Acoustic Modeling // INTERSPEECH’2014.
K. Vesely, A. Ghoshal, L. Burget, D. Povey. Sequence-discriminative training of deep
neural networks // INTERSPEECH’2014.
Model # of parameters Cross-entropy
ReLU DNN 85M 11.3
Deep Projection LSTM RNN 13M 10.7
Training data: 3M utterances (1900 hrs).
10 / 55
Classic Feedforward Neural Networks
(before 2006).
• Single hidden layer (Kolmogorov-Cybenko Universal
Approximation Theorem as the main hope).
• Vanishing gradients effect prevents using more layers.
• Less than 10K free parameters.
• Feature preprocessing stage is often critical.
11 / 55
Training the traditional (shallow)
Neural Network: derivative + optimization
12 / 55
1) forward propagation pass
),( )1(

i
ijij xwfz
),()1(~ )2(

j
jj zwgky
where zj is the postsynaptic value for the j-th hidden neuron, w(1) are the hidden layer’s
weights, f() are the hidden layer’s activation functions, w(2) are the output layer’s weights,
and g() are the output layer’s activation functions.
13 / 55
2) backpropagation pass
Local gradients calculation:
),1(~)1(  kyktOUT

.)(' )2( OUT
jj
HID
j wzf  
,
)(
)2( j
OUT
j
z
w
kE



.
)(
)1( i
IN
j
ji
x
w
kE



Derivatives calculation:
14 / 55
Bad effect of vanishing (exploding)
gradients: a problem
,
)( )1()(
)(



 m
i
m
jm
ji
z
w
kE

,' )1()()1()( 
 m
i
i
m
ij
m
j
m
j wf  0
)(
)(



m
jiw
kE
=> 1mfor
15 / 55
Bad effect of vanishing (exploding)
gradients: two hypotheses
1) increased frequency and
severity of bad local
minima
2) pathological curvature, like
the type seen in the well-known
Rosenbrock function:
222
)(100)1(),( xyxyxf 
16 / 55
Deep Feedforward Neural Networks
• 2-stage training process: i) unsupervised pre-training; ii) fine tuning
(vanishing gradients problem is beaten!).
• Number of hidden layers > 1 (usually 6-9).
• 100K – 100M free parameters.
• No (or less) feature preprocessing stage.
17 / 55
Sparse Autoencoders
18 / 55
Dimensionality reduction
• Use a stacked RBM as deep auto-
encoder
1. Train RBM with images as input &
output
2. Limit one layer to few dimensions
 Information has to pass through middle
layer
G. E. Hinton and R. R. Salakhutdinov. Reducing the Dimensionality of Data with Neural
Networks // Science 313 (2006), p. 504 – 507.
19 / 55
Original
Deep
RBN
PCA
Dimensionality reduction
Olivetti face data, 25x25 pixel images reconstructed from 30 dimensions
(625  30)
G. E. Hinton and R. R. Salakhutdinov. Reducing the Dimensionality of Data with Neural
Networks // Science 313 (2006), p. 504 – 507.
20 / 55
How to use unsupervised pre-training
stage / 1
21 / 55
How to use unsupervised pre-training
stage / 2
22 / 55
How to use unsupervised pre-training
stage / 3
23 / 55
How to use unsupervised pre-training
stage / 4
24 / 55
Unlabeled data
Unlabeled data is readily available
Example: Images from the web
1. Download 10’000’000 images
2. Train a 9-layer DNN
3. Concepts are formed by DNN
G. E. Hinton and R. R. Salakhutdinov. Reducing the Dimensionality of Data with Neural
Networks // Science 313 (2006), p. 504 – 507.
25 / 55
Dimensionality reduction
PCA Deep RBN
804’414 Reuters news stories, reduction to 2 dimensions
G. E. Hinton and R. R. Salakhutdinov. Reducing the Dimensionality of Data with Neural
Networks // Science 313 (2006), p. 504 – 507.
26 / 55
Hierarchy of trained representations
Low-level
feature
Middle-level
feature
Top-level
feature
Feature visualization of convolutional net trained on ImageNet from [Zeiler & Fergus
2013]
27 / 55
Hessian-Free optimization: Deep
Learning with no pre-training stage
J. Martens. Deep Learning via Hessian-free Optimization // Proceedings of the 27th
International Conference on Machine Learning (ICML), 2010.
28 / 55
FLOPS comparison
https://ru.wikipedia.org/wiki/FLOPS
Type Name Flops Cost
Mobile Raspberry Pi 1st Gen, 700
Mhz
0,04 Gflops $35
Mobile Apple A8 1,4 Gflops $700 (in iPhone 6)
CPU Intel Core i7-4930K (Ivy
Bridge), 3.7 GHz
140 Gflops $700
CPU Intel Core i7-5960X
(Haswell), 3.0 GHz
350 Gflops $1300
GPU NVidia GTX 980 4612 Gflops (single
precision), 144 Gflops
(double precision)
$600 + cost of PC
(~$1000)
GPU NVidia Tesla K80 8740 Gflops (single
precision), 2910
Gflops (double
precision)
$4500 + cost of
PC (~1500)
29 / 55
Deep Networks Training time using GPU
• Pretraining – from 2-3 weeks to 2-3 months.
• Fine-tuning (final supervised training) – from
1 day to 1 week.
30 / 55
Tools for training Deep Neural
Networks
D. Kruchinin, E. Dolotov, K. Kornyakov, V. Kustikova, P. Druzhkov. The Comparison of
Deep Learning Libraries on the Problem of Handwritten Digit Classication // Analysis
of Images, Social Networks and Texts (AIST), 2015, April, 9-11th, Yekaterinburg.
31 / 55
Lazy Deep Learning: motivation
A. S. Razavian, H. Azizpour, J. Sullivan, S. Carlsson. CNN Features off-the-shelf: an
Astounding Baseline for Recognition //2014 IEEE Conference on Computer Vision and
Pattern Recognition Workshops (CVPRW), 23-28 June 2014, Columbus, USA, p. 512
– 519.
32 / 55
Lazy Deep Learning: bechmark results
A. S. Razavian, H. Azizpour, J. Sullivan, S. Carlsson. CNN Features off-the-shelf: an
Astounding Baseline for Recognition //2014 IEEE Conference on Computer Vision and
Pattern Recognition Workshops (CVPRW), 23-28 June 2014, Columbus, USA, p. 512
– 519.
33 / 55
Convolutional Neural Networks: Return
of Jedi
Andrej Karpathy and Fei-Fei. CS231n: Convolutional Neural Networks for Visual
Recognition http://cs231n.github.io/convolutional-networks
Yoshua Bengio, Ian Goodfellow and Aaron Courville. Deep Learning // An MIT Press
book in preparation http://www-labs.iro.umontreal.ca/~bengioy/DLbook
34 / 55
AlexNet, 2012 — MeGa HiT
A. Kryzhevsky, I. Sutskever, G.E. Hinton. ImageNet Classification with Deep
Convolutional Neural Networks // Advances in Neural Information Processing
Systems 25 (NIPS 2012).
35 / 55
AlexNet, results on ISVRC-2012
A. Kryzhevsky, I. Sutskever, G.E. Hinton. ImageNet Classification with Deep
Convolutional Neural Networks // Advances in Neural Information Processing
Systems 25 (NIPS 2012).
36 / 55
Convolution Layer
Andrej Karpathy and Fei-Fei. CS231n: Convolutional Neural Networks for Visual
Recognition http://cs231n.github.io/convolutional-networks
Yoshua Bengio, Ian Goodfellow and Aaron Courville. Deep Learning // An MIT Press
book in preparation http://www-labs.iro.umontreal.ca/~bengioy/DLbook
37 / 55
Pooling layer
Andrej Karpathy and Fei-Fei. CS231n: Convolutional Neural Networks for Visual
Recognition http://cs231n.github.io/convolutional-networks
Yoshua Bengio, Ian Goodfellow and Aaron Courville. Deep Learning // An MIT Press
book in preparation http://www-labs.iro.umontreal.ca/~bengioy/DLbook
38 / 55
Implementation tricks: im2col
K. Chellapilla, S. Puri, P. Simard. High Performance Convolutional Neural Networks
for Document Processing // International Workshop on Frontiers in Handwriting
Recognition, 2006.
39 / 55
Implementation tricks: im2col for
convolution
K. Chellapilla, S. Puri, P. Simard. High Performance Convolutional Neural Networks
for Document Processing // International Workshop on Frontiers in Handwriting
Recognition, 2006.
40 / 55
Activation functions
Andrej Karpathy and Fei-Fei. CS231n: Convolutional Neural Networks for Visual
Recognition http://cs231n.github.io/convolutional-networks
Yoshua Bengio, Ian Goodfellow and Aaron Courville. Deep Learning // An MIT Press
book in preparation http://www-labs.iro.umontreal.ca/~bengioy/DLbook
𝑓(𝑥) = max 0, 𝑥
𝑓′ 𝑥 =
1, 𝑥 ≥ 0
0, 𝑥 < 0
ReLU activation function
41 / 55
Development of AleksNet on OpenCV
VGG MatConvNet: CNNs for MATLAB http://www.vlfeat.org/matconvnet/
mexopencv:MATLAB-OpenCV interface http://kyamagu.github.io/mexopencv/matlab
MatConvNet,
MATLAB + CUDA
OpenCV app, C++
YAML
YAML
42 / 55
ZZ Photo – photo organizer
Free beta version is available on http://zzphoto.me
43 / 55
MIT-8 toy problem: formulation
• 8 classes
• 2688 images in total
• TRAIN: 2000 images,
250 per class
• TEST: 688 images,
~86 per class
S. Banerji, A. Verma, C. Liu. Novel Color LBP Descriptors for Scene and Image
Texture Classification // Cross Disciplinary Biometric Systems, 2012, 15th
International Conference on Image Processing, Computer Vision, and Pattern
Recognition, Las Vegas, Nevada, pp. 205-225.
44 / 55
MIT-8 toy problem: results
Acc.
TRAIN
Acc.
TEST
1 LBP + SVM with RBF Kernel 27,2% 19,0%
2 LPQ + SVM with RBF kernel 38,4% 30,5%
3 LBP + SVM with χ2 kernel 94,2% 74,0%
4 LPQ + SVM with χ2 kernel 99,1% 82,2%
5 Deep CNN (AlexNet) + SVM RBF kernel (LAZY DL) 95,1% 91,8%
6 Deep CNN (AlexNet) + SVM with χ2 Kernel (LAZY DL) 100,0% 93,2%
7 Deep CNN (AlexNet) + MLP (LAZY DL) 100,0% 92,3%
Original results, to be published.
45 / 55
Pets detection problem (Kaggle
Dataset + random Other images)
• Kaggle Dataset +
random “other” images;
• 2 classes (cats & dogs
VS other);
• TRAIN: 1,000 samples;
• TEST: 12,000 samples.
46 / 55
Viola-Jones Object Detector
• Very popular for Human Face Detection.
• May be trained for Cat and Dog Face detection.
• Available free in OpenCV library (http://opencv.org).
O. Parkhi, A. Vedaldi, C. V. Jawahar, and A. Zisserman. The Truth about Cats and
Dogs // Proceedings of the International Conference on Computer Vision (ICCV),
2011. J.
Liu, A. Kanazawa, D. Jacobs, P. Belhumeur. Dog Breed Classification Using Part
Localization // Lecture Notes in Computer Science Volume 7572, 2012, pp 172-
185.
47 / 55
Images pyramid for Viola-Jones
48 / 55
Viola-Jones Object Detector Classifier
Structure
49 / 55
P. Viola, M. Jones. Rapid object detection using a boosted cascade of simple
features // Proceedings of the 2001 IEEE Computer Society Conference on
Computer Vision and Pattern Recognition, CVPR 2001.
Pets detection results: FAR vs FRR graphs
Original results, to be published.
50 / 55
Pets detection results: FAR = 0.5%
Error, %
1 Viola-Jones Face Detector for Cats & Dogs + LBP + SVM 79,73%
2 AlexNet, argmax (STANDARD DL, ImageNet-2012, 1000) 26,11%
3 AlexNet, sum (STANDARD DL, ImageNet-2012, 1000) 26,11%
4 AlexNet + SVM linear (LAZY DL) 4,35%
Original results, to be published.
51 / 55
Pet detection results : ROC curve
Original results, to be published.
52 / 55
Labeled Faces in the Wild (LFW) Dataset
G. B. Huang, M. Ramesh, T. Berg, E. Learned-Miller. Labeled Faces in the Wild: A
Database for Studying Face Recognition in Unconstrained Environments // University
of Massachusetts, Amherst, Technical Report 07-49, October, 2007
• more than 13,000 images
of faces collected from the
web.
• Pairs comparison,
restricted mode.
• test: 10-fold cross-
validation, 6000 face
pairs.
53 / 55
Face Recognition on LWF, results
54 / 55
Y. Taigman, M. Yang, M. Ranzato, L. Wolf. DeepFace: Closing the Gap to Human-
Level Performance in Face Verification, 2014, CVPR.
Error, %
1 Principal Component Analysis (EigenFaces) 60,2%
2 Local Binary Pattern Histograms (LBP) 72,4%
3 Deep CNN (AlexNet) + Euclid (LAZY DL) 71,0%
4 DeepFace by Facebook (STANDARD DL) 97,25%
contact: a.chernodub@gmail.com
Thanks!

AI&BigData Lab. Артем Чернодуб "Распознавание изображений методом Lazy Deep Learning в фото-органайзере ZZ Photo"

  • 1.
    Lazy Deep Learningfor Images Recognition in ZZ Photo app Artem Chernodub, George Paschenko IMMSP NASU AI&Big Data Lab, 23 April, 2015, Odessa. ZZ Photo
  • 2.
    𝑝 𝑥 𝑦= 𝑝 𝑦 𝑥 𝑝(𝑥) 𝑝(𝑦) Biological-inspired models Neuroscience Machine Learning 2 / 55
  • 3.
  • 4.
    Artificial Neural Networks Traditional(Shallow) Neural Networks Deep Neural Networks Deep Feedforward Neural Networks Recurrent Neural Networks 4 / 55
  • 5.
    Conventional Methods vsDeep Learning 5 / 55
  • 6.
    Deep Learning =Learning of Representations (Features) The traditional model of pattern recognition (since the late 50's): fixed/engineered features + trainable classifier Hand-crafted Feature Extractor Trainable Classifier Trainable Feature Extractor Trainable Classifier End-to-end learning / Feature learning / Deep learning: trainable features + trainable classifier 6 / 55
  • 7.
    ImageNet Le et al.“Building high-level features using large-scale unsupervised learning” ICML 2012. Model # of parameters Accuracy, % Deep Net 10M 15.8 best state-of-the-art N/A 9.3 Training data: 16M images, 20K categories 7 / 55
  • 8.
    Deep Face (Facebook) Y.Taigman, M. Yang, M.A. Ranzato, L. Wolf. DeepFace: Closing the Gap to Human- Level Performance in Face Verification // CVPR 2014. Model # of parameters Accuracy, % Deep Face Net 128M 97.35 Human level N/A 97.5 Training data: 4M facial images 8 / 55
  • 9.
    TIMIT Phoneme Recognition Graves,A., Mohamed, A.-R., and Hinton, G. E. (2013). Speech recognition with deep recurrent neural networks // IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6645–6649. IEEE. Mohamed, A. and Hinton, G. E. (2010). Phone recognition using restricted Boltzmann machines // IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4354–4357. Model # of parameters Accuracy Hidden Markov Model, HMM N / A 27,3% Deep Belief Network, DBN ~ 4M 26,7% Deep RNN 4,3M 17.7% Training data: 462 speakers train / 24 speakers test, 3.16 / 0.14 hrs. 9 / 55
  • 10.
    Google Large VocabularySpeech Recognition H. Sak, A. Senior, F. Beaufays. Long Short-Term Memory Recurrent Neural Network Architectures for Large Scale Acoustic Modeling // INTERSPEECH’2014. K. Vesely, A. Ghoshal, L. Burget, D. Povey. Sequence-discriminative training of deep neural networks // INTERSPEECH’2014. Model # of parameters Cross-entropy ReLU DNN 85M 11.3 Deep Projection LSTM RNN 13M 10.7 Training data: 3M utterances (1900 hrs). 10 / 55
  • 11.
    Classic Feedforward NeuralNetworks (before 2006). • Single hidden layer (Kolmogorov-Cybenko Universal Approximation Theorem as the main hope). • Vanishing gradients effect prevents using more layers. • Less than 10K free parameters. • Feature preprocessing stage is often critical. 11 / 55
  • 12.
    Training the traditional(shallow) Neural Network: derivative + optimization 12 / 55
  • 13.
    1) forward propagationpass ),( )1(  i ijij xwfz ),()1(~ )2(  j jj zwgky where zj is the postsynaptic value for the j-th hidden neuron, w(1) are the hidden layer’s weights, f() are the hidden layer’s activation functions, w(2) are the output layer’s weights, and g() are the output layer’s activation functions. 13 / 55
  • 14.
    2) backpropagation pass Localgradients calculation: ),1(~)1(  kyktOUT  .)(' )2( OUT jj HID j wzf   , )( )2( j OUT j z w kE    . )( )1( i IN j ji x w kE    Derivatives calculation: 14 / 55
  • 15.
    Bad effect ofvanishing (exploding) gradients: a problem , )( )1()( )(     m i m jm ji z w kE  ,' )1()()1()(   m i i m ij m j m j wf  0 )( )(    m jiw kE => 1mfor 15 / 55
  • 16.
    Bad effect ofvanishing (exploding) gradients: two hypotheses 1) increased frequency and severity of bad local minima 2) pathological curvature, like the type seen in the well-known Rosenbrock function: 222 )(100)1(),( xyxyxf  16 / 55
  • 17.
    Deep Feedforward NeuralNetworks • 2-stage training process: i) unsupervised pre-training; ii) fine tuning (vanishing gradients problem is beaten!). • Number of hidden layers > 1 (usually 6-9). • 100K – 100M free parameters. • No (or less) feature preprocessing stage. 17 / 55
  • 18.
  • 19.
    Dimensionality reduction • Usea stacked RBM as deep auto- encoder 1. Train RBM with images as input & output 2. Limit one layer to few dimensions  Information has to pass through middle layer G. E. Hinton and R. R. Salakhutdinov. Reducing the Dimensionality of Data with Neural Networks // Science 313 (2006), p. 504 – 507. 19 / 55
  • 20.
    Original Deep RBN PCA Dimensionality reduction Olivetti facedata, 25x25 pixel images reconstructed from 30 dimensions (625  30) G. E. Hinton and R. R. Salakhutdinov. Reducing the Dimensionality of Data with Neural Networks // Science 313 (2006), p. 504 – 507. 20 / 55
  • 21.
    How to useunsupervised pre-training stage / 1 21 / 55
  • 22.
    How to useunsupervised pre-training stage / 2 22 / 55
  • 23.
    How to useunsupervised pre-training stage / 3 23 / 55
  • 24.
    How to useunsupervised pre-training stage / 4 24 / 55
  • 25.
    Unlabeled data Unlabeled datais readily available Example: Images from the web 1. Download 10’000’000 images 2. Train a 9-layer DNN 3. Concepts are formed by DNN G. E. Hinton and R. R. Salakhutdinov. Reducing the Dimensionality of Data with Neural Networks // Science 313 (2006), p. 504 – 507. 25 / 55
  • 26.
    Dimensionality reduction PCA DeepRBN 804’414 Reuters news stories, reduction to 2 dimensions G. E. Hinton and R. R. Salakhutdinov. Reducing the Dimensionality of Data with Neural Networks // Science 313 (2006), p. 504 – 507. 26 / 55
  • 27.
    Hierarchy of trainedrepresentations Low-level feature Middle-level feature Top-level feature Feature visualization of convolutional net trained on ImageNet from [Zeiler & Fergus 2013] 27 / 55
  • 28.
    Hessian-Free optimization: Deep Learningwith no pre-training stage J. Martens. Deep Learning via Hessian-free Optimization // Proceedings of the 27th International Conference on Machine Learning (ICML), 2010. 28 / 55
  • 29.
    FLOPS comparison https://ru.wikipedia.org/wiki/FLOPS Type NameFlops Cost Mobile Raspberry Pi 1st Gen, 700 Mhz 0,04 Gflops $35 Mobile Apple A8 1,4 Gflops $700 (in iPhone 6) CPU Intel Core i7-4930K (Ivy Bridge), 3.7 GHz 140 Gflops $700 CPU Intel Core i7-5960X (Haswell), 3.0 GHz 350 Gflops $1300 GPU NVidia GTX 980 4612 Gflops (single precision), 144 Gflops (double precision) $600 + cost of PC (~$1000) GPU NVidia Tesla K80 8740 Gflops (single precision), 2910 Gflops (double precision) $4500 + cost of PC (~1500) 29 / 55
  • 30.
    Deep Networks Trainingtime using GPU • Pretraining – from 2-3 weeks to 2-3 months. • Fine-tuning (final supervised training) – from 1 day to 1 week. 30 / 55
  • 31.
    Tools for trainingDeep Neural Networks D. Kruchinin, E. Dolotov, K. Kornyakov, V. Kustikova, P. Druzhkov. The Comparison of Deep Learning Libraries on the Problem of Handwritten Digit Classication // Analysis of Images, Social Networks and Texts (AIST), 2015, April, 9-11th, Yekaterinburg. 31 / 55
  • 32.
    Lazy Deep Learning:motivation A. S. Razavian, H. Azizpour, J. Sullivan, S. Carlsson. CNN Features off-the-shelf: an Astounding Baseline for Recognition //2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 23-28 June 2014, Columbus, USA, p. 512 – 519. 32 / 55
  • 33.
    Lazy Deep Learning:bechmark results A. S. Razavian, H. Azizpour, J. Sullivan, S. Carlsson. CNN Features off-the-shelf: an Astounding Baseline for Recognition //2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 23-28 June 2014, Columbus, USA, p. 512 – 519. 33 / 55
  • 34.
    Convolutional Neural Networks:Return of Jedi Andrej Karpathy and Fei-Fei. CS231n: Convolutional Neural Networks for Visual Recognition http://cs231n.github.io/convolutional-networks Yoshua Bengio, Ian Goodfellow and Aaron Courville. Deep Learning // An MIT Press book in preparation http://www-labs.iro.umontreal.ca/~bengioy/DLbook 34 / 55
  • 35.
    AlexNet, 2012 —MeGa HiT A. Kryzhevsky, I. Sutskever, G.E. Hinton. ImageNet Classification with Deep Convolutional Neural Networks // Advances in Neural Information Processing Systems 25 (NIPS 2012). 35 / 55
  • 36.
    AlexNet, results onISVRC-2012 A. Kryzhevsky, I. Sutskever, G.E. Hinton. ImageNet Classification with Deep Convolutional Neural Networks // Advances in Neural Information Processing Systems 25 (NIPS 2012). 36 / 55
  • 37.
    Convolution Layer Andrej Karpathyand Fei-Fei. CS231n: Convolutional Neural Networks for Visual Recognition http://cs231n.github.io/convolutional-networks Yoshua Bengio, Ian Goodfellow and Aaron Courville. Deep Learning // An MIT Press book in preparation http://www-labs.iro.umontreal.ca/~bengioy/DLbook 37 / 55
  • 38.
    Pooling layer Andrej Karpathyand Fei-Fei. CS231n: Convolutional Neural Networks for Visual Recognition http://cs231n.github.io/convolutional-networks Yoshua Bengio, Ian Goodfellow and Aaron Courville. Deep Learning // An MIT Press book in preparation http://www-labs.iro.umontreal.ca/~bengioy/DLbook 38 / 55
  • 39.
    Implementation tricks: im2col K.Chellapilla, S. Puri, P. Simard. High Performance Convolutional Neural Networks for Document Processing // International Workshop on Frontiers in Handwriting Recognition, 2006. 39 / 55
  • 40.
    Implementation tricks: im2colfor convolution K. Chellapilla, S. Puri, P. Simard. High Performance Convolutional Neural Networks for Document Processing // International Workshop on Frontiers in Handwriting Recognition, 2006. 40 / 55
  • 41.
    Activation functions Andrej Karpathyand Fei-Fei. CS231n: Convolutional Neural Networks for Visual Recognition http://cs231n.github.io/convolutional-networks Yoshua Bengio, Ian Goodfellow and Aaron Courville. Deep Learning // An MIT Press book in preparation http://www-labs.iro.umontreal.ca/~bengioy/DLbook 𝑓(𝑥) = max 0, 𝑥 𝑓′ 𝑥 = 1, 𝑥 ≥ 0 0, 𝑥 < 0 ReLU activation function 41 / 55
  • 42.
    Development of AleksNeton OpenCV VGG MatConvNet: CNNs for MATLAB http://www.vlfeat.org/matconvnet/ mexopencv:MATLAB-OpenCV interface http://kyamagu.github.io/mexopencv/matlab MatConvNet, MATLAB + CUDA OpenCV app, C++ YAML YAML 42 / 55
  • 43.
    ZZ Photo –photo organizer Free beta version is available on http://zzphoto.me 43 / 55
  • 44.
    MIT-8 toy problem:formulation • 8 classes • 2688 images in total • TRAIN: 2000 images, 250 per class • TEST: 688 images, ~86 per class S. Banerji, A. Verma, C. Liu. Novel Color LBP Descriptors for Scene and Image Texture Classification // Cross Disciplinary Biometric Systems, 2012, 15th International Conference on Image Processing, Computer Vision, and Pattern Recognition, Las Vegas, Nevada, pp. 205-225. 44 / 55
  • 45.
    MIT-8 toy problem:results Acc. TRAIN Acc. TEST 1 LBP + SVM with RBF Kernel 27,2% 19,0% 2 LPQ + SVM with RBF kernel 38,4% 30,5% 3 LBP + SVM with χ2 kernel 94,2% 74,0% 4 LPQ + SVM with χ2 kernel 99,1% 82,2% 5 Deep CNN (AlexNet) + SVM RBF kernel (LAZY DL) 95,1% 91,8% 6 Deep CNN (AlexNet) + SVM with χ2 Kernel (LAZY DL) 100,0% 93,2% 7 Deep CNN (AlexNet) + MLP (LAZY DL) 100,0% 92,3% Original results, to be published. 45 / 55
  • 46.
    Pets detection problem(Kaggle Dataset + random Other images) • Kaggle Dataset + random “other” images; • 2 classes (cats & dogs VS other); • TRAIN: 1,000 samples; • TEST: 12,000 samples. 46 / 55
  • 47.
    Viola-Jones Object Detector •Very popular for Human Face Detection. • May be trained for Cat and Dog Face detection. • Available free in OpenCV library (http://opencv.org). O. Parkhi, A. Vedaldi, C. V. Jawahar, and A. Zisserman. The Truth about Cats and Dogs // Proceedings of the International Conference on Computer Vision (ICCV), 2011. J. Liu, A. Kanazawa, D. Jacobs, P. Belhumeur. Dog Breed Classification Using Part Localization // Lecture Notes in Computer Science Volume 7572, 2012, pp 172- 185. 47 / 55
  • 48.
    Images pyramid forViola-Jones 48 / 55
  • 49.
    Viola-Jones Object DetectorClassifier Structure 49 / 55 P. Viola, M. Jones. Rapid object detection using a boosted cascade of simple features // Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2001.
  • 50.
    Pets detection results:FAR vs FRR graphs Original results, to be published. 50 / 55
  • 51.
    Pets detection results:FAR = 0.5% Error, % 1 Viola-Jones Face Detector for Cats & Dogs + LBP + SVM 79,73% 2 AlexNet, argmax (STANDARD DL, ImageNet-2012, 1000) 26,11% 3 AlexNet, sum (STANDARD DL, ImageNet-2012, 1000) 26,11% 4 AlexNet + SVM linear (LAZY DL) 4,35% Original results, to be published. 51 / 55
  • 52.
    Pet detection results: ROC curve Original results, to be published. 52 / 55
  • 53.
    Labeled Faces inthe Wild (LFW) Dataset G. B. Huang, M. Ramesh, T. Berg, E. Learned-Miller. Labeled Faces in the Wild: A Database for Studying Face Recognition in Unconstrained Environments // University of Massachusetts, Amherst, Technical Report 07-49, October, 2007 • more than 13,000 images of faces collected from the web. • Pairs comparison, restricted mode. • test: 10-fold cross- validation, 6000 face pairs. 53 / 55
  • 54.
    Face Recognition onLWF, results 54 / 55 Y. Taigman, M. Yang, M. Ranzato, L. Wolf. DeepFace: Closing the Gap to Human- Level Performance in Face Verification, 2014, CVPR. Error, % 1 Principal Component Analysis (EigenFaces) 60,2% 2 Local Binary Pattern Histograms (LBP) 72,4% 3 Deep CNN (AlexNet) + Euclid (LAZY DL) 71,0% 4 DeepFace by Facebook (STANDARD DL) 97,25%
  • 55.