AI&BigData Lab. Артем Чернодуб "Распознавание изображений методом Lazy Deep Learning в фото-органайзере ZZ Photo"

Lazy Deep Learning for Images
Recognition in ZZ Photo app
Artem Chernodub, George Paschenko
IMMSP NASU
AI&Big Data Lab, 23 April, 2015, Odessa.
ZZ Photo

𝑝 𝑥 𝑦 =
𝑝 𝑦 𝑥 𝑝(𝑥)
𝑝(𝑦)
Biological-inspired models
Neuroscience
Machine Learning
2 / 55

Biological Neural Networks
3 / 55

Artificial Neural Networks
Traditional (Shallow) Neural
Networks
Deep Neural Networks
Deep Feedforward Neural
Networks
Recurrent Neural Networks
4 / 55

Conventional Methods vs Deep Learning
5 / 55

Deep Learning = Learning of
Representations (Features)
The traditional model of pattern recognition (since the late 50's):
fixed/engineered features + trainable classifier
Hand-crafted
Feature
Extractor
Trainable
Classifier
Trainable
Feature
Extractor
Trainable
Classifier
End-to-end learning / Feature learning / Deep learning:
trainable features + trainable classifier
6 / 55

ImageNet
Le et al. “Building high-level features using large-scale unsupervised learning” ICML
2012.
Model # of parameters Accuracy, %
Deep Net 10M 15.8
best state-of-the-art N/A 9.3
Training data: 16M images, 20K categories
7 / 55

Deep Face (Facebook)
Y. Taigman, M. Yang, M.A. Ranzato, L. Wolf. DeepFace: Closing the Gap to Human-
Level Performance in Face Verification // CVPR 2014.
Model # of parameters Accuracy, %
Deep Face Net 128M 97.35
Human level N/A 97.5
Training data: 4M facial images
8 / 55

TIMIT Phoneme Recognition
Graves, A., Mohamed, A.-R., and Hinton, G. E. (2013). Speech recognition with deep
recurrent neural networks // IEEE International Conference on Acoustics, Speech and
Signal Processing (ICASSP), pages 6645–6649. IEEE.
Mohamed, A. and Hinton, G. E. (2010). Phone recognition using restricted Boltzmann
machines // IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASSP), pages 4354–4357.
Model # of parameters Accuracy
Hidden Markov Model, HMM N / A 27,3%
Deep Belief Network, DBN ~ 4M 26,7%
Deep RNN 4,3M 17.7%
Training data: 462 speakers train / 24 speakers test, 3.16 / 0.14 hrs.
9 / 55

Google Large Vocabulary Speech
Recognition
H. Sak, A. Senior, F. Beaufays. Long Short-Term Memory Recurrent Neural Network
Architectures for Large Scale Acoustic Modeling // INTERSPEECH’2014.
K. Vesely, A. Ghoshal, L. Burget, D. Povey. Sequence-discriminative training of deep
neural networks // INTERSPEECH’2014.
Model # of parameters Cross-entropy
ReLU DNN 85M 11.3
Deep Projection LSTM RNN 13M 10.7
Training data: 3M utterances (1900 hrs).
10 / 55

Classic Feedforward Neural Networks
(before 2006).
• Single hidden layer (Kolmogorov-Cybenko Universal
Approximation Theorem as the main hope).
• Vanishing gradients effect prevents using more layers.
• Less than 10K free parameters.
• Feature preprocessing stage is often critical.
11 / 55

Training the traditional (shallow)
Neural Network: derivative + optimization
12 / 55

1) forward propagation pass
),( )1(

i
ijij xwfz
),()1(~ )2(

j
jj zwgky
where zj is the postsynaptic value for the j-th hidden neuron, w(1) are the hidden layer’s
weights, f() are the hidden layer’s activation functions, w(2) are the output layer’s weights,
and g() are the output layer’s activation functions.
13 / 55

2) backpropagation pass
Local gradients calculation:
),1(~)1(  kyktOUT

.)(' )2( OUT
jj
HID
j wzf  
,
)(
)2( j
OUT
j
z
w
kE



.
)(
)1( i
IN
j
ji
x
w
kE



Derivatives calculation:
14 / 55

Bad effect of vanishing (exploding)
gradients: a problem
,
)( )1()(
)(



 m
i
m
jm
ji
z
w
kE

,' )1()()1()( 
 m
i
i
m
ij
m
j
m
j wf  0
)(
)(



m
jiw
kE
=> 1mfor
15 / 55

Bad effect of vanishing (exploding)
gradients: two hypotheses
1) increased frequency and
severity of bad local
minima
2) pathological curvature, like
the type seen in the well-known
Rosenbrock function:
222
)(100)1(),( xyxyxf 
16 / 55

Deep Feedforward Neural Networks
• 2-stage training process: i) unsupervised pre-training; ii) fine tuning
(vanishing gradients problem is beaten!).
• Number of hidden layers > 1 (usually 6-9).
• 100K – 100M free parameters.
• No (or less) feature preprocessing stage.
17 / 55

Dimensionality reduction
• Use a stacked RBM as deep auto-
encoder
1. Train RBM with images as input &
output
2. Limit one layer to few dimensions
 Information has to pass through middle
layer
G. E. Hinton and R. R. Salakhutdinov. Reducing the Dimensionality of Data with Neural
Networks // Science 313 (2006), p. 504 – 507.
19 / 55

Original
Deep
RBN
PCA
Olivetti face data, 25x25 pixel images reconstructed from 30 dimensions
(625  30)
20 / 55

How to use unsupervised pre-training
stage / 1
21 / 55

stage / 2
22 / 55

stage / 3
23 / 55

stage / 4
24 / 55

Unlabeled data
Unlabeled data is readily available
Example: Images from the web
1. Download 10’000’000 images
2. Train a 9-layer DNN
3. Concepts are formed by DNN
25 / 55

PCA Deep RBN
804’414 Reuters news stories, reduction to 2 dimensions
26 / 55

Hierarchy of trained representations
Low-level
feature
Middle-level
feature
Top-level
feature
Feature visualization of convolutional net trained on ImageNet from [Zeiler & Fergus
2013]
27 / 55

Hessian-Free optimization: Deep
Learning with no pre-training stage
J. Martens. Deep Learning via Hessian-free Optimization // Proceedings of the 27th
International Conference on Machine Learning (ICML), 2010.
28 / 55

FLOPS comparison
https://ru.wikipedia.org/wiki/FLOPS
Type Name Flops Cost
Mobile Raspberry Pi 1st Gen, 700
Mhz
0,04 Gflops $35
Mobile Apple A8 1,4 Gflops $700 (in iPhone 6)
CPU Intel Core i7-4930K (Ivy
Bridge), 3.7 GHz
140 Gflops $700
CPU Intel Core i7-5960X
(Haswell), 3.0 GHz
350 Gflops $1300
GPU NVidia GTX 980 4612 Gflops (single
precision), 144 Gflops
(double precision)
$600 + cost of PC
(~$1000)
GPU NVidia Tesla K80 8740 Gflops (single
precision), 2910
Gflops (double
precision)
$4500 + cost of
PC (~1500)
29 / 55

Deep Networks Training time using GPU
• Pretraining – from 2-3 weeks to 2-3 months.
• Fine-tuning (final supervised training) – from
1 day to 1 week.
30 / 55

Tools for training Deep Neural
Networks
D. Kruchinin, E. Dolotov, K. Kornyakov, V. Kustikova, P. Druzhkov. The Comparison of
Deep Learning Libraries on the Problem of Handwritten Digit Classication // Analysis
of Images, Social Networks and Texts (AIST), 2015, April, 9-11th, Yekaterinburg.
31 / 55

Lazy Deep Learning: motivation
A. S. Razavian, H. Azizpour, J. Sullivan, S. Carlsson. CNN Features off-the-shelf: an
Astounding Baseline for Recognition //2014 IEEE Conference on Computer Vision and
Pattern Recognition Workshops (CVPRW), 23-28 June 2014, Columbus, USA, p. 512
– 519.
32 / 55

Lazy Deep Learning: bechmark results
A. S. Razavian, H. Azizpour, J. Sullivan, S. Carlsson. CNN Features off-the-shelf: an
Astounding Baseline for Recognition //2014 IEEE Conference on Computer Vision and
Pattern Recognition Workshops (CVPRW), 23-28 June 2014, Columbus, USA, p. 512
– 519.
33 / 55

Convolutional Neural Networks: Return
of Jedi
Andrej Karpathy and Fei-Fei. CS231n: Convolutional Neural Networks for Visual
Recognition http://cs231n.github.io/convolutional-networks
Yoshua Bengio, Ian Goodfellow and Aaron Courville. Deep Learning // An MIT Press
book in preparation http://www-labs.iro.umontreal.ca/~bengioy/DLbook
34 / 55

AlexNet, 2012 — MeGa HiT
A. Kryzhevsky, I. Sutskever, G.E. Hinton. ImageNet Classification with Deep
Convolutional Neural Networks // Advances in Neural Information Processing
Systems 25 (NIPS 2012).
35 / 55

AlexNet, results on ISVRC-2012
A. Kryzhevsky, I. Sutskever, G.E. Hinton. ImageNet Classification with Deep
Convolutional Neural Networks // Advances in Neural Information Processing
Systems 25 (NIPS 2012).
36 / 55

Convolution Layer
37 / 55

Pooling layer
38 / 55

Implementation tricks: im2col
K. Chellapilla, S. Puri, P. Simard. High Performance Convolutional Neural Networks
for Document Processing // International Workshop on Frontiers in Handwriting
Recognition, 2006.
39 / 55

Implementation tricks: im2col for
convolution
K. Chellapilla, S. Puri, P. Simard. High Performance Convolutional Neural Networks
for Document Processing // International Workshop on Frontiers in Handwriting
Recognition, 2006.
40 / 55

Activation functions
𝑓(𝑥) = max 0, 𝑥
𝑓′ 𝑥 =
1, 𝑥 ≥ 0
0, 𝑥 < 0
ReLU activation function
41 / 55

Development of AleksNet on OpenCV
VGG MatConvNet: CNNs for MATLAB http://www.vlfeat.org/matconvnet/
mexopencv:MATLAB-OpenCV interface http://kyamagu.github.io/mexopencv/matlab
MatConvNet,
MATLAB + CUDA
OpenCV app, C++
YAML
YAML
42 / 55

ZZ Photo – photo organizer
Free beta version is available on http://zzphoto.me
43 / 55

MIT-8 toy problem: formulation
• 8 classes
• 2688 images in total
• TRAIN: 2000 images,
250 per class
• TEST: 688 images,
~86 per class
S. Banerji, A. Verma, C. Liu. Novel Color LBP Descriptors for Scene and Image
Texture Classification // Cross Disciplinary Biometric Systems, 2012, 15th
International Conference on Image Processing, Computer Vision, and Pattern
Recognition, Las Vegas, Nevada, pp. 205-225.
44 / 55

MIT-8 toy problem: results
Acc.
TRAIN
Acc.
TEST
1 LBP + SVM with RBF Kernel 27,2% 19,0%
2 LPQ + SVM with RBF kernel 38,4% 30,5%
3 LBP + SVM with χ2 kernel 94,2% 74,0%
4 LPQ + SVM with χ2 kernel 99,1% 82,2%
5 Deep CNN (AlexNet) + SVM RBF kernel (LAZY DL) 95,1% 91,8%
6 Deep CNN (AlexNet) + SVM with χ2 Kernel (LAZY DL) 100,0% 93,2%
7 Deep CNN (AlexNet) + MLP (LAZY DL) 100,0% 92,3%
Original results, to be published.
45 / 55

Pets detection problem (Kaggle
Dataset + random Other images)
• Kaggle Dataset +
random “other” images;
• 2 classes (cats & dogs
VS other);
• TRAIN: 1,000 samples;
• TEST: 12,000 samples.
46 / 55

Viola-Jones Object Detector
• Very popular for Human Face Detection.
• May be trained for Cat and Dog Face detection.
• Available free in OpenCV library (http://opencv.org).
O. Parkhi, A. Vedaldi, C. V. Jawahar, and A. Zisserman. The Truth about Cats and
Dogs // Proceedings of the International Conference on Computer Vision (ICCV),
2011. J.
Liu, A. Kanazawa, D. Jacobs, P. Belhumeur. Dog Breed Classification Using Part
Localization // Lecture Notes in Computer Science Volume 7572, 2012, pp 172-
185.
47 / 55

Images pyramid for Viola-Jones
48 / 55

Viola-Jones Object Detector Classifier
Structure
49 / 55
P. Viola, M. Jones. Rapid object detection using a boosted cascade of simple
features // Proceedings of the 2001 IEEE Computer Society Conference on
Computer Vision and Pattern Recognition, CVPR 2001.

Pets detection results: FAR vs FRR graphs
50 / 55

Pets detection results: FAR = 0.5%
Error, %
1 Viola-Jones Face Detector for Cats & Dogs + LBP + SVM 79,73%
2 AlexNet, argmax (STANDARD DL, ImageNet-2012, 1000) 26,11%
3 AlexNet, sum (STANDARD DL, ImageNet-2012, 1000) 26,11%
4 AlexNet + SVM linear (LAZY DL) 4,35%
51 / 55

Pet detection results : ROC curve
52 / 55

Labeled Faces in the Wild (LFW) Dataset
G. B. Huang, M. Ramesh, T. Berg, E. Learned-Miller. Labeled Faces in the Wild: A
Database for Studying Face Recognition in Unconstrained Environments // University
of Massachusetts, Amherst, Technical Report 07-49, October, 2007
• more than 13,000 images
of faces collected from the
web.
• Pairs comparison,
restricted mode.
• test: 10-fold cross-
validation, 6000 face
pairs.
53 / 55

Face Recognition on LWF, results
54 / 55
Y. Taigman, M. Yang, M. Ranzato, L. Wolf. DeepFace: Closing the Gap to Human-
Level Performance in Face Verification, 2014, CVPR.
Error, %
1 Principal Component Analysis (EigenFaces) 60,2%
2 Local Binary Pattern Histograms (LBP) 72,4%
3 Deep CNN (AlexNet) + Euclid (LAZY DL) 71,0%
4 DeepFace by Facebook (STANDARD DL) 97,25%

contact: a.chernodub@gmail.com
Thanks!

AI&BigData Lab. Артем Чернодуб "Распознавание изображений методом Lazy Deep Learning в фото-органайзере ZZ Photo"

More Related Content

What's hot

Viewers also liked

Similar to AI&BigData Lab. Артем Чернодуб "Распознавание изображений методом Lazy Deep Learning в фото-органайзере ZZ Photo"

More from GeeksLab Odessa

AI&BigData Lab. Артем Чернодуб "Распознавание изображений методом Lazy Deep Learning в фото-органайзере ZZ Photo"