convolutional neural networks for image
classification
Evidence from Kaggle National Data Science Bowl
.
Dmytro Mishkin, ducha.aiki at gmail com
March 25, 2015
Czech Technical University in Prague
kaggle national data science bowl overview
The image classification problem
 130,400 test images
 30,336 train images
 1 channel (grayscale)
 121 (biased) classess
 90% images ≤ 100x100 px
 logloss score = - 1
N
N∑
i=1
M∑
j=1
yij log pij
 No external data
1
classes diagram
1
1url: http://npow.github.io/plankton/viewer/index.html.
2
final leaderboard
3
Which approach to use?
4
lunch time chat at kth’s computer vision group
 a computer vision scientist: How long does it take to train these
generic features on ImageNet?
 Hossein: 2 weeks
 Ali: almost 3 weeks depending on the hardware
 the computer vision scientist: hmmmm...
 Stefan: Well, you have to compare the three weeks to the last 40
years of computer vision2
2http://www.csc.kth.se/cvap/cvg/DL/ots/
5
convolutional networks
CNNs are state-of-art in such fields of image recognition as:3
:
 – Object Image Classification
 – Scene Image Classification
 – Action Image Classification
 – Object Detection
 – Semantic Segmentation
 – Fine-grained Recognition
 – Attribute Detection
 – Metric Learning
 – Instance Retrieval (almost).
3beat classic computer vision methods in 19 datasets out of 20
http://www.csc.kth.se/cvap/cvg/DL/ots/
6
contents
1. Basics of convolutional networks
2. Image preprocessing
3. Network architectures
4. Ensembling
5. What (seems that) do and does not work
6. Winner‘s solution highlights
7
..
basics of convolutional net-
works
what is convolution
4
4https://developer.apple.com/library/ios/documentation/Performance/
Conceptual/vImage/ConvolutionOperations/ConvolutionOperations.html
9
softmax classifier
Softmax(cross-entropy) loss
L = − log e
fyi
∑
j
e
fj
SVM (hinge)loss
L =
∑
j̸=yi
max(0, f(xi, W)j − f(xi, W)yi + ∆)
5
5http://vision.stanford.edu/teaching/cs231n/linear-classify-demo/
10
lenet-5. no other layers are necessary
6
Firstly idea proposed by LeCun7
in 1989, recently revived by
Springenberg et. al. in ”All Convolutional Net”8
,
6http://eblearn.sourceforge.net/beginner_tutorial2_train.html
7url: https://www.facebook.com/yann.lecun/posts/10152766574417143.
8J. T. Springenberg et al. “Striving for Simplicity: The All Convolutional Net”. In:
ArXiv e-prints (2014). arXiv: 1412.6806 [cs.LG].
11
non-linearities
−3 −2 −1 0 1 2 3
−3
−2
−1
0
1
2
3
4
Input
Activation TanH
Sigmoid
ReLU
maxout (sort of)
LeakyReLU
12
regularization - dropout, weight decay
9
9Nitish Srivastava et al. “Dropout: A Simple Way to Prevent Neural Networks from
Overfitting”. In: Journal of Machine Learning Research 15 (2014), pp. 1929–1958.
url: http://jmlr.org/papers/v15/srivastava14a.html.
13
deep learning libraries
Table 1: Popular deep learning GPU libraries
Name url languages Notes
caffe github.com/BVLC/caffe C++/Python/no largest community
cxxnet github.com/dmlc/cxxnet C++/no good memory management
Theano github.com/Theano/Theano Python huge flexibility
Torch facebook/fbcunn lua LeCun Facebook library
cuda-convnet2 code.google.com/p/cuda-convnet2/ C++/python
SparseConvNet http://tinyurl.com/pu65cfp C++/CUDA differs from others
14
..image preprocessing
basic network architecture
72x72x1 → Crop to 64x64 →20C5 →MP2 →50C5 → MP2 →500IP → clf
16
basic data preprocessing
Table 2: 5-layer network experiments, 48x48 input image, no non-linearities,
mean pixel extraction
Name, augmentation Val logloss Train logloss
No mean extraction, no scaling – –
mirror 1.67 0.64
histeq, mirror 1.74 0.64
mirror + ReLU 1.61 0.44
mirror + scale 1.42 0.937
mirror + scale LeakyReLU 1.34 0.83
mirror + rand rot 1.53 1.31
17
basic data preprocessing
Table 3: 5-layer network experiments, 48x48 input image, LeakyReLU
non-linearities, mean pixel extraction
Name, augmentation Val logloss Train logloss
mirror + scale 1.34 0.83
invert, mirror + scale 1.27 0.80
invert, norm, mirror + scale 1.24 0.505
invert, norm, mirror + scale, salt-pepper 1.15 n/a
18
more geometric transformations
Table 4: 5-layer network experiments, 64x64 input image, LeakyReLU
Name, augmentation Val logloss
mirror 1.30
mirror + scale (resize modes) 1.12
h+v mirror, scale 1.10
h+v mirror, scale + rot 1.08
mirror, less baselr 1.04 :)
h+v mirror, scale + rot, depolar imgs 1.28
19
regularization methods
Table 5: 5-layer network experiments, 64x64 input image, LeakyReLU
Name, augmentation Val logloss
h+v mirror, scale + rot, vanilla 1.08
h+v mirror, scale + rot, PReLU (but slow down a lot)10
1.03
h+v mirror, scale + rot, BatchNorm11
1.10
h+v mirror, scale + rot, StochPool12
0.98
10K. He et al. “Delving Deep into Rectifiers: Surpassing Human-Level Performance on
ImageNet Classification”. In: ArXiv e-prints (2015). arXiv: 1502.01852 [cs.CV].
11S. Ioffe and C. Szegedy. “Batch Normalization: Accelerating Deep Network Training by
Reducing Internal Covariate Shift”. In: ArXiv e-prints (2015). arXiv: 1502.03167
[cs.LG].
12M. D. Zeiler and R. Fergus. “Stochastic Pooling for Regularization of Deep
Convolutional Neural Networks”. In: ArXiv e-prints (2013). arXiv: 1301.3557 [cs.LG].
20
data augmentation - don‘t forget about it during test time
for i = 0,90,180,270 degrees rotation
for 9 crops (N, NE, E, ...)
get predictions for mirrored/non-mirrored
21
..network architectures
cifar/lenet for testing
Pro‘s
 + Training time  20 min
 + Can be done in parallel
 + therefore lots of experiments
Con‘s
 - Not complex enough to check smth (i.e. BatchNorm)
 - That is why might lead to wrong conclusions about ”bad” things (i.e.
random rotations hurts CifarNets, but helps VGGNets)
 - Or ”good” things (i.e. Stochastic pooling helps CifarNets, but none
for VGGNets)
23
We need to go deeper
24
googlenet
GoogLeNet architecture13
13C. Szegedy et al. “Going Deeper with Convolutions”. In: ArXiv e-prints (2014).
arXiv: 1409.4842 [cs.CV].
25
googlenet
22 layers, but simple base brick – ”Inception”
26
internal ensemble
Take mean of all auxiliary classifiers instead of just throwing away them
Table 6: GoogLeNet,validation loss
Name Public LB
clf on inc3 0.722
clf on inc4a 0.754
clf on inc4b 0.757
clf on inc5b 0.855
average 0.693
Table 7: VGGNet,validation loss
Name Public LB
clf on pool4 0.762
clf on pool5 0.657
clf on fc7 0.707
average 0.630
14
14J. Xie, B. Xu, and Z. Chuang. “Horizontal and Vertical Ensemble with Deep
Representation for Classification”. In: ArXiv e-prints (2013). arXiv: 1306.2759
[cs.LG].
27
googlenet-results
Table 8: GoogLeNet, 64x64 input image, Leaky ReLU (if not stated other),
AlexNet-oversample
Name Public LB
No inv, scale, ReLU, last-clf 0.910
No inv, scale, ReLU 0.859
No inv, scale 0.816
No inv scale, maxout-clf 0.785
Inv, scale, maxout-clf, retrain 0.703
96x96, inv, scale, maxout-clf, retrained, no-aug-ft15
0.684
112x112, inv, scale, maxout-clf, retrained, no-aug-ft. 0.716
48x48, inv, scale, maxout-clf, retrained, no-aug-ft. + test rot 0.749
96x96, inv, scale, maxout-clf, retrained, no-aug-ft. + test rot 0.679
48x48+96x96+112x112, inv, scale, maxout-clf, retrained, no-aug-ft 0.677
15Ben Graham‘s trick: finetune converged model for 1-5 epochs without
data-augmentation with small lrhttp://blog.kaggle.com/2015/01/02/
cifar-10-competition-winners-interviews-with-dr-ben-graham-phil-culliton-zygmu
28
vggnet
VGGNet architectures16
Differences: Dropout in conv-layers (0.3), SPP-pooling for pool5, LeakyReLU,
aux. clf.
16K. Simonyan and A. Zisserman. “Very Deep Convolutional Networks for Large-Scale
Image Recognition”. In: ArXiv e-prints (Sept. 2014). arXiv: 1409.1556 [cs.CV].
29
spatial pyramid pooling
17
17K. He et al. “Spatial Pyramid Pooling in Deep Convolutional Networks for Visual
Recognition”. In: ArXiv e-prints (2014). arXiv: 1406.4729 [cs.CV].
30
vggnet-results
Table 9: GoogLeNet, 64x64 input image, Leaky ReLU (if not stated other),
AlexNet-oversample, no-SPP
Name Public LB
No inv, scale, ReLU, fc-maxout 0.752
Inv, scale, single random crop 0.773
Inv, scale, 50 random crops 0.751
Inv, scale, 0.729
Inv, scale, retrained 0.720
Inv, scale, fc-maxout 0.662
Inv, scale, fc-maxout, SPP 0.654
All VGGNets Mix 0.650
31
sparseconvnet
 – 0.79 LB Score
 – Unusual library
 – C2 instead of C3 convolution
 – Only padding - for input image
 – Kaggle CIFAR-10 winning architecture
320C2 - 320C2 - MP2 -
640C2 - 10% dropout - 640C2 - 10% dropout - MP2 -
960C2 - 20% dropout - 960C2 - 20% dropout - MP2 -
1280C2 - 30% dropout - 1280C2 - 30% dropout - MP2 -
1600C2 - 40% dropout - 1600C2 - 40% dropout - MP2 -
1920C2 - 50% dropout - 1920C1 - 50% dropout - 121C1 - Softmax output
32
ensemble-results
Table 10: Different mixes of all modes (3 GoogleNets, 4 VGGNets, 1
SparseConvNet)
Name Public LB Private LB
4 VGG 0.650 0.651
3 VGG, 1 GLN 0.625 0.629
4 VGG, 3 GLN 0.617 0.618
4 VGG, 3 GLN, 1 Sparse 0.611 0.616
4 VGG, 3 GLN, 1 Sparse, figure-skating 0.609 0.613
33
..misc
batchnorm
Works for CIFAR
But no big difference for VGGNet in KNDB for me. However, works for
other people, i.e. Jae Hyun Lim18
, 22nd place
18https://github.com/lim0606/ndsb
35
what else seems to work here
 – Retrain top layers with different non-linearity (cheat diversity)
 – Figure-skating average – throw away max and min prediction (0.003
LB score)
36
what seems, that does not work here
 – Dense SIFT + BOW / Fisher Vector 6̃0% accuracy
 – Random forest on CNN features 6̃5% accuracy
 – Mix of Hinge and Cross-Entropy losses
 – Averaging with other mean than arithmetical
 – Image enhancement or preprocessing (histogram equalization, etc.)
37
..winner‘s solution highlights
team work
 – Roll-pool
 – Hand-engineered features
 – RMS-Pool
 – Knowledge distillation
19
19http://benanne.github.io/2015/03/17/plankton.html
39
Questions?
40
thanks
This nice presentation theme is taken from
github.com/matze/mtheme
The theme itself is licensed under a Creative Commons
Attribution-ShareAlike 4.0 International License.
cba
41

Convolutional neural networks for image classification — evidence from Kaggle National Data Science Bowl

  • 1.
    convolutional neural networksfor image classification Evidence from Kaggle National Data Science Bowl . Dmytro Mishkin, ducha.aiki at gmail com March 25, 2015 Czech Technical University in Prague
  • 2.
    kaggle national datascience bowl overview The image classification problem 130,400 test images 30,336 train images 1 channel (grayscale) 121 (biased) classess 90% images ≤ 100x100 px logloss score = - 1 N N∑ i=1 M∑ j=1 yij log pij No external data 1
  • 3.
  • 4.
  • 5.
  • 6.
    lunch time chatat kth’s computer vision group a computer vision scientist: How long does it take to train these generic features on ImageNet? Hossein: 2 weeks Ali: almost 3 weeks depending on the hardware the computer vision scientist: hmmmm... Stefan: Well, you have to compare the three weeks to the last 40 years of computer vision2 2http://www.csc.kth.se/cvap/cvg/DL/ots/ 5
  • 7.
    convolutional networks CNNs arestate-of-art in such fields of image recognition as:3 : – Object Image Classification – Scene Image Classification – Action Image Classification – Object Detection – Semantic Segmentation – Fine-grained Recognition – Attribute Detection – Metric Learning – Instance Retrieval (almost). 3beat classic computer vision methods in 19 datasets out of 20 http://www.csc.kth.se/cvap/cvg/DL/ots/ 6
  • 8.
    contents 1. Basics ofconvolutional networks 2. Image preprocessing 3. Network architectures 4. Ensembling 5. What (seems that) do and does not work 6. Winner‘s solution highlights 7
  • 9.
  • 10.
  • 11.
    softmax classifier Softmax(cross-entropy) loss L= − log e fyi ∑ j e fj SVM (hinge)loss L = ∑ j̸=yi max(0, f(xi, W)j − f(xi, W)yi + ∆) 5 5http://vision.stanford.edu/teaching/cs231n/linear-classify-demo/ 10
  • 12.
    lenet-5. no otherlayers are necessary 6 Firstly idea proposed by LeCun7 in 1989, recently revived by Springenberg et. al. in ”All Convolutional Net”8 , 6http://eblearn.sourceforge.net/beginner_tutorial2_train.html 7url: https://www.facebook.com/yann.lecun/posts/10152766574417143. 8J. T. Springenberg et al. “Striving for Simplicity: The All Convolutional Net”. In: ArXiv e-prints (2014). arXiv: 1412.6806 [cs.LG]. 11
  • 13.
    non-linearities −3 −2 −10 1 2 3 −3 −2 −1 0 1 2 3 4 Input Activation TanH Sigmoid ReLU maxout (sort of) LeakyReLU 12
  • 14.
    regularization - dropout,weight decay 9 9Nitish Srivastava et al. “Dropout: A Simple Way to Prevent Neural Networks from Overfitting”. In: Journal of Machine Learning Research 15 (2014), pp. 1929–1958. url: http://jmlr.org/papers/v15/srivastava14a.html. 13
  • 15.
    deep learning libraries Table1: Popular deep learning GPU libraries Name url languages Notes caffe github.com/BVLC/caffe C++/Python/no largest community cxxnet github.com/dmlc/cxxnet C++/no good memory management Theano github.com/Theano/Theano Python huge flexibility Torch facebook/fbcunn lua LeCun Facebook library cuda-convnet2 code.google.com/p/cuda-convnet2/ C++/python SparseConvNet http://tinyurl.com/pu65cfp C++/CUDA differs from others 14
  • 16.
  • 17.
    basic network architecture 72x72x1→ Crop to 64x64 →20C5 →MP2 →50C5 → MP2 →500IP → clf 16
  • 18.
    basic data preprocessing Table2: 5-layer network experiments, 48x48 input image, no non-linearities, mean pixel extraction Name, augmentation Val logloss Train logloss No mean extraction, no scaling – – mirror 1.67 0.64 histeq, mirror 1.74 0.64 mirror + ReLU 1.61 0.44 mirror + scale 1.42 0.937 mirror + scale LeakyReLU 1.34 0.83 mirror + rand rot 1.53 1.31 17
  • 19.
    basic data preprocessing Table3: 5-layer network experiments, 48x48 input image, LeakyReLU non-linearities, mean pixel extraction Name, augmentation Val logloss Train logloss mirror + scale 1.34 0.83 invert, mirror + scale 1.27 0.80 invert, norm, mirror + scale 1.24 0.505 invert, norm, mirror + scale, salt-pepper 1.15 n/a 18
  • 20.
    more geometric transformations Table4: 5-layer network experiments, 64x64 input image, LeakyReLU Name, augmentation Val logloss mirror 1.30 mirror + scale (resize modes) 1.12 h+v mirror, scale 1.10 h+v mirror, scale + rot 1.08 mirror, less baselr 1.04 :) h+v mirror, scale + rot, depolar imgs 1.28 19
  • 21.
    regularization methods Table 5:5-layer network experiments, 64x64 input image, LeakyReLU Name, augmentation Val logloss h+v mirror, scale + rot, vanilla 1.08 h+v mirror, scale + rot, PReLU (but slow down a lot)10 1.03 h+v mirror, scale + rot, BatchNorm11 1.10 h+v mirror, scale + rot, StochPool12 0.98 10K. He et al. “Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification”. In: ArXiv e-prints (2015). arXiv: 1502.01852 [cs.CV]. 11S. Ioffe and C. Szegedy. “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift”. In: ArXiv e-prints (2015). arXiv: 1502.03167 [cs.LG]. 12M. D. Zeiler and R. Fergus. “Stochastic Pooling for Regularization of Deep Convolutional Neural Networks”. In: ArXiv e-prints (2013). arXiv: 1301.3557 [cs.LG]. 20
  • 22.
    data augmentation -don‘t forget about it during test time for i = 0,90,180,270 degrees rotation for 9 crops (N, NE, E, ...) get predictions for mirrored/non-mirrored 21
  • 23.
  • 24.
    cifar/lenet for testing Pro‘s + Training time 20 min + Can be done in parallel + therefore lots of experiments Con‘s - Not complex enough to check smth (i.e. BatchNorm) - That is why might lead to wrong conclusions about ”bad” things (i.e. random rotations hurts CifarNets, but helps VGGNets) - Or ”good” things (i.e. Stochastic pooling helps CifarNets, but none for VGGNets) 23
  • 25.
    We need togo deeper 24
  • 26.
    googlenet GoogLeNet architecture13 13C. Szegedyet al. “Going Deeper with Convolutions”. In: ArXiv e-prints (2014). arXiv: 1409.4842 [cs.CV]. 25
  • 27.
    googlenet 22 layers, butsimple base brick – ”Inception” 26
  • 28.
    internal ensemble Take meanof all auxiliary classifiers instead of just throwing away them Table 6: GoogLeNet,validation loss Name Public LB clf on inc3 0.722 clf on inc4a 0.754 clf on inc4b 0.757 clf on inc5b 0.855 average 0.693 Table 7: VGGNet,validation loss Name Public LB clf on pool4 0.762 clf on pool5 0.657 clf on fc7 0.707 average 0.630 14 14J. Xie, B. Xu, and Z. Chuang. “Horizontal and Vertical Ensemble with Deep Representation for Classification”. In: ArXiv e-prints (2013). arXiv: 1306.2759 [cs.LG]. 27
  • 29.
    googlenet-results Table 8: GoogLeNet,64x64 input image, Leaky ReLU (if not stated other), AlexNet-oversample Name Public LB No inv, scale, ReLU, last-clf 0.910 No inv, scale, ReLU 0.859 No inv, scale 0.816 No inv scale, maxout-clf 0.785 Inv, scale, maxout-clf, retrain 0.703 96x96, inv, scale, maxout-clf, retrained, no-aug-ft15 0.684 112x112, inv, scale, maxout-clf, retrained, no-aug-ft. 0.716 48x48, inv, scale, maxout-clf, retrained, no-aug-ft. + test rot 0.749 96x96, inv, scale, maxout-clf, retrained, no-aug-ft. + test rot 0.679 48x48+96x96+112x112, inv, scale, maxout-clf, retrained, no-aug-ft 0.677 15Ben Graham‘s trick: finetune converged model for 1-5 epochs without data-augmentation with small lrhttp://blog.kaggle.com/2015/01/02/ cifar-10-competition-winners-interviews-with-dr-ben-graham-phil-culliton-zygmu 28
  • 30.
    vggnet VGGNet architectures16 Differences: Dropoutin conv-layers (0.3), SPP-pooling for pool5, LeakyReLU, aux. clf. 16K. Simonyan and A. Zisserman. “Very Deep Convolutional Networks for Large-Scale Image Recognition”. In: ArXiv e-prints (Sept. 2014). arXiv: 1409.1556 [cs.CV]. 29
  • 31.
    spatial pyramid pooling 17 17K.He et al. “Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition”. In: ArXiv e-prints (2014). arXiv: 1406.4729 [cs.CV]. 30
  • 32.
    vggnet-results Table 9: GoogLeNet,64x64 input image, Leaky ReLU (if not stated other), AlexNet-oversample, no-SPP Name Public LB No inv, scale, ReLU, fc-maxout 0.752 Inv, scale, single random crop 0.773 Inv, scale, 50 random crops 0.751 Inv, scale, 0.729 Inv, scale, retrained 0.720 Inv, scale, fc-maxout 0.662 Inv, scale, fc-maxout, SPP 0.654 All VGGNets Mix 0.650 31
  • 33.
    sparseconvnet – 0.79LB Score – Unusual library – C2 instead of C3 convolution – Only padding - for input image – Kaggle CIFAR-10 winning architecture 320C2 - 320C2 - MP2 - 640C2 - 10% dropout - 640C2 - 10% dropout - MP2 - 960C2 - 20% dropout - 960C2 - 20% dropout - MP2 - 1280C2 - 30% dropout - 1280C2 - 30% dropout - MP2 - 1600C2 - 40% dropout - 1600C2 - 40% dropout - MP2 - 1920C2 - 50% dropout - 1920C1 - 50% dropout - 121C1 - Softmax output 32
  • 34.
    ensemble-results Table 10: Differentmixes of all modes (3 GoogleNets, 4 VGGNets, 1 SparseConvNet) Name Public LB Private LB 4 VGG 0.650 0.651 3 VGG, 1 GLN 0.625 0.629 4 VGG, 3 GLN 0.617 0.618 4 VGG, 3 GLN, 1 Sparse 0.611 0.616 4 VGG, 3 GLN, 1 Sparse, figure-skating 0.609 0.613 33
  • 35.
  • 36.
    batchnorm Works for CIFAR Butno big difference for VGGNet in KNDB for me. However, works for other people, i.e. Jae Hyun Lim18 , 22nd place 18https://github.com/lim0606/ndsb 35
  • 37.
    what else seemsto work here – Retrain top layers with different non-linearity (cheat diversity) – Figure-skating average – throw away max and min prediction (0.003 LB score) 36
  • 38.
    what seems, thatdoes not work here – Dense SIFT + BOW / Fisher Vector 6̃0% accuracy – Random forest on CNN features 6̃5% accuracy – Mix of Hinge and Cross-Entropy losses – Averaging with other mean than arithmetical – Image enhancement or preprocessing (histogram equalization, etc.) 37
  • 39.
  • 40.
    team work –Roll-pool – Hand-engineered features – RMS-Pool – Knowledge distillation 19 19http://benanne.github.io/2015/03/17/plankton.html 39
  • 41.
  • 42.
    thanks This nice presentationtheme is taken from github.com/matze/mtheme The theme itself is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. cba 41