Devil in the Details:
Analysing the Performance
of ConvNet Features
Ken Chatfield - University of Oxford
May 2015
The Devil is still in the Details
2011 2014
• This work is about comparing the latest ConvNet
based feature representations on common ground
• We compare both different pre-trained network
architectures and different learning heuristics
Comparing Apples to Apples
Fixed
Evaluation
Protocol
Fixed Learning
CNN
Arch 1
CNN
Arch 2
IFV
Input
Dataset
…
Performance Evolution over VOC2007
BOW
32K
–
IFV-BL
327K
–
IFV
84K
–
IFV
84K
f s
DeCAF
4K
t t
CNN-F
4K
f s
CNN-M 2K
2K
f s
CNN-S
4K (TN)
f s
VGG-D+E
4K
S s
54
56
58
60
62
64
66
68
70
72
74
76
78
80
82
84
86
88
mAP
68.02
54.48
61.69
64.36
73.41
77.15
80.13
2008 2010 2013 2014...
82.42
Method
Dim.
Aug.
2015
89.70
CNN-based methods
Evaluation Setup
SVM Classifier
train
test
training set
test set
Evaluate using
mAP, accuracy etc.
classifier output
Pre-trained Net
on 1,000 ImageNet Classes
CNN Feature
Extractor

(4096-D feature vector out)
Outline
1
2
3
4
Different pre-trained networks
Data augmentation (for both CNN and IFV)
Dataset fine-tuning
• CNN-F Network
• CNN-M Network
• CNN-S Network
• VGG Very Deep Network
Network Architectures
Network Architectures
CNN-F Network
Similar to Krizhevsky et al. (ILSVRC-2012 winner)
conv3
256x3x3
stride
1
conv4
512x3x3
conv2
256x5x5
stride
1
conv1
64x11x11
stride
4
conv5
512x3x3
fc6
d.o.
4096-D
fc7
d.o.
4096-D
input image
x2 x2
Network Architectures
CNN-M Network
Similar to Zeiler & Fergus (ILSVRC-2013 winner)
conv3
512x3x3
stride
1
conv4
512x3x3
conv2
256x5x5
stride
2
conv1
96x7x7
stride
2
conv5
512x3x3
fc6
d.o.
4096-D
fc7
d.o.
4096-D
input image
x2 x2
Smaller receptive window size + stride in conv1
Network Architectures
CNN-S Network
Similar to Overfeat ‘accurate’ network (ICLR 2014)
conv3
512x3x3
stride
1
conv4
512x3x3
conv2
256x5x5
stride
1
conv1
96x7x7
stride
2
conv5
512x3x3
fc6
d.o.
4096-D
fc7
d.o.
4096-D
input image
x3 x2
Smaller stride in in conv2
Network Architectures
VGG Very Deep Network
Simonyan & Zisserman (ICLR 2015)
conv1a
64x3x3
stride
1
fc6
d.o.
4096-D
fc7
d.o.
4096-D
input image
Smaller receptive window size + stride, and deeper
conv1b
64x3x3
stride
1
conv1c
64x3x3
stride
1
x2
conv2a
128x3x3
stride
1
conv2b
128x3x3
stride
1
conv2c
128x3x3
stride
1
3(32
C2
) = 27C2
72
C2
= 49C2
Pre-trained networks
mAP(VOC07)
70
75
80
85
90
Decaf CNN-F CNN-M CNN-S VGG-VD
89.3
79.7479.89
77.38
73.41
Outline
1
2
3
4
Different pre-trained networks
Data augmentation (for both CNN and IFV)
Dataset fine-tuning
Data Augmentation
Given pre-trained ConvNet, augmentation applied at test time
CNN Feature
Extractor
Pre-trained Network
a. Extract crops
b. Pool features
(average, max)
Data Augmentation
a. No augmentation (= 1 image)
b. Flip augmentation (= 2 images)
c. Crop+Flip augmentation (= 10 images)
+
+ flips
224x224
224x224
224x224
Data Augmentation
mAP(VOC07)
60
65
70
75
80
IFV CNN-M
79.89
67.17
79.44
66.68
76.99
64.35
76.97
64.36
None
Flip
Crop+Flip (train pooling: sum, test pooling: sum)
Crop+Flip (train pooling: none, test pooling: sum)
Scale Augmentation
+ flips
224x224
[Smin, Smax] = [256, 512]
+ flips
224x224
256512
Q = {Smin, 0.5(Smin + Smax), Smax}
Fully Convolutional Net
Sermanet et al. 2014 (Overfeat)
• Convert final fc layers to convolutional layers
• Output is then an activation map which can be pooled
8.8% 7.5% top-5 val. error ILSVRC-2014
Outline
1
2
3
4
Different pre-trained networks
Data augmentation (for both CNN and IFV)
Dataset fine-tuning
Fine Tuning
conv3
512x3x3
conv4
512x3x3
conv2
256x5x5
conv1
96x7x7
conv5
512x3x3
fc6
d.o.
4096-D
fc7
d.o.
4096-D
ILSVRC
softm
ax
Fine Tuning
conv3
512x3x3
conv4
512x3x3
conv2
256x5x5
conv1
96x7x7
conv5
512x3x3
fc6
d.o.
4096-D
fc7
d.o.
4096-D
VO
C
07
SVM
loss
VOC 2007
Train Images
Fine Tuning
mAP(VOC07)
79
80
81
82
83
No TN TN-RNK TN-RNK
82.4
82.2
79.7
• TN-CLS – classification loss max{ 0, 1 - ywT
φ( I ) }
• TN-RNK – ranking loss max{ 0, 1 - wT
( φ( IPOS ) - φ( INEG ) ) }
Comparison with State of the Art
VOC2007 VOC2012ILSVRC-2012
CNN-M 2048
CNN-S
CNN-S TUNE-RNK
13.5
13.1
13.1
80.1
79.7
82.4
82.4
82.9
83.2
Zeiler & Fergus
Oquab et al.
Wei et al.
Clarifai (1 net)
16.1 79.0
18.0 77.7 78.7 (82.8*)
81.5 (85.2*) 81.7 (90.3*)
GoogLeNet (1 net)
12.5
7.9
VGG Very Deep (1 net) 89.3 89.07.0
If you get the details right, a relatively simple ConvNet-
based pipeline can outperform much more complex
architectures
• Data augmentation helps a lot, both for deep and
shallow features
• Fine tuning makes a difference, and should use
ranking loss where appropriate
• Smaller filters and deeper networks help, although
feature computation is slower
Take-home Messages
• Presented here was just a subset of the full results
from the paper
• Check out the paper for full results on:
• VOC 2007
• VOC 2012
• Caltech-101
• Caltech-256
• ILSVRC-2012
There’s more…
• Caffe-compatible CNN models can be
downloaded from the Caffe Model Zoo: https://
github.com/BVLC/caffe/wiki/Model-Zoo
• Matlab feature computation code is also available
from the project website: http://
www.robots.ox.ac.uk/~vgg/software/deep_eval
Source Code
Related Publications
“Return of the Devil in the Details: Delving Deep into Convolutional Nets”

BMVC 2014 Ken Chatfield, Karen Simonyan, Andrea Vedaldi, Andrew
Zisserman (Best Paper Prize)
“The devil is in the details: an evaluation of recent feature encoding methods”

BMVC 2011 Ken Chatfield, Karen Simonyan, Andrea Vedaldi, Victor
Lempitsky, Andrew Zisserman

(Best Poster Prize Honourable Mention, 300+ citations)
http://www.robots.ox.ac.uk/~ken

Devil in the Details: Analysing the Performance of ConvNet Features