20150326.journal club

Journal(Club( (
Visualizing(Object(Detec6on(Features
Vondrick,(C.(et(al.(
In(Proc.(Interna6onal(Conference(on(
Computer(Vision(2013(
(
2015/03/26(shouno@uec.ac.jp

Overview
•  (
–  (HOG,(SIFT,(etc)(
–  ((SVM,(Log.(regression,(etc)(
(
•  “ ”(
(
(
•  HOG

HOG(Histogram(of(Oriented(Gradient)
•  Dalal(2005)(
– 
! SVM!
1Cellu,( v

1Cell
HOG(Histogram(of(Oriented(Gradient)
•  Dalal(2005)(
– 
! SVM!
1Block!=!3Cell×3Cell
V!:!Block !
L2norm !
Cell!,!Block !

:(Paired(Dict.LearningMethod: Paired Dictionary
+
=
= +...+
+...++
↵1 ↵2 ↵k
↵1 ↵2 ↵k
ˆy = f(x) = V ˆ↵
2
Method: Paired Dictionary
+
=
= +...+
+...++
↵1 ↵2 ↵k
↵1 ↵2 ↵k
ˆy = f(x) = V ˆ↵
where ˆ↵ = arg min
↵
||x U↵||2
2 s.t. ||↵||1 
+
=
= +...+
+...++
↵1 ↵2 ↵k
↵1 ↵2 ↵k
ˆy = f(x) = V ˆ↵
↵
||x U↵||2
2 s.t. ||↵||1 
+
=
= +...+
+...++
↵1 ↵2 ↵k
↵1 ↵2 ↵k
ˆy = f(x) = V ˆ↵
↵
||x U↵||2
2 s.t. ||↵||1 
= α1 + α2 + …+ αM
u1 u2 uM
x
+
=
= +...+
+...++
↵1 ↵2 ↵k
↵1 ↵2 ↵k
ˆy = f(x) = V ˆ↵
↵
||x U↵||2
2 s.t. ||↵||1 
+
=
= +...+
+...++
↵1 ↵2 ↵k
↵1 ↵2 ↵k
ˆy = f(x) = V ˆ↵
↵
||x U↵||2
2 s.t. ||↵||1 
+
=
= +...+
+...++
↵1 ↵2 ↵k
↵1 ↵2 ↵k
ˆy = f(x) = V ˆ↵
↵
||x U↵||2
2 s.t. ||↵||1 
+
=
= +...+
+...++
↵1 ↵2 ↵k
↵1 ↵2 ↵k
= f(x) = V ˆ↵
where ˆ↵ = arg min ||x U↵||2
2 s.t. ||↵||1 
α1 + α2 + …+ αM
=
v1 v2 vM
y

U, V(
•  :(Paired(Dic6onary(Learning

HOG(glyph(
Figure 3: We visualize some high scoring detections from th
guess which are false alarms? Take a minute to study this ﬁg
Figure 4: In this paper, we present algorithms to visualize

HOG
Dict
61
71
39
05
17
14
92
46
37
ersion al-
uth image
her is bet-
ull table.
Expert
0.438
(a) Human Vision (b) HOG Vision
Figure 13: HOG inversion reveals the world that object de-
tectors see. The left shows a man standing in a dark room.

Invert(HOG(
http://web.mit.edu/vondrick/ihog/(

HOG(
40 ⇥ 40 20 ⇥ 20 10 ⇥ 10 5 ⇥ 5
Figure 12: Our inversion algorithms are sensitive to the
HOG template size. We show how performance degrades
as the template becomes smaller.
cell(size(

nary learning.
OG basis. By
es and natural
e image basis
d V . The left
ment and the
he HOG dic-
naries.
U 2 RD⇥K
d coefﬁcients
(5)
Original ELDA Ridge Direct PairDict
Figure 8: We show results for all four of our inversion al-
gorithms on held out image patches on similar dimensions

(
•  ELDA(
K (
( )(
•  Ridge(
(X( HOG(glyph(Y( (Ridge( (
•  Direct(
Gabor( U( (
•  PairedDict.(
(

Category ELDA Ridge Direct PairDict
bicycle 0.452 0.577 0.513 0.561
bottle 0.697 0.683 0.660 0.671
car 0.668 0.677 0.652 0.639
cat 0.749 0.712 0.687 0.705
chair 0.660 0.621 0.604 0.617
table 0.656 0.617 0.582 0.614
motorbike 0.573 0.617 0.549 0.592
person 0.696 0.667 0.646 0.646
Mean 0.671 0.656 0.620 0.637
Table 1: We evaluate the performance of our inversion al-
gorithm by comparing the inverse to the ground truth image
using the mean normalized cross correlation. Higher is bet-
ter; a score of 1 is perfect. See supplemental for full table.
Category ELDA Ridge Direct PairDict Glyph Expert
bicycle 0.327 0.127 0.362 0.307 0.405 0.438
(a)
Figure 13:
tectors see.
(´ ω )
PASCAL(VOC(2011 8(

•  Mechanical(Turks((or(Expert)
Image
HOG(
inversion
Classifier
Image
HOG(
inversion
Human
Figure 6: Inverting HOG using paired dictionary learning.
We first project the HOG vector on to a HOG basis. By
jointly learning a coupled basis of HOG features and natural
images, we then transfer the coefficients to the image basis
to recover the natural image.
Figure 7: Some pairs of dictionaries for U and V . The left
of every pair is the gray scale dictionary element and the
right is the positive components elements in the HOG dic-
tionary. Notice the correlation between dictionaries.
Suppose we write x and y in terms of bases U 2 RD⇥K
and V 2 Rd⇥K
respectively, but with shared coefficients
↵ 2 RK
:
x = U↵ and y = V ↵ (5)
The key observation is that inversion can be obtained by
common for object detection. See supplemental for more.
: Inverting HOG using paired dictionary learning.
project the HOG vector on to a HOG basis. By
earning a coupled basis of HOG features and natural
we then transfer the coefficients to the image basis
er the natural image.
: Some pairs of dictionaries for U and V . The left
pair is the gray scale dictionary element and the
he positive components elements in the HOG dic-
Notice the correlation between dictionaries.
we write x and y in terms of bases U 2 RD⇥K
2 Rd⇥K
respectively, but with shared coefficients
:
x = U↵ and y = V ↵ (5)
It’s(‘Bird’

The HOGgles C
Comm
False Ala
Let’s(try(Mechanical(Turk
Figure 3: We visualize some high scoring detections from the deform
guess which are false alarms? Take a minute to study this ﬁgure, then
Figure 15: We visualize a few deformable parts models trained with
ization. First row: car, person, bottle, bicycle, motorbike, potted plan
the right most visualizations, we also included the HOG glyph. Our v

HOG
Car
Object Detection Features⇤
masz Malisiewicz, Antonio Torralba
itute of Technology
,torralba}@csail.mit.edu
Figure 1: An image from PASCAL and a high scoring car
detection from DPM [8]. Why did the detector fail?
Figure 2: We show the crop for the false car detection from
Figure 1. On the right, we show our visualization of the
HOG features for the same patch. Our visualization reveals
that this false alarm actually looks like a car in HOG space.
Figure 2 shows the output from our visualization on the
features for the false car detection. This visualization re-
veals that, while there are clearly no cars in the original
image, there is a car hiding in the HOG descriptor. HOG
features see a slightly different visual world than what we
see, and by visualizing this space, we can gain a more intu-
itive understanding of our object detectors.
Figure 3 inverts more top detections on PASCAL for
a few categories. Can you guess which are false alarms?
Take a minute to study the ﬁgure since the next sentence
might ruin the surprise. Although every visualization looks
age from PASCAL and a high scoring car
DPM [8]. Why did the detector fail?

MTurk
PASCAL(VOC(2011 (
1 3
Table 1: We evaluate the performance of our inversion al-
gorithm by comparing the inverse to the ground truth image
using the mean normalized cross correlation. Higher is bet-
ter; a score of 1 is perfect. See supplemental for full table.
Category ELDA Ridge Direct PairDict Glyph Expert
bicycle 0.327 0.127 0.362 0.307 0.405 0.438
bottle 0.269 0.282 0.283 0.446 0.312 0.222
car 0.397 0.457 0.617 0.585 0.359 0.389
cat 0.219 0.178 0.381 0.199 0.139 0.286
chair 0.099 0.239 0.223 0.386 0.119 0.167
table 0.152 0.064 0.162 0.237 0.071 0.125
motorbike 0.221 0.232 0.396 0.224 0.298 0.350
person 0.458 0.546 0.502 0.676 0.301 0.375
Mean 0.282 0.258 0.355 0.383 0.191 0.233
Table 2: We evaluate visualization performance across
twenty PASCAL VOC categories by asking MTurk work-
ers to classify our inversions. Numbers are percent classi-
ﬁed correctly; higher is better. Chance is 0.05. Glyph refers
to the standard black-and-white HOG diagram popularized
Figur
tector
If we
ously
struct
hand
with
W
classi
with
last c
better
on gl
This
itive

of the top
ery simi-
ent types
particular
equently,
use larger
e patches
ow when
the false
ctors ﬁre
sualizing
the learn-
e features
0 0.2 0.4 0.6 0.8 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Recall
Precision
Chair
HOG+Human AP = 0.63
RGB+Human AP = 0.96
HOG+DPM AP = 0.51
0 0.2 0.4 0.6 0.8 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Recall
Precision
Cat
HOG+Human AP = 0.78
HOG+DPM AP = 0.58
0 0.2 0.4 0.6 0.8 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Recall
Precision
Car
HOG+Human AP = 0.83
HOG+DPM AP = 0.87
0 0.2 0.4 0.6 0.8 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Recall
Precision
Person
HOG+Human AP = 0.69
HOG+DPM AP = 0.79
Figure 14: By instructing multiple human subjects to clas-

•  (
– [Zeiler(&(Fergus(13]
invert this, the deconvnet uses transposed versions of
the same filters, but applied to the rectified maps, not
the output of the layer beneath. In practice this means
flipping each filter vertically and horizontally.
Projecting down from higher layers uses the switch
settings generated by the max pooling in the convnet
on the way up. As these switch settings are peculiar
to a given input image, the reconstruction obtained
from a single activation thus resembles a small piece
of the original input image, with structures weighted
according to their contribution toward to the feature
activation. Since the model is trained discriminatively,
they implicitly show which parts of the input image
are discriminative. Note that these projections are not
samples from the model, since there is no generative
process involved.
Layer Below Pooled Maps
Feature Maps
Rectified Feature Maps
Convolu'onal)
Filtering){F})
Rec'fied)Linear)
Func'on)
Pooled Maps
Max)Pooling)
Reconstruction
Rectified Unpooled Maps
Unpooled Maps
Convolu'onal)
Filtering){FT})
Rec'fied)Linear)
Func'on)
Layer Above
Reconstruction
Max)Unpooling)
Switches)
Unpooling
Max Locations
“Switches”
Pooling
Pooled Maps
Feature Map
Layer Above
Reconstruction
Unpooled
Maps
Rectified
Feature Maps
Figure 1. Top: A deconvnet layer (left) attached to a con-
vnet layer (right). The deconvnet will reconstruct an ap-
proximate version of the convnet features from the layer
beneath. Bottom: An illustration of the unpooling oper-
ation in the deconvnet, using switches which record the
location of the local max in each pooling region (colored
zones) during pooling in the convnet.
Other important di↵erences relating to layers 1 and
2 were made following inspection of the visualizations
in Fig. 6, as described in Section 4.1.
The model was trained on the ImageNet 2012 train-
ing set (1.3 million images, spread over 1000 di↵erent
classes). Each RGB image was preprocessed by resiz-
ing the smallest dimension to 256, cropping the center
256x256 region, subtracting the per-pixel mean (across
all images) and then using 10 di↵erent sub-crops of size
224x224 (corners + center with(out) horizontal flips).
Stochastic gradient descent with a mini-batch size of
128 was used to update the parameters, starting with a
learning rate of 10 2
, in conjunction with a momentum
term of 0.9. We anneal the learning rate throughout
training manually when the validation error plateaus.
Dropout (Hinton et al., 2012) is used in the fully con-
nected layers (6 and 7) with a rate of 0.5. All weights
are initialized to 10 2
and biases are set to 0.
Visualization of the first layer filters during training
reveals that a few of them dominate, as shown in
Fig. 6(a). To combat this, we renormalize each filter
in the convolutional layers whose RMS value exceeds
a fixed radius of 10 1
to this fixed radius. This is cru-
cial, especially in the first layer of the model, where the
input images are roughly in the [-128,128] range. As in
(Krizhevsky et al., 2012), we produce multiple di↵er-
ent crops and flips of each training example to boost
training set size. We stopped training after 70 epochs,
which took around 12 days on a single GTX580 GPU,
using an implementation based on (Krizhevsky et al.,
2012).
4. Convnet Visualization
Using the model described in Section 3, we now use
the deconvnet to visualize the feature activations on
the ImageNet validation set.
Feature Visualization: Fig. 2 shows feature visu-
alizations from our model once training is complete.
However, instead of showing the single strongest ac-
tivation for a given feature map, we show the top 9
activations. Projecting each separately down to pixel

20150326.journal club

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (12)

Similar to 20150326.journal club

Similar to 20150326.journal club (20)

Recently uploaded

Recently uploaded (20)

20150326.journal club