Paper review

Learning Deep Features for
Discriminative Localization
Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, Antonio
Torralba Computer Science and Artificial Intelligence Laboratory, MIT
2017.06.06(Tue)
Junya Tanaka(M1)

Introduction
•The various layers of CNN behave as object
detectors
oThis ability is lost when fully-connected layers are used for
classification
•Past work
oAvoid the use of fully-connected layers to minimize the
number of parameters while maintaining high performance
oNetwork in Network (NIN) [M. Lin et al., 2014]
Global average pooling which acts as a structural regularizer,
preventing over-fitting during training
oGoogLeNet [Zhou et al., 2016]

Introduction
•Classify the image and localize class-specific image
regions in a single forward-pass
oLocalize the discriminative regions as the objects that the
humans are interacting with rather than the humans
themselves
This approach can be
easily transferred to other
recognition datasets for
generic classification,
localization, and concept
discovery.

Related Work
•Weakly-supervised object localization:
oBergamo et al
Self-taught object localization involving masking out image
oCinbis et al and Pinheiro et al
Combining multiple-instance learning with CNN features
oOquab et al
Transferring mid-level image representations and evaluating the
output on multiple overlapping patches
•Visualizing CNNs:
oZeiler et al
Using deconvolutional networks to visualize what patterns activate
each unit

Class Activation Mapping
•Identify the importance of the image regions by
projecting back the weights of the output layer on to
the convolutional feature maps
•The CAMs of two classes from dataset

•Highlight the differences in the CAMs for a single
image when using different classes to generate the
maps
•Observe that the discriminative regions for different
categories are different even for a given image

•The procedure for generating class activation maps
oGAP outputs the spatial average of the feature map of
each unit at the last conv layer
oA weighted sum of these values is used to generate the
final output

•It is important to highlight the intuitive difference
between GAP and GMP
•Global average pooling vs Global max pooling:
oGAP(The extent of the object)
Value can be maximized by finding all discriminative parts of an
object as all low activations reduce the output of the particular map
oGMP(just one discriminative part)
Low scores for all image regions except the most discriminative one
do not impact the score as you just perform a max

Weakly-supervised Object Localization
•Evaluate the localization ability of CAM when trained
on the ILSVRC
•Verify that this technique does not adversely impact
the classification performance when learning to
localize and provide detailed results on weakly-
supervised object localization

Setup
•The effect of using CAM on the following popular
CNNs: AlexNet, VG-Gnet, and GoogLeNet
oRemove the fully-connected layers before the final output
oReplace them with GAP followed by a fully-connected
softmax layer
•Classification
oCompare this approach against the original AlexNet,
VGGnet, GoogLeNet and NIN
•Localization
oCompare against the original GoogLeNet, NIN and
GoogLeNet-GMP(For comparing with GMP and GAP)

Results
•Classification
oThis approach does not significantly hurt classification
performance

Results
•Localization
oNeed to generate a bounding box and its associated object
category
Simple thresholding technique to segment the heatmap
Segment the regions of which the value is above 20% of the max
value of the CAM
Take the bounding box that covers the largest connected component
in the segmentation map

Results
•Localization
oThis approach is effective at weakly-supervised object
localization

Results
•Localization
•CAM approach significantly outperforms the
backpropagation approach (see Fig. for a comparison of
the outputs)

Deep Features for Generic Localization
•The responses from the higher-level layers of CNN
have been shown to be effective generic features
•Show that GAP CNNs also identify the
discriminative image regions used for categorization
oTrain a linear SVM on the output of the GAP layer
oCompare the performance of features from our best
network, GoogLeNet-GAP, with the fc7 features from
AlexNet, and ave pool from GoogLeNet

Fine-grained Recognition
•Explore fine-grained recognition of birds
•Evaluate the generic localization ability and use it to
further improve performance
oGoogLeNet-GAP can successfully localize important image
crops, boosting classification performance

Pattern Discovery
•Technique can identify common elements or
patterns in images beyond objects, such as text or
high-level concepts
oGiven a set of images containing a common concept
oTo identify which regions our network recognizes as being
important and if this corresponds to the input pattern
•Several pattern discovery using deep features
oTrain a linear SVM on the GAP layer of the GoogLeNet-
GAP network
oApply the CAM technique

Pattern Discovery
•Discovering informative objects in the scenes:
o10 scene categories from the SUN dataset
containing at least 200 fully annotated images
resulting in a total of 4675 fully annotated images
oTrain a one-vs-all linear SVM for each scene category
oCompute the CAMs using the weights of the linear SVM
oThe high activation regions frequently correspond to
objects indicative of the particular scene category

Pattern Discovery
•Weakly supervised text detector:
oTrain a weakly supervised text detector using 350 Google
StreetView images containing text
oApproach highlights the text accurately without using
bounding box annotations

Visualizing Class-Specific Units
•The class-specific units of a CNN
oEstimating the receptive field
oSegmenting the top activation images of each unit in the
final convolutional layer
oUsing GAP and the ranked softmax weight
Identifying the parts of
the object that are most
discriminative for
classification and exactly
which units detect these
parts

Conclusion
•Propose a general technique called CAM for CNNs
with global average pooling
•Classification-trained CNNs can learn to perform
object localization, without using any bounding box
annotations
•Class activation maps allow us to visualize the
predicted class scores
•Demonstrated that global average pooling CNNs
can perform accurate object localization

Paper review

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Paper review

Similar to Paper review (20)

Recently uploaded

Recently uploaded (20)

Paper review

Editor's Notes