SlideShare a Scribd company logo
1 of 22
Learning Deep Features for
Discriminative Localization
Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, Antonio
Torralba Computer Science and Artificial Intelligence Laboratory, MIT
2017.06.06(Tue)
Junya Tanaka(M1)
Introduction
•The various layers of CNN behave as object
detectors
oThis ability is lost when fully-connected layers are used for
classification
•Past work
oAvoid the use of fully-connected layers to minimize the
number of parameters while maintaining high performance
oNetwork in Network (NIN) [M. Lin et al., 2014]
Global average pooling which acts as a structural regularizer,
preventing over-fitting during training
oGoogLeNet [Zhou et al., 2016]
Introduction
•Classify the image and localize class-specific image
regions in a single forward-pass
oLocalize the discriminative regions as the objects that the
humans are interacting with rather than the humans
themselves
This approach can be
easily transferred to other
recognition datasets for
generic classification,
localization, and concept
discovery.
Related Work
•Weakly-supervised object localization:
oBergamo et al
Self-taught object localization involving masking out image
oCinbis et al and Pinheiro et al
Combining multiple-instance learning with CNN features
oOquab et al
Transferring mid-level image representations and evaluating the
output on multiple overlapping patches
•Visualizing CNNs:
oZeiler et al
Using deconvolutional networks to visualize what patterns activate
each unit
Class Activation Mapping
•Identify the importance of the image regions by
projecting back the weights of the output layer on to
the convolutional feature maps
•The CAMs of two classes from dataset
Class Activation Mapping
•Highlight the differences in the CAMs for a single
image when using different classes to generate the
maps
•Observe that the discriminative regions for different
categories are different even for a given image
Class Activation Mapping
•The procedure for generating class activation maps
oGAP outputs the spatial average of the feature map of
each unit at the last conv layer
oA weighted sum of these values is used to generate the
final output
Class Activation Mapping
•It is important to highlight the intuitive difference
between GAP and GMP
•Global average pooling vs Global max pooling:
oGAP(The extent of the object)
Value can be maximized by finding all discriminative parts of an
object as all low activations reduce the output of the particular map
oGMP(just one discriminative part)
Low scores for all image regions except the most discriminative one
do not impact the score as you just perform a max
Weakly-supervised Object Localization
•Evaluate the localization ability of CAM when trained
on the ILSVRC
•Verify that this technique does not adversely impact
the classification performance when learning to
localize and provide detailed results on weakly-
supervised object localization
Setup
•The effect of using CAM on the following popular
CNNs: AlexNet, VG-Gnet, and GoogLeNet
oRemove the fully-connected layers before the final output
oReplace them with GAP followed by a fully-connected
softmax layer
•Classification
oCompare this approach against the original AlexNet,
VGGnet, GoogLeNet and NIN
•Localization
oCompare against the original GoogLeNet, NIN and
GoogLeNet-GMP(For comparing with GMP and GAP)
Results
•Classification
oThis approach does not significantly hurt classification
performance
Results
•Localization
oNeed to generate a bounding box and its associated object
category
Simple thresholding technique to segment the heatmap
Segment the regions of which the value is above 20% of the max
value of the CAM
Take the bounding box that covers the largest connected component
in the segmentation map
Results
•Localization
oThis approach is effective at weakly-supervised object
localization
Results
•Localization
Results
•Localization
•CAM approach significantly outperforms the
backpropagation approach (see Fig. for a comparison of
the outputs)
Deep Features for Generic Localization
•The responses from the higher-level layers of CNN
have been shown to be effective generic features
•Show that GAP CNNs also identify the
discriminative image regions used for categorization
oTrain a linear SVM on the output of the GAP layer
oCompare the performance of features from our best
network, GoogLeNet-GAP, with the fc7 features from
AlexNet, and ave pool from GoogLeNet
Fine-grained Recognition
•Explore fine-grained recognition of birds
•Evaluate the generic localization ability and use it to
further improve performance
oGoogLeNet-GAP can successfully localize important image
crops, boosting classification performance
Pattern Discovery
•Technique can identify common elements or
patterns in images beyond objects, such as text or
high-level concepts
oGiven a set of images containing a common concept
oTo identify which regions our network recognizes as being
important and if this corresponds to the input pattern
•Several pattern discovery using deep features
oTrain a linear SVM on the GAP layer of the GoogLeNet-
GAP network
oApply the CAM technique
Pattern Discovery
•Discovering informative objects in the scenes:
o10 scene categories from the SUN dataset
containing at least 200 fully annotated images
resulting in a total of 4675 fully annotated images
oTrain a one-vs-all linear SVM for each scene category
oCompute the CAMs using the weights of the linear SVM
oThe high activation regions frequently correspond to
objects indicative of the particular scene category
Pattern Discovery
•Weakly supervised text detector:
oTrain a weakly supervised text detector using 350 Google
StreetView images containing text
oApproach highlights the text accurately without using
bounding box annotations
Visualizing Class-Specific Units
•The class-specific units of a CNN
oEstimating the receptive field
oSegmenting the top activation images of each unit in the
final convolutional layer
oUsing GAP and the ranked softmax weight
Identifying the parts of
the object that are most
discriminative for
classification and exactly
which units detect these
parts
Conclusion
•Propose a general technique called CAM for CNNs
with global average pooling
•Classification-trained CNNs can learn to perform
object localization, without using any bounding box
annotations
•Class activation maps allow us to visualize the
predicted class scores
•Demonstrated that global average pooling CNNs
can perform accurate object localization

More Related Content

What's hot

A Genetic Algorithm-Based Moving Object Detection For Real-Time Traffic Surv...
 A Genetic Algorithm-Based Moving Object Detection For Real-Time Traffic Surv... A Genetic Algorithm-Based Moving Object Detection For Real-Time Traffic Surv...
A Genetic Algorithm-Based Moving Object Detection For Real-Time Traffic Surv...
Chennai Networks
 
Moving object detection
Moving object detectionMoving object detection
Moving object detection
Manav Mittal
 

What's hot (20)

Human Behavior Understanding: From Human-Oriented Analysis to Action Recognit...
Human Behavior Understanding: From Human-Oriented Analysis to Action Recognit...Human Behavior Understanding: From Human-Oriented Analysis to Action Recognit...
Human Behavior Understanding: From Human-Oriented Analysis to Action Recognit...
 
Learning to Perceive the 3D World
Learning to Perceive the 3D WorldLearning to Perceive the 3D World
Learning to Perceive the 3D World
 
Video Object Segmentation in Videos
Video Object Segmentation in VideosVideo Object Segmentation in Videos
Video Object Segmentation in Videos
 
Moving object detection1
Moving object detection1Moving object detection1
Moving object detection1
 
Fast and Noise Robust Depth from Focus using Ring Difference Filter with Your...
Fast and Noise Robust Depth from Focus using Ring Difference Filter with Your...Fast and Noise Robust Depth from Focus using Ring Difference Filter with Your...
Fast and Noise Robust Depth from Focus using Ring Difference Filter with Your...
 
A survey on moving object tracking in video
A survey on moving object tracking in videoA survey on moving object tracking in video
A survey on moving object tracking in video
 
A Genetic Algorithm-Based Moving Object Detection For Real-Time Traffic Surv...
 A Genetic Algorithm-Based Moving Object Detection For Real-Time Traffic Surv... A Genetic Algorithm-Based Moving Object Detection For Real-Time Traffic Surv...
A Genetic Algorithm-Based Moving Object Detection For Real-Time Traffic Surv...
 
Pruning convolutional neural networks for resource efficient inference
Pruning convolutional neural networks for resource efficient inferencePruning convolutional neural networks for resource efficient inference
Pruning convolutional neural networks for resource efficient inference
 
Moving object detection
Moving object detectionMoving object detection
Moving object detection
 
Real-time Object Tracking
Real-time Object TrackingReal-time Object Tracking
Real-time Object Tracking
 
Multiple Object Tracking
Multiple Object TrackingMultiple Object Tracking
Multiple Object Tracking
 
Semantic Line Detection and Its Applications
Semantic Line Detection and Its ApplicationsSemantic Line Detection and Its Applications
Semantic Line Detection and Its Applications
 
Passive stereo vision with deep learning
Passive stereo vision with deep learningPassive stereo vision with deep learning
Passive stereo vision with deep learning
 
Final Poster
Final PosterFinal Poster
Final Poster
 
Visual object tracking using particle clustering - ICITACEE 2014
Visual object tracking using particle clustering - ICITACEE 2014Visual object tracking using particle clustering - ICITACEE 2014
Visual object tracking using particle clustering - ICITACEE 2014
 
Contour-Constrained Superpixels for Image and Video Processing
Contour-Constrained Superpixels for Image and Video ProcessingContour-Constrained Superpixels for Image and Video Processing
Contour-Constrained Superpixels for Image and Video Processing
 
Yol ov2
Yol ov2Yol ov2
Yol ov2
 
Placing Images with Refined Language Models and Similarity Search with PCA-re...
Placing Images with Refined Language Models and Similarity Search with PCA-re...Placing Images with Refined Language Models and Similarity Search with PCA-re...
Placing Images with Refined Language Models and Similarity Search with PCA-re...
 
In-depth Exploration of Geotagging Performance
In-depth Exploration of Geotagging PerformanceIn-depth Exploration of Geotagging Performance
In-depth Exploration of Geotagging Performance
 
Haze removal for a single remote sensing image based on deformed haze imaging...
Haze removal for a single remote sensing image based on deformed haze imaging...Haze removal for a single remote sensing image based on deformed haze imaging...
Haze removal for a single remote sensing image based on deformed haze imaging...
 

Similar to Paper review

Similar to Paper review (20)

物件偵測與辨識技術
物件偵測與辨識技術物件偵測與辨識技術
物件偵測與辨識技術
 
FINAL_Team_4.pptx
FINAL_Team_4.pptxFINAL_Team_4.pptx
FINAL_Team_4.pptx
 
HiPEAC 2019 Workshop - Real-Time Modelling Visual Scenes with Biological Insp...
HiPEAC 2019 Workshop - Real-Time Modelling Visual Scenes with Biological Insp...HiPEAC 2019 Workshop - Real-Time Modelling Visual Scenes with Biological Insp...
HiPEAC 2019 Workshop - Real-Time Modelling Visual Scenes with Biological Insp...
 
How much position information do convolutional neural networks encode? review...
How much position information do convolutional neural networks encode? review...How much position information do convolutional neural networks encode? review...
How much position information do convolutional neural networks encode? review...
 
Seeing what a gan cannot generate: paper review
Seeing what a gan cannot generate: paper reviewSeeing what a gan cannot generate: paper review
Seeing what a gan cannot generate: paper review
 
Cvpr 2018 papers review (efficient computing)
Cvpr 2018 papers review (efficient computing)Cvpr 2018 papers review (efficient computing)
Cvpr 2018 papers review (efficient computing)
 
FickleNet: Weakly and Semi-supervised Semantic Image Segmentation using Stoch...
FickleNet: Weakly and Semi-supervised Semantic Image Segmentation using Stoch...FickleNet: Weakly and Semi-supervised Semantic Image Segmentation using Stoch...
FickleNet: Weakly and Semi-supervised Semantic Image Segmentation using Stoch...
 
Deep Learning in Robotics: Robot gains Social Intelligence through Multimodal...
Deep Learning in Robotics: Robot gains Social Intelligence through Multimodal...Deep Learning in Robotics: Robot gains Social Intelligence through Multimodal...
Deep Learning in Robotics: Robot gains Social Intelligence through Multimodal...
 
Computer vision-nit-silchar-hackathon
Computer vision-nit-silchar-hackathonComputer vision-nit-silchar-hackathon
Computer vision-nit-silchar-hackathon
 
Recent Progress on Object Detection_20170331
Recent Progress on Object Detection_20170331Recent Progress on Object Detection_20170331
Recent Progress on Object Detection_20170331
 
IRJET - Object Detection using Deep Learning with OpenCV and Python
IRJET - Object Detection using Deep Learning with OpenCV and PythonIRJET - Object Detection using Deep Learning with OpenCV and Python
IRJET - Object Detection using Deep Learning with OpenCV and Python
 
[CVPR 2018] Utilizing unlabeled or noisy labeled data (classification, detect...
[CVPR 2018] Utilizing unlabeled or noisy labeled data (classification, detect...[CVPR 2018] Utilizing unlabeled or noisy labeled data (classification, detect...
[CVPR 2018] Utilizing unlabeled or noisy labeled data (classification, detect...
 
slide-171212080528.pptx
slide-171212080528.pptxslide-171212080528.pptx
slide-171212080528.pptx
 
Real Time Object Dectection using machine learning
Real Time Object Dectection using machine learningReal Time Object Dectection using machine learning
Real Time Object Dectection using machine learning
 
Multi-class Classification on Riemannian Manifolds for Video Surveillance
Multi-class Classification on Riemannian Manifolds for Video SurveillanceMulti-class Classification on Riemannian Manifolds for Video Surveillance
Multi-class Classification on Riemannian Manifolds for Video Surveillance
 
Transfer Learning and Fine Tuning for Cross Domain Image Classification with ...
Transfer Learning and Fine Tuning for Cross Domain Image Classification with ...Transfer Learning and Fine Tuning for Cross Domain Image Classification with ...
Transfer Learning and Fine Tuning for Cross Domain Image Classification with ...
 
Image Segmentation Using Deep Learning : A survey
Image Segmentation Using Deep Learning : A surveyImage Segmentation Using Deep Learning : A survey
Image Segmentation Using Deep Learning : A survey
 
Deep learning and image analytics using Python by Dr Sanparit
Deep learning and image analytics using Python by Dr SanparitDeep learning and image analytics using Python by Dr Sanparit
Deep learning and image analytics using Python by Dr Sanparit
 
Convolutional neural networks 이론과 응용
Convolutional neural networks 이론과 응용Convolutional neural networks 이론과 응용
Convolutional neural networks 이론과 응용
 
DLD meetup 2017, Efficient Deep Learning
DLD meetup 2017, Efficient Deep LearningDLD meetup 2017, Efficient Deep Learning
DLD meetup 2017, Efficient Deep Learning
 

Recently uploaded

Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
PirithiRaju
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Lokesh Kothari
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
PirithiRaju
 

Recently uploaded (20)

Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdf
 
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts ServiceJustdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
 
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
 
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxCOST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdf
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
 
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdf
 
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRLKochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
 
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdf
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
Chemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfChemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdf
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 

Paper review

  • 1. Learning Deep Features for Discriminative Localization Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, Antonio Torralba Computer Science and Artificial Intelligence Laboratory, MIT 2017.06.06(Tue) Junya Tanaka(M1)
  • 2. Introduction •The various layers of CNN behave as object detectors oThis ability is lost when fully-connected layers are used for classification •Past work oAvoid the use of fully-connected layers to minimize the number of parameters while maintaining high performance oNetwork in Network (NIN) [M. Lin et al., 2014] Global average pooling which acts as a structural regularizer, preventing over-fitting during training oGoogLeNet [Zhou et al., 2016]
  • 3. Introduction •Classify the image and localize class-specific image regions in a single forward-pass oLocalize the discriminative regions as the objects that the humans are interacting with rather than the humans themselves This approach can be easily transferred to other recognition datasets for generic classification, localization, and concept discovery.
  • 4. Related Work •Weakly-supervised object localization: oBergamo et al Self-taught object localization involving masking out image oCinbis et al and Pinheiro et al Combining multiple-instance learning with CNN features oOquab et al Transferring mid-level image representations and evaluating the output on multiple overlapping patches •Visualizing CNNs: oZeiler et al Using deconvolutional networks to visualize what patterns activate each unit
  • 5. Class Activation Mapping •Identify the importance of the image regions by projecting back the weights of the output layer on to the convolutional feature maps •The CAMs of two classes from dataset
  • 6. Class Activation Mapping •Highlight the differences in the CAMs for a single image when using different classes to generate the maps •Observe that the discriminative regions for different categories are different even for a given image
  • 7. Class Activation Mapping •The procedure for generating class activation maps oGAP outputs the spatial average of the feature map of each unit at the last conv layer oA weighted sum of these values is used to generate the final output
  • 8. Class Activation Mapping •It is important to highlight the intuitive difference between GAP and GMP •Global average pooling vs Global max pooling: oGAP(The extent of the object) Value can be maximized by finding all discriminative parts of an object as all low activations reduce the output of the particular map oGMP(just one discriminative part) Low scores for all image regions except the most discriminative one do not impact the score as you just perform a max
  • 9. Weakly-supervised Object Localization •Evaluate the localization ability of CAM when trained on the ILSVRC •Verify that this technique does not adversely impact the classification performance when learning to localize and provide detailed results on weakly- supervised object localization
  • 10. Setup •The effect of using CAM on the following popular CNNs: AlexNet, VG-Gnet, and GoogLeNet oRemove the fully-connected layers before the final output oReplace them with GAP followed by a fully-connected softmax layer •Classification oCompare this approach against the original AlexNet, VGGnet, GoogLeNet and NIN •Localization oCompare against the original GoogLeNet, NIN and GoogLeNet-GMP(For comparing with GMP and GAP)
  • 11. Results •Classification oThis approach does not significantly hurt classification performance
  • 12. Results •Localization oNeed to generate a bounding box and its associated object category Simple thresholding technique to segment the heatmap Segment the regions of which the value is above 20% of the max value of the CAM Take the bounding box that covers the largest connected component in the segmentation map
  • 13. Results •Localization oThis approach is effective at weakly-supervised object localization
  • 15. Results •Localization •CAM approach significantly outperforms the backpropagation approach (see Fig. for a comparison of the outputs)
  • 16. Deep Features for Generic Localization •The responses from the higher-level layers of CNN have been shown to be effective generic features •Show that GAP CNNs also identify the discriminative image regions used for categorization oTrain a linear SVM on the output of the GAP layer oCompare the performance of features from our best network, GoogLeNet-GAP, with the fc7 features from AlexNet, and ave pool from GoogLeNet
  • 17. Fine-grained Recognition •Explore fine-grained recognition of birds •Evaluate the generic localization ability and use it to further improve performance oGoogLeNet-GAP can successfully localize important image crops, boosting classification performance
  • 18. Pattern Discovery •Technique can identify common elements or patterns in images beyond objects, such as text or high-level concepts oGiven a set of images containing a common concept oTo identify which regions our network recognizes as being important and if this corresponds to the input pattern •Several pattern discovery using deep features oTrain a linear SVM on the GAP layer of the GoogLeNet- GAP network oApply the CAM technique
  • 19. Pattern Discovery •Discovering informative objects in the scenes: o10 scene categories from the SUN dataset containing at least 200 fully annotated images resulting in a total of 4675 fully annotated images oTrain a one-vs-all linear SVM for each scene category oCompute the CAMs using the weights of the linear SVM oThe high activation regions frequently correspond to objects indicative of the particular scene category
  • 20. Pattern Discovery •Weakly supervised text detector: oTrain a weakly supervised text detector using 350 Google StreetView images containing text oApproach highlights the text accurately without using bounding box annotations
  • 21. Visualizing Class-Specific Units •The class-specific units of a CNN oEstimating the receptive field oSegmenting the top activation images of each unit in the final convolutional layer oUsing GAP and the ranked softmax weight Identifying the parts of the object that are most discriminative for classification and exactly which units detect these parts
  • 22. Conclusion •Propose a general technique called CAM for CNNs with global average pooling •Classification-trained CNNs can learn to perform object localization, without using any bounding box annotations •Class activation maps allow us to visualize the predicted class scores •Demonstrated that global average pooling CNNs can perform accurate object localization

Editor's Notes

  1. This suggests that our approach works as expected.
  2. Layers for easy handling
  3. There is a small performance drop of 1 − 2% when removing the additional layers from the various networks. the classification performance is largely preserved for our GAP networks. GoogLeNet-GAP and GoogLeNet-GMP have similar per- formance on classification, as expected. classification in order to achieve a high performance on localization as it involves identifying both the object category and the bound- ing box location accurately.
  4. Fig. 6(a) shows some example bounding boxes generated using this technique.
  5. observe that our GAP networks outperform all the baseline approaches with GoogLeNet-GAP achieving the lowest localization error of 43% on top-5
  6. Further, we observe that GoogLeNet-GAP signifi- cantly outperforms GoogLeNet on localization, despite this being reversed for classification. We believe that the low mapping resolution of GoogLeNet (7 × 7) prevents it from obtaining accurate localizations. Last, we observe that GoogLeNet-GAP outperforms GoogLeNet-GMP by a rea- sonable margin illustrating the importance of average pool- ing over max pooling for identifying the extent of objects.
  7. GoogLeNet-GAP performs comparably to existing approaches, achieving an accuracy of 63.0% when using the full image without any bounding box annotations for both train and test. When using bounding box anno- tations, this accuracy increases to 70.5% . We then use GoogLeNet-GAP to extract features again from the crops inside the bounding box, for training and testing. We find that this improves the performance considerably to 67.8%.
  8. Figure 9. Informative objects for two scene categories. For the din- ing room and bathroom categories, we show examples of original images (top), and list of the 6 most frequent objects in that scene category with the corresponding frequency of appearance. At the bottom: the CAMs and a list of the 6 objects that most frequently overlap with the high activation regions.
  9. The text is accurately detected on the image even though our network is not trained with text or any bounding box annotations.