Class Weighted Convolutional Features for Image Retrieval

Universitat Politècnica de Catalunya
Universitat Politècnica de CatalunyaAssociate Professor at Universitat Politècnica de Catalunya
Class-Weighted Convolutional
Features for Image Retrieval
Xavier Giró-i-NietoAlbert Jiménez Jose Alvarez [Code][arXiv]
Image Retrieval
Google Search By Image
2
Image Retrieval
Given an image query, generate a ranked list of similar images
Task Definition
3
Image Retrieval
Given an image query, generate a ranked list of similar images
Task Definition
4
What are the properties that we want for these systems?
Image Retrieval
Desirable System Properties
5
Invariant to scale, illumination, translation
Be able to provide a fast searchMemory Efficient
Image Retrieval
Desirable System Properties
6
Invariant to scale, illumination, translation
Be able to provide a fast searchMemory Efficient
How to achieve them?
Image Retrieval
General Pipeline
● Encode images into representations that are
○ Invariant
○ Compact
7
8
Image Retrieval
Image Representations
● Hand-Crafted Features (Before 2012)
○ SIFT
○ SIFT + BoW
○ VLAD
○ Fisher Vectors
Josef Sivic, Andrew Zisserman, et al. Video google: A text retrieval approach to object matching in videos. In ICCV, 2003.
9
Image Retrieval
Image Representations
● Learned Features (After 2012)
○ Convolutional Neural Networks (CNNs)
.
.
.
.
.
.
Convolutional Layer Max-Pooling Layer Fully-Connected Layer
Flattening
Class Predictions
Input Image
Softmax Activation
Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition.
arXiv preprint arXiv:1409.1556. (2014)
VGG-16
10
Image Retrieval
Image Representations
● Learned Features (After 2012)
○ Convolutional Neural Networks (CNNs)
.
.
.
.
.
.
Flattening
Large Datasets
Labels
Computational Power (GPUs)
Training is expensive!
11
Retrieval
System
➔ Collecting, annotating and cleaning a large dataset to train models
requires great effort.
➔ Training or fine-tuning a model every time new data is added is neither
efficient nor scalable.
Dynamic datasets of unlabeled images
Image Representations
Challenges for Learning Features in Retrieval
12
➔ Transferring the knowledge of a model trained in a large-scale dataset.
(e.g. ImageNet)
One solution that does not involve training is Transfer Learning
Image Representations
Challenges for Learning Features
Classification → Retrieval
How?
Outline
▷ Related Work
▷ Proposal
▷ Experiments
▷ Conclusions
13
Related Work
14
Babenko, A., Slesarev, A., Chigorin, A., & Lempitsky, V. (2014). Neural codes for image retrieval. In ECCV 2014
Razavian, A., Azizpour, H., Sullivan, J., & Carlsson, S. (2014). CNN features off-the-shelf: an astounding baseline for recognition. In
DeepVision CVPRW 2014
Fully-connected features as global representation
15
Image Representations
Fully-Connected Features
.
.
.
.
.
.
Convolutional Layer Max-Pooling Layer Fully-Connected Layer
Flattening
Class Predictions
Input Image
Softmax Activation
Neural Codes
Babenko, A., & Lempitsky, V. (2015). Aggregating local deep features for image retrieval. ICCV 2015
Tolias, G., Sicre, R., & Jégou, H. (2016). Particular object retrieval with integral max-pooling of CNN activations. ICLR 2016
Kalantidis, Y., Mellina, C., & Osindero, S. (2015). Cross-dimensional Weighting for Aggregated Deep Convolutional Features. arXiv
preprint arXiv:1512.04065.
16
Image Representations
Convolutional Features
.
.
.
.
.
.
Convolutional Layer Max-Pooling Layer Fully-Connected Layer
Flattening
Class Predictions
Input Image
Softmax Activation
Convolutional
Features
Convolutional features as global representation
➔ They convey the spatial information of the image
17
Image Representations
From Feature Maps to Compact Representations
(Fh
x Fw
x 3) x K Filters
Stride: S, Padding: P
Ih
x Iw
x 3 (H x W) x K Feature Maps
H = (Ih
− Fh
+ 2P) / S +1
W = (Iw
− Fw
+ 2P) / S +1
Image Convolutional layer Feature Maps
18
Image Representations
From Feature Maps to Compact Representations
1 0 1
0 1 0
9 0 4
1 2 1
0 0 0
0 0 0
0 0 1
0 0 0
5 1 1
Max Pooling
Max Pooling
Max Pooling
9
2
5
3 Feature Maps (3x3) Compact vector of (1x3)Pooling Operation
19
Image Representations
3 Feature Maps (3x3)
1 0 1
1 1 0
9 0 4
1 2 1
0 0 0
0 0 0
0 0 1
0 0 0
5 1 1
Sum Pooling
Sum Pooling
Sum Pooling
Compact vector of (1x3)
17
4
8
From Feature Maps to Compact Representations
Pooling Operation
Image Representations
Related Works
SPoC R-MAC BoW CroW
Combine and aggregate convolutional features:
● SPoC → Gaussian prior + Sum Pooling
● R-MAC → Rigid grid of regions + Max Pooling
● BoW → Bag of Visual Words
● CroW → Spatial weighting based on normalized sum of feature maps (boost
salient content) + channel weighting based on sparsity + Sum Pooling
20
Image Representations
Related Works
SPoC R-MAC BoW CroW
21
All except CroW:
● Apply fixed weights
● Based on fixed regions
Not based on image contents
Proposal
22
Dog
Tennis Ball
Sand bar
Seashore
23
Motivation
Image Semantics
Objective
.
.
.
.
.
.
Flattening
Class Predictions
Input Image Convolutional
Features
Combine semantic content of the images with convolutional features
Explicit
Semantic
Information
Statue
GAP .
.
.
.
.
.
w1
w2
w3
wN
Class 1
(Tennis Ball)
VGG-16
CAMi
= w1,i
∗ conv1
+ w2,i
∗ conv2
+ … + wN,i
∗ convN
Class N
CAM
(Tennis Ball)
B. Zhou, A. Khosla, Lapedriza. A., A. Oliva, and A. Torralba. 2016. Learning Deep Features for Discriminative Localization. CVPR
(2016).
25
CAM layer
Class Activation Maps
Obtaining Discriminative Regions
Tennis Ball Weimaraner
Simple Classes
26
Class Activation Maps
Examples
Sand Bar Seashore
Complex Classes
27
Class Activation Maps
Examples
Encode images dynamically combining their semantics with convolutional features using
only the knowledge contained inside a trained network
Compact
Descriptor
Class-Weighted Convolutional Features
28
GAP .
.
.
.
.
.
w1
w2
w3
wN
Class 1
(Tennis Ball)
CAM layer
Class N
Proposal
29
CNN
CW1
CW2
CWK
Class 1
(Weimaraner) F1
= [ f1,1
, f1,2
, … , f1,K
]
CW1
CW2
CWK F2
= [ f2,1
, f2,2
, … , f2,K
]
CW1
CW2
CWK F3
= [ f3,1
, f3,2
, … , f3,K
]
Class 2
(Tennis Ball)
Class 3
(Seashore)
Sum-Pooling
l2
norm
PCA + l2
PCA+ l2
PCA + l2
l2
Sum-Pooling
l2
norm
Sum-Pooling
l2
norm
DI
= [d1
, … , dK
]
Image: I
2. Feature Weighting and Pooling 3. Descriptor Aggregation
CAM layer
Feat. Ext.
1. Features & CAMs Extraction
Image Encoding Pipeline
GAP .
.
.
.
.
.
w1
w2
w3
wN
Class 1
(Tennis Ball)
CAM layer
Class N
Feature Extraction
CAMs Computation
Depth = K
In a single forward pass we extract convolutional features and image CAMs
Normalized
CAMs
[0,1]
30
Image Encoding Pipeline
1. Features & CAMs Extraction
CW1
CW2
CWK
F1
= [f1,1
, f1,2
, … , f1,K
]
CW1
CW2
CWK
F2
= [f2,1
, f2,2
, … , f2,K
]
CW1
CW2
CWK
FN
= [fN,1
, fN,2
, … , fN,K
]
Sum-Pooling
l2
norm
Sum-Pooling
l2
norm
Sum-Pooling
l2
norm
.
.
.
.
.
.
One vector per class:
31
Tennis Ball
(Class 1)
Weimaraner
(Class 2)
Seashore
(Class N)
Spatial Weighting using Class Activation Maps
Channel Weighting based on feature maps sparsity (as in CroW)
Image Encoding Pipeline
2. Feature Weighting and Pooling
32
F1
= [ f1,1
, f1,2
, … , f1,K
]
F2
= [ f2,1
, f2,2
, … , f2,K
]
F N
= [ fN,1
, fN,2
, … , fN,K
]
PCA + l2
PCA+ l2
PCA + l2
l2
DI
= [d1
, … , dK
]
➔ Which classes do we aggregate?
➔ What is the optimal number of classes (NC
) ?
➔ What is the optimal number of classes (NPCA
) to compute the PCA ?
.
.
.
Image Encoding Pipeline
3. Descriptor Aggregation
Image
Encoding
Pipeline
Store all the
class vectors (Fc
)
Image Dataset
Query
Image
Encoding
Pipeline
CNN Most Probable Classes Prediction
from query (Retriever, Seashore...)
Query Descriptor
Dataset Descriptor
Aggregation
Scores
33
Descriptor Aggregation Strategies
Online Aggregation (OnA)
Image
Encoding
Pipeline
Store final descriptor
Image Dataset
Query
Image
Encoding
Pipeline
CNN Most Probable Classes Prediction
Query Descriptor
Dataset Descriptor
Aggregation
Scores
Pre-Computed Descriptors
34
Descriptor Aggregation Strategies
Offline Aggregation (OfA)
Experiments
35
● Framework: Keras with Theano as backend / PyTorch
● Images resized to 1024x720 (keeping aspect ratio)
● We explored using different CNNs as feature extractors
36
Experimental Setup
Experimental Setup
Examples of CAMs from different networks
37
Experimental Setup
Examples of CAMs from different networks
38
● VGG-16 CAM model as feature extractor (pre-trained on ImageNet)
● Features from Conv5_1 Layer
GAP .
.
.
.
.
.
w1
w2
w3
wN
CAM layer
39
Conv5_1
Experimental Setup
● Datasets
○ Oxford 5k
○ Paris 6k
○ 100k Distractors (Flickr) → Oxford105k, Paris106k
● Using the features that fall inside the cropped query
● PCA computed with Oxf5k when testing in Par6k and vice versa
Philbin, J. , Chum, O. , Isard, M. , Sivic, J. and Zisserman, A. Object retrieval with large vocabularies and fast spatial matching, CVPR 2007
Philbin, J. , Chum, O. , Isard, M. , Sivic, J. and Zisserman, A. Lost in Quantization: Improving Particular Object Retrieval in Large Scale Image
Databases. CVPR 2008
40
Experimental Setup
● Scores computed with cosine similarity
41
● Evaluation metric: mean Average Precision (mAP)
Experimental Setup
42
Finding the optimal parameters
Ablation Studies
● We carried out ablation studies to validate our method
● Computational Burden
● Number of top-classes to add: NC
● Number of top-classes to compute the PCA matrix: NPCA
43
Ablation Studies
Baseline Results & Computational Burden
44
Ablation Studies
Networks Comparison
45
Nc
corresponds to the number of classes aggregated.
Npca
corresponds to the classes used to compute the PCA (computed with data of
Oxford when testing in Paris and vice versa).
Ablation Studies
Sensitivity with respect to NC
and Npca
46
The first 16 classes correspond to: vault, bell cote, bannister, analog clock, movie
theater, coil, pier, dome, pedestal, flagpole, church, chime, suspension bridge, birdhouse,
sundial, triumphal arch. All of them are related to buildings.
Ablation Studies
Using predefined classes
47
Comparison with State-of-the-Art
Nc
= 64 , Npca
= 1
48
Search Post-Processing
● We establish a set of thresholds based on the max value of the
normalized CAM
○ 1%, 10%, 20%, 30%, 40%
● Generate a bounding box covering the largest connected element
● Compare the descriptor of each new region of the target image with the
query descriptor and generate new scores.
● Results a bit better if we average over 2 most probable CAMs
● For the top-R target images
Region Proposals for Re-Ranking
Query Expansion
● Generate a new query by l2
normalizing the sum of the top-QE target descriptors
49
Green - Ground Truth
Orange → Red - Different Thresholds
Region Proposals using CAMs
50
Green - Ground Truth
Orange → Red - Different Thresholds
Region Proposals using CAMs
51
Comparison with State-of-the-Art
After Re-Ranking and Query Expansion
Nc
= 64 , Npca
= 1, Nre-ranking
= 6
Conclusions
52
▷ In this work we proposed a technique to build compact image
representations focusing on their semantic content.
▷ We employed an image encoding pipeline that makes use of a
pre-trained CNN and Class Activation Maps to extract discriminative
regions from the image and weight its convolutional features
accordingly
▷ Our experiments demonstrated that selecting the relevant content of
an image to build the image descriptor is beneficial, and contributes to
increase the retrieval performance. The proposed approach establishes
a new state-of-the-art compared to methods that build image
representations combining off-the-shelf features using random or fixed
grid regions.
53
Conclusions
5. Conclusions
54
1 of 54

More Related Content

What's hot(20)

Region-oriented Convolutional Networks for Object RetrievalRegion-oriented Convolutional Networks for Object Retrieval
Region-oriented Convolutional Networks for Object Retrieval
Universitat Politècnica de Catalunya2.8K views
Content-based Image Retrieval - Eva Mohedano - UPC Barcelona 2018Content-based Image Retrieval - Eva Mohedano - UPC Barcelona 2018
Content-based Image Retrieval - Eva Mohedano - UPC Barcelona 2018
Universitat Politècnica de Catalunya1K views
Object Segmentation (D2L7 Insight@DCU Machine Learning Workshop 2017)Object Segmentation (D2L7 Insight@DCU Machine Learning Workshop 2017)
Object Segmentation (D2L7 Insight@DCU Machine Learning Workshop 2017)
Universitat Politècnica de Catalunya1.4K views
Deep and Young Vision Learning at UPC BarcelonaTech (NIPS 2016)Deep and Young Vision Learning at UPC BarcelonaTech (NIPS 2016)
Deep and Young Vision Learning at UPC BarcelonaTech (NIPS 2016)
Universitat Politècnica de Catalunya852 views
Mask R-CNNMask R-CNN
Mask R-CNN
Chanuk Lim22.8K views
Object detection - RCNNs vs RetinanetObject detection - RCNNs vs Retinanet
Object detection - RCNNs vs Retinanet
Rishabh Indoria1.7K views
Image Segmentation (D3L1 2017 UPC Deep Learning for Computer Vision)Image Segmentation (D3L1 2017 UPC Deep Learning for Computer Vision)
Image Segmentation (D3L1 2017 UPC Deep Learning for Computer Vision)
Universitat Politècnica de Catalunya6.4K views
Deep Learning for Computer Vision: Object Detection (UPC 2016)Deep Learning for Computer Vision: Object Detection (UPC 2016)
Deep Learning for Computer Vision: Object Detection (UPC 2016)
Universitat Politècnica de Catalunya6.4K views
Visual Object Analysis using Regions and Local FeaturesVisual Object Analysis using Regions and Local Features
Visual Object Analysis using Regions and Local Features
Universitat Politècnica de Catalunya828 views
Lec11 object-re-idLec11 object-re-id
Lec11 object-re-id
United States Air Force Academy384 views
Deep 3D Analysis - Javier Ruiz-Hidalgo - UPC Barcelona 2018Deep 3D Analysis - Javier Ruiz-Hidalgo - UPC Barcelona 2018
Deep 3D Analysis - Javier Ruiz-Hidalgo - UPC Barcelona 2018
Universitat Politècnica de Catalunya477 views

Similar to Class Weighted Convolutional Features for Image Retrieval (20)

D1L5 Visualization (D1L2 Insight@DCU Machine Learning Workshop 2017)D1L5 Visualization (D1L2 Insight@DCU Machine Learning Workshop 2017)
D1L5 Visualization (D1L2 Insight@DCU Machine Learning Workshop 2017)
Universitat Politècnica de Catalunya549 views
ECCV WS 2012 (Frank)ECCV WS 2012 (Frank)
ECCV WS 2012 (Frank)
Chun-Hao Huang279 views
Lec07 aggregation-and-retrieval-systemLec07 aggregation-and-retrieval-system
Lec07 aggregation-and-retrieval-system
United States Air Force Academy742 views
物件偵測與辨識技術物件偵測與辨識技術
物件偵測與辨識技術
CHENHuiMei368 views
object detection paper reviewobject detection paper review
object detection paper review
Yoonho Na53 views
Deep Learning in Computer VisionDeep Learning in Computer Vision
Deep Learning in Computer Vision
Sungjoon Choi10.4K views
D3L4-objects.pdfD3L4-objects.pdf
D3L4-objects.pdf
ssusere945ae1 view
Poster_FinalPoster_Final
Poster_Final
Nicholas Chehade64 views

More from Universitat Politècnica de Catalunya(20)

Deep Generative Learning for AllDeep Generative Learning for All
Deep Generative Learning for All
Universitat Politècnica de Catalunya297 views
Towards Sign Language Translation & Production | Xavier Giro-i-NietoTowards Sign Language Translation & Production | Xavier Giro-i-Nieto
Towards Sign Language Translation & Production | Xavier Giro-i-Nieto
Universitat Politècnica de Catalunya289 views
The Transformer - Xavier Giró - UPC Barcelona 2021The Transformer - Xavier Giró - UPC Barcelona 2021
The Transformer - Xavier Giró - UPC Barcelona 2021
Universitat Politècnica de Catalunya258 views
Open challenges in sign language translation and productionOpen challenges in sign language translation and production
Open challenges in sign language translation and production
Universitat Politècnica de Catalunya187 views
Discovery and Learning of Navigation Goals from Pixels in MinecraftDiscovery and Learning of Navigation Goals from Pixels in Minecraft
Discovery and Learning of Navigation Goals from Pixels in Minecraft
Universitat Politècnica de Catalunya193 views
Intepretability / Explainable AI for Deep Neural NetworksIntepretability / Explainable AI for Deep Neural Networks
Intepretability / Explainable AI for Deep Neural Networks
Universitat Politècnica de Catalunya663 views
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Universitat Politècnica de Catalunya334 views
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Universitat Politècnica de Catalunya403 views
Curriculum Learning for Recurrent Video Object SegmentationCurriculum Learning for Recurrent Video Object Segmentation
Curriculum Learning for Recurrent Video Object Segmentation
Universitat Politècnica de Catalunya662 views
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
Universitat Politècnica de Catalunya864 views

Recently uploaded(20)

Data structure and algorithm. Data structure and algorithm.
Data structure and algorithm.
Abdul salam 12 views
MOSORE_BRESCIAMOSORE_BRESCIA
MOSORE_BRESCIA
Federico Karagulian5 views
ColonyOSColonyOS
ColonyOS
JohanKristiansson69 views
3196 The Case of The East River3196 The Case of The East River
3196 The Case of The East River
ErickANDRADE9011 views
Introduction to Microsoft Fabric.pdfIntroduction to Microsoft Fabric.pdf
Introduction to Microsoft Fabric.pdf
ishaniuudeshika21 views
Survey on Factuality in LLM's.pptxSurvey on Factuality in LLM's.pptx
Survey on Factuality in LLM's.pptx
NeethaSherra15 views
PROGRAMME.pdfPROGRAMME.pdf
PROGRAMME.pdf
HiNedHaJar14 views
Building Real-Time Travel AlertsBuilding Real-Time Travel Alerts
Building Real-Time Travel Alerts
Timothy Spann102 views
RIO GRANDE SUPPLY COMPANY INC, JAYSON.docxRIO GRANDE SUPPLY COMPANY INC, JAYSON.docx
RIO GRANDE SUPPLY COMPANY INC, JAYSON.docx
JaysonGarabilesEspej6 views
PTicketInput.pdfPTicketInput.pdf
PTicketInput.pdf
stuartmcphersonflipm314 views
How Leaders See Data? (Level 1)How Leaders See Data? (Level 1)
How Leaders See Data? (Level 1)
Narendra Narendra10 views

Class Weighted Convolutional Features for Image Retrieval

  • 1. Class-Weighted Convolutional Features for Image Retrieval Xavier Giró-i-NietoAlbert Jiménez Jose Alvarez [Code][arXiv]
  • 3. Image Retrieval Given an image query, generate a ranked list of similar images Task Definition 3
  • 4. Image Retrieval Given an image query, generate a ranked list of similar images Task Definition 4 What are the properties that we want for these systems?
  • 5. Image Retrieval Desirable System Properties 5 Invariant to scale, illumination, translation Be able to provide a fast searchMemory Efficient
  • 6. Image Retrieval Desirable System Properties 6 Invariant to scale, illumination, translation Be able to provide a fast searchMemory Efficient How to achieve them?
  • 7. Image Retrieval General Pipeline ● Encode images into representations that are ○ Invariant ○ Compact 7
  • 8. 8 Image Retrieval Image Representations ● Hand-Crafted Features (Before 2012) ○ SIFT ○ SIFT + BoW ○ VLAD ○ Fisher Vectors Josef Sivic, Andrew Zisserman, et al. Video google: A text retrieval approach to object matching in videos. In ICCV, 2003.
  • 9. 9 Image Retrieval Image Representations ● Learned Features (After 2012) ○ Convolutional Neural Networks (CNNs) . . . . . . Convolutional Layer Max-Pooling Layer Fully-Connected Layer Flattening Class Predictions Input Image Softmax Activation Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. (2014) VGG-16
  • 10. 10 Image Retrieval Image Representations ● Learned Features (After 2012) ○ Convolutional Neural Networks (CNNs) . . . . . . Flattening Large Datasets Labels Computational Power (GPUs) Training is expensive!
  • 11. 11 Retrieval System ➔ Collecting, annotating and cleaning a large dataset to train models requires great effort. ➔ Training or fine-tuning a model every time new data is added is neither efficient nor scalable. Dynamic datasets of unlabeled images Image Representations Challenges for Learning Features in Retrieval
  • 12. 12 ➔ Transferring the knowledge of a model trained in a large-scale dataset. (e.g. ImageNet) One solution that does not involve training is Transfer Learning Image Representations Challenges for Learning Features Classification → Retrieval How?
  • 13. Outline ▷ Related Work ▷ Proposal ▷ Experiments ▷ Conclusions 13
  • 15. Babenko, A., Slesarev, A., Chigorin, A., & Lempitsky, V. (2014). Neural codes for image retrieval. In ECCV 2014 Razavian, A., Azizpour, H., Sullivan, J., & Carlsson, S. (2014). CNN features off-the-shelf: an astounding baseline for recognition. In DeepVision CVPRW 2014 Fully-connected features as global representation 15 Image Representations Fully-Connected Features . . . . . . Convolutional Layer Max-Pooling Layer Fully-Connected Layer Flattening Class Predictions Input Image Softmax Activation Neural Codes
  • 16. Babenko, A., & Lempitsky, V. (2015). Aggregating local deep features for image retrieval. ICCV 2015 Tolias, G., Sicre, R., & Jégou, H. (2016). Particular object retrieval with integral max-pooling of CNN activations. ICLR 2016 Kalantidis, Y., Mellina, C., & Osindero, S. (2015). Cross-dimensional Weighting for Aggregated Deep Convolutional Features. arXiv preprint arXiv:1512.04065. 16 Image Representations Convolutional Features . . . . . . Convolutional Layer Max-Pooling Layer Fully-Connected Layer Flattening Class Predictions Input Image Softmax Activation Convolutional Features Convolutional features as global representation ➔ They convey the spatial information of the image
  • 17. 17 Image Representations From Feature Maps to Compact Representations (Fh x Fw x 3) x K Filters Stride: S, Padding: P Ih x Iw x 3 (H x W) x K Feature Maps H = (Ih − Fh + 2P) / S +1 W = (Iw − Fw + 2P) / S +1 Image Convolutional layer Feature Maps
  • 18. 18 Image Representations From Feature Maps to Compact Representations 1 0 1 0 1 0 9 0 4 1 2 1 0 0 0 0 0 0 0 0 1 0 0 0 5 1 1 Max Pooling Max Pooling Max Pooling 9 2 5 3 Feature Maps (3x3) Compact vector of (1x3)Pooling Operation
  • 19. 19 Image Representations 3 Feature Maps (3x3) 1 0 1 1 1 0 9 0 4 1 2 1 0 0 0 0 0 0 0 0 1 0 0 0 5 1 1 Sum Pooling Sum Pooling Sum Pooling Compact vector of (1x3) 17 4 8 From Feature Maps to Compact Representations Pooling Operation
  • 20. Image Representations Related Works SPoC R-MAC BoW CroW Combine and aggregate convolutional features: ● SPoC → Gaussian prior + Sum Pooling ● R-MAC → Rigid grid of regions + Max Pooling ● BoW → Bag of Visual Words ● CroW → Spatial weighting based on normalized sum of feature maps (boost salient content) + channel weighting based on sparsity + Sum Pooling 20
  • 21. Image Representations Related Works SPoC R-MAC BoW CroW 21 All except CroW: ● Apply fixed weights ● Based on fixed regions Not based on image contents
  • 24. Objective . . . . . . Flattening Class Predictions Input Image Convolutional Features Combine semantic content of the images with convolutional features Explicit Semantic Information Statue
  • 25. GAP . . . . . . w1 w2 w3 wN Class 1 (Tennis Ball) VGG-16 CAMi = w1,i ∗ conv1 + w2,i ∗ conv2 + … + wN,i ∗ convN Class N CAM (Tennis Ball) B. Zhou, A. Khosla, Lapedriza. A., A. Oliva, and A. Torralba. 2016. Learning Deep Features for Discriminative Localization. CVPR (2016). 25 CAM layer Class Activation Maps Obtaining Discriminative Regions
  • 26. Tennis Ball Weimaraner Simple Classes 26 Class Activation Maps Examples
  • 27. Sand Bar Seashore Complex Classes 27 Class Activation Maps Examples
  • 28. Encode images dynamically combining their semantics with convolutional features using only the knowledge contained inside a trained network Compact Descriptor Class-Weighted Convolutional Features 28 GAP . . . . . . w1 w2 w3 wN Class 1 (Tennis Ball) CAM layer Class N Proposal
  • 29. 29 CNN CW1 CW2 CWK Class 1 (Weimaraner) F1 = [ f1,1 , f1,2 , … , f1,K ] CW1 CW2 CWK F2 = [ f2,1 , f2,2 , … , f2,K ] CW1 CW2 CWK F3 = [ f3,1 , f3,2 , … , f3,K ] Class 2 (Tennis Ball) Class 3 (Seashore) Sum-Pooling l2 norm PCA + l2 PCA+ l2 PCA + l2 l2 Sum-Pooling l2 norm Sum-Pooling l2 norm DI = [d1 , … , dK ] Image: I 2. Feature Weighting and Pooling 3. Descriptor Aggregation CAM layer Feat. Ext. 1. Features & CAMs Extraction Image Encoding Pipeline
  • 30. GAP . . . . . . w1 w2 w3 wN Class 1 (Tennis Ball) CAM layer Class N Feature Extraction CAMs Computation Depth = K In a single forward pass we extract convolutional features and image CAMs Normalized CAMs [0,1] 30 Image Encoding Pipeline 1. Features & CAMs Extraction
  • 31. CW1 CW2 CWK F1 = [f1,1 , f1,2 , … , f1,K ] CW1 CW2 CWK F2 = [f2,1 , f2,2 , … , f2,K ] CW1 CW2 CWK FN = [fN,1 , fN,2 , … , fN,K ] Sum-Pooling l2 norm Sum-Pooling l2 norm Sum-Pooling l2 norm . . . . . . One vector per class: 31 Tennis Ball (Class 1) Weimaraner (Class 2) Seashore (Class N) Spatial Weighting using Class Activation Maps Channel Weighting based on feature maps sparsity (as in CroW) Image Encoding Pipeline 2. Feature Weighting and Pooling
  • 32. 32 F1 = [ f1,1 , f1,2 , … , f1,K ] F2 = [ f2,1 , f2,2 , … , f2,K ] F N = [ fN,1 , fN,2 , … , fN,K ] PCA + l2 PCA+ l2 PCA + l2 l2 DI = [d1 , … , dK ] ➔ Which classes do we aggregate? ➔ What is the optimal number of classes (NC ) ? ➔ What is the optimal number of classes (NPCA ) to compute the PCA ? . . . Image Encoding Pipeline 3. Descriptor Aggregation
  • 33. Image Encoding Pipeline Store all the class vectors (Fc ) Image Dataset Query Image Encoding Pipeline CNN Most Probable Classes Prediction from query (Retriever, Seashore...) Query Descriptor Dataset Descriptor Aggregation Scores 33 Descriptor Aggregation Strategies Online Aggregation (OnA)
  • 34. Image Encoding Pipeline Store final descriptor Image Dataset Query Image Encoding Pipeline CNN Most Probable Classes Prediction Query Descriptor Dataset Descriptor Aggregation Scores Pre-Computed Descriptors 34 Descriptor Aggregation Strategies Offline Aggregation (OfA)
  • 36. ● Framework: Keras with Theano as backend / PyTorch ● Images resized to 1024x720 (keeping aspect ratio) ● We explored using different CNNs as feature extractors 36 Experimental Setup
  • 37. Experimental Setup Examples of CAMs from different networks 37
  • 38. Experimental Setup Examples of CAMs from different networks 38
  • 39. ● VGG-16 CAM model as feature extractor (pre-trained on ImageNet) ● Features from Conv5_1 Layer GAP . . . . . . w1 w2 w3 wN CAM layer 39 Conv5_1 Experimental Setup
  • 40. ● Datasets ○ Oxford 5k ○ Paris 6k ○ 100k Distractors (Flickr) → Oxford105k, Paris106k ● Using the features that fall inside the cropped query ● PCA computed with Oxf5k when testing in Par6k and vice versa Philbin, J. , Chum, O. , Isard, M. , Sivic, J. and Zisserman, A. Object retrieval with large vocabularies and fast spatial matching, CVPR 2007 Philbin, J. , Chum, O. , Isard, M. , Sivic, J. and Zisserman, A. Lost in Quantization: Improving Particular Object Retrieval in Large Scale Image Databases. CVPR 2008 40 Experimental Setup
  • 41. ● Scores computed with cosine similarity 41 ● Evaluation metric: mean Average Precision (mAP) Experimental Setup
  • 42. 42 Finding the optimal parameters Ablation Studies ● We carried out ablation studies to validate our method ● Computational Burden ● Number of top-classes to add: NC ● Number of top-classes to compute the PCA matrix: NPCA
  • 43. 43 Ablation Studies Baseline Results & Computational Burden
  • 45. 45 Nc corresponds to the number of classes aggregated. Npca corresponds to the classes used to compute the PCA (computed with data of Oxford when testing in Paris and vice versa). Ablation Studies Sensitivity with respect to NC and Npca
  • 46. 46 The first 16 classes correspond to: vault, bell cote, bannister, analog clock, movie theater, coil, pier, dome, pedestal, flagpole, church, chime, suspension bridge, birdhouse, sundial, triumphal arch. All of them are related to buildings. Ablation Studies Using predefined classes
  • 48. 48 Search Post-Processing ● We establish a set of thresholds based on the max value of the normalized CAM ○ 1%, 10%, 20%, 30%, 40% ● Generate a bounding box covering the largest connected element ● Compare the descriptor of each new region of the target image with the query descriptor and generate new scores. ● Results a bit better if we average over 2 most probable CAMs ● For the top-R target images Region Proposals for Re-Ranking Query Expansion ● Generate a new query by l2 normalizing the sum of the top-QE target descriptors
  • 49. 49 Green - Ground Truth Orange → Red - Different Thresholds Region Proposals using CAMs
  • 50. 50 Green - Ground Truth Orange → Red - Different Thresholds Region Proposals using CAMs
  • 51. 51 Comparison with State-of-the-Art After Re-Ranking and Query Expansion Nc = 64 , Npca = 1, Nre-ranking = 6
  • 53. ▷ In this work we proposed a technique to build compact image representations focusing on their semantic content. ▷ We employed an image encoding pipeline that makes use of a pre-trained CNN and Class Activation Maps to extract discriminative regions from the image and weight its convolutional features accordingly ▷ Our experiments demonstrated that selecting the relevant content of an image to build the image descriptor is beneficial, and contributes to increase the retrieval performance. The proposed approach establishes a new state-of-the-art compared to methods that build image representations combining off-the-shelf features using random or fixed grid regions. 53 Conclusions