Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Class Weighted Convolutional Features for Image Retrieval

3,707 views

Published on

http://imatge-upc.github.io/retrieval-2017-cam/

Image retrieval in realistic scenarios targets large dynamic datasets of unlabeled images. In these cases, training or fine-tuning a model every time new images are added to the database is neither efficient nor scalable.
Convolutional neural networks trained for image classification over large datasets have been proven effective feature extractors when transferred to the task of image retrieval. The most successful approaches are based on encoding the activations of convolutional layers as they convey the image spatial information. Our proposal goes beyond and aims at a local-aware encoding of these features depending on the predicted image semantics, with the advantage of using only of the knowledge contained inside the network.
In particular, we employ Class Activation Maps (CAMs) to obtain the most discriminative regions from a semantic perspective. Additionally, CAMs are also used to generate object proposals during an unsupervised re-ranking stage after a first fast search.
Our experiments on two public available datasets for instance retrieval, Oxford5k and Paris6k, demonstrate that our system is competitive and even outperforms the current state-of-the-art when using off-the-shelf models trained on the object classes of ImageNet.

Published in: Data & Analytics
  • Login to see the comments

Class Weighted Convolutional Features for Image Retrieval

  1. 1. Class-Weighted Convolutional Features for Image Retrieval Xavier Giró-i-NietoAlbert Jiménez Jose Alvarez [Code][arXiv]
  2. 2. Image Retrieval Google Search By Image 2
  3. 3. Image Retrieval Given an image query, generate a ranked list of similar images Task Definition 3
  4. 4. Image Retrieval Given an image query, generate a ranked list of similar images Task Definition 4 What are the properties that we want for these systems?
  5. 5. Image Retrieval Desirable System Properties 5 Invariant to scale, illumination, translation Be able to provide a fast searchMemory Efficient
  6. 6. Image Retrieval Desirable System Properties 6 Invariant to scale, illumination, translation Be able to provide a fast searchMemory Efficient How to achieve them?
  7. 7. Image Retrieval General Pipeline ● Encode images into representations that are ○ Invariant ○ Compact 7
  8. 8. 8 Image Retrieval Image Representations ● Hand-Crafted Features (Before 2012) ○ SIFT ○ SIFT + BoW ○ VLAD ○ Fisher Vectors Josef Sivic, Andrew Zisserman, et al. Video google: A text retrieval approach to object matching in videos. In ICCV, 2003.
  9. 9. 9 Image Retrieval Image Representations ● Learned Features (After 2012) ○ Convolutional Neural Networks (CNNs) . . . . . . Convolutional Layer Max-Pooling Layer Fully-Connected Layer Flattening Class Predictions Input Image Softmax Activation Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. (2014) VGG-16
  10. 10. 10 Image Retrieval Image Representations ● Learned Features (After 2012) ○ Convolutional Neural Networks (CNNs) . . . . . . Flattening Large Datasets Labels Computational Power (GPUs) Training is expensive!
  11. 11. 11 Retrieval System ➔ Collecting, annotating and cleaning a large dataset to train models requires great effort. ➔ Training or fine-tuning a model every time new data is added is neither efficient nor scalable. Dynamic datasets of unlabeled images Image Representations Challenges for Learning Features in Retrieval
  12. 12. 12 ➔ Transferring the knowledge of a model trained in a large-scale dataset. (e.g. ImageNet) One solution that does not involve training is Transfer Learning Image Representations Challenges for Learning Features Classification → Retrieval How?
  13. 13. Outline ▷ Related Work ▷ Proposal ▷ Experiments ▷ Conclusions 13
  14. 14. Related Work 14
  15. 15. Babenko, A., Slesarev, A., Chigorin, A., & Lempitsky, V. (2014). Neural codes for image retrieval. In ECCV 2014 Razavian, A., Azizpour, H., Sullivan, J., & Carlsson, S. (2014). CNN features off-the-shelf: an astounding baseline for recognition. In DeepVision CVPRW 2014 Fully-connected features as global representation 15 Image Representations Fully-Connected Features . . . . . . Convolutional Layer Max-Pooling Layer Fully-Connected Layer Flattening Class Predictions Input Image Softmax Activation Neural Codes
  16. 16. Babenko, A., & Lempitsky, V. (2015). Aggregating local deep features for image retrieval. ICCV 2015 Tolias, G., Sicre, R., & Jégou, H. (2016). Particular object retrieval with integral max-pooling of CNN activations. ICLR 2016 Kalantidis, Y., Mellina, C., & Osindero, S. (2015). Cross-dimensional Weighting for Aggregated Deep Convolutional Features. arXiv preprint arXiv:1512.04065. 16 Image Representations Convolutional Features . . . . . . Convolutional Layer Max-Pooling Layer Fully-Connected Layer Flattening Class Predictions Input Image Softmax Activation Convolutional Features Convolutional features as global representation ➔ They convey the spatial information of the image
  17. 17. 17 Image Representations From Feature Maps to Compact Representations (Fh x Fw x 3) x K Filters Stride: S, Padding: P Ih x Iw x 3 (H x W) x K Feature Maps H = (Ih − Fh + 2P) / S +1 W = (Iw − Fw + 2P) / S +1 Image Convolutional layer Feature Maps
  18. 18. 18 Image Representations From Feature Maps to Compact Representations 1 0 1 0 1 0 9 0 4 1 2 1 0 0 0 0 0 0 0 0 1 0 0 0 5 1 1 Max Pooling Max Pooling Max Pooling 9 2 5 3 Feature Maps (3x3) Compact vector of (1x3)Pooling Operation
  19. 19. 19 Image Representations 3 Feature Maps (3x3) 1 0 1 1 1 0 9 0 4 1 2 1 0 0 0 0 0 0 0 0 1 0 0 0 5 1 1 Sum Pooling Sum Pooling Sum Pooling Compact vector of (1x3) 17 4 8 From Feature Maps to Compact Representations Pooling Operation
  20. 20. Image Representations Related Works SPoC R-MAC BoW CroW Combine and aggregate convolutional features: ● SPoC → Gaussian prior + Sum Pooling ● R-MAC → Rigid grid of regions + Max Pooling ● BoW → Bag of Visual Words ● CroW → Spatial weighting based on normalized sum of feature maps (boost salient content) + channel weighting based on sparsity + Sum Pooling 20
  21. 21. Image Representations Related Works SPoC R-MAC BoW CroW 21 All except CroW: ● Apply fixed weights ● Based on fixed regions Not based on image contents
  22. 22. Proposal 22
  23. 23. Dog Tennis Ball Sand bar Seashore 23 Motivation Image Semantics
  24. 24. Objective . . . . . . Flattening Class Predictions Input Image Convolutional Features Combine semantic content of the images with convolutional features Explicit Semantic Information Statue
  25. 25. GAP . . . . . . w1 w2 w3 wN Class 1 (Tennis Ball) VGG-16 CAMi = w1,i ∗ conv1 + w2,i ∗ conv2 + … + wN,i ∗ convN Class N CAM (Tennis Ball) B. Zhou, A. Khosla, Lapedriza. A., A. Oliva, and A. Torralba. 2016. Learning Deep Features for Discriminative Localization. CVPR (2016). 25 CAM layer Class Activation Maps Obtaining Discriminative Regions
  26. 26. Tennis Ball Weimaraner Simple Classes 26 Class Activation Maps Examples
  27. 27. Sand Bar Seashore Complex Classes 27 Class Activation Maps Examples
  28. 28. Encode images dynamically combining their semantics with convolutional features using only the knowledge contained inside a trained network Compact Descriptor Class-Weighted Convolutional Features 28 GAP . . . . . . w1 w2 w3 wN Class 1 (Tennis Ball) CAM layer Class N Proposal
  29. 29. 29 CNN CW1 CW2 CWK Class 1 (Weimaraner) F1 = [ f1,1 , f1,2 , … , f1,K ] CW1 CW2 CWK F2 = [ f2,1 , f2,2 , … , f2,K ] CW1 CW2 CWK F3 = [ f3,1 , f3,2 , … , f3,K ] Class 2 (Tennis Ball) Class 3 (Seashore) Sum-Pooling l2 norm PCA + l2 PCA+ l2 PCA + l2 l2 Sum-Pooling l2 norm Sum-Pooling l2 norm DI = [d1 , … , dK ] Image: I 2. Feature Weighting and Pooling 3. Descriptor Aggregation CAM layer Feat. Ext. 1. Features & CAMs Extraction Image Encoding Pipeline
  30. 30. GAP . . . . . . w1 w2 w3 wN Class 1 (Tennis Ball) CAM layer Class N Feature Extraction CAMs Computation Depth = K In a single forward pass we extract convolutional features and image CAMs Normalized CAMs [0,1] 30 Image Encoding Pipeline 1. Features & CAMs Extraction
  31. 31. CW1 CW2 CWK F1 = [f1,1 , f1,2 , … , f1,K ] CW1 CW2 CWK F2 = [f2,1 , f2,2 , … , f2,K ] CW1 CW2 CWK FN = [fN,1 , fN,2 , … , fN,K ] Sum-Pooling l2 norm Sum-Pooling l2 norm Sum-Pooling l2 norm . . . . . . One vector per class: 31 Tennis Ball (Class 1) Weimaraner (Class 2) Seashore (Class N) Spatial Weighting using Class Activation Maps Channel Weighting based on feature maps sparsity (as in CroW) Image Encoding Pipeline 2. Feature Weighting and Pooling
  32. 32. 32 F1 = [ f1,1 , f1,2 , … , f1,K ] F2 = [ f2,1 , f2,2 , … , f2,K ] F N = [ fN,1 , fN,2 , … , fN,K ] PCA + l2 PCA+ l2 PCA + l2 l2 DI = [d1 , … , dK ] ➔ Which classes do we aggregate? ➔ What is the optimal number of classes (NC ) ? ➔ What is the optimal number of classes (NPCA ) to compute the PCA ? . . . Image Encoding Pipeline 3. Descriptor Aggregation
  33. 33. Image Encoding Pipeline Store all the class vectors (Fc ) Image Dataset Query Image Encoding Pipeline CNN Most Probable Classes Prediction from query (Retriever, Seashore...) Query Descriptor Dataset Descriptor Aggregation Scores 33 Descriptor Aggregation Strategies Online Aggregation (OnA)
  34. 34. Image Encoding Pipeline Store final descriptor Image Dataset Query Image Encoding Pipeline CNN Most Probable Classes Prediction Query Descriptor Dataset Descriptor Aggregation Scores Pre-Computed Descriptors 34 Descriptor Aggregation Strategies Offline Aggregation (OfA)
  35. 35. Experiments 35
  36. 36. ● Framework: Keras with Theano as backend / PyTorch ● Images resized to 1024x720 (keeping aspect ratio) ● We explored using different CNNs as feature extractors 36 Experimental Setup
  37. 37. Experimental Setup Examples of CAMs from different networks 37
  38. 38. Experimental Setup Examples of CAMs from different networks 38
  39. 39. ● VGG-16 CAM model as feature extractor (pre-trained on ImageNet) ● Features from Conv5_1 Layer GAP . . . . . . w1 w2 w3 wN CAM layer 39 Conv5_1 Experimental Setup
  40. 40. ● Datasets ○ Oxford 5k ○ Paris 6k ○ 100k Distractors (Flickr) → Oxford105k, Paris106k ● Using the features that fall inside the cropped query ● PCA computed with Oxf5k when testing in Par6k and vice versa Philbin, J. , Chum, O. , Isard, M. , Sivic, J. and Zisserman, A. Object retrieval with large vocabularies and fast spatial matching, CVPR 2007 Philbin, J. , Chum, O. , Isard, M. , Sivic, J. and Zisserman, A. Lost in Quantization: Improving Particular Object Retrieval in Large Scale Image Databases. CVPR 2008 40 Experimental Setup
  41. 41. ● Scores computed with cosine similarity 41 ● Evaluation metric: mean Average Precision (mAP) Experimental Setup
  42. 42. 42 Finding the optimal parameters Ablation Studies ● We carried out ablation studies to validate our method ● Computational Burden ● Number of top-classes to add: NC ● Number of top-classes to compute the PCA matrix: NPCA
  43. 43. 43 Ablation Studies Baseline Results & Computational Burden
  44. 44. 44 Ablation Studies Networks Comparison
  45. 45. 45 Nc corresponds to the number of classes aggregated. Npca corresponds to the classes used to compute the PCA (computed with data of Oxford when testing in Paris and vice versa). Ablation Studies Sensitivity with respect to NC and Npca
  46. 46. 46 The first 16 classes correspond to: vault, bell cote, bannister, analog clock, movie theater, coil, pier, dome, pedestal, flagpole, church, chime, suspension bridge, birdhouse, sundial, triumphal arch. All of them are related to buildings. Ablation Studies Using predefined classes
  47. 47. 47 Comparison with State-of-the-Art Nc = 64 , Npca = 1
  48. 48. 48 Search Post-Processing ● We establish a set of thresholds based on the max value of the normalized CAM ○ 1%, 10%, 20%, 30%, 40% ● Generate a bounding box covering the largest connected element ● Compare the descriptor of each new region of the target image with the query descriptor and generate new scores. ● Results a bit better if we average over 2 most probable CAMs ● For the top-R target images Region Proposals for Re-Ranking Query Expansion ● Generate a new query by l2 normalizing the sum of the top-QE target descriptors
  49. 49. 49 Green - Ground Truth Orange → Red - Different Thresholds Region Proposals using CAMs
  50. 50. 50 Green - Ground Truth Orange → Red - Different Thresholds Region Proposals using CAMs
  51. 51. 51 Comparison with State-of-the-Art After Re-Ranking and Query Expansion Nc = 64 , Npca = 1, Nre-ranking = 6
  52. 52. Conclusions 52
  53. 53. ▷ In this work we proposed a technique to build compact image representations focusing on their semantic content. ▷ We employed an image encoding pipeline that makes use of a pre-trained CNN and Class Activation Maps to extract discriminative regions from the image and weight its convolutional features accordingly ▷ Our experiments demonstrated that selecting the relevant content of an image to build the image descriptor is beneficial, and contributes to increase the retrieval performance. The proposed approach establishes a new state-of-the-art compared to methods that build image representations combining off-the-shelf features using random or fixed grid regions. 53 Conclusions
  54. 54. 5. Conclusions 54

×