Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Content-based Image Retrieval - Eva Mohedano - UPC Barcelona 2018

438 views

Published on

Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or image captioning.

Published in: Data & Analytics
  • Be the first to comment

Content-based Image Retrieval - Eva Mohedano - UPC Barcelona 2018

  1. 1. Eva Mohedano eva.mohedano@insight-centre.org Postdoctoral Researcher Insight-centre for Data Analytics Dublin City University Content-based Image Retrieval Day 1 Lecture 5 #DLUPC http://bit.ly/dlcv2018
  2. 2. Overview ● What is content-based image retrieval? ● The classic SIFT retrieval pipeline ● Using off the shelf CNN features for retrieval ● Learning representations for retrieval 2
  3. 3. Overview 3 ● What is content-based image retrieval? ● The classic SIFT retrieval pipeline ● Using off the shelf CNN features for retrieval ● Learning representations for retrieval
  4. 4. The problem: query by example Given: ● An example query image that illustrates the user's information need ● A very large dataset of images Task: ● Rank all images in the dataset according to how likely they are to fulfil the user's information need 4
  5. 5. Demo by Paula Gomez Duran Dockerized visualization tool
  6. 6. Overview 6 ● What is content-based image retrieval? ● The classic SIFT retrieval pipeline ● Using off the shelf CNN features for retrieval ● Learning representations for retrieval
  7. 7. The retrieval pipeline 7 Image RepresentationsQuery Image Dataset Image Matching Ranked List Similarity score Image . . . 0.98 0.97 0.10 0.01 v = (v1 , …, vn ) v1 = (v11 , …, v1n ) vk = (vk1 , …, vkn ) ... Euclidean distance Cosine Similarity Similarity Metric . . .
  8. 8. The classic SIFT retrieval pipeline 8 v1 = (v11 , …, v1n ) vk = (vk1 , …, vkn ) ... variable number of feature vectors per image Bag of Visual Words N-Dimensional feature space M visual words (M clusters) INVERTED FILE word Image ID 1 1, 12, 2 1, 30, 102 3 10, 12 4 2,3 6 10 ... Large vocabularies (50k-1M) Very fast! Typically used with SIFT features Initial Search
  9. 9. The classic SIFT retrieval pipeline 9 Spatial re-ranking Re-ranking the top-ranked results using spatial constraints RAndom SAmple Consensus (RANSAC) ● Estimates an homography between the query and a dataset image ● Re-rank based on number of inlier local features ● Improves quality of the initial search Philbin, J., Chum, O., Isard, M., Sivic, J. and Zisserman, Object retrieval with large vocabularies and fast spatial matching. CVPR 2007 Kalantidis, Y., Tolias, G., Spyrou, E., Mylonas, P. and Avrithis, Visual image retrieval and localization. 7th International Workshop on Content-Based Multimedia Indexing, 2009. Expensive to compute
  10. 10. Overview 10 ● What is content-based image retrieval? ● The classic SIFT retrieval pipeline ● Using off the shelf CNN features for retrieval ● Learning representations for retrieval
  11. 11. Off-the-shelf CNN representations 11Razavian et al. CNN features off-the-shelf: an astounding baseline for recognition. CVPR 2014 Babenko et al. Neural codes for image retrieval. ECCV 2014 FC layers as global feature representation
  12. 12. Off-the-shelf CNN representations Neural codes for retrieval [1] ● FC7 layer (4096D) ● L2 norm + PCA whitening + L2 norm ● Euclidean distance ● Only better than traditional SIFT approach after fine tuning on similar domain image dataset. CNN features off-the-shelf: an astounding baseline for recognition [2] ● Extending Babenko’s approach with spatial search ● Several features extracted by image (sliding window approach) ● Really good results but too computationally expensive for practical situations [1] Babenko et al, Neural codes for image retrieval CVPR 2014 [2] Razavian et al, CNN features off-the-shelf: an astounding baseline for recognition, CVPR 2014 12
  13. 13. Off-the-shelf CNN representations 13 FC layers as global feature representation Is there any other way to pool the local information from a convolutional layer?
  14. 14. Pooling method #1: Spatial sum/max pooling 14 sum/max pool conv features across filters Babenko and Lempitsky, Aggregating local deep features for image retrieval. ICCV 2015 Tolias et al. Particular object retrieval with integral max-pooling of CNN activations. arXiv 2015. Kalantidis et al. Cross-dimensional Weighting for Aggregated Deep Convolutional Features. ECCV 2016 Workshops.
  15. 15. Pooling method #1: Spatial sum/max pooling 15
  16. 16. Pooling method #1: Spatial sum/max pooling 16 Sum/max pooling over specific regions
  17. 17. Pooling method #1: Spatial sum/max pooling Babenko and Lempitsky, Aggregating local deep features for image retrieval. ICCV 2015 Kalantidis et al. Cross-dimensional Weighting for Aggregated Deep Convolutional Features. ECCV 2016 Workshops. Mohedano et al. Saliency Weighted Convolutional Features for Instance Search, arXiv 2017 Apply spatial weighting on the features before pooling them Spatial weighting on convolutional activations 17 w11 w12 ... w1w w21 w22 ... w2w wH1 wH2 ... wHW
  18. 18. Pooling method #2: Generalized-mean pooling Radenović, F., Tolias, G. and Chum, O, Fine-tuning CNN image retrieval with no human annotation. TPAMI 2018 18 Max-pooling (MAC) Average pooling (SPoC) Generalized-mean pooling (GeM) pk can be manually set or learnt f1 f2 fK ... ... ...
  19. 19. Pooling method #2: Generalized-mean pooling 19Radenović, F., Tolias, G. and Chum, O, Fine-tuning CNN image retrieval with no human annotation. TPAMI 2018
  20. 20. Pooling method #3: Region pooling 20 Regional Maximum Activation of Convolution (R-MAC) Settings - Fully convolutional off-the-shelf VGG16 - Pool5 - Spatial Max pooling - High Resolution images - Global descriptor based on aggregating region vectors - Sliding window approach Tolias et al. Particular object retrieval with integral max-pooling of CNN activations. arXiv 2015.
  21. 21. R-MAC Region1 21
  22. 22. R-MAC Region1 Region2 22
  23. 23. R-MAC Region1 Region2 Region3 23
  24. 24. R-MAC Region1 Region2 Region3 Region4 ... RegionN 24
  25. 25. R-MAC Region1 Region2 Region3 Region4 ... RegionN Region vectors 512D l2-PCAw-l2 Region1 Region2 Region3 Region4 ... RegionN Global vector 512D 25 Spatial re-ranking ● Only on top 100 ranked results, compare query with dataset regions ● Re-rank images based on best region score per image
  26. 26. R-MAC 26
  27. 27. Pooling method #4: Traditional pooling techniques 27 BoW, VLAD encoding of conv features Ng et al. Exploiting local features from deep networks for image retrieval. CVPR Workshops 2015 Mohedano et al. Bags of Local Convolutional Features for Scalable Instance Search. ICMR 2016
  28. 28. Pooling method #4: Bags of Local Convolutional Features 28 (336x256) Resolution conv5_1 from VGG16 (42x32) 25K centroids 25K-D vector Mohedano et al. Bags of Local Convolutional Features for Scalable Instance Search. ICMR 2016
  29. 29. 29 (336x256) Resolution conv5_1 from VGG16 (42x32) 25K centroids 25K-D vector Pooling method #4: Bags of Local Convolutional Features
  30. 30. 30 Paris Buildings 6k Oxford Buildings 5k TRECVID Instance Search 2013 (subset of 23k frames) [7] Kalantidis et al. Cross-dimensional Weighting for Aggregated Deep Convolutional Features. ECCV 2016 Workshops. Mohedano et al. Bags of Local Convolutional Features for Scalable Instance Search. ICMR 2016 Pooling method #4: Bags of Local Convolutional Features
  31. 31. 31 Pooling method #4: Bags of Local Convolutional Features Mohedano et al. Saliency Weighted Convolutional Features for Instance Search, arXiv 2017
  32. 32. Overview 32 ● What is content-based image retrieval? ● The classic SIFT retrieval pipeline ● Using off the shelf CNN features for retrieval ● Learning representations for retrieval
  33. 33. Classification 33 Query: This chair Results from dataset classified as “chair”
  34. 34. Retrieval 34 Query: This chair Results from dataset ranked by similarity to the query
  35. 35. What loss function should we use? 35
  36. 36. Learning representations for retrieval Siamese network: network to learn a function that maps input patterns into a target space such that L2 norm in the target space approximates the semantic distance in the input space. 36 Song et al.: Deep metric learning via lifted structured feature embedding. CVPR 2015 Chopra et al. Learning a similarity metric discriminatively, with application to face verification . CVPR 2005 Simo-Serra et al. Discriminative Learning of Deep Convolutional Feature Point Descriptor . ICCV 2015
  37. 37. Margin: m Hinge loss Negative pairs: if nearer than the margin, pay a linear penalty The margin is a hyperparameter 37
  38. 38. Learning representations for retrieval Siamese network with triplet loss: loss function minimizes distance between query and positive and maximizes distance between query and negative 38Schroff et al. FaceNet: A Unified Embedding for Face Recognition and Clustering, CVPR 2015 CNN w w a p n L2 embedding space Triplet Loss CNN CNN
  39. 39. How do we get the training data? 39
  40. 40. Exploring Image datasets with GPS coordinates (Google Street View Time Machine) Relja Arandjelović et al, NetVLAD: CNN architecture for weakly supervised place recognition, CVPR 2016 40
  41. 41. Radenović el at., CNN Image Retrieval Learns from BoW: Unsupervised Fine-Tuning with Hard Examples, CVPR 2016 Gordo et al., Deep Image Retrieval: Learning global representations for image search, CVPR 2016 [1] Schonberger et al. From single image query to detailed 3D reconstruction, CVPR 2015 Automatic data cleaning Strong baseline (SIFT) + geometric verification + 3D camera estimation[1] Further manual inspection Further manual inspection 41
  42. 42. Radenović et al, CNN Image Retrieval Learns from BoW: Unsupervised Fine-Tuning with Hard Examples, CVPR 2016 Gordo et al., Deep Image Retrieval: Learning global representations for image search, CVPR 2016 Generating training pairs Select the hardest negative - Select triplet which generates larger loss - If 3D-geometric verification model, select images based on number of inliers 42 Once we have collected a training dataset (with annotations), the way to select the pairs/triplets is crucial
  43. 43. NetVLAD Relja Arandjelović et al, NetVLAD: CNN architecture for weakly supervised place recognition, CVPR 2016 43 End-to-End learning for retrieval #1
  44. 44. Local features Vectors dimension D K cluster centers Visual Vocabulary (K vectors with dimension D) Learnt from data with K-means Feature Space D-dim VLAD vector Sum of all residuals of the assignments to the cluster 1 d1 d2 d3 d4 Dim D Dim K x D VLAD 44
  45. 45. V Matrix VLAD (residuals) D k Hard Assignment Not differentiable ! j Soft Assignment N Local features {Xi } D i K Clusters features {Ck } D k 0.99 0.98 0.05 0.01 0.00 Cluster k=1 j j 1 1 0 0 0 Softmax function :) VLAD 45 Relja Arandjelović et al, NetVLAD: CNN architecture for weakly supervised place recognition, CVPR 2016
  46. 46. Consider the positive example closer to the query Consider all negative candidates With distance inferior to a certain margin 46 Training based on GPS annotated data Relja Arandjelović et al, NetVLAD: CNN architecture for weakly supervised place recognition, CVPR 2016 NetVLAD
  47. 47. Fine-tuned R-MAC 47 End-to-End learning for retrieval #2 Gordo et al, Deep Image Retrieval: Learning global representations for image search. ECCV 2016
  48. 48. R-MAC Region1 Region2 Region3 Region4 ... RegionN Region vectors 512D l2-PCAw-l2 Region1 Region2 Region3 Region4 ... RegionN Global vector 512D 48
  49. 49. Learning representations for retrieval 49 Gordo et al, Deep Image Retrieval: Learning global representations for image search. ECCV 2016
  50. 50. Learning representations for retrieval 50 Comparison between R-MAC from off-the-shelf network and R-MAC retrained for retrieval Gordo et al, Deep Image Retrieval: Learning global representations for image search. ECCV 2016
  51. 51. Extra: Ranking refinement 51
  52. 52. Ranking refinement: Query Expansion New QueryQuery Image Average descriptors of the top N retrieved images ● Runs at query time ● Boost performance if the initial ranking is good ○ Traditionally applied on spatially verified images (RANSAC) ● Only considers the top NN of the image
  53. 53. Diffusion mechanism to propagate similarities through a pairwise affinity matrix On Random walks and Diffusion on graphs: Leonid E. Zhukov, Structural Analysis and Visualization of Networks [slides] [video lecture] 1 2 4 3 0 1 1 1 0 1 0 0 0 1 1 0 0 1 1 1 0 1 0 0 0 1 1 1 1 5 Affinity Matrix Ranking refinement: Diffusion
  54. 54. Diffusion mechanism to propagate similarities through a pairwise affinity matrix ● Most retrieval approaches using pairwise measures (i.e Euclidean distance between query vs dataset images) to generate the final rank. ● This has the limitation of ignoring the structure of the underlying data manifold Ranking refinement: Diffusion
  55. 55. Zhou et al. Ranking on data manifolds. In Advances in neural information processing systems (pp. 169-176), 2004 Ranking refinement: Diffusion
  56. 56. Iscen, A et al. Efficient diffusion on region manifolds: Recovering small objects with compact cnn representations. CVPR 2017
  57. 57. Overview 57 ● What is content-based image retrieval? ● The classic SIFT retrieval pipeline ● Using off the shelf CNN features for retrieval ● Learning representations for retrieval
  58. 58. 58 Questions?

×