https://imatge-upc.github.io/salbow/
This work explores attention models to weight the contribution of local convolutional representations for the instance search task. We present a retrieval framework based on bags of local convolutional features (BLCF) that benefits from saliency weighting to build an efficient image representation. The use of human visual attention models (saliency) allows significant improvements in retrieval performance without the need to conduct region analysis or spatial verification, and without requiring any feature fine tuning. We investigate the impact of different saliency models, finding that higher performance on saliency benchmarks does not necessarily equate to improved performance when used in instance search tasks. The proposed approach outperforms the state-of-the-art on the challenging INSTRE benchmark by a large margin, and provides similar performance on the Oxford and Paris benchmarks compared to more complex methods that use off-the-shelf representations.
6. The Classic Retrieval Pipeline
6
v1
= (v11
, …, v1n
)
vk
= (vk1
, …, vkn
)
...
variable number of
feature vectors per image
Bag of Visual
Words
N-Dimensional
feature space
M visual words
(M clusters)
INVERTED FILE
word Image ID
1 1, 12,
2 1, 30, 102
3 10, 12
4 2,3
6 10
...
Large vocabularies (50k-1M)
Very fast!
Typically used with SIFT features
Initial Search
7. The Classic Retrieval Pipeline
7
Re-ranking the top-ranked results using spatial constraints
RAndom SAmple Consensus (RANSAC)
● Estimates an homography between
the query and a dataset image
● Re-rank based on number of inlier
local features
● Improves quality of the initial search
Philbin, James, Ondrej Chum, Michael Isard, Josef Sivic, and Andrew Zisserman. "Object retrieval with large vocabularies and fast
spatial matching." In Computer Vision and Pattern Recognition, 2007. CVPR'07. IEEE Conference on, pp. 1-8. IEEE, 2007.
Expensive to compute
Spatial re-ranking
9. Deep Learning Approaches in CBMI
9
Zheng, Liang, Yi Yang, and Qi Tian. "SIFT meets CNN: A decade survey of instance retrieval." TPAMI 2018.
10. Features from pre-trained CNN networks
- Providing more importance to the center region (Content-independent)
10
Gaussian weighting
Convolutional
features
Sum-pooled
features
Babenko, Artem, and Victor Lempitsky. "Aggregating local deep features for image retrieval." CVPR 2015.
11. Features from pre-trained CNN networks
- Providing more importance to the most active regions in a convolution layer
(Content-dependent)
11
Convolutional
features
Sum-pooled
featuresSum across conv
channels weighting
Kalantidis, Yannis, Clayton Mellina, and Simon Osindero. "Cross-dimensional weighting for aggregated deep convolutional features." ECCV 2016.
12. Features from pre-trained CNN networks
- Region Maximum Activation of Convolution (R-MAC)
12
Region1
Region2
…
RegionN
Max-pool Region
Normalization
Tolias, Giorgos, Ronan Sicre, and Hervé Jégou. "Particular object retrieval with integral max-pooling of CNN activations." ICLR 2016.
13. Features from pre-trained CNN networks
- Region Maximum Activation of Convolution (R-MAC) (Content-independent)
13
R-MAC spatial weight
Fix set of locations and
window scales
15. Saliency weighting for retrieval
[1] Awad, Dounia, Vincent Courboulay, and Arnaud Revel. "Saliency filtering of sift detectors: Application to cbir." ACIVS, 2012
[2] de Carvalho Soares, Robson, Ilmerio Reis da Silva, and Denise Guliato. "Spatial locality weighting of features using saliency map
with a bag-of-visual-words approach." ICTAI, 2012
15
- Traditionally explored with SIFT-based BoW approaches to:
- Prune the number of local descriptors [1]
- Weight the contribution of the background [2]
We investigate traditional and data-driven saliency models to weight the
contribution of visual words assigned to local convolutional features for
the Visual Instance Search task.
19. Bag of Local Convolutional Features
19
(336x256)
Resolution
conv5_1 from
VGG16
(21x16)
25K centroids
(Visual Vocabulary)
25K-D vector
Bag of Words
Sparse feature representation
Mohedano, Eva, Kevin McGuinness, Noel E. O'Connor, Amaia Salvador, Ferran Marqués, and Xavier Giro-i-Nieto. "Bags of local convolutional
features for scalable instance search." ICMR 2016.
20. Masking the relevant region
(Encoding the query)
20
(336x256)
Resolution
conv5_1 from
VGG16
(21x16)
25K centroids
(Visual Vocabulary)
25K-D vector
Bag of Words
Assignment Maps
Mohedano, Eva, Kevin McGuinness, Noel E. O'Connor, Amaia Salvador, Ferran Marqués, and Xavier Giro-i-Nieto. "Bags of local convolutional
features for scalable instance search." ICMR 2016.
21. General Framework
21
Pan, Junting, Cristian Canton Ferrer, Kevin McGuinness, Noel E. O'Connor, Jordi Torres, Elisa Sayrol, and Xavier Giro-i-Nieto. "Salgan: Visual
saliency prediction with generative adversarial networks." arXiv preprint arXiv:1701.01081 (2017).
28. Comparison Sum-pooling vs BCLF
28
● BCLF better baseline (vocabulary learning can be seen as
unsupervised domain adaptation)
● Saliency effective in both Sum-pooling and BLCF approach for the
instance search dataset Instre
29. Comparison with the State-of-the-art
29
High dimensional 25,000D representations
with an average number of non-zeros ~200
31. 31
Gomez P, Mohedano E, McGuinness K, Giró-i-Nieto X, O'Connor N, “Demonstration of an Open Source Framework for Qualitative
Evaluation of CBIR Systems”, ACM Multimedia 2018
Dockerized visualization tool
32. Conclusions
● Proven the application of modern saliency models for the instance
search task
● Achieved SoA performance on instance search benchmark (Instre)
with a off-the-shelf CNN model
● Investigate better post-processing for ranking refinement
● Scale method on large-scale datasets
Future Work
33. Thanks for your attention!
Questions?
Software available @ https://github.com/imatge-upc/salbow