Content-based image retrieval (CBIR) uses computer vision techniques to search for and retrieve images from large databases based on visual similarities. CBIR systems typically extract features from images and measure similarities to return images matching a query image. Popular applications include Google Images, eBay, and Pinterest. Evaluation of CBIR systems focuses on precision and recall metrics, as precision alone is insufficient without also considering recall. Training siamese networks for CBIR requires loss functions that pull similar images closer together and push dissimilar images farther apart.
2. What is CBIR?
Content-based image retrieval, also known as query by image content (QBIC) and content-based
visual information retrieval (CBVIR), is the application of computer vision techniques to the image
retrieval problem, that is, the problem of searching for digital images in large databases.
https://en.wikipedia.org/wiki/Content-based_image_retrieval
Query
Image
Image
Feature
Extraction
Feature
Extraction
Similarity
Measuremen
t
Retrieved
Images
3. Technologies
● Query by example (QBE)
● Semantic retrieval
● Relevance feedback (human interaction)
● Iterative/machine learning
● Other query methods
https://en.wikipedia.org/wiki/Content-based_image_retrieval
4. Technologies
● Query by example (QBE)
● Semantic retrieval
● Relevance feedback (human interaction)
● Iterative/machine learning
● Other query methods
https://en.wikipedia.org/wiki/Content-based_image_retrieval
5. Application in popular search systems
● Google images
○ Constructing a mathematical model
○ Metadata
● eBay
○ ResNet-50 for category recognition
● SK Planet
○ inception-v3 as vision encoder
○ RNN multi-class classification
● Alibaba
○ GoogLeNet V1 for category prediction and feature learning
● Pinterest
○ Two-step object detection
https://en.wikipedia.org/wiki/Reverse_image_search
6. Application in popular search systems
● Google images
○ Constructing a mathematical model
○ Metadata
● eBay
○ ResNet-50 for category recognition
● SK Planet
○ inception-v3 as vision encoder
○ RNN multi-class classification
● Alibaba
○ GoogLeNet V1 for category prediction and feature learning
● Pinterest
○ Two-step object detection
https://en.wikipedia.org/wiki/Reverse_image_search
7. Image Representation and Features
● Extract local and deep features
● Studied AlexNet and VGG
○ Extract feature representations from fc6 and fc8 layers
○ Binarized
○ Hamming distance
● Extract salient color signatures
○ Detect salient regions
○ K-means clustering
○ Store cluster centroids and weights as image signature
[DavJing, Yushi, et al. "Visual search at pinterest." Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2015]
8. Two-step Object Detection and Localization
1. Category classification
2. Object detection
[DavJing, Yushi, et al. "Visual search at pinterest." Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2015]
9. Car Flower Person
c1 ... cn f1 ... fm p1 ... pk
Input
Category classification
Object detection
10. Car Flower Person
c1 ... cn f1 ... fm p1 ... pk
Input
Category classification
Object detection
11. Car Flower Person
c1 ... cn f1 ... fm p1 ... pk
Input
Category classification
Object detection
Reduce
computational cost
12. Static Evaluation of Search Relevance
● Used dataset contains: 1.6 M unique images
○ Be assumed to be relevant, if two images share a label
● Computed precision@k based on several features
○ The fc6 layer activations from the generic AlexNet (pre-trained for ILSVRC)
○ The fc6 activations of an AlexNet model fine-tuned to recognize over 3,000 Pinterest
products categories
○ The loss3/classifier activations of a generic GoogLeNet
○ The fc6 activations of a generic VGG 16-layer model
[DavJing, Yushi, et al. "Visual search at pinterest." Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2015]
13. Precision vs. Recall
[Müller, Henning, et al. "Performance evaluation in content-based image retrieval: overview and proposals." Pattern recognition letters 22.5 (2001): 593-601]
14. Precision vs. Recall
[Müller, Henning, et al. "Performance evaluation in content-based image retrieval: overview and proposals." Pattern recognition letters 22.5 (2001): 593-601]
Either value alone contains insufficient information
● We can always make recall 1, simply by retrieving all images
● Similarly, precision can be kept high by retrieving only a few images
● P (10) ; P (30) ; P (NR) - the precision after the first 10 ; 30 ; NR documents are retrieved
● Mean Average Precision - mean (non-interpolated) average precision .
● recall at .5 precision - recall at the rank where precision drops below .5.
● R (1000) - recall after 1000 documents are retrieved.
● Rank first relevant - The rank of the highest-ranked relevant document.
15. Precision vs. Recall
[Müller, Henning, et al. "Performance evaluation in content-based image retrieval: overview and proposals." Pattern recognition letters 22.5 (2001): 593-601]
Either value alone contains insufficient information
● We can always make recall 1, simply by retrieving all images
● Similarly, precision can be kept high by retrieving only a few images
Precision and recall
should either be
used together
● P (10) ; P (30) ; P (NR) - the precision after the first 10 ; 30 ; NR documents are retrieved
● Mean Average Precision - mean (non-interpolated) average precision .
● recall at .5 precision - recall at the rank where precision drops below .5.
● R (1000) - recall after 1000 documents are retrieved.
● Rank first relevant - The rank of the highest-ranked relevant document.
16. Relevance of visual search
Table 1 shows p@5 and p@10 performance of these models, along with the average CPU-based
latency of our visual search service, which includes feature extraction for the query image as well as
retrieval.
[DavJing, Yushi, et al. "Visual search at pinterest." Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2015]
17. Siamese networks
[Das, Arpita, et al. "Together we stand: Siamese networks for similar question retrieval." Proceedings of the 54th Annual Meeting of the Association for
Computational Linguistics (Volume 1: Long Papers). 2016]
18. Siamese networks
● Let, F(X) be the family of functions with set of parameters W. F(X) is assumed to be
differentiable with respect to W. Siamese network seeks a value of the parameter W such that
the symmetric similarity metric is small if X1 and X2 belong to the same category, and large if
they belong to different categories.
[Das, Arpita, et al. "Together we stand: Siamese networks for similar question retrieval." Proceedings of the 54th Annual Meeting of the Association for
Computational Linguistics (Volume 1: Long Papers). 2016]
19. Different loss functions for training a Siamese network
Two commonly used ones are
● Triplet loss
● Contrastive loss
The main idea of these loss functions is to
pull the samples of every class toward one
another and push the samples of different
classes away from each other
[Ghojogh, Benyamin, et al. "Fisher discriminant triplet and contrastive losses for training siamese networks." 2020 International Joint Conference on Neural
Networks (IJCNN). IEEE, 2020]
20. Different loss functions - Triplet loss
The triplet loss uses anchor, neighbor, and distant. Let f(x) be the output (i.e., embedding) of the network
for the input x. The triplet loss tries to reduce the distance of anchor and neighbor embeddings and desires
to increase the distance of anchor and distant embeddings. As long as the distances of anchor-distant pairs
get larger than the distances of anchor-neighbor pairs by a margin α ≥ 0, the desired embedding is obtained
[Ghojogh, Benyamin, et al. "Fisher discriminant triplet and contrastive losses for training siamese networks." 2020 International Joint Conference on Neural
Networks (IJCNN). IEEE, 2020]
21. Different loss functions - Contrastive loss
The contrastive loss uses pairs of samples which can be anchor and neighbor or anchor and distant. If the
samples are anchor and neighbor, they are pulled towards each other; otherwise, their distance is
increased. In other words, the contrastive loss performs like the triplet loss but one by one rather than
simultaneously. The desired embedding is obtained when the anchor-distant distances get larger than the
anchor-neighbor distances by a margin of α
[Ghojogh, Benyamin, et al. "Fisher discriminant triplet and contrastive losses for training siamese networks." 2020 International Joint Conference on Neural
Networks (IJCNN). IEEE, 2020]