Large Scale Image Retrieval 2022.pdf

Large Scale Image Retrieval
and Specific Object Search
Ondra Chum
Center for Machine Perception
Czech Technical University in Prague

Outline
• The correspondence problem
– Local features
– Descriptors
– Matching
– Geometry
• Retrieval with local features
– Bag of Words
– Geometry in image retrieval
– Beyond visual nearest neighbour search
• Image retrieval with CNNs
– Efficient network training
– Day / Night retrieval

4
The Problem
Given a pair of images, find corresponding pixels
YES !
Semantic correspondence
NOT in this lecture
Image stitching
3D reconstruction
Augmented reality
Localization / camera position

5
due to large viewpoint
change (including scale)
=>
the wide-baseline
stereo problem
Applications:
- pose estimation
- 3D reconstruction
- location recognition
Finding correspondences is not easy

6
=>
the wide-baseline
stereo problem

7
Applications:
- summarization of image
collections
=>
the wide-baseline
stereo problem

8
Applications:
- historical reconstruction
- photographer
recognition
- camera type recognition
MPV course 2022, CTU in Prague
due to large
time difference
=>
the temporal-baseline
stereo problem

Finding Correspondences is not easy
9
due to occlusion
Applications:
- pose estimation
- inpainting

11
Local Features
aka feature points, key points, anchor points, distinguished regions, …
• Repeatable features
• Feature descriptor: patch to a vector
• Similar features have similar descriptors – nearest neighbour search
• Retrieval – matching millions of images at the same time
• Detect features in images independently, local = robust to occlusions

12
Local (Handcrafted) Features
Simple idea – a distinguished feature should be different (at least)
from all its immediate neighbourhoods

13
Corners Saddle points Blobs
Local (Handcrafted) Features
1. Enumerate all regions / level sets
2. Compute responses / stability
3. Local Non-Maxima Suppression
Regions
Harris [Harris’88]
Susan [Smith’97]
FAST/ ORB
[Rosten’06][Rublee’11]
Hessian [Lindeberg’91]
SADDLE [Aldana’16]
Hessian
DoG [Lowe’04]
MSER [Matas’02]
Tuytelaars
Simple idea – a distinguished feature should be different (at least)
from all its immediate neighbourhoods
Commonly
used for
deep features

14
Deep Local Features
DELF – classification loss, landmark labelled images
[Noh, Araujo, Sim, Weyand, Han: Large-scale image retrieval with attentive deep local features. CVPR’17]
HOW – contrastive loss, SfM Retrieval – 3D reconstruction, image level
[Tolias, Jenicek, Chum: Learning and aggregating deep local descriptors for instance-level recognition ECCV’20]
D2 net – point correspondence supervision from 3D
[Dusmanu et al.: D2-net: A trainable CNN for joint detection and description of local features. CVPR’19]
R2D2 – point correspondence supervision from optical flow
[Revaud et.al., R2D2: Reliable and Repeatable Detector and Descriptor, NeurIPS 2019]
SuperPoint – synthetic images, augmentations
[DeTone, Malisiewicz, Rabinovich: SuperPoint: Self-supervised interest point detection and description, CVPRW’18]
R2D2 – Revaud 2019
DELF – Noh 2017

15
Local Features from CNN Activations
Simeoni, Avrithis, Chum: Local Features and Visual Words Emerge in Activations, CVPR 2019
Convolutional layers Activation tensor Activation channel
(output of a detector)
• Treat the activation channel as an input to handcrafted feature detector (MSER)
• Use channel id as a descriptor (visual word)
…
…

16
Transformation Co-variant Local Features
Scale invariance
Responses over different scales

17
Transformation Co-variant Local Features
Scale invariance
Responses over different scales
Affine invariance

18
Affine Shape with CNNs
Mishkin, Radenović, Matas:
Repeatability Is Not Enough: Learning Affine Regions via Discriminability, ECCV 2018
AffNet

19
Descriptors of Local Features
Direct description of a measurement region: e.g. moments
Local
feature
Measurement
region

20
Local
feature
Measurement
region
Normalize region to a canonical form first
Histogram of gradients
(root) SIFT

21
Bin Fan Yurun Tian and Fuchao Wu. L2-Net: Deep learning of discriminative patch
descriptor in euclidean space. CVPR 2017.
Anastasiya Mishchuk, Dmytro Mishkin, Filip Radenovic, Jiri Matas: Working hard
to know your neighbor's margins: Local descriptor learning loss, NIPS 2017

Toy example for illustration: matching with OpenCV SIFT
Try yourself: https://github.com/ducha-aiki/matching-strategies-comparison

Toy example for illustration: matching with OpenCV SIFT
Recovered 1st to 2nd image projection,
ground truth 1st to 2nd image project,
inlier correspondences

Nearest neighbor (NN) strategy
Features from img1 are
matched to features from img2
You can see, that it is asymmetric and
allowing “many-to-one” matches

Nearest neighbor (NN) strategy
OpenCV RANSAC failed to find a good model
with NN matching

Mutual nearest neighbor (MNN) strategy
Only cross-consistent
(mutual NNs) matches are retained.

Mutual nearest neighbor (MNN) strategy
OpenCV RANSAC failed to find a good
model with MNN matching
No one-to-many connections, but still bad

Feature space outlier rejection
• How can we tell which putative matches are more reliable?
• Heuristic: compare distance of the nearest neighbor to that of the
second nearest neighbor
– Ratio will be high for features that are not distinctive
– Threshold of 0.8 provides good separation
David Lowe. "Distinctive image features from scale-invariant keypoints.” IJCV 60 (2), pp. 91-110, 2004.

Second nearest neighbor ratio (SNN) strategy
1stNN
2ndNN
2ndNN
1stNN
2ndNN
1stNN
1stNN / 2ndNN > 0.8, drop
1stNN / 2ndNN < 0.8, keep
- we look for 2 nearest neighbors
- If both are too similar (1stNN/2ndNN
ratio > 0.8) → discard
- If 1st NN is much closer
(1stNN/2ndNN ratio ≤ 0.8) → keep

Second nearest neighbor ratio (SNN) strategy
1stNN
2ndNN
2ndNN
1stNN
1stNN / 2ndNN < 0.8, keep
OpenCV RANSAC found a model roughly
correct

1st geometrically inconsistent nearest neighbor ratio (FGINN)
strategy
32
MPV course 2022, CTU in Prague
SNN ratio is good, but
what about symmetrical,
or too closely detected
features?
Ratio test will kill them.
Solution: look for 2nd
nearest neighbor, which
is spatially far enough
from 1st nearest.
Mishkin et al.,“MODS: Fast and Robust Method for Two-View Matching”, CVIU 2015

SNN vs FGINN
Mishkin et al., “MODS: Fast and Robust Method for Two-View Matching”, CVIU 2015
SNN: roughly
correct
FGINN: more
correspondences,
better geometry
found

34
Idea: verify a tentative match “+“ by comparing neighboring features
[Schmid and Mohr: Local Greyvalue Invariants for Image Retrieval. PAMI 1997]
+
+
+
+
+
+
+
+
+ +
+
+
matching features
Local Geometric Constraints
image 1 image 2

35
Cosegmentation / Seed Growing
Start from a seed – a signle strong match and try to locally “grow” the match
- at pixel or feature level
[Ferrari, Tuytelaars,Van Gool, ECCV 2004]
[Cech, Matas, Perdoch CVPR 08]
[Cavalli, Larsson, Oswald, Sattler, Pollefeys: AdaLAM, ECCV’20]
Seeds – semantic objects
Benbihi, Pradalier and Chum: Object-Guided Day-Night Visual Localization in Urban Scenes, ICPR’22

36
Learned Matching
[Sarlin, DeTone, Malisiewicz, Rabinovich. SuperGlue: Learning feature matching with graph neural networks. CVPR’20]
figures from Sarlin CVPR’20

Global Geometry
• Voting in the parameter space
• RANSAC

38
Robust Estimation: Hough vs. RANSAC
Voting:
• discretized parameter space
• votes for parameters consistent
with the measurements
• more votes higher support
+ multiple models
+ can be very fast
- memory demanding
- distances measured in the
parameter space
RANSAC:
• hypothesize and verify loop
- randomized (unless you try it all)
- typically slower than voting
+ no extra memory required
+ measures distances in pixels!

40
Fitting a Line
Least squares fit

41
RANSAC
• Select sample of m points
at random

42
RANSAC
• Select sample of m points at
random
• Calculate model
parameters that fit the data
in the sample

43
RANSAC
random
• Calculate model parameters
that fit the data in the sample
• Calculate error function
for each data point

44
RANSAC
random
• Calculate error function for
each data point
• Select data that support
current hypothesis

45
RANSAC
random
each data point
current hypothesis
• Repeat sampling

46
RANSAC
random
each data point
current hypothesis
• Repeat sampling

47
RANSAC
random
each data point
current hypothesis
• Repeat sampling

48
RANSAC
k … number of samples
drawn
m … minimal sample size
N … number of data points
I … time to compute a
single model
p … confidence in the
solution (.95)
log (1- )
log(1 – p)
I m
Nm
k =

49
How Many Samples
I / N [%]
Size
of
the
sample
m

50
RANSAC [Fischler, Bolles ’81]
In: U = {xi} set of data points, |U| = N
function f computes model parameters p given a sample S from U
the cost function for a single data point x
Out: p* p*, parameters of the model maximizing the cost function
k := 0
Repeat until P{better solution exists} < η (a function of C* and no. of steps k)
k := k + 1
I. Hypothesis
(1) select randomly set , sample size
(2) compute parameters
II. Verification
(3) compute cost
(4) if C* < Ck then C* := Ck, p* := pk
end

51
Advanced RANSAC
In: U = {xi} set of data points, |U| = N
function f computes model parameters p given a sample S from U
the cost function for a single data point x
Out: p* p*, parameters of the model maximizing the cost function
k := 0
Repeat until P{better solution exists} < η (a function of C* and no. of steps k)
k := k + 1
I. Hypothesis
(1) select randomly set , sample size
(2) compute parameters
II. Verification
(3) compute cost
(4) if C* < Ck then C* := Ck, p* := pk
end
Non-uniform sampling
Error scale estimation
Potential degeneracy tests
Randomized verification
Preemptive scoring
Improving precision

52
*SAC
RANSAC [Fischler’81], MLESAC [Torr’00], R-RANSAC [Chum’02],
NAPSAC [Myatt’02], Guided MLESAC [Tordoff’02], LO-RANSAC
[Chum’03], Preemtive RANSAC [Nister’03], PROSAC [Chum’05],
RANSAC with bail-out [Capel’05], DegenSAC [Chum’05], WaldSAC
[Matas’05], QDEGSAC [Frahm‘06], GASAC [Rodehorst’06], ARRSAC
[Raguram’08] GroupSAC [Ni’09], Cov-RANSAC [Raguram’09], …
Lebeda, Matas, and Chum: Fixing the Locally Optimized RANSAC, BMVC 2012
images, data, executables:
http://cmp.felk.cvut.cz/software/LO-RANSAC/index.xhtml
Raguram, Chum, Pollefeys, Matas, Frahm:
“USAC: A Universal Framework for Random Sample Consensus”, PAMI 2013
code, data:
http://cs.unc.edu/~rraguram/usac/
Barath, Matas / Barath, Matas, Noskova :
“Graph-Cut RANSAC”, CVPR 2017 / “MAGSAC”, CVPR 2019
code, data:
http://github.com/danini/graph-cut-ransac
http://github.com/danini/magsac

54
Image Retrieval
Find this …
… in a large (millions+) collection of images
?
• Find images of the same object
• What is this? Nearest neighbor classifier
• Where is this? Visual localization
• How did this look in the past?
• Is there anything interesting here?

55
area under the curve
Average Precision (AP)
Retrieval Quality
0 0.2 0.4 0.6 0.8 1
0
0.2
0.4
0.6
0.8
1
recall
precision
Query
Database size: 10 images
Relevant (total): 5 images
Results (ordered):
precision = #relevant / #returned
recall = #relevant / #total relevant

56
Feature Based Retrieval
• Affine invariant features
• Efficient descriptors
• Corresponding regions in images have similar
descriptors – measured by some distance in
the features space
• Images of the same object have many
correspondences in common

57
Video Google
• Feature detection and description
• Vector quantization
• Bag of Words representation
• Scoring
• Verification
Sivic & Zisserman – ICCV 2003
Video Google: A Text Retrieval Approach to Object Matching in Videos

58
Bag-of-Words (BoW): Off-line Stage

59
Feature Distance Approximation
Partition the feature space
(k – means clustering)
Feature distance
0 : features in the same cell
∞ : features in different cells
+ most of the features are not
considered (infinitely distant)
+ near-by descriptors accessible
instantly – storing a list of
features for each cell

60
Feature Distance Approximation
- quantization effects
- large (even unbounded) cells
Feature distance
0 : features in the same cell
∞ : features in different cells

61
Vector Quantization via k-Means
Initialize cluster
centres
Find nearest cluster to each
datapoint (slow) O(N k)
Re-compute cluster
centres as centroids
Iterate

62
Bags of Words Image Representation
A
C
D
B
A
C
D
B
1
0
0
2
0
3
0
1
Images
…
Visual
vocabulary
Images are represented by vector / histogram of
visual words present in them
Term-frequency (tf) – visual word D is twice in the image
sparse

63
Bag-of-Words : On-line Stage

64
Efficient Scoring
bag of words representation
(up to 1,000,000 D)
0
3
0
1
α1 ( 1 0 0 2 )
α2 ( 0 2 0 1 )
α3 ( 1 0 0 0 )
…
Database Query
• =
Score
αq
s2
s3
…
A C D
B
A
C
D
B
s1

65
1 2 3 4 5 6 7 8 9 10
BoW and Inverted File
6 7 7 …
1 3 6
…
5 6 8
…
2 4 10 …
A
C
D
B
Visual
vocabulary
…
A C
D B
A A
B
B
C
C
D
D
…
…
…
…
…
…
…
…
…
…

66
1 2 3 4 5 6 7 8 9 10
6 7 7 …
1 3 6 …
5 6 8 …
query visual word 1
query visual word 2
query visual word 3
D
B
G

67
1 2 3 4 5 6 7 8 9 10
Efficient (fast)
Linear complexity (in # documents)
Can be interpreted as voting

69
Geometric Verification and Re-ranking
Query
Results
reject
verify
localize
Philbin, Chum, Isard, Sivic, Zisserman: Object retrieval with large
vocabularies and fast spatial matching, CVPR’07

Visual Words and
Vector Quantization

71
Vector Quantization
• k-means
• Fixed quantization [Tuytelaars and Schmid ICCV 2007]
• Agglomerative [Leibe, Mikolajczyk and Schiele BMVC 2006]
• Hierarchical k-means
• Approximate k-means
• Hamming embedding
• Learning fine vocabularies

72
Hierarchical k-means
+ fast O(N log k)
+ incremental construction
- not so good quantization
- often imbalanced
Nistér & Stewénius: Scalable recognition with a vocabulary tree. CVPR 2006

73
Approximate k-means
+ fast O(N log k)
+ reasonable quantization
- Can be inconsistent when ANN fails
Philbin, Chum, Isard, Sivic, and Zisserman – CVPR 2007
Object retrieval with large vocabularies and fast spatial matching
Initialize cluster
centres
Find approximate nearest
cluster to each datapoint
Re-compute cluster
centres as centroids
Iterate

74
Hamming Embedding
+ good quantization
+ elegant idea
- huge memory footprint
0 1
0
1
1
1
0 0
0
0
1
1
Hamming
distance
1
1
2
Jegou, Douze, and Schmid – ECCV 2008
Hamming embedding and weak geometric consistency for large scale image search
random projections

75
Soft Assignment
(Approximate) k-means
- database side
- query side
Hierarchical k-means
Philbin, Chum, Isard, Sivic, and Zisserman – CVPR 2008
Lost in Quantization
Nistér & Stewénius – CVPR 2006 Scalable
recognition with a vocabulary tree

76
Learning Fine Vocabularies
Fine vocabulary (16 million visual words)
Using wide-baseline stereo matches on 6 million images to learn what is similar
Mikulik, Perdoch, Chum, and Matas: Learinig a Fine Vocabulary, ECCV 2010

77
Appearance Variance of a Single Feature
Mikulik, Perdoch, Chum, Matas: Learning Vocabularies over a Fine Quantization, IJCV 2012
• over 5 million images
• almost 20k clusters of 750k images (visual word based)
• 733k successfully matched in WBS matching (raw descriptor based)
• over 111 M feature tracks established (12.3 M with 6+ features)
• 564 M features in the tracks (319.5 M in tracks of 6+ features)
http://cmp.felk.cvut.cz/~qqmikula/publications/ijcv2012/index.html

78
Short Codes – (Joint) Dimensionality Reduction
Jegou & Chum: Negative evidences and co-occurrences in image retrieval: the benefit of PCA and
whitening, ECCV 2012
Radenovic, Jegou & Chum: Multiple Measurements and Joint Dimensionality Reduction for
Large Scale Image Search with Short Vectors ICMR 2015

79
Aggregating Local Descriptors
A
C
D
B
VLAD descriptor
[Jégou, Douze, Schmid and Pérez, CVPR’10]
Fischer Kernel approach
[Perronnin and Dance, CVPR’07]
often combined with dimensionality
reduction by PCA – short codes
• High discriminability needed
• BOW increases the number of visual words
• only assignments are recorded
Idea: using higher order statistics
• small vocabulary (fast assignment)
• dense vectors (ANN search)
• high disriminability

80
A
C
D
B
VLAD descriptor
[Jégou, Douze, Schmid and Pérez, CVPR’10]
1. compute assignments
2. compute difference to means
3. sum differences per visual word

81
• Fit a GMM to training data (SIFT)
• diagonal covariance matrix
• whitened data
• Image represented as a sum (over image
features) of gradients of log-likelihood
• fixed size representation (#parameters)
A
C
D
B
Fischer Kernel approach
[Perronnin and Dance, CVPR’07]
Intuition: direction in which the parameters λ
of the general model should we modified to
better fit the specific sample (current image
data).

84
Query Expansion
…
Query image
Results
New query
Spatial verification
New results
Chum, Philbin, Sivic, Isard, Zisserman: Total Recall…, ICCV 2007

85
Query Expansion Step by Step
Query Image Retrieved image Originally not retrieved

86

87

88
Query Expansion Results
Query
image
Expanded results (better)
Original results (good)

89
Context expansion
• the model of the object is grown beyond the boundaries of the
initial query,
• a feature added into the model that is not inside the context is
inactive until confirmed by feature(s) from another image with
the same visual word and similar geometry.
• Once a feature is confirmed, it adds the neighbourhood around
its center to the context.
Chum, Mikulik, Perdoch, Matas: Total Recall II: Query Expansion Revisited, CVPR 2011

90
• the model of the object is grown beyond the boundaries of the
initial query,
• a feature added into the model that is not inside the context is
inactive until confirmed by feature(s) from another image with
the same visual word and similar geometry.
• Once a feature is confirmed, it adds the neighbourhood around
its center to the context.
Context expansion
Chum, Mikulik, Perdoch, Matas: Total Recall II: Query Expansion Revisited, CVPR 2011

91
Learning the Context
Feature patches back-projected into the context from spatially
verified images.
The query
2 5 10 20

92
How Much Do We Need to See?
Oxford landmarks – 3 queries
100%, 50%, and 10% of the query bounding box
Context learned from the full bounding box
Context learned from 50% of the bounding box
Context learned from 10% of the bounding box

93
Effects of decreasing the
query bounding-box size
Baseline:
spatial verification +
full bounding box
Context QE at the baseline
performance needs only:
• 20% of the BB on the
Paris dataset
• 40% of the BB on the
Oxford dataset

Beyond Similarity Search
Retrieval with (geometric) constraints

95
Retrieval for Browsing
What is this? … and what is that?
Let’s query!

96
Retrieval for Browsing
Query 1
Query 2
Mikulik, Chum, Matas: Image Retrieval for Online Browsing in Large Image Collections, SISAP 2013.

97
New Problem Formulation
Retrieve relevant images subject to a constraint
• Geometric
– Maximize number of relevant pixels
– Maximize scale change
– Change of viewpoint
• Other
– High photometric change (day / night)

98
New Problem Formulation
Results
• Low rank in standard similarity measure
– Geometry for verification and constraint enforcement
– Geometry in the inverted file (DAAT)
• Standard similarity measure can be 0
– Matching through a path of images (query expansion)

100
Query Image
What is interesting here?

101
All Details on the Landmark

102
Highest Resolution Transform
Given a query and a dataset, for every pixel in the query image:
Find the database image with the maximum resolution depicting the pixel
37.3x 27.0x 22.8x 21.9x 21.6x

104
Level of Interest Transform
Given a query and a dataset, for every pixel in the query image:
Find the frequency with which it is photographed in detail
0 – 1 % 1 – 3 % 3 – 10 %
detail
size

FROM SINGLE IMAGE QUERY TO
DETAILED 3D RECONSTRUCTION

Retrieval and SfM
k-NN search often find small
connected components

107
Tight Coupling of Retrieval and SfM
Schoenberger, Radenovic, Chum, and Frahm:
From Single Image Query to Detailed 3D Reconstruction , CVPR’15

Beyond Nearest Neighbour
Looking around the corner
• Zoom out – getting a context of the image
• All details – getting transition to the object details
• Sidewise crawl

NEURAL NETWORKS
Retrieval with global descriptors

111
Efficient Search with Global Descriptors
Find this … … in a large collection of images
?
Mapping into high dimensional space
k ~ 512 … 2048
Image similarity – distance
descriptor space Rk

112
Efficient Search with Global Descriptors
Find this … … in a large collection of images
descriptor space Rk

113
CNN Descriptors for Image Retrieval
…
Max pooling
+ L2-norm
K x 1
MAC
vec.
Image Convolutional Layers MAC Layer Descriptor
𝑤𝑤 ×ℎ×3 𝑊𝑊 ×𝐻𝐻 ×𝐾𝐾 𝐾𝐾 ×1
𝑤𝑤 × ℎ – image width and height
𝑊𝑊 × 𝐻𝐻 – number of activations for feature map 𝑘𝑘 ∈ {1 … 𝐾𝐾}
𝐾𝐾 – number of feature maps in the last convolutional layer
MAC – Maximum Activations of Convolutions

114
…
Image Convolutional Layers
𝑤𝑤 ×ℎ×3 𝑊𝑊 ×𝐻𝐻 ×𝐾𝐾
Sum pooling
+ L2-norm
K x 1
SPoC
vec.
SPcC Layer Descriptor
𝐾𝐾 ×1
SPoC – sum-pooled convolutional

115
…
Image Convolutional Layers
𝑤𝑤 ×ℎ×3 𝑊𝑊 ×𝐻𝐻 ×𝐾𝐾
Descriptor
GeM pooling
+ L2-norm
K x 1
GeM
vec.
GeM Layer
𝐾𝐾 ×1
GeM– Generalized Mean
p = 1
average pooling
p = inf
max pooling

116
Loss Functions
𝐿𝐿 𝑖𝑖, 𝑗𝑗 =
1
2
𝑌𝑌 𝑖𝑖, 𝑗𝑗 �
𝒇𝒇 𝑖𝑖 − �
𝒇𝒇 𝑗𝑗
2
+ 1 − 𝑌𝑌(𝑖𝑖, 𝑗𝑗 max 0, 𝜏𝜏 − �
𝒇𝒇 𝑗𝑗
2
POSITIVE PAIR
𝐿𝐿 𝑖𝑖, 𝑗𝑗
�
𝒇𝒇 𝑗𝑗
NEGATIVE PAIR
�
𝒇𝒇 𝑗𝑗
Contrastive loss

117
Loss Functions
Triplet loss

Retrieval Challenges
Significant viewpoint and/or scale change
Significant illumination change
Severe occlusions
Visually similar but different objects

“Lots of Training Examples”
Large Internet
photo collection
…
Convolutional Neural
Network (CNN)
Image annotations
Training

“Lots of Training Examples”
Large Internet
photo collection
…
Convolutional Neural
Network (CNN)
Not accurate
Expensive $$
Manual cleaning of
the training data
done by Researchers
Very expensive $$$$
Automated extraction
of training data
Very accurate
Free $

• Image representation created from CNN activations
of a network pre-trained for classification task
[Gong et al. ECCV’14, Razavian et al. arXiv’14, Babenko et al.
ICCV’15, Kalantidis et al. arXiv’15, Tolias et al. ICLR’16]
+ Retrieval accuracy suggests generalization of CNNs
- Trained for image classification, NOT retrieval task
CNN Image Retrieval

• Image representation created from CNN activations
of a network pre-trained for classification task
[Gong et al. ECCV’14, Razavian et al. arXiv’14, Babenko et al.
ICCV’15, Kalantidis et al. arXiv’15, Tolias et al. ICLR’16]
+ Retrieval accuracy suggests generalization of CNNs
- Trained for image classification, NOT retrieval task
CNN Image Retrieval
Same Class
Image from ImageNet.org

CNN Image Retrieval
• CNN network re-trained using a dataset that contains
landmarks and buildings as object classes.
[Babenko et al. ECCV’14]
+ Training dataset closer to the target task
- Final metric different to the one actually optimized
- Constructing training datasets requires manual effort

CNN Image Retrieval
• CNN network re-trained using a dataset that contains
landmarks and buildings as object classes.
[Babenko et al. ECCV’14]
+ Training dataset closer to the target task
- Final metric different to the one actually optimized
- Constructing training datasets requires manual effort
Same Class
Image from [Babenko et al. ECCV’14]

CNN Image Retrieval
• NetVLAD: end-to-end fine-tuning for image retrieval.
Geo-tagged dataset for weakly supervised fine-tuning.
[Arandjelovic et al. CVPR’16]
+ Training dataset corresponds to the target task
+ Final metric corresponds to the one actually optimized
- Training dataset requires geo-tags

CNN Image Retrieval
• NetVLAD: end-to-end fine-tuning for image retrieval.
Geo-tagged dataset for weakly supervised fine-tuning.
[Arandjelovic et al. CVPR’16]
+ Training dataset corresponds to the target task
+ Final metric corresponds to the one actually optimized
- Training dataset requires geo-tags
query
Camera Orientation Unknown
unknown

CNN learns from BoW – Training Data
Input: Large unannotated dataset
1. Initial clusters created by grouping of spatially
related images [Chum & Matas PAMI’10]
2. Clustered images used as queries for a retrieval-
SfM pipeline [Schonberger et al. CVPR’15]
Output: Non-overlapping 3D models
551 (134k) training / 162 (30k) validation
Camera Orientation Known
Number of Inliers Known

CNN learns from BoW – Positives
1. Descriptor distance: Image with the lowest global
descriptor distance is chosen (NetVLAD use this)
2. Maximum inliers: Image with the highest number of
co-observed 3D points with the query image is chosen
3. Relaxed inliers: Random image close to the query, with
enough inliers and not an extreme scale change is chosen
query m 1 m 2 m 3

CNN learns from BoW – Negatives
K-nearest neighbors of the query image are selected from
all non-matching clusters, using different methods:
1. No constraint: chosen images often near identical.
2. At most one image per cluster: higher variability.
query hardest negative N 1 N 2

https://github.com/filipradenovic/cnnimageretrieval

136
Day – Night Retrieval
Day – Night training image pairs – sequences of images day – evening - night
Photometric normalization

137
Contrast Limited Adaptive Histogram Equalization
• Semi local (windows)
• Linear interpolation
• Only values more frequent than
the clipping limit are redistributed
clipping
limit
Original Historam Equalization (global) CLAHE
[Jenicek, Chum: No Fear of the Dark: Image Retrieval under Varying Illumination Conditions, ICCV 2019]

138
Generating Synthetic Night Data (GAN)

139
Generating Synthetic Night Data (GAN)

140
Training with Synthetic Night Data
POSITIVE PAIR
𝐿𝐿 𝑖𝑖, 𝑗𝑗
�
𝒇𝒇 𝑗𝑗
NEGATIVE PAIR
�
𝒇𝒇 𝑗𝑗

141
Synthetic Night Data (GAN)

Large Scale Image Retrieval 2022.pdf

Recommended

Recommended

More Related Content

Similar to Large Scale Image Retrieval 2022.pdf

Similar to Large Scale Image Retrieval 2022.pdf (20)

Recently uploaded

Recently uploaded (20)

Large Scale Image Retrieval 2022.pdf