SlideShare a Scribd company logo
1 of 139
Download to read offline
Large Scale Image Retrieval
and Specific Object Search
Ondra Chum
Center for Machine Perception
Czech Technical University in Prague
Outline
• The correspondence problem
– Local features
– Descriptors
– Matching
– Geometry
• Retrieval with local features
– Bag of Words
– Geometry in image retrieval
– Beyond visual nearest neighbour search
• Image retrieval with CNNs
– Efficient network training
– Day / Night retrieval
The Correspondence Problem
4
The Problem
Given a pair of images, find corresponding pixels
YES !
Semantic correspondence
NOT in this lecture
Image stitching
3D reconstruction
Augmented reality
Localization / camera position
5
due to large viewpoint
change (including scale)
=>
the wide-baseline
stereo problem
Applications:
- pose estimation
- 3D reconstruction
- location recognition
Finding correspondences is not easy
6
Finding correspondences is not easy
due to large viewpoint
change (including scale)
=>
the wide-baseline
stereo problem
7
Applications:
- location recognition
- summarization of image
collections
Finding correspondences is not easy
due to large viewpoint
change (including scale)
=>
the wide-baseline
stereo problem
8
Applications:
- historical reconstruction
- location recognition
- photographer
recognition
- camera type recognition
MPV course 2022, CTU in Prague
Finding correspondences is not easy
due to large
time difference
=>
the temporal-baseline
stereo problem
Finding Correspondences is not easy
9
due to occlusion
Applications:
- pose estimation
- inpainting
Local Features
11
Local Features
aka feature points, key points, anchor points, distinguished regions, …
• Repeatable features
• Feature descriptor: patch to a vector
• Similar features have similar descriptors – nearest neighbour search
• Retrieval – matching millions of images at the same time
• Detect features in images independently, local = robust to occlusions
12
Local (Handcrafted) Features
Simple idea – a distinguished feature should be different (at least)
from all its immediate neighbourhoods
13
Corners Saddle points Blobs
Local (Handcrafted) Features
1. Enumerate all regions / level sets
2. Compute responses / stability
3. Local Non-Maxima Suppression
Regions
Harris [Harris’88]
Susan [Smith’97]
FAST/ ORB
[Rosten’06][Rublee’11]
Hessian [Lindeberg’91]
SADDLE [Aldana’16]
Hessian
DoG [Lowe’04]
MSER [Matas’02]
Tuytelaars
Simple idea – a distinguished feature should be different (at least)
from all its immediate neighbourhoods
Commonly
used for
deep features
14
Deep Local Features
DELF – classification loss, landmark labelled images
[Noh, Araujo, Sim, Weyand, Han: Large-scale image retrieval with attentive deep local features. CVPR’17]
HOW – contrastive loss, SfM Retrieval – 3D reconstruction, image level
[Tolias, Jenicek, Chum: Learning and aggregating deep local descriptors for instance-level recognition ECCV’20]
D2 net – point correspondence supervision from 3D
[Dusmanu et al.: D2-net: A trainable CNN for joint detection and description of local features. CVPR’19]
R2D2 – point correspondence supervision from optical flow
[Revaud et.al., R2D2: Reliable and Repeatable Detector and Descriptor, NeurIPS 2019]
SuperPoint – synthetic images, augmentations
[DeTone, Malisiewicz, Rabinovich: SuperPoint: Self-supervised interest point detection and description, CVPRW’18]
R2D2 – Revaud 2019
DELF – Noh 2017
15
Local Features from CNN Activations
Simeoni, Avrithis, Chum: Local Features and Visual Words Emerge in Activations, CVPR 2019
Convolutional layers Activation tensor Activation channel
(output of a detector)
• Treat the activation channel as an input to handcrafted feature detector (MSER)
• Use channel id as a descriptor (visual word)
…
…
16
Transformation Co-variant Local Features
Scale invariance
Responses over different scales
17
Transformation Co-variant Local Features
Scale invariance
Responses over different scales
Affine invariance
18
Affine Shape with CNNs
Mishkin, Radenović, Matas:
Repeatability Is Not Enough: Learning Affine Regions via Discriminability, ECCV 2018
AffNet
19
Descriptors of Local Features
Direct description of a measurement region: e.g. moments
Local
feature
Measurement
region
20
Descriptors of Local Features
Local
feature
Measurement
region
Normalize region to a canonical form first
Histogram of gradients
(root) SIFT
21
Descriptors of Local Features
Bin Fan Yurun Tian and Fuchao Wu. L2-Net: Deep learning of discriminative patch
descriptor in euclidean space. CVPR 2017.
Anastasiya Mishchuk, Dmytro Mishkin, Filip Radenovic, Jiri Matas: Working hard
to know your neighbor's margins: Local descriptor learning loss, NIPS 2017
Matching Local Features
Toy example for illustration: matching with OpenCV SIFT
Try yourself: https://github.com/ducha-aiki/matching-strategies-comparison
Toy example for illustration: matching with OpenCV SIFT
Recovered 1st to 2nd image projection,
ground truth 1st to 2nd image project,
inlier correspondences
Nearest neighbor (NN) strategy
Features from img1 are
matched to features from img2
You can see, that it is asymmetric and
allowing “many-to-one” matches
Nearest neighbor (NN) strategy
OpenCV RANSAC failed to find a good model
with NN matching
Features from img1 are
matched to features from img2
Mutual nearest neighbor (MNN) strategy
Features from img1 are
matched to features from img2
Only cross-consistent
(mutual NNs) matches are retained.
Mutual nearest neighbor (MNN) strategy
OpenCV RANSAC failed to find a good
model with MNN matching
No one-to-many connections, but still bad
Features from img1 are
matched to features from img2
Feature space outlier rejection
• How can we tell which putative matches are more reliable?
• Heuristic: compare distance of the nearest neighbor to that of the
second nearest neighbor
– Ratio will be high for features that are not distinctive
– Threshold of 0.8 provides good separation
David Lowe. "Distinctive image features from scale-invariant keypoints.” IJCV 60 (2), pp. 91-110, 2004.
Second nearest neighbor ratio (SNN) strategy
1stNN
2ndNN
2ndNN
1stNN
2ndNN
1stNN
1stNN / 2ndNN > 0.8, drop
1stNN / 2ndNN < 0.8, keep
- we look for 2 nearest neighbors
- If both are too similar (1stNN/2ndNN
ratio > 0.8) → discard
- If 1st NN is much closer
(1stNN/2ndNN ratio ≤ 0.8) → keep
Features from img1 are
matched to features from img2
Second nearest neighbor ratio (SNN) strategy
1stNN
2ndNN
2ndNN
1stNN
1stNN / 2ndNN < 0.8, keep
OpenCV RANSAC found a model roughly
correct
1st geometrically inconsistent nearest neighbor ratio (FGINN)
strategy
32
MPV course 2022, CTU in Prague
SNN ratio is good, but
what about symmetrical,
or too closely detected
features?
Ratio test will kill them.
Solution: look for 2nd
nearest neighbor, which
is spatially far enough
from 1st nearest.
Mishkin et al.,“MODS: Fast and Robust Method for Two-View Matching”, CVIU 2015
SNN vs FGINN
Mishkin et al., “MODS: Fast and Robust Method for Two-View Matching”, CVIU 2015
SNN: roughly
correct
FGINN: more
correspondences,
better geometry
found
34
Idea: verify a tentative match “+“ by comparing neighboring features
[Schmid and Mohr: Local Greyvalue Invariants for Image Retrieval. PAMI 1997]
+
+
+
+
+
+
+
+
+ +
+
+
matching features
Local Geometric Constraints
image 1 image 2
35
Cosegmentation / Seed Growing
Start from a seed – a signle strong match and try to locally “grow” the match
- at pixel or feature level
[Ferrari, Tuytelaars,Van Gool, ECCV 2004]
[Cech, Matas, Perdoch CVPR 08]
[Cavalli, Larsson, Oswald, Sattler, Pollefeys: AdaLAM, ECCV’20]
Seeds – semantic objects
Benbihi, Pradalier and Chum: Object-Guided Day-Night Visual Localization in Urban Scenes, ICPR’22
36
Learned Matching
[Sarlin, DeTone, Malisiewicz, Rabinovich. SuperGlue: Learning feature matching with graph neural networks. CVPR’20]
figures from Sarlin CVPR’20
Global Geometry
• Voting in the parameter space
• RANSAC
38
Robust Estimation: Hough vs. RANSAC
Voting:
• discretized parameter space
• votes for parameters consistent
with the measurements
• more votes higher support
+ multiple models
+ can be very fast
- memory demanding
- distances measured in the
parameter space
RANSAC:
• hypothesize and verify loop
- randomized (unless you try it all)
- typically slower than voting
+ no extra memory required
+ measures distances in pixels!
RANSAC
40
Fitting a Line
Least squares fit
41
RANSAC
• Select sample of m points
at random
42
RANSAC
• Select sample of m points at
random
• Calculate model
parameters that fit the data
in the sample
43
RANSAC
• Select sample of m points at
random
• Calculate model parameters
that fit the data in the sample
• Calculate error function
for each data point
44
RANSAC
• Select sample of m points at
random
• Calculate model parameters
that fit the data in the sample
• Calculate error function for
each data point
• Select data that support
current hypothesis
45
RANSAC
• Select sample of m points at
random
• Calculate model parameters
that fit the data in the sample
• Calculate error function for
each data point
• Select data that support
current hypothesis
• Repeat sampling
46
RANSAC
• Select sample of m points at
random
• Calculate model parameters
that fit the data in the sample
• Calculate error function for
each data point
• Select data that support
current hypothesis
• Repeat sampling
47
RANSAC
• Select sample of m points at
random
• Calculate model parameters
that fit the data in the sample
• Calculate error function for
each data point
• Select data that support
current hypothesis
• Repeat sampling
48
RANSAC
k … number of samples
drawn
m … minimal sample size
N … number of data points
I … time to compute a
single model
p … confidence in the
solution (.95)
log (1- )
log(1 – p)
I m
Nm
k =
49
How Many Samples
I / N [%]
Size
of
the
sample
m
50
RANSAC [Fischler, Bolles ’81]
In: U = {xi} set of data points, |U| = N
function f computes model parameters p given a sample S from U
the cost function for a single data point x
Out: p* p*, parameters of the model maximizing the cost function
k := 0
Repeat until P{better solution exists} < η (a function of C* and no. of steps k)
k := k + 1
I. Hypothesis
(1) select randomly set , sample size
(2) compute parameters
II. Verification
(3) compute cost
(4) if C* < Ck then C* := Ck, p* := pk
end
51
Advanced RANSAC
In: U = {xi} set of data points, |U| = N
function f computes model parameters p given a sample S from U
the cost function for a single data point x
Out: p* p*, parameters of the model maximizing the cost function
k := 0
Repeat until P{better solution exists} < η (a function of C* and no. of steps k)
k := k + 1
I. Hypothesis
(1) select randomly set , sample size
(2) compute parameters
II. Verification
(3) compute cost
(4) if C* < Ck then C* := Ck, p* := pk
end
Non-uniform sampling
Error scale estimation
Potential degeneracy tests
Randomized verification
Preemptive scoring
Improving precision
52
*SAC
RANSAC [Fischler’81], MLESAC [Torr’00], R-RANSAC [Chum’02],
NAPSAC [Myatt’02], Guided MLESAC [Tordoff’02], LO-RANSAC
[Chum’03], Preemtive RANSAC [Nister’03], PROSAC [Chum’05],
RANSAC with bail-out [Capel’05], DegenSAC [Chum’05], WaldSAC
[Matas’05], QDEGSAC [Frahm‘06], GASAC [Rodehorst’06], ARRSAC
[Raguram’08] GroupSAC [Ni’09], Cov-RANSAC [Raguram’09], …
Lebeda, Matas, and Chum: Fixing the Locally Optimized RANSAC, BMVC 2012
images, data, executables:
http://cmp.felk.cvut.cz/software/LO-RANSAC/index.xhtml
Raguram, Chum, Pollefeys, Matas, Frahm:
“USAC: A Universal Framework for Random Sample Consensus”, PAMI 2013
code, data:
http://cs.unc.edu/~rraguram/usac/
Barath, Matas / Barath, Matas, Noskova :
“Graph-Cut RANSAC”, CVPR 2017 / “MAGSAC”, CVPR 2019
code, data:
http://github.com/danini/graph-cut-ransac
http://github.com/danini/magsac
FEATURE-BASED IMAGE RETRIEVAL
54
Image Retrieval
Find this …
… in a large (millions+) collection of images
?
• Find images of the same object
• What is this? Nearest neighbor classifier
• Where is this? Visual localization
• How did this look in the past?
• Is there anything interesting here?
55
area under the curve
Average Precision (AP)
Retrieval Quality
0 0.2 0.4 0.6 0.8 1
0
0.2
0.4
0.6
0.8
1
recall
precision
Query
Database size: 10 images
Relevant (total): 5 images
Results (ordered):
precision = #relevant / #returned
recall = #relevant / #total relevant
56
Feature Based Retrieval
• Affine invariant features
• Efficient descriptors
• Corresponding regions in images have similar
descriptors – measured by some distance in
the features space
• Images of the same object have many
correspondences in common
57
Video Google
• Feature detection and description
• Vector quantization
• Bag of Words representation
• Scoring
• Verification
Sivic & Zisserman – ICCV 2003
Video Google: A Text Retrieval Approach to Object Matching in Videos
58
Bag-of-Words (BoW): Off-line Stage
59
Feature Distance Approximation
Partition the feature space
(k – means clustering)
Feature distance
0 : features in the same cell
∞ : features in different cells
+ most of the features are not
considered (infinitely distant)
+ near-by descriptors accessible
instantly – storing a list of
features for each cell
60
Feature Distance Approximation
- quantization effects
- large (even unbounded) cells
Feature distance
0 : features in the same cell
∞ : features in different cells
61
Vector Quantization via k-Means
Initialize cluster
centres
Find nearest cluster to each
datapoint (slow) O(N k)
Re-compute cluster
centres as centroids
Iterate
62
Bags of Words Image Representation
A
C
D
B
A
C
D
B
1
0
0
2
0
3
0
1
Images
…
Visual
vocabulary
Images are represented by vector / histogram of
visual words present in them
Term-frequency (tf) – visual word D is twice in the image
sparse
63
Bag-of-Words : On-line Stage
64
Efficient Scoring
bag of words representation
(up to 1,000,000 D)
0
3
0
1
α1 ( 1 0 0 2 )
α2 ( 0 2 0 1 )
α3 ( 1 0 0 0 )
…
Database Query
• =
Score
αq
s2
s3
…
A C D
B
A
C
D
B
s1
65
1 2 3 4 5 6 7 8 9 10
BoW and Inverted File
6 7 7 …
1 3 6
…
5 6 8
…
2 4 10 …
A
C
D
B
Visual
vocabulary
…
A C
D B
A A
B
B
C
C
D
D
…
…
…
…
…
…
…
…
…
…
66
1 2 3 4 5 6 7 8 9 10
BoW and Inverted File
6 7 7 …
1 3 6 …
5 6 8 …
query visual word 1
query visual word 2
query visual word 3
D
B
G
67
1 2 3 4 5 6 7 8 9 10
BoW and Inverted File
Efficient (fast)
Linear complexity (in # documents)
Can be interpreted as voting
69
Geometric Verification and Re-ranking
Query
Results
reject
verify
localize
Philbin, Chum, Isard, Sivic, Zisserman: Object retrieval with large
vocabularies and fast spatial matching, CVPR’07
Visual Words and
Vector Quantization
71
Vector Quantization
• k-means
• Fixed quantization [Tuytelaars and Schmid ICCV 2007]
• Agglomerative [Leibe, Mikolajczyk and Schiele BMVC 2006]
• Hierarchical k-means
• Approximate k-means
• Hamming embedding
• Learning fine vocabularies
72
Hierarchical k-means
+ fast O(N log k)
+ incremental construction
- not so good quantization
- often imbalanced
Nistér & Stewénius: Scalable recognition with a vocabulary tree. CVPR 2006
73
Approximate k-means
+ fast O(N log k)
+ reasonable quantization
- Can be inconsistent when ANN fails
Philbin, Chum, Isard, Sivic, and Zisserman – CVPR 2007
Object retrieval with large vocabularies and fast spatial matching
Initialize cluster
centres
Find approximate nearest
cluster to each datapoint
Re-compute cluster
centres as centroids
Iterate
74
Hamming Embedding
+ good quantization
+ elegant idea
- huge memory footprint
0 1
0
1
1
1
0 0
0
0
1
1
Hamming
distance
1
1
2
Jegou, Douze, and Schmid – ECCV 2008
Hamming embedding and weak geometric consistency for large scale image search
random projections
75
Soft Assignment
(Approximate) k-means
- database side
- query side
Hierarchical k-means
Philbin, Chum, Isard, Sivic, and Zisserman – CVPR 2008
Lost in Quantization
Nistér & Stewénius – CVPR 2006 Scalable
recognition with a vocabulary tree
76
Learning Fine Vocabularies
Fine vocabulary (16 million visual words)
Using wide-baseline stereo matches on 6 million images to learn what is similar
Mikulik, Perdoch, Chum, and Matas: Learinig a Fine Vocabulary, ECCV 2010
77
Appearance Variance of a Single Feature
Mikulik, Perdoch, Chum, Matas: Learning Vocabularies over a Fine Quantization, IJCV 2012
• over 5 million images
• almost 20k clusters of 750k images (visual word based)
• 733k successfully matched in WBS matching (raw descriptor based)
• over 111 M feature tracks established (12.3 M with 6+ features)
• 564 M features in the tracks (319.5 M in tracks of 6+ features)
http://cmp.felk.cvut.cz/~qqmikula/publications/ijcv2012/index.html
78
Short Codes – (Joint) Dimensionality Reduction
Jegou & Chum: Negative evidences and co-occurrences in image retrieval: the benefit of PCA and
whitening, ECCV 2012
Radenovic, Jegou & Chum: Multiple Measurements and Joint Dimensionality Reduction for
Large Scale Image Search with Short Vectors ICMR 2015
79
Aggregating Local Descriptors
A
C
D
B
VLAD descriptor
[Jégou, Douze, Schmid and Pérez, CVPR’10]
Fischer Kernel approach
[Perronnin and Dance, CVPR’07]
often combined with dimensionality
reduction by PCA – short codes
• High discriminability needed
• BOW increases the number of visual words
• only assignments are recorded
Idea: using higher order statistics
• small vocabulary (fast assignment)
• dense vectors (ANN search)
• high disriminability
80
Aggregating Local Descriptors
A
C
D
B
VLAD descriptor
[Jégou, Douze, Schmid and Pérez, CVPR’10]
1. compute assignments
2. compute difference to means
3. sum differences per visual word
81
• Fit a GMM to training data (SIFT)
• diagonal covariance matrix
• whitened data
• Image represented as a sum (over image
features) of gradients of log-likelihood
• fixed size representation (#parameters)
Aggregating Local Descriptors
A
C
D
B
Fischer Kernel approach
[Perronnin and Dance, CVPR’07]
Intuition: direction in which the parameters λ
of the general model should we modified to
better fit the specific sample (current image
data).
Query Expansion
84
Query Expansion
…
Query image
Results
New query
Spatial verification
New results
Chum, Philbin, Sivic, Isard, Zisserman: Total Recall…, ICCV 2007
85
Query Expansion Step by Step
Query Image Retrieved image Originally not retrieved
86
Query Expansion Step by Step
87
Query Expansion Step by Step
88
Query Expansion Results
Query
image
Expanded results (better)
Original results (good)
89
Context expansion
• the model of the object is grown beyond the boundaries of the
initial query,
• a feature added into the model that is not inside the context is
inactive until confirmed by feature(s) from another image with
the same visual word and similar geometry.
• Once a feature is confirmed, it adds the neighbourhood around
its center to the context.
Chum, Mikulik, Perdoch, Matas: Total Recall II: Query Expansion Revisited, CVPR 2011
90
• the model of the object is grown beyond the boundaries of the
initial query,
• a feature added into the model that is not inside the context is
inactive until confirmed by feature(s) from another image with
the same visual word and similar geometry.
• Once a feature is confirmed, it adds the neighbourhood around
its center to the context.
Context expansion
Chum, Mikulik, Perdoch, Matas: Total Recall II: Query Expansion Revisited, CVPR 2011
91
Learning the Context
Feature patches back-projected into the context from spatially
verified images.
The query
2 5 10 20
92
How Much Do We Need to See?
Oxford landmarks – 3 queries
100%, 50%, and 10% of the query bounding box
Context learned from the full bounding box
Context learned from 50% of the bounding box
Context learned from 10% of the bounding box
93
Effects of decreasing the
query bounding-box size
Baseline:
spatial verification +
full bounding box
Context QE at the baseline
performance needs only:
• 20% of the BB on the
Paris dataset
• 40% of the BB on the
Oxford dataset
Beyond Similarity Search
Retrieval with (geometric) constraints
95
Retrieval for Browsing
What is this? … and what is that?
Let’s query!
96
Retrieval for Browsing
Query 1
Query 2
Mikulik, Chum, Matas: Image Retrieval for Online Browsing in Large Image Collections, SISAP 2013.
97
New Problem Formulation
Retrieve relevant images subject to a constraint
• Geometric
– Maximize number of relevant pixels
– Maximize scale change
– Change of viewpoint
• Other
– High photometric change (day / night)
98
New Problem Formulation
Results
• Low rank in standard similarity measure
– Geometry for verification and constraint enforcement
– Geometry in the inverted file (DAAT)
• Standard similarity measure can be 0
– Matching through a path of images (query expansion)
100
Query Image
What is interesting here?
101
All Details on the Landmark
102
Highest Resolution Transform
Given a query and a dataset, for every pixel in the query image:
Find the database image with the maximum resolution depicting the pixel
37.3x 27.0x 22.8x 21.9x 21.6x
103
Highest Details
104
Level of Interest Transform
Given a query and a dataset, for every pixel in the query image:
Find the frequency with which it is photographed in detail
0 – 1 % 1 – 3 % 3 – 10 %
detail
size
FROM SINGLE IMAGE QUERY TO
DETAILED 3D RECONSTRUCTION
Retrieval and SfM
k-NN search often find small
connected components
107
Tight Coupling of Retrieval and SfM
Schoenberger, Radenovic, Chum, and Frahm:
From Single Image Query to Detailed 3D Reconstruction , CVPR’15
Beyond Nearest Neighbour
Looking around the corner
• Zoom out – getting a context of the image
• All details – getting transition to the object details
• Sidewise crawl
Some Results …
NEURAL NETWORKS
Retrieval with global descriptors
111
Efficient Search with Global Descriptors
Find this … … in a large collection of images
?
Mapping into high dimensional space
k ~ 512 … 2048
Image similarity – distance
descriptor space Rk
112
Efficient Search with Global Descriptors
Find this … … in a large collection of images
descriptor space Rk
113
CNN Descriptors for Image Retrieval
…
Max pooling
+ L2-norm
K x 1
MAC
vec.
Image Convolutional Layers MAC Layer Descriptor
𝑤𝑤 ×ℎ×3 𝑊𝑊 ×𝐻𝐻 ×𝐾𝐾 𝐾𝐾 ×1
𝑤𝑤 × ℎ – image width and height
𝑊𝑊 × 𝐻𝐻 – number of activations for feature map 𝑘𝑘 ∈ {1 … 𝐾𝐾}
𝐾𝐾 – number of feature maps in the last convolutional layer
MAC – Maximum Activations of Convolutions
114
CNN Descriptors for Image Retrieval
…
Image Convolutional Layers
𝑤𝑤 ×ℎ×3 𝑊𝑊 ×𝐻𝐻 ×𝐾𝐾
𝑤𝑤 × ℎ – image width and height
𝑊𝑊 × 𝐻𝐻 – number of activations for feature map 𝑘𝑘 ∈ {1 … 𝐾𝐾}
𝐾𝐾 – number of feature maps in the last convolutional layer
Sum pooling
+ L2-norm
K x 1
SPoC
vec.
SPcC Layer Descriptor
𝐾𝐾 ×1
SPoC – sum-pooled convolutional
115
CNN Descriptors for Image Retrieval
…
Image Convolutional Layers
𝑤𝑤 ×ℎ×3 𝑊𝑊 ×𝐻𝐻 ×𝐾𝐾
𝑤𝑤 × ℎ – image width and height
𝑊𝑊 × 𝐻𝐻 – number of activations for feature map 𝑘𝑘 ∈ {1 … 𝐾𝐾}
𝐾𝐾 – number of feature maps in the last convolutional layer
Descriptor
GeM pooling
+ L2-norm
K x 1
GeM
vec.
GeM Layer
𝐾𝐾 ×1
GeM– Generalized Mean
p = 1
average pooling
p = inf
max pooling
116
Loss Functions
𝐿𝐿 𝑖𝑖, 𝑗𝑗 =
1
2
𝑌𝑌 𝑖𝑖, 𝑗𝑗 �
𝒇𝒇 𝑖𝑖 − �
𝒇𝒇 𝑗𝑗
2
+ 1 − 𝑌𝑌(𝑖𝑖, 𝑗𝑗 max 0, 𝜏𝜏 − �
𝒇𝒇 𝑖𝑖 − �
𝒇𝒇 𝑗𝑗
2
POSITIVE PAIR
𝐿𝐿 𝑖𝑖, 𝑗𝑗
�
𝒇𝒇 𝑖𝑖 − �
𝒇𝒇 𝑗𝑗
NEGATIVE PAIR
�
𝒇𝒇 𝑖𝑖 − �
𝒇𝒇 𝑗𝑗
Contrastive loss
117
Loss Functions
Triplet loss
Retrieval Challenges
Significant viewpoint and/or scale change
Significant illumination change
Severe occlusions
Visually similar but different objects
Retrieval Challenges
Significant viewpoint and/or scale change
Significant illumination change
Severe occlusions
Visually similar but different objects
Retrieval Challenges
Significant viewpoint and/or scale change
Significant illumination change
Severe occlusions
Visually similar but different objects
Retrieval Challenges
Significant viewpoint and/or scale change
Significant illumination change
Severe occlusions
Visually similar but different objects
“Lots of Training Examples”
Large Internet
photo collection
…
Convolutional Neural
Network (CNN)
Image annotations
Training
“Lots of Training Examples”
Large Internet
photo collection
…
Convolutional Neural
Network (CNN)
Not accurate
Expensive $$
Manual cleaning of
the training data
done by Researchers
Very expensive $$$$
Automated extraction
of training data
Very accurate
Free $
• Image representation created from CNN activations
of a network pre-trained for classification task
[Gong et al. ECCV’14, Razavian et al. arXiv’14, Babenko et al.
ICCV’15, Kalantidis et al. arXiv’15, Tolias et al. ICLR’16]
+ Retrieval accuracy suggests generalization of CNNs
- Trained for image classification, NOT retrieval task
CNN Image Retrieval
• Image representation created from CNN activations
of a network pre-trained for classification task
[Gong et al. ECCV’14, Razavian et al. arXiv’14, Babenko et al.
ICCV’15, Kalantidis et al. arXiv’15, Tolias et al. ICLR’16]
+ Retrieval accuracy suggests generalization of CNNs
- Trained for image classification, NOT retrieval task
CNN Image Retrieval
Same Class
Image from ImageNet.org
CNN Image Retrieval
• CNN network re-trained using a dataset that contains
landmarks and buildings as object classes.
[Babenko et al. ECCV’14]
+ Training dataset closer to the target task
- Final metric different to the one actually optimized
- Constructing training datasets requires manual effort
CNN Image Retrieval
• CNN network re-trained using a dataset that contains
landmarks and buildings as object classes.
[Babenko et al. ECCV’14]
+ Training dataset closer to the target task
- Final metric different to the one actually optimized
- Constructing training datasets requires manual effort
Same Class
Image from [Babenko et al. ECCV’14]
CNN Image Retrieval
• NetVLAD: end-to-end fine-tuning for image retrieval.
Geo-tagged dataset for weakly supervised fine-tuning.
[Arandjelovic et al. CVPR’16]
+ Training dataset corresponds to the target task
+ Final metric corresponds to the one actually optimized
- Training dataset requires geo-tags
CNN Image Retrieval
• NetVLAD: end-to-end fine-tuning for image retrieval.
Geo-tagged dataset for weakly supervised fine-tuning.
[Arandjelovic et al. CVPR’16]
+ Training dataset corresponds to the target task
+ Final metric corresponds to the one actually optimized
- Training dataset requires geo-tags
query
Camera Orientation Unknown
unknown
CNN learns from BoW – Training Data
Input: Large unannotated dataset
1. Initial clusters created by grouping of spatially
related images [Chum & Matas PAMI’10]
2. Clustered images used as queries for a retrieval-
SfM pipeline [Schonberger et al. CVPR’15]
Output: Non-overlapping 3D models
551 (134k) training / 162 (30k) validation
Camera Orientation Known
Number of Inliers Known
CNN learns from BoW – Positives
1. Descriptor distance: Image with the lowest global
descriptor distance is chosen (NetVLAD use this)
2. Maximum inliers: Image with the highest number of
co-observed 3D points with the query image is chosen
3. Relaxed inliers: Random image close to the query, with
enough inliers and not an extreme scale change is chosen
query m 1 m 2 m 3
CNN learns from BoW – Negatives
K-nearest neighbors of the query image are selected from
all non-matching clusters, using different methods:
1. No constraint: chosen images often near identical.
2. At most one image per cluster: higher variability.
query hardest negative N 1 N 2
https://github.com/filipradenovic/cnnimageretrieval
Retrieval Challenges
Significant viewpoint and/or scale change
Significant illumination change
Severe occlusions
Visually similar but different objects
135
Night Query Image
136
Day – Night Retrieval
Day – Night training image pairs – sequences of images day – evening - night
Photometric normalization
137
Contrast Limited Adaptive Histogram Equalization
• Semi local (windows)
• Linear interpolation
• Only values more frequent than
the clipping limit are redistributed
clipping
limit
Original Historam Equalization (global) CLAHE
[Jenicek, Chum: No Fear of the Dark: Image Retrieval under Varying Illumination Conditions, ICCV 2019]
138
Generating Synthetic Night Data (GAN)
139
Generating Synthetic Night Data (GAN)
140
Training with Synthetic Night Data
POSITIVE PAIR
𝐿𝐿 𝑖𝑖, 𝑗𝑗
�
𝒇𝒇 𝑖𝑖 − �
𝒇𝒇 𝑗𝑗
NEGATIVE PAIR
�
𝒇𝒇 𝑖𝑖 − �
𝒇𝒇 𝑗𝑗
141
Synthetic Night Data (GAN)
QUESTIONS

More Related Content

Similar to Large Scale Image Retrieval 2022.pdf

PCA-SIFT: A More Distinctive Representation for Local Image Descriptors
PCA-SIFT: A More Distinctive Representation for Local Image DescriptorsPCA-SIFT: A More Distinctive Representation for Local Image Descriptors
PCA-SIFT: A More Distinctive Representation for Local Image Descriptorswolf
 
3-d interpretation from single 2-d image for autonomous driving II
3-d interpretation from single 2-d image for autonomous driving II3-d interpretation from single 2-d image for autonomous driving II
3-d interpretation from single 2-d image for autonomous driving IIYu Huang
 
one shot15729752 Deep Learning for AI and DS
one shot15729752 Deep Learning for AI and DSone shot15729752 Deep Learning for AI and DS
one shot15729752 Deep Learning for AI and DSManiMaran230751
 
Visual geometry with deep learning
Visual geometry with deep learningVisual geometry with deep learning
Visual geometry with deep learningNAVER Engineering
 
Recent Progress on Object Detection_20170331
Recent Progress on Object Detection_20170331Recent Progress on Object Detection_20170331
Recent Progress on Object Detection_20170331Jihong Kang
 
10.1109@ICCMC48092.2020.ICCMC-000167.pdf
10.1109@ICCMC48092.2020.ICCMC-000167.pdf10.1109@ICCMC48092.2020.ICCMC-000167.pdf
10.1109@ICCMC48092.2020.ICCMC-000167.pdfmokamojah
 
3-d interpretation from single 2-d image IV
3-d interpretation from single 2-d image IV3-d interpretation from single 2-d image IV
3-d interpretation from single 2-d image IVYu Huang
 
3-d interpretation from single 2-d image III
3-d interpretation from single 2-d image III3-d interpretation from single 2-d image III
3-d interpretation from single 2-d image IIIYu Huang
 
Video Stitching using Improved RANSAC and SIFT
Video Stitching using Improved RANSAC and SIFTVideo Stitching using Improved RANSAC and SIFT
Video Stitching using Improved RANSAC and SIFTIRJET Journal
 
Semi-supervised concept detection by learning the structure of similarity graphs
Semi-supervised concept detection by learning the structure of similarity graphsSemi-supervised concept detection by learning the structure of similarity graphs
Semi-supervised concept detection by learning the structure of similarity graphsSymeon Papadopoulos
 
Deep learning for 3 d point clouds presentation
Deep learning for 3 d point clouds presentationDeep learning for 3 d point clouds presentation
Deep learning for 3 d point clouds presentationVijaylaxmiNagurkar
 
An Assessment of Image Matching Algorithms in Depth Estimation
An Assessment of Image Matching Algorithms in Depth EstimationAn Assessment of Image Matching Algorithms in Depth Estimation
An Assessment of Image Matching Algorithms in Depth EstimationCSCJournals
 
3-d interpretation from single 2-d image V
3-d interpretation from single 2-d image V3-d interpretation from single 2-d image V
3-d interpretation from single 2-d image VYu Huang
 
cvpresentation-190812154654 (1).pptx
cvpresentation-190812154654 (1).pptxcvpresentation-190812154654 (1).pptx
cvpresentation-190812154654 (1).pptxPyariMohanJena
 
ppt 20BET1024.pptx
ppt 20BET1024.pptxppt 20BET1024.pptx
ppt 20BET1024.pptxManeetBali
 
fusion of Camera and lidar for autonomous driving I
fusion of Camera and lidar for autonomous driving Ifusion of Camera and lidar for autonomous driving I
fusion of Camera and lidar for autonomous driving IYu Huang
 

Similar to Large Scale Image Retrieval 2022.pdf (20)

PCA-SIFT: A More Distinctive Representation for Local Image Descriptors
PCA-SIFT: A More Distinctive Representation for Local Image DescriptorsPCA-SIFT: A More Distinctive Representation for Local Image Descriptors
PCA-SIFT: A More Distinctive Representation for Local Image Descriptors
 
3-d interpretation from single 2-d image for autonomous driving II
3-d interpretation from single 2-d image for autonomous driving II3-d interpretation from single 2-d image for autonomous driving II
3-d interpretation from single 2-d image for autonomous driving II
 
one shot15729752 Deep Learning for AI and DS
one shot15729752 Deep Learning for AI and DSone shot15729752 Deep Learning for AI and DS
one shot15729752 Deep Learning for AI and DS
 
Visual geometry with deep learning
Visual geometry with deep learningVisual geometry with deep learning
Visual geometry with deep learning
 
Recent Progress on Object Detection_20170331
Recent Progress on Object Detection_20170331Recent Progress on Object Detection_20170331
Recent Progress on Object Detection_20170331
 
10.1109@ICCMC48092.2020.ICCMC-000167.pdf
10.1109@ICCMC48092.2020.ICCMC-000167.pdf10.1109@ICCMC48092.2020.ICCMC-000167.pdf
10.1109@ICCMC48092.2020.ICCMC-000167.pdf
 
Image Retrieval (D4L5 2017 UPC Deep Learning for Computer Vision)
Image Retrieval (D4L5 2017 UPC Deep Learning for Computer Vision)Image Retrieval (D4L5 2017 UPC Deep Learning for Computer Vision)
Image Retrieval (D4L5 2017 UPC Deep Learning for Computer Vision)
 
3-d interpretation from single 2-d image IV
3-d interpretation from single 2-d image IV3-d interpretation from single 2-d image IV
3-d interpretation from single 2-d image IV
 
3-d interpretation from single 2-d image III
3-d interpretation from single 2-d image III3-d interpretation from single 2-d image III
3-d interpretation from single 2-d image III
 
V2 v posenet
V2 v posenetV2 v posenet
V2 v posenet
 
Video Stitching using Improved RANSAC and SIFT
Video Stitching using Improved RANSAC and SIFTVideo Stitching using Improved RANSAC and SIFT
Video Stitching using Improved RANSAC and SIFT
 
Semi-supervised concept detection by learning the structure of similarity graphs
Semi-supervised concept detection by learning the structure of similarity graphsSemi-supervised concept detection by learning the structure of similarity graphs
Semi-supervised concept detection by learning the structure of similarity graphs
 
Deep learning for 3 d point clouds presentation
Deep learning for 3 d point clouds presentationDeep learning for 3 d point clouds presentation
Deep learning for 3 d point clouds presentation
 
An Assessment of Image Matching Algorithms in Depth Estimation
An Assessment of Image Matching Algorithms in Depth EstimationAn Assessment of Image Matching Algorithms in Depth Estimation
An Assessment of Image Matching Algorithms in Depth Estimation
 
3-d interpretation from single 2-d image V
3-d interpretation from single 2-d image V3-d interpretation from single 2-d image V
3-d interpretation from single 2-d image V
 
Kintinuous review
Kintinuous reviewKintinuous review
Kintinuous review
 
Content-based Image Retrieval - Eva Mohedano - UPC Barcelona 2018
Content-based Image Retrieval - Eva Mohedano - UPC Barcelona 2018Content-based Image Retrieval - Eva Mohedano - UPC Barcelona 2018
Content-based Image Retrieval - Eva Mohedano - UPC Barcelona 2018
 
cvpresentation-190812154654 (1).pptx
cvpresentation-190812154654 (1).pptxcvpresentation-190812154654 (1).pptx
cvpresentation-190812154654 (1).pptx
 
ppt 20BET1024.pptx
ppt 20BET1024.pptxppt 20BET1024.pptx
ppt 20BET1024.pptx
 
fusion of Camera and lidar for autonomous driving I
fusion of Camera and lidar for autonomous driving Ifusion of Camera and lidar for autonomous driving I
fusion of Camera and lidar for autonomous driving I
 

Recently uploaded

Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxJoão Esperancinha
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
Artificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxArtificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxbritheesh05
 
HARMONY IN THE HUMAN BEING - Unit-II UHV-2
HARMONY IN THE HUMAN BEING - Unit-II UHV-2HARMONY IN THE HUMAN BEING - Unit-II UHV-2
HARMONY IN THE HUMAN BEING - Unit-II UHV-2RajaP95
 
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...srsj9000
 
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfAsst.prof M.Gokilavani
 
GDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSCAESB
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escortsranjana rawat
 
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerStudy on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerAnamika Sarkar
 
Biology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxBiology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxDeepakSakkari2
 
microprocessor 8085 and its interfacing
microprocessor 8085  and its interfacingmicroprocessor 8085  and its interfacing
microprocessor 8085 and its interfacingjaychoudhary37
 
chaitra-1.pptx fake news detection using machine learning
chaitra-1.pptx  fake news detection using machine learningchaitra-1.pptx  fake news detection using machine learning
chaitra-1.pptx fake news detection using machine learningmisbanausheenparvam
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxpurnimasatapathy1234
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130Suhani Kapoor
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024Mark Billinghurst
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girlsssuser7cb4ff
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionDr.Costas Sachpazis
 

Recently uploaded (20)

Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
 
Artificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxArtificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptx
 
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Serviceyoung call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
 
HARMONY IN THE HUMAN BEING - Unit-II UHV-2
HARMONY IN THE HUMAN BEING - Unit-II UHV-2HARMONY IN THE HUMAN BEING - Unit-II UHV-2
HARMONY IN THE HUMAN BEING - Unit-II UHV-2
 
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
 
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
 
GDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentation
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
 
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerStudy on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
 
Biology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxBiology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptx
 
microprocessor 8085 and its interfacing
microprocessor 8085  and its interfacingmicroprocessor 8085  and its interfacing
microprocessor 8085 and its interfacing
 
chaitra-1.pptx fake news detection using machine learning
chaitra-1.pptx  fake news detection using machine learningchaitra-1.pptx  fake news detection using machine learning
chaitra-1.pptx fake news detection using machine learning
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptx
 
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girls
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
 

Large Scale Image Retrieval 2022.pdf

  • 1. Large Scale Image Retrieval and Specific Object Search Ondra Chum Center for Machine Perception Czech Technical University in Prague
  • 2. Outline • The correspondence problem – Local features – Descriptors – Matching – Geometry • Retrieval with local features – Bag of Words – Geometry in image retrieval – Beyond visual nearest neighbour search • Image retrieval with CNNs – Efficient network training – Day / Night retrieval
  • 4. 4 The Problem Given a pair of images, find corresponding pixels YES ! Semantic correspondence NOT in this lecture Image stitching 3D reconstruction Augmented reality Localization / camera position
  • 5. 5 due to large viewpoint change (including scale) => the wide-baseline stereo problem Applications: - pose estimation - 3D reconstruction - location recognition Finding correspondences is not easy
  • 6. 6 Finding correspondences is not easy due to large viewpoint change (including scale) => the wide-baseline stereo problem
  • 7. 7 Applications: - location recognition - summarization of image collections Finding correspondences is not easy due to large viewpoint change (including scale) => the wide-baseline stereo problem
  • 8. 8 Applications: - historical reconstruction - location recognition - photographer recognition - camera type recognition MPV course 2022, CTU in Prague Finding correspondences is not easy due to large time difference => the temporal-baseline stereo problem
  • 9. Finding Correspondences is not easy 9 due to occlusion Applications: - pose estimation - inpainting
  • 11. 11 Local Features aka feature points, key points, anchor points, distinguished regions, … • Repeatable features • Feature descriptor: patch to a vector • Similar features have similar descriptors – nearest neighbour search • Retrieval – matching millions of images at the same time • Detect features in images independently, local = robust to occlusions
  • 12. 12 Local (Handcrafted) Features Simple idea – a distinguished feature should be different (at least) from all its immediate neighbourhoods
  • 13. 13 Corners Saddle points Blobs Local (Handcrafted) Features 1. Enumerate all regions / level sets 2. Compute responses / stability 3. Local Non-Maxima Suppression Regions Harris [Harris’88] Susan [Smith’97] FAST/ ORB [Rosten’06][Rublee’11] Hessian [Lindeberg’91] SADDLE [Aldana’16] Hessian DoG [Lowe’04] MSER [Matas’02] Tuytelaars Simple idea – a distinguished feature should be different (at least) from all its immediate neighbourhoods Commonly used for deep features
  • 14. 14 Deep Local Features DELF – classification loss, landmark labelled images [Noh, Araujo, Sim, Weyand, Han: Large-scale image retrieval with attentive deep local features. CVPR’17] HOW – contrastive loss, SfM Retrieval – 3D reconstruction, image level [Tolias, Jenicek, Chum: Learning and aggregating deep local descriptors for instance-level recognition ECCV’20] D2 net – point correspondence supervision from 3D [Dusmanu et al.: D2-net: A trainable CNN for joint detection and description of local features. CVPR’19] R2D2 – point correspondence supervision from optical flow [Revaud et.al., R2D2: Reliable and Repeatable Detector and Descriptor, NeurIPS 2019] SuperPoint – synthetic images, augmentations [DeTone, Malisiewicz, Rabinovich: SuperPoint: Self-supervised interest point detection and description, CVPRW’18] R2D2 – Revaud 2019 DELF – Noh 2017
  • 15. 15 Local Features from CNN Activations Simeoni, Avrithis, Chum: Local Features and Visual Words Emerge in Activations, CVPR 2019 Convolutional layers Activation tensor Activation channel (output of a detector) • Treat the activation channel as an input to handcrafted feature detector (MSER) • Use channel id as a descriptor (visual word) … …
  • 16. 16 Transformation Co-variant Local Features Scale invariance Responses over different scales
  • 17. 17 Transformation Co-variant Local Features Scale invariance Responses over different scales Affine invariance
  • 18. 18 Affine Shape with CNNs Mishkin, Radenović, Matas: Repeatability Is Not Enough: Learning Affine Regions via Discriminability, ECCV 2018 AffNet
  • 19. 19 Descriptors of Local Features Direct description of a measurement region: e.g. moments Local feature Measurement region
  • 20. 20 Descriptors of Local Features Local feature Measurement region Normalize region to a canonical form first Histogram of gradients (root) SIFT
  • 21. 21 Descriptors of Local Features Bin Fan Yurun Tian and Fuchao Wu. L2-Net: Deep learning of discriminative patch descriptor in euclidean space. CVPR 2017. Anastasiya Mishchuk, Dmytro Mishkin, Filip Radenovic, Jiri Matas: Working hard to know your neighbor's margins: Local descriptor learning loss, NIPS 2017
  • 23. Toy example for illustration: matching with OpenCV SIFT Try yourself: https://github.com/ducha-aiki/matching-strategies-comparison
  • 24. Toy example for illustration: matching with OpenCV SIFT Recovered 1st to 2nd image projection, ground truth 1st to 2nd image project, inlier correspondences
  • 25. Nearest neighbor (NN) strategy Features from img1 are matched to features from img2 You can see, that it is asymmetric and allowing “many-to-one” matches
  • 26. Nearest neighbor (NN) strategy OpenCV RANSAC failed to find a good model with NN matching Features from img1 are matched to features from img2
  • 27. Mutual nearest neighbor (MNN) strategy Features from img1 are matched to features from img2 Only cross-consistent (mutual NNs) matches are retained.
  • 28. Mutual nearest neighbor (MNN) strategy OpenCV RANSAC failed to find a good model with MNN matching No one-to-many connections, but still bad Features from img1 are matched to features from img2
  • 29. Feature space outlier rejection • How can we tell which putative matches are more reliable? • Heuristic: compare distance of the nearest neighbor to that of the second nearest neighbor – Ratio will be high for features that are not distinctive – Threshold of 0.8 provides good separation David Lowe. "Distinctive image features from scale-invariant keypoints.” IJCV 60 (2), pp. 91-110, 2004.
  • 30. Second nearest neighbor ratio (SNN) strategy 1stNN 2ndNN 2ndNN 1stNN 2ndNN 1stNN 1stNN / 2ndNN > 0.8, drop 1stNN / 2ndNN < 0.8, keep - we look for 2 nearest neighbors - If both are too similar (1stNN/2ndNN ratio > 0.8) → discard - If 1st NN is much closer (1stNN/2ndNN ratio ≤ 0.8) → keep Features from img1 are matched to features from img2
  • 31. Second nearest neighbor ratio (SNN) strategy 1stNN 2ndNN 2ndNN 1stNN 1stNN / 2ndNN < 0.8, keep OpenCV RANSAC found a model roughly correct
  • 32. 1st geometrically inconsistent nearest neighbor ratio (FGINN) strategy 32 MPV course 2022, CTU in Prague SNN ratio is good, but what about symmetrical, or too closely detected features? Ratio test will kill them. Solution: look for 2nd nearest neighbor, which is spatially far enough from 1st nearest. Mishkin et al.,“MODS: Fast and Robust Method for Two-View Matching”, CVIU 2015
  • 33. SNN vs FGINN Mishkin et al., “MODS: Fast and Robust Method for Two-View Matching”, CVIU 2015 SNN: roughly correct FGINN: more correspondences, better geometry found
  • 34. 34 Idea: verify a tentative match “+“ by comparing neighboring features [Schmid and Mohr: Local Greyvalue Invariants for Image Retrieval. PAMI 1997] + + + + + + + + + + + + matching features Local Geometric Constraints image 1 image 2
  • 35. 35 Cosegmentation / Seed Growing Start from a seed – a signle strong match and try to locally “grow” the match - at pixel or feature level [Ferrari, Tuytelaars,Van Gool, ECCV 2004] [Cech, Matas, Perdoch CVPR 08] [Cavalli, Larsson, Oswald, Sattler, Pollefeys: AdaLAM, ECCV’20] Seeds – semantic objects Benbihi, Pradalier and Chum: Object-Guided Day-Night Visual Localization in Urban Scenes, ICPR’22
  • 36. 36 Learned Matching [Sarlin, DeTone, Malisiewicz, Rabinovich. SuperGlue: Learning feature matching with graph neural networks. CVPR’20] figures from Sarlin CVPR’20
  • 37. Global Geometry • Voting in the parameter space • RANSAC
  • 38. 38 Robust Estimation: Hough vs. RANSAC Voting: • discretized parameter space • votes for parameters consistent with the measurements • more votes higher support + multiple models + can be very fast - memory demanding - distances measured in the parameter space RANSAC: • hypothesize and verify loop - randomized (unless you try it all) - typically slower than voting + no extra memory required + measures distances in pixels!
  • 40. 40 Fitting a Line Least squares fit
  • 41. 41 RANSAC • Select sample of m points at random
  • 42. 42 RANSAC • Select sample of m points at random • Calculate model parameters that fit the data in the sample
  • 43. 43 RANSAC • Select sample of m points at random • Calculate model parameters that fit the data in the sample • Calculate error function for each data point
  • 44. 44 RANSAC • Select sample of m points at random • Calculate model parameters that fit the data in the sample • Calculate error function for each data point • Select data that support current hypothesis
  • 45. 45 RANSAC • Select sample of m points at random • Calculate model parameters that fit the data in the sample • Calculate error function for each data point • Select data that support current hypothesis • Repeat sampling
  • 46. 46 RANSAC • Select sample of m points at random • Calculate model parameters that fit the data in the sample • Calculate error function for each data point • Select data that support current hypothesis • Repeat sampling
  • 47. 47 RANSAC • Select sample of m points at random • Calculate model parameters that fit the data in the sample • Calculate error function for each data point • Select data that support current hypothesis • Repeat sampling
  • 48. 48 RANSAC k … number of samples drawn m … minimal sample size N … number of data points I … time to compute a single model p … confidence in the solution (.95) log (1- ) log(1 – p) I m Nm k =
  • 49. 49 How Many Samples I / N [%] Size of the sample m
  • 50. 50 RANSAC [Fischler, Bolles ’81] In: U = {xi} set of data points, |U| = N function f computes model parameters p given a sample S from U the cost function for a single data point x Out: p* p*, parameters of the model maximizing the cost function k := 0 Repeat until P{better solution exists} < η (a function of C* and no. of steps k) k := k + 1 I. Hypothesis (1) select randomly set , sample size (2) compute parameters II. Verification (3) compute cost (4) if C* < Ck then C* := Ck, p* := pk end
  • 51. 51 Advanced RANSAC In: U = {xi} set of data points, |U| = N function f computes model parameters p given a sample S from U the cost function for a single data point x Out: p* p*, parameters of the model maximizing the cost function k := 0 Repeat until P{better solution exists} < η (a function of C* and no. of steps k) k := k + 1 I. Hypothesis (1) select randomly set , sample size (2) compute parameters II. Verification (3) compute cost (4) if C* < Ck then C* := Ck, p* := pk end Non-uniform sampling Error scale estimation Potential degeneracy tests Randomized verification Preemptive scoring Improving precision
  • 52. 52 *SAC RANSAC [Fischler’81], MLESAC [Torr’00], R-RANSAC [Chum’02], NAPSAC [Myatt’02], Guided MLESAC [Tordoff’02], LO-RANSAC [Chum’03], Preemtive RANSAC [Nister’03], PROSAC [Chum’05], RANSAC with bail-out [Capel’05], DegenSAC [Chum’05], WaldSAC [Matas’05], QDEGSAC [Frahm‘06], GASAC [Rodehorst’06], ARRSAC [Raguram’08] GroupSAC [Ni’09], Cov-RANSAC [Raguram’09], … Lebeda, Matas, and Chum: Fixing the Locally Optimized RANSAC, BMVC 2012 images, data, executables: http://cmp.felk.cvut.cz/software/LO-RANSAC/index.xhtml Raguram, Chum, Pollefeys, Matas, Frahm: “USAC: A Universal Framework for Random Sample Consensus”, PAMI 2013 code, data: http://cs.unc.edu/~rraguram/usac/ Barath, Matas / Barath, Matas, Noskova : “Graph-Cut RANSAC”, CVPR 2017 / “MAGSAC”, CVPR 2019 code, data: http://github.com/danini/graph-cut-ransac http://github.com/danini/magsac
  • 54. 54 Image Retrieval Find this … … in a large (millions+) collection of images ? • Find images of the same object • What is this? Nearest neighbor classifier • Where is this? Visual localization • How did this look in the past? • Is there anything interesting here?
  • 55. 55 area under the curve Average Precision (AP) Retrieval Quality 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 recall precision Query Database size: 10 images Relevant (total): 5 images Results (ordered): precision = #relevant / #returned recall = #relevant / #total relevant
  • 56. 56 Feature Based Retrieval • Affine invariant features • Efficient descriptors • Corresponding regions in images have similar descriptors – measured by some distance in the features space • Images of the same object have many correspondences in common
  • 57. 57 Video Google • Feature detection and description • Vector quantization • Bag of Words representation • Scoring • Verification Sivic & Zisserman – ICCV 2003 Video Google: A Text Retrieval Approach to Object Matching in Videos
  • 59. 59 Feature Distance Approximation Partition the feature space (k – means clustering) Feature distance 0 : features in the same cell ∞ : features in different cells + most of the features are not considered (infinitely distant) + near-by descriptors accessible instantly – storing a list of features for each cell
  • 60. 60 Feature Distance Approximation - quantization effects - large (even unbounded) cells Feature distance 0 : features in the same cell ∞ : features in different cells
  • 61. 61 Vector Quantization via k-Means Initialize cluster centres Find nearest cluster to each datapoint (slow) O(N k) Re-compute cluster centres as centroids Iterate
  • 62. 62 Bags of Words Image Representation A C D B A C D B 1 0 0 2 0 3 0 1 Images … Visual vocabulary Images are represented by vector / histogram of visual words present in them Term-frequency (tf) – visual word D is twice in the image sparse
  • 64. 64 Efficient Scoring bag of words representation (up to 1,000,000 D) 0 3 0 1 α1 ( 1 0 0 2 ) α2 ( 0 2 0 1 ) α3 ( 1 0 0 0 ) … Database Query • = Score αq s2 s3 … A C D B A C D B s1
  • 65. 65 1 2 3 4 5 6 7 8 9 10 BoW and Inverted File 6 7 7 … 1 3 6 … 5 6 8 … 2 4 10 … A C D B Visual vocabulary … A C D B A A B B C C D D … … … … … … … … … …
  • 66. 66 1 2 3 4 5 6 7 8 9 10 BoW and Inverted File 6 7 7 … 1 3 6 … 5 6 8 … query visual word 1 query visual word 2 query visual word 3 D B G
  • 67. 67 1 2 3 4 5 6 7 8 9 10 BoW and Inverted File Efficient (fast) Linear complexity (in # documents) Can be interpreted as voting
  • 68. 69 Geometric Verification and Re-ranking Query Results reject verify localize Philbin, Chum, Isard, Sivic, Zisserman: Object retrieval with large vocabularies and fast spatial matching, CVPR’07
  • 69. Visual Words and Vector Quantization
  • 70. 71 Vector Quantization • k-means • Fixed quantization [Tuytelaars and Schmid ICCV 2007] • Agglomerative [Leibe, Mikolajczyk and Schiele BMVC 2006] • Hierarchical k-means • Approximate k-means • Hamming embedding • Learning fine vocabularies
  • 71. 72 Hierarchical k-means + fast O(N log k) + incremental construction - not so good quantization - often imbalanced Nistér & Stewénius: Scalable recognition with a vocabulary tree. CVPR 2006
  • 72. 73 Approximate k-means + fast O(N log k) + reasonable quantization - Can be inconsistent when ANN fails Philbin, Chum, Isard, Sivic, and Zisserman – CVPR 2007 Object retrieval with large vocabularies and fast spatial matching Initialize cluster centres Find approximate nearest cluster to each datapoint Re-compute cluster centres as centroids Iterate
  • 73. 74 Hamming Embedding + good quantization + elegant idea - huge memory footprint 0 1 0 1 1 1 0 0 0 0 1 1 Hamming distance 1 1 2 Jegou, Douze, and Schmid – ECCV 2008 Hamming embedding and weak geometric consistency for large scale image search random projections
  • 74. 75 Soft Assignment (Approximate) k-means - database side - query side Hierarchical k-means Philbin, Chum, Isard, Sivic, and Zisserman – CVPR 2008 Lost in Quantization Nistér & Stewénius – CVPR 2006 Scalable recognition with a vocabulary tree
  • 75. 76 Learning Fine Vocabularies Fine vocabulary (16 million visual words) Using wide-baseline stereo matches on 6 million images to learn what is similar Mikulik, Perdoch, Chum, and Matas: Learinig a Fine Vocabulary, ECCV 2010
  • 76. 77 Appearance Variance of a Single Feature Mikulik, Perdoch, Chum, Matas: Learning Vocabularies over a Fine Quantization, IJCV 2012 • over 5 million images • almost 20k clusters of 750k images (visual word based) • 733k successfully matched in WBS matching (raw descriptor based) • over 111 M feature tracks established (12.3 M with 6+ features) • 564 M features in the tracks (319.5 M in tracks of 6+ features) http://cmp.felk.cvut.cz/~qqmikula/publications/ijcv2012/index.html
  • 77. 78 Short Codes – (Joint) Dimensionality Reduction Jegou & Chum: Negative evidences and co-occurrences in image retrieval: the benefit of PCA and whitening, ECCV 2012 Radenovic, Jegou & Chum: Multiple Measurements and Joint Dimensionality Reduction for Large Scale Image Search with Short Vectors ICMR 2015
  • 78. 79 Aggregating Local Descriptors A C D B VLAD descriptor [Jégou, Douze, Schmid and Pérez, CVPR’10] Fischer Kernel approach [Perronnin and Dance, CVPR’07] often combined with dimensionality reduction by PCA – short codes • High discriminability needed • BOW increases the number of visual words • only assignments are recorded Idea: using higher order statistics • small vocabulary (fast assignment) • dense vectors (ANN search) • high disriminability
  • 79. 80 Aggregating Local Descriptors A C D B VLAD descriptor [Jégou, Douze, Schmid and Pérez, CVPR’10] 1. compute assignments 2. compute difference to means 3. sum differences per visual word
  • 80. 81 • Fit a GMM to training data (SIFT) • diagonal covariance matrix • whitened data • Image represented as a sum (over image features) of gradients of log-likelihood • fixed size representation (#parameters) Aggregating Local Descriptors A C D B Fischer Kernel approach [Perronnin and Dance, CVPR’07] Intuition: direction in which the parameters λ of the general model should we modified to better fit the specific sample (current image data).
  • 82. 84 Query Expansion … Query image Results New query Spatial verification New results Chum, Philbin, Sivic, Isard, Zisserman: Total Recall…, ICCV 2007
  • 83. 85 Query Expansion Step by Step Query Image Retrieved image Originally not retrieved
  • 86. 88 Query Expansion Results Query image Expanded results (better) Original results (good)
  • 87. 89 Context expansion • the model of the object is grown beyond the boundaries of the initial query, • a feature added into the model that is not inside the context is inactive until confirmed by feature(s) from another image with the same visual word and similar geometry. • Once a feature is confirmed, it adds the neighbourhood around its center to the context. Chum, Mikulik, Perdoch, Matas: Total Recall II: Query Expansion Revisited, CVPR 2011
  • 88. 90 • the model of the object is grown beyond the boundaries of the initial query, • a feature added into the model that is not inside the context is inactive until confirmed by feature(s) from another image with the same visual word and similar geometry. • Once a feature is confirmed, it adds the neighbourhood around its center to the context. Context expansion Chum, Mikulik, Perdoch, Matas: Total Recall II: Query Expansion Revisited, CVPR 2011
  • 89. 91 Learning the Context Feature patches back-projected into the context from spatially verified images. The query 2 5 10 20
  • 90. 92 How Much Do We Need to See? Oxford landmarks – 3 queries 100%, 50%, and 10% of the query bounding box Context learned from the full bounding box Context learned from 50% of the bounding box Context learned from 10% of the bounding box
  • 91. 93 Effects of decreasing the query bounding-box size Baseline: spatial verification + full bounding box Context QE at the baseline performance needs only: • 20% of the BB on the Paris dataset • 40% of the BB on the Oxford dataset
  • 92. Beyond Similarity Search Retrieval with (geometric) constraints
  • 93. 95 Retrieval for Browsing What is this? … and what is that? Let’s query!
  • 94. 96 Retrieval for Browsing Query 1 Query 2 Mikulik, Chum, Matas: Image Retrieval for Online Browsing in Large Image Collections, SISAP 2013.
  • 95. 97 New Problem Formulation Retrieve relevant images subject to a constraint • Geometric – Maximize number of relevant pixels – Maximize scale change – Change of viewpoint • Other – High photometric change (day / night)
  • 96. 98 New Problem Formulation Results • Low rank in standard similarity measure – Geometry for verification and constraint enforcement – Geometry in the inverted file (DAAT) • Standard similarity measure can be 0 – Matching through a path of images (query expansion)
  • 97. 100 Query Image What is interesting here?
  • 98. 101 All Details on the Landmark
  • 99. 102 Highest Resolution Transform Given a query and a dataset, for every pixel in the query image: Find the database image with the maximum resolution depicting the pixel 37.3x 27.0x 22.8x 21.9x 21.6x
  • 101. 104 Level of Interest Transform Given a query and a dataset, for every pixel in the query image: Find the frequency with which it is photographed in detail 0 – 1 % 1 – 3 % 3 – 10 % detail size
  • 102. FROM SINGLE IMAGE QUERY TO DETAILED 3D RECONSTRUCTION
  • 103. Retrieval and SfM k-NN search often find small connected components
  • 104. 107 Tight Coupling of Retrieval and SfM Schoenberger, Radenovic, Chum, and Frahm: From Single Image Query to Detailed 3D Reconstruction , CVPR’15
  • 105. Beyond Nearest Neighbour Looking around the corner • Zoom out – getting a context of the image • All details – getting transition to the object details • Sidewise crawl
  • 107. NEURAL NETWORKS Retrieval with global descriptors
  • 108. 111 Efficient Search with Global Descriptors Find this … … in a large collection of images ? Mapping into high dimensional space k ~ 512 … 2048 Image similarity – distance descriptor space Rk
  • 109. 112 Efficient Search with Global Descriptors Find this … … in a large collection of images descriptor space Rk
  • 110. 113 CNN Descriptors for Image Retrieval … Max pooling + L2-norm K x 1 MAC vec. Image Convolutional Layers MAC Layer Descriptor 𝑤𝑤 ×ℎ×3 𝑊𝑊 ×𝐻𝐻 ×𝐾𝐾 𝐾𝐾 ×1 𝑤𝑤 × ℎ – image width and height 𝑊𝑊 × 𝐻𝐻 – number of activations for feature map 𝑘𝑘 ∈ {1 … 𝐾𝐾} 𝐾𝐾 – number of feature maps in the last convolutional layer MAC – Maximum Activations of Convolutions
  • 111. 114 CNN Descriptors for Image Retrieval … Image Convolutional Layers 𝑤𝑤 ×ℎ×3 𝑊𝑊 ×𝐻𝐻 ×𝐾𝐾 𝑤𝑤 × ℎ – image width and height 𝑊𝑊 × 𝐻𝐻 – number of activations for feature map 𝑘𝑘 ∈ {1 … 𝐾𝐾} 𝐾𝐾 – number of feature maps in the last convolutional layer Sum pooling + L2-norm K x 1 SPoC vec. SPcC Layer Descriptor 𝐾𝐾 ×1 SPoC – sum-pooled convolutional
  • 112. 115 CNN Descriptors for Image Retrieval … Image Convolutional Layers 𝑤𝑤 ×ℎ×3 𝑊𝑊 ×𝐻𝐻 ×𝐾𝐾 𝑤𝑤 × ℎ – image width and height 𝑊𝑊 × 𝐻𝐻 – number of activations for feature map 𝑘𝑘 ∈ {1 … 𝐾𝐾} 𝐾𝐾 – number of feature maps in the last convolutional layer Descriptor GeM pooling + L2-norm K x 1 GeM vec. GeM Layer 𝐾𝐾 ×1 GeM– Generalized Mean p = 1 average pooling p = inf max pooling
  • 113. 116 Loss Functions 𝐿𝐿 𝑖𝑖, 𝑗𝑗 = 1 2 𝑌𝑌 𝑖𝑖, 𝑗𝑗 � 𝒇𝒇 𝑖𝑖 − � 𝒇𝒇 𝑗𝑗 2 + 1 − 𝑌𝑌(𝑖𝑖, 𝑗𝑗 max 0, 𝜏𝜏 − � 𝒇𝒇 𝑖𝑖 − � 𝒇𝒇 𝑗𝑗 2 POSITIVE PAIR 𝐿𝐿 𝑖𝑖, 𝑗𝑗 � 𝒇𝒇 𝑖𝑖 − � 𝒇𝒇 𝑗𝑗 NEGATIVE PAIR � 𝒇𝒇 𝑖𝑖 − � 𝒇𝒇 𝑗𝑗 Contrastive loss
  • 115. Retrieval Challenges Significant viewpoint and/or scale change Significant illumination change Severe occlusions Visually similar but different objects
  • 116. Retrieval Challenges Significant viewpoint and/or scale change Significant illumination change Severe occlusions Visually similar but different objects
  • 117. Retrieval Challenges Significant viewpoint and/or scale change Significant illumination change Severe occlusions Visually similar but different objects
  • 118. Retrieval Challenges Significant viewpoint and/or scale change Significant illumination change Severe occlusions Visually similar but different objects
  • 119. “Lots of Training Examples” Large Internet photo collection … Convolutional Neural Network (CNN) Image annotations Training
  • 120. “Lots of Training Examples” Large Internet photo collection … Convolutional Neural Network (CNN) Not accurate Expensive $$ Manual cleaning of the training data done by Researchers Very expensive $$$$ Automated extraction of training data Very accurate Free $
  • 121. • Image representation created from CNN activations of a network pre-trained for classification task [Gong et al. ECCV’14, Razavian et al. arXiv’14, Babenko et al. ICCV’15, Kalantidis et al. arXiv’15, Tolias et al. ICLR’16] + Retrieval accuracy suggests generalization of CNNs - Trained for image classification, NOT retrieval task CNN Image Retrieval
  • 122. • Image representation created from CNN activations of a network pre-trained for classification task [Gong et al. ECCV’14, Razavian et al. arXiv’14, Babenko et al. ICCV’15, Kalantidis et al. arXiv’15, Tolias et al. ICLR’16] + Retrieval accuracy suggests generalization of CNNs - Trained for image classification, NOT retrieval task CNN Image Retrieval Same Class Image from ImageNet.org
  • 123. CNN Image Retrieval • CNN network re-trained using a dataset that contains landmarks and buildings as object classes. [Babenko et al. ECCV’14] + Training dataset closer to the target task - Final metric different to the one actually optimized - Constructing training datasets requires manual effort
  • 124. CNN Image Retrieval • CNN network re-trained using a dataset that contains landmarks and buildings as object classes. [Babenko et al. ECCV’14] + Training dataset closer to the target task - Final metric different to the one actually optimized - Constructing training datasets requires manual effort Same Class Image from [Babenko et al. ECCV’14]
  • 125. CNN Image Retrieval • NetVLAD: end-to-end fine-tuning for image retrieval. Geo-tagged dataset for weakly supervised fine-tuning. [Arandjelovic et al. CVPR’16] + Training dataset corresponds to the target task + Final metric corresponds to the one actually optimized - Training dataset requires geo-tags
  • 126. CNN Image Retrieval • NetVLAD: end-to-end fine-tuning for image retrieval. Geo-tagged dataset for weakly supervised fine-tuning. [Arandjelovic et al. CVPR’16] + Training dataset corresponds to the target task + Final metric corresponds to the one actually optimized - Training dataset requires geo-tags query Camera Orientation Unknown unknown
  • 127. CNN learns from BoW – Training Data Input: Large unannotated dataset 1. Initial clusters created by grouping of spatially related images [Chum & Matas PAMI’10] 2. Clustered images used as queries for a retrieval- SfM pipeline [Schonberger et al. CVPR’15] Output: Non-overlapping 3D models 551 (134k) training / 162 (30k) validation Camera Orientation Known Number of Inliers Known
  • 128. CNN learns from BoW – Positives 1. Descriptor distance: Image with the lowest global descriptor distance is chosen (NetVLAD use this) 2. Maximum inliers: Image with the highest number of co-observed 3D points with the query image is chosen 3. Relaxed inliers: Random image close to the query, with enough inliers and not an extreme scale change is chosen query m 1 m 2 m 3
  • 129. CNN learns from BoW – Negatives K-nearest neighbors of the query image are selected from all non-matching clusters, using different methods: 1. No constraint: chosen images often near identical. 2. At most one image per cluster: higher variability. query hardest negative N 1 N 2
  • 131. Retrieval Challenges Significant viewpoint and/or scale change Significant illumination change Severe occlusions Visually similar but different objects
  • 133. 136 Day – Night Retrieval Day – Night training image pairs – sequences of images day – evening - night Photometric normalization
  • 134. 137 Contrast Limited Adaptive Histogram Equalization • Semi local (windows) • Linear interpolation • Only values more frequent than the clipping limit are redistributed clipping limit Original Historam Equalization (global) CLAHE [Jenicek, Chum: No Fear of the Dark: Image Retrieval under Varying Illumination Conditions, ICCV 2019]
  • 137. 140 Training with Synthetic Night Data POSITIVE PAIR 𝐿𝐿 𝑖𝑖, 𝑗𝑗 � 𝒇𝒇 𝑖𝑖 − � 𝒇𝒇 𝑗𝑗 NEGATIVE PAIR � 𝒇𝒇 𝑖𝑖 − � 𝒇𝒇 𝑗𝑗