Lec-07: Feature Aggregation and Image Retrieval System [notes]
Image retrieval system performance metrics, precision, recall, true positive rate, false positive rate; Bag of Words (BoW) and VLAD aggregation.
3. Scale Space Theory - Lindeberg
Scale Space Response via Laplacian of Gaussian
The scale is controlled by 𝜎
Characteristic Scale:
Image Analysis & Retrieval, 2016 p.3
2
2
2
2
2
y
g
x
g
g
𝑔 = 𝑒
− 𝑥+𝑦 2
2𝜎
r
image
𝜎 = 0.8𝑟 𝜎 = 1.2𝑟 𝜎 = 2𝑟
…
characteristic
scale
4. SIFT
Use DoG to approximate LoG
Separable Gaussian filter
Difference of image instead of difference of Gaussian kernel
Image Analysis & Retrieval, 2016 p.4
L
o
G
Scale space construction
By Gaussian Filtering,
and Image Difference
5. Peak Strength & Edge Removal
Peak Strength:
Interpolate true DoG response and pixel location by Taylor
expansion
Edge Removal:
Re-do Harris type detection to remove edge on much reduced
pixel set
Image Analysis & Retrieval, 2016 p.5
6. Scale Invariance thru Dominant Orientation Coding
Voting for the dominant orientation
Weighted by a Gaussian window to give more emphasis to the
gradients closer to the center
Image Analysis & Retrieval, 2016 p.6
7. SIFT Matching and Repeatability Prediction
SIFT Distance
Not all SIFT are created equal…
Peak strength (DoG response at interpolated position)
Image Analysis & Retrieval, 2016 p.7
Combined scale/peak strength pmf
𝑑(𝑠1
1
, 𝑠 𝑘∗
2
)
𝑑(𝑠1
1
, 𝑠 𝑘
2
)
≤ 𝜃
8. Box Fitler – CABOX work
Basic Idea:
Approximate DoG with linear combination of box filters
min.
𝒉
𝒈 − 𝐵 ∙ 𝒉 𝐿2
2
+ 𝒉 𝐿1
Solution by LASSO
Image Analysis & Retrieval, 2016 p.8
= h1*
h2*+ + …
10. Image Matching/Retrieval System
SIFT is a sub-image level feature, we actually care
more on how SIFT match will translate into image level
matching/retrieval accuracy
Say if we can compute a single distance from a
collection of features:
Then for a data base of n images, we can compute an n
x n distance matrix
This gives us full information of the performance of this
feature/distance system
How to characterize the performance of such image matching
and retrieval system ?
Image Analysis & Retrieval, 2016 p.10
𝑑 𝐼1, 𝐼2 =
𝑘
𝛼 𝑘 𝑑(𝐹𝑘
1
, 𝐹𝑘
2
)
𝐷𝑖, 𝑘 = 𝑑(𝐼𝑗, 𝐼 𝑘)
11. Thresholding for Matching
Basically, for any pair of Images (documents, in IR
jargon), we declare
Then for each possible image pair, or pairs we care, for
a given threshold t, there will be 4 possible
consequences
TP pair: {Ij, Ik} declared matching pairs, d(Ij, Ik) < t;
FP pair: {Ij, Ik} declared matching pairs, d(Ij, Ik) >= t;
TN pair: {Ij, Ik} declared non-matching pairs, d(Ij, Ik) >= t;
FN pair: {Ij, Ik} declared non- matching pairs, d(Ij, Ik) < t;
Image Analysis & Retrieval, 2016 p.11
𝐼𝑗, 𝐼 𝑘 𝑎𝑟𝑒 𝑚𝑎𝑡𝑐ℎ, 𝑖𝑓 𝑑 𝐼𝑗, 𝐼 𝑘 < 𝑡
𝐼𝑗, 𝐼 𝑘 𝑎𝑟𝑒𝑛𝑜𝑡 𝑚𝑎𝑡𝑐ℎ, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
12. Matching System Performance
True Positive Rate/Precision:
Out of retrieved matching pairs, how many are true matching
pairs
For all matching pairs with distance < t
False Positive Rate:
Out of retrieved matching pairs, how many are actually
negative, false matchings
Image Analysis & Retrieval, 2016 p.12
𝑇𝑃𝑅 =
𝑡𝑝
𝑡𝑝 + 𝑓𝑛
𝐹𝑃𝑅 =
𝑓𝑝
𝑓𝑝 + 𝑡𝑛
13. TPR-FPR
Definition:
TP rate = TP/(TP+FN)
FP rate = FP/(FP+TN)
From the
actual value
point of view
Image Analysis & Retrieval, 2016 p.13
15. ROC curve(2)
Which method (A or B) is better?
compute ROC area: area under ROC
curve
Image Analysis & Retrieval, 2016 p.15
16. Precision, Recall, F-measure
Precision = TP/(TP + FP),
Recall = TP/(TP + FN)
F-measure = 2*(precision*recall)/(precision + recall)
Precision:
is the probability that a
retrieved document
is relevant.
Recall:
is the probability that a
relevant document
is retrieved in a search.
Image Analysis & Retrieval, 2016 p.16
17. Matlab Implementation
We will compute all image
pair distances D(j,k)
How do we compute the
TPR-FPR plot ?
Understand that TPR and
FPR are actually function of
threshold t,
Just need to parameterize
TPR(t) and FPR(t), and
obtaining operating points of
meaningful thresholds, to
generate the plot.
Matlab Implementation:
[tp, fp, tn,
fn]=getPrecisionRecall()
Image Analysis & Retrieval, 2016 p.17
d_min = min(min(d0), min(d1));
d_max = max(max(d0), max(d1));
delta = (d_max - d_min) / npt;
for k=1:npt
thres = d_min + (k-1)*delta;
tp(k) = length(find(d0<=thres));
fp(k) = length(find(d1<=thres));
tn(k) = length(find(d1>thres));
fn(k) = length(find(d0>thres));
end
if dbg
figure(22); grid on; hold on;
plot(fp./(tn+fp), tp./(tp+fn), '.-r',
'DisplayName', 'tpr-fpr');legend();
end
18. TPR-FPR
Image Matching performance are characterized by
functions
TPR(FPR)
Retrieval set: we want high Precision, Short List: High
Recall.
Image Analysis & Retrieval, 2016 p.18
20. Why Aggregation ?
What (Local) Interesting Points features bring us ?
Scale and rotation invariance in the form of nk x d:
Un-cerntainty of the number of detected features nk, at query
time
Permutation along rows of features are the same
representation.
Problems:
The feature has state, not able to draw decision boundaries,
Not directly indexable/hashable
Typically very high dimensionality
Image Analysis & Retrieval, 2016 p.20
𝑆 𝑘| [𝑥 𝑘, 𝑦 𝑘, 𝜃 𝑘, 𝜎 𝑘, ℎ1, ℎ2, … , ℎ128] , 𝑘 = 1. . 𝑛
21. Decision Boundary in Matching
Can we have a decision boundary function for
interesting points based representation ?
Image Analysis & Retrieval, 2016 p.21
…..
22. Curse of Dimensionality in Retrieval
What feature dimensions will do to the retrieval
efficiency…
Looking at retrieval 99% of per dimension locality, and the
total volume covered plot.
Matlab: showDimensionCurse.m
Image Analysis & Retrieval, 2016 p.22
+
23. Aggregation – 30,000ft view
Bag of Words
Compute k centroids in feature space, called visual words
Compute histogram
k x1 feature, hard assignment
VLAD
Compute centroids in feature space
Compute aggregaged difference w.r.t the centroids
k x d feature, soft assignment
Fisher Vector
Compute a Gaussian Mixture Model (GMM) with 2nd order info
Compute the aggregated feature w.r.t the mean and covariance of
GMM
2 x k x d feature
AKULA
Adaptive centroids and feature count
Improved with covariance ?
Image Analysis & Retrieval, 2016 p.23
0.5
0.4 0.05
0.05
24. Visual Key Words: main idea
Extract some local features from a number of
images …
Image Analysis & Retrieval, 2016 24
e.g., SIFT descriptor
space: each point is 128-
dimensional
Slide credit: D. Nister
25. Visual Key Words: main idea
Image Analysis & Retrieval, 2016 25Slide credit: D. Nister
26. Visual words: main idea
Image Analysis & Retrieval, 2016 26
Slide credit: D. Nister
27. Visual words: main idea
Image Analysis & Retrieval, 2016 27
Slide credit: D. Nister
28. Slide credit: D. Nister
Visual Key Words
Image Analysis & Retrieval, 2016 28
Each point is a local
descriptor, e.g. SIFT
vector.
30. Visual words
Example: each group of patches belongs to the
same visual word
Image Analysis & Retrieval, 2016 30
Figure from Sivic & Zisserman, ICCV 2003
31. Visual words
Image Analysis & Retrieval, 2016 31
31
Source credit: K. Grauman, B. Leibe
• More recently used for describing scenes and
objects for the sake of indexing or classification.
Sivic & Zisserman 2003;
Csurka, Bray, Dance, & Fan
2004; many others.
32. Object Bag of ‘words’
ICCV 2005 short course, L. Fei-Fei
Bag of Words
Image Analysis & Retrieval, 2016 32
34. Bags of visual words
Summarize entire image based on its distribution
(histogram) of word occurrences.
Analogous to bag of words representation
commonly used for documents.
Image Analysis & Retrieval, 2016 34
Image credit: Fei-Fei Li
36. BoW Distance Metrics
Rank images by normalized scalar product
between their (possibly weighted) occurrence
counts---nearest neighbor search for similar
images.
Image Analysis & Retrieval, 2016 p.36
[5 1 1 0][1 8 1 4]
dj
q
37. Inverted List
Image Retrieval via Inverted List
Image Analysis & Retrieval, 2016 37
Image credit: A. Zisserman
Visual
Word
number
List of image
numbers
When will this give us a significant gain in efficiency?
38. Indexing local features: inverted file index
For text documents, an
efficient way to find all pages
on which a word occurs is to
use an index…
We want to find all images in
which a feature occurs.
We need to index each
feature by the image it
appears and also we keep the
# of occurrence.
Image Analysis & Retrieval, 2016 38
Source credit : K. Grauman, B. Leibe
39. TF-IDF Weighting
Term Frequency – Inverse Document Frequency
Describe image by frequency of each visual word within
it, down-weight words that appear often in the database
(Standard weighting for text retrieval)
Image Analysis & Retrieval, 2016 p.39
Total number of
words in database
Number of
occurrences of
word i in whole
database
Number of
occurrences of
word i in
document d
Number of
words in
document d
40. BoW Use Case with Spatial Localization
Collecting words within a query region
Image Analysis & Retrieval, 2016 40
Query region:
pull out only the SIFT
descriptors whose
positions are within the
polygon
51. Vocabulary Tree: Performance
Evaluated on large databases
Indexing with up to 1M images
Online recognition for database
of 50,000 CD covers
Retrieval in ~1s
Find experimentally that large vocabularies can be
beneficial for recognition
Image Analysis & Retrieval, 2016 51
[Nister & Stewenius, CVPR’06]
52. Larger vocabularies
can be
advantageous…
But what happens if it
is too large?
Visual Word Vocabulary Size
Performance w.r.t vocabulary size
Image Analysis & Retrieval, 2016 52
53. Bags of words: pros and cons
Good:
+ flexible to geometry / deformations / viewpoint
+ compact summary of image content
+ provides vector representation for sets
+ Inverted List implementation offers practical solution
against large repository
Bad:
- Lost of information at quantization and histogram
generation
- basic model ignores geometry – must verify afterwards,
or encode via features
- background and foreground mixed when bag covers
whole image
- interest points or sampling: no guarantee to capture
object-level parts
Image Analysis & Retrieval, 2016 53Source credit : K. Grauman, B. Leibe
54. Can we improve BoW ?
• E.g. Why isn’t our Bag of Words classifier at 90%
instead of 70%?
• Training Data
– Huge issue, but not necessarily a variable you can manipulate.
• Learning method
– BoW is on top of any feature scheme
• Representation
– Are we losing too much info in the process ?
Image Analysis & Retrieval, 2016 p.54
55. Standard Kmeans Bag of Words
BoW revisited
Image Analysis & Retrieval, 2016 p.55
http://www.cs.utexas.edu/~grauman/courses/fall2009/papers/bag_of_visual_words.pdf
56. Motivation
Bag of Visual Words is only about counting the number
of local descriptors assigned to each Voronoi region
Why not including other statistics/information ?
Image Analysis & Retrieval, 2016 p.56
http://www.cs.utexas.edu/~grauman/courses/fall2009/papers/bag_of_visual_words.pdf
57. We already looked at the Spatial Pyramid/Pooling
Spatial Pooling
Image Analysis & Retrieval, 2016 p.57
level 2: 4x4level 0: 1x1 level 1: 2x2
Key take away: Multiple assignment ? Soft Assignment ?
58. Motivation
Bag of Visual Words is only about counting the number
of local descriptors assigned to each Voronoi region
Why not including other statistics? For instance:
• mean of local descriptors
Image Analysis & Retrieval, 2016 p.58
http://www.cs.utexas.edu/~grauman/courses/fall2009/papers/bag_of_visual_words.pdf
59. Motivation
Bag of Visual Words is only about counting the number
of local descriptors assigned to each Voronoi region
Why not including other statistics? For instance:
• mean of local descriptors
• (co)variance of local descriptors
Image Analysis & Retrieval, 2016 p.59
http://www.cs.utexas.edu/~grauman/courses/fall2009/papers/bag_of_visual_words.pdf
60. Simple case: Soft Assignment
Called “Kernel codebook encoding” by Chatfield et al.
2011. Cast a weighted vote into the most similar
clusters.
Image Analysis & Retrieval, 2016 p.60
61. Simple case: Soft Assignment
Called “Kernel codebook encoding” by Chatfield et al.
2011. Cast a weighted vote into the most similar
clusters.
This is fast and easy to implement (try it for Project 3!)
but it does have some downsides for image retrieval –
the inverted file index becomes less sparse.
Image Analysis & Retrieval, 2016 p.61
62. A first example: the VLAD
Given a codebook ,
e.g. learned with K-means, and a set of
local descriptors :
• assign:
• compute:
• concatenate vi’s + normalize
Image Analysis & Retrieval, 2016 p.62
Jégou, Douze, Schmid and Pérez, “Aggregating local descriptors into a compact image representation”, CVPR’10.
3
x
v1 v2
v3 v4
v5
1
4
2
5
① assign descriptors
② compute x- i
③ vi=sum x- i for cell i
63. A first example: the VLAD
A graphical representation of
Image Analysis & Retrieval, 2016 p.63
Jégou, Douze, Schmid and Pérez, “Aggregating local descriptors into a compact image representation”, CVPR’10.
64. VL_FEAT Implementation
Matlab:
Image Analysis & Retrieval, 2016 p.64
function [vc]=vladSiftEncoding(sift,
codebook)
dbg=1;
if dbg
if (0) % init VL_FEAT, only need
to do once
run('../../tools/vlfeat-
0.9.20/toolbox/vl_setup.m');
end
im = imread('../pics/flarsheim-
2.jpg');
[f, sift] =
vl_sift(single(rgb2gray(im))); sift =
single(sift');
[indx, codebook] = kmeans(sift,
16);
% make sift # smaller
sift = sift(1:800,:);
end
[n, kd]=size(sift);
[m, kd]=size(codebook);
% compute assignment
dist = pdist2(codebook, sift);
mdist = mean(mean(dist));
% normalize the heat kernel s.t. mean
dist is mapped to 0.5
a = -log(0.5)/mdist;
indx = exp(-a*dist);
vc=vl_vlad(sift', codebook', indx);
if dbg
figure(41); colormap(gray);
subplot(2,2,1); imshow(im);
title('image');
subplot(2,2,2); imagesc(dist);
title('m x n distance');
subplot(2,2,3); imagesc(indx);
title('m x n assignment');
subplot(2,2,4); imagesc(reshape(vc,
[m, kd]));title('vlad code');
end
65. VLAD Code
What are the tweaks ?
Code book design
Soft Assignment options
Image Analysis & Retrieval, 2016 p.65
66. References
Vocabulary Tree:
David Nistér, Henrik Stewénius: Scalable Recognition with a Vocabulary
Tree. CVPR (2) 2006: 2161-2168
VLAD:
Herve Jegou, Matthijs Douze, Cordelia Schmid:
Improving Bag-of-Features for Large Scale Image Search. International
Journal of Computer Vision 87(3): 316-336 (2010)
Fisher Vector:
Florent Perronnin, Jorge Sánchez, Thomas Mensink:
Improving the Fisher Kernel for Large-Scale Image Classification.
ECCV (4) 2010: 143-156
AKULA:
Abhishek Nagar, Zhu Li, Gaurav Srivastava, Kyungmo Park:
AKULA - Adaptive Cluster Aggregation for Visual Search. DCC 2014:
13-22
Image Analysis & Retrieval, 2016 p.66
67. Lec 07 Summary
Image Retrieval System Metric
What is true positive, false positive, true negative, false
negative ?
What is precision, recall, F-score ?
Why Aggregation ?
Decision boundary
Indexing/Hashing
Bag of Words
A histogram with bins visual words
Variations: hierarchical assignment with vocabulary tree
Implementation: Inverted List
VLAD
Richer encoding of aggregated info
Soft assignment of features to codebook bins
Vectorized representation – no need for inverted list
Image Analysis & Retrieval, 2016 p.67