4. Patch descriptors
For 4x4 patches, find local gradient directions over t.
Count the directions per patch, 128D SIFT histogram.
Lowe IJCV 2004
5. Affine patch descriptor
Compute the prominent direction.
Start with central Gaussian
distributed weights in W.
Compute 2nd order moments matrix
Mk over all directions.
Adapt weights to elliptic shape.
∑ wk ( x, y ) f x f x ∑ w ( x, y ) f fx
Mk =
k y
∑ wk ( x, y ) f x f y
∑ w ( x, y ) f
k y fy
Iterate until there is no longer change.
Wk +1 = M k Wk
6. Color Patch Descriptors
Invariance properties per descriptor
Light Light Light intensity Light color
intensity intensity change and Light color change and
change shift shift change shift
SIFT + + + - -
OpponentSIFT + + + - -
C-SIFT + - - - -
RGB-SIFT + + + + +
van de Sande PAMI 2010
8. Results per object category
OpponentSIFT (L2 norm)
MAP
Two channel I+C (L2 norm)
bottle
pottedplant
cow
dog
diningtable
sheep
bird
sofa
tvmonitor
cat
chair
bicycle
motorbike
bus
boat
train
car
horse
aeroplane
person
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Average Precision
9. Corner selector
The change energy at x over a small vector u:
u fx fx fy fx
E xy (u , v) ≈ [u v] M , M =
v fx fy fy fy
Since M is symmetric, we have direction of the
λ 0
−1 1
fastest change
M =R R
0 λ2
For a corner both should be large.
(λmax)-1/2
= det M − k ( trace M )
2
R (λmin)-1/2
det M = λ1λ2 = I x2 I y − ( I x I y ) 2
2
traceM = λ1 + λ2 = I x + I y
2 2
12. Blob detector
2D Laplacian: L σ 2 ( Gxx ( x, y, σ ) + G yy ( x, y, σ ) )
=
DoG: = G ( x, y, kσ ) − G ( x, y, σ )
DoG
The Laplacian has a single max at the size of the blob,
Multiply
by σ2
17. System 3: patch detection
System 3 is an app: Stitching
http://www.cloudburstresearch.com/
18. 4. Conclusion
Patch descriptors bring local orderless information.
Best combined with color invariance for illumination.
Scene-pose-illumination invariance brings meaning.
Lee Comm. ACM 2011
21. Capture the pattern in patch
Measure the pattern in a patch with abundant features.
More is better. Different is better. Normalized is better.
22. Sample many patches
Sample the patches in the image.
Dense 256 K words, salient 1 K words. Salience is good.
Dense is better. Combined even better. Salient is memory
efficient. Dense is compute efficient.
23. Sample many images
Sample the images in the world: the learning set.
Learn all relevant distinctions. Learn all irrelevant variations
not covered in the invariance of features.
24. Form a dictionary of words
Form regions in feature space.
Size 4,000 (general) to 400,000 (buildings). Random forest is
good and fast, 4 runs 10 deep is OK.
25. Count words per image
Retain the word boundaries.
Fill the histogram of words per training image.
26. Map histogram in similarity space
In 4096 D word count space, 1 point is 1 image.
Hard assignment: one patch one word.
27. Learn histogram similarity
Learn the histogram distinction between the image histograms
The histogram is 𝑉 𝑑 = 𝑡1 , 𝑡2 , … , 𝑡 𝑖 , … , 𝑡 𝑛 𝑇 , where 𝑡 𝑖 is the total
sorted per class of images in the learning set.
of occurrences of the visual word i.
query and image: 𝑆 𝑞 = 𝑉𝑞 ∩ 𝑉𝑗
The number of words in common is the intersection between
28. Classify unknown image
Retain the word count discrimination + support vectors
Go from patch to patch > words > counts > discriminate
30. Note 1: Soft assignment is better
Soft assignment: assign to multiple clusters, weighted by
distance to center. Pooled single sigma for all codebook
elements.
van Gemert, PAMI 2010
31. Notes 2: SVM similarity is better
SVM can reconstruct a complex geometry at the boundary
including disjoint subspaces. The distance metric in the kernel
is important.
32. Vapnik, 1995
Note 2: nonlinear SVMs
How to transform the data such that the samples from
the two classes are separable by a linear function
(preferably with margin). Or, equivalently, define a
kernel that does this for you straight away.
33. Zhang, IJCV ‘07
Note 2: χ² - kernels
Because χ² is meant to discriminate histograms!
34. Note 2: … or multiple kernels
Let multiple kernel learning determine the weight of all features
Descriptors Norm = L2 # Norm ∈ L #
SIFT 0.4902 1 0.5169 4
OpponentSIFT (baseline) 0.4975 1 0.5203 4
SIFT and OpponentSIFT 0.5187 2 0.5357 8
One channel from C 0.5351 49 0.5405 196
Two channel: I and one from C 0.5463 49 0.5507 196
35. Note 3: Speed
For the Intersection Kernel hi
is piecewise linear, and quite
smooth, blue plot. We can
approximate with fewer
uniformly spaced
segments, red plot.
Saves factor 75 time!
Maji CVPR 2008
36. Note 4: What is in a word?
This is how a word looks like
Gavves 2011 Chum ICCV 2007
Turcot ICCV 2009
37. Note 4: Where are the synonyms?
But not all views of the same detail
are close! Gavves 2011
38. Note 4: Forming selective dictionary
Build vocabulary by selecting the
minimal set by maximizing the cross
entropy:
99% vocabulary reduction
6% improved recognition
Needs 100 words per concept.
Gavves 2011 CVPR
40. Note 5: Deconstruct words
Fisher vectors capture the internal structure of words.
Train a Gaussian Mixture Model, where each codebook
element has its own sigma – one per dimension. Store
differences in all descriptor dimensions. The feature vector is
#codewords x #descriptors.
Perronnin ECCV 2010
42. 5. Conclusion
Words are the essential step forward.
More is better. Better but costly.
Smooth assignment works better than hard.
At the cost of less orthogonal methods.
Approximate algorithms are sufficient, mostly.