UNIT – IV_PCI Complaints: Complaints and evaluation of complaints, Handling o...
Andrew Zisserman Talk - Part 1a
1. Visual search and recognition
Part I – large scale instance search
Andrew Zisserman
Visual Geometry Group
University of Oxford
http://www.robots.ox.ac.uk/~vgg
Slides with Josef Sivic
2. Overview
Part I: Instance level recognition
• e.g. find this car in a dataset of images
and the dataset may contain 1M images …
3. Overview
Part II: Category level recognition
• e.g. find images that contain cars
• or cows …
and localize the occurrences in each image …
4. Problem specification: particular object retrieval
Example: visual search in feature films
Visually defined query “Groundhog Day” [Rammis, 1993]
“Find this
clock”
“Find this
place”
6. The need for visual search
Flickr: has more than 5 billion photographs, more than
1 million added daily
Company collections
Personal collections: 10000s of digital camera photos
and video clips
Vast majority will have minimal, if any, textual annotation.
7. Why is it difficult?
Problem: find particular occurrences of an object in a very
large dataset of images
Want to find the object despite possibly large changes in
scale, viewpoint, lighting and partial occlusion
Scale Viewpoint
Lighting Occlusion
8. Outline
1. Object recognition cast as nearest neighbour matching
• Covariant feature detection
• Feature descriptors (SIFT) and matching
2. Object recognition cast as text retrieval
3. Large scale search and improving performance
4. Applications
5. The future and challenges
9. Visual problem
• Retrieve image/key frames containing the same object
query
?
Approach
Determine regions (detection) and vector descriptors in each
frame which are invariant to camera viewpoint changes
Match descriptors between frames using invariant vectors
10. Example of visual fragments
Image content is transformed into local fragments that are invariant to
translation, rotation, scale, and other imaging parameters
• Fragments generalize over viewpoint and lighting
Lowe ICCV 1999
11. Detection Requirements
Detected image regions must cover the same scene region in different views
• detection must commute with viewpoint
transformation
viewpoint transformation
• i.e. detection is viewpoint
covariant
detection detection
• NB detection computed in
each image independently
viewpoint
transformation
12. Scale-invariant feature detection
Goal: independently detect corresponding regions in scaled versions
of the same image
Need scale selection mechanism for finding characteristic region size
that is covariant with the image transformation
Laplacian
14. Recall: Edge detection
Edge
f
d Derivative
g of Gaussian
dx
d Edge = maximum
f∗ g of derivative
dx
Source: S. Seitz
15. Edge detection, take 2
Edge
f
2 Second derivative
d of Gaussian
2
g
dx (Laplacian)
d2 Edge = zero crossing
f∗ 2g of second derivative
dx
Source: S. Seitz
16. From edges to `top hat’ (blobs)
Blob = top-hat = superposition of two step edges
maximum
Spatial selection: the magnitude of the Laplacian response
will achieve a maximum at the center of the blob, provided the
scale of the Laplacian is “matched” to the scale of the blob
17. Scale selection
We want to find the characteristic scale of the blob by
convolving it with Laplacians at several scales and looking
for the maximum response
However, Laplacian response decays as scale increases:
original signal increasing σ
(radius=8)
Why does this happen?
19. Scale normalization
The response of a derivative of Gaussian filter to a perfect
step edge decreases as σ increases
To keep response the same (scale-invariant), must
multiply Gaussian derivative by σ
Laplacian is the second Gaussian derivative, so it must be
multiplied by σ2
20. Effect of scale normalization
Original signal Unnormalized Laplacian response
Scale-normalized Laplacian response
scale σ increasing
maximum
21. Blob detection in 2D
Laplacian of Gaussian: Circularly symmetric operator for
blob detection in 2D
∂ g ∂ g 2 2
∇ g= 2 + 2
2
∂x ∂y
22. Blob detection in 2D
Laplacian of Gaussian: Circularly symmetric operator for
blob detection in 2D
⎛∂ g ∂ g ⎞
2 2
Scale-normalized: ∇ 2
norm g =σ ⎜ 2 + 2 ⎟
2
⎜ ∂x ⎟
⎝ ∂y ⎠
23. Characteristic scale
We define the characteristic scale as the scale that
produces peak of Laplacian response
characteristic scale
T. Lindeberg (1998). "Feature detection with automatic scale selection."
International Journal of Computer Vision 30 (2): pp 77--116.
24. Scale selection
Scale invariance of the characteristic scale
s
norm. Lap.
norm. Lap.
scale scale
s∗
1
s∗
2
∗ ∗
• Relation between characteristic scales s ⋅ s1 = s2
25. Scale invariance Mikolajczyk and Schmid ICCV 2001
• Multi-scale extraction of Harris interest points
• Selection of characteristic scale in Laplacian scale space
Chacteristic scale :
- maximum in scale space
Laplacian - scale invariant
26. What class of transformations are required?
2D transformation models
Similarity
(translation,
scale, rotation)
Affine
Projective
(homography)
28. Local invariance requirements
Geometric: 2D affine transformation
where A is a 2x2 non-singular matrix
• Objective: compute image descriptors invariant to this
class of transformations
29. Viewpoint covariant detection
• Characteristic scales (size of region)
• Lindeberg and Garding ECCV 1994
• Lowe ICCV 1999
• Mikolajczyk and Schmid ICCV 2001
• Affine covariance (shape of region)
• Baumberg CVPR 2000
• Matas et al BMVC 2002 Maximally stable regions
• Mikolajczyk and Schmid ECCV 2002
Shape adapted regions
• Schaffalitzky and Zisserman ECCV 2002 “Harris affine”
• Tuytelaars and Van Gool BMVC 2000
• Mikolajczyk et al., IJCV 2005
30. Example affine covariant region
Maximally Stable regions (MSR)
first image second image
1. Segment using watershed algorithm, and track connected components as
threshold value varies.
2. An MSR is detected when the area of the component is stationary
See Matas et al BMVC 2002
31. Maximally stable regions
varying threshold
sub-image
first second
area vs change of area image image
threshold vs threshold
33. Example of affine covariant regions
1000+ regions per image Shape adapted regions
Maximally stable regions
• a region’s size and shape are not fixed, but
• automatically adapts to the image intensity to cover the same physical surface
• i.e. pre-image is the same surface region
34. Viewpoint invariant description
• Elliptical viewpoint covariant regions
• Shape Adapted regions
• Maximally Stable Regions
• Map ellipse to circle and orientate by dominant direction
• Represent each region by SIFT descriptor (128-vector) [Lowe 1999]
• see Mikolajczyk and Schmid CVPR 2003 for a comparison of descriptors
35. Local descriptors - rotation invariance
Estimation of the dominant orientation
• extract gradient orientation
• histogram over gradient orientation
• peak in this histogram 0 2π
Rotate patch in dominant direction
36. Descriptors – SIFT [Lowe’99]
distribution of the gradient over an image patch
image patch gradient 3D histogram
x
→ →
y
4x4 location grid and 8 orientations (128 dimensions)
very good performance in image matching [Mikolaczyk and Schmid’03]
37. Summary – detection and description
Extract affine regions Normalize regions Eliminate rotation Compute appearance
descriptors
SIFT (Lowe ’04)
38. Visual problem
• Retrieve image/key frames containing the same object
query
?
Approach
Determine regions (detection) and vector descriptors in each
frame which are invariant to camera viewpoint changes
Match descriptors between frames using invariant vectors
39. Outline of an object retrieval strategy
regions
invariant
descriptor
vectors
frames
invariant
descriptor
vectors
1. Compute regions in each image independently
2. “Label” each region by a descriptor vector from its local intensity neighbourhood
3. Find corresponding regions by matching to closest descriptor vector
4. Score each frame in the database by the number of matches
Finding corresponding regions transformed to finding nearest neighbour vectors
40. Example
In each frame independently
determine elliptical regions (detection covariant with camera viewpoint)
compute SIFT descriptor for each region [Lowe ‘99]
1000+ descriptors per frame
Harris-affine
Maximally stable regions
41. Object recognition
Establish correspondences between object model image and target image by
nearest neighbour matching on SIFT vectors
Model image 128D descriptor Target image
space
42. Match regions between frames using SIFT descriptors
• Multiple fragments overcomes problem of partial occlusion
• Transfer query box to localize object
Harris-affine
Now, convert this approach to a text
Maximally stable regions retrieval representation
43. Outline
1. Object recognition cast as nearest neighbour matching
2. Object recognition cast as text retrieval
• bag of words model
• visual words
3. Large scale search and improving performance
4. Applications
5. The future and challenges
44. Success of text retrieval
• efficient
• high precision
• scalable
Can we use retrieval mechanisms from text for visual retrieval?
• ‘Visual Google’ project, 2003+
For a million+ images:
• scalability
• high precision
• high recall: can we retrieve all occurrences in the corpus?
45. Text retrieval lightning tour
Stemming Represent words by stems, e.g. “walking”, “walks” “walk”
Stop-list Reject the very common words, e.g. “the”, “a”, “of”
Inverted file
Ideal book index: Term List of hits (occurrences in documents)
People [d1:hit hit hit], [d4:hit hit] …
Common [d1:hit hit], [d3: hit], [d4: hit hit hit] …
Sculpture [d2:hit], [d3: hit hit hit] …
• word matches are pre-computed
46. Ranking • frequency of words in document (tf-idf)
• proximity weighting (google)
• PageRank (google)
Need to map feature descriptors to “visual words”.
47. Build a visual vocabulary for a movie
Vector quantize descriptors
• k-means clustering
+
+
SIFT 128D SIFT 128D
Implementation
• compute SIFT features on frames from 48 shots of the film
• 6K clusters for Shape Adapted regions
• 10K clusters for Maximally Stable regions
48. Samples of visual words (clusters on SIFT descriptors):
Shape adapted regions Maximally stable regions
generic examples – cf textons
49. Samples of visual words (clusters on SIFT descriptors):
More specific example
50. Visual words: quantize descriptor space
Sivic and Zisserman, ICCV 2003
Nearest neighbour matching
•expensive to do
for all frames
Image 1 128D descriptor Image 2
space
51. Visual words: quantize descriptor space
Sivic and Zisserman, ICCV 2003
Nearest neighbour matching
•expensive to do
for all frames
Image 1 128D descriptor Image 2
space
Vector quantize descriptors
5
42
42 5 42 5
Image 1 128D descriptor Image 2
space
52. Visual words: quantize descriptor space
Sivic and Zisserman, ICCV 2003
Nearest neighbour matching
•expensive to do
for all frames
Image 1 128D descriptor Image 2
space
Vector quantize descriptors
5
42
42 5 42 5
New image Image 1 128D descriptor Image 2
space
53. Visual words: quantize descriptor space
Sivic and Zisserman, ICCV 2003
Nearest neighbour matching
•expensive to do
for all frames
Image 1 128D descriptor Image 2
space
Vector quantize descriptors
5
42
42 42 5 42 5
New image Image 1 128D descriptor Image 2
space
55. Representation: bag of (visual) words
Visual words are ‘iconic’ image patches or fragments
• represent the frequency of word occurrence
• but not their position
Image
Collection of visual words
56. Offline: assign visual words and compute histograms
for each key frame in the video
+
+
Normalize Compute SIFT
patch descriptor
Find nearest cluster
centre
Detect patches
2
0
0
1
0
1
…
Represent frame by sparse histogram
of visual word occurrences
57. Offline: create an index
For fast search, store a “posting list” for the dataset
This maps word occurrences to the documents they occur in
Posting list
#1 #1
1 5,10, ...
#2 2 10,...
... ...
frame #5 frame #10
58. At run time …
• User specifies a query region
• Generate a short list of frames using visual words in region
1. Accumulate all visual words within the query region
2. Use “book index” to find other frames with these words
3. Compute similarity for frames which share at least one word
Posting list
#1 #1
1 5,10, ...
#2 2 10,...
... ...
frame #5 frame #10
Generates a tf-idf ranked list of all the frames in dataset
59. Image ranking using the bag-of-words model
For a vocabulary of size K, each image is represented by a K-vector
where ti is the (weighted) number of occurrences of visual word i.
Images are ranked by the normalized scalar product between the query
vector vq and all vectors in the database vd:
Scalar product can be computed efficiently using inverted file.
60. Summary: Match histograms of visual words
regions invariant Single vector
Quantize
descriptor (histogram)
vectors
frames
1. Compute affine covariant regions in each frame independently (offline)
2. “Label” each region by a vector of descriptors based on its intensity (offline)
3. Build histograms of visual words by descriptor quantization (offline)
4. Rank retrieved frames by matching vis. word histograms using inverted files.
61. Films = common dataset
“Pretty Woman” “Casablanca”
“Groundhog Day” “Charade”
63. retrieved shots
Example
Select a region
Search in film “Groundhog Day”
64. Visual words - advantages
Design of descriptors makes these words invariant to:
• illumination
• affine transformations (viewpoint)
Multiple local regions gives immunity to partial occlusion
Overlap encodes some structural information
NB: no attempt to carry out a
‘semantic’ segmentation
65. Example application – product placement
Sony logo from Google image
search on `Sony’
Retrieve shots from Groundhog Day
67. Outline
1. Object recognition cast as nearest neighbour matching
2. Object recognition cast as text retrieval
3. Large scale search and improving performance
• large vocabularies and approximate k-means
• query expansion
• soft assignment
4. Applications
5. The future and challenges
70. Oxford buildings dataset
Automatically crawled from Flickr
Dataset (i) consists of 5062 images, crawled by searching
for Oxford landmarks, e.g.
“Oxford Christ Church”
“Oxford Radcliffe camera”
“Oxford”
“Medium” resolution images (1024 x 768)
71. Oxford buildings dataset
Automatically crawled from Flickr
Consists of:
Dataset (i) crawled by searching for Oxford landmarks
Datasets (ii) and (iii) from other popular Flickr tags. Acts as
additional distractors
72. Oxford buildings dataset
Landmarks plus queries used for evaluation
All Soul's Bridge of
Sighs
Ashmolean
Keble
Balliol
Magdalen
Bodleian University
Museum
Thom
Tower
Radcliffe
Camera
Cornmarket
Ground truth obtained for 11 landmarks over 5062 images
Evaluate performance by mean Average Precision
73. Precision - Recall relevant returned
images images
• Precision: % of returned images that
are relevant
• Recall: % of relevant images that are
returned
all images
1
0.8
0.6
precision
0.4
0.2
0
0 0.2 0.4 0.6 0.8 1
recall
74. Average Precision
1
0.8
0.6 • A good AP score requires both
precision
high recall and high precision
0.4
AP
• Application-independent
0.2
0
0 0.2 0.4 0.6 0.8 1
recall
Performance measured by mean Average Precision (mAP)
over 55 queries on 100K or 1.1M image datasets
75. Quantization / Clustering
K-means usually seen as a quick + cheap method
But far too slow for our needs – D~128, N~20M+, K~1M
Use approximate k-means: nearest neighbour search by
multiple, randomized k-d trees
76. K-means overview
K-means overview: Iterate
Initialize cluster Find nearest cluster to each Re-compute cluster
centres datapoint (slow) O(N K) centres as centroid
K-means provably locally minimizes the sum of squared
errors (SSE) between a cluster centre and its points
Idea: nearest neighbour search is the bottleneck – use
approximate nearest neighbour search
77. Approximate K-means
Use multiple, randomized k-d trees for search
A k-d tree hierarchically decomposes the
descriptor space
Points nearby in the space can be found
(hopefully) by backtracking around the tree
some small number of steps
Single tree works OK in low dimensions – not
so well in high dimensions
78. Approximate K-means
Multiple randomized trees increase the chances of finding
nearby points
True nearest
neighbour
Query point
True nearest No No Yes
neighbour found?
79. Approximate K-means
Use the best-bin first strategy to determine which branch of
the tree to examine next
share this priority queue between multiple trees – searching
multiple trees only slightly more expensive than searching one
Original K-means complexity = O(N K)
Approximate K-means complexity = O(N log K)
This means we can scale to very large K
81. Oxford buildings dataset
Landmarks plus queries used for evaluation
All Soul's Bridge of
Sighs
Ashmolean
Keble
Balliol
Magdalen
Bodleian University
Museum
Thom
Tower
Radcliffe
Camera
Cornmarket
Ground truth obtained for 11 landmarks over 5062 images
Evaluate performance by mean Average Precision over 55 queries
82. Approximate K-means
How accurate is the approximate search?
Performance on 5K image dataset for a random forest of 8 trees
Allows much larger clusterings than would be feasible with
standard K-means: N~17M points, K~1M
AKM – 8.3 cpu hours per iteration
Standard K-means - estimated 2650 cpu hours per iteration
83. Performance against vocabulary size
Using large vocabularies gives a big boost in performance
(peak @ 1M words)
More discriminative vocabularies give:
Better retrieval quality
Increased search speed – documents share less words, so fewer
documents need to be scored
84. Beyond Bag of Words
Use the position and shape of the underlying features
to improve retrieval quality
Both images have many matches – which is correct?
85. Beyond Bag of Words
We can measure spatial consistency between the
query and each result to improve retrieval quality
Many spatially consistent Few spatially consistent
matches – correct result matches – incorrect result
86. Compute 2D affine transformation
• between query region and target image
where A is a 2x2 non-singular matrix
90. Beyond Bag of Words
Extra bonus – gives localization of the object
91. Example Results
Query Example
Results
Rank short list of retrieved images on number of correspondences
92. Mean Average Precision variation with vocabulary size
vocab bag of spatial
size words
50K 0.473 0.599
100K 0.535 0.597
250K 0.598 0.633
500K 0.606 0.642
750K 0.609 0.630
1M 0.618 0.645
1.25M 0.602 0.625
93. Bag of visual words particular object retrieval
centroids
Set of SIFT (visual words)
query image descriptors
sparse frequency vector
Hessian-Affine visual words
regions + SIFT descriptors +tf-idf weighting
Inverted
file
querying
Geometric ranked image
verification short-list
[Chum & al 2007] [Lowe 04, Chum & al 2007]