Andrew Zisserman Talk - Part 1a

Visual search and recognition
Part I – large scale instance search

Andrew Zisserman
Visual Geometry Group
University of Oxford
http://www.robots.ox.ac.uk/~vgg

Slides with Josef Sivic

Overview

Part I: Instance level recognition
• e.g. find this car in a dataset of images

and the dataset may contain 1M images …

Overview

Part II: Category level recognition
• e.g. find images that contain cars

• or cows …

and localize the occurrences in each image …

Problem specification: particular object retrieval

Example: visual search in feature films

Visually defined query “Groundhog Day” [Rammis, 1993]

“Find this
clock”

“Find this
place”

Particular Object Search

Find these objects ...in these images and 1M more

Search the web with a visual query …

The need for visual search

Flickr: has more than 5 billion photographs, more than
1 million added daily

Company collections

Personal collections: 10000s of digital camera photos
and video clips

Vast majority will have minimal, if any, textual annotation.

Why is it difficult?
Problem: find particular occurrences of an object in a very
large dataset of images

Want to find the object despite possibly large changes in
scale, viewpoint, lighting and partial occlusion

Scale Viewpoint

Lighting Occlusion

Outline

1. Object recognition cast as nearest neighbour matching
• Covariant feature detection
• Feature descriptors (SIFT) and matching

2. Object recognition cast as text retrieval
3. Large scale search and improving performance
4. Applications
5. The future and challenges

Visual problem
• Retrieve image/key frames containing the same object

query
?

Approach
Determine regions (detection) and vector descriptors in each
frame which are invariant to camera viewpoint changes

Match descriptors between frames using invariant vectors

Example of visual fragments
Image content is transformed into local fragments that are invariant to
translation, rotation, scale, and other imaging parameters

• Fragments generalize over viewpoint and lighting
Lowe ICCV 1999

Detection Requirements
Detected image regions must cover the same scene region in different views

• detection must commute with viewpoint
transformation
viewpoint transformation

• i.e. detection is viewpoint
covariant
detection detection

• NB detection computed in
each image independently
viewpoint
transformation

Scale-invariant feature detection
Goal: independently detect corresponding regions in scaled versions
of the same image

Need scale selection mechanism for finding characteristic region size
that is covariant with the image transformation

Laplacian

Scale-invariant features: Blobs

Slides from Svetlana Lazebnik

Recall: Edge detection

Edge
f

d Derivative
g of Gaussian
dx

d Edge = maximum
f∗ g of derivative
dx

Source: S. Seitz

Edge detection, take 2

Edge
f

2 Second derivative
d of Gaussian
2
g
dx (Laplacian)

d2 Edge = zero crossing
f∗ 2g of second derivative
dx

Source: S. Seitz

From edges to `top hat’ (blobs)

Blob = top-hat = superposition of two step edges

maximum

Spatial selection: the magnitude of the Laplacian response
will achieve a maximum at the center of the blob, provided the
scale of the Laplacian is “matched” to the scale of the blob

Scale selection
We want to find the characteristic scale of the blob by
convolving it with Laplacians at several scales and looking
for the maximum response

However, Laplacian response decays as scale increases:

original signal increasing σ
(radius=8)

Why does this happen?

Scale normalization

The response of a derivative of Gaussian filter to a perfect
step edge decreases as σ increases

1
σ 2π

Scale normalization

The response of a derivative of Gaussian filter to a perfect
step edge decreases as σ increases

To keep response the same (scale-invariant), must
multiply Gaussian derivative by σ

Laplacian is the second Gaussian derivative, so it must be
multiplied by σ2

Effect of scale normalization
Original signal Unnormalized Laplacian response

Scale-normalized Laplacian response

scale σ increasing
maximum

Blob detection in 2D

Laplacian of Gaussian: Circularly symmetric operator for
blob detection in 2D

∂ g ∂ g 2 2
∇ g= 2 + 2
2

∂x ∂y

Blob detection in 2D

Laplacian of Gaussian: Circularly symmetric operator for
blob detection in 2D

⎛∂ g ∂ g ⎞
2 2
Scale-normalized: ∇ 2
norm g =σ ⎜ 2 + 2 ⎟
2
⎜ ∂x ⎟
⎝ ∂y ⎠

Characteristic scale

We define the characteristic scale as the scale that
produces peak of Laplacian response

characteristic scale
T. Lindeberg (1998). "Feature detection with automatic scale selection."
International Journal of Computer Vision 30 (2): pp 77--116.

Scale selection

Scale invariance of the characteristic scale

s

norm. Lap.
norm. Lap.

scale scale
s∗
1
s∗
2

∗ ∗
• Relation between characteristic scales s ⋅ s1 = s2

Scale invariance Mikolajczyk and Schmid ICCV 2001

• Multi-scale extraction of Harris interest points

• Selection of characteristic scale in Laplacian scale space

Chacteristic scale :
- maximum in scale space
Laplacian - scale invariant

What class of transformations are required?

2D transformation models
Similarity
(translation,
scale, rotation)

Affine

Projective
(homography)

Motivation for affine transformations

View 1 View 2 View 2

Not same region

Local invariance requirements

Geometric: 2D affine transformation

where A is a 2x2 non-singular matrix

• Objective: compute image descriptors invariant to this
class of transformations

Viewpoint covariant detection
• Characteristic scales (size of region)
• Lindeberg and Garding ECCV 1994
• Lowe ICCV 1999
• Mikolajczyk and Schmid ICCV 2001

• Affine covariance (shape of region)
• Baumberg CVPR 2000
• Matas et al BMVC 2002 Maximally stable regions
• Mikolajczyk and Schmid ECCV 2002
Shape adapted regions
• Schaffalitzky and Zisserman ECCV 2002 “Harris affine”
• Tuytelaars and Van Gool BMVC 2000
• Mikolajczyk et al., IJCV 2005

Example affine covariant region
Maximally Stable regions (MSR)

first image second image

1. Segment using watershed algorithm, and track connected components as
threshold value varies.
2. An MSR is detected when the area of the component is stationary
See Matas et al BMVC 2002

Maximally stable regions

varying threshold

sub-image

first second
area vs change of area image image
threshold vs threshold

Example: Maximally stable regions

Example of affine covariant regions

1000+ regions per image Shape adapted regions


• a region’s size and shape are not fixed, but
• automatically adapts to the image intensity to cover the same physical surface
• i.e. pre-image is the same surface region

Viewpoint invariant description

• Elliptical viewpoint covariant regions
• Shape Adapted regions
• Maximally Stable Regions

• Map ellipse to circle and orientate by dominant direction

• Represent each region by SIFT descriptor (128-vector) [Lowe 1999]
• see Mikolajczyk and Schmid CVPR 2003 for a comparison of descriptors

Local descriptors - rotation invariance
Estimation of the dominant orientation
• extract gradient orientation
• histogram over gradient orientation
• peak in this histogram 0 2π

Rotate patch in dominant direction

Descriptors – SIFT [Lowe’99]
distribution of the gradient over an image patch

image patch gradient 3D histogram
x

→ →
y

4x4 location grid and 8 orientations (128 dimensions)

very good performance in image matching [Mikolaczyk and Schmid’03]

Summary – detection and description

Extract affine regions Normalize regions Eliminate rotation Compute appearance
descriptors

SIFT (Lowe ’04)

Outline of an object retrieval strategy
regions

invariant
descriptor
vectors
frames

invariant
descriptor
vectors

1. Compute regions in each image independently
2. “Label” each region by a descriptor vector from its local intensity neighbourhood
3. Find corresponding regions by matching to closest descriptor vector
4. Score each frame in the database by the number of matches

Finding corresponding regions transformed to finding nearest neighbour vectors

Example
In each frame independently
determine elliptical regions (detection covariant with camera viewpoint)
compute SIFT descriptor for each region [Lowe ‘99]

1000+ descriptors per frame

Harris-affine


Object recognition

Establish correspondences between object model image and target image by
nearest neighbour matching on SIFT vectors

Model image 128D descriptor Target image
space

Match regions between frames using SIFT descriptors

• Multiple fragments overcomes problem of partial occlusion
• Transfer query box to localize object

Harris-affine
Now, convert this approach to a text
Maximally stable regions retrieval representation

Outline

• bag of words model
• visual words

4. Applications

Success of text retrieval

• efficient
• high precision
• scalable

Can we use retrieval mechanisms from text for visual retrieval?
• ‘Visual Google’ project, 2003+

For a million+ images:
• scalability
• high precision
• high recall: can we retrieve all occurrences in the corpus?

Text retrieval lightning tour

Stemming Represent words by stems, e.g. “walking”, “walks” “walk”

Stop-list Reject the very common words, e.g. “the”, “a”, “of”

Inverted file

Ideal book index: Term List of hits (occurrences in documents)
People [d1:hit hit hit], [d4:hit hit] …
Common [d1:hit hit], [d3: hit], [d4: hit hit hit] …
Sculpture [d2:hit], [d3: hit hit hit] …

• word matches are pre-computed

Ranking • frequency of words in document (tf-idf)
• proximity weighting (google)
• PageRank (google)

Need to map feature descriptors to “visual words”.

Build a visual vocabulary for a movie

Vector quantize descriptors
• k-means clustering

+

+

SIFT 128D SIFT 128D

Implementation
• compute SIFT features on frames from 48 shots of the film
• 6K clusters for Shape Adapted regions
• 10K clusters for Maximally Stable regions

Samples of visual words (clusters on SIFT descriptors):

Shape adapted regions Maximally stable regions

generic examples – cf textons

Samples of visual words (clusters on SIFT descriptors):

More specific example

Visual words: quantize descriptor space
Sivic and Zisserman, ICCV 2003

Nearest neighbour matching
•expensive to do
for all frames

Image 1 128D descriptor Image 2
space


•expensive to do
for all frames

space

5
42

42 5 42 5
space


•expensive to do
for all frames

space

5
42

42 5 42 5
New image Image 1 128D descriptor Image 2
space


•expensive to do
for all frames

space

5
42

42 42 5 42 5
New image Image 1 128D descriptor Image 2
space

Vector quantize the descriptor space (SIFT)

The same visual word

Representation: bag of (visual) words
Visual words are ‘iconic’ image patches or fragments
• represent the frequency of word occurrence
• but not their position

Image
Collection of visual words

Offline: assign visual words and compute histograms
for each key frame in the video

+

+

Normalize Compute SIFT
patch descriptor
Find nearest cluster
centre
Detect patches

2
0
0
1
0
1
…

Represent frame by sparse histogram
of visual word occurrences

Offline: create an index

For fast search, store a “posting list” for the dataset

This maps word occurrences to the documents they occur in

Posting list
#1 #1
1 5,10, ...
#2 2 10,...
... ...

frame #5 frame #10

At run time …
• User specifies a query region
• Generate a short list of frames using visual words in region
1. Accumulate all visual words within the query region
2. Use “book index” to find other frames with these words
3. Compute similarity for frames which share at least one word

Posting list
#1 #1
1 5,10, ...
#2 2 10,...
... ...

frame #5 frame #10

Generates a tf-idf ranked list of all the frames in dataset

Image ranking using the bag-of-words model

For a vocabulary of size K, each image is represented by a K-vector

where ti is the (weighted) number of occurrences of visual word i.

Images are ranked by the normalized scalar product between the query
vector vq and all vectors in the database vd:

Scalar product can be computed efficiently using inverted file.

Summary: Match histograms of visual words
regions invariant Single vector
Quantize
descriptor (histogram)
vectors

frames

1. Compute affine covariant regions in each frame independently (offline)
2. “Label” each region by a vector of descriptors based on its intensity (offline)
3. Build histograms of visual words by descriptor quantization (offline)
4. Rank retrieved frames by matching vis. word histograms using inverted files.

Films = common dataset

“Pretty Woman” “Casablanca”

“Groundhog Day” “Charade”

retrieved shots

Example
Select a region

Search in film “Groundhog Day”

Visual words - advantages

Design of descriptors makes these words invariant to:
• illumination
• affine transformations (viewpoint)

Multiple local regions gives immunity to partial occlusion

Overlap encodes some structural information

NB: no attempt to carry out a
‘semantic’ segmentation

Example application – product placement
Sony logo from Google image
search on `Sony’

Retrieve shots from Groundhog Day

Retrieved shots in Groundhog Day for search on Sony logo

Outline

• large vocabularies and approximate k-means
• query expansion
• soft assignment

4. Applications

Particular object search

Find these landmarks ...in these images

Investigate …

Vocabulary size: number of visual words in range 10K to 1M

Use of spatial information to re-rank

Oxford buildings dataset
Automatically crawled from Flickr

Dataset (i) consists of 5062 images, crawled by searching
for Oxford landmarks, e.g.
“Oxford Christ Church”
“Oxford Radcliffe camera”
“Oxford”

“Medium” resolution images (1024 x 768)

Automatically crawled from Flickr

Consists of:

Dataset (i) crawled by searching for Oxford landmarks

Datasets (ii) and (iii) from other popular Flickr tags. Acts as
additional distractors

Landmarks plus queries used for evaluation

All Soul's Bridge of
Sighs
Ashmolean
Keble

Balliol
Magdalen

Bodleian University
Museum
Thom
Tower
Radcliffe
Camera
Cornmarket

Ground truth obtained for 11 landmarks over 5062 images

Evaluate performance by mean Average Precision

Precision - Recall relevant returned
images images
• Precision: % of returned images that
are relevant

• Recall: % of relevant images that are
returned
all images
1

0.8

0.6
precision

0.4

0.2

0
0 0.2 0.4 0.6 0.8 1
recall

Average Precision
1

0.8

0.6 • A good AP score requires both
precision

high recall and high precision
0.4
AP
• Application-independent
0.2

0
0 0.2 0.4 0.6 0.8 1
recall

Performance measured by mean Average Precision (mAP)
over 55 queries on 100K or 1.1M image datasets

Quantization / Clustering

K-means usually seen as a quick + cheap method

But far too slow for our needs – D~128, N~20M+, K~1M

Use approximate k-means: nearest neighbour search by
multiple, randomized k-d trees

K-means overview

K-means overview: Iterate

Initialize cluster Find nearest cluster to each Re-compute cluster
centres datapoint (slow) O(N K) centres as centroid

K-means provably locally minimizes the sum of squared
errors (SSE) between a cluster centre and its points

Idea: nearest neighbour search is the bottleneck – use
approximate nearest neighbour search

Approximate K-means

Use multiple, randomized k-d trees for search

A k-d tree hierarchically decomposes the
descriptor space

Points nearby in the space can be found
(hopefully) by backtracking around the tree
some small number of steps

Single tree works OK in low dimensions – not
so well in high dimensions

Approximate K-means

Multiple randomized trees increase the chances of finding
nearby points

True nearest
neighbour

Query point

True nearest No No Yes
neighbour found?

Approximate K-means
Use the best-bin first strategy to determine which branch of
the tree to examine next

share this priority queue between multiple trees – searching
multiple trees only slightly more expensive than searching one

Original K-means complexity = O(N K)

Approximate K-means complexity = O(N log K)

This means we can scale to very large K

Experimental evaluation for SIFT matching
http://www.cs.ubc.ca/~lowe/papers/09muja.pdf

Landmarks plus queries used for evaluation

All Soul's Bridge of
Sighs
Ashmolean
Keble

Balliol
Magdalen

Bodleian University
Museum
Thom
Tower
Radcliffe
Camera
Cornmarket

Ground truth obtained for 11 landmarks over 5062 images

Evaluate performance by mean Average Precision over 55 queries

Approximate K-means
How accurate is the approximate search?

Performance on 5K image dataset for a random forest of 8 trees

Allows much larger clusterings than would be feasible with
standard K-means: N~17M points, K~1M
AKM – 8.3 cpu hours per iteration
Standard K-means - estimated 2650 cpu hours per iteration

Performance against vocabulary size
Using large vocabularies gives a big boost in performance
(peak @ 1M words)

More discriminative vocabularies give:
Better retrieval quality
Increased search speed – documents share less words, so fewer
documents need to be scored

Beyond Bag of Words
Use the position and shape of the underlying features
to improve retrieval quality

Both images have many matches – which is correct?

Beyond Bag of Words
We can measure spatial consistency between the
query and each result to improve retrieval quality

Many spatially consistent Few spatially consistent
matches – correct result matches – incorrect result

Compute 2D affine transformation
• between query region and target image

where A is a 2x2 non-singular matrix

Estimating spatial correspondences
1. Test each correspondence

2. Compute a (restricted) affine transformation (5 dof)

3. Score by number of consistent matches

Use RANSAC on full affine transformation (6 dof)

Beyond Bag of Words
Extra bonus – gives localization of the object

Example Results

Query Example
Results

Rank short list of retrieved images on number of correspondences

Mean Average Precision variation with vocabulary size

vocab bag of spatial
size words
50K 0.473 0.599
100K 0.535 0.597
250K 0.598 0.633
500K 0.606 0.642
750K 0.609 0.630
1M 0.618 0.645
1.25M 0.602 0.625

Bag of visual words particular object retrieval
centroids
Set of SIFT (visual words)
query image descriptors
sparse frequency vector

Hessian-Affine visual words
regions + SIFT descriptors +tf-idf weighting

Inverted
file
querying

Geometric ranked image
verification short-list

[Chum & al 2007] [Lowe 04, Chum & al 2007]

Andrew Zisserman Talk - Part 1a

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (18)

Similar to Andrew Zisserman Talk - Part 1a

Similar to Andrew Zisserman Talk - Part 1a (20)

More from Anton Konushin

More from Anton Konushin (12)

Recently uploaded

Recently uploaded (20)

Andrew Zisserman Talk - Part 1a