2. • Search large unannotated datasets of 1M+
images for object categories
• Do so in real-time and without any prior
knowledge
Motivation and Objectives
7. • Bootstrap training using images from the web
• Use highly compact ConvNet features +
compression as the basis of a OTF system
• Plus: Novel GPU architecture for iterative on-
the-fly learning
Proposed Solution
8. Architecture Outline
Car|
Google Image
Search Sourced
Training Images
Image
Encoder
φ( I )
φ( I+ )
Fixed negative pool
precomputed
features
Linear SVM
φ( I- )
w
Target Dataset
wTφ( It )
Ranking
φ( It )
precomputed
features
Flickr
Pinterest
etc.
9. Need for Speed
Car|
Google Image
Search Sourced
Training Images
Image
Encoder
φ( I )
φ( I+ )
Negative pool
Linear SVM
φ( I- )
w
Target Dataset
wTφ( It )
Ranking
φ( It )
Flickr
Pinterest
etc.
Ranking most critical stage
w wTφ( It )
φ( It )
10. Must compute w.X for all image features in dataset giving
complexity of O(ND) so important to reduce image
representation dimensionality:
• Obtain 128-D representation from CNN
(488 MB / 1M images)
• Then compress further using binarization
(122 MB / 1M images)
• Or using product quantization
(30.5 MB / 1M images)
Fast Ranking = Compact Representation
N – # images in test set
D – dim of image representation
15. Compression
• Binarization by embedding into Hamming space:
e : RD
! BM
Where M > D and U is obtained by taking the first D columns of
the QR-decomposition of a random M x M matrix
bi = sgn(Uxi)
• Product Quantization
…
…
…
…
D
S
d
Q
16. Evaluation Dataset
10,000 annotated images
PASCAL VOC 2007
1M unannotated images
MIRFLICKR-1M
• Want to evaluate CNN features for real-world photo retrieval
• Disjoint from ImageNet (as CNN trained on that) + with less
focus on fine-grained retrieval
18. Evaluation Dataset
1 2 33
Remove false negatives and evaluate Precision @ K…
Using MIRFLICKR-1M dataset as distractors
where K = 100
Or evaluate Precision @ K over MIRFLICKR-1M directly
20. Retrieval Results
Results for two sample classes over VOC + Distractor data
(Retrieve ~500 images from within 1M images – TP are 0.05% of dataset)
! CNN 128 (Prec. 0.32 @ 100) ! CNN 128 (Prec 0.77 @ 100)
24. VOC vs Google Training
! ‘Chair’ – CNN 128 (Prec. 0.92 @ 100) (Prec. 0.86 @ 100)
! ‘Train’ – CNN 128 (Prec. 1.0 @ 100) (Prec. 1.0 @ 100)
VOC Training Google Training
25. Instances & Faces too
Instances
Root SIFT
Extractor
ψ( I ) → xi
φ( I+ )
VQ
Encoder
φ( xi )
Hamming
Encoder
φ( xi )
Spatial
Verification
φ( xi )
ψ( It )
Target Dataset
match?
match?
Ranking
x N
(take max)
Ntraining
images
Faces
Ntraining
images
φ( It )
Target Dataset
Tracks
Ranking
Linear
SVM
w
φ( I- )
Negative Pool
φ( If+ )If+Face
Extractor
ψ( I ) → If
Pre-trained
Face CNN
φ( I )
26. Live Demo
Landing Page1
User enters text query term and
selects search modality
(e.g. ‘forest’ using object
category search)
Ranked Results3
A ranked list of visually matching
images is displayed within 1~30 secs
of entering the cold query
Querying2
A live view of images downloaded
from Google Image search as they
are used to construct a visual
appearance model on-the-fly
Can try out the system live over a dataset of 5M+ images
sourced from BBC News footage at:
http://varro3.robots.ox.ac.uk:9090
27. Question:
How can we adapt standard GPU ConvNet pipeline
for on-the-fly search?
We want:
• simultaneous feature computation/model training
• highly parallel operation by using a GPU-bound
architecture
ConvNet-based Architecture
• Libraries such as Caffe allow for fast computation
of ConvNet features entirely on GPU
29. ConvNet-based Architecture
RGB
xB/2 Pos.
CNN feat.
conv
stack
fc
stack
CNN feat.
xB/2 Neg.
Fixed negative pool
Sheep|
Google Image
Search
Training Images
SVM Loss Layer 5 =
1
B
X
i=1..B
I[yiw>
xi < 1]yixi
Batch Sampler
Batch size = B
precomputed
CNN feats
CPU Frontend GPU Backend
30. ConvNet-based Architecture
RGB
xB/2 Pos.
CNN feat.
conv
stack
fc
stack
CNN feat.
xB/2 Neg.
Fixed negative pool
Sheep|
Google Image
Search
Training Images
SVM Loss Layer 5 =
1
B
X
i=1..B
I[yiw>
xi < 1]yixi
Batch Sampler
Batch size = B
Image Buffer
precomputed
CNN feats
CPU Frontend GPU Backend
31. ConvNet-based Architecture
Batch Sampler
Batch size = B
Fixed negative pool
Sheep|
Google Image
Search
Training Images
Image Buffer
RGB
xB/2 Pos.
CNN feat.
xB/2 Neg.
CNN feat.
Target Dataset:
MIRFLICKR
Every
τsecs
conv
stack
fc
stack
Model
w
precomputed
CNN feats
CPU Frontend GPU Backend
Inner Product Layer
precomputed
CNN feats
SVM Loss Layer
32. 20 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Seconds
Precision@100
10images
20images
30images
Retrieval Results
• Images are fed into the network at a rate of 12 per second
• Dataset is ranked with current model every ~0.2 seconds
• Most rankings stabilise in under 1 second
20 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Seconds
Precision@100
10images
20images
30images
0.15s0.36s0.54s0.73s
sofasheepbushorse
33. 20 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Seconds
Precision@100
10images
20images
30images
Retrieval Results
• Images are fed into the network at a rate of 12 per second
• Dataset is ranked with current model every ~0.2 seconds
• Most rankings stabilise in under 1 second
20 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Seconds
Precision@100
10images
20images
30images
0.15s0.36s0.54s0.73s
sofasheepbushorse
34. Retrieval Results
• Images are fed into the network at a rate of 12 per second
• Dataset is ranked with current model every ~0.2 seconds
• Most rankings stabilise in under 1 second
20 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Seconds
Precision@100
10images
20images
30images
20 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Seconds
Precision@100
10images
20images
30images
0.15s0.36s0.54s0.73s
sofasheepbushorse
35. Retrieval Results
• Images are fed into the network at a rate of 12 per second
• Dataset is ranked with current model every ~0.2 seconds
• Most rankings stabilise in under 1 second
20 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Seconds
Precision@100
10images
20images
30images
20 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Seconds
Precision@100
10images
20images
30images
0.15s0.36s0.54s0.73s
sofasheepbushorse
36. Retrieval Results
• Images are fed into the network at a rate of 12 per second
• Dataset is ranked with current model every ~0.2 seconds
• Most rankings stabilise in under 1 second
20 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Seconds
Precision@100
10images
20images
30images
20 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Seconds
Precision@100
10images
20images
30images
0.15s0.36s0.54s0.73s
sofasheepbushorse
37. Retrieval Results
• Images are fed into the network at a rate of 12 per second
• Dataset is ranked with current model every ~0.2 seconds
• Most rankings stabilise in under 1 second
20 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Seconds
Precision@100
10images
20images
30images
20 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Seconds
Precision@100
10images
20images
30images
0.15s0.36s0.54s0.73s
sofasheepbushorse
38. Retrieval Results
• Images are fed into the network at a rate of 12 per second
• Dataset is ranked with current model every ~0.2 seconds
• Most rankings stabilise in under 1 second
20 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Seconds
Precision@100
10images
20images
30images
20 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Seconds
Precision@100
10images
20images
30images
0.15s0.36s0.54s0.73s
sofasheepbushorse
39. Retrieval Results
• Images are fed into the network at a rate of 12 per second
• Dataset is ranked with current model every ~0.2 seconds
• Most rankings stabilise in under 1 second
20 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Seconds
Precision@100
10images
20images
30images
20 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Seconds
Precision@100
10images
20images
30images
0.15s0.36s0.54s0.73s
sofasheepbushorse
40. Retrieval Results
• Images are fed into the network at a rate of 12 per second
• Dataset is ranked with current model every ~0.2 seconds
• Most rankings stabilise in under 1 second
20 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Seconds
Precision@100
10images
20images
30images
20 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Seconds
Precision@100
10images
20images
30images
0.15s0.36s0.54s0.73s
sofasheepbushorse
41. Retrieval Results
• Images are fed into the network at a rate of 12 per second
• Dataset is ranked with current model every ~0.2 seconds
• Most rankings stabilise in under 1 second
20 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Seconds
Precision@100
10images
20images
30images
20 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Seconds
Precision@100
10images
20images
30images
0.15s0.36s0.54s0.73s
sofasheepbushorse
42. Retrieval Results
• Images are fed into the network at a rate of 12 per second
• Dataset is ranked with current model every ~0.2 seconds
• Most rankings stabilise in under 1 second
20 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Seconds
Precision@100
10images
20images
30images
20 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Seconds
Precision@100
10images
20images
30images
0.15s0.36s0.54s0.73s
sofasheepbushorse
43. Currently working on the following extensions:
• How to select negative training images more
intelligently (e.g. selection of most discriminative
negative images per query from a larger 1M+ pool of
non-class images)
• How to establish a confidence measure for images in
the output ranking, so know when a query works well
or not, and source training images more intelligently
• Query attribute refinement (sporty + car)
Continued Work
44. “On-the-fly Learning for Visual Search of Large-scale Image and Video Datasets”
IJMIR 2015 Ken Chatfield, Relja Arandjelovic, Omkar Parkhi, Andrew Zisserman
“Efficient On-the-fly Category Retrieval using ConvNets and GPUs”
ACCV 2014 Ken Chatfield, Karen Simonyan, Andrew Zisserman
“Return of the Devil in the Details: Delving Deep into Convolutional Nets”
BMVC 2014 Ken Chatfield, Karen Simonyan, Andrea Vedaldi, Andrew
Zisserman (Best Paper Prize)
http://www.robots.ox.ac.uk/~ken
Related Publications