On-the-fly Visual Category Search in Web-scale Image Collections

1. On-the-fly Visual Category Search in Web-scale Image Collections Ken Chatﬁeld - University of Oxford May 2015

2. • Search large unannotated datasets of 1M+ images for object categories • Do so in real-time and without any prior knowledge Motivation and Objectives

3. ‘Regular’ Category Retrieval 1,000 ILSVRC classes Pre-trained CNN e.g. Alexnet S O F T M A X

4. ‘Regular’ Category Retrieval car? lion? apple? bus? Pre-trained CNN e.g. Alexnet S O F T M A X

5. On-the-fly Category Retrieval Pre-trained CNN e.g. Alexnet fc7 training data from the web OTF Classiﬁer e.g. Linear SVM

6. On-the-fly Category Retrieval Pre-trained CNN e.g. Alexnet fc7 training data from the web OTF Classiﬁer e.g. Linear SVM

7. • Bootstrap training using images from the web • Use highly compact ConvNet features + compression as the basis of a OTF system • Plus: Novel GPU architecture for iterative on- the-ﬂy learning Proposed Solution

8. Architecture Outline Car| Google Image Search Sourced Training Images Image Encoder φ( I ) φ( I+ ) Fixed negative pool precomputed features Linear SVM φ( I- ) w Target Dataset wTφ( It ) Ranking φ( It ) precomputed features Flickr Pinterest etc.

9. Need for Speed Car| Google Image Search Sourced Training Images Image Encoder φ( I ) φ( I+ ) Negative pool Linear SVM φ( I- ) w Target Dataset wTφ( It ) Ranking φ( It ) Flickr Pinterest etc. Ranking most critical stage w wTφ( It ) φ( It )

10. Must compute w.X for all image features in dataset giving complexity of O(ND) so important to reduce image representation dimensionality: • Obtain 128-D representation from CNN  (488 MB / 1M images) • Then compress further using binarization  (122 MB / 1M images) • Or using product quantization  (30.5 MB / 1M images) Fast Ranking = Compact Representation N – # images in test set D – dim of image representation

11. Lower-dimensional Features Taking CNN-M network as base: conv3 512x3x3 conv4 512x3x3 conv2 256x5x5 conv1 96x7x7 conv5 512x3x3 fc6 d.o. 4096-D fc7 d.o. 4096-D ILSVRC softm ax

12. Lower-dimensional Features Taking CNN-M network as base: conv3 512x3x3 conv4 512x3x3 conv2 256x5x5 conv1 96x7x7 conv5 512x3x3 ILSVRC softm axfc6 d.o. 2048-D fc7 d.o. 2048-D Replace 4096-D fc layer w. 2048-D, 128-D layers

13. Lower-dimensional Features Taking CNN-M network as base: conv3 512x3x3 conv4 512x3x3 conv2 256x5x5 conv1 96x7x7 conv5 512x3x3 ILSVRC softm axfc6 d.o. 128-D fc7 d.o. 128-D Replace 4096-D fc layer w. 2048-D, 128-D layers

14. Lower-dimensional Features mAP(VOC07) 78 78.75 79.5 80.25 81 4096 2048 1024 128 78.6 79.91 80.1 79.89

15. Compression • Binarization by embedding into Hamming space: e : RD ! BM Where M > D and U is obtained by taking the ﬁrst D columns of the QR-decomposition of a random M x M matrix bi = sgn(Uxi) • Product Quantization … … … … D S d Q

16. Evaluation Dataset 10,000 annotated images PASCAL VOC 2007 1M unannotated images MIRFLICKR-1M • Want to evaluate CNN features for real-world photo retrieval • Disjoint from ImageNet (as CNN trained on that) + with less focus on ﬁne-grained retrieval

17. Evaluation Dataset 1 2 3 4 53 Using MIRFLICKR-1M dataset as distractors

18. Evaluation Dataset 1 2 33 Remove false negatives and evaluate Precision @ K… Using MIRFLICKR-1M dataset as distractors where K = 100 Or evaluate Precision @ K over MIRFLICKR-1M directly

19. Retrieval Results Results for two sample classes over VOC + Distractor data (Retrieve ~500 images from within 1M images – TP are 0.05% of dataset) 1000 10 20 30 40 50 60 70 80 90 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Rank Precision CNN 2048 CNN 128 CNN 128 PQ FK 512 CNN 128 rpbin 1000 10 20 30 40 50 60 70 80 90 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Rank Precision Class: Sheep Class: Motorbike ! CNN 128 (Prec. 0.32 @ 100) ! CNN 128 (Prec 0.77 @ 100)

20. Retrieval Results Results for two sample classes over VOC + Distractor data (Retrieve ~500 images from within 1M images – TP are 0.05% of dataset) ! CNN 128 (Prec. 0.32 @ 100) ! CNN 128 (Prec 0.77 @ 100)

21. Retrieval Results CNN-M 2048 CNN-M 128 CNN-M 128 BIN 55.4 51.0 50.1 95.4 95.1 94.0 90.9 92.3 — VOC Training Google Training CNN-M 128 PQ 50.5 94.6 92.1 7.63 GB 488 MB 122 MB 30.5 MB FV 29.3 80.5 — 312.8 GB

22. Freeform Queries

23. Freeform Queries

24. VOC vs Google Training ! ‘Chair’ – CNN 128 (Prec. 0.92 @ 100) (Prec. 0.86 @ 100) ! ‘Train’ – CNN 128 (Prec. 1.0 @ 100) (Prec. 1.0 @ 100) VOC Training Google Training

25. Instances & Faces too Instances Root SIFT Extractor ψ( I ) → xi φ( I+ ) VQ Encoder φ( xi ) Hamming Encoder φ( xi ) Spatial Veriﬁcation φ( xi ) ψ( It ) Target Dataset match? match? Ranking x N (take max) Ntraining images Faces Ntraining images φ( It ) Target Dataset Tracks Ranking Linear SVM w φ( I- ) Negative Pool φ( If+ )If+Face Extractor ψ( I ) → If Pre-trained Face CNN φ( I )

26. Live Demo Landing Page1 User enters text query term and selects search modality (e.g. ‘forest’ using object category search) Ranked Results3 A ranked list of visually matching images is displayed within 1~30 secs of entering the cold query Querying2 A live view of images downloaded from Google Image search as they are used to construct a visual appearance model on-the-ﬂy Can try out the system live over a dataset of 5M+ images sourced from BBC News footage at: http://varro3.robots.ox.ac.uk:9090

27. Question: How can we adapt standard GPU ConvNet pipeline for on-the-ﬂy search? We want: • simultaneous feature computation/model training • highly parallel operation by using a GPU-bound architecture ConvNet-based Architecture • Libraries such as Caffe allow for fast computation of ConvNet features entirely on GPU

28. ConvNet-based Architecture RGB CNN feat. conv stack fc stack Fixed negative pool Sheep| Google Image Search Training Images precomputed CNN feats SVM Model w

29. ConvNet-based Architecture RGB xB/2 Pos. CNN feat. conv stack fc stack CNN feat. xB/2 Neg. Fixed negative pool Sheep| Google Image Search Training Images SVM Loss Layer 5 = 1 B X i=1..B I[yiw> xi < 1]yixi Batch Sampler Batch size = B precomputed CNN feats CPU Frontend GPU Backend

30. ConvNet-based Architecture RGB xB/2 Pos. CNN feat. conv stack fc stack CNN feat. xB/2 Neg. Fixed negative pool Sheep| Google Image Search Training Images SVM Loss Layer 5 = 1 B X i=1..B I[yiw> xi < 1]yixi Batch Sampler Batch size = B Image Buffer precomputed CNN feats CPU Frontend GPU Backend

31. ConvNet-based Architecture Batch Sampler Batch size = B Fixed negative pool Sheep| Google Image Search Training Images Image Buffer RGB xB/2 Pos. CNN feat. xB/2 Neg. CNN feat. Target Dataset: MIRFLICKR Every τsecs conv stack fc stack Model w precomputed CNN feats CPU Frontend GPU Backend Inner Product Layer precomputed CNN feats SVM Loss Layer

32. 20 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Seconds Precision@100 10images 20images 30images Retrieval Results • Images are fed into the network at a rate of 12 per second • Dataset is ranked with current model every ~0.2 seconds • Most rankings stabilise in under 1 second 20 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Seconds Precision@100 10images 20images 30images 0.15s0.36s0.54s0.73s sofasheepbushorse

33. 20 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Seconds Precision@100 10images 20images 30images Retrieval Results • Images are fed into the network at a rate of 12 per second • Dataset is ranked with current model every ~0.2 seconds • Most rankings stabilise in under 1 second 20 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Seconds Precision@100 10images 20images 30images 0.15s0.36s0.54s0.73s sofasheepbushorse

34. Retrieval Results • Images are fed into the network at a rate of 12 per second • Dataset is ranked with current model every ~0.2 seconds • Most rankings stabilise in under 1 second 20 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Seconds Precision@100 10images 20images 30images 20 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Seconds Precision@100 10images 20images 30images 0.15s0.36s0.54s0.73s sofasheepbushorse

43. Currently working on the following extensions: • How to select negative training images more intelligently (e.g. selection of most discriminative negative images per query from a larger 1M+ pool of non-class images) • How to establish a conﬁdence measure for images in the output ranking, so know when a query works well or not, and source training images more intelligently • Query attribute reﬁnement (sporty + car) Continued Work

44. “On-the-fly Learning for Visual Search of Large-scale Image and Video Datasets”  IJMIR 2015 Ken Chatfield, Relja Arandjelovic, Omkar Parkhi, Andrew Zisserman “Efficient On-the-fly Category Retrieval using ConvNets and GPUs”   ACCV 2014 Ken Chatfield, Karen Simonyan, Andrew Zisserman “Return of the Devil in the Details: Delving Deep into Convolutional Nets”  BMVC 2014 Ken Chatfield, Karen Simonyan, Andrea Vedaldi, Andrew Zisserman (Best Paper Prize) http://www.robots.ox.ac.uk/~ken Related Publications

On-the-fly Visual Category Search in Web-scale Image Collections

Recommended

Recommended

More Related Content

What's hot

What's hot (9)

Similar to On-the-fly Visual Category Search in Web-scale Image Collections

Similar to On-the-fly Visual Category Search in Web-scale Image Collections (20)

Recently uploaded

Recently uploaded (20)

On-the-fly Visual Category Search in Web-scale Image Collections