Andrew Zisserman Talk - Part 1a

2,358 views
2,272 views

Published on

Published in: Education, Technology, Business
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,358
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
94
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Andrew Zisserman Talk - Part 1a

  1. 1. Visual search and recognition Part I – large scale instance search Andrew Zisserman Visual Geometry Group University of Oxford http://www.robots.ox.ac.uk/~vggSlides with Josef Sivic
  2. 2. OverviewPart I: Instance level recognition • e.g. find this car in a dataset of imagesand the dataset may contain 1M images …
  3. 3. OverviewPart II: Category level recognition • e.g. find images that contain cars • or cows …and localize the occurrences in each image …
  4. 4. Problem specification: particular object retrieval Example: visual search in feature films Visually defined query “Groundhog Day” [Rammis, 1993]“Find this clock”“Find this place”
  5. 5. Particular Object SearchFind these objects ...in these images and 1M moreSearch the web with a visual query …
  6. 6. The need for visual search Flickr: has more than 5 billion photographs, more than 1 million added daily Company collections Personal collections: 10000s of digital camera photos and video clipsVast majority will have minimal, if any, textual annotation.
  7. 7. Why is it difficult? Problem: find particular occurrences of an object in a verylarge dataset of images Want to find the object despite possibly large changes inscale, viewpoint, lighting and partial occlusion Scale Viewpoint Lighting Occlusion
  8. 8. Outline1. Object recognition cast as nearest neighbour matching • Covariant feature detection • Feature descriptors (SIFT) and matching2. Object recognition cast as text retrieval3. Large scale search and improving performance4. Applications5. The future and challenges
  9. 9. Visual problem• Retrieve image/key frames containing the same object query ?Approach Determine regions (detection) and vector descriptors in eachframe which are invariant to camera viewpoint changesMatch descriptors between frames using invariant vectors
  10. 10. Example of visual fragments Image content is transformed into local fragments that are invariant to translation, rotation, scale, and other imaging parameters • Fragments generalize over viewpoint and lighting Lowe ICCV 1999
  11. 11. Detection RequirementsDetected image regions must cover the same scene region in different views• detection must commute with viewpoint transformationviewpoint transformation• i.e. detection is viewpointcovariant detection detection• NB detection computed ineach image independently viewpoint transformation
  12. 12. Scale-invariant feature detection Goal: independently detect corresponding regions in scaled versions of the same image Need scale selection mechanism for finding characteristic region size that is covariant with the image transformationLaplacian
  13. 13. Scale-invariant features: BlobsSlides from Svetlana Lazebnik
  14. 14. Recall: Edge detection Edge f d Derivative g of Gaussian dx d Edge = maximumf∗ g of derivative dx Source: S. Seitz
  15. 15. Edge detection, take 2 Edge f 2 Second derivative d of Gaussian 2 g dx (Laplacian) d2 Edge = zero crossingf∗ 2g of second derivative dx Source: S. Seitz
  16. 16. From edges to `top hat’ (blobs)Blob = top-hat = superposition of two step edges maximumSpatial selection: the magnitude of the Laplacian responsewill achieve a maximum at the center of the blob, provided thescale of the Laplacian is “matched” to the scale of the blob
  17. 17. Scale selection We want to find the characteristic scale of the blob by convolving it with Laplacians at several scales and looking for the maximum response However, Laplacian response decays as scale increases:original signal increasing σ (radius=8) Why does this happen?
  18. 18. Scale normalizationThe response of a derivative of Gaussian filter to a perfectstep edge decreases as σ increases 1 σ 2π
  19. 19. Scale normalization The response of a derivative of Gaussian filter to a perfectstep edge decreases as σ increasesTo keep response the same (scale-invariant), mustmultiply Gaussian derivative by σLaplacian is the second Gaussian derivative, so it must bemultiplied by σ2
  20. 20. Effect of scale normalizationOriginal signal Unnormalized Laplacian response Scale-normalized Laplacian response scale σ increasing maximum
  21. 21. Blob detection in 2DLaplacian of Gaussian: Circularly symmetric operator forblob detection in 2D ∂ g ∂ g 2 2 ∇ g= 2 + 2 2 ∂x ∂y
  22. 22. Blob detection in 2DLaplacian of Gaussian: Circularly symmetric operator forblob detection in 2D ⎛∂ g ∂ g ⎞ 2 2Scale-normalized: ∇ 2 norm g =σ ⎜ 2 + 2 ⎟ 2 ⎜ ∂x ⎟ ⎝ ∂y ⎠
  23. 23. Characteristic scaleWe define the characteristic scale as the scale thatproduces peak of Laplacian response characteristic scaleT. Lindeberg (1998). "Feature detection with automatic scale selection."International Journal of Computer Vision 30 (2): pp 77--116.
  24. 24. Scale selectionScale invariance of the characteristic scale s norm. Lap. norm. Lap. scale scale s∗ 1 s∗ 2 ∗ ∗• Relation between characteristic scales s ⋅ s1 = s2
  25. 25. Scale invariance Mikolajczyk and Schmid ICCV 2001 • Multi-scale extraction of Harris interest points • Selection of characteristic scale in Laplacian scale space Chacteristic scale : - maximum in scale spaceLaplacian - scale invariant
  26. 26. What class of transformations are required?2D transformation models Similarity (translation, scale, rotation) Affine Projective (homography)
  27. 27. Motivation for affine transformations View 1 View 2 View 2 Not same region
  28. 28. Local invariance requirements Geometric: 2D affine transformation where A is a 2x2 non-singular matrix• Objective: compute image descriptors invariant to thisclass of transformations
  29. 29. Viewpoint covariant detection• Characteristic scales (size of region) • Lindeberg and Garding ECCV 1994 • Lowe ICCV 1999 • Mikolajczyk and Schmid ICCV 2001• Affine covariance (shape of region) • Baumberg CVPR 2000 • Matas et al BMVC 2002 Maximally stable regions • Mikolajczyk and Schmid ECCV 2002 Shape adapted regions • Schaffalitzky and Zisserman ECCV 2002 “Harris affine” • Tuytelaars and Van Gool BMVC 2000 • Mikolajczyk et al., IJCV 2005
  30. 30. Example affine covariant region Maximally Stable regions (MSR) first image second image1. Segment using watershed algorithm, and track connected components asthreshold value varies.2. An MSR is detected when the area of the component is stationarySee Matas et al BMVC 2002
  31. 31. Maximally stable regions varying threshold sub-image first second area vs change of area image image threshold vs threshold
  32. 32. Example: Maximally stable regions
  33. 33. Example of affine covariant regions 1000+ regions per image Shape adapted regions Maximally stable regions• a region’s size and shape are not fixed, but• automatically adapts to the image intensity to cover the same physical surface• i.e. pre-image is the same surface region
  34. 34. Viewpoint invariant description• Elliptical viewpoint covariant regions • Shape Adapted regions • Maximally Stable Regions• Map ellipse to circle and orientate by dominant direction• Represent each region by SIFT descriptor (128-vector) [Lowe 1999] • see Mikolajczyk and Schmid CVPR 2003 for a comparison of descriptors
  35. 35. Local descriptors - rotation invarianceEstimation of the dominant orientation • extract gradient orientation • histogram over gradient orientation • peak in this histogram 0 2πRotate patch in dominant direction
  36. 36. Descriptors – SIFT [Lowe’99]distribution of the gradient over an image patch image patch gradient 3D histogram x → → y4x4 location grid and 8 orientations (128 dimensions)very good performance in image matching [Mikolaczyk and Schmid’03]
  37. 37. Summary – detection and descriptionExtract affine regions Normalize regions Eliminate rotation Compute appearance descriptors SIFT (Lowe ’04)
  38. 38. Visual problem• Retrieve image/key frames containing the same object query ?Approach Determine regions (detection) and vector descriptors in eachframe which are invariant to camera viewpoint changesMatch descriptors between frames using invariant vectors
  39. 39. Outline of an object retrieval strategy regions invariant descriptor vectors frames invariant descriptor vectors1. Compute regions in each image independently2. “Label” each region by a descriptor vector from its local intensity neighbourhood3. Find corresponding regions by matching to closest descriptor vector4. Score each frame in the database by the number of matchesFinding corresponding regions transformed to finding nearest neighbour vectors
  40. 40. Example In each frame independently determine elliptical regions (detection covariant with camera viewpoint) compute SIFT descriptor for each region [Lowe ‘99]1000+ descriptors per frame Harris-affine Maximally stable regions
  41. 41. Object recognitionEstablish correspondences between object model image and target image bynearest neighbour matching on SIFT vectors Model image 128D descriptor Target image space
  42. 42. Match regions between frames using SIFT descriptors• Multiple fragments overcomes problem of partial occlusion• Transfer query box to localize object Harris-affine Now, convert this approach to a text Maximally stable regions retrieval representation
  43. 43. Outline1. Object recognition cast as nearest neighbour matching2. Object recognition cast as text retrieval • bag of words model • visual words3. Large scale search and improving performance4. Applications5. The future and challenges
  44. 44. Success of text retrieval• efficient• high precision• scalableCan we use retrieval mechanisms from text for visual retrieval? • ‘Visual Google’ project, 2003+For a million+ images: • scalability • high precision • high recall: can we retrieve all occurrences in the corpus?
  45. 45. Text retrieval lightning tour Stemming Represent words by stems, e.g. “walking”, “walks” “walk” Stop-list Reject the very common words, e.g. “the”, “a”, “of” Inverted file Ideal book index: Term List of hits (occurrences in documents) People [d1:hit hit hit], [d4:hit hit] … Common [d1:hit hit], [d3: hit], [d4: hit hit hit] … Sculpture [d2:hit], [d3: hit hit hit] … • word matches are pre-computed
  46. 46. Ranking • frequency of words in document (tf-idf) • proximity weighting (google) • PageRank (google)Need to map feature descriptors to “visual words”.
  47. 47. Build a visual vocabulary for a movieVector quantize descriptors • k-means clustering + + SIFT 128D SIFT 128DImplementation • compute SIFT features on frames from 48 shots of the film • 6K clusters for Shape Adapted regions • 10K clusters for Maximally Stable regions
  48. 48. Samples of visual words (clusters on SIFT descriptors): Shape adapted regions Maximally stable regions generic examples – cf textons
  49. 49. Samples of visual words (clusters on SIFT descriptors): More specific example
  50. 50. Visual words: quantize descriptor space Sivic and Zisserman, ICCV 2003Nearest neighbour matching •expensive to do for all frames Image 1 128D descriptor Image 2 space
  51. 51. Visual words: quantize descriptor space Sivic and Zisserman, ICCV 2003Nearest neighbour matching •expensive to do for all frames Image 1 128D descriptor Image 2 spaceVector quantize descriptors 5 42 42 5 42 5 Image 1 128D descriptor Image 2 space
  52. 52. Visual words: quantize descriptor space Sivic and Zisserman, ICCV 2003Nearest neighbour matching •expensive to do for all frames Image 1 128D descriptor Image 2 spaceVector quantize descriptors 5 42 42 5 42 5 New image Image 1 128D descriptor Image 2 space
  53. 53. Visual words: quantize descriptor space Sivic and Zisserman, ICCV 2003Nearest neighbour matching •expensive to do for all frames Image 1 128D descriptor Image 2 spaceVector quantize descriptors 5 42 42 42 5 42 5 New image Image 1 128D descriptor Image 2 space
  54. 54. Vector quantize the descriptor space (SIFT) The same visual word
  55. 55. Representation: bag of (visual) wordsVisual words are ‘iconic’ image patches or fragments • represent the frequency of word occurrence • but not their position Image Collection of visual words
  56. 56. Offline: assign visual words and compute histograms for each key frame in the video + + Normalize Compute SIFT patch descriptor Find nearest cluster centre Detect patches 2 0 0 1 0 1 …Represent frame by sparse histogramof visual word occurrences
  57. 57. Offline: create an indexFor fast search, store a “posting list” for the datasetThis maps word occurrences to the documents they occur in Posting list #1 #1 1 5,10, ... #2 2 10,... ... ... frame #5 frame #10
  58. 58. At run time …• User specifies a query region• Generate a short list of frames using visual words in region 1. Accumulate all visual words within the query region 2. Use “book index” to find other frames with these words 3. Compute similarity for frames which share at least one word Posting list #1 #1 1 5,10, ... #2 2 10,... ... ... frame #5 frame #10 Generates a tf-idf ranked list of all the frames in dataset
  59. 59. Image ranking using the bag-of-words modelFor a vocabulary of size K, each image is represented by a K-vectorwhere ti is the (weighted) number of occurrences of visual word i.Images are ranked by the normalized scalar product between the queryvector vq and all vectors in the database vd:Scalar product can be computed efficiently using inverted file.
  60. 60. Summary: Match histograms of visual words regions invariant Single vector Quantize descriptor (histogram) vectorsframes1. Compute affine covariant regions in each frame independently (offline)2. “Label” each region by a vector of descriptors based on its intensity (offline)3. Build histograms of visual words by descriptor quantization (offline)4. Rank retrieved frames by matching vis. word histograms using inverted files.
  61. 61. Films = common dataset “Pretty Woman” “Casablanca” “Groundhog Day” “Charade”
  62. 62. Video Google Demo
  63. 63. retrieved shots ExampleSelect a regionSearch in film “Groundhog Day”
  64. 64. Visual words - advantagesDesign of descriptors makes these words invariant to: • illumination • affine transformations (viewpoint)Multiple local regions gives immunity to partial occlusionOverlap encodes some structural information NB: no attempt to carry out a ‘semantic’ segmentation
  65. 65. Example application – product placementSony logo from Google imagesearch on `Sony’ Retrieve shots from Groundhog Day
  66. 66. Retrieved shots in Groundhog Day for search on Sony logo
  67. 67. Outline1. Object recognition cast as nearest neighbour matching2. Object recognition cast as text retrieval3. Large scale search and improving performance • large vocabularies and approximate k-means • query expansion • soft assignment4. Applications5. The future and challenges
  68. 68. Particular object searchFind these landmarks ...in these images
  69. 69. Investigate …Vocabulary size: number of visual words in range 10K to 1MUse of spatial information to re-rank
  70. 70. Oxford buildings dataset Automatically crawled from Flickr Dataset (i) consists of 5062 images, crawled by searchingfor Oxford landmarks, e.g. “Oxford Christ Church” “Oxford Radcliffe camera” “Oxford” “Medium” resolution images (1024 x 768)
  71. 71. Oxford buildings dataset Automatically crawled from Flickr Consists of: Dataset (i) crawled by searching for Oxford landmarks Datasets (ii) and (iii) from other popular Flickr tags. Acts asadditional distractors
  72. 72. Oxford buildings dataset Landmarks plus queries used for evaluationAll Souls Bridge of SighsAshmolean KebleBalliol MagdalenBodleian University MuseumThomTower Radcliffe CameraCornmarket Ground truth obtained for 11 landmarks over 5062 images Evaluate performance by mean Average Precision
  73. 73. Precision - Recall relevant returned images images• Precision: % of returned images that are relevant• Recall: % of relevant images that are returned all images 1 0.8 0.6 precision 0.4 0.2 0 0 0.2 0.4 0.6 0.8 1 recall
  74. 74. Average Precision 1 0.8 0.6 • A good AP score requires bothprecision high recall and high precision 0.4 AP • Application-independent 0.2 0 0 0.2 0.4 0.6 0.8 1 recall Performance measured by mean Average Precision (mAP) over 55 queries on 100K or 1.1M image datasets
  75. 75. Quantization / Clustering K-means usually seen as a quick + cheap method But far too slow for our needs – D~128, N~20M+, K~1M Use approximate k-means: nearest neighbour search bymultiple, randomized k-d trees
  76. 76. K-means overview K-means overview: Iterate Initialize cluster Find nearest cluster to each Re-compute cluster centres datapoint (slow) O(N K) centres as centroid K-means provably locally minimizes the sum of squarederrors (SSE) between a cluster centre and its points Idea: nearest neighbour search is the bottleneck – useapproximate nearest neighbour search
  77. 77. Approximate K-means Use multiple, randomized k-d trees for search A k-d tree hierarchically decomposes thedescriptor space Points nearby in the space can be found(hopefully) by backtracking around the treesome small number of steps Single tree works OK in low dimensions – notso well in high dimensions
  78. 78. Approximate K-means Multiple randomized trees increase the chances of findingnearby pointsTrue nearest neighbour Query point True nearest No No Yes neighbour found?
  79. 79. Approximate K-means Use the best-bin first strategy to determine which branch ofthe tree to examine next share this priority queue between multiple trees – searchingmultiple trees only slightly more expensive than searching one Original K-means complexity = O(N K) Approximate K-means complexity = O(N log K) This means we can scale to very large K
  80. 80. Experimental evaluation for SIFT matchinghttp://www.cs.ubc.ca/~lowe/papers/09muja.pdf
  81. 81. Oxford buildings dataset Landmarks plus queries used for evaluationAll Souls Bridge of SighsAshmolean KebleBalliol MagdalenBodleian University MuseumThomTower Radcliffe CameraCornmarket Ground truth obtained for 11 landmarks over 5062 images Evaluate performance by mean Average Precision over 55 queries
  82. 82. Approximate K-meansHow accurate is the approximate search?Performance on 5K image dataset for a random forest of 8 trees Allows much larger clusterings than would be feasible withstandard K-means: N~17M points, K~1M AKM – 8.3 cpu hours per iteration Standard K-means - estimated 2650 cpu hours per iteration
  83. 83. Performance against vocabulary size Using large vocabularies gives a big boost in performance(peak @ 1M words) More discriminative vocabularies give: Better retrieval quality Increased search speed – documents share less words, so fewer documents need to be scored
  84. 84. Beyond Bag of Words Use the position and shape of the underlying featuresto improve retrieval quality Both images have many matches – which is correct?
  85. 85. Beyond Bag of Words We can measure spatial consistency between thequery and each result to improve retrieval qualityMany spatially consistent Few spatially consistentmatches – correct result matches – incorrect result
  86. 86. Compute 2D affine transformation • between query region and target image where A is a 2x2 non-singular matrix
  87. 87. Estimating spatial correspondences1. Test each correspondence
  88. 88. Estimating spatial correspondences2. Compute a (restricted) affine transformation (5 dof)
  89. 89. Estimating spatial correspondences 3. Score by number of consistent matchesUse RANSAC on full affine transformation (6 dof)
  90. 90. Beyond Bag of WordsExtra bonus – gives localization of the object
  91. 91. Example Results Query Example ResultsRank short list of retrieved images on number of correspondences
  92. 92. Mean Average Precision variation with vocabulary size vocab bag of spatial size words 50K 0.473 0.599 100K 0.535 0.597 250K 0.598 0.633 500K 0.606 0.642 750K 0.609 0.630 1M 0.618 0.645 1.25M 0.602 0.625
  93. 93. Bag of visual words particular object retrieval centroids Set of SIFT (visual words)query image descriptors sparse frequency vector Hessian-Affine visual words regions + SIFT descriptors +tf-idf weighting Inverted file querying Geometric ranked image verification short-list [Chum & al 2007] [Lowe 04, Chum & al 2007]

×