SlideShare a Scribd company logo
Visual search and recognition
         Part I – large scale instance search

                Andrew Zisserman
                   Visual Geometry Group
                     University of Oxford
              http://www.robots.ox.ac.uk/~vgg


Slides with Josef Sivic
Overview

Part I: Instance level recognition
   •   e.g. find this car in a dataset of images




and the dataset may contain 1M images …
Overview

Part II: Category level recognition
    •   e.g. find images that contain cars




    •   or cows …




and localize the occurrences in each image …
Problem specification: particular object retrieval


 Example: visual search in feature films


     Visually defined query        “Groundhog Day” [Rammis, 1993]


“Find this
  clock”




“Find this
  place”
Particular Object Search




Find these objects        ...in these images and 1M more

Search the web with a visual query …
The need for visual search


              Flickr: has more than 5 billion photographs, more than
              1 million added daily

              Company collections

              Personal collections: 10000s of digital camera photos
              and video clips


Vast majority will have minimal, if any, textual annotation.
Why is it difficult?
   Problem: find particular occurrences of an object in a very
large dataset of images

  Want to find the object despite possibly large changes in
scale, viewpoint, lighting and partial occlusion




            Scale                          Viewpoint




           Lighting                        Occlusion
Outline


1. Object recognition cast as nearest neighbour matching
   •   Covariant feature detection
   •   Feature descriptors (SIFT) and matching

2. Object recognition cast as text retrieval
3. Large scale search and improving performance
4. Applications
5. The future and challenges
Visual problem
• Retrieve image/key frames containing the same object


      query
                      ?




Approach
 Determine regions (detection) and vector descriptors in each
frame which are invariant to camera viewpoint changes

Match descriptors between frames using invariant vectors
Example of visual fragments
 Image content is transformed into local fragments that are invariant to
   translation, rotation, scale, and other imaging parameters




 • Fragments generalize over viewpoint and lighting
                                                        Lowe ICCV 1999
Detection Requirements
Detected image regions must cover the same scene region in different views



• detection must commute with                   viewpoint
                                             transformation
viewpoint transformation


• i.e. detection is viewpoint
covariant
                                         detection                 detection

• NB detection computed in
each image independently
                                                 viewpoint
                                              transformation
Scale-invariant feature detection
  Goal: independently detect corresponding regions in scaled versions
     of the same image

  Need scale selection mechanism for finding characteristic region size
     that is covariant with the image transformation




Laplacian
Scale-invariant features: Blobs




Slides from Svetlana Lazebnik
Recall: Edge detection



                          Edge
  f


 d                        Derivative
    g                     of Gaussian
 dx


  d                       Edge = maximum
f∗ g                      of derivative
  dx



                                 Source: S. Seitz
Edge detection, take 2



                          Edge
  f


   2                      Second derivative
 d                        of Gaussian
    2
      g
 dx                       (Laplacian)



  d2                      Edge = zero crossing
f∗ 2g                     of second derivative
  dx



                                 Source: S. Seitz
From edges to `top hat’ (blobs)

Blob = top-hat = superposition of two step edges




                                                    maximum


Spatial selection: the magnitude of the Laplacian response
will achieve a maximum at the center of the blob, provided the
scale of the Laplacian is “matched” to the scale of the blob
Scale selection
     We want to find the characteristic scale of the blob by
    convolving it with Laplacians at several scales and looking
    for the maximum response

     However, Laplacian response decays as scale increases:




original signal     increasing σ
  (radius=8)

                     Why does this happen?
Scale normalization

The response of a derivative of Gaussian filter to a perfect
step edge decreases as σ increases



                                      1
                                    σ 2π
Scale normalization

 The response of a derivative of Gaussian filter to a perfect
step edge decreases as σ increases

To keep response the same (scale-invariant), must
multiply Gaussian derivative by σ

Laplacian is the second Gaussian derivative, so it must be
multiplied by σ2
Effect of scale normalization
Original signal          Unnormalized Laplacian response




                        Scale-normalized Laplacian response




              scale σ increasing
                                                   maximum
Blob detection in 2D

Laplacian of Gaussian: Circularly symmetric operator for
blob detection in 2D




                     ∂ g ∂ g 2        2
                 ∇ g= 2 + 2
                    2

                     ∂x  ∂y
Blob detection in 2D

Laplacian of Gaussian: Circularly symmetric operator for
blob detection in 2D




                                      ⎛∂ g ∂ g ⎞
                                            2        2
Scale-normalized:     ∇   2
                          norm   g =σ ⎜ 2 + 2 ⎟
                                      2
                                      ⎜ ∂x     ⎟
                                      ⎝    ∂y ⎠
Characteristic scale

We define the characteristic scale as the scale that
produces peak of Laplacian response




                       characteristic scale
T. Lindeberg (1998). "Feature detection with automatic scale selection."
International Journal of Computer Vision 30 (2): pp 77--116.
Scale selection

Scale invariance of the characteristic scale


                                    s




                                        norm. Lap.
     norm. Lap.




                          scale                           scale
                  s∗
                   1
                                                     s∗
                                                      2

                                                            ∗     ∗
• Relation between characteristic scales s ⋅ s1 = s2
Scale invariance          Mikolajczyk and Schmid ICCV 2001


     • Multi-scale extraction of Harris interest points

     • Selection of characteristic scale in Laplacian scale space




                                              Chacteristic scale :
                                              - maximum in scale space
Laplacian                                     - scale invariant
What class of transformations are required?


2D transformation models
 Similarity
   (translation,
   scale, rotation)


 Affine


 Projective
   (homography)
Motivation for affine transformations

     View 1           View 2            View 2




                  Not same region
Local invariance requirements

 Geometric: 2D affine transformation




   where A is a 2x2 non-singular matrix


• Objective: compute image descriptors invariant to this
class of transformations
Viewpoint covariant detection
• Characteristic scales (size of region)
   • Lindeberg and Garding ECCV 1994
   • Lowe ICCV 1999
   • Mikolajczyk and Schmid ICCV 2001

• Affine covariance (shape of region)
   • Baumberg CVPR 2000
   • Matas et al BMVC 2002                   Maximally stable regions
   • Mikolajczyk and Schmid ECCV 2002
                                             Shape adapted regions
   • Schaffalitzky and Zisserman ECCV 2002       “Harris affine”
   • Tuytelaars and Van Gool BMVC 2000
   • Mikolajczyk et al., IJCV 2005
Example affine covariant region
  Maximally Stable regions (MSR)

                   first image           second image




1. Segment using watershed algorithm, and track connected components as
threshold value varies.
2. An MSR is detected when the area of the component is stationary
See Matas et al BMVC 2002
Maximally stable regions




                                  varying threshold

 sub-image




                                  first       second
     area vs    change of area   image        image
    threshold    vs threshold
Example: Maximally stable regions
Example of affine covariant regions




 1000+ regions per image                          Shape adapted regions

                                                  Maximally stable regions



• a region’s size and shape are not fixed, but
• automatically adapts to the image intensity to cover the same physical surface
• i.e. pre-image is the same surface region
Viewpoint invariant description


• Elliptical viewpoint covariant regions
    • Shape Adapted regions
    • Maximally Stable Regions


• Map ellipse to circle and orientate by dominant direction


• Represent each region by SIFT descriptor (128-vector) [Lowe 1999]
    • see Mikolajczyk and Schmid CVPR 2003 for a comparison of descriptors
Local descriptors - rotation invariance
Estimation of the dominant orientation
   • extract gradient orientation
   • histogram over gradient orientation
   • peak in this histogram                0   2π


Rotate patch in dominant direction
Descriptors – SIFT [Lowe’99]
distribution of the gradient over an image patch

  image patch           gradient           3D histogram
                                   x

                →                      →
                    y




4x4 location grid and 8 orientations (128 dimensions)

very good performance in image matching [Mikolaczyk and Schmid’03]
Summary – detection and description


Extract affine regions   Normalize regions   Eliminate rotation   Compute appearance
                                                                     descriptors




                                                                   SIFT (Lowe ’04)
Visual problem
• Retrieve image/key frames containing the same object


      query
                      ?




Approach
 Determine regions (detection) and vector descriptors in each
frame which are invariant to camera viewpoint changes

Match descriptors between frames using invariant vectors
Outline of an object retrieval strategy
                                      regions

                                                                invariant
                                                               descriptor
                                                                 vectors
     frames

                                                                invariant
                                                               descriptor
                                                                 vectors

1.   Compute regions in each image independently
2.   “Label” each region by a descriptor vector from its local intensity neighbourhood
3.   Find corresponding regions by matching to closest descriptor vector
4.   Score each frame in the database by the number of matches

Finding corresponding regions transformed to finding nearest neighbour vectors
Example
 In each frame independently
  determine elliptical regions (detection covariant with camera viewpoint)
  compute SIFT descriptor for each region [Lowe ‘99]




1000+ descriptors per frame

      Harris-affine

      Maximally stable regions
Object recognition

Establish correspondences between object model image and target image by
nearest neighbour matching on SIFT vectors




                   Model image          128D descriptor      Target image
                                            space
Match regions between frames using SIFT descriptors




• Multiple fragments overcomes problem of partial occlusion
• Transfer query box to localize object

    Harris-affine
                               Now, convert this approach to a text
    Maximally stable regions   retrieval representation
Outline


1. Object recognition cast as nearest neighbour matching
2. Object recognition cast as text retrieval
   •   bag of words model
   •   visual words

3. Large scale search and improving performance
4. Applications
5. The future and challenges
Success of text retrieval


• efficient
• high precision
• scalable




Can we use retrieval mechanisms from text for visual retrieval?
  • ‘Visual Google’ project, 2003+

For a million+ images:
   • scalability
   • high precision
   • high recall: can we retrieve all occurrences in the corpus?
Text retrieval lightning tour

 Stemming        Represent words by stems, e.g. “walking”, “walks”                        “walk”



 Stop-list       Reject the very common words, e.g. “the”, “a”, “of”



 Inverted file




                       Ideal book index:   Term        List of hits (occurrences in documents)
                                           People      [d1:hit hit hit], [d4:hit hit] …
                                           Common      [d1:hit hit], [d3: hit], [d4: hit hit hit] …
                                           Sculpture   [d2:hit], [d3: hit hit hit] …

                 • word matches are pre-computed
Ranking     • frequency of words in document   (tf-idf)
            • proximity weighting (google)
            • PageRank (google)




Need to map feature descriptors to “visual words”.
Build a visual vocabulary for a movie

Vector quantize descriptors
   • k-means clustering



                                                     +


                                                 +

                          SIFT 128D                        SIFT 128D



Implementation
   • compute SIFT features on frames from 48 shots of the film
   • 6K clusters for Shape Adapted regions
   • 10K clusters for Maximally Stable regions
Samples of visual words (clusters on SIFT descriptors):




        Shape adapted regions       Maximally stable regions


 generic examples – cf textons
Samples of visual words (clusters on SIFT descriptors):




     More specific example
Visual words: quantize descriptor space
                       Sivic and Zisserman, ICCV 2003




Nearest neighbour matching
  •expensive to do
  for all frames


                     Image 1           128D descriptor   Image 2
                                           space
Visual words: quantize descriptor space
                         Sivic and Zisserman, ICCV 2003




Nearest neighbour matching
  •expensive to do
  for all frames


                      Image 1            128D descriptor   Image 2
                                             space


Vector quantize descriptors
                                                      5
                                               42



                    42          5                          42        5
                      Image 1            128D descriptor   Image 2
                                             space
Visual words: quantize descriptor space
                         Sivic and Zisserman, ICCV 2003




Nearest neighbour matching
  •expensive to do
  for all frames


                      Image 1            128D descriptor   Image 2
                                             space


Vector quantize descriptors
                                                      5
                                               42



                    42          5                          42        5
    New image         Image 1            128D descriptor   Image 2
                                             space
Visual words: quantize descriptor space
                         Sivic and Zisserman, ICCV 2003




Nearest neighbour matching
  •expensive to do
  for all frames


                      Image 1            128D descriptor   Image 2
                                             space


Vector quantize descriptors
                                                      5
                                               42



   42               42          5                          42        5
    New image         Image 1            128D descriptor   Image 2
                                             space
Vector quantize the descriptor space (SIFT)




                        The same visual word
Representation: bag of (visual) words
Visual words are ‘iconic’ image patches or fragments
   • represent the frequency of word occurrence
   • but not their position




             Image
                                        Collection of visual words
Offline: assign visual words and compute histograms
 for each key frame in the video


                                                                 +

                                                            +

                              Normalize   Compute SIFT
                              patch         descriptor
                                                         Find nearest cluster
                                                         centre
       Detect patches


           2
           0
           0
           1
           0
           1
           …

Represent frame by sparse histogram
of visual word occurrences
Offline: create an index

For fast search, store a “posting list” for the dataset

This maps word occurrences to the documents they occur in


                                                   Posting list
            #1              #1
                                             1          5,10, ...
                                 #2          2          10,...
                                             ...        ...


     frame #5             frame #10
At run time …
•   User specifies a query region
•   Generate a short list of frames using visual words in region
     1. Accumulate all visual words within the query region
     2. Use “book index” to find other frames with these words
     3. Compute similarity for frames which share at least one word

                                                            Posting list
                 #1                #1
                                                      1          5,10, ...
                                        #2            2          10,...
                                                      ...        ...


         frame #5                frame #10



    Generates a tf-idf ranked list of all the frames in dataset
Image ranking using the bag-of-words model

For a vocabulary of size K, each image is represented by a K-vector



where ti is the (weighted) number of occurrences of visual word i.


Images are ranked by the normalized scalar product between the query
vector vq and all vectors in the database vd:




Scalar product can be computed efficiently using inverted file.
Summary: Match histograms of visual words
                    regions           invariant                     Single vector
                                                      Quantize
                                     descriptor                      (histogram)
                                       vectors



frames




1.   Compute affine covariant regions in each frame independently (offline)
2.   “Label” each region by a vector of descriptors based on its intensity (offline)
3.   Build histograms of visual words by descriptor quantization (offline)
4.   Rank retrieved frames by matching vis. word histograms using inverted files.
Films = common dataset




       “Pretty Woman”    “Casablanca”




     “Groundhog Day”      “Charade”
Video Google Demo
retrieved shots


              Example
Select a region




Search in film “Groundhog Day”
Visual words - advantages

Design of descriptors makes these words invariant to:
   • illumination
   • affine transformations (viewpoint)


Multiple local regions gives immunity to partial occlusion

Overlap encodes some structural information




                                   NB: no attempt to carry out a
                                   ‘semantic’ segmentation
Example application – product placement
Sony logo from Google image
search on `Sony’




 Retrieve shots from Groundhog Day
Retrieved shots in Groundhog Day for search on Sony logo
Outline


1. Object recognition cast as nearest neighbour matching
2. Object recognition cast as text retrieval
3. Large scale search and improving performance
   •   large vocabularies and approximate k-means
   •   query expansion
   •   soft assignment

4. Applications
5. The future and challenges
Particular object search




Find these landmarks   ...in these images
Investigate …


Vocabulary size: number of visual words in range 10K to 1M

Use of spatial information to re-rank
Oxford buildings dataset
  Automatically crawled from Flickr

   Dataset (i) consists of 5062 images, crawled by searching
for Oxford landmarks, e.g.
    “Oxford Christ Church”
    “Oxford Radcliffe camera”
    “Oxford”

  “Medium” resolution images (1024 x 768)
Oxford buildings dataset
 Automatically crawled from Flickr

 Consists of:




 Dataset (i) crawled by searching for Oxford landmarks

 Datasets (ii) and (iii) from other popular Flickr tags. Acts as
additional distractors
Oxford buildings dataset
          Landmarks plus queries used for evaluation

All Soul's                            Bridge of
                                      Sighs
Ashmolean
                                      Keble

Balliol
                                      Magdalen

Bodleian                              University
                                      Museum
Thom
Tower
                                      Radcliffe
                                      Camera
Cornmarket


   Ground truth obtained for 11 landmarks over 5062 images

   Evaluate performance by mean Average Precision
Precision - Recall                                        relevant           returned
                                                           images              images
• Precision: % of returned images that
                          are relevant


• Recall: % of relevant images that are
                     returned
                                                                 all images
               1


              0.8


              0.6
  precision




              0.4


              0.2


               0
                0   0.2     0.4            0.6   0.8   1
                                  recall
Average Precision
             1


            0.8


            0.6                                        • A good AP score requires both
precision




                                                         high recall and high precision
            0.4
                        AP
                                                       • Application-independent
            0.2


             0
              0   0.2   0.4            0.6   0.8   1
                              recall




            Performance measured by mean Average Precision (mAP)
            over 55 queries on 100K or 1.1M image datasets
Quantization / Clustering


  K-means usually seen as a quick + cheap method


  But far too slow for our needs – D~128, N~20M+, K~1M

  Use approximate k-means: nearest neighbour search by
multiple, randomized k-d trees
K-means overview

  K-means overview:                                   Iterate




     Initialize cluster     Find nearest cluster to each        Re-compute cluster
          centres             datapoint (slow) O(N K)           centres as centroid


   K-means provably locally minimizes the sum of squared
errors (SSE) between a cluster centre and its points

  Idea: nearest neighbour search is the bottleneck – use
approximate nearest neighbour search
Approximate K-means

  Use multiple, randomized k-d trees for search


  A k-d tree hierarchically decomposes the
descriptor space


   Points nearby in the space can be found
(hopefully) by backtracking around the tree
some small number of steps


  Single tree works OK in low dimensions – not
so well in high dimensions
Approximate K-means


  Multiple randomized trees increase the chances of finding
nearby points


True nearest
 neighbour

 Query point


                 True nearest     No    No           Yes
               neighbour found?
Approximate K-means
   Use the best-bin first strategy to determine which branch of
the tree to examine next


 share this priority queue between multiple trees – searching
multiple trees only slightly more expensive than searching one


  Original K-means complexity       = O(N K)

   Approximate K-means complexity = O(N log K)

  This means we can scale to very large K
Experimental evaluation for SIFT matching
http://www.cs.ubc.ca/~lowe/papers/09muja.pdf
Oxford buildings dataset
          Landmarks plus queries used for evaluation

All Soul's                            Bridge of
                                      Sighs
Ashmolean
                                      Keble

Balliol
                                      Magdalen

Bodleian                              University
                                      Museum
Thom
Tower
                                      Radcliffe
                                      Camera
Cornmarket



  Ground truth obtained for 11 landmarks over 5062 images

  Evaluate performance by mean Average Precision over 55 queries
Approximate K-means
How accurate is the approximate search?

Performance on 5K image dataset for a random forest of 8 trees




   Allows much larger clusterings than would be feasible with
standard K-means: N~17M points, K~1M
      AKM – 8.3 cpu hours per iteration
      Standard K-means - estimated 2650 cpu hours per iteration
Performance against vocabulary size
   Using large vocabularies gives a big boost in performance
(peak @ 1M words)




  More discriminative vocabularies give:
      Better retrieval quality
     Increased search speed – documents share less words, so fewer
    documents need to be scored
Beyond Bag of Words
   Use the position and shape of the underlying features
to improve retrieval quality




  Both images have many matches – which is correct?
Beyond Bag of Words
  We can measure spatial consistency between the
query and each result to improve retrieval quality




Many spatially consistent      Few spatially consistent
matches – correct result      matches – incorrect result
Compute 2D affine transformation
  • between query region and target image




 where A is a 2x2 non-singular matrix
Estimating spatial correspondences
1. Test each correspondence
Estimating spatial correspondences
2. Compute a (restricted) affine transformation (5 dof)
Estimating spatial correspondences
  3. Score by number of consistent matches




Use RANSAC on full affine transformation (6 dof)
Beyond Bag of Words
Extra bonus – gives localization of the object
Example Results




     Query   Example
             Results


Rank short list of retrieved images on number of correspondences
Mean Average Precision variation with vocabulary size



  vocab   bag of   spatial
   size   words
   50K     0.473   0.599
  100K     0.535   0.597
  250K     0.598   0.633
  500K     0.606   0.642
  750K     0.609   0.630
   1M      0.618   0.645
  1.25M    0.602   0.625
Bag of visual words particular object retrieval
                                                               centroids
                                            Set of SIFT     (visual words)
query image                                 descriptors
                                                                                sparse frequency vector

                    Hessian-Affine                            visual words
              regions + SIFT descriptors                    +tf-idf weighting




                                                                   Inverted
                                                                      file
                                                                                      querying




                                                              Geometric              ranked image
                                                              verification             short-list

                                      [Chum & al 2007]    [Lowe 04, Chum & al 2007]

More Related Content

Viewers also liked

CV2011-2. Lecture 03. Photomontage, part 2.
CV2011-2. Lecture 03.  Photomontage, part 2.CV2011-2. Lecture 03.  Photomontage, part 2.
CV2011-2. Lecture 03. Photomontage, part 2.
Anton Konushin
 
CV2011-2. Lecture 08. Multi-view stereo.
CV2011-2. Lecture 08. Multi-view stereo.CV2011-2. Lecture 08. Multi-view stereo.
CV2011-2. Lecture 08. Multi-view stereo.
Anton Konushin
 
CV2011-2. Lecture 10. Pose estimation.
CV2011-2. Lecture 10.  Pose estimation.CV2011-2. Lecture 10.  Pose estimation.
CV2011-2. Lecture 10. Pose estimation.
Anton Konushin
 
Writing a computer vision paper
Writing a computer vision paperWriting a computer vision paper
Writing a computer vision paper
Anton Konushin
 
CV2011-2. Lecture 05. Video segmentation.
CV2011-2. Lecture 05.  Video segmentation.CV2011-2. Lecture 05.  Video segmentation.
CV2011-2. Lecture 05. Video segmentation.
Anton Konushin
 
CV2011-2. Lecture 06. Structure from motion.
CV2011-2. Lecture 06.  Structure from motion.CV2011-2. Lecture 06.  Structure from motion.
CV2011-2. Lecture 06. Structure from motion.
Anton Konushin
 
Classifier evaluation and comparison
Classifier evaluation and comparisonClassifier evaluation and comparison
Classifier evaluation and comparison
Anton Konushin
 
Computer vision infrastracture
Computer vision infrastractureComputer vision infrastracture
Computer vision infrastracture
Anton Konushin
 
CV2011-2. Lecture 02. Photomontage and graphical models.
CV2011-2. Lecture 02.  Photomontage and graphical models.CV2011-2. Lecture 02.  Photomontage and graphical models.
CV2011-2. Lecture 02. Photomontage and graphical models.
Anton Konushin
 
CV2011-2. Lecture 07. Binocular stereo.
CV2011-2. Lecture 07.  Binocular stereo.CV2011-2. Lecture 07.  Binocular stereo.
CV2011-2. Lecture 07. Binocular stereo.
Anton Konushin
 
CV2011-2. Lecture 09. Single view reconstructin.
CV2011-2. Lecture 09.  Single view reconstructin.CV2011-2. Lecture 09.  Single view reconstructin.
CV2011-2. Lecture 09. Single view reconstructin.
Anton Konushin
 
CV2011-2. Lecture 11. Face analysis.
CV2011-2. Lecture 11. Face analysis.CV2011-2. Lecture 11. Face analysis.
CV2011-2. Lecture 11. Face analysis.
Anton Konushin
 
CV2011-2. Lecture 12. Face models.
CV2011-2. Lecture 12.  Face models.CV2011-2. Lecture 12.  Face models.
CV2011-2. Lecture 12. Face models.
Anton Konushin
 
CV2011-2. Lecture 04. Semantic image segmentation
CV2011-2. Lecture 04.  Semantic image segmentationCV2011-2. Lecture 04.  Semantic image segmentation
CV2011-2. Lecture 04. Semantic image segmentation
Anton Konushin
 
Статистическое сравнение классификаторов
Статистическое сравнение классификаторовСтатистическое сравнение классификаторов
Статистическое сравнение классификаторов
Anton Konushin
 

Viewers also liked (18)

CV2011-2. Lecture 03. Photomontage, part 2.
CV2011-2. Lecture 03.  Photomontage, part 2.CV2011-2. Lecture 03.  Photomontage, part 2.
CV2011-2. Lecture 03. Photomontage, part 2.
 
CV2011-2. Lecture 08. Multi-view stereo.
CV2011-2. Lecture 08. Multi-view stereo.CV2011-2. Lecture 08. Multi-view stereo.
CV2011-2. Lecture 08. Multi-view stereo.
 
CV2011-2. Lecture 10. Pose estimation.
CV2011-2. Lecture 10.  Pose estimation.CV2011-2. Lecture 10.  Pose estimation.
CV2011-2. Lecture 10. Pose estimation.
 
CV2015. Лекция 8. Распознавание лиц людей.
CV2015. Лекция 8. Распознавание лиц людей.CV2015. Лекция 8. Распознавание лиц людей.
CV2015. Лекция 8. Распознавание лиц людей.
 
CV2015. Лекция 6. Нейросетевые алгоритмы.
CV2015. Лекция 6. Нейросетевые алгоритмы.CV2015. Лекция 6. Нейросетевые алгоритмы.
CV2015. Лекция 6. Нейросетевые алгоритмы.
 
Writing a computer vision paper
Writing a computer vision paperWriting a computer vision paper
Writing a computer vision paper
 
CV2011-2. Lecture 05. Video segmentation.
CV2011-2. Lecture 05.  Video segmentation.CV2011-2. Lecture 05.  Video segmentation.
CV2011-2. Lecture 05. Video segmentation.
 
CV2011-2. Lecture 06. Structure from motion.
CV2011-2. Lecture 06.  Structure from motion.CV2011-2. Lecture 06.  Structure from motion.
CV2011-2. Lecture 06. Structure from motion.
 
Classifier evaluation and comparison
Classifier evaluation and comparisonClassifier evaluation and comparison
Classifier evaluation and comparison
 
Computer vision infrastracture
Computer vision infrastractureComputer vision infrastracture
Computer vision infrastracture
 
CV2011-2. Lecture 02. Photomontage and graphical models.
CV2011-2. Lecture 02.  Photomontage and graphical models.CV2011-2. Lecture 02.  Photomontage and graphical models.
CV2011-2. Lecture 02. Photomontage and graphical models.
 
CV2011-2. Lecture 07. Binocular stereo.
CV2011-2. Lecture 07.  Binocular stereo.CV2011-2. Lecture 07.  Binocular stereo.
CV2011-2. Lecture 07. Binocular stereo.
 
CV2011-2. Lecture 09. Single view reconstructin.
CV2011-2. Lecture 09.  Single view reconstructin.CV2011-2. Lecture 09.  Single view reconstructin.
CV2011-2. Lecture 09. Single view reconstructin.
 
CV2011-2. Lecture 11. Face analysis.
CV2011-2. Lecture 11. Face analysis.CV2011-2. Lecture 11. Face analysis.
CV2011-2. Lecture 11. Face analysis.
 
CV2011-2. Lecture 12. Face models.
CV2011-2. Lecture 12.  Face models.CV2011-2. Lecture 12.  Face models.
CV2011-2. Lecture 12. Face models.
 
Anton Konushin - TEDxRU 2009
Anton Konushin - TEDxRU 2009Anton Konushin - TEDxRU 2009
Anton Konushin - TEDxRU 2009
 
CV2011-2. Lecture 04. Semantic image segmentation
CV2011-2. Lecture 04.  Semantic image segmentationCV2011-2. Lecture 04.  Semantic image segmentation
CV2011-2. Lecture 04. Semantic image segmentation
 
Статистическое сравнение классификаторов
Статистическое сравнение классификаторовСтатистическое сравнение классификаторов
Статистическое сравнение классификаторов
 

Similar to Andrew Zisserman Talk - Part 1a

Matching with Invariant Features
Matching with Invariant FeaturesMatching with Invariant Features
Matching with Invariant Features
zukun
 
Modern features-part-1-detectors
Modern features-part-1-detectorsModern features-part-1-detectors
Modern features-part-1-detectors
zukun
 
One day short course on Green Building Assessment Methods - Daylight Simulation
One day short course on Green Building Assessment Methods - Daylight SimulationOne day short course on Green Building Assessment Methods - Daylight Simulation
One day short course on Green Building Assessment Methods - Daylight Simulation
ekwtsang
 
Lecture 02 internet video search
Lecture 02 internet video searchLecture 02 internet video search
Lecture 02 internet video search
zukun
 

Similar to Andrew Zisserman Talk - Part 1a (20)

Lec07 corner blob
Lec07 corner blobLec07 corner blob
Lec07 corner blob
 
Computer Vision invariance
Computer Vision invarianceComputer Vision invariance
Computer Vision invariance
 
Matching with Invariant Features
Matching with Invariant FeaturesMatching with Invariant Features
Matching with Invariant Features
 
image segmentation image segmentation.pptx
image segmentation image segmentation.pptximage segmentation image segmentation.pptx
image segmentation image segmentation.pptx
 
Chapter10 image segmentation
Chapter10 image segmentationChapter10 image segmentation
Chapter10 image segmentation
 
Part2
Part2Part2
Part2
 
Modern features-part-1-detectors
Modern features-part-1-detectorsModern features-part-1-detectors
Modern features-part-1-detectors
 
PPT s11-machine vision-s2
PPT s11-machine vision-s2PPT s11-machine vision-s2
PPT s11-machine vision-s2
 
Michal Erel's SIFT presentation
Michal Erel's SIFT presentationMichal Erel's SIFT presentation
Michal Erel's SIFT presentation
 
Image segmentation
Image segmentationImage segmentation
Image segmentation
 
Computer vision - edge detection
Computer vision - edge detectionComputer vision - edge detection
Computer vision - edge detection
 
Fuzzy Logic Based Edge Detection
Fuzzy Logic Based Edge DetectionFuzzy Logic Based Edge Detection
Fuzzy Logic Based Edge Detection
 
Spatial Filtering in intro image processingr
Spatial Filtering in intro image processingrSpatial Filtering in intro image processingr
Spatial Filtering in intro image processingr
 
One day short course on Green Building Assessment Methods - Daylight Simulation
One day short course on Green Building Assessment Methods - Daylight SimulationOne day short course on Green Building Assessment Methods - Daylight Simulation
One day short course on Green Building Assessment Methods - Daylight Simulation
 
06 image features
06 image features06 image features
06 image features
 
november6.ppt
november6.pptnovember6.ppt
november6.ppt
 
smallpt: Global Illumination in 99 lines of C++
smallpt:  Global Illumination in 99 lines of C++smallpt:  Global Illumination in 99 lines of C++
smallpt: Global Illumination in 99 lines of C++
 
Simulations of Strong Lensing
Simulations of Strong LensingSimulations of Strong Lensing
Simulations of Strong Lensing
 
SPATIAL POINT PATTERNS
SPATIAL POINT PATTERNSSPATIAL POINT PATTERNS
SPATIAL POINT PATTERNS
 
Lecture 02 internet video search
Lecture 02 internet video searchLecture 02 internet video search
Lecture 02 internet video search
 

More from Anton Konushin

CV2011-2. Lecture 01. Segmentation.
CV2011-2. Lecture 01. Segmentation.CV2011-2. Lecture 01. Segmentation.
CV2011-2. Lecture 01. Segmentation.
Anton Konushin
 
CV2011 Lecture 13. Real-time vision
CV2011 Lecture 13. Real-time visionCV2011 Lecture 13. Real-time vision
CV2011 Lecture 13. Real-time vision
Anton Konushin
 
CV2011 Lecture 12. Action recognition
CV2011 Lecture 12. Action recognitionCV2011 Lecture 12. Action recognition
CV2011 Lecture 12. Action recognition
Anton Konushin
 
CV2011 Lecture 11. Basic video
CV2011 Lecture 11. Basic videoCV2011 Lecture 11. Basic video
CV2011 Lecture 11. Basic video
Anton Konushin
 
CV2011 Lecture 10. Image retrieval
CV2011 Lecture 10.  Image retrievalCV2011 Lecture 10.  Image retrieval
CV2011 Lecture 10. Image retrieval
Anton Konushin
 

More from Anton Konushin (12)

CV2015. Лекция 7. Поиск изображений по содержанию.
CV2015. Лекция 7. Поиск изображений по содержанию.CV2015. Лекция 7. Поиск изображений по содержанию.
CV2015. Лекция 7. Поиск изображений по содержанию.
 
CV2015. Лекция 5. Выделение объектов.
CV2015. Лекция 5. Выделение объектов.CV2015. Лекция 5. Выделение объектов.
CV2015. Лекция 5. Выделение объектов.
 
CV2015. Лекция 4. Классификация изображений и введение в машинное обучение.
CV2015. Лекция 4. Классификация изображений и введение в машинное обучение.CV2015. Лекция 4. Классификация изображений и введение в машинное обучение.
CV2015. Лекция 4. Классификация изображений и введение в машинное обучение.
 
CV2015. Лекция 2. Основы обработки изображений.
CV2015. Лекция 2. Основы обработки изображений.CV2015. Лекция 2. Основы обработки изображений.
CV2015. Лекция 2. Основы обработки изображений.
 
CV2015. Лекция 1. Понятия и история компьютерного зрения. Свет и цвет.
CV2015. Лекция 1. Понятия и история компьютерного зрения. Свет и цвет.CV2015. Лекция 1. Понятия и история компьютерного зрения. Свет и цвет.
CV2015. Лекция 1. Понятия и история компьютерного зрения. Свет и цвет.
 
CV2015. Лекция 2. Простые методы распознавания изображений.
CV2015. Лекция 2. Простые методы распознавания изображений.CV2015. Лекция 2. Простые методы распознавания изображений.
CV2015. Лекция 2. Простые методы распознавания изображений.
 
Технологии разработки ПО
Технологии разработки ПОТехнологии разработки ПО
Технологии разработки ПО
 
CV2011-2. Lecture 01. Segmentation.
CV2011-2. Lecture 01. Segmentation.CV2011-2. Lecture 01. Segmentation.
CV2011-2. Lecture 01. Segmentation.
 
CV2011 Lecture 13. Real-time vision
CV2011 Lecture 13. Real-time visionCV2011 Lecture 13. Real-time vision
CV2011 Lecture 13. Real-time vision
 
CV2011 Lecture 12. Action recognition
CV2011 Lecture 12. Action recognitionCV2011 Lecture 12. Action recognition
CV2011 Lecture 12. Action recognition
 
CV2011 Lecture 11. Basic video
CV2011 Lecture 11. Basic videoCV2011 Lecture 11. Basic video
CV2011 Lecture 11. Basic video
 
CV2011 Lecture 10. Image retrieval
CV2011 Lecture 10.  Image retrievalCV2011 Lecture 10.  Image retrieval
CV2011 Lecture 10. Image retrieval
 

Recently uploaded

Adversarial Attention Modeling for Multi-dimensional Emotion Regression.pdf
Adversarial Attention Modeling for Multi-dimensional Emotion Regression.pdfAdversarial Attention Modeling for Multi-dimensional Emotion Regression.pdf
Adversarial Attention Modeling for Multi-dimensional Emotion Regression.pdf
Po-Chuan Chen
 

Recently uploaded (20)

Application of Matrices in real life. Presentation on application of matrices
Application of Matrices in real life. Presentation on application of matricesApplication of Matrices in real life. Presentation on application of matrices
Application of Matrices in real life. Presentation on application of matrices
 
Synthetic Fiber Construction in lab .pptx
Synthetic Fiber Construction in lab .pptxSynthetic Fiber Construction in lab .pptx
Synthetic Fiber Construction in lab .pptx
 
MARUTI SUZUKI- A Successful Joint Venture in India.pptx
MARUTI SUZUKI- A Successful Joint Venture in India.pptxMARUTI SUZUKI- A Successful Joint Venture in India.pptx
MARUTI SUZUKI- A Successful Joint Venture in India.pptx
 
Introduction to Quality Improvement Essentials
Introduction to Quality Improvement EssentialsIntroduction to Quality Improvement Essentials
Introduction to Quality Improvement Essentials
 
Sectors of the Indian Economy - Class 10 Study Notes pdf
Sectors of the Indian Economy - Class 10 Study Notes pdfSectors of the Indian Economy - Class 10 Study Notes pdf
Sectors of the Indian Economy - Class 10 Study Notes pdf
 
slides CapTechTalks Webinar May 2024 Alexander Perry.pptx
slides CapTechTalks Webinar May 2024 Alexander Perry.pptxslides CapTechTalks Webinar May 2024 Alexander Perry.pptx
slides CapTechTalks Webinar May 2024 Alexander Perry.pptx
 
GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...
GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...
GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...
 
Adversarial Attention Modeling for Multi-dimensional Emotion Regression.pdf
Adversarial Attention Modeling for Multi-dimensional Emotion Regression.pdfAdversarial Attention Modeling for Multi-dimensional Emotion Regression.pdf
Adversarial Attention Modeling for Multi-dimensional Emotion Regression.pdf
 
Danh sách HSG Bộ môn cấp trường - Cấp THPT.pdf
Danh sách HSG Bộ môn cấp trường - Cấp THPT.pdfDanh sách HSG Bộ môn cấp trường - Cấp THPT.pdf
Danh sách HSG Bộ môn cấp trường - Cấp THPT.pdf
 
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptx
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptxStudents, digital devices and success - Andreas Schleicher - 27 May 2024..pptx
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptx
 
Mattingly "AI & Prompt Design: Limitations and Solutions with LLMs"
Mattingly "AI & Prompt Design: Limitations and Solutions with LLMs"Mattingly "AI & Prompt Design: Limitations and Solutions with LLMs"
Mattingly "AI & Prompt Design: Limitations and Solutions with LLMs"
 
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXXPhrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
 
2024_Student Session 2_ Set Plan Preparation.pptx
2024_Student Session 2_ Set Plan Preparation.pptx2024_Student Session 2_ Set Plan Preparation.pptx
2024_Student Session 2_ Set Plan Preparation.pptx
 
B.ed spl. HI pdusu exam paper-2023-24.pdf
B.ed spl. HI pdusu exam paper-2023-24.pdfB.ed spl. HI pdusu exam paper-2023-24.pdf
B.ed spl. HI pdusu exam paper-2023-24.pdf
 
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
 
INU_CAPSTONEDESIGN_비밀번호486_업로드용 발표자료.pdf
INU_CAPSTONEDESIGN_비밀번호486_업로드용 발표자료.pdfINU_CAPSTONEDESIGN_비밀번호486_업로드용 발표자료.pdf
INU_CAPSTONEDESIGN_비밀번호486_업로드용 발표자료.pdf
 
Jose-Rizal-and-Philippine-Nationalism-National-Symbol-2.pptx
Jose-Rizal-and-Philippine-Nationalism-National-Symbol-2.pptxJose-Rizal-and-Philippine-Nationalism-National-Symbol-2.pptx
Jose-Rizal-and-Philippine-Nationalism-National-Symbol-2.pptx
 
NCERT Solutions Power Sharing Class 10 Notes pdf
NCERT Solutions Power Sharing Class 10 Notes pdfNCERT Solutions Power Sharing Class 10 Notes pdf
NCERT Solutions Power Sharing Class 10 Notes pdf
 
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
 
UNIT – IV_PCI Complaints: Complaints and evaluation of complaints, Handling o...
UNIT – IV_PCI Complaints: Complaints and evaluation of complaints, Handling o...UNIT – IV_PCI Complaints: Complaints and evaluation of complaints, Handling o...
UNIT – IV_PCI Complaints: Complaints and evaluation of complaints, Handling o...
 

Andrew Zisserman Talk - Part 1a

  • 1. Visual search and recognition Part I – large scale instance search Andrew Zisserman Visual Geometry Group University of Oxford http://www.robots.ox.ac.uk/~vgg Slides with Josef Sivic
  • 2. Overview Part I: Instance level recognition • e.g. find this car in a dataset of images and the dataset may contain 1M images …
  • 3. Overview Part II: Category level recognition • e.g. find images that contain cars • or cows … and localize the occurrences in each image …
  • 4. Problem specification: particular object retrieval Example: visual search in feature films Visually defined query “Groundhog Day” [Rammis, 1993] “Find this clock” “Find this place”
  • 5. Particular Object Search Find these objects ...in these images and 1M more Search the web with a visual query …
  • 6. The need for visual search Flickr: has more than 5 billion photographs, more than 1 million added daily Company collections Personal collections: 10000s of digital camera photos and video clips Vast majority will have minimal, if any, textual annotation.
  • 7. Why is it difficult? Problem: find particular occurrences of an object in a very large dataset of images Want to find the object despite possibly large changes in scale, viewpoint, lighting and partial occlusion Scale Viewpoint Lighting Occlusion
  • 8. Outline 1. Object recognition cast as nearest neighbour matching • Covariant feature detection • Feature descriptors (SIFT) and matching 2. Object recognition cast as text retrieval 3. Large scale search and improving performance 4. Applications 5. The future and challenges
  • 9. Visual problem • Retrieve image/key frames containing the same object query ? Approach Determine regions (detection) and vector descriptors in each frame which are invariant to camera viewpoint changes Match descriptors between frames using invariant vectors
  • 10. Example of visual fragments Image content is transformed into local fragments that are invariant to translation, rotation, scale, and other imaging parameters • Fragments generalize over viewpoint and lighting Lowe ICCV 1999
  • 11. Detection Requirements Detected image regions must cover the same scene region in different views • detection must commute with viewpoint transformation viewpoint transformation • i.e. detection is viewpoint covariant detection detection • NB detection computed in each image independently viewpoint transformation
  • 12. Scale-invariant feature detection Goal: independently detect corresponding regions in scaled versions of the same image Need scale selection mechanism for finding characteristic region size that is covariant with the image transformation Laplacian
  • 13. Scale-invariant features: Blobs Slides from Svetlana Lazebnik
  • 14. Recall: Edge detection Edge f d Derivative g of Gaussian dx d Edge = maximum f∗ g of derivative dx Source: S. Seitz
  • 15. Edge detection, take 2 Edge f 2 Second derivative d of Gaussian 2 g dx (Laplacian) d2 Edge = zero crossing f∗ 2g of second derivative dx Source: S. Seitz
  • 16. From edges to `top hat’ (blobs) Blob = top-hat = superposition of two step edges maximum Spatial selection: the magnitude of the Laplacian response will achieve a maximum at the center of the blob, provided the scale of the Laplacian is “matched” to the scale of the blob
  • 17. Scale selection We want to find the characteristic scale of the blob by convolving it with Laplacians at several scales and looking for the maximum response However, Laplacian response decays as scale increases: original signal increasing σ (radius=8) Why does this happen?
  • 18. Scale normalization The response of a derivative of Gaussian filter to a perfect step edge decreases as σ increases 1 σ 2π
  • 19. Scale normalization The response of a derivative of Gaussian filter to a perfect step edge decreases as σ increases To keep response the same (scale-invariant), must multiply Gaussian derivative by σ Laplacian is the second Gaussian derivative, so it must be multiplied by σ2
  • 20. Effect of scale normalization Original signal Unnormalized Laplacian response Scale-normalized Laplacian response scale σ increasing maximum
  • 21. Blob detection in 2D Laplacian of Gaussian: Circularly symmetric operator for blob detection in 2D ∂ g ∂ g 2 2 ∇ g= 2 + 2 2 ∂x ∂y
  • 22. Blob detection in 2D Laplacian of Gaussian: Circularly symmetric operator for blob detection in 2D ⎛∂ g ∂ g ⎞ 2 2 Scale-normalized: ∇ 2 norm g =σ ⎜ 2 + 2 ⎟ 2 ⎜ ∂x ⎟ ⎝ ∂y ⎠
  • 23. Characteristic scale We define the characteristic scale as the scale that produces peak of Laplacian response characteristic scale T. Lindeberg (1998). "Feature detection with automatic scale selection." International Journal of Computer Vision 30 (2): pp 77--116.
  • 24. Scale selection Scale invariance of the characteristic scale s norm. Lap. norm. Lap. scale scale s∗ 1 s∗ 2 ∗ ∗ • Relation between characteristic scales s ⋅ s1 = s2
  • 25. Scale invariance Mikolajczyk and Schmid ICCV 2001 • Multi-scale extraction of Harris interest points • Selection of characteristic scale in Laplacian scale space Chacteristic scale : - maximum in scale space Laplacian - scale invariant
  • 26. What class of transformations are required? 2D transformation models Similarity (translation, scale, rotation) Affine Projective (homography)
  • 27. Motivation for affine transformations View 1 View 2 View 2 Not same region
  • 28. Local invariance requirements Geometric: 2D affine transformation where A is a 2x2 non-singular matrix • Objective: compute image descriptors invariant to this class of transformations
  • 29. Viewpoint covariant detection • Characteristic scales (size of region) • Lindeberg and Garding ECCV 1994 • Lowe ICCV 1999 • Mikolajczyk and Schmid ICCV 2001 • Affine covariance (shape of region) • Baumberg CVPR 2000 • Matas et al BMVC 2002 Maximally stable regions • Mikolajczyk and Schmid ECCV 2002 Shape adapted regions • Schaffalitzky and Zisserman ECCV 2002 “Harris affine” • Tuytelaars and Van Gool BMVC 2000 • Mikolajczyk et al., IJCV 2005
  • 30. Example affine covariant region Maximally Stable regions (MSR) first image second image 1. Segment using watershed algorithm, and track connected components as threshold value varies. 2. An MSR is detected when the area of the component is stationary See Matas et al BMVC 2002
  • 31. Maximally stable regions varying threshold sub-image first second area vs change of area image image threshold vs threshold
  • 33. Example of affine covariant regions 1000+ regions per image Shape adapted regions Maximally stable regions • a region’s size and shape are not fixed, but • automatically adapts to the image intensity to cover the same physical surface • i.e. pre-image is the same surface region
  • 34. Viewpoint invariant description • Elliptical viewpoint covariant regions • Shape Adapted regions • Maximally Stable Regions • Map ellipse to circle and orientate by dominant direction • Represent each region by SIFT descriptor (128-vector) [Lowe 1999] • see Mikolajczyk and Schmid CVPR 2003 for a comparison of descriptors
  • 35. Local descriptors - rotation invariance Estimation of the dominant orientation • extract gradient orientation • histogram over gradient orientation • peak in this histogram 0 2π Rotate patch in dominant direction
  • 36. Descriptors – SIFT [Lowe’99] distribution of the gradient over an image patch image patch gradient 3D histogram x → → y 4x4 location grid and 8 orientations (128 dimensions) very good performance in image matching [Mikolaczyk and Schmid’03]
  • 37. Summary – detection and description Extract affine regions Normalize regions Eliminate rotation Compute appearance descriptors SIFT (Lowe ’04)
  • 38. Visual problem • Retrieve image/key frames containing the same object query ? Approach Determine regions (detection) and vector descriptors in each frame which are invariant to camera viewpoint changes Match descriptors between frames using invariant vectors
  • 39. Outline of an object retrieval strategy regions invariant descriptor vectors frames invariant descriptor vectors 1. Compute regions in each image independently 2. “Label” each region by a descriptor vector from its local intensity neighbourhood 3. Find corresponding regions by matching to closest descriptor vector 4. Score each frame in the database by the number of matches Finding corresponding regions transformed to finding nearest neighbour vectors
  • 40. Example In each frame independently determine elliptical regions (detection covariant with camera viewpoint) compute SIFT descriptor for each region [Lowe ‘99] 1000+ descriptors per frame Harris-affine Maximally stable regions
  • 41. Object recognition Establish correspondences between object model image and target image by nearest neighbour matching on SIFT vectors Model image 128D descriptor Target image space
  • 42. Match regions between frames using SIFT descriptors • Multiple fragments overcomes problem of partial occlusion • Transfer query box to localize object Harris-affine Now, convert this approach to a text Maximally stable regions retrieval representation
  • 43. Outline 1. Object recognition cast as nearest neighbour matching 2. Object recognition cast as text retrieval • bag of words model • visual words 3. Large scale search and improving performance 4. Applications 5. The future and challenges
  • 44. Success of text retrieval • efficient • high precision • scalable Can we use retrieval mechanisms from text for visual retrieval? • ‘Visual Google’ project, 2003+ For a million+ images: • scalability • high precision • high recall: can we retrieve all occurrences in the corpus?
  • 45. Text retrieval lightning tour Stemming Represent words by stems, e.g. “walking”, “walks” “walk” Stop-list Reject the very common words, e.g. “the”, “a”, “of” Inverted file Ideal book index: Term List of hits (occurrences in documents) People [d1:hit hit hit], [d4:hit hit] … Common [d1:hit hit], [d3: hit], [d4: hit hit hit] … Sculpture [d2:hit], [d3: hit hit hit] … • word matches are pre-computed
  • 46. Ranking • frequency of words in document (tf-idf) • proximity weighting (google) • PageRank (google) Need to map feature descriptors to “visual words”.
  • 47. Build a visual vocabulary for a movie Vector quantize descriptors • k-means clustering + + SIFT 128D SIFT 128D Implementation • compute SIFT features on frames from 48 shots of the film • 6K clusters for Shape Adapted regions • 10K clusters for Maximally Stable regions
  • 48. Samples of visual words (clusters on SIFT descriptors): Shape adapted regions Maximally stable regions generic examples – cf textons
  • 49. Samples of visual words (clusters on SIFT descriptors): More specific example
  • 50. Visual words: quantize descriptor space Sivic and Zisserman, ICCV 2003 Nearest neighbour matching •expensive to do for all frames Image 1 128D descriptor Image 2 space
  • 51. Visual words: quantize descriptor space Sivic and Zisserman, ICCV 2003 Nearest neighbour matching •expensive to do for all frames Image 1 128D descriptor Image 2 space Vector quantize descriptors 5 42 42 5 42 5 Image 1 128D descriptor Image 2 space
  • 52. Visual words: quantize descriptor space Sivic and Zisserman, ICCV 2003 Nearest neighbour matching •expensive to do for all frames Image 1 128D descriptor Image 2 space Vector quantize descriptors 5 42 42 5 42 5 New image Image 1 128D descriptor Image 2 space
  • 53. Visual words: quantize descriptor space Sivic and Zisserman, ICCV 2003 Nearest neighbour matching •expensive to do for all frames Image 1 128D descriptor Image 2 space Vector quantize descriptors 5 42 42 42 5 42 5 New image Image 1 128D descriptor Image 2 space
  • 54. Vector quantize the descriptor space (SIFT) The same visual word
  • 55. Representation: bag of (visual) words Visual words are ‘iconic’ image patches or fragments • represent the frequency of word occurrence • but not their position Image Collection of visual words
  • 56. Offline: assign visual words and compute histograms for each key frame in the video + + Normalize Compute SIFT patch descriptor Find nearest cluster centre Detect patches 2 0 0 1 0 1 … Represent frame by sparse histogram of visual word occurrences
  • 57. Offline: create an index For fast search, store a “posting list” for the dataset This maps word occurrences to the documents they occur in Posting list #1 #1 1 5,10, ... #2 2 10,... ... ... frame #5 frame #10
  • 58. At run time … • User specifies a query region • Generate a short list of frames using visual words in region 1. Accumulate all visual words within the query region 2. Use “book index” to find other frames with these words 3. Compute similarity for frames which share at least one word Posting list #1 #1 1 5,10, ... #2 2 10,... ... ... frame #5 frame #10 Generates a tf-idf ranked list of all the frames in dataset
  • 59. Image ranking using the bag-of-words model For a vocabulary of size K, each image is represented by a K-vector where ti is the (weighted) number of occurrences of visual word i. Images are ranked by the normalized scalar product between the query vector vq and all vectors in the database vd: Scalar product can be computed efficiently using inverted file.
  • 60. Summary: Match histograms of visual words regions invariant Single vector Quantize descriptor (histogram) vectors frames 1. Compute affine covariant regions in each frame independently (offline) 2. “Label” each region by a vector of descriptors based on its intensity (offline) 3. Build histograms of visual words by descriptor quantization (offline) 4. Rank retrieved frames by matching vis. word histograms using inverted files.
  • 61. Films = common dataset “Pretty Woman” “Casablanca” “Groundhog Day” “Charade”
  • 63. retrieved shots Example Select a region Search in film “Groundhog Day”
  • 64. Visual words - advantages Design of descriptors makes these words invariant to: • illumination • affine transformations (viewpoint) Multiple local regions gives immunity to partial occlusion Overlap encodes some structural information NB: no attempt to carry out a ‘semantic’ segmentation
  • 65. Example application – product placement Sony logo from Google image search on `Sony’ Retrieve shots from Groundhog Day
  • 66. Retrieved shots in Groundhog Day for search on Sony logo
  • 67. Outline 1. Object recognition cast as nearest neighbour matching 2. Object recognition cast as text retrieval 3. Large scale search and improving performance • large vocabularies and approximate k-means • query expansion • soft assignment 4. Applications 5. The future and challenges
  • 68. Particular object search Find these landmarks ...in these images
  • 69. Investigate … Vocabulary size: number of visual words in range 10K to 1M Use of spatial information to re-rank
  • 70. Oxford buildings dataset Automatically crawled from Flickr Dataset (i) consists of 5062 images, crawled by searching for Oxford landmarks, e.g. “Oxford Christ Church” “Oxford Radcliffe camera” “Oxford” “Medium” resolution images (1024 x 768)
  • 71. Oxford buildings dataset Automatically crawled from Flickr Consists of: Dataset (i) crawled by searching for Oxford landmarks Datasets (ii) and (iii) from other popular Flickr tags. Acts as additional distractors
  • 72. Oxford buildings dataset Landmarks plus queries used for evaluation All Soul's Bridge of Sighs Ashmolean Keble Balliol Magdalen Bodleian University Museum Thom Tower Radcliffe Camera Cornmarket Ground truth obtained for 11 landmarks over 5062 images Evaluate performance by mean Average Precision
  • 73. Precision - Recall relevant returned images images • Precision: % of returned images that are relevant • Recall: % of relevant images that are returned all images 1 0.8 0.6 precision 0.4 0.2 0 0 0.2 0.4 0.6 0.8 1 recall
  • 74. Average Precision 1 0.8 0.6 • A good AP score requires both precision high recall and high precision 0.4 AP • Application-independent 0.2 0 0 0.2 0.4 0.6 0.8 1 recall Performance measured by mean Average Precision (mAP) over 55 queries on 100K or 1.1M image datasets
  • 75. Quantization / Clustering K-means usually seen as a quick + cheap method But far too slow for our needs – D~128, N~20M+, K~1M Use approximate k-means: nearest neighbour search by multiple, randomized k-d trees
  • 76. K-means overview K-means overview: Iterate Initialize cluster Find nearest cluster to each Re-compute cluster centres datapoint (slow) O(N K) centres as centroid K-means provably locally minimizes the sum of squared errors (SSE) between a cluster centre and its points Idea: nearest neighbour search is the bottleneck – use approximate nearest neighbour search
  • 77. Approximate K-means Use multiple, randomized k-d trees for search A k-d tree hierarchically decomposes the descriptor space Points nearby in the space can be found (hopefully) by backtracking around the tree some small number of steps Single tree works OK in low dimensions – not so well in high dimensions
  • 78. Approximate K-means Multiple randomized trees increase the chances of finding nearby points True nearest neighbour Query point True nearest No No Yes neighbour found?
  • 79. Approximate K-means Use the best-bin first strategy to determine which branch of the tree to examine next share this priority queue between multiple trees – searching multiple trees only slightly more expensive than searching one Original K-means complexity = O(N K) Approximate K-means complexity = O(N log K) This means we can scale to very large K
  • 80. Experimental evaluation for SIFT matching http://www.cs.ubc.ca/~lowe/papers/09muja.pdf
  • 81. Oxford buildings dataset Landmarks plus queries used for evaluation All Soul's Bridge of Sighs Ashmolean Keble Balliol Magdalen Bodleian University Museum Thom Tower Radcliffe Camera Cornmarket Ground truth obtained for 11 landmarks over 5062 images Evaluate performance by mean Average Precision over 55 queries
  • 82. Approximate K-means How accurate is the approximate search? Performance on 5K image dataset for a random forest of 8 trees Allows much larger clusterings than would be feasible with standard K-means: N~17M points, K~1M AKM – 8.3 cpu hours per iteration Standard K-means - estimated 2650 cpu hours per iteration
  • 83. Performance against vocabulary size Using large vocabularies gives a big boost in performance (peak @ 1M words) More discriminative vocabularies give: Better retrieval quality Increased search speed – documents share less words, so fewer documents need to be scored
  • 84. Beyond Bag of Words Use the position and shape of the underlying features to improve retrieval quality Both images have many matches – which is correct?
  • 85. Beyond Bag of Words We can measure spatial consistency between the query and each result to improve retrieval quality Many spatially consistent Few spatially consistent matches – correct result matches – incorrect result
  • 86. Compute 2D affine transformation • between query region and target image where A is a 2x2 non-singular matrix
  • 87. Estimating spatial correspondences 1. Test each correspondence
  • 88. Estimating spatial correspondences 2. Compute a (restricted) affine transformation (5 dof)
  • 89. Estimating spatial correspondences 3. Score by number of consistent matches Use RANSAC on full affine transformation (6 dof)
  • 90. Beyond Bag of Words Extra bonus – gives localization of the object
  • 91. Example Results Query Example Results Rank short list of retrieved images on number of correspondences
  • 92. Mean Average Precision variation with vocabulary size vocab bag of spatial size words 50K 0.473 0.599 100K 0.535 0.597 250K 0.598 0.633 500K 0.606 0.642 750K 0.609 0.630 1M 0.618 0.645 1.25M 0.602 0.625
  • 93. Bag of visual words particular object retrieval centroids Set of SIFT (visual words) query image descriptors sparse frequency vector Hessian-Affine visual words regions + SIFT descriptors +tf-idf weighting Inverted file querying Geometric ranked image verification short-list [Chum & al 2007] [Lowe 04, Chum & al 2007]