Visual Object Recognition
Visual Object Recognition Tutorial Computing
Perceptual and Sensory Augmented




                                               Bastian Leibe                &   Kristen Grauman
                                               Computer Vision Laboratory       Department of Computer Sciences
                                               ETH Zurich                       University of Texas in Austin

                                               Chicago, 14.07.2008
Perceptual and Sensory Augmented
                       Visual Object Recognition Tutorial Computing




K. Grauman, B. Leibe
                                                        ?
                                                       ???
                                                                      Identification vs. Categorization




                 2
Object Categorization
                                               • How to recognize ANY car
Visual Object Recognition Tutorial Computing
Perceptual and Sensory Augmented




                                               • How to recognize ANY cow



                                                                                           3
                                                                    K. Grauman, B. Leibe
What could be done with recognition algorithms?
                                               There is a wide range of applications, including…
Visual Object Recognition Tutorial Computing
Perceptual and Sensory Augmented




                                                 Autonomous robots     Navigation, driver safety   Situated search




                                                      Content-based retrieval and analysis for
                                                                                                       Medical image
                                                                images and videos
                                                                                                         analysis
Object Categorization
                                               • Task Description
                                                    “Given a small number of training images of a category,
Visual Object Recognition Tutorial Computing




                                                    recognize a-priori unknown instances of that category and assign
                                                    the correct category label.”


                                               • Which categories are feasible visually?
Perceptual and Sensory Augmented




                                                    Extensively studied in Cognitive Psychology,
                                                    e.g. [Brown’58]




                                                      “Fido”      German           dog             animal   living
                                                                 shepherd                                   being

                                                                                                                     5
                                                                            K. Grauman, B. Leibe
Visual Object Categories

                                               • Basic Level Categories in human categorization
                                                 [Rosch 76, Lakoff 87]
Visual Object Recognition Tutorial Computing




                                                    The highest level at which category members have similar
                                                    perceived shape
                                                    The highest level at which a single mental image reflects the
Perceptual and Sensory Augmented




                                                    entire category
                                                    The level at which human subjects are usually fastest at
                                                    identifying category members
                                                    The first level named and understood by children
                                                    The highest level at which a person uses similar motor actions
                                                    for interaction with category members




                                                                                                                     6
                                                                          K. Grauman, B. Leibe
Visual Object Categories
                                               • Basic-level categories in humans seem to be defined
                                                 predominantly visually.
Visual Object Recognition Tutorial Computing




                                               • There is evidence that humans (usually)
                                                                                                                                   …
                                                  start with basic-level categorization
                                                 before doing identification.
Perceptual and Sensory Augmented




                                                                                                                             animal
                                                  ⇒ Basic-level categorization is easier
                                                                                                    Abstract
                                                    and faster for humans than object                levels
                                                                                                                         …               …
                                                    identification!                                                   quadruped
                                                  ⇒ Most promising starting point
                                                                                                                                         …
                                                    for visual classification
                                                                                              Basic level       dog      cat       cow


                                                                                                           German
                                                                                                                      Doberman
                                                                                                          shepherd

                                                                                             Individual     …   “Fido”         …
                                                                                                level                                    7
                                                                           K. Grauman, B. Leibe
Other Types of Categories
                                               • Functional Categories
                                                   e.g. chairs = “something you can sit on”
Visual Object Recognition Tutorial Computing
Perceptual and Sensory Augmented




                                                                                               8
                                                                        K. Grauman, B. Leibe
Other Types of Categories
                                               • Ad-hoc categories
                                                   e.g. “something you can find in an office environment”
Visual Object Recognition Tutorial Computing
Perceptual and Sensory Augmented




                                                                                                            9
                                                                        K. Grauman, B. Leibe
Levels of Object Categorization

                                                                                 “cow”
Visual Object Recognition Tutorial Computing




                                                                                  “car”
Perceptual and Sensory Augmented




                                                                            “motorbike”



                                               • Different levels of recognition
                                                    Which object class is in the image?          ⇒ Obj/Img classification
                                                    Where is it in the image?                    ⇒ Detection/Localization
                                                    Where exactly ― which pixels?                ⇒ Figure/Ground
                                                                                                   segmentation

                                                                                                                       10
                                                                          K. Grauman, B. Leibe
Challenges: robustness
Visual Object Recognition Tutorial Computing
Perceptual and Sensory Augmented




                                               Illumination     Object pose                     Clutter




                                                   Occlusions    Intra-class             Viewpoint
                                                                appearance
                                                                  K. Grauman, B. Leibe
Challenges: robustness
Visual Object Recognition Tutorial Computing
Perceptual and Sensory Augmented




                                               • Detection in Crowded Scenes
                                                   Learn object variability
                                                    – Changes in appearance, scale, and articulation
                                                   Compensate for clutter, overlap, and occlusion

                                                                                                       12
                                                                          K. Grauman, B. Leibe
Perceptual and Sensory Augmented
                       Visual Object Recognition Tutorial Computing




K. Grauman, B. Leibe
                                                                      Challenges: context and human experience
Challenges: context and human experience
Visual Object Recognition Tutorial Computing
Perceptual and Sensory Augmented




                                                           Context cues                        Dynamics




                                                                      Image credit: D. Hoeim   Video credit: J. Davis
Challenges: scale, efficiency
                                               • Thousands to millions of pixels in an image
                                               • Estimated 30 Gigapixels of image/video content
                                                   generated per second
Visual Object Recognition Tutorial Computing




                                               •   About half of the cerebral cortex in primates is devoted
                                                   to processing visual information [Felleman and van
                                                   Essen 1991]
Perceptual and Sensory Augmented




                                               •   3,000-30,000 human recognizable object categories
                                               •   30+ degrees of freedom in the pose of articulated
                                                   objects (humans)
                                               •   Billions of images indexed by Google Image Search
                                               •   18 billion+ prints produced from digital camera images
                                                   in 2004
                                               •   295.5 million camera phones sold in 2005



                                                                        K. Grauman, B. Leibe
Perceptual and Sensory Augmented
                       Visual Object Recognition Tutorial Computing




                                                                      Less




K. Grauman, B. Leibe
                                                                      More
                                                                             Challenges: learning with minimal supervision
Rough evolution of focus in recognition research
Visual Object Recognition Tutorial Computing
Perceptual and Sensory Augmented




                                                  1980s    1990s to early 2000s             Currently

                                                                     K. Grauman, B. Leibe
This tutorial
                                               • Intended for broad AAAI audience
                                                   Assuming basic familiarity with machine learning, linear algebra,
Visual Object Recognition Tutorial Computing




                                                   probability
                                                   Not assuming significant vision background
Perceptual and Sensory Augmented




                                               • Our goals
                                                   Describe main approaches to recognition
                                                   Highlight past successes and future challenges
                                                   Provide the pointers (to literature and tools) that would allow
                                                   you to take advantage of existing techniques in your research


                                               • Questions welcome

                                                                                                                     18
                                                                         K. Grauman, B. Leibe
Outline

                                               1. Detection with Global Appearance & Sliding Windows
Visual Object Recognition Tutorial Computing




                                               2. Local Invariant Features: Detection & Description
                                               3. Specific Object Recognition with Local Features
Perceptual and Sensory Augmented




                                               ― Coffee Break ―
                                               4. Visual Words: Indexing, Bags of Words Categorization
                                               5. Matching Local Features
                                               6. Part-Based Models for Categorization
                                               7. Current Challenges and Research Directions

                                                                                                         19
                                                                      K. Grauman, B. Leibe
Visual Object Recognition
Visual Object Recognition Tutorial Computing
Perceptual and Sensory Augmented




                                               Bastian Leibe                &   Kristen Grauman
                                               Computer Vision Laboratory       Department of Computer Sciences
                                               ETH Zurich                       University of Texas in Austin

                                               Chicago, 14.07.2008
Outline

                                               1. Detection with Global Appearance & Sliding Windows
                                               2. Local Invariant Features: Detection & Description
Visual Object Recognition Tutorial Computing




                                               3. Specific Object Recognition with Local Features
Perceptual and Sensory Augmented




                                               ― Coffee Break ―
                                               4. Visual Words: Indexing, Bags of Words Categorization
                                               5. Matching Local Features
                                               6. Part-Based Models for Categorization
                                               7. Current Challenges and Research Directions

                                                                                                         2
                                                                      K. Grauman, B. Leibe
Detection via classification: Main idea

                                                 Basic component: a binary classifier
Visual Object Recognition Tutorial Computing
Perceptual and Sensory Augmented




                                                                                               Car/non-car
                                                                                                Classifier



                                                                                               No, notcar.
                                                                                                 Yes, a car.




                                                                        K. Grauman, B. Leibe
Detection via classification: Main idea

                                                 If object may be in a cluttered scene, slide a window
                                                 around looking for it.
Visual Object Recognition Tutorial Computing
Perceptual and Sensory Augmented




                                                                                               Car/non-car
                                                                                                Classifier




                                                                        K. Grauman, B. Leibe
Detection via classification: Main idea
                                                Fleshing out this
                                                pipeline a bit more,
                                                we need to:
Visual Object Recognition Tutorial Computing




                                                1. Obtain training data
                                                2. Define features
                                                3. Define classifier
Perceptual and Sensory Augmented




                                                                                          Training examples




                                                                                                       Car/non-car
                                                                                                        Classifier
                                                                                 Feature
                                                                                extraction

                                                                          K. Grauman, B. Leibe
Detection via classification: Main idea
                                               • Consider all subwindows in an image
                                                    Sample at multiple scales and positions
Visual Object Recognition Tutorial Computing




                                               • Make a decision per window:
                                                    “Does this contain object category X or not?”
Perceptual and Sensory Augmented




                                               • In this section, we’ll focus specifically on methods
                                                 using a global representation (i.e., not part-based,
                                                 not local features).




                                                                                                        6
                                                                          K. Grauman, B. Leibe
Feature extraction:
                                               global appearance
                                                                                                  Feature
                                                                                                 extraction
Visual Object Recognition Tutorial Computing
Perceptual and Sensory Augmented




                                                 Simple holistic descriptions of image content
                                                      grayscale / color histogram
                                                      vector of pixel intensities




                                                                        K. Grauman, B. Leibe
Eigenfaces: global appearance description
                                                An early appearance-based approach to face recognition

                                                                                                                    Generate low-
Visual Object Recognition Tutorial Computing




                                                                                                                    dimensional
                                                                                                                    representation
                                                                                 Mean                               of appearance
                                                                                                                    with a linear
Perceptual and Sensory Augmented




                                                                                           Eigenvectors computed
                                                       Training images
                                                                                           from covariance matrix   subspace.


                                                                                                 ...                Project new
                                                                                                                    images to “face
                                                              ≈                                                     space”.
                                                                            +        +            ++
                                                                         Mean                                       Recognition via
                                                                                                                    nearest neighbors
                                                                                                                    in face space
                                               Turk & Pentland, 1991
                                                                                K. Grauman, B. Leibe
Feature extraction: global appearance
                                               • Pixel-based representations sensitive to small shifts
Visual Object Recognition Tutorial Computing
Perceptual and Sensory Augmented




                                               • Color or grayscale-based appearance description can be
                                                 sensitive to illumination and intra-class appearance
                                                 variation
                                                                                              Cartoon example:
                                                                                              an albino koala




                                                                       K. Grauman, B. Leibe
Gradient-based representations
                                               • Consider edges, contours, and (oriented) intensity
                                                 gradients
Visual Object Recognition Tutorial Computing
Perceptual and Sensory Augmented




                                                                      K. Grauman, B. Leibe
Gradient-based representations:
                                               Matching edge templates
                                               • Example: Chamfer matching
Visual Object Recognition Tutorial Computing
Perceptual and Sensory Augmented




                                                 Input       Edges        Distance                   Template    Best
                                                image       detected     transform                    shape     match


                                               At each window position,
                                               compute average min
                                               distance between points on
                                               template (T) and input (I).

                                               Gavrila & Philomin ICCV 1999
                                                                              K. Grauman, B. Leibe
Gradient-based representations:
                                               Matching edge templates
                                               • Chamfer matching
Visual Object Recognition Tutorial Computing
Perceptual and Sensory Augmented




                                                 Hierarchy of templates




                                               Gavrila & Philomin ICCV 1999
                                                                              K. Grauman, B. Leibe
Gradient-based representations
                                               • Consider edges, contours, and (oriented) intensity
                                                 gradients
Visual Object Recognition Tutorial Computing
Perceptual and Sensory Augmented




                                               • Summarize local distribution of gradients with histogram
                                                    Locally orderless: offers invariance to small shifts and rotations
                                                    Contrast-normalization: try to correct for variable illumination



                                                                           K. Grauman, B. Leibe
Gradient-based representations:
                                               Histograms of oriented gradients (HoG)
Visual Object Recognition Tutorial Computing
Perceptual and Sensory Augmented




                                                                              Map each grid cell in the input
                                                                              window to a histogram counting
                                                                              the gradients per orientation.

                                                                              Code available:
                                                                              http://pascal.inrialpes.fr/soft/olt/
                                               Dalal & Triggs, CVPR 2005
                                                                           K. Grauman, B. Leibe
Gradient-based representations:
                                               SIFT descriptor
Visual Object Recognition Tutorial Computing
Perceptual and Sensory Augmented




                                                                                                Local patch descriptor
                                                                                                (more on this later)

                                               Code: http://vision.ucla.edu/~vedaldi/code/sift/sift.html
                                               Binary: http://www.cs.ubc.ca/~lowe/keypoints/
                                               Lowe, ICCV 1999
                                                                             K. Grauman, B. Leibe
Gradient-based representations:
                                               Biologically inspired features
Visual Object Recognition Tutorial Computing
Perceptual and Sensory Augmented




                                               Convolve with Gabor filters at
                                               multiple orientations
                                               Pool nearby units (max)
                                               Intermediate layers compare input
                                               to prototype patches

                                               Serre, Wolf, Poggio, CVPR 2005
                                               Mutch & Lowe, CVPR 2006        K. Grauman, B. Leibe
Gradient-based representations:
                                               Rectangular features
Visual Object Recognition Tutorial Computing
Perceptual and Sensory Augmented




                                                Compute differences between sums of pixels in rectangles
                                                Captures contrast in adjacent spatial regions
                                                Similar to Haar wavelets, efficient to compute

                                               Viola & Jones, CVPR 2001
                                                                          K. Grauman, B. Leibe
Gradient-based representations:
                                               Shape context descriptor

                                                                                            Count the number of points
                                                                                            inside each bin, e.g.:
Visual Object Recognition Tutorial Computing




                                                                                               Count = 4
Perceptual and Sensory Augmented




                                                                                                      ...
                                                                                               Count = 10


                                                                                             Log-polar binning: more
                                                                                             precision for nearby points,
                                                                                             more flexibility for farther
                                                                                             points.

                                                                                             Local descriptor
                                               Belongie, Malik & Puzicha, ICCV 2001          (more on this later)
                                                                               K. Grauman, B. Leibe
Classifier construction

                                                • How to compute a decision for each
Visual Object Recognition Tutorial Computing




                                                  subwindow?
Perceptual and Sensory Augmented




                                                                  Image feature




                                                                 K. Grauman, B. Leibe
Discriminative vs. generative models

                                                           Pr(image, car )             Pr(image, ¬car )         Generative: separately
Visual Object Recognition Tutorial Computing




                                                 0.1

                                                0.05
                                                                                                                model class-conditional
                                                                                                                and prior densities
                                                  0
                                                       0    10     20   30   40   50      60      70
                                                       image feature
Perceptual and Sensory Augmented




                                                           Pr(car | image)             Pr(¬car | image)         Discriminative: directly
                                                   1
                                                                                               x = data
                                                                                                                model posterior
                                                 0.5

                                                   0
                                                       0     10    20   30   40   50      60       70
                                                       image feature




                                               Plots from Antonio Torralba 2007          K. Grauman, B. Leibe
Discriminative vs. generative models
                                               • Generative:
                                                    + possibly interpretable
Visual Object Recognition Tutorial Computing




                                                    + can draw samples
                                                    - models variability unimportant to classification task
                                                    - often hard to build good model with few parameters
Perceptual and Sensory Augmented




                                               • Discriminative:
                                                    + appealing when infeasible to model data itself
                                                    + excel in practice
                                                    - often can’t provide uncertainty in predictions
                                                    - non-interpretable




                                                                                                              21
                                                                          K. Grauman, B. Leibe
Discriminative methods
                                               Nearest neighbor                            Neural networks
Visual Object Recognition Tutorial Computing




                                                         106 examples

                                               Shakhnarovich, Viola, Darrell 2003           LeCun, Bottou, Bengio, Haffner 1998
                                               Berg, Berg, Malik 2005...                    Rowley, Baluja, Kanade 1998
Perceptual and Sensory Augmented




                                                                                            …

                                               Support Vector Machines       Boosting                  Conditional Random Fields




                                                Guyon, Vapnik                Viola, Jones 2001,         McCallum, Freitag, Pereira
                                                Heisele, Serre, Poggio,      Torralba et al. 2004,      2000; Kumar, Hebert 2003
                                                2001,…                       Opelt et al. 2006,…        …


                                                                               K. Grauman, B. Leibe          Slide adapted from Antonio Torralba
Boosting
                                               • Build a strong classifier by combining number of “weak
                                                 classifiers”, which need only be better than chance
Visual Object Recognition Tutorial Computing




                                               • Sequential learning process: at each iteration, add a
                                                 weak classifier
                                               • Flexible to choice of weak learner
Perceptual and Sensory Augmented




                                                    including fast simple classifiers that alone may be inaccurate


                                               • We’ll look at Freund & Schapire’s AdaBoost algorithm
                                                    Easy to implement
                                                    Base learning algorithm for Viola-Jones face detector




                                                                                                                     23
                                                                          K. Grauman, B. Leibe
AdaBoost: Intuition

                                                                                                     Consider a 2-d feature
                                                                                                     space with positive and
Visual Object Recognition Tutorial Computing




                                                                                                     negative examples.

                                                                                                     Each weak classifier splits
Perceptual and Sensory Augmented




                                                                                                     the training examples with
                                                                                                     at least 50% accuracy.

                                                                                                     Examples misclassified by
                                                                                                     a previous weak learner
                                                                                                     are given more emphasis
                                                                                                     at future rounds.



                                               Figure adapted from Freund and Schapire
                                                                                                                                   24
                                                                                         K. Grauman, B. Leibe
Visual Object Recognition Tutorial Computing
                       Perceptual and Sensory Augmented

                                                                      AdaBoost: Intuition




K. Grauman, B. Leibe
             25
AdaBoost: Intuition
Visual Object Recognition Tutorial Computing
Perceptual and Sensory Augmented




                                                Final classifier is
                                                combination of the
                                                weak classifiers



                                                                                             26
                                                                      K. Grauman, B. Leibe
AdaBoost Algorithm
                                               Start with
                                               uniform weights
                                               on training
                                               examples
                                                                      {x1,…xn}
Visual Object Recognition Tutorial Computing




                                               Evaluate
Perceptual and Sensory Augmented




                                               weighted error
                                               for each feature,
                                               pick best.

                                               Incorrectly classified -> more weight
                                               Correctly classified -> less weight



                                               Final classifier is combination of the
                                               weak ones, weighted according to
                                               error they had.
                                                                 Freund & Schapire 1995
Cascading classifiers for detection
                                               For efficiency, apply less
                                               accurate but faster classifiers
                                               first to immediately discard
Visual Object Recognition Tutorial Computing




                                               windows that clearly appear to
                                               be negative; e.g.,
Perceptual and Sensory Augmented




                                                   Filter for promising regions with an
                                                   initial inexpensive classifier
                                                   Build a chain of classifiers, choosing
                                                   cheap ones with low false negative
                                                   rates early in the chain



                                               Fleuret & Geman, IJCV 2001
                                               Rowley et al., PAMI 1998
                                               Viola & Jones, CVPR 2001                                                                  28
                                                                            K. Grauman, B. Leibe   Figure from Viola & Jones CVPR 2001
Example: Face detection
                                               • Frontal faces are a good example of a class where
                                                 global appearance models + a sliding window
Visual Object Recognition Tutorial Computing




                                                 detection approach fit well:
                                                    Regular 2D structure
                                                    Center of face almost shaped like a “patch”/window
Perceptual and Sensory Augmented




                                               • Now we’ll take AdaBoost and see how the Viola-
                                                 Jones face detector works
                                                                                                         29
                                                                           K. Grauman, B. Leibe
Feature extraction
                                                “Rectangular” filters
                                                                            Feature output is difference
                                                                            between adjacent regions
Visual Object Recognition Tutorial Computing
Perceptual and Sensory Augmented




                                                                                          Value at (x,y) is
                                                                                          sum of pixels
                                                Efficiently computable                    above and to the
                                                with integral image: any                  left of (x,y)
                                                sum can be computed
                                                in constant time
                                                Avoid scaling images
                                                scale features directly            Integral image
                                                for same cost

                                               Viola & Jones, CVPR 2001                                       30
                                                                           K. Grauman, B. Leibe
Large library of filters
                                                                                                Considering all
                                                                                                possible filter
                                                                                                parameters:
Visual Object Recognition Tutorial Computing




                                                                                                position, scale,
                                                                                                and type:
Perceptual and Sensory Augmented




                                                                                                180,000+
                                                                                                possible features
                                                                                                associated with
                                                                                                each 24 x 24
                                                                                                window

                                                        Use AdaBoost both to select the informative
                                                        features and to form the classifier


                                               Viola & Jones, CVPR 2001
AdaBoost for feature+classifier selection
                                                • Want to select the single rectangle feature and threshold
                                                    that best separates positive (faces) and negative (non-
                                                    faces) training examples, in terms of weighted error.
Visual Object Recognition Tutorial Computing




                                                                                    Resulting weak classifier:
Perceptual and Sensory Augmented




                                                                                    For next round, reweight the
                                                …




                                                                                    examples according to errors,
                                                       Outputs of a possible        choose another filter/threshold
                                                       rectangle feature on
                                                                                    combo.
                                                       faces and non-faces.

                                               Viola & Jones, CVPR 2001
Viola-Jones Face Detector: Summary

                                                                   Train cascade of
                                                                    classifiers with
Visual Object Recognition Tutorial Computing




                                                                       AdaBoost




                                                                                                       ow h
                                                                                                   ind eac
                                                     Faces




                                                                                                 bw o
                                                                                                              New image




                                                                                               su ply t
Perceptual and Sensory Augmented




                                                                                                Ap
                                                                        Selected features,
                                                   Non-faces         thresholds, and weights


                                               • Train with 5K positives, 350M negatives
                                               • Real-time detector using 38 layer cascade
                                               • 6061 features in final layer
                                               • [Implementation available in OpenCV:
                                                  http://www.intel.com/technology/computing/opencv/]
                                                                                                                          33
                                                                                K. Grauman, B. Leibe
Viola-Jones Face Detector: Results
Visual Object Recognition Tutorial Computing




                                                                                       First two features
                                                                                       selected
Perceptual and Sensory Augmented




                                                                                                            34
                                                                K. Grauman, B. Leibe
Visual Object Recognition Tutorial Computing
Perceptual and Sensory Augmented

                                               Viola-Jones Face Detector: Results
Visual Object Recognition Tutorial Computing
Perceptual and Sensory Augmented

                                               Viola-Jones Face Detector: Results
Visual Object Recognition Tutorial Computing
Perceptual and Sensory Augmented

                                               Viola-Jones Face Detector: Results
Profile Features
                                                 Detecting profile faces requires training separate
                                                 detector with profile examples.
Visual Object Recognition Tutorial Computing
Perceptual and Sensory Augmented
Visual Object Recognition Tutorial Computing
                            Perceptual and Sensory Augmented




Paul Viola, ICCV tutorial
                                                                           Viola-Jones Face Detector: Results
Example application


                                                                                                         Frontal faces
Visual Object Recognition Tutorial Computing




                                                                                                         detected and
                                                                                                         then tracked,
                                                                                                         character
Perceptual and Sensory Augmented




                                                                                                         names inferred
                                                                                                         with alignment
                                                                                                         of script and
                                                                                                         subtitles.



                                               Everingham, M., Sivic, J. and Zisserman, A.
                                               "Hello! My name is... Buffy" - Automatic naming of characters in TV video,
                                               BMVC 2006.
                                               http://www.robots.ox.ac.uk/~vgg/research/nface/index.html
                                                                                                                            40
                                                                              K. Grauman, B. Leibe
Pedestrian detection
                                               • Detecting upright, walking humans also possible using sliding
                                                 window’s appearance/texture; e.g.,
Visual Object Recognition Tutorial Computing
Perceptual and Sensory Augmented




                                                SVM with Haar wavelets         Space-time rectangle       SVM with HoGs [Dalal &
                                                [Papageorgiou & Poggio, IJCV   features [Viola, Jones &   Triggs, CVPR 2005]
                                                2000]                          Snow, ICCV 2003]




                                                                               K. Grauman, B. Leibe
Highlights
                                               • Sliding window detection and global appearance
                                                 descriptors:
Visual Object Recognition Tutorial Computing




                                                   Simple detection protocol to implement
                                                   Good feature choices critical
                                                   Past successes for certain classes
Perceptual and Sensory Augmented




                                                                                                  42
                                                                        K. Grauman, B. Leibe
Limitations
                                               • High computational complexity
                                                    For example: 250,000 locations x 30 orientations x 4 scales =
                                                    30,000,000 evaluations!
Visual Object Recognition Tutorial Computing




                                                    If training binary detectors independently, means cost increases
                                                    linearly with number of classes
                                               • With so many windows, false positive rate better be low
Perceptual and Sensory Augmented




                                                                                                                   43
                                                                          K. Grauman, B. Leibe
Limitations (continued)
                                               • Not all objects are “box” shaped
Visual Object Recognition Tutorial Computing
Perceptual and Sensory Augmented




                                                                                            44
                                                                     K. Grauman, B. Leibe
Limitations (continued)
                                               • Non-rigid, deformable objects not captured well with
                                                 representations assuming a fixed 2d structure; or must
                                                 assume fixed viewpoint
Visual Object Recognition Tutorial Computing




                                               • Objects with less-regular textures not captured well
                                                 with holistic appearance-based descriptions
Perceptual and Sensory Augmented




                                                                                                          45
                                                                      K. Grauman, B. Leibe
Limitations (continued)
                                               • If considering windows in isolation, context is lost
Visual Object Recognition Tutorial Computing
Perceptual and Sensory Augmented




                                                               Sliding window                          Detector’s view




                                                                                                                         46
                                               Figure credit: Derek Hoiem       K. Grauman, B. Leibe
Limitations (continued)
                                               • In practice, often entails large, cropped training set
                                                 (expensive)
Visual Object Recognition Tutorial Computing




                                               • Requiring good match to a global appearance description
                                                 can lead to sensitivity to partial occlusions
Perceptual and Sensory Augmented




                                                                                                                47
                                               Image credit: Adam, Rivlin, & Shimshoni   K. Grauman, B. Leibe
Outline

                                               1. Detection with Global Appearance & Sliding Windows
                                               2. Local Invariant Features: Detection & Description
Visual Object Recognition Tutorial Computing




                                               3. Specific Object Recognition with Local Features
Perceptual and Sensory Augmented




                                               ― Coffee Break ―
                                               4. Visual Words: Indexing, Bags of Words Categorization
                                               5. Matching Local Features
                                               6. Part-Based Models for Categorization
                                               7. Current Challenges and Research Directions

                                                                                                         48
                                                                      K. Grauman, B. Leibe
Visual Object Recognition
Visual Object Recognition Tutorial Computing
Perceptual and Sensory Augmented




                                               Bastian Leibe                &   Kristen Grauman
                                               Computer Vision Laboratory       Department of Computer Sciences
                                               ETH Zurich                       University of Texas in Austin

                                               Chicago, 14.07.2008
Outline

                                               1. Detection with Global Appearance & Sliding Windows
Visual Object Recognition Tutorial Computing




                                               2. Local Invariant Features: Detection & Description
                                               3. Specific Object Recognition with Local Features
Perceptual and Sensory Augmented




                                               ― Coffee Break ―
                                               4. Visual Words: Indexing, Bags of Words Categorization
                                               5. Matching Local Features
                                               6. Part-Based Models for Categorization
                                               7. Current Challenges and Research Directions

                                                                                                         2
                                                                      K. Grauman, B. Leibe
Motivation
                                               • Global representations have major limitations
                                               • Instead, describe and match only local regions
Visual Object Recognition Tutorial Computing




                                               • Increased robustness to
                                                    Occlusions
Perceptual and Sensory Augmented




                                                    Articulation
                                                                                                     d            dq
                                                                                                              φ
                                                                                                     φ
                                                                                                         θq
                                                                                                 θ

                                                    Intra-category variations


                                                                                                                       3
                                                                          K. Grauman, B. Leibe
Approach
                                                                                                                      1. Find a set of
                                                                                                                         distinctive key-
                                                                                                                         points
Visual Object Recognition Tutorial Computing




                                                               A1
                                                                                                                      2. Define a region
                                                                                                                         around each
                                                              A2          A3                                             keypoint
Perceptual and Sensory Augmented




                                                                                                                      3. Extract and
                                                                                                                         normalize the
                                                                                                                         region content
                                                                     fA           Similarity         fB
                                                                                  measure
                                                                                                                      4. Compute a local
                                               N pixels




                                                                                                                         descriptor from the
                                                                     e.g. color                     e.g. color
                                                                                                                         normalized region
                                                          N pixels             d ( f A, fB ) < T
                                                                                                                      5. Match local
                                                                                                                         descriptors
                                                                                                                                            4
                                                                                               K. Grauman, B. Leibe
Requirements
                                               • Region extraction needs to be repeatable and precise
                                                    Translation, rotation, scale changes
Visual Object Recognition Tutorial Computing




                                                    (Limited out-of-plane (≈affine) transformations)
                                                                            ≈
                                                    Lighting variations
Perceptual and Sensory Augmented




                                               • We need a sufficient number of regions to cover the
                                                 object

                                               • The regions should contain “interesting” structure




                                                                                                        5
                                                                          K. Grauman, B. Leibe
Many Existing Detectors Available
                                               •   Hessian & Harris                    [Beaudet ‘78], [Harris ‘88]
                                               •   Laplacian, DoG                      [Lindeberg ‘98], [Lowe 1999]
Visual Object Recognition Tutorial Computing




                                               •   Harris-/Hessian-Laplace             [Mikolajczyk & Schmid ‘01]
                                               •   Harris-/Hessian-Affine              [Mikolajczyk & Schmid ‘04]
                                               •   EBR and IBR                         [Tuytelaars & Van Gool ‘04]
Perceptual and Sensory Augmented




                                               •   MSER                                [Matas ‘02]
                                               •   Salient Regions                     [Kadir & Brady ‘01]
                                               •   Others…




                                                                                                                      6
                                                                       K. Grauman, B. Leibe
Keypoint Localization
Visual Object Recognition Tutorial Computing
Perceptual and Sensory Augmented




                                               • Goals:
                                                   Repeatable detection
                                                   Precise localization
                                                   Interesting content
                                                 ⇒ Look for two-dimensional signal changes
                                                                                                 7
                                                                          K. Grauman, B. Leibe
Hessian Detector [Beaudet78]
                                               • Hessian determinant
                                                                                              Ixx
Visual Object Recognition Tutorial Computing




                                                                I xx   I xy 
                                               Hessian ( I ) = 
                                                                I xy   I yy 
                                                                             
Perceptual and Sensory Augmented




                                                                                                        Iyy
                                                                                        Ixy




                                               Intuition: Search for strong
                                               derivatives in two
                                               orthogonal directions
                                                                                                              8
                                                                                 K. Grauman, B. Leibe
Hessian Detector [Beaudet78]
                                               • Hessian determinant
                                                                                                   Ixx
Visual Object Recognition Tutorial Computing




                                                                I xx       I xy 
                                               Hessian ( I ) = 
                                                                I xy       I yy 
                                                                                 
Perceptual and Sensory Augmented




                                                                                                             Iyy
                                                                                             Ixy




                                                                                   2
                                                det( Hessian( I )) = I xx I yy − I xy
                                               In Matlab:
                                                        I xx . ∗ I yy − ( I xy )^ 2
                                                                                                                   9
                                                                                      K. Grauman, B. Leibe
Hessian Detector – Responses [Beaudet78]
Visual Object Recognition Tutorial Computing
Perceptual and Sensory Augmented




                                               Effect: Responses mainly
                                               on corners and strongly
                                               textured areas.


                                                                                          10
Perceptual and Sensory Augmented
 Visual Object Recognition Tutorial Computing

                                                Hessian Detector – Responses [Beaudet78]




11
Harris Detector [Harris88]
                                               • Second moment matrix
                                                 (autocorrelation matrix)
Visual Object Recognition Tutorial Computing




                                                                            I x2 (σ D ) I x I y (σ D )
                                               µ (σ I , σ D ) = g (σ I ) ∗                   2        
                                                                           
                                                                            I x I y (σ D ) I y (σ D ) 
                                                                                                       
Perceptual and Sensory Augmented




                                               Intuition: Search for local
                                               neighborhoods where the
                                               image content has two main
                                               directions (eigenvectors).

                                                                                                                 12
                                                                                          K. Grauman, B. Leibe
Harris Detector [Harris88]
                                               • Second moment matrix
                                                  (autocorrelation matrix)
Visual Object Recognition Tutorial Computing




                                                                             I x2 (σ D ) I x I y (σ D )
                                                µ (σ I , σ D ) = g (σ I ) ∗                   2        
                                                                            
                                                                             I x I y (σ D ) I y (σ D ) 
                                                                                                        
Perceptual and Sensory Augmented




                                                                                                                  Ix   Iy

                                               1. Image derivatives
                                                  gx(σD), gy(σD),




                                                                                                                            13
                                                                                           K. Grauman, B. Leibe
Harris Detector [Harris88]
                                               • Second moment matrix
                                                  (autocorrelation matrix)
Visual Object Recognition Tutorial Computing




                                                                             I x2 (σ D ) I x I y (σ D )
                                                µ (σ I , σ D ) = g (σ I ) ∗                   2        
                                                                            
                                                                             I x I y (σ D ) I y (σ D ) 
                                                                                                        
Perceptual and Sensory Augmented




                                                                                                                  Ix      Iy

                                               1. Image derivatives
                                                  gx(σD), gy(σD),


                                                                                                Ix2               Iy2   IxIy
                                               2. Square of
                                                  derivatives
                                                                                                                               14
                                                                                           K. Grauman, B. Leibe
Harris Detector [Harris88]
                                               • Second moment matrix
                                                  (autocorrelation matrix)
Visual Object Recognition Tutorial Computing




                                                                             I x2 (σ D ) I x I y (σ D )
                                                µ (σ I , σ D ) = g (σ I ) ∗                   2        
                                                                            
                                                                             I x I y (σ D ) I y (σ D ) 
                                                                                                        
                                                                                                                                    Ix       Iy
                                                                                                            1. Image
Perceptual and Sensory Augmented




                                                                                                            derivatives
                                                                                                                                             Iy
                                                                                           2. Square of                   Ix2      Iy2     IxIy
                                               1. Image derivatives                        derivatives
                                                  gx(σD), gy(σD),




                                               2.3. Square of
                                                     Gaussian
                                                 filter g(σI)
                                                   derivatives
                                                                                             g(Ix2)                       g(Iy2)         g(IxIy)
                                                                                                                                                  15
Harris Detector [Harris88]
                                               • Second moment matrix
                                                  (autocorrelation matrix)
                                                                              I x2 (σ D ) I x I y (σ D )
Visual Object Recognition Tutorial Computing




                                                 µ (σ I , σ D ) = g (σ I ) ∗                   2                                    Ix        Iy
                                                                              I x I y (σ D ) I y (σ D )    1. Image
                                                                                                        
                                                                                                             derivatives

                                                                                                                           Ix2      Iy2       IxIy
Perceptual and Sensory Augmented




                                                                                              2. Square of
                                                                                              derivatives                                        Iy

                                                                                             3. Gaussian
                                                                                                filter g(σI)
                                                                                                                      g(Ix2)     g(Iy2)    g(IxIy)
                                               4. Cornerness function – both eigenvalues are strong
                                                 har = det[µ (σ I ,σ D)] − α [trace(µ (σ I ,σ D))] =
                                                g ( I x2 ) g ( I y ) − [ g ( I x I y )]2 − α [ g ( I x2 ) + g ( I y )]2
                                                                 2                                                2

                                                                                                                                            g(IxIy)
                                               5. Non-maxima suppression                                                                    har 16
Harris Detector – Responses [Harris88]
Visual Object Recognition Tutorial Computing
Perceptual and Sensory Augmented




                                               Effect: A very precise
                                               corner detector.




                                                                                        17
Perceptual and Sensory Augmented
 Visual Object Recognition Tutorial Computing

                                                Harris Detector – Responses [Harris88]




18
Automatic Scale Selection
Visual Object Recognition Tutorial Computing
Perceptual and Sensory Augmented




                                                                  f ( I i1Kim ( x, σ )) = f ( I i1Kim ( x′, σ ′))

                                               Same operator responses if the patch contains the same image up
                                                                       to scale factor
                                                           How to find corresponding patch sizes?

                                                                                                                    19
                                                                             K. Grauman, B. Leibe
Automatic Scale Selection
                                               • Function responses for increasing scale (scale signature)
Visual Object Recognition Tutorial Computing
Perceptual and Sensory Augmented




                                                          f ( I i1Kim ( x, σ ))                          f ( I i1Kim ( x′, σ ))
                                                                                                                                  20
                                                                                  K. Grauman, B. Leibe
Automatic Scale Selection
                                               • Function responses for increasing scale (scale signature)
Visual Object Recognition Tutorial Computing
Perceptual and Sensory Augmented




                                                          f ( I i1Kim ( x, σ ))                          f ( I i1Kim ( x′, σ ))
                                                                                                                                  21
                                                                                  K. Grauman, B. Leibe
Automatic Scale Selection
                                               • Function responses for increasing scale (scale signature)
Visual Object Recognition Tutorial Computing
Perceptual and Sensory Augmented




                                                          f ( I i1Kim ( x, σ ))                          f ( I i1Kim ( x′, σ ))
                                                                                                                                  22
                                                                                  K. Grauman, B. Leibe
Automatic Scale Selection
                                               • Function responses for increasing scale (scale signature)
Visual Object Recognition Tutorial Computing
Perceptual and Sensory Augmented




                                                          f ( I i1Kim ( x, σ ))                          f ( I i1Kim ( x′, σ ))
                                                                                                                                  23
                                                                                  K. Grauman, B. Leibe
Automatic Scale Selection
                                               • Function responses for increasing scale (scale signature)
Visual Object Recognition Tutorial Computing
Perceptual and Sensory Augmented




                                                          f ( I i1Kim ( x, σ ))                          f ( I i1Kim ( x′, σ ))
                                                                                                                                  24
                                                                                  K. Grauman, B. Leibe
Automatic Scale Selection
                                               • Function responses for increasing scale (scale signature)
Visual Object Recognition Tutorial Computing
Perceptual and Sensory Augmented




                                                          f ( I i1Kim ( x, σ ))                          f ( I i1Kim ( x′, σ ′))
                                                                                                                                   25
                                                                                  K. Grauman, B. Leibe
What Is A Useful Signature Function?
                                               • Laplacian-of-Gaussian = “blob” detector
Visual Object Recognition Tutorial Computing
Perceptual and Sensory Augmented




                                                                                            26
                                                                     K. Grauman, B. Leibe
Laplacian-of-Gaussian (LoG)
                                               • Local maxima in scale
                                                                                 σ5
                                                 space of Laplacian-of-
Visual Object Recognition Tutorial Computing




                                                 Gaussian
                                                                                 σ4
Perceptual and Sensory Augmented




                                                            Lxx (σ ) + Lyy (σ ) σ3


                                                                                σ2

                                                                                                  ⇒ List of
                                                                                σ                   (x, y, s)


                                                                                                                27
                                                                           K. Grauman, B. Leibe
Perceptual and Sensory Augmented
                       Visual Object Recognition Tutorial Computing




K. Grauman, B. Leibe
                                                                      Results: Laplacian-of-Gaussian




                 28
Difference-of-Gaussian (DoG)
                                               • Difference of Gaussians as approximation of the
                                                 Laplacian-of-Gaussian
Visual Object Recognition Tutorial Computing
Perceptual and Sensory Augmented




                                                                 -                           =



                                                                                                   29
                                                                      K. Grauman, B. Leibe
DoG – Efficient Computation
                                               • Computation in Gaussian scale pyramid
Visual Object Recognition Tutorial Computing
Perceptual and Sensory Augmented




                                                Sampling with
                                                  step σ4 =2

                                                                           σ

                                                                           σ

                                                                       1   σ
                                               Original image   σ =2   4
                                                                           σ



                                                                                                      30
                                                                               K. Grauman, B. Leibe
Perceptual and Sensory Augmented
                       Visual Object Recognition Tutorial Computing

                                                                      Results: Lowe’s DoG




K. Grauman, B. Leibe
                 31
Harris-Laplace [Mikolajczyk ‘01]
                                               1. Initialization: Multiscale Harris corner detection
Visual Object Recognition Tutorial Computing




                                                                            σ4
Perceptual and Sensory Augmented




                                                                            σ3




                                                                            σ2



                                                                            σ


                                                                      Computing Harris function Detecting local maxima 32
Harris-Laplace [Mikolajczyk ‘01]
                                               1. Initialization: Multiscale Harris corner detection
                                               2. Scale selection based on Laplacian
Visual Object Recognition Tutorial Computing




                                                  (same procedure with Hessian ⇒ Hessian-Laplace)

                                                                       Harris points
Perceptual and Sensory Augmented




                                                                   Harris-Laplace points
                                                                                                       33
                                                                       K. Grauman, B. Leibe
Maximally Stable Extremal Regions [Matas ‘02]
                                               • Based on Watershed segmentation algorithm
                                               • Select regions that stay stable over a large parameter
Visual Object Recognition Tutorial Computing




                                                 range
Perceptual and Sensory Augmented




                                                                                                          34
                                                                      K. Grauman, B. Leibe
Perceptual and Sensory Augmented
                       Visual Object Recognition Tutorial Computing

                                                                      Example Results: MSER




K. Grauman, B. Leibe
                 35
You Can Try It At Home…
                                               • For most local feature detectors, executables are
                                                 available online:
Visual Object Recognition Tutorial Computing




                                               • http://robots.ox.ac.uk/~vgg/research/affine
                                               • http://www.cs.ubc.ca/~lowe/keypoints/
                                               • http://www.vision.ee.ethz.ch/~surf
Perceptual and Sensory Augmented




                                                                                                     36
                                                                      K. Grauman, B. Leibe
Orientation Normalization
                                               • Compute orientation histogram                      [Lowe, SIFT, 1999]
                                               • Select dominant orientation
Visual Object Recognition Tutorial Computing




                                               • Normalize: rotate to fixed orientation
Perceptual and Sensory Augmented




                                                                                                0              2π
                                                                                                                 37
                                                                      T. Tuytelaars, B. Leibe
Local Descriptors
                                               • The ideal descriptor should be
                                                    Repeatable
Visual Object Recognition Tutorial Computing




                                                    Distinctive
                                                    Compact
                                                    Efficient
Perceptual and Sensory Augmented




                                               • Most available descriptors focus on edge/gradient
                                                 information
                                                    Capture texture information
                                                    Color still relatively seldomly used
                                                    (more suitable for homogenous regions)




                                                                                                     38
                                                                         K. Grauman, B. Leibe
Local Descriptors: SIFT Descriptor
Visual Object Recognition Tutorial Computing
Perceptual and Sensory Augmented




                                                                             Histogram of oriented gradients
                                                                             • Captures important texture
                                                                               information
                                                                             • Robust to small translations /
                                                                               affine deformations
                                               [Lowe, ICCV 1999]
                                                                   K. Grauman, B. Leibe
Local Descriptors: SURF
                                                                               • Fast approximation of SIFT idea
                                                                                         Efficient computation by 2D box
                                                                                         filters & integral images
Visual Object Recognition Tutorial Computing




                                                                                         ⇒ 6 times faster than SIFT
                                                                                         Equivalent quality for object
                                                                                         identification
Perceptual and Sensory Augmented




                                                                               • GPU implementation available
                                                                                         Feature extraction @ 100Hz
                                                                                         (detector + descriptor, 640×480 img)
                                                                                         http://www.vision.ee.ethz.ch/~surf

                                               [Bay, ECCV’06], [Cornelis, CVGPU’08]                                        40
                                                                               K. Grauman, B. Leibe
Local Descriptors: Shape Context

                                                                                          Count the number of points
                                                                                          inside each bin, e.g.:
Visual Object Recognition Tutorial Computing




                                                                                             Count = 4




                                                                                                    ...
Perceptual and Sensory Augmented




                                                                                             Count = 10


                                                                                           Log-polar binning: more
                                                                                           precision for nearby points,
                                                                                           more flexibility for farther
                                                                                           points.



                                               Belongie & Malik, ICCV 2001
                                                                             K. Grauman, B. Leibe
Local Descriptors: Geometric Blur

                                                                                                                 Compute edges
Visual Object Recognition Tutorial Computing




                                                                                                                 at four
                                                                                                                 orientations
                                                                                                                 Extract a patch
Perceptual and Sensory Augmented




                                                                                                                 in each channel



                                                                         ~
                                                                                                          Apply spatially varying
                                                                                                          blur and sub-sample
                                                Example descriptor
                                                                                         (Idealized signal)




                                               Berg & Malik, CVPR 2001
                                                                         K. Grauman, B. Leibe
So, What Local Features Should I Use?
                                               • There have been extensive evaluations/comparisons
                                                   [Mikolajczyk et al., IJCV’05, PAMI’05]
Visual Object Recognition Tutorial Computing




                                                   All detectors/descriptors shown here work well


                                               • Best choice often application dependent
Perceptual and Sensory Augmented




                                                   MSER works well for buildings and printed things
                                                   Harris-/Hessian-Laplace/DoG work well for many natural
                                                   categories


                                               • More features are better
                                                   Combining several detectors often helps




                                                                                                            43
                                                                        K. Grauman, B. Leibe
Outline

                                               1. Detection with Global Appearance & Sliding Windows
Visual Object Recognition Tutorial Computing




                                               2. Local Invariant Features: Detection & Description
                                               3. Specific Object Recognition with Local Features
Perceptual and Sensory Augmented




                                               ― Coffee Break ―
                                               4. Visual Words: Indexing, Bags of Words Categorization
                                               5. Matching Local Features
                                               6. Part-Based Models for Categorization
                                               7. Current Challenges and Research Directions

                                                                                                         44
                                                                      K. Grauman, B. Leibe
Visual Object Recognition
Visual Object Recognition Tutorial Computing
Perceptual and Sensory Augmented




                                               Bastian Leibe                &   Kristen Grauman
                                               Computer Vision Laboratory       Department of Computer Sciences
                                               ETH Zurich                       University of Texas in Austin

                                               Chicago, 14.07.2008
Outline

                                               1. Detection with Global Appearance & Sliding Windows
Visual Object Recognition Tutorial Computing




                                               2. Local Invariant Features: Detection & Description
                                               3. Specific Object Recognition with Local Features
Perceptual and Sensory Augmented




                                               ― Coffee Break ―
                                               4. Visual Words: Indexing, Bags of Words Categorization
                                               5. Matching Local Features
                                               6. Part-Based Models for Categorization
                                               7. Current Challenges and Research Directions

                                                                                                         2
                                                                      K. Grauman, B. Leibe
Recognition with Local Features
                                               • Image content is transformed into local features that
                                                 are invariant to translation, rotation, and scale
                                               • Goal: Verify if they belong to a consistent configuration
Visual Object Recognition Tutorial Computing
Perceptual and Sensory Augmented




                                                                       Local Features,
                                                                          e.g. SIFT
                                                                                                                         3
                                                                       K. Grauman, B. Leibe   Slide credit: David Lowe
Finding Consistent Configurations
                                               • Global spatial models
                                                    Generalized Hough Transform [Lowe99]
Visual Object Recognition Tutorial Computing




                                                    RANSAC [Obdrzalek02, Chum05, Nister06]
                                                    Basic assumption: object is planar
Perceptual and Sensory Augmented




                                               • Assumption is often justified in practice
                                                    Valid for many structures on
                                                    buildings
                                                    Sufficient for small viewpoint
                                                    variations on 3D objects




                                                                                                 4
                                                                          K. Grauman, B. Leibe
Hough Transform
                                               • Origin: Detection of straight lines in clutter
                                                       Basic idea: each candidate point votes
                                                       for all lines that it is consistent with.
Visual Object Recognition Tutorial Computing




                                                       Votes are accumulated in quantized array
                                                       Local maxima correspond to candidate lines
Perceptual and Sensory Augmented




                                               • Representation of a line
                                                       Usual form y = a x + b has a singularity around 90º.
                                                       Better parameterization: x cos(θ) + y sin(θ) = ρ

                                                   y
                                                          ρ
                                                                                y                         ρ
                                                           θ
                                                                          x
                                                                                                 x            θ
                                                                                                                  5
                                                                              K. Grauman, B. Leibe
Hough Transform: Noisy Line

                                                                                 ρ
Visual Object Recognition Tutorial Computing
Perceptual and Sensory Augmented




                                                                                                            θ
                                                            Tokens                          Votes


                                               • Problem: Finding the true maximum

                                                                                                                         7
                                                                     K. Grauman, B. Leibe     Slide credit: David Lowe
Hough Transform: Noisy Input

                                                                                 ρ
Visual Object Recognition Tutorial Computing
Perceptual and Sensory Augmented




                                                                                                            θ
                                                            Tokens                          Votes


                                               • Problem: Lots of spurious maxima

                                                                                                                         8
                                                                     K. Grauman, B. Leibe     Slide credit: David Lowe
Generalized Hough Transform [Ballard81]
                                               • Generalization for an arbitrary contour or shape
                                                    Choose reference point for the contour (e.g. center)
Visual Object Recognition Tutorial Computing




                                                    For each point on the contour remember where it is located
                                                    w.r.t. to the reference point
                                                    Remember radius r and angle φ
                                                    relative to the contour tangent
Perceptual and Sensory Augmented




                                                    Recognition: whenever you find
                                                    a contour point, calculate the
                                                    tangent angle and ‘vote’ for all
                                                    possible reference points



                                                   Instead of reference point, can also vote for transformation
                                                  ⇒ The same idea can be used with local features!

                                                                                                                               9
                                                                          K. Grauman, B. Leibe   Slide credit: Bernt Schiele
Gen. Hough Transform with Local Features
                                               • For every feature, store possible “occurrences”
Visual Object Recognition Tutorial Computing
Perceptual and Sensory Augmented




                                               • For new image, let the matched features vote for
                                                                                 – Object identity
                                                 possible object positions                    – Pose
                                                                                              – Relative position


                                                                                                                    10
                                                                       K. Grauman, B. Leibe
3D Object Recognition
                                               • Gen. HT for Recognition       [Lowe99]

                                                   Typically only 3 feature matches
Visual Object Recognition Tutorial Computing




                                                   needed for recognition
                                                   Extra matches provide robustness
                                                   Affine model can be used for
                                                   planar objects
Perceptual and Sensory Augmented




                                                                                                                         12
                                                                       K. Grauman, B. Leibe   Slide credit: David Lowe
View Interpolation
                                               • Training
                                                    Training views from similar
Visual Object Recognition Tutorial Computing




                                                    viewpoints are clustered
                                                    based on feature matches.
                                                    Matching features between
                                                    adjacent views are linked.
Perceptual and Sensory Augmented




                                               • Recognition
                                                   Feature matches may be
                                                   spread over several
                                                   training viewpoints.
                                                  ⇒ Use the known links to “transfer votes” to other viewpoints.



                                                                                                                    [Lowe01]
                                                                                                                            13
                                                                          K. Grauman, B. Leibe   Slide credit: David Lowe
Recognition Using View Interpolation
Visual Object Recognition Tutorial Computing
Perceptual and Sensory Augmented




                                                                                                          [Lowe01]
                                                                                                                  14
                                                                K. Grauman, B. Leibe   Slide credit: David Lowe
Location Recognition
Visual Object Recognition Tutorial Computing
Perceptual and Sensory Augmented




                                                    Training




                                                                                                  [Lowe04]
                                                                                                                 15
                                                               K. Grauman, B. Leibe   Slide credit: David Lowe
Applications
                                               • Sony Aibo
                                                 (Evolution Robotics)
Visual Object Recognition Tutorial Computing




                                               • SIFT usage
                                                   Recognize
Perceptual and Sensory Augmented




                                                   docking station
                                                   Communicate
                                                   with visual cards


                                               • Other uses
                                                   Place recognition
                                                   Loop closure in SLAM


                                                                                                                            16
                                                                          K. Grauman, B. Leibe   Slide credit: David Lowe
RANSAC (RANdom SAmple Consensus) [Fischler81]
                                               • Randomly choose a minimal subset of data points
                                                 necessary to fit a model (a sample)
                                               • Points within some distance threshold t of model are a
Visual Object Recognition Tutorial Computing




                                                 consensus set. Size of consensus set is model’s support.
                                               • Repeat for N samples; model with biggest support is
                                                 most robust fit
Perceptual and Sensory Augmented




                                                    Points within distance t of best model are inliers
                                                    Fit final model to all inliers




                                                                                                                              17
                                                                          K. Grauman, B. Leibe     Slide credit: David Lowe
RANSAC: How many samples?
                                               • How many samples are needed?
                                                    Suppose w is fraction of inliers (points from line).
                                                    n points needed to define hypothesis (2 for lines)
Visual Object Recognition Tutorial Computing




                                                    k samples chosen.

                                               • Prob. that a single sample of n points is correct: w n
Perceptual and Sensory Augmented




                                               • Prob. that all samples fail is:                                  (1 − wn ) k

                                               ⇒ Choose k high enough to keep this below desired failure
                                                rate.




                                                                                                                              19
                                                                           K. Grauman, B. Leibe    Slide credit: David Lowe
After RANSAC
                                               • RANSAC divides data into inliers and outliers and yields
                                                 estimate computed from minimal set of inliers
Visual Object Recognition Tutorial Computing




                                               • Improve this initial estimate with estimation over all
                                                 inliers (e.g. with standard least-squares minimization)
                                               • But this may change inliers, so alternate fitting with re-
Perceptual and Sensory Augmented




                                                 classification as inlier/outlier




                                                                                                                         21
                                                                       K. Grauman, B. Leibe   Slide credit: David Lowe
Example: Finding Feature Matches
                                               • Find best stereo match within a square search window
                                                 (here 300 pixels2)
Visual Object Recognition Tutorial Computing




                                               • Global transformation model: epipolar geometry
Perceptual and Sensory Augmented




                                                                                                         from Hartley & Zisserman

                                                                                                                               22
                                                                     K. Grauman, B. Leibe   Slide credit: David Lowe
Example: Finding Feature Matches
                                               • Find best stereo match within a square search window
                                                 (here 300 pixels2)
Visual Object Recognition Tutorial Computing




                                               • Global transformation model: epipolar geometry
                                                       before RANSAC                          after RANSAC
Perceptual and Sensory Augmented




                                                                                                               from Hartley & Zisserman

                                                                                                                                     23
                                                                       K. Grauman, B. Leibe       Slide credit: David Lowe
Comparison
                                               Gen. Hough Transform                           RANSAC
                                               • Advantages                                   • Advantages
                                                     Very effective for recognizing                    General method suited to large
Visual Object Recognition Tutorial Computing




                                                     arbitrary shapes or objects                       range of problems
                                                     Can handle high percentage of                     Easy to implement
                                                     outliers (>95%)                                   Independent of number of
                                                     Extracts groupings from clutter in                dimensions
Perceptual and Sensory Augmented




                                                     linear time

                                               • Disadvantages                                • Disadvantages
                                                     Quantization issues                               Only handles moderate number of
                                                     Only practical for small number of                outliers (<50%)
                                                     dimensions (up to 4)

                                               • Improvements available                       • Many variants available, e.g.
                                                     Probabilistic Extensions                          PROSAC: Progressive RANSAC
                                                                                [Leibe08]              [Chum05]
                                                     Continuous Voting Space
                                                                                                       Preemptive RANSAC [Nister05]

                                                                                                                                        24
                                                                                K. Grauman, B. Leibe
Example Applications
Visual Object Recognition Tutorial Computing
Perceptual and Sensory Augmented




                                                  Mobile tourist guide
                                                  • Self-localization
                                                  • Object/building recognition
                                                  • Photo/video augmentation




                                                                           B. Leibe
                                                                                      [Quack, Leibe, Van Gool, CIVR’08] 25
Web Demo: Movie Poster Recognition
Visual Object Recognition Tutorial Computing
Perceptual and Sensory Augmented




                                               50’000 movie
                                               posters indexed

                                               Query-by-image
                                               from mobile phone
                                               available in Switzer-
                                               land
                                                                       http://www.kooaba.com/en/products_engine.html#
                                                                                                                        26
                                                                                 K. Grauman, B. Leibe
Application: Large-Scale Retrieval
Visual Object Recognition Tutorial Computing
Perceptual and Sensory Augmented




                                                Query   Results from 5k Flickr images (demo available for 100k set)
                                                                           K. Grauman, B. Leibe           [Philbin CVPR’07] 27
Application: Image Auto-Annotation
                                                                        Moulin Rouge                Old Town Square (Prague)
Visual Object Recognition Tutorial Computing
Perceptual and Sensory Augmented




                                                                      Tour Montparnasse                       Colosseum




                                                                                                        Viktualienmarkt
                                                                                                            Maypole


                                               Left: Wikipedia image
                                               Right: closest match from Flickr

                                                                                                                          28
                                               [Quack CIVR’08]               K. Grauman, B. Leibe
Outline

                                               1. Detection with Global Appearance & Sliding Windows
Visual Object Recognition Tutorial Computing




                                               2. Local Invariant Features: Detection & Description
                                               3. Specific Object Recognition with Local Features
Perceptual and Sensory Augmented




                                               ― Coffee Break ―
                                               4. Visual Words: Indexing, Bags of Words Categorization
                                               5. Matching Local Features
                                               6. Part-Based Models for Categorization
                                               7. Current Challenges and Research Directions

                                                                                                         29
                                                                      K. Grauman, B. Leibe
Visual Object Recognition
Visual Object Recognition Tutorial Computing
Perceptual and Sensory Augmented




                                               Bastian Leibe                &   Kristen Grauman
                                               Computer Vision Laboratory       Department of Computer Sciences
                                               ETH Zurich                       University of Texas in Austin

                                               Chicago, 14.07.2008
Outline

                                               1. Detection with Global Appearance & Sliding Windows
Visual Object Recognition Tutorial Computing




                                               2. Local Invariant Features: Detection & Description
                                               3. Specific Object Recognition with Local Features
Perceptual and Sensory Augmented




                                               ― Coffee Break ―
                                               4. Visual Words: Indexing, Bags of Words Categorization
                                               5. Matching Local Feature Sets
                                               6. Part-Based Models for Categorization
                                               7. Current Challenges and Research Directions

                                                                                                         2
                                                                      K. Grauman, B. Leibe
Global representations: limitations
                                               • Success may rely on alignment
                                                 -> sensitive to viewpoint
Visual Object Recognition Tutorial Computing




                                               • All parts of the image or window impact the description
                                                 -> sensitive to occlusion, clutter
Perceptual and Sensory Augmented




                                                                                                           3
                                                                      K. Grauman, B. Leibe
Local representations
                                               • Describe component regions or patches separately.
                                               • Many options for detection & description…
Visual Object Recognition Tutorial Computing
Perceptual and Sensory Augmented




                                                                                                              Maximally Stable
                                                                                                              Extremal Regions
                                                                   Shape context               Superpixels        [Matas 02]
                                               SIFT [Lowe 99]      [Belongie 02]               [Ren et al.]




                                                Salient regions     Harris-Affine              Spin images    Geometric Blur
                                                  [Kadir 01]      [Mikolajczyk 04]             [Johnson 99]     [Berg 05]
                                                                                                                                 4
                                                                                K. Grauman, B. Leibe
Recall: Invariant local features
                                               Subset of local feature types
                                               designed to be invariant to                                  y1
Visual Object Recognition Tutorial Computing




                                                     Scale                                                  y2
                                                     Translation                                            …
                                                     Rotation                                               yd
Perceptual and Sensory Augmented




                                                     Affine transformations
                                                     Illumination
                                                                                                            x1
                                                                                                            x2
                                               1) Detect interest points                                    …
                                               2) Extract descriptors                                       xd


                                               [Mikolajczyk01, Matas02, Tuytelaars04, Lowe99, Kadir01,… ]


                                                                             K. Grauman, B. Leibe
Recognition with local feature sets
                                               • Previously, we saw how to use
                                                 local invariant features + a global
Visual Object Recognition Tutorial Computing




                                                 spatial model to recognize
                                                 specific objects, using a planar
                                                 object assumption.
Perceptual and Sensory Augmented




                                               • Now, we’ll use local features for
                                                    Indexing-based recognition
                                                    Bags of words representations
                                                    Correspondence / matching kernels




                                                                                                6
                                                                         K. Grauman, B. Leibe
Basic flow




                                                                                                                   …
                                                                           …
                                                                                                      Index each one into pool
                                                                                                      of descriptors from




                                                                                        …
Visual Object Recognition Tutorial Computing




                                                                                                      previously seen images
Perceptual and Sensory Augmented




                                               Detect or sample         Describe
                                                   features             features

                                               List of positions,   Associated list of
                                                     scales,         d-dimensional
                                                  orientations         descriptors




                                                                                                                                 7
                                                                               K. Grauman, B. Leibe
Indexing local features
                                               • Each patch / region has a descriptor, which is a point in
                                                 some high-dimensional feature space (e.g., SIFT)
Visual Object Recognition Tutorial Computing
Perceptual and Sensory Augmented




                                                                       K. Grauman, B. Leibe
Indexing local features
                                               • When we see close points in feature space, we have
                                                   similar descriptors, which indicates similar local
Visual Object Recognition Tutorial Computing




                                                   content.
Perceptual and Sensory Augmented




                                               Figure credit: A. Zisserman   K. Grauman, B. Leibe
Indexing local features
                                               • We saw in the previous section how to use voting and
                                                 pose clustering to identify objects using local features
Visual Object Recognition Tutorial Computing
Perceptual and Sensory Augmented




                                                                                             Figure credit: David Lowe

                                                                                                                         10
                                                                      K. Grauman, B. Leibe
Indexing local features
                                               • With potentially thousands of features per image, and
                                                 hundreds to millions of images to search, how to
Visual Object Recognition Tutorial Computing




                                                 efficiently find those that are relevant to a new
                                                 image?
Perceptual and Sensory Augmented




                                                    Low-dimensional descriptors : can use standard efficient
                                                    data structures for nearest neighbor search

                                                    High-dimensional descriptors: approximate nearest
                                                    neighbor search methods more practical

                                                    Inverted file indexing schemes


                                                                                                               11
                                                                       K. Grauman, B. Leibe
Indexing local features: approximate
                                               nearest neighbor search

                                                                Best-Bin First (BBF), a variant of k-d
Visual Object Recognition Tutorial Computing




                                                                trees that uses priority queue to
                                                                examine most promising branches
                                                                first [Beis & Lowe, CVPR 1997]
Perceptual and Sensory Augmented




                                                                Locality-Sensitive Hashing (LSH), a
                                                                randomized hashing technique using
                                                                hash functions that map similar
                                                                points to the same bin, with high
                                                                probability [Indyk & Motwani, 1998]


                                                                                                         12
                                                                K. Grauman, B. Leibe
Indexing local features: inverted file index
                                                                                        • For text documents,
                                                                                          an efficient way to
Visual Object Recognition Tutorial Computing




                                                                                          find all pages on
                                                                                          which a word occurs
                                                                                          is to use an index…
Perceptual and Sensory Augmented




                                                                                        • We want to find all
                                                                                          images in which a
                                                                                          feature occurs.
                                                                                        • To use this idea,
                                                                                          we’ll need to map
                                                                                          our features to
                                                                                          “visual words”.
                                                                                                                13
                                                                 K. Grauman, B. Leibe
Visual words: main idea
                                               • Extract some local features from a number of images …
Visual Object Recognition Tutorial Computing
Perceptual and Sensory Augmented




                                                                                          e.g., SIFT descriptor space: each
                                                                                              point is 128-dimensional




                                                                                                                              14
                                               Slide credit: D. Nister   K. Grauman, B. Leibe
Perceptual and Sensory Augmented
                          Visual Object Recognition Tutorial Computing




Slide credit: D. Nister
                                                                         Visual words: main idea




  K. Grauman, B. Leibe
                   15
Perceptual and Sensory Augmented
                          Visual Object Recognition Tutorial Computing




Slide credit: D. Nister
                                                                         Visual words: main idea




  K. Grauman, B. Leibe
                   16
Perceptual and Sensory Augmented
                          Visual Object Recognition Tutorial Computing




Slide credit: D. Nister
                                                                         Visual words: main idea




  K. Grauman, B. Leibe
                   17
Perceptual and Sensory Augmented
                          Visual Object Recognition Tutorial Computing




Slide credit: D. Nister
  K. Grauman, B. Leibe
                   18
Perceptual and Sensory Augmented
                          Visual Object Recognition Tutorial Computing




Slide credit: D. Nister
  K. Grauman, B. Leibe
                   19
Visual words: main idea
                                               Map high-dimensional descriptors to tokens/words by
                                               quantizing the feature space
                                                                                      • Quantize via
Visual Object Recognition Tutorial Computing




                                                                                        clustering, let
                                                                                        cluster centers be
Perceptual and Sensory Augmented




                                                                                        the prototype
                                                                                        “words”

                                                                      Descriptor space




                                                                                                         20
                                                                      K. Grauman, B. Leibe
Visual words: main idea
                                               Map high-dimensional descriptors to tokens/words by
                                               quantizing the feature space
                                                                                      • Determine which
Visual Object Recognition Tutorial Computing




                                                                                        word to assign to
                                                                                        each new image
Perceptual and Sensory Augmented




                                                                                        region by finding
                                                                                        the closest cluster
                                                                                        center.
                                                                       Descriptor space




                                                                                                          21
                                                                       K. Grauman, B. Leibe
Visual words
                                               • Example: each
                                                 group of patches
                                                 belongs to the
Visual Object Recognition Tutorial Computing




                                                 same visual word
Perceptual and Sensory Augmented




                                                                              Figure from Sivic & Zisserman, ICCV 2003
                                                                                                                         22
                                                                    K. Grauman, B. Leibe
Visual words

                                               • First explored for texture
                                                 and material
Visual Object Recognition Tutorial Computing




                                                 representations
                                               • Texton = cluster center of
                                                 filter responses over
Perceptual and Sensory Augmented




                                                 collection of images
                                               • Describe textures and
                                                 materials based on
                                                 distribution of prototypical
                                                 texture elements.

                                                 Leung & Malik 1999; Varma &
                                                 Zisserman, 2002; Lazebnik,
                                                 Schmid & Ponce, 2003;
Visual words

                                               • More recently used for
                                                 describing scenes and
Visual Object Recognition Tutorial Computing




                                                 objects for the sake of
                                                 indexing or classification.
Perceptual and Sensory Augmented




                                                 Sivic & Zisserman 2003;
                                                 Csurka, Bray, Dance, & Fan
                                                 2004; many others.
                                                                                                24
                                                                         K. Grauman, B. Leibe
Inverted file index for images
                                               comprised of visual words
                                                                                                    Word List of image
                                                                                                   number  numbers
Visual Object Recognition Tutorial Computing
Perceptual and Sensory Augmented




                                               Image credit: A. Zisserman   K. Grauman, B. Leibe
Bags of visual words
                                               • Summarize entire image
                                                   based on its distribution
Visual Object Recognition Tutorial Computing




                                                   (histogram) of word
                                                   occurrences.
                                               • Analogous to bag of words
Perceptual and Sensory Augmented




                                                   representation commonly
                                                   used for documents.




                                                                                                 26
                                               Image credit: Fei-Fei Li   K. Grauman, B. Leibe
Video Google System
                                                                                                  Query
                                                                                                  region
                                               1. Collect all words within
                                                  query region
Visual Object Recognition Tutorial Computing




                                               2. Inverted file index to find
                                                  relevant frames
                                               3. Compare word counts
Perceptual and Sensory Augmented




                                               4. Spatial verification




                                                                                                           Retrieved frames
                                                                                                           Retrieved frames
                                               Sivic & Zisserman, ICCV 2003

                                               • Demo online at :
                                                 http://www.robots.ox.ac.uk/~vgg/
                                                 research/vgoogle/index.html


                                                                                                                         27
                                                                           K. Grauman, B. Leibe
Basic flow




                                                                                                                   …
                                                                           …
                                                                                                      Index each one into pool
                                                                                                      of descriptors from




                                                                                        …
Visual Object Recognition Tutorial Computing




                                                                                                      previously seen images


                                                                                                 or
Perceptual and Sensory Augmented




                                                                                                                            …
                                               Detect or sample         Describe                      Quantize to form
                                                   features             features                      bag of words vector
                                                                                                      for the image
                                               List of positions,   Associated list of
                                                     scales,         d-dimensional
                                                  orientations         descriptors




                                                                                                                                 28
                                                                               K. Grauman, B. Leibe
Visual vocabulary formation
                                               Issues:
                                               • Sampling strategy
Visual Object Recognition Tutorial Computing




                                               • Clustering / quantization algorithm
                                               • Unsupervised vs. supervised
                                               • What corpus provides features (universal vocabulary?)
Perceptual and Sensory Augmented




                                               • Vocabulary size, number of words




                                                                                                         29
                                                                     K. Grauman, B. Leibe
Sampling strategies
Visual Object Recognition Tutorial Computing




                                                        Sparse, at
Perceptual and Sensory Augmented




                                                                                              Dense, uniformly                   Randomly
                                                      interest points
                                                                                            • To find specific, textured objects, sparse
                                                                                              sampling from interest points often more
                                                                                              reliable.
                                                                                            • Multiple complementary interest operators
                                                                                              offer more image coverage.
                                                                                            • For object categorization, dense sampling
                                                                                              offers better coverage.
                                                    Multiple interest
                                                       operators                              [See Nowak, Jurie & Triggs, ECCV 2006]
                                                                                                                                            30
                                               Image credits: F-F. Li, E. Nowak, J. Sivic       K. Grauman, B. Leibe
Clustering / quantization methods
                                               • k-means (typical choice), agglomerative clustering,
                                                 mean-shift,…
Visual Object Recognition Tutorial Computing




                                               • Hierarchical clustering: allows faster insertion / word
                                                 assignment while still allowing large vocabularies
Perceptual and Sensory Augmented




                                                    Vocabulary tree [Nister & Stewenius, CVPR 2006]




                                                                                                           31
                                                                         K. Grauman, B. Leibe
Example: Recognition with Vocabulary Tree
                                               • Tree construction:
Visual Object Recognition Tutorial Computing
Perceptual and Sensory Augmented




                                                                                             [Nister & Stewenius, CVPR’06]
                                                                                                                              32
                                                                      K. Grauman, B. Leibe       Slide credit: David Nister
Vocabulary Tree
                                               • Training: Filling the tree
Visual Object Recognition Tutorial Computing
Perceptual and Sensory Augmented




                                                                                              [Nister & Stewenius, CVPR’06]
                                                                                                                               33
                                                                       K. Grauman, B. Leibe       Slide credit: David Nister
Vocabulary Tree
                                               • Training: Filling the tree
Visual Object Recognition Tutorial Computing
Perceptual and Sensory Augmented




                                                                                              [Nister & Stewenius, CVPR’06]
                                                                                                                               34
                                                                       K. Grauman, B. Leibe       Slide credit: David Nister
Vocabulary Tree
                                               • Training: Filling the tree
Visual Object Recognition Tutorial Computing
Perceptual and Sensory Augmented




                                                                                              [Nister & Stewenius, CVPR’06]
                                                                                                                               35
                                                                       K. Grauman, B. Leibe       Slide credit: David Nister
Vocabulary Tree
                                               • Training: Filling the tree
Visual Object Recognition Tutorial Computing
Perceptual and Sensory Augmented




                                                                                              [Nister & Stewenius, CVPR’06]
                                                                                                                               36
                                                                       K. Grauman, B. Leibe       Slide credit: David Nister
Vocabulary Tree
                                               • Training: Filling the tree
Visual Object Recognition Tutorial Computing
Perceptual and Sensory Augmented




                                                                                              [Nister & Stewenius, CVPR’06]
                                                                                                                               37
                                                                       K. Grauman, B. Leibe       Slide credit: David Nister
Vocabulary Tree
                                               • Recognition
Visual Object Recognition Tutorial Computing




                                                  RANSAC
                                                verification
Perceptual and Sensory Augmented




                                                                                        [Nister & Stewenius, CVPR’06]
                                                                                                                         38
                                                                 K. Grauman, B. Leibe       Slide credit: David Nister
Vocabulary Tree: Performance
                                               • Evaluated on large databases
                                                     Indexing with up to 1M images
Visual Object Recognition Tutorial Computing




                                               • Online recognition for database
                                                 of 50,000 CD covers
Perceptual and Sensory Augmented




                                                     Retrieval in ~1s


                                               • Find experimentally that large
                                                 vocabularies can be beneficial for
                                                 recognition

                                                 [Nister & Stewenius, CVPR’06]



                                                                                                        39
                                                                                 K. Grauman, B. Leibe
Vocabulary formation
                                               • Ensembles of trees provide additional robustness
Visual Object Recognition Tutorial Computing
Perceptual and Sensory Augmented




                                                   Moosmann, Jurie, & Triggs 2006; Yeh, Lee, & Darrell 2007;
                                                   Bosch, Zisserman, & Munoz 2007; …

                                               Figure credit: F. Jurie      K. Grauman, B. Leibe
Supervised vocabulary formation
                                               • Recent work considers how to leverage labeled images
                                                 when constructing the vocabulary
Visual Object Recognition Tutorial Computing
Perceptual and Sensory Augmented




                                                Perronnin, Dance, Csurka, & Bressan, Adapted Vocabularies for
                                                Generic Visual Categorization, ECCV 2006.
                                                                                                                41
                                                                             K. Grauman, B. Leibe
Supervised vocabulary formation
                                               • Merge words that don’t aid in discriminability
Visual Object Recognition Tutorial Computing
Perceptual and Sensory Augmented




                                                Winn, Criminisi, & Minka, Object Categorization by Learned
                                                Universal Visual Dictionary, ICCV 2005
Supervised vocabulary formation
                                               • Consider vocabulary and classifier construction jointly.
Visual Object Recognition Tutorial Computing
Perceptual and Sensory Augmented




                                                Yang, Jin, Sukthankar, & Jurie, Discriminative Visual Codebook Generation
                                                with Classifier Training for Object Category Recognition, CVPR 2008.
                                                                                                                            43
                                                                              K. Grauman, B. Leibe
Learning and recognition with bag of
                                               words histograms
                                               • Bag of words representation makes it possible to
                                                 describe the unordered point set with a single vector
Visual Object Recognition Tutorial Computing




                                                 (of fixed dimension across image examples)
Perceptual and Sensory Augmented




                                               • Provides easy way to use distribution of feature types
                                                 with various learning algorithms requiring vector input.

                                                                                                            44
                                                                      K. Grauman, B. Leibe
Learning and recognition with bag of
                                               words histograms
                                               • …including unsupervised topic models designed for
                                                  documents.
Visual Object Recognition Tutorial Computing




                                               • Hierarchical Bayesian text models (pLSA and LDA)
                                                   – Hoffman 2001, Blei, Ng & Jordan, 2004
Perceptual and Sensory Augmented




                                                   – For object and scene categorization: Sivic et al. 2005,
                                                     Sudderth et al. 2005, Quelhas et al. 2005, Fei-Fei et al.
                                                     2005




                                                                                                                 45
                                               Figure credit: Fei-Fei Li   K. Grauman, B. Leibe
Learning and recognition with bag of
                                               words histograms
                                               • …including unsupervised topic models designed for
                                                  documents.
                                                                                                          Probabilistic Latent
Visual Object Recognition Tutorial Computing




                                                                                                          Semantic Analysis
                                                            d              z   w                          (pLSA)
                                                                                   N
Perceptual and Sensory Augmented




                                                         D




                                                                                         “face”

                                               Sivic et al. ICCV 2005
                                               [pLSA code available at: http://www.robots.ox.ac.uk/~vgg/software/]
                                                                                                                                 46
                                               Figure credit: Fei-Fei Li           K. Grauman, B. Leibe
Bags of words: pros and cons
                                               +   flexible to geometry / deformations / viewpoint
                                               +   compact summary of image content
Visual Object Recognition Tutorial Computing




                                               +   provides vector representation for sets
                                               +   has yielded good recognition results in practice
Perceptual and Sensory Augmented




                                               - basic model ignores geometry – must verify afterwards,
                                                 or encode via features
                                               - background and foreground mixed when bag covers
                                                 whole image
                                               - interest points or sampling: no guarantee to capture
                                                 object-level parts
                                               - optimal vocabulary formation remains unclear
                                                                                                          47
                                                                        K. Grauman, B. Leibe
Outline

                                               1. Detection with Global Appearance & Sliding Windows
Visual Object Recognition Tutorial Computing




                                               2. Local Invariant Features: Detection & Description
                                               3. Specific Object Recognition with Local Features
Perceptual and Sensory Augmented




                                               ― Coffee Break ―
                                               4. Visual Words: Indexing, Bags of Words Categorization
                                               5. Matching Local Feature Sets
                                               6. Part-Based Models for Categorization
                                               7. Current Challenges and Research Directions

                                                                                                         48
                                                                      K. Grauman, B. Leibe
Visual Object Recognition
Visual Object Recognition Tutorial Computing
Perceptual and Sensory Augmented




                                               Bastian Leibe                &   Kristen Grauman
                                               Computer Vision Laboratory       Department of Computer Sciences
                                               ETH Zurich                       University of Texas in Austin

                                               Chicago, 14.07.2008
Outline

                                               1. Detection with Global Appearance & Sliding Windows
Visual Object Recognition Tutorial Computing




                                               2. Local Invariant Features: Detection & Description
                                               3. Specific Object Recognition with Local Features
Perceptual and Sensory Augmented




                                               ― Coffee Break ―
                                               4. Visual Words: Indexing, Bags of Words Categorization
                                               5. Matching Local Feature Sets
                                               6. Part-Based Models for Categorization
                                               7. Current Challenges and Research Directions

                                                                                                         2
                                                                      K. Grauman, B. Leibe
Basic flow




                                                                                                                   …
                                                                           …
                                                                                                      Index each one into pool
                                                                                                      of descriptors from




                                                                                        …
Visual Object Recognition Tutorial Computing




                                                                                                      previously seen images


                                                                                                 or
Perceptual and Sensory Augmented




                                                                                                                            …
                                               Detect or sample         Describe                      Quantize to form
                                                   features             features                 or   bag of words vector
                                                                                                      for the image
                                               List of positions,   Associated list of
                                                     scales,         d-dimensional
                                                  orientations         descriptors

                                                                                                      Compute match
                                                                                                      with another image
                                                                                                                                 3
                                                                               K. Grauman, B. Leibe
Local feature correspondences
                                               • The matching between sets of local features helps to
                                                 establish overall similarity between objects or shapes.
Visual Object Recognition Tutorial Computing




                                               • Assigned matches also useful for localization
Perceptual and Sensory Augmented




                                               Shape context   Low-distortion matching [Berg & Malik 2005] Match kernel
                                               [Belongie &                                                 [Wallraven,
                                               Malik 2001]                                                 Caputo & Graf
                                                                                                           2003]

                                                                                                                           4
                                                                          K. Grauman, B. Leibe
Local feature correspondences
                                               • Least cost match: minimize total cost between matched
                                                 points
Visual Object Recognition Tutorial Computing
Perceptual and Sensory Augmented




                                                                                   min
                                                                                  π : X →Y
                                                                                             ∑ x − π (x )
                                                                                             xi∈X
                                                                                                    i   i




                                               • Least cost partial match: match all of smaller set to
                                                 some portion of larger set.
Pyramid match kernel (PMK)
                                               • Optimal matching expensive relative to number of
                                                 features per image (m).
Visual Object Recognition Tutorial Computing




                                               • PMK is approximate partial match for efficient
                                                 discriminative learning from sets of local features.
Perceptual and Sensory Augmented




                                                                                          Optimal match: O(m3)
                                                                                          Greedy match: O(m2 log m)
                                                                                          Pyramid match: O(m)




                                                                                       [Grauman & Darrell, ICCV 2005]

                                                                                                                        6
                                                                       K. Grauman, B. Leibe
Pyramid match kernel: pyramid extraction

                                                                   ,
Visual Object Recognition Tutorial Computing




                                                                              Histogram
                                                                              pyramid:
Perceptual and Sensory Augmented




                                                                              level i has bins
                                                                              of size




                                                                                                 7
                                                                 K. Grauman
Perceptual and Sensory Augmented
             Visual Object Recognition Tutorial Computing




                                                            Histogram
                                                            intersection




K. Grauman
                                                                           Pyramid match kernel: counting matches




        8
Pyramid match kernel: counting new matches
                                                Histogram
                                                intersection
Visual Object Recognition Tutorial Computing




                                                           matches at this level                matches at previous level
Perceptual and Sensory Augmented




                                                               Difference in histogram intersections across
                                                               levels counts number of new pairs matched



                                                                                                                            9
                                                                                   K. Grauman
Pyramid match kernel

                                                                               histogram pyramids
Visual Object Recognition Tutorial Computing
Perceptual and Sensory Augmented




                                                                          number of newly matched pairs at level i

                                               measure of difficulty of
                                                 a match at level i
                                                   •   For similarity, weights inversely proportional to bin size (or may
                                                       be learned discriminatively)
                                                   •   Normalize kernel values to avoid favoring large sets
                                                                                                                            10
                                                                                   K. Grauman
Perceptual and Sensory Augmented
             Visual Object Recognition Tutorial Computing




K. Grauman
                                                            Example pyramid match




        11
Perceptual and Sensory Augmented
             Visual Object Recognition Tutorial Computing




K. Grauman
                                                            Example pyramid match




        12
Perceptual and Sensory Augmented
             Visual Object Recognition Tutorial Computing




K. Grauman
                                                            Example pyramid match




        13
Example pyramid match
pyramid match




optimal match




                 K. Grauman
Pyramid match kernel
                                               • Forms a Mercer kernel -> allows classification with SVMs,
                                                 use of other kernel methods
Visual Object Recognition Tutorial Computing




                                               • Bounded error relative to optimal partial match
                                               • Linear time -> efficient learning with large feature sets
Perceptual and Sensory Augmented




                                                                       K. Grauman, B. Leibe
Pyramid match kernel
                                               • Forms a Mercer kernel -> allows classification with SVMs,
                                                 use of other kernel methods
Visual Object Recognition Tutorial Computing




                                               • Bounded error relative to optimal partial match
                                               • Linear time -> efficient learning with large feature sets
Perceptual and Sensory Augmented




                                                                                                           ETH-80 data set
                                                                                                           ETH
                                                 Accuracy




                                                                                      Time (s)
                                                 Mean number of features              Mean number of features
                                                               Match [Wallraven et al.]
                                                                                                 O(m2)
                                                                   Pyramid match                 O(m)
Pyramid match kernel
                                               • Forms a Mercer kernel -> allows classification with SVMs,
                                                 use of other kernel methods
Visual Object Recognition Tutorial Computing




                                               • Bounded error relative to optimal partial match
                                               • Linear time -> efficient learning with large feature sets
                                               • Use data-dependent pyramid partitions for high-d
Perceptual and Sensory Augmented




                                                 feature spaces




                                                    Uniform pyramid bins                  Vocabulary-guided
                                                                                            pyramid bins


                                                     Code for PMK: http://people.csail.mit.edu/jjl/libpmk/
Matching smoothness & local geometry
                                               • Solving for linear assignment means (non-overlapping)
                                                 features can be matched independently, ignoring
Visual Object Recognition Tutorial Computing




                                                 relative geometry.
                                               • One alternative: simply expand feature vectors to
                                                 include spatial information before matching.
Perceptual and Sensory Augmented




                                                                                             [ f1,…,f128, xa, ya ]
                                                 ya




                                                               xa
                                                                                                                     18
                                                                      K. Grauman, B. Leibe
Spatial pyramid match kernel
                                               • First quantize descriptors into words, then do one
                                                 pyramid match per word in image coordinate space.
Visual Object Recognition Tutorial Computing
Perceptual and Sensory Augmented




                                                Lazebnik, Schmid & Ponce, CVPR 2006
                                                                         K. Grauman, B. Leibe
Matching smoothness & local geometry
                                               • Use correspondence to estimate parameterized
                                                  transformation, regularize to enforce smoothness
Visual Object Recognition Tutorial Computing
Perceptual and Sensory Augmented




                                                      Shape context matching [Belongie, Malik, & Puzicha 2001]

                                               Code: http://www.eecs.berkeley.edu/Research/Projects/CS/vision/shape/sc_digits.html
                                                                                 K. Grauman, B. Leibe
Matching smoothness & local geometry
                                               • Let matching cost include term to penalize distortion
                                                  between pairs of matched features.
Visual Object Recognition Tutorial Computing




                                                               Template                                         Query
                                                                          j                                          j'
Perceptual and Sensory Augmented




                                                                                  Rij
                                                                                                                Si'j'

                                                                              i                                 i'
                                                                                                    i      i'


                                                   Approximate for efficient solutions: Berg & Malik, CVPR 2005;
                                                   Leordeanu & Hebert, ICCV 2005


                                               Figure credit: Alex Berg                 K. Grauman, B. Leibe
Matching smoothness & local geometry
                                               • Compare “semi-local” features: consider configurations
                                                 or neighborhoods and co-occurrence relationships
Visual Object Recognition Tutorial Computing
Perceptual and Sensory Augmented




                                                Correlograms of         Proximity
                                                visual words            distribution kernel
                                                [Savarese, Winn, &      [Ling & Soatto, ICCV            Hyperfeatures: Agarwal &
                                                Criminisi, CVPR 2006]   2007]                           Triggs, ECCV 2006]




                                                  Feature neighborhoods [Sivic          Tiled neighborhood [Quack, Ferrari,
                                                  & Zisserman, CVPR 2004]               Leibe, van Gool ICCV 2007]
                                                                                 K. Grauman, B. Leibe
Matching smoothness & local geometry
                                               • Learn or provide explicit object-specific shape model
                                                 [Next in the tutorial : part-based models]
Visual Object Recognition Tutorial Computing
Perceptual and Sensory Augmented




                                                                         x1
                                                                   x6         x2

                                                                   x5         x3
                                                                         x4
Summary
                                               • Local features are a useful, flexible representation
                                                    Invariance properties - typically built into the descriptor
Visual Object Recognition Tutorial Computing




                                                    Distinctive, especially helpful for identifying specific textured
                                                    objects
                                                    Breaking image into regions/parts gives tolerance to occlusions
                                                    and clutter
Perceptual and Sensory Augmented




                                                    Mapping to visual words forms discrete tokens from image
                                                    regions


                                               • Efficient methods available for
                                                    Indexing patches or regions
                                                    Comparing distributions of visual words
                                                    Matching features


                                                                                                                        24
                                                                          K. Grauman, B. Leibe
Outline

                                               1. Detection with Global Appearance & Sliding Windows
Visual Object Recognition Tutorial Computing




                                               2. Local Invariant Features: Detection & Description
                                               3. Specific Object Recognition with Local Features
Perceptual and Sensory Augmented




                                               ― Coffee Break ―
                                               4. Visual Words: Indexing, Bags of Words Categorization
                                               5. Matching Local Feature Sets
                                               6. Part-Based Models for Categorization
                                               7. Current Challenges and Research Directions

                                                                                                         25
                                                                      K. Grauman, B. Leibe
Visual Object Recognition
Visual Object Recognition Tutorial Computing
Perceptual and Sensory Augmented




                                               Bastian Leibe                &   Kristen Grauman
                                               Computer Vision Laboratory       Department of Computer Sciences
                                               ETH Zurich                       University of Texas in Austin

                                               Chicago, 14.07.2008
Outline

                                               1. Detection with Global Appearance & Sliding Windows
Visual Object Recognition Tutorial Computing




                                               2. Local Invariant Features: Detection & Description
                                               3. Specific Object Recognition with Local Features
Perceptual and Sensory Augmented




                                               ― Coffee Break ―
                                               4. Visual Words: Indexing, Bags of Words Categorization
                                               5. Matching Local Features
                                               6. Part-Based Models for Categorization
                                               7. Current Challenges and Research Directions

                                                                                                         2
                                                                      K. Grauman, B. Leibe
Recognition of Object Categories
                                               • We no longer have exact correspondences…
Visual Object Recognition Tutorial Computing




                                               • On a local level, we
                                                 can still detect
                                                 similar parts.
Perceptual and Sensory Augmented




                                               • Represent objects
                                                 by their parts
                                                  ⇒ Bag-of-features

                                               • How can we
                                                 improve on this?
                                                    Encode structure


                                                                                                                             3
                                                                        T. Tuytelaars, B. Leibe   Slide credit: Rob Fergus
Part-Based Models

                                               • Fischler & Elschlager 1973
Visual Object Recognition Tutorial Computing




                                               • Model has two components
                                                    parts
Perceptual and Sensory Augmented




                                                    (2D image fragments)
                                                    structure
                                                    (configuration of parts)




                                                                                                 4
                                                                          K. Grauman, B. Leibe
Different Connectivity Structures
Visual Object Recognition Tutorial Computing




                                               O(N6)                   O(N2)                       O(N3)                                  O(N2)


                                                  Fergus et al. ’03     Leibe et al. ’04, ‘08            Crandall et al. ‘05   Felzenszwalb &
Perceptual and Sensory Augmented




                                                  Fei-Fei et al. ‘03    Crandall et al. ‘05                                    Huttenlocher ‘05
                                                                        Fergus et al. ’05




                                                  Csurka ’04                   Bouchard & Triggs ‘05                Carneiro & Lowe ‘06
                                                  Vasconcelos ‘00


                                                                                                                                                  5
                                                                                  K. Grauman, B. Leibe     from [Carneiro & Lowe, ECCV’06]
Spatial Models Considered Here

                                                   Fully connected shape                       “Star” shape model
Visual Object Recognition Tutorial Computing




                                                           model
                                                             x1                                        x1
                                                       x6          x2                            x6            x2
Perceptual and Sensory Augmented




                                                       x5          x3                            x5            x3
                                                             x4                                        x4


                                                 e.g. Constellation Model            e.g. ISM
                                                 Parts fully connected               Parts mutually independent
                                                 Recognition complexity: O(NP)       Recognition complexity: O(NP)
                                                 Method: Exhaustive search           Method: Gen. Hough Transform


                                                                                                                                   6
                                                                        K. Grauman, B. Leibe            Slide credit: Rob Fergus
Constellation Model
                                               • Joint model for appearance and shape
Visual Object Recognition Tutorial Computing
Perceptual and Sensory Augmented




                                               Gaussian shape pdf   Gaussian part appearance pdf       Gaussian
                                                                                                   relative scale pdf




                                                                                                                  Log(scale)

                                                                                                   Prob. of detection


                                                                                                     0.8   0.75     0.9
                                                                                                                               7
                                                                         K. Grauman, B. Leibe
Constellation Model
                                                Gaussian shape pdf   Gaussian part appearance pdf           Gaussian
                                                                                                        relative scale pdf
Visual Object Recognition Tutorial Computing




                                                                                                                       Log(scale)

                                                                                                         Prob. of detection
Perceptual and Sensory Augmented




                                                                                                          0.8   0.75     0.9
                                               Clutter model                                                  Uniform
                                                 Uniform shape pdf      Gaussian appearance pdf          relative scale pdf




                                                                                                                       Log(scale)


                                                                                                    Poission pdf on # detections

                                                                                                                                    8
                                                                          K. Grauman, B. Leibe
Constellation Model: Learning Procedure
                                               • Goal: Find regions & their location, scale & appearance
                                               • Initialize model parameters
Visual Object Recognition Tutorial Computing




                                               • Use EM and iterate to convergence
                                                     E-step: Compute assignments for which regions are
                                                     foreground/background
Perceptual and Sensory Augmented




                                                     M-step: Update model parameters
                                               • Trying to maximize likelihood – consistency in shape & appearance




                                                                                                                     9
                                                                            K. Grauman, B. Leibe
Perceptual and Sensory Augmented
                       Visual Object Recognition Tutorial Computing

                                                                      Example: Motorbikes




K. Grauman, B. Leibe
                 10
Perceptual and Sensory Augmented
                       Visual Object Recognition Tutorial Computing

                                                                      Example: Motorbikes (2)




K. Grauman, B. Leibe
                 11
Perceptual and Sensory Augmented
                       Visual Object Recognition Tutorial Computing

                                                                      Example: Spotted Cats




K. Grauman, B. Leibe
                 12
Discussion: Constellation Model
                                               • Advantages
                                                   Works well for many different object categories
                                                   Can adapt well to categories where
Visual Object Recognition Tutorial Computing




                                                    – Shape is more important
                                                    – Appearance is more important
                                                   Everything is learned from training data
Perceptual and Sensory Augmented




                                                   Weakly-supervised training possible

                                               • Disadvantages
                                                  Model contains many parameters that need to be estimated
                                                  Cost increases exponentially with increasing number of
                                                  parameters
                                                 ⇒ Fully connected model restricted to small number of parts.



                                                                                                                13
                                                                         K. Grauman, B. Leibe
Implicit Shape Model (ISM)
                                               • Basic ideas                                                             x1
                                                    Learn an appearance codebook                                    x6        x2
Visual Object Recognition Tutorial Computing




                                                    Learn a star-topology structural model                          x5        x3
                                                     – Features are considered independent given obj. center             x4
Perceptual and Sensory Augmented




                                               • Algorithm: probabilistic Gen. Hough Transform
                                                    Exact correspondences          →             Prob. match to object part
                                                    NN matching                    →             Soft matching
                                                    Feature location on obj.       →             Part location distribution
                                                    Uniform votes                  →             Probabilistic vote weighting
                                                    Quantized Hough array          →             Continuous Hough space




                                                                                                                                   14
                                                                          K. Grauman, B. Leibe
Codebook Representation
                                               • Extraction of local object features
                                                    Interest Points (e.g. Harris detector)
Visual Object Recognition Tutorial Computing




                                                    Sparse representation of the object appearance
Perceptual and Sensory Augmented




                                               • Collect features from whole training set
                                               • Example:




                                                                                                     15
                                                                         K. Grauman, B. Leibe
Gen. Hough Transform with Local Features
                                               • For every feature, store possible “occurrences”
Visual Object Recognition Tutorial Computing
Perceptual and Sensory Augmented




                                               • For new image, let the matched features vote for
                                                                                 – Object identity
                                                 possible object positions                    – Pose
                                                                                              – Relative position


                                                                                                                    18
                                                                       K. Grauman, B. Leibe
Implicit Shape Model - Representation
                                                                                                                                     …
                                                                                                                                     …
                                                                                                                                     …
                                                                                                                                     …
Visual Object Recognition Tutorial Computing




                                                          Training images
                                                    (+reference segmentation)
                                                                                                 …
Perceptual and Sensory Augmented




                                                                                                        Appearance codebook

                                               • Learn appearance codebook                              y               y

                                                     Extract local features at interest points
                                                     Agglomerative clustering ⇒ codebook
                                                                                                            s               s
                                                                                                                x                x
                                               • Learn spatial distributions                            y               y

                                                     Match codebook to training images
                                                     Record matching positions on object
                                                                                                            s               s
                                                                                                                x                x
                                                                                                     Spatial occurrence distributions
                                                                                                       + local figure-ground labels 19
                                                                                  B. Leibe
Implicit Shape Model - Recognition
                                                    Interest Points      Matched Codebook                     Probabilistic
                                                                              Entries                            Voting
Visual Object Recognition Tutorial Computing




                                                                                                                          y
Perceptual and Sensory Augmented




                                                           Image Feature       Interpretation             Object
                                                                             (Codebook match)             Position
                                                                                                                          s
                                                                                                                                        x
                                                                                                                        3D Voting Space
                                                                                                                         (continuous)
                                                                f                     Ci                      o,x



                                                                      p(Ci f )             p(on , x Ci , l)

                                                               p(on , x f , l) = ∑ p (Ci f ) p (on , x Ci , l)
                                                                                  i

                                                                                                                     [Leibe04, Leibe08] 21
Implicit Shape Model - Recognition
                                                    Interest Points   Matched Codebook    Probabilistic
                                                                           Entries           Voting
Visual Object Recognition Tutorial Computing




                                                                                                      y
Perceptual and Sensory Augmented




                                                                                                      s
                                                                                                                    x
                                                                                                    3D Voting Space
                                                                                                     (continuous)




                                                                        Backprojected    Backprojection
                                                                         Hypotheses        of Maxima
                                                                                                 [Leibe04, Leibe08] 22
Perceptual and Sensory Augmented
                       Visual Object Recognition Tutorial Computing




K. Grauman, B. Leibe
                                                                      Example: Results on Cows




                       Original image
                 24
Perceptual and Sensory Augmented
                       Visual Object Recognition Tutorial Computing




K. Grauman, B. Leibe
                                                                      Example: Results on Cows




                       Interest points
                        Original image
                 25
Example: Results on Cows
Visual Object Recognition Tutorial Computing
Perceptual and Sensory Augmented




                                                             Matchedpoints
                                                                Interest patches
                                                                 Original image
                                                                                        26
                                                                 K. Grauman, B. Leibe
Example: Results on Cows
Visual Object Recognition Tutorial Computing
Perceptual and Sensory Augmented




                                                                 Interest points
                                                                  Original image
                                                             Matched patches
                                                                Prob. Votes
                                                                                        27
                                                                 K. Grauman, B. Leibe
Perceptual and Sensory Augmented
                       Visual Object Recognition Tutorial Computing




K. Grauman, B. Leibe
                                                                      Example: Results on Cows




                       1st hypothesis
                 28
Perceptual and Sensory Augmented
                       Visual Object Recognition Tutorial Computing




K. Grauman, B. Leibe
                                                                      Example: Results on Cows




                       2nd hypothesis
                 29
Perceptual and Sensory Augmented
                       Visual Object Recognition Tutorial Computing




K. Grauman, B. Leibe
                                                                      Example: Results on Cows




                       3rd hypothesis
                 30
Scale Invariant Voting
                                               • Scale-invariant feature selection
                                                    Scale-invariant interest points
Visual Object Recognition Tutorial Computing




                                                    Rescale extracted patches
                                                    Match to constant-size codebook
Perceptual and Sensory Augmented




                                               • Generate scale votes
                                                    Scale as 3rd dimension in voting space


                                                                                                 s           Search
                                                                                                             window

                                                                                                     y
                                                    Search for maxima in 3D voting space
                                                                                                         x

                                                                                                                 31
                                                                          K. Grauman, B. Leibe
Scale Voting: Efficient Computation


                                                 s                 s                    s               s
Visual Object Recognition Tutorial Computing




                                                 y                 y                    y               y
                                                                                                    x               x
Perceptual and Sensory Augmented




                                                     Scale votes      Binned                Candidate       Refinement
                                                                   accum. array              maxima           (MSME)


                                               • Mean-Shift formulation for refinement
                                                      Scale-adaptive balloon density estimator




                                                                                                                         33
                                                                           K. Grauman, B. Leibe
Detection Results
                                               • Qualitative Performance
                                                   Recognizes different kinds of objects
Visual Object Recognition Tutorial Computing




                                                   Robust to clutter, occlusion, noise, low contrast
Perceptual and Sensory Augmented




                                                                                                       35
                                                                         K. Grauman, B. Leibe
Figure-Ground Segregation
                                               • Problem extensively studied in
                                                 Psychophysics
Visual Object Recognition Tutorial Computing




                                               • Experiments with ambiguous
                                                 figure-ground stimuli
                                               • Results:
Perceptual and Sensory Augmented




                                                     Evidence that object recognition can
                                                     and does operate before figure-ground
                                                     organization
                                                     Interpreted as Gestalt cue familiarity.




                                                M.A. Peterson, “Object Recognition Processes Can and Do Operate Before Figure-
                                                Ground Organization”, Cur. Dir. in Psych. Sc., 3:105-111, 1994.

                                                                                                                                 36
                                                                              K. Grauman, B. Leibe
ISM – Top-Down Segmentation
                                                       Interest Points   Matched Codebook          Probabilistic
                                                                              Entries                 Voting
Visual Object Recognition Tutorial Computing




                                                                                                               y
Perceptual and Sensory Augmented




                                                                                                               s
                                                                                                                             x
                                                                                                             3D Voting Space
                                               Segmentation                                                   (continuous)




                                                          p(figure)        Backprojected          Backprojection
                                                        Probabilities       Hypotheses              of Maxima
                                                                           K. Grauman, B. Leibe
                                                                                                          [Leibe04, Leibe08] 37
Segmentation: Probabilistic Formulation
Visual Object Recognition Tutorial Computing




                                               • Influence of patch on object hypothesis (vote weight)
Perceptual and Sensory Augmented




                                                            p( f , l o , x ) =
                                                                               ∑ p(o , x | C ) p(C
                                                                                     i      n         i         i   | f ) p( f,l )
                                                                    n
                                                                                                   p(on , x )

                                               • Backprojection to features f and pixels p:
                                                 p(p = figure | on , x ) =     ∑ p(p = figure | f , l, o , x ) p( f , l | o , x )
                                                                                                                       n                n
                                                                             p∈( f ,l )

                                                                                                Segmentation                 Influence on
                                                                                                 information               object hypothesis
                                                                                                                                                38
                                                                                   K. Grauman, B. Leibe                        [Leibe04, Leibe08]
Segmentation
Visual Object Recognition Tutorial Computing




                                                                              p(figure)
Perceptual and Sensory Augmented




                                                 Original image                                 Segmentation
                                                                                                   p(figure)
                                                                                                   p(ground)


                                                                             p(ground)

                                               • Interpretation of p(figure) map
                                                    per-pixel confidence in object hypothesis
                                                    Use for hypothesis verification
                                                                                                                  46
                                                                         K. Grauman, B. Leibe    [Leibe04, Leibe08]
Perceptual and Sensory Augmented
                       Visual Object Recognition Tutorial Computing




K. Grauman, B. Leibe
                                                                      Example Results: Motorbikes




                 47
Example Results: Cows
                                               • Training
                                                   112 hand-segmented images
Visual Object Recognition Tutorial Computing




                                               • Results on novel sequences:
Perceptual and Sensory Augmented




                                                        Single-frame recognition - No temporal continuity used!
                                                                                                                         48
                                                                          K. Grauman, B. Leibe          [Leibe04, Leibe08]
Perceptual and Sensory Augmented
           Visual Object Recognition Tutorial Computing




             Office chairs

B. Leibe
                                                                               Example Results: Chairs
                                                          Dining room chairs




      49
Perceptual and Sensory Augmented
             Visual Object Recognition Tutorial Computing




                                                            Training




                                             Test
                                             Output
                                                                       Inferring Other Information: Part Labels




[Thomas07]
      50
Perceptual and Sensory Augmented
             Visual Object Recognition Tutorial Computing




[Thomas07]
                                                            Inferring Other Information: Part Labels (2)




      51
Inferring Other Information: Depth Maps
Visual Object Recognition Tutorial Computing
Perceptual and Sensory Augmented




                                                  “Depth from a single image”



                                                                                                52
                                                                                   [Thomas07]
Application for Pedestrian Detection
                                               • Estimating Articulation
Visual Object Recognition Tutorial Computing
Perceptual and Sensory Augmented




                                                               [Leibe, Seemann, Schiele, CVPR’05]

                                               • Rotation-Invariant Detection

                                                         d                   dq
                                                                        φ
                                                         φ
                                                                   θq
                                                     θ
                                                             [Mikolajczyk, Leibe, Schiele, CVPR’06]
                                                                                                      53
                                                                                    B. Leibe
Outline

                                               1. Detection with Global Appearance & Sliding Windows
Visual Object Recognition Tutorial Computing




                                               2. Local Invariant Features: Detection & Description
                                               3. Specific Object Recognition with Local Features
Perceptual and Sensory Augmented




                                               ― Coffee Break ―
                                               4. Visual Words: Indexing, Bags of Words Categorization
                                               5. Matching Local Features
                                               6. Part-Based Models for Categorization
                                               7. Current Challenges and Research Directions

                                                                                                         54
                                                                      K. Grauman, B. Leibe
Visual Object Recognition
Visual Object Recognition Tutorial Computing
Perceptual and Sensory Augmented




                                               Bastian Leibe                &   Kristen Grauman
                                               Computer Vision Laboratory       Department of Computer Sciences
                                               ETH Zurich                       University of Texas in Austin

                                               Chicago, 14.07.2008
Outline

                                               1. Detection with Global Appearance & Sliding Windows
Visual Object Recognition Tutorial Computing




                                               2. Local Invariant Features: Detection & Description
                                               3. Specific Object Recognition with Local Features
Perceptual and Sensory Augmented




                                               ― Coffee Break ―
                                               4. Visual Words: Indexing, Bags of Words Categorization
                                               5. Matching Local Features
                                               6. Part-Based Models for Categorization
                                               7. Current Challenges and Research Directions
                                                  Highlight of some research topics not covered in the main tutorial
                                                                                                                       2
                                                                         K. Grauman, B. Leibe
Perceptual and Sensory Augmented
Visual Object Recognition Tutorial Computing

                                                                                                       Benchmark Data
                                               • What degree of difficulty do current datasets have?
Example: Caltech-101

                                                  A dataset that has
Visual Object Recognition Tutorial Computing




                                                  been about
                                                  mastered…
Perceptual and Sensory Augmented




                                                            Images from the Caltech-101:
                                                       101-way multi-class classification problem
                                                                       K. Grauman, B. Leibe
Example: Caltech256
Visual Object Recognition Tutorial Computing
Perceptual and Sensory Augmented




                                                          Images from the Caltech-256:
                                                       256 multi-class recognition problem
                                                                   K. Grauman, B. Leibe
Example: Pascal Visual Object Classes Challenge
Visual Object Recognition Tutorial Computing
Perceptual and Sensory Augmented




                                                                   Pascal VOC 2007:
                                                               Binary detection problems

                                                       http://pascallin.ecs.soton.ac.uk/challenges/VOC/

                                                                        K. Grauman, B. Leibe
Perceptual and Sensory Augmented
                       Visual Object Recognition Tutorial Computing

                                                                      Example: LabelMe




K. Grauman, B. Leibe
                          http://labelme.csail.mit.edu/
Current challenges & ongoing research
                                               •   Multi-cue integration
                                               •   Finer level categorization
Visual Object Recognition Tutorial Computing




                                               •   View invariant recognition
                                               •   Unsupervised category discovery
                                               •   Learning from noisily labeled images
Perceptual and Sensory Augmented




                                               •   Integration of segmentation and recognition
                                               •   Learning with text and images/video
                                               •   Use of video
                                               •   Context and scene layout
Multi-cue integration
                                               • Single cues often not sufficient.
                                               • Integrate multiple local and global cues.
Visual Object Recognition Tutorial Computing
Perceptual and Sensory Augmented




                                                                                             9
                                                                      K. Grauman, B. Leibe
Multi-Category Discrimination
                                               • Distinguish similar categories.
                                               • Need to look at specific details!
Visual Object Recognition Tutorial Computing
Perceptual and Sensory Augmented




                                                                                              10
                                                                       K. Grauman, B. Leibe
Multi-Aspect Recognition
                                               • Detectors for different viewpoints ⇒ How can this be
                                                 improved?
Visual Object Recognition Tutorial Computing
Perceptual and Sensory Augmented




                                                                                                        11
                                                                     K. Grauman, B. Leibe
Multi-Aspect Recognition
Visual Object Recognition Tutorial Computing
Perceptual and Sensory Augmented




                                                  [Hoiem, Rother, Winn, CVPR’07]                     [Thomas et al., CVPR’06]




                                                                                                                                12
                                                                              K. Grauman, B. Leibe
Multi-Aspect Recognition
Visual Object Recognition Tutorial Computing
Perceptual and Sensory Augmented




                                                   [Rothganger et al., CVPR’03]



                                                                                                   [Savarese & Fei-Fei, ICCV’07]




                                                                                                                                   13
                                                                            K. Grauman, B. Leibe
Unsupervised, semi-supervised category discovery
                                                                                Topic models for images
                                                                                                   Probabilistic Latent
Visual Object Recognition Tutorial Computing




                                                                                                   Semantic Analysis
                                                                                                   (pLSA)

                                                                   “face”
Perceptual and Sensory Augmented




                                                                                                  Latent Dirichlet
                                                 “beach”                                          Allocation
                                                                                                  (LDA)



                                                  c     π      z     w
                                                 D           N
                                               Sivic et al. ICCV 2005, Fei-Fei et al. ICCV 2005          Figure credit: Fei-Fei Li
Unsupervised, semi-supervised category discovery
                                                Clustering cluttered images
                                                Learning from noisy keyword-based image search results
Visual Object Recognition Tutorial Computing
Perceptual and Sensory Augmented




                                                   Grauman & Darrell, CVPR 2006




                                                                                  Fergus et al. ECCV 2004, ICCV 2005


                                                       Li & Fei-Fei, CVPR 2007
Learning with text and images/video
Visual Object Recognition Tutorial Computing
Perceptual and Sensory Augmented




                                                                        Barnard et al. JMLR 2003




                                               Berg, Berg, Edwards,
                                               & Forsyth, NIPS 2006
                                                                      Gupta et al. ECML 2008
Integrating segmentation + recognition
Visual Object Recognition Tutorial Computing
Perceptual and Sensory Augmented




                                                                                  Borenstein & Ullman, ECCV 2002
                                                 Kumar et al. CVPR 2005




                                                                                  Kannan, Winn, & Rother, NIPS 2006
                                               Tu, Chen, Yuille, Zhu, ICCV 2003
Perceptual and Sensory Augmented
Visual Object Recognition Tutorial Computing




             Antonio Torralba, IJCV 2003
                                               Role of context, understanding scene layout
Role of context, understanding scene layout
Visual Object Recognition Tutorial Computing
Perceptual and Sensory Augmented




                                                     Image                        World

                                                             Hoiem, Efros, & Hebert, CVPR 2006
Integration with Scene Geometry
                                               • Goal: Find the ground plane
                                                   Restrict object location
Visual Object Recognition Tutorial Computing




                                                   Assume Gaussian size prior
                                                  ⇒ Significantly reduced search space
Perceptual and Sensory Augmented




                                                                                              Structure-from-Motion
                                                         y              Search
                                                                        corridor

                                                             s
                                                                   x
                                                         Hough Volume




                                                                                                  Dense stereo        20
                                                                                   B. Leibe
Extensions
                                               • Combination with 3D Geometry
Visual Object Recognition Tutorial Computing
Perceptual and Sensory Augmented




                                                           [Leibe, Cornelis, Cornelis, Van Gool, CVPR’07]

                                               • Mobile Pedestrian Detection




                                                                          [Ess, Leibe, Van Gool, ICCV’07]
                                                                                                            21
                                                                                K. Grauman, B. Leibe
Detections Using Ground Plane Constraints
Visual Object Recognition Tutorial Computing
Perceptual and Sensory Augmented




                                                                                           left camera
                                                                                           1175 frames

                                                                                                       22
                                                                  B. Leibe    [Leibe et al. CVPR’07]
Extensions: Tracking-by-Detection
Visual Object Recognition Tutorial Computing
Perceptual and Sensory Augmented




                                               • Spacetime trajectory analysis
                                                    Link up detections to form physically plausible ST trajectories
                                                    Select set of ST trajectories that best explain the data


                                                                                                                            23
                                                                                                   [Leibe et al. CVPR’07]
Perceptual and Sensory Augmented
                         Visual Object Recognition Tutorial Computing




   B. Leibe
                                                                        Dynamic Scene Analysis Results




[Leibe et al. CVPR’07]
                 24
Extensions (2)
                                               • Combination 3D Reconstruction
Visual Object Recognition Tutorial Computing
Perceptual and Sensory Augmented




                                                          [Cornelis, Leibe, Cornelis, Van Gool, 3DPVT’06]




                                                                                                            25
                                                                                K. Grauman, B. Leibe
Textured 3D Model
Visual Object Recognition Tutorial Computing
Perceptual and Sensory Augmented




                                                        Original                        3D Reconstruction
                                               • Run-times
                                                   SfM + Bundle adjustment: 27-30 fps on CPU
                                                   Dense reconstruction:       36 fps on GPU
                                                                                     [Cornelis, Cornelis, Van Gool, CVPR’06]
                                                                                                                         26
                                                                          B. Leibe
Improved 3D City Model
                                               Enhancing your driving experience…
Visual Object Recognition Tutorial Computing
Perceptual and Sensory Augmented




                                                        Original                      3D Reconstruction



                                                                        [Cornelis, Leibe, Cornelis, Van Gool, 3DPVT’06] 27
Perceptual and Sensory Augmented
           Visual Object Recognition Tutorial Computing




B. Leibe
                                                                               Putting It All Together…
                                                                           y
                                                                           s
                                                                           x




                      Q

                 VT
                 S
                      V
                                                               πd




                                           t
                                                                       π



                                                      I




                                  z
                                                               oi




                                      H1
                                                      D
                                                          di
                                                                1..n




                                      H2


                              x
                                           H i , ti




      28
Perceptual and Sensory Augmented
                                             Visual Object Recognition Tutorial Computing

                                                                                            Mobile Pedestrian Tracking




[Ess, Leibe, Schindler, Van Gool, CVPR’08]
                         29
Perceptual and Sensory Augmented
                                             Visual Object Recognition Tutorial Computing

                                                                                            Mobile Tracking Through Crowds




[Ess, Leibe, Schindler, Van Gool, CVPR’08]
                         30
Extension: Recovering Articulations
                                                                                                                       1...N
Visual Object Recognition Tutorial Computing
Perceptual and Sensory Augmented




                                               • Idea: Only perform articulated tracking where it’s easy!
                                               • Multi-person tracking
                                                    Solves hard data association problem
                                               • Articulated tracking
                                                    Only on individual “tracklets” between occlusions

                                                                     [Gammeter, Ess, Jaeggli, Schindler, Leibe, Van Gool, ECCV’08]
                                                                                                                               31
                                                                               B. Leibe
Articulated Multi-Person Tracking
Visual Object Recognition Tutorial Computing
Perceptual and Sensory Augmented




                                               • Multi-Person tracking
                                                    Recovers trajectories and solves data association
                                                    Estimates 3D walking direction and speed
                                                    Detects occlusion events

                                                                      [Gammeter, Ess, Jaeggli, Schindler, Leibe, Van Gool, ECCV’08]
                                                                                                                                32
                                                                                B. Leibe
Articulated Tracking under Egomotion
Visual Object Recognition Tutorial Computing
Perceptual and Sensory Augmented




                                                             [Gammeter, Ess, Jaeggli, Schindler, Leibe, Van Gool, ECCV’08]
                                                                                                                       33
                                                                       B. Leibe
Perceptual and Sensory Augmented
                       Visual Object Recognition Tutorial Computing




K. Grauman, B. Leibe
                 34
Summary
                                               • Visual recognition is a challenging and very active
                                                 research area.
Visual Object Recognition Tutorial Computing




                                               • We’ve covered some basic models and representations
                                                 that have been shown to be effective, and highlighted
                                                 some ongoing issues.
Perceptual and Sensory Augmented




                                               • See tutorial website for slides, links, references.
                                                    http://www.vision.ee.ethz.ch/~bleibe/teaching/tutorial-aaai08/



                                                                           Thank you!



                                                                           K. Grauman, B. Leibe

AAAI08 tutorial: visual object recognition

  • 1.
    Visual Object Recognition VisualObject Recognition Tutorial Computing Perceptual and Sensory Augmented Bastian Leibe & Kristen Grauman Computer Vision Laboratory Department of Computer Sciences ETH Zurich University of Texas in Austin Chicago, 14.07.2008
  • 2.
    Perceptual and SensoryAugmented Visual Object Recognition Tutorial Computing K. Grauman, B. Leibe ? ??? Identification vs. Categorization 2
  • 3.
    Object Categorization • How to recognize ANY car Visual Object Recognition Tutorial Computing Perceptual and Sensory Augmented • How to recognize ANY cow 3 K. Grauman, B. Leibe
  • 4.
    What could bedone with recognition algorithms? There is a wide range of applications, including… Visual Object Recognition Tutorial Computing Perceptual and Sensory Augmented Autonomous robots Navigation, driver safety Situated search Content-based retrieval and analysis for Medical image images and videos analysis
  • 5.
    Object Categorization • Task Description “Given a small number of training images of a category, Visual Object Recognition Tutorial Computing recognize a-priori unknown instances of that category and assign the correct category label.” • Which categories are feasible visually? Perceptual and Sensory Augmented Extensively studied in Cognitive Psychology, e.g. [Brown’58] “Fido” German dog animal living shepherd being 5 K. Grauman, B. Leibe
  • 6.
    Visual Object Categories • Basic Level Categories in human categorization [Rosch 76, Lakoff 87] Visual Object Recognition Tutorial Computing The highest level at which category members have similar perceived shape The highest level at which a single mental image reflects the Perceptual and Sensory Augmented entire category The level at which human subjects are usually fastest at identifying category members The first level named and understood by children The highest level at which a person uses similar motor actions for interaction with category members 6 K. Grauman, B. Leibe
  • 7.
    Visual Object Categories • Basic-level categories in humans seem to be defined predominantly visually. Visual Object Recognition Tutorial Computing • There is evidence that humans (usually) … start with basic-level categorization before doing identification. Perceptual and Sensory Augmented animal ⇒ Basic-level categorization is easier Abstract and faster for humans than object levels … … identification! quadruped ⇒ Most promising starting point … for visual classification Basic level dog cat cow German Doberman shepherd Individual … “Fido” … level 7 K. Grauman, B. Leibe
  • 8.
    Other Types ofCategories • Functional Categories e.g. chairs = “something you can sit on” Visual Object Recognition Tutorial Computing Perceptual and Sensory Augmented 8 K. Grauman, B. Leibe
  • 9.
    Other Types ofCategories • Ad-hoc categories e.g. “something you can find in an office environment” Visual Object Recognition Tutorial Computing Perceptual and Sensory Augmented 9 K. Grauman, B. Leibe
  • 10.
    Levels of ObjectCategorization “cow” Visual Object Recognition Tutorial Computing “car” Perceptual and Sensory Augmented “motorbike” • Different levels of recognition Which object class is in the image? ⇒ Obj/Img classification Where is it in the image? ⇒ Detection/Localization Where exactly ― which pixels? ⇒ Figure/Ground segmentation 10 K. Grauman, B. Leibe
  • 11.
    Challenges: robustness Visual ObjectRecognition Tutorial Computing Perceptual and Sensory Augmented Illumination Object pose Clutter Occlusions Intra-class Viewpoint appearance K. Grauman, B. Leibe
  • 12.
    Challenges: robustness Visual ObjectRecognition Tutorial Computing Perceptual and Sensory Augmented • Detection in Crowded Scenes Learn object variability – Changes in appearance, scale, and articulation Compensate for clutter, overlap, and occlusion 12 K. Grauman, B. Leibe
  • 13.
    Perceptual and SensoryAugmented Visual Object Recognition Tutorial Computing K. Grauman, B. Leibe Challenges: context and human experience
  • 14.
    Challenges: context andhuman experience Visual Object Recognition Tutorial Computing Perceptual and Sensory Augmented Context cues Dynamics Image credit: D. Hoeim Video credit: J. Davis
  • 15.
    Challenges: scale, efficiency • Thousands to millions of pixels in an image • Estimated 30 Gigapixels of image/video content generated per second Visual Object Recognition Tutorial Computing • About half of the cerebral cortex in primates is devoted to processing visual information [Felleman and van Essen 1991] Perceptual and Sensory Augmented • 3,000-30,000 human recognizable object categories • 30+ degrees of freedom in the pose of articulated objects (humans) • Billions of images indexed by Google Image Search • 18 billion+ prints produced from digital camera images in 2004 • 295.5 million camera phones sold in 2005 K. Grauman, B. Leibe
  • 16.
    Perceptual and SensoryAugmented Visual Object Recognition Tutorial Computing Less K. Grauman, B. Leibe More Challenges: learning with minimal supervision
  • 17.
    Rough evolution offocus in recognition research Visual Object Recognition Tutorial Computing Perceptual and Sensory Augmented 1980s 1990s to early 2000s Currently K. Grauman, B. Leibe
  • 18.
    This tutorial • Intended for broad AAAI audience Assuming basic familiarity with machine learning, linear algebra, Visual Object Recognition Tutorial Computing probability Not assuming significant vision background Perceptual and Sensory Augmented • Our goals Describe main approaches to recognition Highlight past successes and future challenges Provide the pointers (to literature and tools) that would allow you to take advantage of existing techniques in your research • Questions welcome 18 K. Grauman, B. Leibe
  • 19.
    Outline 1. Detection with Global Appearance & Sliding Windows Visual Object Recognition Tutorial Computing 2. Local Invariant Features: Detection & Description 3. Specific Object Recognition with Local Features Perceptual and Sensory Augmented ― Coffee Break ― 4. Visual Words: Indexing, Bags of Words Categorization 5. Matching Local Features 6. Part-Based Models for Categorization 7. Current Challenges and Research Directions 19 K. Grauman, B. Leibe
  • 20.
    Visual Object Recognition VisualObject Recognition Tutorial Computing Perceptual and Sensory Augmented Bastian Leibe & Kristen Grauman Computer Vision Laboratory Department of Computer Sciences ETH Zurich University of Texas in Austin Chicago, 14.07.2008
  • 21.
    Outline 1. Detection with Global Appearance & Sliding Windows 2. Local Invariant Features: Detection & Description Visual Object Recognition Tutorial Computing 3. Specific Object Recognition with Local Features Perceptual and Sensory Augmented ― Coffee Break ― 4. Visual Words: Indexing, Bags of Words Categorization 5. Matching Local Features 6. Part-Based Models for Categorization 7. Current Challenges and Research Directions 2 K. Grauman, B. Leibe
  • 22.
    Detection via classification:Main idea Basic component: a binary classifier Visual Object Recognition Tutorial Computing Perceptual and Sensory Augmented Car/non-car Classifier No, notcar. Yes, a car. K. Grauman, B. Leibe
  • 23.
    Detection via classification:Main idea If object may be in a cluttered scene, slide a window around looking for it. Visual Object Recognition Tutorial Computing Perceptual and Sensory Augmented Car/non-car Classifier K. Grauman, B. Leibe
  • 24.
    Detection via classification:Main idea Fleshing out this pipeline a bit more, we need to: Visual Object Recognition Tutorial Computing 1. Obtain training data 2. Define features 3. Define classifier Perceptual and Sensory Augmented Training examples Car/non-car Classifier Feature extraction K. Grauman, B. Leibe
  • 25.
    Detection via classification:Main idea • Consider all subwindows in an image Sample at multiple scales and positions Visual Object Recognition Tutorial Computing • Make a decision per window: “Does this contain object category X or not?” Perceptual and Sensory Augmented • In this section, we’ll focus specifically on methods using a global representation (i.e., not part-based, not local features). 6 K. Grauman, B. Leibe
  • 26.
    Feature extraction: global appearance Feature extraction Visual Object Recognition Tutorial Computing Perceptual and Sensory Augmented Simple holistic descriptions of image content grayscale / color histogram vector of pixel intensities K. Grauman, B. Leibe
  • 27.
    Eigenfaces: global appearancedescription An early appearance-based approach to face recognition Generate low- Visual Object Recognition Tutorial Computing dimensional representation Mean of appearance with a linear Perceptual and Sensory Augmented Eigenvectors computed Training images from covariance matrix subspace. ... Project new images to “face ≈ space”. + + ++ Mean Recognition via nearest neighbors in face space Turk & Pentland, 1991 K. Grauman, B. Leibe
  • 28.
    Feature extraction: globalappearance • Pixel-based representations sensitive to small shifts Visual Object Recognition Tutorial Computing Perceptual and Sensory Augmented • Color or grayscale-based appearance description can be sensitive to illumination and intra-class appearance variation Cartoon example: an albino koala K. Grauman, B. Leibe
  • 29.
    Gradient-based representations • Consider edges, contours, and (oriented) intensity gradients Visual Object Recognition Tutorial Computing Perceptual and Sensory Augmented K. Grauman, B. Leibe
  • 30.
    Gradient-based representations: Matching edge templates • Example: Chamfer matching Visual Object Recognition Tutorial Computing Perceptual and Sensory Augmented Input Edges Distance Template Best image detected transform shape match At each window position, compute average min distance between points on template (T) and input (I). Gavrila & Philomin ICCV 1999 K. Grauman, B. Leibe
  • 31.
    Gradient-based representations: Matching edge templates • Chamfer matching Visual Object Recognition Tutorial Computing Perceptual and Sensory Augmented Hierarchy of templates Gavrila & Philomin ICCV 1999 K. Grauman, B. Leibe
  • 32.
    Gradient-based representations • Consider edges, contours, and (oriented) intensity gradients Visual Object Recognition Tutorial Computing Perceptual and Sensory Augmented • Summarize local distribution of gradients with histogram Locally orderless: offers invariance to small shifts and rotations Contrast-normalization: try to correct for variable illumination K. Grauman, B. Leibe
  • 33.
    Gradient-based representations: Histograms of oriented gradients (HoG) Visual Object Recognition Tutorial Computing Perceptual and Sensory Augmented Map each grid cell in the input window to a histogram counting the gradients per orientation. Code available: http://pascal.inrialpes.fr/soft/olt/ Dalal & Triggs, CVPR 2005 K. Grauman, B. Leibe
  • 34.
    Gradient-based representations: SIFT descriptor Visual Object Recognition Tutorial Computing Perceptual and Sensory Augmented Local patch descriptor (more on this later) Code: http://vision.ucla.edu/~vedaldi/code/sift/sift.html Binary: http://www.cs.ubc.ca/~lowe/keypoints/ Lowe, ICCV 1999 K. Grauman, B. Leibe
  • 35.
    Gradient-based representations: Biologically inspired features Visual Object Recognition Tutorial Computing Perceptual and Sensory Augmented Convolve with Gabor filters at multiple orientations Pool nearby units (max) Intermediate layers compare input to prototype patches Serre, Wolf, Poggio, CVPR 2005 Mutch & Lowe, CVPR 2006 K. Grauman, B. Leibe
  • 36.
    Gradient-based representations: Rectangular features Visual Object Recognition Tutorial Computing Perceptual and Sensory Augmented Compute differences between sums of pixels in rectangles Captures contrast in adjacent spatial regions Similar to Haar wavelets, efficient to compute Viola & Jones, CVPR 2001 K. Grauman, B. Leibe
  • 37.
    Gradient-based representations: Shape context descriptor Count the number of points inside each bin, e.g.: Visual Object Recognition Tutorial Computing Count = 4 Perceptual and Sensory Augmented ... Count = 10 Log-polar binning: more precision for nearby points, more flexibility for farther points. Local descriptor Belongie, Malik & Puzicha, ICCV 2001 (more on this later) K. Grauman, B. Leibe
  • 38.
    Classifier construction • How to compute a decision for each Visual Object Recognition Tutorial Computing subwindow? Perceptual and Sensory Augmented Image feature K. Grauman, B. Leibe
  • 39.
    Discriminative vs. generativemodels Pr(image, car ) Pr(image, ¬car ) Generative: separately Visual Object Recognition Tutorial Computing 0.1 0.05 model class-conditional and prior densities 0 0 10 20 30 40 50 60 70 image feature Perceptual and Sensory Augmented Pr(car | image) Pr(¬car | image) Discriminative: directly 1 x = data model posterior 0.5 0 0 10 20 30 40 50 60 70 image feature Plots from Antonio Torralba 2007 K. Grauman, B. Leibe
  • 40.
    Discriminative vs. generativemodels • Generative: + possibly interpretable Visual Object Recognition Tutorial Computing + can draw samples - models variability unimportant to classification task - often hard to build good model with few parameters Perceptual and Sensory Augmented • Discriminative: + appealing when infeasible to model data itself + excel in practice - often can’t provide uncertainty in predictions - non-interpretable 21 K. Grauman, B. Leibe
  • 41.
    Discriminative methods Nearest neighbor Neural networks Visual Object Recognition Tutorial Computing 106 examples Shakhnarovich, Viola, Darrell 2003 LeCun, Bottou, Bengio, Haffner 1998 Berg, Berg, Malik 2005... Rowley, Baluja, Kanade 1998 Perceptual and Sensory Augmented … Support Vector Machines Boosting Conditional Random Fields Guyon, Vapnik Viola, Jones 2001, McCallum, Freitag, Pereira Heisele, Serre, Poggio, Torralba et al. 2004, 2000; Kumar, Hebert 2003 2001,… Opelt et al. 2006,… … K. Grauman, B. Leibe Slide adapted from Antonio Torralba
  • 42.
    Boosting • Build a strong classifier by combining number of “weak classifiers”, which need only be better than chance Visual Object Recognition Tutorial Computing • Sequential learning process: at each iteration, add a weak classifier • Flexible to choice of weak learner Perceptual and Sensory Augmented including fast simple classifiers that alone may be inaccurate • We’ll look at Freund & Schapire’s AdaBoost algorithm Easy to implement Base learning algorithm for Viola-Jones face detector 23 K. Grauman, B. Leibe
  • 43.
    AdaBoost: Intuition Consider a 2-d feature space with positive and Visual Object Recognition Tutorial Computing negative examples. Each weak classifier splits Perceptual and Sensory Augmented the training examples with at least 50% accuracy. Examples misclassified by a previous weak learner are given more emphasis at future rounds. Figure adapted from Freund and Schapire 24 K. Grauman, B. Leibe
  • 44.
    Visual Object RecognitionTutorial Computing Perceptual and Sensory Augmented AdaBoost: Intuition K. Grauman, B. Leibe 25
  • 45.
    AdaBoost: Intuition Visual ObjectRecognition Tutorial Computing Perceptual and Sensory Augmented Final classifier is combination of the weak classifiers 26 K. Grauman, B. Leibe
  • 46.
    AdaBoost Algorithm Start with uniform weights on training examples {x1,…xn} Visual Object Recognition Tutorial Computing Evaluate Perceptual and Sensory Augmented weighted error for each feature, pick best. Incorrectly classified -> more weight Correctly classified -> less weight Final classifier is combination of the weak ones, weighted according to error they had. Freund & Schapire 1995
  • 47.
    Cascading classifiers fordetection For efficiency, apply less accurate but faster classifiers first to immediately discard Visual Object Recognition Tutorial Computing windows that clearly appear to be negative; e.g., Perceptual and Sensory Augmented Filter for promising regions with an initial inexpensive classifier Build a chain of classifiers, choosing cheap ones with low false negative rates early in the chain Fleuret & Geman, IJCV 2001 Rowley et al., PAMI 1998 Viola & Jones, CVPR 2001 28 K. Grauman, B. Leibe Figure from Viola & Jones CVPR 2001
  • 48.
    Example: Face detection • Frontal faces are a good example of a class where global appearance models + a sliding window Visual Object Recognition Tutorial Computing detection approach fit well: Regular 2D structure Center of face almost shaped like a “patch”/window Perceptual and Sensory Augmented • Now we’ll take AdaBoost and see how the Viola- Jones face detector works 29 K. Grauman, B. Leibe
  • 49.
    Feature extraction “Rectangular” filters Feature output is difference between adjacent regions Visual Object Recognition Tutorial Computing Perceptual and Sensory Augmented Value at (x,y) is sum of pixels Efficiently computable above and to the with integral image: any left of (x,y) sum can be computed in constant time Avoid scaling images scale features directly Integral image for same cost Viola & Jones, CVPR 2001 30 K. Grauman, B. Leibe
  • 50.
    Large library offilters Considering all possible filter parameters: Visual Object Recognition Tutorial Computing position, scale, and type: Perceptual and Sensory Augmented 180,000+ possible features associated with each 24 x 24 window Use AdaBoost both to select the informative features and to form the classifier Viola & Jones, CVPR 2001
  • 51.
    AdaBoost for feature+classifierselection • Want to select the single rectangle feature and threshold that best separates positive (faces) and negative (non- faces) training examples, in terms of weighted error. Visual Object Recognition Tutorial Computing Resulting weak classifier: Perceptual and Sensory Augmented For next round, reweight the … examples according to errors, Outputs of a possible choose another filter/threshold rectangle feature on combo. faces and non-faces. Viola & Jones, CVPR 2001
  • 52.
    Viola-Jones Face Detector:Summary Train cascade of classifiers with Visual Object Recognition Tutorial Computing AdaBoost ow h ind eac Faces bw o New image su ply t Perceptual and Sensory Augmented Ap Selected features, Non-faces thresholds, and weights • Train with 5K positives, 350M negatives • Real-time detector using 38 layer cascade • 6061 features in final layer • [Implementation available in OpenCV: http://www.intel.com/technology/computing/opencv/] 33 K. Grauman, B. Leibe
  • 53.
    Viola-Jones Face Detector:Results Visual Object Recognition Tutorial Computing First two features selected Perceptual and Sensory Augmented 34 K. Grauman, B. Leibe
  • 54.
    Visual Object RecognitionTutorial Computing Perceptual and Sensory Augmented Viola-Jones Face Detector: Results
  • 55.
    Visual Object RecognitionTutorial Computing Perceptual and Sensory Augmented Viola-Jones Face Detector: Results
  • 56.
    Visual Object RecognitionTutorial Computing Perceptual and Sensory Augmented Viola-Jones Face Detector: Results
  • 57.
    Profile Features Detecting profile faces requires training separate detector with profile examples. Visual Object Recognition Tutorial Computing Perceptual and Sensory Augmented
  • 58.
    Visual Object RecognitionTutorial Computing Perceptual and Sensory Augmented Paul Viola, ICCV tutorial Viola-Jones Face Detector: Results
  • 59.
    Example application Frontal faces Visual Object Recognition Tutorial Computing detected and then tracked, character Perceptual and Sensory Augmented names inferred with alignment of script and subtitles. Everingham, M., Sivic, J. and Zisserman, A. "Hello! My name is... Buffy" - Automatic naming of characters in TV video, BMVC 2006. http://www.robots.ox.ac.uk/~vgg/research/nface/index.html 40 K. Grauman, B. Leibe
  • 60.
    Pedestrian detection • Detecting upright, walking humans also possible using sliding window’s appearance/texture; e.g., Visual Object Recognition Tutorial Computing Perceptual and Sensory Augmented SVM with Haar wavelets Space-time rectangle SVM with HoGs [Dalal & [Papageorgiou & Poggio, IJCV features [Viola, Jones & Triggs, CVPR 2005] 2000] Snow, ICCV 2003] K. Grauman, B. Leibe
  • 61.
    Highlights • Sliding window detection and global appearance descriptors: Visual Object Recognition Tutorial Computing Simple detection protocol to implement Good feature choices critical Past successes for certain classes Perceptual and Sensory Augmented 42 K. Grauman, B. Leibe
  • 62.
    Limitations • High computational complexity For example: 250,000 locations x 30 orientations x 4 scales = 30,000,000 evaluations! Visual Object Recognition Tutorial Computing If training binary detectors independently, means cost increases linearly with number of classes • With so many windows, false positive rate better be low Perceptual and Sensory Augmented 43 K. Grauman, B. Leibe
  • 63.
    Limitations (continued) • Not all objects are “box” shaped Visual Object Recognition Tutorial Computing Perceptual and Sensory Augmented 44 K. Grauman, B. Leibe
  • 64.
    Limitations (continued) • Non-rigid, deformable objects not captured well with representations assuming a fixed 2d structure; or must assume fixed viewpoint Visual Object Recognition Tutorial Computing • Objects with less-regular textures not captured well with holistic appearance-based descriptions Perceptual and Sensory Augmented 45 K. Grauman, B. Leibe
  • 65.
    Limitations (continued) • If considering windows in isolation, context is lost Visual Object Recognition Tutorial Computing Perceptual and Sensory Augmented Sliding window Detector’s view 46 Figure credit: Derek Hoiem K. Grauman, B. Leibe
  • 66.
    Limitations (continued) • In practice, often entails large, cropped training set (expensive) Visual Object Recognition Tutorial Computing • Requiring good match to a global appearance description can lead to sensitivity to partial occlusions Perceptual and Sensory Augmented 47 Image credit: Adam, Rivlin, & Shimshoni K. Grauman, B. Leibe
  • 67.
    Outline 1. Detection with Global Appearance & Sliding Windows 2. Local Invariant Features: Detection & Description Visual Object Recognition Tutorial Computing 3. Specific Object Recognition with Local Features Perceptual and Sensory Augmented ― Coffee Break ― 4. Visual Words: Indexing, Bags of Words Categorization 5. Matching Local Features 6. Part-Based Models for Categorization 7. Current Challenges and Research Directions 48 K. Grauman, B. Leibe
  • 68.
    Visual Object Recognition VisualObject Recognition Tutorial Computing Perceptual and Sensory Augmented Bastian Leibe & Kristen Grauman Computer Vision Laboratory Department of Computer Sciences ETH Zurich University of Texas in Austin Chicago, 14.07.2008
  • 69.
    Outline 1. Detection with Global Appearance & Sliding Windows Visual Object Recognition Tutorial Computing 2. Local Invariant Features: Detection & Description 3. Specific Object Recognition with Local Features Perceptual and Sensory Augmented ― Coffee Break ― 4. Visual Words: Indexing, Bags of Words Categorization 5. Matching Local Features 6. Part-Based Models for Categorization 7. Current Challenges and Research Directions 2 K. Grauman, B. Leibe
  • 70.
    Motivation • Global representations have major limitations • Instead, describe and match only local regions Visual Object Recognition Tutorial Computing • Increased robustness to Occlusions Perceptual and Sensory Augmented Articulation d dq φ φ θq θ Intra-category variations 3 K. Grauman, B. Leibe
  • 71.
    Approach 1. Find a set of distinctive key- points Visual Object Recognition Tutorial Computing A1 2. Define a region around each A2 A3 keypoint Perceptual and Sensory Augmented 3. Extract and normalize the region content fA Similarity fB measure 4. Compute a local N pixels descriptor from the e.g. color e.g. color normalized region N pixels d ( f A, fB ) < T 5. Match local descriptors 4 K. Grauman, B. Leibe
  • 72.
    Requirements • Region extraction needs to be repeatable and precise Translation, rotation, scale changes Visual Object Recognition Tutorial Computing (Limited out-of-plane (≈affine) transformations) ≈ Lighting variations Perceptual and Sensory Augmented • We need a sufficient number of regions to cover the object • The regions should contain “interesting” structure 5 K. Grauman, B. Leibe
  • 73.
    Many Existing DetectorsAvailable • Hessian & Harris [Beaudet ‘78], [Harris ‘88] • Laplacian, DoG [Lindeberg ‘98], [Lowe 1999] Visual Object Recognition Tutorial Computing • Harris-/Hessian-Laplace [Mikolajczyk & Schmid ‘01] • Harris-/Hessian-Affine [Mikolajczyk & Schmid ‘04] • EBR and IBR [Tuytelaars & Van Gool ‘04] Perceptual and Sensory Augmented • MSER [Matas ‘02] • Salient Regions [Kadir & Brady ‘01] • Others… 6 K. Grauman, B. Leibe
  • 74.
    Keypoint Localization Visual ObjectRecognition Tutorial Computing Perceptual and Sensory Augmented • Goals: Repeatable detection Precise localization Interesting content ⇒ Look for two-dimensional signal changes 7 K. Grauman, B. Leibe
  • 75.
    Hessian Detector [Beaudet78] • Hessian determinant Ixx Visual Object Recognition Tutorial Computing  I xx I xy  Hessian ( I ) =   I xy I yy   Perceptual and Sensory Augmented Iyy Ixy Intuition: Search for strong derivatives in two orthogonal directions 8 K. Grauman, B. Leibe
  • 76.
    Hessian Detector [Beaudet78] • Hessian determinant Ixx Visual Object Recognition Tutorial Computing  I xx I xy  Hessian ( I ) =   I xy I yy   Perceptual and Sensory Augmented Iyy Ixy 2 det( Hessian( I )) = I xx I yy − I xy In Matlab: I xx . ∗ I yy − ( I xy )^ 2 9 K. Grauman, B. Leibe
  • 77.
    Hessian Detector –Responses [Beaudet78] Visual Object Recognition Tutorial Computing Perceptual and Sensory Augmented Effect: Responses mainly on corners and strongly textured areas. 10
  • 78.
    Perceptual and SensoryAugmented Visual Object Recognition Tutorial Computing Hessian Detector – Responses [Beaudet78] 11
  • 79.
    Harris Detector [Harris88] • Second moment matrix (autocorrelation matrix) Visual Object Recognition Tutorial Computing  I x2 (σ D ) I x I y (σ D ) µ (σ I , σ D ) = g (σ I ) ∗  2    I x I y (σ D ) I y (σ D )   Perceptual and Sensory Augmented Intuition: Search for local neighborhoods where the image content has two main directions (eigenvectors). 12 K. Grauman, B. Leibe
  • 80.
    Harris Detector [Harris88] • Second moment matrix (autocorrelation matrix) Visual Object Recognition Tutorial Computing  I x2 (σ D ) I x I y (σ D ) µ (σ I , σ D ) = g (σ I ) ∗  2    I x I y (σ D ) I y (σ D )   Perceptual and Sensory Augmented Ix Iy 1. Image derivatives gx(σD), gy(σD), 13 K. Grauman, B. Leibe
  • 81.
    Harris Detector [Harris88] • Second moment matrix (autocorrelation matrix) Visual Object Recognition Tutorial Computing  I x2 (σ D ) I x I y (σ D ) µ (σ I , σ D ) = g (σ I ) ∗  2    I x I y (σ D ) I y (σ D )   Perceptual and Sensory Augmented Ix Iy 1. Image derivatives gx(σD), gy(σD), Ix2 Iy2 IxIy 2. Square of derivatives 14 K. Grauman, B. Leibe
  • 82.
    Harris Detector [Harris88] • Second moment matrix (autocorrelation matrix) Visual Object Recognition Tutorial Computing  I x2 (σ D ) I x I y (σ D ) µ (σ I , σ D ) = g (σ I ) ∗  2    I x I y (σ D ) I y (σ D )   Ix Iy 1. Image Perceptual and Sensory Augmented derivatives Iy 2. Square of Ix2 Iy2 IxIy 1. Image derivatives derivatives gx(σD), gy(σD), 2.3. Square of Gaussian filter g(σI) derivatives g(Ix2) g(Iy2) g(IxIy) 15
  • 83.
    Harris Detector [Harris88] • Second moment matrix (autocorrelation matrix)  I x2 (σ D ) I x I y (σ D ) Visual Object Recognition Tutorial Computing µ (σ I , σ D ) = g (σ I ) ∗  2  Ix Iy  I x I y (σ D ) I y (σ D )  1. Image   derivatives Ix2 Iy2 IxIy Perceptual and Sensory Augmented 2. Square of derivatives Iy 3. Gaussian filter g(σI) g(Ix2) g(Iy2) g(IxIy) 4. Cornerness function – both eigenvalues are strong har = det[µ (σ I ,σ D)] − α [trace(µ (σ I ,σ D))] = g ( I x2 ) g ( I y ) − [ g ( I x I y )]2 − α [ g ( I x2 ) + g ( I y )]2 2 2 g(IxIy) 5. Non-maxima suppression har 16
  • 84.
    Harris Detector –Responses [Harris88] Visual Object Recognition Tutorial Computing Perceptual and Sensory Augmented Effect: A very precise corner detector. 17
  • 85.
    Perceptual and SensoryAugmented Visual Object Recognition Tutorial Computing Harris Detector – Responses [Harris88] 18
  • 86.
    Automatic Scale Selection VisualObject Recognition Tutorial Computing Perceptual and Sensory Augmented f ( I i1Kim ( x, σ )) = f ( I i1Kim ( x′, σ ′)) Same operator responses if the patch contains the same image up to scale factor How to find corresponding patch sizes? 19 K. Grauman, B. Leibe
  • 87.
    Automatic Scale Selection • Function responses for increasing scale (scale signature) Visual Object Recognition Tutorial Computing Perceptual and Sensory Augmented f ( I i1Kim ( x, σ )) f ( I i1Kim ( x′, σ )) 20 K. Grauman, B. Leibe
  • 88.
    Automatic Scale Selection • Function responses for increasing scale (scale signature) Visual Object Recognition Tutorial Computing Perceptual and Sensory Augmented f ( I i1Kim ( x, σ )) f ( I i1Kim ( x′, σ )) 21 K. Grauman, B. Leibe
  • 89.
    Automatic Scale Selection • Function responses for increasing scale (scale signature) Visual Object Recognition Tutorial Computing Perceptual and Sensory Augmented f ( I i1Kim ( x, σ )) f ( I i1Kim ( x′, σ )) 22 K. Grauman, B. Leibe
  • 90.
    Automatic Scale Selection • Function responses for increasing scale (scale signature) Visual Object Recognition Tutorial Computing Perceptual and Sensory Augmented f ( I i1Kim ( x, σ )) f ( I i1Kim ( x′, σ )) 23 K. Grauman, B. Leibe
  • 91.
    Automatic Scale Selection • Function responses for increasing scale (scale signature) Visual Object Recognition Tutorial Computing Perceptual and Sensory Augmented f ( I i1Kim ( x, σ )) f ( I i1Kim ( x′, σ )) 24 K. Grauman, B. Leibe
  • 92.
    Automatic Scale Selection • Function responses for increasing scale (scale signature) Visual Object Recognition Tutorial Computing Perceptual and Sensory Augmented f ( I i1Kim ( x, σ )) f ( I i1Kim ( x′, σ ′)) 25 K. Grauman, B. Leibe
  • 93.
    What Is AUseful Signature Function? • Laplacian-of-Gaussian = “blob” detector Visual Object Recognition Tutorial Computing Perceptual and Sensory Augmented 26 K. Grauman, B. Leibe
  • 94.
    Laplacian-of-Gaussian (LoG) • Local maxima in scale σ5 space of Laplacian-of- Visual Object Recognition Tutorial Computing Gaussian σ4 Perceptual and Sensory Augmented Lxx (σ ) + Lyy (σ ) σ3 σ2 ⇒ List of σ (x, y, s) 27 K. Grauman, B. Leibe
  • 95.
    Perceptual and SensoryAugmented Visual Object Recognition Tutorial Computing K. Grauman, B. Leibe Results: Laplacian-of-Gaussian 28
  • 96.
    Difference-of-Gaussian (DoG) • Difference of Gaussians as approximation of the Laplacian-of-Gaussian Visual Object Recognition Tutorial Computing Perceptual and Sensory Augmented - = 29 K. Grauman, B. Leibe
  • 97.
    DoG – EfficientComputation • Computation in Gaussian scale pyramid Visual Object Recognition Tutorial Computing Perceptual and Sensory Augmented Sampling with step σ4 =2 σ σ 1 σ Original image σ =2 4 σ 30 K. Grauman, B. Leibe
  • 98.
    Perceptual and SensoryAugmented Visual Object Recognition Tutorial Computing Results: Lowe’s DoG K. Grauman, B. Leibe 31
  • 99.
    Harris-Laplace [Mikolajczyk ‘01] 1. Initialization: Multiscale Harris corner detection Visual Object Recognition Tutorial Computing σ4 Perceptual and Sensory Augmented σ3 σ2 σ Computing Harris function Detecting local maxima 32
  • 100.
    Harris-Laplace [Mikolajczyk ‘01] 1. Initialization: Multiscale Harris corner detection 2. Scale selection based on Laplacian Visual Object Recognition Tutorial Computing (same procedure with Hessian ⇒ Hessian-Laplace) Harris points Perceptual and Sensory Augmented Harris-Laplace points 33 K. Grauman, B. Leibe
  • 101.
    Maximally Stable ExtremalRegions [Matas ‘02] • Based on Watershed segmentation algorithm • Select regions that stay stable over a large parameter Visual Object Recognition Tutorial Computing range Perceptual and Sensory Augmented 34 K. Grauman, B. Leibe
  • 102.
    Perceptual and SensoryAugmented Visual Object Recognition Tutorial Computing Example Results: MSER K. Grauman, B. Leibe 35
  • 103.
    You Can TryIt At Home… • For most local feature detectors, executables are available online: Visual Object Recognition Tutorial Computing • http://robots.ox.ac.uk/~vgg/research/affine • http://www.cs.ubc.ca/~lowe/keypoints/ • http://www.vision.ee.ethz.ch/~surf Perceptual and Sensory Augmented 36 K. Grauman, B. Leibe
  • 104.
    Orientation Normalization • Compute orientation histogram [Lowe, SIFT, 1999] • Select dominant orientation Visual Object Recognition Tutorial Computing • Normalize: rotate to fixed orientation Perceptual and Sensory Augmented 0 2π 37 T. Tuytelaars, B. Leibe
  • 105.
    Local Descriptors • The ideal descriptor should be Repeatable Visual Object Recognition Tutorial Computing Distinctive Compact Efficient Perceptual and Sensory Augmented • Most available descriptors focus on edge/gradient information Capture texture information Color still relatively seldomly used (more suitable for homogenous regions) 38 K. Grauman, B. Leibe
  • 106.
    Local Descriptors: SIFTDescriptor Visual Object Recognition Tutorial Computing Perceptual and Sensory Augmented Histogram of oriented gradients • Captures important texture information • Robust to small translations / affine deformations [Lowe, ICCV 1999] K. Grauman, B. Leibe
  • 107.
    Local Descriptors: SURF • Fast approximation of SIFT idea Efficient computation by 2D box filters & integral images Visual Object Recognition Tutorial Computing ⇒ 6 times faster than SIFT Equivalent quality for object identification Perceptual and Sensory Augmented • GPU implementation available Feature extraction @ 100Hz (detector + descriptor, 640×480 img) http://www.vision.ee.ethz.ch/~surf [Bay, ECCV’06], [Cornelis, CVGPU’08] 40 K. Grauman, B. Leibe
  • 108.
    Local Descriptors: ShapeContext Count the number of points inside each bin, e.g.: Visual Object Recognition Tutorial Computing Count = 4 ... Perceptual and Sensory Augmented Count = 10 Log-polar binning: more precision for nearby points, more flexibility for farther points. Belongie & Malik, ICCV 2001 K. Grauman, B. Leibe
  • 109.
    Local Descriptors: GeometricBlur Compute edges Visual Object Recognition Tutorial Computing at four orientations Extract a patch Perceptual and Sensory Augmented in each channel ~ Apply spatially varying blur and sub-sample Example descriptor (Idealized signal) Berg & Malik, CVPR 2001 K. Grauman, B. Leibe
  • 110.
    So, What LocalFeatures Should I Use? • There have been extensive evaluations/comparisons [Mikolajczyk et al., IJCV’05, PAMI’05] Visual Object Recognition Tutorial Computing All detectors/descriptors shown here work well • Best choice often application dependent Perceptual and Sensory Augmented MSER works well for buildings and printed things Harris-/Hessian-Laplace/DoG work well for many natural categories • More features are better Combining several detectors often helps 43 K. Grauman, B. Leibe
  • 111.
    Outline 1. Detection with Global Appearance & Sliding Windows Visual Object Recognition Tutorial Computing 2. Local Invariant Features: Detection & Description 3. Specific Object Recognition with Local Features Perceptual and Sensory Augmented ― Coffee Break ― 4. Visual Words: Indexing, Bags of Words Categorization 5. Matching Local Features 6. Part-Based Models for Categorization 7. Current Challenges and Research Directions 44 K. Grauman, B. Leibe
  • 112.
    Visual Object Recognition VisualObject Recognition Tutorial Computing Perceptual and Sensory Augmented Bastian Leibe & Kristen Grauman Computer Vision Laboratory Department of Computer Sciences ETH Zurich University of Texas in Austin Chicago, 14.07.2008
  • 113.
    Outline 1. Detection with Global Appearance & Sliding Windows Visual Object Recognition Tutorial Computing 2. Local Invariant Features: Detection & Description 3. Specific Object Recognition with Local Features Perceptual and Sensory Augmented ― Coffee Break ― 4. Visual Words: Indexing, Bags of Words Categorization 5. Matching Local Features 6. Part-Based Models for Categorization 7. Current Challenges and Research Directions 2 K. Grauman, B. Leibe
  • 114.
    Recognition with LocalFeatures • Image content is transformed into local features that are invariant to translation, rotation, and scale • Goal: Verify if they belong to a consistent configuration Visual Object Recognition Tutorial Computing Perceptual and Sensory Augmented Local Features, e.g. SIFT 3 K. Grauman, B. Leibe Slide credit: David Lowe
  • 115.
    Finding Consistent Configurations • Global spatial models Generalized Hough Transform [Lowe99] Visual Object Recognition Tutorial Computing RANSAC [Obdrzalek02, Chum05, Nister06] Basic assumption: object is planar Perceptual and Sensory Augmented • Assumption is often justified in practice Valid for many structures on buildings Sufficient for small viewpoint variations on 3D objects 4 K. Grauman, B. Leibe
  • 116.
    Hough Transform • Origin: Detection of straight lines in clutter Basic idea: each candidate point votes for all lines that it is consistent with. Visual Object Recognition Tutorial Computing Votes are accumulated in quantized array Local maxima correspond to candidate lines Perceptual and Sensory Augmented • Representation of a line Usual form y = a x + b has a singularity around 90º. Better parameterization: x cos(θ) + y sin(θ) = ρ y ρ y ρ θ x x θ 5 K. Grauman, B. Leibe
  • 117.
    Hough Transform: NoisyLine ρ Visual Object Recognition Tutorial Computing Perceptual and Sensory Augmented θ Tokens Votes • Problem: Finding the true maximum 7 K. Grauman, B. Leibe Slide credit: David Lowe
  • 118.
    Hough Transform: NoisyInput ρ Visual Object Recognition Tutorial Computing Perceptual and Sensory Augmented θ Tokens Votes • Problem: Lots of spurious maxima 8 K. Grauman, B. Leibe Slide credit: David Lowe
  • 119.
    Generalized Hough Transform[Ballard81] • Generalization for an arbitrary contour or shape Choose reference point for the contour (e.g. center) Visual Object Recognition Tutorial Computing For each point on the contour remember where it is located w.r.t. to the reference point Remember radius r and angle φ relative to the contour tangent Perceptual and Sensory Augmented Recognition: whenever you find a contour point, calculate the tangent angle and ‘vote’ for all possible reference points Instead of reference point, can also vote for transformation ⇒ The same idea can be used with local features! 9 K. Grauman, B. Leibe Slide credit: Bernt Schiele
  • 120.
    Gen. Hough Transformwith Local Features • For every feature, store possible “occurrences” Visual Object Recognition Tutorial Computing Perceptual and Sensory Augmented • For new image, let the matched features vote for – Object identity possible object positions – Pose – Relative position 10 K. Grauman, B. Leibe
  • 121.
    3D Object Recognition • Gen. HT for Recognition [Lowe99] Typically only 3 feature matches Visual Object Recognition Tutorial Computing needed for recognition Extra matches provide robustness Affine model can be used for planar objects Perceptual and Sensory Augmented 12 K. Grauman, B. Leibe Slide credit: David Lowe
  • 122.
    View Interpolation • Training Training views from similar Visual Object Recognition Tutorial Computing viewpoints are clustered based on feature matches. Matching features between adjacent views are linked. Perceptual and Sensory Augmented • Recognition Feature matches may be spread over several training viewpoints. ⇒ Use the known links to “transfer votes” to other viewpoints. [Lowe01] 13 K. Grauman, B. Leibe Slide credit: David Lowe
  • 123.
    Recognition Using ViewInterpolation Visual Object Recognition Tutorial Computing Perceptual and Sensory Augmented [Lowe01] 14 K. Grauman, B. Leibe Slide credit: David Lowe
  • 124.
    Location Recognition Visual ObjectRecognition Tutorial Computing Perceptual and Sensory Augmented Training [Lowe04] 15 K. Grauman, B. Leibe Slide credit: David Lowe
  • 125.
    Applications • Sony Aibo (Evolution Robotics) Visual Object Recognition Tutorial Computing • SIFT usage Recognize Perceptual and Sensory Augmented docking station Communicate with visual cards • Other uses Place recognition Loop closure in SLAM 16 K. Grauman, B. Leibe Slide credit: David Lowe
  • 126.
    RANSAC (RANdom SAmpleConsensus) [Fischler81] • Randomly choose a minimal subset of data points necessary to fit a model (a sample) • Points within some distance threshold t of model are a Visual Object Recognition Tutorial Computing consensus set. Size of consensus set is model’s support. • Repeat for N samples; model with biggest support is most robust fit Perceptual and Sensory Augmented Points within distance t of best model are inliers Fit final model to all inliers 17 K. Grauman, B. Leibe Slide credit: David Lowe
  • 127.
    RANSAC: How manysamples? • How many samples are needed? Suppose w is fraction of inliers (points from line). n points needed to define hypothesis (2 for lines) Visual Object Recognition Tutorial Computing k samples chosen. • Prob. that a single sample of n points is correct: w n Perceptual and Sensory Augmented • Prob. that all samples fail is: (1 − wn ) k ⇒ Choose k high enough to keep this below desired failure rate. 19 K. Grauman, B. Leibe Slide credit: David Lowe
  • 128.
    After RANSAC • RANSAC divides data into inliers and outliers and yields estimate computed from minimal set of inliers Visual Object Recognition Tutorial Computing • Improve this initial estimate with estimation over all inliers (e.g. with standard least-squares minimization) • But this may change inliers, so alternate fitting with re- Perceptual and Sensory Augmented classification as inlier/outlier 21 K. Grauman, B. Leibe Slide credit: David Lowe
  • 129.
    Example: Finding FeatureMatches • Find best stereo match within a square search window (here 300 pixels2) Visual Object Recognition Tutorial Computing • Global transformation model: epipolar geometry Perceptual and Sensory Augmented from Hartley & Zisserman 22 K. Grauman, B. Leibe Slide credit: David Lowe
  • 130.
    Example: Finding FeatureMatches • Find best stereo match within a square search window (here 300 pixels2) Visual Object Recognition Tutorial Computing • Global transformation model: epipolar geometry before RANSAC after RANSAC Perceptual and Sensory Augmented from Hartley & Zisserman 23 K. Grauman, B. Leibe Slide credit: David Lowe
  • 131.
    Comparison Gen. Hough Transform RANSAC • Advantages • Advantages Very effective for recognizing General method suited to large Visual Object Recognition Tutorial Computing arbitrary shapes or objects range of problems Can handle high percentage of Easy to implement outliers (>95%) Independent of number of Extracts groupings from clutter in dimensions Perceptual and Sensory Augmented linear time • Disadvantages • Disadvantages Quantization issues Only handles moderate number of Only practical for small number of outliers (<50%) dimensions (up to 4) • Improvements available • Many variants available, e.g. Probabilistic Extensions PROSAC: Progressive RANSAC [Leibe08] [Chum05] Continuous Voting Space Preemptive RANSAC [Nister05] 24 K. Grauman, B. Leibe
  • 132.
    Example Applications Visual ObjectRecognition Tutorial Computing Perceptual and Sensory Augmented Mobile tourist guide • Self-localization • Object/building recognition • Photo/video augmentation B. Leibe [Quack, Leibe, Van Gool, CIVR’08] 25
  • 133.
    Web Demo: MoviePoster Recognition Visual Object Recognition Tutorial Computing Perceptual and Sensory Augmented 50’000 movie posters indexed Query-by-image from mobile phone available in Switzer- land http://www.kooaba.com/en/products_engine.html# 26 K. Grauman, B. Leibe
  • 134.
    Application: Large-Scale Retrieval VisualObject Recognition Tutorial Computing Perceptual and Sensory Augmented Query Results from 5k Flickr images (demo available for 100k set) K. Grauman, B. Leibe [Philbin CVPR’07] 27
  • 135.
    Application: Image Auto-Annotation Moulin Rouge Old Town Square (Prague) Visual Object Recognition Tutorial Computing Perceptual and Sensory Augmented Tour Montparnasse Colosseum Viktualienmarkt Maypole Left: Wikipedia image Right: closest match from Flickr 28 [Quack CIVR’08] K. Grauman, B. Leibe
  • 136.
    Outline 1. Detection with Global Appearance & Sliding Windows Visual Object Recognition Tutorial Computing 2. Local Invariant Features: Detection & Description 3. Specific Object Recognition with Local Features Perceptual and Sensory Augmented ― Coffee Break ― 4. Visual Words: Indexing, Bags of Words Categorization 5. Matching Local Features 6. Part-Based Models for Categorization 7. Current Challenges and Research Directions 29 K. Grauman, B. Leibe
  • 137.
    Visual Object Recognition VisualObject Recognition Tutorial Computing Perceptual and Sensory Augmented Bastian Leibe & Kristen Grauman Computer Vision Laboratory Department of Computer Sciences ETH Zurich University of Texas in Austin Chicago, 14.07.2008
  • 138.
    Outline 1. Detection with Global Appearance & Sliding Windows Visual Object Recognition Tutorial Computing 2. Local Invariant Features: Detection & Description 3. Specific Object Recognition with Local Features Perceptual and Sensory Augmented ― Coffee Break ― 4. Visual Words: Indexing, Bags of Words Categorization 5. Matching Local Feature Sets 6. Part-Based Models for Categorization 7. Current Challenges and Research Directions 2 K. Grauman, B. Leibe
  • 139.
    Global representations: limitations • Success may rely on alignment -> sensitive to viewpoint Visual Object Recognition Tutorial Computing • All parts of the image or window impact the description -> sensitive to occlusion, clutter Perceptual and Sensory Augmented 3 K. Grauman, B. Leibe
  • 140.
    Local representations • Describe component regions or patches separately. • Many options for detection & description… Visual Object Recognition Tutorial Computing Perceptual and Sensory Augmented Maximally Stable Extremal Regions Shape context Superpixels [Matas 02] SIFT [Lowe 99] [Belongie 02] [Ren et al.] Salient regions Harris-Affine Spin images Geometric Blur [Kadir 01] [Mikolajczyk 04] [Johnson 99] [Berg 05] 4 K. Grauman, B. Leibe
  • 141.
    Recall: Invariant localfeatures Subset of local feature types designed to be invariant to y1 Visual Object Recognition Tutorial Computing Scale y2 Translation … Rotation yd Perceptual and Sensory Augmented Affine transformations Illumination x1 x2 1) Detect interest points … 2) Extract descriptors xd [Mikolajczyk01, Matas02, Tuytelaars04, Lowe99, Kadir01,… ] K. Grauman, B. Leibe
  • 142.
    Recognition with localfeature sets • Previously, we saw how to use local invariant features + a global Visual Object Recognition Tutorial Computing spatial model to recognize specific objects, using a planar object assumption. Perceptual and Sensory Augmented • Now, we’ll use local features for Indexing-based recognition Bags of words representations Correspondence / matching kernels 6 K. Grauman, B. Leibe
  • 143.
    Basic flow … … Index each one into pool of descriptors from … Visual Object Recognition Tutorial Computing previously seen images Perceptual and Sensory Augmented Detect or sample Describe features features List of positions, Associated list of scales, d-dimensional orientations descriptors 7 K. Grauman, B. Leibe
  • 144.
    Indexing local features • Each patch / region has a descriptor, which is a point in some high-dimensional feature space (e.g., SIFT) Visual Object Recognition Tutorial Computing Perceptual and Sensory Augmented K. Grauman, B. Leibe
  • 145.
    Indexing local features • When we see close points in feature space, we have similar descriptors, which indicates similar local Visual Object Recognition Tutorial Computing content. Perceptual and Sensory Augmented Figure credit: A. Zisserman K. Grauman, B. Leibe
  • 146.
    Indexing local features • We saw in the previous section how to use voting and pose clustering to identify objects using local features Visual Object Recognition Tutorial Computing Perceptual and Sensory Augmented Figure credit: David Lowe 10 K. Grauman, B. Leibe
  • 147.
    Indexing local features • With potentially thousands of features per image, and hundreds to millions of images to search, how to Visual Object Recognition Tutorial Computing efficiently find those that are relevant to a new image? Perceptual and Sensory Augmented Low-dimensional descriptors : can use standard efficient data structures for nearest neighbor search High-dimensional descriptors: approximate nearest neighbor search methods more practical Inverted file indexing schemes 11 K. Grauman, B. Leibe
  • 148.
    Indexing local features:approximate nearest neighbor search Best-Bin First (BBF), a variant of k-d Visual Object Recognition Tutorial Computing trees that uses priority queue to examine most promising branches first [Beis & Lowe, CVPR 1997] Perceptual and Sensory Augmented Locality-Sensitive Hashing (LSH), a randomized hashing technique using hash functions that map similar points to the same bin, with high probability [Indyk & Motwani, 1998] 12 K. Grauman, B. Leibe
  • 149.
    Indexing local features:inverted file index • For text documents, an efficient way to Visual Object Recognition Tutorial Computing find all pages on which a word occurs is to use an index… Perceptual and Sensory Augmented • We want to find all images in which a feature occurs. • To use this idea, we’ll need to map our features to “visual words”. 13 K. Grauman, B. Leibe
  • 150.
    Visual words: mainidea • Extract some local features from a number of images … Visual Object Recognition Tutorial Computing Perceptual and Sensory Augmented e.g., SIFT descriptor space: each point is 128-dimensional 14 Slide credit: D. Nister K. Grauman, B. Leibe
  • 151.
    Perceptual and SensoryAugmented Visual Object Recognition Tutorial Computing Slide credit: D. Nister Visual words: main idea K. Grauman, B. Leibe 15
  • 152.
    Perceptual and SensoryAugmented Visual Object Recognition Tutorial Computing Slide credit: D. Nister Visual words: main idea K. Grauman, B. Leibe 16
  • 153.
    Perceptual and SensoryAugmented Visual Object Recognition Tutorial Computing Slide credit: D. Nister Visual words: main idea K. Grauman, B. Leibe 17
  • 154.
    Perceptual and SensoryAugmented Visual Object Recognition Tutorial Computing Slide credit: D. Nister K. Grauman, B. Leibe 18
  • 155.
    Perceptual and SensoryAugmented Visual Object Recognition Tutorial Computing Slide credit: D. Nister K. Grauman, B. Leibe 19
  • 156.
    Visual words: mainidea Map high-dimensional descriptors to tokens/words by quantizing the feature space • Quantize via Visual Object Recognition Tutorial Computing clustering, let cluster centers be Perceptual and Sensory Augmented the prototype “words” Descriptor space 20 K. Grauman, B. Leibe
  • 157.
    Visual words: mainidea Map high-dimensional descriptors to tokens/words by quantizing the feature space • Determine which Visual Object Recognition Tutorial Computing word to assign to each new image Perceptual and Sensory Augmented region by finding the closest cluster center. Descriptor space 21 K. Grauman, B. Leibe
  • 158.
    Visual words • Example: each group of patches belongs to the Visual Object Recognition Tutorial Computing same visual word Perceptual and Sensory Augmented Figure from Sivic & Zisserman, ICCV 2003 22 K. Grauman, B. Leibe
  • 159.
    Visual words • First explored for texture and material Visual Object Recognition Tutorial Computing representations • Texton = cluster center of filter responses over Perceptual and Sensory Augmented collection of images • Describe textures and materials based on distribution of prototypical texture elements. Leung & Malik 1999; Varma & Zisserman, 2002; Lazebnik, Schmid & Ponce, 2003;
  • 160.
    Visual words • More recently used for describing scenes and Visual Object Recognition Tutorial Computing objects for the sake of indexing or classification. Perceptual and Sensory Augmented Sivic & Zisserman 2003; Csurka, Bray, Dance, & Fan 2004; many others. 24 K. Grauman, B. Leibe
  • 161.
    Inverted file indexfor images comprised of visual words Word List of image number numbers Visual Object Recognition Tutorial Computing Perceptual and Sensory Augmented Image credit: A. Zisserman K. Grauman, B. Leibe
  • 162.
    Bags of visualwords • Summarize entire image based on its distribution Visual Object Recognition Tutorial Computing (histogram) of word occurrences. • Analogous to bag of words Perceptual and Sensory Augmented representation commonly used for documents. 26 Image credit: Fei-Fei Li K. Grauman, B. Leibe
  • 163.
    Video Google System Query region 1. Collect all words within query region Visual Object Recognition Tutorial Computing 2. Inverted file index to find relevant frames 3. Compare word counts Perceptual and Sensory Augmented 4. Spatial verification Retrieved frames Retrieved frames Sivic & Zisserman, ICCV 2003 • Demo online at : http://www.robots.ox.ac.uk/~vgg/ research/vgoogle/index.html 27 K. Grauman, B. Leibe
  • 164.
    Basic flow … … Index each one into pool of descriptors from … Visual Object Recognition Tutorial Computing previously seen images or Perceptual and Sensory Augmented … Detect or sample Describe Quantize to form features features bag of words vector for the image List of positions, Associated list of scales, d-dimensional orientations descriptors 28 K. Grauman, B. Leibe
  • 165.
    Visual vocabulary formation Issues: • Sampling strategy Visual Object Recognition Tutorial Computing • Clustering / quantization algorithm • Unsupervised vs. supervised • What corpus provides features (universal vocabulary?) Perceptual and Sensory Augmented • Vocabulary size, number of words 29 K. Grauman, B. Leibe
  • 166.
    Sampling strategies Visual ObjectRecognition Tutorial Computing Sparse, at Perceptual and Sensory Augmented Dense, uniformly Randomly interest points • To find specific, textured objects, sparse sampling from interest points often more reliable. • Multiple complementary interest operators offer more image coverage. • For object categorization, dense sampling offers better coverage. Multiple interest operators [See Nowak, Jurie & Triggs, ECCV 2006] 30 Image credits: F-F. Li, E. Nowak, J. Sivic K. Grauman, B. Leibe
  • 167.
    Clustering / quantizationmethods • k-means (typical choice), agglomerative clustering, mean-shift,… Visual Object Recognition Tutorial Computing • Hierarchical clustering: allows faster insertion / word assignment while still allowing large vocabularies Perceptual and Sensory Augmented Vocabulary tree [Nister & Stewenius, CVPR 2006] 31 K. Grauman, B. Leibe
  • 168.
    Example: Recognition withVocabulary Tree • Tree construction: Visual Object Recognition Tutorial Computing Perceptual and Sensory Augmented [Nister & Stewenius, CVPR’06] 32 K. Grauman, B. Leibe Slide credit: David Nister
  • 169.
    Vocabulary Tree • Training: Filling the tree Visual Object Recognition Tutorial Computing Perceptual and Sensory Augmented [Nister & Stewenius, CVPR’06] 33 K. Grauman, B. Leibe Slide credit: David Nister
  • 170.
    Vocabulary Tree • Training: Filling the tree Visual Object Recognition Tutorial Computing Perceptual and Sensory Augmented [Nister & Stewenius, CVPR’06] 34 K. Grauman, B. Leibe Slide credit: David Nister
  • 171.
    Vocabulary Tree • Training: Filling the tree Visual Object Recognition Tutorial Computing Perceptual and Sensory Augmented [Nister & Stewenius, CVPR’06] 35 K. Grauman, B. Leibe Slide credit: David Nister
  • 172.
    Vocabulary Tree • Training: Filling the tree Visual Object Recognition Tutorial Computing Perceptual and Sensory Augmented [Nister & Stewenius, CVPR’06] 36 K. Grauman, B. Leibe Slide credit: David Nister
  • 173.
    Vocabulary Tree • Training: Filling the tree Visual Object Recognition Tutorial Computing Perceptual and Sensory Augmented [Nister & Stewenius, CVPR’06] 37 K. Grauman, B. Leibe Slide credit: David Nister
  • 174.
    Vocabulary Tree • Recognition Visual Object Recognition Tutorial Computing RANSAC verification Perceptual and Sensory Augmented [Nister & Stewenius, CVPR’06] 38 K. Grauman, B. Leibe Slide credit: David Nister
  • 175.
    Vocabulary Tree: Performance • Evaluated on large databases Indexing with up to 1M images Visual Object Recognition Tutorial Computing • Online recognition for database of 50,000 CD covers Perceptual and Sensory Augmented Retrieval in ~1s • Find experimentally that large vocabularies can be beneficial for recognition [Nister & Stewenius, CVPR’06] 39 K. Grauman, B. Leibe
  • 176.
    Vocabulary formation • Ensembles of trees provide additional robustness Visual Object Recognition Tutorial Computing Perceptual and Sensory Augmented Moosmann, Jurie, & Triggs 2006; Yeh, Lee, & Darrell 2007; Bosch, Zisserman, & Munoz 2007; … Figure credit: F. Jurie K. Grauman, B. Leibe
  • 177.
    Supervised vocabulary formation • Recent work considers how to leverage labeled images when constructing the vocabulary Visual Object Recognition Tutorial Computing Perceptual and Sensory Augmented Perronnin, Dance, Csurka, & Bressan, Adapted Vocabularies for Generic Visual Categorization, ECCV 2006. 41 K. Grauman, B. Leibe
  • 178.
    Supervised vocabulary formation • Merge words that don’t aid in discriminability Visual Object Recognition Tutorial Computing Perceptual and Sensory Augmented Winn, Criminisi, & Minka, Object Categorization by Learned Universal Visual Dictionary, ICCV 2005
  • 179.
    Supervised vocabulary formation • Consider vocabulary and classifier construction jointly. Visual Object Recognition Tutorial Computing Perceptual and Sensory Augmented Yang, Jin, Sukthankar, & Jurie, Discriminative Visual Codebook Generation with Classifier Training for Object Category Recognition, CVPR 2008. 43 K. Grauman, B. Leibe
  • 180.
    Learning and recognitionwith bag of words histograms • Bag of words representation makes it possible to describe the unordered point set with a single vector Visual Object Recognition Tutorial Computing (of fixed dimension across image examples) Perceptual and Sensory Augmented • Provides easy way to use distribution of feature types with various learning algorithms requiring vector input. 44 K. Grauman, B. Leibe
  • 181.
    Learning and recognitionwith bag of words histograms • …including unsupervised topic models designed for documents. Visual Object Recognition Tutorial Computing • Hierarchical Bayesian text models (pLSA and LDA) – Hoffman 2001, Blei, Ng & Jordan, 2004 Perceptual and Sensory Augmented – For object and scene categorization: Sivic et al. 2005, Sudderth et al. 2005, Quelhas et al. 2005, Fei-Fei et al. 2005 45 Figure credit: Fei-Fei Li K. Grauman, B. Leibe
  • 182.
    Learning and recognitionwith bag of words histograms • …including unsupervised topic models designed for documents. Probabilistic Latent Visual Object Recognition Tutorial Computing Semantic Analysis d z w (pLSA) N Perceptual and Sensory Augmented D “face” Sivic et al. ICCV 2005 [pLSA code available at: http://www.robots.ox.ac.uk/~vgg/software/] 46 Figure credit: Fei-Fei Li K. Grauman, B. Leibe
  • 183.
    Bags of words:pros and cons + flexible to geometry / deformations / viewpoint + compact summary of image content Visual Object Recognition Tutorial Computing + provides vector representation for sets + has yielded good recognition results in practice Perceptual and Sensory Augmented - basic model ignores geometry – must verify afterwards, or encode via features - background and foreground mixed when bag covers whole image - interest points or sampling: no guarantee to capture object-level parts - optimal vocabulary formation remains unclear 47 K. Grauman, B. Leibe
  • 184.
    Outline 1. Detection with Global Appearance & Sliding Windows Visual Object Recognition Tutorial Computing 2. Local Invariant Features: Detection & Description 3. Specific Object Recognition with Local Features Perceptual and Sensory Augmented ― Coffee Break ― 4. Visual Words: Indexing, Bags of Words Categorization 5. Matching Local Feature Sets 6. Part-Based Models for Categorization 7. Current Challenges and Research Directions 48 K. Grauman, B. Leibe
  • 185.
    Visual Object Recognition VisualObject Recognition Tutorial Computing Perceptual and Sensory Augmented Bastian Leibe & Kristen Grauman Computer Vision Laboratory Department of Computer Sciences ETH Zurich University of Texas in Austin Chicago, 14.07.2008
  • 186.
    Outline 1. Detection with Global Appearance & Sliding Windows Visual Object Recognition Tutorial Computing 2. Local Invariant Features: Detection & Description 3. Specific Object Recognition with Local Features Perceptual and Sensory Augmented ― Coffee Break ― 4. Visual Words: Indexing, Bags of Words Categorization 5. Matching Local Feature Sets 6. Part-Based Models for Categorization 7. Current Challenges and Research Directions 2 K. Grauman, B. Leibe
  • 187.
    Basic flow … … Index each one into pool of descriptors from … Visual Object Recognition Tutorial Computing previously seen images or Perceptual and Sensory Augmented … Detect or sample Describe Quantize to form features features or bag of words vector for the image List of positions, Associated list of scales, d-dimensional orientations descriptors Compute match with another image 3 K. Grauman, B. Leibe
  • 188.
    Local feature correspondences • The matching between sets of local features helps to establish overall similarity between objects or shapes. Visual Object Recognition Tutorial Computing • Assigned matches also useful for localization Perceptual and Sensory Augmented Shape context Low-distortion matching [Berg & Malik 2005] Match kernel [Belongie & [Wallraven, Malik 2001] Caputo & Graf 2003] 4 K. Grauman, B. Leibe
  • 189.
    Local feature correspondences • Least cost match: minimize total cost between matched points Visual Object Recognition Tutorial Computing Perceptual and Sensory Augmented min π : X →Y ∑ x − π (x ) xi∈X i i • Least cost partial match: match all of smaller set to some portion of larger set.
  • 190.
    Pyramid match kernel(PMK) • Optimal matching expensive relative to number of features per image (m). Visual Object Recognition Tutorial Computing • PMK is approximate partial match for efficient discriminative learning from sets of local features. Perceptual and Sensory Augmented Optimal match: O(m3) Greedy match: O(m2 log m) Pyramid match: O(m) [Grauman & Darrell, ICCV 2005] 6 K. Grauman, B. Leibe
  • 191.
    Pyramid match kernel:pyramid extraction , Visual Object Recognition Tutorial Computing Histogram pyramid: Perceptual and Sensory Augmented level i has bins of size 7 K. Grauman
  • 192.
    Perceptual and SensoryAugmented Visual Object Recognition Tutorial Computing Histogram intersection K. Grauman Pyramid match kernel: counting matches 8
  • 193.
    Pyramid match kernel:counting new matches Histogram intersection Visual Object Recognition Tutorial Computing matches at this level matches at previous level Perceptual and Sensory Augmented Difference in histogram intersections across levels counts number of new pairs matched 9 K. Grauman
  • 194.
    Pyramid match kernel histogram pyramids Visual Object Recognition Tutorial Computing Perceptual and Sensory Augmented number of newly matched pairs at level i measure of difficulty of a match at level i • For similarity, weights inversely proportional to bin size (or may be learned discriminatively) • Normalize kernel values to avoid favoring large sets 10 K. Grauman
  • 195.
    Perceptual and SensoryAugmented Visual Object Recognition Tutorial Computing K. Grauman Example pyramid match 11
  • 196.
    Perceptual and SensoryAugmented Visual Object Recognition Tutorial Computing K. Grauman Example pyramid match 12
  • 197.
    Perceptual and SensoryAugmented Visual Object Recognition Tutorial Computing K. Grauman Example pyramid match 13
  • 198.
    Example pyramid match pyramidmatch optimal match K. Grauman
  • 199.
    Pyramid match kernel • Forms a Mercer kernel -> allows classification with SVMs, use of other kernel methods Visual Object Recognition Tutorial Computing • Bounded error relative to optimal partial match • Linear time -> efficient learning with large feature sets Perceptual and Sensory Augmented K. Grauman, B. Leibe
  • 200.
    Pyramid match kernel • Forms a Mercer kernel -> allows classification with SVMs, use of other kernel methods Visual Object Recognition Tutorial Computing • Bounded error relative to optimal partial match • Linear time -> efficient learning with large feature sets Perceptual and Sensory Augmented ETH-80 data set ETH Accuracy Time (s) Mean number of features Mean number of features Match [Wallraven et al.] O(m2) Pyramid match O(m)
  • 201.
    Pyramid match kernel • Forms a Mercer kernel -> allows classification with SVMs, use of other kernel methods Visual Object Recognition Tutorial Computing • Bounded error relative to optimal partial match • Linear time -> efficient learning with large feature sets • Use data-dependent pyramid partitions for high-d Perceptual and Sensory Augmented feature spaces Uniform pyramid bins Vocabulary-guided pyramid bins Code for PMK: http://people.csail.mit.edu/jjl/libpmk/
  • 202.
    Matching smoothness &local geometry • Solving for linear assignment means (non-overlapping) features can be matched independently, ignoring Visual Object Recognition Tutorial Computing relative geometry. • One alternative: simply expand feature vectors to include spatial information before matching. Perceptual and Sensory Augmented [ f1,…,f128, xa, ya ] ya xa 18 K. Grauman, B. Leibe
  • 203.
    Spatial pyramid matchkernel • First quantize descriptors into words, then do one pyramid match per word in image coordinate space. Visual Object Recognition Tutorial Computing Perceptual and Sensory Augmented Lazebnik, Schmid & Ponce, CVPR 2006 K. Grauman, B. Leibe
  • 204.
    Matching smoothness &local geometry • Use correspondence to estimate parameterized transformation, regularize to enforce smoothness Visual Object Recognition Tutorial Computing Perceptual and Sensory Augmented Shape context matching [Belongie, Malik, & Puzicha 2001] Code: http://www.eecs.berkeley.edu/Research/Projects/CS/vision/shape/sc_digits.html K. Grauman, B. Leibe
  • 205.
    Matching smoothness &local geometry • Let matching cost include term to penalize distortion between pairs of matched features. Visual Object Recognition Tutorial Computing Template Query j j' Perceptual and Sensory Augmented Rij Si'j' i i' i i' Approximate for efficient solutions: Berg & Malik, CVPR 2005; Leordeanu & Hebert, ICCV 2005 Figure credit: Alex Berg K. Grauman, B. Leibe
  • 206.
    Matching smoothness &local geometry • Compare “semi-local” features: consider configurations or neighborhoods and co-occurrence relationships Visual Object Recognition Tutorial Computing Perceptual and Sensory Augmented Correlograms of Proximity visual words distribution kernel [Savarese, Winn, & [Ling & Soatto, ICCV Hyperfeatures: Agarwal & Criminisi, CVPR 2006] 2007] Triggs, ECCV 2006] Feature neighborhoods [Sivic Tiled neighborhood [Quack, Ferrari, & Zisserman, CVPR 2004] Leibe, van Gool ICCV 2007] K. Grauman, B. Leibe
  • 207.
    Matching smoothness &local geometry • Learn or provide explicit object-specific shape model [Next in the tutorial : part-based models] Visual Object Recognition Tutorial Computing Perceptual and Sensory Augmented x1 x6 x2 x5 x3 x4
  • 208.
    Summary • Local features are a useful, flexible representation Invariance properties - typically built into the descriptor Visual Object Recognition Tutorial Computing Distinctive, especially helpful for identifying specific textured objects Breaking image into regions/parts gives tolerance to occlusions and clutter Perceptual and Sensory Augmented Mapping to visual words forms discrete tokens from image regions • Efficient methods available for Indexing patches or regions Comparing distributions of visual words Matching features 24 K. Grauman, B. Leibe
  • 209.
    Outline 1. Detection with Global Appearance & Sliding Windows Visual Object Recognition Tutorial Computing 2. Local Invariant Features: Detection & Description 3. Specific Object Recognition with Local Features Perceptual and Sensory Augmented ― Coffee Break ― 4. Visual Words: Indexing, Bags of Words Categorization 5. Matching Local Feature Sets 6. Part-Based Models for Categorization 7. Current Challenges and Research Directions 25 K. Grauman, B. Leibe
  • 210.
    Visual Object Recognition VisualObject Recognition Tutorial Computing Perceptual and Sensory Augmented Bastian Leibe & Kristen Grauman Computer Vision Laboratory Department of Computer Sciences ETH Zurich University of Texas in Austin Chicago, 14.07.2008
  • 211.
    Outline 1. Detection with Global Appearance & Sliding Windows Visual Object Recognition Tutorial Computing 2. Local Invariant Features: Detection & Description 3. Specific Object Recognition with Local Features Perceptual and Sensory Augmented ― Coffee Break ― 4. Visual Words: Indexing, Bags of Words Categorization 5. Matching Local Features 6. Part-Based Models for Categorization 7. Current Challenges and Research Directions 2 K. Grauman, B. Leibe
  • 212.
    Recognition of ObjectCategories • We no longer have exact correspondences… Visual Object Recognition Tutorial Computing • On a local level, we can still detect similar parts. Perceptual and Sensory Augmented • Represent objects by their parts ⇒ Bag-of-features • How can we improve on this? Encode structure 3 T. Tuytelaars, B. Leibe Slide credit: Rob Fergus
  • 213.
    Part-Based Models • Fischler & Elschlager 1973 Visual Object Recognition Tutorial Computing • Model has two components parts Perceptual and Sensory Augmented (2D image fragments) structure (configuration of parts) 4 K. Grauman, B. Leibe
  • 214.
    Different Connectivity Structures VisualObject Recognition Tutorial Computing O(N6) O(N2) O(N3) O(N2) Fergus et al. ’03 Leibe et al. ’04, ‘08 Crandall et al. ‘05 Felzenszwalb & Perceptual and Sensory Augmented Fei-Fei et al. ‘03 Crandall et al. ‘05 Huttenlocher ‘05 Fergus et al. ’05 Csurka ’04 Bouchard & Triggs ‘05 Carneiro & Lowe ‘06 Vasconcelos ‘00 5 K. Grauman, B. Leibe from [Carneiro & Lowe, ECCV’06]
  • 215.
    Spatial Models ConsideredHere Fully connected shape “Star” shape model Visual Object Recognition Tutorial Computing model x1 x1 x6 x2 x6 x2 Perceptual and Sensory Augmented x5 x3 x5 x3 x4 x4 e.g. Constellation Model e.g. ISM Parts fully connected Parts mutually independent Recognition complexity: O(NP) Recognition complexity: O(NP) Method: Exhaustive search Method: Gen. Hough Transform 6 K. Grauman, B. Leibe Slide credit: Rob Fergus
  • 216.
    Constellation Model • Joint model for appearance and shape Visual Object Recognition Tutorial Computing Perceptual and Sensory Augmented Gaussian shape pdf Gaussian part appearance pdf Gaussian relative scale pdf Log(scale) Prob. of detection 0.8 0.75 0.9 7 K. Grauman, B. Leibe
  • 217.
    Constellation Model Gaussian shape pdf Gaussian part appearance pdf Gaussian relative scale pdf Visual Object Recognition Tutorial Computing Log(scale) Prob. of detection Perceptual and Sensory Augmented 0.8 0.75 0.9 Clutter model Uniform Uniform shape pdf Gaussian appearance pdf relative scale pdf Log(scale) Poission pdf on # detections 8 K. Grauman, B. Leibe
  • 218.
    Constellation Model: LearningProcedure • Goal: Find regions & their location, scale & appearance • Initialize model parameters Visual Object Recognition Tutorial Computing • Use EM and iterate to convergence E-step: Compute assignments for which regions are foreground/background Perceptual and Sensory Augmented M-step: Update model parameters • Trying to maximize likelihood – consistency in shape & appearance 9 K. Grauman, B. Leibe
  • 219.
    Perceptual and SensoryAugmented Visual Object Recognition Tutorial Computing Example: Motorbikes K. Grauman, B. Leibe 10
  • 220.
    Perceptual and SensoryAugmented Visual Object Recognition Tutorial Computing Example: Motorbikes (2) K. Grauman, B. Leibe 11
  • 221.
    Perceptual and SensoryAugmented Visual Object Recognition Tutorial Computing Example: Spotted Cats K. Grauman, B. Leibe 12
  • 222.
    Discussion: Constellation Model • Advantages Works well for many different object categories Can adapt well to categories where Visual Object Recognition Tutorial Computing – Shape is more important – Appearance is more important Everything is learned from training data Perceptual and Sensory Augmented Weakly-supervised training possible • Disadvantages Model contains many parameters that need to be estimated Cost increases exponentially with increasing number of parameters ⇒ Fully connected model restricted to small number of parts. 13 K. Grauman, B. Leibe
  • 223.
    Implicit Shape Model(ISM) • Basic ideas x1 Learn an appearance codebook x6 x2 Visual Object Recognition Tutorial Computing Learn a star-topology structural model x5 x3 – Features are considered independent given obj. center x4 Perceptual and Sensory Augmented • Algorithm: probabilistic Gen. Hough Transform Exact correspondences → Prob. match to object part NN matching → Soft matching Feature location on obj. → Part location distribution Uniform votes → Probabilistic vote weighting Quantized Hough array → Continuous Hough space 14 K. Grauman, B. Leibe
  • 224.
    Codebook Representation • Extraction of local object features Interest Points (e.g. Harris detector) Visual Object Recognition Tutorial Computing Sparse representation of the object appearance Perceptual and Sensory Augmented • Collect features from whole training set • Example: 15 K. Grauman, B. Leibe
  • 225.
    Gen. Hough Transformwith Local Features • For every feature, store possible “occurrences” Visual Object Recognition Tutorial Computing Perceptual and Sensory Augmented • For new image, let the matched features vote for – Object identity possible object positions – Pose – Relative position 18 K. Grauman, B. Leibe
  • 226.
    Implicit Shape Model- Representation … … … … Visual Object Recognition Tutorial Computing Training images (+reference segmentation) … Perceptual and Sensory Augmented Appearance codebook • Learn appearance codebook y y Extract local features at interest points Agglomerative clustering ⇒ codebook s s x x • Learn spatial distributions y y Match codebook to training images Record matching positions on object s s x x Spatial occurrence distributions + local figure-ground labels 19 B. Leibe
  • 227.
    Implicit Shape Model- Recognition Interest Points Matched Codebook Probabilistic Entries Voting Visual Object Recognition Tutorial Computing y Perceptual and Sensory Augmented Image Feature Interpretation Object (Codebook match) Position s x 3D Voting Space (continuous) f Ci o,x p(Ci f ) p(on , x Ci , l) p(on , x f , l) = ∑ p (Ci f ) p (on , x Ci , l) i [Leibe04, Leibe08] 21
  • 228.
    Implicit Shape Model- Recognition Interest Points Matched Codebook Probabilistic Entries Voting Visual Object Recognition Tutorial Computing y Perceptual and Sensory Augmented s x 3D Voting Space (continuous) Backprojected Backprojection Hypotheses of Maxima [Leibe04, Leibe08] 22
  • 229.
    Perceptual and SensoryAugmented Visual Object Recognition Tutorial Computing K. Grauman, B. Leibe Example: Results on Cows Original image 24
  • 230.
    Perceptual and SensoryAugmented Visual Object Recognition Tutorial Computing K. Grauman, B. Leibe Example: Results on Cows Interest points Original image 25
  • 231.
    Example: Results onCows Visual Object Recognition Tutorial Computing Perceptual and Sensory Augmented Matchedpoints Interest patches Original image 26 K. Grauman, B. Leibe
  • 232.
    Example: Results onCows Visual Object Recognition Tutorial Computing Perceptual and Sensory Augmented Interest points Original image Matched patches Prob. Votes 27 K. Grauman, B. Leibe
  • 233.
    Perceptual and SensoryAugmented Visual Object Recognition Tutorial Computing K. Grauman, B. Leibe Example: Results on Cows 1st hypothesis 28
  • 234.
    Perceptual and SensoryAugmented Visual Object Recognition Tutorial Computing K. Grauman, B. Leibe Example: Results on Cows 2nd hypothesis 29
  • 235.
    Perceptual and SensoryAugmented Visual Object Recognition Tutorial Computing K. Grauman, B. Leibe Example: Results on Cows 3rd hypothesis 30
  • 236.
    Scale Invariant Voting • Scale-invariant feature selection Scale-invariant interest points Visual Object Recognition Tutorial Computing Rescale extracted patches Match to constant-size codebook Perceptual and Sensory Augmented • Generate scale votes Scale as 3rd dimension in voting space s Search window y Search for maxima in 3D voting space x 31 K. Grauman, B. Leibe
  • 237.
    Scale Voting: EfficientComputation s s s s Visual Object Recognition Tutorial Computing y y y y x x Perceptual and Sensory Augmented Scale votes Binned Candidate Refinement accum. array maxima (MSME) • Mean-Shift formulation for refinement Scale-adaptive balloon density estimator 33 K. Grauman, B. Leibe
  • 238.
    Detection Results • Qualitative Performance Recognizes different kinds of objects Visual Object Recognition Tutorial Computing Robust to clutter, occlusion, noise, low contrast Perceptual and Sensory Augmented 35 K. Grauman, B. Leibe
  • 239.
    Figure-Ground Segregation • Problem extensively studied in Psychophysics Visual Object Recognition Tutorial Computing • Experiments with ambiguous figure-ground stimuli • Results: Perceptual and Sensory Augmented Evidence that object recognition can and does operate before figure-ground organization Interpreted as Gestalt cue familiarity. M.A. Peterson, “Object Recognition Processes Can and Do Operate Before Figure- Ground Organization”, Cur. Dir. in Psych. Sc., 3:105-111, 1994. 36 K. Grauman, B. Leibe
  • 240.
    ISM – Top-DownSegmentation Interest Points Matched Codebook Probabilistic Entries Voting Visual Object Recognition Tutorial Computing y Perceptual and Sensory Augmented s x 3D Voting Space Segmentation (continuous) p(figure) Backprojected Backprojection Probabilities Hypotheses of Maxima K. Grauman, B. Leibe [Leibe04, Leibe08] 37
  • 241.
    Segmentation: Probabilistic Formulation VisualObject Recognition Tutorial Computing • Influence of patch on object hypothesis (vote weight) Perceptual and Sensory Augmented p( f , l o , x ) = ∑ p(o , x | C ) p(C i n i i | f ) p( f,l ) n p(on , x ) • Backprojection to features f and pixels p: p(p = figure | on , x ) = ∑ p(p = figure | f , l, o , x ) p( f , l | o , x ) n n p∈( f ,l ) Segmentation Influence on information object hypothesis 38 K. Grauman, B. Leibe [Leibe04, Leibe08]
  • 242.
    Segmentation Visual Object RecognitionTutorial Computing p(figure) Perceptual and Sensory Augmented Original image Segmentation p(figure) p(ground) p(ground) • Interpretation of p(figure) map per-pixel confidence in object hypothesis Use for hypothesis verification 46 K. Grauman, B. Leibe [Leibe04, Leibe08]
  • 243.
    Perceptual and SensoryAugmented Visual Object Recognition Tutorial Computing K. Grauman, B. Leibe Example Results: Motorbikes 47
  • 244.
    Example Results: Cows • Training 112 hand-segmented images Visual Object Recognition Tutorial Computing • Results on novel sequences: Perceptual and Sensory Augmented Single-frame recognition - No temporal continuity used! 48 K. Grauman, B. Leibe [Leibe04, Leibe08]
  • 245.
    Perceptual and SensoryAugmented Visual Object Recognition Tutorial Computing Office chairs B. Leibe Example Results: Chairs Dining room chairs 49
  • 246.
    Perceptual and SensoryAugmented Visual Object Recognition Tutorial Computing Training Test Output Inferring Other Information: Part Labels [Thomas07] 50
  • 247.
    Perceptual and SensoryAugmented Visual Object Recognition Tutorial Computing [Thomas07] Inferring Other Information: Part Labels (2) 51
  • 248.
    Inferring Other Information:Depth Maps Visual Object Recognition Tutorial Computing Perceptual and Sensory Augmented “Depth from a single image” 52 [Thomas07]
  • 249.
    Application for PedestrianDetection • Estimating Articulation Visual Object Recognition Tutorial Computing Perceptual and Sensory Augmented [Leibe, Seemann, Schiele, CVPR’05] • Rotation-Invariant Detection d dq φ φ θq θ [Mikolajczyk, Leibe, Schiele, CVPR’06] 53 B. Leibe
  • 250.
    Outline 1. Detection with Global Appearance & Sliding Windows Visual Object Recognition Tutorial Computing 2. Local Invariant Features: Detection & Description 3. Specific Object Recognition with Local Features Perceptual and Sensory Augmented ― Coffee Break ― 4. Visual Words: Indexing, Bags of Words Categorization 5. Matching Local Features 6. Part-Based Models for Categorization 7. Current Challenges and Research Directions 54 K. Grauman, B. Leibe
  • 251.
    Visual Object Recognition VisualObject Recognition Tutorial Computing Perceptual and Sensory Augmented Bastian Leibe & Kristen Grauman Computer Vision Laboratory Department of Computer Sciences ETH Zurich University of Texas in Austin Chicago, 14.07.2008
  • 252.
    Outline 1. Detection with Global Appearance & Sliding Windows Visual Object Recognition Tutorial Computing 2. Local Invariant Features: Detection & Description 3. Specific Object Recognition with Local Features Perceptual and Sensory Augmented ― Coffee Break ― 4. Visual Words: Indexing, Bags of Words Categorization 5. Matching Local Features 6. Part-Based Models for Categorization 7. Current Challenges and Research Directions Highlight of some research topics not covered in the main tutorial 2 K. Grauman, B. Leibe
  • 253.
    Perceptual and SensoryAugmented Visual Object Recognition Tutorial Computing Benchmark Data • What degree of difficulty do current datasets have?
  • 254.
    Example: Caltech-101 A dataset that has Visual Object Recognition Tutorial Computing been about mastered… Perceptual and Sensory Augmented Images from the Caltech-101: 101-way multi-class classification problem K. Grauman, B. Leibe
  • 255.
    Example: Caltech256 Visual ObjectRecognition Tutorial Computing Perceptual and Sensory Augmented Images from the Caltech-256: 256 multi-class recognition problem K. Grauman, B. Leibe
  • 256.
    Example: Pascal VisualObject Classes Challenge Visual Object Recognition Tutorial Computing Perceptual and Sensory Augmented Pascal VOC 2007: Binary detection problems http://pascallin.ecs.soton.ac.uk/challenges/VOC/ K. Grauman, B. Leibe
  • 257.
    Perceptual and SensoryAugmented Visual Object Recognition Tutorial Computing Example: LabelMe K. Grauman, B. Leibe http://labelme.csail.mit.edu/
  • 258.
    Current challenges &ongoing research • Multi-cue integration • Finer level categorization Visual Object Recognition Tutorial Computing • View invariant recognition • Unsupervised category discovery • Learning from noisily labeled images Perceptual and Sensory Augmented • Integration of segmentation and recognition • Learning with text and images/video • Use of video • Context and scene layout
  • 259.
    Multi-cue integration • Single cues often not sufficient. • Integrate multiple local and global cues. Visual Object Recognition Tutorial Computing Perceptual and Sensory Augmented 9 K. Grauman, B. Leibe
  • 260.
    Multi-Category Discrimination • Distinguish similar categories. • Need to look at specific details! Visual Object Recognition Tutorial Computing Perceptual and Sensory Augmented 10 K. Grauman, B. Leibe
  • 261.
    Multi-Aspect Recognition • Detectors for different viewpoints ⇒ How can this be improved? Visual Object Recognition Tutorial Computing Perceptual and Sensory Augmented 11 K. Grauman, B. Leibe
  • 262.
    Multi-Aspect Recognition Visual ObjectRecognition Tutorial Computing Perceptual and Sensory Augmented [Hoiem, Rother, Winn, CVPR’07] [Thomas et al., CVPR’06] 12 K. Grauman, B. Leibe
  • 263.
    Multi-Aspect Recognition Visual ObjectRecognition Tutorial Computing Perceptual and Sensory Augmented [Rothganger et al., CVPR’03] [Savarese & Fei-Fei, ICCV’07] 13 K. Grauman, B. Leibe
  • 264.
    Unsupervised, semi-supervised categorydiscovery Topic models for images Probabilistic Latent Visual Object Recognition Tutorial Computing Semantic Analysis (pLSA) “face” Perceptual and Sensory Augmented Latent Dirichlet “beach” Allocation (LDA) c π z w D N Sivic et al. ICCV 2005, Fei-Fei et al. ICCV 2005 Figure credit: Fei-Fei Li
  • 265.
    Unsupervised, semi-supervised categorydiscovery Clustering cluttered images Learning from noisy keyword-based image search results Visual Object Recognition Tutorial Computing Perceptual and Sensory Augmented Grauman & Darrell, CVPR 2006 Fergus et al. ECCV 2004, ICCV 2005 Li & Fei-Fei, CVPR 2007
  • 266.
    Learning with textand images/video Visual Object Recognition Tutorial Computing Perceptual and Sensory Augmented Barnard et al. JMLR 2003 Berg, Berg, Edwards, & Forsyth, NIPS 2006 Gupta et al. ECML 2008
  • 267.
    Integrating segmentation +recognition Visual Object Recognition Tutorial Computing Perceptual and Sensory Augmented Borenstein & Ullman, ECCV 2002 Kumar et al. CVPR 2005 Kannan, Winn, & Rother, NIPS 2006 Tu, Chen, Yuille, Zhu, ICCV 2003
  • 268.
    Perceptual and SensoryAugmented Visual Object Recognition Tutorial Computing Antonio Torralba, IJCV 2003 Role of context, understanding scene layout
  • 269.
    Role of context,understanding scene layout Visual Object Recognition Tutorial Computing Perceptual and Sensory Augmented Image World Hoiem, Efros, & Hebert, CVPR 2006
  • 270.
    Integration with SceneGeometry • Goal: Find the ground plane Restrict object location Visual Object Recognition Tutorial Computing Assume Gaussian size prior ⇒ Significantly reduced search space Perceptual and Sensory Augmented Structure-from-Motion y Search corridor s x Hough Volume Dense stereo 20 B. Leibe
  • 271.
    Extensions • Combination with 3D Geometry Visual Object Recognition Tutorial Computing Perceptual and Sensory Augmented [Leibe, Cornelis, Cornelis, Van Gool, CVPR’07] • Mobile Pedestrian Detection [Ess, Leibe, Van Gool, ICCV’07] 21 K. Grauman, B. Leibe
  • 272.
    Detections Using GroundPlane Constraints Visual Object Recognition Tutorial Computing Perceptual and Sensory Augmented left camera 1175 frames 22 B. Leibe [Leibe et al. CVPR’07]
  • 273.
    Extensions: Tracking-by-Detection Visual ObjectRecognition Tutorial Computing Perceptual and Sensory Augmented • Spacetime trajectory analysis Link up detections to form physically plausible ST trajectories Select set of ST trajectories that best explain the data 23 [Leibe et al. CVPR’07]
  • 274.
    Perceptual and SensoryAugmented Visual Object Recognition Tutorial Computing B. Leibe Dynamic Scene Analysis Results [Leibe et al. CVPR’07] 24
  • 275.
    Extensions (2) • Combination 3D Reconstruction Visual Object Recognition Tutorial Computing Perceptual and Sensory Augmented [Cornelis, Leibe, Cornelis, Van Gool, 3DPVT’06] 25 K. Grauman, B. Leibe
  • 276.
    Textured 3D Model VisualObject Recognition Tutorial Computing Perceptual and Sensory Augmented Original 3D Reconstruction • Run-times SfM + Bundle adjustment: 27-30 fps on CPU Dense reconstruction: 36 fps on GPU [Cornelis, Cornelis, Van Gool, CVPR’06] 26 B. Leibe
  • 277.
    Improved 3D CityModel Enhancing your driving experience… Visual Object Recognition Tutorial Computing Perceptual and Sensory Augmented Original 3D Reconstruction [Cornelis, Leibe, Cornelis, Van Gool, 3DPVT’06] 27
  • 278.
    Perceptual and SensoryAugmented Visual Object Recognition Tutorial Computing B. Leibe Putting It All Together… y s x Q VT S V πd t π I z oi H1 D di 1..n H2 x H i , ti 28
  • 279.
    Perceptual and SensoryAugmented Visual Object Recognition Tutorial Computing Mobile Pedestrian Tracking [Ess, Leibe, Schindler, Van Gool, CVPR’08] 29
  • 280.
    Perceptual and SensoryAugmented Visual Object Recognition Tutorial Computing Mobile Tracking Through Crowds [Ess, Leibe, Schindler, Van Gool, CVPR’08] 30
  • 281.
    Extension: Recovering Articulations 1...N Visual Object Recognition Tutorial Computing Perceptual and Sensory Augmented • Idea: Only perform articulated tracking where it’s easy! • Multi-person tracking Solves hard data association problem • Articulated tracking Only on individual “tracklets” between occlusions [Gammeter, Ess, Jaeggli, Schindler, Leibe, Van Gool, ECCV’08] 31 B. Leibe
  • 282.
    Articulated Multi-Person Tracking VisualObject Recognition Tutorial Computing Perceptual and Sensory Augmented • Multi-Person tracking Recovers trajectories and solves data association Estimates 3D walking direction and speed Detects occlusion events [Gammeter, Ess, Jaeggli, Schindler, Leibe, Van Gool, ECCV’08] 32 B. Leibe
  • 283.
    Articulated Tracking underEgomotion Visual Object Recognition Tutorial Computing Perceptual and Sensory Augmented [Gammeter, Ess, Jaeggli, Schindler, Leibe, Van Gool, ECCV’08] 33 B. Leibe
  • 284.
    Perceptual and SensoryAugmented Visual Object Recognition Tutorial Computing K. Grauman, B. Leibe 34
  • 285.
    Summary • Visual recognition is a challenging and very active research area. Visual Object Recognition Tutorial Computing • We’ve covered some basic models and representations that have been shown to be effective, and highlighted some ongoing issues. Perceptual and Sensory Augmented • See tutorial website for slides, links, references. http://www.vision.ee.ethz.ch/~bleibe/teaching/tutorial-aaai08/ Thank you! K. Grauman, B. Leibe