• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
AAAI08 tutorial: visual object recognition
 

AAAI08 tutorial: visual object recognition

on

  • 3,030 views

 

Statistics

Views

Total Views
3,030
Views on SlideShare
3,030
Embed Views
0

Actions

Likes
2
Downloads
247
Comments
1

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel

11 of 1 previous next

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
  • part 1: 21
    part 2: 69
    part 3: 113
    part 4: 138
    part 5: 186
    part 6: 211
    part 7: 252
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    AAAI08 tutorial: visual object recognition AAAI08 tutorial: visual object recognition Presentation Transcript

    • Visual Object RecognitionVisual Object Recognition Tutorial ComputingPerceptual and Sensory Augmented Bastian Leibe & Kristen Grauman Computer Vision Laboratory Department of Computer Sciences ETH Zurich University of Texas in Austin Chicago, 14.07.2008
    • Perceptual and Sensory Augmented Visual Object Recognition Tutorial ComputingK. Grauman, B. Leibe ? ??? Identification vs. Categorization 2
    • Object Categorization • How to recognize ANY carVisual Object Recognition Tutorial ComputingPerceptual and Sensory Augmented • How to recognize ANY cow 3 K. Grauman, B. Leibe
    • What could be done with recognition algorithms? There is a wide range of applications, including…Visual Object Recognition Tutorial ComputingPerceptual and Sensory Augmented Autonomous robots Navigation, driver safety Situated search Content-based retrieval and analysis for Medical image images and videos analysis
    • Object Categorization • Task Description “Given a small number of training images of a category,Visual Object Recognition Tutorial Computing recognize a-priori unknown instances of that category and assign the correct category label.” • Which categories are feasible visually?Perceptual and Sensory Augmented Extensively studied in Cognitive Psychology, e.g. [Brown’58] “Fido” German dog animal living shepherd being 5 K. Grauman, B. Leibe
    • Visual Object Categories • Basic Level Categories in human categorization [Rosch 76, Lakoff 87]Visual Object Recognition Tutorial Computing The highest level at which category members have similar perceived shape The highest level at which a single mental image reflects thePerceptual and Sensory Augmented entire category The level at which human subjects are usually fastest at identifying category members The first level named and understood by children The highest level at which a person uses similar motor actions for interaction with category members 6 K. Grauman, B. Leibe
    • Visual Object Categories • Basic-level categories in humans seem to be defined predominantly visually.Visual Object Recognition Tutorial Computing • There is evidence that humans (usually) … start with basic-level categorization before doing identification.Perceptual and Sensory Augmented animal ⇒ Basic-level categorization is easier Abstract and faster for humans than object levels … … identification! quadruped ⇒ Most promising starting point … for visual classification Basic level dog cat cow German Doberman shepherd Individual … “Fido” … level 7 K. Grauman, B. Leibe
    • Other Types of Categories • Functional Categories e.g. chairs = “something you can sit on”Visual Object Recognition Tutorial ComputingPerceptual and Sensory Augmented 8 K. Grauman, B. Leibe
    • Other Types of Categories • Ad-hoc categories e.g. “something you can find in an office environment”Visual Object Recognition Tutorial ComputingPerceptual and Sensory Augmented 9 K. Grauman, B. Leibe
    • Levels of Object Categorization “cow”Visual Object Recognition Tutorial Computing “car”Perceptual and Sensory Augmented “motorbike” • Different levels of recognition Which object class is in the image? ⇒ Obj/Img classification Where is it in the image? ⇒ Detection/Localization Where exactly ― which pixels? ⇒ Figure/Ground segmentation 10 K. Grauman, B. Leibe
    • Challenges: robustnessVisual Object Recognition Tutorial ComputingPerceptual and Sensory Augmented Illumination Object pose Clutter Occlusions Intra-class Viewpoint appearance K. Grauman, B. Leibe
    • Challenges: robustnessVisual Object Recognition Tutorial ComputingPerceptual and Sensory Augmented • Detection in Crowded Scenes Learn object variability – Changes in appearance, scale, and articulation Compensate for clutter, overlap, and occlusion 12 K. Grauman, B. Leibe
    • Perceptual and Sensory Augmented Visual Object Recognition Tutorial ComputingK. Grauman, B. Leibe Challenges: context and human experience
    • Challenges: context and human experienceVisual Object Recognition Tutorial ComputingPerceptual and Sensory Augmented Context cues Dynamics Image credit: D. Hoeim Video credit: J. Davis
    • Challenges: scale, efficiency • Thousands to millions of pixels in an image • Estimated 30 Gigapixels of image/video content generated per secondVisual Object Recognition Tutorial Computing • About half of the cerebral cortex in primates is devoted to processing visual information [Felleman and van Essen 1991]Perceptual and Sensory Augmented • 3,000-30,000 human recognizable object categories • 30+ degrees of freedom in the pose of articulated objects (humans) • Billions of images indexed by Google Image Search • 18 billion+ prints produced from digital camera images in 2004 • 295.5 million camera phones sold in 2005 K. Grauman, B. Leibe
    • Perceptual and Sensory Augmented Visual Object Recognition Tutorial Computing LessK. Grauman, B. Leibe More Challenges: learning with minimal supervision
    • Rough evolution of focus in recognition researchVisual Object Recognition Tutorial ComputingPerceptual and Sensory Augmented 1980s 1990s to early 2000s Currently K. Grauman, B. Leibe
    • This tutorial • Intended for broad AAAI audience Assuming basic familiarity with machine learning, linear algebra,Visual Object Recognition Tutorial Computing probability Not assuming significant vision backgroundPerceptual and Sensory Augmented • Our goals Describe main approaches to recognition Highlight past successes and future challenges Provide the pointers (to literature and tools) that would allow you to take advantage of existing techniques in your research • Questions welcome 18 K. Grauman, B. Leibe
    • Outline 1. Detection with Global Appearance & Sliding WindowsVisual Object Recognition Tutorial Computing 2. Local Invariant Features: Detection & Description 3. Specific Object Recognition with Local FeaturesPerceptual and Sensory Augmented ― Coffee Break ― 4. Visual Words: Indexing, Bags of Words Categorization 5. Matching Local Features 6. Part-Based Models for Categorization 7. Current Challenges and Research Directions 19 K. Grauman, B. Leibe
    • Visual Object RecognitionVisual Object Recognition Tutorial ComputingPerceptual and Sensory Augmented Bastian Leibe & Kristen Grauman Computer Vision Laboratory Department of Computer Sciences ETH Zurich University of Texas in Austin Chicago, 14.07.2008
    • Outline 1. Detection with Global Appearance & Sliding Windows 2. Local Invariant Features: Detection & DescriptionVisual Object Recognition Tutorial Computing 3. Specific Object Recognition with Local FeaturesPerceptual and Sensory Augmented ― Coffee Break ― 4. Visual Words: Indexing, Bags of Words Categorization 5. Matching Local Features 6. Part-Based Models for Categorization 7. Current Challenges and Research Directions 2 K. Grauman, B. Leibe
    • Detection via classification: Main idea Basic component: a binary classifierVisual Object Recognition Tutorial ComputingPerceptual and Sensory Augmented Car/non-car Classifier No, notcar. Yes, a car. K. Grauman, B. Leibe
    • Detection via classification: Main idea If object may be in a cluttered scene, slide a window around looking for it.Visual Object Recognition Tutorial ComputingPerceptual and Sensory Augmented Car/non-car Classifier K. Grauman, B. Leibe
    • Detection via classification: Main idea Fleshing out this pipeline a bit more, we need to:Visual Object Recognition Tutorial Computing 1. Obtain training data 2. Define features 3. Define classifierPerceptual and Sensory Augmented Training examples Car/non-car Classifier Feature extraction K. Grauman, B. Leibe
    • Detection via classification: Main idea • Consider all subwindows in an image Sample at multiple scales and positionsVisual Object Recognition Tutorial Computing • Make a decision per window: “Does this contain object category X or not?”Perceptual and Sensory Augmented • In this section, we’ll focus specifically on methods using a global representation (i.e., not part-based, not local features). 6 K. Grauman, B. Leibe
    • Feature extraction: global appearance Feature extractionVisual Object Recognition Tutorial ComputingPerceptual and Sensory Augmented Simple holistic descriptions of image content grayscale / color histogram vector of pixel intensities K. Grauman, B. Leibe
    • Eigenfaces: global appearance description An early appearance-based approach to face recognition Generate low-Visual Object Recognition Tutorial Computing dimensional representation Mean of appearance with a linearPerceptual and Sensory Augmented Eigenvectors computed Training images from covariance matrix subspace. ... Project new images to “face ≈ space”. + + ++ Mean Recognition via nearest neighbors in face space Turk & Pentland, 1991 K. Grauman, B. Leibe
    • Feature extraction: global appearance • Pixel-based representations sensitive to small shiftsVisual Object Recognition Tutorial ComputingPerceptual and Sensory Augmented • Color or grayscale-based appearance description can be sensitive to illumination and intra-class appearance variation Cartoon example: an albino koala K. Grauman, B. Leibe
    • Gradient-based representations • Consider edges, contours, and (oriented) intensity gradientsVisual Object Recognition Tutorial ComputingPerceptual and Sensory Augmented K. Grauman, B. Leibe
    • Gradient-based representations: Matching edge templates • Example: Chamfer matchingVisual Object Recognition Tutorial ComputingPerceptual and Sensory Augmented Input Edges Distance Template Best image detected transform shape match At each window position, compute average min distance between points on template (T) and input (I). Gavrila & Philomin ICCV 1999 K. Grauman, B. Leibe
    • Gradient-based representations: Matching edge templates • Chamfer matchingVisual Object Recognition Tutorial ComputingPerceptual and Sensory Augmented Hierarchy of templates Gavrila & Philomin ICCV 1999 K. Grauman, B. Leibe
    • Gradient-based representations • Consider edges, contours, and (oriented) intensity gradientsVisual Object Recognition Tutorial ComputingPerceptual and Sensory Augmented • Summarize local distribution of gradients with histogram Locally orderless: offers invariance to small shifts and rotations Contrast-normalization: try to correct for variable illumination K. Grauman, B. Leibe
    • Gradient-based representations: Histograms of oriented gradients (HoG)Visual Object Recognition Tutorial ComputingPerceptual and Sensory Augmented Map each grid cell in the input window to a histogram counting the gradients per orientation. Code available: http://pascal.inrialpes.fr/soft/olt/ Dalal & Triggs, CVPR 2005 K. Grauman, B. Leibe
    • Gradient-based representations: SIFT descriptorVisual Object Recognition Tutorial ComputingPerceptual and Sensory Augmented Local patch descriptor (more on this later) Code: http://vision.ucla.edu/~vedaldi/code/sift/sift.html Binary: http://www.cs.ubc.ca/~lowe/keypoints/ Lowe, ICCV 1999 K. Grauman, B. Leibe
    • Gradient-based representations: Biologically inspired featuresVisual Object Recognition Tutorial ComputingPerceptual and Sensory Augmented Convolve with Gabor filters at multiple orientations Pool nearby units (max) Intermediate layers compare input to prototype patches Serre, Wolf, Poggio, CVPR 2005 Mutch & Lowe, CVPR 2006 K. Grauman, B. Leibe
    • Gradient-based representations: Rectangular featuresVisual Object Recognition Tutorial ComputingPerceptual and Sensory Augmented Compute differences between sums of pixels in rectangles Captures contrast in adjacent spatial regions Similar to Haar wavelets, efficient to compute Viola & Jones, CVPR 2001 K. Grauman, B. Leibe
    • Gradient-based representations: Shape context descriptor Count the number of points inside each bin, e.g.:Visual Object Recognition Tutorial Computing Count = 4Perceptual and Sensory Augmented ... Count = 10 Log-polar binning: more precision for nearby points, more flexibility for farther points. Local descriptor Belongie, Malik & Puzicha, ICCV 2001 (more on this later) K. Grauman, B. Leibe
    • Classifier construction • How to compute a decision for eachVisual Object Recognition Tutorial Computing subwindow?Perceptual and Sensory Augmented Image feature K. Grauman, B. Leibe
    • Discriminative vs. generative models Pr(image, car ) Pr(image, ¬car ) Generative: separatelyVisual Object Recognition Tutorial Computing 0.1 0.05 model class-conditional and prior densities 0 0 10 20 30 40 50 60 70 image featurePerceptual and Sensory Augmented Pr(car | image) Pr(¬car | image) Discriminative: directly 1 x = data model posterior 0.5 0 0 10 20 30 40 50 60 70 image feature Plots from Antonio Torralba 2007 K. Grauman, B. Leibe
    • Discriminative vs. generative models • Generative: + possibly interpretableVisual Object Recognition Tutorial Computing + can draw samples - models variability unimportant to classification task - often hard to build good model with few parametersPerceptual and Sensory Augmented • Discriminative: + appealing when infeasible to model data itself + excel in practice - often can’t provide uncertainty in predictions - non-interpretable 21 K. Grauman, B. Leibe
    • Discriminative methods Nearest neighbor Neural networksVisual Object Recognition Tutorial Computing 106 examples Shakhnarovich, Viola, Darrell 2003 LeCun, Bottou, Bengio, Haffner 1998 Berg, Berg, Malik 2005... Rowley, Baluja, Kanade 1998Perceptual and Sensory Augmented … Support Vector Machines Boosting Conditional Random Fields Guyon, Vapnik Viola, Jones 2001, McCallum, Freitag, Pereira Heisele, Serre, Poggio, Torralba et al. 2004, 2000; Kumar, Hebert 2003 2001,… Opelt et al. 2006,… … K. Grauman, B. Leibe Slide adapted from Antonio Torralba
    • Boosting • Build a strong classifier by combining number of “weak classifiers”, which need only be better than chanceVisual Object Recognition Tutorial Computing • Sequential learning process: at each iteration, add a weak classifier • Flexible to choice of weak learnerPerceptual and Sensory Augmented including fast simple classifiers that alone may be inaccurate • We’ll look at Freund & Schapire’s AdaBoost algorithm Easy to implement Base learning algorithm for Viola-Jones face detector 23 K. Grauman, B. Leibe
    • AdaBoost: Intuition Consider a 2-d feature space with positive andVisual Object Recognition Tutorial Computing negative examples. Each weak classifier splitsPerceptual and Sensory Augmented the training examples with at least 50% accuracy. Examples misclassified by a previous weak learner are given more emphasis at future rounds. Figure adapted from Freund and Schapire 24 K. Grauman, B. Leibe
    • Visual Object Recognition Tutorial Computing Perceptual and Sensory Augmented AdaBoost: IntuitionK. Grauman, B. Leibe 25
    • AdaBoost: IntuitionVisual Object Recognition Tutorial ComputingPerceptual and Sensory Augmented Final classifier is combination of the weak classifiers 26 K. Grauman, B. Leibe
    • AdaBoost Algorithm Start with uniform weights on training examples {x1,…xn}Visual Object Recognition Tutorial Computing EvaluatePerceptual and Sensory Augmented weighted error for each feature, pick best. Incorrectly classified -> more weight Correctly classified -> less weight Final classifier is combination of the weak ones, weighted according to error they had. Freund & Schapire 1995
    • Cascading classifiers for detection For efficiency, apply less accurate but faster classifiers first to immediately discardVisual Object Recognition Tutorial Computing windows that clearly appear to be negative; e.g.,Perceptual and Sensory Augmented Filter for promising regions with an initial inexpensive classifier Build a chain of classifiers, choosing cheap ones with low false negative rates early in the chain Fleuret & Geman, IJCV 2001 Rowley et al., PAMI 1998 Viola & Jones, CVPR 2001 28 K. Grauman, B. Leibe Figure from Viola & Jones CVPR 2001
    • Example: Face detection • Frontal faces are a good example of a class where global appearance models + a sliding windowVisual Object Recognition Tutorial Computing detection approach fit well: Regular 2D structure Center of face almost shaped like a “patch”/windowPerceptual and Sensory Augmented • Now we’ll take AdaBoost and see how the Viola- Jones face detector works 29 K. Grauman, B. Leibe
    • Feature extraction “Rectangular” filters Feature output is difference between adjacent regionsVisual Object Recognition Tutorial ComputingPerceptual and Sensory Augmented Value at (x,y) is sum of pixels Efficiently computable above and to the with integral image: any left of (x,y) sum can be computed in constant time Avoid scaling images scale features directly Integral image for same cost Viola & Jones, CVPR 2001 30 K. Grauman, B. Leibe
    • Large library of filters Considering all possible filter parameters:Visual Object Recognition Tutorial Computing position, scale, and type:Perceptual and Sensory Augmented 180,000+ possible features associated with each 24 x 24 window Use AdaBoost both to select the informative features and to form the classifier Viola & Jones, CVPR 2001
    • AdaBoost for feature+classifier selection • Want to select the single rectangle feature and threshold that best separates positive (faces) and negative (non- faces) training examples, in terms of weighted error.Visual Object Recognition Tutorial Computing Resulting weak classifier:Perceptual and Sensory Augmented For next round, reweight the … examples according to errors, Outputs of a possible choose another filter/threshold rectangle feature on combo. faces and non-faces. Viola & Jones, CVPR 2001
    • Viola-Jones Face Detector: Summary Train cascade of classifiers withVisual Object Recognition Tutorial Computing AdaBoost ow h ind eac Faces bw o New image su ply tPerceptual and Sensory Augmented Ap Selected features, Non-faces thresholds, and weights • Train with 5K positives, 350M negatives • Real-time detector using 38 layer cascade • 6061 features in final layer • [Implementation available in OpenCV: http://www.intel.com/technology/computing/opencv/] 33 K. Grauman, B. Leibe
    • Viola-Jones Face Detector: ResultsVisual Object Recognition Tutorial Computing First two features selectedPerceptual and Sensory Augmented 34 K. Grauman, B. Leibe
    • Visual Object Recognition Tutorial ComputingPerceptual and Sensory Augmented Viola-Jones Face Detector: Results
    • Visual Object Recognition Tutorial ComputingPerceptual and Sensory Augmented Viola-Jones Face Detector: Results
    • Visual Object Recognition Tutorial ComputingPerceptual and Sensory Augmented Viola-Jones Face Detector: Results
    • Profile Features Detecting profile faces requires training separate detector with profile examples.Visual Object Recognition Tutorial ComputingPerceptual and Sensory Augmented
    • Visual Object Recognition Tutorial Computing Perceptual and Sensory AugmentedPaul Viola, ICCV tutorial Viola-Jones Face Detector: Results
    • Example application Frontal facesVisual Object Recognition Tutorial Computing detected and then tracked, characterPerceptual and Sensory Augmented names inferred with alignment of script and subtitles. Everingham, M., Sivic, J. and Zisserman, A. "Hello! My name is... Buffy" - Automatic naming of characters in TV video, BMVC 2006. http://www.robots.ox.ac.uk/~vgg/research/nface/index.html 40 K. Grauman, B. Leibe
    • Pedestrian detection • Detecting upright, walking humans also possible using sliding window’s appearance/texture; e.g.,Visual Object Recognition Tutorial ComputingPerceptual and Sensory Augmented SVM with Haar wavelets Space-time rectangle SVM with HoGs [Dalal & [Papageorgiou & Poggio, IJCV features [Viola, Jones & Triggs, CVPR 2005] 2000] Snow, ICCV 2003] K. Grauman, B. Leibe
    • Highlights • Sliding window detection and global appearance descriptors:Visual Object Recognition Tutorial Computing Simple detection protocol to implement Good feature choices critical Past successes for certain classesPerceptual and Sensory Augmented 42 K. Grauman, B. Leibe
    • Limitations • High computational complexity For example: 250,000 locations x 30 orientations x 4 scales = 30,000,000 evaluations!Visual Object Recognition Tutorial Computing If training binary detectors independently, means cost increases linearly with number of classes • With so many windows, false positive rate better be lowPerceptual and Sensory Augmented 43 K. Grauman, B. Leibe
    • Limitations (continued) • Not all objects are “box” shapedVisual Object Recognition Tutorial ComputingPerceptual and Sensory Augmented 44 K. Grauman, B. Leibe
    • Limitations (continued) • Non-rigid, deformable objects not captured well with representations assuming a fixed 2d structure; or must assume fixed viewpointVisual Object Recognition Tutorial Computing • Objects with less-regular textures not captured well with holistic appearance-based descriptionsPerceptual and Sensory Augmented 45 K. Grauman, B. Leibe
    • Limitations (continued) • If considering windows in isolation, context is lostVisual Object Recognition Tutorial ComputingPerceptual and Sensory Augmented Sliding window Detector’s view 46 Figure credit: Derek Hoiem K. Grauman, B. Leibe
    • Limitations (continued) • In practice, often entails large, cropped training set (expensive)Visual Object Recognition Tutorial Computing • Requiring good match to a global appearance description can lead to sensitivity to partial occlusionsPerceptual and Sensory Augmented 47 Image credit: Adam, Rivlin, & Shimshoni K. Grauman, B. Leibe
    • Outline 1. Detection with Global Appearance & Sliding Windows 2. Local Invariant Features: Detection & DescriptionVisual Object Recognition Tutorial Computing 3. Specific Object Recognition with Local FeaturesPerceptual and Sensory Augmented ― Coffee Break ― 4. Visual Words: Indexing, Bags of Words Categorization 5. Matching Local Features 6. Part-Based Models for Categorization 7. Current Challenges and Research Directions 48 K. Grauman, B. Leibe
    • Visual Object RecognitionVisual Object Recognition Tutorial ComputingPerceptual and Sensory Augmented Bastian Leibe & Kristen Grauman Computer Vision Laboratory Department of Computer Sciences ETH Zurich University of Texas in Austin Chicago, 14.07.2008
    • Outline 1. Detection with Global Appearance & Sliding WindowsVisual Object Recognition Tutorial Computing 2. Local Invariant Features: Detection & Description 3. Specific Object Recognition with Local FeaturesPerceptual and Sensory Augmented ― Coffee Break ― 4. Visual Words: Indexing, Bags of Words Categorization 5. Matching Local Features 6. Part-Based Models for Categorization 7. Current Challenges and Research Directions 2 K. Grauman, B. Leibe
    • Motivation • Global representations have major limitations • Instead, describe and match only local regionsVisual Object Recognition Tutorial Computing • Increased robustness to OcclusionsPerceptual and Sensory Augmented Articulation d dq φ φ θq θ Intra-category variations 3 K. Grauman, B. Leibe
    • Approach 1. Find a set of distinctive key- pointsVisual Object Recognition Tutorial Computing A1 2. Define a region around each A2 A3 keypointPerceptual and Sensory Augmented 3. Extract and normalize the region content fA Similarity fB measure 4. Compute a local N pixels descriptor from the e.g. color e.g. color normalized region N pixels d ( f A, fB ) < T 5. Match local descriptors 4 K. Grauman, B. Leibe
    • Requirements • Region extraction needs to be repeatable and precise Translation, rotation, scale changesVisual Object Recognition Tutorial Computing (Limited out-of-plane (≈affine) transformations) ≈ Lighting variationsPerceptual and Sensory Augmented • We need a sufficient number of regions to cover the object • The regions should contain “interesting” structure 5 K. Grauman, B. Leibe
    • Many Existing Detectors Available • Hessian & Harris [Beaudet ‘78], [Harris ‘88] • Laplacian, DoG [Lindeberg ‘98], [Lowe 1999]Visual Object Recognition Tutorial Computing • Harris-/Hessian-Laplace [Mikolajczyk & Schmid ‘01] • Harris-/Hessian-Affine [Mikolajczyk & Schmid ‘04] • EBR and IBR [Tuytelaars & Van Gool ‘04]Perceptual and Sensory Augmented • MSER [Matas ‘02] • Salient Regions [Kadir & Brady ‘01] • Others… 6 K. Grauman, B. Leibe
    • Keypoint LocalizationVisual Object Recognition Tutorial ComputingPerceptual and Sensory Augmented • Goals: Repeatable detection Precise localization Interesting content ⇒ Look for two-dimensional signal changes 7 K. Grauman, B. Leibe
    • Hessian Detector [Beaudet78] • Hessian determinant IxxVisual Object Recognition Tutorial Computing  I xx I xy  Hessian ( I ) =   I xy I yy  Perceptual and Sensory Augmented Iyy Ixy Intuition: Search for strong derivatives in two orthogonal directions 8 K. Grauman, B. Leibe
    • Hessian Detector [Beaudet78] • Hessian determinant IxxVisual Object Recognition Tutorial Computing  I xx I xy  Hessian ( I ) =   I xy I yy  Perceptual and Sensory Augmented Iyy Ixy 2 det( Hessian( I )) = I xx I yy − I xy In Matlab: I xx . ∗ I yy − ( I xy )^ 2 9 K. Grauman, B. Leibe
    • Hessian Detector – Responses [Beaudet78]Visual Object Recognition Tutorial ComputingPerceptual and Sensory Augmented Effect: Responses mainly on corners and strongly textured areas. 10
    • Perceptual and Sensory Augmented Visual Object Recognition Tutorial Computing Hessian Detector – Responses [Beaudet78]11
    • Harris Detector [Harris88] • Second moment matrix (autocorrelation matrix)Visual Object Recognition Tutorial Computing  I x2 (σ D ) I x I y (σ D ) µ (σ I , σ D ) = g (σ I ) ∗  2    I x I y (σ D ) I y (σ D )  Perceptual and Sensory Augmented Intuition: Search for local neighborhoods where the image content has two main directions (eigenvectors). 12 K. Grauman, B. Leibe
    • Harris Detector [Harris88] • Second moment matrix (autocorrelation matrix)Visual Object Recognition Tutorial Computing  I x2 (σ D ) I x I y (σ D ) µ (σ I , σ D ) = g (σ I ) ∗  2    I x I y (σ D ) I y (σ D )  Perceptual and Sensory Augmented Ix Iy 1. Image derivatives gx(σD), gy(σD), 13 K. Grauman, B. Leibe
    • Harris Detector [Harris88] • Second moment matrix (autocorrelation matrix)Visual Object Recognition Tutorial Computing  I x2 (σ D ) I x I y (σ D ) µ (σ I , σ D ) = g (σ I ) ∗  2    I x I y (σ D ) I y (σ D )  Perceptual and Sensory Augmented Ix Iy 1. Image derivatives gx(σD), gy(σD), Ix2 Iy2 IxIy 2. Square of derivatives 14 K. Grauman, B. Leibe
    • Harris Detector [Harris88] • Second moment matrix (autocorrelation matrix)Visual Object Recognition Tutorial Computing  I x2 (σ D ) I x I y (σ D ) µ (σ I , σ D ) = g (σ I ) ∗  2    I x I y (σ D ) I y (σ D )   Ix Iy 1. ImagePerceptual and Sensory Augmented derivatives Iy 2. Square of Ix2 Iy2 IxIy 1. Image derivatives derivatives gx(σD), gy(σD), 2.3. Square of Gaussian filter g(σI) derivatives g(Ix2) g(Iy2) g(IxIy) 15
    • Harris Detector [Harris88] • Second moment matrix (autocorrelation matrix)  I x2 (σ D ) I x I y (σ D )Visual Object Recognition Tutorial Computing µ (σ I , σ D ) = g (σ I ) ∗  2  Ix Iy  I x I y (σ D ) I y (σ D )  1. Image   derivatives Ix2 Iy2 IxIyPerceptual and Sensory Augmented 2. Square of derivatives Iy 3. Gaussian filter g(σI) g(Ix2) g(Iy2) g(IxIy) 4. Cornerness function – both eigenvalues are strong har = det[µ (σ I ,σ D)] − α [trace(µ (σ I ,σ D))] = g ( I x2 ) g ( I y ) − [ g ( I x I y )]2 − α [ g ( I x2 ) + g ( I y )]2 2 2 g(IxIy) 5. Non-maxima suppression har 16
    • Harris Detector – Responses [Harris88]Visual Object Recognition Tutorial ComputingPerceptual and Sensory Augmented Effect: A very precise corner detector. 17
    • Perceptual and Sensory Augmented Visual Object Recognition Tutorial Computing Harris Detector – Responses [Harris88]18
    • Automatic Scale SelectionVisual Object Recognition Tutorial ComputingPerceptual and Sensory Augmented f ( I i1Kim ( x, σ )) = f ( I i1Kim ( x′, σ ′)) Same operator responses if the patch contains the same image up to scale factor How to find corresponding patch sizes? 19 K. Grauman, B. Leibe
    • Automatic Scale Selection • Function responses for increasing scale (scale signature)Visual Object Recognition Tutorial ComputingPerceptual and Sensory Augmented f ( I i1Kim ( x, σ )) f ( I i1Kim ( x′, σ )) 20 K. Grauman, B. Leibe
    • Automatic Scale Selection • Function responses for increasing scale (scale signature)Visual Object Recognition Tutorial ComputingPerceptual and Sensory Augmented f ( I i1Kim ( x, σ )) f ( I i1Kim ( x′, σ )) 21 K. Grauman, B. Leibe
    • Automatic Scale Selection • Function responses for increasing scale (scale signature)Visual Object Recognition Tutorial ComputingPerceptual and Sensory Augmented f ( I i1Kim ( x, σ )) f ( I i1Kim ( x′, σ )) 22 K. Grauman, B. Leibe
    • Automatic Scale Selection • Function responses for increasing scale (scale signature)Visual Object Recognition Tutorial ComputingPerceptual and Sensory Augmented f ( I i1Kim ( x, σ )) f ( I i1Kim ( x′, σ )) 23 K. Grauman, B. Leibe
    • Automatic Scale Selection • Function responses for increasing scale (scale signature)Visual Object Recognition Tutorial ComputingPerceptual and Sensory Augmented f ( I i1Kim ( x, σ )) f ( I i1Kim ( x′, σ )) 24 K. Grauman, B. Leibe
    • Automatic Scale Selection • Function responses for increasing scale (scale signature)Visual Object Recognition Tutorial ComputingPerceptual and Sensory Augmented f ( I i1Kim ( x, σ )) f ( I i1Kim ( x′, σ ′)) 25 K. Grauman, B. Leibe
    • What Is A Useful Signature Function? • Laplacian-of-Gaussian = “blob” detectorVisual Object Recognition Tutorial ComputingPerceptual and Sensory Augmented 26 K. Grauman, B. Leibe
    • Laplacian-of-Gaussian (LoG) • Local maxima in scale σ5 space of Laplacian-of-Visual Object Recognition Tutorial Computing Gaussian σ4Perceptual and Sensory Augmented Lxx (σ ) + Lyy (σ ) σ3 σ2 ⇒ List of σ (x, y, s) 27 K. Grauman, B. Leibe
    • Perceptual and Sensory Augmented Visual Object Recognition Tutorial ComputingK. Grauman, B. Leibe Results: Laplacian-of-Gaussian 28
    • Difference-of-Gaussian (DoG) • Difference of Gaussians as approximation of the Laplacian-of-GaussianVisual Object Recognition Tutorial ComputingPerceptual and Sensory Augmented - = 29 K. Grauman, B. Leibe
    • DoG – Efficient Computation • Computation in Gaussian scale pyramidVisual Object Recognition Tutorial ComputingPerceptual and Sensory Augmented Sampling with step σ4 =2 σ σ 1 σ Original image σ =2 4 σ 30 K. Grauman, B. Leibe
    • Perceptual and Sensory Augmented Visual Object Recognition Tutorial Computing Results: Lowe’s DoGK. Grauman, B. Leibe 31
    • Harris-Laplace [Mikolajczyk ‘01] 1. Initialization: Multiscale Harris corner detectionVisual Object Recognition Tutorial Computing σ4Perceptual and Sensory Augmented σ3 σ2 σ Computing Harris function Detecting local maxima 32
    • Harris-Laplace [Mikolajczyk ‘01] 1. Initialization: Multiscale Harris corner detection 2. Scale selection based on LaplacianVisual Object Recognition Tutorial Computing (same procedure with Hessian ⇒ Hessian-Laplace) Harris pointsPerceptual and Sensory Augmented Harris-Laplace points 33 K. Grauman, B. Leibe
    • Maximally Stable Extremal Regions [Matas ‘02] • Based on Watershed segmentation algorithm • Select regions that stay stable over a large parameterVisual Object Recognition Tutorial Computing rangePerceptual and Sensory Augmented 34 K. Grauman, B. Leibe
    • Perceptual and Sensory Augmented Visual Object Recognition Tutorial Computing Example Results: MSERK. Grauman, B. Leibe 35
    • You Can Try It At Home… • For most local feature detectors, executables are available online:Visual Object Recognition Tutorial Computing • http://robots.ox.ac.uk/~vgg/research/affine • http://www.cs.ubc.ca/~lowe/keypoints/ • http://www.vision.ee.ethz.ch/~surfPerceptual and Sensory Augmented 36 K. Grauman, B. Leibe
    • Orientation Normalization • Compute orientation histogram [Lowe, SIFT, 1999] • Select dominant orientationVisual Object Recognition Tutorial Computing • Normalize: rotate to fixed orientationPerceptual and Sensory Augmented 0 2π 37 T. Tuytelaars, B. Leibe
    • Local Descriptors • The ideal descriptor should be RepeatableVisual Object Recognition Tutorial Computing Distinctive Compact EfficientPerceptual and Sensory Augmented • Most available descriptors focus on edge/gradient information Capture texture information Color still relatively seldomly used (more suitable for homogenous regions) 38 K. Grauman, B. Leibe
    • Local Descriptors: SIFT DescriptorVisual Object Recognition Tutorial ComputingPerceptual and Sensory Augmented Histogram of oriented gradients • Captures important texture information • Robust to small translations / affine deformations [Lowe, ICCV 1999] K. Grauman, B. Leibe
    • Local Descriptors: SURF • Fast approximation of SIFT idea Efficient computation by 2D box filters & integral imagesVisual Object Recognition Tutorial Computing ⇒ 6 times faster than SIFT Equivalent quality for object identificationPerceptual and Sensory Augmented • GPU implementation available Feature extraction @ 100Hz (detector + descriptor, 640×480 img) http://www.vision.ee.ethz.ch/~surf [Bay, ECCV’06], [Cornelis, CVGPU’08] 40 K. Grauman, B. Leibe
    • Local Descriptors: Shape Context Count the number of points inside each bin, e.g.:Visual Object Recognition Tutorial Computing Count = 4 ...Perceptual and Sensory Augmented Count = 10 Log-polar binning: more precision for nearby points, more flexibility for farther points. Belongie & Malik, ICCV 2001 K. Grauman, B. Leibe
    • Local Descriptors: Geometric Blur Compute edgesVisual Object Recognition Tutorial Computing at four orientations Extract a patchPerceptual and Sensory Augmented in each channel ~ Apply spatially varying blur and sub-sample Example descriptor (Idealized signal) Berg & Malik, CVPR 2001 K. Grauman, B. Leibe
    • So, What Local Features Should I Use? • There have been extensive evaluations/comparisons [Mikolajczyk et al., IJCV’05, PAMI’05]Visual Object Recognition Tutorial Computing All detectors/descriptors shown here work well • Best choice often application dependentPerceptual and Sensory Augmented MSER works well for buildings and printed things Harris-/Hessian-Laplace/DoG work well for many natural categories • More features are better Combining several detectors often helps 43 K. Grauman, B. Leibe
    • Outline 1. Detection with Global Appearance & Sliding WindowsVisual Object Recognition Tutorial Computing 2. Local Invariant Features: Detection & Description 3. Specific Object Recognition with Local FeaturesPerceptual and Sensory Augmented ― Coffee Break ― 4. Visual Words: Indexing, Bags of Words Categorization 5. Matching Local Features 6. Part-Based Models for Categorization 7. Current Challenges and Research Directions 44 K. Grauman, B. Leibe
    • Visual Object RecognitionVisual Object Recognition Tutorial ComputingPerceptual and Sensory Augmented Bastian Leibe & Kristen Grauman Computer Vision Laboratory Department of Computer Sciences ETH Zurich University of Texas in Austin Chicago, 14.07.2008
    • Outline 1. Detection with Global Appearance & Sliding WindowsVisual Object Recognition Tutorial Computing 2. Local Invariant Features: Detection & Description 3. Specific Object Recognition with Local FeaturesPerceptual and Sensory Augmented ― Coffee Break ― 4. Visual Words: Indexing, Bags of Words Categorization 5. Matching Local Features 6. Part-Based Models for Categorization 7. Current Challenges and Research Directions 2 K. Grauman, B. Leibe
    • Recognition with Local Features • Image content is transformed into local features that are invariant to translation, rotation, and scale • Goal: Verify if they belong to a consistent configurationVisual Object Recognition Tutorial ComputingPerceptual and Sensory Augmented Local Features, e.g. SIFT 3 K. Grauman, B. Leibe Slide credit: David Lowe
    • Finding Consistent Configurations • Global spatial models Generalized Hough Transform [Lowe99]Visual Object Recognition Tutorial Computing RANSAC [Obdrzalek02, Chum05, Nister06] Basic assumption: object is planarPerceptual and Sensory Augmented • Assumption is often justified in practice Valid for many structures on buildings Sufficient for small viewpoint variations on 3D objects 4 K. Grauman, B. Leibe
    • Hough Transform • Origin: Detection of straight lines in clutter Basic idea: each candidate point votes for all lines that it is consistent with.Visual Object Recognition Tutorial Computing Votes are accumulated in quantized array Local maxima correspond to candidate linesPerceptual and Sensory Augmented • Representation of a line Usual form y = a x + b has a singularity around 90º. Better parameterization: x cos(θ) + y sin(θ) = ρ y ρ y ρ θ x x θ 5 K. Grauman, B. Leibe
    • Hough Transform: Noisy Line ρVisual Object Recognition Tutorial ComputingPerceptual and Sensory Augmented θ Tokens Votes • Problem: Finding the true maximum 7 K. Grauman, B. Leibe Slide credit: David Lowe
    • Hough Transform: Noisy Input ρVisual Object Recognition Tutorial ComputingPerceptual and Sensory Augmented θ Tokens Votes • Problem: Lots of spurious maxima 8 K. Grauman, B. Leibe Slide credit: David Lowe
    • Generalized Hough Transform [Ballard81] • Generalization for an arbitrary contour or shape Choose reference point for the contour (e.g. center)Visual Object Recognition Tutorial Computing For each point on the contour remember where it is located w.r.t. to the reference point Remember radius r and angle φ relative to the contour tangentPerceptual and Sensory Augmented Recognition: whenever you find a contour point, calculate the tangent angle and ‘vote’ for all possible reference points Instead of reference point, can also vote for transformation ⇒ The same idea can be used with local features! 9 K. Grauman, B. Leibe Slide credit: Bernt Schiele
    • Gen. Hough Transform with Local Features • For every feature, store possible “occurrences”Visual Object Recognition Tutorial ComputingPerceptual and Sensory Augmented • For new image, let the matched features vote for – Object identity possible object positions – Pose – Relative position 10 K. Grauman, B. Leibe
    • 3D Object Recognition • Gen. HT for Recognition [Lowe99] Typically only 3 feature matchesVisual Object Recognition Tutorial Computing needed for recognition Extra matches provide robustness Affine model can be used for planar objectsPerceptual and Sensory Augmented 12 K. Grauman, B. Leibe Slide credit: David Lowe
    • View Interpolation • Training Training views from similarVisual Object Recognition Tutorial Computing viewpoints are clustered based on feature matches. Matching features between adjacent views are linked.Perceptual and Sensory Augmented • Recognition Feature matches may be spread over several training viewpoints. ⇒ Use the known links to “transfer votes” to other viewpoints. [Lowe01] 13 K. Grauman, B. Leibe Slide credit: David Lowe
    • Recognition Using View InterpolationVisual Object Recognition Tutorial ComputingPerceptual and Sensory Augmented [Lowe01] 14 K. Grauman, B. Leibe Slide credit: David Lowe
    • Location RecognitionVisual Object Recognition Tutorial ComputingPerceptual and Sensory Augmented Training [Lowe04] 15 K. Grauman, B. Leibe Slide credit: David Lowe
    • Applications • Sony Aibo (Evolution Robotics)Visual Object Recognition Tutorial Computing • SIFT usage RecognizePerceptual and Sensory Augmented docking station Communicate with visual cards • Other uses Place recognition Loop closure in SLAM 16 K. Grauman, B. Leibe Slide credit: David Lowe
    • RANSAC (RANdom SAmple Consensus) [Fischler81] • Randomly choose a minimal subset of data points necessary to fit a model (a sample) • Points within some distance threshold t of model are aVisual Object Recognition Tutorial Computing consensus set. Size of consensus set is model’s support. • Repeat for N samples; model with biggest support is most robust fitPerceptual and Sensory Augmented Points within distance t of best model are inliers Fit final model to all inliers 17 K. Grauman, B. Leibe Slide credit: David Lowe
    • RANSAC: How many samples? • How many samples are needed? Suppose w is fraction of inliers (points from line). n points needed to define hypothesis (2 for lines)Visual Object Recognition Tutorial Computing k samples chosen. • Prob. that a single sample of n points is correct: w nPerceptual and Sensory Augmented • Prob. that all samples fail is: (1 − wn ) k ⇒ Choose k high enough to keep this below desired failure rate. 19 K. Grauman, B. Leibe Slide credit: David Lowe
    • After RANSAC • RANSAC divides data into inliers and outliers and yields estimate computed from minimal set of inliersVisual Object Recognition Tutorial Computing • Improve this initial estimate with estimation over all inliers (e.g. with standard least-squares minimization) • But this may change inliers, so alternate fitting with re-Perceptual and Sensory Augmented classification as inlier/outlier 21 K. Grauman, B. Leibe Slide credit: David Lowe
    • Example: Finding Feature Matches • Find best stereo match within a square search window (here 300 pixels2)Visual Object Recognition Tutorial Computing • Global transformation model: epipolar geometryPerceptual and Sensory Augmented from Hartley & Zisserman 22 K. Grauman, B. Leibe Slide credit: David Lowe
    • Example: Finding Feature Matches • Find best stereo match within a square search window (here 300 pixels2)Visual Object Recognition Tutorial Computing • Global transformation model: epipolar geometry before RANSAC after RANSACPerceptual and Sensory Augmented from Hartley & Zisserman 23 K. Grauman, B. Leibe Slide credit: David Lowe
    • Comparison Gen. Hough Transform RANSAC • Advantages • Advantages Very effective for recognizing General method suited to largeVisual Object Recognition Tutorial Computing arbitrary shapes or objects range of problems Can handle high percentage of Easy to implement outliers (>95%) Independent of number of Extracts groupings from clutter in dimensionsPerceptual and Sensory Augmented linear time • Disadvantages • Disadvantages Quantization issues Only handles moderate number of Only practical for small number of outliers (<50%) dimensions (up to 4) • Improvements available • Many variants available, e.g. Probabilistic Extensions PROSAC: Progressive RANSAC [Leibe08] [Chum05] Continuous Voting Space Preemptive RANSAC [Nister05] 24 K. Grauman, B. Leibe
    • Example ApplicationsVisual Object Recognition Tutorial ComputingPerceptual and Sensory Augmented Mobile tourist guide • Self-localization • Object/building recognition • Photo/video augmentation B. Leibe [Quack, Leibe, Van Gool, CIVR’08] 25
    • Web Demo: Movie Poster RecognitionVisual Object Recognition Tutorial ComputingPerceptual and Sensory Augmented 50’000 movie posters indexed Query-by-image from mobile phone available in Switzer- land http://www.kooaba.com/en/products_engine.html# 26 K. Grauman, B. Leibe
    • Application: Large-Scale RetrievalVisual Object Recognition Tutorial ComputingPerceptual and Sensory Augmented Query Results from 5k Flickr images (demo available for 100k set) K. Grauman, B. Leibe [Philbin CVPR’07] 27
    • Application: Image Auto-Annotation Moulin Rouge Old Town Square (Prague)Visual Object Recognition Tutorial ComputingPerceptual and Sensory Augmented Tour Montparnasse Colosseum Viktualienmarkt Maypole Left: Wikipedia image Right: closest match from Flickr 28 [Quack CIVR’08] K. Grauman, B. Leibe
    • Outline 1. Detection with Global Appearance & Sliding WindowsVisual Object Recognition Tutorial Computing 2. Local Invariant Features: Detection & Description 3. Specific Object Recognition with Local FeaturesPerceptual and Sensory Augmented ― Coffee Break ― 4. Visual Words: Indexing, Bags of Words Categorization 5. Matching Local Features 6. Part-Based Models for Categorization 7. Current Challenges and Research Directions 29 K. Grauman, B. Leibe
    • Visual Object RecognitionVisual Object Recognition Tutorial ComputingPerceptual and Sensory Augmented Bastian Leibe & Kristen Grauman Computer Vision Laboratory Department of Computer Sciences ETH Zurich University of Texas in Austin Chicago, 14.07.2008
    • Outline 1. Detection with Global Appearance & Sliding WindowsVisual Object Recognition Tutorial Computing 2. Local Invariant Features: Detection & Description 3. Specific Object Recognition with Local FeaturesPerceptual and Sensory Augmented ― Coffee Break ― 4. Visual Words: Indexing, Bags of Words Categorization 5. Matching Local Feature Sets 6. Part-Based Models for Categorization 7. Current Challenges and Research Directions 2 K. Grauman, B. Leibe
    • Global representations: limitations • Success may rely on alignment -> sensitive to viewpointVisual Object Recognition Tutorial Computing • All parts of the image or window impact the description -> sensitive to occlusion, clutterPerceptual and Sensory Augmented 3 K. Grauman, B. Leibe
    • Local representations • Describe component regions or patches separately. • Many options for detection & description…Visual Object Recognition Tutorial ComputingPerceptual and Sensory Augmented Maximally Stable Extremal Regions Shape context Superpixels [Matas 02] SIFT [Lowe 99] [Belongie 02] [Ren et al.] Salient regions Harris-Affine Spin images Geometric Blur [Kadir 01] [Mikolajczyk 04] [Johnson 99] [Berg 05] 4 K. Grauman, B. Leibe
    • Recall: Invariant local features Subset of local feature types designed to be invariant to y1Visual Object Recognition Tutorial Computing Scale y2 Translation … Rotation ydPerceptual and Sensory Augmented Affine transformations Illumination x1 x2 1) Detect interest points … 2) Extract descriptors xd [Mikolajczyk01, Matas02, Tuytelaars04, Lowe99, Kadir01,… ] K. Grauman, B. Leibe
    • Recognition with local feature sets • Previously, we saw how to use local invariant features + a globalVisual Object Recognition Tutorial Computing spatial model to recognize specific objects, using a planar object assumption.Perceptual and Sensory Augmented • Now, we’ll use local features for Indexing-based recognition Bags of words representations Correspondence / matching kernels 6 K. Grauman, B. Leibe
    • Basic flow … … Index each one into pool of descriptors from …Visual Object Recognition Tutorial Computing previously seen imagesPerceptual and Sensory Augmented Detect or sample Describe features features List of positions, Associated list of scales, d-dimensional orientations descriptors 7 K. Grauman, B. Leibe
    • Indexing local features • Each patch / region has a descriptor, which is a point in some high-dimensional feature space (e.g., SIFT)Visual Object Recognition Tutorial ComputingPerceptual and Sensory Augmented K. Grauman, B. Leibe
    • Indexing local features • When we see close points in feature space, we have similar descriptors, which indicates similar localVisual Object Recognition Tutorial Computing content.Perceptual and Sensory Augmented Figure credit: A. Zisserman K. Grauman, B. Leibe
    • Indexing local features • We saw in the previous section how to use voting and pose clustering to identify objects using local featuresVisual Object Recognition Tutorial ComputingPerceptual and Sensory Augmented Figure credit: David Lowe 10 K. Grauman, B. Leibe
    • Indexing local features • With potentially thousands of features per image, and hundreds to millions of images to search, how toVisual Object Recognition Tutorial Computing efficiently find those that are relevant to a new image?Perceptual and Sensory Augmented Low-dimensional descriptors : can use standard efficient data structures for nearest neighbor search High-dimensional descriptors: approximate nearest neighbor search methods more practical Inverted file indexing schemes 11 K. Grauman, B. Leibe
    • Indexing local features: approximate nearest neighbor search Best-Bin First (BBF), a variant of k-dVisual Object Recognition Tutorial Computing trees that uses priority queue to examine most promising branches first [Beis & Lowe, CVPR 1997]Perceptual and Sensory Augmented Locality-Sensitive Hashing (LSH), a randomized hashing technique using hash functions that map similar points to the same bin, with high probability [Indyk & Motwani, 1998] 12 K. Grauman, B. Leibe
    • Indexing local features: inverted file index • For text documents, an efficient way toVisual Object Recognition Tutorial Computing find all pages on which a word occurs is to use an index…Perceptual and Sensory Augmented • We want to find all images in which a feature occurs. • To use this idea, we’ll need to map our features to “visual words”. 13 K. Grauman, B. Leibe
    • Visual words: main idea • Extract some local features from a number of images …Visual Object Recognition Tutorial ComputingPerceptual and Sensory Augmented e.g., SIFT descriptor space: each point is 128-dimensional 14 Slide credit: D. Nister K. Grauman, B. Leibe
    • Perceptual and Sensory Augmented Visual Object Recognition Tutorial ComputingSlide credit: D. Nister Visual words: main idea K. Grauman, B. Leibe 15
    • Perceptual and Sensory Augmented Visual Object Recognition Tutorial ComputingSlide credit: D. Nister Visual words: main idea K. Grauman, B. Leibe 16
    • Perceptual and Sensory Augmented Visual Object Recognition Tutorial ComputingSlide credit: D. Nister Visual words: main idea K. Grauman, B. Leibe 17
    • Perceptual and Sensory Augmented Visual Object Recognition Tutorial ComputingSlide credit: D. Nister K. Grauman, B. Leibe 18
    • Perceptual and Sensory Augmented Visual Object Recognition Tutorial ComputingSlide credit: D. Nister K. Grauman, B. Leibe 19
    • Visual words: main idea Map high-dimensional descriptors to tokens/words by quantizing the feature space • Quantize viaVisual Object Recognition Tutorial Computing clustering, let cluster centers bePerceptual and Sensory Augmented the prototype “words” Descriptor space 20 K. Grauman, B. Leibe
    • Visual words: main idea Map high-dimensional descriptors to tokens/words by quantizing the feature space • Determine whichVisual Object Recognition Tutorial Computing word to assign to each new imagePerceptual and Sensory Augmented region by finding the closest cluster center. Descriptor space 21 K. Grauman, B. Leibe
    • Visual words • Example: each group of patches belongs to theVisual Object Recognition Tutorial Computing same visual wordPerceptual and Sensory Augmented Figure from Sivic & Zisserman, ICCV 2003 22 K. Grauman, B. Leibe
    • Visual words • First explored for texture and materialVisual Object Recognition Tutorial Computing representations • Texton = cluster center of filter responses overPerceptual and Sensory Augmented collection of images • Describe textures and materials based on distribution of prototypical texture elements. Leung & Malik 1999; Varma & Zisserman, 2002; Lazebnik, Schmid & Ponce, 2003;
    • Visual words • More recently used for describing scenes andVisual Object Recognition Tutorial Computing objects for the sake of indexing or classification.Perceptual and Sensory Augmented Sivic & Zisserman 2003; Csurka, Bray, Dance, & Fan 2004; many others. 24 K. Grauman, B. Leibe
    • Inverted file index for images comprised of visual words Word List of image number numbersVisual Object Recognition Tutorial ComputingPerceptual and Sensory Augmented Image credit: A. Zisserman K. Grauman, B. Leibe
    • Bags of visual words • Summarize entire image based on its distributionVisual Object Recognition Tutorial Computing (histogram) of word occurrences. • Analogous to bag of wordsPerceptual and Sensory Augmented representation commonly used for documents. 26 Image credit: Fei-Fei Li K. Grauman, B. Leibe
    • Video Google System Query region 1. Collect all words within query regionVisual Object Recognition Tutorial Computing 2. Inverted file index to find relevant frames 3. Compare word countsPerceptual and Sensory Augmented 4. Spatial verification Retrieved frames Retrieved frames Sivic & Zisserman, ICCV 2003 • Demo online at : http://www.robots.ox.ac.uk/~vgg/ research/vgoogle/index.html 27 K. Grauman, B. Leibe
    • Basic flow … … Index each one into pool of descriptors from …Visual Object Recognition Tutorial Computing previously seen images orPerceptual and Sensory Augmented … Detect or sample Describe Quantize to form features features bag of words vector for the image List of positions, Associated list of scales, d-dimensional orientations descriptors 28 K. Grauman, B. Leibe
    • Visual vocabulary formation Issues: • Sampling strategyVisual Object Recognition Tutorial Computing • Clustering / quantization algorithm • Unsupervised vs. supervised • What corpus provides features (universal vocabulary?)Perceptual and Sensory Augmented • Vocabulary size, number of words 29 K. Grauman, B. Leibe
    • Sampling strategiesVisual Object Recognition Tutorial Computing Sparse, atPerceptual and Sensory Augmented Dense, uniformly Randomly interest points • To find specific, textured objects, sparse sampling from interest points often more reliable. • Multiple complementary interest operators offer more image coverage. • For object categorization, dense sampling offers better coverage. Multiple interest operators [See Nowak, Jurie & Triggs, ECCV 2006] 30 Image credits: F-F. Li, E. Nowak, J. Sivic K. Grauman, B. Leibe
    • Clustering / quantization methods • k-means (typical choice), agglomerative clustering, mean-shift,…Visual Object Recognition Tutorial Computing • Hierarchical clustering: allows faster insertion / word assignment while still allowing large vocabulariesPerceptual and Sensory Augmented Vocabulary tree [Nister & Stewenius, CVPR 2006] 31 K. Grauman, B. Leibe
    • Example: Recognition with Vocabulary Tree • Tree construction:Visual Object Recognition Tutorial ComputingPerceptual and Sensory Augmented [Nister & Stewenius, CVPR’06] 32 K. Grauman, B. Leibe Slide credit: David Nister
    • Vocabulary Tree • Training: Filling the treeVisual Object Recognition Tutorial ComputingPerceptual and Sensory Augmented [Nister & Stewenius, CVPR’06] 33 K. Grauman, B. Leibe Slide credit: David Nister
    • Vocabulary Tree • Training: Filling the treeVisual Object Recognition Tutorial ComputingPerceptual and Sensory Augmented [Nister & Stewenius, CVPR’06] 34 K. Grauman, B. Leibe Slide credit: David Nister
    • Vocabulary Tree • Training: Filling the treeVisual Object Recognition Tutorial ComputingPerceptual and Sensory Augmented [Nister & Stewenius, CVPR’06] 35 K. Grauman, B. Leibe Slide credit: David Nister
    • Vocabulary Tree • Training: Filling the treeVisual Object Recognition Tutorial ComputingPerceptual and Sensory Augmented [Nister & Stewenius, CVPR’06] 36 K. Grauman, B. Leibe Slide credit: David Nister
    • Vocabulary Tree • Training: Filling the treeVisual Object Recognition Tutorial ComputingPerceptual and Sensory Augmented [Nister & Stewenius, CVPR’06] 37 K. Grauman, B. Leibe Slide credit: David Nister
    • Vocabulary Tree • RecognitionVisual Object Recognition Tutorial Computing RANSAC verificationPerceptual and Sensory Augmented [Nister & Stewenius, CVPR’06] 38 K. Grauman, B. Leibe Slide credit: David Nister
    • Vocabulary Tree: Performance • Evaluated on large databases Indexing with up to 1M imagesVisual Object Recognition Tutorial Computing • Online recognition for database of 50,000 CD coversPerceptual and Sensory Augmented Retrieval in ~1s • Find experimentally that large vocabularies can be beneficial for recognition [Nister & Stewenius, CVPR’06] 39 K. Grauman, B. Leibe
    • Vocabulary formation • Ensembles of trees provide additional robustnessVisual Object Recognition Tutorial ComputingPerceptual and Sensory Augmented Moosmann, Jurie, & Triggs 2006; Yeh, Lee, & Darrell 2007; Bosch, Zisserman, & Munoz 2007; … Figure credit: F. Jurie K. Grauman, B. Leibe
    • Supervised vocabulary formation • Recent work considers how to leverage labeled images when constructing the vocabularyVisual Object Recognition Tutorial ComputingPerceptual and Sensory Augmented Perronnin, Dance, Csurka, & Bressan, Adapted Vocabularies for Generic Visual Categorization, ECCV 2006. 41 K. Grauman, B. Leibe
    • Supervised vocabulary formation • Merge words that don’t aid in discriminabilityVisual Object Recognition Tutorial ComputingPerceptual and Sensory Augmented Winn, Criminisi, & Minka, Object Categorization by Learned Universal Visual Dictionary, ICCV 2005
    • Supervised vocabulary formation • Consider vocabulary and classifier construction jointly.Visual Object Recognition Tutorial ComputingPerceptual and Sensory Augmented Yang, Jin, Sukthankar, & Jurie, Discriminative Visual Codebook Generation with Classifier Training for Object Category Recognition, CVPR 2008. 43 K. Grauman, B. Leibe
    • Learning and recognition with bag of words histograms • Bag of words representation makes it possible to describe the unordered point set with a single vectorVisual Object Recognition Tutorial Computing (of fixed dimension across image examples)Perceptual and Sensory Augmented • Provides easy way to use distribution of feature types with various learning algorithms requiring vector input. 44 K. Grauman, B. Leibe
    • Learning and recognition with bag of words histograms • …including unsupervised topic models designed for documents.Visual Object Recognition Tutorial Computing • Hierarchical Bayesian text models (pLSA and LDA) – Hoffman 2001, Blei, Ng & Jordan, 2004Perceptual and Sensory Augmented – For object and scene categorization: Sivic et al. 2005, Sudderth et al. 2005, Quelhas et al. 2005, Fei-Fei et al. 2005 45 Figure credit: Fei-Fei Li K. Grauman, B. Leibe
    • Learning and recognition with bag of words histograms • …including unsupervised topic models designed for documents. Probabilistic LatentVisual Object Recognition Tutorial Computing Semantic Analysis d z w (pLSA) NPerceptual and Sensory Augmented D “face” Sivic et al. ICCV 2005 [pLSA code available at: http://www.robots.ox.ac.uk/~vgg/software/] 46 Figure credit: Fei-Fei Li K. Grauman, B. Leibe
    • Bags of words: pros and cons + flexible to geometry / deformations / viewpoint + compact summary of image contentVisual Object Recognition Tutorial Computing + provides vector representation for sets + has yielded good recognition results in practicePerceptual and Sensory Augmented - basic model ignores geometry – must verify afterwards, or encode via features - background and foreground mixed when bag covers whole image - interest points or sampling: no guarantee to capture object-level parts - optimal vocabulary formation remains unclear 47 K. Grauman, B. Leibe
    • Outline 1. Detection with Global Appearance & Sliding WindowsVisual Object Recognition Tutorial Computing 2. Local Invariant Features: Detection & Description 3. Specific Object Recognition with Local FeaturesPerceptual and Sensory Augmented ― Coffee Break ― 4. Visual Words: Indexing, Bags of Words Categorization 5. Matching Local Feature Sets 6. Part-Based Models for Categorization 7. Current Challenges and Research Directions 48 K. Grauman, B. Leibe
    • Visual Object RecognitionVisual Object Recognition Tutorial ComputingPerceptual and Sensory Augmented Bastian Leibe & Kristen Grauman Computer Vision Laboratory Department of Computer Sciences ETH Zurich University of Texas in Austin Chicago, 14.07.2008
    • Outline 1. Detection with Global Appearance & Sliding WindowsVisual Object Recognition Tutorial Computing 2. Local Invariant Features: Detection & Description 3. Specific Object Recognition with Local FeaturesPerceptual and Sensory Augmented ― Coffee Break ― 4. Visual Words: Indexing, Bags of Words Categorization 5. Matching Local Feature Sets 6. Part-Based Models for Categorization 7. Current Challenges and Research Directions 2 K. Grauman, B. Leibe
    • Basic flow … … Index each one into pool of descriptors from …Visual Object Recognition Tutorial Computing previously seen images orPerceptual and Sensory Augmented … Detect or sample Describe Quantize to form features features or bag of words vector for the image List of positions, Associated list of scales, d-dimensional orientations descriptors Compute match with another image 3 K. Grauman, B. Leibe
    • Local feature correspondences • The matching between sets of local features helps to establish overall similarity between objects or shapes.Visual Object Recognition Tutorial Computing • Assigned matches also useful for localizationPerceptual and Sensory Augmented Shape context Low-distortion matching [Berg & Malik 2005] Match kernel [Belongie & [Wallraven, Malik 2001] Caputo & Graf 2003] 4 K. Grauman, B. Leibe
    • Local feature correspondences • Least cost match: minimize total cost between matched pointsVisual Object Recognition Tutorial ComputingPerceptual and Sensory Augmented min π : X →Y ∑ x − π (x ) xi∈X i i • Least cost partial match: match all of smaller set to some portion of larger set.
    • Pyramid match kernel (PMK) • Optimal matching expensive relative to number of features per image (m).Visual Object Recognition Tutorial Computing • PMK is approximate partial match for efficient discriminative learning from sets of local features.Perceptual and Sensory Augmented Optimal match: O(m3) Greedy match: O(m2 log m) Pyramid match: O(m) [Grauman & Darrell, ICCV 2005] 6 K. Grauman, B. Leibe
    • Pyramid match kernel: pyramid extraction ,Visual Object Recognition Tutorial Computing Histogram pyramid:Perceptual and Sensory Augmented level i has bins of size 7 K. Grauman
    • Perceptual and Sensory Augmented Visual Object Recognition Tutorial Computing Histogram intersectionK. Grauman Pyramid match kernel: counting matches 8
    • Pyramid match kernel: counting new matches Histogram intersectionVisual Object Recognition Tutorial Computing matches at this level matches at previous levelPerceptual and Sensory Augmented Difference in histogram intersections across levels counts number of new pairs matched 9 K. Grauman
    • Pyramid match kernel histogram pyramidsVisual Object Recognition Tutorial ComputingPerceptual and Sensory Augmented number of newly matched pairs at level i measure of difficulty of a match at level i • For similarity, weights inversely proportional to bin size (or may be learned discriminatively) • Normalize kernel values to avoid favoring large sets 10 K. Grauman
    • Perceptual and Sensory Augmented Visual Object Recognition Tutorial ComputingK. Grauman Example pyramid match 11
    • Perceptual and Sensory Augmented Visual Object Recognition Tutorial ComputingK. Grauman Example pyramid match 12
    • Perceptual and Sensory Augmented Visual Object Recognition Tutorial ComputingK. Grauman Example pyramid match 13
    • Example pyramid matchpyramid matchoptimal match K. Grauman
    • Pyramid match kernel • Forms a Mercer kernel -> allows classification with SVMs, use of other kernel methodsVisual Object Recognition Tutorial Computing • Bounded error relative to optimal partial match • Linear time -> efficient learning with large feature setsPerceptual and Sensory Augmented K. Grauman, B. Leibe
    • Pyramid match kernel • Forms a Mercer kernel -> allows classification with SVMs, use of other kernel methodsVisual Object Recognition Tutorial Computing • Bounded error relative to optimal partial match • Linear time -> efficient learning with large feature setsPerceptual and Sensory Augmented ETH-80 data set ETH Accuracy Time (s) Mean number of features Mean number of features Match [Wallraven et al.] O(m2) Pyramid match O(m)
    • Pyramid match kernel • Forms a Mercer kernel -> allows classification with SVMs, use of other kernel methodsVisual Object Recognition Tutorial Computing • Bounded error relative to optimal partial match • Linear time -> efficient learning with large feature sets • Use data-dependent pyramid partitions for high-dPerceptual and Sensory Augmented feature spaces Uniform pyramid bins Vocabulary-guided pyramid bins Code for PMK: http://people.csail.mit.edu/jjl/libpmk/
    • Matching smoothness & local geometry • Solving for linear assignment means (non-overlapping) features can be matched independently, ignoringVisual Object Recognition Tutorial Computing relative geometry. • One alternative: simply expand feature vectors to include spatial information before matching.Perceptual and Sensory Augmented [ f1,…,f128, xa, ya ] ya xa 18 K. Grauman, B. Leibe
    • Spatial pyramid match kernel • First quantize descriptors into words, then do one pyramid match per word in image coordinate space.Visual Object Recognition Tutorial ComputingPerceptual and Sensory Augmented Lazebnik, Schmid & Ponce, CVPR 2006 K. Grauman, B. Leibe
    • Matching smoothness & local geometry • Use correspondence to estimate parameterized transformation, regularize to enforce smoothnessVisual Object Recognition Tutorial ComputingPerceptual and Sensory Augmented Shape context matching [Belongie, Malik, & Puzicha 2001] Code: http://www.eecs.berkeley.edu/Research/Projects/CS/vision/shape/sc_digits.html K. Grauman, B. Leibe
    • Matching smoothness & local geometry • Let matching cost include term to penalize distortion between pairs of matched features.Visual Object Recognition Tutorial Computing Template Query j jPerceptual and Sensory Augmented Rij Sij i i i i Approximate for efficient solutions: Berg & Malik, CVPR 2005; Leordeanu & Hebert, ICCV 2005 Figure credit: Alex Berg K. Grauman, B. Leibe
    • Matching smoothness & local geometry • Compare “semi-local” features: consider configurations or neighborhoods and co-occurrence relationshipsVisual Object Recognition Tutorial ComputingPerceptual and Sensory Augmented Correlograms of Proximity visual words distribution kernel [Savarese, Winn, & [Ling & Soatto, ICCV Hyperfeatures: Agarwal & Criminisi, CVPR 2006] 2007] Triggs, ECCV 2006] Feature neighborhoods [Sivic Tiled neighborhood [Quack, Ferrari, & Zisserman, CVPR 2004] Leibe, van Gool ICCV 2007] K. Grauman, B. Leibe
    • Matching smoothness & local geometry • Learn or provide explicit object-specific shape model [Next in the tutorial : part-based models]Visual Object Recognition Tutorial ComputingPerceptual and Sensory Augmented x1 x6 x2 x5 x3 x4
    • Summary • Local features are a useful, flexible representation Invariance properties - typically built into the descriptorVisual Object Recognition Tutorial Computing Distinctive, especially helpful for identifying specific textured objects Breaking image into regions/parts gives tolerance to occlusions and clutterPerceptual and Sensory Augmented Mapping to visual words forms discrete tokens from image regions • Efficient methods available for Indexing patches or regions Comparing distributions of visual words Matching features 24 K. Grauman, B. Leibe
    • Outline 1. Detection with Global Appearance & Sliding WindowsVisual Object Recognition Tutorial Computing 2. Local Invariant Features: Detection & Description 3. Specific Object Recognition with Local FeaturesPerceptual and Sensory Augmented ― Coffee Break ― 4. Visual Words: Indexing, Bags of Words Categorization 5. Matching Local Feature Sets 6. Part-Based Models for Categorization 7. Current Challenges and Research Directions 25 K. Grauman, B. Leibe
    • Visual Object RecognitionVisual Object Recognition Tutorial ComputingPerceptual and Sensory Augmented Bastian Leibe & Kristen Grauman Computer Vision Laboratory Department of Computer Sciences ETH Zurich University of Texas in Austin Chicago, 14.07.2008
    • Outline 1. Detection with Global Appearance & Sliding WindowsVisual Object Recognition Tutorial Computing 2. Local Invariant Features: Detection & Description 3. Specific Object Recognition with Local FeaturesPerceptual and Sensory Augmented ― Coffee Break ― 4. Visual Words: Indexing, Bags of Words Categorization 5. Matching Local Features 6. Part-Based Models for Categorization 7. Current Challenges and Research Directions 2 K. Grauman, B. Leibe
    • Recognition of Object Categories • We no longer have exact correspondences…Visual Object Recognition Tutorial Computing • On a local level, we can still detect similar parts.Perceptual and Sensory Augmented • Represent objects by their parts ⇒ Bag-of-features • How can we improve on this? Encode structure 3 T. Tuytelaars, B. Leibe Slide credit: Rob Fergus
    • Part-Based Models • Fischler & Elschlager 1973Visual Object Recognition Tutorial Computing • Model has two components partsPerceptual and Sensory Augmented (2D image fragments) structure (configuration of parts) 4 K. Grauman, B. Leibe
    • Different Connectivity StructuresVisual Object Recognition Tutorial Computing O(N6) O(N2) O(N3) O(N2) Fergus et al. ’03 Leibe et al. ’04, ‘08 Crandall et al. ‘05 Felzenszwalb &Perceptual and Sensory Augmented Fei-Fei et al. ‘03 Crandall et al. ‘05 Huttenlocher ‘05 Fergus et al. ’05 Csurka ’04 Bouchard & Triggs ‘05 Carneiro & Lowe ‘06 Vasconcelos ‘00 5 K. Grauman, B. Leibe from [Carneiro & Lowe, ECCV’06]
    • Spatial Models Considered Here Fully connected shape “Star” shape modelVisual Object Recognition Tutorial Computing model x1 x1 x6 x2 x6 x2Perceptual and Sensory Augmented x5 x3 x5 x3 x4 x4 e.g. Constellation Model e.g. ISM Parts fully connected Parts mutually independent Recognition complexity: O(NP) Recognition complexity: O(NP) Method: Exhaustive search Method: Gen. Hough Transform 6 K. Grauman, B. Leibe Slide credit: Rob Fergus
    • Constellation Model • Joint model for appearance and shapeVisual Object Recognition Tutorial ComputingPerceptual and Sensory Augmented Gaussian shape pdf Gaussian part appearance pdf Gaussian relative scale pdf Log(scale) Prob. of detection 0.8 0.75 0.9 7 K. Grauman, B. Leibe
    • Constellation Model Gaussian shape pdf Gaussian part appearance pdf Gaussian relative scale pdfVisual Object Recognition Tutorial Computing Log(scale) Prob. of detectionPerceptual and Sensory Augmented 0.8 0.75 0.9 Clutter model Uniform Uniform shape pdf Gaussian appearance pdf relative scale pdf Log(scale) Poission pdf on # detections 8 K. Grauman, B. Leibe
    • Constellation Model: Learning Procedure • Goal: Find regions & their location, scale & appearance • Initialize model parametersVisual Object Recognition Tutorial Computing • Use EM and iterate to convergence E-step: Compute assignments for which regions are foreground/backgroundPerceptual and Sensory Augmented M-step: Update model parameters • Trying to maximize likelihood – consistency in shape & appearance 9 K. Grauman, B. Leibe
    • Perceptual and Sensory Augmented Visual Object Recognition Tutorial Computing Example: MotorbikesK. Grauman, B. Leibe 10
    • Perceptual and Sensory Augmented Visual Object Recognition Tutorial Computing Example: Motorbikes (2)K. Grauman, B. Leibe 11
    • Perceptual and Sensory Augmented Visual Object Recognition Tutorial Computing Example: Spotted CatsK. Grauman, B. Leibe 12
    • Discussion: Constellation Model • Advantages Works well for many different object categories Can adapt well to categories whereVisual Object Recognition Tutorial Computing – Shape is more important – Appearance is more important Everything is learned from training dataPerceptual and Sensory Augmented Weakly-supervised training possible • Disadvantages Model contains many parameters that need to be estimated Cost increases exponentially with increasing number of parameters ⇒ Fully connected model restricted to small number of parts. 13 K. Grauman, B. Leibe
    • Implicit Shape Model (ISM) • Basic ideas x1 Learn an appearance codebook x6 x2Visual Object Recognition Tutorial Computing Learn a star-topology structural model x5 x3 – Features are considered independent given obj. center x4Perceptual and Sensory Augmented • Algorithm: probabilistic Gen. Hough Transform Exact correspondences → Prob. match to object part NN matching → Soft matching Feature location on obj. → Part location distribution Uniform votes → Probabilistic vote weighting Quantized Hough array → Continuous Hough space 14 K. Grauman, B. Leibe
    • Codebook Representation • Extraction of local object features Interest Points (e.g. Harris detector)Visual Object Recognition Tutorial Computing Sparse representation of the object appearancePerceptual and Sensory Augmented • Collect features from whole training set • Example: 15 K. Grauman, B. Leibe
    • Gen. Hough Transform with Local Features • For every feature, store possible “occurrences”Visual Object Recognition Tutorial ComputingPerceptual and Sensory Augmented • For new image, let the matched features vote for – Object identity possible object positions – Pose – Relative position 18 K. Grauman, B. Leibe
    • Implicit Shape Model - Representation … … … …Visual Object Recognition Tutorial Computing Training images (+reference segmentation) …Perceptual and Sensory Augmented Appearance codebook • Learn appearance codebook y y Extract local features at interest points Agglomerative clustering ⇒ codebook s s x x • Learn spatial distributions y y Match codebook to training images Record matching positions on object s s x x Spatial occurrence distributions + local figure-ground labels 19 B. Leibe
    • Implicit Shape Model - Recognition Interest Points Matched Codebook Probabilistic Entries VotingVisual Object Recognition Tutorial Computing yPerceptual and Sensory Augmented Image Feature Interpretation Object (Codebook match) Position s x 3D Voting Space (continuous) f Ci o,x p(Ci f ) p(on , x Ci , l) p(on , x f , l) = ∑ p (Ci f ) p (on , x Ci , l) i [Leibe04, Leibe08] 21
    • Implicit Shape Model - Recognition Interest Points Matched Codebook Probabilistic Entries VotingVisual Object Recognition Tutorial Computing yPerceptual and Sensory Augmented s x 3D Voting Space (continuous) Backprojected Backprojection Hypotheses of Maxima [Leibe04, Leibe08] 22
    • Perceptual and Sensory Augmented Visual Object Recognition Tutorial ComputingK. Grauman, B. Leibe Example: Results on Cows Original image 24
    • Perceptual and Sensory Augmented Visual Object Recognition Tutorial ComputingK. Grauman, B. Leibe Example: Results on Cows Interest points Original image 25
    • Example: Results on CowsVisual Object Recognition Tutorial ComputingPerceptual and Sensory Augmented Matchedpoints Interest patches Original image 26 K. Grauman, B. Leibe
    • Example: Results on CowsVisual Object Recognition Tutorial ComputingPerceptual and Sensory Augmented Interest points Original image Matched patches Prob. Votes 27 K. Grauman, B. Leibe
    • Perceptual and Sensory Augmented Visual Object Recognition Tutorial ComputingK. Grauman, B. Leibe Example: Results on Cows 1st hypothesis 28
    • Perceptual and Sensory Augmented Visual Object Recognition Tutorial ComputingK. Grauman, B. Leibe Example: Results on Cows 2nd hypothesis 29
    • Perceptual and Sensory Augmented Visual Object Recognition Tutorial ComputingK. Grauman, B. Leibe Example: Results on Cows 3rd hypothesis 30
    • Scale Invariant Voting • Scale-invariant feature selection Scale-invariant interest pointsVisual Object Recognition Tutorial Computing Rescale extracted patches Match to constant-size codebookPerceptual and Sensory Augmented • Generate scale votes Scale as 3rd dimension in voting space s Search window y Search for maxima in 3D voting space x 31 K. Grauman, B. Leibe
    • Scale Voting: Efficient Computation s s s sVisual Object Recognition Tutorial Computing y y y y x xPerceptual and Sensory Augmented Scale votes Binned Candidate Refinement accum. array maxima (MSME) • Mean-Shift formulation for refinement Scale-adaptive balloon density estimator 33 K. Grauman, B. Leibe
    • Detection Results • Qualitative Performance Recognizes different kinds of objectsVisual Object Recognition Tutorial Computing Robust to clutter, occlusion, noise, low contrastPerceptual and Sensory Augmented 35 K. Grauman, B. Leibe
    • Figure-Ground Segregation • Problem extensively studied in PsychophysicsVisual Object Recognition Tutorial Computing • Experiments with ambiguous figure-ground stimuli • Results:Perceptual and Sensory Augmented Evidence that object recognition can and does operate before figure-ground organization Interpreted as Gestalt cue familiarity. M.A. Peterson, “Object Recognition Processes Can and Do Operate Before Figure- Ground Organization”, Cur. Dir. in Psych. Sc., 3:105-111, 1994. 36 K. Grauman, B. Leibe
    • ISM – Top-Down Segmentation Interest Points Matched Codebook Probabilistic Entries VotingVisual Object Recognition Tutorial Computing yPerceptual and Sensory Augmented s x 3D Voting Space Segmentation (continuous) p(figure) Backprojected Backprojection Probabilities Hypotheses of Maxima K. Grauman, B. Leibe [Leibe04, Leibe08] 37
    • Segmentation: Probabilistic FormulationVisual Object Recognition Tutorial Computing • Influence of patch on object hypothesis (vote weight)Perceptual and Sensory Augmented p( f , l o , x ) = ∑ p(o , x | C ) p(C i n i i | f ) p( f,l ) n p(on , x ) • Backprojection to features f and pixels p: p(p = figure | on , x ) = ∑ p(p = figure | f , l, o , x ) p( f , l | o , x ) n n p∈( f ,l ) Segmentation Influence on information object hypothesis 38 K. Grauman, B. Leibe [Leibe04, Leibe08]
    • SegmentationVisual Object Recognition Tutorial Computing p(figure)Perceptual and Sensory Augmented Original image Segmentation p(figure) p(ground) p(ground) • Interpretation of p(figure) map per-pixel confidence in object hypothesis Use for hypothesis verification 46 K. Grauman, B. Leibe [Leibe04, Leibe08]
    • Perceptual and Sensory Augmented Visual Object Recognition Tutorial ComputingK. Grauman, B. Leibe Example Results: Motorbikes 47
    • Example Results: Cows • Training 112 hand-segmented imagesVisual Object Recognition Tutorial Computing • Results on novel sequences:Perceptual and Sensory Augmented Single-frame recognition - No temporal continuity used! 48 K. Grauman, B. Leibe [Leibe04, Leibe08]
    • Perceptual and Sensory Augmented Visual Object Recognition Tutorial Computing Office chairsB. Leibe Example Results: Chairs Dining room chairs 49
    • Perceptual and Sensory Augmented Visual Object Recognition Tutorial Computing Training Test Output Inferring Other Information: Part Labels[Thomas07] 50
    • Perceptual and Sensory Augmented Visual Object Recognition Tutorial Computing[Thomas07] Inferring Other Information: Part Labels (2) 51
    • Inferring Other Information: Depth MapsVisual Object Recognition Tutorial ComputingPerceptual and Sensory Augmented “Depth from a single image” 52 [Thomas07]
    • Application for Pedestrian Detection • Estimating ArticulationVisual Object Recognition Tutorial ComputingPerceptual and Sensory Augmented [Leibe, Seemann, Schiele, CVPR’05] • Rotation-Invariant Detection d dq φ φ θq θ [Mikolajczyk, Leibe, Schiele, CVPR’06] 53 B. Leibe
    • Outline 1. Detection with Global Appearance & Sliding WindowsVisual Object Recognition Tutorial Computing 2. Local Invariant Features: Detection & Description 3. Specific Object Recognition with Local FeaturesPerceptual and Sensory Augmented ― Coffee Break ― 4. Visual Words: Indexing, Bags of Words Categorization 5. Matching Local Features 6. Part-Based Models for Categorization 7. Current Challenges and Research Directions 54 K. Grauman, B. Leibe
    • Visual Object RecognitionVisual Object Recognition Tutorial ComputingPerceptual and Sensory Augmented Bastian Leibe & Kristen Grauman Computer Vision Laboratory Department of Computer Sciences ETH Zurich University of Texas in Austin Chicago, 14.07.2008
    • Outline 1. Detection with Global Appearance & Sliding WindowsVisual Object Recognition Tutorial Computing 2. Local Invariant Features: Detection & Description 3. Specific Object Recognition with Local FeaturesPerceptual and Sensory Augmented ― Coffee Break ― 4. Visual Words: Indexing, Bags of Words Categorization 5. Matching Local Features 6. Part-Based Models for Categorization 7. Current Challenges and Research Directions Highlight of some research topics not covered in the main tutorial 2 K. Grauman, B. Leibe
    • Perceptual and Sensory AugmentedVisual Object Recognition Tutorial Computing Benchmark Data • What degree of difficulty do current datasets have?
    • Example: Caltech-101 A dataset that hasVisual Object Recognition Tutorial Computing been about mastered…Perceptual and Sensory Augmented Images from the Caltech-101: 101-way multi-class classification problem K. Grauman, B. Leibe
    • Example: Caltech256Visual Object Recognition Tutorial ComputingPerceptual and Sensory Augmented Images from the Caltech-256: 256 multi-class recognition problem K. Grauman, B. Leibe
    • Example: Pascal Visual Object Classes ChallengeVisual Object Recognition Tutorial ComputingPerceptual and Sensory Augmented Pascal VOC 2007: Binary detection problems http://pascallin.ecs.soton.ac.uk/challenges/VOC/ K. Grauman, B. Leibe
    • Perceptual and Sensory Augmented Visual Object Recognition Tutorial Computing Example: LabelMeK. Grauman, B. Leibe http://labelme.csail.mit.edu/
    • Current challenges & ongoing research • Multi-cue integration • Finer level categorizationVisual Object Recognition Tutorial Computing • View invariant recognition • Unsupervised category discovery • Learning from noisily labeled imagesPerceptual and Sensory Augmented • Integration of segmentation and recognition • Learning with text and images/video • Use of video • Context and scene layout
    • Multi-cue integration • Single cues often not sufficient. • Integrate multiple local and global cues.Visual Object Recognition Tutorial ComputingPerceptual and Sensory Augmented 9 K. Grauman, B. Leibe
    • Multi-Category Discrimination • Distinguish similar categories. • Need to look at specific details!Visual Object Recognition Tutorial ComputingPerceptual and Sensory Augmented 10 K. Grauman, B. Leibe
    • Multi-Aspect Recognition • Detectors for different viewpoints ⇒ How can this be improved?Visual Object Recognition Tutorial ComputingPerceptual and Sensory Augmented 11 K. Grauman, B. Leibe
    • Multi-Aspect RecognitionVisual Object Recognition Tutorial ComputingPerceptual and Sensory Augmented [Hoiem, Rother, Winn, CVPR’07] [Thomas et al., CVPR’06] 12 K. Grauman, B. Leibe
    • Multi-Aspect RecognitionVisual Object Recognition Tutorial ComputingPerceptual and Sensory Augmented [Rothganger et al., CVPR’03] [Savarese & Fei-Fei, ICCV’07] 13 K. Grauman, B. Leibe
    • Unsupervised, semi-supervised category discovery Topic models for images Probabilistic LatentVisual Object Recognition Tutorial Computing Semantic Analysis (pLSA) “face”Perceptual and Sensory Augmented Latent Dirichlet “beach” Allocation (LDA) c π z w D N Sivic et al. ICCV 2005, Fei-Fei et al. ICCV 2005 Figure credit: Fei-Fei Li
    • Unsupervised, semi-supervised category discovery Clustering cluttered images Learning from noisy keyword-based image search resultsVisual Object Recognition Tutorial ComputingPerceptual and Sensory Augmented Grauman & Darrell, CVPR 2006 Fergus et al. ECCV 2004, ICCV 2005 Li & Fei-Fei, CVPR 2007
    • Learning with text and images/videoVisual Object Recognition Tutorial ComputingPerceptual and Sensory Augmented Barnard et al. JMLR 2003 Berg, Berg, Edwards, & Forsyth, NIPS 2006 Gupta et al. ECML 2008
    • Integrating segmentation + recognitionVisual Object Recognition Tutorial ComputingPerceptual and Sensory Augmented Borenstein & Ullman, ECCV 2002 Kumar et al. CVPR 2005 Kannan, Winn, & Rother, NIPS 2006 Tu, Chen, Yuille, Zhu, ICCV 2003
    • Perceptual and Sensory AugmentedVisual Object Recognition Tutorial Computing Antonio Torralba, IJCV 2003 Role of context, understanding scene layout
    • Role of context, understanding scene layoutVisual Object Recognition Tutorial ComputingPerceptual and Sensory Augmented Image World Hoiem, Efros, & Hebert, CVPR 2006
    • Integration with Scene Geometry • Goal: Find the ground plane Restrict object locationVisual Object Recognition Tutorial Computing Assume Gaussian size prior ⇒ Significantly reduced search spacePerceptual and Sensory Augmented Structure-from-Motion y Search corridor s x Hough Volume Dense stereo 20 B. Leibe
    • Extensions • Combination with 3D GeometryVisual Object Recognition Tutorial ComputingPerceptual and Sensory Augmented [Leibe, Cornelis, Cornelis, Van Gool, CVPR’07] • Mobile Pedestrian Detection [Ess, Leibe, Van Gool, ICCV’07] 21 K. Grauman, B. Leibe
    • Detections Using Ground Plane ConstraintsVisual Object Recognition Tutorial ComputingPerceptual and Sensory Augmented left camera 1175 frames 22 B. Leibe [Leibe et al. CVPR’07]
    • Extensions: Tracking-by-DetectionVisual Object Recognition Tutorial ComputingPerceptual and Sensory Augmented • Spacetime trajectory analysis Link up detections to form physically plausible ST trajectories Select set of ST trajectories that best explain the data 23 [Leibe et al. CVPR’07]
    • Perceptual and Sensory Augmented Visual Object Recognition Tutorial Computing B. Leibe Dynamic Scene Analysis Results[Leibe et al. CVPR’07] 24
    • Extensions (2) • Combination 3D ReconstructionVisual Object Recognition Tutorial ComputingPerceptual and Sensory Augmented [Cornelis, Leibe, Cornelis, Van Gool, 3DPVT’06] 25 K. Grauman, B. Leibe
    • Textured 3D ModelVisual Object Recognition Tutorial ComputingPerceptual and Sensory Augmented Original 3D Reconstruction • Run-times SfM + Bundle adjustment: 27-30 fps on CPU Dense reconstruction: 36 fps on GPU [Cornelis, Cornelis, Van Gool, CVPR’06] 26 B. Leibe
    • Improved 3D City Model Enhancing your driving experience…Visual Object Recognition Tutorial ComputingPerceptual and Sensory Augmented Original 3D Reconstruction [Cornelis, Leibe, Cornelis, Van Gool, 3DPVT’06] 27
    • Perceptual and Sensory Augmented Visual Object Recognition Tutorial ComputingB. Leibe Putting It All Together… y s x Q VT S V πd t π I z oi H1 D di 1..n H2 x H i , ti 28
    • Perceptual and Sensory Augmented Visual Object Recognition Tutorial Computing Mobile Pedestrian Tracking[Ess, Leibe, Schindler, Van Gool, CVPR’08] 29
    • Perceptual and Sensory Augmented Visual Object Recognition Tutorial Computing Mobile Tracking Through Crowds[Ess, Leibe, Schindler, Van Gool, CVPR’08] 30
    • Extension: Recovering Articulations 1...NVisual Object Recognition Tutorial ComputingPerceptual and Sensory Augmented • Idea: Only perform articulated tracking where it’s easy! • Multi-person tracking Solves hard data association problem • Articulated tracking Only on individual “tracklets” between occlusions [Gammeter, Ess, Jaeggli, Schindler, Leibe, Van Gool, ECCV’08] 31 B. Leibe
    • Articulated Multi-Person TrackingVisual Object Recognition Tutorial ComputingPerceptual and Sensory Augmented • Multi-Person tracking Recovers trajectories and solves data association Estimates 3D walking direction and speed Detects occlusion events [Gammeter, Ess, Jaeggli, Schindler, Leibe, Van Gool, ECCV’08] 32 B. Leibe
    • Articulated Tracking under EgomotionVisual Object Recognition Tutorial ComputingPerceptual and Sensory Augmented [Gammeter, Ess, Jaeggli, Schindler, Leibe, Van Gool, ECCV’08] 33 B. Leibe
    • Perceptual and Sensory Augmented Visual Object Recognition Tutorial ComputingK. Grauman, B. Leibe 34
    • Summary • Visual recognition is a challenging and very active research area.Visual Object Recognition Tutorial Computing • We’ve covered some basic models and representations that have been shown to be effective, and highlighted some ongoing issues.Perceptual and Sensory Augmented • See tutorial website for slides, links, references. http://www.vision.ee.ethz.ch/~bleibe/teaching/tutorial-aaai08/ Thank you! K. Grauman, B. Leibe