Part 2: part-based models by Rob Fergus (MIT)
Problem with bag-of-words <ul><li>All have equal probability for bag-of-words methods </li></ul><ul><li>Location informati...
Overview of section <ul><li>Representation </li></ul><ul><ul><li>Computational complexity </li></ul></ul><ul><ul><li>Locat...
Representation
Model: Parts and Structure
Representation <ul><li>Object as set of parts </li></ul><ul><ul><li>Generative representation </li></ul></ul><ul><li>Model...
History of Parts and Structure approaches <ul><li>Fischler & Elschlager 1973 </li></ul><ul><li>Yuille ‘91 </li></ul><ul><l...
Sparse representation <ul><ul><li>+ Computationally tractable (10 5  pixels    10 1  -- 10 2  parts) </li></ul></ul><ul><...
Region operators <ul><ul><li>Local maxima of interest operator function </li></ul></ul><ul><ul><li>Can give scale/orientat...
The correspondence problem <ul><li>Model with P parts </li></ul><ul><li>Image with N possible assignments for each part </...
<ul><li>1 – 1 mapping </li></ul><ul><ul><li>Each part assigned to unique feature </li></ul></ul><ul><li>As opposed to: </l...
Location
Connectivity of parts <ul><li>Complexity is given by size of maximal clique in graph </li></ul><ul><li>Consider a 3 part m...
from Sparse Flexible Models of Local Features Gustavo Carneiro and David Lowe, ECCV 2006 Different connectivity structures...
How much does shape help? <ul><li>Crandall, Felzenszwalb, Huttenlocher CVPR’05 </li></ul><ul><li>Shape variance increases ...
Hierarchical representations  <ul><li>Pixels    Pixel groupings    Parts    Object </li></ul>Images from [Amit98,Boucha...
Some class-specific graphs <ul><li>Articulated motion </li></ul><ul><ul><li>People </li></ul></ul><ul><ul><li>Animals </li...
Dense layout of parts <ul><li>Layout CRF: Winn & Shotton, CVPR ‘06 </li></ul>Part labels (color-coded)
How to model location? <ul><li>Explicit: Probability density functions  </li></ul><ul><li>Implicit: Voting scheme </li></u...
<ul><li>Cartesian  </li></ul><ul><ul><li>E.g. Gaussian distribution </li></ul></ul><ul><ul><li>Parameters of model,    an...
Implicit shape model <ul><li>Learn appearance codebook </li></ul><ul><ul><li>Cluster over interest points on training imag...
Deformable Template Matching Query Template Berg, Berg and Malik CVPR 2005 <ul><li>Formulate problem as Integer Quadratic ...
Other invariance methods <ul><li>Search over transformations </li></ul><ul><ul><li>Large space (# pixels x # scales ….) </...
Multiple views Component 1 Component 2 <ul><li>Mixture of 2-D models </li></ul><ul><ul><li>Weber, Welling and Perona CVPR ...
Multiple view points Thomas, Ferrari, Leibe, Tuytelaars, Schiele, and L. Van Gool. Towards Multi-View Object Class Detecti...
Appearance
Representation of appearance <ul><li>Dependency structure </li></ul><ul><ul><li>Often assume each part’s appearance is ind...
Representation of appearance <ul><li>Invariance needs to match that of shape model </li></ul><ul><li>Insensitive to small ...
Appearance representation <ul><li>Decision trees </li></ul>Figure from Winn & Shotton, CVPR ‘06 <ul><li>SIFT </li></ul><ul...
Occlusion <ul><li>Explicit </li></ul><ul><ul><li>Additional match of each part to missing state </li></ul></ul><ul><li>Imp...
Background clutter <ul><li>Explicit model </li></ul><ul><ul><li>Generative model for clutter as well as foreground object ...
Recognition
What task? <ul><li>Classification </li></ul><ul><ul><li>Object present/absent in image </li></ul></ul><ul><ul><li>Backgrou...
Efficient search methods <ul><li>Interpretation tree (Grimson ’87) </li></ul><ul><ul><li>Condition on assigned parts to gi...
Distance transforms <ul><li>Distance transforms </li></ul><ul><ul><li>O(N 2 P)    O(NP) for tree structured models </li><...
Distance transforms 2 <ul><li>For each position of landmark part, find best position for part 2 </li></ul><ul><ul><li>Find...
Parts and Structure demo <ul><li>Gaussian location model – star configuration </li></ul><ul><li>Translation invariant only...
Demo images <ul><li>Sub-set of Caltech face dataset </li></ul><ul><li>Caltech background images </li></ul>
Demo Web Page
Demo (2)
Demo (3)
Demo (4)
Demo: efficient methods
Stochastic Grammar of Images S.C. Zhu et al. and D. Mumford
animal head instantiated by tiger head e.g. discontinuities, gradient  e.g. linelets, curvelets, T-junctions e.g. contours...
Parts and Structure models Summary <ul><li>Correspondence problem </li></ul><ul><li>Efficient methods for large # parts an...
 
References 2. Parts and Structure
[Agarwal02] S. Agarwal and D. Roth. Learning a sparse representation for object detection. In Proceedings of the 7th Europ...
[Blei03] D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research, 3:993-1022, Jan...
[FeiFei03] L. Fei-Fei, R. Fergus, and P. Perona. A Bayesian approach to unsupervised one-shot learning of object categorie...
[Grimson87] W. E. L. Grimson and T. Lozano-Perez. Localizing overlapping parts by searching the interpretation tree. IEEE ...
[Leung98] T.K. Leung, M.C. Burl, and P. Perona. Probabilistic ane invariants for recognition. In Proceedings of the IEEE C...
[Viola01] P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. In Proceedings of the ...
Quest for A Stochastic Grammar of Images Song-Chun Zhu and David Mumford
 
 
 
 
Example scheme <ul><li>Model shape using Gaussian distribution on location between parts </li></ul><ul><li>Model appearanc...
Connectivity of parts <ul><li>To find best match in image, we want most probable state of L,  </li></ul><ul><li>Run max-pr...
Different graph structures 1 3 4 5 6 2 1 3 4 5 6 2 Fully  connected Star structure 1 3 4 5 6 2 Tree structure O(N 6 ) O(N ...
Euclidean & Affine Shape <ul><li>Translation, rotation and scaling  Euclidean Shape </li></ul><ul><li>Removal of camera fo...
Translation-invariant shape <ul><li>Figure space density: </li></ul><ul><li>Translation-invariant form </li></ul><ul><li>S...
Affine Shape Density <ul><li>Affine Shape density (Dryden-Mardia): </li></ul><ul><li>Euclidean Shape density is of similar...
Shape <ul><li>Shape is “ what remains after differences due to translation, rotation, and scale have been factored out ” ....
Learning
Learning situations <ul><li>Varying levels of supervision </li></ul><ul><ul><li>Unsupervised </li></ul></ul><ul><ul><li>Im...
 
<ul><li>Task: Estimation of model parameters </li></ul>Learning using EM <ul><li>Let the assignments be a hidden variable ...
Learning procedure E-step: Compute assignments for which regions belong to which part M-step: Update model parameters  <ul...
Example scheme, using EM for maximum likelihood learning 1. Current estimate of   ... Image 1 Image 2 Image  i 2. Assign ...
Priors <ul><li>Implicit </li></ul><ul><ul><li>Structure of dependencies in model </li></ul></ul><ul><ul><li>Parameterisati...
Learning Shape & Appearance simultaneously  Fergus et al. ‘03
Learn appearance then shape Parameter Estimation Choice 1 Choice 2 Parameter Estimation Model 1 Model 2 Predict / measure ...
Discriminative training <ul><li>Sparse so parts need to be distinctive of class </li></ul><ul><li>Boosted parts and struct...
Number of training images <ul><li>More supervision, fewer images needed </li></ul><ul><ul><li>Few unknown parameters </li>...
Number of training examples 6 part Motorbike model Priors
Upcoming SlideShare
Loading in …5
×

Cvpr2007 object category recognition p2 - part based models

1,637 views

Published on

0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,637
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
32
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide
  • Motivation slide. Clearly location has to help, but bag-of-words doesn’t have it…
  • Globally different, but have portions in common
  • Key problem with approach &amp; motivates design choices. Assume for the moment that we have a set of regions.
  • Each pixel is labelled – regions of pixels have the same part label.
  • % The idea of deformable templates is not new. In % the early 1970&apos;s at least three groups were % working on similar ideas: % Grenander in statistical pattern theory % von der Malsburg in neural networks % and Fischler and Eschlager in computer vision On the left we have the model template of a helicopter. We characterize the template by a set of feature points sampled from edges in the image. These are shown as the approximately 40 colored dots in the left image. On the right we show a query image. Here feature points are extracted along edges in the entire image. Many of these will be on background or clutter, but some are on the object of interest. Our problem is to find a correspondence between the model points at left and the correct subset of feature points on the right.
  • Cvpr2007 object category recognition p2 - part based models

    1. 1. Part 2: part-based models by Rob Fergus (MIT)
    2. 2. Problem with bag-of-words <ul><li>All have equal probability for bag-of-words methods </li></ul><ul><li>Location information is important </li></ul>
    3. 3. Overview of section <ul><li>Representation </li></ul><ul><ul><li>Computational complexity </li></ul></ul><ul><ul><li>Location </li></ul></ul><ul><ul><li>Appearance </li></ul></ul><ul><ul><li>Occlusion, Background clutter </li></ul></ul><ul><li>Recognition </li></ul><ul><ul><li>Demos </li></ul></ul>
    4. 4. Representation
    5. 5. Model: Parts and Structure
    6. 6. Representation <ul><li>Object as set of parts </li></ul><ul><ul><li>Generative representation </li></ul></ul><ul><li>Model: </li></ul><ul><ul><li>Relative locations between parts </li></ul></ul><ul><ul><li>Appearance of part </li></ul></ul><ul><li>Issues: </li></ul><ul><ul><li>How to model location </li></ul></ul><ul><ul><li>How to represent appearance </li></ul></ul><ul><ul><li>Sparse or dense (pixels or regions) </li></ul></ul><ul><ul><li>How to handle occlusion/clutter </li></ul></ul>Figure from [Fischler & Elschlager 73]
    7. 7. History of Parts and Structure approaches <ul><li>Fischler & Elschlager 1973 </li></ul><ul><li>Yuille ‘91 </li></ul><ul><li>Brunelli & Poggio ‘93 </li></ul><ul><li>Lades, v.d. Malsburg et al. ‘93 </li></ul><ul><li>Cootes, Lanitis, Taylor et al. ‘95 </li></ul><ul><li>Amit & Geman ‘95, ‘99 </li></ul><ul><li>Perona et al. ‘95, ‘96, ’98, ’00, ’03, ‘04, ‘05 </li></ul><ul><li>Felzenszwalb & Huttenlocher ’00, ’04 </li></ul><ul><li>Crandall & Huttenlocher ’05, ’06 </li></ul><ul><li>Leibe & Schiele ’03, ’04 </li></ul><ul><li>Many papers since 2000 </li></ul>
    8. 8. Sparse representation <ul><ul><li>+ Computationally tractable (10 5 pixels  10 1 -- 10 2 parts) </li></ul></ul><ul><ul><li>+ Generative representation of class </li></ul></ul><ul><ul><li>+ Avoid modeling global variability </li></ul></ul><ul><ul><li>+ Success in specific object recognition </li></ul></ul><ul><ul><li>- Throw away most image information </li></ul></ul><ul><ul><li>- Parts need to be distinctive to separate from other classes </li></ul></ul>
    9. 9. Region operators <ul><ul><li>Local maxima of interest operator function </li></ul></ul><ul><ul><li>Can give scale/orientation invariance </li></ul></ul>Figures from [Kadir, Zisserman and Brady 04]
    10. 10. The correspondence problem <ul><li>Model with P parts </li></ul><ul><li>Image with N possible assignments for each part </li></ul><ul><li>Consider mapping to be 1-1 </li></ul><ul><li>N P combinations!!! </li></ul>
    11. 11. <ul><li>1 – 1 mapping </li></ul><ul><ul><li>Each part assigned to unique feature </li></ul></ul><ul><li>As opposed to: </li></ul><ul><li>1 – Many </li></ul><ul><ul><li>Bag of words approaches </li></ul></ul><ul><ul><li>Sudderth, Torralba, Freeman ’05 </li></ul></ul><ul><ul><li>Loeff, Sorokin, Arora and Forsyth ‘05 </li></ul></ul>The correspondence problem <ul><li>Many – 1 </li></ul><ul><ul><li>- Quattoni, Collins and Darrell, 04 </li></ul></ul>
    12. 12. Location
    13. 13. Connectivity of parts <ul><li>Complexity is given by size of maximal clique in graph </li></ul><ul><li>Consider a 3 part model </li></ul><ul><ul><li>Each part has set of N possible locations in image </li></ul></ul><ul><ul><li>Location of parts 2 & 3 is independent, given location of L </li></ul></ul><ul><ul><li>Each part has an appearance term, independent between parts. </li></ul></ul>L 3 2 Shape Model S(L,2) S(L,3) A(L) A(2) A(3) L 3 2 Variables Factors Shape Appearance Factor graph S(L)
    14. 14. from Sparse Flexible Models of Local Features Gustavo Carneiro and David Lowe, ECCV 2006 Different connectivity structures O(N 6 ) O(N 2 ) O(N 3 ) O(N 2 ) Fergus et al. ’03 Fei-Fei et al. ‘03 Crandall et al. ‘05 Fergus et al. ’05 Crandall et al. ‘05 Felzenszwalb & Huttenlocher ‘00 Bouchard & Triggs ‘05 Carneiro & Lowe ‘06 Csurka ’04 Vasconcelos ‘00
    15. 15. How much does shape help? <ul><li>Crandall, Felzenszwalb, Huttenlocher CVPR’05 </li></ul><ul><li>Shape variance increases with increasing model complexity </li></ul><ul><li>Do get some benefit from shape </li></ul>
    16. 16. Hierarchical representations <ul><li>Pixels  Pixel groupings  Parts  Object </li></ul>Images from [Amit98,Bouchard05] <ul><li>Multi-scale approach increases number of low-level features </li></ul><ul><li>Amit and Geman ‘98 </li></ul><ul><li>Bouchard & Triggs ‘05 </li></ul>
    17. 17. Some class-specific graphs <ul><li>Articulated motion </li></ul><ul><ul><li>People </li></ul></ul><ul><ul><li>Animals </li></ul></ul><ul><li>Special parameterisations </li></ul><ul><ul><li>Limb angles </li></ul></ul>Images from [Kumar, Torr and Zisserman 05, Felzenszwalb & Huttenlocher 05]
    18. 18. Dense layout of parts <ul><li>Layout CRF: Winn & Shotton, CVPR ‘06 </li></ul>Part labels (color-coded)
    19. 19. How to model location? <ul><li>Explicit: Probability density functions </li></ul><ul><li>Implicit: Voting scheme </li></ul><ul><li>Invariance </li></ul><ul><ul><li>Translation </li></ul></ul><ul><ul><li>Scaling </li></ul></ul><ul><ul><li>Similarity/affine </li></ul></ul><ul><ul><li>Viewpoint </li></ul></ul>Similarity transformation Translation and Scaling Translation Affine transformation
    20. 20. <ul><li>Cartesian </li></ul><ul><ul><li>E.g. Gaussian distribution </li></ul></ul><ul><ul><li>Parameters of model,  and  </li></ul></ul><ul><ul><li>Independence corresponds to zeros in  </li></ul></ul><ul><ul><li>Burl et al. ’96, Weber et al. ‘00, Fergus et al. ’03 </li></ul></ul><ul><li>Polar </li></ul><ul><ul><li>Convenient for invariance to rotation </li></ul></ul>Explicit shape model Mikolajczyk et al., CVPR ‘06
    21. 21. Implicit shape model <ul><li>Learn appearance codebook </li></ul><ul><ul><li>Cluster over interest points on training images </li></ul></ul><ul><li>Learn spatial distributions </li></ul><ul><ul><li>Match codebook to training images </li></ul></ul><ul><ul><li>Record matching positions on object </li></ul></ul><ul><ul><li>Centroid is given </li></ul></ul>Interest Points Recognition Learning <ul><li>Use Hough space voting to find object </li></ul><ul><li>Leibe and Schiele ’03,’05 </li></ul>Spatial occurrence distributions x y s x y s x y s x y s Probabilistic Voting Matched Codebook Entries
    22. 22. Deformable Template Matching Query Template Berg, Berg and Malik CVPR 2005 <ul><li>Formulate problem as Integer Quadratic Programming </li></ul><ul><li>O(N P ) in general </li></ul><ul><li>Use approximations that allow P=50 and N=2550 in <2 secs </li></ul>
    23. 23. Other invariance methods <ul><li>Search over transformations </li></ul><ul><ul><li>Large space (# pixels x # scales ….) </li></ul></ul><ul><ul><li>Closed form solution for translation and scale (Helmer and Lowe ’04) </li></ul></ul><ul><li>Features give information </li></ul><ul><ul><li>Characteristic scale </li></ul></ul><ul><ul><li>Characteristic orientation (noisy) </li></ul></ul>Figures from Mikolajczyk & Schmid invariance of the characteristic scale
    24. 24. Multiple views Component 1 Component 2 <ul><li>Mixture of 2-D models </li></ul><ul><ul><li>Weber, Welling and Perona CVPR ‘00 </li></ul></ul>Frontal Profile 0 20 40 60 80 100 50 55 60 65 70 75 80 85 90 95 100 Orientation Tuning angle in degrees % Correct % Correct
    25. 25. Multiple view points Thomas, Ferrari, Leibe, Tuytelaars, Schiele, and L. Van Gool. Towards Multi-View Object Class Detection, CVPR 06 Hoiem, Rother, Winn, 3D LayoutCRF for Multi-View Object Class Recognition and Segmentation, CVPR ‘07
    26. 26. Appearance
    27. 27. Representation of appearance <ul><li>Dependency structure </li></ul><ul><ul><li>Often assume each part’s appearance is independent </li></ul></ul><ul><ul><li>Common to assume independence with location </li></ul></ul><ul><li>Needs to handle intra-class variation </li></ul><ul><ul><li>Task is no longer matching of descriptors </li></ul></ul><ul><ul><li>Implicit variation (VQ to get discrete appearance) </li></ul></ul><ul><ul><li>Explicit model of appearance (e.g. Gaussians in SIFT space) </li></ul></ul>
    28. 28. Representation of appearance <ul><li>Invariance needs to match that of shape model </li></ul><ul><li>Insensitive to small shifts in translation/scale </li></ul><ul><ul><li>Compensate for jitter of features </li></ul></ul><ul><ul><li>e.g. SIFT </li></ul></ul><ul><li>Illumination invariance </li></ul><ul><ul><li>Normalize out </li></ul></ul>
    29. 29. Appearance representation <ul><li>Decision trees </li></ul>Figure from Winn & Shotton, CVPR ‘06 <ul><li>SIFT </li></ul><ul><li>PCA </li></ul>[Lepetit and Fua CVPR 2005]
    30. 30. Occlusion <ul><li>Explicit </li></ul><ul><ul><li>Additional match of each part to missing state </li></ul></ul><ul><li>Implicit </li></ul><ul><ul><li>Truncated minimum probability of appearance </li></ul></ul>Log probability Appearance space µ part
    31. 31. Background clutter <ul><li>Explicit model </li></ul><ul><ul><li>Generative model for clutter as well as foreground object </li></ul></ul><ul><li>Use a sub-window </li></ul><ul><ul><li>At correct position, no clutter is present </li></ul></ul>
    32. 32. Recognition
    33. 33. What task? <ul><li>Classification </li></ul><ul><ul><li>Object present/absent in image </li></ul></ul><ul><ul><li>Background may be correlated with object </li></ul></ul><ul><li>Localization / Detection </li></ul><ul><ul><li>Localize object within the frame </li></ul></ul><ul><ul><li>Bounding box or pixel-level segmentation </li></ul></ul>
    34. 34. Efficient search methods <ul><li>Interpretation tree (Grimson ’87) </li></ul><ul><ul><li>Condition on assigned parts to give search regions for remaining ones </li></ul></ul><ul><ul><li>Branch & bound, A* </li></ul></ul>
    35. 35. Distance transforms <ul><li>Distance transforms </li></ul><ul><ul><li>O(N 2 P)  O(NP) for tree structured models </li></ul></ul><ul><li>How it works </li></ul><ul><ul><li>Assume location model is Gaussian (i.e. e -d 2 ) </li></ul></ul><ul><ul><li>Consider a two part model with µ =0, σ =1 on a 1-D image </li></ul></ul>f(d) = -d 2 Appearance log probability at x i for part 2 = A 2 (x i ) x i Image pixel Log probability <ul><li>Felzenszwalb and Huttenlocher ’00 & ’05 </li></ul>L 2 Model
    36. 36. Distance transforms 2 <ul><li>For each position of landmark part, find best position for part 2 </li></ul><ul><ul><li>Finding most probable x i is equivalent finding maximum over set of offset parabolas </li></ul></ul><ul><ul><li>Upper envelope computed in O(N) rather than obvious O(N 2 ) via distance transform (see Felzenszwalb and Huttenlocher ’05). </li></ul></ul><ul><li>Add A L (x) to upper envelope (offset by µ) to get overall probability map </li></ul>x i x g x j x l x h x k Log probability Image pixel A 2 (x i ) A 2 (x l ) A 2 (x j ) A 2 (x g ) A 2 (x h ) A 2 (x k )
    37. 37. Parts and Structure demo <ul><li>Gaussian location model – star configuration </li></ul><ul><li>Translation invariant only </li></ul><ul><ul><li>Use 1 st part as landmark </li></ul></ul><ul><li>Appearance model is template matching </li></ul><ul><li>Manual training </li></ul><ul><ul><li>User identifies correspondence on training images </li></ul></ul><ul><li>Recognition </li></ul><ul><ul><li>Run template for each part over image </li></ul></ul><ul><ul><li>Get local maxima  set of possible locations for each part </li></ul></ul><ul><ul><li>Impose shape model - O(N 2 P) cost </li></ul></ul><ul><ul><li>Score of each match is combination of shape model and template responses. </li></ul></ul>
    38. 38. Demo images <ul><li>Sub-set of Caltech face dataset </li></ul><ul><li>Caltech background images </li></ul>
    39. 39. Demo Web Page
    40. 40. Demo (2)
    41. 41. Demo (3)
    42. 42. Demo (4)
    43. 43. Demo: efficient methods
    44. 44. Stochastic Grammar of Images S.C. Zhu et al. and D. Mumford
    45. 45. animal head instantiated by tiger head e.g. discontinuities, gradient e.g. linelets, curvelets, T-junctions e.g. contours, intermediate objects e.g. animals, trees, rocks Context and Hierarchy in a Probabilistic Image Model Jin & Geman (2006) animal head instantiated by bear head
    46. 46. Parts and Structure models Summary <ul><li>Correspondence problem </li></ul><ul><li>Efficient methods for large # parts and # positions in image </li></ul><ul><li>Challenge to get representation with desired invariance </li></ul><ul><li>Future directions: </li></ul><ul><li>Multiple views </li></ul><ul><li>Approaches to learning </li></ul><ul><li>Multiple category training </li></ul>
    47. 48. References 2. Parts and Structure
    48. 49. [Agarwal02] S. Agarwal and D. Roth. Learning a sparse representation for object detection. In Proceedings of the 7th European Conference on Computer Vision, Copenhagen, Denmark, pages 113-130, 2002. [Agarwal_Dataset] Agarwal, S. and Awan, A. and Roth, D. UIUC Car dataset. http://l2r.cs.uiuc.edu/ ~cogcomp/Data/Car, 2002. [Amit98] Y. Amit and D. Geman. A computational model for visual selection. Neural Computation, 11(7):1691-1715, 1998. [Amit97] Y. Amit, D. Geman, and K. Wilder. Joint induction of shape features and tree classi- ers. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(11):1300-1305, 1997. [Amores05] J. Amores, N. Sebe, and P. Radeva. Fast spatial pattern discovery integrating boosting with constellations of contextual discriptors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, San Diego, volume 2, pages 769-774, 2005. [Bar-Hillel05] A. Bar-Hillel, T. Hertz, and D. Weinshall. Object class recognition by boosting a part based model. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, San Diego, volume 1, pages 702-709, 2005. [Barnard03] K. Barnard, P. Duygulu, N. de Freitas, D. Forsyth, D. Blei, and M. Jordan. Matching words and pictures. JMLR, 3:1107-1135, February 2003. [Berg05] A. Berg, T. Berg, and J. Malik. Shape matching and object recognition using low distortion correspondence. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, San Diego, CA, volume 1, pages 26-33, June 2005. [Biederman87] I. Biederman. Recognition-by-components: A theory of human image understanding. Psychological Review, 94:115-147, 1987. [Biederman95] I. Biederman. An Invitation to Cognitive Science, Vol. 2: Visual Cognition, volume 2, chapter Visual Object Recognition, pages 121-165. MIT Press, 1995.
    49. 50. [Blei03] D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research, 3:993-1022, January 2003. [Borenstein02] E. Borenstein. and S. Ullman. Class-specic, top-down segmentation. In Proceedings of the 7th European Conference on Computer Vision, Copenhagen, Denmark, pages 109-124, 2002. [Burl96] M. Burl and P. Perona. Recognition of planar object classes. In Proc. Computer Vision and Pattern Recognition, pages 223-230, 1996. [Burl96a] M. Burl, M. Weber, and P. Perona. A probabilistic approach to object recognition using local photometry and global geometry. In Proc. European Conference on Computer Vision, pages 628-641, 1996. [Burl98] M. Burl, M. Weber, and P. Perona. A probabilistic approach to object recognition using local photometry and global geometry. In Proceedings of the European Conference on Computer Vision, pages 628-641, 1998. [Burl95] M.C. Burl, T.K. Leung, and P. Perona. Face localization via shape statistics. In Int. Workshop on Automatic Face and Gesture Recognition, 1995. [Canny86] J. F. Canny. A computational approach to edge detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 8(6):679-698, 1986. [Crandall05] D. Crandall, P. Felzenszwalb, and D. Huttenlocher. Spatial priors for part-based recognition using statistical models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, San Diego, volume 1, pages 10-17, 2005. [Csurka04] G. Csurka, C. Bray, C. Dance, and L. Fan. Visual categorization with bags of keypoints. In Workshop on Statistical Learning in Computer Vision, ECCV, pages 1-22, 2004. [Dalal05] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, San Diego, CA, pages 886--893, 2005. [Dempster76] A. Dempster, N. Laird, and D. Rubin. Maximum likelihood from incomplete data via the EM algorithm. JRSS B, 39:1-38, 1976. [Dorko04] G. Dorko and C. Schmid. Object class recognition using discriminative local features. IEEE Transactions on Pattern Analysis and Machine Intelligence, Review(Submitted), 2004.
    50. 51. [FeiFei03] L. Fei-Fei, R. Fergus, and P. Perona. A Bayesian approach to unsupervised one-shot learning of object categories. In Proceedings of the 9th International Conference on Computer Vision, Nice, France, pages 1134-1141, October 2003. [FeiFei04] L. Fei-Fei, R. Fergus, and P. Perona. Learning generative visual models from few training examples: an incremental bayesian approach tested on 101 object categories. In Workshop on Generative-Model Based Vision, 2004. [FeiFei05] L. Fei-Fei and P. Perona. A Bayesian hierarchical model for learning natural scene categories. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, San Diego, CA, volume 2, pages 524-531, June 2005. [Felzenszwalb00] P. Felzenszwalb and D. Huttenlocher. Pictorial structures for object recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2066-2073, 2000. [Felzenszwalb05] P. Felzenszwalb and D. Huttenlocher. Pictorial structures for object recognition. International Journal of Computer Vision, 61:55-79, January 2005. [Fergus_Datasets] R. Fergus and P. Perona. Caltech Object Category datasets. http://www.vision. caltech.edu/html-files/archive.html, 2003. [Fergus03] R. Fergus, P. Perona, and P. Zisserman. Object class recognition by unsupervised scaleinvariant learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, volume 2, pages 264-271, 2003. [Fergus04] R. Fergus, P. Perona, and A. Zisserman. A visual category lter for google images. In Proceedings of the 8th European Conference on Computer Vision, Prague, Czech Republic, pages 242-256. Springer-Verlag, May 2004. [Fergus05 R. Fergus, P. Perona, and A. Zisserman. A sparse object category model for ecient learning and exhaustive recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, San Diego, volume 1, pages 380-387, 2005. [Fergus_Technote] R. Fergus, M. Weber, and P. Perona. Ecient methods for object recognition using the constellation model. Technical report, California Institute of Technology, 2001. [Fischler73] M.A. Fischler and R.A. Elschlager. The representation and matching of pictorial structures. IEEE Transactions on Computer, c-22(1):67-92, Jan. 1973.
    51. 52. [Grimson87] W. E. L. Grimson and T. Lozano-Perez. Localizing overlapping parts by searching the interpretation tree. IEEE Transactions on Pattern Analysis and Machine Intelligence, 9(4):469-482, 1987. [Harris98] C. J. Harris and M. Stephens. A combined corner and edge detector. In Proceedings of the 4th Alvey Vision Conference, Manchester, pages 147-151, 1988. [Hart68] P.E. Hart, N.J. Nilsson, and B. Raphael. A formal basis for the determination of minimum cost paths. IEEE Transactions on SSC, 4:100-107, 1968. [Helmer04] S. Helmer and D. Lowe. Object recognition with many local features. In Workshop on Generative Model Based Vision 2004 (GMBV), Washington, D.C., July 2004. [Hofmann99] T. Hofmann. Probabilistic latent semantic indexing. In SIGIR '99: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, August 15-19, 1999, Berkeley, CA, USA, pages 50-57. ACM, 1999. [Holub05] A. Holub and P. Perona. A discriminative framework for modeling object classes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, San Diego, volume 1, pages 664-671, 2005. [Kadir01] T. Kadir and M. Brady. Scale, saliency and image description. International Journal of Computer Vision, 45(2):83-105, 2001. [Kadir_Code] T. Kadir and M. Brady. Scale Scaliency Operator. http://www.robots.ox.ac.uk/ ~timork/salscale.html, 2003. [Kumar05] M. P. Kumar, P. H. S. Torr, and A. Zisserman. Obj cut. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, San Diego, pages 18-25, 2005. [Leibe04] B. Leibe, A. Leonardis, and B. Schiele. Combined object categorization and segmentation with an implicit shape model. In Workshop on Statistical Learning in Computer Vision, ECCV, 2004. [Leung98] T. Leung and J. Malik. Contour continuity and region based image segmentation. In Proceedings of the 5th European Conference on Computer Vision, Freiburg, Germany, LNCS 1406, pages 544-559. Springer-Verlag, 1998. [Leung95] T.K. Leung, M.C. Burl, and P. Perona. Finding faces in cluttered scenes using random labeled graph matching. Proceedings of the 5th International Conference on Computer Vision, Boston, pages 637-644, June 1995.
    52. 53. [Leung98] T.K. Leung, M.C. Burl, and P. Perona. Probabilistic ane invariants for recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 678-684, 1998. [Lindeberg98] T. Lindeberg. Feature detection with automatic scale selection. International Journal of Computer Vision, 30(2):77-116, 1998. [Lowe99] D. Lowe. Object recognition from local scale-invariant features. In Proceedings of the 7th International Conference on Computer Vision, Kerkyra, Greece, pages 1150-1157, September 1999. [Lowe01] D. Lowe. Local feature view clustering for 3D object recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Kauai, Hawaii, pages 682-688. Springer, December 2001. [Lowe04] D. Lowe. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2):91-110, 2004. [Mardia89] K.V. Mardia and I.L. Dryden. Shape Distributions for Landmark Data&quot;. Advances in Applied Probability, 21:742-755, 1989. [Sivic05] J. Sivic, B. Russell, A. Efros, A. Zisserman, and W. Freeman. Discovering object categories in image collections. Technical Report A. I. Memo 2005-005, Massachusetts Institute of Technology, 2005. [Sivic03] J. Sivic and A. Zisserman. Video Google: A text retrieval approach to object matching in videos. In Proceedings of the International Conference on Computer Vision, pages 1470-1477, October 2003. [Sudderth05] E. Sudderth, A. Torralba, W. Freeman, and A. Willsky. Learning hierarchical models of scenes, objects, and parts. In Proceedings of the IEEE International Conference on Computer Vision, Beijing, page To appear, 2005. [Torralba04] A. Torralba, K. P. Murphy, and W. T. Freeman. Sharing features: ecient boosting procedures for multiclass object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Washington, DC, pages 762-769, 2004.
    53. 54. [Viola01] P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 511{518, 2001. [Weber00] M.Weber. Unsupervised Learning of Models for Object Recognition. PhD thesis, California Institute of Technology, Pasadena, CA, 2000. [Weber00a] M. Weber, W. Einhauser, M. Welling, and P. Perona. Viewpoint-invariant learning and detection of human heads. In Proc. 4th IEEE Int. Conf. Autom. Face and Gesture Recog., FG2000, pages 20{27, March 2000. [Weber00b] M. Weber, M. Welling, and P. Perona. Towards automatic discovery of object categories. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2101{2108, June 2000. [Weber00c] M. Weber, M. Welling, and P. Perona. Unsupervised learning of models for recognition. In Proc. 6th Europ. Conf. Comp. Vis., ECCV2000, volume 1, pages 18{32, June 2000. [Welling05] M. Welling. An expectation maximization algorithm for inferring oset-normal shape distributions. In Tenth International Workshop on Articial Intelligence and Statistics, 2005. [Winn05] J. Winn and N. Joijic. Locus: Learning object classes with unsupervised segmentation. In Proceedings of the IEEE International Conference on Computer Vision, Beijing, page To appear, 2005
    54. 55. Quest for A Stochastic Grammar of Images Song-Chun Zhu and David Mumford
    55. 60. Example scheme <ul><li>Model shape using Gaussian distribution on location between parts </li></ul><ul><li>Model appearance as pixel templates </li></ul><ul><li>Represent image as collection of regions </li></ul><ul><ul><li>Extracted by template matching: normalized-cross correlation </li></ul></ul><ul><li>Manually trained model </li></ul><ul><ul><li>Click on training images </li></ul></ul>
    56. 61. Connectivity of parts <ul><li>To find best match in image, we want most probable state of L, </li></ul><ul><li>Run max-product message passing </li></ul>S(L,2) S(L,3) A(L) A(2) A(3) L 3 2 m b m c m d Take O(N 2 ) to compute: For each of the N values of L, need to find max over N states S(L) m a
    57. 62. Different graph structures 1 3 4 5 6 2 1 3 4 5 6 2 Fully connected Star structure 1 3 4 5 6 2 Tree structure O(N 6 ) O(N 2 ) O(N 2 ) <ul><li>Sparser graphs cannot capture all interactions between parts </li></ul>
    58. 63. Euclidean & Affine Shape <ul><li>Translation, rotation and scaling Euclidean Shape </li></ul><ul><li>Removal of camera foreshortenings Affine Shape </li></ul><ul><li>Assume Gaussian density in figure space </li></ul><ul><li>What is the probability density for the shape variables in each of the different spaces? </li></ul>     Figures from [Leung98] Feature space Euclidean shape Affine shape Translation Invariant shape
    59. 64. Translation-invariant shape <ul><li>Figure space density: </li></ul><ul><li>Translation-invariant form </li></ul><ul><li>Shape space density is still Gaussian </li></ul>e.g. P=3, move 1 st part to origin
    60. 65. Affine Shape Density <ul><li>Affine Shape density (Dryden-Mardia): </li></ul><ul><li>Euclidean Shape density is of similar form </li></ul>[Leung98],[Welling05] <ul><li>Can learnt parameters of DM density with EM! </li></ul>
    61. 66. Shape <ul><li>Shape is “ what remains after differences due to translation, rotation, and scale have been factored out ” . [Kendall84] </li></ul><ul><li>Statistical theory of shape [Kendall, Bookstein, Mardia & Dryden] </li></ul>X Y Figure Space U V Shape Space  Figures from [Leung98]
    62. 67. Learning
    63. 68. Learning situations <ul><li>Varying levels of supervision </li></ul><ul><ul><li>Unsupervised </li></ul></ul><ul><ul><li>Image labels </li></ul></ul><ul><ul><li>Object centroid/bounding box </li></ul></ul><ul><ul><li>Segmented object </li></ul></ul><ul><ul><li>Manual correspondence (typically sub-optimal) </li></ul></ul><ul><li>Generative models naturally incorporate labelling information (or lack of it) </li></ul><ul><li>Discriminative schemes require labels for all data points </li></ul>Contains a motorbike
    64. 70. <ul><li>Task: Estimation of model parameters </li></ul>Learning using EM <ul><li>Let the assignments be a hidden variable and use EM algorithm to learn them and the model parameters </li></ul><ul><li>Chicken and Egg type problem, since we initially know neither: </li></ul><ul><ul><li>Model parameters </li></ul></ul><ul><ul><li>- Assignment of regions to parts </li></ul></ul>
    65. 71. Learning procedure E-step: Compute assignments for which regions belong to which part M-step: Update model parameters <ul><li>Find regions & their location & appearance </li></ul><ul><li>Initialize model parameters </li></ul><ul><li>Use EM and iterate to convergence: </li></ul><ul><li>Trying to maximize likelihood – consistency in shape & appearance </li></ul>
    66. 72. Example scheme, using EM for maximum likelihood learning 1. Current estimate of  ... Image 1 Image 2 Image i 2. Assign probabilities to constellations Large P Small P 3. Use probabilities as weights to re-estimate parameters. Example:  Large P x + Small P x pdf new estimate of  + … =
    67. 73. Priors <ul><li>Implicit </li></ul><ul><ul><li>Structure of dependencies in model </li></ul></ul><ul><ul><li>Parameterisation of model </li></ul></ul><ul><ul><li>Feature detectors </li></ul></ul><ul><li>Explicit </li></ul><ul><ul><li>p(  ) </li></ul></ul><ul><ul><li>MAP / Bayesian learning </li></ul></ul><ul><ul><li>Fei-Fei ‘03 </li></ul></ul>model (  ) space  1  2  n p(  n ) p(  2 ) p(  1 )
    68. 74. Learning Shape & Appearance simultaneously Fergus et al. ‘03
    69. 75. Learn appearance then shape Parameter Estimation Choice 1 Choice 2 Parameter Estimation Model 1 Model 2 Predict / measure model performance (validation set or directly from model) Preselected Parts (  100) Weber et al. ‘00
    70. 76. Discriminative training <ul><li>Sparse so parts need to be distinctive of class </li></ul><ul><li>Boosted parts and structure models </li></ul><ul><ul><li>Amores et al. CVPR 2005 </li></ul></ul><ul><ul><li>Bar Hillel et al. CVPR 2005 </li></ul></ul><ul><li>Discriminative features </li></ul><ul><ul><li>Weber et al. 2000 </li></ul></ul><ul><ul><li>Ullman et al. </li></ul></ul><ul><li>Train discriminatively on parameters of generative model </li></ul><ul><ul><li>Holub, Welling, Perona ICCV 2005 </li></ul></ul>
    71. 77. Number of training images <ul><li>More supervision, fewer images needed </li></ul><ul><ul><li>Few unknown parameters </li></ul></ul><ul><li>Less supervision, more images. </li></ul><ul><ul><li>Lots of unknown parameters </li></ul></ul><ul><li>Over-fitting problems </li></ul>
    72. 78. Number of training examples 6 part Motorbike model Priors

    ×