Machine Learning in Computer Vision

750 views

Published on

  • Be the first to comment

Machine Learning in Computer Vision

  1. 1. Machine Learning in Computer Vision Fei-Fei Li
  2. 2. What is (computer) vision? • When we “see” something, what does it involve? • Take a picture with a camera, it is just a bunch of colored dots (pixels) • Want to make computers understand images • Looks easy, but not really… Image (or video) Sensing device Interpreting device Interpretations Corn/mature corn jn a cornfield/ plant/blue sky in the background Etc.
  3. 3. What is it related to? Biology Psychology Neuroscience Information Cognitive sciences Engineering Computer Robotics Science Computer Vision Speech Information retrieval Machine learning Physics Maths
  4. 4. Quiz?
  5. 5. What about this?
  6. 6. A picture is worth a thousand words. --- Confucius or Printers’ Ink Ad (1921)
  7. 7. A picture is worth a thousand words. --- Confucius or Printers’ Ink Ad (1921) horizontal lines textured blue on the top vertical white shadow to the left porous oblique large green patches
  8. 8. A picture is worth a thousand words. --- Confucius or Printers’ Ink Ad (1921) clear sky autumn leaves building court yard bicycles trees campus day time talking outdoor people
  9. 9. Today: machine learning methods for object recognition
  10. 10. outline • Intro to object categorization • Brief overview – Generative – Discriminative • Generative models • Discriminative models
  11. 11. How many object categories are there? Biederman 1987
  12. 12. Challenges 1: view point variation Michelangelo 1475-1564
  13. 13. Challenges 2: illumination slide credit: S. Ullman
  14. 14. Challenges 3: occlusion Magritte, 1957
  15. 15. Challenges 4: scale
  16. 16. Challenges 5: deformation Xu, Beihong 1943
  17. 17. Challenges 6: background clutter Klimt, 1913
  18. 18. History: single object recognition
  19. 19. History: single object recognition • Lowe, et al. 1999, 2003 • Mahamud and Herbert, 2000 • Ferrari, Tuytelaars, and Van Gool, 2004 • Rothganger, Lazebnik, and Ponce, 2004 • Moreels and Perona, 2005 •…
  20. 20. Challenges 7: intra-class variation
  21. 21. Object categorization: the statistical viewpoint p ( zebra | image) vs. p (no zebra|image) • Bayes rule: p ( zebra | image) p (image | zebra ) p ( zebra) = ⋅ p (no zebra | image) p (image | no zebra) p (no zebra ) posterior ratio likelihood ratio prior ratio
  22. 22. Object categorization: the statistical viewpoint p ( zebra | image) p (image | zebra ) p ( zebra) = ⋅ p (no zebra | image) p (image | no zebra) p (no zebra ) posterior ratio likelihood ratio prior ratio • Discriminative methods model posterior • Generative methods model likelihood and prior
  23. 23. Discriminative p( zebra | image) • Direct modeling of p (no zebra | image) Decision Zebra boundary Non-zebra
  24. 24. Generative • Model p(image | zebra) and p (image | no zebra) p(image | zebra) p (image | no zebra) Low Middle High Middle Low
  25. 25. Three main issues • Representation – How to represent an object category • Learning – How to form the classifier, given training data • Recognition – How the classifier is to be used on novel data
  26. 26. Representation – Generative / discriminative / hybrid
  27. 27. Representation – Generative / discriminative / hybrid – Appearance only or location and appearance
  28. 28. Representation – Generative / discriminative / hybrid – Appearance only or location and appearance – Invariances • View point • Illumination • Occlusion • Scale • Deformation • Clutter • etc.
  29. 29. Representation – Generative / discriminative / hybrid – Appearance only or location and appearance – invariances – Part-based or global w/sub-window
  30. 30. Representation – Generative / discriminative / hybrid – Appearance only or location and appearance – invariances – Parts or global w/sub- window – Use set of features or each pixel in image
  31. 31. Learning – Unclear how to model categories, so we learn what distinguishes them rather than manually specify the difference -- hence current interest in machine learning
  32. 32. Learning – Unclear how to model categories, so we learn what distinguishes them rather than manually specify the difference -- hence current interest in machine learning) – Methods of training: generative vs. discriminative 5 p(C1|x) p(C2|x) p(x|C ) 2 1 4 posterior probabilities class densities 0.8 3 0.6 2 p(x|C1) 0.4 1 0.2 0 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 x x
  33. 33. Learning – Unclear how to model categories, so we learn what distinguishes them rather than manually specify the difference -- hence current interest in machine learning) – What are you maximizing? Likelihood (Gen.) or performances on train/validation set (Disc.) – Level of supervision • Manual segmentation; bounding box; image labels; noisy labels Contains a motorbike
  34. 34. Learning – Unclear how to model categories, so we learn what distinguishes them rather than manually specify the difference -- hence current interest in machine learning) – What are you maximizing? Likelihood (Gen.) or performances on train/validation set (Disc.) – Level of supervision • Manual segmentation; bounding box; image labels; noisy labels – Batch/incremental (on category and image level; user-feedback )
  35. 35. Learning – Unclear how to model categories, so we learn what distinguishes them rather than manually specify the difference -- hence current interest in machine learning) – What are you maximizing? Likelihood (Gen.) or performances on train/validation set (Disc.) – Level of supervision • Manual segmentation; bounding box; image labels; noisy labels – Batch/incremental (on category and image level; user-feedback ) – Training images: • Issue of overfitting • Negative images for discriminative methods
  36. 36. Learning – Unclear how to model categories, so we learn what distinguishes them rather than manually specify the difference -- hence current interest in machine learning) – What are you maximizing? Likelihood (Gen.) or performances on train/validation set (Disc.) – Level of supervision • Manual segmentation; bounding box; image labels; noisy labels – Batch/incremental (on category and image level; user-feedback ) – Training images: • Issue of overfitting • Negative images for discriminative methods – Priors
  37. 37. Recognition – Scale / orientation range to search over – Speed
  38. 38. Bag-of-words models
  39. 39. Object Bag of ‘words’
  40. 40. Analogy to documents Of all the sensory impressions proceeding to China is forecasting a trade surplus of $90bn the brain, the visual experiences are the (£51bn) to $100bn this year, a threefold dominant ones. Our perception of the world increase on 2004's $32bn. The Commerce around us is based essentially on the Ministry said the surplus would be created by messages that reach the brain from our eyes. a predicted 30% jump in exports to $750bn, For a long time it was thought that the retinal compared with a 18% rise in imports to image was transmitted pointbrain, to visual sensory, by point China, trade, $660bn. The figures are likely to further visual, perception, centers in the brain; the cerebral cortex was a movie screen, so to speak, upon which the surplus, commerce, annoy the US, which has long argued that China's exports are unfairly helped by a retinal, cerebral cortex, image in the eye was projected. Through the exports, imports, US, deliberately undervalued yuan. Beijing discoveries of Hubelcell, optical eye, and Wiesel we now yuan, bank, domestic, agrees the surplus is too high, but says the know that behind the origin of the visual yuan is only one factor. Bank of China nerve, image perception in the brain there is a considerably foreign, increase, governor Zhou Xiaochuan said the country Hubel, Wiesel more complicated course of events. By also needed to do more tovalue trade, boost domestic following the visual impulses along their path demand so more goods stayed within the to the various cell layers of the optical cortex, country. China increased the value of the Hubel and Wiesel have been able to yuan against the dollar by 2.1% in July and demonstrate that the message about the permitted it to trade within a narrow band, but image falling on the retina undergoes a step- the US wants the yuan to be allowed to trade wise analysis in a system of nerve cells freely. However, Beijing has made it clear that stored in columns. In this system each cell it will take its time and tread carefully before has its specific function and is responsible for allowing the yuan to rise further in value. a specific detail in the pattern of the retinal image.
  41. 41. learning recognition codewords dictionary feature detection & representation image representation category models category (and/or) classifiers decision
  42. 42. Representation 2. codewords dictionary 1. feature detection & representation image representation 3.
  43. 43. 1.Feature detection and representation
  44. 44. 1.Feature detection and representation Compute SIFT Normalize descriptor patch [Lowe’99] Detect patches [Mikojaczyk and Schmid ’02] [Matas et al. ’02] [Sivic et al. ’03] Slide credit: Josef Sivic
  45. 45. 1.Feature detection and representation …
  46. 46. 2. Codewords dictionary formation …
  47. 47. 2. Codewords dictionary formation … Vector quantization Slide credit: Josef Sivic
  48. 48. 2. Codewords dictionary formation Fei-Fei et al. 2005
  49. 49. 3. Image representation frequency ….. codewords
  50. 50. Representation 2. codewords dictionary 1. feature detection & representation image representation 3.
  51. 51. Learning and Recognition codewords dictionary category models category (and/or) classifiers decision
  52. 52. 2 case studies 1. Naïve Bayes classifier – Csurka et al. 2004 2. Hierarchical Bayesian text models (pLSA and LDA) – Background: Hoffman 2001, Blei et al. 2004 – Object categorization: Sivic et al. 2005, Sudderth et al. 2005 – Natural scene categorization: Fei-Fei et al. 2005
  53. 53. First, some notations • wn: each patch in an image – wn = [0,0,…1,…,0,0]T • w: a collection of all N patches in an image – w = [w1,w2,…,wN] • dj: the jth image in an image collection • c: category of the image • z: theme or topic of the patch
  54. 54. Case #1: the Naïve Bayes model c w N N c ∗ = arg max p (c | w) ∝ p (c) p ( w | c) = p(c)∏ p( wn | c) c n =1 Object class Prior prob. of Image likelihood decision the object classes given the class Csurka et al. 2004
  55. 55. Case #2: Hierarchical Bayesian text models Probabilistic Latent Semantic Analysis (pLSA) d z w N D Hoffman, 2001 Latent Dirichlet Allocation (LDA) c π z w N D Blei et al., 2001
  56. 56. Case #2: Hierarchical Bayesian text models Probabilistic Latent Semantic Analysis (pLSA) d z w N D “face” Sivic et al. ICCV 2005
  57. 57. Case #2: Hierarchical Bayesian text models “beach” Latent Dirichlet Allocation (LDA) c π z w N D Fei-Fei et al. ICCV 2005
  58. 58. Another application • Human action classification
  59. 59. Invariance issues • Scale and rotation – Implicit – Detectors and descriptors Kadir and Brady. 2003
  60. 60. Invariance issues • Scale and rotation • Occlusion – Implicit in the models – Codeword distribution: small variations – (In theory) Theme (z) distribution: different occlusion patterns
  61. 61. Invariance issues • Scale and rotation • Occlusion • Translation – Encode (relative) location information Sudderth et al. 2005
  62. 62. Invariance issues • Scale and rotation • Occlusion • Translation • View point (in theory) – Codewords: detector and descriptor – Theme distributions: different view points Fergus et al. 2005
  63. 63. Model properties Of all the sensory impressions proceeding to the brain, the visual experiences are the dominant ones. Our perception of the world around us is based essentially on the messages that reach the brain from our eyes. For a long time it was thought that the retinal image was transmitted pointbrain, to visual sensory, by point • Intuitive visual, perception, centers in the brain; the cerebral cortex was a movie screen, so to speak, upon which the retinal, cerebral cortex, image in the eye was projected. Through the – Analogy to documents discoveries of Hubelcell, optical eye, and Wiesel we now know that behind the origin of the visual nerve, image perception in the brain there is a considerably Hubel, Wiesel more complicated course of events. By following the visual impulses along their path to the various cell layers of the optical cortex, Hubel and Wiesel have been able to demonstrate that the message about the image falling on the retina undergoes a step- wise analysis in a system of nerve cells stored in columns. In this system each cell has its specific function and is responsible for a specific detail in the pattern of the retinal image.
  64. 64. Model properties • Intuitive • (Could use) generative models – Convenient for weakly- or un-supervised training – Prior information – Hierarchical Bayesian framework Sivic et al., 2005, Sudderth et al., 2005
  65. 65. Model properties • Intuitive • (Could use) generative models • Learning and recognition relatively fast – Compare to other methods
  66. 66. Weakness of the model • No rigorous geometric information of the object components • It’s intuitive to most of us that objects are made of parts – no such information • Not extensively tested yet for – View point invariance – Scale invariance • Segmentation and localization unclear
  67. 67. part-based models Slides courtesy to Rob Fergus for “part-based models”
  68. 68. One-shot learning Fei-Fei et al. ‘03, ‘04, ‘06 of object categories
  69. 69. P. Bruegel, 1562 One-shot learning Fei-Fei et al. ‘03, ‘04, ‘06 of object categories
  70. 70. model representation One-shot learning Fei-Fei et al. ‘03, ‘04, ‘06 of object categories
  71. 71. X (location) (x,y) coords. of region center A (appearance) normalize ch 1 pat 1 11x c1 c2 ….. Projection onto PCA basis c10
  72. 72. The Generative Model X (location) (x,y) coords. of region center μX ΓX μA ΓA A (appearance) h normalize X A ch pat I 11x 1 1 c1 c2 ….. Projection onto PCA basis c10
  73. 73. The Generative Model μX ΓX μA ΓA parameters hidden variable h X A observed variables I
  74. 74. The Generative Model ML/MAP μX ΓX μA ΓA h θ1 X A I θ2 θn where θ = {µX, ΓX, µA, ΓA} Weber et al. ’98 ’00, Fergus et al. ’03
  75. 75. The Generative Model μX ΓX ML/MAP shape model θ1 μA ΓA appearance model θ2 θn where θ = {µX, ΓX, µA, ΓA}
  76. 76. The Generative Model ML/MAP μX ΓX μA ΓA θ1 h θ2 I X A θn
  77. 77. The Generative Model Bayesian m0X a0X m0A a0A β0X B0X β0A B0A μX ΓX μA ΓA P θ1 h θ2 I X A θn Parameters to estimate: {mX, βX, aX, BX, mA, βA, aA, BA} Fei-Fei et al. ‘03, ‘04, ‘06 i.e. parameters of Normal-Wishart distribution
  78. 78. The Generative Model m0X a0X m0A a0A priors β0X B0X β0A B0A μX ΓX μA ΓA parameters P h I X A Fei-Fei et al. ‘03, ‘04, ‘06
  79. 79. The Generative Model m0X a0X m0A a0A priors β0X B0X β0A B0A μX ΓX μA ΓA P h I X A Prior distribution Fei-Fei et al. ‘03, ‘04, ‘06
  80. 80. 1. human vision 3. learning & inferences One-shot learning 2. model of object categories representation 4. evaluation & dataset & application
  81. 81. learning & inferences No labeling No segmentation No alignment One-shot learning Fei-Fei et al. 2003, 2004, 2006 of object categories
  82. 82. learning & inferences Bayesian θ1 θ2 θn One-shot learning Fei-Fei et al. 2003, 2004, 2006 of object categories
  83. 83. Random Variational EM initialization M-Step E-Step new estimate of p(θ|train) prior knowledge of p(θ) Attias, Jordan, Hinton etc.
  84. 84. evaluation & dataset One-shot learning Fei-Fei et al. 2004, 2006a, 2006b of object categories
  85. 85. evaluation & dataset -- Caltech 101 Dataset One-shot learning Fei-Fei et al. 2004, 2006a, 2006b of object categories
  86. 86. evaluation & dataset -- Caltech 101 Dataset One-shot learning Fei-Fei et al. 2004, 2006a, 2006b of object categories
  87. 87. Part 3: discriminative methods
  88. 88. Discriminative methods Object detection and recognition is formulated as a classification problem. The image is partitioned into a set of overlapping windows … and a decision is taken at each window about if it contains a target object or not. Decision Background boundary Where are the screens? Computer screen Bag of image patches In some feature space
  89. 89. Discriminative vs. generative • Generative model 0.1 (The artist) 0.05 0 0 10 20 30 40 50 60 70 x = data • Discriminative model 1 (The lousy painter) 0.5 0 0 10 20 30 40 50 60 70 x = data • Classification function 1 -1 0 10 20 30 40 50 60 70 80 x = data
  90. 90. Discriminative methods Nearest neighbor Neural networks 106 examples Shakhnarovich, Viola, Darrell 2003 LeCun, Bottou, Bengio, Haffner 1998 Berg, Berg, Malik 2005 Rowley, Baluja, Kanade 1998 … … Support Vector Machines and Kernels Conditional Random Fields Guyon, Vapnik McCallum, Freitag, Pereira 2000 Heisele, Serre, Poggio, 2001 Kumar, Hebert 2003 … …
  91. 91. Formulation • Formulation: binary classification … Features x = x1 x2 x3 … xN xN+1 xN+2 … xN+M Labels y= -1 +1 -1 -1 ? ? ? Training data: each image patch is labeled Test data as containing the object or background • Classification function Where belongs to some family of functions • Minimize misclassification error (Not that simple: we need some guarantees that there will be generalization)
  92. 92. Overview of section • Object detection with classifiers • Boosting – Gentle boosting – Weak detectors – Object model – Object detection • Multiclass object detection
  93. 93. Why boosting? • A simple algorithm for learning robust classifiers – Freund & Shapire, 1995 – Friedman, Hastie, Tibshhirani, 1998 • Provides efficient algorithm for sparse visual feature selection – Tieu & Viola, 2000 – Viola & Jones, 2003 • Easy to implement, not requires external optimization tools.
  94. 94. Boosting • Defines a classifier using an additive model: Strong Weak classifier classifier Weight Features vector
  95. 95. Boosting • Defines a classifier using an additive model: Strong Weak classifier classifier Weight Features vector • We need to define a family of weak classifiers from a family of weak classifiers
  96. 96. Boosting • It is a sequential procedure: xt=1 Each data point has xt a class label: xt=2 +1 ( ) yt = -1 ( ) and a weight: wt =1
  97. 97. Toy example Weak learners from the family of lines Each data point has a class label: +1 ( ) yt = -1 ( ) and a weight: wt =1 h => p(error) = 0.5 it is at chance
  98. 98. Toy example Each data point has a class label: +1 ( ) yt = -1 ( ) and a weight: wt =1 This one seems to be the best This is a ‘weak classifier’: It performs slightly better than chance.
  99. 99. Toy example Each data point has a class label: +1 ( ) yt = -1 ( ) We update the weights: wt wt exp{-yt Ht} We set a new problem for which the previous weak classifier performs at chance again
  100. 100. Toy example Each data point has a class label: +1 ( ) yt = -1 ( ) We update the weights: wt wt exp{-yt Ht} We set a new problem for which the previous weak classifier performs at chance again
  101. 101. Toy example Each data point has a class label: +1 ( ) yt = -1 ( ) We update the weights: wt wt exp{-yt Ht} We set a new problem for which the previous weak classifier performs at chance again
  102. 102. Toy example Each data point has a class label: +1 ( ) yt = -1 ( ) We update the weights: wt wt exp{-yt Ht} We set a new problem for which the previous weak classifier performs at chance again
  103. 103. Toy example f1 f2 f4 f3 The strong (non- linear) classifier is built as the combination of all the weak (linear) classifiers.
  104. 104. From images to features: Weak detectors We will now define a family of visual features that can be used as weak classifiers (“weak detectors”) Takes image as input and the output is binary response. The output is a weak detector.
  105. 105. Weak detectors Textures of textures Tieu and Viola, CVPR 2000 Every combination of three filters generates a different feature This gives thousands of features. Boosting selects a sparse subset, so computations on test time are very efficient. Boosting also avoids overfitting to some extend.
  106. 106. Weak detectors Haar filters and integral image Viola and Jones, ICCV 2001 The average intensity in the block is computed with four sums independently of the block size.
  107. 107. Weak detectors Other weak detectors: • Carmichael, Hebert 2004 • Yuille, Snow, Nitzbert, 1998 • Amit, Geman 1998 • Papageorgiou, Poggio, 2000 • Heisele, Serre, Poggio, 2001 • Agarwal, Awan, Roth, 2004 • Schneiderman, Kanade 2004 • …
  108. 108. Weak detectors Part based: similar to part-based generative models. We create weak detectors by using parts and voting for the object center location Car model Screen model These features are used for the detector on the course web site.
  109. 109. Weak detectors First we collect a set of part templates from a set of training objects. Vidal-Naquet, Ullman (2003) …
  110. 110. Weak detectors We now define a family of “weak detectors” as: = * = Better than chance
  111. 111. Weak detectors We can do a better job using filtered images = = * = * Still a weak detector but better than before
  112. 112. Training First we evaluate all the N features on all the training images. Then, we sample the feature outputs on the object center and at random locations in the background:
  113. 113. Representation and object model Selected features for the screen detector … … 1 2 3 4 10 100 Lousy painter
  114. 114. Representation and object model Selected features for the car detector … … 1 2 3 4 10 100
  115. 115. Overview of section • Object detection with classifiers • Boosting – Gentle boosting – Weak detectors – Object model – Object detection • Multiclass object detection
  116. 116. Example: screen detection Feature output
  117. 117. Example: screen detection Feature Thresholded output output Weak ‘detector’ Produces many false alarms.
  118. 118. Example: screen detection Feature Thresholded Strong classifier output output at iteration 1
  119. 119. Example: screen detection Feature Thresholded Strong output output classifier Second weak ‘detector’ Produces a different set of false alarms.
  120. 120. Example: screen detection Feature Thresholded Strong output output classifier + Strong classifier at iteration 2
  121. 121. Example: screen detection Feature Thresholded Strong output output classifier + … Strong classifier at iteration 10
  122. 122. Example: screen detection Feature Thresholded Strong output output classifier + … Adding features Final classification Strong classifier at iteration 200
  123. 123. applications
  124. 124. Document Analysis Digit recognition, AT&T labs http://www.research.att.com/~yann/
  125. 125. Medical Imaging
  126. 126. Robotics
  127. 127. Toys and robots
  128. 128. Finger prints
  129. 129. Surveillance
  130. 130. Security
  131. 131. Searching the web

×