Object Detection with     Discriminatively Trained Part            Based Models                Pedro F. Felzenszwalb      ...
PASCAL Challenge•   ~10,000 images, with ~25,000 target objects    - Objects from 20 categories (person, car, bicycle, cow...
Why is it hard?•   Objects in rich categories exhibit significant variability    - Photometric variation    - Viewpoint var...
Starting point: sliding window classifiers                                               Feature vector                    ...
Histogram of Gradient (HOG) features•   Image is partitioned into 8x8 pixel blocks•   In each block we compute a histogram...
HOG Filters•   Array of weights for features in subwindow of HOG pyramid•   Score is dot product of filter and feature vect...
Dalal & Triggs: HOG + linear SVMs                             φ(p, H)                             φ(q, H)                 ...
Overview of our models•   Mixture of deformable part models•   Each component has global template + deformable parts•   Fu...
2 component bicycle model   root filters        part filters     deformationcoarse resolution   finer resolution     models  ...
Object hypothesis                                              z = (p0,..., pn)                                           ...
Score of a hypothesis                                 “data term”               “spatial prior”                           ...
Matching•   Define an overall score for each root location    - Based on best placement of parts         score(p0 ) = max s...
input image head filterResponse of filter in l-th pyramid levelRl (x, y) = F · φ(H, (x, y, l)) cross-correlationTransformed ...
model feature map                      feature map at twice the resolution                                         x      ...
Matching results(after non-maximum suppression) ~1 second to search all scales
Training•   Training data consists of images with labeled bounding boxes.•   Need to learn the model structure, filters and...
Latent SVM (MI-SVM)   Classifiers that score an example x using      fβ (x) = max β · Φ(x, z)                z∈Z(x)   β are...
Semi-convexity• Maximum of convex functions is convex• fβ (x) = z∈Z(x) β · Φ(x, z) is convex in β            max• max(0, 1...
Latent SVM training                                        n                       1               LD (β) = ||β||2 + C    ...
Training Models•   Reduce to Latent SVM training problem•   Positive example specifies some z should have high score•   Bou...
Background•   Negative example specifies no z should have high score•   One negative example per root location in a backgro...
Training algorithm, nested iterationsFix “best” positive latent values for positives      Harvest high scoring (x,z) pairs...
Car model   root filters         part filters     deformationcoarse resolution    finer resolution     models
Person model   root filters      part filters     deformationcoarse resolution finer resolution     models
Cat model   root filters        part filters     deformationcoarse resolution   finer resolution     models
Bottle model   root filters      part filters     deformationcoarse resolution finer resolution     models
Car detectionshigh scoring true positives   high scoring false positives
Person detections                              high scoring false positiveshigh scoring true positives      (not enough ov...
Horse detectionshigh scoring true positives   high scoring false positives
Cat detections                               high scoring false positiveshigh scoring true positives                      ...
Quantitative results•   7 systems competed in the 2008 challenge•   Out of 20 classes we got:    - First place in 7 classe...
Precision/Recall results on Bicycles 2008
Precision/Recall results on Person 2008
Precision/Recall results on Bird 2008
Comparison of Car models on 2006 data                                  class: car, year 2006               1              ...
Summary•   Deformable models for object detection    - Fast matching algorithms    - Learning from weakly-labeled data    ...
Object Detection with Discrmininatively Trained Part based Models
Upcoming SlideShare
Loading in …5
×

Object Detection with Discrmininatively Trained Part based Models

1,710 views
1,477 views

Published on

Published in: Education, Technology, Business
1 Comment
1 Like
Statistics
Notes
No Downloads
Views
Total views
1,710
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
44
Comments
1
Likes
1
Embeds 0
No embeds

No notes for slide

Object Detection with Discrmininatively Trained Part based Models

  1. 1. Object Detection with Discriminatively Trained Part Based Models Pedro F. Felzenszwalb Department of Computer Science University of ChicagoJoint with David Mcallester, Deva Ramanan, Ross Girshick
  2. 2. PASCAL Challenge• ~10,000 images, with ~25,000 target objects - Objects from 20 categories (person, car, bicycle, cow, table...) - Objects are annotated with labeled bounding boxes
  3. 3. Why is it hard?• Objects in rich categories exhibit significant variability - Photometric variation - Viewpoint variation - Intra-class variability - Cars come in a variety of shapes (sedan, minivan, etc) - People wear different clothes and take different posesWe need rich object modelsBut this leads to difficult matching and training problems
  4. 4. Starting point: sliding window classifiers Feature vector x = [... , ... , ... , ... ]• Detect objects by testing each subwindow - Reduces object detection to binary classification - Dalal & Triggs: HOG features + linear SVM classifier - Previous state of the art for detecting people
  5. 5. Histogram of Gradient (HOG) features• Image is partitioned into 8x8 pixel blocks• In each block we compute a histogram of gradient orientations - Invariant to changes in lighting, small deformations, etc.• Compute features at different resolutions (pyramid)
  6. 6. HOG Filters• Array of weights for features in subwindow of HOG pyramid• Score is dot product of filter and feature vector p Filter F Score of F at position p is F ⋅ φ(p, H) φ(p, H) = concatenation of HOG features from HOG pyramid H subwindow specified by p
  7. 7. Dalal & Triggs: HOG + linear SVMs φ(p, H) φ(q, H) There is much more background than objects Start with random negatives and repeat: 1) Train a model 2) Harvest false positives to define “hard negatives”Typical form of a model
  8. 8. Overview of our models• Mixture of deformable part models• Each component has global template + deformable parts• Fully trained from bounding boxes alone
  9. 9. 2 component bicycle model root filters part filters deformationcoarse resolution finer resolution models Each component has a root filter F0 and n part models (Fi, vi, di)
  10. 10. Object hypothesis z = (p0,..., pn) p0 : location of root p1,..., pn : location of parts Score is sum of filter scores minus deformation costsImage pyramid HOG feature pyramidMultiscale model captures features at two-resolutions
  11. 11. Score of a hypothesis “data term” “spatial prior” n nscore(p0 , . . . , pn ) = Fi · φ(H, pi ) − di · (dx2 , dyi ) i 2 i=0 i=1 displacements filters deformation parameters score(z) = β · Ψ(H, z) concatenation filters and concatenation of HOG deformation parameters features and part displacement features
  12. 12. Matching• Define an overall score for each root location - Based on best placement of parts score(p0 ) = max score(p0 , . . . , pn ). p1 ,...,pn• High scoring root locations define detections - “sliding window approach”• Efficient computation: dynamic programming + generalized distance transforms (max-convolution)
  13. 13. input image head filterResponse of filter in l-th pyramid levelRl (x, y) = F · φ(H, (x, y, l)) cross-correlationTransformed responseDl (x, y) = max Rl (x + dx, y + dy) − di · (dx2 , dy 2 ) dx,dy max-convolution, computed in linear time (spreading, local max, etc)
  14. 14. model feature map feature map at twice the resolution x ... x x ... response of part filtersresponse of root filter ... transformed responses + color encoding of filter response values combined score of root locations
  15. 15. Matching results(after non-maximum suppression) ~1 second to search all scales
  16. 16. Training• Training data consists of images with labeled bounding boxes.• Need to learn the model structure, filters and deformation costs. Training
  17. 17. Latent SVM (MI-SVM) Classifiers that score an example x using fβ (x) = max β · Φ(x, z) z∈Z(x) β are model parameters z are latent values Training data D = ( x1 , y1 , . . . , xn , yn ) yi ∈ {−1, 1} We would like to find β such that: yi fβ (xi ) > 0Minimize n 1 LD (β) = ||β||2 + C max(0, 1 − yi fβ (xi )) 2 i=1
  18. 18. Semi-convexity• Maximum of convex functions is convex• fβ (x) = z∈Z(x) β · Φ(x, z) is convex in β max• max(0, 1 − yi fβ (xi )) is convex for negative examples n 1 LD (β) = ||β||2 + C max(0, 1 − yi fβ (xi )) 2 i=1 Convex if latent values for positive examples are fixed
  19. 19. Latent SVM training n 1 LD (β) = ||β||2 + C max(0, 1 − yi fβ (xi )) 2 i=1• Convex if we fix z for positive examples• Optimization: - Initialize β and iterate: - Pick best z for each positive example - Optimize β via gradient descent with data-mining
  20. 20. Training Models• Reduce to Latent SVM training problem• Positive example specifies some z should have high score• Bounding box defines range of root locations - Parts can be anywhere - This defines Z(x)
  21. 21. Background• Negative example specifies no z should have high score• One negative example per root location in a background image - Huge number of negative examples - Consistent with requiring low false-positive rate
  22. 22. Training algorithm, nested iterationsFix “best” positive latent values for positives Harvest high scoring (x,z) pairs from background images Update model using gradient descent Trow away (x,z) pairs with low score• Sequence of training rounds - Train root filters - Initialize parts from root - Train final model
  23. 23. Car model root filters part filters deformationcoarse resolution finer resolution models
  24. 24. Person model root filters part filters deformationcoarse resolution finer resolution models
  25. 25. Cat model root filters part filters deformationcoarse resolution finer resolution models
  26. 26. Bottle model root filters part filters deformationcoarse resolution finer resolution models
  27. 27. Car detectionshigh scoring true positives high scoring false positives
  28. 28. Person detections high scoring false positiveshigh scoring true positives (not enough overlap)
  29. 29. Horse detectionshigh scoring true positives high scoring false positives
  30. 30. Cat detections high scoring false positiveshigh scoring true positives (not enough overlap)
  31. 31. Quantitative results• 7 systems competed in the 2008 challenge• Out of 20 classes we got: - First place in 7 classes - Second place in 8 classes• Some statistics: - It takes ~2 seconds to evaluate a model in one image - It takes ~4 hours to train a model - MUCH faster than most systems.
  32. 32. Precision/Recall results on Bicycles 2008
  33. 33. Precision/Recall results on Person 2008
  34. 34. Precision/Recall results on Bird 2008
  35. 35. Comparison of Car models on 2006 data class: car, year 2006 1 0.9 0.8 0.7 0.6 precision 0.5 0.4 0.3 1 Root (0.48) 0.2 2 Root (0.58) 1 Root+Parts (0.55) 0.1 2 Root+Parts (0.62) 2 Root+Parts+BB (0.64) 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 recall
  36. 36. Summary• Deformable models for object detection - Fast matching algorithms - Learning from weakly-labeled data - Leads to state-of-the-art results in PASCAL challenge• Future work: - Hierarchical models - Visual grammars - AO* search (coarse-to-fine)

×