Andrew Zisserman Talk - Part 2

3,231 views

Published on

Published in: Education, Technology, Business
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
3,231
On SlideShare
0
From Embeds
0
Number of Embeds
613
Actions
Shares
0
Downloads
121
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

Andrew Zisserman Talk - Part 2

  1. 1. Visual search and recognition Part II – category recognition Andrew Zisserman Visual Geometry Group University of Oxford http://www.robots.ox.ac.uk/~vggIncludes slides from: Mark Everingham, Pedro Felzenszwalb, Rob Fergus, KristenGrauman, Bastian Leibe, Fei-Fei Li, Marcin Marszalek, Pietro Perona, Deva Ramanan,Josef Sivic and Andrea Vedaldi
  2. 2. What we would like to be able to do …• Visual recognition and scene understanding• What is in the image and where• scene type: outdoor, city …• object classes• material properties• actions
  3. 3. Recognition Tasks• Image Classification – Does the image contain an aeroplane?• Object Class Detection/Localization – Where are the aeroplanes (if any)?• Object Class Segmentation – Which pixels are part of an aeroplane (if any)?
  4. 4. Things vs. Stuff Ted Adelson, Forsyth et al. 1996. Thing (n): An object Stuff (n): Material defined by with a specific size and a homogeneous or repetitive shape. pattern of fine-scale properties, but has no specific or distinctive spatial extent or shape. Slide: Geremy Heitz
  5. 5. Challenges: Clutter
  6. 6. Challenges: Occlusion and truncation
  7. 7. Challenges: Intra-class variation
  8. 8. Object Category Recognition by Learning• Difficult to define model of a category. Instead, learn from example images
  9. 9. Level of Supervision for Learning Image-level label Bounding box Pixel-level segmentation “Parts”
  10. 10. Outline1. Image Classification • Bag of visual words method • Features and adding spatial information • Encoding • PASCAL VOC and other datasets2. Object Category Detection3. The future and challenges
  11. 11. Recognition Task• Image Classification – Does the image contain an aeroplane?• Challenges – Imaging factors e.g. lighting, pose, occlusion, clutter – Intra-class variation – Position can vary within image – Training data may not specify
  12. 12. Image classification• Supervised approach – Training data with labels indicating presence/absence of the class Positive training images containing an object class, here motorbike Negative training images that don’t – Learn classifier
  13. 13. Image classification• Test image – Image without label – Determine whether test image contains the object class or not• Classify – Run classifier on the test image ?
  14. 14. Bag of visual words• Images yield varying number of local features• Features are high-dimensional e.g. 128-D for SIFT• How to summarize image content in a fixed-length vector for classification?1. Map descriptors onto a common vocabulary of visual words2. Represent image as a histogram over visual words – a bag of words
  15. 15. Examples for visual wordsAirplanesMotorbikes FacesWild Cats Leaves People Bikes
  16. 16. Intuition Visual Vocabulary• Visual words represent “iconic” image fragments• Discarding spatial information gives lots of invariance
  17. 17. Training data: vectors are histograms, one from each training image positive negative Train classifier,e.g. SVM
  18. 18. Example Image collection: four object classes +background Faces 435 Motorbikes 800 Airplanes 800 Cars (rear) 1155 Background 900 Total: 4090 The “Caltech 5”
  19. 19. Example: weak supervision Motorbikes Airplanes Frontal FacesTraining• 50% images• No identifcation of object within imageTesting Cars (Rear) Background• 50% images• Simple object present/absent testLearning• SVM classifier 2• Gaussian kernel using K(x, y) = e−γχ (x,y) as similarity between histogramsResult• Between 98.3 – 100% correct, depending on classCsurka et al 2004 Zhang et al 2005
  20. 20. Localization according to visual word probability sparse segmentation 20 20 40 40 60 60 80 80 100 100 120 120 50 100 150 200 50 100 150 200 foreground word more probable background word more probable
  21. 21. Why does SVM learning work? Linear SVM, f (x) = w> x + b • Learns foreground and background visual words w foreground words – positive weight background words – negative weight
  22. 22. Bag of visual words summary• Advantages: – largely unaffected by position and orientation of object in image – fixed length vector irrespective of number of detections – Very successful in classifying images according to the objects they contain• Disadvantages: – No explicit use of configuration of visual word positions – Poor at localizing objects within an image
  23. 23. Adding Spatial Information
  24. 24. Beyond BOW II: Grids and spatial pyramidsStart from BoW for image • no spatial information recorded Bag of Words Feature Vector
  25. 25. Adding Spatial Information to Bag of Words Bag of Words Concatenate Feature Vector [Fergus et al, 2005]Keeps fixed length feature vector for a window
  26. 26. Tiling defines (records) the spatial correspondence of the words • parameter: number of tilesIf codebook has V visual words, then representation has dimension 4V Fergus et al ICCV 05
  27. 27. Spatial Pyramid – represent correspondence • • • 1 BoW • • • • 4 BoW • • • • • • • • • 16 BoW • • • • • [Grauman & Darrell, 2005] [Lazebnik et al, 2006]
  28. 28. Dense Visual Words• Why extract only sparse visual words?• Good where lots of invariance is needed (e.g. to rotation or scale), but not relevant if it isn’t• Also, interest points do not necessarily capture “all” features• Instead, extract dense visual words of fixed scales on an overlapping grid [Luong & Malik, 1999] Quantize [Varma & Zisserman, 2003] Word [Vogel & Schiele, 2004] [Jurie & Triggs, 2005] [Fei-Fei & Perona, 2005] Patch / SIFT [Bosch et al, 2006]• More “detail” at the expense of invariance• Improves performance for most categories• Pyramid histogram of visual words (PHOW)
  29. 29. Image categorization: summary VQ ... ... ... ... dogs Linear SVM
  30. 30. Example – retrieving videos of scene categories • TrecVid 2010 test data – 8383 videos – 144,988 shots – 232,276 key-frames – 200G mpeg2
  31. 31. Training Images FeaturesTraining Feature Positive (Airplane) Learning Classifier Extraction Support Vector Dense Sift + VQ + Spatial Tiling Machine (SVM) Model Negative (Background) Test Images Features Scores Ranked ListTesting 0.91 0.91 Feature Extraction Scoring 0.12 Ranking 0.89 Dense Sift + VQ + Spatial Tiling 0.65 0.65 0.89 0.12
  32. 32. Example – retrieving videos of scene categoriesaircraft
  33. 33. Example – retrieving videos of scene categoriescityscape
  34. 34. Example – retrieving videos of scene categoriesdemonstration
  35. 35. Example – retrieving videos of scene categoriessinging
  36. 36. Encodings
  37. 37. Beyond hard assignments …1. Distance based methods – Represent descriptor by function of distance to set of nearest neighbour cluster centres (soft assignment) B: 1.0 Hard Assignment A: 0.1 B: 0.5 Soft Assignment C: 0.4[Philbin et al. CVPR 2008, Van Gemert et al. ECCV 2008]
  38. 38. Beyond hard assignments …2. Reconstruction based methods – Sparse-coding: approximate descriptor using nearest neighbour centres as a basis – Locality-constrained Linear Coding (LLC) min ||x − Bα||2 + λ||α||2 α where B is a matrix formed from the nearest centres as column vectors[Wang et al. CVPR 2010]
  39. 39. Beyond hard assignments …3. Represent the residual (and more) – Measure (and quantize) the difference vector from the cluster centre x ci[Jegou et al. CVPR 2010, Perronin et al., Fisher kernels ECCV 2010 ]
  40. 40. Spatial tiling & binning1. Aggregate by sum-pooling sum2. Aggregate by max-pooling max[Wang et al. CVPR 2010]
  41. 41. Datasets
  42. 42. The PASCAL Visual Object Classes (VOC) Dataset and Challenge Mark Everingham Luc Van Gool Chris Williams John Winn Andrew Zisserman
  43. 43. The PASCAL VOC Challenge• Challenge in visual object recognition funded by PASCAL network of excellence• Publicly available dataset of annotated images• Main competitions in classification (is there an X in this image), detection (where are the X’s), and segmentation (which pixels belong to X)• “Taster competitions” in 2-D human “pose estimation” (2007- present) and static action classes (2010-present)• Standard evaluation protocol (software supplied)
  44. 44. Dataset Content• 20 classes: aeroplane, bicycle, boat, bottle, bus, car, cat, chair, cow, dining table, dog, horse, motorbike, person, potted plant, sheep, train, TV• Real images downloaded from flickr, not filtered for “quality”• Complex scenes, scale, pose, lighting, occlusion, ...
  45. 45. Annotation• Complete annotation of all objects• Annotated in one session with written guidelinesOccluded DifficultObject is Not scoredsignificantly in evaluationoccluded within BB Truncated Pose Object extends Facing left beyond BB
  46. 46. Examples Aeroplane Bicycle Bird Boat Bottle Bus Car Cat Chair Cow
  47. 47. ExamplesDining Table Dog Horse Motorbike PersonPotted Plant Sheep Sofa Train TV/Monitor
  48. 48. Dataset Collection• Images downloaded from flickr – 500,000 images downloaded and random subset selected for annotation – Queries • Keyword e.g. “car”, “vehicle”, “street”, “downtown” • Date of capture e.g. “taken 21-July” – Removes “recency” bias in flickr results • Images selected from random page of results – Reduces bias toward particular flickr users• 2008/9 datasets retained as subset of 2010 – Assignments to training/test sets maintained
  49. 49. Dataset Statistics• Around 40% increase in size over VOC2009 Training Testing Images 10,103 (7,054) 9,637 (6,650) Objects 23,374 (17,218) 22,992 (16,829) VOC2009 counts shown in brackets• Minimum ~500 training objects per category – ~1700 cars, 1500 dogs, 7000 people• Approximately equal distribution across training and test sets
  50. 50. Classification Challenge • Predict whether at least one object of a given class is present in an image • Evaluation: average precision per class 1 0.9 • Average Precision (AP) measures area 0.8 under precision/recall curve 0.7precision 0.6 0.5 • Application independent 0.4 0.3 • A good score requires both high recall 0.2 AP and high precision 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 recall
  51. 51. Precision/Recall: Aeroplane (All) All results 1 NEC_V1_HOGLBP_NONLIN_SVM (93.3) NEC_V1_HOGLBP_NONLIN_SVMDET (93.3) NUSPSL_KERNELREGFUSING (93.0) NUSPSL_MFDETSVM (91.9) 0.9 CVC_PLUSDET (91.7) UVA_BW_NEWCOLOURSIFT (91.5) NUSPSL_EXCLASSIFIER (91.3) CVC_PLUS (91.0) SURREY_MK_KDA (90.6) 0.8 UVA_BW_NEWCOLOURSIFT_SRKDA (90.6) NLPR_VSTAR_CLS_DICTLEARN (90.3) BUT_FU_SVM_SIFT (89.7) CVC_FLAT (89.4) BONN_FGT_SEGM (88.0) 0.7 LIRIS_MKL_TRAINVAL (87.5) TIT_SIFT_GMM_MKL (87.2) XRCE_IFV (87.1) NUDT_SVM_LDP_SIFT_PMK_SPMK (86.1) 0.6 RITSU_CBVR_WKF (85.6) UC3M_GENDISC (85.5) NUDT_SVM_WHGO_SIFT_CENTRIST_LLM (83.5) BUPT_LPBETA_MULTFEAT (82.1) precision BUPT_SVM_MULTFEAT (81.1) 0.5 BUPT_SPM_SC_HOG (79.6) LIP6UPMC_RANKING (78.8) LIP6UPMC_MKL_L1 (78.5) LIP6UPMC_KSVM_BASELINE (78.4) NTHU_LINSPARSE_2 (77.9) 0.4 WLU_SPM_EMDIST (75.8) LIG_MSVM_FUSE_CONCEPT (74.4) NII_SVMSIFT (69.3) HIT_PROTOLEARN_2 (60.7) 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 recall
  52. 52. Precision/Recall: Aeroplane (Top 10 by AP) Top 10 results by AP 1 NEC_V1_HOGLBP_NONLIN_SVM (93.3) NEC_V1_HOGLBP_NONLIN_SVMDET (93.3) NUSPSL_KERNELREGFUSING (93.0) NUSPSL_MFDETSVM (91.9) 0.9 CVC_PLUSDET (91.7) UVA_BW_NEWCOLOURSIFT (91.5) NUSPSL_EXCLASSIFIER (91.3) CVC_PLUS (91.0) SURREY_MK_KDA (90.6) 0.8 UVA_BW_NEWCOLOURSIFT_SRKDA (90.6) 0.7 0.6 precision 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 recall
  53. 53. AP (%) 0 10 20 30 40 50 60 70 80 90 100 aeroplane person train bus AP by Class motorbike horse car cat bicycle boat tv/monitor bird dog sheep cow• Max AP: 93.3% (aeroplane) ... 53.3% (potted plant) chair diningtable sofa bottle pottedplant max chance median
  54. 54. Progress 2008-2010 100 90 80 70Max AP (%) 60 2008 50 2009 40 2010 30 20 10 0 tvmonitor pottedplant bottle motorbike diningtable bicycle horse sheep sofa train person aeroplane cow boat cat bus bird dog car chair• Results on 2008 data improve for best 2009 and 2010 methods for all categories, by over 100% for some categories – Caveat: Better methods or more training data?
  55. 55. The Indoor Scene Dataset• 67 indoor categories• 15620 images• At least 100 images per category• Training 67 x 80 images• Testing 67 x 20 images• A. Quattoni, and A.Torralba. Recognizing Indoor Scenes. IEEEConference on Computer Vision and Pattern Recognition (CVPR), 2009.
  56. 56. The Oxford Flowers Dataset• Explore fine grained visual categorization• 102 different species
  57. 57. Dataset statistics• 102 categories• Training set – 10 images per category• Validation set – 10 images per category• Test set – >20 images per category. – Total 6129 images.
  58. 58. Fine grained visual classification – flowersY. Chai, M.E. Nilsback, V. Lempitsky, A. Zisserman, ICVGIP’08, ICCV’ 11
  59. 59. Outline1. Image Classification2. Object Category Detection • Sliding window methods • Histogram of Oriented Gradients (HOG) • Learning an object detector • PASCAL VOC (again) and two state of the art algorithms3. The future and challenges
  60. 60. Recognition Task• Object Class Detection/Localization – Where are the aeroplanes (if any)?• Challenges – Imaging factors e.g. lighting, pose, occlusion, clutter – Intra-class variation• Compared to Classification – Detailed prediction e.g. bounding box – Location usually provided for training
  61. 61. Preview of typical results aeroplane bicycle car cow horse motorbike
  62. 62. Problem of background clutter• Use a sub-window – At correct position, no clutter is present – Slide window to detect object – Change size of window to search over scale
  63. 63. Detection by Classification• Basic component of sliding window classifier: binary classifier Car/non-car Classifier Yes, No, a car not a car
  64. 64. Detection by Classification• Detect objects in clutter by search Car/non-car Classifier• Sliding window: exhaustive search over position and scale
  65. 65. Detection by Classification• Detect objects in clutter by search Car/non-car Classifier• Sliding window: exhaustive search over position and scale
  66. 66. Detection by Classification• Detect objects in clutter by search Car/non-car Classifier• Sliding window: exhaustive search over position and scale(can use same size window over a spatial pyramid of images)
  67. 67. Window (Image) Classification Training Data Feature Classifier Extraction• Features usually engineered Car/Non-car• Classifier learnt from data
  68. 68. Problems with sliding windows …• aspect ratio• granuality (finite grid)• partial occlusion• multiple responsesSee work by• Christoph Lampert et al CVPR 08, ECCV 08
  69. 69. Bag of (visual) Words representation• Detect affine invariant local features (e.g. affine-Harris)• Represent by high-dimensional descriptors, e.g. 128-D for SIFT1. Map descriptors onto a common vocabulary of visual words2. Represent sliding window as a histogram over visual words – a bag of words• Summarizes sliding window content in a fixed-length vector suitable for classification
  70. 70. Sliding window detector • Classifier: SVM with linear kernel • BOW representation for ROI Example detections for dogLampert et al CVPR 08: Efficient branch and bound search over all windows
  71. 71. Discussion: ROI as a Bag of Visual Words • Advantages – No explicit modelling of spatial information ⇒ high level of invariance to position and orientation in image – Fixed length vector ⇒ standard machine learning methods applicable • Disadvantages – No explicit modelling of spatial information ⇒ less discriminative power – Inferior to state of the art performance – Add dense features
  72. 72. Dalal & Triggs CVPR 2005 Pedestrian detection• Objective: detect (localize) standing humans in an image• Sliding window classifier• Train a binary classifier on whether a window contains a standing person or not• Histogram of Oriented Gradients (HOG) feature• Although HOG + SVM originally introduced for pedestrians has been used very successfully for many object categories
  73. 73. Feature: Histogram of Oriented Gradients (HOG) dominant image direction HOG frequency• tile 64 x 128 pixel window into 8 x 8 pixel cells• each cell represented by histogram over 8orientation bins (i.e. angles in range 0-180 degrees) orientation
  74. 74. Histogram of Oriented Gradients (HOG) continued• Adds a second level of overlapping spatial bins re- normalizing orientation histograms over a larger spatial area• Feature vector dimension (approx) = 16 x 8 (for tiling) x 8 (orientations) x 4 (for blocks) = 4096
  75. 75. Window (Image) Classification Training Data Feature Classifier Extraction• HOG Features pedestrian/Non-pedestrian• Linear SVM classifier
  76. 76. Averaged examples
  77. 77. Classifier: linear SVM > Advantages of linear SVM: f (x) = w x + b • Training (Learning) • Very efficient packages for the linear case, e.g. LIBLINEAR for batch training and Pegasos for on-line training. • Complexity O(N) for N training points (cf O(N^3) for general SVM) • Testing (Detection) S X S = # of support vectorsNon-linear f ( x) = αik(xi, x) + b = (worst case ) N i S size of training data XLinear f (x) = αixi>x + b i = w>x + b Independent of size of training data
  78. 78. Dalal and Triggs, CVPR 2005
  79. 79. Learned modelf (x ) = w > x + b average over positive training data
  80. 80. Slide from Deva Ramanan
  81. 81. Why does HOG + SVM work so well?• Similar to SIFT, records spatial arrangement of histogram orientations• Compare to learning only edges: – Complex junctions can be represented – Avoids problem of early thresholding – Represents also soft internal gradients• Older methods based on edges have become largely obsolete • HOG gives fixed length vector for window, suitable for feature vector for SVM
  82. 82. Training a sliding window detector• Object detection is inherently asymmetric: much more “non-object” than “object” data• Classifier needs to have very low false positive rate• Non-object category is very complex – need lots of data
  83. 83. Bootstrapping 1. Pick negative training set at random 2. Train classifier 3. Run on training data 4. Add false positives to training set 5. Repeat from 2• Collect a finite but diverse set of non-object windows• Force classifier to concentrate on hard negative examples• For some classifiers can ensure equivalence to training on entire data set
  84. 84. Example: train an upper body detector– Training data – used for training and validation sets • 33 Hollywood2 training movies • 1122 frames with upper bodies marked– First stage training (bootstrapping) • 1607 upper body annotations jittered to 32k positive samples • 55k negatives sampled from the same set of frames– Second stage training (retraining) • 150k hard negatives found in the training data
  85. 85. Training data – positive annotations
  86. 86. Positive windowsNote: common size and alignment
  87. 87. Jittered positives
  88. 88. Jittered positives
  89. 89. Random negatives
  90. 90. Random negatives
  91. 91. Window (Image) first stage classification Linear SVM Jittered positives HOG Feature Classifier random negatives Extraction f (x) = w> x + b• find high scoring false positives detections• these are the hard negatives for the next round of training• cost = # training images x inference on each image
  92. 92. Hard negatives
  93. 93. Hard negatives
  94. 94. First stage performance on validation set
  95. 95. Performance after retraining
  96. 96. Effects of retraining
  97. 97. Side by side before retraining after retraining
  98. 98. Side by side before retraining after retraining
  99. 99. Side by side before retraining after retraining
  100. 100. Tracked upper body detections
  101. 101. Accelerating Sliding Window Search• Sliding window search is slow because so many windows are needed e.g. x × y × scale ≈ 100,000 for a 320×240 image• Most windows are clearly not the object class of interest• Can we speed up the search?
  102. 102. Cascaded Classification• Build a sequence of classifiers with increasing complexity More complex, slower, lower false positive rate Classifier Classifier Possibly a Classifier Possibly a Face 1 2 face N faceWindow Non-face Non-face Non-face• Reject easy non-objects using simpler and faster classifiers
  103. 103. Cascaded Classification• Slow expensive classifiers only applied to a few windows ⇒ significant speed-up• Controlling classifier complexity/speed: – Number of support vectors [Romdhani et al, 2001] – Number of features [Viola & Jones, 2001] – Type of SVM kernel [Vedaldi et al, 2009]
  104. 104. Summary: Sliding Window Detection• Can convert any image classifier into an object detector by sliding window. Efficient search methods available.• Requirements for invariance are reduced by searching over e.g. translation and scale• Spatial correspondence can be “engineered in” by spatial tiling
  105. 105. Outline1. Image Classification2. Object Category Detection • Sliding window methods • Histogram of Oriented Gradients (HOG) • Learning an object detector • PASCAL VOC (again) and two state of the art algorithms3. The future and challenges
  106. 106. The PASCAL Visual Object Classes (VOC) Dataset and Challenge Mark Everingham Luc Van Gool Chris Williams John Winn Andrew Zisserman
  107. 107. Detection: Evaluation of Bounding Boxes• Area of Overlap (AO) Measure Ground truth Bgt Bgt Bp Predicted Bp Detection if > Threshold 50%• Evaluation: Average precision per class on predictions
  108. 108. Precision/Recall - Bicycle 1 NLPR_HOGLBP_MC_LCEGCHLC (55.3) UOCTTI_LSVM_MDPM (54.3) 0.9 UCI_DPM_SP (52.6) NUS_HOGLBP_CTX_CLS_RESCORE_V2 (52.4) 0.8 MITUCLA_HIERARCHY (48.5) UVA_DETMONKEY (39.8) 0.7 UVA_GROUPLOC (39.6) UMNECUIUC_HOGLBP_DHOGBOW_SVM (34.7) BONN_FGT_SEGM (33.7) 0.6 UMNECUIUC_HOGLBP_LINSVM (33.7)precision CMU_RANDPARTS (31.7) 0.5 LJKINPG_HOG_LBP_LTP_PLS2ROOTS (29.7) CMIC_SYNTHTRAIN (28.9) 0.4 CMIC_VARPARTS (28.2) BONN_SVR_SEGM (24.4) TIT_SIFT_GMM_MKL2 (14.5) 0.3 UC3M_GENDISC (5.5) TIT_SIFT_GMM_MKL (1.6) 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 recall
  109. 109. Precision/Recall - Car 1 UOCTTI_LSVM_MDPM (49.1) NLPR_HOGLBP_MC_LCEGCHLC (46.7) 0.9 UCI_DPM_SP (44.5) MITUCLA_HIERARCHY (43.5) 0.8 UVA_GROUPLOC (37.8) UVA_DETMONKEY (36.9) 0.7 UMNECUIUC_HOGLBP_DHOGBOW_SVM (33.8) UMNECUIUC_HOGLBP_LINSVM (33.1) BONN_SVR_SEGM (32.9) 0.6 NUS_HOGLBP_CTX_CLS_RESCORE_V2 (32.8)precision BONN_FGT_SEGM (31.9) 0.5 LJKINPG_HOG_LBP_LTP_PLS2ROOTS (27.5) CMU_RANDPARTS (19.5) 0.4 CMIC_VARPARTS (13.7) CMIC_SYNTHTRAIN (13.3) TIT_SIFT_GMM_MKL2 (8.1) 0.3 UC3M_GENDISC (5.4) TIT_SIFT_GMM_MKL (1.6) 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 recall
  110. 110. Precision/Recall – Potted plant 1 UVA_GROUPLOC (13.0) UVA_DETMONKEY (12.1) 0.9 UCI_DPM_SP (11.6) NLPR_HOGLBP_MC_LCEGCHLC (10.2) 0.8 MITUCLA_HIERARCHY (9.7) NUS_HOGLBP_CTX_CLS_RESCORE_V2 (9.6) UOCTTI_LSVM_MDPM (9.1) 0.7 UMNECUIUC_HOGLBP_LINSVM (7.2) BONN_SVR_SEGM (6.7) 0.6 UMNECUIUC_HOGLBP_DHOGBOW_SVM (6.4) BONN_FGT_SEGM (5.8)precision 0.5 LJKINPG_HOG_LBP_LTP_PLS2ROOTS (3.1) UC3M_GENDISC (1.5) CMU_RANDPARTS (1.1) 0.4 TIT_SIFT_GMM_MKL2 (0.8) TIT_SIFT_GMM_MKL (0.3) 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 recall
  111. 111. True Positives - Motorbike MITUCLA_HIERARCHY NLPR_HOGLBP_MC_LCEGCHLC NUS_HOGLBP_CTX_CLS_RESCORE_V2
  112. 112. False Positives - Motorbike MITUCLA_HIERARCHY NLPR_HOGLBP_MC_LCEGCHLC NUS_HOGLBP_CTX_CLS_RESCORE_V2
  113. 113. True Positives - Cat UVA_DETMONKEY UVA_GROUPLOC MITUCLA_HIERARCHY
  114. 114. False Positives - Cat UVA_DETMONKEY UVA_GROUPLOC MITUCLA_HIERARCHY
  115. 115. Progress 2008-2010 60 50 40Max AP (%) 2008 30 2009 2010 20 10 0 train diningtable bus person bicycle bird bottle motorbike aeroplane cow pottedplant chair tvmonitor dog horse sheep sofa boat cat car Results on 2008 data improve for best 2009 and 2010 methods for all categories, by over 100% for some categories
  116. 116. Multiple Kernels for Object Detection Andrea Vedaldi, Varun Gulshan, Manik Varma, Andrew Zisserman ICCV 2009
  117. 117. Approach Fast Linear SVM Jumping Window Feature vector Quasi-linear SVM Non-linear SVM• Three stage cascade – Each stage uses a more powerful and more expensive classifier• Multiple kernel learning for the classifiers over multiple features• Jumping window first stage
  118. 118. Multiple Kernel Classification PHOW Gray MK SVM PHOW Color combine one kernel per histogram PHOG Feature vector PHOG Sym [Varma & Rai, 2007] [Gehler & Nowozin, 2009] Visual Words SSIM
  119. 119. Jumping windowTraining Position of visual word with respect to the object learn the position/scale/aspect ratio of the ROI with respect to the visual wordDetection Handles change of aspect ratio Hypothesis
  120. 120. SVMs overview• First stage Fast Linear SVM Jumping Window – linear SVM – (or jumping window) – time: #windows• Second stage – quasi-linear SVM Feature vector – χ2 kernel Quasi-linear SVM – time: #windows × #dimensions• Third stage – non-linear SVM – χ2-RBF kernel – time: Non-linear SVM #windows × #dimensions × #SVs 140
  121. 121. Results
  122. 122. Results
  123. 123. Results
  124. 124. Single Kernel vs. Multiple Kernels• Multiple Kernels gives substantial boost• Multiple Kernel Learning: – small improvement over averaging – sparse feature selection
  125. 125. Object Detection with Discriminatively Trained Part Based Models Pedro F. Felzenszwalb, David Mcallester, Deva Ramanan, Ross Girshick PAMI 2010
  126. 126. Approach• Mixture of deformable part-based models – One component per “aspect” e.g. front/side view• Each component has global template + deformable parts• Discriminative training from bounding boxes alone
  127. 127. Example Model• One component of person model x1 x6 x2 x5 x3 x4 root filters part filters deformation coarse resolution finer resolution models
  128. 128. Object Hypothesis• Position of root + each part• Each part: HOG filter (at higher resolution) z = (p0,..., pn) p0 : location of root p1,..., pn : location of parts Score is sum of filter scores minus deformation costs
  129. 129. Score of a Hypothesis Appearance term Spatial prior displacements filters deformation parameters concatenation of filters concatenation of and deformation HOG features and parameters part displacement features• Linear classifier applied to feature subset defined by hypothesis
  130. 130. Training• Training data = images + bounding boxes• Need to learn: model structure, filters, deformation costs Training
  131. 131. Latent SVM (MI-SVM) Classifiers that score an example x using β are model parameters • Which component? z are latent values • Where are the parts? Training data We would like to find β such that: Minimize SVM objective
  132. 132. Latent SVM Training• Convex if we fix z for positive examples• Optimization: – Initialize β and iterate: Alternation • Pick best z for each positive example strategy • Optimize β with z fixed• Local minimum: needs good initialization – Parts initialized heuristically from root
  133. 133. Person Model root filters part filters deformation coarse resolution finer resolution models Handles partial occlusion/truncation
  134. 134. Car Model root filters part filters deformation coarse resolution finer resolution models
  135. 135. Car Detections high scoring true positives high scoring false positives
  136. 136. Person Detections high scoring false positives high scoring true positives (not enough overlap)
  137. 137. Comparison of Models
  138. 138. Summary• Multiple features and multiple kernels boost performance• Discriminative learning of model with latent variables for single feature (HOG): – Latent variables can learn best alignment in the ROI training annotation – Parts can be thought of as local SIFT vectors – Some similarities to Implicit Shape Model/Constellation models but with discriminative/careful training throughoutNB: Code available for latent model !
  139. 139. Outline1. Image Classification2. Object Category Detection3. The future and challenges
  140. 140. Current Research Challenges• Context – from scene properties: GIST, BoW, stuff – from other objects, e.g. Felzenszwalb et al, PAMI 10 – from geometry of scene, e.g. Hoiem et al CVPR 06• Occlusion/truncation – Winn & Shotton, Layout Consistent Random Field, CVPR 06 – Vedaldi & Zisserman, NIPS 09 – Yang et al, Layered Object Detection, CVPR 10• 3D• Scaling up – thousands of classes – Torralba et al, Feature sharing – ImageNet• Weak and noisy supervision

×