Object Recognition and
MIT      Scene Understanding
                  student presentation
6.870
6.870

Template matching
   and histograms
           Nicolas Pinto
Introduction
Hosts
Hosts
   a guy...




(who has big arms)
Hosts
   a guy...               Antonio T...




(who has big arms)   (who knows a lot about vision)
Hosts
   a guy...               Antonio T...                  a frog...




(who has big arms)   (who knows a lot about vi...
Hosts
   a guy...               Antonio T...                  a frog...




(who has big arms)   (who knows a lot about vi...
rs
         p  e
    p  a
3




    yey!!
Object Recognition from Local Scale-Invariant Features

                                                                  ...
Object Recognition from Local Scale-Invariant Features

                                                                  ...
Object Recognition from Local Scale-Invariant Features

                                                                  ...
Object Recognition from Local Scale-Invariant Features

                                                                  ...
Scale-Invariant Feature Transform
              (SIFT)




                      adapted from Kucuktunc
Scale-Invariant Feature Transform
              (SIFT)




                          adapted from Brown, ICCV 2003
SIFT local features are
invariant...




                  adapted from David Lee
like me they are robust...



      Text
like me they are robust...



                Text


... to changes in illumination,
noise, viewpoint, occlusion, etc.
I am sure you want to know
how to build them


      Text
I am sure you want to know
        how to build them

1. find interest points or “keypoints”
                Text
I am sure you want to know
        how to build them

1. find interest points or “keypoints”
                Text

2. find t...
I am sure you want to know
        how to build them

1. find interest points or “keypoints”
                Text

2. find t...
I am sure you want to know
        how to build them

1. find interest points or “keypoints”
                Text

2. find t...
1. find interest points or “keypoints”
                Text
keypoints are taken as maxima/minima
of a DoG pyramid




                Text




                  in this settings, ext...
a DoG (Difference of Gaussians) pyramid
is simple to compute...   even him can do it!




    before            after




...
then we just have to find
neighborhood extremas
in this 3D DoG space
then we just have to find
neighborhood extremas
in this 3D DoG space



                           if a pixel is an extrema...
too many
keypoints?




             adapted from wikipedia
too many
keypoints?




1. remove
low contrast




               adapted from wikipedia
too many
keypoints?




1. remove
low contrast




               adapted from wikipedia
too many
keypoints?




1. remove
low contrast

2. remove
edges




               adapted from wikipedia
too many
keypoints?




1. remove
low contrast

2. remove
edges




               adapted from wikipedia
Text

2. find their dominant orientation
each selected keypoint is
assigned to one or more
“dominant” orientations...
each selected keypoint is
assigned to one or more
“dominant” orientations...



... this step is important to
achieve rota...
How?
How?
using the DoG pyramid to achieve
scale invariance:
How?
using the DoG pyramid to achieve
scale invariance:

a. compute image gradient
magnitude and orientation
How?
using the DoG pyramid to achieve
scale invariance:

a. compute image gradient
magnitude and orientation

b. build an ...
How?
using the DoG pyramid to achieve
scale invariance:

a. compute image gradient
magnitude and orientation

b. build an ...
a. compute image gradient
magnitude and orientation
a. compute image gradient
magnitude and orientation
b. build an orientation histogram




                                    adapted from Ofir Pele
c. keypoint’s orientation(s) = peak(s)


                            *




                                   * the peak ;...
Text



3. compute their descriptor
SIFT descriptor
= a set of orientation histograms




   16x16 neighborhood   4x4 array x 8 bins
   of pixel gradients   =...
Text




4. match them on other images
How to   atch?
How to           atch?


nearest neighbor
How to           atch?


nearest neighbor
hough transform voting
How to           atch?


nearest neighbor
hough transform voting
least-squares fit
How to           atch?


nearest neighbor
hough transform voting
least-squares fit
etc.
SIFT is great!




       Text
SIFT is great!




                  Text
 invariant to affine transformations
SIFT is great!




                  Text
 invariant to affine transformations

 easy to understand
SIFT is great!




                  Text
 invariant to affine transformations

 easy to understand

 fast to compute
Extension Features: Spatial Pyramid Matching
  Beyond Bags of
                 example:
SpatialRecognizing NaturalMatching...
Object Recognition from Local Scale-Invariant Features
                                                                   ...
Histograms of Oriented Gradients for Human Detection
                                      Navneet Dalal and Bill Triggs
 ...
Histograms of Oriented Gradients for Human Detection
                                    Navneet Dalal and Bill Triggs
   ...
Histograms of Oriented Gradients for Human Detection
                                    Navneet Dalal and Bill Triggs
   ...
Histograms of Oriented Gradients for Human Detection
                                     Navneet Dalal and Bill Triggs
  ...
Approach
Approach


• robust feature set   (HOG)
Approach


• robust feature set   (HOG)
Approach


• robust feature set   (HOG)


• simple classifier(linear SVM)
Approach


• robust feature set   (HOG)


• simple classifier(linear SVM)


• fast detection(sliding window)
adapted from Bill Triggs
• Gamma normalization
• Space: RGB, LAB or Gray
• Method: SQRT or LOG
• Filtering with simple
                    masks

  centered            centered *
                                      ...
remember SIFT ?




• Filtering with simple
  masks
            centered




            uncentered




          cubic-co...
...after filtering, each “pixel” represents
an oriented gradient...
...pixels are regrouped in “cells”,
they cast a weighted vote for an
orientation histogram...




           HOG (Histogra...
a window can be
represented like
that
then, cells are locally normalized
using overlapping “blocks”
they used two types of blocks
they used two types of blocks




•   rectangular

•   similar to SIFT (but dense)
they used two types of blocks




•   rectangular                   •   circular

•   similar to SIFT (but dense)   •   si...
and four different types of block
normalization
and four different types of block
normalization
like SIFT, they gain invariance...



...to illuminations, small
deformations, etc.
finally, a sliding window is
classified by a simple linear SVM
during the learning phase, the
algorithm “looked” for hard examples

      Training




                 adapted from Mart...
average gradients




positive weights                       negative weights
Example
Example




          adapted from Bill Triggs
Example




          adapted from Martial Hebert
Results
                                                                             90% @ 1e-5 FPPW
                     ...
Experiments
                           DET − effect of gradient scale σ                                  DET − effect of n...
ive scale significantly increases the performance. (‘c-cor’ is the 1D cubic-corrected


                                   ...
Further
Development
Further
Development

 • Detection on Pascal VOC (2006)
Further
Development

 • Detection on Pascal VOC (2006)
 • Human Detection in Movies (ECCV 2006)
Further
Development

 • Detection on Pascal VOC (2006)
 • Human Detection in Movies (ECCV 2006)
 • US Patent by MERL (2006)
Further
Development

 • Detection on Pascal VOC (2006)
 • Human Detection in Movies (ECCV 2006)
 • US Patent by MERL (2006...
Extension example:
Pyramid HoG++
Extension example:
Pyramid HoG++
Extension example:
Pyramid HoG++
A simple demo...
A simple demo...
A simple demo...




               VIDEO HERE
A simple demo...




               VIDEO HERE
so, it doesn’t work ?!?
so, it doesn’t work ?!?



          no no, it works...
so, it doesn’t work ?!?



          no no, it works...



    ...it just doesn’t work well...
MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)
MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)
MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)
MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)
MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)
MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)
MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)
MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)
MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)
MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)
MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)
MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)
MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)
MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)
MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)
MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)
MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)
MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)
MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)
MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)
MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)
MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)
MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)
MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)
MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)
MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)
MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)
MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)
MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)
MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)
MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)
MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)
MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)
MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)
MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)
MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)
MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)
Upcoming SlideShare
Loading in...5
×

MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

3,005

Published on

MIT 6.870 Object Recognition and Scene Understanding (Fall 2008)

http://people.csail.mit.edu/torralba/courses/6.870/6.870.recognition.htm

This class will review and discuss current approaches to object recognition and scene understanding in computer vision. The course will cover bag of words models, part based models, classifier based models, multiclass object recognition and transfer learning, concurrent recognition and segmentation, context models for object recognition, grammars for scene understanding and large datasets for semi supervised and unsupervised discovery of object and scene categories. We will be reading a mixture of papers from computer vision and influential works from cognitive psychology on object and scene recognition.

Published in: Education, Technology
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
3,005
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
185
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

  1. 1. Object Recognition and MIT Scene Understanding student presentation 6.870
  2. 2. 6.870 Template matching and histograms Nicolas Pinto
  3. 3. Introduction
  4. 4. Hosts
  5. 5. Hosts a guy... (who has big arms)
  6. 6. Hosts a guy... Antonio T... (who has big arms) (who knows a lot about vision)
  7. 7. Hosts a guy... Antonio T... a frog... (who has big arms) (who knows a lot about vision) (who has big eyes)
  8. 8. Hosts a guy... Antonio T... a frog... (who has big arms) (who knows a lot about vision) (who has big eyes) and thus should know a lot about vision...
  9. 9. rs p e p a 3 yey!!
  10. 10. Object Recognition from Local Scale-Invariant Features David G. Lowe Lowe Computer Science Department University of British Columbia s Vancouver, B.C., V6T 1Z4, Canada r (1999) lowe@cs.ubc.ca p e Abstract translation, scaling, and rotation, and partially invariant to illumination changes and affine or 3D projection. Previous a An object recognition system has been developed that uses a approaches to local feature generation lacked invariance to new class of local image features. The features are invariant scale and were more sensitive to projective distortion and p to image scaling, translation, and rotation, and partially in- illumination change. The SIFT features share a number of variant to illumination changes and affine or 3D projection. properties in common with the responses of neurons in infe- These features share similar properties with neurons in in- rior temporal (IT) cortex in primate vision. This paper also 3 ferior temporal cortex that are used for object recognition describes improved approaches to indexing and model ver- in primate vision. Features are efficiently detected through ification. a staged filtering approach that identifies stable points in The scale-invariant features are efficiently identified by scale space. Image keys are created that allow for local ge- using a staged filtering approach. The first stage identifies ometric deformations by representing blurred image gradi- key locations in scale space by looking for locations that ents in multiple orientation planes and at multiple scales. are maxima or minima of a difference-of-Gaussian function. The keys are used as input to a nearest-neighbor indexing Each point is used to generate a feature vector that describes method that identifies candidate object matches. Final veri- the local image region sampled relative to its scale-space co- fication of each match is achieved by finding a low-residual ordinate frame. The features achieve partial invariance to least-squares solution for the unknown model parameters. local variations, such as affine or 3D projections, by blur- Experimental results show that robust object recognition ring image gradient locations. This approach is based on a can be achieved in cluttered partially-occluded images with model of the behavior of complex cells in the cerebral cor- a computation time of under 2 seconds. tex of mammalian vision. The resulting feature vectors are called SIFT keys. In the current implementation, each im- 1. Introduction age generates on the order of 1000 SIFT keys, a process that requires less than 1 second of computation time. Object recognition in cluttered real-world scenes requires The SIFT keys derived from an image are used in a local image features that are unaffected by nearby clutter or nearest-neighbour approach to indexing to identify candi- partial occlusion. The features must be at least partially in- date object models. Collections of keys that agree on a po- variant to illumination, 3D projective transforms, and com- tential model pose are first identified through a Hough trans- mon object variations. On the other hand, the features must form hash table, and then through a least-squares fit to a final also be sufficiently distinctive to identify specific objects estimate of model parameters. When at least 3 keys agree among many alternatives. The difficulty of the object recog- on the model parameters with low residual, there is strong nition problem is due in large part to the lack of success in evidence for the presence of the object. Since there may be finding such image features. However, recent research on dozens of SIFT keys in the image of a typical object, it is the use of dense local features (e.g., Schmid & Mohr [19]) possible to have substantial levels of occlusion in the image has shown that efficient recognition can often be achieved and yet retain high levels of reliability. by using local image descriptors sampled at a large number The current object models are represented as 2D loca- of repeatable locations. tions of SIFT keys that can undergo affine projection. Suf- This paper presents a new method for image feature gen- ficient variation in feature location is allowed to recognize eration called the Scale Invariant Feature Transform (SIFT). perspective projection of planar shapes at up to a 60 degree This approach transforms an image into a large collection rotation away from the camera or to allow up to a 20 degree of local feature vectors, each of which is invariant to image rotation of a 3D object. Proc. of the International Conference on 1 Computer Vision, Corfu (Sept. 1999) yey!!
  11. 11. Object Recognition from Local Scale-Invariant Features David G. Lowe Lowe Computer Science Department University of British Columbia s Vancouver, B.C., V6T 1Z4, Canada r (1999) lowe@cs.ubc.ca p e Abstract translation, scaling, and rotation, and partially invariant to illumination changes and affine or 3D projection. Previous a An object recognition system has been developed that uses a approaches to local feature generation lacked invariance to new class of local image features. The features are invariant scale and were more sensitive to projective distortion and p to image scaling, translation, and rotation, and partially in- illumination change. The SIFT features share a number of variant to illumination changes and affine or 3D projection. properties in common with the responses of neurons in infe- These features share similar properties with neurons in in- rior temporal (IT) cortex in primate vision. This paper also 3 ferior temporal cortex that are used for object recognition describes improved approaches to indexing and model ver- Histograms of Oriented Gradients for Human Detection in primate vision. Features are efficiently detected through ification. a staged filtering approach that identifies stable points in The scale-invariant features are efficiently identified by scale space. Image keys are created that allow for local ge- using a staged filtering approach. The first stage identifies ometric deformations by representing blurred imageDalal and Bill locations in scale space by looking for locations that Navneet gradi- key Triggs INRIA Rhˆ ne-Alps, 655 avenue de l’Europe, Montbonnot 38334, France o ents in multiple orientation planes and at multiple scales. are maxima or minima of a difference-of-Gaussian function. The keys are used as input to a nearest-neighbor indexing Each http://lear.inrialpes.fr {Navneet.Dalal,Bill.Triggs}@inrialpes.fr,point is used to generate a feature vector that describes method that identifies candidate object matches. Final veri- the local image region sampled relative to its scale-space co- Nalal and Triggs fication of each match is achieved by finding a low-residual ordinate frame. The features achieve partial invariance to least-squares solution for the unknown model parameters. Abstract Experimental results show that robust object recognition We briefly discusssuch as affine or 3D projections, by blur- local variations, previous work on human detection in We study the question of feature sets for robust visual with §2, give an overview of our method §3, describe our data ring image gradient locations. This approach is based on a can be achieved in cluttered partially-occluded imagesob- (2005) ject recognition,time of under 2 seconds. a computation adopting linear SVM based human detec- tion as a test case. After reviewing existing edge and gra- dient based descriptors, we show experimentally that grids setsmodel and give a detailedcomplex cells in experimental cor- in §4 of the behavior of description and the cerebral evaluation of each stage of the process in §5–6. The main tex of mammalian vision. The resulting feature vectors are conclusions are summarized in §7. implementation, each im- called SIFT keys. In the current 1. Introduction of Histograms of Oriented Gradient (HOG) descriptors sig- age generates on the order of 1000 SIFT keys, a process that 2 requires lessWork second of computation time. Previous than 1 nificantly outperform existing feature sets for human detec- tion. We recognition in cluttered real-world scenes requires Object study the influence of each stage of the computation There is SIFT keys derived from an image are used in a The an extensive literature on object detection, but local image features that are that fine-scale nearby clutter or here we mention just a approach to papers on to identify candi- on performance, concluding unaffected by gradients, fine nearest-neighbour few relevant indexing human detec- orientation binning, relatively coarse be at least partially in- tiondate object models.See [6] for a survey. Papageorgiou et po- partial occlusion. The features must spatial binning, and [18,17,22,16,20]. Collections of keys that agree on a variant to illumination, 3D projective transforms, and com- al [18] describe a pose are first identified throughpolynomial high-quality local contrast normalization in overlapping de- tential model pedestrian detector based on a a Hough trans- mon object variations. On the other hand, the features must SVM using rectified and then through input descriptors, withfinal scriptor blocks are all important for good results. The new form hash table, Haar wavelets as a least-squares fit to a also be sufficiently distinctive to identify specific objects a parts (subwindow) based variant in [17]. at least 3 keysal approach gives near-perfect separation on the original MIT estimate of model parameters. When Depoortere et agree among many alternatives. The difficulty more challenging give anthe model parameters this [2]. Gavrila &there is strong pedestrian database, so we introduce a of the object recog- on optimized version of with low residual, Philomen nition problem is overin large part to the lack images within [8] take a more direct approach, extracting edge images and be dataset containing due 1800 annotated human of success evidence for the presence of the object. Since there may finding such of pose features. However, recent research on matching themSIFT set of in the image of a typicalchamfer it is a large range image variations and backgrounds. dozens of to a keys learned exemplars using object, the use of dense local features (e.g., Schmid & Mohr [19]) distance. This has been used in levels of occlusion inpedes- possible to have substantial a practical real-time the image 1has Introduction and yet retain high levels of et al [22] shown that efficient recognition can often be achieved trian detection system [7]. Viola reliability.build an efficient by using local imagein images is sampled at a large owing moving person detector, using AdaBoost to train a chain of Detecting humans descriptors a challenging task number The current object models are represented as 2D loca- to their variable appearance and the wide range of poses that of repeatable locations. progressively more complexcan undergo affine projection. Suf- tions of SIFT keys that region rejection rules based on ficient variation in space-time differences. Ronfard et can adopt. The first need is a robust feature set gen- Haar-like wavelets and feature location is allowed to recognize theyThis paper presents a new method for image featurethat allows the human form to be discriminated cleanly, even in eration called the Scale Invariant Feature Transform (SIFT). al [19] build anprojection of planar shapesby incorporating perspective articulated body detectornd at up to a 60 degree st difficult illumination. collection SVM based away classifierscamera or to allow up to a 20 degree cluttered backgrounds under an image into a largeWe study This approach transforms rotation limb from the over 1 and 2 order Gaussian the issue of feature sets foreach of detection, showing to image filters in a dynamic programming framework similar to those of local feature vectors, human which is invariant that lo- rotation of a 3D object. cally normalized Histogram of Oriented Gradient (HOG) de- of Felzenszwalb & Huttenlocher [3] and Ioffe & Forsyth scriptors provide excellent performance relative to other ex- [9]. Mikolajczyk et al [16] use combinations of orientation- isting feature sets including wavelets [17,22]. The proposed position histograms with binary-thresholded gradient magni- Proc. of the International Conference on 1 descriptorsVision, Corfu (Sept. 1999) orientation histograms tudes to build a parts based method containing detectors for Computer are reminiscent of edge [4,5], SIFT descriptors [12] and shape contexts [1], but they faces, heads, and front and side profiles of upper and lower yey!! are computed on a dense grid of uniformly spaced cells and body parts. In contrast, our detector uses a simpler archi- tecture with a single detection window, but appears to give they use overlapping local contrast normalizations for im- proved performance. We make a detailed study of the effects significantly higher performance on pedestrian images. of various implementation choices on detector performance, taking “pedestrian detection” (the detection of mostly visible 3 Overview of the Method
  12. 12. Object Recognition from Local Scale-Invariant Features David G. Lowe Lowe Computer Science Department University of British Columbia s Vancouver, B.C., V6T 1Z4, Canada r (1999) lowe@cs.ubc.ca p e Abstract translation, scaling, and rotation, and partially invariant to illumination changes and affine or 3D projection. Previous a An object recognition system has been developed that uses a approaches to local feature generation lacked invariance to new class of local image features. The features are invariant scale and were more sensitive to projective distortion and p to image scaling, translation, and rotation, and partially in- illumination change. The SIFT features share a number of variant to illumination changes and affine or 3D projection. properties in common with the responses of neurons in infe- These features share similar properties with neurons in in- rior temporal (IT) cortex in primate vision. This paper also 3 ferior temporal cortex that are used for object recognition describes improved approaches to indexing and model ver- Histograms of Oriented Gradients for Human Detection in primate vision. Features are efficiently detected through ification. a staged filtering approach that identifies stable points in The scale-invariant features are efficiently identified by scale space. Image keys are created that allow for local ge- using a staged filtering approach. The first stage identifies ometric deformations by representing blurred imageDalal and Bill locations in scale space by looking for locations that Navneet gradi- key Triggs INRIA Rhˆ ne-Alps, 655 avenue de l’Europe, Montbonnot 38334, France o ents in multiple orientation planes and at multiple scales. are maxima or minima of a difference-of-Gaussian function. The keys are used as input to a nearest-neighbor indexing Each http://lear.inrialpes.fr {Navneet.Dalal,Bill.Triggs}@inrialpes.fr,point is used to generate a feature vector that describes method that identifies candidate object matches. Final veri- the local image region sampled relative to its scale-space co- Nalal and Triggs fication of each match is achieved by finding a low-residual ordinate frame. The features achieve partial invariance to least-squares solution for the unknown model parameters. Abstract Experimental results show that robust object recognition We briefly discusssuch as affine or 3D projections, by blur- local variations, previous work on human detection in We study the question of feature sets for robust visual with §2, give an overview of our method §3, describe our data ring image gradient locations. This approach is based on a can be achieved in cluttered partially-occluded imagesob- (2005) ject recognition,time of under 2 seconds. a computation adopting linear SVM based human detec- tion as a test case. After reviewing existing edge and gra- dient based descriptors, we show experimentally that grids setsmodel and give a detailedcomplex cells in experimental cor- in §4 of the behavior of description and the cerebral evaluation of each stage of the process in §5–6. The main tex of mammalian vision. The resulting feature vectors are conclusions are summarized in §7. implementation, each im- called SIFT keys. In the current 1. Introduction of Histograms of Oriented Gradient (HOG) descriptors sig- age generates on the order of 1000 SIFT keys, a process that 2 requires lessWork second of computation time. Previous than 1 nificantly outperform existing feature sets for human detec- tion. We recognition in cluttered real-world scenes requires Object study the influence of each stage of the computation There is SIFT keys derived from an image are used in a The an extensive literature on object detection, but local image features that are that fine-scale nearby clutter or here we mention just a approach to papers on to identify candi- on performance, concluding unaffected by gradients, fine nearest-neighbour few relevant indexing human detec- orientation binning, relatively coarse be at least partially in- tiondate object models.See [6] for a survey. Papageorgiou et po- partial occlusion. The features must spatial binning, and [18,17,22,16,20]. Collections of keys that agree on a variant to illumination, 3D projective transforms, and com- al [18] describe a pose are first identified throughpolynomial high-quality local contrast normalization in overlapping de- tential model pedestrian detector based on a a Hough trans- A Discriminatively Trained, Multiscale, Deformable Part Model fit to a mon object variations. On the other hand, the features must SVM using rectified and then through input descriptors, withfinal scriptor blocks are all important for good results. The new form hash table, Haar wavelets as a least-squares also be sufficiently distinctive to identify specific objects a parts (subwindow) based variant in [17]. at least 3 keysal approach gives near-perfect separation on the original MIT estimate of model parameters. When Depoortere et agree among many alternatives. The difficulty more challenging give anthe model parameters this [2]. Gavrila &there is strong pedestrian database, so we introduce a of the object recog- on optimized version of with low residual, Philomen nition problem is overin large part to the lack images within [8] take a more direct approach, extracting edge images and be dataset containing due 1800 annotated human of success Pedro Felzenszwalb David McAllester for the presence of the object. Ramanan may evidence Deva Since there finding such of pose features. However, recent research on matching themSIFT set of in the image of a typicalchamfer it is a large range image variations and backgrounds. dozens of to a keys learned exemplars using object, the University of Chicago (e.g.,Toyota Technological Institute to has been used in levels ofUC Irvine pedes- use of dense local features possible at Chicago Schmid & Mohr [19]) distance. This have substantial a practical real-time the image occlusion in 1has Introduction Felzenszwalb et al. pff@cs.uchicago.edu mcallester@tti-c.org shown that efficient recognition can often be achieved trian detection system [7]. Viola dramanan@ics.uci.edu et al [22] and yet retain high levels of reliability.build an efficient by using local imagein images is sampled at a large owing moving person detector, using AdaBoost to train a chain of Detecting humans descriptors a challenging task number to their variable appearance and the wide range of poses that of repeatable locations. The current object models are represented as 2D loca- progressively more complexcan undergo affine projection. Suf- tions of SIFT keys that region rejection rules based on ficient variation in space-time differences. Ronfard et can adopt. The firstnew method for image feature gen- Haar-like wavelets and feature location is allowed to recognize (2008) theyThis paper presents aAbstract robust feature set that need is a allows the human form to be discriminated cleanly, even in eration called the Scale Invariant Feature Transform (SIFT). al [19] build anprojection of planar shapesby incorporating perspective articulated body detectornd at up to a 60 degree st difficult illumination. collection SVM based away classifierscamera or to allow up to a 20 degree cluttered backgrounds under an image into a largeWe study This approach describes a discriminatively trained, multi- rotation limb from the over 1 and 2 order Gaussian This paper transforms the issue of feature sets foreach of detection, showing to image filters in a dynamic programming framework similar to those of local feature vectors, human which is invariant that lo- scale, deformable part model for object detection. Our sys- of Felzenszwalb 3D object. cally normalized Histogram of Oriented Gradient (HOG) de- rotation of a & Huttenlocher [3] and Ioffe & Forsyth scriptors providetwo-fold improvement relative to other ex- tem achieves a excellent performance in average precision [9]. Mikolajczyk et al [16] use combinations of orientation- isting thethe International Conference2006 PASCAL person de- position histograms with binary-thresholded gradient magni- over feature sets including wavelets [17,22]. The proposed Proc. of best performance in the on 1 tection challenge. It also outperforms the best results in the tudes to build a parts based method containing detectors for descriptorsVision, Corfu (Sept. 1999) orientation histograms Computer are reminiscent of edge [4,5], SIFT descriptors [12] of twenty categories. The system faces, heads, and front and side profiles of upper and lower 2007 challenge in ten out and shape contexts [1], but they yey!! relies heavily on dense grid parts. While deformable part body parts. In contrast, our detector uses a simpler archi- are computed on adeformableof uniformly spaced cells and models overlapping quite contrast their value had not been tecture with a single detection obtained with the person model. The Figure 1. Example detection window, but appears to give they usehave become local popular, normalizations for im- model is defined by a coarse template, several higher resolution proved performance. We make a detailedsuch as the PASCAL significantly higher performance on for the location of each part. demonstrated on difficult benchmarks study of the effects part templates and a spatial model pedestrian images. challenge. Our system also relies heavily on new methods of various implementation choices on detector performance, for discriminative training. We detection of mostly visible 3 Overview of the Method taking “pedestrian detection” (thecombine a margin-sensitive
  13. 13. Object Recognition from Local Scale-Invariant Features David G. Lowe Lowe Computer Science Department University of British Columbia Vancouver, B.C., V6T 1Z4, Canada (1999) lowe@cs.ubc.ca Abstract translation, scaling, and rotation, and partially invariant to illumination changes and affine or 3D projection. Previous An object recognition system has been developed that uses a approaches to local feature generation lacked invariance to new class of local image features. The features are invariant scale and were more sensitive to projective distortion and to image scaling, translation, and rotation, and partially in- illumination change. The SIFT features share a number of variant to illumination changes and affine or 3D projection. properties in common with the responses of neurons in infe- These features share similar properties with neurons in in- rior temporal (IT) cortex in primate vision. This paper also ferior temporal cortex that are used for object recognition describes improved approaches to indexing and model ver- Histograms of Oriented Gradients for Human Detection in primate vision. Features are efficiently detected through ification. a staged filtering approach that identifies stable points in The scale-invariant features are efficiently identified by scale space. Image keys are created that allow for local ge- using a staged filtering approach. The first stage identifies ometric deformations by representing blurred imageDalal and Bill locations in scale space by looking for locations that Navneet gradi- key Triggs INRIA Rhˆ ne-Alps, 655 avenue de l’Europe, Montbonnot 38334, France o ents in multiple orientation planes and at multiple scales. are maxima or minima of a difference-of-Gaussian function. The keys are used as input to a nearest-neighbor indexing Each http://lear.inrialpes.fr {Navneet.Dalal,Bill.Triggs}@inrialpes.fr,point is used to generate a feature vector that describes method that identifies candidate object matches. Final veri- the local image region sampled relative to its scale-space co- Nalal and Triggs fication of each match is achieved by finding a low-residual ordinate frame. The features achieve partial invariance to least-squares solution for the unknown model parameters. Abstract Experimental results show that robust object recognition We briefly discusssuch as affine or 3D projections, by blur- local variations, previous work on human detection in We study the question of feature sets for robust visual with §2, give an overview of our method §3, describe our data ring image gradient locations. This approach is based on a can be achieved in cluttered partially-occluded imagesob- (2005) ject recognition,time of under 2 seconds. a computation adopting linear SVM based human detec- tion as a test case. After reviewing existing edge and gra- dient based descriptors, we show experimentally that grids setsmodel and give a detailed complex cells in experimental cor- in §4 of the behavior of description and the cerebral evaluation of each stage of the process in §5–6. The main tex of mammalian vision. The resulting feature vectors are conclusions are summarized in §7. implementation, each im- called SIFT keys. In the current 1. Introduction of Histograms of Oriented Gradient (HOG) descriptors sig- age generates on the order of 1000 SIFT keys, a process that 2 requires lessWork second of computation time. Previous than 1 nificantly outperform existing feature sets for human detec- tion. We recognition in cluttered real-world scenes requires Object study the influence of each stage of the computation There is SIFT keys derived from an image are used in a The an extensive literature on object detection, but local image features that are unaffected by nearby clutter or nearest-neighbour approach to indexing to identify candi- partial occlusion. The features must be at least partially in- date object models. Collections of keys that agree on a po- variant to illumination, 3D projective transforms, and com- tential model pose are first identified through a Hough trans- A Discriminatively Trained, Multiscale, Deformable Part Model fit to a final mon object variations. On the other hand, the features must form hash table, and then through a least-squares also be sufficiently distinctive to identify specific objects estimate of model parameters. When at least 3 keys agree among many alternatives. The difficulty of the object recog- on the model parameters with low residual, there is strong nition problem is due in large part to the lack of success in Pedro Felzenszwalb David McAllester for the presence of the object. Ramanan may be evidence Deva Since there finding such image features. However, recent research on dozens of SIFT keys in the image of a typical object, it is the University of Chicago (e.g.,Toyota Technological Institute to have substantial levels ofUC Irvine the image use of dense local features Schmid & Mohr [19]) possible at Chicago occlusion in Felzenszwalb et al. has pff@cs.uchicago.edu mcallester@tti-c.org shown that efficient recognition can often be achieved by using local image descriptors sampled at a large number of repeatable locations. and yet retain high levels of dramanan@ics.uci.edu reliability. The current object models are represented as 2D loca- tions of SIFT keys that can undergo affine projection. Suf- (2008) This paper presents aAbstract for image feature gen- new method ficient variation in feature location is allowed to recognize eration called the Scale Invariant Feature Transform (SIFT). perspective projection of planar shapes at up to a 60 degree This approach describes a an image into a large collection This paper transforms discriminatively trained, multi- rotation away from the camera or to allow up to a 20 degree of local feature vectors, each of which isdetection. to image scale, deformable part model for object invariant Our sys- rotation of a 3D object. tem achieves a two-fold improvement in average precision over thethe International Conference2006 PASCAL person de- 1 Proc. of best performance in the on tection challenge. It also outperforms the best results in the Computer Vision, Corfu (Sept. 1999) 2007 challenge in ten out of twenty categories. The system relies heavily on deformable parts. While deformable part models have become quite popular, their value had not been Figure 1. Example detection obtained with the person model. The demonstrated on difficult benchmarks such as the PASCAL model is defined by a coarse template, several higher resolution
  14. 14. Scale-Invariant Feature Transform (SIFT) adapted from Kucuktunc
  15. 15. Scale-Invariant Feature Transform (SIFT) adapted from Brown, ICCV 2003
  16. 16. SIFT local features are invariant... adapted from David Lee
  17. 17. like me they are robust... Text
  18. 18. like me they are robust... Text ... to changes in illumination, noise, viewpoint, occlusion, etc.
  19. 19. I am sure you want to know how to build them Text
  20. 20. I am sure you want to know how to build them 1. find interest points or “keypoints” Text
  21. 21. I am sure you want to know how to build them 1. find interest points or “keypoints” Text 2. find their dominant orientation
  22. 22. I am sure you want to know how to build them 1. find interest points or “keypoints” Text 2. find their dominant orientation 3. compute their descriptor
  23. 23. I am sure you want to know how to build them 1. find interest points or “keypoints” Text 2. find their dominant orientation 3. compute their descriptor 4. match them on other images
  24. 24. 1. find interest points or “keypoints” Text
  25. 25. keypoints are taken as maxima/minima of a DoG pyramid Text in this settings, extremas are invariant to scale...
  26. 26. a DoG (Difference of Gaussians) pyramid is simple to compute... even him can do it! before after adapted from Pallus and Fleishman
  27. 27. then we just have to find neighborhood extremas in this 3D DoG space
  28. 28. then we just have to find neighborhood extremas in this 3D DoG space if a pixel is an extrema in its neighboring region he becomes a candidate keypoint
  29. 29. too many keypoints? adapted from wikipedia
  30. 30. too many keypoints? 1. remove low contrast adapted from wikipedia
  31. 31. too many keypoints? 1. remove low contrast adapted from wikipedia
  32. 32. too many keypoints? 1. remove low contrast 2. remove edges adapted from wikipedia
  33. 33. too many keypoints? 1. remove low contrast 2. remove edges adapted from wikipedia
  34. 34. Text 2. find their dominant orientation
  35. 35. each selected keypoint is assigned to one or more “dominant” orientations...
  36. 36. each selected keypoint is assigned to one or more “dominant” orientations... ... this step is important to achieve rotation invariance
  37. 37. How?
  38. 38. How? using the DoG pyramid to achieve scale invariance:
  39. 39. How? using the DoG pyramid to achieve scale invariance: a. compute image gradient magnitude and orientation
  40. 40. How? using the DoG pyramid to achieve scale invariance: a. compute image gradient magnitude and orientation b. build an orientation histogram
  41. 41. How? using the DoG pyramid to achieve scale invariance: a. compute image gradient magnitude and orientation b. build an orientation histogram c. keypoint’s orientation(s) = peak(s)
  42. 42. a. compute image gradient magnitude and orientation
  43. 43. a. compute image gradient magnitude and orientation
  44. 44. b. build an orientation histogram adapted from Ofir Pele
  45. 45. c. keypoint’s orientation(s) = peak(s) * * the peak ;-)
  46. 46. Text 3. compute their descriptor
  47. 47. SIFT descriptor = a set of orientation histograms 16x16 neighborhood 4x4 array x 8 bins of pixel gradients = 128 dimensions (normalized)
  48. 48. Text 4. match them on other images
  49. 49. How to atch?
  50. 50. How to atch? nearest neighbor
  51. 51. How to atch? nearest neighbor hough transform voting
  52. 52. How to atch? nearest neighbor hough transform voting least-squares fit
  53. 53. How to atch? nearest neighbor hough transform voting least-squares fit etc.
  54. 54. SIFT is great! Text
  55. 55. SIFT is great! Text invariant to affine transformations
  56. 56. SIFT is great! Text invariant to affine transformations easy to understand
  57. 57. SIFT is great! Text invariant to affine transformations easy to understand fast to compute
  58. 58. Extension Features: Spatial Pyramid Matching Beyond Bags of example: SpatialRecognizing NaturalMatching using SIFT for Pyramid Scene Categories Svetlana Lazebnik1 Cordelia Schmid2 Jean Ponce1,3 slazebni@uiuc.edu Cordelia.Schmid@inrialpes.fr ponce@cs.uiuc.edu 2 3 Beckman Institute INRIA Rhˆ ne-Alpes o Ecole Normale Sup´ rieure e University of Illinois Montbonnot, France Paris, France Text CVPR 2006
  59. 59. Object Recognition from Local Scale-Invariant Features David G. Lowe Lowe Computer Science Department University of British Columbia Vancouver, B.C., V6T 1Z4, Canada (1999) lowe@cs.ubc.ca Abstract translation, scaling, and rotation, and partially invariant to illumination changes and affine or 3D projection. Previous An object recognition system has been developed that uses a approaches to local feature generation lacked invariance to new class of local image features. The features are invariant scale and were more sensitive to projective distortion and to image scaling, translation, and rotation, and partially in- illumination change. The SIFT features share a number of variant to illumination changes and affine or 3D projection. properties in common with the responses of neurons in infe- Histograms of Oriented Gradients for Human Detection Navneet Dalal and Bill Triggs INRIA Rhˆ ne-Alps, 655 avenue de l’Europe, Montbonnot 38334, France o {Navneet.Dalal,Bill.Triggs}@inrialpes.fr, http://lear.inrialpes.fr Nalal and Triggs Abstract We study the question of feature sets for robust visual ob- We briefly discuss previous work on human detection in §2, give an overview of our method §3, describe our data (2005) ject recognition, adopting linear SVM based human detec- tion as a test case. After reviewing existing edge and gra- dient based descriptors, we show experimentally that grids sets in §4 and give a detailed description and experimental evaluation of each stage of the process in §5–6. The main conclusions are summarized in §7. of Histograms of Oriented Gradient (HOG) descriptors sig- 2 Previous Work nificantly outperform existing feature sets for human detec- tion. We study the influence of each stage of the computation There is an extensive literature on object detection, but on performance, concluding that fine-scale gradients, fine here we mention just a few relevant papers on human detec- orientation binning, relatively coarse spatial binning, and tion [18,17,22,16,20]. See [6] for a survey. Papageorgiou et high-quality local contrast normalization in overlapping de- al [18] describe a pedestrian detector based on a polynomial A Discriminatively Trained, Multiscale, Deformable Part Model with scriptor blocks are all important for good results. The new SVM using rectified Haar wavelets as input descriptors, approach gives near-perfect separation on the original MIT a parts (subwindow) based variant in [17]. Depoortere et al pedestrian database, so we introduce a more challenging give an optimized version of this [2]. Gavrila & Philomen [8] take a more direct approach, extracting edge images and dataset containing over 1800 annotated human images with McAllester Pedro Felzenszwalb David matching them to a set of learned exemplarsRamanan Deva using chamfer a large range of pose variations and backgrounds. University of Chicago Toyota Technological Institute athas been used in a practical real-time pedes- distance. This Chicago UC Irvine 1 Introduction Felzenszwalb et al. pff@cs.uchicago.edu mcallester@tti-c.org system [7]. Viola dramanan@ics.uci.edu Detecting humans in images is a challenging task owing to their variable appearance and the wide range of poses that trian detection et al [22] build an efficient moving person detector, using AdaBoost to train a chain of progressively more complex region rejection rules based on Haar-like wavelets and space-time differences. Ronfard et (2008) they can adopt. The first need is a robust feature set that Abstract allows the human form to be discriminated cleanly, even in al [19] build an articulated body detector by incorporating cluttered backgrounds under difficult illumination. We study SVM based limb classifiers over 1st and 2nd order Gaussian This paper describes a discriminatively trained, multi- filters in a dynamic programming framework similar to those the issue of feature sets for human detection, showing that lo- scale, deformable part model for object detection. Our sys- of Felzenszwalb & Huttenlocher [3] and Ioffe & Forsyth cally normalized Histogram of Oriented Gradient (HOG) de- scriptors providetwo-fold improvement relative to other ex- tem achieves a excellent performance in average precision [9]. Mikolajczyk et al [16] use combinations of orientation- isting the bestsets including wavelets [17,22]. The person de- position histograms with binary-thresholded gradient magni- over feature performance in the 2006 PASCAL proposed descriptors are reminiscent of edge orientation results in the tudes to build a parts based method containing detectors for tection challenge. It also outperforms the best histograms [4,5], SIFT descriptors [12] of twenty categories. The system faces, heads, and front and side profiles of upper and lower 2007 challenge in ten out and shape contexts [1], but they relies heavily on dense grid parts. While deformable part body parts. In contrast, our detector uses a simpler archi- are computed on adeformableof uniformly spaced cells and models overlapping quite contrast their value had not been tecture with a single detection obtained with the person model. The they usehave become local popular, normalizations for im- Figure 1. Example detection window, but appears to give model is defined by a coarse template, several higher resolution proved performance. We make a detailedsuch as the PASCAL significantly higher performance on pedestrian images. demonstrated on difficult benchmarks study of the effects of various implementation choices on detector performance, taking “pedestrian detection” (the detection of mostly visible 3 Overview of the Method
  60. 60. Histograms of Oriented Gradients for Human Detection Navneet Dalal and Bill Triggs INRIA Rhˆ ne-Alps, 655 avenue de l’Europe, Montbonnot 38334, France o {Navneet.Dalal,Bill.Triggs}@inrialpes.fr, http://lear.inrialpes.fr Abstract We briefly discuss previous work on human detection in We study the question of feature sets for robust visual ob- §2, give an overview of our method §3, describe our data ject recognition, adopting linear SVM based human detec- sets in §4 and give a detailed description and experimental tion as a test case. After reviewing existing edge and gra- evaluation of each stage of the process in §5–6. The main dient based descriptors, we show experimentally that grids conclusions are summarized in §7. of Histograms of Oriented Gradient (HOG) descriptors sig- 2 Previous Work nificantly outperform existing feature sets for human detec- tion. We study the influence of each stage of the computation There is an extensive literature on object detection, but on performance, concluding that fine-scale gradients, fine here we mention just a few relevant papers on human detec- orientation binning, relatively coarse spatial binning, and tion [18,17,22,16,20]. See [6] for a survey. Papageorgiou et high-quality local contrast normalization in overlapping de- al [18] describe a pedestrian detector based on a polynomial scriptor blocks are all important for good results. The new SVM using rectified Haar wavelets as input descriptors, with approach gives near-perfect separation on the original MIT a parts (subwindow) based variant in [17]. Depoortere et al pedestrian database, so we introduce a more challenging give an optimized version of this [2]. Gavrila & Philomen dataset containing over 1800 annotated human images with [8] take a more direct approach, extracting edge images and a large range of pose variations and backgrounds. matching them to a set of learned exemplars using chamfer distance. This has been used in a practical real-time pedes- 1 Introduction trian detection system [7]. Viola et al [22] build an efficient Detecting humans in images is a challenging task owing moving person detector, using AdaBoost to train a chain of to their variable appearance and the wide range of poses that progressively more complex region rejection rules based on they can adopt. The first need is a robust feature set that Haar-like wavelets and space-time differences. Ronfard et first of all, let me put this paper in allows the human form to be discriminated cleanly, even in cluttered backgrounds under difficult illumination. We study al [19] build an articulated body detector by incorporating SVM based limb classifiers over 1st and 2nd order Gaussian context the issue of feature sets for human detection, showing that lo- cally normalized Histogram of Oriented Gradient (HOG) de- filters in a dynamic programming framework similar to those of Felzenszwalb & Huttenlocher [3] and Ioffe & Forsyth
  61. 61. Histograms of Oriented Gradients for Human Detection Navneet Dalal and Bill Triggs INRIA Rhˆ ne-Alps, 655 avenue de l’Europe, Montbonnot 38334, France o {Navneet.Dalal,Bill.Triggs}@inrialpes.fr, http://lear.inrialpes.fr λ λ λ Abstract We briefly discuss previous work on human detection in Swain & Ballard 1991 - Color an overview of our method §3, describe our data §2, give Histograms We study the question of feature sets for robust visual ob- ject recognition, adopting linear SVM based human detec- sets in §4 and give a detailed description and experimental tion as a test case. After reviewing& Crowley 1996 evaluation of each stage of the process in §5–6. The main Schiele existing edge and gra- conclusions are summarized in §7. - Receptive Fields Histograms dient based descriptors, we show experimentally that grids of Histograms of Oriented Gradient (HOG) descriptors sig- 2 Previous Work nificantly outperform existing feature sets - SIFT detec- Lowe 1999 for human tion. We study the influence of each stage of the computation There is an extensive literature on object detection, but on performance, concluding that fine-scale gradients, fine here we mention just a few relevant papers on human detec- Schneiderman & Kanade 2000 - Localized for a survey. PapageorgiouWavelets tion [18,17,22,16,20]. See [6] Histograms of et orientation binning, relatively coarse spatial binning, and high-quality local contrast normalization in overlapping de- al [18] describe a pedestrian detector based on a polynomial SVM using rectified Haar wavelets as input descriptors, with scriptor blocks are all Leung for good results. The new Texton Histograms important & Malik 2001 - approach gives near-perfect separation on the original MIT a parts (subwindow) based variant in [17]. Depoortere et al pedestrian database, so we introduce a more challenging give an optimized version of this [2]. Gavrila & Philomen dataset containing over 1800 annotated human images with Shape Context approach, extracting edge images and Belongie et al. 2002 - [8] take a more direct a large range of pose variations and backgrounds. matching them to a set of learned exemplars using chamfer distance. This has been used in a practical real-time pedes- 1 Introduction Dalal & Triggs 2005 - Dense Orientation Histogramsan efficient trian detection system [7]. Viola et al [22] build Detecting humans in images is a challenging task owing moving person detector, using AdaBoost to train a chain of to their variable appearance and the wide range of poses that ... progressively more complex region rejection rules based on they can adopt. The first need is a robust feature set that Haar-like wavelets and space-time differences. Ronfard et histograms of local image measurement allows the human form to be discriminated cleanly, even in cluttered backgrounds under difficult illumination. We study al [19] build an articulated body detector by incorporating SVM based limb classifiers over 1st and 2nd order Gaussian have been quite successful the issue of feature sets for human detection, showing that lo- cally normalized Histogram of Oriented Gradient (HOG) de- filters in a dynamic programming framework similar to those of Felzenszwalb & Huttenlocher [3] and Ioffe & Forsyth
  62. 62. Histograms of Oriented Gradients for Human Detection Navneet Dalal and Bill Triggs features INRIA Rhˆ ne-Alps, 655 avenue de l’Europe, Montbonnot 38334, France o {Navneet.Dalal,Bill.Triggs}@inrialpes.fr, http://lear.inrialpes.fr Abstract We briefly discuss previous work on human detection in We study the question of feature sets for robust visual ob- §2, give an overview of our method §3, describe our data Gravrila & Philomen 1999 - Edgegive a detailed description and experimental ject recognition, adopting linear SVM based human detec- sets in §4 and Templates + Nearest Neighbor tion as a test case. After reviewing existing edge and gra- evaluation of each stage of the process in §5–6. The main dient based descriptors, we show experimentally that grids conclusions are summarized in §7. Papageorgiou & Poggio 2000, Mohan et al. 2001, DePoortere et al. of Histograms of Oriented Gradient (HOG) descriptors sig- 2002 - Haar Wavelets 2 Previous Work nificantly outperform existing feature sets for human detec- + SVM tion. We study the influence of each stage of the computation There is an extensive literature on object detection, but on performance, concluding that fine-scale gradients, - Rectangular Differentialpapers on human + here we mention just a few relevant Viola & Jones 2001 fine tion [18,17,22,16,20]. See [6] for a survey. Papageorgiou et Features detec- orientation binning, relatively coarse spatial binning, and AdaBoost high-quality local contrast normalization in overlapping de- al [18] describe a pedestrian detector based on a polynomial scriptor blocks are all important for good results. The new SVM using rectified Haar wavelets as input descriptors, with approach gives near-perfect separation on the original MIT - parts (subwindow) based variant in [17]. Depoortere et al a Mikolajczyk et al. 2004 give an optimized version of this [2]. Gavrila & Philomen Parts Based Histograms + AdaBoost pedestrian database, so we introduce a more challenging dataset containing over 1800 annotated human images with [8] take a more direct approach, extracting edge images and a large range of pose variations Sukthankar 2004 - PCA-SIFT set of learned exemplars using chamfer Ke & and backgrounds. matching them to a distance. This has been used in a practical real-time pedes- 1 Introduction trian detection system [7]. Viola et al [22] build an efficient ... Detecting humans in images is a challenging task owing moving person detector, using AdaBoost to train a chain of to their variable appearance and the wide range of poses that progressively more complex region rejection rules based on they can adopt. The first need is a robust feature set that Haar-like wavelets and space-time differences. Ronfard et allows the human form to be discriminated cleanly, even in al [19] build an articulated body detector by incorporating tons of “feature sets” have been proposed cluttered backgrounds under difficult illumination. We study the issue of feature sets for human detection, showing that lo- SVM based limb classifiers over 1st and 2nd order Gaussian filters in a dynamic programming framework similar to those cally normalized Histogram of Oriented Gradient (HOG) de- of Felzenszwalb & Huttenlocher [3] and Ioffe & Forsyth
  63. 63. Histograms of Oriented Gradients for Human Detection Navneet Dalal and Bill Triggs difficult! INRIA Rhˆ ne-Alps, 655 avenue de l’Europe, Montbonnot 38334, France o {Navneet.Dalal,Bill.Triggs}@inrialpes.fr, http://lear.inrialpes.fr Abstract We briefly discuss previous work on human detection in We study the question of feature sets for robust visual ob- §2, give an overview of our method §3, describe our data ject recognition, adopting linearvariety human detec- Wide SVM based of articulated poses a detailed description and experimental sets in §4 and give tion as a test case. After reviewing existing edge and gra- evaluation of each stage of the process in §5–6. The main dient based descriptors, we show experimentally that grids conclusions are summarized in §7. Variable appearance/clothing of Histograms of Oriented Gradient (HOG) descriptors sig- 2 Previous Work nificantly outperform existing feature sets for human detec- Complex backgrounds tion. We study the influence of each stage of the computation There is an extensive literature on object detection, but on performance, concluding that fine-scale gradients, fine here we mention just a few relevant papers on human detec- orientation binning, relatively coarse spatial binning, and tion [18,17,22,16,20]. See [6] for a survey. Papageorgiou et Unconstrained illuminations high-quality local contrast normalization in overlapping de- al [18] describe a pedestrian detector based on a polynomial scriptor blocks are all important for good results. The new SVM using rectified Haar wavelets as input descriptors, with approach gives near-perfect separation on the original MIT a parts (subwindow) based variant in [17]. Depoortere et al Occlusions pedestrian database, so we introduce a more challenging give an optimized version of this [2]. Gavrila & Philomen dataset containing over 1800 annotated human images with [8] take a more direct approach, extracting edge images and Different scales a large range of pose variations and backgrounds. matching them to a set of learned exemplars using chamfer distance. This has been used in a practical real-time pedes- 1 Introduction trian detection system [7]. Viola et al [22] build an efficient ... Detecting humans in images is a challenging task owing moving person detector, using AdaBoost to train a chain of to their variable appearance and the wide range of poses that progressively more complex region rejection rules based on they can adopt. The first need is a robust feature set that Haar-like wavelets and space-time differences. Ronfard et localizing humans in images is a allows the human form to be discriminated cleanly, even in cluttered backgrounds under difficult illumination. We study al [19] build an articulated body detector by incorporating SVM based limb classifiers over 1st and 2nd order Gaussian challenging task... the issue of feature sets for human detection, showing that lo- cally normalized Histogram of Oriented Gradient (HOG) de- filters in a dynamic programming framework similar to those of Felzenszwalb & Huttenlocher [3] and Ioffe & Forsyth
  64. 64. Approach
  65. 65. Approach • robust feature set (HOG)
  66. 66. Approach • robust feature set (HOG)
  67. 67. Approach • robust feature set (HOG) • simple classifier(linear SVM)
  68. 68. Approach • robust feature set (HOG) • simple classifier(linear SVM) • fast detection(sliding window)
  69. 69. adapted from Bill Triggs
  70. 70. • Gamma normalization • Space: RGB, LAB or Gray • Method: SQRT or LOG
  71. 71. • Filtering with simple masks centered centered * diagonal uncentered uncentered cubic-corrected cubic-corrected Sobel * centered performs the best
  72. 72. remember SIFT ? • Filtering with simple masks centered uncentered cubic-corrected
  73. 73. ...after filtering, each “pixel” represents an oriented gradient...
  74. 74. ...pixels are regrouped in “cells”, they cast a weighted vote for an orientation histogram... HOG (Histogram of Oriented Gradients)
  75. 75. a window can be represented like that
  76. 76. then, cells are locally normalized using overlapping “blocks”
  77. 77. they used two types of blocks
  78. 78. they used two types of blocks • rectangular • similar to SIFT (but dense)
  79. 79. they used two types of blocks • rectangular • circular • similar to SIFT (but dense) • similar to Shape Context
  80. 80. and four different types of block normalization
  81. 81. and four different types of block normalization
  82. 82. like SIFT, they gain invariance... ...to illuminations, small deformations, etc.
  83. 83. finally, a sliding window is classified by a simple linear SVM
  84. 84. during the learning phase, the algorithm “looked” for hard examples Training adapted from Martial Hebert
  85. 85. average gradients positive weights negative weights
  86. 86. Example
  87. 87. Example adapted from Bill Triggs
  88. 88. Example adapted from Martial Hebert
  89. 89. Results 90% @ 1e-5 FPPW DET − different descriptors on MIT database DET − different descriptors on INRIA database 0.2 0.5 Lin. R−HOG Lin. C−HOG Lin. EC−HOG Wavelet PCA−SIFT 0.2 Lin. G−ShaceC Lin. E−ShaceC MIT best (part) 0.1 miss rate miss rate MIT baseline 0.1 Ker. R−HOG 0.05 Lin. R2−HOG Lin. R−HOG Lin. C−HOG 0.05 Lin. EC−HOG Wavelet 0.02 PCA−SIFT 0.02 Lin. G−ShapeC 0.01 Lin. E−ShapeC −6 −5 −4 −3 −2 −1 0.01 −6 −5 −4 −3 −2 −1 10 10 10 10 10 10 10 10 10 10 10 10 false positives per window (FPPW) false positives per window (FPPW) Figure 3. The performance of selected detectors on (left) MIT and (right) INRIA data sets. See the text for details. tector performance. Throughout this section we refer results tive masks. Several smoothing scales were tested includ- to our default detector which has the following properties, ing σ=0 (none). Masks tested included various 1-D point not good described below: RGB colour space with no gamma cor- good derivatives (uncentred [−1, 1], centred [−1, 0, 1] and cubic-
  90. 90. Experiments DET − effect of gradient scale σ DET − effect of number of orientation bins β DET − effect of normalization methods 0.5 0.5 0.2 0.2 0.2 0.1 0.1 0.1 miss rate miss rate miss rate bin= 9 (0−180) σ=0 bin= 6 (0−180) 0.05 0.05 0.05 σ=0.5 bin= 4 (0−180) L2−Hys bin= 3 (0−180) L2−norm σ=1 bin=18 (0−360) L1−Sqrt 0.02 σ=2 0.02 bin=12 (0−360) L1−norm σ=3 bin= 8 (0−360) No norm σ=0, c−cor bin= 6 (0−360) Window norm 0.01 −6 −5 −4 −3 −2 −1 0.01 −6 −5 −4 −3 −2 −1 0.02 −5 −4 −3 −2 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 false positives per window (FPPW) false positives per window (FPPW) false positives per window (FPPW) (a) (b) (c) DET − effect of window size DET − effect of kernel width,γ, on kernel SVM DET − effect of overlap (cell size=8, num cell = 2x2, wt=0) 0.5 0.5 0.5 0.2 0.2 0.2 0.1 0.1 miss rate miss rate 0.1 miss rate 0.05 0.05 0.05 Linear 0.02 64x128 0.02 γ=8e−3 0.02 overlap = 3/4, stride = 4 overlap = 1/2, stride = 8 56x120 γ=3e−2 overlap = 0, stride =16 48x112 γ=7e−2 0.01 −6 −5 −4 −3 −2 −1 0.01 −6 −5 −4 −3 −2 −1 0.01 −6 −5 −4 −3 −2 −1 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 false positives per window (FPPW) false positives per window (FPPW) false positives per window (FPPW) (d) (e) (f) Figure 4. For details see the text. (a) Using fine derivative scale significantly increases the performance. (‘c-cor’ is the 1D cubic-corrected point derivative). (b) Increasing the number of orientation bins increases performance significantly up to about 9 bins spaced over 0◦ – 180◦ . (c) The effect of different block normalization schemes (see §6.4). (d) Using overlapping descriptor blocks decreases the miss rate by around 5%. (e) Reducing the 16 pixel margin around the 64×128 detection window decreases the performance by about 3%. (f) Using a Gaussian kernel SVM, exp(−γ x1 − x2 2 ), improves the performance by about 3%. magnitude itself gives the best results. Taking the square root
  91. 91. ive scale significantly increases the performance. (‘c-cor’ is the 1D cubic-corrected Experiments ation bins increases performance significantly up to about 9 bins spaced over 0◦ – chemes (see §6.4). (d) Using overlapping descriptor blocks decreases the miss rate d the 64×128 detection window decreases the performance by about 3%. (f) Using ves the performance by about 3%. square root edge pres- −4 FPPW). al for good 20 ing can be Miss Rate (%) 15 he number cantly up to 10 this. This 5 f the gradi- 0 ation range 12x12 1x1 creases the 10x10 8x8 2x2 6x6 3x3 so doubled Cell size (pixels) 4x4 4x4 Block size (Cells) or humans, urs presum- Figure 5. The miss rate at 10−4 FPPW as the cell and block sizes . However change. The stride (block overlap) is fixed at half of the block size. ubstantially 3×3 blocks of 6×6 pixel cells perform best, with 10.4% miss rate. motorbikes.
  92. 92. Further Development
  93. 93. Further Development • Detection on Pascal VOC (2006)
  94. 94. Further Development • Detection on Pascal VOC (2006) • Human Detection in Movies (ECCV 2006)
  95. 95. Further Development • Detection on Pascal VOC (2006) • Human Detection in Movies (ECCV 2006) • US Patent by MERL (2006)
  96. 96. Further Development • Detection on Pascal VOC (2006) • Human Detection in Movies (ECCV 2006) • US Patent by MERL (2006) • Stereo Vision HoG (ICVES 2008)
  97. 97. Extension example: Pyramid HoG++
  98. 98. Extension example: Pyramid HoG++
  99. 99. Extension example: Pyramid HoG++
  100. 100. A simple demo...
  101. 101. A simple demo...
  102. 102. A simple demo... VIDEO HERE
  103. 103. A simple demo... VIDEO HERE
  104. 104. so, it doesn’t work ?!?
  105. 105. so, it doesn’t work ?!? no no, it works...
  106. 106. so, it doesn’t work ?!? no no, it works... ...it just doesn’t work well...
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×