Modeling Mutual Context of Objectand Human Pose in Human-Object      Interaction Activities         Bangpeng Yao and Li Fe...
Human-Object InteractionRobots interact      Automatic sports           Medical care with objects          commentary     ...
Human-Object Interaction     Holistic image based classification (Previous talk: Grouplet)                                ...
Human-Object InteractionHolistic image based classificationDetailed understanding and reasoning  • Human pose estimation  ...
Human-Object InteractionHolistic image based classificationDetailed understanding and reasoning  • Human pose estimation  ...
Human-Object InteractionHolistic image based classificationDetailed understanding and reasoning  • Human pose estimation  ...
Outline • Background and Intuition • Mutual Context of Object and Human Pose    Model Representation    Model Learning  ...
Outline • Background and Intuition • Mutual Context of Object and Human Pose    Model Representation    Model Learning  ...
Human pose estimation & Object detectionHuman pose                                              Difficult partestimation i...
Human pose estimation & Object detectionHuman poseestimation ischallenging.                •   Felzenszwalb & Huttenlocher...
Human pose estimation & Object detection                  FacilitateGiven theobject isdetected.                           ...
Human pose estimation & Object detection                                                            Object                ...
Human pose estimation & Object detection                                                  Object                          ...
Human pose estimation & Object detection                Facilitate                                    Given the           ...
Human pose estimation & Object detection             Mutual Context                                           15
Context in Computer Vision     Previous work – Use context     cues to facilitate object detection:      Helpful, but only...
Context in Computer Vision     Previous work – Use context                              Our approach – Two challenging    ...
Outline • Background and Intuition • Mutual Context of Object and Human Pose    Model Representation    Model Learning  ...
Mutual Context Model RepresentationA:                                                     Activity                       ...
Mutual Context Model Representation                                                                             Markov Ran...
Mutual Context Model Representation                                                                                    Mar...
Mutual Context Model Representation                                                                                    Mar...
Mutual Context Model Representation                                                                                       ...
Outline • Background and Intuition • Mutual Context of Object and Human Pose    Model Representation    Model Learning  ...
Model Learning                                     Input:   we e               A    eE                     H         ...
Model Learning                                     Input:   we e               A    eE                     H         ...
Model Learning                                       Input:   we e               A    eE                      H      ...
Model Learning                                           Input:   we e               A    eE                      H  ...
Model Learning   we e               A                                     Approach:    eE                      H     ...
Model Learning   we e               A                                        Approach:    eE                         ...
Model Learning   we e               A                                     Approach:    eE                      H     ...
Model Learning   we e               A                                     Approach:    eE                      H     ...
Learning Results Cricketdefensive  shotCricketbowlingCroquet shot                               33
Learning Results Tennisforehand Tennis serveVolleyball smash                                34
Outline • Background and Intuition • Mutual Context of Object and Human Pose    Model Representation    Model Learning  ...
Model Inference                      I The learned models                                                           36
Model Inference                                          I       The learned models                                       ...
Model Inference                                     I The learned models                                           Outpu...
Outline • Background and Intuition • Mutual Context of Object and Human Pose    Model Representation    Model Learning  ...
Dataset and Experiment Setup Sport data set: 6 classes  180 training (supervised with object and part locations) & 120 tes...
Dataset and Experiment Setup Sport data set: 6 classes  180 training (supervised with object and part locations) & 120 tes...
Object Detection Results                                     Cricket bat                        Cricket ball              ...
1                          Our Method                                  1                                             Objec...
Dataset and Experiment Setup Sport data set: 6 classes  180 training & 120 testing images                                 ...
Human Pose Estimation Results Method        Torso   Upper Leg   Lower Leg   Upper Arm   Lower Arm   HeadRamanan,  2006    ...
Human Pose Estimation Results Method        Torso   Upper Leg        Lower Leg          Upper Arm     Lower Arm    HeadRam...
Human Pose Estimation Results Method         Torso    Upper Leg    Lower Leg   Upper Arm    Lower Arm    HeadRamanan,  200...
Dataset and Experiment Setup Sport data set: 6 classes  180 training & 120 testing images                                 ...
Activity Classification Results                             No scene                            information       Scene is...
Conclusion                                Grouplet representationHuman-Object Interaction                                 ...
Acknowledgment• Stanford Vision Lab reviewers:  – Barry Chai (1985-2010)  – Juan Carlos Niebles  – Hao Su• Silvio Savarese...
Upcoming SlideShare
Loading in …5
×

CVPR2010: modeling mutual context of object and human pose in human-object interaction activities

969 views
777 views

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
969
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
18
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

CVPR2010: modeling mutual context of object and human pose in human-object interaction activities

  1. 1. Modeling Mutual Context of Objectand Human Pose in Human-Object Interaction Activities Bangpeng Yao and Li Fei-FeiComputer Science Department, Stanford University {bangpeng,feifeili}@cs.stanford.edu 1
  2. 2. Human-Object InteractionRobots interact Automatic sports Medical care with objects commentary “Kobe is dunking the ball.” 2
  3. 3. Human-Object Interaction Holistic image based classification (Previous talk: Grouplet) Playing Playing bassoon saxophone Detailed understanding and reasoning Vs. Playing saxophone Grouplet is a generic feature for structured objects, or interactions of groups of objects. HOI activity: Tennis Forehand Berg & Malik, 2005 Grauman & Darrell, 2005 Gehler & Nowozin, 2009 OURSCaltech101 48% 59% 77% 62% 3
  4. 4. Human-Object InteractionHolistic image based classificationDetailed understanding and reasoning • Human pose estimation Head Torso 4
  5. 5. Human-Object InteractionHolistic image based classificationDetailed understanding and reasoning • Human pose estimation • Object detection Tennis racket 5
  6. 6. Human-Object InteractionHolistic image based classificationDetailed understanding and reasoning • Human pose estimation • Object detection Head Tennis Torso racket HOI activity: Tennis Forehand 6
  7. 7. Outline • Background and Intuition • Mutual Context of Object and Human Pose  Model Representation  Model Learning  Model Inference • Experiments • Conclusion 7
  8. 8. Outline • Background and Intuition • Mutual Context of Object and Human Pose  Model Representation  Model Learning  Model Inference • Experiments • Conclusion 8
  9. 9. Human pose estimation & Object detectionHuman pose Difficult partestimation is appearancechallenging. Self-occlusion Image region looks like a body part • Felzenszwalb & Huttenlocher, 2005 • Ren et al, 2005 • Ramanan, 2006 • Ferrari et al, 2008 • Yang & Mori, 2008 • Andriluka et al, 2009 9 • Eichner & Ferrari, 2009
  10. 10. Human pose estimation & Object detectionHuman poseestimation ischallenging. • Felzenszwalb & Huttenlocher, 2005 • Ren et al, 2005 • Ramanan, 2006 • Ferrari et al, 2008 • Yang & Mori, 2008 • Andriluka et al, 2009 10 • Eichner & Ferrari, 2009
  11. 11. Human pose estimation & Object detection FacilitateGiven theobject isdetected. 11
  12. 12. Human pose estimation & Object detection Object detection is Small, low- challenging resolution, partially occluded Image region similar to detection target • Viola & Jones, 2001 • Lampert et al, 2008 • Divvala et al, 2009 • Vedaldi et al, 2009 12
  13. 13. Human pose estimation & Object detection Object detection is challenging • Viola & Jones, 2001 • Lampert et al, 2008 • Divvala et al, 2009 • Vedaldi et al, 2009 13
  14. 14. Human pose estimation & Object detection Facilitate Given the pose is estimated. 14
  15. 15. Human pose estimation & Object detection Mutual Context 15
  16. 16. Context in Computer Vision Previous work – Use context cues to facilitate object detection: Helpful, but only moderately outperform better ~3-4% with without context context• Hoiem et al, 2006 • Murphy et al, 2003 • Viola & Jones, 2001• Rabinovich et al, 2007 • Shotton et al, 2006 • Lampert et al, 2008• •• Oliva & Torralba, 2007 Heitz & Koller, 2008 • Harzallah et al, 2009 Li, Socher & Fei-Fei, 2009 • Desai et al, 2009 • Marszalek et al, 2009•  Divvala et al, 2009 • Bao & Savarese, 2010 16
  17. 17. Context in Computer Vision Previous work – Use context Our approach – Two challenging cues to facilitate object detection: tasks serve as mutual context of each other: With mutual context: Helpful, but only moderately outperform better ~3-4% Without context: with without context context• Hoiem et al, 2006 • Murphy et al, 2003• Rabinovich et al, 2007 • Shotton et al, 2006• Oliva & Torralba, 2007 • Harzallah et al, 2009• Heitz & Koller, 2008 • Li, Socher & Fei-Fei, 2009• Desai et al, 2009 • Marszalek et al, 2009• Divvala et al, 2009 • Bao & Savarese, 2010 17
  18. 18. Outline • Background and Intuition • Mutual Context of Object and Human Pose  Model Representation  Model Learning  Model Inference • Experiments • Conclusion 18
  19. 19. Mutual Context Model RepresentationA:  Activity A Tennis Croquet Volleyball Human pose forehand shot smash HO: Object  O Tennis Croquet Volleyball Body parts racket mallet P1 P2  PNH: fO f1 f2  fN Intra-class variations • More than one H for each A; Image evidence • Unobserved during training.P: lP: location; θP: orientation; sP: scale.f: Shape context. [Belongie et al, 2002] 19
  20. 20. Mutual Context Model Representation Markov Random Field•  e ( A, O ) ,  e ( A, H ) ,  e (O, H ) : Frequency A    we eof co-occurrence between A, O, and H. eE  e ( A, H )  e ( A, O ) Clique Clique H weight potential  e (O, H ) O P1 P2  PN fO f1 f2  fN 20
  21. 21. Mutual Context Model Representation Markov Random Field•  e ( A, O ) ,  e ( A, H ) ,  e (O, H ) : Frequency A    we eof co-occurrence between A, O, and H. eE•  e (O, Pn ) , e ( H , Pn ) , e ( Pm , Pn ) : Spatial Clique Clique H weight potentialrelationship among object and body parts.    bin lO  lPn  bin O   Pn   sO sPn    O  e ( H , Pn ) location orientation size  e (O, Pn ) P1 P2  PN  e ( Pm , Pn ) fO f1 f2  fN 21
  22. 22. Mutual Context Model Representation Markov Random Field•  e ( A, O ) ,  e ( A, H ) ,  e (O, H ) : Frequency A    we eof co-occurrence between A, O, and H. eE•  e (O, Pn ) , e ( H , Pn ) , e ( Pm , Pn ) : Spatial Clique Clique H weight potentialrelationship among object and body parts.    bin lO  lPn  bin O   Pn   sO sPn    O  e ( H , Pn ) location orientation size  e (O, Pn ) Obtained by structure learning• Learn structural connectivity among P1 P2  PNthe body parts and the object.  e ( Pm , Pn ) fO f1 f2  fN 22
  23. 23. Mutual Context Model Representation Markov Random Field•  e ( A, O ) ,  e ( A, H ) ,  e (O, H ) : Frequency A    we eof co-occurrence between A, O, and H. eE•  e (O, Pn ) , e ( H , Pn ) , e ( Pm , Pn ) : Spatial Clique Clique H weight potentialrelationship among object and body parts.    bin lO  lPn  bin O   Pn   sO sPn    O location orientation size• Learn structural connectivity among  e (O , f O ) P1 P2  PNthe body parts and the object.  e ( Pn , f P ) n fO•  e (O, f O ) and  e ( Pn , f Pn ): Discriminativepart detection scores. f1 f2  fN Shape context + AdaBoost [Andriluka et al, 2009] [Belongie et al, 2002] [Viola & Jones, 2001] 23
  24. 24. Outline • Background and Intuition • Mutual Context of Object and Human Pose  Model Representation  Model Learning  Model Inference • Experiments • Conclusion 24
  25. 25. Model Learning Input:   we e A eE H  O cricket cricket P1 P2  PN shot bowling fO f1 f2  fNGoals: Hidden human poses 25
  26. 26. Model Learning Input:   we e A eE H  O cricket cricket P1 P2  PN shot bowling fO f1 f2  fNGoals: Hidden human poses Structural connectivity 26
  27. 27. Model Learning Input:   we e A eE H  O cricket cricket P1 P2  PN shot bowling fO f1 f2  fNGoals: Hidden human poses Structural connectivity Potential parameters Potential weights 27
  28. 28. Model Learning Input:   we e A eE H  O cricket cricket P1 P2  PN shot bowling fO f1 f2  fNGoals: Hidden human poses Hidden variables Structural connectivity Structure learning Potential parameters Parameter estimation Potential weights 28
  29. 29. Model Learning   we e A Approach: eE H croquet shot O P1 P2  PN fO f1 f2  fNGoals: Hidden human poses Structural connectivity Potential parameters Potential weights 29
  30. 30. Model Learning   we e A Approach: eE   E    2   max  e we e  H Hill-climbing  E e  2 2  O   Joint density Gaussian priori of P1 P2  PN of the model the edge number fO f1 f2  fN  Goals: Hidden human poses         Structural connectivity     Potential parameters Potential weights   30
  31. 31. Model Learning   we e A Approach: eE H • Maximum likelihood O  e ( A, O )  e ( A, H )  e (O, H ) P1 P2  PN  e ( H , Pn )  e (O, Pn )  e ( Pm , Pn ) fO • Standard AdaBoost f1 f2  fN  e (O, f O )  e ( Pn , f Pn )Goals: Hidden human poses Structural connectivity Potential parameters Potential weights 31
  32. 32. Model Learning   we e A Approach: eE H Max-margin learning 1 min  w r    i O 2 w , 2 2 P1 P2  PN r i s.t. i, r where y  r   y  ci  , fO w ci  xi  w r  xi  1  i f1 f2  fN i, i  0Goals: Hidden human poses Notations Structural connectivity • xi: Potential values of the i-th image. • wr: Potential weights of the r-th pose. Potential parameters • y(r): Activity of the r-th pose. Potential weights • ξi: A slack variable for the i-th image. 32
  33. 33. Learning Results Cricketdefensive shotCricketbowlingCroquet shot 33
  34. 34. Learning Results Tennisforehand Tennis serveVolleyball smash 34
  35. 35. Outline • Background and Intuition • Mutual Context of Object and Human Pose  Model Representation  Model Learning  Model Inference • Experiments • Conclusion 35
  36. 36. Model Inference I The learned models   36
  37. 37. Model Inference I The learned models Head detection   Torso detection Compositional Inference  [Chen et al, 2007]   A1 , H1 , O1* , P*n  1, n  Tennis racket detectionLayout of the object and body parts. 37
  38. 38. Model Inference I The learned models  Output     A1 , H1 , O1* , P*n  1, n    AK , H K , OK ,PK ,n  * * n  38
  39. 39. Outline • Background and Intuition • Mutual Context of Object and Human Pose  Model Representation  Model Learning  Model Inference • Experiments • Conclusion 39
  40. 40. Dataset and Experiment Setup Sport data set: 6 classes 180 training (supervised with object and part locations) & 120 testing images Tasks: • Object detection; • Pose estimation; • Activity classification. Cricket Cricket Croquet defensive shot bowling shot Tennis Tennis Volleyball forehand serve smash[Gupta et al, 2009] 40
  41. 41. Dataset and Experiment Setup Sport data set: 6 classes 180 training (supervised with object and part locations) & 120 testing images Tasks: • Object detection; • Pose estimation; • Activity classification. Cricket Cricket Croquet defensive shot bowling shot Tennis Tennis Volleyball forehand serve smash[Gupta et al, 2009] 41
  42. 42. Object Detection Results Cricket bat Cricket ball 1 Valid region 0.8  Precision 0.6  0.4 Sliding Pedestrian Our 0.2 window context Method 0[Andriluka [Dalal & 0 0.2 0.4 0.6 0.8 1 Recallet al, 2009] Triggs, 2006] Croquet mallet Tennis racket Volleyball 1 0.8 Precision 0.6 0.4 0.2 0 0 0.2 0.4 0.6 0.8 1 Recall 42
  43. 43. 1 Our Method 1 Object Detection Results 0.8 Pedestrian as context Our 1 Method 0.8 Pedestrian as context Method Scanning window detector Our Sliding window Pedestrian context Our method Pedestrian as context Cricket ball Scanning window detectorPrecision 0.6 0.8 1 Scanning window detector Precision 0.6 Small object 0.8 Precision 0.4 0.6 Precision 0.4 0.6 0.2 0.4 0.4 0.2 0.2 0 0.2 0 0.2 0.4 0.6 0.8 1 0 Recall 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Recall 0 Recall 0.2 0 0.4 0.6 0.8 1 Recall Volleyball Background clutter 1 0.8 Precision 0.6 0.4 0.2 0 0 0.2 0.4 0.6 0.8 1 Recall 43 43
  44. 44. Dataset and Experiment Setup Sport data set: 6 classes 180 training & 120 testing images Tasks: • Object detection; • Pose estimation; • Activity classification. Cricket Cricket Croquet defensive shot bowling shot Tennis Tennis Volleyball forehand serve smash[Gupta et al, 2009] 44
  45. 45. Human Pose Estimation Results Method Torso Upper Leg Lower Leg Upper Arm Lower Arm HeadRamanan, 2006 .52 .22 .22 .21 .28 .24 .28 .17 .14 .42Andriluka et al, 2009 .50 .31 .30 .31 .27 .18 .19 .11 .11 .45 Our full model .66 .43 .39 .44 .34 .44 .40 .27 .29 .58 45
  46. 46. Human Pose Estimation Results Method Torso Upper Leg Lower Leg Upper Arm Lower Arm HeadRamanan, 2006 .52 .22 .22 .21 .28 .24 .28 .17 .14 .42Andriluka et al, 2009 .50 .31 .30 .31 .27 .18 .19 .11 .11 .45 Our full model .66 .43 .39 .44 .34 .44 .40 .27 .29 .58Tennis serve Our estimation Andriluka Volleyball Our estimation Andriluka model result et al, 2009 smash model result et al, 2009 46
  47. 47. Human Pose Estimation Results Method Torso Upper Leg Lower Leg Upper Arm Lower Arm HeadRamanan, 2006 .52 .22 .22 .21 .28 .24 .28 .17 .14 .42Andriluka et al, 2009 .50 .31 .30 .31 .27 .18 .19 .11 .11 .45 Our full model .66 .43 .39 .44 .34 .44 .40 .27 .29 .58 One pose per class .63 .40 .36 .41 .31 .38 .35 .21 .23 .52 Estimation Estimation Estimation Estimation result result result result 47
  48. 48. Dataset and Experiment Setup Sport data set: 6 classes 180 training & 120 testing images Tasks: • Object detection; • Pose estimation; • Activity classification. Cricket Cricket Croquet defensive shot bowling shot Tennis Tennis Volleyball forehand serve smash[Gupta et al, 2009] 48
  49. 49. Activity Classification Results No scene information Scene is 0.9 critical!! Cricket 83.3% shotClassification accuracy 0.8 78.9% 0.7 Tennis 0.6 52.5% forehand 0.5 Our Gupta et Bag-of- Our model Gupta et Bag-of-words al, 2009 Words model al, 2009 SIFT+SVM 49
  50. 50. Conclusion Grouplet representationHuman-Object Interaction Vs. Mutual context modelNext Steps• Pose estimation & Object detection on PPMI images.• Modeling multiple objects and humans.  50
  51. 51. Acknowledgment• Stanford Vision Lab reviewers: – Barry Chai (1985-2010) – Juan Carlos Niebles – Hao Su• Silvio Savarese, U. Michigan• Anonymous reviewers 51

×