Iccv2011 learning spatiotemporal graphs of human activities

824 views

Published on

Published in: Technology, Business
  • Be the first to comment

  • Be the first to like this

Iccv2011 learning spatiotemporal graphs of human activities

  1. 1. Learning  Spa+otemporal  Graphs  of   Human  Ac+vi+es   William  Brendel   Sinisa  Todorovic  
  2. 2. Our Goal Long Jump Triple Jump•  Recognize all occurrences of activities•  Identify the start and end frames•  Parse the video and find all subactivities•  Localize actors and objects involved
  3. 3. Weakly Supervised Setting Weight Lifting Large-Box LiftingIn  training:          >  ONLY  class  labels      Domain  knowledge  of  temporal  structure:          >  NOT  AVAILABLE  
  4. 4. Learning What and How Weak  supervision  in  training       Need  to  learn  from  training  videos:     What  ac+vity  parts  are  relevant     How  relevant  they  are  for  recogni+on  
  5. 5. Prior Work vs. Our ApproachTypically, focus only on HOWsemantic level model gap features raw video
  6. 6. Prior Work vs. Our Approach Typically...semantic level semantic level model model gap mid-level features features raw video raw video
  7. 7. Prior Work – Video Representation •  Space-­‐+me  points   –  Laptev  &  Schmid  08,  Niebles  &  Fei-­‐Fei  08,  …   •  S+ll  human  postures   –  SoaLo  07,  Ning  &  Huang  08,  …   •  Ac+on  templates   –  Yao  &  Zhu  09,  …   •  Point  tracks   –  Sukthankar  &  Hebert  10,  …    
  8. 8. Our Features: 2D+t Tubes •  Allow  simpler:   -­‐  Modeling   -­‐  Learning  (few  examples)     Sukthankar & Hebert 07, -­‐  Inference   Gorelick & Irani 08, Pritch & Peleg 08, ...     •  We  are  the  first  to  use  2D+t  tubes  for   building  a  sta+s+cal  model  of  ac+vi+es  
  9. 9. Our Features: 2D+t Tubes •  Allow  simpler:   -­‐  Modeling   -­‐  Learning  (few  examples)     Sukthankar & Hebert 07, -­‐  Inference   Gorelick & Irani 08, Pritch & Peleg 08, ...     •  We  use  2D+t  tubes  for  building  a  sta+s+cal   genera+ve  model  of  ac+vi+es  
  10. 10. Prior Work – Activity Representation •  Graphical  models,  Grammars   -­‐  Ivanov  &  Bobick  00   -­‐  Xiang  &  Gong  06   -­‐  Ryoo  &  Aggawal  09   -­‐  Gupta  &  Davis  09   -­‐  Liu  &  Zhu  09   -­‐  Niebles  &  Fei-­‐Fei  10   -­‐  Lan  et  al.  11       •  Probabilis+c  first-­‐order  logic   -­‐  Tran  &  Davis  08   -­‐  Albanese  et  al.  10   -­‐  Morariu  &  Davis  11   -­‐  Brendel  et  al.  11...  
  11. 11. Approach        Input                                    Spa+otemporal                                              Ac+vity                Recogni+on          Video                                                    Graph                                                                Model                  Localiza+on  
  12. 12. Blocky Video Segmentation
  13. 13. Activity as a Spatiotemporal Graph Descriptors of nodes and edges: •  Node descriptors: F - Motion - Object shape •  Adjacency Matrices: {Ai} - Allen temporal relations - Spatial relations - Compositional relations
  14. 14. Activity as Segmentation Graph G = (V, E, "descriptors") = (F, {A1, ..., An}) node descriptors adjacency matrices of distinct relations between the tubes
  15. 15. Activity Graph Model Probabilistic Graph Mixture model node descriptors mixture weights model adjacency matricescompositional spatial temporal * + * + *
  16. 16. Activity Model An  ac+vity  instance: G = (F, {A1,..., An})Model adjacency matricesEdge type: i =1, 2,..., n
  17. 17. Activity Model An  ac+vity  instance: G = (F, {A1,..., An})Model adjacency matrices Model matrix of node descriptorsEdge type: i =1, 2,..., n
  18. 18. Inference        Input                                    Spa+otemporal                                              Ac+vity                Recogni+on          Video                                                    Graph                                                                Model                  Localiza+on  
  19. 19. Inference = Robust Least Squares Goal:     • For  every  ac+vity  model   • Es+mate  the  permuta+on  matrix  subject to
  20. 20. Learning the Activity Graph ModelTraining  videos  →  Training  graphs  →  Graph  model      
  21. 21. Learning Given K training graphs,              Adjacency  matrix                            Node  descriptor   Edge type: i =1, 2,..., n
  22. 22. LearningGiven K training graphs, ESTIMATE              Adjacency  matrix                                Node  descriptor   Model parameters
  23. 23. LearningGiven K training graphs, ESTIMATE              Adjacency  matrix                                Node  descriptor   Permutation matrix
  24. 24. Learning = Robust Least SquaresGiven  K  Training  graphs:  Es4mate:   and  
  25. 25. Learning = Structural EME-step à expected M-step à matching of the model structure training graphs and model Estimatation of Estimation of model parametrs permutation matrices
  26. 26. Learning Results Correctly  learned  ac+vity-­‐characteris+c  tubes    
  27. 27. Recognition and Segmentation Ac+vity  “handshaking”   Detected  and  segmented  characteris+c  tube    
  28. 28. Recognition and Segmentation Ac+vity  “kicking”   Detected  and  segmented  characteris+c  tube    
  29. 29. Classification on UTexas Dataset Human  interac+on  ac+vi+es   [18]  Ryoo  et  al.  ’10  
  30. 30. Conclusion•  Fast spatiotemporal segmentation•  New activity representation = graph model•  Unified learning and inference = Least squares•  Learning under weak supervision: - WHAT activity parts are relevant and - HOW relevant they are for recognition

×