• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
ECCV2010 tutorial: statisitcal and structural recognition of human actions part II

ECCV2010 tutorial: statisitcal and structural recognition of human actions part II






Total Views
Views on SlideShare
Embed Views



0 Embeds 0

No embeds



Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

    ECCV2010 tutorial: statisitcal and structural recognition of human actions part II ECCV2010 tutorial: statisitcal and structural recognition of human actions part II Presentation Transcript

    • Structural  Methods  
    • G. Johansson, Moving Light Displays, 1973•  Pose  seems  sufficient  for  certain  ac5on  categories  •  Remove  effects  of  clothing,  ligh5ng  varia5on  from  representa5on  
    • •  Two  broad  classes  of  approaches   –  Match  templates  (exemplar-­‐based)   –  Fit  a  human  body  model  
    • Mori & Malik PAMI 2005Shakhnarovich, Viola & Darrell ICCV 2003Bourdev & Malik ICCV 2009 Database of Exemplars
    • Pictorial Structuresmodel Felzenszwalb & Huttenlocher CVPR 2000 Ramanan NIPS 2006 Ferrari, Marin & Zisserman CVPR 2008
    • •  Detect  corners  in  images/video   •  Assess  likelihood  under  ac5on-­‐specific  pose  model   •  Discriminate  between  walking  direc5ons,  bicycle  riding  Song, Goncalves & Perona NIPS 2001, PAMI 2003
    • test sequencekey frame•  Key  frame  matching  to  test  sequence  to  find   similar  poses   –  Shape  matching  on  edge  maps  using  order   structure   Sullivan & Carlsson ECCV 2002
    • •  M  is  quan5zed  3d  pose   •  T  is  root  orienta5on    •  Automa5c  person  detec5on-­‐tracking  •  Compare  quan5zed  pose  to  labeled  training  poses   –  Smooth  over  5me   Ramanan & Forsyth NIPS 2003
    • •  Video  shot  retrieval  from  pose   –  Either  query-­‐by-­‐example  or  classifica5on   –  Focus  on  upper  body  pose   •  Pictorial  structures  model   Ac5on   Label  Ferrari, Marin & Zisserman CVPR 2009
    • query CODE AVAILABLE ONLINE•  SVM  on  descriptors  of  absolute  &  rela5ve  part  loca5ons,   segmenta5ons   –  Include  short  tracks  for  robustness  
    • Golfing?   Walking?   •  Focus  on  discrimina5ve  elements  of  pose  for  ac5on  classifica5on   •  Use  exemplar-­‐based  “poselet”  representa5on  Yang, Wang & Mori CVPR 2010
    • Successfulclassification examplesUnsuccessfulclassification examples 17
    • A: Activity A Tennis Croquet Volleyball Human pose forehand shot smash HO: Object O Tennis Croquet Volleyball Body parts racket mallet P1 P2 PNH: fO f1 f2 fN Intra-class variations •  More than one H for each A; Image evidence •  Unobserved during training.P: lP: location; θP: orientation; sP: scale.f: Shape context. [Belongie et al, 2002] Yao & Fei-Fei CVPR 2010
    • Cricketdefensive shotCricketbowlingCroquet shot 19
    • •  Build  ac5on  models  from  web  search  results  Ikizler-Cinbis, Cinbis, Sclaroff ICCV 2009
    • SLAG •  Find  repeated  poses  in  a  dataset  Wang, Jiang, Drew, Li, Mori CVPR 2006
    • •  Person  loca5on  Riding horse Reading book Taking photo given   •  Classify  into  one  of  9   categories  Riding bike Play instrument Running Phoning Use computer Walking
    • •  Pose  as  representa5on  for  ac5on  recogni5on   –  Captures  much  informa5on  about  ac5on   –  Invariance  to  clothing  /  ligh5ng  effects   –  Model  and  exemplar  based  representa5ons  •  New  direc5on:  Ac5on  recogni5on  from  s5ll   images   –  Image  retrieval  and  analysis   –  An  important  cue  for  video-­‐based  ac5on  recogni5on   –  Pose  seems  essen5al  
    • •  Describe  low-­‐level  components   –  Ac5ons  of  individual  people   –  Movement  of  pixels  •  Iden5fy  key  objects  or  loca5ons  in  scene   –  Buildings,  roads,  etc.  •  Model  interac5ons  between  people,  objects,   and  loca5ons  
    • •  Detect  and  track  moving   objects   •  Manually  iden5fy  key   regions  in  scene   –  E.g.  road,  checkpoint   •  Scenarios  describe   rela5ve  arrangements  of   objects  in  scene   –  E.g.  proximity  of  car  to   checkpoint   –  No5ons  of  scene  context  Medioni, Cohen, Bremond, Hongeng, Nevatia PAMI 2001
    • •  Detect  and  track  players,   ball   •  Low-­‐level  ac5on   detectors  for  individual   players   •  Hand-­‐constructed  Bayes   net  for  each  ac5vity   –  Spa5al  and  temporal   rela5ons  between  low-­‐ level  ac5ons  Intille & Bobick CVPR 1999
    • •  Global,  frame-­‐level  feature   –  Bag-­‐of-­‐words  representa5on  •  Detect  unusual  events  by  clustering   –  Isolated,  varied  clusters  are  unusual   Zhong, Shi & Visontai CVPR 2004
    • •  Chea5ng  detec5on  in   •  Real-­‐world  highway  dataset   simulated  card  game   –  Cars  pulling  off  road,  backing   up,  U-­‐turns  
    • •  Describe  moving  pixels   by  loca5on  and  mo5on   direc5on   –  No  object  detec5on  •  Use  as  visual  words  in   Latent  Dirichlet   Alloca5on  (LDA)  type   model   –  Infer  low-­‐level  ac5ons   from  words  Wang, Ma, Grimson PAMI 2009 Blei, Ng, Jordan JMLR 2003
    • •  Higher-­‐level  ac5vity  analysis   –  Distribu5on  of  low-­‐level  ac5ons  over  en5re  scene  •  Applica5ons   –  Temporal  segmenta5on  by  ac5vity   –  Abnormality  detec5on  
    • 0.8 A   B   0.3 0.5 C  •  Hierarchical  Dirichlet   Process  model   –  Learn  number  of  ac5vi5es   automa5cally   Kuettel, Breitenstein, van Gool & Ferrari CVPR 2010
    • traffic  light  controlled  scene   •  con5nuous  video   •  annotated  with  states  and  history   •  3x  speed  
    • Loy, Xiang & Gong CVPR, ICCV 2009
    • •  Consider  5me-­‐delayed  correla5ons  between  regions   –  Applica5ons  to  irregularity  detec5on  
    • y   activity class h1 h2 y   h action class … x1 x2 xn image x0Choi, Shahid, & Savarese VS 2009Lan, Wang, Yang, & Mori SGA 2010, NIPS 2010
    • •  Cap5oned  baseball  videos   in  training   •  Build  AND-­‐OR  graph   representa5on  of  ac5vi5es   –  AND  specifies  elements  of   an  ac5vity  that  must  occur   –  OR  allows  varia5on  in  how   an  element  appears   •  Describe  low-­‐level  tracks   using  STIPs   •  Match  tracks  to  ac5ons  in   AND-­‐OR  graph  Gupta, Srinivasan, Shi, Davis CVPR 2009
    • •  Scene  modeling  to  look  at  the  big  picture  •  Feature  representa5ons   –  Holis5c:  describe  en5re  scene,  irrespec5ve  of   individuals   –  Local:  describe  ac5ons  of  individuals  •  Structure  of  ac5vi5es   –  Model  free:  clustering-­‐type  approaches   –  Strong  models:  grammars,  probabilis5c  models  
    • Objects: Actions: cars, glasses, drinking, running, people, etc… door exit, car enter constraintsScene categories: Geometry: indoors, outdoors, street Street, wall, scene, etc… field, stair, etc…
    • Riding horse Reading book A 0.8 B 0.3 C 0.5 Riding bike Play instrument
    • •  Standardiza5on  of  datasets  for  field   –  Allow  comparison  of  algorithms   •  E.g.  KTH  for  low-­‐level  features,  atomic  ac5ons   –  Fair  tuning  of  model  parameters  •  New  algorithms  compare  to  baselines   –  Bag-­‐of-­‐words  on  densely  sampled  STIPs   –  Pose  es5ma5on  (Ferrari  et  al.  code)   –  HOG  SVM  (Dalal  &  Triggs  code,  Ramanan  code)  
    • •  Standardiza5on  of  datasets  for  field   –  Don’t  feel  constrained  by  the  exis5ng  problem   defini5ons   –  Do  make  your  new  dataset  available   •  Should  clearly  specify  separate  training  and  test  sets  •  New  algorithms  compare  to  baselines   –  Do  use  reasonable  variant  of  standard  baselines   for  your  new  problem  
    • •  Even  atomic  low-­‐level  ac5ons  are  very  difficult   to  detect  reliably   –  Far  more  work  needed  on  representa5ons  for  the   ac5on  of  a  single  person   –  Features   –  Temporal  representa5on,  smoothing   –  Tracking   –  …  
    • 1.  Cameras and bandwidth are cheap2.  Lots of training data is potentially available Training + = data Potential for the huge progress … if we can get the data
    • Aligned with video   Describes visual Source   content  Subtitles   Yes No   DVD, InternetScripts for No   Yes   Internet, e.g.TV series, movies and www.dailyscript.com  sport gamesPlot summaries and No   Yes, sparsely   Internetsynopses (e.g. IMDB)  Instruction videos No   Yes   Internet, e.g. www.videojug.com  Descriptive Video Service Yes   Yes   DVD, rare  Word tags No   Yes, sparsely   Internet (e.g. YouTube)  Manual labelling, ??   ??   Mechanical Turk,Human Computation ESP Game, Grad undergrad students
    • Open questions:   How to benefit from the structure of the human body in complex situations, e.g. heavy occlusions, uniformly colored clothing?   Will action classification generalize over different video domains: Movies, TV, YouTube, Surveillance video?   What is the useful action vocabulary? Are we trying to solve the right problem? How can we visualize/display the results?Interesting novel directions:   Use actions for recognizing functional and physical object properties, e.g. “sitable”, “eatable”, “heavy”, “solid” objects…   Action prediction, i.e. what can happen in the given situation: e.g. is it dangerous to cross this road?   Explore more sources of strong and weak supervision: Manual surveillance, Descriptive Video Service (DVS), YouTube tags; Transcripts of sports games; Instruction videos.
    • •  P.  Viola,  M.  Jones,  and  D.  Snow.  Detec5ng  pedestrians  using  paeerns  of  mo5on  and  appearance.  In   Proc.  9th  Int.  Conf.  Computer  Vision,  pages  734–741,  2003.  •  N.  Dalal  and  B.  Triggs.  Histograms  of  oriented  gradients  for  human  detec5on.  In  Proc.  IEEE  Comput.   Soc.  Conf.  Comput.  Vision  and  Paeern  Recogn.,  2005.  •  Bo  Wu  and  Ram  Neva5a.  Detec5on  of  mul5ple,  par5ally  occluded  humans  in  a  single  image  by   bayesian  combina5on  of  edgelet  part  detectors.  In  Proc.  10th  Int.  Conf.  Computer  Vision,  2005.  •  Pedro  Felzenszwalb,  David  McAllester,  and  Deva  Ramanan.  A  discrimina5vely  trained,  mul5scale,   deformable  part  model.  In  IEEE  Computer  Society  Conference  on  Computer  Vision  and  Paeern   Recogni5on,  2008.  •  Chris  Stauffer  and  W.E.L.  Grimson.  Adap5ve  background  mixture  models  for  real-­‐  5me  tracking.  In   Proc.  IEEE  Comput.  Soc.  Conf.  Comput.  Vision  and  Paeern  Recogn.,  1999.  •  Kentaro  Toyama,  John  Krumm,  Barry  Brumie,  and  Brian  Meyers.  Wallflower:  Principles  and  prac5ce   of  background  maintenance.  In  Proc.  7th  Int.  Conf.  Computer  Vision,  1999.  •  J.L.  Barron,  D.J.  Fleet,  and  S.S.  Beauchemin.  Performance  of  op5cal  flow  techniques.  Int.  Journal  of   Computer  Vision,  12(1):43–77,  1994.  •  T.  Brox,  C.  Bregler,  and  J.  Malik.  Large  displacement  op5cal  flow.  In  Proc.  IEEE  Comput.  Soc.  Conf.   Comput.  Vision  and  Paeern  Recogn.,  2009.  •  M.  Isard  and  A.  Blake.  Condensa5on  –  condi5onal  density  propaga5on  for  visual  tracking.  Int.   Journal  of  Computer  Vision,  29(1):5–28,  1998.  •  Yuan  Li,  Chang  Huang,  and  Ram  Neva5a.  Learning  to  associate:  Hybridboosted  mul5-­‐target  tracker   for  crowded  scene.  In  Proc.  IEEE  Comput.  Soc.  Conf.  Comput.  Vision  and  Paeern  Recogn.,  2009.  
    • •  W.  T.  Freeman,  K.  Tanaka,  J.Ohta,  and  K.  Kyuma.  Computer  vision  for  computer  games.  In  IEEE  2nd   Intl.  Conf.  on  Automa5c  Face  and  Gesture  Recogni5on,  1996.  •  J.  Sullivan  and  S.  Carlsson.  Recognizing  and  tracking  human  ac5on.  In  ECCV  2002  •  A.A.  Efros,  A.C.  Berg,  G.  Mori,  and  J.  Malik.  Recognizing  ac5on  at  a  distance.  In  ICCV  2003  •  A.  Bobick  and  J.  Davis.  The  recogni5on  of  human  movement  using  temporal  templates.  IEEE  Trans.   PAMI,  23(3):257–267,  2001.  •  L.  Zelnik-­‐Manor  and  M.  Irani.  Event-­‐based  video  analysis.  In  CVPR  2001  •  E.  Shechtman  and  M.  Irani.  Space-­‐5me  behavior  based  correla5on.  In  CVPR  2005  •  O.  Boiman  and  M.  Irani.  Detec5ng  irregulari5es  in  images  and  in  video.  In  Proc.  ICCV,  2005.  •  M.  Blank,  L.  Gorelick,  E.  Shechtman,  M.  Irani,  and  R.  Basri.  Ac5ons  as  space-­‐5me  shapes.  In  Proc.   ICCV,  2005.  •  Y.  Ke,  R.  Sukthankar,  and  M.  Hebert.  Efficient  Visual  Event  Detec5on  using  Volumetric  Features  .  In   Proc.  ICCV  2005.  •  Y.  Ke,  R.  Sukthankar,  and  M.  Hebert.  Event  detec5on  in  crowded  videos.  In  Proc.  ICCV,  2007.  •  I.  Laptev  and  P.  Pérez.  Retrieving  ac5ons  in  movies.  In  Proc.  ICCV  2007  •  D.  Weinland  and  E.  Boyer.  Ac5on  recogni5on  using  exemplar-­‐based  embedding.  In  Proc.  CVPR,   2008.  •  Z.  Lin,  Z.  Jiang,  and  L.  S.  Davis.  Recognizing  ac5ons  by  shape-­‐mo5on  prototype  trees.  In  Proc.  ICCV,   2009.  
    • •  I.  Laptev  and  T.  Lindeberg.  Space-­‐5me  interest  points.  In  Proc.  ICCV  2003.  •  C.  Schuldt,  I.  Laptev,  and  B.  Caputo.  Recognizing  human  ac5ons:  A  local  svm  approach.  In  Proc.  ICPR,  2004.  •  P.  Dollar,  V.  Rabaud,  G.  Coerell,  and  Serge  Belongie.  Behavior  recogni5on  via  sparse  spa5o-­‐temporal  features.   In  VS-­‐PETS,  2005.  •  H.  Jhuang,  T.  Serre,  L.  Wolf  and  T.  Poggio.  A  Biologically  Inspired  System  for  Ac5on  Recogni5on.  In  Proc.  ICCV   2007  •  P.  Scovanner,  S.  Ali,  and  M.  Shah,  A  3-­‐Dimensional  SIFT  descriptor  and  its  applica5on  to  ac5on  recogni5on,   ACM  MM  2007.    •  J.  C.  Niebles,  H.  Wang,  and  L.  Fei-­‐Fei.  Unsupervised  learning  of  human  ac5on  categories  using  spa5al-­‐ temporal  words.  In  IJCV  2008.  •  I.  Laptev,  M.  Marszalek,  C.  Schmid,  and  B.  Rozenfeld.  Learning  realis5c  human  ac5ons  from  movies.  In  Proc.   CVPR  2008.  •  A.  Klaeser,  M.  Marszałek  and  C.  Schmid.  A  spa5o-­‐temporal  descriptor  based  on  3D-­‐gradients.  In  Proc.  BMVC   2008  •  G.  Willems,  T.  Tuytelaars  and  L.  Van  Gool.  An  Efficient  Dense  and  Scale-­‐Invariant  Spa5o-­‐Temporal  Interest   Point  Detector.  In  Proc.  ECCV  2008  •  H.  Wang,  M.  M.  Ullah,  A.  Kläser,  I.  Laptev  and  C.  Schmid.  Evalua5on  of  local  spa5o-­‐temporal  features  for   ac5on  recogni5on.  In  Proc.  BMVC  2009.  •  L.  Yeffet  and  L.  Wolf.  Local  Trinary  Paeerns  for  Human  Ac5on  Recogni5on.  In  Proc.  ICCV  2009.    •  A.  Gilbert,  J.  Illingworth,  R.  Bowden.  Fast  realis5c  mul5-­‐ac5on  recogni5on  using  mined  dense  spa5o-­‐ temporal  features,  In  Proc.  ICCV  2009.  •  P.  Ma5kainen,  M.  Hebert,  R.  Sukthankar.  Trajectons:  Ac5on  recogni5on  through  the  mo5on  analysis  of   tracked  features.  ICCV  workshop  on  Video-­‐oriented  Object  and  Event  Classifica5on,  2009  •  M.  M.  Ullah,  S.  N.  Parizi,  I.  Laptev.  Improving  bag-­‐of-­‐features  ac5on  recogni5on  with  non-­‐local  cues.  In  Proc.   BMVC  2010  
    • •  Y.  Song,  L.  Goncalves,  and  P.  Perona.  Unsupervised  learning  of  human  mo5on.  IEEE  Trans.  PAMI,  25 (7):814–827,  2003.  •  D.  Ramanan  and  D.  A.  Forsyth.  Automa5c  annota5on  of  everyday  movements.  In  Advances  in   Neural  Informa5on  Processing  Systems  16,  2003.  •  V.  Ferrari,  M.  Marin,  and  A.  Zisserman.  Pose  search:  retrieving  people  using  their  pose.  In  Proc.  IEEE   Comput.  Soc.  Conf.  Comput.  Vision  and  Paeern  Recogn.,  2009.  •  Yang  Wang,  Hao  Jiang,  Mark  S.  Drew,  Ze-­‐Nian  Li,  and  Greg  Mori.  Unsupervised  discovery  of  ac5on   classes.  In  CVPR,  2006.  •  Nazli  Ikizler-­‐Cinbis,  R.  Gokberk  Cinbis,  and  Stan  Sclaroff.  Learning  ac5ons  from  the  web.  In  IEEE   Interna5onal  Conference  on  Computer  Vision,  2009.  •  Weilong  Yang,  Yang  Wang,  and  Greg  Mori.  Recognizing  human  ac5ons  from  s5ll  images  with  latent   poses.  In  Proc.  IEEE  Comput.  Soc.  Conf.  Comput.  Vision  and  Paeern  Recogn.,  2010.  •  Bangpeng  Yao  and  Li  Fei-­‐Fei.  Modeling  mutual  context  of  object  and  human  pose  in  human-­‐object   interac5on  ac5vi5es.  In  Proc.  IEEE  Comput.  Soc.  Conf.  Comput.  Vision  and  Paeern  Recogn.,  2010.  
    • •  R.  Polana  and  R.C.  Nelson.  Detec5on  and  recogni5on  of  periodic,  nonrigid  mo5on.  In  IJCV  1997.  •  S.M.  Seitz  and  C.R.  Dyer.  View  invariant  analysis  of  cyclic  mo5on.  In  IJCV  1997  •  A.  Thangali  and  S.  Sclaroff.  Periodic  mo5on  detec5on  and  es5ma5on  via  space-­‐5me  sampling.  In   IEEE  Workshop  on  Mo5on  and  Video  Compu5ng,  2005.  •  I.  Laptev,  S.J.  Belongie,  P.  Pérez  and  J.  Wills.  Periodic  mo5on  detec5on  and  segmenta5on  via   approximate  sequence  alignment,  In  Proc.  ICCV  2005  •  P.  Wang,  G.D.  Abowd  and  J.M.  Rehg.  Quasi-­‐periodic  event  analysis  for  social  game  retrieval.  In  Proc   ICCV  2009  •  D.  Weinland,  E.  Boyer,  and  R.  Ronfard.  Ac5on  recogni5on  from  arbitrary  views  using  3D  exemplars.   in  Proc.  ICCV  2007.  •  A.  Farhadi  and  M.  Tabrizi.  Learning  to  recognize  ac5vi5es  from  the  wrong  view  point.  In  Proc.  ECCV   2008.  •  I.  Junejo,  E.  Dexter,  I.  Laptev  and  Patrick  Pérez.  Cross-­‐view  ac5on  recogni5on  from  temporal  self-­‐ similari5es.  In  Proc.  ECCV  2008  •  A.  Farhadi,  M.  Kamali,  I.  Endres,  D.  Forsyth.  A  latent  model  of  discrimina5ve  aspect.  In  Proc.  ICCV   2009.  
    • •  X.  Wang,  X.  Ma,  and  E.  Grimson.  Unsupervised  ac5vity  percep5on  in  crowded  and  complicated   scenes  using  hierarchical  bayesian  models.  IEEE  Trans.  PAMI,  31(3):539–  555,  2009.  •  Abhinav  Gupta,  Praveen  Srinivasan,  Jianbo  Shi,  and  Larry  S.  Davis.  Understanding  videos,   construc5ng  plots  -­‐  learning  a  visually  grounded  storyline  model  from  annotated  videos.  In  CVPR,   2009.  •  T.  Xiang  and  S.  Gong.  Beyond  tracking:  Modelling  ac5vity  and  understanding  behaviour.  Int.  Journal   of  Computer  Vision,  67(1):21–51,  2006.  •  G.  Medioni,  I.  Cohen,  F.  Bre  ́mond,  S.  Hongeng,  and  R.  Neva5a.  Event  detec5on  and  analysis  from   video  streams.  IEEE  Trans.  PAMI,  23(8):873–889,  2001.  •  Y.  A.  Ivanov  and  A.  F.  Bobick.  Recogni5on  of  visual  ac5vi5es  and  interac5ons  by  stochas5c  parsing.   IEEE  Trans.  PAMI,  22(8):852–872,  2000.  •  D.  Moore  and  I.  Essa.  Recognizing  mul5tasked  ac5vi5es  using  stochas5c  context-­‐free  grammar  using   video.  In  AAAI,  2002.  •  Chen  Change  Loy,  Tao  Xiang,  and  Shaogang  Gong.  Modelling  ac5vity  global  temporal  dependencies   using  5me  delayed  probabilis5c  graphical  model.  In  ICCV,  2009.  •  Xiaogang  Wang,  Keng  Teck  Ma,  Gee  Wah  Ng,  and  W.  Eric  L.  Grimson.  Trajectory  analysis  and   seman5c  region  modeling  using  a  nonparametric  bayesian  model.  In  Proc.  IEEE  Comput.  Soc.  Conf.   Comput.  Vision  and  Paeern  Recogn.,  2008.  •  W.  Choi,  K.  Shahid,  and  S.  Savarese.  ”what  are  they  doing?  :  Collec5ve  ac5vity  classifica5on  using   spa5o-­‐temporal  rela5onship  among  people”.  In  9th  Interna5onal  Workshop  on  Visual  Surveillance,   2009.  •  Ramin  Mehran,  Alexis  Oyama,  and  Mubarak  Shah.  Abnormal  crowd  behavior  detec5on  using  social   force  model.  In  CVPR,  2009.  
    • Workshop  materials  available:  heps://sites.google.com/site/humanac5onstutorialeccv10/