Fcv scene schmid


Published on

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Fcv scene schmid

  1. 1. Action recognition in videos Cordelia Schmid INRIA Grenoble
  2. 2. Action recognition - problem <ul><li>Short actions, i.e. drinking, sit down </li></ul>Coffee & Cigarettes dataset Hollywood dataset
  3. 3. Action recognition - problem <ul><li>Short actions, i.e. drinking, sit down </li></ul><ul><li>Activities/events, i.e. making a sandwich, depositing a suspicious object </li></ul>TRECVID Multimedia Event Detection
  4. 4. TRECVID - Multimedia Event Detection Attempting a board trick Feeding an animal Wedding ceremony Getting a vehicle unstuck
  5. 5. Action recognition <ul><li>Action recognition is person-centric </li></ul><ul><li>Vision is person-centric : We mostly care about things which are important </li></ul>Source I.Laptev Movies TV YouTube
  6. 6. Action recognition <ul><li>Action recognition is person-centric </li></ul><ul><li>Vision is person-centric : We mostly care about things which are important </li></ul>Source I.Laptev 40% 35% 34% Movies TV YouTube
  7. 7. Action recognition from still images <ul><li>Description of the human pose </li></ul><ul><ul><li>Silhouette description [Sullivan & Carlsson, 2002] </li></ul></ul><ul><ul><li>Histogram of gradients (HOG) [Dalal & Triggs 2005] </li></ul></ul><ul><ul><li>Human body part layout [Felzenszwalb & Huttenlocher, 2000] </li></ul></ul>
  8. 8. Action recognition from still images <ul><li>Supervised modeling interaction between human & object [Gupta et al. 2009, Yao & Fei-Fei 2009] </li></ul><ul><li>Weakly-supervised learning of objects [Prest, Schmid & Ferrari 2011] </li></ul>Results on PASCAL VOC 2010 Human action classification dataset
  9. 9. Importance of action objects <ul><li>Human pose often not sufficient by itself </li></ul><ul><li>Objects define the actions </li></ul>
  10. 10. Importance of temporal information <ul><li>Video/temporal information necessary to disambiguate actions </li></ul><ul><li>Temporal context describes the action/activity </li></ul><ul><li>Key frames provide significant less information </li></ul>
  11. 11. Action recognition in videos <ul><li>Temporal information allows to stabilize human and object detection by tracking – J. Malik: tracking by detection is difficult ? </li></ul><ul><li>Large amount of data, very fast growing – H. Sawhney: large amount of data, not often well explored </li></ul><ul><li>Often comes with some sort of supervision, scripts, subtitles – similar in spirit to M. Hebert’s comment on large amount of data collected by a robot </li></ul>
  12. 12. Action recognition in videos Motion history image [Bobick & Davis, 2001] Spatial motion descriptor [Efros et al. ICCV 2003] Learning dynamic prior [Blake et al. 1998] Sign language recognition [Zisserman et al. 2009]
  13. 13. Action recognition in videos <ul><li>Bag of space-time features </li></ul>[Laptev’03, Schuldt’04, Niebles’06, Zhang’07] Histogram of visual words SVM classifier Collection of space-time patches HOG & HOF patch descriptors
  14. 14. Action recognition in videos <ul><li>Bag of space-time features </li></ul><ul><ul><li>Many recent extensions: new features / tracklets, temporal structuring etc. </li></ul></ul><ul><ul><li>Advantages </li></ul></ul><ul><ul><ul><li>Very useful as a baseline </li></ul></ul></ul><ul><ul><ul><li>Captures spatial and temporal context – see Efros comment on image classification </li></ul></ul></ul><ul><ul><li>Disadvantages </li></ul></ul><ul><ul><ul><li>No interpretation of the action </li></ul></ul></ul><ul><ul><ul><li>Not sufficient for localization & description </li></ul></ul></ul>
  15. 15. Action recognition in videos <ul><li>Localization by 3D HOG/HOF, interaction with objects </li></ul>Tracking by detection and tracking Space-time description Interaction with objects
  16. 17. Action recognition in videos <ul><li>HOG 3D tracks + description </li></ul><ul><ul><li>Detection & tracking of human with part based models works well </li></ul></ul><ul><ul><ul><li>move towards more flexible models, similar to P. Felzenszwalb’s </li></ul></ul></ul><ul><ul><ul><li>integrate motion information </li></ul></ul></ul><ul><ul><ul><li>very good baseline </li></ul></ul></ul><ul><ul><li>Towards more flexible descriptions, based on body parts, a lot of recent work on finding human body parts </li></ul></ul><ul><ul><li>Interaction with object important, but hard </li></ul></ul>
  17. 18. Discussion <ul><li>Need for more challenging datasets </li></ul><ul><ul><li>Need for realistic datasets </li></ul></ul><ul><ul><li>Scale up number of classes (today ~10 actions per dataset) </li></ul></ul><ul><ul><li>Increase number of examples per class, possibly with weakly supervised learning (the number of examples per videos is low) </li></ul></ul><ul><ul><li>Define a taxonomy, use redundancy between action classes to improve training </li></ul></ul><ul><ul><li>Manual exhaustive labeling of all actions impossible </li></ul></ul>KTH dataset Hollywood dataset
  18. 19. Discussion <ul><li>Make better use of the large amount of information inherent in videos </li></ul><ul><ul><li>automatic collection of additional examples </li></ul></ul><ul><ul><li>improve models incrementally </li></ul></ul><ul><ul><li>use weak labels from associated data (text, sound, subtitles) </li></ul></ul><ul><li>Many existing techniques are straightforward extensions of methods for images </li></ul><ul><ul><li>almost no use of 3D information </li></ul></ul><ul><ul><li>learn better interaction and temporal models </li></ul></ul><ul><ul><li>design activity models by decomposition into simple actions </li></ul></ul>