CVML2011: human action recognition (Ivan Laptev)

3,078 views
2,939 views

Published on

Published in: Education, Technology
0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
3,078
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
151
Comments
0
Likes
4
Embeds 0
No embeds

No notes for slide

CVML2011: human action recognition (Ivan Laptev)

  1. 1. ENS/INRIA Visual Recognition and Machine Learning Summer School 25-29 July, Paris France School, 25 29 July Paris, Human Action Recognition Ivan Laptev ivan.laptev@inria.fr i l t @i i f INRIA, WILLOW, ENS/INRIA/CNRS UMR 8548 Laboratoire d’Informatique, Ecole Normale Supérieure, Paris d Informatique,Includes slides from: Alyosha Efros, Mark Everingham and Andrew Zisserman
  2. 2. Lecture overview Motivation Historic review Applications and challenges Human Pose Estimation Pictorial structures Recent advances Appearance-based methods pp Motion history images Active shape models & Motion priors Motion-based methods Generic and parametric Optical Flow Motion templates Space-time methods p Space-time features Training with weak supervision
  3. 3. Motivation I: Artistic RepresentationEarly studies were motivated by human representations in ArtsDa Vinci: “it is indispensable for a painter, to become totally familiar with the anatomy of nerves, bones, muscles, and sinews, such that he understands y , , , , for their various motions and stresses, which sinews or which muscle causes a particular motion” “I ask for the weight [pressure] of this man for every segment of motion I when climbing those stairs, and for the weight he places on b and on c. Note the vertical line below the center of mass of this man.” Leonardo da Vinci (1452–1519): A man going upstairs, or up a ladder.
  4. 4. Motivation II: Biomechanics  The emergence of biomechanics  Borelli applied to biology the analytical and geometrical methods, developed by Galileo Galilei  He was the first to understand that bones serve as levers and muscles function according to mathematical p principles p  His physiological studies included muscle analysis and a mathematical discussion of movements, such as running or jumpingGiovanni Alfonso Borelli (1608–1679)
  5. 5. Motivation III: Motion perception Etienne-Jules Etienne Jules Marey: (1830–1904) made Chronophotographic experiments influential for the emerging field of cinematography Eadweard Muybridge (1830–1904) invented a machine for displaying the recorded series of images. He pioneered motion pictures and applied his technique to movement studies
  6. 6. Motivation III: Motion perception Gunnar Johansson [1973] pioneered studies on the use of image [ ]p g sequences for a programmed human motion analysis “Moving Light Displays (LED) enable identification of familiar people Moving Displays” and the gender and inspired many works in computer vision. Gunnar Johansson, Perception and Psychophysics, 1973
  7. 7. Human actions: Historic overview 15th century  studies of anatomy t  17th century emergence of biomechanics 19th century  emergence of cinematography  1973 studies of human motion perception Modern computer vision M d t i i
  8. 8. Modern applications: Motion capture and animation Avatar (2009)
  9. 9. Modern applications: Motion capture and animationLeonardo da Vinci (1452–1519) Avatar (2009)
  10. 10. Modern applications: Video editing Space-Time Video Completion Y. Wexler, E. Shechtman and M. Irani, CVPR 2004
  11. 11. Modern applications: Video editing Recognizing Action at a DistanceAlexei A. Efros, Alexander C. Berg, Greg Mori, Jitendra Malik, ICCV 2003
  12. 12. Modern applications: Video editing Recognizing Action at a DistanceAlexei A. Efros, Alexander C. Berg, Greg Mori, Jitendra Malik, ICCV 2003
  13. 13. Why Action Recognition? Video indexing and search is useful in TV production, entertainment, education, social studies, security,… Home videos: e.g. TV & Web: “My e.g. daughter “Fight in a climbing” parlament” Sociology research: Manually Surveillance: analyzed smoking 260K views actions in in 7 days on 900 movies i YouTube
  14. 14. How action recognition is related to computer vision? Sky Sk Street sign Car Car Car Car Car Car Road
  15. 15. We can recognize cars and roads, g , What’s next? 12,184,113 images, 17624 synsets
  16. 16. Airplane A plain has crashed, the cabin is broken, somebody is likely to be injured or dead dead.
  17. 17. cat woman trash bi t h bin
  18. 18.  Vision is person-centric: We mostly care about things which are important to us, people Actions of people reveal the function of objects p p j Future challenges: - Function: What can I do with this and how? - Prediction: What can happen if someone does that? - Recognizing goals: What this person is trying to do?
  19. 19. How many person-pixels are there? person- Movies TV YouTube Y T b
  20. 20. How many person-pixels are there? person- Movies TV YouTube
  21. 21. How many person-pixels are there? person- 35% 34% Movies TV 40% YouTube
  22. 22. How much data do we have? Huge amount of video is available and growing TV-channels recorded since 60’s 60 s >34K hours of video upload every day ~30M surveillance cameras in US => ~700K video hours/day 700K If we want to interpret this data, we should better understand what person-pixels are telling us!
  23. 23. Lecture overview Motivation Historic review Applications and challenges Human Pose Estimation Pictorial structures Recent advances Appearance-based methods pp Motion history images Active shape models & Motion priors Motion-based methods Generic and parametric Optical Flow Motion templates Space-time methods p Space-time features Training with weak supervision
  24. 24. Lecture overview Motivation Historic review Applications and challenges Human Pose Estimation Pictorial structures Recent advances Appearance-based methods pp Motion history images Active shape models & Motion priors Motion-based methods Generic and parametric Optical Flow Motion templates Space-time methods p Space-time features Training with weak supervision
  25. 25. Objective and motivation Determine human body pose (layout)Why? To recognize poses, gestures, actions
  26. 26. Activities characterized by a pose y p
  27. 27. Activities characterized by a pose y p
  28. 28. Activities characterized by a pose y p
  29. 29. Challenges: articulations and deformations g
  30. 30. Challenges: of ( g (almost) unconstrained images ) gvarying illumination and low contrast; moving camera and background; i ill i ti dl t t i db k dmultiple people; scale changes; extensive clutter; any clothing
  31. 31. Pictorial Structures• Intuitive model of an object• Model has two components 1. parts ( pa s (2D image fragments) age ag e s) 2. structure (configuration of parts)• Dates back to Fischler & Elschlager 1973
  32. 32. Long tradition of using p g g pictorial structures for humans Finding People by Sampling Ioffe & Forsyth, ICCV 1999 y , Pictorial Structure Models for Object Recognition Felzenszwalb & Huttenlocher, 2000 Learning to Parse Pictures of People Ronfard, Schmid & Triggs, ECCV 2002
  33. 33. Felzenszwalb & Huttenlocher NB: requires background subtraction
  34. 34. Variety of Poses y
  35. 35. Variety of Poses y
  36. 36. Objective: detect human and determine upper body pose (layout) si f a2 a1
  37. 37. Pictorial structure model – CRFsi f a2 a1
  38. 38. Complexity p ysi f a2 a1
  39. 39. Are trees the answer? He T left right UA UA LA LA Ha Ha• With n parts and h possible discrete locations per part, O(hn)• For a tree, using dynamic programming this reduces to O(nh2)• If model is a tree and has certain edge costs, then complexity g p y reduces to O(nh) using a distance transform [Felzenszwalb & Huttenlocher, 2000, 2005]
  40. 40. Are trees the answer? He T left right UA UA LA LA Ha Ha• With n parts and h possible discrete locations per part, O(hn)• For a tree, using dynamic programming this reduces to O(nh2)• If model is a tree and has certain edge costs, then complexity g p y reduces to O(nh) using a distance transform [Felzenszwalb & Huttenlocher, 2000, 2005]
  41. 41. Kinematic structure vs graphical (independence) structure Graph G = (V,E) He T He Tleft right left right UA UA UA UA LA LA LA LA Requires more Ha Ha Ha Ha connections than a tree
  42. 42. More recent work on human pose estimation p D. Ramanan. Learning to parse images of articulated bodies. NIPS, bodies NIPS 2007 Learn image and person-specific unary terms • g initial iteration  edges • following iterations  edges & colour V. Ferrari, M. Marin-Jimenez, and A. Zisserman. Progressive search space reduction for human pose estimation. Proc. CVPR, estimation In Proc CVPR 2008/2009 (Almost) unconstrained images • Person detector & foreground highlighting g g g g VP. Buehler, M. Everingham and A. Zisserman. , g Learning sign language by watching TV. In Proc. CVPR 2009 Learns with weak t t l annotation L ith k textual t ti • Multiple instance learning
  43. 43. Pose estimation is a very active research area y Y. Yang and D. Ramanan. Articulated pose estimation with flexible mixtures-of-parts. In Proc. CVPR 2011 Extension of LSVM model of Felzenszwalb et al. Y. Wang, D. Tran and Z. Liao. Learning Hierarchical Poselets for Human Parsing. In Proc. CVPR 2011. Builds on Poslets idea of Bourdev et al. S. Johnson and M. Everingham. Learning Effective Human Pose Estimation from Inaccurate Annotation. In Proc. CVPR 2011. Learns from lots of noisy annotations B. Sapp, D.Weiss and B. Taskar. Parsing Human Motion with Stretchable Models. In Proc. CVPR 2011. Explores temporal continuity
  44. 44. Pose estimation is a very active research area yJ. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio, R. Moore, A.Kipman and A. Blake. Real-Time Human Pose Recognition in Parts from Single Depth Images. Best paper award at CVPR 2011 Exploits lots of synthesized depth images for training
  45. 45. Pose Search QV. Ferrari, M. Marin-Jimenez, and A. Zisserman. Progressive search space reduction for human pose estimation. In Proc. CVPR2009
  46. 46. Pose Search Q V. Ferrari, M. Marin-Jimenez, and A. Zisserman. Zisserman Progressive search spacereduction for human pose estimation. In Proc. CVPR2009
  47. 47. ApplicationLearning sign language by watching TV g g g g y g (using weakly aligned subtitles) Patrick Buehler Mark Everingham Andrew Zi A d Zisserman CVPR 2009
  48. 48. ObjectiveLearn signs in British Sign Language (BSL) corresponding to text words: • Training data from TV broadcasts with simultaneous signing • Supervision solely from sub-titles Output: automatically learned signs (4x slow motion) Input: video + subtitle Office GovernmentUse subtitles to find video sequences containing word. These are the positivetraining sequences. Use other sequences as negative training sequences.
  49. 49. Given an English worde.g. “tree” what is thecorresponding BritishSign Language sign? positive sequences negative set
  50. 50. Use sliding window to choose sub-sequence of p q poses in one ppositivesequence and determine if 1st sliding window same sub-sequence of posesoccurs in other positive sequencessomewhere, but does not occur in the negative set positive sequences negative set
  51. 51. Use sliding window to choose sub-sequence of p q poses in one ppositivesequence and determine if 5th sliding window same sub-sequence of posesoccurs in other positive sequencessomewhere, but does not occur in the negative set positive sequences negative set
  52. 52. Multiple instance learning p g Negative Positive bag bags sign of interest i t t
  53. 53. Example pLearn signs in British Sign Language ( g g g g (BSL) corresponding to ) p gtext words.
  54. 54. EvaluationGood results for a variety of signs: Signs where g Signs where g Signs where Signs which Signs whichhand movement hand shape both hands are finger-- are perfomed in is important is important are together spelled front of the face Navy Lung Fungi Kew Whale Prince Garden Golf Bob Rose
  55. 55. What is missed? truncation is not modelled
  56. 56. What is missed? occlusion is not modelled
  57. 57. Modelling p g person-object-pose interactions j p W. Yang Y. W Yang, Y Wang and Greg Mori. Recognizing Mori Human Actions from Still Images with Latent Poses. In Proc. CVPR 2010. Some limbs may not be important for recognizing a particular action (e.g. sitting) B. Y B Yao and L F i F i M d li M t l d L. Fei-Fei. Modeling Mutual Context of Object and Human Pose in Human- Object Interaction Activities. In Proc. CVPR 2010. Pose estimation helps object detection and vice versa
  58. 58. Towards functional object understanding j g A. Gupta, S. Satkin, A.A. Efros and M. Hebert, From 3D Scene Geometry to HumanWorkspace. In Proc. CVPR 2011 p Predicts the “workspace” of a humanH. Grabner, J. Gall and L. Van Gool. What Makes a Chair a Chair? In Proc. CVPR 2011
  59. 59. Conclusions: Human poses p Exciting progress in pose estimation in realistic still images and video. g Industry strength Industry-strength pose estimation from depth sensors Pose estimation from RGB is still very challenging Human Poses ≠ Human Actions!
  60. 60. Lecture overview Motivation Historic review Applications and challenges Human Pose Estimation Pictorial structures Recent advances Appearance-based methods pp Motion history images Active shape models & Motion priors Motion-based methods Generic and parametric Optical Flow Motion templates Space-time methods p Space-time features Training with weak supervision
  61. 61. Lecture overview Motivation Historic review Applications and challenges Human Pose Estimation Pictorial structures Recent advances Appearance-based methods pp Motion history images Active shape models & Motion priors Motion-based methods Generic and parametric Optical Flow Motion templates Space-time methods p Space-time features Training with weak supervision
  62. 62. Foreground segmentationImage differencing: a simple way to measure motion/change - > ConstBetter B kB Background / F d Foreground separation methods exist: d i h d i  Modeling of color variation at each p g pixel with Gaussian Mixture  Dominant motion compensation for sequences with moving camera  Motion layer separation for scenes with non-static backgrounds
  63. 63. Temporal TemplatesIdea: summarize motion in video in a Motion History Image (MHI):Descriptor: Hu moments of different orders p [A.F. Bobick and J.W. Davis, PAMI 2001]
  64. 64. Aerobics datasetNearest Neighbor classifier: 66% accuracy
  65. 65. Temporal Templates: SummaryPros: + Si l and f t Simple d fast Not all shapes are valid + Works in controlled settings Restrict the space of admissible silhouettesCons: - Prone to errors of background subtraction g Variations in light, shadows, clothing… What is the background here? - Does not capture interior motion and shape Silhouette tells little about actions
  66. 66. Active Shape Models [Cootes et al.] [Cootes al.] Constrains shape deformation in PCA-projected space Example: face alignment Illustration of face shape space Active Shape Models: Their Training and Application T.F. Cootes, C.J. Taylor, D.H. Cooper, and J. Graham, CVIU 1995 , y , p , ,
  67. 67. Person TrackingLearning flexible models from image sequences A. Baumberg and D. Hogg, ECCV 1994
  68. 68. Learning dynamic prior Dynamic model: 2nd order Auto-Regressive Process y g State Update rule: Model parameters: Learning scheme:
  69. 69. Learning dynamic prior Random simulation of the Learning point sequence learned dynamical model Statistical models of visual shape and motionA. Blake, B. Bascle, M. Isard and J. MacCormick, Phil.Trans.R.Soc. 1998
  70. 70. Learning dynamic priorRandom simulation of th lR d i l ti f the learned gate d d t dynamics i
  71. 71. Motion priors C Constrain t t i temporal evolution of shape l l ti f h  Help accurate tracking  Recognize actions G l formulate motion models f different types of actions Goal: f l t ti d l for diff tt f ti and use such models for action recognition Example: Drawing with 3 action g modes line drawing scribbling idle [M. Isard and A. Blake, ICCV 1998]
  72. 72. Lecture overview Motivation Historic review Applications and challenges Human Pose Estimation Pictorial structures Recent advances Appearance-based methods pp Motion history images Active shape models & Motion priors Motion-based methods Generic and parametric Optical Flow Motion templates Space-time methods p Space-time features Training with weak supervision
  73. 73. Lecture overview Motivation Historic review Applications and challenges Human Pose Estimation Pictorial structures Recent advances Appearance-based methods pp Motion history images Active shape models & Motion priors Motion-based methods Generic and parametric Optical Flow Motion templates Space-time methods p Space-time features Training with weak supervision
  74. 74. Shape and Appearance vs. Motion vs Shape and appearance in images depends on many factors: clothing, illumination contrast, image resolution, etc… [Efros et al 2003] al. Motion field (in theory) is invariant to shape and can be used directly to describe human actions
  75. 75. Shape and Appearance vs. Motion vs Moving Light DisplaysGunnar Johansson, Perception and Psychophysics, 1973
  76. 76. Motion estimation: Optical Flow Classic problem of computer vision [Gibson 1955] Goal: estimate motion field How? We only have access to image pixels Estimate pixel-wise correspondence between frames = Optical Flow Brightness Change assumption: corresponding pixels preserve their intensity (color)  Useful assumption in many cases  Breaks at occlusions and illumination changes  Physical and visual motion may be different
  77. 77. Generic Optical Flow  Brightness Change Constraint Equation (BCCE) Optical flow Image gradient One O equation, two unknowns => cannot be solved directly ti t k > tb l d di tl Integrate several measurements in the local neighborhood and obtain a L d bt i Least Squares Solution [L tS S l ti [Lucas & K Kanade 1981] dSecond-momentmatrix, the sameone used toco pute a scompute Harris Denotes integration over a spatial (or spatio-temporal)interest points! neighborhood of a point
  78. 78. Parameterized Optical Flow  A th extension of the constant motion model is to compute Another t i f th t t ti d li t t PCA basis flow fields from training examples 1. Compute standard Optical Flow for many examples 2. Put velocity components into one vector 3. Do PCA on and obtain most informative PCA flow basis vectorsTraining samples PCA flow bases Learning Parameterized Models of Image Motion M.J. Black, Y. Yacoob, A.D. Jepson and D.J. Fleet, CVPR 1997
  79. 79. Parameterized Optical Flow  Estimated coefficients of PCA flow bases can be used as action descriptorsFrame numbers Optical flow seems to be an interesting descriptor for motion/action recognition
  80. 80. Spatial Motion Descriptor Image frame g Optical flow p Fx , y Fx , Fy Fx , Fx , Fy , Fy blurred Fx , Fx , Fy , FyA. A. Efros, A.C. Berg, G. Mori and J. Malik. Recognizing Action at a Distance. In Proc. ICCV 2003
  81. 81. Spatio- Spatio-Temporal Motion Descriptor Temporal extent E … … Sequence A  … … Sequence B t A E A E I matrix EB E B frame-to-frame motion-to-motion similarity matrix blurry I similarity matrix
  82. 82. Football Actions: matching InputSequence qMatchedFrames input matched
  83. 83. Football Actions: classification10 actions; 4500 total frames; 13-frame motion descriptor
  84. 84. Classifying Ballet Actions y g16 Actions; 24800 total frames; 51-frame motion descriptor. Men used to classify women and vice versa.
  85. 85. Classifying Tennis Actions6 actions; 4600 frames; 7-frame motion descriptor Woman player used as training, man as testing.
  86. 86. Where are we so far?
  87. 87. Lecture overview Motivation Historic review Hi i i Modern applications Human Pose Estimation Pictorial structures Recent advances Appearance-based methods Motion history images Active h A ti shape models & M ti priors d l Motion i Motion-based methods Generic and parametric O ti l Fl G i d t i Optical Flow Motion templates Space-time methods Space-time features Training with weak supervision
  88. 88. Goal:Interpret complexdynamic scenesd i Common methods: Common problems: • Segmentation ? • Complex & changing BG • Changing appearance • Tracking ?  No global assumptions about the scene
  89. 89. Space- Space-timeNo global assumptions  Consider local spatio temporal neighborhoods spatio-temporal hand waving boxing
  90. 90. Actions == Space-time objects? Space-
  91. 91. Local approach: Bag of Visual WordsAirplanesMotorbikes FacesWild Cats Leaves People Bikes
  92. 92. Space-Space-time local features
  93. 93. Space-Time Interest Points: DetectionSpace- p What neighborhoods to consider? High image Look at the Distinctive  variation in space  distribution of the neighborhoods g gradient and timeDefinitions: Original image sequence Oi i li Space-time Gaussian with covariance Gaussian derivative of Space-time gradient Second-moment matrix
  94. 94. Space-Time Interest Points: DetectionSpace- pProperties of p : defines second order approximation for the local distribution of within neighborhood  1D space-time variation of , e.g. moving bar  2D space-time variation of , e.g. moving ball g g  3D space-time variation of , e.g. jumping ballLarge eigenvalues of  can be detected by thelocal maxima of H over (x,y,t): (similar to Harris operator [Harris and Stephens, 1988])
  95. 95. Space- Space-Time interest pointsVelocity appearance// split/mergechanges disappearance
  96. 96. Space-Time Interest Points: ExamplesSpace- p p Motion event detection
  97. 97. Spatio-temporal scale selectionSpatio- p p Stability to size changes, e.g. camera zoom
  98. 98. Spatio-Spatio-temporal scale selection Selection of temporal scales t l l captures the frequency of events
  99. 99. Local features for human actions
  100. 100. Local features for human actions boxing walking hand waving
  101. 101. Local space-time descriptor: HOG/HOF space- p p Multi scale space time Multi-scale space-time patches Histogram of Histogram oriented spatial of optical  grad. (HOG) d flow (HOF) Public code available at www.irisa.fr/vista/actions 3x3x2x4bins HOG 3x3x2x5bins HOF descriptor descriptor
  102. 102. Visual Vocabulary: K-means clustering y K- g  Group similar p p points in the space of image descriptors using p g p g K-means clustering  Select significant clusters Clustering c1 c2 c3 c4 Classification
  103. 103. Visual Vocabulary: K-means clustering y K- g  Group similar p p points in the space of image descriptors using p g p g K-means clustering  Select significant clusters Clustering c1 c2 c3 c4 Classification
  104. 104. Local Space-time features: Matching Space- p g Find similar events in pairs of video sequences
  105. 105. Action Classification: OverviewBag of space-time features + multi-channel SVM [Laptev’03, Schuldt’04, Niebles’06, Zhang’07] Collection of space-time patches C f Histogram of visual words HOG & HOF Multi-channel patch SVM descriptors Classifier
  106. 106. Action recognition in KTH dataset gSample frames from the KTH actions sequences, all six classes(columns) and scenarios (rows) are presented
  107. 107. Classification results on KTH dataset Confusion matrix for KTH actions
  108. 108. What about 3D?Local motion and appearance features are not invariant to view changes
  109. 109. Multi- Multi-view action recognitionDifficult to apply standard multi-view methods:  Do not want to search for multi- view point correspondence --- Non-rigid motion, clothing changes, … --> It’s Hard!  Do not want to identify body parts. Current methods are not reliable enough.  Yet, want to learn actions from one view and recognize actions in very different views
  110. 110. Temporal self-similarities self-Idea:  Cross-view matching is hard but cross-time matching (tracking) is relatively easy.  Measure self-(dis)similarities across time: self (dis)similaritiesExample: Distance matrix / self-similarity matrix (SSM): P1 P2
  111. 111. Temporal self-similarities: Multi-views self- Multi- Side view Top view Appear very similar despite the view change! h !Intuition: 1. Distance between similar poses is low in any view 2. Distance among different poses is likely to be large in most views
  112. 112. Temporal self-similarities: MoCap self- person 1Self-similaritiescan be measuredfrom MotionCapture (MoCap) person 2data person 1 p person 2
  113. 113. Temporal self-similarities: Video self-Self-similaritiescan bemeasureddirectly fromvideo:HOG orOptical Flowdescriptors id i t inimage frames
  114. 114. Self- Self-similarity descriptor Goal: define a quantitative q measure to compare self- similarity matrices Define a local histogram ~SIFT descriptor descriptor hi for each point computed on SSM t d i on the diagonal. Sequence alignment:  Action recognition: Dynamic Programming for • Visual vocabulary for h two sequences of • BoF representation of {hi} descriptors {hi}, {hj} • SVM
  115. 115. Multi-Multi-view alignment
  116. 116. Multi-Multi-view action recognition: Video SSM-based recognition Alternative view-dependent method (STIP)
  117. 117. What are Human Actions?Actions in recentdatasets: Is it just about kinematics?Should actions be defined by the purpose? Kinematics + Objects
  118. 118. What are Human Actions?Actions in recentdatasets: Is it just about kinematics?Should actions be defined by the purpose? Kinematics + Objects + Scenes
  119. 119. Action recognition in realistic settings Standard St d d action datasets Actions “I the Wild”: A ti “In th Wild”
  120. 120. Action Dataset and Annotation Manual annotation of drinking actions in movies: g “Coffee and Cigarettes”; “Sea of Love” “Drinking”: 159 annotated samples g p “Smoking”: 149 annotated samples Temporal annotation T l t ti First frame Keyframe Last frameSpatial annotation head rectangle torso rectangle
  121. 121. “Drinking” action samples g p
  122. 122. Action representation p Hist. of Gradient Hist. f O ti Fl Hi t of Optic Flow
  123. 123. Action learning selected features boosting b ti weak classifier • Efficient discriminative classifier [Freund&Schapire’97] AdaBoost: • G d performance for face detection [Viola&Jones’01] Good f f f d t ti [Vi l &J ’01]pre-aligned Haarsamples optimal threshold features Fisher discriminant Histogram features
  124. 124. Key- Key-frame action classifier selected features boosting b ti weak classifier 2D HOG features f • Efficient discriminative classifier [Freund&Schapire’97] AdaBoost: • G d performance for face detection [Viola&Jones’01] Good f f f d t ti [Vi l &J ’01]pre-aligned Haarsamples optimal threshold features Fisher discriminant Histogram see [Laptev BMVC’06] features for more details [Laptev, Pérez 2007]
  125. 125. Keyframe priming y p gTraining False positives of static HOG action detector Positive Negative training training sample p samples pTest
  126. 126. Action detectionTest set: • 25min from “Coffee and Cigarettes” with GT 38 drinking actions g g • No overlap with the training set in subjects or scenesDetection: • search over all space-time l h ll ti locations and spatio-temporal ti d ti t l extents Keyframe priming No Keyframe priming
  127. 127. Action Detection (ICCV 2007) ( ) Test episodes from the movie “Coffee and cigarettes Coffee cigarettes”Video available at http://www.irisa.fr/vista/Equipe/People/Laptev/actiondetection.html
  128. 128. 20 most confident detections
  129. 129. Learning Actions from Movies g • Realistic variation of human actions • Many classes and many examples per class Problems: • Typically only a few class-samples per movie • Manual annotation is very time consuming
  130. 130. Automatic video annotation with scripts • Scripts available for >500 movies (no time synchronization) www.dailyscript.com, www.movie-page.com, www.weeklyscript.com … • Subtitles (with time info.) are available for the most of movies • Can transfer time to scripts by text alignment… subtitles movie script1172 …01:20:17,240 --> 01:20:20,437 RICKWhy werent you honest with me? Why werent you honest with me? WhyWhyd you keep your marriage a secret?Why d did you keep your marriage a secret?1173 01:20:1701:20:20,640 --> 01:20:23,598 Rick sits down with Ilsa. 01:20:23lt wasnt my secret, Richard. secret Richard ILSAVictor wanted it that way. Oh, it wasnt my secret, Richard.1174 Victor wanted it that way. Not even01:20:23,800 > 01:20:26,18901:20:23 800 --> 01:20:26 189 our closest friends knew about our marriage.Not even our closest friends …knew about our marriage.…
  131. 131. Script- Script-based action annotation– On the good side: g • Realistic variation of actions: subjects, views, etc… • Many examples per class, many classes • No extra overhead for new classes • Actions, objects, scenes and their combinations • Character names may be used to resolve “who is doing what?” who what?– Problems: • No spatial localization • Temporal localization may be poor • Missing actions e g scripts do not al a s follo the mo ie actions: e.g. always follow movie • Annotation is incomplete, not suitable as ground truth for testing action detection • Large within-class variability of action classes in text
  132. 132. Script alignment: Evaluation • Annotate action samples in text p • Do automatic script-to-video alignment • Check the correspondence of actions in scripts and movies Example of a “visual false positive” A black car pulls up, t bl k ll two army officers get out.a: quality of subtitle-script matching
  133. 133. Text- Text-based action retrieval• Large variation of action expressions in text: GetOutCar “… Will gets out of the Chevrolet. …” action: “… Erin exits her new truck…” Potential false “…About to sit down, he freezes…” positives:• => Supervised text classification approach
  134. 134. Automatically annotated action samples [Laptev, Marszałek, Schmid, Rozenfeld 2008]
  135. 135. Hollywood-Hollywood-2 actions dataset Training and test samples are obtained from 33 and 36 distinct movies respectively. Hollywood-2 dataset is on-line: http://www.irisa.fr/vista http://www irisa fr/vista /actions/hollywood2 [Laptev, Marszałek, Schmid, Rozenfeld 2008]
  136. 136. Action Classification: OverviewBag of space-time features + multi-channel SVM [Laptev’03, Schuldt’04, Niebles’06, Zhang’07] Collection of space-time patches C f Histogram of visual words HOG & HOF Multi-channel patch SVM descriptors Classifier
  137. 137. Action classification (CVPR08)Test episodes from movies “The Graduate”, “It’s a Wonderful Life”, “Indiana Jones and the Last Crusade”
  138. 138. Evaluation of local features for action recognition• Local features provide a popular approach to video description for action recognition: – ~50% of recent action recognition methods (cvpr09, iccv09, bmvc09) are based on local features – Large variety of feature detectors and descriptors is available – Very limited and inconsistent comparison of different featuresGoal:• Systematic evaluation of local feature-descriptor combinations• Compare performance on common datasets• Propose improvements
  139. 139. Evaluation of local features for action recognition• Evaluation study [Wang et al. BMVC’09] – Common recognition framework • Same datasets (varying difficulty): KTH, UCF sports, Hollywood2 • Same train/test data • Same classification method – Alternative local feature detectors and descriptors from recent literature – Comparison of different detector-descriptor combinations
  140. 140. Action recognition framework g Bag of space-time features + SVM [Schuldt’04, Niebles’06, Zhang’07] Evaluation E l i space-time patches Extraction of Local features K-means Evaluation Occurrence histogram clustering of visual words (k=4000) Non-linear FeatureSVM with χ2 description kernel Feature e u e quantization
  141. 141. Local feature detectors/descriptorsL lf t d t t /d i t• Four types of detectors: – Harris3D [Laptev’05] [Laptev 05] – Cuboids [Dollar’05] – Hessian [Willems’08] – Regular dense sampling• Four different types of descriptors: yp p – HoG/HoF [Laptev’08] – Cuboids [Dollar’05] – H G3D HoG3D [Kläser’08] [Klä ’08] – Extended SURF [Willems’08]
  142. 142. Illustration of ST detectorsHarris3DH i 3D Hessian H iCuboid Dense
  143. 143. Dataset: KTH-Actions KTH-• 6 action classes by 25 persons in 4 different scenarios• Total of 2391 video samples• Performance measure: average accuracy over all classes 59
  144. 144. UCF- p UCF-Sports -- samples p• 10 different action classes• 150 1 0 video samples in total - We extend the dataset by flipping videos• Evaluation method: leave-one-out• Performance measure: average accuracy over all classes Diving Kicking Walking Skateboarding High-Bar-Swinging Golf-Swinging
  145. 145. Dataset: Hollywood2 y• 12 different action classes from 69 Hollywood movies• 1707 video sequences in total• Separate movies for training / testing•P f Performance measure: mean average precision ( AP) over all classes i i (mAP) ll l GetOutCar AnswerPhone Kiss HandShake StandUp DriveCar 61
  146. 146. KTH- KTH-Actions -- results Detectors Harris3D Cuboids Hessian Dense HOG3D 89.0% 90.0% 84.6% 85.3% HOG/HOF 91.8% 91 8% 88.7% 88 7% 88.7% 88 7% 86.1% 86 1%Descriptors HOG 80.9% 82.3% 77.7% 79.0% HOF 92.1% 88.2% 88.6% 88.0% Cuboids - 89.1% - - E-SURF - - 81.4% - • Best results for Sparse Harris3D + HOF • Good results for Harris3D and Cuboid detectors with HOG/HOF and HOG3D descriptors • Dense features perform relatively poor compared to sparse features
  147. 147. UCF- UCF-Sports -- results Detectors Harris3D Cuboids Hessian Dense HOG3D 79.7% 82.9% 79.0% 85.6% HOG/HOF 78.1% 77.7% 79.3% 81.6% criptors HOG 71.4% 72.7% 66.0% 77.4% HOFDesc 75.4% 75 4% 76.7% 76 7% 75.3% 75 3% 82.6% 82 6% Cuboids - 76.6% - - E SURF E-SURF - - 77.3% - • Best results for Dense + HOG3D • Good results for Dense and HOG/HOF • Cuboids: good p g performance with HOG3D
  148. 148. Hollywood2 -- results Detectors Harris3D Cuboids Hessian Dense HOG3D 43.7% 45.7% 41.3% 45.3% HOG/HOF 45.2% 46.2% 46.0% 47.4% riptors HOG 32.8% 39.4% 36.2% 39.4%Descri HOF 43.3% 42.9% 43.0% 45.5% Cuboids - 45.0% - - E-SURF E SURF - - 38.2% 38 2% - • Best results for Dense + HOG/HOF • Good results for HOG/HOF
  149. 149. Evaluation summary y• Dense sampling consistently outperforms all the tested sparse features in realistic settings (UCF + Hollywood2) - Importance of realistic video data - Limitations of current feature detectors - Note: large number of features (15-20 times more)• Sparse features provide more or less similar results (sparse features better than Dense on KTH)• Descriptors Descriptors’ performance - Combination of gradients + optical flow seems a good choice (HOG/HOF & HOG3D)
  150. 150. How to improve BoF classification? pActions are about people Why not try to combine BoF with person detection? y y p Detect and track people Compute BoF on person-centered grids: 2x2, 3x2, 3x3… Surprise!
  151. 151. How to improve BoF classification? p2nd attampt: • Do not remove background g • Improve local descriptors with region-level information Local features ambiguous bi Features with F t ith features disambiguated Visual labels Vocabulary Regions R2 Histogram representation R1 R1 R1 SVM R2 Classification R3
  152. 152. Video Segmentation g• Spatio-temporal grids Spatio temporal R3 R2 R1• Static action detectors [Felzenszwalb’08] – Trained from ~100 web-images per class• Object and Person detectors (Upper body) [ [Felzenszwalb’08] ]
  153. 153. Video Segmentation g
  154. 154. Hollywood-2 action classificationHollywood- y Attributed feature Performance (meanAP) BoF 48.55 Spatiotemoral grid 24 channels 51.83 51 83 Motion segmentation 50.39 Upper body pp y 49.26 Object detectors 49.89 Action detectors 52.77 Spatiotemoral grid + Motion segmentation 53.20 Spatiotemoral grid + Upper body 53.18 Spatiotemoral grid + Object detectors S ti t l id Obj t d t t 52.97 52 97 Spatiotemoral grid + Action detectors 55.72 Spatiotemoral grid + Motion segmentation + Upper 55.33 body + Object detectors + Action detectors
  155. 155. Hollywood-2 action classificationHollywood- y
  156. 156. Actions in Context (CVPR 2009) Human actions are frequently correlated with particular scene classes q y p Reasons: physical properties and particular purposes of scenes Eating -- kitchen Eating -- cafe Running -- road Running -- street
  157. 157. Mining scene captions ILSA I wish I didnt love you so much.01:22:0001:22:03 She snuggles closer to Rick. CUT TO: EXT. RICKS CAFE - NIGHT Laszlo and Carl make their way through the darkness toward a side entrance of Ricks. They run inside the entryway. The headlights of a speeding police car sweep toward them. They flatten themselves against a wall to avoid detection. The lights move past them. CARL01:22:15 I think we lost them.01:22:17 …
  158. 158. Mining scene captionsINT. TRENDY RESTAURANT - NIGHTINT.INT MARSELLUS WALLACE’S DINING ROOM MORNING WALLACE SEXT. STREETS BY DORA’S HOUSE - DAY.INT. MELVINS APARTMENT, BATHROOM – NIGHTEXT.EXT NEW YORK CITY STREET NEAR CAROLS RESTAURANT – DAY CAROL S INT. CRAIG AND LOTTES BATHROOM - DAY• Maximize word frequency street, living room, bedroom, car ….• M Merge words with similar senses using W dN t d ith i il i WordNet: taxi -> car, cafe -> restaurant• Measure correlation of words with actions (in scripts) and• Re-sort words by the entropy for P = p(action | word)
  159. 159. Co-Co-occurrence of actions and scenes in scripts
  160. 160. Co-Co-occurrence of actions and scenes in scripts
  161. 161. Co-Co-occurrence of actions and scenes in scripts
  162. 162. Co-Co-occurrence of actions and scenes in text vs. video
  163. 163. Automatic gathering of relevant scene classes and visual samples Source: 69 movies aligned with g the scripts Hollywood-2 dataset is on-line: http://www.irisa.fr/vista /actions/hollywood2
  164. 164. Results: actions and scenes (separately)
  165. 165. Classification with the help of context Action classification score Scene classification score Weight, Weight estimated from text: New action score
  166. 166. Results: actions and scenes (jointly)Actions in h i thecontext ofScenesScenes in thecontext ofActions
  167. 167. Weakly- p Weakly-Supervised y Temporal Action Annotation  Answer questions: WHAT actions and WHEN they happened ?Knock on the door Fight Kiss  Train visual action detectors and annotate actions with the minimal manual supervision
  168. 168. WHAT actions?  Automatic discovery of action classes in text (movie scripts) -- Text processing: Part of Speech (POS) tagging; Named Entity Recognition (NER); WordNet pruning; Visual Noun filtering -- Search action patternsPerson+Verb Person+Verb+Prep. Person+Verb+Prep Person+Verb+Prep+Vis.Noun Person+Verb+Prep+Vis Noun3725 /PERSON .* is 989 /PERSON .* looks .* at 41 /PERSON .* sits .* in .* chair2644 /PERSON .* looks 384 /PERSON .* is .* in 37 /PERSON .* sits .* at .* table1300 /PERSON .* turns 363 /PERSON .* looks .* up 31 /PERSON .* sits .* on .* bed916 /PERSON .* takes 234 /PERSON .* is .* on 29 /PERSON .* sits .* at .* desk840 /PERSON .* sits 215 /PERSON .* picks .* up 26 /PERSON .* picks .* up .* phone829 /PERSON .* has 196 /PERSON .* is .* at 23 /PERSON .* gets .* out .* car807 /PERSON .* walks 139 /PERSON .* sits .* in 23 /PERSON .* looks .* out .* window701 /PERSON .* stands * 138 /PERSON .* is .* with * * 21 /PERSON .* looks .* around .* room * * *622 /PERSON .* goes 134 /PERSON .* stares .* at 18 /PERSON .* is .* at .* desk591 /PERSON .* starts 129 /PERSON .* is .* by 17 /PERSON .* hangs .* up .* phone585 /PERSON .* does 126 /PERSON .* looks .* down 17 /PERSON .* is .* on .* phone569 /PERSON .* gets 124 /PERSON .* sits .* on 17 /PERSON .* looks .* at .* watch552 /PERSON .* pulls 122 /PERSON .* is .* of 16 /PERSON .* sits .* on .* couch503 /PERSON .* comes 114 /PERSON .* gets .* up 15 /PERSON .* opens .* of .* door493 /PERSON .* sees 109 /PERSON .* sits .* at 15 /PERSON .* walks .* into .* room462 /PERSON .* are/VBP 107 /PERSON .* sits .* down 14 /PERSON .* goes .* into .* room
  169. 169. WHEN: WHEN: Video Data and Annotation Want to target realistic video data Want to avoid manual video annotation for training Use movies + scripts for automatic annotation of training samples 24:25 ertainty! Unce 24:51
  170. 170. Overview Input: p Automatic collection of training clips g p • Action type, e.g. Person Opens Door • Videos + aligned scriptsOutput:O Clustering of positive segments Training classifier Sliding- window-style i d l temporal action localization
  171. 171. Action clustering [Lihi Zelnik-Manor and Michal Irani CVPR 2001] Spectral clusteringDescriptor space Clustering results Ground truth
  172. 172. Action clustering Complex data: Standard clustering methods do not work on this data
  173. 173. Action clustering Our view at the problemFeature space Video space Negative samples! N i l ! Nearest neighbor N t i hb solution: wrong! Random video samples: lots of them, very low chance to be positives
  174. 174. Action clustering Formulation [Xu et al. NIPS’04] [ ac [Bach & Harchaoui NIPS’07] a c aou S0 ] discriminative di i i ti cost tFeature space Loss on positive samples Loss on negative samples negative samples parameterized positive samples t i d iti l Optimization SVM solution for Coordinate descent on
  175. 175. Clustering resultsDrinking actions in Coffee and Cigarettes
  176. 176. Detection results Drinking actions in Coffee and Cigarettes T i i B Training Bag-of-Features classifier fF t l ifi Temporal sliding window classification Non-maximum suppression Detection trained on simulated clusters Test set: • 25min from “Coffee and Coffee Cigarettes” with GT 38 drinking actions
  177. 177. Detection results Drinking actions in Coffee and Cigarettes T i i B Training Bag-of-Features classifier fF t l ifi Temporal sliding window classification Non-maximum suppression Detection trained on automatic clusters Test set: • 25min from “Coffee and Coffee Cigarettes” with GT 38 drinking actions
  178. 178. Detection results“Sit Down” and “Open Door” actions in ~5 hours of movies
  179. 179. Temporal detection of “Sit Down” and “Open Door” actions in movies: The Graduate, The Crying Game, Living in Oblivion
  180. 180. Conclusions Bag-of-words models are currently d i B f d d l tl dominant, th t the structure (human poses, etc.) should be integrated. Vocabulary of actions is not well-defined – it depends on the goal and the task g Actions should be used for the functional interpretation of the visual world

×