Microsoft Computer Vision School                                      28 July - 3 August 2011, Moscow, Russia             ...
Lecture overview    Motivation      Historic review      Applications and challenges      A li i          d h ll    Human ...
Computer vision grand challenge:        Video understandingObjects:                 Actions: outdoors      outdoors       ...
Motivation I: Artistic RepresentationEarly studies were motivated by human representations in ArtsDa Vinci:   “it is indis...
Motivation II: Biomechanics                                          The emergence of biomechanics                       ...
Motivation III: Motion perception Etienne-Jules Etienne Jules Marey: (1830–1904) made Chronophotographic experiments influ...
Motivation III: Motion perception  Gunnar Johansson [1973] pioneered studies on the use of image                     [   ]...
Human actions: Historic overview           15th century                studies of              anatomy                  t...
Modern applications: Motion capture          and animation              Avatar (2009)
Modern applications: Motion capture          and animationLeonardo da Vinci (1452–1519)   Avatar (2009)
Modern applications: Video editing               Space-Time Video Completion    Y. Wexler, E. Shechtman and M. Irani, CVPR...
Modern applications: Video editing               Space-Time Video Completion    Y. Wexler, E. Shechtman and M. Irani, CVPR...
Modern applications: Video editing                     Recognizing Action at a DistanceAlexei A. Efros, Alexander C. Berg,...
Modern applications: Video editing                     Recognizing Action at a DistanceAlexei A. Efros, Alexander C. Berg,...
Why Action Recognition? Video indexing and search is useful in TV production, entertainment,  education, social studies, ...
How action recognition is related     to computer vision?   Sky   Sk          Street sign             Car                 ...
We can recognize cars and roads,           g                   ,         What’s next?      12,184,113 images, 17624 synsets
Airplane  A plain has crashed, the  cabin is broken, somebody is  likely to be injured or dead                          de...
cat                 woman      trash bi      t h bin
   Vision is person-centric: We mostly care about    things which are important to us, people   Actions of people reveal...
How many person-pixels are there?         person-   Movies              TV            YouTube            Y T b
How many person-pixels are there?         person-      Movies            TV                        YouTube
How many person-pixels are there?         person-    35%              34%      Movies            TV               40%     ...
How much data do we have? Huge amount of video is available and growing                       TV-channels recorded       ...
What this course is about?
GoalGet familiar with:   •   Problem formulations       P bl    f    l ti   •   Mainstream approaches   •   Particular exi...
What is Action Recognition?Terminology   • What is an “action”?Output representation   • What do we want to say about an i...
TerminologyThe terms “action recognition”, “activity recognition”,  “event recognition”, are used i  “     t       iti ”  ...
Terminology example“Action” is a low-level primitive with semantic meaning   • E.g. walking, pointing, placing an object“A...
Output R    O t t Representation                  t ti• Given this image what is the desired output?                      ...
Output R    O t t Representation                  t ti• Given this image what is the desired output?                      ...
Output R    O t t Representation                  t ti• Given this image what is the desired output?             • Frames ...
DATASETS
Dataset: KTH-Actions                      KTH-•   6 action classes by 25 persons in 4 different scenarios•   Total of 2391...
UCF-UCF-Sports•   10 different action classes•   150 video samples in total•   Evaluation method: leave-one-out           ...
UCF - YouTube Action Dataset11 categories 1168 videos   categories,• Evaluation method: leave-one-out• Performance measure...
Semantic Description of Human Activities(ICPR 2010) 3 challenges: interaction aerial view wide area               interact...
Hollywood2                          y•   12 action classes from 69 Hollywood movies•   1707 video sequences in total•   Se...
TRECVid Surveillance Event Detection10 actions: person runs, take picture, cell to ear, …5 cameras, ~100h video from LGW a...
Dataset DesiderataClutterNot choreographed by dataset collectors   • Real-world variationScaleS l   • Large amount of vide...
Lecture overview    Motivation      Historic review      Applications and challenges      A li i          d h ll    Human ...
Lecture overview    Motivation      Historic review      Applications and challenges      A li i          d h ll    Human ...
Objective and motivationDetermine human body pose (layout)Why? To recognize poses, gestures, actions                      ...
Activities characterized by a pose                          y p                                     Slide credit: Andrew Z...
Activities characterized by a pose                          y p                                     Slide credit: Andrew Z...
Activities characterized by a pose                          y p
Challenges: articulations and deformations       g
Challenges: of (         g       (almost) unconstrained images                        )                  gvarying illumina...
Pictorial Structures•   Intuitive model of an object•   Model has two components    1. parts (       pa s (2D image fragme...
Long tradition of using p   g                  g pictorial structures for humans                    Finding People by Samp...
Felzenszwalb & Huttenlocher                 NB: requires background subtraction
Variety of Poses      y
Variety of Poses      y
Objective: detect human and determine upper body pose (layout)         si                                    f            ...
Pictorial structure model – CRFsi                         f                    a2                    a1
Complexity        p     ysi              f         a2         a1
Are trees the answer?                               He      T                              left          right            ...
Are trees the answer?                               He      T                              left          right            ...
Kinematic structure vs graphical (independence) structure             Graph G = (V,E) He      T             He      Tleft ...
More recent work on human pose estimation                          p           D. Ramanan. Learning to parse images of art...
Pose estimation is a very active research area                        y            Y. Yang and D. Ramanan. Articulated pos...
Pose estimation is a very active research area                        yJ. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Fi...
Pose Search                      QV. Ferrari, M. Marin-Jimenez, and A. Zisserman. Progressive search space reduction      ...
Pose Search    Q  V. Ferrari, M. Marin-Jimenez, and A. Zisserman. Zisserman Progressive search spacereduction for human po...
ApplicationLearning sign language by watching TV       g g       g g y            g   (using weakly aligned subtitles)    ...
ObjectiveLearn signs in British Sign Language (BSL) corresponding to text words:    • Training data from TV broadcasts wit...
Given an English worde.g. “tree” what is thecorresponding BritishSign Language sign?             positive            seque...
Use sliding window to choose sub-sequence of p   q         poses in one ppositivesequence and determine if                ...
Use sliding window to choose sub-sequence of p   q         poses in one ppositivesequence and determine if                ...
Multiple instance learning     p                   g                                Negative       Positive               ...
Example                           pLearn signs in British Sign Language (        g                g     g g (BSL) correspo...
EvaluationGood results for a variety of signs:  Signs where    g            Signs where                   g            Sig...
What is missed?  truncation is not modelled
What is missed?  occlusion is not modelled
Modelling p        g person-object-pose interactions                   j    p                W. Yang Y.                W Y...
Towards functional object understanding                               j                g    A. Gupta, S. Satkin, A.A. Efro...
Conclusions: Human poses                               p   Exciting progress in pose estimation in realistic    still ima...
Lecture overview    Motivation      Historic review      Applications and challenges      A li i          d h ll    Human ...
Lecture overview    Motivation      Historic review      Applications and challenges      A li i          d h ll    Human ...
How to recognize actions?
Action understanding: Key componentsImage measurements                                                     Prior knowledge...
Foreground segmentationImage differencing: a simple way to measure motion/change                 -                  > Cons...
Temporal TemplatesIdea: summarize motion in video in a      Motion History Image (MHI):Descriptor: Hu moments of different...
Aerobics datasetNearest Neighbor classifier: 66% accuracy
Temporal Templates: SummaryPros:    + Si l and f t      Simple d fast                                 Not all shapes are v...
Active Shape ModelsPoint Distribution Model Represent the shape of samples by a set  of corresponding points or landmarks...
Active Shape Models Distribution of eigenvalues of                    g                                         A small f...
Active Shape Models:           effect of regularization P j ti onto the shape-space serves as a regularization  Projectio...
Active Shape Models [Cootes et al.]                    [Cootes al.] Constrains shape deformation in PCA-projected space E...
Person TrackingLearning flexible models from image sequences   [A. Baumberg and D. Hogg, ECCV 1994]
Learning dynamic prior Dynamic model: 2nd order Auto-Regressive Process   y                              g   State   Upda...
Learning dynamic prior                                    Random simulation of the    Learning point sequence         lear...
Learning dynamic priorRandom simulation of th lR d     i l ti     f the learned gate d                               d t d...
Motion priors C  Constrain t      t i temporal evolution of shape                 l    l ti    f h        Help accurate ...
Lecture overview    Motivation      Historic review      Applications and challenges      A li i          d h ll    Human ...
Lecture overview    Motivation      Historic review      Applications and challenges      A li i          d h ll    Human ...
Shape and Appearance vs. Motion                         vs Shape and appearance in images depends on many factors:  cloth...
Shape and Appearance vs. Motion                     vs            Moving Light DisplaysGunnar Johansson, Perception and Ps...
Motion estimation: Optical Flow Classic problem of computer vision [Gibson 1955] Goal: estimate motion field  How? We on...
Generic Optical Flow      Brightness Change Constraint Equation (BCCE)                                                   ...
Generic Optical Flow      Brightness Change Constraint Equation (BCCE)                                                   ...
Generic Optical Flow The solution of                                  assumes  1.   Brightness change constraint holds in...
Recent methods for OF estimation T. Brox, A. Bruhn, N. Papenberg, J. Weickert, ECCV 2004  High accuracy optical flow esti...
Generic Optical Flow The solution of                                  assumes  1.   Brightness change constraint holds in...
Parameterized Optical Flow   A th extension of the constant motion model is to compute    Another t     i     f th       ...
Parameterized Optical Flow      Estimated coefficients of PCA flow bases can be used as action       descriptorsFrame num...
Spatial Motion Descriptor             Image frame                g                                  Optical flow          ...
Spatio-    Spatio-Temporal Motion Descriptor                          Temporal extent E                  …                ...
Football Actions: matching  InputSequence  qMatchedFrames                   input   matched
Football Actions: classification10 actions; 4500 total frames; 13-frame motion descriptor
Classifying Ballet Actions            y g16 Actions; 24800 total frames; 51-frame motion descriptor. Men             used ...
Classifying Tennis Actions6 actions; 4600 frames; 7-frame motion descriptor Woman player used as training, man as testing.
Classifying Tennis ActionsRed bars illustrate classification confidence for each action [A. A. Efros, A. C. Berg, G. Mori,...
Motion recognition                      g       without motion estimations• Motion estimation from video is a often noisy/...
Motion recognition                      g       without motion estimations• Motion estimation from video is a often noisy/...
Motion recognition                      g       without motion estimations• Motion estimation from video is a often noisy/...
Motion-Motion-based template matchingPros:    + Depends less on variations in appearanceCons:    - Can be slow    - Does n...
Where are we so far?Body-pose estimation          Shape models with(out) motion priors             Optical Fl             ...
What s coming next?                 What’sLocal space-time features          Discriminatively-trained action templates Bag...
Goal:Interpret complexdynamic scenesd      i   Common methods:                                    Common problems:       •...
Space-                Space-timeNo global assumptions        Consider local spatio temporal neighborhoods                ...
Actions == Space-time objects?           Space-
Local approach: Bag of Visual WordsAirplanesMotorbikes  FacesWild Cats Leaves People  Bikes
Space-Space-time local features
Space-Time Interest Points: DetectionSpace- p      What neighborhoods to consider?                          High image    ...
Space-Time Interest Points: DetectionSpace- pProperties of   p                       :                 defines second orde...
Space-   Space-Time interest pointsVelocity    appearance//                           split/mergechanges    disappearance
Space-Time Interest Points: ExamplesSpace- p                              p           Motion event detection
Spatio-temporal scale selectionSpatio- p        p              Stability to size changes,              e.g. camera zoom
Spatio-Spatio-temporal scale selection                      Selection of                      temporal scales             ...
Local features for human actions
Local features for human actions                   boxing                   walking                 hand waving
Local space-time descriptor: HOG/HOF      space-       p               p Multi scale space time Multi-scale space-time pat...
Visual Vocabulary: K-means clustering                y K-                g    Group similar p         p         points in...
Visual Vocabulary: K-means clustering                y K-                g    Group similar p         p         points in...
Local Space-time features: Matching      Space-       p                          g   Find similar events in pairs of vide...
Action Classification: OverviewBag of space-time features + multi-channel SVM [Laptev’03, Schuldt’04, Niebles’06, Zhang’07...
Action recognition in KTH dataset             gSample frames from the KTH actions sequences, all six classes(columns) and ...
Classification results on KTH dataset        Confusion matrix for KTH actions
What about 3D?Local motion and appearance features are not invariant to view changes
Multi-      Multi-view action recognitionDifficult to apply standard multi-view methods:    Do not want to search for mul...
Temporal self-similarities                     self-Idea:   Cross-view matching is hard but cross-time matching (tracking...
Temporal self-similarities: Multi-views            self-              Multi-   Side view                                To...
Temporal self-similarities: MoCap               self-                    person 1Self-similaritiescan be measuredfrom Moti...
Temporal self-similarities: Video                   self-Self-similaritiescan bemeasureddirectly fromvideo:HOG orOptical F...
Self-               Self-similarity descriptor Goal: define a quantitative           q measure to compare self- similarity...
Multi-Multi-view alignment
Multi-Multi-view action recognition: Video  SSM-based recognition   Alternative view-dependent method (STIP)
What are Human Actions?Actions in recentdatasets:                                      Is it just about kinematics?Should ...
What are Human Actions?Actions in recentdatasets:                                      Is it just about kinematics?Should ...
Action recognition in realistic settings            Standard            St d d            action            datasets      ...
Action Dataset and Annotation                     Manual annotation of drinking actions in movies:                        ...
“Drinking” action samples        g            p
Action representation         p                   Hist. of Gradient                   Hist. f O ti Fl                   Hi...
Action learning                                                        selected features                                  ...
Key-              Key-frame action classifier                                                        selected features    ...
Keyframe priming               y      p     gTraining           False positives of static HOG action detector  Positive   ...
Action detectionTest set:    • 25min from “Coffee and Cigarettes” with GT 38 drinking actions                             ...
Action Detection (ICCV 2007)                      (         )        Test episodes from the movie “Coffee and cigarettes  ...
20 most confident detections
Learning Actions from Movies       g    •   Realistic variation of human actions    •   Many classes and many examples per...
Automatic video annotation                      with scripts      •     Scripts available for >500 movies (no time synchro...
Script-    Script-based action annotation–    On the good side:            g    • Realistic variation of actions: subjects...
Script alignment: Evaluation •    Annotate action samples in text                           p •    Do automatic script-to-...
Text-      Text-based action retrieval•   Large variation of action expressions in text:     GetOutCar        “… Will gets...
Automatically annotated action samples                  [Laptev, Marszałek, Schmid, Rozenfeld 2008]
Hollywood-Hollywood-2 actions dataset                           Training and test                           samples are ob...
Action Classification: OverviewBag of space-time features + multi-channel SVM [Laptev’03, Schuldt’04, Niebles’06, Zhang’07...
Action classification (CVPR08)Test episodes from movies “The Graduate”, “It’s a Wonderful Life”,              “Indiana Jon...
Evaluation of local features             for action recognition•   Local features provide a popular approach to video desc...
Evaluation of local features           for action recognition•   Evaluation study [Wang et al. BMVC’09]     – Common recog...
Action recognition framework                    g   Bag of space-time features + SVM [Schuldt’04, Niebles’06, Zhang’07]   ...
Local feature detectors/descriptorsL   lf t      d t t    /d    i t•   Four types of detectors:     – Harris3D [Laptev’05]...
Illustration of ST detectorsHarris3DH i 3D              Hessian                    H   iCuboid                Dense
Dataset: KTH-Actions                    KTH-• 6 action classes by 25 persons in 4 different scenarios• Total of 2391 video...
UCF- p             UCF-Sports -- samples                              p•   10 different action classes•   150    1 0 video...
Dataset: Hollywood2                           y• 12 different action classes from 69 Hollywood movies• 1707 video sequence...
KTH-                        KTH-Actions -- results                                     Detectors                          ...
UCF-                        UCF-Sports -- results                                    Detectors                         Har...
Hollywood2 -- results                                     Detectors                          Harris3D      Cuboids   Hessi...
Evaluation summary                                 y•   Dense sampling consistently outperforms all the tested sparse    f...
How to improve BoF classification?          pActions are about people       Why not try to combine BoF with person detecti...
How to improve BoF classification?          p2nd attampt:    •     Do not remove background                              g...
Video Segmentation                       g• Spatio-temporal grids  Spatio temporal                             R3         ...
Video Segmentation        g
Hollywood-2 action classificationHollywood-    y Attributed feature                                 Performance           ...
Hollywood-2 action classificationHollywood-    y
Actions in Context                   (CVPR 2009) Human actions are frequently correlated with particular scene classes   ...
Mining scene captions                             ILSA              I wish I didnt love you so much.01:22:0001:22:03   She...
Mining scene captionsINT. TRENDY RESTAURANT - NIGHTINT.INT MARSELLUS WALLACE’S DINING ROOM MORNING                 WALLACE...
Co-Co-occurrence of actions and scenes            in scripts
Co-Co-occurrence of actions and scenes            in scripts
Co-Co-occurrence of actions and scenes            in scripts
Co-Co-occurrence of actions and scenes          in text vs. video
Automatic gathering of relevant scene classes            and visual samples                                 Source:       ...
Results: actions and scenes (separately)
Classification with the help of context           Action classification score           Scene classification score        ...
Results: actions and scenes (jointly)Actions in h i thecontext   ofScenesScenes in thecontext   ofActions
Weakly- p               Weakly-Supervised                     y            Temporal Action Annotation     Answer question...
WHAT actions?    Automatic discovery of action classes in text (movie scripts)       -- Text processing:                 ...
WHEN:    WHEN: Video Data and Annotation Want to target realistic video data Want to avoid manual video annotation for t...
Overview   Input:     p                                  Automatic collection of training clips                           ...
Action clustering       [Lihi Zelnik-Manor and Michal Irani CVPR 2001]                                      Spectral clust...
Action clustering    Complex data:   Standard clustering  methods do not work on        this data
Action clustering                Our view at the problemFeature space                             Video space             ...
Action clustering                Formulation                                           [Xu et al. NIPS’04]                ...
Clustering resultsDrinking actions in Coffee and Cigarettes
Detection results           Drinking actions in Coffee and Cigarettes T i i B  Training Bag-of-Features classifier       ...
Detection results           Drinking actions in Coffee and Cigarettes T i i B  Training Bag-of-Features classifier       ...
Detection results“Sit Down” and “Open Door” actions in ~5 hours of movies
Temporal detection of “Sit Down” and “Open Door” actions in movies:        The Graduate, The Crying Game, Living in Oblivion
Conclusions   Bag-of-words models are currently d i    B     f     d   d l             tl dominant, th                   ...
Ivan Laptev "Action recognition"
Ivan Laptev "Action recognition"
Ivan Laptev "Action recognition"
Ivan Laptev "Action recognition"
Upcoming SlideShare
Loading in …5
×

Ivan Laptev "Action recognition"

3,075 views

Published on

Published in: Education, Technology, Business

Ivan Laptev "Action recognition"

  1. 1. Microsoft Computer Vision School 28 July - 3 August 2011, Moscow, Russia 2011 Moscow Human Action Recognition Ivan Laptev ivan.laptev@inria.fr i l t @i i f INRIA, WILLOW, ENS/INRIA/CNRS UMR 8548 Laboratoire d’Informatique, Ecole Normale Supérieure, Paris d Informatique,Includes slides from: Alyosha Efros, Mark Everingham and Andrew Zisserman
  2. 2. Lecture overview Motivation Historic review Applications and challenges A li i d h ll Human Pose Estimation Pictorial structures Recent advances Appearance-based methods Motion history images Active h A ti shape models & M ti priors d l Motion i Motion-based methods Generic and parametric Optical Flow Motion templates
  3. 3. Computer vision grand challenge: Video understandingObjects: Actions: outdoors outdoors countryside indoors y outdoors cars, glasses cars glasses, carexit running drinking, person drinking running, person through people, etc… house enter car door exit, building a etc… person enter, door kidnapping drinking d i ki car car car constraints crash personScene categories: roadcarpeople glass indoors, outdoors, Geometry: field street scene, car Street, wall, field, street candle car car stair, etc… etc… street
  4. 4. Motivation I: Artistic RepresentationEarly studies were motivated by human representations in ArtsDa Vinci: “it is indispensable for a painter, to become totally familiar with the anatomy of nerves, bones, muscles, and sinews, such that he understands y , , , , for their various motions and stresses, which sinews or which muscle causes a particular motion” “I ask for the weight [pressure] of this man for every segment of motion I when climbing those stairs, and for the weight he places on b and on c. Note the vertical line below the center of mass of this man.” Leonardo da Vinci (1452–1519): A man going upstairs, or up a ladder.
  5. 5. Motivation II: Biomechanics  The emergence of biomechanics  Borelli applied to biology the analytical and geometrical methods, developed by Galileo Galilei  He was the first to understand that bones serve as levers and muscles function according to mathematical p principles p  His physiological studies included muscle analysis and a mathematical discussion of movements, such as running or jumpingGiovanni Alfonso Borelli (1608–1679)
  6. 6. Motivation III: Motion perception Etienne-Jules Etienne Jules Marey: (1830–1904) made Chronophotographic experiments influential for the emerging field of cinematography Eadweard Muybridge (1830–1904) invented a machine for displaying the recorded series of images. He pioneered motion pictures and applied his technique to movement studies
  7. 7. Motivation III: Motion perception Gunnar Johansson [1973] pioneered studies on the use of image [ ]p g sequences for a programmed human motion analysis “Moving Light Displays (LED) enable identification of familiar people Moving Displays” and the gender and inspired many works in computer vision. Gunnar Johansson, Perception and Psychophysics, 1973
  8. 8. Human actions: Historic overview 15th century  studies of anatomy t  17th century emergence of biomechanics 19th century  emergence of cinematography  1973 studies of human motion perception Modern computer vision M d t i i
  9. 9. Modern applications: Motion capture and animation Avatar (2009)
  10. 10. Modern applications: Motion capture and animationLeonardo da Vinci (1452–1519) Avatar (2009)
  11. 11. Modern applications: Video editing Space-Time Video Completion Y. Wexler, E. Shechtman and M. Irani, CVPR 2004
  12. 12. Modern applications: Video editing Space-Time Video Completion Y. Wexler, E. Shechtman and M. Irani, CVPR 2004
  13. 13. Modern applications: Video editing Recognizing Action at a DistanceAlexei A. Efros, Alexander C. Berg, Greg Mori, Jitendra Malik, ICCV 2003
  14. 14. Modern applications: Video editing Recognizing Action at a DistanceAlexei A. Efros, Alexander C. Berg, Greg Mori, Jitendra Malik, ICCV 2003
  15. 15. Why Action Recognition? Video indexing and search is useful in TV production, entertainment, education, social studies, security,… Home videos: e.g. TV & Web: “My e.g. daughter “Fight in a climbing” parlament” Sociology research: Manually Surveillance: analyzed smoking 260K views actions in in 7 days on 900 movies i YouTube
  16. 16. How action recognition is related to computer vision? Sky Sk Street sign Car Car Car Car Car Car Road
  17. 17. We can recognize cars and roads, g , What’s next? 12,184,113 images, 17624 synsets
  18. 18. Airplane A plain has crashed, the cabin is broken, somebody is likely to be injured or dead dead.
  19. 19. cat woman trash bi t h bin
  20. 20.  Vision is person-centric: We mostly care about things which are important to us, people Actions of people reveal the function of objects p p j Future challenges: - Function: What can I do with this and how? - Prediction: What can happen if someone does that? - Recognizing goals: What this person is trying to do?
  21. 21. How many person-pixels are there? person- Movies TV YouTube Y T b
  22. 22. How many person-pixels are there? person- Movies TV YouTube
  23. 23. How many person-pixels are there? person- 35% 34% Movies TV 40% YouTube
  24. 24. How much data do we have? Huge amount of video is available and growing TV-channels recorded since 60’s 60 s >34K hours of video upload every day ~30M surveillance cameras in US => ~700K video hours/day 700K If we want to interpret this data, we should better understand what person-pixels are telling us!
  25. 25. What this course is about?
  26. 26. GoalGet familiar with: • Problem formulations P bl f l ti • Mainstream approaches • Particular existing techniques • Current benchmarks • Available b A il bl baseline methods li h d • Promising future directions • Hands- Hands-on experience (practical session)
  27. 27. What is Action Recognition?Terminology • What is an “action”?Output representation • What do we want to say about an image/video?Neither question has satisfactory answer yet
  28. 28. TerminologyThe terms “action recognition”, “activity recognition”, “event recognition”, are used i “ t iti ” d inconsistently i t tl • Finding a common language for describing videos is an open problem
  29. 29. Terminology example“Action” is a low-level primitive with semantic meaning • E.g. walking, pointing, placing an object“Activity” is a higher-level combination with some temporal relations • E.g. taking money out from ATM, waiting for a bus“Event” is a combination of activities, often involving multiple individuals • E.g. a soccer game, a traffic accidentThis is contentious • No standard, rigorous definition exists
  30. 30. Output R O t t Representation t ti• Given this image what is the desired output? This image contains a man walking • Action classification / recognition g The man walking is here • Action detection
  31. 31. Output R O t t Representation t ti• Given this image what is the desired output? This image contains 5 men walking, 4 jogging, 2 running The 5 men walking are here This is a soccer game
  32. 32. Output R O t t Representation t ti• Given this image what is the desired output? • Frames 1-20 the man ran to the left, then frames 21-25 he ran away from the camera 21 25 • Is this an accurate description? • Are labels and video frames in 1 1 1-1 correspondence?
  33. 33. DATASETS
  34. 34. Dataset: KTH-Actions KTH-• 6 action classes by 25 persons in 4 different scenarios• Total of 2391 video samples • Specified train validation test sets train, validation,• Performance measure: average accuracy over all classes [Schuldt, Laptev, Caputo ICPR 2004]
  35. 35. UCF-UCF-Sports• 10 different action classes• 150 video samples in total• Evaluation method: leave-one-out leave one out• Performance measure: average accuracy over all classes Diving Kicking Walking Skateboarding High-Bar-Swinging Golf-Swinging Rodriguez, Ahmed, and Shah CVPR 2008
  36. 36. UCF - YouTube Action Dataset11 categories 1168 videos categories,• Evaluation method: leave-one-out• Performance measure: average accuracy over all classes [sLiu, Luo and Shah CVPR 2009]
  37. 37. Semantic Description of Human Activities(ICPR 2010) 3 challenges: interaction aerial view wide area interaction, view, wide-area Interaction • 6 classes 120 instances over ~20 min video classes, min. • Classification and detection tasks (+/- bounding boxes) • Evaluation method: leave one out leave-one-out [Ryoo et al. ICPR 2010 challenge]
  38. 38. Hollywood2 y• 12 action classes from 69 Hollywood movies• 1707 video sequences in total• Separate movies for training / testing• Performance measure: mean average precision (mAP) over all classes GetOutCar AnswerPhone Kiss HandShake StandUp DriveCar [Marszałek, Laptev, Schmid CVPR 2009]
  39. 39. TRECVid Surveillance Event Detection10 actions: person runs, take picture, cell to ear, …5 cameras, ~100h video from LGW airportDetection (in time, not space); multiple detections count as false positivesEvaluation method: specified training / test videos, evaluation at NISTPerformance measure: statistics on DET curvesP f t ti ti [Smeaton, Over, Kraaij, TRECVid]
  40. 40. Dataset DesiderataClutterNot choreographed by dataset collectors • Real-world variationScaleS l • Large amount of videoRarity of actions • Detection harder than c ass ca o e ec o a de a classification • Chance performance should be very lowClear definition of training/test split • Validation set for parameter tuning? • Reproducing / comparing to other methods?
  41. 41. Lecture overview Motivation Historic review Applications and challenges A li i d h ll Human Pose Estimation Pictorial structures Recent advances Appearance-based methods Motion history images Active h A ti shape models & M ti priors d l Motion i Motion-based methods Generic and parametric Optical Flow Motion templates
  42. 42. Lecture overview Motivation Historic review Applications and challenges A li i d h ll Human Pose Estimation Pictorial structures Recent advances Appearance-based methods Motion history images Active h A ti shape models & M ti priors d l Motion i Motion-based methods Generic and parametric Optical Flow Motion templates
  43. 43. Objective and motivationDetermine human body pose (layout)Why? To recognize poses, gestures, actions Slide credit: Andrew Zisserman
  44. 44. Activities characterized by a pose y p Slide credit: Andrew Zisserman
  45. 45. Activities characterized by a pose y p Slide credit: Andrew Zisserman
  46. 46. Activities characterized by a pose y p
  47. 47. Challenges: articulations and deformations g
  48. 48. Challenges: of ( g (almost) unconstrained images ) gvarying illumination and low contrast; moving camera and background; i ill i ti dl t t i db k dmultiple people; scale changes; extensive clutter; any clothing
  49. 49. Pictorial Structures• Intuitive model of an object• Model has two components 1. parts ( pa s (2D image fragments) age ag e s) 2. structure (configuration of parts)• Dates back to Fischler & Elschlager 1973
  50. 50. Long tradition of using p g g pictorial structures for humans Finding People by Sampling Ioffe & Forsyth, ICCV 1999 y , Pictorial Structure Models for Object Recognition Felzenszwalb & Huttenlocher, 2000 Learning to Parse Pictures of People Ronfard, Schmid & Triggs, ECCV 2002
  51. 51. Felzenszwalb & Huttenlocher NB: requires background subtraction
  52. 52. Variety of Poses y
  53. 53. Variety of Poses y
  54. 54. Objective: detect human and determine upper body pose (layout) si f a2 a1
  55. 55. Pictorial structure model – CRFsi f a2 a1
  56. 56. Complexity p ysi f a2 a1
  57. 57. Are trees the answer? He T left right UA UA LA LA Ha Ha• With n parts and h possible discrete locations per part, O(hn)• For a tree, using dynamic programming this reduces to O(nh2)• If model is a tree and has certain edge costs, then complexity g p y reduces to O(nh) using a distance transform [Felzenszwalb & Huttenlocher, 2000, 2005]
  58. 58. Are trees the answer? He T left right UA UA LA LA Ha Ha• With n parts and h possible discrete locations per part, O(hn)• For a tree, using dynamic programming this reduces to O(nh2)• If model is a tree and has certain edge costs, then complexity g p y reduces to O(nh) using a distance transform [Felzenszwalb & Huttenlocher, 2000, 2005]
  59. 59. Kinematic structure vs graphical (independence) structure Graph G = (V,E) He T He Tleft right left right UA UA UA UA LA LA LA LA Requires more Ha Ha Ha Ha connections than a tree
  60. 60. More recent work on human pose estimation p D. Ramanan. Learning to parse images of articulated bodies. NIPS, bodies NIPS 2007 Learn image and person-specific unary terms • g initial iteration  edges • following iterations  edges & colour V. Ferrari, M. Marin-Jimenez, and A. Zisserman. Progressive search space reduction for human pose estimation. Proc. CVPR, estimation In Proc CVPR 2008/2009 (Almost) unconstrained images • Person detector & foreground highlighting g g g g VP. Buehler, M. Everingham and A. Zisserman. , g Learning sign language by watching TV. In Proc. CVPR 2009 Learns with weak t t l annotation L ith k textual t ti • Multiple instance learning
  61. 61. Pose estimation is a very active research area y Y. Yang and D. Ramanan. Articulated pose estimation with flexible mixtures-of-parts. In Proc. CVPR 2011 Extension of LSVM model of Felzenszwalb et al. Y. Wang, D. Tran and Z. Liao. Learning Hierarchical Poselets for Human Parsing. In Proc. CVPR 2011. Builds on Poslets idea of Bourdev et al. S. Johnson and M. Everingham. Learning Effective Human Pose Estimation from Inaccurate Annotation. In Proc. CVPR 2011. Learns from lots of noisy annotations B. Sapp, D.Weiss and B. Taskar. Parsing Human Motion with Stretchable Models. In Proc. CVPR 2011. Explores temporal continuity
  62. 62. Pose estimation is a very active research area yJ. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio, R. Moore, A.Kipman and A. Blake. Real-Time Human Pose Recognition in Parts from Single Depth Images. Best paper award at CVPR 2011 Exploits lots of synthesized depth images for training
  63. 63. Pose Search QV. Ferrari, M. Marin-Jimenez, and A. Zisserman. Progressive search space reduction for human pose estimation. In Proc. CVPR2009
  64. 64. Pose Search Q V. Ferrari, M. Marin-Jimenez, and A. Zisserman. Zisserman Progressive search spacereduction for human pose estimation. In Proc. CVPR2009
  65. 65. ApplicationLearning sign language by watching TV g g g g y g (using weakly aligned subtitles) Patrick Buehler Mark Everingham Andrew Zi A d Zisserman CVPR 2009
  66. 66. ObjectiveLearn signs in British Sign Language (BSL) corresponding to text words: • Training data from TV broadcasts with simultaneous signing • Supervision solely from sub-titles Output: automatically learned signs (4x slow motion) Input: video + subtitle Office GovernmentUse subtitles to find video sequences containing word. These are the positivetraining sequences. Use other sequences as negative training sequences.
  67. 67. Given an English worde.g. “tree” what is thecorresponding BritishSign Language sign? positive sequences negative set
  68. 68. Use sliding window to choose sub-sequence of p q poses in one ppositivesequence and determine if 1st sliding window same sub-sequence of posesoccurs in other positive sequencessomewhere, but does not occur in the negative set positive sequences negative set
  69. 69. Use sliding window to choose sub-sequence of p q poses in one ppositivesequence and determine if 5th sliding window same sub-sequence of posesoccurs in other positive sequencessomewhere, but does not occur in the negative set positive sequences negative set
  70. 70. Multiple instance learning p g Negative Positive bag bags sign of interest i t t
  71. 71. Example pLearn signs in British Sign Language ( g g g g (BSL) corresponding to ) p gtext words.
  72. 72. EvaluationGood results for a variety of signs: Signs where g Signs where g Signs where Signs which Signs whichhand movement hand shape both hands are finger-- are perfomed in is important is important are together spelled front of the face Navy Lung Fungi Kew Whale Prince Garden Golf Bob Rose
  73. 73. What is missed? truncation is not modelled
  74. 74. What is missed? occlusion is not modelled
  75. 75. Modelling p g person-object-pose interactions j p W. Yang Y. W Yang, Y Wang and Greg Mori. Recognizing Mori Human Actions from Still Images with Latent Poses. In Proc. CVPR 2010. Some limbs may not be important for recognizing a particular action (e.g. sitting) B. Y B Yao and L F i F i M d li M t l d L. Fei-Fei. Modeling Mutual Context of Object and Human Pose in Human- Object Interaction Activities. In Proc. CVPR 2010. Pose estimation helps object detection and vice versa
  76. 76. Towards functional object understanding j g A. Gupta, S. Satkin, A.A. Efros and M. Hebert, From 3D Scene Geometry to HumanWorkspace. In Proc. CVPR 2011 p Predicts the “workspace” of a humanH. Grabner, J. Gall and L. Van Gool. What Makes a Chair a Chair? In Proc. CVPR 2011
  77. 77. Conclusions: Human poses p Exciting progress in pose estimation in realistic still images and video. g Industry strength Industry-strength pose estimation from depth sensors Pose estimation from RGB is still very challenging Human Poses ≠ Human Actions!
  78. 78. Lecture overview Motivation Historic review Applications and challenges A li i d h ll Human Pose Estimation Pictorial structures Recent advances Appearance-based methods Motion history images Active h A ti shape models & M ti priors d l Motion i Motion-based methods Generic and parametric Optical Flow Motion templates
  79. 79. Lecture overview Motivation Historic review Applications and challenges A li i d h ll Human Pose Estimation Pictorial structures Recent advances Appearance-based methods Motion history images Active h A ti shape models & M ti priors d l Motion i Motion-based methods Generic and parametric Optical Flow Motion templates
  80. 80. How to recognize actions?
  81. 81. Action understanding: Key componentsImage measurements Prior knowledge Foreground Deformable contour segmentation models Image gradients Association 2D/3D body models y Optical flow Local space-time features Motion priors Background models Learning associations Automatic Action labels from strong / weak inference  supervision i i 
  82. 82. Foreground segmentationImage differencing: a simple way to measure motion/change - > ConstBetter B kB Background / F d Foreground separation methods exist: d i h d i  Modeling of color variation at each p g pixel with Gaussian Mixture  Dominant motion compensation for sequences with moving camera  Motion layer separation for scenes with non-static backgrounds
  83. 83. Temporal TemplatesIdea: summarize motion in video in a Motion History Image (MHI):Descriptor: Hu moments of different orders p [A.F. Bobick and J.W. Davis, PAMI 2001]
  84. 84. Aerobics datasetNearest Neighbor classifier: 66% accuracy
  85. 85. Temporal Templates: SummaryPros: + Si l and f t Simple d fast Not all shapes are valid + Works in controlled settings Restrict the space of admissible silhouettesCons: - Prone to errors of background subtraction g Variations in light, shadows, clothing… What is the background here? - Does not capture interior motion and shape Silhouette tells little about actions
  86. 86. Active Shape ModelsPoint Distribution Model Represent the shape of samples by a set of corresponding points or landmarks Assume each shape can be represented by the linear combination of basis shapes such that for the mean shape and some parameter vector [Cootes et al. 1995]
  87. 87. Active Shape Models Distribution of eigenvalues of g A small fraction of basis shapes (eigenvecors) accounts for the most of shape variation (=> landmarks are redundant) Three main modes of lips shape variation: lips-shape
  88. 88. Active Shape Models: effect of regularization P j ti onto the shape-space serves as a regularization Projection t th h l i ti
  89. 89. Active Shape Models [Cootes et al.] [Cootes al.] Constrains shape deformation in PCA-projected space Example: face alignment Illustration of face shape space Active Shape Models: Their Training and Application[T.F. Cootes, C.J. Taylor, D.H. Cooper, and J. Graham, CVIU 1995]
  90. 90. Person TrackingLearning flexible models from image sequences [A. Baumberg and D. Hogg, ECCV 1994]
  91. 91. Learning dynamic prior Dynamic model: 2nd order Auto-Regressive Process y g State Update rule: Model parameters: Learning scheme:
  92. 92. Learning dynamic prior Random simulation of the Learning point sequence learned dynamical model[A. Blake, B. Bascle, M. Isard and J. MacCormick, Phil.Trans.R.Soc. 1998]
  93. 93. Learning dynamic priorRandom simulation of th lR d i l ti f the learned gate d d t dynamics i
  94. 94. Motion priors C Constrain t t i temporal evolution of shape l l ti f h  Help accurate tracking  Recognize actions G l formulate motion models f different types of actions Goal: f l t ti d l for diff tt f ti and use such models for action recognition Example: Drawing with 3 action g modes line drawing scribbling idle [M. Isard and A. Blake, ICCV 1998]
  95. 95. Lecture overview Motivation Historic review Applications and challenges A li i d h ll Human Pose Estimation Pictorial structures Recent advances Appearance-based methods Motion history images Active h A ti shape models & M ti priors d l Motion i Motion-based methods Generic and parametric Optical Flow Motion templates
  96. 96. Lecture overview Motivation Historic review Applications and challenges A li i d h ll Human Pose Estimation Pictorial structures Recent advances Appearance-based methods Motion history images Active h A ti shape models & M ti priors d l Motion i Motion-based methods Generic and parametric Optical Flow Motion templates
  97. 97. Shape and Appearance vs. Motion vs Shape and appearance in images depends on many factors: clothing, illumination contrast, image resolution, etc… [Efros et al 2003] al. Motion field (in theory) is invariant to shape and can be used directly to describe human actions
  98. 98. Shape and Appearance vs. Motion vs Moving Light DisplaysGunnar Johansson, Perception and Psychophysics, 1973
  99. 99. Motion estimation: Optical Flow Classic problem of computer vision [Gibson 1955] Goal: estimate motion field How? We only have access to image pixels Estimate pixel-wise correspondence between frames = Optical Flow Brightness Change assumption: corresponding pixels preserve their intensity (color)  Useful assumption in many cases  Breaks at occlusions and illumination changes  Physical and visual motion may be different
  100. 100. Generic Optical Flow  Brightness Change Constraint Equation (BCCE) Optical flow Image gradient One O equation, two unknowns => cannot be solved directly ti t k > tb l d di tl Integrate several measurements in the local neighborhood and obtain a L d bt i Least Squares Solution [L tS S l ti [Lucas & K Kanade 1981] dSecond-momentmatrix, the sameone used toco pute a scompute Harris Denotes integration over a spatial (or spatio-temporal)interest points! neighborhood of a point
  101. 101. Generic Optical Flow  Brightness Change Constraint Equation (BCCE) Optical flow Image gradient One O equation, two unknowns => cannot be solved directly ti t k > tb l d di tl Integrate several measurements in the local neighborhood and obtain a L d bt i Least Squares Solution [L tS S l ti [Lucas & K Kanade 1981] dSecond-momentmatrix, the sameone used toco pute a scompute Harris Denotes integration over a spatial (or spatio-temporal)interest points! neighborhood of a point
  102. 102. Generic Optical Flow The solution of assumes 1. Brightness change constraint holds in 2. Sufficient variation of image gradient in 3. Approximately constant motion in pp y Motion estimation becomes inaccurate if any of assumptions 1-3 is violated violated. Solutions: (2) Insufficient gradient variation known as aperture problem Increase integration neighborhood I i t ti i hb h d (3) Non-constant motion in Use more sophisticated motion models
  103. 103. Recent methods for OF estimation T. Brox, A. Bruhn, N. Papenberg, J. Weickert, ECCV 2004 High accuracy optical flow estimation based on a theory for warping Code on-line: http://lmb.informatik.uni-freiburg.de/resources/binaries C. Liu. Beyond Pixels: Exploring New Representations and Applications for Motion Analysis. PhD Thesis. Massachusetts Institute of Technology. M 2009 fT h l May 2009. Code on-line: http://people.csail.mit.edu/celiu/OpticalFlow
  104. 104. Generic Optical Flow The solution of assumes 1. Brightness change constraint holds in 2. Sufficient variation of image gradient in 3. Approximately constant motion in pp y Motion estimation becomes inaccurate if any of assumptions 1-3 is violated violated. Solutions: (2) Insufficient gradient variation known as aperture problem Increase integration neighborhood I i t ti i hb h d (3) Non-constant motion in Use more sophisticated motion models
  105. 105. Parameterized Optical Flow  A th extension of the constant motion model is to compute Another t i f th t t ti d li t t PCA basis flow fields from training examples 1. Compute standard Optical Flow for many examples 2. Put velocity components into one vector 3. Do PCA on and obtain most informative PCA flow basis vectorsTraining samples PCA flow bases [M.J. Black, Y. Yacoob, A.D. Jepson and D.J. Fleet, CVPR 1997]
  106. 106. Parameterized Optical Flow  Estimated coefficients of PCA flow bases can be used as action descriptorsFrame numbers Optical flow seems to be an interesting descriptor for motion/action recognition
  107. 107. Spatial Motion Descriptor Image frame g Optical flow p Fx , y Fx , Fy Fx , Fx , Fy , Fy blurred Fx , Fx , Fy , FyA. A. Efros, A.C. Berg, G. Mori and J. Malik. Recognizing Action at a Distance. In Proc. ICCV 2003
  108. 108. Spatio- Spatio-Temporal Motion Descriptor Temporal extent E … … Sequence A  … … Sequence B t A E A E I matrix EB E B frame-to-frame motion-to-motion similarity matrix blurry I similarity matrix
  109. 109. Football Actions: matching InputSequence qMatchedFrames input matched
  110. 110. Football Actions: classification10 actions; 4500 total frames; 13-frame motion descriptor
  111. 111. Classifying Ballet Actions y g16 Actions; 24800 total frames; 51-frame motion descriptor. Men used to classify women and vice versa.
  112. 112. Classifying Tennis Actions6 actions; 4600 frames; 7-frame motion descriptor Woman player used as training, man as testing.
  113. 113. Classifying Tennis ActionsRed bars illustrate classification confidence for each action [A. A. Efros, A. C. Berg, G. Mori, J. Malik, ICCV 2003]
  114. 114. Motion recognition g without motion estimations• Motion estimation from video is a often noisy/unreliable• Measure motion consistency between a template and test video [Schechtman and Irani, PAMI 2007]
  115. 115. Motion recognition g without motion estimations• Motion estimation from video is a often noisy/unreliable• Measure motion consistency between a template and test video Test video Template video Correlation result [Schechtman and Irani, PAMI 2007]
  116. 116. Motion recognition g without motion estimations• Motion estimation from video is a often noisy/unreliable• Measure motion consistency between a template and test video Test video Template video Correlation result [Schechtman and Irani, PAMI 2007]
  117. 117. Motion-Motion-based template matchingPros: + Depends less on variations in appearanceCons: - Can be slow - Does not model negatives Improvements possible using discriminatively-trained template-based action classifiers
  118. 118. Where are we so far?Body-pose estimation Shape models with(out) motion priors Optical Fl O ti l Flow based methods, motion templates b d th d ti t l t
  119. 119. What s coming next? What’sLocal space-time features Discriminatively-trained action templates Bag-of-Features action f classification Weakly supervised training using video scripts g pAction lA ti classification in realistic videos ifi ti i li ti id
  120. 120. Goal:Interpret complexdynamic scenesd i Common methods: Common problems: • Segmentation ? • Complex & changing BG • Changing appearance • Tracking ?  No global assumptions about the scene
  121. 121. Space- Space-timeNo global assumptions  Consider local spatio temporal neighborhoods spatio-temporal hand waving boxing
  122. 122. Actions == Space-time objects? Space-
  123. 123. Local approach: Bag of Visual WordsAirplanesMotorbikes FacesWild Cats Leaves People Bikes
  124. 124. Space-Space-time local features
  125. 125. Space-Time Interest Points: DetectionSpace- p What neighborhoods to consider? High image Look at the Distinctive  variation in space  distribution of the neighborhoods g gradient and timeDefinitions: Original image sequence Oi i li Space-time Gaussian with covariance Gaussian derivative of Space-time gradient Second-moment matrix
  126. 126. Space-Time Interest Points: DetectionSpace- pProperties of p : defines second order approximation for the local distribution of within neighborhood  1D space-time variation of , e.g. moving bar  2D space-time variation of , e.g. moving ball g g  3D space-time variation of , e.g. jumping ballLarge eigenvalues of  can be detected by thelocal maxima of H over (x,y,t): (similar to Harris operator [Harris and Stephens, 1988])
  127. 127. Space- Space-Time interest pointsVelocity appearance// split/mergechanges disappearance
  128. 128. Space-Time Interest Points: ExamplesSpace- p p Motion event detection
  129. 129. Spatio-temporal scale selectionSpatio- p p Stability to size changes, e.g. camera zoom
  130. 130. Spatio-Spatio-temporal scale selection Selection of temporal scales t l l captures the frequency of events
  131. 131. Local features for human actions
  132. 132. Local features for human actions boxing walking hand waving
  133. 133. Local space-time descriptor: HOG/HOF space- p p Multi scale space time Multi-scale space-time patches Histogram of Histogram oriented spatial of optical  grad. (HOG) d flow (HOF) Public code available at www.irisa.fr/vista/actions 3x3x2x4bins HOG 3x3x2x5bins HOF descriptor descriptor
  134. 134. Visual Vocabulary: K-means clustering y K- g  Group similar p p points in the space of image descriptors using p g p g K-means clustering  Select significant clusters Clustering c1 c2 c3 c4 Classification
  135. 135. Visual Vocabulary: K-means clustering y K- g  Group similar p p points in the space of image descriptors using p g p g K-means clustering  Select significant clusters Clustering c1 c2 c3 c4 Classification
  136. 136. Local Space-time features: Matching Space- p g Find similar events in pairs of video sequences
  137. 137. Action Classification: OverviewBag of space-time features + multi-channel SVM [Laptev’03, Schuldt’04, Niebles’06, Zhang’07] Collection of space-time patches C f Histogram of visual words HOG & HOF Multi-channel patch SVM descriptors Classifier
  138. 138. Action recognition in KTH dataset gSample frames from the KTH actions sequences, all six classes(columns) and scenarios (rows) are presented
  139. 139. Classification results on KTH dataset Confusion matrix for KTH actions
  140. 140. What about 3D?Local motion and appearance features are not invariant to view changes
  141. 141. Multi- Multi-view action recognitionDifficult to apply standard multi-view methods:  Do not want to search for multi- view point correspondence --- Non-rigid motion, clothing changes, … --> It’s Hard!  Do not want to identify body parts. Current methods are not reliable enough.  Yet, want to learn actions from one view and recognize actions in very different views
  142. 142. Temporal self-similarities self-Idea:  Cross-view matching is hard but cross-time matching (tracking) is relatively easy.  Measure self-(dis)similarities across time: self (dis)similaritiesExample: Distance matrix / self-similarity matrix (SSM): P1 P2
  143. 143. Temporal self-similarities: Multi-views self- Multi- Side view Top view Appear very similar despite the view change! h !Intuition: 1. Distance between similar poses is low in any view 2. Distance among different poses is likely to be large in most views
  144. 144. Temporal self-similarities: MoCap self- person 1Self-similaritiescan be measuredfrom MotionCapture (MoCap) person 2data person 1 p person 2
  145. 145. Temporal self-similarities: Video self-Self-similaritiescan bemeasureddirectly fromvideo:HOG orOptical Flowdescriptors id i t inimage frames
  146. 146. Self- Self-similarity descriptor Goal: define a quantitative q measure to compare self- similarity matrices Define a local histogram ~SIFT descriptor descriptor hi for each point computed on SSM t d i on the diagonal. Sequence alignment:  Action recognition: Dynamic Programming for • Visual vocabulary for h two sequences of • BoF representation of {hi} descriptors {hi}, {hj} • SVM
  147. 147. Multi-Multi-view alignment
  148. 148. Multi-Multi-view action recognition: Video SSM-based recognition Alternative view-dependent method (STIP)
  149. 149. What are Human Actions?Actions in recentdatasets: Is it just about kinematics?Should actions be defined by the purpose? Kinematics + Objects
  150. 150. What are Human Actions?Actions in recentdatasets: Is it just about kinematics?Should actions be defined by the purpose? Kinematics + Objects + Scenes
  151. 151. Action recognition in realistic settings Standard St d d action datasets Actions “I the Wild”: A ti “In th Wild”
  152. 152. Action Dataset and Annotation Manual annotation of drinking actions in movies: g “Coffee and Cigarettes”; “Sea of Love” “Drinking”: 159 annotated samples g p “Smoking”: 149 annotated samples Temporal annotation T l t ti First frame Keyframe Last frameSpatial annotation head rectangle torso rectangle
  153. 153. “Drinking” action samples g p
  154. 154. Action representation p Hist. of Gradient Hist. f O ti Fl Hi t of Optic Flow
  155. 155. Action learning selected features boosting b ti weak classifier • Efficient discriminative classifier [Freund&Schapire’97] AdaBoost: • G d performance for face detection [Viola&Jones’01] Good f f f d t ti [Vi l &J ’01]pre-aligned Haarsamples optimal threshold features Fisher discriminant Histogram features
  156. 156. Key- Key-frame action classifier selected features boosting b ti weak classifier 2D HOG features f • Efficient discriminative classifier [Freund&Schapire’97] AdaBoost: • G d performance for face detection [Viola&Jones’01] Good f f f d t ti [Vi l &J ’01]pre-aligned Haarsamples optimal threshold features Fisher discriminant Histogram see [Laptev BMVC’06] features for more details [Laptev, Pérez 2007]
  157. 157. Keyframe priming y p gTraining False positives of static HOG action detector Positive Negative training training sample p samples pTest
  158. 158. Action detectionTest set: • 25min from “Coffee and Cigarettes” with GT 38 drinking actions g g • No overlap with the training set in subjects or scenesDetection: • search over all space-time l h ll ti locations and spatio-temporal ti d ti t l extents Keyframe priming No Keyframe priming
  159. 159. Action Detection (ICCV 2007) ( ) Test episodes from the movie “Coffee and cigarettes Coffee cigarettes”Video available at http://www.irisa.fr/vista/Equipe/People/Laptev/actiondetection.html
  160. 160. 20 most confident detections
  161. 161. Learning Actions from Movies g • Realistic variation of human actions • Many classes and many examples per class Problems: • Typically only a few class-samples per movie • Manual annotation is very time consuming
  162. 162. Automatic video annotation with scripts • Scripts available for >500 movies (no time synchronization) www.dailyscript.com, www.movie-page.com, www.weeklyscript.com … • Subtitles (with time info.) are available for the most of movies • Can transfer time to scripts by text alignment… subtitles movie script1172 …01:20:17,240 --> 01:20:20,437 RICKWhy werent you honest with me? Why werent you honest with me? WhyWhyd you keep your marriage a secret?Why d did you keep your marriage a secret?1173 01:20:1701:20:20,640 --> 01:20:23,598 Rick sits down with Ilsa. 01:20:23lt wasnt my secret, Richard. secret Richard ILSAVictor wanted it that way. Oh, it wasnt my secret, Richard.1174 Victor wanted it that way. Not even01:20:23,800 > 01:20:26,18901:20:23 800 --> 01:20:26 189 our closest friends knew about our marriage.Not even our closest friends …knew about our marriage.…
  163. 163. Script- Script-based action annotation– On the good side: g • Realistic variation of actions: subjects, views, etc… • Many examples per class, many classes • No extra overhead for new classes • Actions, objects, scenes and their combinations • Character names may be used to resolve “who is doing what?” who what?– Problems: • No spatial localization • Temporal localization may be poor • Missing actions e g scripts do not al a s follo the mo ie actions: e.g. always follow movie • Annotation is incomplete, not suitable as ground truth for testing action detection • Large within-class variability of action classes in text
  164. 164. Script alignment: Evaluation • Annotate action samples in text p • Do automatic script-to-video alignment • Check the correspondence of actions in scripts and movies Example of a “visual false positive” A black car pulls up, t bl k ll two army officers get out.a: quality of subtitle-script matching
  165. 165. Text- Text-based action retrieval• Large variation of action expressions in text: GetOutCar “… Will gets out of the Chevrolet. …” action: “… Erin exits her new truck…” Potential false “…About to sit down, he freezes…” positives:• => Supervised text classification approach
  166. 166. Automatically annotated action samples [Laptev, Marszałek, Schmid, Rozenfeld 2008]
  167. 167. Hollywood-Hollywood-2 actions dataset Training and test samples are obtained from 33 and 36 distinct movies respectively. Hollywood-2 dataset is on-line: http://www.irisa.fr/vista http://www irisa fr/vista /actions/hollywood2 [Laptev, Marszałek, Schmid, Rozenfeld 2008]
  168. 168. Action Classification: OverviewBag of space-time features + multi-channel SVM [Laptev’03, Schuldt’04, Niebles’06, Zhang’07] Collection of space-time patches C f Histogram of visual words HOG & HOF Multi-channel patch SVM descriptors Classifier
  169. 169. Action classification (CVPR08)Test episodes from movies “The Graduate”, “It’s a Wonderful Life”, “Indiana Jones and the Last Crusade”
  170. 170. Evaluation of local features for action recognition• Local features provide a popular approach to video description for action recognition: – ~50% of recent action recognition methods (cvpr09, iccv09, bmvc09) are based on local features – Large variety of feature detectors and descriptors is available – Very limited and inconsistent comparison of different featuresGoal:• Systematic evaluation of local feature-descriptor combinations• Compare performance on common datasets• Propose improvements
  171. 171. Evaluation of local features for action recognition• Evaluation study [Wang et al. BMVC’09] – Common recognition framework • Same datasets (varying difficulty): KTH, UCF sports, Hollywood2 • Same train/test data • Same classification method – Alternative local feature detectors and descriptors from recent literature – Comparison of different detector-descriptor combinations
  172. 172. Action recognition framework g Bag of space-time features + SVM [Schuldt’04, Niebles’06, Zhang’07] Evaluation E l i space-time patches Extraction of Local features K-means Evaluation Occurrence histogram clustering of visual words (k=4000) Non-linear FeatureSVM with χ2 description kernel Feature e u e quantization
  173. 173. Local feature detectors/descriptorsL lf t d t t /d i t• Four types of detectors: – Harris3D [Laptev’05] [Laptev 05] – Cuboids [Dollar’05] – Hessian [Willems’08] – Regular dense sampling• Four different types of descriptors: yp p – HoG/HoF [Laptev’08] – Cuboids [Dollar’05] – H G3D HoG3D [Kläser’08] [Klä ’08] – Extended SURF [Willems’08]
  174. 174. Illustration of ST detectorsHarris3DH i 3D Hessian H iCuboid Dense
  175. 175. Dataset: KTH-Actions KTH-• 6 action classes by 25 persons in 4 different scenarios• Total of 2391 video samples• Performance measure: average accuracy over all classes 57
  176. 176. UCF- p UCF-Sports -- samples p• 10 different action classes• 150 1 0 video samples in total - We extend the dataset by flipping videos• Evaluation method: leave-one-out• Performance measure: average accuracy over all classes Diving Kicking Walking Skateboarding High-Bar-Swinging Golf-Swinging
  177. 177. Dataset: Hollywood2 y• 12 different action classes from 69 Hollywood movies• 1707 video sequences in total• Separate movies for training / testing•P f Performance measure: mean average precision ( AP) over all classes i i (mAP) ll l GetOutCar AnswerPhone Kiss HandShake StandUp DriveCar 59
  178. 178. KTH- KTH-Actions -- results Detectors Harris3D Cuboids Hessian Dense HOG3D 89.0% 90.0% 84.6% 85.3% HOG/HOF 91.8% 91 8% 88.7% 88 7% 88.7% 88 7% 86.1% 86 1%Descriptors HOG 80.9% 82.3% 77.7% 79.0% HOF 92.1% 88.2% 88.6% 88.0% Cuboids - 89.1% - - E-SURF - - 81.4% - • Best results for Sparse Harris3D + HOF • Good results for Harris3D and Cuboid detectors with HOG/HOF and HOG3D descriptors • Dense features perform relatively poor compared to sparse features
  179. 179. UCF- UCF-Sports -- results Detectors Harris3D Cuboids Hessian Dense HOG3D 79.7% 82.9% 79.0% 85.6% HOG/HOF 78.1% 77.7% 79.3% 81.6% criptors HOG 71.4% 72.7% 66.0% 77.4% HOFDesc 75.4% 75 4% 76.7% 76 7% 75.3% 75 3% 82.6% 82 6% Cuboids - 76.6% - - E SURF E-SURF - - 77.3% - • Best results for Dense + HOG3D • Good results for Dense and HOG/HOF • Cuboids: good p g performance with HOG3D
  180. 180. Hollywood2 -- results Detectors Harris3D Cuboids Hessian Dense HOG3D 43.7% 45.7% 41.3% 45.3% HOG/HOF 45.2% 46.2% 46.0% 47.4% riptors HOG 32.8% 39.4% 36.2% 39.4%Descri HOF 43.3% 42.9% 43.0% 45.5% Cuboids - 45.0% - - E-SURF E SURF - - 38.2% 38 2% - • Best results for Dense + HOG/HOF • Good results for HOG/HOF
  181. 181. Evaluation summary y• Dense sampling consistently outperforms all the tested sparse features in realistic settings (UCF + Hollywood2) - Importance of realistic video data - Limitations of current feature detectors - Note: large number of features (15-20 times more)• Sparse features provide more or less similar results (sparse features better than Dense on KTH)• Descriptors Descriptors’ performance - Combination of gradients + optical flow seems a good choice (HOG/HOF & HOG3D)
  182. 182. How to improve BoF classification? pActions are about people Why not try to combine BoF with person detection? y y p Detect and track people Compute BoF on person-centered grids: 2x2, 3x2, 3x3… Surprise!
  183. 183. How to improve BoF classification? p2nd attampt: • Do not remove background g • Improve local descriptors with region-level information Local features ambiguous bi Features with F t ith features disambiguated Visual labels Vocabulary Regions R2 Histogram representation R1 R1 R1 SVM R2 Classification R3
  184. 184. Video Segmentation g• Spatio-temporal grids Spatio temporal R3 R2 R1• Static action detectors [Felzenszwalb’08] – Trained from ~100 web-images per class• Object and Person detectors (Upper body) [ [Felzenszwalb’08] ]
  185. 185. Video Segmentation g
  186. 186. Hollywood-2 action classificationHollywood- y Attributed feature Performance (meanAP) BoF 48.55 Spatiotemoral grid 24 channels 51.83 51 83 Motion segmentation 50.39 Upper body pp y 49.26 Object detectors 49.89 Action detectors 52.77 Spatiotemoral grid + Motion segmentation 53.20 Spatiotemoral grid + Upper body 53.18 Spatiotemoral grid + Object detectors S ti t l id Obj t d t t 52.97 52 97 Spatiotemoral grid + Action detectors 55.72 Spatiotemoral grid + Motion segmentation + Upper 55.33 body + Object detectors + Action detectors
  187. 187. Hollywood-2 action classificationHollywood- y
  188. 188. Actions in Context (CVPR 2009) Human actions are frequently correlated with particular scene classes q y p Reasons: physical properties and particular purposes of scenes Eating -- kitchen Eating -- cafe Running -- road Running -- street
  189. 189. Mining scene captions ILSA I wish I didnt love you so much.01:22:0001:22:03 She snuggles closer to Rick. CUT TO: EXT. RICKS CAFE - NIGHT Laszlo and Carl make their way through the darkness toward a side entrance of Ricks. They run inside the entryway. The headlights of a speeding police car sweep toward them. They flatten themselves against a wall to avoid detection. The lights move past them. CARL01:22:15 I think we lost them.01:22:17 …
  190. 190. Mining scene captionsINT. TRENDY RESTAURANT - NIGHTINT.INT MARSELLUS WALLACE’S DINING ROOM MORNING WALLACE SEXT. STREETS BY DORA’S HOUSE - DAY.INT. MELVINS APARTMENT, BATHROOM – NIGHTEXT.EXT NEW YORK CITY STREET NEAR CAROLS RESTAURANT – DAY CAROL S INT. CRAIG AND LOTTES BATHROOM - DAY• Maximize word frequency street, living room, bedroom, car ….• M Merge words with similar senses using W dN t d ith i il i WordNet: taxi -> car, cafe -> restaurant• Measure correlation of words with actions (in scripts) and• Re-sort words by the entropy for P = p(action | word)
  191. 191. Co-Co-occurrence of actions and scenes in scripts
  192. 192. Co-Co-occurrence of actions and scenes in scripts
  193. 193. Co-Co-occurrence of actions and scenes in scripts
  194. 194. Co-Co-occurrence of actions and scenes in text vs. video
  195. 195. Automatic gathering of relevant scene classes and visual samples Source: 69 movies aligned with g the scripts Hollywood-2 dataset is on-line: http://www.irisa.fr/vista /actions/hollywood2
  196. 196. Results: actions and scenes (separately)
  197. 197. Classification with the help of context Action classification score Scene classification score Weight, Weight estimated from text: New action score
  198. 198. Results: actions and scenes (jointly)Actions in h i thecontext ofScenesScenes in thecontext ofActions
  199. 199. Weakly- p Weakly-Supervised y Temporal Action Annotation  Answer questions: WHAT actions and WHEN they happened ?Knock on the door Fight Kiss  Train visual action detectors and annotate actions with the minimal manual supervision
  200. 200. WHAT actions?  Automatic discovery of action classes in text (movie scripts) -- Text processing: Part of Speech (POS) tagging; Named Entity Recognition (NER); WordNet pruning; Visual Noun filtering -- Search action patternsPerson+Verb Person+Verb+Prep. Person+Verb+Prep Person+Verb+Prep+Vis.Noun Person+Verb+Prep+Vis Noun3725 /PERSON .* is 989 /PERSON .* looks .* at 41 /PERSON .* sits .* in .* chair2644 /PERSON .* looks 384 /PERSON .* is .* in 37 /PERSON .* sits .* at .* table1300 /PERSON .* turns 363 /PERSON .* looks .* up 31 /PERSON .* sits .* on .* bed916 /PERSON .* takes 234 /PERSON .* is .* on 29 /PERSON .* sits .* at .* desk840 /PERSON .* sits 215 /PERSON .* picks .* up 26 /PERSON .* picks .* up .* phone829 /PERSON .* has 196 /PERSON .* is .* at 23 /PERSON .* gets .* out .* car807 /PERSON .* walks 139 /PERSON .* sits .* in 23 /PERSON .* looks .* out .* window701 /PERSON .* stands * 138 /PERSON .* is .* with * * 21 /PERSON .* looks .* around .* room * * *622 /PERSON .* goes 134 /PERSON .* stares .* at 18 /PERSON .* is .* at .* desk591 /PERSON .* starts 129 /PERSON .* is .* by 17 /PERSON .* hangs .* up .* phone585 /PERSON .* does 126 /PERSON .* looks .* down 17 /PERSON .* is .* on .* phone569 /PERSON .* gets 124 /PERSON .* sits .* on 17 /PERSON .* looks .* at .* watch552 /PERSON .* pulls 122 /PERSON .* is .* of 16 /PERSON .* sits .* on .* couch503 /PERSON .* comes 114 /PERSON .* gets .* up 15 /PERSON .* opens .* of .* door493 /PERSON .* sees 109 /PERSON .* sits .* at 15 /PERSON .* walks .* into .* room462 /PERSON .* are/VBP 107 /PERSON .* sits .* down 14 /PERSON .* goes .* into .* room
  201. 201. WHEN: WHEN: Video Data and Annotation Want to target realistic video data Want to avoid manual video annotation for training Use movies + scripts for automatic annotation of training samples 24:25 ertainty! Unce 24:51
  202. 202. Overview Input: p Automatic collection of training clips g p • Action type, e.g. Person Opens Door • Videos + aligned scriptsOutput:O Clustering of positive segments Training classifier Sliding- window-style i d l temporal action localization
  203. 203. Action clustering [Lihi Zelnik-Manor and Michal Irani CVPR 2001] Spectral clusteringDescriptor space Clustering results Ground truth
  204. 204. Action clustering Complex data: Standard clustering methods do not work on this data
  205. 205. Action clustering Our view at the problemFeature space Video space Negative samples! N i l ! Nearest neighbor N t i hb solution: wrong! Random video samples: lots of them, very low chance to be positives
  206. 206. Action clustering Formulation [Xu et al. NIPS’04] [ ac [Bach & Harchaoui NIPS’07] a c aou S0 ] discriminative di i i ti cost tFeature space Loss on positive samples Loss on negative samples negative samples parameterized positive samples t i d iti l Optimization SVM solution for Coordinate descent on
  207. 207. Clustering resultsDrinking actions in Coffee and Cigarettes
  208. 208. Detection results Drinking actions in Coffee and Cigarettes T i i B Training Bag-of-Features classifier fF t l ifi Temporal sliding window classification Non-maximum suppression Detection trained on simulated clusters Test set: • 25min from “Coffee and Coffee Cigarettes” with GT 38 drinking actions
  209. 209. Detection results Drinking actions in Coffee and Cigarettes T i i B Training Bag-of-Features classifier fF t l ifi Temporal sliding window classification Non-maximum suppression Detection trained on automatic clusters Test set: • 25min from “Coffee and Coffee Cigarettes” with GT 38 drinking actions
  210. 210. Detection results“Sit Down” and “Open Door” actions in ~5 hours of movies
  211. 211. Temporal detection of “Sit Down” and “Open Door” actions in movies: The Graduate, The Crying Game, Living in Oblivion
  212. 212. Conclusions Bag-of-words models are currently d i B f d d l tl dominant, th t the structure (human poses, etc.) should be integrated. Vocabulary of actions is not well-defined – it depends on the goal and the task g Actions should be used for the functional interpretation of the visual world

×