SlideShare a Scribd company logo
1 of 183
Download to read offline
ENS/INRIA Visual Recognition and Machine Learning
                                   Summer School 25-29 July, Paris France
                                            School, 25 29 July Paris,




         Human Action Recognition
                            Ivan Laptev
                           ivan.laptev@inria.fr
                           i    l t @i i f
             INRIA, WILLOW, ENS/INRIA/CNRS UMR 8548
       Laboratoire d’Informatique, Ecole Normale Supérieure, Paris
                   d Informatique,




Includes slides from: Alyosha Efros, Mark Everingham and Andrew Zisserman
Lecture overview
    Motivation
      Historic review
      Applications and challenges

    Human Pose Estimation
      Pictorial structures
      Recent advances

    Appearance-based methods
     pp
      Motion history images
      Active shape models & Motion priors

   Motion-based methods
      Generic and parametric Optical Flow
      Motion templates

   Space-time methods
    p
      Space-time features
      Training with weak supervision
Motivation I: Artistic Representation
Early studies were motivated by human representations in Arts
Da Vinci:   “it is indispensable for a painter, to become totally familiar with the
            anatomy of nerves, bones, muscles, and sinews, such that he understands
                      y         ,       ,           ,           ,
            for their various motions and stresses, which sinews or which muscle
            causes a particular motion”
            “I ask for the weight [pressure] of this man for every segment of motion
             I
            when climbing those stairs, and for the weight he places on b and on c.
            Note the vertical line below the center of mass of this man.”




            Leonardo da Vinci (1452–1519): A man going upstairs, or up a ladder.
Motivation II: Biomechanics

                                          The emergence of biomechanics

                                          Borelli applied to biology the
                                           analytical and geometrical methods,
                                           developed by Galileo Galilei

                                          He was the first to understand that
                                           bones serve as levers and muscles
                                           function according to mathematical
                                           p
                                           principles
                                                 p

                                          His physiological studies included
                                           muscle analysis and a mathematical
                                           discussion of movements, such as
                                           running or jumping

Giovanni Alfonso Borelli (1608–1679)
Motivation III: Motion perception
 Etienne-Jules
 Etienne Jules Marey:
 (1830–1904) made
 Chronophotographic
 experiments influential
 for the emerging field of
 cinematography




 Eadweard Muybridge
 (1830–1904) invented a
 machine for displaying
 the recorded series of
 images. He pioneered
 motion pictures and
 applied his technique to
 movement studies
Motivation III: Motion perception
  Gunnar Johansson [1973] pioneered studies on the use of image
                     [   ]p                                  g
 sequences for a programmed human motion analysis

  “Moving Light Displays (LED) enable identification of familiar people
   Moving       Displays”
 and the gender and inspired many works in computer vision.




      Gunnar Johansson, Perception and Psychophysics, 1973
Human actions: Historic overview

           15th century   
             studies of
              anatomy
                  t
                             17th century
                              emergence of
                              biomechanics


           19th century   
          emergence of
       cinematography
                           1973
                              studies of human
                              motion perception


              Modern computer vision
              M d         t    i i
Modern applications: Motion capture
          and animation




              Avatar (2009)
Modern applications: Motion capture
          and animation




Leonardo da Vinci (1452–1519)   Avatar (2009)
Modern applications: Video editing




               Space-Time Video Completion
    Y. Wexler, E. Shechtman and M. Irani, CVPR 2004
Modern applications: Video editing




                     Recognizing Action at a Distance
Alexei A. Efros, Alexander C. Berg, Greg Mori, Jitendra Malik, ICCV 2003
Modern applications: Video editing




                     Recognizing Action at a Distance
Alexei A. Efros, Alexander C. Berg, Greg Mori, Jitendra Malik, ICCV 2003
Why Action Recognition?
 Video indexing and search is useful in TV production, entertainment,
  education, social studies, security,…

                                                               Home
                                                              videos: e.g.
                                TV & Web:                     “My
                                e.g.                          daughter
                               “Fight in a                    climbing”
                                parlament”



     Sociology research:
                           Manually                           Surveillance:
                           analyzed smoking                   260K views
                           actions in                         in 7 days on
                           900 movies
                                    i                         YouTube
How action recognition is related
     to computer vision?

   Sky
   Sk          Street sign




             Car
                   Car
                             Car   Car
 Car
       Car
             Road
We can recognize cars and roads,
           g                   ,
         What’s next?




      12,184,113 images, 17624 synsets
Airplane




  A plain has crashed, the
  cabin is broken, somebody is
  likely to be injured or dead
                          dead.
cat


                 woman




      trash bi
      t h bin
   Vision is person-centric: We mostly care about
    things which are important to us, people


   Actions of people reveal the function of objects
               p p                             j

   Future challenges:

     - Function: What can I do with this and how?
     - Prediction: What can happen if someone does that?
     - Recognizing goals: What this person is trying to do?
How many person-pixels are there?
         person-


   Movies              TV




            YouTube
            Y T b
How many person-pixels are there?
         person-




      Movies            TV




                        YouTube
How many person-pixels are there?
         person-


    35%              34%
      Movies            TV




               40%
                        YouTube
How much data do we have?
 Huge amount of video is available and growing




                       TV-channels recorded
                            since 60’s
                                  60 s


                           >34K hours of video
                            upload every day


                                 ~30M surveillance cameras in US
                                    => ~700K video hours/day
                                        700K



  If we want to interpret this data, we should better understand what
  person-pixels are telling us!
Lecture overview
    Motivation
      Historic review
      Applications and challenges

    Human Pose Estimation
      Pictorial structures
      Recent advances

    Appearance-based methods
     pp
      Motion history images
      Active shape models & Motion priors

   Motion-based methods
      Generic and parametric Optical Flow
      Motion templates

   Space-time methods
    p
      Space-time features
      Training with weak supervision
Lecture overview
    Motivation
      Historic review
      Applications and challenges

    Human Pose Estimation
      Pictorial structures
      Recent advances

    Appearance-based methods
     pp
      Motion history images
      Active shape models & Motion priors

   Motion-based methods
      Generic and parametric Optical Flow
      Motion templates

   Space-time methods
    p
      Space-time features
      Training with weak supervision
Objective and motivation
 Determine human body pose (layout)




Why? To recognize poses, gestures, actions
Activities characterized by a pose
                          y p
Activities characterized by a pose
                          y p
Activities characterized by a pose
                          y p
Challenges: articulations and deformations
       g
Challenges: of (
         g       (almost) unconstrained images
                        )                  g




varying illumination and low contrast; moving camera and background;
    i ill i ti         dl       t t        i             db k     d
multiple people; scale changes; extensive clutter; any clothing
Pictorial Structures



•   Intuitive model of an object
•   Model has two components
    1. parts (
       pa s (2D image fragments)
                  age ag e s)
    2. structure (configuration of parts)
•   Dates back to Fischler & Elschlager 1973
Long tradition of using p
   g                  g pictorial structures for humans


                    Finding People by Sampling
                    Ioffe & Forsyth, ICCV 1999
                                y ,




                    Pictorial Structure Models for Object Recognition
                    Felzenszwalb & Huttenlocher, 2000




                    Learning to Parse Pictures of People
                    Ronfard, Schmid & Triggs, ECCV 2002
Felzenszwalb & Huttenlocher




                 NB: requires background subtraction
Variety of Poses
      y
Variety of Poses
      y
Objective: detect human and determine upper body pose (layout)




         si
                                    f

                              a2

                              a1
Pictorial structure model – CRF



si
                         f

                    a2

                    a1
Complexity
        p     y



si
              f

         a2

         a1
Are trees the answer?
                               He      T

                              left          right
                                  UA       UA

                                 LA        LA

                                Ha         Ha


• With n parts and h possible discrete locations per part, O(hn)

• For a tree, using dynamic programming this reduces to O(nh2)

• If model is a tree and has certain edge costs, then complexity
                                       g                 p     y
  reduces to O(nh) using a distance transform [Felzenszwalb &
  Huttenlocher, 2000, 2005]
Are trees the answer?
                               He      T

                              left          right
                                  UA       UA

                                 LA        LA

                                Ha         Ha


• With n parts and h possible discrete locations per part, O(hn)

• For a tree, using dynamic programming this reduces to O(nh2)

• If model is a tree and has certain edge costs, then complexity
                                       g                 p     y
  reduces to O(nh) using a distance transform [Felzenszwalb &
  Huttenlocher, 2000, 2005]
Kinematic structure vs graphical (independence) structure




             Graph G = (V,E)

 He      T             He      T

left          right   left          right
    UA       UA           UA       UA

   LA         LA         LA        LA
                                            Requires more
  Ha          Ha        Ha         Ha       connections than a tree
More recent work on human pose estimation
                          p
           D. Ramanan. Learning to parse images of articulated
           bodies. NIPS,
           bodies NIPS 2007
                 Learn image and person-specific unary terms
                 •                         g
                     initial iteration  edges
                 •   following iterations  edges & colour


           V. Ferrari, M. Marin-Jimenez, and A. Zisserman.
           Progressive search space reduction for human pose
           estimation. Proc. CVPR,
           estimation In Proc CVPR 2008/2009
                 (Almost) unconstrained images
                 •   Person detector & foreground highlighting
                                           g        g g      g


                   VP. Buehler, M. Everingham and A. Zisserman.
                              ,          g
                   Learning sign language by watching TV. In Proc.
                   CVPR 2009
                       Learns with weak t t l annotation
                       L        ith     k textual     t ti
                       •   Multiple instance learning
Pose estimation is a very active research area
                        y

            Y. Yang and D. Ramanan. Articulated pose estimation
            with flexible mixtures-of-parts. In Proc. CVPR 2011
                Extension of LSVM model of Felzenszwalb et al.


                        Y. Wang, D. Tran and Z. Liao. Learning
                        Hierarchical Poselets for Human
                        Parsing. In Proc. CVPR 2011.
                          Builds on Poslets idea of Bourdev et al.

                        S. Johnson and M. Everingham. Learning
                        Effective Human Pose Estimation from
                        Inaccurate Annotation. In Proc. CVPR 2011.
                          Learns from lots of noisy annotations

                              B. Sapp, D.Weiss and B. Taskar. Parsing
                              Human Motion with Stretchable Models.
                              In Proc. CVPR 2011.
                                Explores temporal continuity
Pose estimation is a very active research area
                        y




J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio, R. Moore, A.
Kipman and A. Blake. Real-Time Human Pose Recognition in Parts from
        Single Depth Images. Best paper award at CVPR 2011

        Exploits lots of synthesized depth images for training
Pose Search




                      Q




V. Ferrari, M. Marin-Jimenez, and A. Zisserman. Progressive search space reduction
                   for human pose estimation. In Proc. CVPR2009
Pose Search



    Q




  V. Ferrari, M. Marin-Jimenez, and A.
 Zisserman.
 Zisserman Progressive search space
reduction for human pose estimation. In
            Proc. CVPR2009
Application



Learning sign language by watching TV
       g g       g g y            g
   (using weakly aligned subtitles)


                           Patrick Buehler
                         Mark Everingham
                         Andrew Zi
                         A d    Zisserman


                          CVPR 2009
Objective
Learn signs in British Sign Language (BSL) corresponding to text words:
    • Training data from TV broadcasts with simultaneous signing
    • Supervision solely from sub-titles
                                                         Output: automatically
                                                    learned signs (4x slow motion)
              Input: video + subtitle



                                                                               Office




                                                                            Government



Use subtitles to find video sequences containing word. These are the positive
training sequences. Use other sequences as negative training sequences.
Given an English word
e.g. “tree” what is the
corresponding British
Sign Language sign?

             positive
            sequences




                negative
                  set
Use sliding window to choose sub-
sequence of p
   q         poses in one ppositive
sequence and determine if
                                    1st sliding window
 same sub-sequence of poses
occurs in other positive sequences
somewhere, but
 does not occur in the negative set



                  positive
                 sequences




                     negative
                       set
Use sliding window to choose sub-
sequence of p
   q         poses in one ppositive
sequence and determine if
                                      5th sliding window
 same sub-sequence of poses
occurs in other positive sequences
somewhere, but
 does not occur in the negative set



                 positive
                sequences




                     negative
                       set
Multiple instance learning
     p                   g



                                Negative
       Positive
                                  bag
        bags




                    sign of
                    interest
                    i t     t
Example
                           p

Learn signs in British Sign Language (
        g                g     g g (BSL) corresponding to
                                       )       p     g
text words.
Evaluation

Good results for a variety of signs:


  Signs where
    g            Signs where
                   g            Signs where    Signs which      Signs which
hand movement    hand shape      both hands    are finger--   are perfomed in
  is important   is important   are together     spelled      front of the face

    Navy            Lung            Fungi          Kew              Whale




   Prince         Garden            Golf           Bob              Rose
What is missed?




  truncation is not modelled
What is missed?




  occlusion is not modelled
Modelling p
        g person-object-pose interactions
                   j    p

                W. Yang Y.
                W Yang, Y Wang and Greg Mori. Recognizing
                                            Mori
                Human Actions from Still Images with Latent
                Poses. In Proc. CVPR 2010.

                Some limbs may
                not be important
                for recognizing a
                particular action
                (e.g. sitting)


                B. Y
                B Yao and L F i F i M d li M t l
                          d L. Fei-Fei. Modeling Mutual
                Context of Object and Human Pose in Human-
                Object Interaction Activities. In Proc. CVPR
                2010.

                 Pose estimation helps object detection and
                 vice versa
Towards functional object understanding
                               j                g




    A. Gupta, S. Satkin, A.A. Efros and M. Hebert,
    From 3D Scene Geometry to
    HumanWorkspace. In Proc. CVPR 2011
                 p
    Predicts the “workspace” of a human




H. Grabner, J. Gall and L. Van Gool. What Makes a Chair a Chair? In Proc. CVPR 2011
Conclusions: Human poses
                               p

   Exciting progress in pose estimation in realistic
    still images and video.
             g

   Industry strength
    Industry-strength pose estimation from depth sensors

   Pose estimation from RGB is still very challenging


   Human Poses ≠ Human Actions!
Lecture overview
    Motivation
      Historic review
      Applications and challenges

    Human Pose Estimation
      Pictorial structures
      Recent advances

    Appearance-based methods
     pp
      Motion history images
      Active shape models & Motion priors

   Motion-based methods
      Generic and parametric Optical Flow
      Motion templates

   Space-time methods
    p
      Space-time features
      Training with weak supervision
Lecture overview
    Motivation
      Historic review
      Applications and challenges

    Human Pose Estimation
      Pictorial structures
      Recent advances

    Appearance-based methods
     pp
      Motion history images
      Active shape models & Motion priors

   Motion-based methods
      Generic and parametric Optical Flow
      Motion templates

   Space-time methods
    p
      Space-time features
      Training with weak supervision
Foreground segmentation
Image differencing: a simple way to measure motion/change




                 -                  > Const




Better B k
B      Background / F
                d Foreground separation methods exist:
                           d        i      h d    i

   Modeling of color variation at each p
           g                            pixel with Gaussian Mixture
   Dominant motion compensation for sequences with moving camera
   Motion layer separation for scenes with non-static backgrounds
Temporal Templates




Idea: summarize motion in video in a
      Motion History Image (MHI):




Descriptor: Hu moments of different orders
      p




                                        [A.F. Bobick and J.W. Davis, PAMI 2001]
Aerobics dataset




Nearest Neighbor classifier: 66% accuracy
Temporal Templates: Summary
Pros:
    + Si l and f t
      Simple d fast                                 Not all shapes are valid
    + Works in controlled settings                      Restrict the space
                                                    of admissible silhouettes
Cons:
    - Prone to errors of background subtraction
                             g




        Variations in light, shadows, clothing…   What is the background here?

    - Does not capture interior
      motion and shape
                                                                    Silhouette
                                                                    tells little
                                                                    about actions
Active Shape Models [Cootes et al.]
                    [Cootes al.]

 Constrains shape deformation in PCA-projected space


 Example: face alignment             Illustration of face shape space




        Active Shape Models: Their Training and Application
 T.F. Cootes, C.J. Taylor, D.H. Cooper, and J. Graham, CVIU 1995
             ,       y ,           p ,               ,
Person Tracking




Learning flexible models from image sequences
   A. Baumberg and D. Hogg, ECCV 1994
Learning dynamic prior
 Dynamic model: 2nd order Auto-Regressive Process
   y                              g

   State

   Update rule:

    Model parameters:

    Learning scheme:
Learning dynamic prior

                                  Random simulation of the
    Learning point sequence       learned dynamical model




              Statistical models of visual shape and motion
A. Blake, B. Bascle, M. Isard and J. MacCormick, Phil.Trans.R.Soc. 1998
Learning dynamic prior


Random simulation of th l
R d     i l ti     f the learned gate d
                               d t dynamics
                                        i
Motion priors
 C
  Constrain t
      t i temporal evolution of shape
                 l    l ti    f h
        Help accurate tracking
        Recognize actions

 G l formulate motion models f different types of actions
  Goal: f    l t    ti    d l for diff     tt      f ti
        and use such models for action recognition

 Example:
    Drawing with 3 action
          g
    modes
                 line drawing
                 scribbling
                 idle

                                   [M. Isard and A. Blake, ICCV 1998]
Lecture overview
    Motivation
      Historic review
      Applications and challenges

    Human Pose Estimation
      Pictorial structures
      Recent advances

    Appearance-based methods
     pp
      Motion history images
      Active shape models & Motion priors

   Motion-based methods
      Generic and parametric Optical Flow
      Motion templates

   Space-time methods
    p
      Space-time features
      Training with weak supervision
Lecture overview
    Motivation
      Historic review
      Applications and challenges

    Human Pose Estimation
      Pictorial structures
      Recent advances

    Appearance-based methods
     pp
      Motion history images
      Active shape models & Motion priors

   Motion-based methods
      Generic and parametric Optical Flow
      Motion templates

   Space-time methods
    p
      Space-time features
      Training with weak supervision
Shape and Appearance vs. Motion
                         vs
 Shape and appearance in images depends on many factors:
  clothing, illumination contrast, image resolution, etc…




                                                      [Efros et al 2003]
                                                                al.

 Motion field (in theory) is invariant to shape and can be used
  directly to describe human actions
Shape and Appearance vs. Motion
                     vs
            Moving Light Displays




Gunnar Johansson, Perception and Psychophysics, 1973
Motion estimation: Optical Flow
 Classic problem of computer vision [Gibson 1955]

 Goal: estimate motion field
  How? We only have access to image pixels
        Estimate pixel-wise correspondence
        between frames = Optical Flow

 Brightness Change assumption: corresponding pixels
  preserve their intensity (color)

    Useful assumption in many cases

    Breaks at occlusions and
     illumination changes

    Physical and visual
     motion may be different
Generic Optical Flow
      Brightness Change Constraint Equation (BCCE)

                                                            Optical flow
                                                            Image gradient

         One
         O equation, two unknowns => cannot be solved directly
               ti    t     k       >      tb     l d di tl
               Integrate several measurements in the local neighborhood
               and obtain a L
                  d bt i    Least Squares Solution [L
                                 tS       S l ti [Lucas & K  Kanade 1981]
                                                                 d




Second-moment
matrix, the same
one used to
co pute a s
compute Harris           Denotes integration over a spatial (or spatio-temporal)
interest points!         neighborhood of a point
Parameterized Optical Flow
   A th extension of the constant motion model is to compute
    Another t     i     f th       t t    ti     d li t    t
    PCA basis flow fields from training examples

       1. Compute standard Optical Flow for many examples
       2. Put velocity components into one vector


       3. Do PCA on    and obtain most informative PCA flow basis vectors
Training samples                  PCA flow bases




               Learning Parameterized Models of Image Motion
        M.J. Black, Y. Yacoob, A.D. Jepson and D.J. Fleet, CVPR 1997
Parameterized Optical Flow
      Estimated coefficients of PCA flow bases can be used as action
       descriptors




Frame numbers



                Optical flow seems to be an interesting descriptor for
                motion/action recognition
Spatial Motion Descriptor




             Image frame
                g                                  Optical flow
                                                    p             Fx , y




   Fx , Fy                 Fx , Fx , Fy , Fy     blurred   Fx , Fx , Fy , Fy

A. A. Efros, A.C. Berg, G. Mori and J. Malik. Recognizing Action at a Distance.
                             In Proc. ICCV 2003
Spatio-
    Spatio-Temporal Motion Descriptor
                          Temporal extent E

                  …                                 …     Sequence A

                 …                                 …     Sequence B
                                   t

                      A            E                             A

                            E

                                I matrix
                                   E

B                           E                 B
    frame-to-frame                                motion-to-motion
    similarity matrix           blurry I          similarity matrix
Football Actions: matching

  Input
Sequence
  q



Matched
Frames




                   input   matched
Football Actions: classification




10 actions; 4500 total frames; 13-frame motion descriptor
Classifying Ballet Actions
            y g
16 Actions; 24800 total frames; 51-frame motion descriptor. Men
             used to classify women and vice versa.
Classifying Tennis Actions

6 actions; 4600 frames; 7-frame motion descriptor
 Woman player used as training, man as testing.
Where are we so far?
Lecture overview
    Motivation
       Historic review
       Hi    i     i
       Modern applications

    Human Pose Estimation
      Pictorial structures
      Recent advances

    Appearance-based methods
       Motion history images
       Active h
       A ti shape models & M ti priors
                        d l  Motion i

   Motion-based methods
      Generic and parametric O ti l Fl
      G     i   d       t i Optical Flow
      Motion templates

   Space-time methods
      Space-time features
      Training with weak supervision
Goal:
Interpret complex
dynamic scenes
d      i



   Common methods:
                                    Common problems:
       • Segmentation    ?            • Complex & changing BG

                                       • Changing appearance
       • Tracking    ?
            No global assumptions about the scene
Space-
                Space-time
No global assumptions 
       Consider local spatio temporal neighborhoods
                      spatio-temporal



      hand waving
                                         boxing
Actions == Space-time objects?
           Space-
Local approach: Bag of Visual Words

Airplanes

Motorbikes

  Faces

Wild Cats

 Leaves

 People

  Bikes
Space-
Space-time local features
Space-Time Interest Points: Detection
Space-
 p
      What neighborhoods to consider?

                          High image                          Look at the
     Distinctive      variation in space                 distribution of the
   neighborhoods                                                 g
                                                                 gradient
                            and time

Definitions:
                        Original image sequence
                        Oi i li

                    Space-time Gaussian with covariance

                                  Gaussian derivative of

                         Space-time gradient




                          Second-moment matrix
Space-Time Interest Points: Detection
Space-
 p
Properties of
   p                       :

                 defines second order approximation for the local
       distribution of      within neighborhood

                                1D space-time variation of , e.g. moving bar
                                2D space-time variation of , e.g. moving ball
                                                                g       g

                                3D space-time variation of , e.g. jumping ball

Large eigenvalues of  can be detected by the
local maxima of H over (x,y,t):




             (similar to Harris operator [Harris and Stephens, 1988])
Space-
   Space-Time interest points

Velocity    appearance//
                           split/merge
changes    disappearance
Space-Time Interest Points: Examples
Space-
 p                              p
           Motion event detection
Spatio-temporal scale selection
Spatio-
 p        p



              Stability to size changes,
              e.g. camera zoom
Spatio-
Spatio-temporal scale selection


                      Selection of
                      temporal scales
                      t       l    l
                      captures the
                      frequency of events
Local features for human actions
Local features for human actions
                   boxing




                   walking




                 hand waving
Local space-time descriptor: HOG/HOF
      space-
       p               p

 Multi scale space time
 Multi-scale space-time patches




                                   Histogram of      Histogram
                                  oriented spatial     of optical   
                                   grad. (HOG)
                                       d             flow (HOF)
  Public code available at
  www.irisa.fr/vista/actions

                                   3x3x2x4bins HOG      3x3x2x5bins HOF
                                       descriptor          descriptor
Visual Vocabulary: K-means clustering
                y K-                g
    Group similar p
         p         points in the space of image descriptors using
                                  p          g        p         g
     K-means clustering
    Select significant clusters


                        Clustering
                                        c1


                                        c2

                                        c3

                                        c4

                       Classification
Visual Vocabulary: K-means clustering
                y K-                g
    Group similar p
         p         points in the space of image descriptors using
                                  p          g        p         g
     K-means clustering
    Select significant clusters


                        Clustering
                                        c1


                                        c2

                                        c3

                                        c4

                       Classification
Local Space-time features: Matching
      Space-
       p                          g
   Find similar events in pairs of video sequences
Action Classification: Overview
Bag of space-time features + multi-channel SVM
 [Laptev’03, Schuldt’04, Niebles’06, Zhang’07]


                                                 Collection of space-time patches
                                                 C           f




                                        Histogram of visual words

                     HOG & HOF                                         Multi-channel
                       patch                                               SVM
                     descriptors                                        Classifier
Action recognition in KTH dataset
             g




Sample frames from the KTH actions sequences, all six classes
(columns) and scenarios (rows) are presented
Classification results on KTH dataset




        Confusion matrix for KTH actions
What about 3D?
Local motion and appearance features are not invariant to view changes
Multi-
      Multi-view action recognition
Difficult to apply standard multi-view methods:

    Do not want to search for multi-
     view point correspondence ---
     Non-rigid motion, clothing
     changes, … --> It’s Hard!



    Do not want to identify body
     parts. Current methods are
     not reliable enough.


    Yet, want to learn actions
     from one view
     and recognize actions in very
     different views
Temporal self-similarities
                     self-
Idea:
   Cross-view matching is hard but cross-time matching (tracking) is
    relatively easy.
   Measure self-(dis)similarities across time:
               self (dis)similarities

Example:
                              Distance matrix / self-similarity matrix (SSM):


           P1

                     P2
Temporal self-similarities: Multi-views
            self-              Multi-
   Side view                                Top view




                                         Appear
                                           very
                                         similar
                                         despite
                                        the view
                                        change!
                                          h      !


Intuition: 1. Distance between similar poses is low in any view
           2. Distance among different poses is likely to be large in most views
Temporal self-similarities: MoCap
               self-

                    person 1
Self-similarities
can be measured
from Motion
Capture (MoCap)     person 2
data



                    person 1




                    p
                    person 2
Temporal self-similarities: Video
                   self-




Self-similarities
can be
measured
directly from
video:
HOG or
Optical Flow
descriptors i
d     i t    in
image frames
Self-
               Self-similarity descriptor

 Goal:
 define a quantitative
           q
 measure to compare self-
 similarity matrices


 Define a local histogram                             ~SIFT descriptor
  descriptor hi for each point                         computed on SSM
                                                             t d
  i on the diagonal.

 Sequence alignment:             Action recognition:
  Dynamic Programming for          • Visual vocabulary for h
  two sequences of
                                   • BoF representation of {hi}
  descriptors {hi}, {hj}
                                   • SVM
Multi-
Multi-view alignment
Multi-
Multi-view action recognition: Video




  SSM-based recognition   Alternative view-dependent method (STIP)
What are Human Actions?

Actions in recent
datasets:

                                      Is it just about kinematics?


Should actions be defined by the purpose?




                      Kinematics + Objects
What are Human Actions?

Actions in recent
datasets:

                                      Is it just about kinematics?


Should actions be defined by the purpose?




                    Kinematics + Objects + Scenes
Action recognition in realistic settings

            Standard
            St d d
            action
            datasets
                       Actions “I the Wild”:
                       A ti    “In th Wild”
Action Dataset and Annotation
                     Manual annotation of drinking actions in movies:
                                                 g
                     “Coffee and Cigarettes”; “Sea of Love”

                           “Drinking”: 159 annotated samples
                                   g                    p
                           “Smoking”: 149 annotated samples

                                                 Temporal annotation
                                                 T      l     t ti
                                         First frame    Keyframe        Last frame
Spatial annotation
                head rectangle




              torso rectangle
“Drinking” action samples
        g            p
Action representation
         p
                   Hist. of Gradient
                   Hist. f O ti Fl
                   Hi t of Optic Flow
Action learning
                                                        selected features

                                        boosting
                                        b   ti

                                                        weak classifier


                 • Efficient discriminative classifier [Freund&Schapire’97]
    AdaBoost:
                 • G d performance for face detection [Viola&Jones’01]
                   Good       f         f f      d t ti [Vi l &J       ’01]
pre-aligned                  Haar
samples                                                        optimal threshold
                             features




                                                             Fisher
                                                             discriminant
                Histogram
                features
Key-
              Key-frame action classifier
                                                        selected features

                                         boosting
                                         b   ti

                                                        weak classifier
                            2D HOG features
                                   f


                 • Efficient discriminative classifier [Freund&Schapire’97]
    AdaBoost:
                 • G d performance for face detection [Viola&Jones’01]
                   Good       f         f f      d t ti [Vi l &J       ’01]
pre-aligned                 Haar
samples                                                        optimal threshold
                            features




                                                             Fisher
                                                             discriminant
                Histogram                                    see [Laptev BMVC’06]
                features                                     for more details
                                                               [Laptev, Pérez 2007]
Keyframe priming
               y      p     g
Training           False positives of static HOG action detector




  Positive   Negative
  training   training
  sample
       p     samples
                  p
Test
Action detection
Test set:
    • 25min from “Coffee and Cigarettes” with GT 38 drinking actions
                                 g                           g
    • No overlap with the training set in subjects or scenes
Detection:
   • search over all space-time l
           h      ll       ti   locations and spatio-temporal
                                    ti      d    ti t       l
   extents




 Keyframe
  priming




     No
 Keyframe
  priming
Action Detection (ICCV 2007)
                      (         )




        Test episodes from the movie “Coffee and cigarettes
                                      Coffee     cigarettes”

Video available at http://www.irisa.fr/vista/Equipe/People/Laptev/actiondetection.html
20 most confident detections
Learning Actions from Movies
       g
    •   Realistic variation of human actions
    •   Many classes and many examples per class




 Problems:
   •    Typically only a few class-samples per movie
   •    Manual annotation is very time consuming
Automatic video annotation
                      with scripts
      •     Scripts available for >500 movies (no time synchronization)
             www.dailyscript.com, www.movie-page.com, www.weeklyscript.com …
      •     Subtitles (with time info.) are available for the most of movies
      •     Can transfer time to scripts by text alignment

…          subtitles                                               movie script
1172                                                   …
01:20:17,240 --> 01:20:20,437                                        RICK
Why weren't you honest with me?                            Why weren't you honest with me? Why
Why'd you keep your marriage a secret?
Why d                                                      did you keep your marriage a secret?

1173                                     01:20:17
01:20:20,640 --> 01:20:23,598                       Rick sits down with Ilsa.
                                         01:20:23
lt wasn't my secret, Richard.
             secret Richard                                          ILSA
Victor wanted it that way.
                                                           Oh, it wasn't my secret, Richard.
1174                                                       Victor wanted it that way. Not even
01:20:23,800 > 01:20:26,189
01:20:23 800 --> 01:20:26 189                              our closest friends knew about our
                                                           marriage.
Not even our closest friends                           …
knew about our marriage.
…
Script-
    Script-based action annotation
–    On the good side:
            g
    • Realistic variation of actions: subjects, views, etc…
    • Many examples per class, many classes
    • No extra overhead for new classes
    • Actions, objects, scenes and their combinations
    • Character names may be used to resolve “who is doing what?”
                                                    who     what?

–    Problems:
    • No spatial localization
    • Temporal localization may be poor
    • Missing actions e g scripts do not al a s follo the mo ie
                 actions: e.g.              always follow      movie
    • Annotation is incomplete, not suitable as ground truth for
        testing action detection
    • Large within-class variability of action classes in text
Script alignment: Evaluation
 •    Annotate action samples in text
                           p
 •    Do automatic script-to-video alignment
 •    Check the correspondence of actions in scripts and movies


                                         Example of a “visual false positive”




                                            A black car pulls up, t
                                              bl k         ll     two army
                                                   officers get out.
a: quality of subtitle-script matching
Text-
      Text-based action retrieval
•   Large variation of action expressions in text:

     GetOutCar        “… Will gets out of the Chevrolet. …”
      action:         “… Erin exits her new truck…”

    Potential false
                      “…About to sit down, he freezes…”
      positives:

•   => Supervised text classification approach
Automatically annotated action samples




                  [Laptev, Marszałek, Schmid, Rozenfeld 2008]
Hollywood-
Hollywood-2 actions dataset

                           Training and test
                           samples are obtained
                           from 33 and 36 distinct
                           movies respectively.




                            Hollywood-2
                            dataset is on-line:
                            http://www.irisa.fr/vista
                            http://www irisa fr/vista
                            /actions/hollywood2




             [Laptev, Marszałek, Schmid, Rozenfeld 2008]
Action Classification: Overview
Bag of space-time features + multi-channel SVM
 [Laptev’03, Schuldt’04, Niebles’06, Zhang’07]


                                                 Collection of space-time patches
                                                 C           f




                                        Histogram of visual words

                     HOG & HOF                                         Multi-channel
                       patch                                               SVM
                     descriptors                                        Classifier
Action classification (CVPR08)




Test episodes from movies “The Graduate”, “It’s a Wonderful Life”,
              “Indiana Jones and the Last Crusade”
Evaluation of local features
             for action recognition
•   Local features provide a popular approach to video description for
    action recognition:
     – ~50% of recent action recognition methods (cvpr09, iccv09,
        bmvc09) are based on local features
     – Large variety of feature detectors and descriptors is available
     – Very limited and inconsistent comparison of different features

Goal:
•   Systematic evaluation of local feature-descriptor combinations
•   Compare performance on common datasets
•   Propose improvements
Evaluation of local features
           for action recognition
•   Evaluation study [Wang et al. BMVC’09]
     – Common recognition framework
         • Same datasets (varying difficulty):
           KTH, UCF sports, Hollywood2
         • Same train/test data
         • Same classification method
     – Alternative local feature detectors and descriptors from
       recent literature
     – Comparison of different detector-descriptor combinations
Action recognition framework
                    g
   Bag of space-time features + SVM [Schuldt’04, Niebles’06, Zhang’07]
                              Evaluation
                              E l i
                                                  space-time patches
                          Extraction of
                          Local features




                                            K-means                    Evaluation
                Occurrence histogram        clustering
                   of visual words          (k=4000)
 Non-linear                                                       Feature
SVM with χ2                                                      description
  kernel
                                             Feature
                                              e u e
                                           quantization
Local feature detectors/descriptors
L   lf t      d t t    /d    i t

•   Four types of detectors:
     – Harris3D [Laptev’05]
                  [Laptev 05]
     – Cuboids [Dollar’05]
     – Hessian [Willems’08]
     – Regular dense sampling

•   Four different types of descriptors:
                    yp            p
     – HoG/HoF [Laptev’08]
     – Cuboids [Dollar’05]
     – H G3D
       HoG3D               [Kläser’08]
                           [Klä ’08]
     – Extended SURF [Willems’08]
Illustration of ST detectors

Harris3D
H i 3D              Hessian
                    H   i




Cuboid                Dense
Dataset: KTH-Actions
                    KTH-

• 6 action classes by 25 persons in 4 different scenarios
• Total of 2391 video samples
• Performance measure: average accuracy over all classes




                                                     59
UCF- p
             UCF-Sports -- samples
                              p
•   10 different action classes
•   150
    1 0 video samples in total
                - We extend the dataset by flipping videos
•   Evaluation method: leave-one-out
•   Performance measure: average accuracy over all classes

          Diving             Kicking          Walking




        Skateboarding   High-Bar-Swinging   Golf-Swinging
Dataset: Hollywood2
                           y
• 12 different action classes from 69 Hollywood movies
• 1707 video sequences in total
• Separate movies for training / testing
•P f
  Performance measure: mean average precision ( AP) over all classes
                                             i i (mAP)    ll l


       GetOutCar          AnswerPhone            Kiss




        HandShake            StandUp            DriveCar




                                                              61
KTH-
                        KTH-Actions -- results
                                     Detectors
                          Harris3D   Cuboids     Hessian   Dense
              HOG3D        89.0%       90.0%      84.6%    85.3%
              HOG/HOF      91.8%
                           91 8%       88.7%
                                       88 7%      88.7%
                                                  88 7%    86.1%
                                                           86 1%
Descriptors




              HOG          80.9%       82.3%      77.7%    79.0%
              HOF          92.1%       88.2%      88.6%    88.0%
              Cuboids        -         89.1%        -        -
              E-SURF         -           -        81.4%      -


       •       Best results for Sparse Harris3D + HOF
       •       Good results for Harris3D and Cuboid detectors with
               HOG/HOF and HOG3D descriptors
       •       Dense features perform relatively poor compared to
               sparse features
UCF-
                        UCF-Sports -- results
                                    Detectors
                         Harris3D     Cuboids   Hessian   Dense
              HOG3D       79.7%        82.9%     79.0%    85.6%
              HOG/HOF     78.1%        77.7%     79.3%    81.6%
   criptors




              HOG         71.4%        72.7%     66.0%    77.4%
              HOF
Desc




                          75.4%
                          75 4%        76.7%
                                       76 7%     75.3%
                                                 75 3%    82.6%
                                                          82 6%
              Cuboids       -          76.6%       -        -
              E SURF
              E-SURF        -            -       77.3%      -


       •       Best results for Dense + HOG3D
       •       Good results for Dense and HOG/HOF
       •       Cuboids: good p
                        g    performance with HOG3D
Hollywood2 -- results
                                     Detectors
                          Harris3D      Cuboids   Hessian   Dense
               HOG3D       43.7%          45.7%    41.3%    45.3%
               HOG/HOF     45.2%          46.2%    46.0%    47.4%
    riptors




               HOG         32.8%          39.4%    36.2%    39.4%
Descri




               HOF         43.3%          42.9%    43.0%    45.5%
               Cuboids       -            45.0%      -        -
               E-SURF
               E SURF        -              -      38.2%
                                                   38 2%      -



              • Best results for Dense + HOG/HOF
              • Good results for HOG/HOF
Evaluation summary
                                 y

•   Dense sampling consistently outperforms all the tested sparse
    features in realistic settings (UCF + Hollywood2)
        - Importance of realistic video data
        - Limitations of current feature detectors
        - Note: large number of features (15-20 times more)
•   Sparse features provide more or less similar results (sparse
    features better than Dense on KTH)
•   Descriptors
    Descriptors’ performance
        - Combination of gradients + optical flow seems a good choice
        (HOG/HOF & HOG3D)
How to improve BoF classification?
          p
Actions are about people
       Why not try to combine BoF with person detection?
         y       y                     p
                           Detect and track people
                                                       Compute BoF on
                                                       person-centered
                                                       grids:
                                                       2x2, 3x2, 3x3…


                                                           Surprise!
How to improve BoF classification?
          p
2nd attampt:
    •     Do not remove background
                              g
    •     Improve local descriptors with region-level information
  Local features
                      ambiguous
                         bi              Features with
                                         F t        ith
                       features          disambiguated       Visual
                                             labels        Vocabulary




    Regions
          R2                                                Histogram
                                                          representation

     R1                                      R1      R1
                                                               SVM
                                                R2         Classification
                       R3
Video Segmentation
                       g

• Spatio-temporal grids
  Spatio temporal                             R3

                                              R2
                                              R1

• Static action detectors [Felzenszwalb’08]
   – Trained from ~100 web-images per class




• Object and Person detectors (Upper body)
  [
  [Felzenszwalb’08]
                  ]
Video Segmentation
        g
Hollywood-2 action classification
Hollywood-
    y
 Attributed feature                                 Performance
                                                     (meanAP)
 BoF                                                   48.55
 Spatiotemoral grid 24 channels                        51.83
                                                       51 83
 Motion segmentation                                   50.39
 Upper body
  pp      y                                            49.26
 Object detectors                                      49.89
 Action detectors                                      52.77
 Spatiotemoral grid + Motion segmentation              53.20
 Spatiotemoral grid + Upper body                       53.18
 Spatiotemoral grid + Object detectors
 S ti t      l id Obj t d t t                          52.97
                                                       52 97
 Spatiotemoral grid + Action detectors                 55.72
 Spatiotemoral grid + Motion segmentation + Upper      55.33
 body + Object detectors + Action detectors
Hollywood-2 action classification
Hollywood-
    y
Actions in Context                   (CVPR 2009)

 Human actions are frequently correlated with particular scene classes
                       q     y                 p
  Reasons: physical properties and particular purposes of scenes




            Eating -- kitchen                  Eating -- cafe




             Running -- road                Running -- street
Mining scene captions

                             ILSA
              I wish I didn't love you so much.
01:22:00
01:22:03   She snuggles closer to Rick.

                                                             CUT TO:

           EXT. RICK'S CAFE - NIGHT

           Laszlo and Carl make their way through the darkness toward a
           side entrance of Rick's. They run inside the entryway.

           The headlights of a speeding police car sweep toward them.

           They flatten themselves against a wall to avoid detection.

           The lights move past them.

                              CARL
01:22:15     I think we lost them.
01:22:17     …
Mining scene captions
INT. TRENDY RESTAURANT - NIGHT
INT.
INT MARSELLUS WALLACE’S DINING ROOM MORNING
                 WALLACE S
EXT. STREETS BY DORA’S HOUSE - DAY.
INT. MELVIN'S APARTMENT, BATHROOM – NIGHT
EXT.
EXT NEW YORK CITY STREET NEAR CAROL'S RESTAURANT – DAY
                                CAROL S
 INT. CRAIG AND LOTTE'S BATHROOM - DAY



• Maximize word frequency         street, living room, bedroom, car ….

• M
  Merge words with similar senses using W dN t
           d   ith i il             i WordNet:
                                  taxi -> car, cafe -> restaurant

• Measure correlation of words with actions (in scripts) and
• Re-sort words by the entropy
  for P = p(action | word)
Co-
Co-occurrence of actions and scenes
            in scripts
Co-
Co-occurrence of actions and scenes
            in scripts
Co-
Co-occurrence of actions and scenes
            in scripts
Co-
Co-occurrence of actions and scenes
          in text vs. video
Automatic gathering of relevant scene classes
            and visual samples



                                 Source:
                                 69 movies
                                 aligned with
                                    g
                                 the scripts




                                 Hollywood-2
                                 dataset is on-line:
                                 http://www.irisa.fr/vista
                                 /actions/hollywood2
Results: actions and scenes (separately)
Classification with the help of context




           Action classification score

           Scene classification score

           Weight,
           Weight estimated from text:

           New action score
Results: actions and scenes (jointly)

Actions
 in h
 i the
context
   of
Scenes




Scenes
 in the
context
   of
Actions
Weakly- p
               Weakly-Supervised
                     y
            Temporal Action Annotation

     Answer questions: WHAT actions and WHEN they happened ?




Knock on the door                Fight                              Kiss


     Train visual action detectors and annotate actions with the
      minimal manual supervision
WHAT actions?
    Automatic discovery of action classes in text (movie scripts)
       -- Text processing:
                                          Part of Speech (POS) tagging;
                                          Named Entity Recognition (NER);
                                          WordNet pruning; Visual Noun filtering

       -- Search action patterns
Person+Verb              Person+Verb+Prep.
                         Person+Verb+Prep                    Person+Verb+Prep+Vis.Noun
                                                             Person+Verb+Prep+Vis Noun
3725 /PERSON .* is        989   /PERSON   .* looks .* at      41   /PERSON   .* sits .* in .* chair
2644 /PERSON .* looks     384   /PERSON   .* is .* in         37   /PERSON   .* sits .* at .* table
1300 /PERSON .* turns     363   /PERSON   .* looks .* up      31   /PERSON   .* sits .* on .* bed
916 /PERSON .* takes      234   /PERSON   .* is .* on         29   /PERSON   .* sits .* at .* desk
840 /PERSON .* sits       215   /PERSON   .* picks .* up      26   /PERSON   .* picks .* up .* phone
829 /PERSON .* has        196   /PERSON   .* is .* at         23   /PERSON   .* gets .* out .* car
807 /PERSON .* walks      139   /PERSON   .* sits .* in       23   /PERSON   .* looks .* out .* window
701 /PERSON .* stands
             *            138   /PERSON   .* is .* with
                                           *     *            21   /PERSON   .* looks .* around .* room
                                                                              *          *          *
622 /PERSON .* goes       134   /PERSON   .* stares .* at     18   /PERSON   .* is .* at .* desk
591 /PERSON .* starts     129   /PERSON   .* is .* by         17   /PERSON   .* hangs .* up .* phone
585 /PERSON .* does       126   /PERSON   .* looks .* down    17   /PERSON   .* is .* on .* phone
569 /PERSON .* gets       124   /PERSON   .* sits .* on       17   /PERSON   .* looks .* at .* watch
552 /PERSON .* pulls      122   /PERSON   .* is .* of         16   /PERSON   .* sits .* on .* couch
503 /PERSON .* comes      114   /PERSON   .* gets .* up       15   /PERSON   .* opens .* of .* door
493 /PERSON .* sees       109   /PERSON   .* sits .* at       15   /PERSON   .* walks .* into .* room
462 /PERSON .* are/VBP    107   /PERSON   .* sits .* down     14   /PERSON   .* goes .* into .* room
WHEN:
    WHEN: Video Data and Annotation
 Want to target realistic video data
 Want to avoid manual video annotation for training

       Use movies + scripts for automatic annotation of training samples




                                                          24:25




                                                          ertainty!
                                                       Unce
                                                          24:51
Overview
   Input:
     p                                  Automatic collection of training clips
                                                                       g p
     • Action type, e.g.
       Person Opens Door
     • Videos + aligned scripts




Output:
O                                         Clustering of positive segments
                  Training classifier
     Sliding-
 window-style
   i d        l
    temporal
      action
  localization
Action clustering
       [Lihi Zelnik-Manor and Michal Irani CVPR 2001]




                                      Spectral clustering
Descriptor space




                                                        Clustering results




                                                            Ground truth
Action clustering
    Complex data:




   Standard clustering
  methods do not work on
        this data
Action clustering
                Our view at the problem

Feature space                             Video space




                                     Negative samples!
                                     N    i       l !


      Nearest neighbor
      N      t i hb
      solution: wrong!           Random video samples: lots of them,
                                   very low chance to be positives
Action clustering
                Formulation
                                           [Xu et al. NIPS’04]
                                           [ ac
                                           [Bach & Harchaoui NIPS’07]
                                                      a c aou   S0 ]
                   discriminative
                   di i i ti cost t
Feature space

                                      Loss on positive samples



                                  Loss on negative samples

                           negative samples
                           parameterized positive samples
                                 t i d      iti       l




                                      Optimization
                               SVM solution for
                               Coordinate descent on
Clustering results
Drinking actions in Coffee and Cigarettes
Detection results
           Drinking actions in Coffee and Cigarettes

 T i i B
  Training Bag-of-Features classifier
                 fF t       l   ifi
 Temporal sliding window classification
 Non-maximum suppression

    Detection trained on simulated clusters




                                              Test set:
                                                   • 25min from “Coffee and
                                                                   Coffee
                                                     Cigarettes” with GT 38
                                                     drinking actions
Detection results
           Drinking actions in Coffee and Cigarettes

 T i i B
  Training Bag-of-Features classifier
                 fF t       l   ifi
 Temporal sliding window classification
 Non-maximum suppression

    Detection trained on automatic clusters




                                              Test set:
                                                   • 25min from “Coffee and
                                                                   Coffee
                                                     Cigarettes” with GT 38
                                                     drinking actions
Detection results
“Sit Down” and “Open Door” actions in ~5 hours of movies
Temporal detection of “Sit Down” and “Open Door” actions in movies:
        The Graduate, The Crying Game, Living in Oblivion
Conclusions

   Bag-of-words models are currently d i
    B     f     d   d l             tl dominant, th
                                                t the
    structure (human poses, etc.) should be integrated.

   Vocabulary of actions is not well-defined – it depends
    on the goal and the task
           g

   Actions should be used for the functional interpretation
    of the visual world

More Related Content

What's hot

Action Recognitionの歴史と最新動向
Action Recognitionの歴史と最新動向Action Recognitionの歴史と最新動向
Action Recognitionの歴史と最新動向Ohnishi Katsunori
 
Deep learning based object detection basics
Deep learning based object detection basicsDeep learning based object detection basics
Deep learning based object detection basicsBrodmann17
 
Video Classification: Human Action Recognition on HMDB-51 dataset
Video Classification: Human Action Recognition on HMDB-51 datasetVideo Classification: Human Action Recognition on HMDB-51 dataset
Video Classification: Human Action Recognition on HMDB-51 datasetGiorgio Carbone
 
Faster R-CNN: Towards real-time object detection with region proposal network...
Faster R-CNN: Towards real-time object detection with region proposal network...Faster R-CNN: Towards real-time object detection with region proposal network...
Faster R-CNN: Towards real-time object detection with region proposal network...Universitat Politècnica de Catalunya
 
ConditionalPointDiffusion.pdf
ConditionalPointDiffusion.pdfConditionalPointDiffusion.pdf
ConditionalPointDiffusion.pdfTakuya Minagawa
 
Skeleton-based Human Action Recognition with Recurrent Neural Network
Skeleton-based Human Action Recognition with Recurrent Neural NetworkSkeleton-based Human Action Recognition with Recurrent Neural Network
Skeleton-based Human Action Recognition with Recurrent Neural NetworkLuong Vo
 
cvsaisentan20141004 kanezaki
cvsaisentan20141004 kanezakicvsaisentan20141004 kanezaki
cvsaisentan20141004 kanezakikanejaki
 
【チュートリアル】コンピュータビジョンによる動画認識 v2
【チュートリアル】コンピュータビジョンによる動画認識 v2【チュートリアル】コンピュータビジョンによる動画認識 v2
【チュートリアル】コンピュータビジョンによる動画認識 v2Hirokatsu Kataoka
 
【メタサーベイ】Video Transformer
 【メタサーベイ】Video Transformer 【メタサーベイ】Video Transformer
【メタサーベイ】Video Transformercvpaper. challenge
 
Introduction to object detection
Introduction to object detectionIntroduction to object detection
Introduction to object detectionBrodmann17
 
A Study on the Generation of Clothing Captions Highlighting the Differences b...
A Study on the Generation of Clothing Captions Highlighting the Differences b...A Study on the Generation of Clothing Captions Highlighting the Differences b...
A Study on the Generation of Clothing Captions Highlighting the Differences b...harmonylab
 
文献紹介:X3D: Expanding Architectures for Efficient Video Recognition
文献紹介:X3D: Expanding Architectures for Efficient Video Recognition文献紹介:X3D: Expanding Architectures for Efficient Video Recognition
文献紹介:X3D: Expanding Architectures for Efficient Video RecognitionToru Tamaki
 
[DL輪読会]VNect: Real-time 3D Human Pose Estimation with a Single RGB Camera
[DL輪読会]VNect: Real-time 3D Human Pose Estimation with a Single RGB Camera[DL輪読会]VNect: Real-time 3D Human Pose Estimation with a Single RGB Camera
[DL輪読会]VNect: Real-time 3D Human Pose Estimation with a Single RGB CameraDeep Learning JP
 
動画認識・キャプショニングの潮流 (CVPR 2018 完全読破チャレンジ報告会)
動画認識・キャプショニングの潮流 (CVPR 2018 完全読破チャレンジ報告会)動画認識・キャプショニングの潮流 (CVPR 2018 完全読破チャレンジ報告会)
動画認識・キャプショニングの潮流 (CVPR 2018 完全読破チャレンジ報告会)cvpaper. challenge
 
Deepfakesの生成および検出
Deepfakesの生成および検出Deepfakesの生成および検出
Deepfakesの生成および検出Plot Hong
 
12 非構造化データ解析
12 非構造化データ解析12 非構造化データ解析
12 非構造化データ解析Seiichi Uchida
 
Marker detection algorithm
Marker detection algorithmMarker detection algorithm
Marker detection algorithmpknhacker
 
動画認識サーベイv1(メタサーベイ )
動画認識サーベイv1(メタサーベイ )動画認識サーベイv1(メタサーベイ )
動画認識サーベイv1(メタサーベイ )cvpaper. challenge
 

What's hot (20)

Object detection
Object detectionObject detection
Object detection
 
Action Recognitionの歴史と最新動向
Action Recognitionの歴史と最新動向Action Recognitionの歴史と最新動向
Action Recognitionの歴史と最新動向
 
Deep learning based object detection basics
Deep learning based object detection basicsDeep learning based object detection basics
Deep learning based object detection basics
 
Video Classification: Human Action Recognition on HMDB-51 dataset
Video Classification: Human Action Recognition on HMDB-51 datasetVideo Classification: Human Action Recognition on HMDB-51 dataset
Video Classification: Human Action Recognition on HMDB-51 dataset
 
Faster R-CNN: Towards real-time object detection with region proposal network...
Faster R-CNN: Towards real-time object detection with region proposal network...Faster R-CNN: Towards real-time object detection with region proposal network...
Faster R-CNN: Towards real-time object detection with region proposal network...
 
ConditionalPointDiffusion.pdf
ConditionalPointDiffusion.pdfConditionalPointDiffusion.pdf
ConditionalPointDiffusion.pdf
 
Skeleton-based Human Action Recognition with Recurrent Neural Network
Skeleton-based Human Action Recognition with Recurrent Neural NetworkSkeleton-based Human Action Recognition with Recurrent Neural Network
Skeleton-based Human Action Recognition with Recurrent Neural Network
 
cvsaisentan20141004 kanezaki
cvsaisentan20141004 kanezakicvsaisentan20141004 kanezaki
cvsaisentan20141004 kanezaki
 
【チュートリアル】コンピュータビジョンによる動画認識 v2
【チュートリアル】コンピュータビジョンによる動画認識 v2【チュートリアル】コンピュータビジョンによる動画認識 v2
【チュートリアル】コンピュータビジョンによる動画認識 v2
 
【メタサーベイ】Video Transformer
 【メタサーベイ】Video Transformer 【メタサーベイ】Video Transformer
【メタサーベイ】Video Transformer
 
Introduction to object detection
Introduction to object detectionIntroduction to object detection
Introduction to object detection
 
A Study on the Generation of Clothing Captions Highlighting the Differences b...
A Study on the Generation of Clothing Captions Highlighting the Differences b...A Study on the Generation of Clothing Captions Highlighting the Differences b...
A Study on the Generation of Clothing Captions Highlighting the Differences b...
 
Object detection
Object detectionObject detection
Object detection
 
文献紹介:X3D: Expanding Architectures for Efficient Video Recognition
文献紹介:X3D: Expanding Architectures for Efficient Video Recognition文献紹介:X3D: Expanding Architectures for Efficient Video Recognition
文献紹介:X3D: Expanding Architectures for Efficient Video Recognition
 
[DL輪読会]VNect: Real-time 3D Human Pose Estimation with a Single RGB Camera
[DL輪読会]VNect: Real-time 3D Human Pose Estimation with a Single RGB Camera[DL輪読会]VNect: Real-time 3D Human Pose Estimation with a Single RGB Camera
[DL輪読会]VNect: Real-time 3D Human Pose Estimation with a Single RGB Camera
 
動画認識・キャプショニングの潮流 (CVPR 2018 完全読破チャレンジ報告会)
動画認識・キャプショニングの潮流 (CVPR 2018 完全読破チャレンジ報告会)動画認識・キャプショニングの潮流 (CVPR 2018 完全読破チャレンジ報告会)
動画認識・キャプショニングの潮流 (CVPR 2018 完全読破チャレンジ報告会)
 
Deepfakesの生成および検出
Deepfakesの生成および検出Deepfakesの生成および検出
Deepfakesの生成および検出
 
12 非構造化データ解析
12 非構造化データ解析12 非構造化データ解析
12 非構造化データ解析
 
Marker detection algorithm
Marker detection algorithmMarker detection algorithm
Marker detection algorithm
 
動画認識サーベイv1(メタサーベイ )
動画認識サーベイv1(メタサーベイ )動画認識サーベイv1(メタサーベイ )
動画認識サーベイv1(メタサーベイ )
 

Viewers also liked

Human activity recognition
Human activity recognitionHuman activity recognition
Human activity recognitionRandhir Gupta
 
Sparse representation based human action recognition using an action region-a...
Sparse representation based human action recognition using an action region-a...Sparse representation based human action recognition using an action region-a...
Sparse representation based human action recognition using an action region-a...Wesley De Neve
 
Object_Recognition_using_Deeplearning_Report.pptx
Object_Recognition_using_Deeplearning_Report.pptxObject_Recognition_using_Deeplearning_Report.pptx
Object_Recognition_using_Deeplearning_Report.pptxYaopeng (Gyoho) Wu
 
Introduction to Action Recognition in Python by Bertrand Nouvel, Jonathan Kel...
Introduction to Action Recognition in Python by Bertrand Nouvel, Jonathan Kel...Introduction to Action Recognition in Python by Bertrand Nouvel, Jonathan Kel...
Introduction to Action Recognition in Python by Bertrand Nouvel, Jonathan Kel...PyData
 
Modern features-part-2-descriptors
Modern features-part-2-descriptorsModern features-part-2-descriptors
Modern features-part-2-descriptorszukun
 
Human Pose Estimation by Deep Learning
Human Pose Estimation by Deep LearningHuman Pose Estimation by Deep Learning
Human Pose Estimation by Deep LearningWei Yang
 
Human activity recognition
Human activity recognition Human activity recognition
Human activity recognition srikanthgadam
 
Opinion-Based Entity Ranking
Opinion-Based Entity RankingOpinion-Based Entity Ranking
Opinion-Based Entity RankingKavita Ganesan
 

Viewers also liked (9)

Human activity recognition
Human activity recognitionHuman activity recognition
Human activity recognition
 
Sparse representation based human action recognition using an action region-a...
Sparse representation based human action recognition using an action region-a...Sparse representation based human action recognition using an action region-a...
Sparse representation based human action recognition using an action region-a...
 
Object_Recognition_using_Deeplearning_Report.pptx
Object_Recognition_using_Deeplearning_Report.pptxObject_Recognition_using_Deeplearning_Report.pptx
Object_Recognition_using_Deeplearning_Report.pptx
 
Introduction to Action Recognition in Python by Bertrand Nouvel, Jonathan Kel...
Introduction to Action Recognition in Python by Bertrand Nouvel, Jonathan Kel...Introduction to Action Recognition in Python by Bertrand Nouvel, Jonathan Kel...
Introduction to Action Recognition in Python by Bertrand Nouvel, Jonathan Kel...
 
Modern features-part-2-descriptors
Modern features-part-2-descriptorsModern features-part-2-descriptors
Modern features-part-2-descriptors
 
Human Pose Estimation by Deep Learning
Human Pose Estimation by Deep LearningHuman Pose Estimation by Deep Learning
Human Pose Estimation by Deep Learning
 
Human activity recognition
Human activity recognition Human activity recognition
Human activity recognition
 
Opinion-Based Entity Ranking
Opinion-Based Entity RankingOpinion-Based Entity Ranking
Opinion-Based Entity Ranking
 
Background subtraction
Background subtractionBackground subtraction
Background subtraction
 

Similar to CVML2011: human action recognition (Ivan Laptev)

ECCV2010 tutorial: statisitcal and structural recognition of human actions pa...
ECCV2010 tutorial: statisitcal and structural recognition of human actions pa...ECCV2010 tutorial: statisitcal and structural recognition of human actions pa...
ECCV2010 tutorial: statisitcal and structural recognition of human actions pa...zukun
 
ANIMATION PROJECT DOMAIN APPEAL IN MOTION CAPTURE ANIMATION (2015)
ANIMATION PROJECT DOMAIN  APPEAL IN MOTION CAPTURE ANIMATION (2015)ANIMATION PROJECT DOMAIN  APPEAL IN MOTION CAPTURE ANIMATION (2015)
ANIMATION PROJECT DOMAIN APPEAL IN MOTION CAPTURE ANIMATION (2015)Sara Parker
 
Assembling Movement Scientific Motion Analysis And Studio Animation Practice
Assembling Movement  Scientific Motion Analysis And Studio Animation PracticeAssembling Movement  Scientific Motion Analysis And Studio Animation Practice
Assembling Movement Scientific Motion Analysis And Studio Animation PracticeClaudia Acosta
 
Animation basics: The optical illusion of motion
Animation basics: The optical illusion of motionAnimation basics: The optical illusion of motion
Animation basics: The optical illusion of motionAlicia Garcia
 
Introduction to Biomechanics
Introduction to BiomechanicsIntroduction to Biomechanics
Introduction to BiomechanicsShatheesh Sohan
 
Dynamics of human gait
Dynamics of human gaitDynamics of human gait
Dynamics of human gaitMay1911
 
Introduction to Motion Tracking to Dance
Introduction to Motion Tracking to  DanceIntroduction to Motion Tracking to  Dance
Introduction to Motion Tracking to DanceMarlon Solano
 
cvpr2011: human activity recognition - part 1: introduction
cvpr2011: human activity recognition - part 1: introductioncvpr2011: human activity recognition - part 1: introduction
cvpr2011: human activity recognition - part 1: introductionzukun
 
Actors And Acting In Motion Capture
Actors And Acting In Motion CaptureActors And Acting In Motion Capture
Actors And Acting In Motion CapturePedro Craggett
 
Virtual Worlds and Concrete Codes
Virtual Worlds and Concrete CodesVirtual Worlds and Concrete Codes
Virtual Worlds and Concrete CodesDeborahJ
 
IRJET- Survey on Detection of Crime
IRJET-  	  Survey on Detection of CrimeIRJET-  	  Survey on Detection of Crime
IRJET- Survey on Detection of CrimeIRJET Journal
 
DISSERTATION - Bratislav Vidanovic, 77122071
DISSERTATION - Bratislav Vidanovic, 77122071DISSERTATION - Bratislav Vidanovic, 77122071
DISSERTATION - Bratislav Vidanovic, 77122071Bratislav Vidanovic
 
A Framework for Human Action Detection via Extraction of Multimodal Features
A Framework for Human Action Detection via Extraction of Multimodal FeaturesA Framework for Human Action Detection via Extraction of Multimodal Features
A Framework for Human Action Detection via Extraction of Multimodal FeaturesCSCJournals
 
Building an Academic Virtual Reality System
Building an Academic Virtual Reality SystemBuilding an Academic Virtual Reality System
Building an Academic Virtual Reality SystemBill E
 
A general survey of previous works on action recognition
A general survey of previous works on action recognitionA general survey of previous works on action recognition
A general survey of previous works on action recognitionzukun
 

Similar to CVML2011: human action recognition (Ivan Laptev) (20)

ECCV2010 tutorial: statisitcal and structural recognition of human actions pa...
ECCV2010 tutorial: statisitcal and structural recognition of human actions pa...ECCV2010 tutorial: statisitcal and structural recognition of human actions pa...
ECCV2010 tutorial: statisitcal and structural recognition of human actions pa...
 
ANIMATION PROJECT DOMAIN APPEAL IN MOTION CAPTURE ANIMATION (2015)
ANIMATION PROJECT DOMAIN  APPEAL IN MOTION CAPTURE ANIMATION (2015)ANIMATION PROJECT DOMAIN  APPEAL IN MOTION CAPTURE ANIMATION (2015)
ANIMATION PROJECT DOMAIN APPEAL IN MOTION CAPTURE ANIMATION (2015)
 
Assembling Movement Scientific Motion Analysis And Studio Animation Practice
Assembling Movement  Scientific Motion Analysis And Studio Animation PracticeAssembling Movement  Scientific Motion Analysis And Studio Animation Practice
Assembling Movement Scientific Motion Analysis And Studio Animation Practice
 
Animation basics: The optical illusion of motion
Animation basics: The optical illusion of motionAnimation basics: The optical illusion of motion
Animation basics: The optical illusion of motion
 
PAPER3
PAPER3PAPER3
PAPER3
 
Introduction to Biomechanics
Introduction to BiomechanicsIntroduction to Biomechanics
Introduction to Biomechanics
 
Dynamics of human gait
Dynamics of human gaitDynamics of human gait
Dynamics of human gait
 
Introduction to Motion Tracking to Dance
Introduction to Motion Tracking to  DanceIntroduction to Motion Tracking to  Dance
Introduction to Motion Tracking to Dance
 
cvpr2011: human activity recognition - part 1: introduction
cvpr2011: human activity recognition - part 1: introductioncvpr2011: human activity recognition - part 1: introduction
cvpr2011: human activity recognition - part 1: introduction
 
Actors And Acting In Motion Capture
Actors And Acting In Motion CaptureActors And Acting In Motion Capture
Actors And Acting In Motion Capture
 
seminar Islideshow.pptx
seminar Islideshow.pptxseminar Islideshow.pptx
seminar Islideshow.pptx
 
Virtual Worlds and Concrete Codes
Virtual Worlds and Concrete CodesVirtual Worlds and Concrete Codes
Virtual Worlds and Concrete Codes
 
Cyborg
CyborgCyborg
Cyborg
 
IRJET- Survey on Detection of Crime
IRJET-  	  Survey on Detection of CrimeIRJET-  	  Survey on Detection of Crime
IRJET- Survey on Detection of Crime
 
DISSERTATION - Bratislav Vidanovic, 77122071
DISSERTATION - Bratislav Vidanovic, 77122071DISSERTATION - Bratislav Vidanovic, 77122071
DISSERTATION - Bratislav Vidanovic, 77122071
 
A Framework for Human Action Detection via Extraction of Multimodal Features
A Framework for Human Action Detection via Extraction of Multimodal FeaturesA Framework for Human Action Detection via Extraction of Multimodal Features
A Framework for Human Action Detection via Extraction of Multimodal Features
 
Animation
AnimationAnimation
Animation
 
Building an Academic Virtual Reality System
Building an Academic Virtual Reality SystemBuilding an Academic Virtual Reality System
Building an Academic Virtual Reality System
 
Microsoft Powerpoint
Microsoft PowerpointMicrosoft Powerpoint
Microsoft Powerpoint
 
A general survey of previous works on action recognition
A general survey of previous works on action recognitionA general survey of previous works on action recognition
A general survey of previous works on action recognition
 

More from zukun

My lyn tutorial 2009
My lyn tutorial 2009My lyn tutorial 2009
My lyn tutorial 2009zukun
 
ETHZ CV2012: Tutorial openCV
ETHZ CV2012: Tutorial openCVETHZ CV2012: Tutorial openCV
ETHZ CV2012: Tutorial openCVzukun
 
ETHZ CV2012: Information
ETHZ CV2012: InformationETHZ CV2012: Information
ETHZ CV2012: Informationzukun
 
Siwei lyu: natural image statistics
Siwei lyu: natural image statisticsSiwei lyu: natural image statistics
Siwei lyu: natural image statisticszukun
 
Lecture9 camera calibration
Lecture9 camera calibrationLecture9 camera calibration
Lecture9 camera calibrationzukun
 
Brunelli 2008: template matching techniques in computer vision
Brunelli 2008: template matching techniques in computer visionBrunelli 2008: template matching techniques in computer vision
Brunelli 2008: template matching techniques in computer visionzukun
 
Modern features-part-4-evaluation
Modern features-part-4-evaluationModern features-part-4-evaluation
Modern features-part-4-evaluationzukun
 
Modern features-part-3-software
Modern features-part-3-softwareModern features-part-3-software
Modern features-part-3-softwarezukun
 
Modern features-part-1-detectors
Modern features-part-1-detectorsModern features-part-1-detectors
Modern features-part-1-detectorszukun
 
Modern features-part-0-intro
Modern features-part-0-introModern features-part-0-intro
Modern features-part-0-introzukun
 
Lecture 02 internet video search
Lecture 02 internet video searchLecture 02 internet video search
Lecture 02 internet video searchzukun
 
Lecture 01 internet video search
Lecture 01 internet video searchLecture 01 internet video search
Lecture 01 internet video searchzukun
 
Lecture 03 internet video search
Lecture 03 internet video searchLecture 03 internet video search
Lecture 03 internet video searchzukun
 
Icml2012 tutorial representation_learning
Icml2012 tutorial representation_learningIcml2012 tutorial representation_learning
Icml2012 tutorial representation_learningzukun
 
Advances in discrete energy minimisation for computer vision
Advances in discrete energy minimisation for computer visionAdvances in discrete energy minimisation for computer vision
Advances in discrete energy minimisation for computer visionzukun
 
Gephi tutorial: quick start
Gephi tutorial: quick startGephi tutorial: quick start
Gephi tutorial: quick startzukun
 
EM algorithm and its application in probabilistic latent semantic analysis
EM algorithm and its application in probabilistic latent semantic analysisEM algorithm and its application in probabilistic latent semantic analysis
EM algorithm and its application in probabilistic latent semantic analysiszukun
 
Object recognition with pictorial structures
Object recognition with pictorial structuresObject recognition with pictorial structures
Object recognition with pictorial structureszukun
 
Iccv2011 learning spatiotemporal graphs of human activities
Iccv2011 learning spatiotemporal graphs of human activities Iccv2011 learning spatiotemporal graphs of human activities
Iccv2011 learning spatiotemporal graphs of human activities zukun
 
Icml2012 learning hierarchies of invariant features
Icml2012 learning hierarchies of invariant featuresIcml2012 learning hierarchies of invariant features
Icml2012 learning hierarchies of invariant featureszukun
 

More from zukun (20)

My lyn tutorial 2009
My lyn tutorial 2009My lyn tutorial 2009
My lyn tutorial 2009
 
ETHZ CV2012: Tutorial openCV
ETHZ CV2012: Tutorial openCVETHZ CV2012: Tutorial openCV
ETHZ CV2012: Tutorial openCV
 
ETHZ CV2012: Information
ETHZ CV2012: InformationETHZ CV2012: Information
ETHZ CV2012: Information
 
Siwei lyu: natural image statistics
Siwei lyu: natural image statisticsSiwei lyu: natural image statistics
Siwei lyu: natural image statistics
 
Lecture9 camera calibration
Lecture9 camera calibrationLecture9 camera calibration
Lecture9 camera calibration
 
Brunelli 2008: template matching techniques in computer vision
Brunelli 2008: template matching techniques in computer visionBrunelli 2008: template matching techniques in computer vision
Brunelli 2008: template matching techniques in computer vision
 
Modern features-part-4-evaluation
Modern features-part-4-evaluationModern features-part-4-evaluation
Modern features-part-4-evaluation
 
Modern features-part-3-software
Modern features-part-3-softwareModern features-part-3-software
Modern features-part-3-software
 
Modern features-part-1-detectors
Modern features-part-1-detectorsModern features-part-1-detectors
Modern features-part-1-detectors
 
Modern features-part-0-intro
Modern features-part-0-introModern features-part-0-intro
Modern features-part-0-intro
 
Lecture 02 internet video search
Lecture 02 internet video searchLecture 02 internet video search
Lecture 02 internet video search
 
Lecture 01 internet video search
Lecture 01 internet video searchLecture 01 internet video search
Lecture 01 internet video search
 
Lecture 03 internet video search
Lecture 03 internet video searchLecture 03 internet video search
Lecture 03 internet video search
 
Icml2012 tutorial representation_learning
Icml2012 tutorial representation_learningIcml2012 tutorial representation_learning
Icml2012 tutorial representation_learning
 
Advances in discrete energy minimisation for computer vision
Advances in discrete energy minimisation for computer visionAdvances in discrete energy minimisation for computer vision
Advances in discrete energy minimisation for computer vision
 
Gephi tutorial: quick start
Gephi tutorial: quick startGephi tutorial: quick start
Gephi tutorial: quick start
 
EM algorithm and its application in probabilistic latent semantic analysis
EM algorithm and its application in probabilistic latent semantic analysisEM algorithm and its application in probabilistic latent semantic analysis
EM algorithm and its application in probabilistic latent semantic analysis
 
Object recognition with pictorial structures
Object recognition with pictorial structuresObject recognition with pictorial structures
Object recognition with pictorial structures
 
Iccv2011 learning spatiotemporal graphs of human activities
Iccv2011 learning spatiotemporal graphs of human activities Iccv2011 learning spatiotemporal graphs of human activities
Iccv2011 learning spatiotemporal graphs of human activities
 
Icml2012 learning hierarchies of invariant features
Icml2012 learning hierarchies of invariant featuresIcml2012 learning hierarchies of invariant features
Icml2012 learning hierarchies of invariant features
 

Recently uploaded

Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxmanuelaromero2013
 
Science 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsScience 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsKarinaGenton
 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application ) Sakshi Ghasle
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentInMediaRes1
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxNirmalaLoungPoorunde1
 
History Class XII Ch. 3 Kinship, Caste and Class (1).pptx
History Class XII Ch. 3 Kinship, Caste and Class (1).pptxHistory Class XII Ch. 3 Kinship, Caste and Class (1).pptx
History Class XII Ch. 3 Kinship, Caste and Class (1).pptxsocialsciencegdgrohi
 
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Celine George
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsanshu789521
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13Steve Thomason
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
Final demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxFinal demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxAvyJaneVismanos
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17Celine George
 
Science lesson Moon for 4th quarter lesson
Science lesson Moon for 4th quarter lessonScience lesson Moon for 4th quarter lesson
Science lesson Moon for 4th quarter lessonJericReyAuditor
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...Marc Dusseiller Dusjagr
 

Recently uploaded (20)

Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptx
 
Science 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsScience 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its Characteristics
 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application )
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media Component
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptx
 
History Class XII Ch. 3 Kinship, Caste and Class (1).pptx
History Class XII Ch. 3 Kinship, Caste and Class (1).pptxHistory Class XII Ch. 3 Kinship, Caste and Class (1).pptx
History Class XII Ch. 3 Kinship, Caste and Class (1).pptx
 
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha elections
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 
9953330565 Low Rate Call Girls In Rohini Delhi NCR
9953330565 Low Rate Call Girls In Rohini  Delhi NCR9953330565 Low Rate Call Girls In Rohini  Delhi NCR
9953330565 Low Rate Call Girls In Rohini Delhi NCR
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
Final demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxFinal demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptx
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17
 
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
 
Science lesson Moon for 4th quarter lesson
Science lesson Moon for 4th quarter lessonScience lesson Moon for 4th quarter lesson
Science lesson Moon for 4th quarter lesson
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
 

CVML2011: human action recognition (Ivan Laptev)

  • 1. ENS/INRIA Visual Recognition and Machine Learning Summer School 25-29 July, Paris France School, 25 29 July Paris, Human Action Recognition Ivan Laptev ivan.laptev@inria.fr i l t @i i f INRIA, WILLOW, ENS/INRIA/CNRS UMR 8548 Laboratoire d’Informatique, Ecole Normale Supérieure, Paris d Informatique, Includes slides from: Alyosha Efros, Mark Everingham and Andrew Zisserman
  • 2. Lecture overview Motivation Historic review Applications and challenges Human Pose Estimation Pictorial structures Recent advances Appearance-based methods pp Motion history images Active shape models & Motion priors Motion-based methods Generic and parametric Optical Flow Motion templates Space-time methods p Space-time features Training with weak supervision
  • 3. Motivation I: Artistic Representation Early studies were motivated by human representations in Arts Da Vinci: “it is indispensable for a painter, to become totally familiar with the anatomy of nerves, bones, muscles, and sinews, such that he understands y , , , , for their various motions and stresses, which sinews or which muscle causes a particular motion” “I ask for the weight [pressure] of this man for every segment of motion I when climbing those stairs, and for the weight he places on b and on c. Note the vertical line below the center of mass of this man.” Leonardo da Vinci (1452–1519): A man going upstairs, or up a ladder.
  • 4. Motivation II: Biomechanics  The emergence of biomechanics  Borelli applied to biology the analytical and geometrical methods, developed by Galileo Galilei  He was the first to understand that bones serve as levers and muscles function according to mathematical p principles p  His physiological studies included muscle analysis and a mathematical discussion of movements, such as running or jumping Giovanni Alfonso Borelli (1608–1679)
  • 5. Motivation III: Motion perception Etienne-Jules Etienne Jules Marey: (1830–1904) made Chronophotographic experiments influential for the emerging field of cinematography Eadweard Muybridge (1830–1904) invented a machine for displaying the recorded series of images. He pioneered motion pictures and applied his technique to movement studies
  • 6. Motivation III: Motion perception Gunnar Johansson [1973] pioneered studies on the use of image [ ]p g  sequences for a programmed human motion analysis “Moving Light Displays (LED) enable identification of familiar people Moving Displays”  and the gender and inspired many works in computer vision. Gunnar Johansson, Perception and Psychophysics, 1973
  • 7. Human actions: Historic overview 15th century  studies of anatomy t  17th century emergence of biomechanics 19th century  emergence of cinematography  1973 studies of human motion perception Modern computer vision M d t i i
  • 8. Modern applications: Motion capture and animation Avatar (2009)
  • 9. Modern applications: Motion capture and animation Leonardo da Vinci (1452–1519) Avatar (2009)
  • 10. Modern applications: Video editing Space-Time Video Completion Y. Wexler, E. Shechtman and M. Irani, CVPR 2004
  • 11. Modern applications: Video editing Recognizing Action at a Distance Alexei A. Efros, Alexander C. Berg, Greg Mori, Jitendra Malik, ICCV 2003
  • 12. Modern applications: Video editing Recognizing Action at a Distance Alexei A. Efros, Alexander C. Berg, Greg Mori, Jitendra Malik, ICCV 2003
  • 13. Why Action Recognition?  Video indexing and search is useful in TV production, entertainment, education, social studies, security,… Home videos: e.g. TV & Web: “My e.g. daughter “Fight in a climbing” parlament” Sociology research: Manually Surveillance: analyzed smoking 260K views actions in in 7 days on 900 movies i YouTube
  • 14. How action recognition is related to computer vision? Sky Sk Street sign Car Car Car Car Car Car Road
  • 15. We can recognize cars and roads, g , What’s next? 12,184,113 images, 17624 synsets
  • 16.
  • 17. Airplane A plain has crashed, the cabin is broken, somebody is likely to be injured or dead dead.
  • 18. cat woman trash bi t h bin
  • 19.
  • 20. Vision is person-centric: We mostly care about things which are important to us, people  Actions of people reveal the function of objects p p j  Future challenges: - Function: What can I do with this and how? - Prediction: What can happen if someone does that? - Recognizing goals: What this person is trying to do?
  • 21. How many person-pixels are there? person- Movies TV YouTube Y T b
  • 22. How many person-pixels are there? person- Movies TV YouTube
  • 23. How many person-pixels are there? person- 35% 34% Movies TV 40% YouTube
  • 24. How much data do we have?  Huge amount of video is available and growing TV-channels recorded since 60’s 60 s >34K hours of video upload every day ~30M surveillance cameras in US => ~700K video hours/day 700K If we want to interpret this data, we should better understand what person-pixels are telling us!
  • 25. Lecture overview Motivation Historic review Applications and challenges Human Pose Estimation Pictorial structures Recent advances Appearance-based methods pp Motion history images Active shape models & Motion priors Motion-based methods Generic and parametric Optical Flow Motion templates Space-time methods p Space-time features Training with weak supervision
  • 26. Lecture overview Motivation Historic review Applications and challenges Human Pose Estimation Pictorial structures Recent advances Appearance-based methods pp Motion history images Active shape models & Motion priors Motion-based methods Generic and parametric Optical Flow Motion templates Space-time methods p Space-time features Training with weak supervision
  • 27. Objective and motivation Determine human body pose (layout) Why? To recognize poses, gestures, actions
  • 32. Challenges: of ( g (almost) unconstrained images ) g varying illumination and low contrast; moving camera and background; i ill i ti dl t t i db k d multiple people; scale changes; extensive clutter; any clothing
  • 33. Pictorial Structures • Intuitive model of an object • Model has two components 1. parts ( pa s (2D image fragments) age ag e s) 2. structure (configuration of parts) • Dates back to Fischler & Elschlager 1973
  • 34. Long tradition of using p g g pictorial structures for humans Finding People by Sampling Ioffe & Forsyth, ICCV 1999 y , Pictorial Structure Models for Object Recognition Felzenszwalb & Huttenlocher, 2000 Learning to Parse Pictures of People Ronfard, Schmid & Triggs, ECCV 2002
  • 35. Felzenszwalb & Huttenlocher NB: requires background subtraction
  • 38. Objective: detect human and determine upper body pose (layout) si f a2 a1
  • 39. Pictorial structure model – CRF si f a2 a1
  • 40. Complexity p y si f a2 a1
  • 41. Are trees the answer? He T left right UA UA LA LA Ha Ha • With n parts and h possible discrete locations per part, O(hn) • For a tree, using dynamic programming this reduces to O(nh2) • If model is a tree and has certain edge costs, then complexity g p y reduces to O(nh) using a distance transform [Felzenszwalb & Huttenlocher, 2000, 2005]
  • 42. Are trees the answer? He T left right UA UA LA LA Ha Ha • With n parts and h possible discrete locations per part, O(hn) • For a tree, using dynamic programming this reduces to O(nh2) • If model is a tree and has certain edge costs, then complexity g p y reduces to O(nh) using a distance transform [Felzenszwalb & Huttenlocher, 2000, 2005]
  • 43. Kinematic structure vs graphical (independence) structure Graph G = (V,E) He T He T left right left right UA UA UA UA LA LA LA LA Requires more Ha Ha Ha Ha connections than a tree
  • 44. More recent work on human pose estimation p D. Ramanan. Learning to parse images of articulated bodies. NIPS, bodies NIPS 2007 Learn image and person-specific unary terms • g initial iteration  edges • following iterations  edges & colour V. Ferrari, M. Marin-Jimenez, and A. Zisserman. Progressive search space reduction for human pose estimation. Proc. CVPR, estimation In Proc CVPR 2008/2009 (Almost) unconstrained images • Person detector & foreground highlighting g g g g VP. Buehler, M. Everingham and A. Zisserman. , g Learning sign language by watching TV. In Proc. CVPR 2009 Learns with weak t t l annotation L ith k textual t ti • Multiple instance learning
  • 45. Pose estimation is a very active research area y Y. Yang and D. Ramanan. Articulated pose estimation with flexible mixtures-of-parts. In Proc. CVPR 2011 Extension of LSVM model of Felzenszwalb et al. Y. Wang, D. Tran and Z. Liao. Learning Hierarchical Poselets for Human Parsing. In Proc. CVPR 2011. Builds on Poslets idea of Bourdev et al. S. Johnson and M. Everingham. Learning Effective Human Pose Estimation from Inaccurate Annotation. In Proc. CVPR 2011. Learns from lots of noisy annotations B. Sapp, D.Weiss and B. Taskar. Parsing Human Motion with Stretchable Models. In Proc. CVPR 2011. Explores temporal continuity
  • 46. Pose estimation is a very active research area y J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio, R. Moore, A. Kipman and A. Blake. Real-Time Human Pose Recognition in Parts from Single Depth Images. Best paper award at CVPR 2011 Exploits lots of synthesized depth images for training
  • 47. Pose Search Q V. Ferrari, M. Marin-Jimenez, and A. Zisserman. Progressive search space reduction for human pose estimation. In Proc. CVPR2009
  • 48. Pose Search Q V. Ferrari, M. Marin-Jimenez, and A. Zisserman. Zisserman Progressive search space reduction for human pose estimation. In Proc. CVPR2009
  • 49. Application Learning sign language by watching TV g g g g y g (using weakly aligned subtitles) Patrick Buehler Mark Everingham Andrew Zi A d Zisserman CVPR 2009
  • 50. Objective Learn signs in British Sign Language (BSL) corresponding to text words: • Training data from TV broadcasts with simultaneous signing • Supervision solely from sub-titles Output: automatically learned signs (4x slow motion) Input: video + subtitle Office Government Use subtitles to find video sequences containing word. These are the positive training sequences. Use other sequences as negative training sequences.
  • 51. Given an English word e.g. “tree” what is the corresponding British Sign Language sign? positive sequences negative set
  • 52. Use sliding window to choose sub- sequence of p q poses in one ppositive sequence and determine if 1st sliding window same sub-sequence of poses occurs in other positive sequences somewhere, but does not occur in the negative set positive sequences negative set
  • 53. Use sliding window to choose sub- sequence of p q poses in one ppositive sequence and determine if 5th sliding window same sub-sequence of poses occurs in other positive sequences somewhere, but does not occur in the negative set positive sequences negative set
  • 54. Multiple instance learning p g Negative Positive bag bags sign of interest i t t
  • 55. Example p Learn signs in British Sign Language ( g g g g (BSL) corresponding to ) p g text words.
  • 56. Evaluation Good results for a variety of signs: Signs where g Signs where g Signs where Signs which Signs which hand movement hand shape both hands are finger-- are perfomed in is important is important are together spelled front of the face Navy Lung Fungi Kew Whale Prince Garden Golf Bob Rose
  • 57. What is missed? truncation is not modelled
  • 58. What is missed? occlusion is not modelled
  • 59. Modelling p g person-object-pose interactions j p W. Yang Y. W Yang, Y Wang and Greg Mori. Recognizing Mori Human Actions from Still Images with Latent Poses. In Proc. CVPR 2010. Some limbs may not be important for recognizing a particular action (e.g. sitting) B. Y B Yao and L F i F i M d li M t l d L. Fei-Fei. Modeling Mutual Context of Object and Human Pose in Human- Object Interaction Activities. In Proc. CVPR 2010. Pose estimation helps object detection and vice versa
  • 60. Towards functional object understanding j g A. Gupta, S. Satkin, A.A. Efros and M. Hebert, From 3D Scene Geometry to HumanWorkspace. In Proc. CVPR 2011 p Predicts the “workspace” of a human H. Grabner, J. Gall and L. Van Gool. What Makes a Chair a Chair? In Proc. CVPR 2011
  • 61. Conclusions: Human poses p  Exciting progress in pose estimation in realistic still images and video. g  Industry strength Industry-strength pose estimation from depth sensors  Pose estimation from RGB is still very challenging  Human Poses ≠ Human Actions!
  • 62. Lecture overview Motivation Historic review Applications and challenges Human Pose Estimation Pictorial structures Recent advances Appearance-based methods pp Motion history images Active shape models & Motion priors Motion-based methods Generic and parametric Optical Flow Motion templates Space-time methods p Space-time features Training with weak supervision
  • 63. Lecture overview Motivation Historic review Applications and challenges Human Pose Estimation Pictorial structures Recent advances Appearance-based methods pp Motion history images Active shape models & Motion priors Motion-based methods Generic and parametric Optical Flow Motion templates Space-time methods p Space-time features Training with weak supervision
  • 64. Foreground segmentation Image differencing: a simple way to measure motion/change - > Const Better B k B Background / F d Foreground separation methods exist: d i h d i  Modeling of color variation at each p g pixel with Gaussian Mixture  Dominant motion compensation for sequences with moving camera  Motion layer separation for scenes with non-static backgrounds
  • 65. Temporal Templates Idea: summarize motion in video in a Motion History Image (MHI): Descriptor: Hu moments of different orders p [A.F. Bobick and J.W. Davis, PAMI 2001]
  • 66. Aerobics dataset Nearest Neighbor classifier: 66% accuracy
  • 67. Temporal Templates: Summary Pros: + Si l and f t Simple d fast Not all shapes are valid + Works in controlled settings Restrict the space of admissible silhouettes Cons: - Prone to errors of background subtraction g Variations in light, shadows, clothing… What is the background here? - Does not capture interior motion and shape Silhouette tells little about actions
  • 68. Active Shape Models [Cootes et al.] [Cootes al.]  Constrains shape deformation in PCA-projected space Example: face alignment Illustration of face shape space Active Shape Models: Their Training and Application T.F. Cootes, C.J. Taylor, D.H. Cooper, and J. Graham, CVIU 1995 , y , p , ,
  • 69. Person Tracking Learning flexible models from image sequences A. Baumberg and D. Hogg, ECCV 1994
  • 70. Learning dynamic prior  Dynamic model: 2nd order Auto-Regressive Process y g State Update rule: Model parameters: Learning scheme:
  • 71. Learning dynamic prior Random simulation of the Learning point sequence learned dynamical model Statistical models of visual shape and motion A. Blake, B. Bascle, M. Isard and J. MacCormick, Phil.Trans.R.Soc. 1998
  • 72. Learning dynamic prior Random simulation of th l R d i l ti f the learned gate d d t dynamics i
  • 73. Motion priors  C Constrain t t i temporal evolution of shape l l ti f h  Help accurate tracking  Recognize actions  G l formulate motion models f different types of actions Goal: f l t ti d l for diff tt f ti and use such models for action recognition Example: Drawing with 3 action g modes line drawing scribbling idle [M. Isard and A. Blake, ICCV 1998]
  • 74. Lecture overview Motivation Historic review Applications and challenges Human Pose Estimation Pictorial structures Recent advances Appearance-based methods pp Motion history images Active shape models & Motion priors Motion-based methods Generic and parametric Optical Flow Motion templates Space-time methods p Space-time features Training with weak supervision
  • 75. Lecture overview Motivation Historic review Applications and challenges Human Pose Estimation Pictorial structures Recent advances Appearance-based methods pp Motion history images Active shape models & Motion priors Motion-based methods Generic and parametric Optical Flow Motion templates Space-time methods p Space-time features Training with weak supervision
  • 76. Shape and Appearance vs. Motion vs  Shape and appearance in images depends on many factors: clothing, illumination contrast, image resolution, etc… [Efros et al 2003] al.  Motion field (in theory) is invariant to shape and can be used directly to describe human actions
  • 77. Shape and Appearance vs. Motion vs Moving Light Displays Gunnar Johansson, Perception and Psychophysics, 1973
  • 78. Motion estimation: Optical Flow  Classic problem of computer vision [Gibson 1955]  Goal: estimate motion field How? We only have access to image pixels Estimate pixel-wise correspondence between frames = Optical Flow  Brightness Change assumption: corresponding pixels preserve their intensity (color)  Useful assumption in many cases  Breaks at occlusions and illumination changes  Physical and visual motion may be different
  • 79. Generic Optical Flow  Brightness Change Constraint Equation (BCCE) Optical flow Image gradient One O equation, two unknowns => cannot be solved directly ti t k > tb l d di tl Integrate several measurements in the local neighborhood and obtain a L d bt i Least Squares Solution [L tS S l ti [Lucas & K Kanade 1981] d Second-moment matrix, the same one used to co pute a s compute Harris Denotes integration over a spatial (or spatio-temporal) interest points! neighborhood of a point
  • 80. Parameterized Optical Flow  A th extension of the constant motion model is to compute Another t i f th t t ti d li t t PCA basis flow fields from training examples 1. Compute standard Optical Flow for many examples 2. Put velocity components into one vector 3. Do PCA on and obtain most informative PCA flow basis vectors Training samples PCA flow bases Learning Parameterized Models of Image Motion M.J. Black, Y. Yacoob, A.D. Jepson and D.J. Fleet, CVPR 1997
  • 81. Parameterized Optical Flow  Estimated coefficients of PCA flow bases can be used as action descriptors Frame numbers Optical flow seems to be an interesting descriptor for motion/action recognition
  • 82. Spatial Motion Descriptor Image frame g Optical flow p Fx , y Fx , Fy Fx , Fx , Fy , Fy blurred Fx , Fx , Fy , Fy A. A. Efros, A.C. Berg, G. Mori and J. Malik. Recognizing Action at a Distance. In Proc. ICCV 2003
  • 83. Spatio- Spatio-Temporal Motion Descriptor Temporal extent E … … Sequence A  … … Sequence B t A E A E I matrix E B E B frame-to-frame motion-to-motion similarity matrix blurry I similarity matrix
  • 84. Football Actions: matching Input Sequence q Matched Frames input matched
  • 85. Football Actions: classification 10 actions; 4500 total frames; 13-frame motion descriptor
  • 86. Classifying Ballet Actions y g 16 Actions; 24800 total frames; 51-frame motion descriptor. Men used to classify women and vice versa.
  • 87. Classifying Tennis Actions 6 actions; 4600 frames; 7-frame motion descriptor Woman player used as training, man as testing.
  • 88. Where are we so far?
  • 89. Lecture overview Motivation Historic review Hi i i Modern applications Human Pose Estimation Pictorial structures Recent advances Appearance-based methods Motion history images Active h A ti shape models & M ti priors d l Motion i Motion-based methods Generic and parametric O ti l Fl G i d t i Optical Flow Motion templates Space-time methods Space-time features Training with weak supervision
  • 90. Goal: Interpret complex dynamic scenes d i Common methods: Common problems: • Segmentation ? • Complex & changing BG • Changing appearance • Tracking ?  No global assumptions about the scene
  • 91. Space- Space-time No global assumptions  Consider local spatio temporal neighborhoods spatio-temporal hand waving boxing
  • 92. Actions == Space-time objects? Space-
  • 93. Local approach: Bag of Visual Words Airplanes Motorbikes Faces Wild Cats Leaves People Bikes
  • 95. Space-Time Interest Points: Detection Space- p What neighborhoods to consider? High image Look at the Distinctive  variation in space  distribution of the neighborhoods g gradient and time Definitions: Original image sequence Oi i li Space-time Gaussian with covariance Gaussian derivative of Space-time gradient Second-moment matrix
  • 96. Space-Time Interest Points: Detection Space- p Properties of p : defines second order approximation for the local distribution of within neighborhood  1D space-time variation of , e.g. moving bar  2D space-time variation of , e.g. moving ball g g  3D space-time variation of , e.g. jumping ball Large eigenvalues of  can be detected by the local maxima of H over (x,y,t): (similar to Harris operator [Harris and Stephens, 1988])
  • 97. Space- Space-Time interest points Velocity appearance// split/merge changes disappearance
  • 98. Space-Time Interest Points: Examples Space- p p Motion event detection
  • 99. Spatio-temporal scale selection Spatio- p p Stability to size changes, e.g. camera zoom
  • 100. Spatio- Spatio-temporal scale selection Selection of temporal scales t l l captures the frequency of events
  • 101. Local features for human actions
  • 102. Local features for human actions boxing walking hand waving
  • 103. Local space-time descriptor: HOG/HOF space- p p Multi scale space time Multi-scale space-time patches Histogram of Histogram oriented spatial of optical  grad. (HOG) d flow (HOF) Public code available at www.irisa.fr/vista/actions 3x3x2x4bins HOG 3x3x2x5bins HOF descriptor descriptor
  • 104. Visual Vocabulary: K-means clustering y K- g  Group similar p p points in the space of image descriptors using p g p g K-means clustering  Select significant clusters Clustering c1 c2 c3 c4 Classification
  • 105. Visual Vocabulary: K-means clustering y K- g  Group similar p p points in the space of image descriptors using p g p g K-means clustering  Select significant clusters Clustering c1 c2 c3 c4 Classification
  • 106. Local Space-time features: Matching Space- p g  Find similar events in pairs of video sequences
  • 107. Action Classification: Overview Bag of space-time features + multi-channel SVM [Laptev’03, Schuldt’04, Niebles’06, Zhang’07] Collection of space-time patches C f Histogram of visual words HOG & HOF Multi-channel patch SVM descriptors Classifier
  • 108. Action recognition in KTH dataset g Sample frames from the KTH actions sequences, all six classes (columns) and scenarios (rows) are presented
  • 109. Classification results on KTH dataset Confusion matrix for KTH actions
  • 110. What about 3D? Local motion and appearance features are not invariant to view changes
  • 111. Multi- Multi-view action recognition Difficult to apply standard multi-view methods:  Do not want to search for multi- view point correspondence --- Non-rigid motion, clothing changes, … --> It’s Hard!  Do not want to identify body parts. Current methods are not reliable enough.  Yet, want to learn actions from one view and recognize actions in very different views
  • 112. Temporal self-similarities self- Idea:  Cross-view matching is hard but cross-time matching (tracking) is relatively easy.  Measure self-(dis)similarities across time: self (dis)similarities Example: Distance matrix / self-similarity matrix (SSM): P1 P2
  • 113. Temporal self-similarities: Multi-views self- Multi- Side view Top view Appear very similar despite the view change! h ! Intuition: 1. Distance between similar poses is low in any view 2. Distance among different poses is likely to be large in most views
  • 114. Temporal self-similarities: MoCap self- person 1 Self-similarities can be measured from Motion Capture (MoCap) person 2 data person 1 p person 2
  • 115. Temporal self-similarities: Video self- Self-similarities can be measured directly from video: HOG or Optical Flow descriptors i d i t in image frames
  • 116. Self- Self-similarity descriptor Goal: define a quantitative q measure to compare self- similarity matrices  Define a local histogram ~SIFT descriptor descriptor hi for each point computed on SSM t d i on the diagonal.  Sequence alignment:  Action recognition: Dynamic Programming for • Visual vocabulary for h two sequences of • BoF representation of {hi} descriptors {hi}, {hj} • SVM
  • 118. Multi- Multi-view action recognition: Video SSM-based recognition Alternative view-dependent method (STIP)
  • 119. What are Human Actions? Actions in recent datasets: Is it just about kinematics? Should actions be defined by the purpose? Kinematics + Objects
  • 120. What are Human Actions? Actions in recent datasets: Is it just about kinematics? Should actions be defined by the purpose? Kinematics + Objects + Scenes
  • 121.
  • 122. Action recognition in realistic settings Standard St d d action datasets Actions “I the Wild”: A ti “In th Wild”
  • 123. Action Dataset and Annotation Manual annotation of drinking actions in movies: g “Coffee and Cigarettes”; “Sea of Love” “Drinking”: 159 annotated samples g p “Smoking”: 149 annotated samples Temporal annotation T l t ti First frame Keyframe Last frame Spatial annotation head rectangle torso rectangle
  • 125. Action representation p Hist. of Gradient Hist. f O ti Fl Hi t of Optic Flow
  • 126. Action learning selected features boosting b ti weak classifier • Efficient discriminative classifier [Freund&Schapire’97] AdaBoost: • G d performance for face detection [Viola&Jones’01] Good f f f d t ti [Vi l &J ’01] pre-aligned Haar samples optimal threshold features Fisher discriminant Histogram features
  • 127. Key- Key-frame action classifier selected features boosting b ti weak classifier 2D HOG features f • Efficient discriminative classifier [Freund&Schapire’97] AdaBoost: • G d performance for face detection [Viola&Jones’01] Good f f f d t ti [Vi l &J ’01] pre-aligned Haar samples optimal threshold features Fisher discriminant Histogram see [Laptev BMVC’06] features for more details [Laptev, Pérez 2007]
  • 128. Keyframe priming y p g Training False positives of static HOG action detector Positive Negative training training sample p samples p Test
  • 129. Action detection Test set: • 25min from “Coffee and Cigarettes” with GT 38 drinking actions g g • No overlap with the training set in subjects or scenes Detection: • search over all space-time l h ll ti locations and spatio-temporal ti d ti t l extents Keyframe priming No Keyframe priming
  • 130. Action Detection (ICCV 2007) ( ) Test episodes from the movie “Coffee and cigarettes Coffee cigarettes” Video available at http://www.irisa.fr/vista/Equipe/People/Laptev/actiondetection.html
  • 131. 20 most confident detections
  • 132. Learning Actions from Movies g • Realistic variation of human actions • Many classes and many examples per class Problems: • Typically only a few class-samples per movie • Manual annotation is very time consuming
  • 133. Automatic video annotation with scripts • Scripts available for >500 movies (no time synchronization) www.dailyscript.com, www.movie-page.com, www.weeklyscript.com … • Subtitles (with time info.) are available for the most of movies • Can transfer time to scripts by text alignment … subtitles movie script 1172 … 01:20:17,240 --> 01:20:20,437 RICK Why weren't you honest with me? Why weren't you honest with me? Why Why'd you keep your marriage a secret? Why d did you keep your marriage a secret? 1173 01:20:17 01:20:20,640 --> 01:20:23,598 Rick sits down with Ilsa. 01:20:23 lt wasn't my secret, Richard. secret Richard ILSA Victor wanted it that way. Oh, it wasn't my secret, Richard. 1174 Victor wanted it that way. Not even 01:20:23,800 > 01:20:26,189 01:20:23 800 --> 01:20:26 189 our closest friends knew about our marriage. Not even our closest friends … knew about our marriage. …
  • 134. Script- Script-based action annotation – On the good side: g • Realistic variation of actions: subjects, views, etc… • Many examples per class, many classes • No extra overhead for new classes • Actions, objects, scenes and their combinations • Character names may be used to resolve “who is doing what?” who what? – Problems: • No spatial localization • Temporal localization may be poor • Missing actions e g scripts do not al a s follo the mo ie actions: e.g. always follow movie • Annotation is incomplete, not suitable as ground truth for testing action detection • Large within-class variability of action classes in text
  • 135. Script alignment: Evaluation • Annotate action samples in text p • Do automatic script-to-video alignment • Check the correspondence of actions in scripts and movies Example of a “visual false positive” A black car pulls up, t bl k ll two army officers get out. a: quality of subtitle-script matching
  • 136. Text- Text-based action retrieval • Large variation of action expressions in text: GetOutCar “… Will gets out of the Chevrolet. …” action: “… Erin exits her new truck…” Potential false “…About to sit down, he freezes…” positives: • => Supervised text classification approach
  • 137. Automatically annotated action samples [Laptev, Marszałek, Schmid, Rozenfeld 2008]
  • 138. Hollywood- Hollywood-2 actions dataset Training and test samples are obtained from 33 and 36 distinct movies respectively. Hollywood-2 dataset is on-line: http://www.irisa.fr/vista http://www irisa fr/vista /actions/hollywood2 [Laptev, Marszałek, Schmid, Rozenfeld 2008]
  • 139. Action Classification: Overview Bag of space-time features + multi-channel SVM [Laptev’03, Schuldt’04, Niebles’06, Zhang’07] Collection of space-time patches C f Histogram of visual words HOG & HOF Multi-channel patch SVM descriptors Classifier
  • 140. Action classification (CVPR08) Test episodes from movies “The Graduate”, “It’s a Wonderful Life”, “Indiana Jones and the Last Crusade”
  • 141. Evaluation of local features for action recognition • Local features provide a popular approach to video description for action recognition: – ~50% of recent action recognition methods (cvpr09, iccv09, bmvc09) are based on local features – Large variety of feature detectors and descriptors is available – Very limited and inconsistent comparison of different features Goal: • Systematic evaluation of local feature-descriptor combinations • Compare performance on common datasets • Propose improvements
  • 142. Evaluation of local features for action recognition • Evaluation study [Wang et al. BMVC’09] – Common recognition framework • Same datasets (varying difficulty): KTH, UCF sports, Hollywood2 • Same train/test data • Same classification method – Alternative local feature detectors and descriptors from recent literature – Comparison of different detector-descriptor combinations
  • 143. Action recognition framework g Bag of space-time features + SVM [Schuldt’04, Niebles’06, Zhang’07] Evaluation E l i space-time patches Extraction of Local features K-means Evaluation Occurrence histogram clustering of visual words (k=4000) Non-linear Feature SVM with χ2 description kernel Feature e u e quantization
  • 144. Local feature detectors/descriptors L lf t d t t /d i t • Four types of detectors: – Harris3D [Laptev’05] [Laptev 05] – Cuboids [Dollar’05] – Hessian [Willems’08] – Regular dense sampling • Four different types of descriptors: yp p – HoG/HoF [Laptev’08] – Cuboids [Dollar’05] – H G3D HoG3D [Kläser’08] [Klä ’08] – Extended SURF [Willems’08]
  • 145. Illustration of ST detectors Harris3D H i 3D Hessian H i Cuboid Dense
  • 146. Dataset: KTH-Actions KTH- • 6 action classes by 25 persons in 4 different scenarios • Total of 2391 video samples • Performance measure: average accuracy over all classes 59
  • 147. UCF- p UCF-Sports -- samples p • 10 different action classes • 150 1 0 video samples in total - We extend the dataset by flipping videos • Evaluation method: leave-one-out • Performance measure: average accuracy over all classes Diving Kicking Walking Skateboarding High-Bar-Swinging Golf-Swinging
  • 148. Dataset: Hollywood2 y • 12 different action classes from 69 Hollywood movies • 1707 video sequences in total • Separate movies for training / testing •P f Performance measure: mean average precision ( AP) over all classes i i (mAP) ll l GetOutCar AnswerPhone Kiss HandShake StandUp DriveCar 61
  • 149. KTH- KTH-Actions -- results Detectors Harris3D Cuboids Hessian Dense HOG3D 89.0% 90.0% 84.6% 85.3% HOG/HOF 91.8% 91 8% 88.7% 88 7% 88.7% 88 7% 86.1% 86 1% Descriptors HOG 80.9% 82.3% 77.7% 79.0% HOF 92.1% 88.2% 88.6% 88.0% Cuboids - 89.1% - - E-SURF - - 81.4% - • Best results for Sparse Harris3D + HOF • Good results for Harris3D and Cuboid detectors with HOG/HOF and HOG3D descriptors • Dense features perform relatively poor compared to sparse features
  • 150. UCF- UCF-Sports -- results Detectors Harris3D Cuboids Hessian Dense HOG3D 79.7% 82.9% 79.0% 85.6% HOG/HOF 78.1% 77.7% 79.3% 81.6% criptors HOG 71.4% 72.7% 66.0% 77.4% HOF Desc 75.4% 75 4% 76.7% 76 7% 75.3% 75 3% 82.6% 82 6% Cuboids - 76.6% - - E SURF E-SURF - - 77.3% - • Best results for Dense + HOG3D • Good results for Dense and HOG/HOF • Cuboids: good p g performance with HOG3D
  • 151. Hollywood2 -- results Detectors Harris3D Cuboids Hessian Dense HOG3D 43.7% 45.7% 41.3% 45.3% HOG/HOF 45.2% 46.2% 46.0% 47.4% riptors HOG 32.8% 39.4% 36.2% 39.4% Descri HOF 43.3% 42.9% 43.0% 45.5% Cuboids - 45.0% - - E-SURF E SURF - - 38.2% 38 2% - • Best results for Dense + HOG/HOF • Good results for HOG/HOF
  • 152. Evaluation summary y • Dense sampling consistently outperforms all the tested sparse features in realistic settings (UCF + Hollywood2) - Importance of realistic video data - Limitations of current feature detectors - Note: large number of features (15-20 times more) • Sparse features provide more or less similar results (sparse features better than Dense on KTH) • Descriptors Descriptors’ performance - Combination of gradients + optical flow seems a good choice (HOG/HOF & HOG3D)
  • 153. How to improve BoF classification? p Actions are about people Why not try to combine BoF with person detection? y y p Detect and track people Compute BoF on person-centered grids: 2x2, 3x2, 3x3… Surprise!
  • 154. How to improve BoF classification? p 2nd attampt: • Do not remove background g • Improve local descriptors with region-level information Local features ambiguous bi Features with F t ith features disambiguated Visual labels Vocabulary Regions R2 Histogram representation R1 R1 R1 SVM R2 Classification R3
  • 155. Video Segmentation g • Spatio-temporal grids Spatio temporal R3 R2 R1 • Static action detectors [Felzenszwalb’08] – Trained from ~100 web-images per class • Object and Person detectors (Upper body) [ [Felzenszwalb’08] ]
  • 157. Hollywood-2 action classification Hollywood- y Attributed feature Performance (meanAP) BoF 48.55 Spatiotemoral grid 24 channels 51.83 51 83 Motion segmentation 50.39 Upper body pp y 49.26 Object detectors 49.89 Action detectors 52.77 Spatiotemoral grid + Motion segmentation 53.20 Spatiotemoral grid + Upper body 53.18 Spatiotemoral grid + Object detectors S ti t l id Obj t d t t 52.97 52 97 Spatiotemoral grid + Action detectors 55.72 Spatiotemoral grid + Motion segmentation + Upper 55.33 body + Object detectors + Action detectors
  • 159. Actions in Context (CVPR 2009)  Human actions are frequently correlated with particular scene classes q y p Reasons: physical properties and particular purposes of scenes Eating -- kitchen Eating -- cafe Running -- road Running -- street
  • 160. Mining scene captions ILSA I wish I didn't love you so much. 01:22:00 01:22:03 She snuggles closer to Rick. CUT TO: EXT. RICK'S CAFE - NIGHT Laszlo and Carl make their way through the darkness toward a side entrance of Rick's. They run inside the entryway. The headlights of a speeding police car sweep toward them. They flatten themselves against a wall to avoid detection. The lights move past them. CARL 01:22:15 I think we lost them. 01:22:17 …
  • 161. Mining scene captions INT. TRENDY RESTAURANT - NIGHT INT. INT MARSELLUS WALLACE’S DINING ROOM MORNING WALLACE S EXT. STREETS BY DORA’S HOUSE - DAY. INT. MELVIN'S APARTMENT, BATHROOM – NIGHT EXT. EXT NEW YORK CITY STREET NEAR CAROL'S RESTAURANT – DAY CAROL S INT. CRAIG AND LOTTE'S BATHROOM - DAY • Maximize word frequency street, living room, bedroom, car …. • M Merge words with similar senses using W dN t d ith i il i WordNet: taxi -> car, cafe -> restaurant • Measure correlation of words with actions (in scripts) and • Re-sort words by the entropy for P = p(action | word)
  • 162. Co- Co-occurrence of actions and scenes in scripts
  • 163. Co- Co-occurrence of actions and scenes in scripts
  • 164. Co- Co-occurrence of actions and scenes in scripts
  • 165. Co- Co-occurrence of actions and scenes in text vs. video
  • 166. Automatic gathering of relevant scene classes and visual samples Source: 69 movies aligned with g the scripts Hollywood-2 dataset is on-line: http://www.irisa.fr/vista /actions/hollywood2
  • 167. Results: actions and scenes (separately)
  • 168. Classification with the help of context Action classification score Scene classification score Weight, Weight estimated from text: New action score
  • 169. Results: actions and scenes (jointly) Actions in h i the context of Scenes Scenes in the context of Actions
  • 170. Weakly- p Weakly-Supervised y Temporal Action Annotation  Answer questions: WHAT actions and WHEN they happened ? Knock on the door Fight Kiss  Train visual action detectors and annotate actions with the minimal manual supervision
  • 171. WHAT actions?  Automatic discovery of action classes in text (movie scripts) -- Text processing: Part of Speech (POS) tagging; Named Entity Recognition (NER); WordNet pruning; Visual Noun filtering -- Search action patterns Person+Verb Person+Verb+Prep. Person+Verb+Prep Person+Verb+Prep+Vis.Noun Person+Verb+Prep+Vis Noun 3725 /PERSON .* is 989 /PERSON .* looks .* at 41 /PERSON .* sits .* in .* chair 2644 /PERSON .* looks 384 /PERSON .* is .* in 37 /PERSON .* sits .* at .* table 1300 /PERSON .* turns 363 /PERSON .* looks .* up 31 /PERSON .* sits .* on .* bed 916 /PERSON .* takes 234 /PERSON .* is .* on 29 /PERSON .* sits .* at .* desk 840 /PERSON .* sits 215 /PERSON .* picks .* up 26 /PERSON .* picks .* up .* phone 829 /PERSON .* has 196 /PERSON .* is .* at 23 /PERSON .* gets .* out .* car 807 /PERSON .* walks 139 /PERSON .* sits .* in 23 /PERSON .* looks .* out .* window 701 /PERSON .* stands * 138 /PERSON .* is .* with * * 21 /PERSON .* looks .* around .* room * * * 622 /PERSON .* goes 134 /PERSON .* stares .* at 18 /PERSON .* is .* at .* desk 591 /PERSON .* starts 129 /PERSON .* is .* by 17 /PERSON .* hangs .* up .* phone 585 /PERSON .* does 126 /PERSON .* looks .* down 17 /PERSON .* is .* on .* phone 569 /PERSON .* gets 124 /PERSON .* sits .* on 17 /PERSON .* looks .* at .* watch 552 /PERSON .* pulls 122 /PERSON .* is .* of 16 /PERSON .* sits .* on .* couch 503 /PERSON .* comes 114 /PERSON .* gets .* up 15 /PERSON .* opens .* of .* door 493 /PERSON .* sees 109 /PERSON .* sits .* at 15 /PERSON .* walks .* into .* room 462 /PERSON .* are/VBP 107 /PERSON .* sits .* down 14 /PERSON .* goes .* into .* room
  • 172. WHEN: WHEN: Video Data and Annotation  Want to target realistic video data  Want to avoid manual video annotation for training Use movies + scripts for automatic annotation of training samples 24:25 ertainty! Unce 24:51
  • 173. Overview Input: p Automatic collection of training clips g p • Action type, e.g. Person Opens Door • Videos + aligned scripts Output: O Clustering of positive segments Training classifier Sliding- window-style i d l temporal action localization
  • 174. Action clustering [Lihi Zelnik-Manor and Michal Irani CVPR 2001] Spectral clustering Descriptor space Clustering results Ground truth
  • 175. Action clustering Complex data: Standard clustering methods do not work on this data
  • 176. Action clustering Our view at the problem Feature space Video space Negative samples! N i l ! Nearest neighbor N t i hb solution: wrong! Random video samples: lots of them, very low chance to be positives
  • 177. Action clustering Formulation [Xu et al. NIPS’04] [ ac [Bach & Harchaoui NIPS’07] a c aou S0 ] discriminative di i i ti cost t Feature space Loss on positive samples Loss on negative samples negative samples parameterized positive samples t i d iti l Optimization SVM solution for Coordinate descent on
  • 178. Clustering results Drinking actions in Coffee and Cigarettes
  • 179. Detection results Drinking actions in Coffee and Cigarettes  T i i B Training Bag-of-Features classifier fF t l ifi  Temporal sliding window classification  Non-maximum suppression Detection trained on simulated clusters Test set: • 25min from “Coffee and Coffee Cigarettes” with GT 38 drinking actions
  • 180. Detection results Drinking actions in Coffee and Cigarettes  T i i B Training Bag-of-Features classifier fF t l ifi  Temporal sliding window classification  Non-maximum suppression Detection trained on automatic clusters Test set: • 25min from “Coffee and Coffee Cigarettes” with GT 38 drinking actions
  • 181. Detection results “Sit Down” and “Open Door” actions in ~5 hours of movies
  • 182. Temporal detection of “Sit Down” and “Open Door” actions in movies: The Graduate, The Crying Game, Living in Oblivion
  • 183. Conclusions  Bag-of-words models are currently d i B f d d l tl dominant, th t the structure (human poses, etc.) should be integrated.  Vocabulary of actions is not well-defined – it depends on the goal and the task g  Actions should be used for the functional interpretation of the visual world