Frontiers of
Human Activity Analysis

                J. K. Aggarwal
               Michael S. Ryoo
                 Kris M. Kitani
2000s




    Description-based
       approaches


                        2
Approach paradigm
 Description-based approach
    We represent the structure of the activities,
     and recognize activities using semantic matching.
        Hand shake = “two persons do shake-action (stretches, stays
         stretched, withdraw) simultaneously, while touching”.
        Recognition by finding observations satisfying the definition.

Humans conceptual
                                      Human activity
  knowledge on
                                      representation
  human activity
                                                               High-level
                                       Atomic action            activity
 Input sequence                                               recognition
                                     (e.g. arm stretch)
                      Low-level         recognition
                     processing

                                                                            3
Comparisons
                               Complex         Complex      Recognition of      Handle
                Levels of
Approaches                     temporal      logical con-     recursive      imperfect low-
                hierarchy
                               relations      catenations     activities         levels
                  limited
 Statistical   (depends on                                                        √
               data amount)

 Syntactic      unlimited                                        √                √

                              a sub-event
Siskind 2001    unlimited     participates        √
                               only once
Hongeng et        limited
                                  √               √
 al. 2004       (3-levels)
  Vu et al.                                  conjunctions
                unlimited         √
   2003                                          only
 Ryoo and
 Aggarwal       unlimited         √               √              √                √
   2009
Gupta et al.      limited                    network form
                                  √                                               √
  2009          (2-levels)                       only

                                                                                              4
2000s




        Description-based
           approaches
              Human interactions


                                   5
Recognition of human interactions
Semantic layer
   Interaction:
                                                     in time interval
              Person        “pushed” by Person
                                                     <4, 20>             Interaction
Gesture layer
                                                                         Gesture
             <1,20> : Facing right        <1,20> : Facing left
             <1,20> : Arm staying         <4,20> : Arm stretching           Elementary
             <1,20> : Leg staying         <1,20> : Leg staying               movement of a
Pose layer
                                                                             person
             1 : Arm withdrawn            1 : Arm withdrawn
                                                                         Pose
             8 : Arm withdrawn            8 : Arm somewhat stretched        Abstract status
             13 : Arm withdrawn           13 : Arm fully stretched           of a body part.
Body-part layer                                                          Body-part
                                                                          feature.
                                                                            Numerical status
                                                                             of a body part.
Input sequences

                                                                              Ryoo and Aggarwal,
                                                                              CVPR 2006
                                                                                                6
Atomic actions
  Operation triplets <agent, motion, target>
        Gesture together with subject and object information.
        Unit human activity.
        Computed based on gestures.
        Ex> person1 stretches his/her arm
           <p1’s arm, stretch, null>
  Time intervals
      Ex> Time intervals detected for Pointing action
                                                               Time

P1:Head :222222222222222222        <p1‟s arm, stretch, p2>
P1:ArmV :322021221111122333      <p1‟s arm, stay stret., p2>
P1:ArmH :100022222222221100     <p1‟s arm, withdraw, null>
   Sequences of poses                  Time intervals of operation triplets
                                                                              7
Semantic layer recognition

                            Machine-understandable
                           Punching is a sequence of hand
                           representation of Punching
                           stretch and withdrawal.



                                               Similar?

                                 Time
     <p2‟s arm, stretch, p1>
   <p2‟s arm, stay stret., p1>
…<p2‟s arm, withdraw, null>                                 …
<p2‟s arm, stay withd., null>
  <p2‟s head, face left, null>
 <p1‟s arm, withdraw, nulll>
                             Time intervals of gestures
                                Video observations
                                                                8
Human activity representation
 Semantics                           Syntax
  Knowledge on the structure              Rules to construct formal
   of an activity.                          representation.
     Punching is a sequence of            Organizes a set of
      hand stretch and withdrawal.          vocabularies to describe
  Time intervals                           the activities‟ structure.
  Allen‟s temporal predicates             Context-free grammar
       this=Punching_action (p1)           Punching_action(i) = (
                                    CFG
                                             list( def(„x‟, Arm_Stretch(i)),
                                                   def(„y‟, Arm_Withdraw(i)) ),
       x=arm_        y=arm_
                                             and( meets(„x‟, „y‟),
       stretch(p1)   withdraw(p1)
                                                   and( starts(„x‟, „this‟),
                                                        finishes(„y‟, „this‟)) ) );
   Conceptual/verbal description          Machine-understandable language
                                                                                  9
Description-based

 Hierarchical activity representation
    Representation of the „shake-hands‟ interaction
       Shake hands           ShakeHandsInteractions(i, j) = (
        interaction             list( def(„x‟, ShakeHandsAction(i)),
                                      list( def(„y‟, ShakeHandsAction(j)),
                                            def(„z‟, TouchingInteraction(i, j))) ),
  CFG
                                and( and( during(„z‟, „x‟), during(„z‟, „y‟)),
  Syntax
                            Hand shake = “two„this‟), finishes(„z‟, „this‟)))
                                      and( starts(„z‟,
                                                        persons do
                             );
                            shake-action (stretches, stays stretched,
                             ShakeHandsActions(i) = (
                            withdraw)
                                list( def(„x‟, Arm_Stretch(i)),
                            simultaneously, while touching”.
                                      list( def(„y‟, Arm_Stay_Stretched(i)),
                                            def(„z‟, Arm_withdraw(i))) ),
                                and( and( meets(„x‟, „y‟), meets(„y‟, „z‟)),
                                      and( starts(„x‟, „this‟), finishes(„z‟, „this‟)))
                             );
                             TouchingInteraction(i, j)=(null, touch(i, j, 0));

                                                                                          10
Hierarchical recognition algorithm
 Recognition process of the „Shake-hands‟ interaction.




                                                          11
Continued and recursive activities
 Interaction „fighting‟
    Composed of multiple negative interactions
       Punching + kicking + pushing + punching + …
    Iterative approach is taken.

                                                 Fighting
                                             Interaction(i, j)


                                  Negative             Fighting
                               Interaction(i, j)   Interaction(i, j)


                                       Negative                Fighting
                                    Interaction(i, j)      Interaction(i, j)
                                                                         Base Case

                                          Negative                Fighting
                                       Interaction(i, j)      Interaction(i, j)
                                                                               12
Experiments - Simple interactions
 Recognized 8 types of simple interactions, which were
  recognized in Park and Aggarwal, 2004
    (approach, depart, point, shake-hands, hug, punch, kick, and
     push)
    A videos of a sequence of interactions are taken. (continuous
     executions)
    Interactions are described in more detailed and formal way,
     resulting better recognition accuracy.




                                                                     13
Example Experiment - Fighting
Input video:        Processed video:




                     Gestures
Poses:               and
                     activities:
                                       14
Past-Now-Future networks
 Pinhanez and Bobick 1998
   PNF networks to represent temporal structure
    of an activity.
   Kitchen activities:




 [Pinhanez, C. S. and Bobick, A. F., Human action detection using PNF propagation
 of temporal constraints. CVPR 1998]
                                                                                    15
Event logic
 Siskind 2001
   Logical concatenations of predicates
   Time intervals?




 [Siskind, J. M., Grounding the lexical semantics of verbs in visual perception using force
 dynamics and event logic. Journal of Artificial Intelligence Research (JAIR) 15, 2001]
                                                                                              16
Representation languages
 Nevatia, Zhao,
  and Hongeng 2003
   VERL - language


 Vu, Bremond,
  Thonnat 2003
   Similar to Nevatia
    et al. 2003

 Recursive? Uncertainties?
                               17
Stochastic approaches
 Limitations of the conventional description-
  based approaches
   Uncertainties? – stochastic recognition


 Probabilistic framework needed
   [Ryoo and Aggarwal, IJCV 2009]
   [Tran and Davis, ECCV 2008] - MLN



                                                 18
Hierarchical matching algorithm
 Recognition process tree of „steal(p1, s1, p2)‟
                                         Parameter Objects:
                                             (person p1, suitcase s1, person p2)
                                                                                 P(S | O1, M1, …) = 0.8

                                                  Steal(p1, s1, p2)

   P Obj: (person p1, suitcase s1)            P(A3 | O3, M3) = 0.5              P Obj: (person p2, suitcase s1)
                                      meets                                 meets
            Carry(p1, s1)                              Stay(s1)                          Carry(p2, s1)


P(A1 | O1, M1) = 0.8           P(A2 | O2, M2) = 0.7          P(A4 | O4, M4) = 0.9      P(A5 | O5, M5) = 0.9
                             equals                                                    equals
       MoveR(p1)                      MoveR(s1)                      MoveL(p2)                   MoveL(s1)



                       Recognition results from the object and motion layers

                                                                                                                  19
Description-based

             Probabilistic recognition
    Probability of the activity given observation

                     .

                            .
                            .
                                 .

                         Structural    Gesture detection
                         similarity      confidence
                                                           20
Experiments
      Recognized following six types of interactions.
            Each activity was tested with at least 10 sequences.
                  Carrying a box, leaving a box, placing a box into a trash bin.
                  Carrying a suitcase, leaving a suitcase, stealing the suitcase.
            Object and Motion layer trained with 5 sequences.




                                                                            Time
          Carry(Person1, SuitCase1) :
                    Stay(SuitCase1) :
          Carry(Person2, SuitCase1) :
Steal(Person1, SuitCase1, Person2) :
                                                                                     21
Experiments
   Example
         a person placing a box into a trash bin




                                                    Time
            Move(Person1, right) :
               Move(Box1, right) :
             Move(Person1, left) :
             Move(Box1, down) :
           Carry(Person1, Box1) :
Trash(Person1, Box1, TrashBin1) :
                                                           22
Experiments - Performance
 Recognition accuracy (true positives):
    Compared with a multi-object version of previous works.
P 1
 0.9
 0.8
 0.7
 0.6
 0.5
 0.4
 0.3
 0.2
 0.1
  0
         Carry    Leave      Steal                Carry          Leave          Trash          Total
       -Suitcase -Suitcase -Suitcase              -Box           -Box           -Box
           2 atomics    3 atomics    5 atomics      2 atomics      3 atomics      3 atomics
           1 spatials   1 spatials   2 spatials     1 spatials     1 spatials     3 spatials


 False positives rates are almost zero for all activities.
                                                                                                       23
Advantages
 Ability to represent and recognize an activity composed
  of concurrent sub-events.
    Ex> “touching occurred during pushing”

                     this==Pushing_interactions(p1,p2)

               i=Arm_Stretch(p1)   j=Arm_Stay_Stretched(p1)

                                k =Touch(p1, p2)         l =Depart(p2, p1)



 Ability to represent and recognize „recursive activities‟
    Ex> Fighting = Fighting + another negative interaction.
 Less data required for training.
    „Structure of activities‟ are encoded based on human knowledge.
 High recognition accuracy?
                                                                             24
2008, 2009




      Description-based
         approaches
               Group activities


                                  25
Group activity
  Events performed by
   groups
      Various types of
       complex activities
          Group-person interaction
          Group-group interaction
  Uncertain nature
      Varying # of
       participants                   Group Stealing (grp vs. per)
                                      Group Assault (grp vs. grp)
      Dynamic spatial                     A person with red shirts is
       relation                        taking the laptop on the table
                                         while the others are talking
Ryoo and Aggarwal,
CVPR-SIG 09, IJCV 11                                                     26
                                                                         26
Representation
     Group stealing                                    Formal representation:
                                                        1. Member variables b in Owners,
                                                           ∃ a in Thieves, ∀
                                                           ∃ c in Thieves
               ∀                ∀                       2. Time intervals
                                                           def(t1, TakeObject(a)),
                    b               b                      def(t2, Distract(c, b))
                            ∃
                                        c   ∃           3. Predicates this),
                                                           equals (t1,
                ∃       c    c
                                            Distract!      during (t1, t2)
                Distract!Distract!
                     Owners                                                 Time
∃                                                       Distract (r1, g1)
a              Thieves                                  Distract (r2, g1)
                                                        Distract (r3, g2)    t2
Take object!                                                                         during
                                                        TakeObject (r4)      t1
                                    Object
                                                                                     equals
                                                         Stealing(R, G) this

                                                            Time intervals of activities of
                                                                individual members
                                                                                              27
Recognition overview
 3 key components Generates a pool of member candidates
                                     Distract   Distract

                                                Distract

                     Take          Distract     Distract
                            Distract
                            Stand                Take

                                                 Stand

     Who?                   What?                        When?
 Recognition: NP-hard.             Approximation required.
    Obtain a pool of group member candidates
     with non-zero probability.
       Not many persons perform sub-events.
    M *  arg max M P(Gt (M ) | O M )
                                                                 28
Temporal constraints
    Hierarchical temporal constraint matching.
Member variables:
 ∃a in Thieves, ∀b in Owners, ∃c in Thieves


        Steal(T, O)




                      before                    during
   Approach(c, b)              Distract(c, b)            TakeObject(a)



                               Video inputs

                                                                         29
Group candidates
 Among possible groupings,
   Find a set of group members:
           M  {m1 , m2 ,..., m|M |}
   which maximizes the overall
   probability.
 Bayesian formulation
    P(Gt | O)  max M P(Gt (M ) | O M )
                              G (M )
               max M
                       G ( M )   G ( M )
 where     G (M )  P(OM | Gt (M ))  P(Gt (M ))
                                                    30
Bayesian formulation
 Ci: persons performing ith sub-event.
P(OM | G t ( M ))            P(OM | M , Q, S1t1 ,..., S n )  P( S1t1 ,..., Sn | G t ( M ))
                                                          tn                   tn

                           S11 ,...,S nRepresented
                            t         tn
                                                       relations Represented sub-events
                                 P(rel | S a , Sb )   P( Siti | G t ( M ))
                                 rel                                                      
          G (M )                                              i
                                                                                           
                    t1       tn
                   S1 ,...,S n
                                  d  e (| Ki Ci |/|Ki ||Li Ci |/|Ki |)  G t ( M ) 
                                 i                                                        
                                        Structural similarity
 Essential and anti-essential relations: Ki, Li
                  E[S
                                                  Case1:                   Case2:
| Ki  Ci |                i
                             ti
                                   (k ) | O ] i
                                                  ∃∃        1
                                                            2
                                                                   4
                                                                   5
                                                                          ∀∃         1
                                                                                     2
                                                                                               4
                                                                                               5
               kK i Ci                                    3      6                 3         6

                                                  Case3:                   Case4:

| Li  Ci |      E[S (l ) | O ]
               lLi Ci
                             i
                              ti          i       ∃∀        1
                                                            2
                                                            3
                                                                   4
                                                                   5
                                                                   6
                                                                          ∀∀         1
                                                                                     2
                                                                                     3
                                                                                               4
                                                                                               5
                                                                                               6
                                                                                                   31
Markov chain Monte Carlo
 MCMC-based probability estimation.
   Provides a set of samples from the distribution.
   Models the probability distribution.
 Metropolis-Hastings algorithm
   P(M t 1 , M ' )  min(1, a)
       ( M ' )  q( M ' , M )
   a G
       G ( M )  q( M , M ' )             Add     Add
                                                         Add
 Actions:                                Add
                                          Remove

   Add: M '  M t 1 {m} where m  Ci
   Remove: M '  M t 1  {m}
                                                               32
Experimental setting
 We have tested 45 sequences of 8 activities.
   320*240 with 10 fps
 CCTV videos download from YouTube.
   Group stealing in Malaysia and group arresting in UK.
 Videos that we have taken with 10 participants in
  various environments.
   A group of people carrying a large object.
   A group of people assaulting a person.




  Videos of real human activities
                                                            33
Experiments - stealing
 Group stealing
   One of thieves
    steals a laptop,
    while the other
    thieves are
    distracting the
    shop owner.
         Thieves

         Owners

         Laptop


                                 34
Experiments - arresting
 Group arresting
   A group of
    policemen
    arresting a group
    of suspicious
    persons.
   Color histogram
        Policemen

        Criminal
        candidates
        Pedestrians


                                35
Experiments – group assault
 Highly stochastic
   There may be (and may not be) attackers whose
    guarding the area, or just watching.
   10 videos.




                                                    36
Experiments – group assault
 Highly stochastic
   There may be (and may not be) attackers whose
    guarding the area, or just watching.
   10 videos.




                                                    37
Experimental results
 Recognition accuracy
       False positive rates are almost 0 because of the detailed
        representations: previous, deterministic, stochastic.
 1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
 0
       Move   Carry CarryCmd Fight   Fight   Steal   Arrest   Assault Total
        G      G       GP     GG      IG      GG      GG        GG
         ∀      ∃∀    ∃∀      ∀∃      ∃∃      ∃∀∃     ∀∃        ∀∃∃
                                                                              38
2009




        Spatio-Temporal
       Relationship Match


                            39
Description-based vs. Space-time
 Description-based           Space-time
   High-level activities       Reliable under noise
      Hierarchical             Difficult to model
   Semantic structures          complex activities
   Difficult to cope with      Miss semantic
    noise                        structures
                                t




                                                        40
Space-time approaches
 Video classification
   Each video is represented as a histogram
                                             Hugging
                            Shaking hands

    t




                             Punching       Pushing




 Spatio-temporal features

 Limitation:
   Unable to model complex activities           Laptev 04,
                                                 Dollar et al. 05   41
Spatio-temporal relations (STRs)
t                           t
                y               35         12   overlaps(12, 35)
                                                       ...
                                                before(17, 12)
                    5                           overlaps(5, 17)
                                 17
                                           x
t                           t
                y
                                35              overlaps(12, 35)
                                          12
                                                       ...
                                                before(17, 12)
                        5
                                                 equals(5, 17)
                                     17
                                           x      Ryoo and Aggarwal,
      Videos    Feature relations                 ICCV 2009      42
Histogram of STRs
            t
y               35         12   overlaps(12, 35)
                                      ...
                                before(17, 12)
    5                           overlaps(5, 17)
                 17
                           x

            t
y
                35              overlaps(12, 35)
                          12
                                      ...
                                before(17, 12)
        5
                                 equals(5, 17)
                     17
                           x                       43
STR-match learning
 Supervised learning
   Videos with activity labels are provided.
                          Pushing       Hugging
                                    ?




                         Punching Shaking hands

    Histogram of
    Relationships


                           Feature space          44
STR equations
 STR match considers distributions of pair-
  wise relationships among features.
 Histogram construction



 STR distance:




                                               45
STR-match activity detection
 Must detect starting time and ending time
      Models starting XYT location of an activity.
      Each feature pair in a matching training video
       makes a vote.
             t         35   12            t     35
 y                               y                        12

                                      5
       5
                       17                           17
            vtrstart                          expected_starting(vtest)
                     x                           x
     Original starting               Expected starting
                                                                     46
Hierarchical recognition
 Atomic action detections as new features
   Localization ability enables hierarchical
    recognition




                                                47
Experiments
 KTH dataset
   Public dataset composed of simple actions
     Walking, jogging, running, waving, …




                                                48
Experiments: high-level activities
 High-level human activity detection results
   Changing backgrounds, lighting conditions, …




                                                   49
STR-match summary
 Detection from continuous videos
   Localization using voting-based method
 Noisy observations
   Different backgrounds/lightings
   Uncertainties
 Human-human interactions
   Hierarchical recognition
 Future work
   Hierarchy learning algorithm
                                             50
Description-based: References
 Allen, J. F. and Ferguson, G., Actions and events in interval temporal logic. Journal of Logic and
  Computation 1994.
 Pinhanez, C. S. and Bobick, A. F., Human action detection using PNF propagation of temporal
  constraints. CVPR 1998.
 Siskind, J. M., Grounding the lexical semantics of verbs in visual perception using force dynamics
  and event logic. Journal of Artificial Intelligence Research (JAIR) 15, 2001.
 Nevatia, R., Zhao, T., and Hongeng, S., Hierarchical language-based representation of events in
  video streams. In IEEE Workshop on Event Mining 2003.
 Vu, V.-T., Bremond, F., and Thonnat, M., Automatic video interpretation: A novel algorithm for
  temporal scenario recognition. IJCAI 2003.
 Ryoo, M. S. and Aggarwal, J. K., Recognition of composite human activities through context-free
  grammar based representation. CVPR 2006.
 Ryoo, M. S. and Aggarwal, J. K., Hierarchical recognition of human activities interacting with objects.
  CVPR-SLAM 2007.
 Tran, S. D. and Davis, L. S., Event modeling and recognition using markov logic networks. ECCV
  2008.
 Ryoo, M. S. and Aggarwal, J. K., Semantic representation and recognition of continued and
  recursive human activities. IJCV, 2009.
 Ryoo, M. S. and Aggarwal, J. K., Spatio-temporal relationship match: Video structure comparison
  for recognition of complex human activities. ICCV 2009.
 Ryoo, M. S. and Aggarwal, J. K., Stochastic representation and recognition of high-level group
  activities. IJCV, 2010.
                                                                                                            51

cvpr2011: human activity recognition - part 5: description based

  • 1.
    Frontiers of Human ActivityAnalysis J. K. Aggarwal Michael S. Ryoo Kris M. Kitani
  • 2.
    2000s Description-based approaches 2
  • 3.
    Approach paradigm  Description-basedapproach  We represent the structure of the activities, and recognize activities using semantic matching.  Hand shake = “two persons do shake-action (stretches, stays stretched, withdraw) simultaneously, while touching”.  Recognition by finding observations satisfying the definition. Humans conceptual Human activity knowledge on representation human activity High-level Atomic action activity Input sequence recognition (e.g. arm stretch) Low-level recognition processing 3
  • 4.
    Comparisons Complex Complex Recognition of Handle Levels of Approaches temporal logical con- recursive imperfect low- hierarchy relations catenations activities levels limited Statistical (depends on √ data amount) Syntactic unlimited √ √ a sub-event Siskind 2001 unlimited participates √ only once Hongeng et limited √ √ al. 2004 (3-levels) Vu et al. conjunctions unlimited √ 2003 only Ryoo and Aggarwal unlimited √ √ √ √ 2009 Gupta et al. limited network form √ √ 2009 (2-levels) only 4
  • 5.
    2000s Description-based approaches Human interactions 5
  • 6.
    Recognition of humaninteractions Semantic layer Interaction: in time interval Person “pushed” by Person <4, 20>  Interaction Gesture layer  Gesture <1,20> : Facing right <1,20> : Facing left <1,20> : Arm staying <4,20> : Arm stretching  Elementary <1,20> : Leg staying <1,20> : Leg staying movement of a Pose layer person 1 : Arm withdrawn 1 : Arm withdrawn  Pose 8 : Arm withdrawn 8 : Arm somewhat stretched  Abstract status 13 : Arm withdrawn 13 : Arm fully stretched of a body part. Body-part layer  Body-part feature.  Numerical status of a body part. Input sequences Ryoo and Aggarwal, CVPR 2006 6
  • 7.
    Atomic actions Operation triplets <agent, motion, target>  Gesture together with subject and object information.  Unit human activity.  Computed based on gestures.  Ex> person1 stretches his/her arm  <p1’s arm, stretch, null>  Time intervals  Ex> Time intervals detected for Pointing action Time P1:Head :222222222222222222 <p1‟s arm, stretch, p2> P1:ArmV :322021221111122333 <p1‟s arm, stay stret., p2> P1:ArmH :100022222222221100 <p1‟s arm, withdraw, null> Sequences of poses Time intervals of operation triplets 7
  • 8.
    Semantic layer recognition Machine-understandable Punching is a sequence of hand representation of Punching stretch and withdrawal. Similar? Time <p2‟s arm, stretch, p1> <p2‟s arm, stay stret., p1> …<p2‟s arm, withdraw, null> … <p2‟s arm, stay withd., null> <p2‟s head, face left, null> <p1‟s arm, withdraw, nulll> Time intervals of gestures Video observations 8
  • 9.
    Human activity representation Semantics  Syntax Knowledge on the structure  Rules to construct formal of an activity. representation.  Punching is a sequence of  Organizes a set of hand stretch and withdrawal. vocabularies to describe Time intervals the activities‟ structure. Allen‟s temporal predicates  Context-free grammar this=Punching_action (p1) Punching_action(i) = ( CFG list( def(„x‟, Arm_Stretch(i)), def(„y‟, Arm_Withdraw(i)) ), x=arm_ y=arm_ and( meets(„x‟, „y‟), stretch(p1) withdraw(p1) and( starts(„x‟, „this‟), finishes(„y‟, „this‟)) ) ); Conceptual/verbal description Machine-understandable language 9
  • 10.
    Description-based Hierarchical activityrepresentation  Representation of the „shake-hands‟ interaction Shake hands ShakeHandsInteractions(i, j) = ( interaction list( def(„x‟, ShakeHandsAction(i)), list( def(„y‟, ShakeHandsAction(j)), def(„z‟, TouchingInteraction(i, j))) ), CFG and( and( during(„z‟, „x‟), during(„z‟, „y‟)), Syntax Hand shake = “two„this‟), finishes(„z‟, „this‟))) and( starts(„z‟, persons do ); shake-action (stretches, stays stretched, ShakeHandsActions(i) = ( withdraw) list( def(„x‟, Arm_Stretch(i)), simultaneously, while touching”. list( def(„y‟, Arm_Stay_Stretched(i)), def(„z‟, Arm_withdraw(i))) ), and( and( meets(„x‟, „y‟), meets(„y‟, „z‟)), and( starts(„x‟, „this‟), finishes(„z‟, „this‟))) ); TouchingInteraction(i, j)=(null, touch(i, j, 0)); 10
  • 11.
    Hierarchical recognition algorithm Recognition process of the „Shake-hands‟ interaction. 11
  • 12.
    Continued and recursiveactivities  Interaction „fighting‟  Composed of multiple negative interactions  Punching + kicking + pushing + punching + …  Iterative approach is taken. Fighting Interaction(i, j) Negative Fighting Interaction(i, j) Interaction(i, j) Negative Fighting Interaction(i, j) Interaction(i, j) Base Case Negative Fighting Interaction(i, j) Interaction(i, j) 12
  • 13.
    Experiments - Simpleinteractions  Recognized 8 types of simple interactions, which were recognized in Park and Aggarwal, 2004  (approach, depart, point, shake-hands, hug, punch, kick, and push)  A videos of a sequence of interactions are taken. (continuous executions)  Interactions are described in more detailed and formal way, resulting better recognition accuracy. 13
  • 14.
    Example Experiment -Fighting Input video: Processed video: Gestures Poses: and activities: 14
  • 15.
    Past-Now-Future networks  Pinhanezand Bobick 1998  PNF networks to represent temporal structure of an activity.  Kitchen activities: [Pinhanez, C. S. and Bobick, A. F., Human action detection using PNF propagation of temporal constraints. CVPR 1998] 15
  • 16.
    Event logic  Siskind2001  Logical concatenations of predicates  Time intervals? [Siskind, J. M., Grounding the lexical semantics of verbs in visual perception using force dynamics and event logic. Journal of Artificial Intelligence Research (JAIR) 15, 2001] 16
  • 17.
    Representation languages  Nevatia,Zhao, and Hongeng 2003  VERL - language  Vu, Bremond, Thonnat 2003  Similar to Nevatia et al. 2003  Recursive? Uncertainties? 17
  • 18.
    Stochastic approaches  Limitationsof the conventional description- based approaches  Uncertainties? – stochastic recognition  Probabilistic framework needed  [Ryoo and Aggarwal, IJCV 2009]  [Tran and Davis, ECCV 2008] - MLN 18
  • 19.
    Hierarchical matching algorithm Recognition process tree of „steal(p1, s1, p2)‟ Parameter Objects: (person p1, suitcase s1, person p2) P(S | O1, M1, …) = 0.8 Steal(p1, s1, p2) P Obj: (person p1, suitcase s1) P(A3 | O3, M3) = 0.5 P Obj: (person p2, suitcase s1) meets meets Carry(p1, s1) Stay(s1) Carry(p2, s1) P(A1 | O1, M1) = 0.8 P(A2 | O2, M2) = 0.7 P(A4 | O4, M4) = 0.9 P(A5 | O5, M5) = 0.9 equals equals MoveR(p1) MoveR(s1) MoveL(p2) MoveL(s1) Recognition results from the object and motion layers 19
  • 20.
    Description-based Probabilistic recognition  Probability of the activity given observation . . . . Structural Gesture detection similarity confidence 20
  • 21.
    Experiments  Recognized following six types of interactions.  Each activity was tested with at least 10 sequences.  Carrying a box, leaving a box, placing a box into a trash bin.  Carrying a suitcase, leaving a suitcase, stealing the suitcase.  Object and Motion layer trained with 5 sequences. Time Carry(Person1, SuitCase1) : Stay(SuitCase1) : Carry(Person2, SuitCase1) : Steal(Person1, SuitCase1, Person2) : 21
  • 22.
    Experiments Example  a person placing a box into a trash bin Time Move(Person1, right) : Move(Box1, right) : Move(Person1, left) : Move(Box1, down) : Carry(Person1, Box1) : Trash(Person1, Box1, TrashBin1) : 22
  • 23.
    Experiments - Performance Recognition accuracy (true positives):  Compared with a multi-object version of previous works. P 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Carry Leave Steal Carry Leave Trash Total -Suitcase -Suitcase -Suitcase -Box -Box -Box 2 atomics 3 atomics 5 atomics 2 atomics 3 atomics 3 atomics 1 spatials 1 spatials 2 spatials 1 spatials 1 spatials 3 spatials  False positives rates are almost zero for all activities. 23
  • 24.
    Advantages  Ability torepresent and recognize an activity composed of concurrent sub-events.  Ex> “touching occurred during pushing” this==Pushing_interactions(p1,p2) i=Arm_Stretch(p1) j=Arm_Stay_Stretched(p1) k =Touch(p1, p2) l =Depart(p2, p1)  Ability to represent and recognize „recursive activities‟  Ex> Fighting = Fighting + another negative interaction.  Less data required for training.  „Structure of activities‟ are encoded based on human knowledge.  High recognition accuracy? 24
  • 25.
    2008, 2009 Description-based approaches Group activities 25
  • 26.
    Group activity Events performed by groups  Various types of complex activities  Group-person interaction  Group-group interaction  Uncertain nature  Varying # of participants Group Stealing (grp vs. per) Group Assault (grp vs. grp)  Dynamic spatial A person with red shirts is relation taking the laptop on the table while the others are talking Ryoo and Aggarwal, CVPR-SIG 09, IJCV 11 26 26
  • 27.
    Representation  Group stealing Formal representation: 1. Member variables b in Owners, ∃ a in Thieves, ∀ ∃ c in Thieves ∀ ∀ 2. Time intervals def(t1, TakeObject(a)), b b def(t2, Distract(c, b)) ∃ c ∃ 3. Predicates this), equals (t1, ∃ c c Distract! during (t1, t2) Distract!Distract! Owners Time ∃ Distract (r1, g1) a Thieves Distract (r2, g1) Distract (r3, g2) t2 Take object! during TakeObject (r4) t1 Object equals Stealing(R, G) this Time intervals of activities of individual members 27
  • 28.
    Recognition overview  3key components Generates a pool of member candidates Distract Distract Distract Take Distract Distract Distract Stand Take Stand Who? What? When?  Recognition: NP-hard. Approximation required.  Obtain a pool of group member candidates with non-zero probability.  Not many persons perform sub-events.  M *  arg max M P(Gt (M ) | O M ) 28
  • 29.
    Temporal constraints  Hierarchical temporal constraint matching. Member variables: ∃a in Thieves, ∀b in Owners, ∃c in Thieves Steal(T, O) before during Approach(c, b) Distract(c, b) TakeObject(a) Video inputs 29
  • 30.
    Group candidates  Amongpossible groupings,  Find a set of group members: M  {m1 , m2 ,..., m|M |} which maximizes the overall probability.  Bayesian formulation P(Gt | O)  max M P(Gt (M ) | O M )  G (M )  max M  G ( M )   G ( M ) where  G (M )  P(OM | Gt (M ))  P(Gt (M )) 30
  • 31.
    Bayesian formulation  Ci:persons performing ith sub-event. P(OM | G t ( M ))   P(OM | M , Q, S1t1 ,..., S n )  P( S1t1 ,..., Sn | G t ( M )) tn tn S11 ,...,S nRepresented t tn relations Represented sub-events  P(rel | S a , Sb )   P( Siti | G t ( M ))  rel   G (M )    i  t1 tn S1 ,...,S n   d  e (| Ki Ci |/|Ki ||Li Ci |/|Ki |)  G t ( M )   i  Structural similarity  Essential and anti-essential relations: Ki, Li  E[S Case1: Case2: | Ki  Ci | i ti (k ) | O ] i ∃∃ 1 2 4 5 ∀∃ 1 2 4 5 kK i Ci 3 6 3 6 Case3: Case4: | Li  Ci |  E[S (l ) | O ] lLi Ci i ti i ∃∀ 1 2 3 4 5 6 ∀∀ 1 2 3 4 5 6 31
  • 32.
    Markov chain MonteCarlo  MCMC-based probability estimation.  Provides a set of samples from the distribution.  Models the probability distribution.  Metropolis-Hastings algorithm  P(M t 1 , M ' )  min(1, a)  ( M ' )  q( M ' , M )  a G  G ( M )  q( M , M ' ) Add Add Add  Actions: Add Remove  Add: M '  M t 1 {m} where m  Ci  Remove: M '  M t 1  {m} 32
  • 33.
    Experimental setting  Wehave tested 45 sequences of 8 activities.  320*240 with 10 fps  CCTV videos download from YouTube.  Group stealing in Malaysia and group arresting in UK.  Videos that we have taken with 10 participants in various environments.  A group of people carrying a large object.  A group of people assaulting a person. Videos of real human activities 33
  • 34.
    Experiments - stealing Group stealing  One of thieves steals a laptop, while the other thieves are distracting the shop owner. Thieves Owners Laptop 34
  • 35.
    Experiments - arresting Group arresting  A group of policemen arresting a group of suspicious persons.  Color histogram Policemen Criminal candidates Pedestrians 35
  • 36.
    Experiments – groupassault  Highly stochastic  There may be (and may not be) attackers whose guarding the area, or just watching.  10 videos. 36
  • 37.
    Experiments – groupassault  Highly stochastic  There may be (and may not be) attackers whose guarding the area, or just watching.  10 videos. 37
  • 38.
    Experimental results  Recognitionaccuracy  False positive rates are almost 0 because of the detailed representations: previous, deterministic, stochastic. 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Move Carry CarryCmd Fight Fight Steal Arrest Assault Total G G GP GG IG GG GG GG ∀ ∃∀ ∃∀ ∀∃ ∃∃ ∃∀∃ ∀∃ ∀∃∃ 38
  • 39.
    2009 Spatio-Temporal Relationship Match 39
  • 40.
    Description-based vs. Space-time Description-based  Space-time  High-level activities  Reliable under noise  Hierarchical  Difficult to model  Semantic structures complex activities  Difficult to cope with  Miss semantic noise structures t 40
  • 41.
    Space-time approaches  Videoclassification  Each video is represented as a histogram Hugging Shaking hands t Punching Pushing Spatio-temporal features  Limitation:  Unable to model complex activities Laptev 04, Dollar et al. 05 41
  • 42.
    Spatio-temporal relations (STRs) t t y 35 12 overlaps(12, 35) ... before(17, 12) 5 overlaps(5, 17) 17 x t t y 35 overlaps(12, 35) 12 ... before(17, 12) 5 equals(5, 17) 17 x Ryoo and Aggarwal, Videos Feature relations ICCV 2009 42
  • 43.
    Histogram of STRs t y 35 12 overlaps(12, 35) ... before(17, 12) 5 overlaps(5, 17) 17 x t y 35 overlaps(12, 35) 12 ... before(17, 12) 5 equals(5, 17) 17 x 43
  • 44.
    STR-match learning  Supervisedlearning  Videos with activity labels are provided. Pushing Hugging ? Punching Shaking hands Histogram of Relationships Feature space 44
  • 45.
    STR equations  STRmatch considers distributions of pair- wise relationships among features.  Histogram construction  STR distance: 45
  • 46.
    STR-match activity detection Must detect starting time and ending time  Models starting XYT location of an activity.  Each feature pair in a matching training video makes a vote. t 35 12 t 35 y y 12 5 5 17 17 vtrstart expected_starting(vtest) x x Original starting Expected starting 46
  • 47.
    Hierarchical recognition  Atomicaction detections as new features  Localization ability enables hierarchical recognition 47
  • 48.
    Experiments  KTH dataset  Public dataset composed of simple actions  Walking, jogging, running, waving, … 48
  • 49.
    Experiments: high-level activities High-level human activity detection results  Changing backgrounds, lighting conditions, … 49
  • 50.
    STR-match summary  Detectionfrom continuous videos  Localization using voting-based method  Noisy observations  Different backgrounds/lightings  Uncertainties  Human-human interactions  Hierarchical recognition  Future work  Hierarchy learning algorithm 50
  • 51.
    Description-based: References  Allen,J. F. and Ferguson, G., Actions and events in interval temporal logic. Journal of Logic and Computation 1994.  Pinhanez, C. S. and Bobick, A. F., Human action detection using PNF propagation of temporal constraints. CVPR 1998.  Siskind, J. M., Grounding the lexical semantics of verbs in visual perception using force dynamics and event logic. Journal of Artificial Intelligence Research (JAIR) 15, 2001.  Nevatia, R., Zhao, T., and Hongeng, S., Hierarchical language-based representation of events in video streams. In IEEE Workshop on Event Mining 2003.  Vu, V.-T., Bremond, F., and Thonnat, M., Automatic video interpretation: A novel algorithm for temporal scenario recognition. IJCAI 2003.  Ryoo, M. S. and Aggarwal, J. K., Recognition of composite human activities through context-free grammar based representation. CVPR 2006.  Ryoo, M. S. and Aggarwal, J. K., Hierarchical recognition of human activities interacting with objects. CVPR-SLAM 2007.  Tran, S. D. and Davis, L. S., Event modeling and recognition using markov logic networks. ECCV 2008.  Ryoo, M. S. and Aggarwal, J. K., Semantic representation and recognition of continued and recursive human activities. IJCV, 2009.  Ryoo, M. S. and Aggarwal, J. K., Spatio-temporal relationship match: Video structure comparison for recognition of complex human activities. ICCV 2009.  Ryoo, M. S. and Aggarwal, J. K., Stochastic representation and recognition of high-level group activities. IJCV, 2010. 51