cvpr2011: human activity recognition - part 5: description based

Frontiers of
Human Activity Analysis

J. K. Aggarwal
Michael S. Ryoo
Kris M. Kitani

2000s

Description-based
approaches

2

Approach paradigm
 Description-based approach
 We represent the structure of the activities,
and recognize activities using semantic matching.
 Hand shake = “two persons do shake-action (stretches, stays
stretched, withdraw) simultaneously, while touching”.
 Recognition by finding observations satisfying the definition.

Humans conceptual
Human activity
knowledge on
representation
human activity
High-level
Atomic action activity
Input sequence recognition
(e.g. arm stretch)
Low-level recognition
processing

3

Comparisons
Complex Complex Recognition of Handle
Levels of
Approaches temporal logical con- recursive imperfect low-
hierarchy
relations catenations activities levels
limited
Statistical (depends on √
data amount)

Syntactic unlimited √ √

a sub-event
Siskind 2001 unlimited participates √
only once
Hongeng et limited
√ √
al. 2004 (3-levels)
Vu et al. conjunctions
unlimited √
2003 only
Ryoo and
Aggarwal unlimited √ √ √ √
2009
Gupta et al. limited network form
√ √
2009 (2-levels) only

4

2000s

Description-based
approaches
Human interactions

5

Recognition of human interactions
Semantic layer
Interaction:
in time interval
Person “pushed” by Person
<4, 20>  Interaction
Gesture layer
 Gesture
<1,20> : Facing right <1,20> : Facing left
<1,20> : Arm staying <4,20> : Arm stretching  Elementary
<1,20> : Leg staying <1,20> : Leg staying movement of a
Pose layer
person
1 : Arm withdrawn 1 : Arm withdrawn
 Pose
8 : Arm withdrawn 8 : Arm somewhat stretched  Abstract status
13 : Arm withdrawn 13 : Arm fully stretched of a body part.
Body-part layer  Body-part
feature.
 Numerical status
of a body part.
Input sequences

Ryoo and Aggarwal,
CVPR 2006
6

Atomic actions
 Operation triplets <agent, motion, target>
 Gesture together with subject and object information.
 Unit human activity.
 Computed based on gestures.
 Ex> person1 stretches his/her arm
 <p1’s arm, stretch, null>
 Time intervals
 Ex> Time intervals detected for Pointing action
Time

P1:Head :222222222222222222 <p1‟s arm, stretch, p2>
P1:ArmV :322021221111122333 <p1‟s arm, stay stret., p2>
P1:ArmH :100022222222221100 <p1‟s arm, withdraw, null>
Sequences of poses Time intervals of operation triplets
7

Semantic layer recognition

Machine-understandable
Punching is a sequence of hand
representation of Punching
stretch and withdrawal.

Similar?

Time
<p2‟s arm, stretch, p1>
<p2‟s arm, stay stret., p1>
…<p2‟s arm, withdraw, null> …
<p2‟s arm, stay withd., null>
<p2‟s head, face left, null>
<p1‟s arm, withdraw, nulll>
Time intervals of gestures
Video observations
8

Human activity representation
 Semantics  Syntax
Knowledge on the structure  Rules to construct formal
of an activity. representation.
 Punching is a sequence of  Organizes a set of
hand stretch and withdrawal. vocabularies to describe
Time intervals the activities‟ structure.
Allen‟s temporal predicates  Context-free grammar
this=Punching_action (p1) Punching_action(i) = (
CFG
list( def(„x‟, Arm_Stretch(i)),
def(„y‟, Arm_Withdraw(i)) ),
x=arm_ y=arm_
and( meets(„x‟, „y‟),
stretch(p1) withdraw(p1)
and( starts(„x‟, „this‟),
finishes(„y‟, „this‟)) ) );
Conceptual/verbal description Machine-understandable language
9

Description-based

Hierarchical activity representation
 Representation of the „shake-hands‟ interaction
Shake hands ShakeHandsInteractions(i, j) = (
interaction list( def(„x‟, ShakeHandsAction(i)),
list( def(„y‟, ShakeHandsAction(j)),
def(„z‟, TouchingInteraction(i, j))) ),
CFG
and( and( during(„z‟, „x‟), during(„z‟, „y‟)),
Syntax
Hand shake = “two„this‟), finishes(„z‟, „this‟)))
and( starts(„z‟,
persons do
);
shake-action (stretches, stays stretched,
ShakeHandsActions(i) = (
withdraw)
list( def(„x‟, Arm_Stretch(i)),
simultaneously, while touching”.
list( def(„y‟, Arm_Stay_Stretched(i)),
def(„z‟, Arm_withdraw(i))) ),
and( and( meets(„x‟, „y‟), meets(„y‟, „z‟)),
and( starts(„x‟, „this‟), finishes(„z‟, „this‟)))
);
TouchingInteraction(i, j)=(null, touch(i, j, 0));

10

Hierarchical recognition algorithm
 Recognition process of the „Shake-hands‟ interaction.

11

Continued and recursive activities
 Interaction „fighting‟
 Composed of multiple negative interactions
 Punching + kicking + pushing + punching + …
 Iterative approach is taken.

Fighting
Interaction(i, j)

Negative Fighting
Interaction(i, j) Interaction(i, j)

Negative Fighting
Base Case

Negative Fighting
12

Experiments - Simple interactions
 Recognized 8 types of simple interactions, which were
recognized in Park and Aggarwal, 2004
 (approach, depart, point, shake-hands, hug, punch, kick, and
push)
 A videos of a sequence of interactions are taken. (continuous
executions)
 Interactions are described in more detailed and formal way,
resulting better recognition accuracy.

13

Example Experiment - Fighting
Input video: Processed video:

Gestures
Poses: and
activities:
14

Past-Now-Future networks
 Pinhanez and Bobick 1998
 PNF networks to represent temporal structure
of an activity.
 Kitchen activities:

[Pinhanez, C. S. and Bobick, A. F., Human action detection using PNF propagation
of temporal constraints. CVPR 1998]
15

Event logic
 Siskind 2001
 Logical concatenations of predicates
 Time intervals?

[Siskind, J. M., Grounding the lexical semantics of verbs in visual perception using force
dynamics and event logic. Journal of Artificial Intelligence Research (JAIR) 15, 2001]
16

Representation languages
 Nevatia, Zhao,
and Hongeng 2003
 VERL - language

 Vu, Bremond,
Thonnat 2003
 Similar to Nevatia
et al. 2003

 Recursive? Uncertainties?
17

Stochastic approaches
 Limitations of the conventional description-
based approaches
 Uncertainties? – stochastic recognition

 Probabilistic framework needed
 [Ryoo and Aggarwal, IJCV 2009]
 [Tran and Davis, ECCV 2008] - MLN

18

Hierarchical matching algorithm
 Recognition process tree of „steal(p1, s1, p2)‟
Parameter Objects:
(person p1, suitcase s1, person p2)
P(S | O1, M1, …) = 0.8

Steal(p1, s1, p2)

P Obj: (person p1, suitcase s1) P(A3 | O3, M3) = 0.5 P Obj: (person p2, suitcase s1)
meets meets
Carry(p1, s1) Stay(s1) Carry(p2, s1)

P(A1 | O1, M1) = 0.8 P(A2 | O2, M2) = 0.7 P(A4 | O4, M4) = 0.9 P(A5 | O5, M5) = 0.9
equals equals
MoveR(p1) MoveR(s1) MoveL(p2) MoveL(s1)

Recognition results from the object and motion layers

19

Description-based

Probabilistic recognition
 Probability of the activity given observation

.

.
.
.

Structural Gesture detection
similarity confidence
20

Experiments
 Recognized following six types of interactions.
 Each activity was tested with at least 10 sequences.
 Carrying a box, leaving a box, placing a box into a trash bin.
 Carrying a suitcase, leaving a suitcase, stealing the suitcase.
 Object and Motion layer trained with 5 sequences.

Time
Carry(Person1, SuitCase1) :
Stay(SuitCase1) :
Carry(Person2, SuitCase1) :
Steal(Person1, SuitCase1, Person2) :
21

Experiments
 Example
 a person placing a box into a trash bin

Time
Move(Person1, right) :
Move(Box1, right) :
Move(Person1, left) :
Move(Box1, down) :
Carry(Person1, Box1) :
Trash(Person1, Box1, TrashBin1) :
22

Experiments - Performance
 Recognition accuracy (true positives):
 Compared with a multi-object version of previous works.
P 1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
Carry Leave Steal Carry Leave Trash Total
-Suitcase -Suitcase -Suitcase -Box -Box -Box
2 atomics 3 atomics 5 atomics 2 atomics 3 atomics 3 atomics
1 spatials 1 spatials 2 spatials 1 spatials 1 spatials 3 spatials

 False positives rates are almost zero for all activities.
23

Advantages
 Ability to represent and recognize an activity composed
of concurrent sub-events.
 Ex> “touching occurred during pushing”

this==Pushing_interactions(p1,p2)

i=Arm_Stretch(p1) j=Arm_Stay_Stretched(p1)

k =Touch(p1, p2) l =Depart(p2, p1)

 Ability to represent and recognize „recursive activities‟
 Ex> Fighting = Fighting + another negative interaction.
 Less data required for training.
 „Structure of activities‟ are encoded based on human knowledge.
 High recognition accuracy?
24

2008, 2009

Description-based
approaches
Group activities

25

Group activity
 Events performed by
groups
 Various types of
complex activities
 Group-person interaction
 Group-group interaction
 Uncertain nature
 Varying # of
participants Group Stealing (grp vs. per)
Group Assault (grp vs. grp)
 Dynamic spatial A person with red shirts is
relation taking the laptop on the table
while the others are talking
Ryoo and Aggarwal,
CVPR-SIG 09, IJCV 11 26
26

Representation
 Group stealing Formal representation:
1. Member variables b in Owners,
∃ a in Thieves, ∀
∃ c in Thieves
∀ ∀ 2. Time intervals
def(t1, TakeObject(a)),
b b def(t2, Distract(c, b))
∃
c ∃ 3. Predicates this),
equals (t1,
∃ c c
Distract! during (t1, t2)
Distract!Distract!
Owners Time
∃ Distract (r1, g1)
a Thieves Distract (r2, g1)
Distract (r3, g2) t2
Take object! during
TakeObject (r4) t1
Object
equals
Stealing(R, G) this

Time intervals of activities of
individual members
27

Recognition overview
 3 key components Generates a pool of member candidates
Distract Distract

Distract

Take Distract Distract
Distract
Stand Take

Stand

Who? What? When?
 Recognition: NP-hard. Approximation required.
 Obtain a pool of group member candidates
with non-zero probability.
 Not many persons perform sub-events.
 M *  arg max M P(Gt (M ) | O M )
28

Temporal constraints
 Hierarchical temporal constraint matching.
Member variables:
∃a in Thieves, ∀b in Owners, ∃c in Thieves

Steal(T, O)

before during
Approach(c, b) Distract(c, b) TakeObject(a)

Video inputs

29

Group candidates
 Among possible groupings,
 Find a set of group members:
M  {m1 , m2 ,..., m|M |}
which maximizes the overall
probability.
 Bayesian formulation
P(Gt | O)  max M P(Gt (M ) | O M )
 G (M )
 max M
 G ( M )   G ( M )
where  G (M )  P(OM | Gt (M ))  P(Gt (M ))
30

Markov chain Monte Carlo
 MCMC-based probability estimation.
 Provides a set of samples from the distribution.
 Models the probability distribution.
 Metropolis-Hastings algorithm
 P(M t 1 , M ' )  min(1, a)
 ( M ' )  q( M ' , M )
 a G
 G ( M )  q( M , M ' ) Add Add
Add
 Actions: Add
Remove

 Add: M '  M t 1 {m} where m  Ci
 Remove: M '  M t 1  {m}
32

Experimental setting
 We have tested 45 sequences of 8 activities.
 320*240 with 10 fps
 CCTV videos download from YouTube.
 Group stealing in Malaysia and group arresting in UK.
 Videos that we have taken with 10 participants in
various environments.
 A group of people carrying a large object.
 A group of people assaulting a person.

Videos of real human activities
33

Experiments - stealing
 Group stealing
 One of thieves
steals a laptop,
while the other
thieves are
distracting the
shop owner.
Thieves

Owners

Laptop

34

Experiments - arresting
 Group arresting
 A group of
policemen
arresting a group
of suspicious
persons.
 Color histogram
Policemen

Criminal
candidates
Pedestrians

35

Experiments – group assault
 Highly stochastic
 There may be (and may not be) attackers whose
guarding the area, or just watching.
 10 videos.

36

Experiments – group assault
 Highly stochastic
 There may be (and may not be) attackers whose
guarding the area, or just watching.
 10 videos.

37

Experimental results
 Recognition accuracy
 False positive rates are almost 0 because of the detailed
representations: previous, deterministic, stochastic.
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
Move Carry CarryCmd Fight Fight Steal Arrest Assault Total
G G GP GG IG GG GG GG
∀ ∃∀ ∃∀ ∀∃ ∃∃ ∃∀∃ ∀∃ ∀∃∃
38

2009

Spatio-Temporal
Relationship Match

39

Description-based vs. Space-time
 Description-based  Space-time
 High-level activities  Reliable under noise
 Hierarchical  Difficult to model
 Semantic structures complex activities
 Difficult to cope with  Miss semantic
noise structures
t

40

Space-time approaches
 Video classification
 Each video is represented as a histogram
Hugging
Shaking hands

t

Punching Pushing

Spatio-temporal features

 Limitation:
 Unable to model complex activities Laptev 04,
Dollar et al. 05 41

Spatio-temporal relations (STRs)
t t
y 35 12 overlaps(12, 35)
...
before(17, 12)
5 overlaps(5, 17)
17
x
t t
y
35 overlaps(12, 35)
12
...
before(17, 12)
5
equals(5, 17)
17
x Ryoo and Aggarwal,
Videos Feature relations ICCV 2009 42

Histogram of STRs
t
y 35 12 overlaps(12, 35)
...
before(17, 12)
5 overlaps(5, 17)
17
x

t
y
35 overlaps(12, 35)
12
...
before(17, 12)
5
equals(5, 17)
17
x 43

STR-match learning
 Supervised learning
 Videos with activity labels are provided.
Pushing Hugging
?

Punching Shaking hands

Histogram of
Relationships

Feature space 44

STR equations
 STR match considers distributions of pair-
wise relationships among features.
 Histogram construction

 STR distance:

45

STR-match activity detection
 Must detect starting time and ending time
 Models starting XYT location of an activity.
 Each feature pair in a matching training video
makes a vote.
t 35 12 t 35
y y 12

5
5
17 17
vtrstart expected_starting(vtest)
x x
Original starting Expected starting
46

Hierarchical recognition
 Atomic action detections as new features
 Localization ability enables hierarchical
recognition

47

Experiments
 KTH dataset
 Public dataset composed of simple actions
 Walking, jogging, running, waving, …

48

Experiments: high-level activities
 High-level human activity detection results
 Changing backgrounds, lighting conditions, …

49

STR-match summary
 Detection from continuous videos
 Localization using voting-based method
 Noisy observations
 Different backgrounds/lightings
 Uncertainties
 Human-human interactions
 Hierarchical recognition
 Future work
 Hierarchy learning algorithm
50

Description-based: References
 Allen, J. F. and Ferguson, G., Actions and events in interval temporal logic. Journal of Logic and
Computation 1994.
 Pinhanez, C. S. and Bobick, A. F., Human action detection using PNF propagation of temporal
constraints. CVPR 1998.
 Siskind, J. M., Grounding the lexical semantics of verbs in visual perception using force dynamics
and event logic. Journal of Artificial Intelligence Research (JAIR) 15, 2001.
 Nevatia, R., Zhao, T., and Hongeng, S., Hierarchical language-based representation of events in
video streams. In IEEE Workshop on Event Mining 2003.
 Vu, V.-T., Bremond, F., and Thonnat, M., Automatic video interpretation: A novel algorithm for
temporal scenario recognition. IJCAI 2003.
 Ryoo, M. S. and Aggarwal, J. K., Recognition of composite human activities through context-free
grammar based representation. CVPR 2006.
 Ryoo, M. S. and Aggarwal, J. K., Hierarchical recognition of human activities interacting with objects.
CVPR-SLAM 2007.
 Tran, S. D. and Davis, L. S., Event modeling and recognition using markov logic networks. ECCV
2008.
 Ryoo, M. S. and Aggarwal, J. K., Semantic representation and recognition of continued and
recursive human activities. IJCV, 2009.
 Ryoo, M. S. and Aggarwal, J. K., Spatio-temporal relationship match: Video structure comparison
for recognition of complex human activities. ICCV 2009.
 Ryoo, M. S. and Aggarwal, J. K., Stochastic representation and recognition of high-level group
activities. IJCV, 2010.
51

cvpr2011: human activity recognition - part 5: description based

More Related Content

Similar to cvpr2011: human activity recognition - part 5: description based

More from zukun

Recently uploaded

cvpr2011: human activity recognition - part 5: description based