Sparse representation based human action recognition using an action region-aware dictionary

Sparse Representation-based Human Action Recognition
using an Action Region-aware Dictionary
ISM 2013
December 11, 2013

Hyun-seok Min, Wesley De Neve, and Yong Man Ro
Image and Video Systems Lab
Department of Electrical Engineering
Korea Advanced Institute of Science and Technology (KAIST)
e-mail: hsmin@kaist.ac.kr
web: http://ivylab.kaist.ac.kr

IEEE International Symposium on Multimedia 2013

Outline
•
•
•
•

Introduction
Sparse representation-based human action recognition
Experiments
Conclusions and future research


2

Outline
• Introduction
– human action recognition
– problems
– contributions

• Sparse representation-based human action recognition
• Experiments
• Conclusions and future research


3

Conventional approach for
human action recognition
Pr epr ocessing

Input
Input

Classification

Human Action Recognition Framework
Segmentation
Object Detection
Object Tracking

Video
Sequence

Featur e Ex tr action

Cuboid

SR

2D-Harris

SVM

LBP-TOP

Keypoint
Detection

Output

Random Forest

Descriptor

LMP, CUBOID


“Skating”

4

Action detection vs. action recognition
• A video clip consists of a context region and an action region [1]
– action detection (segmentation) is required for effective action recognition [2]

=

Human action video clip

+

Context region

Action region

• Shortcomings of action detection
– despite the great emphasis on action recognition, there is comparatively little
work available on action detection [2]
– there is currently no general action detection method available that shows a
high level of effectiveness for every action
[1] K K. Reddy and M.Shah, “Recognizing 50 Human Action Categories of Web Videos,” Machine Vision and Applications Journal , vol. 24, no. 5, pp. 971-981, 2012.
[2] S.Sadanand and J.J.Corso, “Action bank: A high-level representation of activity in video,” IEEE Conf. on Computer Vision and Pattern Recognition , pp.1234-1241, 2012.


5

Context information
for human action recognition
• Usefulness of context depends on the action class

(a)

(b)

(c)

– e.g., context is
• helpful for making a distinction between (a) and (b) [3]
• not helpful for making a distinction between (b) and (c)
[3] Tian Lan, Yang Wang, and Greg Mori, “Discriminative Figure-Centric Models for Joint Action Localization and Recognition,” IEEE International Conference on Computer
Vision (ICCV), 2011


6

Research challenges & contributions
• Challenges

– lack of a general method for effective and efficient action detection
– the usefulness of context information depends on the type of action

• Contributions
– we propose a novel human action recognition method
• that does not require complex action detection during testing
• that uses context information in an adaptive way


7

Outline
• Introduction
– conventional method
– proposed method
• construction of an action region-aware dictionary
• use of an action region-aware dictionary
• adaptive classification using split sparse coefficients

• Experiments


8

Conventional SR-based method:
dictionary construction
…

Action class 1

…

…

Feature
extraction

Action class i

…
…

…

…

Action class K

…

…

…

…

…

i
K
D = [z1 ,..., z1 1 ,..., z1 ,..., ziNi ,...., z1 ,..., z K K ] ∈ ℜ d × N
1
N
N


9

classification
•

Input video clip, depicting
'Lifting' (true action)

Given a dictionary D, the feature
vector y of a test video clip V can be
represented as follows
y ≈ Dx∈ ℜ d ,

Sparse coefficients belonging to
the true class

Sparse coefficient value

y : feature vector of V
D : dictionary
x : sparse coefficient vector

•

Given the sparse solution x, we can
calculate the residual error for each
human action as follows:
ri (y) = y − Dδi (x) 1

1

2

3

4

5

Human action class: 1: diving 2: golf swing
6: running 7: skating

6

7

3: kicking
8: swing1

8

4: lifting
9: swing2

9

10

5: riding
10: walking

ri(y) : residual for ith action
δi (x) : a new vector whose only nonzero entries
are the entries in x that are associated
with class i


10

dictionary shortcomings
'Golf' (true action)

• The dictionary only contains
class information

the true class


– we do not know the location and
size of the action region of a test
video clip during classification
– however, we do know the
location and size of the action
regions in the training video clips

• Research question
1

2

3

4

5


6

7

3: kicking
8: swing1

8

4: lifting
9: swing2

9

10

5: riding
10: walking

– how about putting the action
region information of the training
video clips in the dictionary?


11

Proposed SR-based method:
construction of an action region-aware dictionary
Training video clips
...

...

Segmentation during training

• We propose to construct a
dictionary that consists of
two split dictionaries:
– context region dictionary DC
– action region dictionary DA

Segmented regions

...

...
Action regions

Context regions

Feature extraction
Action region-aware dictionary
D=

...

...

DC

DA

D = [DC | D A ]∈ ℜ d × N


12

use of an action region-aware dictionary (1/3)
• Given an action region-aware dictionary D and the feature
vector y of a test video clip V, we can compute the sparse
representation of y as follows
x 
y ≈ D R x ≅ [D C | D A ] C  = D C xC + D A x A
x A 

1
i
K
xC = [ x1,C ,..., x1 1 ,C ,..., x1i,C ,..., xN i ,C ,..., x1KC ,..., xN K ,C ]
,
N
1
i
i
K
x A = [ x1, A ,..., x1 1 , A ,..., x1, A ,..., x Ni , A ,..., x1KA ,..., x N K , A ]
,
N

– xij,C and xij,A: the sparse coefficient values that are associated with the
context and the action region of the jth training video clip
of the ith human action
During testing, the proposed method for human action recognition
is able to automatically make a distinction between information
originating from the context region and information originating from
the action region in a test video clip.


13

'golf swing' (true action)

The sparse coefficients belonging
to the context region of the ‘golf
swing’ test video clip are
dispersed over the different
classes. This can be attributed to
the fact that the background of
‘golf swing’ is visually similar to
the background of ‘kicking’, ‘riding’,
and ‘walking’.

...

the context region


the action region

DC
1

2

3

4

5

6

DA
7

8

9

10


1

2

3

3: kicking
8: swing1

4

5

6

4: lifting
9: swing2

7

8

9

10

5: riding
10: walking


14

'diving' (true action)
the context region

...

The sparse coefficients belonging
to the context region of the
‘diving’ test video clip are
concentrated in the true class.
This means that the context
region of ‘diving’ is different from
the context regions of the other
human actions.


the action region

DC
1

2

3

4

5

6

DA
7

8

9

10


1

2

3

3: kicking
8: swing1

4

5

6

4: lifting
9: swing2

7

8

9

10

5: riding
10: walking


15

Adaptive classification using
split sparse coefficients
• Given the above observations, we can hypothesize that
– information originating from context regions can help in successfully classifying
human actions, on the condition that the sparse coefficients associated with the
context regions are concentrated in the true class

• Measurement of the concentration of sparse coefficients
– Maximum Sparse Coefficient Concentration (MSCC)
MSCC (x) = max
k

δk (x) 1
x1

• We can then use the following criterion to determine whether information
of context regions can help in successfully classifying human actions
MSCC (xC )
> ξ ratio
MSCC (x A )


16

Outline
• Introduction
• Experiments
– experimental setup
– experimental results



17

Experimental setup (1/2)
• Use of the UCF Sports Action data set
– contains 150 action video clips with a resolution of 720×480, collected
for various sports that are typically featured on broadcast television
channels such as BBC and ESPN
– for each frame, a bounding box is available around the person
performing the action of interest
– available action classes: diving, golf swinging, kicking, lifting, riding,
running, skating, swinging, and walking
Diving

Running

Golf swinging

Kicking

Lifting

Skating

Swinging

Walking


Riding

18

Experimental setup (2/2)
• Comparison with
– SR with action region
• only makes use of action regions in the test video clips considered, thus
taking advantage of segmentation information
– SR with whole region
• uses whole video frames, thus not exploiting segmentation information

SR with whole
region

SR with action
region


19

Experimental results (1/2)
• The accuracy of the proposed SR-based method for human action
recognition is more stable over the different human action classes
• The accuracy of the proposed method is highly independent of the
type of human action
– thanks to the use of a context-adaptive classification strategy


20

Experimental results (2/2)
• We can observe that what method is most accurate depends on the
human action class considered
– “SR with action region” is usually more accurate when the concentration of
the sparse coefficients associated with the action region is higher than the
concentration of the sparse coefficients associated with the context region
– Otherwise, “SR with whole region” or “Proposed method” are more effective


21

Outline
•
•
•
•

Introduction
Sparse representation-based human action recognition
Experiments
Conclusions and future research
– conclusions
– future research directions


22

Conclusions
• We proposed a novel SR-based method for human action
recognition, having the following two major characteristics
– first, classification does not have to apply explicit segmentation to a
given test video clip
– second, classification is context adaptive in nature, only leveraging
information about the context in which the action took place when
the concentration of the corresponding sparse coefficients is high


23

Future research directions
• Use of dictionary learning techniques that allow for more
effective and efficient construction of an overcomplete
dictionary
• Perform experiments with actions that have a lower variation
in background
• Study how to leverage SRC by means of an action regionaware dictionary in other application scenarios


24

Thank you!
Any questions?
e-mail: hsmin@kaist.ac.kr .
web: http://ivylab.kaist.ac.kr


25

Sparse representation based human action recognition using an action region-aware dictionary

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Sparse representation based human action recognition using an action region-aware dictionary

Similar to Sparse representation based human action recognition using an action region-aware dictionary (20)

More from Wesley De Neve

More from Wesley De Neve (20)

Recently uploaded

Recently uploaded (20)

Sparse representation based human action recognition using an action region-aware dictionary