Ensuring Technical Readiness For Copilot in Microsoft 365
Sparse representation based human action recognition using an action region-aware dictionary
1. Sparse Representation-based Human Action Recognition
using an Action Region-aware Dictionary
ISM 2013
December 11, 2013
Hyun-seok Min, Wesley De Neve, and Yong Man Ro
Image and Video Systems Lab
Department of Electrical Engineering
Korea Advanced Institute of Science and Technology (KAIST)
e-mail: hsmin@kaist.ac.kr
web: http://ivylab.kaist.ac.kr
IEEE International Symposium on Multimedia 2013
3. Outline
• Introduction
– human action recognition
– problems
– contributions
• Sparse representation-based human action recognition
• Experiments
• Conclusions and future research
IEEE International Symposium on Multimedia 2013
3
4. Conventional approach for
human action recognition
Pr epr ocessing
Input
Input
Classification
Human Action Recognition Framework
Segmentation
Object Detection
Object Tracking
Video
Sequence
Featur e Ex tr action
Cuboid
SR
2D-Harris
SVM
LBP-TOP
Keypoint
Detection
Output
Random Forest
Descriptor
LMP, CUBOID
IEEE International Symposium on Multimedia 2013
“Skating”
4
5. Action detection vs. action recognition
• A video clip consists of a context region and an action region [1]
– action detection (segmentation) is required for effective action recognition [2]
=
Human action video clip
+
Context region
Action region
• Shortcomings of action detection
– despite the great emphasis on action recognition, there is comparatively little
work available on action detection [2]
– there is currently no general action detection method available that shows a
high level of effectiveness for every action
[1] K K. Reddy and M.Shah, “Recognizing 50 Human Action Categories of Web Videos,” Machine Vision and Applications Journal , vol. 24, no. 5, pp. 971-981, 2012.
[2] S.Sadanand and J.J.Corso, “Action bank: A high-level representation of activity in video,” IEEE Conf. on Computer Vision and Pattern Recognition , pp.1234-1241, 2012.
IEEE International Symposium on Multimedia 2013
5
6. Context information
for human action recognition
• Usefulness of context depends on the action class
(a)
(b)
(c)
– e.g., context is
• helpful for making a distinction between (a) and (b) [3]
• not helpful for making a distinction between (b) and (c)
[3] Tian Lan, Yang Wang, and Greg Mori, “Discriminative Figure-Centric Models for Joint Action Localization and Recognition,” IEEE International Conference on Computer
Vision (ICCV), 2011
IEEE International Symposium on Multimedia 2013
6
7. Research challenges & contributions
• Challenges
– lack of a general method for effective and efficient action detection
– the usefulness of context information depends on the type of action
• Contributions
– we propose a novel human action recognition method
• that does not require complex action detection during testing
• that uses context information in an adaptive way
IEEE International Symposium on Multimedia 2013
7
8. Outline
• Introduction
• Sparse representation-based human action recognition
– conventional method
– proposed method
• construction of an action region-aware dictionary
• use of an action region-aware dictionary
• adaptive classification using split sparse coefficients
• Experiments
• Conclusions and future research
IEEE International Symposium on Multimedia 2013
8
9. Conventional SR-based method:
dictionary construction
…
Action class 1
…
…
Feature
extraction
Action class i
…
…
…
…
Action class K
…
…
…
…
…
i
K
D = [z1 ,..., z1 1 ,..., z1 ,..., ziNi ,...., z1 ,..., z K K ] ∈ ℜ d × N
1
N
N
IEEE International Symposium on Multimedia 2013
9
10. Conventional SR-based method:
classification
•
Input video clip, depicting
'Lifting' (true action)
Given a dictionary D, the feature
vector y of a test video clip V can be
represented as follows
y ≈ Dx∈ ℜ d ,
Sparse coefficients belonging to
the true class
Sparse coefficient value
y : feature vector of V
D : dictionary
x : sparse coefficient vector
•
Given the sparse solution x, we can
calculate the residual error for each
human action as follows:
ri (y) = y − Dδi (x) 1
1
2
3
4
5
Human action class: 1: diving 2: golf swing
6: running 7: skating
6
7
3: kicking
8: swing1
8
4: lifting
9: swing2
9
10
5: riding
10: walking
ri(y) : residual for ith action
δi (x) : a new vector whose only nonzero entries
are the entries in x that are associated
with class i
IEEE International Symposium on Multimedia 2013
10
11. Conventional SR-based method:
dictionary shortcomings
Input video clip, depicting
'Golf' (true action)
• The dictionary only contains
class information
Sparse coefficients belonging to
the true class
Sparse coefficient value
– we do not know the location and
size of the action region of a test
video clip during classification
– however, we do know the
location and size of the action
regions in the training video clips
• Research question
1
2
3
4
5
Human action class: 1: diving 2: golf swing
6: running 7: skating
6
7
3: kicking
8: swing1
8
4: lifting
9: swing2
9
10
5: riding
10: walking
– how about putting the action
region information of the training
video clips in the dictionary?
IEEE International Symposium on Multimedia 2013
11
12. Proposed SR-based method:
construction of an action region-aware dictionary
Training video clips
...
...
Segmentation during training
• We propose to construct a
dictionary that consists of
two split dictionaries:
– context region dictionary DC
– action region dictionary DA
Segmented regions
...
...
Action regions
Context regions
Feature extraction
Action region-aware dictionary
D=
...
...
DC
DA
D = [DC | D A ]∈ ℜ d × N
IEEE International Symposium on Multimedia 2013
12
13. Proposed SR-based method:
use of an action region-aware dictionary (1/3)
• Given an action region-aware dictionary D and the feature
vector y of a test video clip V, we can compute the sparse
representation of y as follows
x
y ≈ D R x ≅ [D C | D A ] C = D C xC + D A x A
x A
1
i
K
xC = [ x1,C ,..., x1 1 ,C ,..., x1i,C ,..., xN i ,C ,..., x1KC ,..., xN K ,C ]
,
N
1
i
i
K
x A = [ x1, A ,..., x1 1 , A ,..., x1, A ,..., x Ni , A ,..., x1KA ,..., x N K , A ]
,
N
– xij,C and xij,A: the sparse coefficient values that are associated with the
context and the action region of the jth training video clip
of the ith human action
During testing, the proposed method for human action recognition
is able to automatically make a distinction between information
originating from the context region and information originating from
the action region in a test video clip.
IEEE International Symposium on Multimedia 2013
13
14. Proposed SR-based method:
use of an action region-aware dictionary (2/3)
Input video clip, depicting
'golf swing' (true action)
The sparse coefficients belonging
to the context region of the ‘golf
swing’ test video clip are
dispersed over the different
classes. This can be attributed to
the fact that the background of
‘golf swing’ is visually similar to
the background of ‘kicking’, ‘riding’,
and ‘walking’.
...
Sparse coefficients belonging to
the context region
Sparse coefficient value
Sparse coefficients belonging to
the action region
DC
1
2
3
4
5
6
DA
7
8
9
10
Human action class: 1: diving 2: golf swing
6: running 7: skating
1
2
3
3: kicking
8: swing1
4
5
6
4: lifting
9: swing2
7
8
9
10
5: riding
10: walking
IEEE International Symposium on Multimedia 2013
14
15. Proposed SR-based method:
use of an action region-aware dictionary (3/3)
Input video clip, depicting
'diving' (true action)
Sparse coefficients belonging to
the context region
...
The sparse coefficients belonging
to the context region of the
‘diving’ test video clip are
concentrated in the true class.
This means that the context
region of ‘diving’ is different from
the context regions of the other
human actions.
Sparse coefficient value
Sparse coefficients belonging to
the action region
DC
1
2
3
4
5
6
DA
7
8
9
10
Human action class: 1: diving 2: golf swing
6: running 7: skating
1
2
3
3: kicking
8: swing1
4
5
6
4: lifting
9: swing2
7
8
9
10
5: riding
10: walking
IEEE International Symposium on Multimedia 2013
15
16. Adaptive classification using
split sparse coefficients
• Given the above observations, we can hypothesize that
– information originating from context regions can help in successfully classifying
human actions, on the condition that the sparse coefficients associated with the
context regions are concentrated in the true class
• Measurement of the concentration of sparse coefficients
– Maximum Sparse Coefficient Concentration (MSCC)
MSCC (x) = max
k
δk (x) 1
x1
• We can then use the following criterion to determine whether information
of context regions can help in successfully classifying human actions
MSCC (xC )
> ξ ratio
MSCC (x A )
IEEE International Symposium on Multimedia 2013
16
17. Outline
• Introduction
• Sparse representation-based human action recognition
• Experiments
– experimental setup
– experimental results
• Conclusions and future research
IEEE International Symposium on Multimedia 2013
17
18. Experimental setup (1/2)
• Use of the UCF Sports Action data set
– contains 150 action video clips with a resolution of 720×480, collected
for various sports that are typically featured on broadcast television
channels such as BBC and ESPN
– for each frame, a bounding box is available around the person
performing the action of interest
– available action classes: diving, golf swinging, kicking, lifting, riding,
running, skating, swinging, and walking
Diving
Running
Golf swinging
Kicking
Lifting
Skating
Swinging
Walking
IEEE International Symposium on Multimedia 2013
Riding
18
19. Experimental setup (2/2)
• Comparison with
– SR with action region
• only makes use of action regions in the test video clips considered, thus
taking advantage of segmentation information
– SR with whole region
• uses whole video frames, thus not exploiting segmentation information
SR with whole
region
SR with action
region
IEEE International Symposium on Multimedia 2013
19
20. Experimental results (1/2)
• The accuracy of the proposed SR-based method for human action
recognition is more stable over the different human action classes
• The accuracy of the proposed method is highly independent of the
type of human action
– thanks to the use of a context-adaptive classification strategy
IEEE International Symposium on Multimedia 2013
20
21. Experimental results (2/2)
• We can observe that what method is most accurate depends on the
human action class considered
– “SR with action region” is usually more accurate when the concentration of
the sparse coefficients associated with the action region is higher than the
concentration of the sparse coefficients associated with the context region
– Otherwise, “SR with whole region” or “Proposed method” are more effective
IEEE International Symposium on Multimedia 2013
21
23. Conclusions
• We proposed a novel SR-based method for human action
recognition, having the following two major characteristics
– first, classification does not have to apply explicit segmentation to a
given test video clip
– second, classification is context adaptive in nature, only leveraging
information about the context in which the action took place when
the concentration of the corresponding sparse coefficients is high
IEEE International Symposium on Multimedia 2013
23
24. Future research directions
• Use of dictionary learning techniques that allow for more
effective and efficient construction of an overcomplete
dictionary
• Perform experiments with actions that have a lower variation
in background
• Study how to leverage SRC by means of an action regionaware dictionary in other application scenarios
IEEE International Symposium on Multimedia 2013
24
25. Thank you!
Any questions?
e-mail: hsmin@kaist.ac.kr .
web: http://ivylab.kaist.ac.kr
IEEE International Symposium on Multimedia 2013
25