Sparse representation based human action recognition using an action region-aware dictionary
Upcoming SlideShare
Loading in...5
×
 

Sparse representation based human action recognition using an action region-aware dictionary

on

  • 1,007 views

Sparse representation based human action recognition using an action region-aware dictionary

Sparse representation based human action recognition using an action region-aware dictionary

Statistics

Views

Total Views
1,007
Views on SlideShare
1,007
Embed Views
0

Actions

Likes
0
Downloads
1
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Sparse representation based human action recognition using an action region-aware dictionary Sparse representation based human action recognition using an action region-aware dictionary Presentation Transcript

  • Sparse Representation-based Human Action Recognition using an Action Region-aware Dictionary ISM 2013 December 11, 2013 Hyun-seok Min, Wesley De Neve, and Yong Man Ro Image and Video Systems Lab Department of Electrical Engineering Korea Advanced Institute of Science and Technology (KAIST) e-mail: hsmin@kaist.ac.kr web: http://ivylab.kaist.ac.kr IEEE International Symposium on Multimedia 2013
  • Outline • • • • Introduction Sparse representation-based human action recognition Experiments Conclusions and future research IEEE International Symposium on Multimedia 2013 2
  • Outline • Introduction – human action recognition – problems – contributions • Sparse representation-based human action recognition • Experiments • Conclusions and future research IEEE International Symposium on Multimedia 2013 3
  • Conventional approach for human action recognition Pr epr ocessing Input Input Classification Human Action Recognition Framework Segmentation Object Detection Object Tracking Video Sequence Featur e Ex tr action Cuboid SR 2D-Harris SVM LBP-TOP Keypoint Detection Output Random Forest Descriptor LMP, CUBOID IEEE International Symposium on Multimedia 2013 “Skating” 4
  • Action detection vs. action recognition • A video clip consists of a context region and an action region [1] – action detection (segmentation) is required for effective action recognition [2] = Human action video clip + Context region Action region • Shortcomings of action detection – despite the great emphasis on action recognition, there is comparatively little work available on action detection [2] – there is currently no general action detection method available that shows a high level of effectiveness for every action [1] K K. Reddy and M.Shah, “Recognizing 50 Human Action Categories of Web Videos,” Machine Vision and Applications Journal , vol. 24, no. 5, pp. 971-981, 2012. [2] S.Sadanand and J.J.Corso, “Action bank: A high-level representation of activity in video,” IEEE Conf. on Computer Vision and Pattern Recognition , pp.1234-1241, 2012. IEEE International Symposium on Multimedia 2013 5
  • Context information for human action recognition • Usefulness of context depends on the action class (a) (b) (c) – e.g., context is • helpful for making a distinction between (a) and (b) [3] • not helpful for making a distinction between (b) and (c) [3] Tian Lan, Yang Wang, and Greg Mori, “Discriminative Figure-Centric Models for Joint Action Localization and Recognition,” IEEE International Conference on Computer Vision (ICCV), 2011 IEEE International Symposium on Multimedia 2013 6
  • Research challenges & contributions • Challenges – lack of a general method for effective and efficient action detection – the usefulness of context information depends on the type of action • Contributions – we propose a novel human action recognition method • that does not require complex action detection during testing • that uses context information in an adaptive way IEEE International Symposium on Multimedia 2013 7
  • Outline • Introduction • Sparse representation-based human action recognition – conventional method – proposed method • construction of an action region-aware dictionary • use of an action region-aware dictionary • adaptive classification using split sparse coefficients • Experiments • Conclusions and future research IEEE International Symposium on Multimedia 2013 8
  • Conventional SR-based method: dictionary construction … Action class 1 … … Feature extraction Action class i … … … … Action class K … … … … … i K D = [z1 ,..., z1 1 ,..., z1 ,..., ziNi ,...., z1 ,..., z K K ] ∈ ℜ d × N 1 N N IEEE International Symposium on Multimedia 2013 9
  • Conventional SR-based method: classification • Input video clip, depicting 'Lifting' (true action) Given a dictionary D, the feature vector y of a test video clip V can be represented as follows y ≈ Dx∈ ℜ d , Sparse coefficients belonging to the true class Sparse coefficient value y : feature vector of V D : dictionary x : sparse coefficient vector • Given the sparse solution x, we can calculate the residual error for each human action as follows: ri (y) = y − Dδi (x) 1 1 2 3 4 5 Human action class: 1: diving 2: golf swing 6: running 7: skating 6 7 3: kicking 8: swing1 8 4: lifting 9: swing2 9 10 5: riding 10: walking ri(y) : residual for ith action δi (x) : a new vector whose only nonzero entries are the entries in x that are associated with class i IEEE International Symposium on Multimedia 2013 10
  • Conventional SR-based method: dictionary shortcomings Input video clip, depicting 'Golf' (true action) • The dictionary only contains class information Sparse coefficients belonging to the true class Sparse coefficient value – we do not know the location and size of the action region of a test video clip during classification – however, we do know the location and size of the action regions in the training video clips • Research question 1 2 3 4 5 Human action class: 1: diving 2: golf swing 6: running 7: skating 6 7 3: kicking 8: swing1 8 4: lifting 9: swing2 9 10 5: riding 10: walking – how about putting the action region information of the training video clips in the dictionary? IEEE International Symposium on Multimedia 2013 11
  • Proposed SR-based method: construction of an action region-aware dictionary Training video clips ... ... Segmentation during training • We propose to construct a dictionary that consists of two split dictionaries: – context region dictionary DC – action region dictionary DA Segmented regions ... ... Action regions Context regions Feature extraction Action region-aware dictionary D= ... ... DC DA D = [DC | D A ]∈ ℜ d × N IEEE International Symposium on Multimedia 2013 12
  • Proposed SR-based method: use of an action region-aware dictionary (1/3) • Given an action region-aware dictionary D and the feature vector y of a test video clip V, we can compute the sparse representation of y as follows x  y ≈ D R x ≅ [D C | D A ] C  = D C xC + D A x A x A  1 i K xC = [ x1,C ,..., x1 1 ,C ,..., x1i,C ,..., xN i ,C ,..., x1KC ,..., xN K ,C ] , N 1 i i K x A = [ x1, A ,..., x1 1 , A ,..., x1, A ,..., x Ni , A ,..., x1KA ,..., x N K , A ] , N – xij,C and xij,A: the sparse coefficient values that are associated with the context and the action region of the jth training video clip of the ith human action During testing, the proposed method for human action recognition is able to automatically make a distinction between information originating from the context region and information originating from the action region in a test video clip. IEEE International Symposium on Multimedia 2013 13
  • Proposed SR-based method: use of an action region-aware dictionary (2/3) Input video clip, depicting 'golf swing' (true action) The sparse coefficients belonging to the context region of the ‘golf swing’ test video clip are dispersed over the different classes. This can be attributed to the fact that the background of ‘golf swing’ is visually similar to the background of ‘kicking’, ‘riding’, and ‘walking’. ... Sparse coefficients belonging to the context region Sparse coefficient value Sparse coefficients belonging to the action region DC 1 2 3 4 5 6 DA 7 8 9 10 Human action class: 1: diving 2: golf swing 6: running 7: skating 1 2 3 3: kicking 8: swing1 4 5 6 4: lifting 9: swing2 7 8 9 10 5: riding 10: walking IEEE International Symposium on Multimedia 2013 14
  • Proposed SR-based method: use of an action region-aware dictionary (3/3) Input video clip, depicting 'diving' (true action) Sparse coefficients belonging to the context region ... The sparse coefficients belonging to the context region of the ‘diving’ test video clip are concentrated in the true class. This means that the context region of ‘diving’ is different from the context regions of the other human actions. Sparse coefficient value Sparse coefficients belonging to the action region DC 1 2 3 4 5 6 DA 7 8 9 10 Human action class: 1: diving 2: golf swing 6: running 7: skating 1 2 3 3: kicking 8: swing1 4 5 6 4: lifting 9: swing2 7 8 9 10 5: riding 10: walking IEEE International Symposium on Multimedia 2013 15
  • Adaptive classification using split sparse coefficients • Given the above observations, we can hypothesize that – information originating from context regions can help in successfully classifying human actions, on the condition that the sparse coefficients associated with the context regions are concentrated in the true class • Measurement of the concentration of sparse coefficients – Maximum Sparse Coefficient Concentration (MSCC) MSCC (x) = max k δk (x) 1 x1 • We can then use the following criterion to determine whether information of context regions can help in successfully classifying human actions MSCC (xC ) > ξ ratio MSCC (x A ) IEEE International Symposium on Multimedia 2013 16
  • Outline • Introduction • Sparse representation-based human action recognition • Experiments – experimental setup – experimental results • Conclusions and future research IEEE International Symposium on Multimedia 2013 17
  • Experimental setup (1/2) • Use of the UCF Sports Action data set – contains 150 action video clips with a resolution of 720×480, collected for various sports that are typically featured on broadcast television channels such as BBC and ESPN – for each frame, a bounding box is available around the person performing the action of interest – available action classes: diving, golf swinging, kicking, lifting, riding, running, skating, swinging, and walking Diving Running Golf swinging Kicking Lifting Skating Swinging Walking IEEE International Symposium on Multimedia 2013 Riding 18
  • Experimental setup (2/2) • Comparison with – SR with action region • only makes use of action regions in the test video clips considered, thus taking advantage of segmentation information – SR with whole region • uses whole video frames, thus not exploiting segmentation information SR with whole region SR with action region IEEE International Symposium on Multimedia 2013 19
  • Experimental results (1/2) • The accuracy of the proposed SR-based method for human action recognition is more stable over the different human action classes • The accuracy of the proposed method is highly independent of the type of human action – thanks to the use of a context-adaptive classification strategy IEEE International Symposium on Multimedia 2013 20
  • Experimental results (2/2) • We can observe that what method is most accurate depends on the human action class considered – “SR with action region” is usually more accurate when the concentration of the sparse coefficients associated with the action region is higher than the concentration of the sparse coefficients associated with the context region – Otherwise, “SR with whole region” or “Proposed method” are more effective IEEE International Symposium on Multimedia 2013 21
  • Outline • • • • Introduction Sparse representation-based human action recognition Experiments Conclusions and future research – conclusions – future research directions IEEE International Symposium on Multimedia 2013 22
  • Conclusions • We proposed a novel SR-based method for human action recognition, having the following two major characteristics – first, classification does not have to apply explicit segmentation to a given test video clip – second, classification is context adaptive in nature, only leveraging information about the context in which the action took place when the concentration of the corresponding sparse coefficients is high IEEE International Symposium on Multimedia 2013 23
  • Future research directions • Use of dictionary learning techniques that allow for more effective and efficient construction of an overcomplete dictionary • Perform experiments with actions that have a lower variation in background • Study how to leverage SRC by means of an action regionaware dictionary in other application scenarios IEEE International Symposium on Multimedia 2013 24
  • Thank you! Any questions? e-mail: hsmin@kaist.ac.kr . web: http://ivylab.kaist.ac.kr IEEE International Symposium on Multimedia 2013 25