Human action recognition using spatio-temporal features Nikhil Sawant (2007MCS2899) Guide : Dr. K.K. Biswas
Human activity recognition Higher resolution Longer Time Scale Courtesy : Y. Ke,  Fathi and Mori, Bobick and Davis, Schuldt  et al,  Leibe  et al,  Vaswani  et al.   Pose Estimation Action Recognition Action Classification Tracking Activity Recognition
Use Action recognition? Video surveillance Interactive environment Video classification & indexing Movie search Assisted Care Sports annotation
Goals…. Action recognition against the stable background Action classification  Event detection Scale invariant action recognition Resistant to change in view upto certain degrees
Goals…. Action recognition against the stable background Action classification  Event detection Scale invariant action recognition Resistant to change in view upto certain degrees Action recognition in cluttered background Action detection invariant of speed
Existing Approaches Tracking interest points Flow based Approaches Shape based Approaches
Tracking interest points Use of Moving light displays (MLDs) by Johansson in 1973 Not feasible for as additional constraints are added Use of silhouette and geodesic distance by P. Correra Images Courtesy : P. Correra Tracking 5 crucial points i.e. Head, 2 hands, 2 feet. Mostly present at the local maxima on the plot of geodesic distance
Tracking interest points Use of Moving light displays (MLDs) by Johansson in 1973 Not feasible for as additional constraints are added Use of silhouette and geodesic distance by P. Correra It is difficult to track all the Crucial points all the time Occlusion creates problem in tracking Complex actions involving occlusion of body parts are difficult to track Results depend on the quality of the silhourtte
Flow based approaches Action recognition is done by making use of flow generated by motion Use of optical flows Spatio-temporal features Spatio-temporal regularity based features
Shape based Approaches Blank  et. al.  shown   Action can be describe as space time shape Use of possion equation for features Local space time saliency Action dynamics Shape structure and orientation Images Courtesy : M. Blank
Our Approach flow based features + shaped based features spatio-temporal features Viola-Jones type rectangular features Adaboost STEPS:- Target Localization – Background subtraction Local oriented histogram Formation of descriptor Use of Adaboost for learning
Optical flow and motion features
Target Localization Possible search space is  xyt  cube Action needs to be localized in space and time Target localization helps reducing search space Background subtraction ROI marked Original Video Silhouette Original Video with ROI marked
Motion estimation Make use of optical flows for motion estimation Optical flow is the pattern of relative motion between the object/object feature points and the viewer/camera Several methods : motion compensation encoding, object segmentation, etc We make use of Lucas – Kanade, two frame differential method Opencv implementation used
Noise removal Presence of noisy optical flows Noise removal by averaging Optical flows with magnitude >  C * O mean  are ignored, where  C  – constant [1.5 - 2],  O mean  - mean of optical flow within ROI  Noisy Optical flows After noise removal
Organizing optical flow Local oriented Histogram Weighted averaging
Organizing optical flow (Local oriented Histogram) We fix  X DIV  x Y DIV   grid around ROI O n (u, v)  is considered in  b ij  if  x i  < u < x i+1 y j  < v < y i+1 O bij  =  Σ  O n (u, v) /  Σ  1 Such that,  x i  < u < x i+1 y j  < v < y i+1 for all i <  X DIV   &  j < Y DIV
Membership of the optical flows should be inversely proportional to their distance from the centre Organizing optical flow (Local oriented Histogram) C (0,0) d 2 d 1 O 1 O 2 O e O e
Organizing optical flow (Weighted Averaging) O j  = (O 1 , O 2 ,…..O m )  such that  for all  i  Є  {1,....,N}
Organizing optical flows
Formation of motion descriptor Optical flow is represented in xy component form Effective optical flow from each box is written in a single row as [O ex00 , O ey00 , O ex10 , O ey10 ,…..  ]   vector Vectors for each action are stored for every training subject Adaboost is used to learn the patterns
Learning with Adaboost Strong  classifier Weak classifier Weight Features vector
Classification Example  taken from Antonio Torralba @MIT Weak learners from the family of lines h => p(error) = 0.5  it is at chance Each data point has a class label: w t  =1 and a weight: + 1 (  ) -1 (  ) y t  =
Classification Example  This one seems to be the best This is a ‘ weak classifier ’: It performs slightly better than chance. Each data point has a class label: w t  =1 and a weight: + 1 (  ) -1 (  ) y t  =
Classification Example  We set a new problem for which the previous weak classifier performs at chance again Each data point has a class label: w t  w t  exp{-y t  H t } We update the weights: +  1 (  ) - 1 (  ) y t  =
Classification Example  We set a new problem for which the previous weak classifier performs at chance again Each data point has a class label: w t  w t  exp{-y t  H t } We update the weights: +  1 (  ) - 1 (  ) y t  =
Classification Example  We set a new problem for which the previous weak classifier performs at chance again Each data point has a class label: w t  w t  exp{-y t  H t } We update the weights: +  1 (  ) - 1 (  ) y t  =
Classification Example  We set a new problem for which the previous weak classifier performs at chance again Each data point has a class label: w t  w t  exp{-y t  H t } We update the weights: +  1 (  ) - 1 (  ) y t  =
Classification Example  The strong (non- linear) classifier is built as the combination of all the weak (linear) classifiers. f 1 f 2 f 3 f 4
Our Dataset Video resolution 320 x 240  Stable background ACTION SUBJECTS VIDEOS Walking 8 34 Running 8 20 Flying 5 25 Waving 5 25 Pick up 6 24 Stand up 6 48 Sitting down 6 24
Our Dataset (Tennis actions) Small tennis dataset ACTION SUBJECTS VIDEOS Forehand 3 11 Backhand 3 10 Service 2 9
Training and Testing Dataset Training and testing data is mutually exclusive Training and testing subjects are mutually exclusive Frames used for training and testing ACTION TRAINING TESTING Walking 1184 1710 Running 183 335 Flying 182 373 Waving 198 317 Pick up 111 160 Stand up 128 187 Sitting down 230 282
Classification result ( framewise ) Overall Error : 12.21 % Walking Running  Flying  Waving  Pick up  Sit down  Stand up  Error  Walking 1644  46  0  17  1  2  3.86%  Running 35  295  3  2  11.94%  Flying  1  2  349  11  9  1  6.43%  Waving  11  8  269  29  15.14%  Pick up 8  7  1  120  23  1  25%  Sit down  1  1  26  179  14.97%  Stand up 23  282  8.15%
Classification results ( clipwise ) Overall error : 6.94%  Walking Running Waving1 waving2 bending Sit-down Stand-up Error Walking 10 0.0% Running 10 0.0% Waving1 9 1 10.0% waving2 10 0.0% bending 9 1 10.0% Sit-down 10 0.0% Stand-up 1 9 10.0%
Action classification
Classification results (Tennis events) Overall Error : 19.17% (per frame) Forehand Backhand Service Error Forehand 54 7 11 21.95% Backhand 11 53 10.75% Service 8 49 14.04%
Event Detection Confusion at the junction two actions Use of prediction logic Current frame ‘ f’ Next  n  frames Previous  n  frames f f+1 f+2 f+3 f+4 … … f-1 f-2 f-3 f-4 … … f-n f+n
Event Detection Without using prediction logic With prediction logic
Weizmann Dataset ACTION SUBJECTS VIDEOS Bend 9 9 Jack 9 9 Jump 9 9 Pjump 9 9 Run 9 10 Side 9 9 Skip 9 10 Walk 9 10 Wave1 9 9 Wave2 9 9
Standard Dataset (Weizmann Dataset) Walk Side Skip Wave1 Wave2 Bend Run Jack Jump Pjump
confusion matrix ( framewise ) Overall Error : 29.17% (per frame) Bend  Jack Jump Pjump Run Side Skip Walk Wave1 Wave2 Bend  271 1 1 20 3 30 11 Jack 18 368 8 48 3 2 3 9 16 Jump 9 3 157 8 2 26 19 7 Pjump 36 26 237 22 6 Run 4 2 5 158 3 50 6 1 2 Side 11 9 77 1 1 84 3 58 2 1 Skip 3 9 76 43 5 109 24 1 7 Walk 2 5 16 2 13 5 395 Wave1 47 2 12 238 27 Wave2 30 6 1 4 1 55 269
Weizmann dataset Smaller resolution (180 x 144),  Previously (320 x 240) Weaker motion vectors compare to previous experiments Weizmann : mag 0 – 1.75 px  Earlier experiment  : mag 0 – 5.5 px Lack of background frames available,  used already given poor quality silhouette
Use of MV + Shape Info(SI) Only MV are not enough Shape of the person also gives information about the actions No. of foreground pixels in each box Error  : 23.45%
Use of MV + Differential SI  We calculate Differential Shape Info Make use of Viola-Jones rectangular features Rectangular features are used at grid level rather than pixel level Error : 19.69%
confusion matrix ( framewise ) Bend  Jack Jump Pjump Run Side Skip Walk Wave1 Wave2 Bend  326 7 2 2 Jack 6 418 39 1 3 8 Jump 18 1 189 1 5 4 13 Pjump 11 55 243 6 1 11 Run 2 2 173 2 45 7 Side 8 30 11 1 152 12 33 Skip 1 20 32 83 4 121 13 1 2 Walk 1 1 2 1 1 432 Wave1 43 1 10 10 232 30 Wave2 13 25 328
Spatio-temporal features TSPAN TLEN
Spatio-temporal descriptor Volume Descriptor in row form [Frame1 | Frame2 | Frame3 |   Frame4 | Frame5 | ……] Motion and Differential shape information   for the volume Error : 8.472% (per frame)
Event classification ( clipwise ) Error : 2.15%  Better than 12.7% error rate reported by T. Goodhart et.al., Action recognition usign spatio-temporal regularity based feature, 2008   bend  Jack Jump Pjump Run Side Skip Walk Wave1 Wave2 Error bend  9 0.0% Jack 9 0.0% Jump 9 0.0% Pjump 9 0.0% Run 9 1 10.0% Side 9 0.0% Skip 10 0.0% Walk 10 0.0% Wave1 8 1 11.1% Wave2 9 0.0%
Action recognition in cluttered background
Cluttered environment background is not stable The actor might be occluded Slight change in camera location (panning) Scale variation Speed variation
Training Training is done without background subtraction Manually mark Start and end of action in training videos Also the bounding box around the actor is marked No shape information is added in the training data Training is done with noisy background Currently bending and drinking actions supported
Training data drinking
Training data bending
Template length bending –  Average no. of frames for action – 55 Variation – 40 – 110 TLEN 45 frames Drinking – Average no. of frames for action – 50 Variation – 35 – 70 TLEN 40 frames
Single template formation Length of the template kept constant Some of the frames eliminated One action - one template Adds robustness in the training Speed variation during training is tackled 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1 3 4 5 6 8 9 10 11 13 14 15 1 2 3 4 5 6 7 8 9 10 11 12
Optical flow and Adaboost We have constant length sequences Optical flows are calculated No shape information as background is cluttered, background subtraction not possible Formation of Spatio-temporal template with TSPAN = 1 and TLEN = length of sequence = const Templates are learned with Adaboost.
Testing An action Cuboid is formed with specific height, width, length Cuboid is moved over each and every valid starting location in the video Height Width Length
t x y
Testing An action Cuboid is formed with specific height, width, length Cuboid is moved over each and every valid starting location in the video A spatio-temporal template is formed for each Cuboid location and tested with Adaboost Appropriate  entry is made in confidence matrix Height, width and length updated for scale and speed invariance Height Width Length
Confidence matrix Confidence matrix is a 3D matrix Confidence matrix has an entry for each and every valid location of cube in the video Confidence matrix contains the confidence value given by the Adaboost over various iteration We expect that true positives will be surrounded with a dense fog of large confidence values averaging is done to reduce the effect of the false positives.
Confidence matrix
Results
Results
Results
Results
Results
Results
Key References Y. Ke, R. Sukthankar, M. Hebert, “Spatio-temporal Shape and Flow Correlation for Action Recognition”, In Proc. Visual Surveillance Workshop, 2007. P. Viola and M. Jones. “Robust real-time face detection”. In ICCV, volume 20(11), pages 1254-1259, 2001. M. Lucena, J.M. Fuertes and N. P. la Blanca, “Using Optical Flow for Tracking”,  Volume 2905/2003, Progress in Pattern Recognition, Speech and Image Analysis. Y. Ke, R. Sukthankar, and M. Hebert. “Event detection in crowded videos”. In ICCV, 2007. F. Niu and M. Abdel-Mottaleb, “View –Invariant Human Activity Recognition Based on Shape and Motion Features,” in Proc. of the IEEE Sixth International Symposium on Multimedia Software Engineering, pp. 546-556, 2004. D.M. Gavrila. “The visual analysis of human movement: A survey”. Computer Vision and Image Understanding, 73:82–98, 1999. D. M. Gavrila. “A bayesian, exemplar-based approach to hierarchical shape matching”. IEEE Trans. Pattern Anal. Mach. Intell., 29(8):1408–1421, 2007. K. Gaitanis, P. Correa, and B. Macq, “Human Action Recognition using silhouette based feature extraction and Dynamic Bayesian Networks”.  M. Ahmad, S. Lee, “Human action recognition using shape and CLG-motion flowfrom multi-viewimage sequences”, 7 th  IEEE International Conference on Automatic Face and Gesture Recognition,  April 2006. 10. Haritaoglu, D. Harwood, and L. Davis, “W4: real-time surveillance of people and their activities,” IEEE Transactions on Pattern Analysis and Machine Intelligence 22, pp. 809–830, Aug 2000. Ismail Haritaoglu, David Harwood, and Larry S. Davis, “W4: Who? When? Where? What? a Real-time System for Detecting and Tracking People,&quot; Proc. the third IEEE International Conference on Automatic Face and Gesture Recognition Nara, Japan , IEEE Computer Society Press, Los Alamitos, Calif., 1998, pp.222-227. P. Correa1, J. Czyz1, T. Umeda1, F. Marqu, X. Marichal3, B. Macq, “Silhouette-based probabilistic 2D human motion estimation for real time application”,  in ICIP 2005. Y. Ke, R. Sukthankar, and M. Hebert. “Efficient visual event detection using volumetric features, In ICCV’05.

Action Recognition (Thesis presentation)

  • 1.
    Human action recognitionusing spatio-temporal features Nikhil Sawant (2007MCS2899) Guide : Dr. K.K. Biswas
  • 2.
    Human activity recognitionHigher resolution Longer Time Scale Courtesy : Y. Ke, Fathi and Mori, Bobick and Davis, Schuldt et al, Leibe et al, Vaswani et al. Pose Estimation Action Recognition Action Classification Tracking Activity Recognition
  • 3.
    Use Action recognition?Video surveillance Interactive environment Video classification & indexing Movie search Assisted Care Sports annotation
  • 4.
    Goals…. Action recognitionagainst the stable background Action classification Event detection Scale invariant action recognition Resistant to change in view upto certain degrees
  • 5.
    Goals…. Action recognitionagainst the stable background Action classification Event detection Scale invariant action recognition Resistant to change in view upto certain degrees Action recognition in cluttered background Action detection invariant of speed
  • 6.
    Existing Approaches Trackinginterest points Flow based Approaches Shape based Approaches
  • 7.
    Tracking interest pointsUse of Moving light displays (MLDs) by Johansson in 1973 Not feasible for as additional constraints are added Use of silhouette and geodesic distance by P. Correra Images Courtesy : P. Correra Tracking 5 crucial points i.e. Head, 2 hands, 2 feet. Mostly present at the local maxima on the plot of geodesic distance
  • 8.
    Tracking interest pointsUse of Moving light displays (MLDs) by Johansson in 1973 Not feasible for as additional constraints are added Use of silhouette and geodesic distance by P. Correra It is difficult to track all the Crucial points all the time Occlusion creates problem in tracking Complex actions involving occlusion of body parts are difficult to track Results depend on the quality of the silhourtte
  • 9.
    Flow based approachesAction recognition is done by making use of flow generated by motion Use of optical flows Spatio-temporal features Spatio-temporal regularity based features
  • 10.
    Shape based ApproachesBlank et. al. shown Action can be describe as space time shape Use of possion equation for features Local space time saliency Action dynamics Shape structure and orientation Images Courtesy : M. Blank
  • 11.
    Our Approach flowbased features + shaped based features spatio-temporal features Viola-Jones type rectangular features Adaboost STEPS:- Target Localization – Background subtraction Local oriented histogram Formation of descriptor Use of Adaboost for learning
  • 12.
    Optical flow andmotion features
  • 13.
    Target Localization Possiblesearch space is xyt cube Action needs to be localized in space and time Target localization helps reducing search space Background subtraction ROI marked Original Video Silhouette Original Video with ROI marked
  • 14.
    Motion estimation Makeuse of optical flows for motion estimation Optical flow is the pattern of relative motion between the object/object feature points and the viewer/camera Several methods : motion compensation encoding, object segmentation, etc We make use of Lucas – Kanade, two frame differential method Opencv implementation used
  • 15.
    Noise removal Presenceof noisy optical flows Noise removal by averaging Optical flows with magnitude > C * O mean are ignored, where C – constant [1.5 - 2], O mean - mean of optical flow within ROI Noisy Optical flows After noise removal
  • 16.
    Organizing optical flowLocal oriented Histogram Weighted averaging
  • 17.
    Organizing optical flow(Local oriented Histogram) We fix X DIV x Y DIV grid around ROI O n (u, v) is considered in b ij if x i < u < x i+1 y j < v < y i+1 O bij = Σ O n (u, v) / Σ 1 Such that, x i < u < x i+1 y j < v < y i+1 for all i < X DIV & j < Y DIV
  • 18.
    Membership of theoptical flows should be inversely proportional to their distance from the centre Organizing optical flow (Local oriented Histogram) C (0,0) d 2 d 1 O 1 O 2 O e O e
  • 19.
    Organizing optical flow(Weighted Averaging) O j = (O 1 , O 2 ,…..O m ) such that for all i Є {1,....,N}
  • 20.
  • 21.
    Formation of motiondescriptor Optical flow is represented in xy component form Effective optical flow from each box is written in a single row as [O ex00 , O ey00 , O ex10 , O ey10 ,….. ] vector Vectors for each action are stored for every training subject Adaboost is used to learn the patterns
  • 22.
    Learning with AdaboostStrong classifier Weak classifier Weight Features vector
  • 23.
    Classification Example taken from Antonio Torralba @MIT Weak learners from the family of lines h => p(error) = 0.5 it is at chance Each data point has a class label: w t =1 and a weight: + 1 ( ) -1 ( ) y t =
  • 24.
    Classification Example This one seems to be the best This is a ‘ weak classifier ’: It performs slightly better than chance. Each data point has a class label: w t =1 and a weight: + 1 ( ) -1 ( ) y t =
  • 25.
    Classification Example We set a new problem for which the previous weak classifier performs at chance again Each data point has a class label: w t w t exp{-y t H t } We update the weights: + 1 ( ) - 1 ( ) y t =
  • 26.
    Classification Example We set a new problem for which the previous weak classifier performs at chance again Each data point has a class label: w t w t exp{-y t H t } We update the weights: + 1 ( ) - 1 ( ) y t =
  • 27.
    Classification Example We set a new problem for which the previous weak classifier performs at chance again Each data point has a class label: w t w t exp{-y t H t } We update the weights: + 1 ( ) - 1 ( ) y t =
  • 28.
    Classification Example We set a new problem for which the previous weak classifier performs at chance again Each data point has a class label: w t w t exp{-y t H t } We update the weights: + 1 ( ) - 1 ( ) y t =
  • 29.
    Classification Example The strong (non- linear) classifier is built as the combination of all the weak (linear) classifiers. f 1 f 2 f 3 f 4
  • 30.
    Our Dataset Videoresolution 320 x 240 Stable background ACTION SUBJECTS VIDEOS Walking 8 34 Running 8 20 Flying 5 25 Waving 5 25 Pick up 6 24 Stand up 6 48 Sitting down 6 24
  • 31.
    Our Dataset (Tennisactions) Small tennis dataset ACTION SUBJECTS VIDEOS Forehand 3 11 Backhand 3 10 Service 2 9
  • 32.
    Training and TestingDataset Training and testing data is mutually exclusive Training and testing subjects are mutually exclusive Frames used for training and testing ACTION TRAINING TESTING Walking 1184 1710 Running 183 335 Flying 182 373 Waving 198 317 Pick up 111 160 Stand up 128 187 Sitting down 230 282
  • 33.
    Classification result (framewise ) Overall Error : 12.21 % Walking Running Flying Waving Pick up Sit down Stand up Error Walking 1644 46 0 17 1 2 3.86% Running 35 295 3 2 11.94% Flying 1 2 349 11 9 1 6.43% Waving 11 8 269 29 15.14% Pick up 8 7 1 120 23 1 25% Sit down 1 1 26 179 14.97% Stand up 23 282 8.15%
  • 34.
    Classification results (clipwise ) Overall error : 6.94% Walking Running Waving1 waving2 bending Sit-down Stand-up Error Walking 10 0.0% Running 10 0.0% Waving1 9 1 10.0% waving2 10 0.0% bending 9 1 10.0% Sit-down 10 0.0% Stand-up 1 9 10.0%
  • 35.
  • 36.
    Classification results (Tennisevents) Overall Error : 19.17% (per frame) Forehand Backhand Service Error Forehand 54 7 11 21.95% Backhand 11 53 10.75% Service 8 49 14.04%
  • 37.
    Event Detection Confusionat the junction two actions Use of prediction logic Current frame ‘ f’ Next n frames Previous n frames f f+1 f+2 f+3 f+4 … … f-1 f-2 f-3 f-4 … … f-n f+n
  • 38.
    Event Detection Withoutusing prediction logic With prediction logic
  • 39.
    Weizmann Dataset ACTIONSUBJECTS VIDEOS Bend 9 9 Jack 9 9 Jump 9 9 Pjump 9 9 Run 9 10 Side 9 9 Skip 9 10 Walk 9 10 Wave1 9 9 Wave2 9 9
  • 40.
    Standard Dataset (WeizmannDataset) Walk Side Skip Wave1 Wave2 Bend Run Jack Jump Pjump
  • 41.
    confusion matrix (framewise ) Overall Error : 29.17% (per frame) Bend Jack Jump Pjump Run Side Skip Walk Wave1 Wave2 Bend 271 1 1 20 3 30 11 Jack 18 368 8 48 3 2 3 9 16 Jump 9 3 157 8 2 26 19 7 Pjump 36 26 237 22 6 Run 4 2 5 158 3 50 6 1 2 Side 11 9 77 1 1 84 3 58 2 1 Skip 3 9 76 43 5 109 24 1 7 Walk 2 5 16 2 13 5 395 Wave1 47 2 12 238 27 Wave2 30 6 1 4 1 55 269
  • 42.
    Weizmann dataset Smallerresolution (180 x 144), Previously (320 x 240) Weaker motion vectors compare to previous experiments Weizmann : mag 0 – 1.75 px Earlier experiment : mag 0 – 5.5 px Lack of background frames available, used already given poor quality silhouette
  • 43.
    Use of MV+ Shape Info(SI) Only MV are not enough Shape of the person also gives information about the actions No. of foreground pixels in each box Error : 23.45%
  • 44.
    Use of MV+ Differential SI We calculate Differential Shape Info Make use of Viola-Jones rectangular features Rectangular features are used at grid level rather than pixel level Error : 19.69%
  • 45.
    confusion matrix (framewise ) Bend Jack Jump Pjump Run Side Skip Walk Wave1 Wave2 Bend 326 7 2 2 Jack 6 418 39 1 3 8 Jump 18 1 189 1 5 4 13 Pjump 11 55 243 6 1 11 Run 2 2 173 2 45 7 Side 8 30 11 1 152 12 33 Skip 1 20 32 83 4 121 13 1 2 Walk 1 1 2 1 1 432 Wave1 43 1 10 10 232 30 Wave2 13 25 328
  • 46.
  • 47.
    Spatio-temporal descriptor VolumeDescriptor in row form [Frame1 | Frame2 | Frame3 | Frame4 | Frame5 | ……] Motion and Differential shape information for the volume Error : 8.472% (per frame)
  • 48.
    Event classification (clipwise ) Error : 2.15% Better than 12.7% error rate reported by T. Goodhart et.al., Action recognition usign spatio-temporal regularity based feature, 2008 bend Jack Jump Pjump Run Side Skip Walk Wave1 Wave2 Error bend 9 0.0% Jack 9 0.0% Jump 9 0.0% Pjump 9 0.0% Run 9 1 10.0% Side 9 0.0% Skip 10 0.0% Walk 10 0.0% Wave1 8 1 11.1% Wave2 9 0.0%
  • 49.
    Action recognition incluttered background
  • 50.
    Cluttered environment backgroundis not stable The actor might be occluded Slight change in camera location (panning) Scale variation Speed variation
  • 51.
    Training Training isdone without background subtraction Manually mark Start and end of action in training videos Also the bounding box around the actor is marked No shape information is added in the training data Training is done with noisy background Currently bending and drinking actions supported
  • 52.
  • 53.
  • 54.
    Template length bending– Average no. of frames for action – 55 Variation – 40 – 110 TLEN 45 frames Drinking – Average no. of frames for action – 50 Variation – 35 – 70 TLEN 40 frames
  • 55.
    Single template formationLength of the template kept constant Some of the frames eliminated One action - one template Adds robustness in the training Speed variation during training is tackled 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1 3 4 5 6 8 9 10 11 13 14 15 1 2 3 4 5 6 7 8 9 10 11 12
  • 56.
    Optical flow andAdaboost We have constant length sequences Optical flows are calculated No shape information as background is cluttered, background subtraction not possible Formation of Spatio-temporal template with TSPAN = 1 and TLEN = length of sequence = const Templates are learned with Adaboost.
  • 57.
    Testing An actionCuboid is formed with specific height, width, length Cuboid is moved over each and every valid starting location in the video Height Width Length
  • 58.
  • 59.
    Testing An actionCuboid is formed with specific height, width, length Cuboid is moved over each and every valid starting location in the video A spatio-temporal template is formed for each Cuboid location and tested with Adaboost Appropriate entry is made in confidence matrix Height, width and length updated for scale and speed invariance Height Width Length
  • 60.
    Confidence matrix Confidencematrix is a 3D matrix Confidence matrix has an entry for each and every valid location of cube in the video Confidence matrix contains the confidence value given by the Adaboost over various iteration We expect that true positives will be surrounded with a dense fog of large confidence values averaging is done to reduce the effect of the false positives.
  • 61.
  • 62.
  • 63.
  • 64.
  • 65.
  • 66.
  • 67.
  • 68.
    Key References Y.Ke, R. Sukthankar, M. Hebert, “Spatio-temporal Shape and Flow Correlation for Action Recognition”, In Proc. Visual Surveillance Workshop, 2007. P. Viola and M. Jones. “Robust real-time face detection”. In ICCV, volume 20(11), pages 1254-1259, 2001. M. Lucena, J.M. Fuertes and N. P. la Blanca, “Using Optical Flow for Tracking”, Volume 2905/2003, Progress in Pattern Recognition, Speech and Image Analysis. Y. Ke, R. Sukthankar, and M. Hebert. “Event detection in crowded videos”. In ICCV, 2007. F. Niu and M. Abdel-Mottaleb, “View –Invariant Human Activity Recognition Based on Shape and Motion Features,” in Proc. of the IEEE Sixth International Symposium on Multimedia Software Engineering, pp. 546-556, 2004. D.M. Gavrila. “The visual analysis of human movement: A survey”. Computer Vision and Image Understanding, 73:82–98, 1999. D. M. Gavrila. “A bayesian, exemplar-based approach to hierarchical shape matching”. IEEE Trans. Pattern Anal. Mach. Intell., 29(8):1408–1421, 2007. K. Gaitanis, P. Correa, and B. Macq, “Human Action Recognition using silhouette based feature extraction and Dynamic Bayesian Networks”.  M. Ahmad, S. Lee, “Human action recognition using shape and CLG-motion flowfrom multi-viewimage sequences”, 7 th IEEE International Conference on Automatic Face and Gesture Recognition, April 2006. 10. Haritaoglu, D. Harwood, and L. Davis, “W4: real-time surveillance of people and their activities,” IEEE Transactions on Pattern Analysis and Machine Intelligence 22, pp. 809–830, Aug 2000. Ismail Haritaoglu, David Harwood, and Larry S. Davis, “W4: Who? When? Where? What? a Real-time System for Detecting and Tracking People,&quot; Proc. the third IEEE International Conference on Automatic Face and Gesture Recognition Nara, Japan , IEEE Computer Society Press, Los Alamitos, Calif., 1998, pp.222-227. P. Correa1, J. Czyz1, T. Umeda1, F. Marqu, X. Marichal3, B. Macq, “Silhouette-based probabilistic 2D human motion estimation for real time application”, in ICIP 2005. Y. Ke, R. Sukthankar, and M. Hebert. “Efficient visual event detection using volumetric features, In ICCV’05.