This document discusses research on human action recognition using skeleton data. It introduces issues with skeleton-based action recognition, such as variable scales, view orientations, noise and rate/intra-action variations. It then reviews previous work on skeleton-based action recognition using hand-crafted features and deep learning models. The document proposes two ensemble deep learning models called Ensemble TS-LSTM v1 and v2 that use temporal sliding LSTMs to capture short, medium and long-term dependencies from skeleton sequences for action recognition. Experimental results on standard datasets demonstrate the models outperform previous methods.
Exploring the Future Potential of AI-Enabled Smartphone Processors
Human Action Recognition
1. HUMAN ACTION RECOGNITION
- INTRODUCTION OF HUMAN ACTION RECOGNITION
- ISSUES OF SKELETON-BASED ACTION RECOGNITION
- RESEARCH RELATED TO SKELETON-BASED ACTION RECOGNITION
1
연세대 박사과정 이인웅
3. Introduction of Human Action Recognition
RGB based Human Action Recognition
■ Two-stream convolutional networks for action recognition in
videos, in NIPS 2014
UCF-101 (88.0 %), HMDB-51 (59.4 %)
■ Currently, UCF-101 (94.9 %), HMDB-51 (72.2 %) in CVPR 2017
■ Focusing on mapping video into action label, not human pose
3
4. Introduction of Human Action Recognition
RGB based 2D Human Pose Estimation
■ Hand Face and Body Keypoint Detection (CVPR 2017)
More context information than just RGB video
4
5. Introduction of Human Action Recognition
RGB based 3D Human Pose Estimation
■ Recurrent 3D Pose Sequence Machine (CVPR 2017)
More information (3D) than 2D human pose
5
6. Introduction of Human Action Recognition
RGB-D based 3D Human Pose Estimation
■ Microsoft Kinect version 2.0
More accurate than RGB based 3D skeleton because of depth
6
RGB with 2D Skeleton 3D Skeleton
8. Issues of Skeleton-based Action Recognition
Attributes of Skeleton extracted from Camera
8
z
Variable Scale
View 1
x
y
x
y
z
View 2
Variable View Orientation
View 3z
x
y
small
large
Very Noisy
9. Issues of Skeleton-based Action Recognition
Attributes of Human Action
9
Rate Variation
5 frames per 1 action
3 frames per 1 action
fast
slow
Intra-action Variation
Straight Punch
Curved Punch
10. HUMAN ACTION RECOGNITION
- RESEARCH RELATED TO SKELETON-BASED ACTION RECOGNITION
: ENSEMBLE DEEP LEARNING USING TS-LSTM NETWORKS
11. Ensemble Deep Learning using TS-LSTM networks
Overview of the proposed deep learning network [1]
11
Coordinate
Transformation
Salient Motion
Extraction
Discriminative
Multi-term LSTMs
Ensemble of
Deep Learning
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
Multi-termDiversityaxis
Softmax axis
Motion 1
Motion 2
Motion N
LSTM
LSTM
…
Ensemble
Translation
Rotation
Scale
LSTM
LSTM
[1] Inwoong Lee, Doyoung Kim, Seoungyoon Kang, and Sanghoon Lee, "Ensemble Deep Learning for
Skeleton-based Action Recognition using Temporal Sliding LSTM networks," in ICCV 2017.
12. Feature Representation (1/3)
■ Absolute skeleton position
Joint coordinates of raw skeleton
High orientation and location variations at the same action
■ Relative skeleton position
Joint difference coordinates from a reference joint of each frame
Low orientation and location variations at the same action
Simplification of temporal skeleton movements (ex: Jump)
■ Absolute + Relative skeleton position
Joint difference coordinates from a reference joint of initial frame
Low orientation and location variations at the same action
Reflection of temporal skeleton movements
12
Ensemble Deep Learning using TS-LSTM networks
13. Ensemble Deep Learning using TS-LSTM networks
Feature Representation (2/3)
■ Pose orientation alignment
13
Trial: We transform y axis the vector vertical to the ground, x axis the left
direction of initial skeleton, and z axis the cross product of x and y.
Effect: This trial achieves view/orientation invariance and retains temporal
relation between skeletons.
Initial skeleton of
each sequence
Sequence 1
Sequence 2
x z
y
14. Ensemble Deep Learning using TS-LSTM networks
Feature Representation (3/3)
■ Motion feature extraction
14
Trial: We obtain a difference between two skeleton frames.
Effect: Motion feature can capture the actual movements of skeleton joints.
Difference between
two skeletons
15. Modeling of Human Action (1/8)
■ Traditional work (1/4)
Mining Actionlet Ensemble for
Action Recognition with Depth
Cameras, in CVPR 2012
Fourier temporal pyramids
In addition to the global
fourier coefficients, they
recursively partition the action
into a pyramid
Mining actionlet ensemble
Feature pooling
Support vector machine
15
Ensemble Deep Learning using TS-LSTM networks
16. Modeling of Human Action (2/8)
■ Traditional work (2/4)
Human Action Recognition by Representing 3D Skeletons as
Points in A Lie Group, in CVPR 2014
Feature representation using manifold, temporal alignment
through dynamic time warping, and SVM classification using FTP
16
Ensemble Deep Learning using TS-LSTM networks
SVM: Support Vector Machine
FTP: Fourier Temporal Pyramid
17. Modeling of Human Action (3/8)
■ Traditional work (3/4)
Hierarchical Recurrent Neural Network for Skeleton Based Action
Recognition, in CVPR 2015
Spatial: body part based features, temporal: recurrent networks
17
Ensemble Deep Learning using TS-LSTM networks
18. Modeling of Human Action (4/8)
■ Traditional work (4/4)
Spatio-Temporal LSTM with Trust Gates for 3D Human Action
Recognition, in ECCV 2016
Combination of spatial and temporal features using LSTM
18
Ensemble Deep Learning using TS-LSTM networks
LSTM: Long Short-Term Memory
19. Ensemble Deep Learning using TS-LSTM networks
Modeling of Human Action (5/8)
■ Temporal Sliding LSTM (TS-LSTM)
LSTM captures only long-term dependency.
TS-LSTM can capture short-term, medium-term, and long-term
dependencies.
We can adapt TS-LSTM into various dependencies through
controlling of temporal stride and internal LSTM time-step size.
19
LSTM LSTM LSTM LSTM
LSTM LSTM LSTM LSTM
Input Sequence
Temporal
Stride
LSTM LSTM LSTM LSTM
Input Sequence
: LSTM input : LSTM output
Traditional LSTM Proposed TS-LSTM
LSTM LSTM
LSTM: Long Short-Term Memory
20. Ensemble Deep Learning using TS-LSTM networks
Modeling of Human Action (6/8)
■ Conceptual diagram of Ensemble TS-LSTM v1
1 Short-term TS-LSTM with 1 motion feature
2 Medium-term TS-LSTM with 2 motion features
3 Long-term TS-LSTM with 3 motion features
20
: Ensemble feature
Short
Short
Short
Short
Medium
Medium
Medium
Medium
Long
Long
Long
: Motion feature
Output
Input
21. Ensemble Deep Learning using TS-LSTM networks
Modeling of Human Action (7/8)
■ Conceptual diagram of Ensemble TS-LSTM v2
1 Short-term TS-LSTM with 1 motion feature
2 Medium-term TS-LSTM with 2 motion features
3 Long-term TS-LSTM with 3 motion features
1 Medium-term TS-LSTM with 1 pose feature
21
: Ensemble feature
Short
Short
Short
Short
Medium
Medium
Medium
Medium
Long
Long
Long
: Motion feature
Output
Input
Medium
Medium
: Pose feature
22. Modeling of Human Action (8/8)
■ Actual Ensemble TS-LSTM v1 & Ensemble TS-LSTM v2
22
Ensemble Deep Learning using TS-LSTM networks
23. Ensemble Deep Learning using TS-LSTM networks
Used Datasets
■ MSR Action3D dataset (widely used)
20 actions performed by 10 subjects for 2 or 3 times
■ Northwestern-UCLA dataset (3 views)
10 actions performed by 10 subjects for 1 ~ 6 times
Abbreviation Definition
■ Human Cognitive Coordinate (HCC)
y axis the vector vertical to the ground, x axis the left
direction of initial skeleton, z axis the cross product of x and y
■ Salient Motion Feature (SMF)
Difference features between two skeleton frames
23
24. Ensemble Deep Learning using TS-LSTM networks
Results and Comparisons (1/2)
■ Bag of 3d points: Projection of the sampled 3D points
■ Lie group: Manifold feature based SVM model
■ HBRNN: Body-part based LSTM model
■ ST-LSTM + Trust Gate: Spatio-Temporal LSTM model with Trust
Gate
24
Experimental result comparison on the MSR Action3D dataset.
Best Performance
SVM: Support Vector Machine
25. Ensemble Deep Learning using TS-LSTM networks
Results and Comparisons (2/2)
■ Lie group: Manifold feature based SVM model
■ Actionlet ensemble: Temporal Pyramid features + SVM model
■ HBRNN-L: Body-part based LSTM model
■ Enhanced skeleton visualization: CNN based model
25
Experimental results on the Northwestern-UCLA dataset.
Best Performance
CNN: Convolutional Neural Network
26. Ensemble Deep Learning using TS-LSTM networks
Result Analysis (1/3)
■ Misclassified action
Forward punch
Tennis serve
High throw
Hammer, Tennis serve
Pickup & throw
Bend
■ Ensemble TS-LSTM v1
classifies these similar actions
to some degree by using the multiple
TS-LSTM networks
26
Confusion matrix of AS1
(Ensemble TS-LSTM v1)Ground Truth
Prediction
27. Ensemble Deep Learning using TS-LSTM networks
Result Analysis (2/3)
■ Softmax feature analysis of Ensemble TS-LSTM v1 (1/2)
Overall, the diagonal probabilities of Softmax2 with long-term
LSTMs are higher than those of Softmax0 with short-term LSTMs
and Softmax1 with medium-term LSTMs
The global temporal features have relatively more influence on the
performance than the local temporal features
27Softmax0 Softmax1 Softmax2
Short Medium Long
28. Ensemble Deep Learning using TS-LSTM networks
Result Analysis (3/3)
■ Softmax feature analysis of Ensemble TS-LSTM v1 (2/2)
Softmax0 and Softmax1 sometimes produce lower
misclassification rates compared with Softmax2
This makes the model less prone to overfitting to certain actions
Softmax0 and Softmax1 have lower misclassification probabilities
of “Pickup & throw” to “Bend” than Softmax2
28Softmax0 Softmax1 Softmax2
Short Medium Long
(Weakness Compensation)
29. Ensemble Deep Learning using TS-LSTM networks
Action Sequence Prediction (Yonsei Dataset)
29
1. Fall Down 2. Sit Down 3. Stand Up 4. Wave Hands 5. Hands On The
Head 6. Hunker Down 7. Punch 8. Kick 9. Wield Knife 10. Aim Handgun
11. Aim Rifle 12. Throw 13. Kick Object
Action Label Order
30. Ensemble Deep Learning using TS-LSTM networks
Remaining Issues
■ Network design for skeleton-based action
Advanced ensemble TS-LSTM networks
■ Untrimmed skeleton-based action recognition
Action classification + action localization
Real-time action detection system
■ Human pose estimation
2/3D skeleton estimation from RGB images
Skeleton tracking in video
■ Other types of human action recognition
Human-object interaction analysis
Action video question answering
Multiple person’s action recognition
30