Human Action Recognition

HUMAN ACTION RECOGNITION
- INTRODUCTION OF HUMAN ACTION RECOGNITION
- ISSUES OF SKELETON-BASED ACTION RECOGNITION
- RESEARCH RELATED TO SKELETON-BASED ACTION RECOGNITION
1
연세대 박사과정 이인웅

- INTRODUCTION OF HUMAN ACTION RECOGNITION

Introduction of Human Action Recognition
 RGB based Human Action Recognition
■ Two-stream convolutional networks for action recognition in
videos, in NIPS 2014
 UCF-101 (88.0 %), HMDB-51 (59.4 %)
■ Currently, UCF-101 (94.9 %), HMDB-51 (72.2 %) in CVPR 2017
■ Focusing on mapping video into action label, not human pose
3

 RGB based 2D Human Pose Estimation
■ Hand Face and Body Keypoint Detection (CVPR 2017)
 More context information than just RGB video
4

 RGB based 3D Human Pose Estimation
■ Recurrent 3D Pose Sequence Machine (CVPR 2017)
 More information (3D) than 2D human pose
5

 RGB-D based 3D Human Pose Estimation
■ Microsoft Kinect version 2.0
 More accurate than RGB based 3D skeleton because of depth
6
RGB with 2D Skeleton 3D Skeleton

- ISSUES OF SKELETON-BASED ACTION RECOGNITION

Issues of Skeleton-based Action Recognition
 Attributes of Skeleton extracted from Camera
8
z
Variable Scale
View 1
x
y
x
y
z
View 2
Variable View Orientation
View 3z
x
y
small
large
Very Noisy

Issues of Skeleton-based Action Recognition
 Attributes of Human Action
9
Rate Variation
5 frames per 1 action
3 frames per 1 action
fast
slow
Intra-action Variation
Straight Punch
Curved Punch

- RESEARCH RELATED TO SKELETON-BASED ACTION RECOGNITION
: ENSEMBLE DEEP LEARNING USING TS-LSTM NETWORKS

Ensemble Deep Learning using TS-LSTM networks
 Overview of the proposed deep learning network [1]
11
Coordinate
Transformation
Salient Motion
Extraction
Discriminative
Multi-term LSTMs
Ensemble of
Deep Learning
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
Multi-termDiversityaxis
Softmax axis
Motion 1
Motion 2
Motion N
LSTM
LSTM
…
Ensemble
Translation
Rotation
Scale
LSTM
LSTM
[1] Inwoong Lee, Doyoung Kim, Seoungyoon Kang, and Sanghoon Lee, "Ensemble Deep Learning for
Skeleton-based Action Recognition using Temporal Sliding LSTM networks," in ICCV 2017.

 Feature Representation (1/3)
■ Absolute skeleton position
 Joint coordinates of raw skeleton
 High orientation and location variations at the same action
■ Relative skeleton position
 Joint difference coordinates from a reference joint of each frame
 Low orientation and location variations at the same action
 Simplification of temporal skeleton movements (ex: Jump)
■ Absolute + Relative skeleton position
 Joint difference coordinates from a reference joint of initial frame
 Low orientation and location variations at the same action
 Reflection of temporal skeleton movements
12

■ Pose orientation alignment
13
Trial: We transform y axis  the vector vertical to the ground, x axis  the left
direction of initial skeleton, and z axis  the cross product of x and y.
Effect: This trial achieves view/orientation invariance and retains temporal
relation between skeletons.
Initial skeleton of
each sequence
Sequence 1
Sequence 2
x z
y

■ Motion feature extraction
14
Trial: We obtain a difference between two skeleton frames.
Effect: Motion feature can capture the actual movements of skeleton joints.
Difference between
two skeletons

 Modeling of Human Action (1/8)
■ Traditional work (1/4)
 Mining Actionlet Ensemble for
Action Recognition with Depth
Cameras, in CVPR 2012
 Fourier temporal pyramids
 In addition to the global
fourier coefficients, they
recursively partition the action
into a pyramid
 Mining actionlet ensemble
 Feature pooling
 Support vector machine
15

 Human Action Recognition by Representing 3D Skeletons as
Points in A Lie Group, in CVPR 2014
 Feature representation using manifold, temporal alignment
through dynamic time warping, and SVM classification using FTP
16
SVM: Support Vector Machine
FTP: Fourier Temporal Pyramid

 Hierarchical Recurrent Neural Network for Skeleton Based Action
Recognition, in CVPR 2015
 Spatial: body part based features, temporal: recurrent networks
17

 Spatio-Temporal LSTM with Trust Gates for 3D Human Action
Recognition, in ECCV 2016
 Combination of spatial and temporal features using LSTM
18
LSTM: Long Short-Term Memory

■ Temporal Sliding LSTM (TS-LSTM)
 LSTM captures only long-term dependency.
 TS-LSTM can capture short-term, medium-term, and long-term
dependencies.
 We can adapt TS-LSTM into various dependencies through
controlling of temporal stride and internal LSTM time-step size.
19
LSTM LSTM LSTM LSTM
LSTM LSTM LSTM LSTM
Input Sequence
Temporal
Stride
LSTM LSTM LSTM LSTM
Input Sequence
: LSTM input : LSTM output
Traditional LSTM Proposed TS-LSTM
LSTM LSTM
LSTM: Long Short-Term Memory

■ Conceptual diagram of Ensemble TS-LSTM v1
 1 Short-term TS-LSTM with 1 motion feature
 2 Medium-term TS-LSTM with 2 motion features
 3 Long-term TS-LSTM with 3 motion features
20
: Ensemble feature
Short
Short
Short
Short
Medium
Medium
Medium
Medium
Long
Long
Long
: Motion feature
Output
Input

■ Conceptual diagram of Ensemble TS-LSTM v2
 1 Short-term TS-LSTM with 1 motion feature
 2 Medium-term TS-LSTM with 2 motion features
 3 Long-term TS-LSTM with 3 motion features
 1 Medium-term TS-LSTM with 1 pose feature
21
: Ensemble feature
Short
Short
Short
Short
Medium
Medium
Medium
Medium
Long
Long
Long
: Motion feature
Output
Input
Medium
Medium
: Pose feature

■ Actual Ensemble TS-LSTM v1 & Ensemble TS-LSTM v2
22

 Used Datasets
■ MSR Action3D dataset (widely used)
 20 actions performed by 10 subjects for 2 or 3 times
■ Northwestern-UCLA dataset (3 views)
 10 actions performed by 10 subjects for 1 ~ 6 times
 Abbreviation Definition
■ Human Cognitive Coordinate (HCC)
 y axis  the vector vertical to the ground, x axis  the left
direction of initial skeleton, z axis  the cross product of x and y
■ Salient Motion Feature (SMF)
 Difference features between two skeleton frames
23

 Results and Comparisons (1/2)
■ Bag of 3d points: Projection of the sampled 3D points
■ Lie group: Manifold feature based SVM model
■ HBRNN: Body-part based LSTM model
■ ST-LSTM + Trust Gate: Spatio-Temporal LSTM model with Trust
Gate
24
Experimental result comparison on the MSR Action3D dataset.
 Best Performance
SVM: Support Vector Machine

 Results and Comparisons (2/2)
■ Lie group: Manifold feature based SVM model
■ Actionlet ensemble: Temporal Pyramid features + SVM model
■ HBRNN-L: Body-part based LSTM model
■ Enhanced skeleton visualization: CNN based model
25
Experimental results on the Northwestern-UCLA dataset.
 Best Performance
CNN: Convolutional Neural Network

 Result Analysis (1/3)
■ Misclassified action
 Forward punch
 Tennis serve
 High throw
 Hammer, Tennis serve
 Pickup & throw
 Bend
■ Ensemble TS-LSTM v1
classifies these similar actions
to some degree by using the multiple
TS-LSTM networks
26
Confusion matrix of AS1
(Ensemble TS-LSTM v1)Ground Truth
Prediction

■ Softmax feature analysis of Ensemble TS-LSTM v1 (1/2)
 Overall, the diagonal probabilities of Softmax2 with long-term
LSTMs are higher than those of Softmax0 with short-term LSTMs
and Softmax1 with medium-term LSTMs
 The global temporal features have relatively more influence on the
performance than the local temporal features
27Softmax0 Softmax1 Softmax2
Short Medium Long

■ Softmax feature analysis of Ensemble TS-LSTM v1 (2/2)
 Softmax0 and Softmax1 sometimes produce lower
misclassification rates compared with Softmax2
 This makes the model less prone to overfitting to certain actions
 Softmax0 and Softmax1 have lower misclassification probabilities
of “Pickup & throw” to “Bend” than Softmax2
28Softmax0 Softmax1 Softmax2
Short Medium Long
(Weakness Compensation)

 Action Sequence Prediction (Yonsei Dataset)
29
1. Fall Down  2. Sit Down  3. Stand Up  4. Wave Hands  5. Hands On The
Head  6. Hunker Down  7. Punch  8. Kick  9. Wield Knife  10. Aim Handgun
 11. Aim Rifle  12. Throw  13. Kick Object
Action Label Order

 Remaining Issues
■ Network design for skeleton-based action
 Advanced ensemble TS-LSTM networks
■ Untrimmed skeleton-based action recognition
 Action classification + action localization
 Real-time action detection system
■ Human pose estimation
 2/3D skeleton estimation from RGB images
 Skeleton tracking in video
■ Other types of human action recognition
 Human-object interaction analysis
 Action video question answering
 Multiple person’s action recognition
30

Human Action Recognition

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Human Action Recognition

Similar to Human Action Recognition (20)

More from NAVER Engineering

More from NAVER Engineering (20)

Recently uploaded

Recently uploaded (20)

Human Action Recognition