Make Skeleton Models Smaller, Faster and Better

Make Skeleton-based Action Recognition
Model Smaller, Faster and Better
Fan Yang1,2, Sakriani Sakti1,2, Yang Wu3 , Satoshi Nakamura1,2
1Nara Institute of Science and Technology, Japan
2RIKEN Center for Advanced Intelligence Project, Japan
3Kyoto University, Japan
0

Skeleton-based action recognition
12019©Fan Yang et al. Make Skeleton-based Action Recognition Model Smaller, Faster and Better
RGB-based action
recognition
Skeleton-based
action recognition
Using JHMDB data as an example
With a comparable performance, one of the fastest RGB-based action recognition work is ECO
(Zolfaghari, et al. 2018) : 230 clips per second.
With a comparable performance, one of the reported fastest skeleton-based action recognition work
is MFA-Net (Chen et al. 2019) : 361 clips per second, about 3M parameters.
Compared with the RGB data, the skeleton data is more sparse, can we make it smaller
faster, and better?

Skeleton-based action recognition by neural network models
Joints {j0, j1, …, jn}
Frames{f0,f0,…,fk}
Inputs
Actions
Neural Network
The raw skeleton sequence feature
Representation (e.g., 3D skeleton)

Inputs
Selecting features
≠
joints
xyz
=
Joints 1:N-1
Joints2:N
Geometry feature
Joints 1:N-1
Joints2:N
Has been solely used in:
“Learning a 3d human pose distance metric from geo-metric pose descriptor”,
“Fusing geometric features for skeleton-based action recognition using multilayer lstm networks”,
“Enhanced skeleton visualization for view invariant human action recognition”, etc.
Cartesian coordinate feature
xyz
joints
Has been solely used in:
“Convolutional neural networks and long short-term memory for skeleton-based human activity and hand gesture recognition”
“Spatial-temporal attention res-tcn for skeleton-based dynamic hand gesture recognition”,
“Mfa-net: Motion feature augmented network for dynamic hand gesture recognition from skeletal data”, etc.
( lose global spatio-temporal information)
(contains global spatio-temporal information)
Let’s use both of them
=
Rotate and shift
Action

Obtaining global motions from Cartesian coordinate feature
Actions related to global trajectories,
e.g., swipe V.
Actions unrelated to global trajectories,
e.g., pinch.
Therefore, we use Cartesian coordinates features, other than geometry features, to
calculate the motions.
Motion = -
frame k+1 frame k
It is variant to global view points but invariant to global locations.

Double-scale motions
Slow Motion
Double-scale
motions
Fast motion = -
frame k+1 frame k
Slow motion = -
frame k+2 frame k
Fast Motion
Previous works only utilize a single-scale motion.
To be robust to the motion scale variance, we use double-scale motions.

Implementation in our neural network architecture
Concatenate
CNN(1,2*filters)
CNN(1,filters), /2
Temporal difference
(stride=2)
CNN(1,2*filters)
CNN(1,filters), /2
CNN(1,2*filters)
CNN(1,filters)
Temporal difference
(stride=1)
Cartesian
Coordinates
Joints Collection
Distances
2×CNN(3,2*filters), /2
2×CNN(3,8*filters)
CNN(3,filters) CNN(3,filters) CNN(3,filters)
Fast
motion
Slow
motionInputs
GAP
FC(num_classes)
FC(128)

Skeleton-based action recognition by neural network models
Actions
To make the model smaller faster, and better
Inputs Neural Network
Fewer parameters and FLOPs
FLOPs: floating point operations

Neural network models used in skeleton-based action recognition
Reference:
[1] “Convolutional Neural Networks for Multivariate Time Series Classification using both Inter- and Intra Channel Parallel Convolutions”(1d cnn)
[2] “Spatial-temporal attention res-tcn for skeleton-based dynamic hand gesture recognition”(1d cnn)
[3] “Simple yet efficient real-time pose-based action recognition” (2d)
[4] “PoTion: Pose MoTion Representation for Action Recognition” (2d)
[5]“Mfa-net: Motion feature augmented network for dynamic hand gesture recognition from skeletal data” (lstm)
[6]“Convolutional neural networks and long short-term memory for skeleton-based human activity and hand gesture recognition” (2d cnn+lstm)
[7] “Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition” (GCNN)
2D CNN + LSTM
[6]
1D CNN
[1,2]
LSTM
[5]
2D CNN
[3,4]
Could be relatively
smaller and faster
GCNN
[7]

Skeleton is ordered point data, but the correlation is complicated
1
3
4 5
6 7
8 9
10 11
12 13
14 15
2
Point Data
Ordered Point Data
GCNN, CNN, RNN and FCN could be used
Unordered Point Data
generally GCNN is used
e.g., point cloud e.g., skeletons
However,
Adjacent indices but not
locally correlated joints:
[9, 10],
[11, 12],
[13, 14], etc.
Locally correlated but not
adjacent indices joints:
[4, 8, 12],
[5, 9, 13],
[2, 6, 7], etc.
The relationship between joints are dynamically changed.

Dynamically learning the implicit relationship of joints
Raw Features (relationship of joints is fixed)
Embedding features (implicit relationship of joints are learned)
Actions
Previous works [1,2]
Dynamically learning the
implicit relationship of joints
by a neural network
Reference:
[1] “Convolutional Neural Networks for Multivariate Time Series Classification using both Inter- and Intra Channel Parallel Convolutions”(1d cnn)
[2] “Spatial-temporal attention res-tcn for skeleton-based dynamic hand gesture recognition”(1d cnn)

Implementation in our neural network architecture
Concatenate
CNN(1,2*filters)
CNN(1,filters), /2
Temporal difference
(stride=2)
CNN(1,2*filters)
CNN(1,filters), /2
CNN(1,2*filters)
CNN(1,filters)
Temporal difference
(stride=1)
Cartesian
Coordinates
Joints Collection
Distances
2×CNN(3,8*filters)
CNN(3,filters) CNN(3,filters) CNN(3,filters)
Fast
motion
Slow
motion
Embedding
GAP
FC(num_classes)
FC(128)

Experimental Datasets

Results on SHREC data (Hand Gesture Recognition)
e.g., swipe left gestures
Global trajectory matters
Double motions are more effective

Results on JHMDB data
Actions are weakly related to
global trajectories
Double motions are more effective

Our code is released

Thanks!

Appendix: cartesian coordinate features
Limitation of our method:
For some cases (e.g., NTU data), Cartesian coordinates matters

Appendix: parameters and FLOPs for the neural network
2D CNN:
Parameters: (Ci x K2 + 1) x Co
FLOPs: (2 x Ci x K2 – 1) x T x J x Co
Ci=input channel, k=kernel size, T=frames, J=joint number, Co=output channel.
Dense layer:
Parameters: (I +1) x O
FLOPs: (2 x I – 1) x O
I = input neuron numbers, O = output neuron numbers.
1D CNN:
Parameters: (Ci x K +1) x Co
FLOPs: (2 x Ci x K – 1) x T x Co
Ci=input channel, k=kernel size, T=frames, Co=output channel.
LSTM:
Parameters: 4 x ((I+1) x O + O^2)
FLOPs: 16 x T x H x (H – 1)
I = input size, O = output size, H = hidden size.

Appendix: motion feature
Simonyan et al. “Two-Stream Convolutional Networks for Action Recognition in Videos”
In RGB-based action recognition, motion features (i.e., optical flow) are used to improve the action
recognition performance.
In skeleton-based action recognition, cross-frame skeleton difference can be used as motion
features.
is a motion feature vector

Make Skeleton Models Smaller, Faster and Better

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Make Skeleton Models Smaller, Faster and Better

Similar to Make Skeleton Models Smaller, Faster and Better (20)

Recently uploaded

Recently uploaded (20)

Make Skeleton Models Smaller, Faster and Better

Editor's Notes