SlideShare a Scribd company logo
1 of 9
Download to read offline
Signal & Image Processing: An International Journal (SIPIJ) Vol.11, No.6, December 2020
DOI: 10.5121/sipij.2020.11605 65
A NOVEL GRAPH REPRESENTATION FOR
SKELETON-BASED ACTION RECOGNITION
Tingwei Li1
, Ruiwen Zhang2
and Qing Li3
1,3
Department of Automation Tsinghua University, Beijing, China
2
Department of Computer Science Tsinghua University, Beijing, China
ABSTRACT
Graph convolutional networks (GCNs) have been proven to be effective for processing structured data, so
that it can effectively capture the features of related nodes and improve the performance of model. More
attention is paid to employing GCN in Skeleton-Based action recognition. But there are some challenges
with the existing methods based on GCNs. First, the consistency of temporal and spatial features is ignored
due to extracting features node by node and frame by frame. We design a generic representation of
skeleton sequences for action recognition and propose a novel model called Temporal Graph Networks
(TGN), which can obtain spatiotemporal features simultaneously. Secondly, the adjacency matrix of graph
describing the relation of joints are mostly depended on the physical connection between joints. We
propose a multi-scale graph strategy to appropriately describe the relations between joints in skeleton
graph, which adopts a full-scale graph, part-scale graph and core-scale graph to capture the local features
of each joint and the contour features of important joints. Extensive experiments are conducted on two
large datasets including NTU RGB+D and Kinetics Skeleton. And the experiments results show that TGN
with our graph strategy outperforms other state-of-the-art methods.
KEYWORDS
Skeleton-based action recognition, Graph convolutional network, Multi-scale graphs
1. INTRODUCTION
Human action recognition is a meaningful and challenging task. It has widespread potential
applications, including health care, human-computer interaction and autonomous driving. At
present, skeleton data is more often used for action recognition because skeleton data is robust to
the noise of background and different view points compared to video data. Skeleton-based action
recognition are mainly based on deep learning methods like Recurrent Neural Networks (RNNs),
Convolutional Neural Networks (CNNs) and GCNs [3, 8, 10,12, 13, 15, 17, 18]. RNNs and
CNNs generally process the skeleton data into vector sequence and image respectively. These
representing methods cannot fully express the dependencies between correlated joints. With more
researches on GCN, [12] first employs GCN in skeleton-based action recognition and inspires a
lot of new researches [7, 10, 15, 17, 18].
The key of action recognition based on GCN is to obtain the temporal and spatial features of an
action sequence through graph [7, 10, 18]. In the skeleton graph, skeleton joints transfer into node
and the relations between joints are represented by edges. As shown in Fig. 1, in most previous
work, there are more than one graphs, a node only contains spatial features. In this case, GCN
extracts spatial features frame by frame, then Temporal Convolutional Network (TCN)
extractstemporal features node by node. But, features of a joint in an action is not only related to
other joints intra frames but also joints inter frames. As a result, existing methods split this
consistency of spatiotemporal features. To solve this problem, we propose TGN to capture
Signal & Image Processing: An International Journal (SIPIJ) Vol.11, No.6, December 2020
66
spatiotemporal features simultaneously, as shown in Fig. 2, each node composes a joint of all
frames and contains both spatial and temporal feature in the graph, thus TGN obtains
spatiotemporal features by processing all frames of each joint simultaneously.
Besides, the edges of skeleton graph mainly depend on the adjacency matrix A, which is related
to the physical connections of joints [12,17,18]. GCN still have no effective adaptive graph
mechanism to establish a global connection through the physical relations of nodes, such as a
relation between head and toes. GCN can only obtain local features, such as a relation between
head and neck. In this paper, a multi-scale graph strategy is proposed, which adopts different
scale graphs in different network branches, as a result, physically unconnected information is
added to help network capture the local features of each joint and the contour features of
important joints.
The main contributions of this paper are summarized in three aspects:
(1) Temporal Graph Network (TGN) proposed in this paper is a feature extractor which has the
ability to obtain spatiotemporal features simultaneously. Besides, this feature extractor can
adapt to most skeleton-based action recognition models based on GCN and help to improve
the performance.
(2) A multi-scale graph strategy is proposed for optimization of graph, which can extract
different scale spatial features. This strategy can capture not only the local features but the
global features.
(3) Multi-scale Temporal Graph Network (MS-TGN) is proposed based on TGN and the multi-
scale graph strategy. Extensive experiments are conducted on two datasets including NTU
RGB+D and Kinetics Skeleton, and our MS-TGN out performs state-of-the-art methods on
these datasets for skeleton-based action recognition.
2. RELATED WORKS
2.1. Skeleton-Based Action Recognition
The handcrafted features are usually used to model the human skeleton data for action
recognition in conventional methods. However, the performance of these conventional methods is
barely satisfactory. With the development of deep learning, neural networks have become the
main methods, including RNNs and CNNs. RNN-based methods, such as LSTM and GRU, are
usually used to model the skeleton data as a sequence of the coordinate vectors each represents a
human body joint [3, 9, 11,13]. The 3D coordinates of all joints in a frame are concatenated in
some order to be the input vector. CNN-based methods model the skeleton data as a image where
a node can be regarded as a pixel [2, 7, 8]. Some works transform a skeleton sequence to an
image by treating the joint coordinate (x,y,z) as the R, G, and B channels of a pixel. Because the
skeleton data are naturally embedded in the form of graphs rather than a sequence or grids, both
RNNs and CNNs cannot fully represent the structure of the skeleton data. Since ST-GCN [12]
proposed in 2018, a series of methods for skeleton-based action recognition are based on GCN.
GCN has been proven to be effective for processing structured data, have also been used to model
the skeleton data.
2.2. Graph Convolutional Networks
ST-GCN [12] introduces GCN into skeleton-based action recognition, constructs the
spatiotemporal graph by natural connection of human joints, this method designs three different
adjacency matrix to extract spatial and temporal features and achieves better performance than
Signal & Image Processing: An International Journal (SIPIJ) Vol.11, No.6, December 2020
67
previous methods. 2s-AGCN [18] proposes a two-streams method and designs an adaptive
adjacency matrix. The adaptive adjacency matrix can not only present the information of human
body structure, but be learned using attention mechanism. DGNN [28] designs a directed graph
structure which learns the information between the joint and the bone in GCN. SGN [29]
explicitly introduces the high level semantics of joints, including the joint type and frame index,
into the network to enhance the feature representation capability. These approaches treat each
joint as a node of the graph, and the edge denoting the joint relationship is pre-defined by human
based on prior knowledge. However, the spatial and temporal features of actions are learned by
GCN and TCN, respectively, which makes the network less efficient.
3. METHODOLOGY
3.1. Temporal Graph Networks
The skeleton data of an action can be described as ๐‘‹ = {๐‘ฅ ๐‘,๐‘ฃ,๐‘ก} ๐ถร—๐‘‰ร—๐‘‡
, where๐‘‡is the number
offrames, ๐‘‰ is number of joints in a frame, ๐ถ is number of channels in a joint and๐‘ฅ ๐‘,๐‘ฃ,๐‘กrepresents
the skeleton data of joint ๐‘ฃ in the frame t with ๐‘channels.๐‘‹๐‘–is the sequence ๐‘–.Previous methods
construct ๐‘‡ graphs and each graph has๐‘‰ nodes, as shown in Fig. 1, where the node set๐‘ =
{๐‘ฅ ๐‘ฃ,๐‘ก|, ๐‘ฃ = 1,2, โ€ฆ ๐‘‰, ๐‘ก = 1,2 โ€ฆ ๐‘‡} ๐ถ
has joint ๐‘ฃin๐‘ก๐‘กโ„Ž frame. It means a frame is represented as one
graph and there are totally ๐‘‡ graphs. The size feature of a nodeis ๐ถ. GCN is used to obtain spatial
features from each graph, then outputs of GCN are fed into TCN to extract temporal features.
Figure 1. A node in any graph represents data of a joint in a certain frame. GCN extracts spatial features
frame by frame, and TCN extracts temporal features node by node.
We redefine graph and propose TGN. Compared to ๐‘‡ graphs, we only have one graph with ๐‘‰
nodes, as seen in Fig. 2.In the graph, the node set๐‘ = {๐‘ฅ ๐‘ฃ|๐‘ฃ = 1,2 โ€ฆ ๐‘‡} ๐ถร—๐‘‰ has the joint ๐‘ฃin all
frame. The size of feature of a node is ๐ถ ร— ๐‘‰. Compared with series methods using GCN and
TCN alternately, we only use one GCN block to realize the extraction of spatiotemporal features.
Figure 2. Each node represents data of a joint in all frames so temporal information is also contained.
Compared with Fig1, TGN extracts temporal and spatial features simultaneously.
Signal & Image Processing: An International Journal (SIPIJ) Vol.11, No.6, December 2020
68
In a basic TGN block, each node already has temporal information and spatial information,
therefore, in the process of graph convolution, temporal and spatial features can be calculated
simultaneously. The output value for a single channel at the spatial location ๐‘๐‘ฃ can be written as
Eq. 2.
๐‘† ๐‘ฃ = {๐‘›๐‘—|๐‘›๐‘— โˆˆ ๐‘†( ๐‘๐‘ฃ,โ„Ž)} ๏ผˆ1๏ผ‰
๐น๐‘œ(๐‘๐‘ฃ) = โˆ‘(๐น๐‘–(๐‘† ๐‘ฃ(๐‘—)) ร— ๐‘ค(๐‘—)
๐‘˜
๐‘—=0
)
๏ผˆ2๏ผ‰
Where ๐‘๐‘ฃis node ๐‘ฃ. ๐‘ ( ๐‘๐‘ฃ,โ„Ž)is a sampling functionused to find node set๐‘›๐‘— adjacent to the
node๐‘๐‘ฃ,๐น๐‘– maps nodes to feature vector, ๐‘ค(โ„Ž)is weights of CNNwhose kernel size is
1ร—t,๐น๐‘œ( ๐‘๐‘ฃ)is output of ๐‘๐‘ฃ.Eq.2 is a general formula among most GCN-based models of action
recognition, as it was used to extract spatial features in a graph. our method can be adapted to
existing methods by changing graph structure of this methods.
3.2. Multi-Scale Graph Strategy
(a) (b) (c)
Figure 3. (a) full-scale graph, (b) part-scale graph, (c) core-scale graph
Dilated convolution can obtain ignored features such as features between unconnected points in
an image by over step convolution. Inspired of it, we select different expressive joints to form
different scale graphs for convolution. Temporal features of joints with larger motion space are
more expressive. Joints in a body generally have different relative motion space. For example, the
elbow and knee can move in larger space compared to the surrounding joints like shoulder and
span. In the small-scale graph, there are less but more expressive nodes so there is a larger
receptive field. In large-scale graph, there are more but less expressive nodes so the receptive
field is smaller.
We design three different scale graphs based on NTU-RGB+D datasets, as shown in Fig. 3. Full-
scale graph in Fig. 3(a) has all 25 nodes and can obtain local features of each joint for its small
receptive field. Part-scale graph in Fig. 3(b) is represented by only 11 nodes. In this case,
receptive field becomes larger so it tends to capture contour information. Fig. 3(c) is core-scale
graph with only seven nodes. It has largest convolution receptive field, although it ignores the
internal state of the limbs, it can connect the left and right limbs directly, so the global
information can be obtained. Through different scale graphs, GCN can capture local features,
contour features and global features respectively.
Signal & Image Processing: An International Journal (SIPIJ) Vol.11, No.6, December 2020
69
Based on the former, we combine TGN with a multi-scale graph strategy to get the final model
MS-TGN, as shown in Fig. 4.The model consists of 10 TGN layers, each layer uses 3 ร— 1
convolutional kernel to extract the temporal features of each node, and a fully-connected layer to
classify based on the extracted feature.
Figure 4. The network architecture of MS-TGN.Multi-scale graphs are displayed by different adjacency
matrices๐ด1, ๐ด2 and ๐ด3.
4. EXPERIMENT
4.1. Datasets and Implementation Details
NTU RGB+D. This dataset is large and widely used. It contains 3D skeleton data collected by
Microsoftโ€™s kinetics V2 [16] and has 60 classes of actions and 56,000 action sequences, with 40
subjects are photographed by three cameras fixed at 0โ—ฆ, 45โ—ฆ and 45โ—ฆ, respectively. Each sequence
has several frames and each frame is composed of 25 joints. We adopt the same method [16] to
carry out the cross-view (cv) and cross-subject (cs) experiments. In the cross-view experiments,
the training data is 37,920 action sequences with the view at 45โ—ฆ and 0โ—ฆ, and the test data is
18,960 action sequences with the view at 45โ—ฆ in the cross-subject experiments, the training data is
action sequences performed by 28 subjects, and the test data contains 16,560 action sequences
performed by others. We use the top-1 accuracy for evaluation.
Kinetics Skeleton. Kinetics [6] is a video action recognition dataset obtained from thevideo on
YouTube. Kinetics Skeleton employs Open Pose [23] estimation toolbox to detect 18 joints of a
skeleton. It contains 400 kinds of actions and 260,232 sequences. The training data consists of
240,436 action sequences, and the test data is the remaining 19,796sequences. We use the top-1
and top-5 accuracies for evaluation.
Unless otherwise stated, all models proposed employ strategies following. The number of
channels is 64 in the first four layers, 128 in the middle three layers and 256 in the last three
layers. SGD optimizer with Nesterov accelerated gradient is used for gradient descent and
different learning rate adjustment strategies are designed for different data. The mini-batch size is
32 and the momentum is set to 0.9. All skeleton sequences are padded to T = 300 frames by
replaying the actions. Inputs are processed with normalization and translation as [18].
4.2.ABLATION EXPERIMENTS
4.2.1. Feature Extractor: TGN
ST-GCN [12] and 2s-AGCN [18] are representative models utilizing GCN and TCN alternatively
and were chosen as baselines. We replace GCN&TCN with TGN in these two models and keep
Signal & Image Processing: An International Journal (SIPIJ) Vol.11, No.6, December 2020
70
adjacency matrix construction strategies unchanged. The experimental results on the two datasets
are listed in Table 1 and Table 2. From Table 1, the original performance of ST-GCN increases to
82.3% and 90.8% on X-sub and X-view, and the accuracy of 2s-AGCN increases by 0.55% on
average. From Table 2, accuracies of two models are both improved. In conclusion, TGN is
sufficiently flexible to be used as a feature extractor and performs better than methods based on
GCN & TCN.
Table 1. Effectiveness of our TGN module on NTURGB+D dataset in terms of accuracy.
Model TGN X-sub(%) X-view(%)
ST-GCN[12]
81.6 88.8
โˆš 82.3 90.8
2s-AGCN[18]
88.5 95.1
โˆš 89.0 95.4
Js-Ours
86.0 93.7
โˆš 86.6 94.1
Bs-Ours
86.9 93.2
โˆš 87.5 93.9
Table 2. Effectiveness of our TGN moduleon Kinetics dataset in terms of accuracy.
Model TGN Top-1(%) Top-5(%)
ST-GCN[12]
30.7 52.8
โˆš 31.5 54.0
2s-AGCN[18]
36.1 58.7
โˆš 36.7 59.5
Js-Ours
35.0 93.7
โˆš 35.2 94.1
Bs-Ours
33.0 55.7
โˆš 33.3 56.2
4.2.2. Multi-Scale Graph Strategy
Based on section 4.2.1, we construct the full-scale graph containing all joints of the original data,
and the part-scale graph containing 11 joints in NTU+RGB-D dataset. In Kinetics Skeleton
dataset, the full-scale graph contains all nodes and the part-scale graph contains 11 joints. We
evaluate each scale graph, and Table 3 and Table 4 show the experimental results on the two
datasets. In detail, adding a part-scale graph increases accuracy by 1.5% and 0.45%, respectively,
adding a core-scale graph increases accuracy by 0.3% and 0.2%, respectively. A core-scale graph
provides the global features of the whole body, a part-scale graph provides the contour features of
the body part, and a full-scale graph provides the local features of each joint. By feature fusion,
the model obtains richer information and performs better.
Table 3. Effectiveness of Multi-scale Graph on NTU RGB+D dataset in terms of accuracy.
Full-Scale Part-Scale Core-Scale X-sub(%) X-view(%)
โˆš 89.0 95.2
โˆš 86.0 94.0
โˆš 85.6 93.3
โˆš โˆš 89.2 95.7
โˆš โˆš โˆš 89.5 95.9
Signal & Image Processing: An International Journal (SIPIJ) Vol.11, No.6, December 2020
71
Table 4. Effectiveness of Multi-scale Graph on Kinetics dataset in terms of accuracy.
Full-Scale Part-Scale Core-Scale Top-1(%) Top-5(%)
โˆš 36.6 59.5
โˆš 35.0 55.9
โˆš 33.8 54.6
โˆš โˆš 36.9 59.9
โˆš โˆš โˆš 37.3 60.2
4.3. Comparison With State-of-the-Art Methods
We compare MS-TGN with the state-of-the-art skeleton-based action recognition methods on
both the NTU RGB+D dataset and the Kinetics-Skeleton dataset. The results of NTU RGB+D are
shown in Table 5. Our model performs the best in cross-view experiment on NTU RGB+D, and
has the highest top-1 accuracy on Kinetics Skeleton dataset as listed in Table 6.
Table 5. Performance comparisons on NTU RGB+D dataset with the CSand CV settings.
Method Year X-sub(%) X-view(%)
HBRNN-L[7] 2015 59.1 64.0
PA LSTM[16] 2016 62.9 70.3
STA-LSTM[21] 2017 73.4 81.2
GCA-LSTM[22] 2017 74.4 82.8
ST-GCN[12] 2018 81.5 88.3
DPRL+GCNN[25] 2018 83.5 89.8
SR-TSL[19] 2018 84.8 92.4
AS-GCN[10] 2019 86.8 94.2
2s-AGCN[18] 2019 88.5 95.1
VA-CNN[26] 2019 88.7 94.3
SGN[27] 2020 89.0 94.5
MS-TGN(ours) - 89.5 95.9
Table 6. Performance comparisons on Kinetics dataset with SOTA methods.
Method Year Top-1(%) Top-5(%)
PA LSTM[16] 2016 16.4 35.3
TCN[12] 2017 20.3 40.0
ST-GCN[12] 2018 30.7 52.8
AS-GCN[10] 2019 34.8 56.6
2s-AGCN[18] 2019 36.1 58.7
NAS[15] 2020 37.1 60.1
MS-TGN(ours) - 37.3 60.2
Besides, our model reduces the computational work and parameters, as listed in Table 7. It means
our model is simpler and has a better ability for modelling spatial and temporal features. Why
does our model have fewer parameters and calculation but perform better? In our designed graph,
each node denotes all frame data of a joint, which brings two advantages: (1) TGN extractor only
contains GCN without TCN, which reduces the parameters and calculation. (2) Instead of
extracting alternately, TGN can extract spatial and temporal features at the same time to
strengthen consistence of the spatial and temporal features.
Signal & Image Processing: An International Journal (SIPIJ) Vol.11, No.6, December 2020
72
Table 7. Comparisons of the cost of computing with state-of-the-arts.The #Params and FLOPs are
calculated by the tools called THOP (PyTorch-OpCounter) [24].
Method Year X-sub(%) #Params(M) #FLOPs(G)
ST-GCN[12] 2018 81.6 3.1 15.2
AS-GCN[10] 2019 86.8 4.3 17.1
2s-AGCN[18] 2019 88.5 3.5 17.4
NAS[15] 2020 89.4 6.6 36.6
MS-TGN(ours) - 89.5 3.0 15.0
5. CONCLUSIONS
The MS-TGN model proposed in this paper mainly has two innovations: one is the TGN model
which extracts the temporal and spatial features at the same time, and the other is the multi-scale
graph strategy to obtain local features and contour features simultaneously. TGN designs a novel
representation of skeleton sequences for action recognition, which can obtain spatiotemporal
features simultaneously. The multi-scale graph strategy can extract global spatial features.On two
published datasets, the proposed MS-TGN achieves the SOTA accuracy with the least parameters
and computation.
ACKNOWLEDGEMENT
This work is sponsored by the National Natural Science Foundation of China No. 61771281, the
"New generation artificial intelligence" major project of China No. 2018AAA0101605, the 2018
Industrial Internet innovation and development project, and Tsinghua University initiative
Scientific Research Program.
REFERENCES
[1] Bruna, Joan, et al. โ€œSpectral Networks and Locally Connected Networks on Graphs.โ€ ICLR 2014โ€ฏ:
International Conference on Learning Representations (ICLR) 2014, 2014.
[2] Du, Yong, et al. โ€œSkeleton Based Action Recognition with Convolutional Neural Network.โ€ 2015 3rd
IAPR Asian Conference on Pattern Recognition (ACPR), 2015, pp. 579โ€“583.
[3] Du, Yong, et al. โ€œHierarchical Recurrent Neural Network for Skeleton Based Action Recognition.โ€
2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 1110โ€“1118.
[4] Feichtenhofer, Christoph, et al. โ€œConvolutional Two-Stream Network Fusion for Video Action
Recognition.โ€ 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016,
pp. 1933โ€“1941.
[5] Huang, Zhiwu, et al. โ€œDeep Learning on Lie Groups for Skeleton-Based Action Recognition.โ€ 2017
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 1243โ€“1252.
[6] Kay, Will, et al. โ€œThe Kinetics Human Action Video Dataset.โ€ ArXiv Preprint ArXiv:1705.06950,
2017.
[7] Ke, Qiuhong, et al. โ€œA New Representation of Skeleton Sequences for 3D Action Recognition.โ€ 2017
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 4570โ€“4579.
[8] Kim, Tae Soo, and Austin Reiter. โ€œInterpretable 3D Human Action Analysis with Temporal
Convolutional Networks.โ€ 2017 IEEE Conference on Computer Vision and Pattern Recognition
Workshops (CVPRW), 2017, pp. 1623โ€“1631.
[9] Li, Chao, et al. โ€œCo-Occurrence Feature Learning from Skeleton Data for Action Recognition and
Detection with Hierarchical Aggregation.โ€ IJCAIโ€™18 Proceedings of the 27th International Joint
Conference on Artificial Intelligence, 2018, pp. 786โ€“792.
[10] Li, Maosen, et al. โ€œActional-Structural Graph Convolutional Networks for Skeleton-Based Action
Recognition.โ€ 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),
2019, pp. 3595โ€“3603.
Signal & Image Processing: An International Journal (SIPIJ) Vol.11, No.6, December 2020
73
[11] Li, Shuai, et al. โ€œIndependently Recurrent Neural Network (IndRNN): Building A Longer and Deeper
RNN.โ€ 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 5457โ€“
5466.
[12] Yan, Sijie, et al. โ€œSpatial Temporal Graph Convolutional Networks for Skeleton-Based Action
Recognition.โ€ AAAI, 2018, pp. 7444โ€“7452.
[13] Martinez, Julieta, et al. โ€œOn Human Motion Prediction Using Recurrent Neural Networks.โ€ 2017
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 4674โ€“4683.
[14] Ofli, Ferda, et al. โ€œSequence of the Most Informative Joints (SMIJ): A New Representation for
Human Skeletal Action Recognition.โ€ 2012 IEEE Computer Society Conference on Computer Vision
and Pattern Recognition Workshops, 2012, pp. 8โ€“13.
[15] Peng, Wei, et al. โ€œLearning Graph Convolutional Network for Skeleton-Based Human Action
Recognition by Neural Searching.โ€ Proceedings of the AAAI Conference on Artificial Intelligence,
vol. 34, no. 3, 2020, pp. 2669โ€“2676.
[16] Shahroudy, Amir, et al. โ€œNTU RGB+D: A Large Scale Dataset for 3D Human Activity Analysis.โ€
2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 1010โ€“1019.
[17] Shi, Lei, et al. โ€œSkeleton-Based Action Recognition With Directed Graph Neural Networks.โ€ 2019
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 7912โ€“7921.
[18] Shi, Lei, et al. โ€œTwo-Stream Adaptive Graph Convolutional Networks for Skeleton-Based Action
Recognition.โ€ 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),
2019, pp. 12026โ€“12035.
[19] Si, Chenyang, et al. โ€œSkeleton-Based Action Recognition with Spatial Reasoning and Temporal Stack
Learning.โ€ Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 106โ€“
121.
[20] Si, Chenyang, et al. โ€œAn Attention Enhanced Graph Convolutional LSTM Network for Skeleton-
Based Action Recognition.โ€ 2019 IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR), 2019, pp. 1227โ€“1236.
[21] Liu, Jun, et al. โ€œSpatio-Temporal LSTM with Trust Gates for 3D Human Action Recognition.โ€
European Conference on Computer Vision, 2016, pp. 816โ€“833.
[22] Song, Sijie, et al. โ€œAn End-to-End Spatio-Temporal Attention Model for Human Action Recognition
from Skeleton Data.โ€ AAAIโ€™17 Proceedings of the Thirty-First AAAI Conference on Artificial
Intelligence, 2017, pp. 4263โ€“4270.
[23] Cao, Zhe, et al. โ€œOpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields.โ€
ArXiv Preprint ArXiv:1812.08008, 2018.
[24] Ligeng Zhu. Thop: Pytorch-opcounter. https://github.com/Lyken17/pytorch-OpCounter.
[25] Tang, Yansong, et al. โ€œDeep Progressive Reinforcement Learning for Skeleton-Based Action
Recognition.โ€ 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp.
5323โ€“5332.
[26] Zhang, Pengfei, et al. โ€œView Adaptive Neural Networks for High Performance Skeleton-Based
Human Action Recognition.โ€ IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.
41, no. 8, 2019, pp. 1963โ€“1978.
[27] Zhang, Pengfei, et al. โ€œSemantics-Guided Neural Networks for Efficient Skeleton-Based Human
Action Recognition.โ€ 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition
(CVPR), 2020, pp. 1112โ€“1121.
[28] Shi L, Zhang Y, Cheng J, et al. Skeleton-based action recognition with directed graphneural
networks[C]//Proceedings of the IEEE Conference on Computer Vision andPattern Recognition.
2019: 7912-7921.
[29] Zhang P, Lan C, Zeng W, et al. Semantics-Guided Neural Networks for EfficientSkeleton-Based
Human Action Recognition[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition. 2020: 1112-1121.

More Related Content

What's hot

Mixed Reality: Pose Aware Object Replacement for Alternate Realities
Mixed Reality: Pose Aware Object Replacement for Alternate RealitiesMixed Reality: Pose Aware Object Replacement for Alternate Realities
Mixed Reality: Pose Aware Object Replacement for Alternate RealitiesAlejandro Franceschi
ย 
ICVG: PRACTICAL CONSTRUCTIVE VOLUME GEOMETRY FOR INDIRECT VISUALIZATION
ICVG: PRACTICAL CONSTRUCTIVE VOLUME GEOMETRY FOR INDIRECT VISUALIZATIONICVG: PRACTICAL CONSTRUCTIVE VOLUME GEOMETRY FOR INDIRECT VISUALIZATION
ICVG: PRACTICAL CONSTRUCTIVE VOLUME GEOMETRY FOR INDIRECT VISUALIZATIONijcga
ย 
ICVG : Practical Constructive Volume Geometry for Indirect Visualization
ICVG : Practical Constructive Volume Geometry for Indirect Visualization  ICVG : Practical Constructive Volume Geometry for Indirect Visualization
ICVG : Practical Constructive Volume Geometry for Indirect Visualization ijcga
ย 
COLOR IMAGE ENCRYPTION BASED ON MULTIPLE CHAOTIC SYSTEMS
COLOR IMAGE ENCRYPTION BASED ON MULTIPLE CHAOTIC SYSTEMSCOLOR IMAGE ENCRYPTION BASED ON MULTIPLE CHAOTIC SYSTEMS
COLOR IMAGE ENCRYPTION BASED ON MULTIPLE CHAOTIC SYSTEMSIJNSA Journal
ย 
css_final_v4
css_final_v4css_final_v4
css_final_v4Rui Li
ย 
PIPELINED ARCHITECTURE OF 2D-DCT, QUANTIZATION AND ZIGZAG PROCESS FOR JPEG IM...
PIPELINED ARCHITECTURE OF 2D-DCT, QUANTIZATION AND ZIGZAG PROCESS FOR JPEG IM...PIPELINED ARCHITECTURE OF 2D-DCT, QUANTIZATION AND ZIGZAG PROCESS FOR JPEG IM...
PIPELINED ARCHITECTURE OF 2D-DCT, QUANTIZATION AND ZIGZAG PROCESS FOR JPEG IM...VLSICS Design
ย 
Intel ILS: Enhancing Photorealism Enhancement
Intel ILS: Enhancing Photorealism EnhancementIntel ILS: Enhancing Photorealism Enhancement
Intel ILS: Enhancing Photorealism EnhancementAlejandro Franceschi
ย 
Tissue Segmentation Methods Using 2D Histogram Matching in a Sequence of MR B...
Tissue Segmentation Methods Using 2D Histogram Matching in a Sequence of MR B...Tissue Segmentation Methods Using 2D Histogram Matching in a Sequence of MR B...
Tissue Segmentation Methods Using 2D Histogram Matching in a Sequence of MR B...Vladimir Kanchev
ย 
Background Estimation Using Principal Component Analysis Based on Limited Mem...
Background Estimation Using Principal Component Analysis Based on Limited Mem...Background Estimation Using Principal Component Analysis Based on Limited Mem...
Background Estimation Using Principal Component Analysis Based on Limited Mem...IJECEIAES
ย 
Enhanced target tracking based on mean shift algorithm for satellite imagery
Enhanced target tracking based on mean shift algorithm for satellite imageryEnhanced target tracking based on mean shift algorithm for satellite imagery
Enhanced target tracking based on mean shift algorithm for satellite imageryeSAT Journals
ย 
NMS and Thresholding Architecture used for FPGA based Canny Edge Detector for...
NMS and Thresholding Architecture used for FPGA based Canny Edge Detector for...NMS and Thresholding Architecture used for FPGA based Canny Edge Detector for...
NMS and Thresholding Architecture used for FPGA based Canny Edge Detector for...idescitation
ย 
Google Research Siggraph Whitepaper | Total Relighting: Learning to Relight P...
Google Research Siggraph Whitepaper | Total Relighting: Learning to Relight P...Google Research Siggraph Whitepaper | Total Relighting: Learning to Relight P...
Google Research Siggraph Whitepaper | Total Relighting: Learning to Relight P...Alejandro Franceschi
ย 
Enhanced target tracking based on mean shift
Enhanced target tracking based on mean shiftEnhanced target tracking based on mean shift
Enhanced target tracking based on mean shifteSAT Publishing House
ย 
Fast Full Search for Block Matching Algorithms
Fast Full Search for Block Matching AlgorithmsFast Full Search for Block Matching Algorithms
Fast Full Search for Block Matching Algorithmsijsrd.com
ย 
EFFICIENT IMAGE COMPRESSION USING LAPLACIAN PYRAMIDAL FILTERS FOR EDGE IMAGES
EFFICIENT IMAGE COMPRESSION USING LAPLACIAN PYRAMIDAL FILTERS FOR EDGE IMAGESEFFICIENT IMAGE COMPRESSION USING LAPLACIAN PYRAMIDAL FILTERS FOR EDGE IMAGES
EFFICIENT IMAGE COMPRESSION USING LAPLACIAN PYRAMIDAL FILTERS FOR EDGE IMAGESijcnac
ย 
Be36338341
Be36338341Be36338341
Be36338341IJERA Editor
ย 
Oc2423022305
Oc2423022305Oc2423022305
Oc2423022305IJERA Editor
ย 

What's hot (17)

Mixed Reality: Pose Aware Object Replacement for Alternate Realities
Mixed Reality: Pose Aware Object Replacement for Alternate RealitiesMixed Reality: Pose Aware Object Replacement for Alternate Realities
Mixed Reality: Pose Aware Object Replacement for Alternate Realities
ย 
ICVG: PRACTICAL CONSTRUCTIVE VOLUME GEOMETRY FOR INDIRECT VISUALIZATION
ICVG: PRACTICAL CONSTRUCTIVE VOLUME GEOMETRY FOR INDIRECT VISUALIZATIONICVG: PRACTICAL CONSTRUCTIVE VOLUME GEOMETRY FOR INDIRECT VISUALIZATION
ICVG: PRACTICAL CONSTRUCTIVE VOLUME GEOMETRY FOR INDIRECT VISUALIZATION
ย 
ICVG : Practical Constructive Volume Geometry for Indirect Visualization
ICVG : Practical Constructive Volume Geometry for Indirect Visualization  ICVG : Practical Constructive Volume Geometry for Indirect Visualization
ICVG : Practical Constructive Volume Geometry for Indirect Visualization
ย 
COLOR IMAGE ENCRYPTION BASED ON MULTIPLE CHAOTIC SYSTEMS
COLOR IMAGE ENCRYPTION BASED ON MULTIPLE CHAOTIC SYSTEMSCOLOR IMAGE ENCRYPTION BASED ON MULTIPLE CHAOTIC SYSTEMS
COLOR IMAGE ENCRYPTION BASED ON MULTIPLE CHAOTIC SYSTEMS
ย 
css_final_v4
css_final_v4css_final_v4
css_final_v4
ย 
PIPELINED ARCHITECTURE OF 2D-DCT, QUANTIZATION AND ZIGZAG PROCESS FOR JPEG IM...
PIPELINED ARCHITECTURE OF 2D-DCT, QUANTIZATION AND ZIGZAG PROCESS FOR JPEG IM...PIPELINED ARCHITECTURE OF 2D-DCT, QUANTIZATION AND ZIGZAG PROCESS FOR JPEG IM...
PIPELINED ARCHITECTURE OF 2D-DCT, QUANTIZATION AND ZIGZAG PROCESS FOR JPEG IM...
ย 
Intel ILS: Enhancing Photorealism Enhancement
Intel ILS: Enhancing Photorealism EnhancementIntel ILS: Enhancing Photorealism Enhancement
Intel ILS: Enhancing Photorealism Enhancement
ย 
Tissue Segmentation Methods Using 2D Histogram Matching in a Sequence of MR B...
Tissue Segmentation Methods Using 2D Histogram Matching in a Sequence of MR B...Tissue Segmentation Methods Using 2D Histogram Matching in a Sequence of MR B...
Tissue Segmentation Methods Using 2D Histogram Matching in a Sequence of MR B...
ย 
Background Estimation Using Principal Component Analysis Based on Limited Mem...
Background Estimation Using Principal Component Analysis Based on Limited Mem...Background Estimation Using Principal Component Analysis Based on Limited Mem...
Background Estimation Using Principal Component Analysis Based on Limited Mem...
ย 
Enhanced target tracking based on mean shift algorithm for satellite imagery
Enhanced target tracking based on mean shift algorithm for satellite imageryEnhanced target tracking based on mean shift algorithm for satellite imagery
Enhanced target tracking based on mean shift algorithm for satellite imagery
ย 
NMS and Thresholding Architecture used for FPGA based Canny Edge Detector for...
NMS and Thresholding Architecture used for FPGA based Canny Edge Detector for...NMS and Thresholding Architecture used for FPGA based Canny Edge Detector for...
NMS and Thresholding Architecture used for FPGA based Canny Edge Detector for...
ย 
Google Research Siggraph Whitepaper | Total Relighting: Learning to Relight P...
Google Research Siggraph Whitepaper | Total Relighting: Learning to Relight P...Google Research Siggraph Whitepaper | Total Relighting: Learning to Relight P...
Google Research Siggraph Whitepaper | Total Relighting: Learning to Relight P...
ย 
Enhanced target tracking based on mean shift
Enhanced target tracking based on mean shiftEnhanced target tracking based on mean shift
Enhanced target tracking based on mean shift
ย 
Fast Full Search for Block Matching Algorithms
Fast Full Search for Block Matching AlgorithmsFast Full Search for Block Matching Algorithms
Fast Full Search for Block Matching Algorithms
ย 
EFFICIENT IMAGE COMPRESSION USING LAPLACIAN PYRAMIDAL FILTERS FOR EDGE IMAGES
EFFICIENT IMAGE COMPRESSION USING LAPLACIAN PYRAMIDAL FILTERS FOR EDGE IMAGESEFFICIENT IMAGE COMPRESSION USING LAPLACIAN PYRAMIDAL FILTERS FOR EDGE IMAGES
EFFICIENT IMAGE COMPRESSION USING LAPLACIAN PYRAMIDAL FILTERS FOR EDGE IMAGES
ย 
Be36338341
Be36338341Be36338341
Be36338341
ย 
Oc2423022305
Oc2423022305Oc2423022305
Oc2423022305
ย 

Similar to A NOVEL GRAPH REPRESENTATION FOR SKELETON-BASED ACTION RECOGNITION

A Novel Graph Representation for Skeleton-based Action Recognition
A Novel Graph Representation for Skeleton-based Action RecognitionA Novel Graph Representation for Skeleton-based Action Recognition
A Novel Graph Representation for Skeleton-based Action Recognitionsipij
ย 
Learning Graph Representation for Data-Efficiency RL
Learning Graph Representation for Data-Efficiency RLLearning Graph Representation for Data-Efficiency RL
Learning Graph Representation for Data-Efficiency RLlauratoni4
ย 
5438-Article Text-8663-1-10-20200511.pdf
5438-Article Text-8663-1-10-20200511.pdf5438-Article Text-8663-1-10-20200511.pdf
5438-Article Text-8663-1-10-20200511.pdfTadiyosHailemichael
ย 
Laplacian-regularized Graph Bandits
Laplacian-regularized Graph BanditsLaplacian-regularized Graph Bandits
Laplacian-regularized Graph Banditslauratoni4
ย 
EDGE-Net: Efficient Deep-learning Gradients Extraction Network
EDGE-Net: Efficient Deep-learning Gradients Extraction NetworkEDGE-Net: Efficient Deep-learning Gradients Extraction Network
EDGE-Net: Efficient Deep-learning Gradients Extraction Networkgerogepatton
ย 
EDGE-Net: Efficient Deep-learning Gradients Extraction Network
EDGE-Net: Efficient Deep-learning Gradients Extraction NetworkEDGE-Net: Efficient Deep-learning Gradients Extraction Network
EDGE-Net: Efficient Deep-learning Gradients Extraction Networkgerogepatton
ย 
Object gripping algorithm for robotic assistance by means of deep leaning
Object gripping algorithm for robotic assistance by means of deep leaning Object gripping algorithm for robotic assistance by means of deep leaning
Object gripping algorithm for robotic assistance by means of deep leaning IJECEIAES
ย 
Low complexity features for jpeg steganalysis using undecimated dct
Low complexity features for jpeg steganalysis using undecimated dctLow complexity features for jpeg steganalysis using undecimated dct
Low complexity features for jpeg steganalysis using undecimated dctPvrtechnologies Nellore
ย 
A Blind Steganalysis on JPEG Gray Level Image Based on Statistical Features a...
A Blind Steganalysis on JPEG Gray Level Image Based on Statistical Features a...A Blind Steganalysis on JPEG Gray Level Image Based on Statistical Features a...
A Blind Steganalysis on JPEG Gray Level Image Based on Statistical Features a...IJERD Editor
ย 
A Comparative Case Study on Compression Algorithm for Remote Sensing Images
A Comparative Case Study on Compression Algorithm for Remote Sensing ImagesA Comparative Case Study on Compression Algorithm for Remote Sensing Images
A Comparative Case Study on Compression Algorithm for Remote Sensing ImagesDR.P.S.JAGADEESH KUMAR
ย 
A Review Paper on Real-Time Hand Motion Capture
A Review Paper on Real-Time Hand Motion CaptureA Review Paper on Real-Time Hand Motion Capture
A Review Paper on Real-Time Hand Motion CaptureIRJET Journal
ย 
FastV2C-HandNet - ICICC 2020
FastV2C-HandNet - ICICC 2020FastV2C-HandNet - ICICC 2020
FastV2C-HandNet - ICICC 2020RohanLekhwani
ย 
COMPARATIVE PERFORMANCE ANALYSIS OF RNSC AND MCL ALGORITHMS ON POWER-LAW DIST...
COMPARATIVE PERFORMANCE ANALYSIS OF RNSC AND MCL ALGORITHMS ON POWER-LAW DIST...COMPARATIVE PERFORMANCE ANALYSIS OF RNSC AND MCL ALGORITHMS ON POWER-LAW DIST...
COMPARATIVE PERFORMANCE ANALYSIS OF RNSC AND MCL ALGORITHMS ON POWER-LAW DIST...acijjournal
ย 
Intra Block and Inter Block Neighboring Joint Density Based Approach for Jpeg...
Intra Block and Inter Block Neighboring Joint Density Based Approach for Jpeg...Intra Block and Inter Block Neighboring Joint Density Based Approach for Jpeg...
Intra Block and Inter Block Neighboring Joint Density Based Approach for Jpeg...ijsc
ย 
INTRA BLOCK AND INTER BLOCK NEIGHBORING JOINT DENSITY BASED APPROACH FOR JPEG...
INTRA BLOCK AND INTER BLOCK NEIGHBORING JOINT DENSITY BASED APPROACH FOR JPEG...INTRA BLOCK AND INTER BLOCK NEIGHBORING JOINT DENSITY BASED APPROACH FOR JPEG...
INTRA BLOCK AND INTER BLOCK NEIGHBORING JOINT DENSITY BASED APPROACH FOR JPEG...ijsc
ย 
Clustering Algorithms for Data Stream
Clustering Algorithms for Data StreamClustering Algorithms for Data Stream
Clustering Algorithms for Data StreamIRJET Journal
ย 
Locate, Size and Count: Accurately Resolving People in Dense Crowds via Detec...
Locate, Size and Count: Accurately Resolving People in Dense Crowds via Detec...Locate, Size and Count: Accurately Resolving People in Dense Crowds via Detec...
Locate, Size and Count: Accurately Resolving People in Dense Crowds via Detec...IRJET Journal
ย 
An Effect of Compressive Sensing on Image Steganalysis
An Effect of Compressive Sensing on Image SteganalysisAn Effect of Compressive Sensing on Image Steganalysis
An Effect of Compressive Sensing on Image SteganalysisIRJET Journal
ย 
Node classification with graph neural network based centrality measures and f...
Node classification with graph neural network based centrality measures and f...Node classification with graph neural network based centrality measures and f...
Node classification with graph neural network based centrality measures and f...IJECEIAES
ย 
Colloquium.pptx
Colloquium.pptxColloquium.pptx
Colloquium.pptxMythili680896
ย 

Similar to A NOVEL GRAPH REPRESENTATION FOR SKELETON-BASED ACTION RECOGNITION (20)

A Novel Graph Representation for Skeleton-based Action Recognition
A Novel Graph Representation for Skeleton-based Action RecognitionA Novel Graph Representation for Skeleton-based Action Recognition
A Novel Graph Representation for Skeleton-based Action Recognition
ย 
Learning Graph Representation for Data-Efficiency RL
Learning Graph Representation for Data-Efficiency RLLearning Graph Representation for Data-Efficiency RL
Learning Graph Representation for Data-Efficiency RL
ย 
5438-Article Text-8663-1-10-20200511.pdf
5438-Article Text-8663-1-10-20200511.pdf5438-Article Text-8663-1-10-20200511.pdf
5438-Article Text-8663-1-10-20200511.pdf
ย 
Laplacian-regularized Graph Bandits
Laplacian-regularized Graph BanditsLaplacian-regularized Graph Bandits
Laplacian-regularized Graph Bandits
ย 
EDGE-Net: Efficient Deep-learning Gradients Extraction Network
EDGE-Net: Efficient Deep-learning Gradients Extraction NetworkEDGE-Net: Efficient Deep-learning Gradients Extraction Network
EDGE-Net: Efficient Deep-learning Gradients Extraction Network
ย 
EDGE-Net: Efficient Deep-learning Gradients Extraction Network
EDGE-Net: Efficient Deep-learning Gradients Extraction NetworkEDGE-Net: Efficient Deep-learning Gradients Extraction Network
EDGE-Net: Efficient Deep-learning Gradients Extraction Network
ย 
Object gripping algorithm for robotic assistance by means of deep leaning
Object gripping algorithm for robotic assistance by means of deep leaning Object gripping algorithm for robotic assistance by means of deep leaning
Object gripping algorithm for robotic assistance by means of deep leaning
ย 
Low complexity features for jpeg steganalysis using undecimated dct
Low complexity features for jpeg steganalysis using undecimated dctLow complexity features for jpeg steganalysis using undecimated dct
Low complexity features for jpeg steganalysis using undecimated dct
ย 
A Blind Steganalysis on JPEG Gray Level Image Based on Statistical Features a...
A Blind Steganalysis on JPEG Gray Level Image Based on Statistical Features a...A Blind Steganalysis on JPEG Gray Level Image Based on Statistical Features a...
A Blind Steganalysis on JPEG Gray Level Image Based on Statistical Features a...
ย 
A Comparative Case Study on Compression Algorithm for Remote Sensing Images
A Comparative Case Study on Compression Algorithm for Remote Sensing ImagesA Comparative Case Study on Compression Algorithm for Remote Sensing Images
A Comparative Case Study on Compression Algorithm for Remote Sensing Images
ย 
A Review Paper on Real-Time Hand Motion Capture
A Review Paper on Real-Time Hand Motion CaptureA Review Paper on Real-Time Hand Motion Capture
A Review Paper on Real-Time Hand Motion Capture
ย 
FastV2C-HandNet - ICICC 2020
FastV2C-HandNet - ICICC 2020FastV2C-HandNet - ICICC 2020
FastV2C-HandNet - ICICC 2020
ย 
COMPARATIVE PERFORMANCE ANALYSIS OF RNSC AND MCL ALGORITHMS ON POWER-LAW DIST...
COMPARATIVE PERFORMANCE ANALYSIS OF RNSC AND MCL ALGORITHMS ON POWER-LAW DIST...COMPARATIVE PERFORMANCE ANALYSIS OF RNSC AND MCL ALGORITHMS ON POWER-LAW DIST...
COMPARATIVE PERFORMANCE ANALYSIS OF RNSC AND MCL ALGORITHMS ON POWER-LAW DIST...
ย 
Intra Block and Inter Block Neighboring Joint Density Based Approach for Jpeg...
Intra Block and Inter Block Neighboring Joint Density Based Approach for Jpeg...Intra Block and Inter Block Neighboring Joint Density Based Approach for Jpeg...
Intra Block and Inter Block Neighboring Joint Density Based Approach for Jpeg...
ย 
INTRA BLOCK AND INTER BLOCK NEIGHBORING JOINT DENSITY BASED APPROACH FOR JPEG...
INTRA BLOCK AND INTER BLOCK NEIGHBORING JOINT DENSITY BASED APPROACH FOR JPEG...INTRA BLOCK AND INTER BLOCK NEIGHBORING JOINT DENSITY BASED APPROACH FOR JPEG...
INTRA BLOCK AND INTER BLOCK NEIGHBORING JOINT DENSITY BASED APPROACH FOR JPEG...
ย 
Clustering Algorithms for Data Stream
Clustering Algorithms for Data StreamClustering Algorithms for Data Stream
Clustering Algorithms for Data Stream
ย 
Locate, Size and Count: Accurately Resolving People in Dense Crowds via Detec...
Locate, Size and Count: Accurately Resolving People in Dense Crowds via Detec...Locate, Size and Count: Accurately Resolving People in Dense Crowds via Detec...
Locate, Size and Count: Accurately Resolving People in Dense Crowds via Detec...
ย 
An Effect of Compressive Sensing on Image Steganalysis
An Effect of Compressive Sensing on Image SteganalysisAn Effect of Compressive Sensing on Image Steganalysis
An Effect of Compressive Sensing on Image Steganalysis
ย 
Node classification with graph neural network based centrality measures and f...
Node classification with graph neural network based centrality measures and f...Node classification with graph neural network based centrality measures and f...
Node classification with graph neural network based centrality measures and f...
ย 
Colloquium.pptx
Colloquium.pptxColloquium.pptx
Colloquium.pptx
ย 

Recently uploaded

VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130Suhani Kapoor
ย 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVRajaP95
ย 
chaitra-1.pptx fake news detection using machine learning
chaitra-1.pptx  fake news detection using machine learningchaitra-1.pptx  fake news detection using machine learning
chaitra-1.pptx fake news detection using machine learningmisbanausheenparvam
ย 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )Tsuyoshi Horigome
ย 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Dr.Costas Sachpazis
ย 
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
ย 
GDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSCAESB
ย 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
ย 
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...srsj9000
ย 
Model Call Girl in Narela Delhi reach out to us at ๐Ÿ”8264348440๐Ÿ”
Model Call Girl in Narela Delhi reach out to us at ๐Ÿ”8264348440๐Ÿ”Model Call Girl in Narela Delhi reach out to us at ๐Ÿ”8264348440๐Ÿ”
Model Call Girl in Narela Delhi reach out to us at ๐Ÿ”8264348440๐Ÿ”soniya singh
ย 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxupamatechverse
ย 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINESIVASHANKAR N
ย 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...ranjana rawat
ย 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile servicerehmti665
ย 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...ranjana rawat
ย 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxAsutosh Ranjan
ย 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
ย 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSSIVASHANKAR N
ย 
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSHARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSRajkumarAkumalla
ย 

Recently uploaded (20)

VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
ย 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
ย 
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
ย 
chaitra-1.pptx fake news detection using machine learning
chaitra-1.pptx  fake news detection using machine learningchaitra-1.pptx  fake news detection using machine learning
chaitra-1.pptx fake news detection using machine learning
ย 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )
ย 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
ย 
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
ย 
GDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentation
ย 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
ย 
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
ย 
Model Call Girl in Narela Delhi reach out to us at ๐Ÿ”8264348440๐Ÿ”
Model Call Girl in Narela Delhi reach out to us at ๐Ÿ”8264348440๐Ÿ”Model Call Girl in Narela Delhi reach out to us at ๐Ÿ”8264348440๐Ÿ”
Model Call Girl in Narela Delhi reach out to us at ๐Ÿ”8264348440๐Ÿ”
ย 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptx
ย 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
ย 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
ย 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile service
ย 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
ย 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptx
ย 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
ย 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
ย 
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSHARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
ย 

A NOVEL GRAPH REPRESENTATION FOR SKELETON-BASED ACTION RECOGNITION

  • 1. Signal & Image Processing: An International Journal (SIPIJ) Vol.11, No.6, December 2020 DOI: 10.5121/sipij.2020.11605 65 A NOVEL GRAPH REPRESENTATION FOR SKELETON-BASED ACTION RECOGNITION Tingwei Li1 , Ruiwen Zhang2 and Qing Li3 1,3 Department of Automation Tsinghua University, Beijing, China 2 Department of Computer Science Tsinghua University, Beijing, China ABSTRACT Graph convolutional networks (GCNs) have been proven to be effective for processing structured data, so that it can effectively capture the features of related nodes and improve the performance of model. More attention is paid to employing GCN in Skeleton-Based action recognition. But there are some challenges with the existing methods based on GCNs. First, the consistency of temporal and spatial features is ignored due to extracting features node by node and frame by frame. We design a generic representation of skeleton sequences for action recognition and propose a novel model called Temporal Graph Networks (TGN), which can obtain spatiotemporal features simultaneously. Secondly, the adjacency matrix of graph describing the relation of joints are mostly depended on the physical connection between joints. We propose a multi-scale graph strategy to appropriately describe the relations between joints in skeleton graph, which adopts a full-scale graph, part-scale graph and core-scale graph to capture the local features of each joint and the contour features of important joints. Extensive experiments are conducted on two large datasets including NTU RGB+D and Kinetics Skeleton. And the experiments results show that TGN with our graph strategy outperforms other state-of-the-art methods. KEYWORDS Skeleton-based action recognition, Graph convolutional network, Multi-scale graphs 1. INTRODUCTION Human action recognition is a meaningful and challenging task. It has widespread potential applications, including health care, human-computer interaction and autonomous driving. At present, skeleton data is more often used for action recognition because skeleton data is robust to the noise of background and different view points compared to video data. Skeleton-based action recognition are mainly based on deep learning methods like Recurrent Neural Networks (RNNs), Convolutional Neural Networks (CNNs) and GCNs [3, 8, 10,12, 13, 15, 17, 18]. RNNs and CNNs generally process the skeleton data into vector sequence and image respectively. These representing methods cannot fully express the dependencies between correlated joints. With more researches on GCN, [12] first employs GCN in skeleton-based action recognition and inspires a lot of new researches [7, 10, 15, 17, 18]. The key of action recognition based on GCN is to obtain the temporal and spatial features of an action sequence through graph [7, 10, 18]. In the skeleton graph, skeleton joints transfer into node and the relations between joints are represented by edges. As shown in Fig. 1, in most previous work, there are more than one graphs, a node only contains spatial features. In this case, GCN extracts spatial features frame by frame, then Temporal Convolutional Network (TCN) extractstemporal features node by node. But, features of a joint in an action is not only related to other joints intra frames but also joints inter frames. As a result, existing methods split this consistency of spatiotemporal features. To solve this problem, we propose TGN to capture
  • 2. Signal & Image Processing: An International Journal (SIPIJ) Vol.11, No.6, December 2020 66 spatiotemporal features simultaneously, as shown in Fig. 2, each node composes a joint of all frames and contains both spatial and temporal feature in the graph, thus TGN obtains spatiotemporal features by processing all frames of each joint simultaneously. Besides, the edges of skeleton graph mainly depend on the adjacency matrix A, which is related to the physical connections of joints [12,17,18]. GCN still have no effective adaptive graph mechanism to establish a global connection through the physical relations of nodes, such as a relation between head and toes. GCN can only obtain local features, such as a relation between head and neck. In this paper, a multi-scale graph strategy is proposed, which adopts different scale graphs in different network branches, as a result, physically unconnected information is added to help network capture the local features of each joint and the contour features of important joints. The main contributions of this paper are summarized in three aspects: (1) Temporal Graph Network (TGN) proposed in this paper is a feature extractor which has the ability to obtain spatiotemporal features simultaneously. Besides, this feature extractor can adapt to most skeleton-based action recognition models based on GCN and help to improve the performance. (2) A multi-scale graph strategy is proposed for optimization of graph, which can extract different scale spatial features. This strategy can capture not only the local features but the global features. (3) Multi-scale Temporal Graph Network (MS-TGN) is proposed based on TGN and the multi- scale graph strategy. Extensive experiments are conducted on two datasets including NTU RGB+D and Kinetics Skeleton, and our MS-TGN out performs state-of-the-art methods on these datasets for skeleton-based action recognition. 2. RELATED WORKS 2.1. Skeleton-Based Action Recognition The handcrafted features are usually used to model the human skeleton data for action recognition in conventional methods. However, the performance of these conventional methods is barely satisfactory. With the development of deep learning, neural networks have become the main methods, including RNNs and CNNs. RNN-based methods, such as LSTM and GRU, are usually used to model the skeleton data as a sequence of the coordinate vectors each represents a human body joint [3, 9, 11,13]. The 3D coordinates of all joints in a frame are concatenated in some order to be the input vector. CNN-based methods model the skeleton data as a image where a node can be regarded as a pixel [2, 7, 8]. Some works transform a skeleton sequence to an image by treating the joint coordinate (x,y,z) as the R, G, and B channels of a pixel. Because the skeleton data are naturally embedded in the form of graphs rather than a sequence or grids, both RNNs and CNNs cannot fully represent the structure of the skeleton data. Since ST-GCN [12] proposed in 2018, a series of methods for skeleton-based action recognition are based on GCN. GCN has been proven to be effective for processing structured data, have also been used to model the skeleton data. 2.2. Graph Convolutional Networks ST-GCN [12] introduces GCN into skeleton-based action recognition, constructs the spatiotemporal graph by natural connection of human joints, this method designs three different adjacency matrix to extract spatial and temporal features and achieves better performance than
  • 3. Signal & Image Processing: An International Journal (SIPIJ) Vol.11, No.6, December 2020 67 previous methods. 2s-AGCN [18] proposes a two-streams method and designs an adaptive adjacency matrix. The adaptive adjacency matrix can not only present the information of human body structure, but be learned using attention mechanism. DGNN [28] designs a directed graph structure which learns the information between the joint and the bone in GCN. SGN [29] explicitly introduces the high level semantics of joints, including the joint type and frame index, into the network to enhance the feature representation capability. These approaches treat each joint as a node of the graph, and the edge denoting the joint relationship is pre-defined by human based on prior knowledge. However, the spatial and temporal features of actions are learned by GCN and TCN, respectively, which makes the network less efficient. 3. METHODOLOGY 3.1. Temporal Graph Networks The skeleton data of an action can be described as ๐‘‹ = {๐‘ฅ ๐‘,๐‘ฃ,๐‘ก} ๐ถร—๐‘‰ร—๐‘‡ , where๐‘‡is the number offrames, ๐‘‰ is number of joints in a frame, ๐ถ is number of channels in a joint and๐‘ฅ ๐‘,๐‘ฃ,๐‘กrepresents the skeleton data of joint ๐‘ฃ in the frame t with ๐‘channels.๐‘‹๐‘–is the sequence ๐‘–.Previous methods construct ๐‘‡ graphs and each graph has๐‘‰ nodes, as shown in Fig. 1, where the node set๐‘ = {๐‘ฅ ๐‘ฃ,๐‘ก|, ๐‘ฃ = 1,2, โ€ฆ ๐‘‰, ๐‘ก = 1,2 โ€ฆ ๐‘‡} ๐ถ has joint ๐‘ฃin๐‘ก๐‘กโ„Ž frame. It means a frame is represented as one graph and there are totally ๐‘‡ graphs. The size feature of a nodeis ๐ถ. GCN is used to obtain spatial features from each graph, then outputs of GCN are fed into TCN to extract temporal features. Figure 1. A node in any graph represents data of a joint in a certain frame. GCN extracts spatial features frame by frame, and TCN extracts temporal features node by node. We redefine graph and propose TGN. Compared to ๐‘‡ graphs, we only have one graph with ๐‘‰ nodes, as seen in Fig. 2.In the graph, the node set๐‘ = {๐‘ฅ ๐‘ฃ|๐‘ฃ = 1,2 โ€ฆ ๐‘‡} ๐ถร—๐‘‰ has the joint ๐‘ฃin all frame. The size of feature of a node is ๐ถ ร— ๐‘‰. Compared with series methods using GCN and TCN alternately, we only use one GCN block to realize the extraction of spatiotemporal features. Figure 2. Each node represents data of a joint in all frames so temporal information is also contained. Compared with Fig1, TGN extracts temporal and spatial features simultaneously.
  • 4. Signal & Image Processing: An International Journal (SIPIJ) Vol.11, No.6, December 2020 68 In a basic TGN block, each node already has temporal information and spatial information, therefore, in the process of graph convolution, temporal and spatial features can be calculated simultaneously. The output value for a single channel at the spatial location ๐‘๐‘ฃ can be written as Eq. 2. ๐‘† ๐‘ฃ = {๐‘›๐‘—|๐‘›๐‘— โˆˆ ๐‘†( ๐‘๐‘ฃ,โ„Ž)} ๏ผˆ1๏ผ‰ ๐น๐‘œ(๐‘๐‘ฃ) = โˆ‘(๐น๐‘–(๐‘† ๐‘ฃ(๐‘—)) ร— ๐‘ค(๐‘—) ๐‘˜ ๐‘—=0 ) ๏ผˆ2๏ผ‰ Where ๐‘๐‘ฃis node ๐‘ฃ. ๐‘ ( ๐‘๐‘ฃ,โ„Ž)is a sampling functionused to find node set๐‘›๐‘— adjacent to the node๐‘๐‘ฃ,๐น๐‘– maps nodes to feature vector, ๐‘ค(โ„Ž)is weights of CNNwhose kernel size is 1ร—t,๐น๐‘œ( ๐‘๐‘ฃ)is output of ๐‘๐‘ฃ.Eq.2 is a general formula among most GCN-based models of action recognition, as it was used to extract spatial features in a graph. our method can be adapted to existing methods by changing graph structure of this methods. 3.2. Multi-Scale Graph Strategy (a) (b) (c) Figure 3. (a) full-scale graph, (b) part-scale graph, (c) core-scale graph Dilated convolution can obtain ignored features such as features between unconnected points in an image by over step convolution. Inspired of it, we select different expressive joints to form different scale graphs for convolution. Temporal features of joints with larger motion space are more expressive. Joints in a body generally have different relative motion space. For example, the elbow and knee can move in larger space compared to the surrounding joints like shoulder and span. In the small-scale graph, there are less but more expressive nodes so there is a larger receptive field. In large-scale graph, there are more but less expressive nodes so the receptive field is smaller. We design three different scale graphs based on NTU-RGB+D datasets, as shown in Fig. 3. Full- scale graph in Fig. 3(a) has all 25 nodes and can obtain local features of each joint for its small receptive field. Part-scale graph in Fig. 3(b) is represented by only 11 nodes. In this case, receptive field becomes larger so it tends to capture contour information. Fig. 3(c) is core-scale graph with only seven nodes. It has largest convolution receptive field, although it ignores the internal state of the limbs, it can connect the left and right limbs directly, so the global information can be obtained. Through different scale graphs, GCN can capture local features, contour features and global features respectively.
  • 5. Signal & Image Processing: An International Journal (SIPIJ) Vol.11, No.6, December 2020 69 Based on the former, we combine TGN with a multi-scale graph strategy to get the final model MS-TGN, as shown in Fig. 4.The model consists of 10 TGN layers, each layer uses 3 ร— 1 convolutional kernel to extract the temporal features of each node, and a fully-connected layer to classify based on the extracted feature. Figure 4. The network architecture of MS-TGN.Multi-scale graphs are displayed by different adjacency matrices๐ด1, ๐ด2 and ๐ด3. 4. EXPERIMENT 4.1. Datasets and Implementation Details NTU RGB+D. This dataset is large and widely used. It contains 3D skeleton data collected by Microsoftโ€™s kinetics V2 [16] and has 60 classes of actions and 56,000 action sequences, with 40 subjects are photographed by three cameras fixed at 0โ—ฆ, 45โ—ฆ and 45โ—ฆ, respectively. Each sequence has several frames and each frame is composed of 25 joints. We adopt the same method [16] to carry out the cross-view (cv) and cross-subject (cs) experiments. In the cross-view experiments, the training data is 37,920 action sequences with the view at 45โ—ฆ and 0โ—ฆ, and the test data is 18,960 action sequences with the view at 45โ—ฆ in the cross-subject experiments, the training data is action sequences performed by 28 subjects, and the test data contains 16,560 action sequences performed by others. We use the top-1 accuracy for evaluation. Kinetics Skeleton. Kinetics [6] is a video action recognition dataset obtained from thevideo on YouTube. Kinetics Skeleton employs Open Pose [23] estimation toolbox to detect 18 joints of a skeleton. It contains 400 kinds of actions and 260,232 sequences. The training data consists of 240,436 action sequences, and the test data is the remaining 19,796sequences. We use the top-1 and top-5 accuracies for evaluation. Unless otherwise stated, all models proposed employ strategies following. The number of channels is 64 in the first four layers, 128 in the middle three layers and 256 in the last three layers. SGD optimizer with Nesterov accelerated gradient is used for gradient descent and different learning rate adjustment strategies are designed for different data. The mini-batch size is 32 and the momentum is set to 0.9. All skeleton sequences are padded to T = 300 frames by replaying the actions. Inputs are processed with normalization and translation as [18]. 4.2.ABLATION EXPERIMENTS 4.2.1. Feature Extractor: TGN ST-GCN [12] and 2s-AGCN [18] are representative models utilizing GCN and TCN alternatively and were chosen as baselines. We replace GCN&TCN with TGN in these two models and keep
  • 6. Signal & Image Processing: An International Journal (SIPIJ) Vol.11, No.6, December 2020 70 adjacency matrix construction strategies unchanged. The experimental results on the two datasets are listed in Table 1 and Table 2. From Table 1, the original performance of ST-GCN increases to 82.3% and 90.8% on X-sub and X-view, and the accuracy of 2s-AGCN increases by 0.55% on average. From Table 2, accuracies of two models are both improved. In conclusion, TGN is sufficiently flexible to be used as a feature extractor and performs better than methods based on GCN & TCN. Table 1. Effectiveness of our TGN module on NTURGB+D dataset in terms of accuracy. Model TGN X-sub(%) X-view(%) ST-GCN[12] 81.6 88.8 โˆš 82.3 90.8 2s-AGCN[18] 88.5 95.1 โˆš 89.0 95.4 Js-Ours 86.0 93.7 โˆš 86.6 94.1 Bs-Ours 86.9 93.2 โˆš 87.5 93.9 Table 2. Effectiveness of our TGN moduleon Kinetics dataset in terms of accuracy. Model TGN Top-1(%) Top-5(%) ST-GCN[12] 30.7 52.8 โˆš 31.5 54.0 2s-AGCN[18] 36.1 58.7 โˆš 36.7 59.5 Js-Ours 35.0 93.7 โˆš 35.2 94.1 Bs-Ours 33.0 55.7 โˆš 33.3 56.2 4.2.2. Multi-Scale Graph Strategy Based on section 4.2.1, we construct the full-scale graph containing all joints of the original data, and the part-scale graph containing 11 joints in NTU+RGB-D dataset. In Kinetics Skeleton dataset, the full-scale graph contains all nodes and the part-scale graph contains 11 joints. We evaluate each scale graph, and Table 3 and Table 4 show the experimental results on the two datasets. In detail, adding a part-scale graph increases accuracy by 1.5% and 0.45%, respectively, adding a core-scale graph increases accuracy by 0.3% and 0.2%, respectively. A core-scale graph provides the global features of the whole body, a part-scale graph provides the contour features of the body part, and a full-scale graph provides the local features of each joint. By feature fusion, the model obtains richer information and performs better. Table 3. Effectiveness of Multi-scale Graph on NTU RGB+D dataset in terms of accuracy. Full-Scale Part-Scale Core-Scale X-sub(%) X-view(%) โˆš 89.0 95.2 โˆš 86.0 94.0 โˆš 85.6 93.3 โˆš โˆš 89.2 95.7 โˆš โˆš โˆš 89.5 95.9
  • 7. Signal & Image Processing: An International Journal (SIPIJ) Vol.11, No.6, December 2020 71 Table 4. Effectiveness of Multi-scale Graph on Kinetics dataset in terms of accuracy. Full-Scale Part-Scale Core-Scale Top-1(%) Top-5(%) โˆš 36.6 59.5 โˆš 35.0 55.9 โˆš 33.8 54.6 โˆš โˆš 36.9 59.9 โˆš โˆš โˆš 37.3 60.2 4.3. Comparison With State-of-the-Art Methods We compare MS-TGN with the state-of-the-art skeleton-based action recognition methods on both the NTU RGB+D dataset and the Kinetics-Skeleton dataset. The results of NTU RGB+D are shown in Table 5. Our model performs the best in cross-view experiment on NTU RGB+D, and has the highest top-1 accuracy on Kinetics Skeleton dataset as listed in Table 6. Table 5. Performance comparisons on NTU RGB+D dataset with the CSand CV settings. Method Year X-sub(%) X-view(%) HBRNN-L[7] 2015 59.1 64.0 PA LSTM[16] 2016 62.9 70.3 STA-LSTM[21] 2017 73.4 81.2 GCA-LSTM[22] 2017 74.4 82.8 ST-GCN[12] 2018 81.5 88.3 DPRL+GCNN[25] 2018 83.5 89.8 SR-TSL[19] 2018 84.8 92.4 AS-GCN[10] 2019 86.8 94.2 2s-AGCN[18] 2019 88.5 95.1 VA-CNN[26] 2019 88.7 94.3 SGN[27] 2020 89.0 94.5 MS-TGN(ours) - 89.5 95.9 Table 6. Performance comparisons on Kinetics dataset with SOTA methods. Method Year Top-1(%) Top-5(%) PA LSTM[16] 2016 16.4 35.3 TCN[12] 2017 20.3 40.0 ST-GCN[12] 2018 30.7 52.8 AS-GCN[10] 2019 34.8 56.6 2s-AGCN[18] 2019 36.1 58.7 NAS[15] 2020 37.1 60.1 MS-TGN(ours) - 37.3 60.2 Besides, our model reduces the computational work and parameters, as listed in Table 7. It means our model is simpler and has a better ability for modelling spatial and temporal features. Why does our model have fewer parameters and calculation but perform better? In our designed graph, each node denotes all frame data of a joint, which brings two advantages: (1) TGN extractor only contains GCN without TCN, which reduces the parameters and calculation. (2) Instead of extracting alternately, TGN can extract spatial and temporal features at the same time to strengthen consistence of the spatial and temporal features.
  • 8. Signal & Image Processing: An International Journal (SIPIJ) Vol.11, No.6, December 2020 72 Table 7. Comparisons of the cost of computing with state-of-the-arts.The #Params and FLOPs are calculated by the tools called THOP (PyTorch-OpCounter) [24]. Method Year X-sub(%) #Params(M) #FLOPs(G) ST-GCN[12] 2018 81.6 3.1 15.2 AS-GCN[10] 2019 86.8 4.3 17.1 2s-AGCN[18] 2019 88.5 3.5 17.4 NAS[15] 2020 89.4 6.6 36.6 MS-TGN(ours) - 89.5 3.0 15.0 5. CONCLUSIONS The MS-TGN model proposed in this paper mainly has two innovations: one is the TGN model which extracts the temporal and spatial features at the same time, and the other is the multi-scale graph strategy to obtain local features and contour features simultaneously. TGN designs a novel representation of skeleton sequences for action recognition, which can obtain spatiotemporal features simultaneously. The multi-scale graph strategy can extract global spatial features.On two published datasets, the proposed MS-TGN achieves the SOTA accuracy with the least parameters and computation. ACKNOWLEDGEMENT This work is sponsored by the National Natural Science Foundation of China No. 61771281, the "New generation artificial intelligence" major project of China No. 2018AAA0101605, the 2018 Industrial Internet innovation and development project, and Tsinghua University initiative Scientific Research Program. REFERENCES [1] Bruna, Joan, et al. โ€œSpectral Networks and Locally Connected Networks on Graphs.โ€ ICLR 2014โ€ฏ: International Conference on Learning Representations (ICLR) 2014, 2014. [2] Du, Yong, et al. โ€œSkeleton Based Action Recognition with Convolutional Neural Network.โ€ 2015 3rd IAPR Asian Conference on Pattern Recognition (ACPR), 2015, pp. 579โ€“583. [3] Du, Yong, et al. โ€œHierarchical Recurrent Neural Network for Skeleton Based Action Recognition.โ€ 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 1110โ€“1118. [4] Feichtenhofer, Christoph, et al. โ€œConvolutional Two-Stream Network Fusion for Video Action Recognition.โ€ 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 1933โ€“1941. [5] Huang, Zhiwu, et al. โ€œDeep Learning on Lie Groups for Skeleton-Based Action Recognition.โ€ 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 1243โ€“1252. [6] Kay, Will, et al. โ€œThe Kinetics Human Action Video Dataset.โ€ ArXiv Preprint ArXiv:1705.06950, 2017. [7] Ke, Qiuhong, et al. โ€œA New Representation of Skeleton Sequences for 3D Action Recognition.โ€ 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 4570โ€“4579. [8] Kim, Tae Soo, and Austin Reiter. โ€œInterpretable 3D Human Action Analysis with Temporal Convolutional Networks.โ€ 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2017, pp. 1623โ€“1631. [9] Li, Chao, et al. โ€œCo-Occurrence Feature Learning from Skeleton Data for Action Recognition and Detection with Hierarchical Aggregation.โ€ IJCAIโ€™18 Proceedings of the 27th International Joint Conference on Artificial Intelligence, 2018, pp. 786โ€“792. [10] Li, Maosen, et al. โ€œActional-Structural Graph Convolutional Networks for Skeleton-Based Action Recognition.โ€ 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 3595โ€“3603.
  • 9. Signal & Image Processing: An International Journal (SIPIJ) Vol.11, No.6, December 2020 73 [11] Li, Shuai, et al. โ€œIndependently Recurrent Neural Network (IndRNN): Building A Longer and Deeper RNN.โ€ 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 5457โ€“ 5466. [12] Yan, Sijie, et al. โ€œSpatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition.โ€ AAAI, 2018, pp. 7444โ€“7452. [13] Martinez, Julieta, et al. โ€œOn Human Motion Prediction Using Recurrent Neural Networks.โ€ 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 4674โ€“4683. [14] Ofli, Ferda, et al. โ€œSequence of the Most Informative Joints (SMIJ): A New Representation for Human Skeletal Action Recognition.โ€ 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, 2012, pp. 8โ€“13. [15] Peng, Wei, et al. โ€œLearning Graph Convolutional Network for Skeleton-Based Human Action Recognition by Neural Searching.โ€ Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 3, 2020, pp. 2669โ€“2676. [16] Shahroudy, Amir, et al. โ€œNTU RGB+D: A Large Scale Dataset for 3D Human Activity Analysis.โ€ 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 1010โ€“1019. [17] Shi, Lei, et al. โ€œSkeleton-Based Action Recognition With Directed Graph Neural Networks.โ€ 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 7912โ€“7921. [18] Shi, Lei, et al. โ€œTwo-Stream Adaptive Graph Convolutional Networks for Skeleton-Based Action Recognition.โ€ 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 12026โ€“12035. [19] Si, Chenyang, et al. โ€œSkeleton-Based Action Recognition with Spatial Reasoning and Temporal Stack Learning.โ€ Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 106โ€“ 121. [20] Si, Chenyang, et al. โ€œAn Attention Enhanced Graph Convolutional LSTM Network for Skeleton- Based Action Recognition.โ€ 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 1227โ€“1236. [21] Liu, Jun, et al. โ€œSpatio-Temporal LSTM with Trust Gates for 3D Human Action Recognition.โ€ European Conference on Computer Vision, 2016, pp. 816โ€“833. [22] Song, Sijie, et al. โ€œAn End-to-End Spatio-Temporal Attention Model for Human Action Recognition from Skeleton Data.โ€ AAAIโ€™17 Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, 2017, pp. 4263โ€“4270. [23] Cao, Zhe, et al. โ€œOpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields.โ€ ArXiv Preprint ArXiv:1812.08008, 2018. [24] Ligeng Zhu. Thop: Pytorch-opcounter. https://github.com/Lyken17/pytorch-OpCounter. [25] Tang, Yansong, et al. โ€œDeep Progressive Reinforcement Learning for Skeleton-Based Action Recognition.โ€ 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 5323โ€“5332. [26] Zhang, Pengfei, et al. โ€œView Adaptive Neural Networks for High Performance Skeleton-Based Human Action Recognition.โ€ IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 41, no. 8, 2019, pp. 1963โ€“1978. [27] Zhang, Pengfei, et al. โ€œSemantics-Guided Neural Networks for Efficient Skeleton-Based Human Action Recognition.โ€ 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 1112โ€“1121. [28] Shi L, Zhang Y, Cheng J, et al. Skeleton-based action recognition with directed graphneural networks[C]//Proceedings of the IEEE Conference on Computer Vision andPattern Recognition. 2019: 7912-7921. [29] Zhang P, Lan C, Zeng W, et al. Semantics-Guided Neural Networks for EfficientSkeleton-Based Human Action Recognition[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020: 1112-1121.