240429_Thanh_LabSeminar[TranSG: Transformer-Based Skeleton Graph Prototype Contrastive Learning with Structure-Trajectory Prompted Reconstruction for Person Re-Identification].pptx

TranSG: Transformer-Based
Skeleton Graph Prototype
Contrastive Learning
with Structure-Trajectory
Prompted Reconstruction for
Person Re-Identification
Tien-Bach-Thanh Do
Network Science Lab
Dept. of Artificial Intelligence
The Catholic University of Korea
E-mail: osfa19730@catholic.ac.kr
2024/04/29
Haocong Rao et al.
CVPR 2023

2
Introduction
● Person re-identification (re-ID) is a challenging task of retrieving and matching a specific person across
varying views
● Person re-ID via 3D skeletons has attracted attention in both academia and industry
● Skeleton-based person re-ID methods model unique body and motion representations with 3D
positions of key body joints
● Key contributions:
○ Present a generic TranSG paradigm to learn effective representations form skeleton graphs
for person re-ID
○ Devise a skeleton graph transformer (SGT) to fully capture relations in skeleton graphs and
integrate key correlative noed features into graph representations
○ Propose the graph prototype contrastive learning (GPC) to contrast and learn the most
representative graph features and identity-related semantics from both skeleton and sequence
levels for person re-ID
○ Devise the graph structure-trajectory prompted reconstruction (STPR) to exploit spatial-
temporal graph contexts for reconstruction, to capture more key graph semantics and unique
feature for person re-ID

3
Related Works
Person Re-identification Using Skeleton Data
● [13] 7 Euclidean distances between certain joints are computed to perform distance metric learning
● [12,7] 13 and 16 skeleton descriptors are utilized to identify different persons
● [8] CNN-based model PoseGait is devised to encode body-joint sequences and hand-crafted pose
features
● [6] proposes the AGE model to encode recognizable gait features from 3D skeleton sequences
● [14] proposed to encode prototypes and intra-sequence relations of masked skeleton sequences
● [18,19] perform multi-stage body-component relation learning based on multi-scale graphs
● [20] proposes a general skeleton feature re-ranking mechanism
[13] I. B. Barbosa, M. Cristani, A. Del Bue, L. Bazzani, and V. Murino, “Re-identification with RGB-D sensors,” in European Conference on Computer Vision (ECCV) Workshop, pp. 433–442, Springer, 2012.
[12] M. Munaro, A. Fossati, A. Basso, E. Menegatti, and L. Van Gool, “One-shot person re-identification with a consumer depth camera,” in Person Re-Identification, pp. 161– 181, Springer, 2014.
[7] P. Pala, L. Seidenari, S. Berretti, and A. Del Bimbo, “Enhanced skeleton and face 3D data for person re-identification from depth cameras,” Computers & Graphics, vol. 79, pp. 69–80, 2019.
[8] R. Liao, S. Yu, W. An, and Y. Huang, “A model-based gait recognition method with body pose and human prior knowledge,” Pattern Recognition, vol. 98, p. 107069, 2020.
[6] H. Rao, S. Wang, X. Hu, M. Tan, H. Da, J. Cheng, and B. Hu, “Self-supervised gait encoding with locality-aware attention for person re-identification,” in International Joint Conference on Artificial Intelligence (IJCAI), vol. 1, pp. 898–905, 2020.
[14] H. Rao and C. Miao, “SimMC: Simple masked contrastive learning of skeleton representations for unsupervised person re-identification,” in International Joint Conference on Artificial Intelligence (IJCAI), pp. 1290–1297, 2022.
[18] H. Rao, S. Xu, X. Hu, J. Cheng, and B. Hu, “Multi-level graph encoding with structural-collaborative relation learning for skeleton-based person re-identification,” in International Joint Conference on Artificial Intelligence (IJCAI), pp. 973–980,
2021.
[19] H. Rao, X. Hu, J. Cheng, and B. Hu, “SM-SGE: A self- supervised multi-scale skeleton graph encoding framework for person re-identification,” in Proceedings of the 29th ACM International Conference on Multimedia, pp. 1812–1820, 2021.
[20] H. Rao, Y. Li, and C. Miao, “Revisiting k-reciprocal distance re-ranking for skeleton-based person re-identification,” IEEE Signal Processing Letters, vol. 29, pp. 2103–2107, 2022.

4
Related Works
Contrastive Learning
● Aim to pull closer homogeneous or positive representation pairs while pushing farther negative pairs in a
certain feature space
● [21] an instance discrimination paradigm with exemplar tasks
● [27] approach based on the context auto-encoding and probabilistic contrastive loss to learn different-
domain representations
● [22] combines k-means clustering and CL
● [2,14] based on consecutive sequences or randomly masked sequences
[21] Z. Wu, Y. Xiong, S. X. Yu, and D. Lin, “Unsupervised feature learning via non-parametric instance discrimination,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3733–3742, 2018. 2
[27] A. van den Oord, Y. Li, and O. Vinyals, “Representation learning with contrastive predictive coding,” arXiv preprint arXiv:1807.03748, 2018.
[22] J. Li, P. Zhou, C. Xiong, and S. Hoi, “Prototypical contrastive learning of unsupervised representations,” in Inter- national Conference on Learning Representation (ICLR), 2021.
[2] H. Rao, S. Wang, X. Hu, M. Tan, Y. Guo, J. Cheng, X. Liu, and B. Hu, “A self-supervised gait encoding approach with locality-awareness for 3D skeleton based person re-identification,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, no. 01, pp. 1–1, 2021.
[14] H. Rao and C. Miao, “SimMC: Simple masked contrastive learning of skeleton representations for unsupervised person re-identification,” in International Joint Conference on Artificial Intelligence (IJCAI), pp. 1290–1297, 2022.

5
Related Works
Prompt Learning
● To provide additional knowledge, instruction, or context for the input of models => give more reliable
outputs for different tasks [28-36]
● [30] leverages language-based prompts to generalize the pre-trained visual representations to many
tasks
● [31] automatically model task-relevant prompt with continuous representations to improve the
downstream task performance
[28] F. Petroni, T. Rocktaschel, S. Riedel, P. Lewis, A. Bakhtin, Y. Wu, and A. Miller, “Language models as knowledge bases?,” in Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 2463–2473,
2019
[29] C. Jia, Y. Yang, Y. Xia, Y.-T. Chen, Z. Parekh, H. Pham, Q. Le, Y.-H. Sung, Z. Li, and T. Duerig, “Scaling up visual and vision-language representation learning with noisy text supervision,” in International Conference on Machine Learning
(ICML), pp. 4904–4916, PMLR, 2021
[30] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al., “Learning transferable visual models from natural language supervision,” in International Conference on Machine Learning
(ICML), pp. 8748–8763, PMLR, 2021
[31] K. Zhou, J. Yang, C. C. Loy, and Z. Liu, “Learning to prompt for vision-language models,” International Journal of Computer Vision, vol. 130, no. 9, pp. 2337–2348, 2022
[32] Z. Jiang, F. F. Xu, J. Araki, and G. Neubig, “How can we know what language models know?,” Transactions of the Association for Computational Linguistics, vol. 8, pp. 423–438, 2020.
[33] B. Lester, R. Al-Rfou, and N. Constant, “The power of scale for parameter-efficient prompt tuning,” in Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 3045–3059, 2021
[34] X. L. Li and P. Liang, “Prefix-tuning: Optimizing continuous prompts for generation,” in Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), pp. 4582–4597, 2021
[35] P. Liu, W. Yuan, J. Fu, Z. Jiang, H. Hayashi, and G. Neu- big, “Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing,” arXiv preprint arXiv:2107.13586, 2021
[36] T. Shin, Y. Razeghi, R. L. Logan IV, E. Wallace, and S. Singh, “Autoprompt: Eliciting knowledge from language models with automatically generated prompts,” in Proceedings of the Conference on Empirical Methods in Natural Language
Processing (EMNLP), pp. 4222–4235, 2020

7
Methodology
The Proposed Approach
● 3D skeleton sequence X = (x1, … , xf) Rf*J*3, where xt ∈ RJ*3 denotes the tth skeleton with 3D coordinates
of J body joints
● Each skeleton sequence X corresponds to person identity y where y ∈ {1, … , C} and C is the number of
different classes
● ФT = {XT
i}N1
i=1, ФP = {XP
i}N2
i=1, ФG = {XG
i}N3
i=1 denote the training set, probe set, and gallery set that contain
N1, N2, N3 skeleton sequences of different person in different scenes and views
● Aim to learn an encoder to map skeleton sequences into effective representations => encoded skeleton
sequence representations in the probe set can be matched with the representations of the same identity
in the gallery set

8
Methodology
Skeleton Graph Construction
● Skeleton graphs with the body joints as nodes and the structural connections between adjacent joints as
edges
● Gt(Vt, Et) corresponding to the tth skeleton xt, consists of nodes Vt = {vt
1, vt
2, … ,vt
J}, vt
i ∈ R3 and edges
Et = {et
i,j|vt
i, vt
j ∈ Vt}, et
i,j ∈ R
● Vt, Et denote the set of nodes corresponding to J different body joints and the set of their internal
connection relations

9
Methodology
Skeleton Graph Transformer
● Aim to capture discriminative skeleton features
● Problem:
○ Body structural features, which can be inferred from the relations between adjacent body joints
○ Skeleton actional patterns, which are typically characterized by the relations among different body
components
● Each body-joint node as a basic body component, combine the above relation learning as a full-relation
learning of body-joint nodes => fully aggregate key body and motion features from skeleton graphs
● Positional encoding
● Position-encoded node representation
● Multiple independent FR heads, aggregate corresponding relational feature

10
Methodology
Skeleton Graph Transformer
● Apply Feed Forward Network with residual connections and batch normalization
● Final sequence-level graph representation S

11
Methodology
Graph Prototype Contrastive Learning
● Each individual’s skeletons usually share the same anthropometric features (skeletal lengths), while their
continuous sequence can characterize identity-specific walking patterns (gait)
● Given the encoded graph representations of training skeleton sequences , group
them by growth-truth classes as , where denote
● Graph prototype is generated
● Graph prototype contrastive loss

12
Methodology
Graph Structure-Trajectory Prompted Reconstruction Mechanism
● To exploit more valuable graph features and high-level semantics from both spatial and temporal context
of skeleton graphs, proposed a self-supervised graph Structure and Trajectory Prompted Reconstruction
mechanism
● Devise 2 graph context based prompts
○ Partial node positions of the same graph
○ Partial node trajectory among continuous graphs
● Randomly mask node positions to generate the masked graph representation
● Then exploited to prompt the skeleton reconstruction

13
Methodology
Graph Trajectory Prompted Reconstruction
● To encourage the model to capture more unique temporal patterns from skeleton graphs, propose to
reconstruct graph trajectories based on their partial dynamics, randomly mask the trajectory of each node
● The masked trajectory representation preserves partial graph dynamics of different body-joint nodes,
which are then use to prompt the skeleton reconstruction
● Propose the STPR objective to combine both graph structure and trajectory prompt reconstruction

14
Methodology
The Entire Approach
● Combine the proposed GPC and STPR to perform skeleton representation learning with
where λ is the weight coefficient to fuse two losses for training

15
Experimental Setups
4 datasets
• IAS
○ 11 different persons
• KS20
• BIWI
• KGBD

16
Experimental Setups
Implementation Details
• J = 20 in KGBD, IAS, BIWI (number of body joints)
• J = 25 in KS20
• J = 14 in the estimated skeleton data of CAIA-B
• f = 6 for Kinect-based skeleton datasets (IAS, KS20, BIWI, KGBD) (sequence length)
• f = 40 for RGB-estimated skelton data (CASIA-B)
• Evaluation Metrics
○ Cumulative Matching Characteristics (CMC)
○ Rank-1 accuracy (R1), Rank-5 accuracy (R5), Rank-10 accuracy (R10)
○ Mean Average Precision (mAP) is used to evaluate the overall performance

19
Conclusion
• TranSG learn effective representations from skelton graphs for person re-ID
• Devise a SGT to perform full-relation learning of body-joint nodes to aggregate key body and motion
features into graph representations
• GPC learn discriminative graph representations by contrasting their inherent similarity with the most
representative graph features
• Design STPR mechanism to encourage learning richer graph semantics and key patterns for person re-ID
• TranSG outperform existing SOTA models, scalable to be applied to different scenarios

240429_Thanh_LabSeminar[TranSG: Transformer-Based Skeleton Graph Prototype Contrastive Learning with Structure-Trajectory Prompted Reconstruction for Person Re-Identification].pptx

Recommended

Recommended

More Related Content

Similar to 240429_Thanh_LabSeminar[TranSG: Transformer-Based Skeleton Graph Prototype Contrastive Learning with Structure-Trajectory Prompted Reconstruction for Person Re-Identification].pptx

Similar to 240429_Thanh_LabSeminar[TranSG: Transformer-Based Skeleton Graph Prototype Contrastive Learning with Structure-Trajectory Prompted Reconstruction for Person Re-Identification].pptx (20)

More from thanhdowork

More from thanhdowork (20)

Recently uploaded

Recently uploaded (20)

240429_Thanh_LabSeminar[TranSG: Transformer-Based Skeleton Graph Prototype Contrastive Learning with Structure-Trajectory Prompted Reconstruction for Person Re-Identification].pptx