240429_Thanh_LabSeminar[TranSG: Transformer-Based Skeleton Graph Prototype Contrastive Learning with Structure-Trajectory Prompted Reconstruction for Person Re-Identification].pptx

thanhdowork 107 views 19 slides Apr 29, 2024
Slide 1
Slide 1 of 19
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19

About This Presentation

TranSG: Transformer-Based Skeleton Graph Prototype Contrastive Learning with Structure-Trajectory Prompted Reconstruction for Person Re-Identification


Slide Content

TranSG: Transformer-Based Skeleton Graph Prototype Contrastive Learning with Structure-Trajectory Prompted Reconstruction for Person Re-Identification Tien-Bach-Thanh Do Network Science Lab Dept. of Artificial Intelligence The Catholic University of Korea E-mail: os fa19730 @catholic.ac.kr 202 4/04/29 Haocong Rao et al. CVPR 2023

Introduction Person re- identification (re-ID) is a challenging task of retrieving and matching a specific person across varying views Person re-ID via 3D skeletons has attracted attention in both academia and industry Skeleton-based person re-ID methods model unique body and motion representations with 3D positions of key body joints Key contributions: Present a generic TranSG paradigm to learn effective representations form skeleton graphs for person re-ID Devise a skeleton graph transformer (SGT) to fully capture relations in skeleton graphs and integrate key correlative noed features into graph representations Propose the graph prototype contrastive learning (GPC) to contrast and learn the most representative graph features and identity-related semantics from both skeleton and sequence levels for person re-ID Devise the graph structure-trajectory prompted reconstruction (STPR) to exploit spatial-temporal graph contexts for reconstruction, to capture more key graph semantics and unique feature for person re-ID

Related Works Person Re-identification Using Skeleton Data [13] 7 Euclidean distances between certain joints are computed to perform distance metric learning [12,7] 13 and 16 skeleton descriptors are utilized to identify different persons [8] CNN-based model PoseGait is devised to encode body-joint sequences and hand-crafted pose features [6] proposes the AGE model to encode recognizable gait features from 3D skeleton sequences [14] proposed to encode prototypes and intra-sequence relations of masked skeleton sequences [18,19] perform multi-stage body-component relation learning based on multi-scale graphs [20] proposes a general skeleton feature re-ranking mechanism [13] I. B. Barbosa, M. Cristani, A. Del Bue, L. Bazzani, and V. Murino, “Re-identification with RGB-D sensors,” in European Conference on Computer Vision (ECCV) Workshop, pp. 433–442, Springer, 2012. [12] M. Munaro, A. Fossati, A. Basso, E. Menegatti, and L. Van Gool, “One-shot person re-identification with a consumer depth camera,” in Person Re-Identification, pp. 161– 181, Springer, 2014. [7] P. Pala, L. Seidenari, S. Berretti, and A. Del Bimbo, “ Enhanced skeleton and face 3D data for person re-identification from depth cameras,” Computers & Graphics, vol. 79, pp. 69–80, 2019. [8] R. Liao, S. Yu, W. An, and Y. Huang, “A model-based gait recognition method with body pose and human prior knowledge ,” Pattern Recognition, vol. 98, p. 107069, 2020. [6] H. Rao, S. Wang, X. Hu, M. Tan, H. Da, J. Cheng, and B. Hu, “Self-supervised gait encoding with locality-aware attention for person re-identification,” in International Joint Conference on Artificial Intelligence (IJCAI), vol. 1, pp. 898–905, 2020. [14] H. Rao and C. Miao, “SimMC: Simple masked contrastive learning of skeleton representations for unsupervised person re-identification,” in International Joint Conference on Artificial Intelligence (IJCAI), pp. 1290–1297, 2022. [18] H. Rao, S. Xu, X. Hu, J. Cheng, and B. Hu, “Multi-level graph encoding with structural-collaborative relation learning for skeleton-based person re-identification,” in International Joint Conference on Artificial Intelligence (IJCAI), pp. 973–980, 2021. [19] H. Rao, X. Hu, J. Cheng, and B. Hu, “SM-SGE: A self- supervised multi-scale skeleton graph encoding framework for person re-identification,” in Proceedings of the 29th ACM International Conference on Multimedia, pp. 1812–1820, 202 1. [20] H. Rao, Y. Li, and C. Miao, “Revisiting k-reciprocal distance re-ranking for skeleton-based person re-identification,” IEEE Signal Processing Letters, vol. 29, pp. 2103–2107, 2022.

Related Works Contrastive Learning Aim to pull closer homogeneous or positive representation pairs while pushing farther negative pairs in a certain feature space [21] an instance discrimination paradigm with exemplar tasks [27] approach based on the context auto-encoding and probabilistic contrastive loss to learn different-domain representations [22] combines k-means clustering and CL [2,14] based on consecutive sequences or randomly masked sequences [21] Z. Wu, Y. Xiong, S. X. Yu, and D. Lin, “Unsupervised feature learning via non-parametric instance discrimination,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3733–3742, 2018. 2 [27] A. van den Oord, Y. Li, and O. Vinyals, “Representation learning with contrastive predictive coding,” arXiv preprint arXiv:1807.03748, 2018. [22] J. Li, P. Zhou, C. Xiong, and S. Hoi, “Prototypical con- trastive learning of unsupervised representations,” in Inter- national Conference on Learning Representation (ICLR), 2021. [2] H. Rao, S. Wang, X. Hu, M. Tan, Y. Guo, J. Cheng, X. Liu, and B. Hu, “A self-supervised gait encoding approach with locality-awareness for 3D skeleton based person re-identification,” IEEE Transactions on Pattern Analysis and Machine Intelligence, no. 01, pp. 1–1, 2021. [14] H. Rao and C. Miao, “SimMC: Simple masked contrastive learning of skeleton representations for unsupervised person re-identification,” in International Joint Conference on Artificial Intelligence (IJCAI), pp. 1290–1297, 2022.

Related Works Prompt Learning To provide additional knowledge, instruction, or context for the input of models => give more reliable outputs for different tasks [28-36] [30] leverages language-based prompts to generalize the pre-trained visual representations to many tasks [31] automatically model task-relevant prompt with continuous representations to improve the downstream task performance [28] F. Petroni, T. Rocktaschel , S. Riedel, P. Lewis, A. Bakhtin, Y. Wu, and A. Miller, “Language models as knowledge bases?,” in Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 2463–2473, 2019 [29] C. Jia, Y. Yang, Y. Xia, Y.-T. Chen, Z. Parekh, H. Pham, Q. Le, Y.-H. Sung, Z. Li, and T. Duerig, “Scaling up visual and vision-language representation learning with noisy text supervision,” in International Conference on Machine Learning (ICML), pp. 4904–4916, PMLR, 2021 [30] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al., “Learning transferable visual models from natural language supervision,” in International Conference on Machine Learning (ICML), pp. 8748–8763, PMLR, 2021 [31] K. Zhou, J. Yang, C. C. Loy, and Z. Liu, “Learning to prompt for vision-language models,” International Journal of Computer Vision, vol. 130, no. 9, pp. 2337–2348, 2022 [32] Z. Jiang, F. F. Xu, J. Araki, and G. Neubig, “How can we know what language models know?,” Transactions of the Association for Computational Linguistics, vol. 8, pp. 423–438, 2020. [33] B. Lester, R. Al-Rfou, and N. Constant, “The power of scale for parameter-efficient prompt tuning,” in Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 3045–3059, 2021 [34] X. L. Li and P. Liang, “Prefix-tuning: Optimizing continuous prompts for generation,” in Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), pp. 4582–4597, 2021 [35] P. Liu, W. Yuan, J. Fu, Z. Jiang, H. Hayashi, and G. Neu- big, “Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing,” arXiv preprint arXiv:2107.13586, 2021 [36] T. Shin, Y. Razeghi, R. L. Logan IV, E. Wallace, and S. Singh, “Autoprompt: Eliciting knowledge from language models with automatically generated prompts,” in Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 4222–4235, 2020

Methodology

Methodology The Proposed Approach 3D skeleton sequence X = (x 1 , … , x f ) R f*J*3 , where x t ∈ R J*3 denotes the tth skeleton with 3D coordinates of J body joints Each skeleton sequence X corresponds to person identity y where y ∈ {1, … , C} and C is the number of different classes Ф T = {X T i } N1 i=1 , Ф P = {X P i } N2 i=1 , Ф G = {X G i } N3 i=1 denote the training set, probe set, and gallery set that contain N1, N2, N3 skeleton sequences of different person in different scenes and views Aim to learn an encoder to map skeleton sequences into effective representations => encoded skeleton sequence representations in the probe set can be matched with the representations of the same identity in the gallery set

Methodology Skeleton Graph Construction Skeleton graphs with the body joints as nodes and the structural connections between adjacent joints as edges G t (V t , E t ) corresponding to the t th skeleton x t , consists of nodes V t = {v t 1 , v t 2 , … , v t J }, v t i ∈ R 3 and edges E t = {e t i,j |v t i , v t j ∈ V t }, e t i,j ∈ R V t , E t denote the set of nodes corresponding to J different body joints and the set of their internal connection relations

Methodology Skeleton Graph Transformer Aim to capture discriminative skeleton features Problem: Body structural features, which can be inferred from the relations between adjacent body joints Skeleton actional patterns, which are typically characterized by the relations among different body components Each body-joint node as a basic body component, combine the above relation learning as a full-relation learning of body-joint nodes => fully aggregate key body and motion features from skeleton graphs Positional encoding Position-encoded node representation Multiple independent FR heads, aggregate corresponding relational feature

Methodology Skeleton Graph Transformer Apply Feed Forward Network with residual connections and batch normalization Final sequence -level graph representation S

Methodology Graph Prototype Contrastive Learning Each individual’s skeletons usually share the same anthropometric features (skeletal lengths), while their continuous sequence can characterize identity-specific walking patterns (gait) Given the encoded graph representations of training skeleton sequences , group them by growth-truth classes as , where denote Graph prototype is generated Graph prototype contrastive loss

Methodology Graph Structure-Trajectory Prompted Reconstruction Mechanism To exploit more valuable graph features and high-level semantics from both spatial and temporal context of skeleton graphs, proposed a self-supervised graph Structure and Trajectory Prompted Reconstruction mechanism Devise 2 graph context based prompts Partial node positions of the same graph Partial node trajectory among continuous graphs Randomly mask node positions to generate the masked graph representation Then exploited to prompt the skeleton reconstruction

Methodology Graph Trajectory Prompted Reconstruction To encourage the model to capture more unique temporal patterns from skeleton graphs, propose to reconstruct graph trajectories based on their partial dynamics, randomly mask the trajectory of each node The masked trajectory representation preserves partial graph dynamics of different body-joint nodes, which are then use to prompt the skeleton reconstruction Propose the STPR objective to combine both graph structure and trajectory prompt reconstruction

Methodology The Entire Approach Combine the proposed GPC and STPR to perform skeleton representation learning with where λ is the weight coefficient to fuse two losses for training

Experimental Setups 4 datasets IAS 11 different persons KS20 20 different persons BIWI 50 different persons KGBD 164 different persons

Experimental Setups Implementation Details J = 20 in KGBD, IAS, BIWI (number of body joints) J = 25 in KS20 J = 14 in the estimated skeleton data of CAIA-B f = 6 for Kinect-based skeleton datasets (IAS, KS20, BIWI, KGBD) (sequence length) f = 40 for RGB-estimated skelton data (CASIA-B) Evaluation Metrics Cumulative Matching Characteristics (CMC) Rank-1 accuracy (R1), Rank-5 accuracy (R5), Rank-10 accuracy (R10) Mean Average Precision (mAP) is used to evaluate the overall performance

Results SOTA comparison

Results Ablation Study

Conclusion TranSG learn effective representations from skelton graphs for person re-ID Devise a SGT to perform full-relation learning of body-joint nodes to aggregate key body and motion features into graph representations GPC learn discriminative graph representations by contrasting their inherent similarity with the most representative graph features Design STPR mechanism to encourage learning richer graph semantics and key patterns for person re-ID TranSG outperform existing SOTA models, scalable to be applied to different scenarios