[20240812_LabSeminar_Huy]Spatio-Temporal Fusion for Human Action Recognition via Joint Trajectory Graph.pptx

thanhdowork 83 views 16 slides Aug 20, 2024
Slide 1
Slide 1 of 16
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16

About This Presentation

Spatio-Temporal Fusion for Human Action Recognition via Joint Trajectory Graph


Slide Content

Quang-Huy Tran Network Science Lab Dept. of Artificial Intelligence The Catholic University of Korea E-mail: [email protected] 2024-08- 12 Spatio -Temporal Fusion for Human Action Recognition via Joint Trajectory Graph Yaolin Zheng et al. AAAI-2024 : The Thirty-Eighth AAAI Conference on Artificial Intelligence

OUTLINE MOTIVATION METHODOLOGY EXPERIMENT & RESULT CONCLUSION

MOTIVATION Human action recognition aims to accurately identify and classify different human actions from input videos or sequence data : One of important tasks in Computer Vision with a broad applications . Graph Convolutional Networks (GCNs) and Transformers have adopted well in this research. Overview and Limitation Challenges: Conventional GCN methods do not directly utilize spatio -temporal topology to capture comprehensive spatial-temporal dependencies: Only aggregate in spatial and extend spatial graph. Hence, not sufficient for collecting temporal dynamic correlation. Density of information may vary between spatial and temporal dimensions in joint coordinate: redundancy in the temporal dimension. Self-attention mechanisms not capture hidden topological information of each sequence: negative impact on the model’s robustness and generalization.

INTRODUCTION Propose Spatio -Temporal Dijkstra Attention (STDA) : augmenting feature aggregation among neighboring nodes via the integration of shortest path (Dijkstra) concepts between joints. Contribution Introduction of Joint Trajector y Graph(JTG) as an input data representation : Leveraging trajectory information to enrich feature aggregation capabilities for nodes and their interactions across frames . Incorporation of the Koopman operator for classification: facilitating an encompassing perspective and superior classification performance.

METHODOLOGY Problem Definition Divide the action sequences into several groups Each group has N frames and describe joint trajectories with a graph structure Joint Trajectory Graph : a spatial graph of joints in a frame. set of edges - joint trajectory of the nodes in N frames. set of nodes and edges in JTG.   where is an adjacency matrix of a JTG, A denotes physical connectivity of all joints in a frame.  

METHODOLOGY Main Architecture: a) JT- GraphFormer Positional Encoding. STDA Module - TCN Module. b) Koopman Operator.

METHODOLOGY JT- GraphFormer Positional Encoding: Following the Transformer model, add PE for each frame to describe sequential relationship. TCN Module: sequential aggregation module consists of a convolution operation and a Batch Normalization(BN) operation. Input reshaped to the output . : output channel number of the -th block. Residual res is utilized in both the input and output stages. a 1x1 convolution operation and a BN operation.   where p is position of the joints, i is dimension of PE vector, and is feature dimension.  

METHODOLOGY JT- GraphFormer Multi-head self-attention mechanism: STDA Module: Adapting the Graphormer model adds spatial encoding to compute the self-attention of the nodes: Dijkstra matrix . where -D is inverted of D, exp(·) computes exponential values for all entries in matrix, and is learnable matrix.   where Q, K perform 1x1 convolution operation on input, is a learnable parameter that assigns adaptive weights to different heads.   In forward propagation, STDA output is obtained by a FFN structure and a residual structure: where is input feature of th block, is Leaky ReLU function. FFN contains convolution operation and a BN operation.  

METHODOLOGY Koopman Operator Koopman operator is a linear operator: Describes a nonlinear dynamical system by mapping it into an infinite dimensional Hilbert space . Allows system’s evolution to be showed in a linear space. For illustrative purposes, for JT- GraphFormer’s output feature H across distinct frames relate feature at t- th frame. Define Koopman operator as a linear operator   Approximate representation of any continuous frame segment feature , which feature segment from x- th frame to y- th frame.   Adopt the DMD algorithm[1] and minimize the Frobenius norm to update .   [1] Kutz , J. N., Fu, X., Brunton, S. L., & Erichson , N. B. (2015, December). Multi-resolution dynamic mode decomposition for foreground/background separation and object tracking. In 2015 IEEE international conference on computer vision workshop (ICCVW) (pp. 921-929). IEEE.

METHODOLOGY Four-stream Ensemble Previous studies demonstrate that simultaneously using different streams can significantly enhance the performance of human action recognition. Evaluate the performance of the trained models utilizing streams for joint, bone, joint motion, and bone motion: The bone stream utilizes bone modality as input data. The joint motion and the bone motion streams are aligned. The ultimate result is calculated through a weighted average of the models’ inference outputs.

EXPERIMENT AND RESULT EXPERIMENT SETTINGs Dataset: NTU RGB+D (NTU-60), NTU RGB+D 120 (NTU-120) , and Northwestern-UCLA (N-UCLA). Baselines: GNNs methods: ST-GCN[1], 2s-AGCN[2], DGNN[3], Dynamic-GCN[4], SGN[5], DDGCN[6], DC-GCN+ADG[7], MS-G3D [8], MST-GCN[9], CTR-GCN[10], InfoGCN [11], STF[12], Ta-CNN[13], EffificientGCN [14], and CTR-GCN+FR[15] . [1] Yan, S., Xiong, Y., & Lin, D. (2018, April). Spatial temporal graph convolutional networks for skeleton-based action recognition. In Proceedings of the AAAI conference on artificial intelligence (Vol. 32, No. 1). [2] Shi, L., Zhang, Y., Cheng, J., & Lu, H. (2019). Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 12026-12035). [3] Shi, L., Zhang, Y., Cheng, J., & Lu, H. (2019). Skeleton-based action recognition with directed graph neural networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 7912-7921). [4] Ye, F., Pu, S., Zhong, Q., Li, C., Xie, D., & Tang, H. (2020, October). Dynamic gcn : Context-enriched topology learning for skeleton-based action recognition. In Proceedings of the 28th ACM international conference on multimedia (pp. 55-63). [5] Zhang, P., Lan, C., Zeng, W., Xing, J., Xue, J., & Zheng, N. (2020). Semantics-guided neural networks for efficient skeleton-based human action recognition. In proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 1112-1121). [6] Korban, M., & Li, X. (2020). Ddgcn : A dynamic directed graph convolutional network for action recognition. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XX 16 (pp. 761-776). Springer International Publishing. [7] Cheng, K., Zhang, Y., Cao, C., Shi, L., Cheng, J., & Lu, H. (2020). Decoupling gcn with dropgraph module for skeleton-based action recognition. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXIV 16 (pp. 536-553). Springer International Publishing. [8] Liu, Z., Zhang, H., Chen, Z., Wang, Z., & Ouyang, W. (2020). Disentangling and unifying graph convolutions for skeleton-based action recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 143-152). [9] Chen, Z., Li, S., Yang, B., Li, Q., & Liu, H. (2021, May). Multi-scale spatial temporal graph convolutional network for skeleton-based action recognition. In Proceedings of the AAAI conference on artificial intelligence (Vol. 35, No. 2, pp. 1113-1122). [10] Yan, S., Xiong, Y., & Lin, D. (2018, April). Spatial temporal graph convolutional networks for skeleton-based action recognition. In Proceedings of the AAAI conference on artificial intelligence (Vol. 32, No. 1). [11] Chi, H. G., Ha, M. H., Chi, S., Lee, S. W., Huang, Q., & Ramani, K. (2022). Infogcn : Representation learning for human skeleton-based action recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 20186-20196). [12] Ke, L., Peng, K. C., & Lyu, S. (2022, June). Towards to-at spatio -temporal focus for skeleton-based action recognition. In Proceedings of the AAAI conference on artificial intelligence (Vol. 36, No. 1, pp. 1131-1139). [13] Xu, K., Ye, F., Zhong, Q., & Xie, D. (2022, June). Topology-aware convolutional neural network for efficient skeleton-based action recognition. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 36, No. 3, pp. 2866-2874). [14] Song, Y. F., Zhang, Z., Shan, C., & Wang, L. (2022). Constructing stronger and faster baselines for skeleton-based action recognition. IEEE transactions on pattern analysis and machine intelligence, 45(2), 1474-1488. [15] Zhou, H., Liu, Q., & Wang, Y. (2023). Learning discriminative representations for skeleton based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 10608-10617). Measurement : Accuracy.

EXPERIMENT AND RESULT RESULT – Overall Perfor mance

EXPERIMENT AND RESULT RESULT – Ablation study Table. Top-1 accuracy (%) of the methods using various data modalities on the NTU-60 and NTU-120 datasets Table. Top-1 accuracy (%) using the Koopman operator and fully connected methods on the NTU-60 dataset under the X-Sub setting of using various data modalities.

CONCLUSION Present a JT- GraphFormer model based on a joint trajectory topology structure : C apture effectively the semantic information of the input joint trajectory data, enhancing graph Transformer . The proposal of STDA : Incorporates intra-graph distances of joints within a JTG, empower each node to discerningly allocate attention . The Koopman operator linearizes the extracted features in either the temporal or the spatial dimension. Effectively captures dynamic shifts inherent to distinct action categories. Summarization