[NS][Lab_Seminar_250303]Autoregressive Adaptive Hypergraph Transformer for Skeleton-based Activity Recognition.pptx

thanhdowork 97 views 18 slides Mar 03, 2025
Slide 1
Slide 1 of 18
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18

About This Presentation

Autoregressive Adaptive Hypergraph Transformer for Skeleton-based Activity Recognition


Slide Content

Autoregressive Adaptive Hypergraph Transformer for Skeleton-based Activity Recognition Tien-Bach-Thanh Do Network Science Lab Dept. of Artificial Intelligence The Catholic University of Korea E-mail: os fa19730 @catholic.ac.kr 202 5/03/03 Abhisek Ray et al. WACV 2025

Introduction Skeleton-based action recognition is gaining popularity It is invariant to viewpoint, illumination changes, and background clutter It has reduced computational cost and facilitates background adaptation Applications include surveillance, smart security, healthcare, and Human-Computer Interaction

Background and Motivation Traditional methods like CNNs, RNNs, and GNNs have limitations in capturing non-euclidean joint-bone attributes GCNs with fixed sampling methods fail to address high-order correlations between skeleton nodes and multi-scale semantic information Transformers are effective in capturing long-range dependencies and making complex contextual features accessible Hypergraphs offer advantages over traditional graphs by capturing higher-order relationships among joints. They represent multi-scale contextual information and are more robust to noise

Model Hypergraph Hypergraphs capture higher-order relationships among joints. For example, in "waving", multiple joints (shoulder, elbow, wrist) are considered They represent multi-scale contextual information, for example, in "running" or "walking" Hypergraphs are more robust to noise by emphasizing critical joints, for example, in "sitting down" They aid feature fusion by combining spatial and temporal dynamics across multiple joints, such as in "clapping" Hypergraphs adapt dynamically, adjusting joint importance throughout actions like "jumping" They capture intricate interdependencies among multiple body parts, for example, in "picking up and throwing an object" Challenges Fixed hypergraphs cannot capture dynamic, action-dependent contextual features Dynamically generated hypergraphs can disrupt feature distribution

Model AutoregAd-HGformer Figure 1. Model abstraction. Model-agnostic iterative hypergraph (left), various attention (middle) and AutoregAd-HGformer (right)

Model AutoregAd-HGformer Introduces in-phase and out-phase hypergraph generation techniques Employs a unique transformer design that analyzes individual features of joint and hyperedge along with their mutual semantics Contributions Introduces a transformer-implemented hypergraph architecture to skeleton-based action sequences and mutates hyperedge configuration adaptively Proposes two hypergraph generation techniques to produce in-phase and out-phase hypergraphs for discrete and continuous feature alignment, respectively Hybrid learning (supervised and self-supervised) explores action-dependent features along spatial, temporal, and channel dimensions

Model Figure 2. Proposed framework for Autoregressive in-phase hypergraph quantizer (left) and adaptive hypergraph decoder (right).

Model Comprises three functional blocks: (i) hypergraph encoder, (ii) adaptive hypergraph decoder, and (iii) classifier Hypergraph Encoder (HypEnc) block comprises a stack of Frame Attentive Hypergraph Transformer (FAHT) units Adaptive Hypergraph Decoder block reconstructs skeleton sequences and provides low-dimensional features to the adaptive hypergraph generator

Model Hypergraph Encoder (HypEnc) The output embedding E for each node of the HypEnc block is expressed as adjacency matrix input feature embedding incidence matrix hyperedge weight matrix

Model Frame Attentive Hypergraph Transformer (FAHT) The FAHT unit in each group comprises the Spatio-Temporal Hypergraph Transformer (ST-HT) unit and the Spatiotemporal Attentive Hypergraph Transformer (STA-HT) unit. ST-HT: HyperGraph Convolution (HGC) is applied to find hypergraph embedding that subsequently passes to the transformer unit to calculate cross-attention quantized in-phase hypergraph and its weight STA-HT: temporal attention is applied to hypergraph features in channel dimensions to recognize the importance of each frame

Model Adaptive Hypergraph Decoder Features from the hypergraph encoder block are passed through a Hyperedge Attention Network (HAN) The output embeddings from HAN are passed through the decoder to reconstruct the skeleton sequences and provide low-dimensional features to the adaptive hypergraph generator

Model Adaptive Hypergraph Decoder Hypergraph Decoder (HypDec): combined residual and attentive features from HAN are passed through the 5-layered decoders, where each layer performs the following hypergraph convolution Attentive Hypergraph Generator: K-means clustering is applied to the intermediate output of the decoder to find hyperedges. The weights from HAN are normalized to calculate the importance of each hyperedge

Model Loss Function

Model Algorithm

Experiments Datasets NTU RGB+D 60 NTU RGB+D 120 NW-UCLA

Experiments Quantitative comparison

Experiments Ablation Analysis Table 1. Impact of various attention in transformer block. Table 2. Impact of different units in AutoregAd-HGformer.

Conclusion AutoregAd-HGformer aggregates multiscale graphs and higher-order contextual semantics with long-range motion features Attention mechanisms and channel attention derive motion-level deep action-dependent features Hybrid learning and iterative hyperedge clustering make the model more robust Novel vector quantized in-phase and model-agnostic out-phase hypergraph generation helps the model to aggregate more robust features Future work Hyperedge-hyperedge self-attention
Tags