[NS][Lab_Seminar_250303]Autoregressive Adaptive Hypergraph Transformer for Skeleton-based Activity Recognition.pptx
thanhdowork
97 views
18 slides
Mar 03, 2025
Slide 1 of 18
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
About This Presentation
Autoregressive Adaptive Hypergraph Transformer for Skeleton-based Activity Recognition
Size: 1.35 MB
Language: en
Added: Mar 03, 2025
Slides: 18 pages
Slide Content
Autoregressive Adaptive Hypergraph Transformer for Skeleton-based Activity Recognition Tien-Bach-Thanh Do Network Science Lab Dept. of Artificial Intelligence The Catholic University of Korea E-mail: os fa19730 @catholic.ac.kr 202 5/03/03 Abhisek Ray et al. WACV 2025
Introduction Skeleton-based action recognition is gaining popularity It is invariant to viewpoint, illumination changes, and background clutter It has reduced computational cost and facilitates background adaptation Applications include surveillance, smart security, healthcare, and Human-Computer Interaction
Background and Motivation Traditional methods like CNNs, RNNs, and GNNs have limitations in capturing non-euclidean joint-bone attributes GCNs with fixed sampling methods fail to address high-order correlations between skeleton nodes and multi-scale semantic information Transformers are effective in capturing long-range dependencies and making complex contextual features accessible Hypergraphs offer advantages over traditional graphs by capturing higher-order relationships among joints. They represent multi-scale contextual information and are more robust to noise
Model Hypergraph Hypergraphs capture higher-order relationships among joints. For example, in "waving", multiple joints (shoulder, elbow, wrist) are considered They represent multi-scale contextual information, for example, in "running" or "walking" Hypergraphs are more robust to noise by emphasizing critical joints, for example, in "sitting down" They aid feature fusion by combining spatial and temporal dynamics across multiple joints, such as in "clapping" Hypergraphs adapt dynamically, adjusting joint importance throughout actions like "jumping" They capture intricate interdependencies among multiple body parts, for example, in "picking up and throwing an object" Challenges Fixed hypergraphs cannot capture dynamic, action-dependent contextual features Dynamically generated hypergraphs can disrupt feature distribution
Model AutoregAd-HGformer Figure 1. Model abstraction. Model-agnostic iterative hypergraph (left), various attention (middle) and AutoregAd-HGformer (right)
Model AutoregAd-HGformer Introduces in-phase and out-phase hypergraph generation techniques Employs a unique transformer design that analyzes individual features of joint and hyperedge along with their mutual semantics Contributions Introduces a transformer-implemented hypergraph architecture to skeleton-based action sequences and mutates hyperedge configuration adaptively Proposes two hypergraph generation techniques to produce in-phase and out-phase hypergraphs for discrete and continuous feature alignment, respectively Hybrid learning (supervised and self-supervised) explores action-dependent features along spatial, temporal, and channel dimensions
Model Figure 2. Proposed framework for Autoregressive in-phase hypergraph quantizer (left) and adaptive hypergraph decoder (right).
Model Comprises three functional blocks: (i) hypergraph encoder, (ii) adaptive hypergraph decoder, and (iii) classifier Hypergraph Encoder (HypEnc) block comprises a stack of Frame Attentive Hypergraph Transformer (FAHT) units Adaptive Hypergraph Decoder block reconstructs skeleton sequences and provides low-dimensional features to the adaptive hypergraph generator
Model Hypergraph Encoder (HypEnc) The output embedding E for each node of the HypEnc block is expressed as adjacency matrix input feature embedding incidence matrix hyperedge weight matrix
Model Frame Attentive Hypergraph Transformer (FAHT) The FAHT unit in each group comprises the Spatio-Temporal Hypergraph Transformer (ST-HT) unit and the Spatiotemporal Attentive Hypergraph Transformer (STA-HT) unit. ST-HT: HyperGraph Convolution (HGC) is applied to find hypergraph embedding that subsequently passes to the transformer unit to calculate cross-attention quantized in-phase hypergraph and its weight STA-HT: temporal attention is applied to hypergraph features in channel dimensions to recognize the importance of each frame
Model Adaptive Hypergraph Decoder Features from the hypergraph encoder block are passed through a Hyperedge Attention Network (HAN) The output embeddings from HAN are passed through the decoder to reconstruct the skeleton sequences and provide low-dimensional features to the adaptive hypergraph generator
Model Adaptive Hypergraph Decoder Hypergraph Decoder (HypDec): combined residual and attentive features from HAN are passed through the 5-layered decoders, where each layer performs the following hypergraph convolution Attentive Hypergraph Generator: K-means clustering is applied to the intermediate output of the decoder to find hyperedges. The weights from HAN are normalized to calculate the importance of each hyperedge
Experiments Ablation Analysis Table 1. Impact of various attention in transformer block. Table 2. Impact of different units in AutoregAd-HGformer.
Conclusion AutoregAd-HGformer aggregates multiscale graphs and higher-order contextual semantics with long-range motion features Attention mechanisms and channel attention derive motion-level deep action-dependent features Hybrid learning and iterative hyperedge clustering make the model more robust Novel vector quantized in-phase and model-agnostic out-phase hypergraph generation helps the model to aggregate more robust features Future work Hyperedge-hyperedge self-attention