[20240705_LabSeminar_Huy]Spatial-Temporal Graph-Based AU Relationship Learning for Facial Action Unit Detection​.pptx

thanhdowork 69 views 16 slides Jul 08, 2024
Slide 1
Slide 1 of 16
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16

About This Presentation

Spatial-Temporal Graph-Based AU Relationship Learning for Facial Action Unit Detection​


Slide Content

Quang-Huy Tran Network Science Lab Dept. of Artificial Intelligence The Catholic University of Korea E-mail: [email protected] 2024-07-05 Spatial-Temporal Graph-Based AU Relationship Learning for Facial Action Unit Detection Zihan Wang et al. CVPR- 202 3 : The IEEE/CVF Conference on Computer Vision and Pattern Recognition 2023

OUTLINE MOTIVATION METHODOLOGY EXPERIMENT & RESULT CONCLUSION

MOTIVATION Human Facial Action Units ( Aus) play a significant role in human behavior un derstanding. annotated based on the anatomical characteristics of multiple facial muscle movement. AU detection is a challenging multi-label classification task AUs are subtle movement of facial muscles and different facial muscles have different ranges of movement ( person-specific factors: gender, age, etc; or context: background, illumination, etc) . Overview and Limitation Challenges: Previous works are not concerned temporal information. The annotations of AUs exhibit a notable imbalance which can result in the training of a biased model that are predisposed to learn AU patterns that have been annotated more frequently in the training set.

INTRODUCTION Propose a spatio -temporal facial AU graph representation learning framework : jointly model the spatio -temporal relationships among AUs of all face frames. Relationships between different AUs and the temporal information of a specific AU sequence can interact and jointly guide the graph to learn representation for each AU node. Contribution Pre-train a MAE model based on human face databases, which can generate a strong facial representation from each input facial display . Overcome the data imbalance problem in action units' detection.

METHODOLOGY Problem Definition Given T consecutive facial frames :   Problem: Predict multiple AUs for all frame where represents the number ofpredicted AUs, t denotes the frame and can be either activated (1) or inactivated (0).  

METHODOLOGY Main Architecture

METHODOLOGY Facial Representation Encoder P re-train a MAE model using a large amount of face images from CASIA- WebFace , AffectNet , IMDB-WIKI and CelebA : Encoder: R andomly masked face images are fed to generate latent features. Decoder: reconstructs the original image from these latent features. Input f acial image sequence : O utput set of facial representation:   where represents a global facial representation of a face image; m is the number of patches and d denotes the dimension of each patch.  

METHODOLOGY Spatial-Temporal Graph Learning AU-specific Feature Generator (AFG) : Consists of N branches. Each branch: a FC layer followed by a global average pooling (GAP) layer. where is the activation function; g and r denotes differentiable functions of the GCN layer, and denotes the connectivity between and .   The new representation of the AU in the frame by GCN:   Spacial GCN module : Employ the Facial Graph Generator (FGG) to construct adjacency, where first assume the edge is connectivity. Each vector feature from AFG is a node, then calculate feature similarity: Choose the top K nearest neighbours of each node as its neighbours with highest similarity scores.  

METHODOLOGY Spatial-Temporal Graph Learning Temporal transformer module: Apply Transformer in time for each node in V where FFN is the feed forward network in transformer; Att denotes the self-attention function; and and are trainable weight matrices.   where is the activation function, is a trainable vector .   Cosine similarity calculating (SC) strategy is employed to predict the probability:

METHODOLOGY Spatial-Temporal Graph Learning Loss Function : A two-stage training strategy to train AU detection model For first-stage (pretrain MAE), employ Mean Square Error (MSE) loss to constrain the difference between the reconstructed patches and the original patches at the pixel-level. where are the prediction and ground truth; N and T are the numbers of AUs and frames of the input face sequence .   For AU detection, An asymmetric loss to optimize the network A multi-label binary classification problem most AUs are inactivated for most face frames. where , denote ground truth pixels and reconstructed pixels .  

EXPERIMENT AND RESULT EXPERIMENT SETTINGs Dataset: MAE: CASIA- WebFace , AffectNet , IMDB-WIKI and CelebA . AU detection: Aff-Wild2 Baselines: ME-Graph [1] and Netease [2] . [1] Luo, C., Song, S., Xie, W., Shen, L., & Gunes , H. (2022). Learning multi-dimensional edge feature-based au relation graph for facial action unit recognition. arXiv preprint arXiv:2205.01782. [2] Zhang, W., Ma, B., Qiu, F., & Ding, Y. (2023). Multi-modal facial affective analysis based on masked autoencoder. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 5793-5802). Measurement : Average F1-score across all AUs: Individual AU class: where N denotes the number of AUs, and F1-score   where is the calculated precision for the AU and is the recall rate.  

EXPERIMENT AND RESULT RESULT – Overall Perfor mance

EXPERIMENT AND RESULT RESULT – Ablation Study

CONCLUSION P roposes an effective spatio -temporal AU relational GNN for AU occurrence recognition. MAE is introduced as the facial representation encoder for pretraining . A spatio -temporal graph learning module to model spatial relationships between different Aus and temporal dependencies among different frames. Summarization