[NS][Lab_Seminar_240705]Self-Supervised Relation Alignment for Scene Graph Generation.pptx

thanhdowork 71 views 13 slides Jul 08, 2024
Slide 1
Slide 1 of 13
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13

About This Presentation

Self-Supervised Relation Alignment for Scene Graph Generation


Slide Content

Self-Supervised Relation Alignment for Scene Graph Generation Tien-Bach-Thanh Do Network Science Lab Dept. of Artificial Intelligence The Catholic University of Korea E-mail: os fa19730 @catholic.ac.kr 202 4/07/05 Bicheng Xu et al. WACV 2024

Introduction Idea Node is object with bounding box and class labels Directed edges encode pairwise relations among nodes SGG contain 2 message passing networks: One for detecting objects of interest One for relation prediction among detected objects Challenge: sparsity of data Recent approaches try to solve dataset bias with: Re-sampling Loss re-weighting Still not solve the problem of data sparsity Propose self-supervised relational alignment mechanism Given SGG model, create an auxiliary (mirrored) relation prediction branch, to align relational predictions with those coming from the main supervised branch Apply random masking on relational feature input, form impoverished augmented views of the data, which enhances feature information and refinement within and across relations

Method GIPCOL

Method Self-Supervised Relation Alignment General Pipeline SGG contain 3 parts: Pre-trained visual feature extractor: take image and compute visual features Object detector: localize and classify the object of interest in image Relation predictor: predict relationship between any pair of detected objects Motivated by sparsity of data, focus on building a self-supervised alignment mechanism for relation predictor To regularize learning with augmented data samples To encourage the relation predictor, which includes message passing and refinement of relational features, to more effectively propagate information both within a single relation representation and across such representations First, build the self-supervised alignment, create duplicate of relation predictor named mirrored relation predictor Apply random masking on the input to the mirrored relation predictor Alignment loss is introduced to align the relation predictions from the mirrored predictor with those from the original branch

Method Self-Supervised Relation Alignment Random Masking Each element in feature matrix, the masking operation independently zeros it out with probability p Then feed masked input to mirrored relation predictor Self-Supervised Alignment Loss Use Kullback-Leibler divergence to measure the closeness between relation predictions between 2 branches Use untied projection head to predict the logits of relations in the mirrored relation predictor Network weights between mirrored and original relation predictors are all tied except for last prediction head

Method Self-Supervised Relation Alignment Instantiation under SGTR To test effectiveness of proposed self-supervised relation alignment, use Transformer-based SGTR [25] CNN feature extractor with multi-layer Transformer encoder to extract image features Entity node generator which predicts objects using Transformer decoder Predicate node generator which predicts relations using Transformer decoder Bipartite graph assembling module for constructing the final bipartite matching between entities and predicates Predicate node generator has 2 main components Relation image feature encoder, which refines the image feature extracted from feature extraction part Structural predicate decoder, which is stack of self-attention and cross-attention Transformers blocks, to generate relation predictions Instantiate self-supervised relation alignment mechanism on the structural predicate decoder Structural predicate decoder is treated as key and value of its cross attention Transformer blocks independently Inside block, random mask is multiplied with attention matrix => each element in attention matrix has independent probability p to be set before fed to softmax [25] Rongjie Li, Songyang Zhang, and Xuming He. Sgtr: End-toend scene graph generation with transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19486–19496, 2022

Method Self-Supervised Relation Alignment Instantiation under Neural Motifs First utilizes pre-trained object detector to extract image features and generate object proposals Use RoI Align to extract object features and relation features => predicts scene graph Neural Motifs input object features and object proposal location embeddings to bi-directional LSTM to predict object class labels Fed to another bi-directional LSTM (edge context LSTM) to generate refined object features to form a feature that is used to predict a relation label per pair of object proposals Apply self-supervised relation alignment mechanism to the relation predictor Duplicate edge context LSTM and its subsequent prediction layers Random masks are applied on both input to the edge context LSTM and relation feature Replace last 2 linear layers of mirrored relation predictor with 3 layer MLP with ReLU activations MLP is untied from original relation predictor In mirrored predictor, pairwise combined object features and masked relation features are fed to MLP, to predict relation label distributions

Experiments Dataset, evaluation metrics Visual Genome dataset with same preprocessing procedure and train/val/test splits as previous works Preprocessed Visual Genome dataset has 150 object and 50 relation categories Main metrics: mean-Recall@K, Mean-Recall Predicate classification PredCls Scene graph classification SGCls Scene graph detection SGDet

Experiments Results

Experiments Neural Motifs with Relation Alignment

Experiments Neural Motifs with Relation Alignment

Experiments Ablation study

Conclusion Propose a simple-yet-effective self-supervised relational alignment module, can be plugged into any existing scene graph generation model as an additional loss term Mirror the relation prediction branch and feed randomly masked input to the mirrored branch Alignment between predictions from mirrored and original branches encourages the model to learn better representations Experiments with both cases (one-stage, two-stage) show the improvement Future plan, design more sophisticated random masking patterns