[NS][Lab_Seminar_240607]Unbiased Scene Graph Generation in Videos.pptx

thanhdowork 65 views 20 slides Jul 04, 2024
Slide 1
Slide 1 of 20
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20

About This Presentation

Unbiased Scene Graph Generation in Videos


Slide Content

Unbiased Scene Graph Generation in Videos Tien-Bach-Thanh Do Network Science Lab Dept. of Artificial Intelligence The Catholic University of Korea E-mail: os fa19730 @catholic.ac.kr 202 4/06/07 Sayak Nag et al. CVPR 2023

Introduction Figure 1. (a) Long-tailed distribution of the predicate classes in Action Genome. (b) Visual relationship or predicate classification performance of two SOTA dynamic SGG methods, namely STTran and TRACE, falls off significantly for the tail classes

Introduction Figure 2. Noisy scene graph annotations in Action Genome increase the uncertainty of predicted scene graphs

Introduction Figure 3. Occlusion and motion blur caused by moving objects in video renders off-the-self object detectors such as FasterRCNN ineffective in producing consistent object classification

Method Figure 4. The object detector generates initial object proposals for each RGB frame in a video. The proposals are then passed to the OSPU, where they are first linked into sequences based on the object detector’s confidence scores. These sequences are processed with a transformer encoder to generate temporally consistent object embeddings for improved object classification. The proposals and semantic information of each subject-object pair are passed to the PEG to generate a spatio-temporal representation of their relationships. Modeled as a spatio-temporal transformer, the PEG’s encoder learns the spatial context of the relationships and its decoder learns their temporal dependencies. Due to the long-tail nature of the relationship/predicate classes, a Memory Bank in conjunction with the MDU is used during training to debias the PEG, enabling the production ofmore generalizable predicate embeddings. Finally, a K-component GMM head classifies the PEG embeddings and models the uncertainty associated with each predicate class for a given subject-object pair

Background Scene graphs Nodes represent objects (pretrain Faster R-CNN with ResNet-101 as object detector => detect objects in each video frame) Extract bounding boxes, classes, and visual features Edges represent relationships between objects Analyze pairs of objects to predict relationships (spatial, semantic, or action-based relationships) Based spatio-temporal transformer [1] to analyze the features of objects pairs and their relative positions across frames => Generate embeddings that encode the likelihood and type of relationship between object pairs => Embeddings help construct the edges by predicting the relationships Type of relationship eg. holding, near,... Directional information: subject -> object Graph constructed by combining node and edge to represent the entire scene at each frame Integrate scene graphs over consecutive frames Match nodes and edges across frames to maintain consistent object and relationship over time Continuously update graph as new frames arised [1] Spatial-Temporal Transformer for Dynamic Scene Graph Generation, ICCV 2021

Background G t = {S t , R t , O t } of each frame I t in a video V = {I 1 , I 2 , …, I T } S t = {s 1 t , s 2 t , … , s N(t) t } O t = {o 1 t , o 2 t , … , o N(t) t }

Method Object Detection and Temporal Consistency Use pretrain, obtain set of objects O t = {o i t } i=1 N(t) in each frame where o i t = {b i t , v i t , c oi t }, with b is bounding box, v is RoIAligned proposal feature of o and c is predict class Apply Object Sequence Processing Unit (OSPU), utilize transformer encoder, set of sequences: Apply muli-head self-attention to learn long-term temporal dependencies Input X, single attention head A: Multi-head attention Normalize and pass through FFN, output where is fixed positional encodings

Method Predicate Embedding Generator (PEG) Information of each subject-object pair to generate an embedding - summarize the relationship between nodes Each pair (i, j) where v i , v j are subject and object features, u ij is feature map of union box computed by RoIAlign, s i , s j are semantic glove embeddings of subject and object class from final object logits, f v , f u are FFN, f box is bounding box to feature map projection Output of spatial encoder Input of temporal dencoder Output of temporal decoder Final set of predicate embeddings

Method Memory guided Debiasing Long-tailed bias in SGG dataset

Method Uncertainty Attenuated Predicate Classification Address the noisy annotations in SGG data, model the predicate classification head as K component GMM Given sample embedding, mean, variance, mixture weights

Method Uncertainty Attenuated Predicate Classification

Method Uncertainty Attenuated Predicate Classification

Experiment Training OSPU and GMM head start firing from the first epoch itself Train loss:

Experiment Dataset Action Genome dataset - largest benchmark dataset for video SGG 234,253 annotated frames with 476,229 bounding boxes for 35 object classes ( without person ) Total of 1,715,568 annotated predicate instances for 26 relationship classes Standard metrics Recall@K (R@K) and mean-Recall@K for K = [10,20,50]

Experiment Comparison

Experiment Comparison

Experiment Ablation Studies

Experiment Ablation Studies

Conclusion The difficulty in generating dynamic scene graphs from videos can be attributed to several factors ranging from imbalance predicate class distribution, video dynamics, temporal fluctuation of prediction Existing methods on dynamic SGG focused only on achieving high recall values, which are known to be biased towards head classes Propose TEMPURA for dynamic SGG, can compensate for those biases Outperform SOTA in terms of mean recall metric, showing its efficacy in long-term unbiased visual relationship learning from videos