[NS][Lab_Seminar_240607]Unbiased Scene Graph Generation in Videos.pptx

thanhdowork 65 views 20 slides Jul 04, 2024

Slide 1 of 20

About This Presentation

Unbiased Scene Graph Generation in Videos

Size: 2.3 MB

Language: en

Added: Jul 04, 2024

Slides: 20 pages

Slide Content

Unbiased Scene Graph Generation in Videos Tien-Bach-Thanh Do Network Science Lab Dept. of Artificial Intelligence The Catholic University of Korea E-mail: os fa19730 @catholic.ac.kr 202 4/06/07 Sayak Nag et al. CVPR 2023

Introduction Figure 1. (a) Long-tailed distribution of the predicate classes in Action Genome. (b) Visual relationship or predicate classification performance of two SOTA dynamic SGG methods, namely STTran and TRACE, falls off significantly for the tail classes

Introduction Figure 2. Noisy scene graph annotations in Action Genome increase the uncertainty of predicted scene graphs

Introduction Figure 3. Occlusion and motion blur caused by moving objects in video renders off-the-self object detectors such as FasterRCNN ineffective in producing consistent object classification

Method Figure 4. The object detector generates initial object proposals for each RGB frame in a video. The proposals are then passed to the OSPU, where they are first linked into sequences based on the object detector’s confidence scores. These sequences are processed with a transformer encoder to generate temporally consistent object embeddings for improved object classification. The proposals and semantic information of each subject-object pair are passed to the PEG to generate a spatio-temporal representation of their relationships. Modeled as a spatio-temporal transformer, the PEG’s encoder learns the spatial context of the relationships and its decoder learns their temporal dependencies. Due to the long-tail nature of the relationship/predicate classes, a Memory Bank in conjunction with the MDU is used during training to debias the PEG, enabling the production ofmore generalizable predicate embeddings. Finally, a K-component GMM head classifies the PEG embeddings and models the uncertainty associated with each predicate class for a given subject-object pair

Background Scene graphs Nodes represent objects (pretrain Faster R-CNN with ResNet-101 as object detector => detect objects in each video frame) Extract bounding boxes, classes, and visual features Edges represent relationships between objects Analyze pairs of objects to predict relationships (spatial, semantic, or action-based relationships) Based spatio-temporal transformer [1] to analyze the features of objects pairs and their relative positions across frames => Generate embeddings that encode the likelihood and type of relationship between object pairs => Embeddings help construct the edges by predicting the relationships Type of relationship eg. holding, near,... Directional information: subject -> object Graph constructed by combining node and edge to represent the entire scene at each frame Integrate scene graphs over consecutive frames Match nodes and edges across frames to maintain consistent object and relationship over time Continuously update graph as new frames arised [1] Spatial-Temporal Transformer for Dynamic Scene Graph Generation, ICCV 2021

Background G t = {S t , R t , O t } of each frame I t in a video V = {I 1 , I 2 , …, I T } S t = {s 1 t , s 2 t , … , s N(t) t } O t = {o 1 t , o 2 t , … , o N(t) t }

Method Object Detection and Temporal Consistency Use pretrain, obtain set of objects O t = {o i t } i=1 N(t) in each frame where o i t = {b i t , v i t , c oi t }, with b is bounding box, v is RoIAligned proposal feature of o and c is predict class Apply Object Sequence Processing Unit (OSPU), utilize transformer encoder, set of sequences: Apply muli-head self-attention to learn long-term temporal dependencies Input X, single attention head A: Multi-head attention Normalize and pass through FFN, output where is fixed positional encodings

Method Predicate Embedding Generator (PEG) Information of each subject-object pair to generate an embedding - summarize the relationship between nodes Each pair (i, j) where v i , v j are subject and object features, u ij is feature map of union box computed by RoIAlign, s i , s j are semantic glove embeddings of subject and object class from final object logits, f v , f u are FFN, f box is bounding box to feature map projection Output of spatial encoder Input of temporal dencoder Output of temporal decoder Final set of predicate embeddings

Method Memory guided Debiasing Long-tailed bias in SGG dataset

Method Uncertainty Attenuated Predicate Classification Address the noisy annotations in SGG data, model the predicate classification head as K component GMM Given sample embedding, mean, variance, mixture weights

Method Uncertainty Attenuated Predicate Classification

Experiment Training OSPU and GMM head start firing from the first epoch itself Train loss:

Experiment Dataset Action Genome dataset - largest benchmark dataset for video SGG 234,253 annotated frames with 476,229 bounding boxes for 35 object classes ( without person ) Total of 1,715,568 annotated predicate instances for 26 relationship classes Standard metrics Recall@K (R@K) and mean-Recall@K for K = [10,20,50]

Experiment Comparison

Experiment Ablation Studies

Conclusion The difficulty in generating dynamic scene graphs from videos can be attributed to several factors ranging from imbalance predicate class distribution, video dynamics, temporal fluctuation of prediction Existing methods on dynamic SGG focused only on achieving high recall values, which are known to be biased towards head classes Propose TEMPURA for dynamic SGG, can compensate for those biases Outperform SOTA in terms of mean recall metric, showing its efficacy in long-term unbiased visual relationship learning from videos

[NS][Lab_Seminar_240607]Unbiased Scene Graph Generation in Videos.pptx

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

[NS][Lab_Seminar_240607]Unbiased Scene Graph Generation in Videos.pptx

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

TLE-9-Prepare-Salad-and-Dressing.pptxkkk

LESSON 1 ABOUT MEDIA AND INFORMATION.pptx

GRADE-8-AQUACULTURE-WEEKQ1.pdfdfawgwyrsewru

Feelings PP Game FOR CHILDREN IN ELEMENTARY SCHOOL.pptx

Jeopardy_Figures_of_Speech_Template.pptx [Autosaved].pptx

Jeopardy_Figures_of_Speech.pptxvdsvdsvsdvsd