[NS][Lab_Seminar_251027]From Pixels to Graphs: Open-Vocabulary Scene Graph Generation with Vision-Language Models.pptx

thanhdowork 9 views 16 slides Oct 27, 2025
Slide 1
Slide 1 of 16
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16

About This Presentation

From Pixels to Graphs: Open-Vocabulary Scene Graph Generation with Vision-Language Models


Slide Content

From Pixels to Graphs: Open-Vocabulary Scene Graph Generation with Vision-Language Models Tien-Bach-Thanh Do Network Science Lab Dept. of Artificial Intelligence The Catholic University of Korea E-mail: os fa19730 @catholic.ac.kr 2025/10/27 Rongjie Li et al. CVPR 2024

Introduction Most prior SGG work focused on a close-world setting with a limited label space, struggling to capture the diverse visual relationships found in the real world This limitation leads to incomplete scene representations and domain gaps when used in downstream VL tasks Previous attempts at open-vocabulary (Ov-SGG) often focused only on novel entities or classifying open-set predicates given entity pairs Goal: Tackle open-vocabulary SGG problem in a more general and challenging setting: generating scene graphs with both known and novel visual relation triplets directly from pixels

Introduction Figure 1. An illustration of open-vocabulary SGG paradigm comparison. (A) previous work adopt the task-specific VLM as predicate classifiers from given entity proposals; (B) Our framework offers a unified framework for generating scene graph with novel predicates from images directly and conducting VL tasks.

Proposed Method Figure 2. Illustration of overall pipeline of our PGSG. We generate scene graph sequences from the images using VLM. Then, the relation construction module grounds the entities and converts categorical labels from the sequence. For VL tasks, the SGG training provides parameters as initialization for VLM in fine-tuning.

Proposed Method  SGG SGG                         from an image, which consist of visual relationships                        and entities Relation triplet                             represent relationship between entities, predicate category denoted as Entity                      consists of category label      in entity category space      and bounding box VLM BLIP and instruct BLIP as base Perform the image to sequence generation task Vision encoder  Text decoder output Token sequence Classification score Preliminary

Proposed Method  Propose a SGS prompt: "Generate the scene graph of [triplet sequence] and [triplet sequence]" Consist of 2 components: Prefix instruction "Generate the scene graph of" K relation triplet sequences Each triplet sequence follow subject-predicate-object: Token sequences represent tokenized category names of subject, object entities, and predicate Introduce specified relation-aware tokens: [ENT], [REL] to represent the compositional structure of relationships and position of entities Scene Graph Sequence Generation

Proposed Method  Entity Grounding Module Predict BB of entity to ground the generated relations within SGS Each token sequence                  , extract      by hidden states Utilize token hidden states as Q to re-attend the image features      through attention mechanism to more accurately locate the objects Transform                         into query vector                       by average pooling and linear projection:    Decode by cross-attention between queries of 2N entity token sequences: Q= B is predicted by FFN Relationship Construction

Proposed Method Figure 3. Illustration of entity grounding module and category conversion module. Left): The entity grounding module localizes the entities within scene graph sequences by predicting their bounding boxes. Right): The category conversion module maps the vocabulary sequence prediction into categorical prediction.

Proposed Method  Category Convert Module Category of entities and predicates is represented by token from language vocabulary space Tokenize target category sets: entity, predicate into token sequences by tokenizer of VLM Superscript c denotes category index, and T is sequence length for each tokenized category name Compose:    Relationship Construction

Proposed Method  SGG training, use multi-task loss consisting of 2 parts: Loss for standard next token prediction language modeling Follow same as in VLM backbone Loss for entity grounding module BB regression of entity grounding module Each box prediction B of SGS, GIOU and L1 loss are used in calculating distance B and GT    Learning

Proposed Method  Relationship Triplet Construction Employ heuristic rule to match patterns such as: "subject [ENT] predicate [REL] object [ENT]" Select top 3 based on prediction scores Post-processing Refine initial triplets to obtain final relationship triplets through process of filtering and ranking Remove self-connected relationship Apply non-maximum suppression strategy to eliminate redundant relationships    Inference

Experiments Benchmarks: Panoptic SGG Visual Genome OpenImage V6 Evaluation protocols: SGDet PCls SGCls Metric: Recall Class-balance metric mean recall VL tasks: Visual grounding: RefCOCO /+/g VQA: GQA Image captioning: COCO VLM backbone: BLIP

Experiments Table 2. The close-vocabulary SGDet performance on VG and PSG dataset. D: dataset name; B: visual backbone; M: Method; ViT -B*: the use BLIP ViT visual encoder as backbone. -c: denotes use the close-set classifier. Table 3. Ablation study for entity grounding module on PSG dataset. L: indicates the layer of transformer within entity grounding module. C: denotes the different design choices for entity grounding. Table 4. The SGG performance and inference time with different length of output sequences. SL: output sequence lengths;Sec : inference time of second per image.

Experiments Table 5. Analysis for quality of generated Scene graph sequences. SL.: Generated sequence Length; # Trip.: Average of number of relation triplets; # U. Trip.: Average of number of unique relation triplets; # [REL]: Average of number of occurrence of [REL] tokens; % Valid: percentage of valid relation triplets. Table 6. Performance of visual grounding task. RC, RC+and RCg represents the RefCOCO , RefCOCO + and RefCOCOg datasets respectively. Table 7. Performance comparison between initial pre-trained VLM and PGSG on GQA). We report the answering accuracy respectively according to the question type proposed by benchmark. E: denotes the entity-only scene graph sequence for ablation Table 8. Performance comparison on COCO image captioning.

Experiments Figure 4. The visualization of scene graph sequence prediction of PGSG.

Conclusion Present VLM-based open-vocabulary SGG model that formulates SGG through image-to-graph translation paradigm, addressing difficult problem of generating open-vocabulary predicate SGG Limitation: Performance is suboptimal in standard close-vocabulary settings due to limitations like the VLM's use of smaller input resolutions (e.g., 384x384) compared to traditional SGG methods (e.g., 800x1333) Future work: Improve the perception performance of VLMs with high-resolution input is a challenging and important direction
Tags