[NS][Lab_Seminar_251027]From Pixels to Graphs: Open-Vocabulary Scene Graph Generation with Vision-Language Models.pptx
thanhdowork
9 views
16 slides
Oct 27, 2025
Slide 1 of 16
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
About This Presentation
From Pixels to Graphs: Open-Vocabulary Scene Graph Generation with Vision-Language Models
Size: 2.22 MB
Language: en
Added: Oct 27, 2025
Slides: 16 pages
Slide Content
From Pixels to Graphs: Open-Vocabulary Scene Graph Generation with Vision-Language Models Tien-Bach-Thanh Do Network Science Lab Dept. of Artificial Intelligence The Catholic University of Korea E-mail: os fa19730 @catholic.ac.kr 2025/10/27 Rongjie Li et al. CVPR 2024
Introduction Most prior SGG work focused on a close-world setting with a limited label space, struggling to capture the diverse visual relationships found in the real world This limitation leads to incomplete scene representations and domain gaps when used in downstream VL tasks Previous attempts at open-vocabulary (Ov-SGG) often focused only on novel entities or classifying open-set predicates given entity pairs Goal: Tackle open-vocabulary SGG problem in a more general and challenging setting: generating scene graphs with both known and novel visual relation triplets directly from pixels
Introduction Figure 1. An illustration of open-vocabulary SGG paradigm comparison. (A) previous work adopt the task-specific VLM as predicate classifiers from given entity proposals; (B) Our framework offers a unified framework for generating scene graph with novel predicates from images directly and conducting VL tasks.
Proposed Method Figure 2. Illustration of overall pipeline of our PGSG. We generate scene graph sequences from the images using VLM. Then, the relation construction module grounds the entities and converts categorical labels from the sequence. For VL tasks, the SGG training provides parameters as initialization for VLM in fine-tuning.
Proposed Method SGG SGG from an image, which consist of visual relationships and entities Relation triplet represent relationship between entities, predicate category denoted as Entity consists of category label in entity category space and bounding box VLM BLIP and instruct BLIP as base Perform the image to sequence generation task Vision encoder Text decoder output Token sequence Classification score Preliminary
Proposed Method Propose a SGS prompt: "Generate the scene graph of [triplet sequence] and [triplet sequence]" Consist of 2 components: Prefix instruction "Generate the scene graph of" K relation triplet sequences Each triplet sequence follow subject-predicate-object: Token sequences represent tokenized category names of subject, object entities, and predicate Introduce specified relation-aware tokens: [ENT], [REL] to represent the compositional structure of relationships and position of entities Scene Graph Sequence Generation
Proposed Method Entity Grounding Module Predict BB of entity to ground the generated relations within SGS Each token sequence , extract by hidden states Utilize token hidden states as Q to re-attend the image features through attention mechanism to more accurately locate the objects Transform into query vector by average pooling and linear projection: Decode by cross-attention between queries of 2N entity token sequences: Q= B is predicted by FFN Relationship Construction
Proposed Method Figure 3. Illustration of entity grounding module and category conversion module. Left): The entity grounding module localizes the entities within scene graph sequences by predicting their bounding boxes. Right): The category conversion module maps the vocabulary sequence prediction into categorical prediction.
Proposed Method Category Convert Module Category of entities and predicates is represented by token from language vocabulary space Tokenize target category sets: entity, predicate into token sequences by tokenizer of VLM Superscript c denotes category index, and T is sequence length for each tokenized category name Compose: Relationship Construction
Proposed Method SGG training, use multi-task loss consisting of 2 parts: Loss for standard next token prediction language modeling Follow same as in VLM backbone Loss for entity grounding module BB regression of entity grounding module Each box prediction B of SGS, GIOU and L1 loss are used in calculating distance B and GT Learning
Proposed Method Relationship Triplet Construction Employ heuristic rule to match patterns such as: "subject [ENT] predicate [REL] object [ENT]" Select top 3 based on prediction scores Post-processing Refine initial triplets to obtain final relationship triplets through process of filtering and ranking Remove self-connected relationship Apply non-maximum suppression strategy to eliminate redundant relationships Inference
Experiments Table 2. The close-vocabulary SGDet performance on VG and PSG dataset. D: dataset name; B: visual backbone; M: Method; ViT -B*: the use BLIP ViT visual encoder as backbone. -c: denotes use the close-set classifier. Table 3. Ablation study for entity grounding module on PSG dataset. L: indicates the layer of transformer within entity grounding module. C: denotes the different design choices for entity grounding. Table 4. The SGG performance and inference time with different length of output sequences. SL: output sequence lengths;Sec : inference time of second per image.
Experiments Table 5. Analysis for quality of generated Scene graph sequences. SL.: Generated sequence Length; # Trip.: Average of number of relation triplets; # U. Trip.: Average of number of unique relation triplets; # [REL]: Average of number of occurrence of [REL] tokens; % Valid: percentage of valid relation triplets. Table 6. Performance of visual grounding task. RC, RC+and RCg represents the RefCOCO , RefCOCO + and RefCOCOg datasets respectively. Table 7. Performance comparison between initial pre-trained VLM and PGSG on GQA). We report the answering accuracy respectively according to the question type proposed by benchmark. E: denotes the entity-only scene graph sequence for ablation Table 8. Performance comparison on COCO image captioning.
Experiments Figure 4. The visualization of scene graph sequence prediction of PGSG.
Conclusion Present VLM-based open-vocabulary SGG model that formulates SGG through image-to-graph translation paradigm, addressing difficult problem of generating open-vocabulary predicate SGG Limitation: Performance is suboptimal in standard close-vocabulary settings due to limitations like the VLM's use of smaller input resolutions (e.g., 384x384) compared to traditional SGG methods (e.g., 800x1333) Future work: Improve the perception performance of VLMs with high-resolution input is a challenging and important direction