[NS][Lab_Seminar_240619]A Simple Baseline for Weakly-Supervised Scene Graph Generation.pptx
thanhdowork
72 views
15 slides
Jun 28, 2024
Slide 1 of 15
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
About This Presentation
A Simple Baseline for Weakly-Supervised Scene Graph Generation
Size: 1.85 MB
Language: en
Added: Jun 28, 2024
Slides: 15 pages
Slide Content
A Simple Baseline for Weakly-Supervised Scene Graph Generation Tien-Bach-Thanh Do Network Science Lab Dept. of Artificial Intelligence The Catholic University of Korea E-mail: os fa19730 @catholic.ac.kr 202 4/06/19 Jing Shi et al. ICCV 2021
Introduction Figure 1. Demonstration the task of WS-SGG. During training, only the pair of image and the ungrounded scene graph are provided. For test stage, given an image, the model should output the full scene graph.
Introduction Given an image, SGG is to generate a scene graph Most of current SGG models are supervised trained with scene graphs annotations Rely on expensive annotations of object locations and relations Hard to generalize to out-of-domain objects or relations Investigate the weakly-supervised SGG Consider ungrounded scene graph, compose image-level object relation labels without bounding box Previous work VSPNet [52], objects are nodes and relations are edges, use bipartite graph with one part object nodes and others relations nodes, where role (subject, object) becomes edge WS-SGG propose simple baseline A weakly-supervised graph matching to align the visual and label graph Standard supervised SGG model to generate scene graph Use efficient first-order graph matching algorithm (only match by node similarity) and employ Multi-Instance Learning mechanism with contrastive learning object to train Match scene graph as pseudo ground truth to train standard SGG model [52] Alireza Zareian, Svebor Karaman, and Shih-Fu Chang. Weakly supervised visual semantic parsing. In CVPR, pages 3736–3745, 2020
Proposed Algorithm [42] Yu-Siang Wang, Chenxi Liu, Xiaohui Zeng, and Alan Yuille. Scene graph parsing as dependency parsing. arXiv preprint arXiv:1803.09189, 2018
Proposed Algorithm Figure 2. Demonstration of our pipeline. The whole model is composed of a weakly-supervised graph matching module and a supervised SGG model. In the graph matching module, first-order graph matching is applied and the parameter in F is learned via the contrastive learning.
Proposed Algorithm Problem Formulation Given image I, goal is to generate visual graph G = {N, E}, where node is bounding box paired with an entity class and each edge is a predicate class connecting subject and object node Visual graph has unknown entity and predicate classes G’ (object classes and relationships with no locations ), use language parser [42] Label graph denote ungrounded scene graph label, it doesn’t contain location information for each entity node
Proposed Algorithm Weakly-Supervised Graph Matching Objective: align the visual graph G and the label graph G’ to obtain class labels for the nodes and edges in G Node Embeddings: Use embedding functions F and F’ to encode visual node features e and label node features e’ Visual features are extracted via RoIPooling and concatenated with spatial features First-Order Graph Matching (FOGM) Cosine similarity: compute similarity between nodes h i from G and h j ’ from G Optimal Alignment: solve for the alignment I that maximizes total similarity:
Proposed Algorithm Weakly-Supervised Graph Matching Hungarian Algorithm: Used to solve the optimal alignment efficiently Contrastive learning: Triplet loss: Used to learn embedding functions by maximizing similarity for matched nodes and minimizing for unmatched nodes Negative samples are unmatched objects within and across images to enhance learning Positive margin
Proposed Algorithm SGG Generation Train a standard supervised SGG model using the pseudo scene graph labels obtained from WSGM Training Process: Pseudo Labels: Use aligned labels from WSGM as ground truth for training Supervised Learning: Apply cross-entropy loss to train object and predicate classifications L SGG = Cross-Entropy(Object Predictions,Ground Truth)+Cross-Entropy(Predicate Predictions,Ground Truth) Combined Loss: Total loss combines graph matching loss and SGG loss
Proposed Algorithm Selection of Node Embedding F Multi-Layer Perceptron (MLP) used as the default embedding function In this case, no edge information is encoded to node representation Use GNN to encode the edge context into node Propose Edge Attention Message Passing (EAMP), allowing edge type feature to be explicit encoded into node representation For EAMP, initial node state is input node embedding At k-th interaction , define score to measure confidence if there is an edge pointing from node i to j Dp is an embedding dictionary of all predicates where first predicate is background class Valid predicate dictionary is Dp without background embedding
Proposed Algorithm Selection of Node Embedding F Enforce the message-passing process to be aware of edge type, compute attention score from pair-wise node feature to the valid dictionary In case no valid relation type between 2 node, augment the predicate attention with edge confidence, it can attend to background class, attended predicate representation is obtained from augmented attention Each node aggregate neighbor’s information through subject and object FC layers Aggregation consider predicate type, enabling the message passing process aware of relation categories, GRU is adopted to update node feature After K interactions , refined node feature with edge type context is obtained, soft attention for predicate type, hard attention for label graph
Experiments Visual Genome data 108,077 images with scene graph annotations Two common splits Keep most-frequent 150 object categories and 50 predicate type, train/test 75,651/32,422 images Select 200 object categories and 100 predicate types, with train/test 73,8001/25,857 images Metrics: Instance-level recall #correctly match object instance / #all instance Object-level recall #correctly matched object category / #all instance Predicate-level recall #correctly match relation / #all relation Baseline: PPR-FCN [55] VTransE-MIL [55] VSPNet [52]
Experiments
Experiments
Experiments Conclusion Decouple the WS-SGG task into a WSGM module and a standard SGG model, where a contrastive learning framework based on efficient first-order graph matching is introduced Much simpler than the previous method while achieves significant improvement on both graph matching accuracy and SGG performance