[NS][Lab_Seminar_240619]A Simple Baseline for Weakly-Supervised Scene Graph Generation.pptx

thanhdowork 72 views 15 slides Jun 28, 2024
Slide 1
Slide 1 of 15
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15

About This Presentation

A Simple Baseline for Weakly-Supervised Scene Graph Generation


Slide Content

A Simple Baseline for Weakly-Supervised Scene Graph Generation Tien-Bach-Thanh Do Network Science Lab Dept. of Artificial Intelligence The Catholic University of Korea E-mail: os fa19730 @catholic.ac.kr 202 4/06/19 Jing Shi et al. ICCV 2021

Introduction Figure 1. Demonstration the task of WS-SGG. During training, only the pair of image and the ungrounded scene graph are provided. For test stage, given an image, the model should output the full scene graph.

Introduction Given an image, SGG is to generate a scene graph Most of current SGG models are supervised trained with scene graphs annotations Rely on expensive annotations of object locations and relations Hard to generalize to out-of-domain objects or relations Investigate the weakly-supervised SGG Consider ungrounded scene graph, compose image-level object relation labels without bounding box Previous work VSPNet [52], objects are nodes and relations are edges, use bipartite graph with one part object nodes and others relations nodes, where role (subject, object) becomes edge WS-SGG propose simple baseline A weakly-supervised graph matching to align the visual and label graph Standard supervised SGG model to generate scene graph Use efficient first-order graph matching algorithm (only match by node similarity) and employ Multi-Instance Learning mechanism with contrastive learning object to train Match scene graph as pseudo ground truth to train standard SGG model [52] Alireza Zareian, Svebor Karaman, and Shih-Fu Chang. Weakly supervised visual semantic parsing. In CVPR, pages 3736–3745, 2020

Proposed Algorithm [42] Yu-Siang Wang, Chenxi Liu, Xiaohui Zeng, and Alan Yuille. Scene graph parsing as dependency parsing. arXiv preprint arXiv:1803.09189, 2018

Proposed Algorithm Figure 2. Demonstration of our pipeline. The whole model is composed of a weakly-supervised graph matching module and a supervised SGG model. In the graph matching module, first-order graph matching is applied and the parameter in F is learned via the contrastive learning.

Proposed Algorithm Problem Formulation Given image I, goal is to generate visual graph G = {N, E}, where node is bounding box paired with an entity class and each edge is a predicate class connecting subject and object node Visual graph has unknown entity and predicate classes G’ (object classes and relationships with no locations ), use language parser [42] Label graph denote ungrounded scene graph label, it doesn’t contain location information for each entity node

Proposed Algorithm Weakly-Supervised Graph Matching Objective: align the visual graph G and the label graph G’ to obtain class labels for the nodes and edges in G Node Embeddings: Use embedding functions F and F’ to encode visual node features e and label node features e’ Visual features are extracted via RoIPooling and concatenated with spatial features First-Order Graph Matching (FOGM) Cosine similarity: compute similarity between nodes h i from G and h j ’ from G Optimal Alignment: solve for the alignment I that maximizes total similarity:

Proposed Algorithm Weakly-Supervised Graph Matching Hungarian Algorithm: Used to solve the optimal alignment efficiently Contrastive learning: Triplet loss: Used to learn embedding functions by maximizing similarity for matched nodes and minimizing for unmatched nodes Negative samples are unmatched objects within and across images to enhance learning Positive margin

Proposed Algorithm SGG Generation Train a standard supervised SGG model using the pseudo scene graph labels obtained from WSGM Training Process: Pseudo Labels: Use aligned labels from WSGM as ground truth for training Supervised Learning: Apply cross-entropy loss to train object and predicate classifications L SGG = Cross-Entropy(Object Predictions,Ground Truth)+Cross-Entropy(Predicate Predictions,Ground Truth) Combined Loss: Total loss combines graph matching loss and SGG loss

Proposed Algorithm Selection of Node Embedding F Multi-Layer Perceptron (MLP) used as the default embedding function In this case, no edge information is encoded to node representation Use GNN to encode the edge context into node Propose Edge Attention Message Passing (EAMP), allowing edge type feature to be explicit encoded into node representation For EAMP, initial node state is input node embedding At k-th interaction , define score to measure confidence if there is an edge pointing from node i to j Dp is an embedding dictionary of all predicates where first predicate is background class Valid predicate dictionary is Dp without background embedding

Proposed Algorithm Selection of Node Embedding F Enforce the message-passing process to be aware of edge type, compute attention score from pair-wise node feature to the valid dictionary In case no valid relation type between 2 node, augment the predicate attention with edge confidence, it can attend to background class, attended predicate representation is obtained from augmented attention Each node aggregate neighbor’s information through subject and object FC layers Aggregation consider predicate type, enabling the message passing process aware of relation categories, GRU is adopted to update node feature After K interactions , refined node feature with edge type context is obtained, soft attention for predicate type, hard attention for label graph

Experiments Visual Genome data 108,077 images with scene graph annotations Two common splits Keep most-frequent 150 object categories and 50 predicate type, train/test 75,651/32,422 images Select 200 object categories and 100 predicate types, with train/test 73,8001/25,857 images Metrics: Instance-level recall #correctly match object instance / #all instance Object-level recall #correctly matched object category / #all instance Predicate-level recall #correctly match relation / #all relation Baseline: PPR-FCN [55] VTransE-MIL [55] VSPNet [52]

Experiments

Experiments

Experiments Conclusion Decouple the WS-SGG task into a WSGM module and a standard SGG model, where a contrastive learning framework based on efficient first-order graph matching is introduced Much simpler than the previous method while achieves significant improvement on both graph matching accuracy and SGG performance