[NS][Lab_Seminar_240729]VQA-GNN: Reasoning with Multimodal Knowledge via Graph Neural Networks for Visual Question Answering.pptx

thanhdowork 77 views 18 slides Aug 06, 2024
Slide 1
Slide 1 of 18
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18

About This Presentation

VQA-GNN: Reasoning with Multimodal Knowledge via Graph Neural Networks for Visual Question Answering


Slide Content

VQA-GNN: Reasoning with Multimodal Knowledge via Graph Neural Networks for Visual Question Answering Tien-Bach-Thanh Do Network Science Lab Dept. of Artificial Intelligence The Catholic University of Korea E-mail: os fa19730 @catholic.ac.kr 202 4/07/29 Yanan Wang et al. ICCV 2023

Introduction Figure 1: Overview of VQA-GNN. Given an image and QA sentence, we obtain unstructured knowledge (e.g., QA-concept node p and QA-context node z) and structured knowledge (e.g., scene-graph and concept-graph), and then unify them to perform bidirectional fusion for VQA

Method Multimodal semantic graph Figure 2. Reasoning procedure of VQA-GNN. We first build a multimodal semantic graph for each given image-QA pair to unify unstructured (e.g., “node p” and “node z”) and structured (e.g. “scene-graph” and “concept-graph”) multimodal knowledge. Then we perform inter-modal message passing with a multimodal GNN-based bidirectional fusion method to update the representations of node z, p, vi, and ci for k+1 iterations in 2 steps. Finally, we predict the answer with these updated various node representations. Here, “S” and “C” indicate scene-graph and concept-graph respectively. “LM_encoder” indicates a language model used to finetune QA-context node representation, and “GNN” indicates a relation-graph neural network for iterative message passing

Method Multimodal semantic graph - Scene-graph encoding Given an image, apply pretrained scene graph generator to extract a scene graph consists of recall@20 of (subject, predicate, object) triplets to represent structured image context Apply pretrained object detection model for embedding a set of scene graph nodes Predicate edge types in the scene graph with a set of scene graph edges

Method Multimodal semantic graph - QA-concept node retrieval With assumption that global image context of the correct choice aligns with local image context => employ a pretrained sentence-BERT model to calculate the similarity between each answer choice and all descriptions of region image within VisualGenome dataset Extract relevant region images capture the global image context associated with each choice Retrieve top 10 results and utilize the same object detector to embed them Embeddings are averaged to obtain a QA-concept node denote as p

Method Multimodal semantic graph - Concept-graph retrieval

Method Multimodal semantic graph - Concept-graph retrieval Extract concept entities from both image and answer choices Consider detected object names as potential contextual entities For answer choice, ground phases if they are mentioned concepts in the ConceptNet KG (shop,...) Use grounded phases to retrieve their 1-hop neighbor nodes from ConceptNet KG Use word2vec model to get relevance score between concept node candidates and answer choices, prune irrelevance nodes when relevance score is less than 0.6 Combine parsed local concept entities of image with retrieved subgraph Consider ConceptNet encompasses various types, if local concept entity is found adjacent to retrieved entity => build a new knowledge triple (bottle, aclocation, beverage) Construct a concept graph to depict the structure knowledge at concept level Obtain a collection of concept-graph nodes

Method Multimodal semantic graph - QA-context node encoding QA-context node as z to inter-connect the scene-graph and concept-graph using 3 additional relation types: Question edge r(q) Answer edge r(a) Image edge r(e) r(e) links z with V(s) capturing the relationship between the QA context and relevant entities in scene graph r(q), r(a) link z with entities extracted from question answer text, capturing the relationship between QA context and relevant entities with concept-graph Construct multimodal semnatic graph G = {S,C} provide joint reasoning space, includes 2 sub-graphs of scene -graph and concept-graph, 2 super nodes of QA-concept node and QA-context node Employ RoBERTa LM as encoder of QA-context node z and finetune with GNN modules

Method Multimodal GNN-based bidirectional fusion Relation-GNN is built on GAT by introducing multi-relation aware message for attention-based message aggregation 5 node types: T = {Z, P, S, C} in multimodal semantic graph, indicate z, p, s, q, c Relation edge should capture relationship from node i to node j Obtain node type embedding then concatenate them with edge embedding to generate multi-relation embedding 2-layer MLP one-hot vectors Multi-relation aware message linear transformation node representation of each node i

Method Multimodal GNN-based bidirectional fusion Perform message passing update node representations in each graph in parallel by aggregating multi-relation aware messages from neighborhood nodes in each node linear transformation structured node unstructured node

Method Inference and Learning To identify correct answer, compute probability for each answer choice with its multimodal semantic knowledge from scene-graph, concept-graph, QA-context node, QA-concept node logit(a): confident score of answer choice

Experiments Setup Visual Commonsense Reasoning (VCR) Contain 290k pairs of questions, answers, and rationales, over 110k unique movie scenes VCR consists 2 tasks: VQA (Q->A), answer justification (QA->R) Each question in dataset is provided with 4 candidate answers Q->A: select the best answer, QA->R: justify the given question answer pair by picking the best rationale out of the 4 candidates Joint train VQA-GNN on Q->A and QA->R, with LM encoder, multimodal semantic graph for Q->A, concept graph retrieved by giving question-answering pair with a rationale candidate for QA->R Use pretrained RoBERTa Large model to embed the QA-context node GQA dataset Contain 1.5M questions correspond to 1,842 answer tokens, 110K scene graphs Define question as context node (node q) to fully connect visual and textual SG respectively to structure multimodal semantic graphs Node q embedded with pretrained RoBERTa large model, initialize object nodes’ representations in visual SG using object features, object nodes in textual SG by concatenating GloVe based word embedding of object name and attributes

Experiments Evaluation on VCR dataset

Experiments Effectiveness of the multimodal semantic graph

Experiments Analysis of the multimodal GNN method Figure 4. Ablation architectures. We find that our final VQA-GNN architecture with two modality-specialized GNNs overcomes the representation gaps between modalities

Experiments GQA dataset Table 4. Accuracy scores on the GQA validation set. All models are trained under the realistic setup of not using the annotated semantic functional programs

Experiments Ablation study on the bidirectional fusion Table 5. Ablation results on the effect of our proposed bidirectional fusion for GQA

Conclusion Proposed VQA-GNN which unifies unstructured and structured multimodal knowledge to perform joint reasoning of the scene Outperform strong baseline VQA methods by 3.2% on VCR (Q-AR) and 4.6% on GQA => well performing concept-level reasoning Ablation studies demonstrate the efficacy of the bidirectional fusion and multimodal GNN method in unifying unstructured and structured multimodal knowledge