Graph Representation Learning Methods Visual Question Answering Multi-Graph Approach [147], represents semantic, spatial, and implicit relations with three separate graphs GATs: Run on each graph to assign importance and learn relation-aware node representations Aggregation: Combines results from GATs to predict the answer Symbolic Graphs [148], uses symbolic graphs to represent image and question features Symbolic Graphs: Nodes represent semantic units; edges represent relations Message Passing Neural Network: Aligns and processes symbolic graphs Graph Isomorphism Network for Graph Matching [150], utilizes graph isomorphism network for graph matching, aligns and matches graph structures of image and question TextVQA [154], Uses GNNs to integrate text recognition and reasoning for VQA Visual Commonsense Reasoning [155], [156], Integrates commonsense knowledge graphs with GNNs for reasoning Graph Representations and Alignment [141], Diagram showing image and question graphs with alignment process Sparse Graph Representation [144], Visual of sparse graph generation for images based on questions Dynamic Visual Graphs [145], Diagram illustrating dynamic edge updates in visual graphs Multi-Graph Approach [147], Diagram of semantic, spatial, and implicit graphs with GAT processing Symbolic Graphs [148], Diagram of symbolic graphs for image and question and message passing [147] L. Li, Z. Gan, Y. Cheng, and J. Liu, “Relation-aware graph attention network for visual question answering,” in ICCV, 2019 [148] E.-S. Kim, W. Y. Kang, K.-W. On, Y.-J. Heo, and B.-T. Zhang, “Hypergraph attention networks for multimodal learning,” in CVPR, 2020 [150] R. Saqur and K. Narasimhan, “Multimodal graph networks for compositional generalization in visual question answering,” NeurIPS, vol. 33, 2020 [154] D. Gao, K. Li, R. Wang, S. Shan, and X. Chen, “Multi-modal graph neural network for joint reasoning on vision and scene text,” in CVPR, 2020 [155] A. Wu, L. Zhu, Y. Han, and Y. Yang, “Connective cognition network for directional visual commonsense reasoning,” NeurIPS, vol. 32, 2019 [156] W. Yu, J. Zhou, W. Yu, X. Liang, and N. Xiao, “Heterogeneous graph learning for visual commonsense reasoning,” NeurIPS, vol. 32, 2019 [141] D. Teney, L. Liu, and A. van Den Hengel, “Graph-structured representations for visual question answering,” in CVPR, 2017 [144] W. Norcliffe-Brown, S. Vafeias, and S. Parisot, “Learning conditioned graph structures for interpretable visual question answering,” NeurIPS, vol. 31, 2018 [145] R. Hu, A. Rohrbach, T. Darrell, and K. Saenko, “Language- conditioned graph networks for relational reasoning,” in ICCV, 2019 [148] E.-S. Kim, W. Y. Kang, K.-W. On, Y.-J. Heo, and B.-T. Zhang, “Hypergraph attention networks for multimodal learning,” in CVPR, 2020