[NS][Lab_Seminar_240617]A Survey on Graph Neural Networks and Graph Transformers in Computer Vision: A Task-Oriented Perspective.pptx

thanhdowork 112 views 20 slides Jun 24, 2024
Slide 1
Slide 1 of 20
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20

About This Presentation

A Survey on Graph Neural Networks and Graph Transformers in Computer Vision: A Task-Oriented Perspective


Slide Content

A Survey on Graph Neural Networks and Graph Transformers in Computer Vision: A Task-Oriented Perspective Tien-Bach-Thanh Do Network Science Lab Dept. of Artificial Intelligence The Catholic University of Korea E-mail: os fa19730 @catholic.ac.kr 202 4/06/17 Chaoqi Chen et al.

Introduction 2D NATURAL IMAGES Image Classification Multi-label classification Few-shot learning Zero-shot learning Transfer learning Object detection Image segmentation Scene graph generation VIDEO UNDERSTANDING Video Action Recognition Temporal Action Localization Others Video object segmentation Multi-object tracking Human motion prediction Trajectory prediction

Introduction VIDEO + LANGUAGE Video Question Answering Visual Grounding Image Captioning Others Image-text matching Vision-language navigation 3D DATA ANALYSIS 3D Representation Learning Point cloud representation Mesh representation 3D Understanding 3D point cloud completion and upsampling 3D object detection 3D visual grounding 3D generation Point cloud completion and upsampling 3D data denoising 3D reconstruction

Graph Representation Learning Methods Image Classification (multi-label classification) [41] build directed graph over label space, each node is object label (word embeddings) and their connections model inter-dependencies of different labels [46] propose cross-modality attention mechanism to aggregate visual features and label embeddings for generating category-wise attention maps [47] decompose label graph into multiple sub-graphs, use graph transformer to exploit semantic and topological relations among object labels [41] Z.-M. Chen, X.-S. Wei, P. Wang, and Y. Guo, “Multi-label image recognition with graph convolutional networks,” in CVPR, 2019 [46] R. You, Z. Guo, L. Cui, X. Long, Y. Bao, and S. Wen, “Cross- modality attention with semantic graph embedding for multi-label classification,” in AAAI, 2020 [47] H. D. Nguyen, X.-S. Vu, and D.-T. Le, “Modular graph transformer networks for multi-label image classification,” in AAAI, 2021

Graph Representation Learning Methods Image Classification (few-shot learning) Use graph where each node represents image and edge encode relationships based on similarity Supervised Interpolation on Densely Connected Graphs [49] treats FSL as supervised interpolation problem, constructs a graph with image as node, connected by learnable similarity kernels, apply to semi-supervised and active learning scenarios Transductive Propagation Network [50], construct graphs in the embedding space, exploits manifold structures to propagate labels from a support set to a query set Edge-Labeling GNN [51], predict labels for edges rather than nodes , explicitly models intra- and inter-class similarities to improve classification Bayesian GNN for Meta-Learning [52], treat FSL as continual learning of tasks, captures correlations across tasks using Bayesian principles Dual Complete Graph Network [53], model both distribution-level and instance-level relations, enhance understanding of class and instance relationships Hierarchical Relations GNN [54], use hierarchical reasoning for node relations, employs bottom-up and top-down reasoning modules to capture complex relationships Instance and Prototype GNN [55], use separate GNNs for instances and prototypes, adapt features quickly for new tasks, optimizing task adaptation [49] V. Garcia and J. Bruna, “Few-shot learning with graph neural networks,” in ICLR, 2018 [50] Y. Liu, J. Lee, M. Park, S. Kim, E. Yang, S. J. Hwang, and Y. Yang, “Learning to propagate labels: Transductive propagation network for few-shot learning,” in ICLR, 2019 [51] J. Kim, T. Kim, S. Kim, and C. D. Yoo, “Edge-labeling graph neural network for few-shot learning,” in CVPR, 2019 [52] Y. Luo, Z. Huang, Z. Zhang, Z. Wang, M. Baktash Motlagh , and Y. Yang, “Learning from the past: Continual meta-learning via bayesian graph modeling,” in AAAI, 2020 [53] L. Yang, L. Li, Z. Zhang, X. Zhou, E. Zhou, and Y. Liu, “Dpgn: Distribution propagation graph network for few-shot learning,” in CVPR, 2020 [54] C. Chen, K. Li, W. Wei, J. T. Zhou, and Z. Zeng, “Hierarchical graph neural networks for few-shot learning,” TCSVT, 2021 [55] T. Yu, S. He, Y.-Z. Song, and T. Xiang, “Hybrid graph neural networks for few-shot learning,” in AAAI, 2022

Graph Representation Learning Methods Image Classification (zero-shot learning) GNNs facilitate knowledge representation and propagation between seen and unseen classes, capturing complex inter-class relations Dense Graph Propagation [56], exploit hierarchical structures of knowledge graphs, propagate knowledge iteratively between nodes and their ancestors/descendants. Region Graph Representation [57], represent each image as a region graph, where nodes are regions and edges are appearance similarities. Attribute Propagation Network [58], generate and update attribute vectors dynamically Visual and Semantic Prototype Propagation [59], propagate visual and semantic prototypes on auto-generated graph Graph Attention Networks [61], to model and integrate local and global feature cooperation [56] M. Kampffmeyer, Y. Chen, X. Liang, H. Wang, Y. Zhang, and E. P. Xing, “Rethinking knowledge graph propagation for zero-shot learning,” in CVPR, 2019 [57] G.-S. Xie, L. Liu, F. Zhu, F. Zhao, Z. Zhang, Y. Yao, J. Qin, and L. Shao, “Region graph embedding network for zero-shot learning,” in ECCV, 2020 [58] L. Liu, T. Zhou, G. Long, J. Jiang, and C. Zhang, “Attribute propagation network for graph zero-shot learning,” in AAAI, 2020 [59] L. Liu, T. Zhou, G. Long, J. Jiang, X. Dong, and C. Zhang, “ Isometric propagation network for generalized zero-shot learning,” in ICLR, 2021 [61] S. Chen, Z. Hong, G. Xie, Q. Peng, X. You, W. Ding, and L. Shao, “Gndan: Graph navigated dual attention network for zero-shot learning,” IEEE TNNLS, 2022

Graph Representation Learning Methods Image Classification (Transfer learning) Domain Adaptation: Adapting a model trained in one domain (source) to perform well in another domain (target) Domain Generalization: Developing models that generalize well to unseen domains without requiring access to the target domain during training GNNs leverage knowledge graphs to represent relationships between different classes, making the model transferable and reusable across domains Graph Convolutional Adversarial Network (GCAN) [62], construct densely connected instance graphs from mini-batch samples for end-to-end training Co-Teaching and Curriculum Learning GCN [64], aggregate domain information using GCN with co-teaching and curriculum learning strategies Cross-Domain Prototype Alignment [66], use features from different GNN stages for alignment Knowledge Graph for DA [67], construct a knowledge graph based on domain prototypes Global Prototypical Relation Graphs with Self-Attention [68], build global prototypical relation graphs and updates them using a graph self-attention mechanism [62] X. Ma, T. Zhang, and C. Xu, “Gcan: Graph convolutional adversarial network for unsupervised domain adaptation,” in CVPR, 2019, pp. 8266–8276 [64] S. Roy, E. Krivosheev, Z. Zhong, N. Sebe, and E. Ricci, “Curriculum graph co-teaching for multi-target domain adaptation,” in CVPR, 2021 [66] Z. Wang, Y. Luo, Z. Huang, and M. Baktash Motlagh , “Prototype- matching graph network for heterogeneous domain adaptation,” in ACM MM, 2020 [67] H. Wang, M. Xu, B. Ni, and W. Zhang, “Learning to combine: Knowledge aggregation for multi-source domain adaptation,” in ECCV. Springer, 2020

Graph Representation Learning Methods Object Detection Reasoning-RCNN [71], propagate visual information globally to enable semantic reasoning Semantic Graph Reasoning Network (SGRN) [72], use a relation learner to create a sparse adjacency matrix for relational reasoning RelationNet [70], learn object relations through an adapted attention module RelationNet++ [73], use a self-attention based decoder to integrate various object/part representations Heterogeneous Graph Modeling [74], utilize a heterogeneous graph for comprehensive relational modeling Graph Feature Pyramid Network (GraphFPN) [75], explore contextual and hierarchical structures by within-scale and cross-scale feature interactions via spatial and channel attention FGRR [76], [77], use bipartite GCNs and graph attention to model dependencies in pixel and semantic spaces SIGMA [78], use cross-image graphs to capture relationships between domains SRR-FSD [79], use a dynamic relation graph where each class node facilitates semantic propagation GNNs enable reasoning over semantic and spatial dependencies [71] H. Xu, C. Jiang, X. Liang, L. Lin, and Z. Li, “Reasoning-rcnn: Unifying adaptive global reasoning into large-scale object detection,” in CVPR, 2019 [72] H. Xu, C. Jiang, X. Liang, and Z. Li, “Spatial-aware graph relation network for large-scale object detection,” in CVPR, 2019 [70] H. Hu, J. Gu, Z. Zhang, J. Dai, and Y. Wei, “Relation networks for object detection,” in CVPR, 2018 [73] C. Chi, F. Wei, and H. Hu, “Relationnet++: Bridging visual representations for object detection via transformer decoder,” NeurIPS, vol. 33, 2020 [74] Z. Li, X. Du, and Y. Cao, “Gar: Graph assisted reasoning for object detection,” in WACV, 2020 [75] G. Zhao, W. Ge, and Y. Yu, “Graphfpn: Graph feature pyramid network for object detection,” in ICCV, 2021 [76] C. Chen, J. Li, H.-Y. Zhou, X. Han, Y. Huang, X. Ding, and Y. Yu, “Relation matters: Foreground-aware graph-based relational reasoning for domain adaptive object detection,” IEEE TPAMI, 2022 [77] C. Chen, J. Li, Z. Zheng, Y. Huang, X. Ding, and Y. Yu, “Dual bipartite graph learning: A general approach for domain adaptive object detection,” in ICCV, 2021 [78] W. Li, X. Liu, and Y. Yuan, “Sigma: Semantic-complete graph matching for domain adaptive object detection,” in CVPR, 2022 [79] C. Zhu, F. Chen, U. Ahmed, Z. Shen, and M. Savvides, “Semantic relation reasoning for shot-stable few-shot object detection,” in CVPR, 2021

Graph Representation Learning Methods Image Segmentation Task: segment an image into semantically meaningful regions through pixel-wise labeling Challenges: Traditional CNNs struggle with long-range dependencies and arbitrary shapes Holistic scene understanding is difficult due to intrinsic limitations of convolution operations Dual Graph Convolutional Network (DGCNet) [80], model spatial relationships among pixels, dependencies among feature map channels then combine Global Reasoning Unit [81], project globally aggregated features to the node domain Dynamic Graph Message Passing Network (DGMN) [82], d ynamically samples node neighborhoods, predict node dependencies, filter weights, and affinities for information propagation Scale-Aware GNN for Few-Shot Segmentation [87], perform cross-scale relational reasoning over support and query images Affinity Attention GNN for Weakly Supervised Segmentation [88], use an affinity attention layer to model long-range interactions Bidirectional Graph Reasoning for Panoptic Segmentation [89], build separate graphs for 'things' and 'stuf f’, propagate semantic representations within and across branches [80] L. Zhang, X. Li, A. Arnab, K. Yang, Y. Tong, and P. Torr, “Dual graph convolutional network for semantic segmentation,” in BMVC, 2019 [81] Y. Chen, M. Rohrbach, Z. Yan, Y. Shuicheng, J. Feng, and Y. Kalantidis, “Graph-based global reasoning networks,” in CVPR, 2019 [82] L. Zhang, D. Xu, A. Arnab, and P. H. Torr, “Dynamic graph message passing networks,” in CVPR, 2020 [87] G.-S. Xie, J. Liu, H. Xiong, and L. Shao, “Scale-aware graph neural network for few-shot semantic segmentation,” in CVPR, 2021 [88] B. Zhang, J. Xiao, J. Jiao, Y. Wei, and Y. Zhao, “Affinity attention graph neural network for weakly supervised semantic segmentation ,” IEEE TPAMI, 2021 [89] Y. Wu, G. Zhang, Y. Gao, X. Deng, K. Gong, X. Liang, and L. Lin, “Bidirectional graph reasoning network for panoptic segmentation,” in CVPR, 2020

Graph Representation Learning Methods Scene Graph Generation Represents objects as nodes and their relations as edges, providing a high-level understanding of the visual scene GNNs construct and refine graph structures from image inputs without requiring extensive prior knowledge Graph R-CNN [91], constructs a sparse candidate graph and refines it using attentional GCN Factorizable Net [94], use subgraph-based approach with spatially weighted message passing, treat each subgraph as a node and refines features using attention-like schemes, pass messages among objects and subgroups Attentive Relational Networks [95], transform label embeddings and visual features into a shared semantic space. Bipartite GNN [96], estimate and propagate relation confidence in a multi-stage manner Energy-Based Framework [97], rely on a graph message passing algorithm to evaluate energy configurations for relation inference [91] J. Yang, J. Lu, S. Lee, D. Batra, and D. Parikh, “Graph r-cnn for scene graph generation,” in ECCV, 2018 [94] Y. Li, W. Ouyang, B. Zhou, J. Shi, C. Zhang, and X. Wang, “Factorizable net: an efficient subgraph-based framework for scene graph generation,” in ECCV, 2018 [95] M. Qi, W. Li, Z. Yang, Y. Wang, and J. Luo, “Attentive relational networks for mapping images to scene graphs,” in CVPR, 2019 [96] R. Li, S. Zhang, B. Wan, and X. He, “Bipartite graph network with adaptive message passing for unbiased scene graph generation,” in CVPR, 2021 [97] M. Suhail, A. Mittal, B. Siddiquie, C. Broaddus, J. Eledath, G. Medioni, and L. Sigal, “Energy-based learning for scene graph generation,” in CVPR, 2021

Graph Representation Learning Methods Video Action Recognition Essential for understanding human motion by analyzing interactions between humans, objects, and joints Long-Range Temporal Context Modeling [98], connect humans and objects across frames based on appearance similarity and spatial-temporal relations, use GCNs to reason over the constructed graph for action prediction [104] builds graphs centered around actors and applies GCNs for context modeling among objects, uses a relation-level graph for relation node contexts ST-GCN Network [99], builds GCNs on the joint graph to learn spatial and temporal action patterns [110] uses GCNs on three types of graphs to capture complex joint relations [113] adds edges between limbs and head, uses GCNs for spatial relations, and LSTM for temporal dynamics, improved joint connectivity and LSTM integration [98] [99] [104] Y. Ou, L. Mi, and Z. Chen, “Object-relation reasoning graph for action recognition,” in CVPR, 2022 [110] L. Shi, Y. Zhang, J. Cheng, and H. Lu, “Two-stream adaptive graph convolutional networks for skeleton-based action recognition,” in CVPR, 2019 [113] R. Zhao, K. Wang, H. Su, and Q. Ji, “Bayesian graph convolution lstm for skeleton based action recognition,” in ICCV, 2019

Graph Representation Learning Methods Temporal Action Localization Localize temporal intervals and recognize action instances in untrimmed videos. [100] generate temporal proposals, form a proposal graph where nodes are proposals connected by temporal intersection over union and L1 distance, GCNs capture relations among proposals to predict boundaries and categories [126] GNNs utilized for generating temporal action proposals by capturing temporal dependencies [127] GNNs used for segmenting actions in videos by modeling temporal action patterns [100] R. Zeng,W. Huang, M. Tan, Y. Rong, P. Zhao, J. Huang, and C. Gan, “Graph convolutional networks for temporal action localization,” in ICCV, 2019 [126] Y. Bai, Y. Wang, Y. Tong, Y. Yang, Q. Liu, and J. Liu, “Boundary content graph neural network for temporal action proposal generation,” in ECCV, 2020 [127] Y. Huang, Y. Sugano, and Y. Sato, “Improving action segmentation via graph-based temporal reasoning,” in CVPR, 2020

Graph Representation Learning Methods Others GNN-Based Video Object Segmentation [102] capture higher-order relations among video frames [103] uses a graph memory network to store memory in a graph structure, memory cells as nodes help predict segmentation masks by querying the memory graph Multi-Object Tracking with GNNs [128] tracks objects by predicting edge status (active/non-active) to form connected components, time-aware message passing network on a detection graph [129] onnects tracked and detected objects based on spatial distances, uses GNNs to aggregate features and predict connectivity Human Motion Prediction Using GNNs [130] captures physical constraints and movement relations among body parts, Uses multi-scale GNNs for effective motion prediction [131] separate space-time GNNs capture joint-joint, time-time, and joint-time interactions Trajectory Prediction with GNNs [133] represents pedestrian trajectories with GNNs, uses GCNs to capture interactions based on relative locations and temporal information [134] conducts crowd trajectory prediction using spatio-temporal graphs, extend GATs with Transformer’s self-attention mechanism [102] W. Wang, X. Lu, J. Shen, D. J. Crandall, and L. Shao, “Zero-shot video object segmentation via attentive graph neural networks,” in ICCV, 2019 [103] X. Lu, W. Wang, M. Danelljan, T. Zhou, J. Shen, and L. V. Gool, “Video object segmentation with episodic graph memory networks,” in ECCV, 2020 [128] G. Brasó and L. Leal-Taixé, “Learning a neural solver for multiple object tracking,” in CVPR, 2020 [129] X. Weng, Y. Wang, Y. Man, and K. M. Kitani, “Gnn3dmot: Graph neural network for 3d multi-object tracking with 2d-3d multi- feature learning,” in CVPR, 2020 [130] M. Li, S. Chen, Y. Zhao, Y. Zhang, Y. Wang, and Q. Tian, “Dynamic multiscale graph neural networks for 3d skeleton based human motion prediction,” in CVPR [131] T. Sofianos, A. Sampieri, L. Franco, and F. Galasso, “Space-time- separable graph convolutional network for pose forecasting,” in ICCV, 2021 [133] A. Mohamed, K. Qian, M. Elhoseiny, and C. Claudel, “Social- stgcnn: A social spatio-temporal graph convolutional neural network for human trajectory prediction,” in CVPR, 2020 [134] C. Yu, X. Ma, J. Ren, H. Zhao, and S. Yi, “Spatio-temporal graph transformer networks for pedestrian trajectory prediction,” in ECCV, 2020

Graph Representation Learning Methods Visual Question Answering Answer questions about visual content by aligning and reasoning over graphical representations of images and text GNNs: Capture and align complex relationships between visual objects and textual queries for accurate prediction Graph Representations of Images and Questions [141], constructs graphs for both images and questions and aligns them for VQA Image Graph: Nodes represent visual objects, edges are relative spatial relations Question Graph: Nodes represent words, edges are dependencies among words Uses recurrent unit followed by an attention mechanism Language-Relevant Graph Construction [144], generates sparse graph representation for images conditioned on the question Graph Learner: Produces sparse image graph Graph CNN: Captures language-relevant spatial relations Dynamic Visual Graphs [145], updates edges between objects dynamically during GNN-based message-passing, Edges are updated based on textual command vectors in each iteration Compositional VQA Architecture [146],uses a similar architecture to answer multiple compositional questions, Adopts dynamic graph updates for answering complex, compositional questions [141] D. Teney, L. Liu, and A. van Den Hengel, “Graph-structured representations for visual question answering,” in CVPR, 2017 [144] W. Norcliffe-Brown, S. Vafeias, and S. Parisot, “Learning conditioned graph structures for interpretable visual question answering ,” NeurIPS, vol. 31, 2018 [145] R. Hu, A. Rohrbach, T. Darrell, and K. Saenko, “Language- conditioned graph networks for relational reasoning,” in ICCV, 2019 [146] C. Jing, Y. Jia, Y. Wu, X. Liu, and Q. Wu, “Maintaining reasoning consistency in compositional visual question answering,” in CVPR, 2022

Graph Representation Learning Methods Visual Question Answering Multi-Graph Approach [147], represents semantic, spatial, and implicit relations with three separate graphs GATs: Run on each graph to assign importance and learn relation-aware node representations Aggregation: Combines results from GATs to predict the answer Symbolic Graphs [148], uses symbolic graphs to represent image and question features Symbolic Graphs: Nodes represent semantic units; edges represent relations Message Passing Neural Network: Aligns and processes symbolic graphs Graph Isomorphism Network for Graph Matching [150], utilizes graph isomorphism network for graph matching, aligns and matches graph structures of image and question TextVQA [154], Uses GNNs to integrate text recognition and reasoning for VQA Visual Commonsense Reasoning [155], [156], Integrates commonsense knowledge graphs with GNNs for reasoning Graph Representations and Alignment [141], Diagram showing image and question graphs with alignment process Sparse Graph Representation [144], Visual of sparse graph generation for images based on questions Dynamic Visual Graphs [145], Diagram illustrating dynamic edge updates in visual graphs Multi-Graph Approach [147], Diagram of semantic, spatial, and implicit graphs with GAT processing Symbolic Graphs [148], Diagram of symbolic graphs for image and question and message passing [147] L. Li, Z. Gan, Y. Cheng, and J. Liu, “Relation-aware graph attention network for visual question answering,” in ICCV, 2019 [148] E.-S. Kim, W. Y. Kang, K.-W. On, Y.-J. Heo, and B.-T. Zhang, “Hypergraph attention networks for multimodal learning,” in CVPR, 2020 [150] R. Saqur and K. Narasimhan, “Multimodal graph networks for compositional generalization in visual question answering,” NeurIPS, vol. 33, 2020 [154] D. Gao, K. Li, R. Wang, S. Shan, and X. Chen, “Multi-modal graph neural network for joint reasoning on vision and scene text,” in CVPR, 2020 [155] A. Wu, L. Zhu, Y. Han, and Y. Yang, “Connective cognition network for directional visual commonsense reasoning,” NeurIPS, vol. 32, 2019 [156] W. Yu, J. Zhou, W. Yu, X. Liang, and N. Xiao, “Heterogeneous graph learning for visual commonsense reasoning,” NeurIPS, vol. 32, 2019 [141] D. Teney, L. Liu, and A. van Den Hengel, “Graph-structured representations for visual question answering,” in CVPR, 2017 [144] W. Norcliffe-Brown, S. Vafeias, and S. Parisot, “Learning conditioned graph structures for interpretable visual question answering,” NeurIPS, vol. 31, 2018 [145] R. Hu, A. Rohrbach, T. Darrell, and K. Saenko, “Language- conditioned graph networks for relational reasoning,” in ICCV, 2019 [148] E.-S. Kim, W. Y. Kang, K.-W. On, Y.-J. Heo, and B.-T. Zhang, “Hypergraph attention networks for multimodal learning,” in CVPR, 2020

Graph Representation Learning Methods Visual Grounding Ground expressions in videos to obtain temporal boundaries or spatio-temporal tubes of the referred objects Capturing temporal or spatio-temporal correlations is critical for accurate video grounding Spatial and Temporal Dynamic Graphs [183], Constructs spatial relation graph for each frame and temporal dynamic graph across frames Spatial Relation Graph: Nodes represent objects in a frame; edges capture spatial relations Temporal Dynamic Graph: Nodes represent objects across frames; edges connect the same objects in consecutive frames Language Fusion: Integrates language representation into graphs Graph Convolution: Captures spatio-temporal correlations in the graphs Interval-Based Temporal Graph [184], Constructs a temporal graph where nodes are intervals and edges represent overlapping relations Temporal Graph: Nodes are time intervals; edges indicate overlapping relations GNN for Message Passing: Uses GNN to pass messages and capture temporal correlations. Compositional Temporal Grounding [186], Uses a cross-graph GNN to capture hierarchical semantic correlations in videos and sentences Cross-Graph GNN: Captures correlations among global events, local actions, and atomic objects Hierarchical Semantics: Models semantic relationships across different levels [183] Z. Zhang, Z. Zhao, Y. Zhao, Q. Wang, H. Liu, and L. Gao, “Where does it exist: Spatio-temporal video grounding for multi-form sentences,” in CVPR, 2020 [184] Y. Zhao, Z. Zhao, Z. Zhang, and Z. Lin, “Cascaded prediction network via segment tree for temporal video grounding,” in CVPR, 2021 [186] J. Li, J. Xie, L. Qian, L. Zhu, S. Tang, F. Wu, Y. Yang, Y. Zhuang, and X. E. Wang, “Compositional temporal grounding with structured variational cross-graph correspondence learning,” in CVPR, 2022

Graph Representation Learning Methods Image Captioning Generate complete and natural descriptions for images by capturing the relations between objects Utilizing object relations as natural priors enhances the quality and relevance of generated captions Semantic and Spatial Graphs [187], Constructs semantic and spatial graphs to represent objects and their relations Graphs: Nodes are objects; edges capture semantic and spatial relations GCN: Extends GCN on these graphs to obtain object representations Attention LSTM: Decodes captions from graph-enhanced object representations Hierarchical Levels of Object Relations [189], Models object relations at three hierarchical levels: image, regions, and objects, Uses GCNs on multi-level object relations to enhance caption generation Language Inductive Bias via Sentence Reconstruction [190], Captures language inductive bias through sentence reconstruction, Forms a dictionary using GCN and sentence reconstruction, Uses attention between image scene graph and dictionary Sub-Graph Decomposition for Captioning [191], Decomposes a scene graph into sub-graphs for captioning, Chooses important sub-graphs to decode at each time step [187] T. Yao, Y. Pan, Y. Li, and T. Mei, “Exploring visual relationship for image captioning,” in ECCV, 2018, pp. 684–699 [189] T. Yao, Y. Pan, Y. Li, and T. Mei, “Hierarchy parsing for image captioning,” in ICCV, 2019 [190] X. Yang, K. Tang, H. Zhang, and J. Cai, “Auto-encoding scene graphs for image captioning,” in CVPR, 2019 [191] Y. Zhong, L. Wang, J. Chen, D. Yu, and Y. Li, “Comprehensive image captioning via scene graph decomposition,” in ECCV, 2020

Graph Representation Learning Methods Image Captioning Abstract Scene Graph for User-Desired Captioning [192], Uses an abstract scene graph to control objects, attributes, and relations in captions Multi-Relational GCN: Encodes abstract scene graph grounded in the image Human-Object Interaction Graph [193], Combines image scene graph with a human-object interaction graph Enhanced Scene Graph: Integrates human-object interactions for enhanced captioning GCN-LSTM: Similar architecture to generate captions [192] S. Chen, Q. Jin, P. Wang, and Q. Wu, “Say as you wish: Fine- grained control of image caption generation with abstract scene graphs,” in CVPR, 2020 [193] K. Nguyen, S. Tripathi, B. Du, T. Guha, and T. Q. Nguyen, “In defense of scene graphs for image captioning,” in ICCV, 2021

Graph Representation Learning Methods Video Captioning Generate descriptions for videos by capturing interactions among objects over time Understanding object interactions and their temporal dynamics is critical for accurate video captioning Object Relation Graph [195], Constructs an object relation graph connecting each object with top-k corresponding objects across frames Graph Construction: Nodes are objects; edges connect top-k corresponding objects in all frames GCN: Learns relations on the graph Attention LSTM: Generates descriptions from object relational features Spatio-Temporal Graph [196], Constructs a spatio-temporal graph to capture object interactions over time Graph Construction: Nodes represent objects; edges are weighted by normalized IOU and connect semantically similar objects in consecutive frames GCN: Updates node features to capture spatio-temporal relationships Temporal Graph Aggregation [197], Aggregates information in a temporal graph and adjusts the graph structure dynamically Temporal Graph: Aggregates node features over time LSTM Decoder: Adjusts graph structure based on hidden state and updates hidden state based on graph representation. [195] Z. Zhang, Y. Shi, C. Yuan, B. Li, P. Wang, W. Hu, and Z.-J. Zha, “Object relational graph with teacher-recommended learning for video captioning,” in CVPR, 2020 [196] B. Pan, H. Cai, D.-A. Huang, K.-H. Lee, A. Gaidon, E. Adeli, and J. C. Niebles, “Spatio-temporal graph for video captioning with knowledge distillation,” in CVPR, 2020 [197] S. Chen and Y.-G. Jiang, “Motion guided region message passing for video captioning,” in ICCV, 2021

Graph Representation Learning Methods Video Captioning Object Relation Graph Nodes: Represent objects in all frames Edges: Connect top-k corresponding objects based on appearance and semantics Usage: Provides relational features for captioning Spatio-Temporal Graph Nodes: Represent objects in individual frames Edges: Weighted by normalized IOU and connect semantically similar objects in consecutive frames Usage: Captures spatio-temporal dynamics for generating video descriptions Temporal Graph Aggregation Nodes: Represent aggregated temporal features Edges: Update based on hidden state and graph representation Usage: Enhances temporal context understanding for captioning