[NS][Lab_Seminar_240610]Graph Representation Learning Meets Computer Vision: A Survey.pptx

thanhdowork 121 views 14 slides Jun 24, 2024
Slide 1
Slide 1 of 14
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14

About This Presentation

Graph Representation Learning Meets Computer Vision: A Survey


Slide Content

Graph Representation Learning Meets Computer Vision: A Survey Tien-Bach-Thanh Do Network Science Lab Dept. of Artificial Intelligence The Catholic University of Korea E-mail: os fa19730 @catholic.ac.kr 202 4/06/10 Licheng Jiao et al. IEEE Transactions on AI 2023

Introduction Graph is data structure, which has compositional and relational properties Due to the theoretical breakthroughs of the GCN, it has been a major topic in DL community CNN extracts features by traversing the entire image with a series of sliding filters and has good spatial invariance CNN observes pixels at different locations to obtain as much contextual information as possible and identify targets in the image through rich descriptions CNN ignores the relationship between pixels and hinders adaptive communication between pixels A graph is a collection of objects (nodes) and interactions between objects (edges) Graph nodes can independently represent the properties of objects and graphs can flexibly build relationships between interested objects Use graph-based representation to compactly encode complex visual scenes is an efficient way to achieve relational modeling and reasoning Contributions Evolution of graph processing algorithms from nonneural networks to neural network Organize 11 visual tasks and summarize 8 graph modeling approaches commonly used in different visual tasks Discuss future directions for GRL and CV and list 10 open issues

Preliminaries Graph Graph is kind of structure data consists of a set of nodes and edges, which connect nodes Given a graph G = (V, E, W), where V = {v1, v2, … , vN} is the set of N nodes and E is the edges set Weight adjacency matrix W is real symmetric matrix and its element denotes the similarity for a pair of nodes, Wi,j = 0 means no edge between nodes vi, vj A pair of nodes vi, vj can be connected by a directed edge vi →vj or an undirected edge vi - vj When vi →vj, vj is a child of vi in G. vi - vj is the neighbor k-hop neighborhood of node vi is denoted by Neik Laplacians are graph representation matrices with mathematical properties Laplacian matrix consists of various transformations of the adjacency matrix Unnormalized Laplacian matrix is L = D - W, where D is degree matrix and diagonal matrix is Dii = ∑j≠iWij Normalized Laplacian matrix is Lnorm = D-1/2LD-½ = IN -D-½ WD-1/2 , eigenvalue range is [0,1] Common types of graph include tree, undirected (acyclic) graph, directed acyclic graph (DAG), undirected circular graph, partially DAG, …

Preliminaries Graph

Preliminaries General Framework

Preliminaries General Framework First, in many situations, it is necessary to graph-structure the collected data Eg, grid data like images, the components in the image, or feature should be structured graphically before applying the graph processing algorithm Then the algorithm generates node representations for the structure data Make nodes representable and inferable in vector space, nodes are encoded as low-dimensional and dense real-valued vectors, called node embeddings Common point of shallow and deep encoder is to learn a function, which maps the nodes in the original space to the vectors in the low-dimensional space, the mapping should maintain the similarity => 2 nodes with similar features and structures in the original space are also similar in the vector space Shallow encoders are usually combined with similarity metric functions, the goal is to approximate the node similarity in the embedding space to the node similarity in the original graph space The common similarity function is Adjacency-based similarity, Multihop similarity, and Random walk approach However, shallow coding generates a unique bedding vector for each node, without fusing features between nodes and without parameter sharing Neural network-based encoding methods generate node embeddings by aggregating neighbor information

Graph Representation Learning Methods Summarize several graph representation learning methods commonly used in image processing There are non neural network methods represented by graph embedding methods and probabilistic graphical models (PGMs) The neural network methods are divided into graph recurrent neural networks (RNNs), GCN, and variants of GNNs

Graph Representation Learning Methods Neural Network Methods Key idea is neighborhood aggregation and encoder parameter sharing

Graph Representation Learning Methods Neural Network Methods General form, neighborhood aggregation for node v at step/layer l+1 can be formulated as where zv0 = xv mean that initial step/layer 0 embedding is equal to the original node feature, l corresponds to the index of a layer or an iterative step Node embedding at the (l+1)th step/layer is not only related to the previous layer embeddings of the neighborhood, but also influenced by its own previous layer embedding By iteratively computing the local aggregation of neighboring features, the information about nodes farther away in the graph can be indirectly provided to node A Divide into 3 categories: graph RNNs, GCNs, and variant of GNNs

Graph Representation Learning Methods Application in Computer Vision Construction of graph in CV is highly dependent on the specific task

Graph Representation Learning Methods Application in Computer Vision (classification) Key entities in image can represent the visual properties of image, and associating these entities with image allows the system to perform better in classification and segmentation tasks Use GRU and LSTM for explicit modeling of label dependencies CNN-RNN framework exploits semantic redundancy and label co-occurrence dependencies for multilabel classification However, GRU and LSTM sequentially model regions/labels dependencies, cannot fully exploit the correlation between each region or label pairs , they do not explicitly model the statistical label co-occurrence, which is also key to aid multilabel image classification [64] build a graph based on statistical label co-occurrence, after obtaining the feature vectors of all the categories, these features are associated in the form of graphs, interaction between categories is explored using a gated recurrent update mechanism to propagate messages through the graph and learn node-level features of the context [65] use the gated GNN for relative attribute learning, which treats each pair of images as nodes and the relationship between the to-be-learned representation of nodes as edges , constructing a graph to explore the similarities between multiple images [64] T. Chen, M. Xu, X. Hui, H. Wu, and L. Lin, “Learning semantic-specific graph representation for multi-label image recognition,” in Proc. IEEE Int. Conf. Comput. Vis., 2019, pp. 522–53 [65] Z. Meng, N. Adluru, H. J. Kim, G. Fung, and V. Singh, “Efficient relative attribute learning using graph neural networks,” in Proc. Eur. Conf. Comput. Vis., 2018, pp. 552–567

Graph Representation Learning Methods Application in Computer Vision (semantic segmentation ) Semantic segmentation is the pixel-by-pixel classification of images Objective: assign class label to each pixel in an image, creating a segmented output Graph-based Segmentation: pixels are grouped into superpixels or clusters , and these groups are modeled as graphs for classification Combining CNN with graph models: traditional segmentation methods can be enhanced by integrating graph models, where CNN extract features, graph models (CRFs) refine pixel-wise classification Graph-FCN: constructs graphs at the feature level, preserving local information while applying graph convolutions to improve segmentation accuracy [78] Y. Lu, Y. Chen, D. Zhao, and J. Chen, “Graph-FCN for image semantic segmentation,” in Proc. Int. Symp. Neural Netw., 2019, pp. 97–105

Graph Representation Learning Methods Application in Computer Vision (object detection) Objective: identify and locate objects within an image Graph modeling: represents objects and their spatial relationships as nodes and edges , capturing contextual information that aids in detection Hypergraph modeling: extends graph representations by allowing edges to connect multiple nodes , providing richer contextual information Scene Interaction Networks (SIN) [92]: model objects and their relationships as a graph, with node representing objects and edges capturing their interactions, enhancing detection in complex scenes Graph R-CNN [94]: integrates GNNs with R-CNNs to model interactions and context between objects , improving detection accuracy [92] Y. Liu, R. Wang, S. Shan, and X. Chen, “Structure inference Net: Object detection using scene-level context and instance-level relationships,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 6985–6994. [94] H. Xu, L. Fang, X. Liang, W. Kang, and Z. Li, “Universal-RCNN: Universal object detector via transferable graph R-CNN,” in Proc. 34th AAAI Conf. Artif. Intell., 2020, pp. 12492–12499

Graph Representation Learning Methods Application in Computer Vision (scene graph) [97] B. Dai, Y. Zhang, and D. Lin, “Detecting visual relationships with deep relational networks,” in Proc. IEEEConf. Comput.Vis. Pattern Recognit., 2017, pp. 3076–3086