A Generalization of Transformer Networks to Graphs.pptx

ssuser2624f71 226 views 14 slides Oct 30, 2023
Slide 1
Slide 1 of 14
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14

About This Presentation

A Generalization of Transformer Networks to Graphs


Slide Content

A Generalization of Transformer Networks to Graphs Ho- Beom Kim Network Science Lab Dept. of Mathematics The Catholic University of Korea E-mail: [email protected] 2023 / 10 / 30 DWIVEDI, Vijay Prakash AAAI 2021

Introduction Problem Statements How to deal with the sparsity characteristics of graph? What positional encoding technique will be used to represent the position of the node in the graph?

Introduction Contribution They find that the most fruitful ideas from the transformers literature in NLP can be applied in a more efficient way and posit that sparsity and positional encodings are two key aspects in the development of a Graph Transformer They put forward a generalization of transformer networks to homogeneous graphs of arbitrary structure, namely Graph Transformer, and an extended version of Graph Transformer with edge features that allows the usage of explicit domain information as edge features. Their method includes an elegant way to fuse node positional features using Laplacian eigenvectors for graph datasets, inspired from the heavy usage of positional encodings in NLP transformer models and recent research on node positional features in GNNs. Thier experiments demonstrate that the proposed model surpasses baseline isotropic and anisotropic GNNs.

Methodology Graph Sparsity Sparsity can be a great inductive bias in learning graphs When a graph is sparse, it means that there are not many edges between nodes. In the original transformer, all words become a 'fully connected graph of words' through the attention process. This is because it is difficult to find meaningful interactions or connections between words in a sentence. Th e words handled by the nlp transformer are smaller than tens or hundreds of nodes when expressed graphically, so calculations are possible even when fully connected. In the case of the graph dataset, arbitrary connectivity structures (i.e. edges) can be used, and the node size can reach millions or billions. It would be impossible or very inefficient to use the attention method used in existing transformers in the graph.

Methodology Overall Architecture

Methodology Positional Encodings The transformer in NLP used positional encoding, which can convey information about where each word token is located in the entire sequence and the distance between words. Until now, GNN has been trained to learn the structural information of nodes that are invariant in the node's position in graph learning. Node attends to local node neighbors Use Laplacian eigenvectors for positional encoding Symmetric normalized Laplacian matrix Λ : eigenvalues U : eigenvectors

Methodology Graph Transformer Architecture - Input node feature  α i ​  , edge feature  β ij Positional encoding is also embedded in d dimension through linear projection . And add it to the node feature . Positional encoding is added only to the node features of the input layer .

Methodology Graph Transformer Architecture – Graph Transformer Layer The attention outputs After passing through the Feed Forward Network, it goes through residual connections and normalization layers. Normalization uses LayerNorm or BatchNorm .

Methodology Graph Transformer Architecture – Graph Transformer Layer with edge features Edge attention output Node attention output

Methodology Graph Transformer Architecture – Graph Transformer Layer with edge features The outputs and are then passed to separate Feed Forward Networks preceded and succeeded by residual connections and normalization layers

Experiments Datasets ZINC – Graph Regression ZINC is a molecular dataset with the task of graph property regression for constrained solubility PATTERN – Node Classification PATTERN is a node classification dataset generated using the Stochastic Block Models CLUSTER – Node Classification CLUSTER is also a synthetically generated dataset using SBM model.

Experiments Results of GraphTransformer (GT) on all datasets

Experiments Comparison of our best performing scores / Analysis of GraphTransformer using different PE schemes

Conclusion Conclusion / Future works Their experiments consistently showed that the presence of Laplacian eigenvectors as node positional encodings and batch normalization, in place of layer normalization, around the transformer feed forward layers enhanced the transformer universally on all experiments. In future works, they are interested in building upon the graph transformer along aspects such as efficient training on single large graphs, applicability on heterogeneous domains, etc., and perform efficient graph representation learning keeping in account the recent innovations in graph inductive biases.
Tags