240429_Thuy_Labseminar[Simplifying and Empowering Transformers for Large-Graph Representations].pptx
thanhdowork
102 views
13 slides
Apr 29, 2024
Slide 1 of 13
1
2
3
4
5
6
7
8
9
10
11
12
13
About This Presentation
Simplifying and Empowering Transformers for Large-Graph Representations
Size: 1.55 MB
Language: en
Added: Apr 29, 2024
Slides: 13 pages
Slide Content
Simplifying and Empowering Transformers for Large-Graph Representations ( NeurIPS23) Van Thuy Hoang Network Science Lab Dept. of Artificial Intelligence The Catholic University of Korea E-mail: [email protected] 2024-04-08
BACKGROUND: Message Passing GNNs vs Graph Transformers Generate node embeddings based on local network neighborhoods Nodes have embeddings at each layer, repeating combine messages from their neighbor using neural networks
Message Passing GNNs vs Graph Transformers a node’s update is a function over its neighbors , in GTs, a node’s update is a function of all nodes in a graph (thanks to the self-attention mechanism in the Transformer layer).
Graph Transformers: Challenges How to build GT for large-graph representations: The quadratic global attentions hinder the scalability for large graphs Over-fitting problem
Deep attention layers Do we need many attention layers? Other Transformers often require multiple attention layers for desired capacity
The power of 1-layer attention mini-batch sampling that randomly partitions the input graph into mini-batches with smaller sizes. will be fed into the SGFormer model that is implemented with a one-layer global attention and a GNN network
Simple Global Attention A single-layer global attention is sufficient. This is because through one-layer propagation over a densely connected attention graph, the information of each node can be adaptively propagated to arbitrary nodes within the batch. The computation of Eq. (3) can be achieved in O(N) time complexity, which is much more efficient than the Softmax attention in original Transformers
Incorporation of Structural Information A simple-yet-effective scheme that combines Z with the propagated embeddings by GNNs at the output layer:
Empirical Evaluation Datasets: Cora CiteSeer PubMed Actor Squirrel Chameleon Deezer
Empirical Evaluation Scalability test of training time per epoch Amazon2M dataset and randomly sample a subset of nodes with the node number ranging from 10K to 100K.
SUMMARY The potential of simple Transformer-style architectures for learning large-graph representations where the scalability challenge plays a bottleneck A one-layer attention model combined with a vanilla GCN can surprisingly produce highly competitive performance. Challenge of out-of-distribution learning