240429_Thuy_Labseminar[Simplifying and Empowering Transformers for Large-Graph Representations].pptx

thanhdowork 102 views 13 slides Apr 29, 2024
Slide 1
Slide 1 of 13
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13

About This Presentation

Simplifying and Empowering Transformers for Large-Graph Representations


Slide Content

Simplifying and Empowering Transformers for Large-Graph Representations ( NeurIPS23) Van Thuy Hoang Network Science Lab Dept. of Artificial Intelligence The Catholic University of Korea E-mail: [email protected] 2024-04-08

BACKGROUND: Message Passing GNNs vs Graph Transformers Generate node embeddings based on local network neighborhoods Nodes have embeddings at each layer, repeating combine messages from their neighbor using neural networks

Message Passing GNNs vs Graph Transformers a node’s update is a function over its  neighbors , in GTs, a node’s update is a function of  all  nodes in a graph (thanks to the self-attention mechanism in the Transformer layer). 

Graph Transformers: Challenges How to build GT for large-graph representations: The quadratic global attentions hinder the scalability for large graphs Over-fitting problem

Deep attention layers Do we need many attention layers? Other Transformers often require multiple attention layers for desired capacity

The power of 1-layer attention mini-batch sampling that randomly partitions the input graph into mini-batches with smaller sizes. will be fed into the SGFormer model that is implemented with a one-layer global attention and a GNN network

Simple Global Attention A single-layer global attention is sufficient. This is because through one-layer propagation over a densely connected attention graph, the information of each node can be adaptively propagated to arbitrary nodes within the batch. The computation of Eq. (3) can be achieved in O(N) time complexity, which is much more efficient than the Softmax attention in original Transformers

Incorporation of Structural Information A simple-yet-effective scheme that combines Z with the propagated embeddings by GNNs at the output layer:

Empirical Evaluation Datasets: Cora CiteSeer PubMed Actor Squirrel Chameleon Deezer

Empirical Evaluation node property prediction benchmarks

Empirical Evaluation Scalability test of training time per epoch Amazon2M dataset and randomly sample a subset of nodes with the node number ranging from 10K to 100K.

SUMMARY The potential of simple Transformer-style architectures for learning large-graph representations where the scalability challenge plays a bottleneck A one-layer attention model combined with a vanilla GCN can surprisingly produce highly competitive performance. Challenge of out-of-distribution learning