240610_Thuy_Labseminar[Rethinking Tokenizer and Decoder in Masked Graph Modeling for Molecules].pptx

thanhdowork 70 views 14 slides Jun 24, 2024
Slide 1
Slide 1 of 14
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14

About This Presentation

Rethinking Tokenizer and Decoder in Masked Graph Modeling for Molecules


Slide Content

Rethinking Tokenizer and Decoder in Masked Graph Modeling for Molecules Van Thuy Hoang Network Science Lab Dept. of Artificial Intelligence The Catholic University of Korea E-mail: [email protected] 2024-06-10

BACKGROUND: Graph Convolutional Networks (GCNs) Key Idea: Each node aggregates information from its neighborhood to get contextualized node embedding. Limitation: Most GNNs focus on homogeneous graph. Neural Transformation Aggregate neighbor’s information

BACKGROUND: Motivation A large part of deep learning revolves around finding rich representations of unstructured data such as images, text and graphs. SSL on graphs

BACKGROUND: Motivation A architecture of Variational Graph AutoEncoder

Graph tokenizer Graph tokenizer: Given a graph G, the graph tokenizer employs a graph fragmentation function to break G into smaller subgraphs, such as nodes and motifs Then, these fragments are mapped into fixed-length tokens to serve as the targets being reconstructed later. Clearly, the granularity of graph tokens determines the abstraction level of representations in masked modelling

Preliminary: Masked Graph Modeling Three key steps: graph tokenizer, graph masking, and graph autoencoder Graph tokenizer A graph tokenizer tok(g) = {y_t = m(t) ∈ R d |t ∈ f(g)} to generate its graph tokens as the reconstruction targets. The tokenizer tok(·) is composed of a fragmentation function f that breaks g into a set of subgraphs Graph masking: Remask decoding masks the hidden representations of the masked nodes Vm again by a special token m1 Graph autoencoder

Preliminary: Revisiting Molecule Tokenizers Summarize the molecule tokenizers into four distinct categories

Pretrained GNN-based tokenizer (b) A motif-based tokenizer that applies the fragmentation functions of cycles and the remaining nodes. (c) A two-layer GIN-based tokenizer that extracts 2-hop rooted subtrees for every node in the graph.

Overview of the SimSGT’s framework. It applies the GTS architecture for both its encoder and decoder. SimSGT features a Simple GNN-based Tokenizer (SGT), and employs a new remask strategy to decouple the encoder and decoder of the GTS architecture ( GINE and GraphTrans )

Simple GNN-based Tokenizer SGT simplifies existing aggregation-based GNNs by removing the nonlinear update functions in GNN layers. It is inspired by studies showing that carefully designed graph operators can generate effective node representations

Experiments: Molecular property prediction Transfer learning

Experiments: Molecular property prediction Transfer learning performance for molecular property prediction (regression)

Conclusion and Future Works The roles of tokenizer and decoder in MGM for molecules A comprehensive range of molecule fragmentation functions as molecule tokenizers. The results reveal that a subgraph-level tokenizer gives rise to MRL performance. For future works, the potential application of molecule tokenizers to joint molecule-text modeling