250428_JunH_Seminar[CoLAKE: Contextualized Language and Knowledge Embedding].pptx

thanhdowork 2 views 23 slides Apr 28, 2025
Slide 1
Slide 1 of 23
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23

About This Presentation

CoLAKE: Contextualized Language and Knowledge Embedding


Slide Content

Cho junhee Network Science Lab The Catholic University of Korea E-mail : [email protected] CoLAKE: Contextualized Language and Knowledge Embedding

Abstract Previous studies have incorporated knowledge into pre-trained language models (PLMs) by injecting only shallow, static, and separately pre-trained entity embeddings. However, such methods: fail to jointly learn language and knowledge, and cannot fully exploit deep knowledge context, Why

Introduction Deep contextualized language models pre-trained on large-scale unlabeled corpora have led to remarkable performance improvements across a wide range of natural language processing (NLP) tasks (Peters et al., 2018; Devlin et al., 2019; Yang et al., 2019). However, it has been shown that these models struggle to capture factual knowledge (Logan et al., 2019). Limitation

Introduction It aims to better capture factual knowledge by inserting pre-trained entity embeddings into PLMs, as exemplified by models such as ERNIE (Zhang et al., 2019) and KnowBERT (Peters et al., 2019). Limitation- human-curated structured knowledge

Introduction By using separately pre-trained entity embeddings, the entity embeddings remain fixed during PLM training. ⇒not jointly learned. Existing models incorporate only a single entity embedding into the PLM. ⇒fail to fully leverage Since the pre-trained entity embeddings are static ⇒re-training. better integration of language representation and factual knowledge Limitation- human-curated structured knowledge

Introduction J ointly learns language representations and knowledge representations within a common representation space. D ynamically represents entities based on both their knowledge context and language context. E xtracts a subgraph centered around the entity from the knowledge graph (KG) and constructs a dynamic knowledge context using the triplets within the subgraph. CoLAKE

Introduction CoLAKE

CoLAKE Overview By pre-training on a structured, unsupervised word-knowledge graph (WK graph), CoLAKE simultaneously contextualizes both language representations and knowledge representations. Graph Construction

CoLAKE Graph construction-Make WK Graph 1. Tokenize the sentence and build a fully connected word graph. 2. Identify entities and link them to real entities in the knowledge graph (KG) using Entity Linking.

CoLAKE Graph construction-Make WK Graph 3. Replace matched words with anchor nodes (entities) to align language and knowledge representations. 4. Extract knowledge subgraphs by selecting surrounding relations and entities (up to 15 neighbors, head-only triplets). 5. Merge the knowledge subgraphs with the original word graph to form a unified WK Graph.

CoLAKE Model Architecture

CoLAKE Model Architecture-Embedding layer Token Embedding: Separate embedding tables are created for words, entities, and relations. Words are further split and processed using Byte Pair Encoding (BPE). Type Embedding: Used to distinguish the types of nodes (word, entity, or relation). Position Embedding: Each node is assigned a position index. A Soft Position Index method is used, allowing repeated indices. Nodes within the same triplet are placed with consecutive positions.

CoLAKE Model Architecture-Masked Transformer Encoder Masked Multi-Head Self-Attention: 1. Attention is allowed only between connected nodes. 2. For nodes that are not connected, the attention weight is set to −∞ to block interaction. As a result, each node receives information only from its 1-hop neighbors at each layer.

CoLAKE Pre-training Objective Extended the Masked Language Model (MLM) objective to the entire WK Graph . Randomly mask 15% of all nodes with specific replacement rules (80% [MASK], 10% random same type, 10% unchanged). Word node masking : Predict masked words using both language and knowledge context. Entity node masking : Anchor entities: Align language and knowledge representations. Other entities: Learn contextualized entity embeddings. Relation node masking : Predict relations between entities like a relation extraction task. Neighborhood Drop : Randomly drop neighbors (50% chance) to avoid trivial prediction based only on knowledge context.

CoLAKE Model Training Mixed CPU-GPU Training Entity embeddings are too large to fit entirely into GPU memory. Only entity embeddings are stored on the CPU. The Transformer components are trained on the GPU.

CoLAKE Model Training Mixed CPU-GPU Training -Entity embeddings are too large to fit entirely into GPU memory. -Only entity embeddings are stored on the CPU. -The Transformer components are trained on the GPU. Negative Sampling

Experiments Pre-Training Data & Implementation Data: English Wikipedia (2020/03/01) Knowledge Graph (KG): Wikidata5M (21 million triplets)

Experiments Pre-Training Data & Implementation WK Graph Construction Pre-training Settings: Transformer: RoBERTaBASE Optimizer: AdamW (β₁ = 0.9, β₂ = 0.98) Batch size: 2048 Learning rate: 1e-4 Hardware: 8 × 32GB V100 GPUs Training time: 38 hours

Experiments Knowledge-Driven Tasks Tasks: Open Entity (Entity Typing) FewRel (Relation Extraction) Method: Entity Linking using TAGME Joint encoding of entities and textual representations Results: Improved F1 scores Achieved 90.5% F1 on FewRel (new SOTA)

CoLAKE Knowledge Probing Tasks: LAMA / LAMA-UHN (evaluating factual knowledge) Method: Cloze-style masking Measuring Precision@1 (P@1) Results: Significant improvement over RoBERTa Outperformed K-Adapter on some subsets

Experiments Language Understanding Tasks Tasks: GLUE Benchmark (e.g., MNLI, QQP, QNLI) Results: Maintained RoBERTa-level performance 1.4% average improvement over KEPLER

Experiments Word-Knowledge Graph Completion Task: Relation Prediction Setting: Transductive: predicting on seen entities Inductive: predicting on unseen entities Results: Transductive: 82.5% MRR (better than RotatE) Inductive: generalized well via neighbor aggregation

Conclusion CoLAKE Experimental Results Improved performance on knowledge-driven tasks. Achieved high performance in the Word-Knowledge Graph Completion task. Demonstrates GNN-like properties (e.g., neighbor aggregation). Future Directions Noise reduction in distant supervision-based data. Evaluation of template quality for graph-to-text conversion.
Tags