250428_JunH_Seminar[CoLAKE: Contextualized Language and Knowledge Embedding].pptx
thanhdowork
2 views
23 slides
Apr 28, 2025
Slide 1 of 23
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
About This Presentation
CoLAKE: Contextualized Language and Knowledge Embedding
Size: 1.41 MB
Language: en
Added: Apr 28, 2025
Slides: 23 pages
Slide Content
Cho junhee Network Science Lab The Catholic University of Korea E-mail : [email protected] CoLAKE: Contextualized Language and Knowledge Embedding
Abstract Previous studies have incorporated knowledge into pre-trained language models (PLMs) by injecting only shallow, static, and separately pre-trained entity embeddings. However, such methods: fail to jointly learn language and knowledge, and cannot fully exploit deep knowledge context, Why
Introduction Deep contextualized language models pre-trained on large-scale unlabeled corpora have led to remarkable performance improvements across a wide range of natural language processing (NLP) tasks (Peters et al., 2018; Devlin et al., 2019; Yang et al., 2019). However, it has been shown that these models struggle to capture factual knowledge (Logan et al., 2019). Limitation
Introduction It aims to better capture factual knowledge by inserting pre-trained entity embeddings into PLMs, as exemplified by models such as ERNIE (Zhang et al., 2019) and KnowBERT (Peters et al., 2019). Limitation- human-curated structured knowledge
Introduction By using separately pre-trained entity embeddings, the entity embeddings remain fixed during PLM training. ⇒not jointly learned. Existing models incorporate only a single entity embedding into the PLM. ⇒fail to fully leverage Since the pre-trained entity embeddings are static ⇒re-training. better integration of language representation and factual knowledge Limitation- human-curated structured knowledge
Introduction J ointly learns language representations and knowledge representations within a common representation space. D ynamically represents entities based on both their knowledge context and language context. E xtracts a subgraph centered around the entity from the knowledge graph (KG) and constructs a dynamic knowledge context using the triplets within the subgraph. CoLAKE
Introduction CoLAKE
CoLAKE Overview By pre-training on a structured, unsupervised word-knowledge graph (WK graph), CoLAKE simultaneously contextualizes both language representations and knowledge representations. Graph Construction
CoLAKE Graph construction-Make WK Graph 1. Tokenize the sentence and build a fully connected word graph. 2. Identify entities and link them to real entities in the knowledge graph (KG) using Entity Linking.
CoLAKE Graph construction-Make WK Graph 3. Replace matched words with anchor nodes (entities) to align language and knowledge representations. 4. Extract knowledge subgraphs by selecting surrounding relations and entities (up to 15 neighbors, head-only triplets). 5. Merge the knowledge subgraphs with the original word graph to form a unified WK Graph.
CoLAKE Model Architecture
CoLAKE Model Architecture-Embedding layer Token Embedding: Separate embedding tables are created for words, entities, and relations. Words are further split and processed using Byte Pair Encoding (BPE). Type Embedding: Used to distinguish the types of nodes (word, entity, or relation). Position Embedding: Each node is assigned a position index. A Soft Position Index method is used, allowing repeated indices. Nodes within the same triplet are placed with consecutive positions.
CoLAKE Model Architecture-Masked Transformer Encoder Masked Multi-Head Self-Attention: 1. Attention is allowed only between connected nodes. 2. For nodes that are not connected, the attention weight is set to −∞ to block interaction. As a result, each node receives information only from its 1-hop neighbors at each layer.
CoLAKE Pre-training Objective Extended the Masked Language Model (MLM) objective to the entire WK Graph . Randomly mask 15% of all nodes with specific replacement rules (80% [MASK], 10% random same type, 10% unchanged). Word node masking : Predict masked words using both language and knowledge context. Entity node masking : Anchor entities: Align language and knowledge representations. Other entities: Learn contextualized entity embeddings. Relation node masking : Predict relations between entities like a relation extraction task. Neighborhood Drop : Randomly drop neighbors (50% chance) to avoid trivial prediction based only on knowledge context.
CoLAKE Model Training Mixed CPU-GPU Training Entity embeddings are too large to fit entirely into GPU memory. Only entity embeddings are stored on the CPU. The Transformer components are trained on the GPU.
CoLAKE Model Training Mixed CPU-GPU Training -Entity embeddings are too large to fit entirely into GPU memory. -Only entity embeddings are stored on the CPU. -The Transformer components are trained on the GPU. Negative Sampling
Experiments Pre-Training Data & Implementation Data: English Wikipedia (2020/03/01) Knowledge Graph (KG): Wikidata5M (21 million triplets)
Experiments Knowledge-Driven Tasks Tasks: Open Entity (Entity Typing) FewRel (Relation Extraction) Method: Entity Linking using TAGME Joint encoding of entities and textual representations Results: Improved F1 scores Achieved 90.5% F1 on FewRel (new SOTA)
CoLAKE Knowledge Probing Tasks: LAMA / LAMA-UHN (evaluating factual knowledge) Method: Cloze-style masking Measuring Precision@1 (P@1) Results: Significant improvement over RoBERTa Outperformed K-Adapter on some subsets
Experiments Language Understanding Tasks Tasks: GLUE Benchmark (e.g., MNLI, QQP, QNLI) Results: Maintained RoBERTa-level performance 1.4% average improvement over KEPLER
Experiments Word-Knowledge Graph Completion Task: Relation Prediction Setting: Transductive: predicting on seen entities Inductive: predicting on unseen entities Results: Transductive: 82.5% MRR (better than RotatE) Inductive: generalized well via neighbor aggregation
Conclusion CoLAKE Experimental Results Improved performance on knowledge-driven tasks. Achieved high performance in the Word-Knowledge Graph Completion task. Demonstrates GNN-like properties (e.g., neighbor aggregation). Future Directions Noise reduction in distant supervision-based data. Evaluation of template quality for graph-to-text conversion.