Contrastive Learning Using Graph Embeddings for
Domain Adaptation of Language Models
in the Process Industry
Contact
Anastasia
Zhukova
[email protected]
https://gipplab.org/zhukova/
Anastasia Zhukova
1*
, Jonas Lührs
1*
, Christian E. Lobmüller
2
, Bela Gipp
1
1
University of Göttingen, Germany,
2
eschbach GmbH, Germany, *Equal contribution
Overview
Methodology
Evaluation
Findings
•Document-similarity train dataset: 14K triplets
•Bi-encoder train dataset
oDR-MM + SID = 2.31M pairs
oDR-MM + SID + GET = 2.41M pairs (GET is only 6%)
•Test collection: PITEB with 7 plants, 205 queries, 330K docs
Plants A, C, D, G were used to build triplets Model
Params,
M
Doc.-sim.
fine-tuning
Bi-encoder fine-
tuning
APMRRnDCGMean
intfloat/multilingual-e5-large560 59.8265.3142.2655.80
OpenAI-text-embedding-3-largeUNK 63.6868.5745.6059.28
BAAI/bge-m3 560 66.2471.3350.9462.84
- MM 59.8665.7242.9956.19
- DR-MM + SID 62.6466.9245.7858.45
- DR-MM + SID + GET64.5871.4850.4262.16
+ DR-MM + SID + GET64.0871.4249.9761.82
- DR-MM + SID 62.2169.3645.8159.13
- DR-MM + SID + GET62.3669.2448.0659.89
+ DR-MM + SID + GET64.6070.2348.8961.24
- DR-MM + SID 64.8170.8548.1161.26
- DR-MM + SID + GET65.2270.8650.7062.26
+ DR-MM + SID + GET67.1272.5851.5663.75
- DR-MM + SID 62.5167.2946.1058.63
- DR-MM + SID + GET64.4769.8848.8261.06
+ DR-MM + SID + GET64.1870.2349.3561.25
GBERT-base
daGBERT-base
mBERT
XLM-RoBERTa
111
111
179
278
•Initializing the GE-model with text encoders speeds up training
•Fine-tuning using GEs is more efficient than continual pretraining
•Two-step fine-tuning of mBERT outperforms mE5-large by 14.3% (7.96p)
despite having 3 times fewer parameters, i.e., cost-efficient inference
•Fine-tuned mBERT outperforms M3 in 4 out of 7 plants
•Given the small size of the document triplets and bi-encoder datasets,
two-step fine-tuning is cost-efficient (1h + 1h on A100 for mBERT)
•Knowledge graphs (KGs) encapsulate domain-specific semantics in the
relationships between documents
•SciNCL proposed neighborhood contrastive learning with triplets collected
using the graph embedding (GE) space of the scientific graph
•Text logs in the process industry domain contain many jargons,
abbreviations, acronyms, and codes, which require additional context for
LMs to learn their semantic representation, e.g., for semantic search
•LM fine-tuning in the industrial setting is low-resource due to the lack of data
and often targeting languages other than English
Investigate the adaptation of SciNCL into the German
process industry domain with sparse heterogeneous KGs
of production plants
Plant assistant project