kj2kjhgsqxdqsxcwcxwcwecwecwecewrcewrcercgger.pdf

swapnilkhadke3 5 views 6 slides Jul 07, 2024
Slide 1
Slide 1 of 6
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6

About This Presentation

rf


Slide Content

Pre-trained language
models to curate
oligogenic data
Workshop on “Text mining services to support
scalable curation”
Charlotte Nachtegael - Université Libre de Bruxelles - Belgium

#ELIXIR24
It all begins with a database
The OLIgogenic
diseases DAtabase
1808 variant combinations
Linked to 219 diseases
Involving 3778 variants in 1198 genes

DUVEL
Detection of Unlimited Variant
Ensemble in Literature

#ELIXIR24
Using Biomedical PLM to detect oligogenic data
Model Fine-
tuning
F1-score Precision Recall
BiomedBERT No 0.1004 0.15 0.07547
Yes 0.8171 0.7929 0.8428
BiomedBERT-
large
No 0.1721 0.09414 1
Yes 0.8371 0.8506 0.8239
BioLinkBERT No 0.01047 0.03125 0.006289
Yes 0.8207 0.7941 0.8491
BioM-BERT No 0.08571 0.1765 0.0566
Yes 0.8106 0.8592 0.7673
BERT
base
BERT
large
ELECTRA
Gu, Y., et al. ACM Trans.
Comput. Healthc., 2022
Yasunaga, M., et al. Proceedings of the
60th Annual Meeting of the ACL . 2022
Alrowili, S. et al. Proceedings of the 20th Workshop on
Biomedical Language Processing, 2021

#ELIXIR24
Test on OLIDAv3
BERT
large
18 research articles
40 647 unlabelled instances
(89 OLIDA entries)
125 predicted as positive
65 True positives (= 12 OLIDA entries, 13.5%)
Reason Number of errors
Complex sentence 29
Missing context 25
Negation not detected 1
Unaffected patient 2
Same variant, different forms 1
Missing end of the sentence 2

#ELIXIR24
DUVEL
Detection of Unlimited Variant
Ensemble in Literature

#ELIXIR24
References
●Charlotte Nachtegael and Barbara Gravel, et al. Scaling up oligogenic diseases research with OLIDA: the Oligogenic Diseases Database.
Database, 2022:baac023, January 2022. ISSN 1758-0463. doi: 10.1093/database/baac023. URL https://doi.org/10.1093/database/baac023.
●Yu Gu, et al. Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing. ACM Trans. Comput. Healthcare,
3(1), October 2021. ISSN 2691-1957. doi:10.1145/3458754. URL https://doi.org/10.1145/3458754. Place: New York, NY, USA Publisher:
Association for Computing Machinery.
●Michihiro Yasunaga, et al. LinkBERT: Pretraining language models with document links. In Proceedings of the 60th Annual Meeting of the
Association for Computational Linguistics (Volume 1:Long Papers), pages 8003–8016, Dublin, Ireland, May 2022. Association for
Computational Linguistics. doi: 10.18653/v1/2022.acl-long.551. URL https://aclanthology.org/2022.acl-long.551.
●Sultan Alrowili and Vijay Shanker. BioM-transformers: Building large biomedical language models with BERT, ALBERT and ELECTRA. In
Proceedings of the 20th Workshop on Biomedical Language Processing, pages 221–227, Online, June 2021. Association for Computational
Linguistics. URL https://www.aclweb.org/anthology/2021.bionlp-1.24.
●Charlotte Nachtegael, et al. DUVEL: an active-learning annotated biomedical corpus for the recognition of oligogenic combinations.
Database, 2024:baae039, 05 2024. ISSN 1758-0463. doi: 10.1093/database/baae039. URL https://doi.org/10.1093/database/baae039
Tags