Understanding Scientific and Societal Adoption and Impact of Science Through NLP and Knowledge Graphs

stefandietze 66 views 34 slides May 26, 2024
Slide 1
Slide 1 of 34
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34

About This Presentation

Keynote on analysing scholarly discourse at Second International Workshop on Semantic Technologies and Deep Learning Models for Scientific, Technical and Legal Data SemTech4STLD, held on 26 May at ESWC2024


Slide Content

Understanding Scientific and Societal Adoption and
Impact of Science Through NLP and Knowledge
Graphs
SemTech WS @ ESWC2024
Stefan Dietze, 26.05.2024

2
Discourse Interactions
Algorithms/AI
Motivation: science discourse vs offline society & policies
Society, Media, Politics & Policies
(Offline & Online)
Science discourse
(publications, news & social media)

Motivation
▪Science discourse hidden in unstructured documents (publications vs online media)
▪Understanding science discourse and finding resources within scientific context OR
online discourse is still a challenge
▪Even more challenging: understanding the links between science & online discourse
(e.g. how claims from scientific publications are debated online)

Different representations of machine knowledge
Gerhard Weikum at ISWC2023:
▪KG: fruit platter
▪Search engine/web: fruit
salad
▪LLM: smoothie

Different representations of machine knowledge in science
Gerhard Weikum at ISWC2023:
▪KG: fruit platter
Structured KBs of scientific
data & metadata, „research
knowledge graphs“
▪Search engine/web: fruit
salad
Science discourse
(publications, news, social
media)
▪LLM: smoothie
SciBERT and other learned
representations

Overview
NLP & KGs to…
Part 1: … interpret, link & find scientific resources
Part 2: … interpret, link & find scientific online discourse

Dacrema, M. F., Cremonesi, P., Jannach, D., Are we really making much progress? A worrying analysis of recent neural recommendation approaches. ACM RecSys2019.
Reproducibility & state-of-the-art crisis in computer science
„A worrying analysis of neural recommender approaches“
Are DL-based methods reproducible?
Do they actually beat simple baselines?

Common questions for researchers
•Which top-tier publicationscite which data/method?
(„dataset authority“)
•Which datawas used to train/evaluate which method?
Which methodto produce what data?
•Which claimsare supported/cited/rejected by what dataset
or publication?
▪Research Data
▪Publications
▪Code/Scripts
▪ML Models
▪Methods
▪Claims
▪Metrics
Relations between scientific resources, data, knowledge
Provenance & dependencies of research data, resources, knowledge

Challenges
•Relations and semanticsnot explicit
•Data & metadata about resources and concepts not represented in
structured, machine-interpretable, integrated manner (hidden in
publications, web pages etc)
•Persistent identifiers(e.g. DOIs) used inconsistently (e.g. on
publications/datasets, to small degree on ML models)
•Reproducibility crisis in CS/AI & beyond
KGs focus on all of the above
(PIDs, relations and machine-interpretability)
▪Research Data
▪Publications
▪Code/Scripts
▪ML Models
▪Methods
▪Claims
▪Metrics
Relations between scientific resources, data, knowledge
Provenance & dependencies of research data, resources, knowledge

From publications to machine-interpretable metadata KGs
Example: software citations
https://data.gesis.org/softwarekg
https://data.gesis.org/somesci
▪Manual annotation
(“SomeSci”)
▪Training DL models for
extracting software references
in large-scale corpus
(3.5 M publications)
▪Data lifting into KG
(“SoftwareKG”, 300+ M triples
/ statements)
Schindler, D., Bensmann, F., Dietze, S., Krüger, F., SoMeSci—A 5 Star Open Data Gold Standard Knowledge Graph of Software Mentions in Scientific Articles, CIKM2021

From publications to machine-interpretable metadata KGs
DL-based hierarchical sequence labeling for SW mention detection
Schindler D, Bensmann F, Dietze S, KrügerF., The role of software in science: a knowledge graph-based analysis of software mentions in PubMed Central.
PeerJComputer Science 8:e835

From publications to machine-interpretable metadata KGs
SW mention extraction performance (SciBERT)
Schindler D, Bensmann F, Dietze S, KrügerF., The role of software in science: a knowledge graph-based analysis of software mentions in PubMed Central.
PeerJComputer Science 8:e835

From publications to machine-interpretable metadata KGs
Understanding scientific software/data usage
https://data.gesis.org/softwarekg
▪UnderstandingSW
usage, citation habits
and their evolution
across disciplines
▪Rise of data science =
rise of software usage
Schindler D, Bensmann F, Dietze S, KrügerF., The role of software in science: a knowledge graph-based analysis of software mentions in PubMed Central.
PeerJComputer Science 8:e835

https://data.gesis.org/softwarekg
▪Top adopters of data
science/AI/software…
From publications to machine-interpretable metadata KGs
Understanding scientific software/data usage
Schindler D, Bensmann F, Dietze S, KrügerF., The role of software in science: a knowledge graph-based analysis of software mentions in PubMed Central.
PeerJComputer Science 8:e835

Schindler D, Bensmann F, Dietze S, KrügerF., The role of software in science: a knowledge graph-based analysis of software mentions in PubMed Central.
PeerJComputer Science 8:e835
https://data.gesis.org/softwarekg
▪Top adopters of data
science/AI/software…
▪…follow the worst
citation habits
From publications to machine-interpretable metadata KGs
Understanding scientific software/data usage

https://nfdi4ds.github.io/nslp2024/
SOMD – Shared Task @ NSLP2024 @ ESWC2024
https://nfdi4ds.github.io/nslp2024/docs/somd_shared_task.htm
Meet us tomorrow from 9 am
@ NSLP Workshop

Otto, W., Zloch, M., Gan, L., Karmakar, S., Dietze, S. (2023). GSAP-NER: A Novel Task, Corpus, and Baseline for Scholarly EntityExtraction Focused on
Machine Learning Models and Datasets. In Findings of the Association for Computational Linguistics: EMNLP 2023
How about citations of ML models & datasets?
GESIS Scholarly Annotation Project (GSAP)
▪Novel data model and annotation frameworkto
annotate mentions of ML models and datasets
(and their relations) in scholarly publications
▪Corpusof 100 CS publications annotated with
> 54 K mentions
▪Finetuned PLMs as baseline modelsfor detecting
ML model and dataset mentions
▪Goal: build up a knowledge graph of ML models,
datasetsand their relations and usage/adoption in
CS research (=> GESIS Methods Hub, GESIS Search)

Otto, W., Zloch, M., Gan, L., Karmakar, S., Dietze, S. (2023). GSAP-NER: A Novel Task, Corpus, and Baseline for Scholarly EntityExtraction Focused on
Machine Learning Models and Datasets. In Findings of the Association for Computational Linguistics: EMNLP 2023
GESIS Scholarly Annotation Project (GSAP) - NER
Model performance

KTS/GESIS @ National Research Data Infrastructure (NFDI)
23
Relevant consortia with KTS/GESIS in leading roles
•BERD@NFDI
https://www.berd-nfdi.de/
•NFD4DataScience – National Research Data Infrastructure for
Data Science & AI
https://www.nfdi4datascience.de/
•KonsortSWD
https://www.konsortswd.de/en/
•Base4NFDI
https://base4nfdi.de/

Overview
NLP & KGs to…
Part 1: … interpret, link & find scientific resources
Part 2: … interpret, link & find scientific online discourse

25
▪Percentage of tweets containing
links to scientific articles (journals,
publishers, science blogs etc)
▪Uses list of > 30 K science web
domains
▪Data source: TweetsKB
(https://data.gesis.org/tweetskb/)
How about mentions of science resources on the Web?
Example: Twitter/X https://ai4sci-project.org/

26
Data: TweetsKB –a large-scale archive of societal discourse
▪Archiving of 1% sample from Twitter/X since 2013
(14 billion tweets)
▪TweetsKB: subset of 3 billion prefiltered tweets
(English, spam detection through pretrained classifier)
containing tweet metadata, hash tags, user mentions
and dedicated features that capture tweet semantics
(no actual user IDs and full texts)
▪Features include [CIKM2020, CIKM2022]:
oDisambiguated mentions of entities, linked to
Wikipedia/DBpedia
(“president”/“potus”/”trump” => dbp:DonaldTrump)
oSentimentscores (positive/negative emotions)
oGeotagsvia pretrained DeepGeo model
oScience references/claims [CIKM2022]
https://data.gesis.org/tweetskb
Feature Total Unique % with >= 1 feature
Hashtags: 1,161,839,47168,832,205 0.19
Mentions: 1,840,456,543149,277,474 0.38
Entities: 2,563,433,9972,265,201 0.56
Sentiment: 1,265,974,641- 0.5
Dimitrov, D., Fafalios, P., Yu, R., Zhu, X., Zloch, M., Dietze, S., TweetsCOV19 –A KB of Semantically Annotated Tweets about the COVID-19 Pandemic, CIKM2020

Mentions of scientific knowledge on Twitter/X?
27
https://ai4sci-project.org/
Science claim
Science reference
Science relevance
No science
Science reference
Hafid, S., Schellhammer, S., Bringay, S., Todorov, K., Dietze, S., SciTweets - A Dataset and Annotation Framework for Detecting Scientific Online Discourse,
CIKM2022

SciTweets: training LLMs to detect scientific online discourse
28
▪Ground truth dataset, heuristics-based sampling
strategy and annotation framework for testing
classification models
▪1261 expert-labeled tweets across all
classes/labels
▪Baseline classifiers based on SciBERTtransformer
model (fine-tuned/tested on SciTweets)
▪Ongoing: analysis of large-scale science discourse
and its evolution (projects NEWORDER & AI4SCI)
Hafid, S., Schellhammer, S., Bringay, S., Todorov, K., Dietze, S., SciTweets - A Dataset and Annotation Framework for Detecting Scientific Online Discourse,
CIKM2022
https://github.com/AI-4-Sci/SciTweets

How is public attention distributed?
Power law distribution
31
•10% of studies receive
> 75% of all Twitter
mentions
•Long tail of studies
with few mentions
•Data source: 1.67 M
tweets mentioning at
least one of the
primary science
studies in the
„Altmetrics“ corpus
Top x (%) of mentioned science studies
Share
of

twitter

mentions
(%)

Challenge: online science discourse is not well-informed
Links to actual scientific studies/context missing in news & social media
32

▪NLP models able to predict missing primary science reference (e.g. DOI or journal paper link) for
given informal reference (e.g. “Heinsberg Studie”) or secondary reference (news article)
33
Challenge: online science discourse is not well-informed
Links to actual scientific studies/context missing in news & social media

▪NLP models able to predict missing primary science reference (e.g. DOI or journal paper link) for
given informal reference (e.g. “Heinsberg Studie”) or secondary reference (news article)
Challenge: online science discourse is not well-informed
Links to actual scientific studies/context missing in news & social media
34
▪Supervised & unsupervised approaches using DL language models

Science discourse is „different“
35Examples from http://snopes.com
Non-science claim
Science claim

Computational (AI) challenge
NLP methods (e.g. for fact-checking) perform worse on science discourse
36
▪Take-away: DL-based methods geared towards scientific discourse required
Performance of state-of-the-art AI/deep learning models on standard benchmark datasets
Claim Check-Worthiness
Detection
Fake News Detection
Claim Verification

Summary: scientific knowledge @ KTS/GESIS
37
KGs
▪KGs about scholarly use of software, ML models & research data (e.g.SoftwareKG)
▪Web mined KGs of social science research data, e.g.public opinions, claims and
attitudes expressed on social media (e.g.TweetsKB)
Science discourse (publications/online)
▪Scholarly publications (e.g.PubMed, Arxiv, SSOAR)
▪Raw data about scientific online discourse (e.g.SciTweets, Altmetric)
Deep learning-based models & LLMs
▪NLP and deep learning-powered methods for extracting large-scale KGs about methods,
claims, data, software and their adoption
▪Example: SciBERT(SciTweetsclassifier) & SciDeBERTa(GSAP NER)
https://gesis.org/kts

▪Complementary representations of science discourse with varying degree of structure: KGs,
unstructured Web(e.g. publications, Twitter, web pages), LLMs/learned representations
oMany works depend on all three (!)
oWrtusability & applications: KGs still seen as gold standard (but also most costly to produce)
▪Complementary knowledge construction approaches: manual curation/annotation and automatic
information extraction for KG construction (relying on LLMs & unstructured documents)
▪Challenges (that are often ignored in LLM-based research): model decaydue to language &
knowledge evolution
=> require model & KG maintenance yet constructing/updating LLMs/KGs is costly
▪Essential challenge: keeping all three representations of science discourse synchronised
Key take-aways
38

39
Thanks! (and: we are recruiting!) https://gesis.org/kts

40
@stefandietze
https://stefandietze.net
http://gesis.org/en/kts