FAIR & AI Ready KGs for Explainable Predictions.pdf

micheldumontier 132 views 50 slides Sep 30, 2024
Slide 1
Slide 1 of 50
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50

About This Presentation

owards Biomedical Neurosymbolic AI: From Knowledge Infrastructure to Explainable Predictions

The increased availability of biomedical data, particularly in the public domain, offers the opportunity to better understand human health and to develop effective therapeutics for a wide range of unmet me...


Slide Content

FAIR & AI Ready Knowledge Graphs
for Explainable Predictions
Michel Dumontier, PhD

Distinguished Professor of Data Science
Founder and Director, Institute of Data Science
Maastricht University

SeWeBMeDA :: 26-05-2024

@micheldumontier::KAUST-Hackathon:2023-02-072
Vast amounts of (largely open) data are now available

3

4
A common rejection module (CRM) for acute rejection across multiple organs identifies novel therapeutics
for organ transplantation
Khatri et al. JEM. (2013) 210 (11): 2205
DOI: 10.1084/jem.20122709
Main Findings:
1.CRM of 11 overexpressed genes predicted future injury to a graft
2.Mice treated with existing drugs against specific CRM genes extended graft survival
3.Retrospective EHR data analysis supports treatment prediction

Key Observations:
1. Meta-analysis offers a more reliable estimate of the direction and magnitude of the effect
2. Existing data can be used to generate and validate new hypotheses

significant effort is needed
to find the right data, make
sense of them, and use
them for a new purpose
5

Data scientists could be more productive
6

Surprisingly low reproducibility of landmark studies

39% (39/100) in psychology
1

21% (14/67) in pharmacology
2

11% (6/53) in cancer
3

unsatisfactory in machine learning
4
1
doi:10.1038/nature.2015.17433
2
doi:10.1038/nrd3439-c1
3
doi:10.1038/483531a
4
https://openreview.net/pdf?id=By4l2PbQ-

Most published research findings are false.
- John Ioannidis, Stanford University
PLoS Med 2005;2(8): e124.

8
https://doi.org/10.1038/d41573-019-00074-z
Probability to launch
Nature Reviews Drug Discovery 18, 495-496 (2019)

9
?
Do we have access to the data we need to
build good AI models?

To what extent will AI models improve the
success of developing effective treatments?

Can they provide adequate explanations?

11
data are challenging to reuse:
●difficult to obtain
●poorly described
●in different formats
●hard to integrate with
other data

models have significant
limitations:
●built from limited, biased, or
erroneous data
●aren't robust to different
inputs
●difficult to explain decisions
and explanations aren't
satisfying
Translational
Failure

high quality, linked,
machine accessible,
machine interpretable
(meta)data from
multiple sources and
data types
Trustworthy,
data-driven and
knowledge-aligned,
explainable models and
predictions
Translational
Success

Human Machine collaboration
is crucial to our future work
13

Machines
need to be able to discover and reuse data
(and arguably any digital resource)
14

Research Directions

i)organize knowledge to answer questions about what we know
and what we don’t know (but should)
ii)build models to predict, explain, and justify
iii)human-AI collaboration to create, maintain, correct, and
complete knowledge

Knowledge Infrastructure Explainable Predictions
Neurosymbolic AI

http://www.nature.com/articles/sdata201618
18

Making FAIR Data
Data
use ontologies +
vocabularies
Use standard
data format
Standardized
Metadata
3. Transform2. Describe
add provenance,
license for data +
metadata
4. Publish
use standard
metadata format
use ontologies +
vocabularies
1.Collect
Persistent Data Identifier
Persistent Metadata
Identifier
Data Repository
Standardized
Metadata
Standardized
Data
Standardized
Data

Communities are publishing recipes
to make FAIR data

How do we know it’s FAIR?
•FAIR Enough is a system to
perform automated assessment of
the technical quality of the
FAIRness implementation.
•Uses a collections of metrics,
implemented as web services.
•Fast owing to parallel execution
•Keeps track of past assessments
to monitor status
•Offers search and query services
•Anybody can extend via service
based framework
•Open source and Docker
deployable
https://fair-enough.semanticscience.org

@micheldumontier::KAUST-Hackathon:202
3-02-07
https://lod-cloud.net/

•Since 2007, last updated 2014
•30+ biomedical data sources
•10B+ interlinked statements
•NCBI, EBI, SIB, DBCLS, NCBO, and many others
(chem2bio2rdf) produce this content
chemicals/drugs/formulations,
genomes/genes/proteins, domains
Interactions, complexes & pathways
animal models and phenotypes
Disease, genetic markers, treatments
Terminologies & publications
Belleau et al. JBI 2008. 41(5):706-716.
Callahan et al. ESWC 2013. 200-212
Linked Data for the Life
Sciences
Bio2RDF is an open source project that uses semantic web
technologies to make it easier to reuse biomedical data.

It provides Linked Data and a queryable RDF knowledge graph.

the Triple as
a base unit of knowledge representation
“diclofenac is a drug”
subject object
predicate
diclofenac drug
is a

formalization
“diclofenac is a drug”
drugbank:DB00586 drugbank:Drug
rdf:type
RDF N-Triples format (standardized, machine interpretable):
<https://bio2rdf.org/drugbank:DB00586>
<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
<https://bio2rdf.org/drugbank_vocabulary:Drug> .
"diclofenac"
rdfs:label
diclofenac drug
is a
1.Use RDF
2.Assign/reuse
identifiers
3.Use or
develop
vocabularies

Biomedical Linked Data
has detailed provenance
linked to other resources
semantically typed
http(s) identifier
rich descriptions

An Interoperable Biomedical Knowledge Graph
diclofenac NSAID
db:category
drug-gene
association
pharmgkb:drug
CYP2C9*1
pharmgkb:haplotype

30
Reproducible ML: new uses for existing drugs
Exploration: drug-target-disease networks
https://doi.org/10.7717/peerj-cs.281
https://doi.org/10.7717/peerj-cs.106
Custom Knowledge Portal: EbolaKB
https://doi.org/10.1093/database/bav049
Information Retrieval: Phenotypes of knock-out
mouse models for the targets of a selected drug

het.io DRKG
ROBOKOP
PheKnowLator

Integrative KGs Semantic KGs

Knowledge Collaboratory (for small data)
An AI-powered (NLP model or GPT) web user interface to annotate biomedical
text (NER, RE), create standard-compliant statements (BioLink model) that can be
made publicly available as author-signed nanopublications.
?????? collaboratory.semanticscience.org/annotate

Technology to publish assertions using RDF
Contains RDF triples to specify the assertion,
its provenance, and digital object metadata
Digitally signed by agent
TrustyURI hash to provide globally unique,
persistent, immutable, verifiable identifier and
payload

35

Robust, Reproducible, Explainable Predictions

Neurosymbolic Artificial Intelligence (NAI)
NAI aims to combine symbolic reasoning methods (logic-based reasoning & rules) with
sub-symbolic methods (neural networks, deep learning) to create models with high
predictive performance and explainability.
Specifically:
●Integrates knowledge from different modalities
●Able to perform complex reasoning (e.g. deduction, induction, synthesis)
●Able to learn from examples/small data and big data
●Offer explanations (e.g. causal account of the phenomenon) and justifications
(the evidence that supports the claims)
●Robust to noise and nonsense
●Handles cases out of the learning distribution

Predict new drug applications in a documented and
reproducible manner
38
AUC 0.90 across all therapeutic indications
Scripts not available. Feature tables
available.
Not reproducible!
Result: ROCAUC 0.83
Celebi R, Rebelo Moreira J, Hassan AA, Ayyar S, Ridder L,
Kuhn T, Dumontier M. 2020. Towards FAIR protocols and
workflows: the OpenPREDICT use case. PeerJ Computer
Science 6:e281 https://doi.org/10.7717/peerj-cs.281

Explainable AI
•XAI methods such as SHAP
provide information about
feature importance for the model
and in individual predictions
•When applied to OpenPredict,
it’s too complicated
understand the contributions of
derived features
•However, it is clearer when
using a single feature predictor

XPREDICT
Identify a set of ranked paths through the KG that offer plausible explanations for
a predicted drug indication based on their similarities to known drug-disease pairs.
1. Path generation
2. Path ranking
3. Explanation generation
@micheldumontier::KAUST-Hackathon:202
3-02-07

Graph Representation Learning
We want to automatically discover effective
representations needed for classification from raw data.

In graph representation learning, we encode the
topology, node attributes, and edge information into
low-dimensional vectors (or embeddings).

These vector can then be used as features to train
classifiers for link prediction, node classification, graph
classification, etc

GNN
Graph Neural Networks (GNNs) iteratively update node representations by
aggregating features from neighboring nodes and possibly edges.

Several methods (e.g. Saliency Maps) exist to extract a model-wise explanation for
link prediction, node/graph classification.
Graph Neural Networks

But I don’t find these explanations salient at all.

Embeddings don’t lend themselves to meaningful
explanations

because each dimension does not necessarily
represent or correspond with something we
recognize

Methodology
1.Use Graph Neural Networks to capture
semantics, graph structure and relationships
between nodes
2.Apply Saliency Maps on predictions made by
GNNs to identify relevant nodes for a
specific prediction; this provides valuable
insights into the graph's topology and
highlights the most important components
3.Saliency Maps assign a score to each node in
the network, which can be used to rank
paths involving genes, pathways, diseases,
and compounds

Experiments
Task: Link Prediction: Predict (Compound, treats, Disease)
Dataset: Drug Repurposing Knowledge Graph
Models: GraphSAGE, Graph Convolutional Network (GCN), Graph Attention
Network (GAT)
Metrics:
•Precision: ratio of true positives to predicted positives
•Recall: ratio of true positives to actual positives
•Hits@5 & Hits@10: for each test instance we predicted the tail and assessed
whether the rank of the predicted tail is below 5 or 10

Prediction & Explanation for:
Donepezil treats Alzheimer
The primary goal of Alzheimer's drugs,
including Donepezil, is to maintain elevated
acetylcholine (ACh) levels, thereby
compensating for the loss of functioning
cholinergic brain cells

Explanation subgraph emphasizes the crucial
role of Donepezil binding to
acetylcholinesterase (AChE) and
butyrylcholinesterase (BChE)
Slide 48 of 14

Summary
Finding effective drugs remains an outstanding challenge in biomedicine.
Much progress has been made in constructing semantic biomedical knowledge
graphs, and new infrastructure is available to publish, discover, and reuse
biomedical knowledge in a manner that is FAIR and de-centralized.
Embeddings-based prediction approaches show promising performance
(notwithstanding data bias), but remain difficult to explain.
Neurosymbolic systems retain the semantics of the knowledge, and show promise
in generating predictions that can recover logical explanations.

Michel Dumontier, PhD

Distinguished Professor of Data Science
Founder and Director, Institute of Data Science
Department of Advanced Computing Sciences
Maastricht University

[email protected]

Knowledge Infrastructure
Explainable Predictions
Neurosymbolic AIFAIR & AI
Ready
Knowledge
Graphs
for
Explainable
Predictions