FAIR & AI Ready KGs for Explainable Predictions.pdf
micheldumontier
132 views
50 slides
Sep 30, 2024
Slide 1 of 50
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
About This Presentation
owards Biomedical Neurosymbolic AI: From Knowledge Infrastructure to Explainable Predictions
The increased availability of biomedical data, particularly in the public domain, offers the opportunity to better understand human health and to develop effective therapeutics for a wide range of unmet me...
owards Biomedical Neurosymbolic AI: From Knowledge Infrastructure to Explainable Predictions
The increased availability of biomedical data, particularly in the public domain, offers the opportunity to better understand human health and to develop effective therapeutics for a wide range of unmet medical needs. However, data scientists remain stymied by the fact that data remain hard to find and to productively reuse because data and their metadata i) are wholly inaccessible, ii) are in non-standard or incompatible representations, iii) do not conform to community standards, and iv) have unclear or highly restricted terms and conditions that preclude legitimate reuse. These limitations require a rethink on how data can be made machine and AI-ready - the key motivation behind the FAIR Guiding Principles. Concurrently, while recent efforts have explored the use of deep learning to fuse disparate data into predictive models for a wide range of biomedical applications, these models often fail even when the correct answer is already known, and fail to explain individual predictions in terms that data scientists can appreciate. These limitations suggest that new methods to produce practical artificial intelligence are still needed.
In this talk, I will discuss our work in (1) building an integrative knowledge infrastructure to prepare FAIR and "AI-ready" data and services along with (2) neurosymbolic AI methods to improve the quality of predictions and to generate plausible explanations. Attention is given to standards, platforms, and methods to wrangle knowledge into simple, but effective semantic and latent representations, and to make these available into standards-compliant and discoverable interfaces that can be used in model building, validation, and explanation. Our work, and those of others in the field, creates a baseline for building trustworthy and easy to deploy AI models in biomedicine.
Bio
Dr. Michel Dumontier is the Distinguished Professor of Data Science at Maastricht University, founder and executive director of the Institute of Data Science, and co-founder of the FAIR (Findable, Accessible, Interoperable and Reusable) data principles. His research explores socio-technological approaches for responsible discovery science, which includes collaborative multi-modal knowledge graphs, privacy-preserving distributed data mining, and AI methods for drug discovery and personalized medicine. His work is supported through the Dutch National Research Agenda, the Netherlands Organisation for Scientific Research, Horizon Europe, the European Open Science Cloud, the US National Institutes of Health, and a Marie-Curie Innovative Training Network. He is the editor-in-chief for the journal Data Science and is internationally recognized for his contributions in bioinformatics, biomedical informatics, and semantic technologies including ontologies and linked data.
Size: 7.5 MB
Language: en
Added: Sep 30, 2024
Slides: 50 pages
Slide Content
FAIR & AI Ready Knowledge Graphs
for Explainable Predictions
Michel Dumontier, PhD
Distinguished Professor of Data Science
Founder and Director, Institute of Data Science
Maastricht University
SeWeBMeDA :: 26-05-2024
@micheldumontier::KAUST-Hackathon:2023-02-072
Vast amounts of (largely open) data are now available
3
4
A common rejection module (CRM) for acute rejection across multiple organs identifies novel therapeutics
for organ transplantation
Khatri et al. JEM. (2013) 210 (11): 2205
DOI: 10.1084/jem.20122709
Main Findings:
1.CRM of 11 overexpressed genes predicted future injury to a graft
2.Mice treated with existing drugs against specific CRM genes extended graft survival
3.Retrospective EHR data analysis supports treatment prediction
Key Observations:
1. Meta-analysis offers a more reliable estimate of the direction and magnitude of the effect
2. Existing data can be used to generate and validate new hypotheses
significant effort is needed
to find the right data, make
sense of them, and use
them for a new purpose
5
Data scientists could be more productive
6
Surprisingly low reproducibility of landmark studies
Most published research findings are false.
- John Ioannidis, Stanford University
PLoS Med 2005;2(8): e124.
8
https://doi.org/10.1038/d41573-019-00074-z
Probability to launch
Nature Reviews Drug Discovery 18, 495-496 (2019)
9
?
Do we have access to the data we need to
build good AI models?
To what extent will AI models improve the
success of developing effective treatments?
Can they provide adequate explanations?
11
data are challenging to reuse:
●difficult to obtain
●poorly described
●in different formats
●hard to integrate with
other data
models have significant
limitations:
●built from limited, biased, or
erroneous data
●aren't robust to different
inputs
●difficult to explain decisions
and explanations aren't
satisfying
Translational
Failure
high quality, linked,
machine accessible,
machine interpretable
(meta)data from
multiple sources and
data types
Trustworthy,
data-driven and
knowledge-aligned,
explainable models and
predictions
Translational
Success
Human Machine collaboration
is crucial to our future work
13
Machines
need to be able to discover and reuse data
(and arguably any digital resource)
14
Research Directions
i)organize knowledge to answer questions about what we know
and what we don’t know (but should)
ii)build models to predict, explain, and justify
iii)human-AI collaboration to create, maintain, correct, and
complete knowledge
Knowledge Infrastructure Explainable Predictions
Neurosymbolic AI
http://www.nature.com/articles/sdata201618
18
Making FAIR Data
Data
use ontologies +
vocabularies
Use standard
data format
Standardized
Metadata
3. Transform2. Describe
add provenance,
license for data +
metadata
4. Publish
use standard
metadata format
use ontologies +
vocabularies
1.Collect
Persistent Data Identifier
Persistent Metadata
Identifier
Data Repository
Standardized
Metadata
Standardized
Data
Standardized
Data
Communities are publishing recipes
to make FAIR data
How do we know it’s FAIR?
•FAIR Enough is a system to
perform automated assessment of
the technical quality of the
FAIRness implementation.
•Uses a collections of metrics,
implemented as web services.
•Fast owing to parallel execution
•Keeps track of past assessments
to monitor status
•Offers search and query services
•Anybody can extend via service
based framework
•Open source and Docker
deployable
https://fair-enough.semanticscience.org
•Since 2007, last updated 2014
•30+ biomedical data sources
•10B+ interlinked statements
•NCBI, EBI, SIB, DBCLS, NCBO, and many others
(chem2bio2rdf) produce this content
chemicals/drugs/formulations,
genomes/genes/proteins, domains
Interactions, complexes & pathways
animal models and phenotypes
Disease, genetic markers, treatments
Terminologies & publications
Belleau et al. JBI 2008. 41(5):706-716.
Callahan et al. ESWC 2013. 200-212
Linked Data for the Life
Sciences
Bio2RDF is an open source project that uses semantic web
technologies to make it easier to reuse biomedical data.
It provides Linked Data and a queryable RDF knowledge graph.
the Triple as
a base unit of knowledge representation
“diclofenac is a drug”
subject object
predicate
diclofenac drug
is a
formalization
“diclofenac is a drug”
drugbank:DB00586 drugbank:Drug
rdf:type
RDF N-Triples format (standardized, machine interpretable):
<https://bio2rdf.org/drugbank:DB00586>
<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
<https://bio2rdf.org/drugbank_vocabulary:Drug> .
"diclofenac"
rdfs:label
diclofenac drug
is a
1.Use RDF
2.Assign/reuse
identifiers
3.Use or
develop
vocabularies
Biomedical Linked Data
has detailed provenance
linked to other resources
semantically typed
http(s) identifier
rich descriptions
An Interoperable Biomedical Knowledge Graph
diclofenac NSAID
db:category
drug-gene
association
pharmgkb:drug
CYP2C9*1
pharmgkb:haplotype
30
Reproducible ML: new uses for existing drugs
Exploration: drug-target-disease networks
https://doi.org/10.7717/peerj-cs.281
https://doi.org/10.7717/peerj-cs.106
Custom Knowledge Portal: EbolaKB
https://doi.org/10.1093/database/bav049
Information Retrieval: Phenotypes of knock-out
mouse models for the targets of a selected drug
het.io DRKG
ROBOKOP
PheKnowLator
Integrative KGs Semantic KGs
Knowledge Collaboratory (for small data)
An AI-powered (NLP model or GPT) web user interface to annotate biomedical
text (NER, RE), create standard-compliant statements (BioLink model) that can be
made publicly available as author-signed nanopublications.
?????? collaboratory.semanticscience.org/annotate
Technology to publish assertions using RDF
Contains RDF triples to specify the assertion,
its provenance, and digital object metadata
Digitally signed by agent
TrustyURI hash to provide globally unique,
persistent, immutable, verifiable identifier and
payload
35
Robust, Reproducible, Explainable Predictions
Neurosymbolic Artificial Intelligence (NAI)
NAI aims to combine symbolic reasoning methods (logic-based reasoning & rules) with
sub-symbolic methods (neural networks, deep learning) to create models with high
predictive performance and explainability.
Specifically:
●Integrates knowledge from different modalities
●Able to perform complex reasoning (e.g. deduction, induction, synthesis)
●Able to learn from examples/small data and big data
●Offer explanations (e.g. causal account of the phenomenon) and justifications
(the evidence that supports the claims)
●Robust to noise and nonsense
●Handles cases out of the learning distribution
Predict new drug applications in a documented and
reproducible manner
38
AUC 0.90 across all therapeutic indications
Scripts not available. Feature tables
available.
Not reproducible!
Result: ROCAUC 0.83
Celebi R, Rebelo Moreira J, Hassan AA, Ayyar S, Ridder L,
Kuhn T, Dumontier M. 2020. Towards FAIR protocols and
workflows: the OpenPREDICT use case. PeerJ Computer
Science 6:e281 https://doi.org/10.7717/peerj-cs.281
Explainable AI
•XAI methods such as SHAP
provide information about
feature importance for the model
and in individual predictions
•When applied to OpenPredict,
it’s too complicated
understand the contributions of
derived features
•However, it is clearer when
using a single feature predictor
XPREDICT
Identify a set of ranked paths through the KG that offer plausible explanations for
a predicted drug indication based on their similarities to known drug-disease pairs.
1. Path generation
2. Path ranking
3. Explanation generation
@micheldumontier::KAUST-Hackathon:202
3-02-07
Graph Representation Learning
We want to automatically discover effective
representations needed for classification from raw data.
In graph representation learning, we encode the
topology, node attributes, and edge information into
low-dimensional vectors (or embeddings).
These vector can then be used as features to train
classifiers for link prediction, node classification, graph
classification, etc
GNN
Graph Neural Networks (GNNs) iteratively update node representations by
aggregating features from neighboring nodes and possibly edges.
Several methods (e.g. Saliency Maps) exist to extract a model-wise explanation for
link prediction, node/graph classification.
Graph Neural Networks
But I don’t find these explanations salient at all.
Embeddings don’t lend themselves to meaningful
explanations
because each dimension does not necessarily
represent or correspond with something we
recognize
Methodology
1.Use Graph Neural Networks to capture
semantics, graph structure and relationships
between nodes
2.Apply Saliency Maps on predictions made by
GNNs to identify relevant nodes for a
specific prediction; this provides valuable
insights into the graph's topology and
highlights the most important components
3.Saliency Maps assign a score to each node in
the network, which can be used to rank
paths involving genes, pathways, diseases,
and compounds
Experiments
Task: Link Prediction: Predict (Compound, treats, Disease)
Dataset: Drug Repurposing Knowledge Graph
Models: GraphSAGE, Graph Convolutional Network (GCN), Graph Attention
Network (GAT)
Metrics:
•Precision: ratio of true positives to predicted positives
•Recall: ratio of true positives to actual positives
•Hits@5 & Hits@10: for each test instance we predicted the tail and assessed
whether the rank of the predicted tail is below 5 or 10
Prediction & Explanation for:
Donepezil treats Alzheimer
The primary goal of Alzheimer's drugs,
including Donepezil, is to maintain elevated
acetylcholine (ACh) levels, thereby
compensating for the loss of functioning
cholinergic brain cells
Explanation subgraph emphasizes the crucial
role of Donepezil binding to
acetylcholinesterase (AChE) and
butyrylcholinesterase (BChE)
Slide 48 of 14
Summary
Finding effective drugs remains an outstanding challenge in biomedicine.
Much progress has been made in constructing semantic biomedical knowledge
graphs, and new infrastructure is available to publish, discover, and reuse
biomedical knowledge in a manner that is FAIR and de-centralized.
Embeddings-based prediction approaches show promising performance
(notwithstanding data bias), but remain difficult to explain.
Neurosymbolic systems retain the semantics of the knowledge, and show promise
in generating predictions that can recover logical explanations.
Michel Dumontier, PhD
Distinguished Professor of Data Science
Founder and Director, Institute of Data Science
Department of Advanced Computing Sciences
Maastricht University