FAIR & AI Ready KGs for Explainable Predictions.pdf

micheldumontier 132 views 50 slides Sep 30, 2024

Slide 1 of 50

About This Presentation

owards Biomedical Neurosymbolic AI: From Knowledge Infrastructure to Explainable Predictions

The increased availability of biomedical data, particularly in the public domain, offers the opportunity to better understand human health and to develop effective therapeutics for a wide range of unmet medical needs. However, data scientists remain stymied by the fact that data remain hard to find and to productively reuse because data and their metadata i) are wholly inaccessible, ii) are in non-standard or incompatible representations, iii) do not conform to community standards, and iv) have unclear or highly restricted terms and conditions that preclude legitimate reuse. These limitations require a rethink on how data can be made machine and AI-ready - the key motivation behind the FAIR Guiding Principles. Concurrently, while recent efforts have explored the use of deep learning to fuse disparate data into predictive models for a wide range of biomedical applications, these models often fail even when the correct answer is already known, and fail to explain individual predictions in terms that data scientists can appreciate. These limitations suggest that new methods to produce practical artificial intelligence are still needed.

In this talk, I will discuss our work in (1) building an integrative knowledge infrastructure to prepare FAIR and "AI-ready" data and services along with (2) neurosymbolic AI methods to improve the quality of predictions and to generate plausible explanations. Attention is given to standards, platforms, and methods to wrangle knowledge into simple, but effective semantic and latent representations, and to make these available into standards-compliant and discoverable interfaces that can be used in model building, validation, and explanation. Our work, and those of others in the field, creates a baseline for building trustworthy and easy to deploy AI models in biomedicine.

Bio

Dr. Michel Dumontier is the Distinguished Professor of Data Science at Maastricht University, founder and executive director of the Institute of Data Science, and co-founder of the FAIR (Findable, Accessible, Interoperable and Reusable) data principles. His research explores socio-technological approaches for responsible discovery science, which includes collaborative multi-modal knowledge graphs, privacy-preserving distributed data mining, and AI methods for drug discovery and personalized medicine. His work is supported through the Dutch National Research Agenda, the Netherlands Organisation for Scientific Research, Horizon Europe, the European Open Science Cloud, the US National Institutes of Health, and a Marie-Curie Innovative Training Network. He is the editor-in-chief for the journal Data Science and is internationally recognized for his contributions in bioinformatics, biomedical informatics, and semantic technologies including ontologies and linked data.

Size: 7.5 MB

Language: en

Added: Sep 30, 2024

Slides: 50 pages

Slide Content

FAIR & AI Ready Knowledge Graphs
for Explainable Predictions
Michel Dumontier, PhD

Distinguished Professor of Data Science
Founder and Director, Institute of Data Science
Maastricht University

SeWeBMeDA :: 26-05-2024

@micheldumontier::KAUST-Hackathon:2023-02-072
Vast amounts of (largely open) data are now available

4
A common rejection module (CRM) for acute rejection across multiple organs identifies novel therapeutics
for organ transplantation
Khatri et al. JEM. (2013) 210 (11): 2205
DOI: 10.1084/jem.20122709
Main Findings:
1.CRM of 11 overexpressed genes predicted future injury to a graft
2.Mice treated with existing drugs against specific CRM genes extended graft survival
3.Retrospective EHR data analysis supports treatment prediction

Key Observations:
1. Meta-analysis offers a more reliable estimate of the direction and magnitude of the effect
2. Existing data can be used to generate and validate new hypotheses

significant effort is needed
to find the right data, make
sense of them, and use
them for a new purpose
5

Data scientists could be more productive
6

Surprisingly low reproducibility of landmark studies

39% (39/100) in psychology
1

21% (14/67) in pharmacology
2

11% (6/53) in cancer
3

unsatisfactory in machine learning
4
1
doi:10.1038/nature.2015.17433
2
doi:10.1038/nrd3439-c1
3
doi:10.1038/483531a
4
https://openreview.net/pdf?id=By4l2PbQ-

Most published research findings are false.
- John Ioannidis, Stanford University
PLoS Med 2005;2(8): e124.

8
https://doi.org/10.1038/d41573-019-00074-z
Probability to launch
Nature Reviews Drug Discovery 18, 495-496 (2019)

9
?
Do we have access to the data we need to
build good AI models?

To what extent will AI models improve the
success of developing effective treatments?

Can they provide adequate explanations?

11
data are challenging to reuse:
●difficult to obtain
●poorly described
●in different formats
●hard to integrate with
other data

models have significant
limitations:
●built from limited, biased, or
erroneous data
●aren't robust to different
inputs
●difficult to explain decisions
and explanations aren't
satisfying
Translational
Failure

high quality, linked,
machine accessible,
machine interpretable
(meta)data from
multiple sources and
data types
Trustworthy,
data-driven and
knowledge-aligned,
explainable models and
predictions
Translational
Success

Human Machine collaboration
is crucial to our future work
13

Machines
need to be able to discover and reuse data
(and arguably any digital resource)
14

Research Directions

i)organize knowledge to answer questions about what we know
and what we don’t know (but should)
ii)build models to predict, explain, and justify
iii)human-AI collaboration to create, maintain, correct, and
complete knowledge

Knowledge Infrastructure Explainable Predictions
Neurosymbolic AI

http://www.nature.com/articles/sdata201618
18

Making FAIR Data
Data
use ontologies +
vocabularies
Use standard
data format
Standardized
Metadata
3. Transform2. Describe
add provenance,
license for data +
metadata
4. Publish
use standard
metadata format
use ontologies +
vocabularies
1.Collect
Persistent Data Identifier
Persistent Metadata
Identifier
Data Repository
Standardized
Metadata
Standardized
Data
Standardized
Data

Communities are publishing recipes
to make FAIR data

How do we know it’s FAIR?
•FAIR Enough is a system to
perform automated assessment of
the technical quality of the
FAIRness implementation.
•Uses a collections of metrics,
implemented as web services.
•Fast owing to parallel execution
•Keeps track of past assessments
to monitor status
•Offers search and query services
•Anybody can extend via service
based framework
•Open source and Docker
deployable
https://fair-enough.semanticscience.org

@micheldumontier::KAUST-Hackathon:202
3-02-07
https://lod-cloud.net/

•Since 2007, last updated 2014
•30+ biomedical data sources
•10B+ interlinked statements
•NCBI, EBI, SIB, DBCLS, NCBO, and many others
(chem2bio2rdf) produce this content
chemicals/drugs/formulations,
genomes/genes/proteins, domains
Interactions, complexes & pathways
animal models and phenotypes
Disease, genetic markers, treatments
Terminologies & publications
Belleau et al. JBI 2008. 41(5):706-716.
Callahan et al. ESWC 2013. 200-212
Linked Data for the Life
Sciences
Bio2RDF is an open source project that uses semantic web
technologies to make it easier to reuse biomedical data.

It provides Linked Data and a queryable RDF knowledge graph.

the Triple as
a base unit of knowledge representation
“diclofenac is a drug”
subject object
predicate
diclofenac drug
is a

formalization
“diclofenac is a drug”
drugbank:DB00586 drugbank:Drug
rdf:type
RDF N-Triples format (standardized, machine interpretable):
<https://bio2rdf.org/drugbank:DB00586>
<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
<https://bio2rdf.org/drugbank_vocabulary:Drug> .
"diclofenac"
rdfs:label
diclofenac drug
is a
1.Use RDF
2.Assign/reuse
identifiers
3.Use or
develop
vocabularies

Biomedical Linked Data
has detailed provenance
linked to other resources
semantically typed
http(s) identifier
rich descriptions

An Interoperable Biomedical Knowledge Graph
diclofenac NSAID
db:category
drug-gene
association
pharmgkb:drug
CYP2C9*1
pharmgkb:haplotype

30
Reproducible ML: new uses for existing drugs
Exploration: drug-target-disease networks
https://doi.org/10.7717/peerj-cs.281
https://doi.org/10.7717/peerj-cs.106
Custom Knowledge Portal: EbolaKB
https://doi.org/10.1093/database/bav049
Information Retrieval: Phenotypes of knock-out
mouse models for the targets of a selected drug

het.io DRKG
ROBOKOP
PheKnowLator

Integrative KGs Semantic KGs

Knowledge Collaboratory (for small data)
An AI-powered (NLP model or GPT) web user interface to annotate biomedical
text (NER, RE), create standard-compliant statements (BioLink model) that can be
made publicly available as author-signed nanopublications.
?????? collaboratory.semanticscience.org/annotate

Technology to publish assertions using RDF
Contains RDF triples to specify the assertion,
its provenance, and digital object metadata
Digitally signed by agent
TrustyURI hash to provide globally unique,
persistent, immutable, verifiable identifier and
payload

Robust, Reproducible, Explainable Predictions

Neurosymbolic Artificial Intelligence (NAI)
NAI aims to combine symbolic reasoning methods (logic-based reasoning & rules) with
sub-symbolic methods (neural networks, deep learning) to create models with high
predictive performance and explainability.
Specifically:
●Integrates knowledge from different modalities
●Able to perform complex reasoning (e.g. deduction, induction, synthesis)
●Able to learn from examples/small data and big data
●Offer explanations (e.g. causal account of the phenomenon) and justifications
(the evidence that supports the claims)
●Robust to noise and nonsense
●Handles cases out of the learning distribution

Predict new drug applications in a documented and
reproducible manner
38
AUC 0.90 across all therapeutic indications
Scripts not available. Feature tables
available.
Not reproducible!
Result: ROCAUC 0.83
Celebi R, Rebelo Moreira J, Hassan AA, Ayyar S, Ridder L,
Kuhn T, Dumontier M. 2020. Towards FAIR protocols and
workflows: the OpenPREDICT use case. PeerJ Computer
Science 6:e281 https://doi.org/10.7717/peerj-cs.281

Explainable AI
•XAI methods such as SHAP
provide information about
feature importance for the model
and in individual predictions
•When applied to OpenPredict,
it’s too complicated
understand the contributions of
derived features
•However, it is clearer when
using a single feature predictor

XPREDICT
Identify a set of ranked paths through the KG that offer plausible explanations for
a predicted drug indication based on their similarities to known drug-disease pairs.
1. Path generation
2. Path ranking
3. Explanation generation
@micheldumontier::KAUST-Hackathon:202
3-02-07

Graph Representation Learning
We want to automatically discover effective
representations needed for classification from raw data.

In graph representation learning, we encode the
topology, node attributes, and edge information into
low-dimensional vectors (or embeddings).

These vector can then be used as features to train
classifiers for link prediction, node classification, graph
classification, etc

GNN
Graph Neural Networks (GNNs) iteratively update node representations by
aggregating features from neighboring nodes and possibly edges.

Several methods (e.g. Saliency Maps) exist to extract a model-wise explanation for
link prediction, node/graph classification.
Graph Neural Networks

But I don’t find these explanations salient at all.

Embeddings don’t lend themselves to meaningful
explanations

because each dimension does not necessarily
represent or correspond with something we
recognize

Methodology
1.Use Graph Neural Networks to capture
semantics, graph structure and relationships
between nodes
2.Apply Saliency Maps on predictions made by
GNNs to identify relevant nodes for a
specific prediction; this provides valuable
insights into the graph's topology and
highlights the most important components
3.Saliency Maps assign a score to each node in
the network, which can be used to rank
paths involving genes, pathways, diseases,
and compounds

Experiments
Task: Link Prediction: Predict (Compound, treats, Disease)
Dataset: Drug Repurposing Knowledge Graph
Models: GraphSAGE, Graph Convolutional Network (GCN), Graph Attention
Network (GAT)
Metrics:
•Precision: ratio of true positives to predicted positives
•Recall: ratio of true positives to actual positives
•Hits@5 & Hits@10: for each test instance we predicted the tail and assessed
whether the rank of the predicted tail is below 5 or 10

Prediction & Explanation for:
Donepezil treats Alzheimer
The primary goal of Alzheimer's drugs,
including Donepezil, is to maintain elevated
acetylcholine (ACh) levels, thereby
compensating for the loss of functioning
cholinergic brain cells

Explanation subgraph emphasizes the crucial
role of Donepezil binding to
acetylcholinesterase (AChE) and
butyrylcholinesterase (BChE)
Slide 48 of 14

Summary
Finding effective drugs remains an outstanding challenge in biomedicine.
Much progress has been made in constructing semantic biomedical knowledge
graphs, and new infrastructure is available to publish, discover, and reuse
biomedical knowledge in a manner that is FAIR and de-centralized.
Embeddings-based prediction approaches show promising performance
(notwithstanding data bias), but remain difficult to explain.
Neurosymbolic systems retain the semantics of the knowledge, and show promise
in generating predictions that can recover logical explanations.

Michel Dumontier, PhD

Distinguished Professor of Data Science
Founder and Director, Institute of Data Science
Department of Advanced Computing Sciences
Maastricht University

[email protected]

Knowledge Infrastructure
Explainable Predictions
Neurosymbolic AIFAIR & AI
Ready
Knowledge
Graphs
for
Explainable
Predictions

FAIR & AI Ready KGs for Explainable Predictions.pdf

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

FAIR &amp; AI Ready KGs for Explainable Predictions.pdf

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 18

Slide 19

Slide 20

Slide 23

Slide 24

Slide 25

Slide 26

Slide 27

Slide 28

Slide 29

Slide 30

Slide 31

Slide 32

Slide 33

Slide 35

Slide 36

Slide 37

Slide 38

Slide 39

Slide 40

Slide 42

Slide 43

Slide 44

Slide 45

Slide 46

Slide 47

Slide 48

Slide 49

Slide 50

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

8-top-ai-courses-for-customer-support-representatives-in-2025.pptx

7-essential-ai-courses-for-call-center-supervisors-in-2025.pptx

25-essential-ai-courses-for-user-support-specialists-in-2025.pptx

8-essential-ai-courses-for-insurance-customer-service-representatives-in-2025.pptx

Know for Certain

PPT OPD LES 3ertt4t4tqqqe23e3e3rq2qq232.pptx

FAIR & AI Ready KGs for Explainable Predictions.pdf