Generating (useful) synthetic data for medical research and AI application

micheldumontier 331 views 27 slides Oct 17, 2024
Slide 1
Slide 1 of 27
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27

About This Presentation

This talk addresses the growing importance of synthetic data in overcoming challenges related to data privacy and scarcity in healthcare. The presentation explores how artificially created datasets, which mimic the statistical properties of real patient data without exposing sensitive information, c...


Slide Content

Generating (useful) synthetic data for medical research and AI application Michel Dumontier, Ph.D. Distinguished Professor of Data Science Department of Advanced Computing Sciences Maastricht University Member , Birmingham Hugh Kaul Precision Medicine Institute University of Alabama 1 UAB Symposium on AI in Medicine and Nursing. Oct 18, 2024

Problem Statement Large amounts of accurate patient data are needed to train and test AI systems that provide clinical decision support for the diagnosis, prognosis, treatment, and prevention of disease. Patient data remain difficult to obtain, may lack specific details regarding a particular condition, and give rise to privacy concerns when used. The lack of high quality, representative data across groups and conditions may lead to incorrect inference in AI systems , which undermines their utility and trustworthiness. AI assistant train non-representative patient incorrect inference training data 2

Research questions 1. Can synthetic data be used as a proxy for real data , while addressing privacy concerns? 2. Can we create synthetic data for patient populations that we don’t have the data for , so that AI assistants can make better recommendations? 3

Selected Sponsored Research Projects using Synthetic Data AIDAVA – (Horizon Europe; 4 years; 7M euros; Coordinator/PI) A project to use AI assistants to facilitate the curation of personal health data by patients and data specialists. Synthetic data used to facilitate development and testing. REALM – (Horizon Europe; 4 years; 7M euros; Coordinator/PI). A regulatory sandbox for in-vitro diagnostic devices and (AI ) Medical Device Software . Synthetic data will be used to learn a model of application and real world data, and to challenge MDSW with unseen distributions of patient data . CHARM – (ARPA-H; 3 years; $8M; UAB lead; UM subcontractor) A project to collect, harmonize, analyze, reason about, and manipulate biomedical data using open standards and generative AI technologies. Development of digital twins and use of synthetic data to support new scientific investigations. 4

https://arxiv.org/abs/2407.00116 5

Some types of health data Images (CT, MRI, X-ray, Mammography, Ultrasound, etc ) Signal (ECG, EEG) Text (Discharge summary, Health & illness history, clinical guidelines) Tabular (Electronic Health Records) 6

DP-CGANS: D ifferentially P rivate C onditional G enerative A dversarial N etwork S A GAN is composed of a generator and a discriminator to judge the quality. A conditional GAN uses an input to influence the generation towards a trained label. Differential privacy is where an observer cannot tell that individual data were used in the computation. DP-CGANS : works with mixed data types, imbalanced variables, finds correlations and dependencies between variables, offers a privacy guarantee Sun, Chang, Johan van Soest, and Michel Dumontier. "Generating synthetic personal health data using conditional generative adversarial networks combining with differential privacy." Journal of Biomedical Informatics 143 (2023): 104404. Differential Privacy 7

Utility > Privacy Privacy > Utility realistic and maintains real data associations could be used to learn a ML model to predict missing values data angle towards the mean and do not reflect real populations could not be used in any effective ML model Utility <------------ | ------------> Privacy 8

Privacy c ost metrics The privacy cost metrics measure how much information from the real data may be disclosed by the synthetic data and the generative models. Identity disclosure refers to identifying an individual in the training data (e.g. we can find one or more synthetic data with a certain distance to a real data sample). A low precision and recall achieve a higher level of privacy. The attribute disclosure refers to predicting the original values of the synthesized variables from an individual level based on some other variables of the real data that are known to the attacker. A higher score presents a higher level of privacy 9 Privacy measurement in identity disclosure Privacy measurement in attribute disclosure

The privacy vs utility tradeoff A generative model is good if it can produce synthetic data as similar as possible to the real data . A generator that produces data that has a smaller distance to the real data increases the risk in revealing sensitive information. Therefore, it is an inevitable trade-off between data privacy and data utility in generating synthetic data. But this depends on the dataset. 10

DATA Generator Synthetic data DP-CGANS Model Trained Model 11

Research questions 1. Can synthetic data be used as a proxy for real data , while addressing privacy concerns? 2. Can we create synthetic data for patient populations that we don’t have the data for , so that AI assistants can make better recommendations? 12

OntoCGAN is a conditional GAN that uses external knowledge in the form of ontology embeddings to guide the generation of synthetic (patient) data. (1) train a cGAN with (most similar) patient data that excludes the target class (disease). (2) check the quality of generated data against real data (and other generators). (3) compare the performance of classifiers trained using real or synthetic data. OntoCGAN framework 13 Chang Sun & Michel Dumontier. (2024). Generating Patient’s Electronic Health Records with Unseen Diseases Using Ontology-enhanced Generative Adversarial Networks. 10.21203/rs.3.rs-5043150/v1

use case: Acute Myeloid Leukemia (AML ) Acute myeloid leukemia (AML) is a fast-growing, aggressive cancer that affects the bone marrow and blood. It is characterized by the rapid production of abnormal white blood cells, called myeloblasts , which can interfere with the production of normal blood cells. AML can spread to other parts of the body, including the lymph nodes, spleen, liver, and central nervous system.  Several AML research studies have emphasized the importance of larger and more comprehensive patient datasets to better understand disease progression, evaluate the effects of various treatments, and enhance prediction accuracy. 14

We train a model with diseases similar to AML AML clinical practice guidelines identify these conditions in differential diagnosis: Acute Lymphoblastic Leukemia Anemia Aplastic anemia B-cell lymphoma Bone marrow failure Chronic myelogenous leukemia Lymphoblastic lymphoma Myelodysplastic syndrome Myelophthisic anemia Primary myelofibrosis 15

Ontology Embeddings We generate disease embeddings with OWL2Vec* using text descriptions and logical inference of Orphanet Rare Disease Ontology (ORDO), Human Phenotype Ontology (HPO), and HOOM. 16

total of 153 variables reduced to 35 variables with <30% missing values included 6 diseases with similarity >= 0.7 17

The hematocrit test ( Hct ) measures the percentage of red blood cells in the blood. 18

Complete Blood Count (CBC) 19

20

OntoCGAN outperforms CTGAN in capturing correlations between key AML variables 21

TSTR: Train on Synthetic data and Test on Real TRTR: Train on Real data and Test on Real TRTR TSTR TSTR 22

AML misclassified to Myelodysplastic syndrome 23 Blasts ( immature myeloid cells ) and bone marrow morphology are typically used to differentiate AML and MDS. In MIMIC, AML patients had blast counts while other diseases were missing in this variable, and there was no bone marrow data. As blasts became   a diagnostic variable it was excluded. MDS can progress to develop AML, so patients may have MDS and AML diagnosis at the same time.

Summary Conditional GANs were trained to generate patient data with statistical characteristics comparable to real data DP-CGANS conditions on noise to generate patient data with a privacy guarantee Onto-CGAN conditions on external knowledge (in the form of ontology embeddings ) to generate unseen patient data Developing meaningful evaluation metrics is key to understanding the performance of these systems. Synthetic data may be useful: as proxy for real data to support early/preliminary investigations , where a privacy buget affects the privacy-utility tradeoff . in data augmentation to improve the performance of AI based clinical decision support tools , where real world data may be lacking. 24

Future Directions Explore synthetic data generation to other rare and complex disorders and use in downstream health AI applications (diagnosis, prognosis, treatment, prevention). Multi-modal synthetic data generation for full EHR generation (text, image, omics, time series, etc ) Establishing clinical effectiveness of synthetic data (validity), particularly for clinical decision support and in the testing of medical device software. 25

Acknowledgements UAB (CHARM, Translator) Matt Might Andy Crouse Will Byrd Health AI @ DACS Chang Sun Mahmoud Ibrahim Pietro Bonizzi VITO Bart Elen Gökhan Ertaylan UAB Symposium Carlos Cardenas Heather Milam Thomas Cleij Bas Lemmens Funding for this work supported in part by: 26

Classifying unseen diseases using synthetic patient data Michel Dumontier [email protected] 27 Health artificial intelligence to advance the science and practice of medicine