[DSC DACH 24] Enhancing Medical Research through Synthetic Data Generation - Ena Anicic & Stipe Kabic

DataScienceConferenc1 59 views 34 slides Sep 19, 2024
Slide 1
Slide 1 of 34
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34

About This Presentation

Recently, machine learning has been applied to various problems in medicine.There have been many great results, but still a lot of unsolved problems and issues remain. One of the largest roadblocks currently is the lack of large datasets as well as the inability to share existing data due to privacy...


Slide Content

Enhancing Medical Research through Synthetic Data Generation Ena Aničić & Stipe Kabić

OUTLINE The value of synthetic data in medical research A short tour of generative models and medical data Case study: Synthetic peptide generation 01 02 03

OUTLINE The value of synthetic data in medical research A short tour of generative models and medical data Case study: Synthetic peptide generation 01 02 03

The timing is perfect ! Quality synthetic data can solve lots of issues in medical data Generative AI has had massive breakthroughs in recent times

You can simply impress your audience and add a unique zing and appeal to your Presentations. Add Contents Title Low data Strong regulation Diverse data Danger of bias Augmentation Synthetic data solves key issues Sharing synthetic data Multimodal approaches Balanced sampling

Privacy is still a concern AI models can memorize specific data points! Sensitive data can be hidden in model parameters! Synthetic data = Generative AI + Privacy Evaluation  Quality + Privacy

Differential privacy is a solid metric Real dataset Synthetization process Without a specific patient Synthetic 1 Synthetic 2

Replicate research done on real data Experiments on real data Compare results Replicate on synthetic data Step 1 Step 2 Step 3

Train on synthetic, test on real Real test data Compare Score Train Real data Train Synthetic data Model A Model B

Statistical properties should be kept

Should not be able to differentiate

OUTLINE The value of synthetic data in medical research A short tour of generative models and medical data Case study: Synthetic peptide generation 01 02 03

OUTLINE A short tour of generative models and medical data Case study: Synthetic peptide generation 01 02 03 The value of synthetic data in medical research

Classical methods are limited

Medical data is complex and diverse

Generator Discriminator Z 1 Z Decoder Encoder X' X' X Remove noise X' X Add noise X Z GAN Variational autoencoder Diffusion DL approaches have lots in common

Sequential data - autoregression Decoder T1 T2 Tn T1 T2 Tn Tn+1

sad Architecture depends on modality GAN | VAE | Diffusion | Autoregressive Transformer GNN XGBoost Mamba RNN CNN Vision Graphs Tables Text

OUTLINE A short tour of generative models and medical data Case study: Synthetic peptide generation 01 02 03 The value of synthetic data in medical research

OUTLINE Case study: Synthetic peptide generation 01 02 03 The value of synthetic data in medical research A short tour of generative models and medical data

Data is sparse and high dimensional ~10 clinical variables ~20k peptide variables Missing values Low amounts of patients

A lot to gain from synthetic data Reproduction of existing studies to prove value of synthetic data Sharing of the data with the research community Ensure privacy while creating a representative dataset

Deep learning is difficult to apply Low amounts of data No previous research or pretrained models Need to iterate quickly and check viability of the idea High dimensionality

Statistics – simple and efficient

Statistics – simple and efficient

Marginals can be directly estimated

Normalization makes models simple

Gaussians can model correlations

Distributions and correlations kept

Partition to get nicer distributions

Heavy tails are key for applications

The synthetic data is useful Model trained on synthetic data works on real data Studies done on real data successfully replicated Data distributions and correlations preserved Privacy is safe by definition!

Lots more to be done Tweaks and improvements of current approach Deep learning approach General model for clinical+peptide datasets Transfer learning between different peptide datasets

Thank you! Questions?
Tags