Recently, machine learning has been applied to various problems in medicine.There have been many great results, but still a lot of unsolved problems and issues remain. One of the largest roadblocks currently is the lack of large datasets as well as the inability to share existing data due to privacy...
Recently, machine learning has been applied to various problems in medicine.There have been many great results, but still a lot of unsolved problems and issues remain. One of the largest roadblocks currently is the lack of large datasets as well as the inability to share existing data due to privacy concerns. For these reasons, synthetic data generation using generative modeling techniques is a very important research area in this field. In this talk, we will give an overview of the different data modalities used in medicine and go over the state of the art generative models for such data, highlighting successful applications on biomedical data. Finally, we will go through a case study of our recent research project, where we succesfully created a synthetic dataset for a dataset containing clinical and peptide data for heart failure patients.
Size: 10.57 MB
Language: en
Added: Sep 19, 2024
Slides: 34 pages
Slide Content
Enhancing Medical Research through Synthetic Data Generation Ena Aničić & Stipe Kabić
OUTLINE The value of synthetic data in medical research A short tour of generative models and medical data Case study: Synthetic peptide generation 01 02 03
OUTLINE The value of synthetic data in medical research A short tour of generative models and medical data Case study: Synthetic peptide generation 01 02 03
The timing is perfect ! Quality synthetic data can solve lots of issues in medical data Generative AI has had massive breakthroughs in recent times
You can simply impress your audience and add a unique zing and appeal to your Presentations. Add Contents Title Low data Strong regulation Diverse data Danger of bias Augmentation Synthetic data solves key issues Sharing synthetic data Multimodal approaches Balanced sampling
Privacy is still a concern AI models can memorize specific data points! Sensitive data can be hidden in model parameters! Synthetic data = Generative AI + Privacy Evaluation Quality + Privacy
Differential privacy is a solid metric Real dataset Synthetization process Without a specific patient Synthetic 1 Synthetic 2
Replicate research done on real data Experiments on real data Compare results Replicate on synthetic data Step 1 Step 2 Step 3
Train on synthetic, test on real Real test data Compare Score Train Real data Train Synthetic data Model A Model B
Statistical properties should be kept
Should not be able to differentiate
OUTLINE The value of synthetic data in medical research A short tour of generative models and medical data Case study: Synthetic peptide generation 01 02 03
OUTLINE A short tour of generative models and medical data Case study: Synthetic peptide generation 01 02 03 The value of synthetic data in medical research
Classical methods are limited
Medical data is complex and diverse
Generator Discriminator Z 1 Z Decoder Encoder X' X' X Remove noise X' X Add noise X Z GAN Variational autoencoder Diffusion DL approaches have lots in common
sad Architecture depends on modality GAN | VAE | Diffusion | Autoregressive Transformer GNN XGBoost Mamba RNN CNN Vision Graphs Tables Text
OUTLINE A short tour of generative models and medical data Case study: Synthetic peptide generation 01 02 03 The value of synthetic data in medical research
OUTLINE Case study: Synthetic peptide generation 01 02 03 The value of synthetic data in medical research A short tour of generative models and medical data
Data is sparse and high dimensional ~10 clinical variables ~20k peptide variables Missing values Low amounts of patients
A lot to gain from synthetic data Reproduction of existing studies to prove value of synthetic data Sharing of the data with the research community Ensure privacy while creating a representative dataset
Deep learning is difficult to apply Low amounts of data No previous research or pretrained models Need to iterate quickly and check viability of the idea High dimensionality
Statistics – simple and efficient
Statistics – simple and efficient
Marginals can be directly estimated
Normalization makes models simple
Gaussians can model correlations
Distributions and correlations kept
Partition to get nicer distributions
Heavy tails are key for applications
The synthetic data is useful Model trained on synthetic data works on real data Studies done on real data successfully replicated Data distributions and correlations preserved Privacy is safe by definition!
Lots more to be done Tweaks and improvements of current approach Deep learning approach General model for clinical+peptide datasets Transfer learning between different peptide datasets