Machine Learning for Protein Design: Antibodies and Biologics
TomDiethe
289 views
25 slides
May 23, 2024
Slide 1 of 25
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
About This Presentation
Talk given at the industry session of ECML-PKDD 2023
Size: 19.34 MB
Language: en
Added: May 23, 2024
Slides: 25 pages
Slide Content
Machine Learning for Protein Design: Antibodies and Biologics Tom Diethe, Center for Artificial Intelligence ECML-PKDD 2023 20 th September 2023
Centre for Artificial Intelligence: A Global Team 2 Gaithersburg Cambridge Gothenburg Barcelona We are a global, science-led, patient-focused pharmaceutical company. We are dedicated to transforming the future of healthcare by unlocking the power of what science can do for people, society and the planet.
Centre of Artificial Intelligence in a Nutshell 3 Core Capabilities Medical Computer Vision Graph-based Machine Learning that leverages knowledge graphs ML-guided protein engineering Audio and Biomedical signal processing Causal inference Multi-modal and multi-omics Large language models 1 We always understand the potential value/impact of a project before commencing 2 3 We always evaluate new projects against our current portfolio resourcing 4 We rapidly evaluate and explore new high impact opportunities 5 We always look for opportunities to partner internally and externally We frequently review our priorities Actions ML-guided protein engineering
Introduction & Background Machine Learning for Antibody Design 1
Antibody engineering is the process of designing and creating new antibodies with desired affinity and developability properties This is a global effort involving multiple teams with AZ: Biologics Engineering Centre for Artificial Intelligence Data Science and Advanced Analytics R&D IT 5
6 Antibody Paratope Epitope CDR Loops Antigen Accurately predict epitope-paratope residues ML assisted DNA libraries for smarter experimental design VSYLSTASSLDY In-Silico developability assays Antibody-antigen docking Transforming b iologics d iscovery with machine learning
Centre for AI team and partners working on this project 7
Key the future for biologics discovery is augmented design 8 From fully wet lab based To augmenting existing platforms with AI-driven capabilities Make/ Test Design/ select Make/ Test Validate Performed in silico Design/ select Optimize properties Optimize properties Validate e.g. display technologies, B-cell repertoires, hybridoma e.g. display technologies, B-cell repertoires, hybridoma PLUS HT data & ML enabled in silico design & optimisation
Parental protein sequence & edit regions BUILD TEST Affinity scores LEARN Updated ML Models DESIGN ML Designs Variant Library Protein engineering runs in cycles of Design, Build, Test, Learn
Machine Learning driven Antibody and Biologics platform 10
The core ML components
The core ML components: Representation Learning
Protein Databases are growing rapidly, opening the door for deep learning UniParc ( UniProt Archive) Observed Antibody Space (OAS) > 1B sequences, from >80 studies Diverse immune states, organisms (primarily human and mouse), and individuals Olsen, T.H., Boyles, F., and Deane C.M. (2021). Protein Science Oxford Protein Informatics Group (OPIG) Charlotte Dean’s lab
We trained our own models… Learning Representations from Patents We trained a transformer-based encoder [1], SelfPAD , on 300k patented antibody sequences [2] We used SelfPAD in library screening on internal targets [1] Ashish Vaswani et al. Attention is all you need. In Conference on Neural Information Processing Systems [2] Konrad Krawczyk et al. Data mining patented antibody sequences Training SelfPAD on patented antibodies
The core ML components: Oracles
Oracles are predictive models that predict “fitness” of variants Affinity Antibody EVQLQESGP Amino Acid Sequence Per-Residue Embedding vectors Aggregation Protein embedding Other residue/protein features Downstream ML Model Antigen VREPALSVA Amino Acid Sequence Other residue/protein features Transformer Model
Lead Optimisation Lead Identification + Lead Optimisation Taxonomy of Design Problems 19 No Yes Yes No Yes No Yes No
Design algorithms are methods that propose variants Baseline: E ncoder based on SVD of BLOSUM matrix Deep Generative models Oracle-guided generation using transformer model + beam search
Antigen + G Generative Models: Lead Optimization 21
The core ML components: Library Design
Library Design Criteria Use the generative model to find predicted high affinity sequences Seek a diverse set of sample sequences Bias towards sequences that are closer to the parental sequence Don’t trust the generative models far away from the training data Sequences far from parental are more likely to be non-functional Simple algorithm - ancestral sampling 23