Machine Learning for Protein Design: Antibodies and Biologics

TomDiethe 289 views 25 slides May 23, 2024
Slide 1
Slide 1 of 25
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25

About This Presentation

Talk given at the industry session of ECML-PKDD 2023


Slide Content

Machine Learning for Protein Design: Antibodies and Biologics Tom Diethe, Center for Artificial Intelligence ECML-PKDD 2023 20 th September 2023

Centre for Artificial Intelligence: A Global Team 2 Gaithersburg Cambridge Gothenburg Barcelona We are a global, science-led, patient-focused pharmaceutical company. We are dedicated to transforming the future of healthcare by unlocking the power of what science can do for people, society and the planet.

Centre of Artificial Intelligence in a Nutshell 3 Core Capabilities Medical Computer Vision Graph-based Machine Learning that leverages knowledge graphs ML-guided protein engineering Audio and Biomedical signal processing Causal inference Multi-modal and multi-omics Large language models 1 We always understand the potential value/impact of a project before commencing 2 3 We always evaluate new projects against our current portfolio resourcing 4 We rapidly evaluate and explore new high impact opportunities 5 We always look for opportunities to partner internally and externally We frequently review our priorities Actions ML-guided protein engineering

Introduction & Background Machine Learning for Antibody Design 1

Antibody engineering is the process of designing and creating new antibodies with desired affinity and developability properties This is a global effort involving multiple teams with AZ: Biologics Engineering Centre for Artificial Intelligence Data Science and Advanced Analytics R&D IT 5

6 Antibody Paratope Epitope CDR Loops Antigen Accurately predict epitope-paratope residues ML assisted DNA libraries for smarter experimental design VSYLSTASSLDY In-Silico developability assays Antibody-antigen docking   Transforming b iologics d iscovery with machine learning

Centre for AI team and partners working on this project 7

Key the future for biologics discovery is augmented design 8 From fully wet lab based To augmenting existing platforms with AI-driven capabilities Make/ Test Design/ select Make/ Test Validate Performed in silico Design/ select Optimize properties Optimize properties Validate e.g. display technologies, B-cell repertoires, hybridoma e.g. display technologies, B-cell repertoires, hybridoma PLUS HT data & ML enabled in silico design & optimisation

Parental protein sequence & edit regions BUILD TEST Affinity scores LEARN Updated ML Models DESIGN ML Designs Variant Library Protein engineering runs in cycles of Design, Build, Test, Learn

Machine Learning driven Antibody and Biologics platform 10

The core ML components

The core ML components: Representation Learning

Protein Databases are growing rapidly, opening the door for deep learning UniParc ( UniProt Archive) Observed Antibody Space (OAS) > 1B sequences, from >80 studies Diverse immune states, organisms (primarily human and mouse), and individuals Olsen, T.H., Boyles, F., and Deane C.M. (2021). Protein Science Oxford Protein Informatics Group (OPIG) Charlotte Dean’s lab

We trained our own models… Learning Representations from Patents We trained a transformer-based encoder [1],  SelfPAD , on 300k patented antibody sequences [2] ​ We used SelfPAD in library screening on internal targets [1] Ashish Vaswani et al. Attention is all you need. In Conference on Neural Information Processing Systems ​ [2] Konrad Krawczyk et al. Data mining patented antibody sequences ​ Training  SelfPAD  on patented antibodies​

The core ML components: Oracles

Oracles are predictive models that predict “fitness” of variants Affinity Antibody EVQLQESGP Amino Acid Sequence Per-Residue Embedding vectors Aggregation Protein embedding Other residue/protein features Downstream ML Model Antigen VREPALSVA Amino Acid Sequence Other residue/protein features Transformer Model

Multiple Oracles Leads to Multi-Objective Optimisation Binding affinity Expressivity Thermal Stability (Low) aggregation ( shelf life) Viscosity (absorption) … 17

The core ML components: Design Algorithms

Lead Optimisation Lead Identification + Lead Optimisation Taxonomy of Design Problems 19 No Yes Yes No Yes No Yes No

Design algorithms are methods that propose variants Baseline: E ncoder based on SVD of BLOSUM matrix Deep Generative models Oracle-guided generation using transformer model + beam search

Antigen + G Generative Models: Lead Optimization 21

The core ML components: Library Design

Library Design Criteria Use the generative model to find predicted high affinity sequences Seek a diverse set of sample sequences Bias towards sequences that are closer to the parental sequence Don’t trust the generative models far away from the training data Sequences far from parental are more likely to be non-functional Simple algorithm - ancestral sampling 23

MARBLE Workshop When:  Friday 22, morning Where:   PoliTo Room 12i 24

25