AI in Chemistry: Deep Learning Models Love Really Big Data
csteinbeck
261 views
52 slides
Sep 11, 2024
Slide 1 of 52
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
About This Presentation
Talk delivered at the 5.0 consortium meeting of the Germany National Research Data Infrastructure for Chemistry (NFDI4Chem).
It shows that neural networks were used in chemistry at least since the early 90s, the points out the advancements needed towards deep learning. We then show a few recent hig...
Talk delivered at the 5.0 consortium meeting of the Germany National Research Data Infrastructure for Chemistry (NFDI4Chem).
It shows that neural networks were used in chemistry at least since the early 90s, the points out the advancements needed towards deep learning. We then show a few recent highlights using deep learning in chemistry and finish with our own work on DECIMER.
Size: 9.94 MB
Language: en
Added: Sep 11, 2024
Slides: 52 pages
Slide Content
AI in Chemistry:
Deep Learning
Models Love Really
Big Data
Christoph Steinbeck
Deep Learning in
Chemistry
•Prediction of
•Chemical properties
•Reactions
•Chemical structure
•Knowledge extraction
Counterpropagation Neural Network
with a few hundred neurons.
SGI Origin 200 workstation with one
180-MHz IP27 processor
and running IRIX 6.3. (Good old days)
What has changed since
the good old days?
•Advancements in algorithms and hardware (GPU
training)
•Growth of neural networks from 100s to 100-
thousands neurons
•Availability of big data in a few areas led to the
iconic breakthroughs
SHK: “We have seen that AI methods require more data
than deterministic methods, and deep learning methods
need even more.”
SHK: “… it should be noted that datasets which are
considered gold standard tend to be very large. For
example, the Image Net dataset, a gold standard in
image classification, contains 14,197,122 images as of
now.”
Highlights of Deep Learning in Chemistry
Jablonka, K. M.; et al. Nat. Mach. Intell. 2024, 6 (2), 161–169.
Our results raise a very important question: how can a natural
language model with no prior training in chemistry outperform
dedicated machine learning models, as we were able to show in
the case of high-entropy alloys in Fig. 2 and for various
molecule, material and chemical reaction properties in Extended
Data Table 2? To our knowledge, this fundamental question has
no rigorous answer.
As we show in this Article, a machine learning system built
using GPT-3 works impressively well for a wide range of
questions in chemistry—even for those for which we cannot use
conventional line representations such as SMILES. Compared
with conventional machine learning, it has many advantages.
GPT-3 can be used for many different applications.
Highlights of Deep Learning in Chemistry
He, J.; et al. J. Cheminformatics 2024, 16 (1), 95.
•Recurrent Neural Networks (RNNs)
•Variational Autoencoders (VAEs)
•Transformers
•Generative Adversarial Networks (GANs)
•Graph Neural Networks (GNNs)
•Diffusion-based Models
•Molecular generative model
•Scoring function
•Reinforcement Learning (RL) as a search algorithm
Highlights of Deep Learning in Chemistry
Kirkpatrick et al., Science 374, 1385–1389 (2021)
Highlights of Deep Learning in Chemistry
Abramson, et al., Nature2024, 630(8016), 493–500.
The introduction of AlphaFold 21has spurred a revolution in
modelling the structure of proteins and their interactions,
enabling a huge range of applications in protein modelling and
design2,3,4,5,6. Here we describe our AlphaFold 3 model with a
substantially updated diffusion-based architecture that is
capable of predicting the joint structure of complexes including
proteins, nucleic acids, small molecules, ions and modified
residues.
Highlights of Deep Learning in Chemistry
ACS Cent. Sci.2019, 5, 9, 1572–1583
Organic synthesis is one of the key stumbling blocks in medicinal chemistry. A necessary
yet unsolved step in planning synthesis is solving the forward problem: Given reactants
and reagents, predict the products. Similar to other work, we treat reaction prediction as a
machine translation problem between simplified molecular-input line-entry system
(SMILES) strings (a text-based representation) of reactants, reagents, and the products.
We show that a multihead attention Molecular Transformer model outperforms all
algorithms in the literature, achieving a top-1 accuracy above 90% on a common
benchmark data set.
15 March 2016: Lee Sedol, a top-ranked Go player, loses the last of five games to AlphaGo. Lee Jin-man / AP
In the match against Lee,
Deepmind’s AlphaGo used 1,202
CPUs and 176 GPUs.
AlphaGo Zero: Mastering the Game of Go
without Human Knowledge
•DeepMind'sAlphaGo Zero
implements a Monte Carlo tree
searchwith aconvolutional
neural networkproviding
position evaluation and policy
guidance.
•With only the rules of Go
known, AlphaGo Zero improved
to superhuman playing strength
after a day of training (5 mio
games).
•It uses just a single machine in
the Google Cloud with 4 TPUs
Silver, D., Schrittwieser, J., Simonyan, K.et al.Mastering the game of Go without human knowledge.Nature550, 354–359 (2017). https://doi.org/10.1038/nature24270
Fabricating large numbers of chemical structure
depictions to solve the OCSR problem
The
Project
Information in printed literature is not readily
available in databases
Image Source: Chen et al. 2020, J.Nat Prod
Organism NameChemical Name
Chemical ClassBiol. Activity
Chemical Structures
Optical Chemical Structure Recognition (OCSR) Tools
Mol file
Black pixels on white paper
Rule based methods
1. Scanning
2. Vectorization
3. Searching for dashed lines
and dashed wedges
4. Character recognition
5. Graph compilation
6. Post processing
7. Display and editing
DECIMER: Deep LEarning for Chemical IMagE
Recognition
Image Source: Wijeratne et al. 2001, J.Nat Prod
Segmentation
(Identification &
Extraction)
Prediction
CN1C=NC2=C1C(=O)N(C(=O)N2C)C
Re-Depicted Structure
SMILES
(simplified molecular-
input line-entry system)
OCSR Engine
Kohulan Rajan
DECIMER – Image to SMILES
CN1C=NC2=C1C(=O)N(C(=O)N2C)C
Show and tell: Image Caption Generator
DECIMER – Image to SMILES
Reference and Image source: Xu et al. 2015, arXiv[cs.LG]
DECIMER – Image to SMILES (Performance)
25Image source: Rajan et al. 2020, J Cheminform
0.13
0.22
0.38
0.48
0.53
0.62
0.68
0.1%0.2%
6.7%
13.2%
18.9%
26.0%27.0%
0%
50%
100%
0123456789101112
PERCENTAGE
DATASET INDEX
TANIMOTO SIMILARY CALCULATIONS VS
TRAINING DATA SIZE
Average Tanimoto
similarity on valid
SMILESPercentage of molecules
with Tanimoto 1.0
0.0
0.5
1.0
TANIMOTO SIMILARITY
QQuery TTarget TQÇTQÈ
Similarity measure TANIMOTO coefficient:
TQÇTQÈ
Tanimoto Coefficient =
DECIMER – Image to SMILES (Performance)
The infamous Figure 5
Rajan, K., Zielesny, A. & Steinbeck, C.
DECIMER: towards deep learning for chemical image recognition.
J Cheminform12, 65 (2020). https://doi.org/10.1186/s13321-020-00469-w
DECIMER – Image to SMILES (Training time)
27Image source: Rajan et al. 2020, J Cheminform
25 Epochs for 15 Mio
dataset: ~ 26 days
25 Epochs for 45 Mio
dataset: ~ 78 days
CPU
Central Processing UnitGPU
Graphics Processing UnitTPU
Tensor Processing Unit
https://cloud.google.com/blog/products/ai-machine-learning/what-makes-tpus-fine-tuned-for-deep-
learning
TPUs : Tensor
Processing Units
•Tensor Processing Unit (TPU) is an AI
accelerator application-specific integrated
circuit (ASIC) developed by Google .
•This is developed specifically to train larger
models in deep learning faster.
•Development started in 2013, available to
public from 2018.
•Only available through Google Cloud
Platform.
•https://cloud.google.com/tpu
•https://en.wikipedia.org/wiki/Tensor_Processin
g_Unit
•https://cloud.google.com/blog/products/ai-
machine-learning/what-makes-tpus-fine-
tuned-for-deep-learning
•https://www.tensorflow.org/guide/tpu
GPU VS TPU training speed
30Reference: Rajan et al. 2021, J Cheminform
•Why not GPUs?
Training time compared to an
Nvidia V100-Tesla GPU
•Single V3-8 TPU – 4x faster
•Single V4-8 TPU – 7x faster
•Single V5-8 TPU – 16x faster
=> Six month down to 11 days
010203040
GPU
TPU V3-8
TPU V4-8
TPU V5-8
TIME IN HOURS
Time per epoch (GPU vs TPU)
DECIMER V2
•400 Million images plus corresponding SMILES fed into DNN.
•Images of chemical structure depictions were generated using
RanDepict.
•No assumptions about the underlying problem (no concept of bonds or
atoms, etc) included.
”Caffeine” Depicted using
CDK
CN1C=NC2=C1C(=O)N(C(=O)N2C)C
EfficientNet-V2 CNN + Transformer DNN
SMILES
This repository contains RanDepict, an easy-to-use utility to generate a big variety
of chemical structure depictions (random depiction styles and image
augmentations) based on
RDKit, CDK and Indigo.
RanDepict
Brinkhaus, H.O., Rajan, K., Zielesny, A.et al.RanDepict: Random chemical structure depiction generator.
J Cheminform14, 31 (2022). https://doi.org/10.1186/s13321-022-00609-4
Distortion features controlled by fingerprints
USPTO: 5,719 images from the US Patent Office
UOB: 5,740 images University of Birmingham
CLEF: 992 images from The Conference and Labs of the Evaluation Forum test set
JPO: 450 images from the Japanese Patent Office
RanDepict250k: 250,000 chemical structure depictions generated with RanDepict
RanDepict250k_augmented: 250,000 with additional augmentations generated with RanDepict.
DECIMER hand-drawn: 5,088 chemical structure images from DECIMER hand-drawn dataset.
Indigo: 50,000 images generated by Staker et al. using Indigo30, All images have a resolution of 224 x 224 pixels.
USPTO_big: 50,000 images from the USPTO from Staker et al, All images have a resolution of 224 x 224 pixels.
Img2Mol test set: 25,000 depictions used by Clévert et al. All images have a resolution of 224 x 224 pixels.
OCSR Benchmark Datasets
Skip
OCSR tool performance on Augmented Datasets
Clean DataAugmented Data
•xy-shearing factor randomly drawn from [−0.1, 0.1]
•rotation (randomly drawn from [−5 °, 5°])
Reference: Clevert et al. 2021, Chemical Science
Bayer AG, Berlin
DECIMER Hand-Drawn Structures Dataset
Brinkhaus, H.O., Zielesny, A., Steinbeck, C.et al.DECIMER—hand-drawn molecule images dataset.J Cheminform14, 36
(2022). https://doi.org/10.1186/s13321-022-00620-9
6000 diverse molecules selected from PubChem
using RDKit’simplementation of the MaxMin
algorithm based on Morgan fingerprints.
Dataset of 5088 images and corresponding
SMILES
The dataset at ZENODO:
https://doi.org/10.5281/zenodo.6456306
Training and Testing - DECIMER Image Transformer
Augmented ImageNon augmented imageAugmented Image with
Markush structure
Non augmented image
with Markush structure
Synthetic Hand Drawn
structure
OCSR tools performance on hand drawn images
1%1%3%5%7.65%5%
27%
67.39%
0.73%1.61%3.50%
9.34%11.18%8.49%
32.90%
74.63%
0.17
0.230.22
0.52
0.59
0.640.69
0.93
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
OSRAMolVecImagoImg2MolMolScribeSwinOCSRDECIMER (No
Hand Drawn)
DECIMER
(Synthetic Hand
Drawn)
Percentage
Identical PredictionsTanimoto 1.0 CountAverage Tanimoto
DECIMER
Workflow
Rajan, K., et al.Nature
Communications14, 5045
(2023).
Smiles TO iUpac Translator
Smiles TO iUpac Translator Version 2.0
STOUT: Smiles TO iUpac Translator
Data Sources:ZINC15
Pubchem
IUPAC Names Generated with
OpenEYE – LexiChem
Total Training Data
Size: 998 Million
Maximum Length of
SMILES: 600
Maximum Length of
IUPAC Names: 700
Training
Infrastructure:
TPU V4-256
Chips
Google Cloud
STOUT: Smiles TO iUpac Translator
https://decimer.ai
Dear Both,
I know the AI world is now crazy. I asked Gemini what is DECIMER?
Results:
DECIMER (Deep Learning for Chemical Image Recognition) is an open-source platform designed to automatically identify, segment, and recognize
chemical structures within images from scientific publications. It utilizes deep learning models to extract chemical information from images and
convert it into machine-readable formats, such as SMILES.
DECIMER is a valuable tool in cheminformatics research, as it helps automate the process of extracting chemical data from literature, making it
more accessible for further analysis and research.
There are a few key components of the DECIMER project:
•DECIMER 1.0:The initial deep learning model for converting chemical structure images into SMILES.
•DECIMER-Segmentation:A tool for automatically extracting chemical structure depictions from scientific literature.
•DECIMER 2.0:An updated version of DECIMER that uses EfficientNet V2 and Transformer architectures, along with Google's TPUs, for faster
training on large datasets.
•DECIMER.ai:An open platform that combines all these tools, offering a user-friendly interface for chemical structure recognition tasks.
You can learn more about DECIMER through the following resources:
•DECIMER.ai Web Application:https://decimer.ai/
•DECIMER.ai Publication:https://www.nature.com/articles/s41467-023-40782-0
•DECIMER 1.0 Publication:https://jcheminf.biomedcentral.com/articles/10.1186/s13321-020-00469-w
I hope this information is helpful! Let me know if you have any other questions.
Afterwards, I asked it to write a song based on this
information. I took the lyrics and gave it tosuno.ai, using
the new v3.5 model I could create a once 4-minute song
very easily.