1
Chemoinformatics:
Basic Concepts and Areas of Application
Alexandre Varnek
Laboratory of Chemoinformatics, University of Strasbourg
Double diploma UniStra/KFU
Chem(o)informatics
Cheminformatics
Chemical Informatics
Infochimie
Chémoinformatique
Хемоинформатика
Chemoinformatics is a generic term that encompasses the design, creation, organization,
management, retrieval, analysis, dissemination, visualization, and use of chemical
information
G. Paris, 1998
Chemoinformatics - definition
Chemoinformatics is the application of informatics methods
to solve chemical problems
J. Gasteiger, 2004
Chemoinformatics is the mixing of those information resources to transform data into
information and information into knowledge for the intended purpose of making better
decisions faster in the area of drug lead identification and optimization”
F.K. Brown, 1998
Chemoinformatics is a field based on the representation of molecules as objects
(graphs or vectors) in a chemical space
A. Varnek & I. Baskin, 2011
Selected books in chemoinformatics
Paul Emile Lecoq de
Boisbaudran
Gallium discovery:
the first QSAR successful story
Predicted in 1869
Dmitry Mendeleév
Discovered in 1875
Density
pred ≈ 6.0 g/cm
3
Density
exp = 4.7 (initial)
Density
exp = 5.935
(corrected)
Chemoinformatics:
new disciline combining several „old“ fields
•Chemical databases
•Structure-Activity modeling (QSAR)
•Structure-based drug design
•Computer-aided synthesis design
Peter Willett Michael Lynch
Corwin Hansch Johann Gasteiger
Irwin D. Kuntz
Elias Corey Ivar Ugi
Hans-Joachim Böhm
• Needs for chemoinformatics
• Fundamentals of chemoinformatics
• Chemical Space paradigm
• Virtual screening approaches
• Perspectives
OUTLOOK
Needs in Chemoinformatics
10
Chemical universe
> 100 M compounds are
currently recorded
•How to select useful compounds from this huge dataset ?
•How to design new compounds ?
•How to synthesize these compounds ?
Target Protein
Large libraries
of molecules
High Throughout Screening
Hit
experiment
computations
Virtual
Screening
Small Library of selected hits
Chemical universe:
• > 10
8
compounds are currently available
• 10
33
druglike molecules could potentially be synthesised
(see P. Polischuk, T. Madzidov et al., JCAMD, 2013)
Virtual screening is inevitable to analyse a
huge amount of protein-ligand combinations
Virtual screening must be very fast and efficient !
Human proteome:
Chemoinformatics as a
theoretical chemistry discipline
20
Chemoinformatics is defined as individual discipline
characterized by its own molecular model, basic concepts,
major applications and learning approach
21
Theoretical chemistry
Quantum Chemistry
Force Field
Molecular Modelling
Chemoinformatics
- Molecular model
- Basic concepts
- Major applications
- Learning approaches
22
Molecular Model
Quantum Chemistry
Force Field Molecular Modelling
Chemoinformatics
• molecular graph
• descriptor vector
electrons and nuclei
atoms and bonds
Chemoinformatics is a field based on the representation of molecules as
objects (graphs or vectors) in a chemical space
Chemoinformatics: From Data to Knowledge
know-
ledge
information
data
generalization
context
measurement
or calculation
deductive
learning
inductive
learning
Chemoinformatics learns from experimental data !
Basic concepts
Quantum Chemistry
Force Field
Molecular Modelling
Chemoinformatics
chemical space
wave/particle dualism
classical mechanics
Chemical space paradigm
26
Chemical Space representations
graphs-based descriptors -based
SPACE = objects + metric
Graph-based chemical space
A. Schuffenhauer, P. Ertl, et al. J. Chem. Inf. Model., 2007, 47 (1), 47-58
Scaffold Tree
Natural Product Scaffold Tree
Courtesy of P. Ertl
Natural Product Scaffold Tree
Courtesy of P. Ertl
Descriptors-based chemical space
vectorial space defined by molecular descriptors
32
Case study: Hansch Analysis
3 types of physicochemical parameters are used:
• Electronic (s)
• Steric (dE
s)
• Hydrophobic (logP)
Biological Activity = f (Physicochemical parameters ) + constant
Activity = a ( log P )
2
+ b log P + s + dE
s + cont
33
Case study: Hansch Analysis
Molecule 1
Molecule 2
34
Molecular Descriptors :
ensemble of topological, electronic, geometry parameters calculated directly
from molecular structure
Descriptors
D
1
D
2
…
D
i
…
Molecular graph
-Topological indices,
- Atomic charges,
- Inductive descriptors,
- Substructural fragments,
- Molecular volume and surface, …
Descriptor vector
> 5000 types of descriptors are reported
35
Chemography:
Design and visualization of chemical space
Greenland
2.2 M km
2
Australia
7.7 km
2
Arabian Peninsula
3.5 M km
2
Dimensionality Reduction problem
37
Swiss Roll
• GTM relates the latent space with a 2D “rubber sheet” (manifold) injected into
the high-dimensional data space.
• The visualization plot is obtained by projecting the data points onto the manifold
and then letting the “rubber sheet” relax to its original form.
Generative Topography Mapping (GTM)
N. Kireeva, I. Baskin, H. Gaspar, D. Horvath, G. Marcou, A. Varnek Mol. Inf. 2012, 31, 301–312
GTM of a dataset containing 10 activities from DUD
Similarity principle:
similar molecules possess similar properties
39
Chemical Similarity
0.82
0.39
0.84
0.72
0.67
0.64
0.53
0.56
0.52
reference
compound
Similar compounds possess similar properties
Chemical space representation: Activity Landscapes
i
ik
i
iki
k
R
RA
= A
Expectation of activity in k - node for the training set
logK of Lu
3+
L complexes Ak
logK
Lu
42
Strong binders Weak binders
Activity landscape of lanthanides’ binders
Generative Topographic Mappping
of the set of Ln binders
Contours correspond to different
logK values
H. Gaspar, I. Baskin, G. Marcou, A. Varnek unpublished results
Biopharmaceutics Drug Disposition Classification System
DATASET: 893 drugs
DESCRIPTORS: VolSurf
Case study: classification models for BDDCS classes
Visualization of models’ Applicability Domain
44
CPF ≤ 1, coverage =100 % CPF ≤ 5, coverage = 47 %
BDDCS classes probability distribution
Colored zones on the maps correspond to model’s applicability domain
H. Gaspar, G. Marcou, A. Varnek JCIM, 2013
Class Preference Factor ??????????????????=
max????????????(??????|??????)
??????(??????|??????
??????)
,∀??????
??????≠??????
Chemoinformatics:
Properties predictions
46
Quantitative Structure-Activity Relationships
(QSAR)
Activity = F (structure)
= F (descriptors)
machine-learning methods
•neural networks, support vector machine,
random forest, naïve Bayes, PLS, …
A. Varnek & I. Baskin Machine Learning Methods in Chemoinformatics: Quo Vadis?
J. Chem. Inf. Model. 2012, 52, 1413−1437
predictions of > 20 physico-chemical
properties and NMR spectra for
each individual compound
Chemoinformatics tools in SciFinder:
Machine Learning Methods in Chemoinformatics: Quo Vadis ?
A. Varnek
and I. Baskin , J Chem. Inf. Mod., 2012, 52, 1413-1437
Chemoinformatics:
virtual screening in 3D
Virtual screening : finding the needle in the haystack
CHEMICAL DATABASE
~10
6
– 10
9
molecules
What is in common between these two molecules ?
-
+
+ -
-
Arg-Gly-Asp-Phe
Tirofiban
Pharmacophore model of ligand complementary to
integrine α
IIb
β
3
Positive charge,
H-donor
Negative charge,
H-acceptor
15.5 Å
5 Å
- +
Hydrophobic
interactions
-
+
+ -
pK
i = 7.51
TanimotoCombo = 0.74
pK
i = 7.82
TanimotoCombo = 0.67
pK
i = 7.82
Molecular Shape similarity analysis
Molecular fields
56
Lock Key
Ligand-Protein complex
+
Hermann Emil Fischer
Ligand-to-protein docking :
Lock-and-key paradigm
Selected in silico designed compounds that were synthesized
and successfully tested for bioactivity
G. Schneider J Comput Aided Mol Des (2012) 26:115–120
Chemoinformatics:
areas of application
-Drug design (pharmacodynamics and pharmacokinetics),
-Prediction of physico-chemical properties,
-Materials design,
-Synthesis design,
-Molecular spectra simulations
Chemoinformatics:
perspectives
60
Assessment of biological activity
61
Assessment of side effects
62
See review by D. Rognan, British Journal of Pharmacology (2007), 1–15
Chemoinformatics : Complexity challenge
P. Csermely1 et al. Pharmacology & Therapeutics, 2012
64
Day 1: Databases
Veli-Pekka Hyttinen
Timur Madzidov
Gilles Marcou Dragos Horvath
Chemical Databases: Encoding, Storage and Search
of Chemical Structures
SciFinder - The choice for chemistry research
Tutorial with ChemAxon
Day 2: QSAR
Igor Tetko
Igor Baskin
Obtaining, Validation and Application of SAR/QSAR Models
SAR/QSAR Modelling: state of the art
Tutorial with OChem
Alex Tropsha
ADMET Predictions
Day 3: virtual screening in 3D
Conformational Sampling
Pharmacophore and Its Applications
Tutorial with LigandScoute
Molecular Docking Methods
Gilles Marcou
Dragos Horvath
Thierry Langer
Sharon Bryant
Gilles Marcou Dragos Horvath
Tutorial with LeadIt
Day 4: Drug Design applications
Konstantin Balakin
Vladimir Poroikov
Computational Mapping Tools for Drug Discovery
Drug Design & Discovery in Academia