CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an open cip

NextMoveSoftware 4,847 views 28 slides Aug 20, 2017
Slide 1
Slide 1 of 28
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28

About This Presentation

The Cahn-Ingold-Prelog (CIP) priority rules have been the corner stone in written communication of stereo-chemical configuration for more than half a century. The rules rank ligands around a stereocentre allowing an atom order and layout invariant stereo-descriptor to be assigned, for example R (rig...


Slide Content

ACS Fall 2017, Washington, D.C.
comparing cahn-ingold-prelog
rule implementations:
the need for an open cip
John Mayfield, Daniel Lowe, Roger Sayle

“The Cahn–Ingold–Prelog (CIP) sequence rules … are a
standard process used in organic chemistry to completely and
unequivocally name a stereoisomer of a molecule.” - Wikipedia

“The Cahn–Ingold–Prelog (CIP) sequence rules … are a
standard process used in organic chemistry to completely and
unequivocally name a stereoisomer of a molecule.” - Wikipedia
If you are not naming stereoisomers
you (probably) don’t want to use CIP
Tools can give different answers,
What can we do about it?

NUMBER OF STEREOCENTRES PER ENTRY
chebi_154
chembl_23
pubchem
pubchem_substance
eMolecules170601
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95100
% of Dataset
Dataset
Count
0
1
2
3
4
5
6
7
8
9
eMolecules 2017-Jun-01
PubChem Substance
PubChem Compound (Aug 17)
ChEMBL 23
ChEBI 154
+
14 million total
234 million total
93 million total
1.7 million total
95 thousand total

Many chemists are taught the
CIP rules during their education
and is deceptively simple
‣Simple cases are easy for a
human (and computers)
‣Complex cases are hard for a
human (and computers)
IUPAC Blue Book (2013)
extends recommendations but
incomplete (and some mistakes)

The Sequence RULES
(in essence)
Rule 1
a.Higher atomic number precedes lower
b.An atom node duplicated closer to the root ranks higher than one
duplicated further
Rule 2 Higher atomic mass number precedes lower
Rule 3 Z precedes E and this precedes nonstereogenic (nst) double bonds
Rule 4
a.Chiral stereogenic units precede pseudoasymmetric stereogenic units
and these precede nonstereogenic units (R = S > r = s > nst)
b.When two ligands have different descriptor pairs, the one with the first
chosen like descriptor pairs has priority over the one with a corresponding
unlike descriptor pairs
c.r precedes s
Rule 5 An atom or group with descriptor R has priority over its enantiomorph S

O
H
H
H
H
H
H
H
H
H
321 5
4
6
1
2
3
5
64
H
Example
1.In the sphere (i) C2 and C5 are tied O > C5 = C2 > H
2.In the sphere (ii) C2 and C5 are split C,H,H > H,H,H
and therefore C2 > C5
3.The priority is 4, 2, 5, 6 and the configuration is S
(i)
(ii)

DIGRAPHS
•Rules are applied to hierarchal directed acyclic graphs
(digraphs)
•Comparison proceeds in “spheres” out from the root of the
graph
•Combinatorial explosions for some structures
H
OH
H
H
H
H
H
H H
H
H
1
7
6
5
(1)
(1)
6523 4
O
O
3
4 2
1
65
7
7

PSEUDO-ASYMMETRY
Some confusion of lower case r and s
•Assigned only when Rule 5 has been used
•Not indication of non-constitutional
Why? Reflection is superimposable:

AUXILIARY DESCRIPTORS
Auxiliary descriptors are used to split ties by symmetric
molecules by labelling the asymmetric digraphs
Tie in initial digraph
Calculate auxiliary
descriptors
R > S (Rule 5) 3:r
Picture: May, J. W. (2015). Cheminformatics for genome-scale metabolic
reconstructions (doctoral thesis).

mancude ring handling
P-92.1.4.4 Nomenclature of Organic Chemistry: IUPAC Recommendations and
Preferred Names 2013
Kekulé forms can result if different digraphs
Handled using fractional atomic numbers

The Sequence RULES
(in essence)
Rule 1
a.Higher atomic number precedes lower
b.An atom node duplicated closer to the root ranks higher than one
duplicated further
Rule 2 Higher atomic mass number precedes lower
Rule 3 Z precedes E and this precedes nonstereogenic (nst) double bonds
Rule 4
a.Chiral stereogenic units precede pseudoasymmetric stereogenic units
and these precede nonstereogenic units (R = S > r = s > nst)
b.When two ligands have different descriptor pairs, the one with the first
chosen like descriptor pairs has priority over the one with a corresponding
unlike descriptor pairs
c.r precedes s
Rule 5 An atom or group with descriptor R has priority over its enantiomorph S

ChEBI ChEMBL eMoleculesPubChem
Compound
PubChem
Substance
Rule 1a281K99.6%1.8M98.6%2.4M97.0%53.5M100.0%93.1M98.7%
Rule 1b 4 1 164 255
Rule 2 14 3,565 6,789
Rule 3 29 3 441 36 45
Rule 4a122 126 273 4 12,770
Rule 4b5630.2%4,0370.2%3,1880.1% 125K0.1%
Rule 4c 19 558
Rule 52850.1%23.4K1.2%69K2.8%15 1.1M1.2%
Total282K 1.9M 2.4M 53.5M 94.3M
MAJORITY HANDLED BY RULE 1a
Count is number of stereocentres, values of zero and percentages close to zero removed to reduce complexity

0
5
10
15
20
25
30
35
40
45
50
55
60
65
70
75
80
85
90
95
100
1 2 3 4 5 6 7 8 9 10
Sphere
% of Stereocentres
Dataset
chebi_154
chembl_23
eMolecules170601
pubchem
pubchem_substance
distance from root
Majority (but not all)
stereocentres labelled
within first few spheres
Best to generate digraph
lazily as required
Some digraphs are far
too big to generate fully
(e.g. fullerenes)
0
5
10
15
20
25
30
35
40
45
50
55
60
65
70
75
80
85
90
95
100
1 2 3 4 5 6 7 8 9 10
Sphere
% of Stereocentres
Dataset
chebi_154
chembl_23
eMolecules170601
pubchem
pubchem_substance

comparison

Rule 1A
III
Centres 2.0 RR
JMol 14.20.3 RR
ACD/ChemSketch 14.05beta RR
Balloon 1.6.5beta RR
KnowItAll ChemWindow 2018 RR
ChemDraw 16.0 RR
BIOVIA Draw 2017 RR
MarvinSketch 17.17 R-
Indigo 1.3.0Beta.r16 -R
RDKit 2017.03.03 SR
DataWarrior 4.6.0 RR
CACTVS (NCI Resolver Aug 17) RR
OPSIN 2.3.1 RR
LexiChem (OEChem) 20170613 RR
ChemDoodle 7.0.2 RR
CDK 2.0 -R
JUMBO 6 R-
I
II

Rule 1B
Centres 2.0 R
JMol 14.20.3 R
ACD/ChemSketch 14.05beta R
Balloon 1.6.5beta R
KnowItAll ChemWindow 2018 R
ChemDraw 16.0 R
BIOVIA Draw 2017 -
MarvinSketch 17.17 -
Indigo 1.3.0Beta.r16 -
RDKit 2017.03.03 R
DataWarrior 4.6.0 -
CACTVS (NCI Resolver Aug 17) -
OPSIN 2.3.1 R
LexiChem (OEChem) 20170613 -
ChemDoodle 7.0.2 -
CDK 2.0 -
JUMBO 6 -

Rule 2
Jan 2015Aug 2017
Centres R R
JMol n/a R
ACD/ChemSketch R R
Balloon 1.6.5beta n/a R
KnowItAll ChemWindow n/a R
ChemDraw S S
Accelrys/BIOVIA Draw S R
MarvinSketch S S
Indigo R R
RDKit S S
DataWarrior S S
CACTVS S R
OPSIN R R
LexiChem (OEChem) S R
ChemDoodle S n/a
CDK S S
JUMBO - -
R or S? Let’s Vote https://nextmovesoftware.com/blog/2015/01/21/r-or-s-lets-vote/

Rule 4b
S
S
S
R
Centres 2.0 R
JMol 14.20.3 R
ACD/ChemSketch 14.05beta R
Balloon 1.6.5beta R
KnowItAll ChemWindow 2018 R
ChemDraw 16.0 R
BIOVIA Draw 2017 R
MarvinSketch 17.17 R
Indigo 1.3.0Beta.r16 R
RDKit 2017.03.03 S
DataWarrior 4.6.0 S
CACTVS (NCI Resolver Aug 17) S
OPSIN 2.3.1 -
LexiChem (OEChem) 20170613 -
ChemDoodle 7.0.2 s
CDK 2.0 -
JUMBO 6 -

MANCUDE RINGS
Centres 2.0 RR
JMol 14.20.3 RR
ACD/ChemSketch 14.05beta RR
Balloon 1.6.5beta RR
KnowItAll ChemWindow 2018 RR
ChemDraw 16.0 RR
BIOVIA Draw 2017 RR
MarvinSketch 17.17 RR
Indigo 1.3.0Beta.r16 SR
RDKit 2017.03.03 RR
DataWarrior 4.6.0 RR
CACTVS (NCI Resolver Aug 17) SR
OPSIN 2.3.1 SR
LexiChem (OEChem) 20170613 SR
ChemDoodle 7.0.2 SR
CDK 2.0 SR
JUMBO 6 SS
III
I
II

Centres 2.0 R
JMol 14.20.3 R
ACD/ChemSketch 14.05beta R
Balloon 1.6.5beta R
KnowItAll ChemWindow 2018 R
ChemDraw 16.0 R
BIOVIA Draw 2017 R
MarvinSketch 17.17 -
Indigo 1.3.0Beta.r16 -
RDKit 2017.03.03 -
DataWarrior 4.6.0 -
CACTVS (NCI Resolver Aug 17) -
OPSIN 2.3.1 -
LexiChem (OEChem) 20170613 -
ChemDoodle 7.0.2 -
CDK 2.0 -
JUMBO 6 -
AUX DESCRIPTORS

hard to implement A
MarvinSketch 17.17
(S)
O
O
(S)
OH
(S)
O
O
(R)
OH
Turning aromaticity on
flips stereochemistry
(e.g. CHEBI:16063)
Labels depend on
input order
OH
1
(S)
2
(r)
3
OH
4
(R)
5
OH
6
(S)7
OH
8
(s)
9
HO
10
(R)
11
HO
12
(S)1
OH
2
OH
3
(R)
4
HO
5
OH
6
(R)
7
OH
8
(S)
9
(R)
10
(R)
11
HO
12
(r)
1
OH
2
(s)
3
HO
4
(S)5
(R)
6
(S)
7
(R)
8
OH
9
OH
10
HO
11
OH
12

hard to implement B
(R)
OH
H
(CH
2)
2CH
2HO OH
(R)
OH
H
(CH
2)
11(CH
2)
10HO OH
OH
H
(CH
2)
17(CH
2)
16HO OH
Becomes undefined
distance ≥ 16
ChemDraw 16.0
(R)
(s)
(CH
2)
2
(R)
OH
(r)
(s)
(CH
2)
11
(R)
OH

open cip?
Why?
•Provide a blessed implementation that can be
used directly or compared against
•Toolkit agnostic library to facilitate downstream
integration

“FIX-CIP” CoLABORATION
Robert Hanson (JMol), John Mayfield (Centres)
Mikko Vainio (Balloon), Andrey Yerin (ACD/Name),
Sophia Gillian Musacchio (St. Olaf College)
Goals
•Discuss and resolve software inconsistencies
•Generate comprehensive test set based on
BlueBook structure
•Recomend rule amendments and additions
Publication in preparation

should you use CIP?
Yes
Systematic nomenclature
Human conversation (if no pen is
handy)
Probably not (better algorithms exist)
Unique labelling (see right)
Compute “conversation”
Finding/cleaning stereocentres
No
Relative comparison, e.g.
substructure search

should you use CIP?
Yes
Systematic nomenclature
Human conversation (if no pen is
handy)
Probably not (better algorithms exist)
Unique labelling (see right)
Compute “conversation”
Finding/cleaning stereocentres
No
Relative comparison, e.g.
substructure search
(S)
(S)
(R) (S)
(R)
(R)
(S)(R)
(S)
(S)
(R) (S)
(R)
(R)
(S)(R)

acknowledgements
SciMix Poster
Robert Hanson (JMol)
Mikko Vainio (Balloon)
Andrey Yerin (ACD/Name)
Sophia Gillian Musacchio (St. Olaf
College)
Karl Nedwed (Bio-Rad)
Noel O’Boyle (NextMove Software)
Shuzhe Wang (NextMove Software)
John Mayfield, Daniel Lowe and Roger Sayle NextMove Software Ltd, Cambridge, UK.
NextMove Software Limited Innovation Centre (Unit 23) Cambridge Science Park Milton Road, Cambridge UK CB4 0EY
www.nextmovesoftware.com
Introduction
Robert Hanson, Andrey Yerin, Mikko Vainio, and Sophia Gillian Musacchio for initiating and participating in the “Fix CIP” collaboration and the many in-depth technical discussions that have lead to improvements in the tools. Karl Nedwed for providing KnowItAll results. Philip Skinner for providing ChemDraw licenses. Noel O’Boyle for feedback and suggestions.
the need for open-cip
The Cahn-Ingold-Prelog (CIP) priority rules rank atoms around a stereogenic unit to assign a stereo-descriptor that is invariant to atom order and layout, for example R (right) or S (left) for tetrahedral atoms. A directed acyclic graph (digraph) is constructed for each stereogenic unit and the out edges from the root node compared and ranked according to eight sequence rules
[1]
. Each rule is applied exhaustively and tested on the entire digraph before applying the next rule
[2]
.
Acknowledgements
Results
1.P-92.1.3 Nomenclature of Organic Chemistry: IUPAC Recommendations and Preferred Names 20132.Paulina Mata. The CIP System Again:  Respecting Hierarchies Is Always a Must. J. Chem. Inf. Comput. Sci., 1999, 39 (6)
Bibliography
ConclusionThe CIP sequence rules provide a standard way for chemists to effectively describe the configurations of most stereogenic units. However, beyond simple cases the complexity of the rules necessitates software is used as an aid to naming configurations. The results demonstrate even then, software implementations do not all agree on the configuration.Through the results presented here and the on-going effort of the Fix CIP collaboration, software should aim to converge upon consistent stereochemistry naming. An Open CIP software tool could provide “blessed” stereochemistry configuration names and provide a standard algorithm implementation for other vendors to integrate or adapt.
Comparing Cahn-Ingold-Prelog Rule Implementations
Rule 1 a.Higher atomic number precedes lower b.An atom node duplicated closer to the root ranks higher than one duplicated further Rule 2 Higher atomic mass number precedes lower Rule 3 Z precedes E and this precedes nonstereogenic (nst) double bonds Rule 4 a.Chiral stereogenic units precede pseudoasymmetric stereogenic units and these precede nonstereogenic units (R = S > r = s > nst) b.When two ligands have different descriptor pairs, the one with the first chosen like descriptor pairs has priority over the one with a corresponding unlike descriptor pairs c.r precedes s Rule 5 An atom or group with descriptor R has priority over its enantiomorph S
Stereochemistry in Databases
chebi_154
chembl_23
pubchem
pubchem_substance
eMolecules170601
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95100
% of Dataset
Dataset
Count
0
1
2
3
4
5
6
7
8
9
eMolecules (June 2017)PubChem SubstancePubChem Compound (Aug 2017)ChEMBL 23ChEBI 154
14 million records
234 million records
93 million records
1.7 million records
95 thousand records
chebi_154
chembl_23
pubchem
pubchem_substance
eMolecules170601
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95100
% of Dataset
Dataset
Count
0
1
2
3
4
5
6
7
8
9
Number of Stereogenic Units+
chebi_154
chembl_23
pubchem
pubchem_substance
eMolecules170601
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95100
% of Dataset
Dataset
Count
0
1
2
3
4
5
6
7
8
9
The number of defined stereogenic units per molecule varies between databases.
The application of Rule 1a to the digraph for 2-butanol ranks the out edges connected to the root as giving the label S (4 > 2 > 5 are anticlockwise looking towards 6).
ChEBIChEMBLeMoleculesPubChem Compound
1PubChem SubstanceRule 1a281K99.6%1.8M98.6%2.4M97.0%53.5M100.0%93.1M98.7%Rule 1b41164255Rule 2143,5656,789Rule 32934413645Rule 4a122126273412,770Rule 4b5630.2%4,0370.2%3,1880.1%125K0.1%Rule 4c19558Rule 52850.1%23.4K1.2%69K2.8%151.1M1.2%Total282K1.9M2.4M53.5M94.3M
The majority of stereogenic units are constitutionally asymmetric and can be ranked using Rule 1a. However, in some datasets the number of stereogenic units requiring Rule 4b and 5 can be significant.
IIIIIIIVVVIVIIVIIIIXXXIaXIbXIIXIII Centres 2.0RRRRRRRRRrRRrR JMol 14.20.3RRRRRRRRRrRRrR ACD/ChemSketch 14.05betaRRRRRRRRRrRRrR Balloon 1.6.5betaRRRRRRRRRrRRrR KnowItAll ChemWindow 2018RRRRRRRRRrRRrR
5
ChemDraw 16.0RRRRSRRRRrRRrR BIOVIA Draw 2017RRR-RRRRR-
1
RR-
1
R MarvinSketch 17.17R---SR-R-rRRr- Indigo 1.3.0Beta.r16-
2
R--R-RRRrSR-- RDKit 2017.03.03SRSRSRRSRRRR-- DataWarrior 4.6.0RRR-SRRSRRR
3
R-- CACTVS (NCI Resolver Aug 17)RRS-S
4
RRSRRSR-- OPSIN 2.3.1RRRRR-----SR-- LexiChem (OEChem) 20170613RR--R-----SR-- ChemDoodle 7.0.2RR--S--s-rSR-- CDK 2.0-RR
5
-S-----SR-- JUMBO 6R-S-------SS--Constitutional (Rule 1a, 1b, 2)Geometrical + Topographical (Rule 3,4a,4b,4c,5)Special (Mancude, Aux Descriptors)
1.Pseudoasymmetric r/s labels not displayed but must be calculated due to answers given for IX and XIII2.Runtime error occurs3.Impossible to test as different Kekulé forms are normalised4.R in CACTVS since Feb 2015, NCI resolver is old version5.Other descriptor is assigned differently
A set of fourteen structures was collected to identify differences between software implementations. The structures were selected to cover all the sequence rules and their applications to special cases.
Eight sequence rules (in essence)
Fix CIP CollaborationSince submitting this work for presentation the developers: Centres, JMol, ACD/ChemSketch, and Balloon have begun a collaboration. We are in the process of submitting for publication an extended in-depth validation set and proposing sequence rule refinements and additions where they are required.
1
As part of the PubChem Compound’s processing, non-constitutional stereochemistry is removed: for example the nine stereoisomers of inositols are all represented by CID 892.
Atoms connected by double and triple bonds as well as ring closures result in duplicated nodes in the digraph. In the structure below atoms 5 and 6 appear twice and atom 1 (the root) appears three times.
Due to this duplication, complex ring systems can generate exponentially large digraphs that are not computationally tractable. Further complexity in digraphs is caused by the use of fractional atomic numbers in mancude ring-systems and assignment of auxiliary descriptors for applying Rules 3-5.
H
OH
H
H
H
H
H
H H
H
H
1
7
6
5
(1)
(1)
6523 4
O
O
3
4 2
1
65
7
7
O
H
H
H
H
H
H
H
H
H
321 5
4
6
1
2
3
5
64
H
Tags