Chemical structure representation in PubChem

NextMoveSoftware 2,415 views 27 slides Sep 09, 2016
Slide 1
Slide 1 of 27
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27

About This Presentation

252nd ACS National Meeting Philadelphia Fall 2016
Roger Sayle


Slide Content

Chemical structure
representation in pubchem
Roger Sayle

NextMove Software, Cambridge, UK
252
nd
ACS National Meeting, Philadelphia, PA, Tuesday 23
th
August 2016

Selected Pubchem publications
•Sunghwan Kim, Paul A. Thiessen, Evan E. Bolton, Jie Chen, Gang Fu, Asta
Gindulyte, Lianyi Han, Jane He, Siqian He, Benjamin A. Shoemaker, Jiyao
Wang, Bo Yu, Jian Zhang and Stephen H. Bryant, “PubChem Substance and
Compound Databases”, Nucleic Acids Research, 2015.
•Volker D. Hahnke, Evan E. Bolton and Stephen H. Bryant, “PubChem atom
enironments”, Journal of Cheminformatics, 7:41, 2015.
•Evan E. Bolton, Yanli Wang, Paul A. Thiessen, Stephen H. Bryant,
“PubChem: Integrated Platform of Molecule Molecules and Biological
Activities”, Annual Reports in Computational Chemistry, Volume 4.,
Chapter 12, pp. 217-241, 2008.

252
nd
ACS National Meeting, Philadelphia, PA, Tuesday 23
th
August 2016

Substance and compound
•A unique and invaluable feature of PubChem’s
architecture is the distinction between the deposited
structures (substances) and the normalized
structures (compounds), and the retention of both.

•Pubchem Substance contains ~209.6M structures.
•Pubchem Compound contains ~91.7M structures.
252
nd
ACS National Meeting, Philadelphia, PA, Tuesday 23
th
August 2016

Molecular identity
•When are two chemical structures the same?
–Alternate chemical representations.
–Aromaticity and conjugation.
–Protonation states and tautomerism.
–Errors and typographical mistakes.
252
nd
ACS National Meeting, Philadelphia, PA, Tuesday 23
th
August 2016

Pubchem standardization service
https://pubchem.ncbi.nlm.nih.gov/standardize/standardize.cgi
252
nd
ACS National Meeting, Philadelphia, PA, Tuesday 23
th
August 2016

example 1: ethanol
•PubChem CID 702 has been deposited 1569 times
with six different explicit atom counts.
–1311 have 9 atoms and 8 bonds.
–249 have 3 atoms and 2 bonds.
–4 have 0 atoms and 0 bonds.
–2 have 4 atoms and 3 bonds.
–2 have 5 atoms and 4 bonds.
–1 has 7 atoms and 6 bonds.
•All have same SMILES (“CCO”) and InChI.
252
nd
ACS National Meeting, Philadelphia, PA, Tuesday 23
th
August 2016

Explicit vs. implicit hydrogens
252
nd
ACS National Meeting, Philadelphia, PA, Tuesday 23
th
August 2016

example 2: nitrobenzene
•Pubchem CID 7416 has been deposited as 164
distinct substance depositions (2 without structures).
252
nd
ACS National Meeting, Philadelphia, PA, Tuesday 23
th
August 2016

Mdl molfile-ageDdon
•Biovia 2017 changed the interpretation of CT files.
•This affects 342,689 SIDs and 213,097 CIDs.
252
nd
ACS National Meeting, Philadelphia, PA, Tuesday 23
th
August 2016

Hydrogens: easy come/easy go?
•PubChem is inconsistent on protonation/hydrogens.
•Common organic element radicals are hydrogenated:
–[C] → C, [Cl] → Cl, [P] → P, [S] → S, [H] → [HH]
–[Li], [Be], [B], [Si], [As], [Se], [At], etc. remain unchanged.
•Some groups get deprotonated
–c1ccccc1[N+](=O)O → c1ccccc1[N+](=O)[O-]
•But generally protonation state is preserved
–CC(=O)O, CC(=O)[O-], [NH4+], [NH3+]CC(=O)[O-]
–C[N+](C)(C)O
252
nd
ACS National Meeting, Philadelphia, PA, Tuesday 23
th
August 2016

Example 3: o-xylene
•A major challenge in chemical databases is
aromaticity; that two compounds that differ in
Kekule forms are the same molecule.
252
nd
ACS National Meeting, Philadelphia, PA, Tuesday 23
th
August 2016
CID 7237

Pubchem canonical kekule smiles
•A significant novel innovation in cheminformatics
was Evan Bolton’s development of a “canonical”
Kekulé SMILES form of a molecule.
•Different chemistry toolkits (and chemists!) differ in
opinion on which ring systems are aromatic and
which are not, hence PubChem’s wish to remain
“neutral” by only providing non-aromatic SMILES.
252
nd
ACS National Meeting, Philadelphia, PA, Tuesday 23
th
August 2016

Bolton’s algorithm
•Steps of Bolton’s Canonical Kekulé Form Algorithm:
252
nd
ACS National Meeting, Philadelphia, PA, Tuesday 23
th
August 2016

Tricky case: 10b,10c-dihydropyrene
•An important aspect is to aromatize all conjugated
cycles, not just those associated with SSSR.





•Unfortunately, this computationally demanding
requirement is a source of pain at the NCBI.
252
nd
ACS National Meeting, Philadelphia, PA, Tuesday 23
th
August 2016

Conjugated ring systems
•Does it make sense to distinguish 4n+2 Hückel
aromaticity from conjugated ring systems?
252
nd
ACS National Meeting, Philadelphia, PA, Tuesday 23
th
August 2016

Resonance forms
•CCN(=O)=O → CC[N+](=O)[O-]
•CCN=N#N → CCN=[N+]=[N-]
•CC[O+]=C=[N-] → CCOC#N
•C[P+](C)(C)[O-] → CP(=O)(C)C
•CC(=[NH2+])[O-] → CC(=O)N

•CS(=[OH+])(=O)[O-]

•C[S+2]([O-])([O-])C
252
nd
ACS National Meeting, Philadelphia, PA, Tuesday 23
th
August 2016

Tautomers are normalized
•CC(=N)O → CC(=O)N
•CC(=[NH2+])[O-] → CC(=O)N

•n1ccccc1O → [nH]1ccccc1=O
•n1ccc(O)cc1 → [nH]1ccc(=O)cc1

252
nd
ACS National Meeting, Philadelphia, PA, Tuesday 23
th
August 2016

Classic tautomerism: laar 1886
InChI=1S/C16H12N20/c19-16-11-10-15(13-8-4-5-9-14(13)16)18-17-12-6-2-1-3-7-12/h1-11,19H
InChI=1S/C16H12N20/c19-16-11-10-15(13-8-4-5-9-14(13)16)18-17-12-6-2-1-3-7-12/h1-11,17H
CID 5355205 (CAS 3651-02-3)
5 SIDs 13 SIDs

But things could be improved...
252
nd
ACS National Meeting, Philadelphia, PA, Tuesday 23
th
August 2016

Bonds to metals
•PubChem follows InChI breaking bonds to metals.
–Table salt
•[Na]Cl → [Na+].[Cl-]
•[Na].[Cl] → [Na].Cl
–Zirconium(IV) ethoxide
•CCO[Zr](OCC)(OCC)OCC → [Zr].CCO.CCO.CCO.CCO
•[Zr+4].CC[O-].CC[O-].CC[O-].CC[O-]
–Grignard reagents
•c1ccccc1[Mg]Br → c1cccc[c-]1.[Mg+2].[Br-]
•c1ccccc1[Mg+].[Br-] → c1cccc[c-]1.[Mg+].[Br-]
252
nd
ACS National Meeting, Philadelphia, PA, Tuesday 23
th
August 2016

Periodic table (circa 1997-2003)
•PubChem currently handles 109 of the 118 elements
in the periodic table [to be ratified in 2016].
•Hence “Mt” is the heaviest element at the moment.
•“Ds”, “Rg”, “Cn”, “Fl”, “Lv” already ratified.
•“Nh”, “Mc”, “Ts” and “Og” expected soon.
252
nd
ACS National Meeting, Philadelphia, PA, Tuesday 23
th
August 2016

Pubchem Isotopes
•PubChem registration confirms that any specified
isotope has been observed experimentally.
•Hence [7CH4] is rejected, but [8CH4] is allowed.
•Interestingly, the [8CH4] of CID 11635947 has a half-
life of only two zeptoseconds (2×10
-19
seconds).

•Another quirk is that PubChem doesn’t normalize
mononuclidic isotopes. Hence [19F]C (CID58338844)
is the sames as FC (CID11638).
252
nd
ACS National Meeting, Philadelphia, PA, Tuesday 23
th
August 2016

Disavowed by the government
•There are a number of species PubChem rejects:
–Chlorine dioxide O=[Cl]=O
–Carbide anions: [C-]#[C-] and [C-4]

•But there is hope…
–Disulfur dioxide: O=[S][S]=O → O=S=S=O
252
nd
ACS National Meeting, Philadelphia, PA, Tuesday 23
th
August 2016

Related compounds/substances
•CID → SID
–Same Connectivity, Same Stereochemistry, Same Isotopes
–Same Parent Connectivity, Same Exact Parent
–Mixtures, Components and Neutralized Forms
–Unique Components
–Similar Compounds (90% Tanimoto), Similar Conformers
•CID → SID
–All, Same Structure, Mixture
•SID → SID
–Same Connectivity, Same Exact
•SID → CID
–PubChem SID



252
nd
ACS National Meeting, Philadelphia, PA, Tuesday 23
th
August 2016

Pubchem bond encoding
•PubChem allows depositors to specify advanced
representations of molecular structures such as
inorganics and organometallics via SD tags.
•PUBCHEM_NONSTANDARDBOND
–4 = Quadruple bond, 5 = Dative bond, 6 = Complex bond,
7 = Ionic bond.
•PUBCHEM_BONDANNOTATIONS
–2 = Hydrogen bond, 9 = Resonance bond, 10 = Bold bond,
11 = Fischer bond, 12 = Close contact.
•Relatively few depositors make use of these.
252
nd
ACS National Meeting, Philadelphia, PA, Tuesday 23
th
August 2016

Final thoughts: abstract
For all of the grief that I give Evan, often over corner cases of chemical semantics that
only one or two people care about, it is fair to say that PubChem represents the
current state-of-the-art in chemical structure representation. Nobody does it better.
Under the surface, unseen to most users, are a large number of technical and scientific
innovations that have enabled PubChem to scale over the past decade and a half to
now contain approaching 100 million compounds. From simple design decisions such
as the substance vs. compound distinction [that allows PubChem to avoid the early
mistakes of CAS] to breakthroughs such as canonical Kekule SMILEs [to avoid the early
mistakes of Daylight Chemical Information Systems], the architecture of Pubchem
contains a treasure trove of cheminformatics innovations, covering normalization,
tautomers, mixtures, 2D fingerprints and similarity, substructure search, biopolymers,
text mining and much more. During this presentation I hope to share some of the cool
insights that the remarkable staff at the NCBI often forget to mention or are too
modest to point out.
Congratulations Evan and Steve.
252
nd
ACS National Meeting, Philadelphia, PA, Tuesday 23
th
August 2016

acknowledgements
•Evan Bolton, Steve Bryant, Paul Thiessen, Volker
Hähnke, David Lipman and the PubChem team at the
NCBI.
•John May, at NextMove Software, for the analysis of
PubChem atom types affected by Biovia changes.
•The rest of the team at NextMove Software.
•George Vacek and the team at OpenEye Scientific
Software.
252
nd
ACS National Meeting, Philadelphia, PA, Tuesday 23
th
August 2016
Tags