Chemical Structure Standardization and Synonym Filtering in PubChem
SunghwanKim95
133 views
64 slides
May 01, 2020
Slide 1 of 64
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
About This Presentation
Presented at the 258th American Chemical Society (ACS) National Meeting in San Diego, CA (August 26, 2019).
PubChem (https://pubchem.ncbi.nlm.nih.gov) is a public chemical data repository that provides information on various chemical entities, including small molecules, siRNA, miRNA, peptides, lipi...
Presented at the 258th American Chemical Society (ACS) National Meeting in San Diego, CA (August 26, 2019).
PubChem (https://pubchem.ncbi.nlm.nih.gov) is a public chemical data repository that provides information on various chemical entities, including small molecules, siRNA, miRNA, peptides, lipids, carbohydrates, chemically modified biologics, etc. One of the most commonly requested tasks in PubChem is to search for a compound by chemical name (also commonly called “chemical synonym”). PubChem performs this task by looking up chemical synonym-structure associations provided by individual depositors to PubChem. These name-structure associations are used to create links between chemicals and Medical Subject Headings (MeSH) terms, which in turn are used to generate associations between chemicals and PubMed articles. The accuracy of these depositor-provided synonym-structure associations is dependent upon two important quality control methods used in PubChem: (1) chemical structure standardization and (2) synonym filtering based on crowd voting. In this presentation, we will discuss the two quality control methods and their effects on the chemical synonym-structure associations.
Size: 2.76 MB
Language: en
Added: May 01, 2020
Slides: 64 pages
Slide Content
Chemical Structure Standardization and Synonym Filtering in PubChem Sunghwan Kim, Ph.D., M.Sc. ACS National Meeting in San Diego, CA (August 26, 2019)
2 PubChem ( https://pubchem.ncbi.nlm.nih.gov )
3 PubChem Public chemical information resource Collects data from more than 690+ sources Disseminates data back to the public free of charge Contains the largest amount of publicly available chemical information Faces unique challenges to deal with many big data issues on a daily basis. Chemical structure standardization Name-structure association clean up
Depositor-provided Bioactivity test results Unique chemical structure extraction through Standardization Depositor-provided substance descriptions Unique chemical structures Activity of tested “substances” Activity of “compounds” derived from associated “substances” 690+ Data Contributors Substance deposition Assay deposition Data Organization in PubChem Substance ID (SID) Assay ID (AID) Compound ID (CID) 4
Unique chemical structure extraction through Standardization Depositor-provided substance descriptions Unique chemical structures 690+ Data Contributors Substance deposition Data Organization in PubChem Substance ID (SID) Depositor-provided Bioactivity test results Activity of tested “substances” Activity of “compounds” derived from associated “substances” Assay deposition Assay ID (AID) Compound ID (CID) 5
Unique chemical structure extraction through Standardization Depositor-provided substance descriptions Unique chemical structures 690+ Data Contributors Substance deposition Data Organization in PubChem Substance ID (SID) Compound ID (CID) 6 Individual data depositors provide PubChem with: Chemical structures Chemical names (synonyms) They need to be organized/cleaned up through: Structure standardization Synonym filtering
7 Common Issues with Chemical Structure Representations in PubChem
Drawing conventions Drawing conventions are often ignored in structures deposited by original data sources.
Kekulé 1 Kekulé 2 aromatic Aromatic Compounds Many Kekulé structures for aromatic compounds Which one should be used as a standard?
Tautomerism Ionization Mesomerism Ionization Different Forms of the Same Molecule Different tautomers, resonance forms, protonation states! Choose the most stable one?
Most stable in vacuum Most stable in water The stability depends upon the context. Different Forms of the Same Molecule
12 PubChem Chemical Structure Standardization
Detect components Isolate covalent units Neutralize (by H + or e - ) Reprocess Detect unique components PubChem Standardization Normalize representation Tautomer invariance Aromaticity detection Stereochemistry Explicit hydrogen Validate chemical contents Atoms defined/real Implicit hydrogen Functional group Atom valence Calculate Coordinates Properties Descriptors
14 J. Cheminform . (2018) 10:36
15 ~90% of the substances are subject to standardization. Mostly organic compounds. Standardization success rate: 99.64% Modification rate: 44.43% J. Cheminform . (2018) 10:36 Standardization Statistics
Most stable in vacuum Most stable in water It is not necessarily what one may expect Standardized Structures Standardized by PubChem
In most cases, tautomeric forms of a molecule are standardized into a single form. There are a few exceptions. CID 18630 CID 31261 Standardized Structures tautomerization
Standardization and Structure Identity Search You can search PubChem using a structure as a query. The input structure may be provided: using a line notation (e.g., SMILES, InChI) through using the PubChem Sketcher. The input structure for identity search will be standardized first before the search is performed. Therefore, hits from identity search may have different structures from the original input structure.
22 MeSH Entry Terms A set of “terms” related to ibuprofen. Used to index PubMed articles to help find articles about ibuprofen.
23 Depositor-Supplied Synonyms Synonyms provided for “substance” records by depositors. “Filtered” synonyms are provided on the “Compound” Summary
24 Raw (unfiltered) depositor-provided synonym associated with the largest number of CIDs Examples
25 Synonym # SIDs # CIDs N/A 6,869 6,368 SPIRO COMPOUNDS WITH POLYCYCLIC COMPONENTS ARE NOT SUPPORTED IN CURRENT VERSION 4,903 4,902 NULL 4,610 4,599 ASSEMBLIES OF CYCLIC SYSTEMS ARE NOT SUPPORTED IN CURRENT VERSION 2,554 2,554 NOT AVAILABLE 1,867 1,816 LECITHIN 1,157 1,142 DIACYLGLYCEROL 847 842 DIGLYCERIDE 841 841 MULTIPLICATIVE NOMENCLATURE IS NOT SUPPORTED IN CURRENT VERSION! 797 794 VITASMLAB 461 461 MIXTURE NAME 419 413 CLA 770 394 CHLOROPHYLL A 749 393 NA 7,081 371 Unfiltered Depositor-provided synonyms (page 1/3)
26 Synonym # SIDs # CIDs N/A 6,869 6,368 SPIRO COMPOUNDS WITH POLYCYCLIC COMPONENTS ARE NOT SUPPORTED IN CURRENT VERSION 4,903 4,902 NULL 4,610 4,599 ASSEMBLIES OF CYCLIC SYSTEMS ARE NOT SUPPORTED IN CURRENT VERSION 2,554 2,554 NOT AVAILABLE 1,867 1,816 LECITHIN 1,157 1,142 DIACYLGLYCEROL 847 842 DIGLYCERIDE 841 841 MULTIPLICATIVE NOMENCLATURE IS NOT SUPPORTED IN CURRENT VERSION! 797 794 VITASMLAB 461 461 MIXTURE NAME 419 413 CLA 770 394 CHLOROPHYLL A 749 393 NA 7,081 371 Various forms of “Not Available” Unfiltered Depositor-provided synonyms (page 1/3)
27 Synonym # SIDs # CIDs N/A 6,869 6,368 SPIRO COMPOUNDS WITH POLYCYCLIC COMPONENTS ARE NOT SUPPORTED IN CURRENT VERSION 4,903 4,902 NULL 4,610 4,599 ASSEMBLIES OF CYCLIC SYSTEMS ARE NOT SUPPORTED IN CURRENT VERSION 2,554 2,554 NOT AVAILABLE 1,867 1,816 LECITHIN 1,157 1,142 DIACYLGLYCEROL 847 842 DIGLYCERIDE 841 841 MULTIPLICATIVE NOMENCLATURE IS NOT SUPPORTED IN CURRENT VERSION! 797 794 VITASMLAB 461 461 MIXTURE NAME 419 413 CLA 770 394 CHLOROPHYLL A 749 393 NA 7,081 371 Various forms of “Not Available” Unfiltered Depositor-provided synonyms (page 1/3)
28 Synonym # SIDs # CIDs N/A 6,869 6,368 SPIRO COMPOUNDS WITH POLYCYCLIC COMPONENTS ARE NOT SUPPORTED IN CURRENT VERSION 4,903 4,902 NULL 4,610 4,599 ASSEMBLIES OF CYCLIC SYSTEMS ARE NOT SUPPORTED IN CURRENT VERSION 2,554 2,554 NOT AVAILABLE 1,867 1,816 LECITHIN 1,157 1,142 DIACYLGLYCEROL 847 842 DIGLYCERIDE 841 841 MULTIPLICATIVE NOMENCLATURE IS NOT SUPPORTED IN CURRENT VERSION! 797 794 VITASMLAB 461 461 MIXTURE NAME 419 413 CLA 770 394 CHLOROPHYLL A 749 393 NA 7,081 371 Various forms of “Not Available” Great reduction in the structure count after structure standardization SIDs are standardized to Na (sodium) Unfiltered Depositor-provided synonyms (page 1/3)
29 Synonym # SIDs # CIDs N/A 6,869 6,368 SPIRO COMPOUNDS WITH POLYCYCLIC COMPONENTS ARE NOT SUPPORTED IN CURRENT VERSION 4,903 4,902 NULL 4,610 4,599 ASSEMBLIES OF CYCLIC SYSTEMS ARE NOT SUPPORTED IN CURRENT VERSION 2,554 2,554 NOT AVAILABLE 1,867 1,816 LECITHIN 1,157 1,142 DIACYLGLYCEROL 847 842 DIGLYCERIDE 841 841 MULTIPLICATIVE NOMENCLATURE IS NOT SUPPORTED IN CURRENT VERSION! 797 794 VITASMLAB 461 461 MIXTURE NAME 419 413 CLA 770 394 CHLOROPHYLL A 749 393 NA 7,081 371 Error messages from name generation software Unfiltered Depositor-provided synonyms (page 1/3)
30 Synonym # SIDs # CIDs N/A 6,869 6,368 SPIRO COMPOUNDS WITH POLYCYCLIC COMPONENTS ARE NOT SUPPORTED IN CURRENT VERSION 4,903 4,902 NULL 4,610 4,599 ASSEMBLIES OF CYCLIC SYSTEMS ARE NOT SUPPORTED IN CURRENT VERSION 2,554 2,554 NOT AVAILABLE 1,867 1,816 LECITHIN 1,157 1,142 DIACYLGLYCEROL 847 842 DIGLYCERIDE 841 841 MULTIPLICATIVE NOMENCLATURE IS NOT SUPPORTED IN CURRENT VERSION! 797 794 VITASMLAB 461 461 MIXTURE NAME 419 413 CLA 770 394 CHLOROPHYLL A 749 393 NA 7,081 371 Names of chemical classes Unfiltered Depositor-provided synonyms (page 1/3)
31 Synonym # SIDs # CIDs (1-(5-CARBOXYPENTYL)-3,3-DIMETHYL-3H-INDOL-1-IUM-2-YL)METHANIDE HYDROBROMIDE 405 345 ETHANONE,1- - 328 328 CANNOT MAKE CHOICE: LIGANDS ARE COMPARED UP TO 10 SPHERES 304 304 COMPLEX BRIDGED FUSED SYSTEMS ARE NOT SUPPORTED IN CURRENT VERSION! 302 302 TRIACYLGLYCEROL 286 285 TRIGLYCERIDE 286 285 QUINOLONE DER. 280 279 UNABLE TO GENERATE VALUE 274 264 UNL 656 255 UNKNOWN LIGAND 615 235 HEPT DERIV. 213 211 MULTIPARENT NAMES FOR FUSED SYSTEMS ARE NOT SUPPORTED IN CURRENT VERSION! 208 208 ACHIRAL CENTER(S) 187 187 Unfiltered Depositor-provided synonyms (page 2/3)
32 Synonym # SIDs # CIDs (1-(5-CARBOXYPENTYL)-3,3-DIMETHYL-3H-INDOL-1-IUM-2-YL)METHANIDE HYDROBROMIDE 405 345 ETHANONE,1- - 328 328 CANNOT MAKE CHOICE: LIGANDS ARE COMPARED UP TO 10 SPHERES 304 304 COMPLEX BRIDGED FUSED SYSTEMS ARE NOT SUPPORTED IN CURRENT VERSION! 302 302 TRIACYLGLYCEROL 286 285 TRIGLYCERIDE 286 285 QUINOLONE DER. 280 279 UNABLE TO GENERATE VALUE 274 264 UNL 656 255 UNKNOWN LIGAND 615 235 HEPT DERIV. 213 211 MULTIPARENT NAMES FOR FUSED SYSTEMS ARE NOT SUPPORTED IN CURRENT VERSION! 208 208 ACHIRAL CENTER(S) 187 187 “Derivative” of a chemical Unfiltered Depositor-provided synonyms (page 2/3)
34 Synonym # SIDs # CIDs C9H11NO2 179 174 HEM 4,645 165 BCR 290 160 C10H13NO2 161 154 BETA-CAROTENE 298 147 C8H10N2O2 149 144 C10H10N2O2 149 143 -ACETICACID 141 141 C9H8N2O2 143 141 PROTOPORPHYRIN IX CONTAINING FE 3,690 140 C8H9NO2 144 139 NAG 9,599 130 METHANOL 247 128 C8H9NO3 129 127 C10H9NO2 133 126 PYRIDINONE DERIV. 130 126 N. A. 128 125 Molecular formula Unfiltered Depositor-provided synonyms (page 3/3)
35 Synonym # SIDs # CIDs C9H11NO2 179 174 HEM 4,645 165 BCR 290 160 C10H13NO2 161 154 BETA-CAROTENE 298 147 C8H10N2O2 149 144 C10H10N2O2 149 143 -ACETICACID 141 141 C9H8N2O2 143 141 PROTOPORPHYRIN IX CONTAINING FE 3,690 140 C8H9NO2 144 139 NAG 9,599 130 METHANOL 247 128 C8H9NO3 129 127 C10H9NO2 133 126 PYRIDINONE DERIV. 130 126 N. A. 128 125 Abbreviation for chemical names Unfiltered Depositor-provided synonyms (page 3/3)
36 Synonym # SIDs # CIDs C9H11NO2 179 174 HEM 4,645 165 BCR 290 160 C10H13NO2 161 154 BETA-CAROTENE 298 147 C8H10N2O2 149 144 C10H10N2O2 149 143 -ACETICACID 141 141 C9H8N2O2 143 141 PROTOPORPHYRIN IX CONTAINING FE 3,690 140 C8H9NO2 144 139 NAG 9,599 130 METHANOL 247 128 C8H9NO3 129 127 C10H9NO2 133 126 PYRIDINONE DERIV. 130 126 N. A. 128 125 Abbreviation for chemical names Unfiltered Depositor-provided synonyms (page 3/3) Description
37 Synonym # SIDs # CIDs C9H11NO2 179 174 HEM 4,645 165 BCR 290 160 C10H13NO2 161 154 BETA-CAROTENE 298 147 C8H10N2O2 149 144 C10H10N2O2 149 143 -ACETICACID 141 141 C9H8N2O2 143 141 PROTOPORPHYRIN IX CONTAINING FE 3,690 140 C8H9NO2 144 139 NAG 9,599 130 METHANOL 247 128 C8H9NO3 129 127 C10H9NO2 133 126 PYRIDINONE DERIV. 130 126 N. A. 128 125 Abbreviation for chemical names Unfiltered Depositor-provided synonyms (page 3/3) Description “Not available”
38 Unfiltered Depositor-provided synonyms Depositor-provided synonyms include: Real chemical names Abbreviations for chemical names “Derivatives” of some chemicals Names of chemical classes Molecular formula N/A, NULL, Not Available, NA, N.A., etc Error messages or comments Not feasible to manually clean up. PubChem uses crowd-voting-based synonym filtering.
39 PubChem Synonym Filtering
40 PubChem Synonym filtering Crowd-voting approach Check for a consensus on the name-structure association between depositors. Consensus threshold : >60% of the total votes When a consensus is reached, the synonym is added to the “filtered” synonym list of the corresponding compound (standardized structure).
41 CID 1 Synonym A SID 1 Depositor 1 Synonyms that occurs only “once” No disagreement in the name-structure association Consider that the Synonym A means CID 1 , (although it may not be correct)
42 CID 1 CID 2 CID 3 Synonym A SID 1 Depositor 1 Synonym A Synonym A Synonym A Synonym A SID 2 SID 4 SID 5 SID 3 Depositor 2 SID 7 Synonym A Synonym A SID 8 SID 6 Synonym A Depositor 3 SID 10 SID 9 Synonym A Synonym A Depositor 4 Synonyms occurring multiple times Which one is the best choice?
43 Synonym filtering using crowd voting Two potential approaches Multiple-votes-per-depositor Single-vote-per-depositor
44 CID 1 CID 2 CID 3 Synonym A SID 1 Depositor 1 Synonym A Synonym A Synonym A Synonym A SID 2 SID 4 SID 5 SID 3 Depositor 2 SID 7 Synonym A Synonym A SID 8 SID 6 Synonym A Depositor 3 SID 10 SID 9 Synonym A Synonym A Depositor 4 # votes 3 (30%) 5 (50%) 2 (20%) Consensus Threshold = 60% Multiple-Votes-per-Depositor Strategy
45 CID 1 CID 2 CID 3 Synonym A SID 1 Depositor 1 Synonym A Synonym A Synonym A Synonym A SID 2 SID 4 SID 5 SID 3 Depositor 2 SID 7 Synonym A Synonym A SID 8 SID 6 Synonym A Depositor 3 SID 10 SID 9 Synonym A Synonym A Depositor 4 # votes Consensus Threshold = 60% Single-Vote-per-Depositor Strategy
46 CID 1 CID 2 CID 3 Synonym A SID 1 Depositor 1 Synonym A Synonym A Synonym A Synonym A SID 2 SID 4 SID 5 SID 3 Depositor 2 SID 7 Synonym A Synonym A SID 8 SID 6 Synonym A Depositor 3 SID 10 SID 9 Synonym A Synonym A Depositor 4 # votes Consensus Threshold = 60% Single-Vote-per-Depositor Strategy
47 CID 1 CID 2 CID 3 Synonym A SID 1 Depositor 1 Synonym A Synonym A Synonym A Synonym A SID 2 SID 4 SID 5 SID 3 Depositor 2 SID 7 Synonym A Synonym A SID 8 SID 6 Synonym A Depositor 3 SID 10 SID 9 Synonym A Synonym A Depositor 4 # votes Consensus Threshold = 60% Single-Vote-per-Depositor Strategy
48 CID 1 CID 2 CID 3 Synonym A SID 1 Depositor 1 Synonym A Synonym A Synonym A Synonym A SID 2 SID 4 SID 5 SID 3 Depositor 2 SID 7 Synonym A Synonym A SID 8 SID 6 Synonym A Depositor 3 SID 10 SID 9 Synonym A Synonym A Depositor 4 # votes Consensus Threshold = 60% Single-Vote-per-Depositor Strategy
49 CID 1 CID 2 CID 3 Synonym A SID 1 Depositor 1 Synonym A Synonym A Synonym A Synonym A SID 2 SID 4 SID 5 SID 3 Depositor 2 SID 7 Synonym A Synonym A SID 8 SID 6 Synonym A Depositor 3 SID 10 SID 9 Synonym A Synonym A Depositor 4 # votes 1 (33%) 2 (67%) 0 (0%) Consensus Threshold = 60% Single-Vote-per-Depositor Strategy Consensus has reached! Synonym A = CID 2
50 Additional consideration: Different contexts of chemical sameness CID 6305 (L-Tryptophan) CID 1148 (Tryptophan) CID 9060 (D-Tryptophan) CID 12209747 CID 58478580
51 Abbr. CACTVS hash code used Description CID CID hash code Connectivity + isotopes + stereochemistry STE CID stereo hash code Connectivity + stereochemistry CON CID connectivity hash code Connectivity PCID Parent CID hash code CID of the parent compound PSTE Parent CID stereo hash code STE of the parent compound PCON Parent CID connectivity hash code CON of the parent compound In practice, synonym filtering uses CACTVS hash codes (instead of CID) to determine whether a consensus is reached or not. Additional consideration: Different contexts of chemical sameness
55 Synonym filtering focuses on consistency, not correctness. It resolves the discrepancies in name-structure associations within & between depositors. It does not mean that filtered synonyms are correct. Limitations of Synonym Filtering Fentin acetate (CID 16682804) Its filtered synonyms include: m- Nitrobenzaldehyde 3-thio-4-o-tolylsemicarbazone Benzaldehyde, m-nitro-, 3-thio-4-o-tolylsemicarbazone
56 Limitations of Synonym Filtering Synonym filtering focuses on consistency, not correctness.
57 Limitations of Synonym Filtering Synonym filtering focuses on consistency, not correctness.
58 Limitations of Synonym Filtering Synonym filtering focuses on consistency, not correctness. Data sources integrate synonym data from another sources that are regarded to be authoritative (e.g., government resources). Erroneous data in one source propagate into another sources. This practice helps incorrect name-chemical associations getting more votes than it should during the synonym filtering process.
59 More than 90% of depositor-provided synonyms occur only once. Automatically assigned to the structures represented by their corresponding CIDs. Limitations of Synonym Filtering
60 Uracil (CID 1174) 2,4-Dihydroxypyrimidine (SID 377954591) 2-hydroxy-4(1h)- pyrimidinone (SID 341255477) Different tautomers are merged into one standardized tautomeric structure. Their names are also merged with those of the standardized tautomer. Limitations of Synonym Filtering
61 Limitations of Synonym Filtering
62 Summary
63 PubChem contains a large amount of chemical information provided by 690+ data sources. Through the chemical structure standardization process, PubChem standardizes depositor-provided chemical structures and extracts unique structures. PubChem uses a crowd-voting-based synonym filtering to clean up name-structure associations provided by depositors. Summary
64 Acknowledgements Evan Bolton Jie Chen Tiejun Cheng Asta Gindulyte Jia He Siqian He Qingliang Li Benjamin Shoemaker Thiessen Paul Bo Yu Leonid Zaslavsky Jian Zhang The PubChem Team PubChem depositors, users, and collaborators Funded by the National Library of Medicine