Chemical Structure Standardization and Synonym Filtering in PubChem

SunghwanKim95 133 views 64 slides May 01, 2020
Slide 1
Slide 1 of 64
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57
Slide 58
58
Slide 59
59
Slide 60
60
Slide 61
61
Slide 62
62
Slide 63
63
Slide 64
64

About This Presentation

Presented at the 258th American Chemical Society (ACS) National Meeting in San Diego, CA (August 26, 2019).

PubChem (https://pubchem.ncbi.nlm.nih.gov) is a public chemical data repository that provides information on various chemical entities, including small molecules, siRNA, miRNA, peptides, lipi...


Slide Content

Chemical Structure Standardization and Synonym Filtering in PubChem Sunghwan Kim, Ph.D., M.Sc. ACS National Meeting in San Diego, CA (August 26, 2019)

2 PubChem ( https://pubchem.ncbi.nlm.nih.gov )

3 PubChem Public chemical information resource Collects data from more than 690+ sources Disseminates data back to the public free of charge Contains the largest amount of publicly available chemical information Faces unique challenges to deal with many big data issues on a daily basis. Chemical structure standardization Name-structure association clean up

Depositor-provided Bioactivity test results Unique chemical structure extraction through Standardization Depositor-provided substance descriptions Unique chemical structures Activity of tested “substances” Activity of “compounds” derived from associated “substances” 690+ Data Contributors Substance deposition Assay deposition Data Organization in PubChem Substance ID (SID) Assay ID (AID) Compound ID (CID) 4

Unique chemical structure extraction through Standardization Depositor-provided substance descriptions Unique chemical structures 690+ Data Contributors Substance deposition Data Organization in PubChem Substance ID (SID) Depositor-provided Bioactivity test results Activity of tested “substances” Activity of “compounds” derived from associated “substances” Assay deposition Assay ID (AID) Compound ID (CID) 5

Unique chemical structure extraction through Standardization Depositor-provided substance descriptions Unique chemical structures 690+ Data Contributors Substance deposition Data Organization in PubChem Substance ID (SID) Compound ID (CID) 6 Individual data depositors provide PubChem with: Chemical structures Chemical names (synonyms) They need to be organized/cleaned up through: Structure standardization Synonym filtering

7 Common Issues with Chemical Structure Representations in PubChem

Drawing conventions Drawing conventions are often ignored in structures deposited by original data sources.

Kekulé 1 Kekulé 2 aromatic Aromatic Compounds Many Kekulé structures for aromatic compounds Which one should be used as a standard?

Tautomerism Ionization Mesomerism Ionization Different Forms of the Same Molecule Different tautomers, resonance forms, protonation states! Choose the most stable one?

Most stable in vacuum Most stable in water The stability depends upon the context. Different Forms of the Same Molecule

12 PubChem Chemical Structure Standardization

Detect components Isolate covalent units Neutralize (by  H + or e - ) Reprocess Detect unique components PubChem Standardization Normalize representation Tautomer invariance Aromaticity detection Stereochemistry Explicit hydrogen Validate chemical contents Atoms defined/real Implicit hydrogen Functional group Atom valence Calculate Coordinates Properties Descriptors

14 J. Cheminform . (2018) 10:36

15 ~90% of the substances are subject to standardization. Mostly organic compounds. Standardization success rate: 99.64% Modification rate: 44.43% J. Cheminform . (2018) 10:36 Standardization Statistics

Most stable in vacuum Most stable in water It is not necessarily what one may expect Standardized Structures Standardized by PubChem

In most cases, tautomeric forms of a molecule are standardized into a single form. There are a few exceptions. CID 18630 CID 31261 Standardized Structures tautomerization

Standardization and Structure Identity Search You can search PubChem using a structure as a query. The input structure may be provided: using a line notation (e.g., SMILES, InChI) through using the PubChem Sketcher. The input structure for identity search will be standardized first before the search is performed. Therefore, hits from identity search may have different structures from the original input structure.

19 Uracil (CID 1174) Identity search 2,4-Dihydroxypyrimidine (SID 377954591) 2-hydroxy-4(1h)- pyrimidinone (SID 341255477) Standardization and Structure Identity Search

20 Depositor-supplied synonyms & MeSH Entry Terms

21 Two kinds of chemical names in PubChem

22 MeSH Entry Terms A set of “terms” related to ibuprofen. Used to index PubMed articles to help find articles about ibuprofen.

23 Depositor-Supplied Synonyms Synonyms provided for “substance” records by depositors. “Filtered” synonyms are provided on the “Compound” Summary

24 Raw (unfiltered) depositor-provided synonym associated with the largest number of CIDs Examples

25 Synonym # SIDs # CIDs N/A 6,869 6,368 SPIRO COMPOUNDS WITH POLYCYCLIC COMPONENTS ARE NOT SUPPORTED IN CURRENT VERSION 4,903 4,902 NULL 4,610 4,599 ASSEMBLIES OF CYCLIC SYSTEMS ARE NOT SUPPORTED IN CURRENT VERSION 2,554 2,554 NOT AVAILABLE 1,867 1,816 LECITHIN 1,157 1,142 DIACYLGLYCEROL 847 842 DIGLYCERIDE 841 841 MULTIPLICATIVE NOMENCLATURE IS NOT SUPPORTED IN CURRENT VERSION! 797 794 VITASMLAB 461 461 MIXTURE NAME 419 413 CLA 770 394 CHLOROPHYLL A 749 393 NA 7,081 371 Unfiltered Depositor-provided synonyms (page 1/3)

26 Synonym # SIDs # CIDs N/A 6,869 6,368 SPIRO COMPOUNDS WITH POLYCYCLIC COMPONENTS ARE NOT SUPPORTED IN CURRENT VERSION 4,903 4,902 NULL 4,610 4,599 ASSEMBLIES OF CYCLIC SYSTEMS ARE NOT SUPPORTED IN CURRENT VERSION 2,554 2,554 NOT AVAILABLE 1,867 1,816 LECITHIN 1,157 1,142 DIACYLGLYCEROL 847 842 DIGLYCERIDE 841 841 MULTIPLICATIVE NOMENCLATURE IS NOT SUPPORTED IN CURRENT VERSION! 797 794 VITASMLAB 461 461 MIXTURE NAME 419 413 CLA 770 394 CHLOROPHYLL A 749 393 NA 7,081 371 Various forms of “Not Available” Unfiltered Depositor-provided synonyms (page 1/3)

27 Synonym # SIDs # CIDs N/A 6,869 6,368 SPIRO COMPOUNDS WITH POLYCYCLIC COMPONENTS ARE NOT SUPPORTED IN CURRENT VERSION 4,903 4,902 NULL 4,610 4,599 ASSEMBLIES OF CYCLIC SYSTEMS ARE NOT SUPPORTED IN CURRENT VERSION 2,554 2,554 NOT AVAILABLE 1,867 1,816 LECITHIN 1,157 1,142 DIACYLGLYCEROL 847 842 DIGLYCERIDE 841 841 MULTIPLICATIVE NOMENCLATURE IS NOT SUPPORTED IN CURRENT VERSION! 797 794 VITASMLAB 461 461 MIXTURE NAME 419 413 CLA 770 394 CHLOROPHYLL A 749 393 NA 7,081 371 Various forms of “Not Available” Unfiltered Depositor-provided synonyms (page 1/3)

28 Synonym # SIDs # CIDs N/A 6,869 6,368 SPIRO COMPOUNDS WITH POLYCYCLIC COMPONENTS ARE NOT SUPPORTED IN CURRENT VERSION 4,903 4,902 NULL 4,610 4,599 ASSEMBLIES OF CYCLIC SYSTEMS ARE NOT SUPPORTED IN CURRENT VERSION 2,554 2,554 NOT AVAILABLE 1,867 1,816 LECITHIN 1,157 1,142 DIACYLGLYCEROL 847 842 DIGLYCERIDE 841 841 MULTIPLICATIVE NOMENCLATURE IS NOT SUPPORTED IN CURRENT VERSION! 797 794 VITASMLAB 461 461 MIXTURE NAME 419 413 CLA 770 394 CHLOROPHYLL A 749 393 NA 7,081 371 Various forms of “Not Available” Great reduction in the structure count after structure standardization  SIDs are standardized to Na (sodium) Unfiltered Depositor-provided synonyms (page 1/3)

29 Synonym # SIDs # CIDs N/A 6,869 6,368 SPIRO COMPOUNDS WITH POLYCYCLIC COMPONENTS ARE NOT SUPPORTED IN CURRENT VERSION 4,903 4,902 NULL 4,610 4,599 ASSEMBLIES OF CYCLIC SYSTEMS ARE NOT SUPPORTED IN CURRENT VERSION 2,554 2,554 NOT AVAILABLE 1,867 1,816 LECITHIN 1,157 1,142 DIACYLGLYCEROL 847 842 DIGLYCERIDE 841 841 MULTIPLICATIVE NOMENCLATURE IS NOT SUPPORTED IN CURRENT VERSION! 797 794 VITASMLAB 461 461 MIXTURE NAME 419 413 CLA 770 394 CHLOROPHYLL A 749 393 NA 7,081 371 Error messages from name generation software Unfiltered Depositor-provided synonyms (page 1/3)

30 Synonym # SIDs # CIDs N/A 6,869 6,368 SPIRO COMPOUNDS WITH POLYCYCLIC COMPONENTS ARE NOT SUPPORTED IN CURRENT VERSION 4,903 4,902 NULL 4,610 4,599 ASSEMBLIES OF CYCLIC SYSTEMS ARE NOT SUPPORTED IN CURRENT VERSION 2,554 2,554 NOT AVAILABLE 1,867 1,816 LECITHIN 1,157 1,142 DIACYLGLYCEROL 847 842 DIGLYCERIDE 841 841 MULTIPLICATIVE NOMENCLATURE IS NOT SUPPORTED IN CURRENT VERSION! 797 794 VITASMLAB 461 461 MIXTURE NAME 419 413 CLA 770 394 CHLOROPHYLL A 749 393 NA 7,081 371 Names of chemical classes Unfiltered Depositor-provided synonyms (page 1/3)

31 Synonym # SIDs # CIDs (1-(5-CARBOXYPENTYL)-3,3-DIMETHYL-3H-INDOL-1-IUM-2-YL)METHANIDE HYDROBROMIDE 405 345 ETHANONE,1- - 328 328 CANNOT MAKE CHOICE: LIGANDS ARE COMPARED UP TO 10 SPHERES 304 304 COMPLEX BRIDGED FUSED SYSTEMS ARE NOT SUPPORTED IN CURRENT VERSION! 302 302 TRIACYLGLYCEROL 286 285 TRIGLYCERIDE 286 285 QUINOLONE DER. 280 279 UNABLE TO GENERATE VALUE 274 264 UNL 656 255 UNKNOWN LIGAND 615 235 HEPT DERIV. 213 211 MULTIPARENT NAMES FOR FUSED SYSTEMS ARE NOT SUPPORTED IN CURRENT VERSION! 208 208 ACHIRAL CENTER(S) 187 187 Unfiltered Depositor-provided synonyms (page 2/3)

32 Synonym # SIDs # CIDs (1-(5-CARBOXYPENTYL)-3,3-DIMETHYL-3H-INDOL-1-IUM-2-YL)METHANIDE HYDROBROMIDE 405 345 ETHANONE,1- - 328 328 CANNOT MAKE CHOICE: LIGANDS ARE COMPARED UP TO 10 SPHERES 304 304 COMPLEX BRIDGED FUSED SYSTEMS ARE NOT SUPPORTED IN CURRENT VERSION! 302 302 TRIACYLGLYCEROL 286 285 TRIGLYCERIDE 286 285 QUINOLONE DER. 280 279 UNABLE TO GENERATE VALUE 274 264 UNL 656 255 UNKNOWN LIGAND 615 235 HEPT DERIV. 213 211 MULTIPARENT NAMES FOR FUSED SYSTEMS ARE NOT SUPPORTED IN CURRENT VERSION! 208 208 ACHIRAL CENTER(S) 187 187 “Derivative” of a chemical Unfiltered Depositor-provided synonyms (page 2/3)

33 Synonym # SIDs # CIDs C9H11NO2 179 174 HEM 4,645 165 BCR 290 160 C10H13NO2 161 154 BETA-CAROTENE 298 147 C8H10N2O2 149 144 C10H10N2O2 149 143 -ACETICACID 141 141 C9H8N2O2 143 141 PROTOPORPHYRIN IX CONTAINING FE 3,690 140 C8H9NO2 144 139 NAG 9,599 130 METHANOL 247 128 C8H9NO3 129 127 C10H9NO2 133 126 PYRIDINONE DERIV. 130 126 N. A. 128 125 Unfiltered Depositor-provided synonyms (page 3/3)

34 Synonym # SIDs # CIDs C9H11NO2 179 174 HEM 4,645 165 BCR 290 160 C10H13NO2 161 154 BETA-CAROTENE 298 147 C8H10N2O2 149 144 C10H10N2O2 149 143 -ACETICACID 141 141 C9H8N2O2 143 141 PROTOPORPHYRIN IX CONTAINING FE 3,690 140 C8H9NO2 144 139 NAG 9,599 130 METHANOL 247 128 C8H9NO3 129 127 C10H9NO2 133 126 PYRIDINONE DERIV. 130 126 N. A. 128 125 Molecular formula Unfiltered Depositor-provided synonyms (page 3/3)

35 Synonym # SIDs # CIDs C9H11NO2 179 174 HEM 4,645 165 BCR 290 160 C10H13NO2 161 154 BETA-CAROTENE 298 147 C8H10N2O2 149 144 C10H10N2O2 149 143 -ACETICACID 141 141 C9H8N2O2 143 141 PROTOPORPHYRIN IX CONTAINING FE 3,690 140 C8H9NO2 144 139 NAG 9,599 130 METHANOL 247 128 C8H9NO3 129 127 C10H9NO2 133 126 PYRIDINONE DERIV. 130 126 N. A. 128 125 Abbreviation for chemical names Unfiltered Depositor-provided synonyms (page 3/3)

36 Synonym # SIDs # CIDs C9H11NO2 179 174 HEM 4,645 165 BCR 290 160 C10H13NO2 161 154 BETA-CAROTENE 298 147 C8H10N2O2 149 144 C10H10N2O2 149 143 -ACETICACID 141 141 C9H8N2O2 143 141 PROTOPORPHYRIN IX CONTAINING FE 3,690 140 C8H9NO2 144 139 NAG 9,599 130 METHANOL 247 128 C8H9NO3 129 127 C10H9NO2 133 126 PYRIDINONE DERIV. 130 126 N. A. 128 125 Abbreviation for chemical names Unfiltered Depositor-provided synonyms (page 3/3) Description

37 Synonym # SIDs # CIDs C9H11NO2 179 174 HEM 4,645 165 BCR 290 160 C10H13NO2 161 154 BETA-CAROTENE 298 147 C8H10N2O2 149 144 C10H10N2O2 149 143 -ACETICACID 141 141 C9H8N2O2 143 141 PROTOPORPHYRIN IX CONTAINING FE 3,690 140 C8H9NO2 144 139 NAG 9,599 130 METHANOL 247 128 C8H9NO3 129 127 C10H9NO2 133 126 PYRIDINONE DERIV. 130 126 N. A. 128 125 Abbreviation for chemical names Unfiltered Depositor-provided synonyms (page 3/3) Description “Not available”

38 Unfiltered Depositor-provided synonyms Depositor-provided synonyms include: Real chemical names Abbreviations for chemical names “Derivatives” of some chemicals Names of chemical classes Molecular formula N/A, NULL, Not Available, NA, N.A., etc Error messages or comments Not feasible to manually clean up. PubChem uses crowd-voting-based synonym filtering.

39 PubChem Synonym Filtering

40 PubChem Synonym filtering Crowd-voting approach Check for a consensus on the name-structure association between depositors. Consensus threshold : >60% of the total votes When a consensus is reached, the synonym is added to the “filtered” synonym list of the corresponding compound (standardized structure).

41 CID 1 Synonym A SID 1 Depositor 1 Synonyms that occurs only “once” No disagreement in the name-structure association  Consider that the Synonym A means CID 1 , (although it may not be correct)

42 CID 1 CID 2 CID 3 Synonym A SID 1 Depositor 1 Synonym A Synonym A Synonym A Synonym A SID 2 SID 4 SID 5 SID 3 Depositor 2 SID 7 Synonym A Synonym A SID 8 SID 6 Synonym A Depositor 3 SID 10 SID 9 Synonym A Synonym A Depositor 4 Synonyms occurring multiple times Which one is the best choice?

43 Synonym filtering using crowd voting Two potential approaches Multiple-votes-per-depositor Single-vote-per-depositor

44 CID 1 CID 2 CID 3 Synonym A SID 1 Depositor 1 Synonym A Synonym A Synonym A Synonym A SID 2 SID 4 SID 5 SID 3 Depositor 2 SID 7 Synonym A Synonym A SID 8 SID 6 Synonym A Depositor 3 SID 10 SID 9 Synonym A Synonym A Depositor 4 # votes 3 (30%) 5 (50%) 2 (20%) Consensus Threshold = 60% Multiple-Votes-per-Depositor Strategy

45 CID 1 CID 2 CID 3 Synonym A SID 1 Depositor 1 Synonym A Synonym A Synonym A Synonym A SID 2 SID 4 SID 5 SID 3 Depositor 2 SID 7 Synonym A Synonym A SID 8 SID 6 Synonym A Depositor 3 SID 10 SID 9 Synonym A Synonym A Depositor 4 # votes Consensus Threshold = 60% Single-Vote-per-Depositor Strategy

46 CID 1 CID 2 CID 3 Synonym A SID 1 Depositor 1 Synonym A Synonym A Synonym A Synonym A SID 2 SID 4 SID 5 SID 3 Depositor 2 SID 7 Synonym A Synonym A SID 8 SID 6 Synonym A Depositor 3 SID 10 SID 9 Synonym A Synonym A Depositor 4 # votes Consensus Threshold = 60% Single-Vote-per-Depositor Strategy

47 CID 1 CID 2 CID 3 Synonym A SID 1 Depositor 1 Synonym A Synonym A Synonym A Synonym A SID 2 SID 4 SID 5 SID 3 Depositor 2 SID 7 Synonym A Synonym A SID 8 SID 6 Synonym A Depositor 3 SID 10 SID 9 Synonym A Synonym A Depositor 4 # votes Consensus Threshold = 60% Single-Vote-per-Depositor Strategy

48 CID 1 CID 2 CID 3 Synonym A SID 1 Depositor 1 Synonym A Synonym A Synonym A Synonym A SID 2 SID 4 SID 5 SID 3 Depositor 2 SID 7 Synonym A Synonym A SID 8 SID 6 Synonym A Depositor 3 SID 10 SID 9 Synonym A Synonym A Depositor 4 # votes Consensus Threshold = 60% Single-Vote-per-Depositor Strategy

49 CID 1 CID 2 CID 3 Synonym A SID 1 Depositor 1 Synonym A Synonym A Synonym A Synonym A SID 2 SID 4 SID 5 SID 3 Depositor 2 SID 7 Synonym A Synonym A SID 8 SID 6 Synonym A Depositor 3 SID 10 SID 9 Synonym A Synonym A Depositor 4 # votes 1 (33%) 2 (67%) 0 (0%) Consensus Threshold = 60% Single-Vote-per-Depositor Strategy Consensus has reached! Synonym A = CID 2

50 Additional consideration: Different contexts of chemical sameness CID 6305 (L-Tryptophan) CID 1148 (Tryptophan) CID 9060 (D-Tryptophan) CID 12209747 CID 58478580

51 Abbr. CACTVS hash code used Description CID CID hash code Connectivity + isotopes + stereochemistry STE CID stereo hash code Connectivity + stereochemistry CON CID connectivity hash code Connectivity PCID Parent CID hash code CID of the parent compound PSTE Parent CID stereo hash code STE of the parent compound PCON Parent CID connectivity hash code CON of the parent compound In practice, synonym filtering uses CACTVS hash codes (instead of CID) to determine whether a consensus is reached or not. Additional consideration: Different contexts of chemical sameness

52 Filtered Depositor-provided synonyms with the largest number of CIDs Before Clustering After clustering Synonym # SIDs # CIDs # SIDs # CIDs 124-07-2 (PARENT) 27 25 27 25 VITAMIN B12 38 23 37 22 159351-69-6 50 23 48 21 64-18-6 (PARENT) 25 23 22 20 1397-89-3 57 24 51 18 RIFAPENTINE 59 18 59 18 7681-93-8 44 19 43 18 NYSTATIN 61 28 34 17 50-14-6 61 17 61 17 104376-79-6 33 17 33 17 AMPHOTERICIN B 67 21 63 17 68-19-9 37 21 33 17 ACONITINE 47 19 45 17 QUININE SULFATE 38 17 38 17

53 Filtered Depositor-provided synonyms with the largest number of CIDs Before Clustering After clustering Synonym # SIDs # CIDs # SIDs # CIDs 124-07-2 (PARENT) 27 25 27 25 VITAMIN B12 38 23 37 22 159351-69-6 50 23 48 21 64-18-6 (PARENT) 25 23 22 20 1397-89-3 57 24 51 18 RIFAPENTINE 59 18 59 18 7681-93-8 44 19 43 18 NYSTATIN 61 28 34 17 50-14-6 61 17 61 17 104376-79-6 33 17 33 17 AMPHOTERICIN B 67 21 63 17 68-19-9 37 21 33 17 ACONITINE 47 19 45 17 QUININE SULFATE 38 17 38 17 CAS numbers

Before Clustering After clustering Synonym # SIDs # CIDs # SIDs # CIDs 124-07-2 (PARENT) 27 25 27 25 VITAMIN B12 38 23 37 22 159351-69-6 50 23 48 21 64-18-6 (PARENT) 25 23 22 20 1397-89-3 57 24 51 18 RIFAPENTINE 59 18 59 18 7681-93-8 44 19 43 18 NYSTATIN 61 28 34 17 50-14-6 61 17 61 17 104376-79-6 33 17 33 17 AMPHOTERICIN B 67 21 63 17 68-19-9 37 21 33 17 ACONITINE 47 19 45 17 QUININE SULFATE 38 17 38 17 54 Filtered Depositor-provided synonyms with the largest number of CIDs CAS numbers for parent compounds

55 Synonym filtering focuses on consistency, not correctness. It resolves the discrepancies in name-structure associations within & between depositors. It does not mean that filtered synonyms are correct. Limitations of Synonym Filtering Fentin acetate (CID 16682804) Its filtered synonyms include: m- Nitrobenzaldehyde 3-thio-4-o-tolylsemicarbazone Benzaldehyde, m-nitro-, 3-thio-4-o-tolylsemicarbazone

56 Limitations of Synonym Filtering Synonym filtering focuses on consistency, not correctness.

57 Limitations of Synonym Filtering Synonym filtering focuses on consistency, not correctness.

58 Limitations of Synonym Filtering Synonym filtering focuses on consistency, not correctness. Data sources integrate synonym data from another sources that are regarded to be authoritative (e.g., government resources). Erroneous data in one source propagate into another sources. This practice helps incorrect name-chemical associations getting more votes than it should during the synonym filtering process.

59 More than 90% of depositor-provided synonyms occur only once. Automatically assigned to the structures represented by their corresponding CIDs. Limitations of Synonym Filtering

60 Uracil (CID 1174) 2,4-Dihydroxypyrimidine (SID 377954591) 2-hydroxy-4(1h)- pyrimidinone (SID 341255477) Different tautomers are merged into one standardized tautomeric structure.  Their names are also merged with those of the standardized tautomer. Limitations of Synonym Filtering

61 Limitations of Synonym Filtering

62 Summary

63 PubChem contains a large amount of chemical information provided by 690+ data sources. Through the chemical structure standardization process, PubChem standardizes depositor-provided chemical structures and extracts unique structures. PubChem uses a crowd-voting-based synonym filtering to clean up name-structure associations provided by depositors. Summary

64 Acknowledgements Evan Bolton Jie Chen Tiejun Cheng Asta Gindulyte Jia He Siqian He Qingliang Li Benjamin Shoemaker Thiessen Paul Bo Yu Leonid Zaslavsky Jian Zhang The PubChem Team PubChem depositors, users, and collaborators Funded by the National Library of Medicine