PubChem: a public chemical information resource for big data chemistry
SunghwanKim95
389 views
69 slides
Aug 11, 2020
Slide 1 of 69
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
About This Presentation
Presented at the Joint Statistical Meetings (JSM) 2020 (virtual) on August 3, 2020.
==== Abstract ====
The idea of “big data” has recently been drawing much attention of the scientific community as well as the general public. An example of big data in Chemistry is the data contained in PubChem...
Presented at the Joint Statistical Meetings (JSM) 2020 (virtual) on August 3, 2020.
==== Abstract ====
The idea of “big data” has recently been drawing much attention of the scientific community as well as the general public. An example of big data in Chemistry is the data contained in PubChem, which is a public database of chemical substance descriptions and their biological activities at the National Institutes of Health. PubChem is a sizeable system with 235 million depositor-provided substance descriptions, 96 million unique chemical structures, 1.1 million biological assays, and 268 million biological activity result outcomes. It also contains significant amounts of scientific research data and the inter-relationships between chemicals, proteins, genes, scientific literature, patents and more. PubChem resources have been used in many studies for developing bioactivity and toxicity prediction models, discovering multi-target ligands, and identifying new macromolecule targets of compounds (for drug-repurposing or off-target side effect prediction). This presentation provides an overview of how PubChem’s data, tools, and services can be used for bioassay data analysis and virtual screening (VS) and discusses important aspects of exploiting PubChem for drug discovery.
Size: 17.68 MB
Language: en
Added: Aug 11, 2020
Slides: 69 pages
Slide Content
PubChem: a Public Chemical Information Resource for Big Data Chemistry Sunghwan Kim, Ph.D., M.Sc.
Outline What is PubChem? What does PubChem have? Navigating PubChem Programmatic access to PubChem Showcase: bioactivity prediction model building Summary 2
1. What is PubChem? 3
4 https://pubchem.ncbi.nlm.nih.gov Chemical information resource at NIH. Serves scientific communities as well as the general public.
5 ~ 5 million unique monthly users at peak (Apr. 2020). interactive users only No bots Similar amount of web traffic from programmatic users. One of the top 5 most visited chemistry websites in the world. ( https://www.alexa.com/topsites/ category/Top/Science/Chemistry ).
6 PubChem is a data aggregator. PubChem Sources: https://pubchem.ncbi.nlm.nih.gov/sources Gov’t agencies Academic institutions Publishers Pharma companies Chemical vendors Scientific databases 750+ Data sources Public Research communities Chemical biology Medicinal chemistry Drug design & discovery Cheminformatics Patent agents/examiners Chemical safety officers Educators/Librarians Students ……
2. What does PubChem have? 7
8 PubChem contains (as August 2020): 103-M unique chemical structures. 268-M bioactivity outcomes 1.2-M bioassay experiments 91-K genes & 95-K proteins (from 4-K organisms). 237-K pathways 31-M scientific articles about chemicals 3-M patent documents PubChem Statistics: https://pubchemdocs.ncbi.nlm.nih.gov/statistics Arguably, PubChem contains the largest amount of chemical information in the public domain.
Drug information Drug labeling Drug indications Mechanism of action Target genes/proteins ADMET (Absorption, Distribution, Metabolism, Excretion & Toxicity) Clinical trials information ClinicalTrials.gov ( https://clinicaltrials.gov/ ) EU Clinical Trials Register ( https://www.clinicaltrialsregister.eu/ ) NIPH Clinical Trials Search of Japan ( https://rctportal.niph.go.jp/en/ ) 9 PubChem data for drug discovery
Regulatory information FDA Orange book Unique ingredient identifiers, Pharmacologic Classes EPA Substance Registry Services Chemical data collected under the: Toxic Substance Control Act Clean Air Act 10 Patent information (USPTO, EPO, WIPO, JPO) Journal articles (PubMed & Non-PubMed) PubChem data for drug discovery
Structural information 2-D chemical structures Line notations for 2-Dchemical structures (SMILES, InChI, InChIKey) Computationally-generated 3-D structures Experimental 3-D structures (from Crystallography Open Database) Links to 3-D structures in PDB/CSD Chemical properties (solubility, pKa, molecular weight, logP, …) Spectral information (NMR, IR, UV, MS, GC-MS, LC-MS, …) Chemical vendor Synthesis …… 11 PubChem data for drug discovery
Bioactivity data High-throughput screening (HTS) data (NCATS, EPA, Broad Institute, Sanford-Burnham, Scripps, …) Literature-extracted data from scientific articles and patent documents through text mining & manual curation (ChEMBL, IUPHAR/BPS Guide to PHARMACOLOGY, BindingDB, …) PubChem data for drug discovery 12
3. Navigating PubChem 13
14 https://pubchem.ncbi.nlm.nih.gov
15 https://pubchem.ncbi.nlm.nih.gov
16
17
18
19
20
21
22
23
24 https://pubchem.ncbi.nlm.nih.gov
25
26
27
28
29
30
31
32 Gene/Protein Target Page Suppose that you want to: Retrieve ALL active compounds against a given protein/gene target (e.g., HMGCR=3-hydroxy-3-methylglutaryl-CoA reductase). To identify common chemical scaffolds responsible for bioactivity. To build a quantitative structure-activity relationship (QSAR) model. Gene/Protein Target page Provides a target-centric view of PubChem data. Organizes all data available in PubChem for a given gene/protein.
40 Patent View Page Suppose that you want to: Retrieve ALL chemicals mentioned in a given patent document . Patent View page Provides a list of chemicals “mentioned” in the patent application/grant. No information on why they are mentioned. (e.g., as a subject matter or as a prior art?) Other information, including: - Title, abstract, date, inventor, … - I nternational patent classification (IPC) codes
48 PubChem users have very diverse backgrounds/interests. PubChem’s web interfaces are optimized to perform commonly requested tasks interactively.
49 PubChem users have very diverse backgrounds/interests. PubChem’s web interfaces are optimized to perform commonly requested tasks interactively. Everything you can do with PubChem through the web browser can be automated through PubChem’s programmatic interfaces .
50 PubChem users have very diverse backgrounds/interests. PubChem’s web interfaces are optimized to perform commonly requested tasks interactively. Everything you can do with PubChem through the web browser can be automated through PubChem’s programmatic interfaces . Programmatic access enables one to do much more complicated tasks that cannot be done through the web browser.
51 Multiple programmatic access routes Two major programmatic access methods PUG-REST (primarily for computed properties). https://pubchemdocs.ncbi.nlm.nih.gov/pug-rest PUG-View (primarily for text information). https://pubchemdocs.ncbi.nlm.nih.gov/pug-view Request volume limitation: No more than 5 requests per second (See more at: https://pubchemdocs.ncbi.nlm.nih.gov/programmatic-access$_RequestVolumeLimitations ) Violators/abusers may be blocked for a certain period of time.
52 Bulk Download Structure Download Service (up to 500,000 compounds) https://pubchem.ncbi.nlm.nih.gov/pc_fetch/pc_fetch.cgi Assay Download Service (up to 1,000 assays) https://pubchem.ncbi.nlm.nih.gov/assay/assaydownload.cgi PubChem FTP Site ftp://ftp.ncbi.nlm.nih.gov/pubchem PubChem RDF https://pubchemdocs.ncbi.nlm.nih.gov/rdf RDF: Resource Description Network
5. Showcase: Bioactivity Prediction Model Building 53
Involved in regulation of gene expression in various biological processes. Potential roles in: metabolic signaling pathways skin alopecia (spot baldness) dermal cysts cardiac development insulin sensitization …… Retinoid X Receptor (RXRA) PDB ID: 1FBY 54
Tox21 (AID 1159531) Quantitative HTS ( qHTS ) data for 10K compounds Predominantly inactive Data sets 55
Tox21 (AID 1159531) Training (4916 compounds) Test (547 compounds) 471 actives 4,445 inactives 53 actives 494 inactives Preprocessing Quantitative HTS ( qHTS ) data for 10K compounds Predominantly inactive Data sets 90% 10% 56
Tox21 (AID 1159531) ChEMBL (45 Assays) NCGC (2 Assays) Training (4916 compounds) Test (547 compounds) External 1 (222 compounds) External 2 (489 compounds) 471 actives 4,445 inactives 53 actives 494 inactives 205 actives 17 inactives 20 actives 469 inactives Preprocessing Quantitative HTS (qHTS) data for 10K compounds Predominantly inactive Data extracted from journal articles Predominantly active qHTS data Predominantly inactive Some overlap w/ Tox21 Data sets Preprocessing Preprocessing 90% 10% 57
Tox21 (AID 1159531) ChEMBL (45 Assays) NCGC (2 Assays) Training (4916 compounds) Test (547 compounds) External 1 (222 compounds) External 2 (489 compounds) 471 actives 4,445 inactives 53 actives 494 inactives 205 actives 17 inactives 20 actives 469 inactives Preprocessing Quantitative HTS (qHTS) data for 10K compounds Predominantly inactive Data extracted from journal articles Predominantly active qHTS data Predominantly inactive Some overlap w/ Tox21 Data sets Preprocessing Preprocessing 90% 10% 58
Tox21 (AID 1159531) ChEMBL (45 Assays) NCGC (2 Assays) Training (4916 compounds) Test (547 compounds) External 1 (222 compounds) External 2 (489 compounds) 471 actives 4,445 inactives 53 actives 494 inactives 205 actives 17 inactives 20 actives 469 inactives Preprocessing Quantitative HTS (qHTS) data for 10K compounds Predominantly inactive Data extracted from journal articles Predominantly active qHTS data Predominantly inactive Some overlap w/ Tox21 Data sets Preprocessing Preprocessing 90% 10% 471 59
Molecular descriptors Generated using PaDEL [Yap CW (2011). J. Comput. Chem., 32 (7): 1466-1474] Model Building Abbreviation Name Length AP AtomPairs 2D Fingerprint 780 ESTAT Estate fingerprint 79 EXTFP* CDK Extended Fingerprint 1,024 FP* CDK fingerprint 1,024 GOFP* CDK graph only fingerprint 1,024 KR Klekota-Roth fingerprint 4,860 MACCS MACCS fingerprint 166 PUB PubChem fingerprint 881 SUB Substructure fingerprint 307 * Hashed fingerprints 60
Machine-learning algorithms (implemented in scikit-learn) Model Building Abbreviation Name Hyperparameters optimized NB Naïve Bayes (10 -10 ~ 1) DT Decision tree max_depth_range (3 ~ 7) min_samples_split_range (3 ~ 7) min_samples_leaf_range (2 ~ 6) kNN K-Nearest neighbors weights (uniform, minkowski, jaccard) n_neighbors (1 ~ 25) RF Random forest n_estimators (10 ~ 200) SVM Support vector machine C ( 2 -10 ~ 2 10 ); ( 2 -10 2 10 ) NN Neural network solver (lbfgs or adam); (10 -7 10 7 ) 10-fold cross-validation was used for hyperparameter optimization. 61
Model Performance Evaluation Area under the Receiver operating characteristic curve (AUC) Used for hyperparameter optimization. 62
Performance of the models AUC scores of 0.7 were observed for models developed using: PubChem/MACCS/CDK-FP with NN/SVM/RF/kNN Maximum AUC score (0.77): PubChem fingerprint with RF Similar trend was observed for the performance in terms of BACC scores (not shown here). Area under ROC curve (AUC) 63
General applicability of the models Area under ROC curve (AUC), Inactive-to-active ratio = 1 NCGC ChEMBL 64
Summary 65
PubChem is the largest source of publicly available chemical information, collected from more than 750 data sources. PubChem contains a wide range of annotated information for chemicals, including the gene/protein targets, toxicity, chemical vendors, patents, ……) PubChem contains a large amount of high-throughput screening data as well as literature-extracted bioactivity data. 66
PubChem supports various types of searches (e.g., keyword search, identity/similarity search, substructure/superstructure searches, ……). PubChem supports programmatic access to its data, allowing for building an automated workflow. PubChem’s bioactivity data can be used to develop predictive models for bioactivity of small molecules. 67
Acknowledgements Evan Bolton Jie Chen Tiejun Cheng Asta Gindulyte Jia He Siqian He Qingliang Li Benjamin Shoemaker Thiessen Paul Bo Yu Leonid Zaslavsky Jian Zhang The PubChem Team PubChem users, depositors, and collaborators Funded by the National Library of Medicine 68