PubChem: a public chemical information resource for big data chemistry

SunghwanKim95 389 views 69 slides Aug 11, 2020
Slide 1
Slide 1 of 69
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57
Slide 58
58
Slide 59
59
Slide 60
60
Slide 61
61
Slide 62
62
Slide 63
63
Slide 64
64
Slide 65
65
Slide 66
66
Slide 67
67
Slide 68
68
Slide 69
69

About This Presentation

Presented at the Joint Statistical Meetings (JSM) 2020 (virtual) on August 3, 2020.

==== Abstract ====

The idea of “big data” has recently been drawing much attention of the scientific community as well as the general public. An example of big data in Chemistry is the data contained in PubChem...


Slide Content

PubChem: a Public Chemical Information Resource for Big Data Chemistry Sunghwan Kim, Ph.D., M.Sc.

Outline What is PubChem? What does PubChem have? Navigating PubChem Programmatic access to PubChem Showcase: bioactivity prediction model building Summary 2

1. What is PubChem? 3

4 https://pubchem.ncbi.nlm.nih.gov Chemical information resource at NIH. Serves scientific communities as well as the general public.

5 ~ 5 million unique monthly users at peak (Apr. 2020). interactive users only No bots Similar amount of web traffic from programmatic users. One of the top 5 most visited chemistry websites in the world. ( https://www.alexa.com/topsites/ category/Top/Science/Chemistry ).

6 PubChem is a data aggregator. PubChem Sources: https://pubchem.ncbi.nlm.nih.gov/sources Gov’t agencies Academic institutions Publishers Pharma companies Chemical vendors Scientific databases 750+ Data sources Public Research communities Chemical biology Medicinal chemistry Drug design & discovery Cheminformatics Patent agents/examiners Chemical safety officers Educators/Librarians Students ……

2. What does PubChem have? 7

8 PubChem contains (as August 2020): 103-M unique chemical structures. 268-M bioactivity outcomes 1.2-M bioassay experiments 91-K genes & 95-K proteins (from 4-K organisms). 237-K pathways 31-M scientific articles about chemicals 3-M patent documents PubChem Statistics: https://pubchemdocs.ncbi.nlm.nih.gov/statistics Arguably, PubChem contains the largest amount of chemical information in the public domain.

Drug information Drug labeling Drug indications Mechanism of action Target genes/proteins ADMET (Absorption, Distribution, Metabolism, Excretion & Toxicity) Clinical trials information ClinicalTrials.gov ( https://clinicaltrials.gov/ ) EU Clinical Trials Register ( https://www.clinicaltrialsregister.eu/ ) NIPH Clinical Trials Search of Japan ( https://rctportal.niph.go.jp/en/ ) 9 PubChem data for drug discovery

Regulatory information FDA Orange book Unique ingredient identifiers, Pharmacologic Classes EPA Substance Registry Services Chemical data collected under the: Toxic Substance Control Act Clean Air Act 10 Patent information (USPTO, EPO, WIPO, JPO) Journal articles (PubMed & Non-PubMed) PubChem data for drug discovery

Structural information 2-D chemical structures Line notations for 2-Dchemical structures (SMILES, InChI, InChIKey) Computationally-generated 3-D structures Experimental 3-D structures (from Crystallography Open Database) Links to 3-D structures in PDB/CSD Chemical properties (solubility, pKa, molecular weight, logP, …) Spectral information (NMR, IR, UV, MS, GC-MS, LC-MS, …) Chemical vendor Synthesis …… 11 PubChem data for drug discovery

Bioactivity data High-throughput screening (HTS) data (NCATS, EPA, Broad Institute, Sanford-Burnham, Scripps, …) Literature-extracted data from scientific articles and patent documents through text mining & manual curation (ChEMBL, IUPHAR/BPS Guide to PHARMACOLOGY, BindingDB, …) PubChem data for drug discovery 12

3. Navigating PubChem 13

14 https://pubchem.ncbi.nlm.nih.gov

15 https://pubchem.ncbi.nlm.nih.gov

16

17

18

19

20

21

22

23

24 https://pubchem.ncbi.nlm.nih.gov

25

26

27

28

29

30

31

32 Gene/Protein Target Page Suppose that you want to: Retrieve ALL active compounds against a given protein/gene target (e.g., HMGCR=3-hydroxy-3-methylglutaryl-CoA reductase). To identify common chemical scaffolds responsible for bioactivity. To build a quantitative structure-activity relationship (QSAR) model. Gene/Protein Target page Provides a target-centric view of PubChem data. Organizes all data available in PubChem for a given gene/protein.

33 https://pubchem.ncbi.nlm.nih.gov

34 https://pubchem.ncbi.nlm.nih.gov/#query=HMGCR&tab=gene

35 https://pubchem.ncbi.nlm.nih.gov/#query=HMGCR&tab=gene

36 https://pubchem.ncbi.nlm.nih.gov/gene/3156

37 https://pubchem.ncbi.nlm.nih.gov/gene/3156

38 https://pubchem.ncbi.nlm.nih.gov/gene/3156

39 https://pubchem.ncbi.nlm.nih.gov/gene/3156

40 Patent View Page Suppose that you want to: Retrieve ALL chemicals mentioned in a given patent document . Patent View page Provides a list of chemicals “mentioned” in the patent application/grant. No information on why they are mentioned. (e.g., as a subject matter or as a prior art?) Other information, including: - Title, abstract, date, inventor, … - I nternational patent classification (IPC) codes

41 https://pubchem.ncbi.nlm.nih.gov/#query=US2019183840

42 https://pubchem.ncbi.nlm.nih.gov/#query=US2019183840

43 https://pubchem.ncbi.nlm.nih.gov/patent/US2019183840

44 https://pubchem.ncbi.nlm.nih.gov/patent/US2019183840

45 https://pubchem.ncbi.nlm.nih.gov/patent/US2019183840

46 https://pubchem.ncbi.nlm.nih.gov/patent/US2019183840

4. Programmatic Access to PubChem 47

48 PubChem users have very diverse backgrounds/interests. PubChem’s web interfaces are optimized to perform commonly requested tasks interactively.

49 PubChem users have very diverse backgrounds/interests. PubChem’s web interfaces are optimized to perform commonly requested tasks interactively. Everything you can do with PubChem through the web browser can be automated through PubChem’s programmatic interfaces .

50 PubChem users have very diverse backgrounds/interests. PubChem’s web interfaces are optimized to perform commonly requested tasks interactively. Everything you can do with PubChem through the web browser can be automated through PubChem’s programmatic interfaces . Programmatic access enables one to do much more complicated tasks that cannot be done through the web browser.

51 Multiple programmatic access routes Two major programmatic access methods PUG-REST (primarily for computed properties). https://pubchemdocs.ncbi.nlm.nih.gov/pug-rest PUG-View (primarily for text information). https://pubchemdocs.ncbi.nlm.nih.gov/pug-view Request volume limitation: No more than 5 requests per second (See more at: https://pubchemdocs.ncbi.nlm.nih.gov/programmatic-access$_RequestVolumeLimitations ) Violators/abusers may be blocked for a certain period of time.

52 Bulk Download Structure Download Service (up to 500,000 compounds) https://pubchem.ncbi.nlm.nih.gov/pc_fetch/pc_fetch.cgi Assay Download Service (up to 1,000 assays) https://pubchem.ncbi.nlm.nih.gov/assay/assaydownload.cgi PubChem FTP Site ftp://ftp.ncbi.nlm.nih.gov/pubchem PubChem RDF https://pubchemdocs.ncbi.nlm.nih.gov/rdf RDF: Resource Description Network

5. Showcase: Bioactivity Prediction Model Building 53

Involved in regulation of gene expression in various biological processes. Potential roles in: metabolic signaling pathways skin alopecia (spot baldness) dermal cysts cardiac development insulin sensitization …… Retinoid X Receptor  (RXRA) PDB ID: 1FBY 54

Tox21 (AID 1159531) Quantitative HTS ( qHTS ) data for 10K compounds Predominantly inactive Data sets 55

Tox21 (AID 1159531) Training (4916 compounds) Test (547 compounds) 471 actives 4,445 inactives 53 actives 494 inactives Preprocessing Quantitative HTS ( qHTS ) data for 10K compounds Predominantly inactive Data sets 90% 10% 56

Tox21 (AID 1159531) ChEMBL (45 Assays) NCGC (2 Assays) Training (4916 compounds) Test (547 compounds) External 1 (222 compounds) External 2 (489 compounds) 471 actives 4,445 inactives 53 actives 494 inactives 205 actives 17 inactives 20 actives 469 inactives Preprocessing Quantitative HTS (qHTS) data for 10K compounds Predominantly inactive Data extracted from journal articles Predominantly active qHTS data Predominantly inactive Some overlap w/ Tox21 Data sets Preprocessing Preprocessing 90% 10% 57

Tox21 (AID 1159531) ChEMBL (45 Assays) NCGC (2 Assays) Training (4916 compounds) Test (547 compounds) External 1 (222 compounds) External 2 (489 compounds) 471 actives 4,445 inactives 53 actives 494 inactives 205 actives 17 inactives 20 actives 469 inactives Preprocessing Quantitative HTS (qHTS) data for 10K compounds Predominantly inactive Data extracted from journal articles Predominantly active qHTS data Predominantly inactive Some overlap w/ Tox21 Data sets Preprocessing Preprocessing 90% 10% 58

Tox21 (AID 1159531) ChEMBL (45 Assays) NCGC (2 Assays) Training (4916 compounds) Test (547 compounds) External 1 (222 compounds) External 2 (489 compounds) 471 actives 4,445 inactives 53 actives 494 inactives 205 actives 17 inactives 20 actives 469 inactives Preprocessing Quantitative HTS (qHTS) data for 10K compounds Predominantly inactive Data extracted from journal articles Predominantly active qHTS data Predominantly inactive Some overlap w/ Tox21 Data sets Preprocessing Preprocessing 90% 10% 471 59

Molecular descriptors Generated using PaDEL [Yap CW (2011). J. Comput. Chem., 32 (7): 1466-1474] Model Building Abbreviation Name Length AP AtomPairs 2D Fingerprint 780 ESTAT Estate fingerprint 79 EXTFP* CDK Extended Fingerprint 1,024 FP* CDK fingerprint 1,024 GOFP* CDK graph only fingerprint 1,024 KR Klekota-Roth fingerprint 4,860 MACCS MACCS fingerprint 166 PUB PubChem fingerprint 881 SUB Substructure fingerprint 307 * Hashed fingerprints 60

Machine-learning algorithms (implemented in scikit-learn) Model Building Abbreviation Name Hyperparameters optimized NB Naïve Bayes  (10 -10 ~ 1) DT Decision tree max_depth_range (3 ~ 7) min_samples_split_range (3 ~ 7) min_samples_leaf_range (2 ~ 6) kNN K-Nearest neighbors weights (uniform, minkowski, jaccard) n_neighbors (1 ~ 25) RF Random forest n_estimators (10 ~ 200) SVM Support vector machine C ( 2 -10 ~ 2 10 );  ( 2 -10  2 10 ) NN Neural network solver (lbfgs or adam);  (10 -7  10 7 ) 10-fold cross-validation was used for hyperparameter optimization. 61

Model Performance Evaluation Area under the Receiver operating characteristic curve (AUC)  Used for hyperparameter optimization.   62

Performance of the models AUC scores of 0.7 were observed for models developed using: PubChem/MACCS/CDK-FP with NN/SVM/RF/kNN Maximum AUC score (0.77): PubChem fingerprint with RF Similar trend was observed for the performance in terms of BACC scores (not shown here). Area under ROC curve (AUC) 63

General applicability of the models Area under ROC curve (AUC), Inactive-to-active ratio = 1 NCGC ChEMBL 64

Summary 65

PubChem is the largest source of publicly available chemical information, collected from more than 750 data sources. PubChem contains a wide range of annotated information for chemicals, including the gene/protein targets, toxicity, chemical vendors, patents, ……) PubChem contains a large amount of high-throughput screening data as well as literature-extracted bioactivity data. 66

PubChem supports various types of searches (e.g., keyword search, identity/similarity search, substructure/superstructure searches, ……). PubChem supports programmatic access to its data, allowing for building an automated workflow. PubChem’s bioactivity data can be used to develop predictive models for bioactivity of small molecules. 67

Acknowledgements Evan Bolton Jie Chen Tiejun Cheng Asta Gindulyte Jia He Siqian He Qingliang Li Benjamin Shoemaker Thiessen Paul Bo Yu Leonid Zaslavsky Jian Zhang The PubChem Team PubChem users, depositors, and collaborators Funded by the National Library of Medicine 68

69 Thank you! Questions? Sunghwan Kim Email: [email protected] SlideShare: https://www.slideshare.net/SunghwanKim95/presentations