SlideShare a Scribd company logo
PubChem: a Public Chemical Information
Resource for Big Data Chemistry
Sunghwan Kim, Ph.D., M.Sc.
Outline
1.What is PubChem?
2.What does PubChem have?
3.Navigating PubChem
4.Programmatic access to PubChem
5.Showcase: bioactivity prediction model building
6.Summary
2
1. What is PubChem?
3
4
https://pubchem.ncbi.nlm.nih.gov
 Chemical information resource at NIH.
 Serves scientific communities as well as the general public.
5
 ~5 million unique monthly users at
peak (Apr. 2020).
 interactive users only
 No bots
 Similar amount of web traffic from
programmatic users.
 One of the top 5 most visited
chemistry websites in the world.
(https://www.alexa.com/topsites/
category/Top/Science/Chemistry).
6
 PubChem is a data aggregator.
PubChem Sources: https://pubchem.ncbi.nlm.nih.gov/sources
Gov’t agencies
Academic institutions
Publishers
Pharma companies
Chemical vendors
Scientific databases
750+ Data sources Public
o Research communities
• Chemical biology
• Medicinal chemistry
• Drug design &
discovery
• Cheminformatics
o Patent agents/examiners
o Chemical safety officers
o Educators/Librarians
o Students
o ……
2. What does PubChem have?
7
8
 PubChem contains (as August 2020):
• 103-M unique chemical structures.
• 268-M bioactivity outcomes
• 1.2-M bioassay experiments
• 91-K genes & 95-K proteins (from 4-K organisms).
• 237-K pathways
• 31-M scientific articles about chemicals
• 3-M patent documents
PubChem Statistics: https://pubchemdocs.ncbi.nlm.nih.gov/statistics
Arguably, PubChem contains the largest amount of
chemical information in the public domain.
 Drug information
• Drug labeling
• Drug indications
• Mechanism of action
• Target genes/proteins
• ADMET (Absorption, Distribution, Metabolism, Excretion & Toxicity)
 Clinical trials information
• ClinicalTrials.gov (https://clinicaltrials.gov/)
• EU Clinical Trials Register (https://www.clinicaltrialsregister.eu/)
• NIPH Clinical Trials Search of Japan (https://rctportal.niph.go.jp/en/)
9
 PubChem data for drug discovery
 Regulatory information
o FDA
• Orange book
• Unique ingredient identifiers,
• Pharmacologic Classes
o EPA
• Substance Registry Services
• Chemical data collected under the:
• Toxic Substance Control Act
• Clean Air Act
10
 Patent information
(USPTO, EPO, WIPO, JPO)
 Journal articles
(PubMed & Non-PubMed)
 PubChem data for drug discovery
 Structural information
• 2-D chemical structures
• Line notations for 2-Dchemical structures (SMILES, InChI, InChIKey)
• Computationally-generated 3-D structures
• Experimental 3-D structures (from Crystallography Open Database)
• Links to 3-D structures in PDB/CSD
 Chemical properties
(solubility, pKa, molecular weight, logP, …)
 Spectral information
(NMR, IR, UV, MS, GC-MS, LC-MS, …)
 Chemical vendor
 Synthesis
 ……
11
 PubChem data for drug discovery
 Bioactivity data
• High-throughput screening (HTS) data
(NCATS, EPA, Broad Institute, Sanford-Burnham, Scripps, …)
• Literature-extracted data from scientific articles and patent
documents through text mining & manual curation
(ChEMBL, IUPHAR/BPS Guide to PHARMACOLOGY, BindingDB, …)
 PubChem data for drug discovery
12
3. Navigating PubChem
13
14
https://pubchem.ncbi.nlm.nih.gov
15
https://pubchem.ncbi.nlm.nih.gov
16
17
18
19
20
21
22
23
24
https://pubchem.ncbi.nlm.nih.gov
25
26
27
28
29
30
31
32
Gene/Protein Target Page
 Suppose that you want to:
o Retrieve ALL active compounds
against a given protein/gene target
(e.g., HMGCR=3-hydroxy-3-methylglutaryl-CoA reductase).
• To identify common chemical scaffolds responsible for bioactivity.
• To build a quantitative structure-activity relationship (QSAR) model.
Gene/Protein Target page
• Provides a target-centric view of PubChem data.
• Organizes all data available in PubChem for a given
gene/protein.
33
https://pubchem.ncbi.nlm.nih.gov
34
https://pubchem.ncbi.nlm.nih.gov/#query=HMGCR&tab=gene
35
https://pubchem.ncbi.nlm.nih.gov/#query=HMGCR&tab=gene
36
https://pubchem.ncbi.nlm.nih.gov/gene/3156
37
https://pubchem.ncbi.nlm.nih.gov/gene/3156
38
https://pubchem.ncbi.nlm.nih.gov/gene/3156
39
https://pubchem.ncbi.nlm.nih.gov/gene/3156
40
Patent View Page
 Suppose that you want to:
o Retrieve ALL chemicals mentioned in a given patent document.
Patent View page
• Provides a list of chemicals “mentioned” in the patent
application/grant.
• No information on why they are mentioned.
(e.g., as a subject matter or as a prior art?)
• Other information, including:
- Title, abstract, date, inventor, …
- International patent classification (IPC) codes
41
https://pubchem.ncbi.nlm.nih.gov/#query=US2019183840
42
https://pubchem.ncbi.nlm.nih.gov/#query=US2019183840
43
https://pubchem.ncbi.nlm.nih.gov/patent/US2019183840
44
https://pubchem.ncbi.nlm.nih.gov/patent/US2019183840
45
https://pubchem.ncbi.nlm.nih.gov/patent/US2019183840
46
https://pubchem.ncbi.nlm.nih.gov/patent/US2019183840
4. Programmatic Access to
PubChem
47
48
 PubChem users have very diverse backgrounds/interests.
 PubChem’s web interfaces are optimized to perform commonly
requested tasks interactively.
49
 PubChem users have very diverse backgrounds/interests.
 PubChem’s web interfaces are optimized to perform commonly
requested tasks interactively.
 Everything you can do with PubChem through the web browser
can be automated through PubChem’s programmatic interfaces.
50
 PubChem users have very diverse backgrounds/interests.
 PubChem’s web interfaces are optimized to perform commonly
requested tasks interactively.
 Everything you can do with PubChem through the web browser
can be automated through PubChem’s programmatic interfaces.
 Programmatic access enables one to do much more complicated
tasks that cannot be done through the web browser.
51
 Multiple programmatic access routes
 Two major programmatic access methods
o PUG-REST (primarily for computed properties).
https://pubchemdocs.ncbi.nlm.nih.gov/pug-rest
o PUG-View (primarily for text information).
https://pubchemdocs.ncbi.nlm.nih.gov/pug-view
 Request volume limitation:
o No more than 5 requests per second
(See more at: https://pubchemdocs.ncbi.nlm.nih.gov/programmatic-
access$_RequestVolumeLimitations)
o Violators/abusers may be blocked for a certain period of
time.
52
 Bulk Download
• Structure Download Service (up to 500,000 compounds)
https://pubchem.ncbi.nlm.nih.gov/pc_fetch/pc_fetch.cgi
• Assay Download Service (up to 1,000 assays)
https://pubchem.ncbi.nlm.nih.gov/assay/assaydownload.cgi
• PubChem FTP Site
ftp://ftp.ncbi.nlm.nih.gov/pubchem
• PubChem RDF
https://pubchemdocs.ncbi.nlm.nih.gov/rdf
RDF: Resource Description Network
5. Showcase:
Bioactivity Prediction Model Building
53
 Involved in regulation of gene expression in various biological
processes.
 Potential roles in:
• metabolic signaling pathways
• skin alopecia (spot baldness)
• dermal cysts
• cardiac development
• insulin sensitization
• ……
Retinoid X Receptor  (RXRA)
PDB ID: 1FBY
54
Tox21
(AID 1159531)
• Quantitative HTS (qHTS)
data for 10K compounds
• Predominantly inactive
 Data sets
55
Tox21
(AID 1159531)
Training
(4916 compounds)
Test
(547 compounds)
• 471 actives
• 4,445 inactives
• 53 actives
• 494 inactives
Preprocessing
• Quantitative HTS (qHTS)
data for 10K compounds
• Predominantly inactive
 Data sets
90% 10%
56
Tox21
(AID 1159531)
ChEMBL
(45 Assays)
NCGC
(2 Assays)
Training
(4916 compounds)
Test
(547 compounds)
External 1
(222 compounds)
External 2
(489 compounds)
• 471 actives
• 4,445 inactives
• 53 actives
• 494 inactives
• 205 actives
• 17 inactives
• 20 actives
• 469 inactives
Preprocessing
• Quantitative HTS (qHTS)
data for 10K compounds
• Predominantly inactive
• Data extracted from
journal articles
• Predominantly active
• qHTS data
• Predominantly inactive
• Some overlap w/ Tox21
 Data sets
Preprocessing Preprocessing
90% 10%
57
Tox21
(AID 1159531)
ChEMBL
(45 Assays)
NCGC
(2 Assays)
Training
(4916 compounds)
Test
(547 compounds)
External 1
(222 compounds)
External 2
(489 compounds)
• 471 actives
• 4,445 inactives
• 53 actives
• 494 inactives
• 205 actives
• 17 inactives
• 20 actives
• 469 inactives
Preprocessing
• Quantitative HTS (qHTS)
data for 10K compounds
• Predominantly inactive
• Data extracted from
journal articles
• Predominantly active
• qHTS data
• Predominantly inactive
• Some overlap w/ Tox21
 Data sets
Preprocessing Preprocessing
90% 10%
58
Tox21
(AID 1159531)
ChEMBL
(45 Assays)
NCGC
(2 Assays)
Training
(4916 compounds)
Test
(547 compounds)
External 1
(222 compounds)
External 2
(489 compounds)
• 471 actives
• 4,445 inactives
• 53 actives
• 494 inactives
• 205 actives
• 17 inactives
• 20 actives
• 469 inactives
Preprocessing
• Quantitative HTS (qHTS)
data for 10K compounds
• Predominantly inactive
• Data extracted from
journal articles
• Predominantly active
• qHTS data
• Predominantly inactive
• Some overlap w/ Tox21
 Data sets
Preprocessing Preprocessing
90% 10%
471
59
 Molecular descriptors
• Generated using PaDEL
[Yap CW (2011). J. Comput. Chem., 32 (7): 1466-1474]
Model Building
Abbreviation Name Length
AP AtomPairs 2D Fingerprint 780
ESTAT Estate fingerprint 79
EXTFP* CDK Extended Fingerprint 1,024
FP* CDK fingerprint 1,024
GOFP* CDK graph only fingerprint 1,024
KR Klekota-Roth fingerprint 4,860
MACCS MACCS fingerprint 166
PUB PubChem fingerprint 881
SUB Substructure fingerprint 307
* Hashed fingerprints
60
 Machine-learning algorithms (implemented in scikit-learn)
Model Building
Abbreviation Name Hyperparameters optimized
NB Naïve Bayes  (10-10 ~ 1)
DT Decision tree max_depth_range (3 ~ 7)
min_samples_split_range (3 ~ 7)
min_samples_leaf_range (2 ~ 6)
kNN K-Nearest neighbors weights (uniform, minkowski, jaccard)
n_neighbors (1 ~ 25)
RF Random forest n_estimators (10 ~ 200)
SVM Support vector machine C ( 2-10 ~ 210);  ( 2-10  210)
NN Neural network solver (lbfgs or adam);  (10-7  107)
 10-fold cross-validation was used for hyperparameter
optimization.
61
Model Performance Evaluation
 Area under the Receiver operating characteristic curve (AUC)
 Used for hyperparameter optimization.
 𝐵𝑎𝑙𝑎𝑛𝑐𝑒𝑑 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 𝐵𝐴𝐶𝐶
=
1
2
𝑇𝑃
𝑇𝑃 + 𝐹𝑁
+
𝑇𝑁
𝑇𝑁 + 𝐹𝑃
=
1
2
𝑆𝐸𝑁𝑆 + 𝑆𝑃𝐸𝐶
 𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 (𝑆𝐸𝑁𝑆) =
𝑇𝑃
𝑇𝑃+𝐹𝑁
 𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 𝑆𝑃𝐸𝐶 =
𝑇𝑁
𝑇𝑁+𝐹𝑃
62
 Performance of the models
 AUC scores of 0.7 were
observed for models developed
using:
PubChem/MACCS/CDK-FP with
NN/SVM/RF/kNN
 Maximum AUC score (0.77):
PubChem fingerprint with RF
 Similar trend was observed for
the performance in terms of
BACC scores (not shown here).
Area under ROC curve (AUC)
63
 General applicability of the models
Area under ROC curve (AUC), Inactive-to-active ratio = 1
NCGCChEMBL
64
Summary
65
• PubChem is the largest source of publicly available
chemical information, collected from more than 750
data sources.
• PubChem contains a wide range of annotated
information for chemicals, including the gene/protein
targets, toxicity, chemical vendors, patents, ……)
• PubChem contains a large amount of high-throughput
screening data as well as literature-extracted
bioactivity data.
66
• PubChem supports various types of searches
(e.g., keyword search, identity/similarity search,
substructure/superstructure searches, ……).
• PubChem supports programmatic access to its data,
allowing for building an automated workflow.
• PubChem’s bioactivity data can be used to develop
predictive models for bioactivity of small molecules.
67
Acknowledgements
Evan Bolton
Jie Chen
Tiejun Cheng
Asta Gindulyte
Jia He
Siqian He
Qingliang Li
Benjamin Shoemaker
Thiessen Paul
Bo Yu
Leonid Zaslavsky
Jian Zhang
 The PubChem Team
 PubChem users, depositors, and collaborators
 Funded by the National Library of Medicine
68
69
Thank you!
Questions?
Sunghwan Kim
Email: sunghwan.kim@nih.gov
SlideShare: https://www.slideshare.net/SunghwanKim95/presentations

More Related Content

What's hot

PubChem and Big Data Chemistry
PubChem and Big Data ChemistryPubChem and Big Data Chemistry
PubChem and Big Data Chemistry
Sunghwan Kim
 
Cheminformatics in drug design
Cheminformatics in drug designCheminformatics in drug design
Cheminformatics in drug design
Surmil Shah
 
Chemical database preparation ppt
Chemical database preparation pptChemical database preparation ppt
Chemical database preparation ppt
samantlalit
 
Mining Small Molecules for Drug Discovery
Mining Small Molecules for Drug DiscoveryMining Small Molecules for Drug Discovery
Mining Small Molecules for Drug Discovery
Girinath Pillai
 
Mechanisms of Action of Adjuvants
Mechanisms of Action of AdjuvantsMechanisms of Action of Adjuvants
Mechanisms of Action of Adjuvants
Arman Mahmud
 
Types of animal cell culture; characterization & Their preservation.
Types of animal cell culture; characterization & Their preservation.Types of animal cell culture; characterization & Their preservation.
Types of animal cell culture; characterization & Their preservation.
Santosh Kumar Sahoo
 
Role of computers in drug design1
Role of computers in drug design1Role of computers in drug design1
Role of computers in drug design1
Ankit Tiwari
 
Drug properties (ADMET) prediction using AI
Drug properties (ADMET) prediction using AIDrug properties (ADMET) prediction using AI
Drug properties (ADMET) prediction using AI
IndrajeetKumar124
 
Introduction to Python for Bioinformatics
Introduction to Python for BioinformaticsIntroduction to Python for Bioinformatics
Introduction to Python for Bioinformatics
José Héctor Gálvez
 
Enzyme immobilization sc
Enzyme immobilization scEnzyme immobilization sc
Enzyme immobilization sc
Saroj Meera Singh
 
Molecular Representation, Similarity and Search
Molecular Representation, Similarity and SearchMolecular Representation, Similarity and Search
Molecular Representation, Similarity and Search
Rajarshi Guha
 
Molecular modelling
Molecular modellingMolecular modelling
Molecular modelling
Rikesh lal Shrestha
 
Publicly available tools and open resources in Bioinformatics
Publicly available  tools and open resources in BioinformaticsPublicly available  tools and open resources in Bioinformatics
Publicly available tools and open resources in Bioinformatics
Arindam Ghosh
 
cath-171102055313.pptx
cath-171102055313.pptxcath-171102055313.pptx
cath-171102055313.pptx
MuhammadAli732496
 
2015 Cytoscape 3.2 Tutorial
2015 Cytoscape 3.2 Tutorial2015 Cytoscape 3.2 Tutorial
2015 Cytoscape 3.2 Tutorial
Alexander Pico
 
Homology modeling
Homology modelingHomology modeling
Semantic Technology: The Basics
Semantic Technology: The BasicsSemantic Technology: The Basics
Semantic Technology: The Basics
Peter Berger
 
Computer aided drug design
Computer aided drug designComputer aided drug design
Computer aided drug design
St. Xavier's college, maitighar,Kathmandu
 
Molecular dynamics and Simulations
Molecular dynamics and SimulationsMolecular dynamics and Simulations
Molecular dynamics and SimulationsAbhilash Kannan
 

What's hot (20)

PubChem and Big Data Chemistry
PubChem and Big Data ChemistryPubChem and Big Data Chemistry
PubChem and Big Data Chemistry
 
Cheminformatics in drug design
Cheminformatics in drug designCheminformatics in drug design
Cheminformatics in drug design
 
Chemical database preparation ppt
Chemical database preparation pptChemical database preparation ppt
Chemical database preparation ppt
 
Mining Small Molecules for Drug Discovery
Mining Small Molecules for Drug DiscoveryMining Small Molecules for Drug Discovery
Mining Small Molecules for Drug Discovery
 
Mechanisms of Action of Adjuvants
Mechanisms of Action of AdjuvantsMechanisms of Action of Adjuvants
Mechanisms of Action of Adjuvants
 
Types of animal cell culture; characterization & Their preservation.
Types of animal cell culture; characterization & Their preservation.Types of animal cell culture; characterization & Their preservation.
Types of animal cell culture; characterization & Their preservation.
 
Role of computers in drug design1
Role of computers in drug design1Role of computers in drug design1
Role of computers in drug design1
 
Drug properties (ADMET) prediction using AI
Drug properties (ADMET) prediction using AIDrug properties (ADMET) prediction using AI
Drug properties (ADMET) prediction using AI
 
Introduction to Python for Bioinformatics
Introduction to Python for BioinformaticsIntroduction to Python for Bioinformatics
Introduction to Python for Bioinformatics
 
Enzyme immobilization sc
Enzyme immobilization scEnzyme immobilization sc
Enzyme immobilization sc
 
Molecular Representation, Similarity and Search
Molecular Representation, Similarity and SearchMolecular Representation, Similarity and Search
Molecular Representation, Similarity and Search
 
Molecular modelling
Molecular modellingMolecular modelling
Molecular modelling
 
Publicly available tools and open resources in Bioinformatics
Publicly available  tools and open resources in BioinformaticsPublicly available  tools and open resources in Bioinformatics
Publicly available tools and open resources in Bioinformatics
 
cath-171102055313.pptx
cath-171102055313.pptxcath-171102055313.pptx
cath-171102055313.pptx
 
2015 Cytoscape 3.2 Tutorial
2015 Cytoscape 3.2 Tutorial2015 Cytoscape 3.2 Tutorial
2015 Cytoscape 3.2 Tutorial
 
Homology modeling
Homology modelingHomology modeling
Homology modeling
 
Semantic Technology: The Basics
Semantic Technology: The BasicsSemantic Technology: The Basics
Semantic Technology: The Basics
 
Computer aided drug design
Computer aided drug designComputer aided drug design
Computer aided drug design
 
Computer Aided Drug Design
Computer Aided Drug DesignComputer Aided Drug Design
Computer Aided Drug Design
 
Molecular dynamics and Simulations
Molecular dynamics and SimulationsMolecular dynamics and Simulations
Molecular dynamics and Simulations
 

Similar to PubChem: a public chemical information resource for big data chemistry

Exploiting PubChem for Drug Discovery
Exploiting PubChem for Drug DiscoveryExploiting PubChem for Drug Discovery
Exploiting PubChem for Drug Discovery
Sunghwan Kim
 
Exploiting PubChem for drug discovery based on natural products
Exploiting PubChem for drug discovery based on natural productsExploiting PubChem for drug discovery based on natural products
Exploiting PubChem for drug discovery based on natural products
Sunghwan Kim
 
PubChem as a resource for chemical information training
PubChem as a resource for chemical information trainingPubChem as a resource for chemical information training
PubChem as a resource for chemical information training
Sunghwan Kim
 
Cheminformatics Education with PubChem
Cheminformatics Education with PubChemCheminformatics Education with PubChem
Cheminformatics Education with PubChem
Sunghwan Kim
 
PubChem for drug discovery in the age of big data and artificial intelligence
PubChem for drug discovery in the age of big data and artificial intelligencePubChem for drug discovery in the age of big data and artificial intelligence
PubChem for drug discovery in the age of big data and artificial intelligence
Sunghwan Kim
 
PubChem: A Public Chemical Information Resource for Big Data Chemistry
PubChem: A Public Chemical Information Resource for Big Data ChemistryPubChem: A Public Chemical Information Resource for Big Data Chemistry
PubChem: A Public Chemical Information Resource for Big Data Chemistry
Sunghwan Kim
 
TRIANGLE AREA MASS SPECTOMETRY MEETING: Structure Identification Approaches U...
TRIANGLE AREA MASS SPECTOMETRY MEETING: Structure Identification Approaches U...TRIANGLE AREA MASS SPECTOMETRY MEETING: Structure Identification Approaches U...
TRIANGLE AREA MASS SPECTOMETRY MEETING: Structure Identification Approaches U...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
Accessing Environmental Chemistry Data via Data Dashboards and Applications t...
Accessing Environmental Chemistry Data via Data Dashboards and Applications t...Accessing Environmental Chemistry Data via Data Dashboards and Applications t...
Accessing Environmental Chemistry Data via Data Dashboards and Applications t...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
Overview of Next Gen Sequencing Data Analysis
Overview of Next Gen Sequencing Data AnalysisOverview of Next Gen Sequencing Data Analysis
Overview of Next Gen Sequencing Data Analysis
Bioinformatics and Computational Biosciences Branch
 
Delivering web-based access to data and algorithms to support computational t...
Delivering web-based access to data and algorithms to support computational t...Delivering web-based access to data and algorithms to support computational t...
Delivering web-based access to data and algorithms to support computational t...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
Delivering chemical-associated data via EPA web applications
Delivering chemical-associated data via EPA web applicationsDelivering chemical-associated data via EPA web applications
Delivering chemical-associated data via EPA web applications
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
Structure identification approaches using the EPA CompTox Chemicals Dashboard...
Structure identification approaches using the EPA CompTox Chemicals Dashboard...Structure identification approaches using the EPA CompTox Chemicals Dashboard...
Structure identification approaches using the EPA CompTox Chemicals Dashboard...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
2011-11-28 Open PHACTS at RSC CICAG
2011-11-28 Open PHACTS at RSC CICAG2011-11-28 Open PHACTS at RSC CICAG
2011-11-28 Open PHACTS at RSC CICAG
open_phacts
 
US-EPA Chemicals Dashboard – an integrated data hub for environmental science
US-EPA Chemicals Dashboard – an integrated data hub for environmental scienceUS-EPA Chemicals Dashboard – an integrated data hub for environmental science
US-EPA Chemicals Dashboard – an integrated data hub for environmental science
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
The Open Patent Chemistry “Big Bang”: Implications, Opportunities and Caveats
The Open Patent Chemistry “Big Bang”: Implications, Opportunities and CaveatsThe Open Patent Chemistry “Big Bang”: Implications, Opportunities and Caveats
The Open Patent Chemistry “Big Bang”: Implications, Opportunities and Caveats
Chris Southan
 
Online Chemical Database with Modelling Environment
Online Chemical Database with Modelling EnvironmentOnline Chemical Database with Modelling Environment
Online Chemical Database with Modelling Environment
SSA KPI
 
Non-targeted analysis supported by data and cheminformatics delivered via the...
Non-targeted analysis supported by data and cheminformatics delivered via the...Non-targeted analysis supported by data and cheminformatics delivered via the...
Non-targeted analysis supported by data and cheminformatics delivered via the...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
Toxicological information in PubChem
Toxicological information in PubChemToxicological information in PubChem
Toxicological information in PubChem
Sunghwan Kim
 
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...
Dr. Haxel Consult
 
Project Hippocrates
Project HippocratesProject Hippocrates

Similar to PubChem: a public chemical information resource for big data chemistry (20)

Exploiting PubChem for Drug Discovery
Exploiting PubChem for Drug DiscoveryExploiting PubChem for Drug Discovery
Exploiting PubChem for Drug Discovery
 
Exploiting PubChem for drug discovery based on natural products
Exploiting PubChem for drug discovery based on natural productsExploiting PubChem for drug discovery based on natural products
Exploiting PubChem for drug discovery based on natural products
 
PubChem as a resource for chemical information training
PubChem as a resource for chemical information trainingPubChem as a resource for chemical information training
PubChem as a resource for chemical information training
 
Cheminformatics Education with PubChem
Cheminformatics Education with PubChemCheminformatics Education with PubChem
Cheminformatics Education with PubChem
 
PubChem for drug discovery in the age of big data and artificial intelligence
PubChem for drug discovery in the age of big data and artificial intelligencePubChem for drug discovery in the age of big data and artificial intelligence
PubChem for drug discovery in the age of big data and artificial intelligence
 
PubChem: A Public Chemical Information Resource for Big Data Chemistry
PubChem: A Public Chemical Information Resource for Big Data ChemistryPubChem: A Public Chemical Information Resource for Big Data Chemistry
PubChem: A Public Chemical Information Resource for Big Data Chemistry
 
TRIANGLE AREA MASS SPECTOMETRY MEETING: Structure Identification Approaches U...
TRIANGLE AREA MASS SPECTOMETRY MEETING: Structure Identification Approaches U...TRIANGLE AREA MASS SPECTOMETRY MEETING: Structure Identification Approaches U...
TRIANGLE AREA MASS SPECTOMETRY MEETING: Structure Identification Approaches U...
 
Accessing Environmental Chemistry Data via Data Dashboards and Applications t...
Accessing Environmental Chemistry Data via Data Dashboards and Applications t...Accessing Environmental Chemistry Data via Data Dashboards and Applications t...
Accessing Environmental Chemistry Data via Data Dashboards and Applications t...
 
Overview of Next Gen Sequencing Data Analysis
Overview of Next Gen Sequencing Data AnalysisOverview of Next Gen Sequencing Data Analysis
Overview of Next Gen Sequencing Data Analysis
 
Delivering web-based access to data and algorithms to support computational t...
Delivering web-based access to data and algorithms to support computational t...Delivering web-based access to data and algorithms to support computational t...
Delivering web-based access to data and algorithms to support computational t...
 
Delivering chemical-associated data via EPA web applications
Delivering chemical-associated data via EPA web applicationsDelivering chemical-associated data via EPA web applications
Delivering chemical-associated data via EPA web applications
 
Structure identification approaches using the EPA CompTox Chemicals Dashboard...
Structure identification approaches using the EPA CompTox Chemicals Dashboard...Structure identification approaches using the EPA CompTox Chemicals Dashboard...
Structure identification approaches using the EPA CompTox Chemicals Dashboard...
 
2011-11-28 Open PHACTS at RSC CICAG
2011-11-28 Open PHACTS at RSC CICAG2011-11-28 Open PHACTS at RSC CICAG
2011-11-28 Open PHACTS at RSC CICAG
 
US-EPA Chemicals Dashboard – an integrated data hub for environmental science
US-EPA Chemicals Dashboard – an integrated data hub for environmental scienceUS-EPA Chemicals Dashboard – an integrated data hub for environmental science
US-EPA Chemicals Dashboard – an integrated data hub for environmental science
 
The Open Patent Chemistry “Big Bang”: Implications, Opportunities and Caveats
The Open Patent Chemistry “Big Bang”: Implications, Opportunities and CaveatsThe Open Patent Chemistry “Big Bang”: Implications, Opportunities and Caveats
The Open Patent Chemistry “Big Bang”: Implications, Opportunities and Caveats
 
Online Chemical Database with Modelling Environment
Online Chemical Database with Modelling EnvironmentOnline Chemical Database with Modelling Environment
Online Chemical Database with Modelling Environment
 
Non-targeted analysis supported by data and cheminformatics delivered via the...
Non-targeted analysis supported by data and cheminformatics delivered via the...Non-targeted analysis supported by data and cheminformatics delivered via the...
Non-targeted analysis supported by data and cheminformatics delivered via the...
 
Toxicological information in PubChem
Toxicological information in PubChemToxicological information in PubChem
Toxicological information in PubChem
 
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...
 
Project Hippocrates
Project HippocratesProject Hippocrates
Project Hippocrates
 

More from Sunghwan Kim

PubChem for chemical information literacy training
PubChem for chemical information literacy trainingPubChem for chemical information literacy training
PubChem for chemical information literacy training
Sunghwan Kim
 
PubChem and its application for cheminformatics education
PubChem and its application for cheminformatics educationPubChem and its application for cheminformatics education
PubChem and its application for cheminformatics education
Sunghwan Kim
 
Cheminformatics Online Chemistry Course (OLCC): A Community Effort to Introdu...
Cheminformatics Online Chemistry Course (OLCC): A Community Effort to Introdu...Cheminformatics Online Chemistry Course (OLCC): A Community Effort to Introdu...
Cheminformatics Online Chemistry Course (OLCC): A Community Effort to Introdu...
Sunghwan Kim
 
PubChem as an Emerging Toxicological Information Resource
PubChem as an Emerging Toxicological Information ResourcePubChem as an Emerging Toxicological Information Resource
PubChem as an Emerging Toxicological Information Resource
Sunghwan Kim
 
Chemical Health and Safety Information in PubChem
Chemical Health and Safety Information in PubChemChemical Health and Safety Information in PubChem
Chemical Health and Safety Information in PubChem
Sunghwan Kim
 
Chemical Structure Standardization and Synonym Filtering in PubChem
Chemical Structure Standardization and Synonym Filtering in PubChemChemical Structure Standardization and Synonym Filtering in PubChem
Chemical Structure Standardization and Synonym Filtering in PubChem
Sunghwan Kim
 
Development of machine learning-based prediction models for chemical modulato...
Development of machine learning-based prediction models for chemical modulato...Development of machine learning-based prediction models for chemical modulato...
Development of machine learning-based prediction models for chemical modulato...
Sunghwan Kim
 
Using open bioactivity data for developing machine-learning prediction models...
Using open bioactivity data for developing machine-learning prediction models...Using open bioactivity data for developing machine-learning prediction models...
Using open bioactivity data for developing machine-learning prediction models...
Sunghwan Kim
 
Searching for patent information in PubChem
Searching for patent information in PubChem Searching for patent information in PubChem
Searching for patent information in PubChem
Sunghwan Kim
 
NCBI Minute: Integrating PubChem into Your Chemistry Teaching
NCBI Minute: Integrating PubChem into Your Chemistry TeachingNCBI Minute: Integrating PubChem into Your Chemistry Teaching
NCBI Minute: Integrating PubChem into Your Chemistry Teaching
Sunghwan Kim
 
How can you access PubChem programmatically?
How can you access PubChem programmatically?How can you access PubChem programmatically?
How can you access PubChem programmatically?
Sunghwan Kim
 

More from Sunghwan Kim (11)

PubChem for chemical information literacy training
PubChem for chemical information literacy trainingPubChem for chemical information literacy training
PubChem for chemical information literacy training
 
PubChem and its application for cheminformatics education
PubChem and its application for cheminformatics educationPubChem and its application for cheminformatics education
PubChem and its application for cheminformatics education
 
Cheminformatics Online Chemistry Course (OLCC): A Community Effort to Introdu...
Cheminformatics Online Chemistry Course (OLCC): A Community Effort to Introdu...Cheminformatics Online Chemistry Course (OLCC): A Community Effort to Introdu...
Cheminformatics Online Chemistry Course (OLCC): A Community Effort to Introdu...
 
PubChem as an Emerging Toxicological Information Resource
PubChem as an Emerging Toxicological Information ResourcePubChem as an Emerging Toxicological Information Resource
PubChem as an Emerging Toxicological Information Resource
 
Chemical Health and Safety Information in PubChem
Chemical Health and Safety Information in PubChemChemical Health and Safety Information in PubChem
Chemical Health and Safety Information in PubChem
 
Chemical Structure Standardization and Synonym Filtering in PubChem
Chemical Structure Standardization and Synonym Filtering in PubChemChemical Structure Standardization and Synonym Filtering in PubChem
Chemical Structure Standardization and Synonym Filtering in PubChem
 
Development of machine learning-based prediction models for chemical modulato...
Development of machine learning-based prediction models for chemical modulato...Development of machine learning-based prediction models for chemical modulato...
Development of machine learning-based prediction models for chemical modulato...
 
Using open bioactivity data for developing machine-learning prediction models...
Using open bioactivity data for developing machine-learning prediction models...Using open bioactivity data for developing machine-learning prediction models...
Using open bioactivity data for developing machine-learning prediction models...
 
Searching for patent information in PubChem
Searching for patent information in PubChem Searching for patent information in PubChem
Searching for patent information in PubChem
 
NCBI Minute: Integrating PubChem into Your Chemistry Teaching
NCBI Minute: Integrating PubChem into Your Chemistry TeachingNCBI Minute: Integrating PubChem into Your Chemistry Teaching
NCBI Minute: Integrating PubChem into Your Chemistry Teaching
 
How can you access PubChem programmatically?
How can you access PubChem programmatically?How can you access PubChem programmatically?
How can you access PubChem programmatically?
 

Recently uploaded

Orion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWSOrion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWS
Columbia Weather Systems
 
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Ana Luísa Pinho
 
Lateral Ventricles.pdf very easy good diagrams comprehensive
Lateral Ventricles.pdf very easy good diagrams comprehensiveLateral Ventricles.pdf very easy good diagrams comprehensive
Lateral Ventricles.pdf very easy good diagrams comprehensive
silvermistyshot
 
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
Wasswaderrick3
 
Unveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdfUnveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdf
Erdal Coalmaker
 
Nutraceutical market, scope and growth: Herbal drug technology
Nutraceutical market, scope and growth: Herbal drug technologyNutraceutical market, scope and growth: Herbal drug technology
Nutraceutical market, scope and growth: Herbal drug technology
Lokesh Patil
 
general properties of oerganologametal.ppt
general properties of oerganologametal.pptgeneral properties of oerganologametal.ppt
general properties of oerganologametal.ppt
IqrimaNabilatulhusni
 
Mudde & Rovira Kaltwasser. - Populism - a very short introduction [2017].pdf
Mudde & Rovira Kaltwasser. - Populism - a very short introduction [2017].pdfMudde & Rovira Kaltwasser. - Populism - a very short introduction [2017].pdf
Mudde & Rovira Kaltwasser. - Populism - a very short introduction [2017].pdf
frank0071
 
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
David Osipyan
 
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
University of Maribor
 
in vitro propagation of plants lecture note.pptx
in vitro propagation of plants lecture note.pptxin vitro propagation of plants lecture note.pptx
in vitro propagation of plants lecture note.pptx
yusufzako14
 
Phenomics assisted breeding in crop improvement
Phenomics assisted breeding in crop improvementPhenomics assisted breeding in crop improvement
Phenomics assisted breeding in crop improvement
IshaGoswami9
 
nodule formation by alisha dewangan.pptx
nodule formation by alisha dewangan.pptxnodule formation by alisha dewangan.pptx
nodule formation by alisha dewangan.pptx
alishadewangan1
 
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
yqqaatn0
 
Deep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless ReproducibilityDeep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless Reproducibility
University of Rennes, INSA Rennes, Inria/IRISA, CNRS
 
Hemostasis_importance& clinical significance.pptx
Hemostasis_importance& clinical significance.pptxHemostasis_importance& clinical significance.pptx
Hemostasis_importance& clinical significance.pptx
muralinath2
 
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
University of Maribor
 
BLOOD AND BLOOD COMPONENT- introduction to blood physiology
BLOOD AND BLOOD COMPONENT- introduction to blood physiologyBLOOD AND BLOOD COMPONENT- introduction to blood physiology
BLOOD AND BLOOD COMPONENT- introduction to blood physiology
NoelManyise1
 
Chapter 12 - climate change and the energy crisis
Chapter 12 - climate change and the energy crisisChapter 12 - climate change and the energy crisis
Chapter 12 - climate change and the energy crisis
tonzsalvador2222
 
S.1 chemistry scheme term 2 for ordinary level
S.1 chemistry scheme term 2 for ordinary levelS.1 chemistry scheme term 2 for ordinary level
S.1 chemistry scheme term 2 for ordinary level
ronaldlakony0
 

Recently uploaded (20)

Orion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWSOrion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWS
 
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
 
Lateral Ventricles.pdf very easy good diagrams comprehensive
Lateral Ventricles.pdf very easy good diagrams comprehensiveLateral Ventricles.pdf very easy good diagrams comprehensive
Lateral Ventricles.pdf very easy good diagrams comprehensive
 
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
 
Unveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdfUnveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdf
 
Nutraceutical market, scope and growth: Herbal drug technology
Nutraceutical market, scope and growth: Herbal drug technologyNutraceutical market, scope and growth: Herbal drug technology
Nutraceutical market, scope and growth: Herbal drug technology
 
general properties of oerganologametal.ppt
general properties of oerganologametal.pptgeneral properties of oerganologametal.ppt
general properties of oerganologametal.ppt
 
Mudde & Rovira Kaltwasser. - Populism - a very short introduction [2017].pdf
Mudde & Rovira Kaltwasser. - Populism - a very short introduction [2017].pdfMudde & Rovira Kaltwasser. - Populism - a very short introduction [2017].pdf
Mudde & Rovira Kaltwasser. - Populism - a very short introduction [2017].pdf
 
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
 
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
 
in vitro propagation of plants lecture note.pptx
in vitro propagation of plants lecture note.pptxin vitro propagation of plants lecture note.pptx
in vitro propagation of plants lecture note.pptx
 
Phenomics assisted breeding in crop improvement
Phenomics assisted breeding in crop improvementPhenomics assisted breeding in crop improvement
Phenomics assisted breeding in crop improvement
 
nodule formation by alisha dewangan.pptx
nodule formation by alisha dewangan.pptxnodule formation by alisha dewangan.pptx
nodule formation by alisha dewangan.pptx
 
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
 
Deep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless ReproducibilityDeep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless Reproducibility
 
Hemostasis_importance& clinical significance.pptx
Hemostasis_importance& clinical significance.pptxHemostasis_importance& clinical significance.pptx
Hemostasis_importance& clinical significance.pptx
 
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
 
BLOOD AND BLOOD COMPONENT- introduction to blood physiology
BLOOD AND BLOOD COMPONENT- introduction to blood physiologyBLOOD AND BLOOD COMPONENT- introduction to blood physiology
BLOOD AND BLOOD COMPONENT- introduction to blood physiology
 
Chapter 12 - climate change and the energy crisis
Chapter 12 - climate change and the energy crisisChapter 12 - climate change and the energy crisis
Chapter 12 - climate change and the energy crisis
 
S.1 chemistry scheme term 2 for ordinary level
S.1 chemistry scheme term 2 for ordinary levelS.1 chemistry scheme term 2 for ordinary level
S.1 chemistry scheme term 2 for ordinary level
 

PubChem: a public chemical information resource for big data chemistry

  • 1. PubChem: a Public Chemical Information Resource for Big Data Chemistry Sunghwan Kim, Ph.D., M.Sc.
  • 2. Outline 1.What is PubChem? 2.What does PubChem have? 3.Navigating PubChem 4.Programmatic access to PubChem 5.Showcase: bioactivity prediction model building 6.Summary 2
  • 3. 1. What is PubChem? 3
  • 4. 4 https://pubchem.ncbi.nlm.nih.gov  Chemical information resource at NIH.  Serves scientific communities as well as the general public.
  • 5. 5  ~5 million unique monthly users at peak (Apr. 2020).  interactive users only  No bots  Similar amount of web traffic from programmatic users.  One of the top 5 most visited chemistry websites in the world. (https://www.alexa.com/topsites/ category/Top/Science/Chemistry).
  • 6. 6  PubChem is a data aggregator. PubChem Sources: https://pubchem.ncbi.nlm.nih.gov/sources Gov’t agencies Academic institutions Publishers Pharma companies Chemical vendors Scientific databases 750+ Data sources Public o Research communities • Chemical biology • Medicinal chemistry • Drug design & discovery • Cheminformatics o Patent agents/examiners o Chemical safety officers o Educators/Librarians o Students o ……
  • 7. 2. What does PubChem have? 7
  • 8. 8  PubChem contains (as August 2020): • 103-M unique chemical structures. • 268-M bioactivity outcomes • 1.2-M bioassay experiments • 91-K genes & 95-K proteins (from 4-K organisms). • 237-K pathways • 31-M scientific articles about chemicals • 3-M patent documents PubChem Statistics: https://pubchemdocs.ncbi.nlm.nih.gov/statistics Arguably, PubChem contains the largest amount of chemical information in the public domain.
  • 9.  Drug information • Drug labeling • Drug indications • Mechanism of action • Target genes/proteins • ADMET (Absorption, Distribution, Metabolism, Excretion & Toxicity)  Clinical trials information • ClinicalTrials.gov (https://clinicaltrials.gov/) • EU Clinical Trials Register (https://www.clinicaltrialsregister.eu/) • NIPH Clinical Trials Search of Japan (https://rctportal.niph.go.jp/en/) 9  PubChem data for drug discovery
  • 10.  Regulatory information o FDA • Orange book • Unique ingredient identifiers, • Pharmacologic Classes o EPA • Substance Registry Services • Chemical data collected under the: • Toxic Substance Control Act • Clean Air Act 10  Patent information (USPTO, EPO, WIPO, JPO)  Journal articles (PubMed & Non-PubMed)  PubChem data for drug discovery
  • 11.  Structural information • 2-D chemical structures • Line notations for 2-Dchemical structures (SMILES, InChI, InChIKey) • Computationally-generated 3-D structures • Experimental 3-D structures (from Crystallography Open Database) • Links to 3-D structures in PDB/CSD  Chemical properties (solubility, pKa, molecular weight, logP, …)  Spectral information (NMR, IR, UV, MS, GC-MS, LC-MS, …)  Chemical vendor  Synthesis  …… 11  PubChem data for drug discovery
  • 12.  Bioactivity data • High-throughput screening (HTS) data (NCATS, EPA, Broad Institute, Sanford-Burnham, Scripps, …) • Literature-extracted data from scientific articles and patent documents through text mining & manual curation (ChEMBL, IUPHAR/BPS Guide to PHARMACOLOGY, BindingDB, …)  PubChem data for drug discovery 12
  • 16. 16
  • 17. 17
  • 18. 18
  • 19. 19
  • 20. 20
  • 21. 21
  • 22. 22
  • 23. 23
  • 25. 25
  • 26. 26
  • 27. 27
  • 28. 28
  • 29. 29
  • 30. 30
  • 31. 31
  • 32. 32 Gene/Protein Target Page  Suppose that you want to: o Retrieve ALL active compounds against a given protein/gene target (e.g., HMGCR=3-hydroxy-3-methylglutaryl-CoA reductase). • To identify common chemical scaffolds responsible for bioactivity. • To build a quantitative structure-activity relationship (QSAR) model. Gene/Protein Target page • Provides a target-centric view of PubChem data. • Organizes all data available in PubChem for a given gene/protein.
  • 40. 40 Patent View Page  Suppose that you want to: o Retrieve ALL chemicals mentioned in a given patent document. Patent View page • Provides a list of chemicals “mentioned” in the patent application/grant. • No information on why they are mentioned. (e.g., as a subject matter or as a prior art?) • Other information, including: - Title, abstract, date, inventor, … - International patent classification (IPC) codes
  • 47. 4. Programmatic Access to PubChem 47
  • 48. 48  PubChem users have very diverse backgrounds/interests.  PubChem’s web interfaces are optimized to perform commonly requested tasks interactively.
  • 49. 49  PubChem users have very diverse backgrounds/interests.  PubChem’s web interfaces are optimized to perform commonly requested tasks interactively.  Everything you can do with PubChem through the web browser can be automated through PubChem’s programmatic interfaces.
  • 50. 50  PubChem users have very diverse backgrounds/interests.  PubChem’s web interfaces are optimized to perform commonly requested tasks interactively.  Everything you can do with PubChem through the web browser can be automated through PubChem’s programmatic interfaces.  Programmatic access enables one to do much more complicated tasks that cannot be done through the web browser.
  • 51. 51  Multiple programmatic access routes  Two major programmatic access methods o PUG-REST (primarily for computed properties). https://pubchemdocs.ncbi.nlm.nih.gov/pug-rest o PUG-View (primarily for text information). https://pubchemdocs.ncbi.nlm.nih.gov/pug-view  Request volume limitation: o No more than 5 requests per second (See more at: https://pubchemdocs.ncbi.nlm.nih.gov/programmatic- access$_RequestVolumeLimitations) o Violators/abusers may be blocked for a certain period of time.
  • 52. 52  Bulk Download • Structure Download Service (up to 500,000 compounds) https://pubchem.ncbi.nlm.nih.gov/pc_fetch/pc_fetch.cgi • Assay Download Service (up to 1,000 assays) https://pubchem.ncbi.nlm.nih.gov/assay/assaydownload.cgi • PubChem FTP Site ftp://ftp.ncbi.nlm.nih.gov/pubchem • PubChem RDF https://pubchemdocs.ncbi.nlm.nih.gov/rdf RDF: Resource Description Network
  • 54.  Involved in regulation of gene expression in various biological processes.  Potential roles in: • metabolic signaling pathways • skin alopecia (spot baldness) • dermal cysts • cardiac development • insulin sensitization • …… Retinoid X Receptor  (RXRA) PDB ID: 1FBY 54
  • 55. Tox21 (AID 1159531) • Quantitative HTS (qHTS) data for 10K compounds • Predominantly inactive  Data sets 55
  • 56. Tox21 (AID 1159531) Training (4916 compounds) Test (547 compounds) • 471 actives • 4,445 inactives • 53 actives • 494 inactives Preprocessing • Quantitative HTS (qHTS) data for 10K compounds • Predominantly inactive  Data sets 90% 10% 56
  • 57. Tox21 (AID 1159531) ChEMBL (45 Assays) NCGC (2 Assays) Training (4916 compounds) Test (547 compounds) External 1 (222 compounds) External 2 (489 compounds) • 471 actives • 4,445 inactives • 53 actives • 494 inactives • 205 actives • 17 inactives • 20 actives • 469 inactives Preprocessing • Quantitative HTS (qHTS) data for 10K compounds • Predominantly inactive • Data extracted from journal articles • Predominantly active • qHTS data • Predominantly inactive • Some overlap w/ Tox21  Data sets Preprocessing Preprocessing 90% 10% 57
  • 58. Tox21 (AID 1159531) ChEMBL (45 Assays) NCGC (2 Assays) Training (4916 compounds) Test (547 compounds) External 1 (222 compounds) External 2 (489 compounds) • 471 actives • 4,445 inactives • 53 actives • 494 inactives • 205 actives • 17 inactives • 20 actives • 469 inactives Preprocessing • Quantitative HTS (qHTS) data for 10K compounds • Predominantly inactive • Data extracted from journal articles • Predominantly active • qHTS data • Predominantly inactive • Some overlap w/ Tox21  Data sets Preprocessing Preprocessing 90% 10% 58
  • 59. Tox21 (AID 1159531) ChEMBL (45 Assays) NCGC (2 Assays) Training (4916 compounds) Test (547 compounds) External 1 (222 compounds) External 2 (489 compounds) • 471 actives • 4,445 inactives • 53 actives • 494 inactives • 205 actives • 17 inactives • 20 actives • 469 inactives Preprocessing • Quantitative HTS (qHTS) data for 10K compounds • Predominantly inactive • Data extracted from journal articles • Predominantly active • qHTS data • Predominantly inactive • Some overlap w/ Tox21  Data sets Preprocessing Preprocessing 90% 10% 471 59
  • 60.  Molecular descriptors • Generated using PaDEL [Yap CW (2011). J. Comput. Chem., 32 (7): 1466-1474] Model Building Abbreviation Name Length AP AtomPairs 2D Fingerprint 780 ESTAT Estate fingerprint 79 EXTFP* CDK Extended Fingerprint 1,024 FP* CDK fingerprint 1,024 GOFP* CDK graph only fingerprint 1,024 KR Klekota-Roth fingerprint 4,860 MACCS MACCS fingerprint 166 PUB PubChem fingerprint 881 SUB Substructure fingerprint 307 * Hashed fingerprints 60
  • 61.  Machine-learning algorithms (implemented in scikit-learn) Model Building Abbreviation Name Hyperparameters optimized NB Naïve Bayes  (10-10 ~ 1) DT Decision tree max_depth_range (3 ~ 7) min_samples_split_range (3 ~ 7) min_samples_leaf_range (2 ~ 6) kNN K-Nearest neighbors weights (uniform, minkowski, jaccard) n_neighbors (1 ~ 25) RF Random forest n_estimators (10 ~ 200) SVM Support vector machine C ( 2-10 ~ 210);  ( 2-10  210) NN Neural network solver (lbfgs or adam);  (10-7  107)  10-fold cross-validation was used for hyperparameter optimization. 61
  • 62. Model Performance Evaluation  Area under the Receiver operating characteristic curve (AUC)  Used for hyperparameter optimization.  𝐵𝑎𝑙𝑎𝑛𝑐𝑒𝑑 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 𝐵𝐴𝐶𝐶 = 1 2 𝑇𝑃 𝑇𝑃 + 𝐹𝑁 + 𝑇𝑁 𝑇𝑁 + 𝐹𝑃 = 1 2 𝑆𝐸𝑁𝑆 + 𝑆𝑃𝐸𝐶  𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 (𝑆𝐸𝑁𝑆) = 𝑇𝑃 𝑇𝑃+𝐹𝑁  𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 𝑆𝑃𝐸𝐶 = 𝑇𝑁 𝑇𝑁+𝐹𝑃 62
  • 63.  Performance of the models  AUC scores of 0.7 were observed for models developed using: PubChem/MACCS/CDK-FP with NN/SVM/RF/kNN  Maximum AUC score (0.77): PubChem fingerprint with RF  Similar trend was observed for the performance in terms of BACC scores (not shown here). Area under ROC curve (AUC) 63
  • 64.  General applicability of the models Area under ROC curve (AUC), Inactive-to-active ratio = 1 NCGCChEMBL 64
  • 66. • PubChem is the largest source of publicly available chemical information, collected from more than 750 data sources. • PubChem contains a wide range of annotated information for chemicals, including the gene/protein targets, toxicity, chemical vendors, patents, ……) • PubChem contains a large amount of high-throughput screening data as well as literature-extracted bioactivity data. 66
  • 67. • PubChem supports various types of searches (e.g., keyword search, identity/similarity search, substructure/superstructure searches, ……). • PubChem supports programmatic access to its data, allowing for building an automated workflow. • PubChem’s bioactivity data can be used to develop predictive models for bioactivity of small molecules. 67
  • 68. Acknowledgements Evan Bolton Jie Chen Tiejun Cheng Asta Gindulyte Jia He Siqian He Qingliang Li Benjamin Shoemaker Thiessen Paul Bo Yu Leonid Zaslavsky Jian Zhang  The PubChem Team  PubChem users, depositors, and collaborators  Funded by the National Library of Medicine 68
  • 69. 69 Thank you! Questions? Sunghwan Kim Email: sunghwan.kim@nih.gov SlideShare: https://www.slideshare.net/SunghwanKim95/presentations