SlideShare a Scribd company logo
1 of 97
Download to read offline
PubChem and Big Data Chemistry
Sunghwan Kim, Ph.D., M.Sc.
National Center for Biotechnology Information
National Library of Medicine
National Institutes of Health
Email: sunghwan.kim@nih.gov
2
Outline
1. What Is PubChem?
2. What Does PubChem Have?
3. Exploring Chemical Information in PubChem
4. Programmatic Access to PubChem
5. Bioactivity Prediction Model Building with PubChem Data
6. PubChem and COVID-19 Conspiracy Theories
7. Summary
3
1. What Is PubChem?
4
 https://pubchem.ncbi.nlm.nih.gov
 Public chemical database at NIH.
 Contains information on various chemical entities:
• (Drug-like) small molecules
• siRNAs & miRNAs
• Carbohydrates
• Lipids
• Peptides
• Chemically modified macromolecules
• ……
PubChem Is a Public Chemical Information Resource
5
PubChem Is a Data Aggregator
PubChem Sources: https://pubchem.ncbi.nlm.nih.gov/sources
Gov’t
agencies
Academic
institutions
Publishers
Pharma
companies
Chemical
vendors
Scientific
databases
800+ data sources Users
o Biomedical Researchers
• Chemical biology
• Medicinal chemistry
• Drug design & discovery
• Cheminformatics
o Data scientists
o Patent agents/examiners
o Chemical safety officers
o Educators/librarians
o Students
6
History of PubChem
 NIH Molecular Libraries Program (MLP)
 Common Fund project.
7
History of PubChem
 NIH Molecular Libraries Program (MLP)
 Common Fund project.
 Aimed to provide academic researchers with high-throughput
screening (HTS) resources for drug discovery.
8
History of PubChem
 NIH Molecular Libraries Program (MLP)
 Common Fund project.
 Aimed to provide academic researchers with high-throughput
screening (HTS) resources for drug discovery.
 Had three components (subprojects):
Large, shared
compound
library
HTS centers at
academic
institutions
Central data
repository
(PubChem)
9
History of PubChem
 PubChem was launched in 2004 as a component of MLP.
 All Common Fund projects are supported only up to 10 years.
Large, shared
compound
library
HTS centers at
academic
institutions
Central data
repository
(PubChem)
10
History of PubChem
 PubChem was launched in 2004 as a component of MLP.
 All Common Fund projects are supported only up to 10 years.
 PubChem evolved to play a dual role:
 As a data archive
 As a knowledgebase
Large, shared
compound
library
HTS centers at
academic
institutions
Central data
repository
(PubChem)
11
0
1
2
3
4
5
6
Unique
Monthly
Users
(millions)
Time
Monthly Usage Statistics
(Unique Interactive Users Only)
Source: Google Analytics
 5 million unique interactive users per month at peak (Oct. 2020)
 Programmatic requests are not included.
 These statistics are lower-bound.
12
User Demographics
(June 2020 through May 2021)
36.5%
27.4%
13.5%
10.4%
6.7% 5.4%
0
1
2
3
4
5
6
18-24 25-34 35-44 45-54 55-64 65+
Number
of
Users
(millions)
Age
34.64% of total users
~40% of PubChem users are aged between 18 and 24.
(likely to be college students)
13
2. What does
PubChem have?
14
PubChem Data Content
Structures and properties
15
PubChem Data Content
Structures and properties Spectra
16
PubChem Data Content
Structures and properties Spectra
Chemical
health & safety
3
2 0
17
PubChem Data Content
Structures and properties Spectra
Chemical
health & safety
3
2 0
Bioactivity
18
PubChem Data Content
Structures and properties Spectra
Chemical
health & safety
3
2 0
Bioactivity Chemical vendors & synthesis
19
PubChem Data Content
Drugs
20
PubChem Data Content
Clinical trials
Drugs
21
PubChem Data Content
Clinical trials
Patents
Drugs
22
PubChem Data Content
Clinical trials
Patents
Drugs
Scientific articles
23
Dual Role of PubChem
Archive Knowledgebase
24
Dual Role of PubChem
Archive Knowledgebase
25
Multiple data collections in PubChem
Compound
Unique chemical
structures
Substance
Depositor-provided
chemical data
BioAssay
Assay descriptions
& test results
Protein Gene Pathway Patent
Archive Archive
Knowledgebase
Chemical data associated with a protein/gene/pathway/patent
26
 As of November 2021, PubChem contains:
• 276 million substance descriptions
• 111 million unique chemical structures
• 292 million biological activity test results
• 1.4 million biological assays, covering 21 thousand unique protein
sequence targets.
(Arguably) the largest corpus of
publicly available chemical information from 800+ data
sources.
PubChem Statistics
27
PubChem’s Chemical Space
Lipinski’s
Rule of 5 (Ro5) for
Drug-likeness a
Congreve’s
Rule of 3 (Ro3) for
Lead-likeness b
Molecular Weight ≤500 ≤300
Octanol–water partition coefficient (Log P) ≤5 ≤3
Number of H-bond donors ≤5 ≤3
Number of H-bond acceptors ≤10 ≤3
Number of Rotational Bond N/A ≤3
Polar surface area (PSA) N/A ≤60
a Lipinski et al., Adv. Drug Delivery Rev. 1997, 23(1–3), 3-25.
b Congreve et al., Drug Discov. Today, 2003, 8(19), 876-877.
28
Congreve’s
Rule of 3 (Ro3)
11.7 millions
(10.57 %)
Lipinski’s
Rule of 5 (Ro5)
78.9 millions
(71.36%)
All compounds
110.6 millions
(100%)
PubChem’s Chemical Space
29
Ro5
78.9 millions
(71.36%)
Ro5−1
18.9 millions
(17.08%)
Ro5−2
10.2 millions
(9.26%)
Ro5−3
2.3 millions
(2.05%)
Ro5−4
0.28 millions
(0.25%)
Ro5 + Ro5-1 = 88.44%
PubChem’s Chemical Space
30
Bioactivity Data in PubChem
Tested
3.6 millions
(3.27%)
Active
(AC ≤ 1 nM)
74 thousands
(0.07%)
Active
(1 nM < AC ≤ 1 µM)
777.5 thousands
(0.70%)
Active
(others)
635.2 thousands
(0.57%)
Inactive
2.1 millions
(1.93%)
Not Tested
107.0 millions
(96.73%)
All Compounds
110.7 millions
(100.00%)
AC: activity concentration (e.g., IC50, EC50, Ki, Kd, etc.)
31
Bioactivity Data in PubChem
High-Throughput Screening data
• From Molecular Libraries
Program and other HTS projects.
• Many inactives
• False hits
(e.g., aggregators,
autofluoresent compounds)
• Typically measured at single
concentration
Literature-extracted data
32
Bioactivity Data in PubChem
High-Throughput Screening data
• From Molecular Libraries
Program and other HTS projects.
• Many inactives
• False hits
(e.g., aggregators,
autofluoresent compounds)
• Typically measured at single
concentration
Literature-extracted data
33
High-Throughput Screening data
• From Molecular Libraries
Program and other HTS projects.
• Many inactives
• False hits
(e.g., aggregators,
autofluoresent compounds)
• Typically measured at single
concentration
Literature-extracted data
• From manual curation or data
mining
• No (or few) inactives
• Provided by various PubChem
depositors including:
ChEMBL,
PDBbind, BindingDB,
Guide to Pharmacology
Bioactivity Data in PubChem
34
• Virtual screening hits should be synthesizable or purchasable.
• PubChem contains “real” molecules (not “virtual” molecules)
• At least one or more data contributors claim that they have the compound
and/or information about it.
• Some of these compounds are chemical vendors (e.g., Sigma Aldrich).
Availability of compounds for subsequent experiments
35
 Two important aspects of PubChem records
(in the context of “compound availability”)
 Non-live compounds:
 Not searchable although they exist.
 No associated substances due to:
o Mistakenly submitted substances
o Incorrect information
o No intention to share
Availability of compounds for subsequent experiments
36
 Two important aspects of PubChem records
(in the context of “compound availability”)
 Non-live compounds:
 Not searchable although they exist.
 No associated substances due to:
o Mistakenly submitted substances
o Incorrect information
o No intention to share
 Legacy designation:
 No longer maintains their records up-to-date.
o Discontinued funding, low business priority, …
Availability of compounds for subsequent experiments
37
3. Exploring Chemical Information in
PubChem
38
Text Query
 Chemical name
 Gene/protein name
 Pathway name
 Patent ID
 CAS registry number
 PubChem record ID
(CID, SID, AID)
39
Multiple
collections are
searched
simultaneously.
https://pubchem.ncbi.nlm.nih.gov/
#query=%22salicylic%20acid%22
40
Compound
Summary for
salicylic acid
(CID 338)
https://pubchem.ncbi.nlm.nih.gov/
compound/338
41
Chemical Structure
Query
 SMILES
 InChI/InChIKey
42
Simplified molecular-input line-entry system (SMILES)
CC1=C(C=C(C=C1)NC(=O)C2=CC=C(C=C2)CN3CCN(CC3)C)
NC4=NC=CC(=N4)C5=CN=CC=C5
Line notations for chemical structures
IUPAC International Chemical Identifier (InChI)
InChI=1S/C17H21NO/c1-18(2)13-14-19-17(15-9-5-3-6-
10-15)16-11-7-4-8-12-16/h3-12,17H,13-14H2,1-2H3
43
Multiple types of
chemical
structure search
 Identity
 2-D similarity
 3-D similarity
 Substructure
 Superstructure
44
 Identity Search
 Depending on what you mean by “identical molecules”, you will get different search results.
 What is the definition of “identity”?
→ Different tautomeric states,
Different stereoisomers,
Different isotopes,
Salt forms or mixtures, …
Chemical Structure Search
45
 Identity Search
 Depending on what you mean by “identical molecules”, you will get different search results.
 What is the definition of “identity”?
→ Different tautomeric states,
Different stereoisomers,
Different isotopes,
Salt forms or mixtures, …
(ex) CHCl3 vs. CDCl3 :
Both have the same chemical properties but different spectroscopic property.
Chemical Structure Search
46
 Identity Search
 Depending on what you mean by “identical molecules”, you will get different search results.
 What is the definition of “identity”?
→ Different tautomeric states,
Different stereoisomers,
Different isotopes,
Salt forms or mixtures, …
(ex) CHCl3 vs. CDCl3 :
Both have the same chemical properties but different spectroscopic property.
 Users can search PubChem using different “nuances” of structural identity.
Chemical Structure Search
47
 Substructure Search
• use a substructure as a query
• search for compounds that contain the query substructure.
 Superstructure Search
• use a superstructure as a query
• search for compounds that are contained in the query superstructure.
Chemical Structure Search
48
 When do you use substructure searches?
ex. when you want to find all molecules that
have a particular molecular scaffold.
Cephalosporins
(a class of β-lactam antibiotics)
Substructure/Superstructure Search
Chemical Structure Search
49
 Similarity Search
 Why do we need similarity search?
• There is a huge imbalance of available information among compounds in PubChem.
For example, among 110.7 million compounds in PubChem,
- 3.6 million compounds (3.27 %) have been tested in at least one assay.
- 1.5 million compounds (1.34 %) have been tested to be active in at least one assay.
• The remaining 86.8 million compounds (97.6%) have not been tested in any assay.
• Bioactivities of these compounds may be predicted from structurally similar compounds with
known bioactivities.
• “Similarity Principle” : structurally similar compounds are likely to have similar biological
properties.
Chemical Structure Search
50
 How can you quantify similarity?
• Similarity is very subjective and context-dependent.
• There are many different ways to quantify similarity.
• Different similarity methods will recognize different flavors of similarity.
• PubChem uses two different similarity measures.
- 2-D similarity based on molecular fingerprints.
- 3-D similarity based on rapid-overlay of chemical structures (ROCS).
 Similarity Search
Chemical Structure Search
51
 PubChem 2-D Similarity
• PubChem 881-bit binary fingerprints:
Each bit position represents the presence (=1) or
absence (=0) of a predefined molecular fragment.
Chemical Structure Search
52
 PubChem 2-D Similarity
• PubChem 881-bit binary fingerprints:
Each bit position represents the presence (=1) or
absence (=0) of a predefined molecular fragment.
• Structural Similarity between two molecules are
computed using the Tanimoto equation:
𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 =
𝑁𝑁𝐴𝐴𝐴𝐴
𝑁𝑁𝐴𝐴 + 𝑁𝑁𝐵𝐵 − 𝑁𝑁𝐴𝐴𝐴𝐴
NA: # bits set for molecule A
NB: # bits set for molecule B
NAB: # bits set for both
Chemical Structure Search
• Tanimoto score ranges from 0 (for no similarity) to 1 (for identical molecules).
53
 PubChem 3-D Similarity
 Three similarity measures:
• Shape-Tanimoto (ST): 3-D overlap between steric shapes of molecules
• Color-Tanimoto (CT): 3-D overlap between “feature” atoms
(H-bond donors/acceptors, Cationic/Anionic centers, rings and hydrophobes)
• Combo-Tanimoto (ComboT): the sum of ST and CT
 Both ST and CT range from 0 to 1, and ComboT range from 0 to 2 (without normalization to 1).
Chemical Structure Search
54
 PubChem 3-D Similarity
Chemical Structure Search
 3-D similarity quantification involves optimization of superposition between two molecules:
• ST-optimization: finds the superposition that maximizes the ST score between them.
• CT-optimization: considers both CT and ST scores during the optimization.
55
 Why does PubChem use two different similarities.
• 2-D similarity comparison is much faster than
3-D similarity comparison
- 2-D: 106 comparisons per second
- 3-D: 102 ~ 103 comparisons per second
• However, 2-D similarity methods often fail to
recognize structural similarity that can be
easily recognized by 3-D similarity methods.
Chemical Structure Search
CID 1548887
(Sulindac)
CID 3715
(Indomethacin)
2D = 0.39
ST = 0.92
CT = 0.52
Both are non-steroidal anti-inflammatory drugs
(NSAIDs) and cyclooxygenase inhibitors.
56
Gene/Protein/Pathway Summary
 Suppose that you want to:
o Retrieve ALL active compounds
against a given protein/gene/pathway target
(e.g., HMGCR=3-hydroxy-3-methylglutaryl-CoA reductase).
• To identify common chemical scaffolds responsible for bioactivity.
• To build a quantitative structure-activity relationship (QSAR) model.
→Gene/Protein/Pathway Summary
• Provides a target-centric view of PubChem data.
• Organizes all data available in PubChem for a given
gene/protein/pathway.
57
58
Patent Summary
 Suppose that you want to:
o Retrieve ALL chemicals mentioned in a given patent document.
→Patent Summary page
• Provides a list of chemicals “mentioned” in the patent application/grant.
• No information on why they are mentioned.
(e.g., as a subject matter or as a prior art?)
• Other information, including:
- Title, abstract, date, inventor, …
- International patent classification (IPC) codes
59
60
 https://pubchem.ncbi.nlm.nih.gov/classification
 Browse PubChem data using a classification of interest.
 Search for records annotated with the desired classification/term.
 A few examples of supported ontologies/classifications.
• MeSH (Medical Subject Headings)
• ChEBI (Chemical Entities of Biological Interest)
• FDA Pharm Classes
• PubChem Compound Table of Contents
• PubChem BioAssay Classification
• WHO ATC (Anatomical Therapeutic Chemical Classification System) Code
• WIPO International Patent Classification
Classification Browser
61
Classification
Browser
62
63
 Identifier Exchange Service
https://pubchemdocs.ncbi.nlm.nih.gov/identifier-exchange-service
 Score Matrix Service
https://pubchemdocs.ncbi.nlm.nih.gov/identifier-exchange-service
 Standardization Service
https://pubchem.ncbi.nlm.nih.gov/standardize/standardize.cgi
 PubChem Data Sources (https://pubchem.ncbi.nlm.nih.gov/sources)
 PubChem Widgets (https://pubchemdocs.ncbi.nlm.nih.gov/widgets)
 PubChem Upload (https://pubchem.ncbi.nlm.nih.gov/upload/)
 PubChem Blog (https://pubchemblog.ncbi.nlm.nih.gov)
 PubChemDocs (https://pubchemdocs.ncbi.nlm.nih.gov)
Other Tools & Services
64
4. Programmatic Access to
PubChem
65
 PubChem users have very diverse
backgrounds/interests.
 PubChem’s web interfaces are optimized
to perform commonly requested tasks
interactively.
66
 PubChem users have very diverse
backgrounds/interests.
 PubChem’s web interfaces are optimized
to perform commonly requested tasks
interactively.
 Everything you can do with PubChem
through the web browser can be
automated through PubChem’s
programmatic interfaces.
67
 PubChem users have very diverse
backgrounds/interests.
 PubChem’s web interfaces are optimized
to perform commonly requested tasks
interactively.
 Everything you can do with PubChem
through the web browser can be
automated through PubChem’s
programmatic interfaces.
 Programmatic access enables one to do
much more complicated tasks that cannot
be done through the web browser.
68
 Multiple programmatic access routes
 Two major programmatic access methods
o PUG-REST (primarily for computed properties).
https://pubchemdocs.ncbi.nlm.nih.gov/pug-rest
o PUG-View (primarily for text information).
https://pubchemdocs.ncbi.nlm.nih.gov/pug-view
 Request volume limitation:
o No more than 5 requests per second
(See more at: https://pubchemdocs.ncbi.nlm.nih.gov/programmatic-
access$_RequestVolumeLimitations)
o Violators/abusers may be blocked for a certain period of time.
Entrez
Utilities
(E-Utils)
Power User
Gateway
(PUG)
PUG-SOAP PUG-REST
PubChem
RDF REST
PUG-View
5. Showcase:
Bioactivity Prediction Model Building with
PubChem Data
 Involved in regulation of gene expression in
various biological processes.
 Potential roles in:
• metabolic signaling pathways
• skin alopecia (spot baldness)
• dermal cysts
• cardiac development
• insulin sensitization
• ……
 Let’s build binary classifiers (i.e, active vs.
inactive) for chemical modulators of RXRA
Retinoid X Receptor α (RXRA)
PDB ID: 1FBY
Tox21
(AID 1159531)
• Quantitative HTS (qHTS)
data for 10K compounds
• Predominantly inactive
Data sets
Tox21
(AID 1159531)
Training
(4916 compounds)
Test
(547 compounds)
• 471 actives
• 4,445 inactives
• 53 actives
• 494 inactives
Preprocessing
• Quantitative HTS (qHTS)
data for 10K compounds
• Predominantly inactive
90% 10%
Data sets
Tox21
(AID 1159531)
ChEMBL
(45 Assays)
NCGC
(2 Assays)
Training
(4916 compounds)
Test
(547 compounds)
External 1
(222 compounds)
External 2
(489 compounds)
• 471 actives
• 4,445 inactives
• 53 actives
• 494 inactives
• 205 actives
• 17 inactives
• 20 actives
• 469 inactives
Preprocessing
• Quantitative HTS (qHTS)
data for 10K compounds
• Predominantly inactive
• Data extracted from
journal articles
• Predominantly active
• qHTS data
• Predominantly inactive
• Some overlap w/ Tox21
Preprocessing Preprocessing
90% 10%
Data sets
Tox21
(AID 1159531)
ChEMBL
(45 Assays)
NCGC
(2 Assays)
Training
(4916 compounds)
Test
(547 compounds)
External 1
(222 compounds)
External 2
(489 compounds)
• 471 actives
• 4,445 inactives
• 53 actives
• 494 inactives
• 205 actives
• 17 inactives
• 20 actives
• 469 inactives
Preprocessing
• Quantitative HTS (qHTS)
data for 10K compounds
• Predominantly inactive
• Data extracted from
journal articles
• Predominantly active
• qHTS data
• Predominantly inactive
• Some overlap w/ Tox21
Preprocessing Preprocessing
90% 10%
Data sets
All data
Available in
PubChem.
Tox21
(AID 1159531)
ChEMBL
(45 Assays)
NCGC
(2 Assays)
Training
(4916 compounds)
Test
(547 compounds)
External 1
(222 compounds)
External 2
(489 compounds)
• 471 actives
• 4,445 inactives
• 53 actives
• 494 inactives
• 205 actives
• 17 inactives
• 20 actives
• 469 inactives
Preprocessing
• Quantitative HTS (qHTS)
data for 10K compounds
• Predominantly inactive
• Data extracted from
journal articles
• Predominantly active
• qHTS data
• Predominantly inactive
• Some overlap w/ Tox21
Preprocessing Preprocessing
90% 10%
Data sets
Tox21
(AID 1159531)
ChEMBL
(45 Assays)
NCGC
(2 Assays)
Training
(4916 compounds)
Test
(547 compounds)
External 1
(222 compounds)
External 2
(489 compounds)
• 471 actives
• 4,445 inactives
• 53 actives
• 494 inactives
• 205 actives
• 17 inactives
• 20 actives
• 469 inactives
Preprocessing
• Quantitative HTS (qHTS)
data for 10K compounds
• Predominantly inactive
• Data extracted from
journal articles
• Predominantly active
• qHTS data
• Predominantly inactive
• Some overlap w/ Tox21
Preprocessing Preprocessing
90% 10%
471
Data sets
 Molecular descriptors
• Generated using PaDEL [Yap CW (2011). J. Comput. Chem., 32 (7): 1466-1474]
Model Building
Abbreviation Name Length
AP AtomPairs 2D Fingerprint 780
ESTAT Estate fingerprint 79
EXTFP* CDK Extended Fingerprint 1,024
FP* CDK fingerprint 1,024
GOFP* CDK graph only fingerprint 1,024
KR Klekota-Roth fingerprint 4,860
MACCS MACCS fingerprint 166
PUB PubChem fingerprint 881
SUB Substructure fingerprint 307
* Hashed fingerprints
 Machine-learning algorithms (implemented in scikit-learn)
Abbreviation Name Hyperparameters optimized
NB Naïve Bayes α (10-10 ~ 1)
DT Decision tree max_depth_range (3 ~ 7)
min_samples_split_range (3 ~ 7)
min_samples_leaf_range (2 ~ 6)
kNN K-Nearest neighbors weights (uniform, minkowski, jaccard)
n_neighbors (1 ~ 25)
RF Random forest n_estimators (10 ~ 200)
SVM Support vector machine C ( 2-10 ~ 210); γ ( 2-10 ∼ 210)
NN Neural network solver (lbfgs or adam); α (10-7 ∼ 107)
 10-fold cross-validation was used for hyperparameter optimization.
Model Building
Model Performance Evaluation
 Area under the Receiver operating characteristic curve (AUC)
→ Used for hyperparameter optimization.
 𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵 𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴 𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵
=
1
2
𝑇𝑇𝑇𝑇
𝑇𝑇𝑇𝑇 + 𝐹𝐹𝐹𝐹
+
𝑇𝑇𝑇𝑇
𝑇𝑇𝑇𝑇 + 𝐹𝐹𝐹𝐹
=
1
2
𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 + 𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆
 𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 (𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆) =
𝑇𝑇𝑇𝑇
𝑇𝑇𝑇𝑇+𝐹𝐹𝐹𝐹
 𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 =
𝑇𝑇𝑇𝑇
𝑇𝑇𝑇𝑇+𝐹𝐹𝐹𝐹
 Performance of the models
 AUC scores of ≥0.7 were observed for models
developed using:
PubChem/MACCS/CDK-FP with
NN/SVM/RF/kNN
 Maximum AUC score (0.77):
PubChem fingerprint with RF
 Similar trend was observed for the
performance in terms of BACC scores
(not shown here).
Area under ROC curve (AUC)
Model Performance Evaluation
Area under ROC curve (AUC), Inactive-to-active ratio = 1
NCGC
ChEMBL
General applicability of the models
83
6. PubChem and
COVID-19 Conspiracy Theories
84
0
2,000
4,000
6,000
8,000
10,000
12,000
14,000
Page
Views
Hydroxychloroquine
3/17
Univ. of Minnesota
Begins Testing
Hydroxychloroquine
3/30
Emergency Use
Authorization of
Hydroxychloroquine
4/8
Trump said
“What do you have
to lose?”
5/18
Trump said he had
been taking it.
7/28
Trump said he still
thought it worked.
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Page Views for Hydroxychloroquine (in 2020)
85
0
2,000
4,000
6,000
8,000
10,000
12,000
14,000
Page
Views
Hydroxychloroquine
Remdesivir
Dexamethasone
3/17
Univ. of Minnesota
Begins Testing
Hydroxychloroquine
3/30
Emergency Use
Authorization of
Hydroxychloroquine
4/8
Trump said
“What do you have
to lose?”
5/18
Trump said he had
been taking it.
7/28
Trump said he still
thought it worked.
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Page Views for Hydroxychloroquine (in 2020)
Drugs used for standard treatment of
COVID-19 had a fewer page views.
86
Source: https://silview.media/2020/10/04/atomic-bombshell-rothschilds-patented-covid-19-biometric-tests-in-2015-and-2017
System and Method for
Testing for COVID-19
(US 2020279585 A1)
Priority date:
2015-10-13
87
Source: https://www.reuters.com/article/uk-factcheck-patent/fact-check-rothschild-did-not-patent-a-test-for-covid-19-in-2015-and-2017-idUSKBN27C34O
88
89
90
This is
a continuation-in-part
application.
91
Continuation Application Continuation-in-Part (CIP) Application
 Adds new claims to a pending parent
application (i.e., not granted nor abandoned).
 Cannot change the specification of the
invention.
 Has the same priority date as the parent
application.
 Increases the scope of the application without
having to file an entirely new application (and
consequently losing the original filing date).
 Adds "enhancements" to the original invention
disclosed in the parent application.
 New claims may also be added:
• Claims concerning the original invention:
the same priority date as the parent
application
• Claims concerning the enhancement:
the priority is the filing date of the CIP
application.
Same Invention
Additional claims
Modified Invention
Additional Claims
Patent Application
Publication
Patent (Granted)
Patent Application
15/293,211
(10/13/2016)
62/240,783
(10/13/2015)
15/495,485
(04/24/2017)
10,242,713 B2
(02/11/2019)
System and method for using, processing,
and displaying biometric data (20 claims)
2017/0229149 A1
(8/10/2017)
16/273,141
(02/11/2019)
10,522,188 B2
(12/31/2019)
2019/0325914 A1
(10/24/2019)
System and method for using, processing,
and displaying biometric data (30 claims)
16/704,844
(12/05/2019)
10,910,016 B2
(2/2/2021)
System and method for using, processing,
and displaying biometric data (20 claims)
2020/0126593 A1
(4/23/2020)
16/876,114
(05/17/2020)
2020/0279585 A1
(9/3/2020)
11,024,339 B2
(6/1/2021)
System and method for testing for
COVID-19 (17 claims)
Provisional
• The pre-pandemic applications are about a generic
system/method that deals with biometric data.
• The post-pandemic application includes a modified
invention and additional claims specific to COVID-19
C
C
C
CIP
93
How to deal with this type of misinformation
 Consider PubChem as an information locator.
 PubChem data are from other data sources.
 More detailed information may be available at the original data source.
 It is highly recommended to check the original data source.
94
 Provides students with training/learning opportunity for technology transfer.
 Many studuents are not familiar with patents (contrary to copyright/plagiarism).
 In general, when there is some sort of domain-specific data that students can access,
there should be some introductory training opportunity for it.
How to deal with this type of misinformation
95
• PubChem is one of the largest sources of publicly available chemical
information
• PubChem is a data aggregator, which collects chemical information from
hundreds of data sources.
• PubChem contains chemical information useful for drug discovery.
• In addition to bioactivity data generated through high-throughput screenings,
PubChem contains a substantial amount of bioactivity information extracted
from scientific articles.
• Chemical vendor and patent information for compounds in PubChem helps
prioritize hit compounds for further screening.
Summary
96
• PubChem supports multiple programmatic access routes to its data, allowing
for automating complicated and specialized tasks beyond what PubChem’s
web interface supports.
• PubChem data can be used for developing computational prediction models
for bioactivity or toxicity of molecules, in conjunction with machine learning
methods.
• PubChem is used by millions of users, but some of them often misinterpret or
misunderstand PubChem data, which needs to be addressed by PubChem
as well as at a community level.
Summary
97
Acknowledgements
 The PubChem Team
Evan Bolton Jia He Thiessen Paul Zhi Sun
Jie Chen Siqian He Bo Yu
Tiejun Chung Qingliang Li Leonid Zaslavsky
Asta Gindulyte Ben Shoemaker Jian Zhang
 Collaborators
Prof. Robert Belford (UALR)
Prof. Ehren Bucholtz (U. of Health Sciences and Pharmacy in St. Louis)
ACS CHED Committee on Computers in Chemical Education (CCCE)
 Funding
Intramural Research Program of the National Library of Medicine
Thank you for your attention.
Questions?
Sunghwan Kim
(sunghwan.kim@nih.gov)

More Related Content

What's hot

Drug and Chemical Databases 2018 - Drug Discovery
Drug and Chemical Databases 2018 - Drug DiscoveryDrug and Chemical Databases 2018 - Drug Discovery
Drug and Chemical Databases 2018 - Drug DiscoveryGirinath Pillai
 
Lecture 9 molecular descriptors
Lecture 9  molecular descriptorsLecture 9  molecular descriptors
Lecture 9 molecular descriptorsRAJAN ROLTA
 
Structure based drug designing
Structure based drug designingStructure based drug designing
Structure based drug designingSeenam Iftikhar
 
Cheminformatics, concept by kk sahu sir
Cheminformatics, concept by kk sahu sirCheminformatics, concept by kk sahu sir
Cheminformatics, concept by kk sahu sirKAUSHAL SAHU
 
A Brief Overview of Cheminformatics
A Brief Overview of CheminformaticsA Brief Overview of Cheminformatics
A Brief Overview of CheminformaticsSunghwan Kim
 
Understanding Smiles
Understanding Smiles Understanding Smiles
Understanding Smiles Abhik Seal
 
Cheminformatics by kk sahu
Cheminformatics by kk sahuCheminformatics by kk sahu
Cheminformatics by kk sahuKAUSHAL SAHU
 
Computational Drug Design
Computational Drug DesignComputational Drug Design
Computational Drug Designbaoilleach
 
chemoinformatics ppt 2.pptx
chemoinformatics ppt 2.pptxchemoinformatics ppt 2.pptx
chemoinformatics ppt 2.pptxwadhava gurumeet
 
Structure based and ligand based drug designing
Structure based and ligand based drug designingStructure based and ligand based drug designing
Structure based and ligand based drug designingDr Vysakh Mohan M
 
Homology modeling of proteins (ppt)
Homology modeling of proteins (ppt)Homology modeling of proteins (ppt)
Homology modeling of proteins (ppt)Melvin Alex
 
Docking Score Functions
Docking Score FunctionsDocking Score Functions
Docking Score FunctionsSAKEEL AHMED
 
Molecular Mechanics in Molecular Modeling
Molecular Mechanics in Molecular ModelingMolecular Mechanics in Molecular Modeling
Molecular Mechanics in Molecular ModelingAkshay Kank
 
Chemical File Formats for storing chemical data
Chemical File Formats for storing chemical dataChemical File Formats for storing chemical data
Chemical File Formats for storing chemical dataAbhik Seal
 

What's hot (20)

Drug and Chemical Databases 2018 - Drug Discovery
Drug and Chemical Databases 2018 - Drug DiscoveryDrug and Chemical Databases 2018 - Drug Discovery
Drug and Chemical Databases 2018 - Drug Discovery
 
Lecture 9 molecular descriptors
Lecture 9  molecular descriptorsLecture 9  molecular descriptors
Lecture 9 molecular descriptors
 
Cheminformatics-1.ppt
Cheminformatics-1.pptCheminformatics-1.ppt
Cheminformatics-1.ppt
 
Structure based drug designing
Structure based drug designingStructure based drug designing
Structure based drug designing
 
Chemoinformatic
Chemoinformatic Chemoinformatic
Chemoinformatic
 
Cheminformatics, concept by kk sahu sir
Cheminformatics, concept by kk sahu sirCheminformatics, concept by kk sahu sir
Cheminformatics, concept by kk sahu sir
 
David
DavidDavid
David
 
A Brief Overview of Cheminformatics
A Brief Overview of CheminformaticsA Brief Overview of Cheminformatics
A Brief Overview of Cheminformatics
 
Understanding Smiles
Understanding Smiles Understanding Smiles
Understanding Smiles
 
Cheminformatics by kk sahu
Cheminformatics by kk sahuCheminformatics by kk sahu
Cheminformatics by kk sahu
 
Computational Drug Design
Computational Drug DesignComputational Drug Design
Computational Drug Design
 
Chemoinformatics
ChemoinformaticsChemoinformatics
Chemoinformatics
 
Pharmacophore mapping joon
Pharmacophore mapping joonPharmacophore mapping joon
Pharmacophore mapping joon
 
chemoinformatics ppt 2.pptx
chemoinformatics ppt 2.pptxchemoinformatics ppt 2.pptx
chemoinformatics ppt 2.pptx
 
Structure based and ligand based drug designing
Structure based and ligand based drug designingStructure based and ligand based drug designing
Structure based and ligand based drug designing
 
MOLECULAR DOCKING
MOLECULAR DOCKINGMOLECULAR DOCKING
MOLECULAR DOCKING
 
Homology modeling of proteins (ppt)
Homology modeling of proteins (ppt)Homology modeling of proteins (ppt)
Homology modeling of proteins (ppt)
 
Docking Score Functions
Docking Score FunctionsDocking Score Functions
Docking Score Functions
 
Molecular Mechanics in Molecular Modeling
Molecular Mechanics in Molecular ModelingMolecular Mechanics in Molecular Modeling
Molecular Mechanics in Molecular Modeling
 
Chemical File Formats for storing chemical data
Chemical File Formats for storing chemical dataChemical File Formats for storing chemical data
Chemical File Formats for storing chemical data
 

Similar to PubChem and Big Data Chemistry

Toxicological information in PubChem
Toxicological information in PubChemToxicological information in PubChem
Toxicological information in PubChemSunghwan Kim
 
PubChem as a resource for chemical information training
PubChem as a resource for chemical information trainingPubChem as a resource for chemical information training
PubChem as a resource for chemical information trainingSunghwan Kim
 
PubChem: A Public Chemical Information Resource for Big Data Chemistry
PubChem: A Public Chemical Information Resource for Big Data ChemistryPubChem: A Public Chemical Information Resource for Big Data Chemistry
PubChem: A Public Chemical Information Resource for Big Data ChemistrySunghwan Kim
 
Exploiting PubChem for drug discovery based on natural products
Exploiting PubChem for drug discovery based on natural productsExploiting PubChem for drug discovery based on natural products
Exploiting PubChem for drug discovery based on natural productsSunghwan Kim
 
Searching for chemical information using PubChem
Searching for chemical information using PubChemSearching for chemical information using PubChem
Searching for chemical information using PubChemSunghwan Kim
 
PubChem: a public chemical information resource for big data chemistry
PubChem: a public chemical information resource for big data chemistryPubChem: a public chemical information resource for big data chemistry
PubChem: a public chemical information resource for big data chemistrySunghwan Kim
 
Revolution in the Connectivity Between Medicinal Chemistry and Biology
Revolution in the Connectivity Between Medicinal Chemistry and BiologyRevolution in the Connectivity Between Medicinal Chemistry and Biology
Revolution in the Connectivity Between Medicinal Chemistry and BiologyChris Southan
 
NCBI Minute: Integrating PubChem into Your Chemistry Teaching
NCBI Minute: Integrating PubChem into Your Chemistry TeachingNCBI Minute: Integrating PubChem into Your Chemistry Teaching
NCBI Minute: Integrating PubChem into Your Chemistry TeachingSunghwan Kim
 
Exploiting PubChem for Drug Discovery
Exploiting PubChem for Drug DiscoveryExploiting PubChem for Drug Discovery
Exploiting PubChem for Drug DiscoverySunghwan Kim
 
PubChem for drug discovery and chemical biology
PubChem for drug discovery and chemical biologyPubChem for drug discovery and chemical biology
PubChem for drug discovery and chemical biologyChris Southan
 

Similar to PubChem and Big Data Chemistry (20)

Toxicological information in PubChem
Toxicological information in PubChemToxicological information in PubChem
Toxicological information in PubChem
 
PubChem as a resource for chemical information training
PubChem as a resource for chemical information trainingPubChem as a resource for chemical information training
PubChem as a resource for chemical information training
 
PubChem: A Public Chemical Information Resource for Big Data Chemistry
PubChem: A Public Chemical Information Resource for Big Data ChemistryPubChem: A Public Chemical Information Resource for Big Data Chemistry
PubChem: A Public Chemical Information Resource for Big Data Chemistry
 
Exploiting PubChem for drug discovery based on natural products
Exploiting PubChem for drug discovery based on natural productsExploiting PubChem for drug discovery based on natural products
Exploiting PubChem for drug discovery based on natural products
 
Searching for chemical information using PubChem
Searching for chemical information using PubChemSearching for chemical information using PubChem
Searching for chemical information using PubChem
 
PubChem: a public chemical information resource for big data chemistry
PubChem: a public chemical information resource for big data chemistryPubChem: a public chemical information resource for big data chemistry
PubChem: a public chemical information resource for big data chemistry
 
Revolution in the Connectivity Between Medicinal Chemistry and Biology
Revolution in the Connectivity Between Medicinal Chemistry and BiologyRevolution in the Connectivity Between Medicinal Chemistry and Biology
Revolution in the Connectivity Between Medicinal Chemistry and Biology
 
TRIANGLE AREA MASS SPECTOMETRY MEETING: Structure Identification Approaches U...
TRIANGLE AREA MASS SPECTOMETRY MEETING: Structure Identification Approaches U...TRIANGLE AREA MASS SPECTOMETRY MEETING: Structure Identification Approaches U...
TRIANGLE AREA MASS SPECTOMETRY MEETING: Structure Identification Approaches U...
 
NCBI Minute: Integrating PubChem into Your Chemistry Teaching
NCBI Minute: Integrating PubChem into Your Chemistry TeachingNCBI Minute: Integrating PubChem into Your Chemistry Teaching
NCBI Minute: Integrating PubChem into Your Chemistry Teaching
 
Exploiting PubChem for Drug Discovery
Exploiting PubChem for Drug DiscoveryExploiting PubChem for Drug Discovery
Exploiting PubChem for Drug Discovery
 
Web-based access to experimental and predicted data for environmental fate, t...
Web-based access to experimental and predicted data for environmental fate, t...Web-based access to experimental and predicted data for environmental fate, t...
Web-based access to experimental and predicted data for environmental fate, t...
 
Cheminformatics Support for MS Supporting Exposomics
Cheminformatics Support for MS Supporting ExposomicsCheminformatics Support for MS Supporting Exposomics
Cheminformatics Support for MS Supporting Exposomics
 
Incorporating new technologies and High Throughput Screening in the design an...
Incorporating new technologies and High Throughput Screening in the design an...Incorporating new technologies and High Throughput Screening in the design an...
Incorporating new technologies and High Throughput Screening in the design an...
 
The US-EPA CompTox Chemicals Dashboard to support Non-Targeted Analysis
The US-EPA CompTox Chemicals Dashboard to support Non-Targeted AnalysisThe US-EPA CompTox Chemicals Dashboard to support Non-Targeted Analysis
The US-EPA CompTox Chemicals Dashboard to support Non-Targeted Analysis
 
Structure identification approaches using the EPA CompTox Chemicals Dashboard...
Structure identification approaches using the EPA CompTox Chemicals Dashboard...Structure identification approaches using the EPA CompTox Chemicals Dashboard...
Structure identification approaches using the EPA CompTox Chemicals Dashboard...
 
What chemicals constitute the Exposome? Accessing data via the US EPA’s Comp...
What chemicals constitute the Exposome? Accessing data via the US EPA’s  Comp...What chemicals constitute the Exposome? Accessing data via the US EPA’s  Comp...
What chemicals constitute the Exposome? Accessing data via the US EPA’s Comp...
 
New Approach Methods - What is That?
New Approach Methods - What is That?New Approach Methods - What is That?
New Approach Methods - What is That?
 
Structure identification by Mass Spectrometry Non-Targeted Analysis using the...
Structure identification by Mass Spectrometry Non-Targeted Analysis using the...Structure identification by Mass Spectrometry Non-Targeted Analysis using the...
Structure identification by Mass Spectrometry Non-Targeted Analysis using the...
 
PubChem for drug discovery and chemical biology
PubChem for drug discovery and chemical biologyPubChem for drug discovery and chemical biology
PubChem for drug discovery and chemical biology
 
New developments in delivering public access to data from the National Center...
New developments in delivering public access to data from the National Center...New developments in delivering public access to data from the National Center...
New developments in delivering public access to data from the National Center...
 

More from Sunghwan Kim

PubChem for chemical information literacy training
PubChem for chemical information literacy trainingPubChem for chemical information literacy training
PubChem for chemical information literacy trainingSunghwan Kim
 
PubChem for drug discovery in the age of big data and artificial intelligence
PubChem for drug discovery in the age of big data and artificial intelligencePubChem for drug discovery in the age of big data and artificial intelligence
PubChem for drug discovery in the age of big data and artificial intelligenceSunghwan Kim
 
PubChem and its application for cheminformatics education
PubChem and its application for cheminformatics educationPubChem and its application for cheminformatics education
PubChem and its application for cheminformatics educationSunghwan Kim
 
Cheminformatics Online Chemistry Course (OLCC): A Community Effort to Introdu...
Cheminformatics Online Chemistry Course (OLCC): A Community Effort to Introdu...Cheminformatics Online Chemistry Course (OLCC): A Community Effort to Introdu...
Cheminformatics Online Chemistry Course (OLCC): A Community Effort to Introdu...Sunghwan Kim
 
Cheminformatics Education with PubChem
Cheminformatics Education with PubChemCheminformatics Education with PubChem
Cheminformatics Education with PubChemSunghwan Kim
 
PubChem as an Emerging Toxicological Information Resource
PubChem as an Emerging Toxicological Information ResourcePubChem as an Emerging Toxicological Information Resource
PubChem as an Emerging Toxicological Information ResourceSunghwan Kim
 
PubChem as a resource for chemical information education
PubChem as a resource for chemical information educationPubChem as a resource for chemical information education
PubChem as a resource for chemical information educationSunghwan Kim
 
Chemical Health and Safety Information in PubChem
Chemical Health and Safety Information in PubChemChemical Health and Safety Information in PubChem
Chemical Health and Safety Information in PubChemSunghwan Kim
 
Chemical Structure Standardization and Synonym Filtering in PubChem
Chemical Structure Standardization and Synonym Filtering in PubChemChemical Structure Standardization and Synonym Filtering in PubChem
Chemical Structure Standardization and Synonym Filtering in PubChemSunghwan Kim
 
Development of machine learning-based prediction models for chemical modulato...
Development of machine learning-based prediction models for chemical modulato...Development of machine learning-based prediction models for chemical modulato...
Development of machine learning-based prediction models for chemical modulato...Sunghwan Kim
 
Using open bioactivity data for developing machine-learning prediction models...
Using open bioactivity data for developing machine-learning prediction models...Using open bioactivity data for developing machine-learning prediction models...
Using open bioactivity data for developing machine-learning prediction models...Sunghwan Kim
 
Searching for patent information in PubChem
Searching for patent information in PubChem Searching for patent information in PubChem
Searching for patent information in PubChem Sunghwan Kim
 
How can you access PubChem programmatically?
How can you access PubChem programmatically?How can you access PubChem programmatically?
How can you access PubChem programmatically?Sunghwan Kim
 

More from Sunghwan Kim (13)

PubChem for chemical information literacy training
PubChem for chemical information literacy trainingPubChem for chemical information literacy training
PubChem for chemical information literacy training
 
PubChem for drug discovery in the age of big data and artificial intelligence
PubChem for drug discovery in the age of big data and artificial intelligencePubChem for drug discovery in the age of big data and artificial intelligence
PubChem for drug discovery in the age of big data and artificial intelligence
 
PubChem and its application for cheminformatics education
PubChem and its application for cheminformatics educationPubChem and its application for cheminformatics education
PubChem and its application for cheminformatics education
 
Cheminformatics Online Chemistry Course (OLCC): A Community Effort to Introdu...
Cheminformatics Online Chemistry Course (OLCC): A Community Effort to Introdu...Cheminformatics Online Chemistry Course (OLCC): A Community Effort to Introdu...
Cheminformatics Online Chemistry Course (OLCC): A Community Effort to Introdu...
 
Cheminformatics Education with PubChem
Cheminformatics Education with PubChemCheminformatics Education with PubChem
Cheminformatics Education with PubChem
 
PubChem as an Emerging Toxicological Information Resource
PubChem as an Emerging Toxicological Information ResourcePubChem as an Emerging Toxicological Information Resource
PubChem as an Emerging Toxicological Information Resource
 
PubChem as a resource for chemical information education
PubChem as a resource for chemical information educationPubChem as a resource for chemical information education
PubChem as a resource for chemical information education
 
Chemical Health and Safety Information in PubChem
Chemical Health and Safety Information in PubChemChemical Health and Safety Information in PubChem
Chemical Health and Safety Information in PubChem
 
Chemical Structure Standardization and Synonym Filtering in PubChem
Chemical Structure Standardization and Synonym Filtering in PubChemChemical Structure Standardization and Synonym Filtering in PubChem
Chemical Structure Standardization and Synonym Filtering in PubChem
 
Development of machine learning-based prediction models for chemical modulato...
Development of machine learning-based prediction models for chemical modulato...Development of machine learning-based prediction models for chemical modulato...
Development of machine learning-based prediction models for chemical modulato...
 
Using open bioactivity data for developing machine-learning prediction models...
Using open bioactivity data for developing machine-learning prediction models...Using open bioactivity data for developing machine-learning prediction models...
Using open bioactivity data for developing machine-learning prediction models...
 
Searching for patent information in PubChem
Searching for patent information in PubChem Searching for patent information in PubChem
Searching for patent information in PubChem
 
How can you access PubChem programmatically?
How can you access PubChem programmatically?How can you access PubChem programmatically?
How can you access PubChem programmatically?
 

Recently uploaded

Behavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfBehavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfSELF-EXPLANATORY
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsSérgio Sacani
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxkessiyaTpeter
 
Cultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptxCultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptxpradhanghanshyam7136
 
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.aasikanpl
 
Scheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docxScheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docxyaramohamed343013
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...Sérgio Sacani
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRDelhi Call girls
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsAArockiyaNisha
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...RohitNehra6
 
Artificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PArtificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PPRINCE C P
 
Analytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptxAnalytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptxSwapnil Therkar
 
Orientation, design and principles of polyhouse
Orientation, design and principles of polyhouseOrientation, design and principles of polyhouse
Orientation, design and principles of polyhousejana861314
 
NAVSEA PEO USC - Unmanned & Small Combatants 26Oct23.pdf
NAVSEA PEO USC - Unmanned & Small Combatants 26Oct23.pdfNAVSEA PEO USC - Unmanned & Small Combatants 26Oct23.pdf
NAVSEA PEO USC - Unmanned & Small Combatants 26Oct23.pdfWadeK3
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSarthak Sekhar Mondal
 
A relative description on Sonoporation.pdf
A relative description on Sonoporation.pdfA relative description on Sonoporation.pdf
A relative description on Sonoporation.pdfnehabiju2046
 
zoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistanzoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistanzohaibmir069
 
Work, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE PhysicsWork, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE Physicsvishikhakeshava1
 
Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Patrick Diehl
 

Recently uploaded (20)

Behavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfBehavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
 
Cultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptxCultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptx
 
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
 
Scheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docxScheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docx
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based Nanomaterials
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
 
Artificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PArtificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C P
 
Analytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptxAnalytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptx
 
Orientation, design and principles of polyhouse
Orientation, design and principles of polyhouseOrientation, design and principles of polyhouse
Orientation, design and principles of polyhouse
 
NAVSEA PEO USC - Unmanned & Small Combatants 26Oct23.pdf
NAVSEA PEO USC - Unmanned & Small Combatants 26Oct23.pdfNAVSEA PEO USC - Unmanned & Small Combatants 26Oct23.pdf
NAVSEA PEO USC - Unmanned & Small Combatants 26Oct23.pdf
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
 
A relative description on Sonoporation.pdf
A relative description on Sonoporation.pdfA relative description on Sonoporation.pdf
A relative description on Sonoporation.pdf
 
zoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistanzoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistan
 
Work, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE PhysicsWork, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE Physics
 
Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?
 
The Philosophy of Science
The Philosophy of ScienceThe Philosophy of Science
The Philosophy of Science
 

PubChem and Big Data Chemistry

  • 1. PubChem and Big Data Chemistry Sunghwan Kim, Ph.D., M.Sc. National Center for Biotechnology Information National Library of Medicine National Institutes of Health Email: sunghwan.kim@nih.gov
  • 2. 2 Outline 1. What Is PubChem? 2. What Does PubChem Have? 3. Exploring Chemical Information in PubChem 4. Programmatic Access to PubChem 5. Bioactivity Prediction Model Building with PubChem Data 6. PubChem and COVID-19 Conspiracy Theories 7. Summary
  • 3. 3 1. What Is PubChem?
  • 4. 4  https://pubchem.ncbi.nlm.nih.gov  Public chemical database at NIH.  Contains information on various chemical entities: • (Drug-like) small molecules • siRNAs & miRNAs • Carbohydrates • Lipids • Peptides • Chemically modified macromolecules • …… PubChem Is a Public Chemical Information Resource
  • 5. 5 PubChem Is a Data Aggregator PubChem Sources: https://pubchem.ncbi.nlm.nih.gov/sources Gov’t agencies Academic institutions Publishers Pharma companies Chemical vendors Scientific databases 800+ data sources Users o Biomedical Researchers • Chemical biology • Medicinal chemistry • Drug design & discovery • Cheminformatics o Data scientists o Patent agents/examiners o Chemical safety officers o Educators/librarians o Students
  • 6. 6 History of PubChem  NIH Molecular Libraries Program (MLP)  Common Fund project.
  • 7. 7 History of PubChem  NIH Molecular Libraries Program (MLP)  Common Fund project.  Aimed to provide academic researchers with high-throughput screening (HTS) resources for drug discovery.
  • 8. 8 History of PubChem  NIH Molecular Libraries Program (MLP)  Common Fund project.  Aimed to provide academic researchers with high-throughput screening (HTS) resources for drug discovery.  Had three components (subprojects): Large, shared compound library HTS centers at academic institutions Central data repository (PubChem)
  • 9. 9 History of PubChem  PubChem was launched in 2004 as a component of MLP.  All Common Fund projects are supported only up to 10 years. Large, shared compound library HTS centers at academic institutions Central data repository (PubChem)
  • 10. 10 History of PubChem  PubChem was launched in 2004 as a component of MLP.  All Common Fund projects are supported only up to 10 years.  PubChem evolved to play a dual role:  As a data archive  As a knowledgebase Large, shared compound library HTS centers at academic institutions Central data repository (PubChem)
  • 11. 11 0 1 2 3 4 5 6 Unique Monthly Users (millions) Time Monthly Usage Statistics (Unique Interactive Users Only) Source: Google Analytics  5 million unique interactive users per month at peak (Oct. 2020)  Programmatic requests are not included.  These statistics are lower-bound.
  • 12. 12 User Demographics (June 2020 through May 2021) 36.5% 27.4% 13.5% 10.4% 6.7% 5.4% 0 1 2 3 4 5 6 18-24 25-34 35-44 45-54 55-64 65+ Number of Users (millions) Age 34.64% of total users ~40% of PubChem users are aged between 18 and 24. (likely to be college students)
  • 15. 15 PubChem Data Content Structures and properties Spectra
  • 16. 16 PubChem Data Content Structures and properties Spectra Chemical health & safety 3 2 0
  • 17. 17 PubChem Data Content Structures and properties Spectra Chemical health & safety 3 2 0 Bioactivity
  • 18. 18 PubChem Data Content Structures and properties Spectra Chemical health & safety 3 2 0 Bioactivity Chemical vendors & synthesis
  • 21. 21 PubChem Data Content Clinical trials Patents Drugs
  • 22. 22 PubChem Data Content Clinical trials Patents Drugs Scientific articles
  • 23. 23 Dual Role of PubChem Archive Knowledgebase
  • 24. 24 Dual Role of PubChem Archive Knowledgebase
  • 25. 25 Multiple data collections in PubChem Compound Unique chemical structures Substance Depositor-provided chemical data BioAssay Assay descriptions & test results Protein Gene Pathway Patent Archive Archive Knowledgebase Chemical data associated with a protein/gene/pathway/patent
  • 26. 26  As of November 2021, PubChem contains: • 276 million substance descriptions • 111 million unique chemical structures • 292 million biological activity test results • 1.4 million biological assays, covering 21 thousand unique protein sequence targets. (Arguably) the largest corpus of publicly available chemical information from 800+ data sources. PubChem Statistics
  • 27. 27 PubChem’s Chemical Space Lipinski’s Rule of 5 (Ro5) for Drug-likeness a Congreve’s Rule of 3 (Ro3) for Lead-likeness b Molecular Weight ≤500 ≤300 Octanol–water partition coefficient (Log P) ≤5 ≤3 Number of H-bond donors ≤5 ≤3 Number of H-bond acceptors ≤10 ≤3 Number of Rotational Bond N/A ≤3 Polar surface area (PSA) N/A ≤60 a Lipinski et al., Adv. Drug Delivery Rev. 1997, 23(1–3), 3-25. b Congreve et al., Drug Discov. Today, 2003, 8(19), 876-877.
  • 28. 28 Congreve’s Rule of 3 (Ro3) 11.7 millions (10.57 %) Lipinski’s Rule of 5 (Ro5) 78.9 millions (71.36%) All compounds 110.6 millions (100%) PubChem’s Chemical Space
  • 29. 29 Ro5 78.9 millions (71.36%) Ro5−1 18.9 millions (17.08%) Ro5−2 10.2 millions (9.26%) Ro5−3 2.3 millions (2.05%) Ro5−4 0.28 millions (0.25%) Ro5 + Ro5-1 = 88.44% PubChem’s Chemical Space
  • 30. 30 Bioactivity Data in PubChem Tested 3.6 millions (3.27%) Active (AC ≤ 1 nM) 74 thousands (0.07%) Active (1 nM < AC ≤ 1 µM) 777.5 thousands (0.70%) Active (others) 635.2 thousands (0.57%) Inactive 2.1 millions (1.93%) Not Tested 107.0 millions (96.73%) All Compounds 110.7 millions (100.00%) AC: activity concentration (e.g., IC50, EC50, Ki, Kd, etc.)
  • 31. 31 Bioactivity Data in PubChem High-Throughput Screening data • From Molecular Libraries Program and other HTS projects. • Many inactives • False hits (e.g., aggregators, autofluoresent compounds) • Typically measured at single concentration Literature-extracted data
  • 32. 32 Bioactivity Data in PubChem High-Throughput Screening data • From Molecular Libraries Program and other HTS projects. • Many inactives • False hits (e.g., aggregators, autofluoresent compounds) • Typically measured at single concentration Literature-extracted data
  • 33. 33 High-Throughput Screening data • From Molecular Libraries Program and other HTS projects. • Many inactives • False hits (e.g., aggregators, autofluoresent compounds) • Typically measured at single concentration Literature-extracted data • From manual curation or data mining • No (or few) inactives • Provided by various PubChem depositors including: ChEMBL, PDBbind, BindingDB, Guide to Pharmacology Bioactivity Data in PubChem
  • 34. 34 • Virtual screening hits should be synthesizable or purchasable. • PubChem contains “real” molecules (not “virtual” molecules) • At least one or more data contributors claim that they have the compound and/or information about it. • Some of these compounds are chemical vendors (e.g., Sigma Aldrich). Availability of compounds for subsequent experiments
  • 35. 35  Two important aspects of PubChem records (in the context of “compound availability”)  Non-live compounds:  Not searchable although they exist.  No associated substances due to: o Mistakenly submitted substances o Incorrect information o No intention to share Availability of compounds for subsequent experiments
  • 36. 36  Two important aspects of PubChem records (in the context of “compound availability”)  Non-live compounds:  Not searchable although they exist.  No associated substances due to: o Mistakenly submitted substances o Incorrect information o No intention to share  Legacy designation:  No longer maintains their records up-to-date. o Discontinued funding, low business priority, … Availability of compounds for subsequent experiments
  • 37. 37 3. Exploring Chemical Information in PubChem
  • 38. 38 Text Query  Chemical name  Gene/protein name  Pathway name  Patent ID  CAS registry number  PubChem record ID (CID, SID, AID)
  • 40. 40 Compound Summary for salicylic acid (CID 338) https://pubchem.ncbi.nlm.nih.gov/ compound/338
  • 42. 42 Simplified molecular-input line-entry system (SMILES) CC1=C(C=C(C=C1)NC(=O)C2=CC=C(C=C2)CN3CCN(CC3)C) NC4=NC=CC(=N4)C5=CN=CC=C5 Line notations for chemical structures IUPAC International Chemical Identifier (InChI) InChI=1S/C17H21NO/c1-18(2)13-14-19-17(15-9-5-3-6- 10-15)16-11-7-4-8-12-16/h3-12,17H,13-14H2,1-2H3
  • 43. 43 Multiple types of chemical structure search  Identity  2-D similarity  3-D similarity  Substructure  Superstructure
  • 44. 44  Identity Search  Depending on what you mean by “identical molecules”, you will get different search results.  What is the definition of “identity”? → Different tautomeric states, Different stereoisomers, Different isotopes, Salt forms or mixtures, … Chemical Structure Search
  • 45. 45  Identity Search  Depending on what you mean by “identical molecules”, you will get different search results.  What is the definition of “identity”? → Different tautomeric states, Different stereoisomers, Different isotopes, Salt forms or mixtures, … (ex) CHCl3 vs. CDCl3 : Both have the same chemical properties but different spectroscopic property. Chemical Structure Search
  • 46. 46  Identity Search  Depending on what you mean by “identical molecules”, you will get different search results.  What is the definition of “identity”? → Different tautomeric states, Different stereoisomers, Different isotopes, Salt forms or mixtures, … (ex) CHCl3 vs. CDCl3 : Both have the same chemical properties but different spectroscopic property.  Users can search PubChem using different “nuances” of structural identity. Chemical Structure Search
  • 47. 47  Substructure Search • use a substructure as a query • search for compounds that contain the query substructure.  Superstructure Search • use a superstructure as a query • search for compounds that are contained in the query superstructure. Chemical Structure Search
  • 48. 48  When do you use substructure searches? ex. when you want to find all molecules that have a particular molecular scaffold. Cephalosporins (a class of β-lactam antibiotics) Substructure/Superstructure Search Chemical Structure Search
  • 49. 49  Similarity Search  Why do we need similarity search? • There is a huge imbalance of available information among compounds in PubChem. For example, among 110.7 million compounds in PubChem, - 3.6 million compounds (3.27 %) have been tested in at least one assay. - 1.5 million compounds (1.34 %) have been tested to be active in at least one assay. • The remaining 86.8 million compounds (97.6%) have not been tested in any assay. • Bioactivities of these compounds may be predicted from structurally similar compounds with known bioactivities. • “Similarity Principle” : structurally similar compounds are likely to have similar biological properties. Chemical Structure Search
  • 50. 50  How can you quantify similarity? • Similarity is very subjective and context-dependent. • There are many different ways to quantify similarity. • Different similarity methods will recognize different flavors of similarity. • PubChem uses two different similarity measures. - 2-D similarity based on molecular fingerprints. - 3-D similarity based on rapid-overlay of chemical structures (ROCS).  Similarity Search Chemical Structure Search
  • 51. 51  PubChem 2-D Similarity • PubChem 881-bit binary fingerprints: Each bit position represents the presence (=1) or absence (=0) of a predefined molecular fragment. Chemical Structure Search
  • 52. 52  PubChem 2-D Similarity • PubChem 881-bit binary fingerprints: Each bit position represents the presence (=1) or absence (=0) of a predefined molecular fragment. • Structural Similarity between two molecules are computed using the Tanimoto equation: 𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 = 𝑁𝑁𝐴𝐴𝐴𝐴 𝑁𝑁𝐴𝐴 + 𝑁𝑁𝐵𝐵 − 𝑁𝑁𝐴𝐴𝐴𝐴 NA: # bits set for molecule A NB: # bits set for molecule B NAB: # bits set for both Chemical Structure Search • Tanimoto score ranges from 0 (for no similarity) to 1 (for identical molecules).
  • 53. 53  PubChem 3-D Similarity  Three similarity measures: • Shape-Tanimoto (ST): 3-D overlap between steric shapes of molecules • Color-Tanimoto (CT): 3-D overlap between “feature” atoms (H-bond donors/acceptors, Cationic/Anionic centers, rings and hydrophobes) • Combo-Tanimoto (ComboT): the sum of ST and CT  Both ST and CT range from 0 to 1, and ComboT range from 0 to 2 (without normalization to 1). Chemical Structure Search
  • 54. 54  PubChem 3-D Similarity Chemical Structure Search  3-D similarity quantification involves optimization of superposition between two molecules: • ST-optimization: finds the superposition that maximizes the ST score between them. • CT-optimization: considers both CT and ST scores during the optimization.
  • 55. 55  Why does PubChem use two different similarities. • 2-D similarity comparison is much faster than 3-D similarity comparison - 2-D: 106 comparisons per second - 3-D: 102 ~ 103 comparisons per second • However, 2-D similarity methods often fail to recognize structural similarity that can be easily recognized by 3-D similarity methods. Chemical Structure Search CID 1548887 (Sulindac) CID 3715 (Indomethacin) 2D = 0.39 ST = 0.92 CT = 0.52 Both are non-steroidal anti-inflammatory drugs (NSAIDs) and cyclooxygenase inhibitors.
  • 56. 56 Gene/Protein/Pathway Summary  Suppose that you want to: o Retrieve ALL active compounds against a given protein/gene/pathway target (e.g., HMGCR=3-hydroxy-3-methylglutaryl-CoA reductase). • To identify common chemical scaffolds responsible for bioactivity. • To build a quantitative structure-activity relationship (QSAR) model. →Gene/Protein/Pathway Summary • Provides a target-centric view of PubChem data. • Organizes all data available in PubChem for a given gene/protein/pathway.
  • 57. 57
  • 58. 58 Patent Summary  Suppose that you want to: o Retrieve ALL chemicals mentioned in a given patent document. →Patent Summary page • Provides a list of chemicals “mentioned” in the patent application/grant. • No information on why they are mentioned. (e.g., as a subject matter or as a prior art?) • Other information, including: - Title, abstract, date, inventor, … - International patent classification (IPC) codes
  • 59. 59
  • 60. 60  https://pubchem.ncbi.nlm.nih.gov/classification  Browse PubChem data using a classification of interest.  Search for records annotated with the desired classification/term.  A few examples of supported ontologies/classifications. • MeSH (Medical Subject Headings) • ChEBI (Chemical Entities of Biological Interest) • FDA Pharm Classes • PubChem Compound Table of Contents • PubChem BioAssay Classification • WHO ATC (Anatomical Therapeutic Chemical Classification System) Code • WIPO International Patent Classification Classification Browser
  • 62. 62
  • 63. 63  Identifier Exchange Service https://pubchemdocs.ncbi.nlm.nih.gov/identifier-exchange-service  Score Matrix Service https://pubchemdocs.ncbi.nlm.nih.gov/identifier-exchange-service  Standardization Service https://pubchem.ncbi.nlm.nih.gov/standardize/standardize.cgi  PubChem Data Sources (https://pubchem.ncbi.nlm.nih.gov/sources)  PubChem Widgets (https://pubchemdocs.ncbi.nlm.nih.gov/widgets)  PubChem Upload (https://pubchem.ncbi.nlm.nih.gov/upload/)  PubChem Blog (https://pubchemblog.ncbi.nlm.nih.gov)  PubChemDocs (https://pubchemdocs.ncbi.nlm.nih.gov) Other Tools & Services
  • 65. 65  PubChem users have very diverse backgrounds/interests.  PubChem’s web interfaces are optimized to perform commonly requested tasks interactively.
  • 66. 66  PubChem users have very diverse backgrounds/interests.  PubChem’s web interfaces are optimized to perform commonly requested tasks interactively.  Everything you can do with PubChem through the web browser can be automated through PubChem’s programmatic interfaces.
  • 67. 67  PubChem users have very diverse backgrounds/interests.  PubChem’s web interfaces are optimized to perform commonly requested tasks interactively.  Everything you can do with PubChem through the web browser can be automated through PubChem’s programmatic interfaces.  Programmatic access enables one to do much more complicated tasks that cannot be done through the web browser.
  • 68. 68  Multiple programmatic access routes  Two major programmatic access methods o PUG-REST (primarily for computed properties). https://pubchemdocs.ncbi.nlm.nih.gov/pug-rest o PUG-View (primarily for text information). https://pubchemdocs.ncbi.nlm.nih.gov/pug-view  Request volume limitation: o No more than 5 requests per second (See more at: https://pubchemdocs.ncbi.nlm.nih.gov/programmatic- access$_RequestVolumeLimitations) o Violators/abusers may be blocked for a certain period of time. Entrez Utilities (E-Utils) Power User Gateway (PUG) PUG-SOAP PUG-REST PubChem RDF REST PUG-View
  • 69. 5. Showcase: Bioactivity Prediction Model Building with PubChem Data
  • 70.  Involved in regulation of gene expression in various biological processes.  Potential roles in: • metabolic signaling pathways • skin alopecia (spot baldness) • dermal cysts • cardiac development • insulin sensitization • ……  Let’s build binary classifiers (i.e, active vs. inactive) for chemical modulators of RXRA Retinoid X Receptor α (RXRA) PDB ID: 1FBY
  • 71. Tox21 (AID 1159531) • Quantitative HTS (qHTS) data for 10K compounds • Predominantly inactive Data sets
  • 72. Tox21 (AID 1159531) Training (4916 compounds) Test (547 compounds) • 471 actives • 4,445 inactives • 53 actives • 494 inactives Preprocessing • Quantitative HTS (qHTS) data for 10K compounds • Predominantly inactive 90% 10% Data sets
  • 73. Tox21 (AID 1159531) ChEMBL (45 Assays) NCGC (2 Assays) Training (4916 compounds) Test (547 compounds) External 1 (222 compounds) External 2 (489 compounds) • 471 actives • 4,445 inactives • 53 actives • 494 inactives • 205 actives • 17 inactives • 20 actives • 469 inactives Preprocessing • Quantitative HTS (qHTS) data for 10K compounds • Predominantly inactive • Data extracted from journal articles • Predominantly active • qHTS data • Predominantly inactive • Some overlap w/ Tox21 Preprocessing Preprocessing 90% 10% Data sets
  • 74. Tox21 (AID 1159531) ChEMBL (45 Assays) NCGC (2 Assays) Training (4916 compounds) Test (547 compounds) External 1 (222 compounds) External 2 (489 compounds) • 471 actives • 4,445 inactives • 53 actives • 494 inactives • 205 actives • 17 inactives • 20 actives • 469 inactives Preprocessing • Quantitative HTS (qHTS) data for 10K compounds • Predominantly inactive • Data extracted from journal articles • Predominantly active • qHTS data • Predominantly inactive • Some overlap w/ Tox21 Preprocessing Preprocessing 90% 10% Data sets All data Available in PubChem.
  • 75. Tox21 (AID 1159531) ChEMBL (45 Assays) NCGC (2 Assays) Training (4916 compounds) Test (547 compounds) External 1 (222 compounds) External 2 (489 compounds) • 471 actives • 4,445 inactives • 53 actives • 494 inactives • 205 actives • 17 inactives • 20 actives • 469 inactives Preprocessing • Quantitative HTS (qHTS) data for 10K compounds • Predominantly inactive • Data extracted from journal articles • Predominantly active • qHTS data • Predominantly inactive • Some overlap w/ Tox21 Preprocessing Preprocessing 90% 10% Data sets
  • 76. Tox21 (AID 1159531) ChEMBL (45 Assays) NCGC (2 Assays) Training (4916 compounds) Test (547 compounds) External 1 (222 compounds) External 2 (489 compounds) • 471 actives • 4,445 inactives • 53 actives • 494 inactives • 205 actives • 17 inactives • 20 actives • 469 inactives Preprocessing • Quantitative HTS (qHTS) data for 10K compounds • Predominantly inactive • Data extracted from journal articles • Predominantly active • qHTS data • Predominantly inactive • Some overlap w/ Tox21 Preprocessing Preprocessing 90% 10% 471 Data sets
  • 77.  Molecular descriptors • Generated using PaDEL [Yap CW (2011). J. Comput. Chem., 32 (7): 1466-1474] Model Building Abbreviation Name Length AP AtomPairs 2D Fingerprint 780 ESTAT Estate fingerprint 79 EXTFP* CDK Extended Fingerprint 1,024 FP* CDK fingerprint 1,024 GOFP* CDK graph only fingerprint 1,024 KR Klekota-Roth fingerprint 4,860 MACCS MACCS fingerprint 166 PUB PubChem fingerprint 881 SUB Substructure fingerprint 307 * Hashed fingerprints
  • 78.  Machine-learning algorithms (implemented in scikit-learn) Abbreviation Name Hyperparameters optimized NB Naïve Bayes α (10-10 ~ 1) DT Decision tree max_depth_range (3 ~ 7) min_samples_split_range (3 ~ 7) min_samples_leaf_range (2 ~ 6) kNN K-Nearest neighbors weights (uniform, minkowski, jaccard) n_neighbors (1 ~ 25) RF Random forest n_estimators (10 ~ 200) SVM Support vector machine C ( 2-10 ~ 210); γ ( 2-10 ∼ 210) NN Neural network solver (lbfgs or adam); α (10-7 ∼ 107)  10-fold cross-validation was used for hyperparameter optimization. Model Building
  • 79. Model Performance Evaluation  Area under the Receiver operating characteristic curve (AUC) → Used for hyperparameter optimization.  𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵 𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴 𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵 = 1 2 𝑇𝑇𝑇𝑇 𝑇𝑇𝑇𝑇 + 𝐹𝐹𝐹𝐹 + 𝑇𝑇𝑇𝑇 𝑇𝑇𝑇𝑇 + 𝐹𝐹𝐹𝐹 = 1 2 𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 + 𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆  𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 (𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆) = 𝑇𝑇𝑇𝑇 𝑇𝑇𝑇𝑇+𝐹𝐹𝐹𝐹  𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 = 𝑇𝑇𝑇𝑇 𝑇𝑇𝑇𝑇+𝐹𝐹𝐹𝐹
  • 80.  Performance of the models  AUC scores of ≥0.7 were observed for models developed using: PubChem/MACCS/CDK-FP with NN/SVM/RF/kNN  Maximum AUC score (0.77): PubChem fingerprint with RF  Similar trend was observed for the performance in terms of BACC scores (not shown here). Area under ROC curve (AUC) Model Performance Evaluation
  • 81. Area under ROC curve (AUC), Inactive-to-active ratio = 1 NCGC ChEMBL General applicability of the models
  • 82. 83 6. PubChem and COVID-19 Conspiracy Theories
  • 83. 84 0 2,000 4,000 6,000 8,000 10,000 12,000 14,000 Page Views Hydroxychloroquine 3/17 Univ. of Minnesota Begins Testing Hydroxychloroquine 3/30 Emergency Use Authorization of Hydroxychloroquine 4/8 Trump said “What do you have to lose?” 5/18 Trump said he had been taking it. 7/28 Trump said he still thought it worked. Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Page Views for Hydroxychloroquine (in 2020)
  • 84. 85 0 2,000 4,000 6,000 8,000 10,000 12,000 14,000 Page Views Hydroxychloroquine Remdesivir Dexamethasone 3/17 Univ. of Minnesota Begins Testing Hydroxychloroquine 3/30 Emergency Use Authorization of Hydroxychloroquine 4/8 Trump said “What do you have to lose?” 5/18 Trump said he had been taking it. 7/28 Trump said he still thought it worked. Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Page Views for Hydroxychloroquine (in 2020) Drugs used for standard treatment of COVID-19 had a fewer page views.
  • 87. 88
  • 88. 89
  • 90. 91 Continuation Application Continuation-in-Part (CIP) Application  Adds new claims to a pending parent application (i.e., not granted nor abandoned).  Cannot change the specification of the invention.  Has the same priority date as the parent application.  Increases the scope of the application without having to file an entirely new application (and consequently losing the original filing date).  Adds "enhancements" to the original invention disclosed in the parent application.  New claims may also be added: • Claims concerning the original invention: the same priority date as the parent application • Claims concerning the enhancement: the priority is the filing date of the CIP application. Same Invention Additional claims Modified Invention Additional Claims
  • 91. Patent Application Publication Patent (Granted) Patent Application 15/293,211 (10/13/2016) 62/240,783 (10/13/2015) 15/495,485 (04/24/2017) 10,242,713 B2 (02/11/2019) System and method for using, processing, and displaying biometric data (20 claims) 2017/0229149 A1 (8/10/2017) 16/273,141 (02/11/2019) 10,522,188 B2 (12/31/2019) 2019/0325914 A1 (10/24/2019) System and method for using, processing, and displaying biometric data (30 claims) 16/704,844 (12/05/2019) 10,910,016 B2 (2/2/2021) System and method for using, processing, and displaying biometric data (20 claims) 2020/0126593 A1 (4/23/2020) 16/876,114 (05/17/2020) 2020/0279585 A1 (9/3/2020) 11,024,339 B2 (6/1/2021) System and method for testing for COVID-19 (17 claims) Provisional • The pre-pandemic applications are about a generic system/method that deals with biometric data. • The post-pandemic application includes a modified invention and additional claims specific to COVID-19 C C C CIP
  • 92. 93 How to deal with this type of misinformation  Consider PubChem as an information locator.  PubChem data are from other data sources.  More detailed information may be available at the original data source.  It is highly recommended to check the original data source.
  • 93. 94  Provides students with training/learning opportunity for technology transfer.  Many studuents are not familiar with patents (contrary to copyright/plagiarism).  In general, when there is some sort of domain-specific data that students can access, there should be some introductory training opportunity for it. How to deal with this type of misinformation
  • 94. 95 • PubChem is one of the largest sources of publicly available chemical information • PubChem is a data aggregator, which collects chemical information from hundreds of data sources. • PubChem contains chemical information useful for drug discovery. • In addition to bioactivity data generated through high-throughput screenings, PubChem contains a substantial amount of bioactivity information extracted from scientific articles. • Chemical vendor and patent information for compounds in PubChem helps prioritize hit compounds for further screening. Summary
  • 95. 96 • PubChem supports multiple programmatic access routes to its data, allowing for automating complicated and specialized tasks beyond what PubChem’s web interface supports. • PubChem data can be used for developing computational prediction models for bioactivity or toxicity of molecules, in conjunction with machine learning methods. • PubChem is used by millions of users, but some of them often misinterpret or misunderstand PubChem data, which needs to be addressed by PubChem as well as at a community level. Summary
  • 96. 97 Acknowledgements  The PubChem Team Evan Bolton Jia He Thiessen Paul Zhi Sun Jie Chen Siqian He Bo Yu Tiejun Chung Qingliang Li Leonid Zaslavsky Asta Gindulyte Ben Shoemaker Jian Zhang  Collaborators Prof. Robert Belford (UALR) Prof. Ehren Bucholtz (U. of Health Sciences and Pharmacy in St. Louis) ACS CHED Committee on Computers in Chemical Education (CCCE)  Funding Intramural Research Program of the National Library of Medicine
  • 97. Thank you for your attention. Questions? Sunghwan Kim (sunghwan.kim@nih.gov)