Presented at the Bioinformatics Seminar at the University of Arkansas, Little Rock on November 5, 2021.
PubChem (https://pubchem.ncbi.nlm.nih.gov) is a popular chemical database at the National Library of Medicine, National Institutes of Health. Arguably, PubChem is one of the largest chemical information resources in the public domain, with 111 million unique chemical structures, 1.39 million biological assays, and 292 million biological activity result outcomes. It also contains significant amounts of scientific research data and the inter-relationships between chemicals, proteins, genes, scientific literature, patents, and more. PubChem is a key resource for big data in chemistry and has been used in many studies for developing bioactivity and toxicity prediction models, discovering polypharmacologic (multi-target) ligands, and identifying new macromolecule targets of compounds (for drug-repurposing or off-target side effect prediction). It has also been used for cheminformatics education as well as chemical health and safety training. This presentation provides a high-level overview of PubChem’s data, tools, and services.
1. PubChem and Big Data Chemistry
Sunghwan Kim, Ph.D., M.Sc.
National Center for Biotechnology Information
National Library of Medicine
National Institutes of Health
Email: sunghwan.kim@nih.gov
2. 2
Outline
1. What Is PubChem?
2. What Does PubChem Have?
3. Exploring Chemical Information in PubChem
4. Programmatic Access to PubChem
5. Bioactivity Prediction Model Building with PubChem Data
6. PubChem and COVID-19 Conspiracy Theories
7. Summary
4. 4
https://pubchem.ncbi.nlm.nih.gov
Public chemical database at NIH.
Contains information on various chemical entities:
• (Drug-like) small molecules
• siRNAs & miRNAs
• Carbohydrates
• Lipids
• Peptides
• Chemically modified macromolecules
• ……
PubChem Is a Public Chemical Information Resource
5. 5
PubChem Is a Data Aggregator
PubChem Sources: https://pubchem.ncbi.nlm.nih.gov/sources
Gov’t
agencies
Academic
institutions
Publishers
Pharma
companies
Chemical
vendors
Scientific
databases
800+ data sources Users
o Biomedical Researchers
• Chemical biology
• Medicinal chemistry
• Drug design & discovery
• Cheminformatics
o Data scientists
o Patent agents/examiners
o Chemical safety officers
o Educators/librarians
o Students
7. 7
History of PubChem
NIH Molecular Libraries Program (MLP)
Common Fund project.
Aimed to provide academic researchers with high-throughput
screening (HTS) resources for drug discovery.
8. 8
History of PubChem
NIH Molecular Libraries Program (MLP)
Common Fund project.
Aimed to provide academic researchers with high-throughput
screening (HTS) resources for drug discovery.
Had three components (subprojects):
Large, shared
compound
library
HTS centers at
academic
institutions
Central data
repository
(PubChem)
9. 9
History of PubChem
PubChem was launched in 2004 as a component of MLP.
All Common Fund projects are supported only up to 10 years.
Large, shared
compound
library
HTS centers at
academic
institutions
Central data
repository
(PubChem)
10. 10
History of PubChem
PubChem was launched in 2004 as a component of MLP.
All Common Fund projects are supported only up to 10 years.
PubChem evolved to play a dual role:
As a data archive
As a knowledgebase
Large, shared
compound
library
HTS centers at
academic
institutions
Central data
repository
(PubChem)
12. 12
User Demographics
(June 2020 through May 2021)
36.5%
27.4%
13.5%
10.4%
6.7% 5.4%
0
1
2
3
4
5
6
18-24 25-34 35-44 45-54 55-64 65+
Number
of
Users
(millions)
Age
34.64% of total users
~40% of PubChem users are aged between 18 and 24.
(likely to be college students)
25. 25
Multiple data collections in PubChem
Compound
Unique chemical
structures
Substance
Depositor-provided
chemical data
BioAssay
Assay descriptions
& test results
Protein Gene Pathway Patent
Archive Archive
Knowledgebase
Chemical data associated with a protein/gene/pathway/patent
26. 26
As of November 2021, PubChem contains:
• 276 million substance descriptions
• 111 million unique chemical structures
• 292 million biological activity test results
• 1.4 million biological assays, covering 21 thousand unique protein
sequence targets.
(Arguably) the largest corpus of
publicly available chemical information from 800+ data
sources.
PubChem Statistics
27. 27
PubChem’s Chemical Space
Lipinski’s
Rule of 5 (Ro5) for
Drug-likeness a
Congreve’s
Rule of 3 (Ro3) for
Lead-likeness b
Molecular Weight ≤500 ≤300
Octanol–water partition coefficient (Log P) ≤5 ≤3
Number of H-bond donors ≤5 ≤3
Number of H-bond acceptors ≤10 ≤3
Number of Rotational Bond N/A ≤3
Polar surface area (PSA) N/A ≤60
a Lipinski et al., Adv. Drug Delivery Rev. 1997, 23(1–3), 3-25.
b Congreve et al., Drug Discov. Today, 2003, 8(19), 876-877.
28. 28
Congreve’s
Rule of 3 (Ro3)
11.7 millions
(10.57 %)
Lipinski’s
Rule of 5 (Ro5)
78.9 millions
(71.36%)
All compounds
110.6 millions
(100%)
PubChem’s Chemical Space
30. 30
Bioactivity Data in PubChem
Tested
3.6 millions
(3.27%)
Active
(AC ≤ 1 nM)
74 thousands
(0.07%)
Active
(1 nM < AC ≤ 1 µM)
777.5 thousands
(0.70%)
Active
(others)
635.2 thousands
(0.57%)
Inactive
2.1 millions
(1.93%)
Not Tested
107.0 millions
(96.73%)
All Compounds
110.7 millions
(100.00%)
AC: activity concentration (e.g., IC50, EC50, Ki, Kd, etc.)
31. 31
Bioactivity Data in PubChem
High-Throughput Screening data
• From Molecular Libraries
Program and other HTS projects.
• Many inactives
• False hits
(e.g., aggregators,
autofluoresent compounds)
• Typically measured at single
concentration
Literature-extracted data
32. 32
Bioactivity Data in PubChem
High-Throughput Screening data
• From Molecular Libraries
Program and other HTS projects.
• Many inactives
• False hits
(e.g., aggregators,
autofluoresent compounds)
• Typically measured at single
concentration
Literature-extracted data
33. 33
High-Throughput Screening data
• From Molecular Libraries
Program and other HTS projects.
• Many inactives
• False hits
(e.g., aggregators,
autofluoresent compounds)
• Typically measured at single
concentration
Literature-extracted data
• From manual curation or data
mining
• No (or few) inactives
• Provided by various PubChem
depositors including:
ChEMBL,
PDBbind, BindingDB,
Guide to Pharmacology
Bioactivity Data in PubChem
34. 34
• Virtual screening hits should be synthesizable or purchasable.
• PubChem contains “real” molecules (not “virtual” molecules)
• At least one or more data contributors claim that they have the compound
and/or information about it.
• Some of these compounds are chemical vendors (e.g., Sigma Aldrich).
Availability of compounds for subsequent experiments
35. 35
Two important aspects of PubChem records
(in the context of “compound availability”)
Non-live compounds:
Not searchable although they exist.
No associated substances due to:
o Mistakenly submitted substances
o Incorrect information
o No intention to share
Availability of compounds for subsequent experiments
36. 36
Two important aspects of PubChem records
(in the context of “compound availability”)
Non-live compounds:
Not searchable although they exist.
No associated substances due to:
o Mistakenly submitted substances
o Incorrect information
o No intention to share
Legacy designation:
No longer maintains their records up-to-date.
o Discontinued funding, low business priority, …
Availability of compounds for subsequent experiments
42. 42
Simplified molecular-input line-entry system (SMILES)
CC1=C(C=C(C=C1)NC(=O)C2=CC=C(C=C2)CN3CCN(CC3)C)
NC4=NC=CC(=N4)C5=CN=CC=C5
Line notations for chemical structures
IUPAC International Chemical Identifier (InChI)
InChI=1S/C17H21NO/c1-18(2)13-14-19-17(15-9-5-3-6-
10-15)16-11-7-4-8-12-16/h3-12,17H,13-14H2,1-2H3
44. 44
Identity Search
Depending on what you mean by “identical molecules”, you will get different search results.
What is the definition of “identity”?
→ Different tautomeric states,
Different stereoisomers,
Different isotopes,
Salt forms or mixtures, …
Chemical Structure Search
45. 45
Identity Search
Depending on what you mean by “identical molecules”, you will get different search results.
What is the definition of “identity”?
→ Different tautomeric states,
Different stereoisomers,
Different isotopes,
Salt forms or mixtures, …
(ex) CHCl3 vs. CDCl3 :
Both have the same chemical properties but different spectroscopic property.
Chemical Structure Search
46. 46
Identity Search
Depending on what you mean by “identical molecules”, you will get different search results.
What is the definition of “identity”?
→ Different tautomeric states,
Different stereoisomers,
Different isotopes,
Salt forms or mixtures, …
(ex) CHCl3 vs. CDCl3 :
Both have the same chemical properties but different spectroscopic property.
Users can search PubChem using different “nuances” of structural identity.
Chemical Structure Search
47. 47
Substructure Search
• use a substructure as a query
• search for compounds that contain the query substructure.
Superstructure Search
• use a superstructure as a query
• search for compounds that are contained in the query superstructure.
Chemical Structure Search
48. 48
When do you use substructure searches?
ex. when you want to find all molecules that
have a particular molecular scaffold.
Cephalosporins
(a class of β-lactam antibiotics)
Substructure/Superstructure Search
Chemical Structure Search
49. 49
Similarity Search
Why do we need similarity search?
• There is a huge imbalance of available information among compounds in PubChem.
For example, among 110.7 million compounds in PubChem,
- 3.6 million compounds (3.27 %) have been tested in at least one assay.
- 1.5 million compounds (1.34 %) have been tested to be active in at least one assay.
• The remaining 86.8 million compounds (97.6%) have not been tested in any assay.
• Bioactivities of these compounds may be predicted from structurally similar compounds with
known bioactivities.
• “Similarity Principle” : structurally similar compounds are likely to have similar biological
properties.
Chemical Structure Search
50. 50
How can you quantify similarity?
• Similarity is very subjective and context-dependent.
• There are many different ways to quantify similarity.
• Different similarity methods will recognize different flavors of similarity.
• PubChem uses two different similarity measures.
- 2-D similarity based on molecular fingerprints.
- 3-D similarity based on rapid-overlay of chemical structures (ROCS).
Similarity Search
Chemical Structure Search
51. 51
PubChem 2-D Similarity
• PubChem 881-bit binary fingerprints:
Each bit position represents the presence (=1) or
absence (=0) of a predefined molecular fragment.
Chemical Structure Search
52. 52
PubChem 2-D Similarity
• PubChem 881-bit binary fingerprints:
Each bit position represents the presence (=1) or
absence (=0) of a predefined molecular fragment.
• Structural Similarity between two molecules are
computed using the Tanimoto equation:
𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 =
𝑁𝑁𝐴𝐴𝐴𝐴
𝑁𝑁𝐴𝐴 + 𝑁𝑁𝐵𝐵 − 𝑁𝑁𝐴𝐴𝐴𝐴
NA: # bits set for molecule A
NB: # bits set for molecule B
NAB: # bits set for both
Chemical Structure Search
• Tanimoto score ranges from 0 (for no similarity) to 1 (for identical molecules).
53. 53
PubChem 3-D Similarity
Three similarity measures:
• Shape-Tanimoto (ST): 3-D overlap between steric shapes of molecules
• Color-Tanimoto (CT): 3-D overlap between “feature” atoms
(H-bond donors/acceptors, Cationic/Anionic centers, rings and hydrophobes)
• Combo-Tanimoto (ComboT): the sum of ST and CT
Both ST and CT range from 0 to 1, and ComboT range from 0 to 2 (without normalization to 1).
Chemical Structure Search
54. 54
PubChem 3-D Similarity
Chemical Structure Search
3-D similarity quantification involves optimization of superposition between two molecules:
• ST-optimization: finds the superposition that maximizes the ST score between them.
• CT-optimization: considers both CT and ST scores during the optimization.
55. 55
Why does PubChem use two different similarities.
• 2-D similarity comparison is much faster than
3-D similarity comparison
- 2-D: 106 comparisons per second
- 3-D: 102 ~ 103 comparisons per second
• However, 2-D similarity methods often fail to
recognize structural similarity that can be
easily recognized by 3-D similarity methods.
Chemical Structure Search
CID 1548887
(Sulindac)
CID 3715
(Indomethacin)
2D = 0.39
ST = 0.92
CT = 0.52
Both are non-steroidal anti-inflammatory drugs
(NSAIDs) and cyclooxygenase inhibitors.
56. 56
Gene/Protein/Pathway Summary
Suppose that you want to:
o Retrieve ALL active compounds
against a given protein/gene/pathway target
(e.g., HMGCR=3-hydroxy-3-methylglutaryl-CoA reductase).
• To identify common chemical scaffolds responsible for bioactivity.
• To build a quantitative structure-activity relationship (QSAR) model.
→Gene/Protein/Pathway Summary
• Provides a target-centric view of PubChem data.
• Organizes all data available in PubChem for a given
gene/protein/pathway.
58. 58
Patent Summary
Suppose that you want to:
o Retrieve ALL chemicals mentioned in a given patent document.
→Patent Summary page
• Provides a list of chemicals “mentioned” in the patent application/grant.
• No information on why they are mentioned.
(e.g., as a subject matter or as a prior art?)
• Other information, including:
- Title, abstract, date, inventor, …
- International patent classification (IPC) codes
60. 60
https://pubchem.ncbi.nlm.nih.gov/classification
Browse PubChem data using a classification of interest.
Search for records annotated with the desired classification/term.
A few examples of supported ontologies/classifications.
• MeSH (Medical Subject Headings)
• ChEBI (Chemical Entities of Biological Interest)
• FDA Pharm Classes
• PubChem Compound Table of Contents
• PubChem BioAssay Classification
• WHO ATC (Anatomical Therapeutic Chemical Classification System) Code
• WIPO International Patent Classification
Classification Browser
65. 65
PubChem users have very diverse
backgrounds/interests.
PubChem’s web interfaces are optimized
to perform commonly requested tasks
interactively.
66. 66
PubChem users have very diverse
backgrounds/interests.
PubChem’s web interfaces are optimized
to perform commonly requested tasks
interactively.
Everything you can do with PubChem
through the web browser can be
automated through PubChem’s
programmatic interfaces.
67. 67
PubChem users have very diverse
backgrounds/interests.
PubChem’s web interfaces are optimized
to perform commonly requested tasks
interactively.
Everything you can do with PubChem
through the web browser can be
automated through PubChem’s
programmatic interfaces.
Programmatic access enables one to do
much more complicated tasks that cannot
be done through the web browser.
68. 68
Multiple programmatic access routes
Two major programmatic access methods
o PUG-REST (primarily for computed properties).
https://pubchemdocs.ncbi.nlm.nih.gov/pug-rest
o PUG-View (primarily for text information).
https://pubchemdocs.ncbi.nlm.nih.gov/pug-view
Request volume limitation:
o No more than 5 requests per second
(See more at: https://pubchemdocs.ncbi.nlm.nih.gov/programmatic-
access$_RequestVolumeLimitations)
o Violators/abusers may be blocked for a certain period of time.
Entrez
Utilities
(E-Utils)
Power User
Gateway
(PUG)
PUG-SOAP PUG-REST
PubChem
RDF REST
PUG-View
72. Tox21
(AID 1159531)
Training
(4916 compounds)
Test
(547 compounds)
• 471 actives
• 4,445 inactives
• 53 actives
• 494 inactives
Preprocessing
• Quantitative HTS (qHTS)
data for 10K compounds
• Predominantly inactive
90% 10%
Data sets
73. Tox21
(AID 1159531)
ChEMBL
(45 Assays)
NCGC
(2 Assays)
Training
(4916 compounds)
Test
(547 compounds)
External 1
(222 compounds)
External 2
(489 compounds)
• 471 actives
• 4,445 inactives
• 53 actives
• 494 inactives
• 205 actives
• 17 inactives
• 20 actives
• 469 inactives
Preprocessing
• Quantitative HTS (qHTS)
data for 10K compounds
• Predominantly inactive
• Data extracted from
journal articles
• Predominantly active
• qHTS data
• Predominantly inactive
• Some overlap w/ Tox21
Preprocessing Preprocessing
90% 10%
Data sets
74. Tox21
(AID 1159531)
ChEMBL
(45 Assays)
NCGC
(2 Assays)
Training
(4916 compounds)
Test
(547 compounds)
External 1
(222 compounds)
External 2
(489 compounds)
• 471 actives
• 4,445 inactives
• 53 actives
• 494 inactives
• 205 actives
• 17 inactives
• 20 actives
• 469 inactives
Preprocessing
• Quantitative HTS (qHTS)
data for 10K compounds
• Predominantly inactive
• Data extracted from
journal articles
• Predominantly active
• qHTS data
• Predominantly inactive
• Some overlap w/ Tox21
Preprocessing Preprocessing
90% 10%
Data sets
All data
Available in
PubChem.
75. Tox21
(AID 1159531)
ChEMBL
(45 Assays)
NCGC
(2 Assays)
Training
(4916 compounds)
Test
(547 compounds)
External 1
(222 compounds)
External 2
(489 compounds)
• 471 actives
• 4,445 inactives
• 53 actives
• 494 inactives
• 205 actives
• 17 inactives
• 20 actives
• 469 inactives
Preprocessing
• Quantitative HTS (qHTS)
data for 10K compounds
• Predominantly inactive
• Data extracted from
journal articles
• Predominantly active
• qHTS data
• Predominantly inactive
• Some overlap w/ Tox21
Preprocessing Preprocessing
90% 10%
Data sets
76. Tox21
(AID 1159531)
ChEMBL
(45 Assays)
NCGC
(2 Assays)
Training
(4916 compounds)
Test
(547 compounds)
External 1
(222 compounds)
External 2
(489 compounds)
• 471 actives
• 4,445 inactives
• 53 actives
• 494 inactives
• 205 actives
• 17 inactives
• 20 actives
• 469 inactives
Preprocessing
• Quantitative HTS (qHTS)
data for 10K compounds
• Predominantly inactive
• Data extracted from
journal articles
• Predominantly active
• qHTS data
• Predominantly inactive
• Some overlap w/ Tox21
Preprocessing Preprocessing
90% 10%
471
Data sets
77. Molecular descriptors
• Generated using PaDEL [Yap CW (2011). J. Comput. Chem., 32 (7): 1466-1474]
Model Building
Abbreviation Name Length
AP AtomPairs 2D Fingerprint 780
ESTAT Estate fingerprint 79
EXTFP* CDK Extended Fingerprint 1,024
FP* CDK fingerprint 1,024
GOFP* CDK graph only fingerprint 1,024
KR Klekota-Roth fingerprint 4,860
MACCS MACCS fingerprint 166
PUB PubChem fingerprint 881
SUB Substructure fingerprint 307
* Hashed fingerprints
78. Machine-learning algorithms (implemented in scikit-learn)
Abbreviation Name Hyperparameters optimized
NB Naïve Bayes α (10-10 ~ 1)
DT Decision tree max_depth_range (3 ~ 7)
min_samples_split_range (3 ~ 7)
min_samples_leaf_range (2 ~ 6)
kNN K-Nearest neighbors weights (uniform, minkowski, jaccard)
n_neighbors (1 ~ 25)
RF Random forest n_estimators (10 ~ 200)
SVM Support vector machine C ( 2-10 ~ 210); γ ( 2-10 ∼ 210)
NN Neural network solver (lbfgs or adam); α (10-7 ∼ 107)
10-fold cross-validation was used for hyperparameter optimization.
Model Building
79. Model Performance Evaluation
Area under the Receiver operating characteristic curve (AUC)
→ Used for hyperparameter optimization.
𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵 𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴 𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵
=
1
2
𝑇𝑇𝑇𝑇
𝑇𝑇𝑇𝑇 + 𝐹𝐹𝐹𝐹
+
𝑇𝑇𝑇𝑇
𝑇𝑇𝑇𝑇 + 𝐹𝐹𝐹𝐹
=
1
2
𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 + 𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆
𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 (𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆) =
𝑇𝑇𝑇𝑇
𝑇𝑇𝑇𝑇+𝐹𝐹𝐹𝐹
𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 =
𝑇𝑇𝑇𝑇
𝑇𝑇𝑇𝑇+𝐹𝐹𝐹𝐹
80. Performance of the models
AUC scores of ≥0.7 were observed for models
developed using:
PubChem/MACCS/CDK-FP with
NN/SVM/RF/kNN
Maximum AUC score (0.77):
PubChem fingerprint with RF
Similar trend was observed for the
performance in terms of BACC scores
(not shown here).
Area under ROC curve (AUC)
Model Performance Evaluation
81. Area under ROC curve (AUC), Inactive-to-active ratio = 1
NCGC
ChEMBL
General applicability of the models
90. 91
Continuation Application Continuation-in-Part (CIP) Application
Adds new claims to a pending parent
application (i.e., not granted nor abandoned).
Cannot change the specification of the
invention.
Has the same priority date as the parent
application.
Increases the scope of the application without
having to file an entirely new application (and
consequently losing the original filing date).
Adds "enhancements" to the original invention
disclosed in the parent application.
New claims may also be added:
• Claims concerning the original invention:
the same priority date as the parent
application
• Claims concerning the enhancement:
the priority is the filing date of the CIP
application.
Same Invention
Additional claims
Modified Invention
Additional Claims
91. Patent Application
Publication
Patent (Granted)
Patent Application
15/293,211
(10/13/2016)
62/240,783
(10/13/2015)
15/495,485
(04/24/2017)
10,242,713 B2
(02/11/2019)
System and method for using, processing,
and displaying biometric data (20 claims)
2017/0229149 A1
(8/10/2017)
16/273,141
(02/11/2019)
10,522,188 B2
(12/31/2019)
2019/0325914 A1
(10/24/2019)
System and method for using, processing,
and displaying biometric data (30 claims)
16/704,844
(12/05/2019)
10,910,016 B2
(2/2/2021)
System and method for using, processing,
and displaying biometric data (20 claims)
2020/0126593 A1
(4/23/2020)
16/876,114
(05/17/2020)
2020/0279585 A1
(9/3/2020)
11,024,339 B2
(6/1/2021)
System and method for testing for
COVID-19 (17 claims)
Provisional
• The pre-pandemic applications are about a generic
system/method that deals with biometric data.
• The post-pandemic application includes a modified
invention and additional claims specific to COVID-19
C
C
C
CIP
92. 93
How to deal with this type of misinformation
Consider PubChem as an information locator.
PubChem data are from other data sources.
More detailed information may be available at the original data source.
It is highly recommended to check the original data source.
93. 94
Provides students with training/learning opportunity for technology transfer.
Many studuents are not familiar with patents (contrary to copyright/plagiarism).
In general, when there is some sort of domain-specific data that students can access,
there should be some introductory training opportunity for it.
How to deal with this type of misinformation
94. 95
• PubChem is one of the largest sources of publicly available chemical
information
• PubChem is a data aggregator, which collects chemical information from
hundreds of data sources.
• PubChem contains chemical information useful for drug discovery.
• In addition to bioactivity data generated through high-throughput screenings,
PubChem contains a substantial amount of bioactivity information extracted
from scientific articles.
• Chemical vendor and patent information for compounds in PubChem helps
prioritize hit compounds for further screening.
Summary
95. 96
• PubChem supports multiple programmatic access routes to its data, allowing
for automating complicated and specialized tasks beyond what PubChem’s
web interface supports.
• PubChem data can be used for developing computational prediction models
for bioactivity or toxicity of molecules, in conjunction with machine learning
methods.
• PubChem is used by millions of users, but some of them often misinterpret or
misunderstand PubChem data, which needs to be addressed by PubChem
as well as at a community level.
Summary
96. 97
Acknowledgements
The PubChem Team
Evan Bolton Jia He Thiessen Paul Zhi Sun
Jie Chen Siqian He Bo Yu
Tiejun Chung Qingliang Li Leonid Zaslavsky
Asta Gindulyte Ben Shoemaker Jian Zhang
Collaborators
Prof. Robert Belford (UALR)
Prof. Ehren Bucholtz (U. of Health Sciences and Pharmacy in St. Louis)
ACS CHED Committee on Computers in Chemical Education (CCCE)
Funding
Intramural Research Program of the National Library of Medicine
97. Thank you for your attention.
Questions?
Sunghwan Kim
(sunghwan.kim@nih.gov)