A web-seminar jointly organized by KWSE (Korean Woman Scientists & Engineers) and KWiSE (Korean-American Women in Science and Engineering). Presented on July 27, 2021.
PubChem: A Public Chemical Information Resource for Big Data Chemistry
1. PubChem: A Public Chemical Information
Resource for Big Data Chemistry
Sunghwan Kim, Ph.D., M.Sc.
KWSE-KWiSE Joint Web-Seminar
July 27, 2021
2. 2
Outline
1. What Is PubChem?
2. What Does PubChem Have?
3. Exploring Chemical Information in PubChem
4. Programmatic Access to PubChem
5. PubChem for Cheminformatics Education
6. Summary
4. 4
▪ https://pubchem.ncbi.nlm.nih.gov
▪ Public chemical database at NIH.
▪ Contains information on various chemical entities:
• (Drug-like) small molecules
• siRNAs & miRNAs
• Carbohydrates
• Lipids
• Peptides
• Chemically modified macromolecules
• ……
PubChem Is a Public Chemical Information Resource
5. 5
PubChem Is a Data Aggregator
PubChem Sources: https://pubchem.ncbi.nlm.nih.gov/sources
Gov’t
agencies
Academic
institutions
Publishers
Pharma
companies
Chemical
vendors
Scientific
databases
800+ data sources Users
o Biomedical Researchers
• Chemical biology
• Medicinal chemistry
• Drug design & discovery
• Cheminformatics
o Data scientists
o Patent agents/examiners
o Chemical safety officers
o Educators/librarians
o Students
7. 7
History of PubChem
➢ NIH Molecular Libraries Program (MLP)
▪ Common Fund project.
▪ Aimed to provide academic researchers with high-throughput
screening (HTS) resources for drug discovery.
8. 8
History of PubChem
➢ NIH Molecular Libraries Program (MLP)
▪ Common Fund project.
▪ Aimed to provide academic researchers with high-throughput
screening (HTS) resources for drug discovery.
▪ Had three components:
Large, shared
compound
library
HTS centers at
academic
institutions
Central data
repository
(PubChem)
9. 9
History of PubChem
➢ PubChem was launched in 2004 as a component of MLP.
➢ All Common Fund projects are supported only up to 10 years.
Large, shared
compound
library
HTS centers at
academic
institutions
Central data
repository
(PubChem)
10. 10
History of PubChem
➢ PubChem was launched in 2004 as a component of MLP.
➢ All Common Fund projects are supported only up to 10 years.
➢ PubChem evolved to play a dual role:
▪ As a data archive
▪ As a knowledgebase
Large, shared
compound
library
HTS centers at
academic
institutions
Central data
repository
(PubChem)
30. 30
Gene/Protein/Pathway Summary
➢ Suppose that you want to:
o Retrieve ALL active compounds
against a given protein/gene/pathway target
(e.g., HMGCR=3-hydroxy-3-methylglutaryl-CoA reductase).
• To identify common chemical scaffolds responsible for bioactivity.
• To build a quantitative structure-activity relationship (QSAR) model.
→Gene/Protein/Pathway Summary
• Provides a target-centric view of PubChem data.
• Organizes all data available in PubChem for a given
gene/protein/pathway.
32. 32
Patent Summary
➢ Suppose that you want to:
o Retrieve ALL chemicals mentioned in a given patent document.
→Patent Summary page
• Provides a list of chemicals “mentioned” in the patent application/grant.
• No information on why they are mentioned.
(e.g., as a subject matter or as a prior art?)
• Other information, including:
- Title, abstract, date, inventor, …
- International patent classification (IPC) codes
34. 34
▪ https://pubchem.ncbi.nlm.nih.gov/classification
▪ Browse PubChem data using a classification of interest.
▪ Search for records annotated with the desired classification/term.
▪ A few examples of supported ontologies/classifications.
• MeSH (Medical Subject Headings)
• ChEBI (Chemical Entities of Biological Interest)
• FDA Pharm Classes
• PubChem Compound Table of Contents
• PubChem BioAssay Classification
• WHO ATC (Anatomical Therapeutic Chemical Classification System) Code
• WIPO International Patent Classification
Classification Browser
39. 39
➢ PubChem users have very diverse
backgrounds/interests.
➢ PubChem’s web interfaces are optimized
to perform commonly requested tasks
interactively.
40. 40
➢ PubChem users have very diverse
backgrounds/interests.
➢ PubChem’s web interfaces are optimized
to perform commonly requested tasks
interactively.
➢ Everything you can do with PubChem
through the web browser can be
automated through PubChem’s
programmatic interfaces.
41. 41
➢ PubChem users have very diverse
backgrounds/interests.
➢ PubChem’s web interfaces are optimized
to perform commonly requested tasks
interactively.
➢ Everything you can do with PubChem
through the web browser can be
automated through PubChem’s
programmatic interfaces.
➢ Programmatic access enables one to do
much more complicated tasks that cannot
be done through the web browser.
42. 42
➢ Multiple programmatic access routes
➢ Two major programmatic access methods
o PUG-REST (primarily for computed properties).
https://pubchemdocs.ncbi.nlm.nih.gov/pug-rest
o PUG-View (primarily for text information).
https://pubchemdocs.ncbi.nlm.nih.gov/pug-view
➢ Request volume limitation:
o No more than 5 requests per second
(See more at: https://pubchemdocs.ncbi.nlm.nih.gov/programmatic-
access$_RequestVolumeLimitations)
o Violators/abusers may be blocked for a certain period of time.
Entrez
Utilities
(E-Utils)
Power User
Gateway
(PUG)
PUG-SOAP PUG-REST
PubChem
RDF REST
PUG-View
43. 43
➢ Bulk Download
o PubChem FTP Site
ftp://ftp.ncbi.nlm.nih.gov/pubchem
o PubChem RDF (Resource Description Network)
https://pubchemdocs.ncbi.nlm.nih.gov/rdf
45. 45
User Demographics
(June 2020 through May 2021)
36.5%
27.4%
13.5%
10.4%
6.7% 5.4%
0
1
2
3
4
5
6
18-24 25-34 35-44 45-54 55-64 65+
Number
of
Users
(millions)
Age
34.64% of total users
~40% of PubChem users are aged between 18 and 24.
(likely to be college students)
46. 46
▪ Popularity: many young people are already using PubChem.
▪ Sustainability: it is 17 years old and not going away soon.
▪ Zero-cost (to students): U.S. taxpayers have already paid for it.
47. 47
PubChem presents a strong potential as an education resource
especially for small organizations like:
• primarily undergraduate institutions (PUIs)
• community colleges (CCs)
▪ Popularity: many young people are already using PubChem.
▪ Sustainability: it is 17 years old and not going away soon.
▪ Zero-cost (to students): U.S. taxpayers have already paid for it.
48. 48
➢ How about R1 universities with large endowments?
▪ Likely to have access to proprietary databases.
• Primarily used for research.
• Inconvenient off-campus access.
• Students will lose access when they graduate.
▪ Most students will eventually rely on public resources.
→ Need for training/education opportunities while in school.
49. 49
Cheminformatics Online Chemistry Course (OLCC)
Collaborative Teaching
Instructors at
multiple schools
Open Education
Resource (OER)
Development
Cheminformatics
experts from
outside
S. Kim et al., J. Chem. Educ. 2021, 98(2), 416–425.
51. 51
Cheminformatics
experts
Prepare online reading materials &
homework problem sets
Run the course
using the course materials
at multiple schools
Course website
Course
Instructor
Students
Instructional Approach
52. 52
Course website
Cheminformatics
experts
Prepare online reading materials &
homework problem sets
Course
Instructor
Students
Run the course
using the course materials
at multiple schools
Face-to-face
meeting
Online discussion among
experts, instructors & students
Instructional Approach
53. 53
Enrollment Statistics
Semester # Schools # Students Participating schoolsa
Fall 2015 4 36
UALR (6), Centre (4), WVU (5), UNF
(21)
Spring 2017 9 47
UALR (12), Centre (0b), UHSP (3),
IQS (6c), SDSU (4), Potsdam (3),
UIS (6), Campbell (12), Rutgers (1)
Fall 2019 5 23
UALR (2), Centre (3), UHSP (4),
IQS (9c), Otterbein (5),
a Numbers in parentheses are the numbers of enrolled students at individual schools.
b The course was taken by three faculty and staff members who participated in a faculty learning circle.
No students enrolled.
c Not formally enrolled as the course was offered as a non-credit seminar.
54. 54
OLCC Course websites
2015 • http://olcc.ccce.divched.org/Fall2015OLCC
• https://chem.libretexts.org/link?50598
2017 • http://olcc.ccce.divched.org/Spring2017OLCC
• https://chem.libretexts.org/link?83678
2019 • https://chem.libretexts.org/link?143689
Course Websites
➢ All course materials are freely available for reuse at:
▪ Committee on Computers in Chemical Education (CCCE) website
(http://olcc.ccce.divched.org)
▪ LibreTexts (https://libretexts.org)
55. 55
▪ Critical assessment of chemical information
▪ Chemical representations (e.g., InChI and SMILES)
▪ Search by chemical name
▪ Search by chemical structure
o Identity search
o 2-D/3-D similarity search
o Substructure/superstructure search
▪ Structure clustering
▪ Structure-activity relationship analysis
▪ Automation of chemical data retrieval
PubChem-Related Topics
56. 56
▪ Python Jupyter Notebooks
(with sample codes and assignments)
o Programmatic access to PubChem data
o Cheminformatics tasks using open-source
software packages
(e.g., RDKit, scikit-learn, and Mordred).
o Bioactivity prediction using machine learning
and PubChem data.
Python/R Programming (Fall 2019)
57. 57
▪ Python Jupyter Notebooks
(with sample codes and assignments)
o Programmatic access to PubChem data
o Cheminformatics tasks using open-source
software packages
(e.g., RDKit, scikit-learn, and Mordred).
o Bioactivity prediction using machine learning
and PubChem data.
▪ Similar materials for the R language
(using JupyterLab and R-Studio)
Python/R Programming (Fall 2019)
58. 58
▪ Students had two options to run the notebooks:
o Download and run them on their own computers.
o Run the notebooks on JupyterHub available through LibreTexts.
Python/R Programming (Fall 2019)
59. 59
• PubChem is the largest source of publicly available chemical
information, collected from hundreds of data sources.
• All contents are provided to the public free of charge.
• PubChem contains a wide range of chemical information, necessary
for drug discovery.
• It is used by more than five million users per month at peak.
Summary
60. 60
• PubChem’s web interface allows average users to readily perform
commonly requested, simple tasks.
• PubChem supports programmatic access to its data, allowing
advanced users to perform much more complex tasks that are not
supported by the web interfaces.
• PubChem can be downloaded in bulk.
• PubChem data, tools, and services can be used to teach
cheminformatics for college students.
• Please reach out to us for collaboration.
Summary
61. 61
Acknowledgements
▪ The PubChem Team
▪ Funding
Evan Bolton Jia He Thiessen Paul Zhi Sun
Jie Chen Siqian He Bo Yu
Tiejun Chung Qingliang Li Leonid Zaslavsky
Asta Gindulyte Ben Shoemaker Jian Zhang
Intramural Research Program of the National Library of Medicine
62. 62
Thank you for your attention.
Questions?
Sunghwan Kim, Ph.D., M.Sc.
Email: kimsungh@ncbi.nlm.nih.gov
LinkedIn: https://www.linkedin.com/in/sunghwan-kim/
63. 63
❑ References
▪ Getting the most out of PubChem for virtual screening
S. Kim, Expert Opin. Drug Discov. 2016, 11(9), 843-855.
▪ PubChem in 2021: new data content and improved web interfaces
S. Kim et al., Nucleic Acids Res. 2021, 49(D1):D1388–D1395.
▪ An update on PUG-REST: RESTful interface for programmatic access to PubChem
S. Kim et al., Nucleic Acids Res. 2018, 46(W1):W563-W570.
▪ PUG-View: programmatic access to chemical annotations integrated in PubChem
S. Kim et al., J. Cheminform. 2019, 11:56.
▪ Teaching Cheminformatics through a Collaborative Intercollegiate Online Chemistry
Course (OLCC)
S. Kim et al., J. Chem. Educ. 2021, 98(2), 416–425.