PubChem for drug discovery and chemical biology

The utility of
for academic drug discovery
cand chemical biology
Christopher Southan
Drug Discovery Seminar, Stockholm, March 4th, 2019
Hosted by Per Arvidsson 1

Abstract
Since PubChem (https://pubchem.ncbi.nlm.nih.gov/) surfaced in 2004 it has
become the de facto global informatics hub, not just for chemistry but also
bioactivity. In addition, it is integrated within the very powerful Network
Entrez system that offers connectivity to many other entities in the
massive NCBI database resources including the literature in PubMed,
protein structures in PDB, genomic sequences, and MeSH terms. However,
the statistics of content (97.2 million compounds, 3.4 million of which have
240 million activity results against 12K protein targets) can seem daunting
to users. This presentation will give an overview of this content and the
basic concepts behind the major divisions of submitted substances (SIDs)
non-redundant compound entries (CIDs) and the mapping of activity results
to the latter (PubChem BioAssay).This will be followed by introducing
selected drug discovery-relevant resources inside PubChem including
ChEMBL and the IUPHAR/BPS Guide to Pharmacology. It will conclude with
a series of use cases, including searching against the 23 million structures
extracted from the IBM and SureChEMBL patent extraction sources in
PubChem
2

Outline
• Basic context and content
• Drug records
• Patent chemistry < > documents
• Papers < > chemistry
• Chemical biology, getting to probes
• Submission
• Conclusions
• Further information
3

Comparisons: PubChem is well cited
5

PubChem CID growth 2005 - 2018
6

TheTriumvirate:
Substance, Compound, BioAssay
7
• SIDs can be
biololecules (e.g. large
peptides and
antobodies, no images
• CIDs merge SMILES
strings < 1000 atoms,
to unique InChIs
• Average SID:CID ~ 2.6,
drugs ~50, aspirin (CID
244) 307 plus 1563
mixture SIDs
• CIDs 4.7 % mixtures

CID stats overview
( March 2019)
8

Top
sources
by SID
9
https://cdsouthan.blogspot.com/2016
/06/pubchem-source-of-
month.html?q=pubchem+sources

Multiplexing: one drug > many forms
12

7 same isotopes
= different
stereo
13

52 same connectivity
= 7 stereo + 45
isotopes
14
• Useful to understand and
navigate these multiplexing
chemistry rules
• many drug molecules have
complex connectivities

Will the real approved drugs please stand up?
(I) Selecting from ChEMBL
1511629 drugs > 2716 Phase 4, 2194 with SMILES > ChEMBL IDs mapping to CIDs = 2193

(II) Selecting from DrugBank
16
10517 SIDs > 3144 approved drugs > mapping to 2086 CIDs

(III) Selecting from Guide to Pharmacology
17
9526 SIDs > 1509 approved drugs > mapping to 1321 CIDs

TheVenn (by CID)
18
• Union total = 2965 so 3-way intersect of 958 is only 32%
• Ipso facto divergent approved drug structure capture

Or use search history Booleans
19

Cumulative patent-extractedCIDs
• Drug disovery SAR in documents ~ 2 to 5x more than papers and years earlier
• Majority of lead series now covered from automated exaction sources
• BindingDB curates patent SAR (http://www.bindingdb.org/bind/ByPatent.jsp) and
feeds to PubChem and ChEMBL
• SureChEMBL is the only live source with ~ two-monthly updates
• There are quality issues and overheads with automated chemistry extraction 21

Patent analysis :
SureChEMBL < > PubChem
22
Query via the ChEMBL search interface

Patent document SAR > PubChem
23
1. > Patent (via SureChEMBL)
2. > SureChEMBL structures
3. > SAR for 52 examples in
document
4. Example struc > PubChem
5. Check sources, patent
number, similar
compounds, ChEMBL and
BioAssay intersects
6. Make SAR table
7. Check for missing cpds

Tanimoto similarity shell for SAR “walking”
24

PubChem indexes chemistry against the document
25
SciFinder had 212
Substances with
112 categorised
as biological

• At GtoPdb expert curators judge what a paper is ”about” in terms of the key
active compound to target-map
• We focus on approved drugs, clinicall candidates, immunopharmacolgy and
most recently malaria
• We link the PubMed ID (PMID) as a reference to that database record, the
chemical strucure and quantiative bioactivity
• We then submit our entries as a Substance Identifiers (SIDs) to PubChem, with
comments the references included in the files
• PubChem, in turn links our SIDs to our PMIDs (i.e. structure-to-document, s2d)
• PubChem merges identical SIDs to CIDs and PMIDs from different sources that
may index different structures
• For GtoPdb all the links above are reciprocal
• Other sources make analogous linking
27
"We have spent millions putting chemistry into PDFs but
now are spending more millions taking it back out” (Anon)

1. GtoPdb expert curators judge what a paper is ”about” in terms of active
compounds as key lead structures and mmoa
2. We focus on approved drugs, clinicall candidates, immunopharmacolgy and
most recently malaria
3. We link the PubMed ID (PMID) as a reference to that database record, the
chemical strucure and quantiative bioactivity
4. We then submit our entries as a Substance Identifiers (SIDs) to PubChem, with
comments the references included in the files
5. PubChem, in turn links our SIDs to our PMIDs (i.e. structure-to-document, s2d)
6. PubChem merges identical SIDs to CIDs and PMIDs from different sources that
may index different structures
7. For GtoPdb all the links above are reciprocal
28
"We have spent millions putting chemistry into PDFs but
now we are spending more millions taking it back out"
(Anon )

The Entrez system has linked
48 CIDs to the 58 papers
30

In this case there is a MeSH link to PubChem
31

Probably
the correct
structure
33

Chemical Probes: not cleanly indexed so use
external source for ”mapping in” to PubChem
35

PubChem Identifier Exchange Service
36

Checking IK mapping fails by Googling InChIKey
37

Getting in to
PubChem is easier
than you think
39

Conclusions
• PubChem has become an essential resource for drug discovery and chemical
biology, academic as well as commercial
• Risk of proprietary structure query interception is realistically zero
• Many functionalities to explore and content sets to compare
• Can seem daunting but simple questions can be answered and engagement
practice pays off
• Combination of selects, Booleans History combinations and NCBI Entrez
connectivity are´very powerful
• Like all such sources, it has quirks, caveats, submitter quality issues, gotchas and
constitutive cheminformatic challenges
• No chemistry rules are perfect but PubChem’s work well and are navigable
• Programmatic access by PUG REST, RDF triples: 138,312,069,777
• Has synergies with stand-alone sources such asChEMBL, SureChEMBL, and
Guide to Pharmacology
• Literature to database connectivity is improving but still big shortfall in SAR
extraction from papers into CIDs and BioAssay
40

Further PubChem tips and tricks
43
https://www.slideshare.net/cdsouthan/presentations https://cdsouthan.blogspot.com/

PubChem for drug discovery and chemical biology

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to PubChem for drug discovery and chemical biology

Similar to PubChem for drug discovery and chemical biology (20)

More from Chris Southan

More from Chris Southan (20)

Recently uploaded

Recently uploaded (20)

PubChem for drug discovery and chemical biology