1. The utility of
for academic drug discovery
cand chemical biology
Christopher Southan
Drug Discovery Seminar, Stockholm, March 4th, 2019
Hosted by Per Arvidsson 1
2. Abstract
Since PubChem (https://pubchem.ncbi.nlm.nih.gov/) surfaced in 2004 it has
become the de facto global informatics hub, not just for chemistry but also
bioactivity. In addition, it is integrated within the very powerful Network
Entrez system that offers connectivity to many other entities in the
massive NCBI database resources including the literature in PubMed,
protein structures in PDB, genomic sequences, and MeSH terms. However,
the statistics of content (97.2 million compounds, 3.4 million of which have
240 million activity results against 12K protein targets) can seem daunting
to users. This presentation will give an overview of this content and the
basic concepts behind the major divisions of submitted substances (SIDs)
non-redundant compound entries (CIDs) and the mapping of activity results
to the latter (PubChem BioAssay).This will be followed by introducing
selected drug discovery-relevant resources inside PubChem including
ChEMBL and the IUPHAR/BPS Guide to Pharmacology. It will conclude with
a series of use cases, including searching against the 23 million structures
extracted from the IBM and SureChEMBL patent extraction sources in
PubChem
2
3. Outline
• Basic context and content
• Drug records
• Patent chemistry < > documents
• Papers < > chemistry
• Chemical biology, getting to probes
• Submission
• Conclusions
• Further information
3
14. 52 same connectivity
= 7 stereo + 45
isotopes
14
• Useful to understand and
navigate these multiplexing
chemistry rules
• many drug molecules have
complex connectivities
15. Will the real approved drugs please stand up?
(I) Selecting from ChEMBL
1511629 drugs > 2716 Phase 4, 2194 with SMILES > ChEMBL IDs mapping to CIDs = 2193
16. (II) Selecting from DrugBank
16
10517 SIDs > 3144 approved drugs > mapping to 2086 CIDs
17. (III) Selecting from Guide to Pharmacology
17
9526 SIDs > 1509 approved drugs > mapping to 1321 CIDs
18. TheVenn (by CID)
18
• Union total = 2965 so 3-way intersect of 958 is only 32%
• Ipso facto divergent approved drug structure capture
21. Cumulative patent-extractedCIDs
• Drug disovery SAR in documents ~ 2 to 5x more than papers and years earlier
• Majority of lead series now covered from automated exaction sources
• BindingDB curates patent SAR (http://www.bindingdb.org/bind/ByPatent.jsp) and
feeds to PubChem and ChEMBL
• SureChEMBL is the only live source with ~ two-monthly updates
• There are quality issues and overheads with automated chemistry extraction 21
23. Patent document SAR > PubChem
23
1. > Patent (via SureChEMBL)
2. > SureChEMBL structures
3. > SAR for 52 examples in
document
4. Example struc > PubChem
5. Check sources, patent
number, similar
compounds, ChEMBL and
BioAssay intersects
6. Make SAR table
7. Check for missing cpds
27. • At GtoPdb expert curators judge what a paper is ”about” in terms of the key
active compound to target-map
• We focus on approved drugs, clinicall candidates, immunopharmacolgy and
most recently malaria
• We link the PubMed ID (PMID) as a reference to that database record, the
chemical strucure and quantiative bioactivity
• We then submit our entries as a Substance Identifiers (SIDs) to PubChem, with
comments the references included in the files
• PubChem, in turn links our SIDs to our PMIDs (i.e. structure-to-document, s2d)
• PubChem merges identical SIDs to CIDs and PMIDs from different sources that
may index different structures
• For GtoPdb all the links above are reciprocal
• Other sources make analogous linking
27
"We have spent millions putting chemistry into PDFs but
now are spending more millions taking it back out” (Anon)
28. 1. GtoPdb expert curators judge what a paper is ”about” in terms of active
compounds as key lead structures and mmoa
2. We focus on approved drugs, clinicall candidates, immunopharmacolgy and
most recently malaria
3. We link the PubMed ID (PMID) as a reference to that database record, the
chemical strucure and quantiative bioactivity
4. We then submit our entries as a Substance Identifiers (SIDs) to PubChem, with
comments the references included in the files
5. PubChem, in turn links our SIDs to our PMIDs (i.e. structure-to-document, s2d)
6. PubChem merges identical SIDs to CIDs and PMIDs from different sources that
may index different structures
7. For GtoPdb all the links above are reciprocal
28
"We have spent millions putting chemistry into PDFs but
now we are spending more millions taking it back out"
(Anon )
40. Conclusions
• PubChem has become an essential resource for drug discovery and chemical
biology, academic as well as commercial
• Risk of proprietary structure query interception is realistically zero
• Many functionalities to explore and content sets to compare
• Can seem daunting but simple questions can be answered and engagement
practice pays off
• Combination of selects, Booleans History combinations and NCBI Entrez
connectivity are´very powerful
• Like all such sources, it has quirks, caveats, submitter quality issues, gotchas and
constitutive cheminformatic challenges
• No chemistry rules are perfect but PubChem’s work well and are navigable
• Programmatic access by PUG REST, RDF triples: 138,312,069,777
• Has synergies with stand-alone sources such asChEMBL, SureChEMBL, and
Guide to Pharmacology
• Literature to database connectivity is improving but still big shortfall in SAR
extraction from papers into CIDs and BioAssay
40