Resolving cryptic needles to molecular structures: The GtoPdb experience
1. www.guidetopharmacology.org
Resolving cryptic needles to molecular
structures: The GtoPdb experience
Christopher Southan, Adam J. Pawson, Joanna L. Sharman, Helen
E. Benson and Elena Faccenda,
IUPHAR/BPS Guide to PHARMACOLOGY, Centre for Integrative
Physiology, University of Edinburgh
ACS CINF session: Find the Needle in a Haystack: Mining Data
from Large Chemical Spaces
1
http://www.slideshare.net/cdsouthan/southan-needles-acs
2. Abstract
2
The IUHAR/BPS Guide to PHARMACOLOGY database (GtoPdb) team has data-mined bioactive
chemistry since 2009 (PMID 24234439). Consequently, during the curation of 7586 needles (as
ligand entries) we have grappled extensively with the haystack. This work outlines challenges of
mapping company code numbers to structures (n2s) and lead compounds from a haystack of
anywhere between one and five million bioactive structures. By the time these are assigned non-
proprietary names, data linkages can usually be found. However, other valuable needles are lead
compounds approaching clinical development that can also be incisive pharmacological tools.
The use of company codes to designate these is often obfuscatory, with some journals even
allowing blinding where clinical reports have no n2s or links to primary data. The efforts to resolve
the NCATS and MRC repurposing candidates exemplified the problem (PMID 23159359).
Notwithstanding we have now curated 50 AZDs n2s including Open Innovation structures. Codes
also present back-mapping problems where we need to synonym-chain a) first-filings and early
papers b) consecutive different codes via mergers c) INN or USAN and d) an eventual trade
name. The mining challenges are compounded by ad hoc permutations of hyphen, no space, and
space, comma inclusions, dropping a leading zero, appending suffixes or even ghost codes. In
some cases we curate plausible patent structures pending disclosure. For others we found the
vendor-only n2s corroborated via patent match. Recent reports of potent lead structures are
particularly difficult to name-link and synonym-map. To ameliorate the problem we have recently
introduced binding synonyms such as “compound 17d [PMID 23099093]” or “example 98
(WO2011020806)”. This means users can not only immediately locate the exact structures inside
documents, including via our PubChem submissions, but often find expanded SAR series.
Broader issues of n2s obfuscation will be discussed, including the inherent contradiction with the
trend towards greater clinical trials transparency
3. Subtitle: A tale of three needles
3
http://cdsouthan.blogspot.se/2015/08/merck.html
4. Starting point: curating BACE1 clinical inhibitors for GtoPdb
4
MK-8931 blinded since 2011
one PubMed name match but no name-
to-structure (n2s)
Synonym-chaining to SCH 900931
6. Substance submissions
(SIDs) for CID 23627211
n2s for MK-8391 and SC-1359113
but not SCH 900931
RN 1613380-81-6 is a “ghost” entry
(i.e. no n2s)
6
7. ChEMBL maps CID 23627211 < > PMID: 23412139
(but MK-8931 = compound 13 not the lead as cpd 16 )
7
MK-8931 not mentioned in Merck
paper, or the ChEMBL SID
8. In 2015 verubecestat surfaces via Merck
8
N2s provenanced by the
INN and USAN entries but
no code back-mapping or
clinical trial link
9. N2s: corroborative decoding the IUPACs from the USAN
9
InChIKey search squares
the circle because no CID
pointer in the USAN
(sigh….)
11. Interesting surprise: Merck record verubecestat as more
potent against BACE2 than BACE1
11
n.b. so would it lower blood sugar ?
c.f. BACE2 as a new diabetes target: a patent review (2010 - 2012)
http://www.ncbi.nlm.nih.gov/pubmed/23506624
13. We are left with a lot of needle questions
• So what is the molecular resolution between MK-8391, SC 1359113,
SCH 900931 verubecestat, compound 13, example 25 etc?
• Have ChemIDplus introduced an unprovenanced incorrect n2s into
the CID 23627211 synonyms?
• What is the source of the cryptic n2s surfacing in the vendor jungle?
• Why did the USAN not include the code (they usually do)
• Will Merck directly clarify any of this (e.g. publish a PubMed paper on
verubecestat including an n2s)?
• Would ChemIDplus then correct their entry?
• If the primary n2s changes will secondary sources a) notice and b)
correct and update ?
• Could verubecestat ameliorate diabetes as well as AD? (not a joke)
13
14. In the meantime: GtoPdb does the best we can
14
cpd 16 [PMID: 23412139] GtoP 8698verubecestat GtoPdb 8699 MK-8931 GtoPdb 8931
15. Which includes document-linking the best needles
15
http://www.guidetopharmacology.org/GRAC/ObjectDisplayForward?objectId=2330
BACE1
(upcoming
release
2015.2)
16. Conclusion: time to rethink n2s obfuscation?
(a.k.a. desist from hiding the needles)
• The GtoPdb team can testify that curating useful needles (names <>
structures <> data) from the haystack of ~60-100 million structures with ~
5 million bioactives is tough
• Code-blinding and synonym spaghetti make it even tougher
• They seriously confound big-data mining (but GtoPdb makes small data
minable)
• Is there any evidence that n2s blinding gives competitive advantage?
• Do code numbers and synonym chains have to remain an ad hoc mess?
• So why not adopt a universally useful form (e.g. ABCD123456)?
• Why do pharma companies obstinately decline to provenance their own
n2s in public databases?
• Pressure for clinical trial transparency and translational data mining is
building but there is still anomalous neglect of linking explicit structures.
• Some signs of n2s cross-corroboration across USAN, INN, FDA/SPL,
ChemIDPlus and PubChem – but could be better
16
17. References, acknowledgments and questions
17
http://www.ncbi.nlm.nih.gov/pubmed/24234439
http://www.ncbi.nlm.nih.gov/pubmed/23159359