Since 2009 the Guide to PHARMACOLOGY database (GtoPdb) team have curated 7586 ligands from papers, including approved drugs, clinical candidates , research compounds peptides and clinical antibodies (PMID 24234439). As PubChem pushes towards 70 million compound identifiers (CIDs), we have noticed the problem
of “multiplexing” during the curation of 5713 small molecules as CIDs. we encountered many representations (i.e. different CIDs) of the same pharmacological entities. Three types of variation dominate: stereochemistry, mixtures and isotopic analogues. These are known constitutive issues for chemical databases but in
recent years we observed this multiplexing was reaching
problematic proportions (i.e. more chaff), especially for clinically used drugs (i.e. proportionally less wheat)
Recombination DNA Technology (Nucleic Acid Hybridization )
Sorting bioactive wheat from database chaff: Challenges of discerning correct drug structures
1. Christopher Southan, Helen E. Benson, Elena Faccenda, Joanna L.
Sharman, Adam J. Pawson and Jamie A. Davies, IUPHAR/BPS Guide to
PHARMACOLOGY , Centre for Integrative Physiology, School of Biomedical Sciences, University of
Edinburgh, Hugh Robson Building, Edinburgh, EH8 9XD, UK (enquiries@guidetopharmacology.org).
Available at http://www.slideshare.net/cdsouthan/sorting-bioactive-wheat-fr
Sorting bioactive wheat from database chaff:
Challenges of discerning correct drug structures
Interpreting the Venn
Analysis of the consensus set
The divergence between the drug collections can be expressed as the
intersect being 29% of the union and 52% of this union being source-
unique. The intersect is only 8 structures more than the similar
comparison from 2009 using different sources and methods (PMID
20298516). Importantly, each of the sets is compiled from databases
with established reputations and regular NAR database issue update
papers. We thus emphasise the analysis is not a critique of these teams.
However, results highlight the challenge of selecting drugs from the
multiplexed options and that sources diverge in the rules they use for this
selection. It also implies the concept of “correct” drug structure is illusory
since the consensus is only ~40% of what could be expected (and there
is no agreement on total counts anyway).
Examples of GPCR database tables
Discussion
We can summarise the results (presented as averages) as follows;
• Each of the 815 CIDs was merged from 92 submissions (i.e.
(SID:CID). Note this is a direct measure of “popularity” amongst the
PubChem sources since this ratio is only 2.8 for all of PubChem.
• “Same connectivity” establishes that each drug is structurally related
to 23 other distinct CIDs (as a measure of multiplexing)
• “Same conn isotopes” establishes that 15 of the 23 are isotopic
derivatives (surprisingly ~ 70% as virtual deuteration from patents)
• The related “no isotopes” query establishes that 7.5 from the 23 are
alternative stereoisomer representations
• Each drug is included in 68 distinct mixture CID entries
As a specific multiplexed example we can examine atorvastatin (GtoPdb
ligand 2949). In PubChem this has 102 SIDs and 51 related CIDs. Of
these 44 are isotopic (38 deuterated) and 7 are alternative stereoisomer
forms. In addition there are 295 mixtures. Tracking multiplexing (as
singletons or mixtures) by year in PubChem indicates that patent
extractions are the main reasons for the recent increase.
Representational multiplexing for bioactive chemistry in documents,
web pages and databases in cheminformatics has broadly confounding
effects. These include virtual screening and “big data” mining. Metrics
defining some of the problems have been presented above.
Consequently, our curatorial rules (see GtoPdb FAQ) have been revised.
We now check same connectivity, SID counts and BioAssay records to
support our choice of CID as ligand structure and collaborate with
PubChem for QC. We also alert users to significant structural equivocality
and split activity data. Our March 2015 release thus has 1105 approved
drug CIDs concordant with either ChEMBL, DrugBank or TTD. The
persistent discordance in approved drug database records is of concern
but efforts to produce definitive sets will require more inter-source
collaboration. In addition, regulatory bodies and pharmaceutical
companies need to directly engage with the provenance of public
database structures.
We can formally analyse multiplexing via the detailed chemical
relationships that PubChem pre-computes for the 68.3 million compound
entries (April 7th 2015). By using the 815 CID intersect (from fig.1)
relationship counting operations were performed (see PubChem Help
documentation for details) as presented in the table below.
The structural multiplexing issue
How many approved drugs are there ?
Selection of approved drug sources
Results of the 3-way comparison
Since 2009 the Guide to PHARMACOLOGY database (GtoPdb) team
have curated 7586 ligands from papers, including approved drugs,
clinical candidates , research compounds peptides and clinical
antibodies (PMID 24234439). As PubChem pushes towards 70
million compound identifiers (CIDs), we have noticed the problem
of “multiplexing” during the curation of 5713 small molecules as
CIDs. we encountered many representations (i.e. different CIDs)
of the same pharmacological entities. Three types of variation
dominate: stereochemistry, mixtures and isotopic analogues.
These are known constitutive issues for chemical databases but in
recent years we observed this multiplexing was reaching
problematic proportions (i.e. more chaff), especially for clinically
used drugs (i.e. proportionally less wheat).
Given they represent the Crown Jewels of over five decades of drug
development it is surprising that counts of approved small
molecules span a range from 1216 in the FDA Maximum Daily Dose
Database (PubChem Assay ID 1195) up to 2750 for the NCGC
Pharmaceutical Collection (PMID 21525397). This was reflected in a
comparison of three curated drug collections in 2009 that recorded
only 807 structures in-common (PMID 20298516). The challenge
faced by GtoPdb in 2015 is the choice of which drug structures to
activity-map against which targets. For this reason we re-visited
the comparison outlined in PMID:20298516 but within PubChem
using their advanced chemical relationship mapping functionality.
We chose three sources that a) submit to PubChem b) capture
approved drugs c) updated within the last two years and d) had
previously been compared in toto (PMID 24533037). For DrugBank
(DrugB) approved drugs were selected as 1504 CIDs. For
ChEMBL19 approved SMILES were selected from downloaded
records and ID mapped to 1499 CIDs. For the Therapeutic Target
Database (TTD) the approved drug SDF file was downloaded,
converted to InChI strings and mapped to 1877 CIDs. The three sets
were then compared inside PubChem.
Fig 1. The Venn diagram above shows the ntersect between the
three is 815 (i.e. CIDs in-common). The union (sum of all three) is
2750. Note also that 1435 CIDs are unique to each database (n.b.
a TTD mapping enhancement increased the overlap from the figure
of 749 mentioned in the abstract)
Supported by: