For BioIT World, Boston, April 2014, Christopher Southan and Joanna L. Sharman
http://www.guidetopharmacology.org/
URL for the consensus set
http://www.ncbi.nlm.nih.gov/sites/myncbi/collections/public/1Fo7u3apR1bzS_UWr1YhHOTkZ/
(n.b. result numbers in the poster have been updated since the abstract submission due to DrugBank and ChEMBL updates in PubChem)
BACKGROUND: A comparison of database subsets of approved drugs in 2009 recorded only 807 exact structures in-common (PMID:20298516). Factors contributing to low overlap included semantic naming inconsistencies, ambiguity in structure representation and the fact that neither regulatory bodies nor pharmaceutical companies directly verify public electronic chemical database records. This work is a current comparison of drug sources inside PubChem.
METHODS: We selected submitters that nominally included small-molecule drug collections and International Non-proprietary names (INNs) and/or US approved names (USANS). Unions, intersects and differences were derived by using the Entrez query history interface to perform Boolean operations on retrieved sets. Additional filters were explored, including salt-stripping by selecting a covalent unit count of 1.
RESULTS: DrugBank 3.0 declares 1,541 small-molecule drugs and the term “approved” returned 1,424 substances (SIDs) in PubChem. These collapse to 1,392 compounds (CIDs), and removal of mixtures reduces to 1,325. The Therapeutic Target Database (TTD) declares 1,540 approved drugs on their website. The CID overlap with the DrugBank 1,325 was 1,108, and the equivalent figure for ChEMBL_17 was 1,141. The three-way consensus (from the DrugBank starting point) was 1,003. The term INN retrieves 7,916 CIDS, reducing to 7,180 single-components. USAN brings back 5,494 of which only 3,204 are single-component (i.e. more salt forms are designated as USANs). Of the 1,108 3-way set, 927 have an INN or USAN. The “same connectivity” query indicates, on average, each of the 927 have nearly 20 canonically-related CIDs. Issues associated with these metrics will be outlined and, depending on new source releases, the numbers will be updated.
CONCLUSIONS: A surprising degree of non-overlap persists in drug structures. Our results are not a criticism of the valuable sources but further analysis is needed of the multiplicity of structural representations and fuzzy naming of essentially the same canonical drugs inside PubChem. This important issue in cheminformatics extends beyond the INNs to all pharmacologically active structures. It also rationalises our IUPHAR/BPS Guide to PHARMACOLOGY strategic choice of focusing on consensus sets for curation. This work indicates definitive drug lists will remain elusive until there is more collective engagement for provenance, standardisation and cross-mapping.