Evolving consensus-based curatorial strategies


Published on

Presentation for a Departmental Seminar with the David Gloriam/GPCRDB team, Dept. of Pharmaceutical Sciences, University of Copenhagen, 6th May 2014

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Evolving consensus-based curatorial strategies

  1. 1. www.guidetopharmacology.org Will the real drugs and targets please stand up? Evolving consensus-based curatorial strategies Chris Southan, IUPHAR/BPS Guide to PHARMACOLOGY Web portal Group, Centre for Integrative Physiology,School of Biomedical Sciences, University of Edinburgh, Hugh Robson Building, Edinburgh, EH8 9XD, UK. cdsouthan@hotmail.com Presented to the Gloriam/GPCRDB Team and the Dept. of Pharmaceutical Sciences, University of Copenhagen, 6th May 2014 1
  2. 2. GToPdb: receptors, ligands, targets and drugs • An expert-curated database overseen by the IUPHAR Nomenclature Committee (NC-IUPHAR) • >70 subcommittees comprising ~700 international scientists working on individual target families. • 4 full-time curators, 1 part-time admin, 1 developer. • NC-IUPHAR publishes nomenclature recommendations and reviews on various topics in pharmacological journals and through the IUPHAR database. • Subcommittees update their database pages annually. • Continuously expanding to incorporate new data types, new targets and ligands and new domain committees • Public database releases every 3-4 months
  3. 3. Content
  4. 4. Detailed annotation
  5. 5. Pharmacological and clinical data
  6. 6. WellcomeTrust Grant 099156/Z/12/Z • Key objective: “encompass all the human targets of current prescription medicines and the likely targets of future medicines” • Conceptually familiar from our established receptor/channel-centric database • But - needed to re-define curatorial approaches, caveats and end-points • Balance between theoretical rigour and pragmatic utility • Four foci - grant fulfilment, user value, data mining, data consumption • Discuss and document changes in curatorial strategies with practical guidelines • Add enhancements, new relationships and features • Control activity-mapping stringencies and relationship distributions • QC legacy content, harmonise and remediate where necessary • Aim for small, but perfectly-formed, data content vs. complete coverage
  7. 7. Technical implementation • Restrict relationships to citable/provenanced quantitative mappings (typically IC50, Ki, Kd) • Formally tag data-supported “primary targets” • Only data-supported polypharmacology • Mask nutraceuticals, metabolites or endogenous hormones from bloating drug > target relationship space • Limit drug > multiple subunit mappings to direct interactions • Normalize targets to UniProt IDs and Swiss-Prot for human • Normalise drugs and ligands to PubChem compound records (CIDs) • Extend useful relationships e.g. drug > prodrug, drug > active metabolite, ligand = target (antibody > cytokine) • Flexibility to handle edge cases (e.g. heparinoids) • Options for selective expansion (e.g. kinases, proteases andAlzheimer’s) 7
  8. 8. Defining limits for curation • The good news: capture of targets and drugs in databases and literature reports is continuously expanding • The bad news: no one agrees on numbers, relationship definitions, curatorial rules, identifiers, exact molecular structures, choices of primary sources or provenance attribution • More bad news: source proliferation < “circular” annotation • Human target range: 186 approved drugs in 2006 (PMID:17139284 ) < 3,044 in ChEMBL_18 • Approved drug ranges: 1,216 FDA Maximum Daily Dose (PubChem Assay ID 1195) < 2,750 for the NCGC Pharmaceutical Collection (PMID:21525397) • Outer bioactivity ranges: 8057 INNs < 928,875 actives in PubChem BioAssays < 6.3 million from GVKBIO with SAR from papers and patents 8
  9. 9. Evolution of our consensus strategy Based on many collective years of curatorial engagement and deep source knowledge we now pursue a consensus approach for the following reasons: 1. Concordant sources are generally more likely to be right than wrong 2. Curatorial efficiency of starting with solid consensus sets 3. Multiple sources are informatically synergistic ( if truly independent) 4. Approach is flexible via source updates and testing different filters 5. We control total numbers for matching to curatorial capacity 6. The concept can easily be explained to users 7. The exercise of comparing sources is very informative 8. It forces entity identifier normalisation (via cross-mapping if necessary) 9. Consensus lists per se have value for users (e.g. hosting on website) 9
  10. 10. Will the real targets please stand up ? • Compared as human Swiss-Prot IDs for 2013 database releases • Intersect is 351 the union is 3,046 (i.e. 15% of the 20,265 human proteome) • Lists included approved, clinical and research targets 10 Figure 7d from: “Comparing the chemical structure and protein content of ChEMBL, DrugBank, Human Metabolome Database and the Therapeutic Target Database” PMID: 24533037
  11. 11. Genome Ontology comparison indicates source selectivity 11
  12. 12. Use a target consensus to populate the database 12 • ChEMBL 17, 252 approved • Mathias Rask-Anderson et. al July 2013, 481 approved • Southan et. al, 2013 3-way human DrugBank/ChEMBL/TTD 352 • 3-way or 2-way, 19 + 40 + 143 = 202 Targets Of Approved Drugs (TOADS) set selected for GToP upload
  13. 13. Will the real drugs please stand up? • Work up the following CID triage inside PubChem • Select DrugBank 1504 “approved” drug structures • Select two additional sources TTD and ChEMBL • Filter to remove salts and mixtures • Select synonym INN (WHO International Non-proprietary Name). • The final step was the Boolean intersect between all five 13
  14. 14. Observations and caveats • This set of 923 drugs can be accessed via the MyNCBI open URL http://www.ncbi.nlm.nih.gov/sites/myncbi/collections/public/1Fo7u3a pR1bzS_UWr1YhHOTkZ/ • TTD last submitted in Feb 2012 so drug content is thus capped to before that date (droppingTTD gives 1117 CIDs) • Some metabolites (e.g. amino acids) come through the filters • Older drugs have no INN (e.g. aspirin) • Some peptide drug CIDs are missing (suggesting low concordance) • Approved fixed-mixtures are excluded (they do not get an INN) • The computed CID identity is actually a hash-code match, rather than via InChIKey (but this should give similar numbers) • Each of the 923 had 76 submissions (SIDs) • Applying “same (bond) connectivity” gives 18749 but removing the virtual deuterated entries reduces this to 6919 (i.e. the 923 have, on average, 7.5 alternative stereo CIDs) 14
  15. 15. Closing consensus drugs > targets 15 • From Phase I targets > drugs we have moved to Phase 2 for drugs > targets • Current stats = 228TOADS (inward mapping expanded the set by ~10%) • Current stats = 996 approved drugs (need to complete the activity mappings) • Note that antibodies and larger peptides (with no PubChem CIDs) are subsumed in the 996 • 2013 new drug CIDs loaded http://cdsouthan.blogspot.se/2014/03/the- drugs-of-2013-in-pubchem.html • Will back-fill 2010-2012 new approvals as ligands, targets and activities (but most already there)
  16. 16. GPCRdb/GToPdb collaborative opportunity • Inspect which GPCRs are concordant or discordant between the target lists • Might be able to do similar exersise for GPCR-active drug/compound lists – depending on what we can find with linkage (e.g. GLIDA) • Work up a triage for alert triggers for new GPCR ligand structures in PDB (e.g. via MMDB) 16
  17. 17. References and Acknowledgments 17 The database team: Adam Pawson, Joanna Sharman, Helen Benson, Elena Faccenda