Evolving consensus-based curatorial strategies

www.guidetopharmacology.org
Will the real drugs and targets please stand up?
Evolving consensus-based curatorial strategies
Chris Southan, IUPHAR/BPS Guide to PHARMACOLOGY Web portal Group, Centre for Integrative
Physiology,School of Biomedical Sciences, University of Edinburgh, Hugh Robson Building, Edinburgh,
EH8 9XD, UK. cdsouthan@hotmail.com
Presented to the Gloriam/GPCRDB Team and the Dept. of Pharmaceutical Sciences,
University of Copenhagen, 6th May 2014
1

GToPdb: receptors, ligands, targets and drugs
• An expert-curated database overseen by the IUPHAR Nomenclature
Committee (NC-IUPHAR)
• >70 subcommittees comprising ~700 international scientists working on
individual target families.
• 4 full-time curators, 1 part-time admin, 1 developer.
• NC-IUPHAR publishes nomenclature recommendations and reviews on various
topics in pharmacological journals and through the IUPHAR database.
• Subcommittees update their database pages annually.
• Continuously expanding to incorporate new data types, new targets and
ligands and new domain committees
• Public database releases every 3-4 months

Pharmacological and clinical data

WellcomeTrust Grant 099156/Z/12/Z
• Key objective: “encompass all the human targets of current prescription
medicines and the likely targets of future medicines”
• Conceptually familiar from our established receptor/channel-centric database
• But - needed to re-define curatorial approaches, caveats and end-points
• Balance between theoretical rigour and pragmatic utility
• Four foci - grant fulfilment, user value, data mining, data consumption
• Discuss and document changes in curatorial strategies with practical guidelines
• Add enhancements, new relationships and features
• Control activity-mapping stringencies and relationship distributions
• QC legacy content, harmonise and remediate where necessary
• Aim for small, but perfectly-formed, data content vs. complete coverage

Technical implementation
• Restrict relationships to citable/provenanced quantitative mappings
(typically IC50, Ki, Kd)
• Formally tag data-supported “primary targets”
• Only data-supported polypharmacology
• Mask nutraceuticals, metabolites or endogenous hormones from bloating
drug > target relationship space
• Limit drug > multiple subunit mappings to direct interactions
• Normalize targets to UniProt IDs and Swiss-Prot for human
• Normalise drugs and ligands to PubChem compound records (CIDs)
• Extend useful relationships e.g. drug > prodrug, drug > active metabolite,
ligand = target (antibody > cytokine)
• Flexibility to handle edge cases (e.g. heparinoids)
• Options for selective expansion (e.g. kinases, proteases andAlzheimer’s)
7

Defining limits for curation
• The good news: capture of targets and drugs in databases and literature
reports is continuously expanding
• The bad news: no one agrees on numbers, relationship definitions,
curatorial rules, identifiers, exact molecular structures, choices of primary
sources or provenance attribution
• More bad news: source proliferation < “circular” annotation
• Human target range: 186 approved drugs in 2006 (PMID:17139284 ) <
3,044 in ChEMBL_18
• Approved drug ranges: 1,216 FDA Maximum Daily Dose (PubChem Assay
ID 1195) < 2,750 for the NCGC Pharmaceutical Collection (PMID:21525397)
• Outer bioactivity ranges: 8057 INNs < 928,875 actives in PubChem
BioAssays < 6.3 million from GVKBIO with SAR from papers and patents
8

Evolution of our consensus strategy
Based on many collective years of curatorial engagement and deep source
knowledge we now pursue a consensus approach for the following reasons:
1. Concordant sources are generally more likely to be right than wrong
2. Curatorial efficiency of starting with solid consensus sets
3. Multiple sources are informatically synergistic ( if truly independent)
4. Approach is flexible via source updates and testing different filters
5. We control total numbers for matching to curatorial capacity
6. The concept can easily be explained to users
7. The exercise of comparing sources is very informative
8. It forces entity identifier normalisation (via cross-mapping if necessary)
9. Consensus lists per se have value for users (e.g. hosting on website)
9

Will the real targets please stand up ?
• Compared as human Swiss-Prot IDs for 2013 database releases
• Intersect is 351 the union is 3,046 (i.e. 15% of the 20,265 human proteome)
• Lists included approved, clinical and research targets
10
Figure 7d from: “Comparing the
chemical structure and protein
content of ChEMBL, DrugBank,
Human Metabolome Database
and the Therapeutic Target
Database” PMID: 24533037

Genome Ontology comparison indicates source selectivity
11

Use a target consensus to populate the database
12
• ChEMBL 17, 252 approved
• Mathias Rask-Anderson et. al July
2013, 481 approved
• Southan et. al, 2013 3-way human
DrugBank/ChEMBL/TTD 352
• 3-way or 2-way, 19 + 40 + 143 =
202 Targets Of Approved Drugs
(TOADS) set selected for GToP
upload

Will the real drugs please stand up?
• Work up the following CID triage inside PubChem
• Select DrugBank 1504 “approved” drug structures
• Select two additional sources TTD and ChEMBL
• Filter to remove salts and mixtures
• Select synonym INN (WHO International Non-proprietary Name).
• The final step was the Boolean intersect between all five
13

Observations and caveats
• This set of 923 drugs can be accessed via the MyNCBI open URL
http://www.ncbi.nlm.nih.gov/sites/myncbi/collections/public/1Fo7u3a
pR1bzS_UWr1YhHOTkZ/
• TTD last submitted in Feb 2012 so drug content is thus capped to
before that date (droppingTTD gives 1117 CIDs)
• Some metabolites (e.g. amino acids) come through the filters
• Older drugs have no INN (e.g. aspirin)
• Some peptide drug CIDs are missing (suggesting low concordance)
• Approved fixed-mixtures are excluded (they do not get an INN)
• The computed CID identity is actually a hash-code match, rather
than via InChIKey (but this should give similar numbers)
• Each of the 923 had 76 submissions (SIDs)
• Applying “same (bond) connectivity” gives 18749 but removing the
virtual deuterated entries reduces this to 6919 (i.e. the 923 have,
on average, 7.5 alternative stereo CIDs)
14

Closing consensus drugs > targets
15
• From Phase I targets > drugs we have moved to Phase 2 for drugs >
targets
• Current stats = 228TOADS (inward mapping expanded the set by ~10%)
• Current stats = 996 approved drugs (need to complete the activity
mappings)
• Note that antibodies and larger peptides (with no PubChem CIDs) are
subsumed in the 996
• 2013 new drug CIDs loaded http://cdsouthan.blogspot.se/2014/03/the-
drugs-of-2013-in-pubchem.html
• Will back-fill 2010-2012 new approvals as ligands, targets and activities
(but most already there)

GPCRdb/GToPdb collaborative opportunity
• Inspect which GPCRs are concordant or discordant between the target
lists
• Might be able to do similar exersise for GPCR-active drug/compound lists
– depending on what we can find with linkage (e.g. GLIDA)
• Work up a triage for alert triggers for new GPCR ligand structures in PDB
(e.g. via MMDB)
16

References and Acknowledgments
17
The database team: Adam Pawson, Joanna Sharman, Helen Benson, Elena Faccenda

Evolving consensus-based curatorial strategies

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (6)

Similar to Evolving consensus-based curatorial strategies

Similar to Evolving consensus-based curatorial strategies (20)

More from Chris Southan

More from Chris Southan (20)

Recently uploaded

Recently uploaded (20)

Evolving consensus-based curatorial strategies