Chemistry-to-Protein Relastionship Quality


Published on

BioIT 2012 poster

Published in: Technology
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Over
  • Chemistry-to-Protein Relastionship Quality

    1. 1. The Chemistry-to-Protein Relationship Quality Challenge: Confounding Linked Data? (Poster, Chris Southan, BioIT Boston, 2012) IntroductionAs evidenced from this meeting data integration to facilitate the generation of newknowledge is undergoing a quantum jump driven by the generation of larger data sets,expanded computational capacity and semantic web federated queries across linked opensources.However, the cloud in this bright future is that molecular mechanistic relationships inferredfrom data of equivocal quality can become a house of cards. On a good day, these mayremain local artefacts in the uber-network. On a bad day, the very linking on which utilitydepends can propagate errors instantly, remorselessly, globally and permanently.This poster compares inferred mechanistic mappings between chemical structures andproteins, both in curated drug databases and large chemogenomic data portals. Asurprising degree of discordance and different error types were found. It could also beshown that various curatorial and automated parsing errors were being transitively passedon between databases.The results are given below as a series of problems that are potentially confounding forlinking between chemistry <> protein databases. [1]
    2. 2. Problem I: Constitutive Mapping ChallengesWe know mapping between chemicals and proteins is neither pure nor simple. This is not even acomplete list of what ”compound X <> protein Y ” relationships can encompass in databases.• Binds-to and modulates activity• Binds-to with known specificity (e.g. active or allosteric site in PDB)• Binds-to with molecular mechanism-of-action (mmoa) inhibitor, activator, agonist, antagonist• Binds-to with quantiative mmo (Ki, IC50, Kd etc)• Binds-to and is metabolicaly transformed by (e.g. P450)• Binds-to and is transported by (e.g. multidrug resistance-associated protein)• Binds-to but no activity modulation (e.g. albumin)• X transformation affects binding to Y (e.g. prodrug > drug > salt > metabolite)• X is non-canonical (e.g. enatiomers with different affinity for Y)• One X to-many proteins (panel screen)• Data source ambigous in description of X (e.g. errors or tautomers)• Data source ambigous in description of Y (e.g. protein ID not resolved)• X does not bind Y, thus mmmo is indirect (e.g. up or down regulation of Y)• Many cpds to-one Y (a throughput assay)• X has relevant linked data in addtion to binding Y (e.g. plasma clearance)• Y is part of a functional complex (e.g. gamma secretase)• X-Y mechanistic coupling at different system levels (e.g. in vitro, in celluo, in vivo and in clinico)• Y is species-specific• Y is non-canonical (e.g. splice variant, phosphorylated, activation clipped etc) [2]
    3. 3. Problem II: The Numbers Don’t Add Up A collation of entity and relatishionship counts between databases and curated sets, ranked by compounds-per-protein• The statistical differences in orders of magnitude are only partialy intepretable• No concencus defintions or heirachies of ”target” or ”interaction” as concepts• Ipso facto curation and/or parsing rules are very different• Evidence filtration functionality different• Extraction substrates mostly simillar (e.g. Journals, PubMed and other dbs)• Explicit but also cryptic circularity (e.g. large dbs subsuming smaller dbs) [3]
    4. 4. Problem III: Differential Chemistry Capture• We can compare the two premier academic drug mapping resources, DrugBank and Therapeutic Target Database, in principle having convergent capture concepts.• Both use expert curation teams to extract from the same primary data corpora.• The intra-PubChem comparison of chemical content (at the CID level) is shown below DB = 6720 TTD= 14631 Union = 19803 Intersect = 1548• Results show very different capture (e.g. union is over 10x larger than the intersect )• Some of this is explicable (e.g. DB’s historical emphasis on PDB ligands and TTD picking up BioAssayed compounds from ChEMBL) but reasons for other differences are less clear. [4]
    5. 5. Problem IV: Differential Target Capture • The Venn compares DrugBank with TTD and a re-curated DrugBank sub-set (Ra- An ”Trends in the exploitation of novel drug targets” 2011, PMID: 21804595) • While there are caveats related to set defintions, species filters and protein ID cross-mapings, the differencial capture of the three manualy curated sets is clear • The intersect at only 170 human UniProt IDs is ~ ½ the expected primary targets• Some of this is explicable (i.e. R-An picking up new targets) but the cause of other differences are unclear• Over 900 targets (this comparison excluded enzymes and transporters) are unique to DrugBank so their curatorial rules are clearly different [5]
    6. 6. Problem V: Large chemistry <> protein Dbs • Leading expert teams and significant resources • Overlaps in concepts and utility • Differences in approaches and technical implimentation [6]
    7. 7. Problem V (ctd): Too Large to Verify but too Divergent to Trust? • Comparing atorvastin <> proteins in four large-scale Dbs • The 4-database intersect is only 8 from 143 • 6 of these are probably indirect (no binding ) and mechanistically unclear • Significant database-unique capture (e.g. CTD) • There are caveats with these exact numbers because they depend on protein database x-mappings [7]
    8. 8. Problem VI: Whose curation is ”correct”• Protein <> atorvastin results, automated vs curated (ChEMBL and DugBank)• Sum is proteins from the four dbs in previous slide• Consensus is only HMGCR and CP450 3A4• Unique capture of transporters and metabolic enzymes by DrugBank• Targets unique to DrugBank: hum Dipeptidyl peptidase 4, Aryl hydrocarbon receptor• Targets unique to ChEMBL: Cruzipain, pig Dipeptidyl peptidase 4 [8]
    9. 9. Problem VII. The PDB Hetero Entry Trap: False Drug/ligands and False TargetsE.g. Stitch makes high-scoring links from DPPIV to galatose and fucose [9]
    10. 10. Problem VII ctd. STICH X-refs the Same Errors in DrugBank that Passed them to PubChemDrugBank links to the wrong sugar isomer as CID 671379 andPubChem inherited the 40 targets in the ”BiomolecularInteractions and Pathways” field. DB entry now deprecated [10]
    11. 11. Problem VII ctd. Mixed mappings of the”Wrong” and ”Right” (drug-relevant) Ligands Most of the mappings above are ”right”, on the left is ”wrong” (sugar is in the crystal but not a ligand or a drug in this context) [11]
    12. 12. Problem VIII: False-negatives• This clinically signficant infered interaction is missed by (all ?) Dbs• A guess is that neither text mining nor curation rules (as implimented in the 7 dbs checked here) connected the individual drug names to the general case triple ”statins-inhibited-PAR-1”• We can grapple with false-positives via filtration rules and heuristic tuning but false-negatives are a more difficult and potentialy more serious problem [12]
    13. 13. Ameliorating the Problems• Avoid ”brainless parsing” and go for precision over recall• Make circularity explicit (e.g. dbs within dbs and curatorial recycling)• Refresh and update cross-links between dbs• Define biochemical and pharmacological relationships• Rigorous and deep QC (e.g. actually eyeball records)• Referential integrity checks (e.g. spot orphaned entities)• Display relationship distributions, inspect the extreme tails and attempt to understand them• Document curatorial practice (e.g. equivocality handling rules)• Facilitate annotation judgments and quality-based filtration (i.e. curatorial empowerment )• Consider canonical merging of chemical structures with multiplexed bioactivity mappings• Crowdsourcing (e.g. Drug Bank comments > fixes and deprecations)• Encourage author mark-up at source (i.e. MIABE PMID: 21878981)• “But wait, hold on – did anyone peer review the database? “ (Williams and Eakins 2012 ACS presentation) [13]
    14. 14. Conclusions• Linked Open Data is the new mining rock and roll; but...................• Even just chemistry <> protein is subject to the caveats in this poster (and more besides)• At the very least circumspection is needed if inferences from database linking are to be acted upon, validated and exploited• In the end, nothing saves us from database quality so this has to be addressed by all of usDr Christopher SouthanChrisDS Consulting: cdsouthan@hotmail.comTwitter: @cdsouthanBlog: [14]