Quality Challenges in Linking Chemistry and Protein Data
1. The Chemistry-to-Protein Relationship Quality
Challenge: Confounding Linked Data?
(Poster, Chris Southan, BioIT Boston, 2012)
Introduction
As evidenced from this meeting data integration to facilitate the generation of new
knowledge is undergoing a quantum jump driven by the generation of larger data sets,
expanded computational capacity and semantic web federated queries across linked open
sources.
However, the cloud in this bright future is that molecular mechanistic relationships inferred
from data of equivocal quality can become a house of cards. On a good day, these may
remain local artefacts in the uber-network. On a bad day, the very linking on which utility
depends can propagate errors instantly, remorselessly, globally and permanently.
This poster compares inferred mechanistic mappings between chemical structures and
proteins, both in curated drug databases and large chemogenomic data portals. A
surprising degree of discordance and different error types were found. It could also be
shown that various curatorial and automated parsing errors were being transitively passed
on between databases.
The results are given below as a series of problems that are potentially confounding for
linking between chemistry <> protein databases.
[1]
2. Problem I: Constitutive Mapping Challenges
We know mapping between chemicals and proteins is neither pure nor simple. This is not even a
complete list of what ”compound X <> protein Y ” relationships can encompass in databases.
• Binds-to and modulates activity
• Binds-to with known specificity (e.g. active or allosteric site in PDB)
• Binds-to with molecular mechanism-of-action (mmoa) inhibitor, activator, agonist, antagonist
• Binds-to with quantiative mmo (Ki, IC50, Kd etc)
• Binds-to and is metabolicaly transformed by (e.g. P450)
• Binds-to and is transported by (e.g. multidrug resistance-associated protein)
• Binds-to but no activity modulation (e.g. albumin)
• X transformation affects binding to Y (e.g. prodrug > drug > salt > metabolite)
• X is non-canonical (e.g. enatiomers with different affinity for Y)
• One X to-many proteins (panel screen)
• Data source ambigous in description of X (e.g. errors or tautomers)
• Data source ambigous in description of Y (e.g. protein ID not resolved)
• X does not bind Y, thus mmmo is indirect (e.g. up or down regulation of Y)
• Many cpds to-one Y (a throughput assay)
• X has relevant linked data in addtion to binding Y (e.g. plasma clearance)
• Y is part of a functional complex (e.g. gamma secretase)
• X-Y mechanistic coupling at different system levels (e.g. in vitro, in celluo, in vivo and in clinico)
• Y is species-specific
• Y is non-canonical (e.g. splice variant, phosphorylated, activation clipped etc)
[2]
3. Problem II: The Numbers Don’t Add Up
A collation of entity and relatishionship counts between databases and curated sets,
ranked by compounds-per-protein
• The statistical differences in orders of magnitude are only partialy intepretable
• No concencus defintions or heirachies of ”target” or ”interaction” as concepts
• Ipso facto curation and/or parsing rules are very different
• Evidence filtration functionality different
• Extraction substrates mostly simillar (e.g. Journals, PubMed and other dbs)
• Explicit but also cryptic circularity (e.g. large dbs subsuming smaller dbs)
[3]
4. Problem III: Differential Chemistry Capture
• We can compare the two premier academic drug mapping resources, DrugBank and
Therapeutic Target Database, in principle having convergent capture concepts.
• Both use expert curation teams to extract from the same primary data corpora.
• The intra-PubChem comparison of chemical content (at the CID level) is shown below
DB = 6720 TTD= 14631 Union = 19803 Intersect = 1548
• Results show very different capture (e.g. union is over 10x larger than the intersect )
• Some of this is explicable (e.g. DB’s historical emphasis on PDB ligands and TTD picking up
BioAssayed compounds from ChEMBL) but reasons for other differences are less clear.
[4]
5. Problem IV: Differential Target Capture
• The Venn compares DrugBank with TTD
and a re-curated DrugBank sub-set (Ra-
An ”Trends in the exploitation of novel
drug targets” 2011, PMID: 21804595)
• While there are caveats related to set
defintions, species filters and protein ID
cross-mapings, the differencial capture of
the three manualy curated sets is clear
• The intersect at only 170 human UniProt
IDs is ~ ½ the expected primary targets
• Some of this is explicable (i.e. R-An picking up new targets) but the cause of
other differences are unclear
• Over 900 targets (this comparison excluded enzymes and transporters) are
unique to DrugBank so their curatorial rules are clearly different
[5]
6. Problem V: Large chemistry <> protein Dbs
• Leading expert teams and significant resources
• Overlaps in concepts and utility
• Differences in approaches and technical implimentation
[6]
7. Problem V (ctd): Too Large to Verify but too
Divergent to Trust?
• Comparing atorvastin <>
proteins in four large-scale
Dbs
• The 4-database intersect is
only 8 from 143
• 6 of these are probably
indirect (no binding ) and
mechanistically unclear
• Significant database-unique
capture (e.g. CTD)
• There are caveats with these
exact numbers because they
depend on protein database
x-mappings
[7]
8. Problem VI: Whose curation is ”correct”
• Protein <> atorvastin results, automated vs curated (ChEMBL and DugBank)
• Sum is proteins from the four dbs in previous slide
• Consensus is only HMGCR and CP450 3A4
• Unique capture of transporters and metabolic enzymes by DrugBank
• Targets unique to DrugBank: hum Dipeptidyl peptidase 4, Aryl hydrocarbon
receptor
• Targets unique to ChEMBL: Cruzipain, pig Dipeptidyl peptidase 4
[8]
9. Problem VII. The PDB Hetero Entry Trap:
False Drug/ligands and False Targets
E.g. Stitch makes high-scoring links from DPPIV to galatose and fucose
[9]
10. Problem VII ctd. STICH X-refs the Same Errors
in DrugBank that Passed them to PubChem
DrugBank links to the wrong sugar isomer as CID 671379 and
PubChem inherited the 40 targets in the ”Biomolecular
Interactions and Pathways” field. DB entry now deprecated
[10]
11. Problem VII ctd. Mixed mappings of the
”Wrong” and ”Right” (drug-relevant) Ligands
Most of the mappings above are
”right”, on the left is ”wrong”
(sugar is in the crystal but not a
ligand or a drug in this context)
[11]
12. Problem VIII: False-negatives
• This clinically signficant infered interaction is missed by (all ?) Dbs
• A guess is that neither text mining nor curation rules (as implimented in the 7
dbs checked here) connected the individual drug names to the general case
triple ”statins-inhibited-PAR-1”
• We can grapple with false-positives via filtration rules and heuristic tuning but
false-negatives are a more difficult and potentialy more serious problem
[12]
13. Ameliorating the Problems
• Avoid ”brainless parsing” and go for precision over recall
• Make circularity explicit (e.g. dbs within dbs and curatorial recycling)
• Refresh and update cross-links between dbs
• Define biochemical and pharmacological relationships
• Rigorous and deep QC (e.g. actually eyeball records)
• Referential integrity checks (e.g. spot orphaned entities)
• Display relationship distributions, inspect the extreme tails and attempt
to understand them
• Document curatorial practice (e.g. equivocality handling rules)
• Facilitate annotation judgments and quality-based filtration (i.e.
curatorial empowerment )
• Consider canonical merging of chemical structures with multiplexed
bioactivity mappings
• Crowdsourcing (e.g. Drug Bank comments > fixes and deprecations)
• Encourage author mark-up at source (i.e. MIABE PMID: 21878981)
• “But wait, hold on – did anyone peer review the database? “
(Williams and Eakins 2012 ACS presentation)
[13]
14. Conclusions
• Linked Open Data is the new mining rock and roll; but...................
• Even just chemistry <> protein is subject to the caveats in this poster (and
more besides)
• At the very least circumspection is needed if inferences from database
linking are to be acted upon, validated and exploited
• In the end, nothing saves us from database quality so this has to be
addressed by all of us
Dr Christopher Southan
ChrisDS Consulting:
http://www.cdsouthan.info/Consult/CDS_cons.htm
Email: cdsouthan@hotmail.com
Twitter: @cdsouthan
Blog: http://cdsouthan.blogspot.com/
LinkedIN: http://www.linkedin.com/in/cdsouthan
Publications:
http://www.citeulike.org/user/cdsouthan/publications/ord
er/year
Citations:http://scholar.google.com/citations?user=y1Ds
HJ8AAAAJ&hl=en
Presentations: http://www.slideshare.net/cdsouthan
[14]