SlideShare a Scribd company logo
1 of 14
The Chemistry-to-Protein Relationship Quality
        Challenge: Confounding Linked Data?
                      (Poster, Chris Southan, BioIT Boston, 2012)

                                      Introduction

As evidenced from this meeting data integration to facilitate the generation of new
knowledge is undergoing a quantum jump driven by the generation of larger data sets,
expanded computational capacity and semantic web federated queries across linked open
sources.

However, the cloud in this bright future is that molecular mechanistic relationships inferred
from data of equivocal quality can become a house of cards. On a good day, these may
remain local artefacts in the uber-network. On a bad day, the very linking on which utility
depends can propagate errors instantly, remorselessly, globally and permanently.

This poster compares inferred mechanistic mappings between chemical structures and
proteins, both in curated drug databases and large chemogenomic data portals. A
surprising degree of discordance and different error types were found. It could also be
shown that various curatorial and automated parsing errors were being transitively passed
on between databases.

The results are given below as a series of problems that are potentially confounding for
linking between chemistry <> protein databases.
                                                                                            [1]
Problem I: Constitutive Mapping Challenges
We know mapping between chemicals and proteins is neither pure nor simple. This is not even a
complete list of what ”compound X <> protein Y ” relationships can encompass in databases.

•   Binds-to and modulates activity
•   Binds-to with known specificity (e.g. active or allosteric site in PDB)
•   Binds-to with molecular mechanism-of-action (mmoa) inhibitor, activator, agonist, antagonist
•   Binds-to with quantiative mmo (Ki, IC50, Kd etc)
•   Binds-to and is metabolicaly transformed by (e.g. P450)
•   Binds-to and is transported by (e.g. multidrug resistance-associated protein)
•   Binds-to but no activity modulation (e.g. albumin)
•   X transformation affects binding to Y (e.g. prodrug > drug > salt > metabolite)
•   X is non-canonical (e.g. enatiomers with different affinity for Y)
•   One X to-many proteins (panel screen)
•   Data source ambigous in description of X (e.g. errors or tautomers)
•   Data source ambigous in description of Y (e.g. protein ID not resolved)
•   X does not bind Y, thus mmmo is indirect (e.g. up or down regulation of Y)
•   Many cpds to-one Y (a throughput assay)
•   X has relevant linked data in addtion to binding Y (e.g. plasma clearance)
•   Y is part of a functional complex (e.g. gamma secretase)
•   X-Y mechanistic coupling at different system levels (e.g. in vitro, in celluo, in vivo and in clinico)
•   Y is species-specific
•   Y is non-canonical (e.g. splice variant, phosphorylated, activation clipped etc)
                                                                                                             [2]
Problem II: The Numbers Don’t Add Up
    A collation of entity and relatishionship counts between databases and curated sets,
    ranked by compounds-per-protein




•     The statistical differences in orders of magnitude are only partialy intepretable
•     No concencus defintions or heirachies of ”target” or ”interaction” as concepts
•     Ipso facto curation and/or parsing rules are very different
•     Evidence filtration functionality different
•     Extraction substrates mostly simillar (e.g. Journals, PubMed and other dbs)
•     Explicit but also cryptic circularity (e.g. large dbs subsuming smaller dbs)
                                                                                          [3]
Problem III: Differential Chemistry Capture
•     We can compare the two premier academic drug mapping resources, DrugBank and
      Therapeutic Target Database, in principle having convergent capture concepts.
•     Both use expert curation teams to extract from the same primary data corpora.
•     The intra-PubChem comparison of chemical content (at the CID level) is shown below
    DB = 6720              TTD= 14631                  Union = 19803               Intersect = 1548




•     Results show very different capture (e.g. union is over 10x larger than the intersect )
•     Some of this is explicable (e.g. DB’s historical emphasis on PDB ligands and TTD picking up
      BioAssayed compounds from ChEMBL) but reasons for other differences are less clear.
                                                                                                      [4]
Problem IV: Differential Target Capture

                                        •   The Venn compares DrugBank with TTD
                                            and a re-curated DrugBank sub-set (Ra-
                                            An ”Trends in the exploitation of novel
                                            drug targets” 2011, PMID: 21804595)

                                        •   While there are caveats related to set
                                            defintions, species filters and protein ID
                                            cross-mapings, the differencial capture of
                                            the three manualy curated sets is clear

                                        •   The intersect at only 170 human UniProt
                                            IDs is ~ ½ the expected primary targets

• Some of this is explicable (i.e. R-An picking up new targets) but the cause of
  other differences are unclear

•   Over 900 targets (this comparison excluded enzymes and transporters) are
    unique to DrugBank so their curatorial rules are clearly different

                                                                                   [5]
Problem V: Large chemistry <> protein Dbs
 • Leading expert teams and significant resources
 • Overlaps in concepts and utility
 • Differences in approaches and technical implimentation




                                                            [6]
Problem V (ctd): Too Large to Verify but too
           Divergent to Trust?

                             •   Comparing atorvastin <>
                                 proteins in four large-scale
                                 Dbs

                             •   The 4-database intersect is
                                 only 8 from 143

                             •   6 of these are probably
                                 indirect (no binding ) and
                                 mechanistically unclear

                             •   Significant database-unique
                                 capture (e.g. CTD)

                             •   There are caveats with these
                                 exact numbers because they
                                 depend on protein database
                                 x-mappings


                                                                [7]
Problem VI:           Whose curation is ”correct”




•   Protein <> atorvastin results, automated vs curated (ChEMBL and DugBank)
•   Sum is proteins from the four dbs in previous slide
•   Consensus is only HMGCR and CP450 3A4
•   Unique capture of transporters and metabolic enzymes by DrugBank
•   Targets unique to DrugBank: hum Dipeptidyl peptidase 4, Aryl hydrocarbon
    receptor
•   Targets unique to ChEMBL: Cruzipain, pig Dipeptidyl peptidase 4
                                                                               [8]
Problem VII. The PDB Hetero Entry Trap:
      False Drug/ligands and False Targets

E.g. Stitch makes high-scoring links from DPPIV to galatose and fucose




                                                                         [9]
Problem VII ctd. STICH X-refs the Same Errors
      in DrugBank that Passed them to PubChem




DrugBank links to the wrong sugar isomer as CID 671379 and
PubChem inherited the 40 targets in the ”Biomolecular
Interactions and Pathways” field. DB entry now deprecated



                                                             [10]
Problem VII ctd. Mixed mappings of the
”Wrong” and ”Right” (drug-relevant) Ligands




                            Most of the mappings above are
                            ”right”, on the left is ”wrong”
                            (sugar is in the crystal but not a
                            ligand or a drug in this context)




                                                             [11]
Problem VIII: False-negatives




• This clinically signficant infered interaction is missed by (all ?) Dbs

• A guess is that neither text mining nor curation rules (as implimented in the 7
  dbs checked here) connected the individual drug names to the general case
  triple ”statins-inhibited-PAR-1”

• We can grapple with false-positives via filtration rules and heuristic tuning but
  false-negatives are a more difficult and potentialy more serious problem
                                                                                      [12]
Ameliorating the Problems
•   Avoid ”brainless parsing” and go for precision over recall
•   Make circularity explicit (e.g. dbs within dbs and curatorial recycling)
•   Refresh and update cross-links between dbs
•   Define biochemical and pharmacological relationships
•   Rigorous and deep QC (e.g. actually eyeball records)
•   Referential integrity checks (e.g. spot orphaned entities)
•   Display relationship distributions, inspect the extreme tails and attempt
    to understand them
•   Document curatorial practice (e.g. equivocality handling rules)
•   Facilitate annotation judgments and quality-based filtration (i.e.
    curatorial empowerment )
•   Consider canonical merging of chemical structures with multiplexed
    bioactivity mappings
•   Crowdsourcing (e.g. Drug Bank comments > fixes and deprecations)
•   Encourage author mark-up at source (i.e. MIABE PMID: 21878981)
•   “But wait, hold on – did anyone peer review the database? “
    (Williams and Eakins 2012 ACS presentation)
                                                                                [13]
Conclusions
• Linked Open Data is the new mining rock and roll; but...................
• Even just chemistry <> protein is subject to the caveats in this poster (and
  more besides)
• At the very least circumspection is needed if inferences from database
  linking are to be acted upon, validated and exploited
• In the end, nothing saves us from database quality so this has to be
  addressed by all of us

Dr Christopher Southan
ChrisDS Consulting:
http://www.cdsouthan.info/Consult/CDS_cons.htm
Email: cdsouthan@hotmail.com
Twitter: @cdsouthan
Blog: http://cdsouthan.blogspot.com/
LinkedIN: http://www.linkedin.com/in/cdsouthan
Publications:
http://www.citeulike.org/user/cdsouthan/publications/ord
er/year
Citations:http://scholar.google.com/citations?user=y1Ds
HJ8AAAAJ&hl=en
Presentations: http://www.slideshare.net/cdsouthan
                                                                             [14]

More Related Content

Viewers also liked

The effect of rosuvastatin on incident pneumonia from CMAJ 2012
The effect of rosuvastatin on incident pneumonia from CMAJ 2012The effect of rosuvastatin on incident pneumonia from CMAJ 2012
The effect of rosuvastatin on incident pneumonia from CMAJ 2012Soroka Medical Center
 
Rosuvastatin, pcsk9 concentrations, and ldl cholesterol response the jupiter ...
Rosuvastatin, pcsk9 concentrations, and ldl cholesterol response the jupiter ...Rosuvastatin, pcsk9 concentrations, and ldl cholesterol response the jupiter ...
Rosuvastatin, pcsk9 concentrations, and ldl cholesterol response the jupiter ...King Abdulaziz University - Jeddah
 
Jupiter Slides translate
Jupiter Slides translateJupiter Slides translate
Jupiter Slides translateguestef55fa
 
Design of gastroretentive bilayer floating films of propranolol hydrochloride...
Design of gastroretentive bilayer floating films of propranolol hydrochloride...Design of gastroretentive bilayer floating films of propranolol hydrochloride...
Design of gastroretentive bilayer floating films of propranolol hydrochloride...Namdeo Shinde
 
Crestor Tablets to treat high cholesterol and related conditions
Crestor Tablets to treat high cholesterol and related conditionsCrestor Tablets to treat high cholesterol and related conditions
Crestor Tablets to treat high cholesterol and related conditionsThe Swiss Pharmacy
 
Cardiovascular disorder
Cardiovascular disorderCardiovascular disorder
Cardiovascular disorderJack Frost
 
Statins (report biopharm) Pravastatin and Rosuvastatin
Statins (report biopharm) Pravastatin and RosuvastatinStatins (report biopharm) Pravastatin and Rosuvastatin
Statins (report biopharm) Pravastatin and RosuvastatinFretz Alfaro
 
JUPITER (Justification for the Use of Statins in Primary Prevention: An Inter...
JUPITER (Justification for the Use of Statins in Primary Prevention: An Inter...JUPITER (Justification for the Use of Statins in Primary Prevention: An Inter...
JUPITER (Justification for the Use of Statins in Primary Prevention: An Inter...theheart.org
 
A comparative study of Gaussian Graphical Model approaches to genomic data (R...
A comparative study of Gaussian Graphical Model approaches to genomic data (R...A comparative study of Gaussian Graphical Model approaches to genomic data (R...
A comparative study of Gaussian Graphical Model approaches to genomic data (R...Roberto Anglani
 
Causal comparative research ckv
Causal comparative research ckvCausal comparative research ckv
Causal comparative research ckvchina_velasco
 
Causal comparative research
Causal comparative researchCausal comparative research
Causal comparative researchDua FaTima
 

Viewers also liked (12)

The effect of rosuvastatin on incident pneumonia from CMAJ 2012
The effect of rosuvastatin on incident pneumonia from CMAJ 2012The effect of rosuvastatin on incident pneumonia from CMAJ 2012
The effect of rosuvastatin on incident pneumonia from CMAJ 2012
 
Rosuvastatin, pcsk9 concentrations, and ldl cholesterol response the jupiter ...
Rosuvastatin, pcsk9 concentrations, and ldl cholesterol response the jupiter ...Rosuvastatin, pcsk9 concentrations, and ldl cholesterol response the jupiter ...
Rosuvastatin, pcsk9 concentrations, and ldl cholesterol response the jupiter ...
 
ROSUVASTATIN CALCIUM PPT
ROSUVASTATIN CALCIUM PPTROSUVASTATIN CALCIUM PPT
ROSUVASTATIN CALCIUM PPT
 
Jupiter Slides translate
Jupiter Slides translateJupiter Slides translate
Jupiter Slides translate
 
Design of gastroretentive bilayer floating films of propranolol hydrochloride...
Design of gastroretentive bilayer floating films of propranolol hydrochloride...Design of gastroretentive bilayer floating films of propranolol hydrochloride...
Design of gastroretentive bilayer floating films of propranolol hydrochloride...
 
Crestor Tablets to treat high cholesterol and related conditions
Crestor Tablets to treat high cholesterol and related conditionsCrestor Tablets to treat high cholesterol and related conditions
Crestor Tablets to treat high cholesterol and related conditions
 
Cardiovascular disorder
Cardiovascular disorderCardiovascular disorder
Cardiovascular disorder
 
Statins (report biopharm) Pravastatin and Rosuvastatin
Statins (report biopharm) Pravastatin and RosuvastatinStatins (report biopharm) Pravastatin and Rosuvastatin
Statins (report biopharm) Pravastatin and Rosuvastatin
 
JUPITER (Justification for the Use of Statins in Primary Prevention: An Inter...
JUPITER (Justification for the Use of Statins in Primary Prevention: An Inter...JUPITER (Justification for the Use of Statins in Primary Prevention: An Inter...
JUPITER (Justification for the Use of Statins in Primary Prevention: An Inter...
 
A comparative study of Gaussian Graphical Model approaches to genomic data (R...
A comparative study of Gaussian Graphical Model approaches to genomic data (R...A comparative study of Gaussian Graphical Model approaches to genomic data (R...
A comparative study of Gaussian Graphical Model approaches to genomic data (R...
 
Causal comparative research ckv
Causal comparative research ckvCausal comparative research ckv
Causal comparative research ckv
 
Causal comparative research
Causal comparative researchCausal comparative research
Causal comparative research
 

Similar to Quality Challenges in Linking Chemistry and Protein Data

Evolving consensus-based curatorial strategies
Evolving consensus-based curatorial strategiesEvolving consensus-based curatorial strategies
Evolving consensus-based curatorial strategiesChris Southan
 
Southan real drugs_paris_oct_11_2014
Southan real drugs_paris_oct_11_2014Southan real drugs_paris_oct_11_2014
Southan real drugs_paris_oct_11_2014Chris Southan
 
Analysing targets and drugs to populate the GToP database
Analysing  targets and drugs to populate the GToP databaseAnalysing  targets and drugs to populate the GToP database
Analysing targets and drugs to populate the GToP databaseChris Southan
 
Structural Systems Pharmacology
Structural Systems PharmacologyStructural Systems Pharmacology
Structural Systems PharmacologyPhilip Bourne
 
Protein Interaction Reporters : Protein-Protein Interactions (PPI) elucidated...
Protein Interaction Reporters : Protein-Protein Interactions (PPI) elucidated...Protein Interaction Reporters : Protein-Protein Interactions (PPI) elucidated...
Protein Interaction Reporters : Protein-Protein Interactions (PPI) elucidated...Lorenz Lo Sauer
 
Slicing and dicing expert-curated protein targets in the Guide to PHARMACOLGY
Slicing and dicing expert-curated protein targets in the Guide to PHARMACOLGYSlicing and dicing expert-curated protein targets in the Guide to PHARMACOLGY
Slicing and dicing expert-curated protein targets in the Guide to PHARMACOLGYChris Southan
 
Connecting chemistry-to-biology
Connecting chemistry-to-biology Connecting chemistry-to-biology
Connecting chemistry-to-biology Chris Southan
 
Analysing the drug targets in the human genome
Analysing the drug targets in the human genomeAnalysing the drug targets in the human genome
Analysing the drug targets in the human genomeGuide to PHARMACOLOGY
 
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...Dr. Haxel Consult
 
Peptide tribulations
Peptide tribulationsPeptide tribulations
Peptide tribulationsChris Southan
 
Lecture 7 computer aided drug design
Lecture 7  computer aided drug designLecture 7  computer aided drug design
Lecture 7 computer aided drug designRAJAN ROLTA
 
Will the correct drugs please stand up?
Will  the correct drugs please stand up?Will  the correct drugs please stand up?
Will the correct drugs please stand up?Chris Southan
 
The End of the Drug Development Casino?
The End of the Drug Development Casino?The End of the Drug Development Casino?
The End of the Drug Development Casino?Paul Agapow
 
GtoPdb: A resource for cell-based perturbogens
GtoPdb:  A resource for cell-based perturbogensGtoPdb:  A resource for cell-based perturbogens
GtoPdb: A resource for cell-based perturbogensChris Southan
 
Estimating bioactivity database error rates, tiikkainen
Estimating bioactivity database error rates, tiikkainenEstimating bioactivity database error rates, tiikkainen
Estimating bioactivity database error rates, tiikkainenPekka Tiikkainen
 
IUPHAR/BPS Guide to Pharmacology: concise mapping of chemistry, data, and tar...
IUPHAR/BPS Guide to Pharmacology: concise mapping of chemistry, data, and tar...IUPHAR/BPS Guide to Pharmacology: concise mapping of chemistry, data, and tar...
IUPHAR/BPS Guide to Pharmacology: concise mapping of chemistry, data, and tar...Chris Southan
 
Computational Drug Discovery: Machine Learning for Making Sense of Big Data i...
Computational Drug Discovery: Machine Learning for Making Sense of Big Data i...Computational Drug Discovery: Machine Learning for Making Sense of Big Data i...
Computational Drug Discovery: Machine Learning for Making Sense of Big Data i...Chanin Nantasenamat
 

Similar to Quality Challenges in Linking Chemistry and Protein Data (20)

Evolving consensus-based curatorial strategies
Evolving consensus-based curatorial strategiesEvolving consensus-based curatorial strategies
Evolving consensus-based curatorial strategies
 
Southan real drugs_paris_oct_11_2014
Southan real drugs_paris_oct_11_2014Southan real drugs_paris_oct_11_2014
Southan real drugs_paris_oct_11_2014
 
Analysing targets and drugs to populate the GToP database
Analysing  targets and drugs to populate the GToP databaseAnalysing  targets and drugs to populate the GToP database
Analysing targets and drugs to populate the GToP database
 
Structural Systems Pharmacology
Structural Systems PharmacologyStructural Systems Pharmacology
Structural Systems Pharmacology
 
Protein Interaction Reporters : Protein-Protein Interactions (PPI) elucidated...
Protein Interaction Reporters : Protein-Protein Interactions (PPI) elucidated...Protein Interaction Reporters : Protein-Protein Interactions (PPI) elucidated...
Protein Interaction Reporters : Protein-Protein Interactions (PPI) elucidated...
 
Slicing and dicing expert-curated protein targets in the Guide to PHARMACOLGY
Slicing and dicing expert-curated protein targets in the Guide to PHARMACOLGYSlicing and dicing expert-curated protein targets in the Guide to PHARMACOLGY
Slicing and dicing expert-curated protein targets in the Guide to PHARMACOLGY
 
Connecting chemistry-to-biology
Connecting chemistry-to-biology Connecting chemistry-to-biology
Connecting chemistry-to-biology
 
Analysing the drug targets in the human genome
Analysing the drug targets in the human genomeAnalysing the drug targets in the human genome
Analysing the drug targets in the human genome
 
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...
 
Peptide tribulations
Peptide tribulationsPeptide tribulations
Peptide tribulations
 
Lecture 7 computer aided drug design
Lecture 7  computer aided drug designLecture 7  computer aided drug design
Lecture 7 computer aided drug design
 
Will the correct drugs please stand up?
Will  the correct drugs please stand up?Will  the correct drugs please stand up?
Will the correct drugs please stand up?
 
The End of the Drug Development Casino?
The End of the Drug Development Casino?The End of the Drug Development Casino?
The End of the Drug Development Casino?
 
GtoPdb: A resource for cell-based perturbogens
GtoPdb:  A resource for cell-based perturbogensGtoPdb:  A resource for cell-based perturbogens
GtoPdb: A resource for cell-based perturbogens
 
Estimating bioactivity database error rates, tiikkainen
Estimating bioactivity database error rates, tiikkainenEstimating bioactivity database error rates, tiikkainen
Estimating bioactivity database error rates, tiikkainen
 
Computer aided drug design
Computer aided drug designComputer aided drug design
Computer aided drug design
 
IUPHAR/BPS Guide to Pharmacology: concise mapping of chemistry, data, and tar...
IUPHAR/BPS Guide to Pharmacology: concise mapping of chemistry, data, and tar...IUPHAR/BPS Guide to Pharmacology: concise mapping of chemistry, data, and tar...
IUPHAR/BPS Guide to Pharmacology: concise mapping of chemistry, data, and tar...
 
Mining public domain data as a basis for drug repurposing
Mining public domain data as a basis for drug repurposingMining public domain data as a basis for drug repurposing
Mining public domain data as a basis for drug repurposing
 
SLAS ADMET SIG: SLAS2013 Presentation
SLAS ADMET SIG: SLAS2013 PresentationSLAS ADMET SIG: SLAS2013 Presentation
SLAS ADMET SIG: SLAS2013 Presentation
 
Computational Drug Discovery: Machine Learning for Making Sense of Big Data i...
Computational Drug Discovery: Machine Learning for Making Sense of Big Data i...Computational Drug Discovery: Machine Learning for Making Sense of Big Data i...
Computational Drug Discovery: Machine Learning for Making Sense of Big Data i...
 

More from Chris Southan

FAIR connectivity for DARCP
FAIR  connectivity for DARCPFAIR  connectivity for DARCP
FAIR connectivity for DARCPChris Southan
 
Connectivity > documents > structures > bioactivity
Connectivity > documents > structures > bioactivityConnectivity > documents > structures > bioactivity
Connectivity > documents > structures > bioactivityChris Southan
 
Vicissitudes of target validation for BACE1 and BACE2
Vicissitudes of target validation for BACE1 and BACE2 Vicissitudes of target validation for BACE1 and BACE2
Vicissitudes of target validation for BACE1 and BACE2 Chris Southan
 
Guide to Pharmacology database: ELIXIR updae
Guide to Pharmacology database: ELIXIR updaeGuide to Pharmacology database: ELIXIR updae
Guide to Pharmacology database: ELIXIR updaeChris Southan
 
In silico 360 Analysis for Drug Development
In silico 360 Analysis for Drug DevelopmentIn silico 360 Analysis for Drug Development
In silico 360 Analysis for Drug DevelopmentChris Southan
 
Will the correct BACE ORFs please stand up?
Will the correct BACE ORFs please stand up?Will the correct BACE ORFs please stand up?
Will the correct BACE ORFs please stand up?Chris Southan
 
Desperately seeking DARCP
Desperately seeking DARCPDesperately seeking DARCP
Desperately seeking DARCPChris Southan
 
Seeking glimmers of light in Pharos “Tdark” proteins
Seeking glimmers of light in  Pharos “Tdark” proteinsSeeking glimmers of light in  Pharos “Tdark” proteins
Seeking glimmers of light in Pharos “Tdark” proteinsChris Southan
 
5HT2A modulators update for SAFER
5HT2A modulators update for SAFER5HT2A modulators update for SAFER
5HT2A modulators update for SAFERChris Southan
 
Quality and noise in big chemistry databases
Quality and noise in big chemistry databasesQuality and noise in big chemistry databases
Quality and noise in big chemistry databasesChris Southan
 
GtoPdb June 2019 poster
GtoPdb June 2019 posterGtoPdb June 2019 poster
GtoPdb June 2019 posterChris Southan
 
PubChem as a source of systems biology perturbagens
PubChem as a source of  systems biology perturbagensPubChem as a source of  systems biology perturbagens
PubChem as a source of systems biology perturbagensChris Southan
 
PubChem for drug discovery and chemical biology
PubChem for drug discovery and chemical biologyPubChem for drug discovery and chemical biology
PubChem for drug discovery and chemical biologyChris Southan
 
Will the real proteins please stand up
Will the real proteins please stand upWill the real proteins please stand up
Will the real proteins please stand upChris Southan
 
Peptide Tribulations
Peptide TribulationsPeptide Tribulations
Peptide TribulationsChris Southan
 
Looking at chemistry - protein - papers connectivity in ELIXIR
Looking at chemistry - protein - papers connectivity in ELIXIRLooking at chemistry - protein - papers connectivity in ELIXIR
Looking at chemistry - protein - papers connectivity in ELIXIRChris Southan
 
Guide to Immunopharmacology update
Guide to Immunopharmacology updateGuide to Immunopharmacology update
Guide to Immunopharmacology updateChris Southan
 
Druggable Proteome sources in UniProt
Druggable Proteome sources in UniProtDruggable Proteome sources in UniProt
Druggable Proteome sources in UniProtChris Southan
 
Peptide Tribulations in GtoPdb
Peptide Tribulations in GtoPdbPeptide Tribulations in GtoPdb
Peptide Tribulations in GtoPdbChris Southan
 

More from Chris Southan (20)

FAIR connectivity for DARCP
FAIR  connectivity for DARCPFAIR  connectivity for DARCP
FAIR connectivity for DARCP
 
Connectivity > documents > structures > bioactivity
Connectivity > documents > structures > bioactivityConnectivity > documents > structures > bioactivity
Connectivity > documents > structures > bioactivity
 
Vicissitudes of target validation for BACE1 and BACE2
Vicissitudes of target validation for BACE1 and BACE2 Vicissitudes of target validation for BACE1 and BACE2
Vicissitudes of target validation for BACE1 and BACE2
 
Guide to Pharmacology database: ELIXIR updae
Guide to Pharmacology database: ELIXIR updaeGuide to Pharmacology database: ELIXIR updae
Guide to Pharmacology database: ELIXIR updae
 
In silico 360 Analysis for Drug Development
In silico 360 Analysis for Drug DevelopmentIn silico 360 Analysis for Drug Development
In silico 360 Analysis for Drug Development
 
Will the correct BACE ORFs please stand up?
Will the correct BACE ORFs please stand up?Will the correct BACE ORFs please stand up?
Will the correct BACE ORFs please stand up?
 
Desperately seeking DARCP
Desperately seeking DARCPDesperately seeking DARCP
Desperately seeking DARCP
 
Seeking glimmers of light in Pharos “Tdark” proteins
Seeking glimmers of light in  Pharos “Tdark” proteinsSeeking glimmers of light in  Pharos “Tdark” proteins
Seeking glimmers of light in Pharos “Tdark” proteins
 
5HT2A modulators update for SAFER
5HT2A modulators update for SAFER5HT2A modulators update for SAFER
5HT2A modulators update for SAFER
 
Quality and noise in big chemistry databases
Quality and noise in big chemistry databasesQuality and noise in big chemistry databases
Quality and noise in big chemistry databases
 
GtoPdb June 2019 poster
GtoPdb June 2019 posterGtoPdb June 2019 poster
GtoPdb June 2019 poster
 
PubChem as a source of systems biology perturbagens
PubChem as a source of  systems biology perturbagensPubChem as a source of  systems biology perturbagens
PubChem as a source of systems biology perturbagens
 
PubChem for drug discovery and chemical biology
PubChem for drug discovery and chemical biologyPubChem for drug discovery and chemical biology
PubChem for drug discovery and chemical biology
 
Will the real proteins please stand up
Will the real proteins please stand upWill the real proteins please stand up
Will the real proteins please stand up
 
Peptide Tribulations
Peptide TribulationsPeptide Tribulations
Peptide Tribulations
 
Looking at chemistry - protein - papers connectivity in ELIXIR
Looking at chemistry - protein - papers connectivity in ELIXIRLooking at chemistry - protein - papers connectivity in ELIXIR
Looking at chemistry - protein - papers connectivity in ELIXIR
 
Guide to Immunopharmacology update
Guide to Immunopharmacology updateGuide to Immunopharmacology update
Guide to Immunopharmacology update
 
Druggable Proteome sources in UniProt
Druggable Proteome sources in UniProtDruggable Proteome sources in UniProt
Druggable Proteome sources in UniProt
 
Peptide Tribulations in GtoPdb
Peptide Tribulations in GtoPdbPeptide Tribulations in GtoPdb
Peptide Tribulations in GtoPdb
 
Patents in PubChem
Patents in PubChemPatents in PubChem
Patents in PubChem
 

Recently uploaded

Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 

Recently uploaded (20)

Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 

Quality Challenges in Linking Chemistry and Protein Data

  • 1. The Chemistry-to-Protein Relationship Quality Challenge: Confounding Linked Data? (Poster, Chris Southan, BioIT Boston, 2012) Introduction As evidenced from this meeting data integration to facilitate the generation of new knowledge is undergoing a quantum jump driven by the generation of larger data sets, expanded computational capacity and semantic web federated queries across linked open sources. However, the cloud in this bright future is that molecular mechanistic relationships inferred from data of equivocal quality can become a house of cards. On a good day, these may remain local artefacts in the uber-network. On a bad day, the very linking on which utility depends can propagate errors instantly, remorselessly, globally and permanently. This poster compares inferred mechanistic mappings between chemical structures and proteins, both in curated drug databases and large chemogenomic data portals. A surprising degree of discordance and different error types were found. It could also be shown that various curatorial and automated parsing errors were being transitively passed on between databases. The results are given below as a series of problems that are potentially confounding for linking between chemistry <> protein databases. [1]
  • 2. Problem I: Constitutive Mapping Challenges We know mapping between chemicals and proteins is neither pure nor simple. This is not even a complete list of what ”compound X <> protein Y ” relationships can encompass in databases. • Binds-to and modulates activity • Binds-to with known specificity (e.g. active or allosteric site in PDB) • Binds-to with molecular mechanism-of-action (mmoa) inhibitor, activator, agonist, antagonist • Binds-to with quantiative mmo (Ki, IC50, Kd etc) • Binds-to and is metabolicaly transformed by (e.g. P450) • Binds-to and is transported by (e.g. multidrug resistance-associated protein) • Binds-to but no activity modulation (e.g. albumin) • X transformation affects binding to Y (e.g. prodrug > drug > salt > metabolite) • X is non-canonical (e.g. enatiomers with different affinity for Y) • One X to-many proteins (panel screen) • Data source ambigous in description of X (e.g. errors or tautomers) • Data source ambigous in description of Y (e.g. protein ID not resolved) • X does not bind Y, thus mmmo is indirect (e.g. up or down regulation of Y) • Many cpds to-one Y (a throughput assay) • X has relevant linked data in addtion to binding Y (e.g. plasma clearance) • Y is part of a functional complex (e.g. gamma secretase) • X-Y mechanistic coupling at different system levels (e.g. in vitro, in celluo, in vivo and in clinico) • Y is species-specific • Y is non-canonical (e.g. splice variant, phosphorylated, activation clipped etc) [2]
  • 3. Problem II: The Numbers Don’t Add Up A collation of entity and relatishionship counts between databases and curated sets, ranked by compounds-per-protein • The statistical differences in orders of magnitude are only partialy intepretable • No concencus defintions or heirachies of ”target” or ”interaction” as concepts • Ipso facto curation and/or parsing rules are very different • Evidence filtration functionality different • Extraction substrates mostly simillar (e.g. Journals, PubMed and other dbs) • Explicit but also cryptic circularity (e.g. large dbs subsuming smaller dbs) [3]
  • 4. Problem III: Differential Chemistry Capture • We can compare the two premier academic drug mapping resources, DrugBank and Therapeutic Target Database, in principle having convergent capture concepts. • Both use expert curation teams to extract from the same primary data corpora. • The intra-PubChem comparison of chemical content (at the CID level) is shown below DB = 6720 TTD= 14631 Union = 19803 Intersect = 1548 • Results show very different capture (e.g. union is over 10x larger than the intersect ) • Some of this is explicable (e.g. DB’s historical emphasis on PDB ligands and TTD picking up BioAssayed compounds from ChEMBL) but reasons for other differences are less clear. [4]
  • 5. Problem IV: Differential Target Capture • The Venn compares DrugBank with TTD and a re-curated DrugBank sub-set (Ra- An ”Trends in the exploitation of novel drug targets” 2011, PMID: 21804595) • While there are caveats related to set defintions, species filters and protein ID cross-mapings, the differencial capture of the three manualy curated sets is clear • The intersect at only 170 human UniProt IDs is ~ ½ the expected primary targets • Some of this is explicable (i.e. R-An picking up new targets) but the cause of other differences are unclear • Over 900 targets (this comparison excluded enzymes and transporters) are unique to DrugBank so their curatorial rules are clearly different [5]
  • 6. Problem V: Large chemistry <> protein Dbs • Leading expert teams and significant resources • Overlaps in concepts and utility • Differences in approaches and technical implimentation [6]
  • 7. Problem V (ctd): Too Large to Verify but too Divergent to Trust? • Comparing atorvastin <> proteins in four large-scale Dbs • The 4-database intersect is only 8 from 143 • 6 of these are probably indirect (no binding ) and mechanistically unclear • Significant database-unique capture (e.g. CTD) • There are caveats with these exact numbers because they depend on protein database x-mappings [7]
  • 8. Problem VI: Whose curation is ”correct” • Protein <> atorvastin results, automated vs curated (ChEMBL and DugBank) • Sum is proteins from the four dbs in previous slide • Consensus is only HMGCR and CP450 3A4 • Unique capture of transporters and metabolic enzymes by DrugBank • Targets unique to DrugBank: hum Dipeptidyl peptidase 4, Aryl hydrocarbon receptor • Targets unique to ChEMBL: Cruzipain, pig Dipeptidyl peptidase 4 [8]
  • 9. Problem VII. The PDB Hetero Entry Trap: False Drug/ligands and False Targets E.g. Stitch makes high-scoring links from DPPIV to galatose and fucose [9]
  • 10. Problem VII ctd. STICH X-refs the Same Errors in DrugBank that Passed them to PubChem DrugBank links to the wrong sugar isomer as CID 671379 and PubChem inherited the 40 targets in the ”Biomolecular Interactions and Pathways” field. DB entry now deprecated [10]
  • 11. Problem VII ctd. Mixed mappings of the ”Wrong” and ”Right” (drug-relevant) Ligands Most of the mappings above are ”right”, on the left is ”wrong” (sugar is in the crystal but not a ligand or a drug in this context) [11]
  • 12. Problem VIII: False-negatives • This clinically signficant infered interaction is missed by (all ?) Dbs • A guess is that neither text mining nor curation rules (as implimented in the 7 dbs checked here) connected the individual drug names to the general case triple ”statins-inhibited-PAR-1” • We can grapple with false-positives via filtration rules and heuristic tuning but false-negatives are a more difficult and potentialy more serious problem [12]
  • 13. Ameliorating the Problems • Avoid ”brainless parsing” and go for precision over recall • Make circularity explicit (e.g. dbs within dbs and curatorial recycling) • Refresh and update cross-links between dbs • Define biochemical and pharmacological relationships • Rigorous and deep QC (e.g. actually eyeball records) • Referential integrity checks (e.g. spot orphaned entities) • Display relationship distributions, inspect the extreme tails and attempt to understand them • Document curatorial practice (e.g. equivocality handling rules) • Facilitate annotation judgments and quality-based filtration (i.e. curatorial empowerment ) • Consider canonical merging of chemical structures with multiplexed bioactivity mappings • Crowdsourcing (e.g. Drug Bank comments > fixes and deprecations) • Encourage author mark-up at source (i.e. MIABE PMID: 21878981) • “But wait, hold on – did anyone peer review the database? “ (Williams and Eakins 2012 ACS presentation) [13]
  • 14. Conclusions • Linked Open Data is the new mining rock and roll; but................... • Even just chemistry <> protein is subject to the caveats in this poster (and more besides) • At the very least circumspection is needed if inferences from database linking are to be acted upon, validated and exploited • In the end, nothing saves us from database quality so this has to be addressed by all of us Dr Christopher Southan ChrisDS Consulting: http://www.cdsouthan.info/Consult/CDS_cons.htm Email: cdsouthan@hotmail.com Twitter: @cdsouthan Blog: http://cdsouthan.blogspot.com/ LinkedIN: http://www.linkedin.com/in/cdsouthan Publications: http://www.citeulike.org/user/cdsouthan/publications/ord er/year Citations:http://scholar.google.com/citations?user=y1Ds HJ8AAAAJ&hl=en Presentations: http://www.slideshare.net/cdsouthan [14]

Editor's Notes

  1. Over