Increasingly Accurate Representation of Biochemistry (v2)


Published on

Biochemical ontologies aim to capture and represent biochemical entities and the relations that exist between them in an accurate manner. A fundamental starting point is biochemical identity, but our current approach for generating identifiers is haphazard and consequently integrating data is error-prone. I will discuss plausible structure-based strategies for biochemical identity whether it be at molecular level or some part thereof (e.g. residues, collection of residues, atoms, collection of atoms, functional groups) such that identifiers may be generated in an automatic and curator/database independent manner. With structure-based identifiers in hand, we will be in a position to more accurately capture context-specific biochemical knowledge, such as how a set of residues in a binding site are involved in a chemical reaction including the fact that a key nitrogen atom must first be de-protonated. Thus, our current representation of biochemical knowledge may improve such that manual and automatic methods of bio-curation are substantially more accurate.

Published in: Technology, Health & Medicine
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • <number>
  • <number>
  • <number>
  • Increasingly Accurate Representation of Biochemistry (v2)

    1. 1. Increasingly Accurate Representation of Biochemistry (v2) Michel Dumontier, Ph.D. Assistant Professor of Bioinformatics Department of Biology, School of Computer Science Institute of Biochemistry, Ottawa Institute of Systems Biology Carleton University 1 SemWeb Group::Vancouver 21/05/2009
    2. 2. Representational Issues Biochemical Identity Accurate Descriptions Precise Identifiers Modeling Situations
    3. 3. Which of these are different? # A, B Difference? 1 α-D-Glucose, alpha-D-Glucose None, multiple names 2 α-D-Glucose, β-D-Glucose 3 α-D-Glucose, β-D-Glucose, D-Glucose 4 α-D-Glucose, α-D-Glucose-6-phosphate 5 Hk, Hk(L529S) 6 Hk(human), Hk(mouse) 7 Hk(open), Hk(closed) 8 Hk (L529), Hk (L540) 9 Hk (L529), Hk (A530)
    4. 4. α-D-Glucose vs β-D-Glucose • Rearrangement (isomer) • Related, but structurally different
    5. 5. α-D-Glucose and β-D-Glucose are more specific types* of D-Glucose * They resolve an ambiguity in stereochemistry
    6. 6. α-D-Glucose vs α-D-Glucose-6-Phosphate • Change (addition+removal) in atoms • Structurally different – one is not a type of the other!
    7. 7. Post-Translational Modifications • Structurally different • Unable to capture the difference with single letter AA sequence representation
    8. 8. Hexokinase (mutation) 500 510 520 530 540 RRFHKTLRRL VPDSDVRFLL SESGSGKGAA MVTAVAYRLA EQHRQIEETL 500 510 520 530 540 RRFHKTLRRL VPDSDVRFLL SESGSGKGAA MVTAVAYRSA EQHRQIEETL Leads to hemolytic anemia different sequence = different entity related by some mutation process
    10. 10. Hexokinase Open vs Closed Structurally identical, but conformationally different
    11. 11. Parts need to be identifiable and describable # A, B Difference? 1 α-D-Glucose, alpha-D-Glucose None, multiple names 2 α-D-Glucose, β-D-Glucose Structural (rearrangement) 3 α-D-Glucose, β-D-Glucose, D-Glucose More specific type 4 α-D-Glucose, α-D-Glucose-6-phosphate Structural (modification) 5 Hk, Hk(L529S) Structural (mutation) 6 Hk(human), Hk(mouse) Structural (sequence) 7 Hk(open), Hk(closed) Conformational 8 Hk (L529), Hk (L540) Positional 9 Hk (L529), Hk (A530) Structural, positional
    12. 12. Biochemical identity is necessarily based on a description of structure
    13. 13. To determine identity, we have compare their descriptions
    14. 14. Given A and B How would you know that they are different?
    15. 15. Given two descriptions about a protein, but where their names differ, how do you know they are the same or different? – Structure (sequence) – PTMs – Organism – Function, Process, Localization – Conformation
    16. 16. Biochemical identity is necessarily based on having accurate descriptions
    17. 17. Yet, current approaches add *annotations* rather than create new records with their respective descriptions
    18. 18. Current approach to assigning biochemical identifiers is erroneous, misleading or underspecified • Information gathered from multiple structural variants are attributed to the unmodified form. • Incomplete descriptions are Uniprot/Genbank just as bad – Reactome has an internal • This conflates functionality identifier for referring to different arising from similar, but forms, but links to Uniprot different structural forms entries Inaccurate specification of – Obfuscates identity between knowledge databases
    19. 19. Biochemical relationship is necessarily based on a comparison of accurate descriptions
    20. 20. For each description, we must assign a unique name or identifier
    21. 21. If the description changes we need a new identifier!
    22. 22. 1. Precise Biochemical Identifiers • Identifiers and their exact descriptions are required for these kinds of entities: – atom : atomic interactions, catalytic mechanism – collection of atoms : binding/catalytic site, interaction – residue : post translational modification – collection of residues : motif/domain/interaction site – molecule : metabolism, signalling – complex : metabolism , signalling, scaffolds, containers • We need a reproducible methodology for naming and providing descriptions
    23. 23. Different molecules must have different identifiers • IUPAC International Chemical Identifier (InChI) • A data string that provides – the structure of a chemical compound – the convention for drawing the structure • It can be made by anyone, anywhere at any time – a deterministic algorithm ensures that is always written in the same way (syntactic identity), and fully specifies the molecular description (semantic identity). – It is a data identifier
    24. 24. (S)-Glutamic Acid InChI= {version}1 /{formula}C5H9NO4 /c{connections}6-3(5(9)10)1-2-4(7)8 /h{H_atoms}3H,1-2,6H2,(H,7,8)(H,9,10) /p{protons}+1 /t{stereo:sp3}3- /m{stereo:sp3:inverted}0 /s{stereo:type (1=abs, 2=rel, 3=rac)}1 /i{isotopic:atoms}4+1
    25. 25. 2. Accurate Descriptions InCHI InChI=1/C6H12O6/c7-1-2-3(8)4(9)5(10)6(11)12-2/h2-11H,1H2/t2-,3-,4+,5-,6+/m1/s1 IUPAC 6-(hydroxymethyl)oxane-2,3,4,5-tetrol OR (2R,3R,4S,5R,6R)-6 -(hydroxymethyl)tetrahydro -2H-pyran-2,3,4,5-tetraol SMILES O1[C@@H]([C@@H](O)([C@H](O)([C@@H](O)([C@@H]1(O)))))(CO) 79025 SDF CML α-D-Glucose
    26. 26. OWL Has Explicit Semantics Can therefore be used to capture knowledge in a machine understandable way
    27. 27. Chemical Ontology Chemical Knowledge for the Semantic Web. Mykola Konyk, Alexander De Leon, and Michel Dumontier. LNBI. 2008. 5109:169-176. Data Integration in the Life Sciences (DILS2008). Evry. France.
    28. 28. RDF/OWL descriptions of molecules
    29. 29. Describing chemical functional groups in OWL-DL for the classification of chemical compounds methyl group hydroxyl group Ethanol Knowledge of functional Functional groups describe groups is important in chemical reactivity in terms of chemical synthesis, atoms and their connectivity, pharmaceutical design and and exhibits characteristic lead optimization. chemical behavior when present in a compound. N Villanueva-Rosales, MDumontier. 2007. OWLED, Innsbruck, Austria.
    30. 30. Describing Functional Groups in DL R group O R H HydroxylGroup: CarbonGroup that (hasSingleBondWith some (OxygenAtom that hasSingleBondWith some HydrogenAtom)
    31. 31. Fully Classified Ontology 35 FG
    32. 32. And, we define certain compounds Alcohol: OrganicCompound that (hasPart some HydroxylGroup)
    33. 33. Organic Compound Ontology 28 OC
    34. 34. Question Answering • Query all attributes • Query PubChem, DrugBank and dbPedia
    35. 35. We also need Identifiers/Descriptions for Atoms • Atom identifiers need to be consistently assigned – OpenBabel plugin component naming was first come, first served along with the assigned mol identifier from PubChem SDF files. e.g. id#aN, where a is the “atom” label and N is the position – Canonical numbering (InChI) is required • Atom descriptions need only specify the mereological relation :id#aN :isProperPartOf :id
    36. 36. What about identifiers for collection of atoms? • Potentially useful in describing residues, PTMs, binding sites, etc. – Is the lack of connectivity sufficient? • Contiguous: – ranges (id#aN-aN) – enumerations (id#aN,aN,aN) • Non-contiguous: – Combination of ranges, enumerations?
    37. 37. Can we reuse our positional nomenclature for residues? • Residues are generally referred to by their absolute position in the biopolymer sequence. Global atom numbering: id#a50-a65 owl:sameAs id#r5 Residue specific atom numbering id#r5_a1-r5_a15 owl:sameAs id#r5 • Collection of residues might follow the same rules as a collection of atoms. – Useful for defining domains, motifs, etc
    38. 38. While we’re at it, we could extend our expressive capability to create broader descriptions: • Specification – Exactly mod1@pos X – Only mod1@posX • Minimum : – At least mod1@posX • Combination: – mod1@posX AND mod2@posY, X != Y • Possibilities/Uncertainty: – (mod1 OR mod2) @posX • Exclusion: – not mod1 @ posX
    39. 39. So what if... we describe the structural features of the molecule with OWL (sequence + PTMs), and generate an identifier from one of its serializations? that way we get a unique identifier with a description that is extensible and compatible with the semantic web.
    40. 40. Biological Identifier Service
    41. 41. Description to Identifier
    42. 42. What does this mean? • Identifier exactly matches the description – Great as a primary key for databases – Can be used for citation purposes (no more fuzzy diagrams!) • exact description can be obtained for a given identifier. • Description is extensible, and new identifiers can be autogenerated, independently – Needs canonical serialization / central service – Histories can be made, and published
    43. 43. Case Study: HIF1α Hypoxia-Inducible Factor 1, alpha chain (uniprot:Q16665) Master transcriptional regulator of the adaptive response to hypoxia • Under normoxic conditions, HIF1α is hydroxylated on Pro-402 and Pro-564 in the oxygen-dependent degradation domain (ODD) by EGLN1/PHD1 and EGLN2/PHD2. EGLN3/PHD3 has also been shown to hydroxylate Pro-564. The hydroxylated prolines promote interaction with VHL, initiating rapid ubiquitination and subsequent proteasomal degradation. Situation Multiple structural Part, named/ b) Normoxic c) Hypoxic forms unnamed d) Other/Unspecified regions Selective interaction The part is the agent in the with parts process
    44. 44. Structure-based biochemical identity: Differences between apples and oranges • HIF1α – au naturel • HIF1α – hydroxylated @P402 • HIF1α – hydroxylated @P564 • HIF1α – hydroxylated @P402 & @P564 • HIF1α – hydroxylated @P402 & (@P564) – ubiquitinated @K532 • HIF1α – L400A & L397A
    45. 45. Uniprot example revisited Under normoxic conditions, HIF1α is hydroxylated on Pro-402 and Pro-564 in the oxygen-dependent degradation domain (ODD) by EGLN1/PHD1 and EGLN2/PHD2. The hydroxylated prolines promote interaction with VHL, initiating rapid ubiquitination and subsequent proteasomal degradation . :1 (HIF1α) :A rdfs:subClassOf :Hydroxylation :2 (HIF1α + P402hyd) :A hasParticipant (:0#r402 and :Substrate) :3 (HIF1α + P564hyd) :A hasParticipant (:1#r402 and :Product) :4 (HIF1α + P402hyd + P564hyd) :A hasParticipant (:5 and :Enzyme) :5 (EGLN1) :6 (VHL) :B rdfs:subClassOf :Interaction :B :hasParticipant (:2#r402 or :3#r564 or :4#r402,r564) :B :hasParticipant (:6) Please ignore the made up short-hand syntax!
    46. 46. Situational Modeling
    47. 47. Infering Protein Participation • OWL Role Chain hasParticipant o isPartOf -> hasParticipant if process has the part as a participant, then the whole is also a participant :A rdfs:subClassOf :Hydroxylation :A hasParticipant (:0#r402 and :Substrate) :0#r402 :isPartOf :0 :A hasParticipant (:1#r402 and :Product) :1#r402 :isPartOf :1 :A hasParticipant :0 :A hasParticipant :1
    48. 48. We will add new knowledge about biochemicals and their parts into the linked data web through Bio2RDF!
    49. 49. Query descriptions to find matching biochemicals • Chemical – Structural – Conformation (e.g. open vs closed form) – Collections (alpha vs beta forms of D-glucose) • Biological – Species – mRNA/Gene from which it was transcribed/encoded – Reactions / post-translational modifications – Mutations
    50. 50. Summary • Biochemical identity is tightly linked to accurate descriptions. • Automatic and consistent identifier generation will allow anybody to specify findings according to the biopolymers for which it was observed – No curation required!!!! – Will be discovered automatically – link biochemical knowledge at various levels of granularity • Situational modeling enables the careful separation of what is known under a particular circumstance.
    51. 51. Special thanks to PhD Student Leonid Chepelev for insightful discussions 