Accurate biochemical knowledge starting with precise structure-based criteria for molecular identity


Published on

Biochemical ontologies aim to capture and represent biochemical entities and the relations that exist between them in an accurate and precise manner. A fundamental starting point is the use of identifiers that precisely and uniquely identify some biochemical entity, whether it be a substance, a quality or some biological process. Yet, our current approach for generating identifiers doing so is often haphazard and incomplete. This prevents us from accurately integrating knowledge and also leads to under specification of our knowledge. This talk aims to initiate a discussion on plausible structure-based strategies for biochemical identity, ultimately to generate identifiers in an automatic and curator/database independent fashion, whether it be at molecular level or some part thereof (e.g. residues, collection of residues, atoms, collection of atoms, functional groups). With structure-based identifiers in hand, we will be in a position to accurately capture specific biochemical knowledge, such as how a set of residues in a binding site are involved in a chemical reaction including the fact that a key nitrogen atom must first be de-protonated. Thus, this will enhance our current representation of biochemical knowledge and make it fundamentally more useful.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Accurate biochemical knowledge starting with precise structure-based criteria for molecular identity

  1. 1. Accurate biochemical knowledge starting with precise structure-based criteria for molecular identity Michel Dumontier , Ph.D. Assistant Professor of Bioinformatics Department of Biology, School of Computer Science Institute of Biochemistry, Ottawa Institute of Systems Biology Carleton University 01/04/2009 NCBO Seminar Series::Michel Dumontier
  2. 2. Problem Statement (I) <ul><li>Although biochemical events can be described with reference to specific chemical substances, we may want to describe them at finer/grainier levels of (mereological) granularity. </li></ul><ul><ul><li>residue : post translational modification </li></ul></ul><ul><ul><li>collection of residues : motif/domain/interaction site </li></ul></ul><ul><ul><li>atom : atomic interactions, catalytic mechanism </li></ul></ul><ul><ul><li>collection of atoms : binding/catalytic site, interaction </li></ul></ul><ul><li>This requires identifiers for parts, regions (contiguous and non-contiguous), aggregates/complexes. </li></ul><ul><li>However, we do not (AFAIK) have a precise ( reproducible ) methodology to automatically generate these! </li></ul>01/04/2009 NCBO Seminar Series::Michel Dumontier
  3. 3. Bio2RDF: 2.3B triples of SPARQL-accessible linked biological data! Chemical Parts!
  4. 4. Case Study: HIF1 α <ul><li>Hypoxia-Inducible Factor 1, alpha chain (uniprot:Q16665) </li></ul><ul><li>Master transcriptional regulator of the adaptive response to hypoxia </li></ul><ul><li>Under normoxic conditions , HIF1 α is hydroxylated on Pro-402 </li></ul><ul><li>and Pro-564 in the oxygen-dependent degradation domain (ODD) by EGLN1/PHD1 and EGLN2/PHD2. EGLN3/PHD3 has also been shown to hydroxylate Pro-564. The hydroxylated prolines promote interaction with VHL, initiating rapid ubiquitination and subsequent proteasomal degradation. </li></ul><ul><li>Context Dependent Behavior </li></ul><ul><li>Normoxic Conditions </li></ul><ul><li>Hypoxic Conditions </li></ul>Multiple hydroxylations Part of a domain The part is the agent in the process Selective interaction with parts 01/04/2009 NCBO Seminar Series::Michel Dumontier
  5. 5. Are these the same? <ul><li>HIF1 α – au naturel </li></ul><ul><li>HIF1 α </li></ul><ul><ul><li>hydroxylated @P402 </li></ul></ul><ul><li>HIF1 α </li></ul><ul><ul><li>hydroxylated @P564 </li></ul></ul><ul><li>HIF1 α </li></ul><ul><ul><li>hydroxylated @P402 & @P564 </li></ul></ul><ul><li>HIF1 α </li></ul><ul><ul><li>hydroxylated @P402 & (@P564) </li></ul></ul><ul><ul><li>ubiquitinated @Lys-532 </li></ul></ul><ul><li>HIF1 α </li></ul><ul><ul><li>L400A & L397A </li></ul></ul>01/04/2009 NCBO Seminar Series::Michel Dumontier
  6. 6. NO!!!! <ul><li>These are structurally different </li></ul><ul><li>Each exhibits distinct functionality! </li></ul><ul><li>Yet most databases ( Uniprot / Genbank ) don’t have separate identifiers for them </li></ul><ul><li>Reactome has an internal identifier for referring to different forms, but links to Uniprot entries and doesn’t provide an explicit description of the structure that it corresponds to! </li></ul>01/04/2009 NCBO Seminar Series::Michel Dumontier
  7. 7. So <ul><li>We have a clear need for being able to refer to distinct biochemical entities, based at least on their structure. </li></ul><ul><li>We also need to refer to arbitrary structural parts. </li></ul><ul><li>Should we generate all the combinations a priori??? </li></ul><ul><li> NO!! </li></ul><ul><li>Should we be able to automatically generate the identifier from the structural attributes? </li></ul><ul><li>-> YES!!! </li></ul><ul><li>Should we semantically annotate (manually or otherwise) those forms known to be involved in specific processes??? </li></ul><ul><li>-> YES!!! </li></ul><ul><li>What identifiers are unique for a given structure? </li></ul>01/04/2009 NCBO Seminar Series::Michel Dumontier
  8. 8. InChI <ul><li>IUPAC International Chemical Identifier (InChI) </li></ul><ul><li>A data string that provides </li></ul><ul><ul><li>the structure of a chemical compound </li></ul></ul><ul><ul><li>the convention for drawing the structure </li></ul></ul><ul><li>Different compounds must have different identifiers. Several attributes can be used to distinguish one compound from another. </li></ul><ul><ul><li>chemical graph (connection table) </li></ul></ul><ul><ul><li>Formula </li></ul></ul><ul><ul><li>Atom type (only some atoms explicit) </li></ul></ul><ul><ul><li>Bond type </li></ul></ul><ul><ul><li>Stereochemistry </li></ul></ul><ul><ul><li>Mobile/fixed H-bonds (tautomers) </li></ul></ul><ul><ul><li>Isotopic composition </li></ul></ul><ul><ul><li>Atomic charge </li></ul></ul>01/04/2009 NCBO Seminar Series::Michel Dumontier
  9. 9. (S)-Glutamic Acid InChI= {version}1 /{formula}C5H9NO4 /c{connections}6-3(5(9)10)1-2-4(7)8 /h{H_atoms}3H,1-2,6H2,(H,7,8)(H,9,10) /p{protons}+1 /t{stereo:sp3}3- /m{stereo:sp3:inverted}0 /s{stereo:type (1=abs, 2=rel, 3=rac)}1 /i{isotopic:atoms}4+1 01/04/2009 NCBO Seminar Series::Michel Dumontier
  10. 10. More non-core info captured in “AuxInfo” string... AuxInfo= {version}1 /{normalization_type}1 /N:{original_atom_numbers}5,6,2,7,1,4,8,9,10,11 /E:{atom_equivalence}(7,8)(9,10) /it:{abs_stereo_inverted:sp3}im /I:{isotopic:original_atom_numbers} /E:{isotopic:atom_equivalence}m /rA:{reversibility:atoms}11nCCHN+CCC.i13OOOO /rB:{reversibility:bonds}s1;N2;P2;s2;s5;s6;s7;d7;d1;s1; /rC:{reversibility:xyz}6.1671,-19.3365,0;7.0125,-18.4864,0;6.4113,-17.4485,0;7.6089,-17.4485,0;7.8578,-19.3318,0;8.891,-18.7306,0;9.7363,-19.576,0;9.7316,-20.7735,0;10.8916,-19.266,0;5.0071,-19.0265,0;6.1624,-20.534,0; AuxInfo=1/1/N:5,6,2,7,1,4,8,9,10,11/E:(7,8)(9,10)/it:im/I:/E:m/rA:11nCCHN+CCC.i13OOOO/rB:s1;N2;P2;s2;s5;s6;s7;d7;d1;s1;/rC:6.1671,-19.3365,0;7.0125,-18.4864,0;6.4113,-17.4485,0;7.6089,-17.4485,0;7.8578,-19.3318,0;8.891,-18.7306,0;9.7363,-19.576,0;9.7316,-20.7735,0;10.8916,-19.266,0;5.0071,-19.0265,0;6.1624,-20.534,0; 01/04/2009 NCBO Seminar Series::Michel Dumontier
  11. 11. So... InChi a really just a cryptic data identifier <ul><li>Clever software required to gradually build the chemical identifiers in a series of well-defined steps – normalization, canonicalization then serialization </li></ul><ul><li>Humans can’t (easily) generate them nor can they easily understand them. But that’s OK. </li></ul><ul><li>It’s not (user) extensible. But that’s OK. </li></ul>01/04/2009 NCBO Seminar Series::Michel Dumontier
  12. 12. <ul><li>Possible... but a 1000 residue protein would contain ~15,000 atoms on average.... </li></ul><ul><ul><li>OpenBabel seemed to struggle with anything over 100 residues </li></ul></ul><ul><ul><ul><li>Maybe needs some performance tweaking? </li></ul></ul></ul><ul><ul><li>Size of the string will be enormous </li></ul></ul><ul><ul><ul><li>We can use InChiKeys (SHA1 hash), but then we need to provide a you-submit-InChI , we-store-both and they-look-it-up service. </li></ul></ul></ul><ul><ul><li>Modularize InChI construction for (linear) polymers? </li></ul></ul><ul><ul><ul><li>Make InChi strings for each residue, and concatenate – rename the atoms according to the residue position </li></ul></ul></ul><ul><ul><li>We still need to translate the InChi string ... </li></ul></ul>InCHI for Proteins??? 01/04/2009 NCBO Seminar Series::Michel Dumontier
  13. 13. OpenBabel CML SDF O1[C@@H]([C@@H](O)([C@H](O)([C@@H](O)([C@@H]1(O)))))(CO) 79025 IUPAC InChI=1/C6H12O6/c7-1-2-3(8)4(9)5(10)6(11)12-2/h2-11H,1H2/t2-,3-,4+,5-,6+/m1/s1 InCHI α -D-Glucose 6-(hydroxymethyl)oxane-2,3,4,5-tetrol OR (2R,3R,4S,5R,6R)-6 -(hydroxymethyl)tetrahydro -2H-pyran-2,3,4,5-tetraol SMILES
  14. 14. OWL Has Explicit Semantics <ul><li>Can therefore be used to capture knowledge in a machine understandable way </li></ul>01/04/2009 NCBO Seminar Series::Michel Dumontier
  15. 15. Chemical Ontology Chemical Knowledge for the Semantic Web. Mykola Konyk ,  Alexander De Leon , and  Michel Dumontier . LNBI . 2008. 5109:169-176.  Data Integration in the Life Sciences (DILS2008) . Evry. France. 
  16. 16. 01/04/2009 NCBO Seminar Series::Michel Dumontier
  17. 17. Describing chemical functional groups in OWL-DL for the classification of chemical compounds hydroxyl group methyl group Knowledge of functional groups is important in chemical synthesis, pharmaceutical design and lead optimization. Functional groups describe chemical reactivity in terms of atoms and their connectivity, and exhibits characteristic chemical behavior when present in a compound. N Villanueva-Rosales, MDumontier. 2007. OWLED, Innsbruck, Austria. Ethanol 01/04/2009 NCBO Seminar Series::Michel Dumontier
  18. 18. Describing Functional Groups in DL <ul><li>HydroxylGroup: </li></ul><ul><li>CarbonGroup that (hasSingleBondWith some (OxygenAtom that hasSingleBondWith some HydrogenAtom) </li></ul>O H R R group 01/04/2009 NCBO Seminar Series::Michel Dumontier
  19. 19. Fully Classified Ontology 35 FG 01/04/2009 NCBO Seminar Series::Michel Dumontier
  20. 20. And, we define certain compounds <ul><li>Alcohol: </li></ul><ul><li>OrganicCompound that (hasPart some HydroxylGroup) </li></ul>01/04/2009 NCBO Seminar Series::Michel Dumontier
  21. 21. Organic Compound Ontology 28 OC 01/04/2009 NCBO Seminar Series::Michel Dumontier
  22. 22. Question Answering <ul><li>Query all annotations </li></ul><ul><li>Query PubChem, DrugBank and dbPedia* </li></ul>* Requires import of relevant URIs 01/04/2009 NCBO Seminar Series::Michel Dumontier
  23. 23. But... <ul><li>Molecules represented as individuals because OWL-DL only allows tree-like class descriptions </li></ul><ul><ul><li>No variable binding (e.g. ?x) ... no cyclic molecule/functional group descriptions at the class level  </li></ul></ul><ul><li>Boris Motik et al has a proposal for Description Graphs , </li></ul><ul><ul><li>Robert Stevens & Duncan Hull trying it out for chemical representation.... </li></ul></ul>01/04/2009 NCBO Seminar Series::Michel Dumontier
  24. 24. Identifiers for Atoms <ul><li>Atom identifiers can be consistently retrieved from the OpenBabel model. </li></ul><ul><ul><li>Canonical numbering means we can reliably refer to a specific region rather than a (possibly degenerate) sub-graph match. </li></ul></ul><ul><ul><li>In our plugin, URI component naming was based on the assigned molecule identifier </li></ul></ul><ul><ul><ul><li>e.g. pubchemid#aN, where N is the number </li></ul></ul></ul><ul><ul><li>Use InChiKey as base? </li></ul></ul><ul><ul><ul><li>e.g. InChiKey#aN </li></ul></ul></ul>01/04/2009 NCBO Seminar Series::Michel Dumontier
  25. 25. What about identifiers for collection of atoms? <ul><li>Potentially useful in describing residues, PTMs, binding sites, etc. </li></ul><ul><ul><li>Is the lack of connectivity sufficient? </li></ul></ul><ul><li>Contiguous: </li></ul><ul><ul><li>ranges (aN-aN) </li></ul></ul><ul><ul><li>enumerations (aN,aN,aN) </li></ul></ul><ul><li>Non-contiguous: </li></ul><ul><ul><li>Combination of ranges, enumerations? </li></ul></ul>01/04/2009 NCBO Seminar Series::Michel Dumontier
  26. 26. Can we reuse our positional nomenclature for residues? <ul><li>Residues are generally referred to by their absolute position in the biopolymer sequence. </li></ul><ul><ul><li>e.g. Pro @ X on Protein Y </li></ul></ul><ul><ul><li>InChiKey#a50-a65 owl:sameAs InChiKey#r5 </li></ul></ul><ul><ul><li>InChiKey#r5_a1-r5_a15 owl:sameAs InChiKey#r5 </li></ul></ul><ul><li>Collection of Residues might follow the same rules as a Collection of Atoms. </li></ul><ul><ul><li>Useful for defining domains, motifs, etc </li></ul></ul>01/04/2009 NCBO Seminar Series::Michel Dumontier
  27. 27. <ul><li>We already have a simplified representation for biopolymers... </li></ul><ul><ul><li>Canonical sequence is represented by a string of single letter characters </li></ul></ul><ul><ul><ul><li>DNA: ACGT </li></ul></ul></ul><ul><ul><ul><li>RNA: ACGU </li></ul></ul></ul><ul><ul><ul><li>Proteins: 20 amino acids (not B,J,O,U,X,Z) </li></ul></ul></ul><ul><ul><li>Modifications can be referred to with ChEBI/PSI-MOD ontology (e.g. Prolyl hydroxylated residue @ 402) </li></ul></ul><ul><ul><ul><li>Each (modified) residue must have its InChi description so as to capture explicit structural deviations (de-protonation, etc) </li></ul></ul></ul>An Alternative Scheme 01/04/2009 NCBO Seminar Series::Michel Dumontier
  28. 28. PSI-MOD contains modified residues with links to structural descriptions 01/04/2009 NCBO Seminar Series::Michel Dumontier
  29. 29. But what if we have a modification that isn’t contained in the ontology! <ul><li>No problem... define your own term, with the corresponding structural description (InChi, SMILES), and add to an ontology document... </li></ul><ul><ul><li>If you’re using OWL, you can add the import statement and publish it. </li></ul></ul><ul><li>And, of course, you should submit it to the appropriate ontology development teams. (and later make it equivalent to) </li></ul>01/04/2009 NCBO Seminar Series::Michel Dumontier
  30. 30. While we’re at it, we could extend our expressive capability to match that of OWL: <ul><li>Specification </li></ul><ul><ul><li>Exactly mod1@pos X </li></ul></ul><ul><ul><li>Only mod1@posX </li></ul></ul><ul><li>Minimum : </li></ul><ul><ul><li>At least [email_address] </li></ul></ul><ul><li>Combination: </li></ul><ul><ul><li>mod1@posX AND mod2@posY, X != Y </li></ul></ul><ul><li>Possibilities/Uncertainty: </li></ul><ul><ul><li>(mod1 OR mod2) @posX </li></ul></ul><ul><li>Exclusion : </li></ul><ul><ul><li>not mod1 @ posX </li></ul></ul>01/04/2009 NCBO Seminar Series::Michel Dumontier
  31. 31. So what if... <ul><li>we describe the structural features of the molecule with OWL (sequence + PTMs), and generate an identifier from one of its serializations (RDF/XML?) </li></ul><ul><li>that way we have the explicit description as the identifier in a form that is compatible with the semantic web. </li></ul>01/04/2009 NCBO Seminar Series::Michel Dumontier
  32. 32. 01/04/2009 NCBO Seminar Series::Michel Dumontier
  33. 33. Uniprot example revisited <ul><li>Under normoxic conditions , HIF1 α is hydroxylated on Pro-402 </li></ul><ul><li>and Pro-564 in the oxygen-dependent degradation domain (ODD) by EGLN1/PHD1 and EGLN2/PHD2. The hydroxylated prolines promote interaction with VHL, initiating rapid ubiquitination and subsequent proteasomal degradation </li></ul><ul><li>. </li></ul>:A rdfs:subClassOf :Hydroxylation :A hasParticipant (:0#r402 and :Substrate) :A hasParticipant (:1#r402 and :Product) :A hasParticipant (:5 and :Enzyme) :B rdfs:subClassOf :Interaction :B :hasParticipant (:2#r402 or :3#r564 or :4#r402,r564) :B :hasParticipant (:6) :1 (HIF1 α ) :2 (HIF1 α + P402hyd) :3 (HIF1 α + P564hyd) :4 (HIF1 α + P402hyd + P564hyd) :5 (EGLN1) :6 (VHL) Please ignore the made up short-hand syntax! 01/04/2009 NCBO Seminar Series::Michel Dumontier
  34. 34. Infering Protein Participation <ul><li>OWL Role Chain </li></ul><ul><li>hasParticipant o isPartOf -> hasParticipant </li></ul><ul><li>if process has the part as a participant, then the whole is also a participant </li></ul>:0#r402 :isPartOf :0 :1#r402 :isPartOf :1 :A rdfs:subClassOf :Hydroxylation :A hasParticipant (:0#r402 and :Substrate) :A hasParticipant (:1#r402 and :Product) :A hasParticipant :0 :A hasParticipant :1 01/04/2009 NCBO Seminar Series::Michel Dumontier
  35. 35. Contextual, but non-structural considerations in identifier generation? <ul><li>Chemical? </li></ul><ul><ul><li>pH? </li></ul></ul><ul><ul><li>Temperature? </li></ul></ul><ul><ul><li>Environment ( in vitro, in vivo, in silico )? </li></ul></ul><ul><li>Biological? </li></ul><ul><ul><li>Species? </li></ul></ul><ul><ul><li>mRNA/Gene from which it was transcribed/encoded? </li></ul></ul><ul><li>Indirect Relationships? </li></ul><ul><ul><li>Point & Multiple Mutations? </li></ul></ul><ul><ul><li>Alternative Splice Variants? </li></ul></ul><ul><ul><li>Sequence Similarity? </li></ul></ul>01/04/2009 NCBO Seminar Series::Michel Dumontier
  36. 36. Summary <ul><li>We need a precise method to generate identifiers for biopolymers and arbitrary sets of their parts. </li></ul><ul><li>Consistent identifier generation will allow anybody to specify findings according to the biopolymers for which it was observed, whether it exists in a database or not, and will allow us to link biochemical knowledge at finer levels of granularity. </li></ul><ul><li>(at least) two identifier schemes were put forward to initiate discussion, with the goal of setting a standard naming convention. </li></ul>01/04/2009 NCBO Seminar Series::Michel Dumontier
  37. 37. [email_address] Special thanks to PhD Student Leonid Chepelev for insightful discussions  01/04/2009 NCBO Seminar Series::Michel Dumontier