20090511 Manchester Biochemistry - Presentation Transcript
Increasingly Accurate Representation of Biochemistry Michel Dumontier , Ph.D. Assistant Professor of Bioinformatics Department of Biology, School of Computer Science Institute of Biochemistry, Ottawa Institute of Systems Biology Carleton University IMG Seminar:Manchester:Michel Dumontier 11/05/2009
Biochemistry
Biochemistry aims to understand the structure and function of all living things at the molecular level
Master transcriptional regulator of the adaptive response to hypoxia
Under normoxic conditions , HIF1 α is hydroxylated on Pro-402
and Pro-564 in the oxygen-dependent degradation domain (ODD) by EGLN1/PHD1 and EGLN2/PHD2. EGLN3/PHD3 has also been shown to hydroxylate Pro-564. The hydroxylated prolines promote interaction with VHL, initiating rapid ubiquitination and subsequent proteasomal degradation.
Situation
Normoxic
Hypoxic
Other/Unspecified
Multiple structural forms Part, named/ unnamed regions The part is the agent in the process Selective interaction with parts
Structure-based biochemical identity: Differences between apples and oranges
HIF1 α – au naturel
HIF1 α
hydroxylated @P402
HIF1 α
hydroxylated @P564
HIF1 α
hydroxylated @P402 & @P564
HIF1 α
hydroxylated @P402 & (@P564)
ubiquitinated @K532
HIF1 α
L400A & L397A
Current approach to biochemical identity is erroneous, misleading or underspecified
Information gathered from multiple structural variants are attributed to the unmodified form.
Uniprot / Genbank
This conflates functionality arising from similar, but different structural forms
Inaccurate specification of knowledge
Incomplete descriptions are just as bad
Reactome has an internal identifier for referring to different forms, but links to Uniprot entries
Obfuscates identity between databases
11/05/2009 IMG Seminar::Michel Dumontier
Bio2RDF: 2.3B triples of SPARQL-accessible linked biological data! Chemical Parts!
1. Precise Biochemical Identifiers
Identifiers and their exact descriptions are required for these kinds of entities:
atom : atomic interactions, catalytic mechanism
collection of atoms : binding/catalytic site, interaction
residue : post translational modification
collection of residues : motif/domain/interaction site
Different molecules must have different identifiers
IUPAC International Chemical Identifier (InChI)
A data string that provides
the structure of a chemical compound
the convention for drawing the structure
It can be made by anyone, anywhere at any time – a deterministic algorithm ensures that is always written in the same way (syntactic identity), and fully specifies the molecular description (semantic identity).
2. Structure Accurate and Extensible Descriptions Required CML SDF O1[C@@H]([C@@H](O)([C@H](O)([C@@H](O)([C@@H]1(O)))))(CO) 79025 IUPAC InChI=1/C6H12O6/c7-1-2-3(8)4(9)5(10)6(11)12-2/h2-11H,1H2/t2-,3-,4+,5-,6+/m1/s1 InCHI α -D-Glucose 6-(hydroxymethyl)oxane-2,3,4,5-tetrol OR (2R,3R,4S,5R,6R)-6 -(hydroxymethyl)tetrahydro -2H-pyran-2,3,4,5-tetraol SMILES
OWL Has Explicit Semantics
Can therefore be used to capture knowledge in a machine understandable way
http://code.google.com/p/semanticwebopenbabel/
Chemical Ontology Chemical Knowledge for the Semantic Web. Mykola Konyk , Alexander De Leon , and Michel Dumontier . LNBI . 2008. 5109:169-176. Data Integration in the Life Sciences (DILS2008) . Evry. France.
Describing chemical functional groups in OWL-DL for the classification of chemical compounds hydroxyl group methyl group Knowledge of functional groups is important in chemical synthesis, pharmaceutical design and lead optimization. Functional groups describe chemical reactivity in terms of atoms and their connectivity, and exhibits characteristic chemical behavior when present in a compound. N Villanueva-Rosales, MDumontier. 2007. OWLED, Innsbruck, Austria. Ethanol
Describing Functional Groups in DL
HydroxylGroup:
CarbonGroup that (hasSingleBondWith some (OxygenAtom that hasSingleBondWith some HydrogenAtom)
O H R R group
Fully Classified Ontology 35 FG
And, we define certain compounds
Alcohol:
OrganicCompound that (hasPart some HydroxylGroup)
Organic Compound Ontology 28 OC
Question Answering
Query all attributes
Query PubChem, DrugBank and dbPedia*
* Requires import of relevant URIs
But...
Molecules represented as individuals because OWL-DL only allows tree-like class expressions
No variable binding (e.g. ?x) ... no cyclic molecule/functional group descriptions at the class level
Boris Motik et al proposed Description Graphs
Robert Stevens, Duncan Hull, Uli Sattler (and I) exploring their use for chemical representation and sub-structure reasoning....
turns out that…
Using InChI’s precise numbering system, we can specify molecular graphs at the class level
Simple 3-carbon ring system
CarbonAtom that hasPosition value 1
and hasSingleBondTo exactly 1 (CarbonAtom that hasPosition value 2
and hasSingleBondTo exactly 1(CarbonAtom that hasPosition value 3
and hasSingleBondTo exactly 1 ( CarbonAtom that hasPosition value 1 )))
(ignoring hydrogens)
InChI=1/C3H6/c1-2-3-1/h1-3H2
Possible... but a 1000 residue protein would contain ~15,000 atoms on average....
Size of the string will be enormous
We can use InChiKeys (SHA1 hash), but then we need to provide a you-submit-InChI , we-store-both and they-look-it-up service.
OpenBabel seemed to struggle with anything over 100 residues
Needs some performance tweaking / commercial solutions
Modularize InChI construction for (linear) polymers?
Make InChi strings for each residue, and concatenate – rename the atoms according to the residue position
InCHI for Proteins???
Identifiers for Atoms
Atom identifiers can be consistently retrieved from the InChI model.
Canonical numbering means we can reliably refer to a specific region rather than a (possibly degenerate) sub-graph match.
In our plugin, component naming was based on the assigned molecule identifier
e.g. pubchemid#aN,
where a is the “atom” label and N is the position
Use hash of InChI as base?
e.g. id#aN
What about identifiers for collection of atoms?
Potentially useful in describing residues, PTMs, binding sites, etc.
Is the lack of connectivity sufficient?
Contiguous:
ranges (id#aN-aN)
enumerations (id#aN,aN,aN)
Non-contiguous:
Combination of ranges, enumerations?
Can we reuse our positional nomenclature for residues?
Residues are generally referred to by their absolute position in the biopolymer sequence.
e.g. Pro @ X on Protein Y
id#a50-a65 owl:sameAs id#r5
id#r5_a1-r5_a15 owl:sameAs id#r5
Collection of residues might follow the same rules as a collection of atoms.
Useful for defining domains, motifs, etc
We already have a simplified representation for biopolymers...
Canonical sequence is represented by a string of single letter characters
DNA: ACGT
RNA: ACGU
Proteins: 20 amino acids (not B,J,O,U,X,Z)
Modifications can be referred to with ChEBI/PSI-MOD ontology (e.g. Prolyl hydroxylated residue @ 402)
Each (modified) residue must have its InChi description so as to capture explicit structural deviations (de-protonation, etc)
An Alternative Scheme
PSI-MOD contains modified residues with links to structural descriptions
But what if we have a modification that isn’t contained in the ontology!
No problem... define your own term, with the corresponding structural description (InChI, SMILES), and add to an ontology document...
If you’re using OWL, you can add the import statement and publish it.
And, of course, you should submit it to the appropriate ontology development teams. (and later make it equivalent to)
While we’re at it, we could extend our expressive capability to create broader descriptions:
Specification
Exactly mod1@pos X
Only mod1@posX
Minimum :
At least [email_address]
Combination:
mod1@posX AND mod2@posY, X != Y
Possibilities/Uncertainty:
(mod1 OR mod2) @posX
Exclusion :
not mod1 @ posX
So what if...
we describe the structural features of the molecule with OWL (sequence + PTMs), and generate an identifier from one of its serializations (RDF/XML?)
that way we get a unique identifier with a description that is extensible and compatible with the semantic web.
Biological Identifier Service
Extensible to create other class descriptions
Chemical
Conformation (e.g. Open vs closed form)
Biological
Species
mRNA/Gene from which it was transcribed/encoded
What does this mean to providers and consumers?
Automatic identifier and description generation
Data providers can get the identifier that exactly matches their entity.
Consumers can get the exact description of a reported identifier.
Registry can keep track of provider to entity
Discover where additional information can be found
Semantic Science will create a Bio2RDF endpoint to link semantically equivalent biochemical identifiers
Situational Modeling
Uniprot example revisited
Under normoxic conditions , HIF1 α is hydroxylated on Pro-402
and Pro-564 in the oxygen-dependent degradation domain (ODD) by EGLN1/PHD1 and EGLN2/PHD2. The hydroxylated prolines promote interaction with VHL, initiating rapid ubiquitination and subsequent proteasomal degradation
.
:A rdfs:subClassOf :Hydroxylation :A hasParticipant (:0#r402 and :Substrate) :A hasParticipant (:1#r402 and :Product) :A hasParticipant (:5 and :Enzyme) :B rdfs:subClassOf :Interaction :B :hasParticipant (:2#r402 or :3#r564 or :4#r402,r564) :B :hasParticipant (:6) :1 (HIF1 α ) :2 (HIF1 α + P402hyd) :3 (HIF1 α + P564hyd) :4 (HIF1 α + P402hyd + P564hyd) :5 (EGLN1) :6 (VHL) Please ignore the made up short-hand syntax!
Infering Protein Participation
OWL Role Chain
hasParticipant o isPartOf -> hasParticipant
if process has the part as a participant, then the whole is also a participant
Biochemical ontologies aim to capture and represent more
Biochemical ontologies aim to capture and represent biochemical entities and the relations that exist between them in an accurate manner. A fundamental starting point is biochemical identity, but our current approach for generating identifiers is haphazard and consequently integrating data is error-prone. I will discuss plausible structure-based strategies for biochemical identity whether it be at molecular level or some part thereof (e.g. residues, collection of residues, atoms, collection of atoms, functional groups) such that identifiers may be generated in an automatic and curator/database independent manner. With structure-based identifiers in hand, we will be in a position to more accurately capture context-specific biochemical knowledge, such as how a set of residues in a binding site are involved in a chemical reaction including the fact that a key nitrogen atom must first be de-protonated. Thus, our current representation of biochemical knowledge may improve such that manual and automatic methods of bio-curation are substantially more accurate. less
0 comments
Post a comment