Increasingly Accurate Representation of Biochemistry (v2) - Presentation Transcript
Increasingly Accurate Representation
of Biochemistry (v2)
Michel Dumontier, Ph.D.
Assistant Professor of Bioinformatics
Department of Biology, School of Computer Science
Institute of Biochemistry, Ottawa Institute of Systems Biology
Carleton University
1 SemWeb Group::Vancouver 21/05/2009
Representational Issues
Biochemical Identity
Accurate Descriptions
Precise Identifiers
Modeling Situations
Which of these are different?
# A, B Difference?
1 α-D-Glucose, alpha-D-Glucose None, multiple names
2 α-D-Glucose, β-D-Glucose
3 α-D-Glucose, β-D-Glucose, D-Glucose
4 α-D-Glucose, α-D-Glucose-6-phosphate
5 Hk, Hk(L529S)
6 Hk(human), Hk(mouse)
7 Hk(open), Hk(closed)
8 Hk (L529), Hk (L540)
9 Hk (L529), Hk (A530)
α-D-Glucose vs β-D-Glucose
• Rearrangement (isomer)
• Related, but structurally different
α-D-Glucose and β-D-Glucose are
more specific types* of D-Glucose
* They resolve an ambiguity in stereochemistry
α-D-Glucose vs
α-D-Glucose-6-Phosphate
• Change (addition+removal) in atoms
• Structurally different
– one is not a type of the other!
Post-Translational Modifications
• Structurally different
• Unable to capture the difference with
single letter AA sequence representation
Hexokinase (mutation)
500 510 520 530 540
RRFHKTLRRL VPDSDVRFLL SESGSGKGAA MVTAVAYRLA EQHRQIEETL
500 510 520 530 540
RRFHKTLRRL VPDSDVRFLL SESGSGKGAA MVTAVAYRSA EQHRQIEETL
Leads to hemolytic anemia
different sequence = different entity
related by some mutation process
Hexokinase
Open vs Closed
Structurally identical, but conformationally different
Parts need to be identifiable and
describable
# A, B Difference?
1 α-D-Glucose, alpha-D-Glucose None, multiple names
2 α-D-Glucose, β-D-Glucose Structural (rearrangement)
3 α-D-Glucose, β-D-Glucose, D-Glucose More specific type
4 α-D-Glucose, α-D-Glucose-6-phosphate Structural (modification)
5 Hk, Hk(L529S) Structural (mutation)
6 Hk(human), Hk(mouse) Structural (sequence)
7 Hk(open), Hk(closed) Conformational
8 Hk (L529), Hk (L540) Positional
9 Hk (L529), Hk (A530) Structural, positional
Biochemical identity
is necessarily based on
a description of structure
To determine identity, we have
compare their descriptions
Given A and B
How would you know that they
are different?
Given two descriptions about a protein,
but where their names differ, how do you
know they are the same or different?
– Structure (sequence)
– PTMs
– Organism
– Function, Process, Localization
– Conformation
Biochemical identity
is necessarily based on
having accurate descriptions
Yet, current approaches add *annotations*
rather than create new records with their
respective descriptions
Current approach to assigning biochemical identifiers is
erroneous, misleading or underspecified
• Information gathered from
multiple structural variants are
attributed to the unmodified
form. • Incomplete descriptions are
Uniprot/Genbank
just as bad
– Reactome has an internal
• This conflates functionality identifier for referring to different
arising from similar, but forms, but links to Uniprot
different structural forms entries
Inaccurate specification of – Obfuscates identity between
knowledge databases
Biochemical relationship
is necessarily based on
a comparison of accurate
descriptions
For each description, we must
assign a unique name or
identifier
If the description changes
we need a new identifier!
1. Precise Biochemical Identifiers
• Identifiers and their exact descriptions are
required for these kinds of entities:
– atom : atomic interactions, catalytic mechanism
– collection of atoms : binding/catalytic site,
interaction
– residue : post translational modification
– collection of residues : motif/domain/interaction site
– molecule : metabolism, signalling
– complex : metabolism , signalling, scaffolds,
containers
• We need a reproducible methodology for naming
and providing descriptions
Different molecules must have
different identifiers
• IUPAC International Chemical Identifier (InChI)
• A data string that provides
– the structure of a chemical compound
– the convention for drawing the structure
• It can be made by anyone, anywhere at any time – a deterministic algorithm
ensures that is always written in the same way (syntactic identity), and fully
specifies the molecular description (semantic identity).
– It is a data identifier
2. Accurate Descriptions
InCHI InChI=1/C6H12O6/c7-1-2-3(8)4(9)5(10)6(11)12-2/h2-11H,1H2/t2-,3-,4+,5-,6+/m1/s1
IUPAC 6-(hydroxymethyl)oxane-2,3,4,5-tetrol OR (2R,3R,4S,5R,6R)-6 -(hydroxymethyl)tetrahydro -2H-pyran-2,3,4,5-tetraol
SMILES O1[C@@H]([C@@H](O)([C@H](O)([C@@H](O)([C@@H]1(O)))))(CO) 79025
SDF CML
α-D-Glucose
OWL Has Explicit Semantics
Can therefore be used to capture knowledge
in a machine understandable way
Chemical Ontology
Chemical Knowledge for the Semantic Web.
Mykola Konyk, Alexander De Leon, and Michel Dumontier. LNBI. 2008. 5109:169-176.
Data Integration in the Life Sciences (DILS2008). Evry. France.
RDF/OWL descriptions of molecules
http://code.google.com/p/semanticwebopenbabel/
Describing chemical functional groups in OWL-DL
for the classification of chemical compounds
methyl group
hydroxyl group
Ethanol
Knowledge of functional Functional groups describe
groups is important in chemical reactivity in terms of
chemical synthesis, atoms and their connectivity,
pharmaceutical design and and exhibits characteristic
lead optimization. chemical behavior when present
in a compound.
N Villanueva-Rosales, MDumontier. 2007. OWLED, Innsbruck, Austria.
Describing Functional Groups in DL
R group
O
R H
HydroxylGroup:
CarbonGroup that (hasSingleBondWith some (OxygenAtom that
hasSingleBondWith some HydrogenAtom)
Fully Classified Ontology
35 FG
And, we define certain compounds
Alcohol:
OrganicCompound that (hasPart some HydroxylGroup)
Organic Compound Ontology
28 OC
Question Answering
• Query all attributes
• Query PubChem, DrugBank and dbPedia
We also need
Identifiers/Descriptions for Atoms
• Atom identifiers need to be consistently
assigned
– OpenBabel plugin component naming was first come,
first served along with the assigned mol identifier from
PubChem SDF files.
e.g. id#aN,
where a is the “atom” label and N is the position
– Canonical numbering (InChI) is required
• Atom descriptions need only specify the
mereological relation
:id#aN :isProperPartOf :id
What about identifiers for
collection of atoms?
• Potentially useful in describing residues, PTMs,
binding sites, etc.
– Is the lack of connectivity sufficient?
• Contiguous:
– ranges (id#aN-aN)
– enumerations (id#aN,aN,aN)
• Non-contiguous:
– Combination of ranges, enumerations?
Can we reuse our positional
nomenclature for residues?
• Residues are generally referred to by their
absolute position in the biopolymer sequence.
Global atom numbering:
id#a50-a65 owl:sameAs id#r5
Residue specific atom numbering
id#r5_a1-r5_a15 owl:sameAs id#r5
• Collection of residues might follow the same
rules as a collection of atoms.
– Useful for defining domains, motifs, etc
While we’re at it, we could extend our expressive
capability to create broader descriptions:
• Specification
– Exactly mod1@pos X
– Only mod1@posX
• Minimum :
– At least mod1@posX
• Combination:
– mod1@posX AND mod2@posY, X != Y
• Possibilities/Uncertainty:
– (mod1 OR mod2) @posX
• Exclusion:
– not mod1 @ posX
So what if...
we describe the structural
features of the molecule with
OWL (sequence + PTMs), and
generate an identifier from one
of its serializations?
that way we get a unique
identifier with a description that
is extensible and compatible
with the semantic web.
Biological Identifier Service
Description to Identifier
What does this mean?
• Identifier exactly matches the description
– Great as a primary key for databases
– Can be used for citation purposes (no more fuzzy
diagrams!)
• exact description can be obtained for a given identifier.
• Description is extensible, and new identifiers can
be autogenerated, independently
– Needs canonical serialization / central service
– Histories can be made, and published
Case Study: HIF1α
Hypoxia-Inducible Factor 1, alpha chain (uniprot:Q16665)
Master transcriptional regulator of the adaptive response to hypoxia
• Under normoxic conditions, HIF1α is hydroxylated on Pro-402
and Pro-564 in the oxygen-dependent degradation domain (ODD) by
EGLN1/PHD1 and EGLN2/PHD2. EGLN3/PHD3 has also been shown to
hydroxylate Pro-564. The hydroxylated prolines promote interaction with
VHL, initiating rapid ubiquitination and subsequent proteasomal
degradation.
Situation
Multiple structural Part, named/
b) Normoxic
c) Hypoxic forms unnamed
d) Other/Unspecified regions
Selective interaction The part is the agent in the
with parts process
Uniprot example revisited
Under normoxic conditions, HIF1α is hydroxylated on Pro-402
and Pro-564 in the oxygen-dependent degradation domain (ODD) by
EGLN1/PHD1 and EGLN2/PHD2. The hydroxylated prolines promote
interaction with VHL, initiating rapid ubiquitination and subsequent
proteasomal degradation
.
:1 (HIF1α) :A rdfs:subClassOf :Hydroxylation
:2 (HIF1α + P402hyd) :A hasParticipant (:0#r402 and :Substrate)
:3 (HIF1α + P564hyd) :A hasParticipant (:1#r402 and :Product)
:4 (HIF1α + P402hyd + P564hyd) :A hasParticipant (:5 and :Enzyme)
:5 (EGLN1)
:6 (VHL) :B rdfs:subClassOf :Interaction
:B :hasParticipant (:2#r402 or :3#r564 or :4#r402,r564)
:B :hasParticipant (:6)
Please ignore the made up short-hand syntax!
Situational Modeling
Infering Protein Participation
• OWL Role Chain
hasParticipant o isPartOf -> hasParticipant
if process has the part as a participant, then the whole is
also a participant
:A rdfs:subClassOf :Hydroxylation
:A hasParticipant (:0#r402 and :Substrate) :0#r402 :isPartOf :0
:A hasParticipant (:1#r402 and :Product) :1#r402 :isPartOf :1
:A hasParticipant :0
:A hasParticipant :1
We will add new knowledge about biochemicals and
their parts into the linked data web through Bio2RDF!
Query descriptions to find matching
biochemicals
• Chemical
– Structural
– Conformation (e.g. open vs closed form)
– Collections (alpha vs beta forms of D-glucose)
• Biological
– Species
– mRNA/Gene from which it was transcribed/encoded
– Reactions / post-translational modifications
– Mutations
Summary
• Biochemical identity is tightly linked to accurate
descriptions.
• Automatic and consistent identifier generation will allow
anybody to specify findings according to the biopolymers
for which it was observed
– No curation required!!!!
– Will be discovered automatically
– link biochemical knowledge at various levels of granularity
• Situational modeling enables the careful separation of
what is known under a particular circumstance.
dumontierlab.com
michel_dumontier@carleton.ca
Special thanks to PhD Student Leonid Chepelev for insightful discussions
semanticscience.org
Biochemical ontologies aim to capture and represent more
Biochemical ontologies aim to capture and represent biochemical entities and the relations that exist between them in an accurate manner. A fundamental starting point is biochemical identity, but our current approach for generating identifiers is haphazard and consequently integrating data is error-prone. I will discuss plausible structure-based strategies for biochemical identity whether it be at molecular level or some part thereof (e.g. residues, collection of residues, atoms, collection of atoms, functional groups) such that identifiers may be generated in an automatic and curator/database independent manner. With structure-based identifiers in hand, we will be in a position to more accurately capture context-specific biochemical knowledge, such as how a set of residues in a binding site are involved in a chemical reaction including the fact that a key nitrogen atom must first be de-protonated. Thus, our current representation of biochemical knowledge may improve such that manual and automatic methods of bio-curation are substantially more accurate. less
0 comments
Post a comment