The Bio2RDF project aims to transform silos of bioinformatics data into a distributed platform for biological knowledge discovery. Initial work focused on building a public database of open-linked data with web-resolvable identifiers that provides information about named entities. This involved a syntactic normalization to convert open data represented in a variety of formats (flatfile, tab, xml, web services) to RDF-based linked data with normalized names (HTTP URIs) and basic typing from source databases. Bio2RDF entities also make reference to other open linked data networks (e.g. dbPedia) thus facilitating traversal across information spaces. However, a significant problem arises when attempting to undertake more sophisticated knowledge discovery approaches such as question answering or symbolic data mining. This is because knowledge is represented in a fundamentally different manner, requiring one to know the underlying data model and reconcile the artefactual differences when they arise. In this talk, we describe our data integration strategy that makes use of both syntactic and semantic normalization to consistently marshal knowledge to a common data model while leveraging explicit logic-based mappings with community ontologies to further enhance the biological knowledgescope.
Slides for the following paper: NLP Data Cleansing Based on Linguistic Ontology Constraints
Abstract: Linked Data comprises of an unprecedented volume of structured data on the Web and is adopted from an increasing number of domains. However, the varying quality of published data forms a barrier for further adoption, especially for Linked Data consumers. In this paper, we extend a previously developed methodology of Linked Data quality assessment, which is inspired by test-driven software development. Specifically, we enrich it with ontological support and different levels of result reporting and describe how the method is applied in the Natural Language Processing (NLP) area. NLP is – compared to other domains, such as biology – a late Linked Data adopter. However, it has seen a
steep rise of activity in the creation of data and ontologies. NLP data quality assessment has become an important need for NLP datasets. In our study, we analysed 11 datasets using the lemon and NIF vocabularies in 277 test cases and point out common quality issues.
The Bio2RDF project aims to transform silos of bioinformatics data into a distributed platform for biological knowledge discovery. Initial work focused on building a public database of open-linked data with web-resolvable identifiers that provides information about named entities. This involved a syntactic normalization to convert open data represented in a variety of formats (flatfile, tab, xml, web services) to RDF-based linked data with normalized names (HTTP URIs) and basic typing from source databases. Bio2RDF entities also make reference to other open linked data networks (e.g. dbPedia) thus facilitating traversal across information spaces. However, a significant problem arises when attempting to undertake more sophisticated knowledge discovery approaches such as question answering or symbolic data mining. This is because knowledge is represented in a fundamentally different manner, requiring one to know the underlying data model and reconcile the artefactual differences when they arise. In this talk, we describe our data integration strategy that makes use of both syntactic and semantic normalization to consistently marshal knowledge to a common data model while leveraging explicit logic-based mappings with community ontologies to further enhance the biological knowledgescope.
Slides for the following paper: NLP Data Cleansing Based on Linguistic Ontology Constraints
Abstract: Linked Data comprises of an unprecedented volume of structured data on the Web and is adopted from an increasing number of domains. However, the varying quality of published data forms a barrier for further adoption, especially for Linked Data consumers. In this paper, we extend a previously developed methodology of Linked Data quality assessment, which is inspired by test-driven software development. Specifically, we enrich it with ontological support and different levels of result reporting and describe how the method is applied in the Natural Language Processing (NLP) area. NLP is – compared to other domains, such as biology – a late Linked Data adopter. However, it has seen a
steep rise of activity in the creation of data and ontologies. NLP data quality assessment has become an important need for NLP datasets. In our study, we analysed 11 datasets using the lemon and NIF vocabularies in 277 test cases and point out common quality issues.
Molecular modelling for in silico drug discoveryLee Larcombe
A slide set based on the small molecule section of "Introduction to in silico drug discovery" with more detail on molecular modelling and simulation aspects. Including a bit more on protein structure prediction
European Molecular Biology Laboratory (EMBL)- European Bioinformatics Institu...ExternalEvents
http://www.fao.org/about/meetings/wgs-on-food-safety-management/en/
Building the Database with International Isolates: European Molecular Biology Laboratory (EMBL)- European Bioinformatics Institute (EBI). Presentation from the Technical Meeting on the impact of Whole Genome Sequencing (WGS) on food safety management -23-25 May 2016, Rome, Italy.
Function and Phenotype Prediction through Data and Knowledge FusionKarin Verspoor
The biomedical literature captures the most current biomedical knowledge and is a tremendously rich resource for research. With over 24 million publications currently indexed in the US National Library of Medicine’s PubMed index, however, it is becoming increasingly challenging for biomedical researchers to keep up with this literature. Automated strategies for extracting information from it are required. Large-scale processing of the literature enables direct biomedical knowledge discovery. In this presentation, I will introduce the use of text mining techniques to support analysis of biological data sets, and will specifically discuss applications in protein function and phenotype prediction, exploring the integration of literature data with complementary structured resources.
How can you access PubChem programmatically?Sunghwan Kim
Presented at the 255th American Chemical Society (ACS) National Meeting in New Orleans, LA (March. 19, 2018).
Building automated workflows that exploit the vast amount of data contained in PubChem requires programmatic access to the data through application programming interfaces (APIs). PubChem provides several programmatic access routes to its data, including Entrez Utilities (E-Utilities or E-Utils), PubChem Power User Gateway (PUG), PUG-SOAP, PUG-REST, PUG-View, and a REST-ful interface to PubChemRDF. This presentation provides an overview of these programmatic access tools, including recent updates, limitations, usage policies, and best practices.
*References*
(1) PUG-SOAP and PUG-REST: web services for programmatic access to chemical information in PubChem, Nucleic Acids Research, 2015, 43(W1):W605–W611. https://doi.org/10.1093/nar/gkv396
(2) An update on PUG-REST: RESTful interface for programmatic access to PubChem, Nucleic Acids Research, 2018, 46(W1):gky294. https://doi.org/10.1093/nar/gky294
EUGM 2013 - Ian Berry, Bob Marmon (Evotec): Classification and analysis of 21...ChemAxon
We will discuss how we classified and calculated properties for over 21 million commercially available compounds using a variety of ChemAxon and in-house tools (and aggregated properties). We will show a summary of the analysis of the data and show how we will use that to build better virtual screens.
JEVBase: An Interactive Resource for Protein Annotationof JE VirusCSCJournals
Databases containing proteome ic information have become indispensable for virology related studies. Rajendra Memorial Research Institute of Medical Sciences (RMRIMS) has compiled and maintained a functional and molecular annotation database (http://www.jevbase.biomedinformri.org) commonly referred to as JEVBase. This database facilitates significant relationship between molecular analysis, cleavage sites, possible protein functional families assigned to different proteins of Japanese encephalitis virus (JEV). Identification of different protein functions and molecular analysis facilitates a mechanistic understanding of (JEV) infection and opens novel means for drug development. JEVBase database aims to be a resource for scientists working on JE virus
Slides of my seminar on optimal transport and its applications in machine learning, image processing and mechanistic modelling.
Github with code: https://github.com/MichielStock/Teaching/tree/master/Optimal_transport
I gave this presentation about pairwise learning for the defence of my PhD. I discuss machine learning algorithms for finding suitable books to read, molecular network inference and ecological network modelling.
Exact and efficient top-K inference for multi-target prediction by querying s...Michiel Stock
Many complex multi-target prediction problems that concern large target spaces are characterised by a need for efficient prediction strategies that avoid the computation of predictions for all targets explicitly. Examples of such problems emerge in several subfields of machine learning, such as collaborative filtering, multi-label classification, dyadic prediction and biological network inference. In this article we analyse efficient and exact algorithms for computing the top-$K$ predictions in the above problem settings, using a general class of models that we refer to as separable linear relational models. We show how to use those inference algorithms, which are modifications of well-known information retrieval methods, in a variety of machine learning settings. Furthermore, we study the possibility of scoring items incompletely, while still retaining an exact top-$K$ retrieval. Experimental results in several application domains reveal that the so-called threshold algorithm is very scalable, performing often many orders of magnitude more efficiently than the naive approach.
My poster on using pairwise learning for annotating, engineering and designing biological molecules. Mostly an overview of the types of things we are working on at the lab.
A two-step method to incorporate task features for large output spacesMichiel Stock
Relational learning, predicting properties of dyads, can be seen as an umbrella embodying machine learning problems such as matrix completion, multi-task learning, transfer learning, network prediction and zero-shot learning. Kronecker kernels-based learning algorithms represent a dyad as a structured object and thus provide a computationally efficient and theoretically well-founded framework to tackle these problems. As an alternative to this pairwise feature representation, a two-step approach was suggested that sequentially combines the knowledge from the two domains. This new stepwise method allows us to construct a novel algorithm for dealing with very large datasets in an online fashion. We illustrate experimentally that our method can not only improve performance of a very large-scale multi-class classification, but can also generalize to completely new classes.
More Related Content
Similar to Enzyme Annotation using Conditional Ranking Algorithms
Molecular modelling for in silico drug discoveryLee Larcombe
A slide set based on the small molecule section of "Introduction to in silico drug discovery" with more detail on molecular modelling and simulation aspects. Including a bit more on protein structure prediction
European Molecular Biology Laboratory (EMBL)- European Bioinformatics Institu...ExternalEvents
http://www.fao.org/about/meetings/wgs-on-food-safety-management/en/
Building the Database with International Isolates: European Molecular Biology Laboratory (EMBL)- European Bioinformatics Institute (EBI). Presentation from the Technical Meeting on the impact of Whole Genome Sequencing (WGS) on food safety management -23-25 May 2016, Rome, Italy.
Function and Phenotype Prediction through Data and Knowledge FusionKarin Verspoor
The biomedical literature captures the most current biomedical knowledge and is a tremendously rich resource for research. With over 24 million publications currently indexed in the US National Library of Medicine’s PubMed index, however, it is becoming increasingly challenging for biomedical researchers to keep up with this literature. Automated strategies for extracting information from it are required. Large-scale processing of the literature enables direct biomedical knowledge discovery. In this presentation, I will introduce the use of text mining techniques to support analysis of biological data sets, and will specifically discuss applications in protein function and phenotype prediction, exploring the integration of literature data with complementary structured resources.
How can you access PubChem programmatically?Sunghwan Kim
Presented at the 255th American Chemical Society (ACS) National Meeting in New Orleans, LA (March. 19, 2018).
Building automated workflows that exploit the vast amount of data contained in PubChem requires programmatic access to the data through application programming interfaces (APIs). PubChem provides several programmatic access routes to its data, including Entrez Utilities (E-Utilities or E-Utils), PubChem Power User Gateway (PUG), PUG-SOAP, PUG-REST, PUG-View, and a REST-ful interface to PubChemRDF. This presentation provides an overview of these programmatic access tools, including recent updates, limitations, usage policies, and best practices.
*References*
(1) PUG-SOAP and PUG-REST: web services for programmatic access to chemical information in PubChem, Nucleic Acids Research, 2015, 43(W1):W605–W611. https://doi.org/10.1093/nar/gkv396
(2) An update on PUG-REST: RESTful interface for programmatic access to PubChem, Nucleic Acids Research, 2018, 46(W1):gky294. https://doi.org/10.1093/nar/gky294
EUGM 2013 - Ian Berry, Bob Marmon (Evotec): Classification and analysis of 21...ChemAxon
We will discuss how we classified and calculated properties for over 21 million commercially available compounds using a variety of ChemAxon and in-house tools (and aggregated properties). We will show a summary of the analysis of the data and show how we will use that to build better virtual screens.
JEVBase: An Interactive Resource for Protein Annotationof JE VirusCSCJournals
Databases containing proteome ic information have become indispensable for virology related studies. Rajendra Memorial Research Institute of Medical Sciences (RMRIMS) has compiled and maintained a functional and molecular annotation database (http://www.jevbase.biomedinformri.org) commonly referred to as JEVBase. This database facilitates significant relationship between molecular analysis, cleavage sites, possible protein functional families assigned to different proteins of Japanese encephalitis virus (JEV). Identification of different protein functions and molecular analysis facilitates a mechanistic understanding of (JEV) infection and opens novel means for drug development. JEVBase database aims to be a resource for scientists working on JE virus
Similar to Enzyme Annotation using Conditional Ranking Algorithms (11)
Slides of my seminar on optimal transport and its applications in machine learning, image processing and mechanistic modelling.
Github with code: https://github.com/MichielStock/Teaching/tree/master/Optimal_transport
I gave this presentation about pairwise learning for the defence of my PhD. I discuss machine learning algorithms for finding suitable books to read, molecular network inference and ecological network modelling.
Exact and efficient top-K inference for multi-target prediction by querying s...Michiel Stock
Many complex multi-target prediction problems that concern large target spaces are characterised by a need for efficient prediction strategies that avoid the computation of predictions for all targets explicitly. Examples of such problems emerge in several subfields of machine learning, such as collaborative filtering, multi-label classification, dyadic prediction and biological network inference. In this article we analyse efficient and exact algorithms for computing the top-$K$ predictions in the above problem settings, using a general class of models that we refer to as separable linear relational models. We show how to use those inference algorithms, which are modifications of well-known information retrieval methods, in a variety of machine learning settings. Furthermore, we study the possibility of scoring items incompletely, while still retaining an exact top-$K$ retrieval. Experimental results in several application domains reveal that the so-called threshold algorithm is very scalable, performing often many orders of magnitude more efficiently than the naive approach.
My poster on using pairwise learning for annotating, engineering and designing biological molecules. Mostly an overview of the types of things we are working on at the lab.
A two-step method to incorporate task features for large output spacesMichiel Stock
Relational learning, predicting properties of dyads, can be seen as an umbrella embodying machine learning problems such as matrix completion, multi-task learning, transfer learning, network prediction and zero-shot learning. Kronecker kernels-based learning algorithms represent a dyad as a structured object and thus provide a computationally efficient and theoretically well-founded framework to tackle these problems. As an alternative to this pairwise feature representation, a two-step approach was suggested that sequentially combines the knowledge from the two domains. This new stepwise method allows us to construct a novel algorithm for dealing with very large datasets in an online fashion. We illustrate experimentally that our method can not only improve performance of a very large-scale multi-class classification, but can also generalize to completely new classes.
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.Sérgio Sacani
The return of a sample of near-surface atmosphere from Mars would facilitate answers to several first-order science questions surrounding the formation and evolution of the planet. One of the important aspects of terrestrial planet formation in general is the role that primary atmospheres played in influencing the chemistry and structure of the planets and their antecedents. Studies of the martian atmosphere can be used to investigate the role of a primary atmosphere in its history. Atmosphere samples would also inform our understanding of the near-surface chemistry of the planet, and ultimately the prospects for life. High-precision isotopic analyses of constituent gases are needed to address these questions, requiring that the analyses are made on returned samples rather than in situ.
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...Scintica Instrumentation
Intravital microscopy (IVM) is a powerful tool utilized to study cellular behavior over time and space in vivo. Much of our understanding of cell biology has been accomplished using various in vitro and ex vivo methods; however, these studies do not necessarily reflect the natural dynamics of biological processes. Unlike traditional cell culture or fixed tissue imaging, IVM allows for the ultra-fast high-resolution imaging of cellular processes over time and space and were studied in its natural environment. Real-time visualization of biological processes in the context of an intact organism helps maintain physiological relevance and provide insights into the progression of disease, response to treatments or developmental processes.
In this webinar we give an overview of advanced applications of the IVM system in preclinical research. IVIM technology is a provider of all-in-one intravital microscopy systems and solutions optimized for in vivo imaging of live animal models at sub-micron resolution. The system’s unique features and user-friendly software enables researchers to probe fast dynamic biological processes such as immune cell tracking, cell-cell interaction as well as vascularization and tumor metastasis with exceptional detail. This webinar will also give an overview of IVM being utilized in drug development, offering a view into the intricate interaction between drugs/nanoparticles and tissues in vivo and allows for the evaluation of therapeutic intervention in a variety of tissues and organs. This interdisciplinary collaboration continues to drive the advancements of novel therapeutic strategies.
Cancer cell metabolism: special Reference to Lactate PathwayAADYARAJPANDEY1
Normal Cell Metabolism:
Cellular respiration describes the series of steps that cells use to break down sugar and other chemicals to get the energy we need to function.
Energy is stored in the bonds of glucose and when glucose is broken down, much of that energy is released.
Cell utilize energy in the form of ATP.
The first step of respiration is called glycolysis. In a series of steps, glycolysis breaks glucose into two smaller molecules - a chemical called pyruvate. A small amount of ATP is formed during this process.
Most healthy cells continue the breakdown in a second process, called the Kreb's cycle. The Kreb's cycle allows cells to “burn” the pyruvates made in glycolysis to get more ATP.
The last step in the breakdown of glucose is called oxidative phosphorylation (Ox-Phos).
It takes place in specialized cell structures called mitochondria. This process produces a large amount of ATP. Importantly, cells need oxygen to complete oxidative phosphorylation.
If a cell completes only glycolysis, only 2 molecules of ATP are made per glucose. However, if the cell completes the entire respiration process (glycolysis - Kreb's - oxidative phosphorylation), about 36 molecules of ATP are created, giving it much more energy to use.
IN CANCER CELL:
Unlike healthy cells that "burn" the entire molecule of sugar to capture a large amount of energy as ATP, cancer cells are wasteful.
Cancer cells only partially break down sugar molecules. They overuse the first step of respiration, glycolysis. They frequently do not complete the second step, oxidative phosphorylation.
This results in only 2 molecules of ATP per each glucose molecule instead of the 36 or so ATPs healthy cells gain. As a result, cancer cells need to use a lot more sugar molecules to get enough energy to survive.
Unlike healthy cells that "burn" the entire molecule of sugar to capture a large amount of energy as ATP, cancer cells are wasteful.
Cancer cells only partially break down sugar molecules. They overuse the first step of respiration, glycolysis. They frequently do not complete the second step, oxidative phosphorylation.
This results in only 2 molecules of ATP per each glucose molecule instead of the 36 or so ATPs healthy cells gain. As a result, cancer cells need to use a lot more sugar molecules to get enough energy to survive.
introduction to WARBERG PHENOMENA:
WARBURG EFFECT Usually, cancer cells are highly glycolytic (glucose addiction) and take up more glucose than do normal cells from outside.
Otto Heinrich Warburg (; 8 October 1883 – 1 August 1970) In 1931 was awarded the Nobel Prize in Physiology for his "discovery of the nature and mode of action of the respiratory enzyme.
WARNBURG EFFECT : cancer cells under aerobic (well-oxygenated) conditions to metabolize glucose to lactate (aerobic glycolysis) is known as the Warburg effect. Warburg made the observation that tumor slices consume glucose and secrete lactate at a higher rate than normal tissues.
Multi-source connectivity as the driver of solar wind variability in the heli...Sérgio Sacani
The ambient solar wind that flls the heliosphere originates from multiple
sources in the solar corona and is highly structured. It is often described
as high-speed, relatively homogeneous, plasma streams from coronal
holes and slow-speed, highly variable, streams whose source regions are
under debate. A key goal of ESA/NASA’s Solar Orbiter mission is to identify
solar wind sources and understand what drives the complexity seen in the
heliosphere. By combining magnetic feld modelling and spectroscopic
techniques with high-resolution observations and measurements, we show
that the solar wind variability detected in situ by Solar Orbiter in March
2022 is driven by spatio-temporal changes in the magnetic connectivity to
multiple sources in the solar atmosphere. The magnetic feld footpoints
connected to the spacecraft moved from the boundaries of a coronal hole
to one active region (12961) and then across to another region (12957). This
is refected in the in situ measurements, which show the transition from fast
to highly Alfvénic then to slow solar wind that is disrupted by the arrival of
a coronal mass ejection. Our results describe solar wind variability at 0.5 au
but are applicable to near-Earth observatories.
A brief information about the SCOP protein database used in bioinformatics.
The Structural Classification of Proteins (SCOP) database is a comprehensive and authoritative resource for the structural and evolutionary relationships of proteins. It provides a detailed and curated classification of protein structures, grouping them into families, superfamilies, and folds based on their structural and sequence similarities.
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Ana Luísa Pinho
Functional Magnetic Resonance Imaging (fMRI) provides means to characterize brain activations in response to behavior. However, cognitive neuroscience has been limited to group-level effects referring to the performance of specific tasks. To obtain the functional profile of elementary cognitive mechanisms, the combination of brain responses to many tasks is required. Yet, to date, both structural atlases and parcellation-based activations do not fully account for cognitive function and still present several limitations. Further, they do not adapt overall to individual characteristics. In this talk, I will give an account of deep-behavioral phenotyping strategies, namely data-driven methods in large task-fMRI datasets, to optimize functional brain-data collection and improve inference of effects-of-interest related to mental processes. Key to this approach is the employment of fast multi-functional paradigms rich on features that can be well parametrized and, consequently, facilitate the creation of psycho-physiological constructs to be modelled with imaging data. Particular emphasis will be given to music stimuli when studying high-order cognitive mechanisms, due to their ecological nature and quality to enable complex behavior compounded by discrete entities. I will also discuss how deep-behavioral phenotyping and individualized models applied to neuroimaging data can better account for the subject-specific organization of domain-general cognitive systems in the human brain. Finally, the accumulation of functional brain signatures brings the possibility to clarify relationships among tasks and create a univocal link between brain systems and mental functions through: (1) the development of ontologies proposing an organization of cognitive processes; and (2) brain-network taxonomies describing functional specialization. To this end, tools to improve commensurability in cognitive science are necessary, such as public repositories, ontology-based platforms and automated meta-analysis tools. I will thus discuss some brain-atlasing resources currently under development, and their applicability in cognitive as well as clinical neuroscience.
Seminar of U.V. Spectroscopy by SAMIR PANDASAMIR PANDA
Spectroscopy is a branch of science dealing the study of interaction of electromagnetic radiation with matter.
Ultraviolet-visible spectroscopy refers to absorption spectroscopy or reflect spectroscopy in the UV-VIS spectral region.
Ultraviolet-visible spectroscopy is an analytical method that can measure the amount of light received by the analyte.
What is greenhouse gasses and how many gasses are there to affect the Earth.moosaasad1975
What are greenhouse gasses how they affect the earth and its environment what is the future of the environment and earth how the weather and the climate effects.
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...Sérgio Sacani
We characterize the earliest galaxy population in the JADES Origins Field (JOF), the deepest
imaging field observed with JWST. We make use of the ancillary Hubble optical images (5 filters
spanning 0.4−0.9µm) and novel JWST images with 14 filters spanning 0.8−5µm, including 7 mediumband filters, and reaching total exposure times of up to 46 hours per filter. We combine all our data
at > 2.3µm to construct an ultradeep image, reaching as deep as ≈ 31.4 AB mag in the stack and
30.3-31.0 AB mag (5σ, r = 0.1” circular aperture) in individual filters. We measure photometric
redshifts and use robust selection criteria to identify a sample of eight galaxy candidates at redshifts
z = 11.5 − 15. These objects show compact half-light radii of R1/2 ∼ 50 − 200pc, stellar masses of
M⋆ ∼ 107−108M⊙, and star-formation rates of SFR ∼ 0.1−1 M⊙ yr−1
. Our search finds no candidates
at 15 < z < 20, placing upper limits at these redshifts. We develop a forward modeling approach to
infer the properties of the evolving luminosity function without binning in redshift or luminosity that
marginalizes over the photometric redshift uncertainty of our candidate galaxies and incorporates the
impact of non-detections. We find a z = 12 luminosity function in good agreement with prior results,
and that the luminosity function normalization and UV luminosity density decline by a factor of ∼ 2.5
from z = 12 to z = 14. We discuss the possible implications of our results in the context of theoretical
models for evolution of the dark matter halo mass function.
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
Enzyme Annotation using Conditional Ranking Algorithms
1. Enzyme Annotation using Conditional Ranking
Algorithms
Michiel Stock
Faculty of Bioscience Engineering
Ghent University
6th of June 2014
KERMIT
Michiel Stock (KERMIT) Conditional Ranking of Enzymes 6th of June 2014 1 / 14
2. Outline
1 From Structure to Function
2 Ranking Enzymes
3 Learning to Rank
4 Results
5 Conclusion
Michiel Stock (KERMIT) Conditional Ranking of Enzymes 6th of June 2014 2 / 14
3. From Structure to Function
What bioinformatics is (often) about
Bioinformatics for proteins
Using biological knowledge and statistical models to map information
from a low level (e.g. protein structure) to a higher level (e.g. molecular
function).
Sequence Structure Function
Michiel Stock (KERMIT) Conditional Ranking of Enzymes 6th of June 2014 3 / 14
4. From Structure to Function
The data set
Data:
two data sets of ca. 1600
enzymes with 21
different functions
five different similarity
measures of the active
site
active site of an
enzyme:
Michiel Stock (KERMIT) Conditional Ranking of Enzymes 6th of June 2014 4 / 14
5. From Structure to Function
The enzyme commission number
Michiel Stock (KERMIT) Conditional Ranking of Enzymes 6th of June 2014 5 / 14
7. Ranking Enzymes
Conditional ranking of enzymes
Ranking enzymes
For an unannotated enzyme, rank the annotated enzymes so that the
top has a similar function w.r.t. the query.
Minimize ranking error:
number of switches needed
for a perfect ranking
Example: suppose one has an
enzyme with unknown
function: EC ?.?.?.?
1 EC 2.7.7.12
2 EC 2.7.7.12
3 EC 2.7.7.34
4 EC 2.7.1.12
5 EC 2.7.7.34
6 EC 4.2.3.90
7 EC 1.14.11
8 EC 4.6.1.11
⇒ EC 2.7.7.12
Michiel Stock (KERMIT) Conditional Ranking of Enzymes 6th of June 2014 7 / 14
8. Learning to Rank
Learning the catalytic similarity
pair of enzymes:
e = (v, v )
label ye ∈ {0, 1, 2, 3, 4}:
the catalytic similarity
five different structural
similarities: Kφ(v, v )
A B C D E F G
A 4 4 0 0 0
B 4 4 0 0 0
C 0 0 4 2 1
D 0 0 2 4 3
E 0 0 1 3 4
F
G
Enzymes
Enzymes
Michiel Stock (KERMIT) Conditional Ranking of Enzymes 6th of June 2014 8 / 14
9. Learning to Rank
Pairwise features with the Kronecker product
( , )
( , )
( , )
( , )
( , )
( , )
Object kernel Pairwise kernel
Learning!
algorithm
…
SVM!
RLS!
…
The Kronecker kernel is defined as:
KΦ
((v, v ), (¯v, ¯v )) = KΦ
(e, ¯e) = Kφ
(v, ¯v)Kφ
(v , ¯v )
Michiel Stock (KERMIT) Conditional Ranking of Enzymes 6th of June 2014 9 / 14
10. Learning to Rank
Basic pairwise models
Use training data T = {(e, ye)} to fit a model:
h(e) =
¯e∈T
a¯eKΦ
(e, ¯e).
The function h ∈ H can be fitted using the following optimisation problem:
A(T) = arg min
h∈H
L(h, T) + λ||h||2
H.
For conditional ranking we choose an approximation of the rank loss.
This problem has time complexity O(n3), with n the number of enzymes.
Michiel Stock (KERMIT) Conditional Ranking of Enzymes 6th of June 2014 10 / 14
11. Results
Qualitative improvement in the enzyme similarities
Example for CavBase structural similarity:
Ground truthSupervisedUnsupervised
Lighter color = higher similarity
Michiel Stock (KERMIT) Conditional Ranking of Enzymes 6th of June 2014 11 / 14
12. Results
Improvement of the ROC curves
ROC curves for the five different structural similarity measures:
unsupervised and supervised
False positive rate
Averagetruepositiverate
0.0 0.2 0.4 0.6 0.8 1.0
0.00.20.40.60.81.0
CB sup.
FP sup.
LPCS sup.
MCS sup.
SW sup.
CB unsup.
FP unsup.
LPCS unsup.
MCS unsup.
SW unsup.
ROC curve for the different enzyme similarity
measurements of data set I
Improvement
Increase of AUC from ca. 0.7 to more than 0.8!
Michiel Stock (KERMIT) Conditional Ranking of Enzymes 6th of June 2014 12 / 14
13. Conclusion
General conclusions
1 enzyme function prediction can nicely be cast in a conditional ranking
framework
2 supervised ranking is a clear improvement upon the baseline
3 efficient enough for many bioinformatics applications
4 can be generalised to many other settings
Michiel Stock (KERMIT) Conditional Ranking of Enzymes 6th of June 2014 13 / 14
14. Conclusion
Acknowledgements
Ghent University
Bernard De Baets
Willem Waegeman
University of Turku
Tapio Pahikkala
Antti Airola
University of Marburg
Thomas Fober
Eyke H¨ullermeier
Want to know more?
[1] T. Pahikkala, A. Airola, M. Stock, B. De Baets, and W. Waegeman. Efficient regularized least-squares algorithms for
conditional ranking on relational data. Machine Learning, 93(2-3):321–356, 2013.
[2] M. Stock, T. Fober, E. H¨ullermeier, S. Glinca, G. Klebe, T. Pahikkala, A. Airola, B. De Baets, and W. Waegeman.
Identification of functionally related enzymes by learning-to-rank methods. IEEE Transactions on Computational Biology
and Bioinformatics, page Accepted for publication, 2014.
Michiel Stock (KERMIT) Conditional Ranking of Enzymes 6th of June 2014 14 / 14