This document provides an overview of a seminar presentation on kernel methods for data integration in systems biology. It begins with short biographies of the presenter, who is trained as a mathematician and statistician and applies their skills to research in human health and animal genomics using various omics data types. Examples are given of the presenter's past work inferring networks and integrating gene expression and lipid data, as well as expression and 3D DNA location data. The talk will discuss how to integrate multiple omics data from different sources and types using kernels. Kernels allow reducing high-dimensional data to similarity matrices and are not restricted to numeric data. They also allow embedding expert knowledge and provide a framework for statistical learning.
Kernel methods and variable selection for exploratory analysis and multi-omic...tuxette
Nathalie Vialaneix
4th course on Computational Systems Biology of Cancer: Multi-omics and Machine Learning Approaches
International course, Curie training
https://training.institut-curie.org/courses/sysbiocancer2021
(remote)
September 29th, 2021
Kernel methods and variable selection for exploratory analysis and multi-omic...tuxette
Nathalie Vialaneix
4th course on Computational Systems Biology of Cancer: Multi-omics and Machine Learning Approaches
International course, Curie training
https://training.institut-curie.org/courses/sysbiocancer2021
(remote)
September 29th, 2021
Mini useR! in Melbourne https://www.meetup.com/fr-FR/MelbURN-Melbourne-Users-of-R-Network/events/251933078/
MelbURN (Melbourne useR group) https://www.meetup.com/fr-FR/MelbURN-Melbourne-Users-of-R-Network
July 16th, 2018
Melbourne, Australia
Dimensionality reduction by matrix factorization using concept lattice in dat...eSAT Journals
Abstract Concept lattices is the important technique that has become a standard in data analytics and knowledge presentation in many fields such as statistics, artificial intelligence, pattern recognition ,machine learning ,information theory ,social networks, information retrieval system and software engineering. Formal concepts are adopted as the primitive notion. A concept is jointly defined as a pair consisting of the intension and the extension. FCA can handle with huge amount of data it generates concepts and rules and data visualization. Matrix factorization methods have recently received greater exposure, mainly as an unsupervised learning method for latent variable decomposition. In this paper a novel method is proposed to decompose such concepts by using Boolean Matrix Factorization for dimensionality reduction. This paper focuses on finding all the concepts and the object intersections. Keywords: Data mining, formal concepts, lattice, matrix factorization dimensionality reduction.
The variational Gaussian process (VGP), a Bayesian nonparametric model which adapts its shape to match com- plex posterior distributions. The VGP generates approximate posterior samples by generating latent inputs and warping them through random non-linear mappings; the distribution over random mappings is learned during inference, enabling the transformed outputs to adapt to varying complexity.
Medical pathology images are visually evaluated by experts for disease diagnosis, but the connectionbetween image features and the state of the cells in an image is typically unknown. To understand thisrelationship, we describe a multimodal modeling and inference framework that estimates shared latentstructure of joint gene expression levels and medical image features. The method is built aroundprobabilistic canonical correlation analysis (PCCA), which is jointly fit to image embeddings that are learnedusing convolutional neural networks and linear embeddings of paired gene expression data. We finallydiscuss a set of theoretical and empirical challenges in domain adaptation settings arising from genomics data.(based on work in collab with Gregory Gundersen and Barbara E. Engelhardt)
Since the advent of the horseshoe priors for regularization, global-local shrinkage methods have proved to be a fertile ground for the development of Bayesian theory and methodology in machine learning. They have achieved remarkable success in computation, and enjoy strong theoretical support. Much of the existing literature has focused on the linear Gaussian case. The purpose of the current talk is to demonstrate that the horseshoe priors are useful more broadly, by reviewing both methodological and computational developments in complex models that are more relevant to machine learning applications. Specifically, we focus on methodological challenges in horseshoe regularization in nonlinear and non-Gaussian models; multivariate models; and deep neural networks. We also outline the recent computational developments in horseshoe shrinkage for complex models along with a list of available software implementations that allows one to venture out beyond the comfort zone of the canonical linear regression problems.
Universal Approximation Property via Quantum Feature Maps
----
The quantum Hilbert space can be used as a quantum-enhanced feature space in machine learning (ML) via the quantum feature map to encode classical data into quantum states. We prove the ability to approximate any continuous function with optimal approximation rate via quantum ML models in typical quantum feature maps.
---
Contributed talk at Quantum Techniques in Machine Learning 2021, Tokyo, November 8-12 2021.
By Quoc Hoan Tran, Takahiro Goto and Kohei Nakajima
Bayesian inference for mixed-effects models driven by SDEs and other stochast...Umberto Picchini
An important, and well studied, class of stochastic models is given by stochastic differential equations (SDEs). In this talk, we consider Bayesian inference based on measurements from several individuals, to provide inference at the "population level" using mixed-effects modelling. We consider the case where dynamics are expressed via SDEs or other stochastic (Markovian) models. Stochastic differential equation mixed-effects models (SDEMEMs) are flexible hierarchical models that account for (i) the intrinsic random variability in the latent states dynamics, as well as (ii) the variability between individuals, and also (iii) account for measurement error. This flexibility gives rise to methodological and computational difficulties.
Fully Bayesian inference for nonlinear SDEMEMs is complicated by the typical intractability of the observed data likelihood which motivates the use of sampling-based approaches such as Markov chain Monte Carlo. A Gibbs sampler is proposed to target the marginal posterior of all parameters of interest. The algorithm is made computationally efficient through careful use of blocking strategies, particle filters (sequential Monte Carlo) and correlated pseudo-marginal approaches. The resulting methodology is is flexible, general and is able to deal with a large class of nonlinear SDEMEMs [1]. In a more recent work [2], we also explored ways to make inference even more scalable to an increasing number of individuals, while also dealing with state-space models driven by other stochastic dynamic models than SDEs, eg Markov jump processes and nonlinear solvers typically used in systems biology.
[1] S. Wiqvist, A. Golightly, AT McLean, U. Picchini (2020). Efficient inference for stochastic differential mixed-effects models using correlated particle pseudo-marginal algorithms, CSDA, https://doi.org/10.1016/j.csda.2020.107151
[2] S. Persson, N. Welkenhuysen, S. Shashkova, S. Wiqvist, P. Reith, G. W. Schmidt, U. Picchini, M. Cvijovic (2021). PEPSDI: Scalable and flexible inference framework for stochastic dynamic single-cell models, bioRxiv doi:10.1101/2021.07.01.450748.
Mini useR! in Melbourne https://www.meetup.com/fr-FR/MelbURN-Melbourne-Users-of-R-Network/events/251933078/
MelbURN (Melbourne useR group) https://www.meetup.com/fr-FR/MelbURN-Melbourne-Users-of-R-Network
July 16th, 2018
Melbourne, Australia
Dimensionality reduction by matrix factorization using concept lattice in dat...eSAT Journals
Abstract Concept lattices is the important technique that has become a standard in data analytics and knowledge presentation in many fields such as statistics, artificial intelligence, pattern recognition ,machine learning ,information theory ,social networks, information retrieval system and software engineering. Formal concepts are adopted as the primitive notion. A concept is jointly defined as a pair consisting of the intension and the extension. FCA can handle with huge amount of data it generates concepts and rules and data visualization. Matrix factorization methods have recently received greater exposure, mainly as an unsupervised learning method for latent variable decomposition. In this paper a novel method is proposed to decompose such concepts by using Boolean Matrix Factorization for dimensionality reduction. This paper focuses on finding all the concepts and the object intersections. Keywords: Data mining, formal concepts, lattice, matrix factorization dimensionality reduction.
The variational Gaussian process (VGP), a Bayesian nonparametric model which adapts its shape to match com- plex posterior distributions. The VGP generates approximate posterior samples by generating latent inputs and warping them through random non-linear mappings; the distribution over random mappings is learned during inference, enabling the transformed outputs to adapt to varying complexity.
Medical pathology images are visually evaluated by experts for disease diagnosis, but the connectionbetween image features and the state of the cells in an image is typically unknown. To understand thisrelationship, we describe a multimodal modeling and inference framework that estimates shared latentstructure of joint gene expression levels and medical image features. The method is built aroundprobabilistic canonical correlation analysis (PCCA), which is jointly fit to image embeddings that are learnedusing convolutional neural networks and linear embeddings of paired gene expression data. We finallydiscuss a set of theoretical and empirical challenges in domain adaptation settings arising from genomics data.(based on work in collab with Gregory Gundersen and Barbara E. Engelhardt)
Since the advent of the horseshoe priors for regularization, global-local shrinkage methods have proved to be a fertile ground for the development of Bayesian theory and methodology in machine learning. They have achieved remarkable success in computation, and enjoy strong theoretical support. Much of the existing literature has focused on the linear Gaussian case. The purpose of the current talk is to demonstrate that the horseshoe priors are useful more broadly, by reviewing both methodological and computational developments in complex models that are more relevant to machine learning applications. Specifically, we focus on methodological challenges in horseshoe regularization in nonlinear and non-Gaussian models; multivariate models; and deep neural networks. We also outline the recent computational developments in horseshoe shrinkage for complex models along with a list of available software implementations that allows one to venture out beyond the comfort zone of the canonical linear regression problems.
Universal Approximation Property via Quantum Feature Maps
----
The quantum Hilbert space can be used as a quantum-enhanced feature space in machine learning (ML) via the quantum feature map to encode classical data into quantum states. We prove the ability to approximate any continuous function with optimal approximation rate via quantum ML models in typical quantum feature maps.
---
Contributed talk at Quantum Techniques in Machine Learning 2021, Tokyo, November 8-12 2021.
By Quoc Hoan Tran, Takahiro Goto and Kohei Nakajima
Bayesian inference for mixed-effects models driven by SDEs and other stochast...Umberto Picchini
An important, and well studied, class of stochastic models is given by stochastic differential equations (SDEs). In this talk, we consider Bayesian inference based on measurements from several individuals, to provide inference at the "population level" using mixed-effects modelling. We consider the case where dynamics are expressed via SDEs or other stochastic (Markovian) models. Stochastic differential equation mixed-effects models (SDEMEMs) are flexible hierarchical models that account for (i) the intrinsic random variability in the latent states dynamics, as well as (ii) the variability between individuals, and also (iii) account for measurement error. This flexibility gives rise to methodological and computational difficulties.
Fully Bayesian inference for nonlinear SDEMEMs is complicated by the typical intractability of the observed data likelihood which motivates the use of sampling-based approaches such as Markov chain Monte Carlo. A Gibbs sampler is proposed to target the marginal posterior of all parameters of interest. The algorithm is made computationally efficient through careful use of blocking strategies, particle filters (sequential Monte Carlo) and correlated pseudo-marginal approaches. The resulting methodology is is flexible, general and is able to deal with a large class of nonlinear SDEMEMs [1]. In a more recent work [2], we also explored ways to make inference even more scalable to an increasing number of individuals, while also dealing with state-space models driven by other stochastic dynamic models than SDEs, eg Markov jump processes and nonlinear solvers typically used in systems biology.
[1] S. Wiqvist, A. Golightly, AT McLean, U. Picchini (2020). Efficient inference for stochastic differential mixed-effects models using correlated particle pseudo-marginal algorithms, CSDA, https://doi.org/10.1016/j.csda.2020.107151
[2] S. Persson, N. Welkenhuysen, S. Shashkova, S. Wiqvist, P. Reith, G. W. Schmidt, U. Picchini, M. Cvijovic (2021). PEPSDI: Scalable and flexible inference framework for stochastic dynamic single-cell models, bioRxiv doi:10.1101/2021.07.01.450748.
MOCANAR: A MULTI-OBJECTIVE CUCKOO SEARCH ALGORITHM FOR NUMERIC ASSOCIATION RU...cscpconf
Extracting association rules from numeric features involves searching a very large search space. To
deal with this problem, in this paper a meta-heuristic algorithm is used that we have called
MOCANAR. The MOCANAR is a Pareto based multi-objective cuckoo search algorithm which
extracts high quality association rules from numeric datasets. The support, confidence,
interestingness and comprehensibility are the objectives that have been considered in the
MOCANAR. The MOCANAR extracts rules incrementally, in which, in each run of the algorithm, a
small number of high quality rules are made. In this paper, a comprehensive taxonomy of metaheuristic
algorithm have been presented. Using this taxonomy, we have decided to use a Cuckoo
Search algorithm because this algorithm is one of the most matured algorithms and also, it is simple
to use and easy to comprehend. In addition, until now, to our knowledge this method has not been
used as a multi-objective algorithm and has not been used in the association rule mining area. To
demonstrate the merit and associated benefits of the proposed methodology, the methodology has
been applied to a number of datasets and high quality results in terms of the objectives were
extracted
MOCANAR: A Multi-Objective Cuckoo Search Algorithm for Numeric Association Ru...csandit
Extracting association rules from numeric features
involves searching a very large search space. To
deal with this problem, in this paper a meta-heuris
tic algorithm is used that we have called
MOCANAR. The MOCANAR is a Pareto based multi-object
ive cuckoo search algorithm which
extracts high quality association rules from numeri
c datasets. The support, confidence,
interestingness and comprehensibility are the objec
tives that have been considered in the
MOCANAR. The MOCANAR extracts rules incrementally,
in which, in each run of the algorithm, a
small number of high quality rules are made. In thi
s paper, a comprehensive taxonomy of meta-
heuristic algorithm have been presented. Using this
taxonomy, we have decided to use a Cuckoo
Search algorithm because this algorithm is one of t
he most matured algorithms and also, it is simple
to use and easy to comprehend. In addition, until n
ow, to our knowledge this method has not been
used as a multi-objective algorithm and has not bee
n used in the association rule mining area. To
demonstrate the merit and associated benefits of th
e proposed methodology, the methodology has
been applied to a number of datasets and high quali
ty results in terms of the objectives were
extracted
The data deluge who we are living today is fostering the development of new techniques for effective and efficient methods for the analysis and the extraction of knowledge and insights from the data. The Big Data paradigm, in particular, the Volume and the Velocity features, is requiring to change our habits for treating data and for extracting information that is useful for discovering patterns and insights. Also, the exploratory data analysis must reformulate its aims. How many groups are in the data? How to deal with data that doesn't fit my PC memory? How to represent aggregated data or repeated measures on individuals? How data correlates? The Symbolic Data Analysis approach tried to reformulate the statistical thinking in this case. In this talk, we present some tools for working with aggregated data described by empirical distributions of values.
Using some real-cases from different fields (data from sensors, official statistics or data stream), and the HistDAWass R package, we show some recent solutions for the unsupervised classification, and for feature selection in a subspace clustering context, and how to interpret the results.
Composite repetition-aware data structuresFabio Cunial
Paper: https://link.springer.com/chapter/10.1007/978-3-319-19929-0_3
Code: https://github.com/nicolaprezza/lz-rlbwt
Code: https://github.com/nicolaprezza/slz-rlbwt
In highly repetitive strings, like collections of genomes from the same species, distinct measures of repetition all grow sublinearly in the length of the text, and indexes targeted to such strings typically depend only on one of these measures. We describe two data structures whose size depends on multiple measures of repetition at once, and that provide competitive tradeoffs between the time for counting and reporting all the exact occurrences of a pattern, and the space taken by the structure. The key component of our constructions is the run-length encoded BWT (RLBWT), which takes space proportional to the number of BWT runs: rather than augmenting RLBWT with suffix array samples, we combine it with data structures from LZ77 indexes, which take space proportional to the number of LZ77 factors, and with the compact directed acyclic word graph (CDAWG), which takes space proportional to the number of extensions of maximal repeats. The combination of CDAWG and RLBWT enables also a new representation of the suffix tree, whose size depends again on the number of extensions of maximal repeats, and that is powerful enough to support matching statistics and constant-space traversal.
Computational methods to analyze biological data. It is a way to introduce some of the many resources available for analyzing sequence data with bioinformatics software. This paper will cover the theoretical approaches to data resources and we will get knowledge about some sequential alignments with its databases. As an interdisciplinary field of science, bioinformatics combines biology, computer science, information engineering, mathematics, and statistics to analyze and interpret biological data. Bioinformatics has been used for in silico analyses of biological queries using mathematical and statistical techniques. Databases are essential for bioinformatics research and applications. Many databases exist, covering various information types for example, DNA and protein sequences, molecular structures, phenotypes, and biodiversity. Databases may contain empirical data. Conceptualizing biology in terms of molecules and then applying informatics techniques from math, computer science, and statistics to understand and organize the information associated with these molecules on a large scale. In this materialistic world, People are studying bioinformatics in different ways. Some people are devoted to developing new computational tools, both from software and hardware viewpoints, for the better handling and processing of biological data. They develop new models and new algorithms for existing questions and propose and tackle new questions when new experimental techniques bring in new data. Other people take the study of bioinformatics as the study of biology with the viewpoint of informatics and systems. Durgesh Raghuvanshi | Vivek Solanki | Neha Arora | Faiz Hashmi "Computational of Bioinformatics" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-4 | Issue-4 , June 2020, URL: https://www.ijtsrd.com/papers/ijtsrd30891.pdf Paper Url :https://www.ijtsrd.com/engineering/computer-engineering/30891/computational-of-bioinformatics/durgesh-raghuvanshi
Data reduction techniques for high dimensional biological dataeSAT Journals
Abstract
High dimensional biological datasets in recent years has been growing rapidly. Extracting the knowledge and analyzing highdimensional
biological data is one the key challenges in which variety and veracity are the two distinct characteristics. The
question that arises now is, how to perform dimensionality reduction for this heterogeneous data and how to develop a high
performance platform to efficiently analyze high dimensional biological data and how to find the useful things from this data. To
deeply discuss this issue, this paper begins with a brief introduction to data analytics available for biological data, followed by
the discussions of big data analytics and then a survey on various data reduction methods for biological data. We propose a dense
clustering algorithm for standard high dimensional biological data.
Keywords: Big Data Analytics, Dimensionality Reduction
Credal Fusion of Classifications for Noisy and Uncertain DataIJECEIAES
This paper reports on an investigation in classification technique employed to classify noised and uncertain data. However, classification is not an easy task. It is a significant challenge to discover knowledge from uncertain data. In fact, we can find many problems. More time we don’t have a good or a big learning database for supervised classification. Also, when training data contains noise or missing values, classification accuracy will be affected dramatically. So to extract groups from data is not easy to do. They are overlapped and not very separated from each other. Another problem which can be cited here is the uncertainty due to measuring devices. Consequentially classification model is not so robust and strong to classify new objects. In this work, we present a novel classification algorithm to cover these problems. We materialize our main idea by using belief function theory to do combination between classification and clustering. This theory treats very well imprecision and uncertainty linked to classification. Experimental results show that our approach has ability to significantly improve the quality of classification of generic database.
Presentation of research paper 'Study of some data mining classification techniques'(International Research Journal of Engineering and Technology (IRJET)
e-ISSN: 2395 -0056 p-ISSN: 2395-0072 Volume: 04 Issue: 04 | Apr -2017) for a academic purpose during post graduation. in this study i describe some data mining classification techniques such as ANN, SVM,decision tree with example.
Dynamic Evolving Neuro-Fuzzy Inference System for Mortality Prediction IJERA Editor
In this paper we propose a dynamic evolving neuro-fuzzy inference system (DENFIS) to forecast mortality. DENFIS is an adaptive intelligent system suitable for dynamic time series prediction. An Evolving Cluster Method (ECM) drives the learning process. The typical fuzzy rules of the neuro- fuzzy systems are updated during the learning process and adjusted according to the features of the data. This makes possible to capture the changes in the mortality evolution at the basis of the so called longevity risk
Rainfall Prediction using Data-Core Based Fuzzy Min-Max Neural Network for Cl...IJERA Editor
This paper proposes the Rainfall Prediction System by using classification technique. The advanced and modified neural network called Data Core Based Fuzzy Min Max Neural Network (DCFMNN) is used for pattern classification. This classification method is applied to predict Rainfall. The neural network called fuzzy min max neural network (FMNN) that creates hyperboxes for classification and predication, has a problem of overlapping neurons that resoled in DCFMNN to give greater accuracy. This system is composed of forming of hyperboxes, and two kinds of neurons called as Overlapping Neurons and Classifying neurons, and classification used for prediction. For each kind of hyperbox its data core and geometric center of data is calculated. The advantage of this method is it gives high accuracy and strong robustness. According to evaluation results we can say that this system gives better prediction of rainfall and classification tool in real environment.
Similar to Kernel methods for data integration in systems biology (20)
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.Sérgio Sacani
The return of a sample of near-surface atmosphere from Mars would facilitate answers to several first-order science questions surrounding the formation and evolution of the planet. One of the important aspects of terrestrial planet formation in general is the role that primary atmospheres played in influencing the chemistry and structure of the planets and their antecedents. Studies of the martian atmosphere can be used to investigate the role of a primary atmosphere in its history. Atmosphere samples would also inform our understanding of the near-surface chemistry of the planet, and ultimately the prospects for life. High-precision isotopic analyses of constituent gases are needed to address these questions, requiring that the analyses are made on returned samples rather than in situ.
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Ana Luísa Pinho
Functional Magnetic Resonance Imaging (fMRI) provides means to characterize brain activations in response to behavior. However, cognitive neuroscience has been limited to group-level effects referring to the performance of specific tasks. To obtain the functional profile of elementary cognitive mechanisms, the combination of brain responses to many tasks is required. Yet, to date, both structural atlases and parcellation-based activations do not fully account for cognitive function and still present several limitations. Further, they do not adapt overall to individual characteristics. In this talk, I will give an account of deep-behavioral phenotyping strategies, namely data-driven methods in large task-fMRI datasets, to optimize functional brain-data collection and improve inference of effects-of-interest related to mental processes. Key to this approach is the employment of fast multi-functional paradigms rich on features that can be well parametrized and, consequently, facilitate the creation of psycho-physiological constructs to be modelled with imaging data. Particular emphasis will be given to music stimuli when studying high-order cognitive mechanisms, due to their ecological nature and quality to enable complex behavior compounded by discrete entities. I will also discuss how deep-behavioral phenotyping and individualized models applied to neuroimaging data can better account for the subject-specific organization of domain-general cognitive systems in the human brain. Finally, the accumulation of functional brain signatures brings the possibility to clarify relationships among tasks and create a univocal link between brain systems and mental functions through: (1) the development of ontologies proposing an organization of cognitive processes; and (2) brain-network taxonomies describing functional specialization. To this end, tools to improve commensurability in cognitive science are necessary, such as public repositories, ontology-based platforms and automated meta-analysis tools. I will thus discuss some brain-atlasing resources currently under development, and their applicability in cognitive as well as clinical neuroscience.
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...Wasswaderrick3
In this book, we use conservation of energy techniques on a fluid element to derive the Modified Bernoulli equation of flow with viscous or friction effects. We derive the general equation of flow/ velocity and then from this we derive the Pouiselle flow equation, the transition flow equation and the turbulent flow equation. In the situations where there are no viscous effects , the equation reduces to the Bernoulli equation. From experimental results, we are able to include other terms in the Bernoulli equation. We also look at cases where pressure gradients exist. We use the Modified Bernoulli equation to derive equations of flow rate for pipes of different cross sectional areas connected together. We also extend our techniques of energy conservation to a sphere falling in a viscous medium under the effect of gravity. We demonstrate Stokes equation of terminal velocity and turbulent flow equation. We look at a way of calculating the time taken for a body to fall in a viscous medium. We also look at the general equation of terminal velocity.
Toxic effects of heavy metals : Lead and Arsenicsanjana502982
Heavy metals are naturally occuring metallic chemical elements that have relatively high density, and are toxic at even low concentrations. All toxic metals are termed as heavy metals irrespective of their atomic mass and density, eg. arsenic, lead, mercury, cadmium, thallium, chromium, etc.
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...Sérgio Sacani
Since volcanic activity was first discovered on Io from Voyager images in 1979, changes
on Io’s surface have been monitored from both spacecraft and ground-based telescopes.
Here, we present the highest spatial resolution images of Io ever obtained from a groundbased telescope. These images, acquired by the SHARK-VIS instrument on the Large
Binocular Telescope, show evidence of a major resurfacing event on Io’s trailing hemisphere. When compared to the most recent spacecraft images, the SHARK-VIS images
show that a plume deposit from a powerful eruption at Pillan Patera has covered part
of the long-lived Pele plume deposit. Although this type of resurfacing event may be common on Io, few have been detected due to the rarity of spacecraft visits and the previously low spatial resolution available from Earth-based telescopes. The SHARK-VIS instrument ushers in a new era of high resolution imaging of Io’s surface using adaptive
optics at visible wavelengths.
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...University of Maribor
Slides from:
11th International Conference on Electrical, Electronics and Computer Engineering (IcETRAN), Niš, 3-6 June 2024
Track: Artificial Intelligence
https://www.etran.rs/2024/en/home-english/
What is greenhouse gasses and how many gasses are there to affect the Earth.moosaasad1975
What are greenhouse gasses how they affect the earth and its environment what is the future of the environment and earth how the weather and the climate effects.
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...Sérgio Sacani
We characterize the earliest galaxy population in the JADES Origins Field (JOF), the deepest
imaging field observed with JWST. We make use of the ancillary Hubble optical images (5 filters
spanning 0.4−0.9µm) and novel JWST images with 14 filters spanning 0.8−5µm, including 7 mediumband filters, and reaching total exposure times of up to 46 hours per filter. We combine all our data
at > 2.3µm to construct an ultradeep image, reaching as deep as ≈ 31.4 AB mag in the stack and
30.3-31.0 AB mag (5σ, r = 0.1” circular aperture) in individual filters. We measure photometric
redshifts and use robust selection criteria to identify a sample of eight galaxy candidates at redshifts
z = 11.5 − 15. These objects show compact half-light radii of R1/2 ∼ 50 − 200pc, stellar masses of
M⋆ ∼ 107−108M⊙, and star-formation rates of SFR ∼ 0.1−1 M⊙ yr−1
. Our search finds no candidates
at 15 < z < 20, placing upper limits at these redshifts. We develop a forward modeling approach to
infer the properties of the evolving luminosity function without binning in redshift or luminosity that
marginalizes over the photometric redshift uncertainty of our candidate galaxies and incorporates the
impact of non-detections. We find a z = 12 luminosity function in good agreement with prior results,
and that the luminosity function normalization and UV luminosity density decline by a factor of ∼ 2.5
from z = 12 to z = 14. We discuss the possible implications of our results in the context of theoretical
models for evolution of the dark matter halo mass function.
This presentation explores a brief idea about the structural and functional attributes of nucleotides, the structure and function of genetic materials along with the impact of UV rays and pH upon them.
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...Studia Poinsotiana
I Introduction
II Subalternation and Theology
III Theology and Dogmatic Declarations
IV The Mixed Principles of Theology
V Virtual Revelation: The Unity of Theology
VI Theology as a Natural Science
VII Theology’s Certitude
VIII Conclusion
Notes
Bibliography
All the contents are fully attributable to the author, Doctor Victor Salas. Should you wish to get this text republished, get in touch with the author or the editorial committee of the Studia Poinsotiana. Insofar as possible, we will be happy to broker your contact.
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...
Kernel methods for data integration in systems biology
1. Kernel methods for data integration in systems biology
Nathalie Vialaneix
nathalie.vialaneix@inrae.fr
http://www.nathalievialaneix.eu
Séminaire CBI
February 17, 2020 – Toulouse
Nathalie Vialaneix, MIAT, INRAE Toulouse | Kernel methods for data integration in systems biology 1/37
2. A short bio
trained as a mathematician, statistician
application: research applied to human health (obesity) and animal
genomics
data: mostly transcriptome but also Hi-C and metabolome and (to a
lesser extent) scRNAseq, metagenomics, ATACseq, ...
methods: networks (inference, mining), omics data integration,
machine learning (including random forest, SVM and neural networks)
Nathalie Vialaneix, MIAT, INRAE Toulouse | Kernel methods for data integration in systems biology 2/37
3. Examples of past works
inferring and understanding the relations between
gene expression, lipids and phenotypes (weight,
waist circumference, ...) in adipose tissu (Diogenes)
⇒ network inference and mining, data integration,
missing data, ... [Montastier et al., 2015, Imbert et al., 2018]
and R package RNAseqNet
Nathalie Vialaneix, MIAT, INRAE Toulouse | Kernel methods for data integration in systems biology 3/37
4. Examples of past works
integrating
expression and location (3D DNA FISH)
for network inference in fetal pig tissus
[Marti-Marimon et al., 2018]
Nathalie Vialaneix, MIAT, INRAE Toulouse | Kernel methods for data integration in systems biology 4/37
5. Other activities
including training for biologists in RNAseq data
analysis, basic statistics, graphics with R...
organizer of the working group “Biopuces”
http://www.nathalievialaneix.eu/biopuces and active member of
“Chrocogen” https://groupes.renater.fr/sympa/info/chrocogen
Nathalie Vialaneix, MIAT, INRAE Toulouse | Kernel methods for data integration in systems biology 5/37
6. In this talk...
How to integrate multiple omics data from various sources and various
types with kernels?
Nathalie Vialaneix, MIAT, INRAE Toulouse | Kernel methods for data integration in systems biology 6/37
7. In this talk...
How to integrate multiple omics data from various sources and various
types with kernels?
Disclaimer: equations included (not necessary to understand the talk but
necessary for the speaker to understand her own work during the talk)
Nathalie Vialaneix, MIAT, INRAE Toulouse | Kernel methods for data integration in systems biology 6/37
8. A primer on kernel methods for
biology
Nathalie Vialaneix, MIAT, INRAE Toulouse | Kernel methods for data integration in systems biology 7/37
9. Before we start: context and motivations
Data characteristics
a few (paired) samples
information at various levels
... but of heterogeneous types
and, when numeric, with a large
dimension
Nathalie Vialaneix, MIAT, INRAE Toulouse | Kernel methods for data integration in systems biology 8/37
10. Before we start: context and motivations
Data characteristics
a few (paired) samples
information at various levels
... but of heterogeneous types
and, when numeric, with a large
dimension
What we want to achieve
integrative analysis
to predict a phenotype, to
understand the typology of the
samples, ...
Nathalie Vialaneix, MIAT, INRAE Toulouse | Kernel methods for data integration in systems biology 8/37
11. In short: what are kernels?
Data we are used to...
n samples on which p variables are
measured (xi)i=1,...,n with xi ∈ Rp
Nathalie Vialaneix, MIAT, INRAE Toulouse | Kernel methods for data integration in systems biology 9/37
12. In short: what are kernels?
Data we are used to...
n samples on which p variables are
measured (xi)i=1,...,n with xi ∈ Rp
From that, we can compute:
centers of gravity: x = 1
n
n
i=1 xi
distances and dot products:
d(xi, xi ) = p
j=1
(xij − xi j)2
and xi, xi = p
j=1
xijxi j
Nathalie Vialaneix, MIAT, INRAE Toulouse | Kernel methods for data integration in systems biology 9/37
13. In short: what are kernels?
Data we are used to...
n samples on which p variables are
measured (xi)i=1,...,n with xi ∈ Rp
From that, we can compute:
centers of gravity: x = 1
n
n
i=1 xi
distances and dot products:
d(xi, xi ) = p
j=1
(xij − xi j)2
and xi, xi = p
j=1
xijxi j
Kernels...
The characteristics on the n samples
(xi)i are summarized by pairwise
similarities
More formally: n × n-matrix K, st K is
symmetric and positive definite
Nathalie Vialaneix, MIAT, INRAE Toulouse | Kernel methods for data integration in systems biology 9/37
14. In short: what are kernels?
Data we are used to...
n samples on which p variables are
measured (xi)i=1,...,n with xi ∈ Rp
From that, we can compute:
centers of gravity: x = 1
n
n
i=1 xi
distances and dot products:
d(xi, xi ) = p
j=1
(xij − xi j)2
and xi, xi = p
j=1
xijxi j
Kernels...
The characteristics on the n samples
(xi)i are summarized by pairwise
similarities
More formally: n × n-matrix K, st K is
symmetric and positive definite
Representer Theorem:
Nathalie Vialaneix, MIAT, INRAE Toulouse | Kernel methods for data integration in systems biology 9/37
15. Why are kernels interesting?
1 because they can reduce high dimensional data in small similarity
matrices
2 because they are not restricted to data in Rp
(kernels on graphs,
between graphs, on text, ...) some examples to come
3 because they can embed expert knowledge (i.e., phylogeny between
taxons for instance) some examples to come
4 because they offer a rigorous framework to extend many statistical
methods basic principles to come just after
5 because they offer a clean and common framework for data
integration topic of this talk
Nathalie Vialaneix, MIAT, INRAE Toulouse | Kernel methods for data integration in systems biology
10/37
16. Why are kernels interesting?
1 because they can reduce high dimensional data in small similarity
matrices
2 because they are not restricted to data in Rp
(kernels on graphs,
between graphs, on text, ...) some examples to come
3 because they can embed expert knowledge (i.e., phylogeny between
taxons for instance) some examples to come
4 because they offer a rigorous framework to extend many statistical
methods basic principles to come just after
5 because they offer a clean and common framework for data
integration topic of this talk
but:
1 the choice of the relevant kernel is still up to you...
2 can strongly increase computational time when n is large...
Nathalie Vialaneix, MIAT, INRAE Toulouse | Kernel methods for data integration in systems biology
10/37
17. Kernel examples
1 Rp
observations: Gaussian kernel Kii = e−γ xi−xi
2
Nathalie Vialaneix, MIAT, INRAE Toulouse | Kernel methods for data integration in systems biology
11/37
18. Kernel examples
1 Rp
observations: Gaussian kernel Kii = e−γ xi−xi
2
2 nodes of a graph: [Kondor and Lafferty, 2002]
Nathalie Vialaneix, MIAT, INRAE Toulouse | Kernel methods for data integration in systems biology
11/37
19. Kernel examples
1 Rp
observations: Gaussian kernel Kii = e−γ xi−xi
2
2 nodes of a graph: [Kondor and Lafferty, 2002]
3 sequence kernels (used to compute similarities between proteins for
instance): spectrum kernel [Jaakkola et al., 2000] (with HMM),
convolution kernel [Saigo et al., 2004]
4 kernel between graphs (or “structured data”; used in metabolomics to
compute similarities between metabolites based on their
fragmentation trees): [Shen et al., 2014, Brouard et al., 2016]
More examples: [Mariette and Vialaneix, 2019]
Nathalie Vialaneix, MIAT, INRAE Toulouse | Kernel methods for data integration in systems biology
11/37
20. Principles for learning from kernels
Start from any statistical method (PCA, regression, k-means clustering)
and rewrite all quantities using:
K to compute distances and dot products
dot product is: Kii and distance is:
√
Kii + Ki i − 2Kii
(implicit) linear or convex combinations of (φ(xi))i to describe all
unobserved elements (centers of gravity and so on...)
Nathalie Vialaneix, MIAT, INRAE Toulouse | Kernel methods for data integration in systems biology
12/37
21. A simple example: k-means
Nathalie Vialaneix, MIAT, INRAE Toulouse | Kernel methods for data integration in systems biology
13/37
22. A simple example: k-means
1: Initialization: random initialization of P centers ¯xCt
j
∈ Rp
2: for t = 1 to T do
3: Affectation step ∀ i = 1, ..., n
ft+1
(xi) = argmin
j=1,...,P
d(xi, ¯xCt
j
)
4: Representation step
∀ j = 1, . . . , P, ¯xCt
j
=
1
|Ct
j
|
xl∈Ct
j
xl
5: end for Convergence
6: return Partition
Nathalie Vialaneix, MIAT, INRAE Toulouse | Kernel methods for data integration in systems biology
14/37
23. A simple example: k-means
1: Initialization: random initialization of a partition of (xi)i and
¯xC1
j
= 1
|C1
j
| xi∈C1
j
φ(xi)
2: for t = 1 to T do
3: Affectation step ∀ i = 1, ..., n
ft+1
(xi) = argmin
j=1,...,P
d(xi, ¯xCt
j
)
4: Representation step
∀ j = 1, . . . , P, ¯xCt
j
=
1
|Ct
j
|
xl∈Ct
j
xl
5: end for Convergence
6: return Partition
Nathalie Vialaneix, MIAT, INRAE Toulouse | Kernel methods for data integration in systems biology
14/37
24. A simple example: k-means
1: Initialization: random initialization of a partition of (xi)i and
¯xC1
j
= 1
|C1
j
| xi∈C1
j
φ(xi)
2: for t = 1 to T do
3: Affectation step
ft+1
(xi) = argmin
j=1,...,P
φ(xi) − ¯xCt
j
2
H ,
4: Representation step
∀ j = 1, . . . , P, ¯xCt
j
=
1
|Ct
j
|
xl∈Ct
j
xl
5: end for Convergence
6: return Partition
Nathalie Vialaneix, MIAT, INRAE Toulouse | Kernel methods for data integration in systems biology
14/37
25. A simple example: k-means
1: Initialization: random initialization of a partition of (xi)i and
¯xC1
j
= 1
|C1
j
| xi∈C1
j
φ(xi)
2: for t = 1 to T do
3: Affectation step
ft+1
(xi) = argmin
j=1,...,P
φ(xi) − ¯xCt
j
2
H ,
4: Representation step
∀ j = 1, . . . , P, ¯xCt
j
=
1
|Ct
j
|
xl∈Ct
j
φ(xl)
5: end for Convergence
6: return Partition
Nathalie Vialaneix, MIAT, INRAE Toulouse | Kernel methods for data integration in systems biology
14/37
26. A simple example: k-means
1: Initialization: random initialization of a partition of (xi)i and
¯xC1
j
= 1
|C1
j
| xi∈C1
j
φ(xi)
2: for t = 1 to T do
3: Affectation step
ft+1
(xi) = argmin
j=1,...,P
= Kii −
2
|Ct
j
|
xl∈Ct
j
Kil +
1
|Ct
j
|2
xl, xl ∈Ct
j
Kll .
4: Representation step
∀ j = 1, . . . , P, ¯xCt
j
=
1
|Ct
j
|
xl∈Ct
j
φ(xl)
5: end for Convergence
6: return Partition
Nathalie Vialaneix, MIAT, INRAE Toulouse | Kernel methods for data integration in systems biology
14/37
27. Beyond kernels: relational data
DNA barcoding
Astraptes fulgerator
optimal matching
(edit) distances to
differentiate species
Nathalie Vialaneix, MIAT, INRAE Toulouse | Kernel methods for data integration in systems biology
15/37
28. Beyond kernels: relational data
DNA barcoding
Astraptes fulgerator
optimal matching
(edit) distances to
differentiate species
Hi-C data
pairwise measure (similarity) related to
the physical 3D distance between loci in
the cell, at genome scale
[Ambroise et al., 2019,
Randriamihamison et al., 2019]
Nathalie Vialaneix, MIAT, INRAE Toulouse | Kernel methods for data integration in systems biology
15/37
29. Beyond kernels: relational data
DNA barcoding
Astraptes fulgerator
optimal matching
(edit) distances to
differentiate species
Hi-C data
pairwise measure (similarity) related to
the physical 3D distance between loci in
the cell, at genome scale
[Ambroise et al., 2019,
Randriamihamison et al., 2019]
Metagenomics
dissemblance between
samples is better
captured when
phylogeny between
species is taken into
account (unifrac
distances)
Nathalie Vialaneix, MIAT, INRAE Toulouse | Kernel methods for data integration in systems biology
15/37
30. Combining relational data in an
unsupervised setting
Nathalie Vialaneix, MIAT, INRAE Toulouse | Kernel methods for data integration in systems biology
16/37
31. What are metagenomic data?
Source: [Sommer et al., 2010]
Nathalie Vialaneix, MIAT, INRAE Toulouse | Kernel methods for data integration in systems biology
17/37
32. What are metagenomic data?
Source: [Sommer et al., 2010]
abundance data sparse
n × p-matrices with count data
of samples in rows and
descriptors (species, OTUs,
KEGG groups, k-mer, ...) in
columns. Generally p n.
Nathalie Vialaneix, MIAT, INRAE Toulouse | Kernel methods for data integration in systems biology
17/37
33. What are metagenomic data?
Source: [Sommer et al., 2010]
abundance data sparse
n × p-matrices with count data
of samples in rows and
descriptors (species, OTUs,
KEGG groups, k-mer, ...) in
columns. Generally p n.
phylogenetic tree (evolution
history between species,
OTUs...). One tree with p leaves
built from the sequences
collected in the n samples.
Nathalie Vialaneix, MIAT, INRAE Toulouse | Kernel methods for data integration in systems biology
17/37
34. What are metagenomic data used for?
produce a profile of the diversity of a given sample ⇒ allows to
compare diversity between various conditions
used in various fields: environmental science, microbiote, ...
Nathalie Vialaneix, MIAT, INRAE Toulouse | Kernel methods for data integration in systems biology
18/37
35. What are metagenomic data used for?
produce a profile of the diversity of a given sample ⇒ allows to
compare diversity between various conditions
used in various fields: environmental science, microbiote, ...
Processed by computing a relevant dissimilarity between samples
(standard Euclidean distance is not relevant) and by using this dissimilarity
in subsequent analyses.
Nathalie Vialaneix, MIAT, INRAE Toulouse | Kernel methods for data integration in systems biology
18/37
36. β-diversity data: dissimilarities between count data
Compositional dissimilarities: (nig) count of species g for sample i
Jaccard: the fraction of species specific of either sample i or j:
djac =
g I{nig>0,njg=0} + I{njg>0,nig=0}
j I{nig+njg>0}
Bray-Curtis: the fraction of the sample which is specific of either
sample i or j
dBC =
g |nig − njg|
g(nig + njg)
Other dissimilarities available in the R package philoseq, most of them
not Euclidean.
Nathalie Vialaneix, MIAT, INRAE Toulouse | Kernel methods for data integration in systems biology
19/37
37. β-diversity data: phylogenetic dissimilarities
Phylogenetic dissimilarities
Nathalie Vialaneix, MIAT, INRAE Toulouse | Kernel methods for data integration in systems biology
20/37
38. β-diversity data: phylogenetic dissimilarities
Phylogenetic dissimilarities
For each branch e, note le its length and pei
the fraction of counts in sample i
corresponding to species below branch e.
Nathalie Vialaneix, MIAT, INRAE Toulouse | Kernel methods for data integration in systems biology
20/37
39. β-diversity data: phylogenetic dissimilarities
Phylogenetic dissimilarities
For each branch e, note le its length and pei
the fraction of counts in sample i
corresponding to species below branch e.
Unifrac: the fraction of the tree specific to
either sample i or sample j.
dUF =
e le(I{pei>0,pej=0} + I{pej>0,pei=0})
e leI{pei+pej>0}
Nathalie Vialaneix, MIAT, INRAE Toulouse | Kernel methods for data integration in systems biology
20/37
40. β-diversity data: phylogenetic dissimilarities
Phylogenetic dissimilarities
For each branch e, note le its length and pei
the fraction of counts in sample i
corresponding to species below branch e.
Unifrac: the fraction of the tree specific to
either sample i or sample j.
dUF =
e le(I{pei>0,pej=0} + I{pej>0,pei=0})
e leI{pei+pej>0}
Weighted Unifrac: the fraction of the
diversity specific to sample i or to sample j.
dwUF =
e le|pei − pej|
e(pei + pej)
Nathalie Vialaneix, MIAT, INRAE Toulouse | Kernel methods for data integration in systems biology
20/37
41. TARA Oceans datasets
The 2009-2013 expedition
Co-directed by Étienne Bourgois
and Éric Karsenti.
7,012 datasets collected from
35,000 samples of plankton and
water (11,535 Gb of data).
Study the plankton: bacteria,
protists, metazoans and viruses
representing more than 90% of the
biomass in the ocean.
Nathalie Vialaneix, MIAT, INRAE Toulouse | Kernel methods for data integration in systems biology
21/37
42. TARA Oceans datasets
Science (May 2015) - Studies on:
eukaryotic plankton diversity
[de Vargas et al., 2015],
ocean viral communities
[Brum et al., 2015],
global plankton interactome
[Lima-Mendez et al., 2015],
global ocean microbiome
[Sunagawa et al., 2015],
. . . .
→ datasets from different types and
different sources analyzed separately.
Nathalie Vialaneix, MIAT, INRAE Toulouse | Kernel methods for data integration in systems biology
22/37
43. TARA Oceans datasets that we used
[Sunagawa et al., 2015]
Datasets used
environmental dataset: 22 numeric features (temperature, salinity, . . . ).
Nathalie Vialaneix, MIAT, INRAE Toulouse | Kernel methods for data integration in systems biology
23/37
44. TARA Oceans datasets that we used
[Sunagawa et al., 2015]
Datasets used
environmental dataset: 22 numeric features (temperature, salinity, . . . ).
bacteria phylogenomic tree: computed from ∼ 35,000 OTUs.
Nathalie Vialaneix, MIAT, INRAE Toulouse | Kernel methods for data integration in systems biology
23/37
45. TARA Oceans datasets that we used
[Sunagawa et al., 2015]
Datasets used
environmental dataset: 22 numeric features (temperature, salinity, . . . ).
bacteria phylogenomic tree: computed from ∼ 35,000 OTUs.
bacteria functional composition: ∼ 63,000 KEGG orthologous groups.
Nathalie Vialaneix, MIAT, INRAE Toulouse | Kernel methods for data integration in systems biology
23/37
46. TARA Oceans datasets that we used
[de Vargas et al., 2015]
Datasets used
environmental dataset: 22 numeric features (temperature, salinity, . . . ).
bacteria phylogenomic tree: computed from ∼ 35,000 OTUs.
bacteria functional composition: ∼ 63,000 KEGG orthologous groups.
eukaryotic plankton composition splited into 4 groups pico (0.8 − 5µm),
nano (5 − 20µm), micro (20 − 180µm) and meso (180 − 2000µm).
Nathalie Vialaneix, MIAT, INRAE Toulouse | Kernel methods for data integration in systems biology
23/37
47. TARA Oceans datasets that we used
[Brum et al., 2015]
Datasets used
environmental dataset: 22 numeric features (temperature, salinity, . . . ).
bacteria phylogenomic tree: computed from ∼ 35,000 OTUs.
bacteria functional composition: ∼ 63,000 KEGG orthologous groups.
eukaryotic plankton composition splited into 4 groups pico (0.8 − 5µm),
nano (5 − 20µm), micro (20 − 180µm) and meso (180 − 2000µm).
virus composition: ∼ 867 virus clusters based on shared gene content.
Nathalie Vialaneix, MIAT, INRAE Toulouse | Kernel methods for data integration in systems biology
23/37
48. TARA Oceans datasets that we used
Common samples
48 samples,
2 depth layers: surface
(SRF) and deep chlorophyll
maximum (DCM),
31 different sampling
stations.
Nathalie Vialaneix, MIAT, INRAE Toulouse | Kernel methods for data integration in systems biology
24/37
49. From multiple kernels to an integrated kernel
How to combine multiple kernels?
naive approach: K∗ = 1
M m Km
Nathalie Vialaneix, MIAT, INRAE Toulouse | Kernel methods for data integration in systems biology
25/37
50. From multiple kernels to an integrated kernel
How to combine multiple kernels?
naive approach: K∗ = 1
M m Km
supervised framework: K∗ = m βmKm
with βm ≥ 0 and m βm = 1
with βm chosen so as to minimize the prediction error
[Gönen and Alpaydin, 2011]
Nathalie Vialaneix, MIAT, INRAE Toulouse | Kernel methods for data integration in systems biology
25/37
51. From multiple kernels to an integrated kernel
How to combine multiple kernels?
naive approach: K∗ = 1
M m Km
supervised framework: K∗ = m βmKm
with βm ≥ 0 and m βm = 1
with βm chosen so as to minimize the prediction error
[Gönen and Alpaydin, 2011]
unsupervised framework but input space is Rp
[Zhuang et al., 2011]
K∗ = m βmKm
with βm ≥ 0 and m βm = 1 with βm chosen so as to
minimize the distortion between all training data ij K∗
(xi, xj) xi − xj
2
;
AND minimize the approximation of the original data by the kernel
embedding i xi − j K∗
(xi, xj)xj
2
.
Nathalie Vialaneix, MIAT, INRAE Toulouse | Kernel methods for data integration in systems biology
25/37
52. From multiple kernels to an integrated kernel
How to combine multiple kernels?
naive approach: K∗ = 1
M m Km
supervised framework: K∗ = m βmKm
with βm ≥ 0 and m βm = 1
with βm chosen so as to minimize the prediction error
[Gönen and Alpaydin, 2011]
unsupervised framework but input space is Rp
[Zhuang et al., 2011]
K∗ = m βmKm
with βm ≥ 0 and m βm = 1 with βm chosen so as to
minimize the distortion between all training data ij K∗
(xi, xj) xi − xj
2
;
AND minimize the approximation of the original data by the kernel
embedding i xi − j K∗
(xi, xj)xj
2
.
Our proposal: 2 UMKL frameworks which do not require data to have
values in Rd
.
Nathalie Vialaneix, MIAT, INRAE Toulouse | Kernel methods for data integration in systems biology
25/37
53. Multi-kernel/distances integration
How to “optimally” combine several
relational datasets in an unsupervised
setting?
for kernels K1
, . . . , KM
obtained on the
same n objects, search: Kβ = M
m=1 βmKm
with βm ≥ 0 and m βm = 1
[Mariette and Villa-Vialaneix, 2018]
Package R mixKernel
https://cran.r-project.org/
package=mixKernel
Nathalie Vialaneix, MIAT, INRAE Toulouse | Kernel methods for data integration in systems biology
26/37
54. STATIS like framework
[L’Hermier des Plantes, 1976, Lavit et al., 1994]
Similarities between kernels:
Cmm =
Km
, Km
F
Km
F Km
F
=
Trace(Km
Km
)
Trace((Km)2)Trace((Km )2)
.
(Cmm is an extension of the RV-coefficient [Robert and Escoufier, 1976] to the
kernel framework)
maximizev
M
m=1
K∗
(v),
Km
Km
F F
= v Cv
for K∗
(v) =
M
m=1
vmKm
and v ∈ RM
such that v 2 = 1.
Solution: first eigenvector of C ⇒ Set β = v
M
m=1 vm
(consensual kernel).
Nathalie Vialaneix, MIAT, INRAE Toulouse | Kernel methods for data integration in systems biology
27/37
55. A kernel preserving the original topology of the data I
Similarly to [Lin et al., 2010], preserve the local geometry of the data in the
feature space.
Proxy of the local geometry
Km
−→ Gm
k
k−nearest neighbors graph
−→ Am
k
adjacency matrix
⇒ W = m I{Am
k
>0} or W = m Am
k
Feature space geometry measured by
∆i(β) = φ∗
β(xi),
φ∗
β(x1)
...
φ∗
β(xn)
=
K∗
β(xi, x1)
...
K∗
β(xi, xn)
Nathalie Vialaneix, MIAT, INRAE Toulouse | Kernel methods for data integration in systems biology
28/37
56. A kernel preserving the original topology of the data II
Sparse version (quadprog in R)
minimizeβ
N
i,j=1
Wij ∆i(β) − ∆j(β)
2
for K∗
β =
M
m=1
βmKm
and β ∈ RM
st βm ≥ 0 and
M
m=1
βm = 1.
Non sparse version (ADMM optimization [Boyd et al., 2011]
minimizev
N
i,j=1
Wij ∆i(β) − ∆j(β)
2
for K∗
v =
M
m=1
vmKm
and v ∈ RM
st vm ≥ 0 and v 2 = 1.
Nathalie Vialaneix, MIAT, INRAE Toulouse | Kernel methods for data integration in systems biology
29/37
57. Application to TARA oceans
Similarity between datasets (STATIS)
Low similarities between meso-plankton (euk.meso) and other
datasets: strong geographical structure of mesoplanktonic
communities [de Vargas et al., 2015].
Nathalie Vialaneix, MIAT, INRAE Toulouse | Kernel methods for data integration in systems biology
30/37
58. Application to TARA oceans
Similarity between datasets (STATIS)
Low similarities between meso-plankton (euk.meso) and other
datasets: strong geographical structure of mesoplanktonic
communities [de Vargas et al., 2015].
Strongest similarities between environmental variables and small
organisms than largest ones [de Vargas et al., 2015, Sunagawa et al., 2015].
Nathalie Vialaneix, MIAT, INRAE Toulouse | Kernel methods for data integration in systems biology
30/37
59. Integrating all Tara Oceans data sets
no particular pattern in terms of depth layers but in terms of
geography.
Nathalie Vialaneix, MIAT, INRAE Toulouse | Kernel methods for data integration in systems biology
31/37
60. Application to TARA oceans
Important variables
Rhizaria abundance strongly structure the differences between samples (analyses
restricted to some organisms found differences mostly based on water depths)
and waters from Arctic Oceans and Pacific Oceans differ in terms of Rhizaria
abundance
Nathalie Vialaneix, MIAT, INRAE Toulouse | Kernel methods for data integration in systems biology
32/37
61. Conclusions
Kernel methods are useful for:
dealing with different types of data
even when they are high-dimensional
combining them
However, they can be:
computationally intensive to train
not easy to interpret (work-in-progress with Jérôme Mariette and
Céline Brouard on variable selection in unsupervised setting)
Nathalie Vialaneix, MIAT, INRAE Toulouse | Kernel methods for data integration in systems biology
33/37
62. SOMbrero
Madalina Olteanu,
Fabrice Rossi, Marie Cottrell,
Laura Bendhaïba and
Julien Boelaert
SOMbrero and mixKernel
Jérôme Mariette
adjclust and Hi-C
Pierre Neuvial, Nathanaël Randriamihamison,
Sylvain Foissac, Guillem Rigail, Christophe Ambroise and
Shubham Chaturvedi
Nathalie Vialaneix, MIAT, INRAE Toulouse | Kernel methods for data integration in systems biology
34/37
63. Credits for pictures
Slide 3: image based on ENCODE project, by Darryl Leja (NHGRI), Ian Dunham
(EBI) and Michael Pazin (NHGRI)
Slide 8: k-means image from Wikimedia Commons by Weston.pace
Slide 10: Astraptes picture is from
https://www.flickr.com/photos/39139121@N00/2045403823/ by Anne Toal
(CC BY-SA 2.0), Hi-C experiment is taken from the article Matharu et al., 2015
DOI:10.1371/journal.pgen.1005640 (CC BY-SA 4.0) and metagenomics illustration is
taken from the article Sommer et al., 2010 DOI:10.1038/msb.2010.16 (CC BY-NC-SA
3.0)
Other pictures are from articles that I co-authored.
Nathalie Vialaneix, MIAT, INRAE Toulouse | Kernel methods for data integration in systems biology
35/37
64. References
Ambroise, C., Dehman, A., Neuvial, P., Rigaill, G., and Vialaneix, N. (2019).
Adjacency-constrained hierarchical clustering of a band similarity matrix with application to genomics.
Algorithms for Molecular Biology, 14:22.
Bach, F. (2013).
Sharp analysis of low-rank kernel matrix approximations.
Journal of Machine Learning Research, Workshop and Conference Proceedings, 30:185–209.
Boyd, S., Parikh, N., Chu, E., Peleato, B., and Eckstein, J. (2011).
Distributed optimization and statistical learning via the alterning direction method of multipliers.
Foundations and Trends in Machine Learning, 3(1):1–122.
Brouard, C., Shen, H., Dürkop, K., d’Alché Buc, F., Böcker, S., and Rousu, J. (2016).
Fast metabolite identification with input output kernel regression.
Bioinformatics, 32(12):i28–i36.
Brum, J., Ignacio-Espinoza, J., Roux, S., Doulcier, G., Acinas, S., Alberti, A., Chaffron, S., Cruaud, C., de Vargas, C., Gasol, J.,
Gorsky, G., Gregory, A., Guidi, L., Hingamp, P., Iudicone, D., Not, F., Ogata, H., Pesant, S., Poulos, B., Schwenck, S., Speich, S.,
Dimier, C., Kandels-Lewis, S., Picheral, M., Searson, S., Tara Oceans coordinators, Bork, P., Bowler, C., Sunagawa, S., Wincker,
P., Karsenti, E., and Sullivan, M. (2015).
Patterns and ecological drivers of ocean viral communities.
Science, 348(6237).
Cortes, C., Mohri, M., and Talwalkar, A. (2010).
On the impact of kernel approximation on learning accuracy.
Journal of Machine Learning Research, Workshop and Conference Proceedings, 9:113–120.
Crone, L. and Crosby, D. (1995).
Statistical applications of a metric on subspaces to satellite meteorology.
Technometrics, 37(3):324–328.
Nathalie Vialaneix, MIAT, INRAE Toulouse | Kernel methods for data integration in systems biology
35/37
65. de Vargas, C., Audic, S., Henry, N., Decelle, J., Mahé, P., Logares, R., Lara, E., Berney, C., Le Bescot, N., Probert, I.,
Carmichael, M., Poulain, J., Romac, S., Colin, S., Aury, J., Bittner, L., Chaffron, S., Dunthorn, M., Engelen, S., Flegontova, O.,
Guidi, L., Horák, A., Jaillon, O., Lima-Mendez, G., Lukeš, J., Malviya, S., Morard, R., Mulot, M., Scalco, E., Siano, R., Vincent, F.,
Zingone, A., Dimier, C., Picheral, M., Searson, S., Kandels-Lewis, S., Tara Oceans coordinators, Acinas, S., Bork, P., Bowler, C.,
Gorsky, G., Grimsley, N., Hingamp, P., Iudicone, D., Not, F., Ogata, H., Pesant, S., Raes, J., Sieracki, M. E., Speich, S.,
Stemmann, L., Sunagawa, S., Weissenbach, J., Wincker, P., and Karsenti, E. (2015).
Eukaryotic plankton diversity in the sunlit ocean.
Science, 348(6237).
Drineas, P. and Mahoney, M. (2005).
On the Nyström method for approximating a Gram matrix for improved kernel-based learning.
Journal of Machine Learning Research, 6:2153–2175.
Goldfarb, L. (1984).
A unified approach to pattern recognition.
Pattern Recognition, 17(5):575–582.
Gönen, M. and Alpaydin, E. (2011).
Multiple kernel learning algorithms.
Journal of Machine Learning Research, 12:2211–2268.
Imbert, A., Valsesia, A., Le Gall, C., Armenise, C., Lefebvre, G., Gourraud, P., Viguerie, N., and Villa-Vialaneix, N. (2018).
Multiple hot-deck imputation for network inference from RNA sequencing data.
Bioinformatics, 34(10):1726–1732.
Jaakkola, T., Diekhans, M., and Haussler, D. (2000).
A discriminative framework for detecting remote protein homologies.
Journal of Computational Biology, 7(1-2):95–114.
Kohonen, T. (2001).
Self-Organizing Maps, 3rd Edition, volume 30.
Springer, Berlin, Heidelberg, New York.
Kondor, R. and Lafferty, J. (2002).
Nathalie Vialaneix, MIAT, INRAE Toulouse | Kernel methods for data integration in systems biology
35/37
66. Diffusion kernels on graphs and other discrete structures.
In Sammut, C. and Hoffmann, A., editors, Proceedings of the 19th International Conference on Machine Learning, pages
315–322, Sydney, Australia. Morgan Kaufmann Publishers Inc. San Francisco, CA, USA.
Lavit, C., Escoufier, Y., Sabatier, R., and Traissac, P. (1994).
The ACT (STATIS method).
Computational Statistics and Data Analysis, 18(1):97–119.
L’Hermier des Plantes, H. (1976).
Structuration des tableaux à trois indices de la statistique.
PhD thesis, Université de Montpellier.
Thèse de troisième cycle.
Lima-Mendez, G., Faust, K., Henry, N., Decelle, J., Colin, S., Carcillo, F., Chaffron, S., Ignacio-Espinosa, J., Roux, S., Vincent, F.,
Bittner, L., Darzi, Y., Wang, B., Audic, S., Berline, L., Bontempi, G., Cabello, A., Coppola, L., Cornejo-Castillo, F., d’Oviedo, F.,
de Meester, L., Ferrera, I., Garet-Delmas, M., Guidi, L., Lara, E., Pesant, S., Royo-Llonch, M., Salazar, F., Sánchez, P.,
Sebastian, M., Souffreau, C., Dimier, C., Picheral, M., Searson, S., Kandels-Lewis, S., Tara Oceans coordinators, Gorsky, G.,
Not, F., Ogata, H., Speich, S., Stemmann, L., Weissenbach, J., Wincker, P., Acinas, S., Sunagawa, S., Bork, P., Sullivan, M.,
Karsenti, E., Bowler, C., de Vargas, C., and Raes, J. (2015).
Determinants of community structure in the global plankton interactome.
Science, 348(6237).
Lin, Y., Liu, T., and CS., F. (2010).
Multiple kernel learning for dimensionality reduction.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 33:1147–1160.
Mariette, J., Olteanu, M., and Villa-Vialaneix, N. (2017a).
Efficient interpretable variants of online SOM for large dissimilarity data.
Neurocomputing, 225:31–48.
Mariette, J., Rossi, F., Olteanu, M., and Villa-Vialaneix, N. (2017b).
Accelerating stochastic kernel som.
In Verleysen, M., editor, XXVth European Symposium on Artificial Neural Networks, Computational Intelligence and Machine
Learning (ESANN 2017), pages 269–274, Bruges, Belgium. i6doc.
Nathalie Vialaneix, MIAT, INRAE Toulouse | Kernel methods for data integration in systems biology
35/37
67. Mariette, J. and Vialaneix, N. (2019).
Approches à noyau pour l’analyse et l’intégration de données omiques en biologie des systèmes.
Forthcoming (book chapter).
Mariette, J. and Villa-Vialaneix, N. (2018).
Unsupervised multiple kernel learning for heterogeneous data integration.
Bioinformatics, 34(6):1009–1015.
Marti-Marimon, M., Vialaneix, N., Voillet, V., Yerle-Bouissou, M., Lahbib-Mansais, Y., and Liaubet, L. (2018).
A new approach of gene co-expression network inference reveals significant biological processes involved in porcine muscle
development in late gestation.
Scientific Report, 8:10150.
Montastier, E., Villa-Vialaneix, N., Caspar-Bauguil, S., Hlavaty, P., Tvrzicka, E., Gonzalez, I., Saris, W., Langin, D., Kunesova, M.,
and Viguerie, N. (2015).
System model network for adipose tissue signatures related to weight changes in response to calorie restriction and subsequent
weight maintenance.
PLoS Computational Biology, 11(1):e1004047.
Olteanu, M. and Villa-Vialaneix, N. (2015).
On-line relational and multiple relational SOM.
Neurocomputing, 147:15–30.
Randriamihamison, N., Vialaneix, N., and Neuvial, P. (2019).
Applicability and interpretability of hierarchical agglomerative clustering with or without contiguity constraints.
Submitted for publication. Preprint arXiv 1909.10923.
Robert, P. and Escoufier, Y. (1976).
A unifying tool for linear multivariate statistical methods: the rv-coefficient.
Applied Statistics, 25(3):257–265.
Rossi, F., Hasenfuss, A., and Hammer, B. (2007).
Accelerating relational clustering algorithms with sparse prototype representation.
Nathalie Vialaneix, MIAT, INRAE Toulouse | Kernel methods for data integration in systems biology
35/37
68. In Proceedings of the 6th Workshop on Self-Organizing Maps (WSOM 07), Bielefield, Germany. Neuroinformatics Group,
Bielefield University.
Saigo, H., Vert, J.-P., Ueda, N., and Akutsu, T. (2004).
Protein homology detection using string alignment kernels.
Bioinformatics, 20(11):1682–1689.
Shen, H., Dührkop, K., Böcher, S., and Rousu, J. (2014).
Metabolite identification through multiple kernel learning on fragmentation trees.
Bioinformatics, 30(12):i157–i64.
Sommer, M., Church, G., and Dantas, G. (2010).
A functional metagenomic approach for expanding the synthetic biology toolbox for biomass conversion.
Molecular Systems Biology, 6(360).
Sunagawa, S., Coelho, L., Chaffron, S., Kultima, J., Labadie, K., Salazar, F., Djahanschiri, B., Zeller, G., Mende, D., Alberti, A.,
Cornejo-Castillo, F., Costea, P., Cruaud, C., d’Oviedo, F., Engelen, S., Ferrera, I., Gasol, J., Guidi, L., Hildebrand, F., Kokoszka,
F., Lepoivre, C., Lima-Mendez, G., Poulain, J., Poulos, B., Royo-Llonch, M., Sarmento, H., Vieira-Silva, S., Dimier, C., Picheral,
M., Searson, S., Kandels-Lewis, S., Tara Oceans coordinators, Bowler, C., de Vargas, C., Gorsky, G., Grimsley, N., Hingamp, P.,
Iudicone, D., Jaillon, O., Not, F., Ogata, H., Pesant, S., Speich, S., Stemmann, L., Sullivan, M., Weissenbach, J., Wincker, P.,
Karsenti, E., Raes, J., Acinas, S., and Bork, P. (2015).
Structure and function of the global ocean microbiome.
Science, 348(6237).
Villa, N. and Rossi, F. (2007).
A comparison between dissimilarity SOM and kernel SOM for clustering the vertices of a graph.
In 6th International Workshop on Self-Organizing Maps (WSOM 2007), Bielefield, Germany. Neuroinformatics Group, Bielefield
University.
Williams, C. and Seeger, M. (2000).
Using the Nyström method to speed up kernel machines.
In Leen, T., Dietterich, T., and Tresp, V., editors, Advances in Neural Information Processing Systems (Proceedings of NIPS
2000), volume 13, Denver, CO, USA. Neural Information Processing Systems Foundation.
Zhuang, J., Wang, J., Hoi, S., and Lan, X. (2011).
Nathalie Vialaneix, MIAT, INRAE Toulouse | Kernel methods for data integration in systems biology
35/37
69. Unsupervised multiple kernel clustering.
Journal of Machine Learning Research: Workshop and Conference Proceedings, 20:129–144.
Nathalie Vialaneix, MIAT, INRAE Toulouse | Kernel methods for data integration in systems biology
36/37
70. Optimization issues
Sparse version writes minβ βT
Sβ st β ≥ 0 and β 1 = m βm = 1 ⇒
standard QP problem with linear constrains (ex: package quadprog
in R).
Nathalie Vialaneix, MIAT, INRAE Toulouse | Kernel methods for data integration in systems biology
36/37
71. Optimization issues
Sparse version writes minβ βT
Sβ st β ≥ 0 and β 1 = m βm = 1 ⇒
standard QP problem with linear constrains (ex: package quadprog
in R).
Non sparse version writes minβ βT
Sβ st β ≥ 0 and β 2 = 1 ⇒ QPQC
problem (hard to solve).
Nathalie Vialaneix, MIAT, INRAE Toulouse | Kernel methods for data integration in systems biology
36/37
72. Optimization issues
Sparse version writes minβ βT
Sβ st β ≥ 0 and β 1 = m βm = 1 ⇒
standard QP problem with linear constrains (ex: package quadprog
in R).
Non sparse version writes minβ βT
Sβ st β ≥ 0 and β 2 = 1 ⇒ QPQC
problem (hard to solve).
Solved using Alternating Direction Method of Multipliers (ADMM
[Boyd et al., 2011]) by replacing the previous optimization problem
with
min
x,z
x Sx + 1{x≥0}(x) + 1{ z 2
2
≥1}(z)
with the constraint x − z = 0.
Nathalie Vialaneix, MIAT, INRAE Toulouse | Kernel methods for data integration in systems biology
36/37
73. Optimization issues
Sparse version writes minβ βT
Sβ st β ≥ 0 and β 1 = m βm = 1 ⇒
standard QP problem with linear constrains (ex: package quadprog
in R).
Non sparse version writes minβ βT
Sβ st β ≥ 0 and β 2 = 1 ⇒ QPQC
problem (hard to solve).
Solved using Alternating Direction Method of Multipliers (ADMM
[Boyd et al., 2011])
1 minx x Sx + y (x − z) + λ
2
x − z 2
under the constraint x ≥ 0
(standard QP problem)
2 project on the unit ball z = x
min{ x 2,1}
3 update auxiliary variable y = y + λ(x − z)
Nathalie Vialaneix, MIAT, INRAE Toulouse | Kernel methods for data integration in systems biology
36/37
74. A proposal to improve interpretability of K-PCA in our
framework
Issue: How to assess the importance of a given species in the K-PCA?
Nathalie Vialaneix, MIAT, INRAE Toulouse | Kernel methods for data integration in systems biology
37/37
75. A proposal to improve interpretability of K-PCA in our
framework
Issue: How to assess the importance of a given species in the K-PCA?
our datasets are either numeric (environmental) or are built from a
n × p count matrix
⇒ for a given species, randomly permute counts and re-do the
analysis (kernel computation - with the same optimized weights - and
K-PCA)
Nathalie Vialaneix, MIAT, INRAE Toulouse | Kernel methods for data integration in systems biology
37/37
76. A proposal to improve interpretability of K-PCA in our
framework
Issue: How to assess the importance of a given species in the K-PCA?
our datasets are either numeric (environmental) or are built from a
n × p count matrix
⇒ for a given species, randomly permute counts and re-do the
analysis (kernel computation - with the same optimized weights - and
K-PCA)
the influence of a given species in a given dataset on a given PC
subspace is accessed by computing the Crone-Crosby distance
between these two PCA subspaces [Crone and Crosby, 1995] (∼
Frobenius norm between the projectors)
Nathalie Vialaneix, MIAT, INRAE Toulouse | Kernel methods for data integration in systems biology
37/37