SlideShare a Scribd company logo
Integrating Tara Oceans datasets using unsupervised
multiple kernel learning
Nathalie Villa-Vialaneix
Joint work with Jérôme Mariette
http://www.nathalievilla.org
Séminaire de Probabilité et Statistique
Laboratoire J.A. Dieudonné, Université de Nice
Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 1/41
Sommaire
1 Metagenomic datasets and associated questions
2 A typical (and rich) case study: TARA Oceans datasets
3 A UMKL framework for integrating multiple metagenomic data
4 Application to TARA Oceans datasets
Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 2/41
Sommaire
1 Metagenomic datasets and associated questions
2 A typical (and rich) case study: TARA Oceans datasets
3 A UMKL framework for integrating multiple metagenomic data
4 Application to TARA Oceans datasets
Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 3/41
What are metagenomic data?
Source: [Sommer et al., 2010]
Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 4/41
What are metagenomic data?
Source: [Sommer et al., 2010]
abundance data sparse
n × p-matrices with count data
of samples in rows and
descriptors (species, OTUs,
KEGG groups, k-mer, ...) in
columns. Generally p n.
Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 4/41
What are metagenomic data?
Source: [Sommer et al., 2010]
abundance data sparse
n × p-matrices with count data
of samples in rows and
descriptors (species, OTUs,
KEGG groups, k-mer, ...) in
columns. Generally p n.
philogenetic tree (evolution
history between species,
OTUs...). One tree with p leaves
built from the sequences
collected in the n samples.
Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 4/41
What are metagenomic data used for?
produce a profile of the diversity of a given sample ⇒ allows to
compare diversity between various conditions
used in various fields: environmental science, microbiote, ...
Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 5/41
What are metagenomic data used for?
produce a profile of the diversity of a given sample ⇒ allows to
compare diversity between various conditions
used in various fields: environmental science, microbiote, ...
Processed by computing a relevant dissimilarity between samples
(standard Euclidean distance is not relevant) and by using this dissimilarity
in subsequent analyses.
Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 5/41
β-diversity data: dissimilarities between count data
Compositional dissimilarities: (nig) count of species g for sample i
Jaccard: the fraction of species specific of either sample i or j:
djac =
g I{nig>0,njg=0} + I{njg>0,nig=0}
j I{nig+njg>0}
Bray-Curtis: the fraction of the sample which is specific of either
sample i or j
dBC =
g |nig − njg|
g(nig + njg)
Other dissimilarities available in the R package philoseq, most of them
not Euclidean.
Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 6/41
β-diversity data: phylogenetic dissimilarities
Phylogenetic dissimilarities
Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 7/41
β-diversity data: phylogenetic dissimilarities
Phylogenetic dissimilarities
For each branch e, note le its length and pei
the fraction of counts in sample i
corresponding to species below branch e.
Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 7/41
β-diversity data: phylogenetic dissimilarities
Phylogenetic dissimilarities
For each branch e, note le its length and pei
the fraction of counts in sample i
corresponding to species below branch e.
Unifrac: the fraction of the tree specific to
either sample i or sample j.
dUF =
e le(I{pei>0,pej=0} + I{pej>0,pei=0})
e leI{pei+pej>0}
Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 7/41
β-diversity data: phylogenetic dissimilarities
Phylogenetic dissimilarities
For each branch e, note le its length and pei
the fraction of counts in sample i
corresponding to species below branch e.
Unifrac: the fraction of the tree specific to
either sample i or sample j.
dUF =
e le(I{pei>0,pej=0} + I{pej>0,pei=0})
e leI{pei+pej>0}
Weighted Unifrac: the fraction of the
diversity specific to sample i or to sample j.
dwUF =
e le|pei − pej|
e(pei + pej)
Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 7/41
Sommaire
1 Metagenomic datasets and associated questions
2 A typical (and rich) case study: TARA Oceans datasets
3 A UMKL framework for integrating multiple metagenomic data
4 Application to TARA Oceans datasets
Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 8/41
TARA Oceans datasets
The 2009-2013 expedition
Co-directed by Étienne Bourgois
and Éric Karsenti.
7,012 datasets collected from
35,000 samples of plankton and
water (11,535 Gb of data).
Study the plankton: bacteria,
protists, metazoans and viruses
representing more than 90% of the
biomass in the ocean.
Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 9/41
TARA Oceans datasets
Science (May 2015) - Studies on:
eukaryotic plankton diversity
[de Vargas et al., 2015],
ocean viral communities
[Brum et al., 2015],
global plankton interactome
[Lima-Mendez et al., 2015],
global ocean microbiome
[Sunagawa et al., 2015],
. . . .
→ datasets from different types and
different sources analyzed separately.
Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 10/41
Background of this talk
Objectives
Until now: many papers using many methods. No integrated analysis
performed.
What do the datasets reveal if integrated in a single analysis?
Our purpose: develop a generic method to integrate phylogenetic,
taxonomic and functional community composition together with
environmental factors.
Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 11/41
TARA Oceans datasets that we used
[Sunagawa et al., 2015]
Datasets used
environmental dataset: 22 numeric features (temperature, salinity, . . . ).
Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 12/41
TARA Oceans datasets that we used
[Sunagawa et al., 2015]
Datasets used
environmental dataset: 22 numeric features (temperature, salinity, . . . ).
bacteria phylogenomic tree: computed from ∼ 35,000 OTUs.
Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 12/41
TARA Oceans datasets that we used
[Sunagawa et al., 2015]
Datasets used
environmental dataset: 22 numeric features (temperature, salinity, . . . ).
bacteria phylogenomic tree: computed from ∼ 35,000 OTUs.
bacteria functional composition: ∼ 63,000 KEGG orthologous groups.
Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 12/41
TARA Oceans datasets that we used
[de Vargas et al., 2015]
Datasets used
environmental dataset: 22 numeric features (temperature, salinity, . . . ).
bacteria phylogenomic tree: computed from ∼ 35,000 OTUs.
bacteria functional composition: ∼ 63,000 KEGG orthologous groups.
eukaryotic plankton composition splited into 4 groups pico (0.8 − 5µm),
nano (5 − 20µm), micro (20 − 180µm) and meso (180 − 2000µm).
Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 12/41
TARA Oceans datasets that we used
[Brum et al., 2015]
Datasets used
environmental dataset: 22 numeric features (temperature, salinity, . . . ).
bacteria phylogenomic tree: computed from ∼ 35,000 OTUs.
bacteria functional composition: ∼ 63,000 KEGG orthologous groups.
eukaryotic plankton composition splited into 4 groups pico (0.8 − 5µm),
nano (5 − 20µm), micro (20 − 180µm) and meso (180 − 2000µm).
virus composition: ∼ 867 virus clusters based on shared gene content.
Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 12/41
TARA Oceans datasets that we used
Common samples
48 samples,
2 depth layers: surface
(SRF) and deep chlorophyll
maximum (DCM),
31 different sampling
stations.
Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 13/41
Sommaire
1 Metagenomic datasets and associated questions
2 A typical (and rich) case study: TARA Oceans datasets
3 A UMKL framework for integrating multiple metagenomic data
4 Application to TARA Oceans datasets
Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 14/41
Kernel methods
Kernel viewed as the dot product in an implicit Hilbert space
K : X × X → R st: K(xi, xj) = K(xj, xi) and ∀ m ∈ N, ∀x1, ..., xm ∈ X,
∀ α1, ..., αm ∈ R, m
i,j=1 αiαjK(xi, xj) ≥ 0.
Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 15/41
Kernel methods
Kernel viewed as the dot product in an implicit Hilbert space
K : X × X → R st: K(xi, xj) = K(xj, xi) and ∀ m ∈ N, ∀x1, ..., xm ∈ X,
∀ α1, ..., αm ∈ R, m
i,j=1 αiαjK(xi, xj) ≥ 0.
⇒ [Aronszajn, 1950]
∃!(H, ., . ), φ : X → H st: K(xi, xj) = φ(xi), φ(xj)
Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 15/41
Exploratory analysis with kernels
A well know example: kernel PCA [Schölkopf et al., 1998]
PCA analysis performed in the feature space induced by the kernel K.
Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 16/41
Exploratory analysis with kernels
A well know example: kernel PCA [Schölkopf et al., 1998]
PCA analysis performed in the feature space induced by the kernel K.
In practice:
K is centered: K ← K − 1
N KIN + 1
N2 IN
KIN;
K-PCA is performed by the eigen-decomposition of (centered) K
Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 16/41
Exploratory analysis with kernels
A well know example: kernel PCA [Schölkopf et al., 1998]
PCA analysis performed in the feature space induced by the kernel K.
In practice:
K is centered: K ← K − 1
N KIN + 1
N2 IN
KIN;
K-PCA is performed by the eigen-decomposition of (centered) K
If (αk )k=1,...,N ∈ RN
and (λk )k=1,...,N are the eigenvectors and eigenvalues,
PC axes are:
ak =
N
i=1
αkiφ(xi)
and ak = (aki)i=1,...,n are orthonormal in the feature space induced by the
kernel:
∀ k, k , ak , ak = αk Kαk = δkk with δkk =
0 if k k
1 otherwise
.
Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 16/41
Exploratory analysis with kernels
A well know example: kernel PCA [Schölkopf et al., 1998]
PCA analysis performed in the feature space induced by the kernel K.
In practice:
K is centered: K ← K − 1
N KIN + 1
N2 IN
KIN;
K-PCA is performed by the eigen-decomposition of (centered) K
Coordinate of the projection of the observations (φ(xi))i:
ak , φ(xi) =
n
j=1
αkjKji = Ki.αk = λk αki,
where Ki. is the i-th row of K.
No representation for the variables (no real variables...).
Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 16/41
Exploratory analysis with kernels
A well know example: kernel PCA [Schölkopf et al., 1998]
PCA analysis performed in the feature space induced by the kernel K.
In practice:
K is centered: K ← K − 1
N KIN + 1
N2 IN
KIN;
K-PCA is performed by the eigen-decomposition of (centered) K
Other unsupervised kernel methods: kernel SOM
[Olteanu and Villa-Vialaneix, 2015, Mariette et al., 2017]
Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 16/41
Usefulness of K-PCA
Non linear PCA
Source: By Petter Strandmark - Own work, CC BY 3.0, https://commons.wikimedia.org/w/index.php?curid=3936753
Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 17/41
Usefulness of K-PCA
[Mariette et al., 2017] K-PCA for non numeric datasets - here a
quantitative time series: job trajectories after graduation from the French
survey “Generation 98” [Cottrell and Letrémy, 2005]
color is the mode of the trajectories
Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 17/41
From multiple dissimilarities to multiple kernels
1 several (non Euclidean) dissimilarities D1
, . . . , DM
, transformed into
similarities with [Lee and Verleysen, 2007]:
Km
(xi, xj) = −
1
2

Dm
(xi, xj) −
2
N
N
k=1
Dm
(xi, xk ) +
1
N2
N
k, k =1
Dm
(xk , xk )


2 if non positive, clipping or flipping (removing the negative part of the
eigenvalues decomposition or taking its opposite) produce kernels
[Chen et al., 2009].
Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 18/41
From multiple kernels to an integrated kernel
How to combine multiple kernels?
naive approach: K∗ = 1
M m Km
Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 19/41
From multiple kernels to an integrated kernel
How to combine multiple kernels?
naive approach: K∗ = 1
M m Km
supervised framework: K∗ = m βmKm
with βm ≥ 0 and m βm = 1
with βm chosen so as to minimize the prediction error
[Gönen and Alpaydin, 2011]
Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 19/41
From multiple kernels to an integrated kernel
How to combine multiple kernels?
naive approach: K∗ = 1
M m Km
supervised framework: K∗ = m βmKm
with βm ≥ 0 and m βm = 1
with βm chosen so as to minimize the prediction error
[Gönen and Alpaydin, 2011]
unsupervised framework but input space is Rd
[Zhuang et al., 2011]
K∗ = m βmKm
with βm ≥ 0 and m βm = 1 with βm chosen so as to
minimize the distortion between all training data ij K∗
(xi, xj) xi − xj
2
;
AND minimize the approximation of the original data by the kernel
embedding i xi − j K∗
(xi, xj)xj
2
.
Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 19/41
From multiple kernels to an integrated kernel
How to combine multiple kernels?
naive approach: K∗ = 1
M m Km
supervised framework: K∗ = m βmKm
with βm ≥ 0 and m βm = 1
with βm chosen so as to minimize the prediction error
[Gönen and Alpaydin, 2011]
unsupervised framework but input space is Rd
[Zhuang et al., 2011]
K∗ = m βmKm
with βm ≥ 0 and m βm = 1 with βm chosen so as to
minimize the distortion between all training data ij K∗
(xi, xj) xi − xj
2
;
AND minimize the approximation of the original data by the kernel
embedding i xi − j K∗
(xi, xj)xj
2
.
Our proposal: 2 UMKL frameworks which do not require data to have
values in Rd
.
Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 19/41
STATIS like framework
[L’Hermier des Plantes, 1976, Lavit et al., 1994]
Similarities between kernels:
Cmm =
Km
, Km
F
Km
F Km
F
=
Trace(Km
Km
)
Trace((Km)2)Trace((Km )2)
.
(Cmm is an extension of the RV-coefficient [Robert and Escoufier, 1976] to
the kernel framework)
Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 20/41
STATIS like framework
[L’Hermier des Plantes, 1976, Lavit et al., 1994]
Similarities between kernels:
Cmm =
Km
, Km
F
Km
F Km
F
=
Trace(Km
Km
)
Trace((Km)2)Trace((Km )2)
.
(Cmm is an extension of the RV-coefficient [Robert and Escoufier, 1976] to
the kernel framework)
maximize
M
m=1
K∗
(v),
Km
Km
F F
= v Cv
for K∗
(v) =
M
m=1
vmKm
and v ∈ RM
such that v 2 = 1.
Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 20/41
STATIS like framework
[L’Hermier des Plantes, 1976, Lavit et al., 1994]
Similarities between kernels:
Cmm =
Km
, Km
F
Km
F Km
F
=
Trace(Km
Km
)
Trace((Km)2)Trace((Km )2)
.
(Cmm is an extension of the RV-coefficient [Robert and Escoufier, 1976] to
the kernel framework)
maximize
M
m=1
K∗
(v),
Km
Km
F F
= v Cv
for K∗
(v) =
M
m=1
vmKm
and v ∈ RM
such that v 2 = 1.
Solution: first eigenvector of C ⇒ Set β = v
M
m=1 vm
(consensual kernel).
Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 20/41
A kernel preserving the original topology of the data I
From an idea similar to that of [Lin et al., 2010], find a kernel such that the
local geometry of the data in the feature space is similar to that of the
original data.
Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 21/41
A kernel preserving the original topology of the data I
From an idea similar to that of [Lin et al., 2010], find a kernel such that the
local geometry of the data in the feature space is similar to that of the
original data.
Proxy of the local geometry
Km
−→ Gm
k
k−nearest neighbors graph
−→ Am
k
adjacency matrix
⇒ W = m I{Am
k
>0} or W = m Am
k
Adjacency matrix image from: By S. Mohammad H. Oloomi, CC BY-SA 3.0,
https://commons.wikimedia.org/w/index.php?curid=35313532
Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 21/41
A kernel preserving the original topology of the data I
From an idea similar to that of [Lin et al., 2010], find a kernel such that the
local geometry of the data in the feature space is similar to that of the
original data.
Proxy of the local geometry
Km
−→ Gm
k
k−nearest neighbors graph
−→ Am
k
adjacency matrix
⇒ W = m I{Am
k
>0} or W = m Am
k
Feature space geometry measured by
∆i(β) = φ∗
β(xi),


φ∗
β(x1)
...
φ∗
β(xN)


=


K∗
β (xi, x1)
...
K∗
β (xi, xN)


Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 21/41
A kernel preserving the original topology of the data II
Sparse version
minimize
N
i,j=1
Wij ∆i(β) − ∆j(β)
2
for K∗
β =
M
m=1
βmKm
and β ∈ RM
st βm ≥ 0 and
M
m=1
βm = 1.
Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 22/41
A kernel preserving the original topology of the data II
Sparse version
minimize
N
i,j=1
Wij ∆i(β) − ∆j(β)
2
for K∗
β =
M
m=1
βmKm
and β ∈ RM
st βm ≥ 0 and
M
m=1
βm = 1.
⇔ minimize
M
m,m =1
βmβm Smm
β ∈ RM
such that βm ≥ 0 and
M
m=1
βm = 1,
for Smm = N
i,j=1 Wij ∆m
i
− ∆m
j
2
and ∆m
i
=


Km
(xi, x1)
...
Km
(xi, xN)


.
Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 22/41
A kernel preserving the original topology of the data II
Non sparse version
minimize
N
i,j=1
Wij ∆i(β) − ∆j(β)
2
for K∗
v =
M
m=1
vmKm
and v ∈ RM
st vm ≥ 0 and v 2 = 1.
⇔ minimize
M
m,m =1
vmvm Smm
v ∈ RM
such that vm ≥ 0 and v 2 = 1,
for Smm = N
i,j=1 Wij ∆m
i
− ∆m
j
2
and ∆m
i
=


Km
(xi, x1)
...
Km
(xi, xN)


.
Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 22/41
Optimization issues
Sparse version writes minβ βT
Sβ st β ≥ 0 and β 1 = m βm = 1 ⇒
standard QP problem with linear constrains (ex: package quadprog
in R).
Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 23/41
Optimization issues
Sparse version writes minβ βT
Sβ st β ≥ 0 and β 1 = m βm = 1 ⇒
standard QP problem with linear constrains (ex: package quadprog
in R).
Non sparse version writes minβ βT
Sβ st β ≥ 0 and β 2 = 1 ⇒ QPQC
problem (hard to solve).
Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 23/41
Optimization issues
Sparse version writes minβ βT
Sβ st β ≥ 0 and β 1 = m βm = 1 ⇒
standard QP problem with linear constrains (ex: package quadprog
in R).
Non sparse version writes minβ βT
Sβ st β ≥ 0 and β 2 = 1 ⇒ QPQC
problem (hard to solve).
Equivalent to the following problem: minβ,B Trace(S2X) st
Trace(AX) = 1, Trace(AjX) ≥ 0 and B = β β with:
X =
1 β
β B
A =
0 0M
0M IM
Aj =
0 1j
1j 0MM
Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 23/41
Optimization issues
Sparse version writes minβ βT
Sβ st β ≥ 0 and β 1 = m βm = 1 ⇒
standard QP problem with linear constrains (ex: package quadprog
in R).
Non sparse version writes minβ βT
Sβ st β ≥ 0 and β 2 = 1 ⇒ QPQC
problem (hard to solve).
Relaxed into to the following problem: minβ,B Trace(S2X) st
Trace(AX) = 1, Trace(AjX) ≥ 0 with:
X =
1 β
β B
is positive semi-definite
A =
0 0M
0M IM
Aj =
0 1j
1j 0MM
Semi-definite programming ⇒ efficient solvers exist.
Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 23/41
A proposal to improve interpretability of K-PCA in our
framework
Issue: How to assess the importance of a given species in the K-PCA?
Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 24/41
A proposal to improve interpretability of K-PCA in our
framework
Issue: How to assess the importance of a given species in the K-PCA?
our datasets are either numeric (environmental) or are built from a
n × p count matrix
⇒ for a given species, randomly permute counts and re-do the
analysis (kernel computation - with the same optimized weights - and
K-PCA)
Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 24/41
A proposal to improve interpretability of K-PCA in our
framework
Issue: How to assess the importance of a given species in the K-PCA?
our datasets are either numeric (environmental) or are built from a
n × p count matrix
⇒ for a given species, randomly permute counts and re-do the
analysis (kernel computation - with the same optimized weights - and
K-PCA)
the influence of a given species in a given dataset on a given PC
subspace is accessed by computing the Crone-Crosby distance
between these two PCA subspaces [Crone and Crosby, 1995] (∼
Frobenius norm between the projectors)
Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 24/41
Sommaire
1 Metagenomic datasets and associated questions
2 A typical (and rich) case study: TARA Oceans datasets
3 A UMKL framework for integrating multiple metagenomic data
4 Application to TARA Oceans datasets
Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 25/41
Integrating ’omics data using kernels
M TARA Oceans datasets
(xm
i
)i=1,...,n,m=1,...,M measured on the same
ocean samples (1, . . . , N) which take
values in an arbitrary space (Xm
)m:
environmental dataset,
bacteria phylogenomic tree,
bacteria functional composition,
eukaryote pico-plankton composition,
. . .
virus composition.
Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 26/41
Integrating ’omics data using kernels
Environmental dataset: standard euclidean
distance, given by K(xi, xj) = xT
i
xj.
Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 26/41
Integrating ’omics data using kernels
Bacteria phylogenomic tree: the weighted
Unifrac distance, given by
dwUF (xi, xj) =
e le|pei − pej|
e pei + pej
.
Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 26/41
Integrating ’omics data using kernels
All composition based datasets: bacteria
functional composition, eukaryote (pico,
nano, micro, meso)-plankton composition
and virus composition calculated using the
Bray-Curtis dissimilarity,
dBC(xi, xj) =
g |nig − njg|
g nig + njg
,
nig: gene g abundances summarized at the
KEGG orthologous groups level in sample
i.
Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 26/41
Integrating ’omics data using kernels
Combinaison of M kernels by a weighted
sum
K∗
=
M
m=1
βmKm
,
where βm ≥ 0 and M
m=1 βm = 1.
Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 26/41
Integrating ’omics data using kernels
Apply standard data mining methods
(clustering, linear model, PCA, . . . ) in the
feature space.
Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 26/41
Correlation between kernels (STATIS)
Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 27/41
Correlation between kernels (STATIS)
Low correlations between the bacteria functional composition and
other datasets.
Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 27/41
Correlation between kernels (STATIS)
Low correlations between the bacteria functional composition and
other datasets.
Strong correlation between environmental variables and small
organisms (bacteria, eukarote pico-plankton and virus).
Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 27/41
Influence of k (nb of neighbors) on (βm)m
k ≥ 5 provides stable results
Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 28/41
(βm)m values returned by graph-MKL
Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 29/41
(βm)m values returned by graph-MKL
The dataset the less correlated to the others: the bacteria functional
composition has the highest coefficient.
Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 29/41
(βm)m values returned by graph-MKL
The dataset the less correlated to the others: the bacteria functional
composition has the highest coefficient.
Three kernels have a weight equal to 0 (sparse version).
Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 29/41
Proof of concept: using [Sunagawa et al., 2015]
Datasets
139 samples, 3 layers (SRF, DCM and MES)
kernels: phychem, pro-OTUs and pro-OGs
Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 30/41
Proof of concept: using [Sunagawa et al., 2015]
Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 31/41
Proof of concept: using [Sunagawa et al., 2015]
Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 32/41
Proof of concept: using [Sunagawa et al., 2015]
Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 33/41
Proof of concept: using [Sunagawa et al., 2015]
Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 34/41
Proof of concept: using [Sunagawa et al., 2015]
Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 35/41
Proof of concept: using [Sunagawa et al., 2015]
Proteobacteria (clade SAR11 (Alphaproteobacteria) and SAR86)
dominate the sampled areas of the ocean in term of relative
abundance and taxonomic richness.
Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 36/41
K-PCA on K∗
Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 37/41
K-PCA on K∗
- environmental dataset
Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 38/41
K-PCA on K∗
- environmental dataset
Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 39/41
Conclusion et perspectives
Summary
an integrative exploratory method
... particularly well suited for multi metagenomic datasets
with enhanced interpretability
Perspectives
implement SDP solution and test it
improve biological interpretation
soon-to-be-released R package
Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 40/41
Questions?
Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 41/41
References
Aronszajn, N. (1950).
Theory of reproducing kernels.
Transactions of the American Mathematical Society, 68(3):337–404.
Brum, J., Ignacio-Espinoza, J., Roux, S., Doulcier, G., Acinas, S., Alberti, A., Chaffron, S., Cruaud, C., de Vargas, C., Gasol, J.,
Gorsky, G., Gregory, A., Guidi, L., Hingamp, P., Iudicone, D., Not, F., Ogata, H., Pesant, S., Poulos, B., Schwenck, S., Speich, S.,
Dimier, C., Kandels-Lewis, S., Picheral, M., Searson, S., Tara Oceans coordinators, Bork, P., Bowler, C., Sunagawa, S., Wincker,
P., Karsenti, E., and Sullivan, M. (2015).
Patterns and ecological drivers of ocean viral communities.
Science, 348(6237).
Chen, Y., Garcia, E., Gupta, M., Rahimi, A., and Cazzanti, L. (2009).
Similarity-based classification: concepts and algorithm.
Journal of Machine Learning Research, 10:747–776.
Cottrell, M. and Letrémy, P. (2005).
How to use the Kohonen algorithm to simultaneously analyse individuals in a survey.
Neurocomputing, 63:193–207.
Crone, L. and Crosby, D. (1995).
Statistical applications of a metric on subspaces to satellite meteorology.
Technometrics, 37(3):324–328.
de Vargas, C., Audic, S., Henry, N., Decelle, J., Mahé, P., Logares, R., Lara, E., Berney, C., Le Bescot, N., Probert, I.,
Carmichael, M., Poulain, J., Romac, S., Colin, S., Aury, J., Bittner, L., Chaffron, S., Dunthorn, M., Engelen, S., Flegontova, O.,
Guidi, L., Horák, A., Jaillon, O., Lima-Mendez, G., Lukeš, J., Malviya, S., Morard, R., Mulot, M., Scalco, E., Siano, R., Vincent, F.,
Zingone, A., Dimier, C., Picheral, M., Searson, S., Kandels-Lewis, S., Tara Oceans coordinators, Acinas, S., Bork, P., Bowler, C.,
Gorsky, G., Grimsley, N., Hingamp, P., Iudicone, D., Not, F., Ogata, H., Pesant, S., Raes, J., Sieracki, M. E., Speich, S.,
Stemmann, L., Sunagawa, S., Weissenbach, J., Wincker, P., and Karsenti, E. (2015).
Eukaryotic plankton diversity in the sunlit ocean.
Science, 348(6237).
Gönen, M. and Alpaydin, E. (2011).
Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 41/41
Multiple kernel learning algorithms.
Journal of Machine Learning Research, 12:2211–2268.
Lavit, C., Escoufier, Y., Sabatier, R., and Traissac, P. (1994).
The ACT (STATIS method).
Computational Statistics and Data Analysis, 18(1):97–119.
Lee, J. and Verleysen, M. (2007).
Nonlinear Dimensionality Reduction.
Information Science and Statistics. Springer, New York; London.
L’Hermier des Plantes, H. (1976).
Structuration des tableaux à trois indices de la statistique.
PhD thesis, Université de Montpellier.
Thèse de troisième cycle.
Lima-Mendez, G., Faust, K., Henry, N., Decelle, J., Colin, S., Carcillo, F., Chaffron, S., Ignacio-Espinosa, J., Roux, S., Vincent, F.,
Bittner, L., Darzi, Y., Wang, B., Audic, S., Berline, L., Bontempi, G., Cabello, A., Coppola, L., Cornejo-Castillo, F., d’Oviedo, F.,
de Meester, L., Ferrera, I., Garet-Delmas, M., Guidi, L., Lara, E., Pesant, S., Royo-Llonch, M., Salazar, F., Sánchez, P.,
Sebastian, M., Souffreau, C., Dimier, C., Picheral, M., Searson, S., Kandels-Lewis, S., Tara Oceans coordinators, Gorsky, G.,
Not, F., Ogata, H., Speich, S., Stemmann, L., Weissenbach, J., Wincker, P., Acinas, S., Sunagawa, S., Bork, P., Sullivan, M.,
Karsenti, E., Bowler, C., de Vargas, C., and Raes, J. (2015).
Determinants of community structure in the global plankton interactome.
Science, 348(6237).
Lin, Y., Liu, T., and CS., F. (2010).
Multiple kernel learning for dimensionality reduction.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 33:1147–1160.
Mariette, J., Olteanu, M., and Villa-Vialaneix, N. (2017).
Efficient interpretable variants of online SOM for large dissimilarity data.
Neurocomputing, 225:31–48.
Olteanu, M. and Villa-Vialaneix, N. (2015).
Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 41/41
On-line relational and multiple relational SOM.
Neurocomputing, 147:15–30.
Robert, P. and Escoufier, Y. (1976).
A unifying tool for linear multivariate statistical methods: the rv-coefficient.
Applied Statistics, 25(3):257–265.
Schölkopf, B., Smola, A., and Müller, K. (1998).
Nonlinear component analysis as a kernel eigenvalue problem.
Neural Computation, 10(5):1299–1319.
Sommer, M., Church, G., and Dantas, G. (2010).
A functional metagenomic approach for expanding the synthetic biology toolbox for biomass conversion.
Molecular Systems Biology, 6(360).
Sunagawa, S., Coelho, L., Chaffron, S., Kultima, J., Labadie, K., Salazar, F., Djahanschiri, B., Zeller, G., Mende, D., Alberti, A.,
Cornejo-Castillo, F., Costea, P., Cruaud, C., d’Oviedo, F., Engelen, S., Ferrera, I., Gasol, J., Guidi, L., Hildebrand, F., Kokoszka,
F., Lepoivre, C., Lima-Mendez, G., Poulain, J., Poulos, B., Royo-Llonch, M., Sarmento, H., Vieira-Silva, S., Dimier, C., Picheral,
M., Searson, S., Kandels-Lewis, S., Tara Oceans coordinators, Bowler, C., de Vargas, C., Gorsky, G., Grimsley, N., Hingamp, P.,
Iudicone, D., Jaillon, O., Not, F., Ogata, H., Pesant, S., Speich, S., Stemmann, L., Sullivan, M., Weissenbach, J., Wincker, P.,
Karsenti, E., Raes, J., Acinas, S., and Bork, P. (2015).
Structure and function of the global ocean microbiome.
Science, 348(6237).
Zhuang, J., Wang, J., Hoi, S., and Lan, X. (2011).
Unsupervised multiple kernel clustering.
Journal of Machine Learning Research: Workshop and Conference Proceedings, 20:129–144.
Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 41/41

More Related Content

What's hot

Emerging challenges in data-intensive genomics
Emerging challenges in data-intensive genomicsEmerging challenges in data-intensive genomics
Emerging challenges in data-intensive genomics
mikaelhuss
 
Data analytics challenges in genomics
Data analytics challenges in genomicsData analytics challenges in genomics
Data analytics challenges in genomics
mikaelhuss
 
MultiAgent artificial immune system for network intrusion detection
MultiAgent artificial immune system for network intrusion detectionMultiAgent artificial immune system for network intrusion detection
MultiAgent artificial immune system for network intrusion detection
Aboul Ella Hassanien
 
AIS
AISAIS
Analyzing the exome—focusing your NGS analysis with high performance target c...
Analyzing the exome—focusing your NGS analysis with high performance target c...Analyzing the exome—focusing your NGS analysis with high performance target c...
Analyzing the exome—focusing your NGS analysis with high performance target c...
Integrated DNA Technologies
 
A FRIENDLY APPROACH TO PARTICLE FILTERS IN COMPUTER VISION
A FRIENDLY APPROACH TO PARTICLE FILTERS IN COMPUTER VISIONA FRIENDLY APPROACH TO PARTICLE FILTERS IN COMPUTER VISION
A FRIENDLY APPROACH TO PARTICLE FILTERS IN COMPUTER VISION
Marcos Nieto
 
Introduction to Metagenomics Data Analysis - UEB-VHIR - 2013
Introduction to Metagenomics Data Analysis - UEB-VHIR - 2013Introduction to Metagenomics Data Analysis - UEB-VHIR - 2013
Introduction to Metagenomics Data Analysis - UEB-VHIR - 2013
VHIR Vall d’Hebron Institut de Recerca
 
2014 sage-talk
2014 sage-talk2014 sage-talk
2014 sage-talk
c.titus.brown
 
2013 talk at TGAC, November 4
2013 talk at TGAC, November 42013 talk at TGAC, November 4
2013 talk at TGAC, November 4
c.titus.brown
 
JOSA TechTalks - Machine Learning in Practice
JOSA TechTalks - Machine Learning in PracticeJOSA TechTalks - Machine Learning in Practice
JOSA TechTalks - Machine Learning in Practice
Jordan Open Source Association
 
Toast 2015 qiime_talk2
Toast 2015 qiime_talk2Toast 2015 qiime_talk2
Toast 2015 qiime_talk2
TOASTworkshop
 
Pathogen Genome Data
Pathogen Genome DataPathogen Genome Data
Pathogen Genome Data
Leighton Pritchard
 
2006: Artificial Immune Systems - The Past, The Present, And The Future?
2006: Artificial Immune Systems - The Past, The Present, And The Future?2006: Artificial Immune Systems - The Past, The Present, And The Future?
2006: Artificial Immune Systems - The Past, The Present, And The Future?
Leandro de Castro
 

What's hot (13)

Emerging challenges in data-intensive genomics
Emerging challenges in data-intensive genomicsEmerging challenges in data-intensive genomics
Emerging challenges in data-intensive genomics
 
Data analytics challenges in genomics
Data analytics challenges in genomicsData analytics challenges in genomics
Data analytics challenges in genomics
 
MultiAgent artificial immune system for network intrusion detection
MultiAgent artificial immune system for network intrusion detectionMultiAgent artificial immune system for network intrusion detection
MultiAgent artificial immune system for network intrusion detection
 
AIS
AISAIS
AIS
 
Analyzing the exome—focusing your NGS analysis with high performance target c...
Analyzing the exome—focusing your NGS analysis with high performance target c...Analyzing the exome—focusing your NGS analysis with high performance target c...
Analyzing the exome—focusing your NGS analysis with high performance target c...
 
A FRIENDLY APPROACH TO PARTICLE FILTERS IN COMPUTER VISION
A FRIENDLY APPROACH TO PARTICLE FILTERS IN COMPUTER VISIONA FRIENDLY APPROACH TO PARTICLE FILTERS IN COMPUTER VISION
A FRIENDLY APPROACH TO PARTICLE FILTERS IN COMPUTER VISION
 
Introduction to Metagenomics Data Analysis - UEB-VHIR - 2013
Introduction to Metagenomics Data Analysis - UEB-VHIR - 2013Introduction to Metagenomics Data Analysis - UEB-VHIR - 2013
Introduction to Metagenomics Data Analysis - UEB-VHIR - 2013
 
2014 sage-talk
2014 sage-talk2014 sage-talk
2014 sage-talk
 
2013 talk at TGAC, November 4
2013 talk at TGAC, November 42013 talk at TGAC, November 4
2013 talk at TGAC, November 4
 
JOSA TechTalks - Machine Learning in Practice
JOSA TechTalks - Machine Learning in PracticeJOSA TechTalks - Machine Learning in Practice
JOSA TechTalks - Machine Learning in Practice
 
Toast 2015 qiime_talk2
Toast 2015 qiime_talk2Toast 2015 qiime_talk2
Toast 2015 qiime_talk2
 
Pathogen Genome Data
Pathogen Genome DataPathogen Genome Data
Pathogen Genome Data
 
2006: Artificial Immune Systems - The Past, The Present, And The Future?
2006: Artificial Immune Systems - The Past, The Present, And The Future?2006: Artificial Immune Systems - The Past, The Present, And The Future?
2006: Artificial Immune Systems - The Past, The Present, And The Future?
 

Viewers also liked

Random Forest for Big Data
Random Forest for Big DataRandom Forest for Big Data
Random Forest for Big Data
tuxette
 
Slides Lycée Jules Fil 2014
Slides Lycée Jules Fil 2014Slides Lycée Jules Fil 2014
Slides Lycée Jules Fil 2014
tuxette
 
Interpretable Sparse Sliced Inverse Regression for digitized functional data
Interpretable Sparse Sliced Inverse Regression for digitized functional dataInterpretable Sparse Sliced Inverse Regression for digitized functional data
Interpretable Sparse Sliced Inverse Regression for digitized functional data
tuxette
 
Visualiser et fouiller des réseaux - Méthodes et exemples dans R
Visualiser et fouiller des réseaux - Méthodes et exemples dans RVisualiser et fouiller des réseaux - Méthodes et exemples dans R
Visualiser et fouiller des réseaux - Méthodes et exemples dans R
tuxette
 
Real Estate Customer Servicing GAP Analysis
Real Estate Customer Servicing GAP AnalysisReal Estate Customer Servicing GAP Analysis
Real Estate Customer Servicing GAP Analysis
Rahul Gaur
 
Graduation Research Project about Violence against Teachers
Graduation Research Project about Violence against TeachersGraduation Research Project about Violence against Teachers
Graduation Research Project about Violence against Teachers
Salsabil A.
 
Case Study Report: Fiskars Scissors
Case Study Report: Fiskars ScissorsCase Study Report: Fiskars Scissors
Case Study Report: Fiskars Scissors
Laura Katriina Pollard
 
Graph mining 2: Statistical approaches for graph mining
Graph mining 2: Statistical approaches for graph miningGraph mining 2: Statistical approaches for graph mining
Graph mining 2: Statistical approaches for graph mining
tuxette
 
디자인과 문화
디자인과 문화디자인과 문화
디자인과 문화
jieon lee
 
Exploring ISIS in Yemen
Exploring ISIS in YemenExploring ISIS in Yemen
Exploring ISIS in Yemen
AEI's Critical Threats Project
 
A short introduction to statistical learning
A short introduction to statistical learningA short introduction to statistical learning
A short introduction to statistical learning
tuxette
 
Adapting to water scarcity for Yemen's vulnerability communities: The case st...
Adapting to water scarcity for Yemen's vulnerability communities: The case st...Adapting to water scarcity for Yemen's vulnerability communities: The case st...
Adapting to water scarcity for Yemen's vulnerability communities: The case st...
NENAwaterscarcity
 
Landforms and Oceans Presentation
Landforms and Oceans PresentationLandforms and Oceans Presentation
Landforms and Oceans Presentation
gweesc
 
architectural design
architectural design architectural design
architectural design
ganesh keskar
 
Building an Information Infrastructure to Support Microbial Metagenomic Sciences
Building an Information Infrastructure to Support Microbial Metagenomic SciencesBuilding an Information Infrastructure to Support Microbial Metagenomic Sciences
Building an Information Infrastructure to Support Microbial Metagenomic Sciences
Larry Smarr
 
CAMERA Presentation at KNAW ICoMM Colloquium May 2008
CAMERA Presentation at KNAW ICoMM Colloquium May 2008CAMERA Presentation at KNAW ICoMM Colloquium May 2008
CAMERA Presentation at KNAW ICoMM Colloquium May 2008
Saul Kravitz
 
Assembly of metagenomes
Assembly of metagenomesAssembly of metagenomes
Assembly of metagenomes
Lex Nederbragt
 
Tom Delmont: From the Terragenome Project to Global Metagenomic Comparisons: ...
Tom Delmont: From the Terragenome Project to Global Metagenomic Comparisons: ...Tom Delmont: From the Terragenome Project to Global Metagenomic Comparisons: ...
Tom Delmont: From the Terragenome Project to Global Metagenomic Comparisons: ...
GigaScience, BGI Hong Kong
 
Building a Community Cyberinfrastructure to Support Marine Microbial Ecology ...
Building a Community Cyberinfrastructure to Support Marine Microbial Ecology ...Building a Community Cyberinfrastructure to Support Marine Microbial Ecology ...
Building a Community Cyberinfrastructure to Support Marine Microbial Ecology ...
Larry Smarr
 
Machine Learning
Machine LearningMachine Learning

Viewers also liked (20)

Random Forest for Big Data
Random Forest for Big DataRandom Forest for Big Data
Random Forest for Big Data
 
Slides Lycée Jules Fil 2014
Slides Lycée Jules Fil 2014Slides Lycée Jules Fil 2014
Slides Lycée Jules Fil 2014
 
Interpretable Sparse Sliced Inverse Regression for digitized functional data
Interpretable Sparse Sliced Inverse Regression for digitized functional dataInterpretable Sparse Sliced Inverse Regression for digitized functional data
Interpretable Sparse Sliced Inverse Regression for digitized functional data
 
Visualiser et fouiller des réseaux - Méthodes et exemples dans R
Visualiser et fouiller des réseaux - Méthodes et exemples dans RVisualiser et fouiller des réseaux - Méthodes et exemples dans R
Visualiser et fouiller des réseaux - Méthodes et exemples dans R
 
Real Estate Customer Servicing GAP Analysis
Real Estate Customer Servicing GAP AnalysisReal Estate Customer Servicing GAP Analysis
Real Estate Customer Servicing GAP Analysis
 
Graduation Research Project about Violence against Teachers
Graduation Research Project about Violence against TeachersGraduation Research Project about Violence against Teachers
Graduation Research Project about Violence against Teachers
 
Case Study Report: Fiskars Scissors
Case Study Report: Fiskars ScissorsCase Study Report: Fiskars Scissors
Case Study Report: Fiskars Scissors
 
Graph mining 2: Statistical approaches for graph mining
Graph mining 2: Statistical approaches for graph miningGraph mining 2: Statistical approaches for graph mining
Graph mining 2: Statistical approaches for graph mining
 
디자인과 문화
디자인과 문화디자인과 문화
디자인과 문화
 
Exploring ISIS in Yemen
Exploring ISIS in YemenExploring ISIS in Yemen
Exploring ISIS in Yemen
 
A short introduction to statistical learning
A short introduction to statistical learningA short introduction to statistical learning
A short introduction to statistical learning
 
Adapting to water scarcity for Yemen's vulnerability communities: The case st...
Adapting to water scarcity for Yemen's vulnerability communities: The case st...Adapting to water scarcity for Yemen's vulnerability communities: The case st...
Adapting to water scarcity for Yemen's vulnerability communities: The case st...
 
Landforms and Oceans Presentation
Landforms and Oceans PresentationLandforms and Oceans Presentation
Landforms and Oceans Presentation
 
architectural design
architectural design architectural design
architectural design
 
Building an Information Infrastructure to Support Microbial Metagenomic Sciences
Building an Information Infrastructure to Support Microbial Metagenomic SciencesBuilding an Information Infrastructure to Support Microbial Metagenomic Sciences
Building an Information Infrastructure to Support Microbial Metagenomic Sciences
 
CAMERA Presentation at KNAW ICoMM Colloquium May 2008
CAMERA Presentation at KNAW ICoMM Colloquium May 2008CAMERA Presentation at KNAW ICoMM Colloquium May 2008
CAMERA Presentation at KNAW ICoMM Colloquium May 2008
 
Assembly of metagenomes
Assembly of metagenomesAssembly of metagenomes
Assembly of metagenomes
 
Tom Delmont: From the Terragenome Project to Global Metagenomic Comparisons: ...
Tom Delmont: From the Terragenome Project to Global Metagenomic Comparisons: ...Tom Delmont: From the Terragenome Project to Global Metagenomic Comparisons: ...
Tom Delmont: From the Terragenome Project to Global Metagenomic Comparisons: ...
 
Building a Community Cyberinfrastructure to Support Marine Microbial Ecology ...
Building a Community Cyberinfrastructure to Support Marine Microbial Ecology ...Building a Community Cyberinfrastructure to Support Marine Microbial Ecology ...
Building a Community Cyberinfrastructure to Support Marine Microbial Ecology ...
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 

Similar to Integrating Tara Oceans datasets using unsupervised multiple kernel learning

USING E-INFRASTRUCTURES FOR BIODIVERSITY CONSERVATION - Module 3
USING E-INFRASTRUCTURES FOR BIODIVERSITY CONSERVATION - Module 3USING E-INFRASTRUCTURES FOR BIODIVERSITY CONSERVATION - Module 3
USING E-INFRASTRUCTURES FOR BIODIVERSITY CONSERVATION - Module 3
Gianpaolo Coro
 
GBIF BIFA mentoring, Day 4b Event core, July 2016
GBIF BIFA mentoring, Day 4b Event core, July 2016GBIF BIFA mentoring, Day 4b Event core, July 2016
GBIF BIFA mentoring, Day 4b Event core, July 2016
Dag Endresen
 
John La Salle - Opening Plenary
John La Salle - Opening PlenaryJohn La Salle - Opening Plenary
John La Salle - Opening Plenary
Consortium for the Barcode of Life (CBOL)
 
Zhang et al ecn 2016 building an accessible weevil tissue collection for geno...
Zhang et al ecn 2016 building an accessible weevil tissue collection for geno...Zhang et al ecn 2016 building an accessible weevil tissue collection for geno...
Zhang et al ecn 2016 building an accessible weevil tissue collection for geno...
taxonbytes
 
USING E-INFRASTRUCTURES FOR BIODIVERSITY CONSERVATION - Module 5
USING E-INFRASTRUCTURES FOR BIODIVERSITY CONSERVATION - Module 5USING E-INFRASTRUCTURES FOR BIODIVERSITY CONSERVATION - Module 5
USING E-INFRASTRUCTURES FOR BIODIVERSITY CONSERVATION - Module 5
Gianpaolo Coro
 
E biothon workshop 2014 04 15 v1
E biothon workshop 2014 04 15 v1E biothon workshop 2014 04 15 v1
E biothon workshop 2014 04 15 v1
Vincent Breton
 
Microbial Phylogenomics (EVE161) Class 10-11: Genome Sequencing
Microbial Phylogenomics (EVE161) Class 10-11: Genome SequencingMicrobial Phylogenomics (EVE161) Class 10-11: Genome Sequencing
Microbial Phylogenomics (EVE161) Class 10-11: Genome Sequencing
Jonathan Eisen
 
Modelling Biodiversity Linked Data: Pragmatism May Narrow Future Opportunities
Modelling Biodiversity Linked Data: Pragmatism May Narrow Future OpportunitiesModelling Biodiversity Linked Data: Pragmatism May Narrow Future Opportunities
Modelling Biodiversity Linked Data: Pragmatism May Narrow Future Opportunities
Franck Michel
 
The Transformation of Systems Biology Into A Large Data Science
The Transformation of Systems Biology Into A Large Data ScienceThe Transformation of Systems Biology Into A Large Data Science
The Transformation of Systems Biology Into A Large Data Science
Robert Grossman
 
Gregoire Taillefer poster ESC final
Gregoire Taillefer poster ESC finalGregoire Taillefer poster ESC final
Gregoire Taillefer poster ESC final
Amélie Grégoire Taillefer
 
A Model to Represent Nomenclatural and Taxonomic Information as Linked Data. ...
A Model to Represent Nomenclatural and Taxonomic Information as Linked Data. ...A Model to Represent Nomenclatural and Taxonomic Information as Linked Data. ...
A Model to Represent Nomenclatural and Taxonomic Information as Linked Data. ...
Franck Michel
 
Perth ausplots presentation_070616_internet_qu
Perth ausplots presentation_070616_internet_quPerth ausplots presentation_070616_internet_qu
Perth ausplots presentation_070616_internet_qu
bensparrowau
 
Representation of metabolomic data with wavelets
Representation of metabolomic data with waveletsRepresentation of metabolomic data with wavelets
Representation of metabolomic data with wavelets
tuxette
 
CESAB-GEISHA-sfe2018
CESAB-GEISHA-sfe2018CESAB-GEISHA-sfe2018
CESAB-GEISHA-sfe2018
CESAB-FRB
 
Kernel methods for data integration in systems biology
Kernel methods for data integration in systems biology Kernel methods for data integration in systems biology
Kernel methods for data integration in systems biology
tuxette
 
Comparative study of ensemble deep learning models to determine the classific...
Comparative study of ensemble deep learning models to determine the classific...Comparative study of ensemble deep learning models to determine the classific...
Comparative study of ensemble deep learning models to determine the classific...
CSITiaesprime
 
Open Tree of Life @NSF
Open Tree of Life @NSFOpen Tree of Life @NSF
Open Tree of Life @NSF
Karen Cranston
 
Web Apollo: Lessons learned from community-based biocuration efforts.
Web Apollo: Lessons learned from community-based biocuration efforts.Web Apollo: Lessons learned from community-based biocuration efforts.
Web Apollo: Lessons learned from community-based biocuration efforts.
Monica Munoz-Torres
 
Long Term Ecological Research Network
Long Term Ecological Research NetworkLong Term Ecological Research Network
Long Term Ecological Research Network
TERN Australia
 
Étude du pathobiome respiratoire chez les jeunes bovins atteints de bronchopn...
Étude du pathobiome respiratoire chez les jeunes bovins atteints de bronchopn...Étude du pathobiome respiratoire chez les jeunes bovins atteints de bronchopn...
Étude du pathobiome respiratoire chez les jeunes bovins atteints de bronchopn...
tuxette
 

Similar to Integrating Tara Oceans datasets using unsupervised multiple kernel learning (20)

USING E-INFRASTRUCTURES FOR BIODIVERSITY CONSERVATION - Module 3
USING E-INFRASTRUCTURES FOR BIODIVERSITY CONSERVATION - Module 3USING E-INFRASTRUCTURES FOR BIODIVERSITY CONSERVATION - Module 3
USING E-INFRASTRUCTURES FOR BIODIVERSITY CONSERVATION - Module 3
 
GBIF BIFA mentoring, Day 4b Event core, July 2016
GBIF BIFA mentoring, Day 4b Event core, July 2016GBIF BIFA mentoring, Day 4b Event core, July 2016
GBIF BIFA mentoring, Day 4b Event core, July 2016
 
John La Salle - Opening Plenary
John La Salle - Opening PlenaryJohn La Salle - Opening Plenary
John La Salle - Opening Plenary
 
Zhang et al ecn 2016 building an accessible weevil tissue collection for geno...
Zhang et al ecn 2016 building an accessible weevil tissue collection for geno...Zhang et al ecn 2016 building an accessible weevil tissue collection for geno...
Zhang et al ecn 2016 building an accessible weevil tissue collection for geno...
 
USING E-INFRASTRUCTURES FOR BIODIVERSITY CONSERVATION - Module 5
USING E-INFRASTRUCTURES FOR BIODIVERSITY CONSERVATION - Module 5USING E-INFRASTRUCTURES FOR BIODIVERSITY CONSERVATION - Module 5
USING E-INFRASTRUCTURES FOR BIODIVERSITY CONSERVATION - Module 5
 
E biothon workshop 2014 04 15 v1
E biothon workshop 2014 04 15 v1E biothon workshop 2014 04 15 v1
E biothon workshop 2014 04 15 v1
 
Microbial Phylogenomics (EVE161) Class 10-11: Genome Sequencing
Microbial Phylogenomics (EVE161) Class 10-11: Genome SequencingMicrobial Phylogenomics (EVE161) Class 10-11: Genome Sequencing
Microbial Phylogenomics (EVE161) Class 10-11: Genome Sequencing
 
Modelling Biodiversity Linked Data: Pragmatism May Narrow Future Opportunities
Modelling Biodiversity Linked Data: Pragmatism May Narrow Future OpportunitiesModelling Biodiversity Linked Data: Pragmatism May Narrow Future Opportunities
Modelling Biodiversity Linked Data: Pragmatism May Narrow Future Opportunities
 
The Transformation of Systems Biology Into A Large Data Science
The Transformation of Systems Biology Into A Large Data ScienceThe Transformation of Systems Biology Into A Large Data Science
The Transformation of Systems Biology Into A Large Data Science
 
Gregoire Taillefer poster ESC final
Gregoire Taillefer poster ESC finalGregoire Taillefer poster ESC final
Gregoire Taillefer poster ESC final
 
A Model to Represent Nomenclatural and Taxonomic Information as Linked Data. ...
A Model to Represent Nomenclatural and Taxonomic Information as Linked Data. ...A Model to Represent Nomenclatural and Taxonomic Information as Linked Data. ...
A Model to Represent Nomenclatural and Taxonomic Information as Linked Data. ...
 
Perth ausplots presentation_070616_internet_qu
Perth ausplots presentation_070616_internet_quPerth ausplots presentation_070616_internet_qu
Perth ausplots presentation_070616_internet_qu
 
Representation of metabolomic data with wavelets
Representation of metabolomic data with waveletsRepresentation of metabolomic data with wavelets
Representation of metabolomic data with wavelets
 
CESAB-GEISHA-sfe2018
CESAB-GEISHA-sfe2018CESAB-GEISHA-sfe2018
CESAB-GEISHA-sfe2018
 
Kernel methods for data integration in systems biology
Kernel methods for data integration in systems biology Kernel methods for data integration in systems biology
Kernel methods for data integration in systems biology
 
Comparative study of ensemble deep learning models to determine the classific...
Comparative study of ensemble deep learning models to determine the classific...Comparative study of ensemble deep learning models to determine the classific...
Comparative study of ensemble deep learning models to determine the classific...
 
Open Tree of Life @NSF
Open Tree of Life @NSFOpen Tree of Life @NSF
Open Tree of Life @NSF
 
Web Apollo: Lessons learned from community-based biocuration efforts.
Web Apollo: Lessons learned from community-based biocuration efforts.Web Apollo: Lessons learned from community-based biocuration efforts.
Web Apollo: Lessons learned from community-based biocuration efforts.
 
Long Term Ecological Research Network
Long Term Ecological Research NetworkLong Term Ecological Research Network
Long Term Ecological Research Network
 
Étude du pathobiome respiratoire chez les jeunes bovins atteints de bronchopn...
Étude du pathobiome respiratoire chez les jeunes bovins atteints de bronchopn...Étude du pathobiome respiratoire chez les jeunes bovins atteints de bronchopn...
Étude du pathobiome respiratoire chez les jeunes bovins atteints de bronchopn...
 

More from tuxette

Racines en haut et feuilles en bas : les arbres en maths
Racines en haut et feuilles en bas : les arbres en mathsRacines en haut et feuilles en bas : les arbres en maths
Racines en haut et feuilles en bas : les arbres en maths
tuxette
 
Méthodes à noyaux pour l’intégration de données hétérogènes
Méthodes à noyaux pour l’intégration de données hétérogènesMéthodes à noyaux pour l’intégration de données hétérogènes
Méthodes à noyaux pour l’intégration de données hétérogènes
tuxette
 
Méthodologies d'intégration de données omiques
Méthodologies d'intégration de données omiquesMéthodologies d'intégration de données omiques
Méthodologies d'intégration de données omiques
tuxette
 
Projets autour de l'Hi-C
Projets autour de l'Hi-CProjets autour de l'Hi-C
Projets autour de l'Hi-C
tuxette
 
Can deep learning learn chromatin structure from sequence?
Can deep learning learn chromatin structure from sequence?Can deep learning learn chromatin structure from sequence?
Can deep learning learn chromatin structure from sequence?
tuxette
 
Multi-omics data integration methods: kernel and other machine learning appro...
Multi-omics data integration methods: kernel and other machine learning appro...Multi-omics data integration methods: kernel and other machine learning appro...
Multi-omics data integration methods: kernel and other machine learning appro...
tuxette
 
ASTERICS : une application pour intégrer des données omiques
ASTERICS : une application pour intégrer des données omiquesASTERICS : une application pour intégrer des données omiques
ASTERICS : une application pour intégrer des données omiques
tuxette
 
Autour des projets Idefics et MetaboWean
Autour des projets Idefics et MetaboWeanAutour des projets Idefics et MetaboWean
Autour des projets Idefics et MetaboWean
tuxette
 
Rserve, renv, flask, Vue.js dans un docker pour intégrer des données omiques ...
Rserve, renv, flask, Vue.js dans un docker pour intégrer des données omiques ...Rserve, renv, flask, Vue.js dans un docker pour intégrer des données omiques ...
Rserve, renv, flask, Vue.js dans un docker pour intégrer des données omiques ...
tuxette
 
Apprentissage pour la biologie moléculaire et l’analyse de données omiques
Apprentissage pour la biologie moléculaire et l’analyse de données omiquesApprentissage pour la biologie moléculaire et l’analyse de données omiques
Apprentissage pour la biologie moléculaire et l’analyse de données omiques
tuxette
 
Quelques résultats préliminaires de l'évaluation de méthodes d'inférence de r...
Quelques résultats préliminaires de l'évaluation de méthodes d'inférence de r...Quelques résultats préliminaires de l'évaluation de méthodes d'inférence de r...
Quelques résultats préliminaires de l'évaluation de méthodes d'inférence de r...
tuxette
 
Intégration de données omiques multi-échelles : méthodes à noyau et autres ap...
Intégration de données omiques multi-échelles : méthodes à noyau et autres ap...Intégration de données omiques multi-échelles : méthodes à noyau et autres ap...
Intégration de données omiques multi-échelles : méthodes à noyau et autres ap...
tuxette
 
Journal club: Validation of cluster analysis results on validation data
Journal club: Validation of cluster analysis results on validation dataJournal club: Validation of cluster analysis results on validation data
Journal club: Validation of cluster analysis results on validation data
tuxette
 
Overfitting or overparametrization?
Overfitting or overparametrization?Overfitting or overparametrization?
Overfitting or overparametrization?
tuxette
 
Selective inference and single-cell differential analysis
Selective inference and single-cell differential analysisSelective inference and single-cell differential analysis
Selective inference and single-cell differential analysis
tuxette
 
SOMbrero : un package R pour les cartes auto-organisatrices
SOMbrero : un package R pour les cartes auto-organisatricesSOMbrero : un package R pour les cartes auto-organisatrices
SOMbrero : un package R pour les cartes auto-organisatrices
tuxette
 
Graph Neural Network for Phenotype Prediction
Graph Neural Network for Phenotype PredictionGraph Neural Network for Phenotype Prediction
Graph Neural Network for Phenotype Prediction
tuxette
 
A short and naive introduction to using network in prediction models
A short and naive introduction to using network in prediction modelsA short and naive introduction to using network in prediction models
A short and naive introduction to using network in prediction models
tuxette
 
Explanable models for time series with random forest
Explanable models for time series with random forestExplanable models for time series with random forest
Explanable models for time series with random forest
tuxette
 
Présentation du projet ASTERICS
Présentation du projet ASTERICSPrésentation du projet ASTERICS
Présentation du projet ASTERICS
tuxette
 

More from tuxette (20)

Racines en haut et feuilles en bas : les arbres en maths
Racines en haut et feuilles en bas : les arbres en mathsRacines en haut et feuilles en bas : les arbres en maths
Racines en haut et feuilles en bas : les arbres en maths
 
Méthodes à noyaux pour l’intégration de données hétérogènes
Méthodes à noyaux pour l’intégration de données hétérogènesMéthodes à noyaux pour l’intégration de données hétérogènes
Méthodes à noyaux pour l’intégration de données hétérogènes
 
Méthodologies d'intégration de données omiques
Méthodologies d'intégration de données omiquesMéthodologies d'intégration de données omiques
Méthodologies d'intégration de données omiques
 
Projets autour de l'Hi-C
Projets autour de l'Hi-CProjets autour de l'Hi-C
Projets autour de l'Hi-C
 
Can deep learning learn chromatin structure from sequence?
Can deep learning learn chromatin structure from sequence?Can deep learning learn chromatin structure from sequence?
Can deep learning learn chromatin structure from sequence?
 
Multi-omics data integration methods: kernel and other machine learning appro...
Multi-omics data integration methods: kernel and other machine learning appro...Multi-omics data integration methods: kernel and other machine learning appro...
Multi-omics data integration methods: kernel and other machine learning appro...
 
ASTERICS : une application pour intégrer des données omiques
ASTERICS : une application pour intégrer des données omiquesASTERICS : une application pour intégrer des données omiques
ASTERICS : une application pour intégrer des données omiques
 
Autour des projets Idefics et MetaboWean
Autour des projets Idefics et MetaboWeanAutour des projets Idefics et MetaboWean
Autour des projets Idefics et MetaboWean
 
Rserve, renv, flask, Vue.js dans un docker pour intégrer des données omiques ...
Rserve, renv, flask, Vue.js dans un docker pour intégrer des données omiques ...Rserve, renv, flask, Vue.js dans un docker pour intégrer des données omiques ...
Rserve, renv, flask, Vue.js dans un docker pour intégrer des données omiques ...
 
Apprentissage pour la biologie moléculaire et l’analyse de données omiques
Apprentissage pour la biologie moléculaire et l’analyse de données omiquesApprentissage pour la biologie moléculaire et l’analyse de données omiques
Apprentissage pour la biologie moléculaire et l’analyse de données omiques
 
Quelques résultats préliminaires de l'évaluation de méthodes d'inférence de r...
Quelques résultats préliminaires de l'évaluation de méthodes d'inférence de r...Quelques résultats préliminaires de l'évaluation de méthodes d'inférence de r...
Quelques résultats préliminaires de l'évaluation de méthodes d'inférence de r...
 
Intégration de données omiques multi-échelles : méthodes à noyau et autres ap...
Intégration de données omiques multi-échelles : méthodes à noyau et autres ap...Intégration de données omiques multi-échelles : méthodes à noyau et autres ap...
Intégration de données omiques multi-échelles : méthodes à noyau et autres ap...
 
Journal club: Validation of cluster analysis results on validation data
Journal club: Validation of cluster analysis results on validation dataJournal club: Validation of cluster analysis results on validation data
Journal club: Validation of cluster analysis results on validation data
 
Overfitting or overparametrization?
Overfitting or overparametrization?Overfitting or overparametrization?
Overfitting or overparametrization?
 
Selective inference and single-cell differential analysis
Selective inference and single-cell differential analysisSelective inference and single-cell differential analysis
Selective inference and single-cell differential analysis
 
SOMbrero : un package R pour les cartes auto-organisatrices
SOMbrero : un package R pour les cartes auto-organisatricesSOMbrero : un package R pour les cartes auto-organisatrices
SOMbrero : un package R pour les cartes auto-organisatrices
 
Graph Neural Network for Phenotype Prediction
Graph Neural Network for Phenotype PredictionGraph Neural Network for Phenotype Prediction
Graph Neural Network for Phenotype Prediction
 
A short and naive introduction to using network in prediction models
A short and naive introduction to using network in prediction modelsA short and naive introduction to using network in prediction models
A short and naive introduction to using network in prediction models
 
Explanable models for time series with random forest
Explanable models for time series with random forestExplanable models for time series with random forest
Explanable models for time series with random forest
 
Présentation du projet ASTERICS
Présentation du projet ASTERICSPrésentation du projet ASTERICS
Présentation du projet ASTERICS
 

Recently uploaded

Farming systems analysis: what have we learnt?.pptx
Farming systems analysis: what have we learnt?.pptxFarming systems analysis: what have we learnt?.pptx
Farming systems analysis: what have we learnt?.pptx
Frédéric Baudron
 
Compexometric titration/Chelatorphy titration/chelating titration
Compexometric titration/Chelatorphy titration/chelating titrationCompexometric titration/Chelatorphy titration/chelating titration
Compexometric titration/Chelatorphy titration/chelating titration
Vandana Devesh Sharma
 
JAMES WEBB STUDY THE MASSIVE BLACK HOLE SEEDS
JAMES WEBB STUDY THE MASSIVE BLACK HOLE SEEDSJAMES WEBB STUDY THE MASSIVE BLACK HOLE SEEDS
JAMES WEBB STUDY THE MASSIVE BLACK HOLE SEEDS
Sérgio Sacani
 
Sustainable Land Management - Climate Smart Agriculture
Sustainable Land Management - Climate Smart AgricultureSustainable Land Management - Climate Smart Agriculture
Sustainable Land Management - Climate Smart Agriculture
International Food Policy Research Institute- South Asia Office
 
Signatures of wave erosion in Titan’s coasts
Signatures of wave erosion in Titan’s coastsSignatures of wave erosion in Titan’s coasts
Signatures of wave erosion in Titan’s coasts
Sérgio Sacani
 
11.1 Role of physical biological in deterioration of grains.pdf
11.1 Role of physical biological in deterioration of grains.pdf11.1 Role of physical biological in deterioration of grains.pdf
11.1 Role of physical biological in deterioration of grains.pdf
PirithiRaju
 
Gadgets for management of stored product pests_Dr.UPR.pdf
Gadgets for management of stored product pests_Dr.UPR.pdfGadgets for management of stored product pests_Dr.UPR.pdf
Gadgets for management of stored product pests_Dr.UPR.pdf
PirithiRaju
 
AJAY KUMAR NIET GreNo Guava Project File.pdf
AJAY KUMAR NIET GreNo Guava Project File.pdfAJAY KUMAR NIET GreNo Guava Project File.pdf
AJAY KUMAR NIET GreNo Guava Project File.pdf
AJAY KUMAR
 
Introduction_Ch_01_Biotech Biotechnology course .pptx
Introduction_Ch_01_Biotech Biotechnology course .pptxIntroduction_Ch_01_Biotech Biotechnology course .pptx
Introduction_Ch_01_Biotech Biotechnology course .pptx
QusayMaghayerh
 
Sexuality - Issues, Attitude and Behaviour - Applied Social Psychology - Psyc...
Sexuality - Issues, Attitude and Behaviour - Applied Social Psychology - Psyc...Sexuality - Issues, Attitude and Behaviour - Applied Social Psychology - Psyc...
Sexuality - Issues, Attitude and Behaviour - Applied Social Psychology - Psyc...
PsychoTech Services
 
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
vluwdy49
 
GBSN - Biochemistry (Unit 6) Chemistry of Proteins
GBSN - Biochemistry (Unit 6) Chemistry of ProteinsGBSN - Biochemistry (Unit 6) Chemistry of Proteins
GBSN - Biochemistry (Unit 6) Chemistry of Proteins
Areesha Ahmad
 
BIRDS DIVERSITY OF SOOTEA BISWANATH ASSAM.ppt.pptx
BIRDS  DIVERSITY OF SOOTEA BISWANATH ASSAM.ppt.pptxBIRDS  DIVERSITY OF SOOTEA BISWANATH ASSAM.ppt.pptx
BIRDS DIVERSITY OF SOOTEA BISWANATH ASSAM.ppt.pptx
goluk9330
 
Randomised Optimisation Algorithms in DAPHNE
Randomised Optimisation Algorithms in DAPHNERandomised Optimisation Algorithms in DAPHNE
Randomised Optimisation Algorithms in DAPHNE
University of Maribor
 
Pests of Storage_Identification_Dr.UPR.pdf
Pests of Storage_Identification_Dr.UPR.pdfPests of Storage_Identification_Dr.UPR.pdf
Pests of Storage_Identification_Dr.UPR.pdf
PirithiRaju
 
Methods of grain storage Structures in India.pdf
Methods of grain storage Structures in India.pdfMethods of grain storage Structures in India.pdf
Methods of grain storage Structures in India.pdf
PirithiRaju
 
2001_Book_HumanChromosomes - Genéticapdf
2001_Book_HumanChromosomes - Genéticapdf2001_Book_HumanChromosomes - Genéticapdf
2001_Book_HumanChromosomes - Genéticapdf
lucianamillenium
 
ESA/ACT Science Coffee: Diego Blas - Gravitational wave detection with orbita...
ESA/ACT Science Coffee: Diego Blas - Gravitational wave detection with orbita...ESA/ACT Science Coffee: Diego Blas - Gravitational wave detection with orbita...
ESA/ACT Science Coffee: Diego Blas - Gravitational wave detection with orbita...
Advanced-Concepts-Team
 
Direct Seeded Rice - Climate Smart Agriculture
Direct Seeded Rice - Climate Smart AgricultureDirect Seeded Rice - Climate Smart Agriculture
Direct Seeded Rice - Climate Smart Agriculture
International Food Policy Research Institute- South Asia Office
 
23PH301 - Optics - Optical Lenses.pptx
23PH301 - Optics  -  Optical Lenses.pptx23PH301 - Optics  -  Optical Lenses.pptx
23PH301 - Optics - Optical Lenses.pptx
RDhivya6
 

Recently uploaded (20)

Farming systems analysis: what have we learnt?.pptx
Farming systems analysis: what have we learnt?.pptxFarming systems analysis: what have we learnt?.pptx
Farming systems analysis: what have we learnt?.pptx
 
Compexometric titration/Chelatorphy titration/chelating titration
Compexometric titration/Chelatorphy titration/chelating titrationCompexometric titration/Chelatorphy titration/chelating titration
Compexometric titration/Chelatorphy titration/chelating titration
 
JAMES WEBB STUDY THE MASSIVE BLACK HOLE SEEDS
JAMES WEBB STUDY THE MASSIVE BLACK HOLE SEEDSJAMES WEBB STUDY THE MASSIVE BLACK HOLE SEEDS
JAMES WEBB STUDY THE MASSIVE BLACK HOLE SEEDS
 
Sustainable Land Management - Climate Smart Agriculture
Sustainable Land Management - Climate Smart AgricultureSustainable Land Management - Climate Smart Agriculture
Sustainable Land Management - Climate Smart Agriculture
 
Signatures of wave erosion in Titan’s coasts
Signatures of wave erosion in Titan’s coastsSignatures of wave erosion in Titan’s coasts
Signatures of wave erosion in Titan’s coasts
 
11.1 Role of physical biological in deterioration of grains.pdf
11.1 Role of physical biological in deterioration of grains.pdf11.1 Role of physical biological in deterioration of grains.pdf
11.1 Role of physical biological in deterioration of grains.pdf
 
Gadgets for management of stored product pests_Dr.UPR.pdf
Gadgets for management of stored product pests_Dr.UPR.pdfGadgets for management of stored product pests_Dr.UPR.pdf
Gadgets for management of stored product pests_Dr.UPR.pdf
 
AJAY KUMAR NIET GreNo Guava Project File.pdf
AJAY KUMAR NIET GreNo Guava Project File.pdfAJAY KUMAR NIET GreNo Guava Project File.pdf
AJAY KUMAR NIET GreNo Guava Project File.pdf
 
Introduction_Ch_01_Biotech Biotechnology course .pptx
Introduction_Ch_01_Biotech Biotechnology course .pptxIntroduction_Ch_01_Biotech Biotechnology course .pptx
Introduction_Ch_01_Biotech Biotechnology course .pptx
 
Sexuality - Issues, Attitude and Behaviour - Applied Social Psychology - Psyc...
Sexuality - Issues, Attitude and Behaviour - Applied Social Psychology - Psyc...Sexuality - Issues, Attitude and Behaviour - Applied Social Psychology - Psyc...
Sexuality - Issues, Attitude and Behaviour - Applied Social Psychology - Psyc...
 
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
 
GBSN - Biochemistry (Unit 6) Chemistry of Proteins
GBSN - Biochemistry (Unit 6) Chemistry of ProteinsGBSN - Biochemistry (Unit 6) Chemistry of Proteins
GBSN - Biochemistry (Unit 6) Chemistry of Proteins
 
BIRDS DIVERSITY OF SOOTEA BISWANATH ASSAM.ppt.pptx
BIRDS  DIVERSITY OF SOOTEA BISWANATH ASSAM.ppt.pptxBIRDS  DIVERSITY OF SOOTEA BISWANATH ASSAM.ppt.pptx
BIRDS DIVERSITY OF SOOTEA BISWANATH ASSAM.ppt.pptx
 
Randomised Optimisation Algorithms in DAPHNE
Randomised Optimisation Algorithms in DAPHNERandomised Optimisation Algorithms in DAPHNE
Randomised Optimisation Algorithms in DAPHNE
 
Pests of Storage_Identification_Dr.UPR.pdf
Pests of Storage_Identification_Dr.UPR.pdfPests of Storage_Identification_Dr.UPR.pdf
Pests of Storage_Identification_Dr.UPR.pdf
 
Methods of grain storage Structures in India.pdf
Methods of grain storage Structures in India.pdfMethods of grain storage Structures in India.pdf
Methods of grain storage Structures in India.pdf
 
2001_Book_HumanChromosomes - Genéticapdf
2001_Book_HumanChromosomes - Genéticapdf2001_Book_HumanChromosomes - Genéticapdf
2001_Book_HumanChromosomes - Genéticapdf
 
ESA/ACT Science Coffee: Diego Blas - Gravitational wave detection with orbita...
ESA/ACT Science Coffee: Diego Blas - Gravitational wave detection with orbita...ESA/ACT Science Coffee: Diego Blas - Gravitational wave detection with orbita...
ESA/ACT Science Coffee: Diego Blas - Gravitational wave detection with orbita...
 
Direct Seeded Rice - Climate Smart Agriculture
Direct Seeded Rice - Climate Smart AgricultureDirect Seeded Rice - Climate Smart Agriculture
Direct Seeded Rice - Climate Smart Agriculture
 
23PH301 - Optics - Optical Lenses.pptx
23PH301 - Optics  -  Optical Lenses.pptx23PH301 - Optics  -  Optical Lenses.pptx
23PH301 - Optics - Optical Lenses.pptx
 

Integrating Tara Oceans datasets using unsupervised multiple kernel learning

  • 1. Integrating Tara Oceans datasets using unsupervised multiple kernel learning Nathalie Villa-Vialaneix Joint work with Jérôme Mariette http://www.nathalievilla.org Séminaire de Probabilité et Statistique Laboratoire J.A. Dieudonné, Université de Nice Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 1/41
  • 2. Sommaire 1 Metagenomic datasets and associated questions 2 A typical (and rich) case study: TARA Oceans datasets 3 A UMKL framework for integrating multiple metagenomic data 4 Application to TARA Oceans datasets Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 2/41
  • 3. Sommaire 1 Metagenomic datasets and associated questions 2 A typical (and rich) case study: TARA Oceans datasets 3 A UMKL framework for integrating multiple metagenomic data 4 Application to TARA Oceans datasets Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 3/41
  • 4. What are metagenomic data? Source: [Sommer et al., 2010] Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 4/41
  • 5. What are metagenomic data? Source: [Sommer et al., 2010] abundance data sparse n × p-matrices with count data of samples in rows and descriptors (species, OTUs, KEGG groups, k-mer, ...) in columns. Generally p n. Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 4/41
  • 6. What are metagenomic data? Source: [Sommer et al., 2010] abundance data sparse n × p-matrices with count data of samples in rows and descriptors (species, OTUs, KEGG groups, k-mer, ...) in columns. Generally p n. philogenetic tree (evolution history between species, OTUs...). One tree with p leaves built from the sequences collected in the n samples. Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 4/41
  • 7. What are metagenomic data used for? produce a profile of the diversity of a given sample ⇒ allows to compare diversity between various conditions used in various fields: environmental science, microbiote, ... Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 5/41
  • 8. What are metagenomic data used for? produce a profile of the diversity of a given sample ⇒ allows to compare diversity between various conditions used in various fields: environmental science, microbiote, ... Processed by computing a relevant dissimilarity between samples (standard Euclidean distance is not relevant) and by using this dissimilarity in subsequent analyses. Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 5/41
  • 9. β-diversity data: dissimilarities between count data Compositional dissimilarities: (nig) count of species g for sample i Jaccard: the fraction of species specific of either sample i or j: djac = g I{nig>0,njg=0} + I{njg>0,nig=0} j I{nig+njg>0} Bray-Curtis: the fraction of the sample which is specific of either sample i or j dBC = g |nig − njg| g(nig + njg) Other dissimilarities available in the R package philoseq, most of them not Euclidean. Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 6/41
  • 10. β-diversity data: phylogenetic dissimilarities Phylogenetic dissimilarities Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 7/41
  • 11. β-diversity data: phylogenetic dissimilarities Phylogenetic dissimilarities For each branch e, note le its length and pei the fraction of counts in sample i corresponding to species below branch e. Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 7/41
  • 12. β-diversity data: phylogenetic dissimilarities Phylogenetic dissimilarities For each branch e, note le its length and pei the fraction of counts in sample i corresponding to species below branch e. Unifrac: the fraction of the tree specific to either sample i or sample j. dUF = e le(I{pei>0,pej=0} + I{pej>0,pei=0}) e leI{pei+pej>0} Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 7/41
  • 13. β-diversity data: phylogenetic dissimilarities Phylogenetic dissimilarities For each branch e, note le its length and pei the fraction of counts in sample i corresponding to species below branch e. Unifrac: the fraction of the tree specific to either sample i or sample j. dUF = e le(I{pei>0,pej=0} + I{pej>0,pei=0}) e leI{pei+pej>0} Weighted Unifrac: the fraction of the diversity specific to sample i or to sample j. dwUF = e le|pei − pej| e(pei + pej) Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 7/41
  • 14. Sommaire 1 Metagenomic datasets and associated questions 2 A typical (and rich) case study: TARA Oceans datasets 3 A UMKL framework for integrating multiple metagenomic data 4 Application to TARA Oceans datasets Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 8/41
  • 15. TARA Oceans datasets The 2009-2013 expedition Co-directed by Étienne Bourgois and Éric Karsenti. 7,012 datasets collected from 35,000 samples of plankton and water (11,535 Gb of data). Study the plankton: bacteria, protists, metazoans and viruses representing more than 90% of the biomass in the ocean. Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 9/41
  • 16. TARA Oceans datasets Science (May 2015) - Studies on: eukaryotic plankton diversity [de Vargas et al., 2015], ocean viral communities [Brum et al., 2015], global plankton interactome [Lima-Mendez et al., 2015], global ocean microbiome [Sunagawa et al., 2015], . . . . → datasets from different types and different sources analyzed separately. Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 10/41
  • 17. Background of this talk Objectives Until now: many papers using many methods. No integrated analysis performed. What do the datasets reveal if integrated in a single analysis? Our purpose: develop a generic method to integrate phylogenetic, taxonomic and functional community composition together with environmental factors. Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 11/41
  • 18. TARA Oceans datasets that we used [Sunagawa et al., 2015] Datasets used environmental dataset: 22 numeric features (temperature, salinity, . . . ). Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 12/41
  • 19. TARA Oceans datasets that we used [Sunagawa et al., 2015] Datasets used environmental dataset: 22 numeric features (temperature, salinity, . . . ). bacteria phylogenomic tree: computed from ∼ 35,000 OTUs. Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 12/41
  • 20. TARA Oceans datasets that we used [Sunagawa et al., 2015] Datasets used environmental dataset: 22 numeric features (temperature, salinity, . . . ). bacteria phylogenomic tree: computed from ∼ 35,000 OTUs. bacteria functional composition: ∼ 63,000 KEGG orthologous groups. Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 12/41
  • 21. TARA Oceans datasets that we used [de Vargas et al., 2015] Datasets used environmental dataset: 22 numeric features (temperature, salinity, . . . ). bacteria phylogenomic tree: computed from ∼ 35,000 OTUs. bacteria functional composition: ∼ 63,000 KEGG orthologous groups. eukaryotic plankton composition splited into 4 groups pico (0.8 − 5µm), nano (5 − 20µm), micro (20 − 180µm) and meso (180 − 2000µm). Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 12/41
  • 22. TARA Oceans datasets that we used [Brum et al., 2015] Datasets used environmental dataset: 22 numeric features (temperature, salinity, . . . ). bacteria phylogenomic tree: computed from ∼ 35,000 OTUs. bacteria functional composition: ∼ 63,000 KEGG orthologous groups. eukaryotic plankton composition splited into 4 groups pico (0.8 − 5µm), nano (5 − 20µm), micro (20 − 180µm) and meso (180 − 2000µm). virus composition: ∼ 867 virus clusters based on shared gene content. Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 12/41
  • 23. TARA Oceans datasets that we used Common samples 48 samples, 2 depth layers: surface (SRF) and deep chlorophyll maximum (DCM), 31 different sampling stations. Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 13/41
  • 24. Sommaire 1 Metagenomic datasets and associated questions 2 A typical (and rich) case study: TARA Oceans datasets 3 A UMKL framework for integrating multiple metagenomic data 4 Application to TARA Oceans datasets Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 14/41
  • 25. Kernel methods Kernel viewed as the dot product in an implicit Hilbert space K : X × X → R st: K(xi, xj) = K(xj, xi) and ∀ m ∈ N, ∀x1, ..., xm ∈ X, ∀ α1, ..., αm ∈ R, m i,j=1 αiαjK(xi, xj) ≥ 0. Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 15/41
  • 26. Kernel methods Kernel viewed as the dot product in an implicit Hilbert space K : X × X → R st: K(xi, xj) = K(xj, xi) and ∀ m ∈ N, ∀x1, ..., xm ∈ X, ∀ α1, ..., αm ∈ R, m i,j=1 αiαjK(xi, xj) ≥ 0. ⇒ [Aronszajn, 1950] ∃!(H, ., . ), φ : X → H st: K(xi, xj) = φ(xi), φ(xj) Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 15/41
  • 27. Exploratory analysis with kernels A well know example: kernel PCA [Schölkopf et al., 1998] PCA analysis performed in the feature space induced by the kernel K. Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 16/41
  • 28. Exploratory analysis with kernels A well know example: kernel PCA [Schölkopf et al., 1998] PCA analysis performed in the feature space induced by the kernel K. In practice: K is centered: K ← K − 1 N KIN + 1 N2 IN KIN; K-PCA is performed by the eigen-decomposition of (centered) K Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 16/41
  • 29. Exploratory analysis with kernels A well know example: kernel PCA [Schölkopf et al., 1998] PCA analysis performed in the feature space induced by the kernel K. In practice: K is centered: K ← K − 1 N KIN + 1 N2 IN KIN; K-PCA is performed by the eigen-decomposition of (centered) K If (αk )k=1,...,N ∈ RN and (λk )k=1,...,N are the eigenvectors and eigenvalues, PC axes are: ak = N i=1 αkiφ(xi) and ak = (aki)i=1,...,n are orthonormal in the feature space induced by the kernel: ∀ k, k , ak , ak = αk Kαk = δkk with δkk = 0 if k k 1 otherwise . Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 16/41
  • 30. Exploratory analysis with kernels A well know example: kernel PCA [Schölkopf et al., 1998] PCA analysis performed in the feature space induced by the kernel K. In practice: K is centered: K ← K − 1 N KIN + 1 N2 IN KIN; K-PCA is performed by the eigen-decomposition of (centered) K Coordinate of the projection of the observations (φ(xi))i: ak , φ(xi) = n j=1 αkjKji = Ki.αk = λk αki, where Ki. is the i-th row of K. No representation for the variables (no real variables...). Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 16/41
  • 31. Exploratory analysis with kernels A well know example: kernel PCA [Schölkopf et al., 1998] PCA analysis performed in the feature space induced by the kernel K. In practice: K is centered: K ← K − 1 N KIN + 1 N2 IN KIN; K-PCA is performed by the eigen-decomposition of (centered) K Other unsupervised kernel methods: kernel SOM [Olteanu and Villa-Vialaneix, 2015, Mariette et al., 2017] Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 16/41
  • 32. Usefulness of K-PCA Non linear PCA Source: By Petter Strandmark - Own work, CC BY 3.0, https://commons.wikimedia.org/w/index.php?curid=3936753 Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 17/41
  • 33. Usefulness of K-PCA [Mariette et al., 2017] K-PCA for non numeric datasets - here a quantitative time series: job trajectories after graduation from the French survey “Generation 98” [Cottrell and Letrémy, 2005] color is the mode of the trajectories Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 17/41
  • 34. From multiple dissimilarities to multiple kernels 1 several (non Euclidean) dissimilarities D1 , . . . , DM , transformed into similarities with [Lee and Verleysen, 2007]: Km (xi, xj) = − 1 2  Dm (xi, xj) − 2 N N k=1 Dm (xi, xk ) + 1 N2 N k, k =1 Dm (xk , xk )   2 if non positive, clipping or flipping (removing the negative part of the eigenvalues decomposition or taking its opposite) produce kernels [Chen et al., 2009]. Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 18/41
  • 35. From multiple kernels to an integrated kernel How to combine multiple kernels? naive approach: K∗ = 1 M m Km Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 19/41
  • 36. From multiple kernels to an integrated kernel How to combine multiple kernels? naive approach: K∗ = 1 M m Km supervised framework: K∗ = m βmKm with βm ≥ 0 and m βm = 1 with βm chosen so as to minimize the prediction error [Gönen and Alpaydin, 2011] Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 19/41
  • 37. From multiple kernels to an integrated kernel How to combine multiple kernels? naive approach: K∗ = 1 M m Km supervised framework: K∗ = m βmKm with βm ≥ 0 and m βm = 1 with βm chosen so as to minimize the prediction error [Gönen and Alpaydin, 2011] unsupervised framework but input space is Rd [Zhuang et al., 2011] K∗ = m βmKm with βm ≥ 0 and m βm = 1 with βm chosen so as to minimize the distortion between all training data ij K∗ (xi, xj) xi − xj 2 ; AND minimize the approximation of the original data by the kernel embedding i xi − j K∗ (xi, xj)xj 2 . Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 19/41
  • 38. From multiple kernels to an integrated kernel How to combine multiple kernels? naive approach: K∗ = 1 M m Km supervised framework: K∗ = m βmKm with βm ≥ 0 and m βm = 1 with βm chosen so as to minimize the prediction error [Gönen and Alpaydin, 2011] unsupervised framework but input space is Rd [Zhuang et al., 2011] K∗ = m βmKm with βm ≥ 0 and m βm = 1 with βm chosen so as to minimize the distortion between all training data ij K∗ (xi, xj) xi − xj 2 ; AND minimize the approximation of the original data by the kernel embedding i xi − j K∗ (xi, xj)xj 2 . Our proposal: 2 UMKL frameworks which do not require data to have values in Rd . Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 19/41
  • 39. STATIS like framework [L’Hermier des Plantes, 1976, Lavit et al., 1994] Similarities between kernels: Cmm = Km , Km F Km F Km F = Trace(Km Km ) Trace((Km)2)Trace((Km )2) . (Cmm is an extension of the RV-coefficient [Robert and Escoufier, 1976] to the kernel framework) Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 20/41
  • 40. STATIS like framework [L’Hermier des Plantes, 1976, Lavit et al., 1994] Similarities between kernels: Cmm = Km , Km F Km F Km F = Trace(Km Km ) Trace((Km)2)Trace((Km )2) . (Cmm is an extension of the RV-coefficient [Robert and Escoufier, 1976] to the kernel framework) maximize M m=1 K∗ (v), Km Km F F = v Cv for K∗ (v) = M m=1 vmKm and v ∈ RM such that v 2 = 1. Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 20/41
  • 41. STATIS like framework [L’Hermier des Plantes, 1976, Lavit et al., 1994] Similarities between kernels: Cmm = Km , Km F Km F Km F = Trace(Km Km ) Trace((Km)2)Trace((Km )2) . (Cmm is an extension of the RV-coefficient [Robert and Escoufier, 1976] to the kernel framework) maximize M m=1 K∗ (v), Km Km F F = v Cv for K∗ (v) = M m=1 vmKm and v ∈ RM such that v 2 = 1. Solution: first eigenvector of C ⇒ Set β = v M m=1 vm (consensual kernel). Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 20/41
  • 42. A kernel preserving the original topology of the data I From an idea similar to that of [Lin et al., 2010], find a kernel such that the local geometry of the data in the feature space is similar to that of the original data. Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 21/41
  • 43. A kernel preserving the original topology of the data I From an idea similar to that of [Lin et al., 2010], find a kernel such that the local geometry of the data in the feature space is similar to that of the original data. Proxy of the local geometry Km −→ Gm k k−nearest neighbors graph −→ Am k adjacency matrix ⇒ W = m I{Am k >0} or W = m Am k Adjacency matrix image from: By S. Mohammad H. Oloomi, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=35313532 Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 21/41
  • 44. A kernel preserving the original topology of the data I From an idea similar to that of [Lin et al., 2010], find a kernel such that the local geometry of the data in the feature space is similar to that of the original data. Proxy of the local geometry Km −→ Gm k k−nearest neighbors graph −→ Am k adjacency matrix ⇒ W = m I{Am k >0} or W = m Am k Feature space geometry measured by ∆i(β) = φ∗ β(xi),   φ∗ β(x1) ... φ∗ β(xN)   =   K∗ β (xi, x1) ... K∗ β (xi, xN)   Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 21/41
  • 45. A kernel preserving the original topology of the data II Sparse version minimize N i,j=1 Wij ∆i(β) − ∆j(β) 2 for K∗ β = M m=1 βmKm and β ∈ RM st βm ≥ 0 and M m=1 βm = 1. Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 22/41
  • 46. A kernel preserving the original topology of the data II Sparse version minimize N i,j=1 Wij ∆i(β) − ∆j(β) 2 for K∗ β = M m=1 βmKm and β ∈ RM st βm ≥ 0 and M m=1 βm = 1. ⇔ minimize M m,m =1 βmβm Smm β ∈ RM such that βm ≥ 0 and M m=1 βm = 1, for Smm = N i,j=1 Wij ∆m i − ∆m j 2 and ∆m i =   Km (xi, x1) ... Km (xi, xN)   . Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 22/41
  • 47. A kernel preserving the original topology of the data II Non sparse version minimize N i,j=1 Wij ∆i(β) − ∆j(β) 2 for K∗ v = M m=1 vmKm and v ∈ RM st vm ≥ 0 and v 2 = 1. ⇔ minimize M m,m =1 vmvm Smm v ∈ RM such that vm ≥ 0 and v 2 = 1, for Smm = N i,j=1 Wij ∆m i − ∆m j 2 and ∆m i =   Km (xi, x1) ... Km (xi, xN)   . Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 22/41
  • 48. Optimization issues Sparse version writes minβ βT Sβ st β ≥ 0 and β 1 = m βm = 1 ⇒ standard QP problem with linear constrains (ex: package quadprog in R). Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 23/41
  • 49. Optimization issues Sparse version writes minβ βT Sβ st β ≥ 0 and β 1 = m βm = 1 ⇒ standard QP problem with linear constrains (ex: package quadprog in R). Non sparse version writes minβ βT Sβ st β ≥ 0 and β 2 = 1 ⇒ QPQC problem (hard to solve). Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 23/41
  • 50. Optimization issues Sparse version writes minβ βT Sβ st β ≥ 0 and β 1 = m βm = 1 ⇒ standard QP problem with linear constrains (ex: package quadprog in R). Non sparse version writes minβ βT Sβ st β ≥ 0 and β 2 = 1 ⇒ QPQC problem (hard to solve). Equivalent to the following problem: minβ,B Trace(S2X) st Trace(AX) = 1, Trace(AjX) ≥ 0 and B = β β with: X = 1 β β B A = 0 0M 0M IM Aj = 0 1j 1j 0MM Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 23/41
  • 51. Optimization issues Sparse version writes minβ βT Sβ st β ≥ 0 and β 1 = m βm = 1 ⇒ standard QP problem with linear constrains (ex: package quadprog in R). Non sparse version writes minβ βT Sβ st β ≥ 0 and β 2 = 1 ⇒ QPQC problem (hard to solve). Relaxed into to the following problem: minβ,B Trace(S2X) st Trace(AX) = 1, Trace(AjX) ≥ 0 with: X = 1 β β B is positive semi-definite A = 0 0M 0M IM Aj = 0 1j 1j 0MM Semi-definite programming ⇒ efficient solvers exist. Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 23/41
  • 52. A proposal to improve interpretability of K-PCA in our framework Issue: How to assess the importance of a given species in the K-PCA? Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 24/41
  • 53. A proposal to improve interpretability of K-PCA in our framework Issue: How to assess the importance of a given species in the K-PCA? our datasets are either numeric (environmental) or are built from a n × p count matrix ⇒ for a given species, randomly permute counts and re-do the analysis (kernel computation - with the same optimized weights - and K-PCA) Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 24/41
  • 54. A proposal to improve interpretability of K-PCA in our framework Issue: How to assess the importance of a given species in the K-PCA? our datasets are either numeric (environmental) or are built from a n × p count matrix ⇒ for a given species, randomly permute counts and re-do the analysis (kernel computation - with the same optimized weights - and K-PCA) the influence of a given species in a given dataset on a given PC subspace is accessed by computing the Crone-Crosby distance between these two PCA subspaces [Crone and Crosby, 1995] (∼ Frobenius norm between the projectors) Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 24/41
  • 55. Sommaire 1 Metagenomic datasets and associated questions 2 A typical (and rich) case study: TARA Oceans datasets 3 A UMKL framework for integrating multiple metagenomic data 4 Application to TARA Oceans datasets Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 25/41
  • 56. Integrating ’omics data using kernels M TARA Oceans datasets (xm i )i=1,...,n,m=1,...,M measured on the same ocean samples (1, . . . , N) which take values in an arbitrary space (Xm )m: environmental dataset, bacteria phylogenomic tree, bacteria functional composition, eukaryote pico-plankton composition, . . . virus composition. Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 26/41
  • 57. Integrating ’omics data using kernels Environmental dataset: standard euclidean distance, given by K(xi, xj) = xT i xj. Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 26/41
  • 58. Integrating ’omics data using kernels Bacteria phylogenomic tree: the weighted Unifrac distance, given by dwUF (xi, xj) = e le|pei − pej| e pei + pej . Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 26/41
  • 59. Integrating ’omics data using kernels All composition based datasets: bacteria functional composition, eukaryote (pico, nano, micro, meso)-plankton composition and virus composition calculated using the Bray-Curtis dissimilarity, dBC(xi, xj) = g |nig − njg| g nig + njg , nig: gene g abundances summarized at the KEGG orthologous groups level in sample i. Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 26/41
  • 60. Integrating ’omics data using kernels Combinaison of M kernels by a weighted sum K∗ = M m=1 βmKm , where βm ≥ 0 and M m=1 βm = 1. Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 26/41
  • 61. Integrating ’omics data using kernels Apply standard data mining methods (clustering, linear model, PCA, . . . ) in the feature space. Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 26/41
  • 62. Correlation between kernels (STATIS) Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 27/41
  • 63. Correlation between kernels (STATIS) Low correlations between the bacteria functional composition and other datasets. Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 27/41
  • 64. Correlation between kernels (STATIS) Low correlations between the bacteria functional composition and other datasets. Strong correlation between environmental variables and small organisms (bacteria, eukarote pico-plankton and virus). Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 27/41
  • 65. Influence of k (nb of neighbors) on (βm)m k ≥ 5 provides stable results Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 28/41
  • 66. (βm)m values returned by graph-MKL Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 29/41
  • 67. (βm)m values returned by graph-MKL The dataset the less correlated to the others: the bacteria functional composition has the highest coefficient. Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 29/41
  • 68. (βm)m values returned by graph-MKL The dataset the less correlated to the others: the bacteria functional composition has the highest coefficient. Three kernels have a weight equal to 0 (sparse version). Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 29/41
  • 69. Proof of concept: using [Sunagawa et al., 2015] Datasets 139 samples, 3 layers (SRF, DCM and MES) kernels: phychem, pro-OTUs and pro-OGs Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 30/41
  • 70. Proof of concept: using [Sunagawa et al., 2015] Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 31/41
  • 71. Proof of concept: using [Sunagawa et al., 2015] Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 32/41
  • 72. Proof of concept: using [Sunagawa et al., 2015] Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 33/41
  • 73. Proof of concept: using [Sunagawa et al., 2015] Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 34/41
  • 74. Proof of concept: using [Sunagawa et al., 2015] Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 35/41
  • 75. Proof of concept: using [Sunagawa et al., 2015] Proteobacteria (clade SAR11 (Alphaproteobacteria) and SAR86) dominate the sampled areas of the ocean in term of relative abundance and taxonomic richness. Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 36/41
  • 76. K-PCA on K∗ Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 37/41
  • 77. K-PCA on K∗ - environmental dataset Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 38/41
  • 78. K-PCA on K∗ - environmental dataset Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 39/41
  • 79. Conclusion et perspectives Summary an integrative exploratory method ... particularly well suited for multi metagenomic datasets with enhanced interpretability Perspectives implement SDP solution and test it improve biological interpretation soon-to-be-released R package Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 40/41
  • 80. Questions? Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 41/41
  • 81. References Aronszajn, N. (1950). Theory of reproducing kernels. Transactions of the American Mathematical Society, 68(3):337–404. Brum, J., Ignacio-Espinoza, J., Roux, S., Doulcier, G., Acinas, S., Alberti, A., Chaffron, S., Cruaud, C., de Vargas, C., Gasol, J., Gorsky, G., Gregory, A., Guidi, L., Hingamp, P., Iudicone, D., Not, F., Ogata, H., Pesant, S., Poulos, B., Schwenck, S., Speich, S., Dimier, C., Kandels-Lewis, S., Picheral, M., Searson, S., Tara Oceans coordinators, Bork, P., Bowler, C., Sunagawa, S., Wincker, P., Karsenti, E., and Sullivan, M. (2015). Patterns and ecological drivers of ocean viral communities. Science, 348(6237). Chen, Y., Garcia, E., Gupta, M., Rahimi, A., and Cazzanti, L. (2009). Similarity-based classification: concepts and algorithm. Journal of Machine Learning Research, 10:747–776. Cottrell, M. and Letrémy, P. (2005). How to use the Kohonen algorithm to simultaneously analyse individuals in a survey. Neurocomputing, 63:193–207. Crone, L. and Crosby, D. (1995). Statistical applications of a metric on subspaces to satellite meteorology. Technometrics, 37(3):324–328. de Vargas, C., Audic, S., Henry, N., Decelle, J., Mahé, P., Logares, R., Lara, E., Berney, C., Le Bescot, N., Probert, I., Carmichael, M., Poulain, J., Romac, S., Colin, S., Aury, J., Bittner, L., Chaffron, S., Dunthorn, M., Engelen, S., Flegontova, O., Guidi, L., Horák, A., Jaillon, O., Lima-Mendez, G., Lukeš, J., Malviya, S., Morard, R., Mulot, M., Scalco, E., Siano, R., Vincent, F., Zingone, A., Dimier, C., Picheral, M., Searson, S., Kandels-Lewis, S., Tara Oceans coordinators, Acinas, S., Bork, P., Bowler, C., Gorsky, G., Grimsley, N., Hingamp, P., Iudicone, D., Not, F., Ogata, H., Pesant, S., Raes, J., Sieracki, M. E., Speich, S., Stemmann, L., Sunagawa, S., Weissenbach, J., Wincker, P., and Karsenti, E. (2015). Eukaryotic plankton diversity in the sunlit ocean. Science, 348(6237). Gönen, M. and Alpaydin, E. (2011). Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 41/41
  • 82. Multiple kernel learning algorithms. Journal of Machine Learning Research, 12:2211–2268. Lavit, C., Escoufier, Y., Sabatier, R., and Traissac, P. (1994). The ACT (STATIS method). Computational Statistics and Data Analysis, 18(1):97–119. Lee, J. and Verleysen, M. (2007). Nonlinear Dimensionality Reduction. Information Science and Statistics. Springer, New York; London. L’Hermier des Plantes, H. (1976). Structuration des tableaux à trois indices de la statistique. PhD thesis, Université de Montpellier. Thèse de troisième cycle. Lima-Mendez, G., Faust, K., Henry, N., Decelle, J., Colin, S., Carcillo, F., Chaffron, S., Ignacio-Espinosa, J., Roux, S., Vincent, F., Bittner, L., Darzi, Y., Wang, B., Audic, S., Berline, L., Bontempi, G., Cabello, A., Coppola, L., Cornejo-Castillo, F., d’Oviedo, F., de Meester, L., Ferrera, I., Garet-Delmas, M., Guidi, L., Lara, E., Pesant, S., Royo-Llonch, M., Salazar, F., Sánchez, P., Sebastian, M., Souffreau, C., Dimier, C., Picheral, M., Searson, S., Kandels-Lewis, S., Tara Oceans coordinators, Gorsky, G., Not, F., Ogata, H., Speich, S., Stemmann, L., Weissenbach, J., Wincker, P., Acinas, S., Sunagawa, S., Bork, P., Sullivan, M., Karsenti, E., Bowler, C., de Vargas, C., and Raes, J. (2015). Determinants of community structure in the global plankton interactome. Science, 348(6237). Lin, Y., Liu, T., and CS., F. (2010). Multiple kernel learning for dimensionality reduction. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33:1147–1160. Mariette, J., Olteanu, M., and Villa-Vialaneix, N. (2017). Efficient interpretable variants of online SOM for large dissimilarity data. Neurocomputing, 225:31–48. Olteanu, M. and Villa-Vialaneix, N. (2015). Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 41/41
  • 83. On-line relational and multiple relational SOM. Neurocomputing, 147:15–30. Robert, P. and Escoufier, Y. (1976). A unifying tool for linear multivariate statistical methods: the rv-coefficient. Applied Statistics, 25(3):257–265. Schölkopf, B., Smola, A., and Müller, K. (1998). Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, 10(5):1299–1319. Sommer, M., Church, G., and Dantas, G. (2010). A functional metagenomic approach for expanding the synthetic biology toolbox for biomass conversion. Molecular Systems Biology, 6(360). Sunagawa, S., Coelho, L., Chaffron, S., Kultima, J., Labadie, K., Salazar, F., Djahanschiri, B., Zeller, G., Mende, D., Alberti, A., Cornejo-Castillo, F., Costea, P., Cruaud, C., d’Oviedo, F., Engelen, S., Ferrera, I., Gasol, J., Guidi, L., Hildebrand, F., Kokoszka, F., Lepoivre, C., Lima-Mendez, G., Poulain, J., Poulos, B., Royo-Llonch, M., Sarmento, H., Vieira-Silva, S., Dimier, C., Picheral, M., Searson, S., Kandels-Lewis, S., Tara Oceans coordinators, Bowler, C., de Vargas, C., Gorsky, G., Grimsley, N., Hingamp, P., Iudicone, D., Jaillon, O., Not, F., Ogata, H., Pesant, S., Speich, S., Stemmann, L., Sullivan, M., Weissenbach, J., Wincker, P., Karsenti, E., Raes, J., Acinas, S., and Bork, P. (2015). Structure and function of the global ocean microbiome. Science, 348(6237). Zhuang, J., Wang, J., Hoi, S., and Lan, X. (2011). Unsupervised multiple kernel clustering. Journal of Machine Learning Research: Workshop and Conference Proceedings, 20:129–144. Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 41/41