Integrating Tara Oceans datasets using unsupervised multiple kernel learning

Integrating Tara Oceans datasets using unsupervised
multiple kernel learning
Nathalie Villa-Vialaneix
Joint work with Jérôme Mariette
http://www.nathalievilla.org
Séminaire de Probabilité et Statistique
Laboratoire J.A. Dieudonné, Université de Nice
Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 1/41

Sommaire
1 Metagenomic datasets and associated questions
2 A typical (and rich) case study: TARA Oceans datasets
3 A UMKL framework for integrating multiple metagenomic data
4 Application to TARA Oceans datasets

Sommaire

What are metagenomic data?
Source: [Sommer et al., 2010]

abundance data sparse
n × p-matrices with count data
of samples in rows and
descriptors (species, OTUs,
KEGG groups, k-mer, ...) in
columns. Generally p n.

abundance data sparse
n × p-matrices with count data
of samples in rows and
descriptors (species, OTUs,
KEGG groups, k-mer, ...) in
columns. Generally p n.
philogenetic tree (evolution
history between species,
OTUs...). One tree with p leaves
built from the sequences
collected in the n samples.

What are metagenomic data used for?
produce a proﬁle of the diversity of a given sample ⇒ allows to
compare diversity between various conditions
used in various ﬁelds: environmental science, microbiote, ...

What are metagenomic data used for?
produce a proﬁle of the diversity of a given sample ⇒ allows to
compare diversity between various conditions
used in various ﬁelds: environmental science, microbiote, ...
Processed by computing a relevant dissimilarity between samples
(standard Euclidean distance is not relevant) and by using this dissimilarity
in subsequent analyses.

β-diversity data: dissimilarities between count data
Compositional dissimilarities: (nig) count of species g for sample i
Jaccard: the fraction of species speciﬁc of either sample i or j:
djac =
g I{nig>0,njg=0} + I{njg>0,nig=0}
j I{nig+njg>0}
Bray-Curtis: the fraction of the sample which is speciﬁc of either
sample i or j
dBC =
g |nig − njg|
g(nig + njg)
Other dissimilarities available in the R package philoseq, most of them
not Euclidean.

β-diversity data: phylogenetic dissimilarities
Phylogenetic dissimilarities

For each branch e, note le its length and pei
the fraction of counts in sample i
corresponding to species below branch e.

Unifrac: the fraction of the tree speciﬁc to
either sample i or sample j.
dUF =
e le(I{pei>0,pej=0} + I{pej>0,pei=0})
e leI{pei+pej>0}

Unifrac: the fraction of the tree speciﬁc to
either sample i or sample j.
dUF =
e le(I{pei>0,pej=0} + I{pej>0,pei=0})
e leI{pei+pej>0}
Weighted Unifrac: the fraction of the
diversity speciﬁc to sample i or to sample j.
dwUF =
e le|pei − pej|
e(pei + pej)

Sommaire

TARA Oceans datasets
The 2009-2013 expedition
Co-directed by Étienne Bourgois
and Éric Karsenti.
7,012 datasets collected from
35,000 samples of plankton and
water (11,535 Gb of data).
Study the plankton: bacteria,
protists, metazoans and viruses
representing more than 90% of the
biomass in the ocean.

TARA Oceans datasets
Science (May 2015) - Studies on:
eukaryotic plankton diversity
[de Vargas et al., 2015],
ocean viral communities
[Brum et al., 2015],
global plankton interactome
[Lima-Mendez et al., 2015],
global ocean microbiome
[Sunagawa et al., 2015],
. . . .
→ datasets from different types and
different sources analyzed separately.

Background of this talk
Objectives
Until now: many papers using many methods. No integrated analysis
performed.
What do the datasets reveal if integrated in a single analysis?
Our purpose: develop a generic method to integrate phylogenetic,
taxonomic and functional community composition together with
environmental factors.

TARA Oceans datasets that we used
[Sunagawa et al., 2015]
Datasets used
environmental dataset: 22 numeric features (temperature, salinity, . . . ).

Datasets used
bacteria phylogenomic tree: computed from ∼ 35,000 OTUs.

Datasets used
bacteria functional composition: ∼ 63,000 KEGG orthologous groups.

[de Vargas et al., 2015]
Datasets used
eukaryotic plankton composition splited into 4 groups pico (0.8 − 5µm),
nano (5 − 20µm), micro (20 − 180µm) and meso (180 − 2000µm).

[Brum et al., 2015]
Datasets used
eukaryotic plankton composition splited into 4 groups pico (0.8 − 5µm),
nano (5 − 20µm), micro (20 − 180µm) and meso (180 − 2000µm).
virus composition: ∼ 867 virus clusters based on shared gene content.

Common samples
48 samples,
2 depth layers: surface
(SRF) and deep chlorophyll
maximum (DCM),
31 different sampling
stations.

Sommaire

Kernel methods
Kernel viewed as the dot product in an implicit Hilbert space
K : X × X → R st: K(xi, xj) = K(xj, xi) and ∀ m ∈ N, ∀x1, ..., xm ∈ X,
∀ α1, ..., αm ∈ R, m
i,j=1 αiαjK(xi, xj) ≥ 0.

Kernel methods
Kernel viewed as the dot product in an implicit Hilbert space
K : X × X → R st: K(xi, xj) = K(xj, xi) and ∀ m ∈ N, ∀x1, ..., xm ∈ X,
∀ α1, ..., αm ∈ R, m
i,j=1 αiαjK(xi, xj) ≥ 0.
⇒ [Aronszajn, 1950]
∃!(H, ., . ), φ : X → H st: K(xi, xj) = φ(xi), φ(xj)

Exploratory analysis with kernels
A well know example: kernel PCA [Schölkopf et al., 1998]
PCA analysis performed in the feature space induced by the kernel K.

In practice:
K is centered: K ← K − 1
N KIN + 1
N2 IN
KIN;
K-PCA is performed by the eigen-decomposition of (centered) K

In practice:
N KIN + 1
N2 IN
KIN;
If (αk )k=1,...,N ∈ RN
and (λk )k=1,...,N are the eigenvectors and eigenvalues,
PC axes are:
ak =
N
i=1
αkiφ(xi)
and ak = (aki)i=1,...,n are orthonormal in the feature space induced by the
kernel:
∀ k, k , ak , ak = αk Kαk = δkk with δkk =
0 if k k
1 otherwise
.

In practice:
N KIN + 1
N2 IN
KIN;
Coordinate of the projection of the observations (φ(xi))i:
ak , φ(xi) =
n
j=1
αkjKji = Ki.αk = λk αki,
where Ki. is the i-th row of K.
No representation for the variables (no real variables...).

In practice:
N KIN + 1
N2 IN
KIN;
Other unsupervised kernel methods: kernel SOM
[Olteanu and Villa-Vialaneix, 2015, Mariette et al., 2017]

Usefulness of K-PCA
Non linear PCA
Source: By Petter Strandmark - Own work, CC BY 3.0, https://commons.wikimedia.org/w/index.php?curid=3936753

Usefulness of K-PCA
[Mariette et al., 2017] K-PCA for non numeric datasets - here a
quantitative time series: job trajectories after graduation from the French
survey “Generation 98” [Cottrell and Letrémy, 2005]
color is the mode of the trajectories

From multiple dissimilarities to multiple kernels
1 several (non Euclidean) dissimilarities D1
, . . . , DM
, transformed into
similarities with [Lee and Verleysen, 2007]:
Km
(xi, xj) = −
1
2

Dm
(xi, xj) −
2
N
N
k=1
Dm
(xi, xk ) +
1
N2
N
k, k =1
Dm
(xk , xk )


2 if non positive, clipping or ﬂipping (removing the negative part of the
eigenvalues decomposition or taking its opposite) produce kernels
[Chen et al., 2009].

From multiple kernels to an integrated kernel
How to combine multiple kernels?
naive approach: K∗ = 1
M m Km

M m Km
supervised framework: K∗ = m βmKm
with βm ≥ 0 and m βm = 1
with βm chosen so as to minimize the prediction error
[Gönen and Alpaydin, 2011]

M m Km
unsupervised framework but input space is Rd
[Zhuang et al., 2011]
K∗ = m βmKm
with βm ≥ 0 and m βm = 1 with βm chosen so as to
minimize the distortion between all training data ij K∗
(xi, xj) xi − xj
2
;
AND minimize the approximation of the original data by the kernel
embedding i xi − j K∗
(xi, xj)xj
2
.

M m Km
unsupervised framework but input space is Rd
[Zhuang et al., 2011]
K∗ = m βmKm
with βm ≥ 0 and m βm = 1 with βm chosen so as to
minimize the distortion between all training data ij K∗
(xi, xj) xi − xj
2
;
AND minimize the approximation of the original data by the kernel
embedding i xi − j K∗
(xi, xj)xj
2
.
Our proposal: 2 UMKL frameworks which do not require data to have
values in Rd
.

STATIS like framework
[L’Hermier des Plantes, 1976, Lavit et al., 1994]
Similarities between kernels:
Cmm =
Km
, Km
F
Km
F Km
F
=
Trace(Km
Km
)
Trace((Km)2)Trace((Km )2)
.
(Cmm is an extension of the RV-coefﬁcient [Robert and Escouﬁer, 1976] to
the kernel framework)

Cmm =
Km
, Km
F
Km
F Km
F
=
Trace(Km
Km
)
.
maximize
M
m=1
K∗
(v),
Km
Km
F F
= v Cv
for K∗
(v) =
M
m=1
vmKm
and v ∈ RM
such that v 2 = 1.

Cmm =
Km
, Km
F
Km
F Km
F
=
Trace(Km
Km
)
.
maximize
M
m=1
K∗
(v),
Km
Km
F F
= v Cv
for K∗
(v) =
M
m=1
vmKm
and v ∈ RM
such that v 2 = 1.
Solution: ﬁrst eigenvector of C ⇒ Set β = v
M
m=1 vm
(consensual kernel).

A kernel preserving the original topology of the data I
From an idea similar to that of [Lin et al., 2010], ﬁnd a kernel such that the
local geometry of the data in the feature space is similar to that of the
original data.

original data.
Proxy of the local geometry
Km
−→ Gm
k
k−nearest neighbors graph
−→ Am
k
adjacency matrix
⇒ W = m I{Am
k
>0} or W = m Am
k
Adjacency matrix image from: By S. Mohammad H. Oloomi, CC BY-SA 3.0,
https://commons.wikimedia.org/w/index.php?curid=35313532

original data.
Proxy of the local geometry
Km
−→ Gm
k
k−nearest neighbors graph
−→ Am
k
adjacency matrix
⇒ W = m I{Am
k
>0} or W = m Am
k
Feature space geometry measured by
∆i(β) = φ∗
β(xi),


φ∗
β(x1)
...
φ∗
β(xN)


=


K∗
β (xi, x1)
...
K∗
β (xi, xN)



A kernel preserving the original topology of the data II
Sparse version
minimize
N
i,j=1
Wij ∆i(β) − ∆j(β)
2
for K∗
β =
M
m=1
βmKm
and β ∈ RM
st βm ≥ 0 and
M
m=1
βm = 1.

Sparse version
minimize
N
i,j=1
2
for K∗
β =
M
m=1
βmKm
and β ∈ RM
st βm ≥ 0 and
M
m=1
βm = 1.
⇔ minimize
M
m,m =1
βmβm Smm
β ∈ RM
such that βm ≥ 0 and
M
m=1
βm = 1,
for Smm = N
i,j=1 Wij ∆m
i
− ∆m
j
2
and ∆m
i
=


Km
(xi, x1)
...
Km
(xi, xN)


.

Non sparse version
minimize
N
i,j=1
2
for K∗
v =
M
m=1
vmKm
and v ∈ RM
st vm ≥ 0 and v 2 = 1.
⇔ minimize
M
m,m =1
vmvm Smm
v ∈ RM
such that vm ≥ 0 and v 2 = 1,
for Smm = N
i,j=1 Wij ∆m
i
− ∆m
j
2
and ∆m
i
=


Km
(xi, x1)
...
Km
(xi, xN)


.

Optimization issues
Sparse version writes minβ βT
Sβ st β ≥ 0 and β 1 = m βm = 1 ⇒
standard QP problem with linear constrains (ex: package quadprog
in R).

Optimization issues
Sβ st β ≥ 0 and β 1 = m βm = 1 ⇒
in R).
Non sparse version writes minβ βT
Sβ st β ≥ 0 and β 2 = 1 ⇒ QPQC
problem (hard to solve).

Optimization issues
Sβ st β ≥ 0 and β 1 = m βm = 1 ⇒
in R).
Equivalent to the following problem: minβ,B Trace(S2X) st
Trace(AX) = 1, Trace(AjX) ≥ 0 and B = β β with:
X =
1 β
β B
A =
0 0M
0M IM
Aj =
0 1j
1j 0MM

Optimization issues
Sβ st β ≥ 0 and β 1 = m βm = 1 ⇒
in R).
Relaxed into to the following problem: minβ,B Trace(S2X) st
Trace(AX) = 1, Trace(AjX) ≥ 0 with:
X =
1 β
β B
is positive semi-definite
A =
0 0M
0M IM
Aj =
0 1j
1j 0MM
Semi-definite programming ⇒ efficient solvers exist.

A proposal to improve interpretability of K-PCA in our
framework
Issue: How to assess the importance of a given species in the K-PCA?

framework
our datasets are either numeric (environmental) or are built from a
n × p count matrix
⇒ for a given species, randomly permute counts and re-do the
analysis (kernel computation - with the same optimized weights - and
K-PCA)

framework
our datasets are either numeric (environmental) or are built from a
n × p count matrix
⇒ for a given species, randomly permute counts and re-do the
analysis (kernel computation - with the same optimized weights - and
K-PCA)
the inﬂuence of a given species in a given dataset on a given PC
subspace is accessed by computing the Crone-Crosby distance
between these two PCA subspaces [Crone and Crosby, 1995] (∼
Frobenius norm between the projectors)

Sommaire

Integrating ’omics data using kernels
M TARA Oceans datasets
(xm
i
)i=1,...,n,m=1,...,M measured on the same
ocean samples (1, . . . , N) which take
values in an arbitrary space (Xm
)m:
environmental dataset,
bacteria phylogenomic tree,
bacteria functional composition,
eukaryote pico-plankton composition,
. . .
virus composition.

Environmental dataset: standard euclidean
distance, given by K(xi, xj) = xT
i
xj.

Bacteria phylogenomic tree: the weighted
Unifrac distance, given by
dwUF (xi, xj) =
e le|pei − pej|
e pei + pej
.

All composition based datasets: bacteria
functional composition, eukaryote (pico,
nano, micro, meso)-plankton composition
and virus composition calculated using the
Bray-Curtis dissimilarity,
dBC(xi, xj) =
g |nig − njg|
g nig + njg
,
nig: gene g abundances summarized at the
KEGG orthologous groups level in sample
i.

Combinaison of M kernels by a weighted
sum
K∗
=
M
m=1
βmKm
,
where βm ≥ 0 and M
m=1 βm = 1.

Apply standard data mining methods
(clustering, linear model, PCA, . . . ) in the
feature space.

Correlation between kernels (STATIS)

Low correlations between the bacteria functional composition and
other datasets.

Low correlations between the bacteria functional composition and
other datasets.
Strong correlation between environmental variables and small
organisms (bacteria, eukarote pico-plankton and virus).

Inﬂuence of k (nb of neighbors) on (βm)m
k ≥ 5 provides stable results

(βm)m values returned by graph-MKL

The dataset the less correlated to the others: the bacteria functional
composition has the highest coefﬁcient.

The dataset the less correlated to the others: the bacteria functional
composition has the highest coefﬁcient.
Three kernels have a weight equal to 0 (sparse version).

Proof of concept: using [Sunagawa et al., 2015]
Datasets
139 samples, 3 layers (SRF, DCM and MES)
kernels: phychem, pro-OTUs and pro-OGs

Proteobacteria (clade SAR11 (Alphaproteobacteria) and SAR86)
dominate the sampled areas of the ocean in term of relative
abundance and taxonomic richness.

K-PCA on K∗

K-PCA on K∗
- environmental dataset

Conclusion et perspectives
Summary
an integrative exploratory method
... particularly well suited for multi metagenomic datasets
with enhanced interpretability
Perspectives
implement SDP solution and test it
improve biological interpretation
soon-to-be-released R package

Questions?

References
Aronszajn, N. (1950).
Theory of reproducing kernels.
Transactions of the American Mathematical Society, 68(3):337–404.
Brum, J., Ignacio-Espinoza, J., Roux, S., Doulcier, G., Acinas, S., Alberti, A., Chaffron, S., Cruaud, C., de Vargas, C., Gasol, J.,
Gorsky, G., Gregory, A., Guidi, L., Hingamp, P., Iudicone, D., Not, F., Ogata, H., Pesant, S., Poulos, B., Schwenck, S., Speich, S.,
Dimier, C., Kandels-Lewis, S., Picheral, M., Searson, S., Tara Oceans coordinators, Bork, P., Bowler, C., Sunagawa, S., Wincker,
P., Karsenti, E., and Sullivan, M. (2015).
Patterns and ecological drivers of ocean viral communities.
Science, 348(6237).
Chen, Y., Garcia, E., Gupta, M., Rahimi, A., and Cazzanti, L. (2009).
Similarity-based classiﬁcation: concepts and algorithm.
Journal of Machine Learning Research, 10:747–776.
Cottrell, M. and Letrémy, P. (2005).
How to use the Kohonen algorithm to simultaneously analyse individuals in a survey.
Neurocomputing, 63:193–207.
Crone, L. and Crosby, D. (1995).
Statistical applications of a metric on subspaces to satellite meteorology.
Technometrics, 37(3):324–328.
de Vargas, C., Audic, S., Henry, N., Decelle, J., Mahé, P., Logares, R., Lara, E., Berney, C., Le Bescot, N., Probert, I.,
Carmichael, M., Poulain, J., Romac, S., Colin, S., Aury, J., Bittner, L., Chaffron, S., Dunthorn, M., Engelen, S., Flegontova, O.,
Guidi, L., Horák, A., Jaillon, O., Lima-Mendez, G., Lukeš, J., Malviya, S., Morard, R., Mulot, M., Scalco, E., Siano, R., Vincent, F.,
Zingone, A., Dimier, C., Picheral, M., Searson, S., Kandels-Lewis, S., Tara Oceans coordinators, Acinas, S., Bork, P., Bowler, C.,
Gorsky, G., Grimsley, N., Hingamp, P., Iudicone, D., Not, F., Ogata, H., Pesant, S., Raes, J., Sieracki, M. E., Speich, S.,
Stemmann, L., Sunagawa, S., Weissenbach, J., Wincker, P., and Karsenti, E. (2015).
Eukaryotic plankton diversity in the sunlit ocean.
Science, 348(6237).
Gönen, M. and Alpaydin, E. (2011).

Multiple kernel learning algorithms.
Journal of Machine Learning Research, 12:2211–2268.
Lavit, C., Escouﬁer, Y., Sabatier, R., and Traissac, P. (1994).
The ACT (STATIS method).
Computational Statistics and Data Analysis, 18(1):97–119.
Lee, J. and Verleysen, M. (2007).
Nonlinear Dimensionality Reduction.
Information Science and Statistics. Springer, New York; London.
L’Hermier des Plantes, H. (1976).
Structuration des tableaux à trois indices de la statistique.
PhD thesis, Université de Montpellier.
Thèse de troisième cycle.
Lima-Mendez, G., Faust, K., Henry, N., Decelle, J., Colin, S., Carcillo, F., Chaffron, S., Ignacio-Espinosa, J., Roux, S., Vincent, F.,
Bittner, L., Darzi, Y., Wang, B., Audic, S., Berline, L., Bontempi, G., Cabello, A., Coppola, L., Cornejo-Castillo, F., d’Oviedo, F.,
de Meester, L., Ferrera, I., Garet-Delmas, M., Guidi, L., Lara, E., Pesant, S., Royo-Llonch, M., Salazar, F., Sánchez, P.,
Sebastian, M., Souffreau, C., Dimier, C., Picheral, M., Searson, S., Kandels-Lewis, S., Tara Oceans coordinators, Gorsky, G.,
Not, F., Ogata, H., Speich, S., Stemmann, L., Weissenbach, J., Wincker, P., Acinas, S., Sunagawa, S., Bork, P., Sullivan, M.,
Karsenti, E., Bowler, C., de Vargas, C., and Raes, J. (2015).
Determinants of community structure in the global plankton interactome.
Science, 348(6237).
Lin, Y., Liu, T., and CS., F. (2010).
Multiple kernel learning for dimensionality reduction.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 33:1147–1160.
Mariette, J., Olteanu, M., and Villa-Vialaneix, N. (2017).
Efﬁcient interpretable variants of online SOM for large dissimilarity data.
Olteanu, M. and Villa-Vialaneix, N. (2015).

On-line relational and multiple relational SOM.
Robert, P. and Escouﬁer, Y. (1976).
A unifying tool for linear multivariate statistical methods: the rv-coefﬁcient.
Applied Statistics, 25(3):257–265.
Schölkopf, B., Smola, A., and Müller, K. (1998).
Nonlinear component analysis as a kernel eigenvalue problem.
Neural Computation, 10(5):1299–1319.
Sommer, M., Church, G., and Dantas, G. (2010).
A functional metagenomic approach for expanding the synthetic biology toolbox for biomass conversion.
Molecular Systems Biology, 6(360).
Sunagawa, S., Coelho, L., Chaffron, S., Kultima, J., Labadie, K., Salazar, F., Djahanschiri, B., Zeller, G., Mende, D., Alberti, A.,
Cornejo-Castillo, F., Costea, P., Cruaud, C., d’Oviedo, F., Engelen, S., Ferrera, I., Gasol, J., Guidi, L., Hildebrand, F., Kokoszka,
F., Lepoivre, C., Lima-Mendez, G., Poulain, J., Poulos, B., Royo-Llonch, M., Sarmento, H., Vieira-Silva, S., Dimier, C., Picheral,
M., Searson, S., Kandels-Lewis, S., Tara Oceans coordinators, Bowler, C., de Vargas, C., Gorsky, G., Grimsley, N., Hingamp, P.,
Iudicone, D., Jaillon, O., Not, F., Ogata, H., Pesant, S., Speich, S., Stemmann, L., Sullivan, M., Weissenbach, J., Wincker, P.,
Karsenti, E., Raes, J., Acinas, S., and Bork, P. (2015).
Structure and function of the global ocean microbiome.
Science, 348(6237).
Zhuang, J., Wang, J., Hoi, S., and Lan, X. (2011).
Unsupervised multiple kernel clustering.
Journal of Machine Learning Research: Workshop and Conference Proceedings, 20:129–144.

Integrating Tara Oceans datasets using unsupervised multiple kernel learning

Recommended

Recommended

More Related Content

What's hot

What's hot (13)

Viewers also liked

Viewers also liked (20)

Similar to Integrating Tara Oceans datasets using unsupervised multiple kernel learning

Similar to Integrating Tara Oceans datasets using unsupervised multiple kernel learning (20)

More from tuxette

More from tuxette (20)

Recently uploaded

Recently uploaded (20)

Integrating Tara Oceans datasets using unsupervised multiple kernel learning