Kernel methods and variable selection for exploratory analysis and multi-omics integration

Kernel methods and variable selection for exploratory
analysis and multi-omics integration
Nathalie Vialaneix
nathalie.vialaneix@inrae.fr
http://www.nathalievialaneix.eu
4th Course on Computational Systems Biology of Cancer: Multi-omics and
Machine Learning Approaches
September 29th, 2021

Who am I?
I researcher at INRAE (France)
I trained as a mathematician −→ statistics and machine learning for
computational biology (omics)
I mostly involved in projects on animal genetics (transcriptomics,
Hi-C, ATAC-seq...)
I kernel & network methods
Statistics & ML for high throughput data integration
Sept. 29th 2021 / Curie training / Nathalie Vialaneix
p. 2

Overview of the talk
Kernel methods: the basics
Integrating data with kernels
Improve interpretability
p. 3

p. 4

Main ideas behind kernel methods
Standard (omics) data analyses:
I data are (numeric) tables
I analyses are based on operations (distances, means, ...) in the variable space
p. 5

Main ideas behind kernel methods
Kernel data analyses:
I data are arbitrary
I analyses are based on transformations of data to “similarities” between samples
p. 6

More formally...
n samples (xi )i ∈ X
kernels: symmetric and positive definite (n × n)-matrix
K that measures a “similarity” between n entities in X
p. 7

More formally...
n samples (xi )i ∈ X
kernels: symmetric and positive definite (n × n)-matrix
K that measures a “similarity” between n entities in X
K(x, x0) = hφ(x), φ(x0)i
p. 7

Principles for learning from kernels
Start from any statistical method (PCA, regression, k-means clustering) and rewrite all
quantities using:
I K to compute distances and dot products
dot product is: Kii0 and distance is:
√
Kii + Ki0i0 − 2Kii0
I (implicit) linear or convex combinations of (φ(xi ))i to describe all unobserved
elements (centers of gravity and so on...)
p. 8

A simple example: k-means
p. 9

1: Initialization: random initialization of P centers x̄Ct
j
∈ Rp
2: for t = 1 to T do
3: Affectation step ∀ i = 1, ..., n
f t+1
(xi ) = argmin
j=1,...,P
d(xi , x̄Ct
j
).
4: Representation step
∀ j = 1, . . . , P, x̄Ct
j
=
1
|Ct
j |
X
xl ∈Ct
j
xl
5: end for . Convergence
6: return Partition
p. 10

1: Initialization: random initialization of a partition of (xi )i and
x̄C1
j
= 1
|C1
j |
P
xi ∈C1
j
φ(xi )
f t+1
(xi ) = argmin
j=1,...,P
d(φ(xi ), x̄Ct
j
)

Kii −
2
|Ct
j |
X
xl ∈Ct
j
Kil +
1
|Ct
j |2
X
xl , xl0 ∈Ct
j
Kll0

.
∀ j = 1, . . . , P, x̄Ct
j
=
1
|Ct
j |
X
xl ∈Ct
j
xl
6: return Partition
p. 10

1: Initialization: random initialization of a partition of (xi )i and
x̄C1
j
= 1
|C1
j |
P
xi ∈C1
j
φ(xi )
f t+1
(xi ) = argmin
j=1,...,P
d(φ(xi ), x̄Ct
j
)

Kii −
2
|Ct
j |
X
xl ∈Ct
j
Kil +
1
|Ct
j |2
X
xl , xl0 ∈Ct
j
Kll0

.
∀ j = 1, . . . , P, x̄Ct
j
=
1
|Ct
j |
X
xl ∈Ct
j
φ(xl )
6: return Partition
p. 10

Kernel examples
1. Rp observations: Gaussian kernel Kii0 = e−γkxi −xi0 k2
p. 11

Kernel examples
1. Rp observations: Gaussian kernel Kii0 = e−γkxi −xi0 k2
2. nodes of a graph: [Kondor and Lafferty, 2002]
3. sequence kernels (between proteins: spectrum kernel [Jaakkola et al., 2000] (with
HMM), convolution kernel [Saigo et al., 2004]
4. kernel between graphs (used between metabolites based on their fragmentation
trees): [Shen et al., 2014, Brouard et al., 2016]
p. 11

p. 12

Types of data integration methods [Ritchie et al., 2015]
p. 13

Multiple kernel (or distance) integration
How to “optimally” combine several kernel datasets?
For kernels K1, . . . , KM obtained on the same n objects, search: Kβ =
PM
m=1 βmKm
with βm ≥ 0 and
P
m βm = 1
p. 14

PM
m=1 βmKm
with βm ≥ 0 and
P
m βm = 1
I naive approach: K∗ = 1
M
P
m Km
p. 14

PM
m=1 βmKm
with βm ≥ 0 and
P
m βm = 1
I naive approach: K∗ = 1
M
P
m Km
I supervised framework: K∗ =
P
m βmKm with βm ≥ 0 and
P
m βm = 1 with βm
chosen so as to minimize the prediction error [Gönen and Alpaydin, 2011]
p. 14

Combining kernels in an unsupervised setting
p. 15

Multiple kernel integration
Ideas of kernel consensus: find a kernel that performs a consensus of all kernels
[Mariette and Villa-Vialaneix, 2018] - R package mixKernel
with consensus based on:
I STATIS [L’Hermier des Plantes, 1976, Lavit et al., 1994]
I criterion that preserves local geometry
p. 16

STATIS like framework
Similarities between kernels:
Cmm0 =
hKm, Km0
iF
kKmkF kKm0
kF
=
Trace(KmKm0
)
p
Trace((Km)2)Trace((Km0
)2)
.
(Cmm0 is an extension of the RV-coefficient [Robert and Escoufier, 1976] to the kernel
framework)
maximizev
M
X
m=1

K∗
(v),
Km
kKmkF

F
= v
Cv
for K∗
(v) =
M
X
m=1
vmKm
and v ∈ RM
such that kvk2 = 1.
Solution: first eigenvector of C ⇒ Set β = v
PM
m=1 vm
(consensual kernel).
Statistics ML for high throughput data integration
p. 17

A kernel preserving the original topology of the data I
Similarly to [Lin et al., 2010], preserve the local geometry of the data in the feature
space.
Proxy of the local geometry
Km
−→ Gm
k
| {z }
k−nearest neighbors graph
−→ Am
k
| {z }
adjacency matrix
⇒ W =
P
m I{Am
k 0} or W =
P
m Am
k
Feature space geometry measured by
∆i (β) =
*
φ∗
β(xi ),



φ∗
β(x1)
.
.
.
φ∗
β(xn)



+
=



K∗
β(xi , x1)
.
.
.
K∗
β(xi , xn)



p. 18

A kernel preserving the original topology of the data II
Sparse version (quadprog in R)
minimizeβ
N
X
i,j=1
Wij k∆i (β) − ∆j (β)k2
for K∗
β =
M
X
m=1
βmKm
and β ∈ RM
st βm ≥ 0 and
M
X
m=1
βm = 1.
Non sparse version (ADMM optimization) [Boyd et al., 2011]
minimizev
N
X
i,j=1
Wij k∆i (β) − ∆j (β)k2
for K∗
v =
M
X
m=1
vmKm
and v ∈ RM
st vm ≥ 0 and kvk2 = 1.
p. 19

p. 20

p. 21

What’s next? Improve interpretability
Main drawbacks of kernel methods:
I do not scale well with sample size
I relation to original features is lost ⇒ interpretation of results is more difficult
p. 22

What’s next? Improve interpretability
Main drawbacks of kernel methods:
I do not scale well with sample size
I relation to original features is lost ⇒ interpretation of results is more difficult
⇒ Select variables the most relevant for a given kernel task
p. 22

Selecting features in kernels
Basic principles (for numerical features only):
I label each feature j with a weight wj ∈ {0, 1} (selected or not)
I new kernel: Kw(xi , xi0 ) = K(w · xi , w · xi0 )
I find the best features to optimize a quality criterion (relaxation of wj ≥ 0)
p. 23

Selecting features in kernels
Basic principles (for numerical features only):
I label each feature j with a weight wj ∈ {0, 1} (selected or not)
I new kernel: Kw(xi , xi0 ) = K(w · xi , w · xi0 )
I find the best features to optimize a quality criterion (relaxation of wj ≥ 0)
Examples:
I supervised framework: learn w to compute a kernel Kw
x best to predict Y ∈ R
[Allen, 2013, Grandvalet and Canu, 2002]
I extended to unsupervised (exploratory) learning and arbitrary outputs (kernel
outputs, including multiple output regression) (non-convex and non-smooth
optimization solved with proximal gradient descent) [Brouard et al., 2021] and in
mixKernel
p. 23

UKFS performances
lapl SPEC MCFS NDFS UDFS Autoenc. UKFS
“Carcinom” (n = 174, p = 9 182)
ACC 164.02 106.52 184.17 200.88 138.48 143.13 206.55
COR 28.14 30.75 29.56 27.49 30.30 33.18 24.75
CPU 0.25 2.47 11.69 6,162 99,138 4 days 326
“Glioma” (n = 50, p = 4 434)
ACC 166.31 140.72 172.78 147.77 147.50 132.76 178.57
COR 81.70 70.70 76.43 68.02 72.33 45.96 52.14
CPU 0.02 0.63 1.05 368 2,636 ∼ 12h 23.74
“Koren” (n = 43, p = 980)
ACC 172.90 225.25 233.94 263.04 263.48 239.76 242.39
COR 48.18 52.34 49.94 48.48 48.69 32.60 47.77
CPU 0.01 0.07 1.11 5.88 9.70 ∼ 30 min 10.69
[Brouard et al., 2021]
I non redundant features
I can incorporate information on relations between variables
p. 24

KOKFS illustration
Used with covid19 dataset from [Haug et al., 2020]:
I inputs: 46 government measures (0/1 encoding)
I outputs: R time series (with a kernel based on Fréchet distances)
p. 25

KOKFS illustration
Used with covid19 dataset from [Haug et al., 2020]:
I inputs: 46 government measures (0/1 encoding)
I outputs: R time series (with a kernel based on Fréchet distances)
More important measures:
Increase In Medical Supplies And Equipment
Border Health Check
Public Transport Restriction
The Government Provides Assistance To Vulnerable Populations
Individual Movement Restrictions
Increase Availability Of Ppe
Activate Case Notification
Border Restriction
Measures To Ensure Security Of Supply
Port And Ship Restriction
Activate Or Establish Emergency Response
p. 25

Thank you for your attention!
Questions?
p. 26

References
(unofficial) Beamer template made with the help of Thomas Schiex: https://forgemia.inra.fr/nathalie.villa-vialaneix/bainrae
Allen, G. I. (2013).
Automatic feature selection via weighted kernels and regularization.
Journal of Computational and Graphical Statistics, 22(2):284–299.
Boyd, S., Parikh, N., Chu, E., Peleato, B., and Eckstein, J. (2011).
Distributed optimization and statistical learning via the alterning direction method of multipliers.
Foundations and Trends in Machine Learning, 3(1):1–122.
Brouard, C., Mariette, J., Flamary, R., and Vialaneix, N. (2021).
Feature selection for kernel methods in systems biology.
In preparation.
Brouard, C., Shen, H., Dürkop, K., d’Alché Buc, F., Böcker, S., and Rousu, J. (2016).
Fast metabolite identification with input output kernel regression.
Bioinformatics, 32(12):i28–i36.
Gönen, M. and Alpaydin, E. (2011).
Multiple kernel learning algorithms.
Journal of Machine Learning Research, 12:2211–2268.
Grandvalet, Y. and Canu, S. (2002).
Adaptive scaling for feature selection in SVMs.
p. 26

In Becker, S., Thrun, S., and Obermayer, K., editors, Proceedings of Advances in Neural Information Processing Systems (NIPS 2002), pages
569–576. MIT Press.
Haug, N., Geyrhofer, L., Londei, A., Dervic, E., Desvars-Larrive, A., Loreto, V., Pinior, B., Thurner, S., and Klimek, P. (2020).
Ranking the effectiveness of worldwide COVID-19 government interventions.
Nature Human Behaviour, 4(4):1303–1312.
Jaakkola, T., Diekhans, M., and Haussler, D. (2000).
A discriminative framework for detecting remote protein homologies.
Journal of Computational Biology, 7(1-2):95–114.
Kondor, R. and Lafferty, J. (2002).
Diffusion kernels on graphs and other discrete structures.
In Sammut, C. and Hoffmann, A., editors, Proceedings of the 19th International Conference on Machine Learning, pages 315–322, Sydney,
Australia. Morgan Kaufmann Publishers Inc. San Francisco, CA, USA.
Lavit, C., Escoufier, Y., Sabatier, R., and Traissac, P. (1994).
The ACT (STATIS method).
Computational Statistics and Data Analysis, 18(1):97–119.
L’Hermier des Plantes, H. (1976).
Structuration des tableaux à trois indices de la statistique.
PhD thesis, Université de Montpellier.
Thèse de troisième cycle.
Lin, Y., Liu, T., and CS., F. (2010).
Multiple kernel learning for dimensionality reduction.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 33:1147–1160.
Mariette, J. and Villa-Vialaneix, N. (2018).
p. 26

Unsupervised multiple kernel learning for heterogeneous data integration.
Bioinformatics, 34(6):1009–1015.
Ritchie, M. E., Phipson, B., Wu, D., Hu, Y., Law, C. W., Shi, W., and Smyth, G. K. (2015).
limma powers differential expression analyses for RNA-sequencing and microarray studies.
Nucleic Acids Research, 43(7):e47.
Robert, P. and Escoufier, Y. (1976).
A unifying tool for linear multivariate statistical methods: the rv-coefficient.
Applied Statistics, 25(3):257–265.
Saigo, H., Vert, J.-P., Ueda, N., and Akutsu, T. (2004).
Protein homology detection using string alignment kernels.
Bioinformatics, 20(11):1682–1689.
Shen, H., Dührkop, K., Böcher, S., and Rousu, J. (2014).
Metabolite identification through multiple kernel learning on fragmentation trees.
Bioinformatics, 30(12):i157–i64.
p. 27

Technical details: Unsupervised Kernel Feature Selection
(UKFS)
Reduced kernel for X ∈ Rp
Tune weights w ∈ {0, 1}p such that the kernel based on w · X is very similar to the
kernel based on X
Optimization problem
argmin w ∈ {0, 1}p
kKw
x − Kx k2
F s.t
p
X
j=1
wj ≤ d
Continuous relaxation (non-convex and non-smooth)
w∗ := argmin w ∈ (R+)pkKw
x − Kx k2
F + λkwk1, solved with proximal gradient descent
p. 27

Kernel methods and variable selection for exploratory analysis and multi-omics integration

More Related Content

What's hot

Similar to Kernel methods and variable selection for exploratory analysis and multi-omics integration

More from tuxette

Recently uploaded

Kernel methods and variable selection for exploratory analysis and multi-omics integration