Kernel methods and variable selection for exploratory
analysis and multi-omics integration
Nathalie Vialaneix
nathalie.vialaneix@inrae.fr
http://www.nathalievialaneix.eu
4th Course on Computational Systems Biology of Cancer: Multi-omics and
Machine Learning Approaches
September 29th, 2021
Who am I?
I researcher at INRAE (France)
I trained as a mathematician −→ statistics and machine learning for
computational biology (omics)
I mostly involved in projects on animal genetics (transcriptomics,
Hi-C, ATAC-seq...)
I kernel & network methods
Statistics & ML for high throughput data integration
Sept. 29th 2021 / Curie training / Nathalie Vialaneix
p. 2
Overview of the talk
Kernel methods: the basics
Integrating data with kernels
Improve interpretability
Statistics & ML for high throughput data integration
Sept. 29th 2021 / Curie training / Nathalie Vialaneix
p. 3
Overview of the talk
Kernel methods: the basics
Integrating data with kernels
Improve interpretability
Statistics & ML for high throughput data integration
Sept. 29th 2021 / Curie training / Nathalie Vialaneix
p. 4
Main ideas behind kernel methods
Standard (omics) data analyses:
I data are (numeric) tables
I analyses are based on operations (distances, means, ...) in the variable space
Statistics & ML for high throughput data integration
Sept. 29th 2021 / Curie training / Nathalie Vialaneix
p. 5
Main ideas behind kernel methods
Kernel data analyses:
I data are arbitrary
I analyses are based on transformations of data to “similarities” between samples
Statistics & ML for high throughput data integration
Sept. 29th 2021 / Curie training / Nathalie Vialaneix
p. 6
More formally...
n samples (xi )i ∈ X
kernels: symmetric and positive definite (n × n)-matrix
K that measures a “similarity” between n entities in X
Statistics & ML for high throughput data integration
Sept. 29th 2021 / Curie training / Nathalie Vialaneix
p. 7
More formally...
n samples (xi )i ∈ X
kernels: symmetric and positive definite (n × n)-matrix
K that measures a “similarity” between n entities in X
Statistics & ML for high throughput data integration
Sept. 29th 2021 / Curie training / Nathalie Vialaneix
p. 7
More formally...
n samples (xi )i ∈ X
kernels: symmetric and positive definite (n × n)-matrix
K that measures a “similarity” between n entities in X
K(x, x0) = hφ(x), φ(x0)i
Statistics & ML for high throughput data integration
Sept. 29th 2021 / Curie training / Nathalie Vialaneix
p. 7
Principles for learning from kernels
Start from any statistical method (PCA, regression, k-means clustering) and rewrite all
quantities using:
I K to compute distances and dot products
dot product is: Kii0 and distance is:
√
Kii + Ki0i0 − 2Kii0
I (implicit) linear or convex combinations of (φ(xi ))i to describe all unobserved
elements (centers of gravity and so on...)
Statistics & ML for high throughput data integration
Sept. 29th 2021 / Curie training / Nathalie Vialaneix
p. 8
A simple example: k-means
Statistics & ML for high throughput data integration
Sept. 29th 2021 / Curie training / Nathalie Vialaneix
p. 9
A simple example: k-means
1: Initialization: random initialization of P centers x̄Ct
j
∈ Rp
2: for t = 1 to T do
3: Affectation step ∀ i = 1, ..., n
f t+1
(xi ) = argmin
j=1,...,P
d(xi , x̄Ct
j
).
4: Representation step
∀ j = 1, . . . , P, x̄Ct
j
=
1
|Ct
j |
X
xl ∈Ct
j
xl
5: end for . Convergence
6: return Partition
Statistics & ML for high throughput data integration
Sept. 29th 2021 / Curie training / Nathalie Vialaneix
p. 10
A simple example: k-means
1: Initialization: random initialization of a partition of (xi )i and
x̄C1
j
= 1
|C1
j |
P
xi ∈C1
j
φ(xi )
2: for t = 1 to T do
3: Affectation step ∀ i = 1, ..., n
f t+1
(xi ) = argmin
j=1,...,P
d(φ(xi ), x̄Ct
j
)

Kii −
2
|Ct
j |
X
xl ∈Ct
j
Kil +
1
|Ct
j |2
X
xl , xl0 ∈Ct
j
Kll0

.
4: Representation step
∀ j = 1, . . . , P, x̄Ct
j
=
1
|Ct
j |
X
xl ∈Ct
j
xl
5: end for . Convergence
6: return Partition
Statistics & ML for high throughput data integration
Sept. 29th 2021 / Curie training / Nathalie Vialaneix
p. 10
A simple example: k-means
1: Initialization: random initialization of a partition of (xi )i and
x̄C1
j
= 1
|C1
j |
P
xi ∈C1
j
φ(xi )
2: for t = 1 to T do
3: Affectation step ∀ i = 1, ..., n
f t+1
(xi ) = argmin
j=1,...,P
d(φ(xi ), x̄Ct
j
)

Kii −
2
|Ct
j |
X
xl ∈Ct
j
Kil +
1
|Ct
j |2
X
xl , xl0 ∈Ct
j
Kll0

.
4: Representation step
∀ j = 1, . . . , P, x̄Ct
j
=
1
|Ct
j |
X
xl ∈Ct
j
φ(xl )
5: end for . Convergence
6: return Partition
Statistics & ML for high throughput data integration
Sept. 29th 2021 / Curie training / Nathalie Vialaneix
p. 10
Kernel examples
1. Rp observations: Gaussian kernel Kii0 = e−γkxi −xi0 k2
Statistics & ML for high throughput data integration
Sept. 29th 2021 / Curie training / Nathalie Vialaneix
p. 11
Kernel examples
1. Rp observations: Gaussian kernel Kii0 = e−γkxi −xi0 k2
2. nodes of a graph: [Kondor and Lafferty, 2002]
3. sequence kernels (between proteins: spectrum kernel [Jaakkola et al., 2000] (with
HMM), convolution kernel [Saigo et al., 2004]
4. kernel between graphs (used between metabolites based on their fragmentation
trees): [Shen et al., 2014, Brouard et al., 2016]
Statistics & ML for high throughput data integration
Sept. 29th 2021 / Curie training / Nathalie Vialaneix
p. 11
Overview of the talk
Kernel methods: the basics
Integrating data with kernels
Improve interpretability
Statistics & ML for high throughput data integration
Sept. 29th 2021 / Curie training / Nathalie Vialaneix
p. 12
Types of data integration methods [Ritchie et al., 2015]
Statistics & ML for high throughput data integration
Sept. 29th 2021 / Curie training / Nathalie Vialaneix
p. 13
Multiple kernel (or distance) integration
How to “optimally” combine several kernel datasets?
For kernels K1, . . . , KM obtained on the same n objects, search: Kβ =
PM
m=1 βmKm
with βm ≥ 0 and
P
m βm = 1
Statistics & ML for high throughput data integration
Sept. 29th 2021 / Curie training / Nathalie Vialaneix
p. 14
Multiple kernel (or distance) integration
How to “optimally” combine several kernel datasets?
For kernels K1, . . . , KM obtained on the same n objects, search: Kβ =
PM
m=1 βmKm
with βm ≥ 0 and
P
m βm = 1
I naive approach: K∗ = 1
M
P
m Km
Statistics & ML for high throughput data integration
Sept. 29th 2021 / Curie training / Nathalie Vialaneix
p. 14
Multiple kernel (or distance) integration
How to “optimally” combine several kernel datasets?
For kernels K1, . . . , KM obtained on the same n objects, search: Kβ =
PM
m=1 βmKm
with βm ≥ 0 and
P
m βm = 1
I naive approach: K∗ = 1
M
P
m Km
I supervised framework: K∗ =
P
m βmKm with βm ≥ 0 and
P
m βm = 1 with βm
chosen so as to minimize the prediction error [Gönen and Alpaydin, 2011]
Statistics & ML for high throughput data integration
Sept. 29th 2021 / Curie training / Nathalie Vialaneix
p. 14
Combining kernels in an unsupervised setting
Statistics & ML for high throughput data integration
Sept. 29th 2021 / Curie training / Nathalie Vialaneix
p. 15
Multiple kernel integration
Ideas of kernel consensus: find a kernel that performs a consensus of all kernels
[Mariette and Villa-Vialaneix, 2018] - R package mixKernel
with consensus based on:
I STATIS [L’Hermier des Plantes, 1976, Lavit et al., 1994]
I criterion that preserves local geometry
Statistics & ML for high throughput data integration
Sept. 29th 2021 / Curie training / Nathalie Vialaneix
p. 16
STATIS like framework
Similarities between kernels:
Cmm0 =
hKm, Km0
iF
kKmkF kKm0
kF
=
Trace(KmKm0
)
p
Trace((Km)2)Trace((Km0
)2)
.
(Cmm0 is an extension of the RV-coefficient [Robert and Escoufier, 1976] to the kernel
framework)
maximizev
M
X
m=1

K∗
(v),
Km
kKmkF

F
= v
Cv
for K∗
(v) =
M
X
m=1
vmKm
and v ∈ RM
such that kvk2 = 1.
Solution: first eigenvector of C ⇒ Set β = v
PM
m=1 vm
(consensual kernel).
Statistics  ML for high throughput data integration
Sept. 29th 2021 / Curie training / Nathalie Vialaneix
p. 17
A kernel preserving the original topology of the data I
Similarly to [Lin et al., 2010], preserve the local geometry of the data in the feature
space.
Proxy of the local geometry
Km
−→ Gm
k
| {z }
k−nearest neighbors graph
−→ Am
k
| {z }
adjacency matrix
⇒ W =
P
m I{Am
k 0} or W =
P
m Am
k
Feature space geometry measured by
∆i (β) =
*
φ∗
β(xi ),



φ∗
β(x1)
.
.
.
φ∗
β(xn)



+
=



K∗
β(xi , x1)
.
.
.
K∗
β(xi , xn)



Statistics  ML for high throughput data integration
Sept. 29th 2021 / Curie training / Nathalie Vialaneix
p. 18
A kernel preserving the original topology of the data II
Sparse version (quadprog in R)
minimizeβ
N
X
i,j=1
Wij k∆i (β) − ∆j (β)k2
for K∗
β =
M
X
m=1
βmKm
and β ∈ RM
st βm ≥ 0 and
M
X
m=1
βm = 1.
Non sparse version (ADMM optimization) [Boyd et al., 2011]
minimizev
N
X
i,j=1
Wij k∆i (β) − ∆j (β)k2
for K∗
v =
M
X
m=1
vmKm
and v ∈ RM
st vm ≥ 0 and kvk2 = 1.
Statistics  ML for high throughput data integration
Sept. 29th 2021 / Curie training / Nathalie Vialaneix
p. 19
Statistics  ML for high throughput data integration
Sept. 29th 2021 / Curie training / Nathalie Vialaneix
p. 20
Overview of the talk
Kernel methods: the basics
Integrating data with kernels
Improve interpretability
Statistics  ML for high throughput data integration
Sept. 29th 2021 / Curie training / Nathalie Vialaneix
p. 21
What’s next? Improve interpretability
Main drawbacks of kernel methods:
I do not scale well with sample size
I relation to original features is lost ⇒ interpretation of results is more difficult
Statistics  ML for high throughput data integration
Sept. 29th 2021 / Curie training / Nathalie Vialaneix
p. 22
What’s next? Improve interpretability
Main drawbacks of kernel methods:
I do not scale well with sample size
I relation to original features is lost ⇒ interpretation of results is more difficult
⇒ Select variables the most relevant for a given kernel  task
Statistics  ML for high throughput data integration
Sept. 29th 2021 / Curie training / Nathalie Vialaneix
p. 22
Selecting features in kernels
Basic principles (for numerical features only):
I label each feature j with a weight wj ∈ {0, 1} (selected or not)
I new kernel: Kw(xi , xi0 ) = K(w · xi , w · xi0 )
I find the best features to optimize a quality criterion (relaxation of wj ≥ 0)
Statistics  ML for high throughput data integration
Sept. 29th 2021 / Curie training / Nathalie Vialaneix
p. 23
Selecting features in kernels
Basic principles (for numerical features only):
I label each feature j with a weight wj ∈ {0, 1} (selected or not)
I new kernel: Kw(xi , xi0 ) = K(w · xi , w · xi0 )
I find the best features to optimize a quality criterion (relaxation of wj ≥ 0)
Examples:
I supervised framework: learn w to compute a kernel Kw
x best to predict Y ∈ R
[Allen, 2013, Grandvalet and Canu, 2002]
I extended to unsupervised (exploratory) learning and arbitrary outputs (kernel
outputs, including multiple output regression) (non-convex and non-smooth
optimization solved with proximal gradient descent) [Brouard et al., 2021] and in
mixKernel
Statistics  ML for high throughput data integration
Sept. 29th 2021 / Curie training / Nathalie Vialaneix
p. 23
UKFS performances
lapl SPEC MCFS NDFS UDFS Autoenc. UKFS
“Carcinom” (n = 174, p = 9 182)
ACC 164.02 106.52 184.17 200.88 138.48 143.13 206.55
COR 28.14 30.75 29.56 27.49 30.30 33.18 24.75
CPU 0.25 2.47 11.69 6,162 99,138  4 days 326
“Glioma” (n = 50, p = 4 434)
ACC 166.31 140.72 172.78 147.77 147.50 132.76 178.57
COR 81.70 70.70 76.43 68.02 72.33 45.96 52.14
CPU 0.02 0.63 1.05 368 2,636 ∼ 12h 23.74
“Koren” (n = 43, p = 980)
ACC 172.90 225.25 233.94 263.04 263.48 239.76 242.39
COR 48.18 52.34 49.94 48.48 48.69 32.60 47.77
CPU 0.01 0.07 1.11 5.88 9.70 ∼ 30 min 10.69
[Brouard et al., 2021]
I non redundant features
I can incorporate information on relations between variables
Statistics  ML for high throughput data integration
Sept. 29th 2021 / Curie training / Nathalie Vialaneix
p. 24
KOKFS illustration
Used with covid19 dataset from [Haug et al., 2020]:
I inputs: 46 government measures (0/1 encoding)
I outputs: R time series (with a kernel based on Fréchet distances)
Statistics  ML for high throughput data integration
Sept. 29th 2021 / Curie training / Nathalie Vialaneix
p. 25
KOKFS illustration
Used with covid19 dataset from [Haug et al., 2020]:
I inputs: 46 government measures (0/1 encoding)
I outputs: R time series (with a kernel based on Fréchet distances)
More important measures:
Increase In Medical Supplies And Equipment
Border Health Check
Public Transport Restriction
The Government Provides Assistance To Vulnerable Populations
Individual Movement Restrictions
Increase Availability Of Ppe
Activate Case Notification
Border Restriction
Measures To Ensure Security Of Supply
Port And Ship Restriction
Activate Or Establish Emergency Response
Statistics  ML for high throughput data integration
Sept. 29th 2021 / Curie training / Nathalie Vialaneix
p. 25
Thank you for your attention!
Questions?
Statistics  ML for high throughput data integration
Sept. 29th 2021 / Curie training / Nathalie Vialaneix
p. 26
References
(unofficial) Beamer template made with the help of Thomas Schiex: https://forgemia.inra.fr/nathalie.villa-vialaneix/bainrae
Allen, G. I. (2013).
Automatic feature selection via weighted kernels and regularization.
Journal of Computational and Graphical Statistics, 22(2):284–299.
Boyd, S., Parikh, N., Chu, E., Peleato, B., and Eckstein, J. (2011).
Distributed optimization and statistical learning via the alterning direction method of multipliers.
Foundations and Trends in Machine Learning, 3(1):1–122.
Brouard, C., Mariette, J., Flamary, R., and Vialaneix, N. (2021).
Feature selection for kernel methods in systems biology.
In preparation.
Brouard, C., Shen, H., Dürkop, K., d’Alché Buc, F., Böcker, S., and Rousu, J. (2016).
Fast metabolite identification with input output kernel regression.
Bioinformatics, 32(12):i28–i36.
Gönen, M. and Alpaydin, E. (2011).
Multiple kernel learning algorithms.
Journal of Machine Learning Research, 12:2211–2268.
Grandvalet, Y. and Canu, S. (2002).
Adaptive scaling for feature selection in SVMs.
Statistics  ML for high throughput data integration
Sept. 29th 2021 / Curie training / Nathalie Vialaneix
p. 26
In Becker, S., Thrun, S., and Obermayer, K., editors, Proceedings of Advances in Neural Information Processing Systems (NIPS 2002), pages
569–576. MIT Press.
Haug, N., Geyrhofer, L., Londei, A., Dervic, E., Desvars-Larrive, A., Loreto, V., Pinior, B., Thurner, S., and Klimek, P. (2020).
Ranking the effectiveness of worldwide COVID-19 government interventions.
Nature Human Behaviour, 4(4):1303–1312.
Jaakkola, T., Diekhans, M., and Haussler, D. (2000).
A discriminative framework for detecting remote protein homologies.
Journal of Computational Biology, 7(1-2):95–114.
Kondor, R. and Lafferty, J. (2002).
Diffusion kernels on graphs and other discrete structures.
In Sammut, C. and Hoffmann, A., editors, Proceedings of the 19th International Conference on Machine Learning, pages 315–322, Sydney,
Australia. Morgan Kaufmann Publishers Inc. San Francisco, CA, USA.
Lavit, C., Escoufier, Y., Sabatier, R., and Traissac, P. (1994).
The ACT (STATIS method).
Computational Statistics and Data Analysis, 18(1):97–119.
L’Hermier des Plantes, H. (1976).
Structuration des tableaux à trois indices de la statistique.
PhD thesis, Université de Montpellier.
Thèse de troisième cycle.
Lin, Y., Liu, T., and CS., F. (2010).
Multiple kernel learning for dimensionality reduction.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 33:1147–1160.
Mariette, J. and Villa-Vialaneix, N. (2018).
Statistics  ML for high throughput data integration
Sept. 29th 2021 / Curie training / Nathalie Vialaneix
p. 26
Unsupervised multiple kernel learning for heterogeneous data integration.
Bioinformatics, 34(6):1009–1015.
Ritchie, M. E., Phipson, B., Wu, D., Hu, Y., Law, C. W., Shi, W., and Smyth, G. K. (2015).
limma powers differential expression analyses for RNA-sequencing and microarray studies.
Nucleic Acids Research, 43(7):e47.
Robert, P. and Escoufier, Y. (1976).
A unifying tool for linear multivariate statistical methods: the rv-coefficient.
Applied Statistics, 25(3):257–265.
Saigo, H., Vert, J.-P., Ueda, N., and Akutsu, T. (2004).
Protein homology detection using string alignment kernels.
Bioinformatics, 20(11):1682–1689.
Shen, H., Dührkop, K., Böcher, S., and Rousu, J. (2014).
Metabolite identification through multiple kernel learning on fragmentation trees.
Bioinformatics, 30(12):i157–i64.
Statistics  ML for high throughput data integration
Sept. 29th 2021 / Curie training / Nathalie Vialaneix
p. 27
Technical details: Unsupervised Kernel Feature Selection
(UKFS)
Reduced kernel for X ∈ Rp
Tune weights w ∈ {0, 1}p such that the kernel based on w · X is very similar to the
kernel based on X
Optimization problem
argmin w ∈ {0, 1}p
kKw
x − Kx k2
F s.t
p
X
j=1
wj ≤ d
Continuous relaxation (non-convex and non-smooth)
w∗ := argmin w ∈ (R+)pkKw
x − Kx k2
F + λkwk1, solved with proximal gradient descent
Statistics  ML for high throughput data integration
Sept. 29th 2021 / Curie training / Nathalie Vialaneix
p. 27

Kernel methods and variable selection for exploratory analysis and multi-omics integration

  • 1.
    Kernel methods andvariable selection for exploratory analysis and multi-omics integration Nathalie Vialaneix nathalie.vialaneix@inrae.fr http://www.nathalievialaneix.eu 4th Course on Computational Systems Biology of Cancer: Multi-omics and Machine Learning Approaches September 29th, 2021
  • 2.
    Who am I? Iresearcher at INRAE (France) I trained as a mathematician −→ statistics and machine learning for computational biology (omics) I mostly involved in projects on animal genetics (transcriptomics, Hi-C, ATAC-seq...) I kernel & network methods Statistics & ML for high throughput data integration Sept. 29th 2021 / Curie training / Nathalie Vialaneix p. 2
  • 3.
    Overview of thetalk Kernel methods: the basics Integrating data with kernels Improve interpretability Statistics & ML for high throughput data integration Sept. 29th 2021 / Curie training / Nathalie Vialaneix p. 3
  • 4.
    Overview of thetalk Kernel methods: the basics Integrating data with kernels Improve interpretability Statistics & ML for high throughput data integration Sept. 29th 2021 / Curie training / Nathalie Vialaneix p. 4
  • 5.
    Main ideas behindkernel methods Standard (omics) data analyses: I data are (numeric) tables I analyses are based on operations (distances, means, ...) in the variable space Statistics & ML for high throughput data integration Sept. 29th 2021 / Curie training / Nathalie Vialaneix p. 5
  • 6.
    Main ideas behindkernel methods Kernel data analyses: I data are arbitrary I analyses are based on transformations of data to “similarities” between samples Statistics & ML for high throughput data integration Sept. 29th 2021 / Curie training / Nathalie Vialaneix p. 6
  • 7.
    More formally... n samples(xi )i ∈ X kernels: symmetric and positive definite (n × n)-matrix K that measures a “similarity” between n entities in X Statistics & ML for high throughput data integration Sept. 29th 2021 / Curie training / Nathalie Vialaneix p. 7
  • 8.
    More formally... n samples(xi )i ∈ X kernels: symmetric and positive definite (n × n)-matrix K that measures a “similarity” between n entities in X Statistics & ML for high throughput data integration Sept. 29th 2021 / Curie training / Nathalie Vialaneix p. 7
  • 9.
    More formally... n samples(xi )i ∈ X kernels: symmetric and positive definite (n × n)-matrix K that measures a “similarity” between n entities in X K(x, x0) = hφ(x), φ(x0)i Statistics & ML for high throughput data integration Sept. 29th 2021 / Curie training / Nathalie Vialaneix p. 7
  • 10.
    Principles for learningfrom kernels Start from any statistical method (PCA, regression, k-means clustering) and rewrite all quantities using: I K to compute distances and dot products dot product is: Kii0 and distance is: √ Kii + Ki0i0 − 2Kii0 I (implicit) linear or convex combinations of (φ(xi ))i to describe all unobserved elements (centers of gravity and so on...) Statistics & ML for high throughput data integration Sept. 29th 2021 / Curie training / Nathalie Vialaneix p. 8
  • 11.
    A simple example:k-means Statistics & ML for high throughput data integration Sept. 29th 2021 / Curie training / Nathalie Vialaneix p. 9
  • 12.
    A simple example:k-means 1: Initialization: random initialization of P centers x̄Ct j ∈ Rp 2: for t = 1 to T do 3: Affectation step ∀ i = 1, ..., n f t+1 (xi ) = argmin j=1,...,P d(xi , x̄Ct j ). 4: Representation step ∀ j = 1, . . . , P, x̄Ct j = 1 |Ct j | X xl ∈Ct j xl 5: end for . Convergence 6: return Partition Statistics & ML for high throughput data integration Sept. 29th 2021 / Curie training / Nathalie Vialaneix p. 10
  • 13.
    A simple example:k-means 1: Initialization: random initialization of a partition of (xi )i and x̄C1 j = 1 |C1 j | P xi ∈C1 j φ(xi ) 2: for t = 1 to T do 3: Affectation step ∀ i = 1, ..., n f t+1 (xi ) = argmin j=1,...,P d(φ(xi ), x̄Ct j )  Kii − 2 |Ct j | X xl ∈Ct j Kil + 1 |Ct j |2 X xl , xl0 ∈Ct j Kll0  . 4: Representation step ∀ j = 1, . . . , P, x̄Ct j = 1 |Ct j | X xl ∈Ct j xl 5: end for . Convergence 6: return Partition Statistics & ML for high throughput data integration Sept. 29th 2021 / Curie training / Nathalie Vialaneix p. 10
  • 14.
    A simple example:k-means 1: Initialization: random initialization of a partition of (xi )i and x̄C1 j = 1 |C1 j | P xi ∈C1 j φ(xi ) 2: for t = 1 to T do 3: Affectation step ∀ i = 1, ..., n f t+1 (xi ) = argmin j=1,...,P d(φ(xi ), x̄Ct j )  Kii − 2 |Ct j | X xl ∈Ct j Kil + 1 |Ct j |2 X xl , xl0 ∈Ct j Kll0  . 4: Representation step ∀ j = 1, . . . , P, x̄Ct j = 1 |Ct j | X xl ∈Ct j φ(xl ) 5: end for . Convergence 6: return Partition Statistics & ML for high throughput data integration Sept. 29th 2021 / Curie training / Nathalie Vialaneix p. 10
  • 15.
    Kernel examples 1. Rpobservations: Gaussian kernel Kii0 = e−γkxi −xi0 k2 Statistics & ML for high throughput data integration Sept. 29th 2021 / Curie training / Nathalie Vialaneix p. 11
  • 16.
    Kernel examples 1. Rpobservations: Gaussian kernel Kii0 = e−γkxi −xi0 k2 2. nodes of a graph: [Kondor and Lafferty, 2002] 3. sequence kernels (between proteins: spectrum kernel [Jaakkola et al., 2000] (with HMM), convolution kernel [Saigo et al., 2004] 4. kernel between graphs (used between metabolites based on their fragmentation trees): [Shen et al., 2014, Brouard et al., 2016] Statistics & ML for high throughput data integration Sept. 29th 2021 / Curie training / Nathalie Vialaneix p. 11
  • 17.
    Overview of thetalk Kernel methods: the basics Integrating data with kernels Improve interpretability Statistics & ML for high throughput data integration Sept. 29th 2021 / Curie training / Nathalie Vialaneix p. 12
  • 18.
    Types of dataintegration methods [Ritchie et al., 2015] Statistics & ML for high throughput data integration Sept. 29th 2021 / Curie training / Nathalie Vialaneix p. 13
  • 19.
    Multiple kernel (ordistance) integration How to “optimally” combine several kernel datasets? For kernels K1, . . . , KM obtained on the same n objects, search: Kβ = PM m=1 βmKm with βm ≥ 0 and P m βm = 1 Statistics & ML for high throughput data integration Sept. 29th 2021 / Curie training / Nathalie Vialaneix p. 14
  • 20.
    Multiple kernel (ordistance) integration How to “optimally” combine several kernel datasets? For kernels K1, . . . , KM obtained on the same n objects, search: Kβ = PM m=1 βmKm with βm ≥ 0 and P m βm = 1 I naive approach: K∗ = 1 M P m Km Statistics & ML for high throughput data integration Sept. 29th 2021 / Curie training / Nathalie Vialaneix p. 14
  • 21.
    Multiple kernel (ordistance) integration How to “optimally” combine several kernel datasets? For kernels K1, . . . , KM obtained on the same n objects, search: Kβ = PM m=1 βmKm with βm ≥ 0 and P m βm = 1 I naive approach: K∗ = 1 M P m Km I supervised framework: K∗ = P m βmKm with βm ≥ 0 and P m βm = 1 with βm chosen so as to minimize the prediction error [Gönen and Alpaydin, 2011] Statistics & ML for high throughput data integration Sept. 29th 2021 / Curie training / Nathalie Vialaneix p. 14
  • 22.
    Combining kernels inan unsupervised setting Statistics & ML for high throughput data integration Sept. 29th 2021 / Curie training / Nathalie Vialaneix p. 15
  • 23.
    Multiple kernel integration Ideasof kernel consensus: find a kernel that performs a consensus of all kernels [Mariette and Villa-Vialaneix, 2018] - R package mixKernel with consensus based on: I STATIS [L’Hermier des Plantes, 1976, Lavit et al., 1994] I criterion that preserves local geometry Statistics & ML for high throughput data integration Sept. 29th 2021 / Curie training / Nathalie Vialaneix p. 16
  • 24.
    STATIS like framework Similaritiesbetween kernels: Cmm0 = hKm, Km0 iF kKmkF kKm0 kF = Trace(KmKm0 ) p Trace((Km)2)Trace((Km0 )2) . (Cmm0 is an extension of the RV-coefficient [Robert and Escoufier, 1976] to the kernel framework) maximizev M X m=1 K∗ (v), Km kKmkF F = v Cv for K∗ (v) = M X m=1 vmKm and v ∈ RM such that kvk2 = 1. Solution: first eigenvector of C ⇒ Set β = v PM m=1 vm (consensual kernel). Statistics ML for high throughput data integration Sept. 29th 2021 / Curie training / Nathalie Vialaneix p. 17
  • 25.
    A kernel preservingthe original topology of the data I Similarly to [Lin et al., 2010], preserve the local geometry of the data in the feature space. Proxy of the local geometry Km −→ Gm k | {z } k−nearest neighbors graph −→ Am k | {z } adjacency matrix ⇒ W = P m I{Am k 0} or W = P m Am k Feature space geometry measured by ∆i (β) = * φ∗ β(xi ),    φ∗ β(x1) . . . φ∗ β(xn)    + =    K∗ β(xi , x1) . . . K∗ β(xi , xn)    Statistics ML for high throughput data integration Sept. 29th 2021 / Curie training / Nathalie Vialaneix p. 18
  • 26.
    A kernel preservingthe original topology of the data II Sparse version (quadprog in R) minimizeβ N X i,j=1 Wij k∆i (β) − ∆j (β)k2 for K∗ β = M X m=1 βmKm and β ∈ RM st βm ≥ 0 and M X m=1 βm = 1. Non sparse version (ADMM optimization) [Boyd et al., 2011] minimizev N X i,j=1 Wij k∆i (β) − ∆j (β)k2 for K∗ v = M X m=1 vmKm and v ∈ RM st vm ≥ 0 and kvk2 = 1. Statistics ML for high throughput data integration Sept. 29th 2021 / Curie training / Nathalie Vialaneix p. 19
  • 27.
    Statistics MLfor high throughput data integration Sept. 29th 2021 / Curie training / Nathalie Vialaneix p. 20
  • 28.
    Overview of thetalk Kernel methods: the basics Integrating data with kernels Improve interpretability Statistics ML for high throughput data integration Sept. 29th 2021 / Curie training / Nathalie Vialaneix p. 21
  • 29.
    What’s next? Improveinterpretability Main drawbacks of kernel methods: I do not scale well with sample size I relation to original features is lost ⇒ interpretation of results is more difficult Statistics ML for high throughput data integration Sept. 29th 2021 / Curie training / Nathalie Vialaneix p. 22
  • 30.
    What’s next? Improveinterpretability Main drawbacks of kernel methods: I do not scale well with sample size I relation to original features is lost ⇒ interpretation of results is more difficult ⇒ Select variables the most relevant for a given kernel task Statistics ML for high throughput data integration Sept. 29th 2021 / Curie training / Nathalie Vialaneix p. 22
  • 31.
    Selecting features inkernels Basic principles (for numerical features only): I label each feature j with a weight wj ∈ {0, 1} (selected or not) I new kernel: Kw(xi , xi0 ) = K(w · xi , w · xi0 ) I find the best features to optimize a quality criterion (relaxation of wj ≥ 0) Statistics ML for high throughput data integration Sept. 29th 2021 / Curie training / Nathalie Vialaneix p. 23
  • 32.
    Selecting features inkernels Basic principles (for numerical features only): I label each feature j with a weight wj ∈ {0, 1} (selected or not) I new kernel: Kw(xi , xi0 ) = K(w · xi , w · xi0 ) I find the best features to optimize a quality criterion (relaxation of wj ≥ 0) Examples: I supervised framework: learn w to compute a kernel Kw x best to predict Y ∈ R [Allen, 2013, Grandvalet and Canu, 2002] I extended to unsupervised (exploratory) learning and arbitrary outputs (kernel outputs, including multiple output regression) (non-convex and non-smooth optimization solved with proximal gradient descent) [Brouard et al., 2021] and in mixKernel Statistics ML for high throughput data integration Sept. 29th 2021 / Curie training / Nathalie Vialaneix p. 23
  • 33.
    UKFS performances lapl SPECMCFS NDFS UDFS Autoenc. UKFS “Carcinom” (n = 174, p = 9 182) ACC 164.02 106.52 184.17 200.88 138.48 143.13 206.55 COR 28.14 30.75 29.56 27.49 30.30 33.18 24.75 CPU 0.25 2.47 11.69 6,162 99,138 4 days 326 “Glioma” (n = 50, p = 4 434) ACC 166.31 140.72 172.78 147.77 147.50 132.76 178.57 COR 81.70 70.70 76.43 68.02 72.33 45.96 52.14 CPU 0.02 0.63 1.05 368 2,636 ∼ 12h 23.74 “Koren” (n = 43, p = 980) ACC 172.90 225.25 233.94 263.04 263.48 239.76 242.39 COR 48.18 52.34 49.94 48.48 48.69 32.60 47.77 CPU 0.01 0.07 1.11 5.88 9.70 ∼ 30 min 10.69 [Brouard et al., 2021] I non redundant features I can incorporate information on relations between variables Statistics ML for high throughput data integration Sept. 29th 2021 / Curie training / Nathalie Vialaneix p. 24
  • 34.
    KOKFS illustration Used withcovid19 dataset from [Haug et al., 2020]: I inputs: 46 government measures (0/1 encoding) I outputs: R time series (with a kernel based on Fréchet distances) Statistics ML for high throughput data integration Sept. 29th 2021 / Curie training / Nathalie Vialaneix p. 25
  • 35.
    KOKFS illustration Used withcovid19 dataset from [Haug et al., 2020]: I inputs: 46 government measures (0/1 encoding) I outputs: R time series (with a kernel based on Fréchet distances) More important measures: Increase In Medical Supplies And Equipment Border Health Check Public Transport Restriction The Government Provides Assistance To Vulnerable Populations Individual Movement Restrictions Increase Availability Of Ppe Activate Case Notification Border Restriction Measures To Ensure Security Of Supply Port And Ship Restriction Activate Or Establish Emergency Response Statistics ML for high throughput data integration Sept. 29th 2021 / Curie training / Nathalie Vialaneix p. 25
  • 36.
    Thank you foryour attention! Questions? Statistics ML for high throughput data integration Sept. 29th 2021 / Curie training / Nathalie Vialaneix p. 26
  • 37.
    References (unofficial) Beamer templatemade with the help of Thomas Schiex: https://forgemia.inra.fr/nathalie.villa-vialaneix/bainrae Allen, G. I. (2013). Automatic feature selection via weighted kernels and regularization. Journal of Computational and Graphical Statistics, 22(2):284–299. Boyd, S., Parikh, N., Chu, E., Peleato, B., and Eckstein, J. (2011). Distributed optimization and statistical learning via the alterning direction method of multipliers. Foundations and Trends in Machine Learning, 3(1):1–122. Brouard, C., Mariette, J., Flamary, R., and Vialaneix, N. (2021). Feature selection for kernel methods in systems biology. In preparation. Brouard, C., Shen, H., Dürkop, K., d’Alché Buc, F., Böcker, S., and Rousu, J. (2016). Fast metabolite identification with input output kernel regression. Bioinformatics, 32(12):i28–i36. Gönen, M. and Alpaydin, E. (2011). Multiple kernel learning algorithms. Journal of Machine Learning Research, 12:2211–2268. Grandvalet, Y. and Canu, S. (2002). Adaptive scaling for feature selection in SVMs. Statistics ML for high throughput data integration Sept. 29th 2021 / Curie training / Nathalie Vialaneix p. 26
  • 38.
    In Becker, S.,Thrun, S., and Obermayer, K., editors, Proceedings of Advances in Neural Information Processing Systems (NIPS 2002), pages 569–576. MIT Press. Haug, N., Geyrhofer, L., Londei, A., Dervic, E., Desvars-Larrive, A., Loreto, V., Pinior, B., Thurner, S., and Klimek, P. (2020). Ranking the effectiveness of worldwide COVID-19 government interventions. Nature Human Behaviour, 4(4):1303–1312. Jaakkola, T., Diekhans, M., and Haussler, D. (2000). A discriminative framework for detecting remote protein homologies. Journal of Computational Biology, 7(1-2):95–114. Kondor, R. and Lafferty, J. (2002). Diffusion kernels on graphs and other discrete structures. In Sammut, C. and Hoffmann, A., editors, Proceedings of the 19th International Conference on Machine Learning, pages 315–322, Sydney, Australia. Morgan Kaufmann Publishers Inc. San Francisco, CA, USA. Lavit, C., Escoufier, Y., Sabatier, R., and Traissac, P. (1994). The ACT (STATIS method). Computational Statistics and Data Analysis, 18(1):97–119. L’Hermier des Plantes, H. (1976). Structuration des tableaux à trois indices de la statistique. PhD thesis, Université de Montpellier. Thèse de troisième cycle. Lin, Y., Liu, T., and CS., F. (2010). Multiple kernel learning for dimensionality reduction. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33:1147–1160. Mariette, J. and Villa-Vialaneix, N. (2018). Statistics ML for high throughput data integration Sept. 29th 2021 / Curie training / Nathalie Vialaneix p. 26
  • 39.
    Unsupervised multiple kernellearning for heterogeneous data integration. Bioinformatics, 34(6):1009–1015. Ritchie, M. E., Phipson, B., Wu, D., Hu, Y., Law, C. W., Shi, W., and Smyth, G. K. (2015). limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Research, 43(7):e47. Robert, P. and Escoufier, Y. (1976). A unifying tool for linear multivariate statistical methods: the rv-coefficient. Applied Statistics, 25(3):257–265. Saigo, H., Vert, J.-P., Ueda, N., and Akutsu, T. (2004). Protein homology detection using string alignment kernels. Bioinformatics, 20(11):1682–1689. Shen, H., Dührkop, K., Böcher, S., and Rousu, J. (2014). Metabolite identification through multiple kernel learning on fragmentation trees. Bioinformatics, 30(12):i157–i64. Statistics ML for high throughput data integration Sept. 29th 2021 / Curie training / Nathalie Vialaneix p. 27
  • 40.
    Technical details: UnsupervisedKernel Feature Selection (UKFS) Reduced kernel for X ∈ Rp Tune weights w ∈ {0, 1}p such that the kernel based on w · X is very similar to the kernel based on X Optimization problem argmin w ∈ {0, 1}p kKw x − Kx k2 F s.t p X j=1 wj ≤ d Continuous relaxation (non-convex and non-smooth) w∗ := argmin w ∈ (R+)pkKw x − Kx k2 F + λkwk1, solved with proximal gradient descent Statistics ML for high throughput data integration Sept. 29th 2021 / Curie training / Nathalie Vialaneix p. 27