SlideShare a Scribd company logo
1 of 31
Download to read offline
Intégration de données omiques multi-échelles : méthodes à
noyau et autres approches de machine learning
Nathalie Vialaneix
nathalie.vialaneix@inrae.fr
http://www.nathalievialaneix.eu
AG MathNum
Clermont-Ferrand, 17 mai 2022
GOS1 : Maı̂triser les méthodes pour acquérir, gérer et intégrer
données et connaissances face à la multiplication des sources
d’information
déploiement d’approches IoT
développement de capteurs
extractions à partir de données textuelle (NLP)
données participatives
gestion optimale des données
ontologies
entrepôts de données
edge & online computing
intégrer des données hétérogènes
réseaux
données spatio-temporelles
optimiser les choix d’acquisition
Statistics & ML for high throughput data integration
17 mai 2022 / AG MathNum / Nathalie Vialaneix
p. 2
GOS1 : Maı̂triser les méthodes pour acquérir, gérer et intégrer
données et connaissances face à la multiplication des sources
d’information
déploiement d’approches IoT
développement de capteurs
extractions à partir de données textuelle (NLP)
données participatives
gestion optimale des données
ontologies
entrepôts de données
edge & online computing
intégrer des données hétérogènes
réseaux
données spatio-temporelles
optimiser les choix d’acquisition
Statistics & ML for high throughput data integration
17 mai 2022 / AG MathNum / Nathalie Vialaneix
p. 2
Collected data at genomic level are huge, multi-scaled and of
heterogeneous types... an increasingly publicly available
[Foissac et al., 2019]
Statistics & ML for high throughput data integration
17 mai 2022 / AG MathNum / Nathalie Vialaneix
p. 3
Analysis bottleneck
I as in humans, under-used data because: various experimental conditions
(meta-analyses needed), various scales (not always compatible), various types (not
always simply numeric), unperfect matching between experiments (missing
data/individuals), ...
Statistics & ML for high throughput data integration
17 mai 2022 / AG MathNum / Nathalie Vialaneix
p. 4
Analysis bottleneck
I as in humans, under-used data because: various experimental conditions
(meta-analyses needed), various scales (not always compatible), various types (not
always simply numeric), unperfect matching between experiments (missing
data/individuals), ...
I in addition to humans: less discriminative phenotypes, interest is indirect (in
breeding not directly in the individual itself), much less data with poorer quality
annotation (inter-species transfer might be useful)
Statistics & ML for high throughput data integration
17 mai 2022 / AG MathNum / Nathalie Vialaneix
p. 4
Multiple table analyses
(CCA, MFA, PLS, STATIS, ...)
Image from https://dimensionless.in
Statistics & ML for high throughput data integration
17 mai 2022 / AG MathNum / Nathalie Vialaneix
p. 5
Types of data integration methods [Ritchie et al., 2015]
Statistics & ML for high throughput data integration
17 mai 2022 / AG MathNum / Nathalie Vialaneix
p. 6
Types of objectives
Type of data to integrate
vertical:
horizontal:
Statistics & ML for high throughput data integration
17 mai 2022 / AG MathNum / Nathalie Vialaneix
p. 7
Types of objectives
Type of data to integrate
vertical:
horizontal:
Type of analysis to perform
supervised:
unsupervised:
Left pictures courtesy Harold Duruflé
Statistics & ML for high throughput data integration
17 mai 2022 / AG MathNum / Nathalie Vialaneix
p. 7
Want to know them all?
https://github.com/mikelove/awesome-multi-omics
Some specificities: can account for structure in data (network), are dedicated to a
specific omic (single-cell), can account for temporal/spatial information, can include
biological knowledge (mostly GO), ...
Statistics & ML for high throughput data integration
17 mai 2022 / AG MathNum / Nathalie Vialaneix
p. 8
Scope of the rest of the talk
Unsupervised transformation based integration
Statistics & ML for high throughput data integration
17 mai 2022 / AG MathNum / Nathalie Vialaneix
p. 9
Common representations using relational data
n samples (xi )i ∈ X, summarized by pairwise similarities or dissimilarities
Statistics & ML for high throughput data integration
17 mai 2022 / AG MathNum / Nathalie Vialaneix
p. 10
Common representations using relational data
n samples (xi )i ∈ X, summarized by pairwise similarities or dissimilarities
kernels: symmetric and positive definite (n × n)-matrix
K that measures a “similarity” between n entities in X
K(x, x0) = hφ(x), φ(x0)i
Statistics & ML for high throughput data integration
17 mai 2022 / AG MathNum / Nathalie Vialaneix
p. 10
Multiple kernel (or distance) integration
How to “optimally” combine several relational datasets in an unsupervised
setting?
For kernels K1, . . . , KM obtained on the same n objects, search: Kβ =
PM
m=1 βmKm
with βm ≥ 0 and
P
m βm = 1
Statistics & ML for high throughput data integration
17 mai 2022 / AG MathNum / Nathalie Vialaneix
p. 11
Multiple kernel (or distance) integration
How to “optimally” combine several relational datasets in an unsupervised
setting?
For kernels K1, . . . , KM obtained on the same n objects, search: Kβ =
PM
m=1 βmKm
with βm ≥ 0 and
P
m βm = 1
Ideas of kernel consensus: find a kernel that performs a consensus of all kernels
[Mariette and Villa-Vialaneix, 2018] - R package mixKernel
with consensus based on:
I STATIS [L’Hermier des Plantes, 1976, Lavit et al., 1994]
I criterion that preserves local geometry
minimizeβ
n
X
i,j=1
Wij
|{z}
local proximity
× k∆i (β) − ∆j (β)k
2
| {z }
feature space distance
for βm ≥ 0 and kβk1,2 = 1
Statistics & ML for high throughput data integration
17 mai 2022 / AG MathNum / Nathalie Vialaneix
p. 11
TARA Oceans expedition
Science (May 2015) - Studies on:
I eukaryotic plankton diversity
[de Vargas et al., 2015],
I ocean viral communities
[Brum et al., 2015],
I global plankton interactome
[Lima-Mendez et al., 2015],
I global ocean microbiome
[Sunagawa et al., 2015],
I . . . .
→ datasets from different types and
different sources analyzed separately.
Statistics & ML for high throughput data integration
17 mai 2022 / AG MathNum / Nathalie Vialaneix
p. 12
Integrating TARA Oceans datasets
I For all compositional datasets, include
phylogeny information rather than
standard compositional
transformations (CLR and alike):
weighted Unifrac distance
I Perform PCA (could have been
clustering, linear model, . . . ) in the
feature space.
+ combine with a shuffling approach
to identify most influencing variables
Statistics & ML for high throughput data integration
17 mai 2022 / AG MathNum / Nathalie Vialaneix
p. 13
Application to TARA oceans
Main facts
I Oceans typology related to longitude
I Rhizaria abundance structure the differences, especially between Arctic Oceans
and Pacific Oceans
Statistics & ML for high throughput data integration
17 mai 2022 / AG MathNum / Nathalie Vialaneix
p. 14
What’s next? Improve interpretability
Select variables the most relevant for a given kernel
I built on works that select variables in X to compute a kernel Kx best to predict
Y ∈ R [Allen, 2013, Grandvalet and Canu, 2002]
I extended to unsupervised (exploratory) learning and arbitrary outputs (kernel
outputs, including multiple output regression)
Statistics & ML for high throughput data integration
17 mai 2022 / AG MathNum / Nathalie Vialaneix
p. 15
What’s next? Improve interpretability
Select variables the most relevant for a given kernel
I built on works that select variables in X to compute a kernel Kx best to predict
Y ∈ R [Allen, 2013, Grandvalet and Canu, 2002]
I extended to unsupervised (exploratory) learning and arbitrary outputs (kernel
outputs, including multiple output regression)
Continuous relaxation of a discrete problem
non-convex and non-smooth optimization solved with proximal gradient descent
Statistics & ML for high throughput data integration
17 mai 2022 / AG MathNum / Nathalie Vialaneix
p. 15
Making methods available for biologists
http://asterics.miat.inrae.fr
Ambition:
I easy-to-use and interactive
I helps user to know which analysis to
use and how to interpret results
I integrates domain expertise
I usable online or can be installed
Statistics & ML for high throughput data integration
17 mai 2022 / AG MathNum / Nathalie Vialaneix
p. 16
Future needs for data integration
I improve interpretability of methods (integrate more biological knowledge)
I generic vs specific: omics are always evolving... (spatial transcriptomics is the new
era?)
I reduce computational needs to achieve the challenge of a more sustainable
research
I make them more widely used by the biological community: omics data are still
produced at a much faster rate than their use!
Statistics & ML for high throughput data integration
17 mai 2022 / AG MathNum / Nathalie Vialaneix
p. 17
Thank you for your attention!
Questions?
Statistics & ML for high throughput data integration
17 mai 2022 / AG MathNum / Nathalie Vialaneix
p. 18
References
(unofficial) Beamer template made with the help of Thomas Schiex, Matthias Zytnicki and Andreea Dreau:
https://forgemia.inra.fr/nathalie.villa-vialaneix/bainrae
Allen, G. I. (2013).
Automatic feature selection via weighted kernels and regularization.
Journal of Computational and Graphical Statistics, 22(2):284–299.
Brouard, C., Mariette, J., Flamary, R., and Vialaneix, N. (2021).
Feature selection for kernel methods in systems biology.
In preparation.
Brum, J., Ignacio-Espinoza, J., Roux, S., Doulcier, G., Acinas, S., Alberti, A., Chaffron, S., Cruaud, C., de Vargas, C., Gasol, J., Gorsky, G.,
Gregory, A., Guidi, L., Hingamp, P., Iudicone, D., Not, F., Ogata, H., Pesant, S., Poulos, B., Schwenck, S., Speich, S., Dimier, C.,
Kandels-Lewis, S., Picheral, M., Searson, S., Tara Oceans coordinators, Bork, P., Bowler, C., Sunagawa, S., Wincker, P., Karsenti, E., and
Sullivan, M. (2015).
Patterns and ecological drivers of ocean viral communities.
Science, 348(6237).
de Vargas, C., Audic, S., Henry, N., Decelle, J., Mahé, P., Logares, R., Lara, E., Berney, C., Le Bescot, N., Probert, I., Carmichael, M.,
Poulain, J., Romac, S., Colin, S., Aury, J., Bittner, L., Chaffron, S., Dunthorn, M., Engelen, S., Flegontova, O., Guidi, L., Horák, A., Jaillon,
O., Lima-Mendez, G., Lukeš, J., Malviya, S., Morard, R., Mulot, M., Scalco, E., Siano, R., Vincent, F., Zingone, A., Dimier, C., Picheral, M.,
Searson, S., Kandels-Lewis, S., Tara Oceans coordinators, Acinas, S., Bork, P., Bowler, C., Gorsky, G., Grimsley, N., Hingamp, P., Iudicone,
D., Not, F., Ogata, H., Pesant, S., Raes, J., Sieracki, M. E., Speich, S., Stemmann, L., Sunagawa, S., Weissenbach, J., Wincker, P., and
Karsenti, E. (2015).
Statistics & ML for high throughput data integration
17 mai 2022 / AG MathNum / Nathalie Vialaneix
p. 18
Eukaryotic plankton diversity in the sunlit ocean.
Science, 348(6237).
Foissac, S., Djebali, S., Munyard, K., Vialaneix, N., Rau, A., Muret, K., Esquerre, D., Zytnicki, M., Derrien, T., Bardou, P., Blanc, F., Cabau,
C., Crisci, E., Dhorne-Pollet, S., Drouet, F., Faraut, T., Gonzáles, I., Goubil, A., Lacroix-Lamande, S., Laurent, F., Marthey, S.,
Marti-Marimon, M., Mormal-Leisenring, R., Mompart, F., Quere, P., Robelin, D., SanCristobal, M., Tosser-Klopp, G., Vincent-Naulleau, S.,
Fabre, S., Pinard-Van der Laan, M.-H., Klopp, C., Tixier-Boichard, M., Acloque, H., Lagarrigue, S., and Giuffra, E. (2019).
Multi-species annotation of transcriptome and chromatin structure in domesticated animals.
BMC Biology, 17:108.
Grandvalet, Y. and Canu, S. (2002).
Adaptive scaling for feature selection in SVMs.
In Becker, S., Thrun, S., and Obermayer, K., editors, Proceedings of Advances in Neural Information Processing Systems (NIPS 2002), pages
569–576. MIT Press.
Haug, N., Geyrhofer, L., Londei, A., Dervic, E., Desvars-Larrive, A., Loreto, V., Pinior, B., Thurner, S., and Klimek, P. (2020).
Ranking the effectiveness of worldwide COVID-19 government interventions.
Nature Human Behaviour, 4(4):1303–1312.
Lavit, C., Escoufier, Y., Sabatier, R., and Traissac, P. (1994).
The ACT (STATIS method).
Computational Statistics and Data Analysis, 18(1):97–119.
L’Hermier des Plantes, H. (1976).
Structuration des tableaux à trois indices de la statistique.
PhD thesis, Université de Montpellier.
Thèse de troisième cycle.
Lima-Mendez, G., Faust, K., Henry, N., Decelle, J., Colin, S., Carcillo, F., Chaffron, S., Ignacio-Espinosa, J., Roux, S., Vincent, F., Bittner,
L., Darzi, Y., Wang, B., Audic, S., Berline, L., Bontempi, G., Cabello, A., Coppola, L., Cornejo-Castillo, F., d’Oviedo, F., de Meester, L.,
Statistics & ML for high throughput data integration
17 mai 2022 / AG MathNum / Nathalie Vialaneix
p. 18
Ferrera, I., Garet-Delmas, M., Guidi, L., Lara, E., Pesant, S., Royo-Llonch, M., Salazar, F., Sánchez, P., Sebastian, M., Souffreau, C., Dimier,
C., Picheral, M., Searson, S., Kandels-Lewis, S., Tara Oceans coordinators, Gorsky, G., Not, F., Ogata, H., Speich, S., Stemmann, L.,
Weissenbach, J., Wincker, P., Acinas, S., Sunagawa, S., Bork, P., Sullivan, M., Karsenti, E., Bowler, C., de Vargas, C., and Raes, J. (2015).
Determinants of community structure in the global plankton interactome.
Science, 348(6237).
Mariette, J. and Villa-Vialaneix, N. (2018).
Unsupervised multiple kernel learning for heterogeneous data integration.
Bioinformatics, 34(6):1009–1015.
Ritchie, M. E., Phipson, B., Wu, D., Hu, Y., Law, C. W., Shi, W., and Smyth, G. K. (2015).
limma powers differential expression analyses for RNA-sequencing and microarray studies.
Nucleic Acids Research, 43(7):e47.
Sunagawa, S., Coelho, L., Chaffron, S., Kultima, J., Labadie, K., Salazar, F., Djahanschiri, B., Zeller, G., Mende, D., Alberti, A.,
Cornejo-Castillo, F., Costea, P., Cruaud, C., d’Oviedo, F., Engelen, S., Ferrera, I., Gasol, J., Guidi, L., Hildebrand, F., Kokoszka, F., Lepoivre,
C., Lima-Mendez, G., Poulain, J., Poulos, B., Royo-Llonch, M., Sarmento, H., Vieira-Silva, S., Dimier, C., Picheral, M., Searson, S.,
Kandels-Lewis, S., Tara Oceans coordinators, Bowler, C., de Vargas, C., Gorsky, G., Grimsley, N., Hingamp, P., Iudicone, D., Jaillon, O., Not,
F., Ogata, H., Pesant, S., Speich, S., Stemmann, L., Sullivan, M., Weissenbach, J., Wincker, P., Karsenti, E., Raes, J., Acinas, S., and Bork,
P. (2015).
Structure and function of the global ocean microbiome.
Science, 348(6237).
Statistics & ML for high throughput data integration
17 mai 2022 / AG MathNum / Nathalie Vialaneix
p. 19
Technical details: Unsupervised Kernel Feature Selection
(UKFS)
Reduced kernel for X ∈ Rp
Tune weights w ∈ {0, 1}p such that the kernel based on w · X is very similar to the
kernel based on X
Optimization problem
argmin
w∈{0,1}p
kKw
x − Kx k2
F s.t
p
X
j=1
wj ≤ d
Continuous relaxation (non-convex and non-smooth)
w∗ := argmin
w∈(R+)p
kKw
x − Kx k2
F + λkwk1, solved with proximal gradient descent
Statistics & ML for high throughput data integration
17 mai 2022 / AG MathNum / Nathalie Vialaneix
p. 19
UKFS performances
lapl SPEC MCFS NDFS UDFS Autoenc. UKFS
“Carcinom” (n = 174, p = 9 182)
ACC 164.02 106.52 184.17 200.88 138.48 143.13 206.55
COR 28.14 30.75 29.56 27.49 30.30 33.18 24.75
CPU 0.25 2.47 11.69 6,162 99,138 > 4 days 326
“Glioma” (n = 50, p = 4 434)
ACC 166.31 140.72 172.78 147.77 147.50 132.76 178.57
COR 81.70 70.70 76.43 68.02 72.33 45.96 52.14
CPU 0.02 0.63 1.05 368 2,636 ∼ 12h 23.74
“Koren” (n = 43, p = 980)
ACC 172.90 225.25 233.94 263.04 263.48 239.76 242.39
COR 48.18 52.34 49.94 48.48 48.69 32.60 47.77
CPU 0.01 0.07 1.11 5.88 9.70 ∼ 30 min 10.69
[Brouard et al., 2021]
I non redundant features
I can incorporate information on relations between variables
Statistics & ML for high throughput data integration
17 mai 2022 / AG MathNum / Nathalie Vialaneix
p. 20
KOKFS illustration
Used with covid19 dataset from [Haug et al., 2020]:
I inputs: 46 government measures (0/1 encoding)
I outputs: R time series (with a kernel based on Fréchet distances)
Statistics & ML for high throughput data integration
17 mai 2022 / AG MathNum / Nathalie Vialaneix
p. 21
KOKFS illustration
Used with covid19 dataset from [Haug et al., 2020]:
I inputs: 46 government measures (0/1 encoding)
I outputs: R time series (with a kernel based on Fréchet distances)
More important measures:
Increase In Medical Supplies And Equipment
Border Health Check
Public Transport Restriction
The Government Provides Assistance To Vulnerable Populations
Individual Movement Restrictions
Increase Availability Of Ppe
Activate Case Notification
Border Restriction
Measures To Ensure Security Of Supply
Port And Ship Restriction
Activate Or Establish Emergency Response
Statistics & ML for high throughput data integration
17 mai 2022 / AG MathNum / Nathalie Vialaneix
p. 21

More Related Content

Similar to Intégration de données omiques multi-échelles : méthodes à noyau et autres approches de machine learning

Machine Learning meets Granular Computing
Machine Learning meets Granular ComputingMachine Learning meets Granular Computing
Machine Learning meets Granular ComputingJenny Midwinter
 
Data-centric AI and the convergence of data and model engineering: opportunit...
Data-centric AI and the convergence of data and model engineering:opportunit...Data-centric AI and the convergence of data and model engineering:opportunit...
Data-centric AI and the convergence of data and model engineering: opportunit...Paolo Missier
 
IEEE Big data 2016 Title and Abstract
IEEE Big data  2016 Title and AbstractIEEE Big data  2016 Title and Abstract
IEEE Big data 2016 Title and Abstracttsysglobalsolutions
 
Challenges in Analytics for BIG Data
Challenges in Analytics for BIG DataChallenges in Analytics for BIG Data
Challenges in Analytics for BIG DataPrasant Misra
 
Enabling Data Analytics from Knowledge Graphs @ ISWC 2017 Doctoral Consortium
Enabling Data Analytics from Knowledge Graphs @ ISWC 2017 Doctoral ConsortiumEnabling Data Analytics from Knowledge Graphs @ ISWC 2017 Doctoral Consortium
Enabling Data Analytics from Knowledge Graphs @ ISWC 2017 Doctoral ConsortiumHenrique O. Santos
 
Big data analysis and modelling
Big data analysis and modellingBig data analysis and modelling
Big data analysis and modellingkeivan mahdavi
 
Qu speaker series 14: Synthetic Data Generation in Finance
Qu speaker series 14: Synthetic Data Generation in FinanceQu speaker series 14: Synthetic Data Generation in Finance
Qu speaker series 14: Synthetic Data Generation in FinanceQuantUniversity
 
Smart Data Models session - SALTED.pptx
Smart Data Models session - SALTED.pptxSmart Data Models session - SALTED.pptx
Smart Data Models session - SALTED.pptxFIWARE
 
Survey on MapReduce in Big Data Clustering using Machine Learning Algorithms
Survey on MapReduce in Big Data Clustering using Machine Learning AlgorithmsSurvey on MapReduce in Big Data Clustering using Machine Learning Algorithms
Survey on MapReduce in Big Data Clustering using Machine Learning AlgorithmsIRJET Journal
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)IJERD Editor
 
Map Reduce in Big fata
Map Reduce in Big fataMap Reduce in Big fata
Map Reduce in Big fataSuraj Sawant
 
The web of data: how are we doing so far
The web of data: how are we doing so farThe web of data: how are we doing so far
The web of data: how are we doing so farElena Simperl
 
SC4 Workshop 1: Logistics and big data German herrero
SC4 Workshop 1: Logistics and big data  German herreroSC4 Workshop 1: Logistics and big data  German herrero
SC4 Workshop 1: Logistics and big data German herreroBigData_Europe
 
Big Data Applications & Analytics Motivation: Big Data and the Cloud; Centerp...
Big Data Applications & Analytics Motivation: Big Data and the Cloud; Centerp...Big Data Applications & Analytics Motivation: Big Data and the Cloud; Centerp...
Big Data Applications & Analytics Motivation: Big Data and the Cloud; Centerp...Geoffrey Fox
 
Big Data Applications & Analytics Motivation: Big Data and the Cloud; Center...
Big Data Applications & Analytics Motivation: Big Data and the Cloud; Center...Big Data Applications & Analytics Motivation: Big Data and the Cloud; Center...
Big Data Applications & Analytics Motivation: Big Data and the Cloud; Center...Geoffrey Fox
 
Ppt for paper id 696 a review of hybrid data mining algorithm for big data mi...
Ppt for paper id 696 a review of hybrid data mining algorithm for big data mi...Ppt for paper id 696 a review of hybrid data mining algorithm for big data mi...
Ppt for paper id 696 a review of hybrid data mining algorithm for big data mi...Prasanta Paul
 
Deep learning 1.0 and Beyond, Part 1
Deep learning 1.0 and Beyond, Part 1Deep learning 1.0 and Beyond, Part 1
Deep learning 1.0 and Beyond, Part 1Deakin University
 

Similar to Intégration de données omiques multi-échelles : méthodes à noyau et autres approches de machine learning (20)

Machine Learning meets Granular Computing
Machine Learning meets Granular ComputingMachine Learning meets Granular Computing
Machine Learning meets Granular Computing
 
UNIT 1_2.ppt
UNIT 1_2.pptUNIT 1_2.ppt
UNIT 1_2.ppt
 
Smart Geo. Guido Satta (Maggio 2015)
Smart Geo. Guido Satta (Maggio 2015)Smart Geo. Guido Satta (Maggio 2015)
Smart Geo. Guido Satta (Maggio 2015)
 
Data-centric AI and the convergence of data and model engineering: opportunit...
Data-centric AI and the convergence of data and model engineering:opportunit...Data-centric AI and the convergence of data and model engineering:opportunit...
Data-centric AI and the convergence of data and model engineering: opportunit...
 
IEEE Big data 2016 Title and Abstract
IEEE Big data  2016 Title and AbstractIEEE Big data  2016 Title and Abstract
IEEE Big data 2016 Title and Abstract
 
Challenges in Analytics for BIG Data
Challenges in Analytics for BIG DataChallenges in Analytics for BIG Data
Challenges in Analytics for BIG Data
 
Enabling Data Analytics from Knowledge Graphs @ ISWC 2017 Doctoral Consortium
Enabling Data Analytics from Knowledge Graphs @ ISWC 2017 Doctoral ConsortiumEnabling Data Analytics from Knowledge Graphs @ ISWC 2017 Doctoral Consortium
Enabling Data Analytics from Knowledge Graphs @ ISWC 2017 Doctoral Consortium
 
Big data analysis and modelling
Big data analysis and modellingBig data analysis and modelling
Big data analysis and modelling
 
Qu speaker series 14: Synthetic Data Generation in Finance
Qu speaker series 14: Synthetic Data Generation in FinanceQu speaker series 14: Synthetic Data Generation in Finance
Qu speaker series 14: Synthetic Data Generation in Finance
 
Smart Data Models session - SALTED.pptx
Smart Data Models session - SALTED.pptxSmart Data Models session - SALTED.pptx
Smart Data Models session - SALTED.pptx
 
Survey on MapReduce in Big Data Clustering using Machine Learning Algorithms
Survey on MapReduce in Big Data Clustering using Machine Learning AlgorithmsSurvey on MapReduce in Big Data Clustering using Machine Learning Algorithms
Survey on MapReduce in Big Data Clustering using Machine Learning Algorithms
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 
Map Reduce in Big fata
Map Reduce in Big fataMap Reduce in Big fata
Map Reduce in Big fata
 
The web of data: how are we doing so far
The web of data: how are we doing so farThe web of data: how are we doing so far
The web of data: how are we doing so far
 
SC4 Workshop 1: Logistics and big data German herrero
SC4 Workshop 1: Logistics and big data  German herreroSC4 Workshop 1: Logistics and big data  German herrero
SC4 Workshop 1: Logistics and big data German herrero
 
Big Data Applications & Analytics Motivation: Big Data and the Cloud; Centerp...
Big Data Applications & Analytics Motivation: Big Data and the Cloud; Centerp...Big Data Applications & Analytics Motivation: Big Data and the Cloud; Centerp...
Big Data Applications & Analytics Motivation: Big Data and the Cloud; Centerp...
 
Big Data Applications & Analytics Motivation: Big Data and the Cloud; Center...
Big Data Applications & Analytics Motivation: Big Data and the Cloud; Center...Big Data Applications & Analytics Motivation: Big Data and the Cloud; Center...
Big Data Applications & Analytics Motivation: Big Data and the Cloud; Center...
 
Ppt for paper id 696 a review of hybrid data mining algorithm for big data mi...
Ppt for paper id 696 a review of hybrid data mining algorithm for big data mi...Ppt for paper id 696 a review of hybrid data mining algorithm for big data mi...
Ppt for paper id 696 a review of hybrid data mining algorithm for big data mi...
 
PointNet
PointNetPointNet
PointNet
 
Deep learning 1.0 and Beyond, Part 1
Deep learning 1.0 and Beyond, Part 1Deep learning 1.0 and Beyond, Part 1
Deep learning 1.0 and Beyond, Part 1
 

More from tuxette

Racines en haut et feuilles en bas : les arbres en maths
Racines en haut et feuilles en bas : les arbres en mathsRacines en haut et feuilles en bas : les arbres en maths
Racines en haut et feuilles en bas : les arbres en mathstuxette
 
Méthodes à noyaux pour l’intégration de données hétérogènes
Méthodes à noyaux pour l’intégration de données hétérogènesMéthodes à noyaux pour l’intégration de données hétérogènes
Méthodes à noyaux pour l’intégration de données hétérogènestuxette
 
Projets autour de l'Hi-C
Projets autour de l'Hi-CProjets autour de l'Hi-C
Projets autour de l'Hi-Ctuxette
 
Can deep learning learn chromatin structure from sequence?
Can deep learning learn chromatin structure from sequence?Can deep learning learn chromatin structure from sequence?
Can deep learning learn chromatin structure from sequence?tuxette
 
ASTERICS : une application pour intégrer des données omiques
ASTERICS : une application pour intégrer des données omiquesASTERICS : une application pour intégrer des données omiques
ASTERICS : une application pour intégrer des données omiquestuxette
 
Autour des projets Idefics et MetaboWean
Autour des projets Idefics et MetaboWeanAutour des projets Idefics et MetaboWean
Autour des projets Idefics et MetaboWeantuxette
 
Rserve, renv, flask, Vue.js dans un docker pour intégrer des données omiques ...
Rserve, renv, flask, Vue.js dans un docker pour intégrer des données omiques ...Rserve, renv, flask, Vue.js dans un docker pour intégrer des données omiques ...
Rserve, renv, flask, Vue.js dans un docker pour intégrer des données omiques ...tuxette
 
Apprentissage pour la biologie moléculaire et l’analyse de données omiques
Apprentissage pour la biologie moléculaire et l’analyse de données omiquesApprentissage pour la biologie moléculaire et l’analyse de données omiques
Apprentissage pour la biologie moléculaire et l’analyse de données omiquestuxette
 
Quelques résultats préliminaires de l'évaluation de méthodes d'inférence de r...
Quelques résultats préliminaires de l'évaluation de méthodes d'inférence de r...Quelques résultats préliminaires de l'évaluation de méthodes d'inférence de r...
Quelques résultats préliminaires de l'évaluation de méthodes d'inférence de r...tuxette
 
Journal club: Validation of cluster analysis results on validation data
Journal club: Validation of cluster analysis results on validation dataJournal club: Validation of cluster analysis results on validation data
Journal club: Validation of cluster analysis results on validation datatuxette
 
Overfitting or overparametrization?
Overfitting or overparametrization?Overfitting or overparametrization?
Overfitting or overparametrization?tuxette
 
Selective inference and single-cell differential analysis
Selective inference and single-cell differential analysisSelective inference and single-cell differential analysis
Selective inference and single-cell differential analysistuxette
 
SOMbrero : un package R pour les cartes auto-organisatrices
SOMbrero : un package R pour les cartes auto-organisatricesSOMbrero : un package R pour les cartes auto-organisatrices
SOMbrero : un package R pour les cartes auto-organisatricestuxette
 
Graph Neural Network for Phenotype Prediction
Graph Neural Network for Phenotype PredictionGraph Neural Network for Phenotype Prediction
Graph Neural Network for Phenotype Predictiontuxette
 
A short and naive introduction to using network in prediction models
A short and naive introduction to using network in prediction modelsA short and naive introduction to using network in prediction models
A short and naive introduction to using network in prediction modelstuxette
 
Explanable models for time series with random forest
Explanable models for time series with random forestExplanable models for time series with random forest
Explanable models for time series with random foresttuxette
 
Présentation du projet ASTERICS
Présentation du projet ASTERICSPrésentation du projet ASTERICS
Présentation du projet ASTERICStuxette
 
Présentation du projet ASTERICS
Présentation du projet ASTERICSPrésentation du projet ASTERICS
Présentation du projet ASTERICStuxette
 
A review on structure learning in GNN
A review on structure learning in GNNA review on structure learning in GNN
A review on structure learning in GNNtuxette
 
Graph Neural Network in practice
Graph Neural Network in practiceGraph Neural Network in practice
Graph Neural Network in practicetuxette
 

More from tuxette (20)

Racines en haut et feuilles en bas : les arbres en maths
Racines en haut et feuilles en bas : les arbres en mathsRacines en haut et feuilles en bas : les arbres en maths
Racines en haut et feuilles en bas : les arbres en maths
 
Méthodes à noyaux pour l’intégration de données hétérogènes
Méthodes à noyaux pour l’intégration de données hétérogènesMéthodes à noyaux pour l’intégration de données hétérogènes
Méthodes à noyaux pour l’intégration de données hétérogènes
 
Projets autour de l'Hi-C
Projets autour de l'Hi-CProjets autour de l'Hi-C
Projets autour de l'Hi-C
 
Can deep learning learn chromatin structure from sequence?
Can deep learning learn chromatin structure from sequence?Can deep learning learn chromatin structure from sequence?
Can deep learning learn chromatin structure from sequence?
 
ASTERICS : une application pour intégrer des données omiques
ASTERICS : une application pour intégrer des données omiquesASTERICS : une application pour intégrer des données omiques
ASTERICS : une application pour intégrer des données omiques
 
Autour des projets Idefics et MetaboWean
Autour des projets Idefics et MetaboWeanAutour des projets Idefics et MetaboWean
Autour des projets Idefics et MetaboWean
 
Rserve, renv, flask, Vue.js dans un docker pour intégrer des données omiques ...
Rserve, renv, flask, Vue.js dans un docker pour intégrer des données omiques ...Rserve, renv, flask, Vue.js dans un docker pour intégrer des données omiques ...
Rserve, renv, flask, Vue.js dans un docker pour intégrer des données omiques ...
 
Apprentissage pour la biologie moléculaire et l’analyse de données omiques
Apprentissage pour la biologie moléculaire et l’analyse de données omiquesApprentissage pour la biologie moléculaire et l’analyse de données omiques
Apprentissage pour la biologie moléculaire et l’analyse de données omiques
 
Quelques résultats préliminaires de l'évaluation de méthodes d'inférence de r...
Quelques résultats préliminaires de l'évaluation de méthodes d'inférence de r...Quelques résultats préliminaires de l'évaluation de méthodes d'inférence de r...
Quelques résultats préliminaires de l'évaluation de méthodes d'inférence de r...
 
Journal club: Validation of cluster analysis results on validation data
Journal club: Validation of cluster analysis results on validation dataJournal club: Validation of cluster analysis results on validation data
Journal club: Validation of cluster analysis results on validation data
 
Overfitting or overparametrization?
Overfitting or overparametrization?Overfitting or overparametrization?
Overfitting or overparametrization?
 
Selective inference and single-cell differential analysis
Selective inference and single-cell differential analysisSelective inference and single-cell differential analysis
Selective inference and single-cell differential analysis
 
SOMbrero : un package R pour les cartes auto-organisatrices
SOMbrero : un package R pour les cartes auto-organisatricesSOMbrero : un package R pour les cartes auto-organisatrices
SOMbrero : un package R pour les cartes auto-organisatrices
 
Graph Neural Network for Phenotype Prediction
Graph Neural Network for Phenotype PredictionGraph Neural Network for Phenotype Prediction
Graph Neural Network for Phenotype Prediction
 
A short and naive introduction to using network in prediction models
A short and naive introduction to using network in prediction modelsA short and naive introduction to using network in prediction models
A short and naive introduction to using network in prediction models
 
Explanable models for time series with random forest
Explanable models for time series with random forestExplanable models for time series with random forest
Explanable models for time series with random forest
 
Présentation du projet ASTERICS
Présentation du projet ASTERICSPrésentation du projet ASTERICS
Présentation du projet ASTERICS
 
Présentation du projet ASTERICS
Présentation du projet ASTERICSPrésentation du projet ASTERICS
Présentation du projet ASTERICS
 
A review on structure learning in GNN
A review on structure learning in GNNA review on structure learning in GNN
A review on structure learning in GNN
 
Graph Neural Network in practice
Graph Neural Network in practiceGraph Neural Network in practice
Graph Neural Network in practice
 

Recently uploaded

Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.aasikanpl
 
Davis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technologyDavis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technologycaarthichand2003
 
Neurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 trNeurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 trssuser06f238
 
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)riyaescorts54
 
Pests of Bengal gram_Identification_Dr.UPR.pdf
Pests of Bengal gram_Identification_Dr.UPR.pdfPests of Bengal gram_Identification_Dr.UPR.pdf
Pests of Bengal gram_Identification_Dr.UPR.pdfPirithiRaju
 
Sulphur & Phosphrus Cycle PowerPoint Presentation (2) [Autosaved]-3-1.pptx
Sulphur & Phosphrus Cycle PowerPoint Presentation (2) [Autosaved]-3-1.pptxSulphur & Phosphrus Cycle PowerPoint Presentation (2) [Autosaved]-3-1.pptx
Sulphur & Phosphrus Cycle PowerPoint Presentation (2) [Autosaved]-3-1.pptxnoordubaliya2003
 
User Guide: Magellan MX™ Weather Station
User Guide: Magellan MX™ Weather StationUser Guide: Magellan MX™ Weather Station
User Guide: Magellan MX™ Weather StationColumbia Weather Systems
 
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCRCall Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCRlizamodels9
 
BUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdf
BUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdfBUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdf
BUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdfWildaNurAmalia2
 
Pests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdfPests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdfPirithiRaju
 
Pests of safflower_Binomics_Identification_Dr.UPR.pdf
Pests of safflower_Binomics_Identification_Dr.UPR.pdfPests of safflower_Binomics_Identification_Dr.UPR.pdf
Pests of safflower_Binomics_Identification_Dr.UPR.pdfPirithiRaju
 
TOPIC 8 Temperature and Heat.pdf physics
TOPIC 8 Temperature and Heat.pdf physicsTOPIC 8 Temperature and Heat.pdf physics
TOPIC 8 Temperature and Heat.pdf physicsssuserddc89b
 
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptxTHE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptxNandakishor Bhaurao Deshmukh
 
Microteaching on terms used in filtration .Pharmaceutical Engineering
Microteaching on terms used in filtration .Pharmaceutical EngineeringMicroteaching on terms used in filtration .Pharmaceutical Engineering
Microteaching on terms used in filtration .Pharmaceutical EngineeringPrajakta Shinde
 
The dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxThe dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxEran Akiva Sinbar
 
Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024AyushiRastogi48
 
Grafana in space: Monitoring Japan's SLIM moon lander in real time
Grafana in space: Monitoring Japan's SLIM moon lander  in real timeGrafana in space: Monitoring Japan's SLIM moon lander  in real time
Grafana in space: Monitoring Japan's SLIM moon lander in real timeSatoshi NAKAHIRA
 
Scheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docxScheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docxyaramohamed343013
 
Environmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial BiosensorEnvironmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial Biosensorsonawaneprad
 
User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)Columbia Weather Systems
 

Recently uploaded (20)

Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
 
Davis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technologyDavis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technology
 
Neurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 trNeurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 tr
 
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
 
Pests of Bengal gram_Identification_Dr.UPR.pdf
Pests of Bengal gram_Identification_Dr.UPR.pdfPests of Bengal gram_Identification_Dr.UPR.pdf
Pests of Bengal gram_Identification_Dr.UPR.pdf
 
Sulphur & Phosphrus Cycle PowerPoint Presentation (2) [Autosaved]-3-1.pptx
Sulphur & Phosphrus Cycle PowerPoint Presentation (2) [Autosaved]-3-1.pptxSulphur & Phosphrus Cycle PowerPoint Presentation (2) [Autosaved]-3-1.pptx
Sulphur & Phosphrus Cycle PowerPoint Presentation (2) [Autosaved]-3-1.pptx
 
User Guide: Magellan MX™ Weather Station
User Guide: Magellan MX™ Weather StationUser Guide: Magellan MX™ Weather Station
User Guide: Magellan MX™ Weather Station
 
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCRCall Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
 
BUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdf
BUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdfBUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdf
BUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdf
 
Pests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdfPests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdf
 
Pests of safflower_Binomics_Identification_Dr.UPR.pdf
Pests of safflower_Binomics_Identification_Dr.UPR.pdfPests of safflower_Binomics_Identification_Dr.UPR.pdf
Pests of safflower_Binomics_Identification_Dr.UPR.pdf
 
TOPIC 8 Temperature and Heat.pdf physics
TOPIC 8 Temperature and Heat.pdf physicsTOPIC 8 Temperature and Heat.pdf physics
TOPIC 8 Temperature and Heat.pdf physics
 
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptxTHE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
 
Microteaching on terms used in filtration .Pharmaceutical Engineering
Microteaching on terms used in filtration .Pharmaceutical EngineeringMicroteaching on terms used in filtration .Pharmaceutical Engineering
Microteaching on terms used in filtration .Pharmaceutical Engineering
 
The dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxThe dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptx
 
Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024
 
Grafana in space: Monitoring Japan's SLIM moon lander in real time
Grafana in space: Monitoring Japan's SLIM moon lander  in real timeGrafana in space: Monitoring Japan's SLIM moon lander  in real time
Grafana in space: Monitoring Japan's SLIM moon lander in real time
 
Scheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docxScheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docx
 
Environmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial BiosensorEnvironmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial Biosensor
 
User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)
 

Intégration de données omiques multi-échelles : méthodes à noyau et autres approches de machine learning

  • 1. Intégration de données omiques multi-échelles : méthodes à noyau et autres approches de machine learning Nathalie Vialaneix nathalie.vialaneix@inrae.fr http://www.nathalievialaneix.eu AG MathNum Clermont-Ferrand, 17 mai 2022
  • 2. GOS1 : Maı̂triser les méthodes pour acquérir, gérer et intégrer données et connaissances face à la multiplication des sources d’information déploiement d’approches IoT développement de capteurs extractions à partir de données textuelle (NLP) données participatives gestion optimale des données ontologies entrepôts de données edge & online computing intégrer des données hétérogènes réseaux données spatio-temporelles optimiser les choix d’acquisition Statistics & ML for high throughput data integration 17 mai 2022 / AG MathNum / Nathalie Vialaneix p. 2
  • 3. GOS1 : Maı̂triser les méthodes pour acquérir, gérer et intégrer données et connaissances face à la multiplication des sources d’information déploiement d’approches IoT développement de capteurs extractions à partir de données textuelle (NLP) données participatives gestion optimale des données ontologies entrepôts de données edge & online computing intégrer des données hétérogènes réseaux données spatio-temporelles optimiser les choix d’acquisition Statistics & ML for high throughput data integration 17 mai 2022 / AG MathNum / Nathalie Vialaneix p. 2
  • 4. Collected data at genomic level are huge, multi-scaled and of heterogeneous types... an increasingly publicly available [Foissac et al., 2019] Statistics & ML for high throughput data integration 17 mai 2022 / AG MathNum / Nathalie Vialaneix p. 3
  • 5. Analysis bottleneck I as in humans, under-used data because: various experimental conditions (meta-analyses needed), various scales (not always compatible), various types (not always simply numeric), unperfect matching between experiments (missing data/individuals), ... Statistics & ML for high throughput data integration 17 mai 2022 / AG MathNum / Nathalie Vialaneix p. 4
  • 6. Analysis bottleneck I as in humans, under-used data because: various experimental conditions (meta-analyses needed), various scales (not always compatible), various types (not always simply numeric), unperfect matching between experiments (missing data/individuals), ... I in addition to humans: less discriminative phenotypes, interest is indirect (in breeding not directly in the individual itself), much less data with poorer quality annotation (inter-species transfer might be useful) Statistics & ML for high throughput data integration 17 mai 2022 / AG MathNum / Nathalie Vialaneix p. 4
  • 7. Multiple table analyses (CCA, MFA, PLS, STATIS, ...) Image from https://dimensionless.in Statistics & ML for high throughput data integration 17 mai 2022 / AG MathNum / Nathalie Vialaneix p. 5
  • 8. Types of data integration methods [Ritchie et al., 2015] Statistics & ML for high throughput data integration 17 mai 2022 / AG MathNum / Nathalie Vialaneix p. 6
  • 9. Types of objectives Type of data to integrate vertical: horizontal: Statistics & ML for high throughput data integration 17 mai 2022 / AG MathNum / Nathalie Vialaneix p. 7
  • 10. Types of objectives Type of data to integrate vertical: horizontal: Type of analysis to perform supervised: unsupervised: Left pictures courtesy Harold Duruflé Statistics & ML for high throughput data integration 17 mai 2022 / AG MathNum / Nathalie Vialaneix p. 7
  • 11. Want to know them all? https://github.com/mikelove/awesome-multi-omics Some specificities: can account for structure in data (network), are dedicated to a specific omic (single-cell), can account for temporal/spatial information, can include biological knowledge (mostly GO), ... Statistics & ML for high throughput data integration 17 mai 2022 / AG MathNum / Nathalie Vialaneix p. 8
  • 12. Scope of the rest of the talk Unsupervised transformation based integration Statistics & ML for high throughput data integration 17 mai 2022 / AG MathNum / Nathalie Vialaneix p. 9
  • 13. Common representations using relational data n samples (xi )i ∈ X, summarized by pairwise similarities or dissimilarities Statistics & ML for high throughput data integration 17 mai 2022 / AG MathNum / Nathalie Vialaneix p. 10
  • 14. Common representations using relational data n samples (xi )i ∈ X, summarized by pairwise similarities or dissimilarities kernels: symmetric and positive definite (n × n)-matrix K that measures a “similarity” between n entities in X K(x, x0) = hφ(x), φ(x0)i Statistics & ML for high throughput data integration 17 mai 2022 / AG MathNum / Nathalie Vialaneix p. 10
  • 15. Multiple kernel (or distance) integration How to “optimally” combine several relational datasets in an unsupervised setting? For kernels K1, . . . , KM obtained on the same n objects, search: Kβ = PM m=1 βmKm with βm ≥ 0 and P m βm = 1 Statistics & ML for high throughput data integration 17 mai 2022 / AG MathNum / Nathalie Vialaneix p. 11
  • 16. Multiple kernel (or distance) integration How to “optimally” combine several relational datasets in an unsupervised setting? For kernels K1, . . . , KM obtained on the same n objects, search: Kβ = PM m=1 βmKm with βm ≥ 0 and P m βm = 1 Ideas of kernel consensus: find a kernel that performs a consensus of all kernels [Mariette and Villa-Vialaneix, 2018] - R package mixKernel with consensus based on: I STATIS [L’Hermier des Plantes, 1976, Lavit et al., 1994] I criterion that preserves local geometry minimizeβ n X i,j=1 Wij |{z} local proximity × k∆i (β) − ∆j (β)k 2 | {z } feature space distance for βm ≥ 0 and kβk1,2 = 1 Statistics & ML for high throughput data integration 17 mai 2022 / AG MathNum / Nathalie Vialaneix p. 11
  • 17. TARA Oceans expedition Science (May 2015) - Studies on: I eukaryotic plankton diversity [de Vargas et al., 2015], I ocean viral communities [Brum et al., 2015], I global plankton interactome [Lima-Mendez et al., 2015], I global ocean microbiome [Sunagawa et al., 2015], I . . . . → datasets from different types and different sources analyzed separately. Statistics & ML for high throughput data integration 17 mai 2022 / AG MathNum / Nathalie Vialaneix p. 12
  • 18. Integrating TARA Oceans datasets I For all compositional datasets, include phylogeny information rather than standard compositional transformations (CLR and alike): weighted Unifrac distance I Perform PCA (could have been clustering, linear model, . . . ) in the feature space. + combine with a shuffling approach to identify most influencing variables Statistics & ML for high throughput data integration 17 mai 2022 / AG MathNum / Nathalie Vialaneix p. 13
  • 19. Application to TARA oceans Main facts I Oceans typology related to longitude I Rhizaria abundance structure the differences, especially between Arctic Oceans and Pacific Oceans Statistics & ML for high throughput data integration 17 mai 2022 / AG MathNum / Nathalie Vialaneix p. 14
  • 20. What’s next? Improve interpretability Select variables the most relevant for a given kernel I built on works that select variables in X to compute a kernel Kx best to predict Y ∈ R [Allen, 2013, Grandvalet and Canu, 2002] I extended to unsupervised (exploratory) learning and arbitrary outputs (kernel outputs, including multiple output regression) Statistics & ML for high throughput data integration 17 mai 2022 / AG MathNum / Nathalie Vialaneix p. 15
  • 21. What’s next? Improve interpretability Select variables the most relevant for a given kernel I built on works that select variables in X to compute a kernel Kx best to predict Y ∈ R [Allen, 2013, Grandvalet and Canu, 2002] I extended to unsupervised (exploratory) learning and arbitrary outputs (kernel outputs, including multiple output regression) Continuous relaxation of a discrete problem non-convex and non-smooth optimization solved with proximal gradient descent Statistics & ML for high throughput data integration 17 mai 2022 / AG MathNum / Nathalie Vialaneix p. 15
  • 22. Making methods available for biologists http://asterics.miat.inrae.fr Ambition: I easy-to-use and interactive I helps user to know which analysis to use and how to interpret results I integrates domain expertise I usable online or can be installed Statistics & ML for high throughput data integration 17 mai 2022 / AG MathNum / Nathalie Vialaneix p. 16
  • 23. Future needs for data integration I improve interpretability of methods (integrate more biological knowledge) I generic vs specific: omics are always evolving... (spatial transcriptomics is the new era?) I reduce computational needs to achieve the challenge of a more sustainable research I make them more widely used by the biological community: omics data are still produced at a much faster rate than their use! Statistics & ML for high throughput data integration 17 mai 2022 / AG MathNum / Nathalie Vialaneix p. 17
  • 24. Thank you for your attention! Questions? Statistics & ML for high throughput data integration 17 mai 2022 / AG MathNum / Nathalie Vialaneix p. 18
  • 25. References (unofficial) Beamer template made with the help of Thomas Schiex, Matthias Zytnicki and Andreea Dreau: https://forgemia.inra.fr/nathalie.villa-vialaneix/bainrae Allen, G. I. (2013). Automatic feature selection via weighted kernels and regularization. Journal of Computational and Graphical Statistics, 22(2):284–299. Brouard, C., Mariette, J., Flamary, R., and Vialaneix, N. (2021). Feature selection for kernel methods in systems biology. In preparation. Brum, J., Ignacio-Espinoza, J., Roux, S., Doulcier, G., Acinas, S., Alberti, A., Chaffron, S., Cruaud, C., de Vargas, C., Gasol, J., Gorsky, G., Gregory, A., Guidi, L., Hingamp, P., Iudicone, D., Not, F., Ogata, H., Pesant, S., Poulos, B., Schwenck, S., Speich, S., Dimier, C., Kandels-Lewis, S., Picheral, M., Searson, S., Tara Oceans coordinators, Bork, P., Bowler, C., Sunagawa, S., Wincker, P., Karsenti, E., and Sullivan, M. (2015). Patterns and ecological drivers of ocean viral communities. Science, 348(6237). de Vargas, C., Audic, S., Henry, N., Decelle, J., Mahé, P., Logares, R., Lara, E., Berney, C., Le Bescot, N., Probert, I., Carmichael, M., Poulain, J., Romac, S., Colin, S., Aury, J., Bittner, L., Chaffron, S., Dunthorn, M., Engelen, S., Flegontova, O., Guidi, L., Horák, A., Jaillon, O., Lima-Mendez, G., Lukeš, J., Malviya, S., Morard, R., Mulot, M., Scalco, E., Siano, R., Vincent, F., Zingone, A., Dimier, C., Picheral, M., Searson, S., Kandels-Lewis, S., Tara Oceans coordinators, Acinas, S., Bork, P., Bowler, C., Gorsky, G., Grimsley, N., Hingamp, P., Iudicone, D., Not, F., Ogata, H., Pesant, S., Raes, J., Sieracki, M. E., Speich, S., Stemmann, L., Sunagawa, S., Weissenbach, J., Wincker, P., and Karsenti, E. (2015). Statistics & ML for high throughput data integration 17 mai 2022 / AG MathNum / Nathalie Vialaneix p. 18
  • 26. Eukaryotic plankton diversity in the sunlit ocean. Science, 348(6237). Foissac, S., Djebali, S., Munyard, K., Vialaneix, N., Rau, A., Muret, K., Esquerre, D., Zytnicki, M., Derrien, T., Bardou, P., Blanc, F., Cabau, C., Crisci, E., Dhorne-Pollet, S., Drouet, F., Faraut, T., Gonzáles, I., Goubil, A., Lacroix-Lamande, S., Laurent, F., Marthey, S., Marti-Marimon, M., Mormal-Leisenring, R., Mompart, F., Quere, P., Robelin, D., SanCristobal, M., Tosser-Klopp, G., Vincent-Naulleau, S., Fabre, S., Pinard-Van der Laan, M.-H., Klopp, C., Tixier-Boichard, M., Acloque, H., Lagarrigue, S., and Giuffra, E. (2019). Multi-species annotation of transcriptome and chromatin structure in domesticated animals. BMC Biology, 17:108. Grandvalet, Y. and Canu, S. (2002). Adaptive scaling for feature selection in SVMs. In Becker, S., Thrun, S., and Obermayer, K., editors, Proceedings of Advances in Neural Information Processing Systems (NIPS 2002), pages 569–576. MIT Press. Haug, N., Geyrhofer, L., Londei, A., Dervic, E., Desvars-Larrive, A., Loreto, V., Pinior, B., Thurner, S., and Klimek, P. (2020). Ranking the effectiveness of worldwide COVID-19 government interventions. Nature Human Behaviour, 4(4):1303–1312. Lavit, C., Escoufier, Y., Sabatier, R., and Traissac, P. (1994). The ACT (STATIS method). Computational Statistics and Data Analysis, 18(1):97–119. L’Hermier des Plantes, H. (1976). Structuration des tableaux à trois indices de la statistique. PhD thesis, Université de Montpellier. Thèse de troisième cycle. Lima-Mendez, G., Faust, K., Henry, N., Decelle, J., Colin, S., Carcillo, F., Chaffron, S., Ignacio-Espinosa, J., Roux, S., Vincent, F., Bittner, L., Darzi, Y., Wang, B., Audic, S., Berline, L., Bontempi, G., Cabello, A., Coppola, L., Cornejo-Castillo, F., d’Oviedo, F., de Meester, L., Statistics & ML for high throughput data integration 17 mai 2022 / AG MathNum / Nathalie Vialaneix p. 18
  • 27. Ferrera, I., Garet-Delmas, M., Guidi, L., Lara, E., Pesant, S., Royo-Llonch, M., Salazar, F., Sánchez, P., Sebastian, M., Souffreau, C., Dimier, C., Picheral, M., Searson, S., Kandels-Lewis, S., Tara Oceans coordinators, Gorsky, G., Not, F., Ogata, H., Speich, S., Stemmann, L., Weissenbach, J., Wincker, P., Acinas, S., Sunagawa, S., Bork, P., Sullivan, M., Karsenti, E., Bowler, C., de Vargas, C., and Raes, J. (2015). Determinants of community structure in the global plankton interactome. Science, 348(6237). Mariette, J. and Villa-Vialaneix, N. (2018). Unsupervised multiple kernel learning for heterogeneous data integration. Bioinformatics, 34(6):1009–1015. Ritchie, M. E., Phipson, B., Wu, D., Hu, Y., Law, C. W., Shi, W., and Smyth, G. K. (2015). limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Research, 43(7):e47. Sunagawa, S., Coelho, L., Chaffron, S., Kultima, J., Labadie, K., Salazar, F., Djahanschiri, B., Zeller, G., Mende, D., Alberti, A., Cornejo-Castillo, F., Costea, P., Cruaud, C., d’Oviedo, F., Engelen, S., Ferrera, I., Gasol, J., Guidi, L., Hildebrand, F., Kokoszka, F., Lepoivre, C., Lima-Mendez, G., Poulain, J., Poulos, B., Royo-Llonch, M., Sarmento, H., Vieira-Silva, S., Dimier, C., Picheral, M., Searson, S., Kandels-Lewis, S., Tara Oceans coordinators, Bowler, C., de Vargas, C., Gorsky, G., Grimsley, N., Hingamp, P., Iudicone, D., Jaillon, O., Not, F., Ogata, H., Pesant, S., Speich, S., Stemmann, L., Sullivan, M., Weissenbach, J., Wincker, P., Karsenti, E., Raes, J., Acinas, S., and Bork, P. (2015). Structure and function of the global ocean microbiome. Science, 348(6237). Statistics & ML for high throughput data integration 17 mai 2022 / AG MathNum / Nathalie Vialaneix p. 19
  • 28. Technical details: Unsupervised Kernel Feature Selection (UKFS) Reduced kernel for X ∈ Rp Tune weights w ∈ {0, 1}p such that the kernel based on w · X is very similar to the kernel based on X Optimization problem argmin w∈{0,1}p kKw x − Kx k2 F s.t p X j=1 wj ≤ d Continuous relaxation (non-convex and non-smooth) w∗ := argmin w∈(R+)p kKw x − Kx k2 F + λkwk1, solved with proximal gradient descent Statistics & ML for high throughput data integration 17 mai 2022 / AG MathNum / Nathalie Vialaneix p. 19
  • 29. UKFS performances lapl SPEC MCFS NDFS UDFS Autoenc. UKFS “Carcinom” (n = 174, p = 9 182) ACC 164.02 106.52 184.17 200.88 138.48 143.13 206.55 COR 28.14 30.75 29.56 27.49 30.30 33.18 24.75 CPU 0.25 2.47 11.69 6,162 99,138 > 4 days 326 “Glioma” (n = 50, p = 4 434) ACC 166.31 140.72 172.78 147.77 147.50 132.76 178.57 COR 81.70 70.70 76.43 68.02 72.33 45.96 52.14 CPU 0.02 0.63 1.05 368 2,636 ∼ 12h 23.74 “Koren” (n = 43, p = 980) ACC 172.90 225.25 233.94 263.04 263.48 239.76 242.39 COR 48.18 52.34 49.94 48.48 48.69 32.60 47.77 CPU 0.01 0.07 1.11 5.88 9.70 ∼ 30 min 10.69 [Brouard et al., 2021] I non redundant features I can incorporate information on relations between variables Statistics & ML for high throughput data integration 17 mai 2022 / AG MathNum / Nathalie Vialaneix p. 20
  • 30. KOKFS illustration Used with covid19 dataset from [Haug et al., 2020]: I inputs: 46 government measures (0/1 encoding) I outputs: R time series (with a kernel based on Fréchet distances) Statistics & ML for high throughput data integration 17 mai 2022 / AG MathNum / Nathalie Vialaneix p. 21
  • 31. KOKFS illustration Used with covid19 dataset from [Haug et al., 2020]: I inputs: 46 government measures (0/1 encoding) I outputs: R time series (with a kernel based on Fréchet distances) More important measures: Increase In Medical Supplies And Equipment Border Health Check Public Transport Restriction The Government Provides Assistance To Vulnerable Populations Individual Movement Restrictions Increase Availability Of Ppe Activate Case Notification Border Restriction Measures To Ensure Security Of Supply Port And Ship Restriction Activate Or Establish Emergency Response Statistics & ML for high throughput data integration 17 mai 2022 / AG MathNum / Nathalie Vialaneix p. 21