SlideShare a Scribd company logo
1 of 40
Download to read offline
Explanable models for time series with random forest
Nathalie Vialaneix(1) & Rémi Servien
(1) nathalie.vialaneix@inrae.fr
http://www.nathalievialaneix.eu
First PhenoDyn meeting
November 29-30, 2021
Scientific question
?
−
→
Purpose: prediction of a target quantity (e.g., yield) from functional data (e.g.,
weather time series)
Statistics & ML for high throughput data integration
Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix & Rémi Servien
p. 2
Scientific question & difficulty at stake
Purpose: Improve interpretability by selecting the most predictive intervals.
Statistics & ML for high throughput data integration
Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix & Rémi Servien
p. 3
Scientific question & difficulty at stake
Purpose: Improve interpretability by selecting the most predictive intervals.
Challenge: Selection of intervals is not too hard (e.g., group Lasso) but creating the
relevant intervals (starting point, length) is hard.
Existing solutions: [Picheny et al., 2019, Grollemund et al., 2019]
Statistics & ML for high throughput data integration
Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix & Rémi Servien
p. 3
Scientific question & framework of the presentation
Here: random forest
Why?
I versatile method for prediction
I easy to use and relatively fast
I good prediction ability in general
I natural framework for interpretability (importance through OOB samples)
Statistics & ML for high throughput data integration
Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix & Rémi Servien
p. 4
What is needed to achieve that goal?
Three/four key ingredients
1. random forest for time series
2. (maybe optional) ... based on summary descriptors of intervals
3. building intervals
4. selecting intervals
Statistics & ML for high throughput data integration
Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix & Rémi Servien
p. 5
A short reminder on random forest [Breiman, 2001]
Ln
LΘ1
n LΘ`
n L
Θq
n
b
h(., Θ1, Θ0
1) b
h(., Θ`, Θ0
`) b
h(., Θq, Θ0
q)
b
hRF−RI (.)
Bootstrap
Arbre RI
Agrégation
Courtesy of Robin Genuer and Jean-Michel Poggi.
Statistics & ML for high throughput data integration
Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix & Rémi Servien
p. 6
CART [Breiman et al., 1984]
C8 C9
C4 C5
C2 C3
C1
X1 ≤ d3 X1 > d3
X2 ≤ d2 X2 > d2
X1 ≤ d1 X1 > d1
d3 d1
d2
X1
X2
C4
C3
C8 C9
Statistics & ML for high throughput data integration
Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix & Rémi Servien
p. 7
Technical details
I splits are made by randomly choosing mtry < d variables randomly and by finding
the “best split” among this selection
I aggregation: average (target is numeric) or maximum vote rule (target is a class)
Statistics & ML for high throughput data integration
Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix & Rémi Servien
p. 8
Technical details
I splits are made by randomly choosing mtry < d variables randomly and by finding
the “best split” among this selection
I aggregation: average (target is numeric) or maximum vote rule (target is a class)
I OOB error: average error (over trees) on samples not included in the bootstrap of
the tree
I variable importance: the larger the increase in error, the most important the
variable is (based on random permutation)
Statistics & ML for high throughput data integration
Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix & Rémi Servien
p. 8
Extensions of random forest for time series
I Similarity based techniques
I Fréchet forest [Capitaine et al., 2020]
I Proximity forest [Lucas et al., 2019] (restricting to classification)
Image by courtesy of Charlotte Pelletier
Statistics & ML for high throughput data integration
Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix & Rémi Servien
p. 9
Extensions of random forest for time series
I Similarity based techniques
I Fréchet forest [Capitaine et al., 2020]
I Proximity forest [Lucas et al., 2019] (restricting to classification)
I Interval based techniques
I Time Series Forest [Deng et al., 2013] and its extension [Middlehurst et al., 2020]
I RISE [Lines et al., 2018] (a tree = a randomly selected interval)
Statistics & ML for high throughput data integration
Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix & Rémi Servien
p. 9
Extensions of random forest for time series
I Similarity based techniques
I Fréchet forest [Capitaine et al., 2020]
I Proximity forest [Lucas et al., 2019] (restricting to classification)
I Interval based techniques
I Time Series Forest [Deng et al., 2013] and its extension [Middlehurst et al., 2020]
I RISE [Lines et al., 2018] (a tree = a randomly selected interval)
I Dictionnary or symbolic representation based techniques:
I TS-CHIEF [Shifaz et al., 2020] (combines all types of splits including dictionnary
based splits based on work of [Schäfer, 2015])
I (multivariate time series) symbolic representation of time series
[Baydogan and Runger, 2015] More on that
Statistics & ML for high throughput data integration
Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix & Rémi Servien
p. 9
Time Series Forest
Basic principles:
1. for a given tree: random sampling of intervals
2. for a given tree: compute summaries (mean, sd, slope for [Deng et al., 2013] and
catch22 for [Middlehurst et al., 2020])
3. define splits as usual based on these summaries
Statistics & ML for high throughput data integration
Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix & Rémi Servien
p. 10
Time Series Forest
Basic principles:
1. for a given tree: random sampling of intervals
2. for a given tree: compute summaries (mean, sd, slope for [Deng et al., 2013] and
catch22 for [Middlehurst et al., 2020])
3. define splits as usual based on these summaries
What is useful for our question?
I combined with variable selection could help identify important intervals (still to be
tested)
I ideas to summarize information of an entire intervals (already partially tested)
Statistics & ML for high throughput data integration
Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix & Rémi Servien
p. 10
Extensions on summaries
I supervised: accounting for Y (is that useful?): PLS, linear models (including
ridge)... similar to [Poterie et al., 2019] (grouped variables summarized by LDA)
and to [Rainforth and Wood, 2017] (CCA based splits)
Statistics & ML for high throughput data integration
Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix & Rémi Servien
p. 11
Extensions on summaries
I supervised: accounting for Y (is that useful?): PLS, linear models (including
ridge)... similar to [Poterie et al., 2019] (grouped variables summarized by LDA)
and to [Rainforth and Wood, 2017] (CCA based splits)
I unsupervised: first PC of PCA, as in ClustOfVar method [Chavent et al., 2012],
as in [Chavent et al., 2021] (can also be useful to build groups)
Statistics & ML for high throughput data integration
Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix & Rémi Servien
p. 11
Extensions on summaries
I supervised: accounting for Y (is that useful?): PLS, linear models (including
ridge)... similar to [Poterie et al., 2019] (grouped variables summarized by LDA)
and to [Rainforth and Wood, 2017] (CCA based splits)
I unsupervised: first PC of PCA, as in ClustOfVar method [Chavent et al., 2012],
as in [Chavent et al., 2021] (can also be useful to build groups)
I Could a step further be taken by using oblique splits [Bertsimas and Dunn, 2017]
(let the forest decides how to combine to find the best split)?
See also: [Hornung and Boulesteix, 2021]
Statistics & ML for high throughput data integration
Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix & Rémi Servien
p. 11
Strategies for building intervals
I Precomputing intervals independantly of Y (based on correlation between time
points): constrained extension of ClustOfVar (PCA like criterion), adjclust
(contrainstred clustering based on correlation between variables) ⇒ hierarchy of
intervals
Statistics & ML for high throughput data integration
Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix & Rémi Servien
p. 12
Strategies for building intervals
I Precomputing intervals independantly of Y (based on correlation between time
points): constrained extension of ClustOfVar (PCA like criterion), adjclust
(contrainstred clustering based on correlation between variables) ⇒ hierarchy of
intervals
I Precomputing intervals independantly of Y (based on greedy agglomeration):
alterning between
I regression based step (LM) between any two consecutive variables to select the best
merge (minimum loss or maximum gain in accuracy)
I summary (depends on regression type)
⇒ hierarchy of intervals
Statistics & ML for high throughput data integration
Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix & Rémi Servien
p. 12
Strategies for building intervals
I Precomputing intervals independantly of Y (based on correlation between time
points): constrained extension of ClustOfVar (PCA like criterion), adjclust
(contrainstred clustering based on correlation between variables) ⇒ hierarchy of
intervals
I Precomputing intervals independantly of Y (based on greedy agglomeration):
alterning between
I regression based step (LM) between any two consecutive variables to select the best
merge (minimum loss or maximum gain in accuracy)
I summary (depends on regression type)
⇒ hierarchy of intervals
I Using random forest to compute a hierarchy of interval based on loss in grouped
importance in a greedy manner [Gregorutti et al., 2015]
Statistics & ML for high throughput data integration
Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix & Rémi Servien
p. 12
Strategies for selecting variables (here, intervals) in RF
Huge litterature...... a few reviews: [Degenhardt et al., 2019, Speiser et al., 2019]
Statistics & ML for high throughput data integration
Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix & Rémi Servien
p. 13
Strategies for selecting variables (here, intervals) in RF
Huge litterature...... a few reviews: [Degenhardt et al., 2019, Speiser et al., 2019]
I based on importance:
I just use the importance...
I ranking with importance then selection of the best model with the first k variables
(K = 1, . . . , K) VSURF [Genuer et al., 2010]
I [Altmann et al., 2010] or [Szymczak et al., 2016] based on data-driven importance
threshold (untested)
Statistics & ML for high throughput data integration
Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix & Rémi Servien
p. 13
Strategies for selecting variables (here, intervals) in RF
Huge litterature...... a few reviews: [Degenhardt et al., 2019, Speiser et al., 2019]
I based on importance:
I just use the importance...
I ranking with importance then selection of the best model with the first k variables
(K = 1, . . . , K) VSURF [Genuer et al., 2010]
I [Altmann et al., 2010] or [Szymczak et al., 2016] based on data-driven importance
threshold (untested)
I based on external variable selection methods:
I Knockoffs [Barber and Candès, 2015], as in Boruta [Kursa and Rudnicki, 2010]
I Relief [Robnik-Šikonja and Kononenko, 2003]
Statistics & ML for high throughput data integration
Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix & Rémi Servien
p. 13
Simulation setting
I predictors: 1,000 EMS time series (length: 444)
I important intervals
I target: yi = log(1 + |hxi , βi|) +  with
β(t) = 4 × 1t∈[320,410] + 2 × 1t∈[500,550] − 1t∈[680,730]
Statistics  ML for high throughput data integration
Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix  Rémi Servien
p. 14
Evaluated scenarios
Scenario 1: pre-computed groups, summary, RF and importance based evaluation
Courtesy of Louisa Villa.
Statistics  ML for high throughput data integration
Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix  Rémi Servien
p. 15
Evaluated scenarios
Scenario 2: pre-computed groups, summary, selection and RF
Courtesy of Louisa Villa.
Statistics  ML for high throughput data integration
Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix  Rémi Servien
p. 16
Evaluated scenarios
Scenario 3: groups computed in interaction with importance or variable selection (not
explained), summary, selection (or not) and RF
Courtesy of Louisa Villa.
Statistics  ML for high throughput data integration
Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix  Rémi Servien
p. 17
Evaluation criteria
Resemblance of important/selected intervals with ground truth
Accuracy
Statistics  ML for high throughput data integration
Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix  Rémi Servien
p. 18
A few take home messages (to be confirmed)
I pre-computed groups based on correlation (especially adjclust) are better
I PLS is best in terms of summary strategy
I combining selection strategy with RF is computationnaly extensive and unefficient
I overall, the recovery of groups is a bit disappointing
Statistics  ML for high throughput data integration
Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix  Rémi Servien
p. 19
adjclust + PLS + Boruta (scenario 2)
⇒ Model selection (or model aggregation?) seem critical...
Statistics  ML for high throughput data integration
Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix  Rémi Servien
p. 20
To be continued...
Statistics  ML for high throughput data integration
Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix  Rémi Servien
p. 21
References
Altmann, A., Tolosi, L., Sander, O., and Lengauer, T. (2010).
Permutation importance: a corrected feature importance measure.
Bioinformatics, 26(10):1340–1347.
Barber, R. F. and Candès, E. (2015).
Controlling the false discovery rate via knockoffs.
Annals of Statistics, 43(5):2055–2085.
Baydogan, M. G. and Runger, G. (2015).
Learning a symbolic representation for multivariate time series classification.
Data Mining and Knowledge Discovery, 29:400–422.
Bertsimas, D. and Dunn, J. (2017).
Optimal classification trees.
Machine Learning, 106(7):1039–1082.
Breiman, L. (2001).
Random forests.
Statistics  ML for high throughput data integration
Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix  Rémi Servien
p. 21
Machine Learning, 45(1):5–32.
Breiman, L., Friedman, J., Olsen, R., and Stone, C. (1984).
Classification and Regression Trees.
Chapman and Hall, Boca Raton, Florida, USA.
Capitaine, L., Bigot, J., Thiébaut, R., and Genuer, R. (2020).
Fréchet random forests for metric space valued regression with non Euclidean predictors.
Preprint arXiv:1906.01741v2.
Chavent, M., Genuer, R., and Saracco, J. (2021).
Combining clustering of variables and feature selection using random forests.
Communications in Statistics - Simulation and Computation, 50(2):426–445.
Chavent, M., Liquet, B., Kuentz-Simonet, V., and Saracco, J. (2012).
ClustOfVar: an R package for the clustering of variables.
Journal of Statistical Software, 50(13):1–16.
Degenhardt, F., Seifert, S., and Szymczak, S. (2019).
Evaluation of variable selection methods for random forests and omics data sets.
Statistics  ML for high throughput data integration
Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix  Rémi Servien
p. 21
Briefings in Bioinformatics, 20(2):492–503.
Deng, H., Runger, G., Tuv, E., and Martyanov, V. (2013).
A time series forest for classification and feature extraction.
Information Science, 239:142–153.
Genuer, R., Poggi, J.-M., and Tuleau-Malot, C. (2010).
Variable selection using random forests.
Pattern Recognition Letters, 31(14):2225–2236.
Gregorutti, B., Michel, B., and Saint-Pierre, P. (2015).
Grouped variable importance with random forests and application to multiple functional data
analysis.
Computational Statistics and Data Analysis, 90:15–35.
Grollemund, P.-M., Abraham, C., Baragatti, M., and Pudlo, P. (2019).
Bayesian functional linear regression with sparse step functions.
Bayesian Analysis, 14(1):111–135.
Hornung, R. and Boulesteix, A.-L. (2021).
Statistics  ML for high throughput data integration
Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix  Rémi Servien
p. 21
Interaction forests: identifying and exploiting interpretable quantitative and qualitative interaction
effects.
Technical Report Number 237, Department of Statistics, University of Munich, Germany.
Kursa, M. and Rudnicki, W. (2010).
Feature selection with the Boruta package.
Journal of Statistical Software, 36(11):1–13.
Lines, J., Taylor, S., and Bagnall, A. (2018).
Time series classification with HIVE-COTE: the hierarchical vote collective of
transformation-based ensembles.
ACM Transactions on Knowledge Discovery from Data, 12(5):1–35.
Lucas, B., Shifaz, A., Pelletier, C., O’Neill, L., Zaidi, N., Goethals, B., Petitjean, F., and Webb,
G. I. (2019).
Proximity forest: an effective and scalable distance based classifier for time series.
Data Mining and Knowledge Discovery, 33:607–635.
Middlehurst, M., Large, J., and Bagnall, A. (2020).
Statistics  ML for high throughput data integration
Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix  Rémi Servien
p. 21
The canonical interval forest (CIF) classifier for time series classification.
In Wu, X., Jermaine, C., Hu, X., Kotevskia, O., Lu, S., Xu, W., Aluru, S., Zhai, C., Al-Masri, E.,
Chen, Z., and Saltz, J., editors, Proceedings of IEEE International Conference on Big Data,
Atlanta, GA, USA. IEEE.
Picheny, V., Servien, R., and Villa-Vialaneix, N. (2019).
Interpretable sparse sliced inverse regression for functional data.
Statistics and Computing, 29(2):255–267.
Poterie, A., Dupuy, J.-F., Monbet, V., and Rouvière, L. (2019).
Classification tree algorithm for grouped variables.
Computational Statistics, 34:1613–1648.
Rainforth, T. and Wood, F. (2017).
Canonical correlation forests.
arXiv: 1507.05444.
Robnik-Šikonja, M. and Kononenko, I. (2003).
Theoretical and empirical analysis of ReliefF and RReliefF.
Machine Learning, 53(1-2):23–69.
Statistics  ML for high throughput data integration
Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix  Rémi Servien
p. 21
Schäfer, P. (2015).
The BOSS is concerned with time series classification in the presence of noise.
Data Mining and Knowledge Discovery, 29(6):1505–1530.
Shifaz, A., Pelletier, C., Petitjean, F., and Webb, G. I. (2020).
TS-CHIEF: a scalable and accurate forest algorithm for time series classification.
Data Mining and Knowledge Discovery, 34:742–775.
Speiser, J. L., Miller, M. E., Tooze, J., and Ip, E. (2019).
A comparison of random forest variable selection methods for classification prediction modeling.
Expert Systems with Applications, 134:93–101.
Szymczak, S., Holzinger, E., Dasgupta, A., Malley, J., Molloy, A., Mills, J., Brody, L.,
Stambolian, D., and Bailey-Wilson, J. (2016).
r2VIM: a new variable selection method for random forests in genome-wide association studies.
BioData Mining, 9:7.
Statistics  ML for high throughput data integration
Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix  Rémi Servien
p. 22
Dictionnary/symbolic representation based
BOSS [Schäfer, 2015] and [Baydogan and Runger, 2015]
Based on: Fourier transform then symbolic representation.
[Baydogan and Runger, 2015] is similar, except that representation loses interval
information (based on a tree at time step level)
Statistics  ML for high throughput data integration
Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix  Rémi Servien
p. 22
Dictionnary/symbolic representation based
BOSS [Schäfer, 2015] and [Baydogan and Runger, 2015]
What is useful for our question? Uncertain... can symbolic representation itself be used
to represent/select (windowed) intervals? (untested) Back
Statistics  ML for high throughput data integration
Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix  Rémi Servien
p. 22

More Related Content

What's hot

Kernel methods in machine learning
Kernel methods in machine learningKernel methods in machine learning
Kernel methods in machine learning
butest
 
Kernel Methods and Relational Learning in Computational Biology
Kernel Methods and Relational Learning in Computational BiologyKernel Methods and Relational Learning in Computational Biology
Kernel Methods and Relational Learning in Computational Biology
Michiel Stock
 
PSF_Introduction_to_R_Package_for_Pattern_Sequence (1)
PSF_Introduction_to_R_Package_for_Pattern_Sequence (1)PSF_Introduction_to_R_Package_for_Pattern_Sequence (1)
PSF_Introduction_to_R_Package_for_Pattern_Sequence (1)
neeraj7svp
 

What's hot (20)

A short introduction to statistical learning
A short introduction to statistical learningA short introduction to statistical learning
A short introduction to statistical learning
 
Investigating the 3D structure of the genome with Hi-C data analysis
Investigating the 3D structure of the genome with Hi-C data analysisInvestigating the 3D structure of the genome with Hi-C data analysis
Investigating the 3D structure of the genome with Hi-C data analysis
 
Selective inference and single-cell differential analysis
Selective inference and single-cell differential analysisSelective inference and single-cell differential analysis
Selective inference and single-cell differential analysis
 
Combining co-expression and co-location for gene network inference in porcine...
Combining co-expression and co-location for gene network inference in porcine...Combining co-expression and co-location for gene network inference in porcine...
Combining co-expression and co-location for gene network inference in porcine...
 
Convolutional networks and graph networks through kernels
Convolutional networks and graph networks through kernelsConvolutional networks and graph networks through kernels
Convolutional networks and graph networks through kernels
 
A short and naive introduction to using network in prediction models
A short and naive introduction to using network in prediction modelsA short and naive introduction to using network in prediction models
A short and naive introduction to using network in prediction models
 
Differential analyses of structures in HiC data
Differential analyses of structures in HiC dataDifferential analyses of structures in HiC data
Differential analyses of structures in HiC data
 
Prototype-based models in machine learning
Prototype-based models in machine learningPrototype-based models in machine learning
Prototype-based models in machine learning
 
Graph Neural Network in practice
Graph Neural Network in practiceGraph Neural Network in practice
Graph Neural Network in practice
 
Deep Learning Opening Workshop - Domain Adaptation Challenges in Genomics: a ...
Deep Learning Opening Workshop - Domain Adaptation Challenges in Genomics: a ...Deep Learning Opening Workshop - Domain Adaptation Challenges in Genomics: a ...
Deep Learning Opening Workshop - Domain Adaptation Challenges in Genomics: a ...
 
Dimensionality reduction by matrix factorization using concept lattice in dat...
Dimensionality reduction by matrix factorization using concept lattice in dat...Dimensionality reduction by matrix factorization using concept lattice in dat...
Dimensionality reduction by matrix factorization using concept lattice in dat...
 
Deep Learning Opening Workshop - Horseshoe Regularization for Machine Learnin...
Deep Learning Opening Workshop - Horseshoe Regularization for Machine Learnin...Deep Learning Opening Workshop - Horseshoe Regularization for Machine Learnin...
Deep Learning Opening Workshop - Horseshoe Regularization for Machine Learnin...
 
Kernel methods in machine learning
Kernel methods in machine learningKernel methods in machine learning
Kernel methods in machine learning
 
Kernel Methods and Relational Learning in Computational Biology
Kernel Methods and Relational Learning in Computational BiologyKernel Methods and Relational Learning in Computational Biology
Kernel Methods and Relational Learning in Computational Biology
 
On the Mining of Numerical Data with Formal Concept Analysis
On the Mining of Numerical Data with Formal Concept AnalysisOn the Mining of Numerical Data with Formal Concept Analysis
On the Mining of Numerical Data with Formal Concept Analysis
 
A Novel Approach to Mathematical Concepts in Data Mining
A Novel Approach to Mathematical Concepts in Data MiningA Novel Approach to Mathematical Concepts in Data Mining
A Novel Approach to Mathematical Concepts in Data Mining
 
Xenia miscouridou wi mlds 4
Xenia miscouridou wi mlds 4Xenia miscouridou wi mlds 4
Xenia miscouridou wi mlds 4
 
QMC: Transition Workshop - Selected Highlights from the Probabilistic Numeric...
QMC: Transition Workshop - Selected Highlights from the Probabilistic Numeric...QMC: Transition Workshop - Selected Highlights from the Probabilistic Numeric...
QMC: Transition Workshop - Selected Highlights from the Probabilistic Numeric...
 
Bayes Nets Meetup Sept 29th 2016 - Bayesian Network Modelling by Marco Scutari
Bayes Nets Meetup Sept 29th 2016 - Bayesian Network Modelling by Marco ScutariBayes Nets Meetup Sept 29th 2016 - Bayesian Network Modelling by Marco Scutari
Bayes Nets Meetup Sept 29th 2016 - Bayesian Network Modelling by Marco Scutari
 
PSF_Introduction_to_R_Package_for_Pattern_Sequence (1)
PSF_Introduction_to_R_Package_for_Pattern_Sequence (1)PSF_Introduction_to_R_Package_for_Pattern_Sequence (1)
PSF_Introduction_to_R_Package_for_Pattern_Sequence (1)
 

Similar to Explanable models for time series with random forest

Tree net and_randomforests_2009
Tree net and_randomforests_2009Tree net and_randomforests_2009
Tree net and_randomforests_2009
Matthew Magistrado
 
Lesson 6 measures of central tendency
Lesson 6 measures of central tendencyLesson 6 measures of central tendency
Lesson 6 measures of central tendency
nurun2010
 
MSL 5080, Methods of Analysis for Business Operations 1 .docx
MSL 5080, Methods of Analysis for Business Operations 1 .docxMSL 5080, Methods of Analysis for Business Operations 1 .docx
MSL 5080, Methods of Analysis for Business Operations 1 .docx
madlynplamondon
 

Similar to Explanable models for time series with random forest (20)

Multi-omics data integration methods: kernel and other machine learning appro...
Multi-omics data integration methods: kernel and other machine learning appro...Multi-omics data integration methods: kernel and other machine learning appro...
Multi-omics data integration methods: kernel and other machine learning appro...
 
Intégration de données omiques multi-échelles : méthodes à noyau et autres ap...
Intégration de données omiques multi-échelles : méthodes à noyau et autres ap...Intégration de données omiques multi-échelles : méthodes à noyau et autres ap...
Intégration de données omiques multi-échelles : méthodes à noyau et autres ap...
 
Quelques résultats préliminaires de l'évaluation de méthodes d'inférence de r...
Quelques résultats préliminaires de l'évaluation de méthodes d'inférence de r...Quelques résultats préliminaires de l'évaluation de méthodes d'inférence de r...
Quelques résultats préliminaires de l'évaluation de méthodes d'inférence de r...
 
Tree net and_randomforests_2009
Tree net and_randomforests_2009Tree net and_randomforests_2009
Tree net and_randomforests_2009
 
SPSS statistics - get help using SPSS
SPSS statistics - get help using SPSSSPSS statistics - get help using SPSS
SPSS statistics - get help using SPSS
 
Accounting for variance in machine learning benchmarks
Accounting for variance in machine learning benchmarksAccounting for variance in machine learning benchmarks
Accounting for variance in machine learning benchmarks
 
Lesson 6 measures of central tendency
Lesson 6 measures of central tendencyLesson 6 measures of central tendency
Lesson 6 measures of central tendency
 
MSL 5080, Methods of Analysis for Business Operations 1 .docx
MSL 5080, Methods of Analysis for Business Operations 1 .docxMSL 5080, Methods of Analysis for Business Operations 1 .docx
MSL 5080, Methods of Analysis for Business Operations 1 .docx
 
Using HDDT to avoid instances propagation in unbalanced and evolving data str...
Using HDDT to avoid instances propagation in unbalanced and evolving data str...Using HDDT to avoid instances propagation in unbalanced and evolving data str...
Using HDDT to avoid instances propagation in unbalanced and evolving data str...
 
An approximate possibilistic
An approximate possibilisticAn approximate possibilistic
An approximate possibilistic
 
MAC411(A) Analysis in Communication Researc.ppt
MAC411(A) Analysis in Communication Researc.pptMAC411(A) Analysis in Communication Researc.ppt
MAC411(A) Analysis in Communication Researc.ppt
 
slides of ABC talk at i-like workshop, Warwick, May 16
slides of ABC talk at i-like workshop, Warwick, May 16slides of ABC talk at i-like workshop, Warwick, May 16
slides of ABC talk at i-like workshop, Warwick, May 16
 
IRJET- Big Data and Bayes Theorem used Analyze the Student’s Performance in E...
IRJET- Big Data and Bayes Theorem used Analyze the Student’s Performance in E...IRJET- Big Data and Bayes Theorem used Analyze the Student’s Performance in E...
IRJET- Big Data and Bayes Theorem used Analyze the Student’s Performance in E...
 
Bayesian/Fiducial/Frequentist Uncertainty Quantification by Artificial Samples
Bayesian/Fiducial/Frequentist Uncertainty Quantification by Artificial SamplesBayesian/Fiducial/Frequentist Uncertainty Quantification by Artificial Samples
Bayesian/Fiducial/Frequentist Uncertainty Quantification by Artificial Samples
 
[A]BCel : a presentation at ABC in Roma
[A]BCel : a presentation at ABC in Roma[A]BCel : a presentation at ABC in Roma
[A]BCel : a presentation at ABC in Roma
 
Data Analysis
Data Analysis Data Analysis
Data Analysis
 
QT1 - 03 - Measures of Central Tendency
QT1 - 03 - Measures of Central TendencyQT1 - 03 - Measures of Central Tendency
QT1 - 03 - Measures of Central Tendency
 
QT1 - 03 - Measures of Central Tendency
QT1 - 03 - Measures of Central TendencyQT1 - 03 - Measures of Central Tendency
QT1 - 03 - Measures of Central Tendency
 
Regression, Bayesian Learning and Support vector machine
Regression, Bayesian Learning and Support vector machineRegression, Bayesian Learning and Support vector machine
Regression, Bayesian Learning and Support vector machine
 
7 qc tools
7 qc tools7 qc tools
7 qc tools
 

More from tuxette

More from tuxette (18)

Racines en haut et feuilles en bas : les arbres en maths
Racines en haut et feuilles en bas : les arbres en mathsRacines en haut et feuilles en bas : les arbres en maths
Racines en haut et feuilles en bas : les arbres en maths
 
Méthodes à noyaux pour l’intégration de données hétérogènes
Méthodes à noyaux pour l’intégration de données hétérogènesMéthodes à noyaux pour l’intégration de données hétérogènes
Méthodes à noyaux pour l’intégration de données hétérogènes
 
Méthodologies d'intégration de données omiques
Méthodologies d'intégration de données omiquesMéthodologies d'intégration de données omiques
Méthodologies d'intégration de données omiques
 
Projets autour de l'Hi-C
Projets autour de l'Hi-CProjets autour de l'Hi-C
Projets autour de l'Hi-C
 
Can deep learning learn chromatin structure from sequence?
Can deep learning learn chromatin structure from sequence?Can deep learning learn chromatin structure from sequence?
Can deep learning learn chromatin structure from sequence?
 
ASTERICS : une application pour intégrer des données omiques
ASTERICS : une application pour intégrer des données omiquesASTERICS : une application pour intégrer des données omiques
ASTERICS : une application pour intégrer des données omiques
 
Autour des projets Idefics et MetaboWean
Autour des projets Idefics et MetaboWeanAutour des projets Idefics et MetaboWean
Autour des projets Idefics et MetaboWean
 
Rserve, renv, flask, Vue.js dans un docker pour intégrer des données omiques ...
Rserve, renv, flask, Vue.js dans un docker pour intégrer des données omiques ...Rserve, renv, flask, Vue.js dans un docker pour intégrer des données omiques ...
Rserve, renv, flask, Vue.js dans un docker pour intégrer des données omiques ...
 
Apprentissage pour la biologie moléculaire et l’analyse de données omiques
Apprentissage pour la biologie moléculaire et l’analyse de données omiquesApprentissage pour la biologie moléculaire et l’analyse de données omiques
Apprentissage pour la biologie moléculaire et l’analyse de données omiques
 
Journal club: Validation of cluster analysis results on validation data
Journal club: Validation of cluster analysis results on validation dataJournal club: Validation of cluster analysis results on validation data
Journal club: Validation of cluster analysis results on validation data
 
Overfitting or overparametrization?
Overfitting or overparametrization?Overfitting or overparametrization?
Overfitting or overparametrization?
 
SOMbrero : un package R pour les cartes auto-organisatrices
SOMbrero : un package R pour les cartes auto-organisatricesSOMbrero : un package R pour les cartes auto-organisatrices
SOMbrero : un package R pour les cartes auto-organisatrices
 
Présentation du projet ASTERICS
Présentation du projet ASTERICSPrésentation du projet ASTERICS
Présentation du projet ASTERICS
 
Présentation du projet ASTERICS
Présentation du projet ASTERICSPrésentation du projet ASTERICS
Présentation du projet ASTERICS
 
A review on structure learning in GNN
A review on structure learning in GNNA review on structure learning in GNN
A review on structure learning in GNN
 
La famille *down
La famille *downLa famille *down
La famille *down
 
From RNN to neural networks for cyclic undirected graphs
From RNN to neural networks for cyclic undirected graphsFrom RNN to neural networks for cyclic undirected graphs
From RNN to neural networks for cyclic undirected graphs
 
An introduction to neural network
An introduction to neural networkAn introduction to neural network
An introduction to neural network
 

Recently uploaded

Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
Areesha Ahmad
 
biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGYbiology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY
1301aanya
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
PirithiRaju
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Sérgio Sacani
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
PirithiRaju
 

Recently uploaded (20)

Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRLKochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
 
pumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit flypumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit fly
 
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICESAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
 
Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdf
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
 
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verifiedConnaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
 
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
 
IDENTIFICATION OF THE LIVING- forensic medicine
IDENTIFICATION OF THE LIVING- forensic medicineIDENTIFICATION OF THE LIVING- forensic medicine
IDENTIFICATION OF THE LIVING- forensic medicine
 
biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGYbiology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY
 
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts ServiceJustdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
 
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
 
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 

Explanable models for time series with random forest

  • 1. Explanable models for time series with random forest Nathalie Vialaneix(1) & Rémi Servien (1) nathalie.vialaneix@inrae.fr http://www.nathalievialaneix.eu First PhenoDyn meeting November 29-30, 2021
  • 2. Scientific question ? − → Purpose: prediction of a target quantity (e.g., yield) from functional data (e.g., weather time series) Statistics & ML for high throughput data integration Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix & Rémi Servien p. 2
  • 3. Scientific question & difficulty at stake Purpose: Improve interpretability by selecting the most predictive intervals. Statistics & ML for high throughput data integration Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix & Rémi Servien p. 3
  • 4. Scientific question & difficulty at stake Purpose: Improve interpretability by selecting the most predictive intervals. Challenge: Selection of intervals is not too hard (e.g., group Lasso) but creating the relevant intervals (starting point, length) is hard. Existing solutions: [Picheny et al., 2019, Grollemund et al., 2019] Statistics & ML for high throughput data integration Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix & Rémi Servien p. 3
  • 5. Scientific question & framework of the presentation Here: random forest Why? I versatile method for prediction I easy to use and relatively fast I good prediction ability in general I natural framework for interpretability (importance through OOB samples) Statistics & ML for high throughput data integration Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix & Rémi Servien p. 4
  • 6. What is needed to achieve that goal? Three/four key ingredients 1. random forest for time series 2. (maybe optional) ... based on summary descriptors of intervals 3. building intervals 4. selecting intervals Statistics & ML for high throughput data integration Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix & Rémi Servien p. 5
  • 7. A short reminder on random forest [Breiman, 2001] Ln LΘ1 n LΘ` n L Θq n b h(., Θ1, Θ0 1) b h(., Θ`, Θ0 `) b h(., Θq, Θ0 q) b hRF−RI (.) Bootstrap Arbre RI Agrégation Courtesy of Robin Genuer and Jean-Michel Poggi. Statistics & ML for high throughput data integration Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix & Rémi Servien p. 6
  • 8. CART [Breiman et al., 1984] C8 C9 C4 C5 C2 C3 C1 X1 ≤ d3 X1 > d3 X2 ≤ d2 X2 > d2 X1 ≤ d1 X1 > d1 d3 d1 d2 X1 X2 C4 C3 C8 C9 Statistics & ML for high throughput data integration Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix & Rémi Servien p. 7
  • 9. Technical details I splits are made by randomly choosing mtry < d variables randomly and by finding the “best split” among this selection I aggregation: average (target is numeric) or maximum vote rule (target is a class) Statistics & ML for high throughput data integration Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix & Rémi Servien p. 8
  • 10. Technical details I splits are made by randomly choosing mtry < d variables randomly and by finding the “best split” among this selection I aggregation: average (target is numeric) or maximum vote rule (target is a class) I OOB error: average error (over trees) on samples not included in the bootstrap of the tree I variable importance: the larger the increase in error, the most important the variable is (based on random permutation) Statistics & ML for high throughput data integration Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix & Rémi Servien p. 8
  • 11. Extensions of random forest for time series I Similarity based techniques I Fréchet forest [Capitaine et al., 2020] I Proximity forest [Lucas et al., 2019] (restricting to classification) Image by courtesy of Charlotte Pelletier Statistics & ML for high throughput data integration Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix & Rémi Servien p. 9
  • 12. Extensions of random forest for time series I Similarity based techniques I Fréchet forest [Capitaine et al., 2020] I Proximity forest [Lucas et al., 2019] (restricting to classification) I Interval based techniques I Time Series Forest [Deng et al., 2013] and its extension [Middlehurst et al., 2020] I RISE [Lines et al., 2018] (a tree = a randomly selected interval) Statistics & ML for high throughput data integration Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix & Rémi Servien p. 9
  • 13. Extensions of random forest for time series I Similarity based techniques I Fréchet forest [Capitaine et al., 2020] I Proximity forest [Lucas et al., 2019] (restricting to classification) I Interval based techniques I Time Series Forest [Deng et al., 2013] and its extension [Middlehurst et al., 2020] I RISE [Lines et al., 2018] (a tree = a randomly selected interval) I Dictionnary or symbolic representation based techniques: I TS-CHIEF [Shifaz et al., 2020] (combines all types of splits including dictionnary based splits based on work of [Schäfer, 2015]) I (multivariate time series) symbolic representation of time series [Baydogan and Runger, 2015] More on that Statistics & ML for high throughput data integration Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix & Rémi Servien p. 9
  • 14. Time Series Forest Basic principles: 1. for a given tree: random sampling of intervals 2. for a given tree: compute summaries (mean, sd, slope for [Deng et al., 2013] and catch22 for [Middlehurst et al., 2020]) 3. define splits as usual based on these summaries Statistics & ML for high throughput data integration Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix & Rémi Servien p. 10
  • 15. Time Series Forest Basic principles: 1. for a given tree: random sampling of intervals 2. for a given tree: compute summaries (mean, sd, slope for [Deng et al., 2013] and catch22 for [Middlehurst et al., 2020]) 3. define splits as usual based on these summaries What is useful for our question? I combined with variable selection could help identify important intervals (still to be tested) I ideas to summarize information of an entire intervals (already partially tested) Statistics & ML for high throughput data integration Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix & Rémi Servien p. 10
  • 16. Extensions on summaries I supervised: accounting for Y (is that useful?): PLS, linear models (including ridge)... similar to [Poterie et al., 2019] (grouped variables summarized by LDA) and to [Rainforth and Wood, 2017] (CCA based splits) Statistics & ML for high throughput data integration Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix & Rémi Servien p. 11
  • 17. Extensions on summaries I supervised: accounting for Y (is that useful?): PLS, linear models (including ridge)... similar to [Poterie et al., 2019] (grouped variables summarized by LDA) and to [Rainforth and Wood, 2017] (CCA based splits) I unsupervised: first PC of PCA, as in ClustOfVar method [Chavent et al., 2012], as in [Chavent et al., 2021] (can also be useful to build groups) Statistics & ML for high throughput data integration Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix & Rémi Servien p. 11
  • 18. Extensions on summaries I supervised: accounting for Y (is that useful?): PLS, linear models (including ridge)... similar to [Poterie et al., 2019] (grouped variables summarized by LDA) and to [Rainforth and Wood, 2017] (CCA based splits) I unsupervised: first PC of PCA, as in ClustOfVar method [Chavent et al., 2012], as in [Chavent et al., 2021] (can also be useful to build groups) I Could a step further be taken by using oblique splits [Bertsimas and Dunn, 2017] (let the forest decides how to combine to find the best split)? See also: [Hornung and Boulesteix, 2021] Statistics & ML for high throughput data integration Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix & Rémi Servien p. 11
  • 19. Strategies for building intervals I Precomputing intervals independantly of Y (based on correlation between time points): constrained extension of ClustOfVar (PCA like criterion), adjclust (contrainstred clustering based on correlation between variables) ⇒ hierarchy of intervals Statistics & ML for high throughput data integration Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix & Rémi Servien p. 12
  • 20. Strategies for building intervals I Precomputing intervals independantly of Y (based on correlation between time points): constrained extension of ClustOfVar (PCA like criterion), adjclust (contrainstred clustering based on correlation between variables) ⇒ hierarchy of intervals I Precomputing intervals independantly of Y (based on greedy agglomeration): alterning between I regression based step (LM) between any two consecutive variables to select the best merge (minimum loss or maximum gain in accuracy) I summary (depends on regression type) ⇒ hierarchy of intervals Statistics & ML for high throughput data integration Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix & Rémi Servien p. 12
  • 21. Strategies for building intervals I Precomputing intervals independantly of Y (based on correlation between time points): constrained extension of ClustOfVar (PCA like criterion), adjclust (contrainstred clustering based on correlation between variables) ⇒ hierarchy of intervals I Precomputing intervals independantly of Y (based on greedy agglomeration): alterning between I regression based step (LM) between any two consecutive variables to select the best merge (minimum loss or maximum gain in accuracy) I summary (depends on regression type) ⇒ hierarchy of intervals I Using random forest to compute a hierarchy of interval based on loss in grouped importance in a greedy manner [Gregorutti et al., 2015] Statistics & ML for high throughput data integration Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix & Rémi Servien p. 12
  • 22. Strategies for selecting variables (here, intervals) in RF Huge litterature...... a few reviews: [Degenhardt et al., 2019, Speiser et al., 2019] Statistics & ML for high throughput data integration Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix & Rémi Servien p. 13
  • 23. Strategies for selecting variables (here, intervals) in RF Huge litterature...... a few reviews: [Degenhardt et al., 2019, Speiser et al., 2019] I based on importance: I just use the importance... I ranking with importance then selection of the best model with the first k variables (K = 1, . . . , K) VSURF [Genuer et al., 2010] I [Altmann et al., 2010] or [Szymczak et al., 2016] based on data-driven importance threshold (untested) Statistics & ML for high throughput data integration Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix & Rémi Servien p. 13
  • 24. Strategies for selecting variables (here, intervals) in RF Huge litterature...... a few reviews: [Degenhardt et al., 2019, Speiser et al., 2019] I based on importance: I just use the importance... I ranking with importance then selection of the best model with the first k variables (K = 1, . . . , K) VSURF [Genuer et al., 2010] I [Altmann et al., 2010] or [Szymczak et al., 2016] based on data-driven importance threshold (untested) I based on external variable selection methods: I Knockoffs [Barber and Candès, 2015], as in Boruta [Kursa and Rudnicki, 2010] I Relief [Robnik-Šikonja and Kononenko, 2003] Statistics & ML for high throughput data integration Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix & Rémi Servien p. 13
  • 25. Simulation setting I predictors: 1,000 EMS time series (length: 444) I important intervals I target: yi = log(1 + |hxi , βi|) + with β(t) = 4 × 1t∈[320,410] + 2 × 1t∈[500,550] − 1t∈[680,730] Statistics ML for high throughput data integration Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix Rémi Servien p. 14
  • 26. Evaluated scenarios Scenario 1: pre-computed groups, summary, RF and importance based evaluation Courtesy of Louisa Villa. Statistics ML for high throughput data integration Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix Rémi Servien p. 15
  • 27. Evaluated scenarios Scenario 2: pre-computed groups, summary, selection and RF Courtesy of Louisa Villa. Statistics ML for high throughput data integration Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix Rémi Servien p. 16
  • 28. Evaluated scenarios Scenario 3: groups computed in interaction with importance or variable selection (not explained), summary, selection (or not) and RF Courtesy of Louisa Villa. Statistics ML for high throughput data integration Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix Rémi Servien p. 17
  • 29. Evaluation criteria Resemblance of important/selected intervals with ground truth Accuracy Statistics ML for high throughput data integration Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix Rémi Servien p. 18
  • 30. A few take home messages (to be confirmed) I pre-computed groups based on correlation (especially adjclust) are better I PLS is best in terms of summary strategy I combining selection strategy with RF is computationnaly extensive and unefficient I overall, the recovery of groups is a bit disappointing Statistics ML for high throughput data integration Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix Rémi Servien p. 19
  • 31. adjclust + PLS + Boruta (scenario 2) ⇒ Model selection (or model aggregation?) seem critical... Statistics ML for high throughput data integration Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix Rémi Servien p. 20
  • 32. To be continued... Statistics ML for high throughput data integration Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix Rémi Servien p. 21
  • 33. References Altmann, A., Tolosi, L., Sander, O., and Lengauer, T. (2010). Permutation importance: a corrected feature importance measure. Bioinformatics, 26(10):1340–1347. Barber, R. F. and Candès, E. (2015). Controlling the false discovery rate via knockoffs. Annals of Statistics, 43(5):2055–2085. Baydogan, M. G. and Runger, G. (2015). Learning a symbolic representation for multivariate time series classification. Data Mining and Knowledge Discovery, 29:400–422. Bertsimas, D. and Dunn, J. (2017). Optimal classification trees. Machine Learning, 106(7):1039–1082. Breiman, L. (2001). Random forests. Statistics ML for high throughput data integration Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix Rémi Servien p. 21
  • 34. Machine Learning, 45(1):5–32. Breiman, L., Friedman, J., Olsen, R., and Stone, C. (1984). Classification and Regression Trees. Chapman and Hall, Boca Raton, Florida, USA. Capitaine, L., Bigot, J., Thiébaut, R., and Genuer, R. (2020). Fréchet random forests for metric space valued regression with non Euclidean predictors. Preprint arXiv:1906.01741v2. Chavent, M., Genuer, R., and Saracco, J. (2021). Combining clustering of variables and feature selection using random forests. Communications in Statistics - Simulation and Computation, 50(2):426–445. Chavent, M., Liquet, B., Kuentz-Simonet, V., and Saracco, J. (2012). ClustOfVar: an R package for the clustering of variables. Journal of Statistical Software, 50(13):1–16. Degenhardt, F., Seifert, S., and Szymczak, S. (2019). Evaluation of variable selection methods for random forests and omics data sets. Statistics ML for high throughput data integration Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix Rémi Servien p. 21
  • 35. Briefings in Bioinformatics, 20(2):492–503. Deng, H., Runger, G., Tuv, E., and Martyanov, V. (2013). A time series forest for classification and feature extraction. Information Science, 239:142–153. Genuer, R., Poggi, J.-M., and Tuleau-Malot, C. (2010). Variable selection using random forests. Pattern Recognition Letters, 31(14):2225–2236. Gregorutti, B., Michel, B., and Saint-Pierre, P. (2015). Grouped variable importance with random forests and application to multiple functional data analysis. Computational Statistics and Data Analysis, 90:15–35. Grollemund, P.-M., Abraham, C., Baragatti, M., and Pudlo, P. (2019). Bayesian functional linear regression with sparse step functions. Bayesian Analysis, 14(1):111–135. Hornung, R. and Boulesteix, A.-L. (2021). Statistics ML for high throughput data integration Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix Rémi Servien p. 21
  • 36. Interaction forests: identifying and exploiting interpretable quantitative and qualitative interaction effects. Technical Report Number 237, Department of Statistics, University of Munich, Germany. Kursa, M. and Rudnicki, W. (2010). Feature selection with the Boruta package. Journal of Statistical Software, 36(11):1–13. Lines, J., Taylor, S., and Bagnall, A. (2018). Time series classification with HIVE-COTE: the hierarchical vote collective of transformation-based ensembles. ACM Transactions on Knowledge Discovery from Data, 12(5):1–35. Lucas, B., Shifaz, A., Pelletier, C., O’Neill, L., Zaidi, N., Goethals, B., Petitjean, F., and Webb, G. I. (2019). Proximity forest: an effective and scalable distance based classifier for time series. Data Mining and Knowledge Discovery, 33:607–635. Middlehurst, M., Large, J., and Bagnall, A. (2020). Statistics ML for high throughput data integration Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix Rémi Servien p. 21
  • 37. The canonical interval forest (CIF) classifier for time series classification. In Wu, X., Jermaine, C., Hu, X., Kotevskia, O., Lu, S., Xu, W., Aluru, S., Zhai, C., Al-Masri, E., Chen, Z., and Saltz, J., editors, Proceedings of IEEE International Conference on Big Data, Atlanta, GA, USA. IEEE. Picheny, V., Servien, R., and Villa-Vialaneix, N. (2019). Interpretable sparse sliced inverse regression for functional data. Statistics and Computing, 29(2):255–267. Poterie, A., Dupuy, J.-F., Monbet, V., and Rouvière, L. (2019). Classification tree algorithm for grouped variables. Computational Statistics, 34:1613–1648. Rainforth, T. and Wood, F. (2017). Canonical correlation forests. arXiv: 1507.05444. Robnik-Šikonja, M. and Kononenko, I. (2003). Theoretical and empirical analysis of ReliefF and RReliefF. Machine Learning, 53(1-2):23–69. Statistics ML for high throughput data integration Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix Rémi Servien p. 21
  • 38. Schäfer, P. (2015). The BOSS is concerned with time series classification in the presence of noise. Data Mining and Knowledge Discovery, 29(6):1505–1530. Shifaz, A., Pelletier, C., Petitjean, F., and Webb, G. I. (2020). TS-CHIEF: a scalable and accurate forest algorithm for time series classification. Data Mining and Knowledge Discovery, 34:742–775. Speiser, J. L., Miller, M. E., Tooze, J., and Ip, E. (2019). A comparison of random forest variable selection methods for classification prediction modeling. Expert Systems with Applications, 134:93–101. Szymczak, S., Holzinger, E., Dasgupta, A., Malley, J., Molloy, A., Mills, J., Brody, L., Stambolian, D., and Bailey-Wilson, J. (2016). r2VIM: a new variable selection method for random forests in genome-wide association studies. BioData Mining, 9:7. Statistics ML for high throughput data integration Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix Rémi Servien p. 22
  • 39. Dictionnary/symbolic representation based BOSS [Schäfer, 2015] and [Baydogan and Runger, 2015] Based on: Fourier transform then symbolic representation. [Baydogan and Runger, 2015] is similar, except that representation loses interval information (based on a tree at time step level) Statistics ML for high throughput data integration Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix Rémi Servien p. 22
  • 40. Dictionnary/symbolic representation based BOSS [Schäfer, 2015] and [Baydogan and Runger, 2015] What is useful for our question? Uncertain... can symbolic representation itself be used to represent/select (windowed) intervals? (untested) Back Statistics ML for high throughput data integration Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix Rémi Servien p. 22