Explanable models for time series with random forest
1. Explanable models for time series with random forest
Nathalie Vialaneix(1) & Rémi Servien
(1) nathalie.vialaneix@inrae.fr
http://www.nathalievialaneix.eu
First PhenoDyn meeting
November 29-30, 2021
2. Scientific question
?
−
→
Purpose: prediction of a target quantity (e.g., yield) from functional data (e.g.,
weather time series)
Statistics & ML for high throughput data integration
Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix & Rémi Servien
p. 2
3. Scientific question & difficulty at stake
Purpose: Improve interpretability by selecting the most predictive intervals.
Statistics & ML for high throughput data integration
Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix & Rémi Servien
p. 3
4. Scientific question & difficulty at stake
Purpose: Improve interpretability by selecting the most predictive intervals.
Challenge: Selection of intervals is not too hard (e.g., group Lasso) but creating the
relevant intervals (starting point, length) is hard.
Existing solutions: [Picheny et al., 2019, Grollemund et al., 2019]
Statistics & ML for high throughput data integration
Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix & Rémi Servien
p. 3
5. Scientific question & framework of the presentation
Here: random forest
Why?
I versatile method for prediction
I easy to use and relatively fast
I good prediction ability in general
I natural framework for interpretability (importance through OOB samples)
Statistics & ML for high throughput data integration
Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix & Rémi Servien
p. 4
6. What is needed to achieve that goal?
Three/four key ingredients
1. random forest for time series
2. (maybe optional) ... based on summary descriptors of intervals
3. building intervals
4. selecting intervals
Statistics & ML for high throughput data integration
Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix & Rémi Servien
p. 5
7. A short reminder on random forest [Breiman, 2001]
Ln
LΘ1
n LΘ`
n L
Θq
n
b
h(., Θ1, Θ0
1) b
h(., Θ`, Θ0
`) b
h(., Θq, Θ0
q)
b
hRF−RI (.)
Bootstrap
Arbre RI
Agrégation
Courtesy of Robin Genuer and Jean-Michel Poggi.
Statistics & ML for high throughput data integration
Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix & Rémi Servien
p. 6
8. CART [Breiman et al., 1984]
C8 C9
C4 C5
C2 C3
C1
X1 ≤ d3 X1 > d3
X2 ≤ d2 X2 > d2
X1 ≤ d1 X1 > d1
d3 d1
d2
X1
X2
C4
C3
C8 C9
Statistics & ML for high throughput data integration
Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix & Rémi Servien
p. 7
9. Technical details
I splits are made by randomly choosing mtry < d variables randomly and by finding
the “best split” among this selection
I aggregation: average (target is numeric) or maximum vote rule (target is a class)
Statistics & ML for high throughput data integration
Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix & Rémi Servien
p. 8
10. Technical details
I splits are made by randomly choosing mtry < d variables randomly and by finding
the “best split” among this selection
I aggregation: average (target is numeric) or maximum vote rule (target is a class)
I OOB error: average error (over trees) on samples not included in the bootstrap of
the tree
I variable importance: the larger the increase in error, the most important the
variable is (based on random permutation)
Statistics & ML for high throughput data integration
Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix & Rémi Servien
p. 8
11. Extensions of random forest for time series
I Similarity based techniques
I Fréchet forest [Capitaine et al., 2020]
I Proximity forest [Lucas et al., 2019] (restricting to classification)
Image by courtesy of Charlotte Pelletier
Statistics & ML for high throughput data integration
Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix & Rémi Servien
p. 9
12. Extensions of random forest for time series
I Similarity based techniques
I Fréchet forest [Capitaine et al., 2020]
I Proximity forest [Lucas et al., 2019] (restricting to classification)
I Interval based techniques
I Time Series Forest [Deng et al., 2013] and its extension [Middlehurst et al., 2020]
I RISE [Lines et al., 2018] (a tree = a randomly selected interval)
Statistics & ML for high throughput data integration
Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix & Rémi Servien
p. 9
13. Extensions of random forest for time series
I Similarity based techniques
I Fréchet forest [Capitaine et al., 2020]
I Proximity forest [Lucas et al., 2019] (restricting to classification)
I Interval based techniques
I Time Series Forest [Deng et al., 2013] and its extension [Middlehurst et al., 2020]
I RISE [Lines et al., 2018] (a tree = a randomly selected interval)
I Dictionnary or symbolic representation based techniques:
I TS-CHIEF [Shifaz et al., 2020] (combines all types of splits including dictionnary
based splits based on work of [Schäfer, 2015])
I (multivariate time series) symbolic representation of time series
[Baydogan and Runger, 2015] More on that
Statistics & ML for high throughput data integration
Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix & Rémi Servien
p. 9
14. Time Series Forest
Basic principles:
1. for a given tree: random sampling of intervals
2. for a given tree: compute summaries (mean, sd, slope for [Deng et al., 2013] and
catch22 for [Middlehurst et al., 2020])
3. define splits as usual based on these summaries
Statistics & ML for high throughput data integration
Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix & Rémi Servien
p. 10
15. Time Series Forest
Basic principles:
1. for a given tree: random sampling of intervals
2. for a given tree: compute summaries (mean, sd, slope for [Deng et al., 2013] and
catch22 for [Middlehurst et al., 2020])
3. define splits as usual based on these summaries
What is useful for our question?
I combined with variable selection could help identify important intervals (still to be
tested)
I ideas to summarize information of an entire intervals (already partially tested)
Statistics & ML for high throughput data integration
Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix & Rémi Servien
p. 10
16. Extensions on summaries
I supervised: accounting for Y (is that useful?): PLS, linear models (including
ridge)... similar to [Poterie et al., 2019] (grouped variables summarized by LDA)
and to [Rainforth and Wood, 2017] (CCA based splits)
Statistics & ML for high throughput data integration
Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix & Rémi Servien
p. 11
17. Extensions on summaries
I supervised: accounting for Y (is that useful?): PLS, linear models (including
ridge)... similar to [Poterie et al., 2019] (grouped variables summarized by LDA)
and to [Rainforth and Wood, 2017] (CCA based splits)
I unsupervised: first PC of PCA, as in ClustOfVar method [Chavent et al., 2012],
as in [Chavent et al., 2021] (can also be useful to build groups)
Statistics & ML for high throughput data integration
Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix & Rémi Servien
p. 11
18. Extensions on summaries
I supervised: accounting for Y (is that useful?): PLS, linear models (including
ridge)... similar to [Poterie et al., 2019] (grouped variables summarized by LDA)
and to [Rainforth and Wood, 2017] (CCA based splits)
I unsupervised: first PC of PCA, as in ClustOfVar method [Chavent et al., 2012],
as in [Chavent et al., 2021] (can also be useful to build groups)
I Could a step further be taken by using oblique splits [Bertsimas and Dunn, 2017]
(let the forest decides how to combine to find the best split)?
See also: [Hornung and Boulesteix, 2021]
Statistics & ML for high throughput data integration
Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix & Rémi Servien
p. 11
19. Strategies for building intervals
I Precomputing intervals independantly of Y (based on correlation between time
points): constrained extension of ClustOfVar (PCA like criterion), adjclust
(contrainstred clustering based on correlation between variables) ⇒ hierarchy of
intervals
Statistics & ML for high throughput data integration
Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix & Rémi Servien
p. 12
20. Strategies for building intervals
I Precomputing intervals independantly of Y (based on correlation between time
points): constrained extension of ClustOfVar (PCA like criterion), adjclust
(contrainstred clustering based on correlation between variables) ⇒ hierarchy of
intervals
I Precomputing intervals independantly of Y (based on greedy agglomeration):
alterning between
I regression based step (LM) between any two consecutive variables to select the best
merge (minimum loss or maximum gain in accuracy)
I summary (depends on regression type)
⇒ hierarchy of intervals
Statistics & ML for high throughput data integration
Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix & Rémi Servien
p. 12
21. Strategies for building intervals
I Precomputing intervals independantly of Y (based on correlation between time
points): constrained extension of ClustOfVar (PCA like criterion), adjclust
(contrainstred clustering based on correlation between variables) ⇒ hierarchy of
intervals
I Precomputing intervals independantly of Y (based on greedy agglomeration):
alterning between
I regression based step (LM) between any two consecutive variables to select the best
merge (minimum loss or maximum gain in accuracy)
I summary (depends on regression type)
⇒ hierarchy of intervals
I Using random forest to compute a hierarchy of interval based on loss in grouped
importance in a greedy manner [Gregorutti et al., 2015]
Statistics & ML for high throughput data integration
Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix & Rémi Servien
p. 12
22. Strategies for selecting variables (here, intervals) in RF
Huge litterature...... a few reviews: [Degenhardt et al., 2019, Speiser et al., 2019]
Statistics & ML for high throughput data integration
Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix & Rémi Servien
p. 13
23. Strategies for selecting variables (here, intervals) in RF
Huge litterature...... a few reviews: [Degenhardt et al., 2019, Speiser et al., 2019]
I based on importance:
I just use the importance...
I ranking with importance then selection of the best model with the first k variables
(K = 1, . . . , K) VSURF [Genuer et al., 2010]
I [Altmann et al., 2010] or [Szymczak et al., 2016] based on data-driven importance
threshold (untested)
Statistics & ML for high throughput data integration
Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix & Rémi Servien
p. 13
24. Strategies for selecting variables (here, intervals) in RF
Huge litterature...... a few reviews: [Degenhardt et al., 2019, Speiser et al., 2019]
I based on importance:
I just use the importance...
I ranking with importance then selection of the best model with the first k variables
(K = 1, . . . , K) VSURF [Genuer et al., 2010]
I [Altmann et al., 2010] or [Szymczak et al., 2016] based on data-driven importance
threshold (untested)
I based on external variable selection methods:
I Knockoffs [Barber and Candès, 2015], as in Boruta [Kursa and Rudnicki, 2010]
I Relief [Robnik-Šikonja and Kononenko, 2003]
Statistics & ML for high throughput data integration
Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix & Rémi Servien
p. 13
25. Simulation setting
I predictors: 1,000 EMS time series (length: 444)
I important intervals
I target: yi = log(1 + |hxi , βi|) + with
β(t) = 4 × 1t∈[320,410] + 2 × 1t∈[500,550] − 1t∈[680,730]
Statistics ML for high throughput data integration
Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix Rémi Servien
p. 14
26. Evaluated scenarios
Scenario 1: pre-computed groups, summary, RF and importance based evaluation
Courtesy of Louisa Villa.
Statistics ML for high throughput data integration
Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix Rémi Servien
p. 15
27. Evaluated scenarios
Scenario 2: pre-computed groups, summary, selection and RF
Courtesy of Louisa Villa.
Statistics ML for high throughput data integration
Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix Rémi Servien
p. 16
28. Evaluated scenarios
Scenario 3: groups computed in interaction with importance or variable selection (not
explained), summary, selection (or not) and RF
Courtesy of Louisa Villa.
Statistics ML for high throughput data integration
Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix Rémi Servien
p. 17
29. Evaluation criteria
Resemblance of important/selected intervals with ground truth
Accuracy
Statistics ML for high throughput data integration
Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix Rémi Servien
p. 18
30. A few take home messages (to be confirmed)
I pre-computed groups based on correlation (especially adjclust) are better
I PLS is best in terms of summary strategy
I combining selection strategy with RF is computationnaly extensive and unefficient
I overall, the recovery of groups is a bit disappointing
Statistics ML for high throughput data integration
Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix Rémi Servien
p. 19
31. adjclust + PLS + Boruta (scenario 2)
⇒ Model selection (or model aggregation?) seem critical...
Statistics ML for high throughput data integration
Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix Rémi Servien
p. 20
32. To be continued...
Statistics ML for high throughput data integration
Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix Rémi Servien
p. 21
33. References
Altmann, A., Tolosi, L., Sander, O., and Lengauer, T. (2010).
Permutation importance: a corrected feature importance measure.
Bioinformatics, 26(10):1340–1347.
Barber, R. F. and Candès, E. (2015).
Controlling the false discovery rate via knockoffs.
Annals of Statistics, 43(5):2055–2085.
Baydogan, M. G. and Runger, G. (2015).
Learning a symbolic representation for multivariate time series classification.
Data Mining and Knowledge Discovery, 29:400–422.
Bertsimas, D. and Dunn, J. (2017).
Optimal classification trees.
Machine Learning, 106(7):1039–1082.
Breiman, L. (2001).
Random forests.
Statistics ML for high throughput data integration
Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix Rémi Servien
p. 21
34. Machine Learning, 45(1):5–32.
Breiman, L., Friedman, J., Olsen, R., and Stone, C. (1984).
Classification and Regression Trees.
Chapman and Hall, Boca Raton, Florida, USA.
Capitaine, L., Bigot, J., Thiébaut, R., and Genuer, R. (2020).
Fréchet random forests for metric space valued regression with non Euclidean predictors.
Preprint arXiv:1906.01741v2.
Chavent, M., Genuer, R., and Saracco, J. (2021).
Combining clustering of variables and feature selection using random forests.
Communications in Statistics - Simulation and Computation, 50(2):426–445.
Chavent, M., Liquet, B., Kuentz-Simonet, V., and Saracco, J. (2012).
ClustOfVar: an R package for the clustering of variables.
Journal of Statistical Software, 50(13):1–16.
Degenhardt, F., Seifert, S., and Szymczak, S. (2019).
Evaluation of variable selection methods for random forests and omics data sets.
Statistics ML for high throughput data integration
Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix Rémi Servien
p. 21
35. Briefings in Bioinformatics, 20(2):492–503.
Deng, H., Runger, G., Tuv, E., and Martyanov, V. (2013).
A time series forest for classification and feature extraction.
Information Science, 239:142–153.
Genuer, R., Poggi, J.-M., and Tuleau-Malot, C. (2010).
Variable selection using random forests.
Pattern Recognition Letters, 31(14):2225–2236.
Gregorutti, B., Michel, B., and Saint-Pierre, P. (2015).
Grouped variable importance with random forests and application to multiple functional data
analysis.
Computational Statistics and Data Analysis, 90:15–35.
Grollemund, P.-M., Abraham, C., Baragatti, M., and Pudlo, P. (2019).
Bayesian functional linear regression with sparse step functions.
Bayesian Analysis, 14(1):111–135.
Hornung, R. and Boulesteix, A.-L. (2021).
Statistics ML for high throughput data integration
Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix Rémi Servien
p. 21
36. Interaction forests: identifying and exploiting interpretable quantitative and qualitative interaction
effects.
Technical Report Number 237, Department of Statistics, University of Munich, Germany.
Kursa, M. and Rudnicki, W. (2010).
Feature selection with the Boruta package.
Journal of Statistical Software, 36(11):1–13.
Lines, J., Taylor, S., and Bagnall, A. (2018).
Time series classification with HIVE-COTE: the hierarchical vote collective of
transformation-based ensembles.
ACM Transactions on Knowledge Discovery from Data, 12(5):1–35.
Lucas, B., Shifaz, A., Pelletier, C., O’Neill, L., Zaidi, N., Goethals, B., Petitjean, F., and Webb,
G. I. (2019).
Proximity forest: an effective and scalable distance based classifier for time series.
Data Mining and Knowledge Discovery, 33:607–635.
Middlehurst, M., Large, J., and Bagnall, A. (2020).
Statistics ML for high throughput data integration
Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix Rémi Servien
p. 21
37. The canonical interval forest (CIF) classifier for time series classification.
In Wu, X., Jermaine, C., Hu, X., Kotevskia, O., Lu, S., Xu, W., Aluru, S., Zhai, C., Al-Masri, E.,
Chen, Z., and Saltz, J., editors, Proceedings of IEEE International Conference on Big Data,
Atlanta, GA, USA. IEEE.
Picheny, V., Servien, R., and Villa-Vialaneix, N. (2019).
Interpretable sparse sliced inverse regression for functional data.
Statistics and Computing, 29(2):255–267.
Poterie, A., Dupuy, J.-F., Monbet, V., and Rouvière, L. (2019).
Classification tree algorithm for grouped variables.
Computational Statistics, 34:1613–1648.
Rainforth, T. and Wood, F. (2017).
Canonical correlation forests.
arXiv: 1507.05444.
Robnik-Šikonja, M. and Kononenko, I. (2003).
Theoretical and empirical analysis of ReliefF and RReliefF.
Machine Learning, 53(1-2):23–69.
Statistics ML for high throughput data integration
Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix Rémi Servien
p. 21
38. Schäfer, P. (2015).
The BOSS is concerned with time series classification in the presence of noise.
Data Mining and Knowledge Discovery, 29(6):1505–1530.
Shifaz, A., Pelletier, C., Petitjean, F., and Webb, G. I. (2020).
TS-CHIEF: a scalable and accurate forest algorithm for time series classification.
Data Mining and Knowledge Discovery, 34:742–775.
Speiser, J. L., Miller, M. E., Tooze, J., and Ip, E. (2019).
A comparison of random forest variable selection methods for classification prediction modeling.
Expert Systems with Applications, 134:93–101.
Szymczak, S., Holzinger, E., Dasgupta, A., Malley, J., Molloy, A., Mills, J., Brody, L.,
Stambolian, D., and Bailey-Wilson, J. (2016).
r2VIM: a new variable selection method for random forests in genome-wide association studies.
BioData Mining, 9:7.
Statistics ML for high throughput data integration
Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix Rémi Servien
p. 22
39. Dictionnary/symbolic representation based
BOSS [Schäfer, 2015] and [Baydogan and Runger, 2015]
Based on: Fourier transform then symbolic representation.
[Baydogan and Runger, 2015] is similar, except that representation loses interval
information (based on a tree at time step level)
Statistics ML for high throughput data integration
Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix Rémi Servien
p. 22
40. Dictionnary/symbolic representation based
BOSS [Schäfer, 2015] and [Baydogan and Runger, 2015]
What is useful for our question? Uncertain... can symbolic representation itself be used
to represent/select (windowed) intervals? (untested) Back
Statistics ML for high throughput data integration
Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix Rémi Servien
p. 22