Kernel methods and variable selection for exploratory analysis and multi-omic...tuxette
Nathalie Vialaneix
4th course on Computational Systems Biology of Cancer: Multi-omics and Machine Learning Approaches
International course, Curie training
https://training.institut-curie.org/courses/sysbiocancer2021
(remote)
September 29th, 2021
Mini useR! in Melbourne https://www.meetup.com/fr-FR/MelbURN-Melbourne-Users-of-R-Network/events/251933078/
MelbURN (Melbourne useR group) https://www.meetup.com/fr-FR/MelbURN-Melbourne-Users-of-R-Network
July 16th, 2018
Melbourne, Australia
Kernel methods and variable selection for exploratory analysis and multi-omic...tuxette
Nathalie Vialaneix
4th course on Computational Systems Biology of Cancer: Multi-omics and Machine Learning Approaches
International course, Curie training
https://training.institut-curie.org/courses/sysbiocancer2021
(remote)
September 29th, 2021
Mini useR! in Melbourne https://www.meetup.com/fr-FR/MelbURN-Melbourne-Users-of-R-Network/events/251933078/
MelbURN (Melbourne useR group) https://www.meetup.com/fr-FR/MelbURN-Melbourne-Users-of-R-Network
July 16th, 2018
Melbourne, Australia
Medical pathology images are visually evaluated by experts for disease diagnosis, but the connectionbetween image features and the state of the cells in an image is typically unknown. To understand thisrelationship, we describe a multimodal modeling and inference framework that estimates shared latentstructure of joint gene expression levels and medical image features. The method is built aroundprobabilistic canonical correlation analysis (PCCA), which is jointly fit to image embeddings that are learnedusing convolutional neural networks and linear embeddings of paired gene expression data. We finallydiscuss a set of theoretical and empirical challenges in domain adaptation settings arising from genomics data.(based on work in collab with Gregory Gundersen and Barbara E. Engelhardt)
Dimensionality reduction by matrix factorization using concept lattice in dat...eSAT Journals
Abstract Concept lattices is the important technique that has become a standard in data analytics and knowledge presentation in many fields such as statistics, artificial intelligence, pattern recognition ,machine learning ,information theory ,social networks, information retrieval system and software engineering. Formal concepts are adopted as the primitive notion. A concept is jointly defined as a pair consisting of the intension and the extension. FCA can handle with huge amount of data it generates concepts and rules and data visualization. Matrix factorization methods have recently received greater exposure, mainly as an unsupervised learning method for latent variable decomposition. In this paper a novel method is proposed to decompose such concepts by using Boolean Matrix Factorization for dimensionality reduction. This paper focuses on finding all the concepts and the object intersections. Keywords: Data mining, formal concepts, lattice, matrix factorization dimensionality reduction.
Since the advent of the horseshoe priors for regularization, global-local shrinkage methods have proved to be a fertile ground for the development of Bayesian theory and methodology in machine learning. They have achieved remarkable success in computation, and enjoy strong theoretical support. Much of the existing literature has focused on the linear Gaussian case. The purpose of the current talk is to demonstrate that the horseshoe priors are useful more broadly, by reviewing both methodological and computational developments in complex models that are more relevant to machine learning applications. Specifically, we focus on methodological challenges in horseshoe regularization in nonlinear and non-Gaussian models; multivariate models; and deep neural networks. We also outline the recent computational developments in horseshoe shrinkage for complex models along with a list of available software implementations that allows one to venture out beyond the comfort zone of the canonical linear regression problems.
PhD Dissertation Talk, 22 April 2011
----
The main topic of this thesis addresses the important problem of mining numerical data, and especially gene expression data. These data characterize the behaviour of thousand of genes in various biological situations (time, cell, etc.).
A difficult task consists in clustering genes to obtain classes of genes with similar behaviour, supposed to be involved together within a biological process.
Accordingly, we are interested in designing and comparing methods in the field of knowledge discovery from biological data. We propose to study how the conceptual classification method called Formal Concept Analysis (FCA) can handle the problem of extracting interesting classes of genes. For this purpose, we have designed and experimented several original methods based on an extension of FCA called pattern structures. Furthermore, we show that these methods can enhance decision making in agronomy and crop sanity in the vast formal domain of information fusion.
A Novel Approach to Mathematical Concepts in Data Miningijdmtaiir
-This paper describes three different fundamental
mathematical programming approaches that are relevant to
data mining. They are: Feature Selection, Clustering and
Robust Representation. This paper comprises of two clustering
algorithms such as K-mean algorithm and K-median
algorithms. Clustering is illustrated by the unsupervised
learning of patterns and clusters that may exist in a given
databases and useful tool for Knowledge Discovery in
Database (KDD). The results of k-median algorithm are used
to collecting the blood cancer patient from a medical database.
K-mean clustering is a data mining/machine learning algorithm
used to cluster observations into groups of related observations
without any prior knowledge of those relationships. The kmean algorithm is one of the simplest clustering techniques
and it is commonly used in medical imaging, biometrics and
related fields.
Bayes Nets Meetup Sept 29th 2016 - Bayesian Network Modelling by Marco ScutariBayes Nets meetup London
A talk given at the Bayes Nets meetup on Sept 29th 2016 by Dr Marco Scutari from the University of Oxford. Title of the talk was Bayesian Network Modelling with examples in Genetics and Systems Biology, with case studies.
Medical pathology images are visually evaluated by experts for disease diagnosis, but the connectionbetween image features and the state of the cells in an image is typically unknown. To understand thisrelationship, we describe a multimodal modeling and inference framework that estimates shared latentstructure of joint gene expression levels and medical image features. The method is built aroundprobabilistic canonical correlation analysis (PCCA), which is jointly fit to image embeddings that are learnedusing convolutional neural networks and linear embeddings of paired gene expression data. We finallydiscuss a set of theoretical and empirical challenges in domain adaptation settings arising from genomics data.(based on work in collab with Gregory Gundersen and Barbara E. Engelhardt)
Dimensionality reduction by matrix factorization using concept lattice in dat...eSAT Journals
Abstract Concept lattices is the important technique that has become a standard in data analytics and knowledge presentation in many fields such as statistics, artificial intelligence, pattern recognition ,machine learning ,information theory ,social networks, information retrieval system and software engineering. Formal concepts are adopted as the primitive notion. A concept is jointly defined as a pair consisting of the intension and the extension. FCA can handle with huge amount of data it generates concepts and rules and data visualization. Matrix factorization methods have recently received greater exposure, mainly as an unsupervised learning method for latent variable decomposition. In this paper a novel method is proposed to decompose such concepts by using Boolean Matrix Factorization for dimensionality reduction. This paper focuses on finding all the concepts and the object intersections. Keywords: Data mining, formal concepts, lattice, matrix factorization dimensionality reduction.
Since the advent of the horseshoe priors for regularization, global-local shrinkage methods have proved to be a fertile ground for the development of Bayesian theory and methodology in machine learning. They have achieved remarkable success in computation, and enjoy strong theoretical support. Much of the existing literature has focused on the linear Gaussian case. The purpose of the current talk is to demonstrate that the horseshoe priors are useful more broadly, by reviewing both methodological and computational developments in complex models that are more relevant to machine learning applications. Specifically, we focus on methodological challenges in horseshoe regularization in nonlinear and non-Gaussian models; multivariate models; and deep neural networks. We also outline the recent computational developments in horseshoe shrinkage for complex models along with a list of available software implementations that allows one to venture out beyond the comfort zone of the canonical linear regression problems.
PhD Dissertation Talk, 22 April 2011
----
The main topic of this thesis addresses the important problem of mining numerical data, and especially gene expression data. These data characterize the behaviour of thousand of genes in various biological situations (time, cell, etc.).
A difficult task consists in clustering genes to obtain classes of genes with similar behaviour, supposed to be involved together within a biological process.
Accordingly, we are interested in designing and comparing methods in the field of knowledge discovery from biological data. We propose to study how the conceptual classification method called Formal Concept Analysis (FCA) can handle the problem of extracting interesting classes of genes. For this purpose, we have designed and experimented several original methods based on an extension of FCA called pattern structures. Furthermore, we show that these methods can enhance decision making in agronomy and crop sanity in the vast formal domain of information fusion.
A Novel Approach to Mathematical Concepts in Data Miningijdmtaiir
-This paper describes three different fundamental
mathematical programming approaches that are relevant to
data mining. They are: Feature Selection, Clustering and
Robust Representation. This paper comprises of two clustering
algorithms such as K-mean algorithm and K-median
algorithms. Clustering is illustrated by the unsupervised
learning of patterns and clusters that may exist in a given
databases and useful tool for Knowledge Discovery in
Database (KDD). The results of k-median algorithm are used
to collecting the blood cancer patient from a medical database.
K-mean clustering is a data mining/machine learning algorithm
used to cluster observations into groups of related observations
without any prior knowledge of those relationships. The kmean algorithm is one of the simplest clustering techniques
and it is commonly used in medical imaging, biometrics and
related fields.
Bayes Nets Meetup Sept 29th 2016 - Bayesian Network Modelling by Marco ScutariBayes Nets meetup London
A talk given at the Bayes Nets meetup on Sept 29th 2016 by Dr Marco Scutari from the University of Oxford. Title of the talk was Bayesian Network Modelling with examples in Genetics and Systems Biology, with case studies.
Accounting for variance in machine learning benchmarksDevansh16
Accounting for Variance in Machine Learning Benchmarks
Xavier Bouthillier, Pierre Delaunay, Mirko Bronzi, Assya Trofimov, Brennan Nichyporuk, Justin Szeto, Naz Sepah, Edward Raff, Kanika Madan, Vikram Voleti, Samira Ebrahimi Kahou, Vincent Michalski, Dmitriy Serdyuk, Tal Arbel, Chris Pal, Gaël Varoquaux, Pascal Vincent
Strong empirical evidence that one machine-learning algorithm A outperforms another one B ideally calls for multiple trials optimizing the learning pipeline over sources of variation such as data sampling, data augmentation, parameter initialization, and hyperparameters choices. This is prohibitively expensive, and corners are cut to reach conclusions. We model the whole benchmarking process, revealing that variance due to data sampling, parameter initialization and hyperparameter choice impact markedly the results. We analyze the predominant comparison methods used today in the light of this variance. We show a counter-intuitive result that adding more sources of variation to an imperfect estimator approaches better the ideal estimator at a 51 times reduction in compute cost. Building on these results, we study the error rate of detecting improvements, on five different deep-learning tasks/architectures. This study leads us to propose recommendations for performance comparisons.
MSL 5080, Methods of Analysis for Business Operations 1 .docxmadlynplamondon
MSL 5080, Methods of Analysis for Business Operations 1
Course Learning Outcomes for Unit III
Upon completion of this unit, students should be able to:
2. Distinguish between the approaches to determining probability.
3. Contrast the major differences between the normal distribution and the exponential and Poisson
distributions.
Reading Assignment
Chapter 2: Probability Concepts and Applications, pp. 32–48
Unit Lesson
Mathematical truths provide us several useful means to estimate what will happen based on factors that are
given or researched. After becoming familiar with the idea of probability, one can see how mathematics make
applications in government and business possible.
Probability Distributions
To look at probability distributions, one should define a random variable as an unknown that could be any
real number, including decimals or fractions. Many problems in life have real numbers of any value of a whole
number and fraction or decimal as the value of the random variable amount. Discrete random variables will
have a certain limited range of values, and continuous random variables may have an infinite range of
possible values. These continuous random variables could be any value at all (Render, Stair, Hanna, &
Hale, 2015).
One true tendency is that events that occur in a group of trials tend to cluster around a middle point of values
as the most occurring, or highest probabilities they will occur. They then taper off to one or both sides as there
are lower probabilities that the events will be very low from the middle (or zero) and very high from the middle.
This middle point is called the mean or expected value E(X):
n
E(X) = ∑ Xi P(Xi)
i=1
Where Xi is the random variable value, and the summation sign ∑ with n and i=1 means you are adding all n
possible values (Render et al., 2015).
The sum of these events can be shown as graphs. If the random variable has a discrete probability
distribution (e.g., cans of paint that can be sold in a day), then the graph of events may look like this:
UNIT III STUDY GUIDE
Binomial and Normal Distributions
MSL 5080, Methods of Analysis for Business Operations 2
UNIT x STUDY GUIDE
Title
The bar heights show the probability for any X (or, P(X) ) along the y-axis, given the discrete number for X
along the x-axis and no fractions for discrete variables (no half-cans of paint).
The variance (σ2) is the spread of the distribution of events in a probability distribution (Render et al., 2015).
The variance is interesting because a small variance may indicate that the event value will most likely be near
the mean most of the time, and a large variance may show that the mean is not all that reliable a guide of
what the event values will be, as the sp.
Min-based qualitative possibilistic networks are one of the effective tools for a compact representation of decision problems under uncertainty. The exact approaches for computing decision based on possibilistic networks are limited by the size of the possibility distributions.
Generally, these approaches are based on possibilistic propagation algorithms. An important step in the computation of the decision is the transformation of the DAG into a secondary structure, known as the junction trees. This transformation is known to be costly and represents a difficult problem. We propose in this paper a new approximate approach for the computation
of decision under uncertainty within possibilistic networks. The computing of the optimal optimistic decision no longer goes through the junction tree construction step. Instead, it is performed by calculating the degree of normalization in the moral graph resulting from the merging of the possibilistic network codifying knowledge of the agent and that codifying its preferences.
Bayesian, frequentist and fiducial (BFF) inferences are much more congruous than they have been perceived historically in the scientific community. Most practitioners are probably more familiar with the competing narratives of the two dominant statistical inferential paradigms, Bayesian inference and frequentist inference. The third, lesser known fiducial inference paradigm was pioneered by R.A. Fisher in an attempt to define an inversion procedure for inference as an alternative to Bayes' theorem. Although each paradigm has its own strengths and limitations subject to their different philosophical underpinnings, this talk intends to bridge these three different inferential methodologies through the lenses of confidence distribution theory and artificial sampling procedures. The talk attempts to understand how uncertainty quantifications in these three distinct paradigms, Bayesian, frequentist, and fiducial inference, can be unified and compared on a foundational level, thereby increasing the range of possible techniques available to both statistical theorists and practitioners across all fields.
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...Scintica Instrumentation
Intravital microscopy (IVM) is a powerful tool utilized to study cellular behavior over time and space in vivo. Much of our understanding of cell biology has been accomplished using various in vitro and ex vivo methods; however, these studies do not necessarily reflect the natural dynamics of biological processes. Unlike traditional cell culture or fixed tissue imaging, IVM allows for the ultra-fast high-resolution imaging of cellular processes over time and space and were studied in its natural environment. Real-time visualization of biological processes in the context of an intact organism helps maintain physiological relevance and provide insights into the progression of disease, response to treatments or developmental processes.
In this webinar we give an overview of advanced applications of the IVM system in preclinical research. IVIM technology is a provider of all-in-one intravital microscopy systems and solutions optimized for in vivo imaging of live animal models at sub-micron resolution. The system’s unique features and user-friendly software enables researchers to probe fast dynamic biological processes such as immune cell tracking, cell-cell interaction as well as vascularization and tumor metastasis with exceptional detail. This webinar will also give an overview of IVM being utilized in drug development, offering a view into the intricate interaction between drugs/nanoparticles and tissues in vivo and allows for the evaluation of therapeutic intervention in a variety of tissues and organs. This interdisciplinary collaboration continues to drive the advancements of novel therapeutic strategies.
Seminar of U.V. Spectroscopy by SAMIR PANDASAMIR PANDA
Spectroscopy is a branch of science dealing the study of interaction of electromagnetic radiation with matter.
Ultraviolet-visible spectroscopy refers to absorption spectroscopy or reflect spectroscopy in the UV-VIS spectral region.
Ultraviolet-visible spectroscopy is an analytical method that can measure the amount of light received by the analyte.
This pdf is about the Schizophrenia.
For more details visit on YouTube; @SELF-EXPLANATORY;
https://www.youtube.com/channel/UCAiarMZDNhe1A3Rnpr_WkzA/videos
Thanks...!
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.Sérgio Sacani
The return of a sample of near-surface atmosphere from Mars would facilitate answers to several first-order science questions surrounding the formation and evolution of the planet. One of the important aspects of terrestrial planet formation in general is the role that primary atmospheres played in influencing the chemistry and structure of the planets and their antecedents. Studies of the martian atmosphere can be used to investigate the role of a primary atmosphere in its history. Atmosphere samples would also inform our understanding of the near-surface chemistry of the planet, and ultimately the prospects for life. High-precision isotopic analyses of constituent gases are needed to address these questions, requiring that the analyses are made on returned samples rather than in situ.
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...Sérgio Sacani
We characterize the earliest galaxy population in the JADES Origins Field (JOF), the deepest
imaging field observed with JWST. We make use of the ancillary Hubble optical images (5 filters
spanning 0.4−0.9µm) and novel JWST images with 14 filters spanning 0.8−5µm, including 7 mediumband filters, and reaching total exposure times of up to 46 hours per filter. We combine all our data
at > 2.3µm to construct an ultradeep image, reaching as deep as ≈ 31.4 AB mag in the stack and
30.3-31.0 AB mag (5σ, r = 0.1” circular aperture) in individual filters. We measure photometric
redshifts and use robust selection criteria to identify a sample of eight galaxy candidates at redshifts
z = 11.5 − 15. These objects show compact half-light radii of R1/2 ∼ 50 − 200pc, stellar masses of
M⋆ ∼ 107−108M⊙, and star-formation rates of SFR ∼ 0.1−1 M⊙ yr−1
. Our search finds no candidates
at 15 < z < 20, placing upper limits at these redshifts. We develop a forward modeling approach to
infer the properties of the evolving luminosity function without binning in redshift or luminosity that
marginalizes over the photometric redshift uncertainty of our candidate galaxies and incorporates the
impact of non-detections. We find a z = 12 luminosity function in good agreement with prior results,
and that the luminosity function normalization and UV luminosity density decline by a factor of ∼ 2.5
from z = 12 to z = 14. We discuss the possible implications of our results in the context of theoretical
models for evolution of the dark matter halo mass function.
Astronomy Update- Curiosity’s exploration of Mars _ Local Briefs _ leadertele...
Explanable models for time series with random forest
1. Explanable models for time series with random forest
Nathalie Vialaneix(1) & Rémi Servien
(1) nathalie.vialaneix@inrae.fr
http://www.nathalievialaneix.eu
First PhenoDyn meeting
November 29-30, 2021
2. Scientific question
?
−
→
Purpose: prediction of a target quantity (e.g., yield) from functional data (e.g.,
weather time series)
Statistics & ML for high throughput data integration
Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix & Rémi Servien
p. 2
3. Scientific question & difficulty at stake
Purpose: Improve interpretability by selecting the most predictive intervals.
Statistics & ML for high throughput data integration
Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix & Rémi Servien
p. 3
4. Scientific question & difficulty at stake
Purpose: Improve interpretability by selecting the most predictive intervals.
Challenge: Selection of intervals is not too hard (e.g., group Lasso) but creating the
relevant intervals (starting point, length) is hard.
Existing solutions: [Picheny et al., 2019, Grollemund et al., 2019]
Statistics & ML for high throughput data integration
Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix & Rémi Servien
p. 3
5. Scientific question & framework of the presentation
Here: random forest
Why?
I versatile method for prediction
I easy to use and relatively fast
I good prediction ability in general
I natural framework for interpretability (importance through OOB samples)
Statistics & ML for high throughput data integration
Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix & Rémi Servien
p. 4
6. What is needed to achieve that goal?
Three/four key ingredients
1. random forest for time series
2. (maybe optional) ... based on summary descriptors of intervals
3. building intervals
4. selecting intervals
Statistics & ML for high throughput data integration
Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix & Rémi Servien
p. 5
7. A short reminder on random forest [Breiman, 2001]
Ln
LΘ1
n LΘ`
n L
Θq
n
b
h(., Θ1, Θ0
1) b
h(., Θ`, Θ0
`) b
h(., Θq, Θ0
q)
b
hRF−RI (.)
Bootstrap
Arbre RI
Agrégation
Courtesy of Robin Genuer and Jean-Michel Poggi.
Statistics & ML for high throughput data integration
Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix & Rémi Servien
p. 6
8. CART [Breiman et al., 1984]
C8 C9
C4 C5
C2 C3
C1
X1 ≤ d3 X1 > d3
X2 ≤ d2 X2 > d2
X1 ≤ d1 X1 > d1
d3 d1
d2
X1
X2
C4
C3
C8 C9
Statistics & ML for high throughput data integration
Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix & Rémi Servien
p. 7
9. Technical details
I splits are made by randomly choosing mtry < d variables randomly and by finding
the “best split” among this selection
I aggregation: average (target is numeric) or maximum vote rule (target is a class)
Statistics & ML for high throughput data integration
Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix & Rémi Servien
p. 8
10. Technical details
I splits are made by randomly choosing mtry < d variables randomly and by finding
the “best split” among this selection
I aggregation: average (target is numeric) or maximum vote rule (target is a class)
I OOB error: average error (over trees) on samples not included in the bootstrap of
the tree
I variable importance: the larger the increase in error, the most important the
variable is (based on random permutation)
Statistics & ML for high throughput data integration
Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix & Rémi Servien
p. 8
11. Extensions of random forest for time series
I Similarity based techniques
I Fréchet forest [Capitaine et al., 2020]
I Proximity forest [Lucas et al., 2019] (restricting to classification)
Image by courtesy of Charlotte Pelletier
Statistics & ML for high throughput data integration
Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix & Rémi Servien
p. 9
12. Extensions of random forest for time series
I Similarity based techniques
I Fréchet forest [Capitaine et al., 2020]
I Proximity forest [Lucas et al., 2019] (restricting to classification)
I Interval based techniques
I Time Series Forest [Deng et al., 2013] and its extension [Middlehurst et al., 2020]
I RISE [Lines et al., 2018] (a tree = a randomly selected interval)
Statistics & ML for high throughput data integration
Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix & Rémi Servien
p. 9
13. Extensions of random forest for time series
I Similarity based techniques
I Fréchet forest [Capitaine et al., 2020]
I Proximity forest [Lucas et al., 2019] (restricting to classification)
I Interval based techniques
I Time Series Forest [Deng et al., 2013] and its extension [Middlehurst et al., 2020]
I RISE [Lines et al., 2018] (a tree = a randomly selected interval)
I Dictionnary or symbolic representation based techniques:
I TS-CHIEF [Shifaz et al., 2020] (combines all types of splits including dictionnary
based splits based on work of [Schäfer, 2015])
I (multivariate time series) symbolic representation of time series
[Baydogan and Runger, 2015] More on that
Statistics & ML for high throughput data integration
Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix & Rémi Servien
p. 9
14. Time Series Forest
Basic principles:
1. for a given tree: random sampling of intervals
2. for a given tree: compute summaries (mean, sd, slope for [Deng et al., 2013] and
catch22 for [Middlehurst et al., 2020])
3. define splits as usual based on these summaries
Statistics & ML for high throughput data integration
Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix & Rémi Servien
p. 10
15. Time Series Forest
Basic principles:
1. for a given tree: random sampling of intervals
2. for a given tree: compute summaries (mean, sd, slope for [Deng et al., 2013] and
catch22 for [Middlehurst et al., 2020])
3. define splits as usual based on these summaries
What is useful for our question?
I combined with variable selection could help identify important intervals (still to be
tested)
I ideas to summarize information of an entire intervals (already partially tested)
Statistics & ML for high throughput data integration
Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix & Rémi Servien
p. 10
16. Extensions on summaries
I supervised: accounting for Y (is that useful?): PLS, linear models (including
ridge)... similar to [Poterie et al., 2019] (grouped variables summarized by LDA)
and to [Rainforth and Wood, 2017] (CCA based splits)
Statistics & ML for high throughput data integration
Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix & Rémi Servien
p. 11
17. Extensions on summaries
I supervised: accounting for Y (is that useful?): PLS, linear models (including
ridge)... similar to [Poterie et al., 2019] (grouped variables summarized by LDA)
and to [Rainforth and Wood, 2017] (CCA based splits)
I unsupervised: first PC of PCA, as in ClustOfVar method [Chavent et al., 2012],
as in [Chavent et al., 2021] (can also be useful to build groups)
Statistics & ML for high throughput data integration
Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix & Rémi Servien
p. 11
18. Extensions on summaries
I supervised: accounting for Y (is that useful?): PLS, linear models (including
ridge)... similar to [Poterie et al., 2019] (grouped variables summarized by LDA)
and to [Rainforth and Wood, 2017] (CCA based splits)
I unsupervised: first PC of PCA, as in ClustOfVar method [Chavent et al., 2012],
as in [Chavent et al., 2021] (can also be useful to build groups)
I Could a step further be taken by using oblique splits [Bertsimas and Dunn, 2017]
(let the forest decides how to combine to find the best split)?
See also: [Hornung and Boulesteix, 2021]
Statistics & ML for high throughput data integration
Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix & Rémi Servien
p. 11
19. Strategies for building intervals
I Precomputing intervals independantly of Y (based on correlation between time
points): constrained extension of ClustOfVar (PCA like criterion), adjclust
(contrainstred clustering based on correlation between variables) ⇒ hierarchy of
intervals
Statistics & ML for high throughput data integration
Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix & Rémi Servien
p. 12
20. Strategies for building intervals
I Precomputing intervals independantly of Y (based on correlation between time
points): constrained extension of ClustOfVar (PCA like criterion), adjclust
(contrainstred clustering based on correlation between variables) ⇒ hierarchy of
intervals
I Precomputing intervals independantly of Y (based on greedy agglomeration):
alterning between
I regression based step (LM) between any two consecutive variables to select the best
merge (minimum loss or maximum gain in accuracy)
I summary (depends on regression type)
⇒ hierarchy of intervals
Statistics & ML for high throughput data integration
Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix & Rémi Servien
p. 12
21. Strategies for building intervals
I Precomputing intervals independantly of Y (based on correlation between time
points): constrained extension of ClustOfVar (PCA like criterion), adjclust
(contrainstred clustering based on correlation between variables) ⇒ hierarchy of
intervals
I Precomputing intervals independantly of Y (based on greedy agglomeration):
alterning between
I regression based step (LM) between any two consecutive variables to select the best
merge (minimum loss or maximum gain in accuracy)
I summary (depends on regression type)
⇒ hierarchy of intervals
I Using random forest to compute a hierarchy of interval based on loss in grouped
importance in a greedy manner [Gregorutti et al., 2015]
Statistics & ML for high throughput data integration
Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix & Rémi Servien
p. 12
22. Strategies for selecting variables (here, intervals) in RF
Huge litterature...... a few reviews: [Degenhardt et al., 2019, Speiser et al., 2019]
Statistics & ML for high throughput data integration
Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix & Rémi Servien
p. 13
23. Strategies for selecting variables (here, intervals) in RF
Huge litterature...... a few reviews: [Degenhardt et al., 2019, Speiser et al., 2019]
I based on importance:
I just use the importance...
I ranking with importance then selection of the best model with the first k variables
(K = 1, . . . , K) VSURF [Genuer et al., 2010]
I [Altmann et al., 2010] or [Szymczak et al., 2016] based on data-driven importance
threshold (untested)
Statistics & ML for high throughput data integration
Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix & Rémi Servien
p. 13
24. Strategies for selecting variables (here, intervals) in RF
Huge litterature...... a few reviews: [Degenhardt et al., 2019, Speiser et al., 2019]
I based on importance:
I just use the importance...
I ranking with importance then selection of the best model with the first k variables
(K = 1, . . . , K) VSURF [Genuer et al., 2010]
I [Altmann et al., 2010] or [Szymczak et al., 2016] based on data-driven importance
threshold (untested)
I based on external variable selection methods:
I Knockoffs [Barber and Candès, 2015], as in Boruta [Kursa and Rudnicki, 2010]
I Relief [Robnik-Šikonja and Kononenko, 2003]
Statistics & ML for high throughput data integration
Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix & Rémi Servien
p. 13
25. Simulation setting
I predictors: 1,000 EMS time series (length: 444)
I important intervals
I target: yi = log(1 + |hxi , βi|) + with
β(t) = 4 × 1t∈[320,410] + 2 × 1t∈[500,550] − 1t∈[680,730]
Statistics ML for high throughput data integration
Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix Rémi Servien
p. 14
26. Evaluated scenarios
Scenario 1: pre-computed groups, summary, RF and importance based evaluation
Courtesy of Louisa Villa.
Statistics ML for high throughput data integration
Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix Rémi Servien
p. 15
27. Evaluated scenarios
Scenario 2: pre-computed groups, summary, selection and RF
Courtesy of Louisa Villa.
Statistics ML for high throughput data integration
Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix Rémi Servien
p. 16
28. Evaluated scenarios
Scenario 3: groups computed in interaction with importance or variable selection (not
explained), summary, selection (or not) and RF
Courtesy of Louisa Villa.
Statistics ML for high throughput data integration
Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix Rémi Servien
p. 17
29. Evaluation criteria
Resemblance of important/selected intervals with ground truth
Accuracy
Statistics ML for high throughput data integration
Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix Rémi Servien
p. 18
30. A few take home messages (to be confirmed)
I pre-computed groups based on correlation (especially adjclust) are better
I PLS is best in terms of summary strategy
I combining selection strategy with RF is computationnaly extensive and unefficient
I overall, the recovery of groups is a bit disappointing
Statistics ML for high throughput data integration
Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix Rémi Servien
p. 19
31. adjclust + PLS + Boruta (scenario 2)
⇒ Model selection (or model aggregation?) seem critical...
Statistics ML for high throughput data integration
Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix Rémi Servien
p. 20
32. To be continued...
Statistics ML for high throughput data integration
Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix Rémi Servien
p. 21
33. References
Altmann, A., Tolosi, L., Sander, O., and Lengauer, T. (2010).
Permutation importance: a corrected feature importance measure.
Bioinformatics, 26(10):1340–1347.
Barber, R. F. and Candès, E. (2015).
Controlling the false discovery rate via knockoffs.
Annals of Statistics, 43(5):2055–2085.
Baydogan, M. G. and Runger, G. (2015).
Learning a symbolic representation for multivariate time series classification.
Data Mining and Knowledge Discovery, 29:400–422.
Bertsimas, D. and Dunn, J. (2017).
Optimal classification trees.
Machine Learning, 106(7):1039–1082.
Breiman, L. (2001).
Random forests.
Statistics ML for high throughput data integration
Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix Rémi Servien
p. 21
34. Machine Learning, 45(1):5–32.
Breiman, L., Friedman, J., Olsen, R., and Stone, C. (1984).
Classification and Regression Trees.
Chapman and Hall, Boca Raton, Florida, USA.
Capitaine, L., Bigot, J., Thiébaut, R., and Genuer, R. (2020).
Fréchet random forests for metric space valued regression with non Euclidean predictors.
Preprint arXiv:1906.01741v2.
Chavent, M., Genuer, R., and Saracco, J. (2021).
Combining clustering of variables and feature selection using random forests.
Communications in Statistics - Simulation and Computation, 50(2):426–445.
Chavent, M., Liquet, B., Kuentz-Simonet, V., and Saracco, J. (2012).
ClustOfVar: an R package for the clustering of variables.
Journal of Statistical Software, 50(13):1–16.
Degenhardt, F., Seifert, S., and Szymczak, S. (2019).
Evaluation of variable selection methods for random forests and omics data sets.
Statistics ML for high throughput data integration
Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix Rémi Servien
p. 21
35. Briefings in Bioinformatics, 20(2):492–503.
Deng, H., Runger, G., Tuv, E., and Martyanov, V. (2013).
A time series forest for classification and feature extraction.
Information Science, 239:142–153.
Genuer, R., Poggi, J.-M., and Tuleau-Malot, C. (2010).
Variable selection using random forests.
Pattern Recognition Letters, 31(14):2225–2236.
Gregorutti, B., Michel, B., and Saint-Pierre, P. (2015).
Grouped variable importance with random forests and application to multiple functional data
analysis.
Computational Statistics and Data Analysis, 90:15–35.
Grollemund, P.-M., Abraham, C., Baragatti, M., and Pudlo, P. (2019).
Bayesian functional linear regression with sparse step functions.
Bayesian Analysis, 14(1):111–135.
Hornung, R. and Boulesteix, A.-L. (2021).
Statistics ML for high throughput data integration
Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix Rémi Servien
p. 21
36. Interaction forests: identifying and exploiting interpretable quantitative and qualitative interaction
effects.
Technical Report Number 237, Department of Statistics, University of Munich, Germany.
Kursa, M. and Rudnicki, W. (2010).
Feature selection with the Boruta package.
Journal of Statistical Software, 36(11):1–13.
Lines, J., Taylor, S., and Bagnall, A. (2018).
Time series classification with HIVE-COTE: the hierarchical vote collective of
transformation-based ensembles.
ACM Transactions on Knowledge Discovery from Data, 12(5):1–35.
Lucas, B., Shifaz, A., Pelletier, C., O’Neill, L., Zaidi, N., Goethals, B., Petitjean, F., and Webb,
G. I. (2019).
Proximity forest: an effective and scalable distance based classifier for time series.
Data Mining and Knowledge Discovery, 33:607–635.
Middlehurst, M., Large, J., and Bagnall, A. (2020).
Statistics ML for high throughput data integration
Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix Rémi Servien
p. 21
37. The canonical interval forest (CIF) classifier for time series classification.
In Wu, X., Jermaine, C., Hu, X., Kotevskia, O., Lu, S., Xu, W., Aluru, S., Zhai, C., Al-Masri, E.,
Chen, Z., and Saltz, J., editors, Proceedings of IEEE International Conference on Big Data,
Atlanta, GA, USA. IEEE.
Picheny, V., Servien, R., and Villa-Vialaneix, N. (2019).
Interpretable sparse sliced inverse regression for functional data.
Statistics and Computing, 29(2):255–267.
Poterie, A., Dupuy, J.-F., Monbet, V., and Rouvière, L. (2019).
Classification tree algorithm for grouped variables.
Computational Statistics, 34:1613–1648.
Rainforth, T. and Wood, F. (2017).
Canonical correlation forests.
arXiv: 1507.05444.
Robnik-Šikonja, M. and Kononenko, I. (2003).
Theoretical and empirical analysis of ReliefF and RReliefF.
Machine Learning, 53(1-2):23–69.
Statistics ML for high throughput data integration
Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix Rémi Servien
p. 21
38. Schäfer, P. (2015).
The BOSS is concerned with time series classification in the presence of noise.
Data Mining and Knowledge Discovery, 29(6):1505–1530.
Shifaz, A., Pelletier, C., Petitjean, F., and Webb, G. I. (2020).
TS-CHIEF: a scalable and accurate forest algorithm for time series classification.
Data Mining and Knowledge Discovery, 34:742–775.
Speiser, J. L., Miller, M. E., Tooze, J., and Ip, E. (2019).
A comparison of random forest variable selection methods for classification prediction modeling.
Expert Systems with Applications, 134:93–101.
Szymczak, S., Holzinger, E., Dasgupta, A., Malley, J., Molloy, A., Mills, J., Brody, L.,
Stambolian, D., and Bailey-Wilson, J. (2016).
r2VIM: a new variable selection method for random forests in genome-wide association studies.
BioData Mining, 9:7.
Statistics ML for high throughput data integration
Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix Rémi Servien
p. 22
39. Dictionnary/symbolic representation based
BOSS [Schäfer, 2015] and [Baydogan and Runger, 2015]
Based on: Fourier transform then symbolic representation.
[Baydogan and Runger, 2015] is similar, except that representation loses interval
information (based on a tree at time step level)
Statistics ML for high throughput data integration
Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix Rémi Servien
p. 22
40. Dictionnary/symbolic representation based
BOSS [Schäfer, 2015] and [Baydogan and Runger, 2015]
What is useful for our question? Uncertain... can symbolic representation itself be used
to represent/select (windowed) intervals? (untested) Back
Statistics ML for high throughput data integration
Nov 29-30 2021 / PhenoDyn / Nathalie Vialaneix Rémi Servien
p. 22