Interpretable Sparse Sliced Inverse Regression for digitized functional data

Interpretable Sparse Sliced Inverse Regression for
digitized functional data
Victor Picheny, Rémi Servien & Nathalie Villa-Vialaneix
nathalie.villa@toulouse.inra.fr
http://www.nathalievilla.org
Séminaire Institut de Mathématiques de Bordeaux
8 avril 2016
Nathalie Villa-Vialaneix | IS-SIR 1/26

Sommaire
1 Background and motivation
2 Presentation of SIR
3 Our proposal
4 Simulations

Sommaire
3 Our proposal
4 Simulations

A typical case study: meta-model in agronomy
climate
(daily time series:
rain, temperature...)
plant phenotypes
predictions
(yield, N leaching...)
Agronomic model

climate
(daily time series:
plant phenotypes
predictions
Agronomic model
Agronomic model:
based on biological and chemical knowledge;

climate
(daily time series:
plant phenotypes
predictions
Agronomic model
Agronomic model:
computationaly expensive to use;

climate
(daily time series:
plant phenotypes
predictions
Agronomic model
Agronomic model:
useful for realistic predictions but not to understand the link between
the inputs and the outputs.

climate
(daily time series:
plant phenotypes
predictions
Agronomic model
Agronomic model:
useful for realistic predictions but not to understand the link between
the inputs and the outputs.
Metamodeling: train a simpliﬁed, fast and interpretable model which can
be used as a proxy for the agronomic model.

A first case study: SUNFLO [Casadebaig et al., 2011]
Inputs: 5 daily time series (length: one year) and 8 phenotypes for different
sunflower types
Output: sunflower yield
Data: 1000 sunflower types × 190 climatic series (different places and
years) (n = 190 000) of variables in R5×183
× R8

Main facts obtained from a preliminary study
R. Kpekou internship
The study focused on the inﬂuence of the climate on the yield: 5 functional
variables digitized at 183 points.

Main facts obtained from a preliminary study
R. Kpekou internship
The study focused on the inﬂuence of the climate on the yield: 5 functional
variables digitized at 183 points.
Main result: Using summary of the variables (mean, sd...) on several
weeks and an automatic aggregating procedure in a random forest
method, led to obtain good accuracy in prediction.

Question and mathematical framework
A functional regression problem: X: random variable (functional) & Y:
random real variable
E(Y|X)?

E(Y|X)?
Data: n i.i.d. observations (xi, yi)i=1,...,n.
xi is not perfectly known but sampled at (ﬁxed) points
xi = (xi(t1), . . . , xi(tp))T
∈ Rp
. We denote: X =


xT
1
...
xT
n


.

E(Y|X)?
Data: n i.i.d. observations (xi, yi)i=1,...,n.
xi is not perfectly known but sampled at (ﬁxed) points
xi = (xi(t1), . . . , xi(tp))T
∈ Rp
. We denote: X =


xT
1
...
xT
n


.
Question: Find a model which is easily interpretable and points out
relevant intervals for the prediction within the range of X.

Related works (variable selection in FDA)
LASSO / L1
regularization in linear models
[Ferraty et al., 2010, Aneiros and Vieu, 2014] (isolated evaluation
points), [Matsui and Konishi, 2011] (selects elements of an expansion
basis), [James et al., 2009] (sparsity on derivatives: piecewise
constant predictors)
[Fraiman et al., 2015] (blinding approach useable for various
problems: PCA, regression...)
[Gregorutti et al., 2015] adaptation of the importance of variables in
random forest for groups of variables

Related works (variable selection in FDA)
LASSO / L1
regularization in linear models
[Ferraty et al., 2010, Aneiros and Vieu, 2014] (isolated evaluation
points), [Matsui and Konishi, 2011] (selects elements of an expansion
basis), [James et al., 2009] (sparsity on derivatives: piecewise
constant predictors)
[Fraiman et al., 2015] (blinding approach useable for various
problems: PCA, regression...)
[Gregorutti et al., 2015] adaptation of the importance of variables in
random forest for groups of variables
Our proposal: a semi-parametric (not entirely linear) model which selects
relevant intervals combined with an automatic procedure to deﬁne the
intervals.

Sommaire
3 Our proposal
4 Simulations

SIR in multidimensional framework
SIR: a semi-parametric regression model for X ∈ Rp
Y = F(aT
1 X, . . . , aT
d X, )
for a1, . . . , ad ∈ Rp
(to be estimated), F : Rd+1
→ R, unknown, and , an
error, independant from X.
Standard assumption for SIR
Y X | PA (X)
in which A is the so-called EDR space, spanned by (ak )k=1,...,d.

Estimation
Equivalence between SIR and eigendecomposition

Estimation
A is included in the space spanned by the ﬁrst d Σ-orthogonal
eigenvectors of the generalized eigendecomposition problem:
Γa = λΣa, with Σ = E (X − E(X|Y)))T
E(X|Y) and
Γ = E E(X|Y)T
E(X|Y)

Estimation
E(X|Y) and
Γ = E E(X|Y)T
E(X|Y)
Estimation (when n > p)
compute X = 1
n
n
i=1 xi and ˆΣ = 1
n XT
(X − X)

Estimation
E(X|Y) and
Γ = E E(X|Y)T
E(X|Y)
compute X = 1
n
n
i=1 xi and ˆΣ = 1
n XT
(X − X)
split the range of Y into H different slices: τ1, ... τH and estimate
Ê(X|Y) = 1
nh i: yi∈τh
xi
h=1,...,H
, with nh = |{i : yi ∈ τh}|,
ˆΓ = Ê(X|Y)T
DÊ(X|Y) with D = Diag n1
n , . . . , nH
n

Estimation
E(X|Y) and
Γ = E E(X|Y)T
E(X|Y)
compute X = 1
n
n
i=1 xi and ˆΣ = 1
n XT
(X − X)
split the range of Y into H different slices: τ1, ... τH and estimate
Ê(X|Y) = 1
nh i: yi∈τh
xi
h=1,...,H
, with nh = |{i : yi ∈ τh}|,
ˆΓ = Ê(X|Y)T
DÊ(X|Y) with D = Diag n1
n , . . . , nH
n
solving the eigendecomposition problem ˆΓa = λˆΣa gives the
eigenvectors a1, . . . , ad ⇒ Â = (a1, . . . , ad), p × d

Equivalent formulations
SIR as a regression problem [Li and Yin, 2008] shows that SIR is
equivalent to the (double) minimization of
E(A, C) =
H
h=1
ˆph Xh − X − ˆΣACh
2
for Xh = 1
nh i: yi∈τh
, A a (p × d)-matrix and C a vector in Rd
.

E(A, C) =
H
h=1
2
for Xh = 1
nh i: yi∈τh
.
Rk: Given A, C is obtained as the solution of an ordinary least square
problem...

E(A, C) =
H
h=1
2
for Xh = 1
nh i: yi∈τh
.
problem...
SIR as a Canonical Correlation problem [Li and Nachtsheim, 2008]
shows that SIR rewrites as the double optimisation problem
maxaj,φ Cor(φ(Y), aT
j
X), where φ is any function R → R and (aj)j are
Σ-orthonormal.

E(A, C) =
H
h=1
2
for Xh = 1
nh i: yi∈τh
.
problem...
SIR as a Canonical Correlation problem [Li and Nachtsheim, 2008]
shows that SIR rewrites as the double optimisation problem
maxaj,φ Cor(φ(Y), aT
j
X), where φ is any function R → R and (aj)j are
Σ-orthonormal.
Rk: The solution is shown to satisfy φ(y) = aT
j
E(X|Y = y) and aj is
also obtained as the solution of the mean square error problem:
min
aj
E φ(Y) − aT
j X
2

SIR in large dimensions: problem
In large dimension (or in Functional Data Analysis), n < p and ˆΣ is
ill-conditionned and does not have an inverse ⇒ Z = (X − InX
T
)ˆΣ−1/2
can
not be computed.

SIR in large dimensions: problem
In large dimension (or in Functional Data Analysis), n < p and ˆΣ is
ill-conditionned and does not have an inverse ⇒ Z = (X − InX
T
)ˆΣ−1/2
can
not be computed.
Different solutions have been proposed in the litterature based on:
prior dimension reduction (e.g., PCA) [Ferré and Yao, 2003] (in the
framework of FDA)
regularization (ridge...) [Li and Yin, 2008, Bernard-Michel et al., 2008]
sparse SIR
[Li and Yin, 2008, Li and Nachtsheim, 2008, Ni et al., 2005]

SIR in large dimensions: ridge penalty / L2-regularization
of ˆΣ
Following [Li and Yin, 2008] which shows that SIR is equivalent to the
minimization of
E2(A, C) =
H
h=1
2
,

of ˆΣ
minimization of
E2(A, C) =
H
h=1
2
+µ2
H
h=1
ˆph ACh
2
,
[Bernard-Michel et al., 2008] propose to penalize by a ridge penalty in a
high dimensional setting.

of ˆΣ
minimization of
E2(A, C) =
H
h=1
2
+µ2
H
h=1
ˆph ACh
2
,
[Bernard-Michel et al., 2008] propose to penalize by a ridge penalty in a
high dimensional setting.
They also show that this problem is equivalent to ﬁnding the eigenvectors
of the generalized eigenvalue problem
ˆΓa = λ ˆΣ + µ2Ip a.

SIR in large dimensions: sparse versions
Specific issue to introduce sparsity in SIR
sparsity on a multiple-index model. Most authors use shrinkage
approaches.
First version: sparse penalization of the ridge solution
If (Â, ˆC) are the solutions of the ridge SIR as described in the previous
slide, [Ni et al., 2005, Li and Yin, 2008] propose to shrink this solution by
minimizing
Es,1(α) =
H
h=1
ˆph Xh − X − ˆΣDiag(α)Â ˆCh
2
+ µ1 α L1
(regression formulation of SIR)

SIR in large dimensions: sparse versions
Specific issue to introduce sparsity in SIR
sparsity on a multiple-index model. Most authors use shrinkage
approaches.
Second version: [Li and Nachtsheim, 2008] derive the sparse optimization
problem from the correlation formulation of SIR:
min
as
j
n
i=1
Pâj
(X|yi) − (as
j )T
xi
2
+ µ1,j as
j L1
,
in which Pâj
is the projection of Ê(X|Y = yi) = Xh onto the space spanned
by the solution of the ridge problem.

Characteristics of the different approaches and possible
extensions
[Li and Yin, 2008] [Li and Nachtsheim, 2008]
sparsity on shrinkage coefﬁcients estimates
nb optimization pb 1 d
sparsity common to all dims speciﬁc to each dim

Characteristics of the different approaches and possible
extensions
[Li and Yin, 2008] [Li and Nachtsheim, 2008]
sparsity on shrinkage coefﬁcients estimates
nb optimization pb 1 d
sparsity common to all dims speciﬁc to each dim
Extension to block-sparse SIR (like in PCA)?

Sommaire
3 Our proposal
4 Simulations

IS-SIR: a two step approach
Background: Back to the functional setting, we suppose that t1, ..., tp are
split into D intervals I1, ..., ID.

First step: Solve the ridge problem on the digitized functions (viewed as
high dimensional vectors) to obtain ˆA and ˆC:
min
A,C
H
h=1
2
+ µ2
H
h=1
ˆph ACh
2

First step: Solve the ridge problem on the digitized functions (viewed as
high dimensional vectors) to obtain Â and ˆC:
min
A,C
H
h=1
2
+ µ2
H
h=1
ˆph ACh
2
Second step: Sparse shrinkage using the intervals. If
PÂ (E(X|Y = yi)) = (Xh − X)T Â for h st yi ∈ τh and if Pi = (P1
i
, . . . , Pd
i
)T
and Pj
= (Pj
1
, . . . , Pj
n)T
, we solve:
arg min
α∈RD
d
j=1
Pj
− (X∆(âj)) α 2
+ µ1 α L1
with ∆(âj) the (p × D)-matrix such that ∆kl(âj) = âjl if tl ∈ Ik and 0
otherwise.

IS-SIR: Characteristics
uses the approach based on the correlation formulation (because the
dimensionality of the optimization problem is smaller);
uses a shrinkage approach and optimizes shrinkage coefﬁcients in a
single optimization problem;
handles functional setting by penalizing entire intervals and not just
isolated points.

Parameter estimation
H (number of slices): usually, SIR is known to be not very sensitive to
the number of slices (> d + 1). We took H = 10 (i.e., 10/30
observations per slice);

µ2 and d (ridge estimate ˆA):
L-fold CV for µ2 (for a d0 large enough) Note that GCV as described in
[Li and Yin, 2008] can not be used since the current version of the L2
penalty involves the use of an estimate of Σ−1
.

L-fold CV for µ2 (for a d0 large enough)
using again L-fold CV, ∀ d = 1, . . . , d0, an estimate of
R(d) = d − E Tr Πd
ˆΠd ,
in which Πd and ˆΠd are the projector onto the ﬁrst d dimensions of the
EDR space and its estimate, is derived similarly as in
[Liquet and Saracco, 2012]. The evolution of ˆR(d) versus d is studied
to select a relevant d.

L-fold CV for µ2 (for a d0 large enough)
using again L-fold CV, ∀ d = 1, . . . , d0, an estimate of
R(d) = d − E Tr Πd
ˆΠd ,
in which Πd and ˆΠd are the projector onto the ﬁrst d dimensions of the
EDR space and its estimate, is derived similarly as in
[Liquet and Saracco, 2012]. The evolution of ˆR(d) versus d is studied
to select a relevant d.
µ1 (LASSO) glmnet is used, in which µ1 is selected by CV along the
regularization path.

An automatic approach to deﬁne intervals
1 Initial state: ∀ k = 1, . . . , p, τk = {tk }

2 Iterate
along the regularization path, select three values for µ1:

2 Iterate
along the regularization path, select three values for µ1: P% of the
coefficients are zero, P% of the coefficients are non zero, best GCV.
define: D−
(“strong zeros”) and D+
(“strong non zeros”)

2 Iterate
deﬁne: D−
merge consecutive “strong zeros” (or “strong non zeros”) or “strong
zeros” (resp. “strong non zeros”) separated by a few numbers of
intervals which are of undetermined type.
Until no more iterations can be performed.

2 Iterate
deﬁne: D−
3 Output: Collection of models (ﬁrst with p intervals, last with 1), M∗
D
(optimal for GCV) and corresponding GCVD versus D (number of
intervals).

2 Iterate
deﬁne: D−
3 Output: Collection of models (ﬁrst with p intervals, last with 1), M∗
D
(optimal for GCV) and corresponding GCVD versus D (number of
intervals).
Final solution: Minimize GCVD over D.

Sommaire
3 Our proposal
4 Simulations

Simulation framework
Data generated with:
Y = d
j=1 log X, aj with X(t) = Z(t) + in which Z is a Gaussian
process with mean µ(t) = −5 + 4t − 4t2
and the Matern 3/2
covariance function with parameters σ = 0.1 and θ = 0.2/
√
3, is a
centered Gaussian variable independant of Z, with standard deviation
0.1.;
aj = sin
t(2+j)π
2 −
(j−1)π
3 IIj
(t)
two models: (M1), d = 1, I1 = [0.2, 0.4]. For (M2), d = 3 and
I1 = [0, 0.1], I2 = [0.5, 0.65] and I3 = [0.65, 0.78].

Simulation framework

Ridge step (model M1)
Selection of µ2: µ2 = 1

Ridge step (model M1)
Selection of d: d = 1

Deﬁnition of the intervals
D = 200 (initial state)
0.2 0.4 0.6 0.8 1.0
0.00.20.40.60.8
a^
1

D = 147 (retained solution)
0.2 0.4 0.6 0.8 1.0
0.000.020.040.060.08
a^
1

D = 43
0.2 0.4 0.6 0.8 1.0
−0.050.000.05
a^
1

D = 5
0.2 0.4 0.6 0.8 1.0
−0.04−0.020.000.020.040.060.08
a^
1

q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
0 50 100 150 200
0.0190.0200.0210.0220.023
Number of intervals
CVerror

Conclusion
IS-SIR:
sparse dimension reduction model adapted to functional framework;
fully automated deﬁnition of relevant intervals in the range of the
predictors

Conclusion
IS-SIR:
sparse dimension reduction model adapted to functional framework;
fully automated deﬁnition of relevant intervals in the range of the
predictors
Perspective:
application to real data
block-wise sparse SIR?

Aneiros, G. and Vieu, P. (2014).
Variable in infinite-dimensional problems.
Statistics and Probability Letters, 94:12–20.
Bernard-Michel, C., Gardes, L., and Girard, S. (2008).
A note on sliced inverse regression with regularizations.
Biometrics, 64(3):982–986.
Casadebaig, P., Guilioni, L., Lecoeur, J., Christophe, A., Champolivier, L., and Debaeke, P. (2011).
SUNFLO, a model to simulate genotype-specific performance of the sunflower crop in contrasting environments.
Agricultural and Forest Meteorology, 151(2):163–178.
Ferraty, F., Hall, P., and Vieu, P. (2010).
Most-predictive design points for functiona data predictors.
Biometrika, 97(4):807–824.
Ferré, L. and Yao, A. (2003).
Functional sliced inverse regression analysis.
Statistics, 37(6):475–488.
Fraiman, R., Gimenez, Y., and Svarc, M. (2015).
Feature selection for functional data.
Journal of Multivariate Analysis.
In Press.
Gregorutti, B., Michel, B., and Saint-Pierre, P. (2015).
Grouped variable importance with random forests and application to multiple functional data analysis.
Computational Statistics and Data Analysis, 90:15–35.
James, G., Wang, J., and Zhu, J. (2009).
Functional linear regression that’s interpretable.
Annals of Statistics, 37(5A):2083–2108.
Li, L. and Nachtsheim, C. (2008).

Sparse sliced inverse regression.
Technometrics, 48(4):503–510.
Li, L. and Yin, X. (2008).
Sliced inverse regression with regularizations.
Biometrics, 64:124–131.
Liquet, B. and Saracco, J. (2012).
A graphical tool for selecting the number of slices and the dimension of the model in SIR and SAVE approches.
Computational Statistics, 27(1):103–125.
Matsui, H. and Konishi, S. (2011).
Variable selection for functional regression models via the l1 regularization.
Computational Statistics and Data Analysis, 55(12):3304–3310.
Ni, L., Cook, D., and Tsai, C. (2005).
A note on shrinkage sliced inverse regression.
Biometrika, 92(1):242–247.

Interpretable Sparse Sliced Inverse Regression for digitized functional data

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (9)

Similar to Interpretable Sparse Sliced Inverse Regression for digitized functional data

Similar to Interpretable Sparse Sliced Inverse Regression for digitized functional data (20)

More from tuxette

More from tuxette (20)

Recently uploaded

Recently uploaded (20)

Interpretable Sparse Sliced Inverse Regression for digitized functional data