Benelearn2016

Random subspace with trees for feature selection
under memory constraints
Antonio Sutera
Dept. of EECS, University of Liège, Belgium
Benelearn 2016,
Kortrijk, Belgium
September 12, 2016
Pierre Geurts, Louis Wehenkel (ULg),
Gilles Louppe (CERN & NYU)
Célia Châtel (Luminy)
1 / 15

Background: Ensemble of randomized trees
Good classiﬁcation method
2 / 15

Background: Ensemble of randomized trees for feature
selection
Good classiﬁcation method useful for feature selection
𝜑1 𝜑 𝑀𝜑2
…
Importance of variable Xm for an ensemble of NT trees is given by:
Imp(Xm) =
1
NT
T t∈T:v(t)=Xm
p(t)∆i(t)
where p(t) = Nt /N and ∆i(t) is the impurity reduction at node t:
∆i(t) = i(t) −
NtL
Nt
i(tL) −
Ntr
Nt
i(tR )
Variable ranking by tree-based methods
feat1 feat2 . . . featm Class
191.63 -128.29 . . . -107.59 0
241.07 44.47 . . . 96.56 . . .
179.17 -3.69 . . . 56.67 0
. . . . . . . . . . . . 1
. . .
120.26 -30.47 . . . 42.81 1
⇓
0
5
10
15
20
25
30
f15 f4 f10 f8 f9 f20 f11 f1 f13 f2 f12 f14 f3 f16 f17 f6 f18 f19 f7 f5
%info
semble of randomized trees
3 / 15

Background: Feature relevance (Kohavi and John, 1997)
V
Irrelevant
features
Weakly
Strongly
Relevant
features
M
Given an output Y and a set of input variables V , X ∈ V is
relevant iff ∃B ⊆ V such that Y ⊥⊥ X|B.
irrelevant iff ∀B ⊆ V : Y ⊥⊥ X|B
strongly relevant iff Y ⊥⊥ X|V {X}.
weakly relevant iff X is relevant and not strongly relevant.
A Markov boundary is a minimal size subset M ⊆ V such that
Y ⊥⊥ V M|M.
4 / 15

Background: Feature selection (Nilsson et al., 2007)
V
Irrelevant
features
Weakly
Strongly
Relevant
features
M
Two different feature selection problems:
Minimal-optimal: find a Markov boundary for the output Y .
All-relevant: find all relevant features.
5 / 15

Random forests, variable importance and feature selection
Main results
In asymptotic conditions : inﬁnite sample size and number of trees
K = 1: Unpruned totally randomized trees solve the all-relevant
feature selection problem.
K 1: In the case of stricly positive distributions, non random
trees always ﬁnd a superset F of the minimal-optimal solution
which size decreases with K.
V
Irrelevant
features
Strongly
Relevant
features
F1F2
Fp
6 / 15

Motivation
Our objective: Design more eﬃcient feature selection procedures
based on random forests
We address large-scale feature selection problems where one can
not assume that all variables can be stored into memory
We study and improve ensembles of trees grown from random
subsets of features
7 / 15

Random subspace for feature selection
Simplistic memory constrained setting: We can not grow trees with
more than q features
Straightforward ensemble solution: Random Subspace (RS)
Train each ensemble tree from a random subset of q features
1. Repeat T times:
1.1 Let Q be a subset of q features randomly selected in V
1.2 Grow a tree only using features in Q (with randomization K)
2. Compute importance Impq,T (X) for all X
Proposed e.g. by (Ho, 1998) for accuracy improvement, by (Louppe and
Geurts, 2012) for handling large datasets and by (Draminski et al., 2010,
Konukoglu and Ganz, 2014) for feature selection
Let us study the population version of this algorithm.
8 / 15

RS for feature selection: study
Asymptotic guarantees:
Def. deg(X) with X relevant is the size of the smallest B ⊆ V
such that Y ⊥⊥ X|B
K = 1: If deg(X) q for all relevant variables X: X is relevant iff
Impq(X) 0
K ≥ 1: If there are q or less relevant variables: X strongly
relevant ⇒ Impq(X) 0
Drawback: RS requires many trees to find high degree variables
E.g.: p = 10000, q = 50, k = 1 ⇒
(p−k−1
q−k−1)
(p
q)
= 2.5 · 10−5
. In average, at least
T = 40812 trees are required to find X.
9 / 15

Sequential Random Subspace (SRS)
Proposed algorithm:
1. Let F = ∅
2. Repeat T times:
2.1 Let Q = R ∪ C, where:
R is a subset of min{αq, |F|} features randomly taken from F
C is a subset of q − |R| features randomly selected in V R
2.2 Grow a tree only using features in Q
2.3 Add to F all features that get non-zero importance
3. Return F
↵q
F
Q
...
R C
V F
Compared to RS: ﬁll α% of the memory with previously found relevant
variables and (1 − α)% with randomly selected variables.
10 / 15

SRS for feature selection: study
Asymptotic guarantees: similar as RS if all relevant variables can ﬁt
into memory.
Convergence: SRS requires much less trees than RS in most cases.
For example,
X1 X2 X3 X4 X5
Numerical simulation
11 / 15

Experiments: results in feature selection
Dataset: Madelon (Guyon et al., 2007)
1500 samples (|LS|=1000, |TS|=500)
500 features whose 20 relevant features (5 features that define Y , 5
random linear combinations of the first 5, and 10 noisy copies of the first 10)
0 500 1000 1500 2000 2500 3000 3500 4000
Number of iterations
0.2
0.0
0.2
0.4
0.6
0.8
1.0
F-measure
RS (alpha=0)
SRS (alpha=1.0)
Parameter:
q : 50
12 / 15

Experiments: results in prediction
0 2000 4000 6000 8000 10000
0.50
0.55
0.60
0.65
0.70
0.75
0.80
0.85
0.90
Accuracy
RS (alpha=0)
SRS (alpha=1.0)
Parameter:
q : 50
After 10000
trees/iterations:
RF (K = max): 0.81
RF (K = q): 0.70
RS : 0.68
SRS: 0.84
13 / 15

Conclusions
Future works on SRS:
Good performance of SRS are conﬁrmed on other datasets but
more experiments are needed.
How to dynamically adapt K and α to improve correctness and
convergence?
Parallelization of each step or of the global procedure
Conclusion:
In most cases, accumulating relevant features speeds up the
discovery of new relevant features while improving the accuracy.
14 / 15

References
Célia Châtel, Sélection de variables à grande échelle à partir de forêtes aléatoires,
Master’s thesis, École Centrale de Marseille/Université de Liège, 2015.
Gilles Louppe and Pierre Geurts, Ensembles on random patches., ECML/PKDD (1)
(Peter A. Flach, Tijl De Bie, and Nello Cristianini, eds.), Lecture Notes in Computer
Science, vol. 7523, Springer, 2012, pp. 346–361.
Gilles Louppe, Understanding random forests: From theory to practice, Ph.D. thesis,
University of Liège, 2014.
G. Louppe, L. Wehenkel, A. Sutera, and P. Geurts, Understanding variable
importances in forests of randomized trees, Advances in neural information
processing, 2013.
15 / 15

Variable importance scores
Some interpretability can be retrieved through variable importance
scores
0
5
10
15
20
25
30
f15 f4 f10 f8 f9 f20 f11 f1 f13 f2 f12 f14 f3 f16 f17 f6 f18 f19 f7 f5
%info
e.g. Sum of entropy reduction at each node where
appears.
Ensemble of randomized trees
I Improve standard classiﬁcation and regression trees by reducing
their variance
I Many examples: Bagging (Breiman, 1996), Random Forests (Breiman,
2001), Extremely randomized trees (Geurts et al., 2006)
I Standard Random Forests: bootstrap sampling + random
selection of K features at each node
3 / 37
Two main importance measures:
The mean decrease of impurity (MDI): summing total impurity
reductions at all tree nodes where the variable appears (Breiman et
al., 1984)
The mean decrease of accuracy (MDA): measuring accuracy
reduction on out-of-bag samples when the values of the variable
are randomly permuted (Breiman, 2001)
These measures have found many successful applications such as:
Biomarker discovery
Gene regulatory network inference
(Huynh-Thu et al, Plos ONE, 2010 and Marbach et al., Nature Methods, 2012)
1 / 8

Mean decrease of impurity (MDI): deﬁnition
𝜑1 𝜑 𝑀𝜑2
…
Importance of variable Xm for an ensemble of NT trees is given by:
Imp(Xm) =
1
NT
T t∈T:v(t)=Xm
p(t)∆i(t)
where p(t) = Nt/N and ∆i(t) is the impurity reduction at node t:
∆i(t) = i(t) −
NtL
Nt
i(tL) −
Ntr
Nt
i(tR)
2 / 8

Link with common deﬁnitions of variable relevance
In asymptotic setting (N = NT = ∞)
K = 1: Variable importances depend only on the relevant variables
A variable Xm is relevant iﬀ Imp(Xm) 0
The importance of a relevant variable is insensitive to the addition
or the removal of irrelevant variables in V .
⇒ Asymptotically, unpruned totally randomized trees thus solve the
all-relevant feature selection problem.
3 / 8

Link with common definitions of variable relevance
In asymptotic setting (N = NT = ∞)
K 1: Variable importances can be influenced by the number of
irrelevant variables and there can be relevant variables with zero
importances (due to masking effect)
But:
Xm irrelevant ⇒ Imp(Xm) = 0
Xm strongly relevant ⇒ Imp(Xm) 0
Strongly relevant features can not
be masked
V
Irrelevant
features
Strongly
Relevant
features
F1F2
Fp
⇒ In the case of stricly positive distributions, non random trees always
find a superset of the minimal-optimal solution which size decreases
with K.
4 / 8

Experiments: protocol
Madelon data (Guyon et al., 2007)
1500 samples (|LS|=1000, |TS|=500)
20 relevant features: 5 features that define Y , 5
random linear combinations of the first 5, and 10
noisy copies of the first 10
Increasing number of irrelevant features: 480, 1480,
2980, 5480
Parameters: q = 50, K = q, no bootstrap, threshold randomization
(Geurts et al., 2006)
Evaluation:
Average over 50 random LS/TS splits
Evolution of TS accuracy with number of iterations
Evolution of the area under the precision-recall curve (auprc) with
number of iterations, when features are ranked according to
importances
5 / 8

Experiments: results
Important improvement
of both auprc and
accuracy with SRS
The lower q/p, the
larger the improvement
Only SRS always
eventually perfectly
ranks the features
0 2000 4000 6000 8000 10000
iterations
0.2
0.0
0.2
0.4
0.6
0.8
1.0
1.2
auprc
auprc madelon
RS - 480 irr
RS - 1480 irr
RS - 2980 irr
RS - 5480 irr
SRS - 480 irr
SRS - 1480 irr
SRS - 2980 irr
SRS - 5480 irr
0 2000 4000 6000 8000 10000
iterations
0.45
0.50
0.55
0.60
0.65
0.70
0.75
0.80
0.85
0.90
accuracy
accuracy madelon
RS - 480 irr
RS - 1480 irr
RS - 2980 irr
RS - 5480 irr
SRS - 480 irr
SRS - 1480 irr
SRS - 2980 irr
SRS - 5480 irr
6 / 8

Experiments: results in feature selection
Dataset: TIS
13375 samples
927 features
0 500 1000 1500 2000 2500 3000 3500 4000
0.2
0.0
0.2
0.4
0.6
F-measure
RS (alpha=0)
SRS (alpha=1.0)
Parameter:
q : 92
7 / 8

Experiments: results in prediction
0 2000 4000 6000 8000 10000
0.80
0.82
0.84
0.86
0.88
0.90
0.92
0.94
Accuracy
RS (alpha=0)
SRS (alpha=1.0)
Parameter:
q : 92
After 10000
trees/iterations:
RF (K = max): 0.91
RF (K = q): 0.9
RS : 0.84
SRS: 0.91
8 / 8

Benelearn2016

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Benelearn2016

Similar to Benelearn2016 (20)

Recently uploaded

Recently uploaded (20)

Benelearn2016