2. Outline
ā¢ Local vs. global modeling
ā¢ Wrapper feature selection and local modeling
ā¢ F-Racing and subsampling
ā¢ Experimental results
Combining Lazy Learning, Racing and Subsamplingfor Effective Feature Selection ā p. 2/2
12. Global models: pros and cons
ā¢ Examples of global models are linear regression models and
neural networks.
ā¢ PRO: even for huge datasets, a parametric model can be stored
in a small memory.
ā¢ CON:
ā¢ in the nonlinear case learning procedures are typically slow
and analytically intractable.
ā¢ validation methods, which address the problem of assessing a
global model on the basis of a ļ¬nite amount of noisy samples,
are computationally prohibitive.
Combining Lazy Learning, Racing and Subsamplingfor Effective Feature Selection ā p. 5/2
13. Local models: pros and cons
ā¢ Examples of local models are locally weighted regression and
nearest neighbours
ā¢ We will consider here a Lazy Learning algorithm [2, 5, 4]
published in previous works.
ā¢ PRO: fast and easy local linear learning procedures for
parametric identiļ¬cation and validation.
ā¢ CON:
ā¢ the dataset of observed input/output data must always be kept
in memory.
ā¢ Each prediction requires a repetition of the learning procedure.
Combining Lazy Learning, Racing and Subsamplingfor Effective Feature Selection ā p. 6/2
14. Complexity in global and local modeling
ā¢ Consider a nonlinear regression problem where we have N
training samples, n given features and Q query points (i.e. Q
predictions to be performed).
ā¢ Let us compare the computational cost of a nonlinear global
learner (e.g. a neural network) and a local learner (with k << N
neighbors).
ā¢ Suppose that the nonlinear global learning procedure relies on a
nonlinear parametric identiļ¬cation step (e.g. backpropagation to
compute the weights) and a structural identiļ¬cation step (e.g.
K-fold cross-validation to deļ¬ne the number of hidden nodes).
ā¢ Suppose that the local learning relies on a local leave-one-out
linear criterion (PRESS statistic).
Combining Lazy Learning, Racing and Subsamplingfor Effective Feature Selection ā p. 7/2
15. Complexity in global and local modeling
GLOBAL LOCAL
Parametric identiļ¬cation CNLS O(Nn)+CLS
Structural identiļ¬cation by K-fold cross-validation KCNLS small
Cost of Q predictions (K + 1)CNLS Q(O(Nn) + CLS)
where CNLS and CLS represent the cost of a nonlinear and a linear
least squares, respectively.
The global modeling approach is computationally advantageous wrt to
the local modeling one when the same model is expected to be used
for many predictions. Otherwise, a local approach is to be preferred.
Combining Lazy Learning, Racing and Subsamplingfor Effective Feature Selection ā p. 8/2
16. Feature selection
ā¢ In recent years many applications of data mining (text mining,
bioinformatics, sensor networks) deal with a very large number n
of features (e.g. tens or hundreds of thousands of variables) and
often comparably few samples.
ā¢ In these cases, it is common practice to adopt feature selection
algorithms [7] to improve the generalization accuracy.
ā¢ Several techniques exist for feature selection: we focus here on
wrapper search techniques.
ā¢ Wrapper methods assess subsets of variables according to their
usefulness to a given learning machine. These methods conducts
a search for a good subset using the learning algorithm itself as
part of the evaluation function. The problem boils down to a
problem of stochastic state space search.
ā¢ Well-known example of greedy wrapper search is forward
selection. Combining Lazy Learning, Racing and Subsamplingfor Effective Feature Selection ā p. 9/2
17. Why being local in feature selection?
ā¢ Suppose that we have F feature set candidates, N training
samples and that the assessment is perfomed by leave-one-out.
ā¢ The conventional approach is to to test all the F leave-one-out
models on all the N samples ans choose the best.
ā¢ This requires the training of F ā N different models, each one
used for a single prediction.
ā¢ The use of a global model demands a huge cost of retraining.
ā¢ Local approaches appear to be an effective alternative.
Combining Lazy Learning, Racing and Subsamplingfor Effective Feature Selection ā p. 10/2
18. Racing and subsampling: an analogy
ā¢ You are a national team football trainer who has to select the
goalkeeper among a set of four candidates for the next World
Cup, starting the next month.
ā¢ You have available only twenty days of training session and eight
days to let the players play matches.
ā¢ Two options:
1. (i) Train all the candidates during the ļ¬rst twenty days, (ii) test
all of them with matches the last eight days, and (iii) make a
decision.
2. (i) Alternate each week of training with two matches, (ii) after
each week, assess the candidates and if there is someone
signiļ¬cantly worse than the others discard him (iii) keep
selecting the others.
ā¢ In our analogy the players are the feature subsets, the training
days are the training data, the matches are the test data.Combining Lazy Learning, Racing and Subsamplingfor Effective Feature Selection ā p. 11/2
19. The racing idea
ā¢ Suppose that we have F feature set candidates, N training
samples and that the assessment is perfomed by leave-one-out.
ā¢ The conventional approach is to to test all the F models on all the
N samples and eventually choose the best.
ā¢ The racing idea [8] is to test each feature set on one point at the
time.
ā¢ After only a small number of points, by using statistical tests, we
can detect that some feature sets are signiļ¬cantly worse than
others.
ā¢ We can discard them and keep focusing on the others.
Combining Lazy Learning, Racing and Subsamplingfor Effective Feature Selection ā p. 12/2
20. Non racing approach
Consider this simple example: we have F = 5 feature subsets and
N = 10 samples to select the best feature set by leave-one-out corss
validation.
Squared error
0.1
0.4
0.3
0.7
0.5
2
0.1
4
3.2
4
1.5ESTIMATED
0.3
0.6
1.7
2.5
2
3.1
4
5.2
4
4
0.2
0.5
0.4
1.2
1
2.7
3.5
5.3
3.9
4
2.7 2.2
0.0
0.1
0.1
0.9
0.4
1.9
0.0
3.5
3.4
0.2
1.0
0.05
0.2
0.4
0.8
0.5
2.4
3.0
8.4
4.2
3.9
2.4
WINNER
MSE
F1 F2 F3 F4 F5
i=1
i=2
i=3
i=4
i=5
i=6
i=7
i=8
i=9
i=10
After 50 training and test procedures, we have the best candidate.
Combining Lazy Learning, Racing and Subsamplingfor Effective Feature Selection ā p. 13/2
25. F-racing for feature selection
ā¢ We propose a nonparametric multiple test, the Friedman test [6],
to compare different conļ¬gurations of input variables and to select
the ones to be eliminated from the race.
ā¢ The use of the Friedman test for racing was proposed ļ¬rst by one
of the authors in the context of a technique for comparing
metaheuristics for combinatorial optimization problems [3]. This is
the ļ¬rst time that the technique is used in a feature selection
setting.
ā¢ The main merit of this nonparametric approach is that it does not
require to formulate hypotheses on the distribution of the
observations.
ā¢ The idea of F-racing techniques consists in using blocking and
paired multiple test to compare different models in similar conditions
and discard as soon as possible the worst ones.
Combining Lazy Learning, Racing and Subsamplingfor Effective Feature Selection ā p. 15/2
26. Sub-sampling and LL
ā¢ The goal of feature selection is to ļ¬nd the best subset in a set of
alternatives.
ā¢ Given a set of alternative subsets, what we expect is a correct
ranking of their generalization accuracy (eg F2 > F3 > F5 > F1>
F4).
ā¢ By subsampling we mean using a random subset of the training
set to perform the assessment of the different feature sets.
ā¢ The rationale of subsampling is that by reducing the training set
size N, we deteriorate the accuracy of each single feature subset
without affecting their ranking.
ā¢ In LL reducing the training set size N reduces the computational
cost.
ā¢ This makes more competitive the LL approach
Combining Lazy Learning, Racing and Subsamplingfor Effective Feature Selection ā p. 16/2
27. RACSAM for feature selection
We proposed the RACSAM (RACing+SAMpling) algorithm
1. Deļ¬ne an initial group of promising feature subsets.
2. Start with small training and test sets.
3. Discard by racing all the feature subsets that appear as
signiļ¬cantly worse than the others.
4. Increase the training and test size until at most W winners models
remain.
5. Update the group with new candidates proposed by the search
strategy and go back to step 3.
Combining Lazy Learning, Racing and Subsamplingfor Effective Feature Selection ā p. 17/2
28. Experimental session
ā¢ We compare the performance accuracy of the LL algorithm
enhanced by the RACSAM procedure to the the accuracy of two
state-of-art algorithms, a SVM for regression and a regression
tree (RTREE).
ā¢ Two version of the RACSAM algorithm were tested: the ļ¬rst
(LL-RAC1) takes as feature set the best one (in terms of estimate
Mean absolute Error (MAE)) among the W winning candidates :
the second (LL-RAC2) averages the predictions of the best W LL
predictors.
ā¢ W = 5, and p-value is 0.01.
Combining Lazy Learning, Racing and Subsamplingfor Effective Feature Selection ā p. 18/2
29. Experimental results
Five-fold cross-validation on six real datasets of high dimensionality:
Ailerons (N = 14308, n = 40), Pole (N = 15000, n = 48),
Elevators (N = 16599, n = 18), Triazines (N = 186, n = 60),
Wisconsin (N = 194, n = 32) and Census (N = 22784, n = 137).
Dataset AIL POL ELE TRI WIS CEN
LL-RAC1 9.7e-5 3.12 1.6e-3 0.21 27.39 0.17
LL-RAC2 9.0e-5 3.13 1.5e-3 0.12 27.41 0.16
SVM 1.3e-4 26.5 1.9e-3 0.11 29.91 0.21
RTREE 1.8e-4 8.80 3.1e-3 0.11 33.02 0.17
Combining Lazy Learning, Racing and Subsamplingfor Effective Feature Selection ā p. 19/2
30. Statistical signiļ¬cativity
ā¢ LL-RAC1 vs. LL-RAC2:
ā¢ LL-RAC2 is signiļ¬cantly better than LL-RAC1 3 times out of 6
ā¢ LL-RAC2 is never signiļ¬cantly worse than LL-RAC1.
ā¢ LL-RAC2 vs.state-of-the-art techniques:
ā¢ LL-RAC2 approach is never signiļ¬cantly worse than SVM
and/or RTREE
ā¢ LL-RAC2 5 times out of 6 signiļ¬cantly better than SVM and 6
times out of 6 signiļ¬cantly better than RTREE.
Combining Lazy Learning, Racing and Subsamplingfor Effective Feature Selection ā p. 20/2
31. Software
ā¢ MATLAB toolbox on Lazy Learning [1].
ā¢ R contributed packages:
ā¢ lazy package.
ā¢ racing package.
ā¢ Web page: http://iridia.ulb.ac.be/~lazy.
ā¢ About 5000 accesses since October 2002.
Combining Lazy Learning, Racing and Subsamplingfor Effective Feature Selection ā p. 21/2
32. Conclusions
ā¢ Wrapper strategies asks for a huge number of assessments. It is
important to make this process faster and less prone to instability.
ā¢ Local strategies reduce the computational cost of training models
that has to be used for few predictions.
ā¢ Ranking speeds up the evaluation by discarding bad candidates
as soon as they appear to be statistically signiļ¬cantly worse than
others.
ā¢ Sub-sampling combined with local learning can speed up the
training phase in preliminary phases when it is important to
discard the highest number of bad candidates.
Combining Lazy Learning, Racing and Subsamplingfor Effective Feature Selection ā p. 22/2
35. References
[1] M. Birattari and G. Bontempi. The lazy learning toolbox, for
use with matlab. Technical Report TR/IRIDIA/99-7, IRIDIA-
ULB, Brussels, Belgium, 1999.
[2] M. Birattari, G. Bontempi, and H. Bersini. Lazy learn-
ing meets the recursive least-squares algorithm. In M. S.
Kearns, S. A. Solla, and D. A. Cohn, editors, NIPS 11,
pages 375ā381, Cambridge, 1999. MIT Press.
[3] M. Birattari, T. StĆ¼tzle, L. Paquete, and K. Varrentrapp. A
racing algorithm for conļ¬guring metaheuristics. In W. B.
Langdon, editor, GECCO 2002, pages 11ā18. Morgan
Kaufmann, 2002.
[4] G. Bontempi, M. Birattari, and H. Bersini. Lazy learning
for modeling and control design. International Journal of
Control, 72(7/8):643ā658, 1999.
[5] G. Bontempi, M. Birattari, and H. Bersini. A model selection
approach for local learning. Artiļ¬cial Intelligence Commu-
nications, 121(1), 2000.
[6] W. J. Conover. Practical Nonparametric Statistics. John
Wiley & Sons, New York, NY, USA, third edition, 1999.
24-1
36. [7] I. Guyon and A. Elisseeff. An introduction to variable and
feature selection. Journal of Machine Learning Research,
3:1157ā1182, 2003.
[8] O. Maron and A. Moore. The racing algorithm: Model selec-
tion for lazy learners. Artiļ¬cial Intelligence Review, 11(1ā
5):193ā225, 1997.
24-2