SlideShare a Scribd company logo
1 of 18
Download to read offline
A model-based relevance estimation
approach for feature selection in
microarray datasets
Gianluca Bontempi, Patrick E. Meyer
{gbonte,pmeyer}@ulb.ac.be
Machine Learning Group,
Computer Science Department
ULB, Université Libre de Bruxelles
Boulevard de Triomphe - CP 212
Bruxelles, Belgium
http://www.ulb.ac.be/di/mlg
A model-based relevance estimation approach for feature selection in microarray datasets – p. 1/1
Outline
• Feature selection in microarray classification tasks
• Definition of relevance
• Relevance and feature selection
• Our approach to relevance estimation: between filter and wrapper
• Experimental results
A model-based relevance estimation approach for feature selection in microarray datasets – p. 2/1
Feature selection in microarray
• The availability of massive amounts of experimental data based on
genome-wide studies has given impetus in recent years to a large effort
in developing mathematical, statistical and computational techniques to
infer biological models from data.
• In many bioinformatics problems, the number of features is significantly
larger than the number of samples (high feature-to-sample ratio
datasets).
• This is typical of cancer classification tasks where a systematic
investigation of the correlation of expression patterns of thousands of
genes to specific phenotypic variations is expected to provide an
improved taxonomy of cancer.
• In this context, the number of features n corresponds to the number of
expressed gene probes (up to several thousands) and the number of
observations N to the number of tumor samples (typically in the order of
hundreds).
• Feature selection and consequently gene selection is required to
perform classification in such an high dimensional task.
A model-based relevance estimation approach for feature selection in microarray datasets – p. 3/1
State-of-the-art
Feature selection requires an accurate assessment of a large number of
alternative subsets in terms of predictive power or relevance to the output
class.
Three main state-of-the-art approaches are
Filters: these are preprocessing methods which assess the merits of features
from the data without having recourse to any learning algorithm.
Examples: ranking, PCA, t-test.
Wrappers: these methods rely on a learning algorithm to assess and compare
subsets of variables. They conduct a search for a good subset using the
learning algorithm itself as part of the evaluation function. Examples are
the forward/backward methods proposed in classical regression
analysis.
Embedded methods: they perform variable selection as part of the learning
procedure and are usually specific to given learning machines.
Examples are classification trees and methods based on regularization
techniques (e.g. lasso)
A model-based relevance estimation approach for feature selection in microarray datasets – p. 4/1
Between filters and wrappers
• Filter approaches rely on learner independent estimators to assess the
relevance of a set of features. The rationale of filter techniques is that
the importance of a set of feature should be independent of the
prediction technique.
Our contribution: we propose a model-based strategy to assess
the relevance of a set of features.
• Wrapper depends on a specific learner to assess a set of features and
end up with returning a quantity which confounds the relevance of a
subset (desired quantity) with the quality of the learner (not required). In
other terms wrapper returns a biased estimation of the relevance of a
subset.
Our contribution: since wrapper bias may have a strong negative
impact on the selection procedure we propose a model-based
technique for relevance assessment which is low-biased.
A model-based relevance estimation approach for feature selection in microarray datasets – p. 5/1
Feature selection and relevance
• Let us consider a binary classification problem where x ∈ X ⊂ Rn
and
y ∈ Y = {y0, y1}. Let s ⊆ x, s ∈ S be a subset of the input vector.
• Let us denote
p1(s) = Prob {y = y1|s}
p0(s) = Prob {y = y0|s}
• A feature selection problems can be formalized as a problem of (learner
independent) relevance maximization
s∗
= arg max
s⊆x,|s|≤d
Rs
where the goal is to find the subset s that maximizes the relevance
quantity Rs which accounts for the predictive power that the input s has
on the target y.
A model-based relevance estimation approach for feature selection in microarray datasets – p. 6/1
Relevance definitions
• A well known example of relevance measure is mutual information
I(s, y) = H(y) − H(y|s).
• Here we will focus on the quantity
Rs =
S
p2
0(s) + p2
1(s) dFs(s) =
S
r(s)dFs(s)
where r(s) = 1 − g(s) and g(s) is Gini index of diversity.
• Note that
r(s) = p2
0(s) + p2
1(s) = 1 − 2p0(s)(1 − p0(s)) = 1 − 2Var {y|s}
where Var {y|s} is the conditional variance of y.
• Also a monotone function
GH (·) : [0, 1] → [0, 0.5]
maps the entropy H(y|s) of a binary variable y to the related Gini index
g. A model-based relevance estimation approach for feature selection in microarray datasets – p. 7/1
Bias of the wrapper approach
Given a learner h trained on dataset of size N, the wrapper approach
translates the (learner independent) relevance maximization problem into a
(learner dependent) minimization problem
arg min
s⊆x,|s|≤d
Mh
s = arg min
s⊆x,|s|≤d S
MME
h
(s)dFs(s)
where the Mean Misclassification Error is decomposed as follows (Wolpert,
Kohavi, 96)
MME
h
(s) =
1
2
1 − (p2
0(s) + p2
1(s)) +
+
1
2
(p0(s) − pˆ0(s))2
+ (p1(s) − pˆ1(s))2
+
+
1
2
1 − (p2
ˆ0
(s) + p2
ˆ1
(s)) =
1
2
(n(s) + b(s) + v(s))
where pˆ0 = Prob {ˆy = y0|s}, n(s) = 1 − r(s) is the noise variance term, b(s) is
the learner squared bias and v(s) is the learner variance.
NB: the term b(s) is NOT dependent on relevance.
A model-based relevance estimation approach for feature selection in microarray datasets – p. 8/1
Bias of the wrapper approach
• In real classification tasks, the one-zero misclassification error Mh
s of a
learner h for a subset s cannot be derived analytically but only estimated
(typicallly by cross-validation).
• A wrapper selection returns
sh
= arg min
s⊂x,|s|≤d
M
h
s (1)
where M
h
s the estimate of the misclassification error of the learner h (e.g.
computed by cross-validation)
If a wrapper strategy relies on a generic learner h, that is a learner
where the bias term b(s) is significantly different from zero, the
returned feature selection will depend on a quantity which is a
biased estimate of the term r(s) and consequently of the relevance
Rs. In other words, wrappers do not maximize relevance.
A model-based relevance estimation approach for feature selection in microarray datasets – p. 9/1
Unbiased wrapper approach
• Intuitively, the bias would be reduced if we adopted a learner having a
small bias term. A low bias, yet high variance, learner is the k-nearest
neighbour classifiers (kNN) for small values of k
• In particular, it has been shown that for a 1NN learner and a binary
classification problem
lim
N→∞
M1NN
s = 1 − Rs
where M1NN
s is the misclassification error of a nearest neighbour.
• Since cross-validation returns a consistent estimation of Mh
and since
the quantity Mh
asymptotically converges to one minus the relevance
Rs, we have that 1 − M
1NN
s is a consistent estimator of the relevance Rs.
• We propose then as relevance estimator
ˆRkNN
s = 1 − M
kNN
s
which is the cross-validation error of a kNN learner with low k. This term
returns an unbiased, yet high variance, estimate of the relevance of the
subset s. A model-based relevance estimation approach for feature selection in microarray datasets – p. 10/1
Reducing the variance of the estimator
The low-bias high-variance nature of the ˆRkNN
s estimator suggests that the
best way to employ this estimator is by combining it with other relevance
estimators.
We will take into consideration two possible estimators to combine with:
1. a direct model-based estimator ˆp1 of the conditional probability
p1(s) = Prob {y = y1|s} and consequently of the quantity r(s).
This estimator first samples a set of N unclassified input vectors si
according to the empirical distribution ˆFs and then computes the Monte
Carlo estimation
ˆRD
s =
1
N
N
i=1
ˆp2
1(si ) + ˆp2
0(si ) = 1 −
2
N
N
i=1
ˆp1(si )(1 − ˆp1(si ))
A similar estimator was proposed by Fukunaga in 1973 to estimate the
Bayes error.
A model-based relevance estimation approach for feature selection in microarray datasets – p. 11/1
2. a filter estimator based on the notion of mutual information: several filter
algorithms exploit this notion in order to estimate the relevance. An
example is the MRMR algorithm (Peng et al., 05) where the relevance of
a feature subset s, expressed in terms of the mutual information
I(s, y) = H(y) − H(y|s), is approximated by the incremental formulation
IMRMR(s; y) = IMRMR(si; y) + I(xi, y) −
1
m − 1 xj ∈si
I(xj; xi) (2)
where xi is a feature belonging to the subset s, si is the set s with the xi
feature set aside and m is the number of components of s. Now since
H(y|s) = H(y) − I(s, y) and Gs = 1 − Rs = GH (H(y|s)) we obtain that
ˆRMRMR
s = 1 − GH (H(y) − IMRMR(s, y))
is a MRMR estimator of the relevance Rs where GH (·) is the monotone
mapping between H and Gini index.
A model-based relevance estimation approach for feature selection in microarray datasets – p. 12/1
Proposed relevance estimators
We propose two novel relevance estimators based on the principle of
averaging
ˆRs =
ˆRCV
s + ˆRD
s
2
,
ˆRs =
ˆRCV
s + ˆRMRMR
s
2
and the associated feature selection algorithms:
sR
= arg max
s⊂x,|s|≤d
ˆRs
sR
= arg max
s⊂x,|s|≤d
ˆRs
A model-based relevance estimation approach for feature selection in microarray datasets – p. 13/1
Experimental session
• 20 public domain microarray expression datasets
• external cross-validation scheme three-fold cross-validation strategy
• to avoid any dependency between the learning algorithm employed by
the wrapper and the classifier used for prediction, the experimental
session is composed of two parts:
• Part 1: comparison with the wrapper WSVM and we use the set of
classifiers C1 ={TREE, NB, SVMSIGM, LDA, LOG} which does not
include the SVMLIN learner,
• Part 2: comparison with the wrapper WNB and we use the set of
classifiers C2 ={TREE, SVMSIGM, SVMLIN, LDA, LOG} which does
not include the NB learner.
A model-based relevance estimation approach for feature selection in microarray datasets – p. 14/1
Experiments with cancer datasets
Name N n K
Golub 72 7129 2
Alon 62 2000 2
Notterman 36 7457 2
Nutt 50 12625 2
Shipp 77 7129 2
Singh 102 12600 2
Sorlie 76 7937 2
Wang 286 22283 2
Van’t Veer 65 24481 2
VandeVijver 295 24496 2
Sotiriou 99 7650 2
Pomeroy 60 7129 2
Khan 63 2308 4
Hedenfalk 22 3226 3
West 49 7129 4
Staunton 60 7129 9
Su 174 12533 11
Bhattacharjee 203 12600 5
Armstrong 72 12582 3
Ma 60 22575 3
A model-based relevance estimation approach for feature selection in microarray datasets – p. 15/1
Results 1st part
Name R’ WSVM R” MRMR RANK
Golub 0.0917 0.1177 0.1 0.1079 0.1225
Alon 0.2704 0.2658 0.2267 0.1996 0.2281
Notterman 0.1966 0.0985 0.1494 0.1472 0.1432
Nutt 0.3798 0.4171 0.3873 0.3847 0.4189
Shipp 0.1429 0.1319 0.1322 0.1362 0.1873
Singh 0.1619 0.1517 0.1266 0.1374 0.1328
Sorlie 0.3835 0.4314 0.3963 0.4004 0.3987
Wang 0.4282 0.4111 0.4218 0.4232 0.4181
Van’t Veer 0.2786 0.2638 0.2492 0.2217 0.2277
VandeVijver 0.454 0.4724 0.4365 0.4636 0.4482
Sotiriou 0.5279 0.5796 0.5351 0.5708 0.5339
Pomeroy 0.428 0.4191 0.4141 0.3876 0.4181
Khan 0.0878 0.1143 0.0582 0.0686 0.131
Hedenfalk 0.5475 0.5263 0.452 0.5273 0.5389
West 0.6463 0.6109 0.6186 0.5746 0.6109
Staunton 0.6822 0.71 0.6511 0.6865 0.7407
Su 0.2568 0.307 0.2549 0.3772 0.3352
Bhattacharjee 0.1232 0.1347 0.1105 0.1057 0.1515
Armstrong 0.1082 0.1199 0.1306 0.115 0.1122
Ma 0.2456 0.2041 0.2257 0.2413 0.2317
AVG 0.323 0.331 0.310 0.326 0.331
W/B than R’ (R”) 10/7 9/6 9/2
A model-based relevance estimation approach for feature selection in microarray datasets – p. 16/1
Results 2nd part
Name R’ WNB R” MRMR RANK
Golub 0.0886 0.1114 0.0971 0.1019 0.0904
Alon 0.2376 0.2568 0.2181 0.2109 0.221
Notterman 0.1852 0.2059 0.1491 0.1512 0.1645
Nutt 0.3929 0.3402 0.36 0.3898 0.4258
Shipp 0.1261 0.127 0.1198 0.1338 0.1734
Singh 0.1495 0.1454 0.1297 0.1377 0.1245
Sorlie 0.3848 0.4254 0.3808 0.3953 0.3838
Wang 0.4363 0.4345 0.4298 0.4281 0.4255
Van’t Veer 0.2747 0.2715 0.2421 0.2253 0.2325
VandeVijver 0.4626 0.44 0.4763 0.4721 0.4358
Sotiriou 0.5126 0.5578 0.5505 0.5732 0.5611
Pomeroy 0.4367 0.4389 0.4007 0.3902 0.4224
Khan 0.0804 0.0896 0.0628 0.0631 0.0901
Hedenfalk 0.5379 0.5187 0.4369 0.4904 0.4949
West 0.6413 0.6696 0.5542 0.5882 0.6728
Staunton 0.6689 0.8298 0.6981 0.6661 0.83
Su 0.2544 0.3096 0.2646 0.3739 0.3529
Bhattacharjee 0.1235 0.1209 0.101 0.1061 0.1186
Armstrong 0.1079 0.1668 0.125 0.1148 0.1034
Ma 0.2565 0.2635 0.2335 0.2443 0.2681
AVG 0.322 0.3335 0.315 0.327 0.331
W/B than R’ (R”) 9/2 10/3 11/2
A model-based relevance estimation approach for feature selection in microarray datasets – p. 17/1
Conclusions
• Feature selection demands accurate estimation of relevance of subsets
of features.
• Wrapper methods use cross-validation estimation of misclassification
error with generic learners. We show that this means a biased
estimation of relevance.
• The cross validation assessment ˆRkNN
s returned by kNN techniques
with low k provide a low bias yet high variance estimator of relevance.
• Variance can be reduced by combining with other estimators.
• Experiments on real datasets showed that the resulting relevance
estimator can outperform both conventional wrapper and filter
algorithms.
A model-based relevance estimation approach for feature selection in microarray datasets – p. 18/1

More Related Content

What's hot

Reliable ABC model choice via random forests
Reliable ABC model choice via random forestsReliable ABC model choice via random forests
Reliable ABC model choice via random forestsChristian Robert
 
Statistics symposium talk, Harvard University
Statistics symposium talk, Harvard UniversityStatistics symposium talk, Harvard University
Statistics symposium talk, Harvard UniversityChristian Robert
 
Machine learning
Machine learningMachine learning
Machine learningShreyas G S
 
Nber slides11 lecture2
Nber slides11 lecture2Nber slides11 lecture2
Nber slides11 lecture2NBER
 
Non-parametric analysis of models and data
Non-parametric analysis of models and dataNon-parametric analysis of models and data
Non-parametric analysis of models and datahaharrington
 
Logistic Regression | Logistic Regression In Python | Machine Learning Algori...
Logistic Regression | Logistic Regression In Python | Machine Learning Algori...Logistic Regression | Logistic Regression In Python | Machine Learning Algori...
Logistic Regression | Logistic Regression In Python | Machine Learning Algori...Simplilearn
 
Download presentation source
Download presentation sourceDownload presentation source
Download presentation sourcebutest
 
On the vexing dilemma of hypothesis testing and the predicted demise of the B...
On the vexing dilemma of hypothesis testing and the predicted demise of the B...On the vexing dilemma of hypothesis testing and the predicted demise of the B...
On the vexing dilemma of hypothesis testing and the predicted demise of the B...Christian Robert
 
Similarity Features, and their Role in Concept Alignment Learning
Similarity Features, and their Role in Concept Alignment Learning Similarity Features, and their Role in Concept Alignment Learning
Similarity Features, and their Role in Concept Alignment Learning Shenghui Wang
 
A Maximum Entropy Approach to the Loss Data Aggregation Problem
A Maximum Entropy Approach to the Loss Data Aggregation ProblemA Maximum Entropy Approach to the Loss Data Aggregation Problem
A Maximum Entropy Approach to the Loss Data Aggregation ProblemErika G. G.
 
Intro to machine learning
Intro to machine learningIntro to machine learning
Intro to machine learningAkshay Kanchan
 
Principle of Maximum Entropy
Principle of Maximum EntropyPrinciple of Maximum Entropy
Principle of Maximum EntropyJiawang Liu
 
Machine learning (5)
Machine learning (5)Machine learning (5)
Machine learning (5)NYversity
 
A Framework to Adjust Dependency Measure Estimates for Chance
A Framework to Adjust Dependency Measure Estimates for Chance      A Framework to Adjust Dependency Measure Estimates for Chance
A Framework to Adjust Dependency Measure Estimates for Chance Simone Romano
 
Max Entropy
Max EntropyMax Entropy
Max Entropyjianingy
 
Mathematical Background for Artificial Intelligence
Mathematical Background for Artificial IntelligenceMathematical Background for Artificial Intelligence
Mathematical Background for Artificial Intelligenceananth
 

What's hot (20)

Reliable ABC model choice via random forests
Reliable ABC model choice via random forestsReliable ABC model choice via random forests
Reliable ABC model choice via random forests
 
Statistics symposium talk, Harvard University
Statistics symposium talk, Harvard UniversityStatistics symposium talk, Harvard University
Statistics symposium talk, Harvard University
 
Machine learning
Machine learningMachine learning
Machine learning
 
Nber slides11 lecture2
Nber slides11 lecture2Nber slides11 lecture2
Nber slides11 lecture2
 
Non-parametric analysis of models and data
Non-parametric analysis of models and dataNon-parametric analysis of models and data
Non-parametric analysis of models and data
 
Logistic Regression | Logistic Regression In Python | Machine Learning Algori...
Logistic Regression | Logistic Regression In Python | Machine Learning Algori...Logistic Regression | Logistic Regression In Python | Machine Learning Algori...
Logistic Regression | Logistic Regression In Python | Machine Learning Algori...
 
Download presentation source
Download presentation sourceDownload presentation source
Download presentation source
 
MUMS: Bayesian, Fiducial, and Frequentist Conference - Including Factors in B...
MUMS: Bayesian, Fiducial, and Frequentist Conference - Including Factors in B...MUMS: Bayesian, Fiducial, and Frequentist Conference - Including Factors in B...
MUMS: Bayesian, Fiducial, and Frequentist Conference - Including Factors in B...
 
On the vexing dilemma of hypothesis testing and the predicted demise of the B...
On the vexing dilemma of hypothesis testing and the predicted demise of the B...On the vexing dilemma of hypothesis testing and the predicted demise of the B...
On the vexing dilemma of hypothesis testing and the predicted demise of the B...
 
MUMS: Bayesian, Fiducial, and Frequentist Conference - Spatially Informed Var...
MUMS: Bayesian, Fiducial, and Frequentist Conference - Spatially Informed Var...MUMS: Bayesian, Fiducial, and Frequentist Conference - Spatially Informed Var...
MUMS: Bayesian, Fiducial, and Frequentist Conference - Spatially Informed Var...
 
Similarity Features, and their Role in Concept Alignment Learning
Similarity Features, and their Role in Concept Alignment Learning Similarity Features, and their Role in Concept Alignment Learning
Similarity Features, and their Role in Concept Alignment Learning
 
A Maximum Entropy Approach to the Loss Data Aggregation Problem
A Maximum Entropy Approach to the Loss Data Aggregation ProblemA Maximum Entropy Approach to the Loss Data Aggregation Problem
A Maximum Entropy Approach to the Loss Data Aggregation Problem
 
Intro to machine learning
Intro to machine learningIntro to machine learning
Intro to machine learning
 
Principle of Maximum Entropy
Principle of Maximum EntropyPrinciple of Maximum Entropy
Principle of Maximum Entropy
 
Machine learning (5)
Machine learning (5)Machine learning (5)
Machine learning (5)
 
A Framework to Adjust Dependency Measure Estimates for Chance
A Framework to Adjust Dependency Measure Estimates for Chance      A Framework to Adjust Dependency Measure Estimates for Chance
A Framework to Adjust Dependency Measure Estimates for Chance
 
Max Entropy
Max EntropyMax Entropy
Max Entropy
 
Intractable likelihoods
Intractable likelihoodsIntractable likelihoods
Intractable likelihoods
 
Boston talk
Boston talkBoston talk
Boston talk
 
Mathematical Background for Artificial Intelligence
Mathematical Background for Artificial IntelligenceMathematical Background for Artificial Intelligence
Mathematical Background for Artificial Intelligence
 

Viewers also liked

国外互联网发展趋势
国外互联网发展趋势国外互联网发展趋势
国外互联网发展趋势envong
 
Client Overview Staffing
Client Overview StaffingClient Overview Staffing
Client Overview StaffingDan Stagliano
 
Janussik
JanussikJanussik
JanussikMAsjan
 
Feature selection and microarray data
Feature selection and microarray dataFeature selection and microarray data
Feature selection and microarray dataGianluca Bontempi
 
Computational Intelligence for Time Series Prediction
Computational Intelligence for Time Series PredictionComputational Intelligence for Time Series Prediction
Computational Intelligence for Time Series PredictionGianluca Bontempi
 
产品经理的视角
产品经理的视角产品经理的视角
产品经理的视角envong
 
Pony对QQMail的邮件摘录
Pony对QQMail的邮件摘录Pony对QQMail的邮件摘录
Pony对QQMail的邮件摘录envong
 
指引更工业化的设计
指引更工业化的设计指引更工业化的设计
指引更工业化的设计envong
 
When safety comes last; A short synopisis of events in Nigeria aviation (Pam,...
When safety comes last; A short synopisis of events in Nigeria aviation (Pam,...When safety comes last; A short synopisis of events in Nigeria aviation (Pam,...
When safety comes last; A short synopisis of events in Nigeria aviation (Pam,...Dung Rwang Pam
 
Adaptive model selection in Wireless Sensor Networks
Adaptive model selection in Wireless Sensor NetworksAdaptive model selection in Wireless Sensor Networks
Adaptive model selection in Wireless Sensor NetworksGianluca Bontempi
 
Combining Lazy Learning, Racing and Subsampling for Effective Feature Selection
Combining Lazy Learning, Racing and Subsampling for Effective Feature SelectionCombining Lazy Learning, Racing and Subsampling for Effective Feature Selection
Combining Lazy Learning, Racing and Subsampling for Effective Feature SelectionGianluca Bontempi
 
Safe and sustainable aviation in africa; alignment of policies, regulation an...
Safe and sustainable aviation in africa; alignment of policies, regulation an...Safe and sustainable aviation in africa; alignment of policies, regulation an...
Safe and sustainable aviation in africa; alignment of policies, regulation an...Dung Rwang Pam
 
Perspective of feature selection in bioinformatics
Perspective of feature selection in bioinformaticsPerspective of feature selection in bioinformatics
Perspective of feature selection in bioinformaticsGianluca Bontempi
 
FP7 evaluation & selection: the point of view of an evaluator
FP7 evaluation & selection: the point of view of an evaluatorFP7 evaluation & selection: the point of view of an evaluator
FP7 evaluation & selection: the point of view of an evaluatorGianluca Bontempi
 
CMS OPEN - Università degli studi di Macerata (UniMC)
CMS OPEN - Università degli studi di Macerata (UniMC)CMS OPEN - Università degli studi di Macerata (UniMC)
CMS OPEN - Università degli studi di Macerata (UniMC)Mauro Fava
 
A discourse on aviation with the Nigeria aviation safety initiaitive (NASI)
A discourse on aviation with the Nigeria aviation safety initiaitive (NASI)A discourse on aviation with the Nigeria aviation safety initiaitive (NASI)
A discourse on aviation with the Nigeria aviation safety initiaitive (NASI)Dung Rwang Pam
 
Nigeria aviation industry drifting in turbulent waters
Nigeria aviation industry drifting in turbulent watersNigeria aviation industry drifting in turbulent waters
Nigeria aviation industry drifting in turbulent watersDung Rwang Pam
 

Viewers also liked (19)

Blender
BlenderBlender
Blender
 
国外互联网发展趋势
国外互联网发展趋势国外互联网发展趋势
国外互联网发展趋势
 
Client Overview Staffing
Client Overview StaffingClient Overview Staffing
Client Overview Staffing
 
Janussik
JanussikJanussik
Janussik
 
Feature selection and microarray data
Feature selection and microarray dataFeature selection and microarray data
Feature selection and microarray data
 
Computational Intelligence for Time Series Prediction
Computational Intelligence for Time Series PredictionComputational Intelligence for Time Series Prediction
Computational Intelligence for Time Series Prediction
 
Ubuntu
UbuntuUbuntu
Ubuntu
 
产品经理的视角
产品经理的视角产品经理的视角
产品经理的视角
 
Pony对QQMail的邮件摘录
Pony对QQMail的邮件摘录Pony对QQMail的邮件摘录
Pony对QQMail的邮件摘录
 
指引更工业化的设计
指引更工业化的设计指引更工业化的设计
指引更工业化的设计
 
When safety comes last; A short synopisis of events in Nigeria aviation (Pam,...
When safety comes last; A short synopisis of events in Nigeria aviation (Pam,...When safety comes last; A short synopisis of events in Nigeria aviation (Pam,...
When safety comes last; A short synopisis of events in Nigeria aviation (Pam,...
 
Adaptive model selection in Wireless Sensor Networks
Adaptive model selection in Wireless Sensor NetworksAdaptive model selection in Wireless Sensor Networks
Adaptive model selection in Wireless Sensor Networks
 
Combining Lazy Learning, Racing and Subsampling for Effective Feature Selection
Combining Lazy Learning, Racing and Subsampling for Effective Feature SelectionCombining Lazy Learning, Racing and Subsampling for Effective Feature Selection
Combining Lazy Learning, Racing and Subsampling for Effective Feature Selection
 
Safe and sustainable aviation in africa; alignment of policies, regulation an...
Safe and sustainable aviation in africa; alignment of policies, regulation an...Safe and sustainable aviation in africa; alignment of policies, regulation an...
Safe and sustainable aviation in africa; alignment of policies, regulation an...
 
Perspective of feature selection in bioinformatics
Perspective of feature selection in bioinformaticsPerspective of feature selection in bioinformatics
Perspective of feature selection in bioinformatics
 
FP7 evaluation & selection: the point of view of an evaluator
FP7 evaluation & selection: the point of view of an evaluatorFP7 evaluation & selection: the point of view of an evaluator
FP7 evaluation & selection: the point of view of an evaluator
 
CMS OPEN - Università degli studi di Macerata (UniMC)
CMS OPEN - Università degli studi di Macerata (UniMC)CMS OPEN - Università degli studi di Macerata (UniMC)
CMS OPEN - Università degli studi di Macerata (UniMC)
 
A discourse on aviation with the Nigeria aviation safety initiaitive (NASI)
A discourse on aviation with the Nigeria aviation safety initiaitive (NASI)A discourse on aviation with the Nigeria aviation safety initiaitive (NASI)
A discourse on aviation with the Nigeria aviation safety initiaitive (NASI)
 
Nigeria aviation industry drifting in turbulent waters
Nigeria aviation industry drifting in turbulent watersNigeria aviation industry drifting in turbulent waters
Nigeria aviation industry drifting in turbulent waters
 

Similar to A model-based relevance estimation approach for feature selection in microarray datasets

Recommendation system
Recommendation systemRecommendation system
Recommendation systemDing Li
 
A Modified KS-test for Feature Selection
A Modified KS-test for Feature SelectionA Modified KS-test for Feature Selection
A Modified KS-test for Feature SelectionIOSR Journals
 
Recommender system
Recommender systemRecommender system
Recommender systemBhumi Patel
 
Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)IJERD Editor
 
Machine learning Introduction
Machine learning IntroductionMachine learning Introduction
Machine learning IntroductionKuppusamy P
 
Parameter Optimisation for Automated Feature Point Detection
Parameter Optimisation for Automated Feature Point DetectionParameter Optimisation for Automated Feature Point Detection
Parameter Optimisation for Automated Feature Point DetectionDario Panada
 
3.1 clustering
3.1 clustering3.1 clustering
3.1 clusteringKrish_ver2
 
DM UNIT_4 PPT for btech final year students
DM UNIT_4 PPT for btech final year studentsDM UNIT_4 PPT for btech final year students
DM UNIT_4 PPT for btech final year studentssriharipatilin
 
Download
DownloadDownload
Downloadbutest
 
Download
DownloadDownload
Downloadbutest
 
Probabilistic Collaborative Filtering with Negative Cross Entropy
Probabilistic Collaborative Filtering with Negative Cross EntropyProbabilistic Collaborative Filtering with Negative Cross Entropy
Probabilistic Collaborative Filtering with Negative Cross EntropyAlejandro Bellogin
 
Object class recognition by unsupervide scale invariant learning - kunal
Object class recognition by unsupervide scale invariant learning - kunalObject class recognition by unsupervide scale invariant learning - kunal
Object class recognition by unsupervide scale invariant learning - kunalKunal Kishor Nirala
 
CSE545_Porject
CSE545_PorjectCSE545_Porject
CSE545_Porjecthan li
 

Similar to A model-based relevance estimation approach for feature selection in microarray datasets (20)

Recommendation system
Recommendation systemRecommendation system
Recommendation system
 
A Modified KS-test for Feature Selection
A Modified KS-test for Feature SelectionA Modified KS-test for Feature Selection
A Modified KS-test for Feature Selection
 
Recommender system
Recommender systemRecommender system
Recommender system
 
Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)
 
lcr
lcrlcr
lcr
 
Machine learning Introduction
Machine learning IntroductionMachine learning Introduction
Machine learning Introduction
 
Parameter Optimisation for Automated Feature Point Detection
Parameter Optimisation for Automated Feature Point DetectionParameter Optimisation for Automated Feature Point Detection
Parameter Optimisation for Automated Feature Point Detection
 
3.1 clustering
3.1 clustering3.1 clustering
3.1 clustering
 
Recommender System Based On Statistical Implicative Analysis.doc
Recommender System Based On Statistical Implicative Analysis.docRecommender System Based On Statistical Implicative Analysis.doc
Recommender System Based On Statistical Implicative Analysis.doc
 
DM UNIT_4 PPT for btech final year students
DM UNIT_4 PPT for btech final year studentsDM UNIT_4 PPT for btech final year students
DM UNIT_4 PPT for btech final year students
 
Cs501 cluster analysis
Cs501 cluster analysisCs501 cluster analysis
Cs501 cluster analysis
 
Download
DownloadDownload
Download
 
Download
DownloadDownload
Download
 
Matlab:Regression
Matlab:RegressionMatlab:Regression
Matlab:Regression
 
Matlab: Regression
Matlab: RegressionMatlab: Regression
Matlab: Regression
 
Probabilistic Collaborative Filtering with Negative Cross Entropy
Probabilistic Collaborative Filtering with Negative Cross EntropyProbabilistic Collaborative Filtering with Negative Cross Entropy
Probabilistic Collaborative Filtering with Negative Cross Entropy
 
Object class recognition by unsupervide scale invariant learning - kunal
Object class recognition by unsupervide scale invariant learning - kunalObject class recognition by unsupervide scale invariant learning - kunal
Object class recognition by unsupervide scale invariant learning - kunal
 
Cluster
ClusterCluster
Cluster
 
CSE545_Porject
CSE545_PorjectCSE545_Porject
CSE545_Porject
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 

Recently uploaded

VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...Suhani Kapoor
 
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...shivangimorya083
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 

Recently uploaded (20)

VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
 
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 

A model-based relevance estimation approach for feature selection in microarray datasets

  • 1. A model-based relevance estimation approach for feature selection in microarray datasets Gianluca Bontempi, Patrick E. Meyer {gbonte,pmeyer}@ulb.ac.be Machine Learning Group, Computer Science Department ULB, Université Libre de Bruxelles Boulevard de Triomphe - CP 212 Bruxelles, Belgium http://www.ulb.ac.be/di/mlg A model-based relevance estimation approach for feature selection in microarray datasets – p. 1/1
  • 2. Outline • Feature selection in microarray classification tasks • Definition of relevance • Relevance and feature selection • Our approach to relevance estimation: between filter and wrapper • Experimental results A model-based relevance estimation approach for feature selection in microarray datasets – p. 2/1
  • 3. Feature selection in microarray • The availability of massive amounts of experimental data based on genome-wide studies has given impetus in recent years to a large effort in developing mathematical, statistical and computational techniques to infer biological models from data. • In many bioinformatics problems, the number of features is significantly larger than the number of samples (high feature-to-sample ratio datasets). • This is typical of cancer classification tasks where a systematic investigation of the correlation of expression patterns of thousands of genes to specific phenotypic variations is expected to provide an improved taxonomy of cancer. • In this context, the number of features n corresponds to the number of expressed gene probes (up to several thousands) and the number of observations N to the number of tumor samples (typically in the order of hundreds). • Feature selection and consequently gene selection is required to perform classification in such an high dimensional task. A model-based relevance estimation approach for feature selection in microarray datasets – p. 3/1
  • 4. State-of-the-art Feature selection requires an accurate assessment of a large number of alternative subsets in terms of predictive power or relevance to the output class. Three main state-of-the-art approaches are Filters: these are preprocessing methods which assess the merits of features from the data without having recourse to any learning algorithm. Examples: ranking, PCA, t-test. Wrappers: these methods rely on a learning algorithm to assess and compare subsets of variables. They conduct a search for a good subset using the learning algorithm itself as part of the evaluation function. Examples are the forward/backward methods proposed in classical regression analysis. Embedded methods: they perform variable selection as part of the learning procedure and are usually specific to given learning machines. Examples are classification trees and methods based on regularization techniques (e.g. lasso) A model-based relevance estimation approach for feature selection in microarray datasets – p. 4/1
  • 5. Between filters and wrappers • Filter approaches rely on learner independent estimators to assess the relevance of a set of features. The rationale of filter techniques is that the importance of a set of feature should be independent of the prediction technique. Our contribution: we propose a model-based strategy to assess the relevance of a set of features. • Wrapper depends on a specific learner to assess a set of features and end up with returning a quantity which confounds the relevance of a subset (desired quantity) with the quality of the learner (not required). In other terms wrapper returns a biased estimation of the relevance of a subset. Our contribution: since wrapper bias may have a strong negative impact on the selection procedure we propose a model-based technique for relevance assessment which is low-biased. A model-based relevance estimation approach for feature selection in microarray datasets – p. 5/1
  • 6. Feature selection and relevance • Let us consider a binary classification problem where x ∈ X ⊂ Rn and y ∈ Y = {y0, y1}. Let s ⊆ x, s ∈ S be a subset of the input vector. • Let us denote p1(s) = Prob {y = y1|s} p0(s) = Prob {y = y0|s} • A feature selection problems can be formalized as a problem of (learner independent) relevance maximization s∗ = arg max s⊆x,|s|≤d Rs where the goal is to find the subset s that maximizes the relevance quantity Rs which accounts for the predictive power that the input s has on the target y. A model-based relevance estimation approach for feature selection in microarray datasets – p. 6/1
  • 7. Relevance definitions • A well known example of relevance measure is mutual information I(s, y) = H(y) − H(y|s). • Here we will focus on the quantity Rs = S p2 0(s) + p2 1(s) dFs(s) = S r(s)dFs(s) where r(s) = 1 − g(s) and g(s) is Gini index of diversity. • Note that r(s) = p2 0(s) + p2 1(s) = 1 − 2p0(s)(1 − p0(s)) = 1 − 2Var {y|s} where Var {y|s} is the conditional variance of y. • Also a monotone function GH (·) : [0, 1] → [0, 0.5] maps the entropy H(y|s) of a binary variable y to the related Gini index g. A model-based relevance estimation approach for feature selection in microarray datasets – p. 7/1
  • 8. Bias of the wrapper approach Given a learner h trained on dataset of size N, the wrapper approach translates the (learner independent) relevance maximization problem into a (learner dependent) minimization problem arg min s⊆x,|s|≤d Mh s = arg min s⊆x,|s|≤d S MME h (s)dFs(s) where the Mean Misclassification Error is decomposed as follows (Wolpert, Kohavi, 96) MME h (s) = 1 2 1 − (p2 0(s) + p2 1(s)) + + 1 2 (p0(s) − pˆ0(s))2 + (p1(s) − pˆ1(s))2 + + 1 2 1 − (p2 ˆ0 (s) + p2 ˆ1 (s)) = 1 2 (n(s) + b(s) + v(s)) where pˆ0 = Prob {ˆy = y0|s}, n(s) = 1 − r(s) is the noise variance term, b(s) is the learner squared bias and v(s) is the learner variance. NB: the term b(s) is NOT dependent on relevance. A model-based relevance estimation approach for feature selection in microarray datasets – p. 8/1
  • 9. Bias of the wrapper approach • In real classification tasks, the one-zero misclassification error Mh s of a learner h for a subset s cannot be derived analytically but only estimated (typicallly by cross-validation). • A wrapper selection returns sh = arg min s⊂x,|s|≤d M h s (1) where M h s the estimate of the misclassification error of the learner h (e.g. computed by cross-validation) If a wrapper strategy relies on a generic learner h, that is a learner where the bias term b(s) is significantly different from zero, the returned feature selection will depend on a quantity which is a biased estimate of the term r(s) and consequently of the relevance Rs. In other words, wrappers do not maximize relevance. A model-based relevance estimation approach for feature selection in microarray datasets – p. 9/1
  • 10. Unbiased wrapper approach • Intuitively, the bias would be reduced if we adopted a learner having a small bias term. A low bias, yet high variance, learner is the k-nearest neighbour classifiers (kNN) for small values of k • In particular, it has been shown that for a 1NN learner and a binary classification problem lim N→∞ M1NN s = 1 − Rs where M1NN s is the misclassification error of a nearest neighbour. • Since cross-validation returns a consistent estimation of Mh and since the quantity Mh asymptotically converges to one minus the relevance Rs, we have that 1 − M 1NN s is a consistent estimator of the relevance Rs. • We propose then as relevance estimator ˆRkNN s = 1 − M kNN s which is the cross-validation error of a kNN learner with low k. This term returns an unbiased, yet high variance, estimate of the relevance of the subset s. A model-based relevance estimation approach for feature selection in microarray datasets – p. 10/1
  • 11. Reducing the variance of the estimator The low-bias high-variance nature of the ˆRkNN s estimator suggests that the best way to employ this estimator is by combining it with other relevance estimators. We will take into consideration two possible estimators to combine with: 1. a direct model-based estimator ˆp1 of the conditional probability p1(s) = Prob {y = y1|s} and consequently of the quantity r(s). This estimator first samples a set of N unclassified input vectors si according to the empirical distribution ˆFs and then computes the Monte Carlo estimation ˆRD s = 1 N N i=1 ˆp2 1(si ) + ˆp2 0(si ) = 1 − 2 N N i=1 ˆp1(si )(1 − ˆp1(si )) A similar estimator was proposed by Fukunaga in 1973 to estimate the Bayes error. A model-based relevance estimation approach for feature selection in microarray datasets – p. 11/1
  • 12. 2. a filter estimator based on the notion of mutual information: several filter algorithms exploit this notion in order to estimate the relevance. An example is the MRMR algorithm (Peng et al., 05) where the relevance of a feature subset s, expressed in terms of the mutual information I(s, y) = H(y) − H(y|s), is approximated by the incremental formulation IMRMR(s; y) = IMRMR(si; y) + I(xi, y) − 1 m − 1 xj ∈si I(xj; xi) (2) where xi is a feature belonging to the subset s, si is the set s with the xi feature set aside and m is the number of components of s. Now since H(y|s) = H(y) − I(s, y) and Gs = 1 − Rs = GH (H(y|s)) we obtain that ˆRMRMR s = 1 − GH (H(y) − IMRMR(s, y)) is a MRMR estimator of the relevance Rs where GH (·) is the monotone mapping between H and Gini index. A model-based relevance estimation approach for feature selection in microarray datasets – p. 12/1
  • 13. Proposed relevance estimators We propose two novel relevance estimators based on the principle of averaging ˆRs = ˆRCV s + ˆRD s 2 , ˆRs = ˆRCV s + ˆRMRMR s 2 and the associated feature selection algorithms: sR = arg max s⊂x,|s|≤d ˆRs sR = arg max s⊂x,|s|≤d ˆRs A model-based relevance estimation approach for feature selection in microarray datasets – p. 13/1
  • 14. Experimental session • 20 public domain microarray expression datasets • external cross-validation scheme three-fold cross-validation strategy • to avoid any dependency between the learning algorithm employed by the wrapper and the classifier used for prediction, the experimental session is composed of two parts: • Part 1: comparison with the wrapper WSVM and we use the set of classifiers C1 ={TREE, NB, SVMSIGM, LDA, LOG} which does not include the SVMLIN learner, • Part 2: comparison with the wrapper WNB and we use the set of classifiers C2 ={TREE, SVMSIGM, SVMLIN, LDA, LOG} which does not include the NB learner. A model-based relevance estimation approach for feature selection in microarray datasets – p. 14/1
  • 15. Experiments with cancer datasets Name N n K Golub 72 7129 2 Alon 62 2000 2 Notterman 36 7457 2 Nutt 50 12625 2 Shipp 77 7129 2 Singh 102 12600 2 Sorlie 76 7937 2 Wang 286 22283 2 Van’t Veer 65 24481 2 VandeVijver 295 24496 2 Sotiriou 99 7650 2 Pomeroy 60 7129 2 Khan 63 2308 4 Hedenfalk 22 3226 3 West 49 7129 4 Staunton 60 7129 9 Su 174 12533 11 Bhattacharjee 203 12600 5 Armstrong 72 12582 3 Ma 60 22575 3 A model-based relevance estimation approach for feature selection in microarray datasets – p. 15/1
  • 16. Results 1st part Name R’ WSVM R” MRMR RANK Golub 0.0917 0.1177 0.1 0.1079 0.1225 Alon 0.2704 0.2658 0.2267 0.1996 0.2281 Notterman 0.1966 0.0985 0.1494 0.1472 0.1432 Nutt 0.3798 0.4171 0.3873 0.3847 0.4189 Shipp 0.1429 0.1319 0.1322 0.1362 0.1873 Singh 0.1619 0.1517 0.1266 0.1374 0.1328 Sorlie 0.3835 0.4314 0.3963 0.4004 0.3987 Wang 0.4282 0.4111 0.4218 0.4232 0.4181 Van’t Veer 0.2786 0.2638 0.2492 0.2217 0.2277 VandeVijver 0.454 0.4724 0.4365 0.4636 0.4482 Sotiriou 0.5279 0.5796 0.5351 0.5708 0.5339 Pomeroy 0.428 0.4191 0.4141 0.3876 0.4181 Khan 0.0878 0.1143 0.0582 0.0686 0.131 Hedenfalk 0.5475 0.5263 0.452 0.5273 0.5389 West 0.6463 0.6109 0.6186 0.5746 0.6109 Staunton 0.6822 0.71 0.6511 0.6865 0.7407 Su 0.2568 0.307 0.2549 0.3772 0.3352 Bhattacharjee 0.1232 0.1347 0.1105 0.1057 0.1515 Armstrong 0.1082 0.1199 0.1306 0.115 0.1122 Ma 0.2456 0.2041 0.2257 0.2413 0.2317 AVG 0.323 0.331 0.310 0.326 0.331 W/B than R’ (R”) 10/7 9/6 9/2 A model-based relevance estimation approach for feature selection in microarray datasets – p. 16/1
  • 17. Results 2nd part Name R’ WNB R” MRMR RANK Golub 0.0886 0.1114 0.0971 0.1019 0.0904 Alon 0.2376 0.2568 0.2181 0.2109 0.221 Notterman 0.1852 0.2059 0.1491 0.1512 0.1645 Nutt 0.3929 0.3402 0.36 0.3898 0.4258 Shipp 0.1261 0.127 0.1198 0.1338 0.1734 Singh 0.1495 0.1454 0.1297 0.1377 0.1245 Sorlie 0.3848 0.4254 0.3808 0.3953 0.3838 Wang 0.4363 0.4345 0.4298 0.4281 0.4255 Van’t Veer 0.2747 0.2715 0.2421 0.2253 0.2325 VandeVijver 0.4626 0.44 0.4763 0.4721 0.4358 Sotiriou 0.5126 0.5578 0.5505 0.5732 0.5611 Pomeroy 0.4367 0.4389 0.4007 0.3902 0.4224 Khan 0.0804 0.0896 0.0628 0.0631 0.0901 Hedenfalk 0.5379 0.5187 0.4369 0.4904 0.4949 West 0.6413 0.6696 0.5542 0.5882 0.6728 Staunton 0.6689 0.8298 0.6981 0.6661 0.83 Su 0.2544 0.3096 0.2646 0.3739 0.3529 Bhattacharjee 0.1235 0.1209 0.101 0.1061 0.1186 Armstrong 0.1079 0.1668 0.125 0.1148 0.1034 Ma 0.2565 0.2635 0.2335 0.2443 0.2681 AVG 0.322 0.3335 0.315 0.327 0.331 W/B than R’ (R”) 9/2 10/3 11/2 A model-based relevance estimation approach for feature selection in microarray datasets – p. 17/1
  • 18. Conclusions • Feature selection demands accurate estimation of relevance of subsets of features. • Wrapper methods use cross-validation estimation of misclassification error with generic learners. We show that this means a biased estimation of relevance. • The cross validation assessment ˆRkNN s returned by kNN techniques with low k provide a low bias yet high variance estimator of relevance. • Variance can be reduced by combining with other estimators. • Experiments on real datasets showed that the resulting relevance estimator can outperform both conventional wrapper and filter algorithms. A model-based relevance estimation approach for feature selection in microarray datasets – p. 18/1