Week 13 Feature Selection Computer Vision Bagian 2

Program Studi Teknik Informatika
Fakultas Teknik – Universitas Surabaya
Feature Selection
Week 13
1604C055 - Machine Learning

Feature selection
• A huge number of features used in building a machine learning
model does not always produce good performance.
• Irrelevant features has impact to the model performance.
• Redundant features:
– lead to overfitting
– reduce the generalization capability of the model
– reduce the accuracy of the model.
• Adding more and more features to the model:
– increases the overall complexity of the model
– computational time.

Feature selection
• Feature selection is a process used to automatically select the best
subset of features in the dataset that contribute most to the
prediction variable or output.
• Benefit:
– Reduces Overfitting: Less redundant data means less opportunity to
make decisions based on noise
– Improves Accuracy: Less misleading data means modeling accuracy
improves
– Reduces Training Time: Less data means that algorithms train faster.

Some techniques
• Filter methods: select the best features by examining their statistical
properties. E.g., variance threshold, correlation coefficient, chi-
square test, ANOVA F-value statistic
• Wrapper methods use trial and error to find the subset of features
that produce models with the highest quality predictions. E.g.,
forward feature selection, backward feature selection.
• Embedded methods select the best feature subset as part or as an
extension of a learning algorithm’s training process

Variance threshold
• A simple method for feature selection based on variance
• Remove all features whose variance less than or equal a threshold
value
• Variance of feature with n observations 𝑥 = 𝑥1, 𝑥2, … , 𝑥𝑛 :
Var 𝑥 =
1
𝑛
𝑖=1
𝑛
𝑥𝑖 − 𝑥 2
, 𝑥 =
1
𝑛
𝑖=1
𝑛
𝑥𝑖
• For binary feature with 𝑛 observations 𝑥 = 𝑥1, 𝑥2, … , 𝑥𝑛 , 𝑥𝑖 ∈ 0,1 :
Var 𝑥 = 𝑝(1 − 𝑝)
𝑝: the proportion of 1

Variance threshold: example
𝒙𝟏 𝒙𝟐 𝒙𝟑
1 0 1
0 1 0
0 1 1
0 1 0
0 1 1
𝑝 = 0.2 𝑝 = 0.8 𝑝 = 0.6
• Select feature with the
proportion of 1 or 0 is less than
0.75
• Threshold:
𝑇 = 0.75 1 − 0.75 = 0.1875
• Var 𝑥1 ≤ 𝑇, Var 𝑥2 ≤ 𝑇
• Var 𝑥3 > 𝑇
• Selected feature: 𝑥3
Var 𝑥1 = 0.2 1 − 0.2 = 0.16
Var 𝑥2 = 0.8 1 − 0.8 = 0.16
Var 𝑥3 = 0.6 1 − 0.6 = 0.24

Variance threshold with
sklearn.feature_selection.VarianceThreshold

Correlation coefficient
• If two features are highly correlated, then the information contained
in that features is very similar.
• Highly correlated can be considered as redundant features.
• Remove all features that highly correlated (greater that a threshold
value)
• Correlation coefficient of two features with 𝑛 observations 𝑥1 =
𝑥11, 𝑥12, … , 𝑥1𝑛 , 𝑥2 = 𝑥21, 𝑥22, … , 𝑥2𝑛 :
corr 𝑥1, 𝑥2 =
𝑖=1
𝑛
𝑥1𝑖 − 𝑥1 𝑥2𝑖 − 𝑥2
𝑖=1
𝑛
𝑥1𝑖 − 𝑥1
2
𝑖=1
𝑛
𝑥2𝑖 − 𝑥2
2

Correlation coefficient: example
𝒙𝟏 𝒙𝟐 𝒙𝟑
1 1 1
2 3 0
3 5 1
4 7 0
5 8 1
• Select features with correlation
coefficient with other features
less than 0.95
• corr 𝑥1, 𝑥2 > 0.95
• corr 𝑥1, 𝑥3 < 0.95
• corr 𝑥2, 𝑥3 < 0.95
• Selected feature: 𝑥2, 𝑥3
corr 𝑥1, 𝑥2 = 0.994
corr 𝑥1, 𝑥3 = 0
corr 𝑥2, 𝑥3 = 0.064

Correlation coefficient with pandas

Chi-square test
• Can only be applied for categorical features.
• Used to examine the independence of two categorical vectors i.e.,
feature and target.
• If the feature and target are independent, then the feature is
considered as irrelevant.
• For numerical feature, chi-square test still can be applied by first
transforming the quantitative feature into a categorical feature.
• Chi-Square measures how expected count 𝐸 and observed count 𝑂
deviates each other.

Chi-square test
• Chi-square statistic (𝜒2
) is the difference between the observed
number of observations in each class of a categorical feature and its
expected value if that feature was independent (i.e., no relationship)
with the target.
𝜒𝑠𝑡𝑎𝑡
2
=
𝑖=1
𝑛
𝑂𝑖 − 𝐸𝑖
2
𝐸𝑖
• 𝑂𝑖: the number of observations in class 𝑖
• 𝐸𝑖: the expected value of observations in class 𝑖 if there is no
relationship between the feature and target

Chi-square test steps
• Define Hypothesis:
– 𝐻0: feature and target are independent
– 𝐻1: feature and target are not independent
• Build a Contingency table.
• Find the expected values:
– Under 𝐻0 feature and target are independent
– 𝑃 𝐴 ∩ 𝐵 = 𝑃 𝐴 𝑃 𝐵
• Calculate the Chi-Square statistic: 𝜒𝑠𝑡𝑎𝑡
2
• Accept or reject the null hypothesis

Chi-square test steps
• Accept or reject the null hypothesis:
– Choose level of significant, usually 𝛼 = 0.05
– Determine degree of freedom, 𝑑𝑓 = 𝑛𝑐 − 1 𝑛𝑟 − 1 , where 𝑛𝑐 and 𝑛𝑟
are the number of columns and rows in contingency table, resp.
– Determine 𝜒𝛼,𝑑𝑓
2
from 𝜒2 distribution table or p-value
𝑝 − 𝑣𝑎𝑙𝑢𝑒 = 𝑃 𝜒2 > 𝜒𝑠𝑡𝑎𝑡
2
– Reject 𝐻0 if 𝜒𝑠𝑡𝑎𝑡
2
> 𝜒𝛼,𝑑𝑓
2
or p-value < 𝛼

Chi-square test: example
𝒙𝟏 𝒙𝟐 𝒚
A M 1
A M 1
B M 0
B M 0
B F 0
… … …
Target
Feature
0 1 Total
A 20 10 30
B 40 30 70
Total 60 40 100
Contingency table
𝑥1 and 𝑦
• Expected value:
– 𝐸𝐴0 = 100 ×
30
100
×
60
100
= 18
– 𝐸𝐴1 = 100 ×
30
100
×
40
100
= 12
– 𝐸𝐵0 = 100 ×
70
100
×
60
100
= 42
– 𝐸𝐵1 = 100 ×
70
100
×
40
100
= 28
• Chi-square statistic
2
=
20 − 18 2
18
+
10 − 12 2
12
+
40 − 42 2
42
+
30 − 28 2
28
= 0.7937

• Accept or Reject the Null Hypothesis:
– Level of significant 𝛼 = 0.05
– Degree of freedom 𝑑𝑓 = 2 − 1 2 − 1 = 1
– 𝜒0.05,1
2
= 3.841 from 𝜒2 distribution table
– Since 𝜒𝑠𝑡𝑎𝑡
2
< 𝜒0.05,1
2
, then 𝐻0 should be accepted
– Conclusion: 𝑥1 and 𝑦 are independent (𝑥1 cannot be selected)

𝒙𝟏 𝒙𝟐 𝒚
A M 1
A M 1
B M 0
B M 0
B F 0
… … …
Target
Feature
0 1 Total
M 20 30 50
F 40 10 50
Total 60 40 100
Contingency table
𝑥2 and 𝑦
• Expected value:
– 𝐸𝑀0 = 100 ×
50
100
×
60
100
= 30
– 𝐸𝑀1 = 100 ×
50
100
×
40
100
= 20
– 𝐸𝐹0 = 100 ×
50
100
×
60
100
= 30
– 𝐸𝐹1 = 100 ×
50
100
×
40
100
= 20
• Chi-square statistic
2
=
20 − 30 2
30
+
30 − 20 2
20
+
40 − 30 2
30
+
10 − 20 2
20
= 11.667

• Accept or Reject the Null Hypothesis:
– Level of significant 𝛼 = 0.05
– Degree of freedom 𝑑𝑓 = 2 − 1 2 − 1 = 1
– 𝜒0.05,𝑑𝑓
2
= 3.841
– Since 𝜒𝑠𝑡𝑎𝑡
2
> 𝜒0.05,𝑑𝑓
2
, then 𝐻0 should be rejected
– Conclusion: 𝑥2 and 𝑦 are dependent (𝑥2 can be selected)

Chi-square test with
sklearn.feature_selection.chi2

ANOVA
• ANOVA (Analysis of Variance) is a statistical test used to examine
weather two or more groups differ from each other significantly by
comparing the mean of each group
• In feature selection, the observations in a features are grouped
based on target class.
• If the means of each group are significantly different, then the
feature is considered as relevant.

ANOVA steps
• Define Hypothesis:
– 𝐻0: all groups mean values are same
– 𝐻1: at least one of the groups mean values differ
• Calculate total sum of square (SST), between group sum of square
(SSB), and error sum of square (SSE).
• Determine degree of freedom.
• Calculate between group mean of square (MSB), and error mean of
square (MSE).
• Calculate 𝐹 statistic

ANOVA steps
• Suppose 𝑥𝑖𝑗 = 𝑥𝑖1, 𝑥𝑖2, … , 𝑥𝑖𝑛𝑖
is the observations of feature that belong to
group (class) 𝑖, for 𝑖 = 1,2, … , 𝑘, where 𝑘 is the number of group (class).
• Calculate some of square:
𝑥 =
1
𝑁
𝑖=1
𝑘
𝑗=1
𝑛𝑖
𝑥𝑖𝑗 , 𝑁 =
𝑖=1
𝑘
𝑛𝑖 , 𝑥𝑖 =
1
𝑛𝑖
𝑗=1
𝑛𝑖
𝑥𝑖𝑗
𝑆𝑆𝑇 =
𝑖=1
𝑘
𝑗=1
𝑛𝑖
𝑥𝑖𝑗 − 𝑥
2
𝑆𝑆𝐵 =
𝑖=1
𝑘
𝑛𝑖 𝑥𝑖 − 𝑥 2
𝑆𝑆𝐸 = 𝑆𝑆𝑇 − 𝑆𝑆𝐵

ANOVA steps
• Determine degree of freedom
– Between group: 𝑑𝑓𝐵 = 𝑘 − 1
– Error: 𝑑𝑓𝐸 = 𝑁 − 𝑘
– Total: 𝑑𝑓𝑇 = 𝑁 − 1
• Calculate mean of square
– 𝑀𝑆𝐵 =
𝑆𝑆𝐵
𝑑𝑓𝐵
– 𝑀𝑆𝐸 =
𝑆𝑆𝐸
𝑑𝑓𝐸
• Calculate statistic 𝐹: 𝐹𝑠𝑡𝑎𝑡 =
𝑀𝑆𝐵
𝑀𝑆𝐸

ANOVA steps
• Accept or reject the null hypothesis:
– Choose level of significant, usually 𝛼 = 0.05
– Determine 𝐹𝛼;𝑑𝑓𝐵,𝑑𝑓𝐸
from 𝐹 distribution table or p-value
𝑝 − 𝑣𝑎𝑙𝑢𝑒 = 𝑃 𝐹 > 𝐹𝑠𝑡𝑎𝑡
– Reject 𝐻0 if 𝐹𝑠𝑡𝑎𝑡 > 𝐹𝛼;𝑑𝑓𝐵,𝑑𝑓𝐸
or p-value < 𝛼

ANOVA with
sklearn.feature_selection.f_classif

ANOVA with
sklearn.feature_selection.f_value

Week 13 Feature Selection Computer Vision Bagian 2

Recommended

Recommended

More Related Content

Similar to Week 13 Feature Selection Computer Vision Bagian 2

Similar to Week 13 Feature Selection Computer Vision Bagian 2 (20)

Recently uploaded

Recently uploaded (20)

Week 13 Feature Selection Computer Vision Bagian 2