Feature Selection
Methods
1
Zahra Mojtahedin
Fall 2024
Outline
2
Introduction
Filter Methods
Wrapper Methods
Embedded Methods
Results
Conclusion
Introduction
3
 The problem of reducing irrelevant
and redundant variables.
 Feature Selection:
1. Understanding data
2. Reducing computation requirement
3. Improving the predictor performance
Filter Methods
4
 Variable ranking
techniques
 They are applied before
classification
 Correlation criteria:
 The Pearson correlation
coefficient:
 Python: numpy.corrcoef
Filter Methods
5
 Mutual Information (MI): The measure of dependency
between two variables
 The uncertainty in output Y:
 The conditional entropy:
 The mutual information:
 Python: sklearn.feature_selection.mutual_info_classif
Filter Methods
6
 Feature ranking:
1. Computationally light
2. Avoids overfitting
3. Works well for certain
datasets
4. The selected subset might
not be optimal:
• The correlation to other
variables
Wrapper Methods
7
 Use predictor as a black box.
 Objective: Evaluate variable subsets based on predictor
performance.
 Search algorithms are employed to find suboptimal
feature subsets.
 These methods aim to balance computational feasibility
and good results.
 Classify into Sequential Selection Algorithms and
Heuristic Search Algorithms.
 Sequential Selection starts with an empty set, adding or
8
Sequential Selection Algorithms :
 These algorithms are iterative in nature.
1. Sequential Feature Selection (SFS): Starts with an empty set
and adds one feature with the highest objective function
value. New features are added iteratively, provided they
increase classification accuracy. Repeated until the desired
number of features are included.
2. Sequential Backward Selection (SBS): Similar to SFS but
starts with all variables and removes one at a time, removing
the feature with the lowest impact on predictor performance.
3. Sequential Floating Forward Selection (SFFS): Enhances SFS
by introducing a backtracking step. Adds features initially and
excludes them if removal improves the objective function.
Repeats until the required features or performance are
Wrapper Methods
9
Wrapper Methods
Heuristic Search Algorithms
 Genetic Algorithm: Used for feature selection, where
chromosome bits represent the inclusion of features. It searches
for the global maximum of the objective function, which is
predictor performance.
 Parameters and operators in GA can be adapted to specific data
and applications to optimize performance.
 Cluster-based Half Uniform Crossover Genetic Algorithm: A non-
traditional GA with unique characteristics.
1. Selects the best N individuals from the parent and offspring pool.
2. Utilizes highly disruptive Half Uniform Crossover (HUX) for crossovers,
selecting random non- matching alleles.
3. Mating only occurs between diverse parents; Hamming distance between
parents is calculated, and if it doesn't exceed a threshold, they do not
mate.
4. If offspring is not generated (threshold drops to zero), a cataclysmic
mutation is introduced.
10
Wrapper Methods
Embedded Methods
Embedded Feature Selection Methods:
 Objective: To reduce computation time and incorporate feature
selection into the training process.
 Motivation: Filtering methods based on Mutual Information (MI)
had limitations.
 Greedy Search Algorithm: Designed an objective function
for feature selection, maximizing MI between the feature
and the class output.
 Goal: Maximize MI with the class output and minimize MI
with the selected feature and the subset of previously
selected features.
 Formula: I(Y, f) = b ∑ I(f, s) for s in S.
 Selection based on inter-feature MI for non-redundant
11
 Classifier Weight-Based Ranking: Ranking features for removal
based on classifier weights (wj).
 wj = lj(+) - lj(-) / (rj(+) + rj(-)).
 Features ranked proportionally contribute to correlation.
 Sensitivity analysis on wj used for feature selection.
 Suggested method: Change in the objective function, a linear
discriminant function J based on wj.
12
Embedded Methods
Results
13
 Datasets: from the UCI Machine Learning
Repository and MKS Instruments
 SVM
classifier
14
o Results for correlation criteria
using SVM:
Results
15
o Result for MI using
SVM :
Results
16
o Results for SFFS
using SVM:
Results
An example of Mutual Information
17
Conclusion
18
 More information is not always good in
machine learning applications.
 Applying feature selection:
1. Insight into the data
2. Better classifier model
3. Enhance generalization
4. Identification of irrelevant variables
Thank You For Your
Attention!
19
19

Feature Selections Methods

  • 1.
  • 2.
  • 3.
    Introduction 3  The problemof reducing irrelevant and redundant variables.  Feature Selection: 1. Understanding data 2. Reducing computation requirement 3. Improving the predictor performance
  • 4.
    Filter Methods 4  Variableranking techniques  They are applied before classification  Correlation criteria:  The Pearson correlation coefficient:  Python: numpy.corrcoef
  • 5.
    Filter Methods 5  MutualInformation (MI): The measure of dependency between two variables  The uncertainty in output Y:  The conditional entropy:  The mutual information:  Python: sklearn.feature_selection.mutual_info_classif
  • 6.
    Filter Methods 6  Featureranking: 1. Computationally light 2. Avoids overfitting 3. Works well for certain datasets 4. The selected subset might not be optimal: • The correlation to other variables
  • 7.
    Wrapper Methods 7  Usepredictor as a black box.  Objective: Evaluate variable subsets based on predictor performance.  Search algorithms are employed to find suboptimal feature subsets.  These methods aim to balance computational feasibility and good results.  Classify into Sequential Selection Algorithms and Heuristic Search Algorithms.  Sequential Selection starts with an empty set, adding or
  • 8.
    8 Sequential Selection Algorithms:  These algorithms are iterative in nature. 1. Sequential Feature Selection (SFS): Starts with an empty set and adds one feature with the highest objective function value. New features are added iteratively, provided they increase classification accuracy. Repeated until the desired number of features are included. 2. Sequential Backward Selection (SBS): Similar to SFS but starts with all variables and removes one at a time, removing the feature with the lowest impact on predictor performance. 3. Sequential Floating Forward Selection (SFFS): Enhances SFS by introducing a backtracking step. Adds features initially and excludes them if removal improves the objective function. Repeats until the required features or performance are Wrapper Methods
  • 9.
  • 10.
    Heuristic Search Algorithms Genetic Algorithm: Used for feature selection, where chromosome bits represent the inclusion of features. It searches for the global maximum of the objective function, which is predictor performance.  Parameters and operators in GA can be adapted to specific data and applications to optimize performance.  Cluster-based Half Uniform Crossover Genetic Algorithm: A non- traditional GA with unique characteristics. 1. Selects the best N individuals from the parent and offspring pool. 2. Utilizes highly disruptive Half Uniform Crossover (HUX) for crossovers, selecting random non- matching alleles. 3. Mating only occurs between diverse parents; Hamming distance between parents is calculated, and if it doesn't exceed a threshold, they do not mate. 4. If offspring is not generated (threshold drops to zero), a cataclysmic mutation is introduced. 10 Wrapper Methods
  • 11.
    Embedded Methods Embedded FeatureSelection Methods:  Objective: To reduce computation time and incorporate feature selection into the training process.  Motivation: Filtering methods based on Mutual Information (MI) had limitations.  Greedy Search Algorithm: Designed an objective function for feature selection, maximizing MI between the feature and the class output.  Goal: Maximize MI with the class output and minimize MI with the selected feature and the subset of previously selected features.  Formula: I(Y, f) = b ∑ I(f, s) for s in S.  Selection based on inter-feature MI for non-redundant 11
  • 12.
     Classifier Weight-BasedRanking: Ranking features for removal based on classifier weights (wj).  wj = lj(+) - lj(-) / (rj(+) + rj(-)).  Features ranked proportionally contribute to correlation.  Sensitivity analysis on wj used for feature selection.  Suggested method: Change in the objective function, a linear discriminant function J based on wj. 12 Embedded Methods
  • 13.
    Results 13  Datasets: fromthe UCI Machine Learning Repository and MKS Instruments  SVM classifier
  • 14.
    14 o Results forcorrelation criteria using SVM: Results
  • 15.
    15 o Result forMI using SVM : Results
  • 16.
    16 o Results forSFFS using SVM: Results
  • 17.
    An example ofMutual Information 17
  • 18.
    Conclusion 18  More informationis not always good in machine learning applications.  Applying feature selection: 1. Insight into the data 2. Better classifier model 3. Enhance generalization 4. Identification of irrelevant variables
  • 19.
    Thank You ForYour Attention! 19 19