Module 3
Advanced Feature Engineering and Feature
Selection
Introduction to Feature Engineering
Feature engineering is the process of improving a model’s accuracy by
using domain knowledge to select and transform raw data’s most
relevant variables into features of predictive models that better
represent the underlying problem.
Feature Engineering
Feature
Transformation
Feature
Construction
Feature Selection Feature Extraction
Missing value imputation
Handling Categorical
Features
Outlier detection
Feature scaling
Missing Value Imputation
Handling Categorical Features
Outlier Detection
interquartile range = Upper Quartile – Lower Quartile = Q­
3 – Q­
1
Feature Scaling
Why do we need feature scaling
Feature Scaling
Standardization Normalization
Standardisation(Z-score normalization)
Assume our dataset has random numeric values in the range of 1 to 95,000 (in random order)Just for our
understanding consider a small Dataset of barely 10 values with numbers in the given range and
randomized order.
If we just look at these values, their range is so high, that while training the
model with 10,000 such values will take lot of time.
We have a solution to solve the problem arisen i.e. Standardization. It helps us
solve this by :
● Down Scaling the Values to a scale common to all, usually in the range -
1 to +1.
● And keeping the Range between the values intact.
Normalization
Feature Selection Techniques
Feature selection is a crucial step in the machine learning pipeline, involving
the selection of a subset of relevant features (variables, predictors) for use in
model construction. Effective feature selection can improve model
performance, reduce overfitting, and decrease training time.
The role of feature selection in machine learning is,
1.To reduce the dimensionality of feature space.
2.To speed up a learning algorithm.
3.To improve the predictive accuracy of a classification algorithm.
There are several techniques for feature selection:
Filter Methods
▪ In Filter Method, features are selected on the basis of statistics measures.
▪ This method does not depend on the learning algorithm and chooses the features as a pre-
processing step.
▪ These methods are faster and less computationally expensive than wrapper methods.
▪ When dealing with high-dimensional data, it is computationally cheaper to use filter
methods.
▪ Very good for removing duplicated, correlated, redundant features but these methods do
not remove multicollinearity.
Information Gain
It is defined as the amount of information provided by the feature for identifying the target value
and measures reduction in the entropy values. Information gain of each attribute is calculated
considering the target values for feature selection.
Chi-square Test
Chi-square test is a technique to determine the relationship between the categorical variables.
The chi-square value is calculated between each feature and the target variable, and the desired
number of features with the best chi-square value is selected.
Chi-square Test Example
Steps:
1.Define Null and Alternative Hypothesis:
Null Hypothesis:There is no significant association between the two categorical data
Alternative Hypothesis:There is significant association between the two categorical data.
2.Calculate Contingency Table:
3. Calculate expected Value
4.Calculate Chi-square value
5. Compare Chi-square value with Critical value to Accept or Reject Hypothesis
Degree of freedom=(r-
1) (c-1)
Significance level=0.05
Therefore Income level is relevant feature for predicting subscription
status
Fisher’s Score
Fisher score is one of the most widely used supervised feature selection methods.The algorithm
returns the ranks of the variables based on the fisher’s score in descending order.
Missing Value Ratio
The value of the missing value ratio can be used for evaluating the feature set against the
threshold value. The formula for obtaining the missing value ratio is the number of missing
values in each column divided by the total number of observations.The variable is having more
than the threshold value can be dropped.
Fisher Score
Missing Value Ratio:
1. Calculate the missing value ratio for each feature by dividing the number of missing values by
the total number of instances in the dataset.
2. Set a threshold for the acceptable missing value ratio (e.g., 0.8, meaning that a feature should
have at most 80% of its values missing to be considered).
3. Filter out features that have a missing value ratio above the threshold.
Advanced Feature Selection
Wrapper Methods
Wrapper methods, also referred to as greedy algorithms train the algorithm by
using a subset of features in an iterative manner.
Based on the conclusions made from training in prior to the model, addition and
removal of features takes place.
Stopping criteria for selecting the best subset are usually pre-defined by the person
training the model such as when the performance of the model decreases or a
specific number of features has been achieved.
The main advantage of wrapper methods over the filter methods is that they provide
an optimal set of features for training the model, thus resulting in better
accuracy than the filter methods but are computationally more expensive.
Forward selection
Forward selection is an iterative process, which begins with an empty set of features. After each
iteration, it keeps adding on a feature and evaluates the performance to check whether it is
improving the performance or not. The process continues until the addition of a new
variable/feature does not improve the performance of the model.
Backward elimination
Backward elimination is also an iterative approach, but it is the opposite of forward selection.
This technique begins the process by considering all the features and removes the least
significant feature. This elimination process continues until removing the features does not
improve the performance of the model.
Recursive Feature Elimination
Recursive feature elimination is a recursive greedy optimization approach, where features are
selected by recursively taking a smaller and smaller subset of features. Now, an estimator is
trained with each set of features, and the importance of each feature is determined using
coef_attribute or through a feature_importances_attribute.
Exhaustive Feature Selection
Exhaustive feature selection is one of the best feature selection methods, which evaluates each
feature set as brute-force. It means this method tries & make each possible combination of
features and return the best performing feature set.
How Exhaustive Feature SelectionWorks
1. Generate all possible feature subsets: For a dataset with n features, this means
𝑛
evaluating 2 ^ subsets (including the empty set).
𝑛
2. Evaluate each subset:Train and evaluate a model using each subset of features.The
evaluation metric could be accuracy, precision, recall, F1 score, etc.
3. Select the best subset: Identify the subset of features that provides the best performance
according to the chosen evaluation metric.
Embedded Methods
1.Regularization
This method adds a penalty to different parameters of the machine learning model to
avoid overfitting of the model.
▪ Lasso Regression (L1 Regularization): Adds an L1 penalty (the absolute value of
the magnitude of coefficients) to the loss function. This can shrink some coefficients
to zero, effectively performing feature selection.
▪ Ridge Regression (L2 Regularization): Adds an L2 penalty (the square of the
magnitude of coefficients) to the loss function. While it does not perform feature
selection by shrinking coefficients to zero, it helps in reducing overfitting and
improving model generalization.
2.Tree-based methods
Decision Trees:
Decision Trees split the data into subsets based on the value of input features, and
the splits that provide the best separation (based on criteria like Gini impurity or
information gain) indicate the most important features.
The depth of the tree and the features selected for splits at various levels provide
insights into feature importance.
Random Forests:
Random Forests are ensembles of decision trees. They provide feature importance
by averaging the importance measures of each feature across all the trees.
Feature importance in Random Forests is typically calculated by looking at the
decrease in impurity (e.g., Gini impurity)
Automated Feature Engineering
Automated feature engineering aims to simplify and speed up the process of creating features
from raw data by leveraging algorithms and tools.This approach reduces manual effort and can
uncover complex patterns and interactions that might be missed otherwise.
Benefits of Automated Feature Engineering
● Speed: Quickly generates and evaluates a large number of features.
● Complexity Handling: Captures complex interactions and transformations that might be
difficult to manually specify.
● Consistency: Applies feature engineering techniques uniformly across different datasets and
tasks.
● Performance: Often improves model performance by discovering useful features that
enhance predictive power.
EvalML AutoML library to automate Feature Engineering
evalML is an open-source Python library designed to automate and streamline the machine
learning workflow, particularly focusing on end-to-end model development.
Feature Engineering for Specific Data Types
1. Numerical Data
▪ Feature Scaling
▪ Power Transformations
2.Categorical Data
▪ One hot encoding
▪ Label encoding
▪ Target Encoding
3.Text Data
▪ Bag of Words (BoW)
▪ TF-IDF (Term Frequency-Inverse Document Frequency
▪ Word Embeddings
4.Time-Series Data
▪ Lag
▪ Fourier Transforms
▪ Time-Based Features

Feature Engineering Fundamentals Explained.pptx

  • 1.
    Module 3 Advanced FeatureEngineering and Feature Selection
  • 2.
    Introduction to FeatureEngineering Feature engineering is the process of improving a model’s accuracy by using domain knowledge to select and transform raw data’s most relevant variables into features of predictive models that better represent the underlying problem.
  • 4.
    Feature Engineering Feature Transformation Feature Construction Feature SelectionFeature Extraction Missing value imputation Handling Categorical Features Outlier detection Feature scaling
  • 5.
  • 6.
  • 7.
    Outlier Detection interquartile range= Upper Quartile – Lower Quartile = Q­ 3 – Q­ 1
  • 8.
  • 9.
    Why do weneed feature scaling
  • 10.
  • 11.
    Standardisation(Z-score normalization) Assume ourdataset has random numeric values in the range of 1 to 95,000 (in random order)Just for our understanding consider a small Dataset of barely 10 values with numbers in the given range and randomized order. If we just look at these values, their range is so high, that while training the model with 10,000 such values will take lot of time. We have a solution to solve the problem arisen i.e. Standardization. It helps us solve this by : ● Down Scaling the Values to a scale common to all, usually in the range - 1 to +1. ● And keeping the Range between the values intact.
  • 12.
  • 15.
    Feature Selection Techniques Featureselection is a crucial step in the machine learning pipeline, involving the selection of a subset of relevant features (variables, predictors) for use in model construction. Effective feature selection can improve model performance, reduce overfitting, and decrease training time. The role of feature selection in machine learning is, 1.To reduce the dimensionality of feature space. 2.To speed up a learning algorithm. 3.To improve the predictive accuracy of a classification algorithm.
  • 16.
    There are severaltechniques for feature selection:
  • 17.
    Filter Methods ▪ InFilter Method, features are selected on the basis of statistics measures. ▪ This method does not depend on the learning algorithm and chooses the features as a pre- processing step. ▪ These methods are faster and less computationally expensive than wrapper methods. ▪ When dealing with high-dimensional data, it is computationally cheaper to use filter methods. ▪ Very good for removing duplicated, correlated, redundant features but these methods do not remove multicollinearity.
  • 18.
    Information Gain It isdefined as the amount of information provided by the feature for identifying the target value and measures reduction in the entropy values. Information gain of each attribute is calculated considering the target values for feature selection. Chi-square Test Chi-square test is a technique to determine the relationship between the categorical variables. The chi-square value is calculated between each feature and the target variable, and the desired number of features with the best chi-square value is selected.
  • 19.
  • 20.
    Steps: 1.Define Null andAlternative Hypothesis: Null Hypothesis:There is no significant association between the two categorical data Alternative Hypothesis:There is significant association between the two categorical data. 2.Calculate Contingency Table:
  • 21.
  • 22.
  • 23.
    5. Compare Chi-squarevalue with Critical value to Accept or Reject Hypothesis Degree of freedom=(r- 1) (c-1) Significance level=0.05
  • 24.
    Therefore Income levelis relevant feature for predicting subscription status
  • 25.
    Fisher’s Score Fisher scoreis one of the most widely used supervised feature selection methods.The algorithm returns the ranks of the variables based on the fisher’s score in descending order. Missing Value Ratio The value of the missing value ratio can be used for evaluating the feature set against the threshold value. The formula for obtaining the missing value ratio is the number of missing values in each column divided by the total number of observations.The variable is having more than the threshold value can be dropped.
  • 26.
  • 30.
    Missing Value Ratio: 1.Calculate the missing value ratio for each feature by dividing the number of missing values by the total number of instances in the dataset. 2. Set a threshold for the acceptable missing value ratio (e.g., 0.8, meaning that a feature should have at most 80% of its values missing to be considered). 3. Filter out features that have a missing value ratio above the threshold.
  • 31.
  • 32.
    Wrapper Methods Wrapper methods,also referred to as greedy algorithms train the algorithm by using a subset of features in an iterative manner. Based on the conclusions made from training in prior to the model, addition and removal of features takes place. Stopping criteria for selecting the best subset are usually pre-defined by the person training the model such as when the performance of the model decreases or a specific number of features has been achieved. The main advantage of wrapper methods over the filter methods is that they provide an optimal set of features for training the model, thus resulting in better accuracy than the filter methods but are computationally more expensive.
  • 33.
    Forward selection Forward selectionis an iterative process, which begins with an empty set of features. After each iteration, it keeps adding on a feature and evaluates the performance to check whether it is improving the performance or not. The process continues until the addition of a new variable/feature does not improve the performance of the model. Backward elimination Backward elimination is also an iterative approach, but it is the opposite of forward selection. This technique begins the process by considering all the features and removes the least significant feature. This elimination process continues until removing the features does not improve the performance of the model. Recursive Feature Elimination Recursive feature elimination is a recursive greedy optimization approach, where features are selected by recursively taking a smaller and smaller subset of features. Now, an estimator is trained with each set of features, and the importance of each feature is determined using coef_attribute or through a feature_importances_attribute.
  • 34.
    Exhaustive Feature Selection Exhaustivefeature selection is one of the best feature selection methods, which evaluates each feature set as brute-force. It means this method tries & make each possible combination of features and return the best performing feature set. How Exhaustive Feature SelectionWorks 1. Generate all possible feature subsets: For a dataset with n features, this means 𝑛 evaluating 2 ^ subsets (including the empty set). 𝑛 2. Evaluate each subset:Train and evaluate a model using each subset of features.The evaluation metric could be accuracy, precision, recall, F1 score, etc. 3. Select the best subset: Identify the subset of features that provides the best performance according to the chosen evaluation metric.
  • 35.
    Embedded Methods 1.Regularization This methodadds a penalty to different parameters of the machine learning model to avoid overfitting of the model. ▪ Lasso Regression (L1 Regularization): Adds an L1 penalty (the absolute value of the magnitude of coefficients) to the loss function. This can shrink some coefficients to zero, effectively performing feature selection. ▪ Ridge Regression (L2 Regularization): Adds an L2 penalty (the square of the magnitude of coefficients) to the loss function. While it does not perform feature selection by shrinking coefficients to zero, it helps in reducing overfitting and improving model generalization.
  • 36.
    2.Tree-based methods Decision Trees: DecisionTrees split the data into subsets based on the value of input features, and the splits that provide the best separation (based on criteria like Gini impurity or information gain) indicate the most important features. The depth of the tree and the features selected for splits at various levels provide insights into feature importance. Random Forests: Random Forests are ensembles of decision trees. They provide feature importance by averaging the importance measures of each feature across all the trees. Feature importance in Random Forests is typically calculated by looking at the decrease in impurity (e.g., Gini impurity)
  • 37.
    Automated Feature Engineering Automatedfeature engineering aims to simplify and speed up the process of creating features from raw data by leveraging algorithms and tools.This approach reduces manual effort and can uncover complex patterns and interactions that might be missed otherwise. Benefits of Automated Feature Engineering ● Speed: Quickly generates and evaluates a large number of features. ● Complexity Handling: Captures complex interactions and transformations that might be difficult to manually specify. ● Consistency: Applies feature engineering techniques uniformly across different datasets and tasks. ● Performance: Often improves model performance by discovering useful features that enhance predictive power.
  • 38.
    EvalML AutoML libraryto automate Feature Engineering evalML is an open-source Python library designed to automate and streamline the machine learning workflow, particularly focusing on end-to-end model development.
  • 39.
    Feature Engineering forSpecific Data Types 1. Numerical Data ▪ Feature Scaling ▪ Power Transformations
  • 40.
    2.Categorical Data ▪ Onehot encoding ▪ Label encoding ▪ Target Encoding
  • 41.
    3.Text Data ▪ Bagof Words (BoW) ▪ TF-IDF (Term Frequency-Inverse Document Frequency ▪ Word Embeddings 4.Time-Series Data ▪ Lag ▪ Fourier Transforms ▪ Time-Based Features