Feature
Engineering
AGENDA
• Feature engineering
• Feature selection
• Dealing with categorical data
Feature Scaling
Why Should we Use Feature Scaling?
• Dataset had multiple features spanning varying degrees of magnitude, range, and units. This is a
significant obstacle as a few machine learning algorithms are highly sensitive to these features.
Feature Scaling
Normalization Standardization
Normalization: Min-Max scaling
• Normalization is a scaling technique in which values are shifted and
rescaled so that they end up ranging between 0 and 1. It is also
known as Min-Max scaling.
• Here’s the formula for normalization:
•
Standardization – Z score normalization
• Standardization is another scaling technique where the values are
centered around the mean with a unit standard deviation.
• This means that the mean of the attribute becomes zero and the
resultant distribution has a unit standard deviation (equals 1).
• Here’s the formula for standardization:
Standardization (Z-score Normalization )
Standardization (Z-score Normalization )
Feature Selection
Creating Features
“Good” features are the key to accurate generalization
Domain knowledge can be used to generate a feature set
Medical Example: results of blood tests, age, smoking history
Game Playing example: number of pieces on the board, control of the center
of the board
Data might not be in vector form
Example: spam classification
“Bag of words”: throw out order, keep count of how many times each word appears.
Sequence: one feature for first letter in the email, one for second letter, etc.
Ngrams: one feature for every unique string of n features
What is feature selection?
Reducing the feature space by throwing out some of
the features
Features Selection
Without using it With using it
ü Increase in complexity of a model and makes
it harder to interpret.
ü Increase in time complexity for a model to get
trained.
ü Result in a dumb model with inaccurate or
less reliable predictions.
Ø feature selection helps in finding the smallest
set of features which results in
ü Training a machine learning algorithm faster.
ü Reducing the complexity of a model and
making it easier to interpret.
ü Building a sensible model with better
prediction power.
ü Reducing over-fitting by selecting the right set
of features.
Reasons for Feature Selection
Want to find which features are relevant
Domain specialist not sure which factors are predictive of disease
Common practice: throw in every feature you can think of, let feature selection get rid of
useless ones
Want to maximize accuracy, by removing irrelevant and noisy
features
For Spam, create a feature for each of ~105 English words
Training with all features computationally expensive
Irrelevant features hurt generalization
Features have associated costs, want to optimize accuracy with least
expensive features
Embedded systems with limited resources
Voice recognition on a cell phone
Branch prediction in a CPU (4K code limit)
Terminology
Univariate method: considers one variable (feature) at a time
Multivariate method: considers subsets of variables (features) together
Filter method: ranks features or feature subsets independently of the
predictor (classifier)
Wrapper method: uses a classifier to assess features or feature subsets
Filter Methods:
Wrapper Methods:
Embedded Methods:
Types of Feature Selection:
Feature Selection Methods
Filter:
Wrapper:
Supervised
Learning
Algorithm
All Features
Selected
Features
Classifier
Selected
Features
Filter (Score)
Search
Feature
Evaluation
Criterion
All Features
Feature
Subset
Criterion Value
Classifier
Selected
Features
Classifier
Supervised
Learning
Algorithm
Constant
removal(Variance
Threshold)
Correlation-based
Chi-Square Test
(for Categorical
Features)
ANOVA (Analysis
of Variance)
Information Gain
Filter method: These methods evaluate the intrinsic
characteristics of features independent of the model.
Constant removal :goal of constant removal is to identify and
eliminate features that exhibit no variation or have constant values
across all data points in a dataset
Calculate variance
or standard
deviation for each
feature
1
Set a threshold for
variance
2
Remove features
below the threshold
3
1-Calculate variance or standard deviation for
each feature
the variance of a set of data points measures how far each data point
in the set is from the mean (average) of the data
it indicates that their values are relatively constant across different
instances in the dataset
Features with zero variance (or very low variance) are considered
constant
2) Set a Threshold : Define a threshold value for the variance,
features with variance below this threshold are flagged for removal.
considerations for choosing an appropriate threshold:
1-Impact on Model Performance
2-Domain Knowledge
3-Balance Between Information Loss and Noise Reduction
4-Dataset Size
3)Remove Constant Features:
Eliminate the identified constant features from the dataset.
The remaining features are considered more informative and
are retained for further analysis or modeling
Constant
removal(Variance
Threshold)
Correlation-based
Chi-Square Test
(for Categorical
Features)
ANOVA (Analysis
of Variance)
Information Gain
Filter method:
Benefits
Improve computational efficiency
Improved model performance
Faster training times
Reduce noise in the dataset
Reduced overfitting
Recursive
Feature
Elimination
Recursive Feature
Elimination algorithm
1.Rank the importance of all features
using the chosen RFE machine
learning algorithm.
2.Eliminate the least important feature.
3.Build a model using the remaining
features.
4.Repeat steps 1-3 until the desired
number of features is reached
Categorical
Data
Encoding Categorical Data
• There are different techniques to encode the categorical
features to numeric quantities.
1) Encoding labels
2) One-Hot encoding
Label Encoding
• allows you to convert each
value in a column to a
number. Numerical labels
are always between 0 and
n_categories. - 1.
Label Encoding Example
One-Hot Encoding
• The basic strategy is to
convert each category
value into a new
column and assign a 1
or 0 (True/False) value
to the column.
Classification
Metrics
Confusion Matrix
TP, TN , FN, FP
Evaluation of classification models from confusion matrix
• Accuracy
• Precision
• Recall (sensitivity)
• F1 Score
• Specificity
Evaluation of classification models: Accuracy
Accuracy simply measures how often the classifier makes the correct prediction.
It’s the ratio between the number of correct predictions and the total number of
predictions.
Evaluation of classification models: precision
Precision It is a measure of correctness that is achieved in true prediction. In simple
words, it tells us how many predictions are actually positive out of all the total
positive predicted.
Evaluation of classification models: Recall
Recall (Sensitivity): It is a measure of actual observations which are
predicted correctly, i.e. how many observations of positive class are actually
predicted as positive. It is also known as Sensitivity.
Evaluation of classification models: Recall
F1 score: It is the harmonic mean of precision and recall. It takes both false
positive and false negatives into account.
Evaluation of classification models: Recall
Specificity
Specificity = TN / TN + FP
Regression
Metrics
Evaluation of Regression models
• Mean Absolute Error (MAE),
• Mean Squared Error (MSE),
Evaluation of Regression models: Mean Squared Error
Mean Squared Error (MSE) : the most popular metric used for regression
problems. It essentially finds the average of the squared difference
between the target value and the value predicted by the regression model.
Where:
•y_j: actual value
•y_hat: predicted value from the regression model
•N: number of samples
Evaluation of Regression models: Mean Absolute Error
Mean Absolute Error (MAE) : the average of the difference between the
ground truth and the predicted values. Mathematically, its represented as :
Where:
•y_j: actual value
•y_hat: predicted value from the regression model
•N: number of samples

featurers_Machinelearning___________.pdf

  • 1.
  • 2.
    AGENDA • Feature engineering •Feature selection • Dealing with categorical data
  • 3.
  • 4.
    Why Should weUse Feature Scaling? • Dataset had multiple features spanning varying degrees of magnitude, range, and units. This is a significant obstacle as a few machine learning algorithms are highly sensitive to these features.
  • 5.
  • 6.
    Normalization: Min-Max scaling •Normalization is a scaling technique in which values are shifted and rescaled so that they end up ranging between 0 and 1. It is also known as Min-Max scaling. • Here’s the formula for normalization: •
  • 7.
    Standardization – Zscore normalization • Standardization is another scaling technique where the values are centered around the mean with a unit standard deviation. • This means that the mean of the attribute becomes zero and the resultant distribution has a unit standard deviation (equals 1). • Here’s the formula for standardization:
  • 10.
  • 11.
  • 12.
  • 13.
    Creating Features “Good” featuresare the key to accurate generalization Domain knowledge can be used to generate a feature set Medical Example: results of blood tests, age, smoking history Game Playing example: number of pieces on the board, control of the center of the board Data might not be in vector form Example: spam classification “Bag of words”: throw out order, keep count of how many times each word appears. Sequence: one feature for first letter in the email, one for second letter, etc. Ngrams: one feature for every unique string of n features
  • 14.
    What is featureselection? Reducing the feature space by throwing out some of the features
  • 15.
    Features Selection Without usingit With using it ü Increase in complexity of a model and makes it harder to interpret. ü Increase in time complexity for a model to get trained. ü Result in a dumb model with inaccurate or less reliable predictions. Ø feature selection helps in finding the smallest set of features which results in ü Training a machine learning algorithm faster. ü Reducing the complexity of a model and making it easier to interpret. ü Building a sensible model with better prediction power. ü Reducing over-fitting by selecting the right set of features.
  • 16.
    Reasons for FeatureSelection Want to find which features are relevant Domain specialist not sure which factors are predictive of disease Common practice: throw in every feature you can think of, let feature selection get rid of useless ones Want to maximize accuracy, by removing irrelevant and noisy features For Spam, create a feature for each of ~105 English words Training with all features computationally expensive Irrelevant features hurt generalization Features have associated costs, want to optimize accuracy with least expensive features Embedded systems with limited resources Voice recognition on a cell phone Branch prediction in a CPU (4K code limit)
  • 17.
    Terminology Univariate method: considersone variable (feature) at a time Multivariate method: considers subsets of variables (features) together Filter method: ranks features or feature subsets independently of the predictor (classifier) Wrapper method: uses a classifier to assess features or feature subsets
  • 18.
    Filter Methods: Wrapper Methods: EmbeddedMethods: Types of Feature Selection:
  • 19.
    Feature Selection Methods Filter: Wrapper: Supervised Learning Algorithm AllFeatures Selected Features Classifier Selected Features Filter (Score) Search Feature Evaluation Criterion All Features Feature Subset Criterion Value Classifier Selected Features Classifier Supervised Learning Algorithm
  • 20.
    Constant removal(Variance Threshold) Correlation-based Chi-Square Test (for Categorical Features) ANOVA(Analysis of Variance) Information Gain Filter method: These methods evaluate the intrinsic characteristics of features independent of the model.
  • 21.
    Constant removal :goalof constant removal is to identify and eliminate features that exhibit no variation or have constant values across all data points in a dataset Calculate variance or standard deviation for each feature 1 Set a threshold for variance 2 Remove features below the threshold 3
  • 22.
    1-Calculate variance orstandard deviation for each feature the variance of a set of data points measures how far each data point in the set is from the mean (average) of the data it indicates that their values are relatively constant across different instances in the dataset Features with zero variance (or very low variance) are considered constant
  • 23.
    2) Set aThreshold : Define a threshold value for the variance, features with variance below this threshold are flagged for removal. considerations for choosing an appropriate threshold: 1-Impact on Model Performance 2-Domain Knowledge 3-Balance Between Information Loss and Noise Reduction 4-Dataset Size 3)Remove Constant Features: Eliminate the identified constant features from the dataset. The remaining features are considered more informative and are retained for further analysis or modeling
  • 24.
  • 25.
    Benefits Improve computational efficiency Improvedmodel performance Faster training times Reduce noise in the dataset Reduced overfitting
  • 27.
  • 28.
    Recursive Feature Elimination algorithm 1.Rankthe importance of all features using the chosen RFE machine learning algorithm. 2.Eliminate the least important feature. 3.Build a model using the remaining features. 4.Repeat steps 1-3 until the desired number of features is reached
  • 35.
  • 37.
    Encoding Categorical Data •There are different techniques to encode the categorical features to numeric quantities. 1) Encoding labels 2) One-Hot encoding
  • 38.
    Label Encoding • allowsyou to convert each value in a column to a number. Numerical labels are always between 0 and n_categories. - 1.
  • 39.
  • 40.
    One-Hot Encoding • Thebasic strategy is to convert each category value into a new column and assign a 1 or 0 (True/False) value to the column.
  • 41.
  • 42.
  • 43.
    Evaluation of classificationmodels from confusion matrix • Accuracy • Precision • Recall (sensitivity) • F1 Score • Specificity
  • 44.
    Evaluation of classificationmodels: Accuracy Accuracy simply measures how often the classifier makes the correct prediction. It’s the ratio between the number of correct predictions and the total number of predictions.
  • 45.
    Evaluation of classificationmodels: precision Precision It is a measure of correctness that is achieved in true prediction. In simple words, it tells us how many predictions are actually positive out of all the total positive predicted.
  • 46.
    Evaluation of classificationmodels: Recall Recall (Sensitivity): It is a measure of actual observations which are predicted correctly, i.e. how many observations of positive class are actually predicted as positive. It is also known as Sensitivity.
  • 47.
    Evaluation of classificationmodels: Recall F1 score: It is the harmonic mean of precision and recall. It takes both false positive and false negatives into account.
  • 48.
    Evaluation of classificationmodels: Recall Specificity Specificity = TN / TN + FP
  • 49.
  • 50.
    Evaluation of Regressionmodels • Mean Absolute Error (MAE), • Mean Squared Error (MSE),
  • 51.
    Evaluation of Regressionmodels: Mean Squared Error Mean Squared Error (MSE) : the most popular metric used for regression problems. It essentially finds the average of the squared difference between the target value and the value predicted by the regression model. Where: •y_j: actual value •y_hat: predicted value from the regression model •N: number of samples
  • 52.
    Evaluation of Regressionmodels: Mean Absolute Error Mean Absolute Error (MAE) : the average of the difference between the ground truth and the predicted values. Mathematically, its represented as : Where: •y_j: actual value •y_hat: predicted value from the regression model •N: number of samples