SlideShare a Scribd company logo
Unit 4 :
Basics of Feature Engineering:
1
Silver Oal College Of Engineering AndTechnology
Outline
2
 Feature and Feature Engineering,
 Feature transformation:
 Construction
 Extraction,
 Feature subset selection :
 Issues in high-dimensional data,
 key drivers,
 measure
 overall process
Feature and Feature Engineering
Prof. Monali Suthar (SOCET-CE)
3
 Input in machine learning which are usually in the form of
structured columns.
 Algorithms require features with some specific
characteristic to work properly.
 Feature Engineering?
 Feature engineering is the process of transforming raw data into
features that better represent the underlying problem to the
predictive models, resulting in improved model accuracy on unseen
data.
 Goals of Feature Engineering
1. Preparing the proper input dataset, compatible with the
machine learning algorithm requirements.
2. Improving the performance of machine learning models.
Feature Engineering Category
4
 Feature Engineering is divided into 3 broad categories:-
1. Feature Selection:
 It is all about selecting a small subset of features from a large pool of
features.
 We select those attributes which best explain the relationship of an
independent variable with the target variable.
 There are certain features which are more important than other
features to the accuracy of the model.
 It is different from dimensionality reduction because the
dimensionality reduction method does so by combining existing
attributes, whereas the feature selection method includes or excludes
those features.
 Ex: Chi-squared test, correlation coefficient scores, LASSO, Ridge
regression etc.
Feature Engineering Category
5
II. FeatureTransformation:
 It means transforming our original feature to the functions of
original features.
 Ex: Scaling, discretization, binning and filling missing data values are
the most common forms of data transformation.
 To reduce right skewness of the data, we use log.
III. Feature Extraction:
 When the data to be processed through an algorithm is too large,
it’s generally considered redundant.
 Analysis with a large number of variables uses a lot of computation
power and memory, therefore we should reduce the dimensionality
of these types of variables.
 It is a term for constructing combinations of the variables.
 For tabular data, we use PCA to reduce features.
 For image, we can use line or edge detection.
Feature transformation
6
 Feature transformation is the process of modifying
your data but keeping the information.
 These modifications will make Machine Learning
algorithms understanding easier, which will deliver better
results.
 But why would we transform our features?
 data types are not suitable to be fed into a machine learning
algorithm, e.g. text, categories
 feature values may cause problems during the learning process,
e.g. data represented in different scales
 we want to reduce the number of features to plot and visualize
data, speed up training or improve the accuracy of a specific
model
Feature Engineering Techniques
7
 List of Techniques
1.Imputation
2.Handling Outliers
3.Binning
4.LogTransform
5.One-Hot Encoding
6.Grouping Operations
7.Feature Split
8.Scaling
9.Extracting Date
Imputation Using (Mean/Median) Values
8
 This works by calculating the mean/median of the
non-missing values in a column and then replacing
the missing values within each column separately
and independently from the others. It can only be
used with numeric data.
Pros and Cons
9
 Pros:
• Easy and fast.
• Works well with small numerical datasets.
 Cons:
• Doesn‟t factor the correlations between features. It
only works on the column level.
• Will give poor results on encoded categorical
features (do NOT use it on categorical features).
• Not very accurate.
• Doesn‟t account for the uncertainty in the
imputations.
Pros and Cons
10
 Pros:
• Easy and fast.
• Works well with small numerical datasets.
 Cons:
• Doesn‟t factor the correlations between features. It
only works on the column level.
• Will give poor results on encoded categorical
features (do NOT use it on categorical features).
• Not very accurate.
• Doesn‟t account for the uncertainty in the
imputations.
Imputation Using (Most Frequent) or
(Zero/Constant) Values:
11
 Most Frequent is another statistical strategy to
impute missing values and YES!! It works with
categorical features (strings or numerical
representations) by replacing missing data with the
most frequent values within each column.
 Pros:
• Works well with categorical features.
 Cons:
• It also doesn‟t factor the correlations between
features.
• It can introduce bias in the data.
Imputation Using (Most Frequent) or
(Zero/Constant) Values:
12
Imputation Using k-NN
13
 The k nearest neighbors is an algorithm that is used
for simple classification. The algorithm uses „feature
similarity‟ to predict the values of any new data
points.
 This means that the new point is assigned a value
based on how closely it resembles the points in the
training set. This can be very useful in making
predictions about the missing values by finding
the k’s closest neighbor's to the observation with
missing data and then imputing them based on the
non-missing values in the neighborhood.
Pros and Cons
14
 Pros:
• Can be much more accurate than the mean, median
or most frequent imputation methods (It depends on
the dataset).
 Cons:
• Computationally expensive. KNN works by storing
the whole training dataset in memory.
• K-NN is quite sensitive to outliers in the data (unlike
SVM)
Handling outlier
15
• Incorrectdata entryor errorduringdata processing
• Missingvaluesina dataset.
• Data didnotcomefromtheintendedsample.
• Errorsoccurduringexperiments.
• Notan errored,itwouldbeunusualfrom theoriginal.
• Extremedistributionthannormal.
Handling outlier
16
Univariate method:
 Univariate analysis is the simplest form of analyzing data.
“Uni” means “one”, so in other words your data has only one
variable.
 It doesn‟t deal with causes or relationships (unlike regression )
and it‟s major purpose is to describe; It takes data,
summarizes that data and finds patterns in the data.
 Univariate and multivariate represent two approaches to
statistical analysis.
 Univariate involves the analysis of a single variable
while multivariate analysis examines two or more variables.
 Most multivariate analysis involves a dependent variable and
multiple independent variables.
Handling outlier with Z score
17
 The Z-score is the signed number of standard deviations by which
the value of an observation or data point is above the mean value of
what is being observed or measured.
 Z score is an important concept in statistics. Z score is also called
standard score. This score helps to understand if a data value is
greater or smaller than mean and how far away it is from the mean.
More specifically, Z score tells how many standard deviations away a
data point is from the mean.
 The intuition behind Z-score is to describe any data point by finding
their relationship with the Standard Deviation and Mean of the
group of data points. Z-score is finding the distribution of data
where mean is 0 and standard deviation is 1 i.e. normal distribution.
 Z score = (x -mean) / std. deviation
 If the z score of a data point is more than 3, it indicates that the data
point is quite different from the other data points. Such a data point
can be an outlier.
Binning
18
 Data binning, bucketing is a data pre-processing method
used to minimize the effects of small observation errors.
 The original data values are divided into small intervals
known as bins and then they are replaced by a general
value calculated for that bin.
 This has a smoothing effect on the input data and may
also reduce the chances of overfitting in case of small
datasets.
Log Transform
19
 The Log Transform is one of the most popular
Transformation techniques out there.
 It is primarily used to convert a skewed distribution to a
normal distribution/less-skewed distribution.
 In this transform, we take the log of the values in a
column and use these values as the column instead.
Standard Scaler
20
 The Standard Scaler is another popular scaler that is very
easy to understand and implement.
 For each feature, the Standard Scaler scales the values
such that the mean is 0 and the standard deviation is 1(or
the variance).
x_scaled = x – mean/std_dev
 However, Standard Scaler assumes that the distribution of
the variable is normal.Thus, in case, the variables are not
normally distributed, we either choose a different scaler
or first, convert the variables to a normal distribution and
then apply this scaler
21
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df_scaled[col_names] = scaler.fit_transform(features.values)
df_scaled
One-Hot Encoding
22
 A one hot encoding allows the representation of
categorical data to be more expressive.
 Many machine learning algorithms cannot work with
categorical data directly.
 The categories must be converted into numbers.
 This is required for both input and output variables that
are categorical.
Feature subset selection
23
 Feature Selection is the most critical pre-processing
activity in any machine learning process. It intends to
select a subset of attributes or features that makes the
most meaningful contribution to a machine learning
activity.
High dimensional data
24
 High Dimensional refers to the high number of variables or
attributes or features present in certain data sets, more so in the
domains like DNA analysis, geographic information system (GIS),
etc. It may have sometimes hundreds or thousands of dimensions
which is not good from the machine learning aspect because it may
be a big challenge for any ML algorithm to handle that. On the other
hand, a high quantity of computational and a high amount of time
will be required. Also, a model built on an extremely high number of
features may be very difficult to understand. For these reasons, it
is necessary to take a subset of the features instead of the
full set. So we can deduce that the objectives of feature selection
are:
1. Having a faster and more cost-effective (less need for computational
resources) learning model
2. Having a better understanding of the underlying model that generates
the data.
3. Improving the efficacy of the learning model.
Feature subset selection methods
25
1. Wrapper methods
 Wrapping methods compute models with a certain subset of
features and evaluate the importance of each feature.
 Then they iterate and try a different subset of features until the
optimal subset is reached.
 Two drawbacks of this method are the large computation time
for data with many features, and that it tends to overfit the
model when there is not a large amount of data points.
 The most notable wrapper methods of feature selection
are forward selection, backward selection, and stepwise
selection.
Feature subset selection methods
26
1. Wrapper methods
 Forward selection starts with zero features, then, for each
individual feature, runs a model and determines the p-value
associated with the t-test or F-test performed. It then selects
the feature with the lowest p-value and adds that to the
working model.
 Backward selection starts with all features contained in the
dataset. It then runs a model and calculates a p-value
associated with the t-test or F-test of the model for each
feature.
 Stepwise selection is a hybrid of forward and backward
selection. It starts with zero features and adds the one feature
with the lowest significant p-value as described above.
Feature subset selection methods
27
1. Filter methods
 Filter methods use a measure other than error rate to
determine whether that feature is useful.
 Rather than tuning a model (as in wrapper methods), a subset
of the features is selected through ranking them by a useful
descriptive measure.
 Benefits of filter methods are that they have a very low
computation time and will not overfit the data.
 However, one drawback is that they are blind to any
interactions or correlations between features.
 This will need to be taken into account separately, which will
be explained below. Three different filter methods
are ANOVA, Pearson correlation, and variance
thresholding.
Feature subset selection methods
28
2. Filter methods
 The ANOVA (Analysis of variance) test looks a the variation
within the treatments of a feature and also between the
treatments.
 The Pearson correlation coefficient is a measure of the
similarity of two features that ranges between -1 and 1. A value
close to 1 or -1 indicates that the two features have a high
correlation and may be related.
 The variance of a feature determines how much predictive
power it contains. The lower the variance is, the less
information contained in the feature, and the less value it has in
predicting the response variable.
Feature subset selection methods
29
3. Embedded Methods
 Embedded methods perform feature selection as a part of the
model creation process.
 This generally leads to a happy medium between the two
methods of feature selection previously explained, as the
selection is done in conjunction with the model tuning
process.
 Lasso and Ridge regression are the two most common
feature selection methods of this type, and Decision tree also
creates a model using different types of feature selection.
Feature subset selection methods
30
3. Embedded Methods
 Lasso Regression is another way to penalize the beta coefficients in a
model, and is very similar to Ridge regression. It also adds a penalty term
to the cost function of a model, with a lambda value that must be tuned.
 The smaller number of features a model has, the lower the complexity.
from sklearn.linear_model import Lasso
lasso = Lasso()
lasso.fit(X_train,y_train)
train_score=lasso.score(X_train,y_train)
test_score=lasso.score(X_test,y_test)
coeff_used = np.sum(lasso.coef_!=0)
 An important note for Ridge and Lasso regression is that all of your features must
be standardized
Feature subset selection methods
31
3. Embedded Methods
 Ridge regression can do this by penalizing the beta coefficients of a model
for being too large. Basically, it scales back the strength of correlation with
variables that may not be as important as others. Ride Regression is done
by adding a penalty term (also called ridge estimator or shrinkage estimator)
to the cost function of the regression. The penalty term takes all of the betas
and scales them by a term lambda (λ) that must be tuned (usually with cross
validation: compares the same model but with different values of lambda).
from sklearn.linear_model import Ridge
rr = Ridge(alpha=0.01)
rr.fit(X_train, y_train)
32
Prof. Monali Suthar (SOCET-CE)
33
 https://seleritysas.com/blog/2019/12/12/types-of-
predictive-analytics-models-and-how-they-work/
 https://towardsdatascience.com/selecting-the-correct-
predictive-modeling-technique-ba459c370d59
 https://www.netsuite.com/portal/resource/articles/financia
l-management/predictive-modeling.shtml
 https://www.dezyre.com/article/types-of-analytics-
descriptive-predictive-prescriptive-analytics/209#toc-2
 https://www.sciencedirect.com/topics/computer-
science/descriptive-model
 https://towardsdatascience.com/intro-to-feature-
selection-methods-for-data-science-4cae2178a00a

More Related Content

Similar to ML-Unit-4.pdf

Machine learning module 2
Machine learning module 2Machine learning module 2
Machine learning module 2
Gokulks007
 
KNOLX_Data_preprocessing
KNOLX_Data_preprocessingKNOLX_Data_preprocessing
KNOLX_Data_preprocessing
Knoldus Inc.
 
Data processing
Data processingData processing
Data processing
AnupamSingh211
 
Predicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project PresentationPredicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project Presentation
Boston Institute of Analytics
 
Module-4_Part-II.pptx
Module-4_Part-II.pptxModule-4_Part-II.pptx
Module-4_Part-II.pptx
VaishaliBagewadikar
 
AIML_UNIT 2 _PPT_HAND NOTES_MPS.pdf
AIML_UNIT 2 _PPT_HAND NOTES_MPS.pdfAIML_UNIT 2 _PPT_HAND NOTES_MPS.pdf
AIML_UNIT 2 _PPT_HAND NOTES_MPS.pdf
MargiShah29
 
Data preprocessing in Machine learning
Data preprocessing in Machine learning Data preprocessing in Machine learning
Data preprocessing in Machine learning
pyingkodi maran
 
Data Reduction
Data ReductionData Reduction
Data Reduction
Rajan Shah
 
Feature enginnering and selection
Feature enginnering and selectionFeature enginnering and selection
Feature enginnering and selection
Davis David
 
overview of_data_processing
overview of_data_processingoverview of_data_processing
overview of_data_processing
FEG
 
M5.pptx
M5.pptxM5.pptx
M5.pptx
MayuraD1
 
Singular Value Decomposition (SVD).pptx
Singular Value Decomposition (SVD).pptxSingular Value Decomposition (SVD).pptx
Singular Value Decomposition (SVD).pptx
rajalakshmi5921
 
EDAB Module 5 Singular Value Decomposition (SVD).pptx
EDAB Module 5 Singular Value Decomposition (SVD).pptxEDAB Module 5 Singular Value Decomposition (SVD).pptx
EDAB Module 5 Singular Value Decomposition (SVD).pptx
rajalakshmi5921
 
IRJET - An User Friendly Interface for Data Preprocessing and Visualizati...
IRJET -  	  An User Friendly Interface for Data Preprocessing and Visualizati...IRJET -  	  An User Friendly Interface for Data Preprocessing and Visualizati...
IRJET - An User Friendly Interface for Data Preprocessing and Visualizati...
IRJET Journal
 
UNIT 2: Part 2: Data Warehousing and Data Mining
UNIT 2: Part 2: Data Warehousing and Data MiningUNIT 2: Part 2: Data Warehousing and Data Mining
UNIT 2: Part 2: Data Warehousing and Data Mining
Nandakumar P
 
Sample_Subjective_Questions_Answers (1).pdf
Sample_Subjective_Questions_Answers (1).pdfSample_Subjective_Questions_Answers (1).pdf
Sample_Subjective_Questions_Answers (1).pdf
AaryanArora10
 
Intro to ml_2021
Intro to ml_2021Intro to ml_2021
Intro to ml_2021
Sanghamitra Deb
 
Machine Learning Approaches and its Challenges
Machine Learning Approaches and its ChallengesMachine Learning Approaches and its Challenges
Machine Learning Approaches and its Challenges
ijcnes
 
Data Mining Module 2 Business Analytics.
Data Mining Module 2 Business Analytics.Data Mining Module 2 Business Analytics.
Data Mining Module 2 Business Analytics.
Jayanti Pande
 

Similar to ML-Unit-4.pdf (20)

Machine learning module 2
Machine learning module 2Machine learning module 2
Machine learning module 2
 
KNOLX_Data_preprocessing
KNOLX_Data_preprocessingKNOLX_Data_preprocessing
KNOLX_Data_preprocessing
 
Data processing
Data processingData processing
Data processing
 
Predicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project PresentationPredicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project Presentation
 
Module-4_Part-II.pptx
Module-4_Part-II.pptxModule-4_Part-II.pptx
Module-4_Part-II.pptx
 
AIML_UNIT 2 _PPT_HAND NOTES_MPS.pdf
AIML_UNIT 2 _PPT_HAND NOTES_MPS.pdfAIML_UNIT 2 _PPT_HAND NOTES_MPS.pdf
AIML_UNIT 2 _PPT_HAND NOTES_MPS.pdf
 
Data preprocessing in Machine learning
Data preprocessing in Machine learning Data preprocessing in Machine learning
Data preprocessing in Machine learning
 
1234
12341234
1234
 
Data Reduction
Data ReductionData Reduction
Data Reduction
 
Feature enginnering and selection
Feature enginnering and selectionFeature enginnering and selection
Feature enginnering and selection
 
overview of_data_processing
overview of_data_processingoverview of_data_processing
overview of_data_processing
 
M5.pptx
M5.pptxM5.pptx
M5.pptx
 
Singular Value Decomposition (SVD).pptx
Singular Value Decomposition (SVD).pptxSingular Value Decomposition (SVD).pptx
Singular Value Decomposition (SVD).pptx
 
EDAB Module 5 Singular Value Decomposition (SVD).pptx
EDAB Module 5 Singular Value Decomposition (SVD).pptxEDAB Module 5 Singular Value Decomposition (SVD).pptx
EDAB Module 5 Singular Value Decomposition (SVD).pptx
 
IRJET - An User Friendly Interface for Data Preprocessing and Visualizati...
IRJET -  	  An User Friendly Interface for Data Preprocessing and Visualizati...IRJET -  	  An User Friendly Interface for Data Preprocessing and Visualizati...
IRJET - An User Friendly Interface for Data Preprocessing and Visualizati...
 
UNIT 2: Part 2: Data Warehousing and Data Mining
UNIT 2: Part 2: Data Warehousing and Data MiningUNIT 2: Part 2: Data Warehousing and Data Mining
UNIT 2: Part 2: Data Warehousing and Data Mining
 
Sample_Subjective_Questions_Answers (1).pdf
Sample_Subjective_Questions_Answers (1).pdfSample_Subjective_Questions_Answers (1).pdf
Sample_Subjective_Questions_Answers (1).pdf
 
Intro to ml_2021
Intro to ml_2021Intro to ml_2021
Intro to ml_2021
 
Machine Learning Approaches and its Challenges
Machine Learning Approaches and its ChallengesMachine Learning Approaches and its Challenges
Machine Learning Approaches and its Challenges
 
Data Mining Module 2 Business Analytics.
Data Mining Module 2 Business Analytics.Data Mining Module 2 Business Analytics.
Data Mining Module 2 Business Analytics.
 

Recently uploaded

HYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generationHYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generation
Robbie Edward Sayers
 
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdfHybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
fxintegritypublishin
 
ethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.pptethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.ppt
Jayaprasanna4
 
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdfTop 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Teleport Manpower Consultant
 
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
Amil Baba Dawood bangali
 
Quality defects in TMT Bars, Possible causes and Potential Solutions.
Quality defects in TMT Bars, Possible causes and Potential Solutions.Quality defects in TMT Bars, Possible causes and Potential Solutions.
Quality defects in TMT Bars, Possible causes and Potential Solutions.
PrashantGoswami42
 
ethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.pptethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.ppt
Jayaprasanna4
 
LIGA(E)11111111111111111111111111111111111111111.ppt
LIGA(E)11111111111111111111111111111111111111111.pptLIGA(E)11111111111111111111111111111111111111111.ppt
LIGA(E)11111111111111111111111111111111111111111.ppt
ssuser9bd3ba
 
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
bakpo1
 
power quality voltage fluctuation UNIT - I.pptx
power quality voltage fluctuation UNIT - I.pptxpower quality voltage fluctuation UNIT - I.pptx
power quality voltage fluctuation UNIT - I.pptx
ViniHema
 
TECHNICAL TRAINING MANUAL GENERAL FAMILIARIZATION COURSE
TECHNICAL TRAINING MANUAL   GENERAL FAMILIARIZATION COURSETECHNICAL TRAINING MANUAL   GENERAL FAMILIARIZATION COURSE
TECHNICAL TRAINING MANUAL GENERAL FAMILIARIZATION COURSE
DuvanRamosGarzon1
 
COLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdf
COLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdfCOLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdf
COLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdf
Kamal Acharya
 
CME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional ElectiveCME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional Elective
karthi keyan
 
ASME IX(9) 2007 Full Version .pdf
ASME IX(9)  2007 Full Version       .pdfASME IX(9)  2007 Full Version       .pdf
ASME IX(9) 2007 Full Version .pdf
AhmedHussein950959
 
Democratizing Fuzzing at Scale by Abhishek Arya
Democratizing Fuzzing at Scale by Abhishek AryaDemocratizing Fuzzing at Scale by Abhishek Arya
Democratizing Fuzzing at Scale by Abhishek Arya
abh.arya
 
weather web application report.pdf
weather web application report.pdfweather web application report.pdf
weather web application report.pdf
Pratik Pawar
 
Railway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdfRailway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdf
TeeVichai
 
addressing modes in computer architecture
addressing modes  in computer architectureaddressing modes  in computer architecture
addressing modes in computer architecture
ShahidSultan24
 
The role of big data in decision making.
The role of big data in decision making.The role of big data in decision making.
The role of big data in decision making.
ankuprajapati0525
 
Immunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary AttacksImmunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary Attacks
gerogepatton
 

Recently uploaded (20)

HYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generationHYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generation
 
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdfHybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
 
ethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.pptethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.ppt
 
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdfTop 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
 
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
 
Quality defects in TMT Bars, Possible causes and Potential Solutions.
Quality defects in TMT Bars, Possible causes and Potential Solutions.Quality defects in TMT Bars, Possible causes and Potential Solutions.
Quality defects in TMT Bars, Possible causes and Potential Solutions.
 
ethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.pptethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.ppt
 
LIGA(E)11111111111111111111111111111111111111111.ppt
LIGA(E)11111111111111111111111111111111111111111.pptLIGA(E)11111111111111111111111111111111111111111.ppt
LIGA(E)11111111111111111111111111111111111111111.ppt
 
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
 
power quality voltage fluctuation UNIT - I.pptx
power quality voltage fluctuation UNIT - I.pptxpower quality voltage fluctuation UNIT - I.pptx
power quality voltage fluctuation UNIT - I.pptx
 
TECHNICAL TRAINING MANUAL GENERAL FAMILIARIZATION COURSE
TECHNICAL TRAINING MANUAL   GENERAL FAMILIARIZATION COURSETECHNICAL TRAINING MANUAL   GENERAL FAMILIARIZATION COURSE
TECHNICAL TRAINING MANUAL GENERAL FAMILIARIZATION COURSE
 
COLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdf
COLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdfCOLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdf
COLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdf
 
CME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional ElectiveCME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional Elective
 
ASME IX(9) 2007 Full Version .pdf
ASME IX(9)  2007 Full Version       .pdfASME IX(9)  2007 Full Version       .pdf
ASME IX(9) 2007 Full Version .pdf
 
Democratizing Fuzzing at Scale by Abhishek Arya
Democratizing Fuzzing at Scale by Abhishek AryaDemocratizing Fuzzing at Scale by Abhishek Arya
Democratizing Fuzzing at Scale by Abhishek Arya
 
weather web application report.pdf
weather web application report.pdfweather web application report.pdf
weather web application report.pdf
 
Railway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdfRailway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdf
 
addressing modes in computer architecture
addressing modes  in computer architectureaddressing modes  in computer architecture
addressing modes in computer architecture
 
The role of big data in decision making.
The role of big data in decision making.The role of big data in decision making.
The role of big data in decision making.
 
Immunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary AttacksImmunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary Attacks
 

ML-Unit-4.pdf

  • 1. Unit 4 : Basics of Feature Engineering: 1 Silver Oal College Of Engineering AndTechnology
  • 2. Outline 2  Feature and Feature Engineering,  Feature transformation:  Construction  Extraction,  Feature subset selection :  Issues in high-dimensional data,  key drivers,  measure  overall process
  • 3. Feature and Feature Engineering Prof. Monali Suthar (SOCET-CE) 3  Input in machine learning which are usually in the form of structured columns.  Algorithms require features with some specific characteristic to work properly.  Feature Engineering?  Feature engineering is the process of transforming raw data into features that better represent the underlying problem to the predictive models, resulting in improved model accuracy on unseen data.  Goals of Feature Engineering 1. Preparing the proper input dataset, compatible with the machine learning algorithm requirements. 2. Improving the performance of machine learning models.
  • 4. Feature Engineering Category 4  Feature Engineering is divided into 3 broad categories:- 1. Feature Selection:  It is all about selecting a small subset of features from a large pool of features.  We select those attributes which best explain the relationship of an independent variable with the target variable.  There are certain features which are more important than other features to the accuracy of the model.  It is different from dimensionality reduction because the dimensionality reduction method does so by combining existing attributes, whereas the feature selection method includes or excludes those features.  Ex: Chi-squared test, correlation coefficient scores, LASSO, Ridge regression etc.
  • 5. Feature Engineering Category 5 II. FeatureTransformation:  It means transforming our original feature to the functions of original features.  Ex: Scaling, discretization, binning and filling missing data values are the most common forms of data transformation.  To reduce right skewness of the data, we use log. III. Feature Extraction:  When the data to be processed through an algorithm is too large, it’s generally considered redundant.  Analysis with a large number of variables uses a lot of computation power and memory, therefore we should reduce the dimensionality of these types of variables.  It is a term for constructing combinations of the variables.  For tabular data, we use PCA to reduce features.  For image, we can use line or edge detection.
  • 6. Feature transformation 6  Feature transformation is the process of modifying your data but keeping the information.  These modifications will make Machine Learning algorithms understanding easier, which will deliver better results.  But why would we transform our features?  data types are not suitable to be fed into a machine learning algorithm, e.g. text, categories  feature values may cause problems during the learning process, e.g. data represented in different scales  we want to reduce the number of features to plot and visualize data, speed up training or improve the accuracy of a specific model
  • 7. Feature Engineering Techniques 7  List of Techniques 1.Imputation 2.Handling Outliers 3.Binning 4.LogTransform 5.One-Hot Encoding 6.Grouping Operations 7.Feature Split 8.Scaling 9.Extracting Date
  • 8. Imputation Using (Mean/Median) Values 8  This works by calculating the mean/median of the non-missing values in a column and then replacing the missing values within each column separately and independently from the others. It can only be used with numeric data.
  • 9. Pros and Cons 9  Pros: • Easy and fast. • Works well with small numerical datasets.  Cons: • Doesn‟t factor the correlations between features. It only works on the column level. • Will give poor results on encoded categorical features (do NOT use it on categorical features). • Not very accurate. • Doesn‟t account for the uncertainty in the imputations.
  • 10. Pros and Cons 10  Pros: • Easy and fast. • Works well with small numerical datasets.  Cons: • Doesn‟t factor the correlations between features. It only works on the column level. • Will give poor results on encoded categorical features (do NOT use it on categorical features). • Not very accurate. • Doesn‟t account for the uncertainty in the imputations.
  • 11. Imputation Using (Most Frequent) or (Zero/Constant) Values: 11  Most Frequent is another statistical strategy to impute missing values and YES!! It works with categorical features (strings or numerical representations) by replacing missing data with the most frequent values within each column.  Pros: • Works well with categorical features.  Cons: • It also doesn‟t factor the correlations between features. • It can introduce bias in the data.
  • 12. Imputation Using (Most Frequent) or (Zero/Constant) Values: 12
  • 13. Imputation Using k-NN 13  The k nearest neighbors is an algorithm that is used for simple classification. The algorithm uses „feature similarity‟ to predict the values of any new data points.  This means that the new point is assigned a value based on how closely it resembles the points in the training set. This can be very useful in making predictions about the missing values by finding the k’s closest neighbor's to the observation with missing data and then imputing them based on the non-missing values in the neighborhood.
  • 14. Pros and Cons 14  Pros: • Can be much more accurate than the mean, median or most frequent imputation methods (It depends on the dataset).  Cons: • Computationally expensive. KNN works by storing the whole training dataset in memory. • K-NN is quite sensitive to outliers in the data (unlike SVM)
  • 15. Handling outlier 15 • Incorrectdata entryor errorduringdata processing • Missingvaluesina dataset. • Data didnotcomefromtheintendedsample. • Errorsoccurduringexperiments. • Notan errored,itwouldbeunusualfrom theoriginal. • Extremedistributionthannormal.
  • 16. Handling outlier 16 Univariate method:  Univariate analysis is the simplest form of analyzing data. “Uni” means “one”, so in other words your data has only one variable.  It doesn‟t deal with causes or relationships (unlike regression ) and it‟s major purpose is to describe; It takes data, summarizes that data and finds patterns in the data.  Univariate and multivariate represent two approaches to statistical analysis.  Univariate involves the analysis of a single variable while multivariate analysis examines two or more variables.  Most multivariate analysis involves a dependent variable and multiple independent variables.
  • 17. Handling outlier with Z score 17  The Z-score is the signed number of standard deviations by which the value of an observation or data point is above the mean value of what is being observed or measured.  Z score is an important concept in statistics. Z score is also called standard score. This score helps to understand if a data value is greater or smaller than mean and how far away it is from the mean. More specifically, Z score tells how many standard deviations away a data point is from the mean.  The intuition behind Z-score is to describe any data point by finding their relationship with the Standard Deviation and Mean of the group of data points. Z-score is finding the distribution of data where mean is 0 and standard deviation is 1 i.e. normal distribution.  Z score = (x -mean) / std. deviation  If the z score of a data point is more than 3, it indicates that the data point is quite different from the other data points. Such a data point can be an outlier.
  • 18. Binning 18  Data binning, bucketing is a data pre-processing method used to minimize the effects of small observation errors.  The original data values are divided into small intervals known as bins and then they are replaced by a general value calculated for that bin.  This has a smoothing effect on the input data and may also reduce the chances of overfitting in case of small datasets.
  • 19. Log Transform 19  The Log Transform is one of the most popular Transformation techniques out there.  It is primarily used to convert a skewed distribution to a normal distribution/less-skewed distribution.  In this transform, we take the log of the values in a column and use these values as the column instead.
  • 20. Standard Scaler 20  The Standard Scaler is another popular scaler that is very easy to understand and implement.  For each feature, the Standard Scaler scales the values such that the mean is 0 and the standard deviation is 1(or the variance). x_scaled = x – mean/std_dev  However, Standard Scaler assumes that the distribution of the variable is normal.Thus, in case, the variables are not normally distributed, we either choose a different scaler or first, convert the variables to a normal distribution and then apply this scaler
  • 21. 21 from sklearn.preprocessing import StandardScaler scaler = StandardScaler() df_scaled[col_names] = scaler.fit_transform(features.values) df_scaled
  • 22. One-Hot Encoding 22  A one hot encoding allows the representation of categorical data to be more expressive.  Many machine learning algorithms cannot work with categorical data directly.  The categories must be converted into numbers.  This is required for both input and output variables that are categorical.
  • 23. Feature subset selection 23  Feature Selection is the most critical pre-processing activity in any machine learning process. It intends to select a subset of attributes or features that makes the most meaningful contribution to a machine learning activity.
  • 24. High dimensional data 24  High Dimensional refers to the high number of variables or attributes or features present in certain data sets, more so in the domains like DNA analysis, geographic information system (GIS), etc. It may have sometimes hundreds or thousands of dimensions which is not good from the machine learning aspect because it may be a big challenge for any ML algorithm to handle that. On the other hand, a high quantity of computational and a high amount of time will be required. Also, a model built on an extremely high number of features may be very difficult to understand. For these reasons, it is necessary to take a subset of the features instead of the full set. So we can deduce that the objectives of feature selection are: 1. Having a faster and more cost-effective (less need for computational resources) learning model 2. Having a better understanding of the underlying model that generates the data. 3. Improving the efficacy of the learning model.
  • 25. Feature subset selection methods 25 1. Wrapper methods  Wrapping methods compute models with a certain subset of features and evaluate the importance of each feature.  Then they iterate and try a different subset of features until the optimal subset is reached.  Two drawbacks of this method are the large computation time for data with many features, and that it tends to overfit the model when there is not a large amount of data points.  The most notable wrapper methods of feature selection are forward selection, backward selection, and stepwise selection.
  • 26. Feature subset selection methods 26 1. Wrapper methods  Forward selection starts with zero features, then, for each individual feature, runs a model and determines the p-value associated with the t-test or F-test performed. It then selects the feature with the lowest p-value and adds that to the working model.  Backward selection starts with all features contained in the dataset. It then runs a model and calculates a p-value associated with the t-test or F-test of the model for each feature.  Stepwise selection is a hybrid of forward and backward selection. It starts with zero features and adds the one feature with the lowest significant p-value as described above.
  • 27. Feature subset selection methods 27 1. Filter methods  Filter methods use a measure other than error rate to determine whether that feature is useful.  Rather than tuning a model (as in wrapper methods), a subset of the features is selected through ranking them by a useful descriptive measure.  Benefits of filter methods are that they have a very low computation time and will not overfit the data.  However, one drawback is that they are blind to any interactions or correlations between features.  This will need to be taken into account separately, which will be explained below. Three different filter methods are ANOVA, Pearson correlation, and variance thresholding.
  • 28. Feature subset selection methods 28 2. Filter methods  The ANOVA (Analysis of variance) test looks a the variation within the treatments of a feature and also between the treatments.  The Pearson correlation coefficient is a measure of the similarity of two features that ranges between -1 and 1. A value close to 1 or -1 indicates that the two features have a high correlation and may be related.  The variance of a feature determines how much predictive power it contains. The lower the variance is, the less information contained in the feature, and the less value it has in predicting the response variable.
  • 29. Feature subset selection methods 29 3. Embedded Methods  Embedded methods perform feature selection as a part of the model creation process.  This generally leads to a happy medium between the two methods of feature selection previously explained, as the selection is done in conjunction with the model tuning process.  Lasso and Ridge regression are the two most common feature selection methods of this type, and Decision tree also creates a model using different types of feature selection.
  • 30. Feature subset selection methods 30 3. Embedded Methods  Lasso Regression is another way to penalize the beta coefficients in a model, and is very similar to Ridge regression. It also adds a penalty term to the cost function of a model, with a lambda value that must be tuned.  The smaller number of features a model has, the lower the complexity. from sklearn.linear_model import Lasso lasso = Lasso() lasso.fit(X_train,y_train) train_score=lasso.score(X_train,y_train) test_score=lasso.score(X_test,y_test) coeff_used = np.sum(lasso.coef_!=0)  An important note for Ridge and Lasso regression is that all of your features must be standardized
  • 31. Feature subset selection methods 31 3. Embedded Methods  Ridge regression can do this by penalizing the beta coefficients of a model for being too large. Basically, it scales back the strength of correlation with variables that may not be as important as others. Ride Regression is done by adding a penalty term (also called ridge estimator or shrinkage estimator) to the cost function of the regression. The penalty term takes all of the betas and scales them by a term lambda (λ) that must be tuned (usually with cross validation: compares the same model but with different values of lambda). from sklearn.linear_model import Ridge rr = Ridge(alpha=0.01) rr.fit(X_train, y_train)
  • 32. 32
  • 33. Prof. Monali Suthar (SOCET-CE) 33  https://seleritysas.com/blog/2019/12/12/types-of- predictive-analytics-models-and-how-they-work/  https://towardsdatascience.com/selecting-the-correct- predictive-modeling-technique-ba459c370d59  https://www.netsuite.com/portal/resource/articles/financia l-management/predictive-modeling.shtml  https://www.dezyre.com/article/types-of-analytics- descriptive-predictive-prescriptive-analytics/209#toc-2  https://www.sciencedirect.com/topics/computer- science/descriptive-model  https://towardsdatascience.com/intro-to-feature- selection-methods-for-data-science-4cae2178a00a