zekeLabs
Data Preprocessing
Learning made Simpler !
www.zekeLabs.com
Agenda
● Transformers
● StandardScaler
● MinMaxScaler
● RobustScaler
● Normalization
● Binarization
● Encoding Categorical Features
● Imputation
● Polynomial Features
● Custom Transformer
● Text Processing
Why Preprocessing ?
● Learning algorithms have affinity towards certain data.
● Unscaled or unstandardized data have might have unacceptable prediction
● Preprocessing refers to transformation before feeding to machine learning
http://benalexkeen.com/feature-scaling-with-scikit-learn/
Transformers
● Objects which can transform data so that they can be consumed by
machine learning
● Common API - fit, transform, fit_transform
● fit () - Creating the map
● transform() - Using the map transforming data
● fit_transform() - Combined of above two
StandardScaler
● Assumes features data should be normally
distributed
● Scales such that central tendency is 0 &
standard deviation 1
● If data is not normally distributed,
standardscaler may not be a great idea
MinMaxScaler
● One of the most popular scaling method
● Works on data which is not normally
distributed
● Brings the data in range of [0,1] or [-1,1]
● Skewness maintained but data bought to
same scale
● The two normal distributions are kept
separate by the outliers that are inside the
0-1 range.
RobustScaler
● Most suited for data with outliers
● Rather than min-max, uses
interquartile range
● The distributions are brought into
the same scale and overlap, but
the outliers remain outside of bulk
of the new distributions.
Normalizer
● The normalizer scales each value
by dividing each value by its
magnitude in n-dimensional space
for n number of features.
● Each point is now within 1 unit of
the origin on this Cartesian
coordinate system.
Encoding Categorical Values
● The normalizer scales each value
by dividing each value by its
magnitude in n-dimensional space
for n number of features.
● Each point is now within 1 unit of
the origin on this Cartesian
coordinate system.
Label Encoding
● Learning algorithms don’t understand strings
● Categorical columns with string values ( yes/no ) needs to be converted to
numbers.
● LabelEncoder encodes value between 0 to n-1 classes
One Hot Encoding
● Converts each categorical data into a vector, one value will be hot & others
cold.
● Suitable for nominal data
● Like location ( delhi, mumbai etc. )
Ordinal Data Encoding
● Features usually consist of ordinal data in strings like low, medium, high
● Transformation of such column using LabelEncoding might not be a good
option.
● We want to maintain relationship between data
● Using pandas we can replace low by 0, medium by 1 & high by 2
Biniazer
● Sets feature value 0 or 1
● Commonly used with text data
● An important step before some algorithms expecting binary data
Imputation
● Real world data might be incomplete, missing data is represented as blank,
nan etc.
● Incomplete data are incompatible with scikit-learn
● One way to deal with them is discard.
● Other is to derive it from existing data, that’s called imputation
Polynomial Features
● Sometimes we need to add complexity to the model
● Convert data to higher degrees.
● Hyper-parameter it takes is degree
[ X, Y ] [ 1, X, Y, XY, X^2, Y^2 ]
Polynomial Transformer (2)
Custom Transformer
● Sometimes, in-built transformers are not sufficient for data cleaning or
preprocessing.
● Custom Transformers allow Python functions to be used for transforming
data
[ X, Y ] [ log(X), log(Y) ]
Custom Transformer (log)
Outliers
● Data which doesn’t fit into the
distribution of entire dataset is
outlier.
● Types of outliers - univariate,
multivariate.
● Univariate Outlier - Based on values
of one variable
● Multivariate Outlier - Based on
multiple variables
Outlier Reasons
● Experimental Error
● Data entry error
● Sampling Error
● Natural Outlier
● Intentional error
Custo Transformer (log)
Outlier Impact
● Big impact on range, variance, and standard deviation
● Learning algorithms ability is impacted
Custo Transformer (log)
Outlier Detection
● Extreme Value Analysis : z-score based method
● Probabilistic and Statistical Models : Data participation in distribution
● Linear Models : Data transformed into lesser dimension. Points far from
this plane. PCA
● Proximity-based Models : Clustering based methods, Distance based
methods
● High-Dimensional Outlier Detection
Novelty Detection
● Adding one more observation to dataset
● Checking if the new observation is part of the distribution
● OneClassSVM can be used to detect novelty of dataset
Custom Transformer (log)
Text
● DictVectorizer
● CountVectorizer
● Tf-Idf
● HashingVectorizer
Custom Transformer (log)
Image
● Skimage library
● sklearn.feature_extraction.image
Custom Transformer (log)
Thank You !!!
Visit : www.zekeLabs.com for more details
THANK YOU
Let us know how can we help your organization to Upskill the
employees to stay updated in the ever-evolving IT Industry.
Get in touch:
www.zekeLabs.com | +91-8095465880 | info@zekeLabs.com

Data Preprocessing

  • 1.
  • 2.
    Agenda ● Transformers ● StandardScaler ●MinMaxScaler ● RobustScaler ● Normalization ● Binarization ● Encoding Categorical Features ● Imputation ● Polynomial Features ● Custom Transformer ● Text Processing
  • 3.
    Why Preprocessing ? ●Learning algorithms have affinity towards certain data. ● Unscaled or unstandardized data have might have unacceptable prediction ● Preprocessing refers to transformation before feeding to machine learning
  • 4.
  • 5.
    Transformers ● Objects whichcan transform data so that they can be consumed by machine learning ● Common API - fit, transform, fit_transform ● fit () - Creating the map ● transform() - Using the map transforming data ● fit_transform() - Combined of above two
  • 6.
    StandardScaler ● Assumes featuresdata should be normally distributed ● Scales such that central tendency is 0 & standard deviation 1 ● If data is not normally distributed, standardscaler may not be a great idea
  • 7.
    MinMaxScaler ● One ofthe most popular scaling method ● Works on data which is not normally distributed ● Brings the data in range of [0,1] or [-1,1] ● Skewness maintained but data bought to same scale ● The two normal distributions are kept separate by the outliers that are inside the 0-1 range.
  • 8.
    RobustScaler ● Most suitedfor data with outliers ● Rather than min-max, uses interquartile range ● The distributions are brought into the same scale and overlap, but the outliers remain outside of bulk of the new distributions.
  • 9.
    Normalizer ● The normalizerscales each value by dividing each value by its magnitude in n-dimensional space for n number of features. ● Each point is now within 1 unit of the origin on this Cartesian coordinate system.
  • 10.
    Encoding Categorical Values ●The normalizer scales each value by dividing each value by its magnitude in n-dimensional space for n number of features. ● Each point is now within 1 unit of the origin on this Cartesian coordinate system.
  • 11.
    Label Encoding ● Learningalgorithms don’t understand strings ● Categorical columns with string values ( yes/no ) needs to be converted to numbers. ● LabelEncoder encodes value between 0 to n-1 classes
  • 12.
    One Hot Encoding ●Converts each categorical data into a vector, one value will be hot & others cold. ● Suitable for nominal data ● Like location ( delhi, mumbai etc. )
  • 13.
    Ordinal Data Encoding ●Features usually consist of ordinal data in strings like low, medium, high ● Transformation of such column using LabelEncoding might not be a good option. ● We want to maintain relationship between data ● Using pandas we can replace low by 0, medium by 1 & high by 2
  • 14.
    Biniazer ● Sets featurevalue 0 or 1 ● Commonly used with text data ● An important step before some algorithms expecting binary data
  • 15.
    Imputation ● Real worlddata might be incomplete, missing data is represented as blank, nan etc. ● Incomplete data are incompatible with scikit-learn ● One way to deal with them is discard. ● Other is to derive it from existing data, that’s called imputation
  • 16.
    Polynomial Features ● Sometimeswe need to add complexity to the model ● Convert data to higher degrees. ● Hyper-parameter it takes is degree [ X, Y ] [ 1, X, Y, XY, X^2, Y^2 ] Polynomial Transformer (2)
  • 17.
    Custom Transformer ● Sometimes,in-built transformers are not sufficient for data cleaning or preprocessing. ● Custom Transformers allow Python functions to be used for transforming data [ X, Y ] [ log(X), log(Y) ] Custom Transformer (log)
  • 18.
    Outliers ● Data whichdoesn’t fit into the distribution of entire dataset is outlier. ● Types of outliers - univariate, multivariate. ● Univariate Outlier - Based on values of one variable ● Multivariate Outlier - Based on multiple variables
  • 19.
    Outlier Reasons ● ExperimentalError ● Data entry error ● Sampling Error ● Natural Outlier ● Intentional error Custo Transformer (log)
  • 20.
    Outlier Impact ● Bigimpact on range, variance, and standard deviation ● Learning algorithms ability is impacted Custo Transformer (log)
  • 21.
    Outlier Detection ● ExtremeValue Analysis : z-score based method ● Probabilistic and Statistical Models : Data participation in distribution ● Linear Models : Data transformed into lesser dimension. Points far from this plane. PCA ● Proximity-based Models : Clustering based methods, Distance based methods ● High-Dimensional Outlier Detection
  • 22.
    Novelty Detection ● Addingone more observation to dataset ● Checking if the new observation is part of the distribution ● OneClassSVM can be used to detect novelty of dataset Custom Transformer (log)
  • 23.
    Text ● DictVectorizer ● CountVectorizer ●Tf-Idf ● HashingVectorizer Custom Transformer (log)
  • 24.
    Image ● Skimage library ●sklearn.feature_extraction.image Custom Transformer (log)
  • 25.
  • 26.
    Visit : www.zekeLabs.comfor more details THANK YOU Let us know how can we help your organization to Upskill the employees to stay updated in the ever-evolving IT Industry. Get in touch: www.zekeLabs.com | +91-8095465880 | info@zekeLabs.com