Data Preprocessing

zekeLabs
Data Preprocessing
Learning made Simpler !
www.zekeLabs.com

Agenda
● Transformers
● StandardScaler
● MinMaxScaler
● RobustScaler
● Normalization
● Binarization
● Encoding Categorical Features
● Imputation
● Polynomial Features
● Custom Transformer
● Text Processing

Why Preprocessing ?
● Learning algorithms have affinity towards certain data.
● Unscaled or unstandardized data have might have unacceptable prediction
● Preprocessing refers to transformation before feeding to machine learning

http://benalexkeen.com/feature-scaling-with-scikit-learn/

Transformers
● Objects which can transform data so that they can be consumed by
machine learning
● Common API - fit, transform, fit_transform
● fit () - Creating the map
● transform() - Using the map transforming data
● fit_transform() - Combined of above two

StandardScaler
● Assumes features data should be normally
distributed
● Scales such that central tendency is 0 &
standard deviation 1
● If data is not normally distributed,
standardscaler may not be a great idea

MinMaxScaler
● One of the most popular scaling method
● Works on data which is not normally
distributed
● Brings the data in range of [0,1] or [-1,1]
● Skewness maintained but data bought to
same scale
● The two normal distributions are kept
separate by the outliers that are inside the
0-1 range.

RobustScaler
● Most suited for data with outliers
● Rather than min-max, uses
interquartile range
● The distributions are brought into
the same scale and overlap, but
the outliers remain outside of bulk
of the new distributions.

Normalizer
● The normalizer scales each value
by dividing each value by its
magnitude in n-dimensional space
for n number of features.
● Each point is now within 1 unit of
the origin on this Cartesian
coordinate system.

Encoding Categorical Values
● The normalizer scales each value
by dividing each value by its
magnitude in n-dimensional space
for n number of features.
● Each point is now within 1 unit of
the origin on this Cartesian
coordinate system.

Label Encoding
● Learning algorithms don’t understand strings
● Categorical columns with string values ( yes/no ) needs to be converted to
numbers.
● LabelEncoder encodes value between 0 to n-1 classes

One Hot Encoding
● Converts each categorical data into a vector, one value will be hot & others
cold.
● Suitable for nominal data
● Like location ( delhi, mumbai etc. )

Ordinal Data Encoding
● Features usually consist of ordinal data in strings like low, medium, high
● Transformation of such column using LabelEncoding might not be a good
option.
● We want to maintain relationship between data
● Using pandas we can replace low by 0, medium by 1 & high by 2

Biniazer
● Sets feature value 0 or 1
● Commonly used with text data
● An important step before some algorithms expecting binary data

Imputation
● Real world data might be incomplete, missing data is represented as blank,
nan etc.
● Incomplete data are incompatible with scikit-learn
● One way to deal with them is discard.
● Other is to derive it from existing data, that’s called imputation

Polynomial Features
● Sometimes we need to add complexity to the model
● Convert data to higher degrees.
● Hyper-parameter it takes is degree
[ X, Y ] [ 1, X, Y, XY, X^2, Y^2 ]
Polynomial Transformer (2)

Custom Transformer
● Sometimes, in-built transformers are not sufficient for data cleaning or
preprocessing.
● Custom Transformers allow Python functions to be used for transforming
data
[ X, Y ] [ log(X), log(Y) ]
Custom Transformer (log)

Outliers
● Data which doesn’t fit into the
distribution of entire dataset is
outlier.
● Types of outliers - univariate,
multivariate.
● Univariate Outlier - Based on values
of one variable
● Multivariate Outlier - Based on
multiple variables

Outlier Reasons
● Experimental Error
● Data entry error
● Sampling Error
● Natural Outlier
● Intentional error
Custo Transformer (log)

Outlier Impact
● Big impact on range, variance, and standard deviation
● Learning algorithms ability is impacted
Custo Transformer (log)

Outlier Detection
● Extreme Value Analysis : z-score based method
● Probabilistic and Statistical Models : Data participation in distribution
● Linear Models : Data transformed into lesser dimension. Points far from
this plane. PCA
● Proximity-based Models : Clustering based methods, Distance based
methods
● High-Dimensional Outlier Detection

Novelty Detection
● Adding one more observation to dataset
● Checking if the new observation is part of the distribution
● OneClassSVM can be used to detect novelty of dataset

Text
● DictVectorizer
● CountVectorizer
● Tf-Idf
● HashingVectorizer

Image
● Skimage library
● sklearn.feature_extraction.image

Visit : www.zekeLabs.com for more details
THANK YOU
Let us know how can we help your organization to Upskill the
employees to stay updated in the ever-evolving IT Industry.
Get in touch:
www.zekeLabs.com | +91-8095465880 | info@zekeLabs.com

Data Preprocessing

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Data Preprocessing

Similar to Data Preprocessing (20)

More from zekeLabs Technologies

More from zekeLabs Technologies (20)

Recently uploaded

Recently uploaded (20)

Data Preprocessing