SlideShare a Scribd company logo
zekeLabs
Data Preprocessing
Learning made Simpler !
www.zekeLabs.com
Agenda
● Transformers
● StandardScaler
● MinMaxScaler
● RobustScaler
● Normalization
● Binarization
● Encoding Categorical Features
● Imputation
● Polynomial Features
● Custom Transformer
● Text Processing
Why Preprocessing ?
● Learning algorithms have affinity towards certain data.
● Unscaled or unstandardized data have might have unacceptable prediction
● Preprocessing refers to transformation before feeding to machine learning
http://benalexkeen.com/feature-scaling-with-scikit-learn/
Transformers
● Objects which can transform data so that they can be consumed by
machine learning
● Common API - fit, transform, fit_transform
● fit () - Creating the map
● transform() - Using the map transforming data
● fit_transform() - Combined of above two
StandardScaler
● Assumes features data should be normally
distributed
● Scales such that central tendency is 0 &
standard deviation 1
● If data is not normally distributed,
standardscaler may not be a great idea
MinMaxScaler
● One of the most popular scaling method
● Works on data which is not normally
distributed
● Brings the data in range of [0,1] or [-1,1]
● Skewness maintained but data bought to
same scale
● The two normal distributions are kept
separate by the outliers that are inside the
0-1 range.
RobustScaler
● Most suited for data with outliers
● Rather than min-max, uses
interquartile range
● The distributions are brought into
the same scale and overlap, but
the outliers remain outside of bulk
of the new distributions.
Normalizer
● The normalizer scales each value
by dividing each value by its
magnitude in n-dimensional space
for n number of features.
● Each point is now within 1 unit of
the origin on this Cartesian
coordinate system.
Encoding Categorical Values
● The normalizer scales each value
by dividing each value by its
magnitude in n-dimensional space
for n number of features.
● Each point is now within 1 unit of
the origin on this Cartesian
coordinate system.
Label Encoding
● Learning algorithms don’t understand strings
● Categorical columns with string values ( yes/no ) needs to be converted to
numbers.
● LabelEncoder encodes value between 0 to n-1 classes
One Hot Encoding
● Converts each categorical data into a vector, one value will be hot & others
cold.
● Suitable for nominal data
● Like location ( delhi, mumbai etc. )
Ordinal Data Encoding
● Features usually consist of ordinal data in strings like low, medium, high
● Transformation of such column using LabelEncoding might not be a good
option.
● We want to maintain relationship between data
● Using pandas we can replace low by 0, medium by 1 & high by 2
Biniazer
● Sets feature value 0 or 1
● Commonly used with text data
● An important step before some algorithms expecting binary data
Imputation
● Real world data might be incomplete, missing data is represented as blank,
nan etc.
● Incomplete data are incompatible with scikit-learn
● One way to deal with them is discard.
● Other is to derive it from existing data, that’s called imputation
Polynomial Features
● Sometimes we need to add complexity to the model
● Convert data to higher degrees.
● Hyper-parameter it takes is degree
[ X, Y ] [ 1, X, Y, XY, X^2, Y^2 ]
Polynomial Transformer (2)
Custom Transformer
● Sometimes, in-built transformers are not sufficient for data cleaning or
preprocessing.
● Custom Transformers allow Python functions to be used for transforming
data
[ X, Y ] [ log(X), log(Y) ]
Custom Transformer (log)
Outliers
● Data which doesn’t fit into the
distribution of entire dataset is
outlier.
● Types of outliers - univariate,
multivariate.
● Univariate Outlier - Based on values
of one variable
● Multivariate Outlier - Based on
multiple variables
Outlier Reasons
● Experimental Error
● Data entry error
● Sampling Error
● Natural Outlier
● Intentional error
Custo Transformer (log)
Outlier Impact
● Big impact on range, variance, and standard deviation
● Learning algorithms ability is impacted
Custo Transformer (log)
Outlier Detection
● Extreme Value Analysis : z-score based method
● Probabilistic and Statistical Models : Data participation in distribution
● Linear Models : Data transformed into lesser dimension. Points far from
this plane. PCA
● Proximity-based Models : Clustering based methods, Distance based
methods
● High-Dimensional Outlier Detection
Novelty Detection
● Adding one more observation to dataset
● Checking if the new observation is part of the distribution
● OneClassSVM can be used to detect novelty of dataset
Custom Transformer (log)
Text
● DictVectorizer
● CountVectorizer
● Tf-Idf
● HashingVectorizer
Custom Transformer (log)
Image
● Skimage library
● sklearn.feature_extraction.image
Custom Transformer (log)
Thank You !!!
Visit : www.zekeLabs.com for more details
THANK YOU
Let us know how can we help your organization to Upskill the
employees to stay updated in the ever-evolving IT Industry.
Get in touch:
www.zekeLabs.com | +91-8095465880 | info@zekeLabs.com

More Related Content

What's hot

Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
Gajanand Sharma
 
KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...
KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...
KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...
Simplilearn
 
Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...
Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...
Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...
Simplilearn
 
Python pandas Library
Python pandas LibraryPython pandas Library
Python pandas Library
Md. Sohag Miah
 
K-Nearest Neighbor Classifier
K-Nearest Neighbor ClassifierK-Nearest Neighbor Classifier
K-Nearest Neighbor Classifier
Neha Kulkarni
 
Support Vector Machines ( SVM )
Support Vector Machines ( SVM ) Support Vector Machines ( SVM )
Support Vector Machines ( SVM )
Mohammad Junaid Khan
 
Classification Techniques
Classification TechniquesClassification Techniques
Classification Techniques
Kiran Bhowmick
 
Machine Learning Tutorial Part - 2 | Machine Learning Tutorial For Beginners ...
Machine Learning Tutorial Part - 2 | Machine Learning Tutorial For Beginners ...Machine Learning Tutorial Part - 2 | Machine Learning Tutorial For Beginners ...
Machine Learning Tutorial Part - 2 | Machine Learning Tutorial For Beginners ...
Simplilearn
 
Data Preprocessing
Data PreprocessingData Preprocessing
Dynamic Programming
Dynamic ProgrammingDynamic Programming
Dynamic Programming
Sahil Kumar
 
Applications in Machine Learning
Applications in Machine LearningApplications in Machine Learning
Applications in Machine Learning
Joel Graff
 
Support vector machine
Support vector machineSupport vector machine
Support vector machine
SomnathMore3
 
Linear Regression vs Logistic Regression | Edureka
Linear Regression vs Logistic Regression | EdurekaLinear Regression vs Logistic Regression | Edureka
Linear Regression vs Logistic Regression | Edureka
Edureka!
 
3 classification
3  classification3  classification
3 classification
Mahmoud Alfarra
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
Jason Rodrigues
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
Girish Khanzode
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessingankur bhalla
 
Gradient descent method
Gradient descent methodGradient descent method
Gradient descent method
Prof. Neeta Awasthy
 
Support Vector machine
Support Vector machineSupport Vector machine
Support Vector machine
Anandha L Ranganathan
 

What's hot (20)

Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...
KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...
KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...
 
Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...
Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...
Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...
 
Python pandas Library
Python pandas LibraryPython pandas Library
Python pandas Library
 
K-Nearest Neighbor Classifier
K-Nearest Neighbor ClassifierK-Nearest Neighbor Classifier
K-Nearest Neighbor Classifier
 
Support Vector Machines ( SVM )
Support Vector Machines ( SVM ) Support Vector Machines ( SVM )
Support Vector Machines ( SVM )
 
Classification Techniques
Classification TechniquesClassification Techniques
Classification Techniques
 
Machine Learning Tutorial Part - 2 | Machine Learning Tutorial For Beginners ...
Machine Learning Tutorial Part - 2 | Machine Learning Tutorial For Beginners ...Machine Learning Tutorial Part - 2 | Machine Learning Tutorial For Beginners ...
Machine Learning Tutorial Part - 2 | Machine Learning Tutorial For Beginners ...
 
Data Preprocessing
Data PreprocessingData Preprocessing
Data Preprocessing
 
Dynamic Programming
Dynamic ProgrammingDynamic Programming
Dynamic Programming
 
Applications in Machine Learning
Applications in Machine LearningApplications in Machine Learning
Applications in Machine Learning
 
Support vector machine
Support vector machineSupport vector machine
Support vector machine
 
Linear Regression vs Logistic Regression | Edureka
Linear Regression vs Logistic Regression | EdurekaLinear Regression vs Logistic Regression | Edureka
Linear Regression vs Logistic Regression | Edureka
 
3 classification
3  classification3  classification
3 classification
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
K Nearest Neighbors
K Nearest NeighborsK Nearest Neighbors
K Nearest Neighbors
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Gradient descent method
Gradient descent methodGradient descent method
Gradient descent method
 
Support Vector machine
Support Vector machineSupport Vector machine
Support Vector machine
 

Similar to Data Preprocessing

Grid generation and adaptive refinement
Grid generation and adaptive refinementGrid generation and adaptive refinement
Grid generation and adaptive refinement
Goran Rakic
 
Automatic Machine Learning, AutoML
Automatic Machine Learning, AutoMLAutomatic Machine Learning, AutoML
Automatic Machine Learning, AutoML
Himadri Mishra
 
SKLearn Workshop.pptx
SKLearn Workshop.pptxSKLearn Workshop.pptx
SKLearn Workshop.pptx
fsxflyer789Productio
 
Python's slippy path and Tao of thick Pandas: give my data, Rrrrr...
Python's slippy path and Tao of thick Pandas: give my data, Rrrrr...Python's slippy path and Tao of thick Pandas: give my data, Rrrrr...
Python's slippy path and Tao of thick Pandas: give my data, Rrrrr...
Alexey Zinoviev
 
Feature engineering pipelines
Feature engineering pipelinesFeature engineering pipelines
Feature engineering pipelines
Ramesh Sampath
 
MLlib and Machine Learning on Spark
MLlib and Machine Learning on SparkMLlib and Machine Learning on Spark
MLlib and Machine Learning on Spark
Petr Zapletal
 
Getting Started with Calc Manager for HFM
Getting Started with Calc Manager for HFMGetting Started with Calc Manager for HFM
Getting Started with Calc Manager for HFM
Alithya
 
Big Data processing with Apache Spark
Big Data processing with Apache SparkBig Data processing with Apache Spark
Big Data processing with Apache Spark
Lucian Neghina
 
Pregel: A System For Large Scale Graph Processing
Pregel: A System For Large Scale Graph ProcessingPregel: A System For Large Scale Graph Processing
Pregel: A System For Large Scale Graph ProcessingRiyad Parvez
 
Productionalizing Spark ML
Productionalizing Spark MLProductionalizing Spark ML
Productionalizing Spark ML
datamantra
 
Introduction of MapReduce
Introduction of MapReduceIntroduction of MapReduce
Introduction of MapReduce
HC Lin
 
R user group meeting 25th jan 2017
R user group meeting 25th jan 2017R user group meeting 25th jan 2017
R user group meeting 25th jan 2017
Garrett Teoh Hor Keong
 
Machine learning
Machine learningMachine learning
Machine learning
Mike Martinez
 
PPT-UEU-Database-Objek-Terdistribusi-Pertemuan-8.pptx
PPT-UEU-Database-Objek-Terdistribusi-Pertemuan-8.pptxPPT-UEU-Database-Objek-Terdistribusi-Pertemuan-8.pptx
PPT-UEU-Database-Objek-Terdistribusi-Pertemuan-8.pptx
neju3
 
ML-Unit-4.pdf
ML-Unit-4.pdfML-Unit-4.pdf
ML-Unit-4.pdf
AnushaSharma81
 
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dan Lynn
 
KNOLX_Data_preprocessing
KNOLX_Data_preprocessingKNOLX_Data_preprocessing
KNOLX_Data_preprocessing
Knoldus Inc.
 
Introduction to Machine Learning with Spark
Introduction to Machine Learning with SparkIntroduction to Machine Learning with Spark
Introduction to Machine Learning with Spark
datamantra
 
Distributed systems and consistency
Distributed systems and consistencyDistributed systems and consistency
Distributed systems and consistency
seldo
 
Robert Haas Query Planning Gone Wrong Presentation @ Postgres Open
Robert Haas Query Planning Gone Wrong Presentation @ Postgres OpenRobert Haas Query Planning Gone Wrong Presentation @ Postgres Open
Robert Haas Query Planning Gone Wrong Presentation @ Postgres OpenPostgresOpen
 

Similar to Data Preprocessing (20)

Grid generation and adaptive refinement
Grid generation and adaptive refinementGrid generation and adaptive refinement
Grid generation and adaptive refinement
 
Automatic Machine Learning, AutoML
Automatic Machine Learning, AutoMLAutomatic Machine Learning, AutoML
Automatic Machine Learning, AutoML
 
SKLearn Workshop.pptx
SKLearn Workshop.pptxSKLearn Workshop.pptx
SKLearn Workshop.pptx
 
Python's slippy path and Tao of thick Pandas: give my data, Rrrrr...
Python's slippy path and Tao of thick Pandas: give my data, Rrrrr...Python's slippy path and Tao of thick Pandas: give my data, Rrrrr...
Python's slippy path and Tao of thick Pandas: give my data, Rrrrr...
 
Feature engineering pipelines
Feature engineering pipelinesFeature engineering pipelines
Feature engineering pipelines
 
MLlib and Machine Learning on Spark
MLlib and Machine Learning on SparkMLlib and Machine Learning on Spark
MLlib and Machine Learning on Spark
 
Getting Started with Calc Manager for HFM
Getting Started with Calc Manager for HFMGetting Started with Calc Manager for HFM
Getting Started with Calc Manager for HFM
 
Big Data processing with Apache Spark
Big Data processing with Apache SparkBig Data processing with Apache Spark
Big Data processing with Apache Spark
 
Pregel: A System For Large Scale Graph Processing
Pregel: A System For Large Scale Graph ProcessingPregel: A System For Large Scale Graph Processing
Pregel: A System For Large Scale Graph Processing
 
Productionalizing Spark ML
Productionalizing Spark MLProductionalizing Spark ML
Productionalizing Spark ML
 
Introduction of MapReduce
Introduction of MapReduceIntroduction of MapReduce
Introduction of MapReduce
 
R user group meeting 25th jan 2017
R user group meeting 25th jan 2017R user group meeting 25th jan 2017
R user group meeting 25th jan 2017
 
Machine learning
Machine learningMachine learning
Machine learning
 
PPT-UEU-Database-Objek-Terdistribusi-Pertemuan-8.pptx
PPT-UEU-Database-Objek-Terdistribusi-Pertemuan-8.pptxPPT-UEU-Database-Objek-Terdistribusi-Pertemuan-8.pptx
PPT-UEU-Database-Objek-Terdistribusi-Pertemuan-8.pptx
 
ML-Unit-4.pdf
ML-Unit-4.pdfML-Unit-4.pdf
ML-Unit-4.pdf
 
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
 
KNOLX_Data_preprocessing
KNOLX_Data_preprocessingKNOLX_Data_preprocessing
KNOLX_Data_preprocessing
 
Introduction to Machine Learning with Spark
Introduction to Machine Learning with SparkIntroduction to Machine Learning with Spark
Introduction to Machine Learning with Spark
 
Distributed systems and consistency
Distributed systems and consistencyDistributed systems and consistency
Distributed systems and consistency
 
Robert Haas Query Planning Gone Wrong Presentation @ Postgres Open
Robert Haas Query Planning Gone Wrong Presentation @ Postgres OpenRobert Haas Query Planning Gone Wrong Presentation @ Postgres Open
Robert Haas Query Planning Gone Wrong Presentation @ Postgres Open
 

More from zekeLabs Technologies

Webinar - Build Cloud-native platform using Docker, Kubernetes, Prometheus, I...
Webinar - Build Cloud-native platform using Docker, Kubernetes, Prometheus, I...Webinar - Build Cloud-native platform using Docker, Kubernetes, Prometheus, I...
Webinar - Build Cloud-native platform using Docker, Kubernetes, Prometheus, I...
zekeLabs Technologies
 
Design Patterns for Pods and Containers in Kubernetes - Webinar by zekeLabs
Design Patterns for Pods and Containers in Kubernetes - Webinar by zekeLabsDesign Patterns for Pods and Containers in Kubernetes - Webinar by zekeLabs
Design Patterns for Pods and Containers in Kubernetes - Webinar by zekeLabs
zekeLabs Technologies
 
[Webinar] Following the Agile Footprint - zekeLabs
[Webinar] Following the Agile Footprint - zekeLabs[Webinar] Following the Agile Footprint - zekeLabs
[Webinar] Following the Agile Footprint - zekeLabs
zekeLabs Technologies
 
Machine learning at scale - Webinar By zekeLabs
Machine learning at scale - Webinar By zekeLabsMachine learning at scale - Webinar By zekeLabs
Machine learning at scale - Webinar By zekeLabs
zekeLabs Technologies
 
A curtain-raiser to the container world Docker & Kubernetes
A curtain-raiser to the container world Docker & KubernetesA curtain-raiser to the container world Docker & Kubernetes
A curtain-raiser to the container world Docker & Kubernetes
zekeLabs Technologies
 
Docker - A curtain raiser to the Container world
Docker - A curtain raiser to the Container worldDocker - A curtain raiser to the Container world
Docker - A curtain raiser to the Container world
zekeLabs Technologies
 
Serverless and cloud computing
Serverless and cloud computingServerless and cloud computing
Serverless and cloud computing
zekeLabs Technologies
 
02 terraform core concepts
02 terraform core concepts02 terraform core concepts
02 terraform core concepts
zekeLabs Technologies
 
08 Terraform: Provisioners
08 Terraform: Provisioners08 Terraform: Provisioners
08 Terraform: Provisioners
zekeLabs Technologies
 
Outlier detection handling
Outlier detection handlingOutlier detection handling
Outlier detection handling
zekeLabs Technologies
 
Nearest neighbors
Nearest neighborsNearest neighbors
Nearest neighbors
zekeLabs Technologies
 
Naive bayes
Naive bayesNaive bayes
Master guide to become a data scientist
Master guide to become a data scientist Master guide to become a data scientist
Master guide to become a data scientist
zekeLabs Technologies
 
Linear regression
Linear regressionLinear regression
Linear regression
zekeLabs Technologies
 
Linear models of classification
Linear models of classificationLinear models of classification
Linear models of classification
zekeLabs Technologies
 
Grid search, pipeline, featureunion
Grid search, pipeline, featureunionGrid search, pipeline, featureunion
Grid search, pipeline, featureunion
zekeLabs Technologies
 
Feature selection
Feature selectionFeature selection
Feature selection
zekeLabs Technologies
 
Essential NumPy
Essential NumPyEssential NumPy
Essential NumPy
zekeLabs Technologies
 
Ensemble methods
Ensemble methods Ensemble methods
Ensemble methods
zekeLabs Technologies
 

More from zekeLabs Technologies (20)

Webinar - Build Cloud-native platform using Docker, Kubernetes, Prometheus, I...
Webinar - Build Cloud-native platform using Docker, Kubernetes, Prometheus, I...Webinar - Build Cloud-native platform using Docker, Kubernetes, Prometheus, I...
Webinar - Build Cloud-native platform using Docker, Kubernetes, Prometheus, I...
 
Design Patterns for Pods and Containers in Kubernetes - Webinar by zekeLabs
Design Patterns for Pods and Containers in Kubernetes - Webinar by zekeLabsDesign Patterns for Pods and Containers in Kubernetes - Webinar by zekeLabs
Design Patterns for Pods and Containers in Kubernetes - Webinar by zekeLabs
 
[Webinar] Following the Agile Footprint - zekeLabs
[Webinar] Following the Agile Footprint - zekeLabs[Webinar] Following the Agile Footprint - zekeLabs
[Webinar] Following the Agile Footprint - zekeLabs
 
Machine learning at scale - Webinar By zekeLabs
Machine learning at scale - Webinar By zekeLabsMachine learning at scale - Webinar By zekeLabs
Machine learning at scale - Webinar By zekeLabs
 
A curtain-raiser to the container world Docker & Kubernetes
A curtain-raiser to the container world Docker & KubernetesA curtain-raiser to the container world Docker & Kubernetes
A curtain-raiser to the container world Docker & Kubernetes
 
Docker - A curtain raiser to the Container world
Docker - A curtain raiser to the Container worldDocker - A curtain raiser to the Container world
Docker - A curtain raiser to the Container world
 
Serverless and cloud computing
Serverless and cloud computingServerless and cloud computing
Serverless and cloud computing
 
SQL
SQLSQL
SQL
 
02 terraform core concepts
02 terraform core concepts02 terraform core concepts
02 terraform core concepts
 
08 Terraform: Provisioners
08 Terraform: Provisioners08 Terraform: Provisioners
08 Terraform: Provisioners
 
Outlier detection handling
Outlier detection handlingOutlier detection handling
Outlier detection handling
 
Nearest neighbors
Nearest neighborsNearest neighbors
Nearest neighbors
 
Naive bayes
Naive bayesNaive bayes
Naive bayes
 
Master guide to become a data scientist
Master guide to become a data scientist Master guide to become a data scientist
Master guide to become a data scientist
 
Linear regression
Linear regressionLinear regression
Linear regression
 
Linear models of classification
Linear models of classificationLinear models of classification
Linear models of classification
 
Grid search, pipeline, featureunion
Grid search, pipeline, featureunionGrid search, pipeline, featureunion
Grid search, pipeline, featureunion
 
Feature selection
Feature selectionFeature selection
Feature selection
 
Essential NumPy
Essential NumPyEssential NumPy
Essential NumPy
 
Ensemble methods
Ensemble methods Ensemble methods
Ensemble methods
 

Recently uploaded

Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
CatarinaPereira64715
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Product School
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Jeffrey Haguewood
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 

Recently uploaded (20)

Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 

Data Preprocessing

  • 1. zekeLabs Data Preprocessing Learning made Simpler ! www.zekeLabs.com
  • 2. Agenda ● Transformers ● StandardScaler ● MinMaxScaler ● RobustScaler ● Normalization ● Binarization ● Encoding Categorical Features ● Imputation ● Polynomial Features ● Custom Transformer ● Text Processing
  • 3. Why Preprocessing ? ● Learning algorithms have affinity towards certain data. ● Unscaled or unstandardized data have might have unacceptable prediction ● Preprocessing refers to transformation before feeding to machine learning
  • 5. Transformers ● Objects which can transform data so that they can be consumed by machine learning ● Common API - fit, transform, fit_transform ● fit () - Creating the map ● transform() - Using the map transforming data ● fit_transform() - Combined of above two
  • 6. StandardScaler ● Assumes features data should be normally distributed ● Scales such that central tendency is 0 & standard deviation 1 ● If data is not normally distributed, standardscaler may not be a great idea
  • 7. MinMaxScaler ● One of the most popular scaling method ● Works on data which is not normally distributed ● Brings the data in range of [0,1] or [-1,1] ● Skewness maintained but data bought to same scale ● The two normal distributions are kept separate by the outliers that are inside the 0-1 range.
  • 8. RobustScaler ● Most suited for data with outliers ● Rather than min-max, uses interquartile range ● The distributions are brought into the same scale and overlap, but the outliers remain outside of bulk of the new distributions.
  • 9. Normalizer ● The normalizer scales each value by dividing each value by its magnitude in n-dimensional space for n number of features. ● Each point is now within 1 unit of the origin on this Cartesian coordinate system.
  • 10. Encoding Categorical Values ● The normalizer scales each value by dividing each value by its magnitude in n-dimensional space for n number of features. ● Each point is now within 1 unit of the origin on this Cartesian coordinate system.
  • 11. Label Encoding ● Learning algorithms don’t understand strings ● Categorical columns with string values ( yes/no ) needs to be converted to numbers. ● LabelEncoder encodes value between 0 to n-1 classes
  • 12. One Hot Encoding ● Converts each categorical data into a vector, one value will be hot & others cold. ● Suitable for nominal data ● Like location ( delhi, mumbai etc. )
  • 13. Ordinal Data Encoding ● Features usually consist of ordinal data in strings like low, medium, high ● Transformation of such column using LabelEncoding might not be a good option. ● We want to maintain relationship between data ● Using pandas we can replace low by 0, medium by 1 & high by 2
  • 14. Biniazer ● Sets feature value 0 or 1 ● Commonly used with text data ● An important step before some algorithms expecting binary data
  • 15. Imputation ● Real world data might be incomplete, missing data is represented as blank, nan etc. ● Incomplete data are incompatible with scikit-learn ● One way to deal with them is discard. ● Other is to derive it from existing data, that’s called imputation
  • 16. Polynomial Features ● Sometimes we need to add complexity to the model ● Convert data to higher degrees. ● Hyper-parameter it takes is degree [ X, Y ] [ 1, X, Y, XY, X^2, Y^2 ] Polynomial Transformer (2)
  • 17. Custom Transformer ● Sometimes, in-built transformers are not sufficient for data cleaning or preprocessing. ● Custom Transformers allow Python functions to be used for transforming data [ X, Y ] [ log(X), log(Y) ] Custom Transformer (log)
  • 18. Outliers ● Data which doesn’t fit into the distribution of entire dataset is outlier. ● Types of outliers - univariate, multivariate. ● Univariate Outlier - Based on values of one variable ● Multivariate Outlier - Based on multiple variables
  • 19. Outlier Reasons ● Experimental Error ● Data entry error ● Sampling Error ● Natural Outlier ● Intentional error Custo Transformer (log)
  • 20. Outlier Impact ● Big impact on range, variance, and standard deviation ● Learning algorithms ability is impacted Custo Transformer (log)
  • 21. Outlier Detection ● Extreme Value Analysis : z-score based method ● Probabilistic and Statistical Models : Data participation in distribution ● Linear Models : Data transformed into lesser dimension. Points far from this plane. PCA ● Proximity-based Models : Clustering based methods, Distance based methods ● High-Dimensional Outlier Detection
  • 22. Novelty Detection ● Adding one more observation to dataset ● Checking if the new observation is part of the distribution ● OneClassSVM can be used to detect novelty of dataset Custom Transformer (log)
  • 23. Text ● DictVectorizer ● CountVectorizer ● Tf-Idf ● HashingVectorizer Custom Transformer (log)
  • 24. Image ● Skimage library ● sklearn.feature_extraction.image Custom Transformer (log)
  • 26. Visit : www.zekeLabs.com for more details THANK YOU Let us know how can we help your organization to Upskill the employees to stay updated in the ever-evolving IT Industry. Get in touch: www.zekeLabs.com | +91-8095465880 | info@zekeLabs.com