SlideShare a Scribd company logo
Data Preprocessing: Unlocking the
Power of Manufacturing Data
A Guide to Data Cleaning and Transformation
1
2
What is Data Pre-
processing????
What is it’s
importance???
How do I pre-process a
data???
Role of it in Data
Analysis??
Why is it required??
Data Pre-processing
• Involves cleaning, transformation, reduction and preparing data for ML models
Goal of Data Pre-processing
• Ensures data is in right format suitable for exploration, analysis and modeling
Common Techniques in Pre-processing
• Data Cleaning includes
• Removing duplicates,
• Dealing missing values and outliers
• Correcting data entry errors
Common Techniques in Pre-processing
• Data Transformation
Converting data to a format suitable for analysis.
It includes,
• Scaling the data,
• Normalizing the data
• Transformation for normal distribution of data
Common Techniques in Pre-processing
• Feature Extraction
Important step to identify & select important features from the data and
extracting them.
It includes,
• Selecting most relevant features,
• Combining multiple into one feature.
• Creating new set of features from existing ones.
Common Techniques in Pre-processing
• Data Splitting
Step to separate the entire data into training, validation and test set.
 Training set  To fit the model
 Validation set  To tune hyperparameters
 Test set  To evaluate final performance of the model
Data Cleaning – Handling Outliers
• What are Outliers???
Data Cleaning – Handling Outliers
• Methods to handle outliers:
 IQR method:
o Involves calculating Interquartile range (IQR)  Difference between 75
percentile and 25 percentile data.
o Outliers = 1.5*(IQR below 1st quartile and above 3rd quartile)
o Outliers identified can either be removed or replaced. How??
Data Cleaning – Handling Outliers
• Methods to handle outliers:
 Z score method:
o Involves standardizing data by subtracting mean and dividing standard
deviation.
z=
𝑥−𝜇
𝜎
o Outliers = 3 to 4 times standard deviation from mean value
o Outliers identified can either be removed or replaced. How??
Data Transformation – Feature Scaling
• What is Feature scaling???
Data Transformation – Feature Scaling
• What is Feature scaling???
Any Example on why scaling
is essential??
Data Transformation – Feature Scaling
• Methods of Feature Scaling???
Data Transformation – Feature Scaling
• Methods of Feature Scaling???
This scales features by dividing actual value with maximum value of each feature.
𝑥𝑠𝑐𝑎𝑙𝑒𝑑 =
𝑥
max(𝑎𝑏𝑠 𝑥 )
This scales features to range between 0 and 1.
𝑥𝑠𝑐𝑎𝑙𝑒𝑑 =
𝑥 − min(𝑥)
max 𝑥 − min(𝑥)
Data Transformation – Feature Scaling
• Methods of Feature Scaling???
This scales features to get a mean 0 and st deviation as 1.
𝑥𝑠𝑐𝑎𝑙𝑒𝑑 =
𝑥 − 𝑚𝑒𝑎𝑛(𝑥)
𝑠𝑡𝑑 𝑥
This scales features to get a mean 0 and variance as 1.
𝑥𝑠𝑐𝑎𝑙𝑒𝑑 =
𝑥 − 𝑚𝑒𝑎𝑛(𝑥)
𝑣𝑎𝑟 𝑥
Data Cleaning – Handling Missing values
• What are missing values???
Data Cleaning – Handling Missing values
• What are types of missing values???
• Missing completely at random (MCAR):
o Occurs when data missing  Not related to any other variables in a dataset.
o Data missing @ completely random  No systematic pattern
1. Understanding missing data  Important.
2. If data missing randomy still represents population
3. If missing systematically biased for analysis
• Missing at random (MAR):
o Occurs when data missing  Not completely random. But can be explained by other variables.
o MAR provides asymptotically unbiased estimates.
• Missing NOT at random (MNAR):
o Occurs when data missing  Not random. can be explained by other variables.
Examples – MCAR, MAR, MNAR
• Test papers of 50 students in a class
Professor while bringing 50 papers loses 3 papers due to sudden wind  MCAR.
Classic case of each student losing his sheet with same probability
Professor getting 49 papers out of 50 students as one student is absent  MAR.
Probability of missing data is due to observed data  Attendance of class  No unobserved
data effect
Professor getting 49 papers out of 50 students as one student is absent because he didn’t
understand topic of exam  MNAR.
Probability of missing data is due to un-observed data  Struggling with exam topics
Data Cleaning – Handling Missing values
• How to deal with missing values???
Data Cleaning – Handling Missing values
• How to deal with missing values???
Data Cleaning – Handling Missing values
• How to deal with missing values???
Data Cleaning – Handling Missing values
• How to deal with missing values???
Data Cleaning – Handling Missing values
• How to deal with missing values???
Data Cleaning – Handling Missing values
• How to deal with missing values???
Data Cleaning – Handling Missing values
• How to deal with missing values???

More Related Content

Similar to Data preprocessing.pdf

EDA by Sastry.pptx
EDA by Sastry.pptxEDA by Sastry.pptx
EDA by Sastry.pptx
AmitDas125851
 
Dimensionality Reduction.pptx
Dimensionality Reduction.pptxDimensionality Reduction.pptx
Dimensionality Reduction.pptx
PriyadharshiniG41
 
SQLDay2013_MarcinSzeliga_DataInDataMining
SQLDay2013_MarcinSzeliga_DataInDataMiningSQLDay2013_MarcinSzeliga_DataInDataMining
SQLDay2013_MarcinSzeliga_DataInDataMining
Polish SQL Server User Group
 
01 Introduction to Data Mining
01 Introduction to Data Mining01 Introduction to Data Mining
01 Introduction to Data Mining
Valerii Klymchuk
 
Introduction to Data Analytics.pptx
Introduction to Data Analytics.pptxIntroduction to Data Analytics.pptx
Introduction to Data Analytics.pptx
DikshantSharma63
 
Data Science- Data Preprocessing, Data Cleaning.
Data Science- Data Preprocessing, Data Cleaning.Data Science- Data Preprocessing, Data Cleaning.
Data Science- Data Preprocessing, Data Cleaning.
Megha Sharma
 
Mini datathon - Bengaluru
Mini datathon - BengaluruMini datathon - Bengaluru
Mini datathon - Bengaluru
Kunal Jain
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
Tony Nguyen
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
Harry Potter
 
LETS PUBLISH WITH MORE RELIABLE & PRESENTABLE MODELLING.pptx
LETS PUBLISH WITH MORE RELIABLE & PRESENTABLE MODELLING.pptxLETS PUBLISH WITH MORE RELIABLE & PRESENTABLE MODELLING.pptx
LETS PUBLISH WITH MORE RELIABLE & PRESENTABLE MODELLING.pptx
shamsul2010
 
Exploratory Data Analysis - Satyajit.pdf
Exploratory Data Analysis - Satyajit.pdfExploratory Data Analysis - Satyajit.pdf
Exploratory Data Analysis - Satyajit.pdf
AmmarAhmedSiddiqui2
 
General Tips for participating Kaggle Competitions
General Tips for participating Kaggle CompetitionsGeneral Tips for participating Kaggle Competitions
General Tips for participating Kaggle Competitions
Mark Peng
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
Shimi Bandiel
 
dimension reduction.ppt
dimension reduction.pptdimension reduction.ppt
dimension reduction.ppt
Deadpool120050
 
DATA MINING.pptx
DATA MINING.pptxDATA MINING.pptx
DATA MINING.pptx
Dipankar Boruah
 
Pre-Processing and Data Preparation
Pre-Processing and Data PreparationPre-Processing and Data Preparation
Pre-Processing and Data Preparation
Umair Shafique
 
Basics of Data Analysis
Basics of Data AnalysisBasics of Data Analysis
Basics of Data Analysis
ankurjain1909
 
Machine learning module 2
Machine learning module 2Machine learning module 2
Machine learning module 2
Gokulks007
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
FINBOURNE Technology
 
Choosing a Machine Learning technique to solve your need
Choosing a Machine Learning technique to solve your needChoosing a Machine Learning technique to solve your need
Choosing a Machine Learning technique to solve your need
GibDevs
 

Similar to Data preprocessing.pdf (20)

EDA by Sastry.pptx
EDA by Sastry.pptxEDA by Sastry.pptx
EDA by Sastry.pptx
 
Dimensionality Reduction.pptx
Dimensionality Reduction.pptxDimensionality Reduction.pptx
Dimensionality Reduction.pptx
 
SQLDay2013_MarcinSzeliga_DataInDataMining
SQLDay2013_MarcinSzeliga_DataInDataMiningSQLDay2013_MarcinSzeliga_DataInDataMining
SQLDay2013_MarcinSzeliga_DataInDataMining
 
01 Introduction to Data Mining
01 Introduction to Data Mining01 Introduction to Data Mining
01 Introduction to Data Mining
 
Introduction to Data Analytics.pptx
Introduction to Data Analytics.pptxIntroduction to Data Analytics.pptx
Introduction to Data Analytics.pptx
 
Data Science- Data Preprocessing, Data Cleaning.
Data Science- Data Preprocessing, Data Cleaning.Data Science- Data Preprocessing, Data Cleaning.
Data Science- Data Preprocessing, Data Cleaning.
 
Mini datathon - Bengaluru
Mini datathon - BengaluruMini datathon - Bengaluru
Mini datathon - Bengaluru
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
LETS PUBLISH WITH MORE RELIABLE & PRESENTABLE MODELLING.pptx
LETS PUBLISH WITH MORE RELIABLE & PRESENTABLE MODELLING.pptxLETS PUBLISH WITH MORE RELIABLE & PRESENTABLE MODELLING.pptx
LETS PUBLISH WITH MORE RELIABLE & PRESENTABLE MODELLING.pptx
 
Exploratory Data Analysis - Satyajit.pdf
Exploratory Data Analysis - Satyajit.pdfExploratory Data Analysis - Satyajit.pdf
Exploratory Data Analysis - Satyajit.pdf
 
General Tips for participating Kaggle Competitions
General Tips for participating Kaggle CompetitionsGeneral Tips for participating Kaggle Competitions
General Tips for participating Kaggle Competitions
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
dimension reduction.ppt
dimension reduction.pptdimension reduction.ppt
dimension reduction.ppt
 
DATA MINING.pptx
DATA MINING.pptxDATA MINING.pptx
DATA MINING.pptx
 
Pre-Processing and Data Preparation
Pre-Processing and Data PreparationPre-Processing and Data Preparation
Pre-Processing and Data Preparation
 
Basics of Data Analysis
Basics of Data AnalysisBasics of Data Analysis
Basics of Data Analysis
 
Machine learning module 2
Machine learning module 2Machine learning module 2
Machine learning module 2
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
Choosing a Machine Learning technique to solve your need
Choosing a Machine Learning technique to solve your needChoosing a Machine Learning technique to solve your need
Choosing a Machine Learning technique to solve your need
 

Recently uploaded

132/33KV substation case study Presentation
132/33KV substation case study Presentation132/33KV substation case study Presentation
132/33KV substation case study Presentation
kandramariana6
 
Manufacturing Process of molasses based distillery ppt.pptx
Manufacturing Process of molasses based distillery ppt.pptxManufacturing Process of molasses based distillery ppt.pptx
Manufacturing Process of molasses based distillery ppt.pptx
Madan Karki
 
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
ihlasbinance2003
 
Embedded machine learning-based road conditions and driving behavior monitoring
Embedded machine learning-based road conditions and driving behavior monitoringEmbedded machine learning-based road conditions and driving behavior monitoring
Embedded machine learning-based road conditions and driving behavior monitoring
IJECEIAES
 
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
Yasser Mahgoub
 
Recycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part IIIRecycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part III
Aditya Rajan Patra
 
Question paper of renewable energy sources
Question paper of renewable energy sourcesQuestion paper of renewable energy sources
Question paper of renewable energy sources
mahammadsalmanmech
 
Generative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of contentGenerative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of content
Hitesh Mohapatra
 
International Conference on NLP, Artificial Intelligence, Machine Learning an...
International Conference on NLP, Artificial Intelligence, Machine Learning an...International Conference on NLP, Artificial Intelligence, Machine Learning an...
International Conference on NLP, Artificial Intelligence, Machine Learning an...
gerogepatton
 
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODEL
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODELDEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODEL
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODEL
gerogepatton
 
The Python for beginners. This is an advance computer language.
The Python for beginners. This is an advance computer language.The Python for beginners. This is an advance computer language.
The Python for beginners. This is an advance computer language.
sachin chaurasia
 
spirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptxspirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptx
Madan Karki
 
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
IJECEIAES
 
Properties Railway Sleepers and Test.pptx
Properties Railway Sleepers and Test.pptxProperties Railway Sleepers and Test.pptx
Properties Railway Sleepers and Test.pptx
MDSABBIROJJAMANPAYEL
 
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
insn4465
 
ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024
Rahul
 
IEEE Aerospace and Electronic Systems Society as a Graduate Student Member
IEEE Aerospace and Electronic Systems Society as a Graduate Student MemberIEEE Aerospace and Electronic Systems Society as a Graduate Student Member
IEEE Aerospace and Electronic Systems Society as a Graduate Student Member
VICTOR MAESTRE RAMIREZ
 
Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...
IJECEIAES
 
Comparative analysis between traditional aquaponics and reconstructed aquapon...
Comparative analysis between traditional aquaponics and reconstructed aquapon...Comparative analysis between traditional aquaponics and reconstructed aquapon...
Comparative analysis between traditional aquaponics and reconstructed aquapon...
bijceesjournal
 
Computational Engineering IITH Presentation
Computational Engineering IITH PresentationComputational Engineering IITH Presentation
Computational Engineering IITH Presentation
co23btech11018
 

Recently uploaded (20)

132/33KV substation case study Presentation
132/33KV substation case study Presentation132/33KV substation case study Presentation
132/33KV substation case study Presentation
 
Manufacturing Process of molasses based distillery ppt.pptx
Manufacturing Process of molasses based distillery ppt.pptxManufacturing Process of molasses based distillery ppt.pptx
Manufacturing Process of molasses based distillery ppt.pptx
 
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
 
Embedded machine learning-based road conditions and driving behavior monitoring
Embedded machine learning-based road conditions and driving behavior monitoringEmbedded machine learning-based road conditions and driving behavior monitoring
Embedded machine learning-based road conditions and driving behavior monitoring
 
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
 
Recycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part IIIRecycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part III
 
Question paper of renewable energy sources
Question paper of renewable energy sourcesQuestion paper of renewable energy sources
Question paper of renewable energy sources
 
Generative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of contentGenerative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of content
 
International Conference on NLP, Artificial Intelligence, Machine Learning an...
International Conference on NLP, Artificial Intelligence, Machine Learning an...International Conference on NLP, Artificial Intelligence, Machine Learning an...
International Conference on NLP, Artificial Intelligence, Machine Learning an...
 
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODEL
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODELDEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODEL
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODEL
 
The Python for beginners. This is an advance computer language.
The Python for beginners. This is an advance computer language.The Python for beginners. This is an advance computer language.
The Python for beginners. This is an advance computer language.
 
spirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptxspirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptx
 
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
 
Properties Railway Sleepers and Test.pptx
Properties Railway Sleepers and Test.pptxProperties Railway Sleepers and Test.pptx
Properties Railway Sleepers and Test.pptx
 
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
 
ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024
 
IEEE Aerospace and Electronic Systems Society as a Graduate Student Member
IEEE Aerospace and Electronic Systems Society as a Graduate Student MemberIEEE Aerospace and Electronic Systems Society as a Graduate Student Member
IEEE Aerospace and Electronic Systems Society as a Graduate Student Member
 
Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...
 
Comparative analysis between traditional aquaponics and reconstructed aquapon...
Comparative analysis between traditional aquaponics and reconstructed aquapon...Comparative analysis between traditional aquaponics and reconstructed aquapon...
Comparative analysis between traditional aquaponics and reconstructed aquapon...
 
Computational Engineering IITH Presentation
Computational Engineering IITH PresentationComputational Engineering IITH Presentation
Computational Engineering IITH Presentation
 

Data preprocessing.pdf

  • 1. Data Preprocessing: Unlocking the Power of Manufacturing Data A Guide to Data Cleaning and Transformation 1
  • 2. 2 What is Data Pre- processing???? What is it’s importance??? How do I pre-process a data??? Role of it in Data Analysis?? Why is it required??
  • 3. Data Pre-processing • Involves cleaning, transformation, reduction and preparing data for ML models
  • 4. Goal of Data Pre-processing • Ensures data is in right format suitable for exploration, analysis and modeling
  • 5. Common Techniques in Pre-processing • Data Cleaning includes • Removing duplicates, • Dealing missing values and outliers • Correcting data entry errors
  • 6. Common Techniques in Pre-processing • Data Transformation Converting data to a format suitable for analysis. It includes, • Scaling the data, • Normalizing the data • Transformation for normal distribution of data
  • 7. Common Techniques in Pre-processing • Feature Extraction Important step to identify & select important features from the data and extracting them. It includes, • Selecting most relevant features, • Combining multiple into one feature. • Creating new set of features from existing ones.
  • 8. Common Techniques in Pre-processing • Data Splitting Step to separate the entire data into training, validation and test set.  Training set  To fit the model  Validation set  To tune hyperparameters  Test set  To evaluate final performance of the model
  • 9. Data Cleaning – Handling Outliers • What are Outliers???
  • 10. Data Cleaning – Handling Outliers • Methods to handle outliers:  IQR method: o Involves calculating Interquartile range (IQR)  Difference between 75 percentile and 25 percentile data. o Outliers = 1.5*(IQR below 1st quartile and above 3rd quartile) o Outliers identified can either be removed or replaced. How??
  • 11. Data Cleaning – Handling Outliers • Methods to handle outliers:  Z score method: o Involves standardizing data by subtracting mean and dividing standard deviation. z= 𝑥−𝜇 𝜎 o Outliers = 3 to 4 times standard deviation from mean value o Outliers identified can either be removed or replaced. How??
  • 12. Data Transformation – Feature Scaling • What is Feature scaling???
  • 13. Data Transformation – Feature Scaling • What is Feature scaling??? Any Example on why scaling is essential??
  • 14. Data Transformation – Feature Scaling • Methods of Feature Scaling???
  • 15. Data Transformation – Feature Scaling • Methods of Feature Scaling??? This scales features by dividing actual value with maximum value of each feature. 𝑥𝑠𝑐𝑎𝑙𝑒𝑑 = 𝑥 max(𝑎𝑏𝑠 𝑥 ) This scales features to range between 0 and 1. 𝑥𝑠𝑐𝑎𝑙𝑒𝑑 = 𝑥 − min(𝑥) max 𝑥 − min(𝑥)
  • 16. Data Transformation – Feature Scaling • Methods of Feature Scaling??? This scales features to get a mean 0 and st deviation as 1. 𝑥𝑠𝑐𝑎𝑙𝑒𝑑 = 𝑥 − 𝑚𝑒𝑎𝑛(𝑥) 𝑠𝑡𝑑 𝑥 This scales features to get a mean 0 and variance as 1. 𝑥𝑠𝑐𝑎𝑙𝑒𝑑 = 𝑥 − 𝑚𝑒𝑎𝑛(𝑥) 𝑣𝑎𝑟 𝑥
  • 17. Data Cleaning – Handling Missing values • What are missing values???
  • 18. Data Cleaning – Handling Missing values • What are types of missing values??? • Missing completely at random (MCAR): o Occurs when data missing  Not related to any other variables in a dataset. o Data missing @ completely random  No systematic pattern 1. Understanding missing data  Important. 2. If data missing randomy still represents population 3. If missing systematically biased for analysis • Missing at random (MAR): o Occurs when data missing  Not completely random. But can be explained by other variables. o MAR provides asymptotically unbiased estimates. • Missing NOT at random (MNAR): o Occurs when data missing  Not random. can be explained by other variables.
  • 19. Examples – MCAR, MAR, MNAR • Test papers of 50 students in a class Professor while bringing 50 papers loses 3 papers due to sudden wind  MCAR. Classic case of each student losing his sheet with same probability Professor getting 49 papers out of 50 students as one student is absent  MAR. Probability of missing data is due to observed data  Attendance of class  No unobserved data effect Professor getting 49 papers out of 50 students as one student is absent because he didn’t understand topic of exam  MNAR. Probability of missing data is due to un-observed data  Struggling with exam topics
  • 20. Data Cleaning – Handling Missing values • How to deal with missing values???
  • 21. Data Cleaning – Handling Missing values • How to deal with missing values???
  • 22. Data Cleaning – Handling Missing values • How to deal with missing values???
  • 23. Data Cleaning – Handling Missing values • How to deal with missing values???
  • 24. Data Cleaning – Handling Missing values • How to deal with missing values???
  • 25. Data Cleaning – Handling Missing values • How to deal with missing values???
  • 26. Data Cleaning – Handling Missing values • How to deal with missing values???