SlideShare a Scribd company logo
What is Data Preprocessing?
• Data preprocessing is the process of transforming raw data into an
understandable format. It is also an important step in data mining as we
cannot work with raw data. The quality of the data should be checked
before applying machine learning or data mining algorithms.
• In machine learning processes, data preprocessing is critical for ensuring
large datasets are formatted in such a way that the data they contain can
be interpreted and parsed by learning algorithms.
• Preprocessing involves both data validation and data imputation.
• The goal of data validation is to assess whether the data in question is both
complete and accurate.
• The goal of data imputation is to correct errors and input missing values -- either
manually or automatically through business process automation(BPA)
programming.
Is data Pre-processing important?
Why is Data pre-processing important?
• Preprocessing of data is mainly used to check the data quality. The
quality can be checked by the following
• Accuracy: To check whether the data entered is correct or not.
• Completeness: To check whether the data is available or not
recorded.
• Consistency: To check whether the same data is kept in all the places
that do or do not match.
• Timeliness: The data should be updated correctly.
• Believability: The data should be trustable.
• Interpretability: The understandability of the data.
Steps in Data Pre-processing:
Steps in Data Pre-processing:
Handling missing values, noisy data
Data Cleaning
Data Integration
Data Transformation
Combing multiple datasets from
different databases
Data Reduction
Handling missing values:
1.Deleting the columns with missing data
2.Deleting the rows with missing data
3.Filling the missing data with a mean value – Imputation
4.Imputation with an additional column
If a column doesn't contribute enough to the model then we can
remove the column But this is an extreme case and should only be used
when there are many null values in the column.
• The possible ways to do this are:
1.Filling the missing data with the mean or median value if it’s a numerical variable.
2.Filling the missing data with mode(the value that appears most often) if it’s a categorical value.
3.Filling the numerical value with 0 or -999, or some other number that will not occur in the data.
This can be done so that the machine can recognize that the data is not real or is different.
4.Filling the categorical value with a new type for the missing values.
Code:
• Checking for null values:
• Calculating mean, mode or median:
• Adding the corrected feature to dataframe named ‘data’ :
Thank you

More Related Content

Similar to Data science engineering Preprocessing.pptx

UNIT-1 Data pre-processing-Data cleaning, Transformation, Reduction, Integrat...
UNIT-1 Data pre-processing-Data cleaning, Transformation, Reduction, Integrat...UNIT-1 Data pre-processing-Data cleaning, Transformation, Reduction, Integrat...
UNIT-1 Data pre-processing-Data cleaning, Transformation, Reduction, Integrat...
subhashchandra197
 
Anwar kamal .pdf.pptx
Anwar kamal .pdf.pptxAnwar kamal .pdf.pptx
Anwar kamal .pdf.pptx
Luminous8
 
Data Cleaning and Preprocessing: Ensuring Data Quality
Data Cleaning and Preprocessing: Ensuring Data QualityData Cleaning and Preprocessing: Ensuring Data Quality
Data Cleaning and Preprocessing: Ensuring Data Quality
priyanka rajput
 
Chapter 3.pdf
Chapter 3.pdfChapter 3.pdf
Chapter 3.pdf
DrGnaneswariG
 
Apply Raw Data Set And Implement The Different Data Warngliing Functionalitie...
Apply Raw Data Set And Implement The Different Data Warngliing Functionalitie...Apply Raw Data Set And Implement The Different Data Warngliing Functionalitie...
Apply Raw Data Set And Implement The Different Data Warngliing Functionalitie...
SaiM947604
 
DATA PREPROCESSING AND DATA CLEANSING
DATA PREPROCESSING AND DATA CLEANSINGDATA PREPROCESSING AND DATA CLEANSING
DATA PREPROCESSING AND DATA CLEANSING
Ahtesham Ullah khan
 
Data pre processing
Data pre processingData pre processing
Data pre processing
kalavathisugan
 
omama munir 58.pptx
omama munir 58.pptxomama munir 58.pptx
omama munir 58.pptx
OmamaNoor2
 
Pre-Processing and Data Preparation
Pre-Processing and Data PreparationPre-Processing and Data Preparation
Pre-Processing and Data Preparation
Umair Shafique
 
Unit2
Unit2Unit2
Data warehouse 14 data reconciliation tools
Data warehouse 14 data reconciliation toolsData warehouse 14 data reconciliation tools
Data warehouse 14 data reconciliation tools
Vaibhav Khanna
 
4 Data preparation and processing
4  Data preparation and processing4  Data preparation and processing
4 Data preparation and processing
Mahmoud Alfarra
 
Data cleaning, reduction and transformation.pdf
Data cleaning, reduction and transformation.pdfData cleaning, reduction and transformation.pdf
Data cleaning, reduction and transformation.pdf
9wldv5h8n
 
preprocessing
preprocessingpreprocessing
Machine Learning Approaches and its Challenges
Machine Learning Approaches and its ChallengesMachine Learning Approaches and its Challenges
Machine Learning Approaches and its Challenges
ijcnes
 
Data Preprocessing || Data Mining
Data Preprocessing || Data MiningData Preprocessing || Data Mining
Data Preprocessing || Data Mining
Iffat Firozy
 
overview of_data_processing
overview of_data_processingoverview of_data_processing
overview of_data_processing
FEG
 
Mind Map Test Data Management Overview
Mind Map Test Data Management OverviewMind Map Test Data Management Overview
Mind Map Test Data Management Overview
dublinx
 
UNIT 2: Part 2: Data Warehousing and Data Mining
UNIT 2: Part 2: Data Warehousing and Data MiningUNIT 2: Part 2: Data Warehousing and Data Mining
UNIT 2: Part 2: Data Warehousing and Data Mining
Nandakumar P
 
data wrangling (1).pptx kjhiukjhknjbnkjh
data wrangling (1).pptx kjhiukjhknjbnkjhdata wrangling (1).pptx kjhiukjhknjbnkjh
data wrangling (1).pptx kjhiukjhknjbnkjh
VISHALMARWADE1
 

Similar to Data science engineering Preprocessing.pptx (20)

UNIT-1 Data pre-processing-Data cleaning, Transformation, Reduction, Integrat...
UNIT-1 Data pre-processing-Data cleaning, Transformation, Reduction, Integrat...UNIT-1 Data pre-processing-Data cleaning, Transformation, Reduction, Integrat...
UNIT-1 Data pre-processing-Data cleaning, Transformation, Reduction, Integrat...
 
Anwar kamal .pdf.pptx
Anwar kamal .pdf.pptxAnwar kamal .pdf.pptx
Anwar kamal .pdf.pptx
 
Data Cleaning and Preprocessing: Ensuring Data Quality
Data Cleaning and Preprocessing: Ensuring Data QualityData Cleaning and Preprocessing: Ensuring Data Quality
Data Cleaning and Preprocessing: Ensuring Data Quality
 
Chapter 3.pdf
Chapter 3.pdfChapter 3.pdf
Chapter 3.pdf
 
Apply Raw Data Set And Implement The Different Data Warngliing Functionalitie...
Apply Raw Data Set And Implement The Different Data Warngliing Functionalitie...Apply Raw Data Set And Implement The Different Data Warngliing Functionalitie...
Apply Raw Data Set And Implement The Different Data Warngliing Functionalitie...
 
DATA PREPROCESSING AND DATA CLEANSING
DATA PREPROCESSING AND DATA CLEANSINGDATA PREPROCESSING AND DATA CLEANSING
DATA PREPROCESSING AND DATA CLEANSING
 
Data pre processing
Data pre processingData pre processing
Data pre processing
 
omama munir 58.pptx
omama munir 58.pptxomama munir 58.pptx
omama munir 58.pptx
 
Pre-Processing and Data Preparation
Pre-Processing and Data PreparationPre-Processing and Data Preparation
Pre-Processing and Data Preparation
 
Unit2
Unit2Unit2
Unit2
 
Data warehouse 14 data reconciliation tools
Data warehouse 14 data reconciliation toolsData warehouse 14 data reconciliation tools
Data warehouse 14 data reconciliation tools
 
4 Data preparation and processing
4  Data preparation and processing4  Data preparation and processing
4 Data preparation and processing
 
Data cleaning, reduction and transformation.pdf
Data cleaning, reduction and transformation.pdfData cleaning, reduction and transformation.pdf
Data cleaning, reduction and transformation.pdf
 
preprocessing
preprocessingpreprocessing
preprocessing
 
Machine Learning Approaches and its Challenges
Machine Learning Approaches and its ChallengesMachine Learning Approaches and its Challenges
Machine Learning Approaches and its Challenges
 
Data Preprocessing || Data Mining
Data Preprocessing || Data MiningData Preprocessing || Data Mining
Data Preprocessing || Data Mining
 
overview of_data_processing
overview of_data_processingoverview of_data_processing
overview of_data_processing
 
Mind Map Test Data Management Overview
Mind Map Test Data Management OverviewMind Map Test Data Management Overview
Mind Map Test Data Management Overview
 
UNIT 2: Part 2: Data Warehousing and Data Mining
UNIT 2: Part 2: Data Warehousing and Data MiningUNIT 2: Part 2: Data Warehousing and Data Mining
UNIT 2: Part 2: Data Warehousing and Data Mining
 
data wrangling (1).pptx kjhiukjhknjbnkjh
data wrangling (1).pptx kjhiukjhknjbnkjhdata wrangling (1).pptx kjhiukjhknjbnkjh
data wrangling (1).pptx kjhiukjhknjbnkjh
 

Recently uploaded

一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
v3tuleee
 
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdfUnleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Enterprise Wired
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
AbhimanyuSinha9
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Subhajit Sahu
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
axoqas
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
axoqas
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
NABLAS株式会社
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
John Andrews
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
rwarrenll
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
javier ramirez
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
ahzuo
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
Roger Valdez
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
ahzuo
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
mbawufebxi
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
jerlynmaetalle
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
manishkhaire30
 

Recently uploaded (20)

一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
 
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdfUnleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
 

Data science engineering Preprocessing.pptx

  • 1. What is Data Preprocessing? • Data preprocessing is the process of transforming raw data into an understandable format. It is also an important step in data mining as we cannot work with raw data. The quality of the data should be checked before applying machine learning or data mining algorithms. • In machine learning processes, data preprocessing is critical for ensuring large datasets are formatted in such a way that the data they contain can be interpreted and parsed by learning algorithms.
  • 2. • Preprocessing involves both data validation and data imputation. • The goal of data validation is to assess whether the data in question is both complete and accurate. • The goal of data imputation is to correct errors and input missing values -- either manually or automatically through business process automation(BPA) programming.
  • 4. Why is Data pre-processing important? • Preprocessing of data is mainly used to check the data quality. The quality can be checked by the following • Accuracy: To check whether the data entered is correct or not. • Completeness: To check whether the data is available or not recorded. • Consistency: To check whether the same data is kept in all the places that do or do not match. • Timeliness: The data should be updated correctly. • Believability: The data should be trustable. • Interpretability: The understandability of the data.
  • 5.
  • 6. Steps in Data Pre-processing:
  • 7. Steps in Data Pre-processing:
  • 8. Handling missing values, noisy data Data Cleaning Data Integration Data Transformation Combing multiple datasets from different databases Data Reduction
  • 9.
  • 10. Handling missing values: 1.Deleting the columns with missing data 2.Deleting the rows with missing data 3.Filling the missing data with a mean value – Imputation 4.Imputation with an additional column
  • 11. If a column doesn't contribute enough to the model then we can remove the column But this is an extreme case and should only be used when there are many null values in the column.
  • 12.
  • 13. • The possible ways to do this are: 1.Filling the missing data with the mean or median value if it’s a numerical variable. 2.Filling the missing data with mode(the value that appears most often) if it’s a categorical value. 3.Filling the numerical value with 0 or -999, or some other number that will not occur in the data. This can be done so that the machine can recognize that the data is not real or is different. 4.Filling the categorical value with a new type for the missing values.
  • 14.
  • 15. Code: • Checking for null values: • Calculating mean, mode or median: • Adding the corrected feature to dataframe named ‘data’ :
  • 16.