SlideShare a Scribd company logo
1 of 15
Download to read offline
ML | OVERVIEW
OF DATA CLEANING
Dr. Sheetal Dhande-Dandge
Professor CSE
SIPNA COET
ML | OVERVIEW
OF DATA
CLEANING
Data cleaning is one of the important parts
of machine learning. It plays a significant
part in building a model. It surely isn’t the
fanciest part of machine learning and at the
same time, there aren’t any hidden tricks or
secrets to uncover.
However, the success or failure of a project
relies on proper data cleaning. Professional data
scientists usually invest a very large portion of
their time in this step because of the belief
that “Better data beats fancier algorithms”.
If we have a well-cleaned dataset, there are
chances that we can get achieve good results
with simple algorithms also, which can prove
very beneficial at times especially in terms of
computation when the dataset size is large.
Obviously, different types of data will require
different types of cleaning. However, this
systematic approach can always serve as a good
starting point.
Dr. sheetal Dhande-Dandge 2
STEPS INVOLVED IN DATA CLEANING:
Data cleaning is a crucial step in the
machine learning (ML) pipeline, as it
involves identifying and removing
any missing, duplicate, or irrelevant
data.
The goal of data cleaning is to ensure
that the data is accurate, consistent,
and free of errors, as incorrect or
inconsistent data can negatively
impact the performance of the ML
model.
Dr. sheetal Dhande-Dandge 3
 The main steps involved in data cleaning are:
 Handling missing data:This step involves
identifying and handling missing data, which can be
done by removing the missing data, imputing
missing values with a suitable estimate, or using
techniques such as multiple imputations to handle
missing data.
 Removing duplicates:This step involves
identifying and removing any duplicate data, which
can be done by using techniques such as data
deduplication or data deduplication algorithms.
 Handling outliers:This step involves identifying
and handling any outliers in the data, which can be
done by removing the outliers or transforming the
data to reduce the impact of the outliers.
 Correcting errors: This step involves identifying
and correcting any errors in the data, which can be
done by using techniques such as data validation or
data correction algorithm
 It is important to note that data cleaning is an iterative
process, as it may be necessary to repeat some of the
steps several times to ensure that the data is accurate
and consistent.
 The choice of data cleaning techniques will depend
on the specific requirements of the project, including the
size and complexity of the data and the desired
outcome.
Dr. sheetal Dhande-Dandge 4
Dr. sheetal Dhande-Dandge 5
 This includes deleting duplicate/ redundant or
irrelevant values from your dataset. Duplicate
observations most frequently arise during data
collection and Irrelevant observations are those that
don’t actually fit the specific problem that you’re
trying to solve.
 Redundant observations alter the efficiency by a
great extent as the data repeats and may add
towards the correct side or towards the incorrect
side, thereby producing unfaithful results.
 Irrelevant observations are any type of data that is of
no use to us and can be removed directly.
Dr. sheetal Dhande-Dandge 6
 The errors that arise during measurement, transfer of
data, or other similar situations are called structural
errors. Structural errors include typos in the name of
features, the same attribute with a different name,
mislabeled classes, i.e. separate classes that should really
be the same, or inconsistent capitalization.
 For example, the model will treat America and America as
different classes or values, though they represent the
same value or red, yellow, and red-yellow as different
classes or attributes, though one class can be included in
the other two classes. So, these are some structural errors
that make our model inefficient and give poor quality
results.
Dr. sheetal Dhande-Dandge 7
 Outliers can cause problems with certain
types of models. For example, linear
regression models are less robust to outliers
than decision tree models. Generally, we
should not remove outliers until we have a
legitimate reason to remove them.
Sometimes, removing them improves
performance, sometimes not. So, one must
have a good reason to remove the outlier,
such as suspicious measurements that are
unlikely to be part of real data.
Dr. sheetal Dhande-Dandge 8
HANDLING MISSING DATA
Missing data is a deceptively tricky issue in
machine learning. We cannot just ignore or
remove the missing observation. They must
be handled carefully as they can be an
indication of something important. The two
most common ways to deal with missing
data are:
• Dropping observations with missing values.
• The fact that the value was missing may be informative in
itself.
• Plus, in the real world, you often need to make predictions
on new data even if some of the features are missing!
• Imputing the missing values from past observations.
• Again, “missingness” is almost always informative in itself,
and you should tell your algorithm if a value was missing.
• Even if you build a model to impute your values, you’re
not adding any real information. You’re just reinforcing
the patterns already provided by other features.
Missing data is like missing a puzzle piece. If
you drop it, that’s like pretending the puzzle
slot isn’t there. If you impute it, that’s like
trying to squeeze in a piece from
somewhere else in the puzzle.
So, missing data is always an informative
and an indication of something important.
And we must be aware of our algorithm of
missing data by flagging it. By using this
technique of flagging and filling, you are
essentially allowing the algorithm to
estimate the optimal constant for
missingness, instead of just filling it in with
the mean.
Dr. sheetal Dhande-Dandge 9
SOME DATA CLEANSING
TOOLS
• Openrefine
• Trifacta Wrangler
• TIBCO Clarity
• Cloudingo
• IBM Infosphere Quality Stage
 Data cleaning is an important step in the machine learning process because it
can have a significant impact on the quality and performance of a model.
Data cleaning involves identifying and correcting or removing errors and
inconsistencies in the data.
Dr. sheetal Dhande-Dandge 10
 Here is a simple example of data cleaning in Python:
 import pandas as pd
 # Load the data
 df = pd.read_csv("data.csv")
 # Drop rows with missing values
 df = df.dropna()
 # Remove duplicate rows
 df = df.drop_duplicates()
 # Remove unnecessary columns
 df = df.drop(columns=["col1", "col2"])
 # Normalize numerical columns
 df["col3"] = (df["col3"] - df["col3"].mean()) / df["col3"].std()
 # Encode categorical columns
 df["col4"] = pd.get_dummies(df["col4"])
 # Save the cleaned data
 df.to_csv("cleaned_data.csv", index=False)
 The code I provided does not have any explicit output statements, so it
will not produce any output when it is run. Instead, it modifies the data
stored in the df DataFrame and saves it to a new CSV file.
 If you want to see the cleaned data, you can print the df DataFrame or
read the saved CSV file. For example, you can add the following line at the
end of the code to print the cleaned data:
 print(df)
Dr. sheetal Dhande-Dandge 11
ADVANTAGES OF DATA CLEANING IN MACHINE
LEARNING:
Improved model performance: Data
cleaning helps improve the performance of
the ML model by removing errors,
inconsistencies, and irrelevant data, which
can help the model to better learn from the
data.
Increased accuracy: Data cleaning helps
ensure that the data is accurate, consistent,
and free of errors, which can help improve
the accuracy of the ML model.
Better representation of the data: Data
cleaning allows the data to be transformed
into a format that better represents the
underlying relationships and patterns in the
data, making it easier for the ML model to
learn from the data.
Dr. sheetal Dhande-Dandge 12
Time-consuming: Data cleaning
can be a time-consuming task,
especially for large and
complex datasets.
1
Error-prone: Data cleaning can
be error-prone, as it involves
transforming and cleaning the
data, which can result in the loss
of important information or the
introduction of new errors.
2
Limited understanding of the
data: Data cleaning can lead to a
limited understanding of the
data, as the transformed data
may not be representative of the
underlying relationships and
patterns in the data.
3
Dr. sheetal Dhande-Dandge 13
CONCLUSION:
 So, we have discussed four different steps in
data cleaning to make the data more reliable
and to produce good results. After properly
completing the Data Cleaning steps, we’ll have
a robust dataset that avoids many of the most
common pitfalls. This step should not be
rushed as it proves very beneficial in the
further process.
Dr. sheetal Dhande-Dandge 14
Dr. sheetal Dhande-Dandge
15

More Related Content

What's hot

Machine Learning with Decision trees
Machine Learning with Decision treesMachine Learning with Decision trees
Machine Learning with Decision treesKnoldus Inc.
 
Bias and variance trade off
Bias and variance trade offBias and variance trade off
Bias and variance trade offVARUN KUMAR
 
Naive bayesian classification
Naive bayesian classificationNaive bayesian classification
Naive bayesian classificationDr-Dipali Meher
 
Visualising Multi Dimensional Data
Visualising Multi Dimensional DataVisualising Multi Dimensional Data
Visualising Multi Dimensional DataAmit Kapoor
 
Data preprocessing PPT
Data preprocessing PPTData preprocessing PPT
Data preprocessing PPTANUSUYA T K
 
Data science applications and usecases
Data science applications and usecasesData science applications and usecases
Data science applications and usecasesSreenatha Reddy K R
 
Random forest and decision tree
Random forest and decision treeRandom forest and decision tree
Random forest and decision treeAAKANKSHA JAIN
 
01 knapsack using backtracking
01 knapsack using backtracking01 knapsack using backtracking
01 knapsack using backtrackingmandlapure
 
Regularization in deep learning
Regularization in deep learningRegularization in deep learning
Regularization in deep learningKien Le
 
Decision trees & random forests
Decision trees & random forestsDecision trees & random forests
Decision trees & random forestsSC5.io
 
Machine Learning-Linear regression
Machine Learning-Linear regressionMachine Learning-Linear regression
Machine Learning-Linear regressionkishanthkumaar
 
Big data visualization
Big data visualizationBig data visualization
Big data visualizationAnurag Gupta
 
Developing R Graphical User Interfaces
Developing R Graphical User InterfacesDeveloping R Graphical User Interfaces
Developing R Graphical User InterfacesSetia Pramana
 
Exploratory data analysis data visualization
Exploratory data analysis data visualizationExploratory data analysis data visualization
Exploratory data analysis data visualizationDr. Hamdan Al-Sabri
 
Multi-Layer Perceptrons
Multi-Layer PerceptronsMulti-Layer Perceptrons
Multi-Layer PerceptronsESCOM
 
Tools and techniques for data science
Tools and techniques for data scienceTools and techniques for data science
Tools and techniques for data scienceAjay Ohri
 

What's hot (20)

K - Nearest neighbor ( KNN )
K - Nearest neighbor  ( KNN )K - Nearest neighbor  ( KNN )
K - Nearest neighbor ( KNN )
 
Machine Learning with Decision trees
Machine Learning with Decision treesMachine Learning with Decision trees
Machine Learning with Decision trees
 
Bias and variance trade off
Bias and variance trade offBias and variance trade off
Bias and variance trade off
 
Naive bayesian classification
Naive bayesian classificationNaive bayesian classification
Naive bayesian classification
 
Visualising Multi Dimensional Data
Visualising Multi Dimensional DataVisualising Multi Dimensional Data
Visualising Multi Dimensional Data
 
Data preprocessing PPT
Data preprocessing PPTData preprocessing PPT
Data preprocessing PPT
 
Data science applications and usecases
Data science applications and usecasesData science applications and usecases
Data science applications and usecases
 
Data Preprocessing
Data PreprocessingData Preprocessing
Data Preprocessing
 
Random forest and decision tree
Random forest and decision treeRandom forest and decision tree
Random forest and decision tree
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
01 knapsack using backtracking
01 knapsack using backtracking01 knapsack using backtracking
01 knapsack using backtracking
 
Regularization in deep learning
Regularization in deep learningRegularization in deep learning
Regularization in deep learning
 
Decision trees & random forests
Decision trees & random forestsDecision trees & random forests
Decision trees & random forests
 
Machine Learning-Linear regression
Machine Learning-Linear regressionMachine Learning-Linear regression
Machine Learning-Linear regression
 
Big data visualization
Big data visualizationBig data visualization
Big data visualization
 
Developing R Graphical User Interfaces
Developing R Graphical User InterfacesDeveloping R Graphical User Interfaces
Developing R Graphical User Interfaces
 
Exploratory data analysis data visualization
Exploratory data analysis data visualizationExploratory data analysis data visualization
Exploratory data analysis data visualization
 
Gradient descent method
Gradient descent methodGradient descent method
Gradient descent method
 
Multi-Layer Perceptrons
Multi-Layer PerceptronsMulti-Layer Perceptrons
Multi-Layer Perceptrons
 
Tools and techniques for data science
Tools and techniques for data scienceTools and techniques for data science
Tools and techniques for data science
 

Similar to ML Data Cleaning: An Overview

thegrowingimportanceofdatacleaning-211202141902.pptx
thegrowingimportanceofdatacleaning-211202141902.pptxthegrowingimportanceofdatacleaning-211202141902.pptx
thegrowingimportanceofdatacleaning-211202141902.pptxYashaswiniSrinivasan1
 
The Growing Importance of Data Cleaning
The Growing Importance of Data CleaningThe Growing Importance of Data Cleaning
The Growing Importance of Data CleaningCarolineSmith912130
 
Data Cleaning and Preprocessing: Ensuring Data Quality
Data Cleaning and Preprocessing: Ensuring Data QualityData Cleaning and Preprocessing: Ensuring Data Quality
Data Cleaning and Preprocessing: Ensuring Data Qualitypriyanka rajput
 
Challenges Of A Junior Data Scientist_ Best Tips To Help You Along The Way.pdf
Challenges Of A Junior Data Scientist_ Best Tips To Help You Along The Way.pdfChallenges Of A Junior Data Scientist_ Best Tips To Help You Along The Way.pdf
Challenges Of A Junior Data Scientist_ Best Tips To Help You Along The Way.pdfvenkatakeerthi3
 
The Simple 5-Step Process for Creating a Winning Data Pipeline.pdf
The Simple 5-Step Process for Creating a Winning Data Pipeline.pdfThe Simple 5-Step Process for Creating a Winning Data Pipeline.pdf
The Simple 5-Step Process for Creating a Winning Data Pipeline.pdfData Science Council of America
 
Exploratory Data Analysis - A Comprehensive Guide to EDA.pdf
Exploratory Data Analysis - A Comprehensive Guide to EDA.pdfExploratory Data Analysis - A Comprehensive Guide to EDA.pdf
Exploratory Data Analysis - A Comprehensive Guide to EDA.pdfJamieDornan2
 
Exploratory Data Analysis - A Comprehensive Guide to EDA.pdf
Exploratory Data Analysis - A Comprehensive Guide to EDA.pdfExploratory Data Analysis - A Comprehensive Guide to EDA.pdf
Exploratory Data Analysis - A Comprehensive Guide to EDA.pdfStephenAmell4
 
Exploratory Data Analysis - A Comprehensive Guide to EDA.pdf
Exploratory Data Analysis - A Comprehensive Guide to EDA.pdfExploratory Data Analysis - A Comprehensive Guide to EDA.pdf
Exploratory Data Analysis - A Comprehensive Guide to EDA.pdfJamieDornan2
 
BDA TAE 2 (BMEB 83).pptx
BDA TAE 2 (BMEB 83).pptxBDA TAE 2 (BMEB 83).pptx
BDA TAE 2 (BMEB 83).pptxAkash527744
 
Data preparation and processing chapter 2
Data preparation and processing chapter  2Data preparation and processing chapter  2
Data preparation and processing chapter 2Mahmoud Alfarra
 
Machine Learning: A Fast Review
Machine Learning: A Fast ReviewMachine Learning: A Fast Review
Machine Learning: A Fast ReviewAhmad Ali Abin
 
machinelearning-191005133446.pdf
machinelearning-191005133446.pdfmachinelearning-191005133446.pdf
machinelearning-191005133446.pdfLellaLinton
 
Data mining and data warehouse lab manual updated
Data mining and data warehouse lab manual updatedData mining and data warehouse lab manual updated
Data mining and data warehouse lab manual updatedYugal Kumar
 
DATA WRANGLING presentation.pptx
DATA WRANGLING presentation.pptxDATA WRANGLING presentation.pptx
DATA WRANGLING presentation.pptxAbdullahAbbasi55
 
Data Cleaning Best Practices.pdf
Data Cleaning Best Practices.pdfData Cleaning Best Practices.pdf
Data Cleaning Best Practices.pdfUncodemy
 
data wrangling (1).pptx kjhiukjhknjbnkjh
data wrangling (1).pptx kjhiukjhknjbnkjhdata wrangling (1).pptx kjhiukjhknjbnkjh
data wrangling (1).pptx kjhiukjhknjbnkjhVISHALMARWADE1
 
Pandas Data Cleaning and Preprocessing PPT.pptx
Pandas Data Cleaning and Preprocessing PPT.pptxPandas Data Cleaning and Preprocessing PPT.pptx
Pandas Data Cleaning and Preprocessing PPT.pptxbajajrishabh96tech
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessingsuganmca14
 

Similar to ML Data Cleaning: An Overview (20)

thegrowingimportanceofdatacleaning-211202141902.pptx
thegrowingimportanceofdatacleaning-211202141902.pptxthegrowingimportanceofdatacleaning-211202141902.pptx
thegrowingimportanceofdatacleaning-211202141902.pptx
 
The Growing Importance of Data Cleaning
The Growing Importance of Data CleaningThe Growing Importance of Data Cleaning
The Growing Importance of Data Cleaning
 
Data Cleaning and Preprocessing: Ensuring Data Quality
Data Cleaning and Preprocessing: Ensuring Data QualityData Cleaning and Preprocessing: Ensuring Data Quality
Data Cleaning and Preprocessing: Ensuring Data Quality
 
Challenges Of A Junior Data Scientist_ Best Tips To Help You Along The Way.pdf
Challenges Of A Junior Data Scientist_ Best Tips To Help You Along The Way.pdfChallenges Of A Junior Data Scientist_ Best Tips To Help You Along The Way.pdf
Challenges Of A Junior Data Scientist_ Best Tips To Help You Along The Way.pdf
 
The Simple 5-Step Process for Creating a Winning Data Pipeline.pdf
The Simple 5-Step Process for Creating a Winning Data Pipeline.pdfThe Simple 5-Step Process for Creating a Winning Data Pipeline.pdf
The Simple 5-Step Process for Creating a Winning Data Pipeline.pdf
 
Exploratory Data Analysis - A Comprehensive Guide to EDA.pdf
Exploratory Data Analysis - A Comprehensive Guide to EDA.pdfExploratory Data Analysis - A Comprehensive Guide to EDA.pdf
Exploratory Data Analysis - A Comprehensive Guide to EDA.pdf
 
Exploratory Data Analysis - A Comprehensive Guide to EDA.pdf
Exploratory Data Analysis - A Comprehensive Guide to EDA.pdfExploratory Data Analysis - A Comprehensive Guide to EDA.pdf
Exploratory Data Analysis - A Comprehensive Guide to EDA.pdf
 
Exploratory Data Analysis - A Comprehensive Guide to EDA.pdf
Exploratory Data Analysis - A Comprehensive Guide to EDA.pdfExploratory Data Analysis - A Comprehensive Guide to EDA.pdf
Exploratory Data Analysis - A Comprehensive Guide to EDA.pdf
 
BDA TAE 2 (BMEB 83).pptx
BDA TAE 2 (BMEB 83).pptxBDA TAE 2 (BMEB 83).pptx
BDA TAE 2 (BMEB 83).pptx
 
Data processing
Data processingData processing
Data processing
 
Data preparation and processing chapter 2
Data preparation and processing chapter  2Data preparation and processing chapter  2
Data preparation and processing chapter 2
 
Machine Learning: A Fast Review
Machine Learning: A Fast ReviewMachine Learning: A Fast Review
Machine Learning: A Fast Review
 
Data analytics
Data analyticsData analytics
Data analytics
 
machinelearning-191005133446.pdf
machinelearning-191005133446.pdfmachinelearning-191005133446.pdf
machinelearning-191005133446.pdf
 
Data mining and data warehouse lab manual updated
Data mining and data warehouse lab manual updatedData mining and data warehouse lab manual updated
Data mining and data warehouse lab manual updated
 
DATA WRANGLING presentation.pptx
DATA WRANGLING presentation.pptxDATA WRANGLING presentation.pptx
DATA WRANGLING presentation.pptx
 
Data Cleaning Best Practices.pdf
Data Cleaning Best Practices.pdfData Cleaning Best Practices.pdf
Data Cleaning Best Practices.pdf
 
data wrangling (1).pptx kjhiukjhknjbnkjh
data wrangling (1).pptx kjhiukjhknjbnkjhdata wrangling (1).pptx kjhiukjhknjbnkjh
data wrangling (1).pptx kjhiukjhknjbnkjh
 
Pandas Data Cleaning and Preprocessing PPT.pptx
Pandas Data Cleaning and Preprocessing PPT.pptxPandas Data Cleaning and Preprocessing PPT.pptx
Pandas Data Cleaning and Preprocessing PPT.pptx
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 

Recently uploaded

20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxData Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxFurkanTasci3
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
Spark3's new memory model/management
Spark3's new memory model/managementSpark3's new memory model/management
Spark3's new memory model/managementakshesh doshi
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
Data Warehouse , Data Cube Computation
Data Warehouse   , Data Cube ComputationData Warehouse   , Data Cube Computation
Data Warehouse , Data Cube Computationsit20ad004
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...shivangimorya083
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 

Recently uploaded (20)

20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxData Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptx
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
Spark3's new memory model/management
Spark3's new memory model/managementSpark3's new memory model/management
Spark3's new memory model/management
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
Data Warehouse , Data Cube Computation
Data Warehouse   , Data Cube ComputationData Warehouse   , Data Cube Computation
Data Warehouse , Data Cube Computation
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 

ML Data Cleaning: An Overview

  • 1. ML | OVERVIEW OF DATA CLEANING Dr. Sheetal Dhande-Dandge Professor CSE SIPNA COET
  • 2. ML | OVERVIEW OF DATA CLEANING Data cleaning is one of the important parts of machine learning. It plays a significant part in building a model. It surely isn’t the fanciest part of machine learning and at the same time, there aren’t any hidden tricks or secrets to uncover. However, the success or failure of a project relies on proper data cleaning. Professional data scientists usually invest a very large portion of their time in this step because of the belief that “Better data beats fancier algorithms”. If we have a well-cleaned dataset, there are chances that we can get achieve good results with simple algorithms also, which can prove very beneficial at times especially in terms of computation when the dataset size is large. Obviously, different types of data will require different types of cleaning. However, this systematic approach can always serve as a good starting point. Dr. sheetal Dhande-Dandge 2
  • 3. STEPS INVOLVED IN DATA CLEANING: Data cleaning is a crucial step in the machine learning (ML) pipeline, as it involves identifying and removing any missing, duplicate, or irrelevant data. The goal of data cleaning is to ensure that the data is accurate, consistent, and free of errors, as incorrect or inconsistent data can negatively impact the performance of the ML model. Dr. sheetal Dhande-Dandge 3
  • 4.  The main steps involved in data cleaning are:  Handling missing data:This step involves identifying and handling missing data, which can be done by removing the missing data, imputing missing values with a suitable estimate, or using techniques such as multiple imputations to handle missing data.  Removing duplicates:This step involves identifying and removing any duplicate data, which can be done by using techniques such as data deduplication or data deduplication algorithms.  Handling outliers:This step involves identifying and handling any outliers in the data, which can be done by removing the outliers or transforming the data to reduce the impact of the outliers.  Correcting errors: This step involves identifying and correcting any errors in the data, which can be done by using techniques such as data validation or data correction algorithm  It is important to note that data cleaning is an iterative process, as it may be necessary to repeat some of the steps several times to ensure that the data is accurate and consistent.  The choice of data cleaning techniques will depend on the specific requirements of the project, including the size and complexity of the data and the desired outcome. Dr. sheetal Dhande-Dandge 4
  • 6.  This includes deleting duplicate/ redundant or irrelevant values from your dataset. Duplicate observations most frequently arise during data collection and Irrelevant observations are those that don’t actually fit the specific problem that you’re trying to solve.  Redundant observations alter the efficiency by a great extent as the data repeats and may add towards the correct side or towards the incorrect side, thereby producing unfaithful results.  Irrelevant observations are any type of data that is of no use to us and can be removed directly. Dr. sheetal Dhande-Dandge 6
  • 7.  The errors that arise during measurement, transfer of data, or other similar situations are called structural errors. Structural errors include typos in the name of features, the same attribute with a different name, mislabeled classes, i.e. separate classes that should really be the same, or inconsistent capitalization.  For example, the model will treat America and America as different classes or values, though they represent the same value or red, yellow, and red-yellow as different classes or attributes, though one class can be included in the other two classes. So, these are some structural errors that make our model inefficient and give poor quality results. Dr. sheetal Dhande-Dandge 7
  • 8.  Outliers can cause problems with certain types of models. For example, linear regression models are less robust to outliers than decision tree models. Generally, we should not remove outliers until we have a legitimate reason to remove them. Sometimes, removing them improves performance, sometimes not. So, one must have a good reason to remove the outlier, such as suspicious measurements that are unlikely to be part of real data. Dr. sheetal Dhande-Dandge 8
  • 9. HANDLING MISSING DATA Missing data is a deceptively tricky issue in machine learning. We cannot just ignore or remove the missing observation. They must be handled carefully as they can be an indication of something important. The two most common ways to deal with missing data are: • Dropping observations with missing values. • The fact that the value was missing may be informative in itself. • Plus, in the real world, you often need to make predictions on new data even if some of the features are missing! • Imputing the missing values from past observations. • Again, “missingness” is almost always informative in itself, and you should tell your algorithm if a value was missing. • Even if you build a model to impute your values, you’re not adding any real information. You’re just reinforcing the patterns already provided by other features. Missing data is like missing a puzzle piece. If you drop it, that’s like pretending the puzzle slot isn’t there. If you impute it, that’s like trying to squeeze in a piece from somewhere else in the puzzle. So, missing data is always an informative and an indication of something important. And we must be aware of our algorithm of missing data by flagging it. By using this technique of flagging and filling, you are essentially allowing the algorithm to estimate the optimal constant for missingness, instead of just filling it in with the mean. Dr. sheetal Dhande-Dandge 9
  • 10. SOME DATA CLEANSING TOOLS • Openrefine • Trifacta Wrangler • TIBCO Clarity • Cloudingo • IBM Infosphere Quality Stage  Data cleaning is an important step in the machine learning process because it can have a significant impact on the quality and performance of a model. Data cleaning involves identifying and correcting or removing errors and inconsistencies in the data. Dr. sheetal Dhande-Dandge 10
  • 11.  Here is a simple example of data cleaning in Python:  import pandas as pd  # Load the data  df = pd.read_csv("data.csv")  # Drop rows with missing values  df = df.dropna()  # Remove duplicate rows  df = df.drop_duplicates()  # Remove unnecessary columns  df = df.drop(columns=["col1", "col2"])  # Normalize numerical columns  df["col3"] = (df["col3"] - df["col3"].mean()) / df["col3"].std()  # Encode categorical columns  df["col4"] = pd.get_dummies(df["col4"])  # Save the cleaned data  df.to_csv("cleaned_data.csv", index=False)  The code I provided does not have any explicit output statements, so it will not produce any output when it is run. Instead, it modifies the data stored in the df DataFrame and saves it to a new CSV file.  If you want to see the cleaned data, you can print the df DataFrame or read the saved CSV file. For example, you can add the following line at the end of the code to print the cleaned data:  print(df) Dr. sheetal Dhande-Dandge 11
  • 12. ADVANTAGES OF DATA CLEANING IN MACHINE LEARNING: Improved model performance: Data cleaning helps improve the performance of the ML model by removing errors, inconsistencies, and irrelevant data, which can help the model to better learn from the data. Increased accuracy: Data cleaning helps ensure that the data is accurate, consistent, and free of errors, which can help improve the accuracy of the ML model. Better representation of the data: Data cleaning allows the data to be transformed into a format that better represents the underlying relationships and patterns in the data, making it easier for the ML model to learn from the data. Dr. sheetal Dhande-Dandge 12
  • 13. Time-consuming: Data cleaning can be a time-consuming task, especially for large and complex datasets. 1 Error-prone: Data cleaning can be error-prone, as it involves transforming and cleaning the data, which can result in the loss of important information or the introduction of new errors. 2 Limited understanding of the data: Data cleaning can lead to a limited understanding of the data, as the transformed data may not be representative of the underlying relationships and patterns in the data. 3 Dr. sheetal Dhande-Dandge 13
  • 14. CONCLUSION:  So, we have discussed four different steps in data cleaning to make the data more reliable and to produce good results. After properly completing the Data Cleaning steps, we’ll have a robust dataset that avoids many of the most common pitfalls. This step should not be rushed as it proves very beneficial in the further process. Dr. sheetal Dhande-Dandge 14