SlideShare a Scribd company logo
Data Science
Data Preprocessing
(Data Cleaning)
Data Preprocessing
Data Preprocessing can be defined as a process of converting raw data
into a format that is understandable and usable for further analysis. It
is an important step in the Data Preparation stage. It ensures that the
outcome of the analysis is accurate, complete, and consistent.
Data preprocessing refers to the cleaning, transforming and integrating
of data to make it ready for analysis.
Data
Preprocessing
Data
Integration
Data
Transforma
tion
Data
Reduction
or
dimension
reduction
Data
Cleaning
Scaling, Normalization,
Categorical Encoding
Handling missing
values Outliers,
duplicates
Selecting relevant
features/Column, Data
Combining multiple
datasets/ Merging
Data Cleaning
Data cleaning involves the systematic identification and correction of errors,
inconsistencies, and inaccuracies within a dataset, encompassing tasks such
as handling missing values, removing duplicates, and addressing outliers.
Data cleaning is essential because raw data is often noisy, incomplete, and
inconsistent, which can negatively impact the accuracy and reliability of the
insights derived from it.
Handling Missing Values
Missing values can be handled by many techniques, such as removing
rows/columns containing NULL values and imputing NULL values using mean,
mode etc.
There are several useful functions for detecting, removing, and replacing null
values in Pandas DataFrame :
• isnull()
• notnull()
• dropna()
• fillna()
• replace()
• Interpolate()
Both function help in checking
whether a value is NaN or not
To fill null values in a datasets, we use fillna(), replace() and interpolate() function
these function replace NaN values with some value of their own.
Interpolate() function is basically used to fill NA values in the dataframe but it
uses various interpolation technique to fill the missing values rather than hard-
coding the value.
Outliers
Outliers are data points that are stand out from the rest. It can be either
much higher or much lower than the other data points, and its presence can
have a significant impact on the results of machine learning algorithms.
They are unusual values that don’t follow the overall pattern of our data.
They represent errors in measurement, bad data collection, or simply show
variables not considered when collecting the data.
For example, in a group of 5 students the test grades were 29, 38, 39, 27, and
2. The last value seems to be an outlier because it falls below the main pattern of
the other grades.
27 - mean (with outlier)
33.25 - mean (without outlier)
Handling Outliers
1. Removal:
This involves identifying and removing outliers from the dataset before
training the model. Common methods include:
• Thresholding: Outliers are identified as data points exceeding a certain
threshold (e.g., Z-score > 3).
• Distance-based methods: Outliers are identified based on their
distance from their nearest neighbors.
• Clustering: Outliers are identified as points not belonging to any
cluster or belonging to very small clusters.
Handling Outliers
2. Transformation
This involves transforming the data to reduce the influence of
outliers. Common methods include:
• Scaling: Standardizing or normalizing the data to have a mean of zero
and a standard deviation of one.
• Winsorization: Replacing outlier values with the nearest non-outlier
value.
• Log transformation: Applying a logarithmic transformation to
compress the data and reduce the impact of extreme values.
Handling Outliers
3. Robust Estimation
This involves using algorithms that are less sensitive to outliers. Some
examples include:
• Robust Statistics: Choose statistical methods less influenced by
outliers like median, mode and interquartile range instead of mean and
standard deviation.
• Outlier-insensitive clustering algorithms: Algorithms like DBSCAN
are less susceptible to the presence of outliers than K-means
clustering.
Handling Outliers
4. Modeling Outliers
This involves explicitly modeling the outliers as a separate group. This can
be done by:
• Adding a separate feature: Create a new feature indicating whether a
data point is an outlier or not.
• Using a mixture model: Train a model that assumes the data comes
from a mixture of multiple distributions, where one distribution
represents the outliers.
Handling Outliers
5. Visualization for Outlier Detection
We can use the box plot, or the box and whisker plot, to explore the dataset
and visualize the presence of outliers. The points that lie beyond the
whiskers are detected as outliers.
6. Impute Outliers
For outliers caused by missing or erroneous values, we can estimate
replacements using the mean, median, mode or more advanced imputation
methods.
Duplicates
Data points in a dataset that have the same values for all, or part of the
characteristics are said to have duplicate values. Due to problems with
data input, data collecting, or other circumstances, duplicate values
may appear.
These duplicate values add redundancy to our data and can make our
calculations go wrong.
Handling Duplicates
Duplicates can be handled in a variety of ways
The simplest and most straightforward way to handle duplicate data is to
delete it. This can reduce the noise and redundancy in our data, as well as
improve the efficiency and accuracy of our models.
For example, we can use the df.drop_duplicates() method in pandas to
remove duplicate rows or columns, specifying the subset, keep, and inplace
arguments.
In SQL we can use DISTINCT keyword in SELECT statement.
Thanks for Watching!

More Related Content

Similar to Data Science- Data Preprocessing, Data Cleaning.

Data mining and data warehouse lab manual updated
Data mining and data warehouse lab manual updatedData mining and data warehouse lab manual updated
Data mining and data warehouse lab manual updated
Yugal Kumar
 
Data Cleaning and Preprocessing: Ensuring Data Quality
Data Cleaning and Preprocessing: Ensuring Data QualityData Cleaning and Preprocessing: Ensuring Data Quality
Data Cleaning and Preprocessing: Ensuring Data Quality
priyanka rajput
 
Exploratory Data Analysis - A Comprehensive Guide to EDA.pdf
Exploratory Data Analysis - A Comprehensive Guide to EDA.pdfExploratory Data Analysis - A Comprehensive Guide to EDA.pdf
Exploratory Data Analysis - A Comprehensive Guide to EDA.pdf
JamieDornan2
 
Exploratory Data Analysis - A Comprehensive Guide to EDA.pdf
Exploratory Data Analysis - A Comprehensive Guide to EDA.pdfExploratory Data Analysis - A Comprehensive Guide to EDA.pdf
Exploratory Data Analysis - A Comprehensive Guide to EDA.pdf
StephenAmell4
 
Pandas Data Cleaning and Preprocessing PPT.pptx
Pandas Data Cleaning and Preprocessing PPT.pptxPandas Data Cleaning and Preprocessing PPT.pptx
Pandas Data Cleaning and Preprocessing PPT.pptx
bajajrishabh96tech
 
Pre-Processing and Data Preparation
Pre-Processing and Data PreparationPre-Processing and Data Preparation
Pre-Processing and Data Preparation
Umair Shafique
 
Exploratory Data Analysis - A Comprehensive Guide to EDA.pdf
Exploratory Data Analysis - A Comprehensive Guide to EDA.pdfExploratory Data Analysis - A Comprehensive Guide to EDA.pdf
Exploratory Data Analysis - A Comprehensive Guide to EDA.pdf
JamieDornan2
 
Data pre processing
Data pre processingData pre processing
Data pre processing
kalavathisugan
 
Data Reduction
Data ReductionData Reduction
Data Reduction
Rajan Shah
 
Machine Learning Approaches and its Challenges
Machine Learning Approaches and its ChallengesMachine Learning Approaches and its Challenges
Machine Learning Approaches and its Challenges
ijcnes
 
DM_Notes.pptx
DM_Notes.pptxDM_Notes.pptx
DM_Notes.pptx
Workingad
 
Unit 3 Data Quality and Preprocessing .pptx
Unit 3 Data Quality and Preprocessing .pptxUnit 3 Data Quality and Preprocessing .pptx
Unit 3 Data Quality and Preprocessing .pptx
vipulkondekar
 
KNOLX_Data_preprocessing
KNOLX_Data_preprocessingKNOLX_Data_preprocessing
KNOLX_Data_preprocessing
Knoldus Inc.
 
DataPreprocessing.pptx
DataPreprocessing.pptxDataPreprocessing.pptx
DataPreprocessing.pptx
Dr-Dipali Meher
 
EXPLORATORY DATA ANALYSIS IN STATISTICAL MODeLING.pptx
EXPLORATORY DATA ANALYSIS IN STATISTICAL MODeLING.pptxEXPLORATORY DATA ANALYSIS IN STATISTICAL MODeLING.pptx
EXPLORATORY DATA ANALYSIS IN STATISTICAL MODeLING.pptx
rakeshreghu98
 
Exploratory Data Analysis Unit 1 ppt presentation.pptx
Exploratory Data Analysis Unit 1 ppt presentation.pptxExploratory Data Analysis Unit 1 ppt presentation.pptx
Exploratory Data Analysis Unit 1 ppt presentation.pptx
Mayura shelke
 
ml.pdf by Tee
ml.pdf by Teeml.pdf by Tee
ml.pdf by Tee
mweweblogger
 
Data Science - Part III - EDA & Model Selection
Data Science - Part III - EDA & Model SelectionData Science - Part III - EDA & Model Selection
Data Science - Part III - EDA & Model Selection
Derek Kane
 
Data Cleaning Best Practices.pdf
Data Cleaning Best Practices.pdfData Cleaning Best Practices.pdf
Data Cleaning Best Practices.pdf
Uncodemy
 

Similar to Data Science- Data Preprocessing, Data Cleaning. (20)

Data mining and data warehouse lab manual updated
Data mining and data warehouse lab manual updatedData mining and data warehouse lab manual updated
Data mining and data warehouse lab manual updated
 
Data Cleaning and Preprocessing: Ensuring Data Quality
Data Cleaning and Preprocessing: Ensuring Data QualityData Cleaning and Preprocessing: Ensuring Data Quality
Data Cleaning and Preprocessing: Ensuring Data Quality
 
Exploratory Data Analysis - A Comprehensive Guide to EDA.pdf
Exploratory Data Analysis - A Comprehensive Guide to EDA.pdfExploratory Data Analysis - A Comprehensive Guide to EDA.pdf
Exploratory Data Analysis - A Comprehensive Guide to EDA.pdf
 
Exploratory Data Analysis - A Comprehensive Guide to EDA.pdf
Exploratory Data Analysis - A Comprehensive Guide to EDA.pdfExploratory Data Analysis - A Comprehensive Guide to EDA.pdf
Exploratory Data Analysis - A Comprehensive Guide to EDA.pdf
 
Pandas Data Cleaning and Preprocessing PPT.pptx
Pandas Data Cleaning and Preprocessing PPT.pptxPandas Data Cleaning and Preprocessing PPT.pptx
Pandas Data Cleaning and Preprocessing PPT.pptx
 
Pre-Processing and Data Preparation
Pre-Processing and Data PreparationPre-Processing and Data Preparation
Pre-Processing and Data Preparation
 
Exploratory Data Analysis - A Comprehensive Guide to EDA.pdf
Exploratory Data Analysis - A Comprehensive Guide to EDA.pdfExploratory Data Analysis - A Comprehensive Guide to EDA.pdf
Exploratory Data Analysis - A Comprehensive Guide to EDA.pdf
 
Data pre processing
Data pre processingData pre processing
Data pre processing
 
Data Reduction
Data ReductionData Reduction
Data Reduction
 
Machine Learning Approaches and its Challenges
Machine Learning Approaches and its ChallengesMachine Learning Approaches and its Challenges
Machine Learning Approaches and its Challenges
 
DM_Notes.pptx
DM_Notes.pptxDM_Notes.pptx
DM_Notes.pptx
 
Unit 3 Data Quality and Preprocessing .pptx
Unit 3 Data Quality and Preprocessing .pptxUnit 3 Data Quality and Preprocessing .pptx
Unit 3 Data Quality and Preprocessing .pptx
 
KNOLX_Data_preprocessing
KNOLX_Data_preprocessingKNOLX_Data_preprocessing
KNOLX_Data_preprocessing
 
DataPreprocessing.pptx
DataPreprocessing.pptxDataPreprocessing.pptx
DataPreprocessing.pptx
 
EXPLORATORY DATA ANALYSIS IN STATISTICAL MODeLING.pptx
EXPLORATORY DATA ANALYSIS IN STATISTICAL MODeLING.pptxEXPLORATORY DATA ANALYSIS IN STATISTICAL MODeLING.pptx
EXPLORATORY DATA ANALYSIS IN STATISTICAL MODeLING.pptx
 
Exploratory Data Analysis Unit 1 ppt presentation.pptx
Exploratory Data Analysis Unit 1 ppt presentation.pptxExploratory Data Analysis Unit 1 ppt presentation.pptx
Exploratory Data Analysis Unit 1 ppt presentation.pptx
 
ml.pdf by Tee
ml.pdf by Teeml.pdf by Tee
ml.pdf by Tee
 
Assignmentdatamining
AssignmentdataminingAssignmentdatamining
Assignmentdatamining
 
Data Science - Part III - EDA & Model Selection
Data Science - Part III - EDA & Model SelectionData Science - Part III - EDA & Model Selection
Data Science - Part III - EDA & Model Selection
 
Data Cleaning Best Practices.pdf
Data Cleaning Best Practices.pdfData Cleaning Best Practices.pdf
Data Cleaning Best Practices.pdf
 

More from Megha Sharma

Data Management Activities, Extraction, Transformation and Loading (ETL)
Data Management Activities, Extraction, Transformation and Loading (ETL)Data Management Activities, Extraction, Transformation and Loading (ETL)
Data Management Activities, Extraction, Transformation and Loading (ETL)
Megha Sharma
 
Descriptive Statistics: Mean, Median Mode and Standard Deviation.
Descriptive Statistics: Mean, Median Mode and Standard Deviation.Descriptive Statistics: Mean, Median Mode and Standard Deviation.
Descriptive Statistics: Mean, Median Mode and Standard Deviation.
Megha Sharma
 
Model Evaluation Matrix: Confusion Matrix, F1 Score, ROC curve AUC
Model Evaluation Matrix: Confusion Matrix, F1 Score, ROC curve AUCModel Evaluation Matrix: Confusion Matrix, F1 Score, ROC curve AUC
Model Evaluation Matrix: Confusion Matrix, F1 Score, ROC curve AUC
Megha Sharma
 
Model Evaluation Matrix: Accuracy, precision and recall
Model Evaluation Matrix: Accuracy, precision and recallModel Evaluation Matrix: Accuracy, precision and recall
Model Evaluation Matrix: Accuracy, precision and recall
Megha Sharma
 
Visualization Techniques- Box plot, Line Chart, Scatter plot, Bar chart.
Visualization Techniques- Box plot, Line Chart, Scatter plot, Bar chart.Visualization Techniques- Box plot, Line Chart, Scatter plot, Bar chart.
Visualization Techniques- Box plot, Line Chart, Scatter plot, Bar chart.
Megha Sharma
 
Visualization Techniques ,Exploratory Data Analysis(EDA), Histogram
Visualization Techniques ,Exploratory Data Analysis(EDA), HistogramVisualization Techniques ,Exploratory Data Analysis(EDA), Histogram
Visualization Techniques ,Exploratory Data Analysis(EDA), Histogram
Megha Sharma
 
Data Preprocessing- Data transformation, Scaling, Normalization, Standardiza...
Data Preprocessing- Data transformation,  Scaling, Normalization, Standardiza...Data Preprocessing- Data transformation,  Scaling, Normalization, Standardiza...
Data Preprocessing- Data transformation, Scaling, Normalization, Standardiza...
Megha Sharma
 
Data Preprocessing- Feature Selection and Merging.
Data Preprocessing- Feature Selection and Merging.Data Preprocessing- Feature Selection and Merging.
Data Preprocessing- Feature Selection and Merging.
Megha Sharma
 
Different types of data. Qualitative, Quantitative, Ordinal, Nominal, Discret...
Different types of data. Qualitative, Quantitative, Ordinal, Nominal, Discret...Different types of data. Qualitative, Quantitative, Ordinal, Nominal, Discret...
Different types of data. Qualitative, Quantitative, Ordinal, Nominal, Discret...
Megha Sharma
 
Data Science comparison with AI, ML, BI, and data warehousing, data mining.
Data Science comparison with AI, ML, BI, and data warehousing, data mining.Data Science comparison with AI, ML, BI, and data warehousing, data mining.
Data Science comparison with AI, ML, BI, and data warehousing, data mining.
Megha Sharma
 
Data Science Introduction, Application of Data Science.
Data Science Introduction, Application of Data Science.Data Science Introduction, Application of Data Science.
Data Science Introduction, Application of Data Science.
Megha Sharma
 
Ensemble learning
Ensemble learningEnsemble learning
Ensemble learning
Megha Sharma
 
Association Rule mining
Association Rule miningAssociation Rule mining
Association Rule mining
Megha Sharma
 
Bellman's equation Reinforcement learning - II
Bellman's equation Reinforcement learning - IIBellman's equation Reinforcement learning - II
Bellman's equation Reinforcement learning - II
Megha Sharma
 
Reinforcement learning in Machine learning
 Reinforcement learning in Machine learning Reinforcement learning in Machine learning
Reinforcement learning in Machine learning
Megha Sharma
 
E-M Algorithm
E-M AlgorithmE-M Algorithm
E-M Algorithm
Megha Sharma
 
Entropy and information gain in decision tree.
Entropy and information gain in decision tree.Entropy and information gain in decision tree.
Entropy and information gain in decision tree.
Megha Sharma
 
Types of Machine Learning. & Decision Tree.
Types of Machine Learning. & Decision Tree.Types of Machine Learning. & Decision Tree.
Types of Machine Learning. & Decision Tree.
Megha Sharma
 
If statements in C
If statements in CIf statements in C
If statements in C
Megha Sharma
 
Conditional and special operators
Conditional and special operatorsConditional and special operators
Conditional and special operators
Megha Sharma
 

More from Megha Sharma (20)

Data Management Activities, Extraction, Transformation and Loading (ETL)
Data Management Activities, Extraction, Transformation and Loading (ETL)Data Management Activities, Extraction, Transformation and Loading (ETL)
Data Management Activities, Extraction, Transformation and Loading (ETL)
 
Descriptive Statistics: Mean, Median Mode and Standard Deviation.
Descriptive Statistics: Mean, Median Mode and Standard Deviation.Descriptive Statistics: Mean, Median Mode and Standard Deviation.
Descriptive Statistics: Mean, Median Mode and Standard Deviation.
 
Model Evaluation Matrix: Confusion Matrix, F1 Score, ROC curve AUC
Model Evaluation Matrix: Confusion Matrix, F1 Score, ROC curve AUCModel Evaluation Matrix: Confusion Matrix, F1 Score, ROC curve AUC
Model Evaluation Matrix: Confusion Matrix, F1 Score, ROC curve AUC
 
Model Evaluation Matrix: Accuracy, precision and recall
Model Evaluation Matrix: Accuracy, precision and recallModel Evaluation Matrix: Accuracy, precision and recall
Model Evaluation Matrix: Accuracy, precision and recall
 
Visualization Techniques- Box plot, Line Chart, Scatter plot, Bar chart.
Visualization Techniques- Box plot, Line Chart, Scatter plot, Bar chart.Visualization Techniques- Box plot, Line Chart, Scatter plot, Bar chart.
Visualization Techniques- Box plot, Line Chart, Scatter plot, Bar chart.
 
Visualization Techniques ,Exploratory Data Analysis(EDA), Histogram
Visualization Techniques ,Exploratory Data Analysis(EDA), HistogramVisualization Techniques ,Exploratory Data Analysis(EDA), Histogram
Visualization Techniques ,Exploratory Data Analysis(EDA), Histogram
 
Data Preprocessing- Data transformation, Scaling, Normalization, Standardiza...
Data Preprocessing- Data transformation,  Scaling, Normalization, Standardiza...Data Preprocessing- Data transformation,  Scaling, Normalization, Standardiza...
Data Preprocessing- Data transformation, Scaling, Normalization, Standardiza...
 
Data Preprocessing- Feature Selection and Merging.
Data Preprocessing- Feature Selection and Merging.Data Preprocessing- Feature Selection and Merging.
Data Preprocessing- Feature Selection and Merging.
 
Different types of data. Qualitative, Quantitative, Ordinal, Nominal, Discret...
Different types of data. Qualitative, Quantitative, Ordinal, Nominal, Discret...Different types of data. Qualitative, Quantitative, Ordinal, Nominal, Discret...
Different types of data. Qualitative, Quantitative, Ordinal, Nominal, Discret...
 
Data Science comparison with AI, ML, BI, and data warehousing, data mining.
Data Science comparison with AI, ML, BI, and data warehousing, data mining.Data Science comparison with AI, ML, BI, and data warehousing, data mining.
Data Science comparison with AI, ML, BI, and data warehousing, data mining.
 
Data Science Introduction, Application of Data Science.
Data Science Introduction, Application of Data Science.Data Science Introduction, Application of Data Science.
Data Science Introduction, Application of Data Science.
 
Ensemble learning
Ensemble learningEnsemble learning
Ensemble learning
 
Association Rule mining
Association Rule miningAssociation Rule mining
Association Rule mining
 
Bellman's equation Reinforcement learning - II
Bellman's equation Reinforcement learning - IIBellman's equation Reinforcement learning - II
Bellman's equation Reinforcement learning - II
 
Reinforcement learning in Machine learning
 Reinforcement learning in Machine learning Reinforcement learning in Machine learning
Reinforcement learning in Machine learning
 
E-M Algorithm
E-M AlgorithmE-M Algorithm
E-M Algorithm
 
Entropy and information gain in decision tree.
Entropy and information gain in decision tree.Entropy and information gain in decision tree.
Entropy and information gain in decision tree.
 
Types of Machine Learning. & Decision Tree.
Types of Machine Learning. & Decision Tree.Types of Machine Learning. & Decision Tree.
Types of Machine Learning. & Decision Tree.
 
If statements in C
If statements in CIf statements in C
If statements in C
 
Conditional and special operators
Conditional and special operatorsConditional and special operators
Conditional and special operators
 

Recently uploaded

B.ed spl. HI pdusu exam paper-2023-24.pdf
B.ed spl. HI pdusu exam paper-2023-24.pdfB.ed spl. HI pdusu exam paper-2023-24.pdf
B.ed spl. HI pdusu exam paper-2023-24.pdf
Special education needs
 
The Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official PublicationThe Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official Publication
Delapenabediema
 
Solid waste management & Types of Basic civil Engineering notes by DJ Sir.pptx
Solid waste management & Types of Basic civil Engineering notes by DJ Sir.pptxSolid waste management & Types of Basic civil Engineering notes by DJ Sir.pptx
Solid waste management & Types of Basic civil Engineering notes by DJ Sir.pptx
Denish Jangid
 
PART A. Introduction to Costumer Service
PART A. Introduction to Costumer ServicePART A. Introduction to Costumer Service
PART A. Introduction to Costumer Service
PedroFerreira53928
 
Template Jadual Bertugas Kelas (Boleh Edit)
Template Jadual Bertugas Kelas (Boleh Edit)Template Jadual Bertugas Kelas (Boleh Edit)
Template Jadual Bertugas Kelas (Boleh Edit)
rosedainty
 
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
siemaillard
 
Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345
beazzy04
 
Embracing GenAI - A Strategic Imperative
Embracing GenAI - A Strategic ImperativeEmbracing GenAI - A Strategic Imperative
Embracing GenAI - A Strategic Imperative
Peter Windle
 
GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...
GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...
GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...
Nguyen Thanh Tu Collection
 
Synthetic Fiber Construction in lab .pptx
Synthetic Fiber Construction in lab .pptxSynthetic Fiber Construction in lab .pptx
Synthetic Fiber Construction in lab .pptx
Pavel ( NSTU)
 
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
EugeneSaldivar
 
1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx
JosvitaDsouza2
 
Home assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdfHome assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdf
Tamralipta Mahavidyalaya
 
Fish and Chips - have they had their chips
Fish and Chips - have they had their chipsFish and Chips - have they had their chips
Fish and Chips - have they had their chips
GeoBlogs
 
The approach at University of Liverpool.pptx
The approach at University of Liverpool.pptxThe approach at University of Liverpool.pptx
The approach at University of Liverpool.pptx
Jisc
 
The geography of Taylor Swift - some ideas
The geography of Taylor Swift - some ideasThe geography of Taylor Swift - some ideas
The geography of Taylor Swift - some ideas
GeoBlogs
 
How to Split Bills in the Odoo 17 POS Module
How to Split Bills in the Odoo 17 POS ModuleHow to Split Bills in the Odoo 17 POS Module
How to Split Bills in the Odoo 17 POS Module
Celine George
 
Jose-Rizal-and-Philippine-Nationalism-National-Symbol-2.pptx
Jose-Rizal-and-Philippine-Nationalism-National-Symbol-2.pptxJose-Rizal-and-Philippine-Nationalism-National-Symbol-2.pptx
Jose-Rizal-and-Philippine-Nationalism-National-Symbol-2.pptx
ricssacare
 
plant breeding methods in asexually or clonally propagated crops
plant breeding methods in asexually or clonally propagated cropsplant breeding methods in asexually or clonally propagated crops
plant breeding methods in asexually or clonally propagated crops
parmarsneha2
 
NLC-2024-Orientation-for-RO-SDO (1).pptx
NLC-2024-Orientation-for-RO-SDO (1).pptxNLC-2024-Orientation-for-RO-SDO (1).pptx
NLC-2024-Orientation-for-RO-SDO (1).pptx
ssuserbdd3e8
 

Recently uploaded (20)

B.ed spl. HI pdusu exam paper-2023-24.pdf
B.ed spl. HI pdusu exam paper-2023-24.pdfB.ed spl. HI pdusu exam paper-2023-24.pdf
B.ed spl. HI pdusu exam paper-2023-24.pdf
 
The Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official PublicationThe Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official Publication
 
Solid waste management & Types of Basic civil Engineering notes by DJ Sir.pptx
Solid waste management & Types of Basic civil Engineering notes by DJ Sir.pptxSolid waste management & Types of Basic civil Engineering notes by DJ Sir.pptx
Solid waste management & Types of Basic civil Engineering notes by DJ Sir.pptx
 
PART A. Introduction to Costumer Service
PART A. Introduction to Costumer ServicePART A. Introduction to Costumer Service
PART A. Introduction to Costumer Service
 
Template Jadual Bertugas Kelas (Boleh Edit)
Template Jadual Bertugas Kelas (Boleh Edit)Template Jadual Bertugas Kelas (Boleh Edit)
Template Jadual Bertugas Kelas (Boleh Edit)
 
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
 
Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345
 
Embracing GenAI - A Strategic Imperative
Embracing GenAI - A Strategic ImperativeEmbracing GenAI - A Strategic Imperative
Embracing GenAI - A Strategic Imperative
 
GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...
GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...
GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...
 
Synthetic Fiber Construction in lab .pptx
Synthetic Fiber Construction in lab .pptxSynthetic Fiber Construction in lab .pptx
Synthetic Fiber Construction in lab .pptx
 
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
 
1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx
 
Home assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdfHome assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdf
 
Fish and Chips - have they had their chips
Fish and Chips - have they had their chipsFish and Chips - have they had their chips
Fish and Chips - have they had their chips
 
The approach at University of Liverpool.pptx
The approach at University of Liverpool.pptxThe approach at University of Liverpool.pptx
The approach at University of Liverpool.pptx
 
The geography of Taylor Swift - some ideas
The geography of Taylor Swift - some ideasThe geography of Taylor Swift - some ideas
The geography of Taylor Swift - some ideas
 
How to Split Bills in the Odoo 17 POS Module
How to Split Bills in the Odoo 17 POS ModuleHow to Split Bills in the Odoo 17 POS Module
How to Split Bills in the Odoo 17 POS Module
 
Jose-Rizal-and-Philippine-Nationalism-National-Symbol-2.pptx
Jose-Rizal-and-Philippine-Nationalism-National-Symbol-2.pptxJose-Rizal-and-Philippine-Nationalism-National-Symbol-2.pptx
Jose-Rizal-and-Philippine-Nationalism-National-Symbol-2.pptx
 
plant breeding methods in asexually or clonally propagated crops
plant breeding methods in asexually or clonally propagated cropsplant breeding methods in asexually or clonally propagated crops
plant breeding methods in asexually or clonally propagated crops
 
NLC-2024-Orientation-for-RO-SDO (1).pptx
NLC-2024-Orientation-for-RO-SDO (1).pptxNLC-2024-Orientation-for-RO-SDO (1).pptx
NLC-2024-Orientation-for-RO-SDO (1).pptx
 

Data Science- Data Preprocessing, Data Cleaning.

  • 2. Data Preprocessing Data Preprocessing can be defined as a process of converting raw data into a format that is understandable and usable for further analysis. It is an important step in the Data Preparation stage. It ensures that the outcome of the analysis is accurate, complete, and consistent. Data preprocessing refers to the cleaning, transforming and integrating of data to make it ready for analysis.
  • 3. Data Preprocessing Data Integration Data Transforma tion Data Reduction or dimension reduction Data Cleaning Scaling, Normalization, Categorical Encoding Handling missing values Outliers, duplicates Selecting relevant features/Column, Data Combining multiple datasets/ Merging
  • 4. Data Cleaning Data cleaning involves the systematic identification and correction of errors, inconsistencies, and inaccuracies within a dataset, encompassing tasks such as handling missing values, removing duplicates, and addressing outliers. Data cleaning is essential because raw data is often noisy, incomplete, and inconsistent, which can negatively impact the accuracy and reliability of the insights derived from it.
  • 5. Handling Missing Values Missing values can be handled by many techniques, such as removing rows/columns containing NULL values and imputing NULL values using mean, mode etc. There are several useful functions for detecting, removing, and replacing null values in Pandas DataFrame : • isnull() • notnull() • dropna() • fillna() • replace() • Interpolate() Both function help in checking whether a value is NaN or not To fill null values in a datasets, we use fillna(), replace() and interpolate() function these function replace NaN values with some value of their own. Interpolate() function is basically used to fill NA values in the dataframe but it uses various interpolation technique to fill the missing values rather than hard- coding the value.
  • 6. Outliers Outliers are data points that are stand out from the rest. It can be either much higher or much lower than the other data points, and its presence can have a significant impact on the results of machine learning algorithms. They are unusual values that don’t follow the overall pattern of our data. They represent errors in measurement, bad data collection, or simply show variables not considered when collecting the data. For example, in a group of 5 students the test grades were 29, 38, 39, 27, and 2. The last value seems to be an outlier because it falls below the main pattern of the other grades. 27 - mean (with outlier) 33.25 - mean (without outlier)
  • 7. Handling Outliers 1. Removal: This involves identifying and removing outliers from the dataset before training the model. Common methods include: • Thresholding: Outliers are identified as data points exceeding a certain threshold (e.g., Z-score > 3). • Distance-based methods: Outliers are identified based on their distance from their nearest neighbors. • Clustering: Outliers are identified as points not belonging to any cluster or belonging to very small clusters.
  • 8. Handling Outliers 2. Transformation This involves transforming the data to reduce the influence of outliers. Common methods include: • Scaling: Standardizing or normalizing the data to have a mean of zero and a standard deviation of one. • Winsorization: Replacing outlier values with the nearest non-outlier value. • Log transformation: Applying a logarithmic transformation to compress the data and reduce the impact of extreme values.
  • 9. Handling Outliers 3. Robust Estimation This involves using algorithms that are less sensitive to outliers. Some examples include: • Robust Statistics: Choose statistical methods less influenced by outliers like median, mode and interquartile range instead of mean and standard deviation. • Outlier-insensitive clustering algorithms: Algorithms like DBSCAN are less susceptible to the presence of outliers than K-means clustering.
  • 10. Handling Outliers 4. Modeling Outliers This involves explicitly modeling the outliers as a separate group. This can be done by: • Adding a separate feature: Create a new feature indicating whether a data point is an outlier or not. • Using a mixture model: Train a model that assumes the data comes from a mixture of multiple distributions, where one distribution represents the outliers.
  • 11. Handling Outliers 5. Visualization for Outlier Detection We can use the box plot, or the box and whisker plot, to explore the dataset and visualize the presence of outliers. The points that lie beyond the whiskers are detected as outliers. 6. Impute Outliers For outliers caused by missing or erroneous values, we can estimate replacements using the mean, median, mode or more advanced imputation methods.
  • 12. Duplicates Data points in a dataset that have the same values for all, or part of the characteristics are said to have duplicate values. Due to problems with data input, data collecting, or other circumstances, duplicate values may appear. These duplicate values add redundancy to our data and can make our calculations go wrong.
  • 13. Handling Duplicates Duplicates can be handled in a variety of ways The simplest and most straightforward way to handle duplicate data is to delete it. This can reduce the noise and redundancy in our data, as well as improve the efficiency and accuracy of our models. For example, we can use the df.drop_duplicates() method in pandas to remove duplicate rows or columns, specifying the subset, keep, and inplace arguments. In SQL we can use DISTINCT keyword in SELECT statement.