SlideShare a Scribd company logo
1 of 25
Data Cleaning and
Preprocessing with
Pandas
Data Cleaning
• Data preprocessing involves a broader set of
activities that prepare raw data for analysis. It
includes cleaning, but also encompasses tasks
such as feature scaling, handling categorical
variables, data transformation, and splitting
data into training and testing sets. The purpose
is to make the data more suitable for machine
learning algorithms and statistical analysis.
Data Preprocessing
• Data cleaning, also known as data cleansing or
scrubbing, is the process of identifying and
correcting errors or inconsistencies in datasets.
It involves handling missing values, removing
duplicates, correcting inaccuracies, and
ensuring data consistency. The goal is to
improve the quality and reliability of the data,
making it suitable for analysis.
Importance of Data Quality:
1 Reliable Insights
High-quality data ensures that the insights and
conclusions drawn from the analysis are reliable.
Inaccuracies or inconsistencies in the data can lead to
incorrect interpretations.
2
Better Decision-Making
Organizations rely on data-driven decision-
making. Clean and high-quality data
provides a solid foundation for making
informed and effective decisions.
3 Trust in Analytics
Stakeholders and decision-makers must trust the data
used in analytics. Quality data instills confidence in
the results and recommendations generated by
analytical models.
4
Avoiding Bias
Biased or incomplete data can lead to
biased results. Data quality is crucial to
avoid reinforcing existing biases and to
ensure fairness in decision-making
processes.
Roles in Data Analytics and Machine Learning
1 Improved Model Performance
Clean and preprocessed data is essential for
training accurate machine learning models. It
helps models generalize well to new, unseen
data, improving their performance.
2 Feature Engineering
Data preprocessing includes feature scaling,
handling categorical variables, and transforming
data. These activities contribute to creating
meaningful features, enhancing the model's ability
to capture patterns.
3 Efficient Analysis
Clean data accelerates the analysis process.
Analysts and data scientists can focus on
extracting insights rather than dealing with
data inconsistencies.
4 Enhanced Interpretability
Well-preprocessed data leads to models that are
easier to interpret. This is crucial for understanding
the factors influencing predictions or outcomes.
Key Concepts in Artificial Intelligence
Machine Perception
📷
AI systems can perceive
and interpret the world
through computer vision,
speech recognition, and
natural language
processing.
Knowledge
Representation and
Reasoning 📚
AI uses techniques to
represent and store
knowledge and apply logical
reasoning to solve complex
problems.
Planning and
Decision Making 🧭
AI systems can plan
sequences of actions and
make optimal decisions by
considering various factors
and constraints.
Applications of Machine Learning
Recommendation
Systems
ML algorithms personalize
recommendations on platforms
like Netflix and Amazon.
Speech Recognition
ML techniques transcribe
speech into text and power
voice-controlled systems.
Image Classification
ML models identify objects,
scenes, and people in images
for various applications.
Data Cleaning
Data cleaning means fixing bad data in your data set.
Bad data could be:
• Empty cells
• Data in wrong format
• Wrong data
• Duplicates
Remove Rows
new_df = df.dropna()
print(new_df.to_string())
Remove all Rows with NULL Values
df.dropna(inplace = True)
Remove NULL Values with 200
df.fillna(200, inplace = True)
Remove NULL values in Specific Co
df[“col_name"].fillna(130, inplace = True)
Replace using Mean, Median and M
df[" col_name "].fillna(df[" col_name "].mean(), inplace = True)
df[" col_name "].fillna(df[" col_name "].median(), inplace = True)
df[" col_name "].fillna(df[" col_name "].mode()[0], inplace = True)
Convert into a Correct Format
pd.to_datetime(df['Date'])
Discovering Duplicates
print(df.duplicated())
Removing Duplicates
df.drop_duplicates(inplace = True)
Dealing with Outliers
Outliers are data points that significantly deviate from the rest of the dataset.
Dealing with outliers involves identifying them, understanding their impact,
and deciding whether to remove or transform them.
Identifying Outliers using
Descriptive Statistics
Descriptive statistics, such as mean, median, and standard deviation, can be used to
identify outliers. Data points that fall far from the mean or median may be considered
outliers.
import pandas as pd
# Creating a DataFrame with outliers
data = {'Values': [1, 2, 3, 20, 25, 30, 35, 40]}
df = pd.DataFrame(data)
# Calculate mean and standard deviation
mean_val = df['Values'].mean()
std_dev = df['Values'].std()
# Identify outliers based on z-scores
outliers = df[(df['Values'] < mean_val - 2 * std_dev) | (df['Values'] > mean_val + 2 *
std_dev)]
print("Original DataFrame:")
print(df)
print("nOutliers identified using descriptive statistics:")
Handling Outliers
(Removing or
Transforming)
Handling outliers involves deciding whether to remove them or transform
them to mitigate their impact on analysis or modeling.
# Remove outliers using z-scores
df_no_outliers = df[(df['Values'] >= mean_val - 2 * std_dev) & (df['Values'] <=
mean_val + 2 * std_dev)]
print("DataFrame after removing outliers:")
print(df_no_outliers)
# Transformation to handle positively skewed data
df['Values_log'] = df['Values'].apply(lambda x: 0 if x == 0 else np.log(x))
print("DataFrame after log transformation:")
print(df)
Data Preprocessing
Data preprocessing is a crucial step in data preparation that involves cleaning
and transforming raw data into a format suitable for analysis, modeling, or
machine learning. The purpose of data preprocessing is to enhance data
quality, handle inconsistencies, and create a structured dataset that facilitates
accurate and meaningful analysis.
Purpose of Data Preprocessing
Improving Data Quality
Identify and correct errors, inaccuracies, or
inconsistencies in the dataset.
Enhancing Data Usability
Transform raw data into a format suitable for
analysis, modeling, or machine learning.
Reducing Bias
Handle biases in the data to ensure fair and
unbiased results.
Facilitating Feature Extraction
Prepare data for extracting meaningful
features that contribute to model
performance.
Key Steps in Data Preprocessing
Handling Missing Values:
Identify and handle missing values using
methods like imputation or removal.
Standardizing Formatting:
Standardize data formats, such as date
formats or units, to ensure consistency.
Data Transformation:
Apply transformations such as log
transformations or feature engineering
to create informative features.
Dealing Duplicate Records:
Identify and remove duplicate records to
ensure each observation is unique.
Feature Scaling:
Scale numeric features to a standard
range to avoid dominance of certain
features in modeling.
Data Sampling:
If needed, perform data sampling
techniques like random sampling or
stratified sampling.
Handling Outliers:
Detect and handle outliers to prevent them
from unduly influencing analysis or
modeling.
Handling Categorical Data:
Encode categorical variables using
techniques like one-hot encoding or
label encoding.
Data Splitting:
Split the dataset into training and testing
sets for model evaluation.
Feature Scaling
Feature scaling is a preprocessing technique used to standardize the
range of independent variables or features of a dataset. It ensures that
no single feature dominates the others, making the dataset more
amenable to machine learning algorithms that are sensitive to the scale
of input features.
Normalization vs. Standardization:
Normalization
Normalization (Min-Max scaling) scales the
values of features between 0 and 1. It
transforms the data into a specific range but
does not handle outliers well.
Standardization
Standardization (Z-score normalization)
transforms the data to have a mean of 0 and
a standard deviation of 1. It is more robust
to outliers compared to normalization.
Scaling Numeric Features using Pandas:
import pandas as pd
# Creating a DataFrame with numeric features
data = {'Feature1': [10, 20, 15, 25], 'Feature2': [500, 1000, 750, 1250]}
df = pd.DataFrame(data)
# Min-Max scaling using Pandas
df_normalized = (df - df.min()) / (df.max() - df.min())
print("Original DataFrame:")
print(df)
print("nDataFrame after Min-Max scaling (Normalization):")
print(df_normalized)
Min-Max Scaling (Normalization)
Scaling Numeric Features using Pandas:
import pandas as pd
# Creating a DataFrame with numeric features
data = {'Feature1': [10, 20, 15, 25], 'Feature2': [500, 1000, 750, 1250]}
df = pd.DataFrame(data)
# Z-score normalization using Pandas
df_standardized = (df - df.mean()) / df.std()
print("Original DataFrame:")
print(df)
print("nDataFrame after Z-score normalization (Standardization):")
print(df_standardized)
Z-score Normalization (Standardization)
Handling Categorical
Data
Categorical data represents variables that can take on a limited and
usually fixed number of values, often representing categories. Handling
categorical data involves encoding these variables into a format suitable
for machine learning models.
Encoding Categorical
Variables:
One-Hot Encoding:
One-hot encoding is a technique that converts
categorical variables into a binary matrix. Each
category becomes a separate column, and a binary
value indicates the presence or absence of that
category.
Pandas' get_dummies() function:
get_dummies() is a Pandas function that performs one-hot
encoding on categorical variables, creating dummy/indicator
variables for each category.
Label Encoding:
Label encoding assigns a unique numerical label to
each category in a categorical variable. It is suitable
when there is an ordinal relationship between
categories.
Data Transformation
Log Transformation:
Log transformation involves applying the natural
logarithm to the values of a variable. It is useful for
handling positively skewed data and reducing the
impact of outliers.
Pandas' apply() function:
apply() is a Pandas function that applies a function along the axis
of a DataFrame. It can be used for custom transformations on
data.
Handling Skewed Data:
Skewed data refers to a distribution that is not
symmetrical. Handling skewed data involves
transforming it to achieve a more normal
distribution.
Data Sampling
Importance of Sampling
Data sampling is the process of selecting a subset
of data from a larger dataset. It is crucial for tasks
like model training and evaluation, especially when
dealing with large datasets.
Pandas' sample() method
sample() is a Pandas method that is used to
randomly select a specified number of rows or a
fraction of rows from a DataFrame.
Data Splitting (Training and Testing Data)
Data splitting involves dividing a dataset into two parts: a training set used to train a machine learning model and a
testing set used to evaluate the model's performance on unseen data.
Using Pandas for Data Splitting
import pandas as pd
from sklearn.model_selection import train_test_split
# Creating a DataFrame
data = {'Feature1': [1, 2, 3, 4, 5], 'Target': [0, 1, 0, 1, 0]}
df = pd.DataFrame(data)
# Splitting data into features (X) and target variable (y)
X = df[['Feature1']]
y = df['Target']
# Using train_test_split for data splitting
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print("Training Data:")
print(X_train, y_train)
print("nTesting Data:")
print(X_test, y_test)

More Related Content

Similar to Pandas Data Cleaning and Preprocessing PPT.pptx

Cssu dw dm
Cssu dw dmCssu dw dm
Cssu dw dmsumit621
 
Data Science.pdf
Data Science.pdfData Science.pdf
Data Science.pdfWinduGata3
 
Tutorial Knowledge Discovery
Tutorial Knowledge DiscoveryTutorial Knowledge Discovery
Tutorial Knowledge DiscoverySSSW
 
Anwar kamal .pdf.pptx
Anwar kamal .pdf.pptxAnwar kamal .pdf.pptx
Anwar kamal .pdf.pptxLuminous8
 
Exploratory Data Analysis - Satyajit.pdf
Exploratory Data Analysis - Satyajit.pdfExploratory Data Analysis - Satyajit.pdf
Exploratory Data Analysis - Satyajit.pdfAmmarAhmedSiddiqui2
 
Machine learning module 2
Machine learning module 2Machine learning module 2
Machine learning module 2Gokulks007
 
Top 30 Data Analyst Interview Questions.pdf
Top 30 Data Analyst Interview Questions.pdfTop 30 Data Analyst Interview Questions.pdf
Top 30 Data Analyst Interview Questions.pdfShaikSikindar1
 
Knowledge discovery claudiad amato
Knowledge discovery claudiad amatoKnowledge discovery claudiad amato
Knowledge discovery claudiad amatoSSSW
 
ML-ChapterTwo-Data Preprocessing.ppt
ML-ChapterTwo-Data Preprocessing.pptML-ChapterTwo-Data Preprocessing.ppt
ML-ChapterTwo-Data Preprocessing.pptbelay41
 
Analysis using r
Analysis using rAnalysis using r
Analysis using rPriya Mohan
 
Overview of Data Cleaning.pdf
Overview of Data Cleaning.pdfOverview of Data Cleaning.pdf
Overview of Data Cleaning.pdfSheetalDandge
 
Data preparation and processing chapter 2
Data preparation and processing chapter  2Data preparation and processing chapter  2
Data preparation and processing chapter 2Mahmoud Alfarra
 
Data processing and analysis final
Data processing and analysis finalData processing and analysis final
Data processing and analysis finalAkul10
 

Similar to Pandas Data Cleaning and Preprocessing PPT.pptx (20)

Cssu dw dm
Cssu dw dmCssu dw dm
Cssu dw dm
 
Data Science.pdf
Data Science.pdfData Science.pdf
Data Science.pdf
 
Tutorial Knowledge Discovery
Tutorial Knowledge DiscoveryTutorial Knowledge Discovery
Tutorial Knowledge Discovery
 
Anwar kamal .pdf.pptx
Anwar kamal .pdf.pptxAnwar kamal .pdf.pptx
Anwar kamal .pdf.pptx
 
Exploratory Data Analysis - Satyajit.pdf
Exploratory Data Analysis - Satyajit.pdfExploratory Data Analysis - Satyajit.pdf
Exploratory Data Analysis - Satyajit.pdf
 
Machine learning module 2
Machine learning module 2Machine learning module 2
Machine learning module 2
 
Data processing
Data processingData processing
Data processing
 
Top 30 Data Analyst Interview Questions.pdf
Top 30 Data Analyst Interview Questions.pdfTop 30 Data Analyst Interview Questions.pdf
Top 30 Data Analyst Interview Questions.pdf
 
1234
12341234
1234
 
Chapter 3.pdf
Chapter 3.pdfChapter 3.pdf
Chapter 3.pdf
 
Knowledge discovery claudiad amato
Knowledge discovery claudiad amatoKnowledge discovery claudiad amato
Knowledge discovery claudiad amato
 
BAS 250 Lecture 2
BAS 250 Lecture 2BAS 250 Lecture 2
BAS 250 Lecture 2
 
Preprocessing_new.ppt
Preprocessing_new.pptPreprocessing_new.ppt
Preprocessing_new.ppt
 
ML-ChapterTwo-Data Preprocessing.ppt
ML-ChapterTwo-Data Preprocessing.pptML-ChapterTwo-Data Preprocessing.ppt
ML-ChapterTwo-Data Preprocessing.ppt
 
Data Mining
Data MiningData Mining
Data Mining
 
Analysis using r
Analysis using rAnalysis using r
Analysis using r
 
Overview of Data Cleaning.pdf
Overview of Data Cleaning.pdfOverview of Data Cleaning.pdf
Overview of Data Cleaning.pdf
 
Data preparation and processing chapter 2
Data preparation and processing chapter  2Data preparation and processing chapter  2
Data preparation and processing chapter 2
 
Data Mining.pptx
Data Mining.pptxData Mining.pptx
Data Mining.pptx
 
Data processing and analysis final
Data processing and analysis finalData processing and analysis final
Data processing and analysis final
 

Recently uploaded

Aggregations - The Elasticsearch "GROUP BY"
Aggregations - The Elasticsearch "GROUP BY"Aggregations - The Elasticsearch "GROUP BY"
Aggregations - The Elasticsearch "GROUP BY"John Sobanski
 
NOAM AAUG Adobe Summit 2024: Summit Slam Dunks
NOAM AAUG Adobe Summit 2024: Summit Slam DunksNOAM AAUG Adobe Summit 2024: Summit Slam Dunks
NOAM AAUG Adobe Summit 2024: Summit Slam Dunksgmuir1066
 
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...ssuserf63bd7
 
社内勉強会資料_Object Recognition as Next Token Prediction
社内勉強会資料_Object Recognition as Next Token Prediction社内勉強会資料_Object Recognition as Next Token Prediction
社内勉強会資料_Object Recognition as Next Token PredictionNABLAS株式会社
 
MATERI MANAJEMEN OF PENYAKIT TETANUS.ppt
MATERI  MANAJEMEN OF PENYAKIT TETANUS.pptMATERI  MANAJEMEN OF PENYAKIT TETANUS.ppt
MATERI MANAJEMEN OF PENYAKIT TETANUS.pptRachmaGhifari
 
edited gordis ebook sixth edition david d.pdf
edited gordis ebook sixth edition david d.pdfedited gordis ebook sixth edition david d.pdf
edited gordis ebook sixth edition david d.pdfgreat91
 
1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证
1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证
1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证ppy8zfkfm
 
The Significance of Transliteration Enhancing
The Significance of Transliteration EnhancingThe Significance of Transliteration Enhancing
The Significance of Transliteration Enhancingmohamed Elzalabany
 
Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...
Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...
Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...ThinkInnovation
 
Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...
Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...
Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...ssuserf63bd7
 
What is Insertion Sort. Its basic information
What is Insertion Sort. Its basic informationWhat is Insertion Sort. Its basic information
What is Insertion Sort. Its basic informationmuqadasqasim10
 
社内勉強会資料  Mamba - A new era or ephemeral
社内勉強会資料   Mamba - A new era or ephemeral社内勉強会資料   Mamba - A new era or ephemeral
社内勉強会資料  Mamba - A new era or ephemeralNABLAS株式会社
 
一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理cyebo
 
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...Amil baba
 
Formulas dax para power bI de microsoft.pdf
Formulas dax para power bI de microsoft.pdfFormulas dax para power bI de microsoft.pdf
Formulas dax para power bI de microsoft.pdfRobertoOcampo24
 
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证dq9vz1isj
 
Abortion Clinic in Randfontein +27791653574 Randfontein WhatsApp Abortion Cli...
Abortion Clinic in Randfontein +27791653574 Randfontein WhatsApp Abortion Cli...Abortion Clinic in Randfontein +27791653574 Randfontein WhatsApp Abortion Cli...
Abortion Clinic in Randfontein +27791653574 Randfontein WhatsApp Abortion Cli...mikehavy0
 
一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理pyhepag
 

Recently uploaded (20)

Aggregations - The Elasticsearch "GROUP BY"
Aggregations - The Elasticsearch "GROUP BY"Aggregations - The Elasticsearch "GROUP BY"
Aggregations - The Elasticsearch "GROUP BY"
 
NOAM AAUG Adobe Summit 2024: Summit Slam Dunks
NOAM AAUG Adobe Summit 2024: Summit Slam DunksNOAM AAUG Adobe Summit 2024: Summit Slam Dunks
NOAM AAUG Adobe Summit 2024: Summit Slam Dunks
 
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...
 
社内勉強会資料_Object Recognition as Next Token Prediction
社内勉強会資料_Object Recognition as Next Token Prediction社内勉強会資料_Object Recognition as Next Token Prediction
社内勉強会資料_Object Recognition as Next Token Prediction
 
MATERI MANAJEMEN OF PENYAKIT TETANUS.ppt
MATERI  MANAJEMEN OF PENYAKIT TETANUS.pptMATERI  MANAJEMEN OF PENYAKIT TETANUS.ppt
MATERI MANAJEMEN OF PENYAKIT TETANUS.ppt
 
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotecAbortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
 
edited gordis ebook sixth edition david d.pdf
edited gordis ebook sixth edition david d.pdfedited gordis ebook sixth edition david d.pdf
edited gordis ebook sixth edition david d.pdf
 
1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证
1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证
1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证
 
The Significance of Transliteration Enhancing
The Significance of Transliteration EnhancingThe Significance of Transliteration Enhancing
The Significance of Transliteration Enhancing
 
Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...
Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...
Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...
 
123.docx. .
123.docx.                                 .123.docx.                                 .
123.docx. .
 
Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...
Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...
Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...
 
What is Insertion Sort. Its basic information
What is Insertion Sort. Its basic informationWhat is Insertion Sort. Its basic information
What is Insertion Sort. Its basic information
 
社内勉強会資料  Mamba - A new era or ephemeral
社内勉強会資料   Mamba - A new era or ephemeral社内勉強会資料   Mamba - A new era or ephemeral
社内勉強会資料  Mamba - A new era or ephemeral
 
一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理
 
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
 
Formulas dax para power bI de microsoft.pdf
Formulas dax para power bI de microsoft.pdfFormulas dax para power bI de microsoft.pdf
Formulas dax para power bI de microsoft.pdf
 
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证
 
Abortion Clinic in Randfontein +27791653574 Randfontein WhatsApp Abortion Cli...
Abortion Clinic in Randfontein +27791653574 Randfontein WhatsApp Abortion Cli...Abortion Clinic in Randfontein +27791653574 Randfontein WhatsApp Abortion Cli...
Abortion Clinic in Randfontein +27791653574 Randfontein WhatsApp Abortion Cli...
 
一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理
 

Pandas Data Cleaning and Preprocessing PPT.pptx

  • 2. Data Cleaning • Data preprocessing involves a broader set of activities that prepare raw data for analysis. It includes cleaning, but also encompasses tasks such as feature scaling, handling categorical variables, data transformation, and splitting data into training and testing sets. The purpose is to make the data more suitable for machine learning algorithms and statistical analysis. Data Preprocessing • Data cleaning, also known as data cleansing or scrubbing, is the process of identifying and correcting errors or inconsistencies in datasets. It involves handling missing values, removing duplicates, correcting inaccuracies, and ensuring data consistency. The goal is to improve the quality and reliability of the data, making it suitable for analysis.
  • 3. Importance of Data Quality: 1 Reliable Insights High-quality data ensures that the insights and conclusions drawn from the analysis are reliable. Inaccuracies or inconsistencies in the data can lead to incorrect interpretations. 2 Better Decision-Making Organizations rely on data-driven decision- making. Clean and high-quality data provides a solid foundation for making informed and effective decisions. 3 Trust in Analytics Stakeholders and decision-makers must trust the data used in analytics. Quality data instills confidence in the results and recommendations generated by analytical models. 4 Avoiding Bias Biased or incomplete data can lead to biased results. Data quality is crucial to avoid reinforcing existing biases and to ensure fairness in decision-making processes.
  • 4. Roles in Data Analytics and Machine Learning 1 Improved Model Performance Clean and preprocessed data is essential for training accurate machine learning models. It helps models generalize well to new, unseen data, improving their performance. 2 Feature Engineering Data preprocessing includes feature scaling, handling categorical variables, and transforming data. These activities contribute to creating meaningful features, enhancing the model's ability to capture patterns. 3 Efficient Analysis Clean data accelerates the analysis process. Analysts and data scientists can focus on extracting insights rather than dealing with data inconsistencies. 4 Enhanced Interpretability Well-preprocessed data leads to models that are easier to interpret. This is crucial for understanding the factors influencing predictions or outcomes.
  • 5. Key Concepts in Artificial Intelligence Machine Perception 📷 AI systems can perceive and interpret the world through computer vision, speech recognition, and natural language processing. Knowledge Representation and Reasoning 📚 AI uses techniques to represent and store knowledge and apply logical reasoning to solve complex problems. Planning and Decision Making 🧭 AI systems can plan sequences of actions and make optimal decisions by considering various factors and constraints.
  • 6. Applications of Machine Learning Recommendation Systems ML algorithms personalize recommendations on platforms like Netflix and Amazon. Speech Recognition ML techniques transcribe speech into text and power voice-controlled systems. Image Classification ML models identify objects, scenes, and people in images for various applications.
  • 7. Data Cleaning Data cleaning means fixing bad data in your data set. Bad data could be: • Empty cells • Data in wrong format • Wrong data • Duplicates
  • 8. Remove Rows new_df = df.dropna() print(new_df.to_string()) Remove all Rows with NULL Values df.dropna(inplace = True) Remove NULL Values with 200 df.fillna(200, inplace = True)
  • 9. Remove NULL values in Specific Co df[“col_name"].fillna(130, inplace = True) Replace using Mean, Median and M df[" col_name "].fillna(df[" col_name "].mean(), inplace = True) df[" col_name "].fillna(df[" col_name "].median(), inplace = True) df[" col_name "].fillna(df[" col_name "].mode()[0], inplace = True) Convert into a Correct Format pd.to_datetime(df['Date'])
  • 11. Dealing with Outliers Outliers are data points that significantly deviate from the rest of the dataset. Dealing with outliers involves identifying them, understanding their impact, and deciding whether to remove or transform them.
  • 12. Identifying Outliers using Descriptive Statistics Descriptive statistics, such as mean, median, and standard deviation, can be used to identify outliers. Data points that fall far from the mean or median may be considered outliers. import pandas as pd # Creating a DataFrame with outliers data = {'Values': [1, 2, 3, 20, 25, 30, 35, 40]} df = pd.DataFrame(data) # Calculate mean and standard deviation mean_val = df['Values'].mean() std_dev = df['Values'].std() # Identify outliers based on z-scores outliers = df[(df['Values'] < mean_val - 2 * std_dev) | (df['Values'] > mean_val + 2 * std_dev)] print("Original DataFrame:") print(df) print("nOutliers identified using descriptive statistics:")
  • 13. Handling Outliers (Removing or Transforming) Handling outliers involves deciding whether to remove them or transform them to mitigate their impact on analysis or modeling. # Remove outliers using z-scores df_no_outliers = df[(df['Values'] >= mean_val - 2 * std_dev) & (df['Values'] <= mean_val + 2 * std_dev)] print("DataFrame after removing outliers:") print(df_no_outliers) # Transformation to handle positively skewed data df['Values_log'] = df['Values'].apply(lambda x: 0 if x == 0 else np.log(x)) print("DataFrame after log transformation:") print(df)
  • 14. Data Preprocessing Data preprocessing is a crucial step in data preparation that involves cleaning and transforming raw data into a format suitable for analysis, modeling, or machine learning. The purpose of data preprocessing is to enhance data quality, handle inconsistencies, and create a structured dataset that facilitates accurate and meaningful analysis.
  • 15. Purpose of Data Preprocessing Improving Data Quality Identify and correct errors, inaccuracies, or inconsistencies in the dataset. Enhancing Data Usability Transform raw data into a format suitable for analysis, modeling, or machine learning. Reducing Bias Handle biases in the data to ensure fair and unbiased results. Facilitating Feature Extraction Prepare data for extracting meaningful features that contribute to model performance.
  • 16. Key Steps in Data Preprocessing Handling Missing Values: Identify and handle missing values using methods like imputation or removal. Standardizing Formatting: Standardize data formats, such as date formats or units, to ensure consistency. Data Transformation: Apply transformations such as log transformations or feature engineering to create informative features. Dealing Duplicate Records: Identify and remove duplicate records to ensure each observation is unique. Feature Scaling: Scale numeric features to a standard range to avoid dominance of certain features in modeling. Data Sampling: If needed, perform data sampling techniques like random sampling or stratified sampling. Handling Outliers: Detect and handle outliers to prevent them from unduly influencing analysis or modeling. Handling Categorical Data: Encode categorical variables using techniques like one-hot encoding or label encoding. Data Splitting: Split the dataset into training and testing sets for model evaluation.
  • 17. Feature Scaling Feature scaling is a preprocessing technique used to standardize the range of independent variables or features of a dataset. It ensures that no single feature dominates the others, making the dataset more amenable to machine learning algorithms that are sensitive to the scale of input features.
  • 18. Normalization vs. Standardization: Normalization Normalization (Min-Max scaling) scales the values of features between 0 and 1. It transforms the data into a specific range but does not handle outliers well. Standardization Standardization (Z-score normalization) transforms the data to have a mean of 0 and a standard deviation of 1. It is more robust to outliers compared to normalization.
  • 19. Scaling Numeric Features using Pandas: import pandas as pd # Creating a DataFrame with numeric features data = {'Feature1': [10, 20, 15, 25], 'Feature2': [500, 1000, 750, 1250]} df = pd.DataFrame(data) # Min-Max scaling using Pandas df_normalized = (df - df.min()) / (df.max() - df.min()) print("Original DataFrame:") print(df) print("nDataFrame after Min-Max scaling (Normalization):") print(df_normalized) Min-Max Scaling (Normalization)
  • 20. Scaling Numeric Features using Pandas: import pandas as pd # Creating a DataFrame with numeric features data = {'Feature1': [10, 20, 15, 25], 'Feature2': [500, 1000, 750, 1250]} df = pd.DataFrame(data) # Z-score normalization using Pandas df_standardized = (df - df.mean()) / df.std() print("Original DataFrame:") print(df) print("nDataFrame after Z-score normalization (Standardization):") print(df_standardized) Z-score Normalization (Standardization)
  • 21. Handling Categorical Data Categorical data represents variables that can take on a limited and usually fixed number of values, often representing categories. Handling categorical data involves encoding these variables into a format suitable for machine learning models.
  • 22. Encoding Categorical Variables: One-Hot Encoding: One-hot encoding is a technique that converts categorical variables into a binary matrix. Each category becomes a separate column, and a binary value indicates the presence or absence of that category. Pandas' get_dummies() function: get_dummies() is a Pandas function that performs one-hot encoding on categorical variables, creating dummy/indicator variables for each category. Label Encoding: Label encoding assigns a unique numerical label to each category in a categorical variable. It is suitable when there is an ordinal relationship between categories.
  • 23. Data Transformation Log Transformation: Log transformation involves applying the natural logarithm to the values of a variable. It is useful for handling positively skewed data and reducing the impact of outliers. Pandas' apply() function: apply() is a Pandas function that applies a function along the axis of a DataFrame. It can be used for custom transformations on data. Handling Skewed Data: Skewed data refers to a distribution that is not symmetrical. Handling skewed data involves transforming it to achieve a more normal distribution.
  • 24. Data Sampling Importance of Sampling Data sampling is the process of selecting a subset of data from a larger dataset. It is crucial for tasks like model training and evaluation, especially when dealing with large datasets. Pandas' sample() method sample() is a Pandas method that is used to randomly select a specified number of rows or a fraction of rows from a DataFrame. Data Splitting (Training and Testing Data) Data splitting involves dividing a dataset into two parts: a training set used to train a machine learning model and a testing set used to evaluate the model's performance on unseen data.
  • 25. Using Pandas for Data Splitting import pandas as pd from sklearn.model_selection import train_test_split # Creating a DataFrame data = {'Feature1': [1, 2, 3, 4, 5], 'Target': [0, 1, 0, 1, 0]} df = pd.DataFrame(data) # Splitting data into features (X) and target variable (y) X = df[['Feature1']] y = df['Target'] # Using train_test_split for data splitting X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) print("Training Data:") print(X_train, y_train) print("nTesting Data:") print(X_test, y_test)