Data Wrangling
Week 9
Dr. Ferdin Joe John Joseph
Faculty of Information Technology
Thai – Nichi Institute of Technology, Bangkok
Today’s Lesson
• Introduction to Data Cleaning
• Fundamentals of Data Cleaning
Faculty of Information Technology, Thai - Nichi Institute of
Technology, Bangkok
2
Data Cleaning: Definition
• Data cleaning is used to refer to all kinds of tasks and activities to
detect and repair errors in the data.
Faculty of Information Technology, Thai - Nichi Institute of
Technology, Bangkok
3
Scenario
• A Kaggle’s 2017 survey about the state of data science and machine
learning reveals that dirty data is the most common barrier faced by
workers dealing with data.
Faculty of Information Technology, Thai - Nichi Institute of
Technology, Bangkok
4
Overview
Data collection and acquisition often introduce errors in data, for
example,
• missing values,
• typos,
• mixed formats,
• replicated entries for the same real-world entity,
• outliers,
• violations of business rules
Faculty of Information Technology, Thai - Nichi Institute of
Technology, Bangkok
5
Missing values
Faculty of Information Technology, Thai - Nichi Institute of
Technology, Bangkok
6
How to handle missing values
Listwise Deletion: Delete all data from any participant with missing
values. If your sample is large enough, then you likely can drop data
without substantial loss of statistical power. Be sure that the values are
missing at random and that you are not inadvertently removing a class
of participants.
Faculty of Information Technology, Thai - Nichi Institute of
Technology, Bangkok
7
How to handle missing values
Recover the Values: You can sometimes contact the participants and
ask them to fill out the missing values. For in-person studies, we’ve
found having an additional check for missing values before the
participant leaves helps.
Faculty of Information Technology, Thai - Nichi Institute of
Technology, Bangkok
8
How to handle missing values
Educated Guessing: It sounds arbitrary and isn’t your preferred course
of action, but you can often infer a missing value. For related questions,
for example, like those often presented in a matrix, if the participant
responds with all “4s”, assume that the missing value is a 4.
Faculty of Information Technology, Thai - Nichi Institute of
Technology, Bangkok
9
How to handle missing values
Average Imputation: Use the average value of the responses from the
other participants to fill in the missing value. If the average of the 30
responses on the question is a 4.1, use a 4.1 as the imputed value. This
choice is not always recommended because it can artificially reduce the
variability of your data but in some cases makes sense.
Faculty of Information Technology, Thai - Nichi Institute of
Technology, Bangkok
10
How to handle missing values
Common-Point Imputation: For a rating scale, using the middle point
or most commonly chosen value. For example, on a five-point scale,
substitute a 3, the midpoint, or a 4, the most common value (in many
cases). This is a bit more structured than guessing, but it’s still among
the more risky options. Use caution unless you have good reason and
data to support using the substitute value.
Faculty of Information Technology, Thai - Nichi Institute of
Technology, Bangkok
11
How to handle missing values
Regression Substitution: You can use multiple-regression analysis to
estimate a missing value. We use this technique to deal with missing
SUS scores. Regression substitution predicts the missing value from the
other values. In the case of missing SUS data, we had enough data to
create stable regression equations and predict the missing values
automatically in the calculator.
Faculty of Information Technology, Thai - Nichi Institute of
Technology, Bangkok
12
How to handle missing values
Multiple Imputation: The most sophisticated and, currently, most
popular approach is to take the regression idea further and take
advantage of correlations between responses.
Faculty of Information Technology, Thai - Nichi Institute of
Technology, Bangkok
13
Typos
Faculty of Information Technology, Thai - Nichi Institute of
Technology, Bangkok
14
Mixed Formats
Faculty of Information Technology, Thai - Nichi Institute of
Technology, Bangkok
15
Replicated entries
• Identify Columns That Contain a Single Value
• Delete Columns That Contain a Single Value
• Consider Columns That Have Very Few Values
• Remove Columns That Have A Low Variance
• Delete Rows That Contain Duplicate Data
Faculty of Information Technology, Thai - Nichi Institute of
Technology, Bangkok
16
Outliers
Faculty of Information Technology, Thai - Nichi Institute of
Technology, Bangkok
17
Activity
• Use the csv files of week 4 and use pandas to perform data cleaning
Faculty of Information Technology, Thai - Nichi Institute of
Technology, Bangkok
18
Next Week
Faculty of Information Technology, Thai - Nichi Institute of
Technology, Bangkok
https://github.com/ferdinjoe/DSA201
19

Data wrangling week 9

  • 1.
    Data Wrangling Week 9 Dr.Ferdin Joe John Joseph Faculty of Information Technology Thai – Nichi Institute of Technology, Bangkok
  • 2.
    Today’s Lesson • Introductionto Data Cleaning • Fundamentals of Data Cleaning Faculty of Information Technology, Thai - Nichi Institute of Technology, Bangkok 2
  • 3.
    Data Cleaning: Definition •Data cleaning is used to refer to all kinds of tasks and activities to detect and repair errors in the data. Faculty of Information Technology, Thai - Nichi Institute of Technology, Bangkok 3
  • 4.
    Scenario • A Kaggle’s2017 survey about the state of data science and machine learning reveals that dirty data is the most common barrier faced by workers dealing with data. Faculty of Information Technology, Thai - Nichi Institute of Technology, Bangkok 4
  • 5.
    Overview Data collection andacquisition often introduce errors in data, for example, • missing values, • typos, • mixed formats, • replicated entries for the same real-world entity, • outliers, • violations of business rules Faculty of Information Technology, Thai - Nichi Institute of Technology, Bangkok 5
  • 6.
    Missing values Faculty ofInformation Technology, Thai - Nichi Institute of Technology, Bangkok 6
  • 7.
    How to handlemissing values Listwise Deletion: Delete all data from any participant with missing values. If your sample is large enough, then you likely can drop data without substantial loss of statistical power. Be sure that the values are missing at random and that you are not inadvertently removing a class of participants. Faculty of Information Technology, Thai - Nichi Institute of Technology, Bangkok 7
  • 8.
    How to handlemissing values Recover the Values: You can sometimes contact the participants and ask them to fill out the missing values. For in-person studies, we’ve found having an additional check for missing values before the participant leaves helps. Faculty of Information Technology, Thai - Nichi Institute of Technology, Bangkok 8
  • 9.
    How to handlemissing values Educated Guessing: It sounds arbitrary and isn’t your preferred course of action, but you can often infer a missing value. For related questions, for example, like those often presented in a matrix, if the participant responds with all “4s”, assume that the missing value is a 4. Faculty of Information Technology, Thai - Nichi Institute of Technology, Bangkok 9
  • 10.
    How to handlemissing values Average Imputation: Use the average value of the responses from the other participants to fill in the missing value. If the average of the 30 responses on the question is a 4.1, use a 4.1 as the imputed value. This choice is not always recommended because it can artificially reduce the variability of your data but in some cases makes sense. Faculty of Information Technology, Thai - Nichi Institute of Technology, Bangkok 10
  • 11.
    How to handlemissing values Common-Point Imputation: For a rating scale, using the middle point or most commonly chosen value. For example, on a five-point scale, substitute a 3, the midpoint, or a 4, the most common value (in many cases). This is a bit more structured than guessing, but it’s still among the more risky options. Use caution unless you have good reason and data to support using the substitute value. Faculty of Information Technology, Thai - Nichi Institute of Technology, Bangkok 11
  • 12.
    How to handlemissing values Regression Substitution: You can use multiple-regression analysis to estimate a missing value. We use this technique to deal with missing SUS scores. Regression substitution predicts the missing value from the other values. In the case of missing SUS data, we had enough data to create stable regression equations and predict the missing values automatically in the calculator. Faculty of Information Technology, Thai - Nichi Institute of Technology, Bangkok 12
  • 13.
    How to handlemissing values Multiple Imputation: The most sophisticated and, currently, most popular approach is to take the regression idea further and take advantage of correlations between responses. Faculty of Information Technology, Thai - Nichi Institute of Technology, Bangkok 13
  • 14.
    Typos Faculty of InformationTechnology, Thai - Nichi Institute of Technology, Bangkok 14
  • 15.
    Mixed Formats Faculty ofInformation Technology, Thai - Nichi Institute of Technology, Bangkok 15
  • 16.
    Replicated entries • IdentifyColumns That Contain a Single Value • Delete Columns That Contain a Single Value • Consider Columns That Have Very Few Values • Remove Columns That Have A Low Variance • Delete Rows That Contain Duplicate Data Faculty of Information Technology, Thai - Nichi Institute of Technology, Bangkok 16
  • 17.
    Outliers Faculty of InformationTechnology, Thai - Nichi Institute of Technology, Bangkok 17
  • 18.
    Activity • Use thecsv files of week 4 and use pandas to perform data cleaning Faculty of Information Technology, Thai - Nichi Institute of Technology, Bangkok 18
  • 19.
    Next Week Faculty ofInformation Technology, Thai - Nichi Institute of Technology, Bangkok https://github.com/ferdinjoe/DSA201 19