SlideShare a Scribd company logo
1 of 32
Download to read offline
Data
Preparation
This course content is being actively
developed by Delta Analytics, a 501(c)3
Bay Area nonprofit that aims to
empower communities to leverage their
data for good.
Please reach out with any questions or
feedback to inquiry@deltanalytics.org.
Find out more about our mission here.
Delta Analytics builds technical capacity
around the world.
Course overview:
Now let’s turn to the data we will be using...
✓ Module 1: Introduction to Machine Learning
❏ Module 2: Machine Learning Deep Dive
❏ Module 3: Model Selection and Evaluation
❏ Module 4: Linear Regression
❏ Module 5: Decision Trees
❏ Module 6: Ensemble Algorithms
❏ Module 7: Unsupervised Learning Algorithms
❏ Module 8: Natural Language Processing Part 1
❏ Module 9: Natural Language Processing Part 2
Machine learning helps
us answer questions.
How do we define the
question?
Before we even get to the models/algorithms, we have
to learn about our data and define our research
question.
Research
Question
Exploratory
Analysis
Modeling
Phase
Performance
Data
Validation
+ Cleaning
Machine learning takes
place during the
modeling phase.
~80% of your time as
a data scientist is
spent here, preparing
your data for analysis
Examples of research questions:
● Does this patient have malaria?
● Can we monitor illegal deforestation by detecting
chainsaw noises in audio streamed from
rainforests?
A research question is the question we
want our model to answer.
Research
Question
We may have a question in mind before we look at the data, but
we will often use our exploration of the data to develop or refine
our research question.
Research
Question
Exploratory
Analysis
Data
Validation+
Cleaning
What comes first, the
chicken or the egg?
Data Validation and
Cleaning
Data
Cleaning
Source: Survey of 80 data scientists. Forbes article, March 23, 2016.
“Data preparation accounts for about 80%
of the work of data scientists.”
Data
Cleaning
Why do we need to validate and clean our
data?
Data often comes from multiple sources
- Do data align across different
sources?
Data is created by humans
- Does the data need to be transformed?
- Is it free from human bias and errors?
Go further with these readings: here and here
Data cleaning involves identifying any issues
with our data and confirming our
qualitative understanding of the data.
Data
Cleaning
Missing Data
Is there missing data? Is it
missing systematically?
Times Series Validation
Is the data for the correct
time range?
Are there unusual spikes in
the volume of loans over
time?
Data Type
Are all variables the right type?
Is a date treated like a date?
Data Range
Are all values in the
expected range? Are all
loan_amounts greater than
0?
Let’s step through some examples:
Data
Cleaning
Missing data
Time series
Data types
Data
Cleaning
Transforming
variables
After gaining an initial
understanding of your data, you
may need to transform it to be
used in analysis
Data
Cleaning
Is there missing data? Is data
missing at random or systematically?
Very few datasets
have no missing data;
most of the time you
will have to deal with
missing data.
The first question you
have to ask is what
type of missing data
you have.
Missing completely at
random: no pattern in the
missing data. This is the best type
of missing you can hope for.
Missing at random: there is a
pattern in your missing data but
not in your variables of interest.
Missing not at random: there
is a pattern in the missing data
that systematically affects your
primary variables.
Missing
data
Example: You have survey data from a random sample from high school students in
the U.S. Some students didn’t participate:
Data
Cleaning
Missing
data
If data is missing at random,
we can use the rest of the
nonmissing data without
worrying about bias!
If data is missing in a
non-random or systematic
way, your nonmissing data may
be biased
Some students
were sick the
day of the day
of the survey
Some students
declined to
participate,
since the
survey asks
about grades
Is there missing data? Is data
missing at random or systematically?
Example: You have survey data from a random sample from high school students in
the U.S. Some students didn’t participate:
Data
Cleaning
Missing
data
If data is missing at random,
we can use the rest of the
nonmissing data without
worrying about bias!
If data is missing in a
non-random or systematic
way, your nonmissing data may
be biased
Some students
were sick the
day of the day
of the survey
Some students
declined to
participate,
since the
survey asks
about grades
Is there missing data? Is data
missing at random or systematically?
Data
Cleaning
Sometimes, you can replace
missing data.
● Drop missing observations
● Populate missing values with average
of available data
● Impute data
Missing
data
What you should do depends heavily on what makes
sense for your research question, and your data.
Data
Cleaning
Common imputation techniques
Take the average of observations you do have to populate missing
observations - i.e., assume that this observation is also represented by
the population average
Missing
data
Use the average of
nonmissing values
Use an educated
guess
Use common point
imputation
It sounds arbitrary and often isn’t preferred, but you can infer a missing
value. For related questions, for example, like those often presented in a
matrix, if the participant responds with all “4s”, assume that the missing
value is a 4.
For a rating scale, using the middle point or most commonly chosen value.
For example, on a five-point scale, substitute a 3, the midpoint, or a 4, the
most common value (in many cases). This is a bit more structured than
guessing, but it’s still among the more risky options. Use caution unless
you have good reason and data to support using the substitute value.
Source: Handling missing data
Data
Cleaning
If we have observations over
time, we need to do time series
validation.
Ask yourself:
a. Is the data for the correct time range?
b. Are there unusual spikes in the data over
time?
What should we do if there are
unusual spikes in the data over time?
Time series
Data
Cleaning
How do we address unexpected
spikes in our data?
Data
anomaly
For certain datasets, (like sales data) systematic
seasonal spikes are expected. For example,
around Christmas we would see a spike in sales
venue. This is normal, and should not
necessarily be removed.
Time series
Systematic
spike
Random
spike
If the spike is isolated it is probably unexpected,
we may want to remove the corrupted data.
For example, if for one month sales are
recorded in Kenyan Shillings rather than US
dollars, it would inflate sales figures. We should
do some data cleaning by converting to $ or
perhaps remove this month.
Note, sometimes there are natural anomalies in data that should be investigated first
Data
Cleaning
Are all variables the right
type?
● Integer: A number with no decimal places
● Float: A number with decimal places
● String: Text field, or more formally, a sequence of unicode characters
● Boolean: Can only be True or False (also called indicator or dummy variable)
● Datetime: Values meant to hold time data.
Fun blog on this topic: Data Carpentry
Many functions in Python are type specific, which means we need to make sure all of
our fields are being treated as the correct type:
integer float string date
Data type
As you explore the data, some questions arise…
Data cleaning quiz!
Question Answer
There is an observation from the KIVA loan
dataset that says a loan was fully funded in
year 1804, but Kiva wasn’t even founded
then. What do I do?
Question #1
Question Answer
There is an observation that says this loan
was fully funded in year 1804, but Kiva
wasn’t even founded then. What do I do?
Consult the data documentation. If no
explanation exists, remove this observation.
Data
Cleaning
Time series
This question illustrates you
should always do validation of
the time range. Check what
the minimum and maximum
observations in your data set
are.
Question #1
Question Answer
There is an observation that states a
person’s birthday is 12/1/80 but the “age”
variable is missing. What do I do?
Question #2
Data
Cleaning This question illustrates how
we may be able to leverage
other fields to make an
educated guess about the
missing age.
Missing
Data
Question Answer
There is an observation that states a
person’s birthday is 12/1/80 but the “age”
variable is missing. What do I do?
We can calculate age:
(e.g. 2017 - 1980 = 37)
Question #2
Question Answer
The variable “amount_funded” has values of
both “N/A” and “0”. What do I do?
Question #3
As you explore the data, some questions arise…
Question Answer
The variable “amount_funded” has values of
both “N/A” and “0”. What do I do?
Check documentation if there is a material
difference between NA and 0.
Question #3
Question Answer
I’m not sure what currency the variable
“amount_funded” is reported in. What do I
do?
Question #4
Question Answer
I’m not sure what currency the variable
“amount_funded” is reported in. What do I
do?
Check documentation and other variables,
convert to appropriate currency
Question #4
A final note…
Data
Cleaning
Note that our examples were all very specific - you may or may not
encounter these exact examples in the wild. This is because data
cleaning is very often idiosyncratic and cannot be adequately
completed by following a predetermined set of steps - you must use
common sense!
Next we turn to exploratory analysis, for which we often have to
transform our data.
Data
transformation
You are on fire! Go straight to the
next module here.
Need to slow down and digest? Take a
minute to write us an email about
what you thought about the course. All
feedback small or large welcome!
Email: sara@deltanalytics.org
Find out more about
Delta’s machine
learning for good
mission here.

More Related Content

What's hot

Module 9: Natural Language Processing Part 2
Module 9:  Natural Language Processing Part 2Module 9:  Natural Language Processing Part 2
Module 9: Natural Language Processing Part 2Sara Hooker
 
Module 8: Natural language processing Pt 1
Module 8:  Natural language processing Pt 1Module 8:  Natural language processing Pt 1
Module 8: Natural language processing Pt 1Sara Hooker
 
Introduction to machine learning and deep learning
Introduction to machine learning and deep learningIntroduction to machine learning and deep learning
Introduction to machine learning and deep learningShishir Choudhary
 
Machine Learning Unit 2 Semester 3 MSc IT Part 2 Mumbai University
Machine Learning Unit 2 Semester 3  MSc IT Part 2 Mumbai UniversityMachine Learning Unit 2 Semester 3  MSc IT Part 2 Mumbai University
Machine Learning Unit 2 Semester 3 MSc IT Part 2 Mumbai UniversityMadhav Mishra
 
Intro to machine learning
Intro to machine learningIntro to machine learning
Intro to machine learningTamir Taha
 
Dowhy: An end-to-end library for causal inference
Dowhy: An end-to-end library for causal inferenceDowhy: An end-to-end library for causal inference
Dowhy: An end-to-end library for causal inferenceAmit Sharma
 
Machine learning - session 3
Machine learning - session 3Machine learning - session 3
Machine learning - session 3Luis Borbon
 
Barga Data Science lecture 9
Barga Data Science lecture 9Barga Data Science lecture 9
Barga Data Science lecture 9Roger Barga
 
Introduction to machine learning
Introduction to machine learningIntroduction to machine learning
Introduction to machine learningKoundinya Desiraju
 
Data Science Training | Data Science For Beginners | Data Science With Python...
Data Science Training | Data Science For Beginners | Data Science With Python...Data Science Training | Data Science For Beginners | Data Science With Python...
Data Science Training | Data Science For Beginners | Data Science With Python...Simplilearn
 
Machine Learning Interview Questions Answers
Machine Learning Interview Questions AnswersMachine Learning Interview Questions Answers
Machine Learning Interview Questions AnswersShareDocView.com
 
Ways to evaluate a machine learning model’s performance
Ways to evaluate a machine learning model’s performanceWays to evaluate a machine learning model’s performance
Ways to evaluate a machine learning model’s performanceMala Deep Upadhaya
 
Introduction to Machine learning
Introduction to Machine learningIntroduction to Machine learning
Introduction to Machine learningKnoldus Inc.
 
End-to-End Machine Learning Project
End-to-End Machine Learning ProjectEnd-to-End Machine Learning Project
End-to-End Machine Learning ProjectEng Teong Cheah
 
Barga Data Science lecture 7
Barga Data Science lecture 7Barga Data Science lecture 7
Barga Data Science lecture 7Roger Barga
 
Moving Your Machine Learning Models to Production with TensorFlow Extended
Moving Your Machine Learning Models to Production with TensorFlow ExtendedMoving Your Machine Learning Models to Production with TensorFlow Extended
Moving Your Machine Learning Models to Production with TensorFlow ExtendedJonathan Mugan
 

What's hot (20)

Module 9: Natural Language Processing Part 2
Module 9:  Natural Language Processing Part 2Module 9:  Natural Language Processing Part 2
Module 9: Natural Language Processing Part 2
 
Module 8: Natural language processing Pt 1
Module 8:  Natural language processing Pt 1Module 8:  Natural language processing Pt 1
Module 8: Natural language processing Pt 1
 
Introduction to machine learning and deep learning
Introduction to machine learning and deep learningIntroduction to machine learning and deep learning
Introduction to machine learning and deep learning
 
Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)
 
Machine Learning Unit 2 Semester 3 MSc IT Part 2 Mumbai University
Machine Learning Unit 2 Semester 3  MSc IT Part 2 Mumbai UniversityMachine Learning Unit 2 Semester 3  MSc IT Part 2 Mumbai University
Machine Learning Unit 2 Semester 3 MSc IT Part 2 Mumbai University
 
Machine Learning for Dummies
Machine Learning for DummiesMachine Learning for Dummies
Machine Learning for Dummies
 
Managing machine learning
Managing machine learningManaging machine learning
Managing machine learning
 
Intro to machine learning
Intro to machine learningIntro to machine learning
Intro to machine learning
 
Dowhy: An end-to-end library for causal inference
Dowhy: An end-to-end library for causal inferenceDowhy: An end-to-end library for causal inference
Dowhy: An end-to-end library for causal inference
 
Machine learning - session 3
Machine learning - session 3Machine learning - session 3
Machine learning - session 3
 
Barga Data Science lecture 9
Barga Data Science lecture 9Barga Data Science lecture 9
Barga Data Science lecture 9
 
Introduction to machine learning
Introduction to machine learningIntroduction to machine learning
Introduction to machine learning
 
Data Science Training | Data Science For Beginners | Data Science With Python...
Data Science Training | Data Science For Beginners | Data Science With Python...Data Science Training | Data Science For Beginners | Data Science With Python...
Data Science Training | Data Science For Beginners | Data Science With Python...
 
Machine Learning Interview Questions Answers
Machine Learning Interview Questions AnswersMachine Learning Interview Questions Answers
Machine Learning Interview Questions Answers
 
Ways to evaluate a machine learning model’s performance
Ways to evaluate a machine learning model’s performanceWays to evaluate a machine learning model’s performance
Ways to evaluate a machine learning model’s performance
 
Introduction to Machine learning
Introduction to Machine learningIntroduction to Machine learning
Introduction to Machine learning
 
End-to-End Machine Learning Project
End-to-End Machine Learning ProjectEnd-to-End Machine Learning Project
End-to-End Machine Learning Project
 
Barga Data Science lecture 7
Barga Data Science lecture 7Barga Data Science lecture 7
Barga Data Science lecture 7
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Moving Your Machine Learning Models to Production with TensorFlow Extended
Moving Your Machine Learning Models to Production with TensorFlow ExtendedMoving Your Machine Learning Models to Production with TensorFlow Extended
Moving Your Machine Learning Models to Production with TensorFlow Extended
 

Similar to Module 1.2 data preparation

classIX_DS_Teacher_Presentation.pptx
classIX_DS_Teacher_Presentation.pptxclassIX_DS_Teacher_Presentation.pptx
classIX_DS_Teacher_Presentation.pptxXICSStudents
 
Data Science Workshop - day 2
Data Science Workshop - day 2Data Science Workshop - day 2
Data Science Workshop - day 2Aseel Addawood
 
Stat11t Chapter1
Stat11t Chapter1Stat11t Chapter1
Stat11t Chapter1gueste87a4f
 
Lesson 1 - Overview of Machine Learning and Data Analysis.pptx
Lesson 1 - Overview of Machine Learning and Data Analysis.pptxLesson 1 - Overview of Machine Learning and Data Analysis.pptx
Lesson 1 - Overview of Machine Learning and Data Analysis.pptxcloudserviceuit
 
Unit 3 Qualitative Data
Unit 3 Qualitative DataUnit 3 Qualitative Data
Unit 3 Qualitative DataSherry Bailey
 
The role of statistics and the data analysis process.ppt
The role of statistics and the data analysis process.pptThe role of statistics and the data analysis process.ppt
The role of statistics and the data analysis process.pptJakeCuenca10
 
Learning Analytics Primer: Getting Started with Learning and Performance Anal...
Learning Analytics Primer: Getting Started with Learning and Performance Anal...Learning Analytics Primer: Getting Started with Learning and Performance Anal...
Learning Analytics Primer: Getting Started with Learning and Performance Anal...Watershed
 
10NTC - Data Superheroes - DiJulio
10NTC - Data Superheroes - DiJulio10NTC - Data Superheroes - DiJulio
10NTC - Data Superheroes - DiJuliosarahdijulio
 
Analysing The Data
Analysing The DataAnalysing The Data
Analysing The DataAngel Evans
 
Data Processing & Explain each term in details.pptx
Data Processing & Explain each term in details.pptxData Processing & Explain each term in details.pptx
Data Processing & Explain each term in details.pptxPratikshaSurve4
 
Data mining and data warehouse lab manual updated
Data mining and data warehouse lab manual updatedData mining and data warehouse lab manual updated
Data mining and data warehouse lab manual updatedYugal Kumar
 
Data Science & AI Road Map by Python & Computer science tutor in Malaysia
Data Science  & AI Road Map by Python & Computer science tutor in MalaysiaData Science  & AI Road Map by Python & Computer science tutor in Malaysia
Data Science & AI Road Map by Python & Computer science tutor in MalaysiaAhmed Elmalla
 
Data preparation and processing chapter 2
Data preparation and processing chapter  2Data preparation and processing chapter  2
Data preparation and processing chapter 2Mahmoud Alfarra
 
omama munir 58.pptx
omama munir 58.pptxomama munir 58.pptx
omama munir 58.pptxOmamaNoor2
 
Managing data and defining variables
Managing data and defining variablesManaging data and defining variables
Managing data and defining variablesChristal Sanlao
 

Similar to Module 1.2 data preparation (20)

classIX_DS_Teacher_Presentation.pptx
classIX_DS_Teacher_Presentation.pptxclassIX_DS_Teacher_Presentation.pptx
classIX_DS_Teacher_Presentation.pptx
 
Data Science Workshop - day 2
Data Science Workshop - day 2Data Science Workshop - day 2
Data Science Workshop - day 2
 
Stat11t chapter1
Stat11t chapter1Stat11t chapter1
Stat11t chapter1
 
Stat11t Chapter1
Stat11t Chapter1Stat11t Chapter1
Stat11t Chapter1
 
Lesson 1 - Overview of Machine Learning and Data Analysis.pptx
Lesson 1 - Overview of Machine Learning and Data Analysis.pptxLesson 1 - Overview of Machine Learning and Data Analysis.pptx
Lesson 1 - Overview of Machine Learning and Data Analysis.pptx
 
Unit 3 Qualitative Data
Unit 3 Qualitative DataUnit 3 Qualitative Data
Unit 3 Qualitative Data
 
The role of statistics and the data analysis process.ppt
The role of statistics and the data analysis process.pptThe role of statistics and the data analysis process.ppt
The role of statistics and the data analysis process.ppt
 
BAS 250 Lecture 2
BAS 250 Lecture 2BAS 250 Lecture 2
BAS 250 Lecture 2
 
Learning Analytics Primer: Getting Started with Learning and Performance Anal...
Learning Analytics Primer: Getting Started with Learning and Performance Anal...Learning Analytics Primer: Getting Started with Learning and Performance Anal...
Learning Analytics Primer: Getting Started with Learning and Performance Anal...
 
Data Science in Python.pptx
Data Science in Python.pptxData Science in Python.pptx
Data Science in Python.pptx
 
10NTC - Data Superheroes - DiJulio
10NTC - Data Superheroes - DiJulio10NTC - Data Superheroes - DiJulio
10NTC - Data Superheroes - DiJulio
 
Analysing The Data
Analysing The DataAnalysing The Data
Analysing The Data
 
Data Processing & Explain each term in details.pptx
Data Processing & Explain each term in details.pptxData Processing & Explain each term in details.pptx
Data Processing & Explain each term in details.pptx
 
Data mining and data warehouse lab manual updated
Data mining and data warehouse lab manual updatedData mining and data warehouse lab manual updated
Data mining and data warehouse lab manual updated
 
Metopen 6
Metopen 6Metopen 6
Metopen 6
 
Data Science & AI Road Map by Python & Computer science tutor in Malaysia
Data Science  & AI Road Map by Python & Computer science tutor in MalaysiaData Science  & AI Road Map by Python & Computer science tutor in Malaysia
Data Science & AI Road Map by Python & Computer science tutor in Malaysia
 
Data preparation and processing chapter 2
Data preparation and processing chapter  2Data preparation and processing chapter  2
Data preparation and processing chapter 2
 
EDA
EDAEDA
EDA
 
omama munir 58.pptx
omama munir 58.pptxomama munir 58.pptx
omama munir 58.pptx
 
Managing data and defining variables
Managing data and defining variablesManaging data and defining variables
Managing data and defining variables
 

Recently uploaded

How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 

Recently uploaded (20)

How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 

Module 1.2 data preparation

  • 2. This course content is being actively developed by Delta Analytics, a 501(c)3 Bay Area nonprofit that aims to empower communities to leverage their data for good. Please reach out with any questions or feedback to inquiry@deltanalytics.org. Find out more about our mission here. Delta Analytics builds technical capacity around the world.
  • 3. Course overview: Now let’s turn to the data we will be using... ✓ Module 1: Introduction to Machine Learning ❏ Module 2: Machine Learning Deep Dive ❏ Module 3: Model Selection and Evaluation ❏ Module 4: Linear Regression ❏ Module 5: Decision Trees ❏ Module 6: Ensemble Algorithms ❏ Module 7: Unsupervised Learning Algorithms ❏ Module 8: Natural Language Processing Part 1 ❏ Module 9: Natural Language Processing Part 2
  • 4. Machine learning helps us answer questions. How do we define the question?
  • 5. Before we even get to the models/algorithms, we have to learn about our data and define our research question. Research Question Exploratory Analysis Modeling Phase Performance Data Validation + Cleaning Machine learning takes place during the modeling phase. ~80% of your time as a data scientist is spent here, preparing your data for analysis
  • 6. Examples of research questions: ● Does this patient have malaria? ● Can we monitor illegal deforestation by detecting chainsaw noises in audio streamed from rainforests? A research question is the question we want our model to answer. Research Question
  • 7. We may have a question in mind before we look at the data, but we will often use our exploration of the data to develop or refine our research question. Research Question Exploratory Analysis Data Validation+ Cleaning What comes first, the chicken or the egg?
  • 9. Data Cleaning Source: Survey of 80 data scientists. Forbes article, March 23, 2016. “Data preparation accounts for about 80% of the work of data scientists.”
  • 10. Data Cleaning Why do we need to validate and clean our data? Data often comes from multiple sources - Do data align across different sources? Data is created by humans - Does the data need to be transformed? - Is it free from human bias and errors? Go further with these readings: here and here
  • 11. Data cleaning involves identifying any issues with our data and confirming our qualitative understanding of the data. Data Cleaning Missing Data Is there missing data? Is it missing systematically? Times Series Validation Is the data for the correct time range? Are there unusual spikes in the volume of loans over time? Data Type Are all variables the right type? Is a date treated like a date? Data Range Are all values in the expected range? Are all loan_amounts greater than 0?
  • 12. Let’s step through some examples: Data Cleaning Missing data Time series Data types Data Cleaning Transforming variables After gaining an initial understanding of your data, you may need to transform it to be used in analysis
  • 13. Data Cleaning Is there missing data? Is data missing at random or systematically? Very few datasets have no missing data; most of the time you will have to deal with missing data. The first question you have to ask is what type of missing data you have. Missing completely at random: no pattern in the missing data. This is the best type of missing you can hope for. Missing at random: there is a pattern in your missing data but not in your variables of interest. Missing not at random: there is a pattern in the missing data that systematically affects your primary variables. Missing data
  • 14. Example: You have survey data from a random sample from high school students in the U.S. Some students didn’t participate: Data Cleaning Missing data If data is missing at random, we can use the rest of the nonmissing data without worrying about bias! If data is missing in a non-random or systematic way, your nonmissing data may be biased Some students were sick the day of the day of the survey Some students declined to participate, since the survey asks about grades Is there missing data? Is data missing at random or systematically?
  • 15. Example: You have survey data from a random sample from high school students in the U.S. Some students didn’t participate: Data Cleaning Missing data If data is missing at random, we can use the rest of the nonmissing data without worrying about bias! If data is missing in a non-random or systematic way, your nonmissing data may be biased Some students were sick the day of the day of the survey Some students declined to participate, since the survey asks about grades Is there missing data? Is data missing at random or systematically?
  • 16. Data Cleaning Sometimes, you can replace missing data. ● Drop missing observations ● Populate missing values with average of available data ● Impute data Missing data What you should do depends heavily on what makes sense for your research question, and your data.
  • 17. Data Cleaning Common imputation techniques Take the average of observations you do have to populate missing observations - i.e., assume that this observation is also represented by the population average Missing data Use the average of nonmissing values Use an educated guess Use common point imputation It sounds arbitrary and often isn’t preferred, but you can infer a missing value. For related questions, for example, like those often presented in a matrix, if the participant responds with all “4s”, assume that the missing value is a 4. For a rating scale, using the middle point or most commonly chosen value. For example, on a five-point scale, substitute a 3, the midpoint, or a 4, the most common value (in many cases). This is a bit more structured than guessing, but it’s still among the more risky options. Use caution unless you have good reason and data to support using the substitute value. Source: Handling missing data
  • 18. Data Cleaning If we have observations over time, we need to do time series validation. Ask yourself: a. Is the data for the correct time range? b. Are there unusual spikes in the data over time? What should we do if there are unusual spikes in the data over time? Time series
  • 19. Data Cleaning How do we address unexpected spikes in our data? Data anomaly For certain datasets, (like sales data) systematic seasonal spikes are expected. For example, around Christmas we would see a spike in sales venue. This is normal, and should not necessarily be removed. Time series Systematic spike Random spike If the spike is isolated it is probably unexpected, we may want to remove the corrupted data. For example, if for one month sales are recorded in Kenyan Shillings rather than US dollars, it would inflate sales figures. We should do some data cleaning by converting to $ or perhaps remove this month. Note, sometimes there are natural anomalies in data that should be investigated first
  • 20. Data Cleaning Are all variables the right type? ● Integer: A number with no decimal places ● Float: A number with decimal places ● String: Text field, or more formally, a sequence of unicode characters ● Boolean: Can only be True or False (also called indicator or dummy variable) ● Datetime: Values meant to hold time data. Fun blog on this topic: Data Carpentry Many functions in Python are type specific, which means we need to make sure all of our fields are being treated as the correct type: integer float string date Data type
  • 21. As you explore the data, some questions arise… Data cleaning quiz!
  • 22. Question Answer There is an observation from the KIVA loan dataset that says a loan was fully funded in year 1804, but Kiva wasn’t even founded then. What do I do? Question #1
  • 23. Question Answer There is an observation that says this loan was fully funded in year 1804, but Kiva wasn’t even founded then. What do I do? Consult the data documentation. If no explanation exists, remove this observation. Data Cleaning Time series This question illustrates you should always do validation of the time range. Check what the minimum and maximum observations in your data set are. Question #1
  • 24. Question Answer There is an observation that states a person’s birthday is 12/1/80 but the “age” variable is missing. What do I do? Question #2
  • 25. Data Cleaning This question illustrates how we may be able to leverage other fields to make an educated guess about the missing age. Missing Data Question Answer There is an observation that states a person’s birthday is 12/1/80 but the “age” variable is missing. What do I do? We can calculate age: (e.g. 2017 - 1980 = 37) Question #2
  • 26. Question Answer The variable “amount_funded” has values of both “N/A” and “0”. What do I do? Question #3 As you explore the data, some questions arise…
  • 27. Question Answer The variable “amount_funded” has values of both “N/A” and “0”. What do I do? Check documentation if there is a material difference between NA and 0. Question #3
  • 28. Question Answer I’m not sure what currency the variable “amount_funded” is reported in. What do I do? Question #4
  • 29. Question Answer I’m not sure what currency the variable “amount_funded” is reported in. What do I do? Check documentation and other variables, convert to appropriate currency Question #4
  • 30. A final note… Data Cleaning Note that our examples were all very specific - you may or may not encounter these exact examples in the wild. This is because data cleaning is very often idiosyncratic and cannot be adequately completed by following a predetermined set of steps - you must use common sense! Next we turn to exploratory analysis, for which we often have to transform our data. Data transformation
  • 31. You are on fire! Go straight to the next module here. Need to slow down and digest? Take a minute to write us an email about what you thought about the course. All feedback small or large welcome! Email: sara@deltanalytics.org
  • 32. Find out more about Delta’s machine learning for good mission here.