This talk discusses techniques for dealing with messy and imperfect data. Three examples are provided: (1) commercial real estate data with validity issues that were addressed through consensus, imputation, outlier removal and OCR; (2) click data with cookie issues that required transformations like sampling and clustering after preprocessing; (3) digital media data with access issues that necessitated working with insufficient data using decomposition and seeking more data. The overall message is to focus on the problem, identify data limitations, and apply creative solutions like reworking the data.
1. The Wild West of Data
Wrangling
Sarah Guido
PyTennessee 2018
@sarah_guido
2. This talk:
• A day in the life of a data scientist
• Three jobs where I’ve dealt with uncooperative data
• Messy, incomplete, inconsistent, hard to get, hard to model
• Not ground truth
3. Who am I?
• Experienced data scientist
• Data sciencing in Python (and sometimes Scala)
• Wide variety of data: small, large, user, collected
in-house, retrieved via API
• Twitter: @sarah_guido
10. Our journey
Necessary data transformations
Techniques to help with data issues
Work with less than idea data
11. Example 1: Techniques to help with data issues
• Commercial real estate data
• Data validity concerns
• Not a lot of data
• Imperfect data for modeling
12. Example 1: Techniques to help with data issues
• Data validity issues
• Multiple sources of data
for the same data points
• Entered by humans
• Missing
• Order of magnitude off
• Trapped in PDFs
• Data validity solutions
• Data point consensus
13. Example 1: Techniques to help with data issues
• Data validity issues
• Multiple sources of data
for the same data points
• Entered by humans
• Missing
• Order of magnitude off
• Trapped in PDFs
• Data validity solutions
• Data point consensus
• Fill with mean (when
situation allows for it)
14. Example 1: Techniques to help with data issues
• Data validity issues
• Multiple sources of data
for the same data points
• Entered by humans
• Missing
• Order of magnitude off
• Trapped in PDFs
• Data validity solutions
• Data point consensus
• Fill with mean (when
situation allows for it)
• Discover other
complete sources
15. Example 1: Techniques to help with data issues
• Data validity issues
• Multiple sources of data
for the same data points
• Entered by humans
• Missing
• Order of magnitude off
• Trapped in PDFs
• Data validity solutions
• Data point consensus
• Fill with mean (when
situation allows for it)
• Discover other
complete sources
• Remove outliers
16. Example 1: Techniques to help with data issues
• Data validity issues
• Multiple sources of data
for the same data points
• Entered by humans
• Missing
• Order of magnitude off
• Trapped in PDFs
• Data validity solutions
• Data point consensus
• Fill with mean (when
situation allows for it)
• Discover other
complete sources
• Remove outliers
• OCR
17. • The problem: can we predict if a building will sell the
following year?
• The data: floors, location, square footage, price per sqft,
etc
• The goal: provide valuable insight to platform users
Example 1: Techniques to help with data issues
18. • First thought: logistic regression using scikit-learn
• Binary classification: sale/no sale
Example 1: Techniques to help with data issues
21. Problem: Class imbalance
Class imbalance
When the values you are trying to predict are not equal, this
can create bias in classification models.
22. Solution 1: Stratified sampling
Stratified sampling
Creating a sample of data for training based on the
distribution of classes in your dataset.
23.
24. Solution 2: Gradient boosting
Gradient boosting
Produces a prediction model in the form of an ensemble of
weak prediction models, typically decision trees.
25. Solution: Techniques to help with data issues
Ways to treat the data:
improve validity/quality
Preprocessing techniques:
sampling, gradient boosting
26. • Link click data
• Cookie issues
• Lots of preprocessing to model
Example 2: Necessary data transformations
28. Example 2: Necessary data transformations
The problem: how can we identify similar patterns based on
click data?
The data: time, geolocation, cookie, browser useragent
string, referrer
The goal: understand how people interact with content over
time
30. Problem: Clustering user interactions
K-means clustering
An unsupervised learning method of grouping data together
based on a distance metric.
31. Problem: Clustering the data
• Only look at users with 5 or more interactions
• Each user has a different number of interactions
• Each data point ends up in a different cluster
• Complex feature space
37. Solution: Transform the data
date: 2017-04-09, 2017-04-13, 2017-04-30, 2017-05-01,
2017-05-12
Length of interactions: 5
Average time between interactions: ~8 days
38. Solution: Transform the data
referrer: facebook, twitter
One-hot encode and transform to matrix
• Facebook: [1, 0]
• Twitter: [0, 1]
43. The problem: how can we effectively describe our audience?
The data: anonymized demographic and psychographic data
The goal: audience segmentation and channel analysis
Example 3: Work with less than idea data
44. Problem: insufficient data
• Google Analytics data – 1/3 of urls
• Finicky API
• Semi-useless psychographic data
47. Solution: make it work!
• Sometimes you just have to settle for what you have
• Segmentation through decomposition techniques
• Go get more data!
• Reorganize the data you have!
48. General strategy
• What problem are you trying to solve?
• What’s wrong with your data?
• What do you need that you don’t have?
49. Keep in mind…
• Data your company collects is complicated
• What you do to your data will affect the model
• Creativity is your friend
• Lots of ways to solve the problem
• You don’t have to accept the data as it is!
what do I actually have to do to get data in shape for modeling
origin – convo with friends
the iris dataset is…
here’s the problem
bootcamps - easy
transition: let’s begin the journey. reonomy: use ML techniques to help with data issues.
Bitly: transform data as necessary to model
Mashable: work with what you have, and if you don’t have something, find a way to get it
reonomy problem set up: commercial real estate data from the city of NYC. messy. inconsistent across sources. extract data from PDFs. human-entered data == mistakes
picard facepalm gif
from the scikit-learn documentation
- ensemble model where trees focus on correcting errors of previous trees
exam example - focus
transition
Briefly touch on cookies
slide then…
I did this in scala
Spark problems
scikit code
transition
Briefly – data infrastructure revamp
don’t despair! I love digging into really terrible data