The Wild West of Data
Wrangling
Sarah Guido
PyTennessee 2018
@sarah_guido
This talk:
• A day in the life of a data scientist
• Three jobs where I’ve dealt with uncooperative data
• Messy, incomplete, inconsistent, hard to get, hard to model
• Not ground truth
Who am I?
• Experienced data scientist
• Data sciencing in Python (and sometimes Scala)
• Wide variety of data: small, large, user, collected
in-house, retrieved via API
• Twitter: @sarah_guido
Iris Dataset
Iris Dataset
Our journey to Mordor
Techniques to help with data issues
Our journey
Necessary data transformations
Techniques to help with data issues
Our journey
Necessary data transformations
Techniques to help with data issues
Work with less than idea data
Example 1: Techniques to help with data issues
• Commercial real estate data
• Data validity concerns
• Not a lot of data
• Imperfect data for modeling
Example 1: Techniques to help with data issues
• Data validity issues
• Multiple sources of data
for the same data points
• Entered by humans
• Missing
• Order of magnitude off
• Trapped in PDFs
• Data validity solutions
• Data point consensus
Example 1: Techniques to help with data issues
• Data validity issues
• Multiple sources of data
for the same data points
• Entered by humans
• Missing
• Order of magnitude off
• Trapped in PDFs
• Data validity solutions
• Data point consensus
• Fill with mean (when
situation allows for it)
Example 1: Techniques to help with data issues
• Data validity issues
• Multiple sources of data
for the same data points
• Entered by humans
• Missing
• Order of magnitude off
• Trapped in PDFs
• Data validity solutions
• Data point consensus
• Fill with mean (when
situation allows for it)
• Discover other
complete sources
Example 1: Techniques to help with data issues
• Data validity issues
• Multiple sources of data
for the same data points
• Entered by humans
• Missing
• Order of magnitude off
• Trapped in PDFs
• Data validity solutions
• Data point consensus
• Fill with mean (when
situation allows for it)
• Discover other
complete sources
• Remove outliers
Example 1: Techniques to help with data issues
• Data validity issues
• Multiple sources of data
for the same data points
• Entered by humans
• Missing
• Order of magnitude off
• Trapped in PDFs
• Data validity solutions
• Data point consensus
• Fill with mean (when
situation allows for it)
• Discover other
complete sources
• Remove outliers
• OCR
• The problem: can we predict if a building will sell the
following year?
• The data: floors, location, square footage, price per sqft,
etc
• The goal: provide valuable insight to platform users
Example 1: Techniques to help with data issues
• First thought: logistic regression using scikit-learn
• Binary classification: sale/no sale
Example 1: Techniques to help with data issues
Problem…
Data: 95% no sale, 5% sale
Logistic regression: 95% accurate
DONE!
Problem: Class imbalance
Class imbalance
When the values you are trying to predict are not equal, this
can create bias in classification models.
Solution 1: Stratified sampling
Stratified sampling
Creating a sample of data for training based on the
distribution of classes in your dataset.
Solution 2: Gradient boosting
Gradient boosting
Produces a prediction model in the form of an ensemble of
weak prediction models, typically decision trees.
Solution: Techniques to help with data issues
Ways to treat the data:
improve validity/quality
Preprocessing techniques:
sampling, gradient boosting
• Link click data
• Cookie issues
• Lots of preprocessing to model
Example 2: Necessary data transformations
Example 2: Necessary data transformations
Example 2: Necessary data transformations
The problem: how can we identify similar patterns based on
click data?
The data: time, geolocation, cookie, browser useragent
string, referrer
The goal: understand how people interact with content over
time
Why Scala?
Problem: Clustering user interactions
K-means clustering
An unsupervised learning method of grouping data together
based on a distance metric.
Problem: Clustering the data
• Only look at users with 5 or more interactions
• Each user has a different number of interactions
• Each data point ends up in a different cluster
• Complex feature space
Solution: Transform the data
Solution: Transform the data
date: 2017-04-09, 2017-04-13, 2017-04-30, 2017-05-01,
2017-05-12
Length of interactions: 5
Average time between interactions: ~8 days
Solution: Transform the data
referrer: facebook, twitter
One-hot encode and transform to matrix
• Facebook: [1, 0]
• Twitter: [0, 1]
Solution: Transform the data
Solution: Necessary data transformations
Rework your data in service of
the problem you’re trying to
solve
Example 3: Work with less than idea data
• Digital media data
• Data access issues
• Difficulty retrieving data
• Data is insufficient
Example 3: Work with less than idea data
The problem: how can we effectively describe our audience?
The data: anonymized demographic and psychographic data
The goal: audience segmentation and channel analysis
Example 3: Work with less than idea data
Problem: insufficient data
• Google Analytics data – 1/3 of urls
• Finicky API
• Semi-useless psychographic data
Solution: accept defeat
Solution: accept defeat make it work!
Solution: make it work!
• Sometimes you just have to settle for what you have
• Segmentation through decomposition techniques
• Go get more data!
• Reorganize the data you have!
General strategy
• What problem are you trying to solve?
• What’s wrong with your data?
• What do you need that you don’t have?
Keep in mind…
• Data your company collects is complicated
• What you do to your data will affect the model
• Creativity is your friend
• Lots of ways to solve the problem
• You don’t have to accept the data as it is!
Thank you!
@sarah_guido

The Wild West of Data Wrangling (PyTN)

  • 1.
    The Wild Westof Data Wrangling Sarah Guido PyTennessee 2018 @sarah_guido
  • 2.
    This talk: • Aday in the life of a data scientist • Three jobs where I’ve dealt with uncooperative data • Messy, incomplete, inconsistent, hard to get, hard to model • Not ground truth
  • 3.
    Who am I? •Experienced data scientist • Data sciencing in Python (and sometimes Scala) • Wide variety of data: small, large, user, collected in-house, retrieved via API • Twitter: @sarah_guido
  • 4.
  • 5.
  • 8.
    Our journey toMordor Techniques to help with data issues
  • 9.
    Our journey Necessary datatransformations Techniques to help with data issues
  • 10.
    Our journey Necessary datatransformations Techniques to help with data issues Work with less than idea data
  • 11.
    Example 1: Techniquesto help with data issues • Commercial real estate data • Data validity concerns • Not a lot of data • Imperfect data for modeling
  • 12.
    Example 1: Techniquesto help with data issues • Data validity issues • Multiple sources of data for the same data points • Entered by humans • Missing • Order of magnitude off • Trapped in PDFs • Data validity solutions • Data point consensus
  • 13.
    Example 1: Techniquesto help with data issues • Data validity issues • Multiple sources of data for the same data points • Entered by humans • Missing • Order of magnitude off • Trapped in PDFs • Data validity solutions • Data point consensus • Fill with mean (when situation allows for it)
  • 14.
    Example 1: Techniquesto help with data issues • Data validity issues • Multiple sources of data for the same data points • Entered by humans • Missing • Order of magnitude off • Trapped in PDFs • Data validity solutions • Data point consensus • Fill with mean (when situation allows for it) • Discover other complete sources
  • 15.
    Example 1: Techniquesto help with data issues • Data validity issues • Multiple sources of data for the same data points • Entered by humans • Missing • Order of magnitude off • Trapped in PDFs • Data validity solutions • Data point consensus • Fill with mean (when situation allows for it) • Discover other complete sources • Remove outliers
  • 16.
    Example 1: Techniquesto help with data issues • Data validity issues • Multiple sources of data for the same data points • Entered by humans • Missing • Order of magnitude off • Trapped in PDFs • Data validity solutions • Data point consensus • Fill with mean (when situation allows for it) • Discover other complete sources • Remove outliers • OCR
  • 17.
    • The problem:can we predict if a building will sell the following year? • The data: floors, location, square footage, price per sqft, etc • The goal: provide valuable insight to platform users Example 1: Techniques to help with data issues
  • 18.
    • First thought:logistic regression using scikit-learn • Binary classification: sale/no sale Example 1: Techniques to help with data issues
  • 19.
    Problem… Data: 95% nosale, 5% sale Logistic regression: 95% accurate DONE!
  • 21.
    Problem: Class imbalance Classimbalance When the values you are trying to predict are not equal, this can create bias in classification models.
  • 22.
    Solution 1: Stratifiedsampling Stratified sampling Creating a sample of data for training based on the distribution of classes in your dataset.
  • 24.
    Solution 2: Gradientboosting Gradient boosting Produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees.
  • 25.
    Solution: Techniques tohelp with data issues Ways to treat the data: improve validity/quality Preprocessing techniques: sampling, gradient boosting
  • 26.
    • Link clickdata • Cookie issues • Lots of preprocessing to model Example 2: Necessary data transformations
  • 27.
    Example 2: Necessarydata transformations
  • 28.
    Example 2: Necessarydata transformations The problem: how can we identify similar patterns based on click data? The data: time, geolocation, cookie, browser useragent string, referrer The goal: understand how people interact with content over time
  • 29.
  • 30.
    Problem: Clustering userinteractions K-means clustering An unsupervised learning method of grouping data together based on a distance metric.
  • 31.
    Problem: Clustering thedata • Only look at users with 5 or more interactions • Each user has a different number of interactions • Each data point ends up in a different cluster • Complex feature space
  • 36.
  • 37.
    Solution: Transform thedata date: 2017-04-09, 2017-04-13, 2017-04-30, 2017-05-01, 2017-05-12 Length of interactions: 5 Average time between interactions: ~8 days
  • 38.
    Solution: Transform thedata referrer: facebook, twitter One-hot encode and transform to matrix • Facebook: [1, 0] • Twitter: [0, 1]
  • 39.
  • 40.
    Solution: Necessary datatransformations Rework your data in service of the problem you’re trying to solve
  • 41.
    Example 3: Workwith less than idea data • Digital media data • Data access issues • Difficulty retrieving data • Data is insufficient
  • 42.
    Example 3: Workwith less than idea data
  • 43.
    The problem: howcan we effectively describe our audience? The data: anonymized demographic and psychographic data The goal: audience segmentation and channel analysis Example 3: Work with less than idea data
  • 44.
    Problem: insufficient data •Google Analytics data – 1/3 of urls • Finicky API • Semi-useless psychographic data
  • 45.
  • 46.
  • 47.
    Solution: make itwork! • Sometimes you just have to settle for what you have • Segmentation through decomposition techniques • Go get more data! • Reorganize the data you have!
  • 48.
    General strategy • Whatproblem are you trying to solve? • What’s wrong with your data? • What do you need that you don’t have?
  • 49.
    Keep in mind… •Data your company collects is complicated • What you do to your data will affect the model • Creativity is your friend • Lots of ways to solve the problem • You don’t have to accept the data as it is!
  • 51.

Editor's Notes

  • #3 what do I actually have to do to get data in shape for modeling
  • #5 origin – convo with friends the iris dataset is… here’s the problem
  • #7 bootcamps - easy
  • #9 transition: let’s begin the journey. reonomy: use ML techniques to help with data issues.
  • #10 Bitly: transform data as necessary to model
  • #11 Mashable: work with what you have, and if you don’t have something, find a way to get it
  • #12 reonomy problem set up: commercial real estate data from the city of NYC. messy. inconsistent across sources. extract data from PDFs. human-entered data == mistakes
  • #21 picard facepalm gif
  • #24 from the scikit-learn documentation
  • #25 - ensemble model where trees focus on correcting errors of previous trees exam example - focus
  • #27 transition
  • #28 Briefly touch on cookies
  • #29 slide then… I did this in scala
  • #30 Spark problems
  • #31 scikit code
  • #42 transition
  • #43 Briefly – data infrastructure revamp
  • #50 don’t despair! I love digging into really terrible data