The Wild West of Data Wrangling (PyTN)

The Wild West of Data
Wrangling
Sarah Guido
PyTennessee 2018
@sarah_guido

This talk:
• A day in the life of a data scientist
• Three jobs where I’ve dealt with uncooperative data
• Messy, incomplete, inconsistent, hard to get, hard to model
• Not ground truth

Who am I?
• Experienced data scientist
• Data sciencing in Python (and sometimes Scala)
• Wide variety of data: small, large, user, collected
in-house, retrieved via API
• Twitter: @sarah_guido

Our journey to Mordor
Techniques to help with data issues

Our journey
Necessary data transformations

Our journey
Necessary data transformations
Work with less than idea data

Example 1: Techniques to help with data issues
• Commercial real estate data
• Data validity concerns
• Not a lot of data
• Imperfect data for modeling

• Data validity issues
• Multiple sources of data
for the same data points
• Entered by humans
• Missing
• Order of magnitude off
• Trapped in PDFs
• Data validity solutions
• Data point consensus

• Missing
• Trapped in PDFs
• Fill with mean (when
situation allows for it)

• Missing
• Trapped in PDFs
• Discover other
complete sources

• Missing
• Trapped in PDFs
• Discover other
complete sources
• Remove outliers

• Missing
• Trapped in PDFs
• Discover other
complete sources
• Remove outliers
• OCR

• The problem: can we predict if a building will sell the
following year?
• The data: floors, location, square footage, price per sqft,
etc
• The goal: provide valuable insight to platform users

• First thought: logistic regression using scikit-learn
• Binary classification: sale/no sale

Problem…
Data: 95% no sale, 5% sale
Logistic regression: 95% accurate
DONE!

Problem: Class imbalance
Class imbalance
When the values you are trying to predict are not equal, this
can create bias in classification models.

Solution 1: Stratified sampling
Stratified sampling
Creating a sample of data for training based on the
distribution of classes in your dataset.

Solution 2: Gradient boosting
Gradient boosting
Produces a prediction model in the form of an ensemble of
weak prediction models, typically decision trees.

Solution: Techniques to help with data issues
Ways to treat the data:
improve validity/quality
Preprocessing techniques:
sampling, gradient boosting

• Link click data
• Cookie issues
• Lots of preprocessing to model
Example 2: Necessary data transformations

The problem: how can we identify similar patterns based on
click data?
The data: time, geolocation, cookie, browser useragent
string, referrer
The goal: understand how people interact with content over
time

Problem: Clustering user interactions
K-means clustering
An unsupervised learning method of grouping data together
based on a distance metric.

Problem: Clustering the data
• Only look at users with 5 or more interactions
• Each user has a different number of interactions
• Each data point ends up in a different cluster
• Complex feature space

Solution: Transform the data
date: 2017-04-09, 2017-04-13, 2017-04-30, 2017-05-01,
2017-05-12
Length of interactions: 5
Average time between interactions: ~8 days

Solution: Transform the data
referrer: facebook, twitter
One-hot encode and transform to matrix
• Facebook: [1, 0]
• Twitter: [0, 1]

Solution: Necessary data transformations
Rework your data in service of
the problem you’re trying to
solve

Example 3: Work with less than idea data
• Digital media data
• Data access issues
• Difficulty retrieving data
• Data is insufficient

The problem: how can we effectively describe our audience?
The data: anonymized demographic and psychographic data
The goal: audience segmentation and channel analysis

Problem: insufficient data
• Google Analytics data – 1/3 of urls
• Finicky API
• Semi-useless psychographic data

Solution: accept defeat make it work!

Solution: make it work!
• Sometimes you just have to settle for what you have
• Segmentation through decomposition techniques
• Go get more data!
• Reorganize the data you have!

General strategy
• What problem are you trying to solve?
• What’s wrong with your data?
• What do you need that you don’t have?

Keep in mind…
• Data your company collects is complicated
• What you do to your data will affect the model
• Creativity is your friend
• Lots of ways to solve the problem
• You don’t have to accept the data as it is!

The Wild West of Data Wrangling (PyTN)

More Related Content

What's hot

Similar to The Wild West of Data Wrangling (PyTN)

More from Sarah Guido

Recently uploaded

The Wild West of Data Wrangling (PyTN)

Editor's Notes