SlideShare a Scribd company logo
1 of 51
The Wild West of Data
Wrangling
Sarah Guido
PyTennessee 2018
@sarah_guido
This talk:
• A day in the life of a data scientist
• Three jobs where I’ve dealt with uncooperative data
• Messy, incomplete, inconsistent, hard to get, hard to model
• Not ground truth
Who am I?
• Experienced data scientist
• Data sciencing in Python (and sometimes Scala)
• Wide variety of data: small, large, user, collected
in-house, retrieved via API
• Twitter: @sarah_guido
Iris Dataset
Iris Dataset
Our journey to Mordor
Techniques to help with data issues
Our journey
Necessary data transformations
Techniques to help with data issues
Our journey
Necessary data transformations
Techniques to help with data issues
Work with less than idea data
Example 1: Techniques to help with data issues
• Commercial real estate data
• Data validity concerns
• Not a lot of data
• Imperfect data for modeling
Example 1: Techniques to help with data issues
• Data validity issues
• Multiple sources of data
for the same data points
• Entered by humans
• Missing
• Order of magnitude off
• Trapped in PDFs
• Data validity solutions
• Data point consensus
Example 1: Techniques to help with data issues
• Data validity issues
• Multiple sources of data
for the same data points
• Entered by humans
• Missing
• Order of magnitude off
• Trapped in PDFs
• Data validity solutions
• Data point consensus
• Fill with mean (when
situation allows for it)
Example 1: Techniques to help with data issues
• Data validity issues
• Multiple sources of data
for the same data points
• Entered by humans
• Missing
• Order of magnitude off
• Trapped in PDFs
• Data validity solutions
• Data point consensus
• Fill with mean (when
situation allows for it)
• Discover other
complete sources
Example 1: Techniques to help with data issues
• Data validity issues
• Multiple sources of data
for the same data points
• Entered by humans
• Missing
• Order of magnitude off
• Trapped in PDFs
• Data validity solutions
• Data point consensus
• Fill with mean (when
situation allows for it)
• Discover other
complete sources
• Remove outliers
Example 1: Techniques to help with data issues
• Data validity issues
• Multiple sources of data
for the same data points
• Entered by humans
• Missing
• Order of magnitude off
• Trapped in PDFs
• Data validity solutions
• Data point consensus
• Fill with mean (when
situation allows for it)
• Discover other
complete sources
• Remove outliers
• OCR
• The problem: can we predict if a building will sell the
following year?
• The data: floors, location, square footage, price per sqft,
etc
• The goal: provide valuable insight to platform users
Example 1: Techniques to help with data issues
• First thought: logistic regression using scikit-learn
• Binary classification: sale/no sale
Example 1: Techniques to help with data issues
Problem…
Data: 95% no sale, 5% sale
Logistic regression: 95% accurate
DONE!
Problem: Class imbalance
Class imbalance
When the values you are trying to predict are not equal, this
can create bias in classification models.
Solution 1: Stratified sampling
Stratified sampling
Creating a sample of data for training based on the
distribution of classes in your dataset.
Solution 2: Gradient boosting
Gradient boosting
Produces a prediction model in the form of an ensemble of
weak prediction models, typically decision trees.
Solution: Techniques to help with data issues
Ways to treat the data:
improve validity/quality
Preprocessing techniques:
sampling, gradient boosting
• Link click data
• Cookie issues
• Lots of preprocessing to model
Example 2: Necessary data transformations
Example 2: Necessary data transformations
Example 2: Necessary data transformations
The problem: how can we identify similar patterns based on
click data?
The data: time, geolocation, cookie, browser useragent
string, referrer
The goal: understand how people interact with content over
time
Why Scala?
Problem: Clustering user interactions
K-means clustering
An unsupervised learning method of grouping data together
based on a distance metric.
Problem: Clustering the data
• Only look at users with 5 or more interactions
• Each user has a different number of interactions
• Each data point ends up in a different cluster
• Complex feature space
Solution: Transform the data
Solution: Transform the data
date: 2017-04-09, 2017-04-13, 2017-04-30, 2017-05-01,
2017-05-12
Length of interactions: 5
Average time between interactions: ~8 days
Solution: Transform the data
referrer: facebook, twitter
One-hot encode and transform to matrix
• Facebook: [1, 0]
• Twitter: [0, 1]
Solution: Transform the data
Solution: Necessary data transformations
Rework your data in service of
the problem you’re trying to
solve
Example 3: Work with less than idea data
• Digital media data
• Data access issues
• Difficulty retrieving data
• Data is insufficient
Example 3: Work with less than idea data
The problem: how can we effectively describe our audience?
The data: anonymized demographic and psychographic data
The goal: audience segmentation and channel analysis
Example 3: Work with less than idea data
Problem: insufficient data
• Google Analytics data – 1/3 of urls
• Finicky API
• Semi-useless psychographic data
Solution: accept defeat
Solution: accept defeat make it work!
Solution: make it work!
• Sometimes you just have to settle for what you have
• Segmentation through decomposition techniques
• Go get more data!
• Reorganize the data you have!
General strategy
• What problem are you trying to solve?
• What’s wrong with your data?
• What do you need that you don’t have?
Keep in mind…
• Data your company collects is complicated
• What you do to your data will affect the model
• Creativity is your friend
• Lots of ways to solve the problem
• You don’t have to accept the data as it is!
Thank you!
@sarah_guido

More Related Content

What's hot

Big data
Big dataBig data
Big data
Claire Choong
 

What's hot (20)

Big Data and HR - Talk @SwissHR Congress
Big Data and HR - Talk @SwissHR CongressBig Data and HR - Talk @SwissHR Congress
Big Data and HR - Talk @SwissHR Congress
 
O'Reilly Strata: Distilling Data Exhaust
O'Reilly Strata: Distilling Data ExhaustO'Reilly Strata: Distilling Data Exhaust
O'Reilly Strata: Distilling Data Exhaust
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Bigdata and Hadoop with applications
Bigdata and Hadoop with applicationsBigdata and Hadoop with applications
Bigdata and Hadoop with applications
 
H2O World - Advanced Analytics at Macys.com - Daqing Zhao
H2O World - Advanced Analytics at Macys.com - Daqing ZhaoH2O World - Advanced Analytics at Macys.com - Daqing Zhao
H2O World - Advanced Analytics at Macys.com - Daqing Zhao
 
Pre processing big data
Pre processing big dataPre processing big data
Pre processing big data
 
Big data
Big dataBig data
Big data
 
Data Scientist: The Sexiest Job in the 21st Century
Data Scientist: The Sexiest Job in the 21st CenturyData Scientist: The Sexiest Job in the 21st Century
Data Scientist: The Sexiest Job in the 21st Century
 
Before Kaggle
Before KaggleBefore Kaggle
Before Kaggle
 
Presentation Big Data
Presentation Big DataPresentation Big Data
Presentation Big Data
 
Lecture #03
Lecture #03Lecture #03
Lecture #03
 
How to crack Big Data and Data Science roles
How to crack Big Data and Data Science rolesHow to crack Big Data and Data Science roles
How to crack Big Data and Data Science roles
 
Transform Your Downstream Cloud Analytics with Data Quality 
Transform Your Downstream Cloud Analytics with Data Quality Transform Your Downstream Cloud Analytics with Data Quality 
Transform Your Downstream Cloud Analytics with Data Quality 
 
Data science 101
Data science 101Data science 101
Data science 101
 
H2O World - Machine Learning for non-data scientists
H2O World - Machine Learning for non-data scientistsH2O World - Machine Learning for non-data scientists
H2O World - Machine Learning for non-data scientists
 
Lecture #02
Lecture #02 Lecture #02
Lecture #02
 
Replication in Data Science - A Dance Between Data Science & Machine Learning...
Replication in Data Science - A Dance Between Data Science & Machine Learning...Replication in Data Science - A Dance Between Data Science & Machine Learning...
Replication in Data Science - A Dance Between Data Science & Machine Learning...
 
Data Analytics
Data AnalyticsData Analytics
Data Analytics
 
What is a Data Scientist
What is a Data Scientist What is a Data Scientist
What is a Data Scientist
 
Intro big data analytics
Intro big data analyticsIntro big data analytics
Intro big data analytics
 

Similar to The Wild West of Data Wrangling (PyTN)

Career in Data Science (July 2017, DTLA)
Career in Data Science (July 2017, DTLA)Career in Data Science (July 2017, DTLA)
Career in Data Science (July 2017, DTLA)
Thinkful
 
SOLVING MLOPS FROM FIRST PRINCIPLES, DEAN PLEBAN, DagsHub
SOLVING MLOPS FROM FIRST PRINCIPLES, DEAN PLEBAN, DagsHubSOLVING MLOPS FROM FIRST PRINCIPLES, DEAN PLEBAN, DagsHub
SOLVING MLOPS FROM FIRST PRINCIPLES, DEAN PLEBAN, DagsHub
DevOpsDays Tel Aviv
 

Similar to The Wild West of Data Wrangling (PyTN) (20)

The Wild West of Data Wrangling
The Wild West of Data WranglingThe Wild West of Data Wrangling
The Wild West of Data Wrangling
 
Career in Data Science (July 2017, DTLA)
Career in Data Science (July 2017, DTLA)Career in Data Science (July 2017, DTLA)
Career in Data Science (July 2017, DTLA)
 
Getting Started in Data Science
Getting Started in Data ScienceGetting Started in Data Science
Getting Started in Data Science
 
TDWI Solution Summit San Diego 2014 Advanced Analytics at Macys.com
TDWI Solution Summit San Diego 2014 Advanced Analytics at Macys.comTDWI Solution Summit San Diego 2014 Advanced Analytics at Macys.com
TDWI Solution Summit San Diego 2014 Advanced Analytics at Macys.com
 
Session 01 designing and scoping a data science project
Session 01 designing and scoping a data science projectSession 01 designing and scoping a data science project
Session 01 designing and scoping a data science project
 
Session 01 designing and scoping a data science project
Session 01 designing and scoping a data science projectSession 01 designing and scoping a data science project
Session 01 designing and scoping a data science project
 
STC Information Topology
STC Information TopologySTC Information Topology
STC Information Topology
 
Drinking from the Digital Data Fire Hose
Drinking from the Digital Data Fire HoseDrinking from the Digital Data Fire Hose
Drinking from the Digital Data Fire Hose
 
How Oracle Uses CrowdFlower For Sentiment Analysis
How Oracle Uses CrowdFlower For Sentiment AnalysisHow Oracle Uses CrowdFlower For Sentiment Analysis
How Oracle Uses CrowdFlower For Sentiment Analysis
 
NYC Open Data Meetup-- Thoughtworks chief data scientist talk
NYC Open Data Meetup-- Thoughtworks chief data scientist talkNYC Open Data Meetup-- Thoughtworks chief data scientist talk
NYC Open Data Meetup-- Thoughtworks chief data scientist talk
 
Four Short Foibles of Organizational Data
Four Short Foibles of Organizational DataFour Short Foibles of Organizational Data
Four Short Foibles of Organizational Data
 
Webinar - Product Matching - Palombo (20160428)
Webinar - Product Matching - Palombo (20160428)Webinar - Product Matching - Palombo (20160428)
Webinar - Product Matching - Palombo (20160428)
 
Big data gaurav
Big data gauravBig data gaurav
Big data gaurav
 
Big Data Analysis and Business Intelligence
Big Data Analysis and Business IntelligenceBig Data Analysis and Business Intelligence
Big Data Analysis and Business Intelligence
 
Data Profiling: The First Step to Big Data Quality
Data Profiling: The First Step to Big Data QualityData Profiling: The First Step to Big Data Quality
Data Profiling: The First Step to Big Data Quality
 
Product Management in the Era of Data Science
Product Management in the Era of Data ScienceProduct Management in the Era of Data Science
Product Management in the Era of Data Science
 
Conrad - Separating the Wheat from the Chaff
Conrad - Separating the Wheat from the ChaffConrad - Separating the Wheat from the Chaff
Conrad - Separating the Wheat from the Chaff
 
Altron presentation on Emerging Technologies: Data Science and Artificial Int...
Altron presentation on Emerging Technologies: Data Science and Artificial Int...Altron presentation on Emerging Technologies: Data Science and Artificial Int...
Altron presentation on Emerging Technologies: Data Science and Artificial Int...
 
SOLVING MLOPS FROM FIRST PRINCIPLES, DEAN PLEBAN, DagsHub
SOLVING MLOPS FROM FIRST PRINCIPLES, DEAN PLEBAN, DagsHubSOLVING MLOPS FROM FIRST PRINCIPLES, DEAN PLEBAN, DagsHub
SOLVING MLOPS FROM FIRST PRINCIPLES, DEAN PLEBAN, DagsHub
 
DataScienceIntroduction.pptx
DataScienceIntroduction.pptxDataScienceIntroduction.pptx
DataScienceIntroduction.pptx
 

More from Sarah Guido

More from Sarah Guido (8)

Data Science Retrospective
Data Science RetrospectiveData Science Retrospective
Data Science Retrospective
 
The Importance of Community
The Importance of CommunityThe Importance of Community
The Importance of Community
 
Spark: The Good, the Bad, and the Ugly
Spark: The Good, the Bad, and the UglySpark: The Good, the Bad, and the Ugly
Spark: The Good, the Bad, and the Ugly
 
Data Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at BitlyData Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at Bitly
 
Network theory - PyCon 2015
Network theory - PyCon 2015Network theory - PyCon 2015
Network theory - PyCon 2015
 
Analyzing Data With Python
Analyzing Data With PythonAnalyzing Data With Python
Analyzing Data With Python
 
K-means Clustering with Scikit-Learn
K-means Clustering with Scikit-LearnK-means Clustering with Scikit-Learn
K-means Clustering with Scikit-Learn
 
A Beginner's Guide to Machine Learning with Scikit-Learn
A Beginner's Guide to Machine Learning with Scikit-LearnA Beginner's Guide to Machine Learning with Scikit-Learn
A Beginner's Guide to Machine Learning with Scikit-Learn
 

Recently uploaded

Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
JoseMangaJr1
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
amitlee9823
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
only4webmaster01
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
AroojKhan71
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
amitlee9823
 

Recently uploaded (20)

Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 

The Wild West of Data Wrangling (PyTN)

  • 1. The Wild West of Data Wrangling Sarah Guido PyTennessee 2018 @sarah_guido
  • 2. This talk: • A day in the life of a data scientist • Three jobs where I’ve dealt with uncooperative data • Messy, incomplete, inconsistent, hard to get, hard to model • Not ground truth
  • 3. Who am I? • Experienced data scientist • Data sciencing in Python (and sometimes Scala) • Wide variety of data: small, large, user, collected in-house, retrieved via API • Twitter: @sarah_guido
  • 6.
  • 7.
  • 8. Our journey to Mordor Techniques to help with data issues
  • 9. Our journey Necessary data transformations Techniques to help with data issues
  • 10. Our journey Necessary data transformations Techniques to help with data issues Work with less than idea data
  • 11. Example 1: Techniques to help with data issues • Commercial real estate data • Data validity concerns • Not a lot of data • Imperfect data for modeling
  • 12. Example 1: Techniques to help with data issues • Data validity issues • Multiple sources of data for the same data points • Entered by humans • Missing • Order of magnitude off • Trapped in PDFs • Data validity solutions • Data point consensus
  • 13. Example 1: Techniques to help with data issues • Data validity issues • Multiple sources of data for the same data points • Entered by humans • Missing • Order of magnitude off • Trapped in PDFs • Data validity solutions • Data point consensus • Fill with mean (when situation allows for it)
  • 14. Example 1: Techniques to help with data issues • Data validity issues • Multiple sources of data for the same data points • Entered by humans • Missing • Order of magnitude off • Trapped in PDFs • Data validity solutions • Data point consensus • Fill with mean (when situation allows for it) • Discover other complete sources
  • 15. Example 1: Techniques to help with data issues • Data validity issues • Multiple sources of data for the same data points • Entered by humans • Missing • Order of magnitude off • Trapped in PDFs • Data validity solutions • Data point consensus • Fill with mean (when situation allows for it) • Discover other complete sources • Remove outliers
  • 16. Example 1: Techniques to help with data issues • Data validity issues • Multiple sources of data for the same data points • Entered by humans • Missing • Order of magnitude off • Trapped in PDFs • Data validity solutions • Data point consensus • Fill with mean (when situation allows for it) • Discover other complete sources • Remove outliers • OCR
  • 17. • The problem: can we predict if a building will sell the following year? • The data: floors, location, square footage, price per sqft, etc • The goal: provide valuable insight to platform users Example 1: Techniques to help with data issues
  • 18. • First thought: logistic regression using scikit-learn • Binary classification: sale/no sale Example 1: Techniques to help with data issues
  • 19. Problem… Data: 95% no sale, 5% sale Logistic regression: 95% accurate DONE!
  • 20.
  • 21. Problem: Class imbalance Class imbalance When the values you are trying to predict are not equal, this can create bias in classification models.
  • 22. Solution 1: Stratified sampling Stratified sampling Creating a sample of data for training based on the distribution of classes in your dataset.
  • 23.
  • 24. Solution 2: Gradient boosting Gradient boosting Produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees.
  • 25. Solution: Techniques to help with data issues Ways to treat the data: improve validity/quality Preprocessing techniques: sampling, gradient boosting
  • 26. • Link click data • Cookie issues • Lots of preprocessing to model Example 2: Necessary data transformations
  • 27. Example 2: Necessary data transformations
  • 28. Example 2: Necessary data transformations The problem: how can we identify similar patterns based on click data? The data: time, geolocation, cookie, browser useragent string, referrer The goal: understand how people interact with content over time
  • 30. Problem: Clustering user interactions K-means clustering An unsupervised learning method of grouping data together based on a distance metric.
  • 31. Problem: Clustering the data • Only look at users with 5 or more interactions • Each user has a different number of interactions • Each data point ends up in a different cluster • Complex feature space
  • 32.
  • 33.
  • 34.
  • 35.
  • 37. Solution: Transform the data date: 2017-04-09, 2017-04-13, 2017-04-30, 2017-05-01, 2017-05-12 Length of interactions: 5 Average time between interactions: ~8 days
  • 38. Solution: Transform the data referrer: facebook, twitter One-hot encode and transform to matrix • Facebook: [1, 0] • Twitter: [0, 1]
  • 40. Solution: Necessary data transformations Rework your data in service of the problem you’re trying to solve
  • 41. Example 3: Work with less than idea data • Digital media data • Data access issues • Difficulty retrieving data • Data is insufficient
  • 42. Example 3: Work with less than idea data
  • 43. The problem: how can we effectively describe our audience? The data: anonymized demographic and psychographic data The goal: audience segmentation and channel analysis Example 3: Work with less than idea data
  • 44. Problem: insufficient data • Google Analytics data – 1/3 of urls • Finicky API • Semi-useless psychographic data
  • 46. Solution: accept defeat make it work!
  • 47. Solution: make it work! • Sometimes you just have to settle for what you have • Segmentation through decomposition techniques • Go get more data! • Reorganize the data you have!
  • 48. General strategy • What problem are you trying to solve? • What’s wrong with your data? • What do you need that you don’t have?
  • 49. Keep in mind… • Data your company collects is complicated • What you do to your data will affect the model • Creativity is your friend • Lots of ways to solve the problem • You don’t have to accept the data as it is!
  • 50.

Editor's Notes

  1. what do I actually have to do to get data in shape for modeling
  2. origin – convo with friends the iris dataset is… here’s the problem
  3. bootcamps - easy
  4. transition: let’s begin the journey. reonomy: use ML techniques to help with data issues.
  5. Bitly: transform data as necessary to model
  6. Mashable: work with what you have, and if you don’t have something, find a way to get it
  7. reonomy problem set up: commercial real estate data from the city of NYC. messy. inconsistent across sources. extract data from PDFs. human-entered data == mistakes
  8. picard facepalm gif
  9. from the scikit-learn documentation
  10. - ensemble model where trees focus on correcting errors of previous trees exam example - focus
  11. transition
  12. Briefly touch on cookies
  13. slide then… I did this in scala
  14. Spark problems
  15. scikit code
  16. transition
  17. Briefly – data infrastructure revamp
  18. don’t despair! I love digging into really terrible data