SlideShare a Scribd company logo
Sampling
An often overlooked art in exploratory
data analysis
Eli Bressert
@astrobiased
Stitch Fix
exploratory
data analysis
what to
optimize
1
2
What we
[data scientists]
do
1. obtain data
2. explore
3. do research/create data product
4. fine tune project and release
5. rinse and repeat
1. obtain data
2.explore
3. do research/create data product
4. fine tune project and release
5. rinse and repeat
basic statistics
simple graphics
formulate hypotheses
assess best models & approaches
graphic simplicity
0etric 00 0etric 01 0etric 02 0etric 03
0etric 04 0etric 05 0etric 06 0etric 07
0etric 08 0etric 09 0etric 10 0etric 11
0etric 12 0etric 13 0etric 14 0etric 15
0etric 16 0etric 17 0etric 18 0etric 19
0etric 20 0etric 21 0etric 22 0etric 23
0etric 24 0etric 25 0etric 26 0etric 27
0etric 28 0etric 29 0etric 30 0etric 31
0etric 32 0etric 33 0etric 34 0etric 35
0etric 36 0etric 37 0etric 38
metric00
metric01
metric02
metric03
metric04
metric05
metric 01
metric 02
metric 03
metric 04
metric 05
metric 06
−0.4
−0.3
−0.2
−0.1
0.0
0.1
0.2
0.3
0.4
−3 −2 −1 0 1 2 3 4
−4
−3
−2
−1
0
1
2
3
Anscombe’s Quartet
10 8.04
8 6.95
13 7.58
9 8.81
11 8.33
14 9.96
6 7.24
4 4.26
12 10.84
7 4.82
5 5.68
10 9.14
8 8.14
13 8.74
9 8.77
11 9.26
14 8.1
6 6.13
4 3.1
12 9.13
7 7.26
5 4.74
10 7.46
8 6.77
13 12.74
9 7.11
11 7.81
14 8.84
6 6.08
4 5.39
12 8.15
7 6.42
5 5.73
8 6.58
8 5.76
8 7.71
8 8.84
8 8.47
8 7.04
8 5.25
19 12.5
8 5.56
8 7.91
8 6.89
I II III IV
import seaborn as sns
from scipy.optimize import curve_fit
def func(x, a, b):
return a + b * x
df = sns.load_dataset(“anscombe")
df.x.mean()
df.y.mean()
df.x.var()
df.y.var()
df.x.corr(tmp.y))
popt, pcov = curve_fit(func, tmp.x, tmp.y)
Mean x: 9.0
Mean y: 7.5
Variance x: 11.00
Variance y: 4.13
Correlation between x and y: 0.816
Linear regression coefficients: y = 3.00 + 0.50x
http://goo.gl/Zuw4Qe
2
4
6
8
10
12
14
y
dataVet I dataVet II
2 4 6 8 10 12 14 16 18 20
x
2
4
6
8
10
12
14
y
dataVet III
2 4 6 8 10 12 14 16 18 20
x
dataVet IV
dataVet
I
II
III
IV
EDA results will affect all that follows
processing speed
faster technology
bigger data
Boundaries
Pushing
You have two options
design your
data sample
plan and
execute
hit the big red
button and wait
for the process
to finish
attention span
?
time cost
hit red button
design and sample
explore, hypothesize, model
explore, hypothesize, model
time
hit red button
design and sample
explore, hypothesize, model
explore, hypothesize, model
time
fail frequently
learn fast
tried and true
models and methods
sampling considerations
what you’re sampling
priors that you can assume
what operations you will run
?

More Related Content

Similar to Sampling: An an often overlooked art in exploratory data analysis

Image Classification
Image ClassificationImage Classification
Image Classification
Anwar Jameel
 
Visual Analytics Best Practices
Visual Analytics Best PracticesVisual Analytics Best Practices
Visual Analytics Best Practices
Tableau Software
 
2013.11.14 Big Data Workshop Bruno Voisin
2013.11.14 Big Data Workshop Bruno Voisin 2013.11.14 Big Data Workshop Bruno Voisin
2013.11.14 Big Data Workshop Bruno Voisin
NUI Galway
 
Information Visualization: See Patterns, Gain Insights & Make Decisions
Information Visualization: See Patterns, Gain Insights & Make DecisionsInformation Visualization: See Patterns, Gain Insights & Make Decisions
Information Visualization: See Patterns, Gain Insights & Make Decisions
University of Maryland
 
Using the python_data_toolkit_timbers_slides
Using the python_data_toolkit_timbers_slidesUsing the python_data_toolkit_timbers_slides
Using the python_data_toolkit_timbers_slides
Tiffany Timbers
 
Python seaborn cheat_sheet
Python seaborn cheat_sheetPython seaborn cheat_sheet
Python seaborn cheat_sheet
Nishant Upadhyay
 
Introduction to machine learning algorithms
Introduction to machine learning algorithmsIntroduction to machine learning algorithms
Introduction to machine learning algorithms
bigdata trunk
 
MyStataLab Assignment Help
MyStataLab Assignment HelpMyStataLab Assignment Help
MyStataLab Assignment Help
Statistics Assignment Help
 
Language Language Models (in 2023) - OpenAI
Language Language Models (in 2023) - OpenAILanguage Language Models (in 2023) - OpenAI
Language Language Models (in 2023) - OpenAI
SamuelButler15
 
Humanizing Data Analysis
Humanizing Data AnalysisHumanizing Data Analysis
Humanizing Data Analysis
Jan Aerts
 
ML基本からResNetまで
ML基本からResNetまでML基本からResNetまで
ML基本からResNetまで
Institute of Agricultural Machinery, NARO
 
Machine Learning Tutorial Part - 1 | Machine Learning Tutorial For Beginners ...
Machine Learning Tutorial Part - 1 | Machine Learning Tutorial For Beginners ...Machine Learning Tutorial Part - 1 | Machine Learning Tutorial For Beginners ...
Machine Learning Tutorial Part - 1 | Machine Learning Tutorial For Beginners ...
Simplilearn
 
visualisasi data praktik pakai excel, py
visualisasi data praktik pakai excel, pyvisualisasi data praktik pakai excel, py
visualisasi data praktik pakai excel, py
ElmaLyrics
 
Welcome to python
Welcome to pythonWelcome to python
Welcome to python
Kyunghoon Kim
 
A Semantic Web Platform for Improving the Automation and Reproducibility of F...
A Semantic Web Platform for Improving the Automation and Reproducibility of F...A Semantic Web Platform for Improving the Automation and Reproducibility of F...
A Semantic Web Platform for Improving the Automation and Reproducibility of F...
Ratnesh Sahay
 
Seminar PSU 10.10.2014 mme
Seminar PSU 10.10.2014 mmeSeminar PSU 10.10.2014 mme
Seminar PSU 10.10.2014 mme
Vyacheslav Arbuzov
 
Human_Activity_Recognition_Predictive_Model
Human_Activity_Recognition_Predictive_ModelHuman_Activity_Recognition_Predictive_Model
Human_Activity_Recognition_Predictive_ModelDavid Ritchie
 
Info vis 4-2012-part1
Info vis 4-2012-part1Info vis 4-2012-part1
Info vis 4-2012-part1
University of Maryland
 
Chapter 1: Linear Regression
Chapter 1: Linear RegressionChapter 1: Linear Regression
Chapter 1: Linear Regression
AkmelSyed
 
PSOk-NN: A Particle Swarm Optimization Approach to Optimize k-Nearest Neighbo...
PSOk-NN: A Particle Swarm Optimization Approach to Optimize k-Nearest Neighbo...PSOk-NN: A Particle Swarm Optimization Approach to Optimize k-Nearest Neighbo...
PSOk-NN: A Particle Swarm Optimization Approach to Optimize k-Nearest Neighbo...
Aboul Ella Hassanien
 

Similar to Sampling: An an often overlooked art in exploratory data analysis (20)

Image Classification
Image ClassificationImage Classification
Image Classification
 
Visual Analytics Best Practices
Visual Analytics Best PracticesVisual Analytics Best Practices
Visual Analytics Best Practices
 
2013.11.14 Big Data Workshop Bruno Voisin
2013.11.14 Big Data Workshop Bruno Voisin 2013.11.14 Big Data Workshop Bruno Voisin
2013.11.14 Big Data Workshop Bruno Voisin
 
Information Visualization: See Patterns, Gain Insights & Make Decisions
Information Visualization: See Patterns, Gain Insights & Make DecisionsInformation Visualization: See Patterns, Gain Insights & Make Decisions
Information Visualization: See Patterns, Gain Insights & Make Decisions
 
Using the python_data_toolkit_timbers_slides
Using the python_data_toolkit_timbers_slidesUsing the python_data_toolkit_timbers_slides
Using the python_data_toolkit_timbers_slides
 
Python seaborn cheat_sheet
Python seaborn cheat_sheetPython seaborn cheat_sheet
Python seaborn cheat_sheet
 
Introduction to machine learning algorithms
Introduction to machine learning algorithmsIntroduction to machine learning algorithms
Introduction to machine learning algorithms
 
MyStataLab Assignment Help
MyStataLab Assignment HelpMyStataLab Assignment Help
MyStataLab Assignment Help
 
Language Language Models (in 2023) - OpenAI
Language Language Models (in 2023) - OpenAILanguage Language Models (in 2023) - OpenAI
Language Language Models (in 2023) - OpenAI
 
Humanizing Data Analysis
Humanizing Data AnalysisHumanizing Data Analysis
Humanizing Data Analysis
 
ML基本からResNetまで
ML基本からResNetまでML基本からResNetまで
ML基本からResNetまで
 
Machine Learning Tutorial Part - 1 | Machine Learning Tutorial For Beginners ...
Machine Learning Tutorial Part - 1 | Machine Learning Tutorial For Beginners ...Machine Learning Tutorial Part - 1 | Machine Learning Tutorial For Beginners ...
Machine Learning Tutorial Part - 1 | Machine Learning Tutorial For Beginners ...
 
visualisasi data praktik pakai excel, py
visualisasi data praktik pakai excel, pyvisualisasi data praktik pakai excel, py
visualisasi data praktik pakai excel, py
 
Welcome to python
Welcome to pythonWelcome to python
Welcome to python
 
A Semantic Web Platform for Improving the Automation and Reproducibility of F...
A Semantic Web Platform for Improving the Automation and Reproducibility of F...A Semantic Web Platform for Improving the Automation and Reproducibility of F...
A Semantic Web Platform for Improving the Automation and Reproducibility of F...
 
Seminar PSU 10.10.2014 mme
Seminar PSU 10.10.2014 mmeSeminar PSU 10.10.2014 mme
Seminar PSU 10.10.2014 mme
 
Human_Activity_Recognition_Predictive_Model
Human_Activity_Recognition_Predictive_ModelHuman_Activity_Recognition_Predictive_Model
Human_Activity_Recognition_Predictive_Model
 
Info vis 4-2012-part1
Info vis 4-2012-part1Info vis 4-2012-part1
Info vis 4-2012-part1
 
Chapter 1: Linear Regression
Chapter 1: Linear RegressionChapter 1: Linear Regression
Chapter 1: Linear Regression
 
PSOk-NN: A Particle Swarm Optimization Approach to Optimize k-Nearest Neighbo...
PSOk-NN: A Particle Swarm Optimization Approach to Optimize k-Nearest Neighbo...PSOk-NN: A Particle Swarm Optimization Approach to Optimize k-Nearest Neighbo...
PSOk-NN: A Particle Swarm Optimization Approach to Optimize k-Nearest Neighbo...
 

Recently uploaded

Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
John Andrews
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
ukgaet
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
enxupq
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
nscud
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Linda486226
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
ahzuo
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
haila53
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
ewymefz
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
Tiktokethiodaily
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
yhkoc
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
enxupq
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
ewymefz
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
ewymefz
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
axoqas
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
ahzuo
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
ewymefz
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Boston Institute of Analytics
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
vcaxypu
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 

Recently uploaded (20)

Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 

Sampling: An an often overlooked art in exploratory data analysis