Sampling: An an often overlooked art in exploratory data analysis

Sampling
An often overlooked art in exploratory
data analysis
Eli Bressert
@astrobiased
Stitch Fix

exploratory
data analysis
what to
optimize
1
2

1. obtain data
2. explore
3. do research/create data product
4. ﬁne tune project and release
5. rinse and repeat

1. obtain data
2.explore
3. do research/create data product
4. ﬁne tune project and release
5. rinse and repeat

basic statistics
simple graphics
formulate hypotheses
assess best models & approaches

0etric 00 0etric 01 0etric 02 0etric 03
0etric 36 0etric 37 0etric 38

metric00
metric01
metric02
metric03
metric04
metric05
metric 01
metric 02
metric 03
metric 04
metric 05
metric 06
−0.4
−0.3
−0.2
−0.1
0.0
0.1
0.2
0.3
0.4

−3 −2 −1 0 1 2 3 4
−4
−3
−2
−1
0
1
2
3

10 8.04
8 6.95
13 7.58
9 8.81
11 8.33
14 9.96
6 7.24
4 4.26
12 10.84
7 4.82
5 5.68
10 9.14
8 8.14
13 8.74
9 8.77
11 9.26
14 8.1
6 6.13
4 3.1
12 9.13
7 7.26
5 4.74
10 7.46
8 6.77
13 12.74
9 7.11
11 7.81
14 8.84
6 6.08
4 5.39
12 8.15
7 6.42
5 5.73
8 6.58
8 5.76
8 7.71
8 8.84
8 8.47
8 7.04
8 5.25
19 12.5
8 5.56
8 7.91
8 6.89
I II III IV

import seaborn as sns
from scipy.optimize import curve_fit
def func(x, a, b):
return a + b * x
df = sns.load_dataset(“anscombe")
df.x.mean()
df.y.mean()
df.x.var()
df.y.var()
df.x.corr(tmp.y))
popt, pcov = curve_fit(func, tmp.x, tmp.y)

Mean x: 9.0
Mean y: 7.5
Variance x: 11.00
Variance y: 4.13
Correlation between x and y: 0.816
Linear regression coefficients: y = 3.00 + 0.50x
http://goo.gl/Zuw4Qe

2
4
6
8
10
12
14
y
dataVet I dataVet II
2 4 6 8 10 12 14 16 18 20
x
2
4
6
8
10
12
14
y
dataVet III
2 4 6 8 10 12 14 16 18 20
x
dataVet IV
dataVet
I
II
III
IV

EDA results will aﬀect all that follows

design your
data sample
plan and
execute
hit the big red
button and wait
for the process
to ﬁnish

hit red button
design and sample
explore, hypothesize, model
explore, hypothesize, model
time

tried and true
models and methods

what you’re sampling
priors that you can assume
what operations you will run

Sampling: An an often overlooked art in exploratory data analysis

More Related Content

Similar to Sampling: An an often overlooked art in exploratory data analysis

Recently uploaded

Sampling: An an often overlooked art in exploratory data analysis