Sampling
An often overlooked art in exploratory
data analysis
Eli Bressert
@astrobiased
Stitch Fix
exploratory
data analysis
what to
optimize
1
2
What we
[data scientists]
do
1. obtain data
2. explore
3. do research/create data product
4. ne tune project and release
5. rinse and repeat
1. obtain data
2.explore
3. do research/create data product
4. ne tune project and release
5. rinse and repeat
basic statistics
simple graphics
formulate hypotheses
assess best models & approaches
graphic simplicity
0etric 00 0etric 01 0etric 02 0etric 03
0etric 04 0etric 05 0etric 06 0etric 07
0etric 08 0etric 09 0etric 10 0etric 11
0etric 12 0etric 13 0etric 14 0etric 15
0etric 16 0etric 17 0etric 18 0etric 19
0etric 20 0etric 21 0etric 22 0etric 23
0etric 24 0etric 25 0etric 26 0etric 27
0etric 28 0etric 29 0etric 30 0etric 31
0etric 32 0etric 33 0etric 34 0etric 35
0etric 36 0etric 37 0etric 38
metric00
metric01
metric02
metric03
metric04
metric05
metric 01
metric 02
metric 03
metric 04
metric 05
metric 06
−0.4
−0.3
−0.2
−0.1
0.0
0.1
0.2
0.3
0.4
−3 −2 −1 0 1 2 3 4
−4
−3
−2
−1
0
1
2
3
Anscombe’s Quartet
10 8.04
8 6.95
13 7.58
9 8.81
11 8.33
14 9.96
6 7.24
4 4.26
12 10.84
7 4.82
5 5.68
10 9.14
8 8.14
13 8.74
9 8.77
11 9.26
14 8.1
6 6.13
4 3.1
12 9.13
7 7.26
5 4.74
10 7.46
8 6.77
13 12.74
9 7.11
11 7.81
14 8.84
6 6.08
4 5.39
12 8.15
7 6.42
5 5.73
8 6.58
8 5.76
8 7.71
8 8.84
8 8.47
8 7.04
8 5.25
19 12.5
8 5.56
8 7.91
8 6.89
I II III IV
import seaborn as sns
from scipy.optimize import curve_fit
def func(x, a, b):
return a + b * x
df = sns.load_dataset(“anscombe")
df.x.mean()
df.y.mean()
df.x.var()
df.y.var()
df.x.corr(tmp.y))
popt, pcov = curve_fit(func, tmp.x, tmp.y)
Mean x: 9.0
Mean y: 7.5
Variance x: 11.00
Variance y: 4.13
Correlation between x and y: 0.816
Linear regression coefficients: y = 3.00 + 0.50x
http://goo.gl/Zuw4Qe
2
4
6
8
10
12
14
y
dataVet I dataVet II
2 4 6 8 10 12 14 16 18 20
x
2
4
6
8
10
12
14
y
dataVet III
2 4 6 8 10 12 14 16 18 20
x
dataVet IV
dataVet
I
II
III
IV
EDA results will affect all that follows
processing speed
faster technology
bigger data
Boundaries
Pushing
You have two options
design your
data sample
plan and
execute
hit the big red
button and wait
for the process
to nish
attention span
?
time cost
hit red button
design and sample
explore, hypothesize, model
explore, hypothesize, model
time
hit red button
design and sample
explore, hypothesize, model
explore, hypothesize, model
time
fail frequently
learn fast
tried and true
models and methods
sampling considerations
what you’re sampling
priors that you can assume
what operations you will run
?

Sampling: An an often overlooked art in exploratory data analysis