1. Data science workflow
Andrew Gelman
Dept of Statistics and Dept of Political Science
Columbia University, New York
PyData, New York, 28 Nov 2017
4. The (abridged) model in Stan
parameters {
real b;
real<lower=0> sigma_a;
real<lower=0> sigma_y;
vector[nteams] a;
}
model {
a ~ normal(b*prior_score, sigma_a)
sqrt_dif ~ normal(a[team1] - a[team2], sigma_y);
}
5. Fit the model
Inference for Stan model: worldcup_first_try.
4 chains, each with iter=2000; warmup=1000; thin=1;
post-warmup draws per chain=1000, total post-warmup draws=4000.
mean se_mean sd 25% 50% 75% n_eff Rhat
b 0.46 0.00 0.09 0.40 0.46 0.52 1039 1.00
sigma_a 0.14 0.00 0.07 0.09 0.13 0.19 203 1.01
sigma_y 0.42 0.00 0.05 0.38 0.42 0.46 956 1.00
a[1] 0.35 0.00 0.13 0.27 0.36 0.44 4000 1.00
a[2] 0.39 0.00 0.12 0.31 0.38 0.46 4000 1.00
a[3] 0.43 0.01 0.15 0.33 0.42 0.52 756 1.00
a[4] 0.20 0.01 0.16 0.11 0.22 0.31 966 1.00
a[5] 0.29 0.00 0.13 0.21 0.29 0.36 4000 1.00
. . .
13. Stan code
data {
int J;
int n[J];
real x[J];
int y[J];
real r;
real R;
}
parameters {
real<lower=0> sigma;
}
model {
real p[J];
p = 2*Phi(asin((R-r)/x) / sigma) - 1;
y ~ binomial(n, p);
}
14. Fit the model
golf <- read.table("golf.txt", header=TRUE, skip=2)
x <- golf$x
y <- golf$y
n <- golf$n
J <- length(y)
r <- (1.68/2)/12
R <- (4.25/2)/12
fit1 <- stan("golf1.stan")
15. Check convergence
> print(fit1)
Inference for Stan model: golf1.
4 chains, each with iter=2000; warmup=1000; thin=1;
post-warmup draws per chain=1000, total post-warmup draws=4000.
mean se_mean sd 25% 50% 75% n_eff Rhat
sigma 0.03 0.00 0.00 0.03 0.03 0.03 1692 1
sigma_degrees 1.53 0.00 0.02 1.51 1.53 1.54 1692 1
16. q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
0 5 10 15 20
0.00.20.40.60.81.0
What's the probability of making a golf putt?
Distance from hole (feet)
Probabilityofsuccess
Geometry−based model,
sigma = 1.5
17. q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
0 5 10 15 20
0.00.20.40.60.81.0
Two models fit to the golf putting data
Distance from hole (feet)
Probabilityofsuccess
Logistic regression,
a = 2.2, b = −0.3
Geometry−based model,
sigma = 1.5
22. 1970 1972 1974 1976 1978 1980 1982 1984 1986 1988
Trends
60
80
100
120
Relative Number of Births
Slow trend
Fast non-periodic component
Mean
Mon Tue Wed Thu Fri Sat Sun
Dayofweekeffect
60
80
100
120
1972
1976
1980
1984
1988
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Seasonaleffect
60
80
100
120
1972
1976
1980
1984
1988
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Dayofyeareffect
60
80
100
120
New year
Valentine's day
Leap dayApril 1st Memorial day
Independence day
Labor day
Halloween
Thanksgiving
Christmas
23. Mon Tue Wed Thu Fri Sat Sun
Dayofweekeffect
60
80
100
120
2002
2006
2010
2014
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Seasonaleffect
60
80
100
120
2002
2006
2010
2014
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Dayofyeareffect
60
80
100
120
New year
Valentine's day
Leap day
April 1st Memorial day
Independence day
Labor day
9/11
Halloween
Thanksgiving
Christmas
2000 2002 2004 2006 2008 2010 2012 2014
Trends
60
80
100
120
Relative Number of Births
Slow trend
Fast non-periodic component
Mean
24. Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Dayofyeareffect
50
60
70
80
90
100
110
120
New year
Valentine's day
Leap day
April 1stMemorial day
Independence day
Labor day
9/11
Halloween
Thanksgiving
Christmas
13th day of month
33. Some ideas in data science workflow
Data and information
Replication
Fake-data simulation (or statistical theory)
Comparing predictions to data
The network of models