Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

Share

Data Science Workflow

Download to read offline

By Andrew Gelman
PyData New York City 2017

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all

Data Science Workflow

  1. 1. Data science workflow Andrew Gelman Dept of Statistics and Dept of Political Science Columbia University, New York PyData, New York, 28 Nov 2017
  2. 2. The (abridged) model in Stan parameters { real b; real<lower=0> sigma_a; real<lower=0> sigma_y; vector[nteams] a; } model { a ~ normal(b*prior_score, sigma_a) sqrt_dif ~ normal(a[team1] - a[team2], sigma_y); }
  3. 3. Fit the model Inference for Stan model: worldcup_first_try. 4 chains, each with iter=2000; warmup=1000; thin=1; post-warmup draws per chain=1000, total post-warmup draws=4000. mean se_mean sd 25% 50% 75% n_eff Rhat b 0.46 0.00 0.09 0.40 0.46 0.52 1039 1.00 sigma_a 0.14 0.00 0.07 0.09 0.13 0.19 203 1.01 sigma_y 0.42 0.00 0.05 0.38 0.42 0.46 956 1.00 a[1] 0.35 0.00 0.13 0.27 0.36 0.44 4000 1.00 a[2] 0.39 0.00 0.12 0.31 0.38 0.46 4000 1.00 a[3] 0.43 0.01 0.15 0.33 0.42 0.52 756 1.00 a[4] 0.20 0.01 0.16 0.11 0.22 0.31 966 1.00 a[5] 0.29 0.00 0.13 0.21 0.29 0.36 4000 1.00 . . .
  4. 4. Graph the estimates
  5. 5. Compare to model fit without prior rankings
  6. 6. Compare model to predictions
  7. 7. After finding and fixing a bug
  8. 8. q q q q q q q q q q q q q q q q q q q 0 5 10 15 20 0.00.20.40.60.81.0 Data on putts in pro golf Distance from hole (feet) Probabilityofsuccess 1346/1443 577/694 337/455 208/353 149/272 136/256 111/240 69/217 67/200 75/237 52/202 46/192 54/174 28/167 27/201 31/195 33/191 20/147 24/152
  9. 9. q q q q q q q q q q q q q q q q q q q 0 5 10 15 20 0.00.20.40.60.81.0 What's the probability of making a golf putt? Distance from hole (feet) Probabilityofsuccess Logistic regression, a = 2.2, b = −0.3
  10. 10. Geometry-based model x R r −2σ 0 2σ
  11. 11. Stan code data { int J; int n[J]; real x[J]; int y[J]; real r; real R; } parameters { real<lower=0> sigma; } model { real p[J]; p = 2*Phi(asin((R-r)/x) / sigma) - 1; y ~ binomial(n, p); }
  12. 12. Fit the model golf <- read.table("golf.txt", header=TRUE, skip=2) x <- golf$x y <- golf$y n <- golf$n J <- length(y) r <- (1.68/2)/12 R <- (4.25/2)/12 fit1 <- stan("golf1.stan")
  13. 13. Check convergence > print(fit1) Inference for Stan model: golf1. 4 chains, each with iter=2000; warmup=1000; thin=1; post-warmup draws per chain=1000, total post-warmup draws=4000. mean se_mean sd 25% 50% 75% n_eff Rhat sigma 0.03 0.00 0.00 0.03 0.03 0.03 1692 1 sigma_degrees 1.53 0.00 0.02 1.51 1.53 1.54 1692 1
  14. 14. q q q q q q q q q q q q q q q q q q q 0 5 10 15 20 0.00.20.40.60.81.0 What's the probability of making a golf putt? Distance from hole (feet) Probabilityofsuccess Geometry−based model, sigma = 1.5
  15. 15. q q q q q q q q q q q q q q q q q q q 0 5 10 15 20 0.00.20.40.60.81.0 Two models fit to the golf putting data Distance from hole (feet) Probabilityofsuccess Logistic regression, a = 2.2, b = −0.3 Geometry−based model, sigma = 1.5
  16. 16. Birthdays!
  17. 17. The published graphs show data from 30 days in the year
  18. 18. 1970 1972 1974 1976 1978 1980 1982 1984 1986 1988 Trends 60 80 100 120 Relative Number of Births Slow trend Fast non-periodic component Mean Mon Tue Wed Thu Fri Sat Sun Dayofweekeffect 60 80 100 120 1972 1976 1980 1984 1988 Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Seasonaleffect 60 80 100 120 1972 1976 1980 1984 1988 Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Dayofyeareffect 60 80 100 120 New year Valentine's day Leap dayApril 1st Memorial day Independence day Labor day Halloween Thanksgiving Christmas
  19. 19. Mon Tue Wed Thu Fri Sat Sun Dayofweekeffect 60 80 100 120 2002 2006 2010 2014 Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Seasonaleffect 60 80 100 120 2002 2006 2010 2014 Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Dayofyeareffect 60 80 100 120 New year Valentine's day Leap day April 1st Memorial day Independence day Labor day 9/11 Halloween Thanksgiving Christmas 2000 2002 2004 2006 2008 2010 2012 2014 Trends 60 80 100 120 Relative Number of Births Slow trend Fast non-periodic component Mean
  20. 20. Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Dayofyeareffect 50 60 70 80 90 100 110 120 New year Valentine's day Leap day April 1stMemorial day Independence day Labor day 9/11 Halloween Thanksgiving Christmas 13th day of month
  21. 21. Xbox estimates, adjusting for demographics
  22. 22. Xbox estimates, adjusting for demographics and partisanship
  23. 23. Data from 2016
  24. 24. Some ideas in data science workflow Data and information Replication Fake-data simulation (or statistical theory) Comparing predictions to data The network of models
  • PrakashGiri

    Jun. 19, 2021

By Andrew Gelman PyData New York City 2017

Views

Total views

704

On Slideshare

0

From embeds

0

Number of embeds

0

Actions

Downloads

19

Shares

0

Comments

0

Likes

1

×