Data Science Workflow

PyData
PyDataPyData
Data science workflow
Andrew Gelman
Dept of Statistics and Dept of Political Science
Columbia University, New York
PyData, New York, 28 Nov 2017
Data Science Workflow
Data Science Workflow
The (abridged) model in Stan
parameters {
real b;
real<lower=0> sigma_a;
real<lower=0> sigma_y;
vector[nteams] a;
}
model {
a ~ normal(b*prior_score, sigma_a)
sqrt_dif ~ normal(a[team1] - a[team2], sigma_y);
}
Fit the model
Inference for Stan model: worldcup_first_try.
4 chains, each with iter=2000; warmup=1000; thin=1;
post-warmup draws per chain=1000, total post-warmup draws=4000.
mean se_mean sd 25% 50% 75% n_eff Rhat
b 0.46 0.00 0.09 0.40 0.46 0.52 1039 1.00
sigma_a 0.14 0.00 0.07 0.09 0.13 0.19 203 1.01
sigma_y 0.42 0.00 0.05 0.38 0.42 0.46 956 1.00
a[1] 0.35 0.00 0.13 0.27 0.36 0.44 4000 1.00
a[2] 0.39 0.00 0.12 0.31 0.38 0.46 4000 1.00
a[3] 0.43 0.01 0.15 0.33 0.42 0.52 756 1.00
a[4] 0.20 0.01 0.16 0.11 0.22 0.31 966 1.00
a[5] 0.29 0.00 0.13 0.21 0.29 0.36 4000 1.00
. . .
Graph the estimates
Compare to model fit without prior rankings
Compare model to predictions
After finding and fixing a bug
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
0 5 10 15 20
0.00.20.40.60.81.0
Data on putts in pro golf
Distance from hole (feet)
Probabilityofsuccess
1346/1443
577/694
337/455
208/353
149/272
136/256
111/240
69/217
67/200
75/237
52/202
46/192
54/174
28/167
27/201
31/195
33/191
20/147
24/152
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
0 5 10 15 20
0.00.20.40.60.81.0
What's the probability of making a golf putt?
Distance from hole (feet)
Probabilityofsuccess
Logistic regression,
a = 2.2, b = −0.3
Geometry-based model
x
R
r
−2σ 0 2σ
Stan code
data {
int J;
int n[J];
real x[J];
int y[J];
real r;
real R;
}
parameters {
real<lower=0> sigma;
}
model {
real p[J];
p = 2*Phi(asin((R-r)/x) / sigma) - 1;
y ~ binomial(n, p);
}
Fit the model
golf <- read.table("golf.txt", header=TRUE, skip=2)
x <- golf$x
y <- golf$y
n <- golf$n
J <- length(y)
r <- (1.68/2)/12
R <- (4.25/2)/12
fit1 <- stan("golf1.stan")
Check convergence
> print(fit1)
Inference for Stan model: golf1.
4 chains, each with iter=2000; warmup=1000; thin=1;
post-warmup draws per chain=1000, total post-warmup draws=4000.
mean se_mean sd 25% 50% 75% n_eff Rhat
sigma 0.03 0.00 0.00 0.03 0.03 0.03 1692 1
sigma_degrees 1.53 0.00 0.02 1.51 1.53 1.54 1692 1
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
0 5 10 15 20
0.00.20.40.60.81.0
What's the probability of making a golf putt?
Distance from hole (feet)
Probabilityofsuccess
Geometry−based model,
sigma = 1.5
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
0 5 10 15 20
0.00.20.40.60.81.0
Two models fit to the golf putting data
Distance from hole (feet)
Probabilityofsuccess
Logistic regression,
a = 2.2, b = −0.3
Geometry−based model,
sigma = 1.5
Birthdays!
The published graphs show data from 30 days in the year
Data Science Workflow
Data Science Workflow
1970 1972 1974 1976 1978 1980 1982 1984 1986 1988
Trends
60
80
100
120
Relative Number of Births
Slow trend
Fast non-periodic component
Mean
Mon Tue Wed Thu Fri Sat Sun
Dayofweekeffect
60
80
100
120
1972
1976
1980
1984
1988
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Seasonaleffect
60
80
100
120
1972
1976
1980
1984
1988
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Dayofyeareffect
60
80
100
120
New year
Valentine's day
Leap dayApril 1st Memorial day
Independence day
Labor day
Halloween
Thanksgiving
Christmas
Mon Tue Wed Thu Fri Sat Sun
Dayofweekeffect
60
80
100
120
2002
2006
2010
2014
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Seasonaleffect
60
80
100
120
2002
2006
2010
2014
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Dayofyeareffect
60
80
100
120
New year
Valentine's day
Leap day
April 1st Memorial day
Independence day
Labor day
9/11
Halloween
Thanksgiving
Christmas
2000 2002 2004 2006 2008 2010 2012 2014
Trends
60
80
100
120
Relative Number of Births
Slow trend
Fast non-periodic component
Mean
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Dayofyeareffect
50
60
70
80
90
100
110
120
New year
Valentine's day
Leap day
April 1stMemorial day
Independence day
Labor day
9/11
Halloween
Thanksgiving
Christmas
13th day of month
Data Science Workflow
Data Science Workflow
Data Science Workflow
Data Science Workflow
Data Science Workflow
Xbox estimates, adjusting for demographics
Xbox estimates, adjusting for demographics and
partisanship
Data from 2016
Some ideas in data science workflow
Data and information
Replication
Fake-data simulation (or statistical theory)
Comparing predictions to data
The network of models
1 of 33

Recommended

Solución al ejercicio 1 by
Solución al ejercicio 1Solución al ejercicio 1
Solución al ejercicio 1Jesthiger Cohil
199 views2 slides
Ejerccio uno by
Ejerccio unoEjerccio uno
Ejerccio unonickjeorly
133 views2 slides
Ejercicio by
EjercicioEjercicio
EjercicioJesthiger Cohil
157 views1 slide
Problema by
ProblemaProblema
Problemanickjeorly
123 views1 slide
Eligheor by
EligheorEligheor
Eligheornickjeorly
186 views4 slides
Math06reviewsheet (3) by
Math06reviewsheet (3)Math06reviewsheet (3)
Math06reviewsheet (3)Angel Baez
222 views9 slides

More Related Content

What's hot

The final by
The finalThe final
The finalmichael chavez
129 views10 slides
STRESS ANALYSIS OF AN ISOTROPIC MATERIAL by
STRESS ANALYSIS OF AN ISOTROPIC MATERIALSTRESS ANALYSIS OF AN ISOTROPIC MATERIAL
STRESS ANALYSIS OF AN ISOTROPIC MATERIALRohit Katarya
937 views17 slides
Deep Learning A-Z™: Artificial Neural Networks (ANN) - How do Neural Networks... by
Deep Learning A-Z™: Artificial Neural Networks (ANN) - How do Neural Networks...Deep Learning A-Z™: Artificial Neural Networks (ANN) - How do Neural Networks...
Deep Learning A-Z™: Artificial Neural Networks (ANN) - How do Neural Networks...Kirill Eremenko
1.1K views12 slides
Teoría y problemas de Sumas Notables II sn26 ccesa007 by
Teoría y problemas de Sumas Notables II  sn26 ccesa007Teoría y problemas de Sumas Notables II  sn26 ccesa007
Teoría y problemas de Sumas Notables II sn26 ccesa007Demetrio Ccesa Rayme
353 views30 slides
Copier correction du devoir_de_synthèse_de_topographie by
Copier correction du devoir_de_synthèse_de_topographieCopier correction du devoir_de_synthèse_de_topographie
Copier correction du devoir_de_synthèse_de_topographieAhmed Manai
3K views9 slides
Math unit21 formulae by
Math unit21 formulaeMath unit21 formulae
Math unit21 formulaeeLearningJa
255 views17 slides

What's hot(19)

STRESS ANALYSIS OF AN ISOTROPIC MATERIAL by Rohit Katarya
STRESS ANALYSIS OF AN ISOTROPIC MATERIALSTRESS ANALYSIS OF AN ISOTROPIC MATERIAL
STRESS ANALYSIS OF AN ISOTROPIC MATERIAL
Rohit Katarya937 views
Deep Learning A-Z™: Artificial Neural Networks (ANN) - How do Neural Networks... by Kirill Eremenko
Deep Learning A-Z™: Artificial Neural Networks (ANN) - How do Neural Networks...Deep Learning A-Z™: Artificial Neural Networks (ANN) - How do Neural Networks...
Deep Learning A-Z™: Artificial Neural Networks (ANN) - How do Neural Networks...
Kirill Eremenko1.1K views
Teoría y problemas de Sumas Notables II sn26 ccesa007 by Demetrio Ccesa Rayme
Teoría y problemas de Sumas Notables II  sn26 ccesa007Teoría y problemas de Sumas Notables II  sn26 ccesa007
Teoría y problemas de Sumas Notables II sn26 ccesa007
Copier correction du devoir_de_synthèse_de_topographie by Ahmed Manai
Copier correction du devoir_de_synthèse_de_topographieCopier correction du devoir_de_synthèse_de_topographie
Copier correction du devoir_de_synthèse_de_topographie
Ahmed Manai3K views
Math unit21 formulae by eLearningJa
Math unit21 formulaeMath unit21 formulae
Math unit21 formulae
eLearningJa255 views
A game theoretic approach for runtime capacity allocation in map-reduce (WACC... by EUBra BIGSEA
A game theoretic approach for runtime capacity allocation in map-reduce (WACC...A game theoretic approach for runtime capacity allocation in map-reduce (WACC...
A game theoretic approach for runtime capacity allocation in map-reduce (WACC...
EUBra BIGSEA112 views
Javier dominguez 20800945 actividad 1_estructuras discretas by JavierJoseDominguezd
Javier dominguez 20800945 actividad 1_estructuras discretasJavier dominguez 20800945 actividad 1_estructuras discretas
Javier dominguez 20800945 actividad 1_estructuras discretas
Supersymmetric Q-balls and boson stars in (d + 1) dimensions - Mexico city ta... by Jurgen Riedel
Supersymmetric Q-balls and boson stars in (d + 1) dimensions - Mexico city ta...Supersymmetric Q-balls and boson stars in (d + 1) dimensions - Mexico city ta...
Supersymmetric Q-balls and boson stars in (d + 1) dimensions - Mexico city ta...
Jurgen Riedel222 views
Lec10 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Mu... by Hsien-Hsin Sean Lee, Ph.D.
Lec10 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Mu...Lec10 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Mu...
Lec10 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Mu...
คู่มือการใช้ Casiofx5800 p surveyingprograms by Therdkeat Khuonhat
คู่มือการใช้ Casiofx5800 p surveyingprogramsคู่มือการใช้ Casiofx5800 p surveyingprograms
คู่มือการใช้ Casiofx5800 p surveyingprograms
Therdkeat Khuonhat2.1K views
Problem Application of Antiderivatives by nyaz26
Problem Application of AntiderivativesProblem Application of Antiderivatives
Problem Application of Antiderivatives
nyaz262.4K views
Fast parallelizable scenario-based stochastic optimization by Pantelis Sopasakis
Fast parallelizable scenario-based stochastic optimizationFast parallelizable scenario-based stochastic optimization
Fast parallelizable scenario-based stochastic optimization
Pantelis Sopasakis601 views
Free FE practice problems by EIT Experts
Free FE practice problemsFree FE practice problems
Free FE practice problems
EIT Experts5.8K views

Similar to Data Science Workflow

Approximate Bayesian computation for the Ising/Potts model by
Approximate Bayesian computation for the Ising/Potts modelApproximate Bayesian computation for the Ising/Potts model
Approximate Bayesian computation for the Ising/Potts modelMatt Moores
551 views29 slides
MUMS Undergraduate Workshop - Parameter Selection and Model Calibration for a... by
MUMS Undergraduate Workshop - Parameter Selection and Model Calibration for a...MUMS Undergraduate Workshop - Parameter Selection and Model Calibration for a...
MUMS Undergraduate Workshop - Parameter Selection and Model Calibration for a...The Statistical and Applied Mathematical Sciences Institute
218 views15 slides
Laporan pemodelan dan simulasi by
Laporan pemodelan dan simulasiLaporan pemodelan dan simulasi
Laporan pemodelan dan simulasiIrwansyah Hazniel
1.8K views16 slides
Comparison GUM versus GUM+1 by
Comparison GUM  versus GUM+1Comparison GUM  versus GUM+1
Comparison GUM versus GUM+1Maurice Maeck
338 views32 slides
Note Out of 8 questions, JoeJoepp made 4 mistakes (on problems #2, .pdf by
Note Out of 8 questions, JoeJoepp made 4 mistakes (on problems #2, .pdfNote Out of 8 questions, JoeJoepp made 4 mistakes (on problems #2, .pdf
Note Out of 8 questions, JoeJoepp made 4 mistakes (on problems #2, .pdfanurag1231
3 views6 slides

Similar to Data Science Workflow(20)

Approximate Bayesian computation for the Ising/Potts model by Matt Moores
Approximate Bayesian computation for the Ising/Potts modelApproximate Bayesian computation for the Ising/Potts model
Approximate Bayesian computation for the Ising/Potts model
Matt Moores551 views
Comparison GUM versus GUM+1 by Maurice Maeck
Comparison GUM  versus GUM+1Comparison GUM  versus GUM+1
Comparison GUM versus GUM+1
Maurice Maeck338 views
Note Out of 8 questions, JoeJoepp made 4 mistakes (on problems #2, .pdf by anurag1231
Note Out of 8 questions, JoeJoepp made 4 mistakes (on problems #2, .pdfNote Out of 8 questions, JoeJoepp made 4 mistakes (on problems #2, .pdf
Note Out of 8 questions, JoeJoepp made 4 mistakes (on problems #2, .pdf
anurag12313 views
Compression of “noisy” measurement data for plotting with TikZ and pgfplots by Mathias Magdowski
Compression of “noisy” measurement data for plotting with TikZ and pgfplotsCompression of “noisy” measurement data for plotting with TikZ and pgfplots
Compression of “noisy” measurement data for plotting with TikZ and pgfplots
Mathias Magdowski283 views
Gradient descent optimizer by Hojin Yang
Gradient descent optimizerGradient descent optimizer
Gradient descent optimizer
Hojin Yang1.5K views
Vu_HPSC2012_02.pptx by QucngV
Vu_HPSC2012_02.pptxVu_HPSC2012_02.pptx
Vu_HPSC2012_02.pptx
QucngV2 views
ADVANCED ALGORITHMS-UNIT-3-Final.ppt by ssuser702532
ADVANCED   ALGORITHMS-UNIT-3-Final.pptADVANCED   ALGORITHMS-UNIT-3-Final.ppt
ADVANCED ALGORITHMS-UNIT-3-Final.ppt
ssuser70253283 views
Precomputation for SMC-ABC with undirected graphical models by Matt Moores
Precomputation for SMC-ABC with undirected graphical modelsPrecomputation for SMC-ABC with undirected graphical models
Precomputation for SMC-ABC with undirected graphical models
Matt Moores1.3K views
Application of recursive perturbation approach for multimodal optimization by Pranamesh Chakraborty
Application of recursive perturbation approach for multimodal optimizationApplication of recursive perturbation approach for multimodal optimization
Application of recursive perturbation approach for multimodal optimization

More from PyData

Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P... by
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...PyData
3.3K views20 slides
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh by
Unit testing data with marbles - Jane Stewart Adams, Leif WalshUnit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif WalshPyData
1.7K views128 slides
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski by
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake BolewskiThe TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake BolewskiPyData
1.3K views17 slides
Using Embeddings to Understand the Variance and Evolution of Data Science... ... by
Using Embeddings to Understand the Variance and Evolution of Data Science... ...Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...PyData
590 views27 slides
Deploying Data Science for Distribution of The New York Times - Anne Bauer by
Deploying Data Science for Distribution of The New York Times - Anne BauerDeploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne BauerPyData
770 views35 slides
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma by
Graph Analytics - From the Whiteboard to Your Toolbox - Sam LermaGraph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam LermaPyData
862 views35 slides

More from PyData(20)

Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P... by PyData
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
PyData3.3K views
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh by PyData
Unit testing data with marbles - Jane Stewart Adams, Leif WalshUnit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
PyData1.7K views
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski by PyData
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake BolewskiThe TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
PyData1.3K views
Using Embeddings to Understand the Variance and Evolution of Data Science... ... by PyData
Using Embeddings to Understand the Variance and Evolution of Data Science... ...Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
PyData590 views
Deploying Data Science for Distribution of The New York Times - Anne Bauer by PyData
Deploying Data Science for Distribution of The New York Times - Anne BauerDeploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne Bauer
PyData770 views
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma by PyData
Graph Analytics - From the Whiteboard to Your Toolbox - Sam LermaGraph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
PyData862 views
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ... by PyData
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
PyData484 views
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro by PyData
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroRESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
PyData2.2K views
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod... by PyData
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
PyData383 views
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott by PyData
Avoiding Bad Database Surprises: Simulation and Scalability - Steven LottAvoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
PyData229 views
Words in Space - Rebecca Bilbro by PyData
Words in Space - Rebecca BilbroWords in Space - Rebecca Bilbro
Words in Space - Rebecca Bilbro
PyData201 views
End-to-End Machine learning pipelines for Python driven organizations - Nick ... by PyData
End-to-End Machine learning pipelines for Python driven organizations - Nick ...End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
PyData856 views
Pydata beautiful soup - Monica Puerto by PyData
Pydata beautiful soup - Monica PuertoPydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica Puerto
PyData522 views
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef... by PyData
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
PyData11.6K views
Extending Pandas with Custom Types - Will Ayd by PyData
Extending Pandas with Custom Types - Will AydExtending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will Ayd
PyData1.6K views
Measuring Model Fairness - Stephen Hoover by PyData
Measuring Model Fairness - Stephen HooverMeasuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen Hoover
PyData424 views
What's the Science in Data Science? - Skipper Seabold by PyData
What's the Science in Data Science? - Skipper SeaboldWhat's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper Seabold
PyData417 views
Applying Statistical Modeling and Machine Learning to Perform Time-Series For... by PyData
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
PyData9.5K views
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward by PyData
Solving very simple substitution ciphers algorithmically - Stephen Enright-WardSolving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
PyData928 views
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An... by PyData
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
PyData398 views

Recently uploaded

Business Analyst Series 2023 - Week 3 Session 5 by
Business Analyst Series 2023 -  Week 3 Session 5Business Analyst Series 2023 -  Week 3 Session 5
Business Analyst Series 2023 - Week 3 Session 5DianaGray10
300 views20 slides
【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院 by
【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院
【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院IttrainingIttraining
58 views8 slides
6g - REPORT.pdf by
6g - REPORT.pdf6g - REPORT.pdf
6g - REPORT.pdfLiveplex
10 views23 slides
Network Source of Truth and Infrastructure as Code revisited by
Network Source of Truth and Infrastructure as Code revisitedNetwork Source of Truth and Infrastructure as Code revisited
Network Source of Truth and Infrastructure as Code revisitedNetwork Automation Forum
27 views45 slides
Microsoft Power Platform.pptx by
Microsoft Power Platform.pptxMicrosoft Power Platform.pptx
Microsoft Power Platform.pptxUni Systems S.M.S.A.
53 views38 slides
Mini-Track: Challenges to Network Automation Adoption by
Mini-Track: Challenges to Network Automation AdoptionMini-Track: Challenges to Network Automation Adoption
Mini-Track: Challenges to Network Automation AdoptionNetwork Automation Forum
13 views27 slides

Recently uploaded(20)

Business Analyst Series 2023 - Week 3 Session 5 by DianaGray10
Business Analyst Series 2023 -  Week 3 Session 5Business Analyst Series 2023 -  Week 3 Session 5
Business Analyst Series 2023 - Week 3 Session 5
DianaGray10300 views
【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院 by IttrainingIttraining
【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院
【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院
6g - REPORT.pdf by Liveplex
6g - REPORT.pdf6g - REPORT.pdf
6g - REPORT.pdf
Liveplex10 views
Five Things You SHOULD Know About Postman by Postman
Five Things You SHOULD Know About PostmanFive Things You SHOULD Know About Postman
Five Things You SHOULD Know About Postman
Postman36 views
STPI OctaNE CoE Brochure.pdf by madhurjyapb
STPI OctaNE CoE Brochure.pdfSTPI OctaNE CoE Brochure.pdf
STPI OctaNE CoE Brochure.pdf
madhurjyapb14 views
SAP Automation Using Bar Code and FIORI.pdf by Virendra Rai, PMP
SAP Automation Using Bar Code and FIORI.pdfSAP Automation Using Bar Code and FIORI.pdf
SAP Automation Using Bar Code and FIORI.pdf
ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ... by Jasper Oosterveld
ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ...ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ...
ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ...
Data Integrity for Banking and Financial Services by Precisely
Data Integrity for Banking and Financial ServicesData Integrity for Banking and Financial Services
Data Integrity for Banking and Financial Services
Precisely25 views
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas... by Bernd Ruecker
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...
Bernd Ruecker40 views
Piloting & Scaling Successfully With Microsoft Viva by Richard Harbridge
Piloting & Scaling Successfully With Microsoft VivaPiloting & Scaling Successfully With Microsoft Viva
Piloting & Scaling Successfully With Microsoft Viva

Data Science Workflow

  • 1. Data science workflow Andrew Gelman Dept of Statistics and Dept of Political Science Columbia University, New York PyData, New York, 28 Nov 2017
  • 4. The (abridged) model in Stan parameters { real b; real<lower=0> sigma_a; real<lower=0> sigma_y; vector[nteams] a; } model { a ~ normal(b*prior_score, sigma_a) sqrt_dif ~ normal(a[team1] - a[team2], sigma_y); }
  • 5. Fit the model Inference for Stan model: worldcup_first_try. 4 chains, each with iter=2000; warmup=1000; thin=1; post-warmup draws per chain=1000, total post-warmup draws=4000. mean se_mean sd 25% 50% 75% n_eff Rhat b 0.46 0.00 0.09 0.40 0.46 0.52 1039 1.00 sigma_a 0.14 0.00 0.07 0.09 0.13 0.19 203 1.01 sigma_y 0.42 0.00 0.05 0.38 0.42 0.46 956 1.00 a[1] 0.35 0.00 0.13 0.27 0.36 0.44 4000 1.00 a[2] 0.39 0.00 0.12 0.31 0.38 0.46 4000 1.00 a[3] 0.43 0.01 0.15 0.33 0.42 0.52 756 1.00 a[4] 0.20 0.01 0.16 0.11 0.22 0.31 966 1.00 a[5] 0.29 0.00 0.13 0.21 0.29 0.36 4000 1.00 . . .
  • 7. Compare to model fit without prior rankings
  • 8. Compare model to predictions
  • 9. After finding and fixing a bug
  • 10. q q q q q q q q q q q q q q q q q q q 0 5 10 15 20 0.00.20.40.60.81.0 Data on putts in pro golf Distance from hole (feet) Probabilityofsuccess 1346/1443 577/694 337/455 208/353 149/272 136/256 111/240 69/217 67/200 75/237 52/202 46/192 54/174 28/167 27/201 31/195 33/191 20/147 24/152
  • 11. q q q q q q q q q q q q q q q q q q q 0 5 10 15 20 0.00.20.40.60.81.0 What's the probability of making a golf putt? Distance from hole (feet) Probabilityofsuccess Logistic regression, a = 2.2, b = −0.3
  • 13. Stan code data { int J; int n[J]; real x[J]; int y[J]; real r; real R; } parameters { real<lower=0> sigma; } model { real p[J]; p = 2*Phi(asin((R-r)/x) / sigma) - 1; y ~ binomial(n, p); }
  • 14. Fit the model golf <- read.table("golf.txt", header=TRUE, skip=2) x <- golf$x y <- golf$y n <- golf$n J <- length(y) r <- (1.68/2)/12 R <- (4.25/2)/12 fit1 <- stan("golf1.stan")
  • 15. Check convergence > print(fit1) Inference for Stan model: golf1. 4 chains, each with iter=2000; warmup=1000; thin=1; post-warmup draws per chain=1000, total post-warmup draws=4000. mean se_mean sd 25% 50% 75% n_eff Rhat sigma 0.03 0.00 0.00 0.03 0.03 0.03 1692 1 sigma_degrees 1.53 0.00 0.02 1.51 1.53 1.54 1692 1
  • 16. q q q q q q q q q q q q q q q q q q q 0 5 10 15 20 0.00.20.40.60.81.0 What's the probability of making a golf putt? Distance from hole (feet) Probabilityofsuccess Geometry−based model, sigma = 1.5
  • 17. q q q q q q q q q q q q q q q q q q q 0 5 10 15 20 0.00.20.40.60.81.0 Two models fit to the golf putting data Distance from hole (feet) Probabilityofsuccess Logistic regression, a = 2.2, b = −0.3 Geometry−based model, sigma = 1.5
  • 19. The published graphs show data from 30 days in the year
  • 22. 1970 1972 1974 1976 1978 1980 1982 1984 1986 1988 Trends 60 80 100 120 Relative Number of Births Slow trend Fast non-periodic component Mean Mon Tue Wed Thu Fri Sat Sun Dayofweekeffect 60 80 100 120 1972 1976 1980 1984 1988 Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Seasonaleffect 60 80 100 120 1972 1976 1980 1984 1988 Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Dayofyeareffect 60 80 100 120 New year Valentine's day Leap dayApril 1st Memorial day Independence day Labor day Halloween Thanksgiving Christmas
  • 23. Mon Tue Wed Thu Fri Sat Sun Dayofweekeffect 60 80 100 120 2002 2006 2010 2014 Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Seasonaleffect 60 80 100 120 2002 2006 2010 2014 Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Dayofyeareffect 60 80 100 120 New year Valentine's day Leap day April 1st Memorial day Independence day Labor day 9/11 Halloween Thanksgiving Christmas 2000 2002 2004 2006 2008 2010 2012 2014 Trends 60 80 100 120 Relative Number of Births Slow trend Fast non-periodic component Mean
  • 24. Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Dayofyeareffect 50 60 70 80 90 100 110 120 New year Valentine's day Leap day April 1stMemorial day Independence day Labor day 9/11 Halloween Thanksgiving Christmas 13th day of month
  • 30. Xbox estimates, adjusting for demographics
  • 31. Xbox estimates, adjusting for demographics and partisanship
  • 33. Some ideas in data science workflow Data and information Replication Fake-data simulation (or statistical theory) Comparing predictions to data The network of models