Denitions

Statistics

Computation

Software

Conclusions

References

Statistics, computation, and software engineering:
...
Denitions

Statistics

Computation

Outline
1

Denitions and context

2

Statistical challenges

3

Computational challeng...
Denitions

Statistics

Computation

Outline
1

Denitions and context

2

Statistical challenges

3

Computational challeng...
Denitions

Statistics

Computation

Software

Conclusions

References

(Generalized) linear mixed models
(G)LMMs: a statis...
Denitions

Statistics

Computation

Software

Conclusions

References

(Generalized) linear mixed models
(G)LMMs: a statis...
Denitions

Statistics

Computation

Software

Conclusions

References

(Generalized) linear mixed models
(G)LMMs: a statis...
Denitions

Statistics

Computation

Software

Conclusions

References

(Generalized) linear mixed models
(G)LMMs: a statis...
Denitions

Statistics

Computation

Software

Conclusions

References

Examples

ecology survival, predation, etc. (experi...
Denitions

Statistics

Computation

Software

Conclusions

References

Examples

ecology survival, predation, etc. (experi...
Denitions

Statistics

Computation

Software

Conclusions

References

Examples

ecology survival, predation, etc. (experi...
Denitions

Statistics

Computation

Software

Conclusions

References

Examples

ecology survival, predation, etc. (experi...
Denitions

Statistics

Computation

Software

Conclusions

References

Examples

ecology survival, predation, etc. (experi...
Denitions

Statistics

Computation

Software

Conclusions

Technical denition
conditional
distribution
Yi

∼

Distr

respo...
Denitions

Statistics

Computation

Outline
1

Denitions and context

2

Statistical challenges

3

Computational challeng...
Denitions

Statistics

Computation

Software

Conclusions

References

Estimation
Maximum likelihood estimation
L(Y |θ, β)...
Denitions

Statistics

Computation

Software

Conclusions

Estimation: example (McKeon et al., 2012)
Log−odds of predation...
Denitions

Statistics

Computation

Software

Conclusions

References

Inference
0.02
H2S

0.06
Anoxia
0.08

mostly asympt...
Denitions

Statistics

Computation

Software

Conclusions

References

Inference
0.02
H2S

0.06
Anoxia
0.08

mostly asympt...
Denitions

Statistics

Computation

Software

Conclusions

References

Inference
0.02
H2S

0.06
Anoxia
0.08

mostly asympt...
Denitions

Statistics

Computation

Outline
1

Denitions and context

2

Statistical challenges

3

Computational challeng...
Denitions

Statistics

Computation

Software

Problems of big data

How big is big?
Airline data: 12G
(G)LMM works on

mod...
Denitions

Statistics

Computation

Sparse matrix algorithms

repeated decomposition of

large, matrices (especially Z )
l...
Denitions

Statistics

Computation

Software

Conclusions

References

Bounded optimization
raw

Parameterize
variance-cov...
Denitions

Statistics

Computation

Outline
1

Denitions and context

2

Statistical challenges

3

Computational challeng...
Denitions

Statistics

Computation

Software

Conclusions

Language tradeos

high-level/convenience: R
low-level/performan...
Denitions

Statistics

Computation

Software

Getting it right vs. getting it written

the curse of neophilia: Superiority...
Denitions

Statistics

Computation

Software

Conclusions

References

Sociological issues
Wide user base:
As usual when s...
Denitions

Statistics

Computation

Outline
1

Denitions and context

2

Statistical challenges

3

Computational challeng...
Denitions

Statistics

Computation

Software

Next steps

Alternative platforms/languages
Flexible correlation structures:...
Denitions

Statistics

Computation

Software

Conclusions

References

Is it science?
Public Library of Science data

50
4...
Denitions

Statistics

Computation

Software

Conclusions

Acknowledgments

lme4:

Doug Bates, Martin

Mächler, Steve Walk...
Denitions

Statistics

Computation

Software

Conclusions

References

Booth, J.G. and Hobert, J.P., 1999. Journal of the ...
Upcoming SlideShare
Loading in...5
×

computational science & engineering seminar, 16 oct 2013

414
-1

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
414
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
7
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

computational science & engineering seminar, 16 oct 2013

  1. 1. Denitions Statistics Computation Software Conclusions References Statistics, computation, and software engineering: development and maintenance of mixed modeling software in R Ben Bolker McMaster University, Mathematics Statistics and Biology 15 October 2013 Ben Bolker Mixed model software
  2. 2. Denitions Statistics Computation Outline 1 Denitions and context 2 Statistical challenges 3 Computational challenges 4 Software engineering 5 Conclusions Ben Bolker Mixed model software Software Conclusions References
  3. 3. Denitions Statistics Computation Outline 1 Denitions and context 2 Statistical challenges 3 Computational challenges 4 Software engineering 5 Conclusions Ben Bolker Mixed model software Software Conclusions References
  4. 4. Denitions Statistics Computation Software Conclusions References (Generalized) linear mixed models (G)LMMs: a statistical modeling framework incorporating: Linear combinations of categorical and continuous predictors, and interactions Response distributions in the exponential family (binomial, Poisson, and extensions) Any smooth, monotonic link function (e.g. logistic, exponential models) Flexible combinations of blocking factors (clustering; random eects) Applications in ecology, neurobiology, behaviour, epidemiology, real estate, . . . Ben Bolker Mixed model software
  5. 5. Denitions Statistics Computation Software Conclusions References (Generalized) linear mixed models (G)LMMs: a statistical modeling framework incorporating: Linear combinations of categorical and continuous predictors, and interactions Response distributions in the exponential family (binomial, Poisson, and extensions) Any smooth, monotonic link function (e.g. logistic, exponential models) Flexible combinations of blocking factors (clustering; random eects) Applications in ecology, neurobiology, behaviour, epidemiology, real estate, . . . Ben Bolker Mixed model software
  6. 6. Denitions Statistics Computation Software Conclusions References (Generalized) linear mixed models (G)LMMs: a statistical modeling framework incorporating: Linear combinations of categorical and continuous predictors, and interactions Response distributions in the exponential family (binomial, Poisson, and extensions) Any smooth, monotonic link function (e.g. logistic, exponential models) Flexible combinations of blocking factors (clustering; random eects) Applications in ecology, neurobiology, behaviour, epidemiology, real estate, . . . Ben Bolker Mixed model software
  7. 7. Denitions Statistics Computation Software Conclusions References (Generalized) linear mixed models (G)LMMs: a statistical modeling framework incorporating: Linear combinations of categorical and continuous predictors, and interactions Response distributions in the exponential family (binomial, Poisson, and extensions) Any smooth, monotonic link function (e.g. logistic, exponential models) Flexible combinations of blocking factors (clustering; random eects) Applications in ecology, neurobiology, behaviour, epidemiology, real estate, . . . Ben Bolker Mixed model software
  8. 8. Denitions Statistics Computation Software Conclusions References Examples ecology survival, predation, etc. (experimental plots) genomics presence/absence of polymorphisms, gene expression (individuals) educational assessment student scores (students × teachers) psychology/sensometrics decisions, responses to stimuli (individuals) epidemiology disease prevalence (postal codes, provinces, countries) Ben Bolker Mixed model software
  9. 9. Denitions Statistics Computation Software Conclusions References Examples ecology survival, predation, etc. (experimental plots) genomics presence/absence of polymorphisms, gene expression (individuals) educational assessment student scores (students × teachers) psychology/sensometrics decisions, responses to stimuli (individuals) epidemiology disease prevalence (postal codes, provinces, countries) Ben Bolker Mixed model software
  10. 10. Denitions Statistics Computation Software Conclusions References Examples ecology survival, predation, etc. (experimental plots) genomics presence/absence of polymorphisms, gene expression (individuals) educational assessment student scores (students × teachers) psychology/sensometrics decisions, responses to stimuli (individuals) epidemiology disease prevalence (postal codes, provinces, countries) Ben Bolker Mixed model software
  11. 11. Denitions Statistics Computation Software Conclusions References Examples ecology survival, predation, etc. (experimental plots) genomics presence/absence of polymorphisms, gene expression (individuals) educational assessment student scores (students × teachers) psychology/sensometrics decisions, responses to stimuli (individuals) epidemiology disease prevalence (postal codes, provinces, countries) Ben Bolker Mixed model software
  12. 12. Denitions Statistics Computation Software Conclusions References Examples ecology survival, predation, etc. (experimental plots) genomics presence/absence of polymorphisms, gene expression (individuals) educational assessment student scores (students × teachers) psychology/sensometrics decisions, responses to stimuli (individuals) epidemiology disease prevalence (postal codes, provinces, countries) Ben Bolker Mixed model software
  13. 13. Denitions Statistics Computation Software Conclusions Technical denition conditional distribution Yi ∼ Distr response η linear predictor b conditional modes Ben Bolker Mixed model software = Xβ xed eects (g −1 (η ), i φ ) scale inverse parameter link function + Zb random eects ∼ MVN(0, Σ(θ) ) variancecovariance matrix References
  14. 14. Denitions Statistics Computation Outline 1 Denitions and context 2 Statistical challenges 3 Computational challenges 4 Software engineering 5 Conclusions Ben Bolker Mixed model software Software Conclusions References
  15. 15. Denitions Statistics Computation Software Conclusions References Estimation Maximum likelihood estimation L(Y |θ, β) = i likelihood deterministic: ··· L(Y |θ, β ) i × L(β |Σ(θ)) d β data|random eects random eects precision vs. computational cost: penalized quasi-likelihood, Laplace approximation, adaptive Gauss-Hermite quadrature (Breslow, 2004) . . . Monte Carlo: frequentist and Bayesian (Booth and Hobert, 1999; Ponciano et al., 2009; Sung, 2007) Ben Bolker Mixed model software
  16. 16. Denitions Statistics Computation Software Conclusions Estimation: example (McKeon et al., 2012) Log−odds of predation −6 −4 −2 0 2 q q q q q Added symbiont q q q q q Crab vs. Shrimp q Symbiont Ben Bolker Mixed model software q q q q GLM (fixed) GLM (pooled) PQL Laplace AGQ References
  17. 17. Denitions Statistics Computation Software Conclusions References Inference 0.02 H2S 0.06 Anoxia 0.08 mostly asymptotic or 0.06 uncontrolled approximations 0.04 Solutions are computational and/or Bayesian: parametric bootstrap, MCMC Good news: dierent problems Inferred p value Standard inferential tools: 0.02 Osm Cu 0.08 0.06 0.04 0.02 for small vs large data 0.02 0.06 True p value Ben Bolker Mixed model software
  18. 18. Denitions Statistics Computation Software Conclusions References Inference 0.02 H2S 0.06 Anoxia 0.08 mostly asymptotic or 0.06 uncontrolled approximations 0.04 Solutions are computational and/or Bayesian: parametric bootstrap, MCMC Good news: dierent problems Inferred p value Standard inferential tools: 0.02 Osm Cu 0.08 0.06 0.04 0.02 for small vs large data 0.02 0.06 True p value Ben Bolker Mixed model software
  19. 19. Denitions Statistics Computation Software Conclusions References Inference 0.02 H2S 0.06 Anoxia 0.08 mostly asymptotic or 0.06 uncontrolled approximations 0.04 Solutions are computational and/or Bayesian: parametric bootstrap, MCMC Good news: dierent problems Inferred p value Standard inferential tools: 0.02 Osm Cu 0.08 0.06 0.04 0.02 for small vs large data 0.02 0.06 True p value Ben Bolker Mixed model software
  20. 20. Denitions Statistics Computation Outline 1 Denitions and context 2 Statistical challenges 3 Computational challenges 4 Software engineering 5 Conclusions Ben Bolker Mixed model software Software Conclusions References
  21. 21. Denitions Statistics Computation Software Problems of big data How big is big? Airline data: 12G (G)LMM works on moderately large problems, e.g. student evaluations (≈ 75K total, 3K students, 1K profs) Fairly clever linear algebra Possible improvements? Chunking/parallelization Out-of-memory operation Ben Bolker Mixed model software Conclusions References
  22. 22. Denitions Statistics Computation Sparse matrix algorithms repeated decomposition of large, matrices (especially Z ) ll-reducing permutation to improve sparsity pattern further improvements possible: better matrix representation, parallelization? Ben Bolker Mixed model software Software Conclusions References
  23. 23. Denitions Statistics Computation Software Conclusions References Bounded optimization raw Parameterize variance-covariance matrix log 30 Σ(θ) Positive denite or only semi-denite? Disadvantages of transforming deviance (Pinheiro and Bates, 1996) 20 10 to unconstrain (Disadvantages of boundary solutions) Ben Bolker Mixed model software 0 0 1 2 3 −3 −2 −1 0
  24. 24. Denitions Statistics Computation Outline 1 Denitions and context 2 Statistical challenges 3 Computational challenges 4 Software engineering 5 Conclusions Ben Bolker Mixed model software Software Conclusions References
  25. 25. Denitions Statistics Computation Software Conclusions Language tradeos high-level/convenience: R low-level/performance: C++ new wave? Julia multi-language friction: mostly escaped in R/C++ case, at the price of complexity Ben Bolker Mixed model software References
  26. 26. Denitions Statistics Computation Software Getting it right vs. getting it written the curse of neophilia: Superiority many versions: nlme, lme4(a,b,Eigen) The moral of the story is that if you want to create a beautiful language, for god's sake don't make it useful (Patrick Burns) Ben Bolker Mixed model software ... Conclusions References
  27. 27. Denitions Statistics Computation Software Conclusions References Sociological issues Wide user base: As usual when software for complicated statistical inference procedures is broadly disseminated, there is potential for abuse and misinterpretation. (Breslow, 2004) What if there is no good answer? do no harm vs. better me than someone else Diagnostics and warning messages End users Ben Bolker Mixed model software vs. downstream developers
  28. 28. Denitions Statistics Computation Outline 1 Denitions and context 2 Statistical challenges 3 Computational challenges 4 Software engineering 5 Conclusions Ben Bolker Mixed model software Software Conclusions References
  29. 29. Denitions Statistics Computation Software Next steps Alternative platforms/languages Flexible correlation structures: spatial, temporal, phylogenetic . . . Improved MCMC methods? Simulation tests of inferential tools Ben Bolker Mixed model software Conclusions References
  30. 30. Denitions Statistics Computation Software Conclusions References Is it science? Public Library of Science data 50 40 30 understand well enough to explain to a computer. Art is everything else we do. (Donald Knuth) 20 articles per month Science is what we 10 key glmm lme4 2006 2008 2010 Date Ben Bolker Mixed model software 2012
  31. 31. Denitions Statistics Computation Software Conclusions Acknowledgments lme4: Doug Bates, Martin Mächler, Steve Walker Data: Adrian Stier (UBC/OSU), Sea McKeon (Smithsonian), David Julian (UF) Ben Bolker Mixed model software NSERC (Discovery) SHARCnet References
  32. 32. Denitions Statistics Computation Software Conclusions References Booth, J.G. and Hobert, J.P., 1999. Journal of the Royal Statistical Society. Series B, 61(1):265285. doi:10.1111/1467- 9868.00176. Breslow, N.E., 2004. In D.Y. Lin and P.J. Heagerty, editors, Proceedings of the second Seattle symposium in biostatistics: Analysis of correlated data, pages 122. Springer. ISBN 0387208623. McKeon, C.S., Stier, A., et al., 2012. Oecologia, 169(4):10951103. ISSN 0029-8549. doi:10.1007/s00442- 012-2275-2. Pinheiro, J.C. and Bates, D.M., 1996. Statistics and Computing, 6(3):289296. doi:10.1007/BF00140873. Ponciano, J.M., Taper, M.L., et al., 2009. Ecology, 90(2):356362. ISSN 0012-9658. Sung, Y.J., 2007. The Annals of Statistics, 35(3):9901011. ISSN 0090-5364. doi:10.1214/009053606000001389. Ben Bolker Mixed model software
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×