computational science & engineering seminar, 16 oct 2013

1. Denitions Statistics Computation Software Conclusions References Statistics, computation, and software engineering: development and maintenance of mixed modeling software in R Ben Bolker McMaster University, Mathematics Statistics and Biology 15 October 2013 Ben Bolker Mixed model software

2. Denitions Statistics Computation Outline 1 Denitions and context 2 Statistical challenges 3 Computational challenges 4 Software engineering 5 Conclusions Ben Bolker Mixed model software Software Conclusions References

4. Denitions Statistics Computation Software Conclusions References (Generalized) linear mixed models (G)LMMs: a statistical modeling framework incorporating: Linear combinations of categorical and continuous predictors, and interactions Response distributions in the exponential family (binomial, Poisson, and extensions) Any smooth, monotonic link function (e.g. logistic, exponential models) Flexible combinations of blocking factors (clustering; random eects) Applications in ecology, neurobiology, behaviour, epidemiology, real estate, . . . Ben Bolker Mixed model software

8. Denitions Statistics Computation Software Conclusions References Examples ecology survival, predation, etc. (experimental plots) genomics presence/absence of polymorphisms, gene expression (individuals) educational assessment student scores (students × teachers) psychology/sensometrics decisions, responses to stimuli (individuals) epidemiology disease prevalence (postal codes, provinces, countries) Ben Bolker Mixed model software

13. Denitions Statistics Computation Software Conclusions Technical denition conditional distribution Yi ∼ Distr response η linear predictor b conditional modes Ben Bolker Mixed model software = Xβ xed eects (g −1 (η ), i φ ) scale inverse parameter link function + Zb random eects ∼ MVN(0, Σ(θ) ) variancecovariance matrix References

15. Denitions Statistics Computation Software Conclusions References Estimation Maximum likelihood estimation L(Y |θ, β) = i likelihood deterministic: ··· L(Y |θ, β ) i × L(β |Σ(θ)) d β data|random eects random eects precision vs. computational cost: penalized quasi-likelihood, Laplace approximation, adaptive Gauss-Hermite quadrature (Breslow, 2004) . . . Monte Carlo: frequentist and Bayesian (Booth and Hobert, 1999; Ponciano et al., 2009; Sung, 2007) Ben Bolker Mixed model software

16. Denitions Statistics Computation Software Conclusions Estimation: example (McKeon et al., 2012) Log−odds of predation −6 −4 −2 0 2 q q q q q Added symbiont q q q q q Crab vs. Shrimp q Symbiont Ben Bolker Mixed model software q q q q GLM (fixed) GLM (pooled) PQL Laplace AGQ References

17. Denitions Statistics Computation Software Conclusions References Inference 0.02 H2S 0.06 Anoxia 0.08 mostly asymptotic or 0.06 uncontrolled approximations 0.04 Solutions are computational and/or Bayesian: parametric bootstrap, MCMC Good news: dierent problems Inferred p value Standard inferential tools: 0.02 Osm Cu 0.08 0.06 0.04 0.02 for small vs large data 0.02 0.06 True p value Ben Bolker Mixed model software

21. Denitions Statistics Computation Software Problems of big data How big is big? Airline data: 12G (G)LMM works on moderately large problems, e.g. student evaluations (≈ 75K total, 3K students, 1K profs) Fairly clever linear algebra Possible improvements? Chunking/parallelization Out-of-memory operation Ben Bolker Mixed model software Conclusions References

22. Denitions Statistics Computation Sparse matrix algorithms repeated decomposition of large, matrices (especially Z ) ll-reducing permutation to improve sparsity pattern further improvements possible: better matrix representation, parallelization? Ben Bolker Mixed model software Software Conclusions References

23. Denitions Statistics Computation Software Conclusions References Bounded optimization raw Parameterize variance-covariance matrix log 30 Σ(θ) Positive denite or only semi-denite? Disadvantages of transforming deviance (Pinheiro and Bates, 1996) 20 10 to unconstrain (Disadvantages of boundary solutions) Ben Bolker Mixed model software 0 0 1 2 3 −3 −2 −1 0

25. Denitions Statistics Computation Software Conclusions Language tradeos high-level/convenience: R low-level/performance: C++ new wave? Julia multi-language friction: mostly escaped in R/C++ case, at the price of complexity Ben Bolker Mixed model software References

26. Denitions Statistics Computation Software Getting it right vs. getting it written the curse of neophilia: Superiority many versions: nlme, lme4(a,b,Eigen) The moral of the story is that if you want to create a beautiful language, for god's sake don't make it useful (Patrick Burns) Ben Bolker Mixed model software ... Conclusions References

27. Denitions Statistics Computation Software Conclusions References Sociological issues Wide user base: As usual when software for complicated statistical inference procedures is broadly disseminated, there is potential for abuse and misinterpretation. (Breslow, 2004) What if there is no good answer? do no harm vs. better me than someone else Diagnostics and warning messages End users Ben Bolker Mixed model software vs. downstream developers

29. Denitions Statistics Computation Software Next steps Alternative platforms/languages Flexible correlation structures: spatial, temporal, phylogenetic . . . Improved MCMC methods? Simulation tests of inferential tools Ben Bolker Mixed model software Conclusions References

30. Denitions Statistics Computation Software Conclusions References Is it science? Public Library of Science data 50 40 30 understand well enough to explain to a computer. Art is everything else we do. (Donald Knuth) 20 articles per month Science is what we 10 key glmm lme4 2006 2008 2010 Date Ben Bolker Mixed model software 2012

31. Denitions Statistics Computation Software Conclusions Acknowledgments lme4: Doug Bates, Martin Mächler, Steve Walker Data: Adrian Stier (UBC/OSU), Sea McKeon (Smithsonian), David Julian (UF) Ben Bolker Mixed model software NSERC (Discovery) SHARCnet References

32. Denitions Statistics Computation Software Conclusions References Booth, J.G. and Hobert, J.P., 1999. Journal of the Royal Statistical Society. Series B, 61(1):265285. doi:10.1111/1467- 9868.00176. Breslow, N.E., 2004. In D.Y. Lin and P.J. Heagerty, editors, Proceedings of the second Seattle symposium in biostatistics: Analysis of correlated data, pages 122. Springer. ISBN 0387208623. McKeon, C.S., Stier, A., et al., 2012. Oecologia, 169(4):10951103. ISSN 0029-8549. doi:10.1007/s00442- 012-2275-2. Pinheiro, J.C. and Bates, D.M., 1996. Statistics and Computing, 6(3):289296. doi:10.1007/BF00140873. Ponciano, J.M., Taper, M.L., et al., 2009. Ecology, 90(2):356362. ISSN 0012-9658. Sung, Y.J., 2007. The Annals of Statistics, 35(3):9901011. ISSN 0090-5364. doi:10.1214/009053606000001389. Ben Bolker Mixed model software

computational science & engineering seminar, 16 oct 2013

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (11)

Similar to computational science & engineering seminar, 16 oct 2013

Similar to computational science & engineering seminar, 16 oct 2013 (20)

More from Ben Bolker

More from Ben Bolker (13)

Recently uploaded

Recently uploaded (20)

computational science & engineering seminar, 16 oct 2013