computational science & engineering seminar, 16 oct 2013
Upcoming SlideShare
Loading in...5
×
 

computational science & engineering seminar, 16 oct 2013

on

  • 284 views

 

Statistics

Views

Total Views
284
Views on SlideShare
284
Embed Views
0

Actions

Likes
0
Downloads
5
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    computational science & engineering seminar, 16 oct 2013 computational science & engineering seminar, 16 oct 2013 Presentation Transcript

    • Denitions Statistics Computation Software Conclusions References Statistics, computation, and software engineering: development and maintenance of mixed modeling software in R Ben Bolker McMaster University, Mathematics & Statistics and Biology 15 October 2013 Ben Bolker Mixed model software
    • Denitions Statistics Computation Outline 1 Denitions and context 2 Statistical challenges 3 Computational challenges 4 Software engineering 5 Conclusions Ben Bolker Mixed model software Software Conclusions References
    • Denitions Statistics Computation Outline 1 Denitions and context 2 Statistical challenges 3 Computational challenges 4 Software engineering 5 Conclusions Ben Bolker Mixed model software Software Conclusions References
    • Denitions Statistics Computation Software Conclusions References (Generalized) linear mixed models (G)LMMs: a statistical modeling framework incorporating: Linear combinations of categorical and continuous predictors, and interactions Response distributions in the exponential family (binomial, Poisson, and extensions) Any smooth, monotonic link function (e.g. logistic, exponential models) Flexible combinations of blocking factors (clustering; random eects) Applications in ecology, neurobiology, behaviour, epidemiology, real estate, . . . Ben Bolker Mixed model software
    • Denitions Statistics Computation Software Conclusions References (Generalized) linear mixed models (G)LMMs: a statistical modeling framework incorporating: Linear combinations of categorical and continuous predictors, and interactions Response distributions in the exponential family (binomial, Poisson, and extensions) Any smooth, monotonic link function (e.g. logistic, exponential models) Flexible combinations of blocking factors (clustering; random eects) Applications in ecology, neurobiology, behaviour, epidemiology, real estate, . . . Ben Bolker Mixed model software
    • Denitions Statistics Computation Software Conclusions References (Generalized) linear mixed models (G)LMMs: a statistical modeling framework incorporating: Linear combinations of categorical and continuous predictors, and interactions Response distributions in the exponential family (binomial, Poisson, and extensions) Any smooth, monotonic link function (e.g. logistic, exponential models) Flexible combinations of blocking factors (clustering; random eects) Applications in ecology, neurobiology, behaviour, epidemiology, real estate, . . . Ben Bolker Mixed model software
    • Denitions Statistics Computation Software Conclusions References (Generalized) linear mixed models (G)LMMs: a statistical modeling framework incorporating: Linear combinations of categorical and continuous predictors, and interactions Response distributions in the exponential family (binomial, Poisson, and extensions) Any smooth, monotonic link function (e.g. logistic, exponential models) Flexible combinations of blocking factors (clustering; random eects) Applications in ecology, neurobiology, behaviour, epidemiology, real estate, . . . Ben Bolker Mixed model software
    • Denitions Statistics Computation Software Conclusions References Examples ecology survival, predation, etc. (experimental plots) genomics presence/absence of polymorphisms, gene expression (individuals) educational assessment student scores (students × teachers) psychology/sensometrics decisions, responses to stimuli (individuals) epidemiology disease prevalence (postal codes, provinces, countries) Ben Bolker Mixed model software
    • Denitions Statistics Computation Software Conclusions References Examples ecology survival, predation, etc. (experimental plots) genomics presence/absence of polymorphisms, gene expression (individuals) educational assessment student scores (students × teachers) psychology/sensometrics decisions, responses to stimuli (individuals) epidemiology disease prevalence (postal codes, provinces, countries) Ben Bolker Mixed model software
    • Denitions Statistics Computation Software Conclusions References Examples ecology survival, predation, etc. (experimental plots) genomics presence/absence of polymorphisms, gene expression (individuals) educational assessment student scores (students × teachers) psychology/sensometrics decisions, responses to stimuli (individuals) epidemiology disease prevalence (postal codes, provinces, countries) Ben Bolker Mixed model software
    • Denitions Statistics Computation Software Conclusions References Examples ecology survival, predation, etc. (experimental plots) genomics presence/absence of polymorphisms, gene expression (individuals) educational assessment student scores (students × teachers) psychology/sensometrics decisions, responses to stimuli (individuals) epidemiology disease prevalence (postal codes, provinces, countries) Ben Bolker Mixed model software
    • Denitions Statistics Computation Software Conclusions References Examples ecology survival, predation, etc. (experimental plots) genomics presence/absence of polymorphisms, gene expression (individuals) educational assessment student scores (students × teachers) psychology/sensometrics decisions, responses to stimuli (individuals) epidemiology disease prevalence (postal codes, provinces, countries) Ben Bolker Mixed model software
    • Denitions Statistics Computation Software Conclusions Technical denition conditional distribution Yi ∼ Distr response η linear predictor b conditional modes Ben Bolker Mixed model software = Xβ xed eects (g −1 (η ), i φ ) scale inverse parameter link function + Zb random eects ∼ MVN(0, Σ(θ) ) variancecovariance matrix References
    • Denitions Statistics Computation Outline 1 Denitions and context 2 Statistical challenges 3 Computational challenges 4 Software engineering 5 Conclusions Ben Bolker Mixed model software Software Conclusions References
    • Denitions Statistics Computation Software Conclusions References Estimation Maximum likelihood estimation L(Y |θ, β) = i likelihood deterministic: ··· L(Y |θ, β ) i × L(β |Σ(θ)) d β data|random eects random eects precision vs. computational cost: penalized quasi-likelihood, Laplace approximation, adaptive Gauss-Hermite quadrature (Breslow, 2004) . . . Monte Carlo: frequentist and Bayesian (Booth and Hobert, 1999; Ponciano et al., 2009; Sung, 2007) Ben Bolker Mixed model software
    • Denitions Statistics Computation Software Conclusions Estimation: example (McKeon et al., 2012) Log−odds of predation −6 −4 −2 0 2 q q q q q Added symbiont q q q q q Crab vs. Shrimp q Symbiont Ben Bolker Mixed model software q q q q GLM (fixed) GLM (pooled) PQL Laplace AGQ References
    • Denitions Statistics Computation Software Conclusions References Inference 0.02 H2S 0.06 Anoxia 0.08 mostly asymptotic or 0.06 uncontrolled approximations 0.04 Solutions are computational and/or Bayesian: parametric bootstrap, MCMC Good news: dierent problems Inferred p value Standard inferential tools: 0.02 Osm Cu 0.08 0.06 0.04 0.02 for small vs large data 0.02 0.06 True p value Ben Bolker Mixed model software
    • Denitions Statistics Computation Software Conclusions References Inference 0.02 H2S 0.06 Anoxia 0.08 mostly asymptotic or 0.06 uncontrolled approximations 0.04 Solutions are computational and/or Bayesian: parametric bootstrap, MCMC Good news: dierent problems Inferred p value Standard inferential tools: 0.02 Osm Cu 0.08 0.06 0.04 0.02 for small vs large data 0.02 0.06 True p value Ben Bolker Mixed model software
    • Denitions Statistics Computation Software Conclusions References Inference 0.02 H2S 0.06 Anoxia 0.08 mostly asymptotic or 0.06 uncontrolled approximations 0.04 Solutions are computational and/or Bayesian: parametric bootstrap, MCMC Good news: dierent problems Inferred p value Standard inferential tools: 0.02 Osm Cu 0.08 0.06 0.04 0.02 for small vs large data 0.02 0.06 True p value Ben Bolker Mixed model software
    • Denitions Statistics Computation Outline 1 Denitions and context 2 Statistical challenges 3 Computational challenges 4 Software engineering 5 Conclusions Ben Bolker Mixed model software Software Conclusions References
    • Denitions Statistics Computation Software Problems of big data How big is big? Airline data: 12G (G)LMM works on moderately large problems, e.g. student evaluations (≈ 75K total, 3K students, 1K profs) Fairly clever linear algebra Possible improvements? Chunking/parallelization Out-of-memory operation Ben Bolker Mixed model software Conclusions References
    • Denitions Statistics Computation Sparse matrix algorithms repeated decomposition of large, matrices (especially Z ) ll-reducing permutation to improve sparsity pattern further improvements possible: better matrix representation, parallelization? Ben Bolker Mixed model software Software Conclusions References
    • Denitions Statistics Computation Software Conclusions References Bounded optimization raw Parameterize variance-covariance matrix log 30 Σ(θ) Positive denite or only semi-denite? Disadvantages of transforming deviance (Pinheiro and Bates, 1996) 20 10 to unconstrain (Disadvantages of boundary solutions) Ben Bolker Mixed model software 0 0 1 2 3 −3 −2 −1 0
    • Denitions Statistics Computation Outline 1 Denitions and context 2 Statistical challenges 3 Computational challenges 4 Software engineering 5 Conclusions Ben Bolker Mixed model software Software Conclusions References
    • Denitions Statistics Computation Software Conclusions Language tradeos high-level/convenience: R low-level/performance: C++ new wave? Julia multi-language friction: mostly escaped in R/C++ case, at the price of complexity Ben Bolker Mixed model software References
    • Denitions Statistics Computation Software Getting it right vs. getting it written the curse of neophilia: Superiority many versions: nlme, lme4(a,b,Eigen) The moral of the story is that if you want to create a beautiful language, for god's sake don't make it useful (Patrick Burns) Ben Bolker Mixed model software ... Conclusions References
    • Denitions Statistics Computation Software Conclusions References Sociological issues Wide user base: As usual when software for complicated statistical inference procedures is broadly disseminated, there is potential for abuse and misinterpretation. (Breslow, 2004) What if there is no good answer? do no harm vs. better me than someone else Diagnostics and warning messages End users Ben Bolker Mixed model software vs. downstream developers
    • Denitions Statistics Computation Outline 1 Denitions and context 2 Statistical challenges 3 Computational challenges 4 Software engineering 5 Conclusions Ben Bolker Mixed model software Software Conclusions References
    • Denitions Statistics Computation Software Next steps Alternative platforms/languages Flexible correlation structures: spatial, temporal, phylogenetic . . . Improved MCMC methods? Simulation tests of inferential tools Ben Bolker Mixed model software Conclusions References
    • Denitions Statistics Computation Software Conclusions References Is it science? Public Library of Science data 50 40 30 understand well enough to explain to a computer. Art is everything else we do. (Donald Knuth) 20 articles per month Science is what we 10 key glmm lme4 2006 2008 2010 Date Ben Bolker Mixed model software 2012
    • Denitions Statistics Computation Software Conclusions Acknowledgments lme4: Doug Bates, Martin Mächler, Steve Walker Data: Adrian Stier (UBC/OSU), Sea McKeon (Smithsonian), David Julian (UF) Ben Bolker Mixed model software NSERC (Discovery) SHARCnet References
    • Denitions Statistics Computation Software Conclusions References Booth, J.G. and Hobert, J.P., 1999. Journal of the Royal Statistical Society. Series B, 61(1):265285. doi:10.1111/1467- 9868.00176. Breslow, N.E., 2004. In D.Y. Lin and P.J. Heagerty, editors, Proceedings of the second Seattle symposium in biostatistics: Analysis of correlated data, pages 122. Springer. ISBN 0387208623. McKeon, C.S., Stier, A., et al., 2012. Oecologia, 169(4):10951103. ISSN 0029-8549. doi:10.1007/s00442- 012-2275-2. Pinheiro, J.C. and Bates, D.M., 1996. Statistics and Computing, 6(3):289296. doi:10.1007/BF00140873. Ponciano, J.M., Taper, M.L., et al., 2009. Ecology, 90(2):356362. ISSN 0012-9658. Sung, Y.J., 2007. The Annals of Statistics, 35(3):9901011. ISSN 0090-5364. doi:10.1214/009053606000001389. Ben Bolker Mixed model software