Upcoming SlideShare
×

# Bayesian Core: Chapter 4

1,862 views
1,761 views

Published on

This is the third part (Chapter 4) of our course associated with Bayesian Core.

Published in: Education
3 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

Views
Total views
1,862
On SlideShare
0
From Embeds
0
Number of Embeds
131
Actions
Shares
0
95
0
Likes
3
Embeds 0
No embeds

No notes for slide

### Bayesian Core: Chapter 4

1. 1. Bayesian Core:A Practical Approach to Computational Bayesian Statistics Generalized linear models Generalized linear models Generalized linear models 3 Generalisation of linear models Metropolis–Hastings algorithms The Probit Model The logit model Loglinear models 313 / 785
2. 2. Bayesian Core:A Practical Approach to Computational Bayesian Statistics Generalized linear models Generalisation of linear models Generalisation of Linear Models Linear models model connection between a response variable y and a set x of explanatory variables by a linear dependence relation with [approximately] normal perturbations. 314 / 785
3. 3. Bayesian Core:A Practical Approach to Computational Bayesian Statistics Generalized linear models Generalisation of linear models Generalisation of Linear Models Linear models model connection between a response variable y and a set x of explanatory variables by a linear dependence relation with [approximately] normal perturbations. Many instances where either of these assumptions not appropriate, e.g. when the support of y restricted to R+ or to N. 315 / 785
4. 4. Bayesian Core:A Practical Approach to Computational Bayesian Statistics Generalized linear models Generalisation of linear models bank Four measurements on 100 genuine Swiss banknotes and 100 counterfeit ones: x1 length of the bill (in mm), x2 width of the left edge (in mm), x3 width of the right edge (in mm), x4 bottom margin width (in mm). Response variable y: status of the banknote [0 for genuine and 1 for counterfeit] 316 / 785
5. 5. Bayesian Core:A Practical Approach to Computational Bayesian Statistics Generalized linear models Generalisation of linear models bank Four measurements on 100 genuine Swiss banknotes and 100 counterfeit ones: x1 length of the bill (in mm), x2 width of the left edge (in mm), x3 width of the right edge (in mm), x4 bottom margin width (in mm). Response variable y: status of the banknote [0 for genuine and 1 for counterfeit] Probabilistic model that predicts counterfeiting based on the four measurements 317 / 785
6. 6. Bayesian Core:A Practical Approach to Computational Bayesian Statistics Generalized linear models Generalisation of linear models The impossible linear model Example of the inﬂuence of x4 on y Since y is binary, y|x4 ∼ B(p(x4 )) , c Normal model is impossible 318 / 785
7. 7. Bayesian Core:A Practical Approach to Computational Bayesian Statistics Generalized linear models Generalisation of linear models The impossible linear model Example of the inﬂuence of x4 on y Since y is binary, y|x4 ∼ B(p(x4 )) , c Normal model is impossible Linear dependence in p(x) = E[y|x]’s p(x4i ) = β0 + β1 x4i , 319 / 785
8. 8. Bayesian Core:A Practical Approach to Computational Bayesian Statistics Generalized linear models Generalisation of linear models The impossible linear model Example of the inﬂuence of x4 on y Since y is binary, y|x4 ∼ B(p(x4 )) , c Normal model is impossible Linear dependence in p(x) = E[y|x]’s p(x4i ) = β0 + β1 x4i , estimated [by MLE] as pi = −2.02 + 0.268 xi4 ˆ which gives pi = .12 for xi4 = 8 and ... ˆ 320 / 785
9. 9. Bayesian Core:A Practical Approach to Computational Bayesian Statistics Generalized linear models Generalisation of linear models The impossible linear model Example of the inﬂuence of x4 on y Since y is binary, y|x4 ∼ B(p(x4 )) , c Normal model is impossible Linear dependence in p(x) = E[y|x]’s p(x4i ) = β0 + β1 x4i , estimated [by MLE] as pi = −2.02 + 0.268 xi4 ˆ which gives pi = .12 for xi4 = 8 and ... pi = 1.19 for xi4 = 12!!! ˆ ˆ c Linear dependence is impossible 321 / 785
10. 10. Bayesian Core:A Practical Approach to Computational Bayesian Statistics Generalized linear models Generalisation of linear models Generalisation of the linear dependence Broader class of models to cover various dependence structures. 322 / 785
11. 11. Bayesian Core:A Practical Approach to Computational Bayesian Statistics Generalized linear models Generalisation of linear models Generalisation of the linear dependence Broader class of models to cover various dependence structures. Class of generalised linear models (GLM) where y|x, β ∼ f (y|xT β) . i.e., dependence of y on x partly linear 323 / 785
12. 12. Bayesian Core:A Practical Approach to Computational Bayesian Statistics Generalized linear models Generalisation of linear models Notations Same as in linear regression chapter, with n–sample y = (y1 , . . . , yn ) and corresponding explanatory variables/covariates   x11 x12 . . . x1k  x21 x22 . . . x2k     x32 . . . x3k  X =  x31  . . . . . . . . . . . . xn1 xn2 . . . xnk 324 / 785
13. 13. Bayesian Core:A Practical Approach to Computational Bayesian Statistics Generalized linear models Generalisation of linear models Speciﬁcations of GLM’s Deﬁnition (GLM) A GLM is a conditional model speciﬁed by two functions: the density f of y given x parameterised by its expectation 1 parameter µ = µ(x) [and possibly its dispersion parameter ϕ = ϕ(x)] 325 / 785
14. 14. Bayesian Core:A Practical Approach to Computational Bayesian Statistics Generalized linear models Generalisation of linear models Speciﬁcations of GLM’s Deﬁnition (GLM) A GLM is a conditional model speciﬁed by two functions: the density f of y given x parameterised by its expectation 1 parameter µ = µ(x) [and possibly its dispersion parameter ϕ = ϕ(x)] the link g between the mean µ and the explanatory variables, 2 written customarily as g(µ) = xT β or, equivalently, E[y|x, β] = g −1 (xT β). 326 / 785
15. 15. Bayesian Core:A Practical Approach to Computational Bayesian Statistics Generalized linear models Generalisation of linear models Speciﬁcations of GLM’s Deﬁnition (GLM) A GLM is a conditional model speciﬁed by two functions: the density f of y given x parameterised by its expectation 1 parameter µ = µ(x) [and possibly its dispersion parameter ϕ = ϕ(x)] the link g between the mean µ and the explanatory variables, 2 written customarily as g(µ) = xT β or, equivalently, E[y|x, β] = g −1 (xT β). For identiﬁability reasons, g needs to be bijective. 327 / 785
16. 16. Bayesian Core:A Practical Approach to Computational Bayesian Statistics Generalized linear models Generalisation of linear models Likelihood Obvious representation of the likelihood n f yi |xiT β, ϕ ℓ(β, ϕ|y, X) = i=1 with parameters β ∈ Rk and ϕ > 0. 328 / 785
17. 17. Bayesian Core:A Practical Approach to Computational Bayesian Statistics Generalized linear models Generalisation of linear models Examples Ordinary linear regression Case of GLM where g(x) = x, ϕ = σ 2 , and y|X, β, σ 2 ∼ Nn (Xβ, σ 2 ). 329 / 785
18. 18. Bayesian Core:A Practical Approach to Computational Bayesian Statistics Generalized linear models Generalisation of linear models Examples (2) Case of binary and binomial data, when yi |xi ∼ B(ni , p(xi )) with known ni Logit [or logistic regression] model Link is logit transform on probability of success g(pi ) = log(pi /(1 − pi )) , with likelihood Y „ ni « „ exp(xiT β) «yi „ «ni −yi n 1 yi 1 + exp(xiT β) 1 + exp(xiT β) i=1 (n )ﬃ n Y“ ”ni −yi X yi xiT β 1 + exp(xiT β) ∝ exp i=1 i=1 330 / 785
19. 19. Bayesian Core:A Practical Approach to Computational Bayesian Statistics Generalized linear models Generalisation of linear models Canonical link Special link function g that appears in the natural exponential family representation of the density g ⋆ (µ) = θ if f (y|µ) ∝ exp{T (y) · θ − Ψ(θ)} 331 / 785
20. 20. Bayesian Core:A Practical Approach to Computational Bayesian Statistics Generalized linear models Generalisation of linear models Canonical link Special link function g that appears in the natural exponential family representation of the density g ⋆ (µ) = θ if f (y|µ) ∝ exp{T (y) · θ − Ψ(θ)} Example Logit link is canonical for the binomial model, since ni pi f (yi |pi ) = exp yi log + ni log(1 − pi ) , yi 1 − pi and thus θi = log pi /(1 − pi ) 332 / 785
21. 21. Bayesian Core:A Practical Approach to Computational Bayesian Statistics Generalized linear models Generalisation of linear models Examples (3) Customary to use the canonical link, but only customary ... Probit model Probit link function given by g(µi ) = Φ−1 (µi ) where Φ standard normal cdf Likelihood n Φ(xiT β)yi (1 − Φ(xiT β))ni −yi . ℓ(β|y, X) ∝ i=1 Full processing 333 / 785
22. 22. Bayesian Core:A Practical Approach to Computational Bayesian Statistics Generalized linear models Generalisation of linear models Log-linear models Standard approach to describe associations between several categorical variables, i.e, variables with ﬁnite support Suﬃcient statistic: contingency table, made of the cross-classiﬁed counts for the diﬀerent categorical variables. Full entry to loglinear models 334 / 785
23. 23. Bayesian Core:A Practical Approach to Computational Bayesian Statistics Generalized linear models Generalisation of linear models Log-linear models Standard approach to describe associations between several categorical variables, i.e, variables with ﬁnite support Suﬃcient statistic: contingency table, made of the cross-classiﬁed counts for the diﬀerent categorical variables. Full entry to loglinear models Example (Titanic survivors) Child Adult Survivor Class Male Female Male Female 1st 0 0 118 4 2nd 0 0 154 13 No 3rd 35 17 387 89 Crew 0 0 670 3 1st 5 1 57 140 2nd 11 13 14 80 Yes 3rd 13 14 75 76 Crew 0 0 192 20 335 / 785
24. 24. Bayesian Core:A Practical Approach to Computational Bayesian Statistics Generalized linear models Generalisation of linear models Poisson regression model Each count yi is Poisson with mean µi = µ(xi ) 1 Link function connecting R+ with R, e.g. logarithm 2 g(µi ) = log(µi ). 336 / 785
25. 25. Bayesian Core:A Practical Approach to Computational Bayesian Statistics Generalized linear models Generalisation of linear models Poisson regression model Each count yi is Poisson with mean µi = µ(xi ) 1 Link function connecting R+ with R, e.g. logarithm 2 g(µi ) = log(µi ). Corresponding likelihood n 1 exp yi xiT β − exp(xiT β) . ℓ(β|y, X) = yi ! i=1 337 / 785
26. 26. Bayesian Core:A Practical Approach to Computational Bayesian Statistics Generalized linear models Metropolis–Hastings algorithms Metropolis–Hastings algorithms Convergence assessment Posterior inference in GLMs harder than for linear models 338 / 785
27. 27. Bayesian Core:A Practical Approach to Computational Bayesian Statistics Generalized linear models Metropolis–Hastings algorithms Metropolis–Hastings algorithms Convergence assessment Posterior inference in GLMs harder than for linear models c Working with a GLM requires speciﬁc numerical or simulation tools [E.g., GLIM in classical analyses] 339 / 785
28. 28. Bayesian Core:A Practical Approach to Computational Bayesian Statistics Generalized linear models Metropolis–Hastings algorithms Metropolis–Hastings algorithms Convergence assessment Posterior inference in GLMs harder than for linear models c Working with a GLM requires speciﬁc numerical or simulation tools [E.g., GLIM in classical analyses] Opportunity to introduce universal MCMC method: Metropolis–Hastings algorithm 340 / 785
29. 29. Bayesian Core:A Practical Approach to Computational Bayesian Statistics Generalized linear models Metropolis–Hastings algorithms Generic MCMC sampler Metropolis–Hastings algorithms are generic/down-the-shelf MCMC algorithms Only require likelihood up to a constant [diﬀerence with Gibbs sampler] can be tuned with a wide range of possibilities [diﬀerence with Gibbs sampler & blocking] natural extensions of standard simulation algorithms: based on the choice of a proposal distribution [diﬀerence in Markov proposal q(x, y) and acceptance] 341 / 785
30. 30. Bayesian Core:A Practical Approach to Computational Bayesian Statistics Generalized linear models Metropolis–Hastings algorithms Why Metropolis? Originally introduced by Metropolis, Rosenbluth, Rosenbluth, Teller and Teller in a setup of optimization on a discrete state-space. All authors involved in Los Alamos during and after WWII: 342 / 785
31. 31. Bayesian Core:A Practical Approach to Computational Bayesian Statistics Generalized linear models Metropolis–Hastings algorithms Why Metropolis? Originally introduced by Metropolis, Rosenbluth, Rosenbluth, Teller and Teller in a setup of optimization on a discrete state-space. All authors involved in Los Alamos during and after WWII: Physicist and mathematician, Nicholas Metropolis is considered (with Stanislaw Ulam) to be the father of Monte Carlo methods. Also a physicist, Marshall Rosenbluth worked on the development of the hydrogen (H) bomb Edward Teller was one of the ﬁrst scientists to work on the Manhattan Project that led to the production of the A bomb. Also managed to design with Ulam the H bomb. 343 / 785
32. 32. Bayesian Core:A Practical Approach to Computational Bayesian Statistics Generalized linear models Metropolis–Hastings algorithms Generic Metropolis–Hastings sampler For target π and proposal kernel q(x, y) Initialization: Choose an arbitrary x(0) 344 / 785
33. 33. Bayesian Core:A Practical Approach to Computational Bayesian Statistics Generalized linear models Metropolis–Hastings algorithms Generic Metropolis–Hastings sampler For target π and proposal kernel q(x, y) Initialization: Choose an arbitrary x(0) Iteration t: 1 Given x(t−1) , generate x ∼ q(x(t−1) , x) ˜ 2 Calculate π(˜)/q(x(t−1) , x) x ˜ ρ(x(t−1) , x) = min ˜ ,1 (t−1) )/q(˜, x(t−1) ) π(x x With probability ρ(x(t−1) , x) accept x and set x(t) = x; ˜ ˜ ˜ 3 otherwise reject x and set x(t) = x(t−1) . ˜ 345 / 785
34. 34. Bayesian Core:A Practical Approach to Computational Bayesian Statistics Generalized linear models Metropolis–Hastings algorithms Universality Algorithm only needs to simulate from q which can be chosen [almost!] arbitrarily, i.e. unrelated with π [q also called instrumental distribution] Note: π and q known up to proportionality terms ok since proportionality constants cancel in ρ. 346 / 785
35. 35. Bayesian Core:A Practical Approach to Computational Bayesian Statistics Generalized linear models Metropolis–Hastings algorithms Validation Markov chain theory Target π is stationary distribution of Markov chain (x(t) )t because probability ρ(x, y) satisﬁes detailed balance equation π(x)q(x, y)ρ(x, y) = π(y)q(y, x)ρ(y, x) [Integrate out x to see that π is stationary] 347 / 785
36. 36. Bayesian Core:A Practical Approach to Computational Bayesian Statistics Generalized linear models Metropolis–Hastings algorithms Validation Markov chain theory Target π is stationary distribution of Markov chain (x(t) )t because probability ρ(x, y) satisﬁes detailed balance equation π(x)q(x, y)ρ(x, y) = π(y)q(y, x)ρ(y, x) [Integrate out x to see that π is stationary] For convergence/ergodicity, Markov chain must be irreducible: q has positive probability of reaching all areas with positive π probability in a ﬁnite number of steps. 348 / 785
37. 37. Bayesian Core:A Practical Approach to Computational Bayesian Statistics Generalized linear models Metropolis–Hastings algorithms Choice of proposal Theoretical guarantees of convergence very high, but choice of q is crucial in practice. 349 / 785
38. 38. Bayesian Core:A Practical Approach to Computational Bayesian Statistics Generalized linear models Metropolis–Hastings algorithms Choice of proposal Theoretical guarantees of convergence very high, but choice of q is crucial in practice. Poor choice of q may result in very high rejection rates, with very few moves of the Markov chain (x(t) )t hardly moves, or in 350 / 785
39. 39. Bayesian Core:A Practical Approach to Computational Bayesian Statistics Generalized linear models Metropolis–Hastings algorithms Choice of proposal Theoretical guarantees of convergence very high, but choice of q is crucial in practice. Poor choice of q may result in very high rejection rates, with very few moves of the Markov chain (x(t) )t hardly moves, or in a myopic exploration of the support of π, that is, in a dependence on the starting value x(0) , with the chain stuck in a neighbourhood mode to x(0) . 351 / 785
40. 40. Bayesian Core:A Practical Approach to Computational Bayesian Statistics Generalized linear models Metropolis–Hastings algorithms Choice of proposal Theoretical guarantees of convergence very high, but choice of q is crucial in practice. Poor choice of q may result in very high rejection rates, with very few moves of the Markov chain (x(t) )t hardly moves, or in a myopic exploration of the support of π, that is, in a dependence on the starting value x(0) , with the chain stuck in a neighbourhood mode to x(0) . Note: hybrid MCMC Simultaneous use of diﬀerent kernels valid and recommended 352 / 785
41. 41. Bayesian Core:A Practical Approach to Computational Bayesian Statistics Generalized linear models Metropolis–Hastings algorithms The independence sampler Pick proposal q that is independent of its ﬁrst argument, q(x, y) = q(y) ρ simpliﬁes into π(y)/q(y) ρ(x, y) = min 1, . π(x)/q(x) Special case: q ∝ π Reduces to ρ(x, y) = 1 and iid sampling 353 / 785
42. 42. Bayesian Core:A Practical Approach to Computational Bayesian Statistics Generalized linear models Metropolis–Hastings algorithms The independence sampler Pick proposal q that is independent of its ﬁrst argument, q(x, y) = q(y) ρ simpliﬁes into π(y)/q(y) ρ(x, y) = min 1, . π(x)/q(x) Special case: q ∝ π Reduces to ρ(x, y) = 1 and iid sampling Analogy with Accept-Reject algorithm where max π/q replaced with the current value π(x(t−1) )/q(x(t−1) ) but sequence of accepted x(t) ’s not i.i.d. 354 / 785
43. 43. Bayesian Core:A Practical Approach to Computational Bayesian Statistics Generalized linear models Metropolis–Hastings algorithms Choice of q Convergence properties highly dependent on q. q needs to be positive everywhere on the support of π for a good exploration of this support, π/q needs to be bounded. 355 / 785
44. 44. Bayesian Core:A Practical Approach to Computational Bayesian Statistics Generalized linear models Metropolis–Hastings algorithms Choice of q Convergence properties highly dependent on q. q needs to be positive everywhere on the support of π for a good exploration of this support, π/q needs to be bounded. Otherwise, the chain takes too long to reach regions with low q/π values. 356 / 785
45. 45. Bayesian Core:A Practical Approach to Computational Bayesian Statistics Generalized linear models Metropolis–Hastings algorithms The random walk sampler Independence sampler requires too much global information about π: opt for a local gathering of information 357 / 785
46. 46. Bayesian Core:A Practical Approach to Computational Bayesian Statistics Generalized linear models Metropolis–Hastings algorithms The random walk sampler Independence sampler requires too much global information about π: opt for a local gathering of information Means exploration of the neighbourhood of the current value x(t) in search of other points of interest. 358 / 785
47. 47. Bayesian Core:A Practical Approach to Computational Bayesian Statistics Generalized linear models Metropolis–Hastings algorithms The random walk sampler Independence sampler requires too much global information about π: opt for a local gathering of information Means exploration of the neighbourhood of the current value x(t) in search of other points of interest. Simplest exploration device is based on random walk dynamics. 359 / 785
48. 48. Bayesian Core:A Practical Approach to Computational Bayesian Statistics Generalized linear models Metropolis–Hastings algorithms Random walks Proposal is a symmetric transition density q(x, y) = qRW (y − x) = qRW (x − y) 360 / 785
49. 49. Bayesian Core:A Practical Approach to Computational Bayesian Statistics Generalized linear models Metropolis–Hastings algorithms Random walks Proposal is a symmetric transition density q(x, y) = qRW (y − x) = qRW (x − y) Acceptance probability ρ(x, y) reduces to the simpler form π(y) ρ(x, y) = min 1, . π(x) Only depends on the target π [accepts all proposed values that increase π] 361 / 785
50. 50. Bayesian Core:A Practical Approach to Computational Bayesian Statistics Generalized linear models Metropolis–Hastings algorithms Choice of qRW Considerable ﬂexibility in the choice of qRW , tails: Normal versus Student’s t scale: size of the neighbourhood 362 / 785
51. 51. Bayesian Core:A Practical Approach to Computational Bayesian Statistics Generalized linear models Metropolis–Hastings algorithms Choice of qRW Considerable ﬂexibility in the choice of qRW , tails: Normal versus Student’s t scale: size of the neighbourhood Can also be used for restricted support targets [with a waste of simulations near the boundary] 363 / 785
52. 52. Bayesian Core:A Practical Approach to Computational Bayesian Statistics Generalized linear models Metropolis–Hastings algorithms Choice of qRW Considerable ﬂexibility in the choice of qRW , tails: Normal versus Student’s t scale: size of the neighbourhood Can also be used for restricted support targets [with a waste of simulations near the boundary] Can be tuned towards an acceptance probability of 0.234 at the burnin stage [Magic number!] 364 / 785
53. 53. Bayesian Core:A Practical Approach to Computational Bayesian Statistics Generalized linear models Metropolis–Hastings algorithms Convergence assessment Capital question: How many iterations do we need to run??? MCMC debuts 365 / 785
54. 54. Bayesian Core:A Practical Approach to Computational Bayesian Statistics Generalized linear models Metropolis–Hastings algorithms Convergence assessment Capital question: How many iterations do we need to run??? MCMC debuts Rule # 1 There is no absolute number of simulations, i.e. 1, 000 is neither large, nor small. Rule # 2 It takes [much] longer to check for convergence than for the chain itself to converge. Rule # 3 MCMC is a “what-you-get-is-what-you-see” algorithm: it fails to tell about unexplored parts of the space. Rule # 4 When in doubt, run MCMC chains in parallel and check for consistency. 366 / 785
55. 55. Bayesian Core:A Practical Approach to Computational Bayesian Statistics Generalized linear models Metropolis–Hastings algorithms Convergence assessment Capital question: How many iterations do we need to run??? MCMC debuts Rule # 1 There is no absolute number of simulations, i.e. 1, 000 is neither large, nor small. Rule # 2 It takes [much] longer to check for convergence than for the chain itself to converge. Rule # 3 MCMC is a “what-you-get-is-what-you-see” algorithm: it fails to tell about unexplored parts of the space. Rule # 4 When in doubt, run MCMC chains in parallel and check for consistency. Many “quick-&-dirty” solutions in the literature, but not necessarily trustworthy. 367 / 785
56. 56. Bayesian Core:A Practical Approach to Computational Bayesian Statistics Generalized linear models Metropolis–Hastings algorithms Prohibited dynamic updating Tuning the proposal in terms of its past performances can only be implemented at burnin, because otherwise this cancels Markovian convergence properties. 368 / 785
57. 57. Bayesian Core:A Practical Approach to Computational Bayesian Statistics Generalized linear models Metropolis–Hastings algorithms Prohibited dynamic updating Tuning the proposal in terms of its past performances can only be implemented at burnin, because otherwise this cancels Markovian convergence properties. Use of several MCMC proposals together within a single algorithm using circular or random design is ok. It almost always brings an improvement compared with its individual components (at the cost of increased simulation time) 369 / 785
58. 58. Bayesian Core:A Practical Approach to Computational Bayesian Statistics Generalized linear models Metropolis–Hastings algorithms Eﬀective sample size How many iid simulations from π are equivalent to N simulations from the MCMC algorithm? 370 / 785
59. 59. Bayesian Core:A Practical Approach to Computational Bayesian Statistics Generalized linear models Metropolis–Hastings algorithms Eﬀective sample size How many iid simulations from π are equivalent to N simulations from the MCMC algorithm? Based on estimated k-th order auto-correlation, ρk = corr x(t) , x(t+k) , eﬀective sample size −1/2 T0 ess ρ2 N =n 1+2 ˆk , k=1 Only partial indicator that fails to signal chains stuck in one mode of the target 371 / 785
60. 60. Bayesian Core:A Practical Approach to Computational Bayesian Statistics Generalized linear models The Probit Model The Probit Model Likelihood Recall Probit n Φ(xiT β)yi (1 − Φ(xiT β))ni −yi . ℓ(β|y, X) ∝ i=1 372 / 785
61. 61. Bayesian Core:A Practical Approach to Computational Bayesian Statistics Generalized linear models The Probit Model The Probit Model Likelihood Recall Probit n Φ(xiT β)yi (1 − Φ(xiT β))ni −yi . ℓ(β|y, X) ∝ i=1 If no prior information available, resort to the ﬂat prior π(β) ∝ 1 and then obtain the posterior distribution n yi ni −yi Φ xiT β 1 − Φ(xiT β) π(β|y, X) ∝ , i=1 nonstandard and simulated using MCMC techniques. 373 / 785
62. 62. Bayesian Core:A Practical Approach to Computational Bayesian Statistics Generalized linear models The Probit Model MCMC resolution Metropolis–Hastings random walk sampler works well for binary regression problems with small number of predictors ˆ Uses the maximum likelihood estimate β as starting value and ˆ asymptotic (Fisher) covariance matrix of the MLE, Σ, as scale 374 / 785
63. 63. Bayesian Core:A Practical Approach to Computational Bayesian Statistics Generalized linear models The Probit Model MLE proposal R function glm very useful to get the maximum likelihood estimate ˆ of β and its asymptotic covariance matrix Σ. Terminology used in R program mod=summary(glm(y~X-1,family=binomial(link=quot;probitquot;))) ˆ ˆ with mod\$coeﬀ[,1] denoting β and mod\$cov.unscaled Σ. 375 / 785
64. 64. Bayesian Core:A Practical Approach to Computational Bayesian Statistics Generalized linear models The Probit Model MCMC algorithm Probit random-walk Metropolis-Hastings ˆ ˆ Initialization: Set β (0) = β and compute Σ Iteration t: ˜ ˆ Generate β ∼ Nk+1 (β (t−1) , τ Σ) 1 Compute 2 ˜ π(β|y) ˜ ρ(β (t−1) , β) = min 1, π(β (t−1) |y) ˜ ˜ With probability ρ(β (t−1) , β) set β (t) = β; 3 (t) (t−1) otherwise set β = β . 376 / 785
65. 65. Bayesian Core:A Practical Approach to Computational Bayesian Statistics Generalized linear models The Probit Model bank Probit modelling with 0.8 −1.0 1.0 0.4 no intercept over the −2.0 0.0 0.0 0 4000 8000 −2.0 −1.5 −1.0 −0.5 0 200 600 1000 four measurements. Three diﬀerent scales 3 0.0 0.4 0.8 2 0.4 1 τ = 1, 0.1, 10: best −1 0.0 mixing behavior is 0 4000 8000 −1 0 1 2 3 0 200 600 1000 associated with τ = 1. 2.5 0.8 0.0 0.4 0.8 −0.5 1.0 0.4 Average of the 0.0 parameters over 9, 000 0 4000 8000 −0.5 0.5 1.5 2.5 0 200 600 1000 iterations gives plug-in 1.8 0.0 0.4 0.8 2.0 estimate 1.2 1.0 0.0 0.6 0 4000 8000 0.6 1.0 1.4 1.8 0 200 600 1000 pi = Φ (−1.2193xi1 + 0.9540xi2 + 0.9795xi3 + 1.1481xi4 ) . ˆ 377 / 785
66. 66. Bayesian Core:A Practical Approach to Computational Bayesian Statistics Generalized linear models The Probit Model G-priors for probit models Flat prior on β inappropriate for comparison purposes and Bayes factors. 378 / 785
67. 67. Bayesian Core:A Practical Approach to Computational Bayesian Statistics Generalized linear models The Probit Model G-priors for probit models Flat prior on β inappropriate for comparison purposes and Bayes factors. Replace the ﬂat prior with a hierarchical prior, β|σ 2 , X ∼ Nk 0k , σ 2 (X T X)−1 and π(σ 2 |X) ∝ σ −3/2 , (almost) as in normal linear regression 379 / 785
68. 68. Bayesian Core:A Practical Approach to Computational Bayesian Statistics Generalized linear models The Probit Model G-priors for probit models Flat prior on β inappropriate for comparison purposes and Bayes factors. Replace the ﬂat prior with a hierarchical prior, β|σ 2 , X ∼ Nk 0k , σ 2 (X T X)−1 and π(σ 2 |X) ∝ σ −3/2 , (almost) as in normal linear regression Warning The matrix X T X is not the Fisher information matrix 380 / 785
69. 69. Bayesian Core:A Practical Approach to Computational Bayesian Statistics Generalized linear models The Probit Model G-priors for testing Same argument as before: while π is improper, use of the same variance factor σ 2 in both models means the normalising constant cancels in the Bayes factor. 381 / 785
70. 70. Bayesian Core:A Practical Approach to Computational Bayesian Statistics Generalized linear models The Probit Model G-priors for testing Same argument as before: while π is improper, use of the same variance factor σ 2 in both models means the normalising constant cancels in the Bayes factor. Posterior distribution of β “ ”−(2k−1)/4 |X T X|1/2 Γ((2k − 1)/4) β T (X T X)β π −k/2 π(β|y, X) ∝ h i1−yi Y n Φ(xiT β)yi 1 − Φ(xiT β) × i=1 [where k matters!] 382 / 785
71. 71. Bayesian Core:A Practical Approach to Computational Bayesian Statistics Generalized linear models The Probit Model Marginal approximation Marginal Z“ ”−(2k−1)/4 |X T X|1/2 π −k/2 Γ{(2k − 1)/4} β T (X T X)β f (y|X) ∝ h i1−yi Y n Φ(xiT β)yi 1 − (Φ(xiT β) × dβ, i=1 approximated by M ˛˛ ˛˛−(2k−1)/2 Y h i1−yi |X T X|1/2 X ˛˛ n (m) ˛˛ Φ(xiT β (m) )yi 1 − Φ(xiT β (m) ) ˛˛Xβ ˛˛ π k/2 M m=1 i=1 bb b (m) −β)T V −1 (β (m) −β)/4 b × Γ{(2k − 1)/4} |V |1/2 (4π)k/2 e+(β , where β (m) ∼ Nk (β, 2 V ) with β MCMC approximation of Eπ [β|y, X] and V MCMC approximation of V(β|y, X). 383 / 785
72. 72. Bayesian Core:A Practical Approach to Computational Bayesian Statistics Generalized linear models The Probit Model Linear hypothesis Linear restriction on β H0 : Rβ = r (r ∈ Rq , R q × k matrix) where β 0 is (k − q) dimensional and X0 and x0 are linear transforms of X and of x of dimensions (n, k − q) and (k − q). Likelihood n 1−yi 0 Φ(xiT β 0 )yi 1 − Φ(xiT β 0 ) ℓ(β |y, X0 ) ∝ , 0 0 i=1 384 / 785
73. 73. Bayesian Core:A Practical Approach to Computational Bayesian Statistics Generalized linear models The Probit Model Linear test Associated [projected] G-prior β 0 |σ 2 , X0 ∼ Nk−q 0k−q , σ 2 (X0 X0 )−1 T and π(σ 2 |X0 ) ∝ σ −3/2 , 385 / 785
74. 74. Bayesian Core:A Practical Approach to Computational Bayesian Statistics Generalized linear models The Probit Model Linear test Associated [projected] G-prior β 0 |σ 2 , X0 ∼ Nk−q 0k−q , σ 2 (X0 X0 )−1 T and π(σ 2 |X0 ) ∝ σ −3/2 , Marginal distribution of y of the same type  ﬀZ ˛˛ ˛˛ (2(k − q) − 1) ˛˛Xβ 0 ˛˛−(2(k−q)−1)/2 |X0 X0 |1/2 π −(k−q)/2 Γ T f (y|X0 ) ∝ 4 h i1−yi Y n Φ(xiT β 0 )yi 1 − (Φ(xiT β 0 ) dβ 0 . 0 0 i=1 386 / 785
75. 75. Bayesian Core:A Practical Approach to Computational Bayesian Statistics Generalized linear models The Probit Model banknote π For H0 : β1 = β2 = 0, B10 = 157.73 [against H0 ] Generic regression-like output: Estimate Post. var. log10(BF) X1 -1.1552 0.0631 4.5844 (****) X2 0.9200 0.3299 -0.2875 X3 0.9121 0.2595 -0.0972 X4 1.0820 0.0287 15.6765 (****) evidence against H0: (****) decisive, (***) strong, (**) subtantial, (*) poor 387 / 785
76. 76. Bayesian Core:A Practical Approach to Computational Bayesian Statistics Generalized linear models The Probit Model Informative settings If prior information available on p(x), transform into prior distribution on β by technique of imaginary observations: 388 / 785
77. 77. Bayesian Core:A Practical Approach to Computational Bayesian Statistics Generalized linear models The Probit Model Informative settings If prior information available on p(x), transform into prior distribution on β by technique of imaginary observations: Start with k diﬀerent values of the covariate vector, x1 , . . . , xk ˜ ˜ For each of these values, the practitioner speciﬁes (i) a prior guess gi at the probability pi associated with xi ; (ii) an assessment of (un)certainty about that guess given by a number Ki of equivalent “prior observations”. 389 / 785
78. 78. Bayesian Core:A Practical Approach to Computational Bayesian Statistics Generalized linear models The Probit Model Informative settings If prior information available on p(x), transform into prior distribution on β by technique of imaginary observations: Start with k diﬀerent values of the covariate vector, x1 , . . . , xk ˜ ˜ For each of these values, the practitioner speciﬁes (i) a prior guess gi at the probability pi associated with xi ; (ii) an assessment of (un)certainty about that guess given by a number Ki of equivalent “prior observations”. On how many imaginary observations did you build this guess? 390 / 785
79. 79. Bayesian Core:A Practical Approach to Computational Bayesian Statistics Generalized linear models The Probit Model Informative prior k pKi gi −1 (1 − pi )Ki (1−gi )−1 π(p1 , . . . , pk ) ∝ i i=1 translates into [Jacobian rule] k Ki (1−gi )−1 Φ(˜ iT β)Ki gi −1 1 − Φ(˜ iT β) φ(˜ iT β) π(β) ∝ x x x i=1 391 / 785
80. 80. Bayesian Core:A Practical Approach to Computational Bayesian Statistics Generalized linear models The Probit Model Informative prior k pKi gi −1 (1 − pi )Ki (1−gi )−1 π(p1 , . . . , pk ) ∝ i i=1 translates into [Jacobian rule] k Ki (1−gi )−1 Φ(˜ iT β)Ki gi −1 1 − Φ(˜ iT β) φ(˜ iT β) π(β) ∝ x x x i=1 [Almost] equivalent to using the G-prior 0 #−1 1 quot; X k β ∼ Nk @0k , A xj xjT ˜˜ j=1 392 / 785
81. 81. Bayesian Core:A Practical Approach to Computational Bayesian Statistics Generalized linear models The logit model The logit model Recall that [for ni = 1] exp(µi ) yi |µi ∼ B(1, µi ), ϕ = 1 and g(µi ) = . 1 + exp(µi ) Thus exp(xiT β) P(yi = 1|β) = 1 + exp(xiT β) with likelihood yi 1−yi n exp(xiT β) exp(xiT β) ℓ(β|y, X) = 1− 1 + exp(xiT β) 1 + exp(xiT β) i=1 393 / 785
82. 82. Bayesian Core:A Practical Approach to Computational Bayesian Statistics Generalized linear models The logit model Links with probit usual vague prior for β, π(β) ∝ 1 394 / 785
83. 83. Bayesian Core:A Practical Approach to Computational Bayesian Statistics Generalized linear models The logit model Links with probit usual vague prior for β, π(β) ∝ 1 Posterior given by n yi 1−yi exp(xiT β) exp(xiT β) π(β|y, X) ∝ 1− 1 + exp(xiT β) 1 + exp(xiT β) i=1 [intractable] 395 / 785
84. 84. Bayesian Core:A Practical Approach to Computational Bayesian Statistics Generalized linear models The logit model Links with probit usual vague prior for β, π(β) ∝ 1 Posterior given by n yi 1−yi exp(xiT β) exp(xiT β) π(β|y, X) ∝ 1− 1 + exp(xiT β) 1 + exp(xiT β) i=1 [intractable] Same Metropolis–Hastings sampler 396 / 785
85. 85. Bayesian Core:A Practical Approach to Computational Bayesian Statistics Generalized linear models The logit model bank −1 0.6 0.0 0.4 0.8 −3 Same scale factor equal to 0.3 −5 0.0 τ = 1: slight increase in 0 4000 8000 −5 −4 −3 −2 −1 0 200 600 1000 the skewness of the 246 0.0 0.4 0.8 0.2 histograms of the βi ’s. 0.0 −2 0 4000 8000 −2 0 2 4 6 0 200 600 1000 0.4 246 0.0 0.4 0.8 Plug-in estimate of 0.2 0.0 −2 predictive probability of a 0 4000 8000 −2 0 2 4 6 0 200 600 1000 counterfeit 3.5 0.0 0.4 0.8 2.5 0.6 1.5 0.0 0 4000 8000 1.5 2.5 3.5 0 200 600 1000 exp (−2.5888xi1 + 1.9967xi2 + 2.1260xi3 + 2.1879xi4 ) pi = ˆ . 1 + exp (−2.5888xi1 + 1.9967xi2 + 2.1260xi3 + 2.1879xi4 ) 397 / 785
86. 86. Bayesian Core:A Practical Approach to Computational Bayesian Statistics Generalized linear models The logit model G-priors for logit models Same story: Flat prior on β inappropriate for Bayes factors, to be replaced with hierarchical prior, β|σ 2 , X ∼ Nk 0k , σ 2 (X T X)−1 and π(σ 2 |X) ∝ σ −3/2 Example (bank) Estimate Post. var. log10(BF) X1 -2.3970 0.3286 4.8084 (****) X2 1.6978 1.2220 -0.2453 X3 2.1197 1.0094 -0.1529 X4 2.0230 0.1132 15.9530 (****) evidence against H0: (****) decisive, (***) strong, (**) subtantial, (*) poor 398 / 785
87. 87. Bayesian Core:A Practical Approach to Computational Bayesian Statistics Generalized linear models Loglinear models Loglinear models Introduction to loglinear models Example (airquality) Benchmark in R > air=data(airquality) Repeated measurements over 111 consecutive days of ozone u (in parts per billion) and maximum daily temperature v discretized into dichotomous variables month 5 6 7 8 9 ozone temp [1,31] [57,79] 17 4 2 5 18 (79,97] 0 2332 (31,168] [57,79] 6 1031 (79,97] 1 2 21 12 8 Contingency table with 5 × 2 × 2 = 20 entries 399 / 785
88. 88. Bayesian Core:A Practical Approach to Computational Bayesian Statistics Generalized linear models Loglinear models Poisson regression Observations/counts y = (y1 , . . . , yn ) are integers, so we can choose yi ∼ P(µi ) Saturated likelihood n 1 yi ℓ(µ|y) = µ exp(−µi ) µi ! i i=1 400 / 785
89. 89. Bayesian Core:A Practical Approach to Computational Bayesian Statistics Generalized linear models Loglinear models Poisson regression Observations/counts y = (y1 , . . . , yn ) are integers, so we can choose yi ∼ P(µi ) Saturated likelihood n 1 yi ℓ(µ|y) = µ exp(−µi ) µi ! i i=1 GLM constraint via log-linear link iT β log(µi ) = xiT β , yi |xi ∼ P ex 401 / 785
90. 90. Bayesian Core:A Practical Approach to Computational Bayesian Statistics Generalized linear models Loglinear models Categorical variables Special feature Incidence matrix X = (xi ) such that its elements are all zeros or ones, i.e. covariates are all indicators/dummy variables! 402 / 785
91. 91. Bayesian Core:A Practical Approach to Computational Bayesian Statistics Generalized linear models Loglinear models Categorical variables Special feature Incidence matrix X = (xi ) such that its elements are all zeros or ones, i.e. covariates are all indicators/dummy variables! Several types of (sub)models are possible depending on relations between categorical variables. 403 / 785
92. 92. Bayesian Core:A Practical Approach to Computational Bayesian Statistics Generalized linear models Loglinear models Categorical variables Special feature Incidence matrix X = (xi ) such that its elements are all zeros or ones, i.e. covariates are all indicators/dummy variables! Several types of (sub)models are possible depending on relations between categorical variables. Re-special feature Variable selection problem of a speciﬁc kind, in the sense that all indicators related with the same association must either remain or vanish at once. Thus much fewer submodels than in a regular variable selection problem. 404 / 785
93. 93. Bayesian Core:A Practical Approach to Computational Bayesian Statistics Generalized linear models Loglinear models Parameterisations Example of three variables 1 ≤ u ≤ I, 1 ≤ v ≤ j and 1 ≤ w ≤ K. 405 / 785
94. 94. Bayesian Core:A Practical Approach to Computational Bayesian Statistics Generalized linear models Loglinear models Parameterisations Example of three variables 1 ≤ u ≤ I, 1 ≤ v ≤ j and 1 ≤ w ≤ K. Simplest non-constant model is I J K u v w log(µτ ) = βb Ib (uτ ) + βb Ib (vτ ) + βb Ib (wτ ) , b=1 b=1 b=1 that is, u v w log(µl(i,j,k) ) = βi + βj + βk , where index l(i, j, k) corresponds to u = i, v = j and w = k. 406 / 785
95. 95. Bayesian Core:A Practical Approach to Computational Bayesian Statistics Generalized linear models Loglinear models Parameterisations Example of three variables 1 ≤ u ≤ I, 1 ≤ v ≤ j and 1 ≤ w ≤ K. Simplest non-constant model is I J K u v w log(µτ ) = βb Ib (uτ ) + βb Ib (vτ ) + βb Ib (wτ ) , b=1 b=1 b=1 that is, u v w log(µl(i,j,k) ) = βi + βj + βk , where index l(i, j, k) corresponds to u = i, v = j and w = k. Saturated model is uvw log(µl(i,j,k) ) = βijk 407 / 785
96. 96. Bayesian Core:A Practical Approach to Computational Bayesian Statistics Generalized linear models Loglinear models Log-linear model (over-)parameterisation Representation log(µl(i,j,k) ) = λ + λu + λv + λw + λuv + λuw + λvw + λuvw , i j ij k ik jk ijk as in Anova models. 408 / 785
97. 97. Bayesian Core:A Practical Approach to Computational Bayesian Statistics Generalized linear models Loglinear models Log-linear model (over-)parameterisation Representation log(µl(i,j,k) ) = λ + λu + λv + λw + λuv + λuw + λvw + λuvw , i j ij k ik jk ijk as in Anova models. λ appears as the overall or reference average eﬀect λu appears as the marginal discrepancy (against the reference i eﬀect λ) when u = i, λuv as the interaction discrepancy (against the added eﬀects ij λ + λu + λv ) when (u, v) = (i, j) i j and so on... 409 / 785
98. 98. Bayesian Core:A Practical Approach to Computational Bayesian Statistics Generalized linear models Loglinear models Example of submodels if both v and w are irrelevant, then 1 u log(µl(i,j,k) ) = λ + λi , if all three categorical variables are mutually independent, then 2 log(µl(i,j,k) ) = λ + λi + λv + λw , u j k if u and v are associated but are both independent of w, then 3 log(µl(i,j,k) ) = λ + λi + λv + λw + λuv , u j k ij if u and v are conditionally independent given w, then 4 log(µl(i,j,k) ) = λ + λi + λv + λw + λuw + λvw , u j k ik jk if there is no three-factor interaction, then 5 log(µl(i,j,k) ) = λ + λi + λv + λw + λuv + λuw + λvw u j k ij ik jk [the most complete submodel] 410 / 785
99. 99. Bayesian Core:A Practical Approach to Computational Bayesian Statistics Generalized linear models Loglinear models Identiﬁability Representation log(µl(i,j,k) ) = λ + λu + λv + λw + λuv + λuw + λvw + λuvw , i j ij k ik jk ijk not identiﬁable but Bayesian approach handles non-identiﬁable settings and still estimate properly identiﬁable quantities. 411 / 785
100. 100. Bayesian Core:A Practical Approach to Computational Bayesian Statistics Generalized linear models Loglinear models Identiﬁability Representation log(µl(i,j,k) ) = λ + λu + λv + λw + λuv + λuw + λvw + λuvw , i j ij k ik jk ijk not identiﬁable but Bayesian approach handles non-identiﬁable settings and still estimate properly identiﬁable quantities. Customary to impose identiﬁability constraints on the parameters: set to 0 parameters corresponding to the ﬁrst category of each variable, i.e. remove the indicator of the ﬁrst category. E.g., if u ∈ {1, 2} and v ∈ {1, 2}, constraint could be λu = λv = λuv = λuv = λuv = 0 . 1 1 11 12 21 412 / 785
101. 101. Bayesian Core:A Practical Approach to Computational Bayesian Statistics Generalized linear models Loglinear models Inference under a ﬂat prior Noninformative prior π(β) ∝ 1 gives posterior distribution n yi exp(xiT β) exp{− exp(xiT β)} π(β|y, X) ∝ i=1 n n yi xiT β − exp(xiT β) = exp i=1 i=1 413 / 785
102. 102. Bayesian Core:A Practical Approach to Computational Bayesian Statistics Generalized linear models Loglinear models Inference under a ﬂat prior Noninformative prior π(β) ∝ 1 gives posterior distribution n yi exp(xiT β) exp{− exp(xiT β)} π(β|y, X) ∝ i=1 n n yi xiT β − exp(xiT β) = exp i=1 i=1 Use of same random walk M-H algorithm as in probit and logit cases, starting with MLE evaluation > mod=summary(glm(y~-1+X,family=poisson())) 414 / 785
103. 103. Bayesian Core:A Practical Approach to Computational Bayesian Statistics Generalized linear models Loglinear models airquality Eﬀect Post. mean Post. var. λ 2.8041 0.0612 λu -1.0684 0.2176 2 λv -5.8652 1.7141 2 λw -1.4401 0.2735 2 λw -2.7178 0.7915 3 Identiﬁable non-saturated model λw -1.1031 0.2295 4 involves 16 parameters λw -0.0036 0.1127 5 λuv Obtained with 10, 000 MCMC 3.3559 0.4490 22 λuw -1.6242 1.2869 iterations with scale factor 22 λuw - 0.3456 0.8432 τ 2 = 0.5 23 λuw -0.2473 0.6658 24 λuw -1.3335 0.7115 25 λvw 4.5493 2.1997 22 λvw 6.8479 2.5881 23 λvw 4.6557 1.7201 24 λvw 3.9558 1.7128 25 415 / 785
104. 104. Bayesian Core:A Practical Approach to Computational Bayesian Statistics Generalized linear models Loglinear models airquality: MCMC output 0.0 −2 0.5 3.0 −1.5 −1.0 −6 −3.0 −10 −2.5 2.0 0 4000 10000 0 4000 10000 0 4000 10000 0 4000 10000 6 1.0 0.0 −2 5 4 −1.5 0.0 −4 3 −6 −1.0 2 −3.0 0 4000 10000 0 4000 10000 0 4000 10000 0 4000 10000 0 0 1 1 −2 −2 −1 −1 −4 −5 −3 −3 0 4000 10000 0 4000 10000 0 4000 10000 0 4000 10000 12 8 8 8 6 6 8 4 4 4 2 4 2 0 0 4000 10000 0 4000 10000 0 4000 10000 0 4000 10000 416 / 785
105. 105. Bayesian Core:A Practical Approach to Computational Bayesian Statistics Generalized linear models Loglinear models airquality: MCMC output 1400 800 0.0 −2 0.5 800 3.0 −1.5 400 −1.0 −6 400 400 600 −3.0 −10 −2.5 2.0 0 0 0 0 0 4000 10000 0 4000 10000 0 4000 10000 0 4000 10000 2.0 3.0 −2.5 −1.0 0.5 −10 −6 −2 −3.0 −1.5 0.0 1000 1000 6 1.0 0.0 800 −2 5 400 4 −1.5 0.0 400 −4 400 400 3 −6 −1.0 2 −3.0 0 0 0 0 0 4000 10000 0 4000 10000 0 4000 10000 0 4000 10000 −7 −5 −3 −1 −3.0 −1.0 0.5 −1.0 0.0 1.0 2 4 6 800 800 800 0 0 1 1 400 400 −2 400 400 −2 −1 −1 −4 −5 −3 −3 0 0 0 0 0 4000 10000 0 4000 10000 0 4000 10000 0 4000 10000 −5 −2 0 −3 −1 1 3 −3 −1 1 3 −4 −2 0 1200 1000 12 8 500 500 8 8 6 6 600 8 400 4 200 200 4 4 2 4 2 0 0 0 0 0 0 4000 10000 0 4000 10000 0 4000 10000 0 4000 10000 0 4 8 4 8 12 2468 2468 417 / 785