Chapter 4: Decision theory and Bayesian analysis

1. Chapter 4 : Decision theory and Bayesian analysis 5 Decision theory and Bayesian analysis Bayesian modelling Conjugate priors Improper prior distributions Bayesian inference

2. A pedestrian example paired and orphan socks A drawer contains an unknown number of socks, some of which can be paired and some of which are orphans (single). One takes at random 11 socks without replacement from this drawer: no pair can be found among those. What can we infer about the total number of socks in the drawer? sounds like an impossible task one observation x = 11 and two unknowns, nsocks and npairs writing the likelihood is a challenge [exercise]

3. A pedestrian example paired and orphan socks A drawer contains an unknown number of socks, some of which can be paired and some of which are orphans (single). One takes at random 11 socks without replacement from this drawer: no pair can be found among those. What can we infer about the total number of socks in the drawer? sounds like an impossible task one observation x = 11 and two unknowns, nsocks and npairs writing the likelihood is a challenge [exercise]

4. A prioris on socks Given parameters nsocks and npairs, set of socks S = s1, s1, . . . , snpairs , snpairs , snpairs+1, . . . , snsocks and 11 socks picked at random from S give X unique socks. Rassmus' reasoning If you are a family of 3-4 persons then a guesstimate would be that you have something like 15 pairs of socks in store. It is also possible that you have much more than 30 socks. So as a prior for nsocks I'm going to use a negative binomial with mean 30 and standard deviation 15. On npairs=2nsocks I'm going to put a Beta prior distribution that puts most of the probability over the range 0.75 to 1.0, [Rassmus Baath's Research Blog, Oct 20th, 2014]

5. A prioris on socks Given parameters nsocks and npairs, set of socks S = s1, s1, . . . , snpairs , snpairs , snpairs+1, . . . , snsocks and 11 socks picked at random from S give X unique socks. Rassmus' reasoning If you are a family of 3-4 persons then a guesstimate would be that you have something like 15 pairs of socks in store. It is also possible that you have much more than 30 socks. So as a prior for nsocks I'm going to use a negative binomial with mean 30 and standard deviation 15. On npairs=2nsocks I'm going to put a Beta prior distribution that puts most of the probability over the range 0.75 to 1.0, [Rassmus Baath's Research Blog, Oct 20th, 2014]

6. Simulating the experiment Given a prior distribution on nsocks and npairs, nsocks Neg(30, 15) npairsjnsocks nsocks=2Be(15, 2) possible to 1 generate new values of nsocks and npairs, 2 generate a new observation of X, number of unique socks out of 11. 3 accept the pair (nsocks, npairs) if the realisation of X is equal to 11

7. Simulating the experiment Given a prior distribution on nsocks and npairs, nsocks Neg(30, 15) npairsjnsocks nsocks=2Be(15, 2) possible to 1 generate new values of nsocks and npairs, 2 generate a new observation of X, number of unique socks out of 11. 3 accept the pair (nsocks, npairs) if the realisation of X is equal to 11

8. Meaning ns Density 0 10 20 30 40 50 60 0.00 0.01 0.02 0.03 0.04 0.05 0.06 The outcome of this simulation method returns a distribution on the pair (nsocks, npairs) that is the conditional distribution of the pair given the observation X = 11 Proof: Generations from (nsocks, npairs) are accepted with probability P fX = 11j(nsocks, npairs)g

9. Meaning ns Density 0 10 20 30 40 50 60 0.00 0.01 0.02 0.03 0.04 0.05 0.06 The outcome of this simulation method returns a distribution on the pair (nsocks, npairs) that is the conditional distribution of the pair given the observation X = 11 Proof: Hence accepted values distributed from (nsocks, npairs) P fX = 11j(nsocks, npairs)g = (nsocks, npairsjX = 11)

10. General principle Bayesian principle Given a probability distribution on the parameter called prior () and an observation x of X f(xj), Bayesian inference relies on the conditional distribution of given X = x (jx) = R ()f(xj) ()f(xj)d called posterior distribution [Bayes' theorem] Thomas Bayes (FRS, 1701?-1761)

11. Bayesian inference Posterior distribution (jx) as distribution on the parameter conditional on x the observation used for all aspects of inference point estimation, e.g., E[h()jx]; con

12. dence intervals, e.g., f; (jx) g; tests of hypotheses, e.g., ( = 0jx) ; and prediction of future observations

13. Central tool... central to Bayesian inference Posterior de

14. ned up to a constant as (jx) / f(xj) () Operates conditional upon the observation(s) X = x Integrate simultaneously prior information and information brought by x Avoids averaging over the unobserved values of X Coherent updating of the information available on , independent of the order in which i.i.d. observations are collected [domino eect] Provides a complete inferential scope and a unique motor of inference

15. The thorny issue of the prior distribution Compared with likelihood inference, based solely on L(jx1, . . . , xn) = Yn i=1 f(xij) Bayesian inference introduces an extra measure () that is chosen a priori, hence subjectively by the statistician based on hypothetical range of guesstimates of with an associated (lack of) precision type of sampling distribution Note There also exist reference solutions (see below)

16. The thorny issue of the prior distribution Compared with likelihood inference, based solely on L(jx1, . . . , xn) = Yn i=1 f(xij) Bayesian inference introduces an extra measure () that is chosen a priori, hence subjectively by the statistician based on hypothetical range of guesstimates of with an associated (lack of) precision type of sampling distribution Note There also exist reference solutions (see below)

17. Bayes' example Billiard ball W rolled on a line of length one, with a uniform probability of stopping anywhere: W stops at p. Second ball O then rolled n times under the same assumptions. X denotes the number of times the ball O stopped on the left of W.

18. Bayes' example Billiard ball W rolled on a line of length one, with a uniform probability of stopping anywhere: W stops at p. Second ball O then rolled n times under the same assumptions. X denotes the number of times the ball O stopped on the left of W. Thomas Bayes' question Given X, what inference can we make on p?

19. Bayes' example Billiard ball W rolled on a line of length one, with a uniform probability of stopping anywhere: W stops at p. Second ball O then rolled n times under the same assumptions. X denotes the number of times the ball O stopped on the left of W. Modern translation: Derive the posterior distribution of p given X, when p U([0, 1]) and X B(n, p)

20. Resolution Since P(X = xjp) = n x px(1 - p)n-x, P(a p b and X = x) = Z b a n x px(1 - p)n-xdp and P(X = x) = Z 1 0 n x px(1 - p)n-x dp,

21. Resolution (2) then P(a p bjX = x) = Rb a n x px(1 - p)n-x dp R1 0 n x px(1 - p)n-x dp = Rb a px(1 - p)n-x dp B(x + 1, n - x + 1) , i.e. pjx Be(x + 1, n - x + 1) [Beta distribution]

22. Resolution (2) then P(a p bjX = x) = Rb a n x px(1 - p)n-x dp R1 0 n x px(1 - p)n-x dp = Rb a px(1 - p)n-x dp B(x + 1, n - x + 1) , i.e. pjx Be(x + 1, n - x + 1) [Beta distribution]

23. Conjugate priors Easiest case is when prior distribution is within parametric family Conjugacy In this case, posterior inference is tractable and reduces to updating the hyperparameters of the prior Example In Thomas Bayes' example, the Be(a, b) prior is conjugate The hyperparameters are parameters of the priors; they are most often not treated as random variables

24. Conjugate priors Easiest case is when prior distribution is within parametric family Conjugacy Given a likelihood function L(yj), the family of priors 0 on is said to be conjugate if the posterior (jy) also belong to In this case, posterior inference is tractable and reduces to updating the hyperparameters of the prior Example In Thomas Bayes' example, the Be(a, b) prior is conjugate The hyperparameters are parameters of the priors; they are most often not treated as random variables

25. Conjugate priors Easiest case is when prior distribution is within parametric family Conjugacy A family F of probability distributions on is conjugate for a likelihood function f(xj) if, for every 2 F, the posterior distribution (jx) also belongs to F. In this case, posterior inference is tractable and reduces to updating the hyperparameters of the prior Example In Thomas Bayes' example, the Be(a, b) prior is conjugate The hyperparameters are parameters of the priors; they are most often not treated as random variables

26. Conjugate priors Easiest case is when prior distribution is within parametric family Conjugacy A family F of probability distributions on is conjugate for a likelihood function f(xj) if, for every 2 F, the posterior distribution (jx) also belongs to F. In this case, posterior inference is tractable and reduces to updating the hyperparameters of the prior Example In Thomas Bayes' example, the Be(a, b) prior is conjugate The hyperparameters are parameters of the priors; they are most often not treated as random variables

27. Exponential families and conjugacy The family of exponential distributions f(xj) = C()h(x) expfR() T(x)g = h(x) expfR() T(x) - ()g allows for conjugate priors (j, ) = K(, ) e.- () Following Pitman-Koopman-Darmois' Lemma, only case [besides uniform distributions]

28. Exponential families and conjugacy The family of exponential distributions f(xj) = C()h(x) expfR() T(x)g = h(x) expfR() T(x) - ()g allows for conjugate priors (j, ) = K(, ) e.- () Following Pitman-Koopman-Darmois' Lemma, only case [besides uniform distributions]

29. Illustration Discrete/Multinomial Dirichlet If observations consist of positive counts Y1, . . . , Yd modelled by a Multinomial M(1, . . . , p) distribution L(yj, n) = n! Qd i=1 yi! Yd i=1 yi i conjugate family is the Dirichlet D(1, . . . , d) distribution (j) = Pd ( i=1 i) Qd i=1 (i) Yd i i-1 i de

30. ned on the probability simplex (i 0, Pd i=1 i = 1), where is the gamma function () = R1 0 t-1e-tdt

31. Standard exponential families f(xj) () (jx) Normal Normal N(, 2) N(, 2) N((2 + 2x), 22) -1 = 2 + 2 Poisson Gamma P() G(,

32. ) G( + x,

33. + 1) Gamma Gamma G(, ) G(,

34. ) G( + ,

35. + x) Binomial Beta B(n, ) Be(,

36. ) Be( + x,

37. + n - x)

38. Standard exponential families [2] f(xj) () (jx) Negative Binomial Beta Neg(m, ) Be(,

39. ) Be( +m,

40. + x) Multinomial Dirichlet Mk(1, . . . , k) D(1, . . . , k) D(1 + x1, . . . , k + xk) Normal Gamma N(, 1=) Ga(,

41. ) G( + 0.5,

42. + ( - x)2=2)

43. Linearity of the posterior mean Lemma If ,x0() / ex0- () with x0 2 X, then E[r ()] = x0 . Therefore, if x1, . . . , xn are i.i.d. f(xj), E[r ()jx1, . . . , xn] = x0 + nx + n

44. Improper distributions Necessary extension from a prior probability distribution to a prior -

45. nite positive measure such that Z () d = +1 Improper prior distribution Note A -

46. nite density with Z () d +1 can be renormalised into a probability density

53. Justi

54. cations Often automatic prior determination leads to improper prior distributions 1 Only way to derive a prior in noninformative settings 2 Performances of estimators derived from these generalized distributions usually good 3 Improper priors often occur as limits of proper distributions 4 More robust answer against possible misspeci

55. cations of the prior 5 Penalization factor

56. Justi

59. Justi

62. Justi

65. Justi

68. Validation Extension of the posterior distribution (jx) associated with an improper prior as given by Bayes's formula (jx) = R f(xj)() f(xj)() d , when Z f(xj)() d 1

69. Validation Extension of the posterior distribution (jx) associated with an improper prior as given by Bayes's formula (jx) = R f(xj)() f(xj)() d , when Z f(xj)() d 1

70. Normal illustration If x N(, 1) and () = $, constant, the pseudo marginal distribution is m(x) = $ Z+1 -1 1 p 2 exp -(x - )2=2 d = $ and the posterior distribution of is ( j x) = 1 p 2 exp -(x-)2=2 , i.e., corresponds to a N(x, 1) distribution. [independent of !]

73. Warning The mistake is to think of them [non-informative priors] as representing ignorance [Lindley, 1990] Normal illustration: Consider a N(0, 2) prior. Then lim !1 P ( 2 [a, b]) = 0 for any (a, b)

74. Warning Noninformative priors cannot be expected to represent exactly total ignorance about the problem at hand, but should rather be taken as reference or default priors, upon which everyone could fall back when the prior information is missing. [Kass and Wasserman, 1996] Normal illustration: Consider a N(0, 2) prior. Then lim !1 P ( 2 [a, b]) = 0 for any (a, b)

75. Haldane prior Consider a binomial observation, x B(n, p), and (p) / [p(1 - p)]-1 [Haldane, 1931] The marginal distribution, m(x) = Z 1 0 [p(1 - p)]-1 n x px(1 - p)n-xdp = B(x, n - x), is only de

76. ned for x6= 0, n . [Not recommended!]

77. Haldane prior Consider a binomial observation, x B(n, p), and (p) / [p(1 - p)]-1 [Haldane, 1931] The marginal distribution, m(x) = Z 1 0 [p(1 - p)]-1 n x px(1 - p)n-xdp = B(x, n - x), is only de

78. ned for x6= 0, n . [Not recommended!]

79. The Jereys prior Based on Fisher information I() = E @` @t @` @ Jereys prior density is () / jI()j1=2 Pros Cons relates to information theory agrees with most invariant priors parameterisation invariant

80. The Jereys prior Based on Fisher information I() = E @` @t @` @ Jereys prior density is () / jI()j1=2 Pros Cons relates to information theory agrees with most invariant priors parameterisation invariant

81. Example If x Np(, Ip), Jereys' prior is () / 1 and if = kk2, () = p=2-1 and E[jx] = kxk2 + p with bias 2p [Not recommended!]

82. Example If x B(n, ), Jereys' prior is Be(1=2, 1=2) and, if n Neg(x, ), Jereys' prior is 2() = -E @2 @2 log f(xj) = E x 2 + n - x (1 - )2 = x 2(1 - ) , / -1(1 - )-1=2

83. MAP estimator When considering estimates of the parameter , one default solution is the maximum a posteriori (MAP) estimator argmax `(jx)() Motivations Most likely value of Penalized likelihood estimator Further appeal in restricted parameter spaces

84. MAP estimator When considering estimates of the parameter , one default solution is the maximum a posteriori (MAP) estimator argmax `(jx)() Motivations Most likely value of Penalized likelihood estimator Further appeal in restricted parameter spaces

85. Illustration Consider x B(n, p). Possible priors: (p) = 1 B(1=2, 1=2) p-1=2(1 - p)-1=2 , 1(p) = 1 and 2(p) = p-1(1 - p)-1 . Corresponding MAP estimators: (x) = max x - 1=2 n - 1 , , 0 1(x) = x n , 2(x) = max x - 1 n - 2 , 0 .

86. Illustration [opposite] MAP not always appropriate: When f(xj) = 1 1 + (x - )2-1 , and () = 1 2 e-jj then MAP estimator of is always (x) = 0

87. Prediction Inference on new observations depending on the same parameter, conditional on the current data If x f(xj) [observed], (), and z g(zjx, ) [unobserved], predictive of z is marginal conditional g(zjx) = Z g(zjx, )(jx) d.

88. time series illustration Consider the AR(1) model xt = xt-1 + t t N(0, 2) predictive of xT is then xT jx1:(T-1) Z -1 p 2 expf-(xT-xT-1)2=22g(, jx1:(T-1))dd , and (, jx1:(T-1)) can be expressed in closed form

89. Posterior mean Theorem The solution to arg min E jj - jj2

91. x is given by (x) = E [jx] [Posterior mean = Bayes estimator under quadratic loss]

92. Posterior median Theorem When 2 R, the solution to argmin E [ j - j j x] is given by (x) = median (jx) [Posterior mean = Bayes estimator under absolute loss] Obvious extension to arg min E Xp i=1 ji - j

97. # x

98. Posterior median Theorem When 2 R, the solution to argmin E [ j - j j x] is given by (x) = median (jx) [Posterior mean = Bayes estimator under absolute loss] Obvious extension to arg min E Xp i=1 ji - j

103. x #

104. Inference with conjugate priors For conjugate distributions, posterior expectations of the natural parameters may be expressed analytically, for one or several observations. Distribution Conjugate prior Posterior mean Normal Normal N(, 2) N(, 2) 2 + 2x 2 + 2 Poisson Gamma P() G(,

105. ) + x

106. + 1

107. Inference with conjugate priors For conjugate distributions, posterior expectations of the natural parameters may be expressed analytically, for one or several observations. Distribution Conjugate prior Posterior mean Gamma Gamma G(, ) G(,

108. ) +

109. + x Binomial Beta B(n, ) Be(,

110. ) + x +

111. + n Negative binomial Beta Neg(n, ) Be(,

112. ) + n +

113. + x + n Multinomial Dirichlet Mk(n; 1, . . . , k) D(1, . . . , k) i + xi P j j + n Normal Gamma N(, 1=) G(=2,

114. =2) + 1

115. + ( - x)2

116. Illustration Consider x1, ..., xn U([0, ]) and Pa(0, ). Then jx1, ..., xn Pa(max (0, x1, ..., xn), + n) and (x1, ..., xn) = + n + n - 1 max (0, x1, ..., xn).

117. HPD region Natural con

118. dence region based on (jx) is C(x) = f; (jx) kg with P( 2 Cjx) = 1 - Highest posterior density (HPD) region

120. dence region based on (jx) is C(x) = f; (jx) kg with P( 2 Cjx) = 1 - Highest posterior density (HPD) region Example case x N(, 1) and N(0, 10). Then jx N(10=11x, 10=11) and C(x) = ; j - 10=11xj k0 = (10=11x - k0, 10=11x + k0)

122. dence region based on (jx) is C(x) = f; (jx) kg with P( 2 Cjx) = 1 - Highest posterior density (HPD) region Warning Frequentist coverage is not 1 - , hence name of credible rather than con

123. dence region Further validation of HPD regions as smallest-volume 1 - -coverage regions

Chapter 4: Decision theory and Bayesian analysis

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Viewers also liked

Viewers also liked (17)

Similar to Chapter 4: Decision theory and Bayesian analysis

Similar to Chapter 4: Decision theory and Bayesian analysis (20)

More from Christian Robert

More from Christian Robert (20)

Recently uploaded

Recently uploaded (20)

Chapter 4: Decision theory and Bayesian analysis