Loading…

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

Like this presentation? Why not share!

Like this? Share it with your network

Share

Bayesian Choice (slides)

on

  • 4,063 views

These are slides for a 6 week course I give yearly in Paris Dauphine.

These are slides for a 6 week course I give yearly in Paris Dauphine.

Statistics

Views

Total Views
4,063
Views on SlideShare
2,559
Embed Views
1,504

Actions

Likes
2
Downloads
127
Comments
0

8 Embeds 1,504

http://xianblog.wordpress.com 1481
https://xianblog.wordpress.com 10
http://static.slideshare.net 5
http://www.slideshare.net 4
http://feedheap.com 1
http://translate.googleusercontent.com 1
http://newsblur.com 1
https://translate.googleusercontent.com 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Bayesian Choice (slides) Presentation Transcript

  • 1. Bayesian StatisticsBayesian StatisticsChristian P. RobertUniversit´e Paris Dauphine and CREST-INSEEhttp://www.ceremade.dauphine.fr/~xianJune 3, 2013
  • 2. Bayesian StatisticsOutlineIntroductionDecision-Theoretic Foundations of Statistical InferenceFrom Prior Information to Prior DistributionsBayesian Point EstimationTests and model choiceAdmissibility and Complete ClassesHierarchical and Empirical Bayes Extensions, and the Stein EffectA Defense of the Bayesian Choice
  • 3. Bayesian StatisticsIntroductionVocabulary, concepts and first examplesIntroductionModelsThe Bayesian frameworkPrior and posterior distributionsImproper prior distributionsDecision-Theoretic Foundations of Statistical InferenceFrom Prior Information to Prior DistributionsBayesian Point EstimationTests and model choice
  • 4. Bayesian StatisticsIntroductionModelsParametric modelObservations x1, . . . , xn generated from a probability distributionfi(xi|θi, x1, . . . , xi−1) = fi(xi|θi, x1:i−1)x = (x1, . . . , xn) ∼ f(x|θ), θ = (θ1, . . . , θn)
  • 5. Bayesian StatisticsIntroductionModelsParametric modelObservations x1, . . . , xn generated from a probability distributionfi(xi|θi, x1, . . . , xi−1) = fi(xi|θi, x1:i−1)x = (x1, . . . , xn) ∼ f(x|θ), θ = (θ1, . . . , θn)Associated likelihoodℓ(θ|x) = f(x|θ)[inverted density]
  • 6. Bayesian StatisticsIntroductionThe Bayesian frameworkBayes TheoremBayes theorem = Inversion of probabilitiesIf A and E are events such that P(E) = 0, P(A|E) and P(E|A)are related byP(A|E) =P(E|A)P(A)P(E|A)P(A) + P(E|Ac)P(Ac)=P(E|A)P(A)P(E)
  • 7. Bayesian StatisticsIntroductionThe Bayesian frameworkBayes TheoremBayes theorem = Inversion of probabilitiesIf A and E are events such that P(E) = 0, P(A|E) and P(E|A)are related byP(A|E) =P(E|A)P(A)P(E|A)P(A) + P(E|Ac)P(Ac)=P(E|A)P(A)P(E)[Thomas Bayes, 1764]
  • 8. Bayesian StatisticsIntroductionThe Bayesian frameworkBayes TheoremBayes theorem = Inversion of probabilitiesIf A and E are events such that P(E) = 0, P(A|E) and P(E|A)are related byP(A|E) =P(E|A)P(A)P(E|A)P(A) + P(E|Ac)P(Ac)=P(E|A)P(A)P(E)[Thomas Bayes, 1764]Actualisation principle
  • 9. Bayesian StatisticsIntroductionThe Bayesian frameworkNew perspective◮ Uncertainty on the parameter s θ of a model modeled througha probability distribution π on Θ, called prior distribution
  • 10. Bayesian StatisticsIntroductionThe Bayesian frameworkNew perspective◮ Uncertainty on the parameter s θ of a model modeled througha probability distribution π on Θ, called prior distribution◮ Inference based on the distribution of θ conditional on x,π(θ|x), called posterior distributionπ(θ|x) =f(x|θ)π(θ)f(x|θ)π(θ) dθ.
  • 11. Bayesian StatisticsIntroductionThe Bayesian frameworkDefinition (Bayesian model)A Bayesian statistical model is made of a parametric statisticalmodel,(X, f(x|θ)) ,
  • 12. Bayesian StatisticsIntroductionThe Bayesian frameworkDefinition (Bayesian model)A Bayesian statistical model is made of a parametric statisticalmodel,(X, f(x|θ)) ,and a prior distribution on the parameters,(Θ, π(θ)) .
  • 13. Bayesian StatisticsIntroductionThe Bayesian frameworkJustifications◮ Semantic drift from unknown to random
  • 14. Bayesian StatisticsIntroductionThe Bayesian frameworkJustifications◮ Semantic drift from unknown to random◮ Actualization of the information on θ by extracting theinformation on θ contained in the observation x
  • 15. Bayesian StatisticsIntroductionThe Bayesian frameworkJustifications◮ Semantic drift from unknown to random◮ Actualization of the information on θ by extracting theinformation on θ contained in the observation x◮ Allows incorporation of imperfect information in the decisionprocess
  • 16. Bayesian StatisticsIntroductionThe Bayesian frameworkJustifications◮ Semantic drift from unknown to random◮ Actualization of the information on θ by extracting theinformation on θ contained in the observation x◮ Allows incorporation of imperfect information in the decisionprocess◮ Unique mathematical way to condition upon the observations(conditional perspective)
  • 17. Bayesian StatisticsIntroductionThe Bayesian frameworkJustifications◮ Semantic drift from unknown to random◮ Actualization of the information on θ by extracting theinformation on θ contained in the observation x◮ Allows incorporation of imperfect information in the decisionprocess◮ Unique mathematical way to condition upon the observations(conditional perspective)◮ Penalization factor
  • 18. Bayesian StatisticsIntroductionThe Bayesian frameworkBayes’ example:Billiard ball W rolled on a line of length one, with a uniformprobability of stopping anywhere: W stops at p.Second ball O then rolled n times under the same assumptions. Xdenotes the number of times the ball O stopped on the left of W.
  • 19. Bayesian StatisticsIntroductionThe Bayesian frameworkBayes’ example:Billiard ball W rolled on a line of length one, with a uniformprobability of stopping anywhere: W stops at p.Second ball O then rolled n times under the same assumptions. Xdenotes the number of times the ball O stopped on the left of W.Bayes’ questionGiven X, what inference can we make on p?
  • 20. Bayesian StatisticsIntroductionThe Bayesian frameworkModern translation:Derive the posterior distribution of p given X, whenp ∼ U ([0, 1]) and X ∼ B(n, p)
  • 21. Bayesian StatisticsIntroductionThe Bayesian frameworkResolutionSinceP(X = x|p) =nxpx(1 − p)n−x,P(a < p < b and X = x) =banxpx(1 − p)n−xdpandP(X = x) =10nxpx(1 − p)n−xdp,
  • 22. Bayesian StatisticsIntroductionThe Bayesian frameworkResolution (2)thenP(a < p < b|X = x) =banx px(1 − p)n−x dp10nx px(1 − p)n−x dp=ba px(1 − p)n−x dpB(x + 1, n − x + 1),
  • 23. Bayesian StatisticsIntroductionThe Bayesian frameworkResolution (2)thenP(a < p < b|X = x) =banx px(1 − p)n−x dp10nx px(1 − p)n−x dp=ba px(1 − p)n−x dpB(x + 1, n − x + 1),i.e.p|x ∼ Be(x + 1, n − x + 1)[Beta distribution]
  • 24. Bayesian StatisticsIntroductionPrior and posterior distributionsPrior and posterior distributionsGiven f(x|θ) and π(θ), several distributions of interest:(a) the joint distribution of (θ, x),ϕ(θ, x) = f(x|θ)π(θ) ;
  • 25. Bayesian StatisticsIntroductionPrior and posterior distributionsPrior and posterior distributionsGiven f(x|θ) and π(θ), several distributions of interest:(a) the joint distribution of (θ, x),ϕ(θ, x) = f(x|θ)π(θ) ;(b) the marginal distribution of x,m(x) = ϕ(θ, x) dθ= f(x|θ)π(θ) dθ ;
  • 26. Bayesian StatisticsIntroductionPrior and posterior distributions(c) the posterior distribution of θ,π(θ|x) =f(x|θ)π(θ)f(x|θ)π(θ) dθ=f(x|θ)π(θ)m(x);
  • 27. Bayesian StatisticsIntroductionPrior and posterior distributions(c) the posterior distribution of θ,π(θ|x) =f(x|θ)π(θ)f(x|θ)π(θ) dθ=f(x|θ)π(θ)m(x);(d) the predictive distribution of y, when y ∼ g(y|θ, x),g(y|x) = g(y|θ, x)π(θ|x)dθ .
  • 28. Bayesian StatisticsIntroductionPrior and posterior distributionsPosterior distributioncentral to Bayesian inference◮ Operates conditional upon the observation s
  • 29. Bayesian StatisticsIntroductionPrior and posterior distributionsPosterior distributioncentral to Bayesian inference◮ Operates conditional upon the observation s◮ Incorporates the requirement of the Likelihood Principle
  • 30. Bayesian StatisticsIntroductionPrior and posterior distributionsPosterior distributioncentral to Bayesian inference◮ Operates conditional upon the observation s◮ Incorporates the requirement of the Likelihood Principle◮ Avoids averaging over the unobserved values of x
  • 31. Bayesian StatisticsIntroductionPrior and posterior distributionsPosterior distributioncentral to Bayesian inference◮ Operates conditional upon the observation s◮ Incorporates the requirement of the Likelihood Principle◮ Avoids averaging over the unobserved values of x◮ Coherent updating of the information available on θ,independent of the order in which i.i.d. observations arecollected
  • 32. Bayesian StatisticsIntroductionPrior and posterior distributionsPosterior distributioncentral to Bayesian inference◮ Operates conditional upon the observation s◮ Incorporates the requirement of the Likelihood Principle◮ Avoids averaging over the unobserved values of x◮ Coherent updating of the information available on θ,independent of the order in which i.i.d. observations arecollected◮ Provides a complete inferential scope
  • 33. Bayesian StatisticsIntroductionPrior and posterior distributionsExample (Flat prior (1))Consider x ∼ N(θ, 1) and θ ∼ N(0, 10).π(θ|x) ∝ f(x|θ)π(θ) ∝ exp −(x − θ)22−θ220∝ exp −11θ220+ θx∝ exp −1120{θ − (10x/11)}2
  • 34. Bayesian StatisticsIntroductionPrior and posterior distributionsExample (Flat prior (1))Consider x ∼ N(θ, 1) and θ ∼ N(0, 10).π(θ|x) ∝ f(x|θ)π(θ) ∝ exp −(x − θ)22−θ220∝ exp −11θ220+ θx∝ exp −1120{θ − (10x/11)}2andθ|x ∼ N1011x,1011
  • 35. Bayesian StatisticsIntroductionPrior and posterior distributionsExample (HPD region)Natural confidence regionC = {θ; π(θ|x) > k}= θ; θ −1011x > k′
  • 36. Bayesian StatisticsIntroductionPrior and posterior distributionsExample (HPD region)Natural confidence regionC = {θ; π(θ|x) > k}= θ; θ −1011x > k′Highest posterior density (HPD) region
  • 37. Bayesian StatisticsIntroductionImproper prior distributionsImproper distributionsNecessary extension from a prior distribution to a prior σ-finitemeasure π such thatΘπ(θ) dθ = +∞
  • 38. Bayesian StatisticsIntroductionImproper prior distributionsImproper distributionsNecessary extension from a prior distribution to a prior σ-finitemeasure π such thatΘπ(θ) dθ = +∞Improper prior distribution
  • 39. Bayesian StatisticsIntroductionImproper prior distributionsJustificationsOften automatic prior determination leads to improper priordistributions1. Only way to derive a prior in noninformative settings
  • 40. Bayesian StatisticsIntroductionImproper prior distributionsJustificationsOften automatic prior determination leads to improper priordistributions1. Only way to derive a prior in noninformative settings2. Performances of estimators derived from these generalizeddistributions usually good
  • 41. Bayesian StatisticsIntroductionImproper prior distributionsJustificationsOften automatic prior determination leads to improper priordistributions1. Only way to derive a prior in noninformative settings2. Performances of estimators derived from these generalizeddistributions usually good3. Improper priors often occur as limits of proper distributions
  • 42. Bayesian StatisticsIntroductionImproper prior distributionsJustificationsOften automatic prior determination leads to improper priordistributions1. Only way to derive a prior in noninformative settings2. Performances of estimators derived from these generalizeddistributions usually good3. Improper priors often occur as limits of proper distributions4. More robust answer against possible misspecifications of theprior
  • 43. Bayesian StatisticsIntroductionImproper prior distributions5. Generally more acceptable to non-Bayesians, with frequentistjustifications, such as:(i) minimaxity(ii) admissibility(iii) invariance
  • 44. Bayesian StatisticsIntroductionImproper prior distributions5. Generally more acceptable to non-Bayesians, with frequentistjustifications, such as:(i) minimaxity(ii) admissibility(iii) invariance6. Improper priors prefered to vague proper priors such as aN (0, 1002) distribution
  • 45. Bayesian StatisticsIntroductionImproper prior distributions5. Generally more acceptable to non-Bayesians, with frequentistjustifications, such as:(i) minimaxity(ii) admissibility(iii) invariance6. Improper priors prefered to vague proper priors such as aN (0, 1002) distribution7. Penalization factor inmindL(θ, d)π(θ)f(x|θ) dx dθ
  • 46. Bayesian StatisticsIntroductionImproper prior distributionsValidationExtension of the posterior distribution π(θ|x) associated with animproper prior π as given by Bayes’s formulaπ(θ|x) =f(x|θ)π(θ)Θ f(x|θ)π(θ) dθ,
  • 47. Bayesian StatisticsIntroductionImproper prior distributionsValidationExtension of the posterior distribution π(θ|x) associated with animproper prior π as given by Bayes’s formulaπ(θ|x) =f(x|θ)π(θ)Θ f(x|θ)π(θ) dθ,whenΘf(x|θ)π(θ) dθ < ∞
  • 48. Bayesian StatisticsIntroductionImproper prior distributionsExampleIf x ∼ N (θ, 1) and π(θ) = ̟, constant, the pseudo marginaldistribution ism(x) = ̟+∞−∞1√2πexp −(x − θ)2/2 dθ = ̟
  • 49. Bayesian StatisticsIntroductionImproper prior distributionsExampleIf x ∼ N (θ, 1) and π(θ) = ̟, constant, the pseudo marginaldistribution ism(x) = ̟+∞−∞1√2πexp −(x − θ)2/2 dθ = ̟and the posterior distribution of θ isπ(θ | x) =1√2πexp −(x − θ)22,i.e., corresponds to a N(x, 1) distribution.
  • 50. Bayesian StatisticsIntroductionImproper prior distributionsExampleIf x ∼ N (θ, 1) and π(θ) = ̟, constant, the pseudo marginaldistribution ism(x) = ̟+∞−∞1√2πexp −(x − θ)2/2 dθ = ̟and the posterior distribution of θ isπ(θ | x) =1√2πexp −(x − θ)22,i.e., corresponds to a N(x, 1) distribution.[independent of ω]
  • 51. Bayesian StatisticsIntroductionImproper prior distributionsWarning - Warning - Warning - Warning - WarningThe mistake is to think of them [non-informative priors] asrepresenting ignorance[Lindley, 1990]
  • 52. Bayesian StatisticsIntroductionImproper prior distributionsExample (Flat prior (2))Consider a θ ∼ N (0, τ2) prior. Thenlimτ→∞Pπ(θ ∈ [a, b]) = 0for any (a, b)
  • 53. Bayesian StatisticsIntroductionImproper prior distributionsExample ([Haldane prior)Consider a binomial observation, x ∼ B(n, p), andπ∗(p) ∝ [p(1 − p)]−1[Haldane, 1931]
  • 54. Bayesian StatisticsIntroductionImproper prior distributionsExample ([Haldane prior)Consider a binomial observation, x ∼ B(n, p), andπ∗(p) ∝ [p(1 − p)]−1[Haldane, 1931]The marginal distribution,m(x) =10[p(1 − p)]−1 nxpx(1 − p)n−xdp= B(x, n − x),is only defined for x = 0, n .
  • 55. Bayesian StatisticsDecision-Theoretic Foundations of Statistical InferenceDecision theory motivationsIntroductionDecision-Theoretic Foundations of Statistical InferenceEvaluation of estimatorsLoss functionsMinimaxity and admissibilityUsual loss functionsFrom Prior Information to Prior DistributionsBayesian Point EstimationTests and model choice
  • 56. Bayesian StatisticsDecision-Theoretic Foundations of Statistical InferenceEvaluation of estimatorsEvaluating estimatorsPurpose of most inferential studiesTo provide the statistician/client with a decision d ∈ D
  • 57. Bayesian StatisticsDecision-Theoretic Foundations of Statistical InferenceEvaluation of estimatorsEvaluating estimatorsPurpose of most inferential studiesTo provide the statistician/client with a decision d ∈ DRequires an evaluation criterion for decisions and estimatorsL(θ, d)[a.k.a. loss function]
  • 58. Bayesian StatisticsDecision-Theoretic Foundations of Statistical InferenceEvaluation of estimatorsBayesian Decision TheoryThree spaces/factors:(1) On X , distribution for the observation, f(x|θ);
  • 59. Bayesian StatisticsDecision-Theoretic Foundations of Statistical InferenceEvaluation of estimatorsBayesian Decision TheoryThree spaces/factors:(1) On X , distribution for the observation, f(x|θ);(2) On Θ, prior distribution for the parameter, π(θ);
  • 60. Bayesian StatisticsDecision-Theoretic Foundations of Statistical InferenceEvaluation of estimatorsBayesian Decision TheoryThree spaces/factors:(1) On X , distribution for the observation, f(x|θ);(2) On Θ, prior distribution for the parameter, π(θ);(3) On Θ × D, loss function associated with the decisions, L(θ, δ);
  • 61. Bayesian StatisticsDecision-Theoretic Foundations of Statistical InferenceEvaluation of estimatorsFoundationsTheorem (Existence)There exists an axiomatic derivation of the existence of aloss function.[DeGroot, 1970]
  • 62. Bayesian StatisticsDecision-Theoretic Foundations of Statistical InferenceLoss functionsEstimatorsDecision procedure δ usually called estimator(while its value δ(x) called estimate of θ)
  • 63. Bayesian StatisticsDecision-Theoretic Foundations of Statistical InferenceLoss functionsEstimatorsDecision procedure δ usually called estimator(while its value δ(x) called estimate of θ)FactImpossible to uniformly minimize (in d) the loss functionL(θ, d)when θ is unknown
  • 64. Bayesian StatisticsDecision-Theoretic Foundations of Statistical InferenceLoss functionsFrequentist PrincipleAverage loss (or frequentist risk)R(θ, δ) = Eθ[L(θ, δ(x))]=XL(θ, δ(x))f(x|θ) dx
  • 65. Bayesian StatisticsDecision-Theoretic Foundations of Statistical InferenceLoss functionsFrequentist PrincipleAverage loss (or frequentist risk)R(θ, δ) = Eθ[L(θ, δ(x))]=XL(θ, δ(x))f(x|θ) dxPrincipleSelect the best estimator based on the risk function
  • 66. Bayesian StatisticsDecision-Theoretic Foundations of Statistical InferenceLoss functionsDifficulties with frequentist paradigm(1) Error averaged over the different values of x proportionally tothe density f(x|θ): not so appealing for a client, who wantsoptimal results for her data x!
  • 67. Bayesian StatisticsDecision-Theoretic Foundations of Statistical InferenceLoss functionsDifficulties with frequentist paradigm(1) Error averaged over the different values of x proportionally tothe density f(x|θ): not so appealing for a client, who wantsoptimal results for her data x!(2) Assumption of repeatability of experiments not alwaysgrounded.
  • 68. Bayesian StatisticsDecision-Theoretic Foundations of Statistical InferenceLoss functionsDifficulties with frequentist paradigm(1) Error averaged over the different values of x proportionally tothe density f(x|θ): not so appealing for a client, who wantsoptimal results for her data x!(2) Assumption of repeatability of experiments not alwaysgrounded.(3) R(θ, δ) is a function of θ: there is no total ordering on the setof procedures.
  • 69. Bayesian StatisticsDecision-Theoretic Foundations of Statistical InferenceLoss functionsBayesian principlePrinciple Integrate over the space Θ to get the posterior expectedlossρ(π, d|x) = Eπ[L(θ, d)|x]=ΘL(θ, d)π(θ|x) dθ,
  • 70. Bayesian StatisticsDecision-Theoretic Foundations of Statistical InferenceLoss functionsBayesian principle (2)AlternativeIntegrate over the space Θ and compute integrated riskr(π, δ) = Eπ[R(θ, δ)]=Θ XL(θ, δ(x)) f(x|θ) dx π(θ) dθwhich induces a total ordering on estimators.
  • 71. Bayesian StatisticsDecision-Theoretic Foundations of Statistical InferenceLoss functionsBayesian principle (2)AlternativeIntegrate over the space Θ and compute integrated riskr(π, δ) = Eπ[R(θ, δ)]=Θ XL(θ, δ(x)) f(x|θ) dx π(θ) dθwhich induces a total ordering on estimators.Existence of an optimal decision
  • 72. Bayesian StatisticsDecision-Theoretic Foundations of Statistical InferenceLoss functionsBayes estimatorTheorem (Construction of Bayes estimators)An estimator minimizingr(π, δ)can be obtained by selecting, for every x ∈ X, the value δ(x)which minimizesρ(π, δ|x)sincer(π, δ) =Xρ(π, δ(x)|x)m(x) dx.
  • 73. Bayesian StatisticsDecision-Theoretic Foundations of Statistical InferenceLoss functionsBayes estimatorTheorem (Construction of Bayes estimators)An estimator minimizingr(π, δ)can be obtained by selecting, for every x ∈ X, the value δ(x)which minimizesρ(π, δ|x)sincer(π, δ) =Xρ(π, δ(x)|x)m(x) dx.Both approaches give the same estimator
  • 74. Bayesian StatisticsDecision-Theoretic Foundations of Statistical InferenceLoss functionsBayes estimator (2)Definition (Bayes optimal procedure)A Bayes estimator associated with a prior distribution π and a lossfunction L isarg minδr(π, δ)The value r(π) = r(π, δπ) is called the Bayes risk
  • 75. Bayesian StatisticsDecision-Theoretic Foundations of Statistical InferenceLoss functionsInfinite Bayes riskAbove result valid for both proper and improper priors whenr(π) < ∞
  • 76. Bayesian StatisticsDecision-Theoretic Foundations of Statistical InferenceLoss functionsInfinite Bayes riskAbove result valid for both proper and improper priors whenr(π) < ∞Otherwise, generalized Bayes estimator that must be definedpointwise:δπ(x) = arg mindρ(π, d|x)if ρ(π, d|x) is well-defined for every x.
  • 77. Bayesian StatisticsDecision-Theoretic Foundations of Statistical InferenceLoss functionsInfinite Bayes riskAbove result valid for both proper and improper priors whenr(π) < ∞Otherwise, generalized Bayes estimator that must be definedpointwise:δπ(x) = arg mindρ(π, d|x)if ρ(π, d|x) is well-defined for every x.Warning: Generalized Bayes = Improper Bayes
  • 78. Bayesian StatisticsDecision-Theoretic Foundations of Statistical InferenceMinimaxity and admissibilityMinimaxityFrequentist insurance against the worst case and (weak) totalordering on D∗
  • 79. Bayesian StatisticsDecision-Theoretic Foundations of Statistical InferenceMinimaxity and admissibilityMinimaxityFrequentist insurance against the worst case and (weak) totalordering on D∗Definition (Frequentist optimality)The minimax risk associated with a loss L is¯R = infδ∈D∗supθR(θ, δ) = infδ∈D∗supθEθ[L(θ, δ(x))],
  • 80. Bayesian StatisticsDecision-Theoretic Foundations of Statistical InferenceMinimaxity and admissibilityMinimaxityFrequentist insurance against the worst case and (weak) totalordering on D∗Definition (Frequentist optimality)The minimax risk associated with a loss L is¯R = infδ∈D∗supθR(θ, δ) = infδ∈D∗supθEθ[L(θ, δ(x))],and a minimax estimator is any estimator δ0 such thatsupθR(θ, δ0) = ¯R.
  • 81. Bayesian StatisticsDecision-Theoretic Foundations of Statistical InferenceMinimaxity and admissibilityCriticisms◮ Analysis in terms of the worst case
  • 82. Bayesian StatisticsDecision-Theoretic Foundations of Statistical InferenceMinimaxity and admissibilityCriticisms◮ Analysis in terms of the worst case◮ Does not incorporate prior information
  • 83. Bayesian StatisticsDecision-Theoretic Foundations of Statistical InferenceMinimaxity and admissibilityCriticisms◮ Analysis in terms of the worst case◮ Does not incorporate prior information◮ Too conservative
  • 84. Bayesian StatisticsDecision-Theoretic Foundations of Statistical InferenceMinimaxity and admissibilityCriticisms◮ Analysis in terms of the worst case◮ Does not incorporate prior information◮ Too conservative◮ Difficult to exhibit/construct
  • 85. Bayesian StatisticsDecision-Theoretic Foundations of Statistical InferenceMinimaxity and admissibilityExample (Normal mean)Considerδ2(x) =1 −2p − 1||x||2x if ||x||2 ≥ 2p − 10 otherwise,to estimate θ when x ∼ Np(θ, Ip) under quadratic loss,L(θ, d) = ||θ − d||2.
  • 86. Bayesian StatisticsDecision-Theoretic Foundations of Statistical InferenceMinimaxity and admissibilityComparison of δ2 with δ1(x) = x,maximum likelihood estimator, for p = 10.0 2 4 6 8 100246810thetaδ2 cannot be minimax
  • 87. Bayesian StatisticsDecision-Theoretic Foundations of Statistical InferenceMinimaxity and admissibilityMinimaxity (2)ExistenceIf D ⊂ Rk convex and compact, and if L(θ, d) continuous andconvex as a function of d for every θ ∈ Θ, there exists anonrandomized minimax estimator.
  • 88. Bayesian StatisticsDecision-Theoretic Foundations of Statistical InferenceMinimaxity and admissibilityConnection with Bayesian approachThe Bayes risks are always smaller than the minimax risk:r = supπr(π) = supπinfδ∈Dr(π, δ) ≤ r = infδ∈D∗supθR(θ, δ).
  • 89. Bayesian StatisticsDecision-Theoretic Foundations of Statistical InferenceMinimaxity and admissibilityConnection with Bayesian approachThe Bayes risks are always smaller than the minimax risk:r = supπr(π) = supπinfδ∈Dr(π, δ) ≤ r = infδ∈D∗supθR(θ, δ).DefinitionThe estimation problem has a value when r = r, i.e.supπinfδ∈Dr(π, δ) = infδ∈D∗supθR(θ, δ).r is the maximin risk and the corresponding π the favourable prior
  • 90. Bayesian StatisticsDecision-Theoretic Foundations of Statistical InferenceMinimaxity and admissibilityMaximin-ityWhen the problem has a value, some minimax estimators are Bayesestimators for the least favourable distributions.
  • 91. Bayesian StatisticsDecision-Theoretic Foundations of Statistical InferenceMinimaxity and admissibilityMaximin-ity (2)Example (Binomial probability)Consider x ∼ Be(θ) with θ ∈ {0.1, 0.5} andδ1(x) = 0.1, δ2(x) = 0.5,δ3(x) = 0.1 Ix=0 + 0.5 Ix=1, δ4(x) = 0.5 Ix=0 + 0.1 Ix=1.underL(θ, d) =0 if d = θ1 if (θ, d) = (0.5, 0.1)2 if (θ, d) = (0.1, 0.5)
  • 92. Bayesian StatisticsDecision-Theoretic Foundations of Statistical InferenceMinimaxity and admissibility0.0 0.5 1.0 1.5 2.00.00.20.40.60.81.0δ2δ4δ1δ3δ*Risk set
  • 93. Bayesian StatisticsDecision-Theoretic Foundations of Statistical InferenceMinimaxity and admissibilityExample (Binomial probability (2))Minimax estimator at the intersection of the diagonal of R2 withthe lower boundary of R:δ∗(x) =δ3(x) with probability α = 0.87,δ2(x) with probability 1 − α.
  • 94. Bayesian StatisticsDecision-Theoretic Foundations of Statistical InferenceMinimaxity and admissibilityExample (Binomial probability (2))Minimax estimator at the intersection of the diagonal of R2 withthe lower boundary of R:δ∗(x) =δ3(x) with probability α = 0.87,δ2(x) with probability 1 − α.Also randomized Bayes estimator forπ(θ) = 0.22 I0.1(θ) + 0.78 I0.5(θ)
  • 95. Bayesian StatisticsDecision-Theoretic Foundations of Statistical InferenceMinimaxity and admissibilityChecking minimaxityTheorem (Bayes & minimax)If δ0 is a Bayes estimator for π0 and ifR(θ, δ0) ≤ r(π0)for every θ in the support of π0, then δ0 is minimax and π0 is theleast favourable distribution
  • 96. Bayesian StatisticsDecision-Theoretic Foundations of Statistical InferenceMinimaxity and admissibilityExample (Binomial probability (3))Consider x ∼ B(n, θ) for the lossL(θ, δ) = (δ − θ)2.When θ ∼ Be√n2 ,√n2 , the posterior mean isδ∗(x) =x +√n/2n +√n.with constant riskR(θ, δ∗) = 1/4(1 +√n)2.Therefore, δ∗ is minimax[H. Rubin]
  • 97. Bayesian StatisticsDecision-Theoretic Foundations of Statistical InferenceMinimaxity and admissibilityChecking minimaxity (2)Theorem (Bayes & minimax (2))If for a sequence (πn) of proper priors, the generalised Bayesestimator δ0 satisfiesR(θ, δ0) ≤ limn→∞r(πn) < +∞for every θ ∈ Θ, then δ0 is minimax.
  • 98. Bayesian StatisticsDecision-Theoretic Foundations of Statistical InferenceMinimaxity and admissibilityExample (Normal mean)When x ∼ N(θ, 1),δ0(x) = xis a generalised Bayes estimator associated withπ(θ) ∝ 1
  • 99. Bayesian StatisticsDecision-Theoretic Foundations of Statistical InferenceMinimaxity and admissibilityExample (Normal mean)When x ∼ N(θ, 1),δ0(x) = xis a generalised Bayes estimator associated withπ(θ) ∝ 1Since, for πn(θ) = exp{−θ2/2n},R(δ0, θ) = Eθ (x − θ)2= 1= limn→∞r(πn) = limn→∞nn + 1δ0 is minimax.
  • 100. Bayesian StatisticsDecision-Theoretic Foundations of Statistical InferenceMinimaxity and admissibilityAdmissibilityReduction of the set of acceptable estimators based on “local”propertiesDefinition (Admissible estimator)An estimator δ0 is inadmissible if there exists an estimator δ1 suchthat, for every θ,R(θ, δ0) ≥ R(θ, δ1)and, for at least one θ0R(θ0, δ0) > R(θ0, δ1)
  • 101. Bayesian StatisticsDecision-Theoretic Foundations of Statistical InferenceMinimaxity and admissibilityAdmissibilityReduction of the set of acceptable estimators based on “local”propertiesDefinition (Admissible estimator)An estimator δ0 is inadmissible if there exists an estimator δ1 suchthat, for every θ,R(θ, δ0) ≥ R(θ, δ1)and, for at least one θ0R(θ0, δ0) > R(θ0, δ1)Otherwise, δ0 is admissible
  • 102. Bayesian StatisticsDecision-Theoretic Foundations of Statistical InferenceMinimaxity and admissibilityMinimaxity & admissibilityIf there exists a unique minimax estimator, this estimator isadmissible.The converse is false!
  • 103. Bayesian StatisticsDecision-Theoretic Foundations of Statistical InferenceMinimaxity and admissibilityMinimaxity & admissibilityIf there exists a unique minimax estimator, this estimator isadmissible.The converse is false!If δ0 is admissible with constant risk, δ0 is the unique minimaxestimator.The converse is false!
  • 104. Bayesian StatisticsDecision-Theoretic Foundations of Statistical InferenceMinimaxity and admissibilityThe Bayesian perspectiveAdmissibility strongly related to the Bayes paradigm: Bayesestimators often constitute the class of admissible estimators
  • 105. Bayesian StatisticsDecision-Theoretic Foundations of Statistical InferenceMinimaxity and admissibilityThe Bayesian perspectiveAdmissibility strongly related to the Bayes paradigm: Bayesestimators often constitute the class of admissible estimators◮ If π is strictly positive on Θ, withr(π) =ΘR(θ, δπ)π(θ) dθ < ∞and R(θ, δ), is continuous, then the Bayes estimator δπ isadmissible.
  • 106. Bayesian StatisticsDecision-Theoretic Foundations of Statistical InferenceMinimaxity and admissibilityThe Bayesian perspectiveAdmissibility strongly related to the Bayes paradigm: Bayesestimators often constitute the class of admissible estimators◮ If π is strictly positive on Θ, withr(π) =ΘR(θ, δπ)π(θ) dθ < ∞and R(θ, δ), is continuous, then the Bayes estimator δπ isadmissible.◮ If the Bayes estimator associated with a prior π is unique, it isadmissible.Regular (=generalized) Bayes estimators always admissible
  • 107. Bayesian StatisticsDecision-Theoretic Foundations of Statistical InferenceMinimaxity and admissibilityExample (Normal mean)Consider x ∼ N (θ, 1) and the test of H0 : θ ≤ 0, i.e. theestimation ofIH0 (θ)
  • 108. Bayesian StatisticsDecision-Theoretic Foundations of Statistical InferenceMinimaxity and admissibilityExample (Normal mean)Consider x ∼ N (θ, 1) and the test of H0 : θ ≤ 0, i.e. theestimation ofIH0 (θ)Under the loss(IH0 (θ) − δ(x))2,the estimator (p-value)p(x) = P0(X > x) (X ∼ N (0, 1))= 1 − Φ(x),is Bayes under Lebesgue measure.
  • 109. Bayesian StatisticsDecision-Theoretic Foundations of Statistical InferenceMinimaxity and admissibilityExample (Normal mean (2))Indeedp(x) = Eπ[IH0 (θ)|x] = Pπ(θ < 0|x)= Pπ(θ − x < −x|x) = 1 − Φ(x).The Bayes risk of p is finite and p(s) is admissible.
  • 110. Bayesian StatisticsDecision-Theoretic Foundations of Statistical InferenceMinimaxity and admissibilityExample (Normal mean (3))Consider x ∼ N (θ, 1). Then δ0(x) = x is a generalised Bayesestimator, is admissible, butr(π, δ0) =+∞−∞R(θ, δ0) dθ=+∞−∞1 dθ = +∞.
  • 111. Bayesian StatisticsDecision-Theoretic Foundations of Statistical InferenceMinimaxity and admissibilityExample (Normal mean (4))Consider x ∼ Np(θ, Ip). IfL(θ, d) = (d − ||θ||2)2the Bayes estimator for the Lebesgue measure isδπ(x) = ||x||2+ p.This estimator is not admissible because it is dominated byδ0(x) = ||x||2− p
  • 112. Bayesian StatisticsDecision-Theoretic Foundations of Statistical InferenceUsual loss functionsThe quadratic lossHistorically, first loss function (Legendre, Gauss)L(θ, d) = (θ − d)2
  • 113. Bayesian StatisticsDecision-Theoretic Foundations of Statistical InferenceUsual loss functionsThe quadratic lossHistorically, first loss function (Legendre, Gauss)L(θ, d) = (θ − d)2orL(θ, d) = ||θ − d||2
  • 114. Bayesian StatisticsDecision-Theoretic Foundations of Statistical InferenceUsual loss functionsProper lossPosterior meanThe Bayes estimator δπ associated with the prior π and with thequadratic loss is the posterior expectationδπ(x) = Eπ[θ|x] = Θ θf(x|θ)π(θ) dθΘ f(x|θ)π(θ) dθ.
  • 115. Bayesian StatisticsDecision-Theoretic Foundations of Statistical InferenceUsual loss functionsThe absolute error lossAlternatives to the quadratic loss:L(θ, d) =| θ − d |,orLk1,k2 (θ, d) =k2(θ − d) if θ > d,k1(d − θ) otherwise.(1)
  • 116. Bayesian StatisticsDecision-Theoretic Foundations of Statistical InferenceUsual loss functionsThe absolute error lossAlternatives to the quadratic loss:L(θ, d) =| θ − d |,orLk1,k2 (θ, d) =k2(θ − d) if θ > d,k1(d − θ) otherwise.(1)L1 estimatorThe Bayes estimator associated with π and (1) is a (k2/(k1 + k2))fractile of π(θ|x).
  • 117. Bayesian StatisticsDecision-Theoretic Foundations of Statistical InferenceUsual loss functionsThe 0 − 1 lossNeyman–Pearson loss for testing hypothesesTest of H0 : θ ∈ Θ0 versus H1 : θ ∈ Θ0.ThenD = {0, 1}
  • 118. Bayesian StatisticsDecision-Theoretic Foundations of Statistical InferenceUsual loss functionsThe 0 − 1 lossNeyman–Pearson loss for testing hypothesesTest of H0 : θ ∈ Θ0 versus H1 : θ ∈ Θ0.ThenD = {0, 1}The 0 − 1 lossL(θ, d) =1 − d if θ ∈ Θ0d otherwise,
  • 119. Bayesian StatisticsDecision-Theoretic Foundations of Statistical InferenceUsual loss functionsType–one and type–two errorsAssociated with the riskR(θ, δ) = Eθ[L(θ, δ(x))]=Pθ(δ(x) = 0) if θ ∈ Θ0,Pθ(δ(x) = 1) otherwise,
  • 120. Bayesian StatisticsDecision-Theoretic Foundations of Statistical InferenceUsual loss functionsType–one and type–two errorsAssociated with the riskR(θ, δ) = Eθ[L(θ, δ(x))]=Pθ(δ(x) = 0) if θ ∈ Θ0,Pθ(δ(x) = 1) otherwise,Theorem (Bayes test)The Bayes estimator associated with π and with the 0 − 1 loss isδπ(x) =1 if P(θ ∈ Θ0|x) > P(θ ∈ Θ0|x),0 otherwise,
  • 121. Bayesian StatisticsDecision-Theoretic Foundations of Statistical InferenceUsual loss functionsIntrinsic lossesNoninformative settings w/o natural parameterisation : theestimators should be invariant under reparameterisation[Ultimate invariance!]PrincipleCorresponding parameterisation-free loss functions:L(θ, δ) = d(f(·|θ), f(·|δ)),
  • 122. Bayesian StatisticsDecision-Theoretic Foundations of Statistical InferenceUsual loss functionsExamples:1. the entropy distance (or Kullback–Leibler divergence)Le(θ, δ) = Eθ logf(x|θ)f(x|δ),
  • 123. Bayesian StatisticsDecision-Theoretic Foundations of Statistical InferenceUsual loss functionsExamples:1. the entropy distance (or Kullback–Leibler divergence)Le(θ, δ) = Eθ logf(x|θ)f(x|δ),2. the Hellinger distanceLH(θ, δ) =12Eθ f(x|δ)f(x|θ)− 12 .
  • 124. Bayesian StatisticsDecision-Theoretic Foundations of Statistical InferenceUsual loss functionsExample (Normal mean)Consider x ∼ N (θ, 1). ThenLe(θ, δ) =12Eθ[−(x − θ)2+ (x − δ)2] =12(δ − θ)2,LH(θ, δ) = 1 − exp{−(δ − θ)2/8}.When π(θ|x) is a N (µ(x), σ2) distribution, the Bayes estimator ofθ isδπ(x) = µ(x)in both cases.
  • 125. Bayesian StatisticsFrom Prior Information to Prior DistributionsFrom prior information to prior distributionsIntroductionDecision-Theoretic Foundations of Statistical InferenceFrom Prior Information to Prior DistributionsModelsSubjective determinationConjugate priorsNoninformative prior distributionsNoninformative prior distributionsBayesian Point EstimationTests and model choice
  • 126. Bayesian StatisticsFrom Prior Information to Prior DistributionsModelsPrior DistributionsThe most critical and most criticized point of Bayesian analysis !Because...the prior distribution is the key to Bayesian inference
  • 127. Bayesian StatisticsFrom Prior Information to Prior DistributionsModelsBut...In practice, it seldom occurs that the available prior information isprecise enough to lead to an exact determination of the priordistributionThere is no such thing as the prior distribution!
  • 128. Bayesian StatisticsFrom Prior Information to Prior DistributionsModelsRather...The prior is a tool summarizing available information as well asuncertainty related with this information,And...Ungrounded prior distributions produce unjustified posteriorinference
  • 129. Bayesian StatisticsFrom Prior Information to Prior DistributionsSubjective determinationSubjective priorsExample (Capture probabilities)Capture-recapture experiment on migrations between zonesPrior information on capture and survival probabilities, pt and qitTime 2 3 4 5 6pt Mean 0.3 0.4 0.5 0.2 0.295% cred. int. [0.1,0.5] [0.2,0.6] [0.3,0.7] [0.05,0.4] [0.05,0.4]Site A BTime t=1,3,5 t=2,4 t=1,3,5 t=2,4qit Mean 0.7 0.65 0.7 0.795% cred. int. [0.4,0.95] [0.35,0.9] [0.4,0.95] [0.4,0.95]
  • 130. Bayesian StatisticsFrom Prior Information to Prior DistributionsSubjective determinationExample (Capture probabilities (2))Corresponding prior modelingTime 2 3 4 5 6Dist. Be(6, 14) Be(8, 12) Be(12, 12) Be(3.5, 14) Be(3.5, 14)Site A BTime t=1,3,5 t=2,4 t=1,3,5 t=2,4Dist. Be(6.0, 2.5) Be(6.5, 3.5) Be(6.0, 2.5) Be(6.0, 2.5)
  • 131. Bayesian StatisticsFrom Prior Information to Prior DistributionsSubjective determinationStrategies for prior determination◮ Use a partition of Θ in sets (e.g., intervals), determine theprobability of each set, and approach π by an histogram
  • 132. Bayesian StatisticsFrom Prior Information to Prior DistributionsSubjective determinationStrategies for prior determination◮ Use a partition of Θ in sets (e.g., intervals), determine theprobability of each set, and approach π by an histogram◮ Select significant elements of Θ, evaluate their respectivelikelihoods and deduce a likelihood curve proportional to π
  • 133. Bayesian StatisticsFrom Prior Information to Prior DistributionsSubjective determinationStrategies for prior determination◮ Use a partition of Θ in sets (e.g., intervals), determine theprobability of each set, and approach π by an histogram◮ Select significant elements of Θ, evaluate their respectivelikelihoods and deduce a likelihood curve proportional to π◮ Use the marginal distribution of x,m(x) =Θf(x|θ)π(θ) dθ
  • 134. Bayesian StatisticsFrom Prior Information to Prior DistributionsSubjective determinationStrategies for prior determination◮ Use a partition of Θ in sets (e.g., intervals), determine theprobability of each set, and approach π by an histogram◮ Select significant elements of Θ, evaluate their respectivelikelihoods and deduce a likelihood curve proportional to π◮ Use the marginal distribution of x,m(x) =Θf(x|θ)π(θ) dθ◮ Empirical and hierarchical Bayes techniques
  • 135. Bayesian StatisticsFrom Prior Information to Prior DistributionsSubjective determination◮ Select a maximum entropy prior when prior characteristicsare known:Eπ[gk(θ)] = ωk (k = 1, . . . , K)with solution, in the discrete caseπ∗(θi) =exp K1 λkgk(θi)j exp K1 λkgk(θj),and, in the continuous case,π∗(θ) =exp K1 λkgk(θ) π0(θ)exp K1 λkgk(η) π0(dη),the λk’s being Lagrange multipliers and π0 a referencemeasure [Caveat]
  • 136. Bayesian StatisticsFrom Prior Information to Prior DistributionsSubjective determination◮ Parametric approximationsRestrict choice of π to a parameterised densityπ(θ|λ)and determine the corresponding (hyper-)parametersλthrough the moments or quantiles of π
  • 137. Bayesian StatisticsFrom Prior Information to Prior DistributionsSubjective determinationExampleFor the normal model x ∼ N (θ, 1), ranges of the posteriormoments for fixed prior moments µ1 = 0 and µ2.Minimum Maximum Maximumµ2 x mean mean variance3 0 -1.05 1.05 3.003 1 -0.70 1.69 3.633 2 -0.50 2.85 5.781.5 0 -0.59 0.59 1.501.5 1 -0.37 1.05 1.971.5 2 -0.27 2.08 3.80[Goutis, 1990]
  • 138. Bayesian StatisticsFrom Prior Information to Prior DistributionsConjugate priorsConjugate priorsSpecific parametric family with analytical propertiesDefinitionA family F of probability distributions on Θ is conjugate for alikelihood function f(x|θ) if, for every π ∈ F, the posteriordistribution π(θ|x) also belongs to F.[Raiffa & Schlaifer, 1961]Only of interest when F is parameterised : switching from prior toposterior distribution is reduced to an updating of thecorresponding parameters.
  • 139. Bayesian StatisticsFrom Prior Information to Prior DistributionsConjugate priorsJustifications◮ Limited/finite information conveyed by x
  • 140. Bayesian StatisticsFrom Prior Information to Prior DistributionsConjugate priorsJustifications◮ Limited/finite information conveyed by x◮ Preservation of the structure of π(θ)
  • 141. Bayesian StatisticsFrom Prior Information to Prior DistributionsConjugate priorsJustifications◮ Limited/finite information conveyed by x◮ Preservation of the structure of π(θ)◮ Exchangeability motivations
  • 142. Bayesian StatisticsFrom Prior Information to Prior DistributionsConjugate priorsJustifications◮ Limited/finite information conveyed by x◮ Preservation of the structure of π(θ)◮ Exchangeability motivations◮ Device of virtual past observations
  • 143. Bayesian StatisticsFrom Prior Information to Prior DistributionsConjugate priorsJustifications◮ Limited/finite information conveyed by x◮ Preservation of the structure of π(θ)◮ Exchangeability motivations◮ Device of virtual past observations◮ Linearity of some estimators
  • 144. Bayesian StatisticsFrom Prior Information to Prior DistributionsConjugate priorsJustifications◮ Limited/finite information conveyed by x◮ Preservation of the structure of π(θ)◮ Exchangeability motivations◮ Device of virtual past observations◮ Linearity of some estimators◮ Tractability and simplicity
  • 145. Bayesian StatisticsFrom Prior Information to Prior DistributionsConjugate priorsJustifications◮ Limited/finite information conveyed by x◮ Preservation of the structure of π(θ)◮ Exchangeability motivations◮ Device of virtual past observations◮ Linearity of some estimators◮ Tractability and simplicity◮ First approximations to adequate priors, backed up byrobustness analysis
  • 146. Bayesian StatisticsFrom Prior Information to Prior DistributionsConjugate priorsExponential familiesDefinitionThe family of distributionsf(x|θ) = C(θ)h(x) exp{R(θ) · T(x)}is called an exponential family of dimension k. When Θ ⊂ Rk,X ⊂ Rk andf(x|θ) = C(θ)h(x) exp{θ · x},the family is said to be natural.
  • 147. Bayesian StatisticsFrom Prior Information to Prior DistributionsConjugate priorsInteresting analytical properties :◮ Sufficient statistics (Pitman–Koopman Lemma)◮ Common enough structure (normal, binomial, Poisson,Wishart, &tc...)◮ Analycity (Eθ[x] = ∇ψ(θ), ...)◮ Allow for conjugate priorsπ(θ|µ, λ) = K(µ, λ) eθ.µ−λψ(θ)
  • 148. Bayesian StatisticsFrom Prior Information to Prior DistributionsConjugate priorsStandard exponential familiesf(x|θ) π(θ) π(θ|x)Normal NormalN(θ, σ2) N(µ, τ2) N(ρ(σ2µ + τ2x), ρσ2τ2)ρ−1= σ2+ τ2Poisson GammaP(θ) G(α, β) G(α + x, β + 1)Gamma GammaG(ν, θ) G(α, β) G(α + ν, β + x)Binomial BetaB(n, θ) Be(α, β) Be(α + x, β + n − x)
  • 149. Bayesian StatisticsFrom Prior Information to Prior DistributionsConjugate priorsStandard exponential families [2]f(x|θ) π(θ) π(θ|x)Negative Binomial BetaNeg(m, θ) Be(α, β) Be(α + m, β + x)Multinomial DirichletMk(θ1, . . . , θk) D(α1, . . . , αk) D(α1 + x1, . . . , αk + xk)Normal GammaN(µ, 1/θ) Ga(α, β) G(α + 0.5, β + (µ − x)2/2)
  • 150. Bayesian StatisticsFrom Prior Information to Prior DistributionsConjugate priorsLinearity of the posterior meanIfθ ∼ πλ,x0 (θ) ∝ eθ·x0−λψ(θ)with x0 ∈ X , thenEπ[∇ψ(θ)] =x0λ.Therefore, if x1, . . . , xn are i.i.d. f(x|θ),Eπ[∇ψ(θ)|x1, . . . , xn] =x0 + n¯xλ + n.
  • 151. Bayesian StatisticsFrom Prior Information to Prior DistributionsConjugate priorsBut...ExampleWhen x ∼ Be(α, θ) with known α,f(x|θ) ∝Γ(α + θ)(1 − x)θΓ(θ),conjugate distribution not so easily manageableπ(θ|x0, λ) ∝Γ(α + θ)Γ(θ)λ(1 − x0)θ
  • 152. Bayesian StatisticsFrom Prior Information to Prior DistributionsConjugate priorsExampleCoin spun on its edge, proportion θ of headsWhen spinning n times a given coin, number of headsx ∼ B(n, θ)Flat prior, or mixture prior12[Be(10, 20) + Be(20, 10)]or0.5 Be(10, 20) + 0.2 Be(15, 15) + 0.3 Be(20, 10).Mixtures of natural conjugate distributions also make conjugate families
  • 153. Bayesian StatisticsFrom Prior Information to Prior DistributionsConjugate priorsp0.0 0.2 0.4 0.6 0.8 1.00.00.51.01.52.02.53.01 comp.2 comp.3 comp.Three prior distributions for a spinning-coin experiment
  • 154. Bayesian StatisticsFrom Prior Information to Prior DistributionsConjugate priorsp0.0 0.2 0.4 0.6 0.8 1.0024681 comp.2 comp.3 comp.Posterior distributions for 50 observations
  • 155. Bayesian StatisticsFrom Prior Information to Prior DistributionsConjugate priorsThe noninformative conundrumWhat if all we know is that we know “nothing” ?!In the absence of prior information, prior distributions solelyderived from the sample distribution f(x|θ)[Noninformative priors]
  • 156. Bayesian StatisticsFrom Prior Information to Prior DistributionsConjugate priorsThe noninformative conundrumWhat if all we know is that we know “nothing” ?!In the absence of prior information, prior distributions solelyderived from the sample distribution f(x|θ)[Noninformative priors]Re-WarningNoninformative priors cannot be expected to representexactly total ignorance about the problem at hand, butshould rather be taken as reference or default priors,upon which everyone could fall back when the priorinformation is missing.[Kass and Wasserman, 1996]
  • 157. Bayesian StatisticsFrom Prior Information to Prior DistributionsConjugate priorsLaplace’s priorPrinciple of Insufficient Reason (Laplace)Θ = {θ1, · · · , θp} π(θi) = 1/pExtension to continuous spacesπ(θ) ∝ 1
  • 158. Bayesian StatisticsFrom Prior Information to Prior DistributionsConjugate priors◮ Lack of reparameterization invariance/coherenceψ = eθπ1(ψ) =1ψ= π2(ψ) = 1◮ Problems of propernessx ∼ N(θ, σ2), π(θ, σ) = 1π(θ, σ|x) ∝ e−(x−θ)2/2σ2σ−1⇒ π(σ|x) ∝ 1 (!!!)
  • 159. Bayesian StatisticsFrom Prior Information to Prior DistributionsConjugate priorsInvariant priorsPrinciple: Agree with the natural symmetries of the problem- Identify invariance structures as group actionG : x → g(x) ∼ f(g(x)|¯g(θ))¯G : θ → ¯g(θ)G∗ : L(d, θ) = L(g∗(d), ¯g(θ))- Determine an invariant priorπ(¯g(A)) = π(A)
  • 160. Bayesian StatisticsFrom Prior Information to Prior DistributionsConjugate priorsSolution: Right Haar measureBut...◮ Requires invariance to be part of the decision problem◮ Missing in most discrete setups (Poisson)
  • 161. Bayesian StatisticsFrom Prior Information to Prior DistributionsConjugate priorsThe Jeffreys priorBased on Fisher informationI(θ) = Eθ∂ℓ∂θt∂ℓ∂θThe Jeffreys prior distribution isπ∗(θ) ∝ |I(θ)|1/2
  • 162. Bayesian StatisticsFrom Prior Information to Prior DistributionsConjugate priorsPros & Cons◮ Relates to information theory◮ Agrees with most invariant priors◮ Parameterization invariant◮ Suffers from dimensionality curse◮ Not coherent for Likelihood Principle
  • 163. Bayesian StatisticsFrom Prior Information to Prior DistributionsConjugate priorsExamplex ∼ Np(θ, Ip), η = θ 2, π(η) = ηp/2−1Eπ[η|x] = x 2+ p Bias 2p
  • 164. Bayesian StatisticsFrom Prior Information to Prior DistributionsConjugate priorsExampleIf x ∼ B(n, θ), Jeffreys’ prior isBe(1/2, 1/2)and, if n ∼ N eg(x, θ), Jeffreys’ prior isπ2(θ) = −Eθ∂2∂θ2log f(x|θ)= Eθxθ2+n − x(1 − θ)2=xθ2(1 − θ),∝ θ−1(1 − θ)−1/2
  • 165. Bayesian StatisticsFrom Prior Information to Prior DistributionsConjugate priorsReference priorsGeneralizes Jeffreys priors by distinguishing between nuisance andinterest parametersPrinciple: maximize the information brought by the dataEnπ(θ|xn) log(π(θ|xn)/π(θ))dθand consider the limit of the πnOutcome: most usually, Jeffreys prior
  • 166. Bayesian StatisticsFrom Prior Information to Prior DistributionsConjugate priorsNuisance parameters:For θ = (λ, ω),π(λ|ω) = πJ (λ|ω) with fixed ωJeffreys’ prior conditional on ω, andπ(ω) = πJ (ω)for the marginal modelf(x|ω) ∝ f(x|θ)πJ (λ|ω)dλ◮ Depends on ordering◮ Problems of definition
  • 167. Bayesian StatisticsFrom Prior Information to Prior DistributionsConjugate priorsExample (Neyman–Scott problem)Observation of xij iid N (µi, σ2), i = 1, . . . , n, j = 1, 2.The usual Jeffreys prior for this model isπ(µ1, . . . , µn, σ) = σ−n−1which is inconsistent becauseE[σ2|x11, . . . , xn2] = s2/(2n − 2),wheres2=ni=1(xi1 − xi2)22,
  • 168. Bayesian StatisticsFrom Prior Information to Prior DistributionsConjugate priorsExample (Neyman–Scott problem)Associated reference prior with θ1 = σ and θ2 = (µ1, . . . , µn) givesπ(θ2|θ1) ∝ 1 ,π(σ) ∝ 1/σTherefore,E[σ2|x11, . . . , xn2] = s2/(n − 2)
  • 169. Bayesian StatisticsFrom Prior Information to Prior DistributionsConjugate priorsMatching priorsFrequency-validated priors:Some posterior probabilitiesπ(g(θ) ∈ Cx|x) = 1 − αmust coincide with the corresponding frequentist coveragePθ(Cx ∋ g(θ)) = ICx (g(θ)) f(x|θ) dx ,...asymptotically
  • 170. Bayesian StatisticsFrom Prior Information to Prior DistributionsConjugate priorsFor instance, Welch and Peers’ identityPθ(θ ≤ kα(x)) = 1 − α + O(n−1/2)and for Jeffreys’ prior,Pθ(θ ≤ kα(x)) = 1 − α + O(n−1)
  • 171. Bayesian StatisticsFrom Prior Information to Prior DistributionsConjugate priorsIn general, choice of a matching prior dictated by the cancelationof a first order term in an Edgeworth expansion, like[I′′(θ)]−1/2I′(θ)∇ log π(θ) + ∇t{I′(θ)[I′′(θ)]−1/2} = 0 .
  • 172. Bayesian StatisticsFrom Prior Information to Prior DistributionsConjugate priorsExample (Linear calibration model)yi = α+βxi+εi, y0j = α+βx0+ε0j , (i = 1, . . . , n, j = 1, . . . , k)with θ = (x0, α, β, σ2) and x0 quantity of interest
  • 173. Bayesian StatisticsFrom Prior Information to Prior DistributionsConjugate priorsExample (Linear calibration model (2))One-sided differential equation:|β|−1s−1/2 ∂∂x0{e(x0)π(θ)} − e−1/2(x0)sgn(β)n−1s1/2 ∂π(θ)∂x0−e−1/2(x0)(x0 − ¯x)s−1/2 ∂∂β{sgn(β)π(θ)} = 0withs = Σ(xi − ¯x)2, e(x0) = [(n + k)s + nk(x0 − ¯x)2]/nk .
  • 174. Bayesian StatisticsFrom Prior Information to Prior DistributionsConjugate priorsExample (Linear calibration model (3))Solutionsπ(x0, α, β, σ2) ∝ e(x0)(d−1)/2|β|dg(σ2) ,where g arbitrary.
  • 175. Bayesian StatisticsFrom Prior Information to Prior DistributionsConjugate priorsReference priorsPartition Prior(x0, α, β, σ2) |β|(σ2)−5/2x0, α, β, σ2 e(x0)−1/2(σ2)−1x0, α, (σ2, β) e(x0)−1/2(σ2)−3/2x0, (α, β), σ2 e(x0)−1/2(σ2)−1x0, (α, β, σ2) e(x0)−1/2(σ2)−2
  • 176. Bayesian StatisticsFrom Prior Information to Prior DistributionsConjugate priorsOther approaches◮ Rissanen’s transmission information theory and minimumlength priors◮ Testing priors◮ stochastic complexity
  • 177. Bayesian StatisticsBayesian Point EstimationBayesian Point EstimationIntroductionDecision-Theoretic Foundations of Statistical InferenceFrom Prior Information to Prior DistributionsBayesian Point EstimationBayesian inferenceBayesian Decision TheoryThe particular case of the normal modelDynamic modelsTests and model choice
  • 178. Bayesian StatisticsBayesian Point EstimationPosterior distributionπ(θ|x) ∝ f(x|θ) π(θ)◮ extensive summary of the information available on θ◮ integrate simultaneously prior information and informationbrought by x◮ unique motor of inference
  • 179. Bayesian StatisticsBayesian Point EstimationBayesian inferenceMAP estimatorWith no loss function, consider using the maximum a posteriori(MAP) estimatorarg maxθℓ(θ|x)π(θ)
  • 180. Bayesian StatisticsBayesian Point EstimationBayesian inferenceMotivations◮ Associated with 0 − 1 losses and Lp losses◮ Penalized likelihood estimator◮ Further appeal in restricted parameter spaces
  • 181. Bayesian StatisticsBayesian Point EstimationBayesian inferenceExampleConsider x ∼ B(n, p).Possible priors:π∗(p) =1B(1/2, 1/2)p−1/2(1 − p)−1/2,π1(p) = 1 and π2(p) = p−1(1 − p)−1.Corresponding MAP estimators:δ∗(x) = maxx − 1/2n − 1, 0 ,δ1(x) =xn,δ2(x) = maxx − 1n − 2, 0 .
  • 182. Bayesian StatisticsBayesian Point EstimationBayesian inferenceNot always appropriate:ExampleConsiderf(x|θ) =1π1 + (x − θ)2 −1,and π(θ) = 12e−|θ|. The MAP estimator of θ is then alwaysδ∗(x) = 0
  • 183. Bayesian StatisticsBayesian Point EstimationBayesian inferencePredictionIf x ∼ f(x|θ) and z ∼ g(z|x, θ), the predictive of z isgπ(z|x) =Θg(z|x, θ)π(θ|x) dθ.
  • 184. Bayesian StatisticsBayesian Point EstimationBayesian inferenceExampleConsider the AR(1) modelxt = ̺xt−1 + ǫt ǫt ∼ N (0, σ2)the predictive of xT is thenxT |x1:(T−1) ∼σ−1√2πexp{−(xT −̺xT−1)2/2σ2}π(̺, σ|x1:(T−1))d̺dσ ,and π(̺, σ|x1:(T−1)) can be expressed in closed form
  • 185. Bayesian StatisticsBayesian Point EstimationBayesian Decision TheoryBayesian Decision TheoryFor a loss L(θ, δ) and a prior π, the Bayes rule isδπ(x) = arg mindEπ[L(θ, d)|x].Note: Practical computation not always possible analytically.
  • 186. Bayesian StatisticsBayesian Point EstimationBayesian Decision TheoryConjugate priorsFor conjugate distributionsdistribution!conjugate, the posteriorexpectations of the natural parameters can be expressedanalytically, for one or several observations.Distribution Conjugate prior Posterior meanNormal NormalN(θ, σ2) N(µ, τ2)µσ2+ τ2xσ2 + τ2Poisson GammaP(θ) G(α, β)α + xβ + 1
  • 187. Bayesian StatisticsBayesian Point EstimationBayesian Decision TheoryDistribution Conjugate prior Posterior meanGamma GammaG(ν, θ) G(α, β)α + νβ + xBinomial BetaB(n, θ) Be(α, β)α + xα + β + nNegative binomial BetaNeg(n, θ) Be(α, β)α + nα + β + x + nMultinomial DirichletMk(n; θ1, . . . , θk) D(α1, . . . , αk)αi + xij αj + nNormal GammaN(µ, 1/θ) G(α/2, β/2)α + 1β + (µ − x)2
  • 188. Bayesian StatisticsBayesian Point EstimationBayesian Decision TheoryExampleConsiderx1, ..., xn ∼ U([0, θ])and θ ∼ Pa(θ0, α). Thenθ|x1, ..., xn ∼ Pa(max (θ0, x1, ..., xn), α + n)andδπ(x1, ..., xn) =α + nα + n − 1max (θ0, x1, ..., xn).
  • 189. Bayesian StatisticsBayesian Point EstimationBayesian Decision TheoryEven conjugate priors may lead to computational difficultiesExampleConsider x ∼ Np(θ, Ip) andL(θ, d) =(d − ||θ||2)22||θ||2 + pfor which δ0(x) = ||x||2 − p has a constant risk, 1For the conjugate distributions, Np(0, τ2Ip),δπ(x) =Eπ[||θ||2/(2||θ||2 + p)|x]Eπ[1/(2||θ||2 + p)|x]cannot be computed analytically.
  • 190. Bayesian StatisticsBayesian Point EstimationThe particular case of the normal modelThe normal modelImportance of the normal model in many fieldsNp(θ, Σ)with known Σ, normal conjugate distribution, Np(µ, A).Under quadratic loss, the Bayes estimator isδπ(x) = x − Σ(Σ + A)−1(x − µ)= Σ−1+ A−1 −1Σ−1x + A−1µ ;
  • 191. Bayesian StatisticsBayesian Point EstimationThe particular case of the normal modelEstimation of varianceIf¯x =1nni=1xi and s2=ni=1(xi − ¯x)2the likelihood isℓ(θ, σ | ¯x, s2) ∝ σ−nexp −12σ2s2+ n (¯x − θ)2The Jeffreys prior for this model isπ∗(θ, σ) =1σ2but invariance arguments lead to prefer˜π(θ, σ) =1σ
  • 192. Bayesian StatisticsBayesian Point EstimationThe particular case of the normal modelIn this case, the posterior distribution of (θ, σ) isθ|σ, ¯x, s2∼ N ¯x,σ2n,σ2|¯x, s2∼ IGn − 12,s22.◮ Conjugate posterior distributions have the same form◮ θ and σ2 are not a priori independent.◮ Requires a careful determination of the hyperparameters
  • 193. Bayesian StatisticsBayesian Point EstimationThe particular case of the normal modelLinear modelsUsual regression modelregression!modely = Xβ + ǫ, ǫ ∼ Nk(0, Σ), β ∈ RpConjugate distributions of the typeβ ∼ Np(Aθ, C),where θ ∈ Rq (q ≤ p).Strong connection with random-effect modelsy = X1β1 + X2β2 + ǫ,
  • 194. Bayesian StatisticsBayesian Point EstimationThe particular case of the normal modelΣ unknownIn this general case, the Jeffreys prior isπJ(β, Σ) =1|Σ|(k+1)/2.likelihoodℓ(β, Σ|y) ∝ |Σ|−n/2exp −12tr Σ−1ni=1(yi − Xiβ)(yi − Xiβ)t
  • 195. Bayesian StatisticsBayesian Point EstimationThe particular case of the normal model◮ suggests (inverse) Wishart distribution on Σ◮ posterior marginal distribution on β only defined for samplesize large enough◮ no closed form expression for posterior marginal
  • 196. Bayesian StatisticsBayesian Point EstimationThe particular case of the normal modelSpecial case:ǫ ∼ Nk(0, σ2Ik)The least-squares estimator ˆβ has a normal distributionNp(β, σ2(XtX)−1)Corresponding conjugate distribution s on (β, σ2)β|σ2∼ Np µ,σ2n0(XtX)−1,σ2∼ IG(ν/2, s20/2),
  • 197. Bayesian StatisticsBayesian Point EstimationThe particular case of the normal modelsince, if s2 = ||y − X ˆβ||2,β|ˆβ, s2, σ2∼ Npn0µ + ˆβn0 + 1,σ2n0 + 1(XtX)−1,σ2|ˆβ, s2∼ IGk − p + ν2,s2 + s20 + n0n0+1(µ − ˆβ)tXtX(µ − ˆβ)2.
  • 198. Bayesian StatisticsBayesian Point EstimationDynamic modelsThe AR(p) modelMarkovian dynamic modelxt ∼ N µ −pi=1̺i(xt−i − µ), σ2Appeal:◮ Among the most commonly used model in dynamic settings◮ More challenging than the static models (stationarityconstraints)◮ Different models depending on the processing of the startingvalue x0
  • 199. Bayesian StatisticsBayesian Point EstimationDynamic modelsStationarityStationarity constraints in the prior as a restriction on the values ofθ.AR(p) model stationary iff the roots of the polynomialP(x) = 1 −pi=1̺ixiare all outside the unit circle
  • 200. Bayesian StatisticsBayesian Point EstimationDynamic modelsClosed form likelihoodConditional on the negative time valuesL(µ, ̺1, . . . , ̺p, σ|x1:T , x0:(−p+1)) =σ−TTt=1exp− xt − µ +pi=1̺i(xt−i − µ)22σ2,Natural conjugate prior for θ = (µ, ̺1, . . . , ̺p, σ2) :a normal distributiondistribution!normal on (µ, ̺1, . . . , ρp) and aninverse gamma distributiondistribution!inverse gamma on σ2.
  • 201. Bayesian StatisticsBayesian Point EstimationDynamic modelsStationarity & priorsUnder stationarity constraint, complex parameter spaceThe Durbin–Levinson recursion proposes a reparametrization fromthe parameters ̺i to the partial autocorrelationsψi ∈ [−1, 1]which allow for a uniform prior.
  • 202. Bayesian StatisticsBayesian Point EstimationDynamic modelsTransform:0. Define ϕii = ψi and ϕij = ϕ(i−1)j − ψiϕ(i−1)(i−j), for i > 1and j = 1, · · · , i − 1 .1. Take ̺i = ϕpi for i = 1, · · · , p.Different approach via the real+complex roots of the polynomialP, whose inverses are also within the unit circle.
  • 203. Bayesian StatisticsBayesian Point EstimationDynamic modelsStationarity & priors (contd.)Jeffreys’ prior associated with the stationaryrepresentationrepresentation!stationary isπJ1 (µ, σ2, ̺) ∝1σ211 − ̺2.Within the non-stationary region |̺| > 1, the Jeffreys prior isπJ2 (µ, σ2, ̺) ∝1σ21|1 − ̺2|1 −1 − ̺2TT(1 − ̺2).The dominant part of the prior is the non-stationary region!
  • 204. Bayesian StatisticsBayesian Point EstimationDynamic modelsThe reference prior πJ1 is only defined when the stationaryconstraint holds.Idea Symmetrise to the region |̺| > 1πB(µ, σ2, ̺) ∝1σ21/ 1 − ̺2 if |̺| < 1,1/|̺| ̺2 − 1 if |̺| > 1,,−3 −2 −1 0 1 2 301234567xpi
  • 205. Bayesian StatisticsBayesian Point EstimationDynamic modelsThe MA(q) modelxt = µ + ǫt −qj=1ϑjǫt−j , ǫt ∼ N (0, σ2)Stationary but, for identifiability considerations, the polynomialQ(x) = 1 −qj=1ϑjxjmust have all its roots outside the unit circle.
  • 206. Bayesian StatisticsBayesian Point EstimationDynamic modelsExampleFor the MA(1) model, xt = µ + ǫt − ϑ1ǫt−1,var(xt) = (1 + ϑ21)σ2It can also be writtenxt = µ + ˜ǫt−1 −1ϑ1˜ǫt, ˜ǫ ∼ N (0, ϑ21σ2) ,Both couples (ϑ1, σ) and (1/ϑ1, ϑ1σ) lead to alternativerepresentations of the same model.
  • 207. Bayesian StatisticsBayesian Point EstimationDynamic modelsRepresentationsx1:T is a normal random variable with constant mean µ andcovariance matrixΣ =σ2 γ1 γ2 . . . γq 0 . . . 0 0γ1 σ2 γ1 . . . γq−1 γq . . . 0 0...0 0 0 . . . 0 0 . . . γ1 σ2,with (|s| ≤ q)γs = σ2q−|s|i=0ϑiϑi+|s|Not manageable in practice
  • 208. Bayesian StatisticsBayesian Point EstimationDynamic modelsRepresentations (contd.)Conditional on (ǫ0, . . . , ǫ−q+1),L(µ, ϑ1, . . . , ϑq, σ|x1:T , ǫ0, . . . , ǫ−q+1) =σ−TTt=1exp−xt − µ +qj=1ϑjˆǫt−j22σ2,where (t > 0)ˆǫt = xt − µ +qj=1ϑjˆǫt−j, ˆǫ0 = ǫ0, . . . , ˆǫ1−q = ǫ1−qRecursive definition of the likelihood, still costly O(T × q)
  • 209. Bayesian StatisticsBayesian Point EstimationDynamic modelsRepresentations (contd.)State-space representationxt = Gyyt + εt , (2)yt+1 = Ftyt + ξt , (3)(2) is the observation equation and (3) is the state equation
  • 210. Bayesian StatisticsBayesian Point EstimationDynamic modelsFor the MA(q) modelyt = (ǫt−q, . . . , ǫt−1, ǫt)′andyt+1 =0 1 0 . . . 00 0 1 . . . 0. . .0 0 0 . . . 10 0 0 . . . 0yt + ǫt+100...01xt = µ − ϑq ϑq−1 . . . ϑ1 −1 yt .
  • 211. Bayesian StatisticsBayesian Point EstimationDynamic modelsExampleFor the MA(1) model, observation equationxt = (1 0)ytwithyt = (y1t y2t)′directed by the state equationyt+1 =0 10 0yt + ǫt+11ϑ1.
  • 212. Bayesian StatisticsBayesian Point EstimationDynamic modelsIdentifiabilityIdentifiability condition on Q(x): the ϑj’s vary in a complex space.New reparametrization: the ψi’s are the inverse partialauto-correlations
  • 213. Bayesian StatisticsTests and model choiceTests and model choiceIntroductionDecision-Theoretic Foundations of Statistical InferenceFrom Prior Information to Prior DistributionsBayesian Point EstimationTests and model choiceBayesian testsBayes factorsPseudo-Bayes factorsIntrinsic priorsOpposition to classical tests
  • 214. Bayesian StatisticsTests and model choiceBayesian testsConstruction of Bayes testsDefinition (Test)Given an hypothesis H0 : θ ∈ Θ0 on the parameter θ ∈ Θ0 of astatistical model, a test is a statistical procedure that takes itsvalues in {0, 1}.
  • 215. Bayesian StatisticsTests and model choiceBayesian testsConstruction of Bayes testsDefinition (Test)Given an hypothesis H0 : θ ∈ Θ0 on the parameter θ ∈ Θ0 of astatistical model, a test is a statistical procedure that takes itsvalues in {0, 1}.Example (Normal mean)For x ∼ N (θ, 1), decide whether or not θ ≤ 0.
  • 216. Bayesian StatisticsTests and model choiceBayesian testsDecision-theoretic perspectiveOptimal Bayes decision:Under the 0 − 1 loss functionL(θ, d) =0 if d = IΘ0 (θ)a0 if d = 1 and θ ∈ Θ0a1 if d = 0 and θ ∈ Θ0
  • 217. Bayesian StatisticsTests and model choiceBayesian testsDecision-theoretic perspectiveOptimal Bayes decision:Under the 0 − 1 loss functionL(θ, d) =0 if d = IΘ0 (θ)a0 if d = 1 and θ ∈ Θ0a1 if d = 0 and θ ∈ Θ0the Bayes procedure isδπ(x) =1 if Prπ(θ ∈ Θ0|x) ≥ a0/(a0 + a1)0 otherwise
  • 218. Bayesian StatisticsTests and model choiceBayesian testsBound comparisonDetermination of a0/a1 depends on consequences of “wrongdecision” under both circumstances
  • 219. Bayesian StatisticsTests and model choiceBayesian testsBound comparisonDetermination of a0/a1 depends on consequences of “wrongdecision” under both circumstancesOften difficult to assess in practice and replacement with “golden”bounds like .05, biased towards H0
  • 220. Bayesian StatisticsTests and model choiceBayesian testsBound comparisonDetermination of a0/a1 depends on consequences of “wrongdecision” under both circumstancesOften difficult to assess in practice and replacement with “golden”bounds like .05, biased towards H0Example (Binomial probability)Consider x ∼ B(n, p) and Θ0 = [0, 1/2]. Under the uniform priorπ(p) = 1, the posterior probability of H0 isPπ(p ≤ 1/2|x) =1/20 px(1 − p)n−xdpB(x + 1, n − x + 1)=(1/2)n+1B(x + 1, n − x + 1)1x + 1+ . . . +(n − x)!x!(n + 1)!
  • 221. Bayesian StatisticsTests and model choiceBayesian testsLoss/prior dualityDecompositionPrπ(θ ∈ Θ0|x) = Θ0π(θ|x) dθ= Θ0f(x|θ0)π(θ) dθΘ f(x|θ0)π(θ) dθ
  • 222. Bayesian StatisticsTests and model choiceBayesian testsLoss/prior dualityDecompositionPrπ(θ ∈ Θ0|x) = Θ0π(θ|x) dθ= Θ0f(x|θ0)π(θ) dθΘ f(x|θ0)π(θ) dθsuggests representationπ(θ) = π(Θ0)π0(θ) + (1 − π(Θ0))π1(θ)
  • 223. Bayesian StatisticsTests and model choiceBayesian testsLoss/prior dualityDecompositionPrπ(θ ∈ Θ0|x) = Θ0π(θ|x) dθ= Θ0f(x|θ0)π(θ) dθΘ f(x|θ0)π(θ) dθsuggests representationπ(θ) = π(Θ0)π0(θ) + (1 − π(Θ0))π1(θ)and decisionδπ(x) = 1 iffπ(Θ0)(1 − π(Θ0))Θ0f(x|θ0)π0(θ) dθΘc0f(x|θ0)π1(θ) dθ≥a0a1
  • 224. Bayesian StatisticsTests and model choiceBayesian testsLoss/prior dualityDecompositionPrπ(θ ∈ Θ0|x) = Θ0π(θ|x) dθ= Θ0f(x|θ0)π(θ) dθΘ f(x|θ0)π(θ) dθsuggests representationπ(θ) = π(Θ0)π0(θ) + (1 − π(Θ0))π1(θ)and decisionδπ(x) = 1 iffπ(Θ0)(1 − π(Θ0))Θ0f(x|θ0)π0(θ) dθΘc0f(x|θ0)π1(θ) dθ≥a0a1c What matters is (π(Θ0)/a0, (1 − π(Θ0))/a1)
  • 225. Bayesian StatisticsTests and model choiceBayes factorsA function of posterior probabilitiesDefinition (Bayes factors)For hypotheses H0 : θ ∈ Θ0 vs. Ha : θ ∈ Θ0B01 =π(Θ0|x)π(Θc0|x)π(Θ0)π(Θc0)= Θ0f(x|θ)π0(θ)dθΘc0f(x|θ)π1(θ)dθ[Good, 1958 & Jeffreys, 1961]Goto Poisson exampleEquivalent to Bayes rule: acceptance ifB01 > {(1 − π(Θ0))/a1}/{π(Θ0)/a0}
  • 226. Bayesian StatisticsTests and model choiceBayes factorsSelf-contained conceptOutside decision-theoretic environment:◮ eliminates choice of π(Θ0)
  • 227. Bayesian StatisticsTests and model choiceBayes factorsSelf-contained conceptOutside decision-theoretic environment:◮ eliminates choice of π(Θ0)◮ but depends on the choice of (π0, π1)
  • 228. Bayesian StatisticsTests and model choiceBayes factorsSelf-contained conceptOutside decision-theoretic environment:◮ eliminates choice of π(Θ0)◮ but depends on the choice of (π0, π1)◮ Bayesian/marginal equivalent to the likelihood ratio
  • 229. Bayesian StatisticsTests and model choiceBayes factorsSelf-contained conceptOutside decision-theoretic environment:◮ eliminates choice of π(Θ0)◮ but depends on the choice of (π0, π1)◮ Bayesian/marginal equivalent to the likelihood ratio◮ Jeffreys’ scale of evidence:◮ if log10(Bπ10) between 0 and 0.5, evidence against H0 weak,◮ if log10(Bπ10) 0.5 and 1, evidence substantial,◮ if log10(Bπ10) 1 and 2, evidence strong and◮ if log10(Bπ10) above 2, evidence decisive
  • 230. Bayesian StatisticsTests and model choiceBayes factorsHot handExample (Binomial homogeneity)Consider H0 : yi ∼ B(ni, p) (i = 1, . . . , G) vs. H1 : yi ∼ B(ni, pi).Conjugate priors pi ∼ Be(ξ/ω, (1 − ξ)/ω), with a uniform prior onE[pi|ξ, ω] = ξ and on p (ω is fixed)
  • 231. Bayesian StatisticsTests and model choiceBayes factorsHot handExample (Binomial homogeneity)Consider H0 : yi ∼ B(ni, p) (i = 1, . . . , G) vs. H1 : yi ∼ B(ni, pi).Conjugate priors pi ∼ Be(ξ/ω, (1 − ξ)/ω), with a uniform prior onE[pi|ξ, ω] = ξ and on p (ω is fixed)B10 =10Gi=110pyii (1 − pi)ni−yipα−1i (1 − pi)β−1d pi×Γ(1/ω)/[Γ(ξ/ω)Γ((1 − ξ)/ω)]dξ10 p i yi (1 − p) i(ni−yi)d pwhere α = ξ/ω and β = (1 − ξ)/ω.
  • 232. Bayesian StatisticsTests and model choiceBayes factorsHot handExample (Binomial homogeneity)Consider H0 : yi ∼ B(ni, p) (i = 1, . . . , G) vs. H1 : yi ∼ B(ni, pi).Conjugate priors pi ∼ Be(ξ/ω, (1 − ξ)/ω), with a uniform prior onE[pi|ξ, ω] = ξ and on p (ω is fixed)B10 =10Gi=110pyii (1 − pi)ni−yipα−1i (1 − pi)β−1d pi×Γ(1/ω)/[Γ(ξ/ω)Γ((1 − ξ)/ω)]dξ10 p i yi (1 − p) i(ni−yi)d pwhere α = ξ/ω and β = (1 − ξ)/ω.For instance, log10(B10) = −0.79 for ω = 0.005 and G = 138slightly favours H0.
  • 233. Bayesian StatisticsTests and model choiceBayes factorsA major modificationWhen the null hypothesis is supported by a set of measure 0,π(Θ0) = 0[End of the story?!]
  • 234. Bayesian StatisticsTests and model choiceBayes factorsA major modificationWhen the null hypothesis is supported by a set of measure 0,π(Θ0) = 0[End of the story?!]RequirementDefined prior distributions under both assumptions,π0(θ) ∝ π(θ)IΘ0 (θ), π1(θ) ∝ π(θ)IΘ1 (θ),(under the standard dominating measures on Θ0 and Θ1)
  • 235. Bayesian StatisticsTests and model choiceBayes factorsA major modificationWhen the null hypothesis is supported by a set of measure 0,π(Θ0) = 0[End of the story?!]RequirementDefined prior distributions under both assumptions,π0(θ) ∝ π(θ)IΘ0 (θ), π1(θ) ∝ π(θ)IΘ1 (θ),(under the standard dominating measures on Θ0 and Θ1)Using the prior probabilities π(Θ0) = ̺0 and π(Θ1) = ̺1,π(θ) = ̺0π0(θ) + ̺1π1(θ).Note If Θ0 = {θ0}, π0 is the Dirac mass in θ0
  • 236. Bayesian StatisticsTests and model choiceBayes factorsPoint null hypothesesParticular case H0 : θ = θ0Take ρ0 = Prπ(θ = θ0) and g1 prior density under Ha.
  • 237. Bayesian StatisticsTests and model choiceBayes factorsPoint null hypothesesParticular case H0 : θ = θ0Take ρ0 = Prπ(θ = θ0) and g1 prior density under Ha.Posterior probability of H0π(Θ0|x) =f(x|θ0)ρ0f(x|θ)π(θ) dθ=f(x|θ0)ρ0f(x|θ0)ρ0 + (1 − ρ0)m1(x)and marginal under Ham1(x) =Θ1f(x|θ)g1(θ) dθ.
  • 238. Bayesian StatisticsTests and model choiceBayes factorsPoint null hypotheses (cont’d)Dual representationπ(Θ0|x) = 1 +1 − ρ0ρ0m1(x)f(x|θ0)−1.andBπ01(x) =f(x|θ0)ρ0m1(x)(1 − ρ0)ρ01 − ρ0=f(x|θ0)m1(x)
  • 239. Bayesian StatisticsTests and model choiceBayes factorsPoint null hypotheses (cont’d)Dual representationπ(Θ0|x) = 1 +1 − ρ0ρ0m1(x)f(x|θ0)−1.andBπ01(x) =f(x|θ0)ρ0m1(x)(1 − ρ0)ρ01 − ρ0=f(x|θ0)m1(x)Connectionπ(Θ0|x) = 1 +1 − ρ0ρ01Bπ01(x)−1.
  • 240. Bayesian StatisticsTests and model choiceBayes factorsPoint null hypotheses (cont’d)Example (Normal mean)Test of H0 : θ = 0 when x ∼ N (θ, 1): we take π1 as N (0, τ2)m1(x)f(x|0)=σ√σ2 + τ2e−x2/2(σ2+τ2)e−x2/2σ2=σ2σ2 + τ2expτ2x22σ2(σ2 + τ2)andπ(θ = 0|x) = 1 +1 − ρ0ρ0σ2σ2 + τ2expτ2x22σ2(σ2 + τ2)−1
  • 241. Bayesian StatisticsTests and model choiceBayes factorsPoint null hypotheses (cont’d)Example (Normal mean)Influence of τ:τ/x 0 0.68 1.28 1.961 0.586 0.557 0.484 0.35110 0.768 0.729 0.612 0.366
  • 242. Bayesian StatisticsTests and model choiceBayes factorsA fundamental difficultyImproper priors are not allowed hereIfΘ1π1(dθ1) = ∞ orΘ2π2(dθ2) = ∞then either π1 or π2 cannot be coherently normalised
  • 243. Bayesian StatisticsTests and model choiceBayes factorsA fundamental difficultyImproper priors are not allowed hereIfΘ1π1(dθ1) = ∞ orΘ2π2(dθ2) = ∞then either π1 or π2 cannot be coherently normalised but thenormalisation matters in the Bayes factor Recall Bayes factor
  • 244. Bayesian StatisticsTests and model choiceBayes factorsConstants matterExample (Poisson versus Negative binomial)If M1 is a P(λ) distribution and M2 is a N B(m, p) distribution,we can takeπ1(λ) = 1/λπ2(m, p) = 1M I{1,··· ,M}(m) I[0,1](p)
  • 245. Bayesian StatisticsTests and model choiceBayes factorsConstants matter (cont’d)Example (Poisson versus Negative binomial (2))thenBπ12 =∞0λx−1x!e−λdλ1MMm=1∞0mx − 1px(1 − p)m−xdp= 11MMm=xmx − 1x!(m − x)!m!= 11MMm=xx/(m − x + 1)
  • 246. Bayesian StatisticsTests and model choiceBayes factorsConstants matter (cont’d)Example (Poisson versus Negative binomial (3))◮ does not make sense because π1(λ) = 10/λ leads to adifferent answer, ten times larger!
  • 247. Bayesian StatisticsTests and model choiceBayes factorsConstants matter (cont’d)Example (Poisson versus Negative binomial (3))◮ does not make sense because π1(λ) = 10/λ leads to adifferent answer, ten times larger!◮ same thing when both priors are improper
  • 248. Bayesian StatisticsTests and model choiceBayes factorsConstants matter (cont’d)Example (Poisson versus Negative binomial (3))◮ does not make sense because π1(λ) = 10/λ leads to adifferent answer, ten times larger!◮ same thing when both priors are improperImproper priors on common (nuisance) parameters do not matter(so much)
  • 249. Bayesian StatisticsTests and model choiceBayes factorsNormal illustrationTake x ∼ N (θ, 1) and H0 : θ = 0Influence of the constantπ(θ)/x 0.0 1.0 1.65 1.96 2.581 0.285 0.195 0.089 0.055 0.01410 0.0384 0.0236 0.0101 0.00581 0.00143
  • 250. Bayesian StatisticsTests and model choiceBayes factorsVague proper priors are not the solutionTaking a proper prior and take a “very large” variance (e.g.,BUGS)
  • 251. Bayesian StatisticsTests and model choiceBayes factorsVague proper priors are not the solutionTaking a proper prior and take a “very large” variance (e.g.,BUGS) will most often result in an undefined or ill-defined limit
  • 252. Bayesian StatisticsTests and model choiceBayes factorsVague proper priors are not the solutionTaking a proper prior and take a “very large” variance (e.g.,BUGS) will most often result in an undefined or ill-defined limitExample (Lindley’s paradox)If testing H0 : θ = 0 when observing x ∼ N(θ, 1), under a normalN(0, α) prior π1(θ),B01(x)α−→∞−→ 0
  • 253. Bayesian StatisticsTests and model choiceBayes factorsVague proper priors are not the solution (cont’d)Example (Poisson versus Negative binomial (4))B12 =10λα+x−1x!e−λβdλ1M mxm − x + 1βαΓ(α)if λ ∼ Ga(α, β)=Γ(α + x)x! Γ(α)β−x 1M mxm − x + 1=(x + α − 1) · · · αx(x − 1) · · · 1β−x 1M mxm − x + 1
  • 254. Bayesian StatisticsTests and model choiceBayes factorsVague proper priors are not the solution (cont’d)Example (Poisson versus Negative binomial (4))B12 =10λα+x−1x!e−λβdλ1M mxm − x + 1βαΓ(α)if λ ∼ Ga(α, β)=Γ(α + x)x! Γ(α)β−x 1M mxm − x + 1=(x + α − 1) · · · αx(x − 1) · · · 1β−x 1M mxm − x + 1depends on choice of α(β) or β(α) −→ 0
  • 255. Bayesian StatisticsTests and model choiceBayes factorsLearning from the sampleDefinition (Learning sample)Given an improper prior π, (x1, . . . , xn) is a learning sample ifπ(·|x1, . . . , xn) is proper and a minimal learning sample if none ofits subsamples is a learning sample
  • 256. Bayesian StatisticsTests and model choiceBayes factorsLearning from the sampleDefinition (Learning sample)Given an improper prior π, (x1, . . . , xn) is a learning sample ifπ(·|x1, . . . , xn) is proper and a minimal learning sample if none ofits subsamples is a learning sampleThere is just enough information in a minimal learning sample tomake inference about θ under the prior π
  • 257. Bayesian StatisticsTests and model choicePseudo-Bayes factorsPseudo-Bayes factorsIdeaUse one part x[i] of the data x to make the prior proper:
  • 258. Bayesian StatisticsTests and model choicePseudo-Bayes factorsPseudo-Bayes factorsIdeaUse one part x[i] of the data x to make the prior proper:◮ πi improper but πi(·|x[i]) proper◮ andfi(x[n/i]|θi) πi(θi|x[i])dθifj(x[n/i]|θj) πj(θj|x[i])dθjindependent of normalizing constant
  • 259. Bayesian StatisticsTests and model choicePseudo-Bayes factorsPseudo-Bayes factorsIdeaUse one part x[i] of the data x to make the prior proper:◮ πi improper but πi(·|x[i]) proper◮ andfi(x[n/i]|θi) πi(θi|x[i])dθifj(x[n/i]|θj) πj(θj|x[i])dθjindependent of normalizing constant◮ Use remaining x[n/i] to run test as if πj(θj|x[i]) is the trueprior
  • 260. Bayesian StatisticsTests and model choicePseudo-Bayes factorsMotivation◮ Provides a working principle for improper priors
  • 261. Bayesian StatisticsTests and model choicePseudo-Bayes factorsMotivation◮ Provides a working principle for improper priors◮ Gather enough information from data to achieve properness◮ and use this properness to run the test on remaining data
  • 262. Bayesian StatisticsTests and model choicePseudo-Bayes factorsMotivation◮ Provides a working principle for improper priors◮ Gather enough information from data to achieve properness◮ and use this properness to run the test on remaining data◮ does not use x twice as in Aitkin’s (1991)
  • 263. Bayesian StatisticsTests and model choicePseudo-Bayes factorsDetailsSince π1(θ1|x[i]) =π1(θ1)f1[i](x[i]|θ1)π1(θ1)f1[i](x[i]|θ1)dθ1B12(x[n/i]) =f1[n/i](x[n/i]|θ1)π1(θ1|x[i])dθ1f2[n/i](x[n/i]|θ2)π2(θ2|x[i])dθ2=f1(x|θ1)π1(θ1)dθ1f2(x|θ2)π2(θ2)dθ2π2(θ2)f2[i](x[i]|θ2)dθ2π1(θ1)f1[i](x[i]|θ1)dθ1= BN12(x)B21(x[i])c Independent of scaling factor!
  • 264. Bayesian StatisticsTests and model choicePseudo-Bayes factorsUnexpected problems!◮ depends on the choice of x[i]
  • 265. Bayesian StatisticsTests and model choicePseudo-Bayes factorsUnexpected problems!◮ depends on the choice of x[i]◮ many ways of combining pseudo-Bayes factors◮ AIBF = BNji1LℓBij(x[ℓ])◮ MIBF = BNji med[Bij(x[ℓ])]◮ GIBF = BNji exp1Lℓlog Bij(x[ℓ])◮ not often an exact Bayes factor◮ and thus lacking inner coherenceB12 = B10B02 and B01 = 1/B10 .[Berger & Pericchi, 1996]
  • 266. Bayesian StatisticsTests and model choicePseudo-Bayes factorsUnexpec’d problems (cont’d)Example (Mixtures)There is no sample size that proper-ises improper priors, except if atraining sample is allocated to each component
  • 267. Bayesian StatisticsTests and model choicePseudo-Bayes factorsUnexpec’d problems (cont’d)Example (Mixtures)There is no sample size that proper-ises improper priors, except if atraining sample is allocated to each componentReason Ifx1, . . . , xn ∼ki=1pif(x|θi)andπ(θ) =iπi(θi) with πi(θi)dθi = +∞ ,the posterior is never defined, becausePr(“no observation from f(·|θi)”) = (1 − pi)n
  • 268. Bayesian StatisticsTests and model choiceIntrinsic priorsIntrinsic priorsThere may exist a true prior that provides the same Bayes factor
  • 269. Bayesian StatisticsTests and model choiceIntrinsic priorsIntrinsic priorsThere may exist a true prior that provides the same Bayes factorExample (Normal mean)Take x ∼ N(θ, 1) with either θ = 0 (M1) or θ = 0 (M2) andπ2(θ) = 1.ThenBAIBF21 = B211√2π1nni=1 e−x21/2 ≈ B21 for N(0, 2)BMIBF21 = B211√2πe−med(x21)/2 ≈ 0.93B21 for N(0, 1.2)[Berger and Pericchi, 1998]
  • 270. Bayesian StatisticsTests and model choiceIntrinsic priorsIntrinsic priorsThere may exist a true prior that provides the same Bayes factorExample (Normal mean)Take x ∼ N(θ, 1) with either θ = 0 (M1) or θ = 0 (M2) andπ2(θ) = 1.ThenBAIBF21 = B211√2π1nni=1 e−x21/2 ≈ B21 for N(0, 2)BMIBF21 = B211√2πe−med(x21)/2 ≈ 0.93B21 for N(0, 1.2)[Berger and Pericchi, 1998]When such a prior exists, it is called an intrinsic prior
  • 271. Bayesian StatisticsTests and model choiceIntrinsic priorsIntrinsic priors (cont’d)
  • 272. Bayesian StatisticsTests and model choiceIntrinsic priorsIntrinsic priors (cont’d)Example (Exponential scale)Take x1, . . . , xni.i.d.∼ exp(θ − x)Ix≥θand H0 : θ = θ0, H1 : θ > θ0 , with π1(θ) = 1ThenBA10 = B10(x)1nni=1exi−θ0− 1−1is the Bayes factor forπ2(θ) = eθ0−θ1 − log 1 − eθ0−θ
  • 273. Bayesian StatisticsTests and model choiceIntrinsic priorsIntrinsic priors (cont’d)Example (Exponential scale)Take x1, . . . , xni.i.d.∼ exp(θ − x)Ix≥θand H0 : θ = θ0, H1 : θ > θ0 , with π1(θ) = 1ThenBA10 = B10(x)1nni=1exi−θ0− 1−1is the Bayes factor forπ2(θ) = eθ0−θ1 − log 1 − eθ0−θMost often, however, the pseudo-Bayes factors do not correspondto any true Bayes factor[Berger and Pericchi, 2001]
  • 274. Bayesian StatisticsTests and model choiceIntrinsic priorsFractional Bayes factorIdeause directly the likelihood to separate training sample from testingsampleBF12 = B12(x)Lb2(θ2)π2(θ2)dθ2Lb1(θ1)π1(θ1)dθ1[O’Hagan, 1995]
  • 275. Bayesian StatisticsTests and model choiceIntrinsic priorsFractional Bayes factorIdeause directly the likelihood to separate training sample from testingsampleBF12 = B12(x)Lb2(θ2)π2(θ2)dθ2Lb1(θ1)π1(θ1)dθ1[O’Hagan, 1995]Proportion b of the sample used to gain proper-ness
  • 276. Bayesian StatisticsTests and model choiceIntrinsic priorsFractional Bayes factor (cont’d)Example (Normal mean)BF12 =1√ben(b−1)¯x2n/2corresponds to exact Bayes factor for the prior N 0, 1−bnb◮ If b constant, prior variance goes to 0◮ If b =1n, prior variance stabilises around 1◮ If b = n−α, α < 1, prior variance goes to 0 too.
  • 277. Bayesian StatisticsTests and model choiceOpposition to classical testsComparison with classical testsStandard answerDefinition (p-value)The p-value p(x) associated with a test is the largest significancelevel for which H0 is rejected
  • 278. Bayesian StatisticsTests and model choiceOpposition to classical testsComparison with classical testsStandard answerDefinition (p-value)The p-value p(x) associated with a test is the largest significancelevel for which H0 is rejectedNoteAn alternative definition is that a p-value is distributed uniformlyunder the null hypothesis.
  • 279. Bayesian StatisticsTests and model choiceOpposition to classical testsp-valueExample (Normal mean)Since the UUMP test is {|x| > k}, standard p-valuep(x) = inf{α; |x| > kα}= PX(|X| > |x|), X ∼ N (0, 1)= 1 − Φ(|x|) + Φ(|x|) = 2[1 − Φ(|x|)].Thus, if x = 1.68, p(x) = 0.10 and, if x = 1.96, p(x) = 0.05.
  • 280. Bayesian StatisticsTests and model choiceOpposition to classical testsProblems with p-values◮ Evaluation of the wrong quantity, namely the probability toexceed the observed quantity.(wrong conditionin)◮ No transfer of the UMP optimality◮ No decisional support (occurences of inadmissibility)◮ Evaluation only under the null hypothesis◮ Huge numerical difference with the Bayesian range of answers
  • 281. Bayesian StatisticsTests and model choiceOpposition to classical testsBayesian lower boundsFor illustration purposes, consider a class G of prior distributionsB(x, G ) = infg∈Gf(x|θ0)Θ f(x|θ)g(θ) dθ,P(x, G ) = infg∈Gf(x|θ0)f(x|θ0) + Θ f(x|θ)g(θ) dθwhen ̺0 = 1/2 orB(x, G ) =f(x|θ0)supg∈G Θ f(x|θ)g(θ)dθ, P(x, G ) = 1 +1B(x, G )−1.
  • 282. Bayesian StatisticsTests and model choiceOpposition to classical testsResolutionLemmaIf there exists a mle for θ, ˆθ(x), the solutions to the Bayesian lowerbounds areB(x, G ) =f(x|θ0)f(x|ˆθ(x)), PP(x, G ) = 1 +f(x|ˆθ(x))f(x|θ0)−1respectively
  • 283. Bayesian StatisticsTests and model choiceOpposition to classical testsNormal caseWhen x ∼ N (θ, 1) and H0 : θ0 = 0, the lower bounds areB(x, GA) = e−x2/2et P(x, GA) = 1 + ex2/2−1,
  • 284. Bayesian StatisticsTests and model choiceOpposition to classical testsNormal caseWhen x ∼ N (θ, 1) and H0 : θ0 = 0, the lower bounds areB(x, GA) = e−x2/2et P(x, GA) = 1 + ex2/2−1,i.e.p-value 0.10 0.05 0.01 0.001P 0.205 0.128 0.035 0.004B 0.256 0.146 0.036 0.004
  • 285. Bayesian StatisticsTests and model choiceOpposition to classical testsNormal caseWhen x ∼ N (θ, 1) and H0 : θ0 = 0, the lower bounds areB(x, GA) = e−x2/2et P(x, GA) = 1 + ex2/2−1,i.e.p-value 0.10 0.05 0.01 0.001P 0.205 0.128 0.035 0.004B 0.256 0.146 0.036 0.004[Quite different!]
  • 286. Bayesian StatisticsTests and model choiceOpposition to classical testsUnilateral caseDifferent situation when H0 : θ ≤ 0
  • 287. Bayesian StatisticsTests and model choiceOpposition to classical testsUnilateral caseDifferent situation when H0 : θ ≤ 0◮ Single prior can be used both for H0 and Ha
  • 288. Bayesian StatisticsTests and model choiceOpposition to classical testsUnilateral caseDifferent situation when H0 : θ ≤ 0◮ Single prior can be used both for H0 and Ha◮ Improper priors are therefore acceptable
  • 289. Bayesian StatisticsTests and model choiceOpposition to classical testsUnilateral caseDifferent situation when H0 : θ ≤ 0◮ Single prior can be used both for H0 and Ha◮ Improper priors are therefore acceptable◮ Similar numerical values compared with p-values
  • 290. Bayesian StatisticsTests and model choiceOpposition to classical testsUnilateral agreementTheoremWhen x ∼ f(x − θ), with f symmetric around 0 and endowed withthe monotone likelihood ratio property, if H0 : θ ≤ 0, the p-valuep(x) is equal to the lower bound of the posterior probabilities,P(x, GSU ), when GSU is the set of symmetric unimodal priors andwhen x > 0.
  • 291. Bayesian StatisticsTests and model choiceOpposition to classical testsUnilateral agreementTheoremWhen x ∼ f(x − θ), with f symmetric around 0 and endowed withthe monotone likelihood ratio property, if H0 : θ ≤ 0, the p-valuep(x) is equal to the lower bound of the posterior probabilities,P(x, GSU ), when GSU is the set of symmetric unimodal priors andwhen x > 0.Reason:p(x) = Pθ=0(X > x) =+∞xf(t) dt = infK11 +0−K f(x−θ) dθK−K f(x−θ)dθ−1
  • 292. Bayesian StatisticsTests and model choiceOpposition to classical testsCauchy exampleWhen x ∼ C (θ, 1) and H0 : θ ≤ 0, lower bound inferior to p-value:p-value 0.437 0.102 0.063 0.013 0.004P 0.429 0.077 0.044 0.007 0.002
  • 293. Bayesian StatisticsTests and model choiceModel choiceModel choice and model comparisonChoice of modelsSeveral models available for the same observationMi : x ∼ fi(x|θi), i ∈ Iwhere I can be finite or infinite
  • 294. Bayesian StatisticsTests and model choiceModel choiceExample (Galaxy normal mixture)Set of observations of radial speeds of 82 galaxies possiblymodelled as a mixture of normal distributionsMi : xj ∼iℓ=1pℓiN(µℓi, σ2ℓi)1.0 1.5 2.0 2.5 3.0 3.50.00.51.01.52.0speeds
  • 295. Bayesian StatisticsTests and model choiceBayesian resolutionBayesian resolutionB FrameworkProbabilises the entire model/parameter space
  • 296. Bayesian StatisticsTests and model choiceBayesian resolutionBayesian resolutionB FrameworkProbabilises the entire model/parameter spaceThis means:◮ allocating probabilities pi to all models Mi◮ defining priors πi(θi) for each parameter space Θi
  • 297. Bayesian StatisticsTests and model choiceBayesian resolutionFormal solutionsResolution1. Computep(Mi|x) =piΘifi(x|θi)πi(θi)dθijpjΘjfj(x|θj)πj(θj)dθj
  • 298. Bayesian StatisticsTests and model choiceBayesian resolutionFormal solutionsResolution1. Computep(Mi|x) =piΘifi(x|θi)πi(θi)dθijpjΘjfj(x|θj)πj(θj)dθj2. Take largest p(Mi|x) to determine ‘‘best’’ model,or use averaged predictivejp(Mj|x)Θjfj(x′|θj)πj(θj|x)dθj
  • 299. Bayesian StatisticsTests and model choiceProblemsSeveral types of problems◮ Concentrate on selection perspective:◮ averaging = estimation = non-parsimonious = no-decision◮ how to integrate loss function/decision/consequences
  • 300. Bayesian StatisticsTests and model choiceProblemsSeveral types of problems◮ Concentrate on selection perspective:◮ averaging = estimation = non-parsimonious = no-decision◮ how to integrate loss function/decision/consequences◮ representation of parsimony/sparcity (Ockham’s rule)◮ how to fight overfitting for nested models
  • 301. Bayesian StatisticsTests and model choiceProblemsSeveral types of problems◮ Concentrate on selection perspective:◮ averaging = estimation = non-parsimonious = no-decision◮ how to integrate loss function/decision/consequences◮ representation of parsimony/sparcity (Ockham’s rule)◮ how to fight overfitting for nested modelsWhich loss ?
  • 302. Bayesian StatisticsTests and model choiceProblemsSeveral types of problems (2)◮ Choice of prior structures◮ adequate weights pi:if M1 = M2 ∪ M3,
  • 303. Bayesian StatisticsTests and model choiceProblemsSeveral types of problems (2)◮ Choice of prior structures◮ adequate weights pi:if M1 = M2 ∪ M3, p(M1) = p(M2) + p(M3) ?◮ priors distributions◮ πi(θi) defined for every i ∈ I
  • 304. Bayesian StatisticsTests and model choiceProblemsSeveral types of problems (2)◮ Choice of prior structures◮ adequate weights pi:if M1 = M2 ∪ M3, p(M1) = p(M2) + p(M3) ?◮ priors distributions◮ πi(θi) defined for every i ∈ I◮ πi(θi) proper (Jeffreys)
  • 305. Bayesian StatisticsTests and model choiceProblemsSeveral types of problems (2)◮ Choice of prior structures◮ adequate weights pi:if M1 = M2 ∪ M3, p(M1) = p(M2) + p(M3) ?◮ priors distributions◮ πi(θi) defined for every i ∈ I◮ πi(θi) proper (Jeffreys)◮ πi(θi) coherent (?) for nested models
  • 306. Bayesian StatisticsTests and model choiceProblemsSeveral types of problems (2)◮ Choice of prior structures◮ adequate weights pi:if M1 = M2 ∪ M3, p(M1) = p(M2) + p(M3) ?◮ priors distributions◮ πi(θi) defined for every i ∈ I◮ πi(θi) proper (Jeffreys)◮ πi(θi) coherent (?) for nested modelsWarningParameters common to several models must be treated as separateentities!
  • 307. Bayesian StatisticsTests and model choiceProblemsSeveral types of problems (3)◮ Computation of predictives and marginals- infinite dimensional spaces- integration over parameter spaces- integration over different spaces- summation over many models (2k)
  • 308. Bayesian StatisticsTests and model choiceCompatible priorsCompatibility principleDifficulty of finding simultaneously priors on a collection of modelsMi (i ∈ I)
  • 309. Bayesian StatisticsTests and model choiceCompatible priorsCompatibility principleDifficulty of finding simultaneously priors on a collection of modelsMi (i ∈ I)Easier to start from a single prior on a “big” model and to derivethe others from a coherence principle[Dawid & Lauritzen, 2000]
  • 310. Bayesian StatisticsTests and model choiceCompatible priorsProjection approachFor M2 submodel of M1, π2 can be derived as the distribution ofθ⊥2 (θ1) when θ1 ∼ π1(θ1) and θ⊥2 (θ1) is a projection of θ1 on M2,e.g.d(f(· |θ1), f(· |θ1⊥)) = infθ2∈Θ2d(f(· |θ1) , f(· |θ2)) .where d is a divergence measure[McCulloch & Rossi, 1992]
  • 311. Bayesian StatisticsTests and model choiceCompatible priorsProjection approachFor M2 submodel of M1, π2 can be derived as the distribution ofθ⊥2 (θ1) when θ1 ∼ π1(θ1) and θ⊥2 (θ1) is a projection of θ1 on M2,e.g.d(f(· |θ1), f(· |θ1⊥)) = infθ2∈Θ2d(f(· |θ1) , f(· |θ2)) .where d is a divergence measure[McCulloch & Rossi, 1992]Or we can look instead at the posterior distribution ofd(f(· |θ1), f(· |θ1⊥))[Goutis & Robert, 1998]
  • 312. Bayesian StatisticsTests and model choiceCompatible priorsOperational principle for variable selectionSelection ruleAmong all subsets A of covariates such thatd(Mg, MA) = Ex[d(fg(·|x, α), fA(·|xA, α⊥))] < ǫselect the submodel with the smallest number of variables.[Dupuis & Robert, 2001]
  • 313. Bayesian StatisticsTests and model choiceCompatible priorsKullback proximityAlternative to aboveDefinition (Compatible prior)Given a prior π1 on a model M1 and a submodel M2, a prior π2 onM2 is compatible with π1
  • 314. Bayesian StatisticsTests and model choiceCompatible priorsKullback proximityAlternative to aboveDefinition (Compatible prior)Given a prior π1 on a model M1 and a submodel M2, a prior π2 onM2 is compatible with π1 when it achieves the minimum Kullbackdivergence between the corresponding marginals:m1(x; π1) = Θ1f1(x|θ)π1(θ)dθ andm2(x); π2 = Θ2f2(x|θ)π2(θ)dθ,
  • 315. Bayesian StatisticsTests and model choiceCompatible priorsKullback proximityAlternative to aboveDefinition (Compatible prior)Given a prior π1 on a model M1 and a submodel M2, a prior π2 onM2 is compatible with π1 when it achieves the minimum Kullbackdivergence between the corresponding marginals:m1(x; π1) = Θ1f1(x|θ)π1(θ)dθ andm2(x); π2 = Θ2f2(x|θ)π2(θ)dθ,π2 = arg minπ2logm1(x; π1)m2(x; π2)m1(x; π1) dx
  • 316. Bayesian StatisticsTests and model choiceCompatible priorsDifficulties◮ Does not give a working principle when M2 is not a submodelM1
  • 317. Bayesian StatisticsTests and model choiceCompatible priorsDifficulties◮ Does not give a working principle when M2 is not a submodelM1◮ Depends on the choice of π1
  • 318. Bayesian StatisticsTests and model choiceCompatible priorsDifficulties◮ Does not give a working principle when M2 is not a submodelM1◮ Depends on the choice of π1◮ Prohibits the use of improper priors
  • 319. Bayesian StatisticsTests and model choiceCompatible priorsDifficulties◮ Does not give a working principle when M2 is not a submodelM1◮ Depends on the choice of π1◮ Prohibits the use of improper priors◮ Worse: useless in unconstrained settings...
  • 320. Bayesian StatisticsTests and model choiceCompatible priorsCase of exponential familiesModelsM1 : {f1(x|θ), θ ∈ Θ}andM2 : {f2(x|λ), λ ∈ Λ}sub-model of M1,∀λ ∈ Λ, ∃ θ(λ) ∈ Θ, f2(x|λ) = f1(x|θ(λ))
  • 321. Bayesian StatisticsTests and model choiceCompatible priorsCase of exponential familiesModelsM1 : {f1(x|θ), θ ∈ Θ}andM2 : {f2(x|λ), λ ∈ Λ}sub-model of M1,∀λ ∈ Λ, ∃ θ(λ) ∈ Θ, f2(x|λ) = f1(x|θ(λ))Both M1 and M2 are natural exponential familiesf1(x|θ) = h1(x) exp(θTt1(x) − M1(θ))f2(x|λ) = h2(x) exp(λTt2(x) − M2(λ))
  • 322. Bayesian StatisticsTests and model choiceCompatible priorsConjugate priorsParameterised (conjugate) priorsπ1(θ; s1, n1) = C1(s1, n1) exp(sT1 θ − n1M1(θ))π2(λ; s2, n2) = C2(s2, n2) exp(sT2 λ − n2M2(λ))
  • 323. Bayesian StatisticsTests and model choiceCompatible priorsConjugate priorsParameterised (conjugate) priorsπ1(θ; s1, n1) = C1(s1, n1) exp(sT1 θ − n1M1(θ))π2(λ; s2, n2) = C2(s2, n2) exp(sT2 λ − n2M2(λ))with closed form marginals (i = 1, 2)mi(x; si, ni) = fi(x|u)πi(u)du =hi(x)Ci(si, ni)Ci(si + ti(x), ni + 1)
  • 324. Bayesian StatisticsTests and model choiceCompatible priorsConjugate compatible priors(Q.) Existence and unicity of Kullback-Leibler projection(s∗2, n∗2) = arg min(s2,n2)KL(m1(·; s1, n1), m2(·; s2, n2))= arg min(s2,n2)logm1(x; s1, n1)m2(x; s2, n2)m1(x; s1, n1)dx
  • 325. Bayesian StatisticsTests and model choiceCompatible priorsA sufficient conditionSufficient statistic ψ = (λ, −M2(λ))Theorem (Existence)If, for all (s2, n2), the matrixVπ2s2,n2[ψ] − Em1s1,n1Vπ2s2,n2(ψ|x)is semi-definite negative,
  • 326. Bayesian StatisticsTests and model choiceCompatible priorsA sufficient conditionSufficient statistic ψ = (λ, −M2(λ))Theorem (Existence)If, for all (s2, n2), the matrixVπ2s2,n2[ψ] − Em1s1,n1Vπ2s2,n2(ψ|x)is semi-definite negative, the conjugate compatible prior exists, isunique and satisfiesEπ2s∗2,n∗2[λ] − Em1s1,n1[Eπ2s∗2,n∗2(λ|x)] = 0Eπ2s∗2,n∗2(M2(λ)) − Em1s1,n1[Eπ2s∗2,n∗2(M2(λ)|x)] = 0.
  • 327. Bayesian StatisticsTests and model choiceCompatible priorsAn application to linear regressionM1 and M2 are two nested Gaussian linear regression models withZellner’s g-priors and the same variance σ2 ∼ π(σ2):1. M1 :y|β1, σ2∼ N(X1β1, σ2), β1|σ2∼ N s1, σ2n1(XT1 X1)−1where X1 is a (n × k1) matrix of rank k1 ≤ n
  • 328. Bayesian StatisticsTests and model choiceCompatible priorsAn application to linear regressionM1 and M2 are two nested Gaussian linear regression models withZellner’s g-priors and the same variance σ2 ∼ π(σ2):1. M1 :y|β1, σ2∼ N(X1β1, σ2), β1|σ2∼ N s1, σ2n1(XT1 X1)−1where X1 is a (n × k1) matrix of rank k1 ≤ n2. M2 :y|β2, σ2∼ N(X2β2, σ2), β2|σ2∼ N s2, σ2n2(XT2 X2)−1,where X2 is a (n × k2) matrix with span(X2) ⊆ span(X1)
  • 329. Bayesian StatisticsTests and model choiceCompatible priorsAn application to linear regressionM1 and M2 are two nested Gaussian linear regression models withZellner’s g-priors and the same variance σ2 ∼ π(σ2):1. M1 :y|β1, σ2∼ N(X1β1, σ2), β1|σ2∼ N s1, σ2n1(XT1 X1)−1where X1 is a (n × k1) matrix of rank k1 ≤ n2. M2 :y|β2, σ2∼ N(X2β2, σ2), β2|σ2∼ N s2, σ2n2(XT2 X2)−1,where X2 is a (n × k2) matrix with span(X2) ⊆ span(X1)For a fixed (s1, n1), we need the projection (s2, n2) = (s1, n1)⊥
  • 330. Bayesian StatisticsTests and model choiceCompatible priorsCompatible g-priorsSince σ2 is a nuisance parameter, we can minimize theKullback-Leibler divergence between the two marginal distributionsconditional on σ2: m1(y|σ2; s1, n1) and m2(y|σ2; s2, n2)
  • 331. Bayesian StatisticsTests and model choiceCompatible priorsCompatible g-priorsSince σ2 is a nuisance parameter, we can minimize theKullback-Leibler divergence between the two marginal distributionsconditional on σ2: m1(y|σ2; s1, n1) and m2(y|σ2; s2, n2)TheoremConditional on σ2, the conjugate compatible prior of M2 wrt M1 isβ2|X2, σ2∼ N s∗2, σ2n∗2(XT2 X2)−1withs∗2 = (XT2 X2)−1XT2 X1s1n∗2 = n1
  • 332. Bayesian StatisticsTests and model choiceVariable selectionVariable selectionRegression setup where y regressed on a set {x1, . . . , xp} of ppotential explanatory regressors (plus intercept)
  • 333. Bayesian StatisticsTests and model choiceVariable selectionVariable selectionRegression setup where y regressed on a set {x1, . . . , xp} of ppotential explanatory regressors (plus intercept)Corresponding 2p submodels Mγ, where γ ∈ Γ = {0, 1}p indicatesinclusion/exclusion of variables by a binary representation,
  • 334. Bayesian StatisticsTests and model choiceVariable selectionVariable selectionRegression setup where y regressed on a set {x1, . . . , xp} of ppotential explanatory regressors (plus intercept)Corresponding 2p submodels Mγ, where γ ∈ Γ = {0, 1}p indicatesinclusion/exclusion of variables by a binary representation,e.g. γ = 101001011 means that x1, x3, x5, x7 and x8 are included.
  • 335. Bayesian StatisticsTests and model choiceVariable selectionNotationsFor model Mγ,◮ qγ variables included◮ t1(γ) = {t1,1(γ), . . . , t1,qγ (γ)} indices of those variables andt0(γ) indices of the variables not included◮ For β ∈ Rp+1,βt1(γ) = β0, βt1,1(γ), . . . , βt1,qγ (γ)Xt1(γ) = 1n|xt1,1(γ)| . . . |xt1,qγ (γ) .
  • 336. Bayesian StatisticsTests and model choiceVariable selectionNotationsFor model Mγ,◮ qγ variables included◮ t1(γ) = {t1,1(γ), . . . , t1,qγ (γ)} indices of those variables andt0(γ) indices of the variables not included◮ For β ∈ Rp+1,βt1(γ) = β0, βt1,1(γ), . . . , βt1,qγ (γ)Xt1(γ) = 1n|xt1,1(γ)| . . . |xt1,qγ (γ) .Submodel Mγ is thusy|β, γ, σ2∼ N Xt1(γ)βt1(γ), σ2In
  • 337. Bayesian StatisticsTests and model choiceVariable selectionGlobal and compatible priorsUse Zellner’s g-prior, i.e. a normal prior for β conditional on σ2,β|σ2∼ N(˜β, cσ2(XTX)−1)and a Jeffreys prior for σ2,π(σ2) ∝ σ−2Noninformative g
  • 338. Bayesian StatisticsTests and model choiceVariable selectionGlobal and compatible priorsUse Zellner’s g-prior, i.e. a normal prior for β conditional on σ2,β|σ2∼ N(˜β, cσ2(XTX)−1)and a Jeffreys prior for σ2,π(σ2) ∝ σ−2Noninformative gResulting compatible priorN XTt1(γ)Xt1(γ)−1XTt1(γ)X ˜β, cσ2XTt1(γ)Xt1(γ)−1
  • 339. Bayesian StatisticsTests and model choiceVariable selectionGlobal and compatible priorsUse Zellner’s g-prior, i.e. a normal prior for β conditional on σ2,β|σ2∼ N(˜β, cσ2(XTX)−1)and a Jeffreys prior for σ2,π(σ2) ∝ σ−2Noninformative gResulting compatible priorN XTt1(γ)Xt1(γ)−1XTt1(γ)X ˜β, cσ2XTt1(γ)Xt1(γ)−1[Surprise!]
  • 340. Bayesian StatisticsTests and model choiceVariable selectionGlobal and compatible priorsUse Zellner’s g-prior, i.e. a normal prior for β conditional on σ2,β|σ2∼ N(˜β, cσ2(XTX)−1)and a Jeffreys prior for σ2,π(σ2) ∝ σ−2Noninformative g
  • 341. Bayesian StatisticsTests and model choiceVariable selectionGlobal and compatible priorsUse Zellner’s g-prior, i.e. a normal prior for β conditional on σ2,β|σ2∼ N(˜β, cσ2(XTX)−1)and a Jeffreys prior for σ2,π(σ2) ∝ σ−2Noninformative gResulting compatible priorN XTt1(γ)Xt1(γ)−1XTt1(γ)X ˜β, cσ2XTt1(γ)Xt1(γ)−1
  • 342. Bayesian StatisticsTests and model choiceVariable selectionGlobal and compatible priorsUse Zellner’s g-prior, i.e. a normal prior for β conditional on σ2,β|σ2∼ N(˜β, cσ2(XTX)−1)and a Jeffreys prior for σ2,π(σ2) ∝ σ−2Noninformative gResulting compatible priorN XTt1(γ)Xt1(γ)−1XTt1(γ)X ˜β, cσ2XTt1(γ)Xt1(γ)−1[Surprise!]
  • 343. Bayesian StatisticsTests and model choiceVariable selectionModel indexFor the hierarchical parameter γ, we useπ(γ) =pi=1τγii (1 − τi)1−γi,where τi corresponds to the prior probability that variable i ispresent in the model (and a priori independence between thepresence/absence of variables)
  • 344. Bayesian StatisticsTests and model choiceVariable selectionModel indexFor the hierarchical parameter γ, we useπ(γ) =pi=1τγii (1 − τi)1−γi,where τi corresponds to the prior probability that variable i ispresent in the model (and a priori independence between thepresence/absence of variables)Typically, when no prior information is available,τ1 = . . . = τp = 1/2, ie a uniform priorπ(γ) = 2−p
  • 345. Bayesian StatisticsTests and model choiceVariable selectionPosterior model probabilityCan be obtained in closed form:π(γ|y) ∝ (c+1)−(qγ +1)/2yTy −cyTP1yc + 1+˜βTXTP1X ˜βc + 1−2yTP1X ˜βc + 1−n/2.
  • 346. Bayesian StatisticsTests and model choiceVariable selectionPosterior model probabilityCan be obtained in closed form:π(γ|y) ∝ (c+1)−(qγ +1)/2yTy −cyTP1yc + 1+˜βTXTP1X ˜βc + 1−2yTP1X ˜βc + 1−n/2.Conditionally on γ, posterior distributions of β and σ2:βt1(γ)|σ2, y, γ ∼ Ncc + 1(U1y + U1X ˜β/c),σ2cc + 1XTt1(γ)Xt1(γ)−1,σ2|y, γ ∼ IGn2,yTy2−cyTP1y2(c + 1)+˜βTXTP1X ˜β2(c + 1)−yTP1X ˜βc + 1.
  • 347. Bayesian StatisticsTests and model choiceVariable selectionNoninformative caseUse the same compatible informative g-prior distribution with˜β = 0p+1 and a hierarchical diffuse prior distribution on c,π(c) ∝ c−1IN∗ (c)Recall g-prior
  • 348. Bayesian StatisticsTests and model choiceVariable selectionNoninformative caseUse the same compatible informative g-prior distribution with˜β = 0p+1 and a hierarchical diffuse prior distribution on c,π(c) ∝ c−1IN∗ (c)Recall g-priorThe choice of this hierarchical diffuse prior distribution on c is dueto the model posterior sensitivity to large values of c:
  • 349. Bayesian StatisticsTests and model choiceVariable selectionNoninformative caseUse the same compatible informative g-prior distribution with˜β = 0p+1 and a hierarchical diffuse prior distribution on c,π(c) ∝ c−1IN∗ (c)Recall g-priorThe choice of this hierarchical diffuse prior distribution on c is dueto the model posterior sensitivity to large values of c:Taking ˜β = 0p+1 and c large does not work
  • 350. Bayesian StatisticsTests and model choiceVariable selectionInfluence of cErase influenceConsider the 10-predictor full modely|β, σ2∼ Nβ0 +3i=1βixi +3i=1βi+3x2i + β7x1x2 + β8x1x3 + β9x2x3 + β10x1x2x3, σ2Inwhere the xis are iid U (0, 10)[Casella & Moreno, 2004]
  • 351. Bayesian StatisticsTests and model choiceVariable selectionInfluence of cErase influenceConsider the 10-predictor full modely|β, σ2∼ Nβ0 +3i=1βixi +3i=1βi+3x2i + β7x1x2 + β8x1x3 + β9x2x3 + β10x1x2x3, σ2Inwhere the xis are iid U (0, 10)[Casella & Moreno, 2004]True model: two predictors x1 and x2, i.e. γ∗ = 110. . .0,(β0, β1, β2) = (5, 1, 3), and σ2 = 4.
  • 352. Bayesian StatisticsTests and model choiceVariable selectionInfluence of c2t1(γ) c = 10 c = 100 c = 103 c = 104 c = 1060,1,2 0.04062 0.35368 0.65858 0.85895 0.982220,1,2,7 0.01326 0.06142 0.08395 0.04434 0.005240,1,2,4 0.01299 0.05310 0.05805 0.02868 0.003360,2,4 0.02927 0.03962 0.00409 0.00246 0.002540,1,2,8 0.01240 0.03833 0.01100 0.00126 0.00126
  • 353. Bayesian StatisticsTests and model choiceVariable selectionNoninformative case (cont’d)In the noninformative setting,π(γ|y) ∝∞c=1c−1(c + 1)−(qγ+1)/2yTy −cc + 1yTP1y−n/2converges for all y’s
  • 354. Bayesian StatisticsTests and model choiceVariable selectionCasella & Moreno’s examplet1(γ)106i=1π(γ|y, c)π(c)0,1,2 0.780710,1,2,7 0.062010,1,2,4 0.041190,1,2,8 0.016760,1,2,5 0.01604
  • 355. Bayesian StatisticsTests and model choiceVariable selectionGibbs approximationWhen p large, impossible to compute the posterior probabilities ofthe 2p models.
  • 356. Bayesian StatisticsTests and model choiceVariable selectionGibbs approximationWhen p large, impossible to compute the posterior probabilities ofthe 2p models.Use of a Monte Carlo approximation of π(γ|y)
  • 357. Bayesian StatisticsTests and model choiceVariable selectionGibbs approximationWhen p large, impossible to compute the posterior probabilities ofthe 2p models.Use of a Monte Carlo approximation of π(γ|y)Gibbs sampling• At t = 0, draw γ0 from the uniform distribution on Γ• At t, for i = 1, . . . , p, drawγti ∼ π(γi|y, γt1, . . . , γti−1, . . . , γt−1i+1 , . . . , γt−1p )
  • 358. Bayesian StatisticsTests and model choiceVariable selectionGibbs approximation (cont’d)Example (Simulated data)Severe multicolinearities among predictors for a 20-predictor fullmodely|β, σ2∼ N β0 +20i=1βixi, σ2Inwhere xi = zi + 3z, the zi’s and z are iid Nn(0n, In).
  • 359. Bayesian StatisticsTests and model choiceVariable selectionGibbs approximation (cont’d)Example (Simulated data)Severe multicolinearities among predictors for a 20-predictor fullmodely|β, σ2∼ N β0 +20i=1βixi, σ2Inwhere xi = zi + 3z, the zi’s and z are iid Nn(0n, In).True model with n = 180, σ2 = 4 and seven predictor variablesx1, x3, x5, x6, x12, x18, x20,(β0, β1, β3, β5, β6, β12, β18, β20) = (3, 4, 1, −3, 12, −1, 5, −6)
  • 360. Bayesian StatisticsTests and model choiceVariable selectionGibbs approximation (cont’d)Example (Simulated data (2))γ π(γ|y) π(γ|y)GIBBS0,1,3,5,6,12,18,20 0.1893 0.18220,1,3,5,6,18,20 0.0588 0.05980,1,3,5,6,9,12,18,20 0.0223 0.02360,1,3,5,6,12,14,18,20 0.0220 0.01930,1,2,3,5,6,12,18,20 0.0216 0.02220,1,3,5,6,7,12,18,20 0.0212 0.02330,1,3,5,6,10,12,18,20 0.0199 0.02220,1,3,4,5,6,12,18,20 0.0197 0.01820,1,3,5,6,12,15,18,20 0.0196 0.0196Gibbs (T = 100, 000) results for ˜β = 021 and c = 100
  • 361. Bayesian StatisticsTests and model choiceVariable selectionProcessionary caterpillarInfluence of some forest settlement characteristics on thedevelopment of caterpillar colonies
  • 362. Bayesian StatisticsTests and model choiceVariable selectionProcessionary caterpillarInfluence of some forest settlement characteristics on thedevelopment of caterpillar colonies
  • 363. Bayesian StatisticsTests and model choiceVariable selectionProcessionary caterpillarInfluence of some forest settlement characteristics on thedevelopment of caterpillar coloniesResponse y log-transform of the average number of nests ofcaterpillars per tree on an area of 500 square meters (n = 33 areas)
  • 364. Bayesian StatisticsTests and model choiceVariable selectionProcessionary caterpillar (cont’d)Potential explanatory variablesx1 altitude (in meters), x2 slope (in degrees),x3 number of pines in the square,x4 height (in meters) of the tree at the center of the square,x5 diameter of the tree at the center of the square,x6 index of the settlement density,x7 orientation of the square (from 1 if southb’d to 2 ow),x8 height (in meters) of the dominant tree,x9 number of vegetation strata,x10 mix settlement index (from 1 if not mixed to 2 if mixed).
  • 365. Bayesian StatisticsTests and model choiceVariable selectionx1 x2 x3x4 x5 x6x7 x8 x9
  • 366. Bayesian StatisticsTests and model choiceVariable selectionBayesian regression outputEstimate BF log10(BF)(Intercept) 9.2714 26.334 1.4205 (***)X1 -0.0037 7.0839 0.8502 (**)X2 -0.0454 3.6850 0.5664 (**)X3 0.0573 0.4356 -0.3609X4 -1.0905 2.8314 0.4520 (*)X5 0.1953 2.5157 0.4007 (*)X6 -0.3008 0.3621 -0.4412X7 -0.2002 0.3627 -0.4404X8 0.1526 0.4589 -0.3383X9 -1.0835 0.9069 -0.0424X10 -0.3651 0.4132 -0.3838evidence against H0: (****) decisive, (***) strong, (**)subtantial, (*) poor
  • 367. Bayesian StatisticsTests and model choiceVariable selectionBayesian variable selectiont1(γ) π(γ|y, X) π(γ|y, X)0,1,2,4,5 0.0929 0.09290,1,2,4,5,9 0.0325 0.03260,1,2,4,5,10 0.0295 0.02720,1,2,4,5,7 0.0231 0.02310,1,2,4,5,8 0.0228 0.02290,1,2,4,5,6 0.0228 0.02260,1,2,3,4,5 0.0224 0.02200,1,2,3,4,5,9 0.0167 0.01820,1,2,4,5,6,9 0.0167 0.01710,1,2,4,5,8,9 0.0137 0.0130Noninformative G-prior model choice and Gibbs estimations
  • 368. Bayesian StatisticsTests and model choiceSymmetrised compatible priorsPostulatePrevious principle requires embedded models (or an encompassingmodel) and proper priors, while being hard to implement outsideexponential families
  • 369. Bayesian StatisticsTests and model choiceSymmetrised compatible priorsPostulatePrevious principle requires embedded models (or an encompassingmodel) and proper priors, while being hard to implement outsideexponential familiesNow we determine prior measures on two models M1 and M2, π1and π2, directly by a compatibility principle.
  • 370. Bayesian StatisticsTests and model choiceSymmetrised compatible priorsGeneralised expected posterior priors[Perez & Berger, 2000]EPP PrincipleStarting from reference priors πN1 and πN2 , substitute by priordistributions π1 and π2 that solve the system of integral equationsπ1(θ1) =XπN1 (θ1 | x)m2(x)dxandπ2(θ2) =XπN2 (θ2 | x)m1(x)dx,where x is an imaginary minimal training sample and m1, m2 arethe marginals associated with π1 and π2 respectively.
  • 371. Bayesian StatisticsTests and model choiceSymmetrised compatible priorsMotivations◮ Eliminates the “imaginary observation” device andproper-isation through part of the data by integration underthe “truth”
  • 372. Bayesian StatisticsTests and model choiceSymmetrised compatible priorsMotivations◮ Eliminates the “imaginary observation” device andproper-isation through part of the data by integration underthe “truth”◮ Assumes that both models are equally valid and equippedwith ideal unknown priorsπi, i = 1, 2,that yield “true” marginals balancing each model wrt theother
  • 373. Bayesian StatisticsTests and model choiceSymmetrised compatible priorsMotivations◮ Eliminates the “imaginary observation” device andproper-isation through part of the data by integration underthe “truth”◮ Assumes that both models are equally valid and equippedwith ideal unknown priorsπi, i = 1, 2,that yield “true” marginals balancing each model wrt theother◮ For a given π1, π2 is an expected posterior priorUsing both equations introduces symmetry into the game
  • 374. Bayesian StatisticsTests and model choiceSymmetrised compatible priorsDual propernessTheorem (Proper distributions)If π1 is a probability density then π2 solution toπ2(θ2) =XπN2 (θ2 | x)m1(x)dxis a probability densityc Both EPPs are either proper or improper
  • 375. Bayesian StatisticsTests and model choiceSymmetrised compatible priorsBayesian coherenceTrue Bayes factor:If π1 and π2 are the EPPs and if their marginals are finite, then thecorresponding Bayes factorB1,2(x)is either a (true) Bayes factor or a limit of (true) Bayes factors.
  • 376. Bayesian StatisticsTests and model choiceSymmetrised compatible priorsBayesian coherenceTrue Bayes factor:If π1 and π2 are the EPPs and if their marginals are finite, then thecorresponding Bayes factorB1,2(x)is either a (true) Bayes factor or a limit of (true) Bayes factors.Obviously only interesting when both π1 and π2 are improper.
  • 377. Bayesian StatisticsTests and model choiceSymmetrised compatible priorsExistence/UnicityRecurrence conditionWhen both the observations and the parameters in both modelsare continuous, if the Markov chain with transitionQ θ′1 | θ1 = g θ1, θ′1, θ2, x, x′dxdx′dθ2whereg θ1, θ′1, θ2, x, x′= πN1 θ′1 | x f2 (x | θ2) πN2 θ2 | x′f1 x′| θ1 ,is recurrent, then there exists a solution to the integral equations,unique up to a multiplicative constant.
  • 378. Bayesian StatisticsTests and model choiceSymmetrised compatible priorsConsequences◮ If the M chain is positive recurrent, there exists a unique pairof proper EPPS.
  • 379. Bayesian StatisticsTests and model choiceSymmetrised compatible priorsConsequences◮ If the M chain is positive recurrent, there exists a unique pairof proper EPPS.◮ The transition density Q (θ′1 | θ1) has a dual transition densityon Θ2.
  • 380. Bayesian StatisticsTests and model choiceSymmetrised compatible priorsConsequences◮ If the M chain is positive recurrent, there exists a unique pairof proper EPPS.◮ The transition density Q (θ′1 | θ1) has a dual transition densityon Θ2.◮ There exists a parallel M chain on Θ2 with identicalproperties; if one is (Harris) recurrent, so is the other.
  • 381. Bayesian StatisticsTests and model choiceSymmetrised compatible priorsConsequences◮ If the M chain is positive recurrent, there exists a unique pairof proper EPPS.◮ The transition density Q (θ′1 | θ1) has a dual transition densityon Θ2.◮ There exists a parallel M chain on Θ2 with identicalproperties; if one is (Harris) recurrent, so is the other.◮ Duality property found both in the MCMC literature and indecision theory[Diebolt & Robert, 1992; Eaton, 1992]
  • 382. Bayesian StatisticsTests and model choiceSymmetrised compatible priorsConsequences◮ If the M chain is positive recurrent, there exists a unique pairof proper EPPS.◮ The transition density Q (θ′1 | θ1) has a dual transition densityon Θ2.◮ There exists a parallel M chain on Θ2 with identicalproperties; if one is (Harris) recurrent, so is the other.◮ Duality property found both in the MCMC literature and indecision theory[Diebolt & Robert, 1992; Eaton, 1992]◮ When Harris recurrence holds but the EPPs cannot be found,the Bayes factor can be approximated by MCMC simulation
  • 383. Bayesian StatisticsTests and model choiceExamplesPoint null hypothesis testingTesting H0 : θ = θ∗ versus H1 : θ = θ∗, i.e.M1 : f (x | θ∗) ,M2 : f (x | θ) , θ ∈ Θ.
  • 384. Bayesian StatisticsTests and model choiceExamplesPoint null hypothesis testingTesting H0 : θ = θ∗ versus H1 : θ = θ∗, i.e.M1 : f (x | θ∗) ,M2 : f (x | θ) , θ ∈ Θ.Default priorsπN1 (θ) = δθ∗ (θ) and πN2 (θ) = πN(θ)
  • 385. Bayesian StatisticsTests and model choiceExamplesPoint null hypothesis testingTesting H0 : θ = θ∗ versus H1 : θ = θ∗, i.e.M1 : f (x | θ∗) ,M2 : f (x | θ) , θ ∈ Θ.Default priorsπN1 (θ) = δθ∗ (θ) and πN2 (θ) = πN(θ)For x minimal training sample, consider the proper priorsπ1 (θ) = δθ∗ (θ) and π2 (θ) = πN(θ | x) f (x | θ∗) dx
  • 386. Bayesian StatisticsTests and model choiceExamplesPoint null hypothesis testing (cont’d)ThenπN1 (θ | x) m2 (x) dx = δθ∗ (θ) m2 (x) dx = δθ∗ (θ) = π1 (θ)andπN2 (θ | x) m1 (x) dx = πN(θ | x) f (x | θ∗) dx = π2 (θ)
  • 387. Bayesian StatisticsTests and model choiceExamplesPoint null hypothesis testing (cont’d)ThenπN1 (θ | x) m2 (x) dx = δθ∗ (θ) m2 (x) dx = δθ∗ (θ) = π1 (θ)andπN2 (θ | x) m1 (x) dx = πN(θ | x) f (x | θ∗) dx = π2 (θ)c π1 (θ) and π2 (θ) are integral priors
  • 388. Bayesian StatisticsTests and model choiceExamplesPoint null hypothesis testing (cont’d)ThenπN1 (θ | x) m2 (x) dx = δθ∗ (θ) m2 (x) dx = δθ∗ (θ) = π1 (θ)andπN2 (θ | x) m1 (x) dx = πN(θ | x) f (x | θ∗) dx = π2 (θ)c π1 (θ) and π2 (θ) are integral priorsNoteUniqueness of the Bayes factorIntegral priors and intrinsic priors coincide[Moreno, Bertolino and Racugno, 1998]
  • 389. Bayesian StatisticsTests and model choiceExamplesLocation modelsTwo location modelsM1 : f1 (x | θ1) = f1 (x − θ1)M2 : f2 (x | θ2) = f2 (x − θ2)
  • 390. Bayesian StatisticsTests and model choiceExamplesLocation modelsTwo location modelsM1 : f1 (x | θ1) = f1 (x − θ1)M2 : f2 (x | θ2) = f2 (x − θ2)Default priorsπNi (θi) = ci, i = 1, 2with minimal training sample size one
  • 391. Bayesian StatisticsTests and model choiceExamplesLocation modelsTwo location modelsM1 : f1 (x | θ1) = f1 (x − θ1)M2 : f2 (x | θ2) = f2 (x − θ2)Default priorsπNi (θi) = ci, i = 1, 2with minimal training sample size oneMarginal densitiesmNi (x) = ci, i = 1, 2
  • 392. Bayesian StatisticsTests and model choiceExamplesLocation models (cont’d)In that case, πN1 (θ1) and πN2 (θ2) are integral priors when c1 = c2:πN1 (θ1 | x) mN2 (x) dx = c2f1 (x − θ1) dx = c2πN2 (θ2 | x) mN1 (x) dx = c1f2 (x − θ2) dx = c1.
  • 393. Bayesian StatisticsTests and model choiceExamplesLocation models (cont’d)In that case, πN1 (θ1) and πN2 (θ2) are integral priors when c1 = c2:πN1 (θ1 | x) mN2 (x) dx = c2f1 (x − θ1) dx = c2πN2 (θ2 | x) mN1 (x) dx = c1f2 (x − θ2) dx = c1.c If the associated Markov chain is recurrent,πN1 (θ1) = πN2 (θ2) = care the unique integral priors and they are intrinsic priors[Cano, Kessler & Moreno, 2004]
  • 394. Bayesian StatisticsTests and model choiceExamplesLocation models (cont’d)Example (Normal versus double exponential)M1 : N(θ, 1), πN1 (θ) = c1,M2 : DE(λ, 1), πN2 (λ) = c2.Minimal training sample size one and posterior densitiesπN1 (θ | x) = N(x, 1) and πN2 (λ | x) = DE (x, 1)
  • 395. Bayesian StatisticsTests and model choiceExamplesLocation models (cont’d)Example (Normal versus double exponential (2))Transition θ → θ′ of the Markov chain made of steps :1. x′ = θ + ε1, ε1 ∼ N(0, 1)2. λ = x′ + ε2, ε2 ∼ DE(0, 1)3. x = λ+ ε3, ε3 ∼ DE(0, 1)4. θ′ = x + ε4, ε4 ∼ N(0, 1)i.e. θ′= θ + ε1 + ε2 + ε3 + ε4
  • 396. Bayesian StatisticsTests and model choiceExamplesLocation models (cont’d)Example (Normal versus double exponential (2))Transition θ → θ′ of the Markov chain made of steps :1. x′ = θ + ε1, ε1 ∼ N(0, 1)2. λ = x′ + ε2, ε2 ∼ DE(0, 1)3. x = λ+ ε3, ε3 ∼ DE(0, 1)4. θ′ = x + ε4, ε4 ∼ N(0, 1)i.e. θ′= θ + ε1 + ε2 + ε3 + ε4random walk in θ with finite second moment, null recurrentc Resulting Lebesgue measures π1 (θ) = 1 = π2 (λ) invariantand unique solutions to integral equations
  • 397. Bayesian StatisticsAdmissibility and Complete ClassesAdmissibility and Complete ClassesIntroductionDecision-Theoretic Foundations of Statistical InferenceFrom Prior Information to Prior DistributionsBayesian Point EstimationTests and model choiceAdmissibility and Complete ClassesAdmissibility of Bayes estimatorsNecessary and sufficient admissibility conditions
  • 398. Bayesian StatisticsAdmissibility and Complete ClassesAdmissibility of Bayes estimatorsAdmissibility of Bayes estimatorsWarningBayes estimators may be inadmissible when the Bayes risk is infinite
  • 399. Bayesian StatisticsAdmissibility and Complete ClassesAdmissibility of Bayes estimatorsExample (Normal mean)Consider x ∼ N (θ, 1) with a conjugate prior θ ∼ N (0, σ2) andlossLα(θ, δ) = eθ2/2α(θ − δ)2
  • 400. Bayesian StatisticsAdmissibility and Complete ClassesAdmissibility of Bayes estimatorsExample (Normal mean)Consider x ∼ N (θ, 1) with a conjugate prior θ ∼ N (0, σ2) andlossLα(θ, δ) = eθ2/2α(θ − δ)2The associated generalized Bayes estimator is defined forα > σ2 σ2 + 1 andδπα(x) =σ2 + 1σ2σ2 + 1σ2− α−1−1δπ(x)=αα − σ2σ2+1δπ(x).
  • 401. Bayesian StatisticsAdmissibility and Complete ClassesAdmissibility of Bayes estimatorsExample (Normal mean (2))The corresponding Bayes risk isr(π) =+∞−∞eθ2/2αe−θ2/2σ2dθ
  • 402. Bayesian StatisticsAdmissibility and Complete ClassesAdmissibility of Bayes estimatorsExample (Normal mean (2))The corresponding Bayes risk isr(π) =+∞−∞eθ2/2αe−θ2/2σ2dθwhich is infinite for α ≤ σ2.
  • 403. Bayesian StatisticsAdmissibility and Complete ClassesAdmissibility of Bayes estimatorsExample (Normal mean (2))The corresponding Bayes risk isr(π) =+∞−∞eθ2/2αe−θ2/2σ2dθwhich is infinite for α ≤ σ2. Since δπα(x) = cx with c > 1 whenα > ασ2 + 1σ2− 1,δπα is inadmissible
  • 404. Bayesian StatisticsAdmissibility and Complete ClassesAdmissibility of Bayes estimatorsFormal admissibility resultTheorem (Existence of an admissible Bayes estimator)If Θ is a discrete set and π(θ) > 0 for every θ ∈ Θ, then thereexists an admissible Bayes estimator associated with π
  • 405. Bayesian StatisticsAdmissibility and Complete ClassesAdmissibility of Bayes estimatorsBoundary conditionsIff(x|θ) = h(x)eθ.T(x)−ψ(θ), θ ∈ [θ, ¯θ]and π is a conjugate prior,π(θ|t0, λ) = eθ.t0−λψ(θ)
  • 406. Bayesian StatisticsAdmissibility and Complete ClassesAdmissibility of Bayes estimatorsBoundary conditionsIff(x|θ) = h(x)eθ.T(x)−ψ(θ), θ ∈ [θ, ¯θ]and π is a conjugate prior,π(θ|t0, λ) = eθ.t0−λψ(θ)Theorem (Conjugate admissibility)A sufficient condition for Eπ[∇ψ(θ)|x] to be admissible is that, forevery θ < θ0 < ¯θ,¯θθ0e−γ0λθ+λψ(θ)dθ =θ0θe−γ0λθ+λψ(θ)dθ = +∞.
  • 407. Bayesian StatisticsAdmissibility and Complete ClassesAdmissibility of Bayes estimatorsExample (Binomial probability)Consider x ∼ B(n, p).f(x|θ) =nxe(x/n)θ1 + eθ/n−nθ = n log(p/1 − p)
  • 408. Bayesian StatisticsAdmissibility and Complete ClassesAdmissibility of Bayes estimatorsExample (Binomial probability)Consider x ∼ B(n, p).f(x|θ) =nxe(x/n)θ1 + eθ/n−nθ = n log(p/1 − p)Then the two integralsθ0−∞e−γ0λθ1 + eθ/nλndθ and+∞θ0e−γ0λθ1 + eθ/nλndθcannot diverge simultaneously if λ < 0.
  • 409. Bayesian StatisticsAdmissibility and Complete ClassesAdmissibility of Bayes estimatorsExample (Binomial probability (2))For λ > 0, the second integral is divergent if λ(1 − γ0) > 0 andthe first integral is divergent if γ0λ ≥ 0.
  • 410. Bayesian StatisticsAdmissibility and Complete ClassesAdmissibility of Bayes estimatorsExample (Binomial probability (2))For λ > 0, the second integral is divergent if λ(1 − γ0) > 0 andthe first integral is divergent if γ0λ ≥ 0.Admissible Bayes estimators of pδπ(x) = axn+ b, 0 ≤ a ≤ 1, b ≥ 0, a + b ≤ 1.
  • 411. Bayesian StatisticsAdmissibility and Complete ClassesAdmissibility of Bayes estimatorsDifferential representationsSetting of multidimensional exponential familiesf(x|θ) = h(x)eθ.x−ψ(θ), θ ∈ Rp
  • 412. Bayesian StatisticsAdmissibility and Complete ClassesAdmissibility of Bayes estimatorsDifferential representationsSetting of multidimensional exponential familiesf(x|θ) = h(x)eθ.x−ψ(θ), θ ∈ RpMeasure g such thatIx(∇g) = ||∇g(θ)||eθ.x−ψ(θ)dθ < +∞
  • 413. Bayesian StatisticsAdmissibility and Complete ClassesAdmissibility of Bayes estimatorsDifferential representationsSetting of multidimensional exponential familiesf(x|θ) = h(x)eθ.x−ψ(θ), θ ∈ RpMeasure g such thatIx(∇g) = ||∇g(θ)||eθ.x−ψ(θ)dθ < +∞Representation of the posterior mean of ∇ψ(θ)δg(x) = x +Ix(∇g)Ix(g).
  • 414. Bayesian StatisticsAdmissibility and Complete ClassesAdmissibility of Bayes estimatorsSufficient admissibility conditions{||θ||>1}g(θ)||θ||2 log2(||θ|| ∨ 2)dθ < ∞,||∇g(θ)||2g(θ)dθ < ∞,and∀θ ∈ Θ, R(θ, δg) < ∞,
  • 415. Bayesian StatisticsAdmissibility and Complete ClassesAdmissibility of Bayes estimatorsConsequenceTheoremIfΘ = Rpp ≤ 2the estimatorδ0(x) = xis admissible.
  • 416. Bayesian StatisticsAdmissibility and Complete ClassesAdmissibility of Bayes estimatorsConsequenceTheoremIfΘ = Rpp ≤ 2the estimatorδ0(x) = xis admissible.Example (Normal mean (3))If x ∼ Np(θ, Ip), p ≤ 2, δ0(x) = x is admissible.
  • 417. Bayesian StatisticsAdmissibility and Complete ClassesAdmissibility of Bayes estimatorsSpecial case of Np(θ, Σ)A generalised Bayes estimator of the formδ(x) = (1 − h(||x||))x1. is inadmissible if there exist ǫ > 0 and K < +∞ such that||x||2h(||x||) < p − 2 − ǫ for ||x|| > K2. is admissible if there exist K1 and K2 such thath(||x||)||x|| ≤ K1 for every x and||x||2h(||x||) ≥ p − 2 for ||x|| > K2[Brown, 1971]
  • 418. Bayesian StatisticsAdmissibility and Complete ClassesAdmissibility of Bayes estimatorsRecurrence conditionsGeneral caseEstimation of a bounded function g(θ)For a given prior π, Markovian transition kernelK(θ|η) =Xπ(θ|x)f(x|η) dx,Theorem (Recurrence)The generalised Bayes estimator of g(θ) is admissible if theassociated Markov chain (θ(n)) is π-recurrent.[Eaton, 1994]
  • 419. Bayesian StatisticsAdmissibility and Complete ClassesAdmissibility of Bayes estimatorsRecurrence conditions (cont.)Extension to the unbounded case, based on the (case dependent)transition kernelT(θ|η) = Ψ(η)−1(ϕ(θ) − ϕ(η))2K(θ|η) ,where Ψ(θ) normalizing factor
  • 420. Bayesian StatisticsAdmissibility and Complete ClassesAdmissibility of Bayes estimatorsRecurrence conditions (cont.)Extension to the unbounded case, based on the (case dependent)transition kernelT(θ|η) = Ψ(η)−1(ϕ(θ) − ϕ(η))2K(θ|η) ,where Ψ(θ) normalizing factorTheorem (Recurrence(2))The generalised Bayes estimator of ϕ(θ) is admissible if theassociated Markov chain (θ(n)) is π-recurrent.[Eaton, 1999]
  • 421. Bayesian StatisticsAdmissibility and Complete ClassesNecessary and sufficient admissibility conditionsNecessary and sufficient admissibility conditionsFormalisation of the statement that...
  • 422. Bayesian StatisticsAdmissibility and Complete ClassesNecessary and sufficient admissibility conditionsNecessary and sufficient admissibility conditionsFormalisation of the statement that......all admissible estimators are limits of Bayes estimators...
  • 423. Bayesian StatisticsAdmissibility and Complete ClassesNecessary and sufficient admissibility conditionsBlyth’s sufficient conditionTheorem (Blyth condition)If, for an estimator δ0, there exists a sequence (πn) of generalisedprior distributions such that(i) r(πn, δ0) is finite for every n;(ii) for every nonempty open set C ⊂ Θ, there exist K > 0 and Nsuch that, for every n ≥ N, πn(C) ≥ K; and(iii) limn→+∞r(πn, δ0) − r(πn) = 0;then δ0 is admissible.
  • 424. Bayesian StatisticsAdmissibility and Complete ClassesNecessary and sufficient admissibility conditionsExample (Normal mean (4))Consider x ∼ N (θ, 1) and δ0(x) = xChoose πn as the measure with densitygn(x) = e−θ2/2n[condition (ii) is satisfied]The Bayes estimator for πn isδn(x) =nxn + 1,andr(πn) =Rθ2(n + 1)2+n2(n + 1)2gn(θ) dθ =√2πnnn + 1[condition (i) is satisfied]
  • 425. Bayesian StatisticsAdmissibility and Complete ClassesNecessary and sufficient admissibility conditionsExample (Normal mean (5))whiler(πn, δ0) =R1 gn(θ) dθ =√2πn.Moreover,r(πn, δ0) − r(πn) =√2πn/(n + 1)converges to 0.[condition (iii) is satisfied]
  • 426. Bayesian StatisticsAdmissibility and Complete ClassesNecessary and sufficient admissibility conditionsStein’s necessary and sufficient conditionAssumptions(i) f(x|θ) is continuous in θ and strictly positive on Θ; and(ii) the loss L is strictly convex, continuous and, if E ⊂ Θ iscompact,limδ →+∞infθ∈EL(θ, δ) = +∞.
  • 427. Bayesian StatisticsAdmissibility and Complete ClassesNecessary and sufficient admissibility conditionsStein’s necessary and sufficient condition (cont.)Theorem (Stein’s n&s condition)δ is admissible iff there exist1. a sequence (Fn) of increasing compact sets such thatΘ =nFn,2. a sequence (πn) of finite measures with support Fn, and3. a sequence (δn) of Bayes estimators associated with πnsuch that
  • 428. Bayesian StatisticsAdmissibility and Complete ClassesNecessary and sufficient admissibility conditionsStein’s necessary and sufficient condition (cont.)Theorem (Stein’s n&s condition (cont.))(i) there exists a compact set E0 ⊂ Θ such that infn πn(E0) ≥ 1;(ii) if E ⊂ Θ is compact, supnπn(E) < +∞;(iii) limnr(πn, δ) − r(πn) = 0; and(iv) limnR(θ, δn) = R(θ, δ).
  • 429. Bayesian StatisticsAdmissibility and Complete ClassesComplete classesComplete classesDefinition (Complete class)A class C of estimators is complete if, for every δ′ ∈ C , thereexists δ ∈ C that dominates δ′. The class is essentially complete if,for every δ′ ∈ C , there exists δ ∈ C that is at least as good as δ′.
  • 430. Bayesian StatisticsAdmissibility and Complete ClassesComplete classesA special caseΘ = {θ1, θ2}, with risk setR = {r = (R(θ1, δ), R(θ2, δ)), δ ∈ D∗},bounded and closed from below
  • 431. Bayesian StatisticsAdmissibility and Complete ClassesComplete classesA special caseΘ = {θ1, θ2}, with risk setR = {r = (R(θ1, δ), R(θ2, δ)), δ ∈ D∗},bounded and closed from belowThen, the lower boundary, Γ(R), provides the admissible points ofR.
  • 432. Bayesian StatisticsAdmissibility and Complete ClassesComplete classes0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.450.20.30.40.50.60.7δ1δ2δ3δ4δ5δ6δ7δ8δ9Γ
  • 433. Bayesian StatisticsAdmissibility and Complete ClassesComplete classesA special case (cont.)ReasonFor every r ∈ Γ(R), there exists a tangent line to R going throughr, with positive slope and equationp1r1 + p2r2 = k
  • 434. Bayesian StatisticsAdmissibility and Complete ClassesComplete classesA special case (cont.)ReasonFor every r ∈ Γ(R), there exists a tangent line to R going throughr, with positive slope and equationp1r1 + p2r2 = kTherefore r is a Bayes estimator for π(θi) = pi (i = 1, 2)
  • 435. Bayesian StatisticsAdmissibility and Complete ClassesComplete classesWald’s theoremsTheoremIf Θ is finite and if R is bounded and closed from below, then theset of Bayes estimators constitutes a complete class
  • 436. Bayesian StatisticsAdmissibility and Complete ClassesComplete classesWald’s theoremsTheoremIf Θ is finite and if R is bounded and closed from below, then theset of Bayes estimators constitutes a complete classTheoremIf Θ is compact and the risk set R is convex, if all estimators havea continuous risk function, the Bayes estimators constitute acomplete class.
  • 437. Bayesian StatisticsAdmissibility and Complete ClassesComplete classesExtensionsIf Θ not compact, in many cases, complete classes are made ofgeneralised Bayes estimators
  • 438. Bayesian StatisticsAdmissibility and Complete ClassesComplete classesExtensionsIf Θ not compact, in many cases, complete classes are made ofgeneralised Bayes estimatorsExampleWhen estimating the natural parameter θ of an exponential familyx ∼ f(x|θ) = eθ·x−ψ(θ)h(x), x, θ ∈ Rk,under quadratic loss, every admissible estimator is a generalisedBayes estimator.
  • 439. Bayesian StatisticsHierarchical and Empirical Bayes Extensions, and the Stein EffectHierarchical and Empirical Bayes ExtensionsIntroductionDecision-Theoretic Foundations of Statistical InferenceFrom Prior Information to Prior DistributionsBayesian Point EstimationTests and model choiceAdmissibility and Complete ClassesHierarchical and Empirical Bayes Extensions, and the Stein Effect
  • 440. Bayesian StatisticsHierarchical and Empirical Bayes Extensions, and the Stein EffectThe Bayesian analysis is sufficiently reductive to produce effectivedecisions, but this efficiency can also be misused.
  • 441. Bayesian StatisticsHierarchical and Empirical Bayes Extensions, and the Stein EffectThe Bayesian analysis is sufficiently reductive to produce effectivedecisions, but this efficiency can also be misused.The prior information is rarely rich enough to define a priordistribution exactly.
  • 442. Bayesian StatisticsHierarchical and Empirical Bayes Extensions, and the Stein EffectThe Bayesian analysis is sufficiently reductive to produce effectivedecisions, but this efficiency can also be misused.The prior information is rarely rich enough to define a priordistribution exactly.Uncertainty must be included within the Bayesian model:◮ Further prior modelling◮ Upper and lower probabilities [Dempster-Shafer]◮ Imprecise probabilities [Walley]
  • 443. Bayesian StatisticsHierarchical and Empirical Bayes Extensions, and the Stein EffectHierarchical Bayes analysisHierarchical Bayes analysisDecomposition of the prior distribution into several conditionallevels of distributions
  • 444. Bayesian StatisticsHierarchical and Empirical Bayes Extensions, and the Stein EffectHierarchical Bayes analysisHierarchical Bayes analysisDecomposition of the prior distribution into several conditionallevels of distributionsOften two levels: the first-level distribution is generally a conjugateprior, with parameters distributed from the second-level distribution
  • 445. Bayesian StatisticsHierarchical and Empirical Bayes Extensions, and the Stein EffectHierarchical Bayes analysisHierarchical Bayes analysisDecomposition of the prior distribution into several conditionallevels of distributionsOften two levels: the first-level distribution is generally a conjugateprior, with parameters distributed from the second-level distributionReal life motivations (multiple experiments, meta-analysis, ...)
  • 446. Bayesian StatisticsHierarchical and Empirical Bayes Extensions, and the Stein EffectHierarchical Bayes analysisHierarchical modelsDefinition (Hierarchical model)A hierarchical Bayes model is a Bayesian statistic model,(f(x|θ), π(θ)), whereπ(θ) =Θ1×...×Θnπ1(θ|θ1)π2(θ1|θ2) · · · πn+1(θn) dθ1 · · · dθn+1.
  • 447. Bayesian StatisticsHierarchical and Empirical Bayes Extensions, and the Stein EffectHierarchical Bayes analysisHierarchical modelsDefinition (Hierarchical model)A hierarchical Bayes model is a Bayesian statistic model,(f(x|θ), π(θ)), whereπ(θ) =Θ1×...×Θnπ1(θ|θ1)π2(θ1|θ2) · · · πn+1(θn) dθ1 · · · dθn+1.The parameters θi are called hyperparameters of level i(1 ≤ i ≤ n).
  • 448. Bayesian StatisticsHierarchical and Empirical Bayes Extensions, and the Stein EffectHierarchical Bayes analysisExample (Rats (1))Experiment where rats are intoxicated by a substance, then treatedby either a placebo or a drug:xij ∼ N(θi, σ2c ), 1 ≤ j ≤ Jci , controlyij ∼ N(θi + δi, σ2a), 1 ≤ j ≤ Jai , intoxicationzij ∼ N(θi + δi + ξi, σ2t ), 1 ≤ j ≤ Jti , treatment
  • 449. Bayesian StatisticsHierarchical and Empirical Bayes Extensions, and the Stein EffectHierarchical Bayes analysisExample (Rats (1))Experiment where rats are intoxicated by a substance, then treatedby either a placebo or a drug:xij ∼ N(θi, σ2c ), 1 ≤ j ≤ Jci , controlyij ∼ N(θi + δi, σ2a), 1 ≤ j ≤ Jai , intoxicationzij ∼ N(θi + δi + ξi, σ2t ), 1 ≤ j ≤ Jti , treatmentAdditional variable wi, equal to 1 if the rat is treated with thedrug, and 0 otherwise.
  • 450. Bayesian StatisticsHierarchical and Empirical Bayes Extensions, and the Stein EffectHierarchical Bayes analysisExample (Rats (2))Prior distributions (1 ≤ i ≤ I),θi ∼ N(µθ, σ2θ), δi ∼ N(µδ, σ2δ ),andξi ∼ N(µP , σ2P ) or ξi ∼ N(µD, σ2D),depending on whether the ith rat is treated with a placebo or adrug.
  • 451. Bayesian StatisticsHierarchical and Empirical Bayes Extensions, and the Stein EffectHierarchical Bayes analysisExample (Rats (2))Prior distributions (1 ≤ i ≤ I),θi ∼ N(µθ, σ2θ), δi ∼ N(µδ, σ2δ ),andξi ∼ N(µP , σ2P ) or ξi ∼ N(µD, σ2D),depending on whether the ith rat is treated with a placebo or adrug.Hyperparameters of the model,µθ, µδ, µP , µD, σc, σa, σt, σθ, σδ, σP , σD ,associated with Jeffreys’ noninformative priors.
  • 452. Bayesian StatisticsHierarchical and Empirical Bayes Extensions, and the Stein EffectHierarchical Bayes analysisJustifications1. Objective reasons based on prior information
  • 453. Bayesian StatisticsHierarchical and Empirical Bayes Extensions, and the Stein EffectHierarchical Bayes analysisJustifications1. Objective reasons based on prior informationExample (Rats (3))Alternative priorδi ∼ pN(µδ1, σ2δ1) + (1 − p)N(µδ2, σ2δ2),allows for two possible levels of intoxication.
  • 454. Bayesian StatisticsHierarchical and Empirical Bayes Extensions, and the Stein EffectHierarchical Bayes analysis2. Separation of structural information from subjectiveinformation
  • 455. Bayesian StatisticsHierarchical and Empirical Bayes Extensions, and the Stein EffectHierarchical Bayes analysis2. Separation of structural information from subjectiveinformationExample (Uncertainties about generalized linear models)yi|xi ∼ exp{θi · yi − ψ(θi)} , ∇ψ(θi) = E[yi|xi] = h(xtiβ) ,where h is the link function
  • 456. Bayesian StatisticsHierarchical and Empirical Bayes Extensions, and the Stein EffectHierarchical Bayes analysis2. Separation of structural information from subjectiveinformationExample (Uncertainties about generalized linear models)yi|xi ∼ exp{θi · yi − ψ(θi)} , ∇ψ(θi) = E[yi|xi] = h(xtiβ) ,where h is the link functionThe linear constraint ∇ψ(θi) = h(xtiβ) can move to an higher levelof the hierarchy,θi ∼ exp {λ [θi · ξi − ψ(θi)]}with E[∇ψ(θi)] = h(xtiβ) andβ ∼ Nq(0, τ2Iq)
  • 457. Bayesian StatisticsHierarchical and Empirical Bayes Extensions, and the Stein EffectHierarchical Bayes analysis3. In noninformative settings, compromise between theJeffreys noninformative distributions, and the conjugatedistributions.
  • 458. Bayesian StatisticsHierarchical and Empirical Bayes Extensions, and the Stein EffectHierarchical Bayes analysis3. In noninformative settings, compromise between theJeffreys noninformative distributions, and the conjugatedistributions.4. Robustification of the usual Bayesian analysis from afrequentist point of view
  • 459. Bayesian StatisticsHierarchical and Empirical Bayes Extensions, and the Stein EffectHierarchical Bayes analysis3. In noninformative settings, compromise between theJeffreys noninformative distributions, and the conjugatedistributions.4. Robustification of the usual Bayesian analysis from afrequentist point of view5. Often simplifies Bayesian calculations
  • 460. Bayesian StatisticsHierarchical and Empirical Bayes Extensions, and the Stein EffectHierarchical Bayes analysisConditional decompositionsEasy decomposition of the posterior distribution
  • 461. Bayesian StatisticsHierarchical and Empirical Bayes Extensions, and the Stein EffectHierarchical Bayes analysisConditional decompositionsEasy decomposition of the posterior distributionFor instance, ifθ|θ1 ∼ π1(θ|θ1), θ1 ∼ π2(θ1),
  • 462. Bayesian StatisticsHierarchical and Empirical Bayes Extensions, and the Stein EffectHierarchical Bayes analysisConditional decompositionsEasy decomposition of the posterior distributionFor instance, ifθ|θ1 ∼ π1(θ|θ1), θ1 ∼ π2(θ1),thenπ(θ|x) =Θ1π(θ|θ1, x)π(θ1|x) dθ1,
  • 463. Bayesian StatisticsHierarchical and Empirical Bayes Extensions, and the Stein EffectHierarchical Bayes analysisConditional decompositions (cont.)whereπ(θ|θ1, x) =f(x|θ)π1(θ|θ1)m1(x|θ1),m1(x|θ1) =Θf(x|θ)π1(θ|θ1) dθ,π(θ1|x) =m1(x|θ1)π2(θ1)m(x),m(x) =Θ1m1(x|θ1)π2(θ1) dθ1.
  • 464. Bayesian StatisticsHierarchical and Empirical Bayes Extensions, and the Stein EffectHierarchical Bayes analysisConditional decompositions (cont.)Moreover, this decomposition works for the posterior moments,that is, for every function h,Eπ[h(θ)|x] = Eπ(θ1|x)[Eπ1[h(θ)|θ1, x]] ,whereEπ1[h(θ)|θ1, x] =Θh(θ)π(θ|θ1, x) dθ.
  • 465. Bayesian StatisticsHierarchical and Empirical Bayes Extensions, and the Stein EffectHierarchical Bayes analysisExample (Posterior distribution of the complete parametervector)Posterior distribution of the complete parameter vectorπ((θi, δi, ξi)i, µθ, . . . , σc, . . . |D) ∝Ii=1exp −{(θi − µθ)2/2σ2θ + (δi − µδ)2/2σ2δ }Jcij=1exp −{(xij − θi)2/2σ2c }Jaij=1exp −{(yij − θi − δi)2/2σ2a}Jtij=1exp −{(zij − θi − δi − ξi)2/2σ2t }ℓi=0exp −{(ξi − µP )2/2σ2P }ℓi=1exp −{(ξi − µD)2/2σ2D}
  • 466. Bayesian StatisticsHierarchical and Empirical Bayes Extensions, and the Stein EffectHierarchical Bayes analysisLocal conditioning propertyTheorem (Decomposition)For the hierarchical modelπ(θ) =Θ1×...×Θnπ1(θ|θ1)π2(θ1|θ2) · · · πn+1(θn) dθ1 · · · dθn+1.we haveπ(θi|x, θ, θ1, . . . , θn) = π(θi|θi−1, θi+1)with the convention θ0 = θ and θn+1 = 0.
  • 467. Bayesian StatisticsHierarchical and Empirical Bayes Extensions, and the Stein EffectHierarchical Bayes analysisComputational issuesRarely an explicit derivation of the corresponding Bayes estimatorsNatural solution in hierarchical settings: use a simulation-basedapproach exploiting the hierarchical conditional structure
  • 468. Bayesian StatisticsHierarchical and Empirical Bayes Extensions, and the Stein EffectHierarchical Bayes analysisComputational issuesRarely an explicit derivation of the corresponding Bayes estimatorsNatural solution in hierarchical settings: use a simulation-basedapproach exploiting the hierarchical conditional structureExample (Rats (4))The full conditional distributions correspond to standarddistributions and Gibbs sampling applies.
  • 469. Bayesian StatisticsHierarchical and Empirical Bayes Extensions, and the Stein EffectHierarchical Bayes analysis0 2000 4000 6000 8000 100001.601.701.801.900 2000 4000 6000 8000 10000-2.90-2.80-2.70-2.600 2000 4000 6000 8000 100000.400.500.600.700 2000 4000 6000 8000 100001.71.81.92.02.1Convergence of the posterior means
  • 470. Bayesian StatisticsHierarchical and Empirical Bayes Extensions, and the Stein EffectHierarchical Bayes analysis1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4020406080100120control-4.0 -3.5 -3.0 -2.5 -2.0 -1.5020406080100140intoxication-1 0 1 2050100150200250300placebo0 1 2 3050100150200250drugPosteriors of the effects
  • 471. Bayesian StatisticsHierarchical and Empirical Bayes Extensions, and the Stein EffectHierarchical Bayes analysisµδ µD µP µD − µPProbability 1.00 0.9998 0.94 0.985Confidence [-3.48,-2.17] [0.94,2.50] [-0.17,1.24] [0.14,2.20]Posterior probabilities of significant effects
  • 472. Bayesian StatisticsHierarchical and Empirical Bayes Extensions, and the Stein EffectHierarchical Bayes analysisHierarchical extensions for the normal modelForx ∼ Np(θ, Σ) , θ ∼ Np(µ, Σπ)the hierarchical Bayes estimator isδπ(x) = Eπ2(µ,Σπ|x)[δ(x|µ, Σπ)],
  • 473. Bayesian StatisticsHierarchical and Empirical Bayes Extensions, and the Stein EffectHierarchical Bayes analysisHierarchical extensions for the normal modelForx ∼ Np(θ, Σ) , θ ∼ Np(µ, Σπ)the hierarchical Bayes estimator isδπ(x) = Eπ2(µ,Σπ|x)[δ(x|µ, Σπ)],withδ(x|µ, Σπ) = x − ΣW(x − µ),W = (Σ + Σπ)−1,π2(µ, Σπ|x) ∝ (det W)1/2exp{−(x − µ)tW(x − µ)/2}π2(µ, Σπ).
  • 474. Bayesian StatisticsHierarchical and Empirical Bayes Extensions, and the Stein EffectHierarchical Bayes analysisExample (Exchangeable normal)Consider the exchangeable hierarchical modelx|θ ∼ Np(θ, σ21Ip),θ|ξ ∼ Np(ξ1, σ2πIp),ξ ∼ N (ξ0, τ2),where 1 = (1, . . . , 1)t ∈ Rp. In this case,δ(x|ξ, σπ) = x −σ21σ21 + σ2π(x − ξ1),
  • 475. Bayesian StatisticsHierarchical and Empirical Bayes Extensions, and the Stein EffectHierarchical Bayes analysisExample (Exchangeable normal (2))π2(ξ, σ2π|x) ∝ (σ21 + σ2π)−p/2exp{−x − ξ1 22(σ21 + σ2π)}e−(ξ−ξ0)2/2τ2π2(σ2π)∝π2(σ2π)(σ21 + σ2π)p/2exp −p(¯x − ξ)22(σ21 + σ2π)−s22(σ21 + σ2π)−(ξ − ξ0)22τ2with s2= i(xi − ¯x)2.
  • 476. Bayesian StatisticsHierarchical and Empirical Bayes Extensions, and the Stein EffectHierarchical Bayes analysisExample (Exchangeable normal (2))π2(ξ, σ2π|x) ∝ (σ21 + σ2π)−p/2exp{−x − ξ1 22(σ21 + σ2π)}e−(ξ−ξ0)2/2τ2π2(σ2π)∝π2(σ2π)(σ21 + σ2π)p/2exp −p(¯x − ξ)22(σ21 + σ2π)−s22(σ21 + σ2π)−(ξ − ξ0)22τ2with s2= i(xi − ¯x)2. Thenδπ(x) = Eπ2(σ2π|x)x −σ21σ21 + σ2π(x − ¯x1) −σ21 + σ2πσ21 + σ2π + pτ2(¯x − ξ0)1andπ2(σ2π|x) ∝τ exp − 12s2σ21 + σ2π+p(¯x − ξ0)2pτ2 + σ21 + σ2π(σ21 + σ2π)(p−1)/2(σ21 + σ2π + pτ2)1/2π2(σ2π).
  • 477. Bayesian StatisticsHierarchical and Empirical Bayes Extensions, and the Stein EffectHierarchical Bayes analysisExample (Exchangeable normal (3))Notice the particular form of the hierarchical Bayes estimatorδπ(x) = x − Eπ2(σ2π|x) σ21σ21 + σ2π(x − ¯x1)−Eπ2(σ2π|x) σ21 + σ2πσ21 + σ2π + pτ2(¯x − ξ0)1.[Double shrinkage]
  • 478. Bayesian StatisticsHierarchical and Empirical Bayes Extensions, and the Stein EffectHierarchical Bayes analysisThe Stein effectIf a minimax estimator is unique, it is admissible
  • 479. Bayesian StatisticsHierarchical and Empirical Bayes Extensions, and the Stein EffectHierarchical Bayes analysisThe Stein effectIf a minimax estimator is unique, it is admissibleConverseIf a constant risk minimax estimator is inadmissible, every otherminimax estimator has a uniformly smaller risk (!)
  • 480. Bayesian StatisticsHierarchical and Empirical Bayes Extensions, and the Stein EffectHierarchical Bayes analysisThe Stein ParadoxIf a standard estimator δ∗(x) = (δ0(x1), . . . , δ0(xp)) is evaluatedunder weighted quadratic losspi=1ωi(δi − θi)2,with ωi > 0 (i = 1, . . . , p), there exists p0 such that δ∗ is notadmissible for p ≥ p0,
  • 481. Bayesian StatisticsHierarchical and Empirical Bayes Extensions, and the Stein EffectHierarchical Bayes analysisThe Stein ParadoxIf a standard estimator δ∗(x) = (δ0(x1), . . . , δ0(xp)) is evaluatedunder weighted quadratic losspi=1ωi(δi − θi)2,with ωi > 0 (i = 1, . . . , p), there exists p0 such that δ∗ is notadmissible for p ≥ p0, although the components δ0(xi) areseparately admissible to estimate the θi’s.
  • 482. Bayesian StatisticsHierarchical and Empirical Bayes Extensions, and the Stein EffectHierarchical Bayes analysisJames–Stein estimatorIn the normal case,δJS(x) = 1 −p − 2||x||2x,dominates δ0(x) = x under quadratic loss for p ≥ 3, that is,p = Eθ[||δ0(x) − θ||2] > Eθ[||δJS(x) − θ||2].
  • 483. Bayesian StatisticsHierarchical and Empirical Bayes Extensions, and the Stein EffectHierarchical Bayes analysisJames–Stein estimatorIn the normal case,δJS(x) = 1 −p − 2||x||2x,dominates δ0(x) = x under quadratic loss for p ≥ 3, that is,p = Eθ[||δ0(x) − θ||2] > Eθ[||δJS(x) − θ||2].Andδ+c (x) = 1 −c||x||2+x=(1 − c||x||2 )x if ||x||2 > c,0 otherwise,improves on δ0 when0 < c < 2(p − 2)
  • 484. Bayesian StatisticsHierarchical and Empirical Bayes Extensions, and the Stein EffectHierarchical Bayes analysisUniversality◮ Other distributions than the normal distribution
  • 485. Bayesian StatisticsHierarchical and Empirical Bayes Extensions, and the Stein EffectHierarchical Bayes analysisUniversality◮ Other distributions than the normal distribution◮ Other losses other than the quadratic loss
  • 486. Bayesian StatisticsHierarchical and Empirical Bayes Extensions, and the Stein EffectHierarchical Bayes analysisUniversality◮ Other distributions than the normal distribution◮ Other losses other than the quadratic loss◮ Connections with admissibility
  • 487. Bayesian StatisticsHierarchical and Empirical Bayes Extensions, and the Stein EffectHierarchical Bayes analysisUniversality◮ Other distributions than the normal distribution◮ Other losses other than the quadratic loss◮ Connections with admissibility◮ George’s multiple shrinkage
  • 488. Bayesian StatisticsHierarchical and Empirical Bayes Extensions, and the Stein EffectHierarchical Bayes analysisUniversality◮ Other distributions than the normal distribution◮ Other losses other than the quadratic loss◮ Connections with admissibility◮ George’s multiple shrinkage◮ Robustess against distribution
  • 489. Bayesian StatisticsHierarchical and Empirical Bayes Extensions, and the Stein EffectHierarchical Bayes analysisUniversality◮ Other distributions than the normal distribution◮ Other losses other than the quadratic loss◮ Connections with admissibility◮ George’s multiple shrinkage◮ Robustess against distribution◮ Applies for confidence regions
  • 490. Bayesian StatisticsHierarchical and Empirical Bayes Extensions, and the Stein EffectHierarchical Bayes analysisUniversality◮ Other distributions than the normal distribution◮ Other losses other than the quadratic loss◮ Connections with admissibility◮ George’s multiple shrinkage◮ Robustess against distribution◮ Applies for confidence regions◮ Applies for accuracy (or loss) estimation
  • 491. Bayesian StatisticsHierarchical and Empirical Bayes Extensions, and the Stein EffectHierarchical Bayes analysisUniversality◮ Other distributions than the normal distribution◮ Other losses other than the quadratic loss◮ Connections with admissibility◮ George’s multiple shrinkage◮ Robustess against distribution◮ Applies for confidence regions◮ Applies for accuracy (or loss) estimation◮ Cannot occur in finite parameter spaces
  • 492. Bayesian StatisticsHierarchical and Empirical Bayes Extensions, and the Stein EffectHierarchical Bayes analysisA general Stein-type domination resultConsider z = (xt, yt)t ∈ Rp, with distributionz ∼ f(||x − θ||2+ ||y||2),and x ∈ Rq, y ∈ Rp−q.
  • 493. Bayesian StatisticsHierarchical and Empirical Bayes Extensions, and the Stein EffectHierarchical Bayes analysisA general Stein-type domination result (cont.)Theorem (Stein domination of δ0)δh(z) = (1 − h(||x||2, ||y||2))xdominates δ0 under quadratic loss if there exist α, β > 0 suchthat:(1) tαh(t, u) is a nondecreasing function of t for every u;(2) u−βh(t, u) is a nonincreasing function of u for every t; and(3) 0 ≤ (t/u)h(t, u) ≤2(q − 2)αp − q − 2 + 4β.
  • 494. Bayesian StatisticsHierarchical and Empirical Bayes Extensions, and the Stein EffectHierarchical Bayes analysisOptimality of hierarchical Bayes estimatorsConsiderx ∼ Np(θ, Σ)where Σ is known.Prior distribution on θ is θ ∼ Np(µ, Σπ).
  • 495. Bayesian StatisticsHierarchical and Empirical Bayes Extensions, and the Stein EffectHierarchical Bayes analysisOptimality of hierarchical Bayes estimatorsConsiderx ∼ Np(θ, Σ)where Σ is known.Prior distribution on θ is θ ∼ Np(µ, Σπ).The prior distribution π2 of the hyperparameters(µ, Σπ)is decomposed asπ2(µ, Σπ) = π12(Σπ|µ)π22(µ).
  • 496. Bayesian StatisticsHierarchical and Empirical Bayes Extensions, and the Stein EffectHierarchical Bayes analysisOptimality of hierarchical Bayes estimatorsIn this case,m(x) =Rpm(x|µ)π22(µ) dµ,withm(x|µ) = f(x|θ)π1(θ|µ, Σπ)π12(Σπ|µ) dθ dΣπ.
  • 497. Bayesian StatisticsHierarchical and Empirical Bayes Extensions, and the Stein EffectHierarchical Bayes analysisOptimality of hierarchical Bayes estimatorsMoreover, the Bayes estimatorδπ(x) = x + Σ∇ log m(x)can be writtenδπ(x) = δ(x|µ)π22(µ|x) dµ,withδ(x|µ) = x + Σ∇ log m(x|µ),π22(µ|x) =m(x|µ)π22(µ)m(x).
  • 498. Bayesian StatisticsHierarchical and Empirical Bayes Extensions, and the Stein EffectHierarchical Bayes analysisA sufficient conditionAn estimator δ is minimax under the lossLQ(θ, δ) = (θ − δ)tQ(θ − δ).if it satisfiesR(θ, δ) = Eθ[LQ(θ, δ(x))] ≤ tr(ΣQ)
  • 499. Bayesian StatisticsHierarchical and Empirical Bayes Extensions, and the Stein EffectHierarchical Bayes analysisA sufficient condition (contd.)Theorem (Minimaxity)If m(x) satisfies the three conditions(1) Eθ ∇ log m(x) 2< +∞; (2) Eθ∂2m(x)∂xi∂xjm(x) < +∞;and (1 ≤ i ≤ p)(3) lim|xi|→+∞∇ log m(x) exp{−(1/2)(x − θ)tΣ−1(x − θ)} = 0,
  • 500. Bayesian StatisticsHierarchical and Empirical Bayes Extensions, and the Stein EffectHierarchical Bayes analysisThe unbiased estimator of the risk of δπ is given byDδπ(x) = tr(QΣ)+2m(x)tr(Hm(x) ˜Q) − (∇ log m(x))t ˜Q(∇ log m(x))where˜Q = ΣQΣ, Hm(x) =∂2m(x)∂xi∂xjand...
  • 501. Bayesian StatisticsHierarchical and Empirical Bayes Extensions, and the Stein EffectHierarchical Bayes analysisδπ is minimax ifdiv ˜Q∇ m(x) ≤ 0,
  • 502. Bayesian StatisticsHierarchical and Empirical Bayes Extensions, and the Stein EffectHierarchical Bayes analysisδπ is minimax ifdiv ˜Q∇ m(x) ≤ 0,When Σ = Q = Ip, this condition is∆ m(x) =ni=1∂2∂x2i( m(x)) ≤ 0[ m(x) superharmonic]
  • 503. Bayesian StatisticsHierarchical and Empirical Bayes Extensions, and the Stein EffectHierarchical Bayes analysisSuperharmonicity conditionTheorem (Superharmonicity)δπ is minimax ifdiv ˜Q∇m(x|µ) ≤ 0.
  • 504. Bayesian StatisticsHierarchical and Empirical Bayes Extensions, and the Stein EffectHierarchical Bayes analysisSuperharmonicity conditionTheorem (Superharmonicity)δπ is minimax ifdiv ˜Q∇m(x|µ) ≤ 0.N&S condition that does not depend on π22(µ)!
  • 505. Bayesian StatisticsHierarchical and Empirical Bayes Extensions, and the Stein EffectThe empirical Bayes alternativeEmpirical Bayes alternativeStrictly speaking, not a Bayesian method !
  • 506. Bayesian StatisticsHierarchical and Empirical Bayes Extensions, and the Stein EffectThe empirical Bayes alternativeEmpirical Bayes alternativeStrictly speaking, not a Bayesian method !(i) can be perceived as a dual method of the hierarchical Bayesanalysis;(ii) asymptotically equivalent to the Bayesian approach;(iii) usually classified as Bayesian by others; and(iv) may be acceptable in problems for which a genuine Bayesmodeling is too complicated/costly.
  • 507. Bayesian StatisticsHierarchical and Empirical Bayes Extensions, and the Stein EffectThe empirical Bayes alternativeParametric empirical BayesWhen hyperparameters from a conjugate prior π(θ|λ) areunavailable, estimate these hyperparameters from the marginaldistributionm(x|λ) =Θf(x|θ)π(θ|λ) dθby ˆλ(x) and to use π(θ|ˆλ(x), x) as a pseudo-posterior
  • 508. Bayesian StatisticsHierarchical and Empirical Bayes Extensions, and the Stein EffectThe empirical Bayes alternativeFundamental ad-hocqueryWhich estimate ˆλ(x) for λ ?Moment method or maximum likelihood or Bayes or &tc...
  • 509. Bayesian StatisticsHierarchical and Empirical Bayes Extensions, and the Stein EffectThe empirical Bayes alternativeExample (Poisson estimation)Consider xi distributed according to P(θi) (i = 1, . . . , n). Whenπ(θ|λ) is E xp(λ),m(xi|λ) =+∞0e−θ θxixi!λe−θλdθ=λ(λ + 1)xi+1=1λ + 1xiλλ + 1,i.e. xi|λ ∼ G eo(λ/λ + 1).
  • 510. Bayesian StatisticsHierarchical and Empirical Bayes Extensions, and the Stein EffectThe empirical Bayes alternativeExample (Poisson estimation)Consider xi distributed according to P(θi) (i = 1, . . . , n). Whenπ(θ|λ) is E xp(λ),m(xi|λ) =+∞0e−θ θxixi!λe−θλdθ=λ(λ + 1)xi+1=1λ + 1xiλλ + 1,i.e. xi|λ ∼ G eo(λ/λ + 1). Thenˆλ(x) = 1/¯xand the empirical Bayes estimator of θn+1 isδEB(xn+1) =xn+1 + 1ˆλ + 1=¯x¯x + 1(xn+1 + 1),
  • 511. Bayesian StatisticsHierarchical and Empirical Bayes Extensions, and the Stein EffectThe empirical Bayes alternativeEmpirical Bayes justifications of the Stein effectA way to unify the different occurrences of this paradox and showits Bayesian roots
  • 512. Bayesian StatisticsHierarchical and Empirical Bayes Extensions, and the Stein EffectThe empirical Bayes alternativea. Point estimationExample (Normal mean)Consider x ∼ Np(θ, Ip) and θi ∼ N (0, τ2). The marginaldistribution of x is thenx|τ2∼ Np(0, (1 + τ2)Ip)and the maximum likelihood estimator of τ2 isˆτ2=(||x||2/p) − 1 if ||x||2 > p,0 otherwise.The corresponding empirical Bayes estimator of θi is thenδEB(x) =ˆτ2x1 + ˆτ2= 1 −p||x||2+x.
  • 513. Bayesian StatisticsHierarchical and Empirical Bayes Extensions, and the Stein EffectThe empirical Bayes alternativeNormal modelTakex|θ ∼ Np(θ, Λ),θ|β, σ2π ∼ Np(Zβ, σ2πIp),with Λ = diag(λ1, . . . , λp) and Z a (p × q) full rank matrix.
  • 514. Bayesian StatisticsHierarchical and Empirical Bayes Extensions, and the Stein EffectThe empirical Bayes alternativeNormal modelTakex|θ ∼ Np(θ, Λ),θ|β, σ2π ∼ Np(Zβ, σ2πIp),with Λ = diag(λ1, . . . , λp) and Z a (p × q) full rank matrix.The marginal distribution of x isxi|β, σ2π ∼ N (z′iβ, σ2π + λi)and the posterior distribution of θ isθi|xi, β, σ2π ∼ N (1 − bi)xi + biz′iβ, λi(1 − bi) ,with bi = λi/(λi + σ2π).
  • 515. Bayesian StatisticsHierarchical and Empirical Bayes Extensions, and the Stein EffectThe empirical Bayes alternativeNormal model (cont.)Ifλ1 = . . . = λn = σ2the best equivariant estimators of β and b are given byˆβ = (ZtZ)−1Ztx and ˆb =(p − q − 2)σ2s2,with s2 = pi=1(xi − z′iˆβ)2.
  • 516. Bayesian StatisticsHierarchical and Empirical Bayes Extensions, and the Stein EffectThe empirical Bayes alternativeNormal model (cont.)Ifλ1 = . . . = λn = σ2the best equivariant estimators of β and b are given byˆβ = (ZtZ)−1Ztx and ˆb =(p − q − 2)σ2s2,with s2 = pi=1(xi − z′iˆβ)2.The corresponding empirical Bayes estimator of θ areδEB(x) = Z ˆβ + 1 −(p − q − 2)σ2||x − Z ˆβ||2(x − Z ˆβ),which is of the form of the general Stein estimator
  • 517. Bayesian StatisticsHierarchical and Empirical Bayes Extensions, and the Stein EffectThe empirical Bayes alternativeNormal model (cont.)When the means are assumed to be identical (exchangeability), thematrix Z reduces to the vector 1 and β ∈ R
  • 518. Bayesian StatisticsHierarchical and Empirical Bayes Extensions, and the Stein EffectThe empirical Bayes alternativeNormal model (cont.)When the means are assumed to be identical (exchangeability), thematrix Z reduces to the vector 1 and β ∈ RThe empirical Bayes estimator is thenδEB(x) = ¯x1 + 1 −(p − 3)σ2||x − ¯x1||2(x − ¯x1).
  • 519. Bayesian StatisticsHierarchical and Empirical Bayes Extensions, and the Stein EffectThe empirical Bayes alternativeb. Variance evaluationEstimation of the hyperparameters β and σ2π considerably modifiesthe behavior of the procedures.
  • 520. Bayesian StatisticsHierarchical and Empirical Bayes Extensions, and the Stein EffectThe empirical Bayes alternativeb. Variance evaluationEstimation of the hyperparameters β and σ2π considerably modifiesthe behavior of the procedures.Point estimation generally efficient, but estimation of the posteriorvariance of π(θ|x, β, b) by the empirical variance,var(θi|x, ˆβ,ˆb)induces an underestimation of this variance
  • 521. Bayesian StatisticsHierarchical and Empirical Bayes Extensions, and the Stein EffectThe empirical Bayes alternativeMorris’ correctionδEB(x) = x − ˜B(x − ¯x1),V EBi (x) = σ2−p − 1p˜B +2p − 3ˆb(xi − ¯x)2,withˆb =p − 3p − 1σ2σ2 + ˆσ2π, ˆσ2π = max 0,||x − ¯x1||2p − 1− σ2πand˜B =p − 3p − 1min 1,σ2(p − 1)||x − ¯x1||2.
  • 522. Bayesian StatisticsA Defense of the Bayesian ChoiceUnlimited range of applications◮ artificial intelligence◮ biostatistic◮ econometrics◮ epidemiology◮ environmetrics◮ finance
  • 523. Bayesian StatisticsA Defense of the Bayesian ChoiceUnlimited range of applications (cont’d)◮ genomics◮ geostatistics◮ image processing and pattern recognition◮ neural networks◮ signal processing◮ Bayesian networks
  • 524. Bayesian StatisticsA Defense of the Bayesian Choice1. Choosing a probabilistic representationBayesian Statistics appears as the calculus of uncertaintyReminder:A probabilistic model is nothing but an interpretation of a givenphenomenon
  • 525. Bayesian StatisticsA Defense of the Bayesian Choice2. Conditioning on the dataAt the basis of statistical inference lies an inversion processbetween cause and effect. Using a prior distribution brings anecessary balance between observations and parameters and enableto operate conditional upon x
  • 526. Bayesian StatisticsA Defense of the Bayesian Choice3. Exhibiting the true likelihoodProvides a complete quantitative inference on the parameters andpredictive that points out inadequacies of frequentist statistics,while implementing the Likelihood Principle.
  • 527. Bayesian StatisticsA Defense of the Bayesian Choice4. Using priors as tools and summariesThe choice of a prior distribution π does not require any kind ofbelief belief in this distribution: rather consider it as a tool thatsummarizes the available prior information and the uncertaintysurrounding this information
  • 528. Bayesian StatisticsA Defense of the Bayesian Choice5. Accepting the subjective basis of knowledgeKnowledge is a critical confrontation between a prioris andexperiments. Ignoring these a prioris impoverishes analysis.
  • 529. Bayesian StatisticsA Defense of the Bayesian Choice5. Accepting the subjective basis of knowledgeKnowledge is a critical confrontation between a prioris andexperiments. Ignoring these a prioris impoverishes analysis.We have, for one thing, to use a language and ourlanguage is entirely made of preconceived ideas and hasto be so. However, these are unconscious preconceivedideas, which are a million times more dangerous than theother ones. Were we to assert that if we are includingother preconceived ideas, consciously stated, we wouldaggravate the evil! I do not believe so: I rather maintainthat they would balance one another.Henri Poincar´e, 1902
  • 530. Bayesian StatisticsA Defense of the Bayesian Choice6. Choosing a coherent system of inferenceTo force inference into a decision-theoretic mold allows for aclarification of the way inferential tools should be evaluated, andtherefore implies a conscious (although subjective) choice of theretained optimality.
  • 531. Bayesian StatisticsA Defense of the Bayesian Choice6. Choosing a coherent system of inferenceTo force inference into a decision-theoretic mold allows for aclarification of the way inferential tools should be evaluated, andtherefore implies a conscious (although subjective) choice of theretained optimality.Logical inference process Start with requested properties, i.e.loss function and prior distribution, then derive the best solutionsatisfying these properties.
  • 532. Bayesian StatisticsA Defense of the Bayesian Choice7. Looking for optimal frequentist proceduresBayesian inference widely intersects with the three notions ofminimaxity, admissibility and equivariance. Looking for an optimalestimator most often ends up finding a Bayes estimator.Optimality is easier to attain through the Bayes “filter”
  • 533. Bayesian StatisticsA Defense of the Bayesian Choice8. Solving the actual problemFrequentist methods justified on a long-term basis, i.e., from thestatistician viewpoint. From a decision-maker’s point of view, onlythe problem at hand matters! That is, he/she calls for an inferenceconditional on x.
  • 534. Bayesian StatisticsA Defense of the Bayesian Choice9. Providing a universal system of inferenceGiven the three factors(X , f(x|θ), (Θ, π(θ)), (D, L(θ, d)) ,the Bayesian approach validates one and only one inferentialprocedure
  • 535. Bayesian StatisticsA Defense of the Bayesian Choice10. Computing procedures as a minimization problemBayesian procedures are easier to compute than procedures ofalternative theories, in the sense that there exists a universalmethod for the computation of Bayes estimatorsIn practice, the effective calculation of the Bayes estimators isoften more delicate but this defect is of another magnitude.