Like this presentation? Why not share!

- Bot detection algorithm by Parinita thakur r... 4440 views
- 1. botnet detection algorithms and... by Djona Fegnem 516 views
- Botnets 101 by Aung Thu Rha Hein 2650 views
- Botnet Detection Techniques by Team Firefly 1859 views
- Botnet by PriyanKa Harjai 2009 views
- Athens workshop on MCMC by Christian Robert 624 views

2,508

-1

-1

Published on

No Downloads

Total Views

2,508

On Slideshare

0

From Embeds

0

Number of Embeds

1

Shares

0

Downloads

30

Comments

0

Likes

1

No embeds

No notes for slide

- 1. The 21st Bayesian Century “The 21st Century belongs to Bayes” as argued by a discussion on Bayesian testing and Bayesian model choice Christian P. Robert Universit´ Paris Dauphine and CREST-INSEE e http://www.ceremade.dauphine.fr/~xian http://xianblog.wordpress.com July 1, 2009
- 2. The 21st Bayesian Century A consequence of Bayesian statistics being given a proper name is that it encourages too much historical deference from people who think that the bibles of Jeﬀreys, de Finetti, Jaynes, and others have all the answers. —Gelman, Bayesian Analysis 3(3), 2008
- 3. The 21st Bayesian Century Outline Anyone not shocked by the Bayesian theory of inference has not understood it Senn, BA., 2008 Introduction Tests and model choice Bayesian Calculations A Defense of the Bayesian Choice
- 4. The 21st Bayesian Century Introduction Vocabulary and concepts Bayesian inference is a coherent mathematical theory but I don’t trust it in scientiﬁc applications. Gelman, BA, 2008 Introduction Models The Bayesian framework Improper prior distributions Noninformative prior distributions Tests and model choice Bayesian Calculations A Defense of the Bayesian Choice
- 5. The 21st Bayesian Century Introduction Models Parametric model Bayesians promote the idea that a multiplicity of parameters can be handled via hierarchical, typically exchangeable, models, but it seems implausible that this could really work automatically [instead of] giving reasonable answers using minimal assumptions. Gelman, BA, 2008 Observations x1 , . . . , xn generated from a probability distribution fi (xi |θi , x1 , . . . , xi−1 ) = fi (xi |θi , x1:i−1 ) x = (x1 , . . . , xn ) ∼ f (x|θ), θ = (θ1 , . . . , θn ) Associated likelihood ℓ(θ|x) = f (x|θ) [inverted density & starting point]
- 6. The 21st Bayesian Century Introduction Models And [B] nonparametrics?! Equally very active and deﬁnitely very 21st, thank you, but not mentioned in this talk! 7th Workshop on Bayesian Nonparametrics - Collegio... http://bnpworkshop.carloalberto.org/ 21 - 25 June 2009, Moncalieri The 7th Workshop on Bayesian Nonparametrics will be held at the Collegio Carlo Alberto from June 21 to 25, 2009. The Collegio is a Research Institution housed in an historical building located in Moncalieri on the outskirts of Turin, Italy. The meeting will feature the latest developments in the area and will cover a wide variety of both theoretical and applied topics such as: foundations of the Bayesian nonparametric approach, construction and properties of prior distributions, asymptotics, interplay with probability theory and stochastic processes, statistical modelling, computational algorithms and applications in machine learning, biostatistics, bioinformatics, economics and econometrics. The Workshop will be structured in 4 tutorials on special topics, a series of invited talks and contributed posters sessions. News Tentative Workshop Schedule Abstract Book (last updated 27th May 2009) Workshop Poster
- 7. The 21st Bayesian Century Introduction The Bayesian framework Bayes theorem 101 Bayes theorem = Inversion of probabilities If A and E are events such that P (E) = 0, P (A|E) and P (E|A) are related by P (A|E) = P (E|A)P (A) P (E|A)P (A) + P (E|Ac )P (Ac ) P (E|A)P (A) = P (E) [Thomas Bayes (?)]
- 8. The 21st Bayesian Century Introduction The Bayesian framework Bayesian approach The impact of treating x as a ﬁxed constant is to increase statistical power as an artefact Templeton, Molec. Ecol., 2009 New perspective ◮ Uncertainty on the parameters θ of a model modeled through a probability distribution π on Θ, called prior distribution ◮ Inference based on the distribution of θ conditional on x, π(θ|x), called posterior distribution f (x|θ)π(θ) π(θ|x) = . f (x|θ)π(θ) dθ
- 9. The 21st Bayesian Century Introduction The Bayesian framework [Nonphilosophical] justiﬁcations Ignoring the sampling error of x undermines the statistical validity of all inferences made by the method Templeton, Molec. Ecol., 2009 ◮ Semantic drift from unknown to random ◮ Actualization of the information on θ by extracting the information on θ contained in the observation x ◮ Allows incorporation of imperfect information in the decision process ◮ Unique mathematical way to condition upon the observations (conditional perspective) ◮ Unique way to give meaning to statements like P(θ > 0)
- 10. The 21st Bayesian Century Introduction The Bayesian framework Posterior distribution Bayesian methods are presented as an automatic inference engine, and this raises suspicion in anyone with applied experience Gelman, BA, 2008 π(θ|x) central to Bayesian inference ◮ Operates conditional upon the observations ◮ Incorporates the requirement of the Likelihood Principle ◮ Avoids averaging over the unobserved values of x ◮ Coherent updating of the information available on θ ◮ Provides a complete inferential machinery
- 11. The 21st Bayesian Century Introduction Improper prior distributions Improper distributions If we take P (dσ) ∝ dσ as a statement that σ may have any value between 0 and ∞ (...), we must use ∞ instead of 1 to denote certainty. Jeﬀreys, ToP, 1939 Necessary extension from a prior distribution to a prior σ-ﬁnite measure π such that π(θ) dθ = +∞ Θ Improper prior distribution [Weird? Inappropriate?? report!! ]
- 12. The 21st Bayesian Century Introduction Improper prior distributions Justiﬁcations If the parameter may have any value from −∞ to +∞, its prior probability should be taken as uniformly distributed Jeﬀreys, ToP, 1939 Automated prior determination often leads to improper priors 1. Similar performances of estimators derived from these generalized distributions 2. Improper priors as limits of proper distributions in many [mathematical] senses
- 13. The 21st Bayesian Century Introduction Improper prior distributions More justiﬁcations There is no good objective principle for choosing a noninformative prior (even if that concept were mathematically deﬁned, which it is not) Gelman, BA, 2008 4. Robust answer against possible misspeciﬁcations of the prior 5. Frequencial justiﬁcations, such as: (i) minimaxity (ii) admissibility (iii) invariance (Haar measure) 6. Improper priors [much] prefered to vague proper priors like N (0, 106 )
- 14. The 21st Bayesian Century Introduction Improper prior distributions Validation The mistake is to think of them as representing ignorance Lindley, JASA, 1990 Extension of the posterior distribution π(θ|x) associated with an improper prior π as given by Bayes’s formula f (x|θ)π(θ) π(θ|x) = , Θ f (x|θ)π(θ) dθ when f (x|θ)π(θ) dθ < ∞ Θ Delete all emotional names
- 15. The 21st Bayesian Century Introduction Noninformative prior distributions Noninformative priors ...cannot be expected to represent exactly total ignorance about the problem, but should rather be taken as reference priors, upon which everyone could fall back when the prior information is missing. Kass and Wasserman, JASA, 1996 What if all we know is that we know “nothing” ?! In the absence of prior information, prior distributions solely derived from the sample distribution f (x|θ) Diﬃculty with uniform priors, lacking invariance properties.
- 16. The 21st Bayesian Century Introduction Noninformative prior distributions Jeﬀreys’ prior If we took the prior density for the parameters to be proportional to |I(θ)|1/2 , it could be stated for any law that is diﬀerentiable with respect to all parameters that the total probability in any region of the θi would ′ be equal to the total probability in the corresponding region of the θi Jeﬀreys, ToP, 1939 Based on Fisher information ∂ℓ ∂ℓ I(θ) = Eθ ∂θT ∂θ Jeﬀreys’ prior distribution is π ∗ (θ) ∝ |I(θ)|1/2
- 17. The 21st Bayesian Century Tests and model choice Tests and model choice The Jeﬀreys-subjective synthesis betrays a much more dangerous confusion than the Neyman-Pearson-Fisher synthesis as regards hypothesis tests Senn, BA, 2008 Introduction Tests and model choice Bayesian tests Bayes factors Opposition to classical tests Model choice Compatible priors Variable selection
- 18. The 21st Bayesian Century Tests and model choice Bayesian tests Construction of Bayes tests What is almost never used, however, is the Jeﬀreys signiﬁcance test. Senn, BA, 2008 Deﬁnition (Test) Given an hypothesis H0 : θ ∈ Θ0 on the parameter θ ∈ Θ0 of a statistical model, a test is a statistical procedure that takes its values in {0, 1}. Example (Normal mean) For x ∼ N (θ, 1), decide whether or not θ ≤ 0.
- 19. The 21st Bayesian Century Tests and model choice Bayesian tests Decision-theoretic perspective Loss functions [are] not relevant to statistical inference Gelman, BA, 2008 Theorem (Optimal Bayes decision) Under the 0 − 1 loss function 0 if d = IΘ0 (θ) L(θ, d) = a0 if d = 1 and θ ∈ Θ0 a1 if d = 0 and θ ∈ Θ0 the Bayes procedure is 1 if Prπ (θ ∈ Θ0 |x) ≥ a0 /(a0 + a1 ) δ π (x) = 0 otherwise
- 20. The 21st Bayesian Century Tests and model choice Bayes factors A function of posterior probabilities The method posits two or more alternative hypotheses and tests their relative ﬁts to some observed statistics Templeton, Mol. Ecol., 2009 Deﬁnition (Bayes factors) For hypotheses H0 : θ ∈ Θ0 vs. Ha : θ ∈ Θ0 f (x|θ)π0 (θ)dθ π(Θ0 |x) π(Θ0 ) Θ0 B01 = = π(Θc |x) 0 π(Θc ) 0 f (x|θ)π1 (θ)dθ Θc 0 [Good, 1958 & Jeﬀreys, 1961] Goto Poisson example
- 21. The 21st Bayesian Century Tests and model choice Bayes factors Self-contained concept Having a high relative probability does not mean that a hypothesis is true or supported by the data Templeton, Mol. Ecol., 2009 Non-decision-theoretic: ◮ eliminates choice of π(Θ0 ) ◮ Bayesian/marginal equivalent to the likelihood ratio ◮ Jeﬀreys’ scale of evidence: π ◮ if log10 (B10 ) between 0 and 0.5, evidence against H0 weak, π ◮ if log10 (B10 ) 0.5 and 1, evidence substantial, π ◮ if log10 (B10 ) 1 and 2, evidence strong and π ◮ if log10 (B10 ) above 2, evidence decisive
- 22. The 21st Bayesian Century Tests and model choice Bayes factors A major modiﬁcation Considering whether a location parameter α is 0. The prior is uniform and we should have to take f (α) = 0 and B10 would always be inﬁnite Jeﬀreys, ToP, 1939 When the null hypothesis is supported by a set of measure 0, π(Θ0 ) = 0 and thus π(Θ0 |x) = 0. [End of the story?!]
- 23. The 21st Bayesian Century Tests and model choice Bayes factors Changing the prior to ﬁt the hypotheses Requirement Deﬁned prior distributions under both assumptions, π0 (θ) ∝ π(θ)IΘ0 (θ), π1 (θ) ∝ π(θ)IΘ1 (θ), (under the standard dominating measures on Θ0 and Θ1 ) Using the prior probabilities π(Θ0 ) = ̺0 and π(Θ1 ) = ̺1 , π(θ) = ̺0 π0 (θ) + ̺1 π1 (θ).
- 24. The 21st Bayesian Century Tests and model choice Bayes factors Point null hypotheses I have no patience for statistical methods that assign positive probability to point hypotheses of the θ = 0 type that can never actually be true Gelman, BA, 2008 Take ρ0 = Prπ (θ = θ0 ) and g1 prior density under Ha . Then f (x|θ0 )ρ0 f (x|θ0 )ρ0 π(Θ0 |x) = = f (x|θ)π(θ) dθ f (x|θ0 )ρ0 + (1 − ρ0 )m1 (x) and Bayes factor π f (x|θ0 )ρ0 ρ0 f (x|θ0 ) B01 (x) = = m1 (x)(1 − ρ0 ) 1 − ρ0 m1 (x)
- 25. The 21st Bayesian Century Tests and model choice Bayes factors Point null hypotheses (cont’d) Example (Normal mean) Test of H0 : θ = 0 when x ∼ N (θ, 1): we take π1 as N (0, τ 2 ) m1 (x) σ2 τ 2 x2 = exp f (x|0) σ2 + τ 2 2σ 2 (σ 2 + τ 2 ) and the posterior probability is τ /x 0 0.68 1.28 1.96 1 0.586 0.557 0.484 0.351 10 0.768 0.729 0.612 0.366
- 26. The 21st Bayesian Century Tests and model choice Opposition to classical tests Comparison with classical tests The 95 percent frequentist intervals will live up to their advertised coverage claims Wasserman, BA, 2008 Standard answer Deﬁnition (p-value) The p-value p(x) associated with a test is the largest signiﬁcance level for which H0 is rejected
- 27. The 21st Bayesian Century Tests and model choice Opposition to classical tests Problems with p-values The use of P implies that a hypothesis that may be true may be rejected because it had not predicted observable results that have not occurred Jeﬀreys, ToP, 1939 ◮ Evaluation of the wrong quantity, namely the probability to exceed the observed quantity.(wrong conditioning) ◮ Evaluation only under the null hypothesis ◮ Huge numerical diﬀerence with the Bayesian range of answers
- 28. The 21st Bayesian Century Tests and model choice Opposition to classical tests Bayesian lower bounds If the Bayes estimator has good frequency behavior then we might as well use the frequentist method. If it has bad frequency behavior then we shouldn’t use it. Wasserman, BA, 2008 Least favourable Bayesian answer is f (x|θ0 ) B(x, GA ) = inf , Θ f (x|θ)g(θ) dθ g∈GA ˆ i.e., if there exists a mle for θ, θ(x), f (x|θ0 ) B(x, GA ) = ˆ f (x|θ(x))
- 29. The 21st Bayesian Century Tests and model choice Opposition to classical tests Illustration Example (Normal case) When x ∼ N (θ, 1) and H0 : θ0 = 0, the lower bounds are 2 /2 2 /2 −1 B(x, GA ) = e−x and P(x, GA ) = 1 + ex , i.e. p-value 0.10 0.05 0.01 0.001 P 0.205 0.128 0.035 0.004 B 0.256 0.146 0.036 0.004 [Quite diﬀerent!]
- 30. The 21st Bayesian Century Tests and model choice Model choice Model choice and model comparison There is no null hypothesis, which complicates the computation of sampling error Templeton, Mol. Ecol., 2009 Choice among models Several models available for the same observation(s) Mi : x ∼ fi (x|θi ), i∈I where I can be ﬁnite or inﬁnite
- 31. The 21st Bayesian Century Tests and model choice Model choice Bayesian resolution The posterior probabilities are constructed by using a numerator that is a function of the observation for a particular model, then divided by a denominator that ensures that the ”probabilities” sum to one Templeton, Mol. Ecol., 2009 Probabilise the entire model/parameter space ◮ allocate probabilities pi to all models Mi ◮ deﬁne priors πi (θi ) for each parameter space Θi ◮ compute pi fi (x|θi )πi (θi )dθi Θi π(Mi |x) = pj fj (x|θj )πj (θj )dθj j Θj
- 32. The 21st Bayesian Century Tests and model choice Model choice Bayesian resolution(2) The numerators are not co-measurable across hypotheses, and the denominators are sums of non-co-measurable entities. This means that it is mathematically impossible for them to be probabilities. Templeton, Mol. Ecol., 2009 ◮ take largest π(Mi |x) to determine “best” model, or use averaged predictive π(Mj |x) fj (x′ |θj )πj (θj |x)dθj j Θj
- 33. The 21st Bayesian Century Tests and model choice Model choice Natural Ockham’s razor Pluralitas non est ponenda sine neccesitate Variation is random until the contrary is shown; and new parameters in laws, when they are suggested, must be tested one at a time, unless there is speciﬁc reason to the contrary. Jeﬀreys, ToP, 1939 The Bayesian approach naturally weights diﬀerently models with diﬀerent parameter dimensions (BIC).
- 34. The 21st Bayesian Century Tests and model choice Compatible priors Compatibility principle Further complicating dimensionality of test statistics is the fact that the models are often not nested, and one model may contain parameters that do not have analogues in the other models and vice versa Templeton, Mol. Ecol., 2009 Diﬃculty of ﬁnding simultaneously priors on a collection of models Easier to start from a single prior on a “big” [encompassing] model and to derive others from a coherence principle [Dawid & Lauritzen, 2000] Raw regression output
- 35. The 21st Bayesian Century Tests and model choice Compatible priors An illustration for linear regression In the case M1 and M2 are two nested Gaussian linear regression models with Zellner’s g-priors and the same variance σ 2 ∼ π(σ 2 ): ◮ M1 : y|β1 , σ 2 ∼ N (X1 β1 , σ 2 ) with β1 |σ 2 ∼ N s1 , σ 2 n1 (X1 X1 )−1 T where X1 is a (n × k1 ) matrix of rank k1 ≤ n ◮ M2 : y|β2 , σ 2 ∼ N (X2 β2 , σ 2 ) with β2 |σ 2 ∼ N s2 , σ 2 n2 (X2 X2 )−1 , T where X2 is a (n × k2 ) matrix with span(X2 ) ⊆ span(X1 ) [ c Marin & Robert, Bayesian Core]
- 36. The 21st Bayesian Century Tests and model choice Compatible priors Compatible g-priors I don’t see any role for squared error loss, minimax, or the rest of what is sometimes called statistical decision theory Gelman, BA, 2008 Since σ 2 is a nuisance parameter, minimize the Kullback-Leibler divergence between both marginal distributions conditional on σ 2 : m1 (y|σ 2 ; s1 , n1 ) and m2 (y|σ 2 ; s2 , n2 ), with solution β2 |X2 , σ 2 ∼ N s∗ , σ 2 n∗ (X2 X2 )−1 2 2 T with s∗ = (X2 X2 )−1 X2 X1 s1 2 T T n∗ = n1 2
- 37. The 21st Bayesian Century Tests and model choice Variable selection Variable selection Regression setup where y regressed on a set {x1 , . . . , xp } of p potential explanatory regressors (plus intercept) Corresponding 2p submodels Mγ , where γ ∈ Γ = {0, 1}p indicates inclusion/exclusion of variables by a binary representation, e.g. γ = 101001011 means that x1 , x3 , x5 , x7 and x8 are included.
- 38. The 21st Bayesian Century Tests and model choice Variable selection Notations For model Mγ , ◮ qγ variables included ◮ t1 (γ) = {t1,1 (γ), . . . , t1,qγ (γ)} indices of those variables and t0 (γ) indices of the variables not included ◮ For β ∈ Rp+1 , βt1 (γ) = β0 , βt1,1 (γ) , . . . , βt1,qγ (γ) Xt1 (γ) = 1n |xt1,1 (γ) | . . . |xt1,qγ (γ) . Submodel Mγ is thus y|β, γ, σ 2 ∼ N Xt1 (γ) βt1 (γ) , σ 2 In
- 39. The 21st Bayesian Century Tests and model choice Variable selection Global and compatible priors Use Zellner’s g-prior, i.e. a normal prior for β conditional on σ 2 , ˜ β|σ 2 ∼ N (β, cσ 2 (X T X)−1 ) and a Jeﬀreys prior for σ 2 , π(σ 2 ) ∝ σ −2 Noninformative g Resulting compatible prior −1 −1 βt1 (γ) ∼ N T Xt1 (γ) Xt1 (γ) T ˜ Xt1 (γ) X β, cσ 2 Xt1 (γ) Xt1 (γ) T
- 40. The 21st Bayesian Century Tests and model choice Variable selection Posterior model probability Can be obtained in closed form: −n/2 ˜ ˜ 2y T P1 X β cy T P1 y β T X T P1 X β ˜ −(qγ +1)/2 T π(γ|y) ∝ (c+1) y y− + − . c+1 c+1 c+1 Conditionally on γ, posterior distributions of β and σ 2 : c ˜ σ2 c −1 βt1 (γ) |σ 2 , y, γ ∼ N (U1 y + U1 X β/c), T Xt1 (γ) Xt1 (γ) , c+1 c+1 n yT y cy T P1 y ˜ ˜ y T P1 X β β T X T P1 X β ˜ σ 2 |y, γ ∼ IG , − + − . 2 2 2(c + 1) 2(c + 1) c+1
- 41. The 21st Bayesian Century Tests and model choice Variable selection Noninformative case Use the same compatible informative g-prior distribution with ˜ β = 0p+1 and a hierarchical diﬀuse prior distribution on c, π(c) ∝ c−1 IN∗ (c) or π(c) ∝ c−1 Ic>0 Recall g-prior The choice of this hierarchical diﬀuse prior distribution on c is due to the model posterior sensitivity to large values of c: Taking ˜ β = 0p+1 and c large does not work
- 42. The 21st Bayesian Century Tests and model choice Variable selection Processionary caterpillar Inﬂuence of some forest settlement characteristics on the development of caterpillar colonies Response y log-transform of the average number of nests of caterpillars per tree on an area of 500 square meters (n = 33 areas) [ c Marin & Robert, Bayesian Core]
- 43. The 21st Bayesian Century Tests and model choice Variable selection Processionary caterpillar (cont’d) Potential explanatory variables x x2 x3 1 x1 altitude (in meters), x2 slope (in degrees), x3 number of pines in the square, x4 height (in meters) of the tree at the center of the square, x5 diameter of the tree at the center of the square, x6 index of the settlement density, xx4orientation of the squarex(from 1 if southb’d to 2 ow), 7 5 x6 x8 height (in meters) of the dominant tree, x9 number of vegetation strata, x10 mix settlement index (from 1 if not mixed to 2 if mixed). x x8 x9
- 44. The 21st Bayesian Century Tests and model choice Variable selection Bayesian regression output Estimate BF log10(BF) (Intercept) 9.2714 26.334 1.4205 (***) X1 -0.0037 7.0839 0.8502 (**) X2 -0.0454 3.6850 0.5664 (**) X3 0.0573 0.4356 -0.3609 X4 -1.0905 2.8314 0.4520 (*) X5 0.1953 2.5157 0.4007 (*) X6 -0.3008 0.3621 -0.4412 X7 -0.2002 0.3627 -0.4404 X8 0.1526 0.4589 -0.3383 X9 -1.0835 0.9069 -0.0424 X10 -0.3651 0.4132 -0.3838 evidence against H0: (****) decisive, (***) strong, (**) subtantial, (*) poor
- 45. The 21st Bayesian Century Tests and model choice Variable selection Bayesian variable selection t1 (γ) π(γ|y, X) 0,1,2,4,5 0.0929 0,1,2,4,5,9 0.0325 0,1,2,4,5,10 0.0295 0,1,2,4,5,7 0.0231 0,1,2,4,5,8 0.0228 0,1,2,4,5,6 0.0228 0,1,2,3,4,5 0.0224 0,1,2,3,4,5,9 0.0167 0,1,2,4,5,6,9 0.0167 0,1,2,4,5,8,9 0.0137 Noninformative G-prior model choice
- 46. The 21st Bayesian Century Bayesian Calculations Bayesian Calculations Bayesian methods seem to quickly move to elaborate computation Gelman, BA, 2008 Introduction Tests and model choice Bayesian Calculations Implementation diﬃculties Bayes factor approximation ABC model choice A Defense of the Bayesian Choice
- 47. The 21st Bayesian Century Bayesian Calculations Implementation diﬃculties B Implementation diﬃculties ◮ Computing the posterior distribution π(θ|x) ∝ π(θ)f (x|θ) ◮ Resolution of arg min L(θ, δ)π(θ)f (x|θ)dθ Θ ◮ Maximisation of the marginal posterior arg max π(θ|x)dθ−1 Θ−1
- 48. The 21st Bayesian Century Bayesian Calculations Implementation diﬃculties B Implementation further diﬃculties A statistical test returns a probability value, but rarely is the probability value per se the reason for an investigator performing the test Templeton, Mol. Ecol., 2009 ◮ Computing posterior quantities h(θ) π(θ)f (x|θ)dθ δ π (x) = h(θ) π(θ|x)dθ = Θ Θ π(θ)f (x|θ)dθ Θ ◮ Resolution (in k) of P (π(θ|x) ≥ k|x) = α
- 49. The 21st Bayesian Century Bayesian Calculations Implementation diﬃculties Monte Carlo methods Bayesian simulation seems stuck in an inﬁnite regress of inferential uncertainty Gelman, BA, 2008 Approximation of I= g(θ)f (x|θ)π(θ) dθ, Θ takes advantage of the fact that f (x|θ)π(θ) is proportional to a density: If the θi ’s are from π(θ), m 1 g(θi )f (x|θi ) m i=1 converges (almost surely) to I
- 50. The 21st Bayesian Century Bayesian Calculations Implementation diﬃculties Importance function A simulation method of inference hides unrealistic assumptions Templeton, Mol. Ecol., 2009 No need to simulate from π(·|x) or from π: if h is a probability density, g(θ)f (x|θ)π(θ) g(θ)f (x|θ)π(θ) dθ = h(θ) dθ Θ h(θ) and m i=1 g(θi )ω(θi ) f (x|θi )π(θi ) m with ω(θi ) = i=1 ω(θi ) h(θi ) approximates Eπ [g(θ)|x]
- 51. The 21st Bayesian Century Bayesian Calculations Bayes factor approximation Bayes factor approximation ABC’s When approximating the Bayes factor f1 (x|θ1 )π1 (θ1 )dθ1 Θ1 Z1 B12 = = Z2 f2 (x|θ2 )π2 (θ2 )dθ2 Θ2 use of importance functions ̟1 and ̟2 and n1 n−1 i i i i=1 f1 (x|θ1 )π1 (θ1 )/̟1 (θ1 ) i B12 = 1 n2 θj ∼ ̟j (θ) n−1 2 i i i i=1 f2 (x|θ2 )π2 (θ2 )/̟2 (θ2 ) [Chopin & Robert, 2007]
- 52. The 21st Bayesian Century Bayesian Calculations Bayes factor approximation Bridge sampling Special case: If π1 (θ1 |x) ∝ π1 (θ1 |x) ˜ π2 (θ2 |x) ∝ π2 (θ2 |x) ˜ live on the same space (Θ1 = Θ2 ), then n 1 π1 (θi |x) ˜ B12 ≈ θi ∼ π2 (θ|x) n π2 (θi |x) ˜ i=1 [Gelman & Meng, 1998; Chen, Shao & Ibrahim, 2000]
- 53. The 21st Bayesian Century Bayesian Calculations Bayes factor approximation (Further) bridge sampling In addition π2 (θ|x)α(θ)π1 (θ|x)dθ ˜ B12 = ∀ α(·) π1 (θ|x)α(θ)π2 (θ|x)dθ ˜ n1 1 π2 (θ1i |x)α(θ1i ) ˜ n1 i=1 ≈ n2 θji ∼ πj (θ|x) 1 π1 (θ2i |x)α(θ2i ) ˜ n2 i=1
- 54. The 21st Bayesian Century Bayesian Calculations Bayes factor approximation Optimal bridge sampling The optimal choice of auxiliary function is n1 + n2 α⋆ (θ) = n1 π1 (θ|x) + n2 π2 (θ|x) leading to n1 1 π2 (θ1i |x) ˜ n1 n1 π1 (θ1i |x) + n2 π2 (θ1i |x) i=1 B12 ≈ n2 1 π1 (θ2i |x) ˜ n2 n1 π1 (θ2i |x) + n2 π2 (θ2i |x) i=1 Back later!
- 55. The 21st Bayesian Century Bayesian Calculations Bayes factor approximation Approximating Zk from a posterior sample Use of the [harmonic mean] identity ϕ(θk ) ϕ(θk ) πk (θk )Lk (θk ) 1 Eπk x = dθk = πk (θk )Lk (θk ) πk (θk )Lk (θk ) Zk Zk no matter what the proposal ϕ(·) is. [Gelfand & Dey, 1994; Bartolucci et al., 2006] Direct exploitation of the MCMC output
- 56. The 21st Bayesian Century Bayesian Calculations Bayes factor approximation Comparison with regular importance sampling Harmonic mean: Constraint opposed to usual importance sampling constraints: ϕ(θ) must have lighter (rather than fatter) tails than πk (θk )Lk (θk ) for the approximation T (t) 1 ϕ(θk ) Z1k = 1 (t) (t) T πk (θk )Lk (θk ) t=1 to have a ﬁnite variance. E.g., use ﬁnite support kernels (like Epanechnikov’s kernel) for ϕ
- 57. The 21st Bayesian Century Bayesian Calculations Bayes factor approximation Approximating Z using a mixture representation Bridge sampling redux Design a speciﬁc mixture for simulation [importance sampling] purposes, with density ϕk (θk ) ∝ ω1 πk (θk )Lk (θk ) + ϕ(θk ) , where ϕ(·) is arbitrary (but normalised) Note: ω1 is not a probability weight
- 58. The 21st Bayesian Century Bayesian Calculations Bayes factor approximation Approximating Z using a mixture representation (cont’d) Corresponding MCMC (=Gibbs) sampler At iteration t 1. Take δ (t) = 1 with probability (t−1) (t−1) (t−1) (t−1) (t−1) ω1 πk (θk )Lk (θk ) ω1 πk (θk )Lk (θk ) + ϕ(θk ) and δ (t) = 2 otherwise; (t) (t−1) 2. If δ (t) = 1, generate θk ∼ MCMC(θk , θk ) where MCMC(θk , θk ′ ) denotes an arbitrary MCMC kernel associated with the posterior πk (θk |x) ∝ πk (θk )Lk (θk ); (t) 3. If δ (t) = 2, generate θk ∼ ϕ(θk ) independently
- 59. The 21st Bayesian Century Bayesian Calculations Bayes factor approximation Evidence approximation by mixtures Rao-Blackwellised estimate T ˆ 1 ξ= (t) (t) ω1 πk (θk )Lk (θk ) (t) (t) ω1 πk (θk )Lk (θk ) + ϕ(θk ) , (t) T t=1 converges to ω1 Zk /{ω1 Zk + 1} 3k ˆ ˆ ˆ Deduce Zˆ from ω1 Z3k /{ω1 Z3k + 1} = ξ ie T (t) (t) (t) (t) (t) t=1 ω1 πk (θk )Lk (θk ) ω1 π(θk )Lk (θk ) + ϕ(θk ) ˆ Z3k = T (t) (t) (t) (t) t=1 ϕ(θk ) ω1 πk (θk )Lk (θk ) + ϕ(θk ) [Bridge sampler]
- 60. The 21st Bayesian Century Bayesian Calculations Bayes factor approximation Chib’s representation Direct application of Bayes’ theorem: given x ∼ fk (x|θk ) and θk ∼ πk (θk ), fk (x|θk ) πk (θk ) Zk = mk (x) = πk (θk |x) Use of an approximation to the posterior ∗ ∗ fk (x|θk ) πk (θk ) Zk = mk (x) = . ˆ ∗ πk (θk |x)
- 61. The 21st Bayesian Century Bayesian Calculations Bayes factor approximation Case of latent variables For missing variable z as in mixture models, natural Rao-Blackwell estimate T ∗ 1 ∗ (t) πk (θk |x) = πk (θk |x, zk ) , T t=1 (t) where the zk ’s are Gibbs sampled latent variables
- 62. The 21st Bayesian Century Bayesian Calculations ABC model choice Approximate Bayesian Computation Simulation target is π(θ)f (x|θ) with likelihood f (x|θ) not in closed form. Likelihood-free rejection technique: ABC algorithm For an observation y ∼ f (y|θ), under the prior π(θ), keep jointly simulating θ′ ∼ π(θ) , x ∼ f (x|θ′ ) , until the auxiliary variable x is equal to the observed value, x = y. [Pritchard et al., 1999]
- 63. The 21st Bayesian Century Bayesian Calculations ABC model choice A as approximative When y is a continuous random variable, equality x = y is replaced with a tolerance condition, ̺(x, y) ≤ ǫ where ̺ is a distance between summary statistics Output distributed from π(θ) Pθ {̺(x, y) < ǫ} ∝ π(θ|̺(x, y) < ǫ)
- 64. The 21st Bayesian Century Bayesian Calculations ABC model choice Gibbs random ﬁelds Gibbs distribution The rv y = (y1 , . . . , yn ) is a Gibbs random ﬁeld associated with the graph G if 1 f (y) = exp − Vc (yc ) , Z c∈C where Z is the normalising constant, C is the set of cliques of G and Vc is any function also called potential U (y) = c∈C Vc (yc ) is the energy function c Z is usually unavailable in closed form
- 65. The 21st Bayesian Century Bayesian Calculations ABC model choice Potts model Potts model Vc (y) is of the form Vc (y) = θS(y) = θ δyl =yi l∼i where l∼i denotes a neighbourhood structure In most realistic settings, summation Zθ = exp{θT S(x)} x∈X involves too many terms to be manageable and numerical approximations cannot always be trusted [Cucala, Marin, CPR & Titterington, JASA, 2009]
- 66. The 21st Bayesian Century Bayesian Calculations ABC model choice Neighbourhood relations Choice to be made between M neighbourhood relations m i ∼ i′ (0 ≤ m ≤ M − 1) with Sm (x) = I{xi =xi′ } m i∼i′ driven by the posterior probabilities of the models.
- 67. The 21st Bayesian Century Bayesian Calculations ABC model choice Model index Formalisation via a model index M, new parameter with prior distribution π(M = m) and π(θ|M = m) = πm (θm ) Computational target: P(M = m|x) ∝ fm (x|θm )πm (θm ) dθm π(M = m) Θm
- 68. The 21st Bayesian Century Bayesian Calculations ABC model choice Suﬃcient statistics If S(x) suﬃcient statistic for the joint parameters (M, θ0 , . . . , θM −1 ), P(M = m|x) = P(M = m|S(x)) . For each model m, suﬃcient statistic Sm (·) makes S(·) = (S0 (·), . . . , SM −1 (·)) also suﬃcient. For Gibbs random ﬁelds, 1 2 x|M = m ∼ fm (x|θm ) = fm (x|S(x))fm (S(x)|θm ) 1 = f 2 (S(x)|θm ) n(S(x)) m where n(S(x)) = ♯ {˜ ∈ X : S(˜ ) = S(x)} x x c S(x) also suﬃcient for the joint parameters [Speciﬁc to Gibbs random ﬁelds!]
- 69. The 21st Bayesian Century Bayesian Calculations ABC model choice ABC model choice Algorithm ABC-MC ◮ Generate m∗ from the prior π(M = m). ◮ ∗ Generate θm∗ from the prior πm∗ (·). ◮ ∗ Generate x∗ from the model fm∗ (·|θm∗ ). ◮ Compute the distance ρ(S(x0 ), S(x∗ )). ◮ Accept (θm∗ , m∗ ) if ρ(S(x0 ), S(x∗ )) < ǫ. ∗ [Cornuet, Grelaud, Marin & Robert, BA, 2008] Note When ǫ = 0 the algorithm is exact
- 70. The 21st Bayesian Century Bayesian Calculations ABC model choice Toy example iid Bernoulli model versus two-state ﬁrst-order Markov chain, i.e. n f0 (x|θ0 ) = exp θ0 I{xi =1} {1 + exp(θ0 )}n , i=1 versus n 1 f1 (x|θ1 ) = exp θ1 I{xi =xi−1 } {1 + exp(θ1 )}n−1 , 2 i=2 with priors θ0 ∼ U(−5, 5) and θ1 ∼ U(0, 6) (inspired by “phase transition” boundaries).
- 71. The 21st Bayesian Century Bayesian Calculations ABC model choice Toy example (2) (left) Comparison of the true BF m0 /m1 (x0 ) with BF m0 /m1 (x0 ) (in logs) over 2, 000 simulations and 4.106 proposals from the prior. (right) Same when using tolerance ǫ corresponding to the 1% quantile on the distances.
- 72. The 21st Bayesian Century A Defense of the Bayesian Choice A Defense of the Bayesian Choice Given the advances in practical Bayesian methods in the past two decades, anti-Bayesianism is no longer a serious option Gelman, BA, 2009 Bayesians are of course their own worst enemies. They make non-Bayesians accuse them of religious fervour, and an unwillingness to see another point of view. Davidson, 2009
- 73. The 21st Bayesian Century A Defense of the Bayesian Choice 1. Choosing a probabilistic representation Bayesian statistics is about making probability statements Gelman, BA, 2009 Bayesian Statistics appears as the calculus of uncertainty Reminder: A probabilistic model is nothing but an interpretation of a given phenomenon What is the meaning of RD’s t test example?!
- 74. The 21st Bayesian Century A Defense of the Bayesian Choice 1. Choosing a probabilistic representation (2) Inference is impossible. Davidson, 2009 The Bahadur–Savage problem stems from the inability to make choices about the shape of a statistical model, not from an impossibility to draw [Bayesian] inference. Further, a probability distribution is more than the sum of its moments. Ill-posed problems thus highlight issues with the model, not the inference.
- 75. The 21st Bayesian Century A Defense of the Bayesian Choice 2. Conditioning on the data Bayesian data analysis is a method for summarizing uncertainty and making estimates and predictions using probability statements conditional on observed data and an assumed model Gelman, BA, 2009 At the basis of statistical inference lies an inversion process between cause and eﬀect. Using a prior distribution brings a necessary balance between observations and parameters and enable to operate conditional upon x What is the data in RD’s t test example?! U ’s? Y ’s?
- 76. The 21st Bayesian Century A Defense of the Bayesian Choice 3. Exhibiting the true likelihood Frequentist statistics is an approach for evaluating statistical procedures conditional on some family of posited probability models Gelman, BA, 2009 Provides a complete quantitative inference on the parameters and predictive that points out inadequacies of frequentist statistics, while implementing the Likelihood Principle. There needs to be a true likelihood, including in non-parametric settings [Rousseau, Van der Vaart]
- 77. The 21st Bayesian Century A Defense of the Bayesian Choice 4. Using priors as tools and summaries Bayesian techniques allow prior beliefs to be tested and discarded as appropriate Gelman, BA, 2009 The choice of a prior distribution π does not require any kind of belief in this distribution: rather consider it as a tool that summarizes the available prior information and the uncertainty surrounding this information Non-identiﬁability is an issue in that the prior may strongly impact inference about identiﬁable bits
- 78. The 21st Bayesian Century A Defense of the Bayesian Choice 4. Using priors as tools and summaries (2) No uninformative prior exists for such models. Davidson, 2009 Reference priors can be deduced from the sampling distribution by an automated procedure, based on a minimal information principle that maximises the information brought by the data. Important literature on prior modelling for non-parametric problems, incl. smoothness constraints.
- 79. The 21st Bayesian Century A Defense of the Bayesian Choice 5. Accepting the subjective basis of knowledge Knowledge is a critical confrontation between a prioris and experiments. Ignoring these a prioris impoverishes analysis. We have, for one thing, to use a language and our language is entirely made of preconceived ideas and has to be so. However, these are unconscious preconceived ideas, which are a million times more dangerous than the other ones. Were we to assert that if we are including other preconceived ideas, consciously stated, we would aggravate the evil! I do not believe so: I rather maintain that they would balance one another. Henri Poincar´, 1902 e
- 80. The 21st Bayesian Century A Defense of the Bayesian Choice 6. Choosing a coherent system of inference Bayesian data analysis has three stages: formulating a model, splitting the model to data, and checking the model ﬁt. The second step—inference—gets most of the attention, but the procedure as a whole is not automatic Gelman, BA, 2009 To force inference into a decision-theoretic mold allows for a clariﬁcation of the way inferential tools should be evaluated, and therefore implies a conscious (although subjective) choice of the retained optimality. Logical inference process Start with requested properties, i.e. loss function and prior distribution, then derive the best solution satisfying these properties.
- 81. The 21st Bayesian Century A Defense of the Bayesian Choice 6. Choosing a coherent system of inference (2) Asymptopia annoys Bayesians. Davidson, 2009 Asymptotics [for inference] sounds for a proxy for not specifying completely the model and thus for using another model. While asymptotics [for simulation] is quite acceptable. Bayesian inference does not escape asymptotic diﬃculties, see e.g. mixtures. NP Bootstrap aims at inference with no[t enough] modelling, while P Bayesian bootstrap is essentially using the Bayesian predictive
- 82. The 21st Bayesian Century A Defense of the Bayesian Choice 7. Looking for optimal frequentist procedures At intermediate levels of a Bayesian model, frequency properties typically take care of themselves. It is typically only at the top level of unreplicated parameters that we have to worry Gelman, BA, 2009 Bayesian inference widely intersects with the three notions of minimaxity, admissibility and equivariance (Haar). Looking for an optimal estimator most often ends up ﬁnding a Bayes estimator. Optimality is easier to attain through the Bayes “ﬁlter”
- 83. The 21st Bayesian Century A Defense of the Bayesian Choice 8. Solving the actual problem Frequentist methods have coverage guarantees; Bayesian methods don’t. In science, coverage matters Wasserman, BA, 2009 Frequentist methods justiﬁed on a long-term basis, i.e., from the statistician viewpoint. From a decision-maker’s point of view, only the problem at hand matters! That is, he/she calls for an inference conditional on x.
- 84. The 21st Bayesian Century A Defense of the Bayesian Choice 9. Providing a universal system of inference Bayesian methods are presented as an automatic inference engine Gelman, BA, 2009 Given the three factors (X , f (x|θ), (Θ, π(θ)), (D, L(θ, d)) , the Bayesian approach validates one and only one inferential procedure
- 85. The 21st Bayesian Century A Defense of the Bayesian Choice 10. Computing procedures as a minimization problem The discussion of computational issues should not be allowed to obscure the need for further analysis of inferential questions Bernardo, BA, 2009 Bayesian procedures are easier to compute than procedures of alternative theories, in the sense that there exists a universal method for the computation of Bayes estimators Convergence assessment is an issue, but recent developments in adaptive MCMC allow for more conﬁdence in the output

No public clipboards found for this slide

Be the first to comment