Successfully reported this slideshow.
Upcoming SlideShare
×

# Spanos lecture 7: An Introduction to Bayesian Inference

8,656 views

Published on

Aris Spanos: Phil 6334, Lecture 7: An Introduction to Bayesian Inference

Published in: Education, Technology
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

• Be the first to like this

### Spanos lecture 7: An Introduction to Bayesian Inference

1. 1. PHIL 6334 - Probability/Statistics Lecture Notes 7: An Introduction to Bayesian Inference Aris Spanos [Spring 2014] 1 Introduction to Bayesian Inference The lectures notes provide an elementary introduction to Bayesian inference focusing almost exclusively on two simple models; the simple Bernoulli and Normal models to keep the technical details to a minimum. Textbooks often motivate the various techniques associ- ated with Bayesian statistics by contrasting them to those of fequentist (classical) inference. Indeed, adherents to the Bayesian approach often begin with a litany of charges lev- eled against the frequentist approach which are usually due to inadequate understanding of the latter; see section 5. The following quotation from O’Hagan (1994), p. 16, is typical of such charges: “Broadly speaking, some of the arguments in favour of the Bayesian approach are that it is fundamentally sound, very ﬂexi- ble, produces clear and direct inferences and makes use of all the available information. In contrast, the classical approach suﬀers from some philosophical ﬂaws, has restrictive range of inferences with rather indirect meaning and ignores prior information.” The address the above claims by O’Hagan one needs to compare the two approaches in depth, focusing more on the underlying reasoning and their primary objectives. Section 2: Probability and its various interpretations Section 3: Frequentist Inference Section 4: Bayesian Inference Section 5: Charges against the frequentist approach to inference. 1
2. 2. 2 Probability and its various interpretations For mathematical purposes Kolmogorov’s axiomatization of probability is perfectly adequate, but when it comes to statis- tical inference the interpretation is absolutely necessary. What does mathematical probability correspond to in the real world? This correspondence will determine the kind of inductive pro- cedures one should follow in‘learning from data’. That is, the interpretation of probability will determine the nature of the inductive (statistical) inference called for. From the viewpoint of statistical inference let us con- sider the following interpretations of probability: (i) the classical (equally-likely) - () (ii) the degrees of belief - () (iii) the relative frequency - Pr(). For a better understanding of the various interpretations of probability one should consider them in the context in which they were ﬁrst developed. As early as the 18th century all three basic interpretations of probability were being used in diﬀerent contexts without much thought of choosing one in- terpretation for all purposes. The classical interpretation was used in the context of games of chance and was viewed as stemming from equal prob- ability outcomes based on some sort of physical symmetry. The degrees of belief interpretation originated from at- tempts to quantify the relationship between the evidence pre- sented in courts and the degree of conviction in the mind of the judge. The relative frequency interpretation originated from mortality and socio-economic data gathered over long periods of time from the 16th century onwards. 2
3. 3. The question to be considered is whether any of these in- terpretations can provide a pertinent link between Kol- mogorov’s mathematical theory of probability and empirical modeling and inference. 2.1 The classical interpretation of probability It is generally accepted that, historically, the theory of proba- bility was developed in the context of games of chance such as casting dice or tossing coins. It was only natural then that the ﬁrst interpretation of probability was inextricably bound up with the chance mechanism of such games. The ﬁrst explicit deﬁnition of the classical deﬁnition of probability is given by Laplace at the beginning of the 19th century. The classical deﬁnition. Consider the random exper- iment E which has  equally likely outcomes and event  occurs when  of them occur, then according to the classi- cal deﬁnition of probability: () = ¡  ¢  Example. Let  be the random variable denoting the number of dots on the sides of die. When a die is symmetric and homogeneous: Prob. distribution of  = 1 2 3 4 5 6 () 1 6 1 6 1 6 1 6 1 6 1 6 The ﬁrst important feature of this deﬁnition is its reliance on the nature of an explicit chance mechanism such as casting dice or tossing coins. Its second crucial feature is that it uti- lizes the apparent physical symmetry of the device underlying the chance mechanism to deﬁne probability by evaluating it as “the ratio of the number of outcomes favorable to the event 3
4. 4. to the total number of possible outcomes, each assumed to be equally likely” (see Laplace, 1814). For the purposes of providing the missing link between the mathematical concept of a statistical model and the notion of chance regularity, this deﬁnition of probability is inadequate for a number of reasons including: (i) it is based on an explicit chance mechanism, (ii) the chance mechanism has a build-in physical symmetry that leads to equally likely outcomes, (iii) it assumes that one can partition the set of outcomes into a ﬁnite number of equally likely events. This deﬁnition has been severely criticized in the literature but the critics tend to concentrate their arrows on the equally likely clause. ¥ What do we mean by equally likely and how do we recog- nize equally likely outcomes? Laplace proposed a principle for justifying equally likely cases, known as: the principle of insuﬃcient reason, or the principle of indiﬀerence. This is based on the idea that if one has no reason to favor one outcome over the another they are considered equally likely. This principle has given rise to several paradoxes and has been called into question repeatedly (Hacking, 1975). In addition to the objection to the equally likely clause, there is one crucial objection to the classical deﬁnition: it assumes that one can partition the set of outcomes into a ﬁnite number of equally likely events. ¥ What happens when the random experiment does not enjoy this symmetry, such as the case of a biased coin? ¥ What about axiom A1 of the mathematical deﬁnition? 4
5. 5. In the case of an inﬁnite sample space the denominator of the above deﬁnition will get us into trouble. Empirical modeling. The classical interpretation of prob- ability is too restrictive for empirical modeling purposes. 2.2 The ‘degrees of belief’ interpretation of probability Our interest in the degree of belief interpretation of prob- ability stems from the fact that it leads to an approach to statistical inference known as the Bayesian approach. The degree of belief interpretation of probability comes in two ﬂavors: the subjective and rational. A. Degrees of subjective belief The subjective interpretation considers the probability of an event  as based on the personal judgment of whoever is assigning the probability; the personal judgement being based on the individual’s experience and background. In this sense the probability of event  is based on the person’s beliefs and information relating to the experiment giving rise to event . Example. In the case of tossing a fair coin a person is likely to assign the subjective probability ()=1 2 because a symmetric coin would render  and  a priori equally likely. In the case where the person in question has additional information relating to the mechanism, such as the coin is bent, the subjective probability is likely to change. De Finetti introduced a convenient way to think of sub- jective probabilities is in terms of betting odds. Let us consider the case of betting on the occurrence of an event  and somebody oﬀers odds 2 to 1 or in a ratio form 1 2  If the person whose degrees of subjective belief we are trying to as- sess thinks that these are fair odds, then we can proceed to 5
6. 6. evaluate her subjective probability via: 1 2 1+1 2 = 1 3  i.e. () = 1 3  More generally, if the subjective probability for the occur- rence of the event  is  (i.e. () = ) then the odds ratio ¨ and the corresponding subjective probability  take the form: ¨ =  (1−) ⇒  = ¨ 1+¨  As we can see, the subjective dimension of this probability arises from the fact that it is the decision of a particular in- dividual whether the odds are fair or not. Another individual might consider as fair the odds ratio ¨0  which implies that her subjective probability is 0 = ¨0 1+¨0 6= This is not surprising because the personal experiences which inﬂuence judgement are often diﬀerent between individuals. The question which naturally arises at this stage is to whether such personal subjective probabilities will behave in accor- dance with the mathematical deﬁnition of probability. The answer is yes, under certain restrictions, as demonstrated by Ramsey (1926) , de Finetti (1937) and Savage (1954). B. Degrees of rational belief Another question with regard to the degree of belief inter- pretation of probability is whether one could ﬁnd some way to establish that a particular odds ratio will be considered fair by a rational person; assuming a formal deﬁnition of ra- tionality. The idea being that in such a case the subjective dimension will become less personalistic. Keynes (1921) was the ﬁrst to propose such an interpreta- tion, often called logical probability. Carnap (1950) general- ized and extended it, but the logical interpretation has been severely criticized on several grounds. 6
7. 7. Empirical modeling. How appropriate is the degrees of belief interpretation of probability for empirical modeling? If data x0:=(1  ) contain systematic information in the form of chance regularities, ‘stochasticity’ is a feature of real-world phenomena and exists independently of one’s be- liefs. Hence, degree of belief interpretation has limited role in selecting an appropriate statistical model. 2.3 The frequency interpretation of probability The frequency interpretation of probability can be traced back to the statistical regularities established during the 18th and 19th centuries, based on demographic, anthropomorphic, eco- nomic and social (crimes, violent deaths, etc.) data. The analysis of these data led to an amazing conclusion: “despite the unpredictability at the individual level (people, ﬁrms etc.) there was a remarkable stability of the relative frequencies at the aggregate level (groups) over long periods of time.” In the context of the frequency interpretation, the proba- bility of an event  is viewed as an empirical regularity asso- ciated with this event. The frequentist interpretation. Consider the case where one is able to repeat an experiment under identical conditions, and denote the relative frequency of the event  after  trials by ¡  ¢  then the frequency interpretation views the probability of event  as the limit of the relative frequency¡  ¢ as the number of repetitions goes to inﬁnity: Pr() = lim→∞ ¡  ¢  (1) Can one prove the above mathematical claim? NO, and thus the von Mises attempt to provide a frequen- tist interpretation of probability using (1) is a dead end for 7
8. 8. purely mathematical reasons! Instead, (1) is a mathemati- cally false statement that reﬂects the intuition underlying the frequentist interpretation of probability. It becomes a mathe- matically true statement when the limit is interpreted in prob- abilistic terms; see Spanos (2013). A formal justiﬁcation for the frequentist interpretation as the limit of relative frequencies is grounded on the Strong Law of Large Numbers (SLLN) that gives precise mean- ing to the claim ‘the sequence of relative frequencies {  }∞ =1 converges to  as  → ∞’. To see that let =(=1) and =(=0) and   =1  P =1 := Borel (1909). For an IID Bernoulli process { ∈N} deﬁning the simple Bernoulli model: M(x):  v BerIID( (1−)) ∈N, (2) P( lim →∞ (1  P =1 ) = ) = 1 (3) That is, as  → ∞ the stochastic sequence {}∞ =1 where =1  P =1  converges to a constant  with probability one. 1 0 0 09 0 08 0 07 0 06 0 05 0 04 0 03 0 02 0 01 0 01 0 . 7 0 0 . 6 5 0 . 6 0 0 . 5 5 0 . 5 0 0 . 4 5 0 . 4 0 In d e x Average Fig. 1: The Strong Law of Large Numbers in action The long-run metaphor. It is also important to note 8
9. 9. that the long-run metaphor associated with the frequentist in- terpretation envisions repeating the mechanism represented by an IID Bernoulli process and observing the relative fre- quency process {}∞ =1 (almost surely)-approximate  as →∞ I It is crucial to emphasize that, when viewed in the con- text of a statistical model M(x), the key dimension of the long-run metaphor is not the temporal aspect. Keynes’s tongue- in-cheek comment that "in the long-run we will all be dead" is totally misplaced in this context. I The crucial dimension of the long-run is repeatability (in principle) of the data-generating process. To generate the re- alization shown below takes a fraction of a nanosecond! More- over, the long-run can be operationalized on a computer for any statistical model using the statistical GM. Example. In the case of the simple Normal model, one can generate as many sample realizations as wanted using pseudo-random numbers:  =  +   v N(0 1)  = 1 2    Circularity? The issue often raised, when invoking the SLLN as a justiﬁcation for the frequency deﬁnition of proba- bility, is that the argument suﬀers from circularity: one uses probability to deﬁne probability! This claim is based on confusion. The SLLN states that, under certain restrictions on the probabilistic structure of { ∈N}, one can deduce: P( lim →∞ (1  P =1 ) = )=1 (4) This is a measure-theoretic result which asserts that the con- vergence holds everywhere in a domain  ⊂  except on a subset 0 the latter being a set of measure zero (P (0) =0)- 9
10. 10. the impossible event. This mathematical theorem links the probability () to the relative frequencies (1  P =1 ) via the interpretive provisions: [i] data x0:=(1 2     ) is viewed as a ‘truly typical’ realization of the process { ∈N} speciﬁed by M(x), [ii] the ‘typicality’ of x0 (e.g. IID) can be assessed using trenchant Mis-Speciﬁcation (M-S) testing. These provisions provide a pertinent link between the math- ematical framework and the data-generating mechanism that gave rise to x0. 3 Frequentist Inference Statistical model Mθ(x)={(x; θ) θ∈Θ} x∈R  =⇒ Distribution of the sample (x; θ) x∈R  ↑ Data: x0:=(1 2  ) −→ ⇓ Likelihood function (θ; x0) θ∈Θ Frequentist approach to statistical inference For the frequentist approach: [a] The interpretation of probability is the frequency inter- pretation. [b] The systematic statistical information in data x0 in the form of chance regularities, constitutes the only relevant statistical information for selecting the statistical model. Sub- stantive information comes in the form of restrictions on sta- 10
11. 11. tistical parameters, but should not be imposed at the outset; its validity needs to be established before imposed. [c] The primary aim of the frequentist approach is to learn fromdata x0 about the ‘true’ underlying data-generating mech- anism M∗ (x)={(x; θ∗ )} x∈R ; θ∗ denotes the true value of θ In general, the expression ‘θ∗ denotes the true value of θ’ is a shorthand for saying that ‘data x0 constitute a real- ization of the sample X with distribution (x; θ∗ )’ This is achieved by employing reliable and eﬀective inference proce- dures that revolve around the unknown parameter(s) θ based on ascertainable error probabilities. I The underlying inductive reasoning comes in two forms: (i) factual: the true state of nature (estimation and pre- diction), whatever that happens to be, and (ii) hypothetical: various hypothetical scenarios are com- pared to what actually happened (hypothesis testing). [d] Frequentist inference is based exclusively on the distri- bution of the sample (x; ), x∈R . This is because all fre- quentist procedures, estimators, test statistics and predictors, are based on statistics of the form: =(1 2  ) whose sampling distribution is determined by (x; ) via: ()=P( ≤ )= Z Z · · · Z | {z } {(12)≤} (x; )12 · · ·  (5) Its importance stems from the fact that all relevant error prob- abilities, coverage, type I and II, power, p-value, associated with frequentist inference are based on such a sampling dis- tribution. In the case where the density function () is con- tinuous: (; )=()  . 11
12. 12. 3.0.1 The simple Bernoulli model The discussion that follows focuses on the simple Bernoulli model, speciﬁed by: M(x):  v BerIID( (1−)) ∈N:=(1 2   ) where ‘BerIID’ stands for ‘Bernoulli, Independence and Iden- tically Distributed’. Using the IID assumptions, one can derive the distribu- tion of the sample: (1 2  ; ) IID = Q =1 (; ) Ber = Q =1  (1 − )1− = = P =1  (1 − ) P =1(1−) = (1 − )(1−)  (6) where =1  P =1  is the sample mean, and = denotes the number of ‘successes’ ( = 1) in  trials, i.e. (x; )= (1 − )(1−)  for all x∈R ={0 1}  (7) viewed as a function of  = is Binomially distributed. The likelihood function is deﬁned by: to the distribution of the sample: (; x0) ∝ (x0; ) for all ∈[0 1] (8) Note that the proportionality (∝) is important because (; x0) is viewed as a function of ∈Θ but (x; ) is a function of x∈R  and they usually have very diﬀerent dimensions. In- deed, in the simple Bernoulli case (x; ) is discrete but (; x0) is continuous. Example (a). For ==4, =20 the likelihood function takes the form: (; x0) ∝ 4 (1 − )16  ∈[0 1] 12
13. 13. 121086420 0.25 0.20 0.15 0.10 0.05 0.00 y Probability Binomial, n=20, p=0.2 Fig. 1: (x; ), x∈R  1.00.80.60.40.20.0 5 4 3 2 1 0 theta Likelihhood Fig. 2: (; x0), ∈[0 1] The sampling distribution of  is shown in ﬁg. 1 and the normalized form of the likelihood function is given in ﬁg. 2. 3.0.2 The simple Normal model The discussion that follows focuses on the simple Normal model, speciﬁed by: M(x):  v NIID( 2 ) ∈N:=(1 2   ) The distribution of the sample takes the form: (x; θ) = Y =1 1  √ 2 exp(− 1 22 (−)2 )= =( 1  √ 2 ) exp{- 1 22 P =1(−)2 } This means that the likelihood function is: ( 2 ; x0) ∝ ( 1  √ 2 ) exp{- 1 22 P =1(−)2 } −∞ ∞ 2 0 In the case where 2 is known this simpliﬁes to: (; x0) ∝ exp{− 1 22 P =1(−)2 } −∞ ∞ 13
14. 14. 3.1 Frequentist Estimation (point) In general, the Maximum Likelihood (ML) estimator of  is deﬁned by: b(X)=max ∈Θ (ln (; x)) It is important to distinguish between three diﬀerent concepts: -unknown constant, b(X)-estimator of  b(x0)-estimate of  In the case of the simple Bernoulli model, the ﬁrst order condition for maximizing ln (; x) is:  ln (;x)  =  − (1−) 1− =0 when solved for  gives rise to the MLE estimator: b(X)==1  P =1  (9) The sampling distribution of this estimator is: b(X) v Bin ³  (1−)  ;  ´  (10) where ‘v Bin ( ; )’ reads ‘distributed Binomially with mean , variance  and sample size  What is a good (optimal) estimator in frequentist inference? It’s a statistic (X) that pin-points ∗  the true value of  most eﬀectively. Its capacity (eﬀectiveness) to pin-points ∗ is deﬁned in terms of several optimal properties, such as (i) Un- biaseness: (b(X))=∗  (ii) Full Eﬃciency:  (b(X))=(∗ ) (iii) Strong Consistency: P(lim→∞ b(X)=∗ )=1. Example. The estimator b(X)=1  P =1  whose sam- pling distribution is given by (10) is optimal because it is unbiased, fully eﬃcient and strongly consistent! 14
15. 15. 3.2 Conﬁdence Intervals (CIs) The most straightforward procedure to derive a Conﬁdence Interval (CI) is to ﬁnd a pivotal quantity whose distribution under =∗ is known. A. For the simple Bernoulli model such a pivotal quan- tify can be speciﬁed using the ML estimatorb(X)=1  P =1  and its sampling distribution in (10): (X; )= √ (b(X)−) √ (1−) =∗ v Bin (0 1; )  (11) In general, an optimal CI begins with an optimal estimator. Since its sampling distribution is known, one can specify a two-sided CI of the form: P ¡ − 2 ≤ (X; )   2 ; =∗ ¢ =1 −  and then ‘solve’ for  to derive the optimal (1−) Conﬁdence Interval (CI): P µ [b −  2 q b(1−b)  ] ≤   [b +  2 q b(1−b)  ] ¶ =1 −  (12) where b is a shorthand for b(X) In practice, ﬁnding  2 can be cumbersome and in many cases (12) can be approximated (for a large enough sample size, say  ≥ 20) by the N(0 1) distribution. Example (a). In the case where ==4, =2 =20 the approximate 95 observed CI for :∙ (2 − 196√ 20 p 2(8))=025 ≤   0375=(2 + 196 q 2(8) 20 ) ¸  Example (b). In the case where ==12, =6 =20 the approximate 95 observed CI for :∙ (6 − 196√ 20 p 6(4))=385 ≤   0815=(6 + 196 q 6(4) 20 ) ¸  15
16. 16. B. For the simple Normal model (with 2 known) the pivotal quantify can be speciﬁed using the ML estimator =1  P =1  whose sampling distribution is: (X; )= √ (−)  =∗ v N (0 1)  (13) Using (13) one can specify a two-sided CI of the form: P ¡ − 2 ≤ (X; )   2 ; =∗ ¢ =1 −  and then ‘solve’ for  to derive the optimal (1−) Conﬁdence Interval (CI): P ³ [ −  2 √  ] ≤   [ +  2 √  ´ =1 −  (14) 3.3 Frequentist Testing A. In the context of the simple Bernoulli model, consider the hypotheses: 0 : =0 vs. 0 :   0 (15) As in the case of interval estimation, a good test procedure be- gins with a good estimator, which gives rise to a test statistic. In this case the test statistic looks similar to (11): (X)= √ (b(X)−0) √ 0(1−0) =0 v Bin (0 1; )  where b(X):= but in fact, it has two crucial diﬀerences: (i) the test statistic (X) involves no unknown parameters as (X; ) and (ii) the evaluation of its sampling distribution is under the null (=0), which is hypothetical, and not under =∗ which is factual. The optimal −signiﬁcance level test takes the form: (X)= √ (−0) √ 0(1−0)  1():={x : (x)  } (16) 16
17. 17. Example. In the case where =75, =20 =05 ( 2 =196) 0=5 test (16) yields: (x0)= √ 20(75−5) √ 5(1−5) =2236 leading to rejecting 0 The p-value associated with this test is: P((X)  2236; 0) = 013 (17) Example. For =6, =20 =05 ( 2 =196) 0=5 test (16) yields: (x0)= √ 20(6−5) √ 5(1−5) =894 leading to accepting 0 The p-value associated with this test is: P((X)  894; 0) = 186 (18) B. In the context of the simple Normal model (with 2 known) consider the hypotheses: 0 : =0 vs. 0 :   0 In this case the test statistic takes the form: (X)= √ (−0)  =0 v N (0 1)  and gives rise to the optimal −signiﬁcance level test takes the form: (X)= √ (−0)   1():={x : (x)  } Example. For 0=10 =2 =100 =05 ( 2 =196) =116 this yields: (x0)= √ 100(106−10) 2 =30 leading to rejecting 0 The p-value associated with this test is: P((X)  30; 0) = 001 (19) 17
18. 18. 4 Bayesian Inference 4.1 The Bayesian framework Bayesian inference begins with a statistical model: Mθ(x)={(x; θ) θ∈Θ} x∈R  for θ∈Θ⊂R     (20) where (x; θ) is the distribution of the sample X:=(1  ) R  is the sample space and Θ the parameter space. Bayesian inference modiﬁes the frequentist inferential set up, based ex- clusively on (20), in two crucial respects: (A) It views the unknown parameter(s) θ as random vari- ables with their own distribution, known as the prior dis- tribution: (): Θ → [0 1] which represents one’s a priori assessment of how likely the various values of θ in Θ are, which amounts to ranking the dif- ferent models Mθ(x) for all θ∈Θ. In frequentist θ is viewed as a set of unknown constants indexing (x; θ) x∈R  (B) It re-interprets the distribution of the sample as condi- tional on the unknown parameters θ and denoted by (x|θ) Taken together these modiﬁcations imply that for Bayesians the joint distribution of the sample is re-deﬁned as: (x θ) = (x|θ)·(θ) ∀θ∈Θ ∀x∈R  (21) where ∀ denotes ‘for all’. Distinguishing criteria: [a] The Bayesian approach interprets probability as the de- grees of belief [subjective, logical or rational]. [b] In the context of Bayesian inference, relevant infor- mation includes: (i) the data x0:=(1 2  ) and (ii) the prior distribution (θ) θ∈Θ 18
19. 19. [c] The primary aim of the Bayesian approach is to revise the initial ranking (θ) ∀θ∈Θ in light of the data x0 as précised by (θ|x0) to update the ranking in terms of the posterior distribution derived using Bayes’ rule: (θ|x0) = (x0|θ)·(θ)R  (x0|θ)·(θ)θ ∝ (θ|x0)·(θ) ∀θ∈Θ (22) where (θ|x0) ∝ (x0|θ) θ∈Θ denotes the likelihood func- tion, as re-interpreted by Bayesians. Remark: in relation to (22) it is important to note that contrary to claims by Bayesians (e.g. Gosh et al, 2006, p. 31): “ (x0|θ) is the density of X interpreted as the conditional density of X given θ. The numerator is the joint density of θ and X and the denominator is the marginal density of X” (i) (x0|θ) is not “the conditional density of X given θ”, since the latter is deﬁned by (x|θ) ∀x∈R , (ii) the numerator is not “the joint density of θ and X”, since the latter necessitates the double quantiﬁer as in (21), (iii) the denominator is not “the marginal density of X”, since the latter is deﬁned by (x)= R θ (x|θ)·(θ)θ ∀x∈R ; see Spanos (2014) for the details. Prior probabilities (θ) θ∈Θ ⇓ Statistical model Mθ(x)={(x; θ) θ∈Θ} x∈R  ⇑ Data: x0:=(1  ) ⎫ ⎪⎪⎪⎪⎪⎪⎪⎪⎪⎬ ⎪⎪⎪⎪⎪⎪⎪⎪⎪⎭ Bayes’ rule =⇒ Posterior Distribution (θ|x0) ∝ (θ)·(θ|x0) The Bayesian approach to statistical inference 19
20. 20. A famous Bayesian, Savage (1954) summarized Bayesian inference succinctly by: ‘Inference means for us the change of opinion induced by evi- dence on the application of Bayes’ theorem.” (p. 178) O’Hagan (1994) is more speciﬁc: “Having obtained the posterior density (θ|x0), the ﬁnal step of the Bayesian method is to derive from it suitable inference statements. The most usual inference question is this: After seeing the data x0, what do we now know about the parameter θ The only answer to this question is to present the entire posterior distribution." (p. 6) In this sense, learning from data in the context of the Bayesian perspective pertains to how the original beliefs (θ) are revised in light of data x0, the revision coming in the form of the posterior: (θ|x0) ∀θ∈Θ [d] For inference purposes, the only relevant point in the sample space R  is just the data x0 as précised by (θ|x0) θ∈Θ. This feature of Bayesian inference is formalized by the Likeli- hood Principle. Likelihood Principle. For inference purposes the only relevant sample information pertaining to θ is contained in the likelihood function (x0|θ) ∀θ∈Θ Moreover, if x0 and y0 are two sample realizations contain the same information about θ if their likelihoods are proportional to one another, i.e. (x0|θ)=(y0|θ) (Berger and Wolpert, 1988, p. 19). Indeed, Bayesians take delight in poking fun at frequentist testing by quoting Jeﬀreys’s (1939) remark about the ‘absur- dity’ of invoking realizations ‘x∈R ’ other than x0: “What the use of P [p-value] implies, therefore, is that a hy- pothesis that may be true may be rejected because it has not predicted observable results that have not occurred. This seems 20
21. 21. a remarkable procedure.” (p. 385) [ha, ha, ha!!!!!] Frequentist inference procedures, such as estimation (point and interval), hypothesis testing and prediction DO invoke realizations x∈R  other than x0 contravening the LP. 4.2 The posterior distribution and its role in inference Bayesian inference begins with a statistical model M(x), like the frequentist approach, but it adds a prior distribu- tion () ∈Θ for the unknown parameter  viewed as a random variable with its own distribution. A. In the case of the simple Bernoulli model, the prior of choice for the Bayesians is the Beta distribution deﬁned in terms of two parameters  and . For the discussion that follows let us assume that the prior () is Beta( ) dis- tributed with a density function: ()= 1 B() (−1) (1 − )−1  0 0 01 (23) The Beta distribution is very ﬂexible in the sense that for diﬀerent values of the parameters ( ) the density function can take many diﬀerent shapes; see ﬁg. 4. Such a prior will be proper because it covers the whole of the parameter space and it integrates to one, i.e. R 1 0 () = 1 Bayesian inference are based on the posterior distribu- tion of  given data x0 derived via Bayes’ formula: (|x0)= ()(x0|)R ∈[01] ()(x0|) =()(x0|) (x0) , ∀∈[0 1]. (24) The denominator (x0)= R 1 0 ()(x0|) is a normalizing constant, which is important to ensure that (|x0) is a proper density function, i.e. R ∈[01] (|x0) = 1. 21
22. 22. 1.00.80.60.40.20.0 4 3 2 1 0  Density 1 1 1 2 1 4 2 1 2 2 2 4 4 1 4 2 4 4 a b Beta(a,b) densities for different (a,b) Fig. 4: Beta( ) for diﬀerent values of ( ) Example. In the case where (x0|) is given by (6) and () is given by (23), the posterior distribution is: (|x0) ∝ ³ 1 () (−1) (1 − )−1 ´ £  (1 − )(1−) ¤ = = 1 () h +(−1) (1 − )(1−)+−1 i  (25) In view of the formula in (23), this is recognized as an ‘un- normalized’ density of a Beta(∗  ∗ ) where: ∗ = +  ∗ =(1 − ) +  (26) Since the prior distribution is Beta( ) and the posterior is also Beta(∗  ∗ ) indicates that the Beta distribution is conjugate to the Bernoulli likelihood (Binomial), i.e., the Bernoulli and Beta constitute a conjugate pair because the prior and posterior belong to the same family of distributions. 1.00.80.60.40.20.0 7 6 5 4 3 2 1 0  Density Je ffre ys' in v arian t prio r fo r th e B in o mial B eta: a= 0.5, b= 0.5 Fig. 5: Jeﬀreys ()= 1 (55) −5 (1−)−5 22
23. 23. Example. For Jeﬀreys’ prior:  v Beta(5 5) (see ﬁg. 5) and =4 =20: (; x0) ∝ 4 (1 − )16  ∈[0 1] (|x0)= 1 ( ) h +(−1) (1 − )(1−)+−1 i  (27) i.e. (|x0) is Beta(∗  ∗ ) with ∗ =+=45 ∗ =(1- )+=165 1.00.80.60.40.20.0 5 4 3 2 1 0 theta Likelihood(scaled) Fig. 6: The likelihood (; =4) 1.00.80.60.40.20.0 7 6 5 4 3 2 1 0 theta Density Beta Prior vs. Posterior Fig. 7: () vs. (|x0) The (scaled) likelihood (; x0) is given in ﬁg. 6 and the pos- terior (|x0) is compared to the prior () in ﬁg. 7. In ﬁg. 8, the likelihood is compared to the posterior. Note that all three are continuous, but the sampling distribution of vBin( (1−)  ) is discrete; all relevant frequentist error probabilities come from this distribution. 1.00.80.60.40.20.0 5 4 3 2 1 0 theta Density 4.5 16.5 5 17 F irst Second Beta Likelihood vs. Posterior Fig. 8: (|x0) vs. (; =4) 121086420 0.25 0.20 0.15 0.10 0.05 0.00 X Probability Fig. 9: vBin (=2; =20) 23
24. 24. B. In the case of the simple Normal model, (with 2 known) the prior often selected is Normal: ()= 1 0 √ 2 exp(− 1 22 0 (−0)2 ) for ∈R and denoted by:  v N(0, 2 0), where (0, 2 0) are known values. Hence, the posterior is deﬁned by: (|x0) ∝()(x0; ))= =[ 1 0 √ 2 exp(− 1 22 0 (−0)2 )] h ( 1  √ 2 ) exp{- 1 22 P =1(−)2 } i ∝ exp{−1 2 h (−0 0 )2 ) + P =1(−  )2 i = = exp{-1 2 2 ³ 1 2 0 +  2 ´ +  ³ 0 2 0 +  2 ´ = exp{− 1 22 1 ( − 1)2 } (28) where the parameters (1 2 1) of the posterior (after scaling): (|x0) v N(1, 2 1). take the particular form: 1=2 1(0 2 0 +  2 ) 2 1=( 1 2 0 +  2 )−1 = 2 02 (2 0+2)  After some manipulation 1 can be expressed in the form: 1=+(1−)0 where = 2 0 2 0+2  = 2 1(2  ) which indicates that it is a convex combination of  and 0 with the variance of the prior, 2 0, and the variance of the MLE   ()=2  , providing the weights. Moreover, as  → ∞  → 1 and 1 →  In words, asymptotically (as  → ∞) the mean of the posterior 1 converges to the sample mean . 24
25. 25. Example. (a) Consider the case where  v N(8, 1), =2 =20 =116 The posterior is: (|x0) v N(11022, 167), 2 1=(1 1 + 20 4 )−1 =167 1=167(8 1 + 20(116) 4 )=11022 12111098765 2.5 2.0 1.5 1.0 0.5 0.0 theta Density Normal Prior vs. Posterior Fig. 10: () vs. (|x0) 12.2512.0011.7511.5011.2511.0010.7510.50 2.5 2.0 1.5 1.0 0.5 0.0 theta Density Normal Posterior vs. Likelihood Fig. 11: (|x0) vs. (|x0) (b) Consider the case where the prior is more precise, i.e. 2 0=5  v N(8, 5), =2 =20 =116 The posterior has a smaller variance: (|x0) v N(10582, 143), 2 1=( 1 5 + 20 4 )−1 =143 1=143( 8 5 + 20(116) 4 )=10582 11109876 3.0 2.5 2.0 1.5 1.0 0.5 0.0 theta Density Normal Prior vs. Posterior Fig. 12: () vs. (|x0) 12.512.011.511.010.510.0 3.0 2.5 2.0 1.5 1.0 0.5 0.0 theta Density Normal Posterior vs. Likelihood Fig. 13: (|x0) vs. (|x0) 25
26. 26. Example. Consider the case of the simple Normal model (with 2 known), where the prior is Uniform: ()=1 for ∈R:=(−∞ ∞) This prior is said to be improper because it does not integrate to one since: Z ∞ −∞ ()=∞ for any ()=∈(0 1] In this case the posterior is proportional to the likelihood: (|x0) ∝(; x0)=( 1  √ 2 ) exp{− 1 22 P =1(−)2 } −∞ ∞ =( 1  √ 2 ) exp{-  22 £1  P =1(−)2 + ( − )2 ¤ } =( 1  √ 2 ) exp{-2 22 } exp{-  22 (−)2 } (29) Hence, ignoring the constant term exp{-2 22 } where 2 =1  P =1(−)2  the posterior is Normally distributed: (|x0) v N(, 2  ), −∞ ∞ (30) Note that in (30) the random variable is  which varies with diﬀerent values of the unknown parameter (∈(−∞ ∞)), and  is a constant. In contrast, the sampling distribution of  :  v N(, 2  ), x∈R  (31) varies with diﬀerent values of the sample realization (x∈R ). Hence, any talk about results with identical tail areas has to be objurgated with what the two tail areas really represent. I The intuitive diﬀerence between the two distributions is that one can draw (30) but not (31) since  is unknown! 26
27. 27. 4.3 Bayesian Point Estimation According to O’Hagan (1994): “Classical inference theory is very concerned with constructing good inference rules. The primary concern of Bayesian inference, ..., is entirely diﬀerent. The objective [of Bayesian inference] is to extract information concerning θ from the posterior distribution, and to present it helpfully via eﬀective summaries. There are two criteria in this process. The ﬁrst is to identify interesting features of the posterior distribution. ... The second criterion is good communication. Summaries should be chosen to convey clearly and succinctly all the features of interest.” (p. 14) In this sense, the frequentist inference procedures of estima- tion, testing and prediction are viewed by Bayesians as styl- ized inferences which from the Bayesian perspective represent nothing more than diﬀerent ways to summarize relevant infor- mation in the posterior distribution (|x0). Hence, the rele- vant criteria for ‘goodness’ (optimality) pertain to how well they identify and communicate relevant/interesting features of the posterior. In the case of stylized inference of a point estimate for  one can think of several features of the posterior (|x0) such as a measure of location, that could potentially play such a role by indicating the most ‘representative’ value of  in [0 1]. [1] An obvious choice is to used the mode e of the pos- terior distribution (|x0) in (27): (e|x0)= sup∈Θ (|x0) which is the value of  that is ranked highest by (|x0) We know that in the case of  v Beta( ) the mode of () is = −1 +−2  Hence, an example of a Bayesian estimate 27
28. 28. of  is: e= ∗−1 ∗+∗ −2 = (+−1) (++−2)  (32) If we compare this with the ML estimateb(x0)==1  P =1 , the two will coincide algebraically, i.e. e=, only when ==1 :  v Beta(1 1)=(0 1) ()=1 for ∈[0 1]   1.00 .80.60.40.20.0 1.0 0.8 0.6 0.4 0.2 0.0  Density U n ifo r m p r io r B e t a : a = 1 , b = 1 Fig. 10: Uniform prior [2] Another "natural" choice for a most ‘representative’ value of  is its mean of the posterior. We know that for  vBeta( ) ()=  +  and thus: b= ∗ ∗+∗ = (+) (++)  (33) Example. Let () vBeta(5 5). (a) =4, =20 ∗ = + =45 ∗ =(1-)+=165 e= 35 21−2 =184 b= 45 45+165 =214 (b) =12, =20 ∗ =+=125 ∗ =(1−)+=85 e=115 19 =605 b=125 21 =595 As we can see from the above numerical examples, the two Bayesian estimates can give rise to diﬀerent numerical values, depending on how asymmetric the posterior distribution is; the more asymmetric the greater the diﬀerence. How does a Bayesian decide which one of the above estimators is better? 28
29. 29. O’Hagan answers that question by contrasting frequentist (classical) inferences with Bayesian inferences: “In Bayesian terms, therefore, a good inference is one which contributes eﬀectively to appropriating the information about θ which is conveyed by the posterior distribution.” (p. 14) Indeed, O’Hagan (1994), p. 14, proceeds to argue that criteria for ‘good’ frequentist inference procedures are only parasitical on Bayes’ theorem and enter the picture via the decision theoretic perspective: “... a study of decision theory has two potential beneﬁts. First, it provides a link to classical inference. It thereby shows to what extent classical estimators, conﬁdence intervals and hypotheses tests can be given a Bayesian interpretation or motivation. Sec- ond, it helps identify suitable summaries to give Bayesian answers to stylized inference questions which classical theory addresses.” Invoking information other than the data and the prior. In the case of point estimation the question of selecting an optimal Bayesian estimate of  one needs to invoke addi- tional information from decision theory like a loss (or utility) function (b ). Using diﬀerent loss functions gives rise to diﬀerent choices for Bayes’ estimate: (i) The Bayes estimate ˘ that minimizes (∀∈Θ) the ex- pected zero-one loss function 0−1( ˘)= ½ 0 if ˘ =  1 if ˘ 6=   is the mode of (|x0). (ii) The Bayes estimate e that minimizes (∀∈Θ) the ex- pected absolute loss function 1(e; )=|e − | is the median of (|x0) (iii) The Bayes estimate b that minimizes (∀∈Θ) the the expected square loss function 2(b; )=(b − )2 is the mean 29
30. 30. of (|x0) However, as o’Hagan (1994). p. 15, explains that the notion of optimality is not primary: “The posterior mean can therefore be seen as an estimate of  which is best in the sense of minimizing expected square error. This is distinct from, but clearly related to, its more natural role as a useful summary of location of the posterior distribution.” What is a good (optimal) estimate in Bayesian inference? The one that minimizes a particular expected loss function. Hence, in contrast to frequentist estimation where the optimality is assessed by the estimator’s capacity to pin-point ∗  the true value of  an optimal Bayesian estimate has nothing to do with that; it just minimizes a particular loss function for all possible values of . The most widely used loss function is the square: (b)=(b(X)−)2  ∀∈Θ whose expected value (the risk function) is known as the Mean Square Error (MSE): Bayesian: MSE(b(X); )=(b(X)−)2  ∀∈Θ (34) In relation to the expected loss function, a key property for an estimator b(X) is considered to be admissibility. Admissibility. An estimator e(X) is inadmissible with respect to a risk function (b) if there exists another esti- mator b(X) such that: (b) ≤ (e) ∀∈Θ (35) and the strict inequality () holds for at least one value of  Otherwise, e(X) is said to be admissible with respect to the loss function (b) 30
31. 31. Warning: the Bayesian deﬁnition of the MSE in (34), based on the universal quantiﬁer ‘for all ∈Θ’, is at odds with the frequentist deﬁnition of the MSE, which is deﬁned at a point =∗ [the true value of ]: Frequentist: MSE(b(X); ∗ )=(b(X) − ∗ )2  (36) The factual nature of frequentist reasoning in estimation also brings out the impertinence of the notion of admissibility stemming from its reliance on the quantiﬁer ‘for all ∈Θ’. To see that more clearly let us consider the following example. Example:  is no better than a crystallball esti- mator? For the simple Normal model:  v NIID( 2 ) =1 2    (37) consider a MSE comparison between two estimators of : (i) the Maximum Likelihood Estimator (MLE): =1  P =1  (ii) the ‘crystalball’ estimator: (x)=7405926 for all x∈R  It turns out that both estimators are admissible and thus equally acceptable on admissibility grounds. This surprising result stems primarily from the quantiﬁer ‘for all ∈Θ’. In- deed, for certain values of  close to , say ∈(± √  ) for 01  is ‘better’ than  since: (; )=1   (; ) ≤ 2  for ∈(± √  ) Common sense suggests that if a certain criterion of opti- mality cannot distinguish between  [a strongly consistent, unbiased, fully eﬃcient and suﬃcient estimator] and an ar- bitrarily chosen real number that ignores the data altogether, it is practically useless for distinguishing between ‘good’ and ‘bad’ estimators in frequentist statistics or being viewed as a 31
32. 32. minimal property of estimators. Moreover, it is obvious that the source of the problem is the quantiﬁer ∀∈Θ. In contrast to admissibility, the property of consistency instantly elimi- nates the crystal ball estimator . 4.4 Bayesian Credible Intervals (BCIs) A Bayesian (1−) credible interval for  is constructed by ensuring that the area between  and  is equal to (1−): ( ≤   )= R   (|x0)=1− In practice one can deﬁne an inﬁnity of (1−) credible inter- vals using the same posterior (|x0) To avoid this indeter- minacy one needs to impose additional restrictions like the interval with the shortest length or one with equal tails:R 1  (|x0)=(1− 2 ) R 1  (|x0)= 2 ; see Robert (2007). Example. For the simple Bernoulli model, the end points of an equal-tail credible interval can be evaluated using the F tables and the fact that:  v Beta(∗  ∗ ) ⇒ ∗  ∗(1−) v F(2∗  2∗ ) Denoting the  2 and (1− 2 ) percentiles of the F(2∗  2∗ ) distri- bution, by f( 2 ) and f(1− 2 ) respectively, the Bayesian (1−) credible interval for  is:h (1 + ∗ ∗f(1− 2 ) )−1 ≤  ≤ (1 + ∗ ∗f( 2 ) )−1 i  For () vBeta(5 5). (a) =2, =20 =05 ∗ =+=25 ∗ =(1 − )+=185 (1− 2 )=163 ( 2 )=293 ³ 1+ 185 25(163) ´−1 ≤  ≤ ³ 1+ 185 25(293) ´−1 = (0216 ≤  ≤ 284)  (38) 32
33. 33. (b) =18, =20 =05 ∗ = + =185 ∗ =(1-) + =25 b=185 21 =881 (1- 2 )=341 ( 2 )=6188 ³ 1+ 25 185(341) ´−1 ≤  ≤ ³ 1+ 25 185(6188) ´−1 = (716 ≤  ≤ 979)  How does one interpret a (1 − ) BCI such as (38)? I It provides a summary of the posterior distribution (|x0) by demarcating  and  that deﬁne the interval of length − with the highest probability, i.e. it includes (1−)% of the highest ranked values of ∈[0 1]. 4.4.1 Conﬁdence vs. Credible Intervals Example. For the simple (one parameter - 2 is known) Normal model, the sampling distribution of =1  P =1  and the posterior distribution of  derived on the basis of an improper uniform prior [()=1 ∀∈R] are:  =∗ v N(∗  2  ) x∈R  (|x0) v N( 2  ) ∈R (39) The two distributions can be used, respectively, to construct (1−) Conﬁdence and Credible Intervals: P ³ − 2 ( √  ) ≤  ≤ + 2 ( √  ); =∗ ´ =1− (40)  ³ − 2 ( √  ) ≤  ≤ + 2 ( √  )|x0 ´ =1− (41) The two intervals might appear the same, but they are dras- tically diﬀerent. First, in (40) the r.v. is  and its sampling distribution (; ) is deﬁned over x∈R  but in (41) the r.v. is  and its posterior (|x0) is deﬁned over ∈R Second, the reasoning underlying (40) is factual, but that of (41) is all-inclusive (∀∈[0 1]). 33
34. 34. Hence, the (1−) Conﬁdence Interval (40) provides the shortest random upper (X)=+ 2 ( √  ) and lower (X)=− 2 ( √  ) bounds that cover ∗  the true value of  with probability (1−). In contrast, the (1−) Credible Interval (41) provides the shortest interval of length 2 2 ( √  ) such that (1−)% of the highest ranked (by (|x0)) values of  lie within it. The above comparison suggests that a Bayesian (1−) Cred- ible Interval has nothing to do with the "true" ! 4.5 Bayesian Testing Bayesian testing of hypotheses is not as easy to handle using the posterior distribution of the technical diﬃculty in attaching probabilities to particular values of  when the pa- rameter space Θ is uncountable since: ()=0 (| x0)=0 ∀∈Θ In their attempt to deﬂect attention away from this technical diﬃculty, Bayesians often criticize the use of a point hypoth- esis =0 in frequentist testing as nonsensical because it can never be exactly true! This is a misplaced argument because the notion of exactly true, has no place in statistics. 4.5.1 Point null and alternative hypotheses There have been several attempts to address the diﬃculty with point hypotheses, but no agreement seems to have emerged; see Roberts (2007). Let us consider one such attempt for test- ing of the hypotheses: 0 : θ=θ0 vs. 1 : θ=θ1 Like all Bayesian inferences, the basis is the posterior distribu- tion. Hence, an obvious way to assess their respective degrees 34
35. 35. of belief is the posterior odds: (θ0|x0) (θ1|x0) =(θ0|x0)·(θ0) (θ1|x0)·(θ1) = ³ (θ0) (θ1) ´ ³ (θ0|x0) (θ1|x0) ´  (42) where the factor (θ0) (θ1) represents the prior odds, and (θ0|x0) (θ1|x0) the likelihood ratio. In light of the fact that technical prob- lem stems from the prior (θ) assigning probabilities to par- ticular values of θ an obvious way to sidestep the problem is to cancel the prior odds factor, by using the ratio of the pos- terior to the prior odds to deﬁne the Bayes Factor (BF): (θ0 θ1|x0)= ³ (θ0|x0) (θ1|x0) ´  ³ (θ0) (θ1) ´ =(θ0|x0) (θ1|x0)  (43) This addresses the technical problem because the likelihood function is deﬁnable for particular values of  For this reason Bayesian testing is often based on the BF combined with certain rules of thumb, concerning the strength of the degree of belief against 0 as it relates to the magnitude of (x0; 0) (Robert, 2007): I 0 ≤ (x0; 0) ≤ 32 the degree of belief against 0 is poor, I 32  (x0; 0) ≤ 10 the degree of belief against 0 is substantial, I 10  (x0; 0) ≤ 100 the degree of belief against 0 is strong, and I (x0; 0)  100 the degree of belief against 0 is deci- sive. These rules of thumb, going from the BF to evidence for or against the null, have been questioned as largely ad hoc; see Kass and Raftery (1995). The Likelihoodist approach. It is important to note that the Law of Likelihood deﬁning the likelihood ratio: (0 1|x0)=(0|x0) (1|x0)  35
36. 36. provides the basis of the Likelihoodist approach to testing, but applies only to tests of point vs. point hypotheses. 4.5.2 Composite hypotheses A. In the context of the simple Bernoulli model consider the hypotheses: 0:  ≤ 0 vs. 1:   0 0=5 assuming a Jeﬀreys invariant prior  vBeta(5 5) and data =12, =20 An obvious way to evaluate the posterior odds for these two interval hypotheses is as follows: ( ≤ 0|x0)= Γ(21) Γ(125)Γ(85) R 5 0 ¡ 115 (1-)75 ¢ =186 (  0|x0)=1-( ≤ 0|x0)=814 One can then employ the posterior odds criterion: (≤0|x0) (0|x0) =186 814 =229 which indicates that the degree of belief against 0 is poor. B. (a) In the context of the simple Normal model (2 known) consider the hypotheses: 0:  ≤ 10 vs. 1:   10 assuming a Normal prior  vN(8 1), =2 =20 =116 As shown in section 4.2, the posterior is: (|x0) v N ³ 2 1(0 2 0 +  2 ), 2 02 (2 0+2) ´ , yielding the result: (|x0) v N(11022, 167), 2 1=(1 1 + 20 4 )−1 =167 1=167(8 1 + 20(116) 4 )=11022 The posterior odds criterion yields: (≤10|x0) (10|x0) = R 10 −∞ 1√ 2(167) exp(− 1 2(167)(−11022)2) R ∞ 10 1√ 2(167) exp(− 1 2(167)(−11022)2) = 0062 994 =0062 36
37. 37. which indicates that the degree of belief against 0 is very poor. This result should be contrasted with that of a N-P test yielding: (x0)= √ 20(116−10) 2 =3578 with (x0)=0002 that rejects 0 at any conventional signiﬁcance level. What is the intuitive explanation behind these contradic- tory results between a Bayesian and a frequentist test? This frequentist test will reject 0 if (X)= √ (−0)    or   0+ √   The posterior odds will "reject" 0 if   0+2(0−8) 2 0 . This indicates that in the case where the null value 0 is chosen equal to the mean (0=8) of the prior distribu- tion, i.e.  vN(0 2 0), the posterior odds will "reject" 0 if   0 which ignores the sample size beyond its inﬂu- ence on  On the other hand, when 2 =2 0 the rejection threhold 0+(0−8)  decreases with  in contrast to √  for the frequentist test. (b) Let us change the prior to be an improper prior: ()=1 ∈(−∞ ∞) As shown in section 4.2, in this case the posterior is: (|x0) v N(, 2  ), −∞ ∞ (44) and thus the tail areas of this posterior would seem to match perfectly with that of the sampling distribution of  :  v N(, 2  ), x∈R  (45) but as argued in that section, this is an illusion; one is com- paring apples and eggs! 37
38. 38. 4.5.3 Point null but composite alternative hypothesis Pretending that point hypotheses are small inter- vals. A ‘pragmatic’ way to handle point hypotheses in Bayesian inference is to sidestep the technical diﬃculty in handling hy- potheses of the form: 0: =0 vs. 1: 6=0 (46) by pretending that 0 is actually: 0: ∈Θ0:=(0− 0+) and attaching a spiked prior of the form: (=0)=0 1= R 1 0 (6=0)=1−0 (47) i.e. attach a prior of 0 to =0, and then distribute the rest 1−0 to all the other values of ; see Berger (1985). Example- large  problem. In the context of the simple Bernoulli model, consider the case where 0=2 =106298 527135 =020165233 =527135 [i] Let us apply the Bayes factor procedure to the hypothe- ses (46) using a spiked prior (47) with 0=5 Since the ratio (θ0) (6=0) cancels out, the posterior odds reduces to the Bayes factor: (x0; 0)= (0;x0) R 1 0 (;x0) = (527135 106298)(2)106298(1−2)527135−106298 R 1 0 ((527135 106298)106298 (1−)527135−106298 ) = =000015394 000001897 =8115 (48) I The result (x0; 0)  8115 indicates that data x0 fa- vor the null ( 0=2) against all other values of  substantially. It turns out, however, that this result is highly vulnerable to the fallacy of acceptance. For certain values ‡ in a subset of Θ1=[0 1]−2 in particular: ‡ ∈(2 20331] (49) 38
39. 39. the values ‡ are favored by (x0; ‡ ) more strongly than 0=2; contradicting the original result. Indeed, as pointed out by Mayo (1996), p. 200, there is always the maximally likely alternative, ¨ =b(x0)=20165233, for which the Bayes factors favors ¨ more than any other value of . In particular, the Bayes factors favors ¨ 89 times stronger than 0=2! [ii] Applying an ‘optimal’ N-P test with =003 ( 2 =2968) to the above data yields: (x0)= √ 527135(106298 527135−2) √ 2(1−2) =2999 (50) which rejects 0. The p-value (x0)=0027 suggesting that data x0 indicate ‘some’ discrepancy between 0 and the ‘true’  (that gave rise to x0), but provides no information about its magnitude. The post-data severity evaluation based on:  (;   1) = P(x: (X) ≤ (x0);   1 is false) (51) indicates that, for a severity threshold of say 9, the claim for which data x0 provide evidence for is:   20095 ⇒ ∗ ≤ 00095 In this sense, the frequentist approach can address the fallacy of acceptance by evaluating the warranted discrepancy from the null for the particular data and test. 39
40. 40. 4.6 Where do prior distributions come from? 4.6.1 Conjugate prior and posterior distributions This is the case where the prior () and the posterior: (|x0) ∝ () · (; x0) ∀∈Θ belong to the same family of distributions, i.e. (; x0) is family preserving. Example. For the simple Bernoulli model: ()vBeta( ) (;x0)∝ (1−)(1−) −→ (|x0)vBeta(∗  ∗ ) Table 2 presents some examples of conjugate pairs of prior and posterior distributions, as they combine with diﬀerent likelihood forms. Conjugate pairs make mathematical sense, but does it make ‘modeling’ sense? The various justiﬁcations in the Bayesian literature vary from, ‘these help the objectivity of inference’ to ‘they enhance the allure of the Bayesian approach as a black box’ and these claims are often contradictory! Table 1 - Conjugate pairs (() (|x0)) Likelihood () Binomial (Bernoulli) Beta( ) Negative Binomial Beta( ) Poisson Gamma( ) Exponential Gamma( ) Gamma Gamma( ) Uniform Pareto( ) Normal for  =  N( 2 ) ∈R 2 0 Normal for  = 2 Inverse Gamma( ) 40
41. 41. 4.6.2 Jeﬀreys’s prior Fisher (1921) criticized the notion of prior ignorance using a uniform prior:  v U(0 1) ∀∈Θ to quantify a state of ignorance about the unknown parame- ter  Fisher’s criticism was that such a prior is non-invariant to reparameterizations: one is ignorant about  but very in- formed about =()   1.00.80.60.40.20.0 1.0 0.8 0.6 0.4 0.2 0.0  Density Uniform prior Beta: a=1, b=1 Fig. 10: Uniform prior 7.55.02.50.0-2.5-5.0 0.25 0.20 0.15 0.10 0.05 0.00  Density Logistic distribution Logistic: Loc=0, Scale=1 Fig. 11: The Logistic prior Example. If  is uniformly distributed, then the logit transformation: = ln ¡  1− ¢  gives rise to a very informative prior for  : ()=  (1+)2  −∞    ∞ This distribution attaches much higher probability to the val- ues of  around zero and very low probability to the values in the tails. In respond to Fisher’s second criticism, Jeﬀreys (1939) pro- posed a new class of priors which are invariant to reparameter- izations. This family of invariant priors was based on Fisher’s average information: (; x)= x µ 1  h  ln (;x)  i2 ¶ = R ··· R x∈R  1  ( ln (;x)  )2 x (52) 41
42. 42. Note that the above derivation involves some hand-waving in the sense that if the likelihood function (; x0) is viewed, like the Bayesians do, as only a function of the data x0, then taking expectations outside the brackets makes no sense; the expectation is with respect to the distribution of the sample (x;) for all possible values of x∈R . As we can see, the derivation of (; x) runs afoul to the likelihood principle since all possible values of the sample X, not just the observed data x0, are taken into account. Note that in the case of a random (IID) sample, the Fisher information (; x) for the sample X:=(1 2  ) is related to the above average information via: (; x) = (; x) In the case of a single parameter, Jeﬀreys invariant prior takes the form: () ∝ p (; x) (53) i.e. the likelihood function determines the prior distribution. The simple Bernoulli model. In view of the fact that the log-likelihood takes the form: ln (; x)= ln () + (1 − ) ln(1−)  ln (;x)  =  −(1−) 1−  2 ln (;x) 2 = −( 2 )−(1−) (1−)2  From the second derivative, it follows that:  µ 1  h  ln (;x)  i2 ¶ = ³ −1  2 ln (;x) 2 ´ = 1 (1−)  (54) This follows directly from ()= since:  ³ −1  2 ln (;x) 2 ´ =  2 +(1−) (1−)2 =1  + 1 1− = 1 (1−)  (55) From the deﬁnition of Jeﬀreys invariant prior we can de- duce that for  : ()∝ p (; x)= q 1 (1−) =−1 2 (1−)−1 2  0    1 (56) 42
43. 43. which is an ‘unnormalized’ Beta(1 2  1 2 ) distribution; it needs the scaling 1 ()  Note that Jeﬀreys prior (56) is also the reference prior for a one parameter statistical model; see Bernardo and Smith (1994). 5 Bayesian charges against frequentist inference [1] Bayesian inference is fundamentally sound because it can be given an axiomatic foundation based on coherent (rational) decision making, but frequentist inference suﬀers from several philosophical ﬂaws. Remark: what does an axiomatic foundation have to do with inductive inference? [2] Frequentist inference is not very ﬂexible and has a restric- tive range of applicability. According to Koop, Poirier and Tobias (2007): "Non-Bayesians, who we hereafter refer to as frequentists, argue that situations not admitting repetition under essentially identical conditions are not within the realm of statistical enquiry, and hence ’probability’ should not be used in such situations. Fre- quentists deﬁne the probability of an event as its long-run relative frequency. ... that deﬁnition is nonoperational since only a ﬁnite number of trials can ever be conducted.’ (p. 2) Remark: where have these guys been for the last 80 years? Inference with time series data is beyond the intended scope of frequentist statistics? [3] Bayesian inference produces clear and direct inferences, in contrast to frequentist inference producing unclear and indi- rect inferences, e.g. credible intervals vs. conﬁdence intervals. “... the applied researcher would really like to be able to place a degree of belief on the hypothesis.” (Press, 2003, p. 220) 43
44. 44. Remark: if the primary aim of statistical inference is to learn from data x0 about the ‘true’ underlying data-generating mechanism M∗ (x)={(x; ∗ )} x∈R , what does a probabil- ity, say (=0|x0) =.7, suggest about ∗ ? [4] Bayesian inference makes use of all the available a priori information, but frequentist inference does not. Remark: there is a crucial diﬀerence between prior sub- stantive matter information and information in the form of a prior distribution. Frequentist inference is tailor-made to accommodate prior information in the form of restrictions on the statistical parameters suggested by substantive theories, say G(θ ϕ)=0, where θ and ϕ denote the statistical and substantive parameters of interest. Indeed, substantive prior information in most scientiﬁc ﬁelds does not come in the form of a prior distribution (θ) ∀θ∈Θ [5] A number of counter-examples, introduced by Bayesians, show that frequentist inference is fundamentally ﬂawed. Remark: the problem does not lie with frequentist testing, but with the statistical models introduced. The underlying statistical models are shown to be rigged; see Spanos (2010; 2011; 2012; 2013a-d). [6] The subjectivity charge against Bayesians is misplaced because: “All statistical methods that use probability are subjective in the sense of relying on mathematical idealizations of the world. Bayesian methods are sometimes said to be especially subjective because of their reliance on a prior distribution, but in most problems, scientiﬁc judgement is necessary to specify both the ’likelihood’ and the prior’ parts of the model.” (Gelman, et al. (2004), p. 14) “... likelihoods are just as subjective as priors.” (Kadane, 2011, 44
45. 45. p. 445) Remark: I know how to test the adequacy of the proba- bilistic assumptions deﬁning the likelihood (they are the model assumptions) vis-a-vis data x0 but how do I test the adequacy of the assumptions deﬁning a prior () ∀∈Θ? [7] For inference purposes, the only relevant point in the sam- ple space R  is just the data x0 as summarized by the likeli- hood function (θ|x0) θ∈Θ. Remark: as mentioned in section 4.1, the interpretation of the posterior distribution as proportional to the conditional distribution of X give θ times the prior (θ) is at odds with the Likelihood Principle; see Spanos (2014). Also, Mayo (2013) has shown that Birnbaum’s (1962) ‘proof’ of the LP is erroneous. [8] An eﬀective way to generate frequentist optimal sta- tistical procedures is to ﬁnd the Bayes solution using a reasonable prior and then examine its frequentist properties to see whether it is satisfactory from the latter viewpoint; see Rubin (1984), Gelman et al (2004). Remark: this is based on assuming that admissibility (as deﬁned by Bayesians) is a desirable minimal property; it is not! Indeed, expected losses do not constitute legitimate fre- quentist error probabilities. The minimal property for fre- quentist inference is consistency, not expected loss relative eﬃciency. The above [1]-[8] comments, criticisms and charges leveled against frequentist inference are largely misplaced and stem mostly from insuﬃcient understanding or just plain ignorance on behalf of the critics. Further arguments on how the above criticisms can be coun- tered will be part of the class discussion. 45
46. 46. 6 Appendix A: the N-P Lemma and its extensions The cornerstone of the Neyman-Pearson (N-P) approach is the Neyman-Pearson lemma. Contemplate the simple generic statistical model: Mθ(x)={(x; )} ∈Θ:={0 1}} x∈R  (57) and consider the problem of testing the simple hypotheses: 0: =0 vs. 1: =1 (58) ¥ The fact that the assumed parameter space is Θ:={0 1} and (58) constitute a partition, is often left out from most sta- tistics textbook discussions of this famous lemma! Existence. There is exists an -signiﬁcance level Uniformly Most Powerful (UMP) [-UMP] test based on: (X)=((x;1) (x;0) ) 1()={x: (x)  } (59) where () is a monotone function. Suﬃciency. If an -level test of the form (59) exists, then it is UMP for testing (58). Necessity. If {(X) 1()} is -UMP test, then it will be given by (59). At ﬁrst sight the N-P lemma seems rather contrived because it is an existence result for a simple statistical model Mθ(x) whose parameter space is artiﬁcial Θ:={0 1}, but ﬁts per- fectly into the archetypal formulation. To operationalize the existence result one would need to do two things: (1) Find transformation () that when applied to (x;1) (x;0) yields a meaningful test statistic (X) (2) Derive the distribution of (X) under both 0 and 1. 46
47. 47. Warning: this lemma is often misconstrued as suggesting that for an -UMP test to exist one needs to conﬁne testing to simple-vs-simple cases even when Θ is uncountable; nonsense! ¥ The construction of an -UMP test in realistic cases has nothing to do with simple-vs-simple hypotheses. Instead, (i) it should be based on the archetypal N-P testing formu- lation based on partitioning Θ, and (ii) rely on monotone likelihood ratios and other features of the prespeciﬁed statistical model Mθ(x). Example. To illustrate these issues consider the simple- vs-simple hypotheses: (i) 0: =0 vs. 1: =1 (60) in the context of a simple Normal (one parameter) model:  v NIID( 2 ) =1 2    (61) In this case, the N-P lemma does not apply because the two values (0 1) do not constitute a partition of the parameter space Θ=R. Applying the N-P lemma requires setting up the ratio: (x;1) (x;0) = exp ©  2 (1 − 0) −  22 (2 1 − 2 0) ª  (62) which is clearly not a test statistic, as it stands. However, there exists a monotone function () which transforms (62) into a familiar test statistic (Spanos, 1999, pp. 708-9): (X)=((x;1) (x;0) )= h ( 1 1 ) ln((x;1) (x;0) )+1 2 i = √ (−0)   A UMP test can be derived when (X)= √ (−0)  is com- bined with information relating to the framing of the hypothe- ses. 47
48. 48. Conditions that give rise to UMP tests [1] Point null vs. one-sided alternative. In the case of the simple Normal model, each pair of hypotheses: (i) 0: ≤0 vs. 1: 0 0: =0 vs. 1: 0 (ii) 0: ≥0 vs. 1: 0 0: =0 vs. 1: 0 give rise to the same UMP tests. The existence of these -UMP tests extends the N-P lemma to more realistic cases by invoking two regularity conditions: [2] Monotone likelihood ratio. The ratio (62) is a monotone function of the statistic  in the sense that for any two values 10 (x;1) (x;0) changes monotonically with  This implies that (x;1) (x;0)  if and only if   0 This regularity condition is valid for most statistical mod- els of interest in practice, including the one parameter Expo- nential family of distributions [Normal, Student’s t, Pareto, Gamma, Beta, Binomial, Negative Binomial, Poisson, etc.]. [3] Convex alternative. The parameter space under 1 say Θ1 is convex [contiguous], i.e. for any two values (1 2) ∈Θ1 their convex combinations 1+(1−)2∈Θ1 for any 0 ≤  ≤ 1 When convexity does not hold, like the 2-sided alternative: (vi) (2-s): 0:  = 0 vs. 1:  6= 0 the test :={(X) 1()} 1()={x: |(x)|   2 } is -UMPU (Unbiased); the -level and p-value are: =P(|(X)|   2 ; =0) q(x0)=P(|(X)| |(x0)|; =0) 48
49. 49. 7 Appendix B: Examples based on Jeﬀreys prior For the simple Bernoulli model, consider selecting Jeﬀreys in- variant prior: ()= 1 (55) −5 (1 − )−5  ∈[0 1] This gives rise to a posterior distribution of the form: (|x0) v Beta( + 5 (1−) + 5) ∈[0 1] ¥ (a) For =2, =20 the likelihood function is: (; x0) ∝ 2 (1−)18  ∈[0 1] and the posterior density is: (|x0) vBeta(25 185) ∈[0 1] The Bayesian point estimates are: e=15 19 =0789 b=25 21 =119 A 95 credible interval for  is: (0214 ≤   3803)=95 1 B(25185) R 1 =0214 15 (1−)175 =975 1 B(25185) R 1 =3803 15 (1−)175 =025 ¥ (b) For =18, =20 the likelihood function is: (; x0) ∝ 18 (1 − )2  ∈[0 1] and the posterior density is: (|x0) vBeta(185 25) ∈[0 1] The Bayesian point estimates are: e=175 19 =921 b=185 21 =881 A 95 credible interval for  is: (716 ≤   97862)=95 1 B(18525) R 1 =716 175 (1−)15 =0975 1 B(18525) R 1 =979 175 (1−)15 =0025 ¥ (c) For =72, =80 the likelihood function is: (; x0) ∝ 72 (1 − )8  ∈[0 1] 49
50. 50. and the posterior density is: (|x0) v Beta(725 85) ∈[0 1] The Bayesian point estimates are: e=715 79 =905 b=725 81 =895 A 95 credible interval for  is: (82 ≤   9515)=95 1 B(72585) R 1 =82 715 (1−)75 =0975 1 B(72585) R 1 =9515 715 (1−)75 =0025 ¥ (d) for =40, =80 the likelihood function is: (; x0) ∝ 40 (1 − )40  ∈[0 1] and the posterior density is: (|x0) v Beta(405 405) ∈[0 1] The Bayesian point estimates are: e=395 79 =5 b=405 81 =5 A 95 credible interval for  is: (3923 ≤   6525)=95 1 B(405405) R 1 =392 395 (1−)395 =975 1 B(405405) R 1 =6525 395 (1−)395 =025 In view of the symmetry of the posterior distribution, even the asymptotic Normal credible interval (??) should give a good approximation. Given that b= (+) (++) =05 the approx- imate credible interval is:  µ [5−196 √ 5(1−5) √ 80 ]=390 ≤   610=[5+196 √ 5(1−5) √ 80 ] ¶ =1− which provides a reasonably good approximation to the exact one. 50