SlideShare a Scribd company logo
1 of 50
Download to read offline
PHIL 6334 - Probability/Statistics Lecture Notes 7:
An Introduction to Bayesian Inference
Aris Spanos [Spring 2014]
1 Introduction to Bayesian Inference
The lectures notes provide an elementary introduction to Bayesian
inference focusing almost exclusively on two simple models;
the simple Bernoulli and Normal models to keep the technical
details to a minimum.
Textbooks often motivate the various techniques associ-
ated with Bayesian statistics by contrasting them to those
of fequentist (classical) inference. Indeed, adherents to the
Bayesian approach often begin with a litany of charges lev-
eled against the frequentist approach which are usually due
to inadequate understanding of the latter; see section 5. The
following quotation from O’Hagan (1994), p. 16, is typical of
such charges:
“Broadly speaking, some of the arguments in favour of the
Bayesian approach are that it is fundamentally sound, very flexi-
ble, produces clear and direct inferences and makes use of all the
available information. In contrast, the classical approach suffers
from some philosophical flaws, has restrictive range of inferences
with rather indirect meaning and ignores prior information.”
The address the above claims by O’Hagan one needs to
compare the two approaches in depth, focusing more on the
underlying reasoning and their primary objectives.
Section 2: Probability and its various interpretations
Section 3: Frequentist Inference
Section 4: Bayesian Inference
Section 5: Charges against the frequentist approach to
inference.
1
2 Probability and its various interpretations
For mathematical purposes Kolmogorov’s axiomatization of
probability is perfectly adequate, but when it comes to statis-
tical inference the interpretation is absolutely necessary. What
does mathematical probability correspond to in the real world?
This correspondence will determine the kind of inductive pro-
cedures one should follow in‘learning from data’. That is, the
interpretation of probability will determine the nature of
the inductive (statistical) inference called for.
From the viewpoint of statistical inference let us con-
sider the following interpretations of probability:
(i) the classical (equally-likely) - ()
(ii) the degrees of belief - ()
(iii) the relative frequency - Pr().
For a better understanding of the various interpretations of
probability one should consider them in the context in which
they were first developed. As early as the 18th century all
three basic interpretations of probability were being used in
different contexts without much thought of choosing one in-
terpretation for all purposes.
The classical interpretation was used in the context of
games of chance and was viewed as stemming from equal prob-
ability outcomes based on some sort of physical symmetry.
The degrees of belief interpretation originated from at-
tempts to quantify the relationship between the evidence pre-
sented in courts and the degree of conviction in the mind of
the judge.
The relative frequency interpretation originated from
mortality and socio-economic data gathered over long periods
of time from the 16th century onwards.
2
The question to be considered is whether any of these in-
terpretations can provide a pertinent link between Kol-
mogorov’s mathematical theory of probability and empirical
modeling and inference.
2.1 The classical interpretation of probability
It is generally accepted that, historically, the theory of proba-
bility was developed in the context of games of chance such as
casting dice or tossing coins. It was only natural then that the
first interpretation of probability was inextricably bound up
with the chance mechanism of such games. The first explicit
definition of the classical definition of probability is given by
Laplace at the beginning of the 19th century.
The classical definition. Consider the random exper-
iment E which has  equally likely outcomes and event 
occurs when  of them occur, then according to the classi-
cal definition of probability:
() =
¡

¢

Example. Let  be the random variable denoting the
number of dots on the sides of die. When a die is symmetric
and homogeneous:
Prob. distribution of 
= 1 2 3 4 5 6
() 1
6
1
6
1
6
1
6
1
6
1
6
The first important feature of this definition is its reliance
on the nature of an explicit chance mechanism such as casting
dice or tossing coins. Its second crucial feature is that it uti-
lizes the apparent physical symmetry of the device underlying
the chance mechanism to define probability by evaluating it
as “the ratio of the number of outcomes favorable to the event
3
to the total number of possible outcomes, each assumed to be
equally likely” (see Laplace, 1814).
For the purposes of providing the missing link between the
mathematical concept of a statistical model and the notion of
chance regularity, this definition of probability is inadequate
for a number of reasons including:
(i) it is based on an explicit chance mechanism,
(ii) the chance mechanism has a build-in physical
symmetry that leads to equally likely outcomes,
(iii) it assumes that one can partition the set of outcomes
into a finite number of equally likely events.
This definition has been severely criticized in the literature
but the critics tend to concentrate their arrows on the equally
likely clause.
¥ What do we mean by equally likely and how do we recog-
nize equally likely outcomes?
Laplace proposed a principle for justifying equally likely
cases, known as:
the principle of insufficient reason,
or the principle of indifference. This is based on the idea that
if one has no reason to favor one outcome over the another
they are considered equally likely. This principle has given
rise to several paradoxes and has been called into question
repeatedly (Hacking, 1975).
In addition to the objection to the equally likely clause,
there is one crucial objection to the classical definition: it
assumes that one can partition the set of outcomes into a
finite number of equally likely events.
¥ What happens when the random experiment does not
enjoy this symmetry, such as the case of a biased coin?
¥ What about axiom A1 of the mathematical definition?
4
In the case of an infinite sample space the denominator of the
above definition will get us into trouble.
Empirical modeling. The classical interpretation of prob-
ability is too restrictive for empirical modeling purposes.
2.2 The ‘degrees of belief’ interpretation of probability
Our interest in the degree of belief interpretation of prob-
ability stems from the fact that it leads to an approach to
statistical inference known as the Bayesian approach.
The degree of belief interpretation of probability comes in
two flavors: the subjective and rational.
A. Degrees of subjective belief
The subjective interpretation considers the probability of
an event  as based on the personal judgment of whoever is
assigning the probability; the personal judgement being based
on the individual’s experience and background. In this sense
the probability of event  is based on the person’s beliefs and
information relating to the experiment giving rise to event .
Example. In the case of tossing a fair coin a person is
likely to assign the subjective probability ()=1
2
because
a symmetric coin would render  and  a priori equally
likely. In the case where the person in question has additional
information relating to the mechanism, such as the coin is
bent, the subjective probability is likely to change.
De Finetti introduced a convenient way to think of sub-
jective probabilities is in terms of betting odds. Let us
consider the case of betting on the occurrence of an event 
and somebody offers odds 2 to 1 or in a ratio form 1
2
 If the
person whose degrees of subjective belief we are trying to as-
sess thinks that these are fair odds, then we can proceed to
5
evaluate her subjective probability via:
1
2
1+1
2
= 1
3
 i.e. () = 1
3

More generally, if the subjective probability for the occur-
rence of the event  is  (i.e. () = ) then the odds ratio ¨
and the corresponding subjective probability  take the form:
¨ = 
(1−)
⇒  = ¨
1+¨

As we can see, the subjective dimension of this probability
arises from the fact that it is the decision of a particular in-
dividual whether the odds are fair or not. Another individual
might consider as fair the odds ratio ¨0
 which implies that
her subjective probability is 0
= ¨0
1+¨0 6= This is not surprising
because the personal experiences which influence judgement
are often different between individuals.
The question which naturally arises at this stage is to whether
such personal subjective probabilities will behave in accor-
dance with the mathematical definition of probability. The
answer is yes, under certain restrictions, as demonstrated by
Ramsey (1926) , de Finetti (1937) and Savage (1954).
B. Degrees of rational belief
Another question with regard to the degree of belief inter-
pretation of probability is whether one could find some way
to establish that a particular odds ratio will be considered fair
by a rational person; assuming a formal definition of ra-
tionality. The idea being that in such a case the subjective
dimension will become less personalistic.
Keynes (1921) was the first to propose such an interpreta-
tion, often called logical probability. Carnap (1950) general-
ized and extended it, but the logical interpretation has been
severely criticized on several grounds.
6
Empirical modeling. How appropriate is the degrees of
belief interpretation of probability for empirical modeling?
If data x0:=(1  ) contain systematic information in
the form of chance regularities, ‘stochasticity’ is a feature of
real-world phenomena and exists independently of one’s be-
liefs. Hence, degree of belief interpretation has limited role in
selecting an appropriate statistical model.
2.3 The frequency interpretation of probability
The frequency interpretation of probability can be traced back
to the statistical regularities established during the 18th and
19th centuries, based on demographic, anthropomorphic, eco-
nomic and social (crimes, violent deaths, etc.) data. The
analysis of these data led to an amazing conclusion:
“despite the unpredictability at the individual level (people,
firms etc.) there was a remarkable stability of the relative
frequencies at the aggregate level (groups) over long periods
of time.”
In the context of the frequency interpretation, the proba-
bility of an event  is viewed as an empirical regularity asso-
ciated with this event.
The frequentist interpretation. Consider the case
where one is able to repeat an experiment under identical
conditions, and denote the relative frequency of the event 
after  trials by
¡

¢
 then the frequency interpretation views
the probability of event  as the limit of the relative frequency¡

¢
as the number of repetitions goes to infinity:
Pr() = lim→∞
¡

¢
 (1)
Can one prove the above mathematical claim?
NO, and thus the von Mises attempt to provide a frequen-
tist interpretation of probability using (1) is a dead end for
7
purely mathematical reasons! Instead, (1) is a mathemati-
cally false statement that reflects the intuition underlying the
frequentist interpretation of probability. It becomes a mathe-
matically true statement when the limit is interpreted in prob-
abilistic terms; see Spanos (2013).
A formal justification for the frequentist interpretation as
the limit of relative frequencies is grounded on the Strong
Law of Large Numbers (SLLN) that gives precise mean-
ing to the claim ‘the sequence of relative frequencies {

}∞
=1
converges to  as  → ∞’. To see that let =(=1) and
=(=0) and 

=1

P
=1 :=
Borel (1909). For an IID Bernoulli process { ∈N}
defining the simple Bernoulli model:
M(x):  v BerIID( (1−)) ∈N, (2)
P( lim
→∞
(1

P
=1 ) = ) = 1 (3)
That is, as  → ∞ the stochastic sequence {}∞
=1 where
=1

P
=1  converges to a constant  with probability
one.
1 0 0 09 0 08 0 07 0 06 0 05 0 04 0 03 0 02 0 01 0 01
0 . 7 0
0 . 6 5
0 . 6 0
0 . 5 5
0 . 5 0
0 . 4 5
0 . 4 0
In d e x
Average
Fig. 1: The Strong Law of Large Numbers in action
The long-run metaphor. It is also important to note
8
that the long-run metaphor associated with the frequentist in-
terpretation envisions repeating the mechanism represented
by an IID Bernoulli process and observing the relative fre-
quency process {}∞
=1 (almost surely)-approximate  as →∞
I It is crucial to emphasize that, when viewed in the con-
text of a statistical model M(x), the key dimension of the
long-run metaphor is not the temporal aspect. Keynes’s tongue-
in-cheek comment that "in the long-run we will all be dead"
is totally misplaced in this context.
I The crucial dimension of the long-run is repeatability (in
principle) of the data-generating process. To generate the re-
alization shown below takes a fraction of a nanosecond! More-
over, the long-run can be operationalized on a computer for
any statistical model using the statistical GM.
Example. In the case of the simple Normal model, one
can generate as many sample realizations as wanted using
pseudo-random numbers:
 =  +   v N(0 1)  = 1 2   
Circularity? The issue often raised, when invoking the
SLLN as a justification for the frequency definition of proba-
bility, is that the argument suffers from circularity:
one uses probability to define probability!
This claim is based on confusion. The SLLN states that,
under certain restrictions on the probabilistic structure of
{ ∈N}, one can deduce:
P( lim
→∞
(1

P
=1 ) = )=1 (4)
This is a measure-theoretic result which asserts that the con-
vergence holds everywhere in a domain  ⊂  except on a
subset 0 the latter being a set of measure zero (P (0) =0)-
9
the impossible event. This mathematical theorem links the
probability () to the relative frequencies (1

P
=1 ) via the
interpretive provisions:
[i] data x0:=(1 2     ) is viewed as a ‘truly typical’
realization of the process { ∈N} specified by M(x),
[ii] the ‘typicality’ of x0 (e.g. IID) can be assessed using
trenchant Mis-Specification (M-S) testing.
These provisions provide a pertinent link between the math-
ematical framework and the data-generating mechanism that
gave rise to x0.
3 Frequentist Inference
Statistical model
Mθ(x)={(x; θ) θ∈Θ} x∈R

=⇒
Distribution of the sample
(x; θ) x∈R

↑
Data: x0:=(1 2  ) −→
⇓
Likelihood function
(θ; x0) θ∈Θ
Frequentist approach to statistical inference
For the frequentist approach:
[a] The interpretation of probability is the frequency inter-
pretation.
[b] The systematic statistical information in data x0 in
the form of chance regularities, constitutes the only relevant
statistical information for selecting the statistical model. Sub-
stantive information comes in the form of restrictions on sta-
10
tistical parameters, but should not be imposed at the outset;
its validity needs to be established before imposed.
[c] The primary aim of the frequentist approach is to learn
fromdata x0 about the ‘true’ underlying data-generating mech-
anism M∗
(x)={(x; θ∗
)} x∈R
; θ∗
denotes the true value
of θ In general, the expression ‘θ∗
denotes the true value of
θ’ is a shorthand for saying that ‘data x0 constitute a real-
ization of the sample X with distribution (x; θ∗
)’ This is
achieved by employing reliable and effective inference proce-
dures that revolve around the unknown parameter(s) θ based
on ascertainable error probabilities.
I The underlying inductive reasoning comes in two forms:
(i) factual: the true state of nature (estimation and pre-
diction), whatever that happens to be, and
(ii) hypothetical: various hypothetical scenarios are com-
pared to what actually happened (hypothesis testing).
[d] Frequentist inference is based exclusively on the distri-
bution of the sample (x; ), x∈R
. This is because all fre-
quentist procedures, estimators, test statistics and predictors,
are based on statistics of the form:
=(1 2  )
whose sampling distribution is determined by (x; ) via:
()=P( ≤ )=
Z Z
· · ·
Z
| {z }
{(12)≤}
(x; )12 · · · 
(5)
Its importance stems from the fact that all relevant error prob-
abilities, coverage, type I and II, power, p-value, associated
with frequentist inference are based on such a sampling dis-
tribution. In the case where the density function () is con-
tinuous: (; )=()

.
11
3.0.1 The simple Bernoulli model
The discussion that follows focuses on the simple Bernoulli
model, specified by:
M(x):  v BerIID( (1−)) ∈N:=(1 2   )
where ‘BerIID’ stands for ‘Bernoulli, Independence and Iden-
tically Distributed’.
Using the IID assumptions, one can derive the distribu-
tion of the sample:
(1 2  ; )
IID
=
Q
=1
(; )
Ber
=
Q
=1

(1 − )1−
=
=
P
=1 
(1 − )
P
=1(1−)
=
(1 − )(1−)

(6)
where =1

P
=1  is the sample mean, and = denotes
the number of ‘successes’ ( = 1) in  trials, i.e.
(x; )=
(1 − )(1−)
 for all x∈R
={0 1}
 (7)
viewed as a function of  = is Binomially distributed. The
likelihood function is defined by: to the distribution of the
sample:
(; x0) ∝ (x0; ) for all ∈[0 1] (8)
Note that the proportionality (∝) is important because (; x0)
is viewed as a function of ∈Θ but (x; ) is a function of
x∈R
 and they usually have very different dimensions. In-
deed, in the simple Bernoulli case (x; ) is discrete but (; x0)
is continuous.
Example (a). For ==4, =20 the likelihood function
takes the form:
(; x0) ∝ 4
(1 − )16
 ∈[0 1]
12
121086420
0.25
0.20
0.15
0.10
0.05
0.00
y
Probability
Binomial, n=20, p=0.2
Fig. 1: (x; ), x∈R

1.00.80.60.40.20.0
5
4
3
2
1
0
theta
Likelihhood
Fig. 2: (; x0), ∈[0 1]
The sampling distribution of  is shown in fig. 1 and the
normalized form of the likelihood function is given in fig. 2.
3.0.2 The simple Normal model
The discussion that follows focuses on the simple Normal
model, specified by:
M(x):  v NIID( 2
) ∈N:=(1 2   )
The distribution of the sample takes the form:
(x; θ) =
Y
=1
1

√
2
exp(− 1
22 (−)2
)=
=( 1

√
2
)
exp{- 1
22
P
=1(−)2
}
This means that the likelihood function is:
( 2
; x0) ∝ ( 1

√
2
)
exp{- 1
22
P
=1(−)2
} −∞ ∞ 2
0
In the case where 2
is known this simplifies to:
(; x0) ∝ exp{− 1
22
P
=1(−)2
} −∞ ∞
13
3.1 Frequentist Estimation (point)
In general, the Maximum Likelihood (ML) estimator of  is
defined by:
b(X)=max
∈Θ
(ln (; x))
It is important to distinguish between three different concepts:
-unknown constant, b(X)-estimator of  b(x0)-estimate of 
In the case of the simple Bernoulli model, the first order
condition for maximizing ln (; x) is:
 ln (;x)

=

− (1−)
1−
=0
when solved for  gives rise to the MLE estimator:
b(X)==1

P
=1  (9)
The sampling distribution of this estimator is:
b(X) v Bin
³
 (1−)

; 
´
 (10)
where ‘v Bin ( ; )’ reads ‘distributed Binomially with mean
, variance  and sample size 
What is a good (optimal) estimator in frequentist
inference?
It’s a statistic (X) that pin-points ∗
 the true value of 
most effectively. Its capacity (effectiveness) to pin-points ∗
is
defined in terms of several optimal properties, such as (i) Un-
biaseness: (b(X))=∗
 (ii) Full Efficiency:  (b(X))=(∗
)
(iii) Strong Consistency: P(lim→∞
b(X)=∗
)=1.
Example. The estimator b(X)=1

P
=1  whose sam-
pling distribution is given by (10) is optimal because it is
unbiased, fully efficient and strongly consistent!
14
3.2 Confidence Intervals (CIs)
The most straightforward procedure to derive a Confidence
Interval (CI) is to find a pivotal quantity whose distribution
under =∗
is known.
A. For the simple Bernoulli model such a pivotal quan-
tify can be specified using the ML estimatorb(X)=1

P
=1 
and its sampling distribution in (10):
(X; )=
√
(b(X)−)
√
(1−)
=∗
v Bin (0 1; )  (11)
In general, an optimal CI begins with an optimal estimator.
Since its sampling distribution is known, one can specify a
two-sided CI of the form:
P
¡
−
2
≤ (X; )  
2
; =∗
¢
=1 − 
and then ‘solve’ for  to derive the optimal (1−) Confidence
Interval (CI):
P
µ
[b − 
2
q
b(1−b)

] ≤   [b + 
2
q
b(1−b)

]
¶
=1 − 
(12)
where b is a shorthand for b(X) In practice, finding 
2
can
be cumbersome and in many cases (12) can be approximated
(for a large enough sample size, say  ≥ 20) by the N(0 1)
distribution.
Example (a). In the case where ==4, =2 =20
the approximate 95 observed CI for :∙
(2 − 196√
20
p
2(8))=025 ≤   0375=(2 + 196
q
2(8)
20
)
¸

Example (b). In the case where ==12, =6 =20
the approximate 95 observed CI for :∙
(6 − 196√
20
p
6(4))=385 ≤   0815=(6 + 196
q
6(4)
20
)
¸

15
B. For the simple Normal model (with 2
known) the
pivotal quantify can be specified using the ML estimator
=1

P
=1  whose sampling distribution is:
(X; )=
√
(−)

=∗
v N (0 1)  (13)
Using (13) one can specify a two-sided CI of the form:
P
¡
−
2
≤ (X; )  
2
; =∗
¢
=1 − 
and then ‘solve’ for  to derive the optimal (1−) Confidence
Interval (CI):
P
³
[ − 
2
√

] ≤   [ + 
2
√

´
=1 −  (14)
3.3 Frequentist Testing
A. In the context of the simple Bernoulli model, consider the
hypotheses:
0 : =0 vs. 0 :   0 (15)
As in the case of interval estimation, a good test procedure be-
gins with a good estimator, which gives rise to a test statistic.
In this case the test statistic looks similar to (11):
(X)=
√
(b(X)−0)
√
0(1−0)
=0
v Bin (0 1; ) 
where b(X):= but in fact, it has two crucial differences:
(i) the test statistic (X) involves no unknown parameters
as (X; ) and
(ii) the evaluation of its sampling distribution is under the
null (=0), which is hypothetical, and not under =∗
which
is factual.
The optimal −significance level test takes the form:
(X)=
√
(−0)
√
0(1−0)
 1():={x : (x)  } (16)
16
Example. In the case where =75, =20 =05 (
2
=196)
0=5 test (16) yields:
(x0)=
√
20(75−5)
√
5(1−5)
=2236
leading to rejecting 0 The p-value associated with this test
is:
P((X)  2236; 0) = 013 (17)
Example. For =6, =20 =05 (
2
=196) 0=5 test
(16) yields:
(x0)=
√
20(6−5)
√
5(1−5)
=894
leading to accepting 0 The p-value associated with this test
is:
P((X)  894; 0) = 186 (18)
B. In the context of the simple Normal model (with 2
known) consider the hypotheses:
0 : =0 vs. 0 :   0
In this case the test statistic takes the form:
(X)=
√
(−0)

=0
v N (0 1) 
and gives rise to the optimal −significance level test takes
the form:
(X)=
√
(−0)

 1():={x : (x)  }
Example. For 0=10 =2 =100 =05 (
2
=196) =116
this yields:
(x0)=
√
100(106−10)
2
=30
leading to rejecting 0 The p-value associated with this test
is:
P((X)  30; 0) = 001 (19)
17
4 Bayesian Inference
4.1 The Bayesian framework
Bayesian inference begins with a statistical model:
Mθ(x)={(x; θ) θ∈Θ} x∈R
 for θ∈Θ⊂R
   
(20)
where (x; θ) is the distribution of the sample X:=(1  )
R
 is the sample space and Θ the parameter space. Bayesian
inference modifies the frequentist inferential set up, based ex-
clusively on (20), in two crucial respects:
(A) It views the unknown parameter(s) θ as random vari-
ables with their own distribution, known as the prior dis-
tribution:
(): Θ → [0 1]
which represents one’s a priori assessment of how likely the
various values of θ in Θ are, which amounts to ranking the dif-
ferent models Mθ(x) for all θ∈Θ. In frequentist θ is viewed
as a set of unknown constants indexing (x; θ) x∈R

(B) It re-interprets the distribution of the sample as condi-
tional on the unknown parameters θ and denoted by (x|θ)
Taken together these modifications imply that for Bayesians
the joint distribution of the sample is re-defined as:
(x θ) = (x|θ)·(θ) ∀θ∈Θ ∀x∈R
 (21)
where ∀ denotes ‘for all’. Distinguishing criteria:
[a] The Bayesian approach interprets probability as the de-
grees of belief [subjective, logical or rational].
[b] In the context of Bayesian inference, relevant infor-
mation includes:
(i) the data x0:=(1 2  ) and
(ii) the prior distribution (θ) θ∈Θ
18
[c] The primary aim of the Bayesian approach is to revise
the initial ranking (θ) ∀θ∈Θ in light of the data x0 as
précised by (θ|x0) to update the ranking in terms of the
posterior distribution derived using Bayes’ rule:
(θ|x0) = (x0|θ)·(θ)R
 (x0|θ)·(θ)θ
∝ (θ|x0)·(θ) ∀θ∈Θ (22)
where (θ|x0) ∝ (x0|θ) θ∈Θ denotes the likelihood func-
tion, as re-interpreted by Bayesians.
Remark: in relation to (22) it is important to note that
contrary to claims by Bayesians (e.g. Gosh et al, 2006, p. 31):
“ (x0|θ) is the density of X interpreted as the conditional
density of X given θ. The numerator is the joint density of θ
and X and the denominator is the marginal density of X”
(i) (x0|θ) is not “the conditional density of X given θ”,
since the latter is defined by (x|θ) ∀x∈R
,
(ii) the numerator is not “the joint density of θ and X”,
since the latter necessitates the double quantifier as in (21),
(iii) the denominator is not “the marginal density of X”,
since the latter is defined by (x)=
R
θ
(x|θ)·(θ)θ ∀x∈R
;
see Spanos (2014) for the details.
Prior probabilities
(θ) θ∈Θ
⇓
Statistical model
Mθ(x)={(x; θ) θ∈Θ} x∈R

⇑
Data: x0:=(1  )
⎫
⎪⎪⎪⎪⎪⎪⎪⎪⎪⎬
⎪⎪⎪⎪⎪⎪⎪⎪⎪⎭
Bayes’
rule
=⇒
Posterior Distribution
(θ|x0) ∝ (θ)·(θ|x0)
The Bayesian approach to statistical inference
19
A famous Bayesian, Savage (1954) summarized Bayesian
inference succinctly by:
‘Inference means for us the change of opinion induced by evi-
dence on the application of Bayes’ theorem.” (p. 178)
O’Hagan (1994) is more specific:
“Having obtained the posterior density (θ|x0), the final step
of the Bayesian method is to derive from it suitable inference
statements. The most usual inference question is this: After
seeing the data x0, what do we now know about the parameter θ
The only answer to this question is to present the entire posterior
distribution." (p. 6)
In this sense, learning from data in the context of the
Bayesian perspective pertains to how the original beliefs (θ)
are revised in light of data x0, the revision coming in the
form of the posterior: (θ|x0) ∀θ∈Θ
[d] For inference purposes, the only relevant point in the
sample space R
 is just the data x0 as précised by (θ|x0) θ∈Θ.
This feature of Bayesian inference is formalized by the Likeli-
hood Principle.
Likelihood Principle. For inference purposes the only
relevant sample information pertaining to θ is contained in
the likelihood function (x0|θ) ∀θ∈Θ Moreover, if x0 and
y0 are two sample realizations contain the same information
about θ if their likelihoods are proportional to one another,
i.e. (x0|θ)=(y0|θ) (Berger and Wolpert, 1988, p. 19).
Indeed, Bayesians take delight in poking fun at frequentist
testing by quoting Jeffreys’s (1939) remark about the ‘absur-
dity’ of invoking realizations ‘x∈R
’ other than x0:
“What the use of P [p-value] implies, therefore, is that a hy-
pothesis that may be true may be rejected because it has not
predicted observable results that have not occurred. This seems
20
a remarkable procedure.” (p. 385) [ha, ha, ha!!!!!]
Frequentist inference procedures, such as estimation (point
and interval), hypothesis testing and prediction DO invoke
realizations x∈R
 other than x0 contravening the LP.
4.2 The posterior distribution and its role in inference
Bayesian inference begins with a statistical model M(x),
like the frequentist approach, but it adds a prior distribu-
tion () ∈Θ for the unknown parameter  viewed as a
random variable with its own distribution.
A. In the case of the simple Bernoulli model, the prior
of choice for the Bayesians is the Beta distribution defined in
terms of two parameters  and . For the discussion that
follows let us assume that the prior () is Beta( ) dis-
tributed with a density function:
()= 1
B()
(−1)
(1 − )−1
 0 0 01 (23)
The Beta distribution is very flexible in the sense that for
different values of the parameters ( ) the density function
can take many different shapes; see fig. 4. Such a prior will be
proper because it covers the whole of the parameter space
and it integrates to one, i.e.
R 1
0
() = 1
Bayesian inference are based on the posterior distribu-
tion of  given data x0 derived via Bayes’ formula:
(|x0)= ()(x0|)R
∈[01] ()(x0|)
=()(x0|)
(x0)
, ∀∈[0 1]. (24)
The denominator (x0)=
R 1
0
()(x0|) is a normalizing
constant, which is important to ensure that (|x0) is a proper
density function, i.e.
R
∈[01]
(|x0) = 1.
21
1.00.80.60.40.20.0
4
3
2
1
0

Density
1 1
1 2
1 4
2 1
2 2
2 4
4 1
4 2
4 4
a b
Beta(a,b) densities for different (a,b)
Fig. 4: Beta( ) for different values of ( )
Example. In the case where (x0|) is given by (6) and
() is given by (23), the posterior distribution is:
(|x0) ∝
³
1
()
(−1)
(1 − )−1
´ £

(1 − )(1−)
¤
=
= 1
()
h
+(−1)
(1 − )(1−)+−1
i

(25)
In view of the formula in (23), this is recognized as an ‘un-
normalized’ density of a Beta(∗
 ∗
) where:
∗
= +  ∗
=(1 − ) +  (26)
Since the prior distribution is Beta( ) and the posterior
is also Beta(∗
 ∗
) indicates that the Beta distribution is
conjugate to the Bernoulli likelihood (Binomial), i.e., the
Bernoulli and Beta constitute a conjugate pair because the
prior and posterior belong to the same family of distributions.
1.00.80.60.40.20.0
7
6
5
4
3
2
1
0

Density
Je ffre ys' in v arian t prio r fo r th e B in o mial
B eta: a= 0.5, b= 0.5
Fig. 5: Jeffreys ()= 1
(55)
−5
(1−)−5
22
Example. For Jeffreys’ prior:  v Beta(5 5) (see fig. 5)
and =4 =20:
(; x0) ∝ 4
(1 − )16
 ∈[0 1]
(|x0)=
1
( )
h
+(−1)
(1 − )(1−)+−1
i
 (27)
i.e. (|x0) is Beta(∗
 ∗
) with ∗
=+=45 ∗
=(1-
)+=165
1.00.80.60.40.20.0
5
4
3
2
1
0
theta
Likelihood(scaled)
Fig. 6: The likelihood (; =4)
1.00.80.60.40.20.0
7
6
5
4
3
2
1
0
theta
Density
Beta
Prior vs. Posterior
Fig. 7: () vs. (|x0)
The (scaled) likelihood (; x0) is given in fig. 6 and the pos-
terior (|x0) is compared to the prior () in fig. 7. In
fig. 8, the likelihood is compared to the posterior. Note
that all three are continuous, but the sampling distribution
of vBin( (1−)

) is discrete; all relevant frequentist error
probabilities come from this distribution.
1.00.80.60.40.20.0
5
4
3
2
1
0
theta
Density
4.5 16.5
5 17
F irst Second
Beta
Likelihood vs. Posterior
Fig. 8: (|x0) vs. (; =4)
121086420
0.25
0.20
0.15
0.10
0.05
0.00
X
Probability
Fig. 9: vBin (=2; =20)
23
B. In the case of the simple Normal model, (with 2
known) the prior often selected is Normal:
()= 1
0
√
2
exp(− 1
22
0
(−0)2
) for ∈R
and denoted by:
 v N(0, 2
0), where (0, 2
0) are known values.
Hence, the posterior is defined by:
(|x0) ∝()(x0; ))=
=[ 1
0
√
2
exp(− 1
22
0
(−0)2
)]
h
( 1

√
2
)
exp{- 1
22
P
=1(−)2
}
i
∝ exp{−1
2
h
(−0
0
)2
) +
P
=1(−

)2
i
=
= exp{-1
2
2
³
1
2
0
+ 
2
´
+ 
³
0
2
0
+ 
2
´
= exp{− 1
22
1
( − 1)2
}
(28)
where the parameters (1 2
1) of the posterior (after scaling):
(|x0) v N(1, 2
1).
take the particular form:
1=2
1(0
2
0
+ 
2 ) 2
1=( 1
2
0
+ 
2 )−1
= 2
02
(2
0+2)

After some manipulation 1 can be expressed in the form:
1=+(1−)0 where = 2
0
2
0+2

= 2
1(2

)
which indicates that it is a convex combination of  and 0
with the variance of the prior, 2
0, and the variance of the
MLE   ()=2

, providing the weights. Moreover, as
 → ∞  → 1 and 1 →  In words, asymptotically (as
 → ∞) the mean of the posterior 1 converges to the sample
mean .
24
Example. (a) Consider the case where  v N(8, 1),
=2 =20 =116 The posterior is:
(|x0) v N(11022, 167),
2
1=(1
1
+ 20
4
)−1
=167 1=167(8
1
+ 20(116)
4
)=11022
12111098765
2.5
2.0
1.5
1.0
0.5
0.0
theta
Density
Normal
Prior vs. Posterior
Fig. 10: () vs. (|x0)
12.2512.0011.7511.5011.2511.0010.7510.50
2.5
2.0
1.5
1.0
0.5
0.0
theta
Density
Normal
Posterior vs. Likelihood
Fig. 11: (|x0) vs. (|x0)
(b) Consider the case where the prior is more precise, i.e.
2
0=5  v N(8, 5), =2 =20 =116 The posterior has
a smaller variance:
(|x0) v N(10582, 143),
2
1=( 1
5
+ 20
4
)−1
=143 1=143( 8
5
+ 20(116)
4
)=10582
11109876
3.0
2.5
2.0
1.5
1.0
0.5
0.0
theta
Density
Normal
Prior vs. Posterior
Fig. 12: () vs. (|x0)
12.512.011.511.010.510.0
3.0
2.5
2.0
1.5
1.0
0.5
0.0
theta
Density
Normal
Posterior vs. Likelihood
Fig. 13: (|x0) vs. (|x0)
25
Example. Consider the case of the simple Normal model
(with 2
known), where the prior is Uniform:
()=1 for ∈R:=(−∞ ∞)
This prior is said to be improper because it does not integrate
to one since:
Z ∞
−∞
()=∞ for any ()=∈(0 1]
In this case the posterior is proportional to the likelihood:
(|x0) ∝(; x0)=( 1

√
2
)
exp{− 1
22
P
=1(−)2
} −∞ ∞
=( 1

√
2
)
exp{- 
22
£1

P
=1(−)2
+ ( − )2
¤
}
=( 1

√
2
)
exp{-2
22 } exp{- 
22 (−)2
}
(29)
Hence, ignoring the constant term exp{-2
22 } where 2
=1

P
=1(−)2

the posterior is Normally distributed:
(|x0) v N(, 2

), −∞ ∞ (30)
Note that in (30) the random variable is  which varies with
different values of the unknown parameter (∈(−∞ ∞)), and
 is a constant. In contrast, the sampling distribution of  :
 v N(, 2

), x∈R
 (31)
varies with different values of the sample realization (x∈R
).
Hence, any talk about results with identical tail areas has to
be objurgated with what the two tail areas really represent.
I The intuitive difference between the two distributions is
that one can draw (30) but not (31) since  is unknown!
26
4.3 Bayesian Point Estimation
According to O’Hagan (1994):
“Classical inference theory is very concerned with constructing
good inference rules. The primary concern of Bayesian inference,
..., is entirely different. The objective [of Bayesian inference] is to
extract information concerning θ from the posterior distribution,
and to present it helpfully via effective summaries. There are
two criteria in this process. The first is to identify interesting
features of the posterior distribution. ... The second criterion
is good communication. Summaries should be chosen to convey
clearly and succinctly all the features of interest.” (p. 14)
In this sense, the frequentist inference procedures of estima-
tion, testing and prediction are viewed by Bayesians as styl-
ized inferences which from the Bayesian perspective represent
nothing more than different ways to summarize relevant infor-
mation in the posterior distribution (|x0). Hence, the rele-
vant criteria for ‘goodness’ (optimality) pertain to how well
they identify and communicate relevant/interesting
features of the posterior.
In the case of stylized inference of a point estimate for 
one can think of several features of the posterior (|x0) such
as a measure of location, that could potentially play such a
role by indicating the most ‘representative’ value of  in [0 1].
[1] An obvious choice is to used the mode e of the pos-
terior distribution (|x0) in (27):
(e|x0)= sup∈Θ (|x0)
which is the value of  that is ranked highest by (|x0)
We know that in the case of  v Beta( ) the mode of
() is = −1
+−2
 Hence, an example of a Bayesian estimate
27
of  is:
e= ∗−1
∗+∗
−2
= (+−1)
(++−2)
 (32)
If we compare this with the ML estimateb(x0)==1

P
=1 ,
the two will coincide algebraically, i.e. e=, only when
==1 :  v Beta(1 1)=(0 1) ()=1 for ∈[0 1]
 
1.00 .80.60.40.20.0
1.0
0.8
0.6
0.4
0.2
0.0

Density
U n ifo r m p r io r
B e t a : a = 1 , b = 1
Fig. 10: Uniform prior
[2] Another "natural" choice for a most ‘representative’
value of  is its mean of the posterior.
We know that for  vBeta( ) ()= 
+
 and thus:
b= ∗
∗+∗ = (+)
(++)
 (33)
Example. Let () vBeta(5 5).
(a) =4, =20 ∗
= + =45 ∗
=(1-)+=165
e= 35
21−2
=184 b= 45
45+165
=214
(b) =12, =20 ∗
=+=125 ∗
=(1−)+=85
e=115
19
=605 b=125
21
=595
As we can see from the above numerical examples, the two
Bayesian estimates can give rise to different numerical values,
depending on how asymmetric the posterior distribution is;
the more asymmetric the greater the difference.
How does a Bayesian decide which one of the
above estimators is better?
28
O’Hagan answers that question by contrasting frequentist
(classical) inferences with Bayesian inferences:
“In Bayesian terms, therefore, a good inference is one which
contributes effectively to appropriating the information about θ
which is conveyed by the posterior distribution.” (p. 14)
Indeed, O’Hagan (1994), p. 14, proceeds to argue that
criteria for ‘good’ frequentist inference procedures are only
parasitical on Bayes’ theorem and enter the picture via the
decision theoretic perspective:
“... a study of decision theory has two potential benefits. First,
it provides a link to classical inference. It thereby shows to what
extent classical estimators, confidence intervals and hypotheses
tests can be given a Bayesian interpretation or motivation. Sec-
ond, it helps identify suitable summaries to give Bayesian answers
to stylized inference questions which classical theory addresses.”
Invoking information other than the data and the
prior. In the case of point estimation the question of selecting
an optimal Bayesian estimate of  one needs to invoke addi-
tional information from decision theory like a loss (or utility)
function (b ). Using different loss functions gives rise to
different choices for Bayes’ estimate:
(i) The Bayes estimate ˘ that minimizes (∀∈Θ) the ex-
pected zero-one loss function 0−1( ˘)=
½
0 if ˘ = 
1 if ˘ 6= 
 is
the mode of (|x0).
(ii) The Bayes estimate e that minimizes (∀∈Θ) the ex-
pected absolute loss function 1(e; )=|e − | is the median
of (|x0)
(iii) The Bayes estimate b that minimizes (∀∈Θ) the the
expected square loss function 2(b; )=(b − )2
is the mean
29
of (|x0)
However, as o’Hagan (1994). p. 15, explains that the notion
of optimality is not primary:
“The posterior mean can therefore be seen as an estimate of
 which is best in the sense of minimizing expected square error.
This is distinct from, but clearly related to, its more natural role
as a useful summary of location of the posterior distribution.”
What is a good (optimal) estimate in Bayesian
inference? The one that minimizes a particular expected
loss function. Hence, in contrast to frequentist estimation
where the optimality is assessed by the estimator’s capacity to
pin-point ∗
 the true value of  an optimal Bayesian estimate
has nothing to do with that; it just minimizes a particular loss
function for all possible values of .
The most widely used loss function is the square:
(b)=(b(X)−)2
 ∀∈Θ
whose expected value (the risk function) is known as the
Mean Square Error (MSE):
Bayesian: MSE(b(X); )=(b(X)−)2
 ∀∈Θ (34)
In relation to the expected loss function, a key property for
an estimator b(X) is considered to be admissibility.
Admissibility. An estimator e(X) is inadmissible with
respect to a risk function (b) if there exists another esti-
mator b(X) such that:
(b) ≤ (e) ∀∈Θ (35)
and the strict inequality () holds for at least one value of 
Otherwise, e(X) is said to be admissible with respect to the
loss function (b)
30
Warning: the Bayesian definition of the MSE in (34),
based on the universal quantifier ‘for all ∈Θ’, is at odds with
the frequentist definition of the MSE, which is defined at a
point =∗
[the true value of ]:
Frequentist: MSE(b(X); ∗
)=(b(X) − ∗
)2
 (36)
The factual nature of frequentist reasoning in estimation
also brings out the impertinence of the notion of admissibility
stemming from its reliance on the quantifier ‘for all ∈Θ’. To
see that more clearly let us consider the following example.
Example:  is no better than a crystallball esti-
mator? For the simple Normal model:
 v NIID( 2
) =1 2    (37)
consider a MSE comparison between two estimators of :
(i) the Maximum Likelihood Estimator (MLE): =1

P
=1 
(ii) the ‘crystalball’ estimator: (x)=7405926 for all x∈R

It turns out that both estimators are admissible and thus
equally acceptable on admissibility grounds. This surprising
result stems primarily from the quantifier ‘for all ∈Θ’. In-
deed, for certain values of  close to , say ∈(± √

) for
01  is ‘better’ than  since:
(; )=1

 (; ) ≤ 2

for ∈(± √

)
Common sense suggests that if a certain criterion of opti-
mality cannot distinguish between  [a strongly consistent,
unbiased, fully efficient and sufficient estimator] and an ar-
bitrarily chosen real number that ignores the data altogether,
it is practically useless for distinguishing between ‘good’ and
‘bad’ estimators in frequentist statistics or being viewed as a
31
minimal property of estimators. Moreover, it is obvious that
the source of the problem is the quantifier ∀∈Θ. In contrast
to admissibility, the property of consistency instantly elimi-
nates the crystal ball estimator .
4.4 Bayesian Credible Intervals (BCIs)
A Bayesian (1−) credible interval for  is constructed by
ensuring that the area between  and  is equal to (1−):
( ≤   )=
R 

(|x0)=1−
In practice one can define an infinity of (1−) credible inter-
vals using the same posterior (|x0) To avoid this indeter-
minacy one needs to impose additional restrictions like the
interval with the shortest length or one with equal tails:R 1

(|x0)=(1−
2
)
R 1

(|x0)=
2
; see Robert (2007).
Example. For the simple Bernoulli model, the end points
of an equal-tail credible interval can be evaluated using the F
tables and the fact that:
 v Beta(∗
 ∗
) ⇒ ∗

∗(1−)
v F(2∗
 2∗
)
Denoting the 
2
and (1−
2
) percentiles of the F(2∗
 2∗
) distri-
bution, by f(
2
) and f(1−
2
) respectively, the Bayesian (1−)
credible interval for  is:h
(1 + ∗
∗f(1−
2 )
)−1
≤  ≤ (1 + ∗
∗f(
2 )
)−1
i

For () vBeta(5 5). (a) =2, =20 =05
∗
=+=25 ∗
=(1 − )+=185
(1−
2
)=163 (
2
)=293
³
1+ 185
25(163)
´−1
≤  ≤
³
1+ 185
25(293)
´−1
= (0216 ≤  ≤ 284)  (38)
32
(b) =18, =20 =05
∗
= + =185 ∗
=(1-) + =25
b=185
21
=881 (1-
2
)=341 (
2
)=6188
³
1+ 25
185(341)
´−1
≤  ≤
³
1+ 25
185(6188)
´−1
= (716 ≤  ≤ 979) 
How does one interpret a (1 − ) BCI such as (38)?
I It provides a summary of the posterior distribution (|x0)
by demarcating  and  that define the interval of length −
with the highest probability, i.e. it includes (1−)% of the
highest ranked values of ∈[0 1].
4.4.1 Confidence vs. Credible Intervals
Example. For the simple (one parameter - 2
is known)
Normal model, the sampling distribution of =1

P
=1 
and the posterior distribution of  derived on the basis of an
improper uniform prior [()=1 ∀∈R] are:

=∗
v N(∗
 2

) x∈R
 (|x0) v N( 2

) ∈R (39)
The two distributions can be used, respectively, to construct
(1−) Confidence and Credible Intervals:
P
³
−
2
( √

) ≤  ≤ +
2
( √

); =∗
´
=1− (40)

³
−
2
( √

) ≤  ≤ +
2
( √

)|x0
´
=1− (41)
The two intervals might appear the same, but they are dras-
tically different.
First, in (40) the r.v. is  and its sampling distribution
(; ) is defined over x∈R
 but in (41) the r.v. is  and
its posterior (|x0) is defined over ∈R
Second, the reasoning underlying (40) is factual, but that
of (41) is all-inclusive (∀∈[0 1]).
33
Hence, the (1−) Confidence Interval (40) provides the
shortest random upper (X)=+
2
( √

) and lower
(X)=−
2
( √

) bounds that cover ∗
 the true value of
 with probability (1−). In contrast, the (1−) Credible
Interval (41) provides the shortest interval of length 2
2
( √

)
such that (1−)% of the highest ranked (by (|x0)) values
of  lie within it.
The above comparison suggests that a Bayesian (1−) Cred-
ible Interval has nothing to do with the "true" !
4.5 Bayesian Testing
Bayesian testing of hypotheses is not as easy to handle
using the posterior distribution of the technical difficulty in
attaching probabilities to particular values of  when the pa-
rameter space Θ is uncountable since:
()=0 (| x0)=0 ∀∈Θ
In their attempt to deflect attention away from this technical
difficulty, Bayesians often criticize the use of a point hypoth-
esis =0 in frequentist testing as nonsensical because it can
never be exactly true! This is a misplaced argument because
the notion of exactly true, has no place in statistics.
4.5.1 Point null and alternative hypotheses
There have been several attempts to address the difficulty
with point hypotheses, but no agreement seems to have emerged;
see Roberts (2007). Let us consider one such attempt for test-
ing of the hypotheses:
0 : θ=θ0 vs. 1 : θ=θ1
Like all Bayesian inferences, the basis is the posterior distribu-
tion. Hence, an obvious way to assess their respective degrees
34
of belief is the posterior odds:
(θ0|x0)
(θ1|x0)
=(θ0|x0)·(θ0)
(θ1|x0)·(θ1)
=
³
(θ0)
(θ1)
´ ³
(θ0|x0)
(θ1|x0)
´
 (42)
where the factor (θ0)
(θ1)
represents the prior odds, and (θ0|x0)
(θ1|x0)
the likelihood ratio. In light of the fact that technical prob-
lem stems from the prior (θ) assigning probabilities to par-
ticular values of θ an obvious way to sidestep the problem is
to cancel the prior odds factor, by using the ratio of the pos-
terior to the prior odds to define the Bayes Factor (BF):
(θ0 θ1|x0)=
³
(θ0|x0)
(θ1|x0)
´

³
(θ0)
(θ1)
´
=(θ0|x0)
(θ1|x0)
 (43)
This addresses the technical problem because the likelihood
function is definable for particular values of 
For this reason Bayesian testing is often based on the BF
combined with certain rules of thumb, concerning the strength
of the degree of belief against 0 as it relates to the magnitude
of (x0; 0) (Robert, 2007):
I 0 ≤ (x0; 0) ≤ 32 the degree of belief against 0 is
poor,
I 32  (x0; 0) ≤ 10 the degree of belief against 0 is
substantial,
I 10  (x0; 0) ≤ 100 the degree of belief against 0 is
strong, and
I (x0; 0)  100 the degree of belief against 0 is deci-
sive.
These rules of thumb, going from the BF to evidence for or
against the null, have been questioned as largely ad hoc; see
Kass and Raftery (1995).
The Likelihoodist approach. It is important to note
that the Law of Likelihood defining the likelihood ratio:
(0 1|x0)=(0|x0)
(1|x0)

35
provides the basis of the Likelihoodist approach to testing,
but applies only to tests of point vs. point hypotheses.
4.5.2 Composite hypotheses
A. In the context of the simple Bernoulli model consider the
hypotheses:
0:  ≤ 0 vs. 1:   0 0=5
assuming a Jeffreys invariant prior  vBeta(5 5) and data
=12, =20
An obvious way to evaluate the posterior odds for these two
interval hypotheses is as follows:
( ≤ 0|x0)= Γ(21)
Γ(125)Γ(85)
R 5
0
¡
115
(1-)75
¢
=186
(  0|x0)=1-( ≤ 0|x0)=814
One can then employ the posterior odds criterion:
(≤0|x0)
(0|x0)
=186
814
=229
which indicates that the degree of belief against 0 is poor.
B. (a) In the context of the simple Normal model (2
known) consider the hypotheses:
0:  ≤ 10 vs. 1:   10
assuming a Normal prior  vN(8 1), =2 =20 =116
As shown in section 4.2, the posterior is:
(|x0) v N
³
2
1(0
2
0
+ 
2 ), 2
02
(2
0+2)
´
,
yielding the result: (|x0) v N(11022, 167),
2
1=(1
1
+ 20
4
)−1
=167 1=167(8
1
+ 20(116)
4
)=11022
The posterior odds criterion yields:
(≤10|x0)
(10|x0)
=
R 10
−∞
1√
2(167)
exp(− 1
2(167)(−11022)2)
R ∞
10
1√
2(167)
exp(− 1
2(167)(−11022)2)
= 0062
994
=0062
36
which indicates that the degree of belief against 0 is very
poor.
This result should be contrasted with that of a N-P test
yielding:
(x0)=
√
20(116−10)
2
=3578 with (x0)=0002
that rejects 0 at any conventional significance level.
What is the intuitive explanation behind these contradic-
tory results between a Bayesian and a frequentist test?
This frequentist test will reject 0 if (X)=
√
(−0)

 
or   0+
√


The posterior odds will "reject" 0 if   0+2(0−8)
2
0
.
This indicates that in the case where the null value 0
is chosen equal to the mean (0=8) of the prior distribu-
tion, i.e.  vN(0 2
0), the posterior odds will "reject" 0
if   0 which ignores the sample size beyond its influ-
ence on  On the other hand, when 2
=2
0 the rejection
threhold 0+(0−8)

decreases with  in contrast to
√
 for the
frequentist test.
(b) Let us change the prior to be an improper prior:
()=1 ∈(−∞ ∞)
As shown in section 4.2, in this case the posterior is:
(|x0) v N(, 2

), −∞ ∞ (44)
and thus the tail areas of this posterior would seem to match
perfectly with that of the sampling distribution of  :
 v N(, 2

), x∈R
 (45)
but as argued in that section, this is an illusion; one is com-
paring apples and eggs!
37
4.5.3 Point null but composite alternative hypothesis
Pretending that point hypotheses are small inter-
vals. A ‘pragmatic’ way to handle point hypotheses in Bayesian
inference is to sidestep the technical difficulty in handling hy-
potheses of the form:
0: =0 vs. 1: 6=0 (46)
by pretending that 0 is actually: 0: ∈Θ0:=(0− 0+)
and attaching a spiked prior of the form:
(=0)=0 1=
R 1
0
(6=0)=1−0 (47)
i.e. attach a prior of 0 to =0, and then distribute the rest
1−0 to all the other values of ; see Berger (1985).
Example- large  problem. In the context of the simple
Bernoulli model, consider the case where 0=2
=106298
527135
=020165233 =527135
[i] Let us apply the Bayes factor procedure to the hypothe-
ses (46) using a spiked prior (47) with 0=5 Since the ratio
(θ0)
(6=0)
cancels out, the posterior odds reduces to the Bayes
factor:
(x0; 0)= (0;x0)
R 1
0 (;x0)
=
(527135
106298)(2)106298(1−2)527135−106298
R 1
0 ((527135
106298)106298
(1−)527135−106298
)
=
=000015394
000001897
=8115
(48)
I The result (x0; 0)  8115 indicates that data x0 fa-
vor the null ( 0=2) against all other values of  substantially.
It turns out, however, that this result is highly vulnerable to
the fallacy of acceptance. For certain values ‡
in a subset of
Θ1=[0 1]−2 in particular:
‡
∈(2 20331] (49)
38
the values ‡
are favored by (x0; ‡
) more strongly than
0=2; contradicting the original result. Indeed, as pointed
out by Mayo (1996), p. 200, there is always the maximally
likely alternative, ¨
=b(x0)=20165233, for which the
Bayes factors favors ¨
more than any other value of . In
particular, the Bayes factors favors ¨
89 times stronger than
0=2!
[ii] Applying an ‘optimal’ N-P test with =003 (
2
=2968)
to the above data yields:
(x0)=
√
527135(106298
527135−2)
√
2(1−2)
=2999 (50)
which rejects 0. The p-value (x0)=0027 suggesting that
data x0 indicate ‘some’ discrepancy between 0 and the ‘true’
 (that gave rise to x0), but provides no information about
its magnitude. The post-data severity evaluation based on:
 (;   1) = P(x: (X) ≤ (x0);   1 is false)
(51)
indicates that, for a severity threshold of say 9, the claim for
which data x0 provide evidence for is:
  20095 ⇒ ∗
≤ 00095
In this sense, the frequentist approach can address the fallacy
of acceptance by evaluating the warranted discrepancy from
the null for the particular data and test.
39
4.6 Where do prior distributions come from?
4.6.1 Conjugate prior and posterior distributions
This is the case where the prior () and the posterior:
(|x0) ∝ () · (; x0) ∀∈Θ
belong to the same family of distributions, i.e. (; x0) is
family preserving.
Example. For the simple Bernoulli model:
()vBeta( )
(;x0)∝
(1−)(1−)
−→ (|x0)vBeta(∗
 ∗
)
Table 2 presents some examples of conjugate pairs of prior
and posterior distributions, as they combine with different
likelihood forms.
Conjugate pairs make mathematical sense, but does it make
‘modeling’ sense? The various justifications in the Bayesian
literature vary from, ‘these help the objectivity of inference’ to
‘they enhance the allure of the Bayesian approach as a black
box’ and these claims are often contradictory!
Table 1 - Conjugate pairs (() (|x0))
Likelihood ()
Binomial (Bernoulli) Beta( )
Negative Binomial Beta( )
Poisson Gamma( )
Exponential Gamma( )
Gamma Gamma( )
Uniform Pareto( )
Normal for  =  N( 2
) ∈R 2
0
Normal for  = 2
Inverse Gamma( )
40
4.6.2 Jeffreys’s prior
Fisher (1921) criticized the notion of prior ignorance using
a uniform prior:
 v U(0 1) ∀∈Θ
to quantify a state of ignorance about the unknown parame-
ter  Fisher’s criticism was that such a prior is non-invariant
to reparameterizations: one is ignorant about  but very in-
formed about =()
 
1.00.80.60.40.20.0
1.0
0.8
0.6
0.4
0.2
0.0

Density
Uniform prior
Beta: a=1, b=1
Fig. 10: Uniform prior
7.55.02.50.0-2.5-5.0
0.25
0.20
0.15
0.10
0.05
0.00

Density
Logistic distribution
Logistic: Loc=0, Scale=1
Fig. 11: The Logistic prior
Example. If  is uniformly distributed, then the logit
transformation:
= ln
¡ 
1−
¢

gives rise to a very informative prior for  :
()= 
(1+)2  −∞    ∞
This distribution attaches much higher probability to the val-
ues of  around zero and very low probability to the values
in the tails.
In respond to Fisher’s second criticism, Jeffreys (1939) pro-
posed a new class of priors which are invariant to reparameter-
izations. This family of invariant priors was based on Fisher’s
average information:
(; x)=
x
µ
1

h
 ln (;x)

i2
¶
=
R
···
R
x∈R

1

( ln (;x)

)2
x (52)
41
Note that the above derivation involves some hand-waving in
the sense that if the likelihood function (; x0) is viewed,
like the Bayesians do, as only a function of the data x0, then
taking expectations outside the brackets makes no sense; the
expectation is with respect to the distribution of the sample
(x;) for all possible values of x∈R
. As we can see, the
derivation of (; x) runs afoul to the likelihood principle
since all possible values of the sample X, not just the observed
data x0, are taken into account. Note that in the case of a
random (IID) sample, the Fisher information (; x) for the
sample X:=(1 2  ) is related to the above average
information via: (; x) = (; x)
In the case of a single parameter, Jeffreys invariant
prior takes the form:
() ∝
p
(; x) (53)
i.e. the likelihood function determines the prior distribution.
The simple Bernoulli model. In view of the fact that
the log-likelihood takes the form:
ln (; x)= ln () + (1 − ) ln(1−)
 ln (;x)

=

−(1−)
1−
 2 ln (;x)
2 = −(
2 )−(1−)
(1−)2 
From the second derivative, it follows that:

µ
1

h
 ln (;x)

i2
¶
=
³
−1

2 ln (;x)
2
´
= 1
(1−)
 (54)
This follows directly from ()= since:

³
−1

2 ln (;x)
2
´
= 
2 +(1−)
(1−)2 =1

+ 1
1−
= 1
(1−)
 (55)
From the definition of Jeffreys invariant prior we can de-
duce that for  :
()∝
p
(; x)=
q
1
(1−)
=−1
2 (1−)−1
2  0    1 (56)
42
which is an ‘unnormalized’ Beta(1
2
 1
2
) distribution; it needs the
scaling 1
()
 Note that Jeffreys prior (56) is also the reference
prior for a one parameter statistical model; see Bernardo and
Smith (1994).
5 Bayesian charges against frequentist inference
[1] Bayesian inference is fundamentally sound because it can
be given an axiomatic foundation based on coherent (rational)
decision making, but frequentist inference suffers from several
philosophical flaws.
Remark: what does an axiomatic foundation have to do
with inductive inference?
[2] Frequentist inference is not very flexible and has a restric-
tive range of applicability.
According to Koop, Poirier and Tobias (2007):
"Non-Bayesians, who we hereafter refer to as frequentists,
argue that situations not admitting repetition under essentially
identical conditions are not within the realm of statistical enquiry,
and hence ’probability’ should not be used in such situations. Fre-
quentists define the probability of an event as its long-run relative
frequency. ... that definition is nonoperational since only a finite
number of trials can ever be conducted.’ (p. 2)
Remark: where have these guys been for the last 80 years?
Inference with time series data is beyond the intended scope
of frequentist statistics?
[3] Bayesian inference produces clear and direct inferences, in
contrast to frequentist inference producing unclear and indi-
rect inferences, e.g. credible intervals vs. confidence intervals.
“... the applied researcher would really like to be able to place
a degree of belief on the hypothesis.” (Press, 2003, p. 220)
43
Remark: if the primary aim of statistical inference is to
learn from data x0 about the ‘true’ underlying data-generating
mechanism M∗
(x)={(x; ∗
)} x∈R
, what does a probabil-
ity, say (=0|x0) =.7, suggest about ∗
?
[4] Bayesian inference makes use of all the available a priori
information, but frequentist inference does not.
Remark: there is a crucial difference between prior sub-
stantive matter information and information in the form of
a prior distribution. Frequentist inference is tailor-made to
accommodate prior information in the form of restrictions on
the statistical parameters suggested by substantive theories,
say G(θ ϕ)=0, where θ and ϕ denote the statistical and
substantive parameters of interest. Indeed, substantive prior
information in most scientific fields does not come in the form
of a prior distribution (θ) ∀θ∈Θ
[5] A number of counter-examples, introduced by Bayesians,
show that frequentist inference is fundamentally flawed.
Remark: the problem does not lie with frequentist testing,
but with the statistical models introduced. The underlying
statistical models are shown to be rigged; see Spanos (2010;
2011; 2012; 2013a-d).
[6] The subjectivity charge against Bayesians is misplaced
because:
“All statistical methods that use probability are subjective in
the sense of relying on mathematical idealizations of the world.
Bayesian methods are sometimes said to be especially subjective
because of their reliance on a prior distribution, but in most
problems, scientific judgement is necessary to specify both the
’likelihood’ and the prior’ parts of the model.” (Gelman, et al.
(2004), p. 14)
“... likelihoods are just as subjective as priors.” (Kadane, 2011,
44
p. 445)
Remark: I know how to test the adequacy of the proba-
bilistic assumptions defining the likelihood (they are the model
assumptions) vis-a-vis data x0 but how do I test the adequacy
of the assumptions defining a prior () ∀∈Θ?
[7] For inference purposes, the only relevant point in the sam-
ple space R
 is just the data x0 as summarized by the likeli-
hood function (θ|x0) θ∈Θ.
Remark: as mentioned in section 4.1, the interpretation
of the posterior distribution as proportional to the conditional
distribution of X give θ times the prior (θ) is at odds
with the Likelihood Principle; see Spanos (2014). Also, Mayo
(2013) has shown that Birnbaum’s (1962) ‘proof’ of the LP is
erroneous.
[8] An effective way to generate frequentist optimal sta-
tistical procedures is to find the Bayes solution using a
reasonable prior and then examine its frequentist properties
to see whether it is satisfactory from the latter viewpoint; see
Rubin (1984), Gelman et al (2004).
Remark: this is based on assuming that admissibility (as
defined by Bayesians) is a desirable minimal property; it is
not! Indeed, expected losses do not constitute legitimate fre-
quentist error probabilities. The minimal property for fre-
quentist inference is consistency, not expected loss relative
efficiency.
The above [1]-[8] comments, criticisms and charges leveled
against frequentist inference are largely misplaced and stem
mostly from insufficient understanding or just plain ignorance
on behalf of the critics.
Further arguments on how the above criticisms can be coun-
tered will be part of the class discussion.
45
6 Appendix A: the N-P Lemma and its extensions
The cornerstone of the Neyman-Pearson (N-P) approach is
the Neyman-Pearson lemma. Contemplate the simple
generic statistical model:
Mθ(x)={(x; )} ∈Θ:={0 1}} x∈R
 (57)
and consider the problem of testing the simple hypotheses:
0: =0 vs. 1: =1 (58)
¥ The fact that the assumed parameter space is Θ:={0 1}
and (58) constitute a partition, is often left out from most sta-
tistics textbook discussions of this famous lemma!
Existence. There is exists an -significance level Uniformly
Most Powerful (UMP) [-UMP] test based on:
(X)=((x;1)
(x;0)
) 1()={x: (x)  } (59)
where () is a monotone function.
Sufficiency. If an -level test of the form (59) exists, then it
is UMP for testing (58).
Necessity. If {(X) 1()} is -UMP test, then it will be
given by (59).
At first sight the N-P lemma seems rather contrived because
it is an existence result for a simple statistical model Mθ(x)
whose parameter space is artificial Θ:={0 1}, but fits per-
fectly into the archetypal formulation. To operationalize the
existence result one would need to do two things:
(1) Find transformation () that when applied to (x;1)
(x;0)
yields
a meaningful test statistic (X)
(2) Derive the distribution of (X) under both 0 and 1.
46
Warning: this lemma is often misconstrued as suggesting
that for an -UMP test to exist one needs to confine testing to
simple-vs-simple cases even when Θ is uncountable; nonsense!
¥ The construction of an -UMP test in realistic cases has
nothing to do with simple-vs-simple hypotheses. Instead,
(i) it should be based on the archetypal N-P testing formu-
lation based on partitioning Θ, and
(ii) rely on monotone likelihood ratios and other features
of the prespecified statistical model Mθ(x).
Example. To illustrate these issues consider the simple-
vs-simple hypotheses:
(i) 0: =0 vs. 1: =1 (60)
in the context of a simple Normal (one parameter) model:
 v NIID( 2
) =1 2    (61)
In this case, the N-P lemma does not apply because the two
values (0 1) do not constitute a partition of the parameter
space Θ=R.
Applying the N-P lemma requires setting up the ratio:
(x;1)
(x;0)
= exp
© 
2 (1 − 0) − 
22 (2
1 − 2
0)
ª
 (62)
which is clearly not a test statistic, as it stands. However,
there exists a monotone function () which transforms (62)
into a familiar test statistic (Spanos, 1999, pp. 708-9):
(X)=((x;1)
(x;0)
)=
h
( 1
1
) ln((x;1)
(x;0)
)+1
2
i
=
√
(−0)


A UMP test can be derived when (X)=
√
(−0)

is com-
bined with information relating to the framing of the hypothe-
ses.
47
Conditions that give rise to UMP tests
[1] Point null vs. one-sided alternative. In the case
of the simple Normal model, each pair of hypotheses:
(i) 0: ≤0 vs. 1: 0 0: =0 vs. 1: 0
(ii) 0: ≥0 vs. 1: 0 0: =0 vs. 1: 0
give rise to the same UMP tests.
The existence of these -UMP tests extends the N-P lemma
to more realistic cases by invoking two regularity conditions:
[2] Monotone likelihood ratio. The ratio (62) is a
monotone function of the statistic  in the sense that
for any two values 10 (x;1)
(x;0)
changes monotonically with
 This implies that (x;1)
(x;0)
 if and only if   0
This regularity condition is valid for most statistical mod-
els of interest in practice, including the one parameter Expo-
nential family of distributions [Normal, Student’s t, Pareto,
Gamma, Beta, Binomial, Negative Binomial, Poisson, etc.].
[3] Convex alternative. The parameter space under
1 say Θ1 is convex [contiguous], i.e. for any two values
(1 2) ∈Θ1 their convex combinations 1+(1−)2∈Θ1 for
any 0 ≤  ≤ 1
When convexity does not hold, like the 2-sided alternative:
(vi) (2-s): 0:  = 0 vs. 1:  6= 0
the test :={(X) 1()} 1()={x: |(x)|  
2
}
is -UMPU (Unbiased); the -level and p-value are:
=P(|(X)|  
2
; =0) q(x0)=P(|(X)| |(x0)|; =0)
48
7 Appendix B: Examples based on Jeffreys prior
For the simple Bernoulli model, consider selecting Jeffreys in-
variant prior:
()= 1
(55)
−5
(1 − )−5
 ∈[0 1]
This gives rise to a posterior distribution of the form:
(|x0) v Beta( + 5 (1−) + 5) ∈[0 1]
¥ (a) For =2, =20 the likelihood function is:
(; x0) ∝ 2
(1−)18
 ∈[0 1]
and the posterior density is: (|x0) vBeta(25 185) ∈[0 1]
The Bayesian point estimates are:
e=15
19
=0789 b=25
21
=119
A 95 credible interval for  is: (0214 ≤   3803)=95
1
B(25185)
R 1
=0214
15
(1−)175
=975 1
B(25185)
R 1
=3803
15
(1−)175
=025
¥ (b) For =18, =20 the likelihood function is:
(; x0) ∝ 18
(1 − )2
 ∈[0 1]
and the posterior density is: (|x0) vBeta(185 25) ∈[0 1]
The Bayesian point estimates are:
e=175
19
=921 b=185
21
=881
A 95 credible interval for  is: (716 ≤   97862)=95
1
B(18525)
R 1
=716
175
(1−)15
=0975 1
B(18525)
R 1
=979
175
(1−)15
=0025
¥ (c) For =72, =80 the likelihood function is:
(; x0) ∝ 72
(1 − )8
 ∈[0 1]
49
and the posterior density is: (|x0) v Beta(725 85)
∈[0 1]
The Bayesian point estimates are:
e=715
79
=905 b=725
81
=895
A 95 credible interval for  is: (82 ≤   9515)=95
1
B(72585)
R 1
=82
715
(1−)75
=0975 1
B(72585)
R 1
=9515
715
(1−)75
=0025
¥ (d) for =40, =80 the likelihood function is:
(; x0) ∝ 40
(1 − )40
 ∈[0 1]
and the posterior density is: (|x0) v Beta(405 405)
∈[0 1]
The Bayesian point estimates are:
e=395
79
=5 b=405
81
=5
A 95 credible interval for  is: (3923 ≤   6525)=95
1
B(405405)
R 1
=392
395
(1−)395
=975 1
B(405405)
R 1
=6525
395
(1−)395
=025
In view of the symmetry of the posterior distribution, even
the asymptotic Normal credible interval (??) should give a
good approximation. Given that b= (+)
(++)
=05 the approx-
imate credible interval is:

µ
[5−196
√
5(1−5)
√
80
]=390 ≤   610=[5+196
√
5(1−5)
√
80
]
¶
=1−
which provides a reasonably good approximation to the exact
one.
50

More Related Content

What's hot

Statistical Flukes, the Higgs Discovery, and 5 Sigma
Statistical Flukes, the Higgs Discovery, and 5 Sigma Statistical Flukes, the Higgs Discovery, and 5 Sigma
Statistical Flukes, the Higgs Discovery, and 5 Sigma jemille6
 
Mayo: Day #2 slides
Mayo: Day #2 slidesMayo: Day #2 slides
Mayo: Day #2 slidesjemille6
 
D. Mayo: Philosophy of Statistics & the Replication Crisis in Science
D. Mayo: Philosophy of Statistics & the Replication Crisis in ScienceD. Mayo: Philosophy of Statistics & the Replication Crisis in Science
D. Mayo: Philosophy of Statistics & the Replication Crisis in Sciencejemille6
 
Spanos: Lecture 1 Notes: Introduction to Probability and Statistical Inference
Spanos: Lecture 1 Notes: Introduction to Probability and Statistical InferenceSpanos: Lecture 1 Notes: Introduction to Probability and Statistical Inference
Spanos: Lecture 1 Notes: Introduction to Probability and Statistical Inferencejemille6
 
hypothesis testing overview
hypothesis testing overviewhypothesis testing overview
hypothesis testing overviewi i
 
Feb21 mayobostonpaper
Feb21 mayobostonpaperFeb21 mayobostonpaper
Feb21 mayobostonpaperjemille6
 
An Introduction to Mis-Specification (M-S) Testing
An Introduction to Mis-Specification (M-S) TestingAn Introduction to Mis-Specification (M-S) Testing
An Introduction to Mis-Specification (M-S) Testingjemille6
 
Hypothesis Testing
Hypothesis TestingHypothesis Testing
Hypothesis Testingrishi.indian
 
Mayo minnesota 28 march 2 (1)
Mayo minnesota 28 march 2 (1)Mayo minnesota 28 march 2 (1)
Mayo minnesota 28 march 2 (1)jemille6
 
6 estimation hypothesis testing t test
6 estimation hypothesis testing t test6 estimation hypothesis testing t test
6 estimation hypothesis testing t testPenny Jiang
 
beyond objectivity and subjectivity; a discussion paper
beyond objectivity and subjectivity; a discussion paperbeyond objectivity and subjectivity; a discussion paper
beyond objectivity and subjectivity; a discussion paperChristian Robert
 
Discussion a 4th BFFF Harvard
Discussion a 4th BFFF HarvardDiscussion a 4th BFFF Harvard
Discussion a 4th BFFF HarvardChristian Robert
 
D. Mayo: Philosophical Interventions in the Statistics Wars
D. Mayo: Philosophical Interventions in the Statistics WarsD. Mayo: Philosophical Interventions in the Statistics Wars
D. Mayo: Philosophical Interventions in the Statistics Warsjemille6
 
"The Statistical Replication Crisis: Paradoxes and Scapegoats”
"The Statistical Replication Crisis: Paradoxes and Scapegoats”"The Statistical Replication Crisis: Paradoxes and Scapegoats”
"The Statistical Replication Crisis: Paradoxes and Scapegoats”jemille6
 
Senn repligate
Senn repligateSenn repligate
Senn repligatejemille6
 
D. G. Mayo Columbia slides for Workshop on Probability &Learning
D. G. Mayo Columbia slides for Workshop on Probability &LearningD. G. Mayo Columbia slides for Workshop on Probability &Learning
D. G. Mayo Columbia slides for Workshop on Probability &Learningjemille6
 
Hypothesis testing
Hypothesis testingHypothesis testing
Hypothesis testingpraveen3030
 

What's hot (20)

Statistical Flukes, the Higgs Discovery, and 5 Sigma
Statistical Flukes, the Higgs Discovery, and 5 Sigma Statistical Flukes, the Higgs Discovery, and 5 Sigma
Statistical Flukes, the Higgs Discovery, and 5 Sigma
 
Mayo: Day #2 slides
Mayo: Day #2 slidesMayo: Day #2 slides
Mayo: Day #2 slides
 
D. Mayo: Philosophy of Statistics & the Replication Crisis in Science
D. Mayo: Philosophy of Statistics & the Replication Crisis in ScienceD. Mayo: Philosophy of Statistics & the Replication Crisis in Science
D. Mayo: Philosophy of Statistics & the Replication Crisis in Science
 
Spanos: Lecture 1 Notes: Introduction to Probability and Statistical Inference
Spanos: Lecture 1 Notes: Introduction to Probability and Statistical InferenceSpanos: Lecture 1 Notes: Introduction to Probability and Statistical Inference
Spanos: Lecture 1 Notes: Introduction to Probability and Statistical Inference
 
hypothesis testing overview
hypothesis testing overviewhypothesis testing overview
hypothesis testing overview
 
Feb21 mayobostonpaper
Feb21 mayobostonpaperFeb21 mayobostonpaper
Feb21 mayobostonpaper
 
An Introduction to Mis-Specification (M-S) Testing
An Introduction to Mis-Specification (M-S) TestingAn Introduction to Mis-Specification (M-S) Testing
An Introduction to Mis-Specification (M-S) Testing
 
Hypothesis Testing
Hypothesis TestingHypothesis Testing
Hypothesis Testing
 
Mayo minnesota 28 march 2 (1)
Mayo minnesota 28 march 2 (1)Mayo minnesota 28 march 2 (1)
Mayo minnesota 28 march 2 (1)
 
6 estimation hypothesis testing t test
6 estimation hypothesis testing t test6 estimation hypothesis testing t test
6 estimation hypothesis testing t test
 
beyond objectivity and subjectivity; a discussion paper
beyond objectivity and subjectivity; a discussion paperbeyond objectivity and subjectivity; a discussion paper
beyond objectivity and subjectivity; a discussion paper
 
Discussion a 4th BFFF Harvard
Discussion a 4th BFFF HarvardDiscussion a 4th BFFF Harvard
Discussion a 4th BFFF Harvard
 
D. Mayo: Philosophical Interventions in the Statistics Wars
D. Mayo: Philosophical Interventions in the Statistics WarsD. Mayo: Philosophical Interventions in the Statistics Wars
D. Mayo: Philosophical Interventions in the Statistics Wars
 
Hypothesis testing
Hypothesis testingHypothesis testing
Hypothesis testing
 
"The Statistical Replication Crisis: Paradoxes and Scapegoats”
"The Statistical Replication Crisis: Paradoxes and Scapegoats”"The Statistical Replication Crisis: Paradoxes and Scapegoats”
"The Statistical Replication Crisis: Paradoxes and Scapegoats”
 
Senn repligate
Senn repligateSenn repligate
Senn repligate
 
D. G. Mayo Columbia slides for Workshop on Probability &Learning
D. G. Mayo Columbia slides for Workshop on Probability &LearningD. G. Mayo Columbia slides for Workshop on Probability &Learning
D. G. Mayo Columbia slides for Workshop on Probability &Learning
 
Hypothesis testing
Hypothesis testingHypothesis testing
Hypothesis testing
 
Hypothesis testing
Hypothesis testingHypothesis testing
Hypothesis testing
 
Hypothesis testing
Hypothesis testingHypothesis testing
Hypothesis testing
 

Similar to Spanos lecture 7: An Introduction to Bayesian Inference

Philosophy of Probability
Philosophy of ProbabilityPhilosophy of Probability
Philosophy of ProbabilityKayla Leung
 
Review tversky & kahnemann (1974) judgment under uncertainty
Review   tversky & kahnemann (1974) judgment under uncertaintyReview   tversky & kahnemann (1974) judgment under uncertainty
Review tversky & kahnemann (1974) judgment under uncertaintyMicheal Axelsen
 
Bruno de fenitty - A Modern History of Probability Theory
Bruno de fenitty - A Modern History of Probability TheoryBruno de fenitty - A Modern History of Probability Theory
Bruno de fenitty - A Modern History of Probability TheoryChristian Torrico Avila
 
UNDERSTANDING DECISION/ GAME THEORY FOR BETTER RISK ASSESSMENT.
UNDERSTANDING DECISION/ GAME THEORY FOR BETTER RISK ASSESSMENT.UNDERSTANDING DECISION/ GAME THEORY FOR BETTER RISK ASSESSMENT.
UNDERSTANDING DECISION/ GAME THEORY FOR BETTER RISK ASSESSMENT.Kaustav Lahiri
 
Judgment Under Uncertainty: Anchoring and Adjustment Bias
Judgment Under Uncertainty:  Anchoring and Adjustment BiasJudgment Under Uncertainty:  Anchoring and Adjustment Bias
Judgment Under Uncertainty: Anchoring and Adjustment BiasMicheal Axelsen
 
Probability In Discrete Structure of Computer Science
Probability In Discrete Structure of Computer ScienceProbability In Discrete Structure of Computer Science
Probability In Discrete Structure of Computer SciencePrankit Mishra
 
Many decisions are based on beliefs concerning the likelihoo.docx
Many decisions are based on beliefs concerning the likelihoo.docxMany decisions are based on beliefs concerning the likelihoo.docx
Many decisions are based on beliefs concerning the likelihoo.docxalfredacavx97
 
35812 Topic discussion1Number of Pages 1 (Double Spaced).docx
35812 Topic discussion1Number of Pages 1 (Double Spaced).docx35812 Topic discussion1Number of Pages 1 (Double Spaced).docx
35812 Topic discussion1Number of Pages 1 (Double Spaced).docxrhetttrevannion
 
Research Methodology Module-05
Research Methodology Module-05Research Methodology Module-05
Research Methodology Module-05Kishor Ade
 
35845 Topic Group AssignmentNumber of Pages 1 (Double Spaced.docx
35845 Topic Group AssignmentNumber of Pages 1 (Double Spaced.docx35845 Topic Group AssignmentNumber of Pages 1 (Double Spaced.docx
35845 Topic Group AssignmentNumber of Pages 1 (Double Spaced.docxrhetttrevannion
 
35813 Topic Discussion2Number of Pages 1 (Double Spaced).docx
35813 Topic Discussion2Number of Pages 1 (Double Spaced).docx35813 Topic Discussion2Number of Pages 1 (Double Spaced).docx
35813 Topic Discussion2Number of Pages 1 (Double Spaced).docxrhetttrevannion
 

Similar to Spanos lecture 7: An Introduction to Bayesian Inference (20)

Maths probability
Maths probabilityMaths probability
Maths probability
 
Russo Ihpst Seminar
Russo Ihpst SeminarRusso Ihpst Seminar
Russo Ihpst Seminar
 
Philosophy of Probability
Philosophy of ProbabilityPhilosophy of Probability
Philosophy of Probability
 
Russo Vub Seminar
Russo Vub SeminarRusso Vub Seminar
Russo Vub Seminar
 
Russo Vub Seminar
Russo Vub SeminarRusso Vub Seminar
Russo Vub Seminar
 
Review tversky & kahnemann (1974) judgment under uncertainty
Review   tversky & kahnemann (1974) judgment under uncertaintyReview   tversky & kahnemann (1974) judgment under uncertainty
Review tversky & kahnemann (1974) judgment under uncertainty
 
Bruno de fenitty - A Modern History of Probability Theory
Bruno de fenitty - A Modern History of Probability TheoryBruno de fenitty - A Modern History of Probability Theory
Bruno de fenitty - A Modern History of Probability Theory
 
Probability And Statistics
Probability And StatisticsProbability And Statistics
Probability And Statistics
 
UNDERSTANDING DECISION/ GAME THEORY FOR BETTER RISK ASSESSMENT.
UNDERSTANDING DECISION/ GAME THEORY FOR BETTER RISK ASSESSMENT.UNDERSTANDING DECISION/ GAME THEORY FOR BETTER RISK ASSESSMENT.
UNDERSTANDING DECISION/ GAME THEORY FOR BETTER RISK ASSESSMENT.
 
Judgment Under Uncertainty: Anchoring and Adjustment Bias
Judgment Under Uncertainty:  Anchoring and Adjustment BiasJudgment Under Uncertainty:  Anchoring and Adjustment Bias
Judgment Under Uncertainty: Anchoring and Adjustment Bias
 
Probability In Discrete Structure of Computer Science
Probability In Discrete Structure of Computer ScienceProbability In Discrete Structure of Computer Science
Probability In Discrete Structure of Computer Science
 
Conditional Spank
Conditional SpankConditional Spank
Conditional Spank
 
Many decisions are based on beliefs concerning the likelihoo.docx
Many decisions are based on beliefs concerning the likelihoo.docxMany decisions are based on beliefs concerning the likelihoo.docx
Many decisions are based on beliefs concerning the likelihoo.docx
 
35812 Topic discussion1Number of Pages 1 (Double Spaced).docx
35812 Topic discussion1Number of Pages 1 (Double Spaced).docx35812 Topic discussion1Number of Pages 1 (Double Spaced).docx
35812 Topic discussion1Number of Pages 1 (Double Spaced).docx
 
Questions On Probability
Questions On ProbabilityQuestions On Probability
Questions On Probability
 
Research Methodology Module-05
Research Methodology Module-05Research Methodology Module-05
Research Methodology Module-05
 
35845 Topic Group AssignmentNumber of Pages 1 (Double Spaced.docx
35845 Topic Group AssignmentNumber of Pages 1 (Double Spaced.docx35845 Topic Group AssignmentNumber of Pages 1 (Double Spaced.docx
35845 Topic Group AssignmentNumber of Pages 1 (Double Spaced.docx
 
35813 Topic Discussion2Number of Pages 1 (Double Spaced).docx
35813 Topic Discussion2Number of Pages 1 (Double Spaced).docx35813 Topic Discussion2Number of Pages 1 (Double Spaced).docx
35813 Topic Discussion2Number of Pages 1 (Double Spaced).docx
 
Probability In Daily Life
Probability In Daily LifeProbability In Daily Life
Probability In Daily Life
 
Dubrovnik Pres
Dubrovnik PresDubrovnik Pres
Dubrovnik Pres
 

More from jemille6

“The importance of philosophy of science for statistical science and vice versa”
“The importance of philosophy of science for statistical science and vice versa”“The importance of philosophy of science for statistical science and vice versa”
“The importance of philosophy of science for statistical science and vice versa”jemille6
 
Statistical Inference as Severe Testing: Beyond Performance and Probabilism
Statistical Inference as Severe Testing: Beyond Performance and ProbabilismStatistical Inference as Severe Testing: Beyond Performance and Probabilism
Statistical Inference as Severe Testing: Beyond Performance and Probabilismjemille6
 
D. Mayo JSM slides v2.pdf
D. Mayo JSM slides v2.pdfD. Mayo JSM slides v2.pdf
D. Mayo JSM slides v2.pdfjemille6
 
reid-postJSM-DRC.pdf
reid-postJSM-DRC.pdfreid-postJSM-DRC.pdf
reid-postJSM-DRC.pdfjemille6
 
Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022
Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022
Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022jemille6
 
Causal inference is not statistical inference
Causal inference is not statistical inferenceCausal inference is not statistical inference
Causal inference is not statistical inferencejemille6
 
What are questionable research practices?
What are questionable research practices?What are questionable research practices?
What are questionable research practices?jemille6
 
What's the question?
What's the question? What's the question?
What's the question? jemille6
 
The neglected importance of complexity in statistics and Metascience
The neglected importance of complexity in statistics and MetascienceThe neglected importance of complexity in statistics and Metascience
The neglected importance of complexity in statistics and Metasciencejemille6
 
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...jemille6
 
On Severity, the Weight of Evidence, and the Relationship Between the Two
On Severity, the Weight of Evidence, and the Relationship Between the TwoOn Severity, the Weight of Evidence, and the Relationship Between the Two
On Severity, the Weight of Evidence, and the Relationship Between the Twojemille6
 
Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...
Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...
Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...jemille6
 
Comparing Frequentists and Bayesian Control of Multiple Testing
Comparing Frequentists and Bayesian Control of Multiple TestingComparing Frequentists and Bayesian Control of Multiple Testing
Comparing Frequentists and Bayesian Control of Multiple Testingjemille6
 
Good Data Dredging
Good Data DredgingGood Data Dredging
Good Data Dredgingjemille6
 
The Duality of Parameters and the Duality of Probability
The Duality of Parameters and the Duality of ProbabilityThe Duality of Parameters and the Duality of Probability
The Duality of Parameters and the Duality of Probabilityjemille6
 
Error Control and Severity
Error Control and SeverityError Control and Severity
Error Control and Severityjemille6
 
The Statistics Wars and Their Causalities (refs)
The Statistics Wars and Their Causalities (refs)The Statistics Wars and Their Causalities (refs)
The Statistics Wars and Their Causalities (refs)jemille6
 
The Statistics Wars and Their Casualties (w/refs)
The Statistics Wars and Their Casualties (w/refs)The Statistics Wars and Their Casualties (w/refs)
The Statistics Wars and Their Casualties (w/refs)jemille6
 
On the interpretation of the mathematical characteristics of statistical test...
On the interpretation of the mathematical characteristics of statistical test...On the interpretation of the mathematical characteristics of statistical test...
On the interpretation of the mathematical characteristics of statistical test...jemille6
 
The role of background assumptions in severity appraisal (
The role of background assumptions in severity appraisal (The role of background assumptions in severity appraisal (
The role of background assumptions in severity appraisal (jemille6
 

More from jemille6 (20)

“The importance of philosophy of science for statistical science and vice versa”
“The importance of philosophy of science for statistical science and vice versa”“The importance of philosophy of science for statistical science and vice versa”
“The importance of philosophy of science for statistical science and vice versa”
 
Statistical Inference as Severe Testing: Beyond Performance and Probabilism
Statistical Inference as Severe Testing: Beyond Performance and ProbabilismStatistical Inference as Severe Testing: Beyond Performance and Probabilism
Statistical Inference as Severe Testing: Beyond Performance and Probabilism
 
D. Mayo JSM slides v2.pdf
D. Mayo JSM slides v2.pdfD. Mayo JSM slides v2.pdf
D. Mayo JSM slides v2.pdf
 
reid-postJSM-DRC.pdf
reid-postJSM-DRC.pdfreid-postJSM-DRC.pdf
reid-postJSM-DRC.pdf
 
Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022
Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022
Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022
 
Causal inference is not statistical inference
Causal inference is not statistical inferenceCausal inference is not statistical inference
Causal inference is not statistical inference
 
What are questionable research practices?
What are questionable research practices?What are questionable research practices?
What are questionable research practices?
 
What's the question?
What's the question? What's the question?
What's the question?
 
The neglected importance of complexity in statistics and Metascience
The neglected importance of complexity in statistics and MetascienceThe neglected importance of complexity in statistics and Metascience
The neglected importance of complexity in statistics and Metascience
 
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...
 
On Severity, the Weight of Evidence, and the Relationship Between the Two
On Severity, the Weight of Evidence, and the Relationship Between the TwoOn Severity, the Weight of Evidence, and the Relationship Between the Two
On Severity, the Weight of Evidence, and the Relationship Between the Two
 
Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...
Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...
Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...
 
Comparing Frequentists and Bayesian Control of Multiple Testing
Comparing Frequentists and Bayesian Control of Multiple TestingComparing Frequentists and Bayesian Control of Multiple Testing
Comparing Frequentists and Bayesian Control of Multiple Testing
 
Good Data Dredging
Good Data DredgingGood Data Dredging
Good Data Dredging
 
The Duality of Parameters and the Duality of Probability
The Duality of Parameters and the Duality of ProbabilityThe Duality of Parameters and the Duality of Probability
The Duality of Parameters and the Duality of Probability
 
Error Control and Severity
Error Control and SeverityError Control and Severity
Error Control and Severity
 
The Statistics Wars and Their Causalities (refs)
The Statistics Wars and Their Causalities (refs)The Statistics Wars and Their Causalities (refs)
The Statistics Wars and Their Causalities (refs)
 
The Statistics Wars and Their Casualties (w/refs)
The Statistics Wars and Their Casualties (w/refs)The Statistics Wars and Their Casualties (w/refs)
The Statistics Wars and Their Casualties (w/refs)
 
On the interpretation of the mathematical characteristics of statistical test...
On the interpretation of the mathematical characteristics of statistical test...On the interpretation of the mathematical characteristics of statistical test...
On the interpretation of the mathematical characteristics of statistical test...
 
The role of background assumptions in severity appraisal (
The role of background assumptions in severity appraisal (The role of background assumptions in severity appraisal (
The role of background assumptions in severity appraisal (
 

Recently uploaded

What is the Future of QuickBooks DeskTop?
What is the Future of QuickBooks DeskTop?What is the Future of QuickBooks DeskTop?
What is the Future of QuickBooks DeskTop?TechSoup
 
General views of Histopathology and step
General views of Histopathology and stepGeneral views of Histopathology and step
General views of Histopathology and stepobaje godwin sunday
 
CapTechU Doctoral Presentation -March 2024 slides.pptx
CapTechU Doctoral Presentation -March 2024 slides.pptxCapTechU Doctoral Presentation -March 2024 slides.pptx
CapTechU Doctoral Presentation -March 2024 slides.pptxCapitolTechU
 
CHUYÊN ĐỀ DẠY THÊM TIẾNG ANH LỚP 11 - GLOBAL SUCCESS - NĂM HỌC 2023-2024 - HK...
CHUYÊN ĐỀ DẠY THÊM TIẾNG ANH LỚP 11 - GLOBAL SUCCESS - NĂM HỌC 2023-2024 - HK...CHUYÊN ĐỀ DẠY THÊM TIẾNG ANH LỚP 11 - GLOBAL SUCCESS - NĂM HỌC 2023-2024 - HK...
CHUYÊN ĐỀ DẠY THÊM TIẾNG ANH LỚP 11 - GLOBAL SUCCESS - NĂM HỌC 2023-2024 - HK...Nguyen Thanh Tu Collection
 
Quality Assurance_GOOD LABORATORY PRACTICE
Quality Assurance_GOOD LABORATORY PRACTICEQuality Assurance_GOOD LABORATORY PRACTICE
Quality Assurance_GOOD LABORATORY PRACTICESayali Powar
 
Human-AI Co-Creation of Worked Examples for Programming Classes
Human-AI Co-Creation of Worked Examples for Programming ClassesHuman-AI Co-Creation of Worked Examples for Programming Classes
Human-AI Co-Creation of Worked Examples for Programming ClassesMohammad Hassany
 
3.21.24 The Origins of Black Power.pptx
3.21.24  The Origins of Black Power.pptx3.21.24  The Origins of Black Power.pptx
3.21.24 The Origins of Black Power.pptxmary850239
 
Diploma in Nursing Admission Test Question Solution 2023.pdf
Diploma in Nursing Admission Test Question Solution 2023.pdfDiploma in Nursing Admission Test Question Solution 2023.pdf
Diploma in Nursing Admission Test Question Solution 2023.pdfMohonDas
 
Maximizing Impact_ Nonprofit Website Planning, Budgeting, and Design.pdf
Maximizing Impact_ Nonprofit Website Planning, Budgeting, and Design.pdfMaximizing Impact_ Nonprofit Website Planning, Budgeting, and Design.pdf
Maximizing Impact_ Nonprofit Website Planning, Budgeting, and Design.pdfTechSoup
 
How to Add a New Field in Existing Kanban View in Odoo 17
How to Add a New Field in Existing Kanban View in Odoo 17How to Add a New Field in Existing Kanban View in Odoo 17
How to Add a New Field in Existing Kanban View in Odoo 17Celine George
 
PISA-VET launch_El Iza Mohamedou_19 March 2024.pptx
PISA-VET launch_El Iza Mohamedou_19 March 2024.pptxPISA-VET launch_El Iza Mohamedou_19 March 2024.pptx
PISA-VET launch_El Iza Mohamedou_19 March 2024.pptxEduSkills OECD
 
Benefits & Challenges of Inclusive Education
Benefits & Challenges of Inclusive EducationBenefits & Challenges of Inclusive Education
Benefits & Challenges of Inclusive EducationMJDuyan
 
How to Add Existing Field in One2Many Tree View in Odoo 17
How to Add Existing Field in One2Many Tree View in Odoo 17How to Add Existing Field in One2Many Tree View in Odoo 17
How to Add Existing Field in One2Many Tree View in Odoo 17Celine George
 
Practical Research 1: Lesson 8 Writing the Thesis Statement.pptx
Practical Research 1: Lesson 8 Writing the Thesis Statement.pptxPractical Research 1: Lesson 8 Writing the Thesis Statement.pptx
Practical Research 1: Lesson 8 Writing the Thesis Statement.pptxKatherine Villaluna
 
How to Add a many2many Relational Field in Odoo 17
How to Add a many2many Relational Field in Odoo 17How to Add a many2many Relational Field in Odoo 17
How to Add a many2many Relational Field in Odoo 17Celine George
 
How to Show Error_Warning Messages in Odoo 17
How to Show Error_Warning Messages in Odoo 17How to Show Error_Warning Messages in Odoo 17
How to Show Error_Warning Messages in Odoo 17Celine George
 
Patterns of Written Texts Across Disciplines.pptx
Patterns of Written Texts Across Disciplines.pptxPatterns of Written Texts Across Disciplines.pptx
Patterns of Written Texts Across Disciplines.pptxMYDA ANGELICA SUAN
 
Education and training program in the hospital APR.pptx
Education and training program in the hospital APR.pptxEducation and training program in the hospital APR.pptx
Education and training program in the hospital APR.pptxraviapr7
 

Recently uploaded (20)

What is the Future of QuickBooks DeskTop?
What is the Future of QuickBooks DeskTop?What is the Future of QuickBooks DeskTop?
What is the Future of QuickBooks DeskTop?
 
General views of Histopathology and step
General views of Histopathology and stepGeneral views of Histopathology and step
General views of Histopathology and step
 
CapTechU Doctoral Presentation -March 2024 slides.pptx
CapTechU Doctoral Presentation -March 2024 slides.pptxCapTechU Doctoral Presentation -March 2024 slides.pptx
CapTechU Doctoral Presentation -March 2024 slides.pptx
 
CHUYÊN ĐỀ DẠY THÊM TIẾNG ANH LỚP 11 - GLOBAL SUCCESS - NĂM HỌC 2023-2024 - HK...
CHUYÊN ĐỀ DẠY THÊM TIẾNG ANH LỚP 11 - GLOBAL SUCCESS - NĂM HỌC 2023-2024 - HK...CHUYÊN ĐỀ DẠY THÊM TIẾNG ANH LỚP 11 - GLOBAL SUCCESS - NĂM HỌC 2023-2024 - HK...
CHUYÊN ĐỀ DẠY THÊM TIẾNG ANH LỚP 11 - GLOBAL SUCCESS - NĂM HỌC 2023-2024 - HK...
 
Quality Assurance_GOOD LABORATORY PRACTICE
Quality Assurance_GOOD LABORATORY PRACTICEQuality Assurance_GOOD LABORATORY PRACTICE
Quality Assurance_GOOD LABORATORY PRACTICE
 
Human-AI Co-Creation of Worked Examples for Programming Classes
Human-AI Co-Creation of Worked Examples for Programming ClassesHuman-AI Co-Creation of Worked Examples for Programming Classes
Human-AI Co-Creation of Worked Examples for Programming Classes
 
3.21.24 The Origins of Black Power.pptx
3.21.24  The Origins of Black Power.pptx3.21.24  The Origins of Black Power.pptx
3.21.24 The Origins of Black Power.pptx
 
Diploma in Nursing Admission Test Question Solution 2023.pdf
Diploma in Nursing Admission Test Question Solution 2023.pdfDiploma in Nursing Admission Test Question Solution 2023.pdf
Diploma in Nursing Admission Test Question Solution 2023.pdf
 
Finals of Kant get Marx 2.0 : a general politics quiz
Finals of Kant get Marx 2.0 : a general politics quizFinals of Kant get Marx 2.0 : a general politics quiz
Finals of Kant get Marx 2.0 : a general politics quiz
 
Maximizing Impact_ Nonprofit Website Planning, Budgeting, and Design.pdf
Maximizing Impact_ Nonprofit Website Planning, Budgeting, and Design.pdfMaximizing Impact_ Nonprofit Website Planning, Budgeting, and Design.pdf
Maximizing Impact_ Nonprofit Website Planning, Budgeting, and Design.pdf
 
How to Add a New Field in Existing Kanban View in Odoo 17
How to Add a New Field in Existing Kanban View in Odoo 17How to Add a New Field in Existing Kanban View in Odoo 17
How to Add a New Field in Existing Kanban View in Odoo 17
 
PISA-VET launch_El Iza Mohamedou_19 March 2024.pptx
PISA-VET launch_El Iza Mohamedou_19 March 2024.pptxPISA-VET launch_El Iza Mohamedou_19 March 2024.pptx
PISA-VET launch_El Iza Mohamedou_19 March 2024.pptx
 
Benefits & Challenges of Inclusive Education
Benefits & Challenges of Inclusive EducationBenefits & Challenges of Inclusive Education
Benefits & Challenges of Inclusive Education
 
How to Add Existing Field in One2Many Tree View in Odoo 17
How to Add Existing Field in One2Many Tree View in Odoo 17How to Add Existing Field in One2Many Tree View in Odoo 17
How to Add Existing Field in One2Many Tree View in Odoo 17
 
Practical Research 1: Lesson 8 Writing the Thesis Statement.pptx
Practical Research 1: Lesson 8 Writing the Thesis Statement.pptxPractical Research 1: Lesson 8 Writing the Thesis Statement.pptx
Practical Research 1: Lesson 8 Writing the Thesis Statement.pptx
 
How to Add a many2many Relational Field in Odoo 17
How to Add a many2many Relational Field in Odoo 17How to Add a many2many Relational Field in Odoo 17
How to Add a many2many Relational Field in Odoo 17
 
How to Show Error_Warning Messages in Odoo 17
How to Show Error_Warning Messages in Odoo 17How to Show Error_Warning Messages in Odoo 17
How to Show Error_Warning Messages in Odoo 17
 
Patterns of Written Texts Across Disciplines.pptx
Patterns of Written Texts Across Disciplines.pptxPatterns of Written Texts Across Disciplines.pptx
Patterns of Written Texts Across Disciplines.pptx
 
Education and training program in the hospital APR.pptx
Education and training program in the hospital APR.pptxEducation and training program in the hospital APR.pptx
Education and training program in the hospital APR.pptx
 
Personal Resilience in Project Management 2 - TV Edit 1a.pdf
Personal Resilience in Project Management 2 - TV Edit 1a.pdfPersonal Resilience in Project Management 2 - TV Edit 1a.pdf
Personal Resilience in Project Management 2 - TV Edit 1a.pdf
 

Spanos lecture 7: An Introduction to Bayesian Inference

  • 1. PHIL 6334 - Probability/Statistics Lecture Notes 7: An Introduction to Bayesian Inference Aris Spanos [Spring 2014] 1 Introduction to Bayesian Inference The lectures notes provide an elementary introduction to Bayesian inference focusing almost exclusively on two simple models; the simple Bernoulli and Normal models to keep the technical details to a minimum. Textbooks often motivate the various techniques associ- ated with Bayesian statistics by contrasting them to those of fequentist (classical) inference. Indeed, adherents to the Bayesian approach often begin with a litany of charges lev- eled against the frequentist approach which are usually due to inadequate understanding of the latter; see section 5. The following quotation from O’Hagan (1994), p. 16, is typical of such charges: “Broadly speaking, some of the arguments in favour of the Bayesian approach are that it is fundamentally sound, very flexi- ble, produces clear and direct inferences and makes use of all the available information. In contrast, the classical approach suffers from some philosophical flaws, has restrictive range of inferences with rather indirect meaning and ignores prior information.” The address the above claims by O’Hagan one needs to compare the two approaches in depth, focusing more on the underlying reasoning and their primary objectives. Section 2: Probability and its various interpretations Section 3: Frequentist Inference Section 4: Bayesian Inference Section 5: Charges against the frequentist approach to inference. 1
  • 2. 2 Probability and its various interpretations For mathematical purposes Kolmogorov’s axiomatization of probability is perfectly adequate, but when it comes to statis- tical inference the interpretation is absolutely necessary. What does mathematical probability correspond to in the real world? This correspondence will determine the kind of inductive pro- cedures one should follow in‘learning from data’. That is, the interpretation of probability will determine the nature of the inductive (statistical) inference called for. From the viewpoint of statistical inference let us con- sider the following interpretations of probability: (i) the classical (equally-likely) - () (ii) the degrees of belief - () (iii) the relative frequency - Pr(). For a better understanding of the various interpretations of probability one should consider them in the context in which they were first developed. As early as the 18th century all three basic interpretations of probability were being used in different contexts without much thought of choosing one in- terpretation for all purposes. The classical interpretation was used in the context of games of chance and was viewed as stemming from equal prob- ability outcomes based on some sort of physical symmetry. The degrees of belief interpretation originated from at- tempts to quantify the relationship between the evidence pre- sented in courts and the degree of conviction in the mind of the judge. The relative frequency interpretation originated from mortality and socio-economic data gathered over long periods of time from the 16th century onwards. 2
  • 3. The question to be considered is whether any of these in- terpretations can provide a pertinent link between Kol- mogorov’s mathematical theory of probability and empirical modeling and inference. 2.1 The classical interpretation of probability It is generally accepted that, historically, the theory of proba- bility was developed in the context of games of chance such as casting dice or tossing coins. It was only natural then that the first interpretation of probability was inextricably bound up with the chance mechanism of such games. The first explicit definition of the classical definition of probability is given by Laplace at the beginning of the 19th century. The classical definition. Consider the random exper- iment E which has  equally likely outcomes and event  occurs when  of them occur, then according to the classi- cal definition of probability: () = ¡  ¢  Example. Let  be the random variable denoting the number of dots on the sides of die. When a die is symmetric and homogeneous: Prob. distribution of  = 1 2 3 4 5 6 () 1 6 1 6 1 6 1 6 1 6 1 6 The first important feature of this definition is its reliance on the nature of an explicit chance mechanism such as casting dice or tossing coins. Its second crucial feature is that it uti- lizes the apparent physical symmetry of the device underlying the chance mechanism to define probability by evaluating it as “the ratio of the number of outcomes favorable to the event 3
  • 4. to the total number of possible outcomes, each assumed to be equally likely” (see Laplace, 1814). For the purposes of providing the missing link between the mathematical concept of a statistical model and the notion of chance regularity, this definition of probability is inadequate for a number of reasons including: (i) it is based on an explicit chance mechanism, (ii) the chance mechanism has a build-in physical symmetry that leads to equally likely outcomes, (iii) it assumes that one can partition the set of outcomes into a finite number of equally likely events. This definition has been severely criticized in the literature but the critics tend to concentrate their arrows on the equally likely clause. ¥ What do we mean by equally likely and how do we recog- nize equally likely outcomes? Laplace proposed a principle for justifying equally likely cases, known as: the principle of insufficient reason, or the principle of indifference. This is based on the idea that if one has no reason to favor one outcome over the another they are considered equally likely. This principle has given rise to several paradoxes and has been called into question repeatedly (Hacking, 1975). In addition to the objection to the equally likely clause, there is one crucial objection to the classical definition: it assumes that one can partition the set of outcomes into a finite number of equally likely events. ¥ What happens when the random experiment does not enjoy this symmetry, such as the case of a biased coin? ¥ What about axiom A1 of the mathematical definition? 4
  • 5. In the case of an infinite sample space the denominator of the above definition will get us into trouble. Empirical modeling. The classical interpretation of prob- ability is too restrictive for empirical modeling purposes. 2.2 The ‘degrees of belief’ interpretation of probability Our interest in the degree of belief interpretation of prob- ability stems from the fact that it leads to an approach to statistical inference known as the Bayesian approach. The degree of belief interpretation of probability comes in two flavors: the subjective and rational. A. Degrees of subjective belief The subjective interpretation considers the probability of an event  as based on the personal judgment of whoever is assigning the probability; the personal judgement being based on the individual’s experience and background. In this sense the probability of event  is based on the person’s beliefs and information relating to the experiment giving rise to event . Example. In the case of tossing a fair coin a person is likely to assign the subjective probability ()=1 2 because a symmetric coin would render  and  a priori equally likely. In the case where the person in question has additional information relating to the mechanism, such as the coin is bent, the subjective probability is likely to change. De Finetti introduced a convenient way to think of sub- jective probabilities is in terms of betting odds. Let us consider the case of betting on the occurrence of an event  and somebody offers odds 2 to 1 or in a ratio form 1 2  If the person whose degrees of subjective belief we are trying to as- sess thinks that these are fair odds, then we can proceed to 5
  • 6. evaluate her subjective probability via: 1 2 1+1 2 = 1 3  i.e. () = 1 3  More generally, if the subjective probability for the occur- rence of the event  is  (i.e. () = ) then the odds ratio ¨ and the corresponding subjective probability  take the form: ¨ =  (1−) ⇒  = ¨ 1+¨  As we can see, the subjective dimension of this probability arises from the fact that it is the decision of a particular in- dividual whether the odds are fair or not. Another individual might consider as fair the odds ratio ¨0  which implies that her subjective probability is 0 = ¨0 1+¨0 6= This is not surprising because the personal experiences which influence judgement are often different between individuals. The question which naturally arises at this stage is to whether such personal subjective probabilities will behave in accor- dance with the mathematical definition of probability. The answer is yes, under certain restrictions, as demonstrated by Ramsey (1926) , de Finetti (1937) and Savage (1954). B. Degrees of rational belief Another question with regard to the degree of belief inter- pretation of probability is whether one could find some way to establish that a particular odds ratio will be considered fair by a rational person; assuming a formal definition of ra- tionality. The idea being that in such a case the subjective dimension will become less personalistic. Keynes (1921) was the first to propose such an interpreta- tion, often called logical probability. Carnap (1950) general- ized and extended it, but the logical interpretation has been severely criticized on several grounds. 6
  • 7. Empirical modeling. How appropriate is the degrees of belief interpretation of probability for empirical modeling? If data x0:=(1  ) contain systematic information in the form of chance regularities, ‘stochasticity’ is a feature of real-world phenomena and exists independently of one’s be- liefs. Hence, degree of belief interpretation has limited role in selecting an appropriate statistical model. 2.3 The frequency interpretation of probability The frequency interpretation of probability can be traced back to the statistical regularities established during the 18th and 19th centuries, based on demographic, anthropomorphic, eco- nomic and social (crimes, violent deaths, etc.) data. The analysis of these data led to an amazing conclusion: “despite the unpredictability at the individual level (people, firms etc.) there was a remarkable stability of the relative frequencies at the aggregate level (groups) over long periods of time.” In the context of the frequency interpretation, the proba- bility of an event  is viewed as an empirical regularity asso- ciated with this event. The frequentist interpretation. Consider the case where one is able to repeat an experiment under identical conditions, and denote the relative frequency of the event  after  trials by ¡  ¢  then the frequency interpretation views the probability of event  as the limit of the relative frequency¡  ¢ as the number of repetitions goes to infinity: Pr() = lim→∞ ¡  ¢  (1) Can one prove the above mathematical claim? NO, and thus the von Mises attempt to provide a frequen- tist interpretation of probability using (1) is a dead end for 7
  • 8. purely mathematical reasons! Instead, (1) is a mathemati- cally false statement that reflects the intuition underlying the frequentist interpretation of probability. It becomes a mathe- matically true statement when the limit is interpreted in prob- abilistic terms; see Spanos (2013). A formal justification for the frequentist interpretation as the limit of relative frequencies is grounded on the Strong Law of Large Numbers (SLLN) that gives precise mean- ing to the claim ‘the sequence of relative frequencies {  }∞ =1 converges to  as  → ∞’. To see that let =(=1) and =(=0) and   =1  P =1 := Borel (1909). For an IID Bernoulli process { ∈N} defining the simple Bernoulli model: M(x):  v BerIID( (1−)) ∈N, (2) P( lim →∞ (1  P =1 ) = ) = 1 (3) That is, as  → ∞ the stochastic sequence {}∞ =1 where =1  P =1  converges to a constant  with probability one. 1 0 0 09 0 08 0 07 0 06 0 05 0 04 0 03 0 02 0 01 0 01 0 . 7 0 0 . 6 5 0 . 6 0 0 . 5 5 0 . 5 0 0 . 4 5 0 . 4 0 In d e x Average Fig. 1: The Strong Law of Large Numbers in action The long-run metaphor. It is also important to note 8
  • 9. that the long-run metaphor associated with the frequentist in- terpretation envisions repeating the mechanism represented by an IID Bernoulli process and observing the relative fre- quency process {}∞ =1 (almost surely)-approximate  as →∞ I It is crucial to emphasize that, when viewed in the con- text of a statistical model M(x), the key dimension of the long-run metaphor is not the temporal aspect. Keynes’s tongue- in-cheek comment that "in the long-run we will all be dead" is totally misplaced in this context. I The crucial dimension of the long-run is repeatability (in principle) of the data-generating process. To generate the re- alization shown below takes a fraction of a nanosecond! More- over, the long-run can be operationalized on a computer for any statistical model using the statistical GM. Example. In the case of the simple Normal model, one can generate as many sample realizations as wanted using pseudo-random numbers:  =  +   v N(0 1)  = 1 2    Circularity? The issue often raised, when invoking the SLLN as a justification for the frequency definition of proba- bility, is that the argument suffers from circularity: one uses probability to define probability! This claim is based on confusion. The SLLN states that, under certain restrictions on the probabilistic structure of { ∈N}, one can deduce: P( lim →∞ (1  P =1 ) = )=1 (4) This is a measure-theoretic result which asserts that the con- vergence holds everywhere in a domain  ⊂  except on a subset 0 the latter being a set of measure zero (P (0) =0)- 9
  • 10. the impossible event. This mathematical theorem links the probability () to the relative frequencies (1  P =1 ) via the interpretive provisions: [i] data x0:=(1 2     ) is viewed as a ‘truly typical’ realization of the process { ∈N} specified by M(x), [ii] the ‘typicality’ of x0 (e.g. IID) can be assessed using trenchant Mis-Specification (M-S) testing. These provisions provide a pertinent link between the math- ematical framework and the data-generating mechanism that gave rise to x0. 3 Frequentist Inference Statistical model Mθ(x)={(x; θ) θ∈Θ} x∈R  =⇒ Distribution of the sample (x; θ) x∈R  ↑ Data: x0:=(1 2  ) −→ ⇓ Likelihood function (θ; x0) θ∈Θ Frequentist approach to statistical inference For the frequentist approach: [a] The interpretation of probability is the frequency inter- pretation. [b] The systematic statistical information in data x0 in the form of chance regularities, constitutes the only relevant statistical information for selecting the statistical model. Sub- stantive information comes in the form of restrictions on sta- 10
  • 11. tistical parameters, but should not be imposed at the outset; its validity needs to be established before imposed. [c] The primary aim of the frequentist approach is to learn fromdata x0 about the ‘true’ underlying data-generating mech- anism M∗ (x)={(x; θ∗ )} x∈R ; θ∗ denotes the true value of θ In general, the expression ‘θ∗ denotes the true value of θ’ is a shorthand for saying that ‘data x0 constitute a real- ization of the sample X with distribution (x; θ∗ )’ This is achieved by employing reliable and effective inference proce- dures that revolve around the unknown parameter(s) θ based on ascertainable error probabilities. I The underlying inductive reasoning comes in two forms: (i) factual: the true state of nature (estimation and pre- diction), whatever that happens to be, and (ii) hypothetical: various hypothetical scenarios are com- pared to what actually happened (hypothesis testing). [d] Frequentist inference is based exclusively on the distri- bution of the sample (x; ), x∈R . This is because all fre- quentist procedures, estimators, test statistics and predictors, are based on statistics of the form: =(1 2  ) whose sampling distribution is determined by (x; ) via: ()=P( ≤ )= Z Z · · · Z | {z } {(12)≤} (x; )12 · · ·  (5) Its importance stems from the fact that all relevant error prob- abilities, coverage, type I and II, power, p-value, associated with frequentist inference are based on such a sampling dis- tribution. In the case where the density function () is con- tinuous: (; )=()  . 11
  • 12. 3.0.1 The simple Bernoulli model The discussion that follows focuses on the simple Bernoulli model, specified by: M(x):  v BerIID( (1−)) ∈N:=(1 2   ) where ‘BerIID’ stands for ‘Bernoulli, Independence and Iden- tically Distributed’. Using the IID assumptions, one can derive the distribu- tion of the sample: (1 2  ; ) IID = Q =1 (; ) Ber = Q =1  (1 − )1− = = P =1  (1 − ) P =1(1−) = (1 − )(1−)  (6) where =1  P =1  is the sample mean, and = denotes the number of ‘successes’ ( = 1) in  trials, i.e. (x; )= (1 − )(1−)  for all x∈R ={0 1}  (7) viewed as a function of  = is Binomially distributed. The likelihood function is defined by: to the distribution of the sample: (; x0) ∝ (x0; ) for all ∈[0 1] (8) Note that the proportionality (∝) is important because (; x0) is viewed as a function of ∈Θ but (x; ) is a function of x∈R  and they usually have very different dimensions. In- deed, in the simple Bernoulli case (x; ) is discrete but (; x0) is continuous. Example (a). For ==4, =20 the likelihood function takes the form: (; x0) ∝ 4 (1 − )16  ∈[0 1] 12
  • 13. 121086420 0.25 0.20 0.15 0.10 0.05 0.00 y Probability Binomial, n=20, p=0.2 Fig. 1: (x; ), x∈R  1.00.80.60.40.20.0 5 4 3 2 1 0 theta Likelihhood Fig. 2: (; x0), ∈[0 1] The sampling distribution of  is shown in fig. 1 and the normalized form of the likelihood function is given in fig. 2. 3.0.2 The simple Normal model The discussion that follows focuses on the simple Normal model, specified by: M(x):  v NIID( 2 ) ∈N:=(1 2   ) The distribution of the sample takes the form: (x; θ) = Y =1 1  √ 2 exp(− 1 22 (−)2 )= =( 1  √ 2 ) exp{- 1 22 P =1(−)2 } This means that the likelihood function is: ( 2 ; x0) ∝ ( 1  √ 2 ) exp{- 1 22 P =1(−)2 } −∞ ∞ 2 0 In the case where 2 is known this simplifies to: (; x0) ∝ exp{− 1 22 P =1(−)2 } −∞ ∞ 13
  • 14. 3.1 Frequentist Estimation (point) In general, the Maximum Likelihood (ML) estimator of  is defined by: b(X)=max ∈Θ (ln (; x)) It is important to distinguish between three different concepts: -unknown constant, b(X)-estimator of  b(x0)-estimate of  In the case of the simple Bernoulli model, the first order condition for maximizing ln (; x) is:  ln (;x)  =  − (1−) 1− =0 when solved for  gives rise to the MLE estimator: b(X)==1  P =1  (9) The sampling distribution of this estimator is: b(X) v Bin ³  (1−)  ;  ´  (10) where ‘v Bin ( ; )’ reads ‘distributed Binomially with mean , variance  and sample size  What is a good (optimal) estimator in frequentist inference? It’s a statistic (X) that pin-points ∗  the true value of  most effectively. Its capacity (effectiveness) to pin-points ∗ is defined in terms of several optimal properties, such as (i) Un- biaseness: (b(X))=∗  (ii) Full Efficiency:  (b(X))=(∗ ) (iii) Strong Consistency: P(lim→∞ b(X)=∗ )=1. Example. The estimator b(X)=1  P =1  whose sam- pling distribution is given by (10) is optimal because it is unbiased, fully efficient and strongly consistent! 14
  • 15. 3.2 Confidence Intervals (CIs) The most straightforward procedure to derive a Confidence Interval (CI) is to find a pivotal quantity whose distribution under =∗ is known. A. For the simple Bernoulli model such a pivotal quan- tify can be specified using the ML estimatorb(X)=1  P =1  and its sampling distribution in (10): (X; )= √ (b(X)−) √ (1−) =∗ v Bin (0 1; )  (11) In general, an optimal CI begins with an optimal estimator. Since its sampling distribution is known, one can specify a two-sided CI of the form: P ¡ − 2 ≤ (X; )   2 ; =∗ ¢ =1 −  and then ‘solve’ for  to derive the optimal (1−) Confidence Interval (CI): P µ [b −  2 q b(1−b)  ] ≤   [b +  2 q b(1−b)  ] ¶ =1 −  (12) where b is a shorthand for b(X) In practice, finding  2 can be cumbersome and in many cases (12) can be approximated (for a large enough sample size, say  ≥ 20) by the N(0 1) distribution. Example (a). In the case where ==4, =2 =20 the approximate 95 observed CI for :∙ (2 − 196√ 20 p 2(8))=025 ≤   0375=(2 + 196 q 2(8) 20 ) ¸  Example (b). In the case where ==12, =6 =20 the approximate 95 observed CI for :∙ (6 − 196√ 20 p 6(4))=385 ≤   0815=(6 + 196 q 6(4) 20 ) ¸  15
  • 16. B. For the simple Normal model (with 2 known) the pivotal quantify can be specified using the ML estimator =1  P =1  whose sampling distribution is: (X; )= √ (−)  =∗ v N (0 1)  (13) Using (13) one can specify a two-sided CI of the form: P ¡ − 2 ≤ (X; )   2 ; =∗ ¢ =1 −  and then ‘solve’ for  to derive the optimal (1−) Confidence Interval (CI): P ³ [ −  2 √  ] ≤   [ +  2 √  ´ =1 −  (14) 3.3 Frequentist Testing A. In the context of the simple Bernoulli model, consider the hypotheses: 0 : =0 vs. 0 :   0 (15) As in the case of interval estimation, a good test procedure be- gins with a good estimator, which gives rise to a test statistic. In this case the test statistic looks similar to (11): (X)= √ (b(X)−0) √ 0(1−0) =0 v Bin (0 1; )  where b(X):= but in fact, it has two crucial differences: (i) the test statistic (X) involves no unknown parameters as (X; ) and (ii) the evaluation of its sampling distribution is under the null (=0), which is hypothetical, and not under =∗ which is factual. The optimal −significance level test takes the form: (X)= √ (−0) √ 0(1−0)  1():={x : (x)  } (16) 16
  • 17. Example. In the case where =75, =20 =05 ( 2 =196) 0=5 test (16) yields: (x0)= √ 20(75−5) √ 5(1−5) =2236 leading to rejecting 0 The p-value associated with this test is: P((X)  2236; 0) = 013 (17) Example. For =6, =20 =05 ( 2 =196) 0=5 test (16) yields: (x0)= √ 20(6−5) √ 5(1−5) =894 leading to accepting 0 The p-value associated with this test is: P((X)  894; 0) = 186 (18) B. In the context of the simple Normal model (with 2 known) consider the hypotheses: 0 : =0 vs. 0 :   0 In this case the test statistic takes the form: (X)= √ (−0)  =0 v N (0 1)  and gives rise to the optimal −significance level test takes the form: (X)= √ (−0)   1():={x : (x)  } Example. For 0=10 =2 =100 =05 ( 2 =196) =116 this yields: (x0)= √ 100(106−10) 2 =30 leading to rejecting 0 The p-value associated with this test is: P((X)  30; 0) = 001 (19) 17
  • 18. 4 Bayesian Inference 4.1 The Bayesian framework Bayesian inference begins with a statistical model: Mθ(x)={(x; θ) θ∈Θ} x∈R  for θ∈Θ⊂R     (20) where (x; θ) is the distribution of the sample X:=(1  ) R  is the sample space and Θ the parameter space. Bayesian inference modifies the frequentist inferential set up, based ex- clusively on (20), in two crucial respects: (A) It views the unknown parameter(s) θ as random vari- ables with their own distribution, known as the prior dis- tribution: (): Θ → [0 1] which represents one’s a priori assessment of how likely the various values of θ in Θ are, which amounts to ranking the dif- ferent models Mθ(x) for all θ∈Θ. In frequentist θ is viewed as a set of unknown constants indexing (x; θ) x∈R  (B) It re-interprets the distribution of the sample as condi- tional on the unknown parameters θ and denoted by (x|θ) Taken together these modifications imply that for Bayesians the joint distribution of the sample is re-defined as: (x θ) = (x|θ)·(θ) ∀θ∈Θ ∀x∈R  (21) where ∀ denotes ‘for all’. Distinguishing criteria: [a] The Bayesian approach interprets probability as the de- grees of belief [subjective, logical or rational]. [b] In the context of Bayesian inference, relevant infor- mation includes: (i) the data x0:=(1 2  ) and (ii) the prior distribution (θ) θ∈Θ 18
  • 19. [c] The primary aim of the Bayesian approach is to revise the initial ranking (θ) ∀θ∈Θ in light of the data x0 as précised by (θ|x0) to update the ranking in terms of the posterior distribution derived using Bayes’ rule: (θ|x0) = (x0|θ)·(θ)R  (x0|θ)·(θ)θ ∝ (θ|x0)·(θ) ∀θ∈Θ (22) where (θ|x0) ∝ (x0|θ) θ∈Θ denotes the likelihood func- tion, as re-interpreted by Bayesians. Remark: in relation to (22) it is important to note that contrary to claims by Bayesians (e.g. Gosh et al, 2006, p. 31): “ (x0|θ) is the density of X interpreted as the conditional density of X given θ. The numerator is the joint density of θ and X and the denominator is the marginal density of X” (i) (x0|θ) is not “the conditional density of X given θ”, since the latter is defined by (x|θ) ∀x∈R , (ii) the numerator is not “the joint density of θ and X”, since the latter necessitates the double quantifier as in (21), (iii) the denominator is not “the marginal density of X”, since the latter is defined by (x)= R θ (x|θ)·(θ)θ ∀x∈R ; see Spanos (2014) for the details. Prior probabilities (θ) θ∈Θ ⇓ Statistical model Mθ(x)={(x; θ) θ∈Θ} x∈R  ⇑ Data: x0:=(1  ) ⎫ ⎪⎪⎪⎪⎪⎪⎪⎪⎪⎬ ⎪⎪⎪⎪⎪⎪⎪⎪⎪⎭ Bayes’ rule =⇒ Posterior Distribution (θ|x0) ∝ (θ)·(θ|x0) The Bayesian approach to statistical inference 19
  • 20. A famous Bayesian, Savage (1954) summarized Bayesian inference succinctly by: ‘Inference means for us the change of opinion induced by evi- dence on the application of Bayes’ theorem.” (p. 178) O’Hagan (1994) is more specific: “Having obtained the posterior density (θ|x0), the final step of the Bayesian method is to derive from it suitable inference statements. The most usual inference question is this: After seeing the data x0, what do we now know about the parameter θ The only answer to this question is to present the entire posterior distribution." (p. 6) In this sense, learning from data in the context of the Bayesian perspective pertains to how the original beliefs (θ) are revised in light of data x0, the revision coming in the form of the posterior: (θ|x0) ∀θ∈Θ [d] For inference purposes, the only relevant point in the sample space R  is just the data x0 as précised by (θ|x0) θ∈Θ. This feature of Bayesian inference is formalized by the Likeli- hood Principle. Likelihood Principle. For inference purposes the only relevant sample information pertaining to θ is contained in the likelihood function (x0|θ) ∀θ∈Θ Moreover, if x0 and y0 are two sample realizations contain the same information about θ if their likelihoods are proportional to one another, i.e. (x0|θ)=(y0|θ) (Berger and Wolpert, 1988, p. 19). Indeed, Bayesians take delight in poking fun at frequentist testing by quoting Jeffreys’s (1939) remark about the ‘absur- dity’ of invoking realizations ‘x∈R ’ other than x0: “What the use of P [p-value] implies, therefore, is that a hy- pothesis that may be true may be rejected because it has not predicted observable results that have not occurred. This seems 20
  • 21. a remarkable procedure.” (p. 385) [ha, ha, ha!!!!!] Frequentist inference procedures, such as estimation (point and interval), hypothesis testing and prediction DO invoke realizations x∈R  other than x0 contravening the LP. 4.2 The posterior distribution and its role in inference Bayesian inference begins with a statistical model M(x), like the frequentist approach, but it adds a prior distribu- tion () ∈Θ for the unknown parameter  viewed as a random variable with its own distribution. A. In the case of the simple Bernoulli model, the prior of choice for the Bayesians is the Beta distribution defined in terms of two parameters  and . For the discussion that follows let us assume that the prior () is Beta( ) dis- tributed with a density function: ()= 1 B() (−1) (1 − )−1  0 0 01 (23) The Beta distribution is very flexible in the sense that for different values of the parameters ( ) the density function can take many different shapes; see fig. 4. Such a prior will be proper because it covers the whole of the parameter space and it integrates to one, i.e. R 1 0 () = 1 Bayesian inference are based on the posterior distribu- tion of  given data x0 derived via Bayes’ formula: (|x0)= ()(x0|)R ∈[01] ()(x0|) =()(x0|) (x0) , ∀∈[0 1]. (24) The denominator (x0)= R 1 0 ()(x0|) is a normalizing constant, which is important to ensure that (|x0) is a proper density function, i.e. R ∈[01] (|x0) = 1. 21
  • 22. 1.00.80.60.40.20.0 4 3 2 1 0  Density 1 1 1 2 1 4 2 1 2 2 2 4 4 1 4 2 4 4 a b Beta(a,b) densities for different (a,b) Fig. 4: Beta( ) for different values of ( ) Example. In the case where (x0|) is given by (6) and () is given by (23), the posterior distribution is: (|x0) ∝ ³ 1 () (−1) (1 − )−1 ´ £  (1 − )(1−) ¤ = = 1 () h +(−1) (1 − )(1−)+−1 i  (25) In view of the formula in (23), this is recognized as an ‘un- normalized’ density of a Beta(∗  ∗ ) where: ∗ = +  ∗ =(1 − ) +  (26) Since the prior distribution is Beta( ) and the posterior is also Beta(∗  ∗ ) indicates that the Beta distribution is conjugate to the Bernoulli likelihood (Binomial), i.e., the Bernoulli and Beta constitute a conjugate pair because the prior and posterior belong to the same family of distributions. 1.00.80.60.40.20.0 7 6 5 4 3 2 1 0  Density Je ffre ys' in v arian t prio r fo r th e B in o mial B eta: a= 0.5, b= 0.5 Fig. 5: Jeffreys ()= 1 (55) −5 (1−)−5 22
  • 23. Example. For Jeffreys’ prior:  v Beta(5 5) (see fig. 5) and =4 =20: (; x0) ∝ 4 (1 − )16  ∈[0 1] (|x0)= 1 ( ) h +(−1) (1 − )(1−)+−1 i  (27) i.e. (|x0) is Beta(∗  ∗ ) with ∗ =+=45 ∗ =(1- )+=165 1.00.80.60.40.20.0 5 4 3 2 1 0 theta Likelihood(scaled) Fig. 6: The likelihood (; =4) 1.00.80.60.40.20.0 7 6 5 4 3 2 1 0 theta Density Beta Prior vs. Posterior Fig. 7: () vs. (|x0) The (scaled) likelihood (; x0) is given in fig. 6 and the pos- terior (|x0) is compared to the prior () in fig. 7. In fig. 8, the likelihood is compared to the posterior. Note that all three are continuous, but the sampling distribution of vBin( (1−)  ) is discrete; all relevant frequentist error probabilities come from this distribution. 1.00.80.60.40.20.0 5 4 3 2 1 0 theta Density 4.5 16.5 5 17 F irst Second Beta Likelihood vs. Posterior Fig. 8: (|x0) vs. (; =4) 121086420 0.25 0.20 0.15 0.10 0.05 0.00 X Probability Fig. 9: vBin (=2; =20) 23
  • 24. B. In the case of the simple Normal model, (with 2 known) the prior often selected is Normal: ()= 1 0 √ 2 exp(− 1 22 0 (−0)2 ) for ∈R and denoted by:  v N(0, 2 0), where (0, 2 0) are known values. Hence, the posterior is defined by: (|x0) ∝()(x0; ))= =[ 1 0 √ 2 exp(− 1 22 0 (−0)2 )] h ( 1  √ 2 ) exp{- 1 22 P =1(−)2 } i ∝ exp{−1 2 h (−0 0 )2 ) + P =1(−  )2 i = = exp{-1 2 2 ³ 1 2 0 +  2 ´ +  ³ 0 2 0 +  2 ´ = exp{− 1 22 1 ( − 1)2 } (28) where the parameters (1 2 1) of the posterior (after scaling): (|x0) v N(1, 2 1). take the particular form: 1=2 1(0 2 0 +  2 ) 2 1=( 1 2 0 +  2 )−1 = 2 02 (2 0+2)  After some manipulation 1 can be expressed in the form: 1=+(1−)0 where = 2 0 2 0+2  = 2 1(2  ) which indicates that it is a convex combination of  and 0 with the variance of the prior, 2 0, and the variance of the MLE   ()=2  , providing the weights. Moreover, as  → ∞  → 1 and 1 →  In words, asymptotically (as  → ∞) the mean of the posterior 1 converges to the sample mean . 24
  • 25. Example. (a) Consider the case where  v N(8, 1), =2 =20 =116 The posterior is: (|x0) v N(11022, 167), 2 1=(1 1 + 20 4 )−1 =167 1=167(8 1 + 20(116) 4 )=11022 12111098765 2.5 2.0 1.5 1.0 0.5 0.0 theta Density Normal Prior vs. Posterior Fig. 10: () vs. (|x0) 12.2512.0011.7511.5011.2511.0010.7510.50 2.5 2.0 1.5 1.0 0.5 0.0 theta Density Normal Posterior vs. Likelihood Fig. 11: (|x0) vs. (|x0) (b) Consider the case where the prior is more precise, i.e. 2 0=5  v N(8, 5), =2 =20 =116 The posterior has a smaller variance: (|x0) v N(10582, 143), 2 1=( 1 5 + 20 4 )−1 =143 1=143( 8 5 + 20(116) 4 )=10582 11109876 3.0 2.5 2.0 1.5 1.0 0.5 0.0 theta Density Normal Prior vs. Posterior Fig. 12: () vs. (|x0) 12.512.011.511.010.510.0 3.0 2.5 2.0 1.5 1.0 0.5 0.0 theta Density Normal Posterior vs. Likelihood Fig. 13: (|x0) vs. (|x0) 25
  • 26. Example. Consider the case of the simple Normal model (with 2 known), where the prior is Uniform: ()=1 for ∈R:=(−∞ ∞) This prior is said to be improper because it does not integrate to one since: Z ∞ −∞ ()=∞ for any ()=∈(0 1] In this case the posterior is proportional to the likelihood: (|x0) ∝(; x0)=( 1  √ 2 ) exp{− 1 22 P =1(−)2 } −∞ ∞ =( 1  √ 2 ) exp{-  22 £1  P =1(−)2 + ( − )2 ¤ } =( 1  √ 2 ) exp{-2 22 } exp{-  22 (−)2 } (29) Hence, ignoring the constant term exp{-2 22 } where 2 =1  P =1(−)2  the posterior is Normally distributed: (|x0) v N(, 2  ), −∞ ∞ (30) Note that in (30) the random variable is  which varies with different values of the unknown parameter (∈(−∞ ∞)), and  is a constant. In contrast, the sampling distribution of  :  v N(, 2  ), x∈R  (31) varies with different values of the sample realization (x∈R ). Hence, any talk about results with identical tail areas has to be objurgated with what the two tail areas really represent. I The intuitive difference between the two distributions is that one can draw (30) but not (31) since  is unknown! 26
  • 27. 4.3 Bayesian Point Estimation According to O’Hagan (1994): “Classical inference theory is very concerned with constructing good inference rules. The primary concern of Bayesian inference, ..., is entirely different. The objective [of Bayesian inference] is to extract information concerning θ from the posterior distribution, and to present it helpfully via effective summaries. There are two criteria in this process. The first is to identify interesting features of the posterior distribution. ... The second criterion is good communication. Summaries should be chosen to convey clearly and succinctly all the features of interest.” (p. 14) In this sense, the frequentist inference procedures of estima- tion, testing and prediction are viewed by Bayesians as styl- ized inferences which from the Bayesian perspective represent nothing more than different ways to summarize relevant infor- mation in the posterior distribution (|x0). Hence, the rele- vant criteria for ‘goodness’ (optimality) pertain to how well they identify and communicate relevant/interesting features of the posterior. In the case of stylized inference of a point estimate for  one can think of several features of the posterior (|x0) such as a measure of location, that could potentially play such a role by indicating the most ‘representative’ value of  in [0 1]. [1] An obvious choice is to used the mode e of the pos- terior distribution (|x0) in (27): (e|x0)= sup∈Θ (|x0) which is the value of  that is ranked highest by (|x0) We know that in the case of  v Beta( ) the mode of () is = −1 +−2  Hence, an example of a Bayesian estimate 27
  • 28. of  is: e= ∗−1 ∗+∗ −2 = (+−1) (++−2)  (32) If we compare this with the ML estimateb(x0)==1  P =1 , the two will coincide algebraically, i.e. e=, only when ==1 :  v Beta(1 1)=(0 1) ()=1 for ∈[0 1]   1.00 .80.60.40.20.0 1.0 0.8 0.6 0.4 0.2 0.0  Density U n ifo r m p r io r B e t a : a = 1 , b = 1 Fig. 10: Uniform prior [2] Another "natural" choice for a most ‘representative’ value of  is its mean of the posterior. We know that for  vBeta( ) ()=  +  and thus: b= ∗ ∗+∗ = (+) (++)  (33) Example. Let () vBeta(5 5). (a) =4, =20 ∗ = + =45 ∗ =(1-)+=165 e= 35 21−2 =184 b= 45 45+165 =214 (b) =12, =20 ∗ =+=125 ∗ =(1−)+=85 e=115 19 =605 b=125 21 =595 As we can see from the above numerical examples, the two Bayesian estimates can give rise to different numerical values, depending on how asymmetric the posterior distribution is; the more asymmetric the greater the difference. How does a Bayesian decide which one of the above estimators is better? 28
  • 29. O’Hagan answers that question by contrasting frequentist (classical) inferences with Bayesian inferences: “In Bayesian terms, therefore, a good inference is one which contributes effectively to appropriating the information about θ which is conveyed by the posterior distribution.” (p. 14) Indeed, O’Hagan (1994), p. 14, proceeds to argue that criteria for ‘good’ frequentist inference procedures are only parasitical on Bayes’ theorem and enter the picture via the decision theoretic perspective: “... a study of decision theory has two potential benefits. First, it provides a link to classical inference. It thereby shows to what extent classical estimators, confidence intervals and hypotheses tests can be given a Bayesian interpretation or motivation. Sec- ond, it helps identify suitable summaries to give Bayesian answers to stylized inference questions which classical theory addresses.” Invoking information other than the data and the prior. In the case of point estimation the question of selecting an optimal Bayesian estimate of  one needs to invoke addi- tional information from decision theory like a loss (or utility) function (b ). Using different loss functions gives rise to different choices for Bayes’ estimate: (i) The Bayes estimate ˘ that minimizes (∀∈Θ) the ex- pected zero-one loss function 0−1( ˘)= ½ 0 if ˘ =  1 if ˘ 6=   is the mode of (|x0). (ii) The Bayes estimate e that minimizes (∀∈Θ) the ex- pected absolute loss function 1(e; )=|e − | is the median of (|x0) (iii) The Bayes estimate b that minimizes (∀∈Θ) the the expected square loss function 2(b; )=(b − )2 is the mean 29
  • 30. of (|x0) However, as o’Hagan (1994). p. 15, explains that the notion of optimality is not primary: “The posterior mean can therefore be seen as an estimate of  which is best in the sense of minimizing expected square error. This is distinct from, but clearly related to, its more natural role as a useful summary of location of the posterior distribution.” What is a good (optimal) estimate in Bayesian inference? The one that minimizes a particular expected loss function. Hence, in contrast to frequentist estimation where the optimality is assessed by the estimator’s capacity to pin-point ∗  the true value of  an optimal Bayesian estimate has nothing to do with that; it just minimizes a particular loss function for all possible values of . The most widely used loss function is the square: (b)=(b(X)−)2  ∀∈Θ whose expected value (the risk function) is known as the Mean Square Error (MSE): Bayesian: MSE(b(X); )=(b(X)−)2  ∀∈Θ (34) In relation to the expected loss function, a key property for an estimator b(X) is considered to be admissibility. Admissibility. An estimator e(X) is inadmissible with respect to a risk function (b) if there exists another esti- mator b(X) such that: (b) ≤ (e) ∀∈Θ (35) and the strict inequality () holds for at least one value of  Otherwise, e(X) is said to be admissible with respect to the loss function (b) 30
  • 31. Warning: the Bayesian definition of the MSE in (34), based on the universal quantifier ‘for all ∈Θ’, is at odds with the frequentist definition of the MSE, which is defined at a point =∗ [the true value of ]: Frequentist: MSE(b(X); ∗ )=(b(X) − ∗ )2  (36) The factual nature of frequentist reasoning in estimation also brings out the impertinence of the notion of admissibility stemming from its reliance on the quantifier ‘for all ∈Θ’. To see that more clearly let us consider the following example. Example:  is no better than a crystallball esti- mator? For the simple Normal model:  v NIID( 2 ) =1 2    (37) consider a MSE comparison between two estimators of : (i) the Maximum Likelihood Estimator (MLE): =1  P =1  (ii) the ‘crystalball’ estimator: (x)=7405926 for all x∈R  It turns out that both estimators are admissible and thus equally acceptable on admissibility grounds. This surprising result stems primarily from the quantifier ‘for all ∈Θ’. In- deed, for certain values of  close to , say ∈(± √  ) for 01  is ‘better’ than  since: (; )=1   (; ) ≤ 2  for ∈(± √  ) Common sense suggests that if a certain criterion of opti- mality cannot distinguish between  [a strongly consistent, unbiased, fully efficient and sufficient estimator] and an ar- bitrarily chosen real number that ignores the data altogether, it is practically useless for distinguishing between ‘good’ and ‘bad’ estimators in frequentist statistics or being viewed as a 31
  • 32. minimal property of estimators. Moreover, it is obvious that the source of the problem is the quantifier ∀∈Θ. In contrast to admissibility, the property of consistency instantly elimi- nates the crystal ball estimator . 4.4 Bayesian Credible Intervals (BCIs) A Bayesian (1−) credible interval for  is constructed by ensuring that the area between  and  is equal to (1−): ( ≤   )= R   (|x0)=1− In practice one can define an infinity of (1−) credible inter- vals using the same posterior (|x0) To avoid this indeter- minacy one needs to impose additional restrictions like the interval with the shortest length or one with equal tails:R 1  (|x0)=(1− 2 ) R 1  (|x0)= 2 ; see Robert (2007). Example. For the simple Bernoulli model, the end points of an equal-tail credible interval can be evaluated using the F tables and the fact that:  v Beta(∗  ∗ ) ⇒ ∗  ∗(1−) v F(2∗  2∗ ) Denoting the  2 and (1− 2 ) percentiles of the F(2∗  2∗ ) distri- bution, by f( 2 ) and f(1− 2 ) respectively, the Bayesian (1−) credible interval for  is:h (1 + ∗ ∗f(1− 2 ) )−1 ≤  ≤ (1 + ∗ ∗f( 2 ) )−1 i  For () vBeta(5 5). (a) =2, =20 =05 ∗ =+=25 ∗ =(1 − )+=185 (1− 2 )=163 ( 2 )=293 ³ 1+ 185 25(163) ´−1 ≤  ≤ ³ 1+ 185 25(293) ´−1 = (0216 ≤  ≤ 284)  (38) 32
  • 33. (b) =18, =20 =05 ∗ = + =185 ∗ =(1-) + =25 b=185 21 =881 (1- 2 )=341 ( 2 )=6188 ³ 1+ 25 185(341) ´−1 ≤  ≤ ³ 1+ 25 185(6188) ´−1 = (716 ≤  ≤ 979)  How does one interpret a (1 − ) BCI such as (38)? I It provides a summary of the posterior distribution (|x0) by demarcating  and  that define the interval of length − with the highest probability, i.e. it includes (1−)% of the highest ranked values of ∈[0 1]. 4.4.1 Confidence vs. Credible Intervals Example. For the simple (one parameter - 2 is known) Normal model, the sampling distribution of =1  P =1  and the posterior distribution of  derived on the basis of an improper uniform prior [()=1 ∀∈R] are:  =∗ v N(∗  2  ) x∈R  (|x0) v N( 2  ) ∈R (39) The two distributions can be used, respectively, to construct (1−) Confidence and Credible Intervals: P ³ − 2 ( √  ) ≤  ≤ + 2 ( √  ); =∗ ´ =1− (40)  ³ − 2 ( √  ) ≤  ≤ + 2 ( √  )|x0 ´ =1− (41) The two intervals might appear the same, but they are dras- tically different. First, in (40) the r.v. is  and its sampling distribution (; ) is defined over x∈R  but in (41) the r.v. is  and its posterior (|x0) is defined over ∈R Second, the reasoning underlying (40) is factual, but that of (41) is all-inclusive (∀∈[0 1]). 33
  • 34. Hence, the (1−) Confidence Interval (40) provides the shortest random upper (X)=+ 2 ( √  ) and lower (X)=− 2 ( √  ) bounds that cover ∗  the true value of  with probability (1−). In contrast, the (1−) Credible Interval (41) provides the shortest interval of length 2 2 ( √  ) such that (1−)% of the highest ranked (by (|x0)) values of  lie within it. The above comparison suggests that a Bayesian (1−) Cred- ible Interval has nothing to do with the "true" ! 4.5 Bayesian Testing Bayesian testing of hypotheses is not as easy to handle using the posterior distribution of the technical difficulty in attaching probabilities to particular values of  when the pa- rameter space Θ is uncountable since: ()=0 (| x0)=0 ∀∈Θ In their attempt to deflect attention away from this technical difficulty, Bayesians often criticize the use of a point hypoth- esis =0 in frequentist testing as nonsensical because it can never be exactly true! This is a misplaced argument because the notion of exactly true, has no place in statistics. 4.5.1 Point null and alternative hypotheses There have been several attempts to address the difficulty with point hypotheses, but no agreement seems to have emerged; see Roberts (2007). Let us consider one such attempt for test- ing of the hypotheses: 0 : θ=θ0 vs. 1 : θ=θ1 Like all Bayesian inferences, the basis is the posterior distribu- tion. Hence, an obvious way to assess their respective degrees 34
  • 35. of belief is the posterior odds: (θ0|x0) (θ1|x0) =(θ0|x0)·(θ0) (θ1|x0)·(θ1) = ³ (θ0) (θ1) ´ ³ (θ0|x0) (θ1|x0) ´  (42) where the factor (θ0) (θ1) represents the prior odds, and (θ0|x0) (θ1|x0) the likelihood ratio. In light of the fact that technical prob- lem stems from the prior (θ) assigning probabilities to par- ticular values of θ an obvious way to sidestep the problem is to cancel the prior odds factor, by using the ratio of the pos- terior to the prior odds to define the Bayes Factor (BF): (θ0 θ1|x0)= ³ (θ0|x0) (θ1|x0) ´  ³ (θ0) (θ1) ´ =(θ0|x0) (θ1|x0)  (43) This addresses the technical problem because the likelihood function is definable for particular values of  For this reason Bayesian testing is often based on the BF combined with certain rules of thumb, concerning the strength of the degree of belief against 0 as it relates to the magnitude of (x0; 0) (Robert, 2007): I 0 ≤ (x0; 0) ≤ 32 the degree of belief against 0 is poor, I 32  (x0; 0) ≤ 10 the degree of belief against 0 is substantial, I 10  (x0; 0) ≤ 100 the degree of belief against 0 is strong, and I (x0; 0)  100 the degree of belief against 0 is deci- sive. These rules of thumb, going from the BF to evidence for or against the null, have been questioned as largely ad hoc; see Kass and Raftery (1995). The Likelihoodist approach. It is important to note that the Law of Likelihood defining the likelihood ratio: (0 1|x0)=(0|x0) (1|x0)  35
  • 36. provides the basis of the Likelihoodist approach to testing, but applies only to tests of point vs. point hypotheses. 4.5.2 Composite hypotheses A. In the context of the simple Bernoulli model consider the hypotheses: 0:  ≤ 0 vs. 1:   0 0=5 assuming a Jeffreys invariant prior  vBeta(5 5) and data =12, =20 An obvious way to evaluate the posterior odds for these two interval hypotheses is as follows: ( ≤ 0|x0)= Γ(21) Γ(125)Γ(85) R 5 0 ¡ 115 (1-)75 ¢ =186 (  0|x0)=1-( ≤ 0|x0)=814 One can then employ the posterior odds criterion: (≤0|x0) (0|x0) =186 814 =229 which indicates that the degree of belief against 0 is poor. B. (a) In the context of the simple Normal model (2 known) consider the hypotheses: 0:  ≤ 10 vs. 1:   10 assuming a Normal prior  vN(8 1), =2 =20 =116 As shown in section 4.2, the posterior is: (|x0) v N ³ 2 1(0 2 0 +  2 ), 2 02 (2 0+2) ´ , yielding the result: (|x0) v N(11022, 167), 2 1=(1 1 + 20 4 )−1 =167 1=167(8 1 + 20(116) 4 )=11022 The posterior odds criterion yields: (≤10|x0) (10|x0) = R 10 −∞ 1√ 2(167) exp(− 1 2(167)(−11022)2) R ∞ 10 1√ 2(167) exp(− 1 2(167)(−11022)2) = 0062 994 =0062 36
  • 37. which indicates that the degree of belief against 0 is very poor. This result should be contrasted with that of a N-P test yielding: (x0)= √ 20(116−10) 2 =3578 with (x0)=0002 that rejects 0 at any conventional significance level. What is the intuitive explanation behind these contradic- tory results between a Bayesian and a frequentist test? This frequentist test will reject 0 if (X)= √ (−0)    or   0+ √   The posterior odds will "reject" 0 if   0+2(0−8) 2 0 . This indicates that in the case where the null value 0 is chosen equal to the mean (0=8) of the prior distribu- tion, i.e.  vN(0 2 0), the posterior odds will "reject" 0 if   0 which ignores the sample size beyond its influ- ence on  On the other hand, when 2 =2 0 the rejection threhold 0+(0−8)  decreases with  in contrast to √  for the frequentist test. (b) Let us change the prior to be an improper prior: ()=1 ∈(−∞ ∞) As shown in section 4.2, in this case the posterior is: (|x0) v N(, 2  ), −∞ ∞ (44) and thus the tail areas of this posterior would seem to match perfectly with that of the sampling distribution of  :  v N(, 2  ), x∈R  (45) but as argued in that section, this is an illusion; one is com- paring apples and eggs! 37
  • 38. 4.5.3 Point null but composite alternative hypothesis Pretending that point hypotheses are small inter- vals. A ‘pragmatic’ way to handle point hypotheses in Bayesian inference is to sidestep the technical difficulty in handling hy- potheses of the form: 0: =0 vs. 1: 6=0 (46) by pretending that 0 is actually: 0: ∈Θ0:=(0− 0+) and attaching a spiked prior of the form: (=0)=0 1= R 1 0 (6=0)=1−0 (47) i.e. attach a prior of 0 to =0, and then distribute the rest 1−0 to all the other values of ; see Berger (1985). Example- large  problem. In the context of the simple Bernoulli model, consider the case where 0=2 =106298 527135 =020165233 =527135 [i] Let us apply the Bayes factor procedure to the hypothe- ses (46) using a spiked prior (47) with 0=5 Since the ratio (θ0) (6=0) cancels out, the posterior odds reduces to the Bayes factor: (x0; 0)= (0;x0) R 1 0 (;x0) = (527135 106298)(2)106298(1−2)527135−106298 R 1 0 ((527135 106298)106298 (1−)527135−106298 ) = =000015394 000001897 =8115 (48) I The result (x0; 0)  8115 indicates that data x0 fa- vor the null ( 0=2) against all other values of  substantially. It turns out, however, that this result is highly vulnerable to the fallacy of acceptance. For certain values ‡ in a subset of Θ1=[0 1]−2 in particular: ‡ ∈(2 20331] (49) 38
  • 39. the values ‡ are favored by (x0; ‡ ) more strongly than 0=2; contradicting the original result. Indeed, as pointed out by Mayo (1996), p. 200, there is always the maximally likely alternative, ¨ =b(x0)=20165233, for which the Bayes factors favors ¨ more than any other value of . In particular, the Bayes factors favors ¨ 89 times stronger than 0=2! [ii] Applying an ‘optimal’ N-P test with =003 ( 2 =2968) to the above data yields: (x0)= √ 527135(106298 527135−2) √ 2(1−2) =2999 (50) which rejects 0. The p-value (x0)=0027 suggesting that data x0 indicate ‘some’ discrepancy between 0 and the ‘true’  (that gave rise to x0), but provides no information about its magnitude. The post-data severity evaluation based on:  (;   1) = P(x: (X) ≤ (x0);   1 is false) (51) indicates that, for a severity threshold of say 9, the claim for which data x0 provide evidence for is:   20095 ⇒ ∗ ≤ 00095 In this sense, the frequentist approach can address the fallacy of acceptance by evaluating the warranted discrepancy from the null for the particular data and test. 39
  • 40. 4.6 Where do prior distributions come from? 4.6.1 Conjugate prior and posterior distributions This is the case where the prior () and the posterior: (|x0) ∝ () · (; x0) ∀∈Θ belong to the same family of distributions, i.e. (; x0) is family preserving. Example. For the simple Bernoulli model: ()vBeta( ) (;x0)∝ (1−)(1−) −→ (|x0)vBeta(∗  ∗ ) Table 2 presents some examples of conjugate pairs of prior and posterior distributions, as they combine with different likelihood forms. Conjugate pairs make mathematical sense, but does it make ‘modeling’ sense? The various justifications in the Bayesian literature vary from, ‘these help the objectivity of inference’ to ‘they enhance the allure of the Bayesian approach as a black box’ and these claims are often contradictory! Table 1 - Conjugate pairs (() (|x0)) Likelihood () Binomial (Bernoulli) Beta( ) Negative Binomial Beta( ) Poisson Gamma( ) Exponential Gamma( ) Gamma Gamma( ) Uniform Pareto( ) Normal for  =  N( 2 ) ∈R 2 0 Normal for  = 2 Inverse Gamma( ) 40
  • 41. 4.6.2 Jeffreys’s prior Fisher (1921) criticized the notion of prior ignorance using a uniform prior:  v U(0 1) ∀∈Θ to quantify a state of ignorance about the unknown parame- ter  Fisher’s criticism was that such a prior is non-invariant to reparameterizations: one is ignorant about  but very in- formed about =()   1.00.80.60.40.20.0 1.0 0.8 0.6 0.4 0.2 0.0  Density Uniform prior Beta: a=1, b=1 Fig. 10: Uniform prior 7.55.02.50.0-2.5-5.0 0.25 0.20 0.15 0.10 0.05 0.00  Density Logistic distribution Logistic: Loc=0, Scale=1 Fig. 11: The Logistic prior Example. If  is uniformly distributed, then the logit transformation: = ln ¡  1− ¢  gives rise to a very informative prior for  : ()=  (1+)2  −∞    ∞ This distribution attaches much higher probability to the val- ues of  around zero and very low probability to the values in the tails. In respond to Fisher’s second criticism, Jeffreys (1939) pro- posed a new class of priors which are invariant to reparameter- izations. This family of invariant priors was based on Fisher’s average information: (; x)= x µ 1  h  ln (;x)  i2 ¶ = R ··· R x∈R  1  ( ln (;x)  )2 x (52) 41
  • 42. Note that the above derivation involves some hand-waving in the sense that if the likelihood function (; x0) is viewed, like the Bayesians do, as only a function of the data x0, then taking expectations outside the brackets makes no sense; the expectation is with respect to the distribution of the sample (x;) for all possible values of x∈R . As we can see, the derivation of (; x) runs afoul to the likelihood principle since all possible values of the sample X, not just the observed data x0, are taken into account. Note that in the case of a random (IID) sample, the Fisher information (; x) for the sample X:=(1 2  ) is related to the above average information via: (; x) = (; x) In the case of a single parameter, Jeffreys invariant prior takes the form: () ∝ p (; x) (53) i.e. the likelihood function determines the prior distribution. The simple Bernoulli model. In view of the fact that the log-likelihood takes the form: ln (; x)= ln () + (1 − ) ln(1−)  ln (;x)  =  −(1−) 1−  2 ln (;x) 2 = −( 2 )−(1−) (1−)2  From the second derivative, it follows that:  µ 1  h  ln (;x)  i2 ¶ = ³ −1  2 ln (;x) 2 ´ = 1 (1−)  (54) This follows directly from ()= since:  ³ −1  2 ln (;x) 2 ´ =  2 +(1−) (1−)2 =1  + 1 1− = 1 (1−)  (55) From the definition of Jeffreys invariant prior we can de- duce that for  : ()∝ p (; x)= q 1 (1−) =−1 2 (1−)−1 2  0    1 (56) 42
  • 43. which is an ‘unnormalized’ Beta(1 2  1 2 ) distribution; it needs the scaling 1 ()  Note that Jeffreys prior (56) is also the reference prior for a one parameter statistical model; see Bernardo and Smith (1994). 5 Bayesian charges against frequentist inference [1] Bayesian inference is fundamentally sound because it can be given an axiomatic foundation based on coherent (rational) decision making, but frequentist inference suffers from several philosophical flaws. Remark: what does an axiomatic foundation have to do with inductive inference? [2] Frequentist inference is not very flexible and has a restric- tive range of applicability. According to Koop, Poirier and Tobias (2007): "Non-Bayesians, who we hereafter refer to as frequentists, argue that situations not admitting repetition under essentially identical conditions are not within the realm of statistical enquiry, and hence ’probability’ should not be used in such situations. Fre- quentists define the probability of an event as its long-run relative frequency. ... that definition is nonoperational since only a finite number of trials can ever be conducted.’ (p. 2) Remark: where have these guys been for the last 80 years? Inference with time series data is beyond the intended scope of frequentist statistics? [3] Bayesian inference produces clear and direct inferences, in contrast to frequentist inference producing unclear and indi- rect inferences, e.g. credible intervals vs. confidence intervals. “... the applied researcher would really like to be able to place a degree of belief on the hypothesis.” (Press, 2003, p. 220) 43
  • 44. Remark: if the primary aim of statistical inference is to learn from data x0 about the ‘true’ underlying data-generating mechanism M∗ (x)={(x; ∗ )} x∈R , what does a probabil- ity, say (=0|x0) =.7, suggest about ∗ ? [4] Bayesian inference makes use of all the available a priori information, but frequentist inference does not. Remark: there is a crucial difference between prior sub- stantive matter information and information in the form of a prior distribution. Frequentist inference is tailor-made to accommodate prior information in the form of restrictions on the statistical parameters suggested by substantive theories, say G(θ ϕ)=0, where θ and ϕ denote the statistical and substantive parameters of interest. Indeed, substantive prior information in most scientific fields does not come in the form of a prior distribution (θ) ∀θ∈Θ [5] A number of counter-examples, introduced by Bayesians, show that frequentist inference is fundamentally flawed. Remark: the problem does not lie with frequentist testing, but with the statistical models introduced. The underlying statistical models are shown to be rigged; see Spanos (2010; 2011; 2012; 2013a-d). [6] The subjectivity charge against Bayesians is misplaced because: “All statistical methods that use probability are subjective in the sense of relying on mathematical idealizations of the world. Bayesian methods are sometimes said to be especially subjective because of their reliance on a prior distribution, but in most problems, scientific judgement is necessary to specify both the ’likelihood’ and the prior’ parts of the model.” (Gelman, et al. (2004), p. 14) “... likelihoods are just as subjective as priors.” (Kadane, 2011, 44
  • 45. p. 445) Remark: I know how to test the adequacy of the proba- bilistic assumptions defining the likelihood (they are the model assumptions) vis-a-vis data x0 but how do I test the adequacy of the assumptions defining a prior () ∀∈Θ? [7] For inference purposes, the only relevant point in the sam- ple space R  is just the data x0 as summarized by the likeli- hood function (θ|x0) θ∈Θ. Remark: as mentioned in section 4.1, the interpretation of the posterior distribution as proportional to the conditional distribution of X give θ times the prior (θ) is at odds with the Likelihood Principle; see Spanos (2014). Also, Mayo (2013) has shown that Birnbaum’s (1962) ‘proof’ of the LP is erroneous. [8] An effective way to generate frequentist optimal sta- tistical procedures is to find the Bayes solution using a reasonable prior and then examine its frequentist properties to see whether it is satisfactory from the latter viewpoint; see Rubin (1984), Gelman et al (2004). Remark: this is based on assuming that admissibility (as defined by Bayesians) is a desirable minimal property; it is not! Indeed, expected losses do not constitute legitimate fre- quentist error probabilities. The minimal property for fre- quentist inference is consistency, not expected loss relative efficiency. The above [1]-[8] comments, criticisms and charges leveled against frequentist inference are largely misplaced and stem mostly from insufficient understanding or just plain ignorance on behalf of the critics. Further arguments on how the above criticisms can be coun- tered will be part of the class discussion. 45
  • 46. 6 Appendix A: the N-P Lemma and its extensions The cornerstone of the Neyman-Pearson (N-P) approach is the Neyman-Pearson lemma. Contemplate the simple generic statistical model: Mθ(x)={(x; )} ∈Θ:={0 1}} x∈R  (57) and consider the problem of testing the simple hypotheses: 0: =0 vs. 1: =1 (58) ¥ The fact that the assumed parameter space is Θ:={0 1} and (58) constitute a partition, is often left out from most sta- tistics textbook discussions of this famous lemma! Existence. There is exists an -significance level Uniformly Most Powerful (UMP) [-UMP] test based on: (X)=((x;1) (x;0) ) 1()={x: (x)  } (59) where () is a monotone function. Sufficiency. If an -level test of the form (59) exists, then it is UMP for testing (58). Necessity. If {(X) 1()} is -UMP test, then it will be given by (59). At first sight the N-P lemma seems rather contrived because it is an existence result for a simple statistical model Mθ(x) whose parameter space is artificial Θ:={0 1}, but fits per- fectly into the archetypal formulation. To operationalize the existence result one would need to do two things: (1) Find transformation () that when applied to (x;1) (x;0) yields a meaningful test statistic (X) (2) Derive the distribution of (X) under both 0 and 1. 46
  • 47. Warning: this lemma is often misconstrued as suggesting that for an -UMP test to exist one needs to confine testing to simple-vs-simple cases even when Θ is uncountable; nonsense! ¥ The construction of an -UMP test in realistic cases has nothing to do with simple-vs-simple hypotheses. Instead, (i) it should be based on the archetypal N-P testing formu- lation based on partitioning Θ, and (ii) rely on monotone likelihood ratios and other features of the prespecified statistical model Mθ(x). Example. To illustrate these issues consider the simple- vs-simple hypotheses: (i) 0: =0 vs. 1: =1 (60) in the context of a simple Normal (one parameter) model:  v NIID( 2 ) =1 2    (61) In this case, the N-P lemma does not apply because the two values (0 1) do not constitute a partition of the parameter space Θ=R. Applying the N-P lemma requires setting up the ratio: (x;1) (x;0) = exp ©  2 (1 − 0) −  22 (2 1 − 2 0) ª  (62) which is clearly not a test statistic, as it stands. However, there exists a monotone function () which transforms (62) into a familiar test statistic (Spanos, 1999, pp. 708-9): (X)=((x;1) (x;0) )= h ( 1 1 ) ln((x;1) (x;0) )+1 2 i = √ (−0)   A UMP test can be derived when (X)= √ (−0)  is com- bined with information relating to the framing of the hypothe- ses. 47
  • 48. Conditions that give rise to UMP tests [1] Point null vs. one-sided alternative. In the case of the simple Normal model, each pair of hypotheses: (i) 0: ≤0 vs. 1: 0 0: =0 vs. 1: 0 (ii) 0: ≥0 vs. 1: 0 0: =0 vs. 1: 0 give rise to the same UMP tests. The existence of these -UMP tests extends the N-P lemma to more realistic cases by invoking two regularity conditions: [2] Monotone likelihood ratio. The ratio (62) is a monotone function of the statistic  in the sense that for any two values 10 (x;1) (x;0) changes monotonically with  This implies that (x;1) (x;0)  if and only if   0 This regularity condition is valid for most statistical mod- els of interest in practice, including the one parameter Expo- nential family of distributions [Normal, Student’s t, Pareto, Gamma, Beta, Binomial, Negative Binomial, Poisson, etc.]. [3] Convex alternative. The parameter space under 1 say Θ1 is convex [contiguous], i.e. for any two values (1 2) ∈Θ1 their convex combinations 1+(1−)2∈Θ1 for any 0 ≤  ≤ 1 When convexity does not hold, like the 2-sided alternative: (vi) (2-s): 0:  = 0 vs. 1:  6= 0 the test :={(X) 1()} 1()={x: |(x)|   2 } is -UMPU (Unbiased); the -level and p-value are: =P(|(X)|   2 ; =0) q(x0)=P(|(X)| |(x0)|; =0) 48
  • 49. 7 Appendix B: Examples based on Jeffreys prior For the simple Bernoulli model, consider selecting Jeffreys in- variant prior: ()= 1 (55) −5 (1 − )−5  ∈[0 1] This gives rise to a posterior distribution of the form: (|x0) v Beta( + 5 (1−) + 5) ∈[0 1] ¥ (a) For =2, =20 the likelihood function is: (; x0) ∝ 2 (1−)18  ∈[0 1] and the posterior density is: (|x0) vBeta(25 185) ∈[0 1] The Bayesian point estimates are: e=15 19 =0789 b=25 21 =119 A 95 credible interval for  is: (0214 ≤   3803)=95 1 B(25185) R 1 =0214 15 (1−)175 =975 1 B(25185) R 1 =3803 15 (1−)175 =025 ¥ (b) For =18, =20 the likelihood function is: (; x0) ∝ 18 (1 − )2  ∈[0 1] and the posterior density is: (|x0) vBeta(185 25) ∈[0 1] The Bayesian point estimates are: e=175 19 =921 b=185 21 =881 A 95 credible interval for  is: (716 ≤   97862)=95 1 B(18525) R 1 =716 175 (1−)15 =0975 1 B(18525) R 1 =979 175 (1−)15 =0025 ¥ (c) For =72, =80 the likelihood function is: (; x0) ∝ 72 (1 − )8  ∈[0 1] 49
  • 50. and the posterior density is: (|x0) v Beta(725 85) ∈[0 1] The Bayesian point estimates are: e=715 79 =905 b=725 81 =895 A 95 credible interval for  is: (82 ≤   9515)=95 1 B(72585) R 1 =82 715 (1−)75 =0975 1 B(72585) R 1 =9515 715 (1−)75 =0025 ¥ (d) for =40, =80 the likelihood function is: (; x0) ∝ 40 (1 − )40  ∈[0 1] and the posterior density is: (|x0) v Beta(405 405) ∈[0 1] The Bayesian point estimates are: e=395 79 =5 b=405 81 =5 A 95 credible interval for  is: (3923 ≤   6525)=95 1 B(405405) R 1 =392 395 (1−)395 =975 1 B(405405) R 1 =6525 395 (1−)395 =025 In view of the symmetry of the posterior distribution, even the asymptotic Normal credible interval (??) should give a good approximation. Given that b= (+) (++) =05 the approx- imate credible interval is:  µ [5−196 √ 5(1−5) √ 80 ]=390 ≤   610=[5+196 √ 5(1−5) √ 80 ] ¶ =1− which provides a reasonably good approximation to the exact one. 50