JSM 2011 round table

Uncertainties within some Bayesian concepts:
Examples from classnotes

Christian P. Robert

Universit´ Paris-Dauphine, IuF, and CREST-INSEE
e
http://www.ceremade.dauphine.fr/~xian

July 31, 2011

Christian P. Robert (Paris-Dauphine) Uncertainties within Bayesian concepts July 31, 2011 1 / 30

Outline

Anyone not shocked by the Bayesian theory of inference has not understood it.
— S. Senn, Bayesian Analysis, 2008

1 Testing

2 Fully speciﬁed models?

3 Model choice


Add: Call for vignettes

Kerrie Mengersen and myself are collecting proposals towards a collection
of vignettes on the theme
When is Bayesian analysis really successfull?
celebrating notable achievements of Bayesian analysis.
[deadline: September 30]


Bayes factors

The Jeffreys-subjective synthesis betrays a much more dangerous confusion than the
Neyman-Pearson-Fisher synthesis as regards hypothesis tests — S. Senn, BA, 2008

Definition (Bayes factors)
When testing H0 : θ ∈ Θ0 vs. Ha : θ ∈ Θ0 use

f (x|θ)π0 (θ)dθ
π(Θ0 |x) π(Θ0 ) Θ0
B01 = =
π(Θc |x)
0 π(Θc )
0 f (x|θ)π1 (θ)dθ
Θc
0

[Good, 1958 & Jeffreys, 1939]


Self-contained concept

Derived from 0 − 1 loss and Bayes rule: acceptance if
B01 > {(1 − π(Θ0 ))/a1 }/{π(Θ0 )/a0 }



B01 > {(1 − π(Θ0 ))/a1 }/{π(Θ0 )/a0 }
but used outside decision-theoretic environment
eliminates choice of π(Θ0 )



B01 > {(1 − π(Θ0 ))/a1 }/{π(Θ0 )/a0 }
but still depends on the choice of (π0 , π1 )



B01 > {(1 − π(Θ0 ))/a1 }/{π(Θ0 )/a0 }
but still depends on the choice of (π0 , π1 )
Jeﬀreys’ [arbitrary] scale of evidence:
π
if log10 (B10 ) between 0 and 0.5, evidence against H0 weak,
π
if log10 (B10 ) 0.5 and 1, evidence substantial,
π
if log10 (B10 ) 1 and 2, evidence strong and
π
if log10 (B10 ) above 2, evidence decisive
convergent if used with proper statistics


Diﬃculties with ABC-Bayes factors

‘This is also why focus on model discrimination typically (...) proceeds by
(...) accepting that the Bayes Factor that one obtains is only derived from
the summary statistics and may in no way correspond to that of the full
model.’ — S. Sisson, Jan. 31, 2011, X.’Og



‘This is also why focus on model discrimination typically (...) proceeds by
(...) accepting that the Bayes Factor that one obtains is only derived from
the summary statistics and may in no way correspond to that of the full
model.’ — S. Sisson, Jan. 31, 2011, X.’Og

In the Poisson versus geometric case, if E[yi ] = θ0 > 0,

η (θ0 + 1)2 −θ0
lim B12 (y) = e
n→∞ θ0



Laplace vs. Normal models:
Comparing a sample x1 , . . . , xn from the Laplace (double-exponential)
√
L(µ, 1/ 2) distribution
1 √
f (x|µ) = √ exp{− 2|x − µ|} .
2
or from the Normal N (µ, 1)


Empirical mean, median and variance have the same mean under both
models: useless!


Median absolute deviation: priceless!


Point null hypotheses

I have no patience for statistical methods that assign positive probability to point
hypotheses of the θ = 0 type that can never actually be true — A. Gelman, BA, 2008

Particular case H0 : θ = θ0
Take ρ0 = Prπ (θ = θ0 ) and π1 prior density under Ha .


Point null hypotheses

I have no patience for statistical methods that assign positive probability to point
hypotheses of the θ = 0 type that can never actually be true — A. Gelman, BA, 2008

Particular case H0 : θ = θ0
Take ρ0 = Prπ (θ = θ0 ) and π1 prior density under Ha .
Posterior probability of H0

f (x|θ0 )ρ0 f (x|θ0 )ρ0
π(Θ0 |x) = =
f (x|θ)π(θ) dθ f (x|θ0 )ρ0 + (1 − ρ0 )m1 (x)

and marginal under Ha

m1 (x) = f (x|θ)g1 (θ) dθ.
Θ1


Point null hypotheses (cont’d)

Example (Normal mean)
Test of H0 : θ = 0 when x ∼ N (θ, 1): we take π1 as N (0, τ 2 ) then
−1
1 − ρ0 σ2 τ 2 x2
π(θ = 0|x) = 1 + exp
ρ0 σ2 + τ 2 2σ 2 (σ 2 + τ 2 )

Inﬂuence of τ :
τ /x 0 0.68 1.28 1.96
1 0.586 0.557 0.484 0.351
10 0.768 0.729 0.612 0.366


A fundamental diﬃculty

Improper priors are not allowed in this setting
If
π1 (dθ1 ) = ∞ or π2 (dθ2 ) = ∞
Θ1 Θ2

then either π1 or π2 cannot be coherently normalised


A fundamental diﬃculty

Improper priors are not allowed in this setting
If
π1 (dθ1 ) = ∞ or π2 (dθ2 ) = ∞
Θ1 Θ2

then either π1 or π2 cannot be coherently normalised but the
normalisation matters in the Bayes factor


Jeﬀreys unaware of the problem??

Example of testing for a zero normal mean:
If σ is the standard error and λ the
true value, λ is 0 on q. We want a
suitable form for its prior on q . (...)
Then we should take

P (qdσ|H) ∝ dσ/σ
λ
P (q dσdλ|H) ∝ f dσ/σdλ/λ
σ

where f [is a true density] (ToP, V,
§5.2).


Jeﬀreys unaware of the problem??

Example of testing for a zero normal mean:
If σ is the standard error and λ the
true value, λ is 0 on q. We want a
suitable form for its prior on q . (...)
Then we should take

P (qdσ|H) ∝ dσ/σ
λ
P (q dσdλ|H) ∝ f dσ/σdλ/λ
σ

where f [is a true density] (ToP, V,
§5.2).
Unavoidable fallacy of the “same” σ?!


Puzzling alternatives
When taking two normal samples x11 , . . . , x1n1 and x21 , . . . , x2n2 with
means λ1 and λ2 and same variance σ, testing for H0 : λ1 = λ2 gets
outwordly:
...we are really considering four hypotheses, not two as in the test for
agreement of a location parameter with zero; for neither may be disturbed,
or either, or both may.

ToP then uses parameters (λ, σ) in all versions of the alternative
hypotheses, with

π0 (λ, σ) ∝ 1/σ
π1 (λ, σ, λ1 ) ∝ 1/π{σ 2 + (λ1 − λ)2 }
π2 (λ, σ, λ2 ) ∝ 1/π{σ 2 + (λ2 − λ)2 }
π12 (λ, σ, λ1 , λ2 ) ∝ σ/π 2 {σ 2 + (λ1 − λ)2 }{σ 2 + (λ2 − λ)2 }


Puzzling alternatives

ToP misses the points that
1 λ does not have the same meaning under q, under q1 (= λ2 ) and
under q2 (= λ1 )
2 λ has no precise meaning under q12 [hyperparameter?]
On q12 , since λ does not appear explicitely in the likelihood
we can integrate it (V, §5.41).

3 even σ has a varying meaning over hypotheses
4 integrating over measures
2 dσdλ1 dλ2
P (q12 dσdλ1 dλ2 |H) ∝
π 4σ 2 + (λ1 − λ2 )2

simply deﬁnes a new improper prior...


Addiction to models

One potential diﬃculty with Bayesian analysis is its ultimate dependence
on model(s) speciﬁcation

π(θ) ∝ π(θ)f (x|θ)


Addiction to models


π(θ) ∝ π(θ)f (x|θ)

While Bayesian analysis allows for model variability, prunning,
improvement, comparison, embedding, &tc., there always is a basic
reliance [or at least conditioning] on the ”truth” of an overall model.


Addiction to models


π(θ) ∝ π(θ)f (x|θ)

While Bayesian analysis allows for model variability, prunning,
improvement, comparison, embedding, &tc., there always is a basic
reliance [or at least conditioning] on the ”truth” of an overall model. May
sound paradoxical because of the many tools oﬀered by Bayesian analysis,
however method is blind once ”out of the model”, in the sense that it
cannot assess the validity of a model without imbedding this model inside
another model.


ABCµ multiple errors

[ c Ratmann et al., PNAS, 2009]


No proper goodness-of-fit test

‘There is not the slightest use in rejecting any hypothesis unless we can do it
in favor of some definite alternative that better fits the facts.” — E.T.
Jaynes, Probability Theory

While the setting

H 0 : M = M0 versus H a : M = M0

is rather artificial, there is no satisfactory way of answering the question


An approximate goodness-of-ﬁt test

Testing
H 0 : M = Mθ versus H a : M = Mθ
rephrased as

H0 : min d(Fθ , U(0,1) ) = 0 versus Ha : min d(Fθ , U(0,1) ) > 0
θ θ

[Verdinelli and Wasserman, 98; Rousseau and Robert, 01]


An approximate goodness-of-ﬁt test

Testing
H 0 : M = Mθ versus H a : M = Mθ
rephrased as
H0 : Fθ (x) ∼ U(0, 1) versus
k
ωi
Ha : Fθ (x) ∼ p0 U(0, 1) + (1 − p0 ) Be(αi i , αi (1 − i ))
i=1
ω

with

(αi , i ) ∼ [1 − exp{−(αi − 2)2 − ( i − .5)2 }]
2 2
× exp[−1/(αi i (1 − i )) − 0.2αi /2]

[Verdinelli and Wasserman, 98; Rousseau and Robert, 01]


Robustness

Models only partly deﬁned through moments

Eθ [hi (x)] = Hi (θ) i = 1, . . .

i.e., no complete construction of the underlying model

Example (White noise in AR)
The relation
xt = ρxt−1 + σ t

often makes no assumption on t besides its ﬁrst two moments...

How can we run Bayesian analysis in such settings? Should we?
[Lazar, 2005; Cornuet et al., 2011, in prep.]


[back to] Bayesian model choice

Having a high relative probability does not mean that a hypothesis is true or supported
by the data — A. Templeton, Mol. Ecol., 2009

The formal Bayesian approach put probabilities all over the entire
model/parameter space


[back to] Bayesian model choice

Having a high relative probability does not mean that a hypothesis is true or supported
by the data — A. Templeton, Mol. Ecol., 2009

The formal Bayesian approach put probabilities all over the entire
model/parameter space
This means:
allocating probabilities pi to all models Mi
deﬁning priors πi (θi ) for each parameter space Θi
pick largest p(Mi |x) to determine “best” model


Several types of problems

Concentrate on selection perspective:
how to integrate loss function/decision/consequences
representation of parsimony/sparcity (Occam’s rule)
how to ﬁght overﬁtting for nested models


Several types of problems

Incoherent methods, such as ABC, Bayes factor, or any simulation approach that treats
all hypotheses as mutually exclusive, should never be used with logically overlapping
hypotheses. — A. Templeton, PNAS, 2010

Choice of prior structures
adequate weights pi :
>
if M1 = M2 ∪ M3 , p(M1 ) = p(M2 ) + p(M3 ) ?
priors distributions
πi (·) defined for every i ∈ I
πi (·) proper (Jeffreys)
πi (·) coherent (?) for nested models
prior modelling inflation


Compatibility principle

Diﬃculty of ﬁnding simultaneously priors on a collection of models Mi
(i ∈ I)


Compatibility principle

Diﬃculty of ﬁnding simultaneously priors on a collection of models Mi
(i ∈ I)
Easier to start from a single prior on a “big” model and to derive the
others from a coherence principle
[Dawid & Lauritzen, 2000]


Projection approach

⊥
For M2 submodel of M1 , π2 can be derived as the distribution of θ2 (θ1 )
⊥ (θ ) is a projection of θ on M , e.g.
when θ1 ∼ π1 (θ1 ) and θ2 1 1 2

d(f (· |θ1 ), f (· |θ1 ⊥ )) = inf d(f (· |θ1 ) , f (· |θ2 )) .
θ2 ∈Θ2

where d is a divergence measure
[McCulloch & Rossi, 1992]


Projection approach

⊥
For M2 submodel of M1 , π2 can be derived as the distribution of θ2 (θ1 )
⊥ (θ ) is a projection of θ on M , e.g.
when θ1 ∼ π1 (θ1 ) and θ2 1 1 2

d(f (· |θ1 ), f (· |θ1 ⊥ )) = inf d(f (· |θ1 ) , f (· |θ2 )) .
θ2 ∈Θ2

where d is a divergence measure
[McCulloch & Rossi, 1992]
Or we can look instead at the posterior distribution of

d(f (· |θ1 ), f (· |θ1 ⊥ ))

[Goutis & Robert, 1998; Dupuis & Robert, 2001]


Kullback proximity

Alternative projection to the above

Deﬁnition (Compatible prior)
Given a prior π1 on a model M1 and a submodel M2 , a prior π2 on M2 is
compatible with π1


Kullback proximity


compatible with π1 when it achieves the minimum Kullback divergence
between the corresponding marginals: m1 (x; π1 ) = Θ1 f1 (x|θ)π1 (θ)dθ
and m2 (x); π2 = Θ2 f2 (x|θ)π2 (θ)dθ,


Kullback proximity


compatible with π1 when it achieves the minimum Kullback divergence
between the corresponding marginals: m1 (x; π1 ) = Θ1 f1 (x|θ)π1 (θ)dθ
and m2 (x); π2 = Θ2 f2 (x|θ)π2 (θ)dθ,

m1 (x; π1 )
π2 = arg min log m1 (x; π1 ) dx
π2 m2 (x; π2 )


Diﬃculties

Further complicating dimensionality of test statistics is the fact that the models are
often not nested, and one model may contain parameters that do not have analogues in
the other models and vice versa. — A. Templeton, Mol. Ecol., 2009

Does not give a working principle when M2 is not a submodel M1
[Perez & Berger, 2000; Cano, Salmer´n & Robert, 2006]
o

Depends on the choice of π1
Prohibits the use of improper priors
Worse: useless in unconstrained settings...


A side remark: Zellner’s g

Use of Zellner’s g-prior in linear regression, i.e. a normal prior for β
conditional on σ 2 ,
˜
β|σ 2 ∼ N (β, gσ 2 (X T X)−1 )
and a Jeﬀreys prior for σ 2 ,

π(σ 2 ) ∝ σ −2


Variable selection

For the hierarchical parameter γ, we use
p
π(γ) = τiγi (1 − τi )1−γi ,
i=1

where τi corresponds to the prior probability that variable i is present in
the model (and a priori independence between the presence/absence of
variables)


Variable selection

For the hierarchical parameter γ, we use
p
π(γ) = τiγi (1 − τi )1−γi ,
i=1

where τi corresponds to the prior probability that variable i is present in
the model (and a priori independence between the presence/absence of
variables)
Typically (?), when no prior information is available, τ1 = . . . = τp = 1/2,
ie a uniform prior
π(γ) = 2−p


Inﬂuence of g

Taking ˜
β = 0p+1 and c large does not work


Inﬂuence of g

Taking ˜

Consider the 10-predictor full model
 
3 3
2 2 2
y|β, σ ∼ N β0 + βi x i + βi+3 xi + β7 x1 x2 + β8 x1 x3 + β9 x2 x3 + β10 x1 x2 x3 , σ In 
i=1 i=1

where the xi s are iid U (0, 10)
[Casella & Moreno, 2004]


Inﬂuence of g

Taking ˜

Consider the 10-predictor full model
 
3 3
2 2 2
y|β, σ ∼ N β0 + βi x i + βi+3 xi + β7 x1 x2 + β8 x1 x3 + β9 x2 x3 + β10 x1 x2 x3 , σ In 
i=1 i=1

where the xi s are iid U (0, 10)
[Casella & Moreno, 2004]
True model: two predictors x1 and x2 , i.e. γ ∗ = 110. . .0,
(β0 , β1 , β2 ) = (5, 1, 3), and σ 2 = 4.


Inﬂuence of g 2

t1 (γ) g = 10 g = 100 g = 103 g = 104 g = 106

0,1,2 0.04062 0.35368 0.65858 0.85895 0.98222
0,1,2,7 0.01326 0.06142 0.08395 0.04434 0.00524
0,1,2,4 0.01299 0.05310 0.05805 0.02868 0.00336
0,2,4 0.02927 0.03962 0.00409 0.00246 0.00254
0,1,2,8 0.01240 0.03833 0.01100 0.00126 0.00126


Case for a noninformative hierarchical solution

˜
Use the same compatible informative g-prior distribution with β = 0p+1
and a hierarchical diﬀuse prior distribution on g, e.g.

π(g) ∝ g −1 IN∗ (c)

[Liang et al., 2007; Marin & Robert, 2007; Celeux et al., ca. 2011]


Occam’s razor

Pluralitas non est ponenda sine neccesitate

Variation is random until the contrary
is shown; and new parameters in laws,
when they are suggested, must be
tested one at a time, unless there is
speciﬁc reason to the contrary.

H. Jeﬀreys, ToP, 1939

No well-accepted implementation behind the principle...


Occam’s razor

Pluralitas non est ponenda sine neccesitate

Variation is random until the contrary
is shown; and new parameters in laws,
when they are suggested, must be
tested one at a time, unless there is
speciﬁc reason to the contrary.

H. Jeﬀreys, ToP, 1939

No well-accepted implementation behind the principle...
besides the fact that the Bayes factor naturally penalises larger models


JSM 2011 round table

Recommended

Recommended

More Related Content

What's hot

What's hot (17)

Viewers also liked

Viewers also liked (9)

Similar to JSM 2011 round table

Similar to JSM 2011 round table (20)

More from Christian Robert

More from Christian Robert (20)

Recently uploaded

Recently uploaded (20)

JSM 2011 round table