Rousseau

On adaptation for the posterior distribution
under local and sup-norm

Judith Rousseau, Marc Hoffman and Johannes Schmidt -
Hieber

ENSAE - CREST et CEREMADE, Université Paris-Dauphine

January

1/ 26

Outline

1 Bayesian nonparametric : posterior concentration
Generalities
Idea of the proof
Examples of loss functions where things become less
nice

2 Bayesian Upper and Lower bounds
Lower bound
The case of ∞ and adaptation

3 Links with conﬁdence bands

2/ 26

Outline

Generalities
Idea of the proof
nice

Lower bound


3/ 26

Generalities
n n
Model : Y1 |θ ∼ pθ (density wrt µ), θ ∈ Θ
A priori : θ ∼ Π : prior distribution
−→ posterior distribution
n n
dΠ(θ)pθ (Y1 )
dΠ(θ|X n ) = n , n
Y1 = (Y1 , . . . , Yn )
m(Y1 )

Posterior concentration d(., .) = loss on Θ & θ0 ∈ Θ = True
n
Eθ0 (Π [U n |Y1 ]) = 1 + o(1), U n = {θ; d(θ, θ0 ) ≤ n} n ↓0

Why should we care ?
• Gives insight on some aspects of the prior
• Gives some insight on inference : interpretation of posterior
credible regions (loosely)
• Helps understanding the links between freq. and Bayesian

4/ 26

Minimax concentration rates on a Class Θα (L),

c n
sup Eθ0 Π UM n (α)
|Y1 = o(1),
θ0 ∈Θα (L)

where n (α) = minimax rate under d(., .) & over Θα (L).

5/ 26

Examples of Models-losses for which nice results exist
Density estimation Yi ∼ pθ i.i.d.
√ √
d(pθ , pθ )2 = ( pθ − pθ )2 (x)dx, d(pθ , pθ ) = |pθ −pθ |(x)dx

Regression function
Yi = f (xi ) + i , i ∼ N (0, σ 2 ), θ = (f , σ)

n
d(pθ , pθ ) = f −f 2, d(pθ , pθ ) = n−1 H 2 (pθ (y |Xi ), pθ (y |Xi ))
i=1
H = Hellinger
White noise
dY (t) = f (t)dt + n−1/2 dW (t) ⇔ Yi = θi + n−1/2 i , i ∈N

d(pθ , pθ ) = f − f 2

6/ 26

Examples : functional classes
Θα (L) = Hölder (H(α, L))

n (α) = n−α/(2α+1) minimax rate over H(α, L)

Density example : Hellinger loss
Prior = DPM

f (x) = fP,σ (x) = φσ (x−µ)dP(µ), σ ∼ IΓ(a, b) P ∼ DP(A, G0 )

c n
sup Ef0 Π UM(n/ log n)−α/(2α+1) (f0 )|Y1 = o(1),
f0 ∈Θα (L)

U (f0 ) = {f , h(f0 , f ) ≤ } [ log n term necessary ? ]

⇒ Ef0 h(ˆ, f0 )2
f (n/ log n)−α/(2α+1) , ˆ(x) = E π [f (x)|Y n ]
f

7/ 26

Outline

Generalities
Idea of the proof
nice

Lower bound


8/ 26

Outline of the proof : Tests and KL

n n
Un = UM(n/ log n)−α/(2α+1) and ln (θ) = log pθ (Y1 )
¯n = (n/ log n)−α/(2α+1)

c
Un eln (θ)−ln (θ0 ) dΠ(θ) Nn
c n
Π [Un |Y1 ] = :=
Θ eln (θ)−ln (θ0 ) dΠ(θ) Dn
n
φn = φn (Y1 ) ∈ [0, 1]
2
Eθ0 (Π [Un |Y1 ]) ≤ Eθ0 [φn ] + Pθ0 Dn < e−cn ¯
c n n n

2
+ e(c+τ )n n Eθ [1 − φn ] dπ(θ)
c
Un

9/ 26

Constraints

2
n
Eθ0 [φn ] = o(1) & sup Eθ [1 − φn ] = o(e−cn ¯ ) → d(., .)
n

d(θ,θ0 )>M ¯
n

2
Pθ0 Dn < e−cn ¯
n
= o(1) We need :

Dn ≥ eln (θ)−ln (θ0 ) dΠ(θ)
Sn
2
≥ e−2n n Π Sn ∩ {ln (θ) − ln (θ0 ) > −2n ¯ 2 }
n

Ok if Sn = {KL(pθ0 , pθ ) ≤ n ¯ 2 ; V (pθ0 , pθ ) ≤ n ¯ 2 } and
n n
n
n n
n

2
Π(Sn ) ≥ e−cn ¯ → links d(., .)
n
with KL(., .)

10/ 26

Outline

Generalities
Idea of the proof
nice

Lower bound


11/ 26

White noise model and pointwise or sup-norm loss
White noise
dY (t) = f (t)dt + n−1/2 dW (t) ⇔ Yjk = θjk + n−1/2 jk , i ∈N
pointwise loss : (f , f0 ) = (f (x0 ) − f0 (x0 ))2
sup-norm loss : ∞ (f , f0 ) = supx |f (x) − f0 (x)|
Random Truncation prior
J ∼ P,
θj,k ∼ g(.) ∀k ∀j ≤ J,
θj,k = 0 ∀k ∀j > J
L2 concentration
sup sup Ef0 P π f − f0 2 > M(n/ log n)−α/(2α+1) |Y = o(1)
α1 ≤α≤α2 f0 ∈H(α,L)

concentration ∀α ∃ > 0
2 /(2α+1)2
sup Ef0 P π (f , f0 ) > (n/ log n)−2α |Y = 1 + o(1)
f0 ∈H(α,L)

12/ 26

If Deterministic Truncation prior

J := Jn (α) : 2Jn (α) = (n/ log n)1/(2α+1)
θj,k ∼ g(.) ∀k ∀j ≤ Jn (α)
θj,k = 0 ∀k ∀j > Jn (α)

L2 concentration

sup Ef0 P π f − f0 2 > M(n/ log n)−α/(2α+1) |Y = o(1)
f0 ∈H(α,L)

concentration ∀α ∃ > 0

sup Ef0 P π (f , f0 ) > M(n/ log n)−α/(2α+1) |Y = o(1)
f0 ∈H(α,L)

• What does it mean ? Can we have adaptation with or ∞?

13/ 26

Why didn’t it work ?

Same problem as in freq. (see M Low’s papers)
0 2
f1 = 0, f0 : θj,0 = n (α), ∀2j ≤ L(n/ log n)2α/(2α+1) , j ≥ 1

then
2 /(2α+1)2
(θj,k )2 ≤
0 2
n (α) , 2j/2 θj,0
0
(n/ log n)−2α
j≥1,k j≥1

and
P π [J = 0|Y ] = 1 + oP0 (1)
f0 looks too much like f1 = 0

14/ 26

Gine & Nickl : posterior concentration rates for ∞ via
tests

At best they have

sup Ef0 P π ∞ (f , f0 ) > M(n/ log n)−(α−1/2)/(2α+1) |Y = o(1)
f0 ∈H(α,L)

Proof based on tests → suboptimal. Can we do better ?

15/ 26

Outline

Generalities
Idea of the proof
nice

Lower bound


16/ 26

Bayesian Lower bounds : white noise model

Let d(θ, θ ) be a symetrical semi-metric, e.g. d(θ, θ ) = (fθ , fθ )
or ∞ (fθ , fθ ).
Dual of modulus of continuity

φ(θ, ) = inf { θ − θ 2 |d(θ, θ )> }
θ ∈Θ

φ( ) = inf φ(θ, )
θ

Theorem Let C > 0 and n such that
Qn (C) := {θ; φ(θ, 2 n ) ≤ Cφ( n )} = ∅. Then ∀θ0 ∈ Qn (C), ∀Π,
∃K > 0
2
Eθ0 ,n [P π [d(θ, θ0 ) ≥ n |Y ]] ≥ e−Knφ( n )

17/ 26

Consequences

Consequence 1 : We√ obtain as in Cai and Low
n = inf{ , φ( ) > M/ n} then ∀un = o( n )

Eθ0 ,n [P π [d(θ, θ0 ) ≥ un |Y ]] = o(1)

18/ 26

Consequences

Consequence 1 : We√ obtain as in Cai and Low
n = inf{ , φ( ) > M/ n} then ∀un = o( n )

Eθ0 ,n [P π [d(θ, θ0 ) ≥ un |Y ]] = o(1)

Consequence 2 : If φ( n ) = o( 2 )
n

2 2
e−Knφ( n)
>> e−Kn n

→ Proof based on tests will lead to suboptimal
concentration rates.

¯n = inf{ ; φ( ) > M n }

18/ 26

Outline

Generalities
Idea of the proof
nice

Lower bound


19/ 26

The case of ∞

Yj,k = θj,k + n−1/2 j,k , j,k ∼ N (0, 1) i.i.d

∞ (fθ , fθ )= max 2j/2 |θj,k − θj,k |
k
j

log n
φ( n (β)) = O √ , n (β) = (n/ log n)−β/(2β/1) , Θ = H(β, L)
n
Theorem
There is a prior Π s.t. ∀C < 1/2

sup sup Eθ0 (P π [ ∞ (θ0 , θ) > M n (β)|Y ]) ≤ e−C log n
β1 ≤β≤β2 θ0 ∈H(β,L)

n (β) := (n/ log n)−β/(2β+1)
Sieve prior (discrete prior)
Spike and slab
20/ 26

spike and slab prior

∀j ≤ Jn with 2Jn ≈ n , ∀k

1 1
θj,k ∼ 1− δ(0) + g(.)
n n

with log g smooth (Laplace, Gaussian, Student)
Adaptive posterior concentration in L2 (loosing a log n) and ∞

(n/ log n)−α/(2α+1)

21/ 26

Some connections with confidence sets

Adaptive confidence sets Cn

inf Pθ (θ ∈ Cn ) ≥ 1 − α
θ

and
−1
sup sup n (β) Eθ0 [|Cn |] < +∞
β1 ≤β≤β2 θ0 ∈H(β,L)

with |Cn | = supθ,θ ∈Cn d(θ, θ )
If d(., .) = ∞ Does not exist (M. Low)
Hoffman and Nickl H(β1 , L) ∪ H(β2 , L) with β2 > β1
˜
Θn = H(β2 , L) ∪ {θ ∈ H(β1 , L); ∞ (θ, H(β2 , L)) > M n (β1 )}
˜
Then Adaptive confidence set in Θn

22/ 26

1rst Bayesian perspective
H(β1 , L) ∪ H(β2 , L) with β2 > β1 If

β∈{β1 ,β2 } θ0 ∈H(β,L)

Set Cn = {θ0 ; P π [ ∞ (θ0 , θ) > M n (β, θ0 )|Y ] < n−C /α} Then

Pθ0 [θ0 ∈ Cn ] ≤ αnC Eθ0 [P π [
c
∞ (θ0 , θ) > M n (β, θ0 )|Y ]] ≤ α

problem : Control of Eθ [|Cn |]

sup Eθ [|Cn |] n (β1 ) → OK
θ∈H(β1 ,L)

sup Eθ [|Cn |] n (β1 ) → BAD
θ∈H(β2 ,L)

˜
But on Θ := H(β2 , L) ∪ {θ ∈ H(β1 , L); ∞ (θ, H(β2 , L)) > n (β1 )}
˜ ˜
Cn = Cn ∩ Θ Adaptive conﬁdence set
23/ 26

A better ( ?) Bayesian perspective : back to basics

If

β∈[β1 ,β2 ] θ0 ∈H(β,L)

and
ˆ −1
sup sup Eθ0 ∞ (θ0 , θ) n (β) < +∞
β∈[β1 ,β2 ] θ0 ∈H(β,L)

Cn = {θ; ∞ (θ, θ)
ˆ ≤ kn (αn )}, P π [θ ∈ Cn |Y ] ≥ 1 − αn
Then

Pθ [θ ∈ Cn ] dπ(θ) ≥ 1 − αn
Θ
sup Eθ [|Cn |] ≤ 2M n (β),
θ∈H(β,L)

If Θ is bounded.
24/ 26

Conclusion

• Bayesian is great for risks that are related to Kullback : L2 in
regression, hellinger or L1 in density etc.
• How to understand some speciﬁc features in these big
models ?
More tricky
• Good nonparametric priors : Have good properties for a wide
range of loss functions
• Why should we care ? → interpretation of credible bands ! ?
• Extension to other models than white noise . [Done]
• Can we go further than 2nd Bayesian interpretation
(conﬁdence sets) ?

25/ 26

Rousseau

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (15)

Similar to Rousseau

Similar to Rousseau (20)

More from eric_gautier

More from eric_gautier (8)

Rousseau