This document discusses Bayesian nonparametric posterior concentration rates under different loss functions.
1. It provides an overview of posterior concentration, how it gives insights into priors and inference, and how minimax rates can characterize concentration classes.
2. The proof technique involves constructing tests and relating distances like KL divergence to the loss function. Examples where nice results exist include density estimation, regression, and white noise models.
3. For the white noise model with a random truncation prior, it shows L2 concentration and pointwise concentration rates match minimax. But for sup-norm loss, existing results only achieve a suboptimal rate. The document explores how to potentially obtain better adaptation for sup-norm loss.
1. On adaptation for the posterior distribution
under local and sup-norm
Judith Rousseau, Marc Hoffman and Johannes Schmidt -
Hieber
ENSAE - CREST et CEREMADE, Université Paris-Dauphine
January
1/ 26
2. Outline
1 Bayesian nonparametric : posterior concentration
Generalities
Idea of the proof
Examples of loss functions where things become less
nice
2 Bayesian Upper and Lower bounds
Lower bound
The case of ∞ and adaptation
3 Links with confidence bands
2/ 26
3. Outline
1 Bayesian nonparametric : posterior concentration
Generalities
Idea of the proof
Examples of loss functions where things become less
nice
2 Bayesian Upper and Lower bounds
Lower bound
The case of ∞ and adaptation
3 Links with confidence bands
3/ 26
4. Generalities
n n
Model : Y1 |θ ∼ pθ (density wrt µ), θ ∈ Θ
A priori : θ ∼ Π : prior distribution
−→ posterior distribution
n n
dΠ(θ)pθ (Y1 )
dΠ(θ|X n ) = n , n
Y1 = (Y1 , . . . , Yn )
m(Y1 )
Posterior concentration d(., .) = loss on Θ & θ0 ∈ Θ = True
n
Eθ0 (Π [U n |Y1 ]) = 1 + o(1), U n = {θ; d(θ, θ0 ) ≤ n} n ↓0
Why should we care ?
• Gives insight on some aspects of the prior
• Gives some insight on inference : interpretation of posterior
credible regions (loosely)
• Helps understanding the links between freq. and Bayesian
4/ 26
5. Minimax concentration rates on a Class Θα (L),
c n
sup Eθ0 Π UM n (α)
|Y1 = o(1),
θ0 ∈Θα (L)
where n (α) = minimax rate under d(., .) & over Θα (L).
5/ 26
6. Examples of Models-losses for which nice results exist
Density estimation Yi ∼ pθ i.i.d.
√ √
d(pθ , pθ )2 = ( pθ − pθ )2 (x)dx, d(pθ , pθ ) = |pθ −pθ |(x)dx
Regression function
Yi = f (xi ) + i , i ∼ N (0, σ 2 ), θ = (f , σ)
n
d(pθ , pθ ) = f −f 2, d(pθ , pθ ) = n−1 H 2 (pθ (y |Xi ), pθ (y |Xi ))
i=1
H = Hellinger
White noise
dY (t) = f (t)dt + n−1/2 dW (t) ⇔ Yi = θi + n−1/2 i , i ∈N
d(pθ , pθ ) = f − f 2
6/ 26
7. Examples : functional classes
Θα (L) = Hölder (H(α, L))
n (α) = n−α/(2α+1) minimax rate over H(α, L)
Density example : Hellinger loss
Prior = DPM
f (x) = fP,σ (x) = φσ (x−µ)dP(µ), σ ∼ IΓ(a, b) P ∼ DP(A, G0 )
c n
sup Ef0 Π UM(n/ log n)−α/(2α+1) (f0 )|Y1 = o(1),
f0 ∈Θα (L)
U (f0 ) = {f , h(f0 , f ) ≤ } [ log n term necessary ? ]
⇒ Ef0 h(ˆ, f0 )2
f (n/ log n)−α/(2α+1) , ˆ(x) = E π [f (x)|Y n ]
f
7/ 26
8. Outline
1 Bayesian nonparametric : posterior concentration
Generalities
Idea of the proof
Examples of loss functions where things become less
nice
2 Bayesian Upper and Lower bounds
Lower bound
The case of ∞ and adaptation
3 Links with confidence bands
8/ 26
9. Outline of the proof : Tests and KL
n n
Un = UM(n/ log n)−α/(2α+1) and ln (θ) = log pθ (Y1 )
¯n = (n/ log n)−α/(2α+1)
c
Un eln (θ)−ln (θ0 ) dΠ(θ) Nn
c n
Π [Un |Y1 ] = :=
Θ eln (θ)−ln (θ0 ) dΠ(θ) Dn
n
φn = φn (Y1 ) ∈ [0, 1]
2
Eθ0 (Π [Un |Y1 ]) ≤ Eθ0 [φn ] + Pθ0 Dn < e−cn ¯
c n n n
2
+ e(c+τ )n n Eθ [1 − φn ] dπ(θ)
c
Un
9/ 26
10. Constraints
2
n
Eθ0 [φn ] = o(1) & sup Eθ [1 − φn ] = o(e−cn ¯ ) → d(., .)
n
d(θ,θ0 )>M ¯
n
2
Pθ0 Dn < e−cn ¯
n
= o(1) We need :
Dn ≥ eln (θ)−ln (θ0 ) dΠ(θ)
Sn
2
≥ e−2n n Π Sn ∩ {ln (θ) − ln (θ0 ) > −2n ¯ 2 }
n
Ok if Sn = {KL(pθ0 , pθ ) ≤ n ¯ 2 ; V (pθ0 , pθ ) ≤ n ¯ 2 } and
n n
n
n n
n
2
Π(Sn ) ≥ e−cn ¯ → links d(., .)
n
with KL(., .)
10/ 26
11. Outline
1 Bayesian nonparametric : posterior concentration
Generalities
Idea of the proof
Examples of loss functions where things become less
nice
2 Bayesian Upper and Lower bounds
Lower bound
The case of ∞ and adaptation
3 Links with confidence bands
11/ 26
12. White noise model and pointwise or sup-norm loss
White noise
dY (t) = f (t)dt + n−1/2 dW (t) ⇔ Yjk = θjk + n−1/2 jk , i ∈N
pointwise loss : (f , f0 ) = (f (x0 ) − f0 (x0 ))2
sup-norm loss : ∞ (f , f0 ) = supx |f (x) − f0 (x)|
Random Truncation prior
J ∼ P,
θj,k ∼ g(.) ∀k ∀j ≤ J,
θj,k = 0 ∀k ∀j > J
L2 concentration
sup sup Ef0 P π f − f0 2 > M(n/ log n)−α/(2α+1) |Y = o(1)
α1 ≤α≤α2 f0 ∈H(α,L)
concentration ∀α ∃ > 0
2 /(2α+1)2
sup Ef0 P π (f , f0 ) > (n/ log n)−2α |Y = 1 + o(1)
f0 ∈H(α,L)
12/ 26
13. If Deterministic Truncation prior
J := Jn (α) : 2Jn (α) = (n/ log n)1/(2α+1)
θj,k ∼ g(.) ∀k ∀j ≤ Jn (α)
θj,k = 0 ∀k ∀j > Jn (α)
L2 concentration
sup Ef0 P π f − f0 2 > M(n/ log n)−α/(2α+1) |Y = o(1)
f0 ∈H(α,L)
concentration ∀α ∃ > 0
sup Ef0 P π (f , f0 ) > M(n/ log n)−α/(2α+1) |Y = o(1)
f0 ∈H(α,L)
• What does it mean ? Can we have adaptation with or ∞?
13/ 26
14. Why didn’t it work ?
Same problem as in freq. (see M Low’s papers)
0 2
f1 = 0, f0 : θj,0 = n (α), ∀2j ≤ L(n/ log n)2α/(2α+1) , j ≥ 1
then
2 /(2α+1)2
(θj,k )2 ≤
0 2
n (α) , 2j/2 θj,0
0
(n/ log n)−2α
j≥1,k j≥1
and
P π [J = 0|Y ] = 1 + oP0 (1)
f0 looks too much like f1 = 0
14/ 26
15. Gine & Nickl : posterior concentration rates for ∞ via
tests
At best they have
sup Ef0 P π ∞ (f , f0 ) > M(n/ log n)−(α−1/2)/(2α+1) |Y = o(1)
f0 ∈H(α,L)
Proof based on tests → suboptimal. Can we do better ?
15/ 26
16. Outline
1 Bayesian nonparametric : posterior concentration
Generalities
Idea of the proof
Examples of loss functions where things become less
nice
2 Bayesian Upper and Lower bounds
Lower bound
The case of ∞ and adaptation
3 Links with confidence bands
16/ 26
17. Bayesian Lower bounds : white noise model
Let d(θ, θ ) be a symetrical semi-metric, e.g. d(θ, θ ) = (fθ , fθ )
or ∞ (fθ , fθ ).
Dual of modulus of continuity
φ(θ, ) = inf { θ − θ 2 |d(θ, θ )> }
θ ∈Θ
φ( ) = inf φ(θ, )
θ
Theorem Let C > 0 and n such that
Qn (C) := {θ; φ(θ, 2 n ) ≤ Cφ( n )} = ∅. Then ∀θ0 ∈ Qn (C), ∀Π,
∃K > 0
2
Eθ0 ,n [P π [d(θ, θ0 ) ≥ n |Y ]] ≥ e−Knφ( n )
17/ 26
18. Consequences
Consequence 1 : We√ obtain as in Cai and Low
n = inf{ , φ( ) > M/ n} then ∀un = o( n )
Eθ0 ,n [P π [d(θ, θ0 ) ≥ un |Y ]] = o(1)
18/ 26
19. Consequences
Consequence 1 : We√ obtain as in Cai and Low
n = inf{ , φ( ) > M/ n} then ∀un = o( n )
Eθ0 ,n [P π [d(θ, θ0 ) ≥ un |Y ]] = o(1)
Consequence 2 : If φ( n ) = o( 2 )
n
2 2
e−Knφ( n)
>> e−Kn n
→ Proof based on tests will lead to suboptimal
concentration rates.
¯n = inf{ ; φ( ) > M n }
18/ 26
20. Outline
1 Bayesian nonparametric : posterior concentration
Generalities
Idea of the proof
Examples of loss functions where things become less
nice
2 Bayesian Upper and Lower bounds
Lower bound
The case of ∞ and adaptation
3 Links with confidence bands
19/ 26
21. The case of ∞
Yj,k = θj,k + n−1/2 j,k , j,k ∼ N (0, 1) i.i.d
∞ (fθ , fθ )= max 2j/2 |θj,k − θj,k |
k
j
log n
φ( n (β)) = O √ , n (β) = (n/ log n)−β/(2β/1) , Θ = H(β, L)
n
Theorem
There is a prior Π s.t. ∀C < 1/2
sup sup Eθ0 (P π [ ∞ (θ0 , θ) > M n (β)|Y ]) ≤ e−C log n
β1 ≤β≤β2 θ0 ∈H(β,L)
n (β) := (n/ log n)−β/(2β+1)
Sieve prior (discrete prior)
Spike and slab
20/ 26
22. spike and slab prior
∀j ≤ Jn with 2Jn ≈ n , ∀k
1 1
θj,k ∼ 1− δ(0) + g(.)
n n
with log g smooth (Laplace, Gaussian, Student)
Adaptive posterior concentration in L2 (loosing a log n) and ∞
(n/ log n)−α/(2α+1)
21/ 26
23. Some connections with confidence sets
Adaptive confidence sets Cn
inf Pθ (θ ∈ Cn ) ≥ 1 − α
θ
and
−1
sup sup n (β) Eθ0 [|Cn |] < +∞
β1 ≤β≤β2 θ0 ∈H(β,L)
with |Cn | = supθ,θ ∈Cn d(θ, θ )
If d(., .) = ∞ Does not exist (M. Low)
Hoffman and Nickl H(β1 , L) ∪ H(β2 , L) with β2 > β1
˜
Θn = H(β2 , L) ∪ {θ ∈ H(β1 , L); ∞ (θ, H(β2 , L)) > M n (β1 )}
˜
Then Adaptive confidence set in Θn
22/ 26
24. 1rst Bayesian perspective
H(β1 , L) ∪ H(β2 , L) with β2 > β1 If
sup sup Eθ0 (P π [ ∞ (θ0 , θ) > M n (β)|Y ]) ≤ e−C log n
β∈{β1 ,β2 } θ0 ∈H(β,L)
Set Cn = {θ0 ; P π [ ∞ (θ0 , θ) > M n (β, θ0 )|Y ] < n−C /α} Then
Pθ0 [θ0 ∈ Cn ] ≤ αnC Eθ0 [P π [
c
∞ (θ0 , θ) > M n (β, θ0 )|Y ]] ≤ α
problem : Control of Eθ [|Cn |]
sup Eθ [|Cn |] n (β1 ) → OK
θ∈H(β1 ,L)
sup Eθ [|Cn |] n (β1 ) → BAD
θ∈H(β2 ,L)
˜
But on Θ := H(β2 , L) ∪ {θ ∈ H(β1 , L); ∞ (θ, H(β2 , L)) > n (β1 )}
˜ ˜
Cn = Cn ∩ Θ Adaptive confidence set
23/ 26
25. A better ( ?) Bayesian perspective : back to basics
If
sup sup Eθ0 (P π [ ∞ (θ0 , θ) > M n (β)|Y ]) ≤ e−C log n
β∈[β1 ,β2 ] θ0 ∈H(β,L)
and
ˆ −1
sup sup Eθ0 ∞ (θ0 , θ) n (β) < +∞
β∈[β1 ,β2 ] θ0 ∈H(β,L)
Cn = {θ; ∞ (θ, θ)
ˆ ≤ kn (αn )}, P π [θ ∈ Cn |Y ] ≥ 1 − αn
Then
Pθ [θ ∈ Cn ] dπ(θ) ≥ 1 − αn
Θ
sup Eθ [|Cn |] ≤ 2M n (β),
θ∈H(β,L)
If Θ is bounded.
24/ 26
26. Conclusion
• Bayesian is great for risks that are related to Kullback : L2 in
regression, hellinger or L1 in density etc.
• How to understand some specific features in these big
models ?
More tricky
• Good nonparametric priors : Have good properties for a wide
range of loss functions
• Why should we care ? → interpretation of credible bands ! ?
• Extension to other models than white noise . [Done]
• Can we go further than 2nd Bayesian interpretation
(confidence sets) ?
25/ 26