Bayesian adaptive optimal estimation using a sieve prior
1. Bayesian optimal adaptive estimation using a
sieve prior
YES IV Workshop
Julyan Arbel, arbel@ensae.fr
ENSAE-CREST-Université Paris Dauphine
November 9, 2010
1 / 21
4. Introduction
• Posterior concentration rate and risk convergence rate in a
Bayesian nonparametric setting.
• Results in the same spirit as the ones by Ghosal, Ghosh and Van
der Vaart (2000) and Ghosal and Van der Vaart (2007), in the
specific case of models which are suitable for the use of sieve
priors.
• Use of a family of sieve priors (introduced by Zhao (2000) in the
white noise model).
• Infinite dimensional parameter from a Sobolev smoothness class.
4 / 21
5. Notations
• Let a model (X(n)
, A(n)
, P
(n)
θ : θ ∈ Θ) with observations
X(n)
= (Xn
i )1≤i≤n, and
Θ =
∞
k=1
Rk
.
5 / 21
6. Notations
• Let a model (X(n)
, A(n)
, P
(n)
θ : θ ∈ Θ) with observations
X(n)
= (Xn
i )1≤i≤n, and
Θ =
∞
k=1
Rk
.
• Denote θ0 the parameter associated to the true model. Densities
are denoted p
(n)
θ (p
(n)
0 for θ0). The first k coordinates of θ0 are
denoted θ0k .
5 / 21
7. Notations
• Let a model (X(n)
, A(n)
, P
(n)
θ : θ ∈ Θ) with observations
X(n)
= (Xn
i )1≤i≤n, and
Θ =
∞
k=1
Rk
.
• Denote θ0 the parameter associated to the true model. Densities
are denoted p
(n)
θ (p
(n)
0 for θ0). The first k coordinates of θ0 are
denoted θ0k .
• A sieve prior Π on Θ is defined as follows
Π(θ) =
k
λk Πk (θ),
k
λk = 1,
and
θi
τi
∼ g, where τi > 0.
5 / 21
8. We define four different divergencies
K(f, g) =
ˆ
f log(f/g)dµ,
Vp,0(f, g) =
ˆ
f |log(f/g) − K(f, g)|
p
dµ,
K(f, g) =
ˆ
p
(n)
0 |log(f, g)| dµ,
Vp,0(f, g) =
ˆ
p
(n)
0 |log(f, g) − K(f, g)|
p
dµ.
6 / 21
9. Define a Kullblack-Leibler neighborhood
Bn = θ : K p
(n)
0 , p
(n)
θ ≤ n 2
n, Vp,0 p
(n)
0 , p
(n)
θ ≤ n 2
n
p/2
.
We use a semimetric dn on Θ, and define Θn = θ ∈ Rkn
, θ ≤ ωn
with kn = k0n 2
n/ log n and ωn some power of n.
The posterior distribution is defined by
Π(B|X(n)
) =
´
B
p
(n)
θ X(n)
dΠ(θ)
´
Θ
p
(n)
θ X(n) dΠ(θ)
.
7 / 21
11. Assumptions
Assumption 1 On the prior
Assume there exist a, b, c, d > 0 such that λk and gn satisfy
e−ak log k
≤ λk ≤ e−bk log k
,
Ae−A1|t|d
≤ g(t) ≤ Be−B1|t|d
,
∃T, τ0 > 0, s.t. min
i≤kn
τi ≥ n−T
and max
i>0
τi ≤ τ0 < ∞,
kn
i=1
|θ0i |
d
/τd
i ≤ Ckn log n.
9 / 21
12. Assumptions
Assumption 1 On the prior
Assume there exist a, b, c, d > 0 such that λk and gn satisfy
e−ak log k
≤ λk ≤ e−bk log k
,
Ae−A1|t|d
≤ g(t) ≤ Be−B1|t|d
,
∃T, τ0 > 0, s.t. min
i≤kn
τi ≥ n−T
and max
i>0
τi ≤ τ0 < ∞,
kn
i=1
|θ0i |
d
/τd
i ≤ Ckn log n.
Assumption 2 On the rate of convergence
The rate of convergence n is bounded below by the two inequalities
K p
(n)
0 , p
(n)
0kn
≤ n 2
n, and Vp,0 p
(n)
0 , p
(n)
0kn
≤ n 2
n
p/2
.
9 / 21
13. Assumption 3 On divergencies
K and Vp,0 satisfy
K p
(n)
0kn
, p
(n)
θ ≤ C
n
2
θ0kn
− θ
2
, Vp,0 p
(n)
0kn
, p
(n)
θ ≤ Cnp/2
θ0kn
− θ
p
,
10 / 21
14. Assumption 3 On divergencies
K and Vp,0 satisfy
K p
(n)
0kn
, p
(n)
θ ≤ C
n
2
θ0kn
− θ
2
, Vp,0 p
(n)
0kn
, p
(n)
θ ≤ Cnp/2
θ0kn
− θ
p
,
Assumption 4 On semimetric dn
There exist G0, G > 0 such that, for any two θ, θ ,
dn(θ, θ ) ≤ CkG0
n θ − θ
G
10 / 21
15. Assumption 5 Test condition
There exist constants c1, ζ > 0 such that for every > 0 and for each
θ1 such that dn(θ1, θ0) > , one can construct a test statistic φn ∈ [0, 1]
which satisfies
E
(n)
0 φn ≤ e−c1n 2
, sup
dn(θ,θ1)<ζ
E
(n)
θ (1 − φn) ≤ e−c1n 2
.
11 / 21
17. Results
Theorem Posterior concentration rate
The rate of convergence of the posterior distribution relative to dn is
n,
E
(n)
0 Π d2
n (θ, θ0) ≥ M 2
n|X(n)
→ 0.
13 / 21
18. Results
Theorem Posterior concentration rate
The rate of convergence of the posterior distribution relative to dn is
n,
E
(n)
0 Π d2
n (θ, θ0) ≥ M 2
n|X(n)
→ 0.
Corollary Risk convergence rate
If assumptions are satisfied with p > 2, and if dn is bounded, then the
integrated posterior risk given θ0 and Π converges at least at the
same rate n
Rdn
n (θ0, Π) = E
(n)
0 EΠ
d2
n (θ, θ0)|X(n)
= O 2
n .
13 / 21
19. Suppose the true parameter θ0 has the Sobolev regularity (β > 1/2)
Θβ(Q0) = θ :
∞
i=1
θ2
i i2β
≤ Q0 < ∞ .
Then the assumption of the following Corollary holds in the Gaussian
white noise model and in the regression. For these models, the rate
given in the following Corollary coincides with the minimax rate (up to
a log n term) in these models: it is in this sense adaptive optimal.
14 / 21
20. Corollary
If θ ∈ Θβ(Q0) and
K p
(n)
0 , p
(n)
0kn
≤ Cn θ0 − θ0kn
2
, Vp,0 p
(n)
0 , p
(n)
0kn
≤ Cnp/2
θ0 − θ0kn
p
,
then the rate n is
n = 0
log n
n
β
2β+1
.
15 / 21
21. Corollary
If θ ∈ Θβ(Q0) and
K p
(n)
0 , p
(n)
0kn
≤ Cn θ0 − θ0kn
2
, Vp,0 p
(n)
0 , p
(n)
0kn
≤ Cnp/2
θ0 − θ0kn
p
,
then the rate n is
n = 0
log n
n
β
2β+1
.
15 / 21
23. White noise model
dXn
(t) = f0(t)dt +
1
√
n
dW(t), 0 ≤ t ≤ 1,
By Fourier transform on a basis (φi ), equivalent normal mean model
Xn
i = θ0i +
1
√
n
ξi , i = 1, 2, . . .
17 / 21
24. White noise model
dXn
(t) = f0(t)dt +
1
√
n
dW(t), 0 ≤ t ≤ 1,
By Fourier transform on a basis (φi ), equivalent normal mean model
Xn
i = θ0i +
1
√
n
ξi , i = 1, 2, . . .
Global L2
loss
RL2
n = E
(n)
0
ˆfn − f0
2
= E
(n)
0
∞
i=1
θni − θ0i
2
.
Pointwise l2
loss at point t (with ai = φi (t))
Rl2
n = E
(n)
0
ˆfn(t) − f0(t)
2
= E
(n)
0
∞
i=1
ai θni − θ0i
2
.
17 / 21
25. Results in the white noise model
We show that the model satisfies Assumptions 1 to 5.
Proposition
Under global loss, concentration and risk rates are adaptive optimal
E
(n)
0 Π θ − θ0
2
≥ M 2
n|X(n)
→ 0,
RL2
n (θ0, Π) = E
(n)
0 EΠ
θ − θ0
2
|X(n)
= O 2
n .
18 / 21
26. Pointwise loss
Pointwise l2
loss does not satisfy Assumption 4. We can show the
following lower bound on the rate of the associated risk.
Proposition
Under pointwise loss, a lower bound on the frequentist risk rate is
given by
sup
θ0∈Θβ (Q0)
Rl2
n (θ0, Π)
n− 2β−1
2β+1
log2
n
.
19 / 21
27. Pointwise loss
Pointwise l2
loss does not satisfy Assumption 4. We can show the
following lower bound on the rate of the associated risk.
Proposition
Under pointwise loss, a lower bound on the frequentist risk rate is
given by
sup
θ0∈Θβ (Q0)
Rl2
n (θ0, Π)
n− 2β−1
2β+1
log2
n
.
A global optimal estimator can not be pointwise optimal (result stated
by Cai, Low and Zhao, 2007).
There is a penalty here from global to pointwise loss of (up to a log n
term)
n
1
2β(2β+1) .
19 / 21
29. Conclusion
• We have first derived posterior concentration and risk
convergence rates for a variety of models that accomodate a
sieve prior.
21 / 21
30. Conclusion
• We have first derived posterior concentration and risk
convergence rates for a variety of models that accomodate a
sieve prior.
• In a second result we have obtained a lower bound for the
frequentist risk under pointwise loss, that is to say that the sieve
prior does not achieve the optimal rate under pointwise loss.
21 / 21
31. Conclusion
• We have first derived posterior concentration and risk
convergence rates for a variety of models that accomodate a
sieve prior.
• In a second result we have obtained a lower bound for the
frequentist risk under pointwise loss, that is to say that the sieve
prior does not achieve the optimal rate under pointwise loss.
• Further work should focus on posterior concentration rate under
pointwise loss.
21 / 21