Bayesian adaptive optimal estimation using a sieve prior

Bayesian optimal adaptive estimation using a
sieve prior
YES IV Workshop
Julyan Arbel, arbel@ensae.fr
ENSAE-CREST-Université Paris Dauphine
November 9, 2010
1 / 21

Outline
1 Motivations
2 Assumptions
3 Results
4 White noise model
5 Conclusion
2 / 21

Outline
1 Motivations
2 Assumptions
3 Results
4 White noise model
5 Conclusion
3 / 21

Introduction
• Posterior concentration rate and risk convergence rate in a
Bayesian nonparametric setting.
• Results in the same spirit as the ones by Ghosal, Ghosh and Van
der Vaart (2000) and Ghosal and Van der Vaart (2007), in the
speciﬁc case of models which are suitable for the use of sieve
priors.
• Use of a family of sieve priors (introduced by Zhao (2000) in the
white noise model).
• Inﬁnite dimensional parameter from a Sobolev smoothness class.
4 / 21

Notations
• Let a model (X(n)
, A(n)
, P
(n)
θ : θ ∈ Θ) with observations
X(n)
= (Xn
i )1≤i≤n, and
Θ =
∞
k=1
Rk
.
5 / 21

Notations
, A(n)
, P
(n)
X(n)
= (Xn
i )1≤i≤n, and
Θ =
∞
k=1
Rk
.
• Denote θ0 the parameter associated to the true model. Densities
are denoted p
(n)
θ (p
(n)
0 for θ0). The ﬁrst k coordinates of θ0 are
denoted θ0k .
5 / 21

Notations
, A(n)
, P
(n)
X(n)
= (Xn
i )1≤i≤n, and
Θ =
∞
k=1
Rk
.
• Denote θ0 the parameter associated to the true model. Densities
are denoted p
(n)
θ (p
(n)
0 for θ0). The ﬁrst k coordinates of θ0 are
denoted θ0k .
• A sieve prior Π on Θ is deﬁned as follows
Π(θ) =
k
λk Πk (θ),
k
λk = 1,
and
θi
τi
∼ g, where τi > 0.
5 / 21

Define a Kullblack-Leibler neighborhood
Bn = θ : K p
(n)
0 , p
(n)
θ ≤ n 2
n, Vp,0 p
(n)
0 , p
(n)
θ ≤ n 2
n
p/2
.
We use a semimetric dn on Θ, and define Θn = θ ∈ Rkn
, θ ≤ ωn
with kn = k0n 2
n/ log n and ωn some power of n.
The posterior distribution is defined by
Π(B|X(n)
) =
´
B
p
(n)
θ X(n)
dΠ(θ)
´
Θ
p
(n)
θ X(n) dΠ(θ)
.
7 / 21

Outline
1 Motivations
2 Assumptions
3 Results
4 White noise model
5 Conclusion
8 / 21

Assumptions
Assumption 1 On the prior
Assume there exist a, b, c, d > 0 such that λk and gn satisfy
e−ak log k
≤ λk ≤ e−bk log k
,
Ae−A1|t|d
≤ g(t) ≤ Be−B1|t|d
,
∃T, τ0 > 0, s.t. min
i≤kn
τi ≥ n−T
and max
i>0
τi ≤ τ0 < ∞,
kn
i=1
|θ0i |
d
/τd
i ≤ Ckn log n.
9 / 21

Assumptions
Assumption 1 On the prior
Assume there exist a, b, c, d > 0 such that λk and gn satisfy
e−ak log k
≤ λk ≤ e−bk log k
,
Ae−A1|t|d
≤ g(t) ≤ Be−B1|t|d
,
∃T, τ0 > 0, s.t. min
i≤kn
τi ≥ n−T
and max
i>0
τi ≤ τ0 < ∞,
kn
i=1
|θ0i |
d
/τd
i ≤ Ckn log n.
Assumption 2 On the rate of convergence
The rate of convergence n is bounded below by the two inequalities
K p
(n)
0 , p
(n)
0kn
≤ n 2
n, and Vp,0 p
(n)
0 , p
(n)
0kn
≤ n 2
n
p/2
.
9 / 21

Assumption 3 On divergencies
K and Vp,0 satisfy
K p
(n)
0kn
, p
(n)
θ ≤ C
n
2
θ0kn
− θ
2
, Vp,0 p
(n)
0kn
, p
(n)
θ ≤ Cnp/2
θ0kn
− θ
p
,
10 / 21

Assumption 3 On divergencies
K and Vp,0 satisfy
K p
(n)
0kn
, p
(n)
θ ≤ C
n
2
θ0kn
− θ
2
, Vp,0 p
(n)
0kn
, p
(n)
θ ≤ Cnp/2
θ0kn
− θ
p
,
Assumption 4 On semimetric dn
There exist G0, G > 0 such that, for any two θ, θ ,
dn(θ, θ ) ≤ CkG0
n θ − θ
G
10 / 21

Assumption 5 Test condition
There exist constants c1, ζ > 0 such that for every > 0 and for each
θ1 such that dn(θ1, θ0) > , one can construct a test statistic φn ∈ [0, 1]
which satisﬁes
E
(n)
0 φn ≤ e−c1n 2
, sup
dn(θ,θ1)<ζ
E
(n)
θ (1 − φn) ≤ e−c1n 2
.
11 / 21

Outline
1 Motivations
2 Assumptions
3 Results
4 White noise model
5 Conclusion
12 / 21

Results
Theorem Posterior concentration rate
The rate of convergence of the posterior distribution relative to dn is
n,
E
(n)
0 Π d2
n (θ, θ0) ≥ M 2
n|X(n)
→ 0.
13 / 21

Results
Theorem Posterior concentration rate
The rate of convergence of the posterior distribution relative to dn is
n,
E
(n)
0 Π d2
n (θ, θ0) ≥ M 2
n|X(n)
→ 0.
Corollary Risk convergence rate
If assumptions are satisﬁed with p > 2, and if dn is bounded, then the
integrated posterior risk given θ0 and Π converges at least at the
same rate n
Rdn
n (θ0, Π) = E
(n)
0 EΠ
d2
n (θ, θ0)|X(n)
= O 2
n .
13 / 21

Suppose the true parameter θ0 has the Sobolev regularity (β > 1/2)
Θβ(Q0) = θ :
∞
i=1
θ2
i i2β
≤ Q0 < ∞ .
Then the assumption of the following Corollary holds in the Gaussian
white noise model and in the regression. For these models, the rate
given in the following Corollary coincides with the minimax rate (up to
a log n term) in these models: it is in this sense adaptive optimal.
14 / 21

Corollary
If θ ∈ Θβ(Q0) and
K p
(n)
0 , p
(n)
0kn
≤ Cn θ0 − θ0kn
2
, Vp,0 p
(n)
0 , p
(n)
0kn
≤ Cnp/2
θ0 − θ0kn
p
,
then the rate n is
n = 0
log n
n
β
2β+1
.
15 / 21

Outline
1 Motivations
2 Assumptions
3 Results
4 White noise model
5 Conclusion
16 / 21

White noise model
dXn
(t) = f0(t)dt +
1
√
n
dW(t), 0 ≤ t ≤ 1,
By Fourier transform on a basis (φi ), equivalent normal mean model
Xn
i = θ0i +
1
√
n
ξi , i = 1, 2, . . .
17 / 21

White noise model
dXn
(t) = f0(t)dt +
1
√
n
dW(t), 0 ≤ t ≤ 1,
By Fourier transform on a basis (φi ), equivalent normal mean model
Xn
i = θ0i +
1
√
n
ξi , i = 1, 2, . . .
Global L2
loss
RL2
n = E
(n)
0
ˆfn − f0
2
= E
(n)
0
∞
i=1
θni − θ0i
2
.
Pointwise l2
loss at point t (with ai = φi (t))
Rl2
n = E
(n)
0
ˆfn(t) − f0(t)
2
= E
(n)
0
∞
i=1
ai θni − θ0i
2
.
17 / 21

Results in the white noise model
We show that the model satisﬁes Assumptions 1 to 5.
Proposition
Under global loss, concentration and risk rates are adaptive optimal
E
(n)
0 Π θ − θ0
2
≥ M 2
n|X(n)
→ 0,
RL2
n (θ0, Π) = E
(n)
0 EΠ
θ − θ0
2
|X(n)
= O 2
n .
18 / 21

Pointwise loss
Pointwise l2
loss does not satisfy Assumption 4. We can show the
following lower bound on the rate of the associated risk.
Proposition
Under pointwise loss, a lower bound on the frequentist risk rate is
given by
sup
θ0∈Θβ (Q0)
Rl2
n (θ0, Π)
n− 2β−1
2β+1
log2
n
.
19 / 21

Pointwise loss
Pointwise l2
loss does not satisfy Assumption 4. We can show the
following lower bound on the rate of the associated risk.
Proposition
Under pointwise loss, a lower bound on the frequentist risk rate is
given by
sup
θ0∈Θβ (Q0)
Rl2
n (θ0, Π)
n− 2β−1
2β+1
log2
n
.
A global optimal estimator can not be pointwise optimal (result stated
by Cai, Low and Zhao, 2007).
There is a penalty here from global to pointwise loss of (up to a log n
term)
n
1
2β(2β+1) .
19 / 21

Outline
1 Motivations
2 Assumptions
3 Results
4 White noise model
5 Conclusion
20 / 21

Conclusion
• We have ﬁrst derived posterior concentration and risk
convergence rates for a variety of models that accomodate a
sieve prior.
21 / 21

Conclusion
sieve prior.
• In a second result we have obtained a lower bound for the
frequentist risk under pointwise loss, that is to say that the sieve
prior does not achieve the optimal rate under pointwise loss.
21 / 21

Conclusion
sieve prior.
• In a second result we have obtained a lower bound for the
frequentist risk under pointwise loss, that is to say that the sieve
prior does not achieve the optimal rate under pointwise loss.
• Further work should focus on posterior concentration rate under
pointwise loss.
21 / 21

Bayesian adaptive optimal estimation using a sieve prior

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Bayesian adaptive optimal estimation using a sieve prior

Similar to Bayesian adaptive optimal estimation using a sieve prior (20)

More from Julyan Arbel

More from Julyan Arbel (20)

Bayesian adaptive optimal estimation using a sieve prior