Shape restrictions such as monotonicity often naturally arise. In this talk, we consider a Bayesian approach to monotone nonparametric regression with a normal error. We assign a prior through piecewise constant functions and impose a conjugate normal prior on the coefficient. Since the resulting functions need not be monotone, we project samples from the posterior on the allowed parameter space to construct a “projection posterior”. We obtain the limit posterior distribution of a suitably centered and scaled posterior distribution for the function value at a point. The limit distribution has some interesting similarity and difference with the corresponding limit distribution for the maximum likelihood estimator. By comparing the quantiles of these two distributions, we observe an interesting new phenomenon that coverage of a credible interval can be more than the credibility level, the exact opposite of a phenomenon observed by Cox for smooth regression. We describe a recalibration strategy to modify the credible interval to meet the correct level of coverage.
This talk is based on joint work with Moumita Chakraborty, a doctoral student at North Carolina State University.
Similar to MUMS: Bayesian, Fiducial, and Frequentist Conference - Coverage of Credible Intervals for Monotone Regression, Subhashis Ghosal, April 29, 2019
Similar to MUMS: Bayesian, Fiducial, and Frequentist Conference - Coverage of Credible Intervals for Monotone Regression, Subhashis Ghosal, April 29, 2019 (20)
MUMS: Bayesian, Fiducial, and Frequentist Conference - Coverage of Credible Intervals for Monotone Regression, Subhashis Ghosal, April 29, 2019
1. Coverage of credible intervals for monotone regression
Subhashis Ghoshal,
North Carolina State University
BFF Conferences, Duke University, April 29, 2019
1 / 29
4. Monotone regression
Yi = f (Xi ) + εi , εi iid N(0, σ2), i = 1, . . . , n, independent of
X1, . . . , Xn ∼ G, G has bounded positive density g.
f : [0, 1] → R is monotone.
Wish to make inference on f . More specifically, we shall construct
credible interval for f (x0) at a fixed point with guaranteed frequentist
coverage;
For theoretical study, errors will be only assumed to be subgaussian
with mean zero.
4 / 29
5. To construct a prior for f , with monotonicity in mind.
A finite random series of step functions:
f =
J
j=1
θj 1(ξj−1 < x ≤ ξj ],
where 0 = ξ0 < ξ1 < . . . < xiJ−1 < ξj = 1 are the knots, θ1, . . . , θJ
are the coefficients, J the number of terms.
For the present purpose, it suffices to work with a deterministic choice
of J and equispaced knots.
5 / 29
6. Prior
We need to put a prior on θ = (θ1, . . . , θJ).
The conjugate prior for θ is normal. Could be chosen to be a product
of independent normals.
This does not give monotone functions —
f = J
j=1 θj 1(ξj−1 < x ≤ ξj ] is monotone if and only if
θ1 ≤ · · · ≤ θJ.
A simple way to achieve is to restrict the normal prior to the cone
{θ : θ1 ≤ · · · ≤ θJ}
Technically this is also conjugate, but is not easy to work with
theoretically (and to some extent, computationally).
6 / 29
7. Projection posterior
“If the model space is complicated for a direct prior imposition,
posterior computation and theoretical analysis, so a prior is put on a
larger space and the posterior is projected on the desired subset.”
In this context, samples can be generated from the conjugate normal
posterior which can be isotonized to θ∗
1 ≤ · · · ≤ θ∗
J: Minimize
J
j=1 Nj (θ∗
j − θj )2 subject to θ∗
1 ≤ · · · ≤ θ∗
J, where
Nj = n
i=1 1(ξj−1 < Xi ≤ ξj ].
There are efficient algorithms for this purpose, such as PAVA
(discussed later).
7 / 29
8. Why is the projection posterior good?
If the unrestricted posterior concentrates, then so does the projection
posterior around a monotone function:
J
j=1
|θ∗
j − θ0j | ≤
J
j=1
|θ∗
j − θj | +
J
j=1
|θj − θ0j | ≤ 2
J
j=1
|θj − θ0j |,
implying that |f ∗ − f0| ≤ 2 |f − f0|.
Thus accuracy of the projection posterior is inherited from the
unrestricted posterior, and hence it suffices to study the latter.
8 / 29
9. Why step functions give good approximation?
Approximation properties of step functions:
To approximate a monotone function using J steps, we can get the
optimal approximation error J−1
using equal length intervals in terms of L1-distance
using varying intervals for Lp-distance (equal intervals give a
suboptimal J−1/p
-rate)
Coefficients can be chosen to be monotone.
No additional smoothness is needed.
9 / 29
10. Handling unknown σ
Unknown σ can be handled by plug-in or fully Bayes approach.
Integrating out θ by conjugacy, explicit expression for the marginal
likelihood of σ is obtained, and the MLE can be shown to be
consistent.
In the fully Bayes approach, σ2 is given a conjugate inverse gamma
prior on σ2. This leads to a consistent posterior for σ.
Effectively σ can almost be treated as known.
10 / 29
12. slogcom operator
The solution of the isotonization problem θ∗
j is given by the slope of
the left derivative of the greatest convex minorant (slogcm) of
(0, 0), (n−1
j
k=1
Nk, n−1
j
k=1
Nk
¯Yi )J
j=1
at the point j
k=1 Nk/n.
At a point x0,
f ∗
(x0) = slogcm (0, 0), (n−1
j
k=1
Nk, n−1
j
k=1
Nk
¯Yi )J
j=1 ( x0J /J).
12 / 29
13. Pool adjacent violator algorithm
The greatest convex minorant of a cumulative sum diagram is
popularly obtained by the pool adjacent violators algorithm (PAVA).
PAVA algebraically describes a method of successive approximation to
the GCM in O(n) time.
Whenever it sees a violation of monotonicity between two adjacent
points (blocks), it pulls them and replaces both by the same weighted
average.
Works fine for both ordinary and weighted sum of squares, as well as
several other convex criteria.
13 / 29
15. Credibility or and coverage
Coverage is E[1(θ ∈ S(X))|θ].
Credibility is E[1(θ ∈ S(X))|X].
Both are some sort of “projections” of the actual important thing
1(θ ∈ S(X)).
Together credibility and coverage may give a much more complete
picture if they are in close agreement.
In parametric problems, they often agree in view of the Bernstein-von
Mises theorem.
Even second order matching is possible with probability matching
priors (Jeffreys’ prior if no nuisance parameter)
In curve estimation with optimal smoothing, coverage of a credible
set may be arbitrarily low [Cox (1993)].
This is usually addressed by undersmoothing or inflating a credible
region.
15 / 29
16. Coverage of pointwise credible interval
Fix x0 ∈ (0, 1) where f0(x0) > 0.
Consider the projection posterior given by the distribution of f ∗(x0),
where f ∗ is the isotonization of f .
Consider a (1 − α)-credible interval (around mean, median or equal
tailed). Does its coverage go to 1 − α?
If there will be a Bernstein-von Mises type theorem:
Π(n1/3(f ∗(x0) − ˆf (x0)) ≤ z|Dn) →p H(z) and
P(n1/3(f0(x0) − ˆf (x0)) ≤ z|f0) → H(z), for some suitable estimator ˆf ,
then coverage will approach (1 − α), and the posterior median (mean
also if H is symmetric) will be asymptotically equivalent with ˆf (x0).
16 / 29
17. Centering estimator
What is the right choice of ˆf ?
The common estimator is the MLE: Minimize n
i=1(Yi − f (Xi ))2
subject to monotonicity. The solution is isotonization of the pairs
(Xi , Yi ), i = 1, . . . , n.
This has a limiting distribution under centering at f0(x0) and scaling
by n1/3, known as the Chernoff distribution — the distribution of
argmin of a two-sided Brownian motion with a parabolic drift.
The key technique in the proof is the “switch relation”: For a lower
semicontinuous function Φ on an interval I, ∀t ∈ I, v ∈ R:
slogcm(Φ)(t) > v if and only if argmin+
s∈I (Φ(s) − vs) < t, and
mirroring for right derivative.
Brownian motion comes through the Donsker theorem for a local
empirical process.
The Chernoff distribution is symmetric and has tails sharper than the
normal.
17 / 29
18. Sieve MLE
But the estimator’s structure is hardly anything similar to that of the
prior/posterior.
Modify to sieve MLE: Minimize n
i=1(Yi − f (Xi ))2 subject to
f = J
j=1 θj 1Ij
, θ1 ≤ · · · ≤ θJ.
With some tweaking, for any choice n1/3 J n2/3, the normalized
sieve MLE also has the same asymptotic Chernoff distribution CZ,
where Z is a standard Chernoff variable, C = 2b(a/b)2/3,
a = σ2/g(x0), b = f0(x0)g(x0)/2.
Let f ∗ = J
j=1 θ∗
j 1Ij
be the isotonization of a random draw from the
unrestricted posterior, so that the distribution of f ∗ is the projection
posterior.
18 / 29
19. No Bernstein-von Mises type theorem
Theorem
Let W1, W2 be independent two-sided Brownian motions on R with
W1(0) = W2(0) = 0, Z2 = arg min{W1(t) + W2(t) + t2 : t ∈ R}, Z1 =
arg min{W1(t) + t2 : t ∈ R}, and C = 2b (a/b)2/3
with a = σ2
0/g(x0)
and b = f0(x0)/2.
(a) For every z ∈ R, P0(n1/3(ˆfn(x0) − f0(x0)) ≤ z) → P(CZ1 ≤ z).
(b) For every z ∈ R, P0(n1/3(f ∗(x0) − f0(x0)) ≤ z) → P(CZ2 ≤ z).
(c) The conditional process z → Π(n1/3(f ∗(x0) − ˆfn(x0)) ≤ z Dn) does
not have a limit in probability.
19 / 29
20. Case 1 Case 2
Case 3
Figure: Plot demonstrating that Π(n1/3
(f ∗
(x0) − ˆfn(x0)) ≤ 0 Dn) does not have a
limit in probability, using sample size n = 2000 in three cases.
20 / 29
21. Coverage of credible intervals
Let
F∗
n (z|Dn) = Π n1/3
(f ∗
(x0) − f0(x0)) ≤ z |Dn ,
F∗
a,b(z|W1) = P 2b(a/b)2/3
arg min
t∈R
V (t) ≤ z |W1 ,
where V (t) = W1 (t) + W2(t) + t2. For every n ≥ 1, γ ∈ [0, 1], define
Qn,γ = inf {z ∈ R : Π(f ∗
(x0) ≤ z|Dn) ≥ 1 − γ} ,
In,γ = [Qn,1−γ
2
, Qn, γ
2
],
∆∗
W1,W2
= arg min
t∈R
W1(t) + W2(t) + t2
.
We are primarily interested in P0 (f0(x0) ∈ In,γ), the coverage of the
credible interval.
21 / 29
22. Limiting coverage
Theorem
(a) for every z ∈ R, F∗
n (z|Dn) F∗
a,b(z|W1);
(b) the distribution of F∗
a,b(0|W1) is symmetric about 1/2;
(c) the limiting coverage of In,γ is characterized as follows:
P0 (f0(x0) ∈ In,γ) → P
γ
2
≤ P(∆∗
W1,W2
≥ 0 W1) ≤ 1 −
γ
2
.
22 / 29
23. Thus the Bayesian and frequentist distributions do not exactly tally,
meaning that the asymptotic coverage is not 1 − α, but how do they
compare?
For every α, γ ∈ [0, 1], define
A(γ) = P(P(∆∗
W1,W2
≥ 0 W1) ≤ γ), γ(α) = 2A−1
(α/2).
Thus the theorem says that the limiting coverage of the
(1 − γ)-credible interval In,γ is 1 − 2A(γ/2), which depends only on γ.
It does not match the nominal level (1 − γ), but something
remarkable happens using a recalibration: If the target coverage is
(1 − α), instead of starting with a (1 − α)-credible interval, start with
a (1 − γ)-credible interval, where A(γ/2) = α/2. Then the limiting
coverage (1 − α) is attained exactly.
Unlike a confidence interval based on the MLE, no need for
estimation of nuisance parameters.
23 / 29
24. Recalibration table
1 − α 0.900 0.920 0.940 0.950 0.960 0.970 0.980 0.990
1 − γ 0.874 0.897 0.922 0.934 0.946 0.960 0.973 0.986
Simulation based — limited accuracy.
This is therefore a reverse Cox phenomenon, meaning that we will have
to shrink a credible interval to get nominal coverage.
24 / 29
25. Main idea of the proof
How to use the weak convergence assertion?
P0 (f0(x0) ≤ Qn,γ)
= P0 (Π(f ∗
(x0) ≤ f0(x0)|Dn) ≤ 1 − γ)
= P0 Π(n1/3
(f ∗
(x0) − f0(x0)) ≤ 0|Dn) ≤ 1 − γ
= P0 F∗
n (0 Dn) ≤ 1 − γ
→ P F∗
a,b(0 W1) ≤ 1 − γ
= P P C arg min
t∈R
W1(t) + W2(t) + t2
≥ 0 W1 ≤ 1 − γ
= P P ∆∗
W1,W2
≥ 0 W1 ≤ 1 − γ .
25 / 29
27. Coverage
Consider data generated from regression function
f0(x) = ex−0.5/(1 + ex−0.5), G uniform on [0, 1].
Choose x0 = 0.5, σ = 1, J ≈ n1/3 log n.
Take n = 500, 1000, 1500, 2000.
For each n, 1000 Monte Carlo samples are used.
27 / 29