MUMS: Bayesian, Fiducial, and Frequentist Conference - Coverage of Credible Intervals for Monotone Regression, Subhashis Ghosal, April 29, 2019

Coverage of credible intervals for monotone regression
Subhashis Ghoshal,
North Carolina State University
BFF Conferences, Duke University, April 29, 2019
1 / 29

Based on collaborations with my graduate student Moumita Chakraborty
2 / 29

Monotone regression
Yi = f (Xi ) + εi , εi iid N(0, σ2), i = 1, . . . , n, independent of
X1, . . . , Xn ∼ G, G has bounded positive density g.
f : [0, 1] → R is monotone.
Wish to make inference on f . More speciﬁcally, we shall construct
credible interval for f (x0) at a ﬁxed point with guaranteed frequentist
coverage;
For theoretical study, errors will be only assumed to be subgaussian
with mean zero.
4 / 29

To construct a prior for f , with monotonicity in mind.
A finite random series of step functions:
f =
J
j=1
θj 1(ξj−1 < x ≤ ξj ],
where 0 = ξ0 < ξ1 < . . . < xiJ−1 < ξj = 1 are the knots, θ1, . . . , θJ
are the coefficients, J the number of terms.
For the present purpose, it suffices to work with a deterministic choice
of J and equispaced knots.
5 / 29

Prior
We need to put a prior on θ = (θ1, . . . , θJ).
The conjugate prior for θ is normal. Could be chosen to be a product
of independent normals.
This does not give monotone functions —
f = J
j=1 θj 1(ξj−1 < x ≤ ξj ] is monotone if and only if
θ1 ≤ · · · ≤ θJ.
A simple way to achieve is to restrict the normal prior to the cone
{θ : θ1 ≤ · · · ≤ θJ}
Technically this is also conjugate, but is not easy to work with
theoretically (and to some extent, computationally).
6 / 29

Projection posterior
“If the model space is complicated for a direct prior imposition,
posterior computation and theoretical analysis, so a prior is put on a
larger space and the posterior is projected on the desired subset.”
In this context, samples can be generated from the conjugate normal
posterior which can be isotonized to θ∗
1 ≤ · · · ≤ θ∗
J: Minimize
J
j=1 Nj (θ∗
j − θj )2 subject to θ∗
1 ≤ · · · ≤ θ∗
J, where
Nj = n
i=1 1(ξj−1 < Xi ≤ ξj ].
There are eﬃcient algorithms for this purpose, such as PAVA
(discussed later).
7 / 29

Why is the projection posterior good?
If the unrestricted posterior concentrates, then so does the projection
posterior around a monotone function:
J
j=1
|θ∗
j − θ0j | ≤
J
j=1
|θ∗
j − θj | +
J
j=1
|θj − θ0j | ≤ 2
J
j=1
|θj − θ0j |,
implying that |f ∗ − f0| ≤ 2 |f − f0|.
Thus accuracy of the projection posterior is inherited from the
unrestricted posterior, and hence it suﬃces to study the latter.
8 / 29

Why step functions give good approximation?
Approximation properties of step functions:
To approximate a monotone function using J steps, we can get the
optimal approximation error J−1
using equal length intervals in terms of L1-distance
using varying intervals for Lp-distance (equal intervals give a
suboptimal J−1/p
-rate)
Coeﬃcients can be chosen to be monotone.
No additional smoothness is needed.
9 / 29

Handling unknown σ
Unknown σ can be handled by plug-in or fully Bayes approach.
Integrating out θ by conjugacy, explicit expression for the marginal
likelihood of σ is obtained, and the MLE can be shown to be
consistent.
In the fully Bayes approach, σ2 is given a conjugate inverse gamma
prior on σ2. This leads to a consistent posterior for σ.
Eﬀectively σ can almost be treated as known.
10 / 29

Computational algorithm
11 / 29

slogcom operator
The solution of the isotonization problem θ∗
j is given by the slope of
the left derivative of the greatest convex minorant (slogcm) of
(0, 0), (n−1
j
k=1
Nk, n−1
j
k=1
Nk
¯Yi )J
j=1
at the point j
k=1 Nk/n.
At a point x0,
f ∗
(x0) = slogcm (0, 0), (n−1
j
k=1
Nk, n−1
j
k=1
Nk
¯Yi )J
j=1 ( x0J /J).
12 / 29

Pool adjacent violator algorithm
The greatest convex minorant of a cumulative sum diagram is
popularly obtained by the pool adjacent violators algorithm (PAVA).
PAVA algebraically describes a method of successive approximation to
the GCM in O(n) time.
Whenever it sees a violation of monotonicity between two adjacent
points (blocks), it pulls them and replaces both by the same weighted
average.
Works ﬁne for both ordinary and weighted sum of squares, as well as
several other convex criteria.
13 / 29

Coverage of credible intervals
14 / 29

Credibility or and coverage
Coverage is E[1(θ ∈ S(X))|θ].
Credibility is E[1(θ ∈ S(X))|X].
Both are some sort of “projections” of the actual important thing
1(θ ∈ S(X)).
Together credibility and coverage may give a much more complete
picture if they are in close agreement.
In parametric problems, they often agree in view of the Bernstein-von
Mises theorem.
Even second order matching is possible with probability matching
priors (Jeﬀreys’ prior if no nuisance parameter)
In curve estimation with optimal smoothing, coverage of a credible
set may be arbitrarily low [Cox (1993)].
This is usually addressed by undersmoothing or inﬂating a credible
region.
15 / 29

Coverage of pointwise credible interval
Fix x0 ∈ (0, 1) where f0(x0) > 0.
Consider the projection posterior given by the distribution of f ∗(x0),
where f ∗ is the isotonization of f .
Consider a (1 − α)-credible interval (around mean, median or equal
tailed). Does its coverage go to 1 − α?
If there will be a Bernstein-von Mises type theorem:
Π(n1/3(f ∗(x0) − ˆf (x0)) ≤ z|Dn) →p H(z) and
P(n1/3(f0(x0) − ˆf (x0)) ≤ z|f0) → H(z), for some suitable estimator ˆf ,
then coverage will approach (1 − α), and the posterior median (mean
also if H is symmetric) will be asymptotically equivalent with ˆf (x0).
16 / 29

Centering estimator
What is the right choice of ˆf ?
The common estimator is the MLE: Minimize n
i=1(Yi − f (Xi ))2
subject to monotonicity. The solution is isotonization of the pairs
(Xi , Yi ), i = 1, . . . , n.
This has a limiting distribution under centering at f0(x0) and scaling
by n1/3, known as the Chernoﬀ distribution — the distribution of
argmin of a two-sided Brownian motion with a parabolic drift.
The key technique in the proof is the “switch relation”: For a lower
semicontinuous function Φ on an interval I, ∀t ∈ I, v ∈ R:
slogcm(Φ)(t) > v if and only if argmin+
s∈I (Φ(s) − vs) < t, and
mirroring for right derivative.
Brownian motion comes through the Donsker theorem for a local
empirical process.
The Chernoﬀ distribution is symmetric and has tails sharper than the
normal.
17 / 29

Sieve MLE
But the estimator’s structure is hardly anything similar to that of the
prior/posterior.
Modify to sieve MLE: Minimize n
i=1(Yi − f (Xi ))2 subject to
f = J
j=1 θj 1Ij
, θ1 ≤ · · · ≤ θJ.
With some tweaking, for any choice n1/3 J n2/3, the normalized
sieve MLE also has the same asymptotic Chernoﬀ distribution CZ,
where Z is a standard Chernoﬀ variable, C = 2b(a/b)2/3,
a = σ2/g(x0), b = f0(x0)g(x0)/2.
Let f ∗ = J
j=1 θ∗
j 1Ij
be the isotonization of a random draw from the
unrestricted posterior, so that the distribution of f ∗ is the projection
posterior.
18 / 29

No Bernstein-von Mises type theorem
Theorem
Let W1, W2 be independent two-sided Brownian motions on R with
W1(0) = W2(0) = 0, Z2 = arg min{W1(t) + W2(t) + t2 : t ∈ R}, Z1 =
arg min{W1(t) + t2 : t ∈ R}, and C = 2b (a/b)2/3
with a = σ2
0/g(x0)
and b = f0(x0)/2.
(a) For every z ∈ R, P0(n1/3(ˆfn(x0) − f0(x0)) ≤ z) → P(CZ1 ≤ z).
(b) For every z ∈ R, P0(n1/3(f ∗(x0) − f0(x0)) ≤ z) → P(CZ2 ≤ z).
(c) The conditional process z → Π(n1/3(f ∗(x0) − ˆfn(x0)) ≤ z Dn) does
not have a limit in probability.
19 / 29

Case 1 Case 2
Case 3
Figure: Plot demonstrating that Π(n1/3
(f ∗
(x0) − ˆfn(x0)) ≤ 0 Dn) does not have a
limit in probability, using sample size n = 2000 in three cases.
20 / 29

Coverage of credible intervals
Let
F∗
n (z|Dn) = Π n1/3
(f ∗
(x0) − f0(x0)) ≤ z |Dn ,
F∗
a,b(z|W1) = P 2b(a/b)2/3
arg min
t∈R
V (t) ≤ z |W1 ,
where V (t) = W1 (t) + W2(t) + t2. For every n ≥ 1, γ ∈ [0, 1], deﬁne
Qn,γ = inf {z ∈ R : Π(f ∗
(x0) ≤ z|Dn) ≥ 1 − γ} ,
In,γ = [Qn,1−γ
2
, Qn, γ
2
],
∆∗
W1,W2
= arg min
t∈R
W1(t) + W2(t) + t2
.
We are primarily interested in P0 (f0(x0) ∈ In,γ), the coverage of the
credible interval.
21 / 29

Limiting coverage
Theorem
(a) for every z ∈ R, F∗
n (z|Dn) F∗
a,b(z|W1);
(b) the distribution of F∗
a,b(0|W1) is symmetric about 1/2;
(c) the limiting coverage of In,γ is characterized as follows:
P0 (f0(x0) ∈ In,γ) → P
γ
2
≤ P(∆∗
W1,W2
≥ 0 W1) ≤ 1 −
γ
2
.
22 / 29

Thus the Bayesian and frequentist distributions do not exactly tally,
meaning that the asymptotic coverage is not 1 − α, but how do they
compare?
For every α, γ ∈ [0, 1], deﬁne
A(γ) = P(P(∆∗
W1,W2
≥ 0 W1) ≤ γ), γ(α) = 2A−1
(α/2).
Thus the theorem says that the limiting coverage of the
(1 − γ)-credible interval In,γ is 1 − 2A(γ/2), which depends only on γ.
It does not match the nominal level (1 − γ), but something
remarkable happens using a recalibration: If the target coverage is
(1 − α), instead of starting with a (1 − α)-credible interval, start with
a (1 − γ)-credible interval, where A(γ/2) = α/2. Then the limiting
coverage (1 − α) is attained exactly.
Unlike a conﬁdence interval based on the MLE, no need for
estimation of nuisance parameters.
23 / 29

Recalibration table
1 − α 0.900 0.920 0.940 0.950 0.960 0.970 0.980 0.990
1 − γ 0.874 0.897 0.922 0.934 0.946 0.960 0.973 0.986
Simulation based — limited accuracy.
This is therefore a reverse Cox phenomenon, meaning that we will have
to shrink a credible interval to get nominal coverage.
24 / 29

Main idea of the proof
How to use the weak convergence assertion?
P0 (f0(x0) ≤ Qn,γ)
= P0 (Π(f ∗
(x0) ≤ f0(x0)|Dn) ≤ 1 − γ)
= P0 Π(n1/3
(f ∗
(x0) − f0(x0)) ≤ 0|Dn) ≤ 1 − γ
= P0 F∗
n (0 Dn) ≤ 1 − γ
→ P F∗
a,b(0 W1) ≤ 1 − γ
= P P C arg min
t∈R
W1(t) + W2(t) + t2
≥ 0 W1 ≤ 1 − γ
= P P ∆∗
W1,W2
≥ 0 W1 ≤ 1 − γ .
25 / 29

Coverage
Consider data generated from regression function
f0(x) = ex−0.5/(1 + ex−0.5), G uniform on [0, 1].
Choose x0 = 0.5, σ = 1, J ≈ n1/3 log n.
Take n = 500, 1000, 1500, 2000.
For each n, 1000 Monte Carlo samples are used.
27 / 29

Table: Comparison of coverage and average length of uncorrected and corrected
Bayesian credible intervals and conﬁdence interval based on MLE.
1 − α n=500 n=1000
CB (α) LB (α) C∗
B (α) L∗
B (α) CF (α) LF (α) CB (α) LB (α) C∗
B (α) L∗
B (α) CF (α) LF (α)
0.99 0.994 0.48 0.983 0.43 0.992 0.54 0.996 0.39 0.991 0.35 0.986 0.43
0.95 0.958 0.38 0.935 0.35 0.957 0.42 0.967 0.30 0.951 0.28 0.952 0.34
0.90 0.911 0.32 0.893 0.30 0.907 0.36 0.929 0.26 0.900 0.24 0.891 0.28
1 − α n=1500 n=2000
CB (α) LB (α) C∗
B (α) L∗
B (α) CF (α) LF (α) CB (α) LB (α) C∗
B (α) L∗
B (α) CF (α) LF (α)
0.99 0.994 0.34 0.984 0.31 0.989 0.38 0.996 0.31 0.988 0.28 0.993 0.34
0.95 0.967 0.27 0.949 0.25 0.953 0.29 0.968 0.25 0.939 0.23 0.951 0.27
0.90 0.914 0.23 0.894 0.21 0.905 0.25 0.914 0.21 0.895 0.19 0.889 0.23
28 / 29

MUMS: Bayesian, Fiducial, and Frequentist Conference - Coverage of Credible Intervals for Monotone Regression, Subhashis Ghosal, April 29, 2019

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to MUMS: Bayesian, Fiducial, and Frequentist Conference - Coverage of Credible Intervals for Monotone Regression, Subhashis Ghosal, April 29, 2019

Similar to MUMS: Bayesian, Fiducial, and Frequentist Conference - Coverage of Credible Intervals for Monotone Regression, Subhashis Ghosal, April 29, 2019 (20)

More from The Statistical and Applied Mathematical Sciences Institute

More from The Statistical and Applied Mathematical Sciences Institute (20)

Recently uploaded

Recently uploaded (20)

MUMS: Bayesian, Fiducial, and Frequentist Conference - Coverage of Credible Intervals for Monotone Regression, Subhashis Ghosal, April 29, 2019