In many statistical problems, estimators are derived from maximizing a data-dependent function of the parameter, usually in the form of an empirical average., e.g. the maximum likelihood and the least squares. For asymptotic studies, ordinary calculus fails if the parameter is non-Euclidean or has an infinite-dimensional component. In such cases, it becomes useful to capitalize on general M-estimation theorems formulated in terms of Banach-space-valued parameters. Although these theorems are constructed using sophisticated empirical processes results and arguments, the idea is usually simple enough to grasp. For example, if the objective function converges in probability to a limit function, then the estimator should also tend in probability to the maximizer of that limit function (consistency). Now, if a centered and re-scaled objective function that takes a centered and re-scaled parameter as argument converges weakly to a tight limit process, then the centered and re-scaled estimator should also tend weakly to the (hopefully tight) maximizer of that tight limit process. Along this line, it is clear that the proper re-scaling rate on the parameter (argument) should render the standardized objective function (in the form of an empirical process) asymptotically tight. Therefore, it is not surprising that the re-scaling rate is intimately related to the "modulus of continuity" of the empirical process. This is essentially the spirit of rate theorems such as Theorem 3.2.5 of van der Vaart and Wellner (1996) and Theorem 7.4 of van de Geer (2000), where technical tools such as the “peeling device” are employed to make the mathematics go through.
This presentation tries to deliver the intuitive ideas behind the general M-estimation theory. For a complete and rigorous treatment, please refer to the relevant parts of
Vaart and Wellner (1996) and van de Geer (2000), among others. An application with one-sample estimation in current status data is briefly described. For details, please refer to Groeneboom and Wellner, J. A. (1992).
The slides were originally presented in the class STOR 831 Advanced Probability in Fall 2013 at UNC Chapel Hill as a final project.
Rates of convergence in M-Estimation with an example from current status data
1. Rates of Convergence in M-Estimation
With an Example from Current Status Data
Lu Mao
Department of Biostatistics
The University of North Carolina at Chapel Hill
Email: lmao@unc.edu
Lu Mao (UNC CH) Dec 4, 2013 1 / 23
2. 1 M-Estimation
Motivation
Z-Estimation
Consistency
Distribution
Rate of convergence
2 Example: Current Status Data
Description of problem
Characterization of MLE
Consistency and distributional results
Lu Mao (UNC CH) Dec 4, 2013 2 / 23
3. MOTIVATION
Examples:
Parametric model: fp(x) : 2 g
0 = arg max
2
P log p
Regression: Y = g0(X) + ;E(jX) = 0; g 2 G
g0 = arg min
g2G
P(Y g(X))2
Natural estimators:
MLE
^ = arg max
2
Pn log p
LSE
^g = arg min
g2G
Pn(Y g(X))2
Lu Mao (UNC CH) Dec 4, 2013 3 / 23
4. INTRODUCTION
M-estimator:
^n = arg max
2
Mn()
Mn : Data-dependent criteria function
Mn ! M, and 0 = arg max2M()
Typically Mn() = Pnm
Analysis steps:
Consistency: ^n !p 0
Rate of convergence: rn(^n 0) = Op(1)
Asymptotic distribution: rn(^n 0) Z
Lu Mao (UNC CH) Dec 4, 2013 4 / 23
5. Z-ESTIMATION
Special case: m smooth in parameter (Pm_ 0 = 0)
(approx.) solve Pnm_ ^n
= op(n1=2)
Provided that
Gnm_ ^n
= Gnm_ 0 + op(1) (Consistency + Donskerness)
we have
p
n(Pm_ ^n
Pm_ 0 ) = Gnm_ 0 + op(1)
V0
p
n(^n 0) = Gnm_ 0 + op(1 + jj
p
n(^n 0)jj)
p
n(^n 0) = V 1
0 Gnm_ 0 + op(1)
where
V0 =
@
@
Pm_
10. Z-ESTIMATION
Example (Sample median)
^n = arg min
PnjX j
or equivalently
1=n Pnsign(X ^n) 1=n
Since Psign(X ) = 2F() 1 ) V0 = 2f(0)
p
n(^ 0) = (2f(0))1Gnsign(X 0) + op(1)
N
0; 4f(0)2
Lu Mao (UNC CH) Dec 4, 2013 6 / 23
11. M-ESTIMATION
M-Estimation
Non-smoothness in parameter
Constraint in parameter
Resulting estimator not root n consistent, i.e.
rn(^n 0) = Op(1); where rn6=
p
n
Lu Mao (UNC CH) Dec 4, 2013 7 / 23
12. CONSISTENCY
Theorem 1.1 (VW Corollary 3.2.3)
Let Mn() be a stochastic process indexed by a metric space , and let
M : ! R be a deterministic function.
a Suppose jjMn Mjj !p 0 and the true parameter 0 satis
13. es
M(0) sup
=2G
M()
for every open set G containing 0. Then if Mn(^n) sup Mn() op(1),
we have ^n !p 0.
b Suppose that jjMn MjjK !p 0 for every compact K and that the map
7! M() is upper-semicontinuous with a unique maximum at 0. Then the
same conclusion is true provided that ^n = Op(1).
Lu Mao (UNC CH) Dec 4, 2013 8 / 23
14. CONSISTENCY
Example (Sample median)
^n = arg min
PnjX j
Use Theorem 1.1 b:
j^nj PnjX ^nj + PnjXj 2PnjXj = Op(1)
fj j : 2 Kg Glivenko-Cantelli
Uniqueness of 0 as minimizer of PjX j
@
@
PjX j = 2F() 1;
@2
@2 PjX j = 2f()
strictly convex on the support of X
Conclusion: ^n !p 0
Lu Mao (UNC CH) Dec 4, 2013 9 / 23
15. CONSISTENCY
Theorem 1.2 (Wald)
Let 7! m(x) be upper-semicontinuous for every x, and for every suciently
small ball U
P sup
2U
m 1;
Then if 0 is the unique maximizer of Pm, ^n = Op(1) and
Pnm^n
Pnm0 op(1), we have
^n !p 0:
Lu Mao (UNC CH) Dec 4, 2013 10 / 23
16. DISTRIBUTION
Suppose ^h
n := rn(^n 0) = Op(1), then
^h
n = arg max
h
r2n
Mn(0 + r1
n h) Mn(0)
=: arg max
h
Hn(h)
Theorem 1.3 (Argmax)
Suppose that Hn H in l1(K) for every compact K R, for a limit process
with continuous sample paths that have unqiue points of maxima ^ h. If
^h
n = Op(1) and Hn(^h
n) suph Hn(h) op(1), then
^h
n ^h:
Lu Mao (UNC CH) Dec 4, 2013 11 / 23
17. DISTRIBUTION
Example (Parametric MLE):
Hn(h) = r2n
Pn(log p0+r1
n h log p0 )
= log
Yn
i=1
p0+h=
p
n
p0
(Xi)
= Gnh0 _ l0
1
2
h0I0h + op(1) (LAN)
h0Z
1
2
h0I0h Z N(0; I0 )
=: H(h)
Therefore
p
n(^n 0) = ^h
n arg maxH(h) = I1
0
Z N(0; I1
0
)
Lu Mao (UNC CH) Dec 4, 2013 12 / 23
18. DISTRIBUTION
In general
Hn(h) = r2n
Pn(m0+r1
n h m0 )
=
r2n
p
n
n h m0 ) + r2n
Gn(m0+r1
P(m0+r1
n h m0 )
=
r2n
p
n
Gn(m0+r1
n h m0 ) +
1
2
h0V0h + op(1)
V =
@2
@2 Pm
G(h) +
1
2
h0V0h
(1)
for some zero-mean Gaussian process G.
Note that convergence of the Gn term concerns empirical processes
indexed by
Fn :=
r2n
p
n
MK=rn; where M = fm m0 : d(; 0) g
Lu Mao (UNC CH) Dec 4, 2013 13 / 23
19. DISTRIBUTION
Empirical processes on index sets changing with n: VW Section 2.11
If (1) does hold, the variance function of G is given by
E(G(h) G(g))2 = lim
n!1
r4n
n
P(m0+h=rn m0+g=rn)2
The remaining (key) issue:
21. RATE OF CONVERGENCE
Theorem 1.4 (Rate of Convergence, VW Theorem 3.2.5)
Let Mn be stochastic processes indexed by a semimetric space and
M : ! R is a deterministic function, such that for every in a
neighborhood of 0,
M() M(0) . d2(; 0):
Suppose that for suciently small ,
E sup
d(;0)
j(Mn M)() (Mn M)(0)j . n()
p
n
;
for functions n such that 7! n()= is decreasing for some 2. Let
r2n
n(1=rn)
p
n; for every n.
If the sequence ^n satis
22. es Mn(^n) Mn(0) Op(r2
n ) and converges in
probability to 0, then
rnd(^n; 0) = Op(1):
Lu Mao (UNC CH) Dec 4, 2013 15 / 23
23. RATE OF CONVERGENCE
Remark:
For empirical-type criteria function,
EjjGnjjM . n()
Use maximal inequality
EjjGnjjM . J[](1;M;L2(P))(PM2
)1=2;
where M is the envelope function of M, and if
J[](1;M;L2(P)) =
Z 1
0
q
1 + logN[](jjMjjP;2;M;L2(P))d 1;
uniformly in , then take
n() = (PM2
)1=2
If n() = for some 2, then
rn = n
1
2(2)
Lu Mao (UNC CH) Dec 4, 2013 16 / 23
24. RATE OF CONVERGENCE
Example (Lipschitz in parameter):
If for every 1; 2 in a neighborhood of 0,
jm1 m2 j m_ (x)jj1 2jj;
with Pm_ 2(x) 1. Then
)1=2 . :
n() = (PM2
This gives
rn =
p
n
Lu Mao (UNC CH) Dec 4, 2013 17 / 23
25. CURRENT STATUS
Interval censoring Case 1 (current status):
Time-to-event data, examined only once
Example: A cross-sectional antibody test of people of various ages
against Hepatitis A virus (Keiding, 1991)
Statistical problem: observe i.i.d. (U; ),
U G on R+
= I(T U), T F on R+, T ? U
Aim: estimate F
Method: nonparametric MLE (NPMLE)
Lu Mao (UNC CH) Dec 4, 2013 18 / 23
26. CHARACTERIZATION OF MLE
Regularity conditions: F0 and G admit Lebesgue densities f and g
respectively
Likelihood:
ln(F) = Pn( log F(U) + (1 ) log(1 F(U)))
NPMLE: denote as ^ Fn
Lu Mao (UNC CH) Dec 4, 2013 19 / 23
27. CHARACTERIZATION OF MLE
Theorem 2.1 (GW Proposition 1.2)
Re-order the observation times in ascending order such that U1 P Un.
Let Hn be the greatest convex minorant (GCM) of the points (i;
i
j=1 j).
Then ^ Fn(Ui) is the left derivative of Hn at i. Algebraically
^ Fn(Ui) = max
1ji
min
ikn
Pk
m=j m
k j + 1
Corollary 2.2
Denote Dn as the right continuous step function de
28. ned by the points
(i=n; n1Pi
j=1 j), then
^ Fn(Ui) a i arg min
s
fDn(s) asg i=n:
Lu Mao (UNC CH) Dec 4, 2013 20 / 23
29. CONSISTENCY OF MLE
Theorem 2.3 (Consistency of ^ Fn(t))
Fix t, assume that f(t); g(t) 0, then
^ Fn(t) !p F0(t)
Proof. See Example 5.17 ([V], pp 49) for a Wald's consistency (Theroem
1.2) proof, with (; d)=the space of distribution functions equipped with
the weak topology.
Lu Mao (UNC CH) Dec 4, 2013 21 / 23
30. DISTRIBUTION OF MLE
Theorem 2.4 (Groeneboom, 1987)
Fix t, assume that f(t); g(t) 0, then
n1=3f ^ Fn(t) F0(t)g
4F0(1 F0)f
g
(t)
1=3
arg min
h
fZ(h) + h2g;
where Z is a two-sided Brownian motion process originating from zero.
Proof. We
31. rst use Theorem 1.4 and the subsequent Remark to establish
that rn = n1=3, and then use the Argmax Theroem (Theorem 1.3) to
32. nd
the asymptotic distribution. See Example 3.2.15 ([VW, pp 298]) for
details.
Lu Mao (UNC CH) Dec 4, 2013 22 / 23
33. References
Groeneboom, P. (1987). Asymptotics for interval censored observations. Report, 87, 18
[GW] Groeneboom, P., Wellner, J. A. (1992). Information bounds and nonparametric maximum
likelihood estimation (Vol. 19). Springer.
[V] Van der Vaart, A. W. (2000). Asymptotic statistics (Vol. 3). Cambridge university press.
[VW] Van der Vaart, A. W., Wellner, J. A. (1996). Weak Convergence and Empirical Processes.
Lu Mao (UNC CH) Dec 4, 2013 23 / 23