1. The document discusses statistical estimation and properties of estimators such as bias, variance, consistency, and asymptotic normality.
2. Key concepts covered include unbiasedness, mean squared error, relative efficiency, sufficiency, and properties of estimators like consistency, asymptotic unbiasedness, and best asymptotic normality.
3. Examples are provided to illustrate theoretical estimators for parameters like the variance of a distribution or coefficients in a linear regression model.
2. 1 Introduction
All models are wrong. Some models are useful. –George E.P. Box
1. Data Generating Process (DGP), the joint distribution of the data
f (z1; : : : zn; )
where zi in general are vector valued observations.
2. Theoretical (economic) model, being a simpli…cation, is di¤erent fromthe DGP.
3. The DGP is unknown.
4. Statistical model of the data.
(a) Provide a su¢ ciently good approximation to the DGP to make inference
valid.
(b) If the approximation is ”bad” and inference is invalid we say that the
model is misspeci…ed.
(c) There may be several ”valid”models, di¤ering in ”goodness”.
5. If the parameters of the theoretical model can be uniquely determined from
the parameters of the statistical model we say that the theoretical model is
identi…ed.
6. In many cases we are only interested in a subset of the variables, yi; and can
write the DGP as
f (z1; : : : zn; ) = f1 (y1; : : : ynjx1; : : : xn; 1) f2 (x1; : : : xn; 2) :
If xi is exogenous, f2 can be ignored and it is su¢ cient to model f1: Roughly
speaking this is the case when 2 does not contain any information about 1.
In what follows the DGP is assumed known and all these issues are
ignored!
2 Small sample properties of general estimators
(criteria)
De…nition 1 An estimator, b;of is a function of the data, b (Z1; : : : ;Zn) : As such
it is a random variable and has a sampling variability.
De…nition 2 An estimate of is the estimator evaluated at the current sample,
b (z1; : : : ; zn) :
1
3. De…nition 3 (Unbiased) An estimator b of is unbiased if E
b
= : b
b;
=
E
b
is the bias of b:
Example 1 Consider the estimator b2 = 1
n
Pn
i=1
Xi X
2 of 2 where the Xi are
uncorrelated, E (Xi) = and V ar (Xi) = 2: We have
Xi X
2
=
Xi + X
2
= (Xi )2 2 (Xi )
X
+
X
2
E
Xi X
2
= E (Xi )2 2E (Xi )
X
+ E
X
2
= 2 2
2
n
+
2
n
n 2 with b (b2; 2) = 2
and it is clear the E (b2) = n1
n :
De…nition 4 (MSE) The Mean Square Error (MSE) of an estimator, b;
is given
by MSE
b;
= E
b
2
:
Remark 1 Note that we have
E
b
2
= E
b E
b
+ E
b
2
= E
b E
b
2
+ 2E
b E
b
E
b
+ E
E
b
2
= V ar
b
+ 0 + b
b;
2
:
That is the MSE of an unbiased estimator is just the variance.
De…nition 5 (Relative e¢ ciency) Let b1 and b2 be two alternative estimators of
. Then ratio of the MSEs MSE
b1;
=MSE
b2;
is called the relative e¢ -
ciency of b1 with respect to b2:
De…nition 6 (UMVUE) An
estimator b is a uniformly minimum variance unbi-
ased estimator (UMVUE) if E
b
= and for any other unbiased estimator, ;
V ar
b
V ar () for all :
Example 2 Consider the class, b =
Pn
i=1 wiXi, of linear estimators of = E (Xi) ;
where 2 and the are uncorrelated. Unbiasedness clearly requires
P
V ar (Xi) = Xi that
wi = 1 and the variance is given by
V ar (b) = E
X
2
wi (Xi )
= E
X
i
X
j
wiwj (Xi ) (Xj )
= 2
X
w2
i
2
4. One unbiased estimator in this class is the familiar X which sets wi = 1=n and has
variance 2=n:We will show that this is the UMVUE in the class of linear estimators.
P
The …rst order condition for minimizing V ar (b) subject to the restriction
wi = 1
is
2wi =
for the Lagrange multiplier. That is, all the weights are equal, together with
P
wi = 1 this gives wi = 1=n:
Remark 2 The notion of minimizing the variance is suggestive. One can de…ne
a general class of estimators by requiring the estimator to minimize the sample
analogue of the variance
b = arg min n1
Xn
i=1
(Xi )2 ;
with FOC 2n1Pn
i=1 (Xi ) = 0 and solution b = 1
n
P
Xi: This is the class of
Least Squares estimators.
Example 3 Consider the linear regression model
yi =
9. ; b; is obtained byminimizing q = 0 = (y Xb)0 (y Xb).
The FOC is
@q
@b0
= 2 (y Xb)0X = 0
y0X = b0X0X
with solution
b = (X0X)1X0y
provided that X0X has full rank (so the inverse is well-de…ned), i.e. that X has rank
k:
Theorem 1 (Gauss-Markov) Assume that X is non-stochastic and E () = 0;
V ar () = 2I: Then V ar (b) = 2 (X0X)1 and b is the BLUE (Best Linear Un-
biased Estimator) of
10. : That is, b is the UMVUE in the class of linear estimators,
eb
= Ay:
Proof. Write
b = (X0X)1X0y =(X0X)1X0 (X
16. )0 = E
h
(X0X)1X00X(X0X)1
i
= (X0X)1X0E (0)X(X0X)1 = (X0X)1X02IX (X0X)1
= 2 (X0X)1 :
To prove that b is BLUE, let= Ay be an unbiased linear estimator of
17. : De…ning
C eb
= A(X0X)1X0 we have=
C + (X0X)1X0
y = Cy+b = CX
25. 0 = E
h
C+(X0X)1X0
i
0
h
C+(X0X)1X0
i
0
= 2CC0 + 2CX(X0X)1 + 2 (X0X)1X0C + 2 (X0X)1
= 2CC0 + 2 (X0X)1
and the variance of eb
exceeds the variance of b by the positive semi-de…nite matrix
2CC0: This implies that V ar
0eb
= V ar (0b) + 20CC0 V ar (0b) for any
linear combination :
De…nition 7 (Su¢ ciency) Let f (x; ) be the joint density of the data. T (x) is
said to be a su¢ cient statistic for if g (xjT) ; the density of x conditional on T
does not depend on :
Remark 3 A su¢ cient statistic T captures all the information about in the data.
This means that we can base estimators on T rather than the full sample.
Theorem 2 (Factorization theorem) Let X1; :::;Xn be a random sample from
f (x; ). Then T(x) is su¢ cient statistic for i¤
f (x; ) = g (x) f (T(x); )
where g does not depend on :
Example 4 Let Xi be iid Bernoulli with parameter p. T =
Pn
i=1 Xi is then a
su¢ cient statistic (i.e. the number of successes in n trials). The joint pdf is given
by
f (x;p) =
Yn
i=1
pxi (1 p)1xi = p
P
xi (1 p)n
P
xi = g (xjT) h (T; ) :
and we can put g (x) = 1 and f (T; p) = pT (1 p)nT with T =
Pn
i=1 Xi.
Remark 4 Note that su¢ cient statistics are not unique and may di¤er in how
P
good they are at reducing the data. In the previous example T2 = n
Xi and
T3 = (
P
Xi; n
P
Xi) are clearly su¢ cient statistics as well.
4
26. 3 Large sample properties of general estimators
(criteria)
De…nition 8 (Consistency) An estimator b of is consistent if b
p!
.
De…nition 9 (Asymptotically unbiased) An estimator b of is asymptotically
unbiased if n
b
d!
Z; for 0, and Z is a non-degenerate random variable
with E (Z) = 0:
Remark 5 The requirement n
b
d!
Z; 0 implies that b is a consistent
estimator. Typically = 1=2 and b is referred to as a pn-consistent estimator.
De…nition 10 (ARE) Let b1 and b2 be two estimators of such that pn
b1
d!
1 ()) and pn
N (0; 2
b2
d!
2 ()) ; the asymptotic relative e¢ ciency
N (0; 2
(ARE) of b1 relative to b2 is given by 2
1 () =2
2 () where 2
1 () = limn!1 nV ar(b1)
1 () = limn!1 nV ar(b2).
and 2
De…nition 11 (Best asymptotically normal (BAN)) b is said to be asymptot-
ically e¢ cient if
1. b
p!
for all 2 ;
2. pn
b
d!
N (0; 2 ()) ;
3. There is no other estimator, ; ful…lling 1) and 2) with 2 () 2 () :
Example 5 Consider again the liner regression model y = X
27. + if we add the
assumption that lim!1 n1X0X = Q or plim n1X0X = lim n1E (X0X) = Q for
X stochastic with Q a positive de…nite matrix we have that the OLS estimator b is
consistent. We prove this in the case for X being …xed. We have that
b =
29. +
X0X
n
1 X0
n
:
By assumption lim!1 n1X0X ! Q and
X0
n
=
P
xii
n
looks like something a law of large numbers could apply to. We have E (xii) = 0 and
V ar (X0) = E (X00X) = E (X02IX) = 2E (X0X) = 2X0X: This immediately
gives lim
n!1
V ar (X0=n) = lim
n!1
1
nn12X0X = lim
n!1
1
n lim
n!1
n12X0X = lim
n!1
1
n2Q =0
and plim n1X0 = 0 by theMarkov LLN. It follows that plim b =
33. If in addition E j0xiij2+ B 1 for 0 = 1 then b is also asymptotically
normal. We will use this to establish that the condition
lim
n!1
Pn
i=1 E j0xiij2+
2
Pn
(
i=1 V ar (0xii))2+ = 0
for the Liapunov theorem holds. Since the numerator is dominated by n2B2 and
P
P
lim
n1V ar (0xii) = 20 [lim n1xix0i] = 20Q 0 we have
n!1
lim
n!1
P
E j0xiij2+
2
P
(
V ar (0xii))2+ lim
n!1
n2B2
P
(
V ar (0xii))2+
= lim
n!1
nB2
(n1
P
V ar (0xii))2+ =
limn!1 nB2
(limn!1 n1
P
V ar (0xii))2+ = 0
We also have that limn!1pn (n ) = 0 trivially holds because n = E(0xii=n) =
0 holds for all i, and thus limn!1 n = 0 = . The Liapunov CLT now gives that
pn
P
(0xii=n) = pn0 (X0=n) d!
N (0; 20Q) : Applying the Cramér-Wold
device gives pn (X0=n) d!
N (0; 2Q) : Using Cramérs theorem then gives
pn (b
34. ) =
n1X0X
1 pn (X0=n) d!
N
0; 2Q1
since lim
n!1
(n1X0X)1 = Q1:
4 Maximum likelihood
De…nition 12 (Likelihood) The likelihood is the data density viewed as function
of the parameters,
L (; x) = f (x; ) :
The likelihood is a random variable since it depends on the data.
De…nition 13 (MLE) We de…ne the maximum likelihood estimator (MLE) as
b = arg max
2
L (; x)
where x =(x1; : : : ; xn) denotes the data, and xi and may be vectors.
Remark 6 Alternatively, the MLE can be de…ned as the solution to the FOC
@L (; x)
@
= 0:
This de…nition has two problems. The likelihood may have local maxima, i.e. there
are multiple solutions to the FOC and the derivative may not be well de…ned. De-
spite these shortcomings we will, for simplicity, rely on this de…ning the MLE for
much of what follows.
6
35. Example 6 Suppose that Xi; i = 1; : : : n; are iid U (0; ). We have
f (x) =
(
1
0 x
0 otherwise
and the likelihood is given by
L (; x) =
1
n I
X(n)
where X(n) is the nth order statistic, i.e. X(n) = max (X1; : : : ;Xn) : It is clear that
the FOC n
n+1 = 0 will not provide a sensible answer. On the other hand it is easily
seen, since L (; x) is decreasing in ; that the likelihood is maximized by b = X(n):
Remark 7 For independent data we can write the likelihood as
L (; x) = f (x1; : : : ; xn; ) =
Yn
i=1
fi (xi; )
and, conveniently, the log-likelihood as
ln L (; x) = l (; x) =
Xn
i=1
ln fi (xi; ) :
This decomposition turns out to be crucial in the derivation of many of the properties
of MLEs. For dependent data we can, somewhat less conveniently, write
L (; x) = f (x1; ) f (x2jx1; ) : : : f (xnjx1; : : : ; xn1; )
=
Yn
i=1
f (xijxj ; j i; )
l (x; ) =
Xn
i=1
ln f (xijxj ; j i; ) :
and the derivations below goes through with relatively small changes.
De…nition 14 (Score) The derivative of the log-likelihood
s (; x) =
@l (; x)
@
is referred to as the score vector.
Lemma 1 The score vector evaluated at the true parameter values, 0; has expec-
tation zero.
Proof. Since L (; x) is the density of the data we have
1 =
Z
L (0; x) dx:
7
36. Di¤erentiate both sides w.r.t.
0 =
@
@
Z
L (0; x) dx =
Z
@L (0; x)
@
dx
=
Z
1
L (0; x)
@L (0; x)
@
L (0; x) dx
=
Z
@l (0; x)
@
L (0; x) dx = E [s (0; x)]
De…nition 15 (Fisher Information) The information matrix is the variance-covariance
matrix of the score vector evaluated at the true parameter values 0;
I () = E
s (0; x) s (0; x)0
= E
@l (0; x)
@
@l (0; x)
@0
:
Remark 8 Note the use of the convention that the derivative w.r.t. the column
vector is a column vector and the derivative w.r.t. the row vector 0 is a row
vector.
Remark 9 The Fisher information is a measure of the information about we, on
average, can expect to …nd in a sample of given size.
Theorem 3 (Information matrix equality)
I () = E
@2l (0; x)
@@0
= E
@l (0; x)
@
@l (0; x)
@0
= V ar (s (0; x))
Proof. Write
0 =
Z
@l (0; x)
@
L (0; x) dx
and di¤erentiate both sides
0 =
Z
@l (0; x)
@
@L (0; x)
@0
dx+
Z
@2l (0; x)
@@0
L (0; x) dx
=
Z
@l (0; x)
@
@l (0; x)
@0
L (0; x) dx+
Z
@2l (0; x)
@@0
L (0; x) dx:
That is
E
@l (0; x)
@
@l (0; x)
@0
= E
@2l (0; x)
@@0
Remark 10 For iid data we can write the information as
I () = nE
@2 ln f (xi; 0)
@@0
= nE
@ ln f (xi; 0)
@
@ ln f (xi; 0)
@0
:
8
37. Condition 1 We have assumed that @
@
R
L (0; x) dx =
R @L(0;x)
@ dx holds. This is
not necessarily the case. Roughly speaking, the requirement for this to hold is that
the distribution isn’t too fat tailed and that the domain of x does not depend on :
Su¢ cient conditions for this and the Cramér-Rao theorem below (theorem 5) is that
1. The parameter space ; 2 is an open rectangle or we can restrict the
parameter space to an open rectangle.
2. The domain of x does not depend on :
3. The score vector s has …nite expectation and variance 82 :
Example 7 Example 6 continued. With the Uniform likelihood we have l (x; ) =
n ln () and
@l (x; )
@
=
n
@2l (x; )
@2 =
n
2
and it is clear that both the information matrix equality and the lemma fails to
hold. This should not be surprising since the domain of Xi depends on :
Example 8 Suppose that Xi NID(; 2) ; f (x) = 1 p22 e(x)2=22 with likeli-
hood
L
; 2; x
=
22
n=2
exp
Xn
i=1
(xi )2 =22
!
l
; 2; x
=
n
2
ln 2
n
2
ln 2
1
22
Xn
i=1
(xi )2
with
@l
@
=
Pn
i=1 (xi )
2
@l
@2 =
n
22 +
1
24
Xn
i=1
(xi )2
yielding the familiar estimates b = x; b2 = 1
n
Pn
i=1 (xi x)2 :
It is easily veri…ed that E
@l
@
= E
@l
@2
= 0: Furthermore
E
@l
@
2
= E
Pn
i=1 (xi )
2
2
=
1
4E
Xn
i=1
Xn
j=1
#
=
(xi ) (xj )
n2
4 =
n
2
9
38. E
@l
@2
2
= E
n
22 +
1
24
Xn
i=1
(xi )2
#2
= E
n2
44
n
26
Xn
i=1
(xi )2 +
1
48
Xn
i=1
Xn
j=1
(xi )2 (xj )2
#
,
E (xi )2 (xj )2 =
(
34 i = j
4 i6= j
,
by independence
=
n2
44
n2
24 +
[3n + n (n 1)] 4
48 =
n
24
E
@l
@
@l
@2
= E
1
2
Xn
i=1
!
(xi )
n
22 +
1
24
Xn
i=1
(xi )2
!
= E
n
24
Xn
i=1
(xi ) +
1
26
Xn
i=1
Xn
j=1
(xi ) (xj )2
!
= 0
and the information matrix is given by
I
; 2
=
n
2 0
0 n
24
!
:
To verify that the information matrix equality holds we evaluate
E
@2l
@2
= E
Pn
i=1 1
2
=
n
2
E
@2l
@ (2)2
= E
n
24
1
6
Xn
i=1
(xi )2
!
=
n
24
n2
6 =
n
24
E
@2l
@@2
= E
1
4
Xn
i=1
!
(xi )
= 0
and it is clear that E
@l
@
@l
@0
= E
@2l
@@0
holds.
5 Small sample optimality results
Remark 11 Maximum likelihood estimators are functions of su¢ cient statistics
rather than the full sample. To see this, note that if T is a su¢ cient statistic we
can write the likelihood as (recall the Factorization theorem)
L (x; ) = g (x) f (T; ) =) l(x; ) =ln g (x) + ln f (T; )
where g (x) is a function of the data only and f (T; ) is the marginal density of
T: Maximizing lnf (T; ) w.r.t. will obviously give the same result as maximizing
l (x; ) :
10
39. Theorem 4 (Rao-Blackwell) Let the density of the data be indexed by the para-
meter , T be a su¢ cient statistic for and t (x) be an unbiased estimator of u () :
De…ne the new estimator b = E (t (x) jT) : then
1. b is unbiased estimator of u ()
2. V ar
b
V ar (t) :
Proof. We must …rst establish that b can be used as an estimator, i.e. that it
does not depend on and can be computed from the sample. To see this note that
t (x) is a function of the sample and since T is R
a su¢ cient statistic g (xjT) does not
depend on : Consequently b = E (t (x) jT) =
t (x) g (xjT) dx is independent of :
To show part 1 note that E
b
= E [E (t (x) jT)] = E (t (x)) = u () by the law
of iterated expectations. For part 2 we have from theorem 5.6 in Ramanathan that
V ar (X) = E [V ar (XjY )] + V ar [E (XjY )] ; setting t = X and b = E (XjY ) it is
clear that part 2 must hold.
Remark 12 Rao-Blackwellization provides a general way of obtaining a reasonable
estimator. Find an unbiased estimator (which by no means has to bee a good
estimator) and a su¢ cient statistic and construct the new estimator using the Rao-
Blackwell theorem. In some cases this will even be an optimal estimator in the sense
that it is an UMVUE.
Example 9 Consider again the case with iid Bernoulli data with parameter p: Sup-
pose we take t (X) = X1: Clearly this is an unbiased estimator of P
p; E (X1) = p
and V ar (X1) = p (1 p) : The su¢ cient statistic is T =
Xi: Calculating bp =
E (X1jT) is a combinatorial problem, there are in total n!= [T! (n T)!] equally
likely permutations of the T ones and n T zeros given T: Of these there are
(n 1)!= [(T 1)! (n T)!] permutations where X1 = 1: This gives
P (X1 = 1jT) =
(n 1)!T!
n! (T 1)!
=
T
n
and bp = T=n with E (bp) = E (T) =n = p and V ar (bp) = V ar (T) =n2 = p (1 p) =n:
De…nition 16 (Exponential family) A distribution characterized by a kdimensional
parameter vector is said to belong to the exponential family if its density or prob-
ability function can be written on the form
f (x) = C () exp
Xk
i=1
#
h (x) :
qi () Ti (x)
Remark 13 It follows from the factorization theorem that (T1; : : : Tk) are su¢ cient
statistics for :
11
40. Remark 14 The exponential family is a large class of distributions, containing
among others the binomial, normal, geometric, exponential and Poisson distribu-
tions.
Example 10 Consider the randomvariable X with the normal pdf f(x) = (22)0:5
e0:5(x)2=2 . To deduce that his pdf belongs to the exponential family …rst note
that =(; 2)0 and write
(22)0:5e0:5(x)2=2
=
e0:52=2
p22
22 +x
22
ex2 1
= C () eq1()T1(x)+q2()T2(x)h (x)
where C () = e0:52=2
22 , T1(x) = x2, q2 () =
22 , T2(x) = x, and
p22 , q1 () = 1
h (x) = 1.
In many cases it is not possible to establish the existence of an UMVUE. In those
cases it is of interest to know how good the estimator at hand is. Is it worth the
e¤ort to try to …nd a better estimator? To answer this question we need to know
how far o¤ we are from the best possible case.
Theorem 5 (Cramér-Rao) Let b be an unbiased estimator of the k-dimensional
parameter vector and suppose that the regularity conditions 1 hold. Then V ar
b
I1 () is a positive semi de…nite matrix and we write V ar
b
I1 () :
Proof. We have = E
b
=
R bL (; x) dx and di¤erentiate both sides w.r.t.
:
@
@0
= I =
Z
b
@L (; x)
@0
dx =
Z
b
1
L (; x)
@L (; x)
@0
L (; x) dx
=
Z
b
@l (; x)
@0
L (; x) dx =
Z
bs (; x)0 L (; x) dx
= Cov
b; s
since E (s) = 0 where s is the score vector. The variance of
b0; s0
0 is then
V ar
b
s
!
=
V ar
b
I
I I ()
!
:
Note that any variance matrix is positive
semi-de…nite and hence the variance of
the linear combination [I; I1 ()]
b0; s0
0 is positive semi-de…nite. This variance
12
41. is given by
I I1 ()
V ar
b
I
I I ()
!
I
I1 ()
!
= V ar
b
I1 () 0
which establishes the result.
Remark 15 The inverse information matrix I1 () provides a lower bound for the
variance of an unbiased estimator and is referred to as the Cramér-Rao lower bound.
Remark 16 In the scalar parameter case the Cramér-Rao lower bound reduces to
V ar
b
I1 () 1:
Remark 17 The notation V ar
b
I1 () is justi…ed in the vector valued para-
meter case by noting that a0
h
V ar
b
I1 ()
i
a 0 or a0V ar
b
aa0I1 () a
0 for an arbitrary vector a when V ar
b
I1 () is positive semi-de…nite. That is,
there is no linear combination a0b of any unbiased estimator b with smaller variance
than a0I1 () a.
Remark 18 There is no guarantee that there is an unbiased estimator that attains
the Cramér-Rao lower bound.
Example 11 The information for the parameters (; 2) with iid normal data was
obtained in example 8 as
I
; 2
=
n
2 0
0 n
24
!
and the Cramér-Rao lower bound is given by
I1
; 2
=
2
n 0
0 24
n
!
:
It is clear that x attains the lower bound but s2 = 1
n1
Pn
i=1 (xi x)2 does not
because V (s2) = 24
n1 which follows from noting that
Pn
i=1
Xi X
2
2 2 (n 1) :
Clearly V ar (s2) is greater than the Cramér-Rao lower bound for any …nite n:
Theorem 6 Suppose that t is an unbiased estimator of that attains the Cramér-
Rao lower bound. Then t is the MLE of :
13
42. Proof. From
the proof of the Cramér-Rao theorem we have that V ar (t)
I1 () = V ar
(I; I1 ()) (t0; s0)0
if t is an unbiased estimator. By assumption
V ar (t) I1 () = 0 and (I; I1 ()) (t0; s0)0 must be constant and there is an
exact linear relation between t and s. Since t is unbiased the linear relation has the
form t = A() s (; x)+ or s (; x) = A1 () (t ) : Setting the score to zero we
obtain the MLE as b = t:
Remark 19 This is a rather strong optimality result for MLEs but it should not
be taken to imply that the MLE always is unbiased or that it always attains the
Cramér-Rao lower bound. In particular it does not imply that a MLE is UMVUE.
Example 12 Consider again the case of iid normal data. The MLE of 2 is b2 P=
1
n
2 (biased) and V ar (b2) = 24(n1)
:
n
n n2 i=1 (xi x)2 with E (b2) = n1
6 Large sample optimality results
Theorem 7 (Consistency of MLE) Subject to the regularity conditions 1 the MLE
bn is consistent, bn
p!
0; the true parameter value.
Theorem 8 (Asymptotic normality of MLE) Let
1 = lim 1
nI () : If the reg-
ularity conditions 1 hold and if in addition the statistical model is identi…ed and
l (; x) is twice continuously di¤erentiable then the asymptotic distribution of the
MLE, b; is normal,
pn
bn 0
d!
N (0;
) :
Proof. We will again, for simplicity, assume that the data is iid. Note that this
implies that
I () = nE
@ ln f (Xi; 0)
@
@ ln f (Xi; 0)
@0
= nE
@2 ln f (Xi; 0)
@@0
:
That is
1=E
@ ln f (Xi; 0)
@
@ ln f (Xi; 0)
@0
= V ar
@ ln f (Xi; 0)
@
in this case. By the mean value theorem we can write, for some value between 0
and bn;
sn (0; x) = sn
bn; x
+
@sn (; x)
@0
0 bn
=
@sn (; x)
@0
0 bn
14
43. since the MLE bn sets the score to zero. Alternatively we can write this as
0 bn
=
@sn (; x)
@0
1
sn (0; x)
provided that @sn(;x)
@0 has full rank. Since
sn (0; x) =
Xn
i=1
@ ln f (Xi; 0)
@
where f (Xi; 0) and @ ln f(Xi;0)
@ are iid random variables, we have by the (multivari-
ate) Lindeberg-Lévy CLT that
1
pn
sn (0; x) d!
N
0;
1
Secondly
@sn (0; x)
@0
=
Xn
i=1
@2 ln f (Xi; 0)
@@0
;
is a sum of iid random matrices and
1
n
@sn (0; x)
@0
p!
1
by the Kinchine WLLN. In addition, bn
p!
0 implies p!
0 and
1
n
@sn (; x)
@0
p!
1
by the Slutsky theorem. Note that this implies
1
n
@sn(;x)
@0
p!
I: Next, write
1
n
@sn (; x)
@0
pn
0 bn
=
1
pn
sn (0; x) :
Since
1
n
@sn(;x)
@0
p!
I we have that
pn
0 bn
d!
1
pn
sn (0; x)
d!
N (0;
)
which establishes the result.
Remark 20 The variance of the limiting distribution for the MLE is the inverse
of the limit of the average information. That is, asymptotically MLE attains the
Cramér-Rao lower bound. This implies that MLE is Best Asymptotic Normal, i.e.
there is no other asymptotically normal estimator whose limiting distribution has a
smaller variance. This provides a strong rationale for the use of maximum likelihood.
Remark 21 Note the crucial role that the Information matrix equality plays in
giving us a simple form for the variance of the limiting distribution.
15
44. Example 13 For normal data, Xi iid N (; 2) ; the information matrix is given by
I
; 2
=
n
2 0
0 n
24
!
:
It follows that
pn
b
b2
!
2
!#
d!
N (0;
)
for
= lim nI1
; 2
=
2 0
0 24
!
:
From exercise 5 in the asymptotics lecture notes we deduce that pn (b2n
2) d!
N (0; ( 1) 4) where = E (Xi )4 =4 = 3 for normal data.
Example 14 Suppose that Xi; i = 1; : : : ; n; is iid Bernoulli with parameter p: The
loglikelihood is
l (p; x) = T ln p + (n T) ln (1 p)
for T =
Pn
i=1 xi: The score is
@l (p; x)
@p
=
T
p
n T
1 p
:
Setting the score to zero and solving for p gives the MLE as bp = T
n : We obtain the
Fisher information as
I (p) = E
@2l (p; x)
@p2
= E
T
p2 +
n T
(1 p)2
=
np
p2 +
n (1 p)
(1 p)2
=
n
p
+
n
1 p
=
n
p (1 p)
:
Since the regularity conditions hold it follows that bp is consistent and that
pn (bp p) d!N (0; p (1 p)) :
The results are easily veri…ed by applying a suitable LLN and CLT to bp =
Pn
i=1 xi=n:
A common rule of thumb for when the asymptotic distribution provides a good
approximation to the exact …nite sample distribution is that np (1 p) 9: Noting
that T Bin (n; p)
7 When the form of the likelihood is unknown
(optional)
1. It generally is unknown.
2. We can’t expect to get exact small sample results.
16
45. (a) Must rely on asymptotic results.
(b) In special cases we may be able to obtain the small sample bias and
variance of the estimator.
3. Maximum likelihood is out of the question.
4. Maximize the wrong likelihood, on purpose or out of ignorance. Quasi Max-
imum Likelihood (QML).The QMLE can, under more restrictive conditions
than above, be shown to be consistent and asymptotically normal. The major
di¤erence is that the Information matrix equality doesn’t hold for QMLE and
we get
pn
bQML 0
d!
N
0;A1BA1
for
A = plim
1
n
@sn (0; x)
@0
B = plim
1
n
sn (0; x) sn (0; x)0 :
5. Estimators that doesn’t rely on the likelihood.
(a) Least squares.
(b) Generalized Method of Moments (GMM).
GMM speci…es a set of k moment conditions E [gn (0; x)] = 0; where is
a k-dimensional parameter vector and minimizes gn (; x)0 gn (; x) :It is
possible to show, under more restrictive conditions than above, that the
GMME is consistent and asymptotically normal,
pn
bGMM 0
d!
N (0;V)
where V1 = lim 1
nV ar (gn (0; x)) :
Remark 22 We know that the MLE attains the Cramér-Rao lower bound asymp-
totically and it should be clear that we in general su¤er from a loss in e¢ ciency by
using other estimators than MLE.
Remark 23 Note that Least Squares and ML are special cases of GMM. This
is seen by setting the FOCs of LS or ML as the GMM moment conditions, e.g.
E [sn (0; x)] = 0 for ML.
17
46. 8 Worked exercises
8.1 Exercises
1. Exercise 8.1 (b)-(e) in Ramanathan.
2. Exercise 8.2 in Ramanathan.
3. Exercise 8.9 (a)-(c) in Ramanathan. In addition, obtain E (b) and V ar (b)
where b is the MLE of :
4. Consider the regression model y = x+
48. are scalars. In addition we are told that the iare iid, the xi are iid
with E (x) = c, xi and i are independent of each other, x0x
n
p!
c = E (x2i
)6= 0
and x0z
n
p!
d6= 0.
(a) Suppose
56. ) d!
N(0; 1_)
.
De…ne the estimator
e =
x0(y e
57. z)
x0x
and obtain the limiting distribution of e.
(c) Are b and e consistent estimators of ?
8.2 Solutions
1. f (x; ) = kx a discrete geometric distribution, i.e. k = 1 :
b) We have
L (x;) =
Yn
i=1
(1 ) xi = (1 )n
Pn
i=1 xi
and it is clear from the factorization theorem that
Pn
i=1 xi and x =
1
n
Pn
i=1 xi are su¢ cient statistics.
(a) We have
@l (x; )
@
=
n
1
+
Pn
i=1 xi
@2l (x; )
@2 =
n
(1 )2
Pn
i=1 xi
2 :
18
58. Since E (xi) =
1 we have
I () = E
@2l (x; )
@2 =
n
(1 )2 +
n
(1 ) 2
=
n
(1 )2
:
It is easy to verify that the outer product of the score form of the infor-
mation matrix gives the same result,
E
@l (x; )
@
2
= E
n
1
+
Pn
i=1 xi
2
=
n2
(1 )2
2nE
Pn
i=1 xi
(1 )
+
Pn
E (
i=1 xi)2
2
/independence/ =
n2
(1 )2
2n2
(1 )2 +
E
Pn
i=1 x2i
2 +
E
Pn
i=1
P
j6=i xixj
2
=
n2
(1 )2 +
n
(1)2 + 2
(1)2
2 +
n (n 1) 2
(1)2
2
=
n2
(1 )2 +
n
(1 )2 +
n2
(1 2)
=
n
(1 )2
(b) Setting the score to zero we have
@l (x; )
@
=
n
1
+
Pn
i=1 xi
= 0
Pn
i=1 xi
n
=
1
with the solution
b =
x
1 + x
:
(c) Since xi are iid we have x
p!
1 by the Kinchine WLLN. It
E (xi) =
follows from the Slutsky theorem that b = g (x) = x
1+x
p!
g
1
= :
2. f(x; ) = x1 for 0 x 1 and 0.
(a)
R 1
0 x1dx = 1
x
59.
60. 1
: It follows that
0 = 1
R 1
0 xdx = 1
+1 and hence that
E(x) =
R 1
0 xx1dx =
+1:
R 1
0 lnxe(1)lnxdx = R 1
0 lnxx1dx: It follows that E(lnx) =
(b) 1
2 = @
@
1
= @
@
R 1
0 x1dx =
R 1
0
@
@ e(1)lnxdx =
R 1
0 lnxx1dx = 1
:
(c)
2
3 =
@
@2
1
=
@
@2
Z 1
0
x1dx =
Z 1
0
@
@
lnxe(1)lnxdx
=
Z 1
0
(lnx)2e(1)lnxdx =
Z 1
0
(lnx)2 x1dx:
19
61. Which gives
E(lnx)2 =
Z 1
0
(lnx)2 x1dx =
2
2 :V ar(lnx) = E (lnx E (lnx))2
= E
(lnx)2 2lnxE(lnx) + [E(lnx)]2
= E(lnx)2 [E(lnx)]2
=
2
2
1
2 =
1
2 :
(d) We have the random sample x1; :::; xn: Independence gives the joint den-
sity as
f(x1; :::; xn; ) =
Yn
i=1
f(xi; ) = n
Yn
i=1
xi
!1
:
Qn
The likelihood is thus L(; x1; :::; xn) = n (
i=1 xi)1 : It follows from
the factorization theorem (8.1) that T1 =
Qn
i=1 xi is a su¢ cient statistic
since we can factorize the likelihood into the function h(T1; ) = nT1
1 ,
depending only on and T1, and the function g() = 1, which does not
depend on P and T1: The factorization theorem is if and only if, that is
T2 =
n
i=1 xi is a su¢ cient statistic only if we can factorize the likelihood
correspondingly for T2: Inspection of the likelihood function shows that
this is impossible and consequently that T2 is not a su¢ cient statistic for
P. T3 =
n
i=1 lnxi, on the other hand, is a su¢ cient statistic.
(e) lnL(; x) = nln() + ( 1)
Pn
i=1 lnxi and
@lnL
@
=
n
+
Xn
i=1
lnxi:
Setting the derivative to zero yields b = P n n
i=1 lnxi
: To verify that this is
a maximum we need to show that the second derivative is negative at b:
We have @2lnL
2 and @2lnL
@2
@2 = n
67. = 1=b =
Pn
i=1 lnxi
. It is easy to establish that this guess is correct for
68. = g()
n when g is a monotone function (it holds for non-monotone functions
as well, but is trickier to show). By monotonicity the inverse func-
tion = g1(
113. =
lnx E
lnx
q
V ar
lnx
:
We then have Zn
d!
N(0; 1) since the lnxi are independent and V ar(lnxi) =
114. 2 1 and thus ful…lls the conditions of the Lindeberg-Lévy CLT.
Comment: we have (for this estimator) veri…ed the claim that ML esti-
mators are asymptotically normally distributed. To see that the result
is in accordance with theorem 8.12 we write Zn =
q
179. 2:
Comment: The mean and variance we obtained shouldn’t be too surpris-
ing. The distribution of x is an exponential distribution with a shift in
the location. That is if y is exponentially distributed with parameter
180. ;
then x is obtained as x = y + :
(b) The likelihood is given by L =
Qn
i=1
1
212. Pn
The expectation E (
i=1 (xi ))2 =
Pn
i=1
Pn
j=1 E [(xi ) (xj )] is
a little bit tricky. For i6= j we have independence and E [(xi ) (xj )] =
E (xi )E (xj ) =
213. 2 and there are n (n 1) terms with i6= j: This
leaves n terms with i = j where we have E (xi )2 = 2
214. 2:
Comment: The reason for using the score form of the information hmatrix
is that the information matrix equality E (SS0) = E
@lnL
@
@lnL
@
0
i
=
E
@2lnL
@@0
for = (;
215. )0 doesn’t hold for this likelihood. When estab-
lishing that E
h@lnL
@
@lnL
@
0
i
= E
@2lnL
@@0
we needed to interchange
the order of integration and di¤erentiation. That is we needed, for exam-
ple, that @2
@2
R
1
L(;
219. as
1
n
Pn
i=1 (xi ), provided is known. S1 is obviously of little use for ob-
taining the MLE of . Instead we need to look at the likelihood function
itself, writing this as L = 1
222. it is clear that the likelihood
is an increasing function of : On the other hand we have the condition
xi ; that is, the likelihood of observing a value of x smaller than is
zero. The value of maximizing the likelihood is thus the smallest value
of xi in the sample or the …rst order statistic, denote this by x(1): We
have T1 = b = x(1) and T2 = b
223. = 1
n
Pn
i=1
xi x(1)
:
extra. From p. 137 in Ramanathan we get the density of the …rst order statistic
as
fx(1)(x) = n [1 Fx(x)]n1 fx(x):
We obtain the distribution function of x as Fx(x) =
R x
1
238. z)
x0x
=
x0 (x + )
x0x
=
x0x + x0
x0x
= +
x0
x0x
where x0x
n
p!
c: In addition, 1
nx0 =1
n
Pn
i=1 xii; a sample average which a
CLT might apply to. By assumption we have E (xii) = E (xi)E (i) = 0
23
239. and V ar (xii) = E (x2i
2i
) = E (x2i
) 2 = 2c 1: Since xi and i are iid,
xii is iid as well and the conditions for the Lindeberg-Lévy CLT holds.
That is,
1
pn
x0 d!
N
0; 2c
:
Write
pn (b ) =
n1=2x0
x0x
n
and it follows that
pn (b ) d!
N
0; 2=c
:
(b) We have
e =
x0
ye
250. x0z=n
x0x=n
where the …rst
term converges in distribution to N (0; 2=c) and the sec-
ond term to N
0;
c
d
2
since plim
x0z=n
x0x=n
= c=d: Note that these limit-
c and pn
ing distributions are the same as for pnx0
252. c
d and hence
does not depend on x and z. By independence of and e
253. it follows
that pn (e ) converges in distribution to the sum of two independent
normal random variables. That is,
pn (e ) d!
N
0;
2
c
+
c2
d2
:
(c) In both cases we have convergence in distribution when scaling by pn: It
follows from corollary 2 in the Asymptotics lecture notes that (b )
p!
0 and (e )
p!
0 and the estimators are consistent.
24