Lecture notes on statistical estimation techniques

1. Lecture notes Statistics Estimation Rickard Sandberg e-mail:rickard.sandberg@hhs.se January 22, 2010

2. 1 Introduction All models are wrong. Some models are useful. –George E.P. Box 1. Data Generating Process (DGP), the joint distribution of the data f (z1; : : : zn; ) where zi in general are vector valued observations. 2. Theoretical (economic) model, being a simpli…cation, is di¤erent fromthe DGP. 3. The DGP is unknown. 4. Statistical model of the data. (a) Provide a su¢ ciently good approximation to the DGP to make inference valid. (b) If the approximation is ”bad” and inference is invalid we say that the model is misspeci…ed. (c) There may be several ”valid”models, di¤ering in ”goodness”. 5. If the parameters of the theoretical model can be uniquely determined from the parameters of the statistical model we say that the theoretical model is identi…ed. 6. In many cases we are only interested in a subset of the variables, yi; and can write the DGP as f (z1; : : : zn; ) = f1 (y1; : : : ynjx1; : : : xn; 1) f2 (x1; : : : xn; 2) : If xi is exogenous, f2 can be ignored and it is su¢ cient to model f1: Roughly speaking this is the case when 2 does not contain any information about 1. In what follows the DGP is assumed known and all these issues are ignored! 2 Small sample properties of general estimators (criteria) De…nition 1 An estimator, b;of is a function of the data, b (Z1; : : : ;Zn) : As such it is a random variable and has a sampling variability. De…nition 2 An estimate of is the estimator evaluated at the current sample, b (z1; : : : ; zn) : 1

3. De…nition 3 (Unbiased) An estimator b of is unbiased if E b = : b b; = E b is the bias of b: Example 1 Consider the estimator b2 = 1 n Pn i=1 Xi X 2 of 2 where the Xi are uncorrelated, E (Xi) = and V ar (Xi) = 2: We have Xi X 2 = Xi + X 2 = (Xi )2 2 (Xi ) X + X 2 E Xi X 2 = E (Xi )2 2E (Xi ) X + E X 2 = 2 2 2 n + 2 n n 2 with b (b2; 2) = 2 and it is clear the E (b2) = n1 n : De…nition 4 (MSE) The Mean Square Error (MSE) of an estimator, b; is given by MSE b; = E b 2 : Remark 1 Note that we have E b 2 = E b E b + E b 2 = E b E b 2 + 2E b E b E b + E E b 2 = V ar b + 0 + b b; 2 : That is the MSE of an unbiased estimator is just the variance. De…nition 5 (Relative e¢ ciency) Let b1 and b2 be two alternative estimators of . Then ratio of the MSEs MSE b1; =MSE b2; is called the relative e¢ - ciency of b1 with respect to b2: De…nition 6 (UMVUE) An estimator b is a uniformly minimum variance unbiased estimator (UMVUE) if E b = and for any other unbiased estimator, ; V ar b V ar () for all : Example 2 Consider the class, b = Pn i=1 wiXi, of linear estimators of = E (Xi) ; where 2 and the are uncorrelated. Unbiasedness clearly requires P V ar (Xi) = Xi that wi = 1 and the variance is given by V ar (b) = E X 2 wi (Xi ) = E X i X j wiwj (Xi ) (Xj ) = 2 X w2 i 2

4. One unbiased estimator in this class is the familiar X which sets wi = 1=n and has variance 2=n:We will show that this is the UMVUE in the class of linear estimators. P The …rst order condition for minimizing V ar (b) subject to the restriction wi = 1 is 2wi = for the Lagrange multiplier. That is, all the weights are equal, together with P wi = 1 this gives wi = 1=n: Remark 2 The notion of minimizing the variance is suggestive. One can de…ne a general class of estimators by requiring the estimator to minimize the sample analogue of the variance b = arg min n1 Xn i=1 (Xi )2 ; with FOC 2n1Pn i=1 (Xi ) = 0 and solution b = 1 n P Xi: This is the class of Least Squares estimators. Example 3 Consider the linear regression model yi =

5. 1 + x2i

6. 2 + : : : + xki

7. k + i or in matrix notation y = X

8. + : The least squares estimator of

9. ; b; is obtained byminimizing q = 0 = (y Xb)0 (y Xb). The FOC is @q @b0 = 2 (y Xb)0X = 0 y0X = b0X0X with solution b = (X0X)1X0y provided that X0X has full rank (so the inverse is well-de…ned), i.e. that X has rank k: Theorem 1 (Gauss-Markov) Assume that X is non-stochastic and E () = 0; V ar () = 2I: Then V ar (b) = 2 (X0X)1 and b is the BLUE (Best Linear Un- biased Estimator) of

10. : That is, b is the UMVUE in the class of linear estimators, eb = Ay: Proof. Write b = (X0X)1X0y =(X0X)1X0 (X

11. + ) =

12. + (X0X)1X0: 3

13. This immediately gives E (b) =

14. and V ar (b) = E (b

15. ) (b

16. )0 = E h (X0X)1X00X(X0X)1 i = (X0X)1X0E (0)X(X0X)1 = (X0X)1X02IX (X0X)1 = 2 (X0X)1 : To prove that b is BLUE, let= Ay be an unbiased linear estimator of

17. : De…ning C eb = A(X0X)1X0 we have= C + (X0X)1X0 y = Cy+b = CX

18. +C +

19. + eb (X0X)1X0: Clearly E eb = CX

20. + CE () +

21. = CX

23. and unbiasedness implies that CX = 0: The variance is then V ar eb = E eb

24. eb

25. 0 = E h C+(X0X)1X0 i 0 h C+(X0X)1X0 i 0 = 2CC0 + 2CX(X0X)1 + 2 (X0X)1X0C + 2 (X0X)1 = 2CC0 + 2 (X0X)1 and the variance of eb exceeds the variance of b by the positive semi-de…nite matrix 2CC0: This implies that V ar 0eb = V ar (0b) + 20CC0 V ar (0b) for any linear combination : De…nition 7 (Su¢ ciency) Let f (x; ) be the joint density of the data. T (x) is said to be a su¢ cient statistic for if g (xjT) ; the density of x conditional on T does not depend on : Remark 3 A su¢ cient statistic T captures all the information about in the data. This means that we can base estimators on T rather than the full sample. Theorem 2 (Factorization theorem) Let X1; :::;Xn be a random sample from f (x; ). Then T(x) is su¢ cient statistic for i¤ f (x; ) = g (x) f (T(x); ) where g does not depend on : Example 4 Let Xi be iid Bernoulli with parameter p. T = Pn i=1 Xi is then a su¢ cient statistic (i.e. the number of successes in n trials). The joint pdf is given by f (x;p) = Yn i=1 pxi (1 p)1xi = p P xi (1 p)n P xi = g (xjT) h (T; ) : and we can put g (x) = 1 and f (T; p) = pT (1 p)nT with T = Pn i=1 Xi. Remark 4 Note that su¢ cient statistics are not unique and may di¤er in how P good they are at reducing the data. In the previous example T2 = n Xi and T3 = ( P Xi; n P Xi) are clearly su¢ cient statistics as well. 4

26. 3 Large sample properties of general estimators (criteria) De…nition 8 (Consistency) An estimator b of is consistent if b p! . De…nition 9 (Asymptotically unbiased) An estimator b of is asymptotically unbiased if n b d! Z; for 0, and Z is a non-degenerate random variable with E (Z) = 0: Remark 5 The requirement n b d! Z; 0 implies that b is a consistent estimator. Typically = 1=2 and b is referred to as a pn-consistent estimator. De…nition 10 (ARE) Let b1 and b2 be two estimators of such that pn b1 d! 1 ()) and pn N (0; 2 b2 d! 2 ()) ; the asymptotic relative e¢ ciency N (0; 2 (ARE) of b1 relative to b2 is given by 2 1 () =2 2 () where 2 1 () = limn!1 nV ar(b1) 1 () = limn!1 nV ar(b2). and 2 De…nition 11 (Best asymptotically normal (BAN)) b is said to be asymptotically e¢ cient if 1. b p! for all 2 ; 2. pn b d! N (0; 2 ()) ; 3. There is no other estimator, ; ful…lling 1) and 2) with 2 () 2 () : Example 5 Consider again the liner regression model y = X

27. + if we add the assumption that lim!1 n1X0X = Q or plim n1X0X = lim n1E (X0X) = Q for X stochastic with Q a positive de…nite matrix we have that the OLS estimator b is consistent. We prove this in the case for X being …xed. We have that b =

28. + (X0X)1X0 =

29. + X0X n 1 X0 n : By assumption lim!1 n1X0X ! Q and X0 n = P xii n looks like something a law of large numbers could apply to. We have E (xii) = 0 and V ar (X0) = E (X00X) = E (X02IX) = 2E (X0X) = 2X0X: This immediately gives lim n!1 V ar (X0=n) = lim n!1 1 nn12X0X = lim n!1 1 n lim n!1 n12X0X = lim n!1 1 n2Q =0 and plim n1X0 = 0 by theMarkov LLN. It follows that plim b =

30. +[plim (n1X0X)]1 plim n1X0 =

31. + Q1 0 =

32. : 5

33. If in addition E j0xiij2+ B 1 for 0 = 1 then b is also asymptotically normal. We will use this to establish that the condition lim n!1 Pn i=1 E j0xiij2+ 2 Pn ( i=1 V ar (0xii))2+ = 0 for the Liapunov theorem holds. Since the numerator is dominated by n2B2 and P P lim n1V ar (0xii) = 20 [lim n1xix0i] = 20Q 0 we have n!1 lim n!1 P E j0xiij2+ 2 P ( V ar (0xii))2+ lim n!1 n2B2 P ( V ar (0xii))2+ = lim n!1 nB2 (n1 P V ar (0xii))2+ = limn!1 nB2 (limn!1 n1 P V ar (0xii))2+ = 0 We also have that limn!1pn (n ) = 0 trivially holds because n = E(0xii=n) = 0 holds for all i, and thus limn!1 n = 0 = . The Liapunov CLT now gives that pn P (0xii=n) = pn0 (X0=n) d! N (0; 20Q) : Applying the Cramér-Wold device gives pn (X0=n) d! N (0; 2Q) : Using Cramérs theorem then gives pn (b

34. ) = n1X0X 1 pn (X0=n) d! N 0; 2Q1 since lim n!1 (n1X0X)1 = Q1: 4 Maximum likelihood De…nition 12 (Likelihood) The likelihood is the data density viewed as function of the parameters, L (; x) = f (x; ) : The likelihood is a random variable since it depends on the data. De…nition 13 (MLE) We de…ne the maximum likelihood estimator (MLE) as b = arg max 2 L (; x) where x =(x1; : : : ; xn) denotes the data, and xi and may be vectors. Remark 6 Alternatively, the MLE can be de…ned as the solution to the FOC @L (; x) @ = 0: This de…nition has two problems. The likelihood may have local maxima, i.e. there are multiple solutions to the FOC and the derivative may not be well de…ned. De- spite these shortcomings we will, for simplicity, rely on this de…ning the MLE for much of what follows. 6

35. Example 6 Suppose that Xi; i = 1; : : : n; are iid U (0; ). We have f (x) = ( 1 0 x 0 otherwise and the likelihood is given by L (; x) = 1 n I X(n) where X(n) is the nth order statistic, i.e. X(n) = max (X1; : : : ;Xn) : It is clear that the FOC n n+1 = 0 will not provide a sensible answer. On the other hand it is easily seen, since L (; x) is decreasing in ; that the likelihood is maximized by b = X(n): Remark 7 For independent data we can write the likelihood as L (; x) = f (x1; : : : ; xn; ) = Yn i=1 fi (xi; ) and, conveniently, the log-likelihood as ln L (; x) = l (; x) = Xn i=1 ln fi (xi; ) : This decomposition turns out to be crucial in the derivation of many of the properties of MLEs. For dependent data we can, somewhat less conveniently, write L (; x) = f (x1; ) f (x2jx1; ) : : : f (xnjx1; : : : ; xn1; ) = Yn i=1 f (xijxj ; j i; ) l (x; ) = Xn i=1 ln f (xijxj ; j i; ) : and the derivations below goes through with relatively small changes. De…nition 14 (Score) The derivative of the log-likelihood s (; x) = @l (; x) @ is referred to as the score vector. Lemma 1 The score vector evaluated at the true parameter values, 0; has expectation zero. Proof. Since L (; x) is the density of the data we have 1 = Z L (0; x) dx: 7

36. Di¤erentiate both sides w.r.t. 0 = @ @ Z L (0; x) dx = Z @L (0; x) @ dx = Z 1 L (0; x) @L (0; x) @ L (0; x) dx = Z @l (0; x) @ L (0; x) dx = E [s (0; x)] De…nition 15 (Fisher Information) The information matrix is the variance-covariance matrix of the score vector evaluated at the true parameter values 0; I () = E s (0; x) s (0; x)0 = E @l (0; x) @ @l (0; x) @0 : Remark 8 Note the use of the convention that the derivative w.r.t. the column vector is a column vector and the derivative w.r.t. the row vector 0 is a row vector. Remark 9 The Fisher information is a measure of the information about we, on average, can expect to …nd in a sample of given size. Theorem 3 (Information matrix equality) I () = E @2l (0; x) @@0 = E @l (0; x) @ @l (0; x) @0 = V ar (s (0; x)) Proof. Write 0 = Z @l (0; x) @ L (0; x) dx and di¤erentiate both sides 0 = Z @l (0; x) @ @L (0; x) @0 dx+ Z @2l (0; x) @@0 L (0; x) dx = Z @l (0; x) @ @l (0; x) @0 L (0; x) dx+ Z @2l (0; x) @@0 L (0; x) dx: That is E @l (0; x) @ @l (0; x) @0 = E @2l (0; x) @@0 Remark 10 For iid data we can write the information as I () = nE @2 ln f (xi; 0) @@0 = nE @ ln f (xi; 0) @ @ ln f (xi; 0) @0 : 8

37. Condition 1 We have assumed that @ @ R L (0; x) dx = R @L(0;x) @ dx holds. This is not necessarily the case. Roughly speaking, the requirement for this to hold is that the distribution isn’t too fat tailed and that the domain of x does not depend on : Su¢ cient conditions for this and the Cramér-Rao theorem below (theorem 5) is that 1. The parameter space ; 2 is an open rectangle or we can restrict the parameter space to an open rectangle. 2. The domain of x does not depend on : 3. The score vector s has …nite expectation and variance 82 : Example 7 Example 6 continued. With the Uniform likelihood we have l (x; ) = n ln () and @l (x; ) @ = n @2l (x; ) @2 = n 2 and it is clear that both the information matrix equality and the lemma fails to hold. This should not be surprising since the domain of Xi depends on : Example 8 Suppose that Xi NID(; 2) ; f (x) = 1 p22 e(x)2=22 with likelihood L ; 2; x = 22 n=2 exp Xn i=1 (xi )2 =22 ! l ; 2; x = n 2 ln 2 n 2 ln 2 1 22 Xn i=1 (xi )2 with @l @ = Pn i=1 (xi ) 2 @l @2 = n 22 + 1 24 Xn i=1 (xi )2 yielding the familiar estimates b = x; b2 = 1 n Pn i=1 (xi x)2 : It is easily veri…ed that E @l @ = E @l @2 = 0: Furthermore E @l @ 2 = E Pn i=1 (xi ) 2 2 = 1 4E Xn i=1 Xn j=1 # = (xi ) (xj ) n2 4 = n 2 9

38. E @l @2 2 = E n 22 + 1 24 Xn i=1 (xi )2 #2 = E n2 44 n 26 Xn i=1 (xi )2 + 1 48 Xn i=1 Xn j=1 (xi )2 (xj )2 # , E (xi )2 (xj )2 = ( 34 i = j 4 i6= j , by independence = n2 44 n2 24 + [3n + n (n 1)] 4 48 = n 24 E @l @ @l @2 = E 1 2 Xn i=1 ! (xi ) n 22 + 1 24 Xn i=1 (xi )2 ! = E n 24 Xn i=1 (xi ) + 1 26 Xn i=1 Xn j=1 (xi ) (xj )2 ! = 0 and the information matrix is given by I ; 2 = n 2 0 0 n 24 ! : To verify that the information matrix equality holds we evaluate E @2l @2 = E Pn i=1 1 2 = n 2 E @2l @ (2)2 = E n 24 1 6 Xn i=1 (xi )2 ! = n 24 n2 6 = n 24 E @2l @@2 = E 1 4 Xn i=1 ! (xi ) = 0 and it is clear that E @l @ @l @0 = E @2l @@0 holds. 5 Small sample optimality results Remark 11 Maximum likelihood estimators are functions of su¢ cient statistics rather than the full sample. To see this, note that if T is a su¢ cient statistic we can write the likelihood as (recall the Factorization theorem) L (x; ) = g (x) f (T; ) =) l(x; ) =ln g (x) + ln f (T; ) where g (x) is a function of the data only and f (T; ) is the marginal density of T: Maximizing lnf (T; ) w.r.t. will obviously give the same result as maximizing l (x; ) : 10

39. Theorem 4 (Rao-Blackwell) Let the density of the data be indexed by the parameter , T be a su¢ cient statistic for and t (x) be an unbiased estimator of u () : De…ne the new estimator b = E (t (x) jT) : then 1. b is unbiased estimator of u () 2. V ar b V ar (t) : Proof. We must …rst establish that b can be used as an estimator, i.e. that it does not depend on and can be computed from the sample. To see this note that t (x) is a function of the sample and since T is R a su¢ cient statistic g (xjT) does not depend on : Consequently b = E (t (x) jT) = t (x) g (xjT) dx is independent of : To show part 1 note that E b = E [E (t (x) jT)] = E (t (x)) = u () by the law of iterated expectations. For part 2 we have from theorem 5.6 in Ramanathan that V ar (X) = E [V ar (XjY )] + V ar [E (XjY )] ; setting t = X and b = E (XjY ) it is clear that part 2 must hold. Remark 12 Rao-Blackwellization provides a general way of obtaining a reasonable estimator. Find an unbiased estimator (which by no means has to bee a good estimator) and a su¢ cient statistic and construct the new estimator using the Rao- Blackwell theorem. In some cases this will even be an optimal estimator in the sense that it is an UMVUE. Example 9 Consider again the case with iid Bernoulli data with parameter p: Sup- pose we take t (X) = X1: Clearly this is an unbiased estimator of P p; E (X1) = p and V ar (X1) = p (1 p) : The su¢ cient statistic is T = Xi: Calculating bp = E (X1jT) is a combinatorial problem, there are in total n!= [T! (n T)!] equally likely permutations of the T ones and n T zeros given T: Of these there are (n 1)!= [(T 1)! (n T)!] permutations where X1 = 1: This gives P (X1 = 1jT) = (n 1)!T! n! (T 1)! = T n and bp = T=n with E (bp) = E (T) =n = p and V ar (bp) = V ar (T) =n2 = p (1 p) =n: De…nition 16 (Exponential family) A distribution characterized by a kdimensional parameter vector is said to belong to the exponential family if its density or prob- ability function can be written on the form f (x) = C () exp Xk i=1 # h (x) : qi () Ti (x) Remark 13 It follows from the factorization theorem that (T1; : : : Tk) are su¢ cient statistics for : 11

40. Remark 14 The exponential family is a large class of distributions, containing among others the binomial, normal, geometric, exponential and Poisson distributions. Example 10 Consider the randomvariable X with the normal pdf f(x) = (22)0:5 e0:5(x)2=2 . To deduce that his pdf belongs to the exponential family …rst note that =(; 2)0 and write (22)0:5e0:5(x)2=2 = e0:52=2 p22 22 +x 22 ex2 1 = C () eq1()T1(x)+q2()T2(x)h (x) where C () = e0:52=2 22 , T1(x) = x2, q2 () = 22 , T2(x) = x, and p22 , q1 () = 1 h (x) = 1. In many cases it is not possible to establish the existence of an UMVUE. In those cases it is of interest to know how good the estimator at hand is. Is it worth the e¤ort to try to …nd a better estimator? To answer this question we need to know how far o¤ we are from the best possible case. Theorem 5 (Cramér-Rao) Let b be an unbiased estimator of the k-dimensional parameter vector and suppose that the regularity conditions 1 hold. Then V ar b I1 () is a positive semi de…nite matrix and we write V ar b I1 () : Proof. We have = E b = R bL (; x) dx and di¤erentiate both sides w.r.t. : @ @0 = I = Z b @L (; x) @0 dx = Z b 1 L (; x) @L (; x) @0 L (; x) dx = Z b @l (; x) @0 L (; x) dx = Z bs (; x)0 L (; x) dx = Cov b; s since E (s) = 0 where s is the score vector. The variance of b0; s0 0 is then V ar b s ! = V ar b I I I () ! : Note that any variance matrix is positive semi-de…nite and hence the variance of the linear combination [I; I1 ()] b0; s0 0 is positive semi-de…nite. This variance 12

41. is given by I I1 () V ar b I I I () ! I I1 () ! = V ar b I1 () 0 which establishes the result. Remark 15 The inverse information matrix I1 () provides a lower bound for the variance of an unbiased estimator and is referred to as the Cramér-Rao lower bound. Remark 16 In the scalar parameter case the Cramér-Rao lower bound reduces to V ar b I1 () 1: Remark 17 The notation V ar b I1 () is justi…ed in the vector valued parameter case by noting that a0 h V ar b I1 () i a 0 or a0V ar b aa0I1 () a 0 for an arbitrary vector a when V ar b I1 () is positive semi-de…nite. That is, there is no linear combination a0b of any unbiased estimator b with smaller variance than a0I1 () a. Remark 18 There is no guarantee that there is an unbiased estimator that attains the Cramér-Rao lower bound. Example 11 The information for the parameters (; 2) with iid normal data was obtained in example 8 as I ; 2 = n 2 0 0 n 24 ! and the Cramér-Rao lower bound is given by I1 ; 2 = 2 n 0 0 24 n ! : It is clear that x attains the lower bound but s2 = 1 n1 Pn i=1 (xi x)2 does not because V (s2) = 24 n1 which follows from noting that Pn i=1 Xi X 2 2 2 (n 1) : Clearly V ar (s2) is greater than the Cramér-Rao lower bound for any …nite n: Theorem 6 Suppose that t is an unbiased estimator of that attains the Cramér- Rao lower bound. Then t is the MLE of : 13

42. Proof. From the proof of the Cramér-Rao theorem we have that V ar (t) I1 () = V ar (I; I1 ()) (t0; s0)0 if t is an unbiased estimator. By assumption V ar (t) I1 () = 0 and (I; I1 ()) (t0; s0)0 must be constant and there is an exact linear relation between t and s. Since t is unbiased the linear relation has the form t = A() s (; x)+ or s (; x) = A1 () (t ) : Setting the score to zero we obtain the MLE as b = t: Remark 19 This is a rather strong optimality result for MLEs but it should not be taken to imply that the MLE always is unbiased or that it always attains the Cramér-Rao lower bound. In particular it does not imply that a MLE is UMVUE. Example 12 Consider again the case of iid normal data. The MLE of 2 is b2 P= 1 n 2 (biased) and V ar (b2) = 24(n1) : n n n2 i=1 (xi x)2 with E (b2) = n1 6 Large sample optimality results Theorem 7 (Consistency of MLE) Subject to the regularity conditions 1 the MLE bn is consistent, bn p! 0; the true parameter value. Theorem 8 (Asymptotic normality of MLE) Let 1 = lim 1 nI () : If the regularity conditions 1 hold and if in addition the statistical model is identi…ed and l (; x) is twice continuously di¤erentiable then the asymptotic distribution of the MLE, b; is normal, pn bn 0 d! N (0; ) : Proof. We will again, for simplicity, assume that the data is iid. Note that this implies that I () = nE @ ln f (Xi; 0) @ @ ln f (Xi; 0) @0 = nE @2 ln f (Xi; 0) @@0 : That is 1=E @ ln f (Xi; 0) @ @ ln f (Xi; 0) @0 = V ar @ ln f (Xi; 0) @ in this case. By the mean value theorem we can write, for some value between 0 and bn; sn (0; x) = sn bn; x + @sn (; x) @0 0 bn = @sn (; x) @0 0 bn 14

43. since the MLE bn sets the score to zero. Alternatively we can write this as 0 bn = @sn (; x) @0 1 sn (0; x) provided that @sn(;x) @0 has full rank. Since sn (0; x) = Xn i=1 @ ln f (Xi; 0) @ where f (Xi; 0) and @ ln f(Xi;0) @ are iid random variables, we have by the (multivari- ate) Lindeberg-Lévy CLT that 1 pn sn (0; x) d! N 0; 1 Secondly @sn (0; x) @0 = Xn i=1 @2 ln f (Xi; 0) @@0 ; is a sum of iid random matrices and 1 n @sn (0; x) @0 p! 1 by the Kinchine WLLN. In addition, bn p! 0 implies p! 0 and 1 n @sn (; x) @0 p! 1 by the Slutsky theorem. Note that this implies 1 n @sn(;x) @0 p! I: Next, write 1 n @sn (; x) @0 pn 0 bn = 1 pn sn (0; x) : Since 1 n @sn(;x) @0 p! I we have that pn 0 bn d! 1 pn sn (0; x) d! N (0; ) which establishes the result. Remark 20 The variance of the limiting distribution for the MLE is the inverse of the limit of the average information. That is, asymptotically MLE attains the Cramér-Rao lower bound. This implies that MLE is Best Asymptotic Normal, i.e. there is no other asymptotically normal estimator whose limiting distribution has a smaller variance. This provides a strong rationale for the use of maximum likelihood. Remark 21 Note the crucial role that the Information matrix equality plays in giving us a simple form for the variance of the limiting distribution. 15

44. Example 13 For normal data, Xi iid N (; 2) ; the information matrix is given by I ; 2 = n 2 0 0 n 24 ! : It follows that pn b b2 ! 2 !# d! N (0; ) for = lim nI1 ; 2 = 2 0 0 24 ! : From exercise 5 in the asymptotics lecture notes we deduce that pn (b2n 2) d! N (0; ( 1) 4) where = E (Xi )4 =4 = 3 for normal data. Example 14 Suppose that Xi; i = 1; : : : ; n; is iid Bernoulli with parameter p: The loglikelihood is l (p; x) = T ln p + (n T) ln (1 p) for T = Pn i=1 xi: The score is @l (p; x) @p = T p n T 1 p : Setting the score to zero and solving for p gives the MLE as bp = T n : We obtain the Fisher information as I (p) = E @2l (p; x) @p2 = E T p2 + n T (1 p)2 = np p2 + n (1 p) (1 p)2 = n p + n 1 p = n p (1 p) : Since the regularity conditions hold it follows that bp is consistent and that pn (bp p) d!N (0; p (1 p)) : The results are easily veri…ed by applying a suitable LLN and CLT to bp = Pn i=1 xi=n: A common rule of thumb for when the asymptotic distribution provides a good approximation to the exact …nite sample distribution is that np (1 p) 9: Noting that T Bin (n; p) 7 When the form of the likelihood is unknown (optional) 1. It generally is unknown. 2. We can’t expect to get exact small sample results. 16

45. (a) Must rely on asymptotic results. (b) In special cases we may be able to obtain the small sample bias and variance of the estimator. 3. Maximum likelihood is out of the question. 4. Maximize the wrong likelihood, on purpose or out of ignorance. Quasi Max- imum Likelihood (QML).The QMLE can, under more restrictive conditions than above, be shown to be consistent and asymptotically normal. The major di¤erence is that the Information matrix equality doesn’t hold for QMLE and we get pn bQML 0 d! N 0;A1BA1 for A = plim 1 n @sn (0; x) @0 B = plim 1 n sn (0; x) sn (0; x)0 : 5. Estimators that doesn’t rely on the likelihood. (a) Least squares. (b) Generalized Method of Moments (GMM). GMM speci…es a set of k moment conditions E [gn (0; x)] = 0; where is a k-dimensional parameter vector and minimizes gn (; x)0 gn (; x) :It is possible to show, under more restrictive conditions than above, that the GMME is consistent and asymptotically normal, pn bGMM 0 d! N (0;V) where V1 = lim 1 nV ar (gn (0; x)) : Remark 22 We know that the MLE attains the Cramér-Rao lower bound asymptotically and it should be clear that we in general su¤er from a loss in e¢ ciency by using other estimators than MLE. Remark 23 Note that Least Squares and ML are special cases of GMM. This is seen by setting the FOCs of LS or ML as the GMM moment conditions, e.g. E [sn (0; x)] = 0 for ML. 17

46. 8 Worked exercises 8.1 Exercises 1. Exercise 8.1 (b)-(e) in Ramanathan. 2. Exercise 8.2 in Ramanathan. 3. Exercise 8.9 (a)-(c) in Ramanathan. In addition, obtain E (b) and V ar (b) where b is the MLE of : 4. Consider the regression model y = x+

47. z+, E () = 0, V ar () = 2I, where 2i and

48. are scalars. In addition we are told that the iare iid, the xi are iid with E (x) = c, xi and i are independent of each other, x0x n p! c = E (x2i )6= 0 and x0z n p! d6= 0. (a) Suppose

49. is known and de…ne the estimator b = x0(y

50. z) x0x : Obtain the limiting distribution of b. (b) Suppose

51. is unknown but that we are given the estimator e

52. of

53. and told that e

54. is independent of and x and that pn(e

56. ) d! N(0; 1_) . De…ne the estimator e = x0(y e

57. z) x0x and obtain the limiting distribution of e. (c) Are b and e consistent estimators of ? 8.2 Solutions 1. f (x; ) = kx a discrete geometric distribution, i.e. k = 1 : b) We have L (x;) = Yn i=1 (1 ) xi = (1 )n Pn i=1 xi and it is clear from the factorization theorem that Pn i=1 xi and x = 1 n Pn i=1 xi are su¢ cient statistics. (a) We have @l (x; ) @ = n 1 + Pn i=1 xi @2l (x; ) @2 = n (1 )2 Pn i=1 xi 2 : 18

58. Since E (xi) = 1 we have I () = E @2l (x; ) @2 = n (1 )2 + n (1 ) 2 = n (1 )2 : It is easy to verify that the outer product of the score form of the information matrix gives the same result, E @l (x; ) @ 2 = E n 1 + Pn i=1 xi 2 = n2 (1 )2 2nE Pn i=1 xi (1 ) + Pn E ( i=1 xi)2 2 /independence/ = n2 (1 )2 2n2 (1 )2 + E Pn i=1 x2i 2 + E Pn i=1 P j6=i xixj 2 = n2 (1 )2 + n (1)2 + 2 (1)2 2 + n (n 1) 2 (1)2 2 = n2 (1 )2 + n (1 )2 + n2 (1 2) = n (1 )2 (b) Setting the score to zero we have @l (x; ) @ = n 1 + Pn i=1 xi = 0 Pn i=1 xi n = 1 with the solution b = x 1 + x : (c) Since xi are iid we have x p! 1 by the Kinchine WLLN. It E (xi) = follows from the Slutsky theorem that b = g (x) = x 1+x p! g 1 = : 2. f(x; ) = x1 for 0 x 1 and 0. (a) R 1 0 x1dx = 1 x

60. 1 : It follows that 0 = 1 R 1 0 xdx = 1 +1 and hence that E(x) = R 1 0 xx1dx = +1: R 1 0 lnxe(1)lnxdx = R 1 0 lnxx1dx: It follows that E(lnx) = (b) 1 2 = @ @ 1 = @ @ R 1 0 x1dx = R 1 0 @ @ e(1)lnxdx = R 1 0 lnxx1dx = 1 : (c) 2 3 = @ @2 1 = @ @2 Z 1 0 x1dx = Z 1 0 @ @ lnxe(1)lnxdx = Z 1 0 (lnx)2e(1)lnxdx = Z 1 0 (lnx)2 x1dx: 19

61. Which gives E(lnx)2 = Z 1 0 (lnx)2 x1dx = 2 2 :V ar(lnx) = E (lnx E (lnx))2 = E (lnx)2 2lnxE(lnx) + [E(lnx)]2 = E(lnx)2 [E(lnx)]2 = 2 2 1 2 = 1 2 : (d) We have the random sample x1; :::; xn: Independence gives the joint density as f(x1; :::; xn; ) = Yn i=1 f(xi; ) = n Yn i=1 xi !1 : Qn The likelihood is thus L(; x1; :::; xn) = n ( i=1 xi)1 : It follows from the factorization theorem (8.1) that T1 = Qn i=1 xi is a su¢ cient statistic since we can factorize the likelihood into the function h(T1; ) = nT1 1 , depending only on and T1, and the function g() = 1, which does not depend on P and T1: The factorization theorem is if and only if, that is T2 = n i=1 xi is a su¢ cient statistic only if we can factorize the likelihood correspondingly for T2: Inspection of the likelihood function shows that this is impossible and consequently that T2 is not a su¢ cient statistic for P. T3 = n i=1 lnxi, on the other hand, is a su¢ cient statistic. (e) lnL(; x) = nln() + ( 1) Pn i=1 lnxi and @lnL @ = n + Xn i=1 lnxi: Setting the derivative to zero yields b = P n n i=1 lnxi : To verify that this is a maximum we need to show that the second derivative is negative at b: We have @2lnL 2 and @2lnL @2 @2 = n

64. =b = Pn ( i=1 lnxi)2 n 0: (f) Let

65. = 1=; a reasonable guess for the MLE of

66. is then b

67. = 1=b = Pn i=1 lnxi . It is easy to establish that this guess is correct for

68. = g() n when g is a monotone function (it holds for non-monotone functions as well, but is trickier to show). By monotonicity the inverse function = g1(

69. ) exists and we can write the loglikelihood for

70. as lnL(g1(

71. ); x). The …rst order condition is thus given by @lnL @

72. = @lnL @ @ @

73. = 0: By monotonicity @ @

74. 6= 0 and the FOC simpli…es to @lnL @ = 0: Since there is only one value of

75. for which @lnL @ = 0 it follows that the MLE of

76. is b

77. = 1=b. E(b

78. ) = Pn i=1 E(lnxi) n = Pn i=1 1 n = Pn i=1

79. n =

80. ; that is b

81. is an unbiased estimator of

82. : 20

83. (g) V ar b

84. = 1 n2V ar Xn i=1 ! E(lnxi) = , the xiare independent, hence ln xiare independent and speci…cally uncorrelated , = 1 n2 Xn i=1 V ar(lnxi) = 1 n2 n 2 = 1 n2 =

85. 2 n : The Fisher information is given by I(

86. ) = E @2lnL @

87. 2 ; @lnL @

88. = @lnL @ @ @

89. = n + Xn i=1 lnxi ! 1

90. 2 = = 1

91. = n

92. Pn i=1 lnxi

93. 2 and @2lnL @

94. 2 = n

95. 2 + 2 Pn i=1 lnxi

96. 3 : We have I(

97. ) = E @2lnL @

98. 2 = E n

99. 2 2 Pn i=1 lnxi

100. 3 = n

101. 2 2 Pn i=1 E (lnxi)

102. 3 = n

103. 2 2 Pn i=1 E (lnxi)

104. 3 = n

105. 2 + 2n

107. 3 = n

108. 2 : The Cramér-Rao lower bound is given by 1=I(

109. ) =

110. 2 n and it is clear that the CRLB is attained in this case. (h) Let Zn = b

111. E(b

112. ) r V ar b

113. = lnx E lnx q V ar lnx : We then have Zn d! N(0; 1) since the lnxi are independent and V ar(lnxi) =

114. 2 1 and thus ful…lls the conditions of the Lindeberg-Lévy CLT. Comment: we have (for this estimator) veri…ed the claim that ML estimators are asymptotically normally distributed. To see that the result is in accordance with theorem 8.12 we write Zn = q

115. b

116. V ar(b

117. ) = pn(b

119. ) pn=I(

120. ) and let (

121. ) = limI(

122. ) n = 1

123. 2 : Using Slutsky (theorem 7.1) we thus have that q n I(

124. )Zn = pn b

126. d! Z p(

127. ) where z s N(0; 1) or pn b

129. d! N 0;(

130. )1 = N(0;

131. 2): Finally we may wonder about the MLE of : To obtain the asymptotic distribution of we use the Delta Rule. = g (

132. ) = 1=

133. is a function with continuous derivative at

134. and it follows that pn b = pn g b

135. g (

136. ) d! N(0; 2): 21

137. Note that 2 = h limI() n i 1 : 3. We have the density f(x; ;

138. ) = 1

139. e(x)=

140. ; x ;

141. 0: (a) Z 1 e(x)=

142. dx =

143. e(x)=

146. 1 = 0 (

147. ) =

148. : Di¤erentiating both sides with respect to

149. we obtain Z @

150. @ 1 = 1 = @

151. @

152. e(x)=

153. dx = Z 1 @ @

154. e(x)=

155. dx = Z 1 x

156. 2 e(x)=

157. dx: It follows that E(x ) = R 1 x

158. e(x)=

159. dx =

160. and E(x) = +

161. . Di¤erentiating once more we have 0 = @ @

162. Z 1 x

163. 2 e(x)=

164. dx = Z 1 @ @

165. x

166. 2 e(x)=

167. dx = Z 1 2 x

168. 3 e(x)=

169. + (x )2

170. 4 e(x)=

171. ! dx = 2

172. 2E (x ) + 1

173. 3E(x )2 and E(x)2 = 2

174. E(x) = 2

175. 2: It follows that V ar(x) = E (x E (x))2 = E (x

176. )2 = E (x )2 2

177. E (x ) +

178. 2 =

179. 2: Comment: The mean and variance we obtained shouldn’t be too surprising. The distribution of x is an exponential distribution with a shift in the location. That is if y is exponentially distributed with parameter

180. ; then x is obtained as x = y + : (b) The likelihood is given by L = Qn i=1 1

181. e(xi)=

182. = 1

183. n e Pn i=1(xi)=

184. and the loglikelihood as lnL = nln

185. 1

186. Pn i=1 (xi ) : This gives the ele- ments of the score vector as S1 = @lnL @ = n

187. S2 = @lnL @

188. = n

189. + 1

190. 2 Xn i=1 (xi ) Using the score form of information matrix we have I(;

191. ) = E S2 1 S1S2 S2S1 S2 2 # = E n2

192. 2 n2

193. 2 + n

194. 3 Pn i=1 (xi ) n2

195. 2 + n

196. 3 Pn i=1 (xi ) n2

197. 2 2 n

198. 3 Pn i=1 (xi ) + 1 Pn

199. 4 ( i=1 (xi ))2 # = n2

200. 2 n2

201. 2 + n2

202. 2 n2

203. 2 + n2

204. 2

205. 2 + n(n1)

206. 2+2n

207. 2

208. 4 n2

209. 2 2n2 # = n2

210. 2 0 0 n

211. 2 # 22

212. Pn The expectation E ( i=1 (xi ))2 = Pn i=1 Pn j=1 E [(xi ) (xj )] is a little bit tricky. For i6= j we have independence and E [(xi ) (xj )] = E (xi )E (xj ) =

213. 2 and there are n (n 1) terms with i6= j: This leaves n terms with i = j where we have E (xi )2 = 2

214. 2: Comment: The reason for using the score form of the information hmatrix is that the information matrix equality E (SS0) = E @lnL @ @lnL @ 0 i = E @2lnL @@0 for = (;

215. )0 doesn’t hold for this likelihood. When estab- lishing that E h@lnL @ @lnL @ 0 i = E @2lnL @@0 we needed to interchange the order of integration and di¤erentiation. That is we needed, for example, that @2 @2 R 1 L(;

216. ; x)dx = R 1 @2 @2L(;

217. ; x)dx; which doesn’t hold since is a limit of integration. (c) Setting S2 = 0, we have n

218. = Pn i=1 (xi ) and the MLE of

219. as 1 n Pn i=1 (xi ), provided is known. S1 is obviously of little use for obtaining the MLE of . Instead we need to look at the likelihood function itself, writing this as L = 1

220. n en=

221. e P xi=

222. it is clear that the likelihood is an increasing function of : On the other hand we have the condition xi ; that is, the likelihood of observing a value of x smaller than is zero. The value of maximizing the likelihood is thus the smallest value of xi in the sample or the …rst order statistic, denote this by x(1): We have T1 = b = x(1) and T2 = b

223. = 1 n Pn i=1 xi x(1) : extra. From p. 137 in Ramanathan we get the density of the …rst order statistic as fx(1)(x) = n [1 Fx(x)]n1 fx(x): We obtain the distribution function of x as Fx(x) = R x 1

224. e(y)=

225. dy = 1 e(x)=

226. and we have fx(1)(x) = n e(x)=

227. n1 1

228. e(x)=

229. = n

230. en(x)=

231. ; a shifted exponential distribution with parameters and

232. =n: It follows that E(T1) = E x(1) = +

233. =n6= and V ar(T1) =

234. 2=n2: 4. The regression model is y = x +

235. z + : (a) We have b = x0 (y

236. z) x0x = x0 (x +

237. z +

238. z) x0x = x0 (x + ) x0x = x0x + x0 x0x = + x0 x0x where x0x n p! c: In addition, 1 nx0 =1 n Pn i=1 xii; a sample average which a CLT might apply to. By assumption we have E (xii) = E (xi)E (i) = 0 23

239. and V ar (xii) = E (x2i 2i ) = E (x2i ) 2 = 2c 1: Since xi and i are iid, xii is iid as well and the conditions for the Lindeberg-Lévy CLT holds. That is, 1 pn x0 d! N 0; 2c : Write pn (b ) = n1=2x0 x0x n and it follows that pn (b ) d! N 0; 2=c : (b) We have e = x0 ye

240. z x0x = x0 x +

241. z + e

242. z x0x = x0 x +

243. e

244. z x0x = x0x + x0

245. e

246. x0z x0x = + x0 x0x

247. e

248. x0z x0x : This gives pn (e ) = n1=2x0 x0x=n pn

249. e

250. x0z=n x0x=n where the …rst term converges in distribution to N (0; 2=c) and the second term to N 0; c d 2 since plim x0z=n x0x=n = c=d: Note that these limit- c and pn ing distributions are the same as for pnx0

251. e

252. c d and hence does not depend on x and z. By independence of and e

253. it follows that pn (e ) converges in distribution to the sum of two independent normal random variables. That is, pn (e ) d! N 0; 2 c + c2 d2 : (c) In both cases we have convergence in distribution when scaling by pn: It follows from corollary 2 in the Asymptotics lecture notes that (b ) p! 0 and (e ) p! 0 and the estimators are consistent. 24

Lecture notes on statistical estimation techniques

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (6)

Similar to Lecture notes on statistical estimation techniques

Similar to Lecture notes on statistical estimation techniques (20)

Recently uploaded

Recently uploaded (20)

Lecture notes on statistical estimation techniques