mathstat.pdf

1. SELECTED TOPICS IN MATHEMATICAL STATISTICS LAJOS HORVÁTH Abstract. This is the outcome of the online Math 6070 class during the COVID-19 epidemic. 1. Some problems in nonparametric statistics First we consider some examples for various statistical problems. In all cases we should be able to get large sample approximations but getting the critical values might not be simple. Also, the large sample approximations might not work in case of our sample sizes. We assume in this section that Assumption 1.1. X1, X2, . . . , XN are independent and identically distributed random variables with distribution function F. First we consider a simple hypothesis question which has been already discussed. 1.1. Kolmogov–Smirnov and related statistics. We wish to test the null hypothesis that F(t) = F0(t) for all −∞ < t < ∞, where F0(t) is a given distribution function. We assume that F0 is continuous. There are several well known tests for this problem. The first two are due to Kolmogorov and Smirnov: TN,1 = N1/2 sup −∞<t<∞ |FN (t) − F0(t)| , TN,2 = N1/2 sup −∞<t<∞ (FN (t) − F0(t)) , where FN (t) = 1 N N X `=1 I{Xi ≤ t}, −∞ < t < ∞ denotes the empirical distribution function of our sample. The distributions of the statistics TN,1 and TN,2 do not depend on F0 for any sample size as long as (1.1) F0 is continuous. By the probability integral transformation U1 = F(X1), U2 = F(X2), . . . , UN = F(XN ) are independent identically distributed random variables, uniform on [0, 1]. Since TN,1 = N1/2 sup −∞<t<∞

6. 1 N N X `=1 I{F(Xi) ≤ F(t)} − F0(t)

11. = N1/2 sup 0<u<1

16. 1 N N X `=1 I{Ui ≤ u} − F0(F−1 (u))

21. , where F−1 is the generalized inverse of F. (If F is not strictly increasing then F−1 is not uniquely defined but it will satisfy F(F−1(u)) = u. ) We already discussed in the class of the weak convergence of the uniform empirical and quan- tile processes. Due to the probability integral transformation the following results are immediate consequences: (1.2) TN,1 D → sup 0≤u≤1 |B(u)| 1

22. 2 LAJOS HORVÁTH and (1.3) TN,2 D → sup 0≤u≤1 B(u), where {B(u), 0 ≤ u ≤ 1} denotes a Brownian bridge. We already discussed the definition of the Brownian bridge when we studied the weak convergence of the process constructed from the uniform order statistics. The Brownian bridge is defined as B(t) = W(t) − tW(1), where W(t) is a Wiener process. Hence B(0) = B(1) = 0 (tied down). It is a Gaussian process, i.e. the finite dimensional distributions are multivariate normal. The parameters of the multivariate normal distribution can be computed from the facts that EB(t) = 0 and EB(t)B(s) = min(t, s) − ts. The Brownian bridge is continous with probability 1. More on the Brownian bridge is on page 153 of DasGupta(2008). Hence (1.2) and (1.3) provide large sample approximations for our test statistics under the null hypothesis. Even we know how good are the approximations in (1.2) and (1.3). There are constants c1 and c2 such that (1.4) sup −∞<x<∞

26. P {TN,1 ≤ x} − P sup 0≤u≤1 |B(u)| ≤ x

30. ≤ c1 log N N1/2 and (1.5) sup −∞x∞

34. P {TN,2 ≤ x} − P sup 0≤u≤1 B(u) ≤ x

38. ≤ c2 log N N1/2 . These results follows immediately from the Komlós, Major and Tusnády approximation (cf. Das- Gupta, 2008, p. 162). There are explicit bounds for c1 and c2 but these are so large that they are useless in practice. It is even more interesting, from a theoretical point of view, that the results in (1.4) and (1.5) are very close to the best possible ones; N−1/2 are lower bounds. Since the limiting distribution functions in (1.4) and (1.5) are known explicitly, we could check how these results work for finite sample sizes. Chapter 9 in Shorack and Wellner (1986) contains formulae and bounds, including exact and asymptotic bounds for TN,1 and TN,2. We need to study the behavior of the test statistics under suitable alternatives. First we look at TN,1. We assume that (1.6) HA : there is t0 such that F(t0) 6= F0(t0). If (1.6) holds, then (1.7) TN,1 P → ∞. Since TN,1 = N1/2 sup −∞t∞ |FN (t) − F(t) + F(t) − F0(t)| , we get the lower bound N1/2 sup −∞t∞ |F(t) − F0(t)| − N1/2 sup −∞t∞ |FN (t) − F(t)| ≤ TN,1. The weak convergence of the empirical process yields N1/2 sup −∞t∞ |FN (t) − F(t)| = OP (1) and (1.6) gives N1/2 sup −∞t∞ |F(t) − F0(t)| → ∞, completing the proof of (1.9). However, we might not be able to reject the null hypothesis under the alternative of (1.6). The statistic TN,2 is consistent under the alternative (1.8) HA : there is t0 such that F(t0) F0(t0).

39. MATHEMATICAL STATISTICS 3 Similarly to the proof of (1.9), one can show that (1.9) TN,2 P → ∞. The other class of statistics are due to Cramér and von Mises. We provide two formulas. If you know how to integrate with respect to a function, please use those. If not, use the formula, where the density f0(t) = F0 0(t) appears. There are two possibilities for us: TN,3 = N Z ∞ ∞ (FN (t) − F0(t))2 dF0(t) = N Z ∞ ∞ (FN (t) − F0(t))2 f0(t)dt and TN,4 = N Z ∞ ∞ (FN (t) − F0(t))2 dt. Similarly to the first two statistics TN,1 and TN,2, the distribution of TN,3 also does not depend on F0, if (1.1) holds. Using the probability integral transformation again, we get that TN,3 also does not depend on F0. However, the distribution of TN,4 does depend on F0. This means that we need to use different Monte Carlo simulations for different F0’s. The weak convergence of the uniform empirical process already used in the justification of (1.2) and (1.3) can be used to show that (1.10) TN,3 D → Z 1 0 B2 (u)du and (1.11) TN,4 D → Z 1 0 B2 (F0(t))dt, where, as before, {B(u), 0 ≤ u ≤ 1} denotes a Brownian bridge. The rate of convergence in (1.10) and (1.11) is much better than in (1.4) and (1.5). Namely, there are c3 and c4 such that (1.12) sup −∞x∞

43. P {TN,3 ≤ x} − P Z 1 0 B2 (u)du ≤ x

47. ≤ c3 N and (1.13) sup −∞x∞

51. P {TN,4 ≤ x} − P Z ∞ −∞ B2 (F0(t))dt) ≤ x

55. ≤ c4 N . The upper bound in (1.12) was obtained by Götze (cf. Shorack and Wellner, 1986, p. 223) and his method can be used to prove (1.13). It is conjectured that these results are optimal, it is impossible to replace 1/N with a sequence which would converge to 0 faster. The theoretical results in (1.12) and (1.13) were observed empirically a long time ago. This is one of the reasons for the popularity of TN,3. There is an interesting connection between U statistics and the Cramér–von Mises statistics. It can be shown that the Cramér–von Mises statistics are essentially U statistics. This claim is supported by a famous expansion of the square integral of the Brownian bridge: (1.14) Z 1 0 B2 (t)dt = ∞ X k=1 1 k2π2 N2 , where Ni, i ≥ 1 are independent standard normal random variable. This is like the limit of the degenerate U statistics. The result in (1.14) is a consequence of the Karhunen–Loéve theorem. They showed that {B(t), 0 ≤ t ≤ 1} D = ( √ 2 ∞ X k=1 Nk 1 kπ sin(kπs) ) .

56. 4 LAJOS HORVÁTH This result looks obvious, in some sense, since B(t) is square integrable, so we should have expansion with respect a basis. Here the interesting part is that the Ni’s are iid standard normal random variables. If a different basis is used, this will not be correct. We use a special basis here, since sin(kπs) are the eigenfunction of the operator K(t, s)f(s)ds. There is am other interesting and useful formula for the Brownian bridge: the integral B(t) = Z t 0 1 − t 1 − s dW(s) 0 ≤ t ≤ 1, also defines a Brownian bridge. However, first we need to define integration with respect to a Wiener process. We have two roads: study Ito integration. The other possibility is much simpler. We just assume that integration by parts defines the formula, so Z t 0 1 − t 1 − s dW(s) = 1 1 − t W(t) − Z t 0 W(s)d 1 1 − s = 1 1 − t W(t) + Z t 0 1 (1 − s)2 W(s)ds. This integral representation of the Brownian bridge is often used in biostatistics. Next we discuss the consistency of the Cramér–von Mises tests. If (1.1) and (1.6), then (1.15) TN,3 P → ∞ and (1.16) TN,4 P → ∞. We write N Z ∞ ∞ (FN (t) − F0(t))2 dF0(t) = N Z ∞ ∞ ([FN (t) − F(t)] + [F(t) − F0(t)])2 dF0(t) = N Z ∞ ∞ (FN (t) − F(t))2 dF0(t) + 2N Z ∞ ∞ [FN (t) − F(t)][F(t) − F0(t)]dF0(t) + N Z ∞ ∞ (F(t) − F0(t))2 dF0(t) and by the weak convergence of the Cramér–von Mises statistic Z ∞ ∞ (N1/2 [FN (t) − F(t)])2 dF0(t) = OP (1). Using the Cauchy–Schwartz inequality we obtain that N

60. Z ∞ ∞ [FN (t) − F(t)][F(t) − F0(t)]dF0(t)

64. ≤ N Z ∞ ∞ [FN (t) − F(t)]2 dF0(t) 1/2 Z ∞ ∞ [F(t) − F0(t)]2 dF0(t) 1/2 = N1/2 Z ∞ ∞ [N1/2 (FN (t) − F(t))]2 dF0(t) 1/2 Z ∞ ∞ [F(t) − F0(t)]2 dF0(t) 1/2 and Z ∞ ∞ [N1/2 (FN (t) − F(t))]2 dF0(t) = OP (1), N1/2 Z ∞ ∞ [F(t) − F0(t)]2 dF0(t) 1/2 = O(N1/2 ).

65. MATHEMATICAL STATISTICS 5 According to our condition Z ∞ ∞ [F(t) − F0(t)]2 dF0(t) 0 and therefore (1.15) is proven. Similar arguments give (1.16). One of the basic advise is that “do not compare apples and oranges”. One of the interpretation is that we should compare variables with the same or essentially the same variances. Since the observations are independent, under H0 we have that var (FN (t) − F0(t)) = 1 N F0(t)(1 − F0(t)), so the variance of the variables used in all statistics so far, depend on t. Darling and Erdős (1956) suggested the following statistic to test the null hypothesis against the alternative in (1.6): (1.17) TN,5 = sup −∞x∞ N1/2|FN (t) − F0(t)| (F0(t)(1 − F0(t)))1/2 . The statistic TN,5 is called self normalized. However, even under the null hypothesis (1.18) lim N→∞ P{TN,5 ≥ C} = 1, for all C , i.e. TN,5 is unbounded in probability. Here is a heuristic argument for (1.17). We observe that TN,5 does not depend on F0. If the weak convergence of the empirical process to a Brownian bridge holds, the distribution of TN,5 should be close to the distribution of sup0t1 |B(t)|/(t(1 − t))1/2. But according to the law of the iterated logarithm, lim sup t→0 |B(t)| t1/2 = ∞ a.s. and therefore P sup 0t1 |B(t)|/(t(1 − t))1/2 = ∞ = 1. We can also use the empirically self normalized (1.19) TN,6 = sup X1,N xXN,N N1/2|FN (t) − F0(t)| (FN (t)(1 − FN (t)))1/2 , where X1,N = min{X1, X2, . . . , XN } and XN,N = max{X1, X2, . . . , XN }. Using the result of Darling and Erdős (1956), it can be shown that under the null hypothesis lim N→∞ P (2 log log N)1/2 TN,5 ≤ x + 2 log log N + 1 2 log log log N − 1 2 log π (1.20) = exp(−2e−x ) for all x. The limit result in (1.20) also holds for TN,6. Here we have the interesting result that even under the null hypothesis TN,5 and TN,6 converge to ∞ in probability, since 1 (2 log log N)1/2 TN,5 P → 1. However, under the alternative (1.6), N−1/2 TN,5 P → c, where c is a positive constant and if t 6= t0, then c ≥ |F(t) − F0(t)| (F0(t)(1 − F0(t))1/2 . This means the under the alternative TN,5 will be much larger than under the null. This observation makes it possible to use bootstrap. The rate of convergence in (1.20) is slow. The limit is

66. 6 LAJOS HORVÁTH an extreme value and, in general, convergence to extreme values can be slow. Also, the norming sequences in (1.20) are chosen for their “simplicity”. They do not have any statistical meaning like the norming with the mean and the variance in the central limit theorem. There is an important observation in this discussion: the test statistic has a limit distribution under the null and converges in probability to ∞ under the alternative. In case of TN,5 and TN,6, we should say that they converge to ∞ much faster. Next we consider testing if our sample belongs to a specific family of distributions. 1.2. Parameter estimated processes. We assume that Assumption 1.1 holds. Now we wish to the null hypothesis F0(t, λ) = ( 0, if t 0 1 − e−t/λ, if t ≥ 0, (1.21) where λ is an unknown parameter. The true value of the parameter is λ0. It is natural to estimate λ from the sample by the maximum likelihood estimator λ̂N = X̄N = 1 N N X i=1 Xi. If the null hypothesis of (1.21) is true, then FN (t) should be close to F0(t, λ̂N ) for all −∞ t ∞, because FN (t) always estimates the true distribution function. Hence we study the difference between FN (t) and F0(t, λ̂N ). We start withe a Taylor expansion with respect the parameter F0(t, λ̂N ) − F0(t, λ0) = g1(t, λ0)(λ̂N − λ0) + 1 2 g2(t, λ∗ )(λ̂N − λ0)2 , where λ∗ is between λ̂N and λ0, We can assume that t 0 since both FN (t) and F(t, λ) are 0 for t ≤ 0. Let g1(t, λ) = ∂F0(t, λ) ∂λ and g2(t, λ) = ∂2F0(t, λ) ∂λ2 . We know from the law of large numbers that (1.22) λ̂N P → λ0, and from the central limit theorem that N1/2(λ̂N − λ0) is asymptotically normal and therefore (1.23) N1/2 |λ̂N − λ0| = OP (1). It is elementary that sup−∞t∞ |g2(t, λ)| is bounded, as a function of λ, in a neighbourhood of λ0. Hence (1.22) yields (1.24) sup 0t∞ |g2(t, λ∗ )| = OP (1). Putting together (1.23) and (1.24) we conclude (1.25) sup 0t∞ |g2(t, λ∗ )|(λ̂N − λ0)2 = OP 1 N . These arguments give the important observation that sup 0t∞

69. N1/2 [FN (t) − F0(t, λ̂N )] − N1/2 [FN (t) − F0(t, λ0) − g1(t, λ0)(λ̂N − λ0)]

72. (1.26) = OP (N−1/2 ).

73. MATHEMATICAL STATISTICS 7 The result is in (1.26) is very important and it can be proven in more generality. The process N1/2(FN (t) − F0(t, λ̂N )) is called parameter estimated empirical process and it is often used to check a null hypothesis when an unknown parameter appears under the null hypothesis. We know some facts already: (1.27) N1/2 (FN (t) − F0(t, λ0) D[0,∞] −→ B(F0(t, λ0)) and the asymptotic normality of N1/2(λ̂N −λ0). However, this is not enough! We need them jointly since both terms appear in (1.26). The key is a formula which you might have learnt in probability. We know that λ0 = EX1 = Z ∞ 0 tf0(t, λ0)dt = Z ∞ 0 (1 − F0(t, λ0))dt. Using integration by parts, Z ∞ 0 tf0(t, λ0)dt = − Z ∞ 0 t(1 − F0(t, λ0))0 dt = −t(1 − F0(t, λ0))

77. ∞ 0 + Z ∞ 0 (1 − F0(t, λ0))dt. Clearly, lim t→0 t(1 − F0(t, λ0)) = 0 and by the existence of the expected value lim t→∞ t(1 − F0(t, λ0)) = 0. Thus we get N1/2 (λ̂N − λ0) = Z ∞ 0 N1/2 (F0(t, λ0) − F̂N (t))dt = − Z ∞ 0 N1/2 (F̂N (t) − F0(t, λ0))dt. Now everything is expressed in terms of the empirical process. The weak convergence of the empirical process in (3.21) yields (1.28) N1/2 (FN (t)−F0(t, λ0)−g1(t, λ0)(λ̂N −λ0)) D[0,∞] −→ B(F(t, λ0))+g1(t, λ0) Z ∞ 0 B(F0(u, λ0))du. It looks obvious that (3.21) implies (1.29) Z ∞ 0 N1/2 (F̂N (t) − F0(t, λ0))dt D → Z ∞ 0 B(F0(u, λ0))du, but it requires a little work. For any C 0, the weak convergence of the empirical process to B(F(·)) implies that N1/2 FN (t) − F0(t, λ0) + g1(t, λ0) Z C 0 N1/2 (F̂N (u) − F0(u, λ0))du (1.30) D[0,∞] −→ B(F(t, λ0)) + g1(t, λ0) Z C 0 B(F0(u, λ0))du. Also, by the Cauchy–Schwartz inequality we have var Z ∞ C B(F0(u, λ0))du = E Z ∞ C B(F0(u, λ0))du 2 (1.31) = E Z ∞ C Z ∞ C B(F0(u, λ0))B(F0(v, λ0))dudv = Z ∞ C Z ∞ C E[B(F0(u, λ0))B(F0(v, λ0))]dudv ≤ Z ∞ C Z ∞ C (E[B(F0(u, λ0))]2 )1/2 (E[B(F0(v, λ0))]2 )1/2 dudv

78. 8 LAJOS HORVÁTH = Z ∞ C Z ∞ C [F0(u, λ0)(1 − F0(u, λ0))]1/2 [F0(v, λ0)(1 − F0(v, λ0))]1/2 dudv = Z ∞ C [F0(u, λ0)(1 − F0(u, λ0))]1/2 du 2 → 0, as C → ∞. The same arguments yield for any N that var Z ∞ C N1/2 (F̂N (u) − F0(u, λ0)) → 0, as C → ∞. (1.32) Now Chebishev’s inequality implies on account of (1.31) and (1.32) that

82. Z ∞ C B(F0(u, λ0))du

86. P → 0 and for all 0 lim C→∞ lim sup N→∞ P

90. Z ∞ C N1/2 (F̂N (u) − F0(u, λ0))

94. = 0. Now the proof of (1.28) is complete. In light of (1.29) we suggest the parameter estimated Kolmogorov–Smirnov statistics: TN,7 = N1/2 sup 0t∞

97. FN (t) − F0(t, λ̂N )

100. and TN,8 = N1/2 sup 0t∞ FN (t) − F0(t, λ̂N ) . Now the limit distributions of TN,7 and TN,8 can be derived easily from (1.29). Namely, (1.33) TN,8 D → sup 0t∞

104. B(F0(t, λ0)) + g1(t, λ0) Z ∞ 0 B(F0(u, λ0))du

108. and (1.34) TN,9 D → sup 0t∞ B(F0(t, λ0)) + g1(t, λ0) Z ∞ 0 B(F0(u, λ0))du . How to use (1.33) and (1.34) ? This is highly not obvious. Please note that the limit depends on the parametric form of F0, but this is not an issue since we know that F0 is exponential. But the dependence on λ0 is more serious since λ0 is unknown. However, a little work shows that TN,8 and TN,9 do not depend on λ0. Note that N1/2 sup 0t∞

111. FN (t) − F0(t, λ̂N )

114. = N1/2 sup 0t∞

117. FN (t) − (1 − e−t/λ̂N )

120. = N1/2 sup 0u∞

123. FN (uλ̂N ) − (1 − e−u )

126. . By definition, FN (uλ̂N ) = 1 N N X i=1 I{Xi/λ̂N ≤ u} and Xi λ̂N = Xi X̄N = Xi/λ0 PN j=1 Xj/λ0 . Hence TN,7 and therefore the limit distribution does not depend on λ0. Now Monte Carlo simulations could be used to get the distribution of the limit in (1.33), since we can assume that λ0 = 1 in

127. MATHEMATICAL STATISTICS 9 the limit. This argument also works for TN,8. Hence TN,7 and TN,8, and therefore their limits, are free of the unknown parameter. Our argument works for scale families. With some modifications, it can be done for location and location and scale families. Let assume that we are in a location family. In this case the underlying distribution is F(t, λ) = F0(t − λ). Hence sup −∞t∞

130. FN (t) − F0(t − λ̂N )

133. = sup −∞t∞

136. [FN (t) − F0(t − λ0)] + [λ0 − λ̂N ])

139. = sup −∞t∞

142. FN (u + λ0) − F0(u + [λ0 − λ̂N ])

145. and FN (u + λ0) = 1 N N X i=1 I{Xi ≤ t + λ0} = 1 N N X i=1 I{Xi − λ0 ≤ t} Since we are in a location family, the distribution of Xi − λ0 does not depend on λ0. We showed that in case of location families, if λ̂N is the maximum likelihood estimator, then the distribution of λ0 − λ̂N does not depend on λ0 (more is true, the value of λ0 − λ̂N does not depend on λ0). The same argument work for the location and scale families. As an example, let assume that F0 is a Gamma distribution with parameters λ (scale parameter) and κ (shape parameter). We assume that κ is known. Since we are in the scale family, the arguments used in the exponential case would work. So far we considered Kolmogorov–Smirnov type processes for parameter estimated processes. In case of scale families (this includes the exponential we discussed at the beginning of this section), N Z ∞ −∞ (FN (t) − F0(t, λ̂N ))2 dF0(t, λ̂N ) = N Z ∞ −∞ (FN (t) − F0(t, λ̂N ))2 f0(t, λ̂N )dt do not depend on λ0. An other possibility for parameter free method is the parameter estimated Cramér–von Mises statistic N Z ∞ −∞ (FN (t) − F0(t, λ̂N ))2 dFN (t) ≈ N X i=1 i N − F0(Xi,N , λ̂N ) 2 , where X1,N ≤ X2,N ≤ . . . ≤ XN,N are the order statistics. To establish the consistency of TN,7 is easy.We assume that under the alternative HA : inf λ0 sup 0t∞ |F(t) − F0(t, λ)| 0 and in this case TN,7 P → ∞. We have that TN,8 P → ∞, if HA : inf λ0 sup 0t∞ (F(t) − F0(t, λ)) 0. The asymptotic behaviour of the parameter estimated Cramér–von Mises statistics can be discussed in the same way. The self normalized statistics also can be used to test if the underlying distribution in a parametric form. For example, in case of testing for exponentiality we can use sup 0t∞ N1/2|FN (t) − F0(t, λ̂N )| (F0(t, λ̂N )(1 − F0(t, λ̂N )))1/2

146. 10 LAJOS HORVÁTH and sup X1,N tXN,N N1/2|FN (t) − F0(t, λ̂N )| (FN (t)(1 − FN (t)))1/2 , where X1,N = min(X1, X2, . . . , XN ) and XN,N = max(X1, X2, . . . , XN ). We found the similar pattern for the test statistics as in Section 1.1. The test statistics convergence in distribution to a limit and they converge to ∞ in probability under the alternative. It turns out that the estimation of the parameter does not effect the limit distribution, i.e. (1.20) holds for the parameter estimated statistics as well. Scale family. The underlying density is in the form f(t, λ) = 1 λ f0(t/λ), where λ 0 is a parameter. We use the empirical distribution function of Yi = Xi X̄N . 1 ≤ i ≤ N X̄N = 1 N N X i=1 Xi The distribution of Yi does not depend on λ0, the true value of the parameter under the null hypothesis. Hence the limit of N1/2 (HN (x) − F0(x)) with HN (x) = 1 N N X i=1 I{Yi ≤ x}, does not depend on λ0 but it DOES on f0. We used the notation F0 0 = f0. The sample mean X̄N might not be the maximum likelihood estimator. What is the maximum likelihood estimator? The likelihood function is L(λ) = N Y i=1 1 λ f0(Xi/λ) and the log likelihood is `(λ) = −N log λ + N X i=1 log f0(Xi/λ). We compute the derivative `0 (λ) = − N λ + N X i=1 1 f0(Xi/λ) f0 0(Xi/λ) − Xi λ2 and we need to solve the equation − N X i=1 1 f0(Xi/λ) f0 0(Xi/λ) − Xi λ = N. The equation depends only on Xi/λ. This shows that λ̂N /λ0 does not depend on λ0, where λ̂N is the maximum likelihood estimator. Hence the parameter estimated statistics do not depend on the unknown scale parameter under the null hypothesis. Next we consider a typical two sample problem.

147. MATHEMATICAL STATISTICS 11 1.3. Comparing two samples. In addition to Assumption 1.1 we require Assumption 1.2. Y1, Y2, . . . , YM are independent and identically distributed random variables with distribution function H. It is a very common problem to test H0 : F(t) = H(t) for all t. In addition to FN (t) we define the empirical distribution of the Y sample HM (t) = 1 M M X i=1 Yi. If H0 is true, the difference should be small. Due to independence, under the null hypothesis we have var(FN (t) − HM (t)) = 1 N + 1 M (F(t)(1 − F(t))) = N + M NM (F(t)(1 − F(t))), so our consideration will be based on the two sample version of the empirical process uN,M (t) = NM N + M 1/2 [(FN (t) − HM (t)) − (F(t) − H(t))] . Of course, under the null hypothesis F(t)−H(t) = 0 for all t in the definition of uN,M (t). The weak convergence of uN,M (t) is immediate consequence of the weak convergence of empirical processes: (1.35) uN,M (t) D[−∞,∞] −→ c 1/2 0 B(1) (F(t)) + (1 − c0)1/2 B(1) (H(t)), where B(1) and B(2) are independent Brownian bridges, and lim N,M→∞ M N + M = c0. We observe that (1.36) {c 1/2 0 B(1) (F(t)) + (1 − c0)1/2 B(2) (F(t)), −∞ t ∞} D = {B(F(t)), −∞ t ∞}. Since B(1)(F(t)), B(2)(F(t)) are jointly Gaussian, they linear combination will be Gaussian. Hence we need to compute the mean and the covariance of c 1/2 0 B(1)(F(t)) + (1 − c0)1/2B(2)(F(t)). Since EB(1)(F(t)) = 0 EB(2)(F(t)) = 0, we get that E[c 1/2 0 B(1) (F(t)) + (1 − c0)1/2 B(2) (F(t))] = 0. Using the independence of B(1) and B(2) and EB(1)(t)B(1)(s) = EB(2)(t)B(2)(s) = min(t, s) − ts, E[c 1/2 0 B(1) (F(t)) + (1 − c0)1/2 B(2) (F(t))][c 1/2 0 B(1) (F(s)) + (1 − c0)1/2 B(2) (F(s))] = E[c 1/2 0 B(1) (F(t))][c 1/2 0 B(1) (F(s))] + E[(1 − c0)1/2 B(2) (F(t))][(1 − c0)1/2 B(2) (F(s))] = c0[min(F(t), F(s)) − F(t)F(s)] + (1 − c0)[min(F(t), F(s)) − F(t)F(s) = min(F(t), F(s)) − F(t)F(s), which is exactly the covariance function of B(F(t)). We suggest the following statistics: TN,M,1 = sup −∞t∞ |uN,M (t)| and TN,M,2 = sup −∞t∞ uN,M (t).

148. 12 LAJOS HORVÁTH If H0 holds and F = H is continuous, then (1.37) TN,M,1 D → sup −∞t∞ |B(F(t))| = sup 0≤t≤1 |B(t)| and (1.38) TN,M,2 D → sup −∞t∞ B(F(t)) = sup 0≤t≤1 B(t) The result in (1.37) and (1.38) are immediate consequences of (1.35) and (1.36). If F is continuous, then the distributions of TN,M,1 and TN,M,2 do not depend on F under H0. This observation is an immediate consequence of the probability integral transformation. The statistics TN,M,1 and TN,M,2 are Kolmogorov–Smirnov type statistics. Similarly to the previous discussions we can define Cramér–von Mises type statistics as well. We can define similarly Cramér– von Mises statistics: Z ∞ −∞ u2 N,M dFN (t) ≈ M N + M N X i=1 i N − HM (Xi,N ) 2 , where X1,N ≤ X2,N ≤ . . . , XN,N are the order statistics of the first sample. Or similarly, Z ∞ −∞ u2 N,M dHM (t) ≈ NM N + M M X i=1 i M − FN (Yi,M ) 2 , where Y1,M ≤ Y2,M ≤ . . . , YM,M are the order statistics of the second sample. Using again the weak convergence of uN,M (t), one can prove that under H0 Z ∞ −∞ u2 N,M dFN (t) D → Z 1 0 B2 (u)du and Z ∞ −∞ u2 N,M dHM (t) D → Z 1 0 B2 (u)du. So far the observations were not only independent but also identically distributed even under the alternative hypothesis. The next problem is interesting since we want to test that the assumption of identically distributed data will be tested. The topic of Section 1.4 is very popular in the literature and it is called the change point problem or testing for the stability of the data. 1.4. Change point. We assume that Assumption 1.1 holds but we observe Z1, Z2, . . . , ZN defined by Zi = ( µ0 + Xi, if 1 ≤ k ≤ k∗ µA + Xi, if k ∗ +1 ≤ k ≤ N. (1.39) We call k∗ the time of change and it is unknown. Similarly, the means, µ0 6= µA before and after the change are also unknown. Of course these are the means of the Zi’s, if EXi = 0. Hence we need to modify Assumption 1.1: Assumption 1.3. X1, X2, . . . , XN are independent and identically distributed random variables with EXi = 0 and EX2 i = σ2.

149. MATHEMATICAL STATISTICS 13 The model assumes that the variance is constant. First we even assume that σ is known to find a suitable test statistic. We will discuss how to proceed if σ is unknown. We note that σ is a nuisance parameter so we have no interest in its value. Recently, data examples confirmed that σ might not be constant during the observation period. It might be time dependent so we wish to detect changes in the mean even if the variance is changing as well. We only want to detect a change point if the mean changes regardless what happens to the variance of the observations. Only few results are available now. We assume now that (1.40) σ2 is known. We wish to test the stability of the model, i.e. the mean remains constant during the observation period: H0 : k∗ N against the alternative HA : 1 k∗ N. The null hypothesis postulates the change occurs outside of the observation period so it does not matter for us. Under the alternative the means changes exactly once. Our model is called “at most one change” (AMOC). First we need to find a test statistic. Let assume that k∗ = k is known. In this case this a simple two sample problem. We cut the data into two parts at k and we compute the sample means for each segment with Z̄k = 1 k k X i=1 Zi and Ẑk = 1 N − k N X i=k+1 Zi. If H0 holds, than |Z̄k −Ẑk| is small, the difference between the two empirical means can be explained by the variability in the data. Using Assumption 1.3 we get var Z̄k − Ẑk = σ2 k + σ2 N − k = σ2 N k(N − k) , so we reject the means of Z1, Z2, . . . , Zk and Zk+1, Zk+2, . . . , ZN are the same if Qk = 1 σ k(N − k) N 1/2 |Z̄k − Ẑk| is large. The statistic Qk should be familiar since this is the two sample z–test if the observations are normal! To prove this claim assume that X1, X2, X3, . . . , XN are independent identically distributed normal random variables with EXi = 0, EX2 i = σ2, σ2 is known. We wish to test that the means of Z1, Z2, . . . , Zk and Zk+1, Zk+2, . . . , ZN are the same. Let µ1 be the mean of the first sample, µ2 be the mean of the second sample and µ be the mean under the null hypothesis. The maximum likelihood estimators are µ̂1 = 1 k k X i=1 Zi, µ̂2 = 1 N − k N X i=k+1 Zi and µ̂ = 1 N N X i=1 Zi. Hence the likelihood ratio is k Y i=1 1 √ 2π exp(−(Zi − µ̂1)2 /(2σ2 )) N Y i=k+1 1 √ 2π exp(−(Zi − µ̂2)2 /(2σ2 )) N Y i=1 1 √ 2π exp(−(Zi − µ̂)2 /(2σ2 ))

150. 14 LAJOS HORVÁTH = exp 1 2σ2 N X i=1 (Zi − µ̂)2 − k X i=1 (Zi − µ̂1)2 − N X i=k+1 (Zi − µ̂2)2 !! = exp 1 2σ2 kµ̂2 1 + (N − k) ˆ µ2 2 − Nµ̂2 = exp 1 2σ2 kµ̂2 1 + (N − k) ˆ µ2 2 − N((k/N)µ̂1 + ((N − k)/N)µ̂2)2 = exp 1 2σ2 k(N − k) N (µ̂1 − µ̂2)2 , proving our claim. Since k is unknown we use the rule: (1.41) reject H0, if max 1≤kN 1 σ |Qk| is large. A simple algebra shows that the rule in (1.41) might not work. It is easy to see that under H0 Qk = 1 σ N k(N − k) 1/2

155. k X i=1 Xi − k N N X i=1 Xi

160. . So by the law of the iterated logarithm for partial sums of independent and identically distributed random variables (Das Gupta, 2008, pp. 8) (1.42) max 1≤kN |Qk| P → ∞ as N → ∞. We observe that σ max 1≤kN |Qk| ≥ max 1≤kN/2 N k(N − k) 1/2

165. k X i=1 Xi − k N N X i=1 Xi

170. ≥ max 1≤kN/2 N k(N − k) 1/2

175. k X i=1 Xi

180. − max 1≤kN/2 N k(N − k) 1/2

185. k N N X i=1 Xi

190. ≥ 21/2 max 1≤kN/2 1 k 1/2

195. k X i=1 Xi

200. − N−1/2

205. N X i=1 Xi

210. . The central limit theorem yields N−1/2

215. N X i=1 Xi

220. = OP (1) and the law of the iterated logarithm implies lim sup N→∞ (log log(N/2))−1/2 max 1≤kN/2 1 k 1/2

225. k X i=1 Xi

230. 0 a.s., completing the proof of (1.42). The result in (1.42) was first observed empirically by economists. This caused a stir since the z–test is widely used (without checking the required assumptions) and using the normal table for the critical values of max1≤kN |Qk| caused over rejection, and it was getting worse as N was increasing. Andrews (1993) is the most popular contribution to the applicability of the z–test for the change point problem. He observed that the law of the iterated comes into action for large and small k. He suggested rejecting for large values of max bNαc≤k≤N−bNαc 1 σ |Qk|,

231. MATHEMATICAL STATISTICS 15 where b·c is the integer part and 0 α 1/2 is chosen by the practitioner. Since the change point problem is common in economics (“nothing last forever”), there has been a tremendous interest in the choice of α. The choice of 5% and 10% is recommended. Using the weak convergence of partial sums to a Wiener process can be used to prove that under the null hypothesis (1.43) max bNαc≤k≤N−bNαc 1 σ |Qk| D → sup α≤t≤1−α |B(t)| (t(1 − t))1/2 , where {B(u), 0 ≤ u ≤ 1} is a Brownian bridge. The functional L(f) = supα≤u≤1−α |f(u)| is a continuous functional on the Skorokhod space D[0, 1], so the weak convergence of partial sums gives (1.43). Looking at (1.43) it is clear why (2.20) holds if α = 0. The limit cannot be finite in this case according to the law of iterated logarithm for the Wiener process. Hence Andrews (1993) claimed that no limit result can be established for max1≤kN |Qk|. This claim is strongly believed in econometrics but it was not even true when Andrews (1993) published his famous paper. If we look at again (1.19), we face the same issue. The self–normalization (i.e. taking the maximum of random variables with constant variance) puts too much weight at the beginning and the end of the data. If Darling and Erdős (1956) can be used to get the limit in (1.19), it might work in the present case as well. We will return to this question later. The limit result of (1.43) suggests that we should remove the weight and work with (1.44) TN,11 = max 1≤k≤N 1 σ N−1/2

236. k X i=1 Zi − k N N X i=1 Zi

241. . Now we reached a famous and useful statistic. It is called CUSUM (CUmulative SUMS) in the literature. One of the interesting feature is that it does not depend on the unknown mean under the null hypothesis. The mean is a nuisance parameter and its value does not appear in CUSUM type statistics. We usually refer to the maximally selected z–statistic as the standardized CUSUM. The limit distribution of TN,11 under the null hypothesis is very simple: (1.45) TN,11 D → sup 0≤u≤1 |B(u)|. The result in (1.45) follows from the weak convergence of partial sums. Using (1.45), it is easy to detect changes in the data since the distribution of sup0≤u≤1 |B(u)| is known and tabulated. In TN,11 we can recognise a Kolmogorov–Smirnov type statistics. A possible Cramér–von Mises type statistic for the change point problem is Z 1 0  N−1/2   bNuc X i=1 Zi − bNuc N N X i=1 Zi     2 du and under H0 Z 1 0  N−1/2   bNuc X i=1 Zi − bNuc N N X i=1 Zi     2 du D → Z 1 0 B2 (t)dt. The behaviour of TN,11 is very simple under the exactly one change point alternative. Namely, (1.46) if µ0 6= µA, then TN,11 P → ∞. We note that (1.46) holds in case of several changes in the mean. We can use TN,11 estimate the time of change k∗ from the data. The estimator is the point where the CUSUM achieves it largest value: k̂N = ( k : 1 ≤ k ≤ N :

246. k X i=1 Zi − k N N X i=1 Zi

251. = max 1≤`≤N

256. ` X i=1 Zi − ` N N X i=1 Zi

261. ) .

262. 16 LAJOS HORVÁTH If the change occurs in the middle of the data, i.e. k∗ N = bNθc, with some 0 θ 1, then (1.47) k∗ N N P → θ, i.e. we can consistently approximate θ. So we can do testing in the very unlikely case when σ is known. If we can estimate σ from the sample, we have more realistic procedures. The first candidate is the sample mean: σ̂2 N,1 = 1 N − 1 N X i=1 Zi − Z̄N 2 . We have already established that under the null hypothesis σ̂2 N,1 P → σ2 , so it is asymptotically consistent. But this is not the case under the alternative! We note that (1.48) Z̄N = k∗ N 1 k∗ k∗ X i=1 Zi + N − k∗ N 1 N − k∗ N X i=k∗+1 Zi P → µ̄ = θµ0 + (1 − θ)µA. Elementary algebra gives σ̂2 N = 1 N − 1 k∗ X i=1 Zi − µ0 + µ0 − µ̄ + µ̄ − Z̄N )2 + 1 N − 1 N X i=k∗+1 Zi − µA + µA − µ̄ + µ̄ − Z̄N 2 and 1 N − 1 k∗ X i=1 Zi − µ0 + µ0 − µ̄ + µ̄ − Z̄N )2 = k∗ N − 1 1 k∗ k∗ X i=1 (Zi − µ0)2 + k∗ N − 1 (µ0 − µ̄)2 + k∗ N − 1 (Z̄N − µ̄)2 + 2k∗(µ0 − µ̄) N − 1 1 k∗ k∗ X i=1 (Zi − µ0) + 2k∗(µ̄ − X̄N ) N − 1 1 k∗ k∗ X i=1 (Zi − µ0) + 2k∗ N − 1 (µ0 − µ̄)(µ̄ − X̄N ). Using now the law of large numbers we obtain that 1 k∗ k∗ X i=1 (Zi − µ0)2 P → σ2 , k∗ X i=1 (Zi − µ0) P → 0 and by (1.48) X̄N − µ̄ P → 0. Thus we conclude (1.49) 1 N − 1 k∗ X i=1 Zi − µ0 + µ0 − µ̄ + µ̄ − Z̄N )2 P → θσ2 + θ(µ̄ − µ0)2 . Similar arguments give (1.50) 1 N − 1 N X i=k∗+1 Zi − µA + µA − µ̄ + µ̄ − Z̄N 2 P → (1 − θ)σ2 + (1 − θ)(µ̄ − µA)2 .

263. MATHEMATICAL STATISTICS 17 Putting together (1.49) and (1.50) we get that (1.51) σ̂2 N,1 P → σ2 + θ(µ0 − µ̄)2 + (1 − θ)(µ̄ − µA)2 , so we are overestimating σ2. This is the penalty that the possible change in the mean is not taken into account. Thus we could try σ̂2 N,2 = 1 N − 1   k̂N X i=1 Zi − Z̄k̂N 2 + N X i=k̂N +1 Zi − Ẑk̂N 2   . On account of k̂N ≈ k∗, it looks obvious that σ̂2 N,2 → σ2 in probability. But now the null hypothesis is the problem since there is no k∗ under the null hypothesis! Using the weak convergence of the CUSUM process, it can be shown that k̂N /N converges in distribution. It requires lengthy calculations to show that (1.52) σ̂2 N,2 P → σ2 under the null and also under the alternative. Now we can define two statistics which does not require the knowledge of σ2: (1.53) TN,12 = max 1≤k≤N 1 σ̂N,1 N−1/2

268. k X i=1 Zi − k N N X i=1 Zi

273. and (1.54) TN,13 = max 1≤k≤N 1 σ̂N,2 N−1/2

278. k X i=1 Zi − k N N X i=1 Zi

283. . We note that under the no change null hypothesis (1.55) TN,12 D → sup 0≤u≤1 |B(u)|, and (1.56) TN,13 D → sup 0≤u≤1 |B(u)|. Under the alternative TN,12 P → ∞ and TN,13 P → ∞. The suggested tests are very similar and it is not immediate to see difference between them. None of them are perfect: in TN,12 we are overestimating the variance, so we reduce the power while in TN,13 an additional estimation is used which might impact the behaviour in case of small and moderate sample sizes. This is a typical situation in statistics. We have a choice but which one is better is not obvious. 1.5. Total Time on Test. The total time on test (TTT) is a popular concept in engineering. For example, the best test for exponentiality is based on TTT. TTT is defined for positive variables, so this will be assumed in this part. One of the ingredients for TTT is the function z(x) = 1 1 − F(x) Z x 0 (1 − F(u))du. The estimator of z(x) is simple, we just replace F with FN resulting in ẑN (x) = 1 1 − FN (x) Z x 0 (1 − FN (u))du.

284. 18 LAJOS HORVÁTH The weak convergence of N1/2(FN (x)−F(x)) to B(F(x)) (B is a Brownian bridge yields if F(T) 1, then (1.57) N1/2 (ẑN (x) − z(x)) D[0,T] −→ Γ(x), where Γ(x) = B(F(x)) (1 − F(x))2 Z x 0 (1 − F(u))du − 1 1 − F(x) Z x 0 B(F(u))du. The proof of (1.57) can be derived from the weak convergence of the empirical process with the help of some algebra. By (1.57) we have that (1.58) TN,15 D → sup 0≤x≤T |Γ(x)|, where TN,15 = sup 0≤x≤T N1/2 |ẑN (x) − z(x)|. Getting the distribution of the limit in (1.58) is hopeless since it depends on the unknown F. We will demonstrate that bootstrap works. Interestingly, (1.58) mainly used to construct confidence bands for z(x), 0 ≤ x ≤ T.

285. MATHEMATICAL STATISTICS 19 2. Several versions of resampling In Section 1 we discussed several common hypothesis testing problems in statistics and possible approaches how to tackle them. The procedures based on a single sample had the following form: we defined a test statistic TN and established the following properties: we reject for large values of TN , (2.1) lim N→∞ P{TN ≤ x} = D(x) under the null hypothesis, where D denotes the limiting distribution function and (2.2) TN P → ∞ under the alternative. Based on the original sample X1, X2, . . . , XN we want to create an other sample, called bootstrap sample, X∗ 1 , X∗ 2 , . . . , X∗ L which should resemble the original observation. We consider the original sample X1, X2, . . . , XN as fixed values, i.e. we condition with respect to them. Due to this condi- tioning we use PX to denote P{ · |X}, where X = (X1, X2, . . . , XN ). From the bootstrap sample we compute our test statistic T (1) L , as TN was computed from the original sample. Please note that we have not said how the bootstrap sample was obtained. We repeat this procedure independently of each other R times resulting in the bootstrap statistics T (1) L , T (2) L , . . . , T (R) L . Next we compute their empirical distribution function (2.3) DN,L,R(x) = 1 R R X i=1 I{T (i) L ≤ x}, −∞ x ∞. If we can show that under the null hypothesis (2.4) lim min(N,L)→∞ sup −∞x∞

288. PX{T (1) L ≤ x} − D(x)

291. = 0 a.s., since in this case the law of large numbers implies that (2.5) sup −∞x∞ |DN,L,R(x) − D(x)| → 0 if min(N, L, R) → ∞. Equation (2.5) means that for almost all realization of the original sample X, DN,L(x) converges to D. So you must be extremely unlucky if the bootstrap is not working for you! We require from the bootstrap statistic that it is bounded in probability under the alternative: (2.6) |T (1) L | = OPX (1). The construction of the critical values will explain why (2.6) is crucial for the bootstrap to work. Let 0 α 1 and define the bootstrap critical value cN,L,R = cN,L(α) by DN,L,R(cN,L,R) = 1 − α. (There is a minor technical issue, since DN,L(x) is a jump function so it might not take the value 1 − α. In this case the smallest number for which DN,L(x) is the first time larger than 1 − α. This works, for example, D(x) is a continuous.) Using (2.5) we obtain that P{TN cN,L,R} → α, as min(N, L, R) → ∞ under the null hypothesis. The requirement in (2.6) implies that |cN,L,R(α)| = OP X(1), as min(N, L, R) → ∞ even under the alternative and therefore on account of (2.2) we get under the alternative that P{TN cN,L,R} → 1, as min(N, L, R) → ∞. This means that the rejection rate under the null hypothesis is asymptotically α and we reject the alternative with probability going to 1. The statistics in Section 1 can be bootstrapped. Bootstrap is simple, you need to run the same program several times and you will get the critical value very

292. 20 LAJOS HORVÁTH easily. This sounds nice but, of course, questions will arise. How to choose L? How to choose R? We discuss these questions later. The theory supporting our discussion works well if the limit is derived from Gaussian processes. But this might not be the case for Poisson processes and extreme values. Usually the bootstrap is better (it provides better critical values) and this is proven in several cases, like bootstrapping the mean. In Section 1 we tried to discuss problems where, in some cases, only bootstrap can provide critical values. The bootstrap can be used to construct confidence intervals and confidence bands as well. 2.1. Nonparametric bootstrap. This is probably the most popular method for resampling. However, the permutation method (selection without replacement) is older but it has more limited use. It is a modification of Fisher’s exact test. Our sample is X = (X1, X2, . . . , XN ). As in the introduction of this lecture, we assume that bX is given so we consider as constant, i.e. we condition with respect to X. We assume that (2.7) F is a continuous distribution function. If (2.7) holds, then there is no tie among the Xi’s with probability 1. Now we select from X with replacement, resulting in X∗ 1 , X∗ 2 , . . . , X∗ L. Due to the construction, (2.8) X∗ 1 , X∗ 2 , . . . , X∗ L are independent and identically distributed random variables. The computation of the common distribution function is very simple. Since there is no tie among the Xi’s, due to the random selection, PX{X∗ 1 = Xj} = 1 N , 1 ≤ j ≤ N, so the number of Xi’s which are less than x gives the conditional probability that X∗ 1 less or equal than x. This means that the common distribution function of the bootstrap sample is FN (x) = 1 N N X i=1 I{Xi ≤ x}, i.e. the empirical distribution of the original sample. It is important to note that FN (x) is a jump function even if (2.8) holds. This cause problems when the definition of TN , the statistic we want to bootstrap, assumes that (2.7) holds. For example, statistics based on densities will have this problem. However, as we discussed earlier, FN is an excellent estimate for F. For example, (2.9) sup −∞x∞ |FN (x) − F(x)| → 0 a.s. Even we know that the rate of convergence in (2.9) is N−1/2(log log)1/2 according to the law of the iterated logarithm for empirical processes. The computation of the mean and the variance of X∗ 1 is simple since it takes Xi, 1 ≤ i ≤ N, with probability 1/N, so EXX∗ 1 = 1 N N X i=1 Xi = X̄N i.e. the conditional expected value of X∗ 1 is the sample mean of the original sample. We use EX to denote the conditional expected value when we condition with respect to X. Similarly, varX(X∗ 1 ) = 1 N N X i=1 X2 i − X̄2 N .

293. MATHEMATICAL STATISTICS 21 Also, E[EXX∗ 1 ] = µ and E[varX(X∗ 1 )] = (N/(N − 1))σ2, where EX1 = µ and var(X1) = σ2. According to the central theorem (2.10) sup −∞x∞

298. P ( N−1/2 N X i=1 (Xi − µ)/σ ≤ x ) − Φ(x)

303. → 0 as N → ∞. The bootstrap version of this result is (2.11) sup −∞x∞

308. PX ( N−1/2 L X i=1 (X∗ i − X̄N )/σ̄N ≤ x ) − P ( N−1/2 N X i=1 (Xi − µ)/σ ≤ x )

313. → 0 a.s., as min(N, L) → ∞. The theoretical mean and variance in (2.10) are replaced with the conditional mean and variance of the bootstrapped observations. The proof is very simple if we also assume that |X1|3 ∞. According to the Berry–Esseen theorem, there is an absolute constant c such that (2.12) sup −∞x∞

318. PX ( N−1/2 L X i=1 (X∗ i − X̄N )/σ̄N ≤ x ) − Φ(x)

328. PX ( N−1/2 L X i=1 (X∗ i − X̄N )/σ̄N ≤ x ) − Φ(x)

333. → 0 a.s., as L → ∞. Now we get (2.11) from (2.10) and (2.13). The proof of (2.11) is typical to deal with theoretical issues of the bootstrap. It is shown that the test statistic and its bootstrap version have the same limit distribution. Das Gupta (2008) has a lengthy discussion of bootstrapping the mean. He provides theoretical evidence that the rate of convergence in (2.11) is better than in (2.10). This has been confirmed empirically in the literature. Please set up Monte Carlo simulations to provide numerical evidence that the rate of convergence is better in (2.11). Just provide some graphs. Bootstrapping the mean provides theoretical results but not too useful in real life applications due to the enormous amount of results on the sample mean. We illustrate on the Kolmogorov–Smirov statistic why the bootstrap works. We already obtained TN,1 of Problem 1. Now we obtain it bootstrap version from the sample X∗ 1 , X∗ 2 , . . . , X∗ L. Their empirical distribution function is F∗ L(x) = 1 L L X i=1 I{X∗ i ≤ x} and now we can define T∗ L,1 as T∗ L,1 = max −∞x∞ L1/2 |F∗ L(x) − FN (x)|.

334. 22 LAJOS HORVÁTH We provide some heuristic arguments proving (2.14) sup −∞x∞ |

338. PX{T∗ L,1 ≤ x} − P sup 0≤u≤1 |B(u)| ≤ x

342. → 0 a.s., as min(N, L) → ∞, where B is a Brownian bridge. By the weak convergence of the empirical process L1/2 (F∗ L(x) − FN (x)) ≈ B(FN (x)) and by (2.9) and the almost sure continuity of B, B(FN (x)) ≈ B(F(x)). Since F is continuous, sup−∞x∞ |B(F(x))| = sup0≤u≤1 |B(u)|. Hence we have (2.14). Please note that (2.14) holds regardless we have H0 or the alternative. Hence we have (2.4) and (2.5). In case of the Kolmogorov–Smirnov statistic the limit distribution has a known form. Hence you could investigate the question if the bootstrap method provides a better approximation for the distribution of TN,1 than the limit distribution. There are several cases where the limit distribution of the test statistic depends on the underlying distribution of the data. In this case the bootstrap might be the only method to get critical values to test our null hypothesis. Now we return to the test for exponentiality. From the bootstrap sample we estimate λ by λ̂∗ L = 1 L L X i=1 X∗ i , so the bootstrapped parameter could be defined as L1/2(F∗ L(x) − F0(x, λ̂∗ L)). Following the proof in Section 1.2, one can show that sup0≤x∞ L1/2|F∗ L(x) − F0(x, λ̂∗ L)| converges to the limit if TN,8. So it works under the null. Now the bad news. Under the alternative (2.15) sup 0≤x∞ L1/2 |F∗ L(x) − F0(x, λ̂∗ L)| P → ∞. The proof of (2.15) is the same what we did in Section 1.2. Hence with method we will not be able to reject exponantiality even if it is false. The problem is that we should use the distribution function of the bootstrap sample, which is FN (x) which does not contain any place for the parameter λ. We need to do something else which will be done in the next section. Next we consider we consider the two sample problem. If we select from the X sample with replacement and separately from the Y’s with replacement, the procedure will not work. The distribution of the bootstrapped X sample will be around F, while the distribution of the Y sample will be close to H. This means that TN,M,1 and its bootstrapped version, T∗ N∗,M∗,1 behave exactly in the same way. Hence (2.16) PX,Y{T∗ N∗,M∗,1 K} → 1 a.s. for all K. Now we combine the two samples into 1, Z = (X1, X2, . . . , XN , Y1, Y2, . . . , YM )T . We select from Z with replacement, resulting in Z∗ = (Z∗ 1 , Z∗ 2 , . . . , Z∗ L). Due to the random selection with replacement, conditionally on Z, these are independent and identically distributed with PZ{Z∗ 1 ≤ x} = 1 N + M N+M X i=1 I{Zi ≤ x} = N N + M 1 N N X i=1 I{Xi ≤ x} ! + M N + M 1 M M X i=1 I{Zi ≤ x} ! . Let N∗ = bLN/(N + M)c and M∗ = L − N∗, where b·c denotes the integer part. Now X = (Z∗ 1 , Z∗ 2 , . . . , ZN∗ ) and Y∗ = (Z∗ N∗+1, Z∗ N∗+2, . . . , Z∗ L∗ ). Regardless if the original samples satisfy

343. MATHEMATICAL STATISTICS 23 the null or the alternative hypothesis, X∗ and Y∗ have the same distribution (conditionally on Z, the empirical distribution function of Z). Hence under the null as well as under the alternative sup −∞x∞

347. PZ{T∗ N∗,M∗,1 ≤ x} − P{ sup 0≤u≤1 |B(u)| ≤ x}

351. → 0 a.s., where T∗ N∗,M∗,1 is the bootstrap version of TN,M,1 and B is a Brownian bridge. Hence both (2.4) and (2.6) are satisfied. 2.2. Parametric bootstrap. Now we discuss how to get critical values for TN,8. Let X∗ 1 , X∗ 2 , . . . , X∗ L be independent, identically distributed random variables with distribution function F0(x, λ̂N ), i.e. given X, these simulated random variables are independent and PX{X∗ i ≤ x} = F0(x, λ̂N ). We need to estimate the parameter, for the bootstrap this is denoted by λ̂∗ N . Clearly, we need to use λ̂∗ L = 1 L L X i=1 X∗ i . Now the bootstrap version of the parameter estimated empirical process is L1/2 (F∗ L(x) − F0(x, λ̂∗ L)). We note that using integration by parts, (2.17) λ̂N = Z ∞ 0 (1 − F0(u, λ̂N )du and (2.18) λ̂∗ L = Z ∞ 0 (1 − F∗ L(u))du. Using again the mean value theorem we get for almost all realization of bX we have uniformly in x L1/2 (F∗ L(x) − F0(x, λ̂∗ L)) = L1/2 (F∗ L(x) − F0(x, λ̂N ) + F0(x, λ̂N ) − F0(x, λ̂∗ L)) = L1/2 (F∗ L(x) − F0(x, λ̂N )) + g1(x, λ̂N ) Z ∞ 0 L1/2 (F∗ L(u) − F0(u, λ̂N ))du + oX(1), where g1(u, λ) = ∂F0(u, λ) ∂λ . Conditionally on X, L1/2(F∗ L(x) − F0(x, λ̂N )) ≈ B(F0(x, λ̂N )). By the strong law of large numbers λ̂N → µ a.s., where µ = EX1, which is true under the null and the alternative. By the continuity of the Brownian bridge, B(F0(x, λ̂N )) ≈ B(F0(x, µ)). Thus we get for almost all realization of X that L1/2 (F∗ L(x) − F0(x, λ̂∗ L)) D[0,∞] −→ B(F0(x, µ)) + g1(x, µ) Z ∞ 0 B(F0(u, µ)). If H0 holds, then µ = λ0, so TN,7 and sup0≤x∞ L1/2|F∗ L(x) − F0(x, λ̂∗ L)| have the same limit distribution. Under the alternative sup 0≤x∞ L1/2 |F∗ L(x) − F0(x, λ̂∗ L)| = OPX (1). Hence this method provides a correct resampling for TN,7. The more general case can be handled in the same way.

352. 24 LAJOS HORVÁTH 2.3. Resampling without replacement (the permutation method). First we show that the permutations of the original sample can be used for the change point problem. Let Z∗ 1 , Z∗ 2 , . . . , Z∗ N be a random permutation of Z = (Z1, Z2, . . . , ZN ). We note that the permuted variables (selection without replacement) are not independent, but the dependence is week. It is easy to see that P{Z∗ i = Zj} = 1 N , 1 ≤ i, j ≤ N and P{Z∗ i = Zj, Z∗ k = Z`} = 1 N(N − 1) , 1 ≤ i, j, k, ` ≤ N, i 6= k, j 6= `. Hence |P{Z∗ i = Zj, Z∗ k = Z`} − P{Z∗ i = Zj}P{Z∗ k = Z`}| = 1 N2(N − 1) , which is much smaller than P{Z∗ i = Zj}. Also, PZ{Z∗ i ≤ x} = FN (x), 1 ≤ i ≤ N, and FN (x) = 1 N N X i=1 I{Zi ≤ x}. The permuted statistic is T∗ N,11 = max 1≤k≤N 1 σ̄N N−1/2

357. k X i=1 Z∗ i − k N N X i=1 Z∗ i

362. . Using the weak dependence between the permuted variables, one can show that (2.19) sup −∞x∞

366. PZ{T∗ N,11 ≤ x} − P sup 0≤u≤1 |B(u)| ≤ x

370. → 0 a.s., where B is a Brownian bridge. Under the alternative there is a change at k∗ so the distribution of the permuted random variables is a linear combination of two distributions, but the permuted variables will have the same marginal distribution. Namely, for all 1 ≤ j ≤ N, PZ{Z∗ j ≤ x} = FN (x) = 1 N k∗ X i=1 Zi + N X i=k∗+1 Zi ! = k∗ N 1 k∗ k∗ X i=1 Zi ! + N − k∗ N 1 N − k∗ N X i=k∗+1 Zi ! . Hence if there is no change in the means of the observations, then we still have that σ̄N → τ2 and τ2 might be different from σ2. However, this is still enough to claim that T∗ N,11 = OPZ (1), so the permutation method can be used to find critical values for the change point statistics based on partial sum processes. The permutation resampling was considered by Fisher (without calling it resampling). The idea is very simple. We permute Z resulting in Z∗ = (Z∗ 1 , Z∗ 2 , . . . , Z∗ N+M ). These are not independent since we selected without replacement but they have the same distribution, conditionally on Z, PZ{Z∗ i ≤ x} = 1 N + M N+M X `=1 I{Z` ≤ x}. Now X∗ = (Z∗ 1 , Z∗ 2 , . . . , Z∗ N ) and Y∗ = (Z∗ N+1, Z∗ N+2, . . . , Z∗ N+M ). Regardless if the original data satisfy the null hypothesis or the alternative, the marginal distributions, conditionally on Z, are the same. Hence under the null as well as under the alternative sup −∞x∞

374. PZ{T∗ N∗,M∗,1 ≤ x} − P{ sup 0≤u≤1 |B(u)| ≤ x}

378. → 0 a.s.

379. MATHEMATICAL STATISTICS 25 where T∗ N∗,M∗,1 is the bootstrap version of TN,M,1 and B is a Brownian bridge. Hence both (2.4) and (2.6) are satisfied. So far we only had to assume that min(N, L) → ∞. Of course, choosing a much larger L would result in lots of ties (we try to imitate continuous distributions which do not have ties). Since we apply limit theorems, L cannot be small. Usually, L = N is used. However, in case of extremes it might not work. 2.4. Bootstrapping the largest observation. Das Gupta (2008) contains an example when the bootstrap does not work when we sample with replacement. The original sample size is N and we generated a bootstrap sample X∗ 1 , X∗ 2 , . . . , X∗ N . Let XN,N = max(X1, X2, . . . , XN ) be the maximum in the original sample and X∗ N,N = max(X∗ 1 , X∗ 2 , . . . , X∗ N ). The bootstrap works in this case if XN,N = X∗ N,N , but the probability that X∗ N,N XN,N is (1 − 1/N)N → 1/e, N → ∞. This is clear, since during the selection, we cannot pick XN,N , so we need to choose from N − 1 possibilities. This example suggests if we increase the bootstrap sample size, we might hit XN,N . However, this is not the case. Let assume that X1, X2, . . . , XN be independent and identically distributed exponential(1) random variables, i.e. F(t) = ( 0, if t 0 1 − e−t , if t ≥ 0. Let XN,N be the largest order statistics and YN = XN,N − log N. As we did before P{YN ≤ t} = P{XN,N ≤ t + log N} = FN (t + log N) and FN (t + log N) =      0, if t − log N 1 − e−t N N , if t ≥ − log N. Thus we get that for all −∞ t ∞ lim N→∞ P{YN ≤ t} = H(t), where H(t) = e−e−t , −∞ t ∞. We can do the same for the bootstrap sample. Let Y ∗ L = max(X∗ 1 , X∗ 2 , . . . , X∗ L) − log L. As before, PX{Y ∗ L ≤ t} = FL N (t + log L), where, as before, FN (t) is the empirical distribution function of X = (X1, X2, . . . , XN ). According to the law of the iterated logarithm for the empirical process we have (2.20) lim sup N→∞ N 2 log log N 1/2 sup −∞t∞ |FN (t) − F(t)| = 1 a.s. Next we write FL N (t − log L) = (F(t + log L) + FN (t + log L) − F(t + log L))L . If t is fixed and N is so large that t + log L 0 F(t + log L) + FN (t + log L) − F(t + log L) = 1 − e−t L + FN (t + log L) − F(t + log L) = 1 − 1 L e−t (1 + L(FN (t + log L) − F(t + log L))) .

380. 26 LAJOS HORVÁTH If want use the formula 1 − xn n → e−x , if xn → x with xN,L = e−t (1 + L|FN (t + log L) − F(t + log L)|) we need that L|FN (t + log L) − F(t + log L)| → 0 a.s. In light of (2.20) this is satisfied if (2.21) L N log log N 1/2 → 0. According to (2.21), the bootstrap with replacement works if L is not large, essentially L must be less than N1/2. So more is not better in this case. Please note that if N = 100, than we should use a bootstrap sample size less than 10. Doing asymptotic theory with 10 observations is somewhat questionable. The rate of convergence to extreme values could be very slow so the bootstrap might not be better than using the limit results for the original sample. In any case, if (2.21) holds, the lim N→∞ PXP{Y ∗ L ≤ t} = H(t), a.s. So far we bootstrapped the observations directly. Now we consider the case when the observations are not identically distributed so selection with or without replacement will not work. All the bootstrap methods we discussed so far produced identically distributed random variables. 2.5. Residual bootstrap. We illustrate this method by linear models. We assume that (2.22) yi = x i β0 + i, 1 ≤ i ≤ N, where xi = (xi,1, xi,2, . . . , xi,d) ∈ Rd and β0 ∈ Rd. As usual, y1, y2, . . . , yN and x1, x2, . . . , xN are observed. We note that in statistics the xi’s are given numbers, while in econometrics, they modeled as random variables. The errors, 1, 2, . . . , N are unobservable random errors. We assume that 1, 2, . . . , N are independent and identically distributed random variables with (2.23) Ei = 0 and 0 E2 i = σ2 ∞. The parameters of interest are β0 and σ2. We estimate β0 using the least squares β̂N . The residuals defined by ˆ i = yt − x i β̂N , 1 ≤ i ≤ N. We collect some facts from linear models. First we write (2.22) in matrix form. Let YN = (y1, y2, . . . , yN ), EN = (1, 2, . . . , N ), β = (β1, β2, . . . , βd) and XN =         x1,1 x1,2 . . . x1,d x2,1 x2,2 . . . x2,d . . . . . . . . . . . . . . . . . . xN,1 xN,2 . . . xN,d         The matrix form of (2.22) is Y = XN β + EN . The least square estimator β̂ is defined by the minimization problem β̂N = argminβ kYN − XN βk2 ,

381. MATHEMATICAL STATISTICS 27 where k · k is the Euclidian norm (some of the squares of the elements of the matrix). We obtain the solution in explicit form: β̂N = X X −1 X Y, assuming that X N XN is nonsingular. Usually, the properties of β̂N are established assuming the normality of the errors. If normality of the errors is assumed, then one possibility is the parametric bootstrap. However, the bootstrap in this case is not very useful, since in this case the normality of β̂N is proven, so only the estimation of σ is needed for statistical inference. If the errors are not necessarily normal, then β̂N is still asymptotically normal. Namely, (2.24) N1/2 (β̂N − β0) D → Nd(0, σ2 A−1 ), where (2.25) lim N→∞ 1 N X X = A, β0 is the true value of the parameter and Nd denotes a d–dimensional normal random variable. It is easy to interpret the condition in (2.25): (2.26) lim N→∞ 1 N N X `=1 xi,`x`,j = ai,j, A = {ai,j, 1 ≤ i, j ≤ d}. If the xi’s are modeled as random variables, then (2.26) is just the law of large numbers. Hence (2.25) is a very natural assumption in linear models. If we go back the definition of the residuals, then we have ˆ i = i − x i (β̂N − β0). Let zi = (x i , ˆ i), Z = {zi, 1 ≤ i ≤ N}. We choose L-times, independently of each other with replacement from {zi, 1 ≤ i ≤ N}, resulting in {z∗ i , 1 ≤ i ≤ L}. Now we define (2.27) y∗ i = (x∗ i ) β̂N + ˆ ∗ i , 1 ≤ i ≤ L. Using the same notation as before but putting up ∗ we write (2.28) Y∗ = X∗ β̂N + E∗ and therefore β̂ ∗ N = (X∗ ) X∗ −1 (X∗ ) Y∗ . Using (2.27) we get that β̂ ∗ N = β̂N + (X∗ ) X∗ −1 (X∗ ) E∗ . Using again the law of large numbers lim L→∞ 1 L (X∗ ) X∗ = A a.s. We note that EZ (X∗ ) X∗ −1 (X∗ ) E∗ (X∗ ) X∗ −1 (X∗ ) E∗ = (X∗ ) X∗ −1 (X∗ ) E h E∗ (E∗ ) i X∗ (X∗ ) X∗ −1 ≈ L2 A−1 (X∗ ) EZ h E∗ (E∗ ) i X∗ A−1 . Conditionally on Z, ∗’s are independent and identically distributed and therefore EZ h E∗ (E∗ ) i = EZ(∗ 1)2 IN×N

382. 28 LAJOS HORVÁTH where IN×N is the N × N identity matrix. Thus we have L2 A−1 (X∗ ) EZ h E∗ (E∗ ) i X∗ A−1 ≈ L2 EZ(∗ 1)2 A−1 (X∗ ) IN×N X∗ A−1 ≈ LEZ(∗ 1)2 A−1 . It is easy to see that EZ(∗ 1)2 → σ2 a.s. Thus we conjecture that for almost all realization of Z (2.29) L1/2 (β̂ ∗ N − β̂N ) D → Nd(0, σ2 A−1 ). The proofs of (2.24) and (2.29) are essentially the same. We showed that EZβ̂ ∗ N = β̂N EZ h β̂ ∗ N (β̂ ∗ N ) i = 1 L EZ(∗ 1)2 A−1 + oZ 1 L = σ2 L A−1 + oZ 1 L , hence the first (mean) and the second order (variance) properties of β̂ ∗ N and β̂N are practically the same. Of course, this is not the proof of the normality of the estimators but these are necessary results for normality. Since β̂N always converges to the true value of the parameter of the linear model, the limit theorem 2.29 can be used to justify this bootstrap method for hypothesis testing. We only need that min(N, L) → ∞. The resampling of the residuals is also a popular technique in time series analysis. For the sake if simplicity we consider an autoregressive AR(1) sequence. We assume that {i, −∞ t ∞} are independent and identically distributed random variables. The AR(1) sequence is the solution of the recursion (2.30) yi = ρyi−1 + i, −∞ i ∞. If |ρ| 1 and |0|δ ∞ with some δ 0, then (2.30) has a unique solution given by (2.31) yi = X `=0 ρ` i−`, −∞ i ∞. We can estimate ρ with ρ̂N , the least square estimator. It is established in time series that N1/2(ρ̂N − ρ) is asymptotically normal. Observing y1, y2, . . . , yN , we define the residuals as ˆ i = yi − ρ̂N yi−1, 2 ≤ i ≤ N. We select from {ˆ 2, ˆ 3, . . . , ˆ N } with replacement, creating {ˆ ∗ 1, ˆ ∗ 2, . . . , ˆ ∗ L}. If the statistical inference is about the i’s, we are done. If our interest is in y1, y2, . . . , yN , then we define the boostrap sample (2.32) y∗ i = ρy∗ i−1 + ˆ ∗ i , 2 ≤ i ≤ L, with some initial value y∗ 0. However, of i is small the solution of (2.32) is certainly not close to the solution of (2.30) given by the infinite sum in (2.31). Hence we do not use all y∗ i ’s only if i ≥ L0. This we get a bootstrap sample of size L − L0. L0 is the burn in period and the practical advise is that L0 = 25 or 50. Next we discuss how the bootstrap can help to construct confidence bands.

383. MATHEMATICAL STATISTICS 29 2.6. Confidence bands. We recall the TTT function z(x) from Section 1.5. We want to define to random functions, zN,1(x) and zN,2(x) such that lim N→∞ P {zN,1(x) ≤ z(x) ≤ zN,2(x) for all x ∈ [0, a]} = 1 − α. If we choose zN,1(x) = z(x) − cN−1/2 and zN,2(x) = z(x) + cN−1/2 According to our theory, lim N→∞ P {zN,1(x) ≤ z(x) ≤ zN,2(x) for all x ∈ [0, a]} = P{ sup 0≤x≤a |Γ(x)| ≤ c}, so c = c(1 − α) is coming from the equation P sup 0≤x≤a |Γ(x)| ≤ c = 1 − α. However, the computation of the distribution distribution function P{sup0≤x≤a |Γ(x)| ≤ c} is hopeless since it depends on the unknown F. Let X∗ 1 , X∗ 2 , . . . , X∗ L be the bootstrap sample and define the bootstrap version of zN (x) by ẑ∗ L(x) = 1 1 − F∗ L(x) Z x 0 (1 − F∗ L(u))du, where, as before, F∗ L(u) = 1 L L X i=1 I{X∗ i ≤ u}. Using our previous arguments, one can show that PZ{ sup 0≤x≤a L1/2 |ẑ∗ L(x) − ẑN (x) ≤ c} → P sup 0≤x≤a |Γ(x)| ≤ c a.s., hence the bootstrap can be used to estimate c(1−α) from the sample. We obtain ĉN,L,R = ĉN,L,R(1− α) as our estimate, where N is the original sample size, L is the bootstrap sample size and R is the number of the repetations of the bootstrap procedure. We get that if ẑN,1(x) = z(x)−ĉN,L,RN−1/2 and ẑN,2(x) = z(x) + ĉN,L,RN−1/2, then lim min(N,L,R)→∞ PX {zN,1(x) ≤ z(x) ≤ zN,2(x) for all x ∈ [0, a]} = 1 − α a.s. The construction of confidence bands for a regression line is also a popular question. Let a(x) = β0 + β1x, b ≤ x ≤ d. We observe the line at N points giving the observations yi, xi, 1 ≤ i ≤ N yi = a(xi) + i = β0 + β1xi + i. We wish to define aN,1(x) and aN,2(x) from the sample that lim N→∞ P{aN,1(x) ≤ a(x) ≤ aN,2(x) for all x ∈ [b, d]} = 1 − α. We try aN,1(x) = β̂0,N + β̂1,N − cN−1/2 and aN,1(x) = β̂0,N + β̂1,N + cN−1/2 , where β̂0,N and β̂1,N are the least squares estimators for the parameters. Then P {aN,1(x) ≤ a(x) ≤ aN,2(x) for all x ∈ [b, d]} = P ( sup b≤x≤d |N1/2 (β0 − β̂0,N ) + N1/2 (β1 − β̂1,N )x| ≤ c ) . We already know that N1/2 (β0 − β̂0,N ), N1/2 (β1 − β̂1,N ) D → N2,

384. 30 LAJOS HORVÁTH where N2 = (N1, N2) is a bivariate normal random variable. Hence Γ0 (x) = N1 + N2x, b ≤ x ≤ d is a Gaussian process and we need to find c = c(1 − α) such that (2.33) P ( sup b≤x≤d |Γ0(x)| ≤ c ) = 1 − α. We have at least to possibilities to get c in (2.33). The mean of N2 is 0 and it covariance matrix is known explicitly and it is easy to estimate from the sample. Hence we can easily estimate N2 and we could use Monte Carlo simulations. The other possibility is using the bootstrap as it was done for z(x). To reflect the variability of the data even in the bands, one might try using the limit distribution of sup b≤x≤d |N1/2(β0 − β̂0,N ) + N1/2(β1 − β̂1,N )x| (var(N1/2(β0 − β̂0,N ) + N1/2(β1 − β̂1,N )x))1/2 . There was restriction on the choice of L, the bootstrap sample size in case of extreme values. It turns out that there is no restriction on L if we bootstrap a statistic with normal limit (or limit derived from normal random variables and/or processes).

385. MATHEMATICAL STATISTICS 31 3. Density estimation So far we have discussed the estimation of the distribution function and the theory related to it. These are fundamental results and the distribution of several statistics can be derived from this theory. In some cases, looking at the data, we want to guess the distribution of the underlying observations. This is impossible from the distribution function and from its estimate to do this since all distributions look the same. However, the shape of densities is relatively unique and everybody could see the difference between the shapes of exponential and normal densities. So the estimation of densities could provide important tools for statistical analysis. However, they are rarely used in hypothesis testing, for example, and their rate of convergence to the limit can be extremely slow. A large part of the statistical literature shows efforts how to avoid the estimation of densities. The main problem is that the density, as a derivative, is a limit. In real life, we have trouble to estimate limits. We discuss several versions of density estimation. 3.1. Kernel density estimator. Let X1, X2, . . . , XN be independent and identically distributed random variables with distribution function F. The density function f is defined by f(t) = F0 (t). The kernel density estimate was introduced by Rosenblatt (1956) and Parzen (1962) and it is defined by ˆ fN (t) = 1 NhN N X i=1 K t − Xi hN , (3.1) where hN is the bandwidth (analogous to length of bins in a histogram) and K(·) is the kernel. One natural requirement is that ˆ fN (t) is a density function for each N. This requirement is satisfied if (3.2) K is a density function. A function is a density function if it is non negative and its integral on the line is 1. It is clear from (3.2) that ˆ fN (t) ≥ 0 and Z ∞ −∞ ˆ fN (t)dt = 1 NhN N X i=1 Z ∞ −∞ K t − Xi hN dt and by change of variable, u = (t − Xi)/hN , Z ∞ −∞ K t − Xi hN dt = hN Z ∞ −∞ K(u)du = hN . Hence Z ∞ −∞ ˆ fN (t)dt = 1. Assumption (3.2) is attractive but it will limit how small of the bias of ˆ fN (t) can be. We will have two assumptions on the window (bandwidth) hN : (3.3) hN → 0 and (3.4) NhN → ∞, as N → ∞. Assumptions (3.3) and (3.4) will require careful balancing act. According to (3.3), hN should be small but according to (3.4), hN cannot be too small. These conflicting requirements will lead us to the optimal choice of hN . We need (3.3) to get an asymptotically unbiased estimator,

386. 32 LAJOS HORVÁTH i.e. E ˆ fN (t) → f(t). The assumption in (3.4) will imply that var( ˆ fN (t)) → 0. First we consider the behaviour of ˆ fN (t) at a fixed point t. Let Λ be a neighbourhood of t. It is easy to see that E ˆ fN (t) = 1 h EK t − X1 h , since the observations are identically distributed. By definition, 1 h EK t − X1 h = 1 h Z ∞ −∞ K t − u h f(u)du = Z ∞ −∞ K(x)f(t − xh)dx. Next we show that (3.5) Z ∞ −∞ K(x)f(t − xh)dx → f(t). We assume (3.6) sup −∞u∞ K(u) ∞, (3.7) sup −∞x∞ f(x) ∞, i.e. K and f are bounded functions. Also, (3.8) f(u) is continuous if u ∈ Λ. Using (3.2), (3.3) and (3.6)–(3.8) we show that (3.5) holds. Let 0. We choose A so large that Z −A −∞ K(x)f(t − xh)dx ≤ sup −∞u∞ f(u) Z −A −∞ K(x) ≤ and Z ∞ A K(x)f(t − xh)dx ≤ sup −∞u∞ f(u) Z ∞ A K(x) ≤ . Using the continuity assumed in (3.8) with (3.3), there is an integer N0 such that sup −A≤x≤A |f(t − xh) − f(t)| ≤ , if N ≥ N0. Hence for N ≥ N0 we have

390. Z A −A K(x)(f(t − xh) − f(t))dx

394. ≤ sup −A≤u≤A |f(t − uh) − f(t)| Z A −A K(x) ≤ sup −A≤u≤A |f(t − uh) − f(t)| Z ∞ −∞ K(x) ≤ . Also, by the choice of A we get

398. Z A −A K(u)du − 1

402. f(t) ≤ , completing the proof of (3.5). Hence (3.9) E ˆ fN (t) → f(t),

403. MATHEMATICAL STATISTICS 33 i.e. the estimator is asymptotically unbiased. In the applications the rate of convergence will be important. We need to increase our assumptions on the smoothness of K and f: (3.10) Z ∞ −∞ x2 K(x)dx ∞, (3.11) sup −∞x∞ |f0 (x)| ∞ and sup −∞x∞ |f00 (x)|, (3.12) f00 (u) is continuous, if u ∈ Λ. Using a two term Taylor expansion we obtain that Z ∞ ∞ K(x)(f(t − xh) − f(t))dx = −h Z ∞ −∞ K(x)f0 (t)xhdx + 1 2 Z ∞ −∞ K(x)(xh)2 f00 (ξ(x))dx, where ξ(x) satisfies |ξ(x) − t| ≤ |x|h. Using now (3.10)–(3.12), repeating our previous arguments we can show that Z ∞ −∞ K(x)x2 f00 (ξ(x))dx → f00 (t) Z ∞ −∞ x2 K(x)dx. Thus we conclude E ˆ fN (t) = f(t) − hf0 (t) Z ∞ −∞ xK(x)dx + h2 2 f00 (t) Z ∞ −∞ x2 K(x)dx + o(h2 ). Since we want to have small bias, from now on we assume that (3.13) Z ∞ −∞ xK(x)dx = 0. If K is symmetric around 0, assumption (3.13) holds. Under (3.13) E ˆ fN (t) = f(t) + h2 2 f00 (t) Z ∞ −∞ x2 K(x)dx + o(h2 ). Now we turn to the computation of the variance. Since the observations are independent and identically distributed we get that var( ˆ fN (t)) = 1 N2h2 N X i=1 var K t − Xi h = 1 Nh2 var K t − X1 h . and 1 h var K t − X1 h = 1 h E K t − X1 h 2 − 1 h EK t − X1 h 2 . We already showed that EK t − X1 h = O(h). (3.14) Repeating our previous calculations we conclude 1 h E K t − X1 h 2 = 1 h Z ∞ −∞ K2 t − x h f(x)dx = Z ∞ −∞ K2 (u)f(t − uh)du

404. 34 LAJOS HORVÁTH = f(t) Z ∞ −∞ K2 (u)du + o(1). Summarizing our calculations we have that var( ˆ fN (t)) = 1 Nh f(t) Z ∞ −∞ K2 (u)du + o(1) and therefore var( ˆ fN (t)) → 0 if and only if (3.4) holds. Since ˆ fN (t) is biased, the mean square error is used to evaluate its performance. By definition, MSE( ˆ fN (t)) = var( ˆ fN (t)) + (E ˆ fN (t) − f(t))2 = 1 Nh f(t) Z ∞ −∞ K2 (u)du + h4 4 (f00 (t))2 Z ∞ −∞ u2 K(u)du 2 + o 1 Nh + o(h4 ), if (3.13) holds. Now it is easy to find h which gives the smallest value of MSE( ˆ fN (t)), at least asymptotically: hopt = c0N−1/5 , where c0 = (c1/c2)1/5 c1 = f(t) Z ∞ −∞ K2 (u)du and c2 = (f00 (t))2 Z ∞ −∞ u2 K(u)du 2 . The result on the optimal h looks nice but it is not too useful. It depends on t but in our definition of the kernel density estimator, the window depends only on the sample size. Also, since f is unknown, we cannot compute c0. However, we have the interesting observation that the optimal hN is proportional to N−1/5 so it will be crucial for any theory to cover this case. We wish to use an estimator which minimizes the mean square error, i.e. we choose h and K, where min K min h E( ˆ fN (t) − f(t))2 is reached. We already found hopt and plugging this value into the formula for the MSE, we need to maximize this expression with respect to K. This is hard but the value of the MSE will not change too much. Hence the crucial question is the choice of h. There are some kernels which are often used in practice: K(u) = 1 2c I{−c ≤ u ≤ c} (uniform density), K(u) = 1 (2π)1/2 e−u2/2 (normal density), K(u) = (1 − |u|)I{|u| ≤ 1}, (triangular or Bartlett), K(u) = 3 4 (1 − u2 )I{|u| ≤ 1} (Epanechnikov kernel), K(u) = 30 20 √ 5 (5 − u2 )I{− √ 5 ≤ u ≤ √ 5} (Epanechnikov kernel) and K(u) = 1 2π sin(u/2) u/2 2 , −∞ u ∞, K(0) = 1 2π (Fejér kernel). All kernels have finite support except the normal and the Fejér. The kernel densities based on the normal and the Fejér kernels are infinitely times differentiable. The others might not provide differentiable or only few times differentiable density estimates. The Epanechnikov kernel minimizes the mean square error. The Fejér kernel is coming from the theory of Fourier analysis. In practice,

405. MATHEMATICAL STATISTICS 35 there is little difference between estimators using different kernels. Next we consider the limit distribution of ˆ fN (t). We show that (Nh)1/2( ˆ fN (t) − f(t)) is asymptotically normal for each t. We decompose the difference between the empirical and the true density as ˆ fN (t) − f(t) = [ ˆ fN (t) − E ˆ fN (t)] + [E ˆ fN (t) − f(t)], the random error and the numerical bias. The bias term will not play any role in the limit if (Nh)1/2 h2 → 0, i.e. (3.15) hN1/5 → 0. This means that using the optimal window the asymptotic mean of (Nh)1/2( ˆ fN (t) − f(t)) will not be 0. This is natural since the optimal window will give the same order of the square of the bias and the variance of ˆ fN (t). Since we already know the exact behaviour of the bias term, we consider the normality of ˆ fN (t) − E ˆ fN (t) = 1 Nh N X i=1 K t − Xi h − EK t − Xi h . We introduce ηi,N = 1 h1/2 K t − Xi h − EK t − Xi h , which are independent and identically distributed random variables. Also, by (3.6) these are bounded random variables, but the bound depends on N. Regardless, we can use Liapounov’s condition (cf. DasGupta, 2008 p. 64) to establish normality. Now var(ηi,N ) = 1 h EK2 t − Xi h − EK t − Xi h 2 # . We showed that EK t − Xi h = O(h). Repeating our previous arguments we get that 1 h EK2 t − Xi h = 1 h Z ∞ −∞ K2 t − x h f(x)dx = Z ∞ −∞ K2 (u)f(t − uh)du = f(t) Z ∞ −∞ K2 (u)du + o(1). Now we compute Eη4 i,N (essentially we only need an upper bound). The only reason why we compute that 4th moment because in this case we can get the exact asymptotic and the method can be used in other cases as well. Namely, Eη4 i,N = 1 h2 K4 t − Xi h − 4EK3 t − Xi h EK t − Xi h + 6EK2 t − Xi h EK t − Xi h 2 − 4EK t − Xi h EK t − Xi h 3 + EK t − Xi h 4 .

406. 36 LAJOS HORVÁTH For ` = 1, 2, 3 and 4 we have EK` t − Xi h = Z ∞ ∞ K` t − x h fxdx = h Z ∞ −∞ K` (u)f(t − uh)du = hf(t) Z ∞ −∞ K` (u)du + o(h). Thus we get Eη4 i,N = 1 h f(t) Z ∞ −∞ K4 (u)du + o 1 h . According to the Liapounov condition, we need to show that N X i=1 Eη4 i,N !1/4 N X i=1 Eη2 i,N !1/2 → 0. (3.16) We showed that N X i=1 Eη2 i,N !1/2 ≈ N1/2 and N X i=1 Eη4 i,N !1/4 ≈ N1/4 h−1/4 , and therefore (3.16) holds if Nh → ∞. Thus we get the asymptotic normality of (Nh)1/2( ˆ fN (t) − E ˆ fN (t)): (Nh)1/2 ( ˆ fN (t) − E ˆ fN (t)) D → N 0, f(t) Z ∞ −∞ K2 (u)du , (3.17) where N denotes a normal random variable. The normalization in the central limit theorem in (3.17) shows that the rate of convergence is always slower than, for example, the convergence of the empirical distribution function to the theoretical one. Also, we need to choose the kernel K and the window h. We also see that the optimal window h ≈ N−1/5 will give a central limit theorem for (Nh)1/2( ˆ fN (t) − f(t)) but the expected value of the limiting normal will not be 0. We are interested in f(t), as a function of t, so we would like to estimate at several points simulta- neously. First we consider the correlation between (Nh)1/2( ˆ fN (t) − E ˆ fN (t)) and (Nh)1/2( ˆ fN (s) − E ˆ fN (s)). Using independence we get that E (Nh)1/2 ( ˆ fN (t) − E ˆ fN (t))(Nh)1/2 ( ˆ fN (s) − E ˆ fN (s)) = 1 Nh N X i=1 N X j=1 E K t − Xi h − EK t − Xi h K s − Xj h − EK s − Xj h = 1 Nh N X i=1 E K t − Xi h − EK t − Xi h K s − Xi h − EK s − Xi h = 1 h E K t − X1 h − EK t − X1 h K s − X1 h − EK s − X1 h .

407. MATHEMATICAL STATISTICS 37 Now E K t − X1 h − EK t − X1 h K s − X1 h − EK s − X1 h = E K t − X1 h K s − X1 h − EK t − X1 h EK s − X1 h = E K t − X1 h K s − X1 h + O(h2 ) on account of (3.14). Following our earlier calculations we get that E K t − X1 h K s − X1 h = Z ∞ −∞ K t − x h K s − x h f(x)dx = h Z ∞ −∞ K(u)K u + t − s h f(t − uh)du. Since K is integrable on the line, if t 6= s, then for all u (3.18) K u + t − s h → 0, since |t − s|/h → ∞. On account of (3.6) and (3.7) we have that 0 ≤ K(u)K u + t − s h f(t − uh) ≤≤ sup x K(x) sup x f(x) K(u), K is inegrable on R and by (3.18) K(u)K u + t − s h f(t − uh) → 0 for all u ∈ Rd , so by the Lebesgue dominated convergence theorem we have (3.19) Z ∞ −∞ K(u)K u + t − s h f(t − uh)du → 0, if t 6= s. Thus we proved that (Nh)1/2( ˆ fN (t) − E ˆ fN (t)) and (Nh)1/2( ˆ fN (s) − E ˆ fN (s)) are asymptotically uncorrelated if t 6= s. Now we try to establish the multivariate central limit theorem for density estimates. Let t1 t2 . . . tR and our conditions on the density are satisfied at these points. We show that (Nh)1/2 ( ˆ fN (t1) − E ˆ fN (t1)), (Nh)1/2 ( ˆ fN (t2) − E ˆ fN (t2)), (3.20) . . . , (Nh)1/2 ( ˆ fN (tR) − E ˆ fN (tR)) ! D → NR(0, Σ) where NR is an R dimensional normal random vector, Σ is a diagonal matrix and diag(Σ) = f(t1) Z ∞ −∞ K2 (u)du, f(t2) Z ∞ −∞ K2 (u)du, . . . , f(tR) Z ∞ −∞ K2 (u)du . There is a standard method how to prove the asymptotic normality of a random vector. This is the Cramér–Wold theorem (DasGupta, 2008, p. 9). According to this theorem we need to show that all linear combinations are asymptotically normal, i.e. for all λ1.λ2, . . . , λR R X `=1 λ`(Nh)1/2 ( ˆ fN (t`) − E ˆ fN (t`)) D → N 0, R X `=1 λ2 ` f(t`) Z ` −∞ K2 (u)du ! . (3.21)

408. 38 LAJOS HORVÁTH The proof of (3.21) is also based on Liapounov theorem. Using the just proven asymptotic uncor- relation, we obtain that E R X `=1 λ` 1 h1/2 K t` − Xi h − EK t` − Xi h #2 = E 1 h1/2 ( R X `=1 λ` K t` − Xi h − EK t` − Xi h )#2 = 1 h E R X `=1 λ` K t` − Xi h − EK t` − Xi h #2 = R X `=1 λ2 ` f(t`) Z ∞ −∞ K2 (u)du + o(1). The Hölder inequality yields (x1 + x2 + . . . + xR)4 ≤ R4(x4 1 + x4 2 + . . . + x4 R) and therefore we get that E R X `=1 λ` 1 h1/2 K t` − Xi h − EK t` − Xi h #4 ≤ R4 R X `=1 λ4 ` E 1 h1/2 K t` − Xi h − EK t` − Xi h 4 ≤ R4 24 R X `=1 λ4 ` 1 h2 EK4 t` − Xi h + 1 h2 EK t` − Xi h 4 # = O 1 h . So if η̄i,N = R X `=1 λ` 1 h1/2 K t` − Xi h − EK t` − Xi h , then Eη̄i,N = 0 and using the formulae above we get the N X i=1 η̄4 i,N !1/4 N X i=1 η̄2 i,N !1/2 → 0. Hence the Liapounov condition is satisfied and therefore (3.21) holds. We obtain several results on the kernel density estimators for fixed t’s. The rate of convergence is (Nh)−1/2 is slower than the usual N−1/2. Also, the asymptotic independence (3.20) will cause problems when we look at the estimate on the interval [a, b]. We wish to obtain “global” results for ˆ fN (t). The popular choices are sup a≤t≤b | ˆ fN (t) − f(t)| and Z b a | ˆ fN (t) − f(t)|p dt 1/p ,

409. MATHEMATICAL STATISTICS 39 where p ≥ 1. What is the “natural” norm? It is argued that p = 1 is the “natural” norm, since the L1 norm of densities is always finite, it is always less than 2. All the other norms put restrictions on f. The visualization of ˆ fN (t) together with f is supported by the sup–norm. To obtain global results, the point wise assumptions on f must hold on [a − , b + ] with some 0. Under these conditions, sup a≤t≤b | ˆ fN (t) − f(t)| P → 0 i.e. the estimator is uniformly weakly consistent. The limit distribution of the sup norm and the L2 norm of the kernel density estimator was determined by Bickel and Rosenblatt (1973). They consider MN,1 = (Nh)1/2 sup a≤t≤b f−1/2 || ˆ fN (t) − f(t)| and MN,2 = Nh Z b a Z b a ( ˆ fN (t) − f(t))2 a(t)dt, where a(t) is a weight function. Bickel and Rosenblatt (1973) explicitly define numerical sequences r1,N and r2,N such that (3.22) P {r1,N (MN,1 − r2,N ) ≤ x} → exp(−2e−x ). We note that r1,N ≈ (log N)1/2 and r2,N ≈ (log N)1/2 and the limit is an extreme value distribution. These suggest that the rate of convergence in (3.22) is slow. This conjecture was checked empirically. They also showed that there are constants r3 and r4 such that (3.23) P 1 r3h1/2 (MN,2 − r4) → Φ(x), where Φ(x) denotes the standard normal distribution function. The rate of convergence in (3.23) is better than in (3.22), so usually (3.23) is used in hypothesis testing. Csörgő and Horváth (1988) extend the results in (3.23) to the functionals MN,3 = (Nh)p/2 Z b a | ˆ fN (t) − f(t)|p a(t)dt, where a(t) is a weight function for all p ≥ 1. Their result is mainly used when p = 1 since this will give the natural L1 norm for density estimates. Since the rate of convergence in (3.22) is rather low, we discuss how to use the bootstrap to get cN = cN (α) such that (3.24) lim N→∞ P{MN,1 ≤ cN } = 1 − α. The result in (3.24) can be used to construct confidence bands for the density on [a, b] and for hypothesis testing. We note that cN (α) does not have a limit as N → ∞, it is inceasing like (2 log(1/h))1/2. If we use boostrap with replacement, X∗ 1 , X∗ 2 , . . . , X∗ N are independent and identically distributed random variables but they are discrete, so conditionally on X they do not have a density function. Note that due to the difficult form of r1,N and r2,N , the bootstrap sample size is N, the original sample size. However, even there is no density, we can formally compute ˆ f∗ N (t) = 1 Nh N X i=1 K t − X∗ i h , which is a density for all N, it satisfies that ˆ f∗ N (t) ≥ 0 and Z ∞ −∞ ˆ f∗ N (t)dt = 1.

410. 40 LAJOS HORVÁTH This is really interesting since with a density we are estimating a non existing density if we condition on X. The bootstrap statistic is M∗ N,1 = (Nh)1/2 sup a≤t≤b ˆ f −1/2 N (t)| ˆ f∗ N (t) − ˆ fN (t)|. We cannot repeat our previous arguments since conditionally on X, ˆ fN (t) is not the density of the bootstrap observations. However, if c∗ N (α) is defined by P{M∗ N,1 ≤ c∗ N } = 1 − α, then (3.25) lim N→∞ P{MN,1 ≤ c∗ N (α)} = 1 − α. Hence we can use c∗ N (α) as an approximation for cN (α). It is more natural to use a boostrap sample with density function ˆ fN (t), conditionally on X. Since ˆ fN (t) is a density function, F̂N (x) = Z x −∞ ˆ fN (t)dt defines a distribution function. We note that F̂N (x) = 1 Nh N X i=1 Z x −∞ K t − Xi h dt = 1 N N X i=1 K x − Xi h , where K(u) = Z u −∞ K(t)dt, i.e. K(u) is the distribution function satisfying K0(u) = K(u). Hence F̂N (x) is a smooth estimator for the underlying distribution function F. Let Z1, Z2, . . . , ZN be independent identically random variables with distribution function F̂N (x), conditionally on X. Now we compute the kernel density estimator ˜ f∗ N (t) from Z1, Z2, . . . , ZN . The corresponding sup statistic is M̃∗ N,1 = (Nh)1/2 sup a≤t≤b ˆ f −1/2 N (t)| ˜ f∗ N (t) − ˆ fN (t)|. If c̃∗ N = c̃∗ N (1 − α) is defined by lim N→∞ P{MN,1 ≤ c̃∗ N (α)} = 1 − α, so we have an other resampling based estimator for cN (α). Our discussion introduced a smooth estimator for F and this estimator is used to define the smoothed bootstrap. Next we consider the effect of estimating a parameter in the fitted density function. We assume that the underlying density function is in the parametric form f(t, θ). The true value of the parameter is θ0, i.e. f0(t) = f(t, θ0). We estimate the parameter with θ̂N satisfying (3.26) N1/2 (θ̂N − θ0) = OP(1). We have seen that several estimators satisfy (3.26), for example, maximum likelihood, least squares, U–statistics and so on. If f(t, θ) has a bounded derivative in a neighbourhood of θ0, i.e. there is a constant C such that ∂f(t, θ) ∂θ ≤ C for all t and θ in a neighbourhood of θ0. Hence by (3.26) and the mean value theorem we get that sup a≤t≤b

413. f(t, θ̂N ) − f(t, θ0)

416. = OP (N−1/2 )

417. MATHEMATICAL STATISTICS 41 and therefore (3.27) (Nh)1/2 sup a≤t≤b

420. ˆ fN (t) − f(t, θ̂N )

423. = (Nh)1/2 sup a≤t≤b

426. ˆ fN (t) − f(t, θ0)

429. + oP (1). This means that estimating parameters does not effect the results on density estimation. This is different from the parameter estimated empirical process where the estimation of the parameter changes the asymptotics. 3.2. Cross validation. We have seen if we minimize MSE( ˆ fN (t)f (t))2 with respect to the window (smoothing parameter), then h depends on t. In order to find an “optimal” window, the other possible criteria is the minimization of the mean integrated square error MISE(h) = E Z b a ( ˆ fN (t) − f(t))2 dt. So we choose hopt which minimizes MISE(h), i.e. hopt = argminhMISE(h). Since E Z b a ( ˆ fN (t) − f(t))2 dt = Z b a E( ˆ fN (t) − E ˆ fN (t))2 dt + 2 Z b a E( ˆ fN (t) − E ˆ fN (t))(E ˆ fN (t) − f(t))dt + Z b a (E ˆ fN (t) − f(t))2 dt = Z b a var( ˆ fN(t))dt + Z b a (E ˆ fN (t) − f(t))2 dt = 1 Nh Z b a f(t)dt Z ∞ −∞ K2 (u)du 2 + h4 4 Z b a (f00 (t))2 dt Z ∞ −∞ u2 K(u)du 2 + o(h4 ) + o 1 Nh So, at least asymptotically. hopt = c∗N−1/5 , with c∗ = (Z b a f(t)dt Z ∞ −∞ K2 (u)du 2 )1/5 (Z b a (f00 (t))2 dt Z ∞ −∞ u2 K(u)du 2 )−1/5 , which depends on the unknown f. Cross validation provides a data based estimator for hopt. We write Z b 0 ( ˆ fN (t) − f(t))2 dt = Z b a ( ˆ fN (t))2 dt − 2 Z b a ˆ fN (t)f(t)dt + Z b a f2 (t)dt = J(h) + Z b a f2 (t)dt. Since Z b a f2 (t)dt does not depend on h, we need to minimize J(h). The estimator for J(h) is ¯ J(h) = Z b a ( ˆ fN (t))2 dt − 2 N N X i=1 Z b a ˆ fN (t) ˆ f (−i) N (t)dt,

430. 42 LAJOS HORVÁTH where ˆ f (−i) N (t) is the kernel density estimator without the ith observation. The sample based cross validation estimator is ĥ = argminh{ ¯ J(h)}. It can be shown that (3.28) ĥ hopt P → 1 and (3.29) MISE(ĥ) MISE(hopt) P → 1. (Note: proving ĥ − hopt P → 0 would not be too useful since both terms go to 0.) The result in (3.29) that using ĥ we get the asymptotically most efficient kernel density estimator. Also, we need to check that our results proven for the non random MISE(hopt) window (expansion of the bias, variance, asymptotic normality, asymptotic distribution of norms) will go through for the random ˆ fN . These have been established in the literature (cf. Silverman (1986)), so it is justified to use ˆ f. We discussed cross validation in the context of finding the optimal window. The same idea is also used in model validation on machine learning, for example. However, not always one element is removed to get the comparison estimates, but several. Also, the same idea is used in case of jackknife estimators. For computational purpose we approximate ˆ J(h) with J∗ (h) = 1 Nh2 N X i=1 N X j=1 K∗ Xi − Xj h + 2 Nh K(0), where K∗ (x) = K(2) (x) − 2K(x) and K(2) (x) = Z ∞ −∞ K(x − y)K(y)dy. The numerical work is still substantial and Fast Fourier Transform is suggested for the computation. The computation of the cross validation is not too simple so some suggestions are given which only supported by simulations. If f is a normal density h∗ = 1.06σN−1/5 is suggested where σ is the variance of the observations. Since σ is unknown, it is estimated by σ̂ = min(sample standard deviation, interquartile range/π). Hence h∗ = 1.06σ̂N−1/5 is computable from the sample. This rule of thumb is used for non–normal densities as well. Usually instead of 1.06, several other constants are tried and the “best” is used in the analysis. Choosing h requires practice!

431. MATHEMATICAL STATISTICS 43 3.3. Histogram. The histogram is wildly popular, since this was the first density estimator and it has been around since 1880’s. Also, even the simplest statistical software contains it. It is not better than the kernel density estimator and the estimate for smooth densities is a step function. The definition is very simple. We assume for the sake of simplicity that the support of f is [0, 1]. (Essentially, we have an interval which contains all the observations, or we use an interval such that the integral of the density on this interval is close to 1. Roughly speaking, we need a relatively large but not too large interval for the construction of the density.) We construct histogram with equal length bins. Let m be an integer and define the bins Bj = x : j − 1 m x ≤ j m , j = 1, 2, . . . , m. If Yj = N X i=1 I {Xi ∈ Bj} , and p̂j = Yj N , j = 1, 2, . . . , m then the histogram is defined as ˜ fN (t) = m X j=1 p̂j h I {t ∈ Bj} with h = 1 m . Hence the histogram is a step function, the value of the percentage of the observations in the bin. It is clear that the histogram is closely related to the kernel density estimator with the uniform kernel. (Writing h = 1/m is just an effort to connect the number of bins with the window.) It can be shown that MISE = Z 1 0 ( ˜ fN (t) − f(t))2 dt = h2 12 Z 1 0 (f00 (t))2 dt + 1 Nh + o(h2 ) + o 1 Nh , so in this case the optimal window is hopt = N−1/3     6 Z 1 0 (f00 (t))2 dt     1/3 . If we compare the optimal window of order N−1/3 to the optimal window of order N−1/5 for kernel densities, we see that the histogram converges to its limit much slower. Finding the optimal h (finding the number of bins m) can be found by cross validation as well. Let ˆ Jh = 2 h(N − 1) − N + 1 h(N − 1) m X j=1 p̂j. The cross validation gives the window ĥ = argminh ˆ J(h) so m = b1/ĥc. There are several versions of the histogram, including non equal bin sizes, data driven bins and so on. Silverman (1986) contains a readable account of histograms. Local log likelihood (local polynomial smoothing). The likelihood method can be extended to function spaces so the likelihood for f is N X i=1 log f(Xi) − N Z ∞ −∞ f(u)du − 1 .

432. 44 LAJOS HORVÁTH Maximizing the log likelihood function above does not give acceptable result. Nonparametric likelihood arguments give that the locally smoothed log likelihood should be used and the likelihood estimator for f is ˆ fN (t) = argmaxf L(t; f) where L(t; f) = N X i=1 K t − Xi h log Xi − N Z ∞ −∞ K t − u h f(u)du. Maximizing with respect to f is hard so we approximate log t with a polynomial in the form pt(a, u) = r X j=0 aj j! (t − u)j , a = (a0, a1, . . . , ar). So we need to minimize N X i=1 N X i=1 K t − Xi h r X j=0 aj j! (t − Xi)j − N Z ∞ −∞ K t − u h exp   r X j=0 aj j! (t − u)j   du with respect to a. The minimum is reached at â = (â0, â1, . . . , âr), Then the estimator is ˆ fN (t) = eâ0 . This method is also called to local polynomial smoothing. It is called local because of the expansion of log t around t. This method requires the choice of K, h, r. Due to the choices of more parameters, we can get good results. The best result would be for large r but in this case the error would increase. So h and r should be picked using the data. This is implemented in several statistical packages. You can find more on local polynomial smoothing in Fan and Gijbels (1996) 3.4. Estimation with series. If φ1, φ2, . . . are orthonormal functions on [0, 1], i.e. Z 1 0 φi(u)φj(u)du = ( 0, if i 6= j 1, if i = j then (3.30) f(t) = ∞ X `=1 c`φ`(t), c` = Z 1 0 f(u)φ`(u)du. The expansion in (3.30) requires some assumptions to make sense, at least Z 1 0 f2 (u)du ∞ is needed. This give you a meaningful estimate for a fixed t and also in L2, the space of square integrable functions on [0, 1]. Assuming appropriate further conditions, we have that the infinite sum converges in the sup–norm or in L1 (the space of integrable functions on [0, 1]). It is easy to unbiased estimator for c` ĉ` = 1 N N X i=1 φ`(Xi). Clearly, Eĉ` = 1 N N X i=1 Eφ`(Xi) = Eφ`(X1) = Z 1 0 φ`(u)f(u)du = c`.

mathstat.pdf

Recommended

Recommended

More Related Content

Similar to mathstat.pdf

Similar to mathstat.pdf (20)

Recently uploaded

Recently uploaded (20)

mathstat.pdf