SlideShare a Scribd company logo
1 of 61
Download to read offline
SELECTED TOPICS IN MATHEMATICAL STATISTICS
LAJOS HORVÁTH
Abstract. This is the outcome of the online Math 6070 class during the COVID-19 epidemic.
1. Some problems in nonparametric statistics
First we consider some examples for various statistical problems. In all cases we should be able
to get large sample approximations but getting the critical values might not be simple. Also, the
large sample approximations might not work in case of our sample sizes. We assume in this section
that
Assumption 1.1. X1, X2, . . . , XN are independent and identically distributed random variables
with distribution function F.
First we consider a simple hypothesis question which has been already discussed.
1.1. Kolmogov–Smirnov and related statistics. We wish to test the null hypothesis that
F(t) = F0(t) for all −∞ < t < ∞, where F0(t) is a given distribution function. We assume that
F0 is continuous. There are several well known tests for this problem. The first two are due to
Kolmogorov and Smirnov:
TN,1 = N1/2
sup
−∞<t<∞
|FN (t) − F0(t)| ,
TN,2 = N1/2
sup
−∞<t<∞
(FN (t) − F0(t)) ,
where
FN (t) =
1
N
N
X
`=1
I{Xi ≤ t}, −∞ < t < ∞
denotes the empirical distribution function of our sample. The distributions of the statistics TN,1
and TN,2 do not depend on F0 for any sample size as long as
(1.1) F0 is continuous.
By the probability integral transformation U1 = F(X1), U2 = F(X2), . . . , UN = F(XN ) are inde-
pendent identically distributed random variables, uniform on [0, 1]. Since
TN,1 = N1/2
sup
−∞<t<∞
1
N
N
X
`=1
I{F(Xi) ≤ F(t)} − F0(t)
= N1/2
sup
0<u<1
1
N
N
X
`=1
I{Ui ≤ u} − F0(F−1
(u))
,
where F−1 is the generalized inverse of F. (If F is not strictly increasing then F−1 is not uniquely
defined but it will satisfy F(F−1(u)) = u. )
We already discussed in the class of the weak convergence of the uniform empirical and quan-
tile processes. Due to the probability integral transformation the following results are immediate
consequences:
(1.2) TN,1
D
→ sup
0≤u≤1
|B(u)|
1
2 LAJOS HORVÁTH
and
(1.3) TN,2
D
→ sup
0≤u≤1
B(u),
where {B(u), 0 ≤ u ≤ 1} denotes a Brownian bridge. We already discussed the definition of the
Brownian bridge when we studied the weak convergence of the process constructed from the uniform
order statistics. The Brownian bridge is defined as B(t) = W(t) − tW(1), where W(t) is a Wiener
process. Hence B(0) = B(1) = 0 (tied down). It is a Gaussian process, i.e. the finite dimensional
distributions are multivariate normal. The parameters of the multivariate normal distribution can
be computed from the facts that EB(t) = 0 and EB(t)B(s) = min(t, s) − ts. The Brownian bridge
is continous with probability 1. More on the Brownian bridge is on page 153 of DasGupta(2008).
Hence (1.2) and (1.3) provide large sample approximations for our test statistics under the null
hypothesis. Even we know how good are the approximations in (1.2) and (1.3). There are constants
c1 and c2 such that
(1.4) sup
−∞<x<∞
P {TN,1 ≤ x} − P

sup
0≤u≤1
|B(u)| ≤ x
≤ c1
log N
N1/2
and
(1.5) sup
−∞x∞
P {TN,2 ≤ x} − P

sup
0≤u≤1
B(u) ≤ x
≤ c2
log N
N1/2
.
These results follows immediately from the Komlós, Major and Tusnády approximation (cf. Das-
Gupta, 2008, p. 162). There are explicit bounds for c1 and c2 but these are so large that they are
useless in practice. It is even more interesting, from a theoretical point of view, that the results in
(1.4) and (1.5) are very close to the best possible ones; N−1/2 are lower bounds. Since the limiting
distribution functions in (1.4) and (1.5) are known explicitly, we could check how these results work
for finite sample sizes. Chapter 9 in Shorack and Wellner (1986) contains formulae and bounds,
including exact and asymptotic bounds for TN,1 and TN,2.
We need to study the behavior of the test statistics under suitable alternatives. First we look at
TN,1. We assume that
(1.6) HA : there is t0 such that F(t0) 6= F0(t0).
If (1.6) holds, then
(1.7) TN,1
P
→ ∞.
Since
TN,1 = N1/2
sup
−∞t∞
|FN (t) − F(t) + F(t) − F0(t)| ,
we get the lower bound
N1/2
sup
−∞t∞
|F(t) − F0(t)| − N1/2
sup
−∞t∞
|FN (t) − F(t)| ≤ TN,1.
The weak convergence of the empirical process yields
N1/2
sup
−∞t∞
|FN (t) − F(t)| = OP (1)
and (1.6) gives
N1/2
sup
−∞t∞
|F(t) − F0(t)| → ∞,
completing the proof of (1.9). However, we might not be able to reject the null hypothesis under
the alternative of (1.6). The statistic TN,2 is consistent under the alternative
(1.8) HA : there is t0 such that F(t0)  F0(t0).
MATHEMATICAL STATISTICS 3
Similarly to the proof of (1.9), one can show that
(1.9) TN,2
P
→ ∞.
The other class of statistics are due to Cramér and von Mises. We provide two formulas. If you
know how to integrate with respect to a function, please use those. If not, use the formula, where
the density f0(t) = F0
0(t) appears. There are two possibilities for us:
TN,3 = N
Z ∞
∞
(FN (t) − F0(t))2
dF0(t) = N
Z ∞
∞
(FN (t) − F0(t))2
f0(t)dt
and
TN,4 = N
Z ∞
∞
(FN (t) − F0(t))2
dt.
Similarly to the first two statistics TN,1 and TN,2, the distribution of TN,3 also does not depend on
F0, if (1.1) holds. Using the probability integral transformation again, we get that TN,3 also does
not depend on F0. However, the distribution of TN,4 does depend on F0. This means that we need
to use different Monte Carlo simulations for different F0’s. The weak convergence of the uniform
empirical process already used in the justification of (1.2) and (1.3) can be used to show that
(1.10) TN,3
D
→
Z 1
0
B2
(u)du
and
(1.11) TN,4
D
→
Z 1
0
B2
(F0(t))dt,
where, as before, {B(u), 0 ≤ u ≤ 1} denotes a Brownian bridge. The rate of convergence in (1.10)
and (1.11) is much better than in (1.4) and (1.5). Namely, there are c3 and c4 such that
(1.12) sup
−∞x∞
P {TN,3 ≤ x} − P
Z 1
0
B2
(u)du ≤ x
≤
c3
N
and
(1.13) sup
−∞x∞
P {TN,4 ≤ x} − P
Z ∞
−∞
B2
(F0(t))dt) ≤ x
≤
c4
N
.
The upper bound in (1.12) was obtained by Götze (cf. Shorack and Wellner, 1986, p. 223) and his
method can be used to prove (1.13). It is conjectured that these results are optimal, it is impossible
to replace 1/N with a sequence which would converge to 0 faster. The theoretical results in (1.12)
and (1.13) were observed empirically a long time ago. This is one of the reasons for the popularity
of TN,3.
There is an interesting connection between U statistics and the Cramér–von Mises statistics. It can
be shown that the Cramér–von Mises statistics are essentially U statistics. This claim is supported
by a famous expansion of the square integral of the Brownian bridge:
(1.14)
Z 1
0
B2
(t)dt =
∞
X
k=1
1
k2π2
N2
,
where Ni, i ≥ 1 are independent standard normal random variable. This is like the limit of the
degenerate U statistics. The result in (1.14) is a consequence of the Karhunen–Loéve theorem.
They showed that
{B(t), 0 ≤ t ≤ 1}
D
=
(
√
2
∞
X
k=1
Nk
1
kπ
sin(kπs)
)
.
4 LAJOS HORVÁTH
This result looks obvious, in some sense, since B(t) is square integrable, so we should have expan-
sion with respect a basis. Here the interesting part is that the Ni’s are iid standard normal random
variables. If a different basis is used, this will not be correct. We use a special basis here, since
sin(kπs) are the eigenfunction of the operator K(t, s)f(s)ds.
There is am other interesting and useful formula for the Brownian bridge: the integral
B(t) =
Z t
0
1 − t
1 − s
dW(s) 0 ≤ t ≤ 1,
also defines a Brownian bridge. However, first we need to define integration with respect to a
Wiener process. We have two roads: study Ito integration. The other possibility is much simpler.
We just assume that integration by parts defines the formula, so
Z t
0
1 − t
1 − s
dW(s) =
1
1 − t
W(t) −
Z t
0
W(s)d
1
1 − s
=
1
1 − t
W(t) +
Z t
0
1
(1 − s)2
W(s)ds.
This integral representation of the Brownian bridge is often used in biostatistics.
Next we discuss the consistency of the Cramér–von Mises tests. If (1.1) and (1.6), then
(1.15) TN,3
P
→ ∞
and
(1.16) TN,4
P
→ ∞.
We write
N
Z ∞
∞
(FN (t) − F0(t))2
dF0(t)
= N
Z ∞
∞
([FN (t) − F(t)] + [F(t) − F0(t)])2
dF0(t)
= N
Z ∞
∞
(FN (t) − F(t))2
dF0(t) + 2N
Z ∞
∞
[FN (t) − F(t)][F(t) − F0(t)]dF0(t)
+ N
Z ∞
∞
(F(t) − F0(t))2
dF0(t)
and by the weak convergence of the Cramér–von Mises statistic
Z ∞
∞
(N1/2
[FN (t) − F(t)])2
dF0(t) = OP (1).
Using the Cauchy–Schwartz inequality we obtain that
N
Z ∞
∞
[FN (t) − F(t)][F(t) − F0(t)]dF0(t)

More Related Content

Similar to mathstat.pdf

Fourier Specturm via MATLAB
Fourier Specturm via MATLABFourier Specturm via MATLAB
Fourier Specturm via MATLABZunAib Ali
 
Low rank tensor approximation of probability density and characteristic funct...
Low rank tensor approximation of probability density and characteristic funct...Low rank tensor approximation of probability density and characteristic funct...
Low rank tensor approximation of probability density and characteristic funct...Alexander Litvinenko
 
Fourier series of odd functions with period 2 l
Fourier series of odd functions with period 2 lFourier series of odd functions with period 2 l
Fourier series of odd functions with period 2 lPepa Vidosa Serradilla
 
WaveletTutorial.pdf
WaveletTutorial.pdfWaveletTutorial.pdf
WaveletTutorial.pdfshreyassr9
 
Computing f-Divergences and Distances of\\ High-Dimensional Probability Densi...
Computing f-Divergences and Distances of\\ High-Dimensional Probability Densi...Computing f-Divergences and Distances of\\ High-Dimensional Probability Densi...
Computing f-Divergences and Distances of\\ High-Dimensional Probability Densi...Alexander Litvinenko
 
Find the compact trigonometric Fourier series for the periodic signal.pdf
Find the compact trigonometric Fourier series for the periodic signal.pdfFind the compact trigonometric Fourier series for the periodic signal.pdf
Find the compact trigonometric Fourier series for the periodic signal.pdfarihantelectronics
 
dhirota_hone_corrected
dhirota_hone_correcteddhirota_hone_corrected
dhirota_hone_correctedAndy Hone
 
International Journal of Mathematics and Statistics Invention (IJMSI)
International Journal of Mathematics and Statistics Invention (IJMSI) International Journal of Mathematics and Statistics Invention (IJMSI)
International Journal of Mathematics and Statistics Invention (IJMSI) inventionjournals
 
Time Series Analysis
Time Series AnalysisTime Series Analysis
Time Series AnalysisAmit Ghosh
 
Natalini nse slide_giu2013
Natalini nse slide_giu2013Natalini nse slide_giu2013
Natalini nse slide_giu2013Madd Maths
 
Stochastic Calculus, Summer 2014, July 22,Lecture 7Con.docx
Stochastic Calculus, Summer 2014, July 22,Lecture 7Con.docxStochastic Calculus, Summer 2014, July 22,Lecture 7Con.docx
Stochastic Calculus, Summer 2014, July 22,Lecture 7Con.docxdessiechisomjj4
 
Ch1 representation of signal pg 130
Ch1 representation of signal pg 130Ch1 representation of signal pg 130
Ch1 representation of signal pg 130Prateek Omer
 
A Fibonacci-like universe expansion on time-scale
A Fibonacci-like universe expansion on time-scaleA Fibonacci-like universe expansion on time-scale
A Fibonacci-like universe expansion on time-scaleOctavianPostavaru
 
On Generalized Classical Fréchet Derivatives in the Real Banach Space
On Generalized Classical Fréchet Derivatives in the Real Banach SpaceOn Generalized Classical Fréchet Derivatives in the Real Banach Space
On Generalized Classical Fréchet Derivatives in the Real Banach SpaceBRNSS Publication Hub
 

Similar to mathstat.pdf (20)

Fourier Specturm via MATLAB
Fourier Specturm via MATLABFourier Specturm via MATLAB
Fourier Specturm via MATLAB
 
Low rank tensor approximation of probability density and characteristic funct...
Low rank tensor approximation of probability density and characteristic funct...Low rank tensor approximation of probability density and characteristic funct...
Low rank tensor approximation of probability density and characteristic funct...
 
01_AJMS_277_20_20210128_V1.pdf
01_AJMS_277_20_20210128_V1.pdf01_AJMS_277_20_20210128_V1.pdf
01_AJMS_277_20_20210128_V1.pdf
 
Fourier series of odd functions with period 2 l
Fourier series of odd functions with period 2 lFourier series of odd functions with period 2 l
Fourier series of odd functions with period 2 l
 
WaveletTutorial.pdf
WaveletTutorial.pdfWaveletTutorial.pdf
WaveletTutorial.pdf
 
Computing f-Divergences and Distances of\\ High-Dimensional Probability Densi...
Computing f-Divergences and Distances of\\ High-Dimensional Probability Densi...Computing f-Divergences and Distances of\\ High-Dimensional Probability Densi...
Computing f-Divergences and Distances of\\ High-Dimensional Probability Densi...
 
D021018022
D021018022D021018022
D021018022
 
Signal Processing Homework Help
Signal Processing Homework HelpSignal Processing Homework Help
Signal Processing Homework Help
 
Find the compact trigonometric Fourier series for the periodic signal.pdf
Find the compact trigonometric Fourier series for the periodic signal.pdfFind the compact trigonometric Fourier series for the periodic signal.pdf
Find the compact trigonometric Fourier series for the periodic signal.pdf
 
dhirota_hone_corrected
dhirota_hone_correcteddhirota_hone_corrected
dhirota_hone_corrected
 
International Journal of Mathematics and Statistics Invention (IJMSI)
International Journal of Mathematics and Statistics Invention (IJMSI) International Journal of Mathematics and Statistics Invention (IJMSI)
International Journal of Mathematics and Statistics Invention (IJMSI)
 
Fourier series
Fourier seriesFourier series
Fourier series
 
Kanal wireless dan propagasi
Kanal wireless dan propagasiKanal wireless dan propagasi
Kanal wireless dan propagasi
 
Time Series Analysis
Time Series AnalysisTime Series Analysis
Time Series Analysis
 
Data Analysis Assignment Help
Data Analysis Assignment HelpData Analysis Assignment Help
Data Analysis Assignment Help
 
Natalini nse slide_giu2013
Natalini nse slide_giu2013Natalini nse slide_giu2013
Natalini nse slide_giu2013
 
Stochastic Calculus, Summer 2014, July 22,Lecture 7Con.docx
Stochastic Calculus, Summer 2014, July 22,Lecture 7Con.docxStochastic Calculus, Summer 2014, July 22,Lecture 7Con.docx
Stochastic Calculus, Summer 2014, July 22,Lecture 7Con.docx
 
Ch1 representation of signal pg 130
Ch1 representation of signal pg 130Ch1 representation of signal pg 130
Ch1 representation of signal pg 130
 
A Fibonacci-like universe expansion on time-scale
A Fibonacci-like universe expansion on time-scaleA Fibonacci-like universe expansion on time-scale
A Fibonacci-like universe expansion on time-scale
 
On Generalized Classical Fréchet Derivatives in the Real Banach Space
On Generalized Classical Fréchet Derivatives in the Real Banach SpaceOn Generalized Classical Fréchet Derivatives in the Real Banach Space
On Generalized Classical Fréchet Derivatives in the Real Banach Space
 

Recently uploaded

Pests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdfPests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdfPirithiRaju
 
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdfPests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdfPirithiRaju
 
Davis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technologyDavis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technologycaarthichand2003
 
Pests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdfPests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdfPirithiRaju
 
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptxSTOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptxMurugaveni B
 
Speech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxSpeech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxpriyankatabhane
 
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxMicrophone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxpriyankatabhane
 
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)riyaescorts54
 
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...Universidade Federal de Sergipe - UFS
 
Good agricultural practices 3rd year bpharm. herbal drug technology .pptx
Good agricultural practices 3rd year bpharm. herbal drug technology .pptxGood agricultural practices 3rd year bpharm. herbal drug technology .pptx
Good agricultural practices 3rd year bpharm. herbal drug technology .pptxSimeonChristian
 
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPirithiRaju
 
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfBehavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfSELF-EXPLANATORY
 
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...lizamodels9
 
User Guide: Capricorn FLX™ Weather Station
User Guide: Capricorn FLX™ Weather StationUser Guide: Capricorn FLX™ Weather Station
User Guide: Capricorn FLX™ Weather StationColumbia Weather Systems
 
Functional group interconversions(oxidation reduction)
Functional group interconversions(oxidation reduction)Functional group interconversions(oxidation reduction)
Functional group interconversions(oxidation reduction)itwameryclare
 
User Guide: Magellan MX™ Weather Station
User Guide: Magellan MX™ Weather StationUser Guide: Magellan MX™ Weather Station
User Guide: Magellan MX™ Weather StationColumbia Weather Systems
 
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Harmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms PresentationHarmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms Presentationtahreemzahra82
 
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxLIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxmalonesandreagweneth
 
OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024innovationoecd
 

Recently uploaded (20)

Pests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdfPests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdf
 
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdfPests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdf
 
Davis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technologyDavis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technology
 
Pests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdfPests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdf
 
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptxSTOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
 
Speech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxSpeech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptx
 
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxMicrophone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
 
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
 
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
 
Good agricultural practices 3rd year bpharm. herbal drug technology .pptx
Good agricultural practices 3rd year bpharm. herbal drug technology .pptxGood agricultural practices 3rd year bpharm. herbal drug technology .pptx
Good agricultural practices 3rd year bpharm. herbal drug technology .pptx
 
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
 
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfBehavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
 
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
 
User Guide: Capricorn FLX™ Weather Station
User Guide: Capricorn FLX™ Weather StationUser Guide: Capricorn FLX™ Weather Station
User Guide: Capricorn FLX™ Weather Station
 
Functional group interconversions(oxidation reduction)
Functional group interconversions(oxidation reduction)Functional group interconversions(oxidation reduction)
Functional group interconversions(oxidation reduction)
 
User Guide: Magellan MX™ Weather Station
User Guide: Magellan MX™ Weather StationUser Guide: Magellan MX™ Weather Station
User Guide: Magellan MX™ Weather Station
 
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
 
Harmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms PresentationHarmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms Presentation
 
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxLIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
 
OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024
 

mathstat.pdf

  • 1. SELECTED TOPICS IN MATHEMATICAL STATISTICS LAJOS HORVÁTH Abstract. This is the outcome of the online Math 6070 class during the COVID-19 epidemic. 1. Some problems in nonparametric statistics First we consider some examples for various statistical problems. In all cases we should be able to get large sample approximations but getting the critical values might not be simple. Also, the large sample approximations might not work in case of our sample sizes. We assume in this section that Assumption 1.1. X1, X2, . . . , XN are independent and identically distributed random variables with distribution function F. First we consider a simple hypothesis question which has been already discussed. 1.1. Kolmogov–Smirnov and related statistics. We wish to test the null hypothesis that F(t) = F0(t) for all −∞ < t < ∞, where F0(t) is a given distribution function. We assume that F0 is continuous. There are several well known tests for this problem. The first two are due to Kolmogorov and Smirnov: TN,1 = N1/2 sup −∞<t<∞ |FN (t) − F0(t)| , TN,2 = N1/2 sup −∞<t<∞ (FN (t) − F0(t)) , where FN (t) = 1 N N X `=1 I{Xi ≤ t}, −∞ < t < ∞ denotes the empirical distribution function of our sample. The distributions of the statistics TN,1 and TN,2 do not depend on F0 for any sample size as long as (1.1) F0 is continuous. By the probability integral transformation U1 = F(X1), U2 = F(X2), . . . , UN = F(XN ) are inde- pendent identically distributed random variables, uniform on [0, 1]. Since TN,1 = N1/2 sup −∞<t<∞
  • 2.
  • 3.
  • 4.
  • 5.
  • 7.
  • 8.
  • 9.
  • 10.
  • 12.
  • 13.
  • 14.
  • 15.
  • 16. 1 N N X `=1 I{Ui ≤ u} − F0(F−1 (u))
  • 17.
  • 18.
  • 19.
  • 20.
  • 21. , where F−1 is the generalized inverse of F. (If F is not strictly increasing then F−1 is not uniquely defined but it will satisfy F(F−1(u)) = u. ) We already discussed in the class of the weak convergence of the uniform empirical and quan- tile processes. Due to the probability integral transformation the following results are immediate consequences: (1.2) TN,1 D → sup 0≤u≤1 |B(u)| 1
  • 22. 2 LAJOS HORVÁTH and (1.3) TN,2 D → sup 0≤u≤1 B(u), where {B(u), 0 ≤ u ≤ 1} denotes a Brownian bridge. We already discussed the definition of the Brownian bridge when we studied the weak convergence of the process constructed from the uniform order statistics. The Brownian bridge is defined as B(t) = W(t) − tW(1), where W(t) is a Wiener process. Hence B(0) = B(1) = 0 (tied down). It is a Gaussian process, i.e. the finite dimensional distributions are multivariate normal. The parameters of the multivariate normal distribution can be computed from the facts that EB(t) = 0 and EB(t)B(s) = min(t, s) − ts. The Brownian bridge is continous with probability 1. More on the Brownian bridge is on page 153 of DasGupta(2008). Hence (1.2) and (1.3) provide large sample approximations for our test statistics under the null hypothesis. Even we know how good are the approximations in (1.2) and (1.3). There are constants c1 and c2 such that (1.4) sup −∞<x<∞
  • 23.
  • 24.
  • 25.
  • 26. P {TN,1 ≤ x} − P sup 0≤u≤1 |B(u)| ≤ x
  • 27.
  • 28.
  • 29.
  • 30. ≤ c1 log N N1/2 and (1.5) sup −∞x∞
  • 31.
  • 32.
  • 33.
  • 34. P {TN,2 ≤ x} − P sup 0≤u≤1 B(u) ≤ x
  • 35.
  • 36.
  • 37.
  • 38. ≤ c2 log N N1/2 . These results follows immediately from the Komlós, Major and Tusnády approximation (cf. Das- Gupta, 2008, p. 162). There are explicit bounds for c1 and c2 but these are so large that they are useless in practice. It is even more interesting, from a theoretical point of view, that the results in (1.4) and (1.5) are very close to the best possible ones; N−1/2 are lower bounds. Since the limiting distribution functions in (1.4) and (1.5) are known explicitly, we could check how these results work for finite sample sizes. Chapter 9 in Shorack and Wellner (1986) contains formulae and bounds, including exact and asymptotic bounds for TN,1 and TN,2. We need to study the behavior of the test statistics under suitable alternatives. First we look at TN,1. We assume that (1.6) HA : there is t0 such that F(t0) 6= F0(t0). If (1.6) holds, then (1.7) TN,1 P → ∞. Since TN,1 = N1/2 sup −∞t∞ |FN (t) − F(t) + F(t) − F0(t)| , we get the lower bound N1/2 sup −∞t∞ |F(t) − F0(t)| − N1/2 sup −∞t∞ |FN (t) − F(t)| ≤ TN,1. The weak convergence of the empirical process yields N1/2 sup −∞t∞ |FN (t) − F(t)| = OP (1) and (1.6) gives N1/2 sup −∞t∞ |F(t) − F0(t)| → ∞, completing the proof of (1.9). However, we might not be able to reject the null hypothesis under the alternative of (1.6). The statistic TN,2 is consistent under the alternative (1.8) HA : there is t0 such that F(t0) F0(t0).
  • 39. MATHEMATICAL STATISTICS 3 Similarly to the proof of (1.9), one can show that (1.9) TN,2 P → ∞. The other class of statistics are due to Cramér and von Mises. We provide two formulas. If you know how to integrate with respect to a function, please use those. If not, use the formula, where the density f0(t) = F0 0(t) appears. There are two possibilities for us: TN,3 = N Z ∞ ∞ (FN (t) − F0(t))2 dF0(t) = N Z ∞ ∞ (FN (t) − F0(t))2 f0(t)dt and TN,4 = N Z ∞ ∞ (FN (t) − F0(t))2 dt. Similarly to the first two statistics TN,1 and TN,2, the distribution of TN,3 also does not depend on F0, if (1.1) holds. Using the probability integral transformation again, we get that TN,3 also does not depend on F0. However, the distribution of TN,4 does depend on F0. This means that we need to use different Monte Carlo simulations for different F0’s. The weak convergence of the uniform empirical process already used in the justification of (1.2) and (1.3) can be used to show that (1.10) TN,3 D → Z 1 0 B2 (u)du and (1.11) TN,4 D → Z 1 0 B2 (F0(t))dt, where, as before, {B(u), 0 ≤ u ≤ 1} denotes a Brownian bridge. The rate of convergence in (1.10) and (1.11) is much better than in (1.4) and (1.5). Namely, there are c3 and c4 such that (1.12) sup −∞x∞
  • 40.
  • 41.
  • 42.
  • 43. P {TN,3 ≤ x} − P Z 1 0 B2 (u)du ≤ x
  • 44.
  • 45.
  • 46.
  • 48.
  • 49.
  • 50.
  • 51. P {TN,4 ≤ x} − P Z ∞ −∞ B2 (F0(t))dt) ≤ x
  • 52.
  • 53.
  • 54.
  • 55. ≤ c4 N . The upper bound in (1.12) was obtained by Götze (cf. Shorack and Wellner, 1986, p. 223) and his method can be used to prove (1.13). It is conjectured that these results are optimal, it is impossible to replace 1/N with a sequence which would converge to 0 faster. The theoretical results in (1.12) and (1.13) were observed empirically a long time ago. This is one of the reasons for the popularity of TN,3. There is an interesting connection between U statistics and the Cramér–von Mises statistics. It can be shown that the Cramér–von Mises statistics are essentially U statistics. This claim is supported by a famous expansion of the square integral of the Brownian bridge: (1.14) Z 1 0 B2 (t)dt = ∞ X k=1 1 k2π2 N2 , where Ni, i ≥ 1 are independent standard normal random variable. This is like the limit of the degenerate U statistics. The result in (1.14) is a consequence of the Karhunen–Loéve theorem. They showed that {B(t), 0 ≤ t ≤ 1} D = ( √ 2 ∞ X k=1 Nk 1 kπ sin(kπs) ) .
  • 56. 4 LAJOS HORVÁTH This result looks obvious, in some sense, since B(t) is square integrable, so we should have expan- sion with respect a basis. Here the interesting part is that the Ni’s are iid standard normal random variables. If a different basis is used, this will not be correct. We use a special basis here, since sin(kπs) are the eigenfunction of the operator K(t, s)f(s)ds. There is am other interesting and useful formula for the Brownian bridge: the integral B(t) = Z t 0 1 − t 1 − s dW(s) 0 ≤ t ≤ 1, also defines a Brownian bridge. However, first we need to define integration with respect to a Wiener process. We have two roads: study Ito integration. The other possibility is much simpler. We just assume that integration by parts defines the formula, so Z t 0 1 − t 1 − s dW(s) = 1 1 − t W(t) − Z t 0 W(s)d 1 1 − s = 1 1 − t W(t) + Z t 0 1 (1 − s)2 W(s)ds. This integral representation of the Brownian bridge is often used in biostatistics. Next we discuss the consistency of the Cramér–von Mises tests. If (1.1) and (1.6), then (1.15) TN,3 P → ∞ and (1.16) TN,4 P → ∞. We write N Z ∞ ∞ (FN (t) − F0(t))2 dF0(t) = N Z ∞ ∞ ([FN (t) − F(t)] + [F(t) − F0(t)])2 dF0(t) = N Z ∞ ∞ (FN (t) − F(t))2 dF0(t) + 2N Z ∞ ∞ [FN (t) − F(t)][F(t) − F0(t)]dF0(t) + N Z ∞ ∞ (F(t) − F0(t))2 dF0(t) and by the weak convergence of the Cramér–von Mises statistic Z ∞ ∞ (N1/2 [FN (t) − F(t)])2 dF0(t) = OP (1). Using the Cauchy–Schwartz inequality we obtain that N
  • 57.
  • 58.
  • 59.
  • 60. Z ∞ ∞ [FN (t) − F(t)][F(t) − F0(t)]dF0(t)
  • 61.
  • 62.
  • 63.
  • 64. ≤ N Z ∞ ∞ [FN (t) − F(t)]2 dF0(t) 1/2 Z ∞ ∞ [F(t) − F0(t)]2 dF0(t) 1/2 = N1/2 Z ∞ ∞ [N1/2 (FN (t) − F(t))]2 dF0(t) 1/2 Z ∞ ∞ [F(t) − F0(t)]2 dF0(t) 1/2 and Z ∞ ∞ [N1/2 (FN (t) − F(t))]2 dF0(t) = OP (1), N1/2 Z ∞ ∞ [F(t) − F0(t)]2 dF0(t) 1/2 = O(N1/2 ).
  • 65. MATHEMATICAL STATISTICS 5 According to our condition Z ∞ ∞ [F(t) − F0(t)]2 dF0(t) 0 and therefore (1.15) is proven. Similar arguments give (1.16). One of the basic advise is that “do not compare apples and oranges”. One of the interpretation is that we should compare variables with the same or essentially the same variances. Since the observations are independent, under H0 we have that var (FN (t) − F0(t)) = 1 N F0(t)(1 − F0(t)), so the variance of the variables used in all statistics so far, depend on t. Darling and Erdős (1956) suggested the following statistic to test the null hypothesis against the alternative in (1.6): (1.17) TN,5 = sup −∞x∞ N1/2|FN (t) − F0(t)| (F0(t)(1 − F0(t)))1/2 . The statistic TN,5 is called self normalized. However, even under the null hypothesis (1.18) lim N→∞ P{TN,5 ≥ C} = 1, for all C , i.e. TN,5 is unbounded in probability. Here is a heuristic argument for (1.17). We observe that TN,5 does not depend on F0. If the weak convergence of the empirical process to a Brownian bridge holds, the distribution of TN,5 should be close to the distribution of sup0t1 |B(t)|/(t(1 − t))1/2. But according to the law of the iterated logarithm, lim sup t→0 |B(t)| t1/2 = ∞ a.s. and therefore P sup 0t1 |B(t)|/(t(1 − t))1/2 = ∞ = 1. We can also use the empirically self normalized (1.19) TN,6 = sup X1,N xXN,N N1/2|FN (t) − F0(t)| (FN (t)(1 − FN (t)))1/2 , where X1,N = min{X1, X2, . . . , XN } and XN,N = max{X1, X2, . . . , XN }. Using the result of Darling and Erdős (1956), it can be shown that under the null hypothesis lim N→∞ P (2 log log N)1/2 TN,5 ≤ x + 2 log log N + 1 2 log log log N − 1 2 log π (1.20) = exp(−2e−x ) for all x. The limit result in (1.20) also holds for TN,6. Here we have the interesting result that even under the null hypothesis TN,5 and TN,6 converge to ∞ in probability, since 1 (2 log log N)1/2 TN,5 P → 1. However, under the alternative (1.6), N−1/2 TN,5 P → c, where c is a positive constant and if t 6= t0, then c ≥ |F(t) − F0(t)| (F0(t)(1 − F0(t))1/2 . This means the under the alternative TN,5 will be much larger than under the null. This obser- vation makes it possible to use bootstrap. The rate of convergence in (1.20) is slow. The limit is
  • 66. 6 LAJOS HORVÁTH an extreme value and, in general, convergence to extreme values can be slow. Also, the norming sequences in (1.20) are chosen for their “simplicity”. They do not have any statistical meaning like the norming with the mean and the variance in the central limit theorem. There is an important observation in this discussion: the test statistic has a limit distribution under the null and converges in probability to ∞ under the alternative. In case of TN,5 and TN,6, we should say that they converge to ∞ much faster. Next we consider testing if our sample belongs to a specific family of distributions. 1.2. Parameter estimated processes. We assume that Assumption 1.1 holds. Now we wish to the null hypothesis F0(t, λ) = ( 0, if t 0 1 − e−t/λ, if t ≥ 0, (1.21) where λ is an unknown parameter. The true value of the parameter is λ0. It is natural to estimate λ from the sample by the maximum likelihood estimator λ̂N = X̄N = 1 N N X i=1 Xi. If the null hypothesis of (1.21) is true, then FN (t) should be close to F0(t, λ̂N ) for all −∞ t ∞, because FN (t) always estimates the true distribution function. Hence we study the difference between FN (t) and F0(t, λ̂N ). We start withe a Taylor expansion with respect the parameter F0(t, λ̂N ) − F0(t, λ0) = g1(t, λ0)(λ̂N − λ0) + 1 2 g2(t, λ∗ )(λ̂N − λ0)2 , where λ∗ is between λ̂N and λ0, We can assume that t 0 since both FN (t) and F(t, λ) are 0 for t ≤ 0. Let g1(t, λ) = ∂F0(t, λ) ∂λ and g2(t, λ) = ∂2F0(t, λ) ∂λ2 . We know from the law of large numbers that (1.22) λ̂N P → λ0, and from the central limit theorem that N1/2(λ̂N − λ0) is asymptotically normal and therefore (1.23) N1/2 |λ̂N − λ0| = OP (1). It is elementary that sup−∞t∞ |g2(t, λ)| is bounded, as a function of λ, in a neighbourhood of λ0. Hence (1.22) yields (1.24) sup 0t∞ |g2(t, λ∗ )| = OP (1). Putting together (1.23) and (1.24) we conclude (1.25) sup 0t∞ |g2(t, λ∗ )|(λ̂N − λ0)2 = OP 1 N . These arguments give the important observation that sup 0t∞
  • 67.
  • 68.
  • 69. N1/2 [FN (t) − F0(t, λ̂N )] − N1/2 [FN (t) − F0(t, λ0) − g1(t, λ0)(λ̂N − λ0)]
  • 70.
  • 71.
  • 73. MATHEMATICAL STATISTICS 7 The result is in (1.26) is very important and it can be proven in more generality. The process N1/2(FN (t) − F0(t, λ̂N )) is called parameter estimated empirical process and it is often used to check a null hypothesis when an unknown parameter appears under the null hypothesis. We know some facts already: (1.27) N1/2 (FN (t) − F0(t, λ0) D[0,∞] −→ B(F0(t, λ0)) and the asymptotic normality of N1/2(λ̂N −λ0). However, this is not enough! We need them jointly since both terms appear in (1.26). The key is a formula which you might have learnt in probability. We know that λ0 = EX1 = Z ∞ 0 tf0(t, λ0)dt = Z ∞ 0 (1 − F0(t, λ0))dt. Using integration by parts, Z ∞ 0 tf0(t, λ0)dt = − Z ∞ 0 t(1 − F0(t, λ0))0 dt = −t(1 − F0(t, λ0))
  • 74.
  • 75.
  • 76.
  • 77. ∞ 0 + Z ∞ 0 (1 − F0(t, λ0))dt. Clearly, lim t→0 t(1 − F0(t, λ0)) = 0 and by the existence of the expected value lim t→∞ t(1 − F0(t, λ0)) = 0. Thus we get N1/2 (λ̂N − λ0) = Z ∞ 0 N1/2 (F0(t, λ0) − F̂N (t))dt = − Z ∞ 0 N1/2 (F̂N (t) − F0(t, λ0))dt. Now everything is expressed in terms of the empirical process. The weak convergence of the empirical process in (3.21) yields (1.28) N1/2 (FN (t)−F0(t, λ0)−g1(t, λ0)(λ̂N −λ0)) D[0,∞] −→ B(F(t, λ0))+g1(t, λ0) Z ∞ 0 B(F0(u, λ0))du. It looks obvious that (3.21) implies (1.29) Z ∞ 0 N1/2 (F̂N (t) − F0(t, λ0))dt D → Z ∞ 0 B(F0(u, λ0))du, but it requires a little work. For any C 0, the weak convergence of the empirical process to B(F(·)) implies that N1/2 FN (t) − F0(t, λ0) + g1(t, λ0) Z C 0 N1/2 (F̂N (u) − F0(u, λ0))du (1.30) D[0,∞] −→ B(F(t, λ0)) + g1(t, λ0) Z C 0 B(F0(u, λ0))du. Also, by the Cauchy–Schwartz inequality we have var Z ∞ C B(F0(u, λ0))du = E Z ∞ C B(F0(u, λ0))du 2 (1.31) = E Z ∞ C Z ∞ C B(F0(u, λ0))B(F0(v, λ0))dudv = Z ∞ C Z ∞ C E[B(F0(u, λ0))B(F0(v, λ0))]dudv ≤ Z ∞ C Z ∞ C (E[B(F0(u, λ0))]2 )1/2 (E[B(F0(v, λ0))]2 )1/2 dudv
  • 78. 8 LAJOS HORVÁTH = Z ∞ C Z ∞ C [F0(u, λ0)(1 − F0(u, λ0))]1/2 [F0(v, λ0)(1 − F0(v, λ0))]1/2 dudv = Z ∞ C [F0(u, λ0)(1 − F0(u, λ0))]1/2 du 2 → 0, as C → ∞. The same arguments yield for any N that var Z ∞ C N1/2 (F̂N (u) − F0(u, λ0)) → 0, as C → ∞. (1.32) Now Chebishev’s inequality implies on account of (1.31) and (1.32) that
  • 79.
  • 80.
  • 81.
  • 83.
  • 84.
  • 85.
  • 86. P → 0 and for all 0 lim C→∞ lim sup N→∞ P
  • 87.
  • 88.
  • 89.
  • 90. Z ∞ C N1/2 (F̂N (u) − F0(u, λ0))
  • 91.
  • 92.
  • 93.
  • 94. = 0. Now the proof of (1.28) is complete. In light of (1.29) we suggest the parameter estimated Kolmogorov–Smirnov statistics: TN,7 = N1/2 sup 0t∞
  • 95.
  • 96.
  • 97. FN (t) − F0(t, λ̂N )
  • 98.
  • 99.
  • 100. and TN,8 = N1/2 sup 0t∞ FN (t) − F0(t, λ̂N ) . Now the limit distributions of TN,7 and TN,8 can be derived easily from (1.29). Namely, (1.33) TN,8 D → sup 0t∞
  • 101.
  • 102.
  • 103.
  • 104. B(F0(t, λ0)) + g1(t, λ0) Z ∞ 0 B(F0(u, λ0))du
  • 105.
  • 106.
  • 107.
  • 108. and (1.34) TN,9 D → sup 0t∞ B(F0(t, λ0)) + g1(t, λ0) Z ∞ 0 B(F0(u, λ0))du . How to use (1.33) and (1.34) ? This is highly not obvious. Please note that the limit depends on the parametric form of F0, but this is not an issue since we know that F0 is exponential. But the dependence on λ0 is more serious since λ0 is unknown. However, a little work shows that TN,8 and TN,9 do not depend on λ0. Note that N1/2 sup 0t∞
  • 109.
  • 110.
  • 111. FN (t) − F0(t, λ̂N )
  • 112.
  • 113.
  • 115.
  • 116.
  • 117. FN (t) − (1 − e−t/λ̂N )
  • 118.
  • 119.
  • 121.
  • 122.
  • 123. FN (uλ̂N ) − (1 − e−u )
  • 124.
  • 125.
  • 126. . By definition, FN (uλ̂N ) = 1 N N X i=1 I{Xi/λ̂N ≤ u} and Xi λ̂N = Xi X̄N = Xi/λ0 PN j=1 Xj/λ0 . Hence TN,7 and therefore the limit distribution does not depend on λ0. Now Monte Carlo simula- tions could be used to get the distribution of the limit in (1.33), since we can assume that λ0 = 1 in
  • 127. MATHEMATICAL STATISTICS 9 the limit. This argument also works for TN,8. Hence TN,7 and TN,8, and therefore their limits, are free of the unknown parameter. Our argument works for scale families. With some modifications, it can be done for location and location and scale families. Let assume that we are in a location family. In this case the underlying distribution is F(t, λ) = F0(t − λ). Hence sup −∞t∞
  • 128.
  • 129.
  • 130. FN (t) − F0(t − λ̂N )
  • 131.
  • 132.
  • 134.
  • 135.
  • 136. [FN (t) − F0(t − λ0)] + [λ0 − λ̂N ])
  • 137.
  • 138.
  • 140.
  • 141.
  • 142. FN (u + λ0) − F0(u + [λ0 − λ̂N ])
  • 143.
  • 144.
  • 145. and FN (u + λ0) = 1 N N X i=1 I{Xi ≤ t + λ0} = 1 N N X i=1 I{Xi − λ0 ≤ t} Since we are in a location family, the distribution of Xi − λ0 does not depend on λ0. We showed that in case of location families, if λ̂N is the maximum likelihood estimator, then the distribution of λ0 − λ̂N does not depend on λ0 (more is true, the value of λ0 − λ̂N does not depend on λ0). The same argument work for the location and scale families. As an example, let assume that F0 is a Gamma distribution with parameters λ (scale parameter) and κ (shape parameter). We assume that κ is known. Since we are in the scale family, the arguments used in the exponential case would work. So far we considered Kolmogorov–Smirnov type processes for parameter estimated processes. In case of scale families (this includes the exponential we discussed at the beginning of this section), N Z ∞ −∞ (FN (t) − F0(t, λ̂N ))2 dF0(t, λ̂N ) = N Z ∞ −∞ (FN (t) − F0(t, λ̂N ))2 f0(t, λ̂N )dt do not depend on λ0. An other possibility for parameter free method is the parameter estimated Cramér–von Mises statistic N Z ∞ −∞ (FN (t) − F0(t, λ̂N ))2 dFN (t) ≈ N X i=1 i N − F0(Xi,N , λ̂N ) 2 , where X1,N ≤ X2,N ≤ . . . ≤ XN,N are the order statistics. To establish the consistency of TN,7 is easy.We assume that under the alternative HA : inf λ0 sup 0t∞ |F(t) − F0(t, λ)| 0 and in this case TN,7 P → ∞. We have that TN,8 P → ∞, if HA : inf λ0 sup 0t∞ (F(t) − F0(t, λ)) 0. The asymptotic behaviour of the parameter estimated Cramér–von Mises statistics can be discussed in the same way. The self normalized statistics also can be used to test if the underlying distribution in a parametric form. For example, in case of testing for exponentiality we can use sup 0t∞ N1/2|FN (t) − F0(t, λ̂N )| (F0(t, λ̂N )(1 − F0(t, λ̂N )))1/2
  • 146. 10 LAJOS HORVÁTH and sup X1,N tXN,N N1/2|FN (t) − F0(t, λ̂N )| (FN (t)(1 − FN (t)))1/2 , where X1,N = min(X1, X2, . . . , XN ) and XN,N = max(X1, X2, . . . , XN ). We found the similar pattern for the test statistics as in Section 1.1. The test statistics convergence in distribution to a limit and they converge to ∞ in probability under the alternative. It turns out that the estimation of the parameter does not effect the limit distribution, i.e. (1.20) holds for the parameter estimated statistics as well. Scale family. The underlying density is in the form f(t, λ) = 1 λ f0(t/λ), where λ 0 is a parameter. We use the empirical distribution function of Yi = Xi X̄N . 1 ≤ i ≤ N X̄N = 1 N N X i=1 Xi The distribution of Yi does not depend on λ0, the true value of the parameter under the null hypothesis. Hence the limit of N1/2 (HN (x) − F0(x)) with HN (x) = 1 N N X i=1 I{Yi ≤ x}, does not depend on λ0 but it DOES on f0. We used the notation F0 0 = f0. The sample mean X̄N might not be the maximum likelihood estimator. What is the maximum likelihood estimator? The likelihood function is L(λ) = N Y i=1 1 λ f0(Xi/λ) and the log likelihood is `(λ) = −N log λ + N X i=1 log f0(Xi/λ). We compute the derivative `0 (λ) = − N λ + N X i=1 1 f0(Xi/λ) f0 0(Xi/λ) − Xi λ2 and we need to solve the equation − N X i=1 1 f0(Xi/λ) f0 0(Xi/λ) − Xi λ = N. The equation depends only on Xi/λ. This shows that λ̂N /λ0 does not depend on λ0, where λ̂N is the maximum likelihood estimator. Hence the parameter estimated statistics do not depend on the unknown scale parameter under the null hypothesis. Next we consider a typical two sample problem.
  • 147. MATHEMATICAL STATISTICS 11 1.3. Comparing two samples. In addition to Assumption 1.1 we require Assumption 1.2. Y1, Y2, . . . , YM are independent and identically distributed random variables with distribution function H. It is a very common problem to test H0 : F(t) = H(t) for all t. In addition to FN (t) we define the empirical distribution of the Y sample HM (t) = 1 M M X i=1 Yi. If H0 is true, the difference should be small. Due to independence, under the null hypothesis we have var(FN (t) − HM (t)) = 1 N + 1 M (F(t)(1 − F(t))) = N + M NM (F(t)(1 − F(t))), so our consideration will be based on the two sample version of the empirical process uN,M (t) = NM N + M 1/2 [(FN (t) − HM (t)) − (F(t) − H(t))] . Of course, under the null hypothesis F(t)−H(t) = 0 for all t in the definition of uN,M (t). The weak convergence of uN,M (t) is immediate consequence of the weak convergence of empirical processes: (1.35) uN,M (t) D[−∞,∞] −→ c 1/2 0 B(1) (F(t)) + (1 − c0)1/2 B(1) (H(t)), where B(1) and B(2) are independent Brownian bridges, and lim N,M→∞ M N + M = c0. We observe that (1.36) {c 1/2 0 B(1) (F(t)) + (1 − c0)1/2 B(2) (F(t)), −∞ t ∞} D = {B(F(t)), −∞ t ∞}. Since B(1)(F(t)), B(2)(F(t)) are jointly Gaussian, they linear combination will be Gaussian. Hence we need to compute the mean and the covariance of c 1/2 0 B(1)(F(t)) + (1 − c0)1/2B(2)(F(t)). Since EB(1)(F(t)) = 0 EB(2)(F(t)) = 0, we get that E[c 1/2 0 B(1) (F(t)) + (1 − c0)1/2 B(2) (F(t))] = 0. Using the independence of B(1) and B(2) and EB(1)(t)B(1)(s) = EB(2)(t)B(2)(s) = min(t, s) − ts, E[c 1/2 0 B(1) (F(t)) + (1 − c0)1/2 B(2) (F(t))][c 1/2 0 B(1) (F(s)) + (1 − c0)1/2 B(2) (F(s))] = E[c 1/2 0 B(1) (F(t))][c 1/2 0 B(1) (F(s))] + E[(1 − c0)1/2 B(2) (F(t))][(1 − c0)1/2 B(2) (F(s))] = c0[min(F(t), F(s)) − F(t)F(s)] + (1 − c0)[min(F(t), F(s)) − F(t)F(s) = min(F(t), F(s)) − F(t)F(s), which is exactly the covariance function of B(F(t)). We suggest the following statistics: TN,M,1 = sup −∞t∞ |uN,M (t)| and TN,M,2 = sup −∞t∞ uN,M (t).
  • 148. 12 LAJOS HORVÁTH If H0 holds and F = H is continuous, then (1.37) TN,M,1 D → sup −∞t∞ |B(F(t))| = sup 0≤t≤1 |B(t)| and (1.38) TN,M,2 D → sup −∞t∞ B(F(t)) = sup 0≤t≤1 B(t) The result in (1.37) and (1.38) are immediate consequences of (1.35) and (1.36). If F is continuous, then the distributions of TN,M,1 and TN,M,2 do not depend on F under H0. This observation is an immediate consequence of the probability integral transformation. The statistics TN,M,1 and TN,M,2 are Kolmogorov–Smirnov type statistics. Similarly to the previous discussions we can define Cramér–von Mises type statistics as well. We can define similarly Cramér– von Mises statistics: Z ∞ −∞ u2 N,M dFN (t) ≈ M N + M N X i=1 i N − HM (Xi,N ) 2 , where X1,N ≤ X2,N ≤ . . . , XN,N are the order statistics of the first sample. Or similarly, Z ∞ −∞ u2 N,M dHM (t) ≈ NM N + M M X i=1 i M − FN (Yi,M ) 2 , where Y1,M ≤ Y2,M ≤ . . . , YM,M are the order statistics of the second sample. Using again the weak convergence of uN,M (t), one can prove that under H0 Z ∞ −∞ u2 N,M dFN (t) D → Z 1 0 B2 (u)du and Z ∞ −∞ u2 N,M dHM (t) D → Z 1 0 B2 (u)du. So far the observations were not only independent but also identically distributed even under the alternative hypothesis. The next problem is interesting since we want to test that the assumption of identically distributed data will be tested. The topic of Section 1.4 is very popular in the literature and it is called the change point problem or testing for the stability of the data. 1.4. Change point. We assume that Assumption 1.1 holds but we observe Z1, Z2, . . . , ZN defined by Zi = ( µ0 + Xi, if 1 ≤ k ≤ k∗ µA + Xi, if k ∗ +1 ≤ k ≤ N. (1.39) We call k∗ the time of change and it is unknown. Similarly, the means, µ0 6= µA before and after the change are also unknown. Of course these are the means of the Zi’s, if EXi = 0. Hence we need to modify Assumption 1.1: Assumption 1.3. X1, X2, . . . , XN are independent and identically distributed random variables with EXi = 0 and EX2 i = σ2.
  • 149. MATHEMATICAL STATISTICS 13 The model assumes that the variance is constant. First we even assume that σ is known to find a suitable test statistic. We will discuss how to proceed if σ is unknown. We note that σ is a nuisance parameter so we have no interest in its value. Recently, data examples confirmed that σ might not be constant during the observation period. It might be time dependent so we wish to detect changes in the mean even if the variance is changing as well. We only want to detect a change point if the mean changes regardless what happens to the variance of the observations. Only few results are available now. We assume now that (1.40) σ2 is known. We wish to test the stability of the model, i.e. the mean remains constant during the observation period: H0 : k∗ N against the alternative HA : 1 k∗ N. The null hypothesis postulates the change occurs outside of the observation period so it does not matter for us. Under the alternative the means changes exactly once. Our model is called “at most one change” (AMOC). First we need to find a test statistic. Let assume that k∗ = k is known. In this case this a simple two sample problem. We cut the data into two parts at k and we compute the sample means for each segment with Z̄k = 1 k k X i=1 Zi and Ẑk = 1 N − k N X i=k+1 Zi. If H0 holds, than |Z̄k −Ẑk| is small, the difference between the two empirical means can be explained by the variability in the data. Using Assumption 1.3 we get var Z̄k − Ẑk = σ2 k + σ2 N − k = σ2 N k(N − k) , so we reject the means of Z1, Z2, . . . , Zk and Zk+1, Zk+2, . . . , ZN are the same if Qk = 1 σ k(N − k) N 1/2 |Z̄k − Ẑk| is large. The statistic Qk should be familiar since this is the two sample z–test if the observations are normal! To prove this claim assume that X1, X2, X3, . . . , XN are independent identically distributed normal random variables with EXi = 0, EX2 i = σ2, σ2 is known. We wish to test that the means of Z1, Z2, . . . , Zk and Zk+1, Zk+2, . . . , ZN are the same. Let µ1 be the mean of the first sample, µ2 be the mean of the second sample and µ be the mean under the null hypothesis. The maximum likelihood estimators are µ̂1 = 1 k k X i=1 Zi, µ̂2 = 1 N − k N X i=k+1 Zi and µ̂ = 1 N N X i=1 Zi. Hence the likelihood ratio is k Y i=1 1 √ 2π exp(−(Zi − µ̂1)2 /(2σ2 )) N Y i=k+1 1 √ 2π exp(−(Zi − µ̂2)2 /(2σ2 )) N Y i=1 1 √ 2π exp(−(Zi − µ̂)2 /(2σ2 ))
  • 150. 14 LAJOS HORVÁTH = exp 1 2σ2 N X i=1 (Zi − µ̂)2 − k X i=1 (Zi − µ̂1)2 − N X i=k+1 (Zi − µ̂2)2 !! = exp 1 2σ2 kµ̂2 1 + (N − k) ˆ µ2 2 − Nµ̂2 = exp 1 2σ2 kµ̂2 1 + (N − k) ˆ µ2 2 − N((k/N)µ̂1 + ((N − k)/N)µ̂2)2 = exp 1 2σ2 k(N − k) N (µ̂1 − µ̂2)2 , proving our claim. Since k is unknown we use the rule: (1.41) reject H0, if max 1≤kN 1 σ |Qk| is large. A simple algebra shows that the rule in (1.41) might not work. It is easy to see that under H0 Qk = 1 σ N k(N − k) 1/2
  • 151.
  • 152.
  • 153.
  • 154.
  • 156.
  • 157.
  • 158.
  • 159.
  • 160. . So by the law of the iterated logarithm for partial sums of independent and identically distributed random variables (Das Gupta, 2008, pp. 8) (1.42) max 1≤kN |Qk| P → ∞ as N → ∞. We observe that σ max 1≤kN |Qk| ≥ max 1≤kN/2 N k(N − k) 1/2
  • 161.
  • 162.
  • 163.
  • 164.
  • 166.
  • 167.
  • 168.
  • 169.
  • 171.
  • 172.
  • 173.
  • 174.
  • 176.
  • 177.
  • 178.
  • 179.
  • 181.
  • 182.
  • 183.
  • 184.
  • 186.
  • 187.
  • 188.
  • 189.
  • 191.
  • 192.
  • 193.
  • 194.
  • 196.
  • 197.
  • 198.
  • 199.
  • 201.
  • 202.
  • 203.
  • 204.
  • 206.
  • 207.
  • 208.
  • 209.
  • 210. . The central limit theorem yields N−1/2
  • 211.
  • 212.
  • 213.
  • 214.
  • 216.
  • 217.
  • 218.
  • 219.
  • 220. = OP (1) and the law of the iterated logarithm implies lim sup N→∞ (log log(N/2))−1/2 max 1≤kN/2 1 k 1/2
  • 221.
  • 222.
  • 223.
  • 224.
  • 226.
  • 227.
  • 228.
  • 229.
  • 230. 0 a.s., completing the proof of (1.42). The result in (1.42) was first observed empirically by economists. This caused a stir since the z–test is widely used (without checking the required assumptions) and using the normal table for the critical values of max1≤kN |Qk| caused over rejection, and it was getting worse as N was increasing. Andrews (1993) is the most popular contribution to the applicability of the z–test for the change point problem. He observed that the law of the iterated comes into action for large and small k. He suggested rejecting for large values of max bNαc≤k≤N−bNαc 1 σ |Qk|,
  • 231. MATHEMATICAL STATISTICS 15 where b·c is the integer part and 0 α 1/2 is chosen by the practitioner. Since the change point problem is common in economics (“nothing last forever”), there has been a tremendous interest in the choice of α. The choice of 5% and 10% is recommended. Using the weak convergence of partial sums to a Wiener process can be used to prove that under the null hypothesis (1.43) max bNαc≤k≤N−bNαc 1 σ |Qk| D → sup α≤t≤1−α |B(t)| (t(1 − t))1/2 , where {B(u), 0 ≤ u ≤ 1} is a Brownian bridge. The functional L(f) = supα≤u≤1−α |f(u)| is a continuous functional on the Skorokhod space D[0, 1], so the weak convergence of partial sums gives (1.43). Looking at (1.43) it is clear why (2.20) holds if α = 0. The limit cannot be finite in this case according to the law of iterated logarithm for the Wiener process. Hence Andrews (1993) claimed that no limit result can be established for max1≤kN |Qk|. This claim is strongly believed in econometrics but it was not even true when Andrews (1993) published his famous paper. If we look at again (1.19), we face the same issue. The self–normalization (i.e. taking the maximum of random variables with constant variance) puts too much weight at the beginning and the end of the data. If Darling and Erdős (1956) can be used to get the limit in (1.19), it might work in the present case as well. We will return to this question later. The limit result of (1.43) suggests that we should remove the weight and work with (1.44) TN,11 = max 1≤k≤N 1 σ N−1/2
  • 232.
  • 233.
  • 234.
  • 235.
  • 237.
  • 238.
  • 239.
  • 240.
  • 241. . Now we reached a famous and useful statistic. It is called CUSUM (CUmulative SUMS) in the literature. One of the interesting feature is that it does not depend on the unknown mean under the null hypothesis. The mean is a nuisance parameter and its value does not appear in CUSUM type statistics. We usually refer to the maximally selected z–statistic as the standardized CUSUM. The limit distribution of TN,11 under the null hypothesis is very simple: (1.45) TN,11 D → sup 0≤u≤1 |B(u)|. The result in (1.45) follows from the weak convergence of partial sums. Using (1.45), it is easy to detect changes in the data since the distribution of sup0≤u≤1 |B(u)| is known and tabulated. In TN,11 we can recognise a Kolmogorov–Smirnov type statistics. A possible Cramér–von Mises type statistic for the change point problem is Z 1 0  N−1/2   bNuc X i=1 Zi − bNuc N N X i=1 Zi     2 du and under H0 Z 1 0  N−1/2   bNuc X i=1 Zi − bNuc N N X i=1 Zi     2 du D → Z 1 0 B2 (t)dt. The behaviour of TN,11 is very simple under the exactly one change point alternative. Namely, (1.46) if µ0 6= µA, then TN,11 P → ∞. We note that (1.46) holds in case of several changes in the mean. We can use TN,11 estimate the time of change k∗ from the data. The estimator is the point where the CUSUM achieves it largest value: k̂N = ( k : 1 ≤ k ≤ N :
  • 242.
  • 243.
  • 244.
  • 245.
  • 247.
  • 248.
  • 249.
  • 250.
  • 252.
  • 253.
  • 254.
  • 255.
  • 257.
  • 258.
  • 259.
  • 260.
  • 261. ) .
  • 262. 16 LAJOS HORVÁTH If the change occurs in the middle of the data, i.e. k∗ N = bNθc, with some 0 θ 1, then (1.47) k∗ N N P → θ, i.e. we can consistently approximate θ. So we can do testing in the very unlikely case when σ is known. If we can estimate σ from the sample, we have more realistic procedures. The first candidate is the sample mean: σ̂2 N,1 = 1 N − 1 N X i=1 Zi − Z̄N 2 . We have already established that under the null hypothesis σ̂2 N,1 P → σ2 , so it is asymptotically consistent. But this is not the case under the alternative! We note that (1.48) Z̄N = k∗ N 1 k∗ k∗ X i=1 Zi + N − k∗ N 1 N − k∗ N X i=k∗+1 Zi P → µ̄ = θµ0 + (1 − θ)µA. Elementary algebra gives σ̂2 N = 1 N − 1 k∗ X i=1 Zi − µ0 + µ0 − µ̄ + µ̄ − Z̄N )2 + 1 N − 1 N X i=k∗+1 Zi − µA + µA − µ̄ + µ̄ − Z̄N 2 and 1 N − 1 k∗ X i=1 Zi − µ0 + µ0 − µ̄ + µ̄ − Z̄N )2 = k∗ N − 1 1 k∗ k∗ X i=1 (Zi − µ0)2 + k∗ N − 1 (µ0 − µ̄)2 + k∗ N − 1 (Z̄N − µ̄)2 + 2k∗(µ0 − µ̄) N − 1 1 k∗ k∗ X i=1 (Zi − µ0) + 2k∗(µ̄ − X̄N ) N − 1 1 k∗ k∗ X i=1 (Zi − µ0) + 2k∗ N − 1 (µ0 − µ̄)(µ̄ − X̄N ). Using now the law of large numbers we obtain that 1 k∗ k∗ X i=1 (Zi − µ0)2 P → σ2 , k∗ X i=1 (Zi − µ0) P → 0 and by (1.48) X̄N − µ̄ P → 0. Thus we conclude (1.49) 1 N − 1 k∗ X i=1 Zi − µ0 + µ0 − µ̄ + µ̄ − Z̄N )2 P → θσ2 + θ(µ̄ − µ0)2 . Similar arguments give (1.50) 1 N − 1 N X i=k∗+1 Zi − µA + µA − µ̄ + µ̄ − Z̄N 2 P → (1 − θ)σ2 + (1 − θ)(µ̄ − µA)2 .
  • 263. MATHEMATICAL STATISTICS 17 Putting together (1.49) and (1.50) we get that (1.51) σ̂2 N,1 P → σ2 + θ(µ0 − µ̄)2 + (1 − θ)(µ̄ − µA)2 , so we are overestimating σ2. This is the penalty that the possible change in the mean is not taken into account. Thus we could try σ̂2 N,2 = 1 N − 1   k̂N X i=1 Zi − Z̄k̂N 2 + N X i=k̂N +1 Zi − Ẑk̂N 2   . On account of k̂N ≈ k∗, it looks obvious that σ̂2 N,2 → σ2 in probability. But now the null hypothesis is the problem since there is no k∗ under the null hypothesis! Using the weak convergence of the CUSUM process, it can be shown that k̂N /N converges in distribution. It requires lengthy calculations to show that (1.52) σ̂2 N,2 P → σ2 under the null and also under the alternative. Now we can define two statistics which does not require the knowledge of σ2: (1.53) TN,12 = max 1≤k≤N 1 σ̂N,1 N−1/2
  • 264.
  • 265.
  • 266.
  • 267.
  • 269.
  • 270.
  • 271.
  • 272.
  • 273. and (1.54) TN,13 = max 1≤k≤N 1 σ̂N,2 N−1/2
  • 274.
  • 275.
  • 276.
  • 277.
  • 279.
  • 280.
  • 281.
  • 282.
  • 283. . We note that under the no change null hypothesis (1.55) TN,12 D → sup 0≤u≤1 |B(u)|, and (1.56) TN,13 D → sup 0≤u≤1 |B(u)|. Under the alternative TN,12 P → ∞ and TN,13 P → ∞. The suggested tests are very similar and it is not immediate to see difference between them. None of them are perfect: in TN,12 we are overestimating the variance, so we reduce the power while in TN,13 an additional estimation is used which might impact the behaviour in case of small and moderate sample sizes. This is a typical situation in statistics. We have a choice but which one is better is not obvious. 1.5. Total Time on Test. The total time on test (TTT) is a popular concept in engineering. For example, the best test for exponentiality is based on TTT. TTT is defined for positive variables, so this will be assumed in this part. One of the ingredients for TTT is the function z(x) = 1 1 − F(x) Z x 0 (1 − F(u))du. The estimator of z(x) is simple, we just replace F with FN resulting in ẑN (x) = 1 1 − FN (x) Z x 0 (1 − FN (u))du.
  • 284. 18 LAJOS HORVÁTH The weak convergence of N1/2(FN (x)−F(x)) to B(F(x)) (B is a Brownian bridge yields if F(T) 1, then (1.57) N1/2 (ẑN (x) − z(x)) D[0,T] −→ Γ(x), where Γ(x) = B(F(x)) (1 − F(x))2 Z x 0 (1 − F(u))du − 1 1 − F(x) Z x 0 B(F(u))du. The proof of (1.57) can be derived from the weak convergence of the empirical process with the help of some algebra. By (1.57) we have that (1.58) TN,15 D → sup 0≤x≤T |Γ(x)|, where TN,15 = sup 0≤x≤T N1/2 |ẑN (x) − z(x)|. Getting the distribution of the limit in (1.58) is hopeless since it depends on the unknown F. We will demonstrate that bootstrap works. Interestingly, (1.58) mainly used to construct confidence bands for z(x), 0 ≤ x ≤ T.
  • 285. MATHEMATICAL STATISTICS 19 2. Several versions of resampling In Section 1 we discussed several common hypothesis testing problems in statistics and possible approaches how to tackle them. The procedures based on a single sample had the following form: we defined a test statistic TN and established the following properties: we reject for large values of TN , (2.1) lim N→∞ P{TN ≤ x} = D(x) under the null hypothesis, where D denotes the limiting distribution function and (2.2) TN P → ∞ under the alternative. Based on the original sample X1, X2, . . . , XN we want to create an other sample, called bootstrap sample, X∗ 1 , X∗ 2 , . . . , X∗ L which should resemble the original observation. We consider the original sample X1, X2, . . . , XN as fixed values, i.e. we condition with respect to them. Due to this condi- tioning we use PX to denote P{ · |X}, where X = (X1, X2, . . . , XN ). From the bootstrap sample we compute our test statistic T (1) L , as TN was computed from the original sample. Please note that we have not said how the bootstrap sample was obtained. We repeat this procedure independently of each other R times resulting in the bootstrap statistics T (1) L , T (2) L , . . . , T (R) L . Next we compute their empirical distribution function (2.3) DN,L,R(x) = 1 R R X i=1 I{T (i) L ≤ x}, −∞ x ∞. If we can show that under the null hypothesis (2.4) lim min(N,L)→∞ sup −∞x∞
  • 286.
  • 287.
  • 288. PX{T (1) L ≤ x} − D(x)
  • 289.
  • 290.
  • 291. = 0 a.s., since in this case the law of large numbers implies that (2.5) sup −∞x∞ |DN,L,R(x) − D(x)| → 0 if min(N, L, R) → ∞. Equation (2.5) means that for almost all realization of the original sample X, DN,L(x) converges to D. So you must be extremely unlucky if the bootstrap is not working for you! We require from the bootstrap statistic that it is bounded in probability under the alternative: (2.6) |T (1) L | = OPX (1). The construction of the critical values will explain why (2.6) is crucial for the bootstrap to work. Let 0 α 1 and define the bootstrap critical value cN,L,R = cN,L(α) by DN,L,R(cN,L,R) = 1 − α. (There is a minor technical issue, since DN,L(x) is a jump function so it might not take the value 1 − α. In this case the smallest number for which DN,L(x) is the first time larger than 1 − α. This works, for example, D(x) is a continuous.) Using (2.5) we obtain that P{TN cN,L,R} → α, as min(N, L, R) → ∞ under the null hypothesis. The requirement in (2.6) implies that |cN,L,R(α)| = OP X(1), as min(N, L, R) → ∞ even under the alternative and therefore on account of (2.2) we get under the alternative that P{TN cN,L,R} → 1, as min(N, L, R) → ∞. This means that the rejection rate under the null hypothesis is asymptotically α and we reject the alternative with probability going to 1. The statistics in Section 1 can be bootstrapped. Bootstrap is simple, you need to run the same program several times and you will get the critical value very
  • 292. 20 LAJOS HORVÁTH easily. This sounds nice but, of course, questions will arise. How to choose L? How to choose R? We discuss these questions later. The theory supporting our discussion works well if the limit is derived from Gaussian processes. But this might not be the case for Poisson processes and extreme values. Usually the bootstrap is better (it provides better critical values) and this is proven in several cases, like bootstrapping the mean. In Section 1 we tried to discuss problems where, in some cases, only bootstrap can provide critical values. The bootstrap can be used to construct confidence intervals and confidence bands as well. 2.1. Nonparametric bootstrap. This is probably the most popular method for resampling. However, the permutation method (selection without replacement) is older but it has more limited use. It is a modification of Fisher’s exact test. Our sample is X = (X1, X2, . . . , XN ). As in the introduction of this lecture, we assume that bX is given so we consider as constant, i.e. we condition with respect to X. We assume that (2.7) F is a continuous distribution function. If (2.7) holds, then there is no tie among the Xi’s with probability 1. Now we select from X with replacement, resulting in X∗ 1 , X∗ 2 , . . . , X∗ L. Due to the construction, (2.8) X∗ 1 , X∗ 2 , . . . , X∗ L are independent and identically distributed random variables. The computation of the common distribution function is very simple. Since there is no tie among the Xi’s, due to the random selection, PX{X∗ 1 = Xj} = 1 N , 1 ≤ j ≤ N, so the number of Xi’s which are less than x gives the conditional probability that X∗ 1 less or equal than x. This means that the common distribution function of the bootstrap sample is FN (x) = 1 N N X i=1 I{Xi ≤ x}, i.e. the empirical distribution of the original sample. It is important to note that FN (x) is a jump function even if (2.8) holds. This cause problems when the definition of TN , the statistic we want to bootstrap, assumes that (2.7) holds. For example, statistics based on densities will have this problem. However, as we discussed earlier, FN is an excellent estimate for F. For example, (2.9) sup −∞x∞ |FN (x) − F(x)| → 0 a.s. Even we know that the rate of convergence in (2.9) is N−1/2(log log)1/2 according to the law of the iterated logarithm for empirical processes. The computation of the mean and the variance of X∗ 1 is simple since it takes Xi, 1 ≤ i ≤ N, with probability 1/N, so EXX∗ 1 = 1 N N X i=1 Xi = X̄N i.e. the conditional expected value of X∗ 1 is the sample mean of the original sample. We use EX to denote the conditional expected value when we condition with respect to X. Similarly, varX(X∗ 1 ) = 1 N N X i=1 X2 i − X̄2 N .
  • 293. MATHEMATICAL STATISTICS 21 Also, E[EXX∗ 1 ] = µ and E[varX(X∗ 1 )] = (N/(N − 1))σ2, where EX1 = µ and var(X1) = σ2. According to the central theorem (2.10) sup −∞x∞
  • 294.
  • 295.
  • 296.
  • 297.
  • 299.
  • 300.
  • 301.
  • 302.
  • 303. → 0 as N → ∞. The bootstrap version of this result is (2.11) sup −∞x∞
  • 304.
  • 305.
  • 306.
  • 307.
  • 308. PX ( N−1/2 L X i=1 (X∗ i − X̄N )/σ̄N ≤ x ) − P ( N−1/2 N X i=1 (Xi − µ)/σ ≤ x )
  • 309.
  • 310.
  • 311.
  • 312.
  • 313. → 0 a.s., as min(N, L) → ∞. The theoretical mean and variance in (2.10) are replaced with the conditional mean and variance of the bootstrapped observations. The proof is very simple if we also assume that |X1|3 ∞. According to the Berry–Esseen theorem, there is an absolute constant c such that (2.12) sup −∞x∞
  • 314.
  • 315.
  • 316.
  • 317.
  • 318. PX ( N−1/2 L X i=1 (X∗ i − X̄N )/σ̄N ≤ x ) − Φ(x)
  • 319.
  • 320.
  • 321.
  • 322.
  • 323. ≤ c L1/2 EX|X∗ 1 − X̄N |3 σ̄3 N . We have by the definition of X∗ 1 , EX|X∗ 1 − X̄N |3 = 1 N N X i=1 |Xi − X̄N |3 . Using the law of large numbers, we have that X̄N → µ, σ̄N → σ and 1 N N X i=1 |Xi − X̄N |3 → E|X1 − µ|3 almost surely, so these variables are bounded with probability 1. Hence (2.12) implies that (2.13) sup −∞x∞
  • 324.
  • 325.
  • 326.
  • 327.
  • 328. PX ( N−1/2 L X i=1 (X∗ i − X̄N )/σ̄N ≤ x ) − Φ(x)
  • 329.
  • 330.
  • 331.
  • 332.
  • 333. → 0 a.s., as L → ∞. Now we get (2.11) from (2.10) and (2.13). The proof of (2.11) is typical to deal with theoretical issues of the bootstrap. It is shown that the test statistic and its bootstrap version have the same limit distribution. Das Gupta (2008) has a lengthy discussion of bootstrapping the mean. He provides theoretical evidence that the rate of convergence in (2.11) is better than in (2.10). This has been confirmed empirically in the literature. Please set up Monte Carlo simulations to provide numerical evidence that the rate of convergence is better in (2.11). Just provide some graphs. Bootstrapping the mean provides theoretical results but not too useful in real life applications due to the enormous amount of results on the sample mean. We illustrate on the Kolmogorov–Smirov statistic why the bootstrap works. We already obtained TN,1 of Problem 1. Now we obtain it bootstrap version from the sample X∗ 1 , X∗ 2 , . . . , X∗ L. Their empirical distribution function is F∗ L(x) = 1 L L X i=1 I{X∗ i ≤ x} and now we can define T∗ L,1 as T∗ L,1 = max −∞x∞ L1/2 |F∗ L(x) − FN (x)|.
  • 334. 22 LAJOS HORVÁTH We provide some heuristic arguments proving (2.14) sup −∞x∞ |
  • 335.
  • 336.
  • 337.
  • 338. PX{T∗ L,1 ≤ x} − P sup 0≤u≤1 |B(u)| ≤ x
  • 339.
  • 340.
  • 341.
  • 342. → 0 a.s., as min(N, L) → ∞, where B is a Brownian bridge. By the weak convergence of the empirical process L1/2 (F∗ L(x) − FN (x)) ≈ B(FN (x)) and by (2.9) and the almost sure continuity of B, B(FN (x)) ≈ B(F(x)). Since F is continuous, sup−∞x∞ |B(F(x))| = sup0≤u≤1 |B(u)|. Hence we have (2.14). Please note that (2.14) holds regardless we have H0 or the alternative. Hence we have (2.4) and (2.5). In case of the Kolmogorov–Smirnov statistic the limit distribution has a known form. Hence you could investigate the question if the bootstrap method provides a better approximation for the distribution of TN,1 than the limit distribution. There are several cases where the limit distribution of the test statistic depends on the underlying distribution of the data. In this case the bootstrap might be the only method to get critical values to test our null hypothesis. Now we return to the test for exponentiality. From the bootstrap sample we estimate λ by λ̂∗ L = 1 L L X i=1 X∗ i , so the bootstrapped parameter could be defined as L1/2(F∗ L(x) − F0(x, λ̂∗ L)). Following the proof in Section 1.2, one can show that sup0≤x∞ L1/2|F∗ L(x) − F0(x, λ̂∗ L)| converges to the limit if TN,8. So it works under the null. Now the bad news. Under the alternative (2.15) sup 0≤x∞ L1/2 |F∗ L(x) − F0(x, λ̂∗ L)| P → ∞. The proof of (2.15) is the same what we did in Section 1.2. Hence with method we will not be able to reject exponantiality even if it is false. The problem is that we should use the distribution func- tion of the bootstrap sample, which is FN (x) which does not contain any place for the parameter λ. We need to do something else which will be done in the next section. Next we consider we consider the two sample problem. If we select from the X sample with replacement and separately from the Y’s with replacement, the procedure will not work. The distribution of the bootstrapped X sample will be around F, while the distribution of the Y sample will be close to H. This means that TN,M,1 and its bootstrapped version, T∗ N∗,M∗,1 behave exactly in the same way. Hence (2.16) PX,Y{T∗ N∗,M∗,1 K} → 1 a.s. for all K. Now we combine the two samples into 1, Z = (X1, X2, . . . , XN , Y1, Y2, . . . , YM )T . We select from Z with replacement, resulting in Z∗ = (Z∗ 1 , Z∗ 2 , . . . , Z∗ L). Due to the random selection with replace- ment, conditionally on Z, these are independent and identically distributed with PZ{Z∗ 1 ≤ x} = 1 N + M N+M X i=1 I{Zi ≤ x} = N N + M 1 N N X i=1 I{Xi ≤ x} ! + M N + M 1 M M X i=1 I{Zi ≤ x} ! . Let N∗ = bLN/(N + M)c and M∗ = L − N∗, where b·c denotes the integer part. Now X = (Z∗ 1 , Z∗ 2 , . . . , ZN∗ ) and Y∗ = (Z∗ N∗+1, Z∗ N∗+2, . . . , Z∗ L∗ ). Regardless if the original samples satisfy
  • 343. MATHEMATICAL STATISTICS 23 the null or the alternative hypothesis, X∗ and Y∗ have the same distribution (conditionally on Z, the empirical distribution function of Z). Hence under the null as well as under the alternative sup −∞x∞
  • 344.
  • 345.
  • 346.
  • 347. PZ{T∗ N∗,M∗,1 ≤ x} − P{ sup 0≤u≤1 |B(u)| ≤ x}
  • 348.
  • 349.
  • 350.
  • 351. → 0 a.s., where T∗ N∗,M∗,1 is the bootstrap version of TN,M,1 and B is a Brownian bridge. Hence both (2.4) and (2.6) are satisfied. 2.2. Parametric bootstrap. Now we discuss how to get critical values for TN,8. Let X∗ 1 , X∗ 2 , . . . , X∗ L be independent, identically distributed random variables with distribution function F0(x, λ̂N ), i.e. given X, these simulated random variables are independent and PX{X∗ i ≤ x} = F0(x, λ̂N ). We need to estimate the parameter, for the bootstrap this is denoted by λ̂∗ N . Clearly, we need to use λ̂∗ L = 1 L L X i=1 X∗ i . Now the bootstrap version of the parameter estimated empirical process is L1/2 (F∗ L(x) − F0(x, λ̂∗ L)). We note that using integration by parts, (2.17) λ̂N = Z ∞ 0 (1 − F0(u, λ̂N )du and (2.18) λ̂∗ L = Z ∞ 0 (1 − F∗ L(u))du. Using again the mean value theorem we get for almost all realization of bX we have uniformly in x L1/2 (F∗ L(x) − F0(x, λ̂∗ L)) = L1/2 (F∗ L(x) − F0(x, λ̂N ) + F0(x, λ̂N ) − F0(x, λ̂∗ L)) = L1/2 (F∗ L(x) − F0(x, λ̂N )) + g1(x, λ̂N ) Z ∞ 0 L1/2 (F∗ L(u) − F0(u, λ̂N ))du + oX(1), where g1(u, λ) = ∂F0(u, λ) ∂λ . Conditionally on X, L1/2(F∗ L(x) − F0(x, λ̂N )) ≈ B(F0(x, λ̂N )). By the strong law of large numbers λ̂N → µ a.s., where µ = EX1, which is true under the null and the alternative. By the continuity of the Brownian bridge, B(F0(x, λ̂N )) ≈ B(F0(x, µ)). Thus we get for almost all realization of X that L1/2 (F∗ L(x) − F0(x, λ̂∗ L)) D[0,∞] −→ B(F0(x, µ)) + g1(x, µ) Z ∞ 0 B(F0(u, µ)). If H0 holds, then µ = λ0, so TN,7 and sup0≤x∞ L1/2|F∗ L(x) − F0(x, λ̂∗ L)| have the same limit distribution. Under the alternative sup 0≤x∞ L1/2 |F∗ L(x) − F0(x, λ̂∗ L)| = OPX (1). Hence this method provides a correct resampling for TN,7. The more general case can be handled in the same way.
  • 352. 24 LAJOS HORVÁTH 2.3. Resampling without replacement (the permutation method). First we show that the permutations of the original sample can be used for the change point problem. Let Z∗ 1 , Z∗ 2 , . . . , Z∗ N be a random permutation of Z = (Z1, Z2, . . . , ZN ). We note that the permuted variables (selection without replacement) are not independent, but the dependence is week. It is easy to see that P{Z∗ i = Zj} = 1 N , 1 ≤ i, j ≤ N and P{Z∗ i = Zj, Z∗ k = Z`} = 1 N(N − 1) , 1 ≤ i, j, k, ` ≤ N, i 6= k, j 6= `. Hence |P{Z∗ i = Zj, Z∗ k = Z`} − P{Z∗ i = Zj}P{Z∗ k = Z`}| = 1 N2(N − 1) , which is much smaller than P{Z∗ i = Zj}. Also, PZ{Z∗ i ≤ x} = FN (x), 1 ≤ i ≤ N, and FN (x) = 1 N N X i=1 I{Zi ≤ x}. The permuted statistic is T∗ N,11 = max 1≤k≤N 1 σ̄N N−1/2
  • 353.
  • 354.
  • 355.
  • 356.
  • 358.
  • 359.
  • 360.
  • 361.
  • 362. . Using the weak dependence between the permuted variables, one can show that (2.19) sup −∞x∞
  • 363.
  • 364.
  • 365.
  • 366. PZ{T∗ N,11 ≤ x} − P sup 0≤u≤1 |B(u)| ≤ x
  • 367.
  • 368.
  • 369.
  • 370. → 0 a.s., where B is a Brownian bridge. Under the alternative there is a change at k∗ so the distribution of the permuted random variables is a linear combination of two distributions, but the permuted variables will have the same marginal distribution. Namely, for all 1 ≤ j ≤ N, PZ{Z∗ j ≤ x} = FN (x) = 1 N k∗ X i=1 Zi + N X i=k∗+1 Zi ! = k∗ N 1 k∗ k∗ X i=1 Zi ! + N − k∗ N 1 N − k∗ N X i=k∗+1 Zi ! . Hence if there is no change in the means of the observations, then we still have that σ̄N → τ2 and τ2 might be different from σ2. However, this is still enough to claim that T∗ N,11 = OPZ (1), so the permutation method can be used to find critical values for the change point statistics based on partial sum processes. The permutation resampling was considered by Fisher (without calling it resampling). The idea is very simple. We permute Z resulting in Z∗ = (Z∗ 1 , Z∗ 2 , . . . , Z∗ N+M ). These are not independent since we selected without replacement but they have the same distribution, conditionally on Z, PZ{Z∗ i ≤ x} = 1 N + M N+M X `=1 I{Z` ≤ x}. Now X∗ = (Z∗ 1 , Z∗ 2 , . . . , Z∗ N ) and Y∗ = (Z∗ N+1, Z∗ N+2, . . . , Z∗ N+M ). Regardless if the original data satisfy the null hypothesis or the alternative, the marginal distributions, conditionally on Z, are the same. Hence under the null as well as under the alternative sup −∞x∞
  • 371.
  • 372.
  • 373.
  • 374. PZ{T∗ N∗,M∗,1 ≤ x} − P{ sup 0≤u≤1 |B(u)| ≤ x}
  • 375.
  • 376.
  • 377.
  • 379. MATHEMATICAL STATISTICS 25 where T∗ N∗,M∗,1 is the bootstrap version of TN,M,1 and B is a Brownian bridge. Hence both (2.4) and (2.6) are satisfied. So far we only had to assume that min(N, L) → ∞. Of course, choosing a much larger L would result in lots of ties (we try to imitate continuous distributions which do not have ties). Since we apply limit theorems, L cannot be small. Usually, L = N is used. However, in case of extremes it might not work. 2.4. Bootstrapping the largest observation. Das Gupta (2008) contains an example when the bootstrap does not work when we sample with replacement. The original sample size is N and we generated a bootstrap sample X∗ 1 , X∗ 2 , . . . , X∗ N . Let XN,N = max(X1, X2, . . . , XN ) be the maximum in the original sample and X∗ N,N = max(X∗ 1 , X∗ 2 , . . . , X∗ N ). The bootstrap works in this case if XN,N = X∗ N,N , but the probability that X∗ N,N XN,N is (1 − 1/N)N → 1/e, N → ∞. This is clear, since during the selection, we cannot pick XN,N , so we need to choose from N − 1 possibilities. This example suggests if we increase the bootstrap sample size, we might hit XN,N . However, this is not the case. Let assume that X1, X2, . . . , XN be independent and identically distributed exponential(1) random variables, i.e. F(t) = ( 0, if t 0 1 − e−t , if t ≥ 0. Let XN,N be the largest order statistics and YN = XN,N − log N. As we did before P{YN ≤ t} = P{XN,N ≤ t + log N} = FN (t + log N) and FN (t + log N) =      0, if t − log N 1 − e−t N N , if t ≥ − log N. Thus we get that for all −∞ t ∞ lim N→∞ P{YN ≤ t} = H(t), where H(t) = e−e−t , −∞ t ∞. We can do the same for the bootstrap sample. Let Y ∗ L = max(X∗ 1 , X∗ 2 , . . . , X∗ L) − log L. As before, PX{Y ∗ L ≤ t} = FL N (t + log L), where, as before, FN (t) is the empirical distribution function of X = (X1, X2, . . . , XN ). According to the law of the iterated logarithm for the empirical process we have (2.20) lim sup N→∞ N 2 log log N 1/2 sup −∞t∞ |FN (t) − F(t)| = 1 a.s. Next we write FL N (t − log L) = (F(t + log L) + FN (t + log L) − F(t + log L))L . If t is fixed and N is so large that t + log L 0 F(t + log L) + FN (t + log L) − F(t + log L) = 1 − e−t L + FN (t + log L) − F(t + log L) = 1 − 1 L e−t (1 + L(FN (t + log L) − F(t + log L))) .
  • 380. 26 LAJOS HORVÁTH If want use the formula 1 − xn n → e−x , if xn → x with xN,L = e−t (1 + L|FN (t + log L) − F(t + log L)|) we need that L|FN (t + log L) − F(t + log L)| → 0 a.s. In light of (2.20) this is satisfied if (2.21) L N log log N 1/2 → 0. According to (2.21), the bootstrap with replacement works if L is not large, essentially L must be less than N1/2. So more is not better in this case. Please note that if N = 100, than we should use a bootstrap sample size less than 10. Doing asymptotic theory with 10 observations is somewhat questionable. The rate of convergence to extreme values could be very slow so the bootstrap might not be better than using the limit results for the original sample. In any case, if (2.21) holds, the lim N→∞ PXP{Y ∗ L ≤ t} = H(t), a.s. So far we bootstrapped the observations directly. Now we consider the case when the observations are not identically distributed so selection with or without replacement will not work. All the bootstrap methods we discussed so far produced identically distributed random variables. 2.5. Residual bootstrap. We illustrate this method by linear models. We assume that (2.22) yi = x i β0 + i, 1 ≤ i ≤ N, where xi = (xi,1, xi,2, . . . , xi,d) ∈ Rd and β0 ∈ Rd. As usual, y1, y2, . . . , yN and x1, x2, . . . , xN are observed. We note that in statistics the xi’s are given numbers, while in econometrics, they modeled as random variables. The errors, 1, 2, . . . , N are unobservable random errors. We assume that 1, 2, . . . , N are independent and identically distributed random variables with (2.23) Ei = 0 and 0 E2 i = σ2 ∞. The parameters of interest are β0 and σ2. We estimate β0 using the least squares β̂N . The residuals defined by ˆ i = yt − x i β̂N , 1 ≤ i ≤ N. We collect some facts from linear models. First we write (2.22) in matrix form. Let YN = (y1, y2, . . . , yN ), EN = (1, 2, . . . , N ), β = (β1, β2, . . . , βd) and XN =         x1,1 x1,2 . . . x1,d x2,1 x2,2 . . . x2,d . . . . . . . . . . . . . . . . . . xN,1 xN,2 . . . xN,d         The matrix form of (2.22) is Y = XN β + EN . The least square estimator β̂ is defined by the minimization problem β̂N = argminβ kYN − XN βk2 ,
  • 381. MATHEMATICAL STATISTICS 27 where k · k is the Euclidian norm (some of the squares of the elements of the matrix). We obtain the solution in explicit form: β̂N = X X −1 X Y, assuming that X N XN is nonsingular. Usually, the properties of β̂N are established assuming the normality of the errors. If normality of the errors is assumed, then one possibility is the parametric bootstrap. However, the bootstrap in this case is not very useful, since in this case the normality of β̂N is proven, so only the estimation of σ is needed for statistical inference. If the errors are not necessarily normal, then β̂N is still asymptotically normal. Namely, (2.24) N1/2 (β̂N − β0) D → Nd(0, σ2 A−1 ), where (2.25) lim N→∞ 1 N X X = A, β0 is the true value of the parameter and Nd denotes a d–dimensional normal random variable. It is easy to interpret the condition in (2.25): (2.26) lim N→∞ 1 N N X `=1 xi,`x`,j = ai,j, A = {ai,j, 1 ≤ i, j ≤ d}. If the xi’s are modeled as random variables, then (2.26) is just the law of large numbers. Hence (2.25) is a very natural assumption in linear models. If we go back the definition of the residuals, then we have ˆ i = i − x i (β̂N − β0). Let zi = (x i , ˆ i), Z = {zi, 1 ≤ i ≤ N}. We choose L-times, independently of each other with replacement from {zi, 1 ≤ i ≤ N}, resulting in {z∗ i , 1 ≤ i ≤ L}. Now we define (2.27) y∗ i = (x∗ i ) β̂N + ˆ ∗ i , 1 ≤ i ≤ L. Using the same notation as before but putting up ∗ we write (2.28) Y∗ = X∗ β̂N + E∗ and therefore β̂ ∗ N = (X∗ ) X∗ −1 (X∗ ) Y∗ . Using (2.27) we get that β̂ ∗ N = β̂N + (X∗ ) X∗ −1 (X∗ ) E∗ . Using again the law of large numbers lim L→∞ 1 L (X∗ ) X∗ = A a.s. We note that EZ (X∗ ) X∗ −1 (X∗ ) E∗ (X∗ ) X∗ −1 (X∗ ) E∗ = (X∗ ) X∗ −1 (X∗ ) E h E∗ (E∗ ) i X∗ (X∗ ) X∗ −1 ≈ L2 A−1 (X∗ ) EZ h E∗ (E∗ ) i X∗ A−1 . Conditionally on Z, ∗’s are independent and identically distributed and therefore EZ h E∗ (E∗ ) i = EZ(∗ 1)2 IN×N
  • 382. 28 LAJOS HORVÁTH where IN×N is the N × N identity matrix. Thus we have L2 A−1 (X∗ ) EZ h E∗ (E∗ ) i X∗ A−1 ≈ L2 EZ(∗ 1)2 A−1 (X∗ ) IN×N X∗ A−1 ≈ LEZ(∗ 1)2 A−1 . It is easy to see that EZ(∗ 1)2 → σ2 a.s. Thus we conjecture that for almost all realization of Z (2.29) L1/2 (β̂ ∗ N − β̂N ) D → Nd(0, σ2 A−1 ). The proofs of (2.24) and (2.29) are essentially the same. We showed that EZβ̂ ∗ N = β̂N EZ h β̂ ∗ N (β̂ ∗ N ) i = 1 L EZ(∗ 1)2 A−1 + oZ 1 L = σ2 L A−1 + oZ 1 L , hence the first (mean) and the second order (variance) properties of β̂ ∗ N and β̂N are practically the same. Of course, this is not the proof of the normality of the estimators but these are necessary results for normality. Since β̂N always converges to the true value of the parameter of the linear model, the limit theorem 2.29 can be used to justify this bootstrap method for hypothesis testing. We only need that min(N, L) → ∞. The resampling of the residuals is also a popular technique in time series analysis. For the sake if simplicity we consider an autoregressive AR(1) sequence. We assume that {i, −∞ t ∞} are independent and identically distributed random variables. The AR(1) sequence is the solution of the recursion (2.30) yi = ρyi−1 + i, −∞ i ∞. If |ρ| 1 and |0|δ ∞ with some δ 0, then (2.30) has a unique solution given by (2.31) yi = X `=0 ρ` i−`, −∞ i ∞. We can estimate ρ with ρ̂N , the least square estimator. It is established in time series that N1/2(ρ̂N − ρ) is asymptotically normal. Observing y1, y2, . . . , yN , we define the residuals as ˆ i = yi − ρ̂N yi−1, 2 ≤ i ≤ N. We select from {ˆ 2, ˆ 3, . . . , ˆ N } with replacement, creating {ˆ ∗ 1, ˆ ∗ 2, . . . , ˆ ∗ L}. If the statistical inference is about the i’s, we are done. If our interest is in y1, y2, . . . , yN , then we define the boostrap sample (2.32) y∗ i = ρy∗ i−1 + ˆ ∗ i , 2 ≤ i ≤ L, with some initial value y∗ 0. However, of i is small the solution of (2.32) is certainly not close to the solution of (2.30) given by the infinite sum in (2.31). Hence we do not use all y∗ i ’s only if i ≥ L0. This we get a bootstrap sample of size L − L0. L0 is the burn in period and the practical advise is that L0 = 25 or 50. Next we discuss how the bootstrap can help to construct confidence bands.
  • 383. MATHEMATICAL STATISTICS 29 2.6. Confidence bands. We recall the TTT function z(x) from Section 1.5. We want to define to random functions, zN,1(x) and zN,2(x) such that lim N→∞ P {zN,1(x) ≤ z(x) ≤ zN,2(x) for all x ∈ [0, a]} = 1 − α. If we choose zN,1(x) = z(x) − cN−1/2 and zN,2(x) = z(x) + cN−1/2 According to our theory, lim N→∞ P {zN,1(x) ≤ z(x) ≤ zN,2(x) for all x ∈ [0, a]} = P{ sup 0≤x≤a |Γ(x)| ≤ c}, so c = c(1 − α) is coming from the equation P sup 0≤x≤a |Γ(x)| ≤ c = 1 − α. However, the computation of the distribution distribution function P{sup0≤x≤a |Γ(x)| ≤ c} is hopeless since it depends on the unknown F. Let X∗ 1 , X∗ 2 , . . . , X∗ L be the bootstrap sample and define the bootstrap version of zN (x) by ẑ∗ L(x) = 1 1 − F∗ L(x) Z x 0 (1 − F∗ L(u))du, where, as before, F∗ L(u) = 1 L L X i=1 I{X∗ i ≤ u}. Using our previous arguments, one can show that PZ{ sup 0≤x≤a L1/2 |ẑ∗ L(x) − ẑN (x) ≤ c} → P sup 0≤x≤a |Γ(x)| ≤ c a.s., hence the bootstrap can be used to estimate c(1−α) from the sample. We obtain ĉN,L,R = ĉN,L,R(1− α) as our estimate, where N is the original sample size, L is the bootstrap sample size and R is the number of the repetations of the bootstrap procedure. We get that if ẑN,1(x) = z(x)−ĉN,L,RN−1/2 and ẑN,2(x) = z(x) + ĉN,L,RN−1/2, then lim min(N,L,R)→∞ PX {zN,1(x) ≤ z(x) ≤ zN,2(x) for all x ∈ [0, a]} = 1 − α a.s. The construction of confidence bands for a regression line is also a popular question. Let a(x) = β0 + β1x, b ≤ x ≤ d. We observe the line at N points giving the observations yi, xi, 1 ≤ i ≤ N yi = a(xi) + i = β0 + β1xi + i. We wish to define aN,1(x) and aN,2(x) from the sample that lim N→∞ P{aN,1(x) ≤ a(x) ≤ aN,2(x) for all x ∈ [b, d]} = 1 − α. We try aN,1(x) = β̂0,N + β̂1,N − cN−1/2 and aN,1(x) = β̂0,N + β̂1,N + cN−1/2 , where β̂0,N and β̂1,N are the least squares estimators for the parameters. Then P {aN,1(x) ≤ a(x) ≤ aN,2(x) for all x ∈ [b, d]} = P ( sup b≤x≤d |N1/2 (β0 − β̂0,N ) + N1/2 (β1 − β̂1,N )x| ≤ c ) . We already know that N1/2 (β0 − β̂0,N ), N1/2 (β1 − β̂1,N ) D → N2,
  • 384. 30 LAJOS HORVÁTH where N2 = (N1, N2) is a bivariate normal random variable. Hence Γ0 (x) = N1 + N2x, b ≤ x ≤ d is a Gaussian process and we need to find c = c(1 − α) such that (2.33) P ( sup b≤x≤d |Γ0(x)| ≤ c ) = 1 − α. We have at least to possibilities to get c in (2.33). The mean of N2 is 0 and it covariance matrix is known explicitly and it is easy to estimate from the sample. Hence we can easily estimate N2 and we could use Monte Carlo simulations. The other possibility is using the bootstrap as it was done for z(x). To reflect the variability of the data even in the bands, one might try using the limit distribution of sup b≤x≤d |N1/2(β0 − β̂0,N ) + N1/2(β1 − β̂1,N )x| (var(N1/2(β0 − β̂0,N ) + N1/2(β1 − β̂1,N )x))1/2 . There was restriction on the choice of L, the bootstrap sample size in case of extreme values. It turns out that there is no restriction on L if we bootstrap a statistic with normal limit (or limit derived from normal random variables and/or processes).
  • 385. MATHEMATICAL STATISTICS 31 3. Density estimation So far we have discussed the estimation of the distribution function and the theory related to it. These are fundamental results and the distribution of several statistics can be derived from this theory. In some cases, looking at the data, we want to guess the distribution of the underlying ob- servations. This is impossible from the distribution function and from its estimate to do this since all distributions look the same. However, the shape of densities is relatively unique and everybody could see the difference between the shapes of exponential and normal densities. So the estimation of densities could provide important tools for statistical analysis. However, they are rarely used in hypothesis testing, for example, and their rate of convergence to the limit can be extremely slow. A large part of the statistical literature shows efforts how to avoid the estimation of densities. The main problem is that the density, as a derivative, is a limit. In real life, we have trouble to estimate limits. We discuss several versions of density estimation. 3.1. Kernel density estimator. Let X1, X2, . . . , XN be independent and identically distributed random variables with distribution function F. The density function f is defined by f(t) = F0 (t). The kernel density estimate was introduced by Rosenblatt (1956) and Parzen (1962) and it is defined by ˆ fN (t) = 1 NhN N X i=1 K t − Xi hN , (3.1) where hN is the bandwidth (analogous to length of bins in a histogram) and K(·) is the kernel. One natural requirement is that ˆ fN (t) is a density function for each N. This requirement is satisfied if (3.2) K is a density function. A function is a density function if it is non negative and its integral on the line is 1. It is clear from (3.2) that ˆ fN (t) ≥ 0 and Z ∞ −∞ ˆ fN (t)dt = 1 NhN N X i=1 Z ∞ −∞ K t − Xi hN dt and by change of variable, u = (t − Xi)/hN , Z ∞ −∞ K t − Xi hN dt = hN Z ∞ −∞ K(u)du = hN . Hence Z ∞ −∞ ˆ fN (t)dt = 1. Assumption (3.2) is attractive but it will limit how small of the bias of ˆ fN (t) can be. We will have two assumptions on the window (bandwidth) hN : (3.3) hN → 0 and (3.4) NhN → ∞, as N → ∞. Assumptions (3.3) and (3.4) will require careful balancing act. According to (3.3), hN should be small but according to (3.4), hN cannot be too small. These conflicting requirements will lead us to the optimal choice of hN . We need (3.3) to get an asymptotically unbiased estimator,
  • 386. 32 LAJOS HORVÁTH i.e. E ˆ fN (t) → f(t). The assumption in (3.4) will imply that var( ˆ fN (t)) → 0. First we consider the behaviour of ˆ fN (t) at a fixed point t. Let Λ be a neighbourhood of t. It is easy to see that E ˆ fN (t) = 1 h EK t − X1 h , since the observations are identically distributed. By definition, 1 h EK t − X1 h = 1 h Z ∞ −∞ K t − u h f(u)du = Z ∞ −∞ K(x)f(t − xh)dx. Next we show that (3.5) Z ∞ −∞ K(x)f(t − xh)dx → f(t). We assume (3.6) sup −∞u∞ K(u) ∞, (3.7) sup −∞x∞ f(x) ∞, i.e. K and f are bounded functions. Also, (3.8) f(u) is continuous if u ∈ Λ. Using (3.2), (3.3) and (3.6)–(3.8) we show that (3.5) holds. Let 0. We choose A so large that Z −A −∞ K(x)f(t − xh)dx ≤ sup −∞u∞ f(u) Z −A −∞ K(x) ≤ and Z ∞ A K(x)f(t − xh)dx ≤ sup −∞u∞ f(u) Z ∞ A K(x) ≤ . Using the continuity assumed in (3.8) with (3.3), there is an integer N0 such that sup −A≤x≤A |f(t − xh) − f(t)| ≤ , if N ≥ N0. Hence for N ≥ N0 we have
  • 387.
  • 388.
  • 389.
  • 390. Z A −A K(x)(f(t − xh) − f(t))dx
  • 391.
  • 392.
  • 393.
  • 394. ≤ sup −A≤u≤A |f(t − uh) − f(t)| Z A −A K(x) ≤ sup −A≤u≤A |f(t − uh) − f(t)| Z ∞ −∞ K(x) ≤ . Also, by the choice of A we get
  • 395.
  • 396.
  • 397.
  • 399.
  • 400.
  • 401.
  • 402. f(t) ≤ , completing the proof of (3.5). Hence (3.9) E ˆ fN (t) → f(t),
  • 403. MATHEMATICAL STATISTICS 33 i.e. the estimator is asymptotically unbiased. In the applications the rate of convergence will be important. We need to increase our assumptions on the smoothness of K and f: (3.10) Z ∞ −∞ x2 K(x)dx ∞, (3.11) sup −∞x∞ |f0 (x)| ∞ and sup −∞x∞ |f00 (x)|, (3.12) f00 (u) is continuous, if u ∈ Λ. Using a two term Taylor expansion we obtain that Z ∞ ∞ K(x)(f(t − xh) − f(t))dx = −h Z ∞ −∞ K(x)f0 (t)xhdx + 1 2 Z ∞ −∞ K(x)(xh)2 f00 (ξ(x))dx, where ξ(x) satisfies |ξ(x) − t| ≤ |x|h. Using now (3.10)–(3.12), repeating our previous arguments we can show that Z ∞ −∞ K(x)x2 f00 (ξ(x))dx → f00 (t) Z ∞ −∞ x2 K(x)dx. Thus we conclude E ˆ fN (t) = f(t) − hf0 (t) Z ∞ −∞ xK(x)dx + h2 2 f00 (t) Z ∞ −∞ x2 K(x)dx + o(h2 ). Since we want to have small bias, from now on we assume that (3.13) Z ∞ −∞ xK(x)dx = 0. If K is symmetric around 0, assumption (3.13) holds. Under (3.13) E ˆ fN (t) = f(t) + h2 2 f00 (t) Z ∞ −∞ x2 K(x)dx + o(h2 ). Now we turn to the computation of the variance. Since the observations are independent and identically distributed we get that var( ˆ fN (t)) = 1 N2h2 N X i=1 var K t − Xi h = 1 Nh2 var K t − X1 h . and 1 h var K t − X1 h = 1 h E K t − X1 h 2 − 1 h EK t − X1 h 2 . We already showed that EK t − X1 h = O(h). (3.14) Repeating our previous calculations we conclude 1 h E K t − X1 h 2 = 1 h Z ∞ −∞ K2 t − x h f(x)dx = Z ∞ −∞ K2 (u)f(t − uh)du
  • 404. 34 LAJOS HORVÁTH = f(t) Z ∞ −∞ K2 (u)du + o(1). Summarizing our calculations we have that var( ˆ fN (t)) = 1 Nh f(t) Z ∞ −∞ K2 (u)du + o(1) and therefore var( ˆ fN (t)) → 0 if and only if (3.4) holds. Since ˆ fN (t) is biased, the mean square error is used to evaluate its performance. By definition, MSE( ˆ fN (t)) = var( ˆ fN (t)) + (E ˆ fN (t) − f(t))2 = 1 Nh f(t) Z ∞ −∞ K2 (u)du + h4 4 (f00 (t))2 Z ∞ −∞ u2 K(u)du 2 + o 1 Nh + o(h4 ), if (3.13) holds. Now it is easy to find h which gives the smallest value of MSE( ˆ fN (t)), at least asymptotically: hopt = c0N−1/5 , where c0 = (c1/c2)1/5 c1 = f(t) Z ∞ −∞ K2 (u)du and c2 = (f00 (t))2 Z ∞ −∞ u2 K(u)du 2 . The result on the optimal h looks nice but it is not too useful. It depends on t but in our definition of the kernel density estimator, the window depends only on the sample size. Also, since f is unknown, we cannot compute c0. However, we have the interesting observation that the optimal hN is proportional to N−1/5 so it will be crucial for any theory to cover this case. We wish to use an estimator which minimizes the mean square error, i.e. we choose h and K, where min K min h E( ˆ fN (t) − f(t))2 is reached. We already found hopt and plugging this value into the formula for the MSE, we need to maximize this expression with respect to K. This is hard but the value of the MSE will not change too much. Hence the crucial question is the choice of h. There are some kernels which are often used in practice: K(u) = 1 2c I{−c ≤ u ≤ c} (uniform density), K(u) = 1 (2π)1/2 e−u2/2 (normal density), K(u) = (1 − |u|)I{|u| ≤ 1}, (triangular or Bartlett), K(u) = 3 4 (1 − u2 )I{|u| ≤ 1} (Epanechnikov kernel), K(u) = 30 20 √ 5 (5 − u2 )I{− √ 5 ≤ u ≤ √ 5} (Epanechnikov kernel) and K(u) = 1 2π sin(u/2) u/2 2 , −∞ u ∞, K(0) = 1 2π (Fejér kernel). All kernels have finite support except the normal and the Fejér. The kernel densities based on the normal and the Fejér kernels are infinitely times differentiable. The others might not provide dif- ferentiable or only few times differentiable density estimates. The Epanechnikov kernel minimizes the mean square error. The Fejér kernel is coming from the theory of Fourier analysis. In practice,
  • 405. MATHEMATICAL STATISTICS 35 there is little difference between estimators using different kernels. Next we consider the limit distribution of ˆ fN (t). We show that (Nh)1/2( ˆ fN (t) − f(t)) is asymptot- ically normal for each t. We decompose the difference between the empirical and the true density as ˆ fN (t) − f(t) = [ ˆ fN (t) − E ˆ fN (t)] + [E ˆ fN (t) − f(t)], the random error and the numerical bias. The bias term will not play any role in the limit if (Nh)1/2 h2 → 0, i.e. (3.15) hN1/5 → 0. This means that using the optimal window the asymptotic mean of (Nh)1/2( ˆ fN (t) − f(t)) will not be 0. This is natural since the optimal window will give the same order of the square of the bias and the variance of ˆ fN (t). Since we already know the exact behaviour of the bias term, we consider the normality of ˆ fN (t) − E ˆ fN (t) = 1 Nh N X i=1 K t − Xi h − EK t − Xi h . We introduce ηi,N = 1 h1/2 K t − Xi h − EK t − Xi h , which are independent and identically distributed random variables. Also, by (3.6) these are bounded random variables, but the bound depends on N. Regardless, we can use Liapounov’s condition (cf. DasGupta, 2008 p. 64) to establish normality. Now var(ηi,N ) = 1 h EK2 t − Xi h − EK t − Xi h 2 # . We showed that EK t − Xi h = O(h). Repeating our previous arguments we get that 1 h EK2 t − Xi h = 1 h Z ∞ −∞ K2 t − x h f(x)dx = Z ∞ −∞ K2 (u)f(t − uh)du = f(t) Z ∞ −∞ K2 (u)du + o(1). Now we compute Eη4 i,N (essentially we only need an upper bound). The only reason why we compute that 4th moment because in this case we can get the exact asymptotic and the method can be used in other cases as well. Namely, Eη4 i,N = 1 h2 K4 t − Xi h − 4EK3 t − Xi h EK t − Xi h + 6EK2 t − Xi h EK t − Xi h 2 − 4EK t − Xi h EK t − Xi h 3 + EK t − Xi h 4 .
  • 406. 36 LAJOS HORVÁTH For ` = 1, 2, 3 and 4 we have EK` t − Xi h = Z ∞ ∞ K` t − x h fxdx = h Z ∞ −∞ K` (u)f(t − uh)du = hf(t) Z ∞ −∞ K` (u)du + o(h). Thus we get Eη4 i,N = 1 h f(t) Z ∞ −∞ K4 (u)du + o 1 h . According to the Liapounov condition, we need to show that N X i=1 Eη4 i,N !1/4 N X i=1 Eη2 i,N !1/2 → 0. (3.16) We showed that N X i=1 Eη2 i,N !1/2 ≈ N1/2 and N X i=1 Eη4 i,N !1/4 ≈ N1/4 h−1/4 , and therefore (3.16) holds if Nh → ∞. Thus we get the asymptotic normality of (Nh)1/2( ˆ fN (t) − E ˆ fN (t)): (Nh)1/2 ( ˆ fN (t) − E ˆ fN (t)) D → N 0, f(t) Z ∞ −∞ K2 (u)du , (3.17) where N denotes a normal random variable. The normalization in the central limit theorem in (3.17) shows that the rate of convergence is always slower than, for example, the convergence of the empirical distribution function to the theoretical one. Also, we need to choose the kernel K and the window h. We also see that the optimal window h ≈ N−1/5 will give a central limit theorem for (Nh)1/2( ˆ fN (t) − f(t)) but the expected value of the limiting normal will not be 0. We are interested in f(t), as a function of t, so we would like to estimate at several points simulta- neously. First we consider the correlation between (Nh)1/2( ˆ fN (t) − E ˆ fN (t)) and (Nh)1/2( ˆ fN (s) − E ˆ fN (s)). Using independence we get that E (Nh)1/2 ( ˆ fN (t) − E ˆ fN (t))(Nh)1/2 ( ˆ fN (s) − E ˆ fN (s)) = 1 Nh N X i=1 N X j=1 E K t − Xi h − EK t − Xi h K s − Xj h − EK s − Xj h = 1 Nh N X i=1 E K t − Xi h − EK t − Xi h K s − Xi h − EK s − Xi h = 1 h E K t − X1 h − EK t − X1 h K s − X1 h − EK s − X1 h .
  • 407. MATHEMATICAL STATISTICS 37 Now E K t − X1 h − EK t − X1 h K s − X1 h − EK s − X1 h = E K t − X1 h K s − X1 h − EK t − X1 h EK s − X1 h = E K t − X1 h K s − X1 h + O(h2 ) on account of (3.14). Following our earlier calculations we get that E K t − X1 h K s − X1 h = Z ∞ −∞ K t − x h K s − x h f(x)dx = h Z ∞ −∞ K(u)K u + t − s h f(t − uh)du. Since K is integrable on the line, if t 6= s, then for all u (3.18) K u + t − s h → 0, since |t − s|/h → ∞. On account of (3.6) and (3.7) we have that 0 ≤ K(u)K u + t − s h f(t − uh) ≤≤ sup x K(x) sup x f(x) K(u), K is inegrable on R and by (3.18) K(u)K u + t − s h f(t − uh) → 0 for all u ∈ Rd , so by the Lebesgue dominated convergence theorem we have (3.19) Z ∞ −∞ K(u)K u + t − s h f(t − uh)du → 0, if t 6= s. Thus we proved that (Nh)1/2( ˆ fN (t) − E ˆ fN (t)) and (Nh)1/2( ˆ fN (s) − E ˆ fN (s)) are asymptotically uncorrelated if t 6= s. Now we try to establish the multivariate central limit theorem for density estimates. Let t1 t2 . . . tR and our conditions on the density are satisfied at these points. We show that (Nh)1/2 ( ˆ fN (t1) − E ˆ fN (t1)), (Nh)1/2 ( ˆ fN (t2) − E ˆ fN (t2)), (3.20) . . . , (Nh)1/2 ( ˆ fN (tR) − E ˆ fN (tR)) ! D → NR(0, Σ) where NR is an R dimensional normal random vector, Σ is a diagonal matrix and diag(Σ) = f(t1) Z ∞ −∞ K2 (u)du, f(t2) Z ∞ −∞ K2 (u)du, . . . , f(tR) Z ∞ −∞ K2 (u)du . There is a standard method how to prove the asymptotic normality of a random vector. This is the Cramér–Wold theorem (DasGupta, 2008, p. 9). According to this theorem we need to show that all linear combinations are asymptotically normal, i.e. for all λ1.λ2, . . . , λR R X `=1 λ`(Nh)1/2 ( ˆ fN (t`) − E ˆ fN (t`)) D → N 0, R X `=1 λ2 ` f(t`) Z ` −∞ K2 (u)du ! . (3.21)
  • 408. 38 LAJOS HORVÁTH The proof of (3.21) is also based on Liapounov theorem. Using the just proven asymptotic uncor- relation, we obtain that E R X `=1 λ` 1 h1/2 K t` − Xi h − EK t` − Xi h #2 = E 1 h1/2 ( R X `=1 λ` K t` − Xi h − EK t` − Xi h )#2 = 1 h E R X `=1 λ` K t` − Xi h − EK t` − Xi h #2 = R X `=1 λ2 ` f(t`) Z ∞ −∞ K2 (u)du + o(1). The Hölder inequality yields (x1 + x2 + . . . + xR)4 ≤ R4(x4 1 + x4 2 + . . . + x4 R) and therefore we get that E R X `=1 λ` 1 h1/2 K t` − Xi h − EK t` − Xi h #4 ≤ R4 R X `=1 λ4 ` E 1 h1/2 K t` − Xi h − EK t` − Xi h 4 ≤ R4 24 R X `=1 λ4 ` 1 h2 EK4 t` − Xi h + 1 h2 EK t` − Xi h 4 # = O 1 h . So if η̄i,N = R X `=1 λ` 1 h1/2 K t` − Xi h − EK t` − Xi h , then Eη̄i,N = 0 and using the formulae above we get the N X i=1 η̄4 i,N !1/4 N X i=1 η̄2 i,N !1/2 → 0. Hence the Liapounov condition is satisfied and therefore (3.21) holds. We obtain several results on the kernel density estimators for fixed t’s. The rate of convergence is (Nh)−1/2 is slower than the usual N−1/2. Also, the asymptotic independence (3.20) will cause problems when we look at the estimate on the interval [a, b]. We wish to obtain “global” results for ˆ fN (t). The popular choices are sup a≤t≤b | ˆ fN (t) − f(t)| and Z b a | ˆ fN (t) − f(t)|p dt 1/p ,
  • 409. MATHEMATICAL STATISTICS 39 where p ≥ 1. What is the “natural” norm? It is argued that p = 1 is the “natural” norm, since the L1 norm of densities is always finite, it is always less than 2. All the other norms put restrictions on f. The visualization of ˆ fN (t) together with f is supported by the sup–norm. To obtain global results, the point wise assumptions on f must hold on [a − , b + ] with some 0. Under these conditions, sup a≤t≤b | ˆ fN (t) − f(t)| P → 0 i.e. the estimator is uniformly weakly consistent. The limit distribution of the sup norm and the L2 norm of the kernel density estimator was determined by Bickel and Rosenblatt (1973). They consider MN,1 = (Nh)1/2 sup a≤t≤b f−1/2 || ˆ fN (t) − f(t)| and MN,2 = Nh Z b a Z b a ( ˆ fN (t) − f(t))2 a(t)dt, where a(t) is a weight function. Bickel and Rosenblatt (1973) explicitly define numerical sequences r1,N and r2,N such that (3.22) P {r1,N (MN,1 − r2,N ) ≤ x} → exp(−2e−x ). We note that r1,N ≈ (log N)1/2 and r2,N ≈ (log N)1/2 and the limit is an extreme value distribution. These suggest that the rate of convergence in (3.22) is slow. This conjecture was checked empirically. They also showed that there are constants r3 and r4 such that (3.23) P 1 r3h1/2 (MN,2 − r4) → Φ(x), where Φ(x) denotes the standard normal distribution function. The rate of convergence in (3.23) is better than in (3.22), so usually (3.23) is used in hypothesis testing. Csörgő and Horváth (1988) extend the results in (3.23) to the functionals MN,3 = (Nh)p/2 Z b a | ˆ fN (t) − f(t)|p a(t)dt, where a(t) is a weight function for all p ≥ 1. Their result is mainly used when p = 1 since this will give the natural L1 norm for density estimates. Since the rate of convergence in (3.22) is rather low, we discuss how to use the bootstrap to get cN = cN (α) such that (3.24) lim N→∞ P{MN,1 ≤ cN } = 1 − α. The result in (3.24) can be used to construct confidence bands for the density on [a, b] and for hypothesis testing. We note that cN (α) does not have a limit as N → ∞, it is inceasing like (2 log(1/h))1/2. If we use boostrap with replacement, X∗ 1 , X∗ 2 , . . . , X∗ N are independent and identi- cally distributed random variables but they are discrete, so conditionally on X they do not have a density function. Note that due to the difficult form of r1,N and r2,N , the bootstrap sample size is N, the original sample size. However, even there is no density, we can formally compute ˆ f∗ N (t) = 1 Nh N X i=1 K t − X∗ i h , which is a density for all N, it satisfies that ˆ f∗ N (t) ≥ 0 and Z ∞ −∞ ˆ f∗ N (t)dt = 1.
  • 410. 40 LAJOS HORVÁTH This is really interesting since with a density we are estimating a non existing density if we condition on X. The bootstrap statistic is M∗ N,1 = (Nh)1/2 sup a≤t≤b ˆ f −1/2 N (t)| ˆ f∗ N (t) − ˆ fN (t)|. We cannot repeat our previous arguments since conditionally on X, ˆ fN (t) is not the density of the bootstrap observations. However, if c∗ N (α) is defined by P{M∗ N,1 ≤ c∗ N } = 1 − α, then (3.25) lim N→∞ P{MN,1 ≤ c∗ N (α)} = 1 − α. Hence we can use c∗ N (α) as an approximation for cN (α). It is more natural to use a boostrap sample with density function ˆ fN (t), conditionally on X. Since ˆ fN (t) is a density function, F̂N (x) = Z x −∞ ˆ fN (t)dt defines a distribution function. We note that F̂N (x) = 1 Nh N X i=1 Z x −∞ K t − Xi h dt = 1 N N X i=1 K x − Xi h , where K(u) = Z u −∞ K(t)dt, i.e. K(u) is the distribution function satisfying K0(u) = K(u). Hence F̂N (x) is a smooth estimator for the underlying distribution function F. Let Z1, Z2, . . . , ZN be independent identically random variables with distribution function F̂N (x), conditionally on X. Now we compute the kernel density estimator ˜ f∗ N (t) from Z1, Z2, . . . , ZN . The corresponding sup statistic is M̃∗ N,1 = (Nh)1/2 sup a≤t≤b ˆ f −1/2 N (t)| ˜ f∗ N (t) − ˆ fN (t)|. If c̃∗ N = c̃∗ N (1 − α) is defined by lim N→∞ P{MN,1 ≤ c̃∗ N (α)} = 1 − α, so we have an other resampling based estimator for cN (α). Our discussion introduced a smooth estimator for F and this estimator is used to define the smoothed bootstrap. Next we consider the effect of estimating a parameter in the fitted density function. We assume that the underlying density function is in the parametric form f(t, θ). The true value of the parameter is θ0, i.e. f0(t) = f(t, θ0). We estimate the parameter with θ̂N satisfying (3.26) N1/2 (θ̂N − θ0) = OP(1). We have seen that several estimators satisfy (3.26), for example, maximum likelihood, least squares, U–statistics and so on. If f(t, θ) has a bounded derivative in a neighbourhood of θ0, i.e. there is a constant C such that ∂f(t, θ) ∂θ ≤ C for all t and θ in a neighbourhood of θ0. Hence by (3.26) and the mean value theorem we get that sup a≤t≤b
  • 411.
  • 412.
  • 413. f(t, θ̂N ) − f(t, θ0)
  • 414.
  • 415.
  • 417. MATHEMATICAL STATISTICS 41 and therefore (3.27) (Nh)1/2 sup a≤t≤b
  • 418.
  • 419.
  • 420. ˆ fN (t) − f(t, θ̂N )
  • 421.
  • 422.
  • 424.
  • 425.
  • 426. ˆ fN (t) − f(t, θ0)
  • 427.
  • 428.
  • 429. + oP (1). This means that estimating parameters does not effect the results on density estimation. This is different from the parameter estimated empirical process where the estimation of the parameter changes the asymptotics. 3.2. Cross validation. We have seen if we minimize MSE( ˆ fN (t)f (t))2 with respect to the window (smoothing parameter), then h depends on t. In order to find an “optimal” window, the other possible criteria is the minimization of the mean integrated square error MISE(h) = E Z b a ( ˆ fN (t) − f(t))2 dt. So we choose hopt which minimizes MISE(h), i.e. hopt = argminhMISE(h). Since E Z b a ( ˆ fN (t) − f(t))2 dt = Z b a E( ˆ fN (t) − E ˆ fN (t))2 dt + 2 Z b a E( ˆ fN (t) − E ˆ fN (t))(E ˆ fN (t) − f(t))dt + Z b a (E ˆ fN (t) − f(t))2 dt = Z b a var( ˆ fN(t))dt + Z b a (E ˆ fN (t) − f(t))2 dt = 1 Nh Z b a f(t)dt Z ∞ −∞ K2 (u)du 2 + h4 4 Z b a (f00 (t))2 dt Z ∞ −∞ u2 K(u)du 2 + o(h4 ) + o 1 Nh So, at least asymptotically. hopt = c∗N−1/5 , with c∗ = (Z b a f(t)dt Z ∞ −∞ K2 (u)du 2 )1/5 (Z b a (f00 (t))2 dt Z ∞ −∞ u2 K(u)du 2 )−1/5 , which depends on the unknown f. Cross validation provides a data based estimator for hopt. We write Z b 0 ( ˆ fN (t) − f(t))2 dt = Z b a ( ˆ fN (t))2 dt − 2 Z b a ˆ fN (t)f(t)dt + Z b a f2 (t)dt = J(h) + Z b a f2 (t)dt. Since Z b a f2 (t)dt does not depend on h, we need to minimize J(h). The estimator for J(h) is ¯ J(h) = Z b a ( ˆ fN (t))2 dt − 2 N N X i=1 Z b a ˆ fN (t) ˆ f (−i) N (t)dt,
  • 430. 42 LAJOS HORVÁTH where ˆ f (−i) N (t) is the kernel density estimator without the ith observation. The sample based cross validation estimator is ĥ = argminh{ ¯ J(h)}. It can be shown that (3.28) ĥ hopt P → 1 and (3.29) MISE(ĥ) MISE(hopt) P → 1. (Note: proving ĥ − hopt P → 0 would not be too useful since both terms go to 0.) The result in (3.29) that using ĥ we get the asymptotically most efficient kernel density estimator. Also, we need to check that our results proven for the non random MISE(hopt) window (expansion of the bias, variance, asymptotic normality, asymptotic distribution of norms) will go through for the random ˆ fN . These have been established in the literature (cf. Silverman (1986)), so it is justified to use ˆ f. We discussed cross validation in the context of finding the optimal window. The same idea is also used in model validation on machine learning, for example. However, not always one element is removed to get the comparison estimates, but several. Also, the same idea is used in case of jackknife estimators. For computational purpose we approximate ˆ J(h) with J∗ (h) = 1 Nh2 N X i=1 N X j=1 K∗ Xi − Xj h + 2 Nh K(0), where K∗ (x) = K(2) (x) − 2K(x) and K(2) (x) = Z ∞ −∞ K(x − y)K(y)dy. The numerical work is still substantial and Fast Fourier Transform is suggested for the computation. The computation of the cross validation is not too simple so some suggestions are given which only supported by simulations. If f is a normal density h∗ = 1.06σN−1/5 is suggested where σ is the variance of the observations. Since σ is unknown, it is estimated by σ̂ = min(sample standard deviation, interquartile range/π). Hence h∗ = 1.06σ̂N−1/5 is computable from the sample. This rule of thumb is used for non–normal densities as well. Usually instead of 1.06, several other constants are tried and the “best” is used in the analysis. Choosing h requires practice!
  • 431. MATHEMATICAL STATISTICS 43 3.3. Histogram. The histogram is wildly popular, since this was the first density estimator and it has been around since 1880’s. Also, even the simplest statistical software contains it. It is not better than the kernel density estimator and the estimate for smooth densities is a step function. The definition is very simple. We assume for the sake of simplicity that the support of f is [0, 1]. (Essentially, we have an interval which contains all the observations, or we use an interval such that the integral of the density on this interval is close to 1. Roughly speaking, we need a relatively large but not too large interval for the construction of the density.) We construct histogram with equal length bins. Let m be an integer and define the bins Bj = x : j − 1 m x ≤ j m , j = 1, 2, . . . , m. If Yj = N X i=1 I {Xi ∈ Bj} , and p̂j = Yj N , j = 1, 2, . . . , m then the histogram is defined as ˜ fN (t) = m X j=1 p̂j h I {t ∈ Bj} with h = 1 m . Hence the histogram is a step function, the value of the percentage of the observations in the bin. It is clear that the histogram is closely related to the kernel density estimator with the uniform kernel. (Writing h = 1/m is just an effort to connect the number of bins with the window.) It can be shown that MISE = Z 1 0 ( ˜ fN (t) − f(t))2 dt = h2 12 Z 1 0 (f00 (t))2 dt + 1 Nh + o(h2 ) + o 1 Nh , so in this case the optimal window is hopt = N−1/3     6 Z 1 0 (f00 (t))2 dt     1/3 . If we compare the optimal window of order N−1/3 to the optimal window of order N−1/5 for kernel densities, we see that the histogram converges to its limit much slower. Finding the optimal h (finding the number of bins m) can be found by cross validation as well. Let ˆ Jh = 2 h(N − 1) − N + 1 h(N − 1) m X j=1 p̂j. The cross validation gives the window ĥ = argminh ˆ J(h) so m = b1/ĥc. There are several versions of the histogram, including non equal bin sizes, data driven bins and so on. Silverman (1986) contains a readable account of histograms. Local log likelihood (local polynomial smoothing). The likelihood method can be extended to function spaces so the likelihood for f is N X i=1 log f(Xi) − N Z ∞ −∞ f(u)du − 1 .
  • 432. 44 LAJOS HORVÁTH Maximizing the log likelihood function above does not give acceptable result. Nonparametric likelihood arguments give that the locally smoothed log likelihood should be used and the likelihood estimator for f is ˆ fN (t) = argmaxf L(t; f) where L(t; f) = N X i=1 K t − Xi h log Xi − N Z ∞ −∞ K t − u h f(u)du. Maximizing with respect to f is hard so we approximate log t with a polynomial in the form pt(a, u) = r X j=0 aj j! (t − u)j , a = (a0, a1, . . . , ar). So we need to minimize N X i=1 N X i=1 K t − Xi h r X j=0 aj j! (t − Xi)j − N Z ∞ −∞ K t − u h exp   r X j=0 aj j! (t − u)j   du with respect to a. The minimum is reached at â = (â0, â1, . . . , âr), Then the estimator is ˆ fN (t) = eâ0 . This method is also called to local polynomial smoothing. It is called local because of the expansion of log t around t. This method requires the choice of K, h, r. Due to the choices of more parame- ters, we can get good results. The best result would be for large r but in this case the error would increase. So h and r should be picked using the data. This is implemented in several statistical packages. You can find more on local polynomial smoothing in Fan and Gijbels (1996) 3.4. Estimation with series. If φ1, φ2, . . . are orthonormal functions on [0, 1], i.e. Z 1 0 φi(u)φj(u)du = ( 0, if i 6= j 1, if i = j then (3.30) f(t) = ∞ X `=1 c`φ`(t), c` = Z 1 0 f(u)φ`(u)du. The expansion in (3.30) requires some assumptions to make sense, at least Z 1 0 f2 (u)du ∞ is needed. This give you a meaningful estimate for a fixed t and also in L2, the space of square integrable functions on [0, 1]. Assuming appropriate further conditions, we have that the infinite sum converges in the sup–norm or in L1 (the space of integrable functions on [0, 1]). It is easy to unbiased estimator for c` ĉ` = 1 N N X i=1 φ`(Xi). Clearly, Eĉ` = 1 N N X i=1 Eφ`(Xi) = Eφ`(X1) = Z 1 0 φ`(u)f(u)du = c`.