3. Kernel density estimation
Introduction
Definition: Let us consider a random variable Y with a probability
density function f , the probability that Y takes a value between a
and b is:
P(a < Y < b) =
b
a
f (y) dy for all a < b.
Interest:
From a sample of observations, the density function provides
much more information than standard descriptive statistics
(mean, median, sd, skewness, kurtosis, ...)
Some properties on the data can be drawn from a simple plot
of the density function
Emmanuel Flachaire Non-Parametric Econometrics
7. Parametric estimation
Maximum likelihood
Let us assume that a sample of n observations, y1, . . . , yn, is
drawn from a parametric distribution f (x, θ).
If the data are i.i.d., the joint density function is:
f (y; θ) =
n
i=1
f (yi ; θ)
To estimate the parameters, we maximize this density function
(“likelihood”), or more easily, its logarithmic transformation:
(y; θ) = log f (y; θ) =
n
i=1
log f (yi ; θ)
Problem: poor estimation if f is misspecified (see CAC40)
Emmanuel Flachaire Non-Parametric Econometrics
8. Nonparametric method: histogram
Define the intervals: choose a number of bins and a starting
value for the first interval
Then, the histogram is calculated as follows:
ˆf (y) =
1
n
×
nb of observations in the interval of y
width of the interval
The number of bins defines the degree of smoothness of the
histogram
The starting value can have an impact too (see next slide)
Emmanuel Flachaire Non-Parametric Econometrics
10. Nonparametric method: histogram
Intervals are defined abitrarily: the number of bins and the
starting value have to be chosen a priori,
Problem: the plot of the histogram is very sensitive to these
choices
Emmanuel Flachaire Non-Parametric Econometrics
11. Nonparametric method: na¨ıve estimator
Definition
Histogram: with m intervals of the same width h,
ˆf (y) =
1
nh
n
i=1
I(zk < yi < zk+1), with y ∈ [zk; zk+1]
Na¨ıve estimator:
ˆf (y) =
1
nh
n
i=1
I y −
h
2
< yi < y +
h
2
The global density is obtained with a “moving window”
(intervals with intersections)
Emmanuel Flachaire Non-Parametric Econometrics
13. Nonparametric method: na¨ıve estimator
Advantage: the choice of a starting value is not required
Problem: the plot is not smooth
Emmanuel Flachaire Non-Parametric Econometrics
14. Nonparametric method: kernel estimator
Definition
Na¨ıve estimator
ˆf (y) =
1
nh
n
i=1
w
y − yi
h
where w(x) =
1 if |x| < 1
2
0 otherwise
Kernel estimator: replace w by a kernel function K
ˆf (y) =
1
nh
n
i=1
K
y − yi
h
where K puts increasing weights to closer observations. The
kernel function has to be such that:
∞
−∞ K(x) dx = 1. Thus,
any density function can be used.
Emmanuel Flachaire Non-Parametric Econometrics
15. Nonparametric method: kernel estimator
Kernel function
Gaussian kernel: the centered and reduced Normal density
function:
K(x) =
1
√
2π
e−x2
2
Epanechnikov kernel: a polynom of second order such that it
is a density,
K(x) =
3 1−x2
5
4
√
5
si |x| <
√
5
0 sinon
Emmanuel Flachaire Non-Parametric Econometrics
16. Nonparametric method: kernel estimator
Kernel function
Triangular kernel:
K(x) =
1 − |x| si |x| < 1
0 sinon
Rectangular kernel
K(x) =
1
2 if |x| < 1
0 otherwise
It corresponds to the na¨ıve estimator, with a normalized
weight function (variance equals to 1).
Emmanuel Flachaire Non-Parametric Econometrics
18. Nonparametric method: kernel estimator
The kernel density estimation is defined as:
ˆf (y) =
1
nh
n
i=1
K
y − yi
h
The kernel density estimation depends on the choices of:
the kernel function K
the bandwidth parameter h
The kernel density estimation is very sensitive to h, not to K
Emmanuel Flachaire Non-Parametric Econometrics
21. How to select the bandwidth?
Principle
Manually: from a comparison of several plots
Automatic selection: based on a criteria
We would select h such that ˆf is as close as possible as f
Criteria: hopt minimizes the distance between ˆf and f
Emmanuel Flachaire Non-Parametric Econometrics
22. How to select the bandwidth?
Mean squared error
Let’s consider the estimation in one specific point y, we can
define the mean squared error as:
MSE(h) = E [ˆf (y) − f (y)]2
This equation can be developped in two terms:
MSE(h) = E[ˆf (y)] − f (y)
2
+ Var[ˆf (y)]
The first term is the bias and the second term is the variance.
Minimizing MSE ≡ an arbitrage bias/variance
Emmanuel Flachaire Non-Parametric Econometrics
23. How to select the bandwidth?
Mean integrated squared error
In practice, the distance should be minimized for all points.
Thus, we use the mean integrated squared error,
MISE(h) = E [ˆf (y) − f (y)]2
dy
We have MISE(h) = MSE(h)dy, from which it follows:
MISE(h) = E[ˆf (y)] − f (y)
2
dy + Var[ˆf (y)]dy
The first term is the “sum” of biases and the second term is
the “sum” of variances for all observations
Emmanuel Flachaire Non-Parametric Econometrics
24. How to select the bandwidth?
Result
From an approximation of the MISE, Silverman shows that
hopt = x2
K(x)dx
−2
5
K(x)2
dx
1
5
f (y)2
dy
−1
5
n−1
5
We can drawn some comments:
h decreases to 0 as the sample size n increases
f measures the fluctuations of f : we expect hopt small if
the density fluctuates quite a lot
With hopt, the kernel function that minimizes the MISE is the
Epanechnikov kernel function
Finally, the bandwith selection depends on the true underlying
density function: we need to know f to select hopt !!
Emmanuel Flachaire Non-Parametric Econometrics
25. How to select the bandwidth?
Rule of thumb
With a Gaussian distribution f in hopt and a Gaussian kernel,
Silverman shows that the optimal value of the bandwidth is:
hopt = 1.059 σ n−1
5
The variance is sensitive to the presence of outliers in data,
not the interquartile range. The Silverman rule of thumb is
ˆhopt = 0.9 min ˆσ ;
ˆq3 − ˆq1
1.349
n−1
5
with 0.9 rather 1.059 to reduce the risk of oversmoothing
This very simple rule works well in practice, in many cases
Emmanuel Flachaire Non-Parametric Econometrics
26. How to select the bandwidth?
Plug-in method
The Silverman rule of thumb is obtained by replacing f in the
MISE by a reference distribution (Gaussian)
Plug-in method: a (first) nonparametric estimation of f is
used in the MISE to obtain a new hopt
Much more numerical computations are needed
Emmanuel Flachaire Non-Parametric Econometrics
27. How to select the bandwidth?
Cross validation
Minimizing ISE rather than MISE:
ISE(h) = [ˆf − f ]2
dy = ˆf 2
dy − 2 ˆf f dy + f 2
dy
This criteria is specific to one sample. It’s conceptually
different, but makes no differences in practice
the last term does not depend on h: no impact
the second term is E(ˆf ): it can be estimated by n−1 n
i=1
ˆf−i
Minimizing ISE remains to minimize:
CV(h) = ˆf 2
dy −
2
n
n
i=1
ˆf−i
which can be computed from the data
Emmanuel Flachaire Non-Parametric Econometrics
28. How to select the bandwidth?
Limits
Advantage: automatic selection methods work well in practice
When does it fail?: poor performance with skewed and
heavy-tailed distributions
Why?: the bandwidth h is fixed over the sample
Solution: adaptive methods - the bandwidth varies with the
degree of concentration of the data
Emmanuel Flachaire Non-Parametric Econometrics
29. Adaptive methods
The nearest neighbours
Principle: define intervals such that they have the same
number of observations
Kernel method: h fixed, nobs per intervals varies
Nearest neighbours: nobs per intervals fixed, h varies
Define d(y) = |yi − y| and d1(y) ≤ d2(y) ≤ · · · ≤ dn(y), such
that k observations are in [y − dk(y); y + dk(y)]. The na¨ıve
estimator with h = 2dk(y) is the nearest neighbours estimator:
ˆf (y) =
k
2ndk(y)
Emmanuel Flachaire Non-Parametric Econometrics
30. Adaptive methods
The nearest neighbours
Nearest neighbours estimator:
ˆf (y) =
k
2ndk(y)
Something new: the bandwidth depends on y. It follows that ˆf :
is not smooth, because a weight equals to 1 is given to
observations in the interval and 0 otherwise and because
dk(y) has discontinuous derivatives
overestimate the tails, ˆf (y0) = 0 for y0 << ymin or
y0 >> ymax , thus it is not a density function
Emmanuel Flachaire Non-Parametric Econometrics
31. Adaptive methods
Generalized nearest neighbours
Principle: we replace the weighting function 0/1 by a kernel
function
ˆf (y) =
1
2ndk(y)
n
i=1
K
y − yi
2dk(y)
It reduces overestimate of the tails, but doesn’t remove it
The bandwidth is still a function of dk(y), with discontinuous
derivatives
Problem: those problems come from the fact that the
bandwidth depends on y.
Emmanuel Flachaire Non-Parametric Econometrics
32. Adaptive methods
Adaptive kernel
We replace the bandwidth 2dk(y) by hλi
ˆf (y) =
1
n
n
i=1
1
hλi
K
y − yi
hλi
, λi =
g
˜f (yi )
α
where g is the geometric mean of ˜f (yi ), α a value between 0 and 1
and ˜f (yi ) a pilot estimator.
the bandwidth does not depend on y → the problems of the
nearest neighbours estimator disappear
λi is small when the density is large (center of the
distribution) and large when the density is small (tails)
α = 1/2 and h is obtained from the Silverman rule of thumb
Emmanuel Flachaire Non-Parametric Econometrics
34. Adaptive methods
GNP per capita
0 1 2 3 4
0.00.20.40.60.8
GDP per capita
Density
Adaptive kernel
Simple kernel
Figure : Kernel density estimation of GDP per capita in 121 countries
Emmanuel Flachaire Non-Parametric Econometrics
35. Quality of density estimation
To asses the quality of the density estimation, we use the mean
integrated absolute errors (MIAE) measure,
MIAE = E
∞
0
ˆf (x) − f (x) dx .
In our experiments, data are generated from :
Lognormal distributions, Λ(x; 0, σ),
Singh-Maddala distributions, SM(x; 2.8, 0.193, q)
Mixture of two Singh-Maddala distributions:
2
5 SM(x; 2.8, 0.193, 1.7) + 3
5 SM(x; 5.8, 0.593, q)
As σ increases and q decreases the upper tail of the distribution
decays more slowly. The sample size is n = 500 and N = 100.
Emmanuel Flachaire Non-Parametric Econometrics
36. Quality of density estimation
0.0 0.5 1.0 1.5 2.0
0.00.51.01.52.0
y
Density
q=0.8
q=0.6
q=0.4
Figure : Mixture of two Singh-Maddala distributions
Emmanuel Flachaire Non-Parametric Econometrics