2. We often come across the problem when density of the variable of
interest is unknown. One popular method of estimating the unknown
density is by using the Histogram estimator.
Often the decision on bin number or bin width in a histogram is
made arbitrarily or subjectively but need not be. Here we review the
literature on various statistical procedures that have been proposed for
making the decision on optimum bin width and bin number.
3. We shall review various methods in statistical literature that
are prevalent for determining the optimal number of bins and
the bin width in a histogram.
We shall also try to present a comparative analysis so as to
determine which methods are more efficient.
The measure we use to compare the various methods of
optimal binning is sup|ĥ(x) -h(x)| where ĥ(x) is the
histogram density estimator at x and h is the true density at the
point x.
4. Proposed methods of interest for optimal binning
Sturges’ rule and Doane Modification
Scott Rule and Freedman-Diaconis modification
Bayesian optimal binning
Optimal binning by Hellinger Risk minimization
Penalized maximum log-likelihood method with
penalty A and Hogg penalty
Stochastic complexity or Kolmogorov complexity
method
5. Sturges’ Rule
If one constructs a frequency distribution with k bins , each of width 1 and
centered on the points i=0,1,. . . , k-1 . Choose the bin count of the "i" th
bin to be the Binomial coefficient . As k increases, the ideal frequency
histogram assumes the shape of a normal density with mean (k-1)/2 and
variance (k-1)/4.
According to the Sturges' rule, the optimum number of bins for the histogram
is given by,
n =
k is the number of bins to be used. This when solved for k, gives us
n=
We split the sample range into k such bins of equal length. So, the Sturges'
rule gives us a regular histogram.
6. Conceptual Fallacy of the Sturges rule
• There is conceptually a fallacy in Sturges rule derivation Instead of
choosing n= , one could have satisfied any n that satisfies
individual cell frequencies to be
• m(i)=no. of obs in “i”th cell could well have been taken to be
m(i)= n.
• So, intuitively there is no reason for choosing this particular n given
the motivation we employ in Sturges’ rule.
Doane’s law
For skewed or kurtotic distributions, additional bins may be required.
Doane proposed increasing the number of bins by log2(1 +ŷ ) where
ŷ is the standardized skewness coefficient.
7. Scott rule and Freedman Diaconis
modification
• We get an optimum band width by minimizing the asymptotic
expected L2 norm. The histogram estimator is given by
• ĥ(x)=Vk /nh where h is the bin width and n is the total no. of
observations and Vk = no. of obs lying in the “k” th bin.
• The optimum band width given by
h*(x) = [f(xk)/2γ2n] 1/3 where xk is some point lying in the “k”th
bin and γ is the Lipschitz continuity factor.
For normal density case, we observe that h*= 3.5n-1/3sd(x) for
regular case.
The Friedman Diaconis modification for non-normal data is given
by h*= 2(IQ)n-1/3
8. Hellinger risk minimization
• The Hellinger risk between the histogram density estimator ĥ(x) for a
given bin width k for a regular histogram and the true density f(x) is
defined as
H=
• We try to minimize this quantity for different choices of the bin width
or bin number .
• If the true f is known, we have no problem in dealing with this
integral. But, if the true f is not known, one may estimate f using
Bootstrapping over repeated sample from f.
9. Bayesian model for optimal binning
The likelihood of the data given the parameters M – no. of
bins and the vector tuple π ,we get
P(d/ π,M,I)= (M/V)N π1n1π2n2 ……πM- 1nM-1 πMnM where
V =Mv and v is the bin width.
Assume that the prior densities are defined as follows
P(M/I)=1/C where C= max no. of bins taken in account
P(π/M) = [π1π2 …πM ]-1/2 ᴦ (M/2)/ᴦ (1/2)M. Which is actually
a Dirichlet distribution with M parameters equal to ½ and this is
conjugate prior of multinomial distribution.
P(π, M/d,I) = k*P(π /M)P(M/)P(d/ π ,M) is obtained and
integrated over M to get the marginal distribution of M which
when maximized yields the optimal value of M.
10. Maximum penalized loglikelihood method
In this case we do maximize the loglikelihood of the multinomial
distribution corresponding to a histogram but with some penalty
function added. The penalized loglikelihood is thus of the form
Pl=log(L(ĥ, x1 , x2 ,……, xn))- penn (I) where I is the partition of
the sample range into disjoint intervals. Note that these bins need
not be of equal length i,e the histogram may be irregular.
There are various choices of the penalty , however our two
choices have been under D bins
penA=
The first penalty is applicable for both regular and irregular cases
penB(Hogg or Akaike penalty)=D-1
11. Stochastic complexity method
• This is based on the idea of encoding the data with minimum number
of bits. This is a sort of PML with no. of bits or description length as
penalty.
• If P(X|Ө) be the distribution of the data with Ө unknown and if σi
(Ө) be the standard deviation with respect to the
best estimator of “i” th co-ordiante of Ө, then the description length
is given by
- log2 (P(X|Ө))+∑ log2 ( )
We define stochastic complexity as
- log2 ∫ P(X|Ө) π(Ө)dӨ. If we take an uniform prior for Ө,
Then taking P(X|Ө) to be the multinomial distribution, we get
stochastic complexity to be
l=(m-1)ᴦ (N .N2 ….. , Nm )/(m+n-1)ǃ. Maximize wrt to m to
1
get the no. of bins
12. Simulation design
In order to compare the various methods of binning, we use
simulation experiments from 3 reference distributions namely Chi
square (2), Normal(0,1) and Uniform(1,10).
We compare the statistic T = | |ĥ(x)-f(x)|
For various methods and compare how smaller the value of T is on an
average for each of these methods.
We have simulated 1000 observations from each of the reference
distributions,computed the T statistic for each simulated run and
carry out this experiment 200 times to get a distribution of T.
13. Mean and variance of T for chi-
square(2)
Method Mean no. Mean(T) Variance(T)
of bins
Sturges 10 0.1364 0.00031
Doane 15 0.1028 0.00018
Scott 19 0.0874 0.00015
Hellinger 13 0.1151 0.00022
FD 32 0.0747 0.00023
Kolmogrov 10 0.2744 0.01288
Bayesian 12 0.1177 0.00037
Hogg 18 0.0948 0.000194
Irregular(pen 6 0.1134 0.00028
A)
15. Analysis of chi square simulation
For Chi-square(2), Freedman-diaconis and Scott’s rule
have performed very well in terms of smaller mean value of
T.
Kolmogorov’s complexity method has the maximum
spread in the t-values. The distribution of T under the
sturge’s rule dominates that under Freedman-diaconis and
Scott’s rule.
The irregular histogram method under PenA gives very
less no. of bins compared to others.
16. Mean and variance of T for N(0,1)
Method Mean Mean(T) Variance(T)
No. of
Bins
Sturges 10 0.0909 0.00013
Scott 18 0.08377 0.00025
Hellinger 20 0.08309 0.00022
FD 25 0.08687 0.00029
Kolmogrov 13 0.2243 0.0137
Bayesian 13 0.0912 0.00022
Hogg 12 0.0855 0.00011
Irregular(penA) 6 0.1984 0.00113
18. For Normal(0,1), we left out Doane's modification as it is meant for
non-normal or skewed distribution.
Sturges rule and Scott’s rule have performed very well under the
normal case, which is expected given that they are designed under
normality assumptions.
Scott, Freedman-Diaconis and Sturges rule are very close to one
another in terms of the distribution of T.
The penalized log –likelihood with penalty A has a distribution of T
that dominates the T distribution under the other methods.
The T-distribution under stochastic complexity and Hellinger
distance have maximum spread. The minimum spread is due to
Sturges rule.
19. Mean and Variance of T for U(1,10)
Method Mean no. of Mean(T) Variance(T)
Bins
Sturges 10 0.1298 0.00036
Scott 9 0.1288 0.00035
Doane 11 0.1308 0.00051
FD 9 0.1283 0.00032
Bayesian 9 0.1274 0.000361
21. Analysis under U(1,10) distribution
Most of the methods under uniform case give only
1 or 2 bins, so they cannot be compared with others
which are more stable in nature.
However, the Scott’s, Freedman Diaconis and
Sturges rule have performed well with small values
of the T and small variation in the values of T
under repeated simulations.
22. Similar to the univariate method, we try to
generalize our method for bivariate
distributions.
Here we simulate observations from bivariate
normal distribution with mean (0,0) , ρ = 0.5
and σ2 = 1.
The methods we use are the multivariate
extension of the Bayesian optimal binning
and the multivariate Scott's rule.
23. In the same vein as in univariate case, the
multivariate Scott’s rule is determined by minimizing
the asymptotic L-2 error of the expected L-2 norm.
The Multivariate Scott’s choice of bin
width is given by h*=3.5 σxk
Where d is the dimension of the
dataset and σxk the standard
deviation along “k”th co-ordinate.
24. The 3-d histogram obtained for T statistic
distribution under Scott rule
26. Bayesian optimal binning for multivariate normal
case
In this case, we select Mx bins along X axis and My
bins along the Y axis and define M= Mx My .The joint
likelihood in this case given by
h(x,y, Mx ,My )=
which is quite analogous to the univariate
case. Again taking a rectangular prior for
(Mx,My ) and dirichlet distribution of M
dimensions with each parameter ½ as prior
for ℿ.
29. We have dealt with only histogram estimators in
this paper.However,one may apply smoothing
parameter to make the estimator more efficient and
analyze the values of T-staistic for various smoothing
parameters.
We have only used Bayesian and Scott’s
multivariate extensions . However, one may try to
generalize other methods in the multivariate case .
One may use other form of penalties and observe for
which penalty, the estimator thus obtained is most
efficient.
30. From All three univariate simulation
experiments we infer that Scott’s and
Freedman-Diaconis method have been most
efficient in reducing the values of T .
No method however is uniformly best under
all scenarios.
For bivariate normal case, using Scott’s rule
and Bayesian optimal binning , we find that the
T value is smaller on an average under Scott
than under the bayesian optimal binning.