Transcript of "Selection of bin width and bin number in histograms"
Selection of bin width and bin number for Histograms
We often come across the problem when density of the variable ofinterest is unknown. One popular method of estimating the unknowndensity is by using the Histogram estimator. Often the decision on bin number or bin width in a histogram ismade arbitrarily or subjectively but need not be. Here we review theliterature on various statistical procedures that have been proposed formaking the decision on optimum bin width and bin number.
We shall review various methods in statistical literature thatare prevalent for determining the optimal number of bins andthe bin width in a histogram. We shall also try to present a comparative analysis so as todetermine which methods are more efficient.The measure we use to compare the various methods ofoptimal binning is sup|ĥ(x) -h(x)| where ĥ(x) is thehistogram density estimator at x and h is the true density at thepoint x.
Proposed methods of interest for optimal binning Sturges’ rule and Doane Modification Scott Rule and Freedman-Diaconis modification Bayesian optimal binning Optimal binning by Hellinger Risk minimization Penalized maximum log-likelihood method with penalty A and Hogg penalty Stochastic complexity or Kolmogorov complexity method
Sturges’ Rule If one constructs a frequency distribution with k bins , each of width 1 andcentered on the points i=0,1,. . . , k-1 . Choose the bin count of the "i" thbin to be the Binomial coefficient . As k increases, the ideal frequencyhistogram assumes the shape of a normal density with mean (k-1)/2 andvariance (k-1)/4. According to the Sturges rule, the optimum number of bins for the histogram is given by, n =k is the number of bins to be used. This when solved for k, gives us n=We split the sample range into k such bins of equal length. So, the Sturgesrule gives us a regular histogram.
Conceptual Fallacy of the Sturges rule• There is conceptually a fallacy in Sturges rule derivation Instead of choosing n= , one could have satisfied any n that satisfies individual cell frequencies to be• m(i)=no. of obs in “i”th cell could well have been taken to be m(i)= n.• So, intuitively there is no reason for choosing this particular n given the motivation we employ in Sturges’ rule. Doane’s lawFor skewed or kurtotic distributions, additional bins may be required. Doane proposed increasing the number of bins by log2(1 +ŷ ) where ŷ is the standardized skewness coefficient.
Scott rule and Freedman Diaconis modification• We get an optimum band width by minimizing the asymptotic expected L2 norm. The histogram estimator is given by• ĥ(x)=Vk /nh where h is the bin width and n is the total no. of observations and Vk = no. of obs lying in the “k” th bin.• The optimum band width given by h*(x) = [f(xk)/2γ2n] 1/3 where xk is some point lying in the “k”th bin and γ is the Lipschitz continuity factor. For normal density case, we observe that h*= 3.5n-1/3sd(x) for regular case. The Friedman Diaconis modification for non-normal data is given by h*= 2(IQ)n-1/3
Hellinger risk minimization• The Hellinger risk between the histogram density estimator ĥ(x) for a given bin width k for a regular histogram and the true density f(x) is defined as H=• We try to minimize this quantity for different choices of the bin width or bin number .• If the true f is known, we have no problem in dealing with this integral. But, if the true f is not known, one may estimate f using Bootstrapping over repeated sample from f.
Bayesian model for optimal binningThe likelihood of the data given the parameters M – no. ofbins and the vector tuple π ,we getP(d/ π,M,I)= (M/V)N π1n1π2n2 ……πM- 1nM-1 πMnM where V =Mv and v is the bin width. Assume that the prior densities are defined as followsP(M/I)=1/C where C= max no. of bins taken in accountP(π/M) = [π1π2 …πM ]-1/2 ᴦ (M/2)/ᴦ (1/2)M. Which is actuallya Dirichlet distribution with M parameters equal to ½ and this isconjugate prior of multinomial distribution. P(π, M/d,I) = k*P(π /M)P(M/)P(d/ π ,M) is obtained andintegrated over M to get the marginal distribution of M whichwhen maximized yields the optimal value of M.
Maximum penalized loglikelihood method In this case we do maximize the loglikelihood of the multinomial distribution corresponding to a histogram but with some penalty function added. The penalized loglikelihood is thus of the form Pl=log(L(ĥ, x1 , x2 ,……, xn))- penn (I) where I is the partition of the sample range into disjoint intervals. Note that these bins need not be of equal length i,e the histogram may be irregular. There are various choices of the penalty , however our two choices have been under D bins penA= The first penalty is applicable for both regular and irregular cases penB(Hogg or Akaike penalty)=D-1
Stochastic complexity method• This is based on the idea of encoding the data with minimum number of bits. This is a sort of PML with no. of bits or description length as penalty.• If P(X|Ө) be the distribution of the data with Ө unknown and if σi (Ө) be the standard deviation with respect to the best estimator of “i” th co-ordiante of Ө, then the description length is given by - log2 (P(X|Ө))+∑ log2 ( ) We define stochastic complexity as - log2 ∫ P(X|Ө) π(Ө)dӨ. If we take an uniform prior for Ө, Then taking P(X|Ө) to be the multinomial distribution, we get stochastic complexity to be l=(m-1)ᴦ (N .N2 ….. , Nm )/(m+n-1)ǃ. Maximize wrt to m to 1 get the no. of bins
Simulation designIn order to compare the various methods of binning, we usesimulation experiments from 3 reference distributions namely Chisquare (2), Normal(0,1) and Uniform(1,10).We compare the statistic T = | |ĥ(x)-f(x)|For various methods and compare how smaller the value of T is on anaverage for each of these methods.We have simulated 1000 observations from each of the referencedistributions,computed the T statistic for each simulated run andcarry out this experiment 200 times to get a distribution of T.
Mean and variance of T for chi-square(2) Method Mean no. Mean(T) Variance(T) of bins Sturges 10 0.1364 0.00031 Doane 15 0.1028 0.00018 Scott 19 0.0874 0.00015 Hellinger 13 0.1151 0.00022 FD 32 0.0747 0.00023 Kolmogrov 10 0.2744 0.01288 Bayesian 12 0.1177 0.00037 Hogg 18 0.0948 0.000194 Irregular(pen 6 0.1134 0.00028 A)
Analysis of chi square simulation For Chi-square(2), Freedman-diaconis and Scott’s rule have performed very well in terms of smaller mean value of T. Kolmogorov’s complexity method has the maximum spread in the t-values. The distribution of T under the sturge’s rule dominates that under Freedman-diaconis and Scott’s rule. The irregular histogram method under PenA gives very less no. of bins compared to others.
Mean and variance of T for N(0,1)Method Mean Mean(T) Variance(T) No. of BinsSturges 10 0.0909 0.00013Scott 18 0.08377 0.00025Hellinger 20 0.08309 0.00022FD 25 0.08687 0.00029Kolmogrov 13 0.2243 0.0137Bayesian 13 0.0912 0.00022Hogg 12 0.0855 0.00011Irregular(penA) 6 0.1984 0.00113
For Normal(0,1), we left out Doanes modification as it is meant fornon-normal or skewed distribution. Sturges rule and Scott’s rule have performed very well under thenormal case, which is expected given that they are designed undernormality assumptions.Scott, Freedman-Diaconis and Sturges rule are very close to oneanother in terms of the distribution of T.The penalized log –likelihood with penalty A has a distribution of Tthat dominates the T distribution under the other methods. The T-distribution under stochastic complexity and Hellingerdistance have maximum spread. The minimum spread is due toSturges rule.
Mean and Variance of T for U(1,10) Method Mean no. of Mean(T) Variance(T) Bins Sturges 10 0.1298 0.00036 Scott 9 0.1288 0.00035 Doane 11 0.1308 0.00051 FD 9 0.1283 0.00032 Bayesian 9 0.1274 0.000361
Analysis under U(1,10) distribution Most of the methods under uniform case give only 1 or 2 bins, so they cannot be compared with others which are more stable in nature. However, the Scott’s, Freedman Diaconis and Sturges rule have performed well with small values of the T and small variation in the values of T under repeated simulations.
Similar to the univariate method, we try to generalize our method for bivariate distributions. Here we simulate observations from bivariate normal distribution with mean (0,0) , ρ = 0.5 and σ2 = 1. The methods we use are the multivariate extension of the Bayesian optimal binning and the multivariate Scotts rule.
In the same vein as in univariate case, themultivariate Scott’s rule is determined by minimizing the asymptotic L-2 error of the expected L-2 norm.The Multivariate Scott’s choice of binwidth is given by h*=3.5 σxkWhere d is the dimension of thedataset and σxk the standarddeviation along “k”th co-ordinate.
The 3-d histogram obtained for T statistic distribution under Scott rule
Distribution of T-statistic forBivariate Normal under Scott’s rule
Bayesian optimal binning for multivariate normalcase In this case, we select Mx bins along X axis and My bins along the Y axis and define M= Mx My .The joint likelihood in this case given by h(x,y, Mx ,My )=which is quite analogous to the univariatecase. Again taking a rectangular prior for(Mx,My ) and dirichlet distribution of Mdimensions with each parameter ½ as priorfor ℿ.
Bivariate normal histogram underBayesian optimal binning
T distribution under Bayesian rulefor bivariate normal
We have dealt with only histogram estimators in this paper.However,one may apply smoothing parameter to make the estimator more efficient and analyze the values of T-staistic for various smoothing parameters. We have only used Bayesian and Scott’s multivariate extensions . However, one may try to generalize other methods in the multivariate case . One may use other form of penalties and observe for which penalty, the estimator thus obtained is most efficient.
From All three univariate simulation experiments we infer that Scott’s and Freedman-Diaconis method have been most efficient in reducing the values of T . No method however is uniformly best under all scenarios. For bivariate normal case, using Scott’s rule and Bayesian optimal binning , we find that the T value is smaller on an average under Scott than under the bayesian optimal binning.