Statistical software for Sampling from Finite Populations: an analysis using Monte Carlo simulations

Statistical software for Sampling from Finite Populations: an analysis using Monte Carlo simulations
Michele De Meo
PhD in Statistics
Summary of the thesis
Università degli Studi di Bari - Italy
micheledemeo@gmail.com

Aim of the thesis
Test the quality of the software for probability proportional to size (PPS) sampling
The quality is the ability of the software to ensure properties of the algorithms:
Monte Carlo Simulations

PPS Sampling without Replacement
1.Hanurav-Vijayan
2.Rao-Sampford
Three important properties:
First-order inclusion probability proportional to size
These algorithms enable computation of joint selection probabilities
Joint selection probabilities usually ensure non-negativity and stability of the Sen-Yates-Grundy variance estimator

Statistical Software and PPS sampling
Closed source software:
SAS PROC SURVEYSELECT
SPSS COMPLEX SAMPLE
Open source software:
R SAMPLING library

Some notes
The official documentation of Hanurav-Vijayan (H-V) seems confused. Two scientific articles tried to correct the original one.
According to the official user's guide of SAS and SPSS, the algo H-V was developed in a way not exactly coincident with the original one. They seems to be different from each other.
The source code for SAS and SPSS is "closed" and not available. It's not possible to check "directly" the implemented algorithm.
Hunrav-Vijayan is not available in R, so the algo was developed (and tested) according to the official bibliography.

Simulation and software control: the sampled population.
the target population (used to test the algorithms) is the following (assuming a sample of size n = 5):
the auxiliary variable (x) to select the sample is equal to i. This is a "trick" to facilitate the code, maintaining the experiment still valid.

Simulation and software control: the sampled population.
The positive outcome for the tests performed with this population (with the the sample of size n = 5) it's necessary and not sufficient for the validity of the all sampling algorithms.
The negative outcome with this population and this sample would be sufficient to invalidate the algorithm. It must "work" regardless of the population or the sample size.

Simulation and software control: test 1
The Joint Probability Matrix for Hanurav-Vijayan and Rao-Sampford is well known:
A first test is comparing the output of the software with the correct matrix.

Such a test is not easy in SAS and SPSS:
it's an hidden procedure
the returned matrix refers only to the selected units, not to the whole population
develop an ad-hoc procedure

Simulation and software control: results of test 1
In SAS, SPSS and R (for Hanurav-Vijayan and Rao- Sampford) the matrix is exactly equals to the original one!

Perform a Monte Carlo simulation to obtain a numerical estimate of the joint probability matrix.
Measure the "distance" between estimates and the original matrix.

Simulation steps
1.define the target population and sample size;
2.define the matrix P(0)=[0]NXN, where N is the population size
3.define the number of simulations (K)
4.execute the following steps (K times), where H=1,2,…,K
5.draw a sample of n=5 units, then build the vector sh=[sh(i)]Nx1. The element sh(i) is equal to 1 if the unit i of the population has been selected, 0 otherwise.
6.update the matrix P: P(H) = P(H-1) + sH * s’H
7.the cross product sH * s’H will produce a symmetric N x N matrix. The element (i,j) of this matrix will be 1 for pairs of drawn units, 0 otherwise.
At the end of this simulation process, the "numerical" estimate of the Joint Probability Matrix will be equal to:

The proof for this simulation process is the weak law of large numbers:
P is a "good" estimate of 휋

It's possible "measure the distance" between estimated and real value using the following distribution:
Per each pairs of units (i,j), we can use the p-level to analyze "how good is the estimate".
P-level too close to zero are representative of a wrong output!

Simulation results P-level in R for Hanurav-Vijayan.
k=10,000,001 and n=5. p-level<.01 highlighted in red.

Simulation results P-level in SAS for Hanurav-Vijayan.

Simulation results P-level in SPSS for Hanurav-Vijayan.

Simulation results P-level in R for Rao-Sampford.

Simulation results P-level in SAS for Rao-Sampford.

Simulation results P-level in SPSS for Rao-Sampford.

Conclusions
R is better suited for the development of this type of simulations
the data.frame (the "container" of the data) is easily managed as an "object" matrix, therefore it's easy to access the data
Specific "libraries" needed for working with matrices in SAS and SPSS
R code is "sliding" and more powerful!

Conclusions
Looking at the join probability matrix in R, there is a clear "correspondence" between estimated and real values (for both Rao-Sampford and Hanurav-Vijayan)
The results show a general bad situation for both algorithms tested in SAS and SPSS:
almost all p-values are equal to zero, even for the inclusion probabilities of first-order!

Conclusions
SAS and SPSS do not converge to the result for two reasons:
1.wrong implementation of the algorithm in program code (both of them!)
2.wrong pseudo-random number generator (PRNG):

Conclusions
Pseudo-Random Number Generator used:
SAS
Linear congruential generator, Park-Miller (period: 2^31-1)
R and SPSS
Mersenne-Twister (period: 2^ 19,937 - 1)

Conclusions
SAS and SPSS lead to results strongly "biased", regardless of the cause of non-convergence (both for Hanurav-Vijayan and for Rao-Sampford)
Negative impact on the validity of the simulation studies carried out by these procedures (in SAS and SPSS). For example: Monte Carlo simulations to verify the bias of an estimator (such as for the total or the variance)

Bibliography
Brewer, K. e Hanif, M. (1983), Sampling with Unequal Probabilities, Springer-Verlag, New-York.
Chambers, J. (2008), Software for Data Analysis: Programming with R, Springer, New York.
Chieppa, M. e D'Orazio, M. (1999), Appunti di Teoria dei Campioni,Università degli Studi del Sannio, Benevento. Cicchitelli, G., Herzel, A. e Montanari, G. E. (1997), Il campionamento statistico, Il Mulino, Bologna.
Efron, B. e Tibshirani, R. (1991), Statistical data analysis in the computer age, Science, vol. 253, p. 390395.
Fishman, G. e More, L. R. (1981),In search of correlation in multiplicative congruential generators with modulus 2**31-1, Computer Science and Statistics, Proceedings
of the 13th Symposium on the Interface, p. 155157.
Fox, D. (1989), Computer Selection of Size-Biased Samples, The American Statistician, vol. 43 (3), p. 168171.
Gentle, J., Hardle, W. e Mori, Y. (2008), Handbook of Computational Statistics: Concepts and Methods, Springer-Verlag, New-York. Golmant, J. (1990), Correction: Computer Selection of Size-Biased Samples, The American Statistician, vol. 44 (2), p. 194.
Hanurav, T. (1967), Optimum Utilization of Auxiliary Information: Sampling of Two Units from a Stratum, Journal of the Royal Statistical Society, vol. B (29), p. 374391. Lauro, C. (1996), Computational statistics or statistical computing, is that the question?, Computational Statistics and Data Analysis, vol. 23 (1), p. 191-193. Capitolo 5
L'Ecuyer, P. (1990), Random Numbers for Simulation, Communications of the ACM, vol. 33 (10), p. 8597.
Marsaglia, G. (1995), The Diehard Battery of Tests of Randomness, Rap. tecn., Florida State University, http://www.stat.fsu.edu/pub/diehard/.
Matsumoto, M. e Nishimura, T. (1998), Mersenne Twister: A 623-Dimensionally Equidistrited Uniform Pseudo-Random Number Generator , ACM Transactions on
Modeling and Computer Simulation, vol. 8 (1), p. 330.
Mecatti, F. (2004), Lezioni di Metodi di Simulazione, Università degli Studi di Milano-Bicocca.
Mood, A., Graybill, A. e Boes, D. (1991), Introduzione alla Statistica,Mc-Graw-Hill, Milano.
Rao, J. (1965), On two simple schemes of unequal probability sampling without replacement, The Indian Journal of Statistics, vol. 3, p. 173-180. Raynald, L. (2008), Programming and Data Management for SPSS Statistics 17.0. A Guide for SPSS Statistics and SAS R Users, http:
//www.spss.com/statistics/base/ProgDataMgmtSPSS17.pdf.
Sampford, M. (1967), On sampling without replacement with unequal probabilities of selection, Biometrika, vol. 54, p. 499-513. SAS-Institute (1999), SAS/STAT R User's Guide - Version 8, http://www.math.wpi.edu/saspdf/stat/chap63.pdf.
Sen, A. (1953), On the estimate of variance in sampling with varying probabilities , Journal of the Indian Society of Agricultural Statistics, vol. 5, p. 119-127. Tillé, Y. (2006), Sampling Algorithms, Springer Series in Statistics, New-York.
Tillé, Y. e Alina, M. (2009), Package sampling, http://cran.r-project.org/web/packages/sampling/sampling.pdf.
Vijayan, K. (1967), An Exact pps Sampling Scheme: Generalization of a Method of Hanurav, Journal of the Royal Statistical Society, vol. B (30), p. 556-566.
Watts, D. (1991), Correction: Computer Selection of Size-Biased Samples, The American Statistician, vol. 45 (2), p. 172.
Wu, C. (2005), R/S-PLUS codes for the pseudo EL method and the Rao-Sampford sampling procedure, Rap. tecn., University of Waterloo, http://www.math.uwaterloo.ca/~cbwu/Rcodes/04JSS.R.
Yates, F. e Grundy, P. (1953), Selection without replacement from within strata with probability proportional to size, Journal of the Royal Statistical Society, vol. B (15), p. 253-261.

Statistical software for Sampling from Finite Populations: an analysis using Monte Carlo simulations

Recommended

Recommended

More Related Content

What's hot

What's hot (16)

Similar to Statistical software for Sampling from Finite Populations: an analysis using Monte Carlo simulations

Similar to Statistical software for Sampling from Finite Populations: an analysis using Monte Carlo simulations (20)

Recently uploaded

Recently uploaded (20)

Statistical software for Sampling from Finite Populations: an analysis using Monte Carlo simulations