SlideShare a Scribd company logo
Statistical software for Sampling from Finite Populations: an analysis using Monte Carlo simulations 
Michele De Meo 
PhD in Statistics 
Summary of the thesis 
Università degli Studi di Bari - Italy 
micheledemeo@gmail.com
Aim of the thesis 
Test the quality of the software for probability proportional to size (PPS) sampling 
The quality is the ability of the software to ensure properties of the algorithms: 
Monte Carlo Simulations
PPS Sampling without Replacement 
1.Hanurav-Vijayan 
2.Rao-Sampford 
Three important properties: 
First-order inclusion probability proportional to size 
These algorithms enable computation of joint selection probabilities 
Joint selection probabilities usually ensure non-negativity and stability of the Sen-Yates-Grundy variance estimator
Statistical Software and PPS sampling 
Closed source software: 
SAS PROC SURVEYSELECT 
SPSS COMPLEX SAMPLE 
Open source software: 
R SAMPLING library
Some notes 
The official documentation of Hanurav-Vijayan (H-V) seems confused. Two scientific articles tried to correct the original one. 
According to the official user's guide of SAS and SPSS, the algo H-V was developed in a way not exactly coincident with the original one. They seems to be different from each other. 
The source code for SAS and SPSS is "closed" and not available. It's not possible to check "directly" the implemented algorithm. 
Hunrav-Vijayan is not available in R, so the algo was developed (and tested) according to the official bibliography.
Simulation and software control: the sampled population. 
the target population (used to test the algorithms) is the following (assuming a sample of size n = 5): 
the auxiliary variable (x) to select the sample is equal to i. This is a "trick" to facilitate the code, maintaining the experiment still valid.
Simulation and software control: the sampled population. 
The positive outcome for the tests performed with this population (with the the sample of size n = 5) it's necessary and not sufficient for the validity of the all sampling algorithms. 
The negative outcome with this population and this sample would be sufficient to invalidate the algorithm. It must "work" regardless of the population or the sample size.
Simulation and software control: test 1 
The Joint Probability Matrix for Hanurav-Vijayan and Rao-Sampford is well known: 
A first test is comparing the output of the software with the correct matrix.
Simulation and software control: test 1 
Such a test is not easy in SAS and SPSS: 
it's an hidden procedure 
the returned matrix refers only to the selected units, not to the whole population 
develop an ad-hoc procedure
Simulation and software control: results of test 1 
In SAS, SPSS and R (for Hanurav-Vijayan and Rao- Sampford) the matrix is exactly equals to the original one!
Simulation and software control: test 2 
Perform a Monte Carlo simulation to obtain a numerical estimate of the joint probability matrix. 
Measure the "distance" between estimates and the original matrix.
Simulation steps 
1.define the target population and sample size; 
2.define the matrix P(0)=[0]NXN, where N is the population size 
3.define the number of simulations (K) 
4.execute the following steps (K times), where H=1,2,…,K 
5.draw a sample of n=5 units, then build the vector sh=[sh(i)]Nx1. The element sh(i) is equal to 1 if the unit i of the population has been selected, 0 otherwise. 
6.update the matrix P: P(H) = P(H-1) + sH * s’H 
7.the cross product sH * s’H will produce a symmetric N x N matrix. The element (i,j) of this matrix will be 1 for pairs of drawn units, 0 otherwise. 
At the end of this simulation process, the "numerical" estimate of the Joint Probability Matrix will be equal to:
The proof for this simulation process is the weak law of large numbers: 
P is a "good" estimate of 휋
It's possible "measure the distance" between estimated and real value using the following distribution: 
Per each pairs of units (i,j), we can use the p-level to analyze "how good is the estimate". 
P-level too close to zero are representative of a wrong output!
Simulation results P-level in R for Hanurav-Vijayan. 
k=10,000,001 and n=5. p-level<.01 highlighted in red.
Simulation results P-level in SAS for Hanurav-Vijayan. 
k=10,000,001 and n=5. p-level<.01 highlighted in red.
Simulation results P-level in SPSS for Hanurav-Vijayan. 
k=1,000,001 and n=5. p-level<.01 highlighted in red.
Simulation results P-level in R for Rao-Sampford. 
k=10,000,001 and n=5. p-level<.01 highlighted in red.
Simulation results P-level in SAS for Rao-Sampford. 
k=1,000,001 and n=5. p-level<.01 highlighted in red.
Simulation results P-level in SPSS for Rao-Sampford. 
k=1,000,001 and n=5. p-level<.01 highlighted in red.
Conclusions 
R is better suited for the development of this type of simulations 
the data.frame (the "container" of the data) is easily managed as an "object" matrix, therefore it's easy to access the data 
Specific "libraries" needed for working with matrices in SAS and SPSS 
R code is "sliding" and more powerful!
Conclusions 
Looking at the join probability matrix in R, there is a clear "correspondence" between estimated and real values (for both Rao-Sampford and Hanurav-Vijayan) 
The results show a general bad situation for both algorithms tested in SAS and SPSS: 
almost all p-values are equal to zero, even for the inclusion probabilities of first-order!
Conclusions 
SAS and SPSS do not converge to the result for two reasons: 
1.wrong implementation of the algorithm in program code (both of them!) 
2.wrong pseudo-random number generator (PRNG):
Conclusions 
Pseudo-Random Number Generator used: 
SAS 
Linear congruential generator, Park-Miller (period: 2^31-1) 
R and SPSS 
Mersenne-Twister (period: 2^ 19,937 - 1)
Conclusions 
SAS and SPSS lead to results strongly "biased", regardless of the cause of non-convergence (both for Hanurav-Vijayan and for Rao-Sampford) 
Negative impact on the validity of the simulation studies carried out by these procedures (in SAS and SPSS). For example: Monte Carlo simulations to verify the bias of an estimator (such as for the total or the variance)
Bibliography 
Brewer, K. e Hanif, M. (1983), Sampling with Unequal Probabilities, Springer-Verlag, New-York. 
Chambers, J. (2008), Software for Data Analysis: Programming with R, Springer, New York. 
Chieppa, M. e D'Orazio, M. (1999), Appunti di Teoria dei Campioni,Università degli Studi del Sannio, Benevento. Cicchitelli, G., Herzel, A. e Montanari, G. E. (1997), Il campionamento statistico, Il Mulino, Bologna. 
Efron, B. e Tibshirani, R. (1991), Statistical data analysis in the computer age, Science, vol. 253, p. 390395. 
Fishman, G. e More, L. R. (1981),In search of correlation in multiplicative congruential generators with modulus 2**31-1, Computer Science and Statistics, Proceedings 
of the 13th Symposium on the Interface, p. 155157. 
Fox, D. (1989), Computer Selection of Size-Biased Samples, The American Statistician, vol. 43 (3), p. 168171. 
Gentle, J., Hardle, W. e Mori, Y. (2008), Handbook of Computational Statistics: Concepts and Methods, Springer-Verlag, New-York. Golmant, J. (1990), Correction: Computer Selection of Size-Biased Samples, The American Statistician, vol. 44 (2), p. 194. 
Hanurav, T. (1967), Optimum Utilization of Auxiliary Information: Sampling of Two Units from a Stratum, Journal of the Royal Statistical Society, vol. B (29), p. 374391. Lauro, C. (1996), Computational statistics or statistical computing, is that the question?, Computational Statistics and Data Analysis, vol. 23 (1), p. 191-193. Capitolo 5 
L'Ecuyer, P. (1990), Random Numbers for Simulation, Communications of the ACM, vol. 33 (10), p. 8597. 
Marsaglia, G. (1995), The Diehard Battery of Tests of Randomness, Rap. tecn., Florida State University, http://www.stat.fsu.edu/pub/diehard/. 
Matsumoto, M. e Nishimura, T. (1998), Mersenne Twister: A 623-Dimensionally Equidistrited Uniform Pseudo-Random Number Generator , ACM Transactions on 
Modeling and Computer Simulation, vol. 8 (1), p. 330. 
Mecatti, F. (2004), Lezioni di Metodi di Simulazione, Università degli Studi di Milano-Bicocca. 
Mood, A., Graybill, A. e Boes, D. (1991), Introduzione alla Statistica,Mc-Graw-Hill, Milano. 
Rao, J. (1965), On two simple schemes of unequal probability sampling without replacement, The Indian Journal of Statistics, vol. 3, p. 173-180. Raynald, L. (2008), Programming and Data Management for SPSS Statistics 17.0. A Guide for SPSS Statistics and SAS R Users, http: 
//www.spss.com/statistics/base/ProgDataMgmtSPSS17.pdf. 
Sampford, M. (1967), On sampling without replacement with unequal probabilities of selection, Biometrika, vol. 54, p. 499-513. SAS-Institute (1999), SAS/STAT R User's Guide - Version 8, http://www.math.wpi.edu/saspdf/stat/chap63.pdf. 
Sen, A. (1953), On the estimate of variance in sampling with varying probabilities , Journal of the Indian Society of Agricultural Statistics, vol. 5, p. 119-127. Tillé, Y. (2006), Sampling Algorithms, Springer Series in Statistics, New-York. 
Tillé, Y. e Alina, M. (2009), Package sampling, http://cran.r-project.org/web/packages/sampling/sampling.pdf. 
Vijayan, K. (1967), An Exact pps Sampling Scheme: Generalization of a Method of Hanurav, Journal of the Royal Statistical Society, vol. B (30), p. 556-566. 
Watts, D. (1991), Correction: Computer Selection of Size-Biased Samples, The American Statistician, vol. 45 (2), p. 172. 
Wu, C. (2005), R/S-PLUS codes for the pseudo EL method and the Rao-Sampford sampling procedure, Rap. tecn., University of Waterloo, http://www.math.uwaterloo.ca/~cbwu/Rcodes/04JSS.R. 
Yates, F. e Grundy, P. (1953), Selection without replacement from within strata with probability proportional to size, Journal of the Royal Statistical Society, vol. B (15), p. 253-261.

More Related Content

What's hot

stats
statsstats
stats
Aiden Yeh
 
Matlab:Regression
Matlab:RegressionMatlab:Regression
Matlab:Regression
DataminingTools Inc
 
Chap01 intro & data collection
Chap01 intro & data collectionChap01 intro & data collection
Chap01 intro & data collection
Uni Azza Aunillah
 
Pittsburgh and Toronto "Halloween US trip" seminars
Pittsburgh and Toronto "Halloween US trip" seminarsPittsburgh and Toronto "Halloween US trip" seminars
Pittsburgh and Toronto "Halloween US trip" seminars
Christian Robert
 
Avoiding undesired choices using intelligent adaptive systems
Avoiding undesired choices using intelligent adaptive systemsAvoiding undesired choices using intelligent adaptive systems
Avoiding undesired choices using intelligent adaptive systems
ijaia
 
Chap02 presenting data in chart & tables
Chap02 presenting data in chart & tablesChap02 presenting data in chart & tables
Chap02 presenting data in chart & tables
Uni Azza Aunillah
 
Estimators for structural equation models of Likert scale data
Estimators for structural equation models of Likert scale dataEstimators for structural equation models of Likert scale data
Estimators for structural equation models of Likert scale data
Nick Stauner
 
Artifact3 allen
Artifact3 allenArtifact3 allen
Artifact3 allen
allent07
 
Data Analyst - Interview Guide
Data Analyst - Interview GuideData Analyst - Interview Guide
Data Analyst - Interview Guide
Venkata Reddy Konasani
 
Machine Learning - Simple Linear Regression
Machine Learning - Simple Linear RegressionMachine Learning - Simple Linear Regression
Machine Learning - Simple Linear Regression
Siddharth Shrivastava
 
Pca analysis
Pca analysisPca analysis
Multivariate adaptive regression splines
Multivariate adaptive regression splinesMultivariate adaptive regression splines
Multivariate adaptive regression splines
Eklavya Gupta
 
Generalized Linear Models for Between-Subjects Designs
Generalized Linear Models for Between-Subjects DesignsGeneralized Linear Models for Between-Subjects Designs
Generalized Linear Models for Between-Subjects Designs
smackinnon
 
Devry bis 155 week 1 quiz new
Devry bis 155 week 1 quiz newDevry bis 155 week 1 quiz new
Devry bis 155 week 1 quiz new
uopassignment
 
A hybrid sine cosine optimization algorithm for solving global optimization p...
A hybrid sine cosine optimization algorithm for solving global optimization p...A hybrid sine cosine optimization algorithm for solving global optimization p...
A hybrid sine cosine optimization algorithm for solving global optimization p...
Aboul Ella Hassanien
 
Gordoncorr
GordoncorrGordoncorr
Gordoncorr
Tom Loughran
 

What's hot (16)

stats
statsstats
stats
 
Matlab:Regression
Matlab:RegressionMatlab:Regression
Matlab:Regression
 
Chap01 intro & data collection
Chap01 intro & data collectionChap01 intro & data collection
Chap01 intro & data collection
 
Pittsburgh and Toronto "Halloween US trip" seminars
Pittsburgh and Toronto "Halloween US trip" seminarsPittsburgh and Toronto "Halloween US trip" seminars
Pittsburgh and Toronto "Halloween US trip" seminars
 
Avoiding undesired choices using intelligent adaptive systems
Avoiding undesired choices using intelligent adaptive systemsAvoiding undesired choices using intelligent adaptive systems
Avoiding undesired choices using intelligent adaptive systems
 
Chap02 presenting data in chart & tables
Chap02 presenting data in chart & tablesChap02 presenting data in chart & tables
Chap02 presenting data in chart & tables
 
Estimators for structural equation models of Likert scale data
Estimators for structural equation models of Likert scale dataEstimators for structural equation models of Likert scale data
Estimators for structural equation models of Likert scale data
 
Artifact3 allen
Artifact3 allenArtifact3 allen
Artifact3 allen
 
Data Analyst - Interview Guide
Data Analyst - Interview GuideData Analyst - Interview Guide
Data Analyst - Interview Guide
 
Machine Learning - Simple Linear Regression
Machine Learning - Simple Linear RegressionMachine Learning - Simple Linear Regression
Machine Learning - Simple Linear Regression
 
Pca analysis
Pca analysisPca analysis
Pca analysis
 
Multivariate adaptive regression splines
Multivariate adaptive regression splinesMultivariate adaptive regression splines
Multivariate adaptive regression splines
 
Generalized Linear Models for Between-Subjects Designs
Generalized Linear Models for Between-Subjects DesignsGeneralized Linear Models for Between-Subjects Designs
Generalized Linear Models for Between-Subjects Designs
 
Devry bis 155 week 1 quiz new
Devry bis 155 week 1 quiz newDevry bis 155 week 1 quiz new
Devry bis 155 week 1 quiz new
 
A hybrid sine cosine optimization algorithm for solving global optimization p...
A hybrid sine cosine optimization algorithm for solving global optimization p...A hybrid sine cosine optimization algorithm for solving global optimization p...
A hybrid sine cosine optimization algorithm for solving global optimization p...
 
Gordoncorr
GordoncorrGordoncorr
Gordoncorr
 

Similar to Statistical software for Sampling from Finite Populations: an analysis using Monte Carlo simulations

SPSS statistics - get help using SPSS
SPSS statistics - get help using SPSSSPSS statistics - get help using SPSS
SPSS statistics - get help using SPSS
csula its training
 
Comparison of Cost Estimation Methods using Hybrid Artificial Intelligence on...
Comparison of Cost Estimation Methods using Hybrid Artificial Intelligence on...Comparison of Cost Estimation Methods using Hybrid Artificial Intelligence on...
Comparison of Cost Estimation Methods using Hybrid Artificial Intelligence on...
IJERA Editor
 
06-07 Chapter interpolation in MATLAB
06-07 Chapter interpolation in MATLAB06-07 Chapter interpolation in MATLAB
06-07 Chapter interpolation in MATLAB
Dr. Mohammed Danish
 
Introduction to simulating data to improve your research
Introduction to simulating data to improve your researchIntroduction to simulating data to improve your research
Introduction to simulating data to improve your research
Dorothy Bishop
 
Rodriguez_Ullmayer_Rojo_RUSIS@UNR_REU_Poster_Presentation_JMM
Rodriguez_Ullmayer_Rojo_RUSIS@UNR_REU_Poster_Presentation_JMMRodriguez_Ullmayer_Rojo_RUSIS@UNR_REU_Poster_Presentation_JMM
Rodriguez_Ullmayer_Rojo_RUSIS@UNR_REU_Poster_Presentation_JMM
​Iván Rodríguez
 
Introduction to spss
Introduction to spssIntroduction to spss
Introduction to spss
Manish Parihar
 
Evaluation and optimization of variables using response surface methodology
Evaluation and optimization of variables using response surface methodologyEvaluation and optimization of variables using response surface methodology
Evaluation and optimization of variables using response surface methodology
Mohammed Abdullah Issa
 
Argument to use Both Statistical and Graphical Evaluation Techniques in Groun...
Argument to use Both Statistical and Graphical Evaluation Techniques in Groun...Argument to use Both Statistical and Graphical Evaluation Techniques in Groun...
Argument to use Both Statistical and Graphical Evaluation Techniques in Groun...
IRJET Journal
 
An Introduction to SPSS
An Introduction to SPSSAn Introduction to SPSS
An Introduction to SPSS
Rayman Soe
 
IBM SPSS Statistics Algorithms.pdf
IBM SPSS Statistics Algorithms.pdfIBM SPSS Statistics Algorithms.pdf
IBM SPSS Statistics Algorithms.pdf
Norafizah Samawi
 
AN ALTERNATIVE APPROACH FOR SELECTION OF PSEUDO RANDOM NUMBERS FOR ONLINE EXA...
AN ALTERNATIVE APPROACH FOR SELECTION OF PSEUDO RANDOM NUMBERS FOR ONLINE EXA...AN ALTERNATIVE APPROACH FOR SELECTION OF PSEUDO RANDOM NUMBERS FOR ONLINE EXA...
AN ALTERNATIVE APPROACH FOR SELECTION OF PSEUDO RANDOM NUMBERS FOR ONLINE EXA...
cscpconf
 
GENETIC ALGORITHM FOR FUNCTION APPROXIMATION: AN EXPERIMENTAL INVESTIGATION
GENETIC ALGORITHM FOR FUNCTION APPROXIMATION: AN EXPERIMENTAL INVESTIGATIONGENETIC ALGORITHM FOR FUNCTION APPROXIMATION: AN EXPERIMENTAL INVESTIGATION
GENETIC ALGORITHM FOR FUNCTION APPROXIMATION: AN EXPERIMENTAL INVESTIGATION
ijaia
 
A new graph-based approach for biometric fusion at hybrid rank-score level
A new graph-based approach for biometric fusion at hybrid rank-score levelA new graph-based approach for biometric fusion at hybrid rank-score level
A new graph-based approach for biometric fusion at hybrid rank-score level
Sotiris Mitracos
 
Nimrita koul Machine Learning
Nimrita koul  Machine LearningNimrita koul  Machine Learning
Nimrita koul Machine Learning
Nimrita Koul
 
Elementary statistics for Food Indusrty
Elementary statistics for Food IndusrtyElementary statistics for Food Indusrty
Elementary statistics for Food Indusrty
Atcharaporn Khoomtong
 
Pareto Type II Based Software Reliability Growth Model
Pareto Type II Based Software Reliability Growth ModelPareto Type II Based Software Reliability Growth Model
Pareto Type II Based Software Reliability Growth Model
Waqas Tariq
 
Evolving Universal Hash Function using Genetic Algorithms
Evolving Universal Hash Function using Genetic AlgorithmsEvolving Universal Hash Function using Genetic Algorithms
Evolving Universal Hash Function using Genetic Algorithms
Mustafa Safdari
 
Recommender system
Recommender systemRecommender system
Recommender system
Bhumi Patel
 
elementary statistic
elementary statisticelementary statistic
elementary statistic
Atcharaporn Khoomtong
 
SOC2002 Lecture 11
SOC2002 Lecture 11SOC2002 Lecture 11
SOC2002 Lecture 11
Bonnie Green
 

Similar to Statistical software for Sampling from Finite Populations: an analysis using Monte Carlo simulations (20)

SPSS statistics - get help using SPSS
SPSS statistics - get help using SPSSSPSS statistics - get help using SPSS
SPSS statistics - get help using SPSS
 
Comparison of Cost Estimation Methods using Hybrid Artificial Intelligence on...
Comparison of Cost Estimation Methods using Hybrid Artificial Intelligence on...Comparison of Cost Estimation Methods using Hybrid Artificial Intelligence on...
Comparison of Cost Estimation Methods using Hybrid Artificial Intelligence on...
 
06-07 Chapter interpolation in MATLAB
06-07 Chapter interpolation in MATLAB06-07 Chapter interpolation in MATLAB
06-07 Chapter interpolation in MATLAB
 
Introduction to simulating data to improve your research
Introduction to simulating data to improve your researchIntroduction to simulating data to improve your research
Introduction to simulating data to improve your research
 
Rodriguez_Ullmayer_Rojo_RUSIS@UNR_REU_Poster_Presentation_JMM
Rodriguez_Ullmayer_Rojo_RUSIS@UNR_REU_Poster_Presentation_JMMRodriguez_Ullmayer_Rojo_RUSIS@UNR_REU_Poster_Presentation_JMM
Rodriguez_Ullmayer_Rojo_RUSIS@UNR_REU_Poster_Presentation_JMM
 
Introduction to spss
Introduction to spssIntroduction to spss
Introduction to spss
 
Evaluation and optimization of variables using response surface methodology
Evaluation and optimization of variables using response surface methodologyEvaluation and optimization of variables using response surface methodology
Evaluation and optimization of variables using response surface methodology
 
Argument to use Both Statistical and Graphical Evaluation Techniques in Groun...
Argument to use Both Statistical and Graphical Evaluation Techniques in Groun...Argument to use Both Statistical and Graphical Evaluation Techniques in Groun...
Argument to use Both Statistical and Graphical Evaluation Techniques in Groun...
 
An Introduction to SPSS
An Introduction to SPSSAn Introduction to SPSS
An Introduction to SPSS
 
IBM SPSS Statistics Algorithms.pdf
IBM SPSS Statistics Algorithms.pdfIBM SPSS Statistics Algorithms.pdf
IBM SPSS Statistics Algorithms.pdf
 
AN ALTERNATIVE APPROACH FOR SELECTION OF PSEUDO RANDOM NUMBERS FOR ONLINE EXA...
AN ALTERNATIVE APPROACH FOR SELECTION OF PSEUDO RANDOM NUMBERS FOR ONLINE EXA...AN ALTERNATIVE APPROACH FOR SELECTION OF PSEUDO RANDOM NUMBERS FOR ONLINE EXA...
AN ALTERNATIVE APPROACH FOR SELECTION OF PSEUDO RANDOM NUMBERS FOR ONLINE EXA...
 
GENETIC ALGORITHM FOR FUNCTION APPROXIMATION: AN EXPERIMENTAL INVESTIGATION
GENETIC ALGORITHM FOR FUNCTION APPROXIMATION: AN EXPERIMENTAL INVESTIGATIONGENETIC ALGORITHM FOR FUNCTION APPROXIMATION: AN EXPERIMENTAL INVESTIGATION
GENETIC ALGORITHM FOR FUNCTION APPROXIMATION: AN EXPERIMENTAL INVESTIGATION
 
A new graph-based approach for biometric fusion at hybrid rank-score level
A new graph-based approach for biometric fusion at hybrid rank-score levelA new graph-based approach for biometric fusion at hybrid rank-score level
A new graph-based approach for biometric fusion at hybrid rank-score level
 
Nimrita koul Machine Learning
Nimrita koul  Machine LearningNimrita koul  Machine Learning
Nimrita koul Machine Learning
 
Elementary statistics for Food Indusrty
Elementary statistics for Food IndusrtyElementary statistics for Food Indusrty
Elementary statistics for Food Indusrty
 
Pareto Type II Based Software Reliability Growth Model
Pareto Type II Based Software Reliability Growth ModelPareto Type II Based Software Reliability Growth Model
Pareto Type II Based Software Reliability Growth Model
 
Evolving Universal Hash Function using Genetic Algorithms
Evolving Universal Hash Function using Genetic AlgorithmsEvolving Universal Hash Function using Genetic Algorithms
Evolving Universal Hash Function using Genetic Algorithms
 
Recommender system
Recommender systemRecommender system
Recommender system
 
elementary statistic
elementary statisticelementary statistic
elementary statistic
 
SOC2002 Lecture 11
SOC2002 Lecture 11SOC2002 Lecture 11
SOC2002 Lecture 11
 

Recently uploaded

原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
mkkikqvo
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
soxrziqu
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
nuttdpt
 
一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理
一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理
一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理
1tyxnjpia
 
UofT毕业证如何办理
UofT毕业证如何办理UofT毕业证如何办理
UofT毕业证如何办理
exukyp
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
jitskeb
 
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
slg6lamcq
 
社内勉強会資料_Hallucination of LLMs               .
社内勉強会資料_Hallucination of LLMs               .社内勉強会資料_Hallucination of LLMs               .
社内勉強会資料_Hallucination of LLMs               .
NABLAS株式会社
 
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
ihavuls
 
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docxDATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
SaffaIbrahim1
 
How To Control IO Usage using Resource Manager
How To Control IO Usage using Resource ManagerHow To Control IO Usage using Resource Manager
How To Control IO Usage using Resource Manager
Alireza Kamrani
 
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
taqyea
 
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
Timothy Spann
 
Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024
ElizabethGarrettChri
 
一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理
一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理
一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理
asyed10
 
Sample Devops SRE Product Companies .pdf
Sample Devops SRE  Product Companies .pdfSample Devops SRE  Product Companies .pdf
Sample Devops SRE Product Companies .pdf
Vineet
 
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
hyfjgavov
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Kiwi Creative
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
AlessioFois2
 
一比一原版格里菲斯大学毕业证(Griffith毕业证书)学历如何办理
一比一原版格里菲斯大学毕业证(Griffith毕业证书)学历如何办理一比一原版格里菲斯大学毕业证(Griffith毕业证书)学历如何办理
一比一原版格里菲斯大学毕业证(Griffith毕业证书)学历如何办理
lzdvtmy8
 

Recently uploaded (20)

原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
 
一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理
一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理
一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理
 
UofT毕业证如何办理
UofT毕业证如何办理UofT毕业证如何办理
UofT毕业证如何办理
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
 
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
 
社内勉強会資料_Hallucination of LLMs               .
社内勉強会資料_Hallucination of LLMs               .社内勉強会資料_Hallucination of LLMs               .
社内勉強会資料_Hallucination of LLMs               .
 
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
 
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docxDATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
 
How To Control IO Usage using Resource Manager
How To Control IO Usage using Resource ManagerHow To Control IO Usage using Resource Manager
How To Control IO Usage using Resource Manager
 
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
 
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
 
Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024
 
一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理
一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理
一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理
 
Sample Devops SRE Product Companies .pdf
Sample Devops SRE  Product Companies .pdfSample Devops SRE  Product Companies .pdf
Sample Devops SRE Product Companies .pdf
 
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
 
一比一原版格里菲斯大学毕业证(Griffith毕业证书)学历如何办理
一比一原版格里菲斯大学毕业证(Griffith毕业证书)学历如何办理一比一原版格里菲斯大学毕业证(Griffith毕业证书)学历如何办理
一比一原版格里菲斯大学毕业证(Griffith毕业证书)学历如何办理
 

Statistical software for Sampling from Finite Populations: an analysis using Monte Carlo simulations

  • 1. Statistical software for Sampling from Finite Populations: an analysis using Monte Carlo simulations Michele De Meo PhD in Statistics Summary of the thesis Università degli Studi di Bari - Italy micheledemeo@gmail.com
  • 2. Aim of the thesis Test the quality of the software for probability proportional to size (PPS) sampling The quality is the ability of the software to ensure properties of the algorithms: Monte Carlo Simulations
  • 3. PPS Sampling without Replacement 1.Hanurav-Vijayan 2.Rao-Sampford Three important properties: First-order inclusion probability proportional to size These algorithms enable computation of joint selection probabilities Joint selection probabilities usually ensure non-negativity and stability of the Sen-Yates-Grundy variance estimator
  • 4. Statistical Software and PPS sampling Closed source software: SAS PROC SURVEYSELECT SPSS COMPLEX SAMPLE Open source software: R SAMPLING library
  • 5. Some notes The official documentation of Hanurav-Vijayan (H-V) seems confused. Two scientific articles tried to correct the original one. According to the official user's guide of SAS and SPSS, the algo H-V was developed in a way not exactly coincident with the original one. They seems to be different from each other. The source code for SAS and SPSS is "closed" and not available. It's not possible to check "directly" the implemented algorithm. Hunrav-Vijayan is not available in R, so the algo was developed (and tested) according to the official bibliography.
  • 6. Simulation and software control: the sampled population. the target population (used to test the algorithms) is the following (assuming a sample of size n = 5): the auxiliary variable (x) to select the sample is equal to i. This is a "trick" to facilitate the code, maintaining the experiment still valid.
  • 7. Simulation and software control: the sampled population. The positive outcome for the tests performed with this population (with the the sample of size n = 5) it's necessary and not sufficient for the validity of the all sampling algorithms. The negative outcome with this population and this sample would be sufficient to invalidate the algorithm. It must "work" regardless of the population or the sample size.
  • 8. Simulation and software control: test 1 The Joint Probability Matrix for Hanurav-Vijayan and Rao-Sampford is well known: A first test is comparing the output of the software with the correct matrix.
  • 9. Simulation and software control: test 1 Such a test is not easy in SAS and SPSS: it's an hidden procedure the returned matrix refers only to the selected units, not to the whole population develop an ad-hoc procedure
  • 10. Simulation and software control: results of test 1 In SAS, SPSS and R (for Hanurav-Vijayan and Rao- Sampford) the matrix is exactly equals to the original one!
  • 11. Simulation and software control: test 2 Perform a Monte Carlo simulation to obtain a numerical estimate of the joint probability matrix. Measure the "distance" between estimates and the original matrix.
  • 12. Simulation steps 1.define the target population and sample size; 2.define the matrix P(0)=[0]NXN, where N is the population size 3.define the number of simulations (K) 4.execute the following steps (K times), where H=1,2,…,K 5.draw a sample of n=5 units, then build the vector sh=[sh(i)]Nx1. The element sh(i) is equal to 1 if the unit i of the population has been selected, 0 otherwise. 6.update the matrix P: P(H) = P(H-1) + sH * s’H 7.the cross product sH * s’H will produce a symmetric N x N matrix. The element (i,j) of this matrix will be 1 for pairs of drawn units, 0 otherwise. At the end of this simulation process, the "numerical" estimate of the Joint Probability Matrix will be equal to:
  • 13. The proof for this simulation process is the weak law of large numbers: P is a "good" estimate of 휋
  • 14. It's possible "measure the distance" between estimated and real value using the following distribution: Per each pairs of units (i,j), we can use the p-level to analyze "how good is the estimate". P-level too close to zero are representative of a wrong output!
  • 15. Simulation results P-level in R for Hanurav-Vijayan. k=10,000,001 and n=5. p-level<.01 highlighted in red.
  • 16. Simulation results P-level in SAS for Hanurav-Vijayan. k=10,000,001 and n=5. p-level<.01 highlighted in red.
  • 17. Simulation results P-level in SPSS for Hanurav-Vijayan. k=1,000,001 and n=5. p-level<.01 highlighted in red.
  • 18. Simulation results P-level in R for Rao-Sampford. k=10,000,001 and n=5. p-level<.01 highlighted in red.
  • 19. Simulation results P-level in SAS for Rao-Sampford. k=1,000,001 and n=5. p-level<.01 highlighted in red.
  • 20. Simulation results P-level in SPSS for Rao-Sampford. k=1,000,001 and n=5. p-level<.01 highlighted in red.
  • 21. Conclusions R is better suited for the development of this type of simulations the data.frame (the "container" of the data) is easily managed as an "object" matrix, therefore it's easy to access the data Specific "libraries" needed for working with matrices in SAS and SPSS R code is "sliding" and more powerful!
  • 22. Conclusions Looking at the join probability matrix in R, there is a clear "correspondence" between estimated and real values (for both Rao-Sampford and Hanurav-Vijayan) The results show a general bad situation for both algorithms tested in SAS and SPSS: almost all p-values are equal to zero, even for the inclusion probabilities of first-order!
  • 23. Conclusions SAS and SPSS do not converge to the result for two reasons: 1.wrong implementation of the algorithm in program code (both of them!) 2.wrong pseudo-random number generator (PRNG):
  • 24. Conclusions Pseudo-Random Number Generator used: SAS Linear congruential generator, Park-Miller (period: 2^31-1) R and SPSS Mersenne-Twister (period: 2^ 19,937 - 1)
  • 25. Conclusions SAS and SPSS lead to results strongly "biased", regardless of the cause of non-convergence (both for Hanurav-Vijayan and for Rao-Sampford) Negative impact on the validity of the simulation studies carried out by these procedures (in SAS and SPSS). For example: Monte Carlo simulations to verify the bias of an estimator (such as for the total or the variance)
  • 26. Bibliography Brewer, K. e Hanif, M. (1983), Sampling with Unequal Probabilities, Springer-Verlag, New-York. Chambers, J. (2008), Software for Data Analysis: Programming with R, Springer, New York. Chieppa, M. e D'Orazio, M. (1999), Appunti di Teoria dei Campioni,Università degli Studi del Sannio, Benevento. Cicchitelli, G., Herzel, A. e Montanari, G. E. (1997), Il campionamento statistico, Il Mulino, Bologna. Efron, B. e Tibshirani, R. (1991), Statistical data analysis in the computer age, Science, vol. 253, p. 390395. Fishman, G. e More, L. R. (1981),In search of correlation in multiplicative congruential generators with modulus 2**31-1, Computer Science and Statistics, Proceedings of the 13th Symposium on the Interface, p. 155157. Fox, D. (1989), Computer Selection of Size-Biased Samples, The American Statistician, vol. 43 (3), p. 168171. Gentle, J., Hardle, W. e Mori, Y. (2008), Handbook of Computational Statistics: Concepts and Methods, Springer-Verlag, New-York. Golmant, J. (1990), Correction: Computer Selection of Size-Biased Samples, The American Statistician, vol. 44 (2), p. 194. Hanurav, T. (1967), Optimum Utilization of Auxiliary Information: Sampling of Two Units from a Stratum, Journal of the Royal Statistical Society, vol. B (29), p. 374391. Lauro, C. (1996), Computational statistics or statistical computing, is that the question?, Computational Statistics and Data Analysis, vol. 23 (1), p. 191-193. Capitolo 5 L'Ecuyer, P. (1990), Random Numbers for Simulation, Communications of the ACM, vol. 33 (10), p. 8597. Marsaglia, G. (1995), The Diehard Battery of Tests of Randomness, Rap. tecn., Florida State University, http://www.stat.fsu.edu/pub/diehard/. Matsumoto, M. e Nishimura, T. (1998), Mersenne Twister: A 623-Dimensionally Equidistrited Uniform Pseudo-Random Number Generator , ACM Transactions on Modeling and Computer Simulation, vol. 8 (1), p. 330. Mecatti, F. (2004), Lezioni di Metodi di Simulazione, Università degli Studi di Milano-Bicocca. Mood, A., Graybill, A. e Boes, D. (1991), Introduzione alla Statistica,Mc-Graw-Hill, Milano. Rao, J. (1965), On two simple schemes of unequal probability sampling without replacement, The Indian Journal of Statistics, vol. 3, p. 173-180. Raynald, L. (2008), Programming and Data Management for SPSS Statistics 17.0. A Guide for SPSS Statistics and SAS R Users, http: //www.spss.com/statistics/base/ProgDataMgmtSPSS17.pdf. Sampford, M. (1967), On sampling without replacement with unequal probabilities of selection, Biometrika, vol. 54, p. 499-513. SAS-Institute (1999), SAS/STAT R User's Guide - Version 8, http://www.math.wpi.edu/saspdf/stat/chap63.pdf. Sen, A. (1953), On the estimate of variance in sampling with varying probabilities , Journal of the Indian Society of Agricultural Statistics, vol. 5, p. 119-127. Tillé, Y. (2006), Sampling Algorithms, Springer Series in Statistics, New-York. Tillé, Y. e Alina, M. (2009), Package sampling, http://cran.r-project.org/web/packages/sampling/sampling.pdf. Vijayan, K. (1967), An Exact pps Sampling Scheme: Generalization of a Method of Hanurav, Journal of the Royal Statistical Society, vol. B (30), p. 556-566. Watts, D. (1991), Correction: Computer Selection of Size-Biased Samples, The American Statistician, vol. 45 (2), p. 172. Wu, C. (2005), R/S-PLUS codes for the pseudo EL method and the Rao-Sampford sampling procedure, Rap. tecn., University of Waterloo, http://www.math.uwaterloo.ca/~cbwu/Rcodes/04JSS.R. Yates, F. e Grundy, P. (1953), Selection without replacement from within strata with probability proportional to size, Journal of the Royal Statistical Society, vol. B (15), p. 253-261.