COMPLEX SAMPLING Siti Haslinda Mohd Din Statistician Institute for Public Health
JUST A MINUTE One day some papers catch fire in a wastebasket in the Dean’s office. Luckily, a physicist, a chemist and a statistician happen to be nearby. Naturally, they are you to help. “What rush in doing????” the Dean demand The physicist whips out a notebook and starts to work on how much energy would have to be removed from the fire in order to stop the combustion.Then a chemist works statistician replies, To which the on determiningwhich solve a problem of this magnitude, you need a “To reagent would have to be added to LARGE SAMPLE SIZE.”the fire to prevent oxidation. While they doing this, the statistician is setting fires to all the other wastebaskets in the adjacent offices. http://www.amstat.org/publications/ise/v10n3/friedman.html
Survey Sampling• The subject of survey sampling is concern with the process of selecting members of the population to be included in the survey and the estimation.• A sample design needs to be developed to meet the survey objectives.
Properties of complex samplingA given complex sample can have some or all of the following features: STRATIFICATION + CLUSTER + MULTISTAGE
Properties of complex samplingStratification - Selecting samples independently within non- overlapping subgroups of the population, or strata. For example, strata may be socioeconomic groups, job categories, age groups, or ethnic groups. - With stratification, you can ensure • adequate sample sizes for subgroups of interest, • improve the precision of overall estimates, and • use different sampling methods from stratum to stratum.
Properties of complex samplingClustering. • Involves the selection of groups of sampling units, or clusters. For example, clusters may be schools, hospitals, or geographical areas, and sampling units may be students, patients, or citizens. • Clustering is common in multistage designs and area (geographic) samples.
Properties of complex samplingMultiple stages. •In multistage sampling, – a first-stage sample based on clusters. – a second-stage sample by drawing subsamples from the selected clusters. – If the second-stage sample is based on subclusters, then add a third stage to the sample. For example: • first stage of a survey, a sample of cities • from the selected cities, households could be sampled. • Finally, from the selected households, individuals could be polled.
Example : South Zone Johor STRATIFIED Negeri Sembilan Melaka STRATIFIED Urban Rural Urban Rural STRATIFIED Urban Rural STRATIFIED eb ebEB EB eb eb eb eb eb eb ebCLUSTER CLUSTER CLUSTER CLUSTER CLUSTER CLUSTER EB eb eb eb eb eb eb Not selected Selected enumeration enumeration block block
Sampling Weight• Uniform in SRS but varies in unequal probabilities sampling• Sampling weights are automatically computed while drawing a complex sample and ideally correspond to the “frequency” that each sampling unit represents in the target population. Therefore, the sum of the weights over the sample should estimate the population size.
Sampling Weight• Used to compensate for – Unequal probabilities of selection – Nonresponse adjustment (a unit that fails to respond) – In post stratification to adjust weighted sample distribution for certain variables (eg age and sex) to make them conform to the known population distribution. To improved the precision of sample estimates and to compensate for noncoverage and nonresponse
Basic weighting approach• Suppose sample element i was selected with probability ∏i.• Then the sample element i represents 1/∏i elements in the population. W = 1/∏i• Example : a sample element selected with probability 1/10 represents 10 elements in the population
Weighting for Unequal Probabilities of Selection• Consider an EPSEM (Equal Probability of Selection Method) sample of 6 household selected from 240 household. One adult is selected at random in each selected household. • The probability of selection of the βth adult is – P(αβ) = P(α).P(β|α)=f.1/Bα=1/wα – Which Bα = number of adults in household α if f=6/240 = 1/40 and Bα=3 then P(αβ) = (1/40)X(1/3)=1/120 Therefore each adults represents 120 adults from population; W=120
Non responseSources of failure to obtain observations(responses, measurements) on some elementsselected and designated for the sample; •Not at homes •Refusals •Incapacitated or inability •Not found •Lost schedulesNR refer to eligible respondents and should exclude theineligibles but include vacant dwellings, household withoutthe specified kinds of population elements. NR rate computed for responses and nonresponses among the eligible only.
Disposition of the sample with components of Total Unitsresponse and nonresponse (Initial Sample) (1) Resolved Unresolved (2) (3) Estimated Units Estimated Units Units in Scope Units Out of Scope in Scope Out of Scope (4) (5) (3A) (3B) Respondents Nonrespondents Nonexistent Units Response rate (6) (7) (8) = /(+[3A]) Units Temporarily Refusal Conversions Refusals (11) (13) Out of Scope Non response rate (9) = (+[3A])/(+[3A]) Units Permanently Other Respondents Noncontacts Out of Scope (12) (14) (10) Estimated Units in scope [3A] = /X Other Nonrespondents (15) Adapted from Hidiroglou et al (1993)
Weighting for Non response• Compute weighted response rates in subgroups of the sample.• Use the inverse of the subgroup response rates for non-response adjustment• The weighted response rate= Weighted # completed interviews with eligible elements Weighted # eligible elements in sample• Exclude empty dwellings, destroyed dwellings, addresses that are not dwellings and ineligible elements W2 = 1/ response rate = nh / nh’ nh = # of sample response nh’ = # of actual response
Total weight • W = W1 X W2W1 = weight for unequal selection probabilitiesW2 = weight for non-response
Weighting for Post Stratification • The weighted sample distribution conform to a known population distribution. • If known population of female of age 25- 64 and stay in North area are 12,800,100 where as total weighted sampled are 11,325,553. • Therefore, the post stratification weight: W = 12,800,100/11,325,553 = 1.13 W3 = # of population of specific category # of weighted sampled of the specific category
Total weight • W = W1 X W2 X W3W1 = weight for unequal selection probabilitiesW2 = weight for non-responseW3 = weight for post stratification
Variance Estimation• Linearization – Taylor Series approximation (Wolter 1985) • Best for simple statistics eg weighted mean (Frankel,1971)• Replication (Resampling method) – Balanced Repeated Replication (BRR) – Jackknife estimation (Kish & Frankel 1974; Krewski and Rao 1981; Kovar, Rao and Wu 1988; Rao, Wu, and Yue 1992; Shao 1996) • Maximum-likelihood estimates (Brillinger, 1964) • Best for complex statistics like regression coefficients (Frankel,1971)
Comparison proportion of smokingpregnant mother with years of schooling Years of schooling Weighted proportion Unweighted proportion < 12 years 0.315 ± 0.010 0.328 ± 0.007 12 years 0.373 ± 0.012 0.332 ± 0.008 > 12 years 0.202 ± 0.011 0.217± 0.008 Data source : National Maternal and Child Survey 1988,US
Comparison of the highest prevalence by states and gender Prevalens (%) States SPSS STATA Male Female Male Female Johor 26.97 29.62 25.39 28.75 Kedah 20.36 28.39 19.63 27.12 Kelantan 19.00 27.09 16.08 24.39 Melaka 29.06 34.84 29.67 33.99 N. Sembilan 30.99 34.56 28.40 34.18 Pahang 26.27 37.48 24.06 39.02 P. Pinang 24.81 28.80 24.40 27.09Source : National Perak 27.21 31.96 26.58 31.02Health MorbiditySurvey 1996 Perlis 24.91 35.98 22.49 35.29 Selangor 25.65 28.66 25.26 26.73 Sarawak 21.46 26.73 17.41 28.18 Sabah 22.84 26.51 18.28 25.80 Terengganu 26.58 35.17 33.75 32.17 WPKL 30.80 29.94 30.29 29.39
The difference based on the highest prevalence of obesity among adults in Kedah by gender and ethnicity Prevalens Gender Ethnic S.E (95% CI)Without 32.35 Female India 5.68 (21.22,43.48) weight 29.87With weights Female Cina 4.54 (20.98,38.76) Source : National Health Morbidity Survey 1996
Things to be considered if a design- based inference approach is chosen• What is the nature of the sample design? Was is a stratified multistage sample design used? Was is a cluster sample design used? Were unequal prob. of selection applied? • Were there adjustments for nonresponse or coverage errors? Is there a weight or several weights that must be applied when different parts of the sample are analyzed? • Are there important measurement issues that could affect survey analyses? Is item nonresponse an important problem for some variables? • How can the results be interpreted, and what kind of inference are appropriate in view of the complex survey design?
Steps required for performing a design-based analysisPaul S. Levy and Stanley Lemeshow (1999) • Identify the following elements of the sample design: • Stratification • Clustering • Population sizes required for determination of finite population correction • Determine the sampling weight • Determine a final sampling weight; nonresponse, post stratification • Ensure data required for an appropriate design- based analysis • Determine the procedure and the set of commands for performing the required analysis • Run the analysis and carefully interpret findings
Further reading• C.J. Skinner, D.Holt, T.M.F.Smith, 1989, Analysis of Complex Surveys, New York: John Wiley and Sons.• P.S. Levy, S.Lemeshow. 1999, Sampling of Populations; Methods and Applications,, 3rd Ed.,John Wiley & Sons.• Cochran, W. G. 1977. Sampling Techniques. New York: John Wiley and Sons.• Kish, L. 1965. Survey Sampling. New York: John Wiley and Sons.• Kish, L. 1987. Statistical Design for Research. New York: John Wiley and Sons.• Murthy, M. N. 1967. Sampling Theory and Methods. Calcutta, India: Statistical Publishing Society.• E.L.Korn, B.I.Graubard. Examples of Differing Weighted and Unweighted Estimates From a Sample Survey, The American Statistician, Aug 1995, 49, No.3, 291-295. • E.S.Lee, R. N. Fourthofer, R.J. Lorimor. Analysis of Complex Sample Survey Data, Problem and Startegies, Sociological Methods & Research , Aug-Nov. 1986,15,69- 100.