Cluster Sampling

Statistics 522: Sampling and Survey Techniques
Topic 6

Topic Overview
This topic will cover

• Sampling with unequal probabilities

• Sampling one primary sampling unit

• One-stage sampling with replacement

Unequal probabilities
• Recall πi is the probability that unit i is selected as part of the sample.

• Most designs we have studied so far have the πi equal.

• Now we consider general designs where the πi can vary with i.

• There are situations where this can give much better results.

Example 6.1
• Survey of nursing home residents in Philadelphia to determine preferences on life-
sustaining treatments

• 294 nursing homes with a total of 37,652 beds (number of residents not known at the
planning stage)

• Use cluster sampling

• Suppose we choose an SRS of the 294 nursing homes and then an SRS of 10 residents
of each selected home.

• A nursing home with 20 beds has the same probability of being sampled as a nursing
home with 1000 beds.

• 10 residents from the 20 bed home represent fewer people than 10 residents from 1000
bed home.

1

Self-weighting
• This procedure gives a sample that is not self-weighted.

• Alternatives that are self-weighted.

– A one-stage cluster sample
– Sample a fixed percentage of the residents of each selected nursing home.

The two-stage cluster design
• The two-stage cluster design (SRS of homes, then equal proportion SRS of residents
in each selected home)

– Gives a mathematically valid estimator

SRS at first stage
Three shortcomings:

• We would expect ti to be proportional to the number of beds in nursing home i, so
estimators will have large variance (Mi ).

• Equal percentage sampling in each selected home may be difficult to administer.

• Cost is not known in advance (dont know if you will get large or small homes in sample).

The study
• They drew a sample of 57 nursing homes with probabilities proportional to the number
of beds.

• Then they took an SRS of 30 beds (and their occupants) from a list of all beds within
each selected nursing home.

Properties
• Each bed is equally likely to be in the sample (note beds vs occupants).

• The cost is known before selecting the sample.

• The same number of interviews is taken at each nursing home.

• The estimators will have smaller variance

2

Key ideas
• When sampling with unequal probabilities, we deliberately vary the selection proba-
bilities.

• We compensate by using weights in the estimation.

• The key is that we know the selection probabilities

Notation
• The probability that psu i is in the sample is πi .

• The probability that psu i is selected on the ﬁrst draw is ψi .

• We will consider an artiﬁcial situation where n = 1, so πi = ψi .

Sampling one psu
• Sample size is n = 1.

• Suppose we are interested in estimating the population total.

• ti is the total for psu i.

• To illustrate the ideas, we will assume that we know the whole population.

The Example
• N = 4 supermarkets

• Size (in square meters) varies.

• Select n = 1 with probabilities proportional to size.

• Record total sales

• Using the data from one store we want to estimate total sales for the four stores in the
population.

The population
Store Size ψi ti
A 100 1/16 11
B 200 2/16 20
C 300 3/16 24
D 1000 10/16 245
Total 1600 1 300

3

Weights
• The weights wi are the inverses of the selection probabilities ψi .

• The weighted estimator of the population total is tψ =
ˆ wi ti .

• There are four possible samples.

• We calculate tψ for each.
ˆ

The samples
Sample ψi wi ti ˆ
tψ
A 1/16 16 11 176
B 2/16 8 20 160
C 3/16 16/3 24 128
D 10/16 16/10 245 392

ˆ
Sampling distribution of the estimate tψ
Sample ψi ˆ
tψ
1 1/16 176
2 2/16 160
3 3/16 128
4 10/16 392

ˆ
Mean of the sampling distribution of tψ

ˆ 1 2 3 10
E tψ = 176 + 160 + 128 + 392 = 300 = t
16 16 16 16
• So tψ is unbiased.
ˆ

• This will always be true.

ˆ
E tψ = ψi wi ti = ti

ˆ
Variance of the sampling distribution tψ

1 2 3 10
ˆ
Var(tψ ) = (176 − 300)2 + (160 − 300)2 + (128 − 300)2 + (392 − 300)2 = 14248
16 16 16 16
Compare with the variance for an SRS:
1 1 1 1
Var(tSRS ) = (176 − 300)2 + (160 − 300)2 + (128 − 300)2 + (392 − 300)2 = 154488
ˆ
4 4 4 4

4

Interpretation
• Store D is the largest and we expect it to account for a large portion of the total sales.

• Therefore, we give it a higher probability of being in the sample (10/16) than it would
have with an SRS (1/4).

• If it is selected, we multiply its sales by (16/10) to estimate total sales.

One-stage sampling with replacement
• Suppose n > 1 and we sample with replacement.

• This implies πi = 1 − (1 − ψi )n .

• Probability that item i is selected on the ﬁrst draw is the same as the probability that
item i is selected on any other draw.

• Sampling with replacement gives us n independent estimates of the population total,
one for each unit in sample.

• We average these n estimates.

• Estimated variance is variance of the estimates divided by n

Example 6.2
• N = 15 classes of elementary stat

• Mi students in class i (i = 1 to 15)

• Values of Mi range from 20 to 100.

• We want a sample of 5 classes.

• Each student in the selected classes will ﬁll out a questionnaire.

• (It is possible for the same class to be selected more than once.)

Randomization
• There are a total of 647 students in these classes.

• Select 5 random numbers between 1 and 647.

• Think about ordering the students by class.

• Each random number corresponds to a student and the corresponding class will be in
the sample.

5

This method
• This method is called the cumulative-size method.
• It is based on M1 , M1 + M2 , M1 + M2 + M3 , . . .
• An alternative is to use the cumulative sums of the ψi and select random numbers
between 0 and 1.
• For this example, ψi = Mi /647

Alternative
• Systematic sampling is often used as an alternative in this setting.
– The basic idea is the same.
– Not technically sampling with replacement
– Works well as systematic sampling works well.
– See page 186 for details.
• Lahiris method
– Involves two stages of randomization
– Rejection sampling: corresponds to classroom problem in Problem Set 2.
– Can be ineﬃcient.
– See page 187 for details

Estimation Theory
• Let Qi be the number of times unit i occurs in the sample.
1
• Then tψ =
ˆ
n
Qi ti /ψi .

• The estimated variance of ti is
ˆ
1 ti
Qi ( − tψ )2
ˆ
n(n − 1) ψi

• The estimate and its estimated variance are both unbiased.

Choosing the selection probabilities
• We want small variance for our estimator.
– Often, ti is related to the size of the psu.
– We can take ψi proportional to Mi or some other measure of the size of psu i.

6

PPS
• This procedure is called sampling with probability proportional to size (pps).

• The formulas for the estimate and variance can be simpliﬁed for this special case.
Mi
ψi =
K
ti
= K yi
¯
ψi

• See page 190 for details

• See Example 6.5 on pages 190-192

Two-stage sampling with replacement
• Basic ideas are very similar to one-stage sampling.

• ψi is the probability that psu i is selected on the ﬁrst (or any) draw.

• We take a sample of mi ssus from each selected psu.

Sampling ssu’s
• Usually we use an SRS.

• Alternatives include

– systematic sampling
– any other probability sampling method

• Note if a psu is selected more than once, a separate independent second stage sample
is required.

Estimates and SE’s
• Weights are used to make the estimators unbiased.

• Formulas are similar to those for one-stage.

• See (6.8) and (6.9) on page 192

7

Outline of the procedure
1. Determine the ψi .

2. Select the n psus (with replacement).

3. Select the ssus.

4. Estimate the t for each selected psu,

tψ = weight × t
ˆ ˆ

ˆ
5. The average of these is tψ .
√
6. SE is the standard error of these (sd/ n).

Unequal probability sampling without replacement
• ψi is the probability of selection on the first draw.

• The probability of selection on later draws depends on which units were selected on
earlier draws.

Estimation
• πi is called the inclusion probability. ( pop πi = n)

• πi,j is the probability that both psu i and psu j are in the sample. ( j=i πi,j = (n−1)πi )

• Weights (inverse of selection probability)

– we use πi /n in place of ψi (with replacement)

• The recommended procedure is to use the Horvitz-Thompson (HT) estimator and the
ˆ ˆ
associated SE. (tHT = sam ti /πi )

• See page 196-197 for details.

• This estimator can be generalized to other designs that do not use replacement.

Randomization Theory
Framework is

• Probability sampling without replacement for the psus for the first stage

• Sampling at the second stage is independent of sampling at the first stage

8

Horvitz-Thompson
• Randomization theory can be used to prove the Horvitz-Thompson Theorem.

– Expected value of the estimator is t.
– Formula for the variance of the estimator

The estimator
• tHT =
ˆ ˆ
ti /πi

– where the sum is over the psu’s selected in the ﬁrst stage.

• Idea behind proofs is to condition on which psus are in the sample.

• Study pages 205-210

Model
• One-way random eﬀects anova model

Yi,j = Ai + i,j

where
2
– the Ai are random variables with mean µ and variance σA
– the i,j are random variables with mean 0 and variance σ 2 .
– the Ai and the i,j are uncorrelated

The pps estimator
• πi = nMi /K – the inclusion probability

ˆ K ˆ
TP = Ti
nMi

• We rewrite this as a weighted estimator.

ˆ Mi
ti = Yi,j
mi
ˆ
tP = wi,j Yi,j

K
where wi,j = nMi

• Take expected values to show that the estimator is unbiased.

9

Variance
• The variance can be computed.

• See page 211

• The variance depends on which psu’s are selected through the Mi .

• The variance is smallest when psu’s with the largest Mi are chosen.

Recall
• Estimate of population total is the weighted average of the ti for the selected psus.
ˆ

• The weights wi are the inverses of the probabilities of selection.

Elephants
• A circus needed to ship its 50 elephants.

• They needed to estimate the total weight of the animals.

• It is not easy to weigh 50 elephants and they were in a hurry.

• They had data from three years ago.

Sample
• The owner wanted to base the estimate on a sample.

• Dumbo had a weight equal to the average three years ago.

• The owner wanted to weigh Dumbo and multiply by 50.

• The statistician said:

NO
• You have to use probability sampling and the Horvitz-Thompson estimator.

• They compromised:

– The probability of selecting Dumbo was set as 99/100.
– The probability of selecting each of the other elephants was 1/4900.

10

Who was selected
• Dumbo, of course.

• The owner was happy and said now we can estimate the weight of the 50 elephants as
50 times Dumbos weight, 50y.

• The statistician said

NO
• The estimate of the total weight of the 50 elephants should be Dumbos weight divided
by his probability of selection.

• This is y/(99/100) or 100y/99.

• The theory behind this estimator is rigorous

What if
• The owner asked

– What if the randomization had selected Jumbo the largest elephant in the herd?

• The statistician replied 4900y, where y is Jumbos weight.

Conclusion
• The statistician lost his circus job and became a teacher of statistics.

• bad model; highly variable estimator

• Due to Basu (1971).

11

Cluster Sampling

More Related Content

What's hot

Viewers also liked

Similar to Cluster Sampling

Recently uploaded

Cluster Sampling