Sampling is a powerful tool to obtain valuable information about a population quickly and at a fraction of the cost. But the sample size and sampling plan have to be proper to yield scientifically valid and acceptable conclusions. We describe this challenge in understandable terms for all and back it up with sufficient statistical concepts for the benefit of students.
Best Practices for Implementing an External Recruiting Partnership
Adequate Sample Size
1. balasubp@gmail.com
Adequacy of Sample Size in Population Surveys
Dr. P. Balasubramanian, Ph.D.
Founder & CEO, Theme Work Analytics, Bangalore
& West Lafayette, IN, USA
Please obtain prior permission for reuse.
Feel free to download for self study.
Oct 2016
8. Homogeneous Population ..examples..
Heterogeneous Population ..examples..
! Almost everyone ( say 95 % of ciCzens ) believes that
the city is pedestrian friendly.
! 98% of the ba^eries supplied by Sunshine Power
SoluCons Company served their warranty period of two
years without any claim.
! Opinion varied widely among the rural residents about
the uClity of the ferClizer credit scheme of the
government.
! Infant mortality rate ranged from 2 per thousand to
20 per thousand in different states in a developing
country.
14. sample size tables …….preamble
! We will present a series of tables showing the required sample
size for a given populaCon size, allowable margin of error and
expected confidence level.
! We assume that the population is quite heterogeneous in terms
of the parameter being studied. This will result in the
maximum sample size ever needed.
! There is an elegant mathemaCcal formula to calculate these
values. We will present the formula in a later secCon.
! There are many ready reckoners and eCalculators to help us
find the sample size. One such calculator from Surveymonkey
is available at
h^ps://www.surveymonkey.com/mp/sample-size-calculator/
15. sample size tables
N=10000 Table 1
confidence level 90% 95% 99%
margin of error
1% 4021 4900 6247
2% 1440 1937 2939
5% 262 370 625
10% 67 96 164
N=100000 Table 2
confidence level 90% 95% 99%
margin of error
1% 6301 8763 14267
2% 1654 2345 3995
5% 269 383 662
10% 68 96 167
! If we accept a higher margin of error ( such as 10%) then
even when the populaCon size (N) is 100000, the required
sample size is 68 ( at 90% Confidence Level) and only 167 (at
99% Confidence Level)!
! The sample size has quickly converged to these numbers
and almost constant at higher Margins of Error and lower
Confidence Levels.
16. sample size tables
N=10000 Table 1
confidence level 90% 95% 99%
margin of error
1% 4021 4900 6247
2% 1440 1937 2939
5% 262 370 625
10% 67 96 164
N=100000 Table 2
confidence level 90% 95% 99%
margin of error
1% 6301 8763 14267
2% 1654 2345 3995
5% 269 383 662
10% 68 96 167
! For a populaCon of 10000, the maximum sample size
needed ( for high level of accuracy) is 6247. [It is 62.5% of
the populaCon]. Quite high.
! However when populaCon size is 100000, the maximum
sample size needed is only 14267. [It is 14.3% of the
populaCon]
17. sample size tables…some more..
N=1000000 Table 3
confidence level 90% 95% 99%
margin of error
1% 6680 9513 16369
2% 1679 2396 4144
5% 269 385 666
10% 68 97 167
N=10000000 Table 4
confidence level 90% 95% 99%
margin of error
1% 6720 9595 16614
2% 1681 2401 4159
5% 269 385 666
10% 68 97 167
! The sample size converges quickly as populaCon size increases.
! The maximum sample size when the populaCon is 10 million is
16614 ( 0.16% of the populaCon!)
! At 5 % Margin of Error and 99% Confidence Level the required
sample size is quite low at 666!
18. sample size tables…at population size of 100 million
N=10000000 Table 4
confidence level 90% 95% 99%
margin of error
1% 6720 9595 16614
2% 1681 2401 4159
5% 269 385 666
10% 68 97 167
! At populaCon size of 100 million the sample size has converged
for all but two scenario.
! The maximum sample size needed for even larger populaCons is
16641.( as determined from the eCalculator)
! Hence any (random sample) survey that covers the enCre
populaCon of the world can be carried out to a high degree of
accuracy with a sample size of 16641.
N=100 million Table 5
confidence level 90% 95% 99%
margin of error
1% 6724 9604 16639
2% 1681 2401 4161
5% 269 385 666
10% 68 97 167
19. sample size tables…at population size of 100 million
N=10000000 Table 4
confidence level 90% 95% 99%
margin of error
1% 6720 9595 16614
2% 1681 2401 4159
5% 269 385 666
10% 68 97 167
! With a sample size of 68, we can study the global populaCon at
a moderate level of accuracy !
! This is however true only when everyone in the populaCon has
an equal chance of being selected in the sample.
N=100 million Table 5
confidence level 90% 95% 99%
margin of error
1% 6724 9604 16639
2% 1681 2401 4161
5% 269 385 666
10% 68 97 167
[ The eCalculator will also reveal that when the population size is less than 1000 we need
to sample almost everyone to get 1% Margin of Error and 99% Confidence Level ]
20. Formula for Sample Size…..preamble….
! We need to revisit the concepts of Margin of Error, Confidence
Level and Homogeneity to understand the Sample Size formula.
! Further we have to grasp some fundamental concepts from
StaCsCcs and Probability Theory.
! Normal DistribuCon and Central Limit Theorem are terms and
concepts used by scienCsts, engineers and psephologists in this
context.
22. Normal Distribution ( alias Bell Curve )
According to Normal DistribuCon, When the populaCon is
very large, the observed values will lie within a bell shaped
curve which has (a) most values concentrated near the
centre and (b) distributed symmetrically around the centre.
In our Ba^ery example, the average
life can be 24 months. Then the actual
life of a ba^ery can range from 2 to
46 months. Majority of the ba^eries
will show a life of 22 to 26 months
Life in months
No.
of
Ba^
eries
If the Margin of Error specified is 5 % ( 1.2 months) then we wish the sample
study to find the average battery life to be in the range of 22.8 to 25.2
months. The chosen sample size should ensure this.
50
30
10
0
10 16 20 24 28 32 36
24. Life in months
10 16 20 24 28 32 36
No. of
Ba^eries
50
30
10
0
Confidence Level …revisited….
! In our example , the probability of a Sample Study finding a value between 20
and 28 months is given by the area under the curve between these two lines.
(This area to be divided by the total area under the curve)
! Let us say the area is 50 % Then the probability is 0.5 It means there is a
probability of 0.5 that our Sample Study will find the average life of ba^eries
to fall between 20 to 28 months
28. Suppose we can transform any given “mean” and “standard deviation” to 0 and 1
respectively then the area under the curve can be obtained from a standardized
table. The Standard Table considers a normal distribution with mean=0 and SD=1
as shown below. Later we can also get the appropriate values by a retransormation
process. A variable called z ( z=( x-Mu)/Sigma ) [Mu is the population mean and
Sigma is the Standard Deviation of the population] performs this magical
transformation!
Standard Normal Distribution.
Now we are armed with all the concepts and are ready
to look at the formula!
29. Formula for determining the Sample Size.
n1 = Z**2x p x(1-p)/ ( e**2)
n0 = (n1 ) / [ 1+ (n1-1) /N]
! n1 = Sample Size uncorrected for the populaJon size
! n0 = Sample Size corrected for the populaJon size
! Z = The Z staJsJc value as derived from a normal distribuJon table for
a given confidence level. ( It is 2.58 at 99% Confidence Level)
! P = esJmate of proporJon of the populaJon voJng for the
proposiJon
! E = Margin of Error
! N = PopulaJon size esJmated
! Symbol ** represents “raised to the power of”
This formula holds good for medium and large size
populations and where the study is aimed at finding the %
voting for a proposition.
30. Formula for determining the Sample Size.
n1 = Z**2x p x(1-p)/ ( e**2)
n0 = (n1 ) / [ 1+ (n1-1) /N]
For smaller populations ( less than N= 1000) we need to use a
different but similar distribution called “t distribution” tables.
Instead of normal distribution tables.
Example: For z= 2.58 ( at the Confidence Level of 99%), p=0.5 (maximum dispersion
of opinions ) and e= .01 ( that is 1% Margin of Error) and N= 1m
n0 value is 16369. [Same value shown in Table 3 earlier]
If the populaJon size is 100000 instead of 1 million then n1=16639 and n0= 14267
If the populaJon size is 10000 instead of 1 million then n1 =16639 and n0 = 6247
31. Formula for determining the Sample Size in
arriving at a mean instead of a proportion
n1 = Z**2x SD**2/ ( e**2)
n0 = (n1 ) / [ 1+ (n1-1) /N]
(SD stands for Standard DeviaCon)
Similar to the earlier formula except that
(1) Term p x (1-p) is replaced by SD **2
(2) error term e must be in same units as SD
SD of the population is unknown prior to the survey. Hence
we can use an estimate determined through presampling.
32. Formula for determining the Sample Size.
Observations
n1 = Z**2x p x(1-p)/ ( e**2)
n0 = (n1 ) / [ 1+ (n1-1) /N]
! Higher the Confidence Level( Z ) required, higher the
sample size needed.
! Lower the Margin of Error ( e ) allowed, higher the
sample size required.
! When p = 0.5 the term p x (1-p) reaches a
maximum of 0.25. For any other p value the product
term of p x (1-p) will be less than 0.25. Hence the
sample size needed is maximum when p=0.5
! The formula for n0 converges to n1 for large values
of N. We have earlier seen that this convergence
occurs for N= 1000000 when the CL needed is 99%
and ME is 1 %. For relaxed requirements the
converges occurs even at lower N values.
33. Apriori data needed…
Population size and characteristics
Most of the surveys require that we know in advance
a) Size of the populaJon
b) populaJon characterisJcs with respect to the study focus ( such as the
standard deviaJon or expected proporJon)
! For example, crime against women in any community is never fully
reported. Hence one can not accurately know, in advance, the total
number of women affected. If one proposes to study how they are
impacted, then the relevant populaCon can not be known in advance.
! Similarly the standard deviaCon of income distribuCon among
residents of a city may not be known already.
! ( But the formula for Sample Size calculaCon requires such data)
! We circumvent this problem by arriving at an esCmate based on
prior studies or through presampling methods.
34. Sub Groups and Stratified Sampling
! It may be worthwhile to form sub groups and study them as
different strata in certain surveys.
! ( For example, we may wish to find out the opinion of age
wise groups)
! Hence age wise strata need to be formed and the sample
size formula is to be applied within each stratum
! AggregaCon of study variate across strata requires due
weightage being given to each stratum based on its
populaCon size.
35. Why do pollsters go wrong?
! Pollsters and psephologists carry out opinion or attude
surveys to determine what is likely to happen. Some
Cmes their predicCons go wrong.
! The Brexit opinion poll conducted prior to the voCng in
Britain is a good example. Similarly many elecCon
results predicted on the basis of prior or exit polls have
gone wrong.
Not all Sample Studies are similar in context.
Their contextual difference must be well
understood prior to the study.
37. Opinion Surveys among voters
! Many a Cme substanCal number of voters remain
undecided Cll the last minute.
! Survey instruments are not clever enough to detect
preferences of “sitng on the fence” voters.
! Sample Size turns out to be inadequate when mulCple
probes are included in a single quesConnaire.
! Voters have a reason for withholding informaCon or
misleading the pollsters. Survey instruments cannot detect
such devious behaviour.
! Inadequate randomness in Sample SelecCon
Better design of survey in terms of instruments, sample size and
sampling plan and training the administrators along with use
of modern Data Analytics aids can improve predictability of
results.
39. Issues in Clinical Trials
They have many special characterisJcs
compared to regular sample studies.
! The study duraCon tends to be long; as much as 18 months average
! The study populaCon size may be unknown. Data on a^ribute dispersion
can be sparse.
! Hence Sample Size determinaCon is a complex issue
! Samples tend to drop out during the study.
! Need to bifurcate the study populaCon is a special requirement. One group
has to be administered the placebo. The other group is likely to benefit from
the study.
! Sample selecCon becomes a moral and ethical issue
! Both under selecCon and over selecCon of study populaCon can cause
dilemma.
40. Small Sample Studies
! The results can be presented at a lower Confidence Level or
higher Margin of Error.
! Valid results can be presented at some of the strata levels or
with relaxed survey focus
! It is common to change the study focus to in-depth probing on
select topics when study populaCon drops out midway in clinical
trials. ( modify the null hypothesis)
! There are many techniques and tools available to guide in data
collecCon and data analysis, specific to small sample studies.
There are expert groups dedicated to analyzing small sample
data.
What can be done when sample size has shrunk
unwicngly or otherwise?
41. balasubp@gmail.com
Adequacy of Sample Size in Population Surveys
Please obtain prior permission for reuse.
Feel free to download for self study.
Dr.P.Balasubramanian,
Founder & CEO, Theme Work Analy-cs,
Gurukrupa,508, 47th Cross
Jayanagar 5th Block
Bangalore, India 560041
Ph: 91 80 4121 4297