Adequate Sample Size

balasubp@gmail.com
Adequacy of Sample Size in Population Surveys

Dr. P. Balasubramanian, Ph.D.
Founder & CEO, Theme Work Analytics, Bangalore
& West Lafayette, IN, USA
Please obtain prior permission for reuse.
Feel free to download for self study.
Oct 2016

! Adequacy deﬁned
! Relevant PopulaCon
! PopulaCon characterisCcs
! Focus of the survey and its relevance to sampling
! Unbiased sampling?
! PopulaCon size vs sample size; revelaCons
! Formula for sample size
! Apriori data needed: populaCon size and its
characterisCcs
!  Sub groups and straCﬁed sampling
!  Why do pollsters go wrong
! QuesConnaire Design
! Clinical Trials issues
! Small sample studies

!  Engineers need to ﬁgure out the best features to
be provided in many devices such as mobile phones,
lap tops, automobiles, washing machines etc..
!  Managers of many ﬁrms try hard to determine the
response of customers to new product introducCons.
!  Pharma companies are conducCng clinical trials ever
so oSen before launching new drugs in the market
place.
!  Pollsters and Psephologists use surveys all the Cme to
predict what issues dominate voters’ minds and who
is likely to win in an elecCon.

There are many common characteristics
amongst these diverse requirements.
Population Studies are needed everywhere

! The study has to be conducted and concluded quickly.
(The reasonable -me frame being a few days to few months.)
!  It is not possible to poll the enCre populaCon and
do an exhausCve study since that would call for extended
Cme periods and also prove to be very expensive.
!  Hence we resort to sample studies.
(meaning a small percentage of the popula-on is polled)
!  Results are tabulated or analyzed.
!  The underlying belief here is that the sample study
ﬁndings and conclusions are equally valid and
applicable to the enCre populaCon.

Hence Sample Studies can turn out to be cost

eﬀective and be conducted in reasonable time
``periods.

Need for Sample Studies

!  There are two other fundamental requirements in
Sample Studies: that
!  (1) the sample chosen should truly reflect the
characterisCcs of the populaCon
!  (2) the sample size should be sufficient to draw
conclusions truly representaCve of the
populaCon.

Hence adequacy of sample is defined based on

these two requirements.
Sample Vs Population

! The study populaCon contains every unit or member to
which(whom) we wish to apply the conclusion arising from the
sample study.
! For example, in an elecCon for oﬃce bearers in a housing
society, every one with the voCng right is relevant populaCon.
It is immaterial he/she is a ciCzen of that country or region.
! Similarly in a general elecCon, every ciCzen, irrespecCve of
where he/she lives ( inside or outside the country) consCtutes
the relevant populaCon.
The concept of relevant population

Incorrectly identiﬁed population will result
in invalid conclusions .

!  units or members of a populaCon do not exhibit
uniform a^ributes, characterisCcs or features.

! For example, the longevity of people living in a
community can differ widely. The price they are
willing to pay for any object can vary significantly.

! A homogeneous populaCon is one with marginal
variaCon of the characterisCcs under study.

! A populaCon with extreme variaCons is defined as
heterogeneous.
We will need a larger sample to draw meaningful
conclusions from a heterogeneous population.
Homogeneous VS Heterogeneous population

Homogeneous Population ..examples..
Heterogeneous Population ..examples..
!  Almost everyone ( say 95 % of ciCzens ) believes that
the city is pedestrian friendly.

! 98% of the ba^eries supplied by Sunshine Power
SoluCons Company served their warranty period of two
years without any claim.
!  Opinion varied widely among the rural residents about
the uClity of the ferClizer credit scheme of the
government.
!  Infant mortality rate ranged from 2 per thousand to
20 per thousand in diﬀerent states in a developing
country.

! Clarity on purpose of the study, its focus and what
inferences we wish to draw is criCcal for its success.

! Ambiguity in its mission will result in incorrect
idenCficaCon of the relevant populaCon, inadequate
design of survey instruments and unreliable
conclusions.

! For example, a study of reasons for failure among
firms requires an unambiguous definiCon of “failure”.
The relevant populaCon must include both failed and
successful companies.
Focus of the study will determine the relevant
population as well as its homogeneity.
Relevance of focusing on study objectives

!  We have earlier stated that “ the sample chosen should
truly reﬂect the characterisCcs of the populaCon”

! Hence sample units need to be chosen in such a way that
collecCvely they become a mini populaCon in terms of the
characterisCcs being studied..

! For example, if the focus of the study is malnutriCon in a
community, the sample units can not be either from the
schools or work places. They must come from both the
schools and work places.

!  Every unit in the populaCon must have an equal chance of
being present in the study . This is called Unbiased Sampling.

Unbiased Sampling

! There are scienCfic methods to select the sample units
randomly from the populaCon to ensure there is no bias in
sampling.

! Simple Random Sampling (SRS), StraCfied Sampling and
Cluster Sampling are some of these methods.

! Random Sampling requires finite populaCon to give reliable
results. Further each unit must be disCnctly idenCfied.
Unbiased Sampling techniques are the means to ensure
comprehensive representation of the population most
efficiently.
Unbiased Sampling

!  We have earlier stated that the second fundamental
requirement of a sample study is that “ the sample size
should be suﬃcient to draw conclusions truly representaCve
of the populaCon”
! There is no assurance that the study will yield an exact
result. (“exact” meaning 100 % accuracy with reference to
the populaCon)
! There will be a margin of error between the study ﬁndings
and the true populaCon characterisCcs. This is known as
Sampling Error.
! Hence it is appropriate to present the result as a range
rather than point esCmate.
We can now turn our attention to the issue of
sample size determination.
The Margin of Error goes down as the Sample Size
increases.

!  Even with the descripCon of the esCmate as a range and
not as a single point, we can speak with a degree of
confidence only and not with absolute certainty.
! We can state it with 95 % or 99% confidence level (or less)
based on the sample size.
Continuing with the issue of sample size
determination…..
The Confidence Level goes up as the Sample Size
increases.
Hence a high Confidence Level ( say 99 %) and a low
Margin of Error ( say 1%) is achieved with a high
sample size.

sample size tables …….preamble

! We will present a series of tables showing the required sample
size for a given populaCon size, allowable margin of error and
expected conﬁdence level.

! We assume that the population is quite heterogeneous in terms
of the parameter being studied. This will result in the
maximum sample size ever needed.
! There is an elegant mathemaCcal formula to calculate these
values. We will present the formula in a later secCon.

! There are many ready reckoners and eCalculators to help us
ﬁnd the sample size. One such calculator from Surveymonkey
is available at
h^ps://www.surveymonkey.com/mp/sample-size-calculator/

sample size tables

N=10000 Table 1

confidence level 90% 95% 99%
margin of error
1% 4021 4900 6247
2% 1440 1937 2939
5% 262 370 625
10% 67 96 164

N=100000 Table 2

margin of error
1% 6301 8763 14267
2% 1654 2345 3995
5% 269 383 662
10% 68 96 167

!  If we accept a higher margin of error ( such as 10%) then
even when the populaCon size (N) is 100000, the required
sample size is 68 ( at 90% Confidence Level) and only 167 (at
99% Confidence Level)!
! The sample size has quickly converged to these numbers
and almost constant at higher Margins of Error and lower
Confidence Levels.

sample size tables

N=10000 Table 1

margin of error
1% 4021 4900 6247
2% 1440 1937 2939
5% 262 370 625
10% 67 96 164

N=100000 Table 2

margin of error
1% 6301 8763 14267
2% 1654 2345 3995
5% 269 383 662
10% 68 96 167

! For a populaCon of 10000, the maximum sample size
needed ( for high level of accuracy) is 6247. [It is 62.5% of
the populaCon]. Quite high.
! However when populaCon size is 100000, the maximum
sample size needed is only 14267. [It is 14.3% of the
populaCon]

sample size tables…some more..

N=1000000 Table 3

margin of error
1% 6680 9513 16369
2% 1679 2396 4144
5% 269 385 666
10% 68 97 167

N=10000000 Table 4

margin of error
1% 6720 9595 16614
2% 1681 2401 4159
5% 269 385 666
10% 68 97 167

! The sample size converges quickly as populaCon size increases.
! The maximum sample size when the populaCon is 10 million is
16614 ( 0.16% of the populaCon!)
! At 5 % Margin of Error and 99% Conﬁdence Level the required
sample size is quite low at 666!

sample size tables…at population size of 100 million

N=10000000 Table 4

margin of error
1% 6720 9595 16614
2% 1681 2401 4159
5% 269 385 666
10% 68 97 167

! At populaCon size of 100 million the sample size has converged
for all but two scenario.
! The maximum sample size needed for even larger populaCons is
16641.( as determined from the eCalculator)
! Hence any (random sample) survey that covers the enCre
populaCon of the world can be carried out to a high degree of
accuracy with a sample size of 16641.

N=100 million Table 5

margin of error
1% 6724 9604 16639
2% 1681 2401 4161
5% 269 385 666
10% 68 97 167

sample size tables…at population size of 100 million

N=10000000 Table 4

margin of error
1% 6720 9595 16614
2% 1681 2401 4159
5% 269 385 666
10% 68 97 167

! With a sample size of 68, we can study the global populaCon at
a moderate level of accuracy !
! This is however true only when everyone in the populaCon has
an equal chance of being selected in the sample.

N=100 million Table 5

margin of error
1% 6724 9604 16639
2% 1681 2401 4161
5% 269 385 666
10% 68 97 167

[ The eCalculator will also reveal that when the population size is less than 1000 we need
to sample almost everyone to get 1% Margin of Error and 99% Conﬁdence Level ]

Formula for Sample Size…..preamble….
! We need to revisit the concepts of Margin of Error, Conﬁdence
Level and Homogeneity to understand the Sample Size formula.

! Further we have to grasp some fundamental concepts from
StaCsCcs and Probability Theory.
!  Normal DistribuCon and Central Limit Theorem are terms and
concepts used by scienCsts, engineers and psephologists in this
context.

Margin of Error…revisited…..
! Sample Study is unlikely to yield the exact result. ( For example,
the average age of residents in a city, based on census was 32.1
but one sample study conducted in the same city found it to be
31.5 but a second study resulted in the value of 32.3 )
! Margin of Error is the diﬀerence between the actual value and
value determined by the sample study.
! Before the study commences, we can specify the desired
Margin of Error ( say 2% or 5% away from the actual value) and
then determine the sample size accordingly. Margin of Error is
also known as Degree of Precision in some texts.
The Margin of Error goes down as the Sample Size
increases.

Normal Distribution ( alias Bell Curve )
According to Normal DistribuCon, When the populaCon is
very large, the observed values will lie within a bell shaped
curve which has (a) most values concentrated near the
centre and (b) distributed symmetrically around the centre.
In our Baêry example, the average
life can be 24 months. Then the actual
life of a baêry can range from 2 to
46 months. Majority of the baêries
will show a life of 22 to 26 months
Life in months
No.
of
Ba^
eries
If the Margin of Error specified is 5 % ( 1.2 months) then we wish the sample
study to find the average battery life to be in the range of 22.8 to 25.2
months. The chosen sample size should ensure this.

50

30

10

0
10 16 20 24 28 32 36

Conﬁdence Level …revisited….
! Even when mulCple Sample Studies are done with the same
populaCon, there is no assurance that exact value ( as per the
populaCon) will be found. Neither individual Sample Study values
nor the average of Sample Studies is assured to get us the exact
value.
! The Bell Curve explains the phenomenon. Due to Sampling Error,
the values will lie around the exact value; more of them very
close to it but some away from it and a few far away from it.
! The area under this curve and between two verCcal lines
represents the probability that we will ﬁnd the value to lie on the
curve between the lines.

Life in months
10 16 20 24 28 32 36
No. of
Baêries
50

30

10

0
!  In our example , the probability of a Sample Study finding a value between 20
and 28 months is given by the area under the curve between these two lines.
(This area to be divided by the total area under the curve)
!  Let us say the area is 50 % Then the probability is 0.5 It means there is a
probability of 0.5 that our Sample Study will find the average life of baêries
to fall between 20 to 28 months

Life in months
10 16 20 24 28 32 36
No. of
Baêries
50

30

10

0
! Since we desire to have very high Confidence Levels ( say 95 % ) the
area under the curve should be accordingly 95%.
! Further we wish the Margin of Error to be low ( say 5%) That calls for
the Sample Study value to fall within a range of 1.2 from 24 months.
! Combining the two together, we can say that we wish to find the
sample size to give us a 95% Confidence Level that the Sample value
will fall between 22.8 to 25.2 months

Homogeneity is expressed in terms of congruence of opinion
or level of dispersion around the average value
Homogeneity…revisited….
10 20 22 24 26 28 36 10 18 24 30 28 36
Bell Curve of a
homogeneous
populaCon
Bell Curve of a heterogeneous populaCon
The Dispersion around the average ( also called as mean in staCsCcs)
is measured and expressed as standard deviaCon

Normal DistribuCon assures us that within 1 SD around the mean
we have the area under the curve equal to 68%. With 2 SD around
the mean the area will be 95% and with 3 SD it will be 99.7%
Homogeneity…revisited….

Suppose we can transform any given “mean” and “standard deviation” to 0 and 1
respectively then the area under the curve can be obtained from a standardized
table. The Standard Table considers a normal distribution with mean=0 and SD=1
as shown below. Later we can also get the appropriate values by a retransormation
process. A variable called z ( z=( x-Mu)/Sigma ) [Mu is the population mean and
Sigma is the Standard Deviation of the population] performs this magical
transformation!
Standard Normal Distribution.
Now we are armed with all the concepts and are ready
to look at the formula!

Formula for determining the Sample Size.
n1 = Z**2x p x(1-p)/ ( e**2)
n0 = (n1 ) / [ 1+ (n1-1) /N]
! n1 = Sample Size uncorrected for the populaJon size
! n0 = Sample Size corrected for the populaJon size
! Z = The Z staJsJc value as derived from a normal distribuJon table for
a given confidence level. ( It is 2.58 at 99% Confidence Level)
! P = esJmate of proporJon of the populaJon voJng for the
proposiJon
! E = Margin of Error
! N = PopulaJon size esJmated
! Symbol ** represents “raised to the power of”
This formula holds good for medium and large size
populations and where the study is aimed at finding the %
voting for a proposition.

n1 = Z**2x p x(1-p)/ ( e**2)
n0 = (n1 ) / [ 1+ (n1-1) /N]
For smaller populations ( less than N= 1000) we need to use a
diﬀerent but similar distribution called “t distribution” tables.
Instead of normal distribution tables.

Example: For z= 2.58 ( at the Conﬁdence Level of 99%), p=0.5 (maximum dispersion
of opinions ) and e= .01 ( that is 1% Margin of Error) and N= 1m
n0 value is 16369. [Same value shown in Table 3 earlier]

If the populaJon size is 100000 instead of 1 million then n1=16639 and n0= 14267
If the populaJon size is 10000 instead of 1 million then n1 =16639 and n0 = 6247

Formula for determining the Sample Size in
arriving at a mean instead of a proportion
n1 = Z**2x SD**2/ ( e**2)
n0 = (n1 ) / [ 1+ (n1-1) /N]

(SD stands for Standard DeviaCon)
Similar to the earlier formula except that
(1)  Term p x (1-p) is replaced by SD **2
(2)  error term e must be in same units as SD
SD of the population is unknown prior to the survey. Hence
we can use an estimate determined through presampling.

Observations
n1 = Z**2x p x(1-p)/ ( e**2)
n0 = (n1 ) / [ 1+ (n1-1) /N]
!  Higher the Conﬁdence Level( Z ) required, higher the
sample size needed.
!  Lower the Margin of Error ( e ) allowed, higher the
sample size required.
!  When p = 0.5 the term p x (1-p) reaches a
maximum of 0.25. For any other p value the product
term of p x (1-p) will be less than 0.25. Hence the
sample size needed is maximum when p=0.5
!  The formula for n0 converges to n1 for large values
of N. We have earlier seen that this convergence
occurs for N= 1000000 when the CL needed is 99%
and ME is 1 %. For relaxed requirements the
converges occurs even at lower N values.

Apriori data needed…
Population size and characteristics
Most of the surveys require that we know in advance
a) Size of the populaJon
b) populaJon characterisJcs with respect to the study focus ( such as the
standard deviaJon or expected proporJon)
!  For example, crime against women in any community is never fully
reported. Hence one can not accurately know, in advance, the total
number of women aﬀected. If one proposes to study how they are
impacted, then the relevant populaCon can not be known in advance.
!  Similarly the standard deviaCon of income distribuCon among
residents of a city may not be known already.
!  ( But the formula for Sample Size calculaCon requires such data)
!  We circumvent this problem by arriving at an esCmate based on
prior studies or through presampling methods.

Sub Groups and Stratified Sampling
! It may be worthwhile to form sub groups and study them as
different strata in certain surveys.

! ( For example, we may wish to find out the opinion of age
wise groups)

! Hence age wise strata need to be formed and the sample
size formula is to be applied within each stratum

! AggregaCon of study variate across strata requires due
weightage being given to each stratum based on its
populaCon size.

Why do pollsters go wrong?
! Pollsters and psephologists carry out opinion or attude
surveys to determine what is likely to happen. Some
Cmes their predicCons go wrong.

! The Brexit opinion poll conducted prior to the voCng in
Britain is a good example. Similarly many elecCon
results predicted on the basis of prior or exit polls have
gone wrong.
Not all Sample Studies are similar in context.
Their contextual diﬀerence must be well
understood prior to the study.

Sample
Studies
To gauge the
property or a^ribute
distribuCon paêrn
within the
populaCon
To carry out an
opinion survey
among voters or
ciCzens
To conduct a
clinical trial

Sample units are
neutral to the outcome
Sample units can be
untruthful
Survey owner may
withhold information
Study objectives can differ and so can the behaviour of stakeholders

Significant
Differences in

Opinion Surveys among voters
!  Many a Cme substanCal number of voters remain
undecided Cll the last minute.
!  Survey instruments are not clever enough to detect
preferences of “sitng on the fence” voters.
!  Sample Size turns out to be inadequate when mulCple
probes are included in a single quesConnaire.
!  Voters have a reason for withholding informaCon or
misleading the pollsters. Survey instruments cannot detect
such devious behaviour.
!  Inadequate randomness in Sample SelecCon
Better design of survey in terms of instruments, sample size and
sampling plan and training the administrators along with use
of modern Data Analytics aids can improve predictability of
results.

Questionnaire Design
! When sample units are neutral to the outcome veracity of data
is not an issue.
! However most opinion surveys may end up with data not
reﬂecCng the true opinion of the persons interviewed.
! Hence it is preferable to design the quesConnaire as a mulCple
choice queries than binary responses.
! Further the sample size needs to be increased ( 25 to 50 % ) to
account for this unreliability of response.
! Redundant queries need to be included to cross validate
response and to discover anomalies.
! Leading quesCons are to be avoided.
! QuesCons must reﬂect gender, race and region etc. sensiCvity.

Issues in Clinical Trials
They have many special characterisJcs
compared to regular sample studies.
! The study duraCon tends to be long; as much as 18 months average
! The study populaCon size may be unknown. Data on a^ribute dispersion
can be sparse.
! Hence Sample Size determinaCon is a complex issue
! Samples tend to drop out during the study.
! Need to bifurcate the study populaCon is a special requirement. One group
has to be administered the placebo. The other group is likely to beneﬁt from
the study.
! Sample selecCon becomes a moral and ethical issue
! Both under selecCon and over selecCon of study populaCon can cause
dilemma.

Small Sample Studies
! The results can be presented at a lower Conﬁdence Level or
higher Margin of Error.
! Valid results can be presented at some of the strata levels or
with relaxed survey focus
! It is common to change the study focus to in-depth probing on
select topics when study populaCon drops out midway in clinical
trials. ( modify the null hypothesis)
! There are many techniques and tools available to guide in data
collecCon and data analysis, speciﬁc to small sample studies.
There are expert groups dedicated to analyzing small sample
data.
What can be done when sample size has shrunk
unwicngly or otherwise?

balasubp@gmail.com

Please obtain prior permission for reuse.
Feel free to download for self study.
Dr.P.Balasubramanian,
Founder & CEO, Theme Work Analy-cs,
Gurukrupa,508, 47th Cross
Jayanagar 5th Block
Bangalore, India 560041
Ph: 91 80 4121 4297

Adequate Sample Size

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Adequate Sample Size

Similar to Adequate Sample Size (20)

More from Parasuram Balasubramanian

More from Parasuram Balasubramanian (20)

Recently uploaded

Recently uploaded (20)

Adequate Sample Size