SAMPLING AND SAMPLE SIZE
Dr. Keerti Jain,
NIIT University, Neemrana
POPULATION AND SAMPLE
Population:
a set which includes all measurements
of interest to the researcher
(The collection of all responses,
measurements, or counts that are
of interest)
Sample:
A subset of the population
3/26/2020Dr. Keerti Jain, NIIT University Neemrana
2
POPULATION DEFINITION
• A population can be defined as including all people or items
with the characteristic one wishes to understand.
• Because there is very rarely enough time or money to gather
information from everyone or everything in a population, the
goal becomes finding a representative sample (or subset) of
that population.
• The population from which the sample is drawn may not be
the same as the population about which we actually want
information. Often there is large but not complete overlap
between these two groups due to frame issues etc .
3/26/2020Dr. Keerti Jain, NIIT University Neemrana
3
EXAMPLE
• We might study rats in order to get a better
understanding of human health, or we might study
records from people born in 2008 in order to make
predictions about people born in 2009.
3/26/2020Dr. Keerti Jain, NIIT University Neemrana
4
SAMPLING
A sample is “a smaller (but hopefully
representative) collection of units from a
population used to determine truths about that
population” (Field, 2005)
3/26/2020Dr. Keerti Jain, NIIT University Neemrana
5
WHY SAMPLING?
• What is your population of interest?
• To whom do you want to generalize your
results?
• All doctors
• School children
• Indians
• Women aged 15-45 years
• Other
• Can you sample the entire population?
3/26/2020Dr. Keerti Jain, NIIT University Neemrana
6
WHY SAMPLING?
• Less costs
• Less field time
• But less accuracy
• When it’s impossible to study the whole
population
3/26/2020Dr. Keerti Jain, NIIT University Neemrana
7
WHEN MIGHT SAMPLE THE ENTIRE
POPULATION?
• When your population is very small
• When you have extensive resources
• When you don’t expect a very high response
3/26/2020Dr. Keerti Jain, NIIT University Neemrana
8
TERMINOLOGY
Target Population:
The population to be studied/ to which the investigator wants to generalize his
results
Sampling Unit:
Smallest unit from which sample can be selected
Study Population:
The part of target population from which the investigation collect the sample
population
Sampling frame:
List of all the sampling units from which sample is drawn
Sampling scheme:
Method of selecting sampling units from sampling frame
3/26/2020Dr. Keerti Jain, NIIT University Neemrana
9
SAMPLING
TARGET POPULATION
STUDY POPULATION
SAMPLE
Sample Frame
3/26/2020
Dr. Keerti Jain, NIIT University Neemrana
10
SAMPLING BREAKDOWN
3/26/2020Dr. Keerti Jain, NIIT University Neemrana
11
EXAMPLE OF SAMPLING FRAME
The sampling frame is the list from which the
potential respondents are drawn
• Registrar’s office
• Class rosters
3/26/2020Dr. Keerti Jain, NIIT University Neemrana
12
IMPORTANCE OF SAMPLING
FRAME
• In the most straightforward case, such as the sentencing of a batch of
material from production (acceptance sampling by lots), it is possible
to identify and measure every single item in the population and to
include any one of them in our sample. However, in the more general
case this is not possible.
• There is no way to identify all rats in the set of all rats. Where voting
is not compulsory, there is no way to identify which people will
actually vote at a forthcoming election (in advance of the election)
• As a remedy, we seek a sampling frame which has the property that
we can identify every single element and include any in our sample.
• The sampling frame must be representative of the population
3/26/2020Dr. Keerti Jain, NIIT University Neemrana
13
FACTORS INFLUENCE SAMPLE
REPRESENTATIVENESS
• Sampling procedure
• Sample size
• Participation (response)
3/26/2020Dr. Keerti Jain, NIIT University Neemrana
14
SAMPLING PROCESS
The sampling process comprises several stages:
• Defining the population of concern
• Specifying a sampling frame a set of items or events possible to measure
• Specifying a sampling method for selecting items or events from the frame
• Determining the sample size
• Implementing the sampling plan
• Sampling and data collecting
• Reviewing the sampling process
3/26/2020Dr. Keerti Jain, NIIT University Neemrana
15
TYPES OF SAMPLING TECHNIQUES
• Non Probability Sampling
• Probability Sampling
3/26/2020Dr. Keerti Jain, NIIT University Neemrana
16
NON PROBABILITY SAMPLING
3/26/2020Dr. Keerti Jain, NIIT University Neemrana
17
• Probability of being chosen is unknown
• Cheaper- but unable to generalise
• potential for bias
PROBABILITY SAMPLING
• Random sampling
• Each subject has a known probability of being
selected
• Allows application of statistical
sampling theory to results to:
• Generalise
• Test hypotheses
3/26/2020Dr. Keerti Jain, NIIT University Neemrana
18
TYPES OF NON-PROBABILITY
SAMPLE
• Convenience sample
• Purposive sample
• Judgmental Sampling
• Quota Sampling
• SnowBall Sampling
• Panel Sampling
3/26/2020Dr. Keerti Jain, NIIT University Neemrana
19
TYPES OF PROBABILITY SAMPLING
• Simple Random Sample
• Systematic random sample
• Stratified random sample
• Multistage sample
• Multiphase sample
• Cluster sample
3/26/2020Dr. Keerti Jain, NIIT University Neemrana
20
Systematic error (or bias)
Inaccurate response (information bias)
Selection bias
Sampling error (random error)
Errors in Sample
3/26/2020Dr. Keerti Jain, NIIT University Neemrana
21
TYPE 1 ERROR
• The probability of finding a difference with our
sample compared to population, and there really
isn’t one….
• Known as the α (or “type 1 error”)
• Usually set at 5% (or 0.05)
3/26/2020Dr. Keerti Jain, NIIT University Neemrana
22
TYPE 2 ERROR
• The probability of not finding a difference that actually
exists between our sample compared to the
population…
• Known as the β (or “type 2 error”)
• Power is (1- β) and is usually 80%
3/26/2020Dr. Keerti Jain, NIIT University Neemrana
23
SAMPLE SIZE FOR ESTIMATING
POPULATION MEAN
•
3/26/2020Dr. Keerti Jain, NIIT University Neemrana
24
EXAMPLE 1
• An investigator wants to estimate the mean systolic blood
pressure in children with congenital heart disease who are
between the ages of 3 and 5. How many children should be
enrolled in the study? The investigator plans on using a 95%
confidence interval (so Z=1.96) and wants a margin of error of 5
units. The standard deviation of systolic blood pressure is
unknown, but the investigators conduct a literature search and
find that the standard deviation of systolic blood pressures in
children with other cardiac defects is between 15 and 20. To
estimate the sample size, we consider the larger standard
deviation in order to obtain the most conservative (largest) sample
size.
3/26/2020Dr. Keerti Jain, NIIT University Neemrana
25
SOLUTION
3/26/2020Dr. Keerti Jain, NIIT University Neemrana
26
In order to ensure that the 95% confidence interval estimate of the mean
systolic blood pressure in children between the ages of 3 and 5 with
congenital heart disease is within 5 units of the true mean, a sample of size
62 is needed.
•
3/26/2020Dr. Keerti Jain, NIIT University Neemrana
27
Example 2
SAMPLE SIZES FOR TWO
INDEPENDENT SAMPLES
•
3/26/2020Dr. Keerti Jain, NIIT University Neemrana
28
EXAMPLE 3
•
3/26/2020Dr. Keerti Jain, NIIT University Neemrana
29
EXAMPLE 4
An investigator wants to compare two diet programs in children who are
obese. One diet is a low fat diet, and the other is a low carbohydrate diet.
The plan is to enroll children and weigh them at the start of the study.
Each child will then be randomly assigned to either the low fat or the low
carbohydrate diet. Each child will follow the assigned diet for 8 weeks, at
which time they will again be weighed. The number of pounds lost will
be computed for each child. Based on data reported from diet trials in
adults, the investigator expects that 20% of all children will not complete
the study. A 95% confidence interval will be estimated to quantify the
difference in weight lost between the two diets and the investigator
would like the margin of error to be no more than 3 pounds. How many
children should be recruited into the study?
3/26/2020Dr. Keerti Jain, NIIT University Neemrana
30
SOLUTION
3/26/2020Dr. Keerti Jain, NIIT University Neemrana
31
Samples of size n1=56 and n2=56 will ensure that the 95% confidence interval for
the difference in weight lost between diets will have a margin of error of no more
than 3 pounds. Again, these sample sizes refer to the numbers of children with
complete data.
SAMPLE SIZE FOR ONE SAMPLE,
DICHOTOMOUS OUTCOME
(PROPORTION)
3/26/2020Dr. Keerti Jain, NIIT University Neemrana
32
where p is proportion
E is sampling error or tolerable margin of error
E= difference between population proportion and sample proportion
EXAMPLE 5
It was desired to estimate proportion of anemic children in a certain
preparatory school. In a similar study at another school a proportion
of 30 % was detected.
Compute the minimal sample size required at a confidence limit of 95%
and accepting a difference of up to 4% of the true population.
SOLUTION
3/26/2020Dr. Keerti Jain, NIIT University Neemrana
33
EXAMPLE 6
An investigator wants to estimate the proportion of freshmen
at his University who currently smoke cigarettes (i.e., the
prevalence of smoking). How many freshmen should be
involved in the study to ensure that a 95% confidence interval
estimate of the proportion of freshmen who smoke is within
5% of the true proportion?
3/26/2020Dr. Keerti Jain, NIIT University Neemrana
34
SOLUTION
3/26/2020Dr. Keerti Jain, NIIT University Neemrana
35
In order to ensure that the 95% confidence interval estimate of the
proportion of freshmen who smoke is within 5% of the true proportion, a
sample of size 385 is needed
SAMPLE SIZES FOR TWO SAMPLES,
DICHOTOMOUS OUTCOME
(PROPORTIONS)
3/26/2020Dr. Keerti Jain, NIIT University Neemrana
36
E is sampling error or tolerable margin of error
E= difference between sample proportions
EXAMPLE 7
•
3/26/2020Dr. Keerti Jain, NIIT University Neemrana
37
EXAMPLE 8
• An investigator wants to estimate the impact of smoking during
pregnancy on premature delivery. Normal pregnancies last
approximately 40 weeks and premature deliveries are those that occur
before 37 weeks. The 2005 National Vital Statistics report indicates that
approximately 12% of infants are born prematurely in the United
States.5 The investigator plans to collect data through medical record
review and to generate a 95% confidence interval for the difference in
proportions of infants born prematurely to women who smoked during
pregnancy as compared to those who did not. How many women
should be enrolled in the study to ensure that the 95% confidence
interval for the difference in proportions has a margin of error of no
more than 4%?
3/26/2020Dr. Keerti Jain, NIIT University Neemrana
38
SOLUTION
3/26/2020Dr. Keerti Jain, NIIT University Neemrana
39
The sample sizes (i.e., numbers of women who smoked and did not smoke
during pregnancy) can be computed using the formula shown above.
National data suggest that 12% of infants are born prematurely. We will use
that estimate for both groups in the sample size computation.
Samples of size n1=508 women who smoked during pregnancy and n2=508
women who did not smoke during pregnancy will ensure that the 95%
confidence interval for the difference in proportions who deliver prematurely
will have a margin of error of no more than 4%.
3/26/2020Dr. Keerti Jain, NIIT University Neemrana
40

Sampling and Sample Size

  • 1.
    SAMPLING AND SAMPLESIZE Dr. Keerti Jain, NIIT University, Neemrana
  • 2.
    POPULATION AND SAMPLE Population: aset which includes all measurements of interest to the researcher (The collection of all responses, measurements, or counts that are of interest) Sample: A subset of the population 3/26/2020Dr. Keerti Jain, NIIT University Neemrana 2
  • 3.
    POPULATION DEFINITION • Apopulation can be defined as including all people or items with the characteristic one wishes to understand. • Because there is very rarely enough time or money to gather information from everyone or everything in a population, the goal becomes finding a representative sample (or subset) of that population. • The population from which the sample is drawn may not be the same as the population about which we actually want information. Often there is large but not complete overlap between these two groups due to frame issues etc . 3/26/2020Dr. Keerti Jain, NIIT University Neemrana 3
  • 4.
    EXAMPLE • We mightstudy rats in order to get a better understanding of human health, or we might study records from people born in 2008 in order to make predictions about people born in 2009. 3/26/2020Dr. Keerti Jain, NIIT University Neemrana 4
  • 5.
    SAMPLING A sample is“a smaller (but hopefully representative) collection of units from a population used to determine truths about that population” (Field, 2005) 3/26/2020Dr. Keerti Jain, NIIT University Neemrana 5
  • 6.
    WHY SAMPLING? • Whatis your population of interest? • To whom do you want to generalize your results? • All doctors • School children • Indians • Women aged 15-45 years • Other • Can you sample the entire population? 3/26/2020Dr. Keerti Jain, NIIT University Neemrana 6
  • 7.
    WHY SAMPLING? • Lesscosts • Less field time • But less accuracy • When it’s impossible to study the whole population 3/26/2020Dr. Keerti Jain, NIIT University Neemrana 7
  • 8.
    WHEN MIGHT SAMPLETHE ENTIRE POPULATION? • When your population is very small • When you have extensive resources • When you don’t expect a very high response 3/26/2020Dr. Keerti Jain, NIIT University Neemrana 8
  • 9.
    TERMINOLOGY Target Population: The populationto be studied/ to which the investigator wants to generalize his results Sampling Unit: Smallest unit from which sample can be selected Study Population: The part of target population from which the investigation collect the sample population Sampling frame: List of all the sampling units from which sample is drawn Sampling scheme: Method of selecting sampling units from sampling frame 3/26/2020Dr. Keerti Jain, NIIT University Neemrana 9
  • 10.
    SAMPLING TARGET POPULATION STUDY POPULATION SAMPLE SampleFrame 3/26/2020 Dr. Keerti Jain, NIIT University Neemrana 10
  • 11.
    SAMPLING BREAKDOWN 3/26/2020Dr. KeertiJain, NIIT University Neemrana 11
  • 12.
    EXAMPLE OF SAMPLINGFRAME The sampling frame is the list from which the potential respondents are drawn • Registrar’s office • Class rosters 3/26/2020Dr. Keerti Jain, NIIT University Neemrana 12
  • 13.
    IMPORTANCE OF SAMPLING FRAME •In the most straightforward case, such as the sentencing of a batch of material from production (acceptance sampling by lots), it is possible to identify and measure every single item in the population and to include any one of them in our sample. However, in the more general case this is not possible. • There is no way to identify all rats in the set of all rats. Where voting is not compulsory, there is no way to identify which people will actually vote at a forthcoming election (in advance of the election) • As a remedy, we seek a sampling frame which has the property that we can identify every single element and include any in our sample. • The sampling frame must be representative of the population 3/26/2020Dr. Keerti Jain, NIIT University Neemrana 13
  • 14.
    FACTORS INFLUENCE SAMPLE REPRESENTATIVENESS •Sampling procedure • Sample size • Participation (response) 3/26/2020Dr. Keerti Jain, NIIT University Neemrana 14
  • 15.
    SAMPLING PROCESS The samplingprocess comprises several stages: • Defining the population of concern • Specifying a sampling frame a set of items or events possible to measure • Specifying a sampling method for selecting items or events from the frame • Determining the sample size • Implementing the sampling plan • Sampling and data collecting • Reviewing the sampling process 3/26/2020Dr. Keerti Jain, NIIT University Neemrana 15
  • 16.
    TYPES OF SAMPLINGTECHNIQUES • Non Probability Sampling • Probability Sampling 3/26/2020Dr. Keerti Jain, NIIT University Neemrana 16
  • 17.
    NON PROBABILITY SAMPLING 3/26/2020Dr.Keerti Jain, NIIT University Neemrana 17 • Probability of being chosen is unknown • Cheaper- but unable to generalise • potential for bias
  • 18.
    PROBABILITY SAMPLING • Randomsampling • Each subject has a known probability of being selected • Allows application of statistical sampling theory to results to: • Generalise • Test hypotheses 3/26/2020Dr. Keerti Jain, NIIT University Neemrana 18
  • 19.
    TYPES OF NON-PROBABILITY SAMPLE •Convenience sample • Purposive sample • Judgmental Sampling • Quota Sampling • SnowBall Sampling • Panel Sampling 3/26/2020Dr. Keerti Jain, NIIT University Neemrana 19
  • 20.
    TYPES OF PROBABILITYSAMPLING • Simple Random Sample • Systematic random sample • Stratified random sample • Multistage sample • Multiphase sample • Cluster sample 3/26/2020Dr. Keerti Jain, NIIT University Neemrana 20
  • 21.
    Systematic error (orbias) Inaccurate response (information bias) Selection bias Sampling error (random error) Errors in Sample 3/26/2020Dr. Keerti Jain, NIIT University Neemrana 21
  • 22.
    TYPE 1 ERROR •The probability of finding a difference with our sample compared to population, and there really isn’t one…. • Known as the α (or “type 1 error”) • Usually set at 5% (or 0.05) 3/26/2020Dr. Keerti Jain, NIIT University Neemrana 22
  • 23.
    TYPE 2 ERROR •The probability of not finding a difference that actually exists between our sample compared to the population… • Known as the β (or “type 2 error”) • Power is (1- β) and is usually 80% 3/26/2020Dr. Keerti Jain, NIIT University Neemrana 23
  • 24.
    SAMPLE SIZE FORESTIMATING POPULATION MEAN • 3/26/2020Dr. Keerti Jain, NIIT University Neemrana 24
  • 25.
    EXAMPLE 1 • Aninvestigator wants to estimate the mean systolic blood pressure in children with congenital heart disease who are between the ages of 3 and 5. How many children should be enrolled in the study? The investigator plans on using a 95% confidence interval (so Z=1.96) and wants a margin of error of 5 units. The standard deviation of systolic blood pressure is unknown, but the investigators conduct a literature search and find that the standard deviation of systolic blood pressures in children with other cardiac defects is between 15 and 20. To estimate the sample size, we consider the larger standard deviation in order to obtain the most conservative (largest) sample size. 3/26/2020Dr. Keerti Jain, NIIT University Neemrana 25
  • 26.
    SOLUTION 3/26/2020Dr. Keerti Jain,NIIT University Neemrana 26 In order to ensure that the 95% confidence interval estimate of the mean systolic blood pressure in children between the ages of 3 and 5 with congenital heart disease is within 5 units of the true mean, a sample of size 62 is needed.
  • 27.
    • 3/26/2020Dr. Keerti Jain,NIIT University Neemrana 27 Example 2
  • 28.
    SAMPLE SIZES FORTWO INDEPENDENT SAMPLES • 3/26/2020Dr. Keerti Jain, NIIT University Neemrana 28
  • 29.
    EXAMPLE 3 • 3/26/2020Dr. KeertiJain, NIIT University Neemrana 29
  • 30.
    EXAMPLE 4 An investigatorwants to compare two diet programs in children who are obese. One diet is a low fat diet, and the other is a low carbohydrate diet. The plan is to enroll children and weigh them at the start of the study. Each child will then be randomly assigned to either the low fat or the low carbohydrate diet. Each child will follow the assigned diet for 8 weeks, at which time they will again be weighed. The number of pounds lost will be computed for each child. Based on data reported from diet trials in adults, the investigator expects that 20% of all children will not complete the study. A 95% confidence interval will be estimated to quantify the difference in weight lost between the two diets and the investigator would like the margin of error to be no more than 3 pounds. How many children should be recruited into the study? 3/26/2020Dr. Keerti Jain, NIIT University Neemrana 30
  • 31.
    SOLUTION 3/26/2020Dr. Keerti Jain,NIIT University Neemrana 31 Samples of size n1=56 and n2=56 will ensure that the 95% confidence interval for the difference in weight lost between diets will have a margin of error of no more than 3 pounds. Again, these sample sizes refer to the numbers of children with complete data.
  • 32.
    SAMPLE SIZE FORONE SAMPLE, DICHOTOMOUS OUTCOME (PROPORTION) 3/26/2020Dr. Keerti Jain, NIIT University Neemrana 32 where p is proportion E is sampling error or tolerable margin of error E= difference between population proportion and sample proportion
  • 33.
    EXAMPLE 5 It wasdesired to estimate proportion of anemic children in a certain preparatory school. In a similar study at another school a proportion of 30 % was detected. Compute the minimal sample size required at a confidence limit of 95% and accepting a difference of up to 4% of the true population. SOLUTION 3/26/2020Dr. Keerti Jain, NIIT University Neemrana 33
  • 34.
    EXAMPLE 6 An investigatorwants to estimate the proportion of freshmen at his University who currently smoke cigarettes (i.e., the prevalence of smoking). How many freshmen should be involved in the study to ensure that a 95% confidence interval estimate of the proportion of freshmen who smoke is within 5% of the true proportion? 3/26/2020Dr. Keerti Jain, NIIT University Neemrana 34
  • 35.
    SOLUTION 3/26/2020Dr. Keerti Jain,NIIT University Neemrana 35 In order to ensure that the 95% confidence interval estimate of the proportion of freshmen who smoke is within 5% of the true proportion, a sample of size 385 is needed
  • 36.
    SAMPLE SIZES FORTWO SAMPLES, DICHOTOMOUS OUTCOME (PROPORTIONS) 3/26/2020Dr. Keerti Jain, NIIT University Neemrana 36 E is sampling error or tolerable margin of error E= difference between sample proportions
  • 37.
    EXAMPLE 7 • 3/26/2020Dr. KeertiJain, NIIT University Neemrana 37
  • 38.
    EXAMPLE 8 • Aninvestigator wants to estimate the impact of smoking during pregnancy on premature delivery. Normal pregnancies last approximately 40 weeks and premature deliveries are those that occur before 37 weeks. The 2005 National Vital Statistics report indicates that approximately 12% of infants are born prematurely in the United States.5 The investigator plans to collect data through medical record review and to generate a 95% confidence interval for the difference in proportions of infants born prematurely to women who smoked during pregnancy as compared to those who did not. How many women should be enrolled in the study to ensure that the 95% confidence interval for the difference in proportions has a margin of error of no more than 4%? 3/26/2020Dr. Keerti Jain, NIIT University Neemrana 38
  • 39.
    SOLUTION 3/26/2020Dr. Keerti Jain,NIIT University Neemrana 39 The sample sizes (i.e., numbers of women who smoked and did not smoke during pregnancy) can be computed using the formula shown above. National data suggest that 12% of infants are born prematurely. We will use that estimate for both groups in the sample size computation. Samples of size n1=508 women who smoked during pregnancy and n2=508 women who did not smoke during pregnancy will ensure that the 95% confidence interval for the difference in proportions who deliver prematurely will have a margin of error of no more than 4%.
  • 40.
    3/26/2020Dr. Keerti Jain,NIIT University Neemrana 40