Understanding Sampling Methods and Biases in Epidemiological Studies

 Census study = if the entire population of interest is
included in the study
 Most studies require sampling because collecting data on
an entire population is generally not feasible
 Collect information on a subset that represents (or is
similar to) the entire population

 Surveys are widely used.
◦ Politics: Poll on views of presidential candidates
◦ Economy: Poll on consumer confidence (overall
economy outlook, personal finance, etc.)
 Some type of sampling is usually involved in
surveys.

 Sampling is usually discussed in the context
of cross-sectional studies.
 However, similar concepts apply to other
study designs:
◦ selection of subjects in a cohort study
◦ selection of controls in a case-control study
◦ assignment of subjects to different study groups in
a clinical trial

 Target population or universe
◦ the group one wishes to generalize to or make inferences about
 Sampling frame
◦ the list of the target population from which the sample is drawn
◦ operational definition of who, where, and when
 Sampling unit or element
◦ the unit of observation for data collection.

 Instrument that includes all units of the survey
population – e.g. Lists, maps , lists of
imidugudu
 Clear rules are needed to define what is in and
what is out of the survey population. –
inclusion and exclusion criteria
 Coverage error (e.g. bad or incomplete lists)
can be a problem.

 Time efficient
 Less costly
 Potentially more accurate (since it is more feasible to
maintain quality control over a smaller number of
subjects)

 Potential bias in the selection of subjects, which may
lead to error in interpretation of results and decrease in
ability to generalize the results beyond the subjects
actually studied

 Is the sample drawn from the target population?
 Is the sampling scheme biased?
 Can we demonstrate that enrolled subjects are
reasonably comparable to the population of interest
with respect to important demographic/clinical
characteristics?

 Probability sampling
◦ each member of the population has a known, non-zero
probability of being selected
◦ the probability is not necessarily the same for all members
 Non-probability sampling
◦ members are selected from the population in some non-random
manner.

 Simple random sample
 Systematic random sample
 Stratified sample
 Cluster sample

 Purest form of probability sampling (Gold standard
among the probability sampling techniques)
 Every element in the population has an equal chance of
being included in the sample (e.g., drawing numbers
from a hat).

 Use random numbers that are generated by
random number tables or computer programs.
 Software available: Epi Info, ect

 Researcher selects a random starting point and then
systematically selects subjects from the sampling frame
at a specified interval (e.g., every k th unit is selected)
 Starting point and sampling interval (k) are based on
the required sample size
◦ e.g.: if sampling frame N=1,000 and sample size n=100, then k =
1,000/100 = 10

 Each element in the population has the chance of
being selected, but the probabilities of different sets
of elements being included in the sample are not all
equal.
 If we were to select 3 individuals out of 10 and take
every third person after a random start (1, 2, 3, or 4),
◦ what is the probability that both #1 and #2 are
selected?
◦ What is the probability that both #1 and #4 are
selected?

 For a systematic sampling to take place, it is
necessary to assume that the list has an
approximately random order.

 Used when there is a particular interest in making sure that:
◦ certain groups will be included in the study, or
◦ some groups will be sampled at a higher rate than others, or
◦ there is geographic representativeness
 The population is divided into strata, or groups of units
having certain characteristic(s) in common, and then a
sample of units is drawn from each stratum (stratum-specific
sampling may be proportionate or disproportionate).

 First, classify the population into subpopulations
(strata), based on existing information (e.g. grades in
a school), and then select separate samples from
each of the strata using SRS or other methods.
 If the strata sample sizes are proportional to the
strata population sizes (i.e. a uniform sampling
fraction is used), it is known as proportionate
stratification; otherwise, it is considered
disproportionate stratification.

 Proportionate
◦ same sampling fraction in each stratum
 Disproportionate
◦ different sampling fraction in each stratum.

 Compared with SRS of the same size, a
proportionate stratification sample with SRS in
each of the strata will have a similar or smaller
variance.
 The gain in precision is large if the within-strata
variation is small and/or the between-strata
variation is large.

 The stratum that is given a higher sampling fraction usually
has
 a relatively small size (e.g. minority groups in national surveys); or
 a relatively high variance in terms of the variable of interest.
 Disproportionate stratification can result in a higher variance
of the sample mean than a SRS of the same size.

 Primary purpose is to maximize the dispersion of the
sample throughout the community in order to represent
the diversity that exists while also minimizing costs
(money and time)
 Geographic area is divided into “clusters”
 “Clusters” are first sampled from all of the clusters in a
defined area (often using probability proportionate to
size [PPS] design) and then data are collected on all or a
subsample of units in each sampled cluster.

 Two-stage sampling: Only a sample of all the
elements in selected clusters are included.
 Multistage sampling: A hierarchy of clusters is used
– this is sometimes described as cluster sampling.

 Compared with an SRS of the same size, cluster sampling
often (although not all the time) leads to a loss in precision.
 The justification for cluster sampling is the reduced cost (time
and money).
 Need to consider weighting and variance estimation issues
when analyzing complex sample data (unequal probability of
sample selection and clustering)
 Sample weight reflects the inverse of the selection probability
– can be interpreted as the number of population members a
given subject represents.

 In sampling with stratification, all strata would be
included in the final sample.
We would like to have strata that are internally
homogeneous and externally heterogeneous.
 In cluster sampling, only a sample of the clusters
will be included in the final sample.
 We would like each cluster to be as heterogeneous
as possible.

 Convenience sample
 Consecutive sample
 Quota sample
 Snowball sample

 Subjects are selected because of their convenient
accessibility to the investigator
 Subjects are chosen simply because they are the easiest to
obtain for the study (easy, fast, and usually the least
expensive and troublesome)
 Investigator makes no attempt (or only a limited attempt)
to insure that the sample is an accurate representation of
the population of interest
 Greatest potential for bias.

 Strict version of convenience sampling – where every
available subject is selected, i.e., the complete accessible
population is studied
 This is the best choice of the nonprobability sampling
techniques, since by studying everybody available, a
good representation of the overall population is possible
in a reasonable period of time.

 Non-probability equivalent of stratified sampling
 The researcher first identifies the strata of interest and
then more targeted sampling is used to select the
required number of subjects from each stratum.

 Used when the desired sample is hard to reach or the
sample characteristic is rare
 Relies on referrals from initial subjects to generate
additional subjects (also called chain-referral
sampling)
 Reduced likelihood that the sample will represent the
population of interest.

 Internal validity
◦ Did the study measure what it set out to measure?
 External validity (generalizability)
◦ Can the study results be extrapolated to a wider population?
 Our focus will be on threats to internal validity –
“bias” undermines the internal validity of a study

 Defined as the result of systematic error in the design
or conduct of a study
 Bias results from flaws in either the method of selection
of study participants [selection bias] or the procedures
for gathering exposure and/or outcome information
[information bias]

 Systematic error in the ascertainment of study
subjects, resulting in a tendency toward distorting the
exposure-outcome association
 Present when individuals have different probabilities of
being included in the study sample according to the
exposure and the outcome of interest.

Recall basic principles:
 Study cases should be representative of all cases that
arise in the source population with respect to the exposure
of interest
 Controls should be representative of the source
population with respect to the exposure of interest

 Problems may arise when, for example:
◦ there is differential participation according to exposure status
◦ exposure leads to increased likelihood of diagnosis (also called
detection bias or medical surveillance bias or, in the case of
hospital-based cases, Berkson’s bias)
◦ prevalent (or surviving) cases are used instead of incident cases.

 Problems may arise when, for example:
◦ there is differential participation according to exposure status

 Non-response bias
◦ Rates of response to surveys and questionnaires may be related to
exposure and/or disease status
 Loss to follow-up
◦ Major source of bias in cohort studies – persons lost to follow-up
may differ from those who stay in the study with respect to both
exposure and outcome
 Volunteer/compliance bias
◦ In studies comparing disease outcome in persons who volunteer
or comply with medical treatment to those who do not, better
results might be expected among those persons who volunteer or
comply than among those who do not

 There is no statistical test to determine whether
selection bias has occurred – must be assessed through
critical review of the study
 Generally speaking, there is no statistical procedure to
“fix” selection bias – once it’s occurred, you’re pretty
much
 In general, how to avoid selection bias (easier said than
done!):
◦ ensure that the study design is appropriate
◦ establish appropriate selection criteria
◦ minimize non-participation and loss to follow-up

Systematic error in the measurement of information on
exposure or outcome

 Interviewer bias
◦ May result if the interviewer or data collector is aware of the
disease status (in a case-control study) or the exposure status
(in cohort and experimental studies)
 Recall bias
◦ May result if cases are more (or less) likely to recall an exposure
than controls, or exposed persons are more (or less) likely to
report outcomes than unexposed persons (in cohort and
experimental studies)

 Social desirability bias
◦ Occurs because subjects are systematically more likely to
provide a socially acceptable response
 Hawthorne effect
◦ Refers to the situation where subjects change their behavior in
response to being observed

 Some strategies for minimizing information bias, where
applicable:
◦ clearly and appropriately define study variables
◦ use of more objective sources of information (e.g., physical
examination, medical records, laboratory results, death
certificates, etc.) vs. self-report
◦ initial and ongoing training of research staff
◦ blinding (at least to study hypothesis)
◦ having multiple observers

1) Study design: Minimize Bias
(more on this in upcoming lectures)
2) Study implementation:
Quality Assurance & Quality Control
3) Use “validated tools” (best if validated in your
population)

49
To know if the exposure to HIV is a risk factor for
tuberculosis, a 2 year follow up was done on a cohort of 215
people infected with HIV (HIV+) and a cohort of 298 living
in the similar conditions, but not infected by HIV (HIV-).
After two years, the following results were noted:

50
EXPOSURE COHORT NbTBCASES
HIV+ 215 8
HIV- 298 1

51
DISEASE
+ - TOTAL
EXPOSURE HIV+ 8 207 215
HIV- 1 297 298
TOTAL 9 504 513
INCIDENCE INCIDENCE RR CI
8/215 3,72 11 1.4-88
1/298 0,34

52
41 selectively gave up :
 40 TB +, HIV +
 1 TB+ HIV -

53
INCIDENCE INCIDENCE RR
48/215 22.3 33
2/298 0.67
TB
+ - TOTAL
EXPOSURE HIV+ 48(8) 167 215
HIV- 2(1) 296 298
TOTAL 50 504 513

Understanding Sampling Methods and Biases in Epidemiological Studies

Recommended

Recommended

More Related Content

Similar to Understanding Sampling Methods and Biases in Epidemiological Studies

Similar to Understanding Sampling Methods and Biases in Epidemiological Studies (20)

Recently uploaded

Recently uploaded (20)

Understanding Sampling Methods and Biases in Epidemiological Studies