2. Introduction to Sampling techniques
Subjective sample: is defined as a non-random
sample
Random sample: is defined as a sample chosen
based on chance or probability and this sample is
selected using random number tables
Population parametrs:defined as characteristics of
the population,eg population mean, population
variance..
Sample statistics: defined as the characteristic of a
sample .E.g the sample mean, sample variance,
sample proportion..
3. Sampling techniques
Sampling frame: is a list of distinct and
distiquishable units of a given population which is
used for selecting units from the population into a
random sample.
Types of sampling frames(area (map) frames, list
frames)…….
4. Sampling techniques
Classified into two:
Probability sampling techniques
Non-probability sampling techniques
Probability sampling techniques are selected based
on chance while non-probability sampling methods
are selected not based on chance.
Reading assignment( read and make notes about
non-probability sampling techniques)
5. Probability sampling techiniques
These include:
Simple random sampling
Probability proportional to size(PPS) sampling
Stratified random sampling
Systematic sampling
Cluster sampling
and mult-stage sampling
6. Probability sampling techiniques
Simple random sampling(SRS).
Simplest sampling design
Its simplicity makes it a starting point for the study
of the sampling designs and all other designs are
derived from this design.
Description:
SRS is defined as the sampling design in which every
unit in the population has same probability 1/N of
being selected at each draw.
The probability of a unit being selected into a sample
7. Probability sampling techniques
..is also the same for each unit
Sample selection;SRS can be done with replacement
or without replacement.
SRS with replacement: under this sampling method,
a population unit can repeat its self in a sample and
the order in which units appear in a sample counts
and we re-place the sampling unit into the
population at every trail before we select sub
subsequent sampling units.
SRS without replacement: under this
arangament,no unit of the population is allowed to
8. Probability sampling techniques
…in any sample and the order in which units appear
in the sample does not count.
Here the units are put back into the population before
the selection of the subsequent unit at any draw.
Stratified random sampling.
Design description.
The population to be sampled is first divided into
sub populations that are as much internally
homogenous as possible with respect to the ……
9. Probability sampling techniques
…the variables under study.
Then we select a random sample from each sub
groups independently according to some criterion.
N.B. The sub groups are called strata.
Advantages of the design.
Its administratively convenient(i.e in terms of sub-
divisions /sub population).
When judiciously done, stratification can increase
the precision of estimators
10. Probability sampling techniques
It permits the use of the use of different other
sampling methods
Note However, before stratification is done, it is
necessary to;
Define the basis for stratification and the strata
Decide on the number of strata.
Basis for stratification:
A stratification variable should be related to the
variable(s) under study,idealy the stratefication ….
11. Probability sampling techniques
.. Variable should be the study variable
However, if information is not available about the
study variable a prior/before the survey is
undertaken, what we do in practice is to use a
variable that is highly correlated with the study
variable(s) as the stratification variable.
Sample allocation to strata
The total sample is allocated to strata using one of
the following three methods
Proportional allocation.
12. Probability sampling techniques
Optimum allocation
Arbitrary allocation
Proportional allocation;
With this method, the number of the sampling units
selected from each strata is proportional to its
size.(show them the procedure)
ni= wi (n)
Optimum allocation(reading assignment).
Arbitrary allocation. This is where the allocation of
sample to a strata is done arbitrarily without use of
any of the above two methods seen.
13. Probability sampling techniques
Cluster sampling: defined as that sampling
technique in which the population is subdivided into
sub populations which are internally heterogeneous
with respect to the study characteristics.
Here the clusters are the sampling units and once a
cluster is selected, all units in this cluster are
enumerated.
Clusters may be equal or un equal
Advantages of the design:
Cluster sampling can lead to simple field instructions
14. Probability sampling techniques
..and training thereby leading to little room for error.
Hence the design can facilitate better control of
some aspects non –sampling errors.
Its applicable even if the sampling frame is not
available.
It saves on time and saves on costs/expenditures as
once a cluster is selected, all elements are
enumerated.
15. Probability sampling techniques
How to select a cluster sample/sample of clusters
from the many clusters:
A cluster sample is selected from the sampling frame
of clusters in the same way as a sample of elements is
selected from the sampling frame of elements by
SRS,PPS,Stratefied random sampling or
systematic sampling design(illustrate to them on
board by clear diagrams).
Effeciency of cluster sampling:
If clusters are made up of random sample elements..
16. Probability sampling techniques
…heterogeneous in nature, then cluster sampling is
as efficient as simple random sampling.
If clusters are made up of contiguous elements, the
elements within cluster will tend to be
similar(homogenous),which increases the variance
of the estimates and reduces their precision and
efficiency will be almost zero
Design effect or Design Efficiency factor(Deff).
Deff:is defined as a measure of relative efficiency of
…
17. Probability sampling techniques
….the design compared with what it would have been
had the sample been selected by simple random
sampling design.
Its important to note that Deff is not a single quantity
attached to a design but rather a set of
quantinties,one for each estimate variable given a
fixed sample size.
For a fixed sample size, the design effect of a cluster
sample is given by the following relationship:
Deff(estimate)=var(estimate)cluster/var(estimat
e)SRS
18. Probability sampling techniques
Deff(estimate)=var(estimate)cluster/var(estimate)
SRS
= 1+ρ(M-1)
Where M=cluster size which is assumed to be equal
for all clusters
ρ= is the intra-cluster correlation coefficient.
Interpretation of the Deff values
If Deff=1,it will imply that the
var(estimate)cluster=var(estimate)SRS or the cluster
sampling design compares with a situation had the
sample bee been selected by srs
19. Probability sampling techniques
If Deff< 1,it will simply mean that variance estimate
for cluster sample would be less than if the sample
had been selected by SRS,Implying that the cluster
sample would be more efficient compared to a
situation when had the sample been selected by SRS.
If Deff >1(reading assignment for group/class)
Suppose Deff=1.4, it will simply mean that, for this
variable ,the variance of a cluster sample is 40%
higher than that for an equivalent SRS
20. Probability sampling techniques
Suppose that Deff is 0.6,For this variable, the
variance for the cluster sample is 40% less than that
for an equivalent SRS.
Remarks:
The use of small number of large clusters reduces
costs, but it generally increases sampling error
A larger number of smaller clusters behave in the
opposite manner(i.e increased field costs but reduces
sampling error).
21. Probability sampling techniques
Way forward:
Clusters that are too big should be broken into
average sized clusters before sampling is done.
Clusters that are too too small should be re-grouped
to average sized clusters before sampling is done
22. Comparison between stratification
and clustering
strata Cluster
1.Fraction of the population Fraction of the Population
2.Each stratum is investigated Only a sample of clusters is
investigated
3.Within each stratum a sample is
fixed in advance
The size of the sample varies if the
size of clusters varies
4.Higher precision than is
achieved by SRS
Lower precision than is achieved
by SRS
5.Higher costs than in SRS Lower costs than is in SRS
6.In order to improve precision of
estimates, strata should be
internally homogenous
In order to improve precision of
estimates, clusters should be
internally heterogonous
23. Probability sampling techniques
Systematic sampling method.
It is practical and convenient way of selecting units
from ordered lists
Design description:
Given a population with size N,and if we want to
select a sample of size n,
We first get sampling interval K=N/n
We then select a random start r in such a way that
1≤r≤k
26. Sources of data
To address most of the statistical problems, we need data,
below are the sources of data
Routinely kept records: This type of data arises out of
keeping of records of day to day transactions of activates.
Surveys: If data needed to answer a question are not available
from the routinely kept records, the logical source may be a
survey.
Experiments: frequently data needed to answer a question are
available only as a result of an experiment.
External sources: At times data needed may be existing in
published reports,commercialy available data banks, or in
research literature. In otherward,we may find that some else
has already asked the same question, and the answer they
27. Types of variables
Quantitative variable: a quantitative variable is
that variable which can be measured in the usual
sense ,e.g height,weight,age,blood pressure ,etc.The
measurements made on quantitative variables
convey information regarding amount.
Qualitative variable: A qualitative variable is that
variable that cannot be measured in the usual
sense instead we just come up with categories,
examples of such variables are(1) persons ethnic
group, persons place, persons religion etc,…The
measurements made on qualitative variables
convey
28. Types of variables
…information regarding attribute.
Random variable: is that variable whose values can
not be exactly predicted in advance.e.g variable
“adult age” is random variable
Discrete random variable:varaibles can be
categorized further into discrete or continuous, for
a discrete variable is defined as that variable
characterized by gaps or interruptions in the
values that it can assume.The gaps indicate the
absence of values between particular values that
the variable assumes
29. Types of variables
E.g. the variable “daily number of admissions of
patients to a general hospital” is an example of a
describe random variable
Continuous variables :defined as that variable that
does not posses gaps or interruptions within its
values it takes on.e.g: height,age,weight, etc…
30. Measurement and
measurement scales
Measurement: This may be defined as the
assignment of numbers to objects or events
according to a set of rules.
The different measurement scales include:
(1).The nominal scale
(2).The ordinal scale
(3).The interval scale
(4).The ratio scale
31. Measurement scales
The nominal scale: Its the lowest measurement
scale ,as the name implies, it consists of “naming”
observations or classifying them into various
mutually exclusive and collectively exhaustive
categories.e.g the practice of using numbers to
distiquish among the various medical diagnosis
constitutes measurement on the nominal scale.
Other examples include such dichotomies as
male-female, well- sick, under 65 years of age-65 and
over,child-adult,and married-not married
32. Measurement scales
The ordinal scale: whenever observations are not
only different from category to category, but can
be ranked according to some criterion, they are
said to be measured on an ordinal scale. e.g.
individuals can be classified according to social-
economic status as low, medium, or high.the
inteligence of childreen may be categorised and
ranked as above average,average,or below average.
Etc.The implication is that if a finer breakdown
were made resulting in more categories,these
,too,could be ordered in a simillar manner.
33. Measurement scales
….The function of numbers assigned to ordinal
data is to order(or rank) the observations from
lowest to highest and hence, the term “ordinal”
The interval scale: The interval scale is more
sophisticated scale than the nominal and ordinal
scale in that with this scale, it is not only possible
to order measurements, but also the distance
between any two measurements is known..e.g the
disatnce between a measure of 20& a measure of
34. Measurement scales
…30 is equal to the differerence between a
measure of 30 & 40.The ability to do this implies
the use of unit distance and a zero point, both of
which are arbitrary
35. Descriptive statistics
o Here we are to look at:
Measures of central tendency
Measures of dispersion
Application of measures of location and their
limitations
36. Measures of location and
dispersion
Descriptive measure: Is defined as a measure that
has the ability to summarize the data by means of a
single value. A descriptive measure computed from a
sample is called a statistic a descriptive measure
computed from a population is called a parameter.
Several types of descriptive measures can be
computed from a set of data,however,for our case we
limit our discussion to measures of central
tendency and measures of location.
Measures of central tendency: These convey ….
37. Measures of location and
dispersion
..…regarding the average value of a set of values.
The three most commonly used measures of
central tendency are the mean, median, and mode.
The mean: also called the arithmetic mean,
simple mean or average. This one is used where
numbers can be added i.e. can be applied where we
have numerical, interval and ratio scales.
Mean: Defined as the sum of all the observations
(∑X) divided by the number of observations(n).
38. Measures of location and
dispersion
…i.e. mean=∑x/n.
o E.g: Consider the age in months of 9 under-five
children in a malaria clinic as:29,20,40,32,26,28,20,20,
and 40.Their mean age = ∑x/n=255/9=28.3 months
o Limitation of the mean: It is sensitive to extreme
values,e.g supose we had an additional age of
60months in the above example,then the new
mean=315/10=31.5 months.you can see the mean
has increased by 3.2!,therefore its not a good….
39. Measures of location and
dispersion
…estimate in skewed data.
o The median: This is the middle value of a set of data
or observations arranged in either ascending or
descending order.
o E.g. re-consider the ages in months of under-five
children in a Maria clinic as
29,20,40,32,26,28,20,20,40,60 which can be arranged
as 20,20,20,26,28,32,40,40,60,here the middle value is
28month,hence is the median.
40. Measures of location and
dispersion
Steps to identifying the median
arrange the series in ascending or descending
order
Find the middle rank of the observations using
the formula: mid rank=(n+1)/2
If n is odd,the middle rank falls on an
abservation,if n is even the middle rank falls
between two observations
Identify the value of the median
41. Measures of location and
dispersion
The mode: this is defined as the commonest value
in the list of observations,e.g in the ages(in
months) of 8 children i.e.
20,20,20,26,28,32,40,40.The mode is 20 because it
occurs 3 times, which is the highest number of
reputations.
The geometric mean: Reading assignment!!
42. Measures of location and
dispersion(exercise)
Determine the median height in meters of 7 women
attending an ANC clinic if their heights are as
follows 1.6,1.5,1.4,1.6,1.5,1.7,1.55
Determine the median height in meters of 8 women
attending ANC clinic if their heights are as
follows:1.72,1.6,1.5,1.4,1.6,1.5,1.7,1.55
Calculate a geometric mean for the number of
patients reporting obesity related complications in
a.
43. Measures of location and dispersion
a hospital if the records are as
follows:2,2,4,8,8,16,16,32,64.
Relationship between mean,mode,and median in
symmetrical and asymmetrical distributions:
If the mean=mode=median for a data set,then that
data would have come from a symetrcal
distribution/symetricaly distributed population
If the mode>median>mean,then this data set would
have come from a skewed population(skewed to ..
44. Measures of location and dispersion
…to the right).
• If the mean>median>mode implies that this data set
would have come from a distribution which is
skewed to the left.
Exercise:
• Find out the nature of the distribution of the
population of kids where a sample for ages was taken
and whose ages in months
were:29,20,40,32,26,28,20,20,40
45. Measures of location and dispersion
Measures of dispersion:
Definition: A measure of dispersion is defined as that
measure which is meant to show the extent of the
spread of data.
A measure of dispersion is a real number which is
a zero if all data is identical and increases as the
data is more diverse.
• The commonest measures of dispersion
are:range,the standard deviation, and coefficient
of deviation:
46. Measures of location and dispersion
Range: Defined as the difference between the
highest(maximum) and the lowest(minimum) values
in a set of observations.
Steps to determining the range:
Arrange the data in ascending order
Identify the minimum and maximum value
Take their difference to get the range
• Percentiles,quartiles and inter quartile range
47. Measures of location and dispersion
Percentiles, quartiles and inter quartile range :
Left as reading assignment!!! Please do it!!!
Variance and standard deviation:
Steps to calculating the variance:
Calculate sample mean()
Compute the difference between each
observation from the mean
Square the differences….
48. Measures of location and dispersion
Get the sum of the squared deviations
Divide the resultant with the degrees of freedom(n-
1) and when we do so we get the variance and taking
the square root we get the standard deviation
Assignment:
• The data below shows the number of out patients in
9 clinics, use it to calculate the variance and standard
deviation in this data set:……
49. Measures of location and dispersion
clinic Out patients in 9 clinics
Kawempe 20
Naguru 30
Komamboga 32
Nakawa 40
Rubaga 44
Makindye 60
Makerere 63
Bwaise 70
Kalerewe 80
50. Measures of location and dispersion
Coefficient of variation: is a measure of relative
spread. its given by :
Standard Deviation/mean*100
• Its more relevant if we want to compare spread in
two sample whose units of measurement may
not necessarily be the same.
e.g. comparing blood pressure and the pulse
pressure(difference between systolic and diastolic
BP).
• Therefore cv is used to compare variations between
groups whose units of measurement is not the same
51. Methods of data presentation
The ordered array
Frequency distribution
The graphical method
52. Methods of data presentation
Ordered array: here we present data in a straight
forward and simplistic way. E.g. suppose we have
12 students in biostatistics course who have
each achieved a score in knowledge test, we can
just simply present this data in form of an
ordered array like 61,69,72,76,78,83,85,85,86,88,93
and 97
o However this is suitable for small data sets thou it
has a problem if the data set is big as :
Its too detailed
Too broad
53. Methods of data presentation
And its difficult to intemperate
• Frequency tables: here we sub divide numerical
data into classes e.g. age groups, and indicate the
counts in each group
• Graphical method of data presentation: here we
have the following:
Histogram: They mainly show area, The continuous
variable of intrst is on the x-axis,usualy in grouped
form based on ranges,the size of the range should
be..
54. Methods of data presentation
….uniform, the frequency of occurrence is on the y-
axis.
o Frequency polygon: its derived from a histogram, in
which a line is drawn to indicate the frequencies.
They are useful when comparing two distributions on
the same graph.
o Line graph: They indicate the variation of one
discrete continuous variable with another.
o Scatter plots: similar to line graphs, and they
indicate the variation of a continuous variable with
another
55. Methods of data presentation
Stem and leaf plots: in such plots, data are
presented in form of “leaf "the summarizing digits of
the display constitutes the stem, while the more
varying digits represents the leaf.
Box and whisker plots: data are divided into a box
and whisker, its useful for comparing different sets
of sampled data, to gauge their spread about a
population mean. To construct a box and whisker
plot, we do the following:-
56. Methods of data presentation
We arrange the data set from smallest to largest
value
We draw a uniform box over the inter-quatile range
We draw whiskers outward fom the box,to cover the
parts of the data that are outside the inter-
quartile range
Assignments:please read and cite two examples
of each of the following
Frequency tables
Histogram
Frequency pologon
57. Methods of data presentation
….Reading assignment
Line graph
Scatter plots
Stem and leaf plots
Box and whisker plots
59. Probability theory
See separate slides and run thru quickly, But
make them understand. Next is probability theory
please in a separate slide form, closely watch!!
Assuming ended!!then give an assignment!!
Assignment: develop it and leave it behind.