Study design & instrument

Objectives
At the end of this section, the students will be able to:
• Differentiate study design that is relevant for specific research
question
• Differentiate appropriate variable for specific study
• Explain population, sample size calculation and sampling
techniques method for quantitative studies

Methodology
Possible sub-sections of the methodology:
• The study design
• Setting/Area
• Population of the study
• Sample size and sampling strategies
• Variables
3

Study area
• Location, Physical features (climate, altitude...), Population
size and composition
• Infrastructures -education, health, communication…
• Economy
Study period
Time required to conduct the study
4

Choosing study designs
A study design is a specific plan or protocol for
conducting the study, which allows the investigator to
translate the conceptual hypothesis into an operational
one.
• A study design is the process that guides researchers on
how to collect, analyze and interpret observations.
• It is a logical model that guides the investigator in the
various stages of the research.
5

• Does it adequately test the hypotheses?
• Hypotheses determine participants, variables measured &
data analysis methods
• Are results generalizable?
• Replicate to other samples and other contexts
• Random selection of participants
• Does it identify and control extraneous factors?
• Eliminate alternative explanations for results to increase
confidence in cause-effect conclusion (internal validity)
• Control depends on type of design
How to Choose a Research Design
6

How to Choose a Research Design…
• Can the hypothesis be rejected or retained via
statistical means? (statistical conclusion validity)
• Need reliable measures
• Need large enough sample to detect true effect
• Is the design efficient in using available resources?
• Optimal balance between research design, time, resources
and researcher expertise
7

Eg. If the problems calls for:
a) the identification of factors that influence an outcome
b) the utility of an intervention, or
c) knowing the prevalence of diseases
Then a quantitative approach is best
• If the research problem needs a concept or phenomenon
needs to be understood because little research has been
done on it, then qualitative approach is preferred
• A mixed methods design is useful when either the
quantitative or qualitative approach by itself is inadequate
to best understand a research problem
How to Choose problems call for specific approaches…
8

Features of Descriptive Study
• Studies occurrence of diseases with respect to time,
place and person.
• Useful for health managers to allocate resource and to
plan effective prevention programs.
• Useful to generate epidemiological hypothesis in the
search for disease risk factors.
11

Features of Descriptive Study…
• Not aimed specifically to test a hypothesis
• No attempt to gather data on controls
• Inexpensive and less time-consuming: can use
information collected routinely.
• Most common type of epidemiological study in the
medical literature.
12

Case Report and Case Series
• Documentation of unusual medical occurrence with
detailed description
• First clue in the identification of new disease or adverse
effect of exposure
• The profile of a single patient is reported in detail by
one or more clinicians
13

Case Series
• Collection of individual case reports, which occurs with in
fairly short period of time.
• Is an individual case report that has been expanded to include
a number of patients with a given disease
• Helps in identifying the beginning or presence of epidemic.
• Helps in hypothesis formulation
• Lack comparison group
14

Case report/series: Limitations
–Lack of denominator to calculate rates of disease
–Lack of comparison group
–No selection of appropriate study populations
–Sampling variations
– No sampling employed, emerging cases are
reported.
15

A cross-sectional study (survey)
 Snapshot of the health status of populations at a certain
point in time.
 Compare prevalence of disease in persons with and
without the exposure of interest
 Cross-sectional studies must be done on representative
samples of the population.
16

Advantage of Cross-sectional
• Provides prevalence estimates of exposure and
disease.
• Easier to perform than studies that require follow-
up (hence relatively inexpensive).
• Can evaluate multiple risk (and protective) factors
and health outcomes at the same point in time
18

Advantage …
• May identify groups of persons at high or low risk
of disease
• Can be used to generate hypotheses about
associations between predictive factors and disease
outcomes
19

Limitation of cross-sectional
 Temporal(time) sequence between exposure and disease
cannot be established
* i.e. chicken-or-egg dilemma.
Example: In the study of knowledge of modern
contraceptive, did the women know about it and then
start to use it or did they learn about it because they were
using it.
20

Analytic Studies
• Focus on identifying risk factors
• Always use comparison group
• Test hypotheses
• Relatively costly
• Less often used than descriptive studies
21

Case-control Studies
A case-control study is one in which persons with a
condition ("cases") and suitable comparison subjects
("controls") are identified, and then the two groups are
compared with respect to prior exposure.
– subjects are sampled by their outcome status.
Assess retrospectively on exposure status
Relatively cheaper, (Time and Cost)
Measure of association is using Odds ratio
22

Case control: numbers and ratio
• A single control group is optimal in most studies. Conditions for
multiple controls:
– when a single control group is considered to be not appropriate.
– when the selected group has a specific deficiency that could be
avoided by inclusion of another control group.
Control-case ratio
• Efficiency of a study can be maximized up to control-case ratio of
4:1.
24

Cohort Studies
• Cohort studies are epidemiologic designs that
identify comparison groups according to their
exposure status.
• Disease free subjects are sampled by exposure
status
27

Characteristics of a Cohort Study
• Groups of individuals defined on the basis of
presence/absence of exposure to the suspected risk
factor
• All potential subjects must be initially free of the
disease under investigation
• Eligible participants are then followed over time to
assess occurrence of disease
29

Types of Cohort Studies
•Classification of cohort studies depends on the
temporal relationship between the initiation of the
study and the occurrence of the outcome
Prospective Cohort Study
Retrospective Cohort Study
30

Prospective Cohort Studies
•The investigator collects information on the exposure status of the
cohort members at the time the study begins, and identifies new
cases of disease (or deaths) from that time forward
•The exposures may have occurred at the beginning of study BUT
the outcome has certainly not yet occurred.
•After the selection of the cohort, participants must be followed over
time to assess incidence of disease.
E.g. identify oral contraceptive users and non-users; follow for the
years to come and assess heart disease status.
31

Retrospective Cohort Studies
• Both the exposures and the outcomes have already
occurred when the study is initiated.
• Exposure status is established from information
recorded at some time in the past, and disease
incidence (or mortality) is determined from then until
the present.
33

Retrospective Cohort Studies…
• Either interview the participants, or use medical records,
to determine their subsequent history from that point to the
present in terms of developing outcome.
• Retrospective: deals with past events; can be done
quickly
• Cohort: the comparison is made between users and non-
users of OCs
34

Factors in Selection of Exposed Group
• Frequency of the exposure of interest: ability of
obtaining sufficient exposed individuals in a
reasonable period of time
• Ability to obtain complete and accurate exposure
and outcome information on all study subjects
35

Advantages of Cohort Design
• Valuable when the exposure is rare
• Allows direct measurement of risk
• Can elucidate temporal relationship
• Minimize bias in ascertainment of exposure
• Can examine multiple effects of a single
exposure
37

Disadvantages of Cohort Design
• Not suitable for rare disease
• Cannot be applied for diseases with long
incubation period
• Cost in terms of time and resources
• Obtaining complete information for all
comparison groups
• Loss-to-follow-up
38

Experimental Studies
Experimental designs are epidemiologic studies where:
1) Investigator manipulates the condition under study
2) Always prospective
40

Classification of Intervention Studies:
• Based on population
• Clinical trial - usually performed in clinical setting and the
subjects are patients.
• Field trial- used in testing medicine for preventive purpose and
the subjects are healthy people.
• Community trial - a field trial in which the unit of the study is
group of people/ community.
44

Interventions that Can Be Evaluated
• New drugs and new treatment of diseases
• New medical and health care technology
• New methods of primary prevention
• New programs for screening
• New ways of organizing and delivering health services
• New behavioral intervention programs
45

Ethical Considerations in experimental studies
• Risks vs benefits
• Comparison: Standard care vs placebo
• Ethical approval
• Informed consent & confidentiality
• Freedom to withdraw
• Duty of care
• Stopping/Monitoring
• Reporting findings
• Quality: ‘Poor’ quality research is unethical!
46

Experimental studies: Advantages
• The major advantage of experimental studies lie in the
strength of causal inference that can be made.
– it is very difficult to make causal inferences based on
observational studies.
• Experimental studies offer the best design for controlling
confounding variables.
• Gold standard for epidemiologic research
– Randomized Controlled Trials (RCTs)
47

The Quality of “Gold Standard"
• Randomization
• Blinding
• Use of Placebo
48

Assignment
1. Review three articles for one specific study design. From the
article you reviewed:
• Write the characteristics of the study design in the that articles
• Strengths and limitations of the study designs reviewed.
• Which study design need to be repeated to solve the limitation of
that conducted designs and how it can solve?
• How do you calculate sample size for that study design?
• 2. How do you develop data collection tools?
• What are the sources of the tool
• How do you assess whether it is measuring what it intended
to measure or not?.
• For question 1, Write full name and a DOI (Digital Object
Identifier) of an article

Sampling Methods
♣ Sampling involves the selection of a number of study
units from a defined study population.
♣ The population is too large for us to consider collecting
information from all its members.
♣ Instead we select a sample of individuals hoping that the
sample is representative of the population.
50

51
SAMPLING…
• Importance of sampling:
- To save time and money
- Measurements more accurate on samples than entire
population (census)
Defining the population:
- Target population
- Study population

Sampling…
When taking a sample, we will be confronted with the
following questions:
• What is the group of people from which we want to draw a
sample?
• How many people do we need in our sample?
• How will these people be selected?
• What are the errors to be confronted with when taking a
random sample?
52

Definitions of Population
• Target population (reference population or source
population): Is that population about which an investigator
wishes to draw a conclusion.
• Study population: Population from which the sample
actually is drawn
• Sampling unit: The unit of selection in the sampling
process. For example, in a sample of districts, the sampling
unit is a district; in a sample of persons, a person etc.
53

• Study unit: The unit on which the observations will be
collected. For example, persons in a study of disease
prevalence, or households, in a study of family size.
N.B. The sampling unit is not necessarily the same as the
study unit.
• Sampling frame: The list of units from which the
sample is to be selected. The existence of an adequate
and up-to-date sampling frame often defines the study
population.
Definitions of Population

What is a defined population?
♣ The problem of obtaining a sample which is
representative of a larger population needs special
attention.
♣ The population under consideration should be clearly
defined.
♣ It is only after having such a clearly defined population
(i.e., in terms of geographical area, type of study
subjects, etc. ) that the selection of the random sample
could take place
♣ What are the main reasons for the necessity of such
“clear definitions of the population”?
55

How are the study subjects selected?
♣ An important issue influencing the choice of the most
appropriate sampling method is whether a sampling
frame is available (can be maintained), that is, a listing
of all the units that compose the population.
♣ Two broad areas: Non-probability sampling method
and probability sampling method
56

Sampling methods…
♣ Non-probability sampling methods - used when a
sampling frame does not exist
Examples:
• Convenience sampling
• Quota sampling
• These sampling methods do not claim to be
representative of the entire population.
When do you use these techniques?
57

b) Probability sampling methods
♣ They involve random selection procedures to ensure
that each unit of the sample is chosen on the basis of
chance
All units of the population should have an equal or at
least a known chance of being included in the sample.
♣ Sample findings can be generalized
58

b) Probability sampling methods
1. Simple Random Sampling
2. Systematic Sampling
3. Stratified sampling
4. Cluster sampling (all selected clusters will be
considered –take care of clustering effect)
5. Multi-Stage Sampling (consider the design effect)
59

Simple random sampling (SRS)
• This is the most basic scheme of random sampling.
• Each unit in the sampling frame has an equal
chance of being selected
• Representativeness of the sample is ensured.
• However, it is costly to conduct SRS.
• Moreover, minority subgroups of interest in the
population my not be present in the sample in
sufficient numbers for study.
60

Simple random sampling (SRS)…
To select a simple random sample you need to:
 Make a numbered list of all the units in the population from
which you want to draw a sample.
 Each unit on the list should be numbered in sequence from 1
to N (where N is the size of the population)
 Decide on the size of the sample
 Select the required number of study units, using a “lottery”
method or a table of random numbers.
61

Simple random sampling…
Lottery method : for a small population it may be possible to use the
“lottery” method: each unit in the population is represented by a
slip of paper, these are put in a box and mixed, and a sample of the
required size is drawn from the box.
Table of random numbers: if there are many units, however, the
above technique soon becomes laborious. Selection of the units is
greatly facilitated and made more accurate by using a set of random
numbers in which a large number of digits is set out in random
order. The property of a table of random numbers is that, whichever
way it is read, vertically in columns or horizontally in rows, the
order of the digits is random.
Computer generated random list:
62

Systematic Sampling
• Individuals are chosen at regular intervals ( for
example, every kth) from the sampling frame.
• The first unit to be selected is taken at random from
among the first k units.
• For example, a systematic sample is to be selected
from 1200 students of a school.
• The sample size is decided to be 100. The sampling
fraction is: 100 /1200 = 1/12.
63

Systematic Sampling…
• The number of the first student to be included in the
sample is chosen randomly, for example by blindly
picking one out of twelve pieces of paper, numbered 1 to
12.
• If number 6 is picked, every twelfth student will be
included in the sample, starting with student number 6,
until 100 students are selected.
• The numbers selected would be 6,18,30,42,etc.
64

Merits
• Systematic sampling is usually less time consuming and
easier to perform than simple random sampling. It
provides a good approximation to SRS.
• Unlike SRS, systematic sampling can be conducted
without a sampling frame (useful in some situations
where a sampling frame is not readily available). E.g. In
patients attending a health center, where it is not possible
to predict in advance who will be attending.
65

Demerits
• If there is any sort of cyclic pattern in the ordering
of the subjects which coincides with the sampling
interval, the sample will not be representative of the
population.
66

Stratified Sampling
• Appropriate when the distribution of the characteristic to be
studied is strongly affected by certain variable (heterogeneous
population).
• The population is first divided into groups (strata) according to a
characteristic of interest (eg., geographic area, prevalence of
disease, etc.)
• A separate sample is taken independently from each stratum, by
simple random or systematic sampling.
• proportional allocation - if the same sampling fraction is used
for each stratum.
• non- proportional allocation - if a different sampling fraction is
used for each stratum or if the strata are unequal in size and a
fixed number of units is selected from each stratum.
67

Stratified Sampling:
• Merit
- The representativeness of the sample is improved. That
is, adequate representation of minority subgroups of
interest can be ensured by stratification and by varying
the sampling fraction between strata as required.
• Demerit
- sampling frame for the entire population has to be
prepared separately for each stratum.
68

Cluster sampling
• The selection of groups of study units (clusters) instead of the
selection of study units individually
• The sampling unit is a cluster, and the sampling frame is a list of
these clusters.
• Procedure - the reference population (homogeneous) is divided
into clusters.
• These clusters are often geographic units (e.g. districts, villages,
etc.).
• - a sample of such clusters is selected
- all the units in the selected clusters are studied.
• It is preferable to select a large number of small clusters rather
than a small number of large clusters.
69

Cluster sampling…
• Merit - A list of all the individual study units in the
reference population is not required. It is sufficient to
have a list of clusters.
• Demerit - It is based on the assumption that the
characteristic to be studied is uniformly distributed
throughout the reference population, which may not
always be the case.
70

Multi-stage sampling
• This method is appropriate when the reference population is large
and widely scattered.
• Selection is done in stages until the final sampling unit (eg.,
households or persons) are arrived at.
• The primary sampling unit (PSU) is the sampling unit (usually
large size) in the first sampling stage.
• The secondary sampling unit (SSU) is the sampling unit in the
second sampling stage. etc.
• Example - The PSUs could be kebeles and the SSUs could be
households.
71

Multi-stage sampling
• Merit - Cuts the cost of preparing sampling frame
• Demerit - Sampling error is increased compared with a
simple random sample.
 Multistage sampling gives less precise estimates than
simple random sampling for the same sample size, but
the reduction in cost usually far outweighs this, and
allows for a larger sample size.
 That is, a design effect need to be considered
72

What are the errors to be confronted with when
taking a random sample?
♣ When we take a sample, our results will not exactly
equal the correct results for the whole population. That
is, our results will be subject to errors. This error has
two components.
a) Sampling error (i.e., random error)
b) Non Sampling error (i.e., bias)
73

Sampling error (i.e., random error)
•Random error consists of random deviations from
the true value, which can occur in any direction.
• Sampling error (random error) can be
minimized by increasing the size of the sample
74

Non Sampling error (i.e., bias)
• Bias consists of systematic deviations from the true
value, always in the same direction
• It is possible to eliminate or reduce the non-sampling
error (bias) by careful design of the sampling
procedure and by taking care of the errors that may
arise during data analysis.
75

2. Nonprobability Sampling
• Here, the sample is less likely to be representative of
the population, thus it is difficult to extrapolate from
the sample to the population.
• Is used when there is no sampling frame or when it is
impossible to conduct probability sampling due to
economical and feasibility factors.
76

Nonprobability Sampling Cont..
• Judgmental or Purposive Sampling: The researcher chooses the
sample based on who he/she think would be appropriate for the
study.
• Convenience Sampling: The selection of units from the
population is based on availability and/or accessibility.
• Quota Sampling: It starts with systematically setting “Quota” to
represent subgroups of a population. Then data is collected to
meet the predefined Quota.
• Snowball Sampling: The researcher begins by identifying
someone who meets the inclusion criteria of the study. Then the
study subject would be asked to recommend others who s/he may
know who also meet the criteria.
77

Sample size estimation for a study
78

How do we ensure enough precision to make good
programmatic decisions after the results are analysed?
• We calculate the sample size we will need in our study before we start
collecting data
• If a study has not calculated the sample size beforehand, it may have
wasted resources gathering more data than necessary.
• Supervision and training are more difficult with more teams and
study workers
• Not calculating sample size can lead to a greater chance of having
bias in the results.

Sample size determination for study
• Sample size:- it is the number of study population (subjects)
required to study an estimate in a population
• Any sample size will give you an estimate of the population
parameter
• However, the larger the sample the more precision
• Always calculate the sample size that gives the required
precision!
• Too large sample size: too expensive and time consuming
• Too small sample size: it has inadequate precision to show a
good estimate or show difference

Sample size depends on
1.Estimated variability (denoted by P)
2.The precision; or margin of error
(denoted by d)
3.The sampling (clustering; Dffect)
(denoted by g)
4.Size of population
5.Feasibility (cost)
6.Confidence level (Z value of certainty)
7.Non-response rate
* g

 If the whole population is examined, then there is
no uncertainty
 If a sample is taken, then sampling variability is
introduced – level of precision
Sampling variability and precison
An estimate based on a sample depends on chance!
It depends on which of the numerous different samples, study subject are
selected

1. Estimated variability
• This is the estimated proportion of the event in the population as
estimated from similar population
• It is taken usually from similar study in literature
• It could also be found in pilot study
• In the absence of a way to estimate this variability, the maximum
variability of 50% is considered.
• If there are two similar studies about the population estimate, it is
good to take the one nearer to 50%.

2. What is precision (margin of error)
• It is postulated confidence level the true population mean would be
bounded from the sample.
• It is the standard error of the mean or the proportion of the sample
• Standard error is a function of the standard deviation against the
square root of the sample size
• Sample size is inversely related to precision
SE =
SD
n
Mean
(p•(1-p)
 n
SE =
Proportion

Cont’d …
• A measure of how close an estimate is to the true
value of a population parameter.
• For example,
• a prevalence of 10% from a sample of size 20
 a 95% confidence interval of 1% to 31%,
 not very precise or informative.
• a prevalence of 10% from a sample of size 400
 a 95% confidence interval of 7% to 13%,
sufficiently accurate.
• Sample size calculations help to avoid this
situation.

Cont’d…
• Precision is estimated based on the researchers trial
to estimate the true population proportion
• A precision of 1-3% is chosen if in the presence of
resource if high precision is required
• It is usually estimated between 1-5% for variability
estimate when counted by percent

3. The sampling method (clustering)
• Sample size also is related to sampling method
• If cluster sampling is used, the random error produced will
be larger to lower the power
• People within cluster are homogenous but more
heterogeneous between clusters
• This homogeneity within a cluster could produce higher
imprecision
• Therefore higher number of clusters and taking few
individuals within each cluster could lower the design effect

Cont’d…
• Design effect is the variance produced by clustering
against the variance produced by SRS
• It ranges between 1.5 to 10.
• If a correction is not done, the random error can be boosted
and a deviation from the true estimate could result
• In case of cluster sampling, the sample size is multiplied
by the design effect that is estimated

4. Size of a population
• Sample size for finite population needs finite
population correction
• Finite population is considered when the study is done
in a sub-population having no other reference
population
• The sample size is reduced as a proportion of the
population

5. Feasibility (cost)
• Sample size also is dependent on the cost of the sampling
• More people provides higher precision but high cost
• After the optimal sample size, the increment in sample size will add
only few precision
• After the optimal sample size adding the size will cost more than to
add the precision
• Therefore, the issue of cost should be controlled by precision estimate

6. Confidence level (Z value of certainty)
• The confidence level is based on extrapolating the value
of certainty in the sample to measure the population
parameter
• Population parameter is the true estimate of the
population but unknown
• It is estimated based on Central limit theorem
• Central limit theorem: The distribution of the sample
means will be nearly Normal regardless of how the
variable is distributed in the population as long as the
sample size is large enough.

SAMPLE SIZE FORMULA FOR PROPORTION
Sample size for single population proportion
• n = Z2
(1-α) p(1-p)/d2
• Sample size calculation for two population
proportion:
n = (Z(1-α)+Z(1-β))2 [p1(1-p1)+ p2(1-p2)]
(p1-p2)2
92

Variable
 What is a variable?
 What types of variables do you know?
 What is the importance of knowing types of variables?
 Identify the type of the outcome variable and also indicate the
outcome variable and possible independent variables for the title
bellow
 Exclusive breast feeding practice and associated factors among
mothers of under two years children in ‘X’ town
94

Variable
 Variable: A characteristic which takes different
values in different persons, places, or things.
 Any aspect of an individual or object that is
measured (e.g. BP) or recorded (e.g. age, sex) and
takes any value.
 There may be one variable in a study or many.
 E.g. A study of treatment outcome of TB
95

Variable
Qualitative
or categorical
Quantitative
measurement
Nominal
(not ordered)
e.g. ethnic
group
Ordinal
(ordered)
e.g. response
to treatment
Discrete
(count data)
e.g. # of
admissions
Continuous
(real-valued)
e.g. height
Measurement scales
SUMMARY
96

Depending on scales of measurement we have:
Four levels of measurement
• Nominal measures
• Ordinal measures
• Interval measures
• Ratio measures
97

98
1. Nominal scale:
• The simplest type of data, in which the values fall into
un-ordered categories or classes
• Uses names, labels or symbols to assign each
measurement.
• Examples: Blood type, sex, race, marital status

99
2. Ordinal scale:
• Assigns each measurement to one of a limited number of
categories that are ranked in terms of order.
• However, the distances between the categories are uneven
or unknown
• Although non-numerical, can be considered to have a
natural ordering
• Examples: Patient status, cancer stages, social class

100
3. Interval scale:
- Measured on a continuum
- Differences between any two numbers on a scale are of known size.
Example: Temp. in o
F on 4 consecutive days
Days: A B C D
Temp. o
F: 50 55 60 65
For these data, not only is day A with 50o F cooler than day D with 65o but is
15o cooler.
- It has no true zero point. “0” is arbitrarily chosen and doesn’t reflect the absence
of temp.
- Interval data differs from ordinal data because the differences between adjacent
scores are equal.

101
4. . Ratio scale:
- Measurement begins at a true zero point and the scale has
equal space.
- In ratio scales, zero is an absolute absence of the variable
- The data can be categorized, ranked, evenly spaced, and
has a natural zero.
- Examples: Height, weight, BP, etc.
- 40cm height is twice 20cm height.

Why is Level of Measurement Important?
• Helps to decide how to interpret the data from that
variable.
• Helps to decide what statistical analysis is appropriate on
the values that were assigned.
• If a measure is nominal, then you know that you would
never average the data values or do a t-test on the data.

Exercises:
Give the correct scales of measurement for each variable
1. Temperature (Celsius)
2. Hair colour
3. Job satisfaction index (1-5)
4. Number of heart attacks
5. Calendar year
6. Serum uric acid (mg/100ml)
7. Number of accidents in a 3 - year period
8. Number of cases of each reportable disease reported by a
health worker
9. The average weight gain of six 1-year old dogs with a special
diet supplement was 950 grams last month.
103

Dependent and independent variables
• Because in health research we often look for
associations, it is important to make a distinction
between dependent and independent variables.
• Both the dependent and independent variables
together with their operational definitions (when
necessary) should be stated.
104

Dependent and independent variables …
• The variable that is used to describe or measure the
problem under study is called the dependent
variable.
• The variables that are used to describe or measure
the factors that are assumed to influence (or cause)
the problem are called independent variables.
105

Variables …
• For example, in a study of relationship between
smoking and lung cancer, "suffering from lung
cancer" (with the values yes, no) would be the
dependent variable and "smoking" (with the
values no, less than a packet/day, 1 to 2
packets/day, more than 2 packets/day) would be the
independent variable.
106

For each of the following research questions identify the outcome
and independent variables
a. The prevalence of contraceptive use among HIV +VE women in
the reproductive age group in Y town
b. The incidence of COVID-19 infection among under five children
in Kebelle “X” in Y town
c. Is double burden malnutrition the emerging problem among under
5 children in Oromia region?
d. Factors associated with Age at first sexual initiation among youths
visiting HIV testing and counselling centres in North Shoa Zone,
Ethiopia
107

Background variables
 In almost every study involving human subjects, background
variables are usually demographic characteristics—such as, age,
sex, educational status, monthly family income, marital status and
religion will be included.
 These background variables are often related to a number of
independent variables, so that they influence the problem indirectly.
 Hence they are called background variables or background
characteristics.
 The researcher cannot manipulate
108

Operationalizing variables
• Operationalizing variables means that you make
them ‘measurable'.
• Example: In a study on VCT acceptance, you want
to determine the level of knowledge concerning
HIV in order to find out to what extent the factor
‘poor knowledge’ influences willingness to be
tested for HIV.
109

Cont’d
• The variable ‘level of knowledge’ cannot be
measured as such.
• You would need to develop a series of questions to
assess a person’s knowledge.
• The answers to these questions form an indicator of
someone’s knowledge on this issue, which can then
be categorized.
110

Cont’d …
If 10 questions were asked, you might decide that the knowledge of those
with:
0 to 3 correct answers is poor,
 4 to 6 correct answers is reasonable, and
 7 to 10 correct answers is good.
Operational definitions of variables are used in order to:
• Avoid ambiguity
• Make the variables to be more measurable
111

Data collection
Introduction
 Data collection is a crucial stage in the planning and
implementation of a study.
 data analysis becomes difficult when the data collection
has been
- superficial,
- biased or
- incomplete,
 Therefore, we should concentrate all possible efforts on
developing appropriate tools, and should test them
several times.
112

 Depending on the type of study, different data-collection
techniques may be used.
 In HSR studies we usually combine a number of
different techniques and look at problems from different
perspectives (triangulation).
113
Data collection….

The choice of methods of data collection is based on:
The resource required
Acceptability of the method
Coverage of the method
Familiarization of the procedure
Relevance
The accuracy of information they will yield
Practical considerations, such as, the need for personnel, time,
equipment and other facilities, in relation to what is available
114
Data collection….

OVERVIEW OF DATA COLLECTION TECHNIQUES
Data-collection techniques allow us to systematically
collect information about our
- objects of study (people, objects, phenomena)
- the settings in which they occur.
 In the collection of data we have to be systematic.
If data are collected haphazardly, it will be difficult to
answer our research questions in a conclusive way.
115

• Various data collection techniques can be used such as:
➢Using available information
➢Observing
➢Interviewing (face-to-face)
➢Administering questionnaire
➢Focus group discussion
➢Physiological measurement(in vitro vs in vivo)
DATA COLLECTION TECHNIQUES…

1. Using available information/documentary
sources
Locating the sources and retrieving the information is a
good starting point in any data collection effort.
117
These include:
• Health Information System Data,
• Census Data,
• Unpublished Reports
• Publications
• Clinical Records
• Personal Records,
• Death Certificates,
• Published Mortality Statistics,
• Census Publications, etc
• Key Informants
• Newspapers

Advantages:
Documents can provide ready made information
relatively easily
The best means of studying past events.
 Data collection is inexpensive
118

Disadvantages:
Problems of reliability and validity
There is a possibility that errors may occur when the
information is extracted from the records.
Since the records are maintained not for research
purposes, but for clinical, administrative or other ends,
the information required may not be recorded at all, or
only partly recorded.
119

2. Observing
 Is a technique that involves systematically selecting,
watching and recording behavior and characteristics of
living beings, objects or phenomena.
Observation of human behavior is a much-used data
collection technique.
It can be undertaken in different ways:
 Participant observation
 Non-participant observation
 Structured observation
 Unstructured observation
120

Participant observation
The observer takes part in the situation he or she
observes
 E.g., a nurse researcher observing how nurses
communicate with their patients while taking
part in patient care activities in the facility.

Non-participant observation
• The observer watches the situation, openly or concealed,
but does not participate in the situation being observed.
• Observation is commonly used in qualitative studies
• Phenomena amenable to observation in research include:
☞Activities and behaviour
☞Characteristics and conditions of individuals
☞Skill attainment and performance
☞Verbal and nonverbal communication
☞Environmental characteristics

Observation can be made by using structured or
unstructured tools or both
a) Unstructured observation:
➢Involves spontaneously observing & recording what is seen using field
diaries or field notes
Conducted in an open and free manner in a sense that there would be no
pre-determined variables or objectives.
b) Structured observation:
➢ The researcher carefully defines what is to be observed & how the
observations are to be made, recorded, & coded
Data collection is conducted using specific variables and according to a
pre-defined schedule.

cont’d…
• Involves the use of categorical system or checklist or rating
scales to guide observation & recording
• If observations are made using a calibrated scale they may be
called measurements
• Measurements often require additional tools e.g., weighing
scale to measure weight, meter tap to measure height,
thermometer to measure body temperature.
• Observations can give additional, more accurate information
on behaviour of people than interviews or questionnaires

Advantages of observation
• Give additional, more accurate information on behavior of
people than interviews or questionnaires.
• Check on the information collected through interviews
especially on sensitive topics such as alcohol or drug use, or
stigmatizing diseases.
• They can also be made on objects. For example, the presence
or absence of a latrine and its state of cleanliness may be
observed.
• They would be the major research technique.
125

They are time consuming
 They are most often used in small-scale studies.
Investigators or observers own bias, prejudices,
desires, etc
Needs more resources and skill human power during
the use of high level machines.
 Ethical issues
126
Disadvantagesof
observation

3. Interviewing
• Is a data-collection technique that involves oral questioning of
respondents, either individually or as a group.
• Based on qualitative or quantitative or both, type of data
collection can be:
 Face to face interview
 Telephone interview
 Self-reported/completed questionnaire
• Answers to questions posed during an interview can be
recorded by:
• ☞ writing them down either during the interview itself or
immediately after interview; or
• ☞ Tape-recording the responses; or
• ☞ A combination of both
127

Can stimulate and maintain the respondents interest the frank
answering of questions.
 If anxiety is aroused the interviewer can allay it.
 Can repeat questions which are not understood, and give
standardized explanations where necessary.
 An interviewer can ask “follow-up” or “probing” questions to
clarify a response.
can make observations during the interview;
128
Advantages of
Interviewing

Disadvantages Interviewing
• Questions may be misunderstood
• Time consuming
• Need to setup interviews
• Can be expensive
• Respondents bias
• Needs a set of questions

4. Administering written
questionnaires
 Is a data collection technique in which written questions
are presented that are to be answered by the respondents
in written form
It can be administered in different ways, such as by:
• Sending questionnaires by mail
• Self-administered questionnaires
• Interviewer -administered questionnaires
130

Administering written
questionnaires cont’d…
Advantages:
• Less expensive
• Permits anonymity & may result in more honest
responses
• does not require research assistants
• Eliminates bias due to phrasing questions differently
with different respondents

Disadvantages:
• Cannot be used with illiterates
• there is often a low rate of response
• questions may be misunderstood
questionnaires cont’d…

Rating scale
• The question in self-administered questionnaire can be open-ended
or closed (with pre-categorized answers)
• Closed questions can be composed of dichotomous questions,
multiple-choice questions, rank-order questions, & rating scales
• Rating Scales is elicit response in terms of the degree of attitude,
perception, needs, or experiences
• Rating scales involve the composite psychosocial scale used to
make fine quantitative discriminations among people with
different attitudes, perceptions, needs, or experiences

Rating scale…
• These psychosocial scales are:
☞Likert scales (summated rating scales)
☞Semantic differential scales
☞Visual analogue scale
Likert Scales
• Consist of several declarative statements (items)
expressing viewpoints or opinion or attitude of subjects
• Responses are on an agree/disagree continuum (usually
ranging from 4 - 7 response options) i.e., Strongly agree,
agree, uncertain, disagree, strongly disagree

Likert scale
• Values are placed on each response, with 1 on the most
negative response & highest (4-7) value on most positive
response
• Responses to items are summed to compute a total scale
score
• Example of Likert Scales
S
D= StronglyDisagree;D
G= Disagree;UN=Uncertain;A
G= Agree;S
A= StronglyAgree

Semantic Differential Scales
• Used to measure attitudes & beliefs
• Require ratings of various concepts
• Rating scales involve bipolar adjective pairs, with 7-point
ratings
• Value of 1 denotes the most negative response & 7
denotes the most positive response
• Ratings for each dimension are summed to compute a total
score for each concept

Example of a Semantic Differential Scales

Differences between data collection
techniques and data collection tools
Data collection techniques Data collection tools
Using available information Checklist; data compilation
forms
Observation Eyes and other senses,
pen/paper, watch, scales,
microscope, etc..
Interviewing Interview guide, checklist,
questionnaire, tape recorder
questionnaire
Questionnaire

Data collection instruments
• Types of questions
 Depending on how questions are asked and recorded
we can distinguish major possibilities:
1. Closed questions
2. Open-ended questions
3. Semi-opened questions
140

141
 Closed questions
 Open-ended questions
 Semi-opened questions
Types of questions

142
Closed questions
 A list of possible answers or options
 Commonly used for background variables
 Should be exhaustive & mutually exclusive
What is your marital status?
1. Single
2. Married
3. Divorced
4. Separated
5. Widowed

143
Open-ended questions
 Free to answer with fewer limits imposed by the
researcher
 Useful for exploring new areas
What is your opinion on the services provided in the
antenatal (AN) care?
_______________________________________
_____
_______________________________________

144
Semi-opened questions
What is your occupation?
(1) Dependent
(2) Manual labourer
(3) Government employee
(4) Private employee
(5) Owned business
(6) Others (please specify) _____________

Open-ended questions
1. (allowing for completely open as well as partially
categorized answers).
 It permit free responses which should be recorded in the
respondents' own words.
 Such questions are useful for obtaining in-depth information on:
• facts with which the researcher is not very familiar,
• opinions, attitudes and suggestions of informants
Examples;
1. At what age the child started supplementary food?
2. 'What is your opinion on the services provided in the ANC?'
(Explain why.)
3. 'What do you think are the reasons some adolescents in this area
start using drugs? 145

Advantage of open-ended
questions…
Allow you to probe more deeply into issues of interest
being raised.
Information provided in the respondents' own words
might be useful
Providing valuable new insights on the problem.
Permit unlimited number of answers
146

Risks of completely open-ended
questions…
A big risk is incomplete recording of all relevant issues
covered in the discussion.
Analysis is time-consuming and requires experience;
otherwise important data may be lost.
Skilled interviewers are needed to get the discussion
started and focused on relevant issues and to record all
information collected.
147

2. Closed questions:
 Have a list of possible options or answers
from which the respondents must choose.
Example: closed ended question
What is the current breastfeeding status of mother ?
A. Exclusive breastfeeding
B. Partial breastfeeding
C. Not breastfeeding
148

Advantages of closed ended
questions
 It saves time
 Comparing responses of different groups, or of the same
group over time, becomes easier.
 Answers easier to analyze on computer
 Response choices make question clearer
Risks of closed ended questions:
• In case of illiterate respondents, bias will be introduce
• Many choices can be confusing
• Can't tell if respondent misinterpreted the question
• Fine distinctions may be lost
149

Questionnaire Design
• Designing a questionnaire always takes several drafts.
• In the first draft we should concentrate on the content.
• In the second, we should look critically at the formulation and
sequencing of the questions.
• Then we should scrutinize the format of the questionnaire.
• Finally, we should do a test-run to check whether the
questionnaire gives us the information we required & whether both
the respondents & we feel at ease with it.
150

Steps in designing questionnaire
Step 1: Content
Step 2: Formulating questions
Step 3: Sequencing the questions
Step 4: Formatting the questionnaire
Step 5: Translation
Step 6: pre-test
151

 Take your objectives and variables as a starting point
 Decide what questions will be needed to measure or to
define your variables and reach your objectives.
152
Step 1: Content:

Step 2: Formulating
questions:
 Formulate one or more questions that will provide the
information needed for each variable.
 Check whether each question measures one thing at a time.
 Take care that questions are specific and precise enough
that different respondents do not interpret them differently.
 Avoid words with double or vaguely defined meanings or
that are emotionally laden e.g., omit concepts such as dirty
(clinics), lazy (patients), or unhealthy (foods)
 Ask sensitive questions in a socially acceptable way.
 Avoid leading questions.
A question is leading if it suggests a certain answer.

• Design your interview schedule or questionnaire to be
'informant friendly’
• The sequence of questions must be logical for the
respondent & allow as mush as possible for a “natural”
discussion.
• Organize the questions in a logical order & use simple,
everyday language
• Pose more sensitive questions as late as possible in the
interview.
154
Step 3: Sequencing the
questions:

Step 4: Formatting the
questionnaire
• When you finalize your questionnaire, be sure that:
• An introductory page explaining the purpose of the
study & confidentiality issue is attached to the
questionnaire
• Sufficient space is provided for answers to open-ended
questions
• Page layout & margins are properly formatted

Step 5:Translation
If interview will be conducted in one or more local
languages, the questionnaire has to be translated to
standardize the way questions will be asked.
After having it translated you should have it
retranslated into the original language.
You can then compare the two versions for
differences and make a decision concerning the final
phrasing of difficult concepts.
156

Step 6: Pretest
A pretest usually refers to a small-scale trial of a
particular research component.
A pretest serves as a trial run that allows us to
identify potential problems in the proposed study.
As a result, a good deal of time, effort, and money
can be saved in the long run
Pre testing is :
 Simpler
 Less time consuming and less Costly
157

Pre test…
• A pretest determine whether the instrument is clearly
worded, free from major biases, and useful in
generating desired information
• When do we carry out a pre-test?
• Pre-testing the data collection 1-2 weeks before starting
the fieldwork so that you have time to make revisions.

Pre test…
Components to be assessed during the pre-test?
The reactions of respondents to:
• The research procedures and
• Questions related to sensitive issues.
 The appropriateness of format and wording of
questionnaires and the accuracy of the translations.
 The time needed to carry out interviews, observations or
measurements.

Pretesting and Pilot study
• Pretest – usually refers to a small-scale trial of particular
research components
• Pilot study – is the process of carrying out a preliminary
study, going through the entire research procedure with a
small sample

• Whatever the type of measurement, its performance
can be described in several ways:
Validity
Reliability
Range
Variation
Responsiveness (An instrument’s ability to detect
change over time)
Measurement Properties

Measurement
• Involves rules for assigning numeric values to
qualities of objects to designate the quantity of the
attribute
• Advantage of measurement
–It removes guesswork in gathering information

Measurement error
Systematic Error - Also called "constant error" or "bias"
• Design or instrument Error which affect the data in a consistent
way.
–Either pull all the scores up or all of the scores down.
Therefore, systematic errors affect the group mean score
• a systematic upward or downward distortion of the level of
measurement
Random Error/noise
• Transient aspects of the measurement situation that cause
variable errors.
• These errors cause greater variability within the data set, but do
not make the mean score higher or lower.

• Is the consistency of measurement results across
persons, occasions, locations and instruments
• Consistency of responses to a question (if you get on
your scale and it tells you that you weigh 110 lbs
one minute, then you step on it again and it tells you
that you weight 115, then it is not very reliable).
Reliability

Reliability…
• Is the degree to which the same results are obtained when the
measurement is repeated.
• Repeated measurements of a stable phenomenon by diﬀerent people
and instruments at diﬀerent times and places get similar results.
• Reproducibility and precision are other words for this property
• Relates to the consistency of a measure, or the degree to which an
instrument measures the same way each time it is used under the
same condition with the same subjects
• It is unlikely that the same results will be given every time due to
differences at the time changes in the population and the sample

• There are three methods of testing the reliability of research
instruments:
1.Tests for the stability of the instruments (how stable it is over time)
2.Tests for equivalence (consistency of the results by different
investigators)
3.Internal consistency (the measurement of the concept is consistent
in all parts of the test).
Reliability...

Reliability…
• Stability: the same score is obtained when the instrument
is used with the same people on a separate occasion
–Test-Retest Reliability(stability): Administer the same
questionnaire at a later time
–Reliability coefficient
• Equivalence: the consistency of the instrument by
different observers/raters
–Interrater reliablity
• Internal consistency: the extent that all its subparts
measure the same characteristics
–Split-Half Reliability
–Cornbach’s alpha/cofficient alpha

Tests of Stability
• A stable research instrument is one that can be repeated on the same
individual more than once and achieve the same results.
• In observational methods, when the characteristic being observed is
expected to change over time, a test of stability cannot be used.
• Repeated observations and test/retest procedures are used to test
the stability of an instrument.
• Pearson Correlation Coefficient is used to calculate, which takes on
a value between -1 and 1
Reliability...

Tests of Equivalence
• Tests of equivalence attempt to determine if the same results
can be obtained using different observers at the same time
or if similar tests given at the same time yield the same
results.
• The equivalence aspect considers how much error may get
introduced by different investigators or different samples of the
items being studied
Reliability...

Test of Internal consistency
• Internal consistency refers to the extent to which all parts of the
measurement technique are measuring the same concept
• The most common one is called Cronbatch’s alpha
o Cronbatch’s alpha can be calculated SPSS, SAS, STATA
o This Cronbatch's alpha gives the lower bound for reliability.
o If it is high for the whole scale (>=0.7), then you know the scale is reliable
Reliability...

• Variability of observer ratings can be distinguished by
observer disagreement
• indicated by how observers classify individual
subjects into the same category on the measurement
scale
• Kappa coefficient is one of the most common
approaches.
• Ranges from -1 to +1
Reliability...

 Cohen's kappa measures the agreement between two raters who
each classify N items into C mutually exclusive categories.
 The equation for κ is:
K = Po - Pe = Actual agreement beyond chance
1 - Pe Potential agreement beyond chance
Po = the total proportion of observations on which there is agreement
Pe = the proportion of agreement expected by chance alone.
Reliability...

Agreement matrix for kappa statistic
(inter-rater agreement, 2 observers, dichotomous data)
OBSERVER B
OBSERVER A
Yes No TOTALS
Yes a b f1
No c d f2
TOTALS n1 n2 N

Agreement matrix for kappa statistic
(2 observers, dichotomous data)
OBSERVER B
OBSERVER A
Yes No TOTALS
Yes 69 15 84
No 18 48 66
TOTALS 87 63 150

K (Cont’d)
• Observed agreement (Po) = (a+d)/N= (69 + 48)/150 = 0.78 or
78%.
• Agreement expected at chance (Pe) Calculated by the product
of the marginal totals
(Pe) =[ (f1*n1)/N+(f2*n2)/N]*1/N
87 x 84/150 = 48.75
63 x 66/150 = 27.72
Then divide sum [76.47] by 150 to get Pe = 0.51 or 51%.

K (Cont’d)
• K = Po - Pe = 0.78 - 0.51 = 0.27 = 0.55 or 55%
1 - Pe 1 - 0.51 0.47
Kappa varies from -1 to +1, with a value of zero denoting
agreement no better than chance (negative values denotes
agreement worse than chance!)

Reliability...
Interpretation of Kappa
• Poor agreement = Less than 0.20
• Fair agreement = 0.20 to 0.40
• Moderate agreement = 0.40 to 0.60
• Good agreement = 0.60 to 0.80
• Very good agreement = 0.80 to 1.00

Validity
• Validity (accuracy) is the degree to which the results of a
measurement correspond to the true state of the
phenomenon being measured.
• For clinical observations that can be measured by physical
means, the observed measurement is compared with some
accepted standard.
• Thus, it is relatively easy to establish validity.

• Some other clinical measurements such as pain,nausea, dyspnea,
depression, and fear cannot be verified physically.
• In patient care, information about these phenomena is usually
obtained informally by “taking a history.”
• More formal and standardized approaches, used in research, are
structured interviews and questionnaires.
Validity…

• A valid measurement thus requires both a valid method
(instrument for measurement) and a valid observer
(measurer).
• Individual questions (items) are designed to measure
specific phenomena (e.g., symptoms, feelings, attitudes,
knowledge, beliefs) called constructs
• Three general strategies are used to establish the validity of
measurements that cannot be directly verified physically.
Validity…

• Is the extent to which a particular method of measurement includes all
of the dimensions of the construct one intends to measure and nothing
more.
• For example, a scale for measuring pain would have content validity if
it included questions about aching, throbbing, pressure, burning, and
stinging, but not about itching, nausea, and tingling.
• It looks at whether the instrument adequately covers all the content
that it should with respect to the variable.
• In other words, does the instrument cover the entire domain related to
the variable, or construct it was designed to measure?
A. Content Validity

B. Criterion Validity
• Also called "concurrent validity" has to do with the
correlation between:
• measurement items on the one hand and known and
accepted standard measures or criteria on the other.
• It is any other instrument that measures the same variable
• Criterion validity is measured in three ways:
1. Convergent validity—shows that an instrument is
highly correlated with instruments measuring similar
variables.

2. Divergent validity—shows that an instrument is poorly correlated to instruments that
measure different variables.
• For example, there should be a low correlation between an instrument that measures
motivation and one that measures self-efficacy.
3. Predictive validity—means that the instrument should have high correlations with
future criterions.
• The predictive validity is like concurrent validity except that there is a time elapse
between the criterion and test measures
• For example, a score of high self-efficacy related to performing a task should predict
the likelihood a participant completing the task.
Criterion Validity…

C. Construct validity
• It refers to whether you can draw inferences about test scores related
to the concept being studied.
• Construct validation is the accumulation of evidence to support the
interpretation of what a measure reflects
• For example, if a person has a high score on a survey that measures
anxiety, does this person truly have a high degree of anxiety?
• There are three types of evidence that can be used to demonstrate a
research instrument has construct validity:
• Homogeneity—meaning that the instrument measures one construct.

• Convergence—this occurs when the instrument measures concepts
similar to that of other instruments.
• However, if there are no similar instruments available this will not be
possible to do.
• Theory evidence—this is evident when behaviour is similar to
theoretical propositions of the construct measured in the instrument.
• For example, when an instrument measures anxiety, one would expect
to see that participants who score high on the instrument for anxiety
also demonstrate symptoms of anxiety in their day-to-day lives
Construct validity…

Summary of validity
• Validity is defined as the extent to which a concept is
accurately measured in a quantitative study
• Content validity: The extent to which a research
instrument accurately measures all aspects of a construct
• Construct validity: The extent to which a research
instrument (or tool) measures the intended construct
• Criterion validity: The extent to which a research
instrument is related to other instrument

• To assess the accuracy of any particular measuring
'instrument', we should distinguish between the reliability
of the data collected and their validity.
• Reliability is essentially the extent of the agreement or
consistency between repeated measurements
• Validity is the extent to which a method of measurement
provides a true assessment of that which it purports to
measure
Reliability versus Validity

Reliable
Valid
Not Reliable
Not Valid
Not Valid
Reliability Validity

Stages in the Data Collection Process
Three main stages can be distinguished:
Stage 1: Permission to proceed
Stage 2: Data collection
Stage 3: Data handling
190

DATA PROCESSING, ANALYZING
& INTERPRETATION

 Data are numbers which can be measured or can be
obtained by counting.
 Data are sources of facts or information from which
conclusion can be drawn after they are statistically
treated in some way.
 They are the raw material for statistics.

• Data processing and analysis should start in the field,
with checking for completeness of the data and
performing quality control checks, while sorting the data
by instrument used and by group of informants.
• Data of small samples may even be processed and
analyzed as soon as it is collected.

Data processing, analyzing & interpretation
 Data processing involves:
• Data entry
• Data coding
• Data categorizing
• Data cleaning
Analyzing & interpretation

 WHAT IS DATA PROCESSING?
• Data processing refers to:
• Data entry onto a computer
• Data coding
• Data categorizing
• Data checks and correction
• The aim of this process is to produce a relatively “clean”
data set which may be imported into a statistical package.
• When to start?

Data pro…
Why process data?
 It helps the researchers to assure that :-
• All the information one needs has been collected, and in a
standardized way;
• She/he has not collected unnecessary data which will never be
analyzed.
• Provide better insight into the feasibility of the analysis to be
performed as well as the resources that are required.
• It assures the appropriateness of the data collection tools that
he/she needs.

Data can be processed:-
• Manually, using data master sheets, manual sorting,
or tally counts.
• Computer, using existing software for data analysis
(eg:- spss, epi…).

 Computer compilation consists of the following
steps:
1. Choosing an appropriate computer program
2. Data entry
3. Verification or validation of the data
4. Programming (if necessary)
5. Computer outputs/prints

I. Data entry
• Data entry concerns the transfer of data from a
questionnaire to a computer file.
• It is a process of entering raw data into a computer
• It is a process where raw data could be manipulated
and changed
• Data is coded and entered into a computer

Data entry….
• We can use any software
• EPI info
• EPI 6 (dose format)
• EPI-data (Dutch
format)
• SPSS for windows
• Excel (office)
• Access (office)
• etc

Selection of data entry software
• There are different computer software for data entry
• A software is selected based on its
• strength to enter data through resisting to change
• Its lower cost
• Presence of program looking for consistency
• Non-visibility of the whole data to the entering Clark
• Ability to enter through double entry and its validity

DATA ENTRY...
 Who does data entry?
• Data often entered to a computer by a clerk who may
not be familiar with how the research was designed &
how the data was collected.
• To facilitate data entry and minimizing errors, the data
entry person should not make guesses, calculations,
coding etc.
• Data entry is quick and easy for the data entry person if
he/she simply type the information which is seen on the
answer sheet (i.e. direct data entry).

DATA ENTRY...
If the questionnaire is adequately designed, direct data entry is
possible if:
• Answers are put in separate column or separate answer sheet
• Documents are edited before data entry
• Closed ended questions are pre-coded etc.
When working with computers, note to:
• Saving your work frequently
• Keep back-ups (more than one copy)
• Share time with other users etc.

2. DATA CODING
 For computers to work their magic they must be able to read your
data. In general computers are at best with numbers.
 Alphabetic codes and open ended responses must be translated to
numbers through the process called “coding”.
Coding:
 is assigning a separate (non-overlapping) numerical code
for separate answers and missing values
 E.g. instead of using “Male” and “Female” for the
variable sex, it can be indicated as:
1= Male, 2= Female

Coding may pre, post and recoding
• Pre-coding ;-when questionnaire being written.
• Post -coding:- After respondents have answered
questions.
• for open ended questions for which
response categories can't anticipate.
• Recoding:-changes earlier coding to facilitate
meaningful analysis.

Coding missing values:
• Missing values occur when measurements were not
taken, or respondents did not answers etc.
• In general, missing values should not be entered as a
“blank” because some statistical packages interpret
blanks as zeros
• Ideally, a code should be chosen to denote a missing
value (e.g. code “9” or “99” or “999” is often used
missing values).

DATA CODING...
Who does the coding?
• The principal investigator should coordinate the coding
process and ideally all the coding should be done by one
person.
• Certainly, no more than three different people should be
involved in this process.
• If the work is done by more than one person, they should
have codebook
• Code book: It is essentially a list of each variables entered
in the column and the codes associated with the value of the
variables

Code book provides
oA guide used in the coding process
oLocating the variables
oAssignments of the values of the variable
oList of the code assignments of the values of
the variable
oDecoding back to original variables when
reporting.
DATA CODING...

Coding conventions
• Common responses should have the same code in each
question, as this minimizes mistakes by coders.
• For example:
• Yes (or positive response) code - Y or 1
• No (or negative response) code - N or 2
• Don’t know code - D or 8
• No response/unknown code - U or 9

3. Data categorizing
• Decisions have to be made concerning how to
categorize responses.
• For categorical variables that are investigated
through closed questions or observation, the categories
have been decided upon beforehand.
• In interviews the answers to open-ended questions (for
example, ‘Why do you visit the health centre?’) can
be pre-categorised to a certain extent, depending on
the knowledge of possible answers that may be given.

• However, there should always be a category called ‘Others,
specify . . .’, which can only be categorised afterwards.
• For numerical variables, the data are often better collected
without any pre-categorisation.
• If you do not exactly know the range and the dispersion of
the different values of these variables when you collect your
sample.
Example:
Home-clinic distance for out-patients,
 income
Age
Weight

4. Data cleaning
• Once data is entered, the second step is data cleaning
• Data cleaning is a process of similarizing data entered in
a computer (soft copy) with that of the hard copy on a
paper
• The aim of this process is to produce a clean set of data
for statistical analysis.
• Checking for errors, impossible or implausible values
and inconsistencies that may be due to coding or data
entry process.
• No matter how carefully the data have been entered, some
errors are inevitable.

DATA CLEANING…
 Errors can result from:
• Incorrect reading
• Incorrect reporting
• Incorrect filling
• Incorrect sensing
• Incorrect coding
• Incorrect typing
• Incorrect etc.

• Data cleaning goes at three occasions,
1. During template formation
2. During data entry
3. After data is entered
(the more we use combination of the above cleaning
process, the more valid will be our data)
DATA CLEANING…

I. Cleaning during template formation
• It is programming of data during template formation
• Program is formed
• by limiting values that enter within a variable
• By looking for consistency of values
• By providing good skipping pattern
• By controlling for must enter
• By making the computer to calculate and see for
consistency, etc

II. Cleaning during data entry
a. Using two computers
• It is when we use two data clerks with two computers
• Data entered by the two computers are validated for
similarity
• When there is difference, a correction measure (based
on the hard copy) is taken

b. Double entry using a single computer
• It is also possible to do double entry using the same
computer
• In EPI data version 3.1
• When there is change in the second entry, pop sound is
heard, and a corrective measure can undertake
• Counter checking entered data by principal investigator
is also another method
Other cleaning is by
• Trying to counter check 5 to 10 % of daily entering data
is useful

III. Cleaning after data entry is completed
• It is by making
• Simple frequency,
• Tabulating variables for consistency, and
• Sorting (in SPSS)
• Out layers and missing values are usually evaluated
(against hard copy)
• Giving serial number for the hard and soft copy makes
things simple

Analysis of epidemiological study
• Quantitative data analysis is making sense of the
numbers to permit meaningful interpretation
It involves:
1. organizing the data
2. doing the calculations
3. interpreting the information
• lessons learned
4. explaining limitations

Analysis of epidemiological study…
Prerequisites for analysis
1. More acquainted to the objectives of study
2. Knowledge of type of variables (dependent/
independent)
3. Knowledge of measurement of variables
4. Knowledge of type of analysis needed for each
objectives (and designs)
5. Knowledge of statistics to be done
6. Selection of statistical software for analysis

Aware of study objectives
• A research is made principally to answer study
questions
• Our:
• Results should answer the objectives (study questions)
• Discussion should interpret what it mean by the results
answering the objectives
• Conclusion should be based on the answer to the objectives
• Recommendation also should be based on finding but not on
wish

Knowledge of types of analysis and study design
• Each study designs has a distinct type of analysis
• For descriptive design analysis may be based on data
summary (point estimate), and parametric measurement
(confidence interval measure)
• For analytic studies, analysis is based on comparison

Components of Data Analysis
Data processing
• Data entry
• Coding
• Cleaning
• Descriptive /exploratory
• Frequencies,
• Tables and graphs
• Cross tabulations (chi-squares, spearman’s correlation…)
• Measures of central tendency and variations
• Proportions/percentages
• Analytic /inferential
• Estimation
• Confidence intervals (P-value, OR,…)
• Hypothesis testing
• Statistical models
Tadesse A., 2013

Statistical Inference
• Depending on different factors, there are a number of statistical
models which will be appropriate for the data we have in hand
• These are:
• Objective of the study
• Study Design
• Nature of the variable
• Distribution of the variable
• The nature of the data
• Sample size
• The number of group we want to compare
225

Different t-tests
• If we want to compare two independent groups whether
there is significant difference or not:
• Independent sample t-test
• If our aim is to compare two dependent groups
(measurement before treatment and after treatment, two
measurements taken from each individuals in a group, …)
• Paired sample t-test
226

Regressions
• Linear regression
• If the response variable (y) is continuous
Simple linear regression
y = 0 + 1 x + 
Multiple linear regression
227
ιj
ε
ι
ι
β
...
2
2
β
1
1
β
α 




 x
x
x
y

Regression…
• When the response variable is categorical
• Logistic regression
Binary logistic regression
Bivariate logistic regression
Multiple logistic regression
• Analysis of variance (ANOVA)
• Survival analysis
228
i
i
2
2
1
1 x
β
...
x
β
x
β
α
P
-
1
P
ln 









βx
α
)
(
log
P
-
1
P
ln 








p
it

Study design & instrument

Recommended

Recommended

More Related Content

Similar to Study design & instrument

Similar to Study design & instrument (20)

More from KhadiraMohammed

More from KhadiraMohammed (20)

Recently uploaded

Recently uploaded (20)

Study design & instrument