SlideShare a Scribd company logo
Module 1.1: Introduction
What is statistics?
What is Biostatistics?
Why we study Biostatistics?
1
• Statistics is a field of study concerned with:
1. the collection, organization, summarization,
and analysis of data; and
2. the drawing of inferences about a body of
data when only a part of the data is observed.
• Biostatistics: When the tools of statistics are
employed on the data derived from the biological
sciences and medicine or public health, we use
the term biostatistics
2
• Statistics versus statistic (field of study versus
numerical quantity computed from sample data)
• Roughly speaking, the field of statistics can be
divided into:
• Mathematical Statistics: the study &
development of statistical theory and methods in
the abstract and
• Applied Statistics: the application of statistical
methods to solve real problems involving
randomly generated data, and the development
of new statistical methodology motivated by real
problems
3
Rationale of studying Statistics
• Statistics provides a way of organizing information on
a wider and more formal basis than relying on the
exchange of anecdotes or biography and personal
experiences
• More and more things are now measured quantitatively
in medicine and public health
• There is a great deal of intrinsic (inherent) variation in
most biological processes
Rationale of studying Statistics
• The medical and public health literature is replete or
full with reports in which statistical techniques are
used extensively
• The planning, conduct and interpretation of much of
medical and public health research are becoming
increasingly reliant on statistical technology
5
Limitations of statistics
• It deals with only those subjects of inquiry that are
capable of being quantitatively measured and
numerically expressed.
• It deals on aggregates of facts and no importance is
attached to individual items: suited only their group
characteristics are desired to be studied.
• Statistical data is only approximately and not
mathematically correct.
Limitations of statistics
• It can be used to establish wrong conclusion and
therefore, can be used only by experts.
• Remember the three lies: Lies, Damon lies and
Statistics
• Evan Esar’s Definition of Statistics and Quote:
“The science of producing unreliable facts from
reliable figures”
• “Statistics is the only science that enables
different experts using the same figures to draw
different conclusions”
7
Variable
• As we observe a characteristic, we find that it takes
on different values in different persons, places, or
things, called variable. The characteristic is not the
same when observed in different possessors of it.
• Quantitative variables: is one that can be
measured in the usual sense. For example,
measurements on the heights of adults, the
weights of children, and the ages of patients.
• Qualitative Variables: characteristics that can be
categorized only, like possess or not to possess
some characteristic of interest, ethnic group, etc.
8
• Random Variable: Whenever we determine the
height, weight, or age of an individual, the result is
frequently referred to as a value of the respective
variable.
• When the values obtained arise as a result of
chance factors, so that they cannot be exactly
predicted in advance, the variable is called a
random variable.
• When a child is born, we cannot predict exactly his
or her height at maturity. Attained adult height is
the result of numerous genetic and environmental
factors.
9
Scales of measurement
• Scales of measurement refer to ways in which
variables/numbers are defined and categorized.
Each scale of measurement determines the
appropriateness for use of certain statistical
analyses.
• There are four scales of measurement: nominal,
ordinal, interval, and ratio.
10
Scales of measurement
• Nominal: Categorical data and numbers that are simply
used as identifiers or names represent a nominal scale
of measurement.
• Example: gender code Female as 1 and Male as 2 or
visa versa
• Ordinal: An ordinal scale of measurement represents
an ordered series of relationships or rank order.
• Example: Likert-type scales; how much pain are you in
today? (on a scale of 1 to 10 with one being no pain
and ten being high pain), represent ordinal data.
11
Scales of measurement
• Interval: A scale which represents quantity and has
equal units but for which zero represents simply an
additional point of measurement is an interval scale.
• In interval scales zero does not represent the absolute
lowest value.
• Example: Measurement of temperature in Fahrenheit
scale, measurement of Sea levels
12
Scales of measurement
• Ratio: The ratio scale of measurement is similar to
the interval scale in that it also represents quantity
and has equality of units. However, this scale also has
an absolute zero (no numbers exist below the zero). A
negative length is not possible.
• Example: physical measures height and weight.
• Often, the distinction between interval and ratio
scales can be ignored in statistical analyses.
• Distinction between these two types and ordinal and
nominal are more important.
13
Data
• Data are observations of random variables
made on the elements of a population or sample
• Data are the quantities (numbers) or qualities
(attributes) measured or observed that are to be
collected and/or analyzed
• The word data is plural, datum is singular
• A collection of data is often called a data set
(singular)
14
Data and information
• Data is raw, unorganized facts that need to be
processed. Data can be something simple and
seemingly random and useless until it is
organized.
• Example: Each newborn’s birth weight
• When data is processed, organized, structured
or presented in a given context so as to make it
useful, it is called information.
• Example: Mean birth weight of newborns
15
Types of data
1. Nominal data
• In statistics/biostatistics, we encounter many
different types of data.
• One of the simplest types of data is nominal data,
in which the values fallen to unordered categories
or classes. Example: sex, marital status, ethnicity,
religion, etc.
• Numbers are often used to represent the
categories. In a certain study, for instance, males
might be assigned the value 1 and females the
value 0 16
2. Ordinal data
• When the order among categories becomes
important, the observations are referred to as
ordinal data.
• For example injuries may be classified according
to their level of severity, so that
1= fatal, 2= severe, 3= moderate, and 4= minor.
• Here a natural order exists among the groupings:
a smaller number represents a more serious
injury. However we are still not concerned with
the magnitude of these numbers.
17
3. Discrete data
• For discrete data both ordering and magnitude
are important.
• In this case, the numbers represent actual
measurable quantities or counts rather than
mere labels.
• Examples of discrete data include the number of
car accidents in a given month, the number of
times a woman has given birth.
18
4. Continuous data
• Data that represent measurable quantities but
are not restricted to taking on certain specified
values.
• In this case the difference between any two
possible data values can be arbitrarily small.
• Examples of continuous data include time, the
serum cholesterol level of a patient, etc.
19
Types and Methods of Data Collection
• The statistical data may be classified
under two categories depending up on the
sources:
- Primary Data: are those data which are
collected by the investigator himself for the
purpose of a specific inquiry or study.
- Secondary Data: when an investigator
uses data which have already been collected by
others.
20
Data collection methods
1. Observation
• It is a technique that involves systematically
selecting, watching, and recording behaviors of
people, measuring characteristics or other
phenomena.
• It includes all methods from simple visual
observations to the use of high level machines.
• Advantage: Gives relatively more accurate data
on behavior and activities.
• Disadvantages: Investigator’s or observer’s own
bias, prejudice, desires may be reflected and
needs more resources and skilled human power
during the use of high level machines.
21
2 . Self-administered Questionnaire & Interviews
• These are the most commonly used research data
collection techniques.
• Self-administered questionnaire is
– simpler and cheaper
– can be administered to many persons
simultaneously
– can be sent by post (unlike interviews)
• But requires a certain level of education and skill
on the part of the respondents
• People of a low socio-economic status are less
likely to respond
22
3. Face-to-face and telephone interviews
– An interview is a conversation for gathering
information. A research interview involves an
interviewer, who coordinates the process of the
conversation and asks questions, and an
interviewee, who responds to those questions.
– A good interviewer can stimulate and maintain
the respondent’s interest, and can create a
rapport (understanding) and atmosphere
conducive to the answering of questions.
– If anxiety aroused, the interviewer can allay it. If
a question is not understood an interviewer can
repeat it and explain.
23
4. Mailed Questionnaire Method
• The investigator prepares a questionnaire
pertaining to the field of inquiry and are sent by
post to the informants together with a polite
covering letter explaining the detail, the aims and
objectives of collecting the information
• Requests the respondents to cooperate by
furnishing the correct replies and returning the
questionnaire duly filled in
• Drawback: response rates tend to be relatively
low, and there may be under representation of
less literate subjects
24
5. Use of Documentary Sources
• Includes clinical and other personal records,
death certificates, published mortality statistics,
census publications, etc.
• Examples:
- Official publications of CSA
- Publication of MoH and other Ministries
- Newspapers and Journals
- International publications (WHO, UNICEF)
- Records of Hospitals or any HI
25
6. Computer Direct Interviews
• These are interviews in which the Interviewees
enter their own answers directly into a computer.
• They can be used at malls, trade shows, offices,
and so on.
• The Survey System's optional Interviewing
Module and Interview Stations can easily create
computer-direct interviews. Some researchers
set up a Web page survey for this purpose.
26
Advantages
• The virtual elimination of data entry and editing
costs
• You will get more accurate answers to sensitive
questions
• Elimination of interviewer bias
• Ensuring skip patterns are accurately followed
• Response rates are usually higher
27
Disadvantages
• The Interviewees must have access to a
computer or one must be provided for them.
• As with mail surveys, computer direct
interviews may have serious response rate
problems in populations of lower
educational and literacy levels. This method
may grow in importance as computer use
increases.
28
Choosing Method of data
collection
• Decision Makers Need Information
that is Relevant, Timely, Accurate
and Useable
29
• The selection of the method of data collection
is also based on practical considerations,
such as:
 The need for personnel, skills, equipment, etc.
into what is available and the urgency with
which results are needed.
 The acceptability of the procedures to the
subjects – the absence of inconvenience,
unpleasantness, or untoward
 The probability that the method will provide a
good coverage, i.e. will supply the required
information about all or almost all members of
the population or sample
30
Choice of survey method will also depend
on several factors. These include:
Speed
Email and Web page surveys are the fastest methods,
followed by telephone interviewing. Mail surveys are the
slowest.
Cost
Personal interviews are the most expensive followed by
telephone and then mail. Email and Web page surveys
are the least expensive for large samples.
Computer and
Internet Usage
Web page and Email surveys offer significant
advantages, but you may not be able to generalize their
results to the population as a whole.
Literacy Levels
Illiterate and less-educated people rarely respond to mail
surveys.
Sensitive
Questions
People are more likely to answer sensitive questions
when interviewed directly by a computer in one form or
another.
31
Designing Questionnaire
When designing a questionnaire the following
points should be taken into account
– Keep it (questions) short and simple (KISS)
– Questions should be unambiguous and not
double barreled
– Use simple and direct language. The
questions must be clearly understood by
respondent.
– The wording of a question should be simple
and to the point.
– The best kinds of questions are those which
allow a pre-printed answer to be ticked 32
– Questions should be neither irrelevant nor too
personal
– Leading questions shouldn’t be asked. A “leading
question” is one that suggests the answer.
– The questionnaire should be designed so that the
questions should fall into a logical sequence.
– After finalizing developing the questionnaire,
translate it into local languages to be used for data
collection
– The last step in questionnaire design is to test the
questionnaire with a small number of interviews
before conducting your main interviews - pilot.
33
General Considerations
 To be successful involve other experts and
relevant decision-makers in the questionnaire
design process
 Formulate a plan for doing the statistical
analysis during the design stage of the project
 If you used one method in the past and need
to compare results, stick to that method,
unless there is a compelling reason to change
34
Types of questions
Open-ended Questions:
- Permit free responses that should be recorded
in the respondent’s own words.
It is used in
 Facts with which the researcher is not very
familiar
 Opinions, attitudes, and suggestions of
informants, or
 Sensitive issues
35
Closed Questions:
 Offer a list of possible options or answers
from which the respondents must choose.
 Offer a list of options that are exhaustive
and mutually exclusive, and
 Keep the number of options as few as
possible.
36
Interviewing technique
• Before the questionnaire is used for the data
collection, it should be pre-tested
• Manuals that explain each of the questions should
be prepared – question-by-question specification
• Enumerators and field supervisors should be
trained before they are deployed to the field
37
• Enumerator should create good communication
environment with the respondents.
• They should precisely explain the questions in the
questionnaire to the respondent. He/she should
not lead the respondent.
• There should be strong supervision to the field
work until it will be completed.
38
Rules for asking questions
 Read Qs as they are written
 Do not change order of Qs
 Read the Qs slowly and clearly
 Read Qs in a pleasant voice
 Maintain eye contact which is culturally
appropriate
 Read the entire question to Respondent
 Do not skip Qs
 Verify information given by Respondent
39
Interviewing tactics of Sensitive
Questions
• Sensitive questions may offend the
respondents
–Expose the respondent’s ignorance
–Call for socially unacceptable answer
–Embarrassments
45
Possible tactics (Barton)
– The everybody approach – as you know many
people have been arrested for being involved in
theft. Do you happen to have arrested for being
involved in theft?
– The other people approach – Do you know any
one arrested of theft? How about yourself?
– The Kinsey technique – stare firmly into the
respondents’ eyes and as in simple, clear-cut
language such as that to which respondent is
accustomed, and with and air of assuming that
everybody has done everything, ‘Have you ever
arrested for being involved in theft?’
46
Informed consents
Participation in a survey should be voluntary and a
respondent can refuse to be interviewed or
measured, etc.
The information given should be simple and clear
and adapted to the respondent’s level of
understanding.
Informed consents can be either signed or verbal
48
The interviewer is responsible for explaining:
– what the survey is about,
– providing all the necessary information, and
– making sure the respondent understands the
implications of his/her participation before
giving his/her consent.
• The information given should be simple and
clear and adapted to the respondent’s level of
understanding.
49
• Consents must be documented by asking the
respondents to sign an Informed Consent Form
or give verbal consent before doing the
interview.
– These forms must mention:
• who will be doing the study,
• the types of questions that will be asked,
• why the study is being done, and
• who will have access to the information
provided.
50
Module 1.2: Methods of data
processing, organization and
presentation
51
No. Ht Wt Sex age FEV No. Ht Wt Sex age FEV
1 175.2 79.2 1 57 3.80 16 177.5 69.7 1 32 4.10
2 164.5 92.4 6 60 3.50 17 164.0 719 2 58 3.15
3 168.5 64.6 1 62 1.48 18 174.0 63.2 1 45 4.25
4 180.0 82.6 1 43 4.35 19 161.0 60.0 2 59 2.75
5 156.0 79.9 2 13 2.70 20 169.5 63.3 3 53 3.32
6 170.0 80.9 1 61 2.35 21 181.5 101.3 1 37 4.20
7 170.0 79.7 1 67 149 22 173.0 72.9 1 47 4.45
8 162.0 57.4 1 63 2.95 23 473.6 55.9 2 39 3.65
9 177.0 98.1 1 46 4.20 24 178.2 39.2 1 70 3.05
10 285.0 61.6 2 47 2.45 25 159.0 63.5 2 42 3.20
11 156.0 60.0 2 43 2.10 26 149.0 69.2 2 58 29.3
12 157.0 62.0 3 34 3.41 27 159.0 80.3 2 63 2.45
13 150.0 51.8 2 49 2.70 28 190.0 883.0 1 60 4.65
14 154.0 58.1 2 47 2.45 29 175.0 85.0 7 41 3.75
15 165.0 70.6 1 79 3.10 30 168.7 855 1 60 3.15
52
Data cleaning and edition
• When the questionnaires are collected from the
field, they should be coded and edited
• Checks are basically of two sorts, range checks
and consistency checks.
Range checks: exclude, for example, the
erroneous occurrence of code 3 for sex,
which should only be code 1(male) or code
2(female).
Consistency checks: detect impossible
combinations of data
53
Basic precautions recommended to
minimize errors during the handling of
data:
• Avoid any unnecessary copying of data from one
form to another
• Use a verification procedure during data entry -
range and skip rules, double data entry, etc.
• Check all calculations carefully, example – date
conversion, units of measurement, etc.
54
Data organization: Tables
The use of tables for presenting data involves
grouping the data into mutually exclusive categories
of the variable, and counting the number of
occurrences to each category
 Tables should be as simple as possible and self-
explanatory
 Numerical entities of zero should be explicitly
written rather than indicated by a dash
 Totals should be shown either in the top row and
the first column or in the last row and last column
 If data are not original, their source should be
given in a footnote
55
Asthma versus sex and smoking
Sex and
smoking status
Presence of Asthma
No Yes
n % n % Total
Sex
Female 459 91.6 42 8.4 501
Male 439 93.0 33 7.0 472
Total 898 92.3 75 7.7 973
Smoking
Never smoker 480 91.4 45 8.6 525
Ex-smoker 254 91.7 23 8.3 277
Current smoker 164 95.9 7 4.1 171
Total 898 92.3 75 7.7 973
56
Data presentation: Diagrams
• Allows readers to obtain an overall grasp of the
data presented.
• The relationship can be seen more quickly and
easily from a graph than from a table.
• The choice of one graph over the other depends
on personal choices and/or the type of the data.
Bar chart and pie chart are commonly used for
quantitative discrete or qualitative data
Histograms, frequency polygon, and line graphs
are used for quantitative continuous data
57
Component Bar graph - Smoking status and
presence of asthma
0
10
20
30
40
50
60
70
80
90
100
Never smoker Ex-smoker Current smoker
Number
of
individuals
Smoking status
No Yes
58
Pie-chart – smoking status (%)
Never smoker
54%
Ex-smoker
28%
Current
smoker
18%
59
Histogram for FEV1 data
60
Neonatal Mortality Rate by Sex
65.8
34.2
37.2
46.3
25.8
29.0 29.3
50.2
44.8
49.0
54.6
41.4
38.7
34.3
0.0
10.0
20.0
30.0
40.0
50.0
60.0
70.0
2005 2006 2007 2008 2009 2010 2011
NNMR
per
1000
LB
Surveillance year
Female Male
61
General rules for constructing graphs
• Every graph should be self-explanatory and as
simple as possible
• Titles are usually placed below the graph
• Legends or keys should be used to differentiate
variables if more than one is shown
• The axes label should be placed to read from
the left side and from the bottom
• The units into which the scale is divided should
be clearly indicated
• The numerical scale representing frequency
must start at zero or a break in the line should
be shown
62
Module 1.3: Data summarization
63
Data Exploration
• The exploration procedure produces summary
statistics and graphical displays
• The reasons for using the explore procedure are:
– data screening,
– outlier identification,
– description,
– assumption checking, and
– characterizing differences among
subpopulations (groups of cases).
64
No. Ht Wt Sex age FEV No. Ht Wt Sex age FEV
1 175.2 79.2 1 57 3.80 16 177.5 69.7 1 32 4.10
2 164.5 92.4 1 60 3.50 17 164.0 71.9 2 58 3.15
3 168.5 64.6 1 62 1.48 18 174.0 63.2 1 45 4.25
4 180.0 82.6 1 43 4.35 19 161.0 60.0 2 59 2.75
5 156.0 79.9 2 47 2.70 20 169.5 63.3 2 53 3.32
6 170.0 80.9 1 61 2.35 21 181.5 101.3 1 37 4.20
7 170.0 79.7 1 67 0.80 22 173.0 72.9 1 47 4.45
8 162.0 57.4 1 63 2.95 23 164.2 55.9 2 39 3.65
9 177.0 98.1 1 46 4.20 24 178.2 93.2 1 70 3.05
10 160.5 61.6 2 47 2.45 25 159.0 63.5 2 42 3.20
11 156.0 60.0 2 43 2.10 26 149.0 69.2 2 58 2.20
12 157.0 62.0 2 34 3.41 27 159.0 80.3 2 63 2.45
13 150.0 51.8 2 49 2.70 28 190.0 88.3 1 60 4.65
14 154.0 58.1 2 47 2.45 29 175.0 85.0 1 41 3.75
15 165.0 70.6 1 79 3.10 30 168.7 85.5 1 60 3.15
65
• Data screening may show that you have
unusual values, extreme values, gaps in
the data, or other peculiarities.
• Exploring the data can help to determine
whether the statistical techniques that you
are considering for data analysis are
appropriate.
• The exploration may indicate that you need
to transform the data if the technique
requires some known distribution, say the
Normal distribution.
66
Measures of Central tendency
- The arithmetic mean, median and mode
- Arithmetic mean is unique, takes into
account all data points and leads itself for
further manipulation but sensitive to
extreme values
- Median is unique, not sensitive to all data
points and not affected by extreme values
- Mode might not exist and be unique, it can
be determined for qualitative data
67
Exercise
• Calculate the mean, median and mode for
the whole sample and sex specific
summary values using the data in the
table below
• Sex – 1=Male, 2=Female
• Height if measured in cm, weight in kg,
age in years and FEV in liter
68
Ht Wt Sex age FEV
175.2 79.2 1 57 3.80
164.5 92.4 1 60 3.50
168.5 64.6 1 62 1.48
180.0 82.6 1 43 4.35
156.0 79.9 2 47 2.70
170.0 80.9 1 61 2.35
170.0 79.7 1 67 0.80
162.0 57.4 1 63 2.95
177.0 98.1 1 46 4.20
160.5 61.6 2 47 2.45
156.0 60.0 2 43 2.10
157.0 62.0 2 34 3.41
150.0 51.8 2 49 2.70
154.0 58.1 2 47 2.45
165.0 70.6 1 79 3.10
69
Summary values
Sex Age Ht Wt FEV
Male Mean 54.85 173.54 80.27 3.42
Median 59.94 174.00 80.90 3.75
Mode 32.47 170.00 57.40 4.20
Sum 932.47 2950.10 1364.60 58.13
n 17 17 17 17
Female Mean 49.16 158.40 64.42 2.81
Median 47.40 159.00 62.00 2.70
Mode 34.43 156.00 60.00 2.45
Sum 639.04 2059.20 837.50 36.53
n 13 13 13 13
Both Mean 52.38 166.98 73.40 3.16
Median 50.96 166.75 71.25 3.15
Mode 32.47 156.00 60.00 2.45
Sum 1571.51 5009.30 2202.10 94.66
n 30 30 30 30
70
Measures of Variation/Dispersion
• Dispersion of a set of observations refers to the
scatteredness of observations around a measure
of central tendency
Commonly used measures of variation:
Range, Percentiles, and Standard deviation.
Of these measures only standard deviation is a
measure of variation since it assesses the
scatteredness of observations around the mean
71
The Coefficient of Variation
To compare the variability of two or more sets of
data for same or different variables, standard
deviations may lead to fallacious results.
• The variables involved might be measured in
different units, or different characteristics
• Coefficient of Variation (CV) is the standard
deviation expressed as a percentage of the
mean.
72
Use the above data to determine standard deviation and Coefficient of variation
Sex Age Ht Wt FEV
Male Mean 54.85 173.54 80.27 3.42
Variance 160.7 49.53 157.22 1.15
Std dev 12.68 7.04 12.54 1.07
CV 23.1 4.1 15.6 31.3
Range 46.06 28 43.9 3.85
Female Mean 49.16 158.4 64.42 2.81
Variance 74.16 32.65 74.78 0.24
Std dev 8.61 5.71 8.65 0.49
CV 17.5 3.6 13.4 17.4
Range 28.98 20.5 28.5 1.55
Both Mean 52.38 166.98 73.40 3.16
Variance 127.58 99.03 181.48 0.83
Std dev 11.3 9.95 13.47 0.91
CV 21.6 6.0 18.4 28.8
Range 46.06 41 49.5 3.85
73
Data transformations
• The assumptions underlying a statistical method
may not always be satisfied by a particular set of
data.
• For example, a distribution may be positively
skewed rather than normal. Such problems can
often be overcome simply by transforming the
data to a different scale of measurement
• The most common choice is the logarithmic
transformation
74
Logarithmic transformation
• When a logarithmic transformation is applied
to a variable, each individual value is replaced
by its logarithm.
y = log x
• Where x is the original value and y the
transformed value.
• The logarithm has the effect both of equalizing
the standard deviations and removing
skewness (absence of symmetry)
75
Choice of a transformation
• There are alternative transformations
• Reciprocal transformation:- is stronger than
the logarithmic, and would be appropriate if the
distribution were considerably more positively
skewed than lognormal.
Y=1/x
76
• Square root transformation:- is used when the
constant variance assumption does not hold
true.
• It is weaker than the logarithmic transformation.
• Negative skewness can be removed by using
power transformation, such as a square or a
cubic transformation, the strength increases with
the order of the power
x
y 
77
Histogram & Normal curve with transformations
78
Module 2: Probability and
Probability Distributions
79
Probability Distributions
• Definition: A random variable is a numerical
quantity that takes different values with specified
probabilities.
• There are two types of random variables: discrete
and continuous.
• Definition: A random variable for which there
exists a discrete definition of values with specified
probabilities is a discrete random variable.
80
Probability Distributions
• Example: Diarrhoea is one of the most frequent
reasons for visiting health institutions in the first 2
years of life in children.
• Let X be the random variable that represents the
number of episodes of diarrhoea in the first 2
years of life. Then X is a discrete random
variable, which takes on values 0,1,2, ....
• Definition: A random variable whose values form
a continuum (i.e., have no gaps) such that ranges
of values occur with specified probabilities is a
continuous random variable. 81
Probability Mass Function for a Discrete
Random Variable
• The values taken by a discrete random variable
and its associated probabilities can be expressed
by a rule, or relationship that is called a probability
density function (pdf).
• Definition: A pdf is a mathematical relationship, or
rule, that assigns to any possible value of a discrete
random variable X the probability P(X = r). This
assignment is made for all values r that have
positive probability. The pdf is also referred to as
probability distribution.
82
General rules which apply to any
probability distribution
1. Since the values of a probability distribution are
probabilities, they must be numbers in the
interval from 0 to 1.
2. Since a random variable has to take on one of
its values, the sum of all the values of a
probability distribution must be equal to 1.
• Example: Check whether the following function
can serve as the probability distribution of an
appropriate random variable
83
General rules …
12
2
)
(


x
x
f
for x=1, 2, and 3
Substituting the values of x, f(1)=3/12, f(2)=4/12,
and f(3)=5/12
Since none of these values is negative or greater
than one, and since their sum 3/12+4/12+5/12 = 1,
the given function is a probability distribution
84
Example on Hypertension-control:
• Suppose a physician agrees to use a new anti-
hypertensive drug on a trial basis on the first 4
untreated hypertensives whom she encounters in
her practice before deciding whether to adopt the
drug for routine use.
• Let X = the number of patients out of 4 who are
brought under control. Suppose that from
previous experience with the drug, for any clinical
practice, the drug company expects the following
probabilities.
r 0 1 2 3 4
P(X=r) .008 .076 .265 .411 .240
85
Example:
• For the above table, for any clinical practice, the
probability that between 0 and 4 hypertension’s
are brought under control = 1, i.e.,
• 0.008 + 0.076 + 0.265 + 0.411 + 0.240 = 1
• What is the probability that:
– At least two patients brought under control?
– At most three patients brought under control?
86
1. Binomial distribution
• The Binomial distribution with parameters n and
p is a discrete probability distribution of the
number of successes in a sequence of n
independent binary (yes/no) experiments, each of
which yields success with probability p.
• A useful summary measure, used to describe
binary variables, is the proportion with which the
variable took one of its values, called success.
• The binomial distribution is used to model the
number of successes in a sample of size n drawn
with replacement from a population of size N.
87
The Binomial Distribution
• Definition: The distribution of the number of
successes (r) in n statistically independent trails,
where the probability of success on each trail is
P, is known as the binomial distribution, and has
a probability density function given by:
where
• The mean is np and variance is np(1-p)
r
n
r
P)
(1
P
r
n
r)
P(X 











r = 0, 1, 2, …, n
!
)!
(
!
r
r
n
n
r
n










88
Probability mass function for the binomial
distribution
89
Example:
• What is the probability of obtaining 2 boys out of
5 children if the probability of a boy is 0.51 at
each birth and the sexes of successive children
are considered independent random variables?
• n=5, p=0.51, 1-p=0.49 and r=2
0.306
(0.49)
(0.51)
2!3!
5!
(0.49)
(0.51)
2
5
2)
P(x 3
2
3
2



 







90
Continuous Probability Distribution
• A continuous probability distribution is a smooth
density curve that models the distribution of a
continuous random variable.
• The area under the curve is 1 and the area
within any interval is approximately the
probability that the value of the random variable
is in that interval.
• Density function is a formula used to represent
the distribution of a continuous random variable.
91
Definition
• Probability distribution for a continuous
random variable for a nonnegative function
f(x) (probability density function) is:
– Total area bounded by its curve and the x-
axis is equal to one
– Subarea under the curve bounded, X-axis and
the perpendiculars erected at any two points
give the probability that x is between a and b
92
2. Normal distribution
• The Normal Distribution also called the Gaussian
distribution is the most important of the
distribution in all statistics.
• The normal density is given by:
= 3.141….. and e = 2.72….
 











 

x
where
e
x
f
x 2
2
1
2
1 




93
Characteristics
1. It is symmetrical about its mean
2. Mean, median and mode are equal
3. The total area under the curve above the x
axis is one square unit
4. One SD from the mean in both directions
approximately 68% of the area
5. The height of the curve =
6. The normal distribution is determined by the
parameters standard deviation and mean.


2
/
1
94
The Normal Distribution curve
σ = σx
μ = μx
95
Cont…
96
The standard Normal distribution
• Definition: A normal distribution with mean 0
and variance 1 will be referred to as a standard,
or unit, normal distribution. This distribution is
denoted by N(0,1).
2
2
1
z
2π
1
f(z) e

 for - < z < +
This distribution is symmetrical about 0 (the mean),
since f(x)=f(-x). About 68% of the area under the
normal density lies +1 and -1, about 95% lies
between +2 and -2, and about 99% lies between
+2.5 and -2.5
97
Application of Normal distribution
• Example:
Suppose it is know that the height of a population
of individual are approximately normally
distributed with a mean of 70 inches and standard
deviation of 3 inches. What is the probability that
a person picked at random from this group will be
a) between 65 and 74 inches tall?
b) greater than 75 inches
c) less than 65 inches
98
Solution
Step 1: Transform this to standard normal
distribution by using
Step 2: Determine the area under the curve
bounded by the curve, x-axis and the two points.
P( a<z<b).
Step 3: Look at the z distribution table for the
corresponding value of z.


 


99
3. The t-distribution
• The t-distribution is a family of continuous
probability distributions that arise when
estimating the mean of a normally distributed
population in situations where the sample
size is small and population standard
deviation is unknown.
• Whereas a normal distribution describes a full
population, t-distributions describe samples
drawn from a full population; accordingly,
the t-distribution for each sample size is
different.
100
The t-distribution
• The t-distribution is similar in shape to the
Normal distribution but is more spread out with
longer tails than the standard Normal.
• It is symmetrical about zero, its mean, and the
variance, σ2 is = k/(k-2) for k > 2, k = df, µ does
not exist for k=1, σ2 does not exists for k = 1,2
• The df increases with the sample size. As the
sample size increases, the shape of the t-
distribution becomes increasingly more like the
standard Normal distribution.
• It is used for estimation of means.
101
The t-distribution
n
s
X
t
/



102
The t-distribution
ν = n−1 degrees of freedom
103
Module 3.1:
Sampling methods and
Sample size estimation
104
Why sample?
• It is usually not cost effective or practicable to
collect and examine all the data that might be
available.
• Instead it is often necessary to draw a sample of
information from the whole population to enable
the detailed examination required to take place.
• Sampling provides a means of gaining
information about the population without the
need to examine the population in its entirety.
105
• Purposes of sampling: Provides various
types of statistical information of a
qualitative or quantitative nature about the
whole by examining a few selected units.
• Advantages of sample based studies
– Cost effectiveness
– Timeliness
– Inaccessibility of some people
– Less destructive in data summarization
– Accuracy
106
Caveats
• Sampling can provide a valid, defensible
methodology but it is important to match
the type of sample needed to the type of
analysis required.
• The auditor should also take care to check
the quality of the information from which
the sample is to be drawn. If the quality is
poor, sampling may not be justified.
107
Sampling Designs
• Sample design covers the method of selection, the
sample structure and plans for analysing and
interpreting the results.
• Sample designs can vary from simple to complex
and depend on the type of information required and
the way the sample is selected.
• The design will impact upon the size of the sample
and the way in which analysis is carried out. In
simple terms the tighter the required precision and
the more complex the design the larger the sample
size. 108
Sampling Designs
• The design may make use of the characteristics
of the population, but it does not have to be
proportionally representative.
• It may be necessary to draw a larger sample
than would be expected from some parts of the
population;
• For example, to select more from a minority
grouping to ensure that we get sufficient data for
analysis on such groups.
109
Sampling Designs
• The aim of the design is to achieve a
balance between the required precision
and the available resources.
110
Definition of terms
• Sample – Subset of the population of interest
• Sampling – process of selecting units from
the population of interest so that by studying
the sample we generalize our result back to
population.
• Sampling can provide a valid, defensible
methodology but it is important to match the
type of sample needed to the type of analysis
required.
111
• Population - Finite or infinite set of objects
whose properties are to be studied.
• Study population/sample population –
subset of target population chosen so as to be
representative of the total population
• Sampling unit - unit of selection in the
sampling process.
• Study unit – subject on which information is
collected.
112
Conditions that needs to be met
The sample must be well chosen – Representative
 the method of choosing the sample matters
 the best methods involve the planned
introduction of chance
 A sampling procedure should be fair, selecting
people for inclusion in the sample in an impartial
way, so as to get a representative cross section of
the public – No selection bias
When a selection procedure is biased, taking a large
sample does not help. This just repeats the basic
mistake on a large scale
113
Conditions …
A sample chosen in a haphazard fashion, or
because it is ‘handy’, is unlikely to be a
representative one. This kind of samples may be
used in exploratory surveys to get a ‘feel’ about
the situation
The sample must be sufficiently large –
Sample size
There must be adequate coverage of the sample
– Response rate
 Non-respondents can be very different from
respondents. When there is high non-response
rate, lookout for non-response bias. 114
Is a sample any good?
Some samples are really bad. To find out
whether a sample is any good, ask:
1. How it is chosen?
2. Was there selection bias?
3. Non-response bias?
These questions might not be answered just
by look at the data
115
Sampling techniques/methods
• Sampling is the process of selecting a number of
study units from a defined study population.
• Clearly define study population and study unit
– Study population – individuals, households,
institutions, records, etc…
– Study units – an individual, a household, an
institution or a record
116
Sampling cont…
• Types: probability and non-probability
– Probability – quantitative studies
– Non-probability – qualitative studies
• Probability sampling technique:
– Involves using random selection procedures to ensure that each
unit of the sample is chosen on the basis of chance.
– All units of the study population should have an equal, or at
least a known non-zero chance of being included in the sample.
– Sample drawn in such a way that it is representative of the
population
– The type to be used depends on population composition and
availability of sampling frame
117
Sampling cont…
Probability sampling methods include:
– Simple random sampling
– Systematic sampling
– Stratified sampling
– Cluster sampling
– Multistage sampling
118
1. Simple random sampling
• Selecting required number of sampling units
randomly from list of all units
– Up-to-date Sampling frame
– Random selection – manually using table of random
numbers or using computer programs
• E.g. 250 households from list of 9000 households
• Better representativeness but costly and
representativeness reduced in heterogeneous
population
119
2. Systematic sampling
• Sampling units are selected at regular intervals. The
starting unit is selected randomly
• Example: to select a sample of 100 students from
2500, first calculate sampling interval=2500/100=25.
Then randomly select the first student and finally pick
every 25th student
• Easier and less time consuming
• Can be done without sampling frame – sequential
studies
• Risk of bias if there is cyclic repetition
120
3. Stratified sampling
• Used when the population structure consists distinct
subgroups/strata
• Ensures proportions of individuals with certain
characteristics in the sample will be the same as those
in the whole population
– Representation of groups with different characteristics
• The study population must be divided into strata of
the characteristic (Example: residence, age, sex,
profession) and then random or systematic samples
are obtained from each stratum
121
3. Stratified sampling cont.
• Depending on the need, samples from each stratum
can be drawn either proportional to their size or non-
proportionally/equal size from each stratum
– Proportional- using sampling fraction (N/n)
– Equal size – to represent small groups
• Improved representativeness
• Estimates can be obtained for each stratum and the
population
122
4. Cluster sampling
• Groups of study units (clusters) instead of individual
study units are selected at a time
• Assumes homogeneity of population with respect the
characteristic to be measured
• All the study units in the selected clusters are
included in the study
• Used in geographically scattered areas where visiting
dispersed study units is time consuming and costly
• Example: a simple random sample of 5 villages from
30 villages
• Easier but less representative
123
5. Multistage sampling
• Carried out in stages – PSU, SSU…
• Used in very large and diverse populations
• The method used in most community-based big
studies
• E.g. In a study to be undertaken in a big town the
sampling may involve stages like selection of
kefetegnas, kebeles and finally houses
• Representativeness and reduced cost
124
5. Multistage sampling
• The larger the number of clusters, the greater is
the likelihood that the sample will be
representative.
• Further, the sampling units at community level
should be selected randomly (avoid convenience
sampling!).
125
Bias in sampling
• Bias in sampling is a systematic error in
sampling procedures, which leads to a distortion
in the results of the study.
• Bias can be introduced as a consequence of
improper sampling procedures, which result in
the sample not being representative of the study
population.
126
Bias …
• There are several possible sources of bias that
may arise when sampling. The most well known
source is non-response.
• Non-response can occur in any interview
situation
• Respondents may refuse or forget to fill in the
questionnaire
• The problem lies in the fact that non-respondents
in a sample may exhibit characteristics that differ
systematically from the characteristics of
respondents.
127
Bias …
There are several ways to deal with this problem and
reduce the possibility of bias:
1. Data collection tools should be pre-tested.
2. If non-response is due to absence of the subjects,
follow-up of non-respondents may be considered.
3. If non-response is due to refusal to co-operate, an
extra, separate study of non-respondents may be
considered in order to identify to what extent they
differ from respondents.
4. Include additional people in the sample, so that non-
respondents can be replaced if their absence was
very unlikely to be related to the topic being studied.
128
Bias …
Other sources of bias in sampling:
Studying volunteers only – volunteers are
motivated to participate in the study.
Sampling of registered patients only –
Patients reporting to a clinic are likely to
differ systematically from people seeking
alternative treatments
 Seasonal bias.
Tarmac bias – easily accessible by car.
129
Non-probability sampling methods
Quota Sampling: Each data collector is assigned
a fixed quota of subjects to interview; the number
falling into certain categories (like residence, sex,
age, etc.) are also fixed. On the other hand, the
interviewers are free to select anybody they like.
From common sense point of view, quota sampling
looks good. It seems to guarantee that the sample
will be like the population with respect to all the
important characteristics that affect the variable of
interest.
130
In quota sampling, the sample is hand-picked
to resemble the population with respect to
some key characteristics. The method
seems reasonable, but does not work very
well. The reason is unintentional bias on
the part of the interviewers.
131
Other non-probability sampling methods
• Purposive sampling
• Snowball or chain sampling
• Extreme case sampling
• Maximum variation sampling
• Homogeneous sampling
• Critical case sampling
132
Sample size estimation
• How many subjects are needed in the sample
to enable draw conclusion on the whole
population?
– Depends on expected variation in the data and
number of units per cell for analysis
– The eventual sample size is a compromise between
what is desirable and what is feasible
133
Sample size cont…
• Minimum sample size can be calculated
depending on the objective of the study
– Estimation of population parameter with certain
precision
• Single variable estimation (single population mean,
proportion or rate)
• Descriptive studies - Prevalence, coverage and utilization
rate studies
– Test of significant difference between groups
• Analytic studies - comparative cross-sectional, case-
control, cohort and clinical trials
134
Sample size - single proportion
• For making confidence limit statement (such as
prevalence study), the following formula can be used
to estimate minimum sample size:
• For population <10,000, use finite population
correction
 
2
2
2
1
1
d
P
P
Z
n









 
   
P
P
Z
N
d
P
P
Z
N
nf



















1
1
1
2
2
1
2
2
2
1


135
Single proportion cont…
• Parameters in the formula
– n is minimum sample size
– P is estimate of the prevalence rate for the
population
• From available data, or Pilot study result, or 0.5 should be
used to get the possible minimum large sample size; if given
in range, take the value closest to 0.5.
– d is the margin of sampling error tolerated
– Z1-α/2 is the standard normal variable at (1-α )%
confidence level and α is mostly taken to be 5%
• Usually 95% confidence level is used = 1.96
– N population size 136
Exercise
• What sample size do we need to estimate the
prevalence of HIV among residents of a town such
that the error of estimation is within 1% of its actual
parameter with 95% confidence?
137
Measuring prevalence for more than one
item in one group
• Take estimated prevalence of the most important item
to be measured or
• Determine sample size for each item/specific
objective and then
– Take estimated prevalence of the item that gives
the maximum sample size
138
Sample size-two proportion
For test of significance study the following formula can
be used:
Parameters:
n - size of sample in each group
P1 ,P2 – estimated population prevalence in the
comparison groups
β = 1- Power (the probability that if the two proportions
differ the test will produce a significant difference)
– Usually a power of 80% or 90% is used
     
 
 2
2
1
2
2
1
1
2
2 1
1
p
p
p
p
p
p
Z
Z
n





 

139
Exercise
A study is designed to assess the difference in the
proportion of physicians leaving health services in
urban and rural areas. From available literature 30% and
15% of physicians are estimated to leave services in
rural and urban areas within three years of graduation
respectively. What sample size is required for the study?
140
Sample size – case-control studies
• Formula –
• Parameters:
– P1 ,P0–estimated prevalence of exposure in the case
and controls respectively
– P0 can be estimated as the population prevalence of
exposure
– P′ – derived from P1 ,P0, m and odds ratio
– OR : odds ratio of exposures between cases and
controls
– m : number of control subjects per case subject
       
 
 2
1
2
1
1 1
1
1
1
o
o
o
p
p
p
mp
p
p
z
p
p
m
z
n









 

141
Exercise
• Example: Suppose you want to test presence of
difference in exposure status between cases and
controls at 95% confidence level and with power of
80% using a 1:1 ratio of cases to controls while
looking for an odds ratio of 2. You assume the
prevalence of exposure controls is 25%. How many
sample size do you need?
142
Sample size-two proportion
• More than one comparison variable – take the one
with the smallest estimated difference
– To get largest sample size
• Different formulae
– Case-control studies
– Matched studies
– Survival analysis
– Other cases
• Reference
– http://www.statsdirect.com/help/sample_size_and_me
thods/sms.htm
143
Five key factors
1. Confidence level: how certain you want to be that the
population figure is within the sample estimate and its
associated precision.
2. Variability in the population: the SD is the most usual
measure and often needs to be estimated.
3. Margin of error or precision: a measure of the possible
difference between the sample estimate and the actual
population value.
4. The population proportion: the proportion of items in
the population displaying the attributes that you are
seeking.
5. Population size: only important if the sample size is
greater than 5% of the population in which case the
sample size reduces.
144
Sample size – other considerations
• Non-response
– Add contingency – say 10%
• More – sensitive topic, self-administered questionnaire
(up to 30%)
– Response rate for
• Cross-sectional survey >85%
• Cohort - >60-80%
• Sampling technique
– In complex samples (cluster, multistage) increase the
sample size to account for design effect
145
Sample size – other considerations cont.
– Design effect - ratio variance of estimate derived from
a complex sampling design to the variance of estimate
from simple random sample
– Usually sample size is multiplied by 2 (1.5) in cluster
sampling
• Increase – large PSU, many stages, clustered variable
• Qualitative methods – estimate, not determined
• Better to have good quality data than large sample
after a certain point
• Better to have representative than large sample
– Use representative sampling techniques
146
Sampling distribution
Definition: A parameter is a numerical descriptive
measure of a population (μ). A statistic is a
numerical descriptive measure of a sample ( ).
To each sample statistic there corresponds a
population parameter. We use , S2, S , p, etc. to
estimate μ, σ2, σ, P (or π), etc.
X
X
147
Sampling distribution of Means
• The sampling distribution of means is one of the
most fundamental concepts of statistical
inference, and it has remarkable properties.
• Since it is a frequency distribution, it has its own
mean and standard deviation
Example: let a population of size 6 has values for
weight of individuals with 55.7, 66.7, 85.5, 79.7,
122.4 and 78.1. Select all possible samples of size
3 from this population and check if the sample mean
is unbiased estimate of population mean and
calculate the standard error of the sample mean.
148
Measurements of weight of individuals of
the population
Population values: 55.7 66.7 85.5 79.7 122.4 78.1
Sum of observations 488.1
Population mean (µ) 81.35
Population SD (σ) 20.77
All possible unique sample 20 







n
N
N
X
N
X





2
2
)
( 


149
Sample Obs1 Obs2 Obs3 Mean
S1 55.7 66.7 85.5 69.30
S2 55.7 66.7 79.7 67.37
S3 55.7 66.7 122.4 81.60
S4 55.7 66.7 78.1 66.83
S5 55.7 85.5 79.7 73.63
S6 55.7 85.5 122.4 87.87
S7 55.7 85.5 78.1 73.10
S8 55.7 79.7 122.4 85.93
S9 55.7 79.7 78.1 71.17
S10 55.7 122.4 78.1 85.40
S11 66.7 85.5 79.7 77.30
S12 66.7 85.5 122.4 91.53
S13 66.7 85.5 78.1 76.77
S14 66.7 79.7 122.4 89.60
S15 66.7 79.7 78.1 74.83
S16 66.7 122.4 78.1 89.07
S17 85.5 79.7 122.4 95.87
S18 85.5 79.7 78.1 81.10
S19 85.5 122.4 78.1 95.33
S20 79.7 122.4 78.1 93.40
Sum of means 1627.00
Mean of means 81.35
Variance of means 86.27
SD of sample means 9.29
n
N
n
N
n
n
N
n
X
X
n
X

























1
X
of
error
Standard
X
deviation
Standard
X
means
sample
of
Mean
1
)
(
S
variance
Sample
X
mean
Sample
2
2
150
Properties
1. The mean of the sampling distribution of means
is the same as the population mean, μ
2. The SD of the sampling distribution of sample
means is ≈ σ/√n if n is large
3. The sampling distribution of sample means is
approximately normal, regardless of the shape
of the population distribution provided n is large
(> 30) enough (Central limit theorem).
1


N
n
N
n

151
Module 3.2: Estimation
and Hypothesis Testing
152
Descriptive result
153
Estimation
Definition
Calculating some statistics from sample data
that is offered as an approximation of the
corresponding parameter of the population
from which the sample was drawn.
154
Cont…
Estimator: Methods or rules to compute
values/ estimate.
Estimator need to have characteristics of
unbiasedness.
• T of the parameter x is said to be unbiased
estimator of x if E(T) =x.
155
Cont…
• Estimation is calculating, from sample data, some statistic
that offers an approximation for the corresponding
parameter of the population from which the sample is
drawn.
• Properties of good estimators
– Unbiased: An estimator is said to be unbiased if in
the long run it takes on the value of the population
parameter
– Efficiency: An estimator is said to be efficient if in the
class of unbiased estimators it has minimum variance
– Consistency: A sequence of estimators is said to be
consistent if it converges in probability to the true value
of the parameter
– Sufficiency: an estimator is sufficient if it uses all the
sample information 156
Estimation methods
• Point estimate:
a single numeric value used to estimate the
corresponding population parameter.
frequently used point estimators ( sample statistic)
sample statistic coresponding population
sample mean population mean
sample variance population variance
sample standard deviation population standard deviation
sample proportion population proportion
157
Interval Estimate
• Interval estimate:
Two numerical values defining a range of
values that, with a specified degree of
confidence, we feel include the parameter
being estimated.
158
Cont…
• Even if sample mean is good quality estimator,
it is better to explain in an interval regarding the
probable magnitude of population mean.
• Confidence intervals are about putting some
bounds on how far away the truth might be from
your estimate.
• Sample mean is the best unbiased estimator.
159
Cont…
• If the sample is drawn from normally distributed
population, sample distribution will be normal.
• Even if the distribution of the population is non
normal, sampling distribution will assume normal
distribution if sample size is sufficiently large.
• Ninety-five (95%) percent of possible value of
will lie between two standard deviation of


x
2
2


s
x

160
Interval estimator component
• Reliability coefficient value of Z or t within the
standard error:
• Standard error – measure of sample mean
variability in repeated sampling.
n
x
z




n
s
x
t



161
Standard Error of the Mean
• It helps us to quantify in some way how good our
estimate of the mean is of the true, & unknown,
population mean- how large an error might we
be making
• Standard error of sample mean is 𝑆𝐷 𝑛 and it
is:
• Error that arise from variability in the sample
means
• It indicates the variability of the distribution of
means of samples caused by sampling error
and measurement error.
162
Confidence interval
• The confidence interval provides a range that is
highly likely (often 95% or 99%) to contain the
true population value, or parameter that is being
estimated.
• The narrower the interval the more informative is
the result. It is usually calculated using the point
estimate and its standard error.
163
• Provide an interval around our estimate
showing how much error there might be
either side of the estimate
lower upper
confidence estimate confidence
interval interval
164
Interval estimate for mean:
one sample situation
• Confidence interval of the mean with known
population standard deviation
• Confidence interval of the mean with unknown
population standard deviation for small sample
size
n
Z
x
x
SE
z
x


 2
/
1
)
2
/
1
( )
( 
 


n
s
n
t
x
x
se
df
t
x )
1
(
)
(
)
( 2
/
1
2
/
1 


 
 

165
Cont…
Interpretation of confidence interval
• Probabilistic: in repeated sampling from a
normally distributed population with known SD of
all interval will in the long run include population
mean
• Practical: when sampling from normally
distributed population with known SD (σ), we are
confident that the single computed interval
contains the population mean.
166
Cont…
• Confidence coefficient commonly used values are
0.9, 0.95 & 0.99 associated reliability coefficient
value of 1.645, 1.96 and 2.58 respectively for the
standard normal random variable (Z).
• Precision:
The quantity obtained by multiplying the reliability
factor by the SE of the mean called margins of
error.
167
Computing a 95 and 99% CI for μ
• Given = 19.26, σ = 2.52 and n = 117
• At 95% confidence level, α = 0.05 (α/2=0.025) and at 99%
α = 0.01 (α/2=0.005)
• Z0.975 = 1.96 and Z0.995 = 2.58
 95% CI for μ becomes
• 19.26  1.96*2.52/117 = (18.80  μ  19.72)
99% CI for μ becomes
• 19.26  2.58*2.52/117 = (18.66  μ  19.86)
x
168
Computing CI for μ when σ is unknown
• When the population SD (σ) is unknown, it
should be estimated from the sample SD (s)
• Accordingly, the standard error of the sample
mean will be estimated by s/√n
• Therefore, the say 95% CI for μ with n < 30 will
be based on the t-statistic as:
where (n-1) is the degree of freedom
n
s
n
t
x /
)
1
(
975
.
0 

169
Example
• Consider the following summary information
based on data on systolic blood pressure of a
random sample of 30 individuals selected from a
normal population. Compute a 95% and 99% CI
for μ
• n=30, df=30-1=29, at 95% confidence level, t0.975(29)=
2.045 and at 99%, t0.995(29)=2.756, se( )=16.3/30=2.98
• 95% CI for μ: 115.9  2.045*2.98 = (109.8  μ  122.0)
• 99% CI for μ: 115.9  2.756*2.98 = (107.7  μ  124.1)
3
.
16
s
,
9
.
115 

X
x
170
Standard Error of the difference between
two sample means
• Most medical research is comparative, as a
result we are more often concerned with two or
more samples rather than a single sample, i.e.,
compare difference between two samples.
• This helps in deciding whether or not it is likely
that the two mean are equal
• When the interval includes 0, the two means
might be equal.
• When the interval does not include zero the two
mean are different.
171
Cont….
The Z test statistic can be used in confidence
interval to estimate difference between two mean
if the variances of the populations are known
A 95% confidence interval for the difference of the
two means is given by:
2
2
2
1
2
1
2
1
2
2
2
1
2
1
975
.
0
2
1 96
.
1
)
(
)
(
n
n
X
X
n
n
Z
X
X













172
Unknown Variance
The t-test statistic is used when the
population standard deviations are unknown
and small sample size under the two sets of
conditions
1. When equal variance is assumed
2. When the variance are unequal
173
Cont…
• When the variance are equal, the variances are
pooled to estimate the common variance.
• Pooled estimate is obtained by weighing
average of the two sample variance.
• Each sample variance is weighed by its degree
of freedom (n-1).
• If the sample size are equal, the weighed
average equal the arithmetic mean of the two
sample variance.
• If the sample size are different, weighed average
take the advantage of additional information
provided by the larger sample.
174
Unknown but equal variances
• The pooled standard deviation (Sp) is
calculated using the following formula:
• Then the standard error of the difference
of the two sample means is:
2
)
1
(
)
1
(
2
1
2
2
2
2
1
1






n
n
S
n
S
n
Sp
2
1
2
1
1
1
)
(
n
n
S
X
X
se p 


175
Example: Was there a difference in the mean
fasting blood glucose level between men and
women given data from normal populations
Sex Mean SD n
Men 98.14 19.59 57
Women 95.19 14.03 59
Total 96.64 16.98 116
• Compute a 95% CI for the population mean
difference
– Assuming the standard deviations (SD) are
population SD
– Assuming the population variances are unknown but
assumed to be equal
176
Factors affecting the length of a
confidence interval (CI)
– Sample size (n)
– Standard deviation (σ)
– Confidence level (1-α)
177
Hypothesis Testing
Why is hypothesis testing so important?
• Hypothesis testing provides an objective
framework for making decisions using
probabilistic method, rather than relying on
subjective impressions.
• The Null hypothesis, denoted by Ho, is the
hypothesis that is to be tested.
• The alternative hypothesis H1 is the hypothesis
that in some sense contradicts the null
hypothesis.
178
Cont…
• While making decision on the null and
alternative hypothesis, we have four
possible outcomes:
1. We accept Ho, and Ho is in fact true – confidence level
(1-α).
2. We accept Ho, and H1is in fact true – Type II error (β).
3. We reject Ho, and Ho is in fact true – Type I error (α).
4. We reject Ho, and H1 in fact is true – Power of the test
(1- β).
179
One Sample Test for the Mean from a
Normal population
1. One Sided Alternative (One-tailed)
 Unknown Variance
• A one tailed test is a test in which the values of
the parameter being studied (in this case mean)
under the alternative hypothesis are allowed to be
either greater than or less than the values of the
parameter under the null hypothesis, but not both
































180
Cont…
I. Alternative mean < Null mean
• One sample t -test for the mean of a normal
distribution with Unknown variance to test the
hypothesis:
If t < t1- with n-1 df, then Do not Reject Ho
If t >= t1- with n-1 df, then Reject Ho
n
s
X
t o



181
Cont…
Two ways to determine statistical significance:
1. Critical value method – comparing the tabulated
value of the test statistic to the calculated value
for a given level of significance
2. P-value method
182
Cont…
The p value is the α level at which the given
value of the test statistic (such as t) would be on
the boarder line between the acceptance and
rejection zone.
P=p(tn-1 ≤ t)
where p is the area to the left of ’t’ under a tn-1
distribution.
183
Guidelines to judge p-value
1. If 0.01 <= p < 0.05, statistically significant
2. If 0.001 <= p < 0.01, statistically highly
significant
3. If p < 0.001, very highly statistically
significant
4. If p > 0.05, not statistically significant
184
II. Alternative mean >Null mean
• To test the hypothesis:
Ho: = Vs H1 : > , Variance Unknown
With a significant level, , the test is based on ‘t’
where:
• If t > tn-1, 1-α Ho is rejected
• If t < tn-1, 1- α Ho is accepted
 o
  o


n
s
x
t o
/



185
Cont…
2. Two-sided alternatives (two tailed)
It is a test in which the values of the parameter
being studied under the alternate hypothesis are
allowed to be either greater than or less than the
values of the parameter under the null hypothesis,
Ho.
186
Cont…
• To test the hypothesis:
Ho : = versus H1: ≠ with a significant
level of 
/t/ > tn-1,1- α /2 Ho rejected
/t/ < tn-1,1- α /2 Ho accepted
n
s
x
t o
/



 o

 o

187
Cont…
• P-value for two tailed t-test
n
s
x
t o
/
















0
t
if
)]
(
1
[
2
0
t
if
)
(
2
1
1
t
t
P
P
t
t
P
P
n
n
188
Cont…
One sample Z-test - Two Tailed
• The critical values and p-values for the one
sample t-test have been specified in terms of
percentiles of the t distribution, assuming that the
underlying variance is unknown.
• In some applications, the variance may be
assumed known from prior studies. In this case,
the test statistic t-test is replaced by the test
statistic ′Z′
189
Cont…
To test the hypothesis, we use
 Z < Z α /2 or Z > Z1- α /2 ,reject Ho
 Z α /2 < Z < Z 1- α /2 , Don’t reject Ho
n
x
z o
/




190
Cont…
• One Tail
• Alternative mean < Null mean (Variance
Known)
 Z < Z α , then Ho rejected
 Z > Z α, Ho accepted
• Alternative mean > Null mean (Variance
Known)
 Z > Z1- α , then Ho rejected
 Z < Z α, Ho accepted
191
Relationship between Hypothesis
Testing and confidence interval
–Two sided case
• Suppose we are testing Ho : = versus
H1: Ho is rejected with a two –sided level
alpha test if and only if the two sided confidence
interval for Does not contain , otherwise
accept Ho.
 o

  o

 o

192
Hypothesis Testing Two Sample
Inference
• In a two sample hypothesis testing, the
underlying parameters of two different
Population, neither of whose values is
assumed Known, are compared.
• Two samples are said to be Paired when
each data point of the first sample is
matched and is related to a unique data
point of the second sample.
193
Cont…
• Two samples are said to be independent
if the data points in one sample are
unrelated to the data points in the second
sample
194
The paired t- test
• the statistic is denoted by
where SD(d) is the sample standard deviation of
the observed difference and n is the number of
differences
n
d
SD
d
t
)
(

195
Cont…
• Degree of freedom n-1
– If t>tn-1 ,1- α /2 or t<-tn-1, 1- α /2 then Ho is
rejected.
– - tn-1, 1- α /2 <t<tn-1, 1- α /2
• P- value is 2x the area of ‘t’
196
• Example:
• Suppose a sample of 20 students were
given a test before studying a particular
module and then again after completing
the module.
• We want to find out if, in general, our
teaching leads to improvements in
students’ knowledge/skills (i.e. test
scores).
197
Student
Score
Difference Student
Score
Difference
Pre-
module
Post-
module
Pre-
module
Post-
module
1 18 22 4 11 14 15 1
2 21 25 4 12 16 15 -1
3 16 17 1 13 16 18 2
4 22 24 2 14 19 26 7
5 19 16 -3 15 18 18 0
6 24 29 5 16 20 24 4
7 17 20 3 17 12 18 6
8 21 23 2 18 22 25 3
9 23 19 -4 19 15 19 4
10 18 20 2 20 17 16 -1
198
199
• Hypothesis: Ho: △=0 and HA: △≠0
• Calculating the mean and standard deviation of
the differences: 𝑑= 2.05 and sd(d) = 2.837.
Therefore, se(𝑑) = 2.837/ 20 = 0.634
• So, we have: t = 2.05/0.634 = 3.231 on 19 df with
p = 0.004.
• Therefore, there is strong evidence that, on
average, the module does lead to improvements.
Two sample t – test for independent
sample with equal variance
• The equation is given by:
where, the weighted average of variance1 and variance2
could simply used as the estimate of
• The degree of freedom will be the sum of the degree of
freedom of the two samples, i.e., (n1-1) + (n2-1)
2
1
2
1
1
1
n
n
S
X
X
t
p 


2

200
Estimation and Hypothesis testing
of population proportion
201
Sampling distribution of proportions
Construction
• It is done in the same manner as that of
the mean
• take all possible samples of a given size
• Compute the sample proportion for each
• Prepare a frequency distribution of the
proportions
202
Cont…
Characteristics:
– When the sample size is large the distribution is
approximately normal
– The mean of the distribution, , will be equal
to the true proportion P.
– the variance of the distribution, , will be
equal to
P̂

2
p̂

n
p
p )
1
( 
203
Sampling distribution of difference
between two proportions
• For independent random samples n1 and n2 drawn
from two populations of dichotomous variables and
when P1 and P2 are the population proportions of
the characteristic
• Distribution of is approximately normal with
mean:
• And variance:
2
1
ˆ
ˆ p
p 
2
1
ˆ
ˆ 2
1
p
p
p
p 



2
2
2
1
1
1
2
ˆ
ˆ
)
1
(
)
1
(
2
1
n
p
p
n
p
p
p
p






204
Estimation of single proportions
• Confidence intervals of proportions by
approximation to the normal distribution and the
sample standard deviation.
• The confidence interval for the population
proportion :
where p is the proportion of successes (event),
q=(1 - p) is the proportion of failures,
n is the sample size and z denotes the z value
relating to a defined probability level.
n
p
p
Z
p
)
1
( 

205
Estimation of difference between
two proportions
• Unbiased point estimators are
• Standard error of the estimate when n1 and n2 are
large enough and are not close to 1 or 0
• Since population proportions are not known
2
2
2
1
1
1
ˆ
ˆ
)
ˆ
1
(
ˆ
)
ˆ
1
(
ˆ
2
1
n
p
p
n
p
p
p
p






2
1
ˆ
ˆ p
and
p
2
1
ˆ
ˆ p
p 
206
Cont…
• Therefore,100(1-α)% confidence interval will be:
2
2
2
1
1
1
)
2
/
1
(
2
1
)
ˆ
1
(
ˆ
)
ˆ
1
(
ˆ
)
ˆ
ˆ
(
n
p
p
n
p
p
p
p





 
207
Hypothesis testing on single
population proportions
• Follows from the properties of the sampling
distribution of the sample proportion
• The null hypothesis
and
• The alternate hypothesis
o
A
o
o
P
P
H
P
P
H


:
:
208
Cont…
• Test statistics
• Where Ho is true the sample proportions are
approximately distributed as standard normal
distribution
n
p
p
p
p
Z
o
o
)
1
(
ˆ
0 


209
Testing differences between two
sample proportions
• The most commonly used test
Ho: P1-P2 = 0 or P1=P2
• Under Ho, thus pooled estimate for the proportions will be
• Standard error
2
1
2
2
1
1
2
1
2
1
n
n
p
n
p
n
n
n
x
x
P






2
1
ˆ
ˆ
)
1
(
)
1
(
2
1
n
p
p
n
p
p
p
p






210
Cont…
• The test statistic will be:
   
2
1 ˆ
ˆ
2
1
2
1
ˆ
ˆ
p
p
P
P
p
p
z






211
Example: Comparison of number of swimming
hours’ by swimmers with or without erosion of
dental enamel
Number of
swimming hours
per week
Erosion of dental
enamel (EDE) Total
Yes No
≥ 6 hours 32 118 150
< 6 hours 17 127 144
Total 49 245 294
212
Prevalence of EDE (P) 0.167
Standard error 0.022
95% CI for P: Lower 0.124
Upper 0.209
1. Estimate the prevalence of erosion of
dental enamel and calculate a 95% CI
2. From previous studies among
swimmers it is claimed that the
prevalence of erosion of dental enamel
was 14%. Is the claim justified? Give
your p-value
213
3. Compute the respective prevalence of erosion
of dental enamel for those who had  6 hours
and < 6 hours of swimming time and calculate a
95% CI for the difference in the prevalence.
4. Is there a difference in the prevalence of erosion
of dental enamel between the two swimming
times? Give your p-value
214
Amount of swimming time per week P
≥ 6 hours 0.213
< 6 hours 0.118
Total 0.167
p1 – p2 0.095
Ho: P1=P2, HA: P1≠P2
se(p1-p2) 0.044
Z 2.174
95% CI for P1-P2
se(p1-p2) 0.042
Lower 95% 0.013
Upper 95% 0.177
215
Exercise: A study was conducted to look at the
effect of oral contraceptives (OC) on heart disease
in women 40-44 years of age over 3 years. Given
the following data, is there a difference in the rate of
MI between OC-users and non-users? Compute
95% CI for the difference.
OC-use group
MI status over 3
years Total
Yes No
OC-users 13 4,987 5,000
No-OC-users 7 9,993 10,000
Total 20 14,980 15,000
216
Common Statistical errors
217
Errors in
• Design
• Execution
• Analysis
• Presentation
• Interpretation
• Omission
218
Statistical errors related to study design
• Study aims and primary outcome measures
not clearly stated or unclear
• In adequate sample size
• Choice of inappropriate high risk sample to
make inferences about the general population
• Failure to report number of participants or
observations
• Use of an inappropriate control group
219
Errors in execution
• Failure to adhered to the study protocol
– Misuse of sample selection procedures
– Exclusion and inclusion criteria not strictly
followed
– Failure to follow randomization procedures
220
Statistical errors in presentation
• Inadequate graphical or numerical description of
basic data
– Presenting or plotting mean but no indication of
variability
– Giving SE instead of SD to describe data
– Failure to define ± notation for describing variability
– Numerical information given to an unrealistic level
of precision to present data and results
– Inappropriate graph selection that doesn’t reflect
characteristics of variables and use of three
dimensional graph for two dimension presentation
221
222
Statistical errors in analysis
• Using methods of analysis when assumptions are
not met
• Analyzing paired data ignoring the pairing
• Failing to take account of ordered categories
• Treating multiple observations on one subject as
independent
o Improper multiple pair-wise comparisons of more than
two groups
o Quoting confidence intervals that include impossible
values
• Failure to use multivariate techniques to adjust
for confounding factors
223
Statistical errors in interpretation of
study findings
• Wrong interpretation of results
 “non significant” interpreted as “no effect”, or
“no difference”
 Drawing conclusions not supported by the
study data
 Significance claimed without data analysis
or statistical test mentioned
• Failure to discuss sources of potential bias and
confounding factors
224
Consequences of statistical errors
• Impossible to get ethical approval to conduct the
study
• Others researchers may be led to follow false line
of investigation
• Patients may receive an inferior treatment , either
as a direct consequence of the result of the study
or possibly by the delay in the introduction of a
truly effective treatment
• If the results go unchallenged the researchers
may use the same inferior statistical methods in
future research, and others may copy them due to
inappropriate conclusion 225

More Related Content

Similar to Advanced Biostatistics presentation pptx

1 Introduction to Biostatistics.pptx
1 Introduction to Biostatistics.pptx1 Introduction to Biostatistics.pptx
1 Introduction to Biostatistics.pptx
AyeleBizuneh1
 
Biostatistics.pptxhgjfhgfthfujkolikhgjhcghd
Biostatistics.pptxhgjfhgfthfujkolikhgjhcghdBiostatistics.pptxhgjfhgfthfujkolikhgjhcghd
Biostatistics.pptxhgjfhgfthfujkolikhgjhcghd
madanshresthanepal
 
introduction to statistics
introduction to statisticsintroduction to statistics
introduction to statistics
SoujanyaLk1
 
introduction to statistics
introduction to statisticsintroduction to statistics
introduction to statistics
AbeeraIlyasAssistant
 
Statistics.pptx
Statistics.pptxStatistics.pptx
Statistics.pptx
lavanya209529
 
Chapter 1
Chapter 1Chapter 1
Chapter 1
HassanKhalid80
 
4.1 Handling data conv.docx
4.1 Handling data conv.docx4.1 Handling data conv.docx
4.1 Handling data conv.docx
ismaeljemal1
 
Data collection
Data collectionData collection
1.introduction
1.introduction1.introduction
1.introduction
abdi beshir
 
Introduction to basics of bio statistics.
Introduction to basics of bio statistics.Introduction to basics of bio statistics.
Introduction to basics of bio statistics.
AB Rajar
 
Data and scales of measurement
Data and scales of measurement Data and scales of measurement
Data and scales of measurement
riturandad
 
Session_12_-_Data_Collection,_Analy_237.ppt
Session_12_-_Data_Collection,_Analy_237.pptSession_12_-_Data_Collection,_Analy_237.ppt
Session_12_-_Data_Collection,_Analy_237.ppt
mousaderhem1
 
Session_12_-_Data_Collection,_Analy_237.ppt
Session_12_-_Data_Collection,_Analy_237.pptSession_12_-_Data_Collection,_Analy_237.ppt
Session_12_-_Data_Collection,_Analy_237.ppt
Gurumurthy B R
 
Final Lecture - 1.ppt
Final Lecture - 1.pptFinal Lecture - 1.ppt
Final Lecture - 1.ppt
ssuserbe1d97
 
Biostatistics.school of public healthppt
Biostatistics.school of public healthpptBiostatistics.school of public healthppt
Biostatistics.school of public healthppt
temesgengirma0906
 
Chapter 7 Knowing Our Data
Chapter 7 Knowing Our DataChapter 7 Knowing Our Data
Chapter 7 Knowing Our Data
International advisers
 
Research and Data Analysi-1.pptx
Research and Data Analysi-1.pptxResearch and Data Analysi-1.pptx
Research and Data Analysi-1.pptx
MaryamManzoor25
 
Biostatistics
BiostatisticsBiostatistics
Biostatistics
khushbu mishra
 
Biostatistics khushbu
Biostatistics khushbuBiostatistics khushbu
Biostatistics khushbu
khushbu mishra
 
Biostatistics
Biostatistics Biostatistics
Biostatistics
Vaibhav Ambashikar
 

Similar to Advanced Biostatistics presentation pptx (20)

1 Introduction to Biostatistics.pptx
1 Introduction to Biostatistics.pptx1 Introduction to Biostatistics.pptx
1 Introduction to Biostatistics.pptx
 
Biostatistics.pptxhgjfhgfthfujkolikhgjhcghd
Biostatistics.pptxhgjfhgfthfujkolikhgjhcghdBiostatistics.pptxhgjfhgfthfujkolikhgjhcghd
Biostatistics.pptxhgjfhgfthfujkolikhgjhcghd
 
introduction to statistics
introduction to statisticsintroduction to statistics
introduction to statistics
 
introduction to statistics
introduction to statisticsintroduction to statistics
introduction to statistics
 
Statistics.pptx
Statistics.pptxStatistics.pptx
Statistics.pptx
 
Chapter 1
Chapter 1Chapter 1
Chapter 1
 
4.1 Handling data conv.docx
4.1 Handling data conv.docx4.1 Handling data conv.docx
4.1 Handling data conv.docx
 
Data collection
Data collectionData collection
Data collection
 
1.introduction
1.introduction1.introduction
1.introduction
 
Introduction to basics of bio statistics.
Introduction to basics of bio statistics.Introduction to basics of bio statistics.
Introduction to basics of bio statistics.
 
Data and scales of measurement
Data and scales of measurement Data and scales of measurement
Data and scales of measurement
 
Session_12_-_Data_Collection,_Analy_237.ppt
Session_12_-_Data_Collection,_Analy_237.pptSession_12_-_Data_Collection,_Analy_237.ppt
Session_12_-_Data_Collection,_Analy_237.ppt
 
Session_12_-_Data_Collection,_Analy_237.ppt
Session_12_-_Data_Collection,_Analy_237.pptSession_12_-_Data_Collection,_Analy_237.ppt
Session_12_-_Data_Collection,_Analy_237.ppt
 
Final Lecture - 1.ppt
Final Lecture - 1.pptFinal Lecture - 1.ppt
Final Lecture - 1.ppt
 
Biostatistics.school of public healthppt
Biostatistics.school of public healthpptBiostatistics.school of public healthppt
Biostatistics.school of public healthppt
 
Chapter 7 Knowing Our Data
Chapter 7 Knowing Our DataChapter 7 Knowing Our Data
Chapter 7 Knowing Our Data
 
Research and Data Analysi-1.pptx
Research and Data Analysi-1.pptxResearch and Data Analysi-1.pptx
Research and Data Analysi-1.pptx
 
Biostatistics
BiostatisticsBiostatistics
Biostatistics
 
Biostatistics khushbu
Biostatistics khushbuBiostatistics khushbu
Biostatistics khushbu
 
Biostatistics
Biostatistics Biostatistics
Biostatistics
 

More from Abebe334138

Regression Analysis.ppt
Regression Analysis.pptRegression Analysis.ppt
Regression Analysis.ppt
Abebe334138
 
Lecture_5Conditional_Probability_Bayes_T.pptx
Lecture_5Conditional_Probability_Bayes_T.pptxLecture_5Conditional_Probability_Bayes_T.pptx
Lecture_5Conditional_Probability_Bayes_T.pptx
Abebe334138
 
3. Statistical inference_anesthesia.pptx
3.  Statistical inference_anesthesia.pptx3.  Statistical inference_anesthesia.pptx
3. Statistical inference_anesthesia.pptx
Abebe334138
 
chapter-7b.pptx
chapter-7b.pptxchapter-7b.pptx
chapter-7b.pptx
Abebe334138
 
chapter -7.pptx
chapter -7.pptxchapter -7.pptx
chapter -7.pptx
Abebe334138
 
7 Chi-square and F (1).ppt
7 Chi-square and F (1).ppt7 Chi-square and F (1).ppt
7 Chi-square and F (1).ppt
Abebe334138
 
Presentation1.pptx
Presentation1.pptxPresentation1.pptx
Presentation1.pptx
Abebe334138
 
RCT CH0.ppt
RCT CH0.pptRCT CH0.ppt
RCT CH0.ppt
Abebe334138
 
1. intro_biostatistics.pptx
1. intro_biostatistics.pptx1. intro_biostatistics.pptx
1. intro_biostatistics.pptx
Abebe334138
 
Lecture_R.ppt
Lecture_R.pptLecture_R.ppt
Lecture_R.ppt
Abebe334138
 
ppt1221[1][1].pptx
ppt1221[1][1].pptxppt1221[1][1].pptx
ppt1221[1][1].pptx
Abebe334138
 
dokumen.tips_biostatistics-basics-biostatistics.ppt
dokumen.tips_biostatistics-basics-biostatistics.pptdokumen.tips_biostatistics-basics-biostatistics.ppt
dokumen.tips_biostatistics-basics-biostatistics.ppt
Abebe334138
 

More from Abebe334138 (12)

Regression Analysis.ppt
Regression Analysis.pptRegression Analysis.ppt
Regression Analysis.ppt
 
Lecture_5Conditional_Probability_Bayes_T.pptx
Lecture_5Conditional_Probability_Bayes_T.pptxLecture_5Conditional_Probability_Bayes_T.pptx
Lecture_5Conditional_Probability_Bayes_T.pptx
 
3. Statistical inference_anesthesia.pptx
3.  Statistical inference_anesthesia.pptx3.  Statistical inference_anesthesia.pptx
3. Statistical inference_anesthesia.pptx
 
chapter-7b.pptx
chapter-7b.pptxchapter-7b.pptx
chapter-7b.pptx
 
chapter -7.pptx
chapter -7.pptxchapter -7.pptx
chapter -7.pptx
 
7 Chi-square and F (1).ppt
7 Chi-square and F (1).ppt7 Chi-square and F (1).ppt
7 Chi-square and F (1).ppt
 
Presentation1.pptx
Presentation1.pptxPresentation1.pptx
Presentation1.pptx
 
RCT CH0.ppt
RCT CH0.pptRCT CH0.ppt
RCT CH0.ppt
 
1. intro_biostatistics.pptx
1. intro_biostatistics.pptx1. intro_biostatistics.pptx
1. intro_biostatistics.pptx
 
Lecture_R.ppt
Lecture_R.pptLecture_R.ppt
Lecture_R.ppt
 
ppt1221[1][1].pptx
ppt1221[1][1].pptxppt1221[1][1].pptx
ppt1221[1][1].pptx
 
dokumen.tips_biostatistics-basics-biostatistics.ppt
dokumen.tips_biostatistics-basics-biostatistics.pptdokumen.tips_biostatistics-basics-biostatistics.ppt
dokumen.tips_biostatistics-basics-biostatistics.ppt
 

Recently uploaded

Cervical Disc Arthroplasty ORSI 2024.pptx
Cervical Disc Arthroplasty ORSI 2024.pptxCervical Disc Arthroplasty ORSI 2024.pptx
Cervical Disc Arthroplasty ORSI 2024.pptx
LEFLOT Jean-Louis
 
Call Girls in Kolkata 💯Call Us 🔝 7374876321 🔝 💃 Top Class Call Girl Servic...
Call Girls in Kolkata   💯Call Us 🔝 7374876321 🔝 💃  Top Class Call Girl Servic...Call Girls in Kolkata   💯Call Us 🔝 7374876321 🔝 💃  Top Class Call Girl Servic...
Call Girls in Kolkata 💯Call Us 🔝 7374876321 🔝 💃 Top Class Call Girl Servic...
daljeetsingh9909
 
Tele Optometry (kunj'sppt) / Basics of tele optometry.
Tele Optometry (kunj'sppt) / Basics of tele optometry.Tele Optometry (kunj'sppt) / Basics of tele optometry.
Tele Optometry (kunj'sppt) / Basics of tele optometry.
Kunj Vihari
 
Breast cancer: Post menopausal endocrine therapy
Breast cancer: Post menopausal endocrine therapyBreast cancer: Post menopausal endocrine therapy
Breast cancer: Post menopausal endocrine therapy
Dr. Sumit KUMAR
 
TEST BANK For Brunner and Suddarth's Textbook of Medical-Surgical Nursing, 14...
TEST BANK For Brunner and Suddarth's Textbook of Medical-Surgical Nursing, 14...TEST BANK For Brunner and Suddarth's Textbook of Medical-Surgical Nursing, 14...
TEST BANK For Brunner and Suddarth's Textbook of Medical-Surgical Nursing, 14...
Donc Test
 
Osvaldo Bernardo Muchanga-GASTROINTESTINAL INFECTIONS AND GASTRITIS-2024.pdf
Osvaldo Bernardo Muchanga-GASTROINTESTINAL INFECTIONS AND GASTRITIS-2024.pdfOsvaldo Bernardo Muchanga-GASTROINTESTINAL INFECTIONS AND GASTRITIS-2024.pdf
Osvaldo Bernardo Muchanga-GASTROINTESTINAL INFECTIONS AND GASTRITIS-2024.pdf
Osvaldo Bernardo Muchanga
 
Helminthiasis or Worm infestation in Children for Nursing students
Helminthiasis or Worm infestation in Children for Nursing studentsHelminthiasis or Worm infestation in Children for Nursing students
Helminthiasis or Worm infestation in Children for Nursing students
RAJU B N
 
Demystifying Fallopian Tube Blockage- Grading the Differences and Implication...
Demystifying Fallopian Tube Blockage- Grading the Differences and Implication...Demystifying Fallopian Tube Blockage- Grading the Differences and Implication...
Demystifying Fallopian Tube Blockage- Grading the Differences and Implication...
FFragrant
 
PGx Analysis in VarSeq: A User’s Perspective
PGx Analysis in VarSeq: A User’s PerspectivePGx Analysis in VarSeq: A User’s Perspective
PGx Analysis in VarSeq: A User’s Perspective
Golden Helix
 
anatomy of submandibular region presentation
anatomy of submandibular region presentationanatomy of submandibular region presentation
anatomy of submandibular region presentation
MalaM67
 
Public Health Lecture 4 Social Sciences and Public Health
Public Health Lecture 4 Social Sciences and Public HealthPublic Health Lecture 4 Social Sciences and Public Health
Public Health Lecture 4 Social Sciences and Public Health
phuakl
 
Storyboard on Skin- Innovative Learning (M-pharm) 2nd sem. (Cosmetics)
Storyboard on Skin- Innovative Learning (M-pharm) 2nd sem. (Cosmetics)Storyboard on Skin- Innovative Learning (M-pharm) 2nd sem. (Cosmetics)
Storyboard on Skin- Innovative Learning (M-pharm) 2nd sem. (Cosmetics)
MuskanShingari
 
Hemodialysis: Chapter 6, Hemodialysis Adequacy and Dose - Dr.Gawad
Hemodialysis: Chapter 6, Hemodialysis Adequacy and Dose - Dr.GawadHemodialysis: Chapter 6, Hemodialysis Adequacy and Dose - Dr.Gawad
Hemodialysis: Chapter 6, Hemodialysis Adequacy and Dose - Dr.Gawad
NephroTube - Dr.Gawad
 
Pharmacology of Drugs for Congestive Heart Failure
Pharmacology of Drugs for Congestive Heart FailurePharmacology of Drugs for Congestive Heart Failure
Pharmacology of Drugs for Congestive Heart Failure
Dr. Nikhilkumar Sakle
 
Allopurinol (Anti-gout drug).pptx
Allopurinol (Anti-gout drug).pptxAllopurinol (Anti-gout drug).pptx
Allopurinol (Anti-gout drug).pptx
Madhumita Dixit
 
STUDIES IN SUPPORT OF SPECIAL POPULATIONS: GERIATRICS E7
STUDIES IN SUPPORT OF SPECIAL POPULATIONS: GERIATRICS E7STUDIES IN SUPPORT OF SPECIAL POPULATIONS: GERIATRICS E7
STUDIES IN SUPPORT OF SPECIAL POPULATIONS: GERIATRICS E7
shruti jagirdar
 
pharmacology for dummies free pdf download.pdf
pharmacology for dummies free pdf download.pdfpharmacology for dummies free pdf download.pdf
pharmacology for dummies free pdf download.pdf
KerlynIgnacio
 
Mechanical injuries(ICS) due to sharp force.ppt
Mechanical injuries(ICS) due to sharp force.pptMechanical injuries(ICS) due to sharp force.ppt
Mechanical injuries(ICS) due to sharp force.ppt
SatrajitRoy5
 
Call Girl Pune 7339748667 Vip Call Girls Pune
Call Girl Pune 7339748667 Vip Call Girls PuneCall Girl Pune 7339748667 Vip Call Girls Pune
Call Girl Pune 7339748667 Vip Call Girls Pune
Mobile Problem
 
Giloy in Ayurveda - Classical Categorization and Synonyms
Giloy in Ayurveda - Classical Categorization and SynonymsGiloy in Ayurveda - Classical Categorization and Synonyms
Giloy in Ayurveda - Classical Categorization and Synonyms
Planet Ayurveda
 

Recently uploaded (20)

Cervical Disc Arthroplasty ORSI 2024.pptx
Cervical Disc Arthroplasty ORSI 2024.pptxCervical Disc Arthroplasty ORSI 2024.pptx
Cervical Disc Arthroplasty ORSI 2024.pptx
 
Call Girls in Kolkata 💯Call Us 🔝 7374876321 🔝 💃 Top Class Call Girl Servic...
Call Girls in Kolkata   💯Call Us 🔝 7374876321 🔝 💃  Top Class Call Girl Servic...Call Girls in Kolkata   💯Call Us 🔝 7374876321 🔝 💃  Top Class Call Girl Servic...
Call Girls in Kolkata 💯Call Us 🔝 7374876321 🔝 💃 Top Class Call Girl Servic...
 
Tele Optometry (kunj'sppt) / Basics of tele optometry.
Tele Optometry (kunj'sppt) / Basics of tele optometry.Tele Optometry (kunj'sppt) / Basics of tele optometry.
Tele Optometry (kunj'sppt) / Basics of tele optometry.
 
Breast cancer: Post menopausal endocrine therapy
Breast cancer: Post menopausal endocrine therapyBreast cancer: Post menopausal endocrine therapy
Breast cancer: Post menopausal endocrine therapy
 
TEST BANK For Brunner and Suddarth's Textbook of Medical-Surgical Nursing, 14...
TEST BANK For Brunner and Suddarth's Textbook of Medical-Surgical Nursing, 14...TEST BANK For Brunner and Suddarth's Textbook of Medical-Surgical Nursing, 14...
TEST BANK For Brunner and Suddarth's Textbook of Medical-Surgical Nursing, 14...
 
Osvaldo Bernardo Muchanga-GASTROINTESTINAL INFECTIONS AND GASTRITIS-2024.pdf
Osvaldo Bernardo Muchanga-GASTROINTESTINAL INFECTIONS AND GASTRITIS-2024.pdfOsvaldo Bernardo Muchanga-GASTROINTESTINAL INFECTIONS AND GASTRITIS-2024.pdf
Osvaldo Bernardo Muchanga-GASTROINTESTINAL INFECTIONS AND GASTRITIS-2024.pdf
 
Helminthiasis or Worm infestation in Children for Nursing students
Helminthiasis or Worm infestation in Children for Nursing studentsHelminthiasis or Worm infestation in Children for Nursing students
Helminthiasis or Worm infestation in Children for Nursing students
 
Demystifying Fallopian Tube Blockage- Grading the Differences and Implication...
Demystifying Fallopian Tube Blockage- Grading the Differences and Implication...Demystifying Fallopian Tube Blockage- Grading the Differences and Implication...
Demystifying Fallopian Tube Blockage- Grading the Differences and Implication...
 
PGx Analysis in VarSeq: A User’s Perspective
PGx Analysis in VarSeq: A User’s PerspectivePGx Analysis in VarSeq: A User’s Perspective
PGx Analysis in VarSeq: A User’s Perspective
 
anatomy of submandibular region presentation
anatomy of submandibular region presentationanatomy of submandibular region presentation
anatomy of submandibular region presentation
 
Public Health Lecture 4 Social Sciences and Public Health
Public Health Lecture 4 Social Sciences and Public HealthPublic Health Lecture 4 Social Sciences and Public Health
Public Health Lecture 4 Social Sciences and Public Health
 
Storyboard on Skin- Innovative Learning (M-pharm) 2nd sem. (Cosmetics)
Storyboard on Skin- Innovative Learning (M-pharm) 2nd sem. (Cosmetics)Storyboard on Skin- Innovative Learning (M-pharm) 2nd sem. (Cosmetics)
Storyboard on Skin- Innovative Learning (M-pharm) 2nd sem. (Cosmetics)
 
Hemodialysis: Chapter 6, Hemodialysis Adequacy and Dose - Dr.Gawad
Hemodialysis: Chapter 6, Hemodialysis Adequacy and Dose - Dr.GawadHemodialysis: Chapter 6, Hemodialysis Adequacy and Dose - Dr.Gawad
Hemodialysis: Chapter 6, Hemodialysis Adequacy and Dose - Dr.Gawad
 
Pharmacology of Drugs for Congestive Heart Failure
Pharmacology of Drugs for Congestive Heart FailurePharmacology of Drugs for Congestive Heart Failure
Pharmacology of Drugs for Congestive Heart Failure
 
Allopurinol (Anti-gout drug).pptx
Allopurinol (Anti-gout drug).pptxAllopurinol (Anti-gout drug).pptx
Allopurinol (Anti-gout drug).pptx
 
STUDIES IN SUPPORT OF SPECIAL POPULATIONS: GERIATRICS E7
STUDIES IN SUPPORT OF SPECIAL POPULATIONS: GERIATRICS E7STUDIES IN SUPPORT OF SPECIAL POPULATIONS: GERIATRICS E7
STUDIES IN SUPPORT OF SPECIAL POPULATIONS: GERIATRICS E7
 
pharmacology for dummies free pdf download.pdf
pharmacology for dummies free pdf download.pdfpharmacology for dummies free pdf download.pdf
pharmacology for dummies free pdf download.pdf
 
Mechanical injuries(ICS) due to sharp force.ppt
Mechanical injuries(ICS) due to sharp force.pptMechanical injuries(ICS) due to sharp force.ppt
Mechanical injuries(ICS) due to sharp force.ppt
 
Call Girl Pune 7339748667 Vip Call Girls Pune
Call Girl Pune 7339748667 Vip Call Girls PuneCall Girl Pune 7339748667 Vip Call Girls Pune
Call Girl Pune 7339748667 Vip Call Girls Pune
 
Giloy in Ayurveda - Classical Categorization and Synonyms
Giloy in Ayurveda - Classical Categorization and SynonymsGiloy in Ayurveda - Classical Categorization and Synonyms
Giloy in Ayurveda - Classical Categorization and Synonyms
 

Advanced Biostatistics presentation pptx

  • 1. Module 1.1: Introduction What is statistics? What is Biostatistics? Why we study Biostatistics? 1
  • 2. • Statistics is a field of study concerned with: 1. the collection, organization, summarization, and analysis of data; and 2. the drawing of inferences about a body of data when only a part of the data is observed. • Biostatistics: When the tools of statistics are employed on the data derived from the biological sciences and medicine or public health, we use the term biostatistics 2
  • 3. • Statistics versus statistic (field of study versus numerical quantity computed from sample data) • Roughly speaking, the field of statistics can be divided into: • Mathematical Statistics: the study & development of statistical theory and methods in the abstract and • Applied Statistics: the application of statistical methods to solve real problems involving randomly generated data, and the development of new statistical methodology motivated by real problems 3
  • 4. Rationale of studying Statistics • Statistics provides a way of organizing information on a wider and more formal basis than relying on the exchange of anecdotes or biography and personal experiences • More and more things are now measured quantitatively in medicine and public health • There is a great deal of intrinsic (inherent) variation in most biological processes
  • 5. Rationale of studying Statistics • The medical and public health literature is replete or full with reports in which statistical techniques are used extensively • The planning, conduct and interpretation of much of medical and public health research are becoming increasingly reliant on statistical technology 5
  • 6. Limitations of statistics • It deals with only those subjects of inquiry that are capable of being quantitatively measured and numerically expressed. • It deals on aggregates of facts and no importance is attached to individual items: suited only their group characteristics are desired to be studied. • Statistical data is only approximately and not mathematically correct.
  • 7. Limitations of statistics • It can be used to establish wrong conclusion and therefore, can be used only by experts. • Remember the three lies: Lies, Damon lies and Statistics • Evan Esar’s Definition of Statistics and Quote: “The science of producing unreliable facts from reliable figures” • “Statistics is the only science that enables different experts using the same figures to draw different conclusions” 7
  • 8. Variable • As we observe a characteristic, we find that it takes on different values in different persons, places, or things, called variable. The characteristic is not the same when observed in different possessors of it. • Quantitative variables: is one that can be measured in the usual sense. For example, measurements on the heights of adults, the weights of children, and the ages of patients. • Qualitative Variables: characteristics that can be categorized only, like possess or not to possess some characteristic of interest, ethnic group, etc. 8
  • 9. • Random Variable: Whenever we determine the height, weight, or age of an individual, the result is frequently referred to as a value of the respective variable. • When the values obtained arise as a result of chance factors, so that they cannot be exactly predicted in advance, the variable is called a random variable. • When a child is born, we cannot predict exactly his or her height at maturity. Attained adult height is the result of numerous genetic and environmental factors. 9
  • 10. Scales of measurement • Scales of measurement refer to ways in which variables/numbers are defined and categorized. Each scale of measurement determines the appropriateness for use of certain statistical analyses. • There are four scales of measurement: nominal, ordinal, interval, and ratio. 10
  • 11. Scales of measurement • Nominal: Categorical data and numbers that are simply used as identifiers or names represent a nominal scale of measurement. • Example: gender code Female as 1 and Male as 2 or visa versa • Ordinal: An ordinal scale of measurement represents an ordered series of relationships or rank order. • Example: Likert-type scales; how much pain are you in today? (on a scale of 1 to 10 with one being no pain and ten being high pain), represent ordinal data. 11
  • 12. Scales of measurement • Interval: A scale which represents quantity and has equal units but for which zero represents simply an additional point of measurement is an interval scale. • In interval scales zero does not represent the absolute lowest value. • Example: Measurement of temperature in Fahrenheit scale, measurement of Sea levels 12
  • 13. Scales of measurement • Ratio: The ratio scale of measurement is similar to the interval scale in that it also represents quantity and has equality of units. However, this scale also has an absolute zero (no numbers exist below the zero). A negative length is not possible. • Example: physical measures height and weight. • Often, the distinction between interval and ratio scales can be ignored in statistical analyses. • Distinction between these two types and ordinal and nominal are more important. 13
  • 14. Data • Data are observations of random variables made on the elements of a population or sample • Data are the quantities (numbers) or qualities (attributes) measured or observed that are to be collected and/or analyzed • The word data is plural, datum is singular • A collection of data is often called a data set (singular) 14
  • 15. Data and information • Data is raw, unorganized facts that need to be processed. Data can be something simple and seemingly random and useless until it is organized. • Example: Each newborn’s birth weight • When data is processed, organized, structured or presented in a given context so as to make it useful, it is called information. • Example: Mean birth weight of newborns 15
  • 16. Types of data 1. Nominal data • In statistics/biostatistics, we encounter many different types of data. • One of the simplest types of data is nominal data, in which the values fallen to unordered categories or classes. Example: sex, marital status, ethnicity, religion, etc. • Numbers are often used to represent the categories. In a certain study, for instance, males might be assigned the value 1 and females the value 0 16
  • 17. 2. Ordinal data • When the order among categories becomes important, the observations are referred to as ordinal data. • For example injuries may be classified according to their level of severity, so that 1= fatal, 2= severe, 3= moderate, and 4= minor. • Here a natural order exists among the groupings: a smaller number represents a more serious injury. However we are still not concerned with the magnitude of these numbers. 17
  • 18. 3. Discrete data • For discrete data both ordering and magnitude are important. • In this case, the numbers represent actual measurable quantities or counts rather than mere labels. • Examples of discrete data include the number of car accidents in a given month, the number of times a woman has given birth. 18
  • 19. 4. Continuous data • Data that represent measurable quantities but are not restricted to taking on certain specified values. • In this case the difference between any two possible data values can be arbitrarily small. • Examples of continuous data include time, the serum cholesterol level of a patient, etc. 19
  • 20. Types and Methods of Data Collection • The statistical data may be classified under two categories depending up on the sources: - Primary Data: are those data which are collected by the investigator himself for the purpose of a specific inquiry or study. - Secondary Data: when an investigator uses data which have already been collected by others. 20
  • 21. Data collection methods 1. Observation • It is a technique that involves systematically selecting, watching, and recording behaviors of people, measuring characteristics or other phenomena. • It includes all methods from simple visual observations to the use of high level machines. • Advantage: Gives relatively more accurate data on behavior and activities. • Disadvantages: Investigator’s or observer’s own bias, prejudice, desires may be reflected and needs more resources and skilled human power during the use of high level machines. 21
  • 22. 2 . Self-administered Questionnaire & Interviews • These are the most commonly used research data collection techniques. • Self-administered questionnaire is – simpler and cheaper – can be administered to many persons simultaneously – can be sent by post (unlike interviews) • But requires a certain level of education and skill on the part of the respondents • People of a low socio-economic status are less likely to respond 22
  • 23. 3. Face-to-face and telephone interviews – An interview is a conversation for gathering information. A research interview involves an interviewer, who coordinates the process of the conversation and asks questions, and an interviewee, who responds to those questions. – A good interviewer can stimulate and maintain the respondent’s interest, and can create a rapport (understanding) and atmosphere conducive to the answering of questions. – If anxiety aroused, the interviewer can allay it. If a question is not understood an interviewer can repeat it and explain. 23
  • 24. 4. Mailed Questionnaire Method • The investigator prepares a questionnaire pertaining to the field of inquiry and are sent by post to the informants together with a polite covering letter explaining the detail, the aims and objectives of collecting the information • Requests the respondents to cooperate by furnishing the correct replies and returning the questionnaire duly filled in • Drawback: response rates tend to be relatively low, and there may be under representation of less literate subjects 24
  • 25. 5. Use of Documentary Sources • Includes clinical and other personal records, death certificates, published mortality statistics, census publications, etc. • Examples: - Official publications of CSA - Publication of MoH and other Ministries - Newspapers and Journals - International publications (WHO, UNICEF) - Records of Hospitals or any HI 25
  • 26. 6. Computer Direct Interviews • These are interviews in which the Interviewees enter their own answers directly into a computer. • They can be used at malls, trade shows, offices, and so on. • The Survey System's optional Interviewing Module and Interview Stations can easily create computer-direct interviews. Some researchers set up a Web page survey for this purpose. 26
  • 27. Advantages • The virtual elimination of data entry and editing costs • You will get more accurate answers to sensitive questions • Elimination of interviewer bias • Ensuring skip patterns are accurately followed • Response rates are usually higher 27
  • 28. Disadvantages • The Interviewees must have access to a computer or one must be provided for them. • As with mail surveys, computer direct interviews may have serious response rate problems in populations of lower educational and literacy levels. This method may grow in importance as computer use increases. 28
  • 29. Choosing Method of data collection • Decision Makers Need Information that is Relevant, Timely, Accurate and Useable 29
  • 30. • The selection of the method of data collection is also based on practical considerations, such as:  The need for personnel, skills, equipment, etc. into what is available and the urgency with which results are needed.  The acceptability of the procedures to the subjects – the absence of inconvenience, unpleasantness, or untoward  The probability that the method will provide a good coverage, i.e. will supply the required information about all or almost all members of the population or sample 30
  • 31. Choice of survey method will also depend on several factors. These include: Speed Email and Web page surveys are the fastest methods, followed by telephone interviewing. Mail surveys are the slowest. Cost Personal interviews are the most expensive followed by telephone and then mail. Email and Web page surveys are the least expensive for large samples. Computer and Internet Usage Web page and Email surveys offer significant advantages, but you may not be able to generalize their results to the population as a whole. Literacy Levels Illiterate and less-educated people rarely respond to mail surveys. Sensitive Questions People are more likely to answer sensitive questions when interviewed directly by a computer in one form or another. 31
  • 32. Designing Questionnaire When designing a questionnaire the following points should be taken into account – Keep it (questions) short and simple (KISS) – Questions should be unambiguous and not double barreled – Use simple and direct language. The questions must be clearly understood by respondent. – The wording of a question should be simple and to the point. – The best kinds of questions are those which allow a pre-printed answer to be ticked 32
  • 33. – Questions should be neither irrelevant nor too personal – Leading questions shouldn’t be asked. A “leading question” is one that suggests the answer. – The questionnaire should be designed so that the questions should fall into a logical sequence. – After finalizing developing the questionnaire, translate it into local languages to be used for data collection – The last step in questionnaire design is to test the questionnaire with a small number of interviews before conducting your main interviews - pilot. 33
  • 34. General Considerations  To be successful involve other experts and relevant decision-makers in the questionnaire design process  Formulate a plan for doing the statistical analysis during the design stage of the project  If you used one method in the past and need to compare results, stick to that method, unless there is a compelling reason to change 34
  • 35. Types of questions Open-ended Questions: - Permit free responses that should be recorded in the respondent’s own words. It is used in  Facts with which the researcher is not very familiar  Opinions, attitudes, and suggestions of informants, or  Sensitive issues 35
  • 36. Closed Questions:  Offer a list of possible options or answers from which the respondents must choose.  Offer a list of options that are exhaustive and mutually exclusive, and  Keep the number of options as few as possible. 36
  • 37. Interviewing technique • Before the questionnaire is used for the data collection, it should be pre-tested • Manuals that explain each of the questions should be prepared – question-by-question specification • Enumerators and field supervisors should be trained before they are deployed to the field 37
  • 38. • Enumerator should create good communication environment with the respondents. • They should precisely explain the questions in the questionnaire to the respondent. He/she should not lead the respondent. • There should be strong supervision to the field work until it will be completed. 38
  • 39. Rules for asking questions  Read Qs as they are written  Do not change order of Qs  Read the Qs slowly and clearly  Read Qs in a pleasant voice  Maintain eye contact which is culturally appropriate  Read the entire question to Respondent  Do not skip Qs  Verify information given by Respondent 39
  • 40. Interviewing tactics of Sensitive Questions • Sensitive questions may offend the respondents –Expose the respondent’s ignorance –Call for socially unacceptable answer –Embarrassments 45
  • 41. Possible tactics (Barton) – The everybody approach – as you know many people have been arrested for being involved in theft. Do you happen to have arrested for being involved in theft? – The other people approach – Do you know any one arrested of theft? How about yourself? – The Kinsey technique – stare firmly into the respondents’ eyes and as in simple, clear-cut language such as that to which respondent is accustomed, and with and air of assuming that everybody has done everything, ‘Have you ever arrested for being involved in theft?’ 46
  • 42. Informed consents Participation in a survey should be voluntary and a respondent can refuse to be interviewed or measured, etc. The information given should be simple and clear and adapted to the respondent’s level of understanding. Informed consents can be either signed or verbal 48
  • 43. The interviewer is responsible for explaining: – what the survey is about, – providing all the necessary information, and – making sure the respondent understands the implications of his/her participation before giving his/her consent. • The information given should be simple and clear and adapted to the respondent’s level of understanding. 49
  • 44. • Consents must be documented by asking the respondents to sign an Informed Consent Form or give verbal consent before doing the interview. – These forms must mention: • who will be doing the study, • the types of questions that will be asked, • why the study is being done, and • who will have access to the information provided. 50
  • 45. Module 1.2: Methods of data processing, organization and presentation 51
  • 46. No. Ht Wt Sex age FEV No. Ht Wt Sex age FEV 1 175.2 79.2 1 57 3.80 16 177.5 69.7 1 32 4.10 2 164.5 92.4 6 60 3.50 17 164.0 719 2 58 3.15 3 168.5 64.6 1 62 1.48 18 174.0 63.2 1 45 4.25 4 180.0 82.6 1 43 4.35 19 161.0 60.0 2 59 2.75 5 156.0 79.9 2 13 2.70 20 169.5 63.3 3 53 3.32 6 170.0 80.9 1 61 2.35 21 181.5 101.3 1 37 4.20 7 170.0 79.7 1 67 149 22 173.0 72.9 1 47 4.45 8 162.0 57.4 1 63 2.95 23 473.6 55.9 2 39 3.65 9 177.0 98.1 1 46 4.20 24 178.2 39.2 1 70 3.05 10 285.0 61.6 2 47 2.45 25 159.0 63.5 2 42 3.20 11 156.0 60.0 2 43 2.10 26 149.0 69.2 2 58 29.3 12 157.0 62.0 3 34 3.41 27 159.0 80.3 2 63 2.45 13 150.0 51.8 2 49 2.70 28 190.0 883.0 1 60 4.65 14 154.0 58.1 2 47 2.45 29 175.0 85.0 7 41 3.75 15 165.0 70.6 1 79 3.10 30 168.7 855 1 60 3.15 52
  • 47. Data cleaning and edition • When the questionnaires are collected from the field, they should be coded and edited • Checks are basically of two sorts, range checks and consistency checks. Range checks: exclude, for example, the erroneous occurrence of code 3 for sex, which should only be code 1(male) or code 2(female). Consistency checks: detect impossible combinations of data 53
  • 48. Basic precautions recommended to minimize errors during the handling of data: • Avoid any unnecessary copying of data from one form to another • Use a verification procedure during data entry - range and skip rules, double data entry, etc. • Check all calculations carefully, example – date conversion, units of measurement, etc. 54
  • 49. Data organization: Tables The use of tables for presenting data involves grouping the data into mutually exclusive categories of the variable, and counting the number of occurrences to each category  Tables should be as simple as possible and self- explanatory  Numerical entities of zero should be explicitly written rather than indicated by a dash  Totals should be shown either in the top row and the first column or in the last row and last column  If data are not original, their source should be given in a footnote 55
  • 50. Asthma versus sex and smoking Sex and smoking status Presence of Asthma No Yes n % n % Total Sex Female 459 91.6 42 8.4 501 Male 439 93.0 33 7.0 472 Total 898 92.3 75 7.7 973 Smoking Never smoker 480 91.4 45 8.6 525 Ex-smoker 254 91.7 23 8.3 277 Current smoker 164 95.9 7 4.1 171 Total 898 92.3 75 7.7 973 56
  • 51. Data presentation: Diagrams • Allows readers to obtain an overall grasp of the data presented. • The relationship can be seen more quickly and easily from a graph than from a table. • The choice of one graph over the other depends on personal choices and/or the type of the data. Bar chart and pie chart are commonly used for quantitative discrete or qualitative data Histograms, frequency polygon, and line graphs are used for quantitative continuous data 57
  • 52. Component Bar graph - Smoking status and presence of asthma 0 10 20 30 40 50 60 70 80 90 100 Never smoker Ex-smoker Current smoker Number of individuals Smoking status No Yes 58
  • 53. Pie-chart – smoking status (%) Never smoker 54% Ex-smoker 28% Current smoker 18% 59
  • 55. Neonatal Mortality Rate by Sex 65.8 34.2 37.2 46.3 25.8 29.0 29.3 50.2 44.8 49.0 54.6 41.4 38.7 34.3 0.0 10.0 20.0 30.0 40.0 50.0 60.0 70.0 2005 2006 2007 2008 2009 2010 2011 NNMR per 1000 LB Surveillance year Female Male 61
  • 56. General rules for constructing graphs • Every graph should be self-explanatory and as simple as possible • Titles are usually placed below the graph • Legends or keys should be used to differentiate variables if more than one is shown • The axes label should be placed to read from the left side and from the bottom • The units into which the scale is divided should be clearly indicated • The numerical scale representing frequency must start at zero or a break in the line should be shown 62
  • 57. Module 1.3: Data summarization 63
  • 58. Data Exploration • The exploration procedure produces summary statistics and graphical displays • The reasons for using the explore procedure are: – data screening, – outlier identification, – description, – assumption checking, and – characterizing differences among subpopulations (groups of cases). 64
  • 59. No. Ht Wt Sex age FEV No. Ht Wt Sex age FEV 1 175.2 79.2 1 57 3.80 16 177.5 69.7 1 32 4.10 2 164.5 92.4 1 60 3.50 17 164.0 71.9 2 58 3.15 3 168.5 64.6 1 62 1.48 18 174.0 63.2 1 45 4.25 4 180.0 82.6 1 43 4.35 19 161.0 60.0 2 59 2.75 5 156.0 79.9 2 47 2.70 20 169.5 63.3 2 53 3.32 6 170.0 80.9 1 61 2.35 21 181.5 101.3 1 37 4.20 7 170.0 79.7 1 67 0.80 22 173.0 72.9 1 47 4.45 8 162.0 57.4 1 63 2.95 23 164.2 55.9 2 39 3.65 9 177.0 98.1 1 46 4.20 24 178.2 93.2 1 70 3.05 10 160.5 61.6 2 47 2.45 25 159.0 63.5 2 42 3.20 11 156.0 60.0 2 43 2.10 26 149.0 69.2 2 58 2.20 12 157.0 62.0 2 34 3.41 27 159.0 80.3 2 63 2.45 13 150.0 51.8 2 49 2.70 28 190.0 88.3 1 60 4.65 14 154.0 58.1 2 47 2.45 29 175.0 85.0 1 41 3.75 15 165.0 70.6 1 79 3.10 30 168.7 85.5 1 60 3.15 65
  • 60. • Data screening may show that you have unusual values, extreme values, gaps in the data, or other peculiarities. • Exploring the data can help to determine whether the statistical techniques that you are considering for data analysis are appropriate. • The exploration may indicate that you need to transform the data if the technique requires some known distribution, say the Normal distribution. 66
  • 61. Measures of Central tendency - The arithmetic mean, median and mode - Arithmetic mean is unique, takes into account all data points and leads itself for further manipulation but sensitive to extreme values - Median is unique, not sensitive to all data points and not affected by extreme values - Mode might not exist and be unique, it can be determined for qualitative data 67
  • 62. Exercise • Calculate the mean, median and mode for the whole sample and sex specific summary values using the data in the table below • Sex – 1=Male, 2=Female • Height if measured in cm, weight in kg, age in years and FEV in liter 68
  • 63. Ht Wt Sex age FEV 175.2 79.2 1 57 3.80 164.5 92.4 1 60 3.50 168.5 64.6 1 62 1.48 180.0 82.6 1 43 4.35 156.0 79.9 2 47 2.70 170.0 80.9 1 61 2.35 170.0 79.7 1 67 0.80 162.0 57.4 1 63 2.95 177.0 98.1 1 46 4.20 160.5 61.6 2 47 2.45 156.0 60.0 2 43 2.10 157.0 62.0 2 34 3.41 150.0 51.8 2 49 2.70 154.0 58.1 2 47 2.45 165.0 70.6 1 79 3.10 69
  • 64. Summary values Sex Age Ht Wt FEV Male Mean 54.85 173.54 80.27 3.42 Median 59.94 174.00 80.90 3.75 Mode 32.47 170.00 57.40 4.20 Sum 932.47 2950.10 1364.60 58.13 n 17 17 17 17 Female Mean 49.16 158.40 64.42 2.81 Median 47.40 159.00 62.00 2.70 Mode 34.43 156.00 60.00 2.45 Sum 639.04 2059.20 837.50 36.53 n 13 13 13 13 Both Mean 52.38 166.98 73.40 3.16 Median 50.96 166.75 71.25 3.15 Mode 32.47 156.00 60.00 2.45 Sum 1571.51 5009.30 2202.10 94.66 n 30 30 30 30 70
  • 65. Measures of Variation/Dispersion • Dispersion of a set of observations refers to the scatteredness of observations around a measure of central tendency Commonly used measures of variation: Range, Percentiles, and Standard deviation. Of these measures only standard deviation is a measure of variation since it assesses the scatteredness of observations around the mean 71
  • 66. The Coefficient of Variation To compare the variability of two or more sets of data for same or different variables, standard deviations may lead to fallacious results. • The variables involved might be measured in different units, or different characteristics • Coefficient of Variation (CV) is the standard deviation expressed as a percentage of the mean. 72
  • 67. Use the above data to determine standard deviation and Coefficient of variation Sex Age Ht Wt FEV Male Mean 54.85 173.54 80.27 3.42 Variance 160.7 49.53 157.22 1.15 Std dev 12.68 7.04 12.54 1.07 CV 23.1 4.1 15.6 31.3 Range 46.06 28 43.9 3.85 Female Mean 49.16 158.4 64.42 2.81 Variance 74.16 32.65 74.78 0.24 Std dev 8.61 5.71 8.65 0.49 CV 17.5 3.6 13.4 17.4 Range 28.98 20.5 28.5 1.55 Both Mean 52.38 166.98 73.40 3.16 Variance 127.58 99.03 181.48 0.83 Std dev 11.3 9.95 13.47 0.91 CV 21.6 6.0 18.4 28.8 Range 46.06 41 49.5 3.85 73
  • 68. Data transformations • The assumptions underlying a statistical method may not always be satisfied by a particular set of data. • For example, a distribution may be positively skewed rather than normal. Such problems can often be overcome simply by transforming the data to a different scale of measurement • The most common choice is the logarithmic transformation 74
  • 69. Logarithmic transformation • When a logarithmic transformation is applied to a variable, each individual value is replaced by its logarithm. y = log x • Where x is the original value and y the transformed value. • The logarithm has the effect both of equalizing the standard deviations and removing skewness (absence of symmetry) 75
  • 70. Choice of a transformation • There are alternative transformations • Reciprocal transformation:- is stronger than the logarithmic, and would be appropriate if the distribution were considerably more positively skewed than lognormal. Y=1/x 76
  • 71. • Square root transformation:- is used when the constant variance assumption does not hold true. • It is weaker than the logarithmic transformation. • Negative skewness can be removed by using power transformation, such as a square or a cubic transformation, the strength increases with the order of the power x y  77
  • 72. Histogram & Normal curve with transformations 78
  • 73. Module 2: Probability and Probability Distributions 79
  • 74. Probability Distributions • Definition: A random variable is a numerical quantity that takes different values with specified probabilities. • There are two types of random variables: discrete and continuous. • Definition: A random variable for which there exists a discrete definition of values with specified probabilities is a discrete random variable. 80
  • 75. Probability Distributions • Example: Diarrhoea is one of the most frequent reasons for visiting health institutions in the first 2 years of life in children. • Let X be the random variable that represents the number of episodes of diarrhoea in the first 2 years of life. Then X is a discrete random variable, which takes on values 0,1,2, .... • Definition: A random variable whose values form a continuum (i.e., have no gaps) such that ranges of values occur with specified probabilities is a continuous random variable. 81
  • 76. Probability Mass Function for a Discrete Random Variable • The values taken by a discrete random variable and its associated probabilities can be expressed by a rule, or relationship that is called a probability density function (pdf). • Definition: A pdf is a mathematical relationship, or rule, that assigns to any possible value of a discrete random variable X the probability P(X = r). This assignment is made for all values r that have positive probability. The pdf is also referred to as probability distribution. 82
  • 77. General rules which apply to any probability distribution 1. Since the values of a probability distribution are probabilities, they must be numbers in the interval from 0 to 1. 2. Since a random variable has to take on one of its values, the sum of all the values of a probability distribution must be equal to 1. • Example: Check whether the following function can serve as the probability distribution of an appropriate random variable 83
  • 78. General rules … 12 2 ) (   x x f for x=1, 2, and 3 Substituting the values of x, f(1)=3/12, f(2)=4/12, and f(3)=5/12 Since none of these values is negative or greater than one, and since their sum 3/12+4/12+5/12 = 1, the given function is a probability distribution 84
  • 79. Example on Hypertension-control: • Suppose a physician agrees to use a new anti- hypertensive drug on a trial basis on the first 4 untreated hypertensives whom she encounters in her practice before deciding whether to adopt the drug for routine use. • Let X = the number of patients out of 4 who are brought under control. Suppose that from previous experience with the drug, for any clinical practice, the drug company expects the following probabilities. r 0 1 2 3 4 P(X=r) .008 .076 .265 .411 .240 85
  • 80. Example: • For the above table, for any clinical practice, the probability that between 0 and 4 hypertension’s are brought under control = 1, i.e., • 0.008 + 0.076 + 0.265 + 0.411 + 0.240 = 1 • What is the probability that: – At least two patients brought under control? – At most three patients brought under control? 86
  • 81. 1. Binomial distribution • The Binomial distribution with parameters n and p is a discrete probability distribution of the number of successes in a sequence of n independent binary (yes/no) experiments, each of which yields success with probability p. • A useful summary measure, used to describe binary variables, is the proportion with which the variable took one of its values, called success. • The binomial distribution is used to model the number of successes in a sample of size n drawn with replacement from a population of size N. 87
  • 82. The Binomial Distribution • Definition: The distribution of the number of successes (r) in n statistically independent trails, where the probability of success on each trail is P, is known as the binomial distribution, and has a probability density function given by: where • The mean is np and variance is np(1-p) r n r P) (1 P r n r) P(X             r = 0, 1, 2, …, n ! )! ( ! r r n n r n           88
  • 83. Probability mass function for the binomial distribution 89
  • 84. Example: • What is the probability of obtaining 2 boys out of 5 children if the probability of a boy is 0.51 at each birth and the sexes of successive children are considered independent random variables? • n=5, p=0.51, 1-p=0.49 and r=2 0.306 (0.49) (0.51) 2!3! 5! (0.49) (0.51) 2 5 2) P(x 3 2 3 2             90
  • 85. Continuous Probability Distribution • A continuous probability distribution is a smooth density curve that models the distribution of a continuous random variable. • The area under the curve is 1 and the area within any interval is approximately the probability that the value of the random variable is in that interval. • Density function is a formula used to represent the distribution of a continuous random variable. 91
  • 86. Definition • Probability distribution for a continuous random variable for a nonnegative function f(x) (probability density function) is: – Total area bounded by its curve and the x- axis is equal to one – Subarea under the curve bounded, X-axis and the perpendiculars erected at any two points give the probability that x is between a and b 92
  • 87. 2. Normal distribution • The Normal Distribution also called the Gaussian distribution is the most important of the distribution in all statistics. • The normal density is given by: = 3.141….. and e = 2.72….                 x where e x f x 2 2 1 2 1      93
  • 88. Characteristics 1. It is symmetrical about its mean 2. Mean, median and mode are equal 3. The total area under the curve above the x axis is one square unit 4. One SD from the mean in both directions approximately 68% of the area 5. The height of the curve = 6. The normal distribution is determined by the parameters standard deviation and mean.   2 / 1 94
  • 89. The Normal Distribution curve σ = σx μ = μx 95
  • 91. The standard Normal distribution • Definition: A normal distribution with mean 0 and variance 1 will be referred to as a standard, or unit, normal distribution. This distribution is denoted by N(0,1). 2 2 1 z 2π 1 f(z) e   for - < z < + This distribution is symmetrical about 0 (the mean), since f(x)=f(-x). About 68% of the area under the normal density lies +1 and -1, about 95% lies between +2 and -2, and about 99% lies between +2.5 and -2.5 97
  • 92. Application of Normal distribution • Example: Suppose it is know that the height of a population of individual are approximately normally distributed with a mean of 70 inches and standard deviation of 3 inches. What is the probability that a person picked at random from this group will be a) between 65 and 74 inches tall? b) greater than 75 inches c) less than 65 inches 98
  • 93. Solution Step 1: Transform this to standard normal distribution by using Step 2: Determine the area under the curve bounded by the curve, x-axis and the two points. P( a<z<b). Step 3: Look at the z distribution table for the corresponding value of z.       99
  • 94. 3. The t-distribution • The t-distribution is a family of continuous probability distributions that arise when estimating the mean of a normally distributed population in situations where the sample size is small and population standard deviation is unknown. • Whereas a normal distribution describes a full population, t-distributions describe samples drawn from a full population; accordingly, the t-distribution for each sample size is different. 100
  • 95. The t-distribution • The t-distribution is similar in shape to the Normal distribution but is more spread out with longer tails than the standard Normal. • It is symmetrical about zero, its mean, and the variance, σ2 is = k/(k-2) for k > 2, k = df, µ does not exist for k=1, σ2 does not exists for k = 1,2 • The df increases with the sample size. As the sample size increases, the shape of the t- distribution becomes increasingly more like the standard Normal distribution. • It is used for estimation of means. 101
  • 97. The t-distribution ν = n−1 degrees of freedom 103
  • 98. Module 3.1: Sampling methods and Sample size estimation 104
  • 99. Why sample? • It is usually not cost effective or practicable to collect and examine all the data that might be available. • Instead it is often necessary to draw a sample of information from the whole population to enable the detailed examination required to take place. • Sampling provides a means of gaining information about the population without the need to examine the population in its entirety. 105
  • 100. • Purposes of sampling: Provides various types of statistical information of a qualitative or quantitative nature about the whole by examining a few selected units. • Advantages of sample based studies – Cost effectiveness – Timeliness – Inaccessibility of some people – Less destructive in data summarization – Accuracy 106
  • 101. Caveats • Sampling can provide a valid, defensible methodology but it is important to match the type of sample needed to the type of analysis required. • The auditor should also take care to check the quality of the information from which the sample is to be drawn. If the quality is poor, sampling may not be justified. 107
  • 102. Sampling Designs • Sample design covers the method of selection, the sample structure and plans for analysing and interpreting the results. • Sample designs can vary from simple to complex and depend on the type of information required and the way the sample is selected. • The design will impact upon the size of the sample and the way in which analysis is carried out. In simple terms the tighter the required precision and the more complex the design the larger the sample size. 108
  • 103. Sampling Designs • The design may make use of the characteristics of the population, but it does not have to be proportionally representative. • It may be necessary to draw a larger sample than would be expected from some parts of the population; • For example, to select more from a minority grouping to ensure that we get sufficient data for analysis on such groups. 109
  • 104. Sampling Designs • The aim of the design is to achieve a balance between the required precision and the available resources. 110
  • 105. Definition of terms • Sample – Subset of the population of interest • Sampling – process of selecting units from the population of interest so that by studying the sample we generalize our result back to population. • Sampling can provide a valid, defensible methodology but it is important to match the type of sample needed to the type of analysis required. 111
  • 106. • Population - Finite or infinite set of objects whose properties are to be studied. • Study population/sample population – subset of target population chosen so as to be representative of the total population • Sampling unit - unit of selection in the sampling process. • Study unit – subject on which information is collected. 112
  • 107. Conditions that needs to be met The sample must be well chosen – Representative  the method of choosing the sample matters  the best methods involve the planned introduction of chance  A sampling procedure should be fair, selecting people for inclusion in the sample in an impartial way, so as to get a representative cross section of the public – No selection bias When a selection procedure is biased, taking a large sample does not help. This just repeats the basic mistake on a large scale 113
  • 108. Conditions … A sample chosen in a haphazard fashion, or because it is ‘handy’, is unlikely to be a representative one. This kind of samples may be used in exploratory surveys to get a ‘feel’ about the situation The sample must be sufficiently large – Sample size There must be adequate coverage of the sample – Response rate  Non-respondents can be very different from respondents. When there is high non-response rate, lookout for non-response bias. 114
  • 109. Is a sample any good? Some samples are really bad. To find out whether a sample is any good, ask: 1. How it is chosen? 2. Was there selection bias? 3. Non-response bias? These questions might not be answered just by look at the data 115
  • 110. Sampling techniques/methods • Sampling is the process of selecting a number of study units from a defined study population. • Clearly define study population and study unit – Study population – individuals, households, institutions, records, etc… – Study units – an individual, a household, an institution or a record 116
  • 111. Sampling cont… • Types: probability and non-probability – Probability – quantitative studies – Non-probability – qualitative studies • Probability sampling technique: – Involves using random selection procedures to ensure that each unit of the sample is chosen on the basis of chance. – All units of the study population should have an equal, or at least a known non-zero chance of being included in the sample. – Sample drawn in such a way that it is representative of the population – The type to be used depends on population composition and availability of sampling frame 117
  • 112. Sampling cont… Probability sampling methods include: – Simple random sampling – Systematic sampling – Stratified sampling – Cluster sampling – Multistage sampling 118
  • 113. 1. Simple random sampling • Selecting required number of sampling units randomly from list of all units – Up-to-date Sampling frame – Random selection – manually using table of random numbers or using computer programs • E.g. 250 households from list of 9000 households • Better representativeness but costly and representativeness reduced in heterogeneous population 119
  • 114. 2. Systematic sampling • Sampling units are selected at regular intervals. The starting unit is selected randomly • Example: to select a sample of 100 students from 2500, first calculate sampling interval=2500/100=25. Then randomly select the first student and finally pick every 25th student • Easier and less time consuming • Can be done without sampling frame – sequential studies • Risk of bias if there is cyclic repetition 120
  • 115. 3. Stratified sampling • Used when the population structure consists distinct subgroups/strata • Ensures proportions of individuals with certain characteristics in the sample will be the same as those in the whole population – Representation of groups with different characteristics • The study population must be divided into strata of the characteristic (Example: residence, age, sex, profession) and then random or systematic samples are obtained from each stratum 121
  • 116. 3. Stratified sampling cont. • Depending on the need, samples from each stratum can be drawn either proportional to their size or non- proportionally/equal size from each stratum – Proportional- using sampling fraction (N/n) – Equal size – to represent small groups • Improved representativeness • Estimates can be obtained for each stratum and the population 122
  • 117. 4. Cluster sampling • Groups of study units (clusters) instead of individual study units are selected at a time • Assumes homogeneity of population with respect the characteristic to be measured • All the study units in the selected clusters are included in the study • Used in geographically scattered areas where visiting dispersed study units is time consuming and costly • Example: a simple random sample of 5 villages from 30 villages • Easier but less representative 123
  • 118. 5. Multistage sampling • Carried out in stages – PSU, SSU… • Used in very large and diverse populations • The method used in most community-based big studies • E.g. In a study to be undertaken in a big town the sampling may involve stages like selection of kefetegnas, kebeles and finally houses • Representativeness and reduced cost 124
  • 119. 5. Multistage sampling • The larger the number of clusters, the greater is the likelihood that the sample will be representative. • Further, the sampling units at community level should be selected randomly (avoid convenience sampling!). 125
  • 120. Bias in sampling • Bias in sampling is a systematic error in sampling procedures, which leads to a distortion in the results of the study. • Bias can be introduced as a consequence of improper sampling procedures, which result in the sample not being representative of the study population. 126
  • 121. Bias … • There are several possible sources of bias that may arise when sampling. The most well known source is non-response. • Non-response can occur in any interview situation • Respondents may refuse or forget to fill in the questionnaire • The problem lies in the fact that non-respondents in a sample may exhibit characteristics that differ systematically from the characteristics of respondents. 127
  • 122. Bias … There are several ways to deal with this problem and reduce the possibility of bias: 1. Data collection tools should be pre-tested. 2. If non-response is due to absence of the subjects, follow-up of non-respondents may be considered. 3. If non-response is due to refusal to co-operate, an extra, separate study of non-respondents may be considered in order to identify to what extent they differ from respondents. 4. Include additional people in the sample, so that non- respondents can be replaced if their absence was very unlikely to be related to the topic being studied. 128
  • 123. Bias … Other sources of bias in sampling: Studying volunteers only – volunteers are motivated to participate in the study. Sampling of registered patients only – Patients reporting to a clinic are likely to differ systematically from people seeking alternative treatments  Seasonal bias. Tarmac bias – easily accessible by car. 129
  • 124. Non-probability sampling methods Quota Sampling: Each data collector is assigned a fixed quota of subjects to interview; the number falling into certain categories (like residence, sex, age, etc.) are also fixed. On the other hand, the interviewers are free to select anybody they like. From common sense point of view, quota sampling looks good. It seems to guarantee that the sample will be like the population with respect to all the important characteristics that affect the variable of interest. 130
  • 125. In quota sampling, the sample is hand-picked to resemble the population with respect to some key characteristics. The method seems reasonable, but does not work very well. The reason is unintentional bias on the part of the interviewers. 131
  • 126. Other non-probability sampling methods • Purposive sampling • Snowball or chain sampling • Extreme case sampling • Maximum variation sampling • Homogeneous sampling • Critical case sampling 132
  • 127. Sample size estimation • How many subjects are needed in the sample to enable draw conclusion on the whole population? – Depends on expected variation in the data and number of units per cell for analysis – The eventual sample size is a compromise between what is desirable and what is feasible 133
  • 128. Sample size cont… • Minimum sample size can be calculated depending on the objective of the study – Estimation of population parameter with certain precision • Single variable estimation (single population mean, proportion or rate) • Descriptive studies - Prevalence, coverage and utilization rate studies – Test of significant difference between groups • Analytic studies - comparative cross-sectional, case- control, cohort and clinical trials 134
  • 129. Sample size - single proportion • For making confidence limit statement (such as prevalence study), the following formula can be used to estimate minimum sample size: • For population <10,000, use finite population correction   2 2 2 1 1 d P P Z n                P P Z N d P P Z N nf                    1 1 1 2 2 1 2 2 2 1   135
  • 130. Single proportion cont… • Parameters in the formula – n is minimum sample size – P is estimate of the prevalence rate for the population • From available data, or Pilot study result, or 0.5 should be used to get the possible minimum large sample size; if given in range, take the value closest to 0.5. – d is the margin of sampling error tolerated – Z1-α/2 is the standard normal variable at (1-α )% confidence level and α is mostly taken to be 5% • Usually 95% confidence level is used = 1.96 – N population size 136
  • 131. Exercise • What sample size do we need to estimate the prevalence of HIV among residents of a town such that the error of estimation is within 1% of its actual parameter with 95% confidence? 137
  • 132. Measuring prevalence for more than one item in one group • Take estimated prevalence of the most important item to be measured or • Determine sample size for each item/specific objective and then – Take estimated prevalence of the item that gives the maximum sample size 138
  • 133. Sample size-two proportion For test of significance study the following formula can be used: Parameters: n - size of sample in each group P1 ,P2 – estimated population prevalence in the comparison groups β = 1- Power (the probability that if the two proportions differ the test will produce a significant difference) – Usually a power of 80% or 90% is used          2 2 1 2 2 1 1 2 2 1 1 p p p p p p Z Z n         139
  • 134. Exercise A study is designed to assess the difference in the proportion of physicians leaving health services in urban and rural areas. From available literature 30% and 15% of physicians are estimated to leave services in rural and urban areas within three years of graduation respectively. What sample size is required for the study? 140
  • 135. Sample size – case-control studies • Formula – • Parameters: – P1 ,P0–estimated prevalence of exposure in the case and controls respectively – P0 can be estimated as the population prevalence of exposure – P′ – derived from P1 ,P0, m and odds ratio – OR : odds ratio of exposures between cases and controls – m : number of control subjects per case subject            2 1 2 1 1 1 1 1 1 o o o p p p mp p p z p p m z n             141
  • 136. Exercise • Example: Suppose you want to test presence of difference in exposure status between cases and controls at 95% confidence level and with power of 80% using a 1:1 ratio of cases to controls while looking for an odds ratio of 2. You assume the prevalence of exposure controls is 25%. How many sample size do you need? 142
  • 137. Sample size-two proportion • More than one comparison variable – take the one with the smallest estimated difference – To get largest sample size • Different formulae – Case-control studies – Matched studies – Survival analysis – Other cases • Reference – http://www.statsdirect.com/help/sample_size_and_me thods/sms.htm 143
  • 138. Five key factors 1. Confidence level: how certain you want to be that the population figure is within the sample estimate and its associated precision. 2. Variability in the population: the SD is the most usual measure and often needs to be estimated. 3. Margin of error or precision: a measure of the possible difference between the sample estimate and the actual population value. 4. The population proportion: the proportion of items in the population displaying the attributes that you are seeking. 5. Population size: only important if the sample size is greater than 5% of the population in which case the sample size reduces. 144
  • 139. Sample size – other considerations • Non-response – Add contingency – say 10% • More – sensitive topic, self-administered questionnaire (up to 30%) – Response rate for • Cross-sectional survey >85% • Cohort - >60-80% • Sampling technique – In complex samples (cluster, multistage) increase the sample size to account for design effect 145
  • 140. Sample size – other considerations cont. – Design effect - ratio variance of estimate derived from a complex sampling design to the variance of estimate from simple random sample – Usually sample size is multiplied by 2 (1.5) in cluster sampling • Increase – large PSU, many stages, clustered variable • Qualitative methods – estimate, not determined • Better to have good quality data than large sample after a certain point • Better to have representative than large sample – Use representative sampling techniques 146
  • 141. Sampling distribution Definition: A parameter is a numerical descriptive measure of a population (μ). A statistic is a numerical descriptive measure of a sample ( ). To each sample statistic there corresponds a population parameter. We use , S2, S , p, etc. to estimate μ, σ2, σ, P (or π), etc. X X 147
  • 142. Sampling distribution of Means • The sampling distribution of means is one of the most fundamental concepts of statistical inference, and it has remarkable properties. • Since it is a frequency distribution, it has its own mean and standard deviation Example: let a population of size 6 has values for weight of individuals with 55.7, 66.7, 85.5, 79.7, 122.4 and 78.1. Select all possible samples of size 3 from this population and check if the sample mean is unbiased estimate of population mean and calculate the standard error of the sample mean. 148
  • 143. Measurements of weight of individuals of the population Population values: 55.7 66.7 85.5 79.7 122.4 78.1 Sum of observations 488.1 Population mean (µ) 81.35 Population SD (σ) 20.77 All possible unique sample 20         n N N X N X      2 2 ) (    149
  • 144. Sample Obs1 Obs2 Obs3 Mean S1 55.7 66.7 85.5 69.30 S2 55.7 66.7 79.7 67.37 S3 55.7 66.7 122.4 81.60 S4 55.7 66.7 78.1 66.83 S5 55.7 85.5 79.7 73.63 S6 55.7 85.5 122.4 87.87 S7 55.7 85.5 78.1 73.10 S8 55.7 79.7 122.4 85.93 S9 55.7 79.7 78.1 71.17 S10 55.7 122.4 78.1 85.40 S11 66.7 85.5 79.7 77.30 S12 66.7 85.5 122.4 91.53 S13 66.7 85.5 78.1 76.77 S14 66.7 79.7 122.4 89.60 S15 66.7 79.7 78.1 74.83 S16 66.7 122.4 78.1 89.07 S17 85.5 79.7 122.4 95.87 S18 85.5 79.7 78.1 81.10 S19 85.5 122.4 78.1 95.33 S20 79.7 122.4 78.1 93.40 Sum of means 1627.00 Mean of means 81.35 Variance of means 86.27 SD of sample means 9.29 n N n N n n N n X X n X                          1 X of error Standard X deviation Standard X means sample of Mean 1 ) ( S variance Sample X mean Sample 2 2 150
  • 145. Properties 1. The mean of the sampling distribution of means is the same as the population mean, μ 2. The SD of the sampling distribution of sample means is ≈ σ/√n if n is large 3. The sampling distribution of sample means is approximately normal, regardless of the shape of the population distribution provided n is large (> 30) enough (Central limit theorem). 1   N n N n  151
  • 146. Module 3.2: Estimation and Hypothesis Testing 152
  • 148. Estimation Definition Calculating some statistics from sample data that is offered as an approximation of the corresponding parameter of the population from which the sample was drawn. 154
  • 149. Cont… Estimator: Methods or rules to compute values/ estimate. Estimator need to have characteristics of unbiasedness. • T of the parameter x is said to be unbiased estimator of x if E(T) =x. 155
  • 150. Cont… • Estimation is calculating, from sample data, some statistic that offers an approximation for the corresponding parameter of the population from which the sample is drawn. • Properties of good estimators – Unbiased: An estimator is said to be unbiased if in the long run it takes on the value of the population parameter – Efficiency: An estimator is said to be efficient if in the class of unbiased estimators it has minimum variance – Consistency: A sequence of estimators is said to be consistent if it converges in probability to the true value of the parameter – Sufficiency: an estimator is sufficient if it uses all the sample information 156
  • 151. Estimation methods • Point estimate: a single numeric value used to estimate the corresponding population parameter. frequently used point estimators ( sample statistic) sample statistic coresponding population sample mean population mean sample variance population variance sample standard deviation population standard deviation sample proportion population proportion 157
  • 152. Interval Estimate • Interval estimate: Two numerical values defining a range of values that, with a specified degree of confidence, we feel include the parameter being estimated. 158
  • 153. Cont… • Even if sample mean is good quality estimator, it is better to explain in an interval regarding the probable magnitude of population mean. • Confidence intervals are about putting some bounds on how far away the truth might be from your estimate. • Sample mean is the best unbiased estimator. 159
  • 154. Cont… • If the sample is drawn from normally distributed population, sample distribution will be normal. • Even if the distribution of the population is non normal, sampling distribution will assume normal distribution if sample size is sufficiently large. • Ninety-five (95%) percent of possible value of will lie between two standard deviation of   x 2 2   s x  160
  • 155. Interval estimator component • Reliability coefficient value of Z or t within the standard error: • Standard error – measure of sample mean variability in repeated sampling. n x z     n s x t    161
  • 156. Standard Error of the Mean • It helps us to quantify in some way how good our estimate of the mean is of the true, & unknown, population mean- how large an error might we be making • Standard error of sample mean is 𝑆𝐷 𝑛 and it is: • Error that arise from variability in the sample means • It indicates the variability of the distribution of means of samples caused by sampling error and measurement error. 162
  • 157. Confidence interval • The confidence interval provides a range that is highly likely (often 95% or 99%) to contain the true population value, or parameter that is being estimated. • The narrower the interval the more informative is the result. It is usually calculated using the point estimate and its standard error. 163
  • 158. • Provide an interval around our estimate showing how much error there might be either side of the estimate lower upper confidence estimate confidence interval interval 164
  • 159. Interval estimate for mean: one sample situation • Confidence interval of the mean with known population standard deviation • Confidence interval of the mean with unknown population standard deviation for small sample size n Z x x SE z x    2 / 1 ) 2 / 1 ( ) (      n s n t x x se df t x ) 1 ( ) ( ) ( 2 / 1 2 / 1         165
  • 160. Cont… Interpretation of confidence interval • Probabilistic: in repeated sampling from a normally distributed population with known SD of all interval will in the long run include population mean • Practical: when sampling from normally distributed population with known SD (σ), we are confident that the single computed interval contains the population mean. 166
  • 161. Cont… • Confidence coefficient commonly used values are 0.9, 0.95 & 0.99 associated reliability coefficient value of 1.645, 1.96 and 2.58 respectively for the standard normal random variable (Z). • Precision: The quantity obtained by multiplying the reliability factor by the SE of the mean called margins of error. 167
  • 162. Computing a 95 and 99% CI for μ • Given = 19.26, σ = 2.52 and n = 117 • At 95% confidence level, α = 0.05 (α/2=0.025) and at 99% α = 0.01 (α/2=0.005) • Z0.975 = 1.96 and Z0.995 = 2.58  95% CI for μ becomes • 19.26  1.96*2.52/117 = (18.80  μ  19.72) 99% CI for μ becomes • 19.26  2.58*2.52/117 = (18.66  μ  19.86) x 168
  • 163. Computing CI for μ when σ is unknown • When the population SD (σ) is unknown, it should be estimated from the sample SD (s) • Accordingly, the standard error of the sample mean will be estimated by s/√n • Therefore, the say 95% CI for μ with n < 30 will be based on the t-statistic as: where (n-1) is the degree of freedom n s n t x / ) 1 ( 975 . 0   169
  • 164. Example • Consider the following summary information based on data on systolic blood pressure of a random sample of 30 individuals selected from a normal population. Compute a 95% and 99% CI for μ • n=30, df=30-1=29, at 95% confidence level, t0.975(29)= 2.045 and at 99%, t0.995(29)=2.756, se( )=16.3/30=2.98 • 95% CI for μ: 115.9  2.045*2.98 = (109.8  μ  122.0) • 99% CI for μ: 115.9  2.756*2.98 = (107.7  μ  124.1) 3 . 16 s , 9 . 115   X x 170
  • 165. Standard Error of the difference between two sample means • Most medical research is comparative, as a result we are more often concerned with two or more samples rather than a single sample, i.e., compare difference between two samples. • This helps in deciding whether or not it is likely that the two mean are equal • When the interval includes 0, the two means might be equal. • When the interval does not include zero the two mean are different. 171
  • 166. Cont…. The Z test statistic can be used in confidence interval to estimate difference between two mean if the variances of the populations are known A 95% confidence interval for the difference of the two means is given by: 2 2 2 1 2 1 2 1 2 2 2 1 2 1 975 . 0 2 1 96 . 1 ) ( ) ( n n X X n n Z X X              172
  • 167. Unknown Variance The t-test statistic is used when the population standard deviations are unknown and small sample size under the two sets of conditions 1. When equal variance is assumed 2. When the variance are unequal 173
  • 168. Cont… • When the variance are equal, the variances are pooled to estimate the common variance. • Pooled estimate is obtained by weighing average of the two sample variance. • Each sample variance is weighed by its degree of freedom (n-1). • If the sample size are equal, the weighed average equal the arithmetic mean of the two sample variance. • If the sample size are different, weighed average take the advantage of additional information provided by the larger sample. 174
  • 169. Unknown but equal variances • The pooled standard deviation (Sp) is calculated using the following formula: • Then the standard error of the difference of the two sample means is: 2 ) 1 ( ) 1 ( 2 1 2 2 2 2 1 1       n n S n S n Sp 2 1 2 1 1 1 ) ( n n S X X se p    175
  • 170. Example: Was there a difference in the mean fasting blood glucose level between men and women given data from normal populations Sex Mean SD n Men 98.14 19.59 57 Women 95.19 14.03 59 Total 96.64 16.98 116 • Compute a 95% CI for the population mean difference – Assuming the standard deviations (SD) are population SD – Assuming the population variances are unknown but assumed to be equal 176
  • 171. Factors affecting the length of a confidence interval (CI) – Sample size (n) – Standard deviation (σ) – Confidence level (1-α) 177
  • 172. Hypothesis Testing Why is hypothesis testing so important? • Hypothesis testing provides an objective framework for making decisions using probabilistic method, rather than relying on subjective impressions. • The Null hypothesis, denoted by Ho, is the hypothesis that is to be tested. • The alternative hypothesis H1 is the hypothesis that in some sense contradicts the null hypothesis. 178
  • 173. Cont… • While making decision on the null and alternative hypothesis, we have four possible outcomes: 1. We accept Ho, and Ho is in fact true – confidence level (1-α). 2. We accept Ho, and H1is in fact true – Type II error (β). 3. We reject Ho, and Ho is in fact true – Type I error (α). 4. We reject Ho, and H1 in fact is true – Power of the test (1- β). 179
  • 174. One Sample Test for the Mean from a Normal population 1. One Sided Alternative (One-tailed)  Unknown Variance • A one tailed test is a test in which the values of the parameter being studied (in this case mean) under the alternative hypothesis are allowed to be either greater than or less than the values of the parameter under the null hypothesis, but not both                                 180
  • 175. Cont… I. Alternative mean < Null mean • One sample t -test for the mean of a normal distribution with Unknown variance to test the hypothesis: If t < t1- with n-1 df, then Do not Reject Ho If t >= t1- with n-1 df, then Reject Ho n s X t o    181
  • 176. Cont… Two ways to determine statistical significance: 1. Critical value method – comparing the tabulated value of the test statistic to the calculated value for a given level of significance 2. P-value method 182
  • 177. Cont… The p value is the α level at which the given value of the test statistic (such as t) would be on the boarder line between the acceptance and rejection zone. P=p(tn-1 ≤ t) where p is the area to the left of ’t’ under a tn-1 distribution. 183
  • 178. Guidelines to judge p-value 1. If 0.01 <= p < 0.05, statistically significant 2. If 0.001 <= p < 0.01, statistically highly significant 3. If p < 0.001, very highly statistically significant 4. If p > 0.05, not statistically significant 184
  • 179. II. Alternative mean >Null mean • To test the hypothesis: Ho: = Vs H1 : > , Variance Unknown With a significant level, , the test is based on ‘t’ where: • If t > tn-1, 1-α Ho is rejected • If t < tn-1, 1- α Ho is accepted  o   o   n s x t o /    185
  • 180. Cont… 2. Two-sided alternatives (two tailed) It is a test in which the values of the parameter being studied under the alternate hypothesis are allowed to be either greater than or less than the values of the parameter under the null hypothesis, Ho. 186
  • 181. Cont… • To test the hypothesis: Ho : = versus H1: ≠ with a significant level of  /t/ > tn-1,1- α /2 Ho rejected /t/ < tn-1,1- α /2 Ho accepted n s x t o /     o   o  187
  • 182. Cont… • P-value for two tailed t-test n s x t o /                 0 t if )] ( 1 [ 2 0 t if ) ( 2 1 1 t t P P t t P P n n 188
  • 183. Cont… One sample Z-test - Two Tailed • The critical values and p-values for the one sample t-test have been specified in terms of percentiles of the t distribution, assuming that the underlying variance is unknown. • In some applications, the variance may be assumed known from prior studies. In this case, the test statistic t-test is replaced by the test statistic ′Z′ 189
  • 184. Cont… To test the hypothesis, we use  Z < Z α /2 or Z > Z1- α /2 ,reject Ho  Z α /2 < Z < Z 1- α /2 , Don’t reject Ho n x z o /     190
  • 185. Cont… • One Tail • Alternative mean < Null mean (Variance Known)  Z < Z α , then Ho rejected  Z > Z α, Ho accepted • Alternative mean > Null mean (Variance Known)  Z > Z1- α , then Ho rejected  Z < Z α, Ho accepted 191
  • 186. Relationship between Hypothesis Testing and confidence interval –Two sided case • Suppose we are testing Ho : = versus H1: Ho is rejected with a two –sided level alpha test if and only if the two sided confidence interval for Does not contain , otherwise accept Ho.  o    o   o  192
  • 187. Hypothesis Testing Two Sample Inference • In a two sample hypothesis testing, the underlying parameters of two different Population, neither of whose values is assumed Known, are compared. • Two samples are said to be Paired when each data point of the first sample is matched and is related to a unique data point of the second sample. 193
  • 188. Cont… • Two samples are said to be independent if the data points in one sample are unrelated to the data points in the second sample 194
  • 189. The paired t- test • the statistic is denoted by where SD(d) is the sample standard deviation of the observed difference and n is the number of differences n d SD d t ) (  195
  • 190. Cont… • Degree of freedom n-1 – If t>tn-1 ,1- α /2 or t<-tn-1, 1- α /2 then Ho is rejected. – - tn-1, 1- α /2 <t<tn-1, 1- α /2 • P- value is 2x the area of ‘t’ 196
  • 191. • Example: • Suppose a sample of 20 students were given a test before studying a particular module and then again after completing the module. • We want to find out if, in general, our teaching leads to improvements in students’ knowledge/skills (i.e. test scores). 197
  • 192. Student Score Difference Student Score Difference Pre- module Post- module Pre- module Post- module 1 18 22 4 11 14 15 1 2 21 25 4 12 16 15 -1 3 16 17 1 13 16 18 2 4 22 24 2 14 19 26 7 5 19 16 -3 15 18 18 0 6 24 29 5 16 20 24 4 7 17 20 3 17 12 18 6 8 21 23 2 18 22 25 3 9 23 19 -4 19 15 19 4 10 18 20 2 20 17 16 -1 198
  • 193. 199 • Hypothesis: Ho: △=0 and HA: △≠0 • Calculating the mean and standard deviation of the differences: 𝑑= 2.05 and sd(d) = 2.837. Therefore, se(𝑑) = 2.837/ 20 = 0.634 • So, we have: t = 2.05/0.634 = 3.231 on 19 df with p = 0.004. • Therefore, there is strong evidence that, on average, the module does lead to improvements.
  • 194. Two sample t – test for independent sample with equal variance • The equation is given by: where, the weighted average of variance1 and variance2 could simply used as the estimate of • The degree of freedom will be the sum of the degree of freedom of the two samples, i.e., (n1-1) + (n2-1) 2 1 2 1 1 1 n n S X X t p    2  200
  • 195. Estimation and Hypothesis testing of population proportion 201
  • 196. Sampling distribution of proportions Construction • It is done in the same manner as that of the mean • take all possible samples of a given size • Compute the sample proportion for each • Prepare a frequency distribution of the proportions 202
  • 197. Cont… Characteristics: – When the sample size is large the distribution is approximately normal – The mean of the distribution, , will be equal to the true proportion P. – the variance of the distribution, , will be equal to P̂  2 p̂  n p p ) 1 (  203
  • 198. Sampling distribution of difference between two proportions • For independent random samples n1 and n2 drawn from two populations of dichotomous variables and when P1 and P2 are the population proportions of the characteristic • Distribution of is approximately normal with mean: • And variance: 2 1 ˆ ˆ p p  2 1 ˆ ˆ 2 1 p p p p     2 2 2 1 1 1 2 ˆ ˆ ) 1 ( ) 1 ( 2 1 n p p n p p p p       204
  • 199. Estimation of single proportions • Confidence intervals of proportions by approximation to the normal distribution and the sample standard deviation. • The confidence interval for the population proportion : where p is the proportion of successes (event), q=(1 - p) is the proportion of failures, n is the sample size and z denotes the z value relating to a defined probability level. n p p Z p ) 1 (   205
  • 200. Estimation of difference between two proportions • Unbiased point estimators are • Standard error of the estimate when n1 and n2 are large enough and are not close to 1 or 0 • Since population proportions are not known 2 2 2 1 1 1 ˆ ˆ ) ˆ 1 ( ˆ ) ˆ 1 ( ˆ 2 1 n p p n p p p p       2 1 ˆ ˆ p and p 2 1 ˆ ˆ p p  206
  • 201. Cont… • Therefore,100(1-α)% confidence interval will be: 2 2 2 1 1 1 ) 2 / 1 ( 2 1 ) ˆ 1 ( ˆ ) ˆ 1 ( ˆ ) ˆ ˆ ( n p p n p p p p        207
  • 202. Hypothesis testing on single population proportions • Follows from the properties of the sampling distribution of the sample proportion • The null hypothesis and • The alternate hypothesis o A o o P P H P P H   : : 208
  • 203. Cont… • Test statistics • Where Ho is true the sample proportions are approximately distributed as standard normal distribution n p p p p Z o o ) 1 ( ˆ 0    209
  • 204. Testing differences between two sample proportions • The most commonly used test Ho: P1-P2 = 0 or P1=P2 • Under Ho, thus pooled estimate for the proportions will be • Standard error 2 1 2 2 1 1 2 1 2 1 n n p n p n n n x x P       2 1 ˆ ˆ ) 1 ( ) 1 ( 2 1 n p p n p p p p       210
  • 205. Cont… • The test statistic will be:     2 1 ˆ ˆ 2 1 2 1 ˆ ˆ p p P P p p z       211
  • 206. Example: Comparison of number of swimming hours’ by swimmers with or without erosion of dental enamel Number of swimming hours per week Erosion of dental enamel (EDE) Total Yes No ≥ 6 hours 32 118 150 < 6 hours 17 127 144 Total 49 245 294 212 Prevalence of EDE (P) 0.167 Standard error 0.022 95% CI for P: Lower 0.124 Upper 0.209
  • 207. 1. Estimate the prevalence of erosion of dental enamel and calculate a 95% CI 2. From previous studies among swimmers it is claimed that the prevalence of erosion of dental enamel was 14%. Is the claim justified? Give your p-value 213
  • 208. 3. Compute the respective prevalence of erosion of dental enamel for those who had  6 hours and < 6 hours of swimming time and calculate a 95% CI for the difference in the prevalence. 4. Is there a difference in the prevalence of erosion of dental enamel between the two swimming times? Give your p-value 214
  • 209. Amount of swimming time per week P ≥ 6 hours 0.213 < 6 hours 0.118 Total 0.167 p1 – p2 0.095 Ho: P1=P2, HA: P1≠P2 se(p1-p2) 0.044 Z 2.174 95% CI for P1-P2 se(p1-p2) 0.042 Lower 95% 0.013 Upper 95% 0.177 215
  • 210. Exercise: A study was conducted to look at the effect of oral contraceptives (OC) on heart disease in women 40-44 years of age over 3 years. Given the following data, is there a difference in the rate of MI between OC-users and non-users? Compute 95% CI for the difference. OC-use group MI status over 3 years Total Yes No OC-users 13 4,987 5,000 No-OC-users 7 9,993 10,000 Total 20 14,980 15,000 216
  • 212. Errors in • Design • Execution • Analysis • Presentation • Interpretation • Omission 218
  • 213. Statistical errors related to study design • Study aims and primary outcome measures not clearly stated or unclear • In adequate sample size • Choice of inappropriate high risk sample to make inferences about the general population • Failure to report number of participants or observations • Use of an inappropriate control group 219
  • 214. Errors in execution • Failure to adhered to the study protocol – Misuse of sample selection procedures – Exclusion and inclusion criteria not strictly followed – Failure to follow randomization procedures 220
  • 215. Statistical errors in presentation • Inadequate graphical or numerical description of basic data – Presenting or plotting mean but no indication of variability – Giving SE instead of SD to describe data – Failure to define ± notation for describing variability – Numerical information given to an unrealistic level of precision to present data and results – Inappropriate graph selection that doesn’t reflect characteristics of variables and use of three dimensional graph for two dimension presentation 221
  • 216. 222
  • 217. Statistical errors in analysis • Using methods of analysis when assumptions are not met • Analyzing paired data ignoring the pairing • Failing to take account of ordered categories • Treating multiple observations on one subject as independent o Improper multiple pair-wise comparisons of more than two groups o Quoting confidence intervals that include impossible values • Failure to use multivariate techniques to adjust for confounding factors 223
  • 218. Statistical errors in interpretation of study findings • Wrong interpretation of results  “non significant” interpreted as “no effect”, or “no difference”  Drawing conclusions not supported by the study data  Significance claimed without data analysis or statistical test mentioned • Failure to discuss sources of potential bias and confounding factors 224
  • 219. Consequences of statistical errors • Impossible to get ethical approval to conduct the study • Others researchers may be led to follow false line of investigation • Patients may receive an inferior treatment , either as a direct consequence of the result of the study or possibly by the delay in the introduction of a truly effective treatment • If the results go unchallenged the researchers may use the same inferior statistical methods in future research, and others may copy them due to inappropriate conclusion 225