Advanced Biostatistics presentation pptx

Module 1.1: Introduction
What is statistics?
What is Biostatistics?
Why we study Biostatistics?
1

• Statistics is a field of study concerned with:
1. the collection, organization, summarization,
and analysis of data; and
2. the drawing of inferences about a body of
data when only a part of the data is observed.
• Biostatistics: When the tools of statistics are
employed on the data derived from the biological
sciences and medicine or public health, we use
the term biostatistics
2

• Statistics versus statistic (field of study versus
numerical quantity computed from sample data)
• Roughly speaking, the field of statistics can be
divided into:
• Mathematical Statistics: the study &
development of statistical theory and methods in
the abstract and
• Applied Statistics: the application of statistical
methods to solve real problems involving
randomly generated data, and the development
of new statistical methodology motivated by real
problems
3

Rationale of studying Statistics
• Statistics provides a way of organizing information on
a wider and more formal basis than relying on the
exchange of anecdotes or biography and personal
experiences
• More and more things are now measured quantitatively
in medicine and public health
• There is a great deal of intrinsic (inherent) variation in
most biological processes

Rationale of studying Statistics
• The medical and public health literature is replete or
full with reports in which statistical techniques are
used extensively
• The planning, conduct and interpretation of much of
medical and public health research are becoming
increasingly reliant on statistical technology
5

Limitations of statistics
• It deals with only those subjects of inquiry that are
capable of being quantitatively measured and
numerically expressed.
• It deals on aggregates of facts and no importance is
attached to individual items: suited only their group
characteristics are desired to be studied.
• Statistical data is only approximately and not
mathematically correct.

Limitations of statistics
• It can be used to establish wrong conclusion and
therefore, can be used only by experts.
• Remember the three lies: Lies, Damon lies and
Statistics
• Evan Esar’s Definition of Statistics and Quote:
“The science of producing unreliable facts from
reliable figures”
• “Statistics is the only science that enables
different experts using the same figures to draw
different conclusions”
7

Variable
• As we observe a characteristic, we find that it takes
on different values in different persons, places, or
things, called variable. The characteristic is not the
same when observed in different possessors of it.
• Quantitative variables: is one that can be
measured in the usual sense. For example,
measurements on the heights of adults, the
weights of children, and the ages of patients.
• Qualitative Variables: characteristics that can be
categorized only, like possess or not to possess
some characteristic of interest, ethnic group, etc.
8

• Random Variable: Whenever we determine the
height, weight, or age of an individual, the result is
frequently referred to as a value of the respective
variable.
• When the values obtained arise as a result of
chance factors, so that they cannot be exactly
predicted in advance, the variable is called a
random variable.
• When a child is born, we cannot predict exactly his
or her height at maturity. Attained adult height is
the result of numerous genetic and environmental
factors.
9

Scales of measurement
• Scales of measurement refer to ways in which
variables/numbers are defined and categorized.
Each scale of measurement determines the
appropriateness for use of certain statistical
analyses.
• There are four scales of measurement: nominal,
ordinal, interval, and ratio.
10

• Nominal: Categorical data and numbers that are simply
used as identifiers or names represent a nominal scale
of measurement.
• Example: gender code Female as 1 and Male as 2 or
visa versa
• Ordinal: An ordinal scale of measurement represents
an ordered series of relationships or rank order.
• Example: Likert-type scales; how much pain are you in
today? (on a scale of 1 to 10 with one being no pain
and ten being high pain), represent ordinal data.
11

• Interval: A scale which represents quantity and has
equal units but for which zero represents simply an
additional point of measurement is an interval scale.
• In interval scales zero does not represent the absolute
lowest value.
• Example: Measurement of temperature in Fahrenheit
scale, measurement of Sea levels
12

• Ratio: The ratio scale of measurement is similar to
the interval scale in that it also represents quantity
and has equality of units. However, this scale also has
an absolute zero (no numbers exist below the zero). A
negative length is not possible.
• Example: physical measures height and weight.
• Often, the distinction between interval and ratio
scales can be ignored in statistical analyses.
• Distinction between these two types and ordinal and
nominal are more important.
13

Data
• Data are observations of random variables
made on the elements of a population or sample
• Data are the quantities (numbers) or qualities
(attributes) measured or observed that are to be
collected and/or analyzed
• The word data is plural, datum is singular
• A collection of data is often called a data set
(singular)
14

Data and information
• Data is raw, unorganized facts that need to be
processed. Data can be something simple and
seemingly random and useless until it is
organized.
• Example: Each newborn’s birth weight
• When data is processed, organized, structured
or presented in a given context so as to make it
useful, it is called information.
• Example: Mean birth weight of newborns
15

Types of data
1. Nominal data
• In statistics/biostatistics, we encounter many
different types of data.
• One of the simplest types of data is nominal data,
in which the values fallen to unordered categories
or classes. Example: sex, marital status, ethnicity,
religion, etc.
• Numbers are often used to represent the
categories. In a certain study, for instance, males
might be assigned the value 1 and females the
value 0 16

2. Ordinal data
• When the order among categories becomes
important, the observations are referred to as
ordinal data.
• For example injuries may be classified according
to their level of severity, so that
1= fatal, 2= severe, 3= moderate, and 4= minor.
• Here a natural order exists among the groupings:
a smaller number represents a more serious
injury. However we are still not concerned with
the magnitude of these numbers.
17

3. Discrete data
• For discrete data both ordering and magnitude
are important.
• In this case, the numbers represent actual
measurable quantities or counts rather than
mere labels.
• Examples of discrete data include the number of
car accidents in a given month, the number of
times a woman has given birth.
18

4. Continuous data
• Data that represent measurable quantities but
are not restricted to taking on certain specified
values.
• In this case the difference between any two
possible data values can be arbitrarily small.
• Examples of continuous data include time, the
serum cholesterol level of a patient, etc.
19

Types and Methods of Data Collection
• The statistical data may be classified
under two categories depending up on the
sources:
- Primary Data: are those data which are
collected by the investigator himself for the
purpose of a specific inquiry or study.
- Secondary Data: when an investigator
uses data which have already been collected by
others.
20

Data collection methods
1. Observation
• It is a technique that involves systematically
selecting, watching, and recording behaviors of
people, measuring characteristics or other
phenomena.
• It includes all methods from simple visual
observations to the use of high level machines.
• Advantage: Gives relatively more accurate data
on behavior and activities.
• Disadvantages: Investigator’s or observer’s own
bias, prejudice, desires may be reflected and
needs more resources and skilled human power
during the use of high level machines.
21

2 . Self-administered Questionnaire & Interviews
• These are the most commonly used research data
collection techniques.
• Self-administered questionnaire is
– simpler and cheaper
– can be administered to many persons
simultaneously
– can be sent by post (unlike interviews)
• But requires a certain level of education and skill
on the part of the respondents
• People of a low socio-economic status are less
likely to respond
22

3. Face-to-face and telephone interviews
– An interview is a conversation for gathering
information. A research interview involves an
interviewer, who coordinates the process of the
conversation and asks questions, and an
interviewee, who responds to those questions.
– A good interviewer can stimulate and maintain
the respondent’s interest, and can create a
rapport (understanding) and atmosphere
conducive to the answering of questions.
– If anxiety aroused, the interviewer can allay it. If
a question is not understood an interviewer can
repeat it and explain.
23

4. Mailed Questionnaire Method
• The investigator prepares a questionnaire
pertaining to the field of inquiry and are sent by
post to the informants together with a polite
covering letter explaining the detail, the aims and
objectives of collecting the information
• Requests the respondents to cooperate by
furnishing the correct replies and returning the
questionnaire duly filled in
• Drawback: response rates tend to be relatively
low, and there may be under representation of
less literate subjects
24

5. Use of Documentary Sources
• Includes clinical and other personal records,
death certificates, published mortality statistics,
census publications, etc.
• Examples:
- Official publications of CSA
- Publication of MoH and other Ministries
- Newspapers and Journals
- International publications (WHO, UNICEF)
- Records of Hospitals or any HI
25

6. Computer Direct Interviews
• These are interviews in which the Interviewees
enter their own answers directly into a computer.
• They can be used at malls, trade shows, offices,
and so on.
• The Survey System's optional Interviewing
Module and Interview Stations can easily create
computer-direct interviews. Some researchers
set up a Web page survey for this purpose.
26

Advantages
• The virtual elimination of data entry and editing
costs
• You will get more accurate answers to sensitive
questions
• Elimination of interviewer bias
• Ensuring skip patterns are accurately followed
• Response rates are usually higher
27

Disadvantages
• The Interviewees must have access to a
computer or one must be provided for them.
• As with mail surveys, computer direct
interviews may have serious response rate
problems in populations of lower
educational and literacy levels. This method
may grow in importance as computer use
increases.
28

Choosing Method of data
collection
• Decision Makers Need Information
that is Relevant, Timely, Accurate
and Useable
29

• The selection of the method of data collection
is also based on practical considerations,
such as:
 The need for personnel, skills, equipment, etc.
into what is available and the urgency with
which results are needed.
 The acceptability of the procedures to the
subjects – the absence of inconvenience,
unpleasantness, or untoward
 The probability that the method will provide a
good coverage, i.e. will supply the required
information about all or almost all members of
the population or sample
30

Choice of survey method will also depend
on several factors. These include:
Speed
Email and Web page surveys are the fastest methods,
followed by telephone interviewing. Mail surveys are the
slowest.
Cost
Personal interviews are the most expensive followed by
telephone and then mail. Email and Web page surveys
are the least expensive for large samples.
Computer and
Internet Usage
Web page and Email surveys offer significant
advantages, but you may not be able to generalize their
results to the population as a whole.
Literacy Levels
Illiterate and less-educated people rarely respond to mail
surveys.
Sensitive
Questions
People are more likely to answer sensitive questions
when interviewed directly by a computer in one form or
another.
31

Designing Questionnaire
When designing a questionnaire the following
points should be taken into account
– Keep it (questions) short and simple (KISS)
– Questions should be unambiguous and not
double barreled
– Use simple and direct language. The
questions must be clearly understood by
respondent.
– The wording of a question should be simple
and to the point.
– The best kinds of questions are those which
allow a pre-printed answer to be ticked 32

– Questions should be neither irrelevant nor too
personal
– Leading questions shouldn’t be asked. A “leading
question” is one that suggests the answer.
– The questionnaire should be designed so that the
questions should fall into a logical sequence.
– After finalizing developing the questionnaire,
translate it into local languages to be used for data
collection
– The last step in questionnaire design is to test the
questionnaire with a small number of interviews
before conducting your main interviews - pilot.
33

General Considerations
 To be successful involve other experts and
relevant decision-makers in the questionnaire
design process
 Formulate a plan for doing the statistical
analysis during the design stage of the project
 If you used one method in the past and need
to compare results, stick to that method,
unless there is a compelling reason to change
34

Types of questions
Open-ended Questions:
- Permit free responses that should be recorded
in the respondent’s own words.
It is used in
 Facts with which the researcher is not very
familiar
 Opinions, attitudes, and suggestions of
informants, or
 Sensitive issues
35

Closed Questions:
 Offer a list of possible options or answers
from which the respondents must choose.
 Offer a list of options that are exhaustive
and mutually exclusive, and
 Keep the number of options as few as
possible.
36

Interviewing technique
• Before the questionnaire is used for the data
collection, it should be pre-tested
• Manuals that explain each of the questions should
be prepared – question-by-question specification
• Enumerators and field supervisors should be
trained before they are deployed to the field
37

• Enumerator should create good communication
environment with the respondents.
• They should precisely explain the questions in the
questionnaire to the respondent. He/she should
not lead the respondent.
• There should be strong supervision to the field
work until it will be completed.
38

Rules for asking questions
 Read Qs as they are written
 Do not change order of Qs
 Read the Qs slowly and clearly
 Read Qs in a pleasant voice
 Maintain eye contact which is culturally
appropriate
 Read the entire question to Respondent
 Do not skip Qs
 Verify information given by Respondent
39

Interviewing tactics of Sensitive
Questions
• Sensitive questions may offend the
respondents
–Expose the respondent’s ignorance
–Call for socially unacceptable answer
–Embarrassments
45

Possible tactics (Barton)
– The everybody approach – as you know many
people have been arrested for being involved in
theft. Do you happen to have arrested for being
involved in theft?
– The other people approach – Do you know any
one arrested of theft? How about yourself?
– The Kinsey technique – stare firmly into the
respondents’ eyes and as in simple, clear-cut
language such as that to which respondent is
accustomed, and with and air of assuming that
everybody has done everything, ‘Have you ever
arrested for being involved in theft?’
46

Informed consents
Participation in a survey should be voluntary and a
respondent can refuse to be interviewed or
measured, etc.
The information given should be simple and clear
and adapted to the respondent’s level of
understanding.
Informed consents can be either signed or verbal
48

The interviewer is responsible for explaining:
– what the survey is about,
– providing all the necessary information, and
– making sure the respondent understands the
implications of his/her participation before
giving his/her consent.
• The information given should be simple and
clear and adapted to the respondent’s level of
understanding.
49

• Consents must be documented by asking the
respondents to sign an Informed Consent Form
or give verbal consent before doing the
interview.
– These forms must mention:
• who will be doing the study,
• the types of questions that will be asked,
• why the study is being done, and
• who will have access to the information
provided.
50

Module 1.2: Methods of data
processing, organization and
presentation
51

No. Ht Wt Sex age FEV No. Ht Wt Sex age FEV
1 175.2 79.2 1 57 3.80 16 177.5 69.7 1 32 4.10
2 164.5 92.4 6 60 3.50 17 164.0 719 2 58 3.15
3 168.5 64.6 1 62 1.48 18 174.0 63.2 1 45 4.25
4 180.0 82.6 1 43 4.35 19 161.0 60.0 2 59 2.75
5 156.0 79.9 2 13 2.70 20 169.5 63.3 3 53 3.32
6 170.0 80.9 1 61 2.35 21 181.5 101.3 1 37 4.20
7 170.0 79.7 1 67 149 22 173.0 72.9 1 47 4.45
8 162.0 57.4 1 63 2.95 23 473.6 55.9 2 39 3.65
9 177.0 98.1 1 46 4.20 24 178.2 39.2 1 70 3.05
10 285.0 61.6 2 47 2.45 25 159.0 63.5 2 42 3.20
11 156.0 60.0 2 43 2.10 26 149.0 69.2 2 58 29.3
12 157.0 62.0 3 34 3.41 27 159.0 80.3 2 63 2.45
13 150.0 51.8 2 49 2.70 28 190.0 883.0 1 60 4.65
14 154.0 58.1 2 47 2.45 29 175.0 85.0 7 41 3.75
15 165.0 70.6 1 79 3.10 30 168.7 855 1 60 3.15
52

Data cleaning and edition
• When the questionnaires are collected from the
field, they should be coded and edited
• Checks are basically of two sorts, range checks
and consistency checks.
Range checks: exclude, for example, the
erroneous occurrence of code 3 for sex,
which should only be code 1(male) or code
2(female).
Consistency checks: detect impossible
combinations of data
53

Basic precautions recommended to
minimize errors during the handling of
data:
• Avoid any unnecessary copying of data from one
form to another
• Use a verification procedure during data entry -
range and skip rules, double data entry, etc.
• Check all calculations carefully, example – date
conversion, units of measurement, etc.
54

Data organization: Tables
The use of tables for presenting data involves
grouping the data into mutually exclusive categories
of the variable, and counting the number of
occurrences to each category
 Tables should be as simple as possible and self-
explanatory
 Numerical entities of zero should be explicitly
written rather than indicated by a dash
 Totals should be shown either in the top row and
the first column or in the last row and last column
 If data are not original, their source should be
given in a footnote
55

Asthma versus sex and smoking
Sex and
smoking status
Presence of Asthma
No Yes
n % n % Total
Sex
Female 459 91.6 42 8.4 501
Male 439 93.0 33 7.0 472
Total 898 92.3 75 7.7 973
Smoking
Never smoker 480 91.4 45 8.6 525
Ex-smoker 254 91.7 23 8.3 277
Current smoker 164 95.9 7 4.1 171
Total 898 92.3 75 7.7 973
56

Data presentation: Diagrams
• Allows readers to obtain an overall grasp of the
data presented.
• The relationship can be seen more quickly and
easily from a graph than from a table.
• The choice of one graph over the other depends
on personal choices and/or the type of the data.
Bar chart and pie chart are commonly used for
quantitative discrete or qualitative data
Histograms, frequency polygon, and line graphs
are used for quantitative continuous data
57

Component Bar graph - Smoking status and
presence of asthma
0
10
20
30
40
50
60
70
80
90
100
Never smoker Ex-smoker Current smoker
Number
of
individuals
Smoking status
No Yes
58

Pie-chart – smoking status (%)
Never smoker
54%
Ex-smoker
28%
Current
smoker
18%
59

Neonatal Mortality Rate by Sex
65.8
34.2
37.2
46.3
25.8
29.0 29.3
50.2
44.8
49.0
54.6
41.4
38.7
34.3
0.0
10.0
20.0
30.0
40.0
50.0
60.0
70.0
2005 2006 2007 2008 2009 2010 2011
NNMR
per
1000
LB
Surveillance year
Female Male
61

General rules for constructing graphs
• Every graph should be self-explanatory and as
simple as possible
• Titles are usually placed below the graph
• Legends or keys should be used to differentiate
variables if more than one is shown
• The axes label should be placed to read from
the left side and from the bottom
• The units into which the scale is divided should
be clearly indicated
• The numerical scale representing frequency
must start at zero or a break in the line should
be shown
62

Module 1.3: Data summarization
63

Data Exploration
• The exploration procedure produces summary
statistics and graphical displays
• The reasons for using the explore procedure are:
– data screening,
– outlier identification,
– description,
– assumption checking, and
– characterizing differences among
subpopulations (groups of cases).
64

No. Ht Wt Sex age FEV No. Ht Wt Sex age FEV
1 175.2 79.2 1 57 3.80 16 177.5 69.7 1 32 4.10
2 164.5 92.4 1 60 3.50 17 164.0 71.9 2 58 3.15
3 168.5 64.6 1 62 1.48 18 174.0 63.2 1 45 4.25
4 180.0 82.6 1 43 4.35 19 161.0 60.0 2 59 2.75
5 156.0 79.9 2 47 2.70 20 169.5 63.3 2 53 3.32
6 170.0 80.9 1 61 2.35 21 181.5 101.3 1 37 4.20
7 170.0 79.7 1 67 0.80 22 173.0 72.9 1 47 4.45
8 162.0 57.4 1 63 2.95 23 164.2 55.9 2 39 3.65
9 177.0 98.1 1 46 4.20 24 178.2 93.2 1 70 3.05
10 160.5 61.6 2 47 2.45 25 159.0 63.5 2 42 3.20
11 156.0 60.0 2 43 2.10 26 149.0 69.2 2 58 2.20
12 157.0 62.0 2 34 3.41 27 159.0 80.3 2 63 2.45
13 150.0 51.8 2 49 2.70 28 190.0 88.3 1 60 4.65
14 154.0 58.1 2 47 2.45 29 175.0 85.0 1 41 3.75
15 165.0 70.6 1 79 3.10 30 168.7 85.5 1 60 3.15
65

• Data screening may show that you have
unusual values, extreme values, gaps in
the data, or other peculiarities.
• Exploring the data can help to determine
whether the statistical techniques that you
are considering for data analysis are
appropriate.
• The exploration may indicate that you need
to transform the data if the technique
requires some known distribution, say the
Normal distribution.
66

Measures of Central tendency
- The arithmetic mean, median and mode
- Arithmetic mean is unique, takes into
account all data points and leads itself for
further manipulation but sensitive to
extreme values
- Median is unique, not sensitive to all data
points and not affected by extreme values
- Mode might not exist and be unique, it can
be determined for qualitative data
67

Exercise
• Calculate the mean, median and mode for
the whole sample and sex specific
summary values using the data in the
table below
• Sex – 1=Male, 2=Female
• Height if measured in cm, weight in kg,
age in years and FEV in liter
68

Ht Wt Sex age FEV
175.2 79.2 1 57 3.80
164.5 92.4 1 60 3.50
168.5 64.6 1 62 1.48
180.0 82.6 1 43 4.35
156.0 79.9 2 47 2.70
170.0 80.9 1 61 2.35
170.0 79.7 1 67 0.80
162.0 57.4 1 63 2.95
177.0 98.1 1 46 4.20
160.5 61.6 2 47 2.45
156.0 60.0 2 43 2.10
157.0 62.0 2 34 3.41
150.0 51.8 2 49 2.70
154.0 58.1 2 47 2.45
165.0 70.6 1 79 3.10
69

Summary values
Sex Age Ht Wt FEV
Male Mean 54.85 173.54 80.27 3.42
Median 59.94 174.00 80.90 3.75
Mode 32.47 170.00 57.40 4.20
Sum 932.47 2950.10 1364.60 58.13
n 17 17 17 17
Female Mean 49.16 158.40 64.42 2.81
Median 47.40 159.00 62.00 2.70
Mode 34.43 156.00 60.00 2.45
Sum 639.04 2059.20 837.50 36.53
n 13 13 13 13
Both Mean 52.38 166.98 73.40 3.16
Median 50.96 166.75 71.25 3.15
Mode 32.47 156.00 60.00 2.45
Sum 1571.51 5009.30 2202.10 94.66
n 30 30 30 30
70

Measures of Variation/Dispersion
• Dispersion of a set of observations refers to the
scatteredness of observations around a measure
of central tendency
Commonly used measures of variation:
Range, Percentiles, and Standard deviation.
Of these measures only standard deviation is a
measure of variation since it assesses the
scatteredness of observations around the mean
71

The Coefficient of Variation
To compare the variability of two or more sets of
data for same or different variables, standard
deviations may lead to fallacious results.
• The variables involved might be measured in
different units, or different characteristics
• Coefficient of Variation (CV) is the standard
deviation expressed as a percentage of the
mean.
72

Use the above data to determine standard deviation and Coefficient of variation
Sex Age Ht Wt FEV
Male Mean 54.85 173.54 80.27 3.42
Variance 160.7 49.53 157.22 1.15
Std dev 12.68 7.04 12.54 1.07
CV 23.1 4.1 15.6 31.3
Range 46.06 28 43.9 3.85
Female Mean 49.16 158.4 64.42 2.81
Variance 74.16 32.65 74.78 0.24
Std dev 8.61 5.71 8.65 0.49
CV 17.5 3.6 13.4 17.4
Range 28.98 20.5 28.5 1.55
Both Mean 52.38 166.98 73.40 3.16
Variance 127.58 99.03 181.48 0.83
Std dev 11.3 9.95 13.47 0.91
CV 21.6 6.0 18.4 28.8
Range 46.06 41 49.5 3.85
73

Data transformations
• The assumptions underlying a statistical method
may not always be satisfied by a particular set of
data.
• For example, a distribution may be positively
skewed rather than normal. Such problems can
often be overcome simply by transforming the
data to a different scale of measurement
• The most common choice is the logarithmic
transformation
74

Logarithmic transformation
• When a logarithmic transformation is applied
to a variable, each individual value is replaced
by its logarithm.
y = log x
• Where x is the original value and y the
transformed value.
• The logarithm has the effect both of equalizing
the standard deviations and removing
skewness (absence of symmetry)
75

Choice of a transformation
• There are alternative transformations
• Reciprocal transformation:- is stronger than
the logarithmic, and would be appropriate if the
distribution were considerably more positively
skewed than lognormal.
Y=1/x
76

• Square root transformation:- is used when the
constant variance assumption does not hold
true.
• It is weaker than the logarithmic transformation.
• Negative skewness can be removed by using
power transformation, such as a square or a
cubic transformation, the strength increases with
the order of the power
x
y 
77

Histogram & Normal curve with transformations
78

Module 2: Probability and
Probability Distributions
79

• Definition: A random variable is a numerical
quantity that takes different values with specified
probabilities.
• There are two types of random variables: discrete
and continuous.
• Definition: A random variable for which there
exists a discrete definition of values with specified
probabilities is a discrete random variable.
80

• Example: Diarrhoea is one of the most frequent
reasons for visiting health institutions in the first 2
years of life in children.
• Let X be the random variable that represents the
number of episodes of diarrhoea in the first 2
years of life. Then X is a discrete random
variable, which takes on values 0,1,2, ....
• Definition: A random variable whose values form
a continuum (i.e., have no gaps) such that ranges
of values occur with specified probabilities is a
continuous random variable. 81

Probability Mass Function for a Discrete
Random Variable
• The values taken by a discrete random variable
and its associated probabilities can be expressed
by a rule, or relationship that is called a probability
density function (pdf).
• Definition: A pdf is a mathematical relationship, or
rule, that assigns to any possible value of a discrete
random variable X the probability P(X = r). This
assignment is made for all values r that have
positive probability. The pdf is also referred to as
probability distribution.
82

General rules which apply to any
probability distribution
1. Since the values of a probability distribution are
probabilities, they must be numbers in the
interval from 0 to 1.
2. Since a random variable has to take on one of
its values, the sum of all the values of a
probability distribution must be equal to 1.
• Example: Check whether the following function
can serve as the probability distribution of an
appropriate random variable
83

General rules …
12
2
)
(


x
x
f
for x=1, 2, and 3
Substituting the values of x, f(1)=3/12, f(2)=4/12,
and f(3)=5/12
Since none of these values is negative or greater
than one, and since their sum 3/12+4/12+5/12 = 1,
the given function is a probability distribution
84

Example on Hypertension-control:
• Suppose a physician agrees to use a new anti-
hypertensive drug on a trial basis on the first 4
untreated hypertensives whom she encounters in
her practice before deciding whether to adopt the
drug for routine use.
• Let X = the number of patients out of 4 who are
brought under control. Suppose that from
previous experience with the drug, for any clinical
practice, the drug company expects the following
probabilities.
r 0 1 2 3 4
P(X=r) .008 .076 .265 .411 .240
85

Example:
• For the above table, for any clinical practice, the
probability that between 0 and 4 hypertension’s
are brought under control = 1, i.e.,
• 0.008 + 0.076 + 0.265 + 0.411 + 0.240 = 1
• What is the probability that:
– At least two patients brought under control?
– At most three patients brought under control?
86

1. Binomial distribution
• The Binomial distribution with parameters n and
p is a discrete probability distribution of the
number of successes in a sequence of n
independent binary (yes/no) experiments, each of
which yields success with probability p.
• A useful summary measure, used to describe
binary variables, is the proportion with which the
variable took one of its values, called success.
• The binomial distribution is used to model the
number of successes in a sample of size n drawn
with replacement from a population of size N.
87

The Binomial Distribution
• Definition: The distribution of the number of
successes (r) in n statistically independent trails,
where the probability of success on each trail is
P, is known as the binomial distribution, and has
a probability density function given by:
where
• The mean is np and variance is np(1-p)
r
n
r
P)
(1
P
r
n
r)
P(X 











r = 0, 1, 2, …, n
!
)!
(
!
r
r
n
n
r
n










88

Probability mass function for the binomial
distribution
89

Example:
• What is the probability of obtaining 2 boys out of
5 children if the probability of a boy is 0.51 at
each birth and the sexes of successive children
are considered independent random variables?
• n=5, p=0.51, 1-p=0.49 and r=2
0.306
(0.49)
(0.51)
2!3!
5!
(0.49)
(0.51)
2
5
2)
P(x 3
2
3
2



 







90

Continuous Probability Distribution
• A continuous probability distribution is a smooth
density curve that models the distribution of a
continuous random variable.
• The area under the curve is 1 and the area
within any interval is approximately the
probability that the value of the random variable
is in that interval.
• Density function is a formula used to represent
the distribution of a continuous random variable.
91

Definition
• Probability distribution for a continuous
random variable for a nonnegative function
f(x) (probability density function) is:
– Total area bounded by its curve and the x-
axis is equal to one
– Subarea under the curve bounded, X-axis and
the perpendiculars erected at any two points
give the probability that x is between a and b
92

2. Normal distribution
• The Normal Distribution also called the Gaussian
distribution is the most important of the
distribution in all statistics.
• The normal density is given by:
= 3.141….. and e = 2.72….
 











 

x
where
e
x
f
x 2
2
1
2
1 




93

Characteristics
1. It is symmetrical about its mean
2. Mean, median and mode are equal
3. The total area under the curve above the x
axis is one square unit
4. One SD from the mean in both directions
approximately 68% of the area
5. The height of the curve =
6. The normal distribution is determined by the
parameters standard deviation and mean.


2
/
1
94

The Normal Distribution curve
σ = σx
μ = μx
95

The standard Normal distribution
• Definition: A normal distribution with mean 0
and variance 1 will be referred to as a standard,
or unit, normal distribution. This distribution is
denoted by N(0,1).
2
2
1
z
2π
1
f(z) e

 for - < z < +
This distribution is symmetrical about 0 (the mean),
since f(x)=f(-x). About 68% of the area under the
normal density lies +1 and -1, about 95% lies
between +2 and -2, and about 99% lies between
+2.5 and -2.5
97

Application of Normal distribution
• Example:
Suppose it is know that the height of a population
of individual are approximately normally
distributed with a mean of 70 inches and standard
deviation of 3 inches. What is the probability that
a person picked at random from this group will be
a) between 65 and 74 inches tall?
b) greater than 75 inches
c) less than 65 inches
98

Solution
Step 1: Transform this to standard normal
distribution by using
Step 2: Determine the area under the curve
bounded by the curve, x-axis and the two points.
P( a<z<b).
Step 3: Look at the z distribution table for the
corresponding value of z.


 


99

3. The t-distribution
• The t-distribution is a family of continuous
probability distributions that arise when
estimating the mean of a normally distributed
population in situations where the sample
size is small and population standard
deviation is unknown.
• Whereas a normal distribution describes a full
population, t-distributions describe samples
drawn from a full population; accordingly,
the t-distribution for each sample size is
different.
100

The t-distribution
• The t-distribution is similar in shape to the
Normal distribution but is more spread out with
longer tails than the standard Normal.
• It is symmetrical about zero, its mean, and the
variance, σ2 is = k/(k-2) for k > 2, k = df, µ does
not exist for k=1, σ2 does not exists for k = 1,2
• The df increases with the sample size. As the
sample size increases, the shape of the t-
distribution becomes increasingly more like the
standard Normal distribution.
• It is used for estimation of means.
101

The t-distribution
n
s
X
t
/



102

The t-distribution
ν = n−1 degrees of freedom
103

Module 3.1:
Sampling methods and
Sample size estimation
104

Why sample?
• It is usually not cost effective or practicable to
collect and examine all the data that might be
available.
• Instead it is often necessary to draw a sample of
information from the whole population to enable
the detailed examination required to take place.
• Sampling provides a means of gaining
information about the population without the
need to examine the population in its entirety.
105

• Purposes of sampling: Provides various
types of statistical information of a
qualitative or quantitative nature about the
whole by examining a few selected units.
• Advantages of sample based studies
– Cost effectiveness
– Timeliness
– Inaccessibility of some people
– Less destructive in data summarization
– Accuracy
106

Caveats
• Sampling can provide a valid, defensible
methodology but it is important to match
the type of sample needed to the type of
analysis required.
• The auditor should also take care to check
the quality of the information from which
the sample is to be drawn. If the quality is
poor, sampling may not be justified.
107

Sampling Designs
• Sample design covers the method of selection, the
sample structure and plans for analysing and
interpreting the results.
• Sample designs can vary from simple to complex
and depend on the type of information required and
the way the sample is selected.
• The design will impact upon the size of the sample
and the way in which analysis is carried out. In
simple terms the tighter the required precision and
the more complex the design the larger the sample
size. 108

Sampling Designs
• The design may make use of the characteristics
of the population, but it does not have to be
proportionally representative.
• It may be necessary to draw a larger sample
than would be expected from some parts of the
population;
• For example, to select more from a minority
grouping to ensure that we get sufficient data for
analysis on such groups.
109

Sampling Designs
• The aim of the design is to achieve a
balance between the required precision
and the available resources.
110

Definition of terms
• Sample – Subset of the population of interest
• Sampling – process of selecting units from
the population of interest so that by studying
the sample we generalize our result back to
population.
• Sampling can provide a valid, defensible
methodology but it is important to match the
type of sample needed to the type of analysis
required.
111

• Population - Finite or infinite set of objects
whose properties are to be studied.
• Study population/sample population –
subset of target population chosen so as to be
representative of the total population
• Sampling unit - unit of selection in the
sampling process.
• Study unit – subject on which information is
collected.
112

Conditions that needs to be met
The sample must be well chosen – Representative
 the method of choosing the sample matters
 the best methods involve the planned
introduction of chance
 A sampling procedure should be fair, selecting
people for inclusion in the sample in an impartial
way, so as to get a representative cross section of
the public – No selection bias
When a selection procedure is biased, taking a large
sample does not help. This just repeats the basic
mistake on a large scale
113

Conditions …
A sample chosen in a haphazard fashion, or
because it is ‘handy’, is unlikely to be a
representative one. This kind of samples may be
used in exploratory surveys to get a ‘feel’ about
the situation
The sample must be sufficiently large –
Sample size
There must be adequate coverage of the sample
– Response rate
 Non-respondents can be very different from
respondents. When there is high non-response
rate, lookout for non-response bias. 114

Is a sample any good?
Some samples are really bad. To find out
whether a sample is any good, ask:
1. How it is chosen?
2. Was there selection bias?
3. Non-response bias?
These questions might not be answered just
by look at the data
115

Sampling techniques/methods
• Sampling is the process of selecting a number of
study units from a defined study population.
• Clearly define study population and study unit
– Study population – individuals, households,
institutions, records, etc…
– Study units – an individual, a household, an
institution or a record
116

Sampling cont…
• Types: probability and non-probability
– Probability – quantitative studies
– Non-probability – qualitative studies
• Probability sampling technique:
– Involves using random selection procedures to ensure that each
unit of the sample is chosen on the basis of chance.
– All units of the study population should have an equal, or at
least a known non-zero chance of being included in the sample.
– Sample drawn in such a way that it is representative of the
population
– The type to be used depends on population composition and
availability of sampling frame
117

Sampling cont…
Probability sampling methods include:
– Simple random sampling
– Systematic sampling
– Stratified sampling
– Cluster sampling
– Multistage sampling
118

1. Simple random sampling
• Selecting required number of sampling units
randomly from list of all units
– Up-to-date Sampling frame
– Random selection – manually using table of random
numbers or using computer programs
• E.g. 250 households from list of 9000 households
• Better representativeness but costly and
representativeness reduced in heterogeneous
population
119

2. Systematic sampling
• Sampling units are selected at regular intervals. The
starting unit is selected randomly
• Example: to select a sample of 100 students from
2500, first calculate sampling interval=2500/100=25.
Then randomly select the first student and finally pick
every 25th student
• Easier and less time consuming
• Can be done without sampling frame – sequential
studies
• Risk of bias if there is cyclic repetition
120

3. Stratified sampling
• Used when the population structure consists distinct
subgroups/strata
• Ensures proportions of individuals with certain
characteristics in the sample will be the same as those
in the whole population
– Representation of groups with different characteristics
• The study population must be divided into strata of
the characteristic (Example: residence, age, sex,
profession) and then random or systematic samples
are obtained from each stratum
121

3. Stratified sampling cont.
• Depending on the need, samples from each stratum
can be drawn either proportional to their size or non-
proportionally/equal size from each stratum
– Proportional- using sampling fraction (N/n)
– Equal size – to represent small groups
• Improved representativeness
• Estimates can be obtained for each stratum and the
population
122

4. Cluster sampling
• Groups of study units (clusters) instead of individual
study units are selected at a time
• Assumes homogeneity of population with respect the
characteristic to be measured
• All the study units in the selected clusters are
included in the study
• Used in geographically scattered areas where visiting
dispersed study units is time consuming and costly
• Example: a simple random sample of 5 villages from
30 villages
• Easier but less representative
123

5. Multistage sampling
• Carried out in stages – PSU, SSU…
• Used in very large and diverse populations
• The method used in most community-based big
studies
• E.g. In a study to be undertaken in a big town the
sampling may involve stages like selection of
kefetegnas, kebeles and finally houses
• Representativeness and reduced cost
124

5. Multistage sampling
• The larger the number of clusters, the greater is
the likelihood that the sample will be
representative.
• Further, the sampling units at community level
should be selected randomly (avoid convenience
sampling!).
125

Bias in sampling
• Bias in sampling is a systematic error in
sampling procedures, which leads to a distortion
in the results of the study.
• Bias can be introduced as a consequence of
improper sampling procedures, which result in
the sample not being representative of the study
population.
126

Bias …
• There are several possible sources of bias that
may arise when sampling. The most well known
source is non-response.
• Non-response can occur in any interview
situation
• Respondents may refuse or forget to fill in the
questionnaire
• The problem lies in the fact that non-respondents
in a sample may exhibit characteristics that differ
systematically from the characteristics of
respondents.
127

Bias …
There are several ways to deal with this problem and
reduce the possibility of bias:
1. Data collection tools should be pre-tested.
2. If non-response is due to absence of the subjects,
follow-up of non-respondents may be considered.
3. If non-response is due to refusal to co-operate, an
extra, separate study of non-respondents may be
considered in order to identify to what extent they
differ from respondents.
4. Include additional people in the sample, so that non-
respondents can be replaced if their absence was
very unlikely to be related to the topic being studied.
128

Bias …
Other sources of bias in sampling:
Studying volunteers only – volunteers are
motivated to participate in the study.
Sampling of registered patients only –
Patients reporting to a clinic are likely to
differ systematically from people seeking
alternative treatments
 Seasonal bias.
Tarmac bias – easily accessible by car.
129

Non-probability sampling methods
Quota Sampling: Each data collector is assigned
a fixed quota of subjects to interview; the number
falling into certain categories (like residence, sex,
age, etc.) are also fixed. On the other hand, the
interviewers are free to select anybody they like.
From common sense point of view, quota sampling
looks good. It seems to guarantee that the sample
will be like the population with respect to all the
important characteristics that affect the variable of
interest.
130

In quota sampling, the sample is hand-picked
to resemble the population with respect to
some key characteristics. The method
seems reasonable, but does not work very
well. The reason is unintentional bias on
the part of the interviewers.
131

Other non-probability sampling methods
• Purposive sampling
• Snowball or chain sampling
• Extreme case sampling
• Maximum variation sampling
• Homogeneous sampling
• Critical case sampling
132

Sample size estimation
• How many subjects are needed in the sample
to enable draw conclusion on the whole
population?
– Depends on expected variation in the data and
number of units per cell for analysis
– The eventual sample size is a compromise between
what is desirable and what is feasible
133

Sample size cont…
• Minimum sample size can be calculated
depending on the objective of the study
– Estimation of population parameter with certain
precision
• Single variable estimation (single population mean,
proportion or rate)
• Descriptive studies - Prevalence, coverage and utilization
rate studies
– Test of significant difference between groups
• Analytic studies - comparative cross-sectional, case-
control, cohort and clinical trials
134

Sample size - single proportion
• For making confidence limit statement (such as
prevalence study), the following formula can be used
to estimate minimum sample size:
• For population <10,000, use finite population
correction
 
2
2
2
1
1
d
P
P
Z
n









 
   
P
P
Z
N
d
P
P
Z
N
nf



















1
1
1
2
2
1
2
2
2
1


135

Single proportion cont…
• Parameters in the formula
– n is minimum sample size
– P is estimate of the prevalence rate for the
population
• From available data, or Pilot study result, or 0.5 should be
used to get the possible minimum large sample size; if given
in range, take the value closest to 0.5.
– d is the margin of sampling error tolerated
– Z1-α/2 is the standard normal variable at (1-α )%
confidence level and α is mostly taken to be 5%
• Usually 95% confidence level is used = 1.96
– N population size 136

Exercise
• What sample size do we need to estimate the
prevalence of HIV among residents of a town such
that the error of estimation is within 1% of its actual
parameter with 95% confidence?
137

Measuring prevalence for more than one
item in one group
• Take estimated prevalence of the most important item
to be measured or
• Determine sample size for each item/specific
objective and then
– Take estimated prevalence of the item that gives
the maximum sample size
138

Sample size-two proportion
For test of significance study the following formula can
be used:
Parameters:
n - size of sample in each group
P1 ,P2 – estimated population prevalence in the
comparison groups
β = 1- Power (the probability that if the two proportions
differ the test will produce a significant difference)
– Usually a power of 80% or 90% is used
     
 
 2
2
1
2
2
1
1
2
2 1
1
p
p
p
p
p
p
Z
Z
n





 

139

Exercise
A study is designed to assess the difference in the
proportion of physicians leaving health services in
urban and rural areas. From available literature 30% and
15% of physicians are estimated to leave services in
rural and urban areas within three years of graduation
respectively. What sample size is required for the study?
140

Sample size – case-control studies
• Formula –
• Parameters:
– P1 ,P0–estimated prevalence of exposure in the case
and controls respectively
– P0 can be estimated as the population prevalence of
exposure
– P′ – derived from P1 ,P0, m and odds ratio
– OR : odds ratio of exposures between cases and
controls
– m : number of control subjects per case subject
       
 
 2
1
2
1
1 1
1
1
1
o
o
o
p
p
p
mp
p
p
z
p
p
m
z
n









 

141

Exercise
• Example: Suppose you want to test presence of
difference in exposure status between cases and
controls at 95% confidence level and with power of
80% using a 1:1 ratio of cases to controls while
looking for an odds ratio of 2. You assume the
prevalence of exposure controls is 25%. How many
sample size do you need?
142

Sample size-two proportion
• More than one comparison variable – take the one
with the smallest estimated difference
– To get largest sample size
• Different formulae
– Case-control studies
– Matched studies
– Survival analysis
– Other cases
• Reference
– http://www.statsdirect.com/help/sample_size_and_me
thods/sms.htm
143

Five key factors
1. Confidence level: how certain you want to be that the
population figure is within the sample estimate and its
associated precision.
2. Variability in the population: the SD is the most usual
measure and often needs to be estimated.
3. Margin of error or precision: a measure of the possible
difference between the sample estimate and the actual
population value.
4. The population proportion: the proportion of items in
the population displaying the attributes that you are
seeking.
5. Population size: only important if the sample size is
greater than 5% of the population in which case the
sample size reduces.
144

Sample size – other considerations
• Non-response
– Add contingency – say 10%
• More – sensitive topic, self-administered questionnaire
(up to 30%)
– Response rate for
• Cross-sectional survey >85%
• Cohort - >60-80%
• Sampling technique
– In complex samples (cluster, multistage) increase the
sample size to account for design effect
145

Sample size – other considerations cont.
– Design effect - ratio variance of estimate derived from
a complex sampling design to the variance of estimate
from simple random sample
– Usually sample size is multiplied by 2 (1.5) in cluster
sampling
• Increase – large PSU, many stages, clustered variable
• Qualitative methods – estimate, not determined
• Better to have good quality data than large sample
after a certain point
• Better to have representative than large sample
– Use representative sampling techniques
146

Sampling distribution
Definition: A parameter is a numerical descriptive
measure of a population (μ). A statistic is a
numerical descriptive measure of a sample ( ).
To each sample statistic there corresponds a
population parameter. We use , S2, S , p, etc. to
estimate μ, σ2, σ, P (or π), etc.
X
X
147

Sampling distribution of Means
• The sampling distribution of means is one of the
most fundamental concepts of statistical
inference, and it has remarkable properties.
• Since it is a frequency distribution, it has its own
mean and standard deviation
Example: let a population of size 6 has values for
weight of individuals with 55.7, 66.7, 85.5, 79.7,
122.4 and 78.1. Select all possible samples of size
3 from this population and check if the sample mean
is unbiased estimate of population mean and
calculate the standard error of the sample mean.
148

Measurements of weight of individuals of
the population
Population values: 55.7 66.7 85.5 79.7 122.4 78.1
Sum of observations 488.1
Population mean (µ) 81.35
Population SD (σ) 20.77
All possible unique sample 20 







n
N
N
X
N
X





2
2
)
( 


149

Sample Obs1 Obs2 Obs3 Mean
S1 55.7 66.7 85.5 69.30
S2 55.7 66.7 79.7 67.37
S3 55.7 66.7 122.4 81.60
S4 55.7 66.7 78.1 66.83
S5 55.7 85.5 79.7 73.63
S6 55.7 85.5 122.4 87.87
S7 55.7 85.5 78.1 73.10
S8 55.7 79.7 122.4 85.93
S9 55.7 79.7 78.1 71.17
S10 55.7 122.4 78.1 85.40
S11 66.7 85.5 79.7 77.30
S12 66.7 85.5 122.4 91.53
S13 66.7 85.5 78.1 76.77
S14 66.7 79.7 122.4 89.60
S15 66.7 79.7 78.1 74.83
S16 66.7 122.4 78.1 89.07
S17 85.5 79.7 122.4 95.87
S18 85.5 79.7 78.1 81.10
S19 85.5 122.4 78.1 95.33
S20 79.7 122.4 78.1 93.40
Sum of means 1627.00
Mean of means 81.35
Variance of means 86.27
SD of sample means 9.29
n
N
n
N
n
n
N
n
X
X
n
X

























1
X
of
error
Standard
X
deviation
Standard
X
means
sample
of
Mean
1
)
(
S
variance
Sample
X
mean
Sample
2
2
150

Properties
1. The mean of the sampling distribution of means
is the same as the population mean, μ
2. The SD of the sampling distribution of sample
means is ≈ σ/√n if n is large
3. The sampling distribution of sample means is
approximately normal, regardless of the shape
of the population distribution provided n is large
(> 30) enough (Central limit theorem).
1


N
n
N
n

151

Module 3.2: Estimation
and Hypothesis Testing
152

Estimation
Definition
Calculating some statistics from sample data
that is offered as an approximation of the
corresponding parameter of the population
from which the sample was drawn.
154

Cont…
Estimator: Methods or rules to compute
values/ estimate.
Estimator need to have characteristics of
unbiasedness.
• T of the parameter x is said to be unbiased
estimator of x if E(T) =x.
155

Cont…
• Estimation is calculating, from sample data, some statistic
that offers an approximation for the corresponding
parameter of the population from which the sample is
drawn.
• Properties of good estimators
– Unbiased: An estimator is said to be unbiased if in
the long run it takes on the value of the population
parameter
– Efficiency: An estimator is said to be efficient if in the
class of unbiased estimators it has minimum variance
– Consistency: A sequence of estimators is said to be
consistent if it converges in probability to the true value
of the parameter
– Sufficiency: an estimator is sufficient if it uses all the
sample information 156

Estimation methods
• Point estimate:
a single numeric value used to estimate the
corresponding population parameter.
frequently used point estimators ( sample statistic)
sample statistic coresponding population
sample mean population mean
sample variance population variance
sample standard deviation population standard deviation
sample proportion population proportion
157

Interval Estimate
• Interval estimate:
Two numerical values defining a range of
values that, with a specified degree of
confidence, we feel include the parameter
being estimated.
158

Cont…
• Even if sample mean is good quality estimator,
it is better to explain in an interval regarding the
probable magnitude of population mean.
• Confidence intervals are about putting some
bounds on how far away the truth might be from
your estimate.
• Sample mean is the best unbiased estimator.
159

Cont…
• If the sample is drawn from normally distributed
population, sample distribution will be normal.
• Even if the distribution of the population is non
normal, sampling distribution will assume normal
distribution if sample size is sufficiently large.
• Ninety-five (95%) percent of possible value of
will lie between two standard deviation of


x
2
2


s
x

160

Interval estimator component
• Reliability coefficient value of Z or t within the
standard error:
• Standard error – measure of sample mean
variability in repeated sampling.
n
x
z




n
s
x
t



161

Standard Error of the Mean
• It helps us to quantify in some way how good our
estimate of the mean is of the true, & unknown,
population mean- how large an error might we
be making
• Standard error of sample mean is 𝑆𝐷 𝑛 and it
is:
• Error that arise from variability in the sample
means
• It indicates the variability of the distribution of
means of samples caused by sampling error
and measurement error.
162

Confidence interval
• The confidence interval provides a range that is
highly likely (often 95% or 99%) to contain the
true population value, or parameter that is being
estimated.
• The narrower the interval the more informative is
the result. It is usually calculated using the point
estimate and its standard error.
163

• Provide an interval around our estimate
showing how much error there might be
either side of the estimate
lower upper
confidence estimate confidence
interval interval
164

Interval estimate for mean:
one sample situation
• Confidence interval of the mean with known
population standard deviation
• Confidence interval of the mean with unknown
population standard deviation for small sample
size
n
Z
x
x
SE
z
x


 2
/
1
)
2
/
1
( )
( 
 


n
s
n
t
x
x
se
df
t
x )
1
(
)
(
)
( 2
/
1
2
/
1 


 
 

165

Cont…
Interpretation of confidence interval
• Probabilistic: in repeated sampling from a
normally distributed population with known SD of
all interval will in the long run include population
mean
• Practical: when sampling from normally
distributed population with known SD (σ), we are
confident that the single computed interval
contains the population mean.
166

Cont…
• Confidence coefficient commonly used values are
0.9, 0.95 & 0.99 associated reliability coefficient
value of 1.645, 1.96 and 2.58 respectively for the
standard normal random variable (Z).
• Precision:
The quantity obtained by multiplying the reliability
factor by the SE of the mean called margins of
error.
167

Computing a 95 and 99% CI for μ
• Given = 19.26, σ = 2.52 and n = 117
• At 95% confidence level, α = 0.05 (α/2=0.025) and at 99%
α = 0.01 (α/2=0.005)
• Z0.975 = 1.96 and Z0.995 = 2.58
 95% CI for μ becomes
• 19.26  1.96*2.52/117 = (18.80  μ  19.72)
99% CI for μ becomes
• 19.26  2.58*2.52/117 = (18.66  μ  19.86)
x
168

Computing CI for μ when σ is unknown
• When the population SD (σ) is unknown, it
should be estimated from the sample SD (s)
• Accordingly, the standard error of the sample
mean will be estimated by s/√n
• Therefore, the say 95% CI for μ with n < 30 will
be based on the t-statistic as:
where (n-1) is the degree of freedom
n
s
n
t
x /
)
1
(
975
.
0 

169

Example
• Consider the following summary information
based on data on systolic blood pressure of a
random sample of 30 individuals selected from a
normal population. Compute a 95% and 99% CI
for μ
• n=30, df=30-1=29, at 95% confidence level, t0.975(29)=
2.045 and at 99%, t0.995(29)=2.756, se( )=16.3/30=2.98
• 95% CI for μ: 115.9  2.045*2.98 = (109.8  μ  122.0)
• 99% CI for μ: 115.9  2.756*2.98 = (107.7  μ  124.1)
3
.
16
s
,
9
.
115 

X
x
170

Standard Error of the difference between
two sample means
• Most medical research is comparative, as a
result we are more often concerned with two or
more samples rather than a single sample, i.e.,
compare difference between two samples.
• This helps in deciding whether or not it is likely
that the two mean are equal
• When the interval includes 0, the two means
might be equal.
• When the interval does not include zero the two
mean are different.
171

Cont….
The Z test statistic can be used in confidence
interval to estimate difference between two mean
if the variances of the populations are known
A 95% confidence interval for the difference of the
two means is given by:
2
2
2
1
2
1
2
1
2
2
2
1
2
1
975
.
0
2
1 96
.
1
)
(
)
(
n
n
X
X
n
n
Z
X
X













172

Unknown Variance
The t-test statistic is used when the
population standard deviations are unknown
and small sample size under the two sets of
conditions
1. When equal variance is assumed
2. When the variance are unequal
173

Cont…
• When the variance are equal, the variances are
pooled to estimate the common variance.
• Pooled estimate is obtained by weighing
average of the two sample variance.
• Each sample variance is weighed by its degree
of freedom (n-1).
• If the sample size are equal, the weighed
average equal the arithmetic mean of the two
sample variance.
• If the sample size are different, weighed average
take the advantage of additional information
provided by the larger sample.
174

Unknown but equal variances
• The pooled standard deviation (Sp) is
calculated using the following formula:
• Then the standard error of the difference
of the two sample means is:
2
)
1
(
)
1
(
2
1
2
2
2
2
1
1






n
n
S
n
S
n
Sp
2
1
2
1
1
1
)
(
n
n
S
X
X
se p 


175

Example: Was there a difference in the mean
fasting blood glucose level between men and
women given data from normal populations
Sex Mean SD n
Men 98.14 19.59 57
Women 95.19 14.03 59
Total 96.64 16.98 116
• Compute a 95% CI for the population mean
difference
– Assuming the standard deviations (SD) are
population SD
– Assuming the population variances are unknown but
assumed to be equal
176

Factors affecting the length of a
confidence interval (CI)
– Sample size (n)
– Standard deviation (σ)
– Confidence level (1-α)
177

Hypothesis Testing
Why is hypothesis testing so important?
• Hypothesis testing provides an objective
framework for making decisions using
probabilistic method, rather than relying on
subjective impressions.
• The Null hypothesis, denoted by Ho, is the
hypothesis that is to be tested.
• The alternative hypothesis H1 is the hypothesis
that in some sense contradicts the null
hypothesis.
178

Cont…
• While making decision on the null and
alternative hypothesis, we have four
possible outcomes:
1. We accept Ho, and Ho is in fact true – confidence level
(1-α).
2. We accept Ho, and H1is in fact true – Type II error (β).
3. We reject Ho, and Ho is in fact true – Type I error (α).
4. We reject Ho, and H1 in fact is true – Power of the test
(1- β).
179

One Sample Test for the Mean from a
Normal population
1. One Sided Alternative (One-tailed)
 Unknown Variance
• A one tailed test is a test in which the values of
the parameter being studied (in this case mean)
under the alternative hypothesis are allowed to be
either greater than or less than the values of the
parameter under the null hypothesis, but not both
































180

Cont…
I. Alternative mean < Null mean
• One sample t -test for the mean of a normal
distribution with Unknown variance to test the
hypothesis:
If t < t1- with n-1 df, then Do not Reject Ho
If t >= t1- with n-1 df, then Reject Ho
n
s
X
t o



181

Cont…
Two ways to determine statistical significance:
1. Critical value method – comparing the tabulated
value of the test statistic to the calculated value
for a given level of significance
2. P-value method
182

Cont…
The p value is the α level at which the given
value of the test statistic (such as t) would be on
the boarder line between the acceptance and
rejection zone.
P=p(tn-1 ≤ t)
where p is the area to the left of ’t’ under a tn-1
distribution.
183

Guidelines to judge p-value
1. If 0.01 <= p < 0.05, statistically significant
2. If 0.001 <= p < 0.01, statistically highly
significant
3. If p < 0.001, very highly statistically
significant
4. If p > 0.05, not statistically significant
184

II. Alternative mean >Null mean
• To test the hypothesis:
Ho: = Vs H1 : > , Variance Unknown
With a significant level, , the test is based on ‘t’
where:
• If t > tn-1, 1-α Ho is rejected
• If t < tn-1, 1- α Ho is accepted
 o
  o


n
s
x
t o
/



185

Cont…
2. Two-sided alternatives (two tailed)
It is a test in which the values of the parameter
being studied under the alternate hypothesis are
allowed to be either greater than or less than the
values of the parameter under the null hypothesis,
Ho.
186

Cont…
• To test the hypothesis:
Ho : = versus H1: ≠ with a significant
level of 
/t/ > tn-1,1- α /2 Ho rejected
/t/ < tn-1,1- α /2 Ho accepted
n
s
x
t o
/



 o

 o

187

Cont…
• P-value for two tailed t-test
n
s
x
t o
/
















0
t
if
)]
(
1
[
2
0
t
if
)
(
2
1
1
t
t
P
P
t
t
P
P
n
n
188

Cont…
One sample Z-test - Two Tailed
• The critical values and p-values for the one
sample t-test have been specified in terms of
percentiles of the t distribution, assuming that the
underlying variance is unknown.
• In some applications, the variance may be
assumed known from prior studies. In this case,
the test statistic t-test is replaced by the test
statistic ′Z′
189

Cont…
To test the hypothesis, we use
 Z < Z α /2 or Z > Z1- α /2 ,reject Ho
 Z α /2 < Z < Z 1- α /2 , Don’t reject Ho
n
x
z o
/




190

Cont…
• One Tail
• Alternative mean < Null mean (Variance
Known)
 Z < Z α , then Ho rejected
 Z > Z α, Ho accepted
• Alternative mean > Null mean (Variance
Known)
 Z > Z1- α , then Ho rejected
 Z < Z α, Ho accepted
191

Relationship between Hypothesis
Testing and confidence interval
–Two sided case
• Suppose we are testing Ho : = versus
H1: Ho is rejected with a two –sided level
alpha test if and only if the two sided confidence
interval for Does not contain , otherwise
accept Ho.
 o

  o

 o

192

Hypothesis Testing Two Sample
Inference
• In a two sample hypothesis testing, the
underlying parameters of two different
Population, neither of whose values is
assumed Known, are compared.
• Two samples are said to be Paired when
each data point of the first sample is
matched and is related to a unique data
point of the second sample.
193

Cont…
• Two samples are said to be independent
if the data points in one sample are
unrelated to the data points in the second
sample
194

The paired t- test
• the statistic is denoted by
where SD(d) is the sample standard deviation of
the observed difference and n is the number of
differences
n
d
SD
d
t
)
(

195

Cont…
• Degree of freedom n-1
– If t>tn-1 ,1- α /2 or t<-tn-1, 1- α /2 then Ho is
rejected.
– - tn-1, 1- α /2 <t<tn-1, 1- α /2
• P- value is 2x the area of ‘t’
196

• Example:
• Suppose a sample of 20 students were
given a test before studying a particular
module and then again after completing
the module.
• We want to find out if, in general, our
teaching leads to improvements in
students’ knowledge/skills (i.e. test
scores).
197

Student
Score
Difference Student
Score
Difference
Pre-
module
Post-
module
Pre-
module
Post-
module
1 18 22 4 11 14 15 1
2 21 25 4 12 16 15 -1
3 16 17 1 13 16 18 2
4 22 24 2 14 19 26 7
5 19 16 -3 15 18 18 0
6 24 29 5 16 20 24 4
7 17 20 3 17 12 18 6
8 21 23 2 18 22 25 3
9 23 19 -4 19 15 19 4
10 18 20 2 20 17 16 -1
198

199
• Hypothesis: Ho: △=0 and HA: △≠0
• Calculating the mean and standard deviation of
the differences: 𝑑= 2.05 and sd(d) = 2.837.
Therefore, se(𝑑) = 2.837/ 20 = 0.634
• So, we have: t = 2.05/0.634 = 3.231 on 19 df with
p = 0.004.
• Therefore, there is strong evidence that, on
average, the module does lead to improvements.

Two sample t – test for independent
sample with equal variance
• The equation is given by:
where, the weighted average of variance1 and variance2
could simply used as the estimate of
• The degree of freedom will be the sum of the degree of
freedom of the two samples, i.e., (n1-1) + (n2-1)
2
1
2
1
1
1
n
n
S
X
X
t
p 


2

200

Estimation and Hypothesis testing
of population proportion
201

Sampling distribution of proportions
Construction
• It is done in the same manner as that of
the mean
• take all possible samples of a given size
• Compute the sample proportion for each
• Prepare a frequency distribution of the
proportions
202

Cont…
Characteristics:
– When the sample size is large the distribution is
approximately normal
– The mean of the distribution, , will be equal
to the true proportion P.
– the variance of the distribution, , will be
equal to
P̂

2
p̂

n
p
p )
1
( 
203

Sampling distribution of difference
between two proportions
• For independent random samples n1 and n2 drawn
from two populations of dichotomous variables and
when P1 and P2 are the population proportions of
the characteristic
• Distribution of is approximately normal with
mean:
• And variance:
2
1
ˆ
ˆ p
p 
2
1
ˆ
ˆ 2
1
p
p
p
p 



2
2
2
1
1
1
2
ˆ
ˆ
)
1
(
)
1
(
2
1
n
p
p
n
p
p
p
p






204

Estimation of single proportions
• Confidence intervals of proportions by
approximation to the normal distribution and the
sample standard deviation.
• The confidence interval for the population
proportion :
where p is the proportion of successes (event),
q=(1 - p) is the proportion of failures,
n is the sample size and z denotes the z value
relating to a defined probability level.
n
p
p
Z
p
)
1
( 

205

Estimation of difference between
two proportions
• Unbiased point estimators are
• Standard error of the estimate when n1 and n2 are
large enough and are not close to 1 or 0
• Since population proportions are not known
2
2
2
1
1
1
ˆ
ˆ
)
ˆ
1
(
ˆ
)
ˆ
1
(
ˆ
2
1
n
p
p
n
p
p
p
p






2
1
ˆ
ˆ p
and
p
2
1
ˆ
ˆ p
p 
206

Cont…
• Therefore,100(1-α)% confidence interval will be:
2
2
2
1
1
1
)
2
/
1
(
2
1
)
ˆ
1
(
ˆ
)
ˆ
1
(
ˆ
)
ˆ
ˆ
(
n
p
p
n
p
p
p
p





 
207

Hypothesis testing on single
population proportions
• Follows from the properties of the sampling
distribution of the sample proportion
• The null hypothesis
and
• The alternate hypothesis
o
A
o
o
P
P
H
P
P
H


:
:
208

Cont…
• Test statistics
• Where Ho is true the sample proportions are
approximately distributed as standard normal
distribution
n
p
p
p
p
Z
o
o
)
1
(
ˆ
0 


209

Testing differences between two
sample proportions
• The most commonly used test
Ho: P1-P2 = 0 or P1=P2
• Under Ho, thus pooled estimate for the proportions will be
• Standard error
2
1
2
2
1
1
2
1
2
1
n
n
p
n
p
n
n
n
x
x
P






2
1
ˆ
ˆ
)
1
(
)
1
(
2
1
n
p
p
n
p
p
p
p






210

Cont…
• The test statistic will be:
   
2
1 ˆ
ˆ
2
1
2
1
ˆ
ˆ
p
p
P
P
p
p
z






211

Example: Comparison of number of swimming
hours’ by swimmers with or without erosion of
dental enamel
Number of
swimming hours
per week
Erosion of dental
enamel (EDE) Total
Yes No
≥ 6 hours 32 118 150
< 6 hours 17 127 144
Total 49 245 294
212
Prevalence of EDE (P) 0.167
Standard error 0.022
95% CI for P: Lower 0.124
Upper 0.209

1. Estimate the prevalence of erosion of
dental enamel and calculate a 95% CI
2. From previous studies among
swimmers it is claimed that the
prevalence of erosion of dental enamel
was 14%. Is the claim justified? Give
your p-value
213

3. Compute the respective prevalence of erosion
of dental enamel for those who had  6 hours
and < 6 hours of swimming time and calculate a
95% CI for the difference in the prevalence.
4. Is there a difference in the prevalence of erosion
of dental enamel between the two swimming
times? Give your p-value
214

Amount of swimming time per week P
≥ 6 hours 0.213
< 6 hours 0.118
Total 0.167
p1 – p2 0.095
Ho: P1=P2, HA: P1≠P2
se(p1-p2) 0.044
Z 2.174
95% CI for P1-P2
se(p1-p2) 0.042
Lower 95% 0.013
Upper 95% 0.177
215

Exercise: A study was conducted to look at the
effect of oral contraceptives (OC) on heart disease
in women 40-44 years of age over 3 years. Given
the following data, is there a difference in the rate of
MI between OC-users and non-users? Compute
95% CI for the difference.
OC-use group
MI status over 3
years Total
Yes No
OC-users 13 4,987 5,000
No-OC-users 7 9,993 10,000
Total 20 14,980 15,000
216

Errors in
• Design
• Execution
• Analysis
• Presentation
• Interpretation
• Omission
218

Statistical errors related to study design
• Study aims and primary outcome measures
not clearly stated or unclear
• In adequate sample size
• Choice of inappropriate high risk sample to
make inferences about the general population
• Failure to report number of participants or
observations
• Use of an inappropriate control group
219

Errors in execution
• Failure to adhered to the study protocol
– Misuse of sample selection procedures
– Exclusion and inclusion criteria not strictly
followed
– Failure to follow randomization procedures
220

Statistical errors in presentation
• Inadequate graphical or numerical description of
basic data
– Presenting or plotting mean but no indication of
variability
– Giving SE instead of SD to describe data
– Failure to define ± notation for describing variability
– Numerical information given to an unrealistic level
of precision to present data and results
– Inappropriate graph selection that doesn’t reflect
characteristics of variables and use of three
dimensional graph for two dimension presentation
221

Statistical errors in analysis
• Using methods of analysis when assumptions are
not met
• Analyzing paired data ignoring the pairing
• Failing to take account of ordered categories
• Treating multiple observations on one subject as
independent
o Improper multiple pair-wise comparisons of more than
two groups
o Quoting confidence intervals that include impossible
values
• Failure to use multivariate techniques to adjust
for confounding factors
223

Statistical errors in interpretation of
study findings
• Wrong interpretation of results
 “non significant” interpreted as “no effect”, or
“no difference”
 Drawing conclusions not supported by the
study data
 Significance claimed without data analysis
or statistical test mentioned
• Failure to discuss sources of potential bias and
confounding factors
224

Consequences of statistical errors
• Impossible to get ethical approval to conduct the
study
• Others researchers may be led to follow false line
of investigation
• Patients may receive an inferior treatment , either
as a direct consequence of the result of the study
or possibly by the delay in the introduction of a
truly effective treatment
• If the results go unchallenged the researchers
may use the same inferior statistical methods in
future research, and others may copy them due to
inappropriate conclusion 225

Advanced Biostatistics presentation pptx

Recommended

Recommended

More Related Content

Similar to Advanced Biostatistics presentation pptx

Similar to Advanced Biostatistics presentation pptx (20)

More from Abebe334138

More from Abebe334138 (12)

Recently uploaded

Recently uploaded (20)

Advanced Biostatistics presentation pptx