MPH Biostatistics Course on Data Presentation

Teresa Kisi (MPH in Epidemiology and Biostatistics, Assist. Prof.)
Basic Biostatistics for MPH students
11/17/2018 1
Arsi University, College of Health Science,
Department of Public Health

Course content
Topics Facilitator
1. Introduction
2. Methods of data collection and presentation
3. Summery measures
Mr. Teresa Kisi (MPH in
Epidemiology and
Biostatistics, Assist.
Prof.)
Email:
terek7@gmail.com
4. Probability and probability distributions
5. Sampling methods and sample size
determination
5. Statistical inference
2
11/17/2018

Course description
This course covers both descriptive and some
intermediate inferential level statistics for public
health. The descriptive statistics deals with frequency
distribution, measures of central tendency and
variability; probability and probability distributions;
sampling and sample size determination; statistical
estimation and sampling distributions and hypothesis
testing.
11/17/2018 3

Learning Objectives:
 At the end of the course we will be able to:
– Discuss the role of statistics in health science and explain
the main uses of statistical methods in the broader field of
health care;
– Describe methods of collection, recording, and present
data in the form of tables, graphs etc;
– Calculate measures of central tendency and dispersion
– Apply different sample size determination and sampling
techniques
– Explain the context and meaning of statistical estimation
and hypothesis testing.
11/17/2018 4

Evaluation
Evaluation criteria Percent
Assignments 40%
Final exam 60%
11/17/2018 5
NB: Grading will be as per the grading scale of the university registrar

Chapter one:
Introduction to Biostatistics
Objectives of the chapter
 After completing this chapter, we will be able to:
– Define Statistics and Biostatistics
– Enumerate the importance and limitations of
statistics
– Define and Identify the different types of variable
and list why we need to classify variables
6
11/17/2018

Objectives cont’d…
– Identify the different methods of medical and
biological data organization and presentation
– Identify the criterion for the selection of a method
to organize and present data
– Discuss data summarization methods
7
11/17/2018

Statistics?
8
11/17/2018

Statistics
 The science of assembling and interpreting numerical
data (Bland, 2000)
 The discipline concerned with the treatment of
numerical data derived from groups of individuals
(Armitage et al., 2001).
 Generally the term statistics is used to mean either
statistical data or statistical methods.
9
11/17/2018

Statistics cont’d…
Statistical data: refers to numerical
descriptions of things. These descriptions may
take the form of counts or measurements.
E.g. statistics of malaria cases include fever
cases, number of positives obtained, sex and
age distribution of positive cases, etc.
10
11/17/2018

 NB: Even though statistical data always denote
figures (numerical descriptions), it must be
remembered that all 'numerical descriptions' are
not statistical data.
Why?
11
11/17/2018

 Statistical methods: refers methods that are used
for collecting, organising, analyzing and
interpreting numerical data for understanding a
phenomenon or making wise decisions. In this sense
it is a branch of scientific method and helps us to
know in a better way the objective under study.
12
11/17/2018

Biostatistics?
13
11/17/2018

 Biostatistics: The tools of statistics are employed in
many fields - business, education, psychology,
agriculture, and economics, to mention only few.
 When the data being analyzed are derived from the
public health data, biological sciences and medicine,
we use the term biostatistics to distinguish this
particular application of statistical tools and
concepts.
14
11/17/2018

–Types of biostatistics?
15
11/17/2018

Types of biostatistics
collection
organizing
summarizing
presenting of data
Descriptive Statistics
making inferences
hypothesis testing
determining relationship
making the prediction
Inferential Statistics
Biostatistics
16
11/17/2018

Types of Biostatistics
1. Descriptive (exploratory) statistics: is the aspect of
collecting, organization, presentation and
summarization of data.
These include techniques for tabular and graphical
presentation of data as well as the methods used to
summarize a body of data with one or two
meaningful figures
E.g. At our health centre, 50 patients were diagnosed
with angina last year.
17
11/17/2018

Descriptive statistics cont’d …
 Some statistical summaries which are especially
common in descriptive analyses are:
Measures of central tendency
Measures of dispersion
Cross-tabulation /contingency table
Histogram
Quantile, Q-Q plot
Scatter plot
Box plot
18
11/17/2018

2. Inferential Statistics:
 Consists of generalizing from samples to population,
performing hypothesis testing, determining relation
among variables, and making prediction.
 This branch of statistics deals with techniques of making
conclusions about population.
 The inferences are drawn from particular properties of
sample to particular properties of population.
Inferential statistics builds upon descriptive statistics.
19
11/17/2018

Inferential Statistics cont’d...
NB: They encompasses a variety of procedures to
ensure that the inferences are sound and rational,
even though they may not always be correct.
20
11/17/2018

Statistical inference cont’d…
 In short, inferential statistics enables us to make
confident decisions in the face of uncertainty.
E.g. Antibiotics reduce the duration of viral throat
infections by 1-2 days.
Five percent of women aged 30-49 consult their GP
each year with heavy menstrual bleeding.
21
11/17/2018

Summery
Descriptive statistical methods
– Provide summary indices for a given data, e.g.
arithmetic mean, median, standard deviation,
coefficient of variation, etc.
Inductive (inferential) statistical methods
– Produce statistical inferences about a population
based on information from a sample derived from
the population, need to take variation into account
– Estimating population values from sample values 22
sample Population
11/17/2018

Summery cont’d …
• E.g.
At our health centre, 50 patients were diagnosed
with angina last year. (descriptive )
Antibiotics reduce the duration of viral throat
infections by 1-2 days. (inferential)
Five percent of women aged 30-49 consult their GP
each year with heavy menstrual bleeding.
(inferential)
23
11/17/2018

• Why we need biostatistics?
24
11/17/2018

Why we need biostatistics?
 Main reason: handling variations:
o Biological variation
–Among individuals as well as within same
individual over time
»Example: height, weight, blood pressure,
eye color ...
o Sample variation:
Biomedical research projects are usually carried
out on small numbers of study subjects
25
11/17/2018

Why need to learn biostatistics? Cont’d....
 Essential for scientific method of investigation
– Formulate hypothesis
– Design study to objectively test hypothesis
– Collect reliable and unbiased data
– Process and evaluate data rigorously
– Interpret and draw appropriate conclusions
26
11/17/2018

Why need to learn biostatistics? Cont’d....
 Essential for understanding, appraisal and critique of
scientific literature
 Public health and medicine are becoming
increasingly quantitative.
27
11/17/2018

limitations of statistics:
 It deals with only those subjects of inquiry that are
capable of being quantitatively measured and
numerically expressed.
 It deals on aggregates of facts and no importance is
attached to individual items – suited only if the group
characteristics are desired to be studied.
 Statistical data are only approximation and not
mathematically correct.
28
11/17/2018

variables
 Variable: A variable is a characteristic under study
that assumes different values for different elements.
or it is a characteristic or attribute that can assume
different value.
Some examples of variables include:
 Diastolic blood pressure,
 heart rate, height,
 The weight and
 Stage of bladder cancer to list some
29
11/17/2018

variables cont’d…
 Random variable: are varibles whose value are
determined by chance.
 Data: the measurements or observatuions (values)
for a variable
 Data set: it is a collection of observation on a
variable.
30
11/17/2018

variables cont’d…
31
variables Data Data set
Values Many
Mrs. brown Mr. Patel Mr. Amanda
Age 32 24 20
Sex Female Male Male
Blood type O O A
11/17/2018

Types of variables
 Depending on the characteristic of the measurement,
variable can be:
Qualitative(Categorical) variable
A variable or characteristic which cannot be
measured in quantitative form. But, can only be
identified by name or categories, or variable that
can be placed into distinct categories, according to
some characteristic or attribute.
 For instance place of birth, ethnic group, type of
drug, stages of breast cancer (I, II, III, or IV),
degree of pain (low, moderate, sever or
unbearable). 32
11/17/2018

Types of variables cont’d…
• The categories should be clear cut (not overlapping)
and cover all the possibilities. For example, sex
(male or female), disease stage (depends on disease),
ever smoked (yes or no).
33
11/17/2018

Types of variables cont’d…
Quantitative (Numerical) variable:
 Is one that can be measured and expressed numerically.
 They can be of two types
Discrete Data
The values of a discrete variable are usually whole
numbers, such as the number of episodes of
diarrhoea in the first five years of life.
Observations can only take certain numerical values
Numerical discrete data occur when the observations
are integers that correspond with a count of some
sort. 34
11/17/2018

Discrete Data cont’d…
 Some common examples are:
 The number of bacteria colonies on a plate,
 The number of cells within a prescribed area
upon microscopic examination,
 The number of heart beats within a specified
time interval,
 A mother’s history of numbers of births ( parity)
and pregnancies (gravidity),
 The number of episodes of illness a patient
experiences during some time period, etc.
35
11/17/2018

Continuous Data
A continuous variable is a measurement on a
continuous scale
Each observation theoretically falls somewhere
along a continuum.
One is not restricted, in principle, to particular
values such as the integers of the discrete scale.
most clinical measurements, such as:
 Blood pressure,
 Serum cholesterol level,
 Height, weight, age etc. are on a numerical
continuous scale. 36
11/17/2018

Continuous Data cont’d…
Continuous data are used to report a measurement
of the individual that can take on any value within
an acceptable range.
37
Data
Qualitative
Quantitative
Discrete Continuous
11/17/2018

Scales of measurement
Data comes in various sizes and shapes and it is
important to know about these so that the proper
analysis can be used on the data.
There are four at which we measure:
Nominal scales of measurement
It may be thought of as "naming" level. This level of
measurement do not put subjects in any particular
order. There is no logical basis for saying one
category is higher or less than the other category. In
research activities a YES/NO scale is nominal.
38
11/17/2018

Nominal scales of cont’d…
The simplest data consist of unordered,
dichotomous, or "either ------- or" types of
observations, i.e., either the patient lives or the
patient dies, either he has some particular
attribute or he does not.
 Examples are: Blood group, Gender, religious
affiliation
39
11/17/2018

Nominal scales cont’d…
 The nominal level of measurement classifies data
into mutually exclusive (non over lapping),
exhaustive categories in which no order or ranking
can be imposed on the data
40
11/17/2018

Ordinal Scales of Measurement
An ordinal scale is next up the list in terms of power of
measurement. The simplest ordinal scale is a ranking.
At this level we put subjects in order from lowest to
height.
It is important to know that ranks do not tell us by
how much subjects differ.
There is no objective distance between any two points
on your subjective scale.
Hence, an ordinal scale only lets you interpret gross
order and not the relative positional distances.
41
11/17/2018

Ordinal Scales cont’d…
 E.g. If we told that third year students have better
knowledge than first year student, then we do not
know by how much they are better.
To measure the amount of the difference between
subjects we need the next level of measurement.
42
11/17/2018

Some of the examples under this scales of
measurement includes:
• Academic status, job satisfaction index,
employment status, response to treatment
(none, slow, moderate, fast)
• like art scale:
1. strongly agree
2. agree
3. no opinion
4. disagree
5. strongly disagree
43
11/17/2018

 The ordinal level of measurement classifies data
into categories that can be ranked; however, precise
differences between the ranks do not exist.
44
11/17/2018

Interval Scales of Measurement
 It is more powerful than nominal and ordinal as it not
only orders or categories but also shows exact
distances in between.
 On interval measurement scales, one unit on the scale
represents the same magnitude on the trait or
characteristic being measured across the whole range
of the scale.
 They do not have a "true" zero point, however, and
therefore it is not possible to make statements about
how many times higher one score is than another.
45
11/17/2018

Interval Scales cont’d …
 A good example of an interval scale is the Fahrenheit
scale for temperature.
 Equal differences on this scale represent equal
differences in temperature, but the scale is not a RATIO
Scale. Thus, a temperature of 30 degrees is not twice
as warm as that of 15 degrees.
46
The interval level of measurement ranks data,
and precise differences between units of measure
do exist; however, there is no meaningful zero
11/17/2018

Ratio Scales of Measurement
 The highest level of measurement
 This has the properties of an interval scale together
with a fixed origin or zero point.
 Examples of variables which are ratio scaled include
weights, lengths and times.
47
11/17/2018

Ratio Scales cont’d…
 Ratio scales permit the researcher to compare both
differences in scores and the relative magnitude of
scores.
– For instance the difference between 5 and 10
minutes is the same as that between 10 and 15
minutes, and 10 minutes is twice as long as 5
minutes.
48
11/17/2018

Ratio Scales cont’d…
 The ratio level of measurement possesses all the
characteristics of interval measurement, and there
exists a true zero. In addition, true ratio exist
between different units of measure.
49
11/17/2018

Summary
50
Variables
Qualitative/Categorical
Quantitative
Discrete Continuous
11/17/2018
Depending on the characteristic of the measurement, variable can
be:
Which cannot be
measured in
quantitative form.
That can be measured
and expressed
numerically.
Which takes whole/
integer numbers.
A measurement on a
continuous scale

11/17/2018 51
Based on the scales of measurement
Variables
Nominal Ordinal Interval Ratio
Only category and
no ranking
Category + ranking, (no
clear distance)
Ranking +clear distance between
category, but, no true Zero
If true zero exists
Summary cont’d…

Summary table for the four scales of measurement
52
Power Scale characterstics
Highest Ratio Equal interval with absolute zero
Interval Equal interval without absolute zero
Ordinal Ordering
Lowest scale Nominal Naming
Power
increase
11/17/2018

Categorize the following variables into nominal,
ordinal, interval or ratio
 Gender
 Grade(A, B, C, D and F )
 Rating scale(poor, good, excelent)
 Eye colour
 Political affilation
 Religious affilation
 Ranking of tennis players
 Majour field
 Nationality
53
Height
Weight
Time
Age
IQ
Temprature
Salary
11/17/2018

ASSIGNMENT 1
Exercise 1: Table 1.6 contains the characteristics of cases and controls
from a case-control study into stressful life events and breast cancer
in women (Protheroeet al.1999). Categorize the variables in the
table into nominal, ordinal, Interval or ration.
Exercise 2: Table 1.7 is from a cross-section study to determine the
incidence of pregnancy-related venous thromboembolic events and
their relationship to selected risk factors, such as maternal age,
parity, smoking, and so on (Lindqvistet al.1999). Categorize the
variables in the table into nominal, ordinal, Interval or ration.
Exercise 3: Table 1.8 is from a study to compare two lotions, Malathion
and d-phenothrin, in the treatment of head lice (Chosidowet
al.1994). In 193 schoolchildren, 95 children were given Malathion
and 98 d-phenothrin. Categorize the variables in the table into
nominal, ordinal, Interval or ration.
54
11/17/2018

Teresa Kisi (MPH in Epidemiology and Biostatistics, Assist. Prof.) 55
11/17/2018

11/17/2018

ASSIGNMENT 2
 Four migraine patients are asked to assess the severity
of their migraine pain one hour after the first
symptoms of an attack, by marking a point on a
horizontal line, 100 mm long. The line is marked ‘No
pain’, at the left-hand end, and ‘Worst possible pain’ at
the right-hand end. The distance of each patient’s mark
from the left-hand end is subsequently measured with
a mm ruler, and their scores are 25 mm, 44 mm, 68
mm and 85 mm. What sort of data is this? Can you
calculate the average pain of these four patients? Note
that this form of measurement (using a line and getting
subjects to mark it) is known as a visual analogue scale
(VAS).
58
11/17/2018

 Response and Explanatory variables
 A variable can be also either response (dependant,
outcome) variables or explanatory (independent,
predictor) variables.
 Response (dependent, outcome) variables: are
variables which can be affected by explanatory
variable and it is the outcome of a study.
A variable you would be interested in predicting or
forecasting.
 Explanatory variables are any variables that explain
the response variable.
59
11/17/2018

exercise 1:
In a study to determine whether surgery or
chemotherapy results in higher survival rates for a
certain type of cancer,
Which variable is the explanatory variable and which
one is the response variable?
60
11/17/2018

• What is the importance of
variable classification?
61
11/17/2018

• Source of Data?
62
11/17/2018

Primary source of data
It needs the involvement of the researcher
himself. Census and sample survey are sources of
primary types of data
Experiments is also another means of getting the
data needed to answer a question
63
11/17/2018

Source of Data…
secondary data.
The data needed to answer a question may already
exit in the form of published reports, commercially
available data banks, or the research literature.
In this case data were obtained from already collected
sources like newspaper, magazines, DHS, hospital
records and existing data like;
Mortality reports
Morbidity reports
Epidemic reports
Reports of laboratory utilization (including
laboratory test results)
64
11/17/2018

Data collection methods?
65
11/17/2018

Data collection methods
 Before any statistical work can be done data must be
collected.
 Data collection is a crucial stage in the planning and
implementation of a study.
 Data collection techniques allow us to systematically
collect data about our objectives of study and about
the setting in which they occur.
66
11/17/2018

Data collection methods…
 The methods of collecting data may be broadly
classified as:
Self-administered questionnaires
The use of documentary sources,
Observation
Interviews
Tape recording
Filming
Photography
Focus group discussion
67
11/17/2018

The choice of methods of data collection is
based on:
♣ Types information to be collected from the
source.
♣ The accuracy of information they will yield
♣ Practical considerations, such as, the need
for personnel, time, equipment and other
facilities, in relation to what is available.
68
11/17/2018

 Method providing more satisfactory information will
often be a more expensive or inconvenient one.
♣ Therefore, accuracy must be balanced against
practical considerations (resources and other
practical limitations)
69
11/17/2018

1) Observation
 Observation is a technique that involves
systematically selecting, watching and recording
behaviors of people or other phenomena and
aspects of the setting in which they occur, for the
purpose of getting (gaining) specific information.
 It includes all methods from simple visual
observations to the use of high level machines and
measurements, sophisticated equipment or facilities,
such as radiographic, biochemical, X-ray machines,
microscope, clinical examinations, and
microbiological examinations.
70
11/17/2018

Observation…
 Advantages: Gives relatively more accurate data on
behavior and activities
 Disadvantages: Investigators or observer’s own
biases, desires, and etc. and needs more resources
and skilled human power during the use of high level
machines.
71
11/17/2018

2) The use of documentary sources
Clinical records and other personal records, death
certificates, published mortality statistics, census
publications, etc.
Advantages
 Documents can provide ready made information
relatively easily
 The best means of studying past events.
Disadvantages
 Problems of reliability and validity (because the
information is collected by a number of different
persons who may have used different definitions or
methods of obtaining data).
72
11/17/2018

3. Interviewing
It involves oral questioning of respondents, either
individually or as a group
Answers can be recorded by writing them down or
by tape-recording the responses, or by a
combination of them.
Interviews can be conducted with varying degree of
flexibility (high degree of flexibility Vs low degree of
flexibility)
11/17/2018 73

Interviewing cont’d…
A) High degree of flexibility /unstructured:
Usually used when the researcher has little
understanding of the problem
Is frequently applied in exploratory studies
11/17/2018 74

B) Low degree of flexibility / highly structured
interview.
Useful when the researcher is relatively
knowledgeable about expected answers or when
the number of respondents being interviewed is
relatively large
Questionnaires may be used with a fixed list of
questions in a standard sequence, which have mainly
fixed or pre-categorized answers
11/17/2018 75

 Ways of interviewing participants:
Face to face
Telephone
76
11/17/2018

Interviews cont’d…
Face to face interviews:
A good interviewer can stimulate and maintain the
respondents interest of the frank answering of
questions.
If anxiety is aroused (e.g., why am I being asked these
questions?), the interviewer can allay it.
An interviewer can repeat questions which are not
understood, and give standardized explanations
where necessary.
An interviewer can make observations during the
interview; i.e., note is taken not only of what the
subject says but also how he says it.
77
11/17/2018

Telephone interviews
Telephone interviews can be a very effective and
economical way of collecting data for quantitative
research
May be useful when the respondents to be
interviewed are on wide geographical distribution
78
NB: The questionnaire should be fairly short and a prior
appointment may enhance the response rate and length of
interview
11/17/2018

While interviewing, a precaution should be taken
not to influence the responses; the interviewer
should ask his questions in a neutral manner. He
should not show agreement, disagreement, or
surprise, and should record the respondent’s precise
answers without shifting or interpreting them.
79
11/17/2018

4. Self-administered questionnaires
 Written questions are presented that are to be
answered by the respondents in written form.
 The respondent reads the questions and fills in the
answers by him/ herself (sometimes in the presence of
an interviewer who “stands by” to give assistance if
necessary.
 The use of self-administered questionnaires is simpler
and cheaper. It can be administered to many persons
simultaneously.
80
11/17/2018

Self-administered questionnaires cont’d ….
A written questionnaire can be administered in
different ways, such as by:
– Sending questionnaires by mail
– Gathering all or part of the respondents in one
place at one time, giving oral or written
instructions, and letting them fill out the
questionnaires
81
The main problems with postal questionnaire are
that response rates tend to be relatively low, and
that there may be under representation of less
literate subjects.
11/17/2018

Self -administered questionnaires cont’d…
Advantages
Is less expensive; permits anonymity & may result in
more honest responses; does not require research
assistants; eliminates bias due to phrasing questions
differently with different respondents
Disadvantages
Cannot be used with illiterates; there is often a low
rate of response; questions may be misunderstood
82
11/17/2018

Problems in gathering data?
83
11/17/2018

Problems in gathering data
Common problems might include:
 Language barriers
 Lack of adequate time
 Expense
 Inadequately trained and experienced staff
 Invasion of privacy
 Suspicion (mistrust)
 Bias (any systematic error)
 Cultural norms (e.g. which may preclude (prevent)
men interviewing women)
84
11/17/2018

Types of Questions
 Depending on how questions are asked and recorded we
can distinguish two major possibilities - Open –ended
questions, and closed questions.
Open-ended questions
Open-ended questions permit free responses that
should be recorded in the respondent’s own words. The
respondent is not given any possible answers to choose
from.
Such questions are useful to obtain information on:
 Facts with which the researcher is not very familiar,
 Opinions, attitudes, and suggestions of informants
85
11/17/2018

Open-ended questions…
For example
Can you describe exactly what the traditional birth
attendant did when your labor started?
What do you think the reasons for a high drop-out
rate of village health committee members?
What would you do if you noticed that your daughter
(school girl) had a relationship with a teacher?
86
11/17/2018

 Closed Questions
Closed questions offer a list of possible options or
answers from which the respondents must choose.
When designing closed questions one should try to:
 Offer a list of options that are exhaustive and
mutually exclusive
 Closed questions are useful if the range of possible
responses are known.
87
11/17/2018

Closed Questions…
For example
What is your marital status?
1. Single
2. Married/living together
3. Separated
4. divorced
5. widowed
Have you ever gone to the local village health worker
for treatment?
1. Yes
2. No
88
11/17/2018

Requirements of questions
 Must have validity – that is the question that we design
should be one that give an obviously valid and relevant
measurement for the variable.
 Must be clear and unambiguous – the way in which
questions are worded can ‘make or break’ a questionnaire.
They must be phrased in language that it is believed the
respondent will understand, and that all respondents will
understand in the same way.
To ensure clarity, each question should contain only one
idea; ‘double-barrelled’ questions like:
‘Do you take your child to a doctor when he has a cold or
has diarrhea?’ are difficult to answer, and the answers are
difficult to interpret. 89
11/17/2018

Requirements of questions …
 Must not be offensive – whenever possible it is wise
to avoid questions that may offend the respondent,
for example, those which may seem to expose the
respondent’s ignorance, and those requiring him to
give a socially unacceptable answer.
 The questions should be fair - They should not be
loaded.
Short questions are generally regarded as preferable
to long ones.
90
11/17/2018

Requirements of questions …
 Sensitive questions - It may not be possible to avoid
asking ‘sensitive’ questions that may offend
respondents, In such situations the interviewer
(questioner) should do it very carefully and wisely
91
11/17/2018

Methods of data organization and presentation
 The data collected in a survey is called raw data. In most
cases, useful information is not immediately evident
from the mass of unsorted data.
 Collected data need to be organized in such a way as to
condense the information they contain in a way that will
show patterns of variation clearly.
92
11/17/2018

1. Frequency Distributions
 Quite often, the presentation of data in a meaningful
way is done by preparing a frequency distribution. If
this is not done the raw data will not present any
meaning and any pattern in them, may not be
detected.
 Given a set of scores, constructing a frequency
distribution includes proportion(P)/ percentages.
93
11/17/2018

Frequency Distributions cont’d …
 Frequency distribution determines the number of
units (e.g., people) which fall into a series of specified
categories.
 The Frequency is the count of the number of times
that a particular combination occurred in a data set.
 The relative frequency is the frequency of the
event/value/category divided by the total number of
data points.
Frequency distribution can be grouped or ungrouped
94
11/17/2018

Ungrouped Frequency Distribution
 It uses to present categorical variable in simplified
and easily understandable way
 This frequency table can be constructed by listing all
possible categories of the variable and then counting
the number laying on each category of the variable
as a frequency.
95
11/17/2018

Example
The following data is about current age of women
and it was collected from 240 women ( data 1).
96
11/17/2018

Example: Consider the data collected on age at first
marriage of 240 women (data 1). One of the variable in
this dataset is religion followed by the women. Hence,
for such types of variable, we can use ungrouped
frequency distribution to summarize the data as follows:
97
religion frequency Relative frequency(%)
Orthodox 103 42.9
Muslim 33 13.8
Protestant 97 40.4
Others* 7 2.9
Total 240 100
*catholic, none religious
11/17/2018

Grouped Frequency Distribution
In order to present data using grouped frequency
distribution, it is not as simple as that of ungrouped. In
this case we need to compute some values. These
values are given below:
Number of class(K): The number of categories
the table will have
Number of class can be computed/ estimated using
Sturge’s rule as:
K = 1+3.322log(n)
Where:
K= number of class
n=sample size.
98
11/17/2018

Grouped Frequency cont’d…
• Then the width of each class, W, can be computed
as:
99
11/17/2018

Class limit: The range for each class/ The smallest
and largest values that can go into any class; they
can be either lower or upper class limits.
Lower class limit: Smallest observation of the
category
Upper class limit: Smallest observation plus
width of the class minus one.
100
11/17/2018

 When forming classes, always make sure that each item
(measurement or observation) goes into one and only
one class, i.e. classes should be mutually exclusive
(namely, that successive classes have no values in
common).
 To this end we must make sure that the smallest and
largest values fall within the classification, that none of
the values can fall into possible gaps between successive
classes.
101
11/17/2018

 Note that: the Sturges rule should not be regarded
as final, but should:
Be considered as a guide only. The number of
classes specified by the rule should be increased
or decreased for convenient or clear presentation.
102
11/17/2018

 Class Boundaries/True Limits: are those limits, which
are determined mathematically to make an interval of a
continuous variable continuous in both directions, and
no gap exists between classes. It is obtained by
subtracting and adding 0.5 from lower and upper class
limit respectively
 Lower class boundary
Upper class boundary
103
11/17/2018

 Class mark/ Mid-point (Xc) of an interval: is the value
of the interval which lies mid-way between the lower
true limit (LTL) and the upper true limit (UTL) of a
class.
It is calculated as: The average of lower and upper
class limit.
104
11/17/2018

NB: The constructed grouped frequency distribution
expected to be:
– Class intervals should be continuous (for
continuous data), non overlapping(mutually
exclusive) and exhaustive.
– Class intervals should generally be of the same
width
– Open indeed class intervals should be avoided.
These are classes like less then 10, greater than
65, and so on.
105
11/17/2018

Example for data 1
 The number of classes(k) can be computed using
Sturg's rule as:
 Therefore, the width W of each class can be
computed as:
 Thus the width of each class can be 4 and the lower
class limit for the first class will be the minimum
observation from the dataset. 106
11/17/2018

Example for data 1
 Thus, the grouped frequency distribution of current age of women can be
constructed as:
107
Class
limit
Class boundary Class mark Frequency RF(%) CF
15-18 14.5-18.5 16.5 15 6.25 15
19-22 18.5-22.5 20.5 49 20.41 64
23-26 22.5-26.5 24.5 51 21.25 115
27-30 26.5-30.5 28.5 40 16.67 155
31-34 30.5-34.5 32.5 21 8.75 176
35-38 34.5-38.5 36.5 22 9.17 198
39-42 38.5-42.5 40.5 18 7.50 216
43-46 42.5-46.5 44.5 15 6.25 231
47-50 46.5-50.5 48.5 9 3.75 240
11/17/2018

Example for data 1 cont’d …
Where RF and CF are relative frequency and cumulative
frequency respectively.
 Note that: the value to be added or subtracted on the
class limits to get class boundaries depends on the
decimal number of the dataset that we want to
summarize.
The width of a class is found from the true class limit by
subtracting the true lower limit from the upper true limit
of any particular class.
For example, the width of the above distribution is (let's
take the fourth class) ( w = 30.5 - 26.5 = 4).
108
11/17/2018

Statistical Tables
A statistical table is an orderly and systematic
presentation of data in rows and columns.
Rows : are horizontal arrangements.
Columns: are vertical arrangements.
109
11/17/2018

 Based on the purpose for which the table is designed
and the complexity of the relationship, a table could
be either of simple frequency table or cross
tabulation.
Simple frequency table is used when the
individual observations involve only to a single
variable.
Cross tabulation is used to obtain the frequency
distribution of one variable by the subset of
another variables.
110
Statistical Tables cont’d…
11/17/2018

Statistical tables cont’d…
Construction of tables
There are no hard and fast rules to follow, the
following general principles should be addressed in
constructing tables.
Tables should be as simple as possible.
Tables should be self-explanatory:
 Title should be clear and to the point (a good
title answers: what? when? where? how
classified ?) and it should be placed above the
table.
 Each row and column should be labeled.
111
11/17/2018

Statistical tables cont’d …
 Numerical entities of zero should be explicitly
written rather than indicated by a dash. Dashed
are reserved for missing or unobserved data.
 If data are not original, their source should be
given in a footnote.
112
11/17/2018

Tables cont’d…
One-variable/ Simple frequency table
– Most basic table is a simple frequency distribution with one
variable
Example,
Fig 3. Blood group of voluntary blood donors examined in Red Cross Blood bank,
within a day, May 2006 (n=548)
Rows
Column
Title
11/17/2018 113
Table 1:

Eample 2: simple table cont’d...
Table 5. Clinical symptoms among 54 patients with S
Typhimurium-infection, Oslo, Norway, May 1998
Symptoms
n %
Diarrhoea 54 100
Fever 35 65
Headache 12 22
Joint pain 4 7
Muscle pain 4 7
Cases
11/17/2018 114
Table 2:

If two variables are cross tabulated, it is a two
variable table
If the tabulation is among three variables, it is
three variable table
In cross tabulated frequency distributions where
there are row and column totals, the decision for
the denominator is based on the variable of interest
to be compared over the subset of the other
variable.
Two and three variable table
11/17/2018 115

Table 1. Distribution of variable 1 by variable 2,
population X (n=58), place Y, period Z
Variable 2
Variable 1 Value 1 Value 2 Value 3 Total
Value 1 2 4 7 13
Value 2 3 5 3 11
Value 3 4 5 4 13
Value 4 5 6 2 13
Unkown 3 2 3 8
Total 17 22 19 58
Explanation of acronyms, units used, …
Two and three variable table cont’d…
11/17/2018 116
Table 3:

Two and three variable table cont’d...
Table 1. Cases of Salmonella
Typhimurium-infection by age-group and sex,
Herøy, Norw
ay, 1999
Age group Total
(years) Male Female
0 - 9 7 5 12
10 - 19 5 5 10
20 - 29 5 5 10
30 - 39 1 4 5
40 - 49 2 3 5
50 - 59 0 3 3
60 - 69 2 1 3
70 - 2 4 6
Total 24 30 54
Sex
11/17/2018 117
Table 1:

Two and three variable table cont’d...
Residence Age Male Female Total
Urban 15-24
25-34
35-44
34
48
65
76
56
54
110
104
119
Rural 15-24
25-34
35-44
56
78
46
58
53
47
114
131
93
Total 369 395 764
Distribution participants by age, sex and residency
11/17/2018 118

Common form of a two by two variable
It is a special form of table favorite among
epidemiologist
It is used to compare whether there is relationship
between the two variables
Exposure
Number of Total
Cases Controls
Exposed 23 23 46
Non exposed 4 139 143
Total 27 162 189
11/17/2018 119

Composite/ Higher Order Table
It is a large table combining several separate
variable/tables
Age, sex and other demographic variables may be
combined to form a single table
11/17/2018 120

– Example of composite table
Characteristics Number Percent
Marital status
Single
Married
Divorced/ widowed
50
20
4
67.6
27.0
5.4
Current Residence (n=73)
Within the PA
Within the PA (H. Post)
Within the nearest town
40
25
8
54.8
34.2
11.0
Residence of origin
Within the PA
Outside the PA
Outside the Woreda
4
24
46
5.4
32.4
62.2
Training TVETI
Axum
Makele
19
55
25.7
74.3
Totals 74 100
11/17/2018 121

Graphical Presentation
Graphs are often easier to interpret than tables,
perhaps at the expense of detail.
A variety of graphs are used depending on the type
of data.
If we want to present categorical/qualitative or
quantitative discrete data/variable using graph, then
pie chart and bar chart are the appropriate ones,
however if the variable is numerical/quantitative
continuous data in nature, then we can use histogram,
frequency polygon, cumulative frequency curve, box
plot…
122
11/17/2018

Graphical Presentation cont’d…
There are, however, general rules that are commonly
accepted about construction of graphs.
Every graph should be self-explanatory and as
simple as possible.
Titles are usually placed below the graph and it
should answer again question like: what ? Where?
When? How classified?
Legends or keys should be used to differentiate
variables if more than one is shown.
123
11/17/2018

The units in to which the scale is divided should
be clearly indicated.
The numerical scale representing frequency must
start at zero or a break in the line should be
shown.
124
Graphical Presentation cont’d…
11/17/2018

Examples of graphs:
Bar Chart
Bar diagrams are used to represent and compare the
frequency distribution of discrete variables and
attributes or categorical series. When we represent
data using bar diagram, all the bars must have equal
width and the distance between bars must be equal.
Each category of variable is represented by a bar
Variables are categorical, or treated as qualitative
It can be displayed as horizontal or vertical
125
11/17/2018

Types of bar charts
There are different types of bar diagrams:
A. Simple bar chart: It is a one-dimensional
diagram in which the bar represents the
whole of the magnitude.
The height or length of each bar indicates
the size (frequency) of the figure
represented.
– one variable
– It can be displayed as horizontal or vertical
11/17/2018 126

Figure 1: immunization status of children in adami Tulu Wereda,
1995
Types of bar charts cont’d…
11/17/2018 127

Type of bar chart cont’d ...
B. Grouped bar chart
– Data from 2-variable or more variable tables
– Distinct colours or shading is used to
differentiate
– Legend is necessary
11/17/2018 128

The meaning of
each bar is shown
in a legend
One cell
Cell separated
By a space
E.g. Grouped/ joined bar chart
Figure 2: TT immunization status by marital status of women 15-49 years,
Asendabo town, 1996.
11/17/2018 129

C. Stacked bar chart
– It is used to show the same data as a grouped bar
chart using a single bar
– Different groups are differentiated by different
segments within a single bar
– You are able to see the overall change easier, but
changes between groups may be difficult than
grouped bars
11/17/2018 130

Figure 1. Cases of S Typhimurium-infection
by age-group and sex, Herøy, Norway, 1999
0 2 4 6 8 10 12 14
0 - 9
10 - 19
20 - 29
30 - 39
40 - 49
50 - 59
60 - 69
70 -
Age-group
Number of cases
Male
Female
Eg Stacked bar chart(absolute value)
Figure 3: cases of S. Typhimurium-infection by age group and
sex, Heroy, Norway, 1999
11/17/2018 131

D. 100% component bar chart
– It is a variant of stacked bar chart , where bars are
pulled to 100% rather than their real values;
– It is helpful for comparing the contribution of
different subgroups within the categories of the
main variable
11/17/2018 132

Eg 100% Component bar chart
Figure 4. Cases of S Typhimurium-infection by age-group and sex, Herøy, Norway,
1999
0 %
20 %
40 %
60 %
80 %
100 %
0 - 9 10 - 19 20 - 29 30 - 39 40 - 49 50 - 59 60 - 69 70 -
Age-group
Male Female
Proportional distribution by sex
11/17/2018 133

Pie -Charts;
 It is a circle divided into sectors so that the areas of the
sectors are proportional to the frequencies.
 It is split into segments to show percentages or the
relative contributions of categories of data.
 It is a good method of representation if you wish to
compare a part of group with the whole group.
 The number of categories should not be too much.
134
11/17/2018

e.g. Pie chart
Fig.5. Distribution of religion of participants from Kunama ethnic group
among Eritrean Refugees in Shimelba Camp, July 2006
11/17/2018 135

Quantitative continuous data
Histograms: is the graph of the frequency distribution
of continuous measurement variables.
It is constructed on the basis of the following
principles:
The horizontal axis is a continuous scale running
from one extreme end of the distribution to the
other. It should be labeled with the name of the
variable and the units of measurement.
136
11/17/2018

Histograms cont’d …
For each class in the distribution a vertical rectangle
is drawn with
Its base on the horizontal axis extending from
one class boundary of the class to the other class
boundary, there will never be any gap between
the histogram rectangles.
The bases of all rectangles will be determined by
the width of the class intervals.
137
11/17/2018

Histograms cont’d …
Area of each column is proportional to the number of
observations in that interval
In constructing
– Use equal class intervals
– Do not use scale breaks
It could show second variable by shading
11/17/2018 138

Figure 6: Age distribution of women in a reproductive age group
included in a study of violence against women in Butajira, 1984.
11/17/2018 139

Frequency polygon
 If we join the midpoints of the tops of the adjacent
rectangles of the histogram with line segments a
frequency polygon is obtained.
Note: it is not essential to draw histogram in order
to obtain frequency polygon. It can be drawn with
out erecting rectangles of histogram as follows:
140
11/17/2018

Frequency polygon cont’d…
 The scale should be marked in the numerical values of
the midpoints of intervals
 Erect ordinates on the midpoints of the interval - the
length or altitude of an ordinate representing the
frequency of the class on whose mid-point it is erected.
 Join the tops of the ordinates and extend the connecting
lines to the scale of sizes.
141
11/17/2018

Construction of a frequency polygon from a histogram
15 cases
14
13 1 case patient
12 1 case staff member
11
10
9
8
7
6
5
4
3
2
1
0
00- 06- 12- 18- 00- 06- 12- 18- 00- 06- 12- 18- 00- 06- 12- 18- 00-
27 August 28 August 29 August 30 August
Date and time of onset
11/17/2018 142

Mid point/ class mark
11/17/2018

Ogive or cumulative frequency curve:
 To construct an Ogive curve:
Compute the cumulative frequency of the
distribution.
 Prepare a graph with the cumulative frequency on
the vertical axis and the true upper class limits (class
boundaries) of the interval scaled along the X-axis
(horizontal axis).
The true lower limit of the lowest class interval with
lowest scores is included in the X-axis scale; this is
also the true upper limit of the next lower interval
having a cumulative frequency of 0.
144
11/17/2018

11/17/2018

Summarizing Data
 The first step in looking at data is to describe the
data at hand in some concise way.
 One type of measure useful for summarizing data
defines the center, or middle, of the sample.
146
11/17/2018

Measures of Central Tendency/ Measures of Location
 Measures of central Tendency: the various methods of
determining the actual value at which the data tend to
concentrate. Hence, measures of central Tendency is a
value which tends to sum up or describe the mass of the
data.
 These central tendency includes:
Mean ,
Median and
Mode .
147
11/17/2018

Arithmetic Mean/simple Mean ( )
Definition: the arithmetic mean is the sum of all
observations divided by the number of observations. it
is usually denoted by
 Let us consider X1, X2, ..., XN are the list of N
measurements obtained from N subjects. Then the
mean for ungrouped number of measurements for N
subjects is defined as:
148
X
11/17/2018

The mean for Grouped data can be computed as
follows:
 Where: k=the number of classes
Xci=class mark for the ith class and
fi=frequency of the ith class
149
11/17/2018

properties of Mean
Individual extreme values (also known as 'outliers')
can distort its ability to represent the typical value of
a variable (which is The main weakness of the
mean.)
It is unique for the given set of data
The value of the arithmetic mean is determined by
every item in the series.
The sum of the deviations about it is zero.
150
11/17/2018

Example 1
 Consider the data on birth weight of 10 new born
children in kilo gram at university of Gondar hospital:
2.51, 3.01, 3.25, 2.02,1.98, 2.33, 2.33, 2.98, 2.88,
2.43.
Then the average birth weight can be computed as:
151
11/17/2018

 Compute mean for the grouped frequency
distribution given bellow:
The grouped frequency distribution for current
age of women
152
Example 2
11/17/2018

Class
limit
Class boundary Class mark Frequency RF(%) CF
15-18 14.5-18.5 16.5 15 6.25 15
19-22 18.5-22.5 20.5 49 20.41 64
23-26 22.5-26.5 24.5 51 21.25 115
27-30 26.5-30.5 28.5 40 16.67 155
31-34 30.5-34.5 32.5 21 8.75 176
35-38 34.5-38.5 36.5 22 9.17 198
39-42 38.5-42.5 40.5 18 7.50 216
43-46 42.5-46.5 44.5 15 6.25 231
47-50 46.5-50.5 48.5 9 3.75 240
11/17/2018

Example 2 cont’d…
 Where as: fi = frequency distribution of ith class
Xc = is the mid-point
n = total sample size
154
11/17/2018

Median
 An alternative measure of central location, perhaps
second in popularity to the arithmetic mean.
 Suppose there are n observations in a sample. If these
observations are ordered from smallest to largest,
then the median is defined as follows:
The median, is a value such that at least half of the
observations are less than or equal to median and
at least half of the observations are greater than or
equal to median .
The median is the midpoint of the data array.
155
11/17/2018

Median cont’d …
 To find the median of a data set:
Arrange the data in ascending order.
Find the middle observation of this ordered
data.
156
11/17/2018

Median cont’d…
 If the number of data is ODD, then the median is the
middle data point:
Median =
 If the number of data is EVEN, then the median is the
average of the two values around the middle.
Median =
157
11/17/2018

• Extreme values do NOT affect the median, making
the median a good alternative to the mean to
measure central tendencies when such values occur.
158
Median cont’d…
11/17/2018

Example:
 Consider the data on the weight of 10 new born
children at university of Gondar hospital within a
month:
2.51, 3.01, 3.25, 2.02,1.98, 2.33, 2.33, 2.98, 2.88, 2.43.
– Find median for the data.
159
11/17/2018

 First arrange the data in to ascending order as:
1.98, 2.02, 2.33, 2.33, 2.43,2.51, 2.88, 2.98, 3.01,
3.25.
 As 10 is even we need to take the middle two
observations and the median will be the average of
this two middle observations.
160
Example cont’d…
11/17/2018

Median cont’d…
Median for grouped data:
 The median for grouped data is defined by:
Where as:
LCB= lower class boundary of the median class
Fc= cumulative frequency just before the median
class
fc=frequency of the median class
W =class width and n=number of observations. 161
11/17/2018

Example median for grouped data 1
 Consider the example on age of women we presented
using frequency distribution bellow. Compute median
for grouped data?
 To compute median for grouped data, we need first
find the median class. In this example half of the
observation is 120.
 Let us see the distribution with the cumulative
frequency:
162
11/17/2018

Example median for grouped data 1 cont’d…
11/17/2018

 As we can see from the distribution, the class which
contains 120 observation for the first time is the class
with cumulative frequency 155 as 120 is under 155. So,
the median class is the 4th class
164
11/17/2018

Mode
 Mode is the value appearing most frequently
 It can be obtained by counting the number of appearance for
each observation from the list.
 Important for summarising nominal/categorical types of data
 disadvantage,
 In small number of observations, there may be no mode.
 In addition, sometimes, there may be more than one mode
such as when dealing with a bimodal (two-peaks)
distribution.
 Example
a. 22, 66, 69, 70, 73. (no modal value)
b. 1.8, 3.0, 3.3, 2.8, 2.9, 3.6, 3.0, 1.9, 3.2, 3.5 (modal
value = 3.0 kg) 165
11/17/2018

11/17/2018

NB: The mode for grouped data is modal class. Modal
class is the class with the largest frequency.
167
Mode cont’d…
11/17/2018

Skewness:
 If extremely low or extremely high observations are present
in a distribution, then the mean tends to shift towards those
scores.
 Based on the type of skewness, distributions can be:
 Symmetrical distribution: when data values are
evenly distributed on both sides of the three
measures of central tendency (Mean, Median and
Mode).
 It is neither positively nor negatively skewed. A curve
is symmetrical if one half of the curve is the mirror
image of the other half.
 If the distribution is symmetric and has only one
mode, all three measures are the same, an example
being the normal distribution. 168
11/17/2018

11/17/2018

Positively skewed distribution: Occurs when the
majority of scores are at the left end of the curve
and a few extreme large scores are scattered at
the right end.
 For positively skewed distributions (where the
upper, or left tail of the distribution is longer
(“fatter”) than the lower, or right tail) the
measures are ordered as follows:
Mode < median < mean.
170
11/17/2018

11/17/2018

Negatively skewed distribution: occurs when
majority of scores are at the right end of the curve
and a few small scores are scattered at the left
end.
For negatively skewed distributions (where the
right tail of the distribution is longer than the
left tail), the reverse ordering occurs:
Mean < median < mode.
172
11/17/2018

11/17/2018

Measures of Dispersion/ Variation
 Measures of dispersion or variability will give us
information about the spread of the scores in our
distribution.
 Without knowing something about how the data is
dispersed, measures of central tendency may be
misleading.
 Most common measures of dispersion includes
Range,
Inter-quartile range,
Variance,
Standard deviation and
Coefficient of variation. 174
11/17/2018

 Consider the following three datasets
Dataset 1:7, 7, 7, 7, 7, 7 Mean=7, s.d=0
Dataset 2: 6, 7, 7, 7, 7, 8, mean=7, s.d=0.63
Dataset 3: 3, 2, 7, 8, 9, 13, mean=7, s.d=4.04
175
Measures of Dispersion/ Variation
11/17/2018

Measures of Dispersion cont’d…
 RANGE: It is the difference between the largest and
smallest observation from the data
EXAMPLE: Consider the data on the weight (in Kg) of
10 new born children at university of Gondar hospital
within a month:
2.51, 3.01, 3.25, 2.02,1.98, 2.33, 2.33, 2.98, 2.88,
2.43.
176
11/17/2018

Then the range for the dataset can be computed by
first arranging all observation in to ascending order
as:
1.98, 2.02, 2.33, 2.33, 2.43, 2.51, 2.88, 2.98, 3.01, 3.25.
Range = Maximum-Minimum=3.25-1.98=1.27
 It is based upon two extreme cases in the entire
distribution, the range may be considerably changed if
either of the extreme cases happens to drop out, while
the removal of any other case would not affect it at all.
 It wastes information , it takes no account of the entire
data. 177
11/17/2018

 The extremes values may be unreliable; that is, they
are the most likely to be faulty
 Not suitable with regard to the mathematical
treatment required in driving the techniques of
statistical inference.
178
11/17/2018

Quantiles
The Pth percentile is the value Vp such that P percent
of the sample points are less than or equal to Vp.
The median, being the 50th percentile, is a special case
of a quantile.
As was the case for the median, a different definition is
needed for the pth percentile, depending on whether
np/100 is an integer or not.
179
11/17/2018

The pth percentile is defined by:
1. (k+1)th largest sample point if np/100 is not an
integer (where k is the largest integer less than
np/100)
2. The average of the (np/100)th and (np/100 + 1)th
larges observation if np/100 is an integer.
180
Quantiles cont’d …
11/17/2018

Quintiles cont’d …
Example 1: Compute the 10th and 90th percentile for
the birth weight data below.
Suppose the sample consists of birth weights (in
grams) of all live born infants born at a private
hospital in a city, during a 1-week period.
3265, 3323, 2581, 2759, 3260, 3649,2841
3248, 3245, 3200, 3609, 3314, 3484, 3031
2838, 3101, 4146, 2069, 3541, 2834
181
11/17/2018

By sorting the data from the smallest to highest
2069 2581 2759 2834 2838 2841 3031 3101 3200
3245 3248 3260 3265 3314 3323 3484 3541 3609
3649 4146
Solution: Since 20×0.1=2 and 20×0.9=18 are integers,
the 10th and 90th percentiles are defined by:
182
11/17/2018

10th percentile = the average of the 2nd and 3rd
values = (2581+2759)/2 = 2670 g
90th percentile=the average of the 18th and 19th
values = (3609+3649)/2 = 3629 grams.
183
We would estimate that 80 percent of birth weights
would fall between 2670 g and 3629 g, which gives us
an overall feel for the spread of the distribution.
11/17/2018

 Quartiles: are other quantiles which divide the
distribution into four equal parts. The second
quartile is the median.
 The interquartile range (IQR): is the difference
between the first and the third quartiles.
 To compute it, we first sort the data, in ascending
order, then find the data values corresponding to the
first quarter of the numbers (first quartile), and then
the third quartile.
184
11/17/2018

example 2:
Given the following data set (age of patients) find the
interquartile range!
18,59,24,42,21,23,24,32
1. sort the data from lowest to highest
18 21 23 24 24 32 42 59
2. Find the bottom and the top quarters of the data
3. Find the difference (interquartile range) between
the two quartiles.
185
11/17/2018

 1st quartile = The {(n+1)/4}th observation = (2.25) th
observation = 21 + (23-21)x 0.25 = 21.5
 3rd quartile = {3/4 (n+1)}th observation = (6.75)th
observation = 32 + (42-32)x 0.75 = 39.5
Hence, IQR = 39.5 - 21.5 = 18
 The interquartile range is a preferable measure to the
range. Because it is less prone to distortion by a single
large or small value. That is, outliers in the data do not
affect the inerquartile range. Also, it can be computed
when the distribution has open-end classes.
186
11/17/2018

Box and Whisker plot
 Box plots summarize data using a five-number :
 The 25th (first quartile), the median(second quartile), and
75th percentiles(third quartile), and the minimum and
maximum observed values that are not statistically
outlying.
 The heavy black line inside each box marks the 50th
percentile, or median, of the group distribution.
 The lower and upper hinges, or box boundaries, mark
the 25th (Q1) and 75th (Q2) percentiles respectively.
 Whiskers appear above and below the hinges. Whiskers
are vertical lines ending in horizontal lines at the largest
and smallest observed values that are not statistical
outliers.
187
11/17/2018

Box and Whisker plot cont’d…
 Outliers are identified with an O. Yield one (1) is,
labeled 1O and, Yield 2 labeled as 17 O and Yield 3,
labeled as 58O.
 The label 1,3,17 and 58 refers to the row number in
the Data Editor where that observation is found.
 Extreme values are marked with an asterisk (*). In
this case the extreme labeled as *3 in the first yield
indicated.
188
11/17/2018

grooup three
group two
group one
differen groups having defferent training
30,00
25,00
20,00
15,00
10,00
s
c
o
r
e
t
o
s
u
r
v
e
y
q
u
e
s
t
i
o
n
58
17
1
3
11/17/2018

Information obtained from a box and whisker
plot
– If the median is near the center of the box, the
distribution is approximately symmetric,
– If the median falls to the bottom of the center of
the box, the distribution is positively skewed.
– If the median falls to the top of the center, the
distribution is negatively skewed.
190
11/17/2018

– If the whiskers are about the same length, the
distribution is approximately symmetric,
– If the top whisker is longer than the bottom
whisker, the distribution is positively skewed.
– If the bottom whisker is longer than the top
whisker, the distribution is negatively skewed.
191
Information obtained from a box and whisker
plot cont’d …
11/17/2018

outlier
An outlier is an observation that lies an abnormal
distance from other values in a random sample from a
population.
Before abnormal observations can be singled out, it is
necessary to characterize normal observations.
Two activities are essential for characterizing a set of
data:
Examination of the overall shape of the graphed data
for important features, including symmetry and
departures from assumptions.
Examination of the data for unusual observations that
are far from the mass of data. These points are often
referred to as outliers
192
11/17/2018

outlier cont’d…
The following quantities (called fences) are needed for
identifying outliers extreme values in the tails of the
distribution:
lower inner fence: Q1 - 1.5*IQ
upper inner fence: Q3 + 1.5*IQ
lower outer fence: Q1 - 3*IQ
upper outer fence: Q3 + 3*IQ
Where as: Q1 = 1st quartile
Q3 = 3rd quartile
IQ = interquartile range
A point beyond an inner fence on either side is considered
a outlier. A point beyond an outer fence is considered an
extreme outlier.
193
11/17/2018

Standard Deviation and Variance
 Variance:
While the inter-quartile range eliminates the
problem of outliers it creates another problem in
that you are eliminating half of your data.
The solution to both problems is to measure
variability from the center of the distribution.
Variance measure how far on average scores
deviate or differ from the mean.
194
Variance is the average of the square of the
distance each value is from the mean
11/17/2018

2
2 1
( )
N
i
i
x
N
µ
σ −
−
=
∑
Mathematically the formula for population
variance is defined as:
2
2 1
( )
1
n
i
i
x x
s
n
−
−
=
−
∑
The mathemetical formula for sample variance is
defined as:
11/17/2018

Short cut formula for the sample variance
196
2
2
2
( )
=
1
x
x
n
s
n
Σ
Σ −
−
Variance cont’d…
11/17/2018

 The sample and population standard deviations are
denoted by S and σ (by convention) respectively.
 The standard deviation, S.D., is just the positive square
root of the variance.
 It expresses exactly the same information as the variance,
but re-scaled to be in the same units as the mean.
 Mathematically: Population standard deviation
197
2
1
( )
N
i
i
x
N
µ
σ −
−
=
∑
Standard Deviation
11/17/2018

Standard Deviation cont’d…
 Sample standard deviation can be defined as:
 Example1 The Areas of sprayable surfaces with DDT
from a sample of 15 houses are measured as follows (in
m2) :
101, 105, 110, 114, 115, 124, 125, 125, 130, 133, 135,
136, 137, 140, 145
198
2
1
( )
1
n
i
i
x
s
n
x
−
=
−
=
−
∑
11/17/2018

Example 1 cont’d …
 Find the variance and standard deviation of the
above distribution.
 Solutions
The mean of the sample is 125 m2.
Variance (sample) = s2 = Σ(xi –x)2/n-1 = {(101-125)2
+(105-125) 2 + ….(145-125) 2 } / (15-1)
= 2502/14
= 178.71 m4
Hence, the standard deviation
=
= 13.37 m2
199
178.71
11/17/2018

Variance for grouped frequency distribution
 In a grouped frequency distribution, the variance is
computed as:
 Where as
fi =frequency of ith class
Xci =class mark of ith class
n = total number of the sample
200
2 2
2 1 1
( ) ( )
( 1)
i i
k k
i c i c
i i
n f x f x
s
n n
= =
−
=
−
∑ ∑
11/17/2018

Example of Variance for grouped frequency
distribution
 Consider the following data of time spend by college
students for leisure activities. Compute standard
deviation.
201
11/17/2018

2 2
1 1
( ) ( )
=
( 1)
i i
k k
i c i c
i i
n f x f x
s
n n
= =
−
−
∑ ∑
11/17/2018

Coefficient of variance
 The standard deviation is an absolute measure of deviation of
observations around their mean and is expressed with the same
unit of the data.
 Due to this nature of the standard deviation it is not directly
used for comparison purposes with respect to variability.
 Coefficient of variation, is often used for this purpose
 The coefficient of variation (CV) is defined by:
CV =
 The coefficient of variation is most useful in comparing the
variability of several different samples, each with different
means. 203
11/17/2018

Coefficient of variance cont’d…
 CV is a relative measure free from unit of measurement.
example
204
Weights of newborn
elephants (kg)
929 853
878 939
895 972
937 841
801 826
Weights of newborn
mice (kg)
0.72 0.42
0.63 0.31
0.59 0.38
0.79 0.96
1.06 0.89
n=10, = 887.1
s = 56.50
CV = 0.0637
X
n=10, = 0.68
s = 0.255
CV = 0.375
X
Mice show
greater birth-
weight variation
11/17/2018

When to use coefficient of variance
 When comparison groups have very different means
(CV is suitable as it expresses the standard deviation
relative to its corresponding mean)
 When different units of measurements are involved,
e.g. group 1 unit is mm, and group 2 unit is gm (CV is
suitable for comparison as it is unit-free)
 In such cases, standard deviation should not be used
for comparison
205
11/17/2018

11/17/2018

MPH Biostatistics Course on Data Presentation

Recommended

Recommended

More Related Content

Similar to MPH Biostatistics Course on Data Presentation

Similar to MPH Biostatistics Course on Data Presentation (20)

More from MohammedKasim29

More from MohammedKasim29 (6)

Recently uploaded

Recently uploaded (20)

MPH Biostatistics Course on Data Presentation