1.
Chapter two
DATA ORGANIZATION AND
PRESENTATION
Mengistu Y. (BSC, MPH-HI)
2017
1
4/17/2023
2.
Learning objectives
At the end of this section students are expected to:
• understand the nature of data
• organize and present data according to the need of
the activity
• present data in table and graphical ways for
information use.
2
4/17/2023
3.
Data organization and presentation
• Statistics is used to organize and interpret research
observations and findings.
• Before interpretation & communication of the
findings, the raw data must be organized and
presented in a clear and understandable way.
Techniques used to organize and summarize a set of
data in a concise way.
– Organization of data
– Summarization of data
– Presentation of data
3
4/17/2023
4.
Cont...
• Numbers that have not been summarized and
organized are called raw data
Descriptive statistic includes tables, graphical
/chart displays and calculation of summary
measures such as mean, proportions, averages
etc…
• The methods of describing variables differ
depending on the type of data (Numerical or
Categorical).
4
4/17/2023
5.
Organizing data
Categorical data
• Table of frequency
distributions
– Frequency
– Relative frequency
– Cumulative frequencies
• Graphs
– Bar charts
– Pie charts
Continuous or discrete data
• Frequency distribution
• Summary measures
Graphs
– Histograms
– Frequency polygons
– Cumulative frequency polygons
Leaf and steam
Box and whisker Plots
Scatter plot
5
4/17/2023
6.
Frequency distributions
• A frequency distribution is a presentation of the
number of times (or the frequency) that each value (or
group of values) occurs in the study population.
• Ordered array: A simple arrangement of individual
observations in order of magnitude.
• A simple and effective way of summarizing categorical
data is to construct a frequency distribution table.
• This is done by counting the number of observations
falling into each of the categories, or levels of the
variables.
• Consider for example, the variable birth weight with
levels ‘Very low ’, ‘Low’, ‘Normal’ and ‘Big’.
6
4/17/2023
7.
Relative Frequency
• Sometimes it is useful to compute the
proportion, or percentages of observations in
each category.
• The distribution of proportions is called the
relative frequency distribution of the variable.
• Given a total number of observations, the
relative frequency distribution is easily derived
from the frequency distribution.
7
4/17/2023
8.
Cumulative frequency
• Two other distributions are useful describing
particularly ordinal data.
• It tells nothing in nominal data.
E.g. You will never say 70% are below blue
color.
• The cumulative frequency is the number of
observations in the category plus observations in
all categories smaller than it.
• Cumulative relative frequency is the
proportion of observations in the category plus
observations in all categories smaller than it, and
is obtained by dividing the cumulative frequency
by the total number of observations.
8
4/17/2023
9.
Table 2. Distribution of birth weight of newborns
between 1976-1996 at TAH.
BWT Freq. Rel. Freq(%) Cum. Freq Cum.rel.freq.(%)
Very low 43 0.4 43 0.4
Low 793 8.0 836 8.4
Normal 8870 88.9 9706 97.3
Big 268 2.7 9974 100_____
Total 9974 100
9
4/17/2023
10.
Frequency distribution for numerical data
• Ordered array, further useful summarization may
be achieved by grouping the data.
• To group a set of observations we select a set of
continuous, non overlapping intervals such
that each value in the set of observations can be
placed in one, and only one, of the intervals.
• These intervals are usually referred to as class
intervals.
10
4/17/2023
11.
• One of the first considerations when
data are to be grouped is how many
intervals to include
• The question is how best can we
organize such data. Imagine when
we have huge data set which may
not be manageable by eye.
4/17/2023 11
15.
Example of categorized data of age
15
4/17/2023
16.
How to calculate class interval?
To determine the number of class intervals and the
corresponding width, we use:
Sturge’s rule:
K=1+3.322(logn)
W=L-S
K
where
K = number of class intervals n = no. of observations
W = width of the class interval L = the largest value
S = the smallest value
16
4/17/2023
17.
Example
• Construct a grouped frequency
distribution of the following data on the
amount of time (in hours) that 80 college
students devoted to leisure activities
during a typical school week:
4/17/2023 17
19.
The amount of time (in hours) that 80 college students devoted to leisure activities
during a typical school week
• Using the above formula,
K = 1 + 3.322 log (80)
= 7.32 7 classes
• Maximum value = 38 and Minimum value = 10
• w= Range/k = (38 – 10)/7= 28/7 = 4
• Using width of 5(common rule of thumb), we
can construct grouped frequency distribution
for the above data as:
4/17/2023 19
21.
Mid-point and True-limits
Mid-point (class mark): The value of the interval
which lies midway between the lower and the upper
limits of a class.
True limits(class boundaries): Are those limits
that make an interval of a continuous variable
continuous in both directions
Used for smoothening of the class intervals
Subtract 0.5 from the lower and add it to the upper
limit
21
4/17/2023
22.
Contd…
• Note. In the construction of cumulative
frequency distribution, if we start the cumulation
from the lowest size of the variable to the highest
size, the resulting frequency distribution is called
`Less than cumulative frequency distribution'
and if the cumulation is from the highest to the
lowest value the resulting frequency distribution
is called `more than cumulative frequency
distribution.' The most common cumulative
frequency is the less than cumulative frequency
4/17/2023 22
24.
• Class interval: The length of the class, it is
given by the difference between class
boundaries for 1st class, the interval is 5.
• Note: As sample increases, and interval
reduced the sample distribution resembles
the population distribution
4/17/2023 24
25.
– Class intervals should be continuous, non
overlapping, mutually exclusive and exhaustive
– Too few intervals results loss of information
– Too many intervals results that the objective of
summarization will not be met.
– Class intervals generally should be of the same
width (some times impossible)
– Open ended class intervals should be avoided
25
26.
Exercise
• Construct a
grouped frequency
distribution and
complete the
following table for
the Age of patients
(years) in a diabetic
clinic in Addis
Ababa, 2010
4/17/2023 26
27.
Age of patients (years) in a diabetic clinic in
Addis Ababa, 2010
Age
group
(Years)
Class
limit
Class
Boundary
Class
Mid
Point
Tally
Fr.
(fi)
Relative
Frequency
,
Fraction
(%)
Cumulative freq Relative Cum freq
<Method >Method <Method >Method
Total
4/17/2023 27
29.
Data table
Guidelines for constructing tables
• Keep them simple
• Limit the number of variables
• All tables should be self-explanatory
• Include clear title telling what, where and
when
• Clearly label the rows and columns
29
4/17/2023
30.
Cntd…
• State clearly the unit of measurement used
• Explain codes and abbreviations in the foot-
note
• Show totals
• If data is not original, indicate the source in
foot-note
4/17/2023 30
31.
Graphical presentation of data
• Variety of graph styles can be used to present
data.
• The most commonly used types of graph are pie
charts, bar diagrams, histograms, frequency
polygon and scatter diagrams.
• The purpose of using a graph is to tell others
about a set of data quickly, allowing them to
grasp the important characteristics of the data.
• In other words, graphs are visual aids to rapid
understanding.
31
4/17/2023
32.
Importance of graphs
• Diagrams have greater attraction than mere
figures.
• They give delight to the eye, add a spark of
interest and as such catch the attention
• They help in deriving the required
information in less time and without any
mental strain.
• They have great memorizing value than
mere figures.
• They facilitate comparison
4/17/2023 32
33.
Bar charts
• Bar chart: Display the frequency distribution for
nominal or ordinal data.
• In a bar chart the various categories into which the
observation fall are represented along horizontal axis
and
• A vertical bar is drawn above each category such that
the height of the bar represents either the frequency
or the relative frequency of observation within the
class.
• The vertical axis should always start from 0 but the
horizontal can start from any where.
• The bars should be of equal width and should be
separated from one another so as not to imply
continuity
33
4/17/2023
34.
Figure 1. Bar charts showing frequency distribution of
the variable ‘BWT’.
0
1000
2000
3000
4000
5000
6000
Very low Low Normal Big
BWT
Freq.
0
20
40
60
80
100
Verylow Low Normal Big
BWT
Rel.
Freq.
34
4/17/2023
35.
Bar charts for comparison
• Multiple bar chart: In order to compare the
distribution of a variable for two or more
groups, bars are often drawn along side each
other for groups being compared in a single bar
chart.
• Sub division bar chart: If there are different
quantities forming the sub-divisions of the
totals, simple bars may be sub-divided in the
ratio of the various sub-divisions to exhibit the
relationship of the parts to the whole.
35
4/17/2023
36.
Fig 2. Bar chart indicating categories of birth weight of 9975
newborns grouped by antenatal follow-up of the mothers
9
88.9
2.1
7.9
89
3.1
0
10
20
30
40
50
60
70
80
90
100
Low Normal Big
BWT
Percent
Yes
No
36
4/17/2023
37.
Example: Plasmodium species distribution for confirmed
malaria cases, Zeway, 2003
37
4/17/2023
38.
Pie chart
Pie Chart: Displays the frequency
distribution for nominal or ordinal data.
• In a pie chart the various categories into
which the observation fall are represented
along sectors of a circle
• Each sector represents either the
frequency or the relative frequency of
observation within the class the angles of
which are proportional to frequency or the
relative frequency.
38
4/17/2023
39.
Figure 3. Pie charts showing frequency distribution of
the variable ‘BWT’
Fig 3(b) Pie chart indicating relative frequencyof
categories of birth weight
0.4 8
88.9
2.7
Very low
Low
Normal
Big
Fig 3(a) Pie chart indicating frequencyof categories
of birth weight
43 793
8870
268
Verylow
Low
Normal
Big
39
4/17/2023
40.
Histogram
• Histogram is frequency distributions with
continuous class interval that has been turned into
graph.
• Given a set of numerical data, we can obtain
impression of the shape of its distribution by
constructing a histogram.
• A histogram is constructed by choosing a set of
non-overlapping intervals (class intervals) and
counting the number of observations that fall in
each class.
. 40
4/17/2023
41.
Histograms cont…
• The number of observations in each class
is called the frequency. Hence histograms
are also called frequency distributions
• It is necessary that the class intervals be
non-overlapping so that each observation
falls in one and only one interval.
4/17/2023 41
42.
Histograms cont…
• Except for the two boundaries, class intervals
are usually chosen to be of equal width. If this
is not the case, the histogram could give a
misleading impression of the shape of the data
• In drawing the histogram , smoothening of
class interval is one of important point. We
subtract 0.5 from the lower and add it up to the
upper boundary of the given interval.
42
4/17/2023
43.
Example
Distribution of the age of women at the time of
marriage
Age group No. of women
15-19 11
20-24 36
25-29 28
30-34 13
35-39 7
40-44 3
45-49 2
43
4/17/2023
44.
Age of women at the time of marriage
0
5
10
15
20
25
30
35
40
14.5-19.5 19.5-24.5 24.5-29.5 29.5-34.5 34.5-39.5 39.5-44.5 44.5-49.5
Age group
No
of
women
44
4/17/2023
45.
Fig 5. A histogram displaying frequency distribution of birth
weight of newborns at Tikur Anbessa Hospital
Birth weight
5200
4800
4400
4000
3600
3200
2800
2400
2000
1600
1200
800
2000
1800
1600
1400
1200
1000
800
600
400
200
0
Std. Dev = 502.34
Mean = 3126
N = 9975.00
45
4/17/2023
46.
Frequency polygons
• Instead of drawing bars for each class interval,
sometimes a single point is drawn at the mid
point of each class interval and consecutive
points joined by straight line.
• Graphs drawn in this way are called frequency
polygons .
• Frequency polygons are superior to histograms
for comparing two or more sets of data.
46
4/17/2023
47.
Fig.6. Frequency polygon of birth weight of 9975 newborns at Tikur
Anbessa Hospital for males and females
Birth Weight
5000
4500
4000
3500
3000
2500
2000
1500
1000
500
%
50
40
30
20
10
0
SEX
Males
Females
47
4/17/2023
48.
Box and Whisker Plot
It is another way to display information when
the objective is to illustrate certain locations
(skewness) in the distribution
Can be used to display a set of discrete or
continuous observations using a single vertical
axis – only certain summaries of the data are
shown
48
4/17/2023
49.
Box plot cont...
A box is drawn with the top of the box at the third
quartile (75%) and the bottom at the first quartile
(25%).
The location of the mid-point (50%) of the
distribution is indicated with a horizontal line in the
box.
Finally, straight lines, or whiskers, are drawn from the
centre of the top of the box to the largest observation
and from the centre of the bottom of the box to the
smallest observation.
49
4/17/2023
50.
Box cont....
The box plot is then completed
Draw a vertical bar from the upper quartile to
the largest non-outlining value in the sample
Draw a vertical bar from the lower quartile to the
smallest non-outlying value in the sample
Any values that are outside the IQR but are not
outliers are marked by the whiskers on the plot
(IQR = P75 – P25)
50
4/17/2023
51.
Box plots are useful for comparing two or
more groups of observations
51
4/17/2023
52.
Drawing Box-and -whiskers plot
Raw data
35, 29, 44, 72, 34, 64, 41, 50, 54, 104, 39, 58
Order the data
29 34 35 39 41 44 50 54 58 64 72 104
Median = (44 + 50)/2 = 47 = Q2
Q1 = 37
Q3 = 61,Min = 29 , Max = 104
52
4/17/2023
54.
Scatter plot
Most studies in medicine involve measuring
more than one characteristic, and graphs
displaying the relationship between two
characteristics are common in literature.
When both the variables are qualitative then
we can use a multiple bar graph.
When one of the characteristics is qualitative
and the other is quantitative, the data can be
displayed in box and whisker plots
54
4/17/2023
55.
Scatter plot ….
For two quantitative variables we use bivariate
plots (also called scatter plots or scatter
diagrams).
It is used to see whether a relationship existed
between the two measures.
A scatter diagram is constructed by drawing
X-and Y-axes
Each point represented by a point or dot()
represents a pair of values measured for a single
study subject =POSTIVE RELATION
55
4/17/2023
56.
0 2 4 6 8 10 12 14 16 18 20
0
10
20
30
40
50
60
Hours of Training
Negative Correlation as x increases, y decreases
x = hours of training
y = number of accidents
Scatter Plots and Types of Correlation
Accidents
56
57.
300 350 400 450 500 550 600 650 700 750 800
1.50
1.75
2.00
2.25
2.50
2.75
3.00
3.25
3.50
3.75
4.00
Math SAT
Positive Correlation as x increases y increases
x = SAT score
y = GPA
GPA
Scatter Plots and Types of Correlation
57
58.
80
76
72
68
64
60
160
150
140
130
120
110
100
90
80
Height
IQ
No linear correlation
x = height y = IQ
Scatter Plots and Types of Correlation
58
59.
1. Direction of Relationship
Positive
Negative
X
X
Y
Y
Scatter Diagram…
4/17/2023 59
60.
2. Form of Relationship
Linear
Curvilinear
X
Y
X
Y
4/17/2023 60
61.
3. Degree of Relationship
Strong
Weak
X
Y
X
Y
4/17/2023 61
62.
Line graph
Useful for assessing the trend of particular situation
overtime. e.g. monitoring the trend of epidemics.
The time, in weeks, months or years, is marked along
the horizontal axis
Values of the quantity being studied is marked on the
vertical axis.
Values for each category are connected by continuous
line.
Sometimes two or more graphs are drawn on the same
graph taking the same scale so that the plotted graphs
are comparable.
62
4/17/2023
63.
No. of microscopically confirmed malaria cases by species and month
at Zeway malaria control unit, 2003
0
300
600
900
1200
1500
1800
2100
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Months
No.
of
confirmed
malaria
cases
Positive
P. falciparum
P. vivax
63
4/17/2023
64.
Line graph cont..
The following graph shows level of zidovudine
(AZT) in the blood of HIV/AIDS patients at
several times after administration of the drug,
for with normal fat absorption and with fat
mal absorption.
Line graph can be also used to depict the
relationship between two continuous
variables like that of scatter diagram.
64
4/17/2023
65.
Line graph cont…..
Response to administration of zidovudine in two groups of AIDS
patients in hospital X, 1999
0
1
2
3
4
5
6
7
8
10
20
70
80
100
120
170
190
250
300
360
Time since administration (Min.)
Blood
zidovudine
concentration
Fat malabsorption Normal fat absorption
65
4/17/2023
66.
Choosing graphs
Type of Data/or
Purpose
Appropriate Graphs
Metric/Numerical -Histogram (one continuous var)
-Frequency Polygon (one/more cont. var)
-Cumulative Freq Polygon (ogive curve)
-Box and whisker (one cont. and one cat.
Var)
-Stem and Leave (one cont. var)
-Scatter (two cont. var)
Categorical -Bar (one/more cat. var) (Simple/Multiple)
-Pie (one cat. var)
Trend -Line (one cont. and one cat. Var/two
cont)
4/17/2023 66