Mbarara University of Science and Technology
Faculty of Medicine, Department of Pharmacy
P.O. Box 1410, Mbarara, Uganda, http://www.must.ac.ug
1
Biostatistics
&
Demography
Summarizing and presenting data
Edward .J. LUKYAMUZI (MPS)
Formulating researchable questions
• EXERCISE: WORKING IN GROUPS OF 5 STUDENTS:
1. IDENTIFY A PROBLEM WHICH YOU THINK SHOULD BE
INVESTIGATED FURTHER OR WHICH NEEDS TO BE
SOLVED.
2. DESCRIBE THE PROBLEM USING THE 5 Ws Technique
3. USING ONE OF THE GUIDELINES, FORMULATE A
RESEARCH QUESTION
4. IDENTIFY THE MAIN VARIABLES FOR YOUR RESEARCH
5. IDENTIFY DATA SOURCES
6. CHOOSE THE APPROPRIATE DATA COLLECTION
METHOD(S)
Note: This exercise will continue to feed into future class discussions and assignments,
therefore individual participation is good for learning and teamwork .
5 Ws Technique to understand a problem
1. Who - Who does the problem affect? Who is experiencing
the problem? Specific groups, organizations, customers, etc.
2. What –What are the effects/impact the problem is causing?
(to individuals, families, communities, Institutions, MoH etc.)
- What will happen when it is fixed? - What would happen if
we didn’t solve the problem?
3. When - When does the issue occur? Is there a trend?
4. Where - Where is the issue occurring? Only in certain
locations, specific level of health facilities, in certain
processes, etc.
5. Why – Why does this problem occur? Why is it important
that we fix the problem?
Objectives…
• Explain how proportions and percentages are
calculated
• Explain two methods for graphically displaying the
distribution of categorical data.
• Explain the method of construction of a histogram;
describe the shape, centre, and spread
• Define a percentile, name the 3 important percentiles
and derive their values from a frequency table
• Define and describe the characteristics of the mean,
median, and mode, and identify when to use each.
• Define and describe the characteristics of standard
deviation, interquartile range and identify when to
use each
4
Recap…
5
Quantitative
Discrete
(count data)
Continuous
(Measured on a continuum)
Interval
Ratio
Categorical
Nominal Ordinal
Binary or
Dichotomous
Data types
Four scales of measurement…
Introduction…
• In public health and health research we are interested in describing
a group (of people or things e.t.c.) rather than an individual
person.
• We have noted the inherent variability in biological, socio-
economic, or behavioral processes.
• Knowledge and application of appropriate statistical methods
enable us to describe groups of people with varying experiences.
7
Summarizing and Presenting data : Methods and
tools
• Summarising categorical data
• Counts
• Proportions
• Percentages
• Rates, Ratios
• Summarizing quantitative data
• Measures of central tendency (Mode, median, mean)
• Measures of dispersion (minimum & maximum, IQR,
standard deviation)
• Graphical presentation of data
• Categorical data (bar chart, pie chart)
• Continuous data (histogram, box plot)
8
Summarising Categorical Data:
Frequency counts
• The frequency distribution tells us how many times different
values of a variable occur in a given sample or population.
Frequency distribution of a random sample of 200 new
students:
Example: Frequency distribution of PHA III students by sex
9
Sex Number of
students: counts
Males 36
Females 14
Total 50
Summarizing Categorical Data: Proportions
• Proportion
• Numerator is included in the denominator
(a / a+b) , where a+b=n
e.g. #students who are Males /Total number of students
Frequency Distribution of Year 1 Students by Sex
10
Sex Number of
students
Proportion
Males 108 0.54
Females 92 0.46
Total 200 1.0
Summarizing Categorical Data: Percentages
• Proportions are often converted into percentages for
ease of interpretation.
• Percentages are relative frequencies expressed per
100.
– Frequency% =(a/n)*100
Percentage of Students by Marital Status
11
Sex Number of
students
Proportion Percentage
Males 108 0.54 54.0
Females 92 0.46 46.0
Total 200 1.0 100.0
Other Basic Measures of Frequency:
Ratios
Ratios show the relative sizes of two events
Examples
• Number of doctors to population size
• Number of nurses to number of patients in a hospital
• Number of girls to number of boys
• Number of households with a bed net to number of
households without a bed net
12
Other Basic Measures of Frequency: Rates
A rate is a measure of frequency of occurrence of
events per unit time.
For example
- Number of disease events per year
- Number of cases of dysentery per week
• Numerator is number of events
• Denominator is total time of observation
• Time may be hours, days, weeks, months, years etc
13
Mbarara University of Science and Technology
Faculty of Medicine, Department of Pharmacy
P.O. Box 1410, Mbarara, Uganda, http://www.must.ac.ug
14
SUMMARIZING CONTINUOUS DATA
Distribution of continuous data
• Values of continuous data exist on a continuum.
• As the sample size increases, the number of unique values
are so large, that it is difficult to summarise such data by
making frequency counts of individual values
• See example of frequency distribution of total cholesterol
of 250 people.
15
Example: Frequency distribution of cholesterol
in mg/dl (n=250)
16
164 1 0.40 2.82
163 1 0.40 2.42
159 1 0.40 2.02
154 1 0.40 1.61
150 2 0.81 1.21
135 1 0.40 0.40
totchol Freq. Percent Cum.
464 1 0.40 100.00
332 1 0.40 99.60
326 1 0.40 99.19
325 1 0.40 98.79
320 2 0.81 98.39
. . . .
. . . .
Distribution of Continuous data
• Continuous data is first grouped into classes
comprising of successive ranges of values of the
variable of interest
• In each class/group, frequency counts are
determined
• Frequency distributions of continuous data have a
shape
17
Frequency distribution of cholesterol data,
Class width is 20 g/dl
18
Frequency distributions of continuous data have a shape; easier to see
on a graph
Total 248 100.00
455- 1 0.40 100.00
315- 5 2.02 99.60
295- 15 6.05 97.58
275- 19 7.66 91.53
255- 39 15.73 83.87
235- 52 20.97 68.15
215- 40 16.13 47.18
195- 35 14.11 31.05
175- 26 10.48 16.94
155- 12 4.84 6.45
135- 4 1.61 1.61
totcholcat Freq. Percent Cum.
Frequency distribution ( from Table above)
presented on a Graph (Histogram)
19
0
5
10
15
20
100 200 300 400 500
Total cholesterol, g/dl
Observe the
Shape,
Centre and
Spread of the data.
20
Tools for Visualising distribution of continuous
data: Histogram
0
5
10
15
20
100 200 300 400 500
Total cholesterol, g/dl
Does the centre
represents the Mean,
Median and Mode?
21
Recall: Frequency distribution of cholesterol in
mg/dl (n=250)
164 1 0.40 2.82
163 1 0.40 2.42
159 1 0.40 2.02
154 1 0.40 1.61
150 2 0.81 1.21
135 1 0.40 0.40
totchol Freq. Percent Cum.
464 1 0.40 100.00
332 1 0.40 99.60
326 1 0.40 99.19
325 1 0.40 98.79
320 2 0.81 98.39
. . . .
. . . .
Summarising the distribution of total
cholesterol data (Continuous data)
• Minimum is 135 mg/dl
• Maximum is 464 mg/dl
• Sample size is 250
• The complete frequency table is too long, making it
difficult to make meaning out of the frequency list.
• A histogram can help visualise the distribution
22
Creating a Histogram
 Determine the minimum and maximum values for the variable
of interest e.g. income
• Minimum is 135, Maximum is 464
• Determine the range (max-min=329)
 Determine the number of classes/groups you want to have e.g.
11.
 Determine the Class Interval width ≈Range divide by number
of classes ≈30 mg/dl
 Create mutually exclusive categories/classes of the original
(continuous) variable.
 Classes are also called bins .
 Start the first interval at a convenient value below the
minimum. 134.5 g/dl,
 Therefore first class will be 134.5 to 164.5 g
 Second class will be 164.5 to 194.5 g/dl
23
Creating a Histogram
 Determine the frequency counts and relative frequency
for each category/class
 To plot the graph:
• On the horizontal axis mark equally spaced values of the
lower boundary of each class
• On the vertical axis, the length represents the frequency
• Plot the frequencies for each class as bars.
• The height of the bar will be proportional to the
frequency of that class.
• The width of the bars is the same.
24
0
10
20
30
100 200 300 400 500
total cholesterol, mg/dl
25
Class width is ≈30 mg/dl
Is the shape of a
Histogram sensitive to
the number of classes?
26
0
5
10
15
20
Percent
100 200 300 400 500
total cholesterol, g/dl
Shape of a Histogram is
sensitive to the number
of classes
Class width is 16.45 mg/dl
Distribution of continuous data
• The shape of the frequency distribution can be
symmetrical or asymmetrical
• A symmetric distribution has the same shape on
both sides of the mean (the centre)
• If outlying values occur only in one direction, the
distribution is said to be skewed
• Normally distributed data has zero skewness
27
Shape of distribution of continuous data:
Symmetrical
28
• Symmetrical distribution has the same shape both sides of the
mean.
• Horizontal axis represents the X-variable while the Vertical
axis represents the frequencies.
Values of a continuous variable (X)
Frequency
Shape of distribution of continuous data:
Skewed to the right
29
Shape of distribution of continuous data:
Skewed to the Left
30
Mbarara University of Science and Technology
Faculty of Medicine, Department of Pharmacy
P.O. Box 1410, Mbarara, Uganda, http://www.must.ac.ug
31
MEASURES OF CENTRAL TENDENCY
Measures of central tendency
• The central tendency of a distribution is an estimate of
the "center" of a distribution of values.
• Enable us to describe the characteristics of a typical
member of a group of people.
• Three major types of estimates of central tendency:
• Mode
• Median
• Mean (Arithmetic mean)
• Other measures of central tendency
• Geometric mean
32
Mode
• The mode is the most frequently occurring value in the
set of observations
33
Using this age data,
45 19 23 10 16 21 25 17 21 18 15 18 21 13 16 23 21 24 18 19 26 20 21
19 20 25 26 20 23 8 23 18 24 16 30 24 15 22 27 20
What is the modal age
• In a given dataset, a variable may have more
than one mode (bimodal distribution)
– distribution with two different peaks
Median
• The Median is the score found at the exact middle of the
set of values
• Arrange observations in order of magnitude, the median is
the middle observation
• Median divides the set of observations into two equal
parts such that the number of values equal to or greater
than the median is equal to the number of values less than
or equal to the median
34
Consider the following age data:
45 19 23 10 16 21 25 17 21 18 15 18 21 13 16 23 21 24 18 19 26 20 21 19
20 25 26 20 23 8 23 18 24 16 30 24 15 22 27 20
- Is The median age of the 40 individuals is 20.5
years
The Arithmetic Mean
• The arithmetic mean – is the most popular measure of
central tendency
• Calculation of mean requires numerical data
• For a given variable, the mean is obtained by
1. Adding all values in the sample
2. Dividing the sum by the number of observations in the
sample (sample size)
• For a given set of data and variable, there is only one mean
35
Arithmetic mean- computation
• Calculating the arithmetic mean
• Formula
– where n is the sample size and xi is the random variable.
• The mean is affected by outliers(extreme but legitimate
values)
36
Arithmetic mean -Example
37
For this set of observations in a sample, calculate the mean.
45 19 23 10 16 21 25 17 21 18 15 18 21 13 16 23 21 24 18 19
26 20 21 19 20 25 26 20 23 8 23 18 24 16 30 24 15 22 27 20
Step 1: First obtain a sum:
45+ 19+ 23+ 10+ 16+ 21+ 25 +17 +21 +18 +15 + 18 + 21+ 13+ 16+ 23+ 21+
24+ 18+ 19+26+ 20+ 21 +19 +20 + 25 +26 + 20 +23 + 8 +23 +18 +24 +16
+30 +24 +15 +22+ 27 +20
Sample size (n)=40
Thus, arithmetic mean =sum/n =20.75
Step 2: Divide the sum by the sample size (n)
Note: The arithmetic mean is affected by outliers.
Summary of the Age data of 40 individuals:
• Median=20.5 years,
• Modal age=21 years
• Mean=20.75 years
• Based on these measures, what can you conclude
about the shape of the distribution of age in this
sample?
38
39
0
5
10
15
20
25
Percent
8 101214161820222426283032343638404244
Age(years)
Using a graph to see the distribution helps to identify key
features such as presence of outliers
Outliers
Outliers & Arithmetic Mean
• Outliers fall outside the general pattern of the distribution.
• The value of the arithmetic mean is sensitive to/affected by
outliers
• If the data contains outliers-
1. Find out if there is a problem with data entry.
2. Assess the extent to which the outlier affects the results:
for example calculate the mean with and without the
outlier
40
The Geometric Mean
• For a variable X with observations xi, the geometric mean
of a set of n observations is equal to the nth root of the
cross-product of the n observations.
• For a given data set, the geometric mean is less than or
equal to the arithmetic mean.
• Do not calculate the geometric mean if you have a
zero or negative value in your data set.
41
Geometric mean=
Application of Geometric Mean
• The geometric mean is a useful measure of central tendency
for highly skewed data e.g. gametocyte density, antibody
titers;
• Geometric mean is used to describe the central tendency of
data which is basically skewed BUT normally distributed on
a log-scale.
• Compared to the arithmetic mean, the geometric mean is
less affected by extreme values.
• The GM is appropriate for describing proportional growth,
e.g. annual population growth, average rate of return on
investment over a period of time.
42
Choice of appropriate measure of central
tendency
• The arithmetic mean is calculated for numerical data and for
symmetric or ≈ symmetric distributions
• The median is suitable for ordinal data or numerical data if
the distribution is skewed
• The mode is used to describe bimodal distributions (esp. in
disease-age/time distributions) e.g. seasonal
distribution of malaria
• For a symmetric distribution, the mean ≈ median ≈ mode
43
Mbarara University of Science and Technology
Faculty of Medicine, Department of Pharmacy
P.O. Box 1410, Mbarara, Uganda, http://www.must.ac.ug
Mbarara University of Science and Technology
Faculty of Medicine, Department of Pharmacy
P.O. Box 1410, Mbarara, Uganda, http://www.must.ac.ug
MEASURES OF VARIATION
45
Assessing Variation/Dispersion in Data
 Variation of data is also commonly referred to as dispersion.
 Dispersion refers to the spread of the values around the central
point.
 The starting point when assessing dispersion in a set of data is
to use visualisation tools such as the Histogram, the Box Plot,
Symmetry Plot, and Quantile plot.
 These tools enable the researcher to make a qualitative
description of the extent of variation observed in the data.
 Data variation is often quantified using measures such as the
range, percentiles, inter-quartile range, the standard deviation,
coefficient of variation , and standard error.
46
47
0
5
10
15
20
100 200 300 400 500
Total Cholesterol, mg/dl
Observe the
Spread of the data.
Tools for Visualising distribution of continuous
data: Histogram
Starts at 135 mg/dl; Class width is 21.93 mg/dl
The graph (Histogram)
helps to see the spread
of the data
Alternative methods for viewing the
distribution of continuous data
• Although the histogram is a popular tool for displaying
visualising the distribution of continuous data, histograms are
sensitive to the number of bins used in their construction.
• They can be inaccurate in informing researchers about the
nature of the distribution of data.
• Other approaches to understanding the distribution of
continuous data include: Box Plot, Symmetry Plot, Quantile
plot etc
48
Tools for Visualising variation: the Box Plot
49
Box Plot showing the distribution of total cholesterol.
100
200
300
400
500
Box Plot- visualising variation in age
50
Box Plot showing the distribution of age (n=40)
10
20
30
40
50
Age
(years)
Box plot…
• Standardized way of displaying data.
• Based on five number summary;
1. Minimum
2. First quartile (Q1)
3. Median
4. Third quartile (Q3)
5. Maximum
51
How to draw a Box Plot
1. Sort the data from minimum to maximum
2. Determine the Min, Q1, Median, Q3, Maximum
3. Determine the IQR (i.e. Q3-Q1) and the value of IQR*1.5
4. Obtain the values of Q1-IQR*1.5, and Q3+IQR*1.5
5. Draw and Label a vertical line that includes the range of the distribution
6. Draw a central box from Q1 to Q3
7. Draw a horizontal line for the median inside the box
8. Extend vertical lines (whiskers) from the box (at Q1 and at Q3) out to
the lower and upper bounds of data falling within the general
distribution (i.e. not outliers). Length of the whisker is ≈1.5 times the
IQR.
Determining Q1 & Q3
• These are known as the lower and upper quartiles
1. Arrange the data set in numerical order
2. If your data set has an odd number of observations, Q1 is the
median of the lower half of the data set.
3. If your data set has an even number of observations, Q1 is
the average of the middle two values of the lower half of
the data set.
• Consider the data set {1, 3, 4, 7, 8, 9, 10, 12, 14, 15}.
• Arrange the data in numerical order: {1, 3, 4, 7, 8, 9, 10, 12, 14,
15}.
• (3+4)/2 = 3.5.
53
54
The Box Plot
Box plot for displaying the
distribution of
quantitative data.
Upper inner fence =Q3+ (1.5XIQR)
Lowe inner fence =Q1- (1.5XIQR)
Any data that falls outside the fences are outliers.
55
Source: https://en.wikipedia.org/wiki/Boxplot
Box Plot: Location of fences
• When the calculated value of Upper fence is greater
than the maximum observation in the data, the
fence will be located at the observed maximum
value.
• When the calculated value of the Lower fence is
less than the minimum value in the data, the fence
will be located at the observed minimum value.
Box Plot showing the distribution of cholesterol.
57
Data Values which fall outside the fences are called outliers.
Observe the
location of the
median
relative to Q1
and Q3
100
200
300
400
500
totchol
(mg/dl)
8
10
12141618202224262830323436384042
44
Box Plot- distribution of age
58
Data Values which fall outside the fences are called outliers.
Using this box plot,
estimate the median,
Q1 and Q3.
Tools for Visualising variation: the Symmetry Plot
• Also known as a normal probability plot or a normal plot.
• Assess whether a data set follows a normal distribution.
• Scatter plot of the data against a theoretical normal
distribution.
• If the data is symmetric and approximately bell-shaped,
the points on the plot will form a roughly straight line.
• If the data deviates significantly from normality, the
points on the plot will deviate from a straight line.
59
Tools for Visualising variation: the Symmetry
Plot
60
The symmetry plot
showing distribution
of data around the
median.
If data is symmetrically distributed, all plotted values will lie along the
reference line.
0
50
100
150
200
250
Distance
above
median
0 20 40 60 80 100
Distance below median
Quantifying variation: The Range
• Range
• The range is the difference between the highest value
(maximum)minus the lowest value (minimum)
• In practice the lowest and the highest values in the data are
reported
• Inter-quartile range
• Is the difference between the 1st quartile (25th percentile)
and the 3rd quartile(75th percentile)
• The inter-quartile range contains the central 50% of the
observations
61
Quantifying variation: Standard Deviation
• Standard deviation is a measure of the spread of
observations about their mean
• It is a measure of how much on average each of the
values in the distribution deviates from the mean
• Standard deviation is an essential part of many statistical
tests
• The value of the standard deviation is affected by outliers
62
Calculating the Standard Deviation
1. Calculate the arithmetic mean
2. Calculate and square the (difference between each
observation in the data set and the mean)
3. Obtain a sum of the squared deviations
4. Divide the sum of the squared deviations by n-1,
(number of observations in the sample minus one)
63
Computation of variance and standard deviation
64
Id_number age: xi (xi-mean) (xi-mean)^2
UG023 21 0.25 0.0625
UG024 19 -1.75 3.0625
UG025 20 -0.75 0.5625
UG026 25 4.25 18.0625
UG027 26
UG028 20
UG029 23
UG030 8
UG031 23
UG032 18
UG033 24
UG034 16
UG035 30
UG036 24
UG037 15
UG038 22
UG039 27
UG040 20
sum 830
mean
Variance=
Standard deviation=square-root of variance=
UG001 45 24.25 588.0625
UG002 19 -1.75 3.0625
UG003 23
UG004 10
UG005 16
UG006 21
UG007 25
UG008 17
UG009 21
UG010 18
UG011 15
UG012 18
UG013 21
UG014 13
UG015 16
UG016 23
UG017 21
UG018 24
UG019 18
UG020 19
UG021 26
UG022 20
Median= 20.5, Mean=20.75,
Relationship between standard deviation,
the mean and distribution of observations
• If the distribution of observations of a given variable is
approx normal:
–Approximately 68% of the observations in the sample fall
within one standard deviation of the mean (Mean±1SD)
–Approximately 95% of the observations in the sample
fall within two standard deviations of the mean
(Mean±2SD)
–Approximately 99.7% of the observations in the sample
fall within three standard deviations of the mean
(Mean±3SD)
65
66
Summary of Cholesterol Data
• Sample size: 250 persons
• Minimum: 135 g/dl
• Median income: 237 g/dl
• Mean (Average): 236.3 g/dl
• Maximum: 464 g/dl
• Standard deviation: 42.6 g/dl
• Question. Based on the values of the mean and median
cholesterol, what can you say about the distribution of
cholesterol in this sample?
67
Coefficient of Variation (CV)
• It is a ratio of the standard deviation to the mean
• The formula is
• where SD=standard deviation and is the mean
• CV can be used to compare variation between data
sets
• Mainly applied in laboratory testing and quality
control procedures
68
Standard Error (SE)
• SE is used to assess how closely sample estimates (like the
sample mean) relate to the population parameter
(population mean)
• Used in computation of confidence intervals and testing
statistical significance
• (more on standard error later)
69
Choice of measures of dispersion
• Standard deviation is appropriate when the mean is used to describe
central tendency (symmetric data)
• The inter-quartile range is used to describe the central 50% of a
distribution, regardless of its shape
• The percentile may also be used when the mean is used but the
objective is to compare a set of observations with the norm
• The range is used with numerical data when the purpose is to
emphasize extreme values
• Percentiles and inter-quartile range are used when the median is used
(skewed data)
• The coefficient of variation is used when the intent is to compare
distributions of variables measured on different scales
70
Take home assignment
• Using dummy data from the research questions that were
pitched in class, provide a summary of the data collected as;
1. Summarize and present data on socio demographics of the
study population.
2. Present the data of your findings using the appropriate data
presentation method.
3. Comment on the spread and symmetry of your data findings
Please work in your groups to have a PowerPoint
presentation ready for a 5 minute presentation in our next
class
71

2. Biostatistics and Demography lecture.pdf

  • 1.
    Mbarara University ofScience and Technology Faculty of Medicine, Department of Pharmacy P.O. Box 1410, Mbarara, Uganda, http://www.must.ac.ug 1 Biostatistics & Demography Summarizing and presenting data Edward .J. LUKYAMUZI (MPS)
  • 2.
    Formulating researchable questions •EXERCISE: WORKING IN GROUPS OF 5 STUDENTS: 1. IDENTIFY A PROBLEM WHICH YOU THINK SHOULD BE INVESTIGATED FURTHER OR WHICH NEEDS TO BE SOLVED. 2. DESCRIBE THE PROBLEM USING THE 5 Ws Technique 3. USING ONE OF THE GUIDELINES, FORMULATE A RESEARCH QUESTION 4. IDENTIFY THE MAIN VARIABLES FOR YOUR RESEARCH 5. IDENTIFY DATA SOURCES 6. CHOOSE THE APPROPRIATE DATA COLLECTION METHOD(S) Note: This exercise will continue to feed into future class discussions and assignments, therefore individual participation is good for learning and teamwork .
  • 3.
    5 Ws Techniqueto understand a problem 1. Who - Who does the problem affect? Who is experiencing the problem? Specific groups, organizations, customers, etc. 2. What –What are the effects/impact the problem is causing? (to individuals, families, communities, Institutions, MoH etc.) - What will happen when it is fixed? - What would happen if we didn’t solve the problem? 3. When - When does the issue occur? Is there a trend? 4. Where - Where is the issue occurring? Only in certain locations, specific level of health facilities, in certain processes, etc. 5. Why – Why does this problem occur? Why is it important that we fix the problem?
  • 4.
    Objectives… • Explain howproportions and percentages are calculated • Explain two methods for graphically displaying the distribution of categorical data. • Explain the method of construction of a histogram; describe the shape, centre, and spread • Define a percentile, name the 3 important percentiles and derive their values from a frequency table • Define and describe the characteristics of the mean, median, and mode, and identify when to use each. • Define and describe the characteristics of standard deviation, interquartile range and identify when to use each 4
  • 5.
    Recap… 5 Quantitative Discrete (count data) Continuous (Measured ona continuum) Interval Ratio Categorical Nominal Ordinal Binary or Dichotomous Data types
  • 6.
    Four scales ofmeasurement…
  • 7.
    Introduction… • In publichealth and health research we are interested in describing a group (of people or things e.t.c.) rather than an individual person. • We have noted the inherent variability in biological, socio- economic, or behavioral processes. • Knowledge and application of appropriate statistical methods enable us to describe groups of people with varying experiences. 7
  • 8.
    Summarizing and Presentingdata : Methods and tools • Summarising categorical data • Counts • Proportions • Percentages • Rates, Ratios • Summarizing quantitative data • Measures of central tendency (Mode, median, mean) • Measures of dispersion (minimum & maximum, IQR, standard deviation) • Graphical presentation of data • Categorical data (bar chart, pie chart) • Continuous data (histogram, box plot) 8
  • 9.
    Summarising Categorical Data: Frequencycounts • The frequency distribution tells us how many times different values of a variable occur in a given sample or population. Frequency distribution of a random sample of 200 new students: Example: Frequency distribution of PHA III students by sex 9 Sex Number of students: counts Males 36 Females 14 Total 50
  • 10.
    Summarizing Categorical Data:Proportions • Proportion • Numerator is included in the denominator (a / a+b) , where a+b=n e.g. #students who are Males /Total number of students Frequency Distribution of Year 1 Students by Sex 10 Sex Number of students Proportion Males 108 0.54 Females 92 0.46 Total 200 1.0
  • 11.
    Summarizing Categorical Data:Percentages • Proportions are often converted into percentages for ease of interpretation. • Percentages are relative frequencies expressed per 100. – Frequency% =(a/n)*100 Percentage of Students by Marital Status 11 Sex Number of students Proportion Percentage Males 108 0.54 54.0 Females 92 0.46 46.0 Total 200 1.0 100.0
  • 12.
    Other Basic Measuresof Frequency: Ratios Ratios show the relative sizes of two events Examples • Number of doctors to population size • Number of nurses to number of patients in a hospital • Number of girls to number of boys • Number of households with a bed net to number of households without a bed net 12
  • 13.
    Other Basic Measuresof Frequency: Rates A rate is a measure of frequency of occurrence of events per unit time. For example - Number of disease events per year - Number of cases of dysentery per week • Numerator is number of events • Denominator is total time of observation • Time may be hours, days, weeks, months, years etc 13
  • 14.
    Mbarara University ofScience and Technology Faculty of Medicine, Department of Pharmacy P.O. Box 1410, Mbarara, Uganda, http://www.must.ac.ug 14 SUMMARIZING CONTINUOUS DATA
  • 15.
    Distribution of continuousdata • Values of continuous data exist on a continuum. • As the sample size increases, the number of unique values are so large, that it is difficult to summarise such data by making frequency counts of individual values • See example of frequency distribution of total cholesterol of 250 people. 15
  • 16.
    Example: Frequency distributionof cholesterol in mg/dl (n=250) 16 164 1 0.40 2.82 163 1 0.40 2.42 159 1 0.40 2.02 154 1 0.40 1.61 150 2 0.81 1.21 135 1 0.40 0.40 totchol Freq. Percent Cum. 464 1 0.40 100.00 332 1 0.40 99.60 326 1 0.40 99.19 325 1 0.40 98.79 320 2 0.81 98.39 . . . . . . . .
  • 17.
    Distribution of Continuousdata • Continuous data is first grouped into classes comprising of successive ranges of values of the variable of interest • In each class/group, frequency counts are determined • Frequency distributions of continuous data have a shape 17
  • 18.
    Frequency distribution ofcholesterol data, Class width is 20 g/dl 18 Frequency distributions of continuous data have a shape; easier to see on a graph Total 248 100.00 455- 1 0.40 100.00 315- 5 2.02 99.60 295- 15 6.05 97.58 275- 19 7.66 91.53 255- 39 15.73 83.87 235- 52 20.97 68.15 215- 40 16.13 47.18 195- 35 14.11 31.05 175- 26 10.48 16.94 155- 12 4.84 6.45 135- 4 1.61 1.61 totcholcat Freq. Percent Cum.
  • 19.
    Frequency distribution (from Table above) presented on a Graph (Histogram) 19 0 5 10 15 20 100 200 300 400 500 Total cholesterol, g/dl Observe the Shape, Centre and Spread of the data.
  • 20.
    20 Tools for Visualisingdistribution of continuous data: Histogram 0 5 10 15 20 100 200 300 400 500 Total cholesterol, g/dl Does the centre represents the Mean, Median and Mode?
  • 21.
    21 Recall: Frequency distributionof cholesterol in mg/dl (n=250) 164 1 0.40 2.82 163 1 0.40 2.42 159 1 0.40 2.02 154 1 0.40 1.61 150 2 0.81 1.21 135 1 0.40 0.40 totchol Freq. Percent Cum. 464 1 0.40 100.00 332 1 0.40 99.60 326 1 0.40 99.19 325 1 0.40 98.79 320 2 0.81 98.39 . . . . . . . .
  • 22.
    Summarising the distributionof total cholesterol data (Continuous data) • Minimum is 135 mg/dl • Maximum is 464 mg/dl • Sample size is 250 • The complete frequency table is too long, making it difficult to make meaning out of the frequency list. • A histogram can help visualise the distribution 22
  • 23.
    Creating a Histogram Determine the minimum and maximum values for the variable of interest e.g. income • Minimum is 135, Maximum is 464 • Determine the range (max-min=329)  Determine the number of classes/groups you want to have e.g. 11.  Determine the Class Interval width ≈Range divide by number of classes ≈30 mg/dl  Create mutually exclusive categories/classes of the original (continuous) variable.  Classes are also called bins .  Start the first interval at a convenient value below the minimum. 134.5 g/dl,  Therefore first class will be 134.5 to 164.5 g  Second class will be 164.5 to 194.5 g/dl 23
  • 24.
    Creating a Histogram Determine the frequency counts and relative frequency for each category/class  To plot the graph: • On the horizontal axis mark equally spaced values of the lower boundary of each class • On the vertical axis, the length represents the frequency • Plot the frequencies for each class as bars. • The height of the bar will be proportional to the frequency of that class. • The width of the bars is the same. 24
  • 25.
    0 10 20 30 100 200 300400 500 total cholesterol, mg/dl 25 Class width is ≈30 mg/dl Is the shape of a Histogram sensitive to the number of classes?
  • 26.
    26 0 5 10 15 20 Percent 100 200 300400 500 total cholesterol, g/dl Shape of a Histogram is sensitive to the number of classes Class width is 16.45 mg/dl
  • 27.
    Distribution of continuousdata • The shape of the frequency distribution can be symmetrical or asymmetrical • A symmetric distribution has the same shape on both sides of the mean (the centre) • If outlying values occur only in one direction, the distribution is said to be skewed • Normally distributed data has zero skewness 27
  • 28.
    Shape of distributionof continuous data: Symmetrical 28 • Symmetrical distribution has the same shape both sides of the mean. • Horizontal axis represents the X-variable while the Vertical axis represents the frequencies. Values of a continuous variable (X) Frequency
  • 29.
    Shape of distributionof continuous data: Skewed to the right 29
  • 30.
    Shape of distributionof continuous data: Skewed to the Left 30
  • 31.
    Mbarara University ofScience and Technology Faculty of Medicine, Department of Pharmacy P.O. Box 1410, Mbarara, Uganda, http://www.must.ac.ug 31 MEASURES OF CENTRAL TENDENCY
  • 32.
    Measures of centraltendency • The central tendency of a distribution is an estimate of the "center" of a distribution of values. • Enable us to describe the characteristics of a typical member of a group of people. • Three major types of estimates of central tendency: • Mode • Median • Mean (Arithmetic mean) • Other measures of central tendency • Geometric mean 32
  • 33.
    Mode • The modeis the most frequently occurring value in the set of observations 33 Using this age data, 45 19 23 10 16 21 25 17 21 18 15 18 21 13 16 23 21 24 18 19 26 20 21 19 20 25 26 20 23 8 23 18 24 16 30 24 15 22 27 20 What is the modal age • In a given dataset, a variable may have more than one mode (bimodal distribution) – distribution with two different peaks
  • 34.
    Median • The Medianis the score found at the exact middle of the set of values • Arrange observations in order of magnitude, the median is the middle observation • Median divides the set of observations into two equal parts such that the number of values equal to or greater than the median is equal to the number of values less than or equal to the median 34 Consider the following age data: 45 19 23 10 16 21 25 17 21 18 15 18 21 13 16 23 21 24 18 19 26 20 21 19 20 25 26 20 23 8 23 18 24 16 30 24 15 22 27 20 - Is The median age of the 40 individuals is 20.5 years
  • 35.
    The Arithmetic Mean •The arithmetic mean – is the most popular measure of central tendency • Calculation of mean requires numerical data • For a given variable, the mean is obtained by 1. Adding all values in the sample 2. Dividing the sum by the number of observations in the sample (sample size) • For a given set of data and variable, there is only one mean 35
  • 36.
    Arithmetic mean- computation •Calculating the arithmetic mean • Formula – where n is the sample size and xi is the random variable. • The mean is affected by outliers(extreme but legitimate values) 36
  • 37.
    Arithmetic mean -Example 37 Forthis set of observations in a sample, calculate the mean. 45 19 23 10 16 21 25 17 21 18 15 18 21 13 16 23 21 24 18 19 26 20 21 19 20 25 26 20 23 8 23 18 24 16 30 24 15 22 27 20 Step 1: First obtain a sum: 45+ 19+ 23+ 10+ 16+ 21+ 25 +17 +21 +18 +15 + 18 + 21+ 13+ 16+ 23+ 21+ 24+ 18+ 19+26+ 20+ 21 +19 +20 + 25 +26 + 20 +23 + 8 +23 +18 +24 +16 +30 +24 +15 +22+ 27 +20 Sample size (n)=40 Thus, arithmetic mean =sum/n =20.75 Step 2: Divide the sum by the sample size (n) Note: The arithmetic mean is affected by outliers.
  • 38.
    Summary of theAge data of 40 individuals: • Median=20.5 years, • Modal age=21 years • Mean=20.75 years • Based on these measures, what can you conclude about the shape of the distribution of age in this sample? 38
  • 39.
    39 0 5 10 15 20 25 Percent 8 101214161820222426283032343638404244 Age(years) Using agraph to see the distribution helps to identify key features such as presence of outliers Outliers
  • 40.
    Outliers & ArithmeticMean • Outliers fall outside the general pattern of the distribution. • The value of the arithmetic mean is sensitive to/affected by outliers • If the data contains outliers- 1. Find out if there is a problem with data entry. 2. Assess the extent to which the outlier affects the results: for example calculate the mean with and without the outlier 40
  • 41.
    The Geometric Mean •For a variable X with observations xi, the geometric mean of a set of n observations is equal to the nth root of the cross-product of the n observations. • For a given data set, the geometric mean is less than or equal to the arithmetic mean. • Do not calculate the geometric mean if you have a zero or negative value in your data set. 41 Geometric mean=
  • 42.
    Application of GeometricMean • The geometric mean is a useful measure of central tendency for highly skewed data e.g. gametocyte density, antibody titers; • Geometric mean is used to describe the central tendency of data which is basically skewed BUT normally distributed on a log-scale. • Compared to the arithmetic mean, the geometric mean is less affected by extreme values. • The GM is appropriate for describing proportional growth, e.g. annual population growth, average rate of return on investment over a period of time. 42
  • 43.
    Choice of appropriatemeasure of central tendency • The arithmetic mean is calculated for numerical data and for symmetric or ≈ symmetric distributions • The median is suitable for ordinal data or numerical data if the distribution is skewed • The mode is used to describe bimodal distributions (esp. in disease-age/time distributions) e.g. seasonal distribution of malaria • For a symmetric distribution, the mean ≈ median ≈ mode 43
  • 44.
    Mbarara University ofScience and Technology Faculty of Medicine, Department of Pharmacy P.O. Box 1410, Mbarara, Uganda, http://www.must.ac.ug
  • 45.
    Mbarara University ofScience and Technology Faculty of Medicine, Department of Pharmacy P.O. Box 1410, Mbarara, Uganda, http://www.must.ac.ug MEASURES OF VARIATION 45
  • 46.
    Assessing Variation/Dispersion inData  Variation of data is also commonly referred to as dispersion.  Dispersion refers to the spread of the values around the central point.  The starting point when assessing dispersion in a set of data is to use visualisation tools such as the Histogram, the Box Plot, Symmetry Plot, and Quantile plot.  These tools enable the researcher to make a qualitative description of the extent of variation observed in the data.  Data variation is often quantified using measures such as the range, percentiles, inter-quartile range, the standard deviation, coefficient of variation , and standard error. 46
  • 47.
    47 0 5 10 15 20 100 200 300400 500 Total Cholesterol, mg/dl Observe the Spread of the data. Tools for Visualising distribution of continuous data: Histogram Starts at 135 mg/dl; Class width is 21.93 mg/dl The graph (Histogram) helps to see the spread of the data
  • 48.
    Alternative methods forviewing the distribution of continuous data • Although the histogram is a popular tool for displaying visualising the distribution of continuous data, histograms are sensitive to the number of bins used in their construction. • They can be inaccurate in informing researchers about the nature of the distribution of data. • Other approaches to understanding the distribution of continuous data include: Box Plot, Symmetry Plot, Quantile plot etc 48
  • 49.
    Tools for Visualisingvariation: the Box Plot 49 Box Plot showing the distribution of total cholesterol. 100 200 300 400 500
  • 50.
    Box Plot- visualisingvariation in age 50 Box Plot showing the distribution of age (n=40) 10 20 30 40 50 Age (years)
  • 51.
    Box plot… • Standardizedway of displaying data. • Based on five number summary; 1. Minimum 2. First quartile (Q1) 3. Median 4. Third quartile (Q3) 5. Maximum 51
  • 52.
    How to drawa Box Plot 1. Sort the data from minimum to maximum 2. Determine the Min, Q1, Median, Q3, Maximum 3. Determine the IQR (i.e. Q3-Q1) and the value of IQR*1.5 4. Obtain the values of Q1-IQR*1.5, and Q3+IQR*1.5 5. Draw and Label a vertical line that includes the range of the distribution 6. Draw a central box from Q1 to Q3 7. Draw a horizontal line for the median inside the box 8. Extend vertical lines (whiskers) from the box (at Q1 and at Q3) out to the lower and upper bounds of data falling within the general distribution (i.e. not outliers). Length of the whisker is ≈1.5 times the IQR.
  • 53.
    Determining Q1 &Q3 • These are known as the lower and upper quartiles 1. Arrange the data set in numerical order 2. If your data set has an odd number of observations, Q1 is the median of the lower half of the data set. 3. If your data set has an even number of observations, Q1 is the average of the middle two values of the lower half of the data set. • Consider the data set {1, 3, 4, 7, 8, 9, 10, 12, 14, 15}. • Arrange the data in numerical order: {1, 3, 4, 7, 8, 9, 10, 12, 14, 15}. • (3+4)/2 = 3.5. 53
  • 54.
    54 The Box Plot Boxplot for displaying the distribution of quantitative data. Upper inner fence =Q3+ (1.5XIQR) Lowe inner fence =Q1- (1.5XIQR) Any data that falls outside the fences are outliers.
  • 55.
  • 56.
    Box Plot: Locationof fences • When the calculated value of Upper fence is greater than the maximum observation in the data, the fence will be located at the observed maximum value. • When the calculated value of the Lower fence is less than the minimum value in the data, the fence will be located at the observed minimum value.
  • 57.
    Box Plot showingthe distribution of cholesterol. 57 Data Values which fall outside the fences are called outliers. Observe the location of the median relative to Q1 and Q3 100 200 300 400 500 totchol (mg/dl)
  • 58.
    8 10 12141618202224262830323436384042 44 Box Plot- distributionof age 58 Data Values which fall outside the fences are called outliers. Using this box plot, estimate the median, Q1 and Q3.
  • 59.
    Tools for Visualisingvariation: the Symmetry Plot • Also known as a normal probability plot or a normal plot. • Assess whether a data set follows a normal distribution. • Scatter plot of the data against a theoretical normal distribution. • If the data is symmetric and approximately bell-shaped, the points on the plot will form a roughly straight line. • If the data deviates significantly from normality, the points on the plot will deviate from a straight line. 59
  • 60.
    Tools for Visualisingvariation: the Symmetry Plot 60 The symmetry plot showing distribution of data around the median. If data is symmetrically distributed, all plotted values will lie along the reference line. 0 50 100 150 200 250 Distance above median 0 20 40 60 80 100 Distance below median
  • 61.
    Quantifying variation: TheRange • Range • The range is the difference between the highest value (maximum)minus the lowest value (minimum) • In practice the lowest and the highest values in the data are reported • Inter-quartile range • Is the difference between the 1st quartile (25th percentile) and the 3rd quartile(75th percentile) • The inter-quartile range contains the central 50% of the observations 61
  • 62.
    Quantifying variation: StandardDeviation • Standard deviation is a measure of the spread of observations about their mean • It is a measure of how much on average each of the values in the distribution deviates from the mean • Standard deviation is an essential part of many statistical tests • The value of the standard deviation is affected by outliers 62
  • 63.
    Calculating the StandardDeviation 1. Calculate the arithmetic mean 2. Calculate and square the (difference between each observation in the data set and the mean) 3. Obtain a sum of the squared deviations 4. Divide the sum of the squared deviations by n-1, (number of observations in the sample minus one) 63
  • 64.
    Computation of varianceand standard deviation 64 Id_number age: xi (xi-mean) (xi-mean)^2 UG023 21 0.25 0.0625 UG024 19 -1.75 3.0625 UG025 20 -0.75 0.5625 UG026 25 4.25 18.0625 UG027 26 UG028 20 UG029 23 UG030 8 UG031 23 UG032 18 UG033 24 UG034 16 UG035 30 UG036 24 UG037 15 UG038 22 UG039 27 UG040 20 sum 830 mean Variance= Standard deviation=square-root of variance= UG001 45 24.25 588.0625 UG002 19 -1.75 3.0625 UG003 23 UG004 10 UG005 16 UG006 21 UG007 25 UG008 17 UG009 21 UG010 18 UG011 15 UG012 18 UG013 21 UG014 13 UG015 16 UG016 23 UG017 21 UG018 24 UG019 18 UG020 19 UG021 26 UG022 20 Median= 20.5, Mean=20.75,
  • 65.
    Relationship between standarddeviation, the mean and distribution of observations • If the distribution of observations of a given variable is approx normal: –Approximately 68% of the observations in the sample fall within one standard deviation of the mean (Mean±1SD) –Approximately 95% of the observations in the sample fall within two standard deviations of the mean (Mean±2SD) –Approximately 99.7% of the observations in the sample fall within three standard deviations of the mean (Mean±3SD) 65
  • 66.
  • 67.
    Summary of CholesterolData • Sample size: 250 persons • Minimum: 135 g/dl • Median income: 237 g/dl • Mean (Average): 236.3 g/dl • Maximum: 464 g/dl • Standard deviation: 42.6 g/dl • Question. Based on the values of the mean and median cholesterol, what can you say about the distribution of cholesterol in this sample? 67
  • 68.
    Coefficient of Variation(CV) • It is a ratio of the standard deviation to the mean • The formula is • where SD=standard deviation and is the mean • CV can be used to compare variation between data sets • Mainly applied in laboratory testing and quality control procedures 68
  • 69.
    Standard Error (SE) •SE is used to assess how closely sample estimates (like the sample mean) relate to the population parameter (population mean) • Used in computation of confidence intervals and testing statistical significance • (more on standard error later) 69
  • 70.
    Choice of measuresof dispersion • Standard deviation is appropriate when the mean is used to describe central tendency (symmetric data) • The inter-quartile range is used to describe the central 50% of a distribution, regardless of its shape • The percentile may also be used when the mean is used but the objective is to compare a set of observations with the norm • The range is used with numerical data when the purpose is to emphasize extreme values • Percentiles and inter-quartile range are used when the median is used (skewed data) • The coefficient of variation is used when the intent is to compare distributions of variables measured on different scales 70
  • 71.
    Take home assignment •Using dummy data from the research questions that were pitched in class, provide a summary of the data collected as; 1. Summarize and present data on socio demographics of the study population. 2. Present the data of your findings using the appropriate data presentation method. 3. Comment on the spread and symmetry of your data findings Please work in your groups to have a PowerPoint presentation ready for a 5 minute presentation in our next class 71