This Open Educational Resource (OER) introduces the concepts of quantitative and qualitative statistics, central tendency (mean, median, and mode), and dispersion (standard deviation and interquartile range). This document is created for PgCAP Digital Education.
OER Descriptive Statistics (University of Edinburgh)
1. Summarising research data using
descriptive statistics
Open Educational Resource
Dr Leonard Ho
ACRC Systematic Reviewer, Usher Institute
2. Objectives
• To understand different types of variables
• To calculate the central tendency and dispersion of continuous data
• To present data with appropriate diagrams
2
3. Useful books
3
From John Wiley & Sons: https://media.wiley.com/product_data/coverImage300/19/08654287/0865428719.jpg From Amazon UK: https://m.media-amazon.com/images/I/41B+txZryWL._SX283_BO1,204,203,200_.jpg From Amazon UK: https://m.media-amazon.com/images/I/61L9H502OYL.jpg
4. Statistics
• “The science of collecting, summarising, presenting and interpreting data,
and of using them to estimate the magnitude of associations and test
hypotheses”
4
From John Wiley & Sons: https://media.wiley.com/product_data/coverImage300/19/08654287/0865428719.jpg
5. Two types of statistics
• Descriptive statistics
• Summarising and describing the behaviour of data in a dataset
• Mean, standard deviation…
• Inferential statistics
• Making predictions and testing hypotheses with the data in a dataset
• Regressions, chi-square tests…
5
6. Types of variables
• Variable is a quantity or characteristic that can be
measured or observed
• Quantitative (numeric) variable contains data that
describe a measurable quantity
• Qualitative (categorical) variable contains data that
describe a characteristic
6
Variables
Quantitative Qualitative
7. Quantitative variable
• Continuous variable
• Contains data that lie on a continuum and can take any
values
• Height, weight...
• Very common in scientific research
• Discrete variable
• Contains data that do not lie on a continuum and can
only take whole numbers (integers)
• Number of strokes per day…
• “Not splitable”
7
Variables
Quantitative
Continuous Discrete
Qualitative
8. Qualitative variable
• Ordinal variable
• Contains data that take any categories and there is an
intrinsic ordering of the categories
• Educational level (Primary < Secondary < Tertiary)…
• Nominal variable
• Contains data that take any categories but there is no
intrinsic ordering of the categories
• Location (Aberdeen, Edinburgh, Glasgow)…
8
Variables
Quantitative
Continuous Discrete
Qualitative
Ordinal Nominal
9. “Mooing pill”
• We are going to sell a drug that makes people moo like Highland cattle
• We conducted a randomised controlled trial on 1,500 people in Scotland
• Our dataset contains the following variables:
• Sex (Male / Female)
• Body mass index (kg/m2)
• Educational level (Primary / Secondary / Tertiary)
• Number of pills necessary to trigger mooing
9
From Visit Scotland: https://www.visitscotland.com/blog/wp-content/uploads/2019/10/HC-on-coastal-road.jpg
10. Our dataset
10
Label Variable Type of variable
Sex Sex Nominal
BMI Body mass index Continuous
Edu_level Educational level Ordinal
No_pill Number of pills Discrete
How do we describe BMI?
11. Description of continuous data
• Describe the central tendency
• Average of data
• Describe the dispersion
• Spread of data
11
12. Central tendency (Median)
• Median is the midway value of a list of ordered data (ascending or descending)
• “Midway” refers to the middle number or the average of two middle numbers
• [1, 1, 1, 3, 4, 4, 7, 7, 7]: Median is 4
• [1, 1, 1, 3, 3, 4, 4, 7, 7, 7]: Median is 3.5
• Divides the list of data into upper and lower halves
• Not affected by extreme values
• [–11111, 1, 1, 3, 4, 4, 7, 7, 99999]: Median is still 4
12
13. Central tendency (Mean)
• Mean is the sum of a list of data divided by the total number of data
• [1, 1, 1, 3, 4, 4, 7, 7, 7]: Mean is 3.89
• Affected by extreme values
• [1, 1, 1, 3, 4, 4, 7, 7, 77777]: Mean is 8645
13
14. Central tendency (Mode)
• Mode is the value that occurs most often in a list of data
• [1, 1, 3, 4, 4, 7, 7, 7]: Mode is 7
• May have ≥ 1 modal value
• More relevant for integers (like discrete variables)
• If we round data, we would lose much information!
14
Calculated from original data Calculated from rounded data
15. Presenting data with histogram
• Histogram
• Shows the distribution of data by plotting the
data in rectangles, “bins” (not bars),
corresponding to categories along the x-axis
• The bins have heights that are proportional to
the frequencies of observations
• No gaps between bins because the categories
are on a continuum!
15
16. Central tendencies of BMI
• Central tendencies of BMI (kg/m2) among our 1,500 participants
16
Median ≈ Mean
Normally distributed (roughly)
17. Distribution of data
17
Mean: 22.95
Median: 23.08
From Biology For Life: https://www.biologyforlife.com/uploads/2/2/3/9/22392738/c101b0da6ea1a0dab31f80d9963b0368_orig.png
19. Description of continuous data
• Describe the central tendency
• Average of data
• Describe the dispersion
• Spread of data
19
20. Dispersion (Range)
• Range is the difference between the maximum and the minimum values in a list
of data
• [1, 1, 1, 3, 4, 4, 7, 7, 7]: Range is 6
• Affected by extreme values
• [1, 1, 1, 3, 4, 4, 7, 7, 77777]: Range is 77776
20
21. Dispersion (Interquartile range)
• Interquartile range (IQR) summarises the spread of
the middle 50% of data in an ordered list
• Not affected by extreme data
• Difference between the upper (Q3) and the lower
(Q1) quartiles in a list of ordered data
• Q3: Between the maximum and median (Q2)
• Q1: Between the minimum and Q2
21
Q1 = 2 Max = 8
Min = 1 Q2 = 4 Q3 = 7
IQR = 7 – 2 = 5
[1, 1, 2, 3, 4, 4, 5, 6, 7, 7, 8]
Q1 = 2 Max = 9
Min = 1 Q2 = 4.5 Q3 = 7
IQR = 7 – 2 = 5
[1, 1, 2, 2, 3, 4, 4, 5, 5, 6, 7, 7, 8, 9]
There are many ways to calculate quartiles!
23. Presenting data with box plot (1)
• Box plot (Box and whisker plot)
• At least shows 5 pieces of summary information about a
list of data:
• Median = Horizontal line in box
• Upper quartile = Top edge of the box
• Lower quartile = Lower edge of box
• Maximum = Top of whisker
• Minimum = Bottom of whisker
23
From the University of Newcastle: https://www.ncl.ac.uk/webtemplate/ask-assets/external/maths-resources/images/Box_and_whiskers_explanation_inkscape(2).png
Whisker
Box
24. Presenting data with box plot (2)
• Box plot (Box and whisker plot)
• Always check whether there are outliers
• Observations that are far away from the others
• Common definitions of outliers:
• Lower outlier(s) < Q1 − (1.5*IQR)
• Upper outlier(s) > Q3 + (1.5*IQR)
• Remember to amend the minimum and maximum values
on the plot
• Minimum value becomes the value right above the cut-off
for lower outliers
• Maximum value becomes the value right below the cut-off
for upper outliers
24
25. Dispersion (Standard deviation) (1)
• Standard deviation (SD) describes the spread of
data around the mean and the average difference
between the mean and each observation
• The larger the SD, the more spread the data
• We use all data to calculate SD
• (Think about how we calculate range and IQR)
• Affected by extreme values
25
From Cuemath: https://d138zd1ktt9iqe.cloudfront.net/media/seo_landing_files/standard-deviation-formula-1626765976.png
27. Dispersion (Standard deviation) (3)
• The 68–95–99 Rule
• 68% of data falls within ± 1*SD
• 95% of data falls within ± 2*SD
• 99% of data falls within ± 3*SD
• (Applicable to normal distribution)
27
Mean = 22.95
Mean + 1*SD = 26.42
Mean + 2*SD = 29.89
Mean – 1*SD = 19.48
Mean – 2*SD = 16.01
Our data may not completely fulfil this rule
because their distribution is slightly skewed
From Biology For Life: https://www.biologyforlife.com/uploads/2/2/3/9/22392738/sd2_orig.png
28. Dispersion of BMI
• Dispersion of BMI (kg/m2) among our 1,500 participants
28
Measurement Dispersion Interpretation
Range 23.55
The difference between the
highest and the lowest BMI
IQR 4.67
The range of the middle 50% of
BMI data around the median
SD 3.47
95% of BMI data falls between
16.01 and 29.89 around the mean
29. Our dataset
29
Label Variable Type of variable
Sex Sex Nominal
BMI Body mass index Continuous
Edu_level Educational level Ordinal
No_pill Number of pills Discrete
How do we describe sex, educational level,
and number of pills?
30. Description of non-continuous data
• Frequency and percentage are useful in describing non-continuous variables
• Modes can also be used to show the most common category
30
31. Presenting data with bar chart
• Bar chart
• Shows the distribution of observations in different
categories of a variable where every observation belongs
to one category
• Each category is given its own bar, and the length of the
bar is proportional to the frequency of observations
within that category
• (How is it different from histogram?)
• For stacked bar charts:
• Always shows the % contribution of each sub-bar
• Always avoid showing > 3 sub-bars in each population
31
From National Records of Scotland:
https://www.nrscotland.gov.uk/statistics-and-data/statistics/statistics-by-theme/population/population-estimates/mid-year-population-estimates/mid-2021
32. Presenting data with pie chart
• Pie chart
• Shows the distribution of observations in different
categories of a variable where every observation belongs
to one category
• Area of each slice proportional to the frequency of
observations within that category
• Only useful when there are ≥ 3 categories
• Become hard to read if there are > 10 categories
• Please, never use 3D pie charts
• (They are not beautiful at all and sometimes misleading)
32
33. Categorising continuous data (1)
• We may categorise our continuous variable
according to pre-specified rules
• For better communication
• For decision-making
33
BMI (kg/m2)
• Underweight: < 18.5 kg/m2
• Normal: 18.5 to 24.9 kg/m2
• Overweight: 25.0 to 29.9 kg/m2
• Obese: > 30.0 kg/m2
• Not obese: < 30 kg/m2
• Obese: ≥ 30 kg/m2
34. Categorising continuous data (2)
• Loss of information
• Cut-off values may be arbitrary
• If we must categorise, make sure that we:
• also provide the central tendency and dispersion of the
continuous variable
• clearly state the cut-off values and their justifications
34
35. Summary (1)
• Continuous variable
• Contains data that lie on a continuum
• Can take any values
• Discrete variable
• Contains data that do not lie on a continuum
• Can only take integers
• Ordinal variable
• Contains data that take any categories
• There is an intrinsic ordering of the categories
• Nominal variable
• Contains data that take any categories
• There is no intrinsic ordering of the categories
35
Variables
Quantitative
Continuous Discrete
Qualitative
Ordinal Nominal
36. Summary (2)
• Continuous variable
• Central tendency summarised by median and mean
• Dispersion summarised by IQR (and range) and SD
• Visually presented by histogram and box plot
• Non-continuous variable
• Observations summarised by frequency and percentage, and mode
• Visually presented by bar chart and pie chart
36
37. Useful books
37
From John Wiley & Sons: https://media.wiley.com/product_data/coverImage300/19/08654287/0865428719.jpg From Amazon UK: https://m.media-amazon.com/images/I/41B+txZryWL._SX283_BO1,204,203,200_.jpg From Amazon UK: https://m.media-amazon.com/images/I/61L9H502OYL.jpg
38. Summarising research data using
descriptive statistics
Open Educational Resource
Dr Leonard Ho
ACRC Systematic Reviewer, Usher Institute