SlideShare a Scribd company logo
1 of 38
Summarising research data
using descriptive statistics
Open Educational Resource
Dr Leonard Ho
ACRC Systematic Reviewer, Usher Institute
Objectives
• To understand different types of variables
• To calculate the central tendency and dispersion of continuous data
• To present data with appropriate diagrams
2
Useful books
3
From John Wiley & Sons: https://media.wiley.com/product_data/coverImage300/19/08654287/0865428719.jpg From Amazon UK: https://m.media-amazon.com/images/I/41B+txZryWL._SX283_BO1,204,203,200_.jpg From Amazon UK: https://m.media-amazon.com/images/I/61L9H502OYL.jpg
Statistics
• “The science of collecting, summarising, presenting and
interpreting data, and of using them to estimate the magnitude
of associations and test hypotheses”
4
From John Wiley & Sons: https://media.wiley.com/product_data/coverImage300/19/08654287/0865428719.jpg
Two types of statistics
• Descriptive statistics
• Summarising and describing the behaviour of data in a dataset
• Mean, standard deviation…
• Inferential statistics
• Making predictions and testing hypotheses with the data in a dataset
• Regressions, chi-square tests…
5
Types of variables
• Variable is a quantity or characteristic that
can be measured or observed
• Quantitative (numeric) variable contains data
that describe a measurable quantity
• Qualitative (categorical) variable contains
data that describe a characteristic
6
Variables
Quantitative Qualitative
Quantitative variable
• Continuous variable
• Contains data that lie on a continuum and can
take any values
• Height, weight...
• Very common in scientific research
• Discrete variable
• Contains data that do not lie on a continuum and
can only take whole numbers (integers)
• Number of strokes per day…
• “Not splitable”
7
Variables
Quantitative
Continuous Discrete
Qualitative
Qualitative variable
• Ordinal variable
• Contains data that take any categories and there
is an intrinsic ordering of the categories
• Educational level (Primary < Secondary <
Tertiary)…
• Nominal variable
• Contains data that take any categories but there is
no intrinsic ordering of the categories
• Location (Aberdeen, Edinburgh, Glasgow)…
8
Variables
Quantitative
Continuous Discrete
Qualitative
Ordinal Nominal
“Mooing pill”
• We are going to sell a drug that makes people moo like Highland
cattle
• We conducted a randomised controlled trial on 1,500 people in
Scotland
• Our dataset contains the following variables:
• Sex (Male / Female)
• Body mass index (kg/m2)
• Educational level (Primary / Secondary / Tertiary)
• Number of pills necessary to trigger mooing
9
From Visit Scotland: https://www.visitscotland.com/blog/wp-content/uploads/2019/10/HC-on-coastal-road.jpg
Our dataset
10
Label Variable Type of variable
Sex Sex Nominal
BMI Body mass index Continuous
Edu_level Educational level Ordinal
No_pill Number of pills Discrete
How do we describe BMI?
Description of continuous data
• Describe the central tendency
• Average of data
• Describe the dispersion
• Spread of data
11
Central tendency (Median)
• Median is the midway value of a list of ordered data (ascending or
descending)
• “Midway” refers to the middle number or the average of two middle numbers
• [1, 1, 1, 3, 4, 4, 7, 7, 7]: Median is 4
• [1, 1, 1, 3, 3, 4, 4, 7, 7, 7]: Median is 3.5
• Divides the list of data into upper and lower halves
• Not affected by extreme values
• [–11111, 1, 1, 3, 4, 4, 7, 7, 99999]: Median is still 4
12
Central tendency (Mean)
• Mean is the sum of a list of data divided by the total number of data
• [1, 1, 1, 3, 4, 4, 7, 7, 7]: Mean is 3.89
• Affected by extreme values
• [1, 1, 1, 3, 4, 4, 7, 7, 77777]: Mean is 8645
13
Central tendency (Mode)
• Mode is the value that occurs most often in a list of data
• [1, 1, 3, 4, 4, 7, 7, 7]: Mode is 7
• May have ≥ 1 modal value
• More relevant for integers (like discrete variables)
• If we round data, we would lose much information!
14
Calculated from original data Calculated from rounded data
Presenting data with histogram
• Histogram
• Shows the distribution of data by plotting
the data in rectangles, “bins” (not bars),
corresponding to categories along the x-
axis
• The bins have heights that are
proportional to the frequencies of
observations
• No gaps between bins because the
categories are on a continuum!
15
Central tendencies of BMI
• Central tendencies of BMI (kg/m2) among our 1,500 participants
16
Median ≈ Mean
Normally distributed (roughly)
Distribution of data
17
Mean: 22.95
Median: 23.08
From Biology For Life: https://www.biologyforlife.com/uploads/2/2/3/9/22392738/c101b0da6ea1a0dab31f80d9963b0368_orig.png
Central tendencies of BMI
18
Mean ; Median ; Mode
Slightly negatively skewed
Description of continuous data
• Describe the central tendency
• Average of data
• Describe the dispersion
• Spread of data
19
Dispersion (Range)
• Range is the difference between the maximum and the minimum
values in a list of data
• [1, 1, 1, 3, 4, 4, 7, 7, 7]: Range is 6
• Affected by extreme values
• [1, 1, 1, 3, 4, 4, 7, 7, 77777]: Range is 77776
20
Dispersion (Interquartile range)
• Interquartile range (IQR) summarises the
spread of the middle 50% of data in an
ordered list
• Not affected by extreme data
• Difference between the upper (Q3) and the
lower (Q1) quartiles in a list of ordered data
• Q3: Between the maximum and median (Q2)
• Q1: Between the minimum and Q2
21
Q1 = 2 Max = 8
Min = 1 Q2 = 4 Q3 = 7
IQR = 7 – 2 = 5
[1, 1, 2, 3, 4, 4, 5, 6, 7, 7, 8]
Q1 = 2 Max = 9
Min = 1 Q2 = 4.5 Q3 = 7
IQR = 7 – 2 = 5
[1, 1, 2, 2, 3, 4, 4, 5, 5, 6, 7, 7, 8, 9]
There are many ways to calculate quartiles!
Interquartile range of BMI
• IQR of BMI (kg/m2) among our 1,500 participants
22
Presenting data with box plot (1)
• Box plot (Box and whisker plot)
• At least shows 5 pieces of summary information
about a list of data:
• Median = Horizontal line in box
• Upper quartile = Top edge of the box
• Lower quartile = Lower edge of box
• Maximum = Top of whisker
• Minimum = Bottom of whisker
23
From the University of Newcastle: https://www.ncl.ac.uk/webtemplate/ask-assets/external/maths-resources/images/Box_and_whiskers_explanation_inkscape(2).png
Whisker
Box
Presenting data with box plot (2)
• Box plot (Box and whisker plot)
• Always check whether there are outliers
• Observations that are far away from the others
• Common definitions of outliers:
• Lower outlier(s) < Q1 − (1.5*IQR)
• Upper outlier(s) > Q3 + (1.5*IQR)
• Remember to amend the minimum and maximum
values on the plot
• Minimum value becomes the value right above the
cut-off for lower outliers
• Maximum value becomes the value right below the
cut-off for upper outliers
24
Dispersion (Standard deviation) (1)
• Standard deviation (SD) describes the
spread of data around the mean and the
average difference between the mean and
each observation
• The larger the SD, the more spread the data
• We use all data to calculate SD
• (Think about how we calculate range and IQR)
• Affected by extreme values
25
From Cuemath: https://d138zd1ktt9iqe.cloudfront.net/media/seo_landing_files/standard-deviation-formula-1626765976.png
Dispersion (Standard deviation) (2)
Larger SD:
• Flatter distribution
26
Smaller SD:
• Narrower distribution
Dispersion (Standard deviation) (3)
• The 68–95–99 Rule
• 68% of data falls within ± 1*SD
• 95% of data falls within ± 2*SD
• 99% of data falls within ± 3*SD
• (Applicable to normal distribution)
27
Mean = 22.95
Mean + 1*SD = 26.42
Mean + 2*SD = 29.89
Mean – 1*SD = 19.48
Mean – 2*SD = 16.01
Our data may not completely fulfil this rule
because their distribution is slightly skewed
From Biology For Life: https://www.biologyforlife.com/uploads/2/2/3/9/22392738/sd2_orig.png
Dispersion of BMI
• Dispersion of BMI (kg/m2) among our 1,500 participants
28
Measurement Dispersion Interpretation
Range 23.55
The difference between the
highest and the lowest BMI
IQR 4.67
The range of the middle 50% of
BMI data around the median
SD 3.47
95% of BMI data falls between
16.01 and 29.89 around the mean
Our dataset
29
Label Variable Type of variable
Sex Sex Nominal
BMI Body mass index Continuous
Edu_level Educational level Ordinal
No_pill Number of pills Discrete
How do we describe sex, educational level,
and number of pills?
Description of non-continuous data
• Frequency and percentage are useful in describing non-continuous
variables
• Modes can also be used to show the most common category
30
Presenting data with bar chart
• Bar chart
• Shows the distribution of observations in
different categories of a variable where every
observation belongs to one category
• Each category is given its own bar, and the
length of the bar is proportional to the frequency
of observations within that category
• (How is it different from histogram?)
• For stacked bar charts:
• Always shows the % contribution of each sub-bar
• Always avoid showing > 3 sub-bars in each population
31
From National Records of Scotland:
https://www.nrscotland.gov.uk/statistics-and-data/statistics/statistics-by-theme/population/population-estimates/mid-year-population-estimates/mid-2021
Presenting data with pie chart
• Pie chart
• Shows the distribution of observations in
different categories of a variable where every
observation belongs to one category
• Area of each slice proportional to the frequency
of observations within that category
• Only useful when there are ≥ 3 categories
• Become hard to read if there are > 10 categories
• Please, never use 3D pie charts
• (They are not beautiful at all and sometimes misleading)
32
Categorising continuous data (1)
• We may categorise our continuous variable
according to pre-specified rules
• For better communication
• For decision-making
33
BMI (kg/m2)
• Underweight: < 18.5 kg/m2
• Normal: 18.5 to 24.9 kg/m2
• Overweight: 25.0 to 29.9 kg/m2
• Obese: > 30.0 kg/m2
• Not obese: < 30 kg/m2
• Obese: ≥ 30 kg/m2
Categorising continuous data (2)
• Loss of information
• Cut-off values may be arbitrary
• If we must categorise, make sure that we:
• also provide the central tendency and dispersion
of the continuous variable
• clearly state the cut-off values and their
justifications
34
Summary (1)
• Continuous variable
• Contains data that lie on a continuum
• Can take any values
• Discrete variable
• Contains data that do not lie on a continuum
• Can only take integers
• Ordinal variable
• Contains data that take any categories
• There is an intrinsic ordering of the categories
• Nominal variable
• Contains data that take any categories
• There is no intrinsic ordering of the categories
35
Variables
Quantitative
Continuous Discrete
Qualitative
Ordinal Nominal
Summary (2)
• Continuous variable
• Central tendency summarised by median and mean
• Dispersion summarised by IQR (and range) and SD
• Visually presented by histogram and box plot
• Non-continuous variable
• Observations summarised by frequency and percentage, and mode
• Visually presented by bar chart and pie chart
36
Useful books
37
From John Wiley & Sons: https://media.wiley.com/product_data/coverImage300/19/08654287/0865428719.jpg From Amazon UK: https://m.media-amazon.com/images/I/41B+txZryWL._SX283_BO1,204,203,200_.jpg From Amazon UK: https://m.media-amazon.com/images/I/61L9H502OYL.jpg
Summarising research data
using descriptive statistics
Open Educational Resource
Dr Leonard Ho
ACRC Systematic Reviewer, Usher Institute

More Related Content

Similar to OER Descriptive Statistics (University of Edinburgh)

Introduction to statistics
Introduction to statisticsIntroduction to statistics
Introduction to statistics
Amira Talic
 

Similar to OER Descriptive Statistics (University of Edinburgh) (20)

Exploratory Data Analysis week 4
Exploratory Data Analysis week 4Exploratory Data Analysis week 4
Exploratory Data Analysis week 4
 
Biostatistics_descriptive stats.pptx
Biostatistics_descriptive stats.pptxBiostatistics_descriptive stats.pptx
Biostatistics_descriptive stats.pptx
 
Descriptive statistics
Descriptive statisticsDescriptive statistics
Descriptive statistics
 
Introduction to statistics
Introduction to statisticsIntroduction to statistics
Introduction to statistics
 
statistics - Populations and Samples.pdf
statistics - Populations and Samples.pdfstatistics - Populations and Samples.pdf
statistics - Populations and Samples.pdf
 
Lab 1 intro
Lab 1 introLab 1 intro
Lab 1 intro
 
Biostatistics Class.pptx
Biostatistics Class.pptxBiostatistics Class.pptx
Biostatistics Class.pptx
 
descriptive statistics.pptx
descriptive statistics.pptxdescriptive statistics.pptx
descriptive statistics.pptx
 
Lect 3 background mathematics for Data Mining
Lect 3 background mathematics for Data MiningLect 3 background mathematics for Data Mining
Lect 3 background mathematics for Data Mining
 
Lect 3 background mathematics
Lect 3 background mathematicsLect 3 background mathematics
Lect 3 background mathematics
 
2. chapter ii(analyz)
2. chapter ii(analyz)2. chapter ii(analyz)
2. chapter ii(analyz)
 
Measure of Variability Report.pptx
Measure of Variability Report.pptxMeasure of Variability Report.pptx
Measure of Variability Report.pptx
 
Res701 research methodology lecture 7 8-devaprakasam
Res701 research methodology lecture 7 8-devaprakasamRes701 research methodology lecture 7 8-devaprakasam
Res701 research methodology lecture 7 8-devaprakasam
 
Introduction to Statistics53004300.ppt
Introduction to Statistics53004300.pptIntroduction to Statistics53004300.ppt
Introduction to Statistics53004300.ppt
 
ONS Guide to Social and Economic Research – Welsh Baccalaureate Teacher Training
ONS Guide to Social and Economic Research – Welsh Baccalaureate Teacher TrainingONS Guide to Social and Economic Research – Welsh Baccalaureate Teacher Training
ONS Guide to Social and Economic Research – Welsh Baccalaureate Teacher Training
 
Biostatistics
BiostatisticsBiostatistics
Biostatistics
 
Data Display and Summary
Data Display and SummaryData Display and Summary
Data Display and Summary
 
determinatiion of
determinatiion of determinatiion of
determinatiion of
 
Intro to Statistics.pptx
Intro to Statistics.pptxIntro to Statistics.pptx
Intro to Statistics.pptx
 
Methods of data presentation.pptx
Methods of data presentation.pptxMethods of data presentation.pptx
Methods of data presentation.pptx
 

Recently uploaded

Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
heathfieldcps1
 
Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functions
KarakKing
 

Recently uploaded (20)

Wellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptxWellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptx
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentation
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Fostering Friendships - Enhancing Social Bonds in the Classroom
Fostering Friendships - Enhancing Social Bonds  in the ClassroomFostering Friendships - Enhancing Social Bonds  in the Classroom
Fostering Friendships - Enhancing Social Bonds in the Classroom
 
Spatium Project Simulation student brief
Spatium Project Simulation student briefSpatium Project Simulation student brief
Spatium Project Simulation student brief
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptx
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17
 
Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)
 
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxHMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functions
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibit
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
 

OER Descriptive Statistics (University of Edinburgh)

  • 1. Summarising research data using descriptive statistics Open Educational Resource Dr Leonard Ho ACRC Systematic Reviewer, Usher Institute
  • 2. Objectives • To understand different types of variables • To calculate the central tendency and dispersion of continuous data • To present data with appropriate diagrams 2
  • 3. Useful books 3 From John Wiley & Sons: https://media.wiley.com/product_data/coverImage300/19/08654287/0865428719.jpg From Amazon UK: https://m.media-amazon.com/images/I/41B+txZryWL._SX283_BO1,204,203,200_.jpg From Amazon UK: https://m.media-amazon.com/images/I/61L9H502OYL.jpg
  • 4. Statistics • “The science of collecting, summarising, presenting and interpreting data, and of using them to estimate the magnitude of associations and test hypotheses” 4 From John Wiley & Sons: https://media.wiley.com/product_data/coverImage300/19/08654287/0865428719.jpg
  • 5. Two types of statistics • Descriptive statistics • Summarising and describing the behaviour of data in a dataset • Mean, standard deviation… • Inferential statistics • Making predictions and testing hypotheses with the data in a dataset • Regressions, chi-square tests… 5
  • 6. Types of variables • Variable is a quantity or characteristic that can be measured or observed • Quantitative (numeric) variable contains data that describe a measurable quantity • Qualitative (categorical) variable contains data that describe a characteristic 6 Variables Quantitative Qualitative
  • 7. Quantitative variable • Continuous variable • Contains data that lie on a continuum and can take any values • Height, weight... • Very common in scientific research • Discrete variable • Contains data that do not lie on a continuum and can only take whole numbers (integers) • Number of strokes per day… • “Not splitable” 7 Variables Quantitative Continuous Discrete Qualitative
  • 8. Qualitative variable • Ordinal variable • Contains data that take any categories and there is an intrinsic ordering of the categories • Educational level (Primary < Secondary < Tertiary)… • Nominal variable • Contains data that take any categories but there is no intrinsic ordering of the categories • Location (Aberdeen, Edinburgh, Glasgow)… 8 Variables Quantitative Continuous Discrete Qualitative Ordinal Nominal
  • 9. “Mooing pill” • We are going to sell a drug that makes people moo like Highland cattle • We conducted a randomised controlled trial on 1,500 people in Scotland • Our dataset contains the following variables: • Sex (Male / Female) • Body mass index (kg/m2) • Educational level (Primary / Secondary / Tertiary) • Number of pills necessary to trigger mooing 9 From Visit Scotland: https://www.visitscotland.com/blog/wp-content/uploads/2019/10/HC-on-coastal-road.jpg
  • 10. Our dataset 10 Label Variable Type of variable Sex Sex Nominal BMI Body mass index Continuous Edu_level Educational level Ordinal No_pill Number of pills Discrete How do we describe BMI?
  • 11. Description of continuous data • Describe the central tendency • Average of data • Describe the dispersion • Spread of data 11
  • 12. Central tendency (Median) • Median is the midway value of a list of ordered data (ascending or descending) • “Midway” refers to the middle number or the average of two middle numbers • [1, 1, 1, 3, 4, 4, 7, 7, 7]: Median is 4 • [1, 1, 1, 3, 3, 4, 4, 7, 7, 7]: Median is 3.5 • Divides the list of data into upper and lower halves • Not affected by extreme values • [–11111, 1, 1, 3, 4, 4, 7, 7, 99999]: Median is still 4 12
  • 13. Central tendency (Mean) • Mean is the sum of a list of data divided by the total number of data • [1, 1, 1, 3, 4, 4, 7, 7, 7]: Mean is 3.89 • Affected by extreme values • [1, 1, 1, 3, 4, 4, 7, 7, 77777]: Mean is 8645 13
  • 14. Central tendency (Mode) • Mode is the value that occurs most often in a list of data • [1, 1, 3, 4, 4, 7, 7, 7]: Mode is 7 • May have ≥ 1 modal value • More relevant for integers (like discrete variables) • If we round data, we would lose much information! 14 Calculated from original data Calculated from rounded data
  • 15. Presenting data with histogram • Histogram • Shows the distribution of data by plotting the data in rectangles, “bins” (not bars), corresponding to categories along the x- axis • The bins have heights that are proportional to the frequencies of observations • No gaps between bins because the categories are on a continuum! 15
  • 16. Central tendencies of BMI • Central tendencies of BMI (kg/m2) among our 1,500 participants 16 Median ≈ Mean Normally distributed (roughly)
  • 17. Distribution of data 17 Mean: 22.95 Median: 23.08 From Biology For Life: https://www.biologyforlife.com/uploads/2/2/3/9/22392738/c101b0da6ea1a0dab31f80d9963b0368_orig.png
  • 18. Central tendencies of BMI 18 Mean ; Median ; Mode Slightly negatively skewed
  • 19. Description of continuous data • Describe the central tendency • Average of data • Describe the dispersion • Spread of data 19
  • 20. Dispersion (Range) • Range is the difference between the maximum and the minimum values in a list of data • [1, 1, 1, 3, 4, 4, 7, 7, 7]: Range is 6 • Affected by extreme values • [1, 1, 1, 3, 4, 4, 7, 7, 77777]: Range is 77776 20
  • 21. Dispersion (Interquartile range) • Interquartile range (IQR) summarises the spread of the middle 50% of data in an ordered list • Not affected by extreme data • Difference between the upper (Q3) and the lower (Q1) quartiles in a list of ordered data • Q3: Between the maximum and median (Q2) • Q1: Between the minimum and Q2 21 Q1 = 2 Max = 8 Min = 1 Q2 = 4 Q3 = 7 IQR = 7 – 2 = 5 [1, 1, 2, 3, 4, 4, 5, 6, 7, 7, 8] Q1 = 2 Max = 9 Min = 1 Q2 = 4.5 Q3 = 7 IQR = 7 – 2 = 5 [1, 1, 2, 2, 3, 4, 4, 5, 5, 6, 7, 7, 8, 9] There are many ways to calculate quartiles!
  • 22. Interquartile range of BMI • IQR of BMI (kg/m2) among our 1,500 participants 22
  • 23. Presenting data with box plot (1) • Box plot (Box and whisker plot) • At least shows 5 pieces of summary information about a list of data: • Median = Horizontal line in box • Upper quartile = Top edge of the box • Lower quartile = Lower edge of box • Maximum = Top of whisker • Minimum = Bottom of whisker 23 From the University of Newcastle: https://www.ncl.ac.uk/webtemplate/ask-assets/external/maths-resources/images/Box_and_whiskers_explanation_inkscape(2).png Whisker Box
  • 24. Presenting data with box plot (2) • Box plot (Box and whisker plot) • Always check whether there are outliers • Observations that are far away from the others • Common definitions of outliers: • Lower outlier(s) < Q1 − (1.5*IQR) • Upper outlier(s) > Q3 + (1.5*IQR) • Remember to amend the minimum and maximum values on the plot • Minimum value becomes the value right above the cut-off for lower outliers • Maximum value becomes the value right below the cut-off for upper outliers 24
  • 25. Dispersion (Standard deviation) (1) • Standard deviation (SD) describes the spread of data around the mean and the average difference between the mean and each observation • The larger the SD, the more spread the data • We use all data to calculate SD • (Think about how we calculate range and IQR) • Affected by extreme values 25 From Cuemath: https://d138zd1ktt9iqe.cloudfront.net/media/seo_landing_files/standard-deviation-formula-1626765976.png
  • 26. Dispersion (Standard deviation) (2) Larger SD: • Flatter distribution 26 Smaller SD: • Narrower distribution
  • 27. Dispersion (Standard deviation) (3) • The 68–95–99 Rule • 68% of data falls within ± 1*SD • 95% of data falls within ± 2*SD • 99% of data falls within ± 3*SD • (Applicable to normal distribution) 27 Mean = 22.95 Mean + 1*SD = 26.42 Mean + 2*SD = 29.89 Mean – 1*SD = 19.48 Mean – 2*SD = 16.01 Our data may not completely fulfil this rule because their distribution is slightly skewed From Biology For Life: https://www.biologyforlife.com/uploads/2/2/3/9/22392738/sd2_orig.png
  • 28. Dispersion of BMI • Dispersion of BMI (kg/m2) among our 1,500 participants 28 Measurement Dispersion Interpretation Range 23.55 The difference between the highest and the lowest BMI IQR 4.67 The range of the middle 50% of BMI data around the median SD 3.47 95% of BMI data falls between 16.01 and 29.89 around the mean
  • 29. Our dataset 29 Label Variable Type of variable Sex Sex Nominal BMI Body mass index Continuous Edu_level Educational level Ordinal No_pill Number of pills Discrete How do we describe sex, educational level, and number of pills?
  • 30. Description of non-continuous data • Frequency and percentage are useful in describing non-continuous variables • Modes can also be used to show the most common category 30
  • 31. Presenting data with bar chart • Bar chart • Shows the distribution of observations in different categories of a variable where every observation belongs to one category • Each category is given its own bar, and the length of the bar is proportional to the frequency of observations within that category • (How is it different from histogram?) • For stacked bar charts: • Always shows the % contribution of each sub-bar • Always avoid showing > 3 sub-bars in each population 31 From National Records of Scotland: https://www.nrscotland.gov.uk/statistics-and-data/statistics/statistics-by-theme/population/population-estimates/mid-year-population-estimates/mid-2021
  • 32. Presenting data with pie chart • Pie chart • Shows the distribution of observations in different categories of a variable where every observation belongs to one category • Area of each slice proportional to the frequency of observations within that category • Only useful when there are ≥ 3 categories • Become hard to read if there are > 10 categories • Please, never use 3D pie charts • (They are not beautiful at all and sometimes misleading) 32
  • 33. Categorising continuous data (1) • We may categorise our continuous variable according to pre-specified rules • For better communication • For decision-making 33 BMI (kg/m2) • Underweight: < 18.5 kg/m2 • Normal: 18.5 to 24.9 kg/m2 • Overweight: 25.0 to 29.9 kg/m2 • Obese: > 30.0 kg/m2 • Not obese: < 30 kg/m2 • Obese: ≥ 30 kg/m2
  • 34. Categorising continuous data (2) • Loss of information • Cut-off values may be arbitrary • If we must categorise, make sure that we: • also provide the central tendency and dispersion of the continuous variable • clearly state the cut-off values and their justifications 34
  • 35. Summary (1) • Continuous variable • Contains data that lie on a continuum • Can take any values • Discrete variable • Contains data that do not lie on a continuum • Can only take integers • Ordinal variable • Contains data that take any categories • There is an intrinsic ordering of the categories • Nominal variable • Contains data that take any categories • There is no intrinsic ordering of the categories 35 Variables Quantitative Continuous Discrete Qualitative Ordinal Nominal
  • 36. Summary (2) • Continuous variable • Central tendency summarised by median and mean • Dispersion summarised by IQR (and range) and SD • Visually presented by histogram and box plot • Non-continuous variable • Observations summarised by frequency and percentage, and mode • Visually presented by bar chart and pie chart 36
  • 37. Useful books 37 From John Wiley & Sons: https://media.wiley.com/product_data/coverImage300/19/08654287/0865428719.jpg From Amazon UK: https://m.media-amazon.com/images/I/41B+txZryWL._SX283_BO1,204,203,200_.jpg From Amazon UK: https://m.media-amazon.com/images/I/61L9H502OYL.jpg
  • 38. Summarising research data using descriptive statistics Open Educational Resource Dr Leonard Ho ACRC Systematic Reviewer, Usher Institute

Editor's Notes

  1. After today's session, you will be able to understand different types of variables, calculate the central tendency and dispersion of continuous data, as well as present data with appropriate diagrams.
  2. Here are the three reference books that I used to prepare this session. You may find their e-version on our "DiscoverEd". I am not going to test you on these books, but they are very useful in your learning journey, especially when you are interested in biostatistics and epidemiology.
  3. I wonder if you can tell me the definition of statistics without telling me that it is mathematics or it is about drawing lots and picking a ball out of a basket of balls. This is one of the definitions from the Essential Medical Statistics that I think worth remembering. Statistics does not only help us summarise, present, and interpret data, but also allows us to estimate the magnitude of associations and test hypotheses with those data. Both elements are relevant to the REBM course, but in this session, we are going to focus on the first element.
  4. With that definition, we can divide statistics into descriptive statistics and inferential statistics. Descriptive statistics aims to summarise and describe the behaviour of data in a given data set. It may involve the calculations of mean, standard deviation, median, interquartile range, and etc. Inferential statistics aims to make predictions and test hypotheses with the data in a dataset. Relevant methods include linear and logistic regressions, chi-square tests, and etc.
  5. After today's session, you will be able to understand different types of variables, calculate the central tendency and dispersion of continuous data, as well as present data with appropriate diagrams.
  6. After today's session, you will be able to understand different types of variables, calculate the central tendency and dispersion of continuous data, as well as present data with appropriate diagrams.
  7. After today's session, you will be able to understand different types of variables, calculate the central tendency and dispersion of continuous data, as well as present data with appropriate diagrams.
  8. Before moving further, let us imagine we are going to sell a drug that makes people moo like Highland cattle. For those of you who are not familiar with Highland cattle, they are pretty much like dog poodles in the size of a cow. To test the drug, we conducted a randomised controlled trial on 1500 people in Scotland. One day, our research assistant sent us a dataset containing the trial results. We have several variables in hand right now, including sex, BMI, educational level, and number of pills necessary to trigger mooing.
  9. This is the dataset illustrated in SPSS. As you may realise, we have four types of variables here. "Sex" can only take "male/female", so it is a nominal variable. "BMI" can take any values on a continuum, so it is continuous. "Educational level" can take "primary", "secondary", and "tertiary", and there is an intrinsic relationships between them, so it is ordinal and is presented in numerical format where primary takes 1, secondary takes 2, and tertiary takes 3. Finally, "number of pills" can only take whole numbers on a continuum, so it is discrete. Here comes the question, how do we describe the results on BMI to the audience?
  10. Let us focus on describing continuous data for now. First, we may describe the central tendency of BMI using median, mean, and mode. They all tell us the average of the data.
  11. Median is simply the midway value of a list of ordered data. The list can be in ascending or descending order, but it is easier for us to read when it is ascending. I avoid using the word "middle value" or "middle number" to define median because there may be two middle values in a list. In the first example, we have "4" as the middle number, or median, in a list containing nine numbers. In the second example, we have two middle numbers, "3" and "4". In order to get the median, we need to take the average of the two middle values, which is "3.5". Median divides the list of data into upper halves and lower halves. Because it concerns only the midway value, it is not affected by extreme values. In the example, if I change the first value from "+1" to "-11111" and the last value from "7" to "99999", the median stays the same.
  12. Mean is more straightforward. It is the sum of a list of data divided by the total number of data. We need not to place the numbers in an order like we do in finding median. We simply need to add all the numbers up and divide the count of the numbers. As you see in the example, we get "35" when we add the nine numbers up. And eventually, we will get 3.888888 and goes on when we divide 35 with 9. Because we use all of the numbers in the list, the mean is affected by extreme values. For example, if I change the last number from "7" to "77777", the mean will be 8645.
  13. Mode is the value that occurs most often in a list of data. In our example, we have two "1"s, two "4"s, but three "7"s, so the mode of this list is "7". Mode is not very useful in medical research because of two reasons. First, we may have more than one modal value. If we take away one "7" in our example list, we will end up with having three modal values, which are "1", "4", and "7". Second, it is more relevant for integers, or whole numbers. For example, the mode of our BMI variable is 23.48, but it occurs less than 50 times in the dataset. In other words, it does not give us much information about the behaviour of the variable. Of course, you can round the data in order to achieve a more representative mode. Yes, you can, but you will lose much information of your dataset. For example, if we round our BMI data, we will get the mode of "24", and the median and mean will change accordingly and become the same.
  14. When we talk about describing continuous data, we must talk about presenting the data visually to the audience. As you might have learnt from your high school, the most popular diagram for illustrating continuous data is histogram. It shows the distribution of data by presenting them in rectangles, which are called "bins", corresponding to categories along the x-axis. And the heights of the bins are proportional to the frequencies (or counts) of the observations. Okay, please be reminded that "bins" are not "bars" and, strictly speaking, there are no gaps between bins as the categories are on a continuum.
  15. Here, we have the histogram of our BMI data for the 1500 participants. You may see in the diagram that the data are sort of normally distributed, which means that the bins lie on the x-axis symmetrically like the shape of a bell. And we can fit a normal distribution curve, or a bell-shaped curve, for the histogram. The median of the data is roughly equal to the mean of the data in such a normal distribution.
  16. Further about in the distribution of data. We may first take a look at the diagram on the left. When we have data that tend to distribute on the right hand side on the x-axis, or when we see the tail of the distribution curve on the left hand side, we would say the data are "negatively skewed". In this case, the three central tendencies are in the manner of "mean", "median", and "mode", from the smallest to the largest. When we have a perfectly normally distributed data, we have the same mean, median, and mode. And finally, when we have data that tend to distribute on the left hand side on the x-axis, the three central tendencies are in the manner of "mode", "median", and "mean" from the smallest to the largest. We may see that the median is always in the middle.
  17. Let us look back at our BMI data. Actually, the three central tendencies are not exactly identical and are listed in the order of "mean", "median", and "mode". Therefore, our data are slightly negatively skewed, even though I forcefully fit a normal distribution line.
  18. Let us focus on describing continuous data for now. First, we may describe the central tendency of BMI using median, mean, and mode. They all tell us the average of the data.
  19. First, we have "range". Range is the difference between the maximum value and the minimum value in a list of data. You may see in the example, the largest number in the list is 7 and the smallest is 1, 7 minus 1 comes 6, so the range is 6. Very simple. Please be reminded that range is affected by extreme values. For example, if we change the largest number from 7 to 77777 in the list, the range becomes 77776.
  20. First, we have "range". Range is the difference between the maximum value and the minimum value in a list of data. You may see in the example, the largest number in the list is 7 and the smallest is 1, 7 minus 1 comes 6, so the range is 6. Very simple. Please be reminded that range is affected by extreme values. For example, if we change the largest number from 7 to 77777 in the list, the range becomes 77776.
  21. Here we have all the information on the calculation of the IQR of our BMI data. Our upper quartile is 25.33 and lower quartile is 20.66. Therefore, the IQR is 4.67.
  22. First, we have "range". Range is the difference between the maximum value and the minimum value in a list of data. You may see in the example, the largest number in the list is 7 and the smallest is 1, 7 minus 1 comes 6, so the range is 6. Very simple. Please be reminded that range is affected by extreme values. For example, if we change the largest number from 7 to 77777 in the list, the range becomes 77776.
  23. First, we have "range". Range is the difference between the maximum value and the minimum value in a list of data. You may see in the example, the largest number in the list is 7 and the smallest is 1, 7 minus 1 comes 6, so the range is 6. Very simple. Please be reminded that range is affected by extreme values. For example, if we change the largest number from 7 to 77777 in the list, the range becomes 77776.
  24. First, we have "range". Range is the difference between the maximum value and the minimum value in a list of data. You may see in the example, the largest number in the list is 7 and the smallest is 1, 7 minus 1 comes 6, so the range is 6. Very simple. Please be reminded that range is affected by extreme values. For example, if we change the largest number from 7 to 77777 in the list, the range becomes 77776.
  25. As I said in the last slide, the larger the SD, the more spread the data, and the smaller the SD, the less spread the data. The diagram above shows the histogram of data with an SD of 5.15. The one below shows the histogram of data with an SD of 3.47. With more dispersed data, we would have a larger SD and a flatter distribution curve for the data. However, with less dispersed data, we would have a smaller SD and a narrower distribution curve for the data. Very intuitive.
  26. Okay, you may ask what else standard deviations tell us about the behaviour of data except their spread. Here comes a more critical concept of SD which is the 68–95–99 rule. It tells us that the area between 1 SD above and below the mean contains 68% of all our data. Between 2 SD above and below the mean contains 95% of all our data. And between 3 SD above and below the mean contains 99% of all our data. However, this rule is only applicable to normally distributed data. Because the distribution of our BMI data is slightly negatively skewed, the 68–95–99 rule does not fulfil this rule completely,
  27. Let us take a look at how we can describe the dispersion of our BMI data. The range of the data is 23.55. This tells us the difference between the highest BMI value and the lowest BMI value. The IQR is 4.67. This tells us the range of the middle 50% of BMI data around the median. The SD is 3.47. This tells us that 95% of the BMI values falls between 16.01 and 29.89 around the mean.
  28. This is the dataset illustrated in SPSS. As you may realise, we have four types of variables here. "Sex" can only take "male/female", so it is a nominal variable. "BMI" can take any values on a continuum, so it is continuous. "Educational level" can take "primary", "secondary", and "tertiary", and there is an intrinsic relationships between them, so it is ordinal and is presented in numerical format where primary takes 1, secondary takes 2, and tertiary takes 3. Finally, "number of pills" can only take whole numbers on a continuum, so it is discrete. Here comes the question, how do we describe the results on BMI to the audience?
  29. First, we have "range". Range is the difference between the maximum value and the minimum value in a list of data. You may see in the example, the largest number in the list is 7 and the smallest is 1, 7 minus 1 comes 6, so the range is 6. Very simple. Please be reminded that range is affected by extreme values. For example, if we change the largest number from 7 to 77777 in the list, the range becomes 77776.
  30. First, we have "range". Range is the difference between the maximum value and the minimum value in a list of data. You may see in the example, the largest number in the list is 7 and the smallest is 1, 7 minus 1 comes 6, so the range is 6. Very simple. Please be reminded that range is affected by extreme values. For example, if we change the largest number from 7 to 77777 in the list, the range becomes 77776.
  31. First, we have "range". Range is the difference between the maximum value and the minimum value in a list of data. You may see in the example, the largest number in the list is 7 and the smallest is 1, 7 minus 1 comes 6, so the range is 6. Very simple. Please be reminded that range is affected by extreme values. For example, if we change the largest number from 7 to 77777 in the list, the range becomes 77776.
  32. First, we have "range". Range is the difference between the maximum value and the minimum value in a list of data. You may see in the example, the largest number in the list is 7 and the smallest is 1, 7 minus 1 comes 6, so the range is 6. Very simple. Please be reminded that range is affected by extreme values. For example, if we change the largest number from 7 to 77777 in the list, the range becomes 77776.
  33. First, we have "range". Range is the difference between the maximum value and the minimum value in a list of data. You may see in the example, the largest number in the list is 7 and the smallest is 1, 7 minus 1 comes 6, so the range is 6. Very simple. Please be reminded that range is affected by extreme values. For example, if we change the largest number from 7 to 77777 in the list, the range becomes 77776.
  34. First, we have "range". Range is the difference between the maximum value and the minimum value in a list of data. You may see in the example, the largest number in the list is 7 and the smallest is 1, 7 minus 1 comes 6, so the range is 6. Very simple. Please be reminded that range is affected by extreme values. For example, if we change the largest number from 7 to 77777 in the list, the range becomes 77776.
  35. Furthermore, a continuous variable can have the central tendency be summarised by median and mean and the dispersion by interquartile range and standard deviation. We may present the variable with histograms and box plots. A non-continuous variable can have the observations summarised by frequency and percentage, and mode. We may present the variable with bar charts and pie charts, but not 3D pie charts.
  36. Again, here are the three reference books that I used to prepare this session. You may find their e-version on our "DiscoverEd".