This Open Educational Resource (OER) introduces the concepts of quantitative and qualitative statistics, central tendency (mean, median, and mode), and dispersion (standard deviation and interquartile range).
This document is created for PgCAP Digital Education.
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
OER Descriptive Statistics (University of Edinburgh)
1. Summarising research data
using descriptive statistics
Open Educational Resource
Dr Leonard Ho
ACRC Systematic Reviewer, Usher Institute
2. Objectives
• To understand different types of variables
• To calculate the central tendency and dispersion of continuous data
• To present data with appropriate diagrams
2
3. Useful books
3
From John Wiley & Sons: https://media.wiley.com/product_data/coverImage300/19/08654287/0865428719.jpg From Amazon UK: https://m.media-amazon.com/images/I/41B+txZryWL._SX283_BO1,204,203,200_.jpg From Amazon UK: https://m.media-amazon.com/images/I/61L9H502OYL.jpg
4. Statistics
• “The science of collecting, summarising, presenting and
interpreting data, and of using them to estimate the magnitude
of associations and test hypotheses”
4
From John Wiley & Sons: https://media.wiley.com/product_data/coverImage300/19/08654287/0865428719.jpg
5. Two types of statistics
• Descriptive statistics
• Summarising and describing the behaviour of data in a dataset
• Mean, standard deviation…
• Inferential statistics
• Making predictions and testing hypotheses with the data in a dataset
• Regressions, chi-square tests…
5
6. Types of variables
• Variable is a quantity or characteristic that
can be measured or observed
• Quantitative (numeric) variable contains data
that describe a measurable quantity
• Qualitative (categorical) variable contains
data that describe a characteristic
6
Variables
Quantitative Qualitative
7. Quantitative variable
• Continuous variable
• Contains data that lie on a continuum and can
take any values
• Height, weight...
• Very common in scientific research
• Discrete variable
• Contains data that do not lie on a continuum and
can only take whole numbers (integers)
• Number of strokes per day…
• “Not splitable”
7
Variables
Quantitative
Continuous Discrete
Qualitative
8. Qualitative variable
• Ordinal variable
• Contains data that take any categories and there
is an intrinsic ordering of the categories
• Educational level (Primary < Secondary <
Tertiary)…
• Nominal variable
• Contains data that take any categories but there is
no intrinsic ordering of the categories
• Location (Aberdeen, Edinburgh, Glasgow)…
8
Variables
Quantitative
Continuous Discrete
Qualitative
Ordinal Nominal
9. “Mooing pill”
• We are going to sell a drug that makes people moo like Highland
cattle
• We conducted a randomised controlled trial on 1,500 people in
Scotland
• Our dataset contains the following variables:
• Sex (Male / Female)
• Body mass index (kg/m2)
• Educational level (Primary / Secondary / Tertiary)
• Number of pills necessary to trigger mooing
9
From Visit Scotland: https://www.visitscotland.com/blog/wp-content/uploads/2019/10/HC-on-coastal-road.jpg
10. Our dataset
10
Label Variable Type of variable
Sex Sex Nominal
BMI Body mass index Continuous
Edu_level Educational level Ordinal
No_pill Number of pills Discrete
How do we describe BMI?
11. Description of continuous data
• Describe the central tendency
• Average of data
• Describe the dispersion
• Spread of data
11
12. Central tendency (Median)
• Median is the midway value of a list of ordered data (ascending or
descending)
• “Midway” refers to the middle number or the average of two middle numbers
• [1, 1, 1, 3, 4, 4, 7, 7, 7]: Median is 4
• [1, 1, 1, 3, 3, 4, 4, 7, 7, 7]: Median is 3.5
• Divides the list of data into upper and lower halves
• Not affected by extreme values
• [–11111, 1, 1, 3, 4, 4, 7, 7, 99999]: Median is still 4
12
13. Central tendency (Mean)
• Mean is the sum of a list of data divided by the total number of data
• [1, 1, 1, 3, 4, 4, 7, 7, 7]: Mean is 3.89
• Affected by extreme values
• [1, 1, 1, 3, 4, 4, 7, 7, 77777]: Mean is 8645
13
14. Central tendency (Mode)
• Mode is the value that occurs most often in a list of data
• [1, 1, 3, 4, 4, 7, 7, 7]: Mode is 7
• May have ≥ 1 modal value
• More relevant for integers (like discrete variables)
• If we round data, we would lose much information!
14
Calculated from original data Calculated from rounded data
15. Presenting data with histogram
• Histogram
• Shows the distribution of data by plotting
the data in rectangles, “bins” (not bars),
corresponding to categories along the x-
axis
• The bins have heights that are
proportional to the frequencies of
observations
• No gaps between bins because the
categories are on a continuum!
15
16. Central tendencies of BMI
• Central tendencies of BMI (kg/m2) among our 1,500 participants
16
Median ≈ Mean
Normally distributed (roughly)
17. Distribution of data
17
Mean: 22.95
Median: 23.08
From Biology For Life: https://www.biologyforlife.com/uploads/2/2/3/9/22392738/c101b0da6ea1a0dab31f80d9963b0368_orig.png
19. Description of continuous data
• Describe the central tendency
• Average of data
• Describe the dispersion
• Spread of data
19
20. Dispersion (Range)
• Range is the difference between the maximum and the minimum
values in a list of data
• [1, 1, 1, 3, 4, 4, 7, 7, 7]: Range is 6
• Affected by extreme values
• [1, 1, 1, 3, 4, 4, 7, 7, 77777]: Range is 77776
20
21. Dispersion (Interquartile range)
• Interquartile range (IQR) summarises the
spread of the middle 50% of data in an
ordered list
• Not affected by extreme data
• Difference between the upper (Q3) and the
lower (Q1) quartiles in a list of ordered data
• Q3: Between the maximum and median (Q2)
• Q1: Between the minimum and Q2
21
Q1 = 2 Max = 8
Min = 1 Q2 = 4 Q3 = 7
IQR = 7 – 2 = 5
[1, 1, 2, 3, 4, 4, 5, 6, 7, 7, 8]
Q1 = 2 Max = 9
Min = 1 Q2 = 4.5 Q3 = 7
IQR = 7 – 2 = 5
[1, 1, 2, 2, 3, 4, 4, 5, 5, 6, 7, 7, 8, 9]
There are many ways to calculate quartiles!
23. Presenting data with box plot (1)
• Box plot (Box and whisker plot)
• At least shows 5 pieces of summary information
about a list of data:
• Median = Horizontal line in box
• Upper quartile = Top edge of the box
• Lower quartile = Lower edge of box
• Maximum = Top of whisker
• Minimum = Bottom of whisker
23
From the University of Newcastle: https://www.ncl.ac.uk/webtemplate/ask-assets/external/maths-resources/images/Box_and_whiskers_explanation_inkscape(2).png
Whisker
Box
24. Presenting data with box plot (2)
• Box plot (Box and whisker plot)
• Always check whether there are outliers
• Observations that are far away from the others
• Common definitions of outliers:
• Lower outlier(s) < Q1 − (1.5*IQR)
• Upper outlier(s) > Q3 + (1.5*IQR)
• Remember to amend the minimum and maximum
values on the plot
• Minimum value becomes the value right above the
cut-off for lower outliers
• Maximum value becomes the value right below the
cut-off for upper outliers
24
25. Dispersion (Standard deviation) (1)
• Standard deviation (SD) describes the
spread of data around the mean and the
average difference between the mean and
each observation
• The larger the SD, the more spread the data
• We use all data to calculate SD
• (Think about how we calculate range and IQR)
• Affected by extreme values
25
From Cuemath: https://d138zd1ktt9iqe.cloudfront.net/media/seo_landing_files/standard-deviation-formula-1626765976.png
27. Dispersion (Standard deviation) (3)
• The 68–95–99 Rule
• 68% of data falls within ± 1*SD
• 95% of data falls within ± 2*SD
• 99% of data falls within ± 3*SD
• (Applicable to normal distribution)
27
Mean = 22.95
Mean + 1*SD = 26.42
Mean + 2*SD = 29.89
Mean – 1*SD = 19.48
Mean – 2*SD = 16.01
Our data may not completely fulfil this rule
because their distribution is slightly skewed
From Biology For Life: https://www.biologyforlife.com/uploads/2/2/3/9/22392738/sd2_orig.png
28. Dispersion of BMI
• Dispersion of BMI (kg/m2) among our 1,500 participants
28
Measurement Dispersion Interpretation
Range 23.55
The difference between the
highest and the lowest BMI
IQR 4.67
The range of the middle 50% of
BMI data around the median
SD 3.47
95% of BMI data falls between
16.01 and 29.89 around the mean
29. Our dataset
29
Label Variable Type of variable
Sex Sex Nominal
BMI Body mass index Continuous
Edu_level Educational level Ordinal
No_pill Number of pills Discrete
How do we describe sex, educational level,
and number of pills?
30. Description of non-continuous data
• Frequency and percentage are useful in describing non-continuous
variables
• Modes can also be used to show the most common category
30
31. Presenting data with bar chart
• Bar chart
• Shows the distribution of observations in
different categories of a variable where every
observation belongs to one category
• Each category is given its own bar, and the
length of the bar is proportional to the frequency
of observations within that category
• (How is it different from histogram?)
• For stacked bar charts:
• Always shows the % contribution of each sub-bar
• Always avoid showing > 3 sub-bars in each population
31
From National Records of Scotland:
https://www.nrscotland.gov.uk/statistics-and-data/statistics/statistics-by-theme/population/population-estimates/mid-year-population-estimates/mid-2021
32. Presenting data with pie chart
• Pie chart
• Shows the distribution of observations in
different categories of a variable where every
observation belongs to one category
• Area of each slice proportional to the frequency
of observations within that category
• Only useful when there are ≥ 3 categories
• Become hard to read if there are > 10 categories
• Please, never use 3D pie charts
• (They are not beautiful at all and sometimes misleading)
32
33. Categorising continuous data (1)
• We may categorise our continuous variable
according to pre-specified rules
• For better communication
• For decision-making
33
BMI (kg/m2)
• Underweight: < 18.5 kg/m2
• Normal: 18.5 to 24.9 kg/m2
• Overweight: 25.0 to 29.9 kg/m2
• Obese: > 30.0 kg/m2
• Not obese: < 30 kg/m2
• Obese: ≥ 30 kg/m2
34. Categorising continuous data (2)
• Loss of information
• Cut-off values may be arbitrary
• If we must categorise, make sure that we:
• also provide the central tendency and dispersion
of the continuous variable
• clearly state the cut-off values and their
justifications
34
35. Summary (1)
• Continuous variable
• Contains data that lie on a continuum
• Can take any values
• Discrete variable
• Contains data that do not lie on a continuum
• Can only take integers
• Ordinal variable
• Contains data that take any categories
• There is an intrinsic ordering of the categories
• Nominal variable
• Contains data that take any categories
• There is no intrinsic ordering of the categories
35
Variables
Quantitative
Continuous Discrete
Qualitative
Ordinal Nominal
36. Summary (2)
• Continuous variable
• Central tendency summarised by median and mean
• Dispersion summarised by IQR (and range) and SD
• Visually presented by histogram and box plot
• Non-continuous variable
• Observations summarised by frequency and percentage, and mode
• Visually presented by bar chart and pie chart
36
37. Useful books
37
From John Wiley & Sons: https://media.wiley.com/product_data/coverImage300/19/08654287/0865428719.jpg From Amazon UK: https://m.media-amazon.com/images/I/41B+txZryWL._SX283_BO1,204,203,200_.jpg From Amazon UK: https://m.media-amazon.com/images/I/61L9H502OYL.jpg
38. Summarising research data
using descriptive statistics
Open Educational Resource
Dr Leonard Ho
ACRC Systematic Reviewer, Usher Institute
Editor's Notes
After today's session, you will be able to understand different types of variables, calculate the central tendency and dispersion of continuous data, as well as present data with appropriate diagrams.
Here are the three reference books that I used to prepare this session. You may find their e-version on our "DiscoverEd". I am not going to test you on these books, but they are very useful in your learning journey, especially when you are interested in biostatistics and epidemiology.
I wonder if you can tell me the definition of statistics without telling me that it is mathematics or it is about drawing lots and picking a ball out of a basket of balls. This is one of the definitions from the Essential Medical Statistics that I think worth remembering. Statistics does not only help us summarise, present, and interpret data, but also allows us to estimate the magnitude of associations and test hypotheses with those data. Both elements are relevant to the REBM course, but in this session, we are going to focus on the first element.
With that definition, we can divide statistics into descriptive statistics and inferential statistics. Descriptive statistics aims to summarise and describe the behaviour of data in a given data set. It may involve the calculations of mean, standard deviation, median, interquartile range, and etc. Inferential statistics aims to make predictions and test hypotheses with the data in a dataset. Relevant methods include linear and logistic regressions, chi-square tests, and etc.
After today's session, you will be able to understand different types of variables, calculate the central tendency and dispersion of continuous data, as well as present data with appropriate diagrams.
After today's session, you will be able to understand different types of variables, calculate the central tendency and dispersion of continuous data, as well as present data with appropriate diagrams.
After today's session, you will be able to understand different types of variables, calculate the central tendency and dispersion of continuous data, as well as present data with appropriate diagrams.
Before moving further, let us imagine we are going to sell a drug that makes people moo like Highland cattle. For those of you who are not familiar with Highland cattle, they are pretty much like dog poodles in the size of a cow. To test the drug, we conducted a randomised controlled trial on 1500 people in Scotland. One day, our research assistant sent us a dataset containing the trial results. We have several variables in hand right now, including sex, BMI, educational level, and number of pills necessary to trigger mooing.
This is the dataset illustrated in SPSS. As you may realise, we have four types of variables here. "Sex" can only take "male/female", so it is a nominal variable. "BMI" can take any values on a continuum, so it is continuous. "Educational level" can take "primary", "secondary", and "tertiary", and there is an intrinsic relationships between them, so it is ordinal and is presented in numerical format where primary takes 1, secondary takes 2, and tertiary takes 3. Finally, "number of pills" can only take whole numbers on a continuum, so it is discrete. Here comes the question, how do we describe the results on BMI to the audience?
Let us focus on describing continuous data for now. First, we may describe the central tendency of BMI using median, mean, and mode. They all tell us the average of the data.
Median is simply the midway value of a list of ordered data. The list can be in ascending or descending order, but it is easier for us to read when it is ascending. I avoid using the word "middle value" or "middle number" to define median because there may be two middle values in a list. In the first example, we have "4" as the middle number, or median, in a list containing nine numbers. In the second example, we have two middle numbers, "3" and "4". In order to get the median, we need to take the average of the two middle values, which is "3.5". Median divides the list of data into upper halves and lower halves. Because it concerns only the midway value, it is not affected by extreme values. In the example, if I change the first value from "+1" to "-11111" and the last value from "7" to "99999", the median stays the same.
Mean is more straightforward. It is the sum of a list of data divided by the total number of data. We need not to place the numbers in an order like we do in finding median. We simply need to add all the numbers up and divide the count of the numbers. As you see in the example, we get "35" when we add the nine numbers up. And eventually, we will get 3.888888 and goes on when we divide 35 with 9. Because we use all of the numbers in the list, the mean is affected by extreme values. For example, if I change the last number from "7" to "77777", the mean will be 8645.
Mode is the value that occurs most often in a list of data. In our example, we have two "1"s, two "4"s, but three "7"s, so the mode of this list is "7". Mode is not very useful in medical research because of two reasons. First, we may have more than one modal value. If we take away one "7" in our example list, we will end up with having three modal values, which are "1", "4", and "7". Second, it is more relevant for integers, or whole numbers. For example, the mode of our BMI variable is 23.48, but it occurs less than 50 times in the dataset. In other words, it does not give us much information about the behaviour of the variable. Of course, you can round the data in order to achieve a more representative mode. Yes, you can, but you will lose much information of your dataset. For example, if we round our BMI data, we will get the mode of "24", and the median and mean will change accordingly and become the same.
When we talk about describing continuous data, we must talk about presenting the data visually to the audience. As you might have learnt from your high school, the most popular diagram for illustrating continuous data is histogram. It shows the distribution of data by presenting them in rectangles, which are called "bins", corresponding to categories along the x-axis. And the heights of the bins are proportional to the frequencies (or counts) of the observations. Okay, please be reminded that "bins" are not "bars" and, strictly speaking, there are no gaps between bins as the categories are on a continuum.
Here, we have the histogram of our BMI data for the 1500 participants. You may see in the diagram that the data are sort of normally distributed, which means that the bins lie on the x-axis symmetrically like the shape of a bell. And we can fit a normal distribution curve, or a bell-shaped curve, for the histogram. The median of the data is roughly equal to the mean of the data in such a normal distribution.
Further about in the distribution of data. We may first take a look at the diagram on the left. When we have data that tend to distribute on the right hand side on the x-axis, or when we see the tail of the distribution curve on the left hand side, we would say the data are "negatively skewed". In this case, the three central tendencies are in the manner of "mean", "median", and "mode", from the smallest to the largest. When we have a perfectly normally distributed data, we have the same mean, median, and mode. And finally, when we have data that tend to distribute on the left hand side on the x-axis, the three central tendencies are in the manner of "mode", "median", and "mean" from the smallest to the largest. We may see that the median is always in the middle.
Let us look back at our BMI data. Actually, the three central tendencies are not exactly identical and are listed in the order of "mean", "median", and "mode". Therefore, our data are slightly negatively skewed, even though I forcefully fit a normal distribution line.
Let us focus on describing continuous data for now. First, we may describe the central tendency of BMI using median, mean, and mode. They all tell us the average of the data.
First, we have "range". Range is the difference between the maximum value and the minimum value in a list of data. You may see in the example, the largest number in the list is 7 and the smallest is 1, 7 minus 1 comes 6, so the range is 6. Very simple. Please be reminded that range is affected by extreme values. For example, if we change the largest number from 7 to 77777 in the list, the range becomes 77776.
First, we have "range". Range is the difference between the maximum value and the minimum value in a list of data. You may see in the example, the largest number in the list is 7 and the smallest is 1, 7 minus 1 comes 6, so the range is 6. Very simple. Please be reminded that range is affected by extreme values. For example, if we change the largest number from 7 to 77777 in the list, the range becomes 77776.
Here we have all the information on the calculation of the IQR of our BMI data. Our upper quartile is 25.33 and lower quartile is 20.66. Therefore, the IQR is 4.67.
First, we have "range". Range is the difference between the maximum value and the minimum value in a list of data. You may see in the example, the largest number in the list is 7 and the smallest is 1, 7 minus 1 comes 6, so the range is 6. Very simple. Please be reminded that range is affected by extreme values. For example, if we change the largest number from 7 to 77777 in the list, the range becomes 77776.
First, we have "range". Range is the difference between the maximum value and the minimum value in a list of data. You may see in the example, the largest number in the list is 7 and the smallest is 1, 7 minus 1 comes 6, so the range is 6. Very simple. Please be reminded that range is affected by extreme values. For example, if we change the largest number from 7 to 77777 in the list, the range becomes 77776.
First, we have "range". Range is the difference between the maximum value and the minimum value in a list of data. You may see in the example, the largest number in the list is 7 and the smallest is 1, 7 minus 1 comes 6, so the range is 6. Very simple. Please be reminded that range is affected by extreme values. For example, if we change the largest number from 7 to 77777 in the list, the range becomes 77776.
As I said in the last slide, the larger the SD, the more spread the data, and the smaller the SD, the less spread the data. The diagram above shows the histogram of data with an SD of 5.15. The one below shows the histogram of data with an SD of 3.47. With more dispersed data, we would have a larger SD and a flatter distribution curve for the data. However, with less dispersed data, we would have a smaller SD and a narrower distribution curve for the data. Very intuitive.
Okay, you may ask what else standard deviations tell us about the behaviour of data except their spread. Here comes a more critical concept of SD which is the 68–95–99 rule. It tells us that the area between 1 SD above and below the mean contains 68% of all our data. Between 2 SD above and below the mean contains 95% of all our data. And between 3 SD above and below the mean contains 99% of all our data. However, this rule is only applicable to normally distributed data. Because the distribution of our BMI data is slightly negatively skewed, the 68–95–99 rule does not fulfil this rule completely,
Let us take a look at how we can describe the dispersion of our BMI data. The range of the data is 23.55. This tells us the difference between the highest BMI value and the lowest BMI value. The IQR is 4.67. This tells us the range of the middle 50% of BMI data around the median. The SD is 3.47. This tells us that 95% of the BMI values falls between 16.01 and 29.89 around the mean.
This is the dataset illustrated in SPSS. As you may realise, we have four types of variables here. "Sex" can only take "male/female", so it is a nominal variable. "BMI" can take any values on a continuum, so it is continuous. "Educational level" can take "primary", "secondary", and "tertiary", and there is an intrinsic relationships between them, so it is ordinal and is presented in numerical format where primary takes 1, secondary takes 2, and tertiary takes 3. Finally, "number of pills" can only take whole numbers on a continuum, so it is discrete. Here comes the question, how do we describe the results on BMI to the audience?
First, we have "range". Range is the difference between the maximum value and the minimum value in a list of data. You may see in the example, the largest number in the list is 7 and the smallest is 1, 7 minus 1 comes 6, so the range is 6. Very simple. Please be reminded that range is affected by extreme values. For example, if we change the largest number from 7 to 77777 in the list, the range becomes 77776.
First, we have "range". Range is the difference between the maximum value and the minimum value in a list of data. You may see in the example, the largest number in the list is 7 and the smallest is 1, 7 minus 1 comes 6, so the range is 6. Very simple. Please be reminded that range is affected by extreme values. For example, if we change the largest number from 7 to 77777 in the list, the range becomes 77776.
First, we have "range". Range is the difference between the maximum value and the minimum value in a list of data. You may see in the example, the largest number in the list is 7 and the smallest is 1, 7 minus 1 comes 6, so the range is 6. Very simple. Please be reminded that range is affected by extreme values. For example, if we change the largest number from 7 to 77777 in the list, the range becomes 77776.
First, we have "range". Range is the difference between the maximum value and the minimum value in a list of data. You may see in the example, the largest number in the list is 7 and the smallest is 1, 7 minus 1 comes 6, so the range is 6. Very simple. Please be reminded that range is affected by extreme values. For example, if we change the largest number from 7 to 77777 in the list, the range becomes 77776.
First, we have "range". Range is the difference between the maximum value and the minimum value in a list of data. You may see in the example, the largest number in the list is 7 and the smallest is 1, 7 minus 1 comes 6, so the range is 6. Very simple. Please be reminded that range is affected by extreme values. For example, if we change the largest number from 7 to 77777 in the list, the range becomes 77776.
First, we have "range". Range is the difference between the maximum value and the minimum value in a list of data. You may see in the example, the largest number in the list is 7 and the smallest is 1, 7 minus 1 comes 6, so the range is 6. Very simple. Please be reminded that range is affected by extreme values. For example, if we change the largest number from 7 to 77777 in the list, the range becomes 77776.
Furthermore, a continuous variable can have the central tendency be summarised by median and mean and the dispersion by interquartile range and standard deviation. We may present the variable with histograms and box plots. A non-continuous variable can have the observations summarised by frequency and percentage, and mode. We may present the variable with bar charts and pie charts, but not 3D pie charts.
Again, here are the three reference books that I used to prepare this session. You may find their e-version on our "DiscoverEd".