2. WHAT ARE
DESCRIPTIVE
STATISTICS?
Descriptive statistics are methods to
summarize data
Allows us to tell something about the
data without showing the full dataset
In practice, first thing we do when we
get a dataset
Helps better understand what we’re
dealing with
5. DESCRIPTIVE STATISTICS (A ROUGH
FRAMEWORK)
Qualitative
variables
Counts
Percentages
Quantitative
variables
Central
tendency
Mean
Median
Mode
Spread
Variance
Std Dev
Percentiles
6. COUNTS AND PERCENTAGES
• Most basic way to describe qualitative / categorical variables
• Counts are the number of observations in each category
• Percentages express these as a fraction of total observations
𝑃𝑒𝑟𝑐𝑒𝑛𝑡𝑎𝑔𝑒 𝑜𝑓 𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑦 𝑥 =
𝑁𝑜. 𝑜𝑓 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛𝑠 𝑡ℎ𝑎𝑡 𝑎𝑟𝑒 𝑥
𝑇𝑜𝑡𝑎𝑙 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛𝑠 𝑖𝑛 𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑦
7. COUNTS AND PERCENTAGES: EXAMPLE
• Consider the following dataset
• How can we describe gender? Political affiliation?
ID Name Age Gender Income Political affiliation Voted in the last election?
1 Alfred 67 Male $ 28,500 Liberal Yes
2 Peter 27 Male $ 275,000 Conservative Yes
3 George 18 Male $ 31,000 Liberal No
4 Jannet 38 Female $ 39,000 Liberal Yes
5 Meagan 19 Female $ 52,000 Moderate Yes
6 Ivan 35 Male $ 27,000 Conservative No
7 Jenny 78 Female $ 38,000 Conservative Yes
8 Sam 43 Male $ 33,000 Conservative Yes
9 Emily 39 Female $ 37,000 Moderate No
10 Hellen 57 Female $ 43,000 Liberal Yes
8. COUNTS AND PERCENTAGES: EXAMPLE
• There are 5 males and 5 females in the previous dataset
• The most common way to formally present counts and percentages is to
use tables:
Gender No. of people Percentage
Male 5 50%
Female 5 50%
TOTAL: 10 100%
9. CROSS TABULATIONS
• Often we will need to summarize information from two categorical
variables
• For example, how many males are politically liberal?
• This type of table is called a cross tabulation (cross tab)
• One variable will be in rows, while the other in columns
• Consider the cross tab of political affiliation and gender
Male Female
Liberal 2 2
Moderate 3 1
Conservative 0 2
12. CROSS TABULATIONS: PERCENTAGES
• Counts in cross tabulations are simple
• Percentages are sometimes not obvious
• What percentage we use depends on what our frame of reference is
• For example, are we asking what percentage of males are liberal leaning?
• In this case we will divide the no. of males who are liberal by total no. of
males
• Or what percentage of liberals are males?
• In this case we will divide the no. of males who are liberal by total no. of
liberal people
• In practice, both are correct and which one we use depends on the context
14. CENTRAL TENDENCY
• Often the most informative to describe numerical variable is to describe
where the ‘center’ is
• The most common way of computing the center is to either use:
• Mean
• Median
• Mode (less common)
15. MEANS
• A mean is a simple average of all numbers
𝑥 =
𝑖=1
𝑖=𝑁
𝑥𝑖
𝑛
1. Add up all the numbers in the variable
2. Divide by the number of observations in the variable
17. MEAN: EXAMPLE
• Calculate the mean of 10,27,12,9,18,21,92
𝑀𝑒𝑎𝑛 𝑥 =
10 + 12 + 9 + 18 + 21 + 27 + 92
7
=
189
7
= 27
18. MEDIAN
• The number in the middle
1. Check if there are an odd or even number of observations
2. Order the numbers from smallest to largest.
3. If the data set contains an odd number of numbers, the one exactly in
the middle is the median.
4. If the data set contains an even number of numbers, take the two
numbers that appear exactly in the middle and average them to find the
median.
20. MEDIAN: EXAMPLES
• Calculate the median of 10,27,12,9,18,21,92
1. There are 7 numbers (odd no.)
2. Order: 9,10,12,18,21,27,92
3. Middle number (median) is 9,10,12,18,21,27,92
• Calculate the median of 21,15,20,14
1. There are 4 numbers (even no.)
2. Order: 14,15,20,21
3. Middle numbers are 14,15,20,21
4. Take their average to get median:
15+20
2
= 17.5
21. MODE
• The number that occurs the most number of times
• Calculate the modal value of: 3,3,3,3,3,4,5,6,3,2,1
• Normally used for categorical variables
Category Frequency
A 10
B 21
C 5
22. PICKING BEST
MEASURE OF CENTER
• Calculating mean, median or mode is
easy
• Picking the right measure is the tricky
bit
Mean
Median
Mode
23. WHEN TO USE MEAN OR MEDIAN (OR
MODE)
Mean
The default method
+ Universal and intuitive
+ Mathematically sound
(we'll see later)
- Susceptible to outliers
Median
Report when data is very
skewed or has noticeable
outliers. How do we
know?
Incomes are usually
reported as median
Mode
Less common
Report when categories
OR one dominant figure
Usually with other
measures
KEEP CONTEXT IN MIND
BE FLEXIBLE!
24. MEAN, MEDIAN OR MODE
• Let’s go back to our dataset
• What’s the best central tendency measure to report for:
• Income
• Age
• Political affiliation
ID Name Age Gender Income Political affiliation Voted in the last election?
1 Alfred 67 Male $ 28,500 Liberal Yes
2 Peter 27 Male $ 275,000 Conservative Yes
3 George 18 Male $ 31,000 Liberal No
4 Jannet 38 Female $ 39,000 Liberal Yes
5 Meagan 19 Female $ 52,000 Moderate Yes
6 Ivan 35 Male $ 27,000 Conservative No
7 Jenny 78 Female $ 38,000 Conservative Yes
8 Sam 43 Male $ 33,000 Conservative Yes
9 Emily 39 Female $ 37,000 Moderate No
10 Hellen 57 Female $ 43,000 Liberal Yes
25. MEAN, MEDIAN OR MODE
• Let’s go back to our dataset
• What’s the best central tendency measure to report for:
• Income: Median
• Age: Mean or median
• Political affiliation: Mode
ID Name Age Gender Income Political affiliation Voted in the last election?
1 Alfred 67 Male $ 28,500 Liberal Yes
2 Peter 27 Male $ 275,000 Conservative Yes
3 George 18 Male $ 31,000 Liberal No
4 Jannet 38 Female $ 39,000 Liberal Yes
5 Meagan 19 Female $ 52,000 Moderate Yes
6 Ivan 35 Male $ 27,000 Conservative No
7 Jenny 78 Female $ 38,000 Conservative Yes
8 Sam 43 Male $ 33,000 Conservative Yes
9 Emily 39 Female $ 37,000 Moderate No
10 Hellen 57 Female $ 43,000 Liberal Yes
26. EXERCISE: BEST MEASURE OF CENTER
What is the best measure of central tendency for the following:
1. Length of Christopher Nolan movies in minutes
2. U.S. household income
3. Platform with most engagement
27. MEASURES OF SPREAD
• Consider two sets of numbers
Set A : { 1 4 6 7 12 }
Set B: { 5 6 6 6 7 }
• What is the mean of each set?
28. MEASURED OF SPREAD
Set A : { 1 4 6 7 12 }
Set B: { 5 6 6 6 7 }
• If the mean of both the sets is the same, what’s the difference between
the two?
• Set A is obviously more ‘spread out’ than Set B
• Is there some way we can quantify this?
29. MEASURED OF SPREAD
• We can try taking the difference of each number in the set from the mean of
the set
• And sum up these differences
• But positive and negative differences will cancel out
Set A Mean
Deviation (Difference
from mean)
1 6 -5
4 6 -2
6 6 0
7 6 1
12 6 6
SUM: 0
30. MEASURED OF SPREAD
• Instead, we can take the ‘squared differences from mean’
• And sum them up
• Divide by number of observations (minus 1) 66/4 = 16.5
• This is called variance
Set A Mean Difference from mean Squared difference from mean
1 6 -5 25
4 6 -2 4
6 6 0 0
7 6 1 1
12 6 6 36
SUM: 0 66
*in practice we always divide by no. of observations - 1 to account for the fact that this is a sample
(more on this later in the course)
31. MEASURED OF SPREAD
• Do the same for Set B
• Divide by number of observations 2/4 = 0.5
Set B Mean Difference from mean Squared difference from mean
5 6 -1 1
6 6 0 0
6 6 0 0
6 6 0 0
7 6 1 1
SUM: 0 2
32. MEASURED OF SPREAD
Set A : { 1 4 6 7 12 }
Set B: { 5 6 6 6 7 }
• The variance for set A is 16.5 but for Set B is only 0.5
• This means Set A is more spread out than Set B
33. MEASURED OF
SPREAD
Variances don’t mean much
Instead, if we take it’s root √, we get a
standard deviation
Standard deviation is the universal way of
quantifying spread
A high standard deviation means that
observations are spread away from the mean
A low standard deviation indicates
observations are close to the mean
34. RECAP OF VARIANCE AND STANDARD
DEVIATION
1. Find the mean of the variable (𝑥)
2. Subtract the mean from each value (xi − 𝑥)
3. Square this difference xi − 𝑥 2
4. Sum this squared difference for all values xi − 𝑥
2
5. Divide by the number of observations minus 1* to get the variance:
Variance =
xi−𝑥
2
𝑛−1
6. Take the square root of variance to get the standard deviation
Standard Deviation = 𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒
*We divide by n-1 instead of n as a standard practice to correct for the fact that the data was
collected as a sample (more on this later)