Statistics101: Numerical Measures

Zahid Mian
Part of the Brown-bag Series

 Population
 Sample
 Variable
 Statistic
 Sample
 Skew

 Mean
 Median
 Mode
 Range
 Percentile
 Variance
 Standard Deviation
 Covariance
 Correlation Coefficient
 Skewness

 Why do we care about CentralTendency?
 What is most valuable to you:
 Average price of home in a neighborhood
 Median price of home …
 Range of prices …
 Mode …
 What does it say about neighborhood if:
 Average price is $500K
 Median price is $350K
 Range is $750K

 𝜇 for Population 𝜇 =
𝑋
𝑁
 𝑥 for Sample 𝑥 =
𝑋
𝑛
 N = 1,2,3,4,4,5,5,5,5,6
 Mean = 4

 The “middle” value from sorted list
 𝑀𝑒𝑑𝑖𝑎𝑛 =
𝑛+1
2
𝑡ℎ
term
 Data: 1,2,3,4,4,5,5,5,5,6 Median: 4.5
 Data: 1,2,3,4,4,5,5,5,5,6,7 Median: 5

 The number that occurs the most
 Data: 1,2,3,4,4,5,5,5,5,6,7
 Mode: 5 (appears 4 times)
> table(c(1,2,3,4,4,5,5,5,5,6,7))
1 2 3 4 5 6 7
1 1 1 2 4 1 1

 Cuts off data by n percent
 Quiz Scores:
67,72,88,82,80,90,95,60,77,89,99,85,77
 What is the score that cuts off 30% of all
scores? 50%?
quantile(c(67,72,88,82,80,90,95,60,77,89,99,85,77), c(.3, .5))
30% 50%
77 82
30% of all scores were 77 or below; 50% of all scores were 82 or
below

 Graph showing data within quartiles
MaxValue
MinValue
Median
Q3 (75%)
Q1 (25%)

 How to see the “dispersion” of data
 Sets of quiz scores for different classes:
 Set1: 80, 79, 80, 81, 80, 80, 79, 79
 Set2: 75, 100, 60, 100, 100, 75, 75, 60
 Just by looking at the number you should be
able to state that Set2 is more dispersed
 standard deviation measures the dispersion
around the mean
 Other measures include range and variance

 difference between the highest and lowest
values in the data
 Data: 1,2,3,4,4,5,5,5,5,6,7
 Range: 6 (7-1)

 Measures dispersion around the mean (how
far from the normal)
 𝜎2
=
(𝑥−𝜇)2
𝑛
 Steps to calculate population variance (𝜎2
)
 Calculate mean
 For each number in set
▪ subtract the mean
▪ Square the result (why square it?)
 Get the average of differences

 Data:
67,72,88,82,80,90,95,6
0,77,89,99,85,77
 Large variance
indicates data is
spread out; small
indicates data is close
to mean.
 For Sample (note n-1)
 𝑠2 =
(𝑥− 𝑥)2
𝑛−1
 Mean: 81.62
 (67-81.62)2 = 213.74
 (72-81.62)2 = 92.54
 …
 (72-81.62)2 = 21.34
 Average of 213.74,
92.54,…, 21.74
 𝜎2 = 123.09

 Simply the square root of variance
 𝜎 = 𝑖(𝑥 𝑖 −𝜇)2
𝑁
 𝜎2
= 123.09
 𝜎= 11.09
 𝜎 is useful in determining what is “normal”
 The mean of scores was 81.62, so most scores
are within 1 𝜎 (+/- 11.09). All scores are within
2 𝜎 (+/- 22.18).

 Measures how two variables (x,y) are linearly
related. Positive value indicates linear relation.
 Test Scores (x):
67,72,88,82,80,90,95,60,77,89,99,85,77
 StudyTime (y):
30,45,80,85,75,85,120,30,45,75,85,110,40
 Is there a relationship betweenTest Scores and
Time spent studying?
 𝜎𝑥𝑦 = 269.43
 What if everyone studied for 30 minutes?
 𝜎𝑥𝑦 = 0 (so no linear relation)

 A normalized measurement of how two
variables are linearly related.
 Sample: 𝑟𝑥𝑦 =
𝑠 𝑥𝑦
𝑠 𝑥 𝑠 𝑦
 Population: 𝜌 𝑥𝑦 =
𝜎 𝑥𝑦
𝜎 𝑥 𝜎 𝑦
 From Previous Example: 𝜌 𝑥𝑦= 0.83 (the closer
this value to 1, the stronger the relationship)

 Intuitively we would say there is no relation
 But be careful …
 Let’s say I have data for sales of ice vs. temp
> cor(temps, sales)
[1] -0.001413245
Correlation Coefficient is nearly 0, so no
relation, right?

The scatter plot
clearly shows a
strong relationship
between sales and
temps. Maybe
when it’s too hot
people just don’t
want to leave the
house.
Always visualize data!

Identical simple statistical properties—so alwaysVisualize!

 Measure of “symmetry” of
data. Negative value
indicates mean is less than
median (left skewed).
Positive value indicates
mean is larger than
median (right skewed).
> skewness(scores)
[1] -0.302365

 Data:
60,62,62,62,65,65,65,
75,82,96,99,100,100
> skewness(scores)
[1] 0.4652821
 This means a few
students did really well
and lifted the overall
mean score.

 Mean, Median, Mode exactly at center
 99.999% of all data within 3 𝜎 of mean
 Important for making inferences
 Test scores are generally normal distribution
 Height of humans follow normal distribution
 Need to be careful not to apply normal
distribution rules against non-normal data
 https://en.wikipedia.org/wiki/Normality_test

 z-Score indicates how far above or below the mean
a given score in the distribution is
 Scenario:Which exam did Scott do better?
 Scott got a 65/100 on Exam1; 𝜇 is 60; 𝜎 is 10
 Scott got a 42/200 on Exam2; 𝜇 is 37; 𝜎 is 5
 First, need to standardize scores (Exam1 is out of
100; Exam2 out of 200)
 This standardization is the z-score
 𝑧 =
𝑟𝑎𝑤 𝑠𝑐𝑜𝑟𝑒 −𝑚𝑒𝑎𝑛
𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛
or 𝑧 =
𝑋−𝜇
𝜎

 z of -1.5 means student
scored 1.5 standard
deviations below the
mean
 In the case of test scores,
positive numbers are good
 Less than 10% scored worse
 Which score marks the
97th percentile?
 What percentage of
population scored
between score1 and
score2 (say 75 and 90)?

 Measurements of CentralTendency and
Variability are critical to study of statistics
 CentralTendency tries to provide information
about the “central” value of your set
 Variability tries to provide information about
the dispersion of data in your set
 Covariance tries to provide information about
how two variables are related
 z-Scores are useful with a normal distribution

Statistics101: Numerical Measures

Statistics101: Numerical Measures

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (10)

Similar to Statistics101: Numerical Measures

Similar to Statistics101: Numerical Measures (20)

More from zahid-mian

More from zahid-mian (9)

Recently uploaded

Recently uploaded (20)

Statistics101: Numerical Measures