1. 5/3/2023 Summary Statistics 1
Summary Statistics
Last week we used stemplots and histograms to
describe the shape, location, and spread of a
distribution. This week we use numerical summaries of
location and spread.
2. 5/3/2023 Summary Statistics 2
Main Summary Statistics by Type
Central location
Mean
Median
Mode
Spread
Variance and standard deviation
Quartiles and Inter Quartile Range (IQR)
Shape
Statistical measures of spread (e.g., skewness and
kurtosis) are available but are seldom used in
practice (not covered)
3. 5/3/2023 Summary Statistics 3
Notation
n sample size
X variable
xi value of individual i
sum all values (capital sigma)
Illustrative example (sample.sav), data:
21 42 5 11 30 50 28 27 24 52
n = 10
X = age
x1= 21, x2= 42, …, x10= 52
x = 21 + 42 + … + 52 = 290
4. 5/3/2023 Summary Statistics 4
Sample Mean
i
i
x
n
n
x
x
1
0
.
29
)
290
(
10
1
1
i
x
n
x
Illustrative example: n = 10 (data & intermediate calculations on prior slide)
5. 5/3/2023 Summary Statistics 5
Population Mean
Same operation as sample mean, but
based on entire population (N =
population size)
Not available in practice, but important
conceptually
i
i
x
N
N
x 1
6. 5/3/2023 Summary Statistics 6
Interpretation of xbar
Sample mean used to predict
an observation drawn at random from a sample
an observation drawn at random from the
population
the population mean
Gravitational center (balance point)
0 10 20 30 40 50 60
Mean = 29
7. 5/3/2023 Summary Statistics 7
Median – a different kind of average
“Middle value”
Covered last week
Order data
Depth of median is (n+1) / 2
When n is odd middle value
When n is even average two middle values
Illustrative example, n = 10 median has
depth (10+1) / 2 = 5.5
05 11 21 24 27 28 30 42 50 52
median = average of 27 and 28 = 27.5
8. 5/3/2023 Summary Statistics 8
Median is “robust”
Robust resistant to skews and outliers
This data set has a mean (xbar) of 1600:
1362 1439 1460 1614 1666 1792 1867
This data set has an outlier and a mean of 2743:
1362 1439 1460 1614 1666 1792 9867
Outlier
The median is 1614 in both instances.
The median was not influenced by the outlier.
9. 5/3/2023 Summary Statistics 9
Mode
Mode value with greatest frequency
e.g., {4, 7, 7, 7, 8, 8, 9} has mode = 7
Used only in very large data sets
10. 5/3/2023 Summary Statistics 10
Mean, Median, Mode
(A) Symmetrical data: mean = median
(B) positive skew: mean > median [mean gets “pulled” by tail]
(C) negative skew: mean < median
Mean Mode
Median
(A)Symmetrica
l
Mode
Median
Mean
Mean
Median
Mode
(B)PositiveSkew (B)NegativeS
kew
11. 5/3/2023 Summary Statistics 11
Spread = Variability
Variability amount values spread
above and below the average
Measures of spread
Range and inter-quartile range
Standard deviation and variance (this week)
12. 5/3/2023 Summary Statistics 12
Range = max – min
The range is rarely used in practice b/c it
tends to underestimate population range
and is not robust
13. 5/3/2023 Summary Statistics 13
Standard deviation
x
xi
Deviation =
2
x
x
SS i
Sum of squared deviations =
1
2
n
SS
s
Sample variance =
2
s
s
Sample standard deviation =
Most common descriptive measure of spread
14. 5/3/2023 Summary Statistics 14
Standard deviation (formula)
2
)
(
1
1
x
x
n
s i
Sample standard deviation s is the unbiased estimator of
population standard deviation .
Population standard deviation is rarely known in practice.
15. 5/3/2023 Summary Statistics 15
New data set (“Metabolic Rates”)
This example is not in your lecture notes
Metabolic rates (cal/day), n = 7
1792 1666 1362 1614 1460 1867 1439
1600
7
200
,
11
7
1439
1867
1460
1614
1362
1666
1792
x
17. 5/3/2023 Summary Statistics 17
Standard Deviation Calculation
metabolic.sav – introduced slide 15
Observations Deviations Squared deviations
1792 1792 1600 = 192 (192)2 = 36,864
1666 1666 1600 = 66 (66)2 = 4,356
1362 1362 1600 = -238 (-238)2 = 56,644
1614 1614 1600 = 14 (14)2 = 196
1460 1460 1600 = -140 (-140)2 = 19,600
1867 1867 1600 = 267 (267)2 = 71,289
1439 1439 1600 = -161 (-161)2 = 25,921
SUMS 0* SS = 214,870
x
xi
i
x 2
x
xi
* Sum of deviations will always equal zero
18. 5/3/2023 Summary Statistics 18
Standard Deviation Metabolic data
(cont.)
2
2
calories
67
.
811
,
35
1
7
870
,
214
1
n
SS
s
calories
24
.
189
67
.
811
,
35
2
s
s
Variance (s2)
Standard deviation (s)
19. 5/3/2023 Summary Statistics 19
General rule for rounding means
and standard deviations
Report mean to one additional decimals above that of
the data
To achieve accuracy, intermediate calculations should
carry still an additional decimals
Illustrative example
Suppose data is recorded with one decimal accuracy (i.e.,
xx.x)
Report mean with two decimal accuracy (i.e., xx.xx)
Carry all intermediate calculations with at least three decimal
accuracy (i.e., xx.xxx)
Even more important: Always use common sense and judgment.
20. 5/3/2023 Summary Statistics 20
TI-30XIIS – about $12
In practice, we often use software
or a calculator to check our
standard deviation
21. 5/3/2023 Summary Statistics 21
Interpretation of Standard Deviation
Larger standard deviation greater variability
s1 = 15 and s2 = 10 group 1 has more variability
68-95-99.7 rule – Normal data only
68% of data with 1 SD of mean, 95% within 2 SD from
mean, and 99.7% within 3 SD of mean
e.g., if mean = 30 and SD = 10, then 95% of individuals are
in the range 30 ± (2)(10) = 30 ± 20 = (10 to 50)
Chebychev’s rule – All data
at least 75% data within 2 SD of mean
e.g., mean = 30 and SD = 10, then at least 75% of
individuals in range 30 ± (2)(10) = (10 to 50)
22. 5/3/2023 Summary Statistics 22
Quartiles and IQR
Quartiles divide the ordered data into
four equally-sized groups
Q0 = minimum
Q1 = 25th %ile
Q2 = 50th %ile (Median)
Q3 = 75th %ile
Q4 = maximum
23. 5/3/2023 Summary Statistics 23
Rule for quartiles
Find the median Q2
Middle of lower half of data set Q1
Middle of upper half of the data Q3
Bottom half | Top half
05 11 21 24 27 | 28 30 42 50 52
Q1 Q2 Q3
IQR = Q3 – Q1 = 42 – 21 = 21
gives spread of middle 50% of the data
30. 5/3/2023 Summary Statistics 30
Interpretation of boxplots
Location
Position of median
Position of box
Spread
Hinge-spread (box length) = IQR
Whisker-to-whisker spread (range or range minus
the outside values)
Shape
Symmetry of box
Size of whiskers
Outside values (potential outliers)