2. Learning Objectives
By the end of this chapter, the reader will be able to
ďśCompute and interpret
ďżMean
ďżMedian
ďżMode
ďżStandard deviation, &
ďżvariance
7/8/2023 Data summarization 2
3. key words:
O Measure of central tendency :
Mean (đĽ), Median(Χ), Mode( Χ)
O Measures of dispersion
Standard deviation, variance,âŚ
7/8/2023 Data summarization 3
4. Introduction
â˘Compiling and presenting the data in tabular or graphical
form will not give complete information of the data
collected.
â˘We need to âsummarizeâ the entire data in one figure,
looking at which we can get overall idea of the data.
â˘The data set should be meaningfully described using
summary measures.
7/8/2023 Data summarization 4
5. IntroductionâŚ
â˘Summary measures provide description of data in terms
of concentration of data and variability existing in data.
â˘We use these summary figures to draw certain
conclusions about the reference population from which
the sample data has been drawn.
7/8/2023 Data summarization 5
6. Measures of Central tendency
â˘Measures of central tendency are sometimes called
measures of central location, also classed as summary
statistics.
â˘This gives the centrality measure of the data set i.e. where
the observations are concentrated.
7/8/2023 Data summarization 6
7. Cont..
There are numerous MCT. These are :
ďżMean
ďżMedian
ďżMode
â˘The mean, median and mode are all valid measures of
central tendency, but under different conditions, Some
measures of central tendency become more appropriate to
use than others.
7/8/2023 Data summarization 7
8. I. Arithmetic Mean:
â˘It is the average of the data.
⢠Random sample of size 10 of ages, where
đĽ =
42 + 28 + 28 + 61 + 31 + 23 + 50 + 38 + 32 + 37
10
7/8/2023 Data summarization 8
n
X
X
n
i
i
ďĽ
ď˝
ď˝ 1
đ =
370
10
= đđ
9. Properties of the Mean
oUniqueness: For a given set of data there is one and only one
mean.
oSimplicity: It is easy to understand and to compute.
oAffected by extreme values: since all values enter into the
computation.
ďśIt can be used with both discrete and continuous data, although
its use is most often with continuous data.
7/8/2023 Data summarization 9
10. Example:
Assume the values are 115, 110, 119,117,121 and 126. The
mean = 118.
But assume that the values are 75, 75, 80, 80 and 280. The mean
= 118, a value that is not representative of the set of data as a
whole.
7/8/2023 Data summarization 10
11. Median
oThe median is the middle score for a set of data that has been
arranged in order of magnitude.
o It is the observation that divide the set of observations into two
equal parts such that half of the data are before it and the other are
after it.
oIf n is odd, the median will be the middle of observations.
oThe median will be the mean of these two middle observations.
oIt will be the (n+1)/2 ordered observation.
7/8/2023 Data summarization 11
12. oWhen n = 11, then the median is the 6đĄâ
observation.
oIf n is even, there are two middle observations.
oIt will be the [(n/2)th + ((n/2) + 1)đĄâ )]/2 ordered
observation.
oWhen n = 12, then the median is an observation
halfway between the 6đĄâ
and 7th ordered observation.
7/8/2023 Data summarization 12
13. âŚMedian
ďFor the same random sample, the ordered observations will
23, 28, 28, 31, 32, 34, 37, 42, 50, 61.
⢠Since n = 10, then the median is the 5.5đĄâ
observation, i.e.
= (32+34)/2 = 33
ďIf 23, 28, 28, 31, 32, n=5,then the median is (n+1)/2 =
â˘5+1/2=3đĄâ
observation = 28
7/8/2023 Data summarization 13
14. Properties of the Median:
â˘Uniqueness: For a given set of data there is one and
only one median.
â˘Simplicity: It is easy to calculate.
â˘It is not affected by extreme values as is the
mean.
7/8/2023 Data summarization 14
15. Mode
⢠It is the value which occurs most frequently.
⢠If all values are different there is no mode.
⢠Sometimes, there are more than one mode.
Sample:
A. 25, 36, 79, 80, 93 (no modal value)
B.1.8, 3.0, 3.3, 2.8, 2.9, 3.6, 3.0, 1.9, 3.2, 3.5 (modal value =
3.0 kg)
7/8/2023 Data summarization 15
16. ďProperties of the Mode
â˘Sometimes, it is not unique.
â˘It may be used for describing qualitative data
â˘The mode is used for categorical data where we
wish to know which is the most common category.
7/8/2023 Data summarization 16
17. Example: We can see below that the most common form
of transport, in this particular data set, is the bus
7/8/2023 Data summarization 17
18. Example: on a histogram it represents the highest bar
in a bar chart or histogram.
7/8/2023 Data summarization 18
19. Summary of when to use the mean, median and
mode: To know what the best measure of central
tendency is with respect to the different types of
variable.
Type of Variable Best measure of central tendency
Nominal Mode
Ordinal Median
Interval/Ratio (not skewed) Mean
Interval/Ratio (skewed) Median
7/8/2023 Data summarization 19
22. Introduction
â˘Knowledge of central tendency alone is not sufficient for
complete understanding of distribution.
â˘Two data sets can have the same mean but they can be
entirely different. thus to describe data, one needs to
know the extent of variability.
â˘Measures of spread tell us how far or how close together
the data points are in a sample.
7/8/2023 Data summarization 22
23. Introduction
â˘Measures of variability are measures of spread that tell
us how varied our data points are from the average of
the sample.
â˘The variance is a measure that will inform us how
values in our dataset differ from the mean.
7/8/2023 Data summarization 23
24. Measures of SpreadâŚ
â˘Measures of spread are :
oRange (R).
oVariance and Standard deviation.
oCoefficient of variation (C.V).
oInterquartile Range
⢠Measures of Relative Position (Quantiles and
Percentiles)
7/8/2023 Data summarization 24
25. Range (R)
Range = Largest value - Smallest value
Note:
oRange concern only on to two values
oHighly sensitive to outliers
oData: 43, 66, 61, 64, 65, 38, 59, 57, 57, 50.
oFind Range? Range=66-38=28
7/8/2023 Data summarization 25
26. Variance
It measure dispersion relative to the scatter of the values
about their mean,
a) Sample Variance(S2 ):
,where X is sample mean
Find Sample Variance of ages, đĽ= 56
Data: 43, 66, 61, 64, 65, 38, 59, 57, 57, 50.
Solution:
S2 = [ (43 â 56)2+(66 â 56)2+ ⯠+(50â56)
2
]/ 10-1
= 810/9 = 90
7/8/2023 Data summarization 26
ďĽ
ď
ď˝ ď
ď˝
n
i n
i x
x
s 1
2
2
1
)
(
27. Standard Deviation
â˘It is the square root of variance ( Variance )
a) Sample Standard Deviation(SD)
= S2
b) Population Standard Deviation(đ)
= đ2
7/8/2023 Data summarization 27
28. Measures of DispersionâŚ
Consider the following two sets of data:
A: 177 193 195 209 226 Mean = 200
B: 192 197 200 202 209 Mean = 200
7/8/2023 Data summarization 28
Two or more sets may have the same mean and/or median but
they may be quite different.
29. Measures of DispersionâŚ
A measure of dispersion conveys information regarding the amount
of variability present in a set of data,
Note:
1. If all the values are the same: There is no dispersion ,
2. If all the values are different: There is a dispersion:
3. If the values close to each other: The amount of Dispersion is
small.
4. If the values are widely scattered: The Dispersion is greater.
7/8/2023 Data summarization 29
30. Standard deviation
⢠Caution must be exercised when using standard deviation as a
comparative index of dispersion
7/8/2023 Data summarization 30
Weights of newborn
elephants (Kg)
929 553
878 939
895 972
937 841
801 826
Weights of newborn
mice (Kg)
0.72 0.42
0.53 0.31
0.59 0.38
0.79 0.96
1.06 0.89
n =10
đż=887.1
SD=56.50
n =10
đż = 0.68
SD=0.255
⢠Incorrect to say that elephants show greater variation for birth-
weights than mice because of higher standard deviation
31. The Coefficient of Variation (C.V):
â˘Is a measure use to compare the dispersion in two sets
of data which is independent of the unit of the
measurement.
CV =
SD
X
*100;
Where
S: Sample standard deviation.
X: Sample mean.
7/8/2023 Data summarization 31
32. Coefficient of Variance
⢠Coefficient of variance expresses standard deviation relative to its
mean
7/8/2023 Data summarization 32
Weights of newborn
elephants (Kg)
929 553
878 939
895 972
937 841
801 826
Weights of newborn
mice (Kg)
0.72 0.42
0.53 0.31
0.59 0.38
0.79 0.96
1.06 0.89
n = 10
đż = 887.1
SD = 56.50
CV = 0.0637
n = 10
đż = 0.68
SD = 0.255
CV = 0.375
Note :
Mice show greater birth weight variation
33. Example:
⢠Suppose two samples of human males yield the following data:
We wish to know which is more variable.
7/8/2023 Data summarization 33
Sample 1 Sample 2
Age 25-year-olds 11 year-olds
Mean weight 145 pound 80 pound
Standard Deviation 10 pound 10 pound
34. ContâŚ
Solution:
C.V (Sample 1) = (10/145)*100= 6.9
C.V (Sample 2) = (10/80)* 100= 12.5
⢠Then age of 11-years olds(sample 2) is more variation
7/8/2023 Data summarization 34
35. When to use coefficient of variance
oWhen comparison groups have very different means
â˘When different units of measurement are involved
e.g. group 1 unit is mm
group 2 unit is gm
ďąCV is suitable for comparison as it is unit free
oIn such cases, SD should not be used for comparison
7/8/2023 Data summarization 35
36. Measures of Relative Position
ďśPercentiles and Quartiles
ďAre measures of spread that are determined by the numerical
ordering of the data.
ďThey are cut points that divide the frequency distribution into
equal groups, each containing the same fraction of the total
population.
ďThey are less sensitive to outliers and are not greatly affected by
the sample size
ďQuantile is just another term for percentile
7/8/2023 Data summarization
36
37. Con,t
⢠The pth percentile is the value Vp such that p percent of the
sample points is less than or equal to Vp.
Other frequently used percentiles include the following
1. Tertiles:
⢠Two points that divide and order a sample variable into three
categories, each containing a third of the population
(e.g., high, medium, low).
7/8/2023 Data summarization 37
38. 2. Quartiles:
⢠Three points that divide and order a sample variable into four
categories, each containing a fourth of the population.
⢠The 25th, 50th, and 75th percentiles of a variable are used to categorize it
into quartiles.
3. Quintiles:
⢠Four points that divide and order a sample variable into five categories,
each containing a fifth of the population.
⢠The 20th, 40th, 60th, and 80th percentiles of a variable are used to
categorize it into quintiles.
7/8/2023 Data summarization 38
39. Ctd
4. Deciles:
â˘Nine points that divide and order a sample variable into
ten categories, each containing a tenth of the
population.
â˘The 10th, 20th, 30th, 40th, 50th, 60th, 70th, 80th, and
90th percentiles of a variable are used to categorize it
into deciles.
7/8/2023 Data summarization 39
40. ContâŚ
â˘Suppose a data set is arranged in ascending (or descending
) order. The pth percentile is a number such that p% of
the observations of the data set fall below
â˘The lower quartile, QL, for a data set is the 25th percentile
â˘The mid- quartile, M, for a data set is the 50th percentile.
â˘The upper quartile, QU, for a data set is the 75th percentile.
â˘The interquartile range of a data set is QU - QL .
7/8/2023 Data summarization 40
42. Finding quartiles for small data sets:
⢠Rank the n observations in the data set in ascending order of
magnitude.
⢠Calculate the quantity (n+1)/4 and round to the nearest integer.
oThe observation with this rank represents the lower quartile.
oIf (n+1)/4 falls halfway between two integers, round up.
⢠Calculate the quantity 3(n+1)/4 and round to the nearest integer.
oThe observation with this rank represents the upper quartile.
oIf 3(n+1)/4 falls halfway between two integers, round down.
7/8/2023 Data summarization 42
43. Inter-quartile range(IQR)
o3rd quartile â 1st quartile (75th â 25th percentile)
oRobust to outliers
oMiddle 50% of observations
7/8/2023 Data summarization 43
44. Boxplot
7/8/2023 Data summarization 44
Minimum = x(1)
Q1 Q3
Median
IQR = Interquartile Range
= which is the range of the middle 50% of the data
Outliers
Maximum = x(n)
45. Example
â˘Find the lower quartile, the median, and the upper
quartile for the data set in ages of 189 subjects who
participated in a study on smoking cessation (age ranges
from 30 to 82)
7/8/2023 Data summarization 45
46. Solution
For this data set n = 189.
⢠Therefore,
o(n+1)/4 = 190/4 = 47.5,
o3(n+1)/4 = 3*190/4 = 142.5.
oWe round 47.5 up to 48 and 142.5 up to 143.
⢠Hence,
othe lower quartile = 48th observation =? ,
othe upper quartile =143th observation =?.
7/8/2023 Data summarization 46
47. Quiz
The incubation period of smallpox in 9 patients where it
was found to be 14, 13, 11, 15, 10, 7, 9, 12 and 10.
Find:
1. Mean, Median & mode
2. SD and Variance
3. C.V
4. Range & IQR
5. Recommend the best MCT
7/8/2023 Data summarization 47
48. Measures of Shape
â˘It is necessary to consider the shape of the data â the
manner, in which the data are distributed.
â˘There are two measures of the shape of a data set:
o Skewness
o kurtosis.
7/8/2023 Data summarization 48
49. Skewness
ďśSkew is a measure of symmetry in the distribution of scores
ďśskewness is defined by the formula:
ďśSkewness:
⢠a3 > 0 distribution skewed to the right/ positively skewed
⢠a3 < 0 distribution skewed to the left/ negatively skewed
⢠a3 = 0 then, the distribution is symmetrical.
7/8/2023 Data summarization 49
Skewness :a3 =
n
(n-1)(n-2)
xi - x
s
ĂŚ
è
ç
Ăś
ø
á
i=1
n
ĂĽ
3
51. ContdâŚ
â˘The direction of the skewness depends upon the location
of the extreme values.
â˘If the extreme values are the larger observations, the
mean will be the measure of location most greatly
distorted toward the upward direction.
â˘Since the mean exceeds the median and the mode, such
distribution is said to be positive or right-skewed.
â˘The tail of its distribution is extended to the right.
7/8/2023 Data summarization 51
52. ContâŚ
â˘On the other hand, if the extreme values are the smaller
observations, the mean will be the measure of location
most greatly reduced.
â˘Since the mean is exceeded by the median and the mode,
such distribution is said to be negative or left-skewed.
â˘The tail of its distribution is extended to the left.
7/8/2023 Data summarization 52
54. Can it possible to detect skewness using histogram?
7/8/2023 Data summarization 54
55. Kurtosis
â˘Kurtosis characterizes the relative peakness or flatness of a
distribution compared with the bell-shaped distribution
(normal distribution).
â˘Kurtosis of a sample data set is calculated by the formula:
7/8/2023 Data summarization 55
)
3
)(
2
(
)
1
(
3
)
3
)(
2
)(
1
(
)
1
(
:
2
4
1
4
ď
ď
ď
ď
ďŻ
ďž
ďŻ
ď˝
ďź
ďŻ
ďŽ
ďŻ
ď
ďŹ
ďˇ
ď¸
ďś
ď§
ď¨
ďŚ ď
ď
ď
ď
ďŤ
ď˝ ďĽ
ď˝ n
n
n
s
x
x
n
n
n
n
n
a
Kurtosis
n
i
i
56. Kurtosis:
⢠a4 > 3 thinner tails & higher peak than a normal
distribution
⢠a4 < 3 thicker tails & lower peak compared to a normal
distribution
ďś For a meaningful and comparable measure of a4, the
distribution should be symmetrical (hence again the need to
have a normal distribution)
7/8/2023 Data summarization 56
57. KurtosisâŚ
â˘When the distribution is normally distributed, its kurtosis
equals 3 and it is said to be mesokurtic
â˘When the distribution is less spread out than normal, its
kurtosis is greater than 3 and it is said to be leptokurtic
â˘When the distribution is more spread out than normal, its
kurtosis is less than 3 and it is said to be platykurtic
7/8/2023 Data summarization 57
58. ContdâŚ
â˘If a4 = 3 Mesokurtic: the value for a bell-shaped
distribution (Gaussian or normal distribution)
â˘If a4 > 3 leptokurtic: thin or peaked shape (or âlight
tailsâ)
â˘If a4 < 3 platykurtic : flat shape (or âheavy tailsâ)
7/8/2023 Data summarization 58
59. Kurtosis
â˘Kurtosis measures whether the scores are spread out
more or less than they would be in a normal (Gaussian)
distribution
7/8/2023 Data summarization 59
Mesokurtic (a4 = 3)
Leptokurtic (a4 > 3)
Platykurtic (a4 < 3)