Understanding Data Summarization

3.Data summarization
Measures of central tendency and Dispersion
[MCTD]
7/8/2023 Data summarization 1

Learning Objectives
By the end of this chapter, the reader will be able to
Compute and interpret
Mean
Median
Mode
Standard deviation, &
variance

key words:
O Measure of central tendency :
Mean (𝑥), Median(Χ), Mode( Χ)
O Measures of dispersion
Standard deviation, variance,…

Introduction
•Compiling and presenting the data in tabular or graphical
form will not give complete information of the data
collected.
•We need to “summarize” the entire data in one figure,
looking at which we can get overall idea of the data.
•The data set should be meaningfully described using
summary measures.

Introduction…
•Summary measures provide description of data in terms
of concentration of data and variability existing in data.
•We use these summary figures to draw certain
conclusions about the reference population from which
the sample data has been drawn.

Measures of Central tendency
•Measures of central tendency are sometimes called
measures of central location, also classed as summary
statistics.
•This gives the centrality measure of the data set i.e. where
the observations are concentrated.

Cont..
There are numerous MCT. These are :
Mean
Median
Mode
•The mean, median and mode are all valid measures of
central tendency, but under different conditions, Some
measures of central tendency become more appropriate to
use than others.

I. Arithmetic Mean:
•It is the average of the data.
• Random sample of size 10 of ages, where
𝑥 =
42 + 28 + 28 + 61 + 31 + 23 + 50 + 38 + 32 + 37
10
n
X
X
n
i
i


 1
𝑋 =
370
10
= 𝟑𝟕

Properties of the Mean
oUniqueness: For a given set of data there is one and only one
mean.
oSimplicity: It is easy to understand and to compute.
oAffected by extreme values: since all values enter into the
computation.
It can be used with both discrete and continuous data, although
its use is most often with continuous data.

Example:
Assume the values are 115, 110, 119,117,121 and 126. The
mean = 118.
But assume that the values are 75, 75, 80, 80 and 280. The mean
= 118, a value that is not representative of the set of data as a
whole.

Median
oThe median is the middle score for a set of data that has been
arranged in order of magnitude.
o It is the observation that divide the set of observations into two
equal parts such that half of the data are before it and the other are
after it.
oIf n is odd, the median will be the middle of observations.
oThe median will be the mean of these two middle observations.
oIt will be the (n+1)/2 ordered observation.

oWhen n = 11, then the median is the 6𝑡ℎ
observation.
oIf n is even, there are two middle observations.
oIt will be the [(n/2)th + ((n/2) + 1)𝑡ℎ )]/2 ordered
observation.
oWhen n = 12, then the median is an observation
halfway between the 6𝑡ℎ
and 7th ordered observation.

…Median
For the same random sample, the ordered observations will
23, 28, 28, 31, 32, 34, 37, 42, 50, 61.
• Since n = 10, then the median is the 5.5𝑡ℎ
observation, i.e.
= (32+34)/2 = 33
If 23, 28, 28, 31, 32, n=5,then the median is (n+1)/2 =
•5+1/2=3𝑡ℎ
observation = 28

Properties of the Median:
•Uniqueness: For a given set of data there is one and
only one median.
•Simplicity: It is easy to calculate.
•It is not affected by extreme values as is the
mean.

Mode
• It is the value which occurs most frequently.
• If all values are different there is no mode.
• Sometimes, there are more than one mode.
Sample:
A. 25, 36, 79, 80, 93 (no modal value)
B.1.8, 3.0, 3.3, 2.8, 2.9, 3.6, 3.0, 1.9, 3.2, 3.5 (modal value =
3.0 kg)

Properties of the Mode
•Sometimes, it is not unique.
•It may be used for describing qualitative data
•The mode is used for categorical data where we
wish to know which is the most common category.

Example: We can see below that the most common form
of transport, in this particular data set, is the bus

Example: on a histogram it represents the highest bar
in a bar chart or histogram.

Summary of when to use the mean, median and
mode: To know what the best measure of central
tendency is with respect to the different types of
variable.
Type of Variable Best measure of central tendency
Nominal Mode
Ordinal Median
Interval/Ratio (not skewed) Mean
Interval/Ratio (skewed) Median

Exercises
Calculate
1) Arithmetic Mean
2) Median
3) Mode
Ages of Women in Clinic
23 31 55 43 55 19 17 44 43 37

Measures of
Spread/Dispersi
on

Introduction
•Knowledge of central tendency alone is not sufficient for
complete understanding of distribution.
•Two data sets can have the same mean but they can be
entirely different. thus to describe data, one needs to
know the extent of variability.
•Measures of spread tell us how far or how close together
the data points are in a sample.

Introduction
•Measures of variability are measures of spread that tell
us how varied our data points are from the average of
the sample.
•The variance is a measure that will inform us how
values in our dataset differ from the mean.

Measures of Spread…
•Measures of spread are :
oRange (R).
oVariance and Standard deviation.
oCoefficient of variation (C.V).
oInterquartile Range
• Measures of Relative Position (Quantiles and
Percentiles)

Range (R)
Range = Largest value - Smallest value
Note:
oRange concern only on to two values
oHighly sensitive to outliers
oData: 43, 66, 61, 64, 65, 38, 59, 57, 57, 50.
oFind Range? Range=66-38=28

Variance
It measure dispersion relative to the scatter of the values
about their mean,
a) Sample Variance(S2 ):
,where X is sample mean
Find Sample Variance of ages, 𝑥= 56
Data: 43, 66, 61, 64, 65, 38, 59, 57, 57, 50.
Solution:
S2 = [ (43 − 56)2+(66 − 56)2+ ⋯ +(50−56)
2
]/ 10-1
= 810/9 = 90


 

n
i n
i x
x
s 1
2
2
1
)
(

Standard Deviation
•It is the square root of variance ( Variance )
a) Sample Standard Deviation(SD)
= S2
b) Population Standard Deviation(𝜎)
= 𝜎2

Measures of Dispersion…
Consider the following two sets of data:
A: 177 193 195 209 226 Mean = 200
B: 192 197 200 202 209 Mean = 200
Two or more sets may have the same mean and/or median but
they may be quite different.

Measures of Dispersion…
A measure of dispersion conveys information regarding the amount
of variability present in a set of data,
Note:
1. If all the values are the same: There is no dispersion ,
2. If all the values are different: There is a dispersion:
3. If the values close to each other: The amount of Dispersion is
small.
4. If the values are widely scattered: The Dispersion is greater.

Standard deviation
• Caution must be exercised when using standard deviation as a
comparative index of dispersion
Weights of newborn
elephants (Kg)
929 553
878 939
895 972
937 841
801 826
Weights of newborn
mice (Kg)
0.72 0.42
0.53 0.31
0.59 0.38
0.79 0.96
1.06 0.89
n =10
𝑿=887.1
SD=56.50
n =10
𝑿 = 0.68
SD=0.255
• Incorrect to say that elephants show greater variation for birth-
weights than mice because of higher standard deviation

The Coefficient of Variation (C.V):
•Is a measure use to compare the dispersion in two sets
of data which is independent of the unit of the
measurement.
CV =
SD
X
*100;
Where
S: Sample standard deviation.
X: Sample mean.

Coefficient of Variance
• Coefficient of variance expresses standard deviation relative to its
mean
Weights of newborn
elephants (Kg)
929 553
878 939
895 972
937 841
801 826
Weights of newborn
mice (Kg)
0.72 0.42
0.53 0.31
0.59 0.38
0.79 0.96
1.06 0.89
n = 10
𝑿 = 887.1
SD = 56.50
CV = 0.0637
n = 10
𝑿 = 0.68
SD = 0.255
CV = 0.375
Note :
Mice show greater birth weight variation

Example:
• Suppose two samples of human males yield the following data:
We wish to know which is more variable.
Sample 1 Sample 2
Age 25-year-olds 11 year-olds
Mean weight 145 pound 80 pound
Standard Deviation 10 pound 10 pound

Cont…
Solution:
C.V (Sample 1) = (10/145)*100= 6.9
C.V (Sample 2) = (10/80)* 100= 12.5
• Then age of 11-years olds(sample 2) is more variation

When to use coefficient of variance
oWhen comparison groups have very different means
•When different units of measurement are involved
e.g. group 1 unit is mm
group 2 unit is gm
CV is suitable for comparison as it is unit free
oIn such cases, SD should not be used for comparison

Measures of Relative Position
Percentiles and Quartiles
Are measures of spread that are determined by the numerical
ordering of the data.
They are cut points that divide the frequency distribution into
equal groups, each containing the same fraction of the total
population.
They are less sensitive to outliers and are not greatly affected by
the sample size
Quantile is just another term for percentile
7/8/2023 Data summarization
36

Con,t
• The pth percentile is the value Vp such that p percent of the
sample points is less than or equal to Vp.
Other frequently used percentiles include the following
1. Tertiles:
• Two points that divide and order a sample variable into three
categories, each containing a third of the population
(e.g., high, medium, low).

2. Quartiles:
• Three points that divide and order a sample variable into four
categories, each containing a fourth of the population.
• The 25th, 50th, and 75th percentiles of a variable are used to categorize it
into quartiles.
3. Quintiles:
• Four points that divide and order a sample variable into five categories,
each containing a fifth of the population.
• The 20th, 40th, 60th, and 80th percentiles of a variable are used to
categorize it into quintiles.

Ctd
4. Deciles:
•Nine points that divide and order a sample variable into
ten categories, each containing a tenth of the
population.
•The 10th, 20th, 30th, 40th, 50th, 60th, 70th, 80th, and
90th percentiles of a variable are used to categorize it
into deciles.

Cont…
•Suppose a data set is arranged in ascending (or descending
) order. The pth percentile is a number such that p% of
the observations of the data set fall below
•The lower quartile, QL, for a data set is the 25th percentile
•The mid- quartile, M, for a data set is the 50th percentile.
•The upper quartile, QU, for a data set is the 75th percentile.
•The interquartile range of a data set is QU - QL .

Finding quartiles for small data sets:
• Rank the n observations in the data set in ascending order of
magnitude.
• Calculate the quantity (n+1)/4 and round to the nearest integer.
oThe observation with this rank represents the lower quartile.
oIf (n+1)/4 falls halfway between two integers, round up.
• Calculate the quantity 3(n+1)/4 and round to the nearest integer.
oThe observation with this rank represents the upper quartile.
oIf 3(n+1)/4 falls halfway between two integers, round down.

Inter-quartile range(IQR)
o3rd quartile – 1st quartile (75th – 25th percentile)
oRobust to outliers
oMiddle 50% of observations

Boxplot
Minimum = x(1)
Q1 Q3
Median
IQR = Interquartile Range
= which is the range of the middle 50% of the data
Outliers
Maximum = x(n)

Example
•Find the lower quartile, the median, and the upper
quartile for the data set in ages of 189 subjects who
participated in a study on smoking cessation (age ranges
from 30 to 82)

Solution
For this data set n = 189.
• Therefore,
o(n+1)/4 = 190/4 = 47.5,
o3(n+1)/4 = 3*190/4 = 142.5.
oWe round 47.5 up to 48 and 142.5 up to 143.
• Hence,
othe lower quartile = 48th observation =? ,
othe upper quartile =143th observation =?.

Quiz
The incubation period of smallpox in 9 patients where it
was found to be 14, 13, 11, 15, 10, 7, 9, 12 and 10.
Find:
1. Mean, Median & mode
2. SD and Variance
3. C.V
4. Range & IQR
5. Recommend the best MCT

Measures of Shape
•It is necessary to consider the shape of the data – the
manner, in which the data are distributed.
•There are two measures of the shape of a data set:
o Skewness
o kurtosis.

Skewness
Skew is a measure of symmetry in the distribution of scores
skewness is defined by the formula:
Skewness:
• a3 > 0 distribution skewed to the right/ positively skewed
• a3 < 0 distribution skewed to the left/ negatively skewed
• a3 = 0 then, the distribution is symmetrical.
Skewness :a3 =
n
(n-1)(n-2)
xi - x
s
æ
è
ç
ö
ø
÷
i=1
n
å
3

Measure of Skew
Positive Skew
Negative Skew
Normal (skew = 0)

Contd…
•The direction of the skewness depends upon the location
of the extreme values.
•If the extreme values are the larger observations, the
mean will be the measure of location most greatly
distorted toward the upward direction.
•Since the mean exceeds the median and the mode, such
distribution is said to be positive or right-skewed.
•The tail of its distribution is extended to the right.

Cont…
•On the other hand, if the extreme values are the smaller
observations, the mean will be the measure of location
most greatly reduced.
•Since the mean is exceeded by the median and the mode,
such distribution is said to be negative or left-skewed.
•The tail of its distribution is extended to the left.

Skewness…
Data summarization 53
7/8/2023
Mean, Median & Mode

Can it possible to detect skewness using histogram?

Kurtosis
•Kurtosis characterizes the relative peakness or flatness of a
distribution compared with the bell-shaped distribution
(normal distribution).
•Kurtosis of a sample data set is calculated by the formula:
)
3
)(
2
(
)
1
(
3
)
3
)(
2
)(
1
(
)
1
(
:
2
4
1
4



















 




 
 n
n
n
s
x
x
n
n
n
n
n
a
Kurtosis
n
i
i

Kurtosis:
• a4 > 3 thinner tails & higher peak than a normal
distribution
• a4 < 3 thicker tails & lower peak compared to a normal
distribution
 For a meaningful and comparable measure of a4, the
distribution should be symmetrical (hence again the need to
have a normal distribution)

Kurtosis…
•When the distribution is normally distributed, its kurtosis
equals 3 and it is said to be mesokurtic
•When the distribution is less spread out than normal, its
kurtosis is greater than 3 and it is said to be leptokurtic
•When the distribution is more spread out than normal, its
kurtosis is less than 3 and it is said to be platykurtic

Contd…
•If a4 = 3 Mesokurtic: the value for a bell-shaped
distribution (Gaussian or normal distribution)
•If a4 > 3 leptokurtic: thin or peaked shape (or “light
tails”)
•If a4 < 3 platykurtic : flat shape (or “heavy tails”)

Kurtosis
•Kurtosis measures whether the scores are spread out
more or less than they would be in a normal (Gaussian)
distribution
Mesokurtic (a4 = 3)
Leptokurtic (a4 > 3)
Platykurtic (a4 < 3)

Three histograms representing kurtosis

Thank you !
For more

Understanding Data Summarization

Recommended

Recommended

More Related Content

What's hot

What's hot (14)

Similar to Understanding Data Summarization

Similar to Understanding Data Summarization (20)

More from AmanuelMerga

More from AmanuelMerga (17)

Recently uploaded

Recently uploaded (20)

Understanding Data Summarization