3. Descriptive statistics.pdf

Descriptive statistics:
Numerical summary
measures
Tufa Kolola
(MPH, Ass’t. Prof.)
1

Contents
§ Introduction
§ Measures of central tendency
§ Measures of relative standing
§ Shape of distribution
§ Measures of dispersion
2

Learning
objectives
After the end of this session you will be able
to:
§ Compute and interpret the mean, median, and
mode for a set of data
§ Construct and interpret a box and whiskers plot
§ Compute and interpret the range, variance,
standard deviation coefficient of variation for a
set of data
§ Use numerical measures along with graphs,
charts, and tables to describe data
3

Numerical
summary measures
Numerical summary measures : A descriptive
measure which summarize the data set by a
single number
§ Unlike frequency distributions, indicate the
average value or (the middle) and the spread of
the values
4

Summary Measures
Measures of central
tendency (Location)
Mean
Median
Mode
Measures of
Relative Standing
Weighted Mean
Numerical summary
measures
Measures of dispersion
(Variation)
Variance
Standard Deviation
Coefficient of
Variation
Range
Percentiles
Interquartile Range
Quartiles
5

Measures of central
tendency(MCT)
§ On the scale of values of a variable, there is a certain
stage at which the largest number of items tend to
cluster
§ Since this stage is usually in the centre of distribution,
the tendency of the statistical data to get concentrated
at a certain value is called “central tendency”
§ The various methods of determining the point about
which the observations tend to concentrate are called
MCT
6

Characteristics of
good MCT
1. It should be based on all the observations
2. It should not be affected by the extreme values
3. It should be as close to the minimum & maximum
number of values as possible
4. It should have a definite value
5. It should not be subjected to complicated and
tedious calculations
6. It should be capable of further algebraic treatment
7. It should be stable with regard to sampling
7

Measures of central
tendency(MCT)
Center and Location
Mean Median Mode Weighted Mean







i
i
i
W
i
i
i
W
w
x
w
w
x
w
X
8

Arithmetic Mean:
ungrouped data
§ The Mean is the average of data set (Is the sum of
all the observations divided by the total number of
observations)
– Sample mean
– Population mean
n = Sample Size
N = Population Size
n
x
x
x
n
x
x n
n
i
i






 
2
1
1
N
x
x
x
N
x
N
N
i
i







 
2
1
1
9

Arithmetic Mean
§ The most common measure of central tendency
§ Affected by extreme values (outliers)
0 1 2 3 4 5 6 7 8 9 10
Mean = 3
0 1 2 3 4 5 6 7 8 9 10
Mean = 4
3
5
15
5
5
4
3
2
1






4
5
20
5
10
4
3
2
1






10

Grouped data
§ In calculating the mean from grouped data, we
assume that all values falling into a particular
class interval are located at the midpoint of the
interval. It is calculated as follows:
11



fi
mifi
Sample
)
(
mean
§ Where:
mi=the midpoint of the ith class interval
fi= the frequency of the ith class interval

Example. Compute the mean age of 169 subjects
from the grouped data
12

Properties of
Arithmetic Mean
§ For a given set of data there is one and only one
arithmetic mean (uniqueness)
§ Easy to calculate and understand (simple)
§ Influenced by each and every value in a data sets
§ Greatly affected by the extreme values
§ Poor measure of location if the underlying
distribution is not normal (or not Gaussian)
§ In case of grouped data if any class interval is
open, arithmetic mean can not be calculated
13

Median: Ungrouped
data
§ In an ordered array, the median is the “middle”
number
– If n or N is odd, the median is the middle number
– If n or N is even, the median is the average of the
two middle numbers
0 1 2 3 4 5 6 7 8 9 10
Median = 3
0 1 2 3 4 5 6 7 8 9 10
Median = 3
§ The median is the value of the middle term in a
data set that has been ranked in increasing order
14

Grouped data
§ In calculating the median from grouped data, we
assume that the values within a class-interval are
evenly distributed through the interval
§ The first step is to locate the class interval in
which the median is located, using the following
procedure
§ Find n/2 and see a class interval with a minimum
cumulative frequency which contains n/2
§ Then, use the following formula
16

where,
Lm = lower true class boundary of the interval containing the
median
Fc = cumulative frequency of the interval just above the
median class interval
fm = frequency of the interval containing the median
W= class interval width
n = total number of observations
c
m
m
n
F
2
x = L W
f
 

 
  
 
 

17

Example: Compute the median age of 169
subjects from the grouped data
18

§ n/2 = 169/2 = 84.5
§ n/2 = 84.5 = in the 3rd class interval
§ Lower limit = 29.5, Upper limit = 39.5
§ Frequency of the class = 47
§ (n/2 – fc) = 84.5-70 = 14.5
Median = 29.5 + (14.5/47)10 = 32.58 33

19

Properties of
Median
§ There is only one median for a given set of data
(uniqueness)
§ The median is easy to calculate
§ Median is a positional average and hence it is
insensitive to very large or very small values
§ Median can be calculated even in the case of open
end intervals if sample size known
§ It is determined mainly by the middle points and
less sensitive to the remaining data points
(weakness)
20

Mode: Ungrouped
data
§ Value that occurs most often
§ Not affected by extreme values
§ Used for either numerical or categorical data
§ There may be no mode
§ There may be several modes
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Mode = 5
0 1 2 3 4 5 6
No Mode
21

Mode: Grouped
data
§ To find the mode of grouped data, we usually
refer to the modal class, where the modal class
is the class interval with the highest frequency
§ If a single value for the mode of grouped data
must be specified, it is taken as the mid-point of
the modal class interval
22

Properties of
Mode
§ It is not affected by extreme values
§ It can be calculated for distributions with open end
classes
§ Often its value is not unique
§ The main drawback of mode is that often it does
not exist
23

Measures of
Relative Standing
§ Where does one particular measurement stand
in relation to the other measurements in the data
set?
§ Descriptive measures that locate the relative
position of an observation in relation to the other
observations are called measures of relative
standing
24

Measures of
Relative Standing
Measures of
Relative Standing
Percentiles Quartiles
n 1st quartile = 25th percentile
n 2nd quartile = 50th percentile
= median
n 3rd quartile = 75th percentile
§ The pth percentile in a data
array: is a number such that
p% of the observations of
the data set fall below and
(100-p)% of the observations
fall above it. (where 0 ≤ p ≤
100)
25

Percentiles
§ The pth percentile in an ordered array of n values
is the value in ith position, where
n Example: The 60th percentile in an ordered array
of 19 values is the value in 12th position:
1)
(n
100
p
i 

12
1)
(19
100
60
1)
(n
100
p
i 




26

27
§ Commonly used percentiles
– First (lower) decile = 10th percentile
– First (lower) quartile, Q1 = 25th percentile
– Second (middle)quartile,Q2 = 50th percentile
– Third quartile, Q3 = 75th percentile
– Ninth (upper) decile = 90th percentile
Percentiles

Quartiles
§ Quartiles Split Ordered Data into 4 equal
portions
§ Q1 and Q3 are Measures of Non-central Location
§ Q2 = the Median
25% 25% 25% 25%
 
1
Q  
2
Q  
3
Q
28

Quartiles
§ Each Quartile has position and value
– With the data in an ordered array, the position of Qi
is:
– The value of Qi is the value associated with that
position in the ordered array
§ Example:
Data in Ordered Array: 11 12 13 16 16 17 18 21 22
   
1 1
1 9 1 12 13
Position of 2.5 12.5
4 2
Q Q
 
   
 
 
1
4
i
i n
Q


29

Example
The prices ($) of 18 brands of walking shoes:
40 60 65 65 67 68 68 70 70
70 70 70 70 74 75 75 90 95
üQ1is 3/4 of the way between the 4th and 5th
ordered measurements, or
Q1 = 65 + .75(67 - 65) = 66.5.
30

Example
The prices ($) of 18 brands of walking shoes:
40 60 65 65 65 68 68 70 70
70 70 70 70 74 75 75 90 95
üQ3 is 1/4 of the way between the 14th and 15th
ordered measurements, or
Q3 = 74 + .25(75 - 74) = 74.25
üAnd IQR = Q3 – Q1 = 74.25 – 66.5 = 7.75
31

Shape of a
Distribution
§ Describes how data is distributed
§ Measures of Shape
- Symmetric or skewed (asymmetric)
Mean = Median = Mode
Mean < Median < Mode Mode < Median < Mean
Right-Skewed
Left-Skewed Symmetric
(Longer tail extends to left) (Longer tail extends to right)
32

The Five Number
Summary
§ One way to give a nice profile of a data set is the
“five-number summary,” which consists of:
1. The smallest measurement
2. The first quartile, Q1
3. The median, Q2
4. The third quartile, Q3
5. The largest measurement
§ Displayed visually using a box-and-whiskers plot
33

The Box-and-
Whisker plot
§ 5-number summary
– Median, Q1, Q3, Xsmallest, Xlargest
§ Box Plot
– Graphical display of data using 5-number
summary
Median
( )
4 6 8 10 12
Maximum
Minimum
1
Q 3
Q
2
Q
34

Distribution Shape &
Box-and-Whisker Plot
Right-Skewed
Left-Skewed Symmetric
1
Q 1
Q 1
Q
2
Q 2
Q 2
Q
3
Q 3
Q
3
Q
35
§ Skewed distributions usually have a long whisker in the
direction of the skewness

Shape of a Distribution
and Quartiles
§ If the distribution is symmetric, then the upper and
lower quartiles should be approximately equally
spaced from the median
§ If the upper quartile is farther from the median than
the lower quartile, then the distribution is positively
skewed
§ If the lower quartile is farther from the median than
the upper quartile, then the distribution is negatively
skewed
36

Outlier
§ A value located at a distance of more than
1.5(IQR) from the box
üLower fence: Q1-1.5 IQR
üUpper fence: Q3+1.5 IQR
§ Measurements beyond the upper or lower fence
are outliers and are marked with *
*
37

Measures of
Variation
Variation
Variance Standard Deviation Coefficient of
Variation
Population
Variance
Sample
Variance
Population
Standard
Deviation
Sample
Standard
Deviation
Range
Interquartile
Range
38

Measures of
Variation
§ Measures that quantify the variation or dispersion
of a set of data from its central location
§ The amount may be small when the values are
close together and large when the values are far
apart from each other
§ If all the values are the same, no dispersion
§ How much are the observations spread out
around the mean value?
39

§ Measures of variation give information on the
spread or variability of the data values
Measures of
Variation
Same center,
different variation
40

Measures of
Variation
§ The more Spread out or dispersed data, the larger
the measures of variation
§ The more concentrated the data, the smaller the
measures of variation
§ If all observations are equal, measures of variation
= Zero
§ All measures of variation are Non-negative
41

Range
§ Simplest measure of variation
§ Difference between the largest and the smallest
observations:
Range = xmaximum – xminimum
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Range = 14 - 1 = 13
Example:
42

§ Ignores the way in which data are distributed
§ Sensitive to outliers
7 8 9 10 11 12
Range = 12 - 7 = 5
7 8 9 10 11 12
Range = 12 - 7 = 5
Disadvantages of the
Range
1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,5
1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,120
Range = 5 - 1 = 4
Range = 120 - 1 = 119
43

Interquartile
Range
§ We can eliminate some outlier problems by using
the interquartile range
§ Eliminate some high-and low-valued observations
and calculate the range from the remaining values
§ Also known as midspread
– Spread in the middle 50%
§ Interquartile range = 3rd quartile – 1st quartile
44

Interquartile
Range
Median
(Q2)
X
maximum
X
minimum Q1 Q3
Example:
25% 25% 25% 25%
12 30 45 57 70
Interquartile range
= 57 – 30 = 27
§ Not affected by extreme values
45

§ Shows variation about the mean
§ Average of squared deviations of values from the
mean
– Sample variance:
– Population variance:
Variance
N
μ)
(x
σ
N
1
i
2
i
2




1
-
n
)
x
(x
s
n
1
i
2
i
2




46

Standard
Deviation
§ Most commonly used measure of variation
§ Shows variation about the mean
§ Has the same units as the original data
- Sample standard deviation:
- Population standard deviation:
N
μ)
(x
σ
N
1
i
2
i




1
-
n
)
x
(x
s
n
1
i
2
i




47

Variance vs.
Standard Deviation
§ Both measure the average “scatter” about the mean
§ Variance computations produce “squared” units which
makes interpretation more difficult
– For example, kg2 is meaningless.
§ Since it is the square root of the Variance, the
Standard Deviation is expressed in the same units as
the original data
§ Therefore, the Standard Deviation is the most
commonly used measure of variation
48

Comparing Standard
Deviations
Mean = 15.5
s = 3.338
11 12 13 14 15 16 17 18 19 20 21
11 12 13 14 15 16 17 18 19 20 21
Data B
Data A
Mean = 15.5
s = .9258
11 12 13 14 15 16 17 18 19 20 21
Mean = 15.5
s = 4.57
Data C
49

Coefficient of
Variation
§ Measures relative variation
§ Always in percentage (%)
§ Shows variation relative to mean
§ Is used to compare two or more sets of data
measured in different units
Population Sample
s
CV = ×100%
X
 
 
 
σ
CV = ×100%
μ
 
 
 
50

Compare the Coefficient of
Variation between data A, data B
and Data C
Mean = 15.5
s = 3.338
11 12 13 14 15 16 17 18 19 20 21
11 12 13 14 15 16 17 18 19 20 21
Data B
Data A
Mean = 15.5
s = .9258
11 12 13 14 15 16 17 18 19 20 21
Mean = 15.5
s = 4.57
Data C
51
§ Which data more Spread out around the mean?

§ If the data distribution is bell-shaped, then the
interval:
§ contains about 68% of the values in
the population
§ contains about 95% of the values in
the population
§ contains about 99.7% of the values
in the population
The Empirical Rule
1σ
μ 
μ 2σ

μ 3σ

52

Summary
§ Quantitative data are usually described by a
measure of central tendency and a measure of
variation
§ In describing data, it is important to select the
measure of central tendency that most accurately
represents the data
§ To do so, it is important to know if data is
symmetrical or skewed
54

3. Descriptive statistics.pdf

Recommended

Recommended

More Related Content

Similar to 3. Descriptive statistics.pdf

Similar to 3. Descriptive statistics.pdf (20)

More from YomifDeksisaHerpa

More from YomifDeksisaHerpa (6)

Recently uploaded

Recently uploaded (20)

3. Descriptive statistics.pdf