5.DATA SUMMERISATION.ppt

DATA SUMMERISATION
Dr Vincent Yusuph
Ecohas Kibaha 2023

OBJECTIVES
At the end of this session you should be able to:
 Explain data summarization
 Explain the characteristics, uses, advantages, and
disadvantages of each measure of location.
 Calculate mode, mean and median
 Compute and interpret variance, and the standard
deviation
 Identify the position of the arithmetic mean, median, and mode
for both a symmetrical and a skewed distribution.
 Explain the characteristics, uses, advantages, and disadvantages
of this measure of dispersion

4.3
Data summarisation
 Measures of Central Location
 Mean, Median, Mode
 Measures of Variability/spread
 Range, Standard Deviation, Variance, Coefficient
of Variation
 Measures of Relative Standing
 Percentiles, Quartiles

MEASURES OF CENTRAL
TENDENCY/LOCATION
 Often we need to summarise frequency
distributions in a few numbers for ease
of reporting or comparison
 Recall: with qualitative data, useful
summary statistics include ratio,
proportion, rate

Measures of central tendency/location
 The statistical methods used to measure
central tendency include the following
1. Mean
2. Median
3. Mode

MEAN
 Refers to arithmetic mean
 It is obtained by adding the individual observations divided
by the total number of observations.
 Advantages – it is easy to calculate. most useful of all the
averages.
 Disadvantages – influenced by abnormal values.
 Examples: In this case it will be (8 + 16 + 15 + 17 + 18 + 20
+ 25)/7 which comes to 17

Characteristics of the Mean
It is calculated by
summing the values
and dividing by the
number of values.
It requires the interval scale.
All values are used.
It is unique.
The sum of the deviations from the mean is 0.
The Arithmetic Mean
is the most widely used
measure of location and
shows the central value of
the data.
The major characteristics of the mean are:
Average
Joe
3- 7

Population Mean
N
X



where
µ is the population mean
N is the total number of observations.
X is a particular value.
 indicates the operation of adding.
For ungrouped data, the
Population Mean is the
sum of all the population
values divided by the total
number of population
values:
3- 8

Example 1
500
,
48
4
000
,
73
...
000
,
56






N
X

Find the mean mileage for the cars.
A Parameter is a measurable characteristic of a
population.
The Musenge
family owns
four cars.
The following
is the current
mileage on
each of the
four cars.
56,000
23,000
42,000
73,000
3- 9

Example 2
4
.
15
5
77
5
0
.
15
...
0
.
14







n
X
X
A statistic is a measurable characteristic of a sampl
A sample of
five
executives
received the
following
bonus last
year ($000):
14.0,
15.0,
17.0,
16.0,
15.0
3- 10

Statistics is a pattern language
Population Sample
Size N n
Mean
Variance
Standard
Deviation

Properties of the Arithmetic Mean
Every set of interval-level and ratio-level data has a
mean.
All the values are included in computing the mean.
A set of data has a unique mean.
The mean is affected by unusually large or small
data values.
The arithmetic mean is the only measure of location
where the sum of the deviations of each value from
the mean is zero.
Properties of the Arithmetic Mean
3- 12

MEDIAN
 When all the observation are arranged either in ascending
order or descending order, the middle observation is known
as median.
 In case of even number the average of the two middle values
is taken.
 Median is better indicator of central value as it is not affected
by the extreme values
 Example : The median of 4, 1, and 7 is 4 because when the
numbers are put in order (1 , 4, 7) , the number 4 is in the
middle.

The Median
There are as many
values above the
median as below it in
the data array.
For an even set of values, the median will be the
arithmetic average of the two middle numbers and is
found at the (n+1)/2 ranked observation.
The Median is the
midpoint of the values
after they have been
ordered from the smallest
to the largest.
3- 14

The ages for a sample of five BSc.HLS.III
students are: 21, 25, 19, 20, 22.
Arranging the data
in ascending order
gives:
19, 20, 21, 22, 25.
Thus the median is
21.
The median (continued)
3- 15

Example 5
Arranging the data in
ascending order gives:
73, 75, 76, 80
Thus the median is 75.5.
The heights of four basketball players, in inches,
are: 76, 73, 80, 75.
The median is found
at the (n+1)/2 =
(4+1)/2 =2.5th data
point.
3- 16

Properties of the Median
There is a unique median for each data set.
It is not affected by extremely large or small
values and is therefore a valuable measure of
location when such values occur.
It can be computed for ratio-level, interval-
level, and ordinal-level data.
It can be computed for an open-ended
frequency distribution if the median does not
lie in an open-ended class.
Properties of the Median
3- 17

MODE
 Most frequently occurring observation in a data is called
mode
 Not often used in medical statistics.
 EXAMPLE
 Number of decayed teeth in 10 children
 2,2,4,1,3,0,10,2,3,8
 Mean = 34 / 10 = 3.4
 Median = (0,1,2,2,2,3,3,4,8,10) = 2+3 /2
= 2.5
 Mode = 2 ( 3 Times)

Symmetric distribution: A distribution having the
same shape on either side of the center
Skewed distribution: One whose shapes on either
side of the center differ; a nonsymmetrical distribution.
Can be positively or negatively skewed, or bimodal
The Relative Positions of the Mean, Median, and Mode
3- 19

RELATIONSHIP BETWEEN
MEAN, MEDIAN, MODE

The Relative Positions of the Mean, Median, and Mode:
Symmetric Distribution
Zero skewness Mean
=Median
=Mode
Mode
Median
Mean
3- 21

The Relative Positions of the Mean, Median, and Mode:
Right Skewed Distribution
 Positively skewed: Mean and median are to the right of the
mode.
Mean>Median>Mode
Mode
Median
Mean
3- 22

Negatively Skewed: Mean and Median are to the left of the Mode.
Mean<Median<Mode
The Relative Positions of the Mean, Median, and
Mode: Left Skewed Distribution
Mode
Mean
Median
3- 23

CHOICE OF APPROPRIATE
MEASURE
 For symmetric distributions, mean is
preferred to median or mode:
 utilises all values
 mathematical niceties
 For asymmetric distributions, mean not
suitable:
 mean is sensitive to extreme values
 median more preferred since it is not
affected by extreme values

• Measures of central location fail to tell the
whole story about the distribution; that is,
how much are the observations spread out
around the mean value?
Measures of spread…

Measures of Variability…
For example, two sets of class
grades are shown. The mean
(=50) is the same in each case…
But, the red class has greater
variability than the blue class.

Dispersion
refers to the
spread or
variability in
the data.
Measures of dispersion include the following: range,
mean deviation, variance, and standard
deviation.
Range = Largest value – Smallest
value
Measures of Dispersion
0
5
10
15
20
25
30
0 2 4 6 8 10 12
3- 27

The following represents the current year’s Return
on Equity of the 25 companies in an investor’s
portfolio.
-8.1 3.2 5.9 8.1 12.3
-5.1 4.1 6.3 9.2 13.3
-3.1 4.6 7.9 9.5 14.0
-1.4 4.8 7.9 9.7 15.0
1.2 5.7 8.0 10.3 22.1
Example 9
Highest value: 22.1 Lowest value: -8.1
Range = Highest value – lowest value
= 22.1-(-8.1)
= 30.2
3- 28

Range…
 Its major advantage is the ease with which it can be
computed.
 Its major shortcoming is its failure to provide
information on the dispersion of the observations
between the two end points.
 Hence we need a measure of variability that
incorporates all the data and not just two
observations. Hence…

Variance: the
arithmetic mean
of the squared
deviations from
the mean.
Standard deviation: The
square root of the variance.
Variance and standard Deviation
3- 30

Not influenced by extreme values.
The units are awkward, the square of the
original units.
All values are used in the calculation.
The major characteristics of the
Population Variance are:
Population Variance
3- 31

Population Variance formula:
 (X - )2
N

=
X is the value of an observation in the
population
m is the arithmetic mean of the population
N is the number of observations in the
population

Population Standard Deviation formula:
2

Variance and standard deviation
3- 32

(-8.1-6.62)2 + (-5.1-6.62)2 + ... + (22.1-6.62)2
25




= 42.227
= 6.498
In Example 9, the variance and standard deviation are:
 (X - )2
N

=
Example 9 continued
3- 33

Sample variance (s2)
s2 =
(X - X)2
n-1
Sample standard deviation (s)
2
s
s 
Sample variance and standard deviation
3- 34

40
.
7
5
37




n
X
X
     
30
.
5
1
5
2
.
21
1
5
4
.
7
6
...
4
.
7
7
1
2
2
2
2













n
X
X
s
Example 11
The hourly wages earned by a sample of five students
are:
$7, $5, $11, $8, $6.
Find the sample variance and standard deviation.
30
.
2
30
.
5
2


 s
s
3- 35

Empirical Rule: For any symmetrical, bell-
shaped distribution:
About 68% of the observations will lie within 1s
the mean
About 95% of the observations will lie within 2s
of the mean
Virtually all the observations will be within 3s of
the mean
Interpretation and Uses of the
Standard Deviation
3- 36

4.37
The Empirical Rule…
 Approximately 68% of all observations fall
 within one standard deviation of the mean.

 Approximately 95% of all observations fall
 within two standard deviations of the mean.
 Approximately 99.7% of all observations fall
 within three standard deviations of the mean.

Bell-Shaped Curve showing the relationship between and .
 
3  1  1  3
68%
95%
99.7%
Interpretation and Uses of the Standard Deviation
3- 38

Interpreting the standard deviation
 The greater the variation in the data the
greater the standard deviation
 If all the values are the same the standard
deviation is zero
 For a symmetrical distribution almost all the
data will be contained within three standard
deviations

Coefficient of Variation…
 The coefficient of variation of a set of observations
is the standard deviation of the observations divided
by their mean,
 that is:
 Population coefficient of variation = CV =
 Sample coefficient of variation = cv =

4.41
Coefficient of Variation…
 This coefficient provides a
 proportionate measure of variation, e.g.
 A standard deviation of 10 may be perceived
as large when the mean value is 100, but only
moderately large when the mean value is 500.

4.42
Measures of Variability…
 If data are symmetric, with no serious outliers,
use range and standard deviation.
 If comparing variation across two data sets,
use coefficient of variation.
 The measures of variability introduced in this
section can be used only for interval data.

5.DATA SUMMERISATION.ppt

Recommended

Recommended

More Related Content

Similar to 5.DATA SUMMERISATION.ppt

Similar to 5.DATA SUMMERISATION.ppt (20)

More from chusematelephone

More from chusematelephone (16)

Recently uploaded

Recently uploaded (20)

5.DATA SUMMERISATION.ppt