QUANTITATIVE METHODS NOTES.pdf

BMCU002: QUANTITATIVE METHODS NOTES
Page 1 of 3
INTRODUCTION TO STATISTICS
Definition:
It is the science of collecting, organizing, presenting, analyzing and interpreting data to assist in
making more effective decisions.
Types of Statistics
(a) Descriptive statistics: it’s a tabular, graphical and numerical method for organizing and
summarizing information clearly and effectively relating to either a population or sample.
(b) Inferential statistics: are the methods of drawing and measuring the reliability of
conclusions about a statistical population based on information from a sample data set.
 A population is a collection of all possible individuals, objects or measurements of
interest.
 A sample part or sub set of the population of interest.
Variables:
A variable is a measurable characteristic that assumes different values among the subjects.
Types of variables
(a) Independent variables: It is a variable that a researcher manipulates in order to determine its
effect or influence on another variable. They predict the amount of variation that occurs in
other variables.
(b) Dependent variables: It is the variable that is measured, predicted or monitored and is
expected to be affected by manipulation of an independent variable. They attempt to indicate
the total influence arising from the effects of the independent variable. It varies as a function
of the independent variable e.g., influence of hours studied on performance in a statistical test,
influence of distance from the supply center on cost of building materials.
The above variables can either be qualitative or quantitative variables: -
i. Qualitative variables: Are variables that are non-numeric i.e., attributes e.g., Gender,
Religion, Colour, State of birth etc.
ii. Quantitative variables: are numeric variables. They can either be discrete or
continuous.
 Discrete variables: Are variables, which can only assume certain values
i.e., whole numbers. Are always counted.
 Continuous variables: Are variables, which can assume any value within
a specific range. Are always measured e.g., height, temperature, weight,
radius etc.

Levels of measurement
There are four levels of measurement; nominal, ordinal, interval and ratio.
(a) Nominal level. The observations are classified under a common characteristic e.g., sex, race,
marital status, employment status, language, religion etc. helps in sampling.

Page 2 of 3
(b) Ordinal level: items or subjects are not only grouped into categories, but they are ranked into
some order e.g., greater than, less than, superior, happier than, poorer, above etc. helps in
developing a likert scale.
(c) Interval level: numerals are assigned to each measure and ranked. The intervals between
numerals are equal. The numerals used represent meaningful quantities but the zero point is
not meaningful e.g., test scores, temperature.
(d) Ratio level: has all the characteristics of the other levels and in addition the zero point is
meaningful. Mathematical operations can be applied to yield meaningful values e.g., height,
weight, distance, age, area etc.
Characteristics of statistical data
 They are aggregate of facts e.g., total sales of a firm for one year.
 They are affected to a marked extent by a multiplicity of causes e.g., volume of wheat
production depends on rainfall, soil fertility, seeds etc
 They are numerically expressed e.g., population of Kenya increased by 4 million during the
year 2004.
 They are estimated according to a reasonable standard of accuracy e.g., 90% accuracy
 They are collected in a systematic manner.
 They are collected for a predetermined purpose
 They should be placed in relation to each other.
Uses and users of statistics
1. Government:
 Monitoring economic and social trends
 Forecasting
 Policy making
2. Individuals
 Leisure activities
 Community work
 Personal finances
 Gambling
3. Academia
 Testing hypothesis
 Developing new theories
 Consultancy services
4. Businesses
 Planning and control
 Quality control especially for the manufacturers
 Forecasting i.e., planning production schedules, advertising expenditures etc.
 Auditing

Page 3 of 3
 Determining production costs e.g., by using regression and correlation, one can determine
the relationship between two variables like costs and methods of production, advertising and
sales etc.
 It gives relevant information for decision-making.
Limitations of statistics
 Deals with aggregate facts and not individual items.
 Deals mainly with quantitative characteristics and not qualitative characteristics like
honesty, efficiency etc.
 The results are only true on an average and under certain conditions.
 Statistics can be misused i.e., wrong interpretation. It requires experience and skill to draw
sensible conclusions from the data.
 Statistics may not provide the best solution under all circumstances.

Page 1 of 11
DESCRIPTIVE STATISTICS
Descriptive statistics is used to summarize data and make sense out of the raw data
collected during the research.
Data collection
Data can be collected from primary and / or secondary sources.
Secondary data consists of information that already exists somewhere having been
collected for another purpose e.g., in government publications, periodicals, journals, books
etc.
Advantages: Low in cost and Readily available
Disadvantages: The data needed might not exist and The existing data might be
outdated, inaccurate, incomplete and unreliable.
Primary data consists of original information gathered for the specific purpose through
observation, interviews and questionnaires.
Advantages
- It is relevant
- Its accurate
Disadvantages
- It is costly
- It is time consuming
Presentation of data
Presentation of data refers to the classification and tabulation of data. Classification of data
refers to the act of arranging the data in groups or classes according to some resemblance
of the data in each group or class. Tabulation of data is the arrangement of statistical data
in columns and rows.
Frequency distribution
A frequency distribution is a grouping of data into mutually exclusive categories showing
the number of observations in each category.
Steps
 Decide on the number of classes
 Determine the class interval or width
 Set the individual class limits
 Tally the values into the classes

Page 2 of 11
 Count the number of items in each class
A class interval is the difference between the lower limit of the class and the lower limit of
the next class.
A class midpoint / class mark is the middle point between the lower and the upper class
limit.
Graphical representation of a frequency distribution
1. Histogram: It is a graph in which classes are marked on the horizontal axis and the
class frequencies on the horizontal axis and the class frequencies on the vertical axis.
The class frequencies are represented by the heights of the bars and the bars are drawn
adjacent to each other.
2. Frequency polygons: The class midpoints are connected with a line segment.
3. Cumulative frequency polygons
 Less than cumulative frequency polygons
 More than cumulative frequency polygons
4. Line charts: Show the change in a variable over time
5. Bar chart: Make use of rectangles to present the given data. Can be vertical,
horizontal or component.
6. Pie charts: different segments of a circle represent percentage contribution of various
components to the total.
7. Graphs
8. Pictograms: pictures are used to represent data.
Example
(a) The data below indicates the marks attained by students in a statistical test. Construct
a frequency distribution table with 10 classes
12
8
18
5
15
24
25
25
32
40
40
42
44
46
48
50
50
52
53
55
56
59
60
66
68
72
76
83
95
98
(b) From the above: construct a histogram, frequency polygons and curves, cumulative
frequency curves.
MEASURES OF CENTRAL TENDENCY
Central tendency is the tendency of observations to cluster near the central part of the
distribution. Measures of central tendency are the measures of location e.g. mean, mode
and median. They are the most representative value of the distribution.

Page 3 of 11
Qualities of a good average
Should be-
 Rigidly defined
 Based on all values
 Easily understood and calculated
 Least affected by the fluctuations of sampling
 Capable of further algebraic or statistical treatment
 Least affected by extreme values
Types of averages
The following are the most important types of averages
(a) Arithmetic mean or simple average
(b) Median
(c) Mode
(d) Geometric mean
(e) Harmonic mean
THE ARITHMETIC MEAN
It is obtained by summing up the values of all the items of a series and dividing this sum
by the number of items.
Computation of the arithmetic mean for
Individual series:-
Direct method
n
X
X

 where X = arithmetic mean , n = number of items
Grouped series
Direct method
n
xf
X

 Where f = frequencies, n = number of items
Properties of the arithmetic mean
 The product of the arithmetic mean and the number of items is equal to the sum of all
given values
 The algebraic sum of the deviations of the various values from the mean is equal to
zero
 The sum of the squares of deviations from arithmetic mean is least.
Advantages of the arithmetic mean
 Can be easily understood
 Takes into account all the items of the series

Page 4 of 11
 It is not necessary to arrange the data before calculating the average
 It is capable of algebraic treatment
 It is a good method of comparison
 It is not indefinite
 It is used frequently.
Disadvantages of the arithmetic mean
 It is affected by extreme values to a great extent
 It may be a figure that does not exist in a series
 It cannot be calculated if all the items of a series are not known
 It cannot be used incase of qualitative data
THE MEDIAN
The median is the middle value of a series arranged in ascending or descending order. If
there are n observations, the median is the value of the
th
n





 
2
1
item.
Computation of the median in discrete series
 Arrange the items in descending or ascending order with their corresponding
frequencies against them.
 Compute the cumulated frequencies and then locate the middle item.
Computation of the median in Continuous series
The median has to be interpolated in the class interval containing the median using the
formula:-
Median = 𝑳 +
(
𝒏
𝟐
)−𝑩
𝑮
(𝑾)
Where:
L= lower class boundary
n= total number of values
B= cumulative frequency of the group before the median group
G= frequency of the median group
W= class width
Properties of the Median
 It is a positional average and is influenced by the position of the items in the series
and not by the size of items
 The sum of the absolute values of deviations is least.
Advantages of the Median
 It is easy to calculate
 It is simple and is understood easily
 It is less affected by the value of extreme items

Page 5 of 11
 It can be calculated by inspection in some cases
 It is useful in the study of phenomenon which are of qualitative nature
Disadvantages of the Median
 It is not a suitable representative of a series in most cases
 It is not suitable for further algebraic treatment
 It is not used frequently like arithmetic mean
 It cannot be determined exactly in the case of continuous series
Quartiles, deciles and percentiles
 Quartiles are the values of the items that divide the series into four equal parts.
 Deciles divide the series into 10 equal parts.
 Percentiles divide the series into 100 equal parts.
The 2nd
quartile, 5th
decile and 50th
percentile are equal to the median.
THE MODE
The mode is the value, which occurs most often in the data. A distribution with one mode
is called unimodal, with two modes bimodal and with many modes, multimodal
distribution. The class mid-point of a modal class is called a crude mode.
Calculation of the mode in a continuous series
Mode = 𝑳 +
𝒇𝒎−𝒇𝒎−𝟏
(𝒇𝒎−𝒇𝒎−𝟏)+(𝒇𝒎−𝒇𝒎+𝟏)
(𝒘)
Where:
 L is the lower-class boundary of the modal group
 fm-1 is the frequency of the group before the modal group
 fm is the frequency of the modal group
 fm+1 is the frequency of the group after the modal group
 w is the group width
Properties of the mode
 It represents the most typical value of the distribution and it should coincide with
existing items
 It is not affected by the presence of extremely large or small items
Advantages of the Mode
 It is easy to understand
 Extreme items do not affect its value
 It possesses the merit of simplicity
Disadvantages of the Mode
 It is often not clearly defined
 Exact location is often uncertain
 It is unsuitable for further algebraic treatment

Page 6 of 11
 It does not take into account extreme values.
GEOMETRIC MEAN
Geometric Mean is the nth
root of the product of n values i.e. n
n
x
x
x
M
G .....
*
. 2
1

For ungrouped data
G.M = Antilog of
n
Logx

Grouped data
G.M = Antilog of
n
fLogx

Merits of the Geometric mean
 It takes into account all the items in the data and condenses them into one
representative value.
 It gives more weight to smaller values than to large values.
 It is amenable to algebraic manipulations
Demerits
 It is difficult to use and compute
 It is determinate for positive values and cannot be used for negative values or zero.
HARMONIC MEAN
It is the reciprocal of the arithmetic mean of the reciprocal of a series of observations.
Ungrouped data
H.M =
 x
n
1
Grouped data
H.M =
 x
f
n
Merits of the Harmonic mean
 It takes into account all the observations in the data
 It gives more weight to smaller items
 It is amenable to algebraic manipulations
 It measures the rates of change
Demerits
 It is difficult to compute when the number of items is large
 It assigns too much weight to smaller items.
Factors to consider in the choice of an average
 The purpose for which the average is being used
 The nature, characteristics and properties of the average
 The nature and characteristics of the data.
MEASURES OF DISPERSION
Definition of dispersion
 It is the degree to which numerical data tends to spread about an average value

Page 7 of 11
 It is the extent of the scattered ness of items around a measure of central tendency
Significance of measuring dispersion
 To determine the reliability of an average
 To serve as a basis for the control of the variability
 To compare two or more series with regard to their variability
 To facilitate the use of other statistical measures
Properties of a good measure of dispersion
It should be: -
 Simple to understand
 Easy to compute
 Rigidly defined
 Based on each and every item in the
distribution
 Amenable to further algebraic
calculations
 Have sampling stability
 Not be unduly affected by extreme
values
Measures of dispersion
 Range
 Quartile deviation
 Mean deviation
 Standard deviation

Page 8 of 11
The Range: it is the difference between the smallest value and the largest value of a series
Advantages of the Range
 It is the simplest to understand and compute
 It takes the minimum time to calculate the value of the range
Limitations
 It is not based on each and every value of the distribution
 It is subject to fluctuations of considerable magnitude from sample to sample
 It cannot be computed in case of open-ended distributions
 It does not explain or indicate anything about the character of the distribution within the two
extreme observations.
Uses of the range
 Quality control
 Fluctuations of prices
 Weather forecast
 Finding the difference between two values e.g. wages earned by different employees.
The standard deviation
It is the square root of the arithmetic average of the squares of the deviations measured from the
mean. It measures how much “spread” or “ Variability” is present in the sample. A small standard
deviation means a high degree of uniformity of the observations as well as the homogeneity of a
series and vice versa.
Ways of computing the standard deviation
Direct method
Ungrouped data
n
dx


2
 where  2
dx = sum of squares of the deviations from arithmetic mean
Grouped data
n
fdx


2

Advantages of the standard deviation
 It is rigidly defined and is based on all the observations of the series
 It is applied or used in other statistical techniques like correlation and regression analysis and
sampling theory.
 It is possible to calculate the combined standard deviation of two or more groups.
Disadvantages of the standard deviation
 It cannot be used for comparing the dispersion of two or more series of observations given in
different units.
 It gives more weight to extreme values.

Page 9 of 11
SKEWNESS AND KURTOSIS IN STATISTICS
The average and measure of dispersion can describe the distribution but they are not sufficient to
describe the nature of the distribution. For this purpose we use other concepts known as Skewness
and Kurtosis. The symmetrical and skewed distributions are shown by curves as
Skewness
Skewness means lack of symmetry. A distribution is said to be symmetrical when the values are
uniformly distributed around the mean. For example, the following distribution is symmetrical
about its mean 3.
X : 1 2 3 4 5
Frequency (f): 5 9 12 9 5
In a symmetrical distribution the mean, median and mode coincide, that is, mean = median = mode.
Several measures are used to express the direction and extent of skewness of a dispersion. The
important measures are that given by Pearson. The first one is the Coefficient of Skewness:
For a symmetric distribution Sk = 0. If the distribution is negatively skewed then Sk is negative and
if it is positively skewed then Sk is positive. The range for Sk is from -3 to 3.

Page 10 of 11
The other measure uses the b (read ‘beta’) coefficient which is given by, where, m2
and m3 are the second and third central moments. The second central moment m2 is nothing but
the variance. The sample estimate of this coefficient is where m2 and m3 are
the sample central moments given by
For a symmetrical distribution b1 = 0. Skewness is positive or negative depending upon whether
m3 is positive or negative.
Kurtosis
A measure of the peakness or convexity of a curve is known as Kurtosis.

Page 11 of 11
It is clear from the above figure that all the three curves, (1), (2) and (3) are symmetrical about
the mean. Still they are not of the same type. One has different peak as compared to that of
others. Curve (1) is known as mesokurtic (normal curve); Curve (2) is known as leptocurtic
(leading curve) and Curve (3) is known as platykurtic (flat curve). Kurtosis is measured by
Pearson’s coefficient, b2 (read ‘beta - two’).It is given by .
The sample estimate of this coefficient is where, m4 is the fourth central moment
given by m4 =
The distribution is called normal if b2 = 3. When b2 is more than 3 the distribution is said to be
leptokurtic. If b2 is less than 3 the distribution is said to be platykurtic.

Page 1 of 13
MEASURES OF CENTRAL TENDENCY
MODE
Meaning
The mode refers to that value in a distribution, which occur most frequently. It is an actual value,
which has the highest concentration of items in and around it.
Computation of the Mode
1. Ungrouped or Raw Data
For ungrouped data or a series of individual observations, mode is often found by mere inspection.
Example 1:
2 , 7, 10, 15, 10, 17, 8, 10, 2
 Mode = M0 = 10
In some cases the mode may be absent while in some cases there may be more than one mode.
Example 2:
1) 12, 10, 15, 24, 30 (no mode)
2) 7, 10, 15, 12, 7, 14, 24, 10, 7, 20, 10
∴ The modes are 7 and 10
2. Grouped Data
a) Discrete Distribution
For Discrete distribution, see the highest frequency and corresponding value of X is mode. A
discrete variable is the one whose outcomes are measured in fixed numbers.
b) Continuous Distribution
See the highest frequency then the corresponding value of class interval is called the modal
class. Then apply the following formula:

Page 2 of 13
Mode = M0 = l1+
𝑓1−𝑓0
(𝑓1−𝑓0 )+(𝑓1−𝑓2 )
x𝑖
Where: 𝑙1 = the lower value of the class in which the lies
𝑓1 = the frequency of the class in which the mode lies
𝑓0 = the frequency of the class preceding the modal class
𝑓2 = the frequency of the class succeeding the modal class
𝑖 = the class interval of the modal classs
NOTE: While applying the above formula, we should ensure that the class-intervals are uniform
throughout. If the class-intervals are not uniform, then they should be made uniform on the
assumption that the frequencies are evenly distributed throughout the class.
Example 3:
Let us take the following frequency distribution:
Class Intervals Frequency
30−40 4
40−50 6
50−60 8
60−70 12
70−80 9
80−90 7
90−100 4
Required:
Calculate the mode in respect of this series.
Solution
Mode = M0 = 60+ 12−8
(12−8)+(12−9)
x10
= 60 +
4
4 + 3
𝑥10 = 65.7 approx.

Page 3 of 13
3. Determination of Modal Class
For a frequency distribution modal class corresponds to the maximum frequency. But it is not
possible to identify by inspection the class where the mode lies in any one (or more) of the
following cases:
i. If the maximum frequency is repeated.
ii. If the maximum frequency occurs in the beginning or at the end of the distribution.
iii. If there are irregularities in the distribution, the modal class is determined by the method
of grouping.
Steps for Calculation
1. Prepare a grouping table with 6 columns.
2. In column I, write down the given frequencies.
3. Column II is obtained by combining the frequencies two by two.
4. Leave the 1st
frequency and combine the remaining frequencies two by two and write in column
III.
5. Column IV is obtained by combining the frequencies three by three.
6. Leave the 1st frequency and combine the remaining frequencies three by three and write in
column V.
7. Leave the 1st
and 2nd
frequencies and combine the remaining frequencies three by three and
write in column VI.
8. Mark the highest frequency in each column.
9. Form an analysis table to find the modal class.
10. After finding the modal class use the formula to calculate the modal value.
Example 4
Calculate the mode for the following frequency distribution.
Class Interval 0−5 5−10 10−15 15−20 20−25 25−30 30−35 35−40
Frequency 9 12 15 16 17 15 10 13

Page 4 of 13
Solution
Grouping Table
Class Interval Frequency 2 3 4 5 6
0−5 9
5−10 12 21 36
10−15 15 27 43
15−20 16 31 48
20−25 17 33 48
25−30 15 32 42
30−35 10 25 38
35−40 13 23
Analysis Table
Columns 0−5 5−10 10−15 15−20 20−25 25−30 30−35 35−40
1 1
2 1 1
3 1 1
4 1 1 1
5 1 1 1
6 1 1 1
Total 1 2 4 5 2
The maximum occurred corresponding to 20−25, and hence it is the modal class.
Mode = M0 = 20+ 17−16
(17−16)+(17−15)
𝑥5
M0 = 20 +
1
1 + 2
𝑥5 = 21.6 approx.
Example 5
The following table gives some frequency data:

Page 5 of 13
Size of Item Frequency Cummulative Currency
10−20 10 10
20−30 18 28
30−40 25 53
40−50 26 79
50−60 17 96
60−70 4 100
Total 100
Required:
Calculate the mode
Solution
Grouping Table
Class Interval Frequency 2 3 4 5 6
10−20 10
20−30 18 28 53
30−40 25 43 69
40−50 26 51 68
50−60 17 43 47
60−70 4 21
Analysis Table
Columns 10−20 20−30 30−40 40−50 50−60 60−70
1 1
2 1 1
3 1 1 1 1
4 1 1 1
5 1 1 1
6 1 1 1
Total 1 3 5 5 2

Page 6 of 13
Mode = 3 median - 2 mean
Median =
n + 1
2
=
100 + 1
2
= 50.5th item
This lies in the class 30−40.
𝑀𝑒𝑑𝑖𝑎𝑛 = 𝑙1 +
𝑙2 − 𝑙1
𝑓
(𝑚 − 𝑐) = 30 +
40 − 30
25
(50.50 − 28) = 30 + 9 = 39
Calculation of Arithmetic Mean
Class- Interval Frequency Mid- Points d d'=d/10 fd’
10−20 10 15 −20 −2 −20
20−30 18 25 −10 −1 −18
30−40 25 35 0 0 0
40−50 26 45 10 1 26
50−60 17 55 20 2 34
60−70 4 65 30 3 12
Total 100 34
Assumed mean= 35
Median = A +
∑ fd′
n
xi
Median = A35 +
34
100
x10 = 38.4
Mode = 3 median − 2 mean = 3(39) − 2(38.4) = 117 − 76.8 = 40.2
Merits of Mode
1. It is easy to calculate and in some cases it can be located mere inspection.
2. Mode is not at all affected by extreme values.
3. It can be calculated for open-end classes.
4. It is usually an actual value of an important part of the series.
5. In some circumstances it is the best representative of data.

Page 7 of 13
Demerits of Mode
1. It is not based on all observations.
2. It is not capable of further mathematical treatment.
3. Mode is ill-defined generally, it is not possible to find mode in some cases.
4. As compared with mean, mode is affected to a great extent,by sampling fluctuations.
5. It is unsuitable in cases where relative importance of items has to be considered.
QUARTILES
Meaning
The quartiles divide the distribution in four parts. There are three quartiles. The second quartile
(Q2) divides the distribution into two halves and therefore is the same as the median. The first
(lower) quartile (Q1) marks off the first one-fourth, the third (upper) quartile (Q3) marks off the
three-fourth. In other words, the three quartiles Q1, Q2 and Q3 are such that 25 percent of the data
fall below Q1, 25 percent fall between Q1 and Q2, 25 percent fall between Q2 and Q3 and 25 percent
fall above Q3.
Computation of the Mode
1. Raw or Ungrouped Data
First arrange the given data in the increasing order and use the formula for Q1 and Q3.
Q1 = (
n + 1
4
) th item
Q3 = 3 (
n + 1
4
) th item
Example 1
Compute quartiles for the data given below:
25,18,30, 8, 15, 5, 10, 35, 40, 45
Solution
5, 8, 10, 15, 18,25, 30,35,40, 45
Q1 = (
n + 1
4
) th item

Page 8 of 13
Q1 = (
10 + 1
4
)th item
Q1 = (2.75)th item
Q1 = 2nd
item + (
3
4
) (3rd
item − 2nd
item)
Q1 = 8 + (
3
4
) (10 − 8) = 9.5
Q3 = 3 (
n + 1
4
) th item
Q3 = 3(2.75)th item
Q3 = (8.25)th item
Q3 = 8th
item + (
1
4
)(9th
item − 8th
item)
𝑄3 = 35 + (
1
4
)(40 − 35) = 36.25
2. Discrete Series
Step1: Find cumulative frequencies.
Step2: Find (
𝑛+1
4
)
Step3: See in the cumulative frequencies, the value just greater than (
𝑛+1
4
), then the corresponding
value of x is Q1.
Step 4: Find 3 (
𝑛+1
4
)
Step 5: See in the cumulative frequencies, the value just greater than 3 (
𝑛+1
4
), then the
corresponding value of x is Q3.
Example 2
Compute quartiles for the data given bellow:
X 5 8 12 15 19 24 30
F 4 3 2 4 5 2 4

Page 9 of 13
Solution :
X F CF
5 4 4
8 3 7
12 2 9
15 4 13
19 5 18
24 2 20
30 4 24
Total 24
Q1 = (
N + 1
4
) th item = (
24 + 1
4
) = (
25
4
) = 6.25th
item
Q3 = 3 (
N + 1
4
) th item = 3 (
24 + 1
4
) = 3 (
25
4
) = 18.25th
item
Q1 = 8; Q3 = 24
3. Continuous Series
Step1: Find cumulative frequencies
Step2: Find (
N
4
)
Step 3: See in the cumulative frequencies, the value just greater than (
𝑁
4
class interval is called first quartile class.
Step 4: Find 3 (
3
4
)
Step 5: See in the cumulative frequencies the value just greater than 3 (
3
4
class interval is called 3rd quartile class.
Step 6: Apply the respective formulae.
Q1 = l1 + (
N
4
− m1
f1
) x c1

Page 10 of 13
Q3 = l3 + (
3 (
N
4
) − m3
f3
) xc3
Where: l1 = lower limit of the first quartile class
f1 = frequency of the first quartile class
c1 = width of the first quartile class
m1 = cf preceding the first quartile class
l3 = lower limit of the third quartile class
f3 = frequency of the third quartile class
c3 = width of the third quartile class
m3 = cf preceding the third quartile class
Example 3
The following series relates to the marks secured by students in an examination.
Marks Number of Students
0−10 11
10−20 18
20−30 25
30−40 28
40−50 30
50−60 33
60−70 22
70−80 15
80−90 12
90−100 10

Page 11 of 13
Required:
Find the quartiles.
Solution:
Marks Number of Students Cummulative Frequency
0−10 11 11
10−20 18 29
20−30 25 54
30−40 28 82
40−50 30 112
50−60 33 145
60−70 22 167
70−80 15 182
80−90 12 194
90−100 10 204
Total 204
(
N
4
) = (
204
4
) = 51; 3 (
N
4
) = 153
Q1 = 20 + (
51 − 29
25
) x 10 = 28.8
Q1 = 60 + (
153 − 145
22
)x 10 = 63.64
PERCENTILES
The percentile values divide the distribution into 100 parts each containing 1 percent of the cases.
The percentile (Pk) is that value of the variable up to which lie exactly k% of the total number of
observations.
1. Percentile for Raw Data or Ungrouped Data
Relationship :
P25 = Q1 ; P50 = Q2 = Median and P75 = Q3

Page 12 of 13
Example 4
Calculate P15 for the data given below:
5, 24 , 36 , 12 , 20 , 8
Solution:
Arranging the given values in the increasing order.
5, 8, 12, 20, 24, 36
P15 = (
15(n + 1)
100
)th item
P15 = (
15(6 + 1)
100
)th item
P15 = (
(15x7)
100
)th item
P15 = (1.05)th item
P15 = 1st
item + 0.05(2nd
item − 1st
item)
P15 = 5 + 0.05(8 − 5) = 5.15
2. Percentile for Grouped Data
Example 5
Find P53 for the following frequency distribution:
Class Interval 0−5 5−10 10−15 15−20 20−25 25−30 30−35 35−40
Frequency 5 8 12 16 20 10 4 3
Solution :
Class Interval Frequency Cummulative Frequency
0−5 5 5
5−10 8 13
10−15 12 25
15−20 16 41
20−25 20 61

Page 13 of 13
25−30 10 71
30−35 4 75
35−40 3 78
Total 78
P53 = l1 +
53N
100
− m
f
xc
P53 = 20 +
53(78)
100
− 41
f
x5 = 20.085

BPCU004: ADVANCED BUSINESS STATISTICS
Page 1 of 23
MEASURES OF DISPERSION
MEANING
Dispersion (also known as scatter, spread or variation) measures the extent to which the items
vary from some central value.
SIGNIFICANCE OF MEASURING VARIATION
1. Measures of variation point out as to how far an average is representative of the mass.
2. Measures of dispersion determine nature and cause of variation in order to control the
variation itself.
3. Measures of dispersion enable a comparison to be made of two or more series with regard
to their variability.
4. Measures of dispersion are the basis of Many powerful analytical tools in statistics such as
correlation analysis, testing of hypothesis, analysis of variance, the statistical quality control
and regression analysis.
Characteristics/Properties of a Good Measure of Dispersion
1. It should be simple to understand.
2. It should be easy to compute.
3. It should be rigidly defined.
4. It should be based on each and every item of the distribution.
5. It should be amenable to further algebraic treatment.
6. It should have sampling stability.
7. Extreme items should not unduly affect it.
ABSOLUTE AND RELATIVE MEASURES OF DISPERSION
There are two kinds of measures of dispersion, namely:
1. Absolute measure of dispersion.
2. Relative measure of dispersion.
Absolute measure of dispersion indicates the amount of variation in a set of values in terms of
units of observations. For example, when rainfalls on different days are available in mm, any
absolute measure of dispersion gives the variation in rainfall in mm. On the other hand relative
measures of dispersion are free from the units of measurements of the observations. They are

Page 2 of 23
pure numbers. They are used to compare the variation in two or more sets, which are having
different units of measurements of observations.
Absolute measure Relative measure
1. Range 1. Co-efficient of Range
2. Quartile deviation 2. Co-efficient of Quartile deviation
3. Mean deviation 3. Co-efficient of Mean deviation
4. Standard deviation 4. Co-efficient of variation
RANGE AND COEFFICIENT OF RANGE
1. Range
This is the simplest possible measure of dispersion and is defined as the difference between the
largest and smallest values of the variable.
Range = L − S
𝑊ℎ𝑒𝑟𝑒: L = Largest Value
S = Smallest Value
In individual observations and discrete series, L and S are easily identified. In continuous series,
the following two methods are followed.
Method 1:
L = Upper boundary of the highest class
S = Lower boundary of the highest class
Method 2:
L = Mid value of the highest class
S = Mid value of the lowest class
2. Co-efficient of Range
Coefficient of Range =
L − S
L + S
Example 1
Find the value of range and its co-efficient for the following data.
7, 9, 6, 8, 11, 10

Page 3 of 23
Solution:
Range = L − S = 11 − 4 = 7
L − S
L + S
=
11 − 4
11 + 4
= 0.4667
Example 2:
Calculate range and its co efficient from the following distribution.
Size : 60−63 63−66 66−69 69−72 72−75
Number : 5 18 42 27 8
Solution:
Range = L − S = 75 − 60 = 15
L − S
L + S
=
75 − 60
75 + 60
= 0.1111
Merits
1. It is simple to understand.
2. It is easy to calculate.
3. In certain types of problems like quality control, weather forecasts, share price analysis, et
c., range is most widely used.
Demerits:
1. It is very much affected by the extreme items.
2. It is based on only two extreme observations.
3. It cannot be calculated from open-end class intervals.
4. It is not suitable for mathematical treatment.
5. It is a very rarely used measure.
QUARTILE DEVIATION AND CO-EFFICIENT OF QUARTILE DEVIATION
1. Quartile Deviation (Q.D)
Definition: Quartile Deviation is half of the difference between the first and third quartiles.
Hence, it is called Semi-Inter Quartile Range.
𝑄. 𝐷 =
𝑄3 − 𝑄1
2

Page 4 of 23
Among the quartiles Q1, Q2 and Q3, the range Q3 – Q1 is called inter quartile range and
𝑄3−𝑄1
2
,
semi inter quartile range.
2. Co-efficient of Quartile Deviation
Co − efficient of Q. D =
Q3 − Q1
Q3 + Q1
Example 3
Find the Quartile Deviation for the following data:
391, 384, 591, 407, 672, 522, 777, 733, 1490, 2488
Solution:
Arrange the given values in ascending order.
384, 391, 407, 522, 591, 672, 733, 777, 1490, 2488.
Position of Q1 is
N + 1
4
=
10 + 1
4
= 12.75th
item
Q1 = 2nd
item + 0.75(3rd
Item − 2nd
Item)
𝑄1 = 391 + 0.75 (4.7 − 391) = 403
Position of Q3 is 3(
N + 1
4
) = 3(12.75) = 8.25th
item
Q3 = 8th
Item + 0.25(9th
Item − 8th
Item)
Q3 = 777 + 0.25(1490 − 777) = 955.25
𝑄. 𝐷 =
955.25 − 403
2
= 276.125
Example 4
Weekly wages of labours are given below. Calculated Q.D and Coefficient of Q.D.
Weekly Wage (Kshs.) 100 200 400 500 600
No. of Weeks 5 8 21 12 6

Page 5 of 23
Solution :
Weekly Wage (Kshs.) No. of Weeks Cum. No. of Weeks
100 5 5
200 8 13
400 21 34
500 12 46
600 6 52
Total 52
Position of Q1 is
N + 1
4
=
52 + 1
4
= 13.25th
item
Q1 = 13th
Item + 0.25(14th
Item − 13th
Item)
𝑄1 = 200 + 0.25 (400 − 200) = 250
Position of Q3 is 3(
N + 1
4
) = 3(13.25) = 39.75th
item
Q3 = 39th
Item + 0.75(40th
Item − 39th
Item)
Q3 = 500 + 0.75(600 − 500) = 575
𝑄. 𝐷 =
575 − 250
2
= 162.5
Co − efficient of Q. D =
Q3 − Q1
Q3 + Q1
Co − efficient of Q. D =
575 − 250
575 + 250
=
325
825
= 0.394
Example 5
For the data given below, give the quartile deviation and coefficient of quartile deviation.
X 351−500 501−650 651−800 801−950 951−1100
F 48 189 88 47 28

Page 6 of 23
Solution:
X True Class Intervals F Cumulative
Frequency
351−500 350.5−500.5 48 48
501−650 500.5−650.5 189 237
651−800 650.5−800.5 88 325
801−950 800.5−950.5 47 372
951−1100 950.5−1100.5 28 400
Total 400
Q1 =
N
4
=
400
4
= 100; Q2 = 3 (
N
4
) = 3 (100) = 300
Q1 = l1 + (
N
4
− m1
f1
) x c1
Q1 = 500.5 + (
100 − 48
189
)x 150 = 541.77
Q3 = l3 + (
3 (
N
4
) − m3
f3
) xc3
Q3 = 650.5 + (
300 − 237
88
)x150 = 757.89
Q.D =
Q3 − Q1
2
=
757.89 − 541.77
2
= 108.06
Co − efficient Q. D =
Q3 − Q1
Q3 + Q1
=
757.89 − 541.77
757.89 + 541.77
= 0.1663
Merits of Quartile Deviation
1. It is simple to understand and easy to calculate.
2. It is not affected by extreme values.
3. It can be calculated for data with open end classes also.
Demerits of Quartile Deviation
1. It is not based on all the items. It is based on two positional values Q1 and Q3 and ignores
the extreme 50% of the items.

Page 7 of 23
2. It is not amenable to further mathematical treatment.
3. It is affected by sampling fluctuations.
MEAN DEVIATION AND COEFFICIENT OF MEAN DEVIATION
1. Mean Deviation
The mean deviation is measure of dispersion based on all items in a distribution. Mean deviation
is the arithmetic mean of the deviations of a series computed from any measure of central
tendency; i.e., the mean, median or mode, all the deviations are taken as positive i.e., signs are
ignored. But in general practice and due to wide applications of mean, the mean deviation is
generally computed from mean. M.D can be used to denote mean deviation.
2. Coefficient of mean deviation:
Mean deviation calculated by any measure of central tendency is an absolute measure. For the
purpose of comparing variation among different series, a relative mean deviation is required.
The relative mean deviation is obtained by dividing the mean deviation by the average used for
calculating mean deviation.
Co − efficient of Mean Deviation =
Mean Deviation
Mean or Median or Mode
If the result is desired in percentage, the coefficient of mean deviation.
Co − efficient of Mean Deviation =
Mean Deviation
Mean or Median or Mode
x100
COMPUTATION OF MEAN DEVIATION
1. Individual Series
a. Calculate the average mean, median or mode of the series.
b. Take the deviations of items from average ignoring signs and denote these deviations
by |D|.
c. Compute the total of these deviations, i.e., Σ |D|
d. Divide this total obtained by the number of items.
M. D. =
D
n

Page 8 of 23
Example 6
Calculate mean deviation from mean and median for the following data: 100, 150, 200, 250,
360, 490, 500, 600, 671 also calculate coefficients of M.D.
Solution:
Mean =
 X
N
=
3321
9
= 369
Now arrange the data in ascending order
100, 150, 200, 250, 360, 490, 500, 600, 671
Mean = Value of (
n + 1
2
) th item = Value of (
9 + 1
2
) th item = Value of 5th
item = 360
X D=X−Mean D=X−Median
100 269 260
150 219 210
200 169 160
250 119 110
360 9 0
490 121 130
500 131 140
600 231 240
671 302 311
3321 1570 1561
M. D. from mean =
 D
n
=
1570
9
= 174.44
Co − efficient of M. D. =
MD
Mean
=
174.44
369
= 0.47
M. D. from median =
 D
n
=
1561
9
= 173.44
MD
Median
=
173.44
360
= 0.48

Page 9 of 23
2. Mean Deviation −Discrete Series
Step 1: Find out an average (mean, median or mode).
Step 2: Find out the deviation of the variable values from the average, ignoring signs and denote
them by |D|
Step 3: Multiply the deviation of each value by its respective frequency and find out the total
Σf | D|
Step 4: Divide Σf | D| by the total frequencies N
Example 7
Compute Mean deviation from mean and median from the following data:
Height in cms 158 159 160 161 162 163 164 165 166
No. of
persons
15 20 32 35 33 22 20 10 8
Also compute coefficient of mean deviation.
Solution:
Height (X) No. of
persons (f)
d = x−A
A = 162
fd D=X−mean fD
158 15 −4 −60 3.51 52.65
159 20 −3 −60 2.51 50.20
160 32 −2 −64 1.51 48.32
161 35 −1 −35 0.51 17.85
162 33 0 0 0.49 16.17
163 22 1 22 1.49 32.78
164 20 2 40 2.49 49.80
165 10 3 30 3.49 34.90
166 8 4 32 4.49 35.92
Total 195 −95 338.59
Mean = A +
fd
N
= 162 +
−95
195
= 161.51
M. D. =
fD
N
=
338.59
195
= 1.74

Page 10 of 23
Co − efficient M. D. =
M. D.
Mean
=
1.74
161.51
= 0.0108
Height (x) No. of persons (f) c.f. D=X−median fD
158 15 15 3 45
159 20 35 2 40
160 32 67 1 32
161 35 102 0 0
162 33 135 1 33
163 22 157 2 44
164 20 177 3 60
165 10 187 4 40
166 8 195 5 40
195 334
Median = Size of (
N
2
) th item = Size of (
195
2
)th item = Size of 98th
item = 161
M. D. =
fD
N
=
334
195
= 1.71
Co − efficient M. D. =
M. D.
Median
=
1.71
161
= 0.0106
3. Mean Deviation-Continuous Series
The method of calculating mean deviation in a continuous series same as the discrete series. In
continuous series we have to find out the mid points of the various classes and take deviation
of these points from the average selected. Thus
M. D. =
fD
N
Where: D = m − Average ; m = mid point
Example 8:
Find out the mean deviation from mean and median from the following series.
Age in years No. of persons
0−10 20

Page 11 of 23
10−20 25
20−30 32
30−40 40
40−50 42
50−60 35
60−70 10
70−80 80
Also compute co-efficient of mean deviation.
Solution:
x m f
𝑑 =
𝑚 − 𝐴
𝑐
𝐴 = 35; 𝑐 = 10
fd D=X−mean fD
0−10 5 20 −3 −60 31.5 630.0
10−20 15 25 −2 −50 21.5 537.5
20−30 25 32 −1 −32 11.5 368.0
30−40 35 40 0 0 1.5 60.0
40−50 45 42 1 42 8.5 357.0
50−60 55 35 2 70 18.5 647.5
60−70 65 10 3 30 28.5 285.0
70−80 75 8 4 32 38.5 308.0
Total 212 3192.5
Mean = A +
∑ fd
N
∗ c = 35 +
320
212
x10 = 36.5
M. D. =
∑ fD
N
=
3192.5
212
= 15.06

Page 12 of 23
Calculation of Median and M.D. from Median
x m f c.f D=m−Md fD
0−10 5 20 20 32.25 645.00
10−20 15 25 45 22.25 556.25
20−30 25 32 77 12.25 392.00
30−40 35 40 117 2.25 90.00
40−50 45 42 159 7.75 325.50
50−60 55 35 194 17.75 621.25
60−70 65 10 204 27.75 277.50
70−80 75 8 212 37.75 302.00
Total 212 3209.50
Median = (
N
2
) th item =
212
2
= 106
Median = 𝑙 +
N
2
− m
f
∗ c = 30 +
106 − 77
40
∗ 10 = 37.25
M. D. =
∑ fD
N
=
3209.5
212
= 15.14
M. D.
Median
=
15.14
37.25
= 0.41
Merits of M.D.
1. It is simple to understand and easy to compute.
2. It is rigidly defined.
3. It is based on all items of the series.
4. It is not much affected by the fluctuations of sampling.
5. It is less affected by the extreme items.
6. It is flexible, because it can be calculated from any average.
7. It is better measure of comparison.
Demerits of M.D.
1. It is not a very accurate measure of dispersion.
2. It is not suitable for further mathematical calculation.
3. It is rarely used. It is not as popular as standard deviation.

Page 13 of 23
4. Algebraic positive and negative signs are ignored. It is mathematically unsound and
illogical.
STANDARD DEVIATION AND COEFFICIENT OF VARIATION
1. Definition
It is defined as the positive square-root of the arithmetic mean of the Square of the deviations
of the given observation from their arithmetic mean. It is the square–root of the mean of the
squared deviation from the arithmetic mean. Square of standard deviation is called Variance.
2. Calculation of Standard Deviation-Individual Series
There are two methods of calculating Standard deviation in an individual series.
a) Deviations taken from Actual mean
b) Deviation taken from Assumed mean
(a) Deviation taken from Actual mean
This method is adopted when the mean is a whole number.
Steps:
1. Find out the actual mean of the series ( )
2. Find out the deviation of each value from the mean (X = X – )
3. Square the deviations and take the total of squared deviations ∑ X2
4. Divide the total (∑ X2) by the number of observation (
∑X2
n
)
Formulae:
Standard Deviation () = √(
∑ X2
n
)𝑜𝑟 √(X − X)
2
n
(b) Deviations Taken from Assumed Mean
This method is adopted when the arithmetic mean is fractional value. Taking deviations from
fractional value would be a very difficult and tedious task. To save time and labour, the short–
cut method is applied. In this method, the deviations are taken from an assumed mean.
The formula is:

Page 14 of 23
 = √(
∑ d2
N
)− (
∑ d
N
)
2
Where: d stands for the deviations from the assumed mean = (X − A)
Steps:
1. Assume any one of the item in the series as an average (A)
2. Find out the deviations from the assumed mean; i.e., X-A denoted by d and also the total of
the deviations Σd
3. Square the deviations; i.e., d2
and add up the squares of deviations, i.e, Σd2
4. Then substitute the values in the following formula:
 = √(
∑ d2
N
)− (
∑ d
N
)
2
Note: We can also use the simplified formula for standard deviation.
 =
1
n
√(n ∑ d2) − (∑ d)
2
For the frequency distribution
 =
c
n
√(N ∑ fd2) − (∑ fd)
2
Example 9
Calculate the standard deviation from the following data.
14, 22, 9, 15, 20, 17, 12, 11

Page 15 of 23
Solution:
Deviations from actual mean.
Values (X) (X − X) (X − X)2
14 –1 1
22 7 49
9 –6 36
15 0 0
20 4 16
17 2 4
12 –3 9
11 –4 16
120 140
X =
120
8
= 15
 = √(X − X)
2
n
= √
140
8
= 4.18
Example 10
The table below gives the marks obtained by 10 students in statistics. Calculate standard
deviation.
Student Nos : 1 2 3 4 5 6 7 8 9 10
Marks 43 48 65 57 31 60 37 48 78 59
Solution
Deviations from assumed mean
Student Nos : Marks (X) d = X − A (A = 57) d2
1 43 –14 196
2 48 –9 81
3 65 8 64
4 57 0 0
5 31 –26 676

Page 16 of 23
6 60 3 9
7 37 –20 400
8 48 –9 81
9 78 21 441
10 59 2 4
N=10 d=–44 d2
=1952
 = √(
∑ d2
N
)− (
∑ d
N
)
2
 = √(
1952
10
)− (
−44
10
)
2
= 13.26
3. Calculation of Standard Deviation for Discrete Series
There are three methods for calculating standard deviation in discrete series:
(a) Actual mean methods
(b) Assumed mean method
(c) Step-deviation method.
(a) Actual mean method
Steps:
1. Calculate the mean of the series.
2. Find deviations for various items from the means i.e., d = X − X
3. Square the deviations (d2
) and multiply by the respective frequencies (f) to get fd2
.
4. Total to product (Σfd2
) Then apply the formula:
 = √
∑ fd2
∑ f
If the actual mean in fractions, the calculation takes lot of time and labour; and as such this
method is rarely used in practice.

Page 17 of 23
(b) Assumed Mean Method
Here deviation are taken not from an actual mean but from an assumed mean. Also this method
is used, if the given variable values are not in equal intervals.
Steps:
1. Assume any one of the items in the series as an assumed mean and denoted by A.
2. Find out the deviations from assumed mean, i.e, X-A and denote it by d.
3. Multiply these deviations by the respective frequencies and get the Σfd.
4. Square the deviations (d2
).
5. Multiply the squared deviations (d2
) by the respective frequencies (f) and get Σfd2
.
6. Substitute the values in the following formula:
 = √
∑ fd2
∑ f
− (
∑ fd
∑ f
)
2
Where: d = A − A, N = f
Example 11:
Calculate Standard deviation from the following data.
X 20 22 25 31 35 40 42 45
f 5 12 15 20 25 14 10 6
Solution :
Deviations from assumed mean
X f d = X − A (A = 31) d2
fd fd2
20 5 −11 121 −55 605
22 12 −9 81 −108 972
25 15 −6 36 −90 540
31 20 0 0 0 0
35 25 4 16 100 400
40 14 9 81 126 1134
42 10 11 121 110 1210
45 6 14 196 84 1176
Total N=107 fd=167 fd2
=6037

Page 18 of 23
 = √
∑ fd2
∑ f
− (
∑ fd
∑ f
)
2
 = √
6037
107
− (
167
107
)
2
= 7.35
(c) Step-deviation method:
If the variable values are in equal intervals, then we adopt this method.
Steps:
1. Assume the center value of the series as assumed mean A.
2. Find out d′
=
X−A
C
, where C is the interval between each value.
3. Multiply these deviations d′
by the respective frequencies and get ∑ fd′
.
4. Square the deviations and get d′2
.
5. Multiply the squared deviation (d′2
) by the respective frequencies (f) and obtain the total
∑ fd′2
.
6. Substitute the values in the following formula to get the standard deviation.
 = √∑ fd′2
∑ f
− (
fd′2
∑f
)
2
*C
Example 12
Compute Standard deviation from the following data.
Marks 10 20 30 40 50 60
No. of students 8 12 20 10 7 3
Solution:
Marks (X) No. of students (f)
d′
=
X − 30
10
d2
fd fd2
10 8 −2 4 −16 32
20 12 −1 1 −12 12
30 20 0 0 0 0
40 10 1 1 10 10
50 7 2 4 14 28
60 3 3 9 9 27
N=60 fd=5 fd2
=109

Page 19 of 23
 = √∑ fd′2
∑ f
− (
fd′2
∑f
)
2
*C
 = √
∑ 1092
60
− (
5
60
)
2
∗ 10 = 13.45
4. Calculation of Standard Deviation for Continuous series
In the continuous series the method of calculating standard deviation is almost the same as in a
discrete series. But in a continuous series, mid-values of the class intervals are to be found out.
The step- deviation method is widely used.
The formula is,
= √∑ fd′2
N
− (
fd′2
N
)
2
*C
Where d′
=
m − A
C
; C = Class interval
Steps:
1. Find out the mid-value of each class.
2. Assume the center value as an assumed mean and denote it by A.
3. Find out d′
=
m−A
C
4. Multiply the deviations d′
by the respective frequencies and get fd′
5. Square the deviations and get 𝑑′2
.
6. Multiply the squared deviations 𝑑′2
) by the respective frequencies and get fd′2
7. Substituting the values in the following formula to get the standard deviation.
 = √∑ fd′2
N
− (
fd′2
N
)
2
*C
Example 13:
The daily temperature recorded in a city in Russia in a year is given below.
Temperature C0
No. of days
−40 to −30 10
−30 to −20 18
−20 to −10 30
−10 to 0 42

Page 20 of 23
0 to −10 65
10 to −20 180
20 to 30 20
Required:
Calculate Standard Deviation.
Solution :
Temperature
(X)
Mid-Point
(m)
No. of days
(f)
d′
=
m − (−5)
10
d′2
fd′
fd′2
−40 to −30 −35 10 −3 9 −30 90
−30 to −20 −25 18 −2 4 −36 72
−20 to −10 −15 30 −1 1 −30 30
−10 to 0 −5 42 0 0 0 0
0 to −10 5 65 1 1 65 65
10 to −20 15 180 2 4 360 720
20 to 30 25 20 3 9 60 180
N=365 fd=389 fd2
=1157
 = √∑ fd′2
N
− (
fd′
N
)
2
*C
 = √1157
365
− (
389
365
)
2
*10 =14.260
𝐶
Merits of Standard Deviation
1. It is rigidly defined and its value is always definite and based on all the observations and
the actual signs of deviations are used.
2. As it is based on arithmetic mean, it has all the merits of arithmetic mean.
3. It is the most important and widely used measure of dispersion.
4. It is possible for further algebraic treatment.
5. It is less affected by the fluctuations of sampling and hence stable.
6. It is the basis for measuring the coefficient of correlation and sampling.
Demerits of Standard Deviation
1. It is not easy to understand and it is difficult to calculate.
2. It gives more weight to extreme values because the values are squared up.

Page 21 of 23
3. As it is an absolute measure of variability, it cannot be used for the purpose of comparison.
Coefficient of Variation
The standard deviation is an absolute measure of dispersion. It is expressed in terms of units in
which the original figures are collected and stated. The standard deviation of heights of students
cannot be compared with the standard deviation of weights of students, as both are expressed
in different units, i.e heights in centimeter and weights in kilograms. Therefore the standard
deviation must be converted into a relative measure of dispersion for the purpose of
comparison. The relative measure is known as the coefficient of variation.
The coefficient of variation is obtained by dividing the standard deviation by the mean and
multiply it by 100. symbolically,
Coefficient of Variation (C. V. ) =

X
x100
If we want to compare the variability of two or more series, we can use C.V. The series or
groups of data for which the C.V. is greater indicate that the group is more variable, less stable,
less uniform, less consistent or less homogeneous. If the C.V. is less, it indicates that the group
is less variable, more stable, more uniform, more consistent or more homogeneous.
Example 15
In two factories A and B located in the same industrial area, the average weekly wages (in
rupees) and the standard deviations are as follows:
Factory Average Standard Deviation No. of workers
A 34.5 5 476
B 28.5 4.5 524
Required:
(a) Which factory A or B pays out a larger amount as weekly wages?
(b) Which factory A or B has greater variability in individual wages?
Solution:
Total wages paid by factory A = 34.5x476 = Kshs. 16,422
(a) Total wages paid by factory B = 28.5x524 = Kshs. 14,934

Page 22 of 23
Therefore factory A pays out larger amount as weekly wages.
(b) C.V. of distribution of weekly wages of factory A and B are
CV (A) =

X
x100 =
5
34.5
x100 = 14.49%
CV (B) =

X
x100 =
4.5
28.5
x100 = 15.79%
Factory B has greater variability in individual wages, since C.V. of factory B is greater than
C.V of factory A.
Example 16
Prices of a particular commodity in five years in two cities are given below:
Price in City A Price in City B
20 10
22 20
19 18
23 12
16 15
Which city has more stable prices?
Solution:
Actual mean method
City A City B
Prices (X) dx = X − 20 dx2 Prices (Y) dy = Y − 15 dy2
20 0 0 10 −5 25
22 2 4 20 5 25
19 −1 1 18 3 9
23 3 9 12 −3 9
16 −4 16 15 0 0
X=100 dx dx2
Y=75 dy=0 dy2
=68
City A: X =
∑ X
n
=
100
5
= 20

Page 23 of 23
 = √
∑ dx2
n
= √
30
5
= 2.45
CV (A) =

X
x100 =
2.45
20
x100 = 12.25%
City B: X =
∑ X
n
=
75
5
= 15
 = √
∑ dx2
n
= √
68
5
= 3.69
CV (A) =

X
x100 =
3.69
15
x100 = 24.6%
City A had more stable prices than City B, because the coefficient of variation is less in City A.

Page 1 of 19
LESSON THREE: OVERVIEW OF HYPOTHESIS TESTING
3.0 Introduction
3.1 Lesson Objectives
3.2 Definition of Hypothesis Testing
Hypothesis: It’s a statement about a population parameter developed for the purpose of testing.
Hypothesis testing: It’s a procedure based on sample evidence and probability theory to determine
whether the hypothesis is a reasonable statement.
3.2 Procedure for Testing a Hypothesis
The following are the steps that are followed when testing hypothesis
1. State the null and alternate hypothesis
2. Select a level of significance.
3. Identify the test statistic
4. Formulate a decision rule and identify the rejection region
5. Compute the value of the test statistic
6. Make a conclusion.
This lesson gives an overview of the concepts in hypothesis testing. It describes the procedure
of testing a hypothesis, differentiates between one-tailed and two-tailed tests and type I and
Type II errors. Examples of testing hypothesis about a single population mean when the
population variance and not given are discussed.
By the end of the lesson, the students should be able to;
 Define the term hypothesis
 Differentiate between one-tailed and two-tailed tests
 Describe the procedure for testing hypothesis
 Test hypothesis about the mean when the population variance is known
 Test hypothesis about the mean when the population variance is unknown

Page 2 of 19
State the null hypothesis (HO) and alternate hypothesis (HA)
 The null hypothesis is a statement about the value of a population parameter. It should be
stated as “There is no significant difference between ……………”. It should always contain
an equal sign.
 The alternate hypothesis is a statement that is accepted if sample data provide enough
evidence that the null hypothesis is false.
Select a Level of Significance
A level of significance is the probability of rejecting the null hypothesis when it is true. It is
designated by  and should be between 0 –1.
Types of errors that can be committed
i. Type I error: it is rejecting the null hypothesis, when it is true.
ii. Type II error: It is not rejecting the null hypothesis, when it is false.
Null hypothesis Do not reject HO Reject HO
HO is True Correct decision Type I error
HO is false Type II error Correct decision
Identify the Test Statistic
A test statistic is the statistic that will be used to test the hypothesis e.g.
)
(
,
, 2
square
chi
Fand 

 
Formulate a decision rule
A decision rule is a statement of the conditions under which the null hypothesis is rejected and
the conditions under which it is not rejected. The region or area of rejection defines the location of
all those values that are so large or so small that the probability of their occurrence under a true
null hypothesis is rather remote.
Compute the value of the test statistic and make a conclusion
The value of the test statistic is determined from the sample information, and is used to determine
whether to reject the null hypothesis or not.

Page 3 of 19
3.4 One-Tailed and Two-Tailed Tests
 A test is one tailed when the alternate hypothesis states a direction e.g.
Ho: The mean income of women is equal to the mean income of men
HA: The mean income of women is greater than the mean income of men
 A test is two tailed if no direction is specified in the alternate hypothesis
Ho: There is no difference between the mean income of women and the mean income
of men
HA: There is a difference between the mean income of women and the mean income of
men
3.5 Testing The Population Mean When the Population Variance is Known
When the population variance is known and the population is normally distributed, the test
statistic for testing hypothesis about  is
n
x
Z



 . The confidence interval estimator of 
when 2
 is known is
n
Z
x 

2

Example One
A study by the Coca-Cola Company showed that the typical adult Kenyan consumes 18 gallons of
Coca-Cola each year. According to the same survey, the standard deviation of the number of
gallons consumed is 3.0. A random sample of 64 college students showed they consumed an
average (mean) of 17 gallons of cola last year. At the 0.05 significance level, can we conclude that
there is a significance difference between the mean consumption rate of college students and other
adults?
Solution
1. Stating the null and alternate hypothesis
18
:
18
:
0




A
H
H
2. Level of significance: 05
.
0



Page 4 of 19
3. Test statistic
n
X
Z




4. Rejection region
o
c
c
025
.
0
2
/ H
Reject
,
96
.
1
or Z
96
.
1
Z
If
96
.
1 



 Z
Z
5. Value of the test statistic
96
.
1
67
.
2
64
3
18
17








n
X
Zc


6. Conclusion
Reject H0. Yes, there is a significance difference between the mean consumption rate of college
students and other adults.
Example Two
Past experience indicates that the monthly long distance telephone bill per household in a particular
community is normally distributed, with a mean of Sh. 1012 and a standard deviation of Sh. 327.
After an advertising campaign that encouraged people to make long distance telephone calls more
frequently, a random sample of 57 households revealed that the mean monthly long distance bill
was Sh. 1098. Can we conclude at the 10% significance level that the advertising campaign was
successful?
Solution
1012
:
1012
:
0




A
H
H
.
0


3. Test statistic
n
X
Z





Page 5 of 19
4. Rejection region
o
c
1
.
0 H
Reject
,
28
.
1
Z
If
28
.
1 

 Z
Z
28
.
1
99
.
1
57
327
1012
1098






n
X
Zc


6. Conclusion
Reject H0. Yes, there is sufficient evidence to conclude that the advertising campaign was
successful
3.6 Testing the Population Mean when the Population Variance is Unknown
When the population variance is unknown and the population is normally distributed, the test
statistic for testing hypothesis about  is
n
s
x
t


 which has a student t distribution with 1

n
degrees of freedom.
We now have two different test statistic for testing the population mean. The choice of which one
to use depends on whether or not the population variance is known.
 If the population variance is known, the test statistic is
n
x
Z




 If the population variance is unknown, the test statistic is
n
s
x
t


 1
. 
 n
f
d
The confidence interval estimator of  when 2
 is unknown is
n
s
t
x
2

 1
.
. 
 n
f
d
Example One
A manufacturer of automobile seats has a production line that produces an average of 100 seats
per day. Because of new government regulations, a new safety device has been installed, which
the manufacturer believes will reduce average daily output. A random sample of 15 days’ output
after the installation of the safety device is shown below:

Page 6 of 19
93, 103, 95, 101, 91, 105, 96, 94, 101, 88, 98, 94, 101, 92, 95
Assuming that the daily output is normally distributed, is there sufficient evidence at the 5%
significance level, to conclude that average daily output has decreased following the installation
of the safety device?
Solution
100
:
100
:
0




A
H
H
.
0


3. Test statistic
n
s
X
t



4. Rejection region
o
c
14
,
05
.
0
1 H
Reject
,
761
.
1
t
If
761
.
1
, 




 t
t n

 
761
.
1
82
.
2
15
85
.
4
100
47
.
96
85
.
4
14
15
1447
139917
1
47
.
96
15
1447
139917
X
1447
2
2
2
2






















 
n
s
X
t
n
n
X
X
S
n
X
X
X
c

6. Conclusion

Page 7 of 19
Reject H0. Yes, there is sufficient evidence to conclude that average daily output has decreased
following the installation of the safety device
Example Two
A courier service advertises that its average delivery time is less than six hours for local deliveries.
A random sample of the amount of time this courier takes to deliver packages to an address across
town produced the following times (rounded to the nearest hour).
7, 3, 4, 6, 10, 5, 6, 4, 3, 8
Is there sufficient evidence to support the courier’s advertisement at the 5% level of significance?
Solution
6
:
6
:
0




A
H
H
.
0


3. Test statistic
n
s
X
t



4. Rejection region
o
c
9
,
05
.
0
1 H
Reject
,
833
.
1
t
If
833
.
1
, 




 t
t n


Page 8 of 19
 
833
.
1
56
.
0
10
27
.
2
6
6
.
5
27
.
2
9
10
56
360
1
6
.
5
10
56
360
X
56
2
2
2
2






















 
n
s
X
t
n
n
X
X
S
n
X
X
X
c

6. Conclusion
Do not Reject H0. No, there is no sufficient evidence to conclude that the advertising campaign
was successful
3.7 Chi-Square Test
A chi-squared test is any statistical hypothesis test in which the sampling distribution of the test
statistic is a chi-squared distribution when the null hypothesis is true. Also considered a chi-
squared test is a test in which this is asymptotically true, meaning that the sampling distribution (if
the null hypothesis is true) can be made to approximate a chi-squared distribution as closely as
desired by making the sample size large enough. The chi-square test is used to determine whether
there is a significant difference between the expected frequencies and the observed frequencies in
one or more categories.
3.7.1 Chi-Square Test of a Multinomial Experiment (Goodness-Of-Fit Test)
A multinomial experiment is a generalized version of a binomial experiment that allows for more
than two possible outcomes on each trial of the experiment. The following are the properties of a
multinomial experiment
 The experiment consists of a fixed number nof trials.
 The outcome of each trial can be classified into exactly one of k categories called cells
 The probability 1
P that the outcome of a trial will fall into a cell i remains constant for each
trial, for .
.........k
3,
2,
1,

i moreover, 1
........
2
1 
 k
P
P
P .

Page 9 of 19
 Each trial of the experiment is independent of the other trials.
Test Statistic
 




k
i i
i
i
e
e
o
1
2
2

Rejection Region
1
-
k
,
2
2


 
Example One
Two companies A and B have recently conducted aggressive advertising campaigns in order to
maintain and possibly increase their respective shares of the market for a particular product. These
two companies enjoy a dominant position in the market. Before advertising campaigns began, the
market share for Company A was 45% while Company B had a market share of 40%. Other
competitors accounted for the remaining market share of 15%. To determine whether these market
shares changed after the advertising campaigns, a marketing analyst solicited the preferences of a
random sample of 200 consumers of this product. Of the 200 consumers, 100 indicated a
preference for Company’s A’s product, 85 preferred Company’s B product and the remainder
preferred one or another of the products distributed by other competitors. Conduct a test to
determine at the 5% level of significance, whether the market shares have changed from the levels
they were at before the advertising campaigns occurred.
Solution
Ho: P1= 0.45, P2 = 0.4, P3 = 0.15
HA: At least one of the i
P is not equal to its specified value.
.
0


3. Test statistic: 



k
i i
i
i
e
e
o
1
2
2 )
(

4. Rejection region : 99147
.
5
2
2
,
05
.
1
,
2
2


  

  k

Page 10 of 19
5. Value of the test statistic: assuming that the null hypothesis is correct, we can calculate the
expected number of consumers who prefer A, B and others using the formula np
ei  .
Company Observed
frequency
Expected
frequency
 2
i
i e
o   
i
i
i
e
e
o
2

A
B
Others
100
85
15
90
80
30
100
25
225
1.11
.31
7.50
Total 200 200 8.92
Therefore 92
.
8
)
(
1
2
2


 

k
i i
i
i
e
e
o

6. Conclusion: Reject Ho
There is sufficient evidence at the 5% level of significance to allow us to conclude that the
market shares have changed from the levels they were at before the advertising campaigns
occurred.
Example Two
To determine if a single die, is balanced, or fair, the die was rolled 600 times. The observed
frequencies with which each of the six sides of the die turned up are recorded in the following
table: -
Face 1 2 3 4 5 6
Observed frequency 114 92 84 101 107 102
Is there sufficient evidence to conclude at the 5% level of significance, that the die is not fair?
Solution
value
specified
its
ot
equal
not
is
s
P
the
of
one
least
At
:
6
1
:
i
6
5
4
3
2
1
A
o
H
p
p
p
p
p
p
H 





.
0



Page 11 of 19
3. Test statistic: 



k
i i
i
i
e
e
o
1
2
2 )
(

4. Decision Rule : Ho
Reject
,
0705
.
11
If
,
0705
.
11 2
2
5
,
05
.
1
,
2
2



  


  k
5. Value of the test statistic:
Assuming that the null hypothesis is correct, we can calculate the expected number of
consumers who prefer A, B and others using the formula np
ei  .
Face Observed
frequency
Expected
frequency
 
i
i
i
e
e
o
2

1
2
3
4
5
6
114
92
84
101
107
102
100
100
100
100
100
100
1.96
0.64
2.56
0.01
0.49
0.04
Total 600 600 5.7
Therefore 0705
.
11
7
.
5
)
(
1
2
2



 

k
i i
i
i
e
e
o

6. Conclusion:
Do not Reject Ho. There is no sufficient evidence at the 5% level of significance to allow us to
conclude that that the die is not fair.
Rule of Five
For the discrete distribution of the test statistic 2
 to be adequately approximated by the
continuous chi-square distribution, the conventional rule is to require that the expected frequency
for each cell be at least 5. Where necessary, cells should be combined in order to satisfy this
condition. The choice of cells to be combined should be made in such a way that meaningful
categories result from the combination.

Page 12 of 19
3.7.2 Chi-Square Test of a Contingency Table
A contingency table is a rectangular table which items from a population are classified according
to two characteristics. The objective is to analyze the relationship between two qualitative
variables i.e. to investigate whether a dependence relationship exists between two variables or
whether the variables are statistically independent. The number of degrees of freedom for a
contingency table with r rows and c columns is   
1
1
-
r
.
. 
 c
f
d .
Example One
A sample of employees at a large chemical plant was asked to indicate a preference for one of
three pension plans. The results are given in the following table: -
Job Class
Pension Plan
Plan A Plan B Plan B
Supervisor
Clerical
Laborer
10
19
81
13
80
57
29
19
22
At the 1% significance level, determine whether there is a relationship between the pension
plan selected and the job classification of employees?
Solution
Job Class
Pension Plan Total
Plan A Plan B Plan B
Supervisor
Clerical
Laborer
10
19
81
13
80
57
29
19
22
52
118
160
Total 110 150 70 330
We need to conduct a chi-square of the contingency table to determine whether the classifications
are statistically independent.
Ho: The two classifications are independent
HA: the two classifications are dependent

Page 13 of 19
Test statistic: 



k
i i
i
i
e
e
o
1
2
2 )
(

Rejection region : 2767
.
13
2
4
,
01
.
0
)
1
)(
1
(
,
2
2


 
 

  c
r
The value of the test statistic
To compute the expected values for each cell, multiply the row total by the column total and divide
by the total number of shirts sampled.
Cell i Observed frequency
o
Expected frequency
e
 
e
e
o
2

1
2
3
4
5
6
7
8
9
10
13
29
19
80
19
81
57
22
17.33
23.64
11.03
39.33
53.64
25.03
53.33
72.73
33.94
3.1003
4.7889
29.2766
10.5087
12.9539
1.4527
14.3564
3.4021
4.2005
Total 84.0401
Value of the test statistic : 0401
.
84
)
(
1
2
2


 

k
i i
i
i
e
e
o

Conclusion: Reject Ho.
There is enough evidence at the 1% significance level to conclude that the two classifications are
dependent.
Example Two
The Coca Cola Company sells four brands of sodas in East Africa. To help determine if the same
marketing approach used in Kenya can be used in Uganda and Tanzania, one of the firm’s
marketing analysts wants to ascertain if there is an association between the brand of Soda preferred
and the nationality of the consumer. She first classifies the population according to the brand of

Page 14 of 19
soda preferred i.e. Fanta, Sprite, Coke and Krest. Her second classification consists of the three
nationalities; Kenyan, Tanzanian and Ugandan. The marketing analyst then interviews a random
sample of 250 Soda drinkers from the three countries, classifies each according to the two criteria
and records the observed frequency of drinkers falling into each of the cells as shown in the table
below.
Nationality
Soda preference
Total
Coke Krest Sprite Fanta
Kenyan
Ugandan
Tanzanian
72
26
7
8
10
10
12
16
14
23
33
19
115
85
50
Total 105 28 42 75 250
Based on the above sample data, can we conclude at the 1% level of significance that there is a
relationship between the preference of the soda drinkers and their nationality?
Solution
We need to conduct a chi-square of the contingency table to determine whether the classifications
are statistically independent.
Ho: The two classifications are independent
HA: the two classifications are dependent
Test statistic: 



k
i i
i
i
e
e
o
1
2
2 )
(

Rejection region : 8119
.
16
2
6
,
01
.
0
)
1
)(
1
(
,
2
2


 
 

  c
r
The value of the test statistic
To compute the expected values for each cell, multiply the row total by the column total
and divide by the total number of respondents sampled.

Page 15 of 19
Cell i Observed frequency
o
Expected frequency
e
 
e
e
o
2

1
2
3
4
5
6
7
8
9
10
11
12
72
26
7
8
10
10
12
16
14
23
33
19
48.30
35.70
21.00
12.88
9.52
5.60
19.32
14.28
8.40
34.50
25.50
15.00
11.63
2.64
9.33
1.85
0.02
3.46
2.77
0.21
3.73
3.83
2.21
1.07
Value of the test statistic: 75
.
42
)
(
1
2
2


 

k
i i
i
i
e
e
o

Conclusion: Reject Ho.
Based on the sample data, we can conclude at the 1% significance level that there is a relationship
between preferences of soda drinkers and their nationality.
3.7.3 Chi-Square Test for Normality
The chi-square goodness of fit test for a normal distribution proceeds in essentially the same way
as the chi-square test for a multinomial population. The multinomial test dealt with a single
population of qualitative data, where as a normal distribution involves quantitative data. Therefore,
we must begin by subdividing the range of the normal distribution into a set of intervals or
categories in order to obtain qualitative data.
Example One
A battery manufacturer who wants to determine if the lifetimes of his batteries are normally
distributed. Such information would be helpful in establishing the guarantee that should be offered.
The lifetimes of a sample of 200 batteries are measured and the resulting data are grouped into a

Page 16 of 19
frequency distribution as shown in the table below. The mean and the standard deviation of the
sample life times are 164 and 10 respectively.
Is there evidence at the 5% level of significance that the lifetimes of his batteries are normally
distributed?
Solution
H0: The data are normally distributed
HA: The data are not normally distributed
.
0


3. Test statistic: 3
-
k
d.f.
)
(
1
2
2


 

k
i i
i
i
e
e
o

Reject
,
9915
.
5
If
,
9915
.
5 2
2
2
,
05
.
1
,
2
2



  


  k
10
,
164 
 
X
6
.
2
10
164
-
190
Z
,
6
.
1
10
164
-
180
Z
,
6
.
0
10
164
170
4
.
0
10
164
160
Z
,
4
.
1
10
164
150
,
4
.
2
10
164
140



















Z
Z
Z
Life Time in Hours Number of Batteries
140 up to 150
150 up to 160
160 up to 170
170 up to 180
180 up to 190
15
54
78
42
11
Total 200

Page 17 of 19
Lifetime Probability Observed
frequency
Expected
frequency
 
i
i
i
e
e
o
2

Less than 150
150 up to 160
160 up to 170
170 up to 180
180 or more
0.0808
0.2638
0.3811
0.2195
0.0548
15
54
78
42
11
16.16
52.76
76.22
43.9
10.96
0.0833
0.0291
0.0416
0.0822
0.0001
200 200 0.2363
Therefore 9915
.
5
02363
.
0
)
(
1
2
2



 

k
i i
i
i
e
e
o

6. Conclusion:
Do not Reject Ho. There is no sufficient evidence at the 5% level of significance to allow us
to conclude that the lifetimes of his batteries are normally distributed?
Example Two
The instructors for an introductory accounting course attempt to construct the final examination
so that the grades are normally distributed with a mean of 65.
Grade Frequency
30 up to 40
40 up to 50
50 up to 60
60 up to 70
70 up to 80
80 up to 90
4
17
29
49
33
18
From the sample of grades appearing in the accompanying frequency distribution table, can
you conclude that they have achieved their objective? (Use 05
.
0

 )

Page 18 of 19
Solution
H0: The data are normally distributed
HA: The data are not normally distributed
.
0


3. Test statistic: 3
-
k
d.f.
)
(
1
2
2


 

k
i i
i
i
e
e
o

Reject
,
81373
.
7
If
,
81473
.
7 2
2
3
,
05
.
1
,
2
2



  


  k
x f xf Dx
10
'
Dx
Dx 
2
'
Dx '
fDx 2
'
fDx
30 up to 40
40 up to 50
50 up to 60
60 up to 70
70 up to 80
80 up to 90
35
45
55
65
75
85
4
17
29
49
33
18
140
765
1595
3185
2475
1530
-30
-20
-10
0
10
20
-3
-2
-1
0
1
2
9
4
1
0
1
4
-12
-34
-29
0
33
36
36
68
29
0
33
72
150 9690 -6 238
6
.
12
10
*
150
6
150
238
6
.
64
150
9690
2






 







n
xf
x
12.6
,
6
.
64 
 
X

Page 19 of 19
22
.
1
12.6
64.6
-
80
Z
,
43
.
0
6
.
12
6
.
64
70
37
.
0
6
.
12
6
.
64
60
Z
,
16
.
1
6
.
12
6
.
64
50
,
95
.
1
6
.
12
6
.
64
40

















Z
Z
Z
Lifetime Probability Observed
frequency
Expected
frequency
 
i
i
i
e
e
o
2

Less than 40
40 up to 50
50 up to 60
60 up to 70
70 up to 80
80 or more
0.0256
0.0974
0.229
0.3144
0.2224
0.1112
4
17
29
49
33
18
3.84
14.61
34.35
47.16
33.36
16.68
0.0067
0.3910
0.8333
0.0718
0.0039
0.1045
150 150 1.4112
Therefore 81473
.
7
4112
.
1
)
(
1
2
2



 

k
i i
i
i
e
e
o

6. Conclusion:
Do not Reject Ho. The data is normally distributed therefore we can conclude that they have
achieved their objective

of 19
LESSON THREE: REGRESSION ANALYSIS
3.0 Introduction
Regression involves developing a mathematical equation that analyses the relationship between
the variable to be forecast (dependent variable) and the variables that the statistician believes
are related to the forecast variable (independent variable). Regression is the estimation of
unknown values or the prediction of one variable from known values of other variables. Simple
linear regression involves a relationship between two variables only. Multiple regression
analyses or considers the relationship between three or more variables.
3.1 Lesson Objectives
By the end of the lesson, the students should be able to:
i. Formulate a simple regression model
ii. Calculate the coefficient of correlation and determination and interpret them
iii. Test hypothesis about the regression coefficients
3.2 Simple Regression
The first step in establishing the relationship between X and Y is to obtain observations on the
two variables and analyze the data using a scatter diagram to indicate whether a positive or
negative relationship exists between X and Y. the relationship can be approximated by a
straight line. Algebraically, the relationship is t
t X
b
b
Y 1
0 

The above function is deterministic since it gives exact relationship between X and Y. when
the line is plotted, not all the points will fall on the line because of the following reasons:
 Omission of other explanatory variables from the function
 Random behavior of human beings
 Imperfect specification of the functional form of the model
 Errors of aggregation
 Errors of measurement
To account for the deviations of some points from the straight line, the error term is introduced.
The introduction of the error term makes the function stochastic t
t
t e
X
b
b
Y 

 1
0 . To
estimate the values of the coefficients 0
b and 1
b , we need observations on Y, X and the error
term. However, the error term is not observable and therefore we make assumptions about the
error term.

of 19
3.3 Assumptions of the Error Term
The following are the assumptions of the error term
 The error term is a real random variable which has a mean of zero and constant variance
(Assumption of homoscedasticity)
 The error term is normally distributed
 The error term corresponding to different values of X for different periods are not correlated
(assumption of no autocorrelation)
 There is no relationship between the explanatory variables and the error term
 The explanatory variables are measured without error. The error absorbs the influence of
omitted variables and errors of measurement in the dependent variable.
All the above assumptions are called stochastic assumptions
Other Assumptions
 The explanatory variables are not perfectly linearly related or correlated (No
multicollinearity)
 The variables are correctly aggregated
 The relation being estimated is identified
 The relationship is correctly specified
The regression equation of Y on X
 It used to predict the values of Y from the given values of X.
 It is expressed as follows X
b
b
Y 1
0 

 To determine the values of 0
b and 1
b the following two normal equations are to be solved
simultaneously
  






2
1
0
1
0
X
b
X
b
XY
X
b
nb
Y
 Alternatively the values of 0
b and 1
b can be got using the following formula’s
X
b
Y
b 1
0 





 2
2
1
X
n
X
Y
X
n
XY
b

of 19
3.4 Correlation
Definition: It is the existence of some definite relationship between two or more variables.
Correlation analysis is a statistical tool used to describe the degree to which one variable is
linearly related to another variable.
Types of Correlation
Correlation may be classified in the following ways:-
(a) Positive and negative correlation.
Correlation is said to be positive if two series move in the same direction, otherwise it is
negative (opposite Direction).
(b) Linear and Non-Linear correlation
Correlation is linear if the amount of change in one variable tends to bear a constant ratio to
the amount of change in the other variable otherwise it is non-linear.
(c) Simple, partial and multiple correlation
Simple correlation is where two variables are studied while partial or multiple involves three
or more variables.
3.5 Methods of Calculating Simple Correlation
 Scatter diagram
 Karl Pearson’s coefficient of correlation
 Spearman’s rank correlation coefficient
 Method of least squares
Karl Pearson’s coefficient of correlation (Product moment coefficient of correlation)
The coefficient of correlation (r) is a measure of strength of the linear relationship between two
variables.







2
2
2
2
Y
n
Y
X
n
X
Y
X
n
XY
r
Interpretation of the coefficient of correlation
1. When r = +1, there is a perfect positive correlation between the variables
2. When r = -1, there is a perfect negative correlation between the variables

of 19
3. When r = 0, there is no correlation between the variables
4. The closer r is to +1 or to –1, the stronger the relationship between the variables and the
closer r is to 0, the weaker the relationship.
5. The following table lists the interpretations for various correlation coefficients:
Value Comment
0.8 to 1.0
0.6 to 0.8
0.4 to 0.6
0.2 to 0.4
0.0 to 0.2
Very strong
Strong
Moderate
Weak
Very weak
Method of least squares
yy
xx
xy
SS
SS
SS
r
*

Coefficient of determination (r2
)
It is the square of the correlation coefficient. It shows the proportion of the total variation in
the dependent variable Y that is explained or accounted for by the variation in the independent
variable X. e.g. If the value of r = 0.9, r2
= 0.81, this means 81% of the variation in the
dependent variable has been explained by the independent variable.
Example One
A random sample of eight auto drivers insured with a company and having similar auto
insurance policies was selected. The following table lists their driving experience (in years)
and the monthly auto insurance premium (in Sh.000) paid by them.
Driving experience (Years) 5 2 12 9 15 6 25 16
Monthly auto insurance premium
(In Sh.000)
64 87 50 71 44 56 42 69
i. Find the least squares regression line by identifying the appropriate dependent and
independent variable
ii. Interpret the meaning of the constants calculated in part (i).
iii. Compute the coefficient of correlation and coefficient of determination and interpret
them.

of 19
Solution:
i. x
y 1
0
ˆ 
 

xx
xy
SS
SS

1
̂ x
y 1
0
ˆ
ˆ 
 

  90
x   1396
2
x   4739
xy   474
y   29642
2
y
  5
.
383
8
90
1396
2
2
2






 n
x
x
SSxx
   5
.
593
8
474
*
90
4739 







 n
y
x
xy
SSxy
  5
.
1557
8
474
29642
2
2
2






 n
y
y
SSyy
55
.
1
5
.
383
5
.
593
ˆ
1 




xx
xy
SS
SS

69
.
76
)
25
.
11
*
55
.
1
(
25
.
59
ˆ
ˆ
1
0 




 x
y 

x
x
y 55
.
1
69
.
76
ˆ 1
0 


 

ii. 55
.
1
ˆ
1 

 it indicates the rate at which the insurance premium reduces with an
additional year of driving experience
69
.
76
ˆ
0 
 It indicates the amount of premium that would be paid by a driver without
any years of experience.
iii.
77
.
0
5
.
1557
*
5
.
383
5
.
593
*





yy
xx
xy
SS
SS
SS
r
There is a strong negative relationship between the years of experience and the monthly auto
insurance premiums
%
29
.
59
77
.
0 2
2



r
59.29% of the premium paid is determined by the driving experience
Example Two
A company is using a system of payment by results. The union claims that this seriously
discriminates against the workers. there is a fairly steep learning curve which workers follow
with the apparent outcome that more experienced workers can perform the task in about half
of the time taken by the new employee. You have been asked to find out if there is any basis

of 19
for this claim. To do this, you have observed ten workers on the shop floor, timing how long it
takes them to produce an item. It was then possible for you to match these times with the length
of worker’s experience. The results obtained are shown below:
Month’s experience 2 5 3 8 5 9 12 16 1 6
Time taken 27 26 30 20 22 20 16 15 30 19
Required:
(a) Find the regression line of time taken on month’s experience
(b) Compute the coefficient of correlation and coefficient of determination and interpret them.
Solution:
x
b
b
Y 1
0 

xx
xy
SS
SS
b 
1 X
b
Y
b 1
0 

7
6
 
X   645
2
X  1300
XY   225
Y   5331
2
Y
 
1
.
196
10
67
645
2
2
2






 n
X
X
SSxx
  
5
.
207
10
225
*
67
1300 







 n
Y
X
XY
SSxy
 
5
.
268
10
225
5331
2
2
2






 n
Y
Y
SSyy
0581
.
1
1
.
196
5
.
207
1 




xx
xy
SS
SS
b
41073
.
15
)
7
.
6
*
0581
.
1
(
5
.
22
1
0 




 X
b
Y
b
X
Y 0581
.
1
41073
.
15 

iv. 0581
.
1
1 

b : It indicates the rate at which the time taken would reduce by for every
additional month of experience
41073
.
15
0 
b It indicates the time taken by an employee without any experience
9043
.
0
5
.
268
*
1
.
196
5
.
207
*





yy
xx
xy
SS
SS
SS
r
There is a very strong negative correlation between the month’s experience and the time
taken

of 19
%
78
.
81
100
*
8178
.
0
9043
.
0 2
2




r
81.78% of the variation in the time taken is explained by the month’s experience
Example Three
Students in the BMS 302 class were polled by a researcher attempting to establish a relationship
between hours of study in the week immediately preceding the end of semester exam and the
marks received on the exam. The surveyor gathered the data listed in the accompanying table
Hours of study Exam score
25
12
18
26
19
20
23
15
22
8
93
57
55
90
82
95
95
80
85
61
i. Find the least squares regression line by identifying the appropriate dependent and
independent variable.
ii. Interpret the meaning of the values of 0 and 1 calculated in part (i).
iii. Compute the correlation of coefficient and coefficient of determination and interpret them.
Solution
x
y 1
0
ˆ 
 

xx
xy
SS
SS

1
̂ x
y 1
0
ˆ
ˆ 
 

 188
x   3832
2
x  15540
xy   793
y   65143
2
y
 
6
.
297
10
188
3832
2
2
2






 n
x
x
SSxx
  
6
.
631
10
793
*
188
15540 






 n
y
x
xy
SSxy

of 19
 
1
.
2258
10
793
65143
2
2
2






 n
y
y
SSyy
122
.
2
6
.
297
6
.
631
ˆ
1 


xx
xy
SS
SS

4064
.
39
)
8
.
18
*
122
.
2
(
3
.
79
ˆ
ˆ
1
0 



 x
y 

x
x
y 122
.
2
4064
.
39
ˆ 1
0 


 

i. 122
.
2
ˆ
1 
 it indicates the rate at which the exam score would increase with an
additional hour of study
04
.
39
ˆ
0 
 It indicates the exam score that would be attained by a student who does
not study a week to exams.
ii.
77
.
0
1
.
2258
*
6
.
297
6
.
631
*



yy
xx
xy
SS
SS
SS
r
There is a strong positive relationship between the exam score and the number of hours studied
%
29
.
59
77
.
0 2
2


r
59.29% of the exam score is determined by the number of hours studied
3.6 Spearman’s Rank Correlation
 It is the correlation between the ranks assigned to individuals by two different people.
 It is a non-parametric technique for measuring strength of relationship between paired
observations of two variables when the data are in ranked form.
It is denoted by R or p
N
N
d
N
N
d
R
i








3
2
2
2
6
1
)
1
(
6
1
In rank correlation, there are two types of problems:-
i. Where actual ranks are given
ii. Where actual ranks are not given

of 19
Where actual ranks are given
Steps:
 Take the differences of the two ranks i.e. (R1-R2) and denote these differences by d.
 Square these differences and obtain the total  2
d
 Use the formula
N
N
d
R




3
2
6
1
Example
The ranks given by two judges to 10 individuals are given below.
Individual 1 2 3 4 5 6 7 8 9 10
Judge 1(X) 1 2 7 9 8 6 4 3 10 5
Judge 2 (Y) 7 5 8 10 9 4 1 6 3 2
Calculate
(a) The spearman’s rank correlation.
(b) The Coefficient of correlation
Where ranks are not given
Ranks can be assigned by taking either the highest value as 1 or the lowest value as 1. the same
method should be followed in case of all the variables.
Example
Calculate the Rank correlation coefficient for the following data of marks given to 1st
year B
Com students:
CMS 100 45 47 60 38 50
CAC 100 60 61 58 48 46
Equal Ranks or Tie in Ranks
 Where equal ranks are assigned to some entries, an adjustment in the formula for
calculating the Rank coefficient of correlation is made.
 The adjustment consists of adding  
m
m 
3
12
1 to the value of  2
d where m stands
for the number of items whose ranks are common.

of 19
Example
An examination of eight applicants for a clerical post was taken by a firm. From the marks
obtained by the applicants in the accounting and statistics papers, compute the Rank coefficient
of correlation.
Applicant A B C D E F G H
Marks in accounting 15 20 28 12 40 60 20 80
Marks in statistics 40 30 50 30 20 10 30 60
3.7 Assessing the Regression Model
3.7.1 Estimating the variance of the error variable
The sample statistic
2
2


n
SSE
Se is an unbiased estimator of 2
e
 . The square root of 2
e
S is called
the standard error of estimate i.e.
2


n
SSE
Se
xx
yy
SS
SS
SS
SSE xy
2


Interpretation of the Standard Error of Estimate
 The smallest value that the standard error of estimate can assume is zero, which occurs
when SSE = 0 i.e. when all the points fall on the regression line.
 If 
S is close to zero, the fit is excellent and the linear model is likely to be a useful and
effective analytical and forecasting tool
 If 
S is large, the model is a poor one and the statistician should either improve it or
discard it.
 In general, the standard error of estimate cannot be used as an absolute measure of the
model’s utility. Nonetheless, it is useful in comparing models.
3.7.2 Drawing inferences about 1

This involves determining whether a linear relationship actually exists between x and y . The
null hypothesis will always state that there is no linear relationship between the variables i.e.
0
: 1
0 

H . Any of the following three alternate hypothesis can be tested:-
i. 0
: 1 

A
H Tests whether some linear relationship exists between x and y
ii. 0
: 1 

A
H Tests for a positive linear relationship exists between x and y

of 19
iii. 0
: 1 

A
H Tests for a negative linear relationship exists between x and y
The test statistic is
1
1
1
b
s
b
t


 where
xx
e
b
SS
S
S 
1
Assuming that the error variable is normally distributed, the test statistic follows a student
distribution with 2

n degrees of freedom
The confidence interval estimator of 1
2
,
2
/
1
1 b
n S
t
b 

 

3.7.3 Measuring the strength of the linear relationship
1
 is useful in measuring the strength of the linear relationship particularly when we want to
compare different models to see which one fits the data better.
(a) Coefficient of Correlation
The coefficient of correlation denoted by )
(Rho
 measures the similarity of the changes in the
values of x and y . Its range is 1
1 

  . Since  is a population parameter, its value is
estimated from the data. The sample coefficient of correlation r is defined as follows:-
yy
xx
xy
SS
SS
SS
r
*

(b) Testing the Coefficient of Correlation
If 0

 the values of x and y are uncorrelated and the linear model is not appropriate. We
can determine if x and y are correlated by testing the following hypothesis
0
:
0
:
0




A
H
H
Test statistics for
r
s
r
t 

 where
2
1 2



n
r
sr
The test statistics is student t distributed with n-2 degrees of freedom if the error variable is
normally distributed
(c) Coefficient of Determination )
( 2
r
This measures the proportion of variability in the dependent variable that is explained by
variability of the independent variable.

of 19
yy
xx SS
SS
SS
r xy
2
2

3.7.4 Predicting the particular value of y for a given x (The prediction Interval)
The prediction interval is given by: -
 
SSxx
x
x
n
S
t
y
Y
g
e
n
2
2
,
2
/
1
1
ˆ




 

Where g
x is the given value of x and g
x
b
b
y 1
0
ˆ 

3.7.5 Estimating the expected value of y for a given x (The confidence Interval)
The confidence interval is given by: -
 
SSxx
x
x
n
S
t
y
Y
g
e
n
2
2
,
2
/
1
ˆ



 

Where g
x is the given value of x and g
x
b
b
y 1
0
ˆ 

Example One
A real estate agent would like to predict the selling price of single family homes. After careful
consideration, she concludes that the variable likely to be mostly closely related to the selling
price is the size of the house. As an experiment, she takes a random sample of 15 recently sold
houses and records the selling price in Sh.000’s and size in 100 ft2
of each. The data is shown
in the table below: -
House size
(100 ft2
)
20.0 14.8 20.5 12.5 18.0 14.3 27.5 16.5 24.3 20.2
Selling price
(Sh’000)
89.5 79.9 83.1 56.9 66.6 82.5 126.3 79.3 119.9 87.6
22.0 19.0 12.3 14.0 16.7
112.6 120.8 78.5 74.3 74.8
Required: -
(a) Find the sample regression line for the data
(b) Estimate the variance of the error variable and the standard error of estimate.

of 19
(c) Can we conclude at the 1% significance level that the size of a house is linearly related
to its selling price?
(d) Estimate the 99% confidence interval estimate of 1

(e) Compute the coefficient of correlation and interpret its value
(f) Can we conclude at the 1% significance level that the two variables are correlated?
(g) Compute the coefficient of determination and interpret its value
(h) Predict with 95% confidence the selling price of a house that occupies 2,000ft2
.
(i) In a certain part of the city, a developer built several thousand houses whose floor plans
and exteriors differ but whose sizes are all 2,000 ft2
. To date, they have been rented but
the builder now wants to sell them and wants to know approximately how much money
in total he can expect from the sale of the houses. Help him by estimating a 95%
confidence interval estimate of the mean selling price of the houses.
Solution
(a) Find the least squares regression line
x
b
b
y 1
0
ˆ 

xx
xy
SS
SS
b 
1 x
y
b 1
0 ̂


  6
.
272
X   6
.
1332
Y   97
.
25257
XY   24
.
5222
2
X
42
.
124618
2
 
Y
 
189
.
268
15
6
.
272
24
.
5222
2
2
2






 n
x
x
SSxx
  
186
.
1040
15
6
.
1332
*
6
.
272
97
.
25257 






 n
y
x
xy
SSxy
 
24
.
6230
15
2
.
1332
42
.
124618
2
2
2






 n
y
y
SSyy
88
.
3
189
.
268
186
.
1040
1 


xx
xy
SS
SS
b
34
.
18
)
17
.
18
*
88
.
3
(
84
.
88
1
0 



 x
b
y
b
x
x
b
b
y 88
.
3
34
.
18
ˆ 1
0 




of 19
(b) Estimate the variance of the error variable and the standard error of estimate.
169
13
13
2
15
88
.
2195
2
88
.
2195
19
.
268
18
.
1040
24
.
6230
2
2
2
2












e
e
xx
yy
S
n
SSE
S
SS
SS
SS
SSE
xy
(c) Can we conclude at the 1% significance level that the size of a house is linearly related to
its selling price?
0
:
b
t
:
Statistic
Test
0.05
0
:
1
1
1
1







A
b
o
H
S
H
Decision rule
0
13
,
025
.
0
2
,
2
/ H
Reject
,
012
.
3
or
012
.
3
If
3.012 




 c
c
n t
t
t
t
Value of the test statistic
012
.
3
89
.
4
794
.
0
88
.
3
794
.
0
19
.
268
13
1
1
1







b
xx
e
b
S
b
t
SS
S
S
Conclusion: Reject Ho. Yes, the data provides sufficient evidence to conclude that the
house size is linearly related to its selling price
(d) Estimate the 99% confidence interval estimate of 1

27
.
6
49
.
1
39
.
2
88
.
3
)
794
.
0
*
012
.
3
(
88
.
3
1
1
1
2
,
2
/
1
1







 


  b
n S
t
b
(e) Compute the coefficient of correlation and interpret its value
805
.
0
24
.
6230
*
19
.
268
18
.
1040
*



yy
xx
xy
SS
SS
SS
r
There is a very strong positive correlation between the size of the house and its selling
price

QUANTITATIVE METHODS NOTES.pdf

QUANTITATIVE METHODS NOTES.pdf

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to QUANTITATIVE METHODS NOTES.pdf

Similar to QUANTITATIVE METHODS NOTES.pdf (20)

Recently uploaded

Recently uploaded (20)

QUANTITATIVE METHODS NOTES.pdf