Statistik topic5 measures of dispersionDocument Transcript
Topic 5 Measures of Dispersion LEARNING OUTCOMES By the end of this topic, you should be able to: 1. describe the concept of dispersion measures; 2. explain the concept of range as a dispersion measure; 3. categorise the distribution curve by its symmetry and non-symmetry; 4. analyse the variance or standard deviation; and 5. sketch Whisker box and interpret it. INTRODUCTIONWe have discussed earlier the position quantities such as mean and quartiles,which can be used to summarise the distributions. However these quantities areordered numbers located at the horizontal axis of the distribution graph. Asnumbers along the line, they are not able to explain in quantitative measure forexample about the shape of the distribution.In this topic we will learn about quantity measures regarding the shape of adistribution. For example, the quantity namely variance is usually used to measurethe dispersion of observations around their mean location. The range is used todescribe the coverage of a given data set. Coefficient of skewedness will be usedto measure the assymetricl distribution of a curve. The coefficient of curtosis isused to measure peakedness of a distribution curve.
TOPIC 5 MEASURE OF DISPERSION 63 Is it important to comprehend quantities like mean and quartiles to prepare you to study this topic? Give your reasons.5.1 MEASURE OF DISPERSIONThe mean of a distribution has been termed as location parameter. Locations ofany two different distributions can be observed by looking at their respectivemeans. The range will tell us about the coverage of a distribution, whilstvariance will measure the distribution of observations around their mean andhence the shape of a distribution curve. Small value of variance means thedistribution curve is more pointed and the larger value of the variance indicate thedistribution curve is more flat. Thus, variance is sometime being called shapeparameter.Figure 5.1(a) shows two distribution curves with different location centres butpossibly of same dispersion measure (they may have the same range of coverage,but of different variances). Curve 1 could represent a distribution of mathematicsmarks of male students from School A and Curve 2 represents distribution ofmathematics marks of female students in the same examination from the sameschool. Figure 5.1 (a)Figure 5.1 (b) shows two distribution curves with same location centre butpossibly of different dispersion measures (they may have different range of
64 TOPIC 5 MEASURE OF DISPERSIONcoverages, as well as variances). Curve 3 could represent a distribution of physicsmarks of male students from School A and Curve 4 represents distribution ofphysics marks of male students in the same examination but from School B. Figure 5.1 (b)Figure 5.1 (c) shows two distribution curves with different location centres butpossibly of same dispersion measure (they may have the same range of coverage,but possibly of the same variance). However, Curve 5 is skewed to the right andCurve 6 is skewed to the left. Curve 5 could represent a distribution ofmathematics marks of students from School A and Curve 6 may representdistribution of mathematics marks of students in the same trial examination butfrom different school. Figure 5.1 (c)By looking at the Figure 5.1(a), (b) and (c), beside the mean, we need to knowother quantities such as variance, range and coefficient of skewedness in order todescribe or summarise completely a given distribution.
TOPIC 5 MEASURE OF DISPERSION 65 Dispersion Measure involves measuring the degree of scatteredness observations surrounding their mean centre.The following are examples of dispersion measures:(a) Dispersion Measure Around Mean of Distribution It measures the deviation of observations from their mean. There are two types that can be considered: (i) Mean Deviation; and (ii) Standard Deviation. However in this module, we will consider only Standard Deviation. You can refer to any statistics book for Mean Deviation.(b) Central Percentage Dispersion Measure This measure has some relationship with Median. There are two types that can be considered: (i) Central Percentage Range 10 – 90; (ii) Semi Inter Quartile Range.(c) Distribution Coverage This quantity measures the range of the whole distribution which shows the overall coverage of observations in data set. 5.2 THE RANGE The range is defined as the difference between the maximum value and the minimum value of observations.Thus, the maximum the minimum Range value value Formula 5.1
66 TOPIC 5 MEASURE OF DISPERSIONAs can be seen from the above formula, range can be easily calculated. However,it is depending on the two extreme values to measure the overall data coverage. Itdoes not explain anything about the variation of observations between the twoextreme values.Example 5.1Give comment on the scatteredness of observations in each of data sets: Set 1 12 6 7 3 15 5 10 18 5 Set 2 9 3 8 8 9 7 8 9 18Solution Figure 5.2(a) Arrange observations in ascending order of values, and draw scatter points plot for each set of data. Set 1 3 5 5 6 7 10 12 15 18 Set 2 3 7 8 8 8 9 9 9 18(b) Both data sets having same range which is 18 – 3 = 15.
TOPIC 5 MEASURE OF DISPERSION 67(c) Observations in Set 1 are scattered almost evenly through out the range. However, for Set 2, most of the observations are concentrated around numbers 8 and 9.(d) We can consider numbers 3 and 18 as outliers to the main body of the data Set 2.(e) From this exercise we learn that it is not good enough to compare only overall data coverage using range. Some other dispersion measures have to be considered too.To conclude, Figures 5.2 show that two distributions can have the same range butthey could be of different shapes which cannot be explained by range. “Range does not explain the density of a data set”. What do you understand about this statement? Discuss it with your coursemates. ACTIVITY 5.1 1. The following are two sets of mathematics marks from an examination: Set A: 45, 48, 52, 54, 55, 55, 57, 59, 60, 65 Set B: 25, 32, 40, 45, 53, 60, 61, 71, 78, 85 (i) Calculate the mean, and the range of both data sets. (ii) Give comment on the scattered ness of observations in both sets. 2. Below are two sets of physics marks in an examination. Set C 35 62 42 75 26 50 57 8 88 80 18 83 Set D 50 42 60 62 57 43 46 56 53 88 8 59 (i) Calculate the mean, and the range of both data sets. (ii) Give comment on the scattered ness of observations in both sets.
68 TOPIC 5 MEASURE OF DISPERSION5.1 INTER QUARTILE RANGE Inter quartile range is the difference between Q3 and Q1. It is used to measure the range of 50% central main body of data distribution.The longer range indicates that the observations in the central main body are morescattered. This quantity measure can be used to complement to the overall rangeof data as the latter has failed to explain the variations of observations betweentwo extreme values.Besides, the former does not depend on the two extreme values. Thus interquartile range can be used to measure the dispersions of main body data. It is alsorecommended to complement the overall data range when we make comparison oftwo sets of data.For example let us go back to Exercise 5.1(No.2) where we compare Set C andSet D. Although they have the same overall data range (88 – 8 = 80) but actuallythey have different variation in observations. However, inter quartile range for SetC is larger than for Set D. This indicates that the main body data of Set D is lessscattered.Inter quartile range is given by: IQR Q3 Q1 Formula 5.2Where double bars | | means absolute value. Some reference books prefer to useSemi Inter Quartile Range which is given by: IQR Q3 Q1 Q= 2 2 Formula 5.2(a)
TOPIC 5 MEASURE OF DISPERSION 69Example 5.2By using the inter quartile range, compare the spread of data Set C and data Set Din Exercise 5.1 (No.2)Solution(a) Number of observations, n = 12 for both data sets. 1(b) Q1 is at the position n 1 = 3.25. 4 Set C: 8, 18, 26, 35, 42, 50, 57, 62, 75, 80, 83, 88 Set D: 8, 42, 43, 46, 50, 53, 56, 57, 59, 60, 62 ,88 Set C: Q1 = 26 + 0.25 (35 – 26) = 28.25; Set D: Q1 = 43 + 0.25 (46 – 43) = 43.75. 3(c) Q3 is at the position n 1 = 9.75. 4 Set C, Q3 = 75 + 0.75 (80 – 75) = 78.75; Set D, Q3 = 59 + 0.75 (60 – 59) = 59.75(d) Then the inter quartile range for each data set is given by IQR(C) = 78.75 – 28.25 = 50.5; and IQR(D) = 59.75 – 43.75 = 16.0.(e) Since IQR (D) < IQR(C) therefore data Set D is considered less spread than Set C.Coefficient of Variation, VQInter quartile range (IQR) and Semi inter quartile range (Q) are two quantitieswhich have dimensions. Therefore they become meaningless when being used incomparing two data sets of different units. For instance, comparison of data onage (years) and weights (Kg). To avoid this problem, we can use the coefficient ofquartiles variation, which has no dimension and is given by:
70 TOPIC 5 MEASURE OF DISPERSION Q3 Q1 Q 2 Q3 Q1 VQ TTQ Q3 Q1 Q3 Q1 2 Formula 5.3In the above formula, TTQ is the mid point between Q1, and Q3; and the two bars | |means absolute value. ACTIVITY 5.2 Given the following three sets of data: Set E: Age(Yrs) 5-14 15-24 25-34 35-44 45-54 55-64 65-74 f Number of 35 90 120 98 130 52 25 550 Residents Set F: Value of 10- 15- 20- 25- 30- 35- 40- Products (RM) f 14.99 19.99 24.99 29.99 34.99 39.99 44.99 x 100 Number of 2 6 15 22 35 15 5 100 Products Set G: Extra 1.03- 1.06- 1.09- 1.12- 1.15- 1.18- 1-1.02 f Charges(RM) 1.05 1.08 1.11 1.14 1.17 1.20 Number of 3 15 28 30 25 14 5 120 Shops 1. Calculate Q1, Q2, Q3 for each data set 2. Obtain the inter quartile range (IQR) for each data set 3. Then make comparison of the spread of the above data sets.
TOPIC 5 MEASURE OF DISPERSION 71 5.4 VARIANCE AND STANDARD DEVIATION Variance is defined as the average of squared distance of each score (or observation) from the mean. It is used to measure the spreading of data.If we have two distributions, the one with larger variance is more spreading andhence its frequency curve is more flat. Variance of population uses symbol 2 .Variance always has positive sign. Standard deviation is obtained by taking squareroot of the variance. In this module, we will consider the given data as a population.5.4.1 Standard Deviation and the Variance of Ungrouped DataSuppose we have n numbers x1, x2,…, xn, with their mean (given or calculated) as . Then the standard deviation is given by: n 2 xi i 1 n Formula 5.4In words it means the square root of the average of squared distance of each score(or observation) from the mean. It has positive sign. The population variance ( 2 )is the square of the standard deviation. Table 5.1: Steps of Obtaining Population Standard Deviation Steps Symbols Used (a) Calculate the population mean (b) Obtain the deviation of each score from mean xi , I = 1,2,…,n 2 (c) Obtain the square of deviation in step (b) xi , I = 1,2,…,n n (d) Obtain the average of the squared deviations xi 2 i 1 n (e) Obtain the square root of the average in step (d) i n 2 xi i 1 n
72 TOPIC 5 MEASURE OF DISPERSIONExample 5.3Obtain the standard deviation of the data set 20, 30, 40, 50, 60.Solution Variable (x) Mean Deviation Squared Mean Deviation (x ) (x )2 20 -20 400 30 -10 100 40 0 0 50 10 100 60 20 400 Sum = 200 Sum = 1000 200 / 5 =40 Mean squared = 1000/5 = 200Now, by using Formula 5.4, the standard deviation of the population is 200 14.14 ACTIVITY 5.3 1. Obtain the standard deviation of data Sets 1 & 2 in Example 5.1. 2. Obtain the standard deviation of data Sets A, B, C and D in Exercise 220.127.116.11.2 Alternative Formulas to Enhance Hand Calculations(a) To avoid of subtracting each score from , the equivalence Formula 5.5 below can be used to calculate the standard deviation.
TOPIC 5 MEASURE OF DISPERSION 73 2 xi2 xi n n Formula 5.5Example 5.4Obtain the standard deviation of data Set 2 in Example 5.1SolutionFrom Formula 5.5, the standard deviation is 2 817 79 3.71 9 9(You can compare this with the answer obtained in Exercise 5.3.)(b) Some time the population mean is not needed and we are only required to find standard deviation. The Formula 5.6 below does not involve , which can be used instead. In this formula, A is ‘assumed’ mean which is an arbitrary number. You can select such A either from the given numbers in the set or any convenience number as you like. 2 2 xi A xi A n n Formula 5.6
74 TOPIC 5 MEASURE OF DISPERSIONExample 5.5Obtain the standard deviation of data Set 1 in Example 5.1.SolutionLet us select number 10 in the data set as ‘assumed’ mean A, thenBy using Formula 5.8, the standard deviation is 2 217 9 4.807 4.81 9 9For comparison, suppose we choose an arbitrary number A = 5, the standarddeviation is given by 2 352 36 4.807 9 9We notice that the two values of assumed mean A give the same value of standarddeviation.Standard Deviation and Variance of Grouped DataStandard deviation can be calculated through the following formula: 2 f i xi2 f i xi n n Formula 5.7Where xi is the class mid-point of the ith class whose frequency is fi.
TOPIC 5 MEASURE OF DISPERSION 75Example 5.6Obtain the standard deviation of the books on weekly sales given in Table 2.6presented in Topic 2.SolutionActually, we need to include a new column for the product f x2 as follows: Class Frequency f x f x2 Class Mid-point (x) (f) (f multiplies x) (f multiplies x2 ) 34 - 43 38.5 2 77 2964.5 44 - 53 48.5 5 242.5 11761.25 54 - 63 58.5 12 702 41067 64 - 73 68.5 18 1233 84460.5 74 - 83 78.5 10 785 61622.5 84 - 93 88.5 2 177 15664.5 94 - 103 98.5 1 98.5 9702.25 Sum 50 3315 227242.5The standard deviation is: 2 2 fi xi2 fi xi 227242.5 3315 = 12.21 12 books. n n 50 50 2The variance is = 149.16 149 books. Do you think obtaining standard deviation for grouped data is easier than for ungrouped data? Give your reasons.
76 TOPIC 5 MEASURE OF DISPERSIONCoefficient of VariationWhen we want to compare the dispersion of two data sets with different units, asdata for age and weight, variance is not appropriate to be used simply because thisquantity has a unit. However, the coefficient of variation, V as given below whichis dimensionless is more appropriate. Standard Deviation V Mean Formula 5.8The comparison is more meaningful, because we compare standard deviationrelative to their respective mean. ACTIVITY 5.4 Referring to data Sets E, F and G in Exercise 5.2: (a) calculate the standard deviation and coefficient of variation; and (b) compare their data spread.5.5 SKEWNESSIn a real situation we may have distribution which is symmetry such as in Figure5.1, case (a) or negatively skewed such as in Figure 5.1, case (b) or evenpositively skewed such as in Figure 5.1, Case (c). Sometimes we need to measurethe degree of skewness. For that, we will use the coefficient of skewness given inthe following section.Coefficient of SkewnessPearson’s Coefficient of SkewnessFor a skewed distribution, the mean tends to lie on the same side of the mode asthe longer tail [See Figure 5.1, case (b) & case (c)]. Thus, a measure of theasymmetry is supplied by the difference (Mean – Mode). We have the followingdimensionless coefficient of skewness:
TOPIC 5 MEASURE OF DISPERSION 77Pearson’s First Coefficient of skewness ( Mean Mode) x ˆ x PCS (1) = Standard Deviation s Formula 5.9Pearson’s Second Coefficient of skewnessIf we do not have the value of Mode then by using Formula 4.4 in Topic 4,we have the following second measure of skewness: 3( Mean Median) 3( x ˆ x) PCS (2) = Standard Deviation s Formula 5.10 ACTIVITY 5.5Given the frequency table of two distributions as follows: Distribution A: Weight (Kg) 20-29 30-39 40-49 50-59 60-69 70-79 80-89 No. of Students 4 10 20 30 25 10 1 Distribution B: Marks 20-29 30-39 40-49 50-59 60-69 70-79 80-89 No. of Students 10 20 30 20 15 4 1Make a comparison of the above distributions based on the followingstatistics:(a) Obtain: mean, mode, median, Q1, Q3, standard deviation and the coefficient of variation.(b) Obtain the Pearson’s coefficient of skewness and comment on the values obtained.
78 TOPIC 5 MEASURE OF DISPERSIONIn this topic, you have studied various measures of dispersions which can be usedto describe the shape of a frequency curve. It has been mentioned earlier thatoverall range cannot explain the pattern of observations lying between theminimum and the maximum values. Thus we introduce inter quartile range (IQR)to measure the dispersion of the data in the middle 50% or the main body. Thevariance which is being called Shape Parameter is also given to measure thedispersion. However, for comparison of two sets of data which have differentunits, coefficient of variation is used. This coefficient is preferred because it isdimensionless. Finally, the Pearson’s coefficient of skewness is given to measurethe degree of skewness of non-symmetric distribution.