Upcoming SlideShare
×

# QT1 - 03 - Measures of Central Tendency

9,926 views

Published on

Class notes used in Quantitative Techniques - I course at Praxis Business School, Calcutta

4 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

Views
Total views
9,926
On SlideShare
0
From Embeds
0
Number of Embeds
86
Actions
Shares
0
500
0
Likes
4
Embeds 0
No embeds

No notes for slide

### QT1 - 03 - Measures of Central Tendency

1. 1. Measures of Central Tendency and Dispersion Q U A N T T E C H I N T E U Q I A S E V I T 1 0 S
2. 2. Contents <ul><li>Summary Statistics </li></ul><ul><li>Measures of Central Tendency </li></ul><ul><ul><li>Mean </li></ul></ul><ul><ul><li>Median </li></ul></ul><ul><ul><li>Mode </li></ul></ul><ul><li>Measures of Dispersion </li></ul><ul><ul><li>Range </li></ul></ul><ul><ul><li>Quartiles </li></ul></ul><ul><ul><li>Standard Deviation </li></ul></ul>
3. 3. Frequency Distribution
4. 4. Relative Frequency Distribution <ul><li>Frequency of each value can be expressed as a fraction or percentage of the total number of observations </li></ul><ul><li>This could help us compare data from samples that are of different sizes </li></ul>
5. 5. Let us compare two sets of data mid term and end term marks for same students <ul><li>What can we say about the results ? </li></ul><ul><ul><li>Did the class in general fare better ? </li></ul></ul><ul><ul><li>Did most of the students get similar marks ? </li></ul></ul>
6. 6. Summary Statistics <ul><li>Tables and Graphs illustrate trends and patterns in the data </li></ul><ul><li>To take hard decisions, we need to more exact measures </li></ul><ul><ul><li>Single numbers </li></ul></ul><ul><ul><li>Calculated Mathematically </li></ul></ul><ul><li>4 Principal Measures </li></ul><ul><ul><li>Central Tendency </li></ul></ul><ul><ul><li>Dispersion </li></ul></ul><ul><ul><li>Skewness </li></ul></ul><ul><ul><li>Kurtosis </li></ul></ul><ul><li>Values which can be calculated mathematically </li></ul>
7. 7. Measure of Central Tendency <ul><li>4 situations where the relative frequency distribution over a certain range is as given above </li></ul><ul><ul><li>In B data clusters tends towards the left, centres around 15 </li></ul></ul><ul><ul><li>In D data clusters tends towards the right, centres around 25 </li></ul></ul><ul><li>Is it really 15 and 25 ? </li></ul>
8. 8. Measure of Dispersion <ul><li>4 situations where the relative frequency distribution over a certain range is as given above </li></ul><ul><ul><li>In A data centres around 15, but closely dispersed </li></ul></ul><ul><ul><li>In B data centres around 15, but more widely dispersed </li></ul></ul><ul><li>What is close and what is wide ? </li></ul>
9. 9. Skewness <ul><li>Skewness is a measure of the lack of symmetry in the data. Not only is the data concentrated in one part of the range, but even there it is asymmetrical </li></ul><ul><ul><li>Different from central tendency </li></ul></ul>
10. 10. Kurtosis <ul><li>This is a measure of the peakedness of the data ! </li></ul><ul><ul><li>Which curve is more peaked than the other </li></ul></ul><ul><ul><li>Event though they might have same central tendency and dispersion </li></ul></ul>
11. 11. Measures of Central Tendency <ul><li>Arithmetic Mean </li></ul><ul><ul><li>Mean of Grouped Data </li></ul></ul><ul><li>Weighted Mean </li></ul><ul><li>Geometric Mean </li></ul><ul><li>Median </li></ul><ul><ul><li>Median of Grouped Data </li></ul></ul><ul><li>Mode </li></ul>
12. 12. The Arithmetic Mean <ul><li>Simply average of all the values </li></ul><ul><ul><li>calculated by adding the values of all the observations </li></ul></ul><ul><ul><li>and then dividing by the number of observations </li></ul></ul><ul><li>x 1 + x 2 + x 3 +...x n </li></ul><ul><li>n </li></ul><ul><li>x </li></ul><ul><li>n </li></ul><ul><li>To calculate this mean, we have to use every piece of data ... </li></ul><ul><ul><li>For a population this is actually impossible </li></ul></ul><ul><ul><li>For a sample , this can be quite difficult if the number of observations in the sample is quite high </li></ul></ul>X = X =
13. 13. Arithmetic Mean of Grouped Data <ul><ul><li>f 1 x 1 + f 2 x 2 + f 3 x 3 + ... f n x n </li></ul></ul><ul><ul><li>n </li></ul></ul><ul><ul><li>( f x ) </li></ul></ul><ul><ul><li>n </li></ul></ul><ul><ul><li>where </li></ul></ul><ul><ul><li>f i = frequency of i th class </li></ul></ul><ul><ul><li>x i = midpoint of i th class </li></ul></ul><ul><ul><li>n = number of observations </li></ul></ul>X = X =
14. 14. Arithmetic Mean : Spreadsheet
15. 15. Arithmetic Mean <ul><li>Advantages </li></ul><ul><ul><li>A single number that represents a whole set of data </li></ul></ul><ul><ul><li>Intuitive and simple to understand </li></ul></ul><ul><ul><li>Easy to calculate </li></ul></ul><ul><ul><li>Allows a quick comparison of two sets of data </li></ul></ul><ul><li>Disadvantages </li></ul><ul><ul><li>Affected by extreme values of the data </li></ul></ul><ul><ul><ul><li>Suppose we have one very high number ! </li></ul></ul></ul><ul><ul><li>Could be difficult to calculate if size of set is high </li></ul></ul><ul><ul><ul><li>Largely overcome by computers </li></ul></ul></ul><ul><ul><li>In case of open-ended datasets, we cannot compute the number </li></ul></ul>
16. 16. The weighted mean <ul><li>Allows us to calculate an average that factors in the significance or importance of each data </li></ul><ul><li>Consider the following </li></ul><ul><ul><li>2 Managers </li></ul></ul><ul><ul><ul><li>Salary of Rs 1 Lakh each </li></ul></ul></ul><ul><ul><li>10 Workers </li></ul></ul><ul><ul><ul><li>Salary of Rs 20K each </li></ul></ul></ul><ul><ul><li>1 Peon </li></ul></ul><ul><ul><ul><li>Salary of Rs 5K </li></ul></ul></ul><ul><li>What is the average salary per-employee ? </li></ul><ul><li>Is it ? </li></ul><ul><ul><li>(1L + 20K + 5K )/ 3 </li></ul></ul><ul><li>Is it ? </li></ul><ul><ul><li>[(2 x 1L ) + (10 x 20K) + (1 x 5K) ] / 13 </li></ul></ul><ul><li>The calculation is a more accurate reflection of the average salary </li></ul>
17. 17. Weighted Mean <=> Grouped Data <ul><ul><li>f 1 x 1 + f 2 x 2 + f 3 x 3 + ... f n x n </li></ul></ul><ul><ul><li>n </li></ul></ul><ul><ul><li>( f x ) </li></ul></ul><ul><ul><li>n </li></ul></ul><ul><ul><li>where </li></ul></ul><ul><ul><li>f i = frequency of i th class </li></ul></ul><ul><ul><li>x i = midpoint of i th class </li></ul></ul><ul><ul><li>n = number of observations </li></ul></ul>The Weight is logically and mathematically equivalent to midpoint of the class. Since we assume that all members of the class have the same value So lower class boundary, upper class boundary and midpoint are all same !! X = X =
18. 18. Median <ul><li>Measure of Central Tendency that does NOT represent all the values in the dataset. </li></ul><ul><li>It is the value of the most central or middle most item. </li></ul><ul><li>Half the values are above this value ( the “median” value) and the other half are below </li></ul><ul><li>To calculate the median </li></ul><ul><ul><li>Arrange the data in either ascending or descending order </li></ul></ul><ul><ul><li>Choose the value of the data that lies in the middle of the array. </li></ul></ul><ul><li>What happens if you have a an even number of data values ? </li></ul>
19. 19. Median of Grouped Data <ul><li>Philosophical approach </li></ul><ul><ul><li>Determine the class where the middle data should lie </li></ul></ul><ul><ul><ul><li>Use frequency distribution </li></ul></ul></ul><ul><ul><ul><li>This is the median class </li></ul></ul></ul><ul><ul><li>Median class has </li></ul></ul><ul><ul><ul><li>Lower Boundary </li></ul></ul></ul><ul><ul><ul><li>Upper Boundary </li></ul></ul></ul><ul><ul><li>Extrapolate ! </li></ul></ul><ul><li>Practical Approach </li></ul><ul><ul><li>Use this formula </li></ul></ul><ul><li>(n+1)/2 – (F+1) </li></ul><ul><li>w + L m </li></ul><ul><li>f m </li></ul><ul><li>N = total number of observations </li></ul><ul><li>F = cumulative frequency till the previous class </li></ul><ul><li>f m frequency of median class </li></ul><ul><li>w = class width </li></ul><ul><li>L m = Lower limit of median class </li></ul>
21. 21. Median <ul><li>Advantages </li></ul><ul><ul><li>Unaffected by extreme values of the data </li></ul></ul><ul><li>Disadvantages </li></ul><ul><ul><li>Complex to calculate </li></ul></ul><ul><ul><li>Some loss of accuracy when you work with grouped data </li></ul></ul><ul><ul><li>If the data is extremely irregular then not much sense can be made from the median </li></ul></ul>
22. 22. Mode <ul><li>Mode is that value that is seen, or that occurs, most frequently in the data set </li></ul><ul><ul><li>For discreet distributions this is easy to determine </li></ul></ul><ul><ul><ul><li>Marks awarded to students in an exam </li></ul></ul></ul><ul><ul><li>For continuous distributions it is very unlikely that the same value will appear twice </li></ul></ul><ul><ul><ul><li>CO 2 emissions from twenty engines </li></ul></ul></ul><ul><li>Modal Class </li></ul><ul><ul><li>For continuous distributions we define classes and note the class that has the highest frequency </li></ul></ul><ul><ul><li>This is the modal class </li></ul></ul><ul><ul><li>It is quite possible that two or more modal classes might occur in the case of MultiModal distributions </li></ul></ul>
23. 23. Measures of Dispersion <ul><li>Why is dispersion important ? </li></ul><ul><ul><li>It helps us understand the significance or reliability of the central tendency. </li></ul></ul><ul><ul><li>Helps us to compare two or more samples </li></ul></ul><ul><li>Range </li></ul><ul><ul><li>Interfractile Range </li></ul></ul><ul><ul><li>Interquartile Range </li></ul></ul><ul><li>Average Deviation Measures </li></ul><ul><ul><li>Population Variance </li></ul></ul><ul><ul><ul><li>Population Standard Deviation </li></ul></ul></ul><ul><ul><li>Standard Score </li></ul></ul><ul><ul><li>Sample Variance </li></ul></ul><ul><ul><ul><li>Sample Standard Deviation </li></ul></ul></ul>
24. 24. Range <ul><li>Range : is simply the difference between the highest and lowest value in the observation </li></ul><ul><ul><li>Easy to understand </li></ul></ul><ul><ul><li>Not terribly useful! </li></ul></ul>
25. 25. Fractile / Quartile / Percentile <ul><li>Fractile </li></ul><ul><ul><li>A value which is higher than a certain percentage of observations </li></ul></ul><ul><li>Median </li></ul><ul><ul><li>Is 0.5 fractile because 50% of the data lies below this value </li></ul></ul><ul><li>1 st Quartile </li></ul><ul><ul><li>25% of the data lies at or below this value </li></ul></ul><ul><li>3 rd Quartile </li></ul><ul><ul><li>75% of the data lies at or below this value </li></ul></ul><ul><li>89 percentile </li></ul><ul><ul><li>89% of the data lies at or below this value </li></ul></ul><ul><li>An n fractile is a value below which a fraction n of the data is resident </li></ul><ul><li>and n is a number between 0 and 1 </li></ul><ul><li>1 st Quartile <=> ¼ fractile or 25 percentile </li></ul><ul><li>90 Percentile <=> 9/10 th fractile </li></ul>
26. 26. Calculation of Fractiles
27. 27. Interquartile Range <ul><li>Where do “half” the values lie ? Q3 - Q1 </li></ul>25% below this 75% below this
28. 28. Variance & Standard Deviation <ul><li>Every population has a variance s 2 ( sigma squared) defined as the following </li></ul><ul><li>S ( x – m) 2 S x 2 </li></ul><ul><li>s 2 = = -- m 2 </li></ul><ul><li>N N </li></ul><ul><li>Where </li></ul><ul><ul><li>s 2 is the variance and s is the standard deviation of population </li></ul></ul><ul><ul><li>X is the item or observation </li></ul></ul><ul><ul><li>m is the population mean </li></ul></ul><ul><ul><li>N is the number of observations </li></ul></ul><ul><ul><li>S is symbol of summation </li></ul></ul>
29. 29. Variance & Standard Deviation <ul><li>S ( x – m) 2 S x 2 </li></ul><ul><li>s 2 = = -- m 2 </li></ul><ul><li>N N </li></ul>Here we use the formula provided by the spreadsheet
30. 30. Significance of s <ul><li>The standard deviation s enables us to determine with a great deal of accuracy where the values of frequency distribution lie with respect to the mean m </li></ul><ul><li>Chebyshev's Theorem states that for any distribution </li></ul><ul><ul><ul><li>75% of all data will be in the between m-2s and m+2s </li></ul></ul></ul><ul><ul><ul><li>89% of all data will be in the between m-3s and m+3s </li></ul></ul></ul><ul><li>For a smooth, symmetrical distribution </li></ul><ul><ul><ul><li>68% of all data will be in the between m-s and m+s </li></ul></ul></ul><ul><ul><ul><li>95% of all data will be in the between m-2s and m+2s </li></ul></ul></ul><ul><ul><ul><li>99% of all data will be in the between m-3s and m+3s </li></ul></ul></ul>
31. 31. Variance & Standard Deviation for Grouped Data <ul><li>Variance s 2 is given by </li></ul><ul><li>S f i ( x i – m) 2 S f i x i 2 </li></ul><ul><li>s 2 = = -- m 2 </li></ul><ul><li>N N </li></ul><ul><li>Where </li></ul><ul><ul><li>s 2 is the variance and s is the standard deviation of population </li></ul></ul><ul><ul><li>X i is the midpoint of the ith class </li></ul></ul><ul><ul><li>f i is the frequency in the ith class </li></ul></ul><ul><ul><li>m is the population mean </li></ul></ul><ul><ul><li>N is the number of observations </li></ul></ul><ul><ul><li>S is symbol of summation </li></ul></ul>
32. 32. Variance & Standard Deviation for Grouped Data <ul><li>S f i ( x i – m) 2 S f i x i 2 </li></ul><ul><li>s 2 = = -- m 2 </li></ul><ul><li>N N </li></ul>
33. 33. Variance : Population and Sample <ul><li>S ( x – m ) 2 S x 2 </li></ul><ul><li>s 2 = = -- m 2 </li></ul><ul><li>N N </li></ul><ul><ul><ul><li>S ( x – x ) 2 S x 2 n x 2 </li></ul></ul></ul><ul><li>s 2 = = -- </li></ul><ul><li>n-1 n-1 n-1 </li></ul>Suspiciously similar but not quite ... why is this ? Population Statistics Sample Statistics
34. 34. Population and Sample <ul><li>Population refers to the totality of all data that is possible </li></ul><ul><ul><li>Impossible to get this </li></ul></ul><ul><ul><li>So we will never be able to calculate </li></ul></ul><ul><ul><li>Either mean m </li></ul></ul><ul><ul><li>Or variance s 2 </li></ul></ul><ul><li>Sample refers to the data from the population that we can collect </li></ul><ul><ul><li>Using this data we can calculate </li></ul></ul><ul><ul><li>Sample mean : x </li></ul></ul><ul><ul><li>Sample variance : s 2 </li></ul></ul><ul><li>Objective of the statistician is to ESTIMATE </li></ul><ul><ul><li>the population statistic m,s </li></ul></ul><ul><ul><li>from the sample statistic x,s </li></ul></ul>
35. 35. Population & Sample Formulae are different in a Spreadsheet sample population
36. 36. Last two statistics <ul><li>Standard Score </li></ul><ul><li>A measure of how far an individual piece of data is from the mean </li></ul><ul><li>(X – m) </li></ul><ul><li>= </li></ul><ul><li>s </li></ul><ul><li>Question </li></ul><ul><ul><li>What would be the mean and standard deviation of the standard score ? </li></ul></ul><ul><li>Coefficient of Variation of a population </li></ul><ul><li>s </li></ul><ul><li>= ( 100 ) </li></ul><ul><li>m </li></ul><ul><li>A relative measure that gives us a feel of how dispersed the data is when compared to the mean </li></ul>