Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Elementary statistics

3,001 views

Published on

  • But kurtosis does not measure anything about the peak. It measures the rare, extreme observations only. Please see https://en.wikipedia.org/wiki/Talk:Kurtosis#Why_kurtosis_should_not_be_interpreted_as_.22peakedness.22
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Elementary statistics

  1. 1. Elementary Statistics Davis Lazarus Assistant Professor ISIM, The IIS University
  2. 2. Too few categories Age of Spring 1998 Stat 250 Students 60 Frequency (Count) 50 40 30 20 10 0 18 23 28 Age (in years)n=92 students
  3. 3. Too many categories GPAs of Spring 1998 Stat 250 Students 7 6 Frequency (Count) 5 4 3 2 1 0 2 3 4 GPAn=92 students
  4. 4. •Scatter Plot 75 •Scatter diagramY 70 •Scattergram 65 60 55 50 45 40 35 30 30 40 50 60 70 80 X
  5. 5. Classes Class Tally Marks Freq. x boundaries 70 – 78 69.5 – 78.5 ///// 5 74 61 – 69 60.5 – 69.5 ///// 5 65 52 – 60 51.5 – 60.5 0 56 43 – 51 42.5 – 51.5 // 2 47 34 – 42 33.5 – 42.5 /////-// 7 38 25 – 33 24.5 – 33.5 /////-/////-//// 14 29 16 – 24 15.5 – 24.5 /////-/////-/////-// 17 20
  6. 6. A frequency distribution table lists categories of scores along with theircorresponding frequencies.
  7. 7. The frequency for a particular category or class is the number oforiginal scores that fall into that class.
  8. 8. The classes orcategories refer to the groupings of a frequency table
  9. 9. • The range is the difference between the highest value and the lowest value.R = highest value – lowest value
  10. 10. The class width is the difference between twoconsecutive lower class limits or class boundaries.
  11. 11. The class limits are the smallest or the largest numbers that can actuallybelong to different classes.
  12. 12. • Lower class limits are the smallest numbers that can actually belong to the different classes.• Upper class limits are the largest numbers that can actually belong to the different classes.
  13. 13. • The class boundaries are obtained by increasing the upper class limits and decreasing the lower class limits by the same amount so that there are no gaps between consecutive under classes. The amount to be added or subtracted is ½ the difference between the upper limit of one class and the lower limit of the following class.
  14. 14. Essential Question :• How do we construct a frequency distribution table?
  15. 15. Process of Constructing a Frequency Table• STEP 1: Determine the range.R = Highest Value – Lowest Value
  16. 16. • STEP 2. Determine the tentative number of classes (k) k = 1 + 3.322 log N• Always round – off• Note: The number of classes should be between 5 and 20. The actual number of classes may be affected by convenience or other subjective factors
  17. 17. • STEP 3. Find the class width by dividing the range by the number of classes. Range Rclass width = ⇔ c= number of classes k (Always round – off )
  18. 18. • STEP 4. Write the classes or categories starting with the lowest score. Stop when the class already includes the highest score.• Add the class width to the starting point to get the second lower class limit. Add the class width to the second lower class limit to get the third, and so on. List the lower class limits in a vertical column and enter the upper class limits, which can be easily identified at this stage.
  19. 19. • STEP 5. Determine the frequency for each class by referring to the tally columns and present the results in a table.
  20. 20. When constructing frequency tables, the following guidelines should be followed.• The classes must be mutually exclusive. That is, each score must belong to exactly one class.• Include all classes, even if the frequency might be zero.
  21. 21. • All classes should have the same width, although it is sometimes impossible to avoid open – ended intervals such as “65 years or older”.• The number of classes should be between 5 and 20.
  22. 22. Let’s Try!!!• Time magazine collected information on all 464 people who died from gunfire in the Philippines during one week. Here are the ages of 50 men randomly selected from that population. Construct a frequency distribution table.
  23. 23. 19 18 30 40 41 33 73 2523 25 21 33 65 17 20 7647 69 20 31 18 24 35 2417 36 65 70 22 25 65 1624 29 42 37 26 46 27 6321 27 23 25 71 37 75 2527 23
  24. 24. Using Table:• What is the lower class limit of the highest class? Upper class limit of the lowest class?• Find the class mark of the class 43 – 51.• What is the frequency of the class 16 – 24?
  25. 25. Classes Class Tally Marks Freq. x boundaries 70 – 78 69.5 – 78.5 ///// 5 74 61 – 69 60.5 – 69.5 ///// 5 65 52 – 60 51.5 – 60.5 0 56 43 – 51 42.5 – 51.5 // 2 47 34 – 42 33.5 – 42.5 /////-// 7 38 25 – 33 24.5 – 33.5 /////-/////-//// 14 29 16 – 24 15.5 – 24.5 /////-/////-/////-// 17 20
  26. 26. Example 1The manager of Hudson Auto would like to have a betterunderstanding of the cost of parts used in the enginetune-ups performed in the shop.She examines 50 customer invoicesfor tune-ups. The costs of parts,rounded off to the nearest dollar,are listed on the next slide.91 78 93 57 75 52 99 80 97 6271 69 72 89 66 75 79 75 72 76104 74 62 68 97 105 77 65 80 10985 97 88 68 83 68 71 69 67 7462 82 98 101 79 105 79 69 62 73
  27. 27. CUMULATIVE FREQUENCY DISTRIBUTION• The less than cumulative frequency distribution (F<) is constructed by adding the frequencies from the lowest to the highest interval while the more than cumulative frequency distribution (F>) is constructed by adding the frequencies from the highest class interval to the lowest class interval.
  28. 28. Tabular Summary Frequency Distribution of engine tune-ups Cumulative FrequencyCost ($) Frequency Relative Frequency less than more than 50-59 2 0.04 2 50 60-69 13 0.26 15 48 2 + 13 70-79 16 0.32 31 35 80-89 7 0.14 38 5 + 7 18 90-99 7 0.14 45 12100-109 5 0.10 50 5 50 1.00 45 tune-ups 12 tune-ups cost less cost more than $ 100 than $ 89
  29. 29. Graphical Summary: Histogram 18 16 14Frequency 12 10 8 6 4 2 50-59 60-69 70-79 80-89 90-99 100-110 Cost ($)Unlike a bar graph, a histogram has no naturalseparation between rectangles of adjacent classes.
  30. 30. Ogive less than ogive 50 40 Frequency 30 20 more than ogive 10 Tune-up 60 70 80 90 100 110 Cost ($) median
  31. 31. Stem-and-Leaf Display 5 2 7 6 2 2 2 2 5 6 7 8 8 8 9 9 9 7 1 1 2 2 3 4 4 5 5 5 6 7 8 9 9 9 8 0 0 2 3 5 8 9 9 1 3 7 7 7 8 9 10 1 4 5 5 9 a stem a leafA single digit is used to define each leafLeaf units may be 100, 10, 1, 0.1, and so onWhere the leaf unit is not shown, it is assumed to equal 1In the above example, the leaf unit was 1
  32. 32. Leaf Unit = 0.1 8.6 11.7 9.4 9.1 10.2 11.0 8.8 8 6 8 9 1 4 10 2 11 0 7Leaf Unit = 10 1806 1717 1974 1791 1682 1910 1838 16 8 17 1 9 The 82 in 1682 18 0 3 is rounded down 19 1 7 to 80 and is represented as an 8
  33. 33. Measures of Central TendencyArithmetic Mean, Weighted Mean, Geometric Mean,Median, Mode, Partition Values – Quartiles, Deciles andPercentilesMeasures of DispersionRange, Mean deviation, Standard deviation, Variance,Co-efficient of variationMeasures of PositionQuartile deviation
  34. 34. • What is the “location” or “centre” of the data? (measures of location or central tendency)• How do the data vary? (measures of variability or dispersion)Mean: the average obtained by finding the sum of thenumbers and dividing by the number of numbers in the sum.Median: When the numbers are listed from highest to lowestor lowest to highest, the median is the average number foundin the middle. If there are an even number of data, find theaverage of the middle two numbers.Mode: The number that occurs the most often.
  35. 35. Mean is the most widely used measure of location andshows the central value of the data. µ is thepopulation mean µ= ∑Xi N is the population size Xi is a particular population value N Σ indicates the operation of adding ΣX xi µ is thesample mean X = n is the sample size n xi is a particular sample value • all values are used • unique • sum of the deviations from the mean is 0 • affected by unusually large or small data values
  36. 36. The Median is the midpoint of the values after theyhave been ordered from the smallest to the largest.For an even set of values, the median will be thearithmetic average of the two middle numbers and isfound at the (n+1)/2 ranked observation.There are as many values above the median as below itin the data array. unique not affected by extremely large or small values⇒ good measure of location when such values occur
  37. 37. The Mode is another measure of location and representsthe value of the observation that appears most frequently. Data can have more than one mode. If it has two modes, it is referred to as bimodal, three modes, trimodal, and the like.
  38. 38. Weighted Mean of a set of numbers X , X , ..., X ,1 2 nwith corresponding weights w1, w2, ...,wn ( w1 X 1 + w2 X 2 + ... + wn X n ) Xw = ( w1 + w2 + ...wn )Geometric Mean of a set of n numbers isdefined as the nth root of the product of the n numbers. GM = n ( X 1)( X 2 )( X 3)...( Xn ) GM is used to average percents, indexes, and relatives.
  39. 39. Example 1 The interest rate on three bonds were 5, 21, and 4 percent. The arithmetic mean is (5+21+4) / 3 =10.0 The geometric mean is GM = 3 (5)(21)(4) = 7.49 The GM gives a more conservative profit figure because it is not heavily weighted by the rate of 21%
  40. 40. Example 2 Grow th in Sales 1999-2004Another use of GMis to determine the 50percent increase in Sales in Millions($) 40sales, production 30or other business 20or economic series 10from one time 0period to another. 1999 2000 2001 2002 2003 2004 Year (Value at end of period) GM = n −1 (Value at beginning of period)
  41. 41. Example 3 The total number of females enrolled in American colleges increased from 755,000 in 1992 to 835,000 in 2000. That is, the geometric mean rate of increase is 1.27%. 835,000 GM = 8 −1 = .0127 755,000
  42. 42. Measures of Dispersion •Range • Mean Deviation •Quartile Deviation •Standard Deviation •Variance •Co-efficient of Variation
  43. 43. Dispersion 30refers to the 25spread orvariability in 20the data. 15 10 5 mean 0 0 2 4 6 8 10 12 Range = Largest value – Smallest value
  44. 44. Range Example The following represents the current year’s Return on Equity of the 25 companies in an investor’s portfolio. -8.1 3.2 5.9 8.1 12.3 -5.1 4.1 6.3 9.2 13.3 -3.1 4.6 7.9 9.5 14.0 -1.4 4.8 7.9 9.7 15.0 1.2 5.7 8.0 10.3 22.1 Highest value: 22.1 Lowest value: -8.1 Range = Highest value – lowest value = 22.1-(-8.1) = 30.2
  45. 45. Mean DeviationThe arithmetic mean of the absolute values of thedeviations from the arithmetic mean.  All values are used M D = Σ X - X in the calculation. n  Itis not unduly influenced by large or small values.  The absolute values are difficult to manipulate.
  46. 46. Example 5 The weights of a sample of crates containing books for the bookstore (in pounds ) are: 103, 97, 101, 106, 103 X = 102 ΣX −X 103 −102 + ... + 103 −102 MD = = n 5 1 + 5 +1 + 4 + 5 = = 2.4 5
  47. 47. Standard deviation and Variancethe arithmetic mean of Standard deviation = √(variance)the squared deviationsfrom the mean σ 2 = Σ (X - µ)2 Population Variance NX is the value of an observation in the populationμ is the arithmetic mean of the populationN is the number of observations in the populationPopulation Standard Deviation, σ
  48. 48. Example 6 In Example 4, the variance and standard deviation are:σ 2 = Σ (X - µ)2 N ( - 8 .1 - 6 .6 2 ) 2 + ( - 5 .1 - 6 .6 2 ) 2 + ... + ( 2 2 .1 - 6 .6 2 ) 2σ2= 25σ2 = 4 2 .2 2 7 σ == 6 . 4 9 8 Σ(X - X ) 2 Sample variances2 = n -1 Sample standard deviation, s
  49. 49. Example 7The hourly wages earned by a sample of five students are$7, $5, $11, $8, $6. ΣX 37 X = = = 7.40 n 5 Σ( X − X ) ( 7 − 7.4 ) +... + ( 6 − 7.4 ) 2 2 2 s 2 = = n −1 5 −1 21.2 = = 5.30 5 −1 s= s 2 = 5.30 = 2.30
  50. 50. Example:Data: X = {6, 10, 5, 4, 9, 8}; N=6 Mean: X X−X (X − X ) 2 X= ∑X = 42 =7 6 -1 1 N 6 10 3 9 Variance: 5 -2 4 s = 2 ∑ ( X − X )2 = 28 = 4.67 4 -3 9 N 6 9 2 4 Standard Deviation: 8 1 1 s = s 2 = 4.67 = 2.16 Total: 42 Total: 28
  51. 51. Empirical Rule:For any symmetrical, bell-shaped distributionAbout 68% of the observations will lie within 1s the meanAbout 95% of the observations will lie within 2s of themeanNearly all the observations will be within 3s of the mean
  52. 52. Interpretation and Uses of the Standard Deviation 68% 95% 99.7% µ− 3σ µ−2σ µ−1σ µ µ+1σ µ+2σ µ+ 3 σ
  53. 53. Quartiles Q1, Q2, Q3 divides ranked data into four equal parts 25% 25% 25% 25% Q1 Q2 Q3 Fra cti10 Deciles: D , D , D , D , D , D , D , D , D 1 2 3 4 5divides ranked data into ten equal parts 6 les 7 8 910% 10% 10% 10% 10% 10% 10% 10% 10% 10% D1 D2 D3 D4 D5 D6 D7 D8 D9 99 Percentiles: divides ranked data into 100 equal parts
  54. 54. Relative Standing Percentilespercentile of value x = ((number of values < x)/ total number ofvalues)*100 (round the result to the nearest whole number Suppose that in a class of 25 people we have the following averages (ordered in ascending order) 42, 59, 63, 67, 69, 69, 70, 73, 73, 74, 74, 74, 77, 78, 78, 79, 80, 81, 84, 85, 87, 89, 91, 94, 98 If you received a 77, what percentile are you? percentile of 77 = (12/25)*100 = 48
  55. 55. Relative StandingQuartiles Instead of finding the percentile of a single data value as we did on the previous page, it is often useful to group the data into 4, or more, (nearly) equal groups. When grouping the data into four equal groupings, we call these groupings quartiles. Let n = number of items in the data set k = percent desired (ex. k= 25) L = locator  the value separating the first k percent of the data from the rest L = (k/100) * n
  56. 56. Relative Standing Let’s separate the 25 class grades into four quartiles. •Step 1 – order the data in ascending order42, 59, 63, 67, 69, 69, 70, 73, 73, 74, 74, 74, 77, 78, 78, 79, 80, 81, 84, 85, 87, 89, L91, 94, 98 25 Q1 Q2 Q3Now find the 3 locators L25, L50, L75, Round fraction part up L25 = (25/100) * 25 = 6.25 7 to the next integer L50 = (50/100) * 25 = 12.5 13 L75 = (75/100) * 25 = 18.75 19
  57. 57. Relative Standing Other measures of relative standing include •Interquartile range (IQR) = Q3 - Q1 •Semi-interquartile range = (Q3 - Q1)/ 2 •Midquartile = (Q3 + Q1)/2 •10 – 90 percentile range = P90 - P10For the data on the previous page we have: IQR = 84 – 70 = 16 Measures of variation Semi IQR = (84 – 70)/2 = 8 Midquartile = (84 + 70)/2 = 77 Measure of central tendency
  58. 58. Box Diagram 65, 67, 68, 68, 69, 69, 71, 71, 71, 72, 72, 72, 73, 73, 73, L25 74, 74, 75, 75, 75, 75, 76, 76, 77, 77, 77, 77, 77, 77, 78, media 78, 78, 78, 79, 79, 79, 79, 80, 81, 81, 81, 81, 81, 81, 81, n L75 81, 82, 82, 83, 84, 85, 85, 85, 86, 86, 87, 87, 88, 89, 92To construct a box diagram to illustrate the extent to which theextreme data values lie beyond the interquartile range, draw a linewith the low and high value highlighted at the two ends. Mark thegradations between these two extremes, then locate the quartileboundaries Q1, Med., and Q3 on this line. Construct a box about Q1 = (73 + 74)/2 = 73.5these values. Q1 M Q3 65 69 73 77 81 85 92 89
  59. 59. number of scores less than aPercentile of score a = * 100 total number of scoresRelation between the different fractiles D1 = P10 D2 = P20 • Q1 = P25 D3 = P30 • Q2 = P50 • • • Q3 = P75 • D9 = P90Interquartile Range: Q 3 – Q1
  60. 60. Box plot graphical display, based on quartiles, that helps to picture a set of data. Five pieces of data are needed to construct a box plot:Minimum Value,First Quartile, Q1 The box represents the interquartileMedian, range which contains the 50% ofThird Quartile, Q3 values.Maximum Value. The whiskers represent the range; they extend from the box to the highest and lowest values, excluding outliers. A line across the box indicates the median.
  61. 61. Example 8 Based on a sample of 20 deliveries, Buddy’s Pizza determined the following information. The minimum delivery time was 13 minutes and the maximum 30 minutes. The first quartile was 15 minutes, the median 18 minutes, and the third quartile 22 minutes. Develop a box plot for the delivery times. M in Q M e d ia n Q M ax 1 31.5 times the IQ range 1.5 times the interquartile range 12 14 16 18 20 22 24 26 28 30 32
  62. 62. Skewnessmeasurement of the lack of symmetry of the distribution.Symmetric distribution: A distribution having the sameshape on either side of the centreSkewed distribution: One whose shapes on either side ofthe center differ; a nonsymmetrical distribution.Can be positively or negatively skewed, or bimodal
  63. 63. Relative Positions of the Mean, Median, andMode in a Symmetric Distribution M e a n M e d ia n M o d e
  64. 64. Relative Positions of the Mean, Median, and Mode in a RightSkewed or Positively Skewed Distribution Mean > Median > Mode M o d e M e a n M e d ia n
  65. 65. The Relative Positions of the Mean, Median, and Mode in a LeftSkewed or Negatively Skewed Distribution Mean < Median < Mode M e a n M o d e M e d ia n
  66. 66. The coefficient of skewness can range from -3.00 up to 3.00A value of 0 indicates a symmetric distribution.Example 9Using the twelve stock prices, we find the mean to be84.42, standard deviation, 7.18, median, 84.5. 3 ( X - Median ) sk = = -.035 s
  67. 67. Kurtosis• derived from the Greek word κυρτός, kyrtos or kurtos,meaning bulging• measure of the "peakedness" of the probabilitydistribution of a real-valued random variable• higher kurtosis means more of the variance is due toinfrequent extreme deviations, as opposed to frequentmodestly-sized deviations.
  68. 68. distribution with positive kurtosis is called leptokurtic,or leptokurtotic.In terms of shape, a leptokurtic distribution has a more acute"peak" around the mean (that is, a higher probability than anormally distributed variable of values near the mean) and"fat tails" (that is, a higher probability than a normallydistributed variable of extreme values).distribution with negative kurtosis is called platykurtic,or platykurtotic.In terms of shape, a platykurtic distribution has a smaller"peak" around the mean (that is, a lower probability than anormally distributed variable of values near the mean) and"thin tails" (that is, a lower probability than a normallydistributed variable of extreme values).
  69. 69. Other distribution – Leptokurtic Normal distribution - Mesokurtic Normal distribution - MesokurticOther distribution– Platykurtic
  70. 70. Comparing Standard DeviationsData A Mean = 15.511 12 13 14 15 16 17 18 19 20 21 s = 3.338 Data B Mean = 11 12 13 14 15 16 17 18 19 20 21 15.5 s = .9258 Data C Mean = 15.5 11 12 13 14 15 16 17 18 19 20 21 s = 4.57
  71. 71. Co-efficient of variation• Measures relative variation S  CV =  ÷100%• Always in percentage (%) X • Shows variation relative to mean• Is used to compare two or more sets of data measured in different units When the mean value is near zero, the coefficient of variation is sensitive to change in the standard deviation, limiting its usefulness.
  72. 72. Stock A: Average price last year = $50 Standard deviation = $5 S   $5  CV =  ÷100% =  ÷100% = 10% X   $50 Stock B: Average price last year = $100 Standard deviation = $5 S   $5  CV =  ÷100% =  ÷100% = 5% X   $100 

×