Upcoming SlideShare
×

# Descriptive statistics

4,701 views

Published on

Published in: Technology
2 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
Your message goes here
• Be the first to comment

Views
Total views
4,701
On SlideShare
0
From Embeds
0
Number of Embeds
15
Actions
Shares
0
244
0
Likes
2
Embeds 0
No embeds

No notes for slide
• This chapter will focus on descriptive statistics. Using descriptive statistics and the information from Chapter 3, 4, and 5, we will begin the study of inferential statistics.
• Most important characteristics necessary to describe, explore, and compare data sets. page 34 of text
• page 35 of text
• Discussion of what these rating values represent is found on page 33 (the Chapter Problem for Chapter 2).
• Final result of a frequency table on the Qwerty data given in previous slide. Next few slides will develop definitions and steps for developing the frequency table.
• The concept is usually difficult for some students. Emphasize that boundaries are often used in graphical representations of data. See Section 2-3.
• Students will question where the -0.5 and the 14.5 came from. After establishing the boundaries between the existing classes, explain that one should find the boundary below the first class and above the last class, again referencing the use of boundaries in the development of some histograms (Section 2-3).
• Being able to identify the midpoints of each class will be important when determining the mean and standard deviation of a frequency table.
• Class widths ideally should be the same between each class. However, open-ended first or last classes (e.g., 65 years or older) are sometimes necessary to keep from have a large number of classes with a frequency of 0.
• page 36 of text #1: Classes should not overlap in order to determine which class a value belongs to. #2: A common mistake of students developing frequency tables. #3: Open-ended classes are sometimes necessary #4: Round up to use fewer decimal places or use relevant numbers #5: Less than 5, the table will be too concise; more than 20, the table will be too large or unwieldy. #6: One test to make sure all data has been included.
• This is the same as the flow chart on the next slide.
• page 37 of text
• page 38 of text Emphasis on the relative frequency table will assist in development the concept of probability distributions - the use of percentages with classes will relate to the use of probabilities with random variables.
• page 55 of text
• In this text, the arithmetic mean will be referred to as the MEAN.
•  is pronounced ‘sigma’ (Greek upper case ‘S’) indicates ‘summation’ of values
• The mean is sensitive to every value , so one exceptional value can affect the mean dramatically. The median (a couple of slides later) overcomes this disadvantage.
• The mean is sensitive to every value , so one exceptional value can affect the mean dramatically. The median (next slide) overcomes that disadvantage.
• Examples on page 57 of text
• Examples on page 57 of text
• page 58 of text
• page 58 of text Name the two values if the set is bimodal
• Midrange is seldom used because it is too sensitive to extremes. (See Figure 2-12, page 59)
• Midrange is seldom used because it is too sensitive to extremes. (See Figure 2-12, page 59)
• Data skewed to the left is said to be ‘negatively skewed’ with the mean and median to the left of the mode. Data skewed to the right is said to be ‘positively skewed’ with the mean and media to the right of the mode.
• Data not ‘lopsided’.
• Data lopsided to left (or slants down to the left - definition of skew is ‘slanting’)
• Data lopsided to the right (or slants down to the right)
• Even though the measures of center are all the same, it is obvious from the dotplots of each group of data that there are some differences in the ‘spread’ (or variation) of the data. page 69 of text
• The range is not a very useful measure of variation as it only uses two values of the data. Other measures of variation (to follow) will more more useful as they are computed by using every data value.
• Ask students to explain what this definition indicates.
• The definition indicates that one should find the average distance each score is from the mean. The use of n-1 in the denominator is necessary because there are only n-1 independent values - that is, only n-1 values can be assigned any number before the nth value is determined. page 70 of text
• This formula is easier to use if computing the standard deviation ‘by hand’ as it does not rely on the use of the mean. Only three values are needed - n,  x, and  x 2 .
• page 71 of text
•  is the lowercase Greek ‘sigma’. Note the division by N(population size), rather than n-1 Most data is a sample, rather than a population; therefore, this formula is not use very often.
• page 74 of text
• After computing the standard deviation, square the resulting number using a calculator.
• The formulas for variance are the same as for standard deviation without the square root - remind student, squaring a square root will result in the radicand.
• Reminder: range is the highest score minus the lowest score
• Reminder: range is the highest score minus the lowest score
• page 79 of text
• Some student have difficulty understand the idea of ‘within one standard deviation of the mean’. Emphasize that this means the interval from one standard deviation below the mean to one standard deviation above the mean.
• These percentages will be verified by the concepts learned in Chapter 5. Emphasize the Empirical Rule is appropriate for data that is in a BELL-SHAPED distribution.
• Emphasize Chebyshev’s Theorem applies to data that is in a distribution of any shape - that is, it is less prescriptive than the Empirical Rule. page 80 of text
• This idea will be revisited throughout the study of Elementary Statistics.
• page 85 of text
• This concept of ‘unusual’ values will be revisited several times during the course, especially in Chapter 7 - Hypothesis Testing. page 86 of text
• These 5 numbers will be used in this text’s method for depicting boxplots.
• page 94 of text
• Outliers can have a dramatic effect on the certain descriptive statistics (mean and standard deviation). It is important to be aware of such data values.
• page 96 of text
• Medians and quartiles are not very sensitive to extreme values. Refer to Section 2-6 for finding the Q1 and Q3 values, and Section 2-4 for finding the median.
• It appears that the Qwerty data set has a skewed right distribution page 97 of text
• page 97 of text
• ### Descriptive statistics

1. 1. 1 Descriptive Statistics 2-1 Overview 2-2 Summarizing Data with Frequency Tables 2-3 Pictures of Data 2-4 Measures of Center 2-5 Measures of Variation 2-6 Measures of Position 2-7 Exploratory Data Analysis (EDA)
2. 2. 2  Descriptive Statistics summarize or describe the important characteristics of a known set of population data  Inferential Statistics use sample data to make inferences (or generalizations) about a population 2 -1 Overview
3. 3. 3 1. Center: A representative or average value that indicates where the middle of the data set is located 2. Variation: A measure of the amount that the values vary among themselves 3. Distribution: The nature or shape of the distribution of data (such as bell-shaped, uniform, or skewed) 4. Outliers: Sample values that lie very far away from the vast majority of other sample values 5. Time: Changing characteristics of the data over time Important Characteristics of Data
4. 4. 4  Frequency Table lists classes (or categories) of values, along with frequencies (or counts) of the number of values that fall into each class 2-2 Summarizing Data With Frequency Tables
5. 5. 5 Qwerty Keyboard Word Ratings Table 2-1 2 2 5 1 2 6 3 3 4 2 4 0 5 7 7 5 6 6 8 10 7 2 2 10 5 8 2 5 4 2 6 2 6 1 7 2 7 2 3 8 1 5 2 5 2 14 2 2 6 3 1 7
6. 6. 6 Frequency Table of Qwerty Word Ratings Table 2-3 0 - 2 20 3 - 5 14 6 - 8 15 9 - 11 2 12 - 14 1 Rating Frequency
7. 7. 7 Frequency Table Definitions
8. 8. 8 Lower Class Limits Lower Class Limits 0 - 2 20 3 - 5 14 6 - 8 15 9 - 11 2 12 - 14 1 Rating Frequency are the smallest numbers that can actually belong to different classes
9. 9. 9 Upper Class Limits Upper Class Limits 0 - 2 20 3 - 5 14 6 - 8 15 9 - 11 2 12 - 14 1 Rating Frequency are the largest numbers that can actually belong to different classes
10. 10. 10 are the numbers used to separate classes, but without the gaps created by class limits Class Boundaries
11. 11. 11 number separating classes Class Boundaries 0 - 2 20 3 - 5 14 6 - 8 15 9 - 11 2 12 - 14 1 Rating Frequency - 0.5 2.5 5.5 8.5 11.5 14.5
12. 12. 12 Class Boundaries Class Boundaries 0 - 2 20 3 - 5 14 6 - 8 15 9 - 11 2 12 - 14 1 Rating Frequency - 0.5 2.5 5.5 8.5 11.5 14.5 number separating classes
13. 13. 13 midpoints of the classes Class Midpoints
14. 14. 14 midpoints of the classes Class Midpoints Class Midpoints 0 - 1 2 20 3 - 4 5 14 6 - 7 8 15 9 - 10 11 2 12 - 13 14 1 Rating Frequency
15. 15. 15 is the difference between two consecutive lower class limits or two consecutive class boundaries Class Width
16. 16. 16 Class Width Class Width 3 0 - 2 20 3 3 - 5 14 3 6 - 8 15 3 9 - 11 2 3 12 - 14 1 Rating Frequency is the difference between two consecutive lower class limits or two consecutive class boundaries
17. 17. 17 1. Be sure that the classes are mutually exclusive. 2. Include all classes, even if the frequency is zero. 3. Try to use the same width for all classes. 4. Select convenient numbers for class limits. 5. Use between 5 and 20 classes. 6. The sum of the class frequencies must equal the number of original data values. Guidelines For Frequency Tables
18. 18. 18 3. Select for the first lower limit either the lowest score or a convenient value slightly less than the lowest score. 4. Add the class width to the starting point to get the second lower class limit, add the width to the second lower limit to get the third, and so on. 5. List the lower class limits in a vertical column and enter the upper class limits. 6. Represent each score by a tally mark in the appropriate class. Total tally marks to find the total frequency for each class. Constructing A Frequency Table 1. Decide on the number of classes . 2. Determine the class width by dividing the range by the number of classes (range = highest score - lowest score) and round up. class width ≈ round up of range number of classes
19. 19. 19
20. 20. 20 Relative Frequency Table relative frequency = class frequency sum of all frequencies
21. 21. 21 Relative Frequency Table 0 - 2 20 3 - 5 14 6 - 8 15 9 - 11 2 12 - 14 1 Rating Frequency 0 - 2 38.5% 3 - 5 26.9% 6 - 8 28.8% 9 - 11 3.8% 12 - 14 1.9% Rating Relative Frequency 20/52 = 38.5% 14/52 = 26.9% etc. Total frequency = 52
22. 22. 22 Cumulative Frequency Table Cumulative Frequencies 0 - 2 20 3 - 5 14 6 - 8 15 9 - 11 2 12 - 14 1 Rating Frequency Less than 3 20 Less than 6 34 Less than 9 49 Less than 12 51 Less than 15 52 Rating Cumulative Frequency
23. 23. 23 Frequency Tables 0 - 2 20 3 - 5 14 6 - 8 15 9 - 11 2 12 - 14 1 Rating Frequency 0 - 2 38.5% 3 - 5 26.9% 6 - 8 28.8% 9 - 11 3.8% 12 - 14 1.9% Rating Relative Frequency Less than 3 20 Less than 6 34 Less than 9 49 Less than 12 51 Less than 15 52 Rating Cumulative Frequency
24. 24. 24 a value at the center or middle of a data set Measures of Center
25. 25. 25 Mean (Arithmetic Mean) AVERAGE the number obtained by adding the values and dividing the total by the number of values Definitions
26. 26. 26 Notation Σ denotes the addition of a set of values x is the variable usually used to represent the individual data values n represents the number of data values in a sample N represents the number of data values in a population
27. 27. 27 Notation is pronounced ‘x-bar’ and denotes the mean of a set of sample values x = n Σ x x
28. 28. 28 Notation µ is pronounced ‘mu’ and denotes the mean of all values in a population is pronounced ‘x-bar’ and denotes the mean of a set of sample values Calculators can calculate the mean of data x = n Σ x x N µ = Σ x
29. 29. 29 Definitions  Median the middle value when the original data values are arranged in order of increasing (or decreasing) magnitude
30. 30. 30 Definitions  Median the middle value when the original data values are arranged in order of increasing (or decreasing) magnitude  often denoted by x (pronounced ‘x-tilde’) ~
31. 31. 31 Definitions  Median the middle value when the original data values are arranged in order of increasing (or decreasing) magnitude  often denoted by x (pronounced ‘x-tilde’)  is not affected by an extreme value ~
32. 32. 32 6.72 3.46 3.60 6.44 3.46 3.60 6.44 6.72 no exact middle -- shared by two numbers 3.60 + 6.44 2 (even number of values) MEDIAN is 5.02
33. 33. 33 6.72 3.46 3.60 6.44 26.70 3.46 3.60 6.44 6.72 26.70 (in order - odd number of values) exact middle MEDIAN is 6.44 6.72 3.46 3.60 6.44 3.46 3.60 6.44 6.72 no exact middle -- shared by two numbers 3.60 + 6.44 2 (even number of values) MEDIAN is 5.02
34. 34. 34 Definitions  Mode the score that occurs most frequently Bimodal Multimodal No Mode denoted by M the only measure of central tendency that can be used with nominal data
35. 35. 35 a. 5 5 5 3 1 5 1 4 3 5 b. 1 2 2 2 3 4 5 6 6 6 7 9 c. 1 2 3 6 7 8 9 10 Examples Mode is 5 Bimodal - 2 and 6 No Mode
36. 36. 36  Midrange the value midway between the highest and lowest values in the original data set Definitions
37. 37. 37  Midrange the value midway between the highest and lowest values in the original data set Definitions Midrange = highest score + lowest score 2
38. 38. 38  Symmetric Data is symmetric if the left half of its histogram is roughly a mirror of its right half.  Skewed Data is skewed if it is not symmetric and if it extends more to one side than the other. Definitions
39. 39. 39 Skewness Mode = Mean = Median SYMMETRIC
40. 40. 40 Skewness Mode = Mean = Median SKEWED LEFT (negatively) SYMMETRIC Mean Mode Median
41. 41. 41 Skewness Mode = Mean = Median SKEWED LEFT (negatively) SYMMETRIC Mean Mode Median SKEWED RIGHT (positively) MeanMode Median
42. 42. 42 Waiting Times of Bank Customers at Different Banks in minutes Jefferson Valley Bank Bank of Providence 6.5 4.2 6.6 5.4 6.7 5.8 6.8 6.2 7.1 6.7 7.3 7.7 7.4 7.7 7.7 8.5 7.7 9.3 7.7 10.0
43. 43. 43 Jefferson Valley Bank Bank of Providence 6.5 4.2 6.6 5.4 6.7 5.8 6.8 6.2 7.1 6.7 7.3 7.7 7.4 7.7 7.7 8.5 7.7 9.3 7.7 10.0 Jefferson Valley Bank 7.15 7.20 7.7 7.10 Bank of Providence 7.15 7.20 7.7 7.10 Mean Median Mode Midrange Waiting Times of Bank Customers at Different Banks in minutes
44. 44. 44 Dotplots of Waiting Times
45. 45. 45 Measures of Variation
46. 46. 46 Measures of Variation Range value highest lowest value
47. 47. 47 a measure of variation of the scores about the mean (average deviation from the mean) Measures of Variation Standard Deviation
48. 48. 48 Sample Standard Deviation Formula calculators can compute the sample standard deviation of data Σ (x - x)2 n - 1 S =
49. 49. 49 Sample Standard Deviation Shortcut Formula n (n - 1) s = n (Σx2 ) - (Σx)2 calculators can compute the sample standard deviation of data
50. 50. 50 Σ x - x Mean Absolute Deviation Formula n
51. 51. 51 Population Standard Deviation calculators can compute the population standard deviation of data 2 Σ (x - µ) N σ =
52. 52. 52 Measures of Variation Variance standard deviation squared
53. 53. 53 Measures of Variation Variance standard deviation squared s σ 2 2 } use square key on calculatorNotation
54. 54. 54 Sample Variance Population Variance Variance Σ (x - x )2 n - 1 s 2 = Σ (x - µ)2 N σ 2 =
55. 55. 55 Estimation of Standard Deviation Range Rule of Thumb x - 2s x x + 2s Range ≈ 4s or (minimum usual value) (maximum usual value)
56. 56. 56 Estimation of Standard Deviation Range Rule of Thumb x - 2s x x + 2s Range ≈ 4s or (minimum usual value) (maximum usual value) Range 4 s ≈ = highest value - lowest value 4
57. 57. 57 x The Empirical Rule (applies to bell-shaped distributions)FIGURE 2-15
58. 58. 58 x - s x x + s 68% within 1 standard deviation 34% 34% The Empirical Rule (applies to bell-shaped distributions)FIGURE 2-15
59. 59. 59 x - 2s x - s x x + 2sx + s 68% within 1 standard deviation 34% 34% 95% within 2 standard deviations The Empirical Rule (applies to bell-shaped distributions) 13.5% 13.5% FIGURE 2-15
60. 60. 60 x - 3s x - 2s x - s x x + 2s x + 3sx + s 68% within 1 standard deviation 34% 34% 95% within 2 standard deviations 99.7% of data are within 3 standard deviations of the mean The Empirical Rule (applies to bell-shaped distributions) 0.1% 0.1% 2.4% 2.4% 13.5% 13.5% FIGURE 2-15
61. 61. 61 Chebyshev’s Theorem  applies to distributions of any shape.  the proportion (or fraction) of any set of data lying within K standard deviations of the mean is always at least 1 - 1/K 2 , where K is any positive number greater than 1.  at least 3/4 (75%) of all values lie within 2 standard deviations of the mean.  at least 8/9 (89%) of all values lie within 3 standard deviations of the mean.
62. 62. 62 Measures of Variation Summary For typical data sets, it is unusual for a score to differ from the mean by more than 2 or 3 standard deviations.
63. 63. 63  z Score (or standard score) the number of standard deviations that a given value x is above or below the mean Measures of Position
64. 64. 64 Sample z = x - x s Population z = x - µ σ Measures of Position z score
65. 65. 65 - 3 - 2 - 1 0 1 2 3 Z Unusual Values Unusual Values Ordinary Values Interpreting Z Scores FIGURE 2-16
66. 66. 66 Measures of Position Quartiles, Deciles, Percentiles
67. 67. 67 Q1, Q2, Q3 divides ranked scores into four equal parts Quartiles 25% 25% 25% 25% Q3Q2Q1 (minimum) (maximum) (median)
68. 68. 68 D1, D2, D3, D4, D5, D6, D7, D8, D9 divides ranked data into ten equal parts Deciles 10% 10% 10% 10% 10% 10% 10% 10% 10% 10% D1 D2 D3 D4 D5 D6 D7 D8 D9
69. 69. 69 99 Percentiles Percentiles
70. 70. 70 Quartiles, Deciles, Percentiles Fractiles (Quantiles) partitions data into approximately equal parts
71. 71. 71 Exploratory Data Analysis the process of using statistical tools (such as graphs, measures of center, and measures of variation) to investigate the data sets in order to understand their important characteristics
72. 72. 72 Outliers  a value located very far away from almost all of the other values  an extreme value  can have a dramatic effect on the mean, standard deviation, and on the scale of the histogram so that the true nature of the distribution is totally obscured
73. 73. 73 Boxplots (Box-and-Whisker Diagram) Reveals the:  center of the data  spread of the data  distribution of the data  presence of outliers Excellent for comparing two or more data sets
74. 74. 74 Boxplots 5 - number summary  Minimum  first quartile Q1  Median (Q2)  third quartile Q3  Maximum
75. 75. 75 Boxplots Boxplot of Qwerty Word Ratings 2 4 6 8 10 12 14 0 2 4 6 14 0
76. 76. 76 Bell-Shaped Skewed Boxplots Uniform
77. 77. 77 Exploring  Measures of center: mean, median, and mode  Measures of variation: Standard deviation and range  Measures of spread and relative location: minimum values, maximum value, and quartiles  Unusual values: outliers  Distribution: histograms, stem-leaf plots, and boxplots