Successfully reported this slideshow.
Upcoming SlideShare
×

# Basic Stat Notes

7,595 views

Published on

Published in: Technology
• Full Name
Comment goes here.

Are you sure you want to Yes No
• hello

Are you sure you want to  Yes  No

### Basic Stat Notes

2. 2. Contents … <ul><li>Construct a frequency distribution both manually and with a computer </li></ul><ul><li>Construct and interpret a histogram </li></ul><ul><li>Create and interpret bar charts, pie charts, and stem-and-leaf diagrams </li></ul><ul><li>Present and interpret data in line charts and scatter diagrams </li></ul>06/08/09
3. 3. Frequency Distributions <ul><li>What is a Frequency Distribution? </li></ul><ul><li>A frequency distribution is a list or a table … </li></ul><ul><li>containing the values of a variable (or a set of ranges within which the data falls) ... </li></ul><ul><li>and the corresponding frequencies with which each value occurs (or frequencies with which data falls within each range) </li></ul>06/08/09
4. 4. Why Use Frequency Distributions? <ul><li>A frequency distribution is a way to summarize data </li></ul><ul><li>The distribution condenses the raw data into a more useful form... </li></ul><ul><li>and allows for a quick visual interpretation of the data </li></ul>06/08/09
5. 5. Frequency Distribution: Discrete Data <ul><li>Discrete data: possible values are countable </li></ul>Example: An advertiser asks 200 customers how many days per week they read the daily newspaper. 06/08/09 Number of days read Frequency 0 44 1 24 2 18 3 16 4 20 5 22 6 26 7 30 Total 200
6. 6. Relative Frequency <ul><li>Relative Frequency : What proportion is in each category? </li></ul>22% of the people in the sample report that they read the newspaper 0 days per week 06/08/09 Number of days read Frequency Relative Frequency 0 44 .22 1 24 .12 2 18 .09 3 16 .08 4 20 .10 5 22 .11 6 26 .13 7 30 .15 Total 200 1.00
7. 7. Frequency Distribution: Continuous Data <ul><li>Continuous Data: may take on any value in some interval </li></ul><ul><li>Example: A manufacturer of insulation randomly selects 20 winter days and records the daily high temperature </li></ul><ul><li> </li></ul><ul><li>24, 35, 17, 21, 24, 37, 26, 46, 58, 30, </li></ul><ul><li> 32, 13, 12, 38, 41, 43, 44, 27, 53, 27 </li></ul><ul><li>(Temperature is a continuous variable because it could </li></ul><ul><li>be measured to any degree of precision desired) </li></ul>06/08/09
8. 8. Grouping Data by Classes <ul><ul><li>Sort raw data in ascending order: </li></ul></ul><ul><li>12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58 </li></ul><ul><li>Find range: 58 - 12 = 46 </li></ul><ul><li>Select number of classes: 5 (usually between 5 and 20) </li></ul><ul><li>Compute class width: 10 (46/5 then round off) </li></ul><ul><li>Determine class boundaries: 10, 20, 30, 40, 50 </li></ul><ul><li>Compute class midpoints: 15, 25, 35, 45, 55 </li></ul><ul><li>Count observations & assign to classes </li></ul>06/08/09
9. 9. Frequency Distribution Example Data in ordered array: 12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58 Class Frequency 10 but under 20 3 .15 20 but under 30 6 .30 30 but under 40 5 .25 40 but under 50 4 .20 50 but under 60 2 .10 Total 20 1.00 Relative Frequency Frequency Distribution 06/08/09
10. 10. Histograms <ul><li>The classes or intervals are shown on the horizontal axis </li></ul><ul><li>frequency is measured on the vertical axis </li></ul><ul><li>Bars of the appropriate heights can be used to represent the number of observations within each class </li></ul><ul><li>Such a graph is called a histogram </li></ul>06/08/09
11. 11. Histogram Example Class Midpoints Data in ordered array: 12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58 No gaps between bars, since continuous data 06/08/09
12. 12. Questions for Grouping Data into Classes <ul><li>1. How wide should each interval be? (How many classes should be used?) </li></ul><ul><li>2. How should the endpoints of the intervals be determined? </li></ul><ul><ul><ul><li>Often answered by trial and error, subject to user judgment </li></ul></ul></ul><ul><ul><ul><li>The goal is to create a distribution that is neither too &quot; jagged &quot; nor too &quot; blocky ” </li></ul></ul></ul><ul><ul><ul><li>Goal is to appropriately show the pattern of variation in the data </li></ul></ul></ul>06/08/09
13. 13. How Many Class Intervals? <ul><li>Many (Narrow class intervals ) </li></ul><ul><ul><ul><li>may yield a very jagged distribution with gaps from empty classes </li></ul></ul></ul><ul><ul><ul><li>Can give a poor indication of how frequency varies across classes </li></ul></ul></ul><ul><li>Few (Wide class intervals ) </li></ul><ul><ul><ul><li>may compress variation too much and yield a blocky distribution </li></ul></ul></ul><ul><ul><ul><li>can obscure important patterns of variation. </li></ul></ul></ul>(X axis labels are upper class endpoints) 06/08/09
14. 14. General Guidelines <ul><li>Number of Data Points Number of Classes </li></ul><ul><li>under 50 5 - 7 50 – 100 6 - 10 100 – 250 7 - 12 over 250 10 - 20 </li></ul><ul><ul><li>Class widths can typically be reduced as the number of observations increases </li></ul></ul><ul><ul><li>Distributions with numerous observations are more likely to be smooth and have gaps filled since data are plentiful </li></ul></ul>06/08/09
15. 15. Class Width <ul><li>The class width is the distance between the lowest possible value and the highest possible value for a frequency class </li></ul><ul><li>The minimum class width is </li></ul>Largest Value  Smallest Value Number of Classes W = 06/08/09
16. 16. Histograms in Excel <ul><li>Select </li></ul><ul><li>Tools/Data Analysis </li></ul>1 06/08/09
17. 17. <ul><li>Choose Histogram </li></ul>2 3 Input data and bin ranges Select Chart Output Histograms in Excel (continued) 06/08/09
18. 18. Stem and Leaf Diagram <ul><li>A simple way to see distribution details in a data set </li></ul><ul><li>METHOD: Separate the sorted data series into leading digits (the stem ) and the trailing digits (the leaves ) </li></ul>06/08/09
19. 19. Example: <ul><li>Here, use the 10’s digit for the stem unit: </li></ul>Data in ordered array: 12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58 <ul><li>12 is shown as </li></ul><ul><li>35 is shown as </li></ul>Stem Leaf 1 2 3 5 06/08/09
20. 20. Example: <ul><li>Completed Stem-and-leaf diagram: </li></ul>Data in ordered array: 12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58 06/08/09 Stem Leaves 1 2 3 7 2 1 4 4 6 7 8 3 0 2 5 7 8 4 1 3 4 6 5 3 8
21. 21. Using other stem units <ul><li>Using the 100’s digit as the stem: </li></ul><ul><ul><li>Round off the 10’s digit to form the leaves </li></ul></ul><ul><ul><li>613 would become 6 1 </li></ul></ul><ul><ul><ul><li>776 would become 7 8 </li></ul></ul></ul><ul><ul><ul><li>. . . </li></ul></ul></ul><ul><ul><ul><li>1224 becomes 12 2 </li></ul></ul></ul>Stem Leaf 06/08/09
22. 22. Graphing Categorical Data Categorical Data Pie Charts Pareto Diagram Bar Charts 06/08/09
23. 23. Bar and Pie Charts <ul><li>Bar charts and Pie charts are often used for qualitative (category) data </li></ul><ul><li>Height of bar or size of pie slice shows the frequency or percentage for each category </li></ul>06/08/09
24. 24. Pie Chart Example Percentages are rounded to the nearest percent Current Investment Portfolio Savings 15% CD 14% Bonds 29% Stocks 42% Investment Amount Percentage Type (in thousands \$) Stocks 46.5 42.27 Bonds 32.0 29.09 CD 15.5 14.09 Savings 16.0 14.55 Total 110 100 (Variables are Qualitative) 06/08/09
25. 25. Bar Chart Example 06/08/09
26. 26. Pareto Diagram Example cumulative % invested (line graph) % invested in each category (bar graph) 06/08/09
27. 27. Bar Chart Example 06/08/09 Number of days read Frequency 0 44 1 24 2 18 3 16 4 20 5 22 6 26 7 30 Total 200
28. 28. Tabulating and Graphing Multivariate Categorical Data <ul><li>Investment in thousands of dollars </li></ul>Investment Investor A Investor B Investor C Total Category Stocks 46.5 55 27.5 129 Bonds 32.0 44 19.0 95 CD 15.5 20 13.5 49 Savings 16.0 28 7.0 51 Total 110.0 147 67.0 324 06/08/09
29. 29. Tabulating and Graphing Multivariate Categorical Data <ul><li>Side by side charts </li></ul>(continued) 06/08/09
30. 30. Side-by-Side Chart Example <ul><li>Sales by quarter for three sales territories: </li></ul>06/08/09
31. 31. <ul><li>Line charts show values of one variable vs. time </li></ul><ul><ul><li>Time is traditionally shown on the horizontal axis </li></ul></ul><ul><ul><li>Scatter Diagrams show points for bivariate data </li></ul></ul><ul><ul><li>one variable is measured on the vertical axis and the other variable is measured on the horizontal axis </li></ul></ul>Line Charts and Scatter Diagrams 06/08/09
32. 32. Line Chart Example 06/08/09 Year Inflation Rate 1985 3.56 1986 1.86 1987 3.65 1988 4.14 1989 4.82 1990 5.40 1991 4.21 1992 3.01 1993 2.99 1994 2.56 1995 2.83 1996 2.95 1997 2.29 1998 1.56 1999 2.21 2000 3.36 2001 2.85 2002 1.58
33. 33. Scatter Diagram Example 06/08/09 Volume per day Cost per day 23 125 26 140 29 146 33 160 38 167 42 170 50 188 55 195 60 200
34. 34. Types of Relationships <ul><li>Linear Relationships </li></ul>06/08/09
35. 35. <ul><li>Curvilinear Relationships </li></ul>Types of Relationships (continued) 06/08/09
36. 36. <ul><li>No Relationship </li></ul>Types of Relationships (continued) 06/08/09
37. 37. Chapter Summary <ul><li>Data in raw form are usually not easy to use for decision making -- Some type of organization is needed: </li></ul><ul><ul><ul><li> Table  Graph </li></ul></ul></ul><ul><li>Techniques reviewed in this chapter: </li></ul><ul><ul><li>Frequency Distributions and Histograms </li></ul></ul><ul><ul><li>Bar Charts and Pie Charts </li></ul></ul><ul><ul><li>Stem and Leaf Diagrams </li></ul></ul><ul><ul><li>Line Charts and Scatter Diagrams </li></ul></ul>06/08/09
38. 38. <ul><li>Summarization measures are single or few number representations of the data which are helpful in representing data and also to compare between data. Based on the summary measures of the sample ,population measures can be forecasted. </li></ul><ul><li>The following will illustrate the above, different measures to represent the data are as follows : </li></ul><ul><li>1. Measures of Center and Location </li></ul><ul><ul><li>2. Mean, median, mode, geometric mean, midrange </li></ul></ul><ul><li>3. Other measures of Location </li></ul><ul><ul><li>4. Weighted mean, percentiles, quartiles </li></ul></ul><ul><li>5. Measures of Variation </li></ul><ul><ul><li>6. Range, Inter quartile range, variance and standard deviation, </li></ul></ul><ul><ul><li>coefficient of variation </li></ul></ul>Summarization measures ….. 06/08/09
39. 39. Center and Location Mean Median Mode Other Measures of Location Weighted Mean Describing Data Numerically Variation Variance Standard Deviation Coefficient of Variation Range Percentiles Inter quartile Range Quartiles Summary Measures 06/08/09
40. 40. Center and Location Mean Median Mode Weighted Mean Overview: Measures of Center and Location 06/08/09
41. 41. <ul><li>The Mean is the arithmetic average of data values </li></ul><ul><ul><li>Sample mean </li></ul></ul><ul><ul><li>Population mean </li></ul></ul>n = Sample Size N = Population Size Mean (Arithmetic Average) 06/08/09
42. 42. <ul><li>The most common measure of central tendency </li></ul><ul><li>Mean = sum of values divided by the number of values </li></ul><ul><li>Affected by extreme values (outliers) </li></ul>0 1 2 3 4 5 6 7 8 9 10 Mean = 3 0 1 2 3 4 5 6 7 8 9 10 Mean = 4 Mean (Arithmetic Average) 06/08/09
43. 43. <ul><li>Not affected by extreme values </li></ul><ul><li>In an ordered array, the median is the “middle” number </li></ul><ul><ul><li>If n or N is odd, the median is the middle number </li></ul></ul><ul><ul><li>If n or N is even, the median is the average of the two middle numbers </li></ul></ul>0 1 2 3 4 5 6 7 8 9 10 Median = 3 0 1 2 3 4 5 6 7 8 9 10 Median = 3 Median 06/08/09
44. 44. <ul><li>A measure of central tendency </li></ul><ul><li>Value that occurs most often </li></ul><ul><li>Not affected by extreme values </li></ul><ul><li>Used for either numerical or categorical data </li></ul><ul><li>There may be no mode </li></ul><ul><li>There may be several modes </li></ul>0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Mode = 5 0 1 2 3 4 5 6 No Mode Mode 06/08/09
45. 45. <ul><li>Used when values are grouped by frequency or relative importance </li></ul>Example : Sample of 26 Repair Projects Weighted Mean Days to Complete: Weighted Mean 06/08/09 Days to Complete Frequency 5 4 6 12 7 8 8 2
46. 46. <ul><li>Five houses on a hill by the beach </li></ul>House Prices: \$2,000,000 500,000 300,000 100,000 100,000 Review Example 06/08/09
47. 47. <ul><li>Mean: (\$3,000,000/5) </li></ul><ul><li> = \$600,000 </li></ul><ul><li>Median: middle value of ranked data = \$300,000 </li></ul><ul><li>Mode: most frequent value = \$100,000 </li></ul>House Prices: \$2,000,000 500,000 300,000 100,000 100,000 Sum 3,000,000 Summary Statistics 06/08/09
48. 48. <ul><li>Mean is generally used, unless extreme values (outliers) exist </li></ul><ul><li>Then median is often used, since the median is not sensitive to extreme values. </li></ul><ul><ul><li>Example: Median home prices may be reported for a region – less sensitive to outliers </li></ul></ul>Which measure of location is the “best”? 06/08/09
49. 49. <ul><li>Describes how data is distributed </li></ul><ul><li>Symmetric or skewed </li></ul>Mean = Median = Mode Mean < Median < Mode Mode < Median < Mean Right-Skewed Left-Skewed Symmetric (Longer tail extends to left) (Longer tail extends to right) Shape of a Distribution 06/08/09
50. 50. <ul><li>The p th percentile in a data array: </li></ul><ul><li>p% are less than or equal to this value </li></ul><ul><li>(100 – p)% are greater than or equal to this value </li></ul><ul><li>(where 0 ≤ p ≤ 100) </li></ul>Other Measures of Location Percentiles Quartiles <ul><li>1 st quartile = 25 th percentile </li></ul><ul><li>2 nd quartile = 50 th percentile </li></ul><ul><ul><ul><li> = median </li></ul></ul></ul><ul><li>3 rd quartile = 75 th percentile </li></ul>Other Location Measures 06/08/09
51. 51. <ul><li>The p th percentile in an ordered array of n values is the value in i th position, where </li></ul><ul><li>Example: The 60 th percentile in an ordered array of 19 values is the value in 12 th position: </li></ul>Percentiles 06/08/09
52. 52. <ul><li>Quartiles split the ranked data into 4 equal groups </li></ul>25% 25% 25% 25% Sample Data in Ordered Array: 11 12 13 16 16 17 18 21 22 <ul><li>Example: Find the first quartile </li></ul>Q1 Q2 Q3 Quartiles (n = 9) Q1 = 25 th percentile, so find the so use the value half way between the 2 nd and 3 rd values, so 25 100 (9+1) = 2.5 position 25 100 Q1=12.5 06/08/09
53. 53. <ul><li>A Graphical display of data using 5-number summary: </li></ul>Minimum -- Q1 -- Median -- Q3 -- Maximum Example : 25% 25% 25% 25% Box and Whisker Plot 06/08/09
54. 54. <ul><li>The Box and central line are centered between the endpoints if data is symmetric around the median </li></ul><ul><li>A Box and Whisker plot can be shown in either vertical or horizontal format </li></ul>Shape of Box and Whisker Plots 06/08/09
55. 55. Right-Skewed Left-Skewed Symmetric Q1 Q2 Q3 Q1 Q2 Q3 Q1 Q2 Q3 Distribution Shape and Box and Whisker Plot 06/08/09
56. 56. <ul><li>Below is a Box-and-Whisker plot for the following data: 0 2 2 2 3 3 4 5 5 10 27 </li></ul><ul><li>This data is very right skewed, as the plot depicts </li></ul>0 2 3 5 27 Min Q1 Q2 Q3 Max Box-and-Whisker Plot Example 06/08/09
57. 57. Variation Variance Standard Deviation Coefficient of Variation Population Variance Sample Variance Population Standard Deviation Sample Standard Deviation Range Interquartile Range Measures of Variation 06/08/09
58. 58. <ul><li>Measures of variation give information on the spread or variability of the data values. </li></ul>Same center, different variation Variation 06/08/09
59. 59. <ul><li>Difference between the largest and the smallest observations. </li></ul>Range = x maximum – x minimum 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Range = 14 - 1 = 13 Example: Range 06/08/09
60. 60. 7 8 9 10 11 12 Range = 12 - 7 = 5 7 8 9 10 11 12 Range = 12 - 7 = 5 1 ,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4, 5 1 ,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4, 120 Range = 5 - 1 = 4 Range = 120 - 1 = 119 Disadvantages of the Range <ul><li>Sensitive to outliers </li></ul><ul><li>Ignores the way in which data are distributed </li></ul>06/08/09
61. 61. <ul><li>Can eliminate some outlier problems by using the Interquartile range </li></ul><ul><li>Eliminate some high-and low-valued observations and calculate the range from the remaining values. </li></ul><ul><li>Interquartile range = 3 rd quartile – 1 st quartile </li></ul>Interquartile Range 06/08/09
62. 62. Median (Q2) X maximum X minimum Q1 Q3 Example : 25% 25% 25% 25% 12 30 45 57 70 Interquartile range = 57 – 30 = 27 Interquartile Range 06/08/09
63. 63. <ul><li>Average of squared deviations of values from the mean </li></ul><ul><ul><li>Sample variance : </li></ul></ul><ul><ul><li>Population variance: </li></ul></ul>Variance 06/08/09
64. 64. <ul><li>Most commonly used measure of variation </li></ul><ul><li>Shows variation about the mean </li></ul><ul><li>Has the same units as the original data </li></ul><ul><ul><li>Sample standard deviation: </li></ul></ul><ul><ul><li>Population standard deviation: </li></ul></ul>Standard Deviation 06/08/09
65. 65. Sample Data (X i ) : 10 12 14 15 17 18 18 24 n = 8 Mean = x = 16 Calculation Example: Sample Standard Deviation 06/08/09
66. 66. Mean = 15.5 s = 3.338 11 12 13 14 15 16 17 18 19 20 21 11 12 13 14 15 16 17 18 19 20 21 Data B Data A Mean = 15.5 s = .9258 11 12 13 14 15 16 17 18 19 20 21 Mean = 15.5 s = 4.57 Data C Comparing Standard Deviations 06/08/09
67. 67. <ul><li>Measures relative variation </li></ul><ul><li>Always in percentage (%) </li></ul><ul><li>Shows variation relative to mean </li></ul><ul><li>Is used to compare two or more sets of data measured in different units </li></ul>Population Sample Coefficient of Variation 06/08/09
68. 68. <ul><li>Stock A: </li></ul><ul><ul><li>Average price last year = \$50 </li></ul></ul><ul><ul><li>Standard deviation = \$5 </li></ul></ul>Both stocks have the same standard deviation, but stock B is less variable relative to its price Comparing Coefficient of Variation <ul><li>Stock B: </li></ul><ul><ul><li>Average price last year = \$100 </li></ul></ul><ul><ul><li>Standard deviation = \$5 </li></ul></ul>06/08/09
69. 69. <ul><li>If the data distribution is bell-shaped, then the interval: </li></ul><ul><li> contains about 68% of the values in the population or the sample </li></ul>The Empirical Rule X 68% 06/08/09
70. 70. <ul><li> contains about 95% of the values in the population or the sample </li></ul><ul><li> contains about 99.7% of the values in the population or the sample </li></ul>The Empirical Rule 99.7% 95% 06/08/09
71. 71. <ul><li>Regardless of how the data are distributed, at least (1 - 1/k 2 ) of the values will fall within k standard deviations of the mean </li></ul><ul><li>Examples: </li></ul><ul><ul><li>(1 - 1/1 2 ) = 0% ……..... k=1 ( μ ± 1 σ ) </li></ul></ul><ul><li>(1 - 1/2 2 ) = 75% …........ k=2 ( μ ± 2 σ ) </li></ul><ul><li>(1 - 1/3 2 ) = 89% …........ k=3 ( μ ± 3 σ ) </li></ul>within At least Tchebysheff’s Theorem 06/08/09
72. 72. <ul><li>A standardized data value refers to the number of standard deviations a value is from the mean </li></ul><ul><li>Standardized data values are sometimes referred to as z-scores </li></ul>Standardized Data Values 06/08/09
73. 73. <ul><li>where: </li></ul><ul><li>x = original data value </li></ul><ul><li>μ = population mean </li></ul><ul><li>σ = population standard deviation </li></ul><ul><li>z = standard score </li></ul><ul><ul><li>(number of standard deviations x is from μ ) </li></ul></ul>Standardized Population Values 06/08/09
74. 74. <ul><li>where: </li></ul><ul><li>x = original data value </li></ul><ul><li>x = sample mean </li></ul><ul><li>s = sample standard deviation </li></ul><ul><li>z = standard score </li></ul><ul><ul><li>(number of standard deviations x is from μ ) </li></ul></ul><ul><ul><li>Remark: The standardized sample values are used for </li></ul></ul><ul><ul><li>constructing the confidence limits for the population parameters. </li></ul></ul>Standardized Sample Values 06/08/09