Basic Stat Notes

by roopcool

• 6,050 views

Categories

Uploaded via SlideShare as Microsoft PowerPoint

1 Embed6

 http://www.slideshare.net 6

Statistics

Likes
1
393
1
Embed Views
6
Views on SlideShare
6,044
Total Views
6,050

11 of 1 previous next

Basic Stat NotesPresentation Transcript

• Contents …
• Construct a frequency distribution both manually and with a computer
• Construct and interpret a histogram
• Create and interpret bar charts, pie charts, and stem-and-leaf diagrams
• Present and interpret data in line charts and scatter diagrams
06/08/09
• Frequency Distributions
• What is a Frequency Distribution?
• A frequency distribution is a list or a table …
• containing the values of a variable (or a set of ranges within which the data falls) ...
• and the corresponding frequencies with which each value occurs (or frequencies with which data falls within each range)
06/08/09
• Why Use Frequency Distributions?
• A frequency distribution is a way to summarize data
• The distribution condenses the raw data into a more useful form...
• and allows for a quick visual interpretation of the data
06/08/09
• Frequency Distribution: Discrete Data
• Discrete data: possible values are countable
Example: An advertiser asks 200 customers how many days per week they read the daily newspaper. 06/08/09 Number of days read Frequency 0 44 1 24 2 18 3 16 4 20 5 22 6 26 7 30 Total 200
• Relative Frequency
• Relative Frequency : What proportion is in each category?
22% of the people in the sample report that they read the newspaper 0 days per week 06/08/09 Number of days read Frequency Relative Frequency 0 44 .22 1 24 .12 2 18 .09 3 16 .08 4 20 .10 5 22 .11 6 26 .13 7 30 .15 Total 200 1.00
• Frequency Distribution: Continuous Data
• Continuous Data: may take on any value in some interval
• Example: A manufacturer of insulation randomly selects 20 winter days and records the daily high temperature
• 24, 35, 17, 21, 24, 37, 26, 46, 58, 30,
• 32, 13, 12, 38, 41, 43, 44, 27, 53, 27
• (Temperature is a continuous variable because it could
• be measured to any degree of precision desired)
06/08/09
• Grouping Data by Classes
• Sort raw data in ascending order:
• 12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58
• Find range: 58 - 12 = 46
• Select number of classes: 5 (usually between 5 and 20)
• Compute class width: 10 (46/5 then round off)
• Determine class boundaries: 10, 20, 30, 40, 50
• Compute class midpoints: 15, 25, 35, 45, 55
• Count observations & assign to classes
06/08/09
• Frequency Distribution Example Data in ordered array: 12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58 Class Frequency 10 but under 20 3 .15 20 but under 30 6 .30 30 but under 40 5 .25 40 but under 50 4 .20 50 but under 60 2 .10 Total 20 1.00 Relative Frequency Frequency Distribution 06/08/09
• Histograms
• The classes or intervals are shown on the horizontal axis
• frequency is measured on the vertical axis
• Bars of the appropriate heights can be used to represent the number of observations within each class
• Such a graph is called a histogram
06/08/09
• Histogram Example Class Midpoints Data in ordered array: 12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58 No gaps between bars, since continuous data 06/08/09
• Questions for Grouping Data into Classes
• 1. How wide should each interval be? (How many classes should be used?)
• 2. How should the endpoints of the intervals be determined?
• Often answered by trial and error, subject to user judgment
• The goal is to create a distribution that is neither too &quot; jagged &quot; nor too &quot; blocky ”
• Goal is to appropriately show the pattern of variation in the data
06/08/09
• How Many Class Intervals?
• Many (Narrow class intervals )
• may yield a very jagged distribution with gaps from empty classes
• Can give a poor indication of how frequency varies across classes
• Few (Wide class intervals )
• may compress variation too much and yield a blocky distribution
• can obscure important patterns of variation.
(X axis labels are upper class endpoints) 06/08/09
• General Guidelines
• Number of Data Points Number of Classes
• under 50 5 - 7 50 – 100 6 - 10 100 – 250 7 - 12 over 250 10 - 20
• Class widths can typically be reduced as the number of observations increases
• Distributions with numerous observations are more likely to be smooth and have gaps filled since data are plentiful
06/08/09
• Class Width
• The class width is the distance between the lowest possible value and the highest possible value for a frequency class
• The minimum class width is
Largest Value  Smallest Value Number of Classes W = 06/08/09
• Histograms in Excel
• Select
• Tools/Data Analysis
1 06/08/09
• Choose Histogram
2 3 Input data and bin ranges Select Chart Output Histograms in Excel (continued) 06/08/09
• Stem and Leaf Diagram
• A simple way to see distribution details in a data set
• METHOD: Separate the sorted data series into leading digits (the stem ) and the trailing digits (the leaves )
06/08/09
• Example:
• Here, use the 10’s digit for the stem unit:
Data in ordered array: 12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58
• 12 is shown as
• 35 is shown as
Stem Leaf 1 2 3 5 06/08/09
• Example:
• Completed Stem-and-leaf diagram:
Data in ordered array: 12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58 06/08/09 Stem Leaves 1 2 3 7 2 1 4 4 6 7 8 3 0 2 5 7 8 4 1 3 4 6 5 3 8
• Using other stem units
• Using the 100’s digit as the stem:
• Round off the 10’s digit to form the leaves
• 613 would become 6 1
• 776 would become 7 8
• . . .
• 1224 becomes 12 2
Stem Leaf 06/08/09
• Graphing Categorical Data Categorical Data Pie Charts Pareto Diagram Bar Charts 06/08/09
• Bar and Pie Charts
• Bar charts and Pie charts are often used for qualitative (category) data
• Height of bar or size of pie slice shows the frequency or percentage for each category
06/08/09
• Pie Chart Example Percentages are rounded to the nearest percent Current Investment Portfolio Savings 15% CD 14% Bonds 29% Stocks 42% Investment Amount Percentage Type (in thousands \$) Stocks 46.5 42.27 Bonds 32.0 29.09 CD 15.5 14.09 Savings 16.0 14.55 Total 110 100 (Variables are Qualitative) 06/08/09
• Bar Chart Example 06/08/09
• Pareto Diagram Example cumulative % invested (line graph) % invested in each category (bar graph) 06/08/09
• Bar Chart Example 06/08/09 Number of days read Frequency 0 44 1 24 2 18 3 16 4 20 5 22 6 26 7 30 Total 200
• Tabulating and Graphing Multivariate Categorical Data
• Investment in thousands of dollars
Investment Investor A Investor B Investor C Total Category Stocks 46.5 55 27.5 129 Bonds 32.0 44 19.0 95 CD 15.5 20 13.5 49 Savings 16.0 28 7.0 51 Total 110.0 147 67.0 324 06/08/09
• Tabulating and Graphing Multivariate Categorical Data
• Side by side charts
(continued) 06/08/09
• Side-by-Side Chart Example
• Sales by quarter for three sales territories:
06/08/09
• Line charts show values of one variable vs. time
• Time is traditionally shown on the horizontal axis
• Scatter Diagrams show points for bivariate data
• one variable is measured on the vertical axis and the other variable is measured on the horizontal axis
Line Charts and Scatter Diagrams 06/08/09
• Line Chart Example 06/08/09 Year Inflation Rate 1985 3.56 1986 1.86 1987 3.65 1988 4.14 1989 4.82 1990 5.40 1991 4.21 1992 3.01 1993 2.99 1994 2.56 1995 2.83 1996 2.95 1997 2.29 1998 1.56 1999 2.21 2000 3.36 2001 2.85 2002 1.58
• Scatter Diagram Example 06/08/09 Volume per day Cost per day 23 125 26 140 29 146 33 160 38 167 42 170 50 188 55 195 60 200
• Types of Relationships
• Linear Relationships
06/08/09
• Curvilinear Relationships
Types of Relationships (continued) 06/08/09
• No Relationship
Types of Relationships (continued) 06/08/09
• Chapter Summary
• Data in raw form are usually not easy to use for decision making -- Some type of organization is needed:
•  Table  Graph
• Techniques reviewed in this chapter:
• Frequency Distributions and Histograms
• Bar Charts and Pie Charts
• Stem and Leaf Diagrams
• Line Charts and Scatter Diagrams
06/08/09
• Summarization measures are single or few number representations of the data which are helpful in representing data and also to compare between data. Based on the summary measures of the sample ,population measures can be forecasted.
• The following will illustrate the above, different measures to represent the data are as follows :
• 1. Measures of Center and Location
• 2. Mean, median, mode, geometric mean, midrange
• 3. Other measures of Location
• 4. Weighted mean, percentiles, quartiles
• 5. Measures of Variation
• 6. Range, Inter quartile range, variance and standard deviation,
• coefficient of variation
Summarization measures ….. 06/08/09
• Center and Location Mean Median Mode Other Measures of Location Weighted Mean Describing Data Numerically Variation Variance Standard Deviation Coefficient of Variation Range Percentiles Inter quartile Range Quartiles Summary Measures 06/08/09
• Center and Location Mean Median Mode Weighted Mean Overview: Measures of Center and Location 06/08/09
• The Mean is the arithmetic average of data values
• Sample mean
• Population mean
n = Sample Size N = Population Size Mean (Arithmetic Average) 06/08/09
• The most common measure of central tendency
• Mean = sum of values divided by the number of values
• Affected by extreme values (outliers)
0 1 2 3 4 5 6 7 8 9 10 Mean = 3 0 1 2 3 4 5 6 7 8 9 10 Mean = 4 Mean (Arithmetic Average) 06/08/09
• Not affected by extreme values
• In an ordered array, the median is the “middle” number
• If n or N is odd, the median is the middle number
• If n or N is even, the median is the average of the two middle numbers
0 1 2 3 4 5 6 7 8 9 10 Median = 3 0 1 2 3 4 5 6 7 8 9 10 Median = 3 Median 06/08/09
• A measure of central tendency
• Value that occurs most often
• Not affected by extreme values
• Used for either numerical or categorical data
• There may be no mode
• There may be several modes
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Mode = 5 0 1 2 3 4 5 6 No Mode Mode 06/08/09
• Used when values are grouped by frequency or relative importance
Example : Sample of 26 Repair Projects Weighted Mean Days to Complete: Weighted Mean 06/08/09 Days to Complete Frequency 5 4 6 12 7 8 8 2
• Five houses on a hill by the beach
House Prices: \$2,000,000 500,000 300,000 100,000 100,000 Review Example 06/08/09
• Mean: (\$3,000,000/5)
• = \$600,000
• Median: middle value of ranked data = \$300,000
• Mode: most frequent value = \$100,000
House Prices: \$2,000,000 500,000 300,000 100,000 100,000 Sum 3,000,000 Summary Statistics 06/08/09
• Mean is generally used, unless extreme values (outliers) exist
• Then median is often used, since the median is not sensitive to extreme values.
• Example: Median home prices may be reported for a region – less sensitive to outliers
Which measure of location is the “best”? 06/08/09
• Describes how data is distributed
• Symmetric or skewed
Mean = Median = Mode Mean < Median < Mode Mode < Median < Mean Right-Skewed Left-Skewed Symmetric (Longer tail extends to left) (Longer tail extends to right) Shape of a Distribution 06/08/09
• The p th percentile in a data array:
• p% are less than or equal to this value
• (100 – p)% are greater than or equal to this value
• (where 0 ≤ p ≤ 100)
Other Measures of Location Percentiles Quartiles
• 1 st quartile = 25 th percentile
• 2 nd quartile = 50 th percentile
• = median
• 3 rd quartile = 75 th percentile
Other Location Measures 06/08/09
• The p th percentile in an ordered array of n values is the value in i th position, where
• Example: The 60 th percentile in an ordered array of 19 values is the value in 12 th position:
Percentiles 06/08/09
• Quartiles split the ranked data into 4 equal groups
25% 25% 25% 25% Sample Data in Ordered Array: 11 12 13 16 16 17 18 21 22
• Example: Find the first quartile
Q1 Q2 Q3 Quartiles (n = 9) Q1 = 25 th percentile, so find the so use the value half way between the 2 nd and 3 rd values, so 25 100 (9+1) = 2.5 position 25 100 Q1=12.5 06/08/09
• A Graphical display of data using 5-number summary:
Minimum -- Q1 -- Median -- Q3 -- Maximum Example : 25% 25% 25% 25% Box and Whisker Plot 06/08/09
• The Box and central line are centered between the endpoints if data is symmetric around the median
• A Box and Whisker plot can be shown in either vertical or horizontal format
Shape of Box and Whisker Plots 06/08/09
• Right-Skewed Left-Skewed Symmetric Q1 Q2 Q3 Q1 Q2 Q3 Q1 Q2 Q3 Distribution Shape and Box and Whisker Plot 06/08/09
• Below is a Box-and-Whisker plot for the following data: 0 2 2 2 3 3 4 5 5 10 27
• This data is very right skewed, as the plot depicts
0 2 3 5 27 Min Q1 Q2 Q3 Max Box-and-Whisker Plot Example 06/08/09
• Variation Variance Standard Deviation Coefficient of Variation Population Variance Sample Variance Population Standard Deviation Sample Standard Deviation Range Interquartile Range Measures of Variation 06/08/09
• Measures of variation give information on the spread or variability of the data values.
Same center, different variation Variation 06/08/09
• Difference between the largest and the smallest observations.
Range = x maximum – x minimum 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Range = 14 - 1 = 13 Example: Range 06/08/09
• 7 8 9 10 11 12 Range = 12 - 7 = 5 7 8 9 10 11 12 Range = 12 - 7 = 5 1 ,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4, 5 1 ,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4, 120 Range = 5 - 1 = 4 Range = 120 - 1 = 119 Disadvantages of the Range
• Sensitive to outliers
• Ignores the way in which data are distributed
06/08/09
• Can eliminate some outlier problems by using the Interquartile range
• Eliminate some high-and low-valued observations and calculate the range from the remaining values.
• Interquartile range = 3 rd quartile – 1 st quartile
Interquartile Range 06/08/09
• Median (Q2) X maximum X minimum Q1 Q3 Example : 25% 25% 25% 25% 12 30 45 57 70 Interquartile range = 57 – 30 = 27 Interquartile Range 06/08/09
• Average of squared deviations of values from the mean
• Sample variance :
• Population variance:
Variance 06/08/09
• Most commonly used measure of variation
• Shows variation about the mean
• Has the same units as the original data
• Sample standard deviation:
• Population standard deviation:
Standard Deviation 06/08/09
• Sample Data (X i ) : 10 12 14 15 17 18 18 24 n = 8 Mean = x = 16 Calculation Example: Sample Standard Deviation 06/08/09
• Mean = 15.5 s = 3.338 11 12 13 14 15 16 17 18 19 20 21 11 12 13 14 15 16 17 18 19 20 21 Data B Data A Mean = 15.5 s = .9258 11 12 13 14 15 16 17 18 19 20 21 Mean = 15.5 s = 4.57 Data C Comparing Standard Deviations 06/08/09
• Measures relative variation
• Always in percentage (%)
• Shows variation relative to mean
• Is used to compare two or more sets of data measured in different units
Population Sample Coefficient of Variation 06/08/09
• Stock A:
• Average price last year = \$50
• Standard deviation = \$5
Both stocks have the same standard deviation, but stock B is less variable relative to its price Comparing Coefficient of Variation
• Stock B:
• Average price last year = \$100
• Standard deviation = \$5
06/08/09
• If the data distribution is bell-shaped, then the interval:
• contains about 68% of the values in the population or the sample
The Empirical Rule X 68% 06/08/09
• contains about 95% of the values in the population or the sample
• contains about 99.7% of the values in the population or the sample
The Empirical Rule 99.7% 95% 06/08/09
• Regardless of how the data are distributed, at least (1 - 1/k 2 ) of the values will fall within k standard deviations of the mean
• Examples:
• (1 - 1/1 2 ) = 0% ……..... k=1 ( μ ± 1 σ )
• (1 - 1/2 2 ) = 75% …........ k=2 ( μ ± 2 σ )
• (1 - 1/3 2 ) = 89% …........ k=3 ( μ ± 3 σ )
within At least Tchebysheff’s Theorem 06/08/09
• A standardized data value refers to the number of standard deviations a value is from the mean
• Standardized data values are sometimes referred to as z-scores
Standardized Data Values 06/08/09
• where:
• x = original data value
• μ = population mean
• σ = population standard deviation
• z = standard score
• (number of standard deviations x is from μ )
Standardized Population Values 06/08/09
• where:
• x = original data value
• x = sample mean
• s = sample standard deviation
• z = standard score
• (number of standard deviations x is from μ )
• Remark: The standardized sample values are used for
• constructing the confidence limits for the population parameters.
Standardized Sample Values 06/08/09