Why Study Statistics Arunesh Chand Mankotia 2004

  • 319 views
Uploaded on

 

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
319
On Slideshare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
26
Comments
0
Likes
1

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Why Study Statistics?
  • 2. Dealing with Uncertainty Everyday decisions are based on incomplete information
  • 3. Dealing with UncertaintyThe price of L&T stock will be higher in six months than it is now. versus The price of L&T stock is likely to be higher in six months than it is now.
  • 4. Dealing with Uncertainty If the union budget deficit is as high aspredicted, interest rates will remain high for the rest of the year. versusIf the union budget deficit is as high as predicted, it is probable that interest rates will remain high for the rest of the year.
  • 5. Statistical ThinkingStatistical thinking is a philosophy of learning and action based on the following fundamental principles: All work occurs in a system of interconnected processes; Variation exists in all processes, and Understanding and reducing variation are the keys to success.
  • 6. Statistical Thinking Systems and ProcessesA system is a number of components that are logically and sometimes physically linked together for some purpose.
  • 7. Statistical Thinking Systems and Processes A process is a set of activities operating on a systemthat transforms inputs to outputs. A business process is groups of logically related tasks and activities, that when performed utilizes the resources of the business to provide definitive results required to achieve the business objectives.
  • 8. Making DecisionsData, Information, Knowledgeq Data: specific observations of measured numbers.q Information: processed and summarized data yielding facts and ideas.q Knowledge: selected and organized information that provides understanding, recommendations, and the basis for decisions.
  • 9. Making Decisions Descriptive and Inferential StatisticsDescriptive Statistics include graphical and numerical procedures that summarize and process data and are used to transform data into information.
  • 10. Making Decisions Descriptive and Inferential StatisticsInferential Statistics provide the bases forpredictions, forecasts, and estimates that areused to transform information to knowledge.
  • 11. The Journey to Making Decisions Decision  Knowledge Experience, Theory, Literature, Inferential Statistics, Computers Information Descriptive Statistics, Probability, ComputersBegin Here: DataIdentify the Problem
  • 12. Describing Data©
  • 13. Summarizing and Describing Data  Tables and Graphs  Numerical Measures
  • 14. Classification of Variables  Discrete numerical variable  Continuous numerical variable  Categorical variable
  • 15. Classification of Variables Discrete Numerical VariableA variable that produces a response that comes from a counting process.
  • 16. Classification of Variables Continuous Numerical VariableA variable that produces a response that is the outcome of a measurement process.
  • 17. Classification of Variables Categorical VariablesVariables that produce responses thatbelong to groups (sometimes called “classes”) or categories.
  • 18. Measurement LevelsNominal and Ordinal Levels of Measurement refer to data obtained from categorical questions.• A nominal scale indicates assignments to groups or classes.• Ordinal data indicate rank ordering of items.
  • 19. Frequency DistributionsA frequency distribution is a table used to organize data. The left column (called classes or groups) includes numerical intervals on a variable being studied. The right column is a list of the frequencies, or number of observations, for each class. Intervals are normally of equal size, must cover the range of the sample observations, and be non-overlapping.
  • 20. Construction of a Frequency Distribution Rule 1: Intervals (classes) must be inclusive and non- overlapping; Rule 2: Determine k, the number of classes; Rule 3: Intervals should be the same width, w; the width is determined by the following: (Largest Number - Smallest Number)w = Interval Width = Number of IntervalsBoth k and w should be rounded upward, possibly to the next largest integer.
  • 21. Construction of a Frequency DistributionQuick Guide to Number of Classes for a Frequency Distribution Sample Size Number of Classes Fewer than 50 5 – 6 classes 50 to 100 6 – 8 classes over 100 8 – 10 classes
  • 22. Example of a Frequency Distribution A Frequency Distribution for the Suntan Lotion Example Weights (in mL) Number of Bottles 220 less than 225 1 225 less than 230 4 230 less than 235 29 235 less than 240 34 240 less than 245 26 245 less than 250 6
  • 23. Cumulative Frequency Distributions A cumulative frequency distribution contains thenumber of observations whose values are less than the upper limit of each interval. It is constructed by adding the frequencies of all frequency distribution intervals up to and including the present interval.
  • 24. Relative Cumulative Frequency DistributionsA relative cumulative frequency distribution converts all cumulative frequencies to cumulative percentages
  • 25. Example of a Frequency Distribution A Cumulative Frequency Distribution for the Sun tan Lotion Example Weights (in mL) Number of Bottles less than 225 1 less than 230 5 less than 235 34 less than 240 68 less than 245 94 less than 250 100
  • 26. Histograms and OgivesA histogram is a bar graph that consists of vertical barsconstructed on a horizontal line that is marked off with intervals for the variable being displayed. The intervals correspond to those in a frequency distribution table. The height of each bar is proportional to the number of observations in that interval.
  • 27. Histograms and OgivesAn ogive, sometimes called a cumulative line graph, is a line that connects points that are the cumulative percentage of observations below the upper limit of each class in a cumulative frequency distribution.
  • 28. Histogram and Ogive for Example 1 Histogram of Weights 40 100 35 90 80 30 70 Frequency 25 60 20 50 15 40 30 10 20 5 10 0 0 224.5 229.5 234.5 239.5 244.5 249.5 Interval Weights (mL)
  • 29. Stem-and-Leaf Display A stem-and-leaf display is an exploratory data analysis graph that is an alternative to the histogram. Data aregrouped according to their leading digits (called the stem)while listing the final digits (called leaves) separately for each member of a class. The leaves are displayed individually in ascending order after each of the stems.
  • 30. Stem-and-Leaf Display Stem-and-Leaf Display Stem unit: 10 9 1 124678899 (9) 2 122246899 5 3 01234 2 4 02
  • 31. Tables - Bar and Pie Charts -Frequency and Relative Frequency Distribution for Top Company Employers Example Number of Industry Employees Percent Tourism 85,287 0.35 Retail 49,424 0.2 Health Care 39,588 0.16 Restaurants 16,050 0.06 Communications 11,750 0.05 Technology 11,144 0.05 Space 11,418 0.05 Other 21,336 0.08
  • 32. Tables - Bar and Pie Charts - Bar Chart for Top Company Employers Example 1999 Top Company Employers in Central Florida 0.35 0.2 0.16 0.06 0.08 0.05 0.05 0.05 e gy e il ism er s ns ta ar nt ac th lo t io Re C ra ur Sp O no ica au thTo ch al st un Te He Re m m Co Industry Category
  • 33. Tables - Bar and Pie Charts -Pie Chart for Top Company Employers Example 1999 Top Company Employers in Central Florida Others 29% Tourism 35% Health Care 16% Retail 20%
  • 34. Pareto Diagrams A Pareto diagram is a bar chart that displays thefrequency of defect causes. The bar at the left indicates the most frequent cause and bars to the right indicatecauses in decreasing frequency. A Pareto diagram is use to separate the “vital few” from the “trivial many.” few many.
  • 35. Line Charts A line chart, also called a time plot, is a series of data plottedat various time intervals. Measuring time along the horizontal axis and the numerical quantity of interest along the verticalaxis yields a point on the graph for each observation. Joining points adjacent in time by straight lines produces a time plot.
  • 36. Line Charts Growth Trends in Internet Use by Age 1997 to 1999 35Millions of Adults 31.3 32.7 30 25 26.3 20 20.2 18.5 15 16.5 15.8 17.2 13.8 13 14.2 10 9.8 11.4 7.5 5 5 0 Age 18 to 29 Age 30 to 49 98 99 9 O 7 O 8 7 8 9 7 8 l-9 l-9 l-9 r- 9 r- 9 r- 9 -9 -9 n- n- ct ct Ju Ju Ju Age 50+ Ap Ap Ap Ja Ja April 1997 to July 1999
  • 37. Parameters and StatisticsA statistic is a descriptive measure computed from a sample of data. A parameter is a descriptive measure computed from an entire population of data.
  • 38. Measures of Central Tendency - Arithmetic Mean -A arithmetic mean is of a set of data is the sum of the data values divided by the number of observations.
  • 39. Sample MeanIf the data set is from a sample, then the sample n mean, X , is: ∑x i x1 + x2 +  + xn X= i =1 = n n
  • 40. Population MeanIf the data set is from a population, then the population mean, µ , is: N ∑x x1 + x2 +  + xn i µ= =i =1 N N
  • 41. Measures of Central Tendency - Median - An ordered array is an arrangement of data in either ascending or descending order. Once the data arearranged in ascending order, the median is the value such that 50% of the observations are smaller and 50% of the observations are larger.If the sample size n is an odd number, the median,Xm, is the middle observation. If the sample size nis an even number, the median, Xm, is the average medianof the two middle observations. The median willbe located in the 0.50(n+1)th ordered position. position
  • 42. Measures of Central Tendency - Mode - The mode, if one exists, is the most frequently occurring observation in the sample or population.
  • 43. Shape of the Distribution The shape of the distribution is said to be symmetric if the observations are balanced, or evenly distributed, about the mean. In asymmetric distribution the mean and median are equal.
  • 44. Shape of the Distribution A distribution is skewed if the observations are notsymmetrically distributed above and below the mean. A positively skewed (or skewed to the right) distribution has a tail that extends to the right in the direction of positive values. A negatively skewed (or skewed to the left) distribution has a tail that extends to the left in the direction of negative values.
  • 45. Shapes of the Distribution Symmetric Distribution 10 9 8 7 Frequency 6 5 4 3 2 1 0 1 2 3 4 5 6 7 8 9 Positively Skewed Distribution Negatively Skewed Distribution 12 12 10 10 8 8Frequency Frequency 6 6 4 4 2 2 0 0 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9
  • 46. Measures of Central Tendency - Geometric Mean -The Geometric Mean is the nth root of the product of n numbers: X g = n ( x1 • x2 •  • xn ) = ( x1 • x2 •  • xn )1/ nThe Geometric Mean is used to obtain mean growth over several periods given compounded growth from each period.
  • 47. Measures of Variability - The Range -The range is in a set of data is thedifference between the largest and smallest observations
  • 48. Measures of Variability - Sample Variance - The sample variance, s2, is the sum of the squareddifferences between each observation and the sample mean divided by the sample size minus 1. n ∑ (x − X ) i 2 s2 = i =1 n −1
  • 49. Measures of Variability- Short-cut Formulas for Sample Variance - Short-cut formulas for the sample variance are: n (∑ xi ) 2 ∑ xi − n ∑ xi2 − nX 2s 2 = i =1 or s2 = n −1 n −1
  • 50. Measures of Variability - Population Variance - The population variance, σ2, is the sum of the squareddifferences between each observation and the population mean divided by the population size, N. N ∑ (x − µ) i 2 σ2 = i =1 N
  • 51. Measures of Variability - Sample Standard Deviation -The sample standard deviation, s, is the positive square root of the variance, and is defined as: n ∑ (x − X ) i 2 s= s = 2 i =1 n −1
  • 52. Measures of Variability- Population Standard Deviation- The population standard deviation, σ, is N ∑ (x − µ) i 2 σ= σ = 2 i =1 N
  • 53. The Empirical Rule (the 68%, 95%, or almost all rule)For a set of data with a mound-shaped histogram, the Empirical Rule is:• approximately 68% of the observations are contained with a distance of one standard deviation around the mean; µ± 1σ• approximately 95% of the observations are contained with a distance of two standard deviations around the mean; µ± 2σ• almost all of the observations are contained with a distance of three standard deviation around the mean; µ± 3σ
  • 54. Coefficient of VariationThe Coefficient of Variation, CV, is a measure of relative dispersion that expresses the standard deviation as apercentage of the mean (provided the mean is positive). The sample coefficient of variation is s CV = × 100 if X > 0 X The population coefficient of variation is σ CV = ×100 if µ > 0 µ
  • 55. Percentiles and Quartiles Data must first be in ascending order. Percentilesseparate large ordered data sets into 100ths. The Pth percentile is a number such that P percent of all the observations are at or below that number.Quartiles are descriptive measures that separate large ordered data sets into four quarters.
  • 56. Percentiles and Quartiles The first quartile, Q1, is another name for the 25thpercentile. The first quartile divides the ordered datapercentilesuch that 25% of the observations are at or below this value. Q1 is located in the .25(n+1)st position when the data is in ascending order. That is, (n + 1) Q1 = ordered position 4
  • 57. Percentiles and QuartilesThe third quartile, Q3, is another name for the 75th percentile. The first quartile divides the ordered percentile data such that 75% of the observations are at or below this value. Q3 is located in the .75(n+1)stposition when the data is in ascending order. That is, 3(n + 1) Q3 = ordered position 4
  • 58. Interquartile Range The Interquartile Range (IQR) measures the spreadin the middle 50% of the data; that is the difference between the observations at the 25th and the 75th percentiles: IQR = Q3 − Q1
  • 59. Five-Number Summary The Five-Number Summary refers to the fivedescriptive measures: minimum, first quartile, median, third quartile, and the maximum.X min imum < Q1 < Median < Q3 < X max imum
  • 60. Box-and-Whisker Plots A Box-and-Whisker Plot is a graphical procedure that uses the Five-Number summary. A Box-and-Whisker Plot consists of• an inner box that shows the numbers which span the range from Q1 Box-and-Whisker Plot to Q3. •a line drawn through the box at the median.The “whiskers” are lines drawn from Q1 to the minimum vale, and from Q3 to the maximum value.
  • 61. Box-and-Whisker Plots (Excel) Box-and-whisker Plot45403530252015 1610
  • 62. Grouped Data Mean For a population of N observations the mean is K ∑fm i i µ= i =1 N For a sample of n observations, the mean is K ∑fm i i X= i =1 nWhere the data set contains observation values m1, m2, . . ., mk occurring with frequencies f1, f2, . . . fK respectively
  • 63. Grouped Data Variance For a population of N observations the variance is K K ∑f i (mi −µ) 2 ∑ f i m i2 σ2 = i=1 = i=1 −µ2 N N For a sample of n observations, the variance is K K ∑ f i (mi − X ) 2 ∑ f i m i2 − nX 2 s2 = i =1 = i =1 n −1 n −1Where the data set contains observation values m1, m2, . . ., mk occurring with frequencies f1, f2, . . . fK respectively