Successfully reported this slideshow.
Your SlideShare is downloading. ×

Session 3&4.pptx

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Loading in …3
×

Check these out next

1 of 48 Ad

More Related Content

Recently uploaded (20)

Advertisement

Session 3&4.pptx

  1. 1. Descriptive Statistics
  2. 2. Descriptive Statistics • Tabular, graphical, or numerical summaries of data. Age Mean 42.57 Median 40 Mode 40 Standard Deviation 10.63 Sample Variance 113.01 Range 44 Minimum 21 Maximum 65 Frequency Female 12 Male 18 Grand Total 30 0 1 2 3 4 5 6 7 8 9 1 2 3 4 5 Frequency Opinion Bar Chart for Opinions
  3. 3. Summarizing Data for Categorical Variables • Let us focus on Tabular and Graphical summaries first. We will deal with numerical summaries later. • Tabular: • Frequency distribution • Relative frequency distribution • Percent frequency distribution • Graphical: • Bar chart • Pie chart
  4. 4. Frequency Distribution • A frequency distribution is a tabular summary of data showing the number (frequency) of observations in each of several non- overlapping categories or classes. Opinion Frequency Strongly disagree 8 Disagree 4 Neutral 6 Agree 7 Strongly agree 5 Grand Total 30
  5. 5. Relative Frequency Distribution Relative frequency of a class = Frequency of the class Total number of observations Percent frequency of a class = Frequency of the class Total number of observations × 100 %
  6. 6. Opinion Frequency Relative frequency Percent Frequency Strongly disagree 8 0.27 27% Disagree 4 0.13 13% Neutral 6 0.20 20% Agree 7 0.23 23% Strongly agree 5 0.17 17% Grand Total 30 1.00 100%
  7. 7. Bar Chart 0 5 10 15 20 25 Elderly Middle-aged Young FREQUENCY AGE CATEGORY Number of people in each age category
  8. 8. Pie Chart Age distribution of people Elderly Middle-aged Young
  9. 9. Summarizing Data for Quantitative Variables • Let us focus on Tabular and Graphical summaries first. We will deal with numerical summaries later. • Tabular: • Frequency distribution • Relative frequency distribution • Percent frequency distribution • Graphical: • Histogram
  10. 10. Frequency Distribution • We need to bin/bucket the quantitative variable of interest. • Three Steps: 1. Determine the number of nonoverlapping classes. 2. Determine the width of each class. 3. Determine the class limits. • Choosing the number of classes is tricky! It is done by trial and error. • Five to twenty classes are preferred. (Not too few, not too many, just enough to informatively show the variation in the frequencies.)
  11. 11. Frequency Distribution Approximate class width = Largest data value − Smallest data value Number of classes
  12. 12. Frequency Distribution Class Frequency [31000, 35200] 1 (35200, 39400] 3 (39400, 43600] 2 (43600, 47800] 7 (47800, 52000] 3 (52000, 56200] 4 (56200, 60400] 3 (60400, 64600] 4 (64600, 68800] 1 (68800, 73000] 0 (73000, 77200] 0 (77200, 81400] 2
  13. 13. Relative/Percent Frequency Class Frequency Rel. Freq. Perc. Freq. [31000, 35200] 1 0.033 3.33 (35200, 39400] 3 0.100 10.00 (39400, 43600] 2 0.067 6.67 (43600, 47800] 7 0.233 23.33 (47800, 52000] 3 0.100 10.00 (52000, 56200] 4 0.133 13.33 (56200, 60400] 3 0.100 10.00 (60400, 64600] 4 0.133 13.33 (64600, 68800] 1 0.033 3.33 (68800, 73000] 0 0.000 0.00 (73000, 77200] 0 0.000 0.00 (77200, 81400] 2 0.067 6.67
  14. 14. Histogram
  15. 15. Skewness • To which side is the tail of the distribution longer or more drawn out? • Positive/Right skew • Negative/Left skew • Zero skewness means symmetric distribution.
  16. 16. Skewness
  17. 17. Summarizing Data for Two Categorical Variables • Tabular • Crosstabulation • Graphical • Side-by-side bar chart • Stacked bar chart
  18. 18. Crosstabulation Strongly agree Agree Neutral Disagree Strongly disagree Grand Total Elderly 0 0 0 0 3 3 Middle-aged 4 6 5 2 4 21 Young 1 1 1 2 1 6 Grand Total 5 7 6 4 8 30
  19. 19. Side-by-side Bar Chart 0 1 2 3 4 5 6 7 Strongly agree Agree Neutral Disagree Strongly disagree Frequency Opinions Opinions vs. age categories Elderly Middle-aged Young
  20. 20. Stacked Bar Chart 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Strongly agree Agree Neutral Disagree Strongly disagree Percentage Opinions Opinions vs. age categories Elderly Middle-aged Young
  21. 21. Scatterplot: Visualizing the Relationship Between Two Quantitative Variables $0 $10,000 $20,000 $30,000 $40,000 $50,000 $60,000 $70,000 $80,000 $90,000 0 10 20 30 40 50 60 70 Salary Age (years) Salary vs. Age
  22. 22. Creating Effective Graphical Displays • Give the display a clear and concise title. • Keep the display simple. • Clearly label each axis and provide the units of measure. • If colors are used, make sure they are distinct. • If multiple colors or line types are used, provide a legend.
  23. 23. Statistical Inference (Recap) Population Sample Population parameter E.g., Population average income 𝜇 Draw Infer Sample statistic E.g., Sample average income 𝑥 A sample statistic is a point estimator of the corresponding population parameter.
  24. 24. Descriptive Statistics: Numerical Measures • Measures of location: • Measures of central location: (A single number which indicates a typical value of the data.) • Sample mean • Sample median • Sample mode • Sample percentiles • Sample quartiles • Measures of variability: (A single number which indicates the variability in the data.) • Sample range • Sample IQR • Sample variance • Sample standard deviation • Measures of distribution shape: (A single number which lets us know the shape of the distribution of the data.) • Skewness • Kurtosis
  25. 25. Some Common Notation • Let 𝑥 represent a variable of interest. • Let 𝑛 be the number of observations in the sample. This is the sample size. • Let 𝑥𝑖 be the 𝑖𝑡ℎ observation. • Let 𝑁 be the number of observations in the population. This is the size of the population.
  26. 26. Measures of Location • Measures of central location: (A single number which indicates a typical value of the data.) • Sample mean • Sample median • Sample mode • Sample percentiles • Sample quartiles
  27. 27. Sample Mean Sample mean 𝑥 = 𝑖=1 𝑛 𝑥𝑖 𝑛 Population mean 𝜇 = 𝑖=1 𝑁 𝑥𝑖 𝑁
  28. 28. Sample Median • The median of a data set is the value in the middle when the data items are arranged in ascending order. • The median divides the dataset into two parts, each with approximately 50% of observations. • Arrange the data in ascending order (smallest value to largest value). • For an odd number of observations, the median is the middle value. • For an even number of observations, the median is the average of the two middle values.
  29. 29. Sample Mode • The mode of a data set is the value that occurs with greatest frequency.
  30. 30. Sample Percentile • The 𝑝𝑡ℎ percentile is a value such that at least 𝒑 percent of the observations are less than or equal to this value and at least (𝟏𝟎𝟎 − 𝒑) percent of the observations are greater than or equal to this value
  31. 31. Sample Percentile • Arrange the data in ascending order. • Location of the 𝑝𝑡ℎ percentile: 𝐿𝑝 = 𝑝 100 (𝑛 + 1)
  32. 32. Sample Quartiles • The quartiles divide the dataset into four parts, each with approximately 25% of observations. • First Quartile 𝑄1 = 25th Percentile • Second Quartile 𝑄2 = 50th Percentile • Third Quartile 𝑄3 = 75th Percentile
  33. 33. Measures of Variability • Measures of variability: (A single number which indicates the variability in the data.) • Sample range • Sample IQR • Sample variance • Sample standard deviation
  34. 34. Sample Range Sample Range = Largest value – Smallest Value
  35. 35. Sample Interquartile Range (IQR) 𝐼𝑄𝑅 = 𝑄3 − 𝑄1
  36. 36. Box Plot Q1 Median Q3 Max value less than inner fence Min value greater than inner fence Q3 + 1.5*IQR Inner fence Q3 + 3*IQR Outer fence Q1 – 1.5*IQR Inner fence Q1 – 3*IQR Outer fence Major outlier Minor outlier
  37. 37. Sample Variance Sample variance 𝑠2 = 𝑖=1 𝑛 𝑥𝑖−𝑥 2 𝑛−1 Population variance 𝜎2 = 𝑖=1 𝑁 𝑥𝑖−𝑥 2 𝑁
  38. 38. Sample Standard Deviation Sample standard deviation 𝑠 = 𝑠2 Sample standard deviation 𝜎 = 𝜎2
  39. 39. Chebyshev’s Theorem • At least (1 − 1 𝑧2) of the data values must be within 𝑧 standard deviations of the mean, where 𝑧 is any value greater than 1.
  40. 40. Suppose that you are interested in analyzing the amount of time spent by users browsing through Swiggy before they come to a decision about what to order. You know that the average time spent browsing is 6.9 minutes. Suppose that the standard deviation is 1.2 minutes. • What can you say about the percentage of users who spend between 4.5 minutes and 9.3 minutes browsing Swiggy? • What can you say about the percentage of users who spend between 5.4 minutes and 9.3 minutes browsing Swiggy?
  41. 41. Measures of Association Between Two Variables • Covariance • Correlation
  42. 42. Covariance • Covariance is a descriptive measure of the strength of linear association between two variables. Sample covariance 𝑠𝑥𝑦 = 𝑖=1 𝑛 𝑥𝑖−𝑥 𝑦𝑖−𝑦 𝑛−1 Population Covariance 𝜎𝑥𝑦 = 𝑖=1 𝑁 𝑥𝑖−𝜇𝑥 𝑦𝑖−𝜇𝑦 𝑁 • +ve value  +ve relationship • -ve value  -ve relationship • Sensitive to units of measurement of the variables!
  43. 43. Correlation • Correlation coefficient is a dimensionless measure of the strength of linear association between two variables. Sample correlation coefficient 𝑟𝑥𝑦 = 𝑠𝑥𝑦 𝑠𝑥𝑠𝑦 Population correlation coefficient 𝜌𝑥𝑦 = 𝜎𝑥𝑦 𝜎𝑥𝜎𝑦 • Bounded between [-1, 1] • Values close to 0 indicate weak linear relationship. • Values close to 1 indicate strong positive linear relationship. • Values close to -1 indicate strong negative linear relationship.

×