Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Intro to quant_analysis_students

790 views

Published on

Published in: Technology, Art & Photos
  • Be the first to comment

Intro to quant_analysis_students

  1. 1. Week 11: Basic Descriptive Quantitative Data Analysis Tables, Graphs, & Summary Statistics 1
  2. 2. Objectives  Learn about basic descriptive quantitative analysis  How to perform these tasks in Excel  Starting point for 502B  Excel knowledge and quantitative skills are highly desired by Employers  EC stream 2
  3. 3. Introduction 3  Without data, it is anyone’s opinion  Why use tables, graphs, summary stats? “At their best, tables, graphs, and statistics are instruments for reasoning about complex quantitative information.”  Why learn how to design them appropriately? “At their worst, tables, graphs and summary statistics are instruments of evil used for deceiving a naive viewer.”  Does your mindset match my dataset!  http://www.ted.com/talks/hans_rosling_at_state.html
  4. 4. Quantitative Research Process Page 4
  5. 5. Introduction Page 5
  6. 6. Page 6 Presenting the Data
  7. 7. Frequency Distribution Page 7  A convenient way of summarizing a lot of tabular data  What is a Frequency Distribution?  A frequency distribution is a list or a table …  containing class groupings (categories or ranges within which the data fall) ...  and the corresponding frequencies with which data fall within each class or category  For nominal/ordinal data
  8. 8. Introduction Page 8
  9. 9. Page 9 Table 1 Univariate Frequencies of Percentage of Sales Reported to Tax Authorities Source: 1999 World Bank World Business Environment Survey (WBES), excludes missing observations % of Sales Reported 100% 90-99% 80-89% 70-79% 60-69% 50-59% <50% Total Frequency 3307 1096 916 703 501 694 936 8153 Percent (%) 40.56 13.44 11.24 8.62 6.14 8.51 11.48 100 http://www.enterprisesurveys.org/
  10. 10. Contingency/Pivot/Cross Table 10  May also want to produce a table with more categories  Cross table or Contingency table or Pivot table  Suitable if you have two nominal/ordinal variables  Simple extension to a univariate table  Considers relationship between two variables  Row variable (Dependent)  Column variable (Independent)
  11. 11. Table2 Percentage of Sales Reported to Tax Authorities by Region Page 11 Africa Transition Asia Latin OECD Former Total Europe America Soviet Countries 100% 490 554 416 794 446 607 3,307 90-99% 266 196 142 119 145 228 1,096 80-89% 158 152 117 192 73 224 916 70-79% 162 117 103 153 43 125 703 60-69% 140 69 70 115 22 85 501 50-59% 140 105 141 118 16 174 694 <50% 100 106 283 296 25 126 936 Total 1,456 1,299 1,272 1,787 770 1,569 8,153 Source: 1999 World Bank World Business Environment Survey (WBES) * Excludes missing observations
  12. 12. Features of a Table 12  Title that accurately summarizes the data  Simple, indicates major variables, and time frame (if applicable)  Source: data set or origin of table  Explanatory footnotes  Easy to read & separated from text  Properly formatted for style (see APA Rules)  Necessary to advance analysis  See Module 7 for APA Table Checklist  Reproduced from APA manual
  13. 13. Page 13 Presenting the Data
  14. 14. Bar Graph Page 14  Often used to describe categorical data  Ordinal/Nominal  Draws attention to the frequency of each category
  15. 15. Page 15 Table 1 Univariate Frequencies of Percentage of Sales Reported to Tax Authorities Source: 1999 World Bank World Business Environment Survey (WBES), excludes missing observations % of Sales Reported 100% 90-99% 80-89% 70-79% 60-69% 50-59% <50% Total Frequency 3307 1096 916 703 501 694 936 8153 Percent (%) 40.56 13.44 11.24 8.62 6.14 8.51 11.48 100 http://www.enterprisesurveys.org/
  16. 16. Bar Graph Page 16 Figure 1 Percentage of sales reported to tax authority Source: 1999 World Bank World Business Environment Survey (WBES) Note. Excludes missing observations. n = 8314
  17. 17. Relative Frequency Polygone 17
  18. 18. Pie Graph Page 18  Emphasizes the proportion of each category  Something that may be good for our tax evasion data  Circle represents the total  Segments the shares of the total  Segment size is proportional to frequency
  19. 19. Pie Graph 19 Figure 1 Percentage of sales reported to tax authority Source: 1999 World Bank World Business Environment Survey (WBES) Note. Excludes missing observations. n = 8314
  20. 20. Page 2020 Pie Graph Figure 1 Percentage of sales reported to tax authority Source: 1999 World Bank World Business Environment Survey (WBES) Note. Excludes missing observations. n = 8314
  21. 21. Page 2121 Pie Graph Figure 1 Percentage of sales reported to tax authority Source: 1999 World Bank World Business Environment Survey (WBES) Note. Excludes missing observations. n = 8314
  22. 22. Charts in Excel I 22
  23. 23. Table2 Percentage of Sales Reported to Tax Authorities by Region Page 23 Africa Transition Asia Latin OECD Former Total Europe America Soviet Countries 100% 490 554 416 794 446 607 3,307 90-99% 266 196 142 119 145 228 1,096 80-89% 158 152 117 192 73 224 916 70-79% 162 117 103 153 43 125 703 60-69% 140 69 70 115 22 85 501 50-59% 140 105 141 118 16 174 694 <50% 100 106 283 296 25 126 936 Total 1,456 1,299 1,272 1,787 770 1,569 8,153
  24. 24. Bar Graph Page 24 Figure 1 Percentage of sales reported to tax authority Source: 1999 World Bank World Business Environment Survey (WBES) Note. Excludes missing observations. n = 8314
  25. 25. Page 2525 Segmented Bar Chart Figure 1 Percentage of sales reported to tax authority Source: 1999 World Bank World Business Environment Survey (WBES) Note. Excludes missing observations. n = 8314
  26. 26. Pie Graph Page 26 Figure 2 Percentage of sales reported to tax authority by region Source: 1999 World Bank World Business Environment Survey (WBES) Note. Excludes missing observations. n = 8314
  27. 27. Vertical Bar Chart 27
  28. 28. Charts in Excel II 28
  29. 29. Time Series Graph Page 29  Time series are often used in social sciences  Data collected at various time period: daily, weekly, monthly, quarterly, annually, etc.  Examples include GDP, Unemployment, University Tuition  Plot series of interest over time  Let’s look at a graph of the unemployment rate by gender and age
  30. 30. Line Graph Page 30
  31. 31. InstructorPage 31 Histogram  Used for continuous data  Frequency Distribution for continuous data  Summary graph showing count of the data pints falling in various ranges  Rough approximate of the distribution of the data  A histogram is a way to summarize data  The distribution condenses the raw data into a more useful form...  and allows for a quick visual interpretation of the data
  32. 32. Histogram 32
  33. 33. InstructorPage 33 Scatter Graphs  Graphs relationship between two continuous variables
  34. 34. Scatter Graph 34
  35. 35. Principles of Graphical Excellence 35  Well-designed presentation of interesting data  Substance & design  Simplicity of design, complexity of data  Proportion and Balance  Clear, precise, efficient  Know what you are trying to show (have a story)  make sure you graph shows it  Well formatted, professional  Choose format that reflects your data and the story  Informative and legible axis  Fully labelled & legible  Gets across main point(s) in the shortest time with the least ink in the smallest space  Adds information not otherwise available to the reader  But supplemented with text describing the figure  Tells the truth about the data  Limits complexity and confusion  Avoid Chart Junk
  36. 36. 36 0 10 20 30 40 50 60 1st Qtr 2nd Qtr 3rd Qtr 4th Qtr 0 20 40 60 80 100 120 West North Northeast Southwest Mexico Europe Japan East South International Examples of Chartjunk
  37. 37. 37 Examples of Chartjunk 0 10 20 30 40 50 60 70 80 90 1st Qtr 2nd Qtr 3rd Qtr 4th Qtr Gridlines! Vibration Pointless Fake 3-D Effects Filled “Floor” Clip Art In or out? Filled “Walls” Borders and Fills Galore Unintentional Heavy or Double Lines Filled Labels Serif Font with Thin & Thick Lines
  38. 38. Displaying Data: “Mistakes” Page 38  Graphs are also instruments of evil used for deceiving a naive viewer.  Non-zero origin  Omitting data that refutes your “evidence”  Limiting scope of data
  39. 39. What is Wrong with this Graph? 39 Provincial Personal Income Taxes Single Individual with $45,000 in income claiming basic personal tax credits
  40. 40. The Real Story 40
  41. 41. Exaggerates a change in data Page 41 Source: Statistics Canada, CANSIM II, V31215364
  42. 42. Dr. Kendall 42
  43. 43. Worst Recession Since the Depression (?) 43
  44. 44. Page 44 Presenting the Data
  45. 45. Describing Data Numerically 45 Simple Arithmetic Mean Median Mode Describing Data Numerically Variance Standard Deviation Range Central Tendency Variation Association Covariance Correlation Shape of the Distribution
  46. 46. Mode 46  A measure of central tendency  Value that occurs most often  Not affected by extreme values  Used for either numerical or categorical data  There may be no mode or several modes  What are the modes for the displayed data? 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 3 4 5 6
  47. 47. Mode 47  A measure of central tendency  Value that occurs most often  Not affected by extreme values  Used for either numerical or categorical data  There may be no mode  There may be several modes 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Mode = 9 0 1 2 3 4 5 6 No Mode
  48. 48. Mode 48  There may be several modes 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Mode = 5 & 9
  49. 49. Mode 49  Caution: Mode may not be representative of the data  {0.1, 0.1, 5000, 4900, 4500, 5200,…}
  50. 50. Median 50  In an ordered list, the median is the “middle” number (50% above, 50% below) 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
  51. 51. Mean 51  The “balancing point” (centre of gravity) of the data  E.g. The data “balances” at 5 1 2 3 4 5 6 7 8 9 -2 -1 +3
  52. 52. Arithmetic Mean 52  The arithmetic mean (mean) is the most common measure of central tendency  Calculated by summing the value observations and dividing by the number of observations  For a sample of size n: # of observationsn xxx n x x n21 n 1i i +++ == ∑=  Observed values
  53. 53. Arithmetic Mean 53  The most common measure of central tendency  Mean = sum of values divided by the number of values  Affected by extreme values (outliers)  What is the mean for these examples? 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
  54. 54. Arithmetic Mean 54  The most common measure of central tendency  Mean = sum of values divided by the number of values  Affected by extreme values (outliers) 0 1 2 3 4 5 6 7 8 9 10 Mean = 3 0 1 2 3 4 5 6 7 8 9 10 Mean = 4 3 5 15 5 54321 == ++++ 4 5 20 5 104321 == ++++
  55. 55. Measures of Central Tendency 55 Central Tendency Mean Median Mode n x x n 1i i∑= = Overview Midpoint of ranked values Most frequently observed valueArithmetic average 50% 50%
  56. 56. The “Shape of a Distribution” 56  Use information on mean, median, and mode to “visualize” the data  A data distribution is said to be symmetric if its shape is the same on both sides of the median  Symmetry implies that median=arithmetic mean  If a distribution is uni-modal and symmetric then  Median=mean=mode
  57. 57. The “Shape of a Distribution” 57 0 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 #ofObs. Value MEDIAN50% 50% Symmetric: Median=Mean Sym m etric: Median=M ean UNIMODAL Symmetric & Unimodel: Median=Mean=Mode
  58. 58. The “Shape of a Distribution” 58 0 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 #ofObs. Value MEDIAN50% 50% Sym m etric: Median=M ean Symmetric: Median=Mean BIMODAL BIMODAL Symmetric & Bimodel: Median=Mean≠Mode
  59. 59. The “Shape of a Distribution” 59 0 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 #ofObs. Values MEDIAN50% 50% Symmetric: Median=Mean Symmetric: Median=Mean MODE? Symmetric & no mode: Median=Mean (Uniform
  60. 60. The “Shape of a Distribution” 60  An asymmetric distribution is said to be skewed 1. Negatively if Mean<Median<Mode 2. Positively if Mean>Median>Mode  Hence, by comparing our measures of cental tendancy, we can start to visualize the shape and characteristics of the data
  61. 61. The “Shape of a Distribution” 61 0 2 4 6 8 10 12 1 2 3 4 5 6 7 8 MODE=2 MEDIAN=3 50% 50% MEAN=3.2 MODE < MEDIAN < MEAN = POSITIVELY SKEWED DISTRIBUTION
  62. 62. Example: Positively skewed variable 62  The Distribution of After-Tax Income  shows the distribution of income across all Canadian households
  63. 63. Example: Positively skewed variable 63  The mode income is the most common income and was in the range from $15,000 to $19,999.  The median income is the level of income that separates the population into two groups of equal size and was $39,700.  The mean income is the average income and was $48,400.
  64. 64. Example: Positively skewed variable 64  A distribution in which the mean exceeds the median and the median exceeds the mode is positively skewed, which means it has a long tail of high values.  The distribution of income in Canada is positively skewed.  Most likely to report median rather than mean since long tail distorts average
  65. 65. Example: Positively skewed variable 65  Volunteer hours  Charitable contributions  # of Cigarette packs smoked (excluding 0)  Collective bargaining agreement duration (in years)  # of beers consumed on a Saturday night  Duration of low income (in years)  Number of children
  66. 66. The “Shape of a Distribution” 66 0 2 4 6 8 10 12 0 1 2 3 4 5 6 7 MODE=6 MEDIAN=5 50% 50% MEAN=4.7 Mean< MEDIAN < Mode = NEGATIVELY SKEWED DISTRIBUTION
  67. 67. Examples 67  University Grades  Age  Years in school  Etc.
  68. 68. Describing Data Numerically 68 Simple Arithmetic Mean Median Mode Describing Data Numerically Variance Standard Deviation Range Central Tendency Variation Association Covariance Correlation Shape of the Distribution
  69. 69. Same center, different variation Measures of Dispersion/Variability 69 Variation Variance Standard Deviation Range  Measures of variation give information on the spread or variability of the data values.
  70. 70. Range 70  Simplest measure of variation  Difference between the largest and the smallest observations: Range = Xlargest – Xsmallest 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Example:
  71. 71. Range 71  Simplest measure of variation  Difference between the largest and the smallest observations: Range = Xlargest – Xsmallest 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Range = 14 - 1 = 13 Example:
  72. 72. The Range 72 • Problem • Ignores all but two data points • These values may be “outliers” (i.e. not representative)
  73. 73. Disadvantages of the Range 73  Ignores the way in which data are distributed  Sensitive to outliers 7 8 9 10 11 12 Range = 12 - 7 = 5 7 8 9 10 11 12 Range = 12 - 7 = 5 1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,5 1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,120 Range = 5 - 1 = 4 Range = 120 - 1 = 119
  74. 74. The Variance 74 • A single summary measure of dispersion would be more helpful • Takes account of all data Values
  75. 75. The Variance 1. Variance 2. Standard Deviation ∑= − − = N i i Xx n s 1 22 )( 1 1 75 siancedeviationdards == vartan
  76. 76. Measuring variation 76 Small standard deviation Large standard deviation
  77. 77. Comparing Standard Deviations 77 Mean = 15.5 s = 3.33811 12 13 14 15 16 17 18 19 20 21 11 12 13 14 15 16 17 18 19 20 21 Data B Data A Mean = 15.5 s = 0.926 11 12 13 14 15 16 17 18 19 20 21 Mean = 15.5 s = 4.570 Data C
  78. 78. Describing Data Numerically 78 Simple Arithmetic Mean Median Mode Describing Data Numerically Variance Standard Deviation Range Central Tendency Variation Association Covariance Correlation Shape of the Distribution
  79. 79. The Sample Covariance 79  The covariance measures the strength of the linear relationship between two variables  The sample covariance:  Only concerned with the strength of the relationship  No causal effect is implied 1n )y)(yx(x sy),(xCov n 1i ii xy − −− == ∑=
  80. 80. Interpreting Covariance 80 Covariance between two variables: Cov(x,y) > 0 x and y tend to move in the same direction Cov(x,y) < 0 x and y tend to move in opposite directions Cov(x,y) = 0 x and y are independent
  81. 81. Coefficient of Correlation 81  Measures the relative strength of the linear relationship between two variable  Sample correlation coefficient: YX ss y),(xCov r =
  82. 82. Features of Correlation Coefficient, r 82  Unit free  Ranges between –1 and 1  The closer to –1, the stronger the negative linear relationship  The closer to 1, the stronger the positive linear relationship  The closer to 0, the weaker any positive linear relationship
  83. 83. Interpreting the Correlation Coefficient, r 83
  84. 84. Scatter Plots of Data with Variou Correlation Coefficients 84 Y X Y X Y X Y X Y X r = -1 Cov<0 r = -.6 Cov<0 r = 0 Cov=0 r = +.3r = +1 Y X r = 0
  85. 85. 502B 85
  86. 86. Fun with Graphs 86  Does your mindset match my dataset!  http://www.ted.com/talks/hans_rosling_at_state.html
  87. 87. Looking ahead  SRs to client (cc) and Turnitin on Wednesday by noon  No class next week  Work on 598 critiques  598 Critiques due in class & Turnitin Nov. 30  Comments on your SRs will be ready Nov. 30  Final SRs (if required) due Dec. 8 @11:55PM PST  Note carefully the requirements  Moodle site will be inaccessible sometime in December  Final Grades reported via usource once approved by the Director 87

×