Successfully reported this slideshow.
Upcoming SlideShare
×

of

Upcoming SlideShare
1a difference between inferential and descriptive statistics (explanation)
Next

1

Share

# Intro to quant_analysis_students

See all

See all

### Intro to quant_analysis_students

1. 1. Week 11: Basic Descriptive Quantitative Data Analysis Tables, Graphs, & Summary Statistics 1
2. 2. Objectives  Learn about basic descriptive quantitative analysis  How to perform these tasks in Excel  Starting point for 502B  Excel knowledge and quantitative skills are highly desired by Employers  EC stream 2
3. 3. Introduction 3  Without data, it is anyone’s opinion  Why use tables, graphs, summary stats? “At their best, tables, graphs, and statistics are instruments for reasoning about complex quantitative information.”  Why learn how to design them appropriately? “At their worst, tables, graphs and summary statistics are instruments of evil used for deceiving a naive viewer.”  Does your mindset match my dataset!  http://www.ted.com/talks/hans_rosling_at_state.html
4. 4. Quantitative Research Process Page 4
5. 5. Introduction Page 5
6. 6. Page 6 Presenting the Data
7. 7. Frequency Distribution Page 7  A convenient way of summarizing a lot of tabular data  What is a Frequency Distribution?  A frequency distribution is a list or a table …  containing class groupings (categories or ranges within which the data fall) ...  and the corresponding frequencies with which data fall within each class or category  For nominal/ordinal data
8. 8. Introduction Page 8
9. 9. Page 9 Table 1 Univariate Frequencies of Percentage of Sales Reported to Tax Authorities Source: 1999 World Bank World Business Environment Survey (WBES), excludes missing observations % of Sales Reported 100% 90-99% 80-89% 70-79% 60-69% 50-59% <50% Total Frequency 3307 1096 916 703 501 694 936 8153 Percent (%) 40.56 13.44 11.24 8.62 6.14 8.51 11.48 100 http://www.enterprisesurveys.org/
10. 10. Contingency/Pivot/Cross Table 10  May also want to produce a table with more categories  Cross table or Contingency table or Pivot table  Suitable if you have two nominal/ordinal variables  Simple extension to a univariate table  Considers relationship between two variables  Row variable (Dependent)  Column variable (Independent)
11. 11. Table2 Percentage of Sales Reported to Tax Authorities by Region Page 11 Africa Transition Asia Latin OECD Former Total Europe America Soviet Countries 100% 490 554 416 794 446 607 3,307 90-99% 266 196 142 119 145 228 1,096 80-89% 158 152 117 192 73 224 916 70-79% 162 117 103 153 43 125 703 60-69% 140 69 70 115 22 85 501 50-59% 140 105 141 118 16 174 694 <50% 100 106 283 296 25 126 936 Total 1,456 1,299 1,272 1,787 770 1,569 8,153 Source: 1999 World Bank World Business Environment Survey (WBES) * Excludes missing observations
12. 12. Features of a Table 12  Title that accurately summarizes the data  Simple, indicates major variables, and time frame (if applicable)  Source: data set or origin of table  Explanatory footnotes  Easy to read & separated from text  Properly formatted for style (see APA Rules)  Necessary to advance analysis  See Module 7 for APA Table Checklist  Reproduced from APA manual
13. 13. Page 13 Presenting the Data
14. 14. Bar Graph Page 14  Often used to describe categorical data  Ordinal/Nominal  Draws attention to the frequency of each category
15. 15. Page 15 Table 1 Univariate Frequencies of Percentage of Sales Reported to Tax Authorities Source: 1999 World Bank World Business Environment Survey (WBES), excludes missing observations % of Sales Reported 100% 90-99% 80-89% 70-79% 60-69% 50-59% <50% Total Frequency 3307 1096 916 703 501 694 936 8153 Percent (%) 40.56 13.44 11.24 8.62 6.14 8.51 11.48 100 http://www.enterprisesurveys.org/
16. 16. Bar Graph Page 16 Figure 1 Percentage of sales reported to tax authority Source: 1999 World Bank World Business Environment Survey (WBES) Note. Excludes missing observations. n = 8314
17. 17. Relative Frequency Polygone 17
18. 18. Pie Graph Page 18  Emphasizes the proportion of each category  Something that may be good for our tax evasion data  Circle represents the total  Segments the shares of the total  Segment size is proportional to frequency
19. 19. Pie Graph 19 Figure 1 Percentage of sales reported to tax authority Source: 1999 World Bank World Business Environment Survey (WBES) Note. Excludes missing observations. n = 8314
20. 20. Page 2020 Pie Graph Figure 1 Percentage of sales reported to tax authority Source: 1999 World Bank World Business Environment Survey (WBES) Note. Excludes missing observations. n = 8314
21. 21. Page 2121 Pie Graph Figure 1 Percentage of sales reported to tax authority Source: 1999 World Bank World Business Environment Survey (WBES) Note. Excludes missing observations. n = 8314
22. 22. Charts in Excel I 22
23. 23. Table2 Percentage of Sales Reported to Tax Authorities by Region Page 23 Africa Transition Asia Latin OECD Former Total Europe America Soviet Countries 100% 490 554 416 794 446 607 3,307 90-99% 266 196 142 119 145 228 1,096 80-89% 158 152 117 192 73 224 916 70-79% 162 117 103 153 43 125 703 60-69% 140 69 70 115 22 85 501 50-59% 140 105 141 118 16 174 694 <50% 100 106 283 296 25 126 936 Total 1,456 1,299 1,272 1,787 770 1,569 8,153
24. 24. Bar Graph Page 24 Figure 1 Percentage of sales reported to tax authority Source: 1999 World Bank World Business Environment Survey (WBES) Note. Excludes missing observations. n = 8314
25. 25. Page 2525 Segmented Bar Chart Figure 1 Percentage of sales reported to tax authority Source: 1999 World Bank World Business Environment Survey (WBES) Note. Excludes missing observations. n = 8314
26. 26. Pie Graph Page 26 Figure 2 Percentage of sales reported to tax authority by region Source: 1999 World Bank World Business Environment Survey (WBES) Note. Excludes missing observations. n = 8314
27. 27. Vertical Bar Chart 27
28. 28. Charts in Excel II 28
29. 29. Time Series Graph Page 29  Time series are often used in social sciences  Data collected at various time period: daily, weekly, monthly, quarterly, annually, etc.  Examples include GDP, Unemployment, University Tuition  Plot series of interest over time  Let’s look at a graph of the unemployment rate by gender and age
30. 30. Line Graph Page 30
31. 31. InstructorPage 31 Histogram  Used for continuous data  Frequency Distribution for continuous data  Summary graph showing count of the data pints falling in various ranges  Rough approximate of the distribution of the data  A histogram is a way to summarize data  The distribution condenses the raw data into a more useful form...  and allows for a quick visual interpretation of the data
32. 32. Histogram 32
33. 33. InstructorPage 33 Scatter Graphs  Graphs relationship between two continuous variables
34. 34. Scatter Graph 34
35. 35. Principles of Graphical Excellence 35  Well-designed presentation of interesting data  Substance & design  Simplicity of design, complexity of data  Proportion and Balance  Clear, precise, efficient  Know what you are trying to show (have a story)  make sure you graph shows it  Well formatted, professional  Choose format that reflects your data and the story  Informative and legible axis  Fully labelled & legible  Gets across main point(s) in the shortest time with the least ink in the smallest space  Adds information not otherwise available to the reader  But supplemented with text describing the figure  Tells the truth about the data  Limits complexity and confusion  Avoid Chart Junk
36. 36. 36 0 10 20 30 40 50 60 1st Qtr 2nd Qtr 3rd Qtr 4th Qtr 0 20 40 60 80 100 120 West North Northeast Southwest Mexico Europe Japan East South International Examples of Chartjunk
37. 37. 37 Examples of Chartjunk 0 10 20 30 40 50 60 70 80 90 1st Qtr 2nd Qtr 3rd Qtr 4th Qtr Gridlines! Vibration Pointless Fake 3-D Effects Filled “Floor” Clip Art In or out? Filled “Walls” Borders and Fills Galore Unintentional Heavy or Double Lines Filled Labels Serif Font with Thin & Thick Lines
38. 38. Displaying Data: “Mistakes” Page 38  Graphs are also instruments of evil used for deceiving a naive viewer.  Non-zero origin  Omitting data that refutes your “evidence”  Limiting scope of data
39. 39. What is Wrong with this Graph? 39 Provincial Personal Income Taxes Single Individual with \$45,000 in income claiming basic personal tax credits
40. 40. The Real Story 40
41. 41. Exaggerates a change in data Page 41 Source: Statistics Canada, CANSIM II, V31215364
42. 42. Dr. Kendall 42
43. 43. Worst Recession Since the Depression (?) 43
44. 44. Page 44 Presenting the Data
45. 45. Describing Data Numerically 45 Simple Arithmetic Mean Median Mode Describing Data Numerically Variance Standard Deviation Range Central Tendency Variation Association Covariance Correlation Shape of the Distribution
46. 46. Mode 46  A measure of central tendency  Value that occurs most often  Not affected by extreme values  Used for either numerical or categorical data  There may be no mode or several modes  What are the modes for the displayed data? 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 3 4 5 6
47. 47. Mode 47  A measure of central tendency  Value that occurs most often  Not affected by extreme values  Used for either numerical or categorical data  There may be no mode  There may be several modes 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Mode = 9 0 1 2 3 4 5 6 No Mode
48. 48. Mode 48  There may be several modes 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Mode = 5 & 9
49. 49. Mode 49  Caution: Mode may not be representative of the data  {0.1, 0.1, 5000, 4900, 4500, 5200,…}
50. 50. Median 50  In an ordered list, the median is the “middle” number (50% above, 50% below) 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
51. 51. Mean 51  The “balancing point” (centre of gravity) of the data  E.g. The data “balances” at 5 1 2 3 4 5 6 7 8 9 -2 -1 +3
52. 52. Arithmetic Mean 52  The arithmetic mean (mean) is the most common measure of central tendency  Calculated by summing the value observations and dividing by the number of observations  For a sample of size n: # of observationsn xxx n x x n21 n 1i i +++ == ∑=  Observed values
53. 53. Arithmetic Mean 53  The most common measure of central tendency  Mean = sum of values divided by the number of values  Affected by extreme values (outliers)  What is the mean for these examples? 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
54. 54. Arithmetic Mean 54  The most common measure of central tendency  Mean = sum of values divided by the number of values  Affected by extreme values (outliers) 0 1 2 3 4 5 6 7 8 9 10 Mean = 3 0 1 2 3 4 5 6 7 8 9 10 Mean = 4 3 5 15 5 54321 == ++++ 4 5 20 5 104321 == ++++
55. 55. Measures of Central Tendency 55 Central Tendency Mean Median Mode n x x n 1i i∑= = Overview Midpoint of ranked values Most frequently observed valueArithmetic average 50% 50%
56. 56. The “Shape of a Distribution” 56  Use information on mean, median, and mode to “visualize” the data  A data distribution is said to be symmetric if its shape is the same on both sides of the median  Symmetry implies that median=arithmetic mean  If a distribution is uni-modal and symmetric then  Median=mean=mode
57. 57. The “Shape of a Distribution” 57 0 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 #ofObs. Value MEDIAN50% 50% Symmetric: Median=Mean Sym m etric: Median=M ean UNIMODAL Symmetric & Unimodel: Median=Mean=Mode
58. 58. The “Shape of a Distribution” 58 0 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 #ofObs. Value MEDIAN50% 50% Sym m etric: Median=M ean Symmetric: Median=Mean BIMODAL BIMODAL Symmetric & Bimodel: Median=Mean≠Mode
59. 59. The “Shape of a Distribution” 59 0 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 #ofObs. Values MEDIAN50% 50% Symmetric: Median=Mean Symmetric: Median=Mean MODE? Symmetric & no mode: Median=Mean (Uniform
60. 60. The “Shape of a Distribution” 60  An asymmetric distribution is said to be skewed 1. Negatively if Mean<Median<Mode 2. Positively if Mean>Median>Mode  Hence, by comparing our measures of cental tendancy, we can start to visualize the shape and characteristics of the data
61. 61. The “Shape of a Distribution” 61 0 2 4 6 8 10 12 1 2 3 4 5 6 7 8 MODE=2 MEDIAN=3 50% 50% MEAN=3.2 MODE < MEDIAN < MEAN = POSITIVELY SKEWED DISTRIBUTION
62. 62. Example: Positively skewed variable 62  The Distribution of After-Tax Income  shows the distribution of income across all Canadian households
63. 63. Example: Positively skewed variable 63  The mode income is the most common income and was in the range from \$15,000 to \$19,999.  The median income is the level of income that separates the population into two groups of equal size and was \$39,700.  The mean income is the average income and was \$48,400.
64. 64. Example: Positively skewed variable 64  A distribution in which the mean exceeds the median and the median exceeds the mode is positively skewed, which means it has a long tail of high values.  The distribution of income in Canada is positively skewed.  Most likely to report median rather than mean since long tail distorts average
65. 65. Example: Positively skewed variable 65  Volunteer hours  Charitable contributions  # of Cigarette packs smoked (excluding 0)  Collective bargaining agreement duration (in years)  # of beers consumed on a Saturday night  Duration of low income (in years)  Number of children
66. 66. The “Shape of a Distribution” 66 0 2 4 6 8 10 12 0 1 2 3 4 5 6 7 MODE=6 MEDIAN=5 50% 50% MEAN=4.7 Mean< MEDIAN < Mode = NEGATIVELY SKEWED DISTRIBUTION
67. 67. Examples 67  University Grades  Age  Years in school  Etc.
68. 68. Describing Data Numerically 68 Simple Arithmetic Mean Median Mode Describing Data Numerically Variance Standard Deviation Range Central Tendency Variation Association Covariance Correlation Shape of the Distribution
69. 69. Same center, different variation Measures of Dispersion/Variability 69 Variation Variance Standard Deviation Range  Measures of variation give information on the spread or variability of the data values.
70. 70. Range 70  Simplest measure of variation  Difference between the largest and the smallest observations: Range = Xlargest – Xsmallest 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Example:
71. 71. Range 71  Simplest measure of variation  Difference between the largest and the smallest observations: Range = Xlargest – Xsmallest 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Range = 14 - 1 = 13 Example:
72. 72. The Range 72 • Problem • Ignores all but two data points • These values may be “outliers” (i.e. not representative)
73. 73. Disadvantages of the Range 73  Ignores the way in which data are distributed  Sensitive to outliers 7 8 9 10 11 12 Range = 12 - 7 = 5 7 8 9 10 11 12 Range = 12 - 7 = 5 1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,5 1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,120 Range = 5 - 1 = 4 Range = 120 - 1 = 119
74. 74. The Variance 74 • A single summary measure of dispersion would be more helpful • Takes account of all data Values
75. 75. The Variance 1. Variance 2. Standard Deviation ∑= − − = N i i Xx n s 1 22 )( 1 1 75 siancedeviationdards == vartan
76. 76. Measuring variation 76 Small standard deviation Large standard deviation
77. 77. Comparing Standard Deviations 77 Mean = 15.5 s = 3.33811 12 13 14 15 16 17 18 19 20 21 11 12 13 14 15 16 17 18 19 20 21 Data B Data A Mean = 15.5 s = 0.926 11 12 13 14 15 16 17 18 19 20 21 Mean = 15.5 s = 4.570 Data C
78. 78. Describing Data Numerically 78 Simple Arithmetic Mean Median Mode Describing Data Numerically Variance Standard Deviation Range Central Tendency Variation Association Covariance Correlation Shape of the Distribution
79. 79. The Sample Covariance 79  The covariance measures the strength of the linear relationship between two variables  The sample covariance:  Only concerned with the strength of the relationship  No causal effect is implied 1n )y)(yx(x sy),(xCov n 1i ii xy − −− == ∑=
80. 80. Interpreting Covariance 80 Covariance between two variables: Cov(x,y) > 0 x and y tend to move in the same direction Cov(x,y) < 0 x and y tend to move in opposite directions Cov(x,y) = 0 x and y are independent
81. 81. Coefficient of Correlation 81  Measures the relative strength of the linear relationship between two variable  Sample correlation coefficient: YX ss y),(xCov r =
82. 82. Features of Correlation Coefficient, r 82  Unit free  Ranges between –1 and 1  The closer to –1, the stronger the negative linear relationship  The closer to 1, the stronger the positive linear relationship  The closer to 0, the weaker any positive linear relationship
83. 83. Interpreting the Correlation Coefficient, r 83
84. 84. Scatter Plots of Data with Variou Correlation Coefficients 84 Y X Y X Y X Y X Y X r = -1 Cov<0 r = -.6 Cov<0 r = 0 Cov=0 r = +.3r = +1 Y X r = 0
85. 85. 502B 85
86. 86. Fun with Graphs 86  Does your mindset match my dataset!  http://www.ted.com/talks/hans_rosling_at_state.html
87. 87. Looking ahead  SRs to client (cc) and Turnitin on Wednesday by noon  No class next week  Work on 598 critiques  598 Critiques due in class & Turnitin Nov. 30  Comments on your SRs will be ready Nov. 30  Final SRs (if required) due Dec. 8 @11:55PM PST  Note carefully the requirements  Moodle site will be inaccessible sometime in December  Final Grades reported via usource once approved by the Director 87

Jun. 9, 2018

Total views

1,019

On Slideshare

0

From embeds

0

Number of embeds

2

37

Shares

0