Intro to quant_analysis_students

716 views
604 views

Published on

Published in: Technology, Art & Photos
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
716
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
22
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Graph makes the frequencies pop more
  • Or that which could have been a bar chart can be made into a line by connect the midpoints
  • Remember our cross table?
    Can we present this graphically?
  • Note legend is on right as no room on left hand side
  • Or we can display this as a stacked bar where the proportion of each region in each category is displayed.
    Called a segmented bar chart
  • Mancession Video 4 minutes
    Unemployment Rates sheet
    ExcelTutorial5_timeseriesgraph
  • The main defences of the lying graph is that at least it was approximately corret, we were just trying to show the general direction of change or magnitidue.
  • So yes, taxes are low in BC but not as low as show in the original graph
    Non zerio origins are a great way to lie
    Very popular in government
  • Remember this time series graph. Look at what happens if we change the scale on the Y axis
    Boy, that really changes your impression of the data and the underlying trend. The drop from 1992 to 1997 was 7%. Does this graph under or overstate a 7% change over this period?
  • Dr. Kendall used his diagram to demonstrate that we are drinking too much when really there are more people drinking due to population growth
  • 9
    No mode
  • If the mean=median and there is no mode, your distribution looks something like this
  • Not as frequently occuring in economic data so I actually do not have many examples
  • What does the standard deviation tell us? It tells us how far from the mean the data points tend to be . A bigger number tells us that the observations are further away from the mean than if there is a small standard deviation. Tells us HOW representative of the data the mean is.
  • Since the standard deviation can be thought of measuring how far the data values lie from the mean, we take the mean and move one standard deviation in either direction.  The mean for this example was about 15.5
    For the first distribution we have 15.5+3.338= 18.838 and 15.5-3.338=12.162
     
    Assuming this is how much restaurant patrons spend, what this means is that most of the patrons probably spend between $12.16 and $18.84.
    In the second example, we have 15.5+0.926=16.43 and 15.5-0.926=14.57 which as you can see shows less spread in the data.
    In the third example we have 15.5+4.57=20.07 and 15.5-4.57=10.93 which is the most spread.
    Excel 4 minutes
    Food Expenditures 2
    ExcelTutorial9_Dispersion.mp4
  • Measures of Relationships Between Variables
    More often than not, we are interested in describing relationship between variables
    On Oct. 28 we learned about scatter plots as a graphical way to describe a relationship between two variables.
    We also learned about cross tabs aka contingency tables for nominal/ordinal variables
    Let’s look a little more closely at measure of relationships for ratio level data
  • Excel Food&Income 2 sheet
  • Intro to quant_analysis_students

    1. 1. Week 11: Basic Descriptive Quantitative Data Analysis Tables, Graphs, & Summary Statistics 1
    2. 2. Objectives  Learn about basic descriptive quantitative analysis  How to perform these tasks in Excel  Starting point for 502B  Excel knowledge and quantitative skills are highly desired by Employers  EC stream 2
    3. 3. Introduction 3  Without data, it is anyone’s opinion  Why use tables, graphs, summary stats? “At their best, tables, graphs, and statistics are instruments for reasoning about complex quantitative information.”  Why learn how to design them appropriately? “At their worst, tables, graphs and summary statistics are instruments of evil used for deceiving a naive viewer.”  Does your mindset match my dataset!  http://www.ted.com/talks/hans_rosling_at_state.html
    4. 4. Quantitative Research Process Page 4
    5. 5. Introduction Page 5
    6. 6. Page 6 Presenting the Data
    7. 7. Frequency Distribution Page 7  A convenient way of summarizing a lot of tabular data  What is a Frequency Distribution?  A frequency distribution is a list or a table …  containing class groupings (categories or ranges within which the data fall) ...  and the corresponding frequencies with which data fall within each class or category  For nominal/ordinal data
    8. 8. Introduction Page 8
    9. 9. Page 9 Table 1 Univariate Frequencies of Percentage of Sales Reported to Tax Authorities Source: 1999 World Bank World Business Environment Survey (WBES), excludes missing observations % of Sales Reported 100% 90-99% 80-89% 70-79% 60-69% 50-59% <50% Total Frequency 3307 1096 916 703 501 694 936 8153 Percent (%) 40.56 13.44 11.24 8.62 6.14 8.51 11.48 100 http://www.enterprisesurveys.org/
    10. 10. Contingency/Pivot/Cross Table 10  May also want to produce a table with more categories  Cross table or Contingency table or Pivot table  Suitable if you have two nominal/ordinal variables  Simple extension to a univariate table  Considers relationship between two variables  Row variable (Dependent)  Column variable (Independent)
    11. 11. Table2 Percentage of Sales Reported to Tax Authorities by Region Page 11 Africa Transition Asia Latin OECD Former Total Europe America Soviet Countries 100% 490 554 416 794 446 607 3,307 90-99% 266 196 142 119 145 228 1,096 80-89% 158 152 117 192 73 224 916 70-79% 162 117 103 153 43 125 703 60-69% 140 69 70 115 22 85 501 50-59% 140 105 141 118 16 174 694 <50% 100 106 283 296 25 126 936 Total 1,456 1,299 1,272 1,787 770 1,569 8,153 Source: 1999 World Bank World Business Environment Survey (WBES) * Excludes missing observations
    12. 12. Features of a Table 12  Title that accurately summarizes the data  Simple, indicates major variables, and time frame (if applicable)  Source: data set or origin of table  Explanatory footnotes  Easy to read & separated from text  Properly formatted for style (see APA Rules)  Necessary to advance analysis  See Module 7 for APA Table Checklist  Reproduced from APA manual
    13. 13. Page 13 Presenting the Data
    14. 14. Bar Graph Page 14  Often used to describe categorical data  Ordinal/Nominal  Draws attention to the frequency of each category
    15. 15. Page 15 Table 1 Univariate Frequencies of Percentage of Sales Reported to Tax Authorities Source: 1999 World Bank World Business Environment Survey (WBES), excludes missing observations % of Sales Reported 100% 90-99% 80-89% 70-79% 60-69% 50-59% <50% Total Frequency 3307 1096 916 703 501 694 936 8153 Percent (%) 40.56 13.44 11.24 8.62 6.14 8.51 11.48 100 http://www.enterprisesurveys.org/
    16. 16. Bar Graph Page 16 Figure 1 Percentage of sales reported to tax authority Source: 1999 World Bank World Business Environment Survey (WBES) Note. Excludes missing observations. n = 8314
    17. 17. Relative Frequency Polygone 17
    18. 18. Pie Graph Page 18  Emphasizes the proportion of each category  Something that may be good for our tax evasion data  Circle represents the total  Segments the shares of the total  Segment size is proportional to frequency
    19. 19. Pie Graph 19 Figure 1 Percentage of sales reported to tax authority Source: 1999 World Bank World Business Environment Survey (WBES) Note. Excludes missing observations. n = 8314
    20. 20. Page 2020 Pie Graph Figure 1 Percentage of sales reported to tax authority Source: 1999 World Bank World Business Environment Survey (WBES) Note. Excludes missing observations. n = 8314
    21. 21. Page 2121 Pie Graph Figure 1 Percentage of sales reported to tax authority Source: 1999 World Bank World Business Environment Survey (WBES) Note. Excludes missing observations. n = 8314
    22. 22. Charts in Excel I 22
    23. 23. Table2 Percentage of Sales Reported to Tax Authorities by Region Page 23 Africa Transition Asia Latin OECD Former Total Europe America Soviet Countries 100% 490 554 416 794 446 607 3,307 90-99% 266 196 142 119 145 228 1,096 80-89% 158 152 117 192 73 224 916 70-79% 162 117 103 153 43 125 703 60-69% 140 69 70 115 22 85 501 50-59% 140 105 141 118 16 174 694 <50% 100 106 283 296 25 126 936 Total 1,456 1,299 1,272 1,787 770 1,569 8,153
    24. 24. Bar Graph Page 24 Figure 1 Percentage of sales reported to tax authority Source: 1999 World Bank World Business Environment Survey (WBES) Note. Excludes missing observations. n = 8314
    25. 25. Page 2525 Segmented Bar Chart Figure 1 Percentage of sales reported to tax authority Source: 1999 World Bank World Business Environment Survey (WBES) Note. Excludes missing observations. n = 8314
    26. 26. Pie Graph Page 26 Figure 2 Percentage of sales reported to tax authority by region Source: 1999 World Bank World Business Environment Survey (WBES) Note. Excludes missing observations. n = 8314
    27. 27. Vertical Bar Chart 27
    28. 28. Charts in Excel II 28
    29. 29. Time Series Graph Page 29  Time series are often used in social sciences  Data collected at various time period: daily, weekly, monthly, quarterly, annually, etc.  Examples include GDP, Unemployment, University Tuition  Plot series of interest over time  Let’s look at a graph of the unemployment rate by gender and age
    30. 30. Line Graph Page 30
    31. 31. InstructorPage 31 Histogram  Used for continuous data  Frequency Distribution for continuous data  Summary graph showing count of the data pints falling in various ranges  Rough approximate of the distribution of the data  A histogram is a way to summarize data  The distribution condenses the raw data into a more useful form...  and allows for a quick visual interpretation of the data
    32. 32. Histogram 32
    33. 33. InstructorPage 33 Scatter Graphs  Graphs relationship between two continuous variables
    34. 34. Scatter Graph 34
    35. 35. Principles of Graphical Excellence 35  Well-designed presentation of interesting data  Substance & design  Simplicity of design, complexity of data  Proportion and Balance  Clear, precise, efficient  Know what you are trying to show (have a story)  make sure you graph shows it  Well formatted, professional  Choose format that reflects your data and the story  Informative and legible axis  Fully labelled & legible  Gets across main point(s) in the shortest time with the least ink in the smallest space  Adds information not otherwise available to the reader  But supplemented with text describing the figure  Tells the truth about the data  Limits complexity and confusion  Avoid Chart Junk
    36. 36. 36 0 10 20 30 40 50 60 1st Qtr 2nd Qtr 3rd Qtr 4th Qtr 0 20 40 60 80 100 120 West North Northeast Southwest Mexico Europe Japan East South International Examples of Chartjunk
    37. 37. 37 Examples of Chartjunk 0 10 20 30 40 50 60 70 80 90 1st Qtr 2nd Qtr 3rd Qtr 4th Qtr Gridlines! Vibration Pointless Fake 3-D Effects Filled “Floor” Clip Art In or out? Filled “Walls” Borders and Fills Galore Unintentional Heavy or Double Lines Filled Labels Serif Font with Thin & Thick Lines
    38. 38. Displaying Data: “Mistakes” Page 38  Graphs are also instruments of evil used for deceiving a naive viewer.  Non-zero origin  Omitting data that refutes your “evidence”  Limiting scope of data
    39. 39. What is Wrong with this Graph? 39 Provincial Personal Income Taxes Single Individual with $45,000 in income claiming basic personal tax credits
    40. 40. The Real Story 40
    41. 41. Exaggerates a change in data Page 41 Source: Statistics Canada, CANSIM II, V31215364
    42. 42. Dr. Kendall 42
    43. 43. Worst Recession Since the Depression (?) 43
    44. 44. Page 44 Presenting the Data
    45. 45. Describing Data Numerically 45 Simple Arithmetic Mean Median Mode Describing Data Numerically Variance Standard Deviation Range Central Tendency Variation Association Covariance Correlation Shape of the Distribution
    46. 46. Mode 46  A measure of central tendency  Value that occurs most often  Not affected by extreme values  Used for either numerical or categorical data  There may be no mode or several modes  What are the modes for the displayed data? 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 3 4 5 6
    47. 47. Mode 47  A measure of central tendency  Value that occurs most often  Not affected by extreme values  Used for either numerical or categorical data  There may be no mode  There may be several modes 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Mode = 9 0 1 2 3 4 5 6 No Mode
    48. 48. Mode 48  There may be several modes 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Mode = 5 & 9
    49. 49. Mode 49  Caution: Mode may not be representative of the data  {0.1, 0.1, 5000, 4900, 4500, 5200,…}
    50. 50. Median 50  In an ordered list, the median is the “middle” number (50% above, 50% below) 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
    51. 51. Mean 51  The “balancing point” (centre of gravity) of the data  E.g. The data “balances” at 5 1 2 3 4 5 6 7 8 9 -2 -1 +3
    52. 52. Arithmetic Mean 52  The arithmetic mean (mean) is the most common measure of central tendency  Calculated by summing the value observations and dividing by the number of observations  For a sample of size n: # of observationsn xxx n x x n21 n 1i i +++ == ∑=  Observed values
    53. 53. Arithmetic Mean 53  The most common measure of central tendency  Mean = sum of values divided by the number of values  Affected by extreme values (outliers)  What is the mean for these examples? 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
    54. 54. Arithmetic Mean 54  The most common measure of central tendency  Mean = sum of values divided by the number of values  Affected by extreme values (outliers) 0 1 2 3 4 5 6 7 8 9 10 Mean = 3 0 1 2 3 4 5 6 7 8 9 10 Mean = 4 3 5 15 5 54321 == ++++ 4 5 20 5 104321 == ++++
    55. 55. Measures of Central Tendency 55 Central Tendency Mean Median Mode n x x n 1i i∑= = Overview Midpoint of ranked values Most frequently observed valueArithmetic average 50% 50%
    56. 56. The “Shape of a Distribution” 56  Use information on mean, median, and mode to “visualize” the data  A data distribution is said to be symmetric if its shape is the same on both sides of the median  Symmetry implies that median=arithmetic mean  If a distribution is uni-modal and symmetric then  Median=mean=mode
    57. 57. The “Shape of a Distribution” 57 0 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 #ofObs. Value MEDIAN50% 50% Symmetric: Median=Mean Sym m etric: Median=M ean UNIMODAL Symmetric & Unimodel: Median=Mean=Mode
    58. 58. The “Shape of a Distribution” 58 0 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 #ofObs. Value MEDIAN50% 50% Sym m etric: Median=M ean Symmetric: Median=Mean BIMODAL BIMODAL Symmetric & Bimodel: Median=Mean≠Mode
    59. 59. The “Shape of a Distribution” 59 0 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 #ofObs. Values MEDIAN50% 50% Symmetric: Median=Mean Symmetric: Median=Mean MODE? Symmetric & no mode: Median=Mean (Uniform
    60. 60. The “Shape of a Distribution” 60  An asymmetric distribution is said to be skewed 1. Negatively if Mean<Median<Mode 2. Positively if Mean>Median>Mode  Hence, by comparing our measures of cental tendancy, we can start to visualize the shape and characteristics of the data
    61. 61. The “Shape of a Distribution” 61 0 2 4 6 8 10 12 1 2 3 4 5 6 7 8 MODE=2 MEDIAN=3 50% 50% MEAN=3.2 MODE < MEDIAN < MEAN = POSITIVELY SKEWED DISTRIBUTION
    62. 62. Example: Positively skewed variable 62  The Distribution of After-Tax Income  shows the distribution of income across all Canadian households
    63. 63. Example: Positively skewed variable 63  The mode income is the most common income and was in the range from $15,000 to $19,999.  The median income is the level of income that separates the population into two groups of equal size and was $39,700.  The mean income is the average income and was $48,400.
    64. 64. Example: Positively skewed variable 64  A distribution in which the mean exceeds the median and the median exceeds the mode is positively skewed, which means it has a long tail of high values.  The distribution of income in Canada is positively skewed.  Most likely to report median rather than mean since long tail distorts average
    65. 65. Example: Positively skewed variable 65  Volunteer hours  Charitable contributions  # of Cigarette packs smoked (excluding 0)  Collective bargaining agreement duration (in years)  # of beers consumed on a Saturday night  Duration of low income (in years)  Number of children
    66. 66. The “Shape of a Distribution” 66 0 2 4 6 8 10 12 0 1 2 3 4 5 6 7 MODE=6 MEDIAN=5 50% 50% MEAN=4.7 Mean< MEDIAN < Mode = NEGATIVELY SKEWED DISTRIBUTION
    67. 67. Examples 67  University Grades  Age  Years in school  Etc.
    68. 68. Describing Data Numerically 68 Simple Arithmetic Mean Median Mode Describing Data Numerically Variance Standard Deviation Range Central Tendency Variation Association Covariance Correlation Shape of the Distribution
    69. 69. Same center, different variation Measures of Dispersion/Variability 69 Variation Variance Standard Deviation Range  Measures of variation give information on the spread or variability of the data values.
    70. 70. Range 70  Simplest measure of variation  Difference between the largest and the smallest observations: Range = Xlargest – Xsmallest 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Example:
    71. 71. Range 71  Simplest measure of variation  Difference between the largest and the smallest observations: Range = Xlargest – Xsmallest 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Range = 14 - 1 = 13 Example:
    72. 72. The Range 72 • Problem • Ignores all but two data points • These values may be “outliers” (i.e. not representative)
    73. 73. Disadvantages of the Range 73  Ignores the way in which data are distributed  Sensitive to outliers 7 8 9 10 11 12 Range = 12 - 7 = 5 7 8 9 10 11 12 Range = 12 - 7 = 5 1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,5 1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,120 Range = 5 - 1 = 4 Range = 120 - 1 = 119
    74. 74. The Variance 74 • A single summary measure of dispersion would be more helpful • Takes account of all data Values
    75. 75. The Variance 1. Variance 2. Standard Deviation ∑= − − = N i i Xx n s 1 22 )( 1 1 75 siancedeviationdards == vartan
    76. 76. Measuring variation 76 Small standard deviation Large standard deviation
    77. 77. Comparing Standard Deviations 77 Mean = 15.5 s = 3.33811 12 13 14 15 16 17 18 19 20 21 11 12 13 14 15 16 17 18 19 20 21 Data B Data A Mean = 15.5 s = 0.926 11 12 13 14 15 16 17 18 19 20 21 Mean = 15.5 s = 4.570 Data C
    78. 78. Describing Data Numerically 78 Simple Arithmetic Mean Median Mode Describing Data Numerically Variance Standard Deviation Range Central Tendency Variation Association Covariance Correlation Shape of the Distribution
    79. 79. The Sample Covariance 79  The covariance measures the strength of the linear relationship between two variables  The sample covariance:  Only concerned with the strength of the relationship  No causal effect is implied 1n )y)(yx(x sy),(xCov n 1i ii xy − −− == ∑=
    80. 80. Interpreting Covariance 80 Covariance between two variables: Cov(x,y) > 0 x and y tend to move in the same direction Cov(x,y) < 0 x and y tend to move in opposite directions Cov(x,y) = 0 x and y are independent
    81. 81. Coefficient of Correlation 81  Measures the relative strength of the linear relationship between two variable  Sample correlation coefficient: YX ss y),(xCov r =
    82. 82. Features of Correlation Coefficient, r 82  Unit free  Ranges between –1 and 1  The closer to –1, the stronger the negative linear relationship  The closer to 1, the stronger the positive linear relationship  The closer to 0, the weaker any positive linear relationship
    83. 83. Interpreting the Correlation Coefficient, r 83
    84. 84. Scatter Plots of Data with Variou Correlation Coefficients 84 Y X Y X Y X Y X Y X r = -1 Cov<0 r = -.6 Cov<0 r = 0 Cov=0 r = +.3r = +1 Y X r = 0
    85. 85. 502B 85
    86. 86. Fun with Graphs 86  Does your mindset match my dataset!  http://www.ted.com/talks/hans_rosling_at_state.html
    87. 87. Looking ahead  SRs to client (cc) and Turnitin on Wednesday by noon  No class next week  Work on 598 critiques  598 Critiques due in class & Turnitin Nov. 30  Comments on your SRs will be ready Nov. 30  Final SRs (if required) due Dec. 8 @11:55PM PST  Note carefully the requirements  Moodle site will be inaccessible sometime in December  Final Grades reported via usource once approved by the Director 87

    ×