Published on

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide


  1. 1. Teaching Basic Modeling Skills Using Real Data Steven Gordon Senior Director of Education and Client Services [email_address]
  2. 2. Goals of the Session <ul><li>Overview of statistical methods that can be used to build a model </li></ul><ul><li>Sources of real data to test hypotheses about statistical relationships </li></ul><ul><li>Example exercise(s) and techniques for downloading and extracting data, testing statistical relationships, and building a model based on the results </li></ul>
  3. 3. Measuring the Strength of a Relationship <ul><li>Correlation </li></ul><ul><ul><li>Statistical relationship between two variables </li></ul></ul><ul><ul><li>Goes between -1.0 and 1.0 </li></ul></ul><ul><ul><li>Zero means no relationship </li></ul></ul><ul><li>http://argyll.epsb.ca/jreed/math9/strand4/scatterPlot.htm </li></ul>
  4. 4. Linear Regression <ul><li>Assumes a cause and effect between one dependent variable and one or more independent variables </li></ul><ul><li>Solution of the linear equation with a best fit to data </li></ul><ul><li>Y = aX + b  </li></ul><ul><ul><li>where:  </li></ul></ul><ul><ul><li>Y = the dependent variable </li></ul></ul><ul><ul><li>a = a coefficient equivalent to the slope of the line </li></ul></ul><ul><ul><li>b = the Y intercept of the line (the place where it crosses the Y axis) </li></ul></ul><ul><li>Y = a 1 X 1 + a 2 X 2 + a 3 X 3 + b for multiple causes </li></ul>
  5. 6. <ul><li>Use the vertical offsets to estimate Y for a given X </li></ul><ul><li>Square the differences so all numbers are positive </li></ul><ul><li>Sum of the squared offsets or deviations is a measure of the goodness of fit </li></ul><ul><li>R 2 or coefficient of determination (0 to 1.0) </li></ul>
  6. 7. Using Excel for Regression Analysis <ul><li>Open the file Regress_Data.xls </li></ul><ul><ul><li>File shows a dataset with 40 observations showing Income, Education, and Age </li></ul></ul><ul><ul><li>What would we hypothesize to be the relationships? Which is the dependent variable? Independent variables? </li></ul></ul><ul><li>Need to activate the data analysis toolpak </li></ul><ul><ul><li>Click on Tools from main menu in Excel – if Data Analysis is a submenu choice – good to go </li></ul></ul><ul><ul><li>Otherwise – Select Tools> Add-Ins </li></ul></ul><ul><ul><li>Mark the box next to the Analysis Toolpak and click OK </li></ul></ul>
  7. 8. Do a Correlation Analysis <ul><li>Click on Tools – Data Analysis </li></ul><ul><li>Choose Correlation from the pull-down menu </li></ul><ul><li>Click in the input range box </li></ul><ul><ul><li>Choose all three columns including the labels by clicking and dragging across the entire dataset </li></ul></ul><ul><li>Click the Labels box to indicate the first row is labels </li></ul><ul><li>Click New Worksheet and give it a name correlate </li></ul><ul><li>What are the results? </li></ul>
  8. 9. Do a Regression <ul><li>Click Tools, Data Analysis </li></ul><ul><li>Select Regression from the pull down menu </li></ul><ul><li>Put your cursor in the field for the input Y Range </li></ul><ul><ul><li>Select the range for the dependent variable, including its label </li></ul></ul><ul><li>Put the cursor in the Input X Range </li></ul><ul><ul><li>Select the other two columns of number </li></ul></ul><ul><li>Mark the check boxes </li></ul><ul><ul><li>Labels </li></ul></ul><ul><ul><li>Confidence Level </li></ul></ul><ul><li>Click on the radio button for New Worksheet and give it a name </li></ul><ul><li>Click ok </li></ul>
  9. 10. Interpreting Regression Outputs in Excel 37.6 percent of variance explained after adjusting for degrees of freedom
  10. 11. More interpretation Significance of the equation. Probability of a false positive is less than 0.004 %
  11. 12. Coefficients and Their Significance <ul><li>Equation: Income = -17954.1 + 440.71*Age+ 1542.41*Education </li></ul>Significance of individual coefficients
  12. 13. Fitting Non-Linear Data <ul><li>Same principle and measurement of the deviations </li></ul><ul><li>Choice of curve to fit not automatic </li></ul><ul><ul><li>Individual choice with possibility of error </li></ul></ul><ul><ul><li>Real relationship may not be fully represented by experimental data </li></ul></ul>
  13. 14. Potential Errors <ul><li>Chose the wrong function for non-linear data </li></ul><ul><li>Experiments did not measure all possible circumstances </li></ul><ul><ul><li>Behavior may change in areas outside the sample data </li></ul></ul><ul><ul><li>E.G. physical limitations of system lead to failure </li></ul></ul><ul><li>Variables may be highly correlated and have strong relationship in regression but one is not the cause of the other </li></ul>
  14. 15. Exercise Using Real Data