004
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share
Uploaded on

Lesson 4 Correlational Analysis

Lesson 4 Correlational Analysis

More in: Education
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
922
On Slideshare
922
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
15
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • Correlation and Cause Just because two variables are correlated, does not mean that one of the variables is the cause of the other. It could be the case, but it does not necessarily follow: There is a strong positive correlation between the number of cigarettes that one smokes a day and one's chances of contracting lung cancer (measured as the number of cases of lung cancer per hundred people who smoke a given number of cigarettes). The percentage of heavy smokers who contract lung cancer is higher than the percentage of light smokers who develop the disease, and both figures are higher than the percentage of non-smokers who get lung cancer. In this case, the cigarettes are definitely causing the cancer. There is a strong negative correlation between the total number of skiing holidays that people book for any month of the year and the total amount of ice cream that supermarkets sell for that month. This means that the more skiing holidays that are booked, the less ice cream is sold. Is there a cause here? Are people spending so much money on ice cream that they can't afford skiing holidays? Is the fact that the ice cream is so cold putting people off skiing? Clearly not! The simple fact is that most people tend to book their skiing holidays in the winter, and they tend to buy ice cream in the summer. Although a correlation between two variables doesn't mean that one of them causes the other, it can suggest a way of finding out what the true cause might be. There may be some underlying variable that is causing both of them. For instance, if a survey found that there is a correlation between the time that people spend watching television and the amount of crime that people commit, it could be because unemployed people tend to sit around watching the television, and that unemployed people are more likely to commit crime. If that were the case, then unemployment would be the true cause!
  • Correlation and Cause Just because two variables are correlated, does not mean that one of the variables is the cause of the other. It could be the case, but it does not necessarily follow: There is a strong positive correlation between the number of cigarettes that one smokes a day and one's chances of contracting lung cancer (measured as the number of cases of lung cancer per hundred people who smoke a given number of cigarettes). The percentage of heavy smokers who contract lung cancer is higher than the percentage of light smokers who develop the disease, and both figures are higher than the percentage of non-smokers who get lung cancer. In this case, the cigarettes are definitely causing the cancer. There is a strong negative correlation between the total number of skiing holidays that people book for any month of the year and the total amount of ice cream that supermarkets sell for that month. This means that the more skiing holidays that are booked, the less ice cream is sold. Is there a cause here? Are people spending so much money on ice cream that they can't afford skiing holidays? Is the fact that the ice cream is so cold putting people off skiing? Clearly not! The simple fact is that most people tend to book their skiing holidays in the winter, and they tend to buy ice cream in the summer. Although a correlation between two variables doesn't mean that one of them causes the other, it can suggest a way of finding out what the true cause might be. There may be some underlying variable that is causing both of them. For instance, if a survey found that there is a correlation between the time that people spend watching television and the amount of crime that people commit, it could be because unemployed people tend to sit around watching the television, and that unemployed people are more likely to commit crime. If that were the case, then unemployment would be the true cause!
  • Correlation and Cause Just because two variables are correlated, does not mean that one of the variables is the cause of the other. It could be the case, but it does not necessarily follow: There is a strong positive correlation between the number of cigarettes that one smokes a day and one's chances of contracting lung cancer (measured as the number of cases of lung cancer per hundred people who smoke a given number of cigarettes). The percentage of heavy smokers who contract lung cancer is higher than the percentage of light smokers who develop the disease, and both figures are higher than the percentage of non-smokers who get lung cancer. In this case, the cigarettes are definitely causing the cancer. There is a strong negative correlation between the total number of skiing holidays that people book for any month of the year and the total amount of ice cream that supermarkets sell for that month. This means that the more skiing holidays that are booked, the less ice cream is sold. Is there a cause here? Are people spending so much money on ice cream that they can't afford skiing holidays? Is the fact that the ice cream is so cold putting people off skiing? Clearly not! The simple fact is that most people tend to book their skiing holidays in the winter, and they tend to buy ice cream in the summer. Although a correlation between two variables doesn't mean that one of them causes the other, it can suggest a way of finding out what the true cause might be. There may be some underlying variable that is causing both of them. For instance, if a survey found that there is a correlation between the time that people spend watching television and the amount of crime that people commit, it could be because unemployed people tend to sit around watching the television, and that unemployed people are more likely to commit crime. If that were the case, then unemployment would be the true cause!
  • Correlation and Cause Just because two variables are correlated, does not mean that one of the variables is the cause of the other. It could be the case, but it does not necessarily follow: There is a strong positive correlation between the number of cigarettes that one smokes a day and one's chances of contracting lung cancer (measured as the number of cases of lung cancer per hundred people who smoke a given number of cigarettes). The percentage of heavy smokers who contract lung cancer is higher than the percentage of light smokers who develop the disease, and both figures are higher than the percentage of non-smokers who get lung cancer. In this case, the cigarettes are definitely causing the cancer. There is a strong negative correlation between the total number of skiing holidays that people book for any month of the year and the total amount of ice cream that supermarkets sell for that month. This means that the more skiing holidays that are booked, the less ice cream is sold. Is there a cause here? Are people spending so much money on ice cream that they can't afford skiing holidays? Is the fact that the ice cream is so cold putting people off skiing? Clearly not! The simple fact is that most people tend to book their skiing holidays in the winter, and they tend to buy ice cream in the summer. Although a correlation between two variables doesn't mean that one of them causes the other, it can suggest a way of finding out what the true cause might be. There may be some underlying variable that is causing both of them. For instance, if a survey found that there is a correlation between the time that people spend watching television and the amount of crime that people commit, it could be because unemployed people tend to sit around watching the television, and that unemployed people are more likely to commit crime. If that were the case, then unemployment would be the true cause!
  • More explanation: http://www.ncsu.edu/labwrite/res/gt/gt-reg-home.html
  • More explanation: http://www.ncsu.edu/labwrite/res/gt/gt-reg-home.html
  • More explanation: http://www.ncsu.edu/labwrite/res/gt/gt-reg-home.html
  • Online Calculator: http://chemmac1.usc.edu/bruno/java/linreg.html
  • Online Calculator: -
  • More explanation: http://www.ncsu.edu/labwrite/res/gt/gt-reg-home.html
  • More explanation: http://www.ncsu.edu/labwrite/res/gt/gt-reg-home.html
  • More explanation: http://www.ncsu.edu/labwrite/res/gt/gt-reg-home.html

Transcript

  • 1. IBS Statistics Year 1 Dr. Ning DING
  • 2. Table of content
    • Review
    • Learning Goals
    • Chapter 12: Simple Regression and Correlation
    • Exercises
  • 3. Review Chapter 3: Describing Data Find the interquartile range:   1460 1471 1637 1721 1758 1787 1940 2038 2047 2054 2097 2205 2287 2311 2406 Interquartile Range =Q 3 -Q 1 =2205-1721 =484
  • 4. Correction of EXCEL Exercise 5 L=(8+1)*25%=2.25 Q1=133.5 L=(8+1)*75%=6.75 Q3=274.5 Interquartile Range =274.5-133.5 =141
  • 5. Boxplot 1 2 2 4 5 7 8 9 12 Median 1 2 2 4 7 8 9 12 Quartile Q 1 =2 Q 3 =8.5 5 Interquartile Range Decile 1st D 9th D Percentile http://cnx.org/content/m11192/latest/ How to interpret?
  • 6. Boxplot The distribution is skewed to __________ because the mean is __________the median. the right larger than http://cnx.org/content/m11192/latest/ € 20 € 2000 Q 1 = € 250 Q 3 = € 850 Median= € 350 Mean= € 450 a b
  • 7. 0.8 1.0 1.0 1.2 1.2 1.3 1.5 1.7 2.0 2.0 2.1 2.2 4.0 2.0 3.2 3.6 3.7 4.0 4.2 4.2 4.5 4.5 4.6 4.8 5.0 5.0 Mean > Median Mean < Median Positively skewed Negatively skewed http://qudata.com/online/statcalc/
  • 8. This means that the data is symmetrically distributed . Zero skewness mode=median=mean
  • 9. Learning Goals
    • Chapter 12:
      • Learn how many business decisions depend on knowing the specific relationship between two or more variables
      • Use scatter diagrams to visualize the relationship between two variables
      • Use regression analysis to estimate the relationship between two variables
      • Use the least-squares estimating equation to predict future values of the dependent variable
      • Learn how correlation analysis describes the degree to which two variables are linearly related to each other
      • Understand the coefficient of determination as a measure of the strength of the relationship between two variables
      • Learn limitations of regression and correlation analyses and caveats about their use.
  • 10. 1. Introduction Chapter 12: Sim Reg & Corr Regression and Correlation Analyses:
      • How to determine both the nature and the strength of a relationship between variables.
  • 11. 1. Introduction Chapter 12: Sim Reg & Corr Scatter Diagram: Positive correlation
  • 12. 1. Introduction Chapter 12: Sim Reg & Corr Scatter Diagram: Negative correlation
  • 13. 1. Introduction Chapter 12: Sim Reg & Corr Scatter Diagram: No correlation
  • 14. 2. Types of Relationships Chapter 12: Sim Reg & Corr Variables:
      • Independent variables: known
      • Dependent variables: to predict
    Independent Variable Dependent Variable
  • 15. 2. Types of Relationship Chapter 12: Sim Reg & Corr
    • Correlation & Cause Effect?
    • The relationships found by regression to be relationships of association
    • Not necessarilly of cause and effect.
    Independent Variable Dependent Variable
  • 16. 2. Estimation Using the Regression Line Chapter 12: Sim Reg & Corr
    • Scatter Diagrams:
    • Patterns indicating that the variables are related
    • If related, we can describe the relationship
    Strong & Positive correlation Strong & Negative correlation Weak & Positive correlation Weak & Negative correlation No correlation
  • 17. Chapter 12: Sim Reg & Corr Scatter Diagrams: 2. Estimation Using the Regression Line
  • 18. 2. Estimation Using the Regression Line Chapter 12: Sim Reg & Corr
    • Simple Linear Regression:
    • The dependent variable Y is determined by the independent variable X
    Ŷ = a + b X Independent Variable Dependent Variable Ŷ = a + b X Y X
  • 19. 2. Estimation Using the Regression Line Chapter 12: Sim Reg & Corr
    • Simple Linear Regression:
    • The dependent variable Y is determined by the independent variable X
    Ŷ = a + b X
  • 20. 2. Estimation Using the Regression Line Chapter 12: Sim Reg & Corr Slope of the Best-Fitting Regression Line: Y = a + b X a = Y - b X
  • 21. 2. Estimation Using the Regression Line Chapter 12: Sim Reg & Corr the relationship between the age of a truck and the annual repair expense? a = 6 - 0.75*3 = 3.75 Ŷ = 3.75 + 0.75 X If the city has a truck that is 4 years old, the director could use the equation to predict $675 annually in repairs. 6.75 = 3.75 + 0.75 * 4 Y = a + b X a = Y - b X X=3 Y=6
  • 22. Exercise Chapter 12: Sim Reg & Corr
    • Example:
    • To find the simple/linear regression of Personal Income ( X ) and Auto Sales ( Y )
    Count the number of values.       Find XY, X 2   See the below table N = 5 X=64 what about Y? Step 1: Step 2:
  • 23. Exercise Chapter 12: Sim Reg & Corr Find Σ X, Σ Y, Σ XY, Σ X 2 .             Σ X = 311 Mean = 62.2             Σ Y = 18.6 Mean = 3.72             Σ XY = 1159.7             Σ X 2 = 19359 Step 3: Step 4: Substitute in the above slope formula given.             Slope(b) = = 0.19 1159.7-5*62.2*3.72 19359-5*62.2*62.2
  • 24. Exercise Chapter 12: Sim Reg & Corr Then substitute these values in regression equation formula             Regression Equation( Ŷ ) = a + bX           Ŷ   = -8.098 + 0.19 X .             Slope(b) = 0.19 Suppose if we want to know the approximate y value for the variable X = 64. Then we can substitute the value in the above equation. Regression Equation: Ŷ = a + bX             = -8.098 + 0.19( 64 ).             = -8.098 + 12.16             = 4.06 Step 5: Step 6: Now, again substitute in the above intercept formula given.             Intercept(a) = Y - b X   = 3.72- 0.19 * 62.2= -8.098
  • 25. 2. Estimation Using the Regression Line Chapter 12: Sim Reg & Corr Least Squares Method: Minimize the sum of the squares of the errors to measure the goodness of fit of a line e i = residual i
  • 26. 2. Estimation Using the Regression Line Chapter 12: Sim Reg & Corr Least Squares Method:
  • 27. 2. Estimation Using the Regression Line Chapter 12: Sim Reg & Corr Example:
  • 28. 2. Estimation Using the Regression Line Chapter 12: Sim Reg & Corr Example Solution:
  • 29. 3. Correlation Analysis Chapter 12: Sim Reg & Corr Correlation Analysis: describe the degree to which one variable is linearly related to another. Coefficient of Determination: Measure the extent, or strength, of the association that exists between two variables. Coefficient of Correlation: Square root of coefficient of determination r 2 r
  • 30. 3. Correlation Analysis Chapter 12: Sim Reg & Corr Coefficient of Determination: Measure the extent, or strength, of the association that exists between two variables.
    • 0 ≤ r 2 ≤ 1.
    • The larger r 2 , the stronger the linear relationship.
    • The closer r 2 is to 1, the more confident we are in our prediction.
  • 31. 3. Correlation Analysis Chapter 12: Sim Reg & Corr Coefficient of Correlation:
  • 32. 3. Correlation Analysis Chapter 12: Sim Reg & Corr Coefficient of Determination:
  • 33. 3. Correlation Analysis Chapter 12: Sim Reg & Corr Example Solution:
  • 34. 3. Correlation Analysis Chapter 12: Sim Reg & Corr Example Solution:
  • 35. Review Chapter 3: Describing Data Which value of r indicates a stronger correlation than 0.40?  A. -0.30 B. -0.50 C. +0.38 D. 0 If all the plots on a scatter diagram lie on a straight line, what is the standard error of estimate?  A. -1 B. +1 C. 0 D. Infinity
  • 36. Review Chapter 3: Describing Data In the least squares equation,   Ŷ  = 10 + 20 X the value of 20 indicates  A. the Y intercept. B. for each unit increase in X , Y increases by 20. C. for each unit increase in Y , X increases by 20. D. none of these.  
  • 37. Exercise Chapter 3: Describing Data A sales manager for an advertising agency believes there is a relationship between the number of contacts and the amount of the sales. To verify this belief, the following data was collected:   What is the Y-intercept of the linear equation?  A. -12.201 B. 2.1946 C. -2.1946 D. 12.201
  • 38. Exercise Chapter 12: Sim Reg & Corr Ŷ = -1.8182 + 0.1329X Sample Exam P.4
  • 39. Exercise Chapter 12: Sim Reg & Corr Sample Exam P.4
  • 40. Exercise Chapter 12: Sim Reg & Corr Sample Exam P.4 Ŷ = -1.8182 + 0.1329X
  • 41. Summary Chapter 1: What is Statistics?
    • Chapter 3:
      • Calculate the arithmetic mean, weighted mean, median, mode, and geometric mean
      • Explain the characteristics, uses, advantages, and disadvantages of each measure of location
      • Identify the position of the mean, median, and mode for both symmetric and skewed distributions
      • Compute and interpret the range, mean deviation, variance, and standard deviation
      • Understand the characteristics, uses, advantages, and disadvantages of each measure of dispersion
      • Understand Chebyshev’s theorem and the Empirical Rule as they relate to a set of observations