Chapter 10: Correlation and Regression
Correlation & Regression Correlation is a statistical method used to determine if a relationship between variables exists.  Regression is the statistical method used to describe the nature of the relationship between variables - that is, positive or negative, linear or nonlinear.
Independent and Dependent Variable There are two types of variables in a regression analysis:  The independent variable is the variable in regression that can be controlled or manipulated.  The dependent variable is the variable that cannot be controlled or manipulated
Scatter Plot A scatter plot is a graph of the ordered pairs (x, y) of numbers consisting of the independent variable, x, and the dependent variable, y.  The scatter plot is a visual way to describe the nature of the relationship between the independent and dependent variables.
Analyzing a Scatter Plot After the plot is drawn, it should be analyzed to determine which type of relationship, if any exists.  A positive relationship exists when both variables increase or decrease Ex. ______________________________________ A negative relationship exists when one variable decreases while the other increases. Ex. ______________________________________
Example:  Construct a scatter plot of the data and discuss any trends that you see.   Yield of Wheat vs. Rainfall  54.4 71.3 44.5 41.6 80.6 52.2 28.7 62.5 Yield of Wheat (bushels per acre)  13.1 15.9 10.3 8.8 18.6 11.3 7.2 12.9 Rainfall (inches)
Example: Prepare a scatter plot of the data and discuss any trends.  Description:  Ice cream consumption was measured over 30 four-week periods from March 18, 1951 to July 11, 1953. The purpose of the study was to determine if ice cream consumption depends on the variables price, income, or temperature. The variables Lag-temp and Year have been added to the original data.  Link to Data
Correlation Coefficient The  correlation coefficient  computed from the sample data measures the strength and direction of a linear relationship between two variables. The symbol for the sample correlation coefficient is  r . The symbol for the population correlation coefficient is   .
Correlation Coefficient (cont’d.) The range of the correlation coefficient is from   1 to   1. If there is a  strong positive linear relationship  between the variables, the value of  r  will be close to   1.  If there is a  strong negative linear relationship  between the variables, the value of  r   will be close to   1.
Correlation Coefficient (cont’d.) When there is no linear relationship between the variables or only a weak relationship, the value of  r  will be close to 0. Strong negative linear relationship Strong positive linear relationship  1  1 0 No linear  relationship
Formula for the Correlation Coefficient  r where  n  is the number of data pairs.
Possible Relationships Between Variables There is a  direct cause-and-effect relationship between the variables : that is,  x  causes  y . There is a  reverse cause-and-effect relationship between the variables : that is,  y  causes  x . The  relationship between the variable may be caused by a third variable : that is,  y  may appear to cause  x  but in reality  z  causes  x .
Possible Relationships Between Variables There may be a  complexity of interrelationships among many variables ; that is,  x  may cause  y  but  w ,  t , and  z  fit into the picture as well. The  relationship may be coincidental : although a researcher may find a relationship between  x  and  y , common sense may prove otherwise.
Interpretation of Relationships When the null hypothesis is rejected, the researcher must consider all possibilities and select the appropriate relationship between the variables as determined by the study.  Remember, correlation does not necessarily imply causation.
Example A medical researcher wishes to determine how the dosage (in milligrams) of a drug affects the heart rate of the patient. The data for seven patients are given here. Draw the scatter plot for the variables and compute the correlation coefficient. 82 80 88 92 93 90 95 Heart Rate, y 0.50 0.40 0.35 0.30 0.25 0.20 0.125 Drug Dosage, x
Example A researcher wishes to determine whether there is a relationship between the age (in years) of grocery store cash registers and monthly maintenance cost. The data follow. Draw the scatter plot for the variables and compute the correlation coefficient. 87 90 83 65 70 90 75 Cost, y 4 6 2 1 3 4 2 Age, x
Steps in Regression Analysis First. Collect the data. Second: Construct a scatter plot to see if there is any linear relationship between the variables. Third: Compute the value of the correlation coefficient.
Steps in Regression Analysis Fourth: If the value of the correlation coefficient is significant, then determine the equation of the regression line which is the data’s line of best fit. Note: Determining the regression line when r is not significant is meaningless.
The purpose of the regression line is to enable the researcher to see the trend and make predictions on the basis of the data. The  line of best fit  is the line that minimizes the sum of the squared residual.
The closer the points fit the regression line, the higher the absolute value of r and the closer it will be to -1 or 1  When all points fall directly on the line, r will equal 1 or -1 and this indicates a perfect linear relationship between the variables.
The values y’ - y are called residuals. The residual is the difference between the actual (observed) value y and the predicted value y’. The sum of the residuals is always zero. The regression line determined by the formulas is the line that best fits the points of the observed data. This line is also called the least-squares line because the sum of the ____________ of the ___________ computed using the regression line is the ___________ __________ ________
Recall from algebra that the equation of the line can be given by y = mx + b, where m is the slope of the line and b is the y-intercept. In statistics, the equation of the regression line is y’=ax + b where y’ represents the predicted function value of x.
Example: Drug dosage vs. Heart Rate Data Revisited  Find the regression equation for the drug dosage and heart rate data. Use the equation of the regression line to predict the heart rate of a patient given a dosage of .27 milligrams.
Example: Age of Register vs. Repair Cost Data Revisited Find the regression equation for the age of the cash register and repair cost data. Use the equation of the regression line to predict the repair cost of a register that is 4 years old.
The Coefficient of Determination The  coefficient of determination  is a measure of the variation of the dependent variable that is explained by the regression line and the independent variable. The symbol for the coefficient of determination is  r 2 . _________ = _______________
The coefficient of nondetermination is the measure of the unexplained variation. It is found by subtracting the coefficient of determination from 1.  Coefficient of Nondetermination = __________
Example Determine the percentage of explained variation and unexplained variation in the heart rate data.
Example Determine the coefficient of determination for the cash register data and explain what it means.
 

Chapter 10

  • 1.
  • 2.
    Correlation & RegressionCorrelation is a statistical method used to determine if a relationship between variables exists. Regression is the statistical method used to describe the nature of the relationship between variables - that is, positive or negative, linear or nonlinear.
  • 3.
    Independent and DependentVariable There are two types of variables in a regression analysis: The independent variable is the variable in regression that can be controlled or manipulated. The dependent variable is the variable that cannot be controlled or manipulated
  • 4.
    Scatter Plot Ascatter plot is a graph of the ordered pairs (x, y) of numbers consisting of the independent variable, x, and the dependent variable, y. The scatter plot is a visual way to describe the nature of the relationship between the independent and dependent variables.
  • 5.
    Analyzing a ScatterPlot After the plot is drawn, it should be analyzed to determine which type of relationship, if any exists. A positive relationship exists when both variables increase or decrease Ex. ______________________________________ A negative relationship exists when one variable decreases while the other increases. Ex. ______________________________________
  • 6.
    Example: Constructa scatter plot of the data and discuss any trends that you see. Yield of Wheat vs. Rainfall 54.4 71.3 44.5 41.6 80.6 52.2 28.7 62.5 Yield of Wheat (bushels per acre) 13.1 15.9 10.3 8.8 18.6 11.3 7.2 12.9 Rainfall (inches)
  • 7.
    Example: Prepare ascatter plot of the data and discuss any trends. Description: Ice cream consumption was measured over 30 four-week periods from March 18, 1951 to July 11, 1953. The purpose of the study was to determine if ice cream consumption depends on the variables price, income, or temperature. The variables Lag-temp and Year have been added to the original data. Link to Data
  • 8.
    Correlation Coefficient The correlation coefficient computed from the sample data measures the strength and direction of a linear relationship between two variables. The symbol for the sample correlation coefficient is r . The symbol for the population correlation coefficient is  .
  • 9.
    Correlation Coefficient (cont’d.)The range of the correlation coefficient is from  1 to  1. If there is a strong positive linear relationship between the variables, the value of r will be close to  1. If there is a strong negative linear relationship between the variables, the value of r will be close to  1.
  • 10.
    Correlation Coefficient (cont’d.)When there is no linear relationship between the variables or only a weak relationship, the value of r will be close to 0. Strong negative linear relationship Strong positive linear relationship  1  1 0 No linear relationship
  • 11.
    Formula for theCorrelation Coefficient r where n is the number of data pairs.
  • 12.
    Possible Relationships BetweenVariables There is a direct cause-and-effect relationship between the variables : that is, x causes y . There is a reverse cause-and-effect relationship between the variables : that is, y causes x . The relationship between the variable may be caused by a third variable : that is, y may appear to cause x but in reality z causes x .
  • 13.
    Possible Relationships BetweenVariables There may be a complexity of interrelationships among many variables ; that is, x may cause y but w , t , and z fit into the picture as well. The relationship may be coincidental : although a researcher may find a relationship between x and y , common sense may prove otherwise.
  • 14.
    Interpretation of RelationshipsWhen the null hypothesis is rejected, the researcher must consider all possibilities and select the appropriate relationship between the variables as determined by the study. Remember, correlation does not necessarily imply causation.
  • 15.
    Example A medicalresearcher wishes to determine how the dosage (in milligrams) of a drug affects the heart rate of the patient. The data for seven patients are given here. Draw the scatter plot for the variables and compute the correlation coefficient. 82 80 88 92 93 90 95 Heart Rate, y 0.50 0.40 0.35 0.30 0.25 0.20 0.125 Drug Dosage, x
  • 16.
    Example A researcherwishes to determine whether there is a relationship between the age (in years) of grocery store cash registers and monthly maintenance cost. The data follow. Draw the scatter plot for the variables and compute the correlation coefficient. 87 90 83 65 70 90 75 Cost, y 4 6 2 1 3 4 2 Age, x
  • 17.
    Steps in RegressionAnalysis First. Collect the data. Second: Construct a scatter plot to see if there is any linear relationship between the variables. Third: Compute the value of the correlation coefficient.
  • 18.
    Steps in RegressionAnalysis Fourth: If the value of the correlation coefficient is significant, then determine the equation of the regression line which is the data’s line of best fit. Note: Determining the regression line when r is not significant is meaningless.
  • 19.
    The purpose ofthe regression line is to enable the researcher to see the trend and make predictions on the basis of the data. The line of best fit is the line that minimizes the sum of the squared residual.
  • 20.
    The closer thepoints fit the regression line, the higher the absolute value of r and the closer it will be to -1 or 1 When all points fall directly on the line, r will equal 1 or -1 and this indicates a perfect linear relationship between the variables.
  • 21.
    The values y’- y are called residuals. The residual is the difference between the actual (observed) value y and the predicted value y’. The sum of the residuals is always zero. The regression line determined by the formulas is the line that best fits the points of the observed data. This line is also called the least-squares line because the sum of the ____________ of the ___________ computed using the regression line is the ___________ __________ ________
  • 22.
    Recall from algebrathat the equation of the line can be given by y = mx + b, where m is the slope of the line and b is the y-intercept. In statistics, the equation of the regression line is y’=ax + b where y’ represents the predicted function value of x.
  • 23.
    Example: Drug dosagevs. Heart Rate Data Revisited Find the regression equation for the drug dosage and heart rate data. Use the equation of the regression line to predict the heart rate of a patient given a dosage of .27 milligrams.
  • 24.
    Example: Age ofRegister vs. Repair Cost Data Revisited Find the regression equation for the age of the cash register and repair cost data. Use the equation of the regression line to predict the repair cost of a register that is 4 years old.
  • 25.
    The Coefficient ofDetermination The coefficient of determination is a measure of the variation of the dependent variable that is explained by the regression line and the independent variable. The symbol for the coefficient of determination is r 2 . _________ = _______________
  • 26.
    The coefficient ofnondetermination is the measure of the unexplained variation. It is found by subtracting the coefficient of determination from 1. Coefficient of Nondetermination = __________
  • 27.
    Example Determine thepercentage of explained variation and unexplained variation in the heart rate data.
  • 28.
    Example Determine thecoefficient of determination for the cash register data and explain what it means.
  • 29.