William Christensen, Ph.D.
IMPORTANT Note
 Before you can use the Excel
functions for regression (this
section) or ANOVA (section 8)
you must be sure you’ve set
up Excel to access these
functions. To do that, while in
an Excel spreadsheet, click
“Tools” then “Add-Ins”. Then
make sure that “Analysis
TookPak” and “Analysis
ToolPak –VBA” are checked
(see picture).
Correlation
Correlation
exists between two variables
when they are related to each
other in some way
 Example: when one increases the other
also increases, OR when one increases the
other decreases
Assumptions
1. The sample of paired data (x,y) is a
random sample.
2. The pairs of (x,y) data are normally
distributed.
Definition
Scatterplot (or scatter diagram)
is a graph in which the paired
(x,y) sample data are plotted
with a horizontal x axis and a
vertical y axis. Each individual
(x,y) pair is plotted as a single
point.
Scatter Diagram of Paired Data
Restaurant Bill (x) and Tip (y)
Positive Linear Correlation
x x
y
y y
x
Sample Scatter Plots showing various degrees of “positive”
correlation. That is, when x increases y also increases
(a) Positive (b) Strong
positive
(c) Perfect
positive
Negative Linear Correlation
x x
y
y y
x
(d) Negative (e) Strong
negative
(f) Perfect
negative
Sample Scatter Plots showing various degrees of
“negative” correlation. When x increases y decreases
No Linear Correlation
x x
y
y
(g) No Correlation (h) Nonlinear Correlation
Sample Scatter Plots showing NO linear correlation. The first plot (g)
has no correlation of any kind. The other plot (h) shows a clear
correlation between x and y, but the correlation is NOT LINEAR.
Nonlinear correlation can be studied, but it is beyond this class. We
will only study LINEAR CORRELATION.
Definition
•Linear correlation can be measured. The
strength of the relationship between an “x”
variable and a “y” variable is measured by the
“linear correlation coefficient”
•The letter r represents the linear correlation
coefficient for paired x,y sample data
•The Greek letter  (rho) represents the linear
correlation coefficient for paired x,y population
data
The Linear Correlation Coefficient r
•The values of the linear correlation coefficient are
ALWAYS between -1 and +1
•If r = +1 there is a perfect positive linear correlation
between x and y
•If r = -1 there is a perfect negative linear correlation
between x and y
•If r = 0 there is absolutely no linear correlation
between x and y
Properties of the
Linear Correlation Coefficient r
1. -1  r  1 (r is ALWAYS between -1 and 1)
2. The value of r does not change if all values of
either variable are converted to a different scale.
E.g., you can change from meters to feet and it
will not change the value of r.
3. You can interchange x and y and the value of r
will not change. I.e., it does not matter which
comes first.
4. Only linear relationships are measured by r.
The Linear Correlation Coefficient r
•We will use Excel to calculate the linear
correlation coefficient, but just in case you want
to see the formula, here it is:
  
       
2 2
2 2
n xy x y
r
n x x n y y


 
  
   
The Linear Correlation Coefficient r
•The Excel function for finding the linear correlation
coefficient is =CORREL(array1,array2)
•Where array1 is the range of x values and array2 is
the range of y values (or visa versa)
• Hint: always arrange the x and y
values in side-by-side columns. Since
the data is “paired” there is a specific y
that goes with each x. Make sure
these pairs are side-by-side. As long
as the pairs are side-by-side, the
sequencing of pairs makes no
difference. See the next slide for an
example.
Correlation (Example)
Interpreting the Linear Correlation
Coefficient – 2 ways
1. If the absolute value of r exceeds the value in
Table A – 6 (see next slide), conclude that there
is a significant linear correlation.
 Otherwise, there is not sufficient evidence to
support the conclusion of significant linear
correlation.
2. Or, use the hypothesis testing method
described in this section to confirm whether the
correlation / relationship between x and y is
significant
TABLE A-6
Critical Values of the Pearson Correlation Coefficient r
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
25
30
35
40
45
50
60
70
80
90
100
n
.999
.959
.917
.875
.834
.798
.765
.735
.708
.684
.661
.641
.623
.606
.590
.575
.561
.505
.463
.430
.402
.378
.361
.330
.305
.286
.269
.256
.950
.878
.811
.754
.707
.666
.632
.602
.576
.553
.532
.514
.497
.482
.468
.456
.444
.396
.361
.335
.312
.294
.279
.254
.236
.220
.207
.196
 = .05  = .01
 Note on rounding r: Round to three decimal places
 Example: For our Manatee problem, we calculated r = 0.922 (as
shown below)
 Using an alpha=0.05, and with n=10 we see from Table A-6 that an
r with an absolute value greater than 0.632 is significant.
 Since our calculated r = 0.922, we can say that the relationship
between the number of boats and the number of Manatee deaths is
statistically significant.
Using Table A-6
Common Errors Involving Correlation
1. Causation: It is wrong to conclude that correlation implies
causality
 Just because x and y are “related” does NOT MEAN that x
“causes” y or visa versa. Determining causality is a separate
issue and beyond this class (this is a very common mistake)
2. Averages: Averages suppress individual variation and may
inflate the correlation coefficient
 Make sure you are using data for individual events, NOT
averaged or grouped data
3. Linearity: There may be some relationship between x and y
even when there is no significant linear correlation
 Nonlinear relationships can be studied, but NOT in this class
Formal Hypothesis Test
 To determine whether there is a
significant linear correlation
between two variables
 Let
 H0:  = (no significant linear correlation) and
 H1:   (significant linear correlation)
Remember:  is the symbol for correlation between the
populations of x’s and y’s and r is the symbol for correlation
between a sample of x’s and y’s
Test Statistic is t
(follows format of earlier sections)
Test statistic:
Critical values:
1. Use =NORMSINV for large samples
2. For small samples (n<30) use =TINV(prob,df)
with probability = α (Excel knows it is a two-tailed
test, so input the entire alpha, NOT α/2), and (df)
degrees of freedom = n – 2 (YES, that’s df=n-2
in this case, NOT df=n-1 as in previous
sections)
2
1
2
r
t
r
n



Manatee Example (Hypothesis Test)
2
1
2
r
t
r
n



• Given α = 0.05 and n=10
• Test Statistic:
• Critical Values: -2.306 and +2.306
Manatee Example
Conclusion: Accept the alternate
hypothesis (H1) that there is a significant
relationship between x and y (number of
boats and manatee deaths)
t=0 2.306
-2.306
Test Statistic = 6.735
Regression
Regression
Definition
 The “Regression” process seeks to find a
line (and the equation that describes that line)
that “fits” through the paired data so that the
sum of the vertical distances (squared)
between the actual data points and the line is
minimized.
 The Regression Equation takes the form:
0 1
ŷ b b x
 
Regression Line Plotted on Scatter Plot
The Regression Line is the line of “best fit”
through the data points. “Best fit” means the
sum of the vertical distances between each
data point and the regression line is
minimized.
The Regression Equation
x is the independent variable (predictor
variable)
y-hat is the dependent variable
(response variable)
0 1
ŷ b b x
 
b0 is the y-
intercept or the
value at which
the regression
line crosses the
vertical axis
b1 is the slope of
the regression line
or the amount of
change in y for
every 1 unit change
in x
y-hat is the
“dependent” or
“response” variable
because it depends
on, or responds to
the value of x
x is the “independent”
or “predictor” variable
because it acts
independently to
predict the value of y-
hat
Assumptions
1. We are investigating only linear relationships
2. For each x value, y is a random variable
having a normal (bell-shaped) distribution. All
of these y distributions have the same
variance
 Results are not seriously affected if departures from
normal distributions and equal variances are not too
extreme
Rounding Rule for Regression
 Round the y-intercept (b0) and the
slope (b1) to three significant digits
 If calculating manually, do not round
intermediate values
1. Use one of the methods we already learned to check for
a significant linear correlation (r) between x and y
2. If there is NOT a significant linear correlation between x
and y, then we CANNOT USE REGRESSION to predict y
and our best estimate of y is simply y-bar (the mean of
y). In this case forget about using regression because
it fails us
3. However, IF THERE IS a significant linear correlation
between x and y, the best predicted y-value is found by
putting the x-value into the regression equation and
calculating y
Steps in Doing Regression
Regression using Excel
Click on:
 Tools
 Data Analysis
 Regression
To run the media file on the next slide,
click or double-click on the picture of the
Excel worksheet
See the following examples of how to do
Regression using Excel
Regression using Excel
This stuff is beyond
this class (don’t
worry about these).
See me if you have
questions
Here you can
specify where
you want the
results of the
Regression
analysis to
appear
Check ONLY if you
include in the y and
x ranges the
“headings” or cells
that label y and x. I
prefer to do that.
Input the range of cells
that contains the y
variables
Input the range of cells that
contains the x variables
Doesn’t matter if you
check, but I prefer to leave
unchecked
Leave
unchecked
Regression Example: What is the best predicted size of a
household that discards 0.50 lb of plastic? Data was collected by some poor
smucks going around to different houses and going through the garbage. The
data is listed in columns (always list data in columns not in rows when using
Excel). x represents the lbs. of plastic thrown out and y represents the number of
people in that household. Notice that the y variable (number in household) is
what we are trying to predict and the x variable (lbs. of plastic tossed) is what we
use to try to predict y.
Step 1: Determine whether or not there is a significant relationship between x and y.
If there is NOT a significant relationship then we CANNOT use Regression. If there
IS a significant relationship between x and y then we can proceed to use Excel to do
a Regression Analysis and find the Regression Equation (note: assume an alpha of
0.05 unless specifically told otherwise)
• Given α = 0.05 and n=8 (df=n-2=8-2=6)
• Test Statistic:
• Critical Values: -2.447 and +2.447
• Conclusion: The Test Statistic is greater than the
Critical Value so we conclude that there IS a significant
linear relationship between the variables and we can now
proceed with our Regression Analysis
Regression Example (Hypothesis Test)
2
1
2
r
t
r
n



Regression Example: What is the best predicted size of a
household that discards 0.50 lb of plastic?
By specifying an
Output Range of $A$11
we are telling Excel to
use cell A11 as the
upper-left corner of the
Regression Analysis
output (see next slide)
Regression Example: Answer this question using Regression.
What is the best predicted size of a household that discard 0.50 lb of plastic?
Before you get too excited about this output, let’s cross-off the info that we
are not going to discuss or learn about in class. I’m only trying to give you an
elementary exposure to Regression. The following slides will show you the
things you need to understand and each of those items will be explained.
Regression Example: Answer this question using Regression.
What is the best predicted size of a household that discard 0.50 lb of plastic?
Info to help you understand Regression Info that you MUST know for test
Multiple R has no importance by itself, it is the
square of this value “R square” that is important
“R Square” is also known as the “Coefficient of
Determination” and represents the amount (percent)
of the variance in y that is explained by x. You can
sort-of say that this value shows how accurate or
effective our regression equation is. In this case, our
regression equation is 71% accurate or effective in
predicting y, given some value of x.
These coefficients form the Regression Equation. In
this case y=0.549+1.480(x). Thus, the answer to our
question is y=0.549+1.480(0.50)=1.289 people (or 1
rounded to the nearest whole person in a household
that tosses 0.50 lbs. of plastic)
Regression Example: Answer this question using Regression.
What is the best predicted size of a household that discard 0.50 lb of plastic?
Info to help you understand Regression Info that you MUST know for test
Here we simply see that Excel determines the
appropriate degrees of freedom – nothing for us to
do here
“Adjusted R Square” is the same as “R Square”
EXCEPT that it is adjusted for sample size. When the
sample size (n) is small, this value will be quite a bit
below the unadjusted R Square value. As the sample
size increases, the difference between R Square and
Adjusted R Square becomes negligible. Adjusted R
Square is a more accurate representation of the
accuracy or effectiveness of our regression equation.
Standard Error is an estimate of how far off our y
predictions will be (on average) using the regression
equation.
Observations
is simply n
Regression Example: Answer this question using Regression.
What is the best predicted size of a household that discard 0.50 lb of plastic?
Info to help you understand Regression Info that you MUST know for test
You may remember the F test from the section on
hypothesis testing. Excel does an F test to see if the
regression equation is any better than simply using the
average y as the predictor of y. A large F value generally
suggests that the regression equation is effective, but we
rely on the “Significance F” to tell for sure
Significance F shows the alpha level that separates
the regression equation between being statistically
significant and NOT. Thus, if this value is LESS
THAN our alpha value (usually 0.05) then the
regression equation is statistically significant (i.e., a
low chance of error in predicting y). Conversely, if
this value is greater than our alpha then the
regression equation is not statistically significant
(i.e., a relatively high chance of error in predicting
y). In this case, since Significance F is much less
than 0.05 we conclude that the regression equation
is statistically significant.
SS requires some
explanation – see the
next couple of slides
Regression Example: Answer this question using Regression.
What is the best predicted size of a household that discard 0.50 lb of plastic?
Explained deviation
(Regression)
(y-hat – y-bar)
Total
deviation
(y – y-bar)
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
•
•
•
Unexplained deviation
(Residual)
(y – y-hat)
y-hat = 3 + 2x
y-bar = 9
y
x
0 1 2 3 4 5 6 7 8 9
SS (Sum of Squares) NOT REQUIRED FOR
YOU TO KNOW
SS Regression (Explained Deviation) – This is
calculated by first finding the vertical distance
between the regression line and the mean y value
(y-bar) at each data point. Next, those individual
differences are squared and then all added together
(summed) to form the SS (sum of squares)
Regression. In practical terms, in order for the
regression equation to be valid, we should see a
relatively high value here as compared to SS
Residual.
SS Residual (Unexplained Deviation) – This is
calculated by first finding the vertical distance
between each data point and the regression line.
Since we hope our regression line explains the
relationship between x and y, any difference
between the line and a data point is “unexplained”.
We hope this value is relatively small compared to
SS Regression.
SS Total – This is simply SS Regression + SS
Residual. It can also be calculated by finding the
vertical distance between each data point and y-bar,
squaring each value and added them all up.
Multiple Regression
Multiple Regression
 Multiple Regression is a process for defining a
linear relationship between a dependent variable
y and two or more independent variables (x1, x2,
x3 . . . , xk)
 The linear equation for a regression problem in
which we have multiple x variable is as follows,
where b0 is the y-intercept and all the other b’s
are coefficients associated with their respective
x values:
0 1 1 2 2
ˆ ... k k
y b b x b x b x
    
Multiple Regression Guidelines
 More x variables is NOT necessarily better
 Remember that R Square is a measure of how
effective our regression equation is. Therefore, if
adding an x variable does not appreciably
increase the R Square value, then DON’T add it
 Use those x variables (the fewest possible) that
give you the biggest R Square (or Adjusted R
Square) value. We want efficiency so a few
variables that provide a big R Square is best
Multiple Regression Example: Using the following data
(measurements taken from Bears that had been anesthetized), construct a
multiple regression equation to predict the weight of Bears.
Step 1: Construct a “Correlation Matrix” to see which x variables have the strongest
linear relationships with the y variable (weight). Use the Excel function Tools, Data
Analysis, Correlation to construct a correlation matrix. An Excel file containing this
Bear data and Correlation Matrix are on the class website (mrbear.xls).
Multiple Regression Example: Using the following data
(measurements taken from Bears that had been anesthetized), construct a
multiple regression equation to predict the weight of Bears.
Step 1: Construct a “Correlation Matrix” to see which x variables have the strongest
linear relationships with the y variable (weight)
• Ideally, we want to pick those few x variables that have strong correlations (close
to -1 or +1) with the y variable, BUT we also want the x variables to NOT be highly
correlated with each other
• The addition of an x variable that is strongly correlated with any x variable(s)
already in a multiple regression equation WILL NOT do much to increase the R
Squared or Adjusted R Square value
• On the other hand, adding an x variable that is strongly correlated with the y
variable, but NOT with any x variables already in the regression equation WILL
increase our R Squared value substantially
Multiple Regression Example: Using the following data
(measurements taken from Bears that had been anesthetized), construct a
multiple regression equation to predict the weight of Bears.
Choosing which x variables to include in a multiple regression problem is often a
subjective decision. I suggest you try whichever variables you think will work best.
For this example, I am going to choose Neck size as my first x variable to predict y
(weight) since it has an awesome correlation of 0.971 with Weight. To pick a second
x variable I want one that is highly correlated with Weight, but NOT with Neck.
Unfortunately, it looks like all the other x variables ARE highly correlated with Neck.
In fact, if this were a real problem we would probably stick with simple regression
and just use Neck to predict Weight. But, since we have to do a multiple regression
problem I think I’ll pick Age as my second x variable. Age is not that highly
correlated with Weight (0.814), but it is also the least highly correlated with Neck
(0.906). Why don’t you try the problem with other x variables and see if you can
beat my Adjusted R Square value.
Multiple Regression Example: Using the following data
(measurements taken from Bears that had been anesthetized), construct a
multiple regression equation to predict the weight of Bears.
 For Multiple Regression we use the same
Tools, Data Analysis, Regression function in
Excel that we used for simple Regression.
 The “Y Range” is still simply the range of
cells that contains the y-variable
The “X Range” is the range of cells that
contains ALL the x-variables we want to
include in the regression model. MAKE
SURE the x-variables are in adjacent
columns (you cannot skip columns). Notice
how I now have Neck and Age right next to
each other.
 I like to include the column headings or
labels, so I checked the “labels” box. When
you do this the output also includes the
labels, making it a lot easier to interpret.
Everything else is the same as when we
did simple Regression
Multiple Regression Example: Using the following data
(measurements taken from Bears that had been anesthetized), construct a
multiple regression equation to predict the weight of Bears.
Again, before we analyze the output, let’s cross-off the info that we are not
going to discuss or learn about in class. The following slides will explain
those things you need to understand.
Multiple Regression Example: Using the following data
(measurements taken from Bears that had been anesthetized), construct a
multiple regression equation to predict the weight of Bears.
Just a note: I ran another regression with
Neck size as the only x variable. You can
see the Adjusted R Square is 0.9330 as
compared to the Adjusted R Square of
0.9536 that we got when we included Age
in addition to Neck. I don’t think I would
consider the slight increase in Adjusted R
Square from including Age as really being
worth the trouble of including another
variable. There is no set guideline as to
how much R Square should increase to
justify adding another variable, but going
from an already high R Squared to a
slightly higher value does not seem
worthwhile in my opinion
Multiple Regression Example: Using the following data
(measurements taken from Bears that had been anesthetized), construct a
multiple regression equation to predict the weight of Bears.
Info to help you understand Regression Info that you MUST know for test
Multiple R – same story as before
Almost the same story as in simple Regression - “R Square” or
the “Coefficient of Determination” represents the amount
(percent) of variance in y that is explained by the x’s. In this
case our regression equation is 97% accurate or effective in
predicting y (Weight). Adjusted R Square simply adjusted the R
Square value for the number of x variables and the sample size.
Lots of variables and/or small sample sizes reduce the
Adjusted R Square. A few good variables and a large sample
size bring Adjusted R Square very close to R Square.
These coefficients form the Regression Equation. In
this case y=-307.817+26.388(Neck)-1.527(Age). Thus,
if we found a bear with a Neck of 25” and an Age of
the 48 months we could predict its Weight as y=-
307.817+26.388(25)-1.527(48)=278.6 lbs.
Multiple Regression Example: Using the following data
(measurements taken from Bears that had been anesthetized), construct a
multiple regression equation to predict the weight of Bears.
Info to help you understand Regression Info that you MUST know for test
Same general comments as simple
Regression. Note that MS is SS/df and the F
statistic is calculated as MS Regression / MS
Residual (just in case you are curious)
See next slide for
explanation of these
important values
Like simple Regression, if the
Significance F is less than alpha then
our regression equation is statistically
significant
These
Multiple Regression Example: Using the following data
(measurements taken from Bears that had been anesthetized), construct a
multiple regression equation to predict the weight of Bears.
Info to help you understand Regression Info that you MUST know for test
Excel calculates a t test statistic for each of the x variables (ignore the t Stat for Intercept).
The next column (P-value) is the important one because it provides the test result of this
test statistic as compared to the critical value. The critical values are not shown, but we
don’t need to see them because the P-value effectively tells us whether this test statistic is
inside of outside the critical value (see P-value for more explanation)
The P-value tells us whether the x variable is statistically significant in the regression equation. If
the P-value is less than alpha (usually 0.05) then that x variable is a significant contributor in the
regression equation. If the P-value is greater than alpha then that x variable does not contribute
significantly to the regression equation. In this case, Neck has a P-value much less than 0.05, so
we see that Neck size contributes significantly to our regression equation to predict a Bear’s
weight. On the other hand, Age has a P-value greater than 0.05, so we see that adding the Age x
variable into the equation was not a good idea since Age does not significantly help us predict a
Bear’s weight in our regression equation.
Modeling
Modeling and Exploring Data
 The next few slides show some of the various linear and
nonlinear relationships that might be encountered
 Linear relationships are the only “easy” ones to study,
but other types of relationships obviously exist
 These other types of relationships are beyond the scope
of this course – just wanted to show you what some of
them might look like
 Also, the following slide shows a little Excel movie on
how to use Excel to help fit certain types of mathematical
models to data – just enough info for you to be
dangerous
Using Excel to Model Data
William Christensen, Ph.D.

correlation and r3433333333333333333333333333333333333333333333333egratio111n.ppt

  • 1.
  • 2.
    IMPORTANT Note  Beforeyou can use the Excel functions for regression (this section) or ANOVA (section 8) you must be sure you’ve set up Excel to access these functions. To do that, while in an Excel spreadsheet, click “Tools” then “Add-Ins”. Then make sure that “Analysis TookPak” and “Analysis ToolPak –VBA” are checked (see picture).
  • 3.
  • 4.
    Correlation exists between twovariables when they are related to each other in some way  Example: when one increases the other also increases, OR when one increases the other decreases
  • 5.
    Assumptions 1. The sampleof paired data (x,y) is a random sample. 2. The pairs of (x,y) data are normally distributed.
  • 6.
    Definition Scatterplot (or scatterdiagram) is a graph in which the paired (x,y) sample data are plotted with a horizontal x axis and a vertical y axis. Each individual (x,y) pair is plotted as a single point.
  • 7.
    Scatter Diagram ofPaired Data Restaurant Bill (x) and Tip (y)
  • 8.
    Positive Linear Correlation xx y y y x Sample Scatter Plots showing various degrees of “positive” correlation. That is, when x increases y also increases (a) Positive (b) Strong positive (c) Perfect positive
  • 9.
    Negative Linear Correlation xx y y y x (d) Negative (e) Strong negative (f) Perfect negative Sample Scatter Plots showing various degrees of “negative” correlation. When x increases y decreases
  • 10.
    No Linear Correlation xx y y (g) No Correlation (h) Nonlinear Correlation Sample Scatter Plots showing NO linear correlation. The first plot (g) has no correlation of any kind. The other plot (h) shows a clear correlation between x and y, but the correlation is NOT LINEAR. Nonlinear correlation can be studied, but it is beyond this class. We will only study LINEAR CORRELATION.
  • 11.
    Definition •Linear correlation canbe measured. The strength of the relationship between an “x” variable and a “y” variable is measured by the “linear correlation coefficient” •The letter r represents the linear correlation coefficient for paired x,y sample data •The Greek letter  (rho) represents the linear correlation coefficient for paired x,y population data
  • 12.
    The Linear CorrelationCoefficient r •The values of the linear correlation coefficient are ALWAYS between -1 and +1 •If r = +1 there is a perfect positive linear correlation between x and y •If r = -1 there is a perfect negative linear correlation between x and y •If r = 0 there is absolutely no linear correlation between x and y
  • 13.
    Properties of the LinearCorrelation Coefficient r 1. -1  r  1 (r is ALWAYS between -1 and 1) 2. The value of r does not change if all values of either variable are converted to a different scale. E.g., you can change from meters to feet and it will not change the value of r. 3. You can interchange x and y and the value of r will not change. I.e., it does not matter which comes first. 4. Only linear relationships are measured by r.
  • 14.
    The Linear CorrelationCoefficient r •We will use Excel to calculate the linear correlation coefficient, but just in case you want to see the formula, here it is:            2 2 2 2 n xy x y r n x x n y y           
  • 15.
    The Linear CorrelationCoefficient r •The Excel function for finding the linear correlation coefficient is =CORREL(array1,array2) •Where array1 is the range of x values and array2 is the range of y values (or visa versa) • Hint: always arrange the x and y values in side-by-side columns. Since the data is “paired” there is a specific y that goes with each x. Make sure these pairs are side-by-side. As long as the pairs are side-by-side, the sequencing of pairs makes no difference. See the next slide for an example.
  • 16.
  • 17.
    Interpreting the LinearCorrelation Coefficient – 2 ways 1. If the absolute value of r exceeds the value in Table A – 6 (see next slide), conclude that there is a significant linear correlation.  Otherwise, there is not sufficient evidence to support the conclusion of significant linear correlation. 2. Or, use the hypothesis testing method described in this section to confirm whether the correlation / relationship between x and y is significant
  • 18.
    TABLE A-6 Critical Valuesof the Pearson Correlation Coefficient r 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 25 30 35 40 45 50 60 70 80 90 100 n .999 .959 .917 .875 .834 .798 .765 .735 .708 .684 .661 .641 .623 .606 .590 .575 .561 .505 .463 .430 .402 .378 .361 .330 .305 .286 .269 .256 .950 .878 .811 .754 .707 .666 .632 .602 .576 .553 .532 .514 .497 .482 .468 .456 .444 .396 .361 .335 .312 .294 .279 .254 .236 .220 .207 .196  = .05  = .01
  • 19.
     Note onrounding r: Round to three decimal places  Example: For our Manatee problem, we calculated r = 0.922 (as shown below)  Using an alpha=0.05, and with n=10 we see from Table A-6 that an r with an absolute value greater than 0.632 is significant.  Since our calculated r = 0.922, we can say that the relationship between the number of boats and the number of Manatee deaths is statistically significant. Using Table A-6
  • 20.
    Common Errors InvolvingCorrelation 1. Causation: It is wrong to conclude that correlation implies causality  Just because x and y are “related” does NOT MEAN that x “causes” y or visa versa. Determining causality is a separate issue and beyond this class (this is a very common mistake) 2. Averages: Averages suppress individual variation and may inflate the correlation coefficient  Make sure you are using data for individual events, NOT averaged or grouped data 3. Linearity: There may be some relationship between x and y even when there is no significant linear correlation  Nonlinear relationships can be studied, but NOT in this class
  • 21.
    Formal Hypothesis Test To determine whether there is a significant linear correlation between two variables  Let  H0:  = (no significant linear correlation) and  H1:   (significant linear correlation) Remember:  is the symbol for correlation between the populations of x’s and y’s and r is the symbol for correlation between a sample of x’s and y’s
  • 22.
    Test Statistic ist (follows format of earlier sections) Test statistic: Critical values: 1. Use =NORMSINV for large samples 2. For small samples (n<30) use =TINV(prob,df) with probability = α (Excel knows it is a two-tailed test, so input the entire alpha, NOT α/2), and (df) degrees of freedom = n – 2 (YES, that’s df=n-2 in this case, NOT df=n-1 as in previous sections) 2 1 2 r t r n   
  • 23.
    Manatee Example (HypothesisTest) 2 1 2 r t r n    • Given α = 0.05 and n=10 • Test Statistic: • Critical Values: -2.306 and +2.306
  • 24.
    Manatee Example Conclusion: Acceptthe alternate hypothesis (H1) that there is a significant relationship between x and y (number of boats and manatee deaths) t=0 2.306 -2.306 Test Statistic = 6.735
  • 25.
  • 26.
    Regression Definition  The “Regression”process seeks to find a line (and the equation that describes that line) that “fits” through the paired data so that the sum of the vertical distances (squared) between the actual data points and the line is minimized.  The Regression Equation takes the form: 0 1 ŷ b b x  
  • 27.
    Regression Line Plottedon Scatter Plot The Regression Line is the line of “best fit” through the data points. “Best fit” means the sum of the vertical distances between each data point and the regression line is minimized.
  • 28.
    The Regression Equation xis the independent variable (predictor variable) y-hat is the dependent variable (response variable) 0 1 ŷ b b x   b0 is the y- intercept or the value at which the regression line crosses the vertical axis b1 is the slope of the regression line or the amount of change in y for every 1 unit change in x y-hat is the “dependent” or “response” variable because it depends on, or responds to the value of x x is the “independent” or “predictor” variable because it acts independently to predict the value of y- hat
  • 29.
    Assumptions 1. We areinvestigating only linear relationships 2. For each x value, y is a random variable having a normal (bell-shaped) distribution. All of these y distributions have the same variance  Results are not seriously affected if departures from normal distributions and equal variances are not too extreme
  • 30.
    Rounding Rule forRegression  Round the y-intercept (b0) and the slope (b1) to three significant digits  If calculating manually, do not round intermediate values
  • 31.
    1. Use oneof the methods we already learned to check for a significant linear correlation (r) between x and y 2. If there is NOT a significant linear correlation between x and y, then we CANNOT USE REGRESSION to predict y and our best estimate of y is simply y-bar (the mean of y). In this case forget about using regression because it fails us 3. However, IF THERE IS a significant linear correlation between x and y, the best predicted y-value is found by putting the x-value into the regression equation and calculating y Steps in Doing Regression
  • 32.
    Regression using Excel Clickon:  Tools  Data Analysis  Regression To run the media file on the next slide, click or double-click on the picture of the Excel worksheet See the following examples of how to do Regression using Excel
  • 33.
    Regression using Excel Thisstuff is beyond this class (don’t worry about these). See me if you have questions Here you can specify where you want the results of the Regression analysis to appear Check ONLY if you include in the y and x ranges the “headings” or cells that label y and x. I prefer to do that. Input the range of cells that contains the y variables Input the range of cells that contains the x variables Doesn’t matter if you check, but I prefer to leave unchecked Leave unchecked
  • 34.
    Regression Example: Whatis the best predicted size of a household that discards 0.50 lb of plastic? Data was collected by some poor smucks going around to different houses and going through the garbage. The data is listed in columns (always list data in columns not in rows when using Excel). x represents the lbs. of plastic thrown out and y represents the number of people in that household. Notice that the y variable (number in household) is what we are trying to predict and the x variable (lbs. of plastic tossed) is what we use to try to predict y. Step 1: Determine whether or not there is a significant relationship between x and y. If there is NOT a significant relationship then we CANNOT use Regression. If there IS a significant relationship between x and y then we can proceed to use Excel to do a Regression Analysis and find the Regression Equation (note: assume an alpha of 0.05 unless specifically told otherwise)
  • 35.
    • Given α= 0.05 and n=8 (df=n-2=8-2=6) • Test Statistic: • Critical Values: -2.447 and +2.447 • Conclusion: The Test Statistic is greater than the Critical Value so we conclude that there IS a significant linear relationship between the variables and we can now proceed with our Regression Analysis Regression Example (Hypothesis Test) 2 1 2 r t r n   
  • 36.
    Regression Example: Whatis the best predicted size of a household that discards 0.50 lb of plastic? By specifying an Output Range of $A$11 we are telling Excel to use cell A11 as the upper-left corner of the Regression Analysis output (see next slide)
  • 37.
    Regression Example: Answerthis question using Regression. What is the best predicted size of a household that discard 0.50 lb of plastic? Before you get too excited about this output, let’s cross-off the info that we are not going to discuss or learn about in class. I’m only trying to give you an elementary exposure to Regression. The following slides will show you the things you need to understand and each of those items will be explained.
  • 38.
    Regression Example: Answerthis question using Regression. What is the best predicted size of a household that discard 0.50 lb of plastic? Info to help you understand Regression Info that you MUST know for test Multiple R has no importance by itself, it is the square of this value “R square” that is important “R Square” is also known as the “Coefficient of Determination” and represents the amount (percent) of the variance in y that is explained by x. You can sort-of say that this value shows how accurate or effective our regression equation is. In this case, our regression equation is 71% accurate or effective in predicting y, given some value of x. These coefficients form the Regression Equation. In this case y=0.549+1.480(x). Thus, the answer to our question is y=0.549+1.480(0.50)=1.289 people (or 1 rounded to the nearest whole person in a household that tosses 0.50 lbs. of plastic)
  • 39.
    Regression Example: Answerthis question using Regression. What is the best predicted size of a household that discard 0.50 lb of plastic? Info to help you understand Regression Info that you MUST know for test Here we simply see that Excel determines the appropriate degrees of freedom – nothing for us to do here “Adjusted R Square” is the same as “R Square” EXCEPT that it is adjusted for sample size. When the sample size (n) is small, this value will be quite a bit below the unadjusted R Square value. As the sample size increases, the difference between R Square and Adjusted R Square becomes negligible. Adjusted R Square is a more accurate representation of the accuracy or effectiveness of our regression equation. Standard Error is an estimate of how far off our y predictions will be (on average) using the regression equation. Observations is simply n
  • 40.
    Regression Example: Answerthis question using Regression. What is the best predicted size of a household that discard 0.50 lb of plastic? Info to help you understand Regression Info that you MUST know for test You may remember the F test from the section on hypothesis testing. Excel does an F test to see if the regression equation is any better than simply using the average y as the predictor of y. A large F value generally suggests that the regression equation is effective, but we rely on the “Significance F” to tell for sure Significance F shows the alpha level that separates the regression equation between being statistically significant and NOT. Thus, if this value is LESS THAN our alpha value (usually 0.05) then the regression equation is statistically significant (i.e., a low chance of error in predicting y). Conversely, if this value is greater than our alpha then the regression equation is not statistically significant (i.e., a relatively high chance of error in predicting y). In this case, since Significance F is much less than 0.05 we conclude that the regression equation is statistically significant. SS requires some explanation – see the next couple of slides
  • 41.
    Regression Example: Answerthis question using Regression. What is the best predicted size of a household that discard 0.50 lb of plastic? Explained deviation (Regression) (y-hat – y-bar) Total deviation (y – y-bar) 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 • • • Unexplained deviation (Residual) (y – y-hat) y-hat = 3 + 2x y-bar = 9 y x 0 1 2 3 4 5 6 7 8 9 SS (Sum of Squares) NOT REQUIRED FOR YOU TO KNOW SS Regression (Explained Deviation) – This is calculated by first finding the vertical distance between the regression line and the mean y value (y-bar) at each data point. Next, those individual differences are squared and then all added together (summed) to form the SS (sum of squares) Regression. In practical terms, in order for the regression equation to be valid, we should see a relatively high value here as compared to SS Residual. SS Residual (Unexplained Deviation) – This is calculated by first finding the vertical distance between each data point and the regression line. Since we hope our regression line explains the relationship between x and y, any difference between the line and a data point is “unexplained”. We hope this value is relatively small compared to SS Regression. SS Total – This is simply SS Regression + SS Residual. It can also be calculated by finding the vertical distance between each data point and y-bar, squaring each value and added them all up.
  • 42.
  • 43.
    Multiple Regression  MultipleRegression is a process for defining a linear relationship between a dependent variable y and two or more independent variables (x1, x2, x3 . . . , xk)  The linear equation for a regression problem in which we have multiple x variable is as follows, where b0 is the y-intercept and all the other b’s are coefficients associated with their respective x values: 0 1 1 2 2 ˆ ... k k y b b x b x b x     
  • 44.
    Multiple Regression Guidelines More x variables is NOT necessarily better  Remember that R Square is a measure of how effective our regression equation is. Therefore, if adding an x variable does not appreciably increase the R Square value, then DON’T add it  Use those x variables (the fewest possible) that give you the biggest R Square (or Adjusted R Square) value. We want efficiency so a few variables that provide a big R Square is best
  • 45.
    Multiple Regression Example:Using the following data (measurements taken from Bears that had been anesthetized), construct a multiple regression equation to predict the weight of Bears. Step 1: Construct a “Correlation Matrix” to see which x variables have the strongest linear relationships with the y variable (weight). Use the Excel function Tools, Data Analysis, Correlation to construct a correlation matrix. An Excel file containing this Bear data and Correlation Matrix are on the class website (mrbear.xls).
  • 46.
    Multiple Regression Example:Using the following data (measurements taken from Bears that had been anesthetized), construct a multiple regression equation to predict the weight of Bears. Step 1: Construct a “Correlation Matrix” to see which x variables have the strongest linear relationships with the y variable (weight) • Ideally, we want to pick those few x variables that have strong correlations (close to -1 or +1) with the y variable, BUT we also want the x variables to NOT be highly correlated with each other • The addition of an x variable that is strongly correlated with any x variable(s) already in a multiple regression equation WILL NOT do much to increase the R Squared or Adjusted R Square value • On the other hand, adding an x variable that is strongly correlated with the y variable, but NOT with any x variables already in the regression equation WILL increase our R Squared value substantially
  • 47.
    Multiple Regression Example:Using the following data (measurements taken from Bears that had been anesthetized), construct a multiple regression equation to predict the weight of Bears. Choosing which x variables to include in a multiple regression problem is often a subjective decision. I suggest you try whichever variables you think will work best. For this example, I am going to choose Neck size as my first x variable to predict y (weight) since it has an awesome correlation of 0.971 with Weight. To pick a second x variable I want one that is highly correlated with Weight, but NOT with Neck. Unfortunately, it looks like all the other x variables ARE highly correlated with Neck. In fact, if this were a real problem we would probably stick with simple regression and just use Neck to predict Weight. But, since we have to do a multiple regression problem I think I’ll pick Age as my second x variable. Age is not that highly correlated with Weight (0.814), but it is also the least highly correlated with Neck (0.906). Why don’t you try the problem with other x variables and see if you can beat my Adjusted R Square value.
  • 48.
    Multiple Regression Example:Using the following data (measurements taken from Bears that had been anesthetized), construct a multiple regression equation to predict the weight of Bears.  For Multiple Regression we use the same Tools, Data Analysis, Regression function in Excel that we used for simple Regression.  The “Y Range” is still simply the range of cells that contains the y-variable The “X Range” is the range of cells that contains ALL the x-variables we want to include in the regression model. MAKE SURE the x-variables are in adjacent columns (you cannot skip columns). Notice how I now have Neck and Age right next to each other.  I like to include the column headings or labels, so I checked the “labels” box. When you do this the output also includes the labels, making it a lot easier to interpret. Everything else is the same as when we did simple Regression
  • 49.
    Multiple Regression Example:Using the following data (measurements taken from Bears that had been anesthetized), construct a multiple regression equation to predict the weight of Bears. Again, before we analyze the output, let’s cross-off the info that we are not going to discuss or learn about in class. The following slides will explain those things you need to understand.
  • 50.
    Multiple Regression Example:Using the following data (measurements taken from Bears that had been anesthetized), construct a multiple regression equation to predict the weight of Bears. Just a note: I ran another regression with Neck size as the only x variable. You can see the Adjusted R Square is 0.9330 as compared to the Adjusted R Square of 0.9536 that we got when we included Age in addition to Neck. I don’t think I would consider the slight increase in Adjusted R Square from including Age as really being worth the trouble of including another variable. There is no set guideline as to how much R Square should increase to justify adding another variable, but going from an already high R Squared to a slightly higher value does not seem worthwhile in my opinion
  • 51.
    Multiple Regression Example:Using the following data (measurements taken from Bears that had been anesthetized), construct a multiple regression equation to predict the weight of Bears. Info to help you understand Regression Info that you MUST know for test Multiple R – same story as before Almost the same story as in simple Regression - “R Square” or the “Coefficient of Determination” represents the amount (percent) of variance in y that is explained by the x’s. In this case our regression equation is 97% accurate or effective in predicting y (Weight). Adjusted R Square simply adjusted the R Square value for the number of x variables and the sample size. Lots of variables and/or small sample sizes reduce the Adjusted R Square. A few good variables and a large sample size bring Adjusted R Square very close to R Square. These coefficients form the Regression Equation. In this case y=-307.817+26.388(Neck)-1.527(Age). Thus, if we found a bear with a Neck of 25” and an Age of the 48 months we could predict its Weight as y=- 307.817+26.388(25)-1.527(48)=278.6 lbs.
  • 52.
    Multiple Regression Example:Using the following data (measurements taken from Bears that had been anesthetized), construct a multiple regression equation to predict the weight of Bears. Info to help you understand Regression Info that you MUST know for test Same general comments as simple Regression. Note that MS is SS/df and the F statistic is calculated as MS Regression / MS Residual (just in case you are curious) See next slide for explanation of these important values Like simple Regression, if the Significance F is less than alpha then our regression equation is statistically significant These
  • 53.
    Multiple Regression Example:Using the following data (measurements taken from Bears that had been anesthetized), construct a multiple regression equation to predict the weight of Bears. Info to help you understand Regression Info that you MUST know for test Excel calculates a t test statistic for each of the x variables (ignore the t Stat for Intercept). The next column (P-value) is the important one because it provides the test result of this test statistic as compared to the critical value. The critical values are not shown, but we don’t need to see them because the P-value effectively tells us whether this test statistic is inside of outside the critical value (see P-value for more explanation) The P-value tells us whether the x variable is statistically significant in the regression equation. If the P-value is less than alpha (usually 0.05) then that x variable is a significant contributor in the regression equation. If the P-value is greater than alpha then that x variable does not contribute significantly to the regression equation. In this case, Neck has a P-value much less than 0.05, so we see that Neck size contributes significantly to our regression equation to predict a Bear’s weight. On the other hand, Age has a P-value greater than 0.05, so we see that adding the Age x variable into the equation was not a good idea since Age does not significantly help us predict a Bear’s weight in our regression equation.
  • 54.
  • 55.
    Modeling and ExploringData  The next few slides show some of the various linear and nonlinear relationships that might be encountered  Linear relationships are the only “easy” ones to study, but other types of relationships obviously exist  These other types of relationships are beyond the scope of this course – just wanted to show you what some of them might look like  Also, the following slide shows a little Excel movie on how to use Excel to help fit certain types of mathematical models to data – just enough info for you to be dangerous
  • 56.
    Using Excel toModel Data
  • 63.