correlation and r3433333333333333333333333333333333333333333333333egratio111n.ppt

IMPORTANT Note
 Before you can use the Excel
functions for regression (this
section) or ANOVA (section 8)
you must be sure you’ve set
up Excel to access these
functions. To do that, while in
an Excel spreadsheet, click
“Tools” then “Add-Ins”. Then
make sure that “Analysis
TookPak” and “Analysis
ToolPak –VBA” are checked
(see picture).

Correlation
exists between two variables
when they are related to each
other in some way
 Example: when one increases the other
also increases, OR when one increases the
other decreases

Assumptions
1. The sample of paired data (x,y) is a
random sample.
2. The pairs of (x,y) data are normally
distributed.

Definition
Scatterplot (or scatter diagram)
is a graph in which the paired
(x,y) sample data are plotted
with a horizontal x axis and a
vertical y axis. Each individual
(x,y) pair is plotted as a single
point.

Scatter Diagram of Paired Data
Restaurant Bill (x) and Tip (y)

Positive Linear Correlation
x x
y
y y
x
Sample Scatter Plots showing various degrees of “positive”
correlation. That is, when x increases y also increases
(a) Positive (b) Strong
positive
(c) Perfect
positive

Negative Linear Correlation
x x
y
y y
x
(d) Negative (e) Strong
negative
(f) Perfect
negative
Sample Scatter Plots showing various degrees of
“negative” correlation. When x increases y decreases

No Linear Correlation
x x
y
y
(g) No Correlation (h) Nonlinear Correlation
Sample Scatter Plots showing NO linear correlation. The first plot (g)
has no correlation of any kind. The other plot (h) shows a clear
correlation between x and y, but the correlation is NOT LINEAR.
Nonlinear correlation can be studied, but it is beyond this class. We
will only study LINEAR CORRELATION.

Definition
•Linear correlation can be measured. The
strength of the relationship between an “x”
variable and a “y” variable is measured by the
“linear correlation coefficient”
•The letter r represents the linear correlation
coefficient for paired x,y sample data
•The Greek letter  (rho) represents the linear
correlation coefficient for paired x,y population
data

The Linear Correlation Coefficient r
•The values of the linear correlation coefficient are
ALWAYS between -1 and +1
•If r = +1 there is a perfect positive linear correlation
between x and y
•If r = -1 there is a perfect negative linear correlation
between x and y
•If r = 0 there is absolutely no linear correlation
between x and y

Properties of the
Linear Correlation Coefficient r
1. -1  r  1 (r is ALWAYS between -1 and 1)
2. The value of r does not change if all values of
either variable are converted to a different scale.
E.g., you can change from meters to feet and it
will not change the value of r.
3. You can interchange x and y and the value of r
will not change. I.e., it does not matter which
comes first.
4. Only linear relationships are measured by r.

•We will use Excel to calculate the linear
correlation coefficient, but just in case you want
to see the formula, here it is:
  
       
2 2
2 2
n xy x y
r
n x x n y y


 
  
   

•The Excel function for finding the linear correlation
coefficient is =CORREL(array1,array2)
•Where array1 is the range of x values and array2 is
the range of y values (or visa versa)
• Hint: always arrange the x and y
values in side-by-side columns. Since
the data is “paired” there is a specific y
that goes with each x. Make sure
these pairs are side-by-side. As long
as the pairs are side-by-side, the
sequencing of pairs makes no
difference. See the next slide for an
example.

Interpreting the Linear Correlation
Coefficient – 2 ways
1. If the absolute value of r exceeds the value in
Table A – 6 (see next slide), conclude that there
is a significant linear correlation.
 Otherwise, there is not sufficient evidence to
support the conclusion of significant linear
correlation.
2. Or, use the hypothesis testing method
described in this section to confirm whether the
correlation / relationship between x and y is
significant

TABLE A-6
Critical Values of the Pearson Correlation Coefficient r
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
25
30
35
40
45
50
60
70
80
90
100
n
.999
.959
.917
.875
.834
.798
.765
.735
.708
.684
.661
.641
.623
.606
.590
.575
.561
.505
.463
.430
.402
.378
.361
.330
.305
.286
.269
.256
.950
.878
.811
.754
.707
.666
.632
.602
.576
.553
.532
.514
.497
.482
.468
.456
.444
.396
.361
.335
.312
.294
.279
.254
.236
.220
.207
.196
 = .05  = .01

 Note on rounding r: Round to three decimal places
 Example: For our Manatee problem, we calculated r = 0.922 (as
shown below)
 Using an alpha=0.05, and with n=10 we see from Table A-6 that an
r with an absolute value greater than 0.632 is significant.
 Since our calculated r = 0.922, we can say that the relationship
between the number of boats and the number of Manatee deaths is
statistically significant.
Using Table A-6

Common Errors Involving Correlation
1. Causation: It is wrong to conclude that correlation implies
causality
 Just because x and y are “related” does NOT MEAN that x
“causes” y or visa versa. Determining causality is a separate
issue and beyond this class (this is a very common mistake)
2. Averages: Averages suppress individual variation and may
inflate the correlation coefficient
 Make sure you are using data for individual events, NOT
averaged or grouped data
3. Linearity: There may be some relationship between x and y
even when there is no significant linear correlation
 Nonlinear relationships can be studied, but NOT in this class

Formal Hypothesis Test
 To determine whether there is a
significant linear correlation
between two variables
 Let
 H0:  = (no significant linear correlation) and
 H1:   (significant linear correlation)
Remember:  is the symbol for correlation between the
populations of x’s and y’s and r is the symbol for correlation
between a sample of x’s and y’s

Test Statistic is t
(follows format of earlier sections)
Test statistic:
Critical values:
1. Use =NORMSINV for large samples
2. For small samples (n<30) use =TINV(prob,df)
with probability = α (Excel knows it is a two-tailed
test, so input the entire alpha, NOT α/2), and (df)
degrees of freedom = n – 2 (YES, that’s df=n-2
in this case, NOT df=n-1 as in previous
sections)
2
1
2
r
t
r
n




Manatee Example (Hypothesis Test)
2
1
2
r
t
r
n



• Given α = 0.05 and n=10
• Test Statistic:
• Critical Values: -2.306 and +2.306

Manatee Example
Conclusion: Accept the alternate
hypothesis (H1) that there is a significant
relationship between x and y (number of
boats and manatee deaths)
t=0 2.306
-2.306
Test Statistic = 6.735

Regression
Definition
 The “Regression” process seeks to find a
line (and the equation that describes that line)
that “fits” through the paired data so that the
sum of the vertical distances (squared)
between the actual data points and the line is
minimized.
 The Regression Equation takes the form:
0 1
ŷ b b x
 

Regression Line Plotted on Scatter Plot
The Regression Line is the line of “best fit”
through the data points. “Best fit” means the
sum of the vertical distances between each
data point and the regression line is
minimized.

The Regression Equation
x is the independent variable (predictor
variable)
y-hat is the dependent variable
(response variable)
0 1
ŷ b b x
 
b0 is the y-
intercept or the
value at which
the regression
line crosses the
vertical axis
b1 is the slope of
the regression line
or the amount of
change in y for
every 1 unit change
in x
y-hat is the
“dependent” or
“response” variable
because it depends
on, or responds to
the value of x
x is the “independent”
or “predictor” variable
because it acts
independently to
predict the value of y-
hat

Assumptions
1. We are investigating only linear relationships
2. For each x value, y is a random variable
having a normal (bell-shaped) distribution. All
of these y distributions have the same
variance
 Results are not seriously affected if departures from
normal distributions and equal variances are not too
extreme

Rounding Rule for Regression
 Round the y-intercept (b0) and the
slope (b1) to three significant digits
 If calculating manually, do not round
intermediate values

1. Use one of the methods we already learned to check for
a significant linear correlation (r) between x and y
2. If there is NOT a significant linear correlation between x
and y, then we CANNOT USE REGRESSION to predict y
and our best estimate of y is simply y-bar (the mean of
y). In this case forget about using regression because
it fails us
3. However, IF THERE IS a significant linear correlation
between x and y, the best predicted y-value is found by
putting the x-value into the regression equation and
calculating y
Steps in Doing Regression

Regression using Excel
Click on:
 Tools
 Data Analysis
 Regression
To run the media file on the next slide,
click or double-click on the picture of the
Excel worksheet
See the following examples of how to do

This stuff is beyond
this class (don’t
worry about these).
See me if you have
questions
Here you can
specify where
you want the
results of the
Regression
analysis to
appear
Check ONLY if you
include in the y and
x ranges the
“headings” or cells
that label y and x. I
prefer to do that.
Input the range of cells
that contains the y
variables
Input the range of cells that
contains the x variables
Doesn’t matter if you
check, but I prefer to leave
unchecked
Leave
unchecked

Regression Example: What is the best predicted size of a
household that discards 0.50 lb of plastic? Data was collected by some poor
smucks going around to different houses and going through the garbage. The
data is listed in columns (always list data in columns not in rows when using
Excel). x represents the lbs. of plastic thrown out and y represents the number of
people in that household. Notice that the y variable (number in household) is
what we are trying to predict and the x variable (lbs. of plastic tossed) is what we
use to try to predict y.
Step 1: Determine whether or not there is a significant relationship between x and y.
If there is NOT a significant relationship then we CANNOT use Regression. If there
IS a significant relationship between x and y then we can proceed to use Excel to do
a Regression Analysis and find the Regression Equation (note: assume an alpha of
0.05 unless specifically told otherwise)

• Given α = 0.05 and n=8 (df=n-2=8-2=6)
• Test Statistic:
• Critical Values: -2.447 and +2.447
• Conclusion: The Test Statistic is greater than the
Critical Value so we conclude that there IS a significant
linear relationship between the variables and we can now
proceed with our Regression Analysis
Regression Example (Hypothesis Test)
2
1
2
r
t
r
n




Regression Example: What is the best predicted size of a
household that discards 0.50 lb of plastic?
By specifying an
Output Range of $A$11
we are telling Excel to
use cell A11 as the
upper-left corner of the
Regression Analysis
output (see next slide)

Regression Example: Answer this question using Regression.
What is the best predicted size of a household that discard 0.50 lb of plastic?
Before you get too excited about this output, let’s cross-off the info that we
are not going to discuss or learn about in class. I’m only trying to give you an
elementary exposure to Regression. The following slides will show you the
things you need to understand and each of those items will be explained.

Info to help you understand Regression Info that you MUST know for test
Multiple R has no importance by itself, it is the
square of this value “R square” that is important
“R Square” is also known as the “Coefficient of
Determination” and represents the amount (percent)
of the variance in y that is explained by x. You can
sort-of say that this value shows how accurate or
effective our regression equation is. In this case, our
regression equation is 71% accurate or effective in
predicting y, given some value of x.
These coefficients form the Regression Equation. In
this case y=0.549+1.480(x). Thus, the answer to our
question is y=0.549+1.480(0.50)=1.289 people (or 1
rounded to the nearest whole person in a household
that tosses 0.50 lbs. of plastic)

Here we simply see that Excel determines the
appropriate degrees of freedom – nothing for us to
do here
“Adjusted R Square” is the same as “R Square”
EXCEPT that it is adjusted for sample size. When the
sample size (n) is small, this value will be quite a bit
below the unadjusted R Square value. As the sample
size increases, the difference between R Square and
Adjusted R Square becomes negligible. Adjusted R
Square is a more accurate representation of the
accuracy or effectiveness of our regression equation.
Standard Error is an estimate of how far off our y
predictions will be (on average) using the regression
equation.
Observations
is simply n

You may remember the F test from the section on
hypothesis testing. Excel does an F test to see if the
regression equation is any better than simply using the
average y as the predictor of y. A large F value generally
suggests that the regression equation is effective, but we
rely on the “Significance F” to tell for sure
Significance F shows the alpha level that separates
the regression equation between being statistically
significant and NOT. Thus, if this value is LESS
THAN our alpha value (usually 0.05) then the
regression equation is statistically significant (i.e., a
low chance of error in predicting y). Conversely, if
this value is greater than our alpha then the
regression equation is not statistically significant
(i.e., a relatively high chance of error in predicting
y). In this case, since Significance F is much less
than 0.05 we conclude that the regression equation
is statistically significant.
SS requires some
explanation – see the
next couple of slides

Explained deviation
(Regression)
(y-hat – y-bar)
Total
deviation
(y – y-bar)
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
•
•
•
Unexplained deviation
(Residual)
(y – y-hat)
y-hat = 3 + 2x
y-bar = 9
y
x
0 1 2 3 4 5 6 7 8 9
SS (Sum of Squares) NOT REQUIRED FOR
YOU TO KNOW
SS Regression (Explained Deviation) – This is
calculated by first finding the vertical distance
between the regression line and the mean y value
(y-bar) at each data point. Next, those individual
differences are squared and then all added together
(summed) to form the SS (sum of squares)
Regression. In practical terms, in order for the
regression equation to be valid, we should see a
relatively high value here as compared to SS
Residual.
SS Residual (Unexplained Deviation) – This is
calculated by first finding the vertical distance
between each data point and the regression line.
Since we hope our regression line explains the
relationship between x and y, any difference
between the line and a data point is “unexplained”.
We hope this value is relatively small compared to
SS Regression.
SS Total – This is simply SS Regression + SS
Residual. It can also be calculated by finding the
vertical distance between each data point and y-bar,
squaring each value and added them all up.

Multiple Regression
 Multiple Regression is a process for defining a
linear relationship between a dependent variable
y and two or more independent variables (x1, x2,
x3 . . . , xk)
 The linear equation for a regression problem in
which we have multiple x variable is as follows,
where b0 is the y-intercept and all the other b’s
are coefficients associated with their respective
x values:
0 1 1 2 2
ˆ ... k k
y b b x b x b x
    

Multiple Regression Guidelines
 More x variables is NOT necessarily better
 Remember that R Square is a measure of how
effective our regression equation is. Therefore, if
adding an x variable does not appreciably
increase the R Square value, then DON’T add it
 Use those x variables (the fewest possible) that
give you the biggest R Square (or Adjusted R
Square) value. We want efficiency so a few
variables that provide a big R Square is best

Multiple Regression Example: Using the following data
(measurements taken from Bears that had been anesthetized), construct a
multiple regression equation to predict the weight of Bears.
Step 1: Construct a “Correlation Matrix” to see which x variables have the strongest
linear relationships with the y variable (weight). Use the Excel function Tools, Data
Analysis, Correlation to construct a correlation matrix. An Excel file containing this
Bear data and Correlation Matrix are on the class website (mrbear.xls).

Step 1: Construct a “Correlation Matrix” to see which x variables have the strongest
linear relationships with the y variable (weight)
• Ideally, we want to pick those few x variables that have strong correlations (close
to -1 or +1) with the y variable, BUT we also want the x variables to NOT be highly
correlated with each other
• The addition of an x variable that is strongly correlated with any x variable(s)
already in a multiple regression equation WILL NOT do much to increase the R
Squared or Adjusted R Square value
• On the other hand, adding an x variable that is strongly correlated with the y
variable, but NOT with any x variables already in the regression equation WILL
increase our R Squared value substantially

Choosing which x variables to include in a multiple regression problem is often a
subjective decision. I suggest you try whichever variables you think will work best.
For this example, I am going to choose Neck size as my first x variable to predict y
(weight) since it has an awesome correlation of 0.971 with Weight. To pick a second
x variable I want one that is highly correlated with Weight, but NOT with Neck.
Unfortunately, it looks like all the other x variables ARE highly correlated with Neck.
In fact, if this were a real problem we would probably stick with simple regression
and just use Neck to predict Weight. But, since we have to do a multiple regression
problem I think I’ll pick Age as my second x variable. Age is not that highly
correlated with Weight (0.814), but it is also the least highly correlated with Neck
(0.906). Why don’t you try the problem with other x variables and see if you can
beat my Adjusted R Square value.

 For Multiple Regression we use the same
Tools, Data Analysis, Regression function in
Excel that we used for simple Regression.
 The “Y Range” is still simply the range of
cells that contains the y-variable
The “X Range” is the range of cells that
contains ALL the x-variables we want to
include in the regression model. MAKE
SURE the x-variables are in adjacent
columns (you cannot skip columns). Notice
how I now have Neck and Age right next to
each other.
 I like to include the column headings or
labels, so I checked the “labels” box. When
you do this the output also includes the
labels, making it a lot easier to interpret.
Everything else is the same as when we
did simple Regression

Again, before we analyze the output, let’s cross-off the info that we are not
going to discuss or learn about in class. The following slides will explain
those things you need to understand.

Just a note: I ran another regression with
Neck size as the only x variable. You can
see the Adjusted R Square is 0.9330 as
compared to the Adjusted R Square of
0.9536 that we got when we included Age
in addition to Neck. I don’t think I would
consider the slight increase in Adjusted R
Square from including Age as really being
worth the trouble of including another
variable. There is no set guideline as to
how much R Square should increase to
justify adding another variable, but going
from an already high R Squared to a
slightly higher value does not seem
worthwhile in my opinion

Multiple R – same story as before
Almost the same story as in simple Regression - “R Square” or
the “Coefficient of Determination” represents the amount
(percent) of variance in y that is explained by the x’s. In this
case our regression equation is 97% accurate or effective in
predicting y (Weight). Adjusted R Square simply adjusted the R
Square value for the number of x variables and the sample size.
Lots of variables and/or small sample sizes reduce the
Adjusted R Square. A few good variables and a large sample
size bring Adjusted R Square very close to R Square.
These coefficients form the Regression Equation. In
this case y=-307.817+26.388(Neck)-1.527(Age). Thus,
if we found a bear with a Neck of 25” and an Age of
the 48 months we could predict its Weight as y=-
307.817+26.388(25)-1.527(48)=278.6 lbs.

Same general comments as simple
Regression. Note that MS is SS/df and the F
statistic is calculated as MS Regression / MS
Residual (just in case you are curious)
See next slide for
explanation of these
important values
Like simple Regression, if the
Significance F is less than alpha then
our regression equation is statistically
significant
These

Excel calculates a t test statistic for each of the x variables (ignore the t Stat for Intercept).
The next column (P-value) is the important one because it provides the test result of this
test statistic as compared to the critical value. The critical values are not shown, but we
don’t need to see them because the P-value effectively tells us whether this test statistic is
inside of outside the critical value (see P-value for more explanation)
The P-value tells us whether the x variable is statistically significant in the regression equation. If
the P-value is less than alpha (usually 0.05) then that x variable is a significant contributor in the
regression equation. If the P-value is greater than alpha then that x variable does not contribute
significantly to the regression equation. In this case, Neck has a P-value much less than 0.05, so
we see that Neck size contributes significantly to our regression equation to predict a Bear’s
weight. On the other hand, Age has a P-value greater than 0.05, so we see that adding the Age x
variable into the equation was not a good idea since Age does not significantly help us predict a
Bear’s weight in our regression equation.

Modeling and Exploring Data
 The next few slides show some of the various linear and
nonlinear relationships that might be encountered
 Linear relationships are the only “easy” ones to study,
but other types of relationships obviously exist
 These other types of relationships are beyond the scope
of this course – just wanted to show you what some of
them might look like
 Also, the following slide shows a little Excel movie on
how to use Excel to help fit certain types of mathematical
models to data – just enough info for you to be
dangerous

correlation and r3433333333333333333333333333333333333333333333333egratio111n.ppt

More Related Content

Similar to correlation and r3433333333333333333333333333333333333333333333333egratio111n.ppt

Recently uploaded

correlation and r3433333333333333333333333333333333333333333333333egratio111n.ppt