2. REGRESSION / CORRELATION
Object:
To measure the degree of association between
variables and/or to predict the value of one
variable from the knowledge of the values of
(an)other variable(s).
Relationships:
(1) Functional
(2) Statistical
2
3. Functional Relationship:
Y=f(X),
an exact relationship-- no “error”.
e.g.,
$ savings
Y = -25 +.10X
$ spent at B&N during the year
(joining Barnes & Noble book club)
3
4. Statistical Relationship:
(true only “on the average”)
Y
PRODUCTION
X
LABOR HOURS
Linear
Y
PHYSICAL ABILITY
X
AGE
Non-linear
Upside-down U-shape
4
5. Consider the following data, which represent the
sales of a product (adjusted for trend) over the
last 8 sales periods:
Y = sales (millions)
116 109 117 112
122 113 108 115
Y = 114
Last 8 sales of
periods
Average of the 8 sales amounts
What would (should) one predict for the next sales
period? Probably, one would be hard pressed, in this
case, to justify choosing other than Y=114. How good
will this prediction be?
5
7. But-- we can get an idea by looking at how well we would
have done, had we been using this 114 all along:
Prediction error/residual
Y
116
109
117
112
122
113
108
115
Y=114
Y (Y-Y) (Y-Y)2
114
2
4
114
-5
25
114
3
9
-2
114
4
114
8
64
114
-1
1
114
36
-6
114
1
1
0
144
TSS =
Total Sum
of Squares
n
So TSS = Σ (Y -Y)2 = 144
j
j=1
7
8. Two ways to look at the “TSS”:
1) A measure of the “mis-prediction”
(prediction error) using Y as predictor.
2) A measure of the “Total Variability in the
System” (the amount by which the 8 data
values aren’t all the same).
When the TSS is larger/when the data varies
more, you have more reason to investigate
8
9. Consider using X, advertising, to “help” predict Y:
Y
X
1 6
1
1 9
0
1 7
1
1 2
1
1 2
2
1 3
1
1 8
0
1 5
1
2
1
3
1
4
2
1
2
Y 1 4 X 2
= 1
=
Scatter Diagram
125
Y
120
115
110
X
105
0
1
2
3
4
9
10. Consider a Linear or Straight Line Statistical
relationship between the two variables, and then
consider finding the “best fitting line” to the data.
Call this line:
Yc = a+bX
Yc = “Computed Y” or “Predicted Y”
Y is called the Dependent Variable
X is called the Independent Variable
10
11. What do we mean by “best fitting”?
Answer:
The “Least Squares” line, i.e., the line which
minimizes the sum of the squares of the
distances from the “dots”, Y, and the “line”, Y c.
Hence, the MATH problem is to minimize
n
Σ
Y
j=1
(Y -Yc)2
Y1 = 7
Yc1 = 5
X1
X
11
12. To find this Least Squares line, we theoretically
need calculus.
However, as a practical matter, every text gives
the answer, and, more importantly, we will get
the result using Excel, or SPSS, or other
software - NOT “BY HAND.”
(There is an arithmetic formula for “b” and “a”
in terms of the sum of the X’s, the sum of the
Y’s, the sum of the X•Y’s, etc., but with software
available, we never use it.)
12
23. So, using X in the best way, we have a prediction line
of Yc=106+4X. How good are the predictions we’ll get
using this line? Suppose we had been using it:
106+4(2)
(Y-Y)2
(Y-Y)
Y
X
Yc
Y-Yc
(Y-Yc )2
4
25
9
4
64
1
36
1
2
-5
3
-2
8
-1
-6
1
116
109
117
112
122
113
108
115
2
1
3
1
4
2
1
2
114
110
118
110
122
114
110
114
2
-1
-1
2
0
-1
-2
1
4
1
1
4
0
1
4
1
144
TSS
0
0
16
SSE
23
24. So, SSE = Σ(Y-Yc)2 = 16.
SSE = Sum of Squares “due to error”
That is, we use X in the best way possible, and
still do not get perfect prediction. The amount
of “mis-prediction” still remaining, measured
by sum of squares, is 16. This must be due to
factors other than advertising (X). (Perhaps:
size of sales force, number of retail outlets,
strategy of competition, interest rates, etc.)
24
25. We call all these other factors “ERROR”. That
is, “error” is the collective name of all variables
(factors) not used in making the prediction.
SSE is also called “SUM OF SQUARED
RESIDUALS” or “RESIDUAL SUM OF
SQUARES”.
25
26. We have TSS = 144 and SSE = 16.
TSS - SSE = 128
What happened to the other 128? We call this
“SSA”: (“SSR” in text)
SSA = TSS - SSE = 128
SSA = Sum of squares “due to X” or “Attributed
to X”.
26
27. So, TSS = SSA + SSE
Total = Variability
Variability
+
Variability
Attributed due to ERROR
to X
27
28. We have
128
SSA
r =
=
= .89
144
TSS
2
r2 is called the “Coefficient of Determination”,
and is interpreted as the “proportion of
variability in Y explained by X” or “... explained
by the relationship between Y and X expressed
in the regression line”.
28
29. Of course, 1 - r2 =
SSE
TSS
= .11
is interpreted as the proportion of variability
in Y unexplained by X (and still present).
0 ≤ r2 ≤ 1
r2 =
SSA
TSS
Define r= SQRT(r2).
r= correlation (coefficient).
Here r=SQRT(.89) = .943
29
30. But, r can be + or - !!
SQRT(.89) = +.943 or -.943.
It takes on the sign of b in Yc = a+bX.
-1 ≤ r ≤ 1
A value of r near 1 or -1 is suggestive of a
strong linear relationship between Y and X.
A value of r near 0 is suggestive of no
linear relationship between Y and X.
30
31. Note that the sign of r indicates the
direction of the relationship (if any). A “+”
indicates that Y and X move in the same
direction; a “-” indicates that they move in
opposite directions. Some people refer to
a positive r as a “positive relationship” and
a negative r as an “inverse relationship”.
31
32. Y
Y
r = -1
r = +1
X
Y
X
Y
r = -.65
r = +.8
X
Y
r= 0
X
Y
X
r= 0
X
32
33. Note that a high r 2 does not necessarily mean
CAUSE/EFFECT.
Frequently we have “spurious correlations” – two
variables which are highly related in terms of r 2 ,
but only because they are both “driven” by a third
variable.
“Classic” example:
Number of Number quarts of
TEACHERS LIQUOR SOLD
33
36. THE MODEL
In order to get a measure of prediction
error (e.g., confidence intervals,
hypothesis testing), we must make
some assumptions about the
distribution of points scattered about
the regression line. These
assumptions are usually couched in
what is called a “statistical model.”
36
37. We specify
µ Y•X
= A+BX
Where µ Y•X is the mean or average
value of Y for a given X. We have a
(true) slope of B and (true) intercept of
A; A and B are parameters, the exact
values of which we’ll never know.
37
38. This says that if we set X = 1 (for example)
and sample an infinite number of Y’s (hence
finding µ Y.1) and then set X = 2 and find µ Y.2, X
= 3 and find µ Y.3, etc., all the µ Y•X fall exactly
on a straight line
(TRUE)
Average
Y,
µ Y•X
X
38
39. But, we never find µ Y•X. For a given X, we
observe a value of Y which differs from µ Y•X in
the same way that when we observe any random
variable value, it does not equal “µ,” but is some
point governed by some probability law .
f(Y)
µY.X
Y
39
40. The way we write this in a formal way
is:
Y = µ Y•X + ε = A+BX + ε
Where ε is the difference between an
individual Y and the mean Y, all given a
specific X.
ε is basically the impact of having a nonzero σ.
40
41. Example:
Suppose that
and µ Y•X=70”
Y = weight
X = height,
= 160 lbs.
Then a person 70” tall with weight of 168 pounds has
a “personal ε” of 8 lbs. If his/her weight were 158
lbs., his/her personal ε would be -2 pounds.
Of course, since ε = Y - µ Y•X, and we don’t know µ Y•X,
we don’t really know anybody’s personal ε.
41
42. We find the LS line,
Yc = a + bx
a → estimates → A
b → estimates → B
Yc → estimates → µY•X , and Y itself.
42
43. We usually make the following assumptions,
which are called
“the standard assumptions.”
1) NORMALITY
2) HOMOSCEDASTICITY
3) INDEPENDENCE
43
44. Assumption 1:
Given a value of X, the probability of Y is
normal.
(e.g., with Y = weight and X = height, for any
given height (say 70”) the Y’s are normal
around µ Y•X =70 (say, 160 lbs.)
Y
160
44
45. Assumption 2:
The standard deviation of ε , σ ε (which we
don’t know) which is usually called σ Y•X, is
constant for all values of X. The
characteristic of having σ Y•X constant is
referred to as “Homoscedasticity.”
45
48. Combining assumptions 1 & 2, we have
the Y’s being normally distributed with µ y.x
as mean (and correspondingly, average
error of 0) and constant standard dev. σ y•x.
Of course, as you know, neither µ Y•X nor
σ y•x is known.
µ Y•X is estimated by Yc = a+bx
σ y•x is estimated by “Sy•x”
Sy•x is called the “Standard Error of Estimate,”
SSE
Sy•x =
n−2
48
49. The SSE makes intuitive sense, in that
SSE is a variability due to error. The [n-2]
(instead of [n-1], the denominator of S in
most previous applications) is really a
degrees of freedom number. The df = n
minus a degree of freedom for each
parameter estimated from the data. Here,
there are 2 such parameters, A and B
(estimated by a and b, respectively).
49
50. Later, when we have a model of
Y = A + B 1 X1 + B 2 X2 + ε ,
the df will be [n-3].
We usually get Sy•x from the Computer output.
Here, Sy•x = 1.63 (See output on next page).
50
53. Assumption 3:
The Y values are independent of one another. (This
is often a problem when the data form a time series).
In the real world these assumptions may never be
exactly true, but are often close enough to true to
make our statistical analysis (which follows) valid.
Investigation has shown that moderate departure
from assumptions 1 and 2 do not appreciably affect
results (i.e., assumptions 1 and 2 are “Robust”). In
terms of large departures –– there are ways to
recognize them and do the appropriate (but more
complex) analysis.
53
57. Of greater interest (usually) is a confidence interval for
the prediction of the next “period.” This is done by:
Yc ± t1-α • Sy•x
Recall: Yc=106+4X
(n-2) df
This formula is a excellent approximation when n is
“large,” (virtually always in MK) and the value of X
at which we are predicting isn’t dramatically far
from the center [X-bar] of our data.
For 95% confidence,
and X = 3, we have:
118 ± 2.447(1.63) or 118 ± 3.99
TINV(.05, 6)
57
59. Hypothesis Testing
To test:
H0: B=0
H1: B≠0
We compute
tcalc= b-BH0 0
sb
Note: B=0
same as X & Y
NOT RELATED
Y= A + BX + ε
and accept H0 if |tcalc| < t1- α ,
(n-2)df
reject H0 if |tcalc| > t 1- α
(n-2)df
59
60. In our problem- tcalc = 6.93 (see output on
next page)
If α=.05, we have t.95= 2.447 and we reject H0
6 df
(All we really need to do is to examine the p-value)
We’ll refer to this as the “t-test.”
60
63. To test
H0: all B’s=0
H1: not all B’s=0,
we have a different procedure.
Here, where µ y•x = A + BX,
there’s only one B, and thus the H’s
above are the same as the previous
H0: B=0
H1: B≠0
63
64. However , for the future, where
µ y•x = A + B 1 X 1 + B 2 X 2, and “all B’s=0”
means
B 1 =B 2 =0, and there is a difference
between
“B=0” and “all B’s=0,” we introduce:
H 0 : all B’s =0
H 1 : not all B’s=0
64
65. To test the above, we determine
Fcalc
We get Fcalc from the output!!! Yeah!!!!
65
66. And we accept H0 if Fcalc < F1- α
(1, n-2) df
reject H0 if Fcalc > F1- α
(1, n-2) df
where F 1- α is the appropriate value
from the F table.
More easily: examine p-value of F- test (next page)
F
α = 0.05
5.99
66
69. MULTIPLE REGRESSION
When there is more than one
independent variable (X), we call our
regression analysis by the term
“Multiple Regression.” With a single
independent variable, we call it “Simple
Regression.”
69
70. µ y•x = A + B1X1 + B2X2 + • • • + Bk-1Xk-1
Y = µ y•x + ε
Least Squares hyperplane (“line”):
Yc = a + b1x1 + b2x2 + • • • + bk-1xk-1
NOTE:
k-1 = Number of X’s
k = Number of parameters
70
71. Example:
Y =
X1 =
X2 =
X3 =
Job Performance
Score on (entrance) Test 1
Score on Test 2
Score on Test 3
or
Y =
X1 =
X2 =
X3 =
Sales
Advertising
Number of sales people
Number of competitors
We assume that Computer software gives
us all (or nearly all) the numerical results.
71
72. Typically, we wish to perform two types of
Hypothesis Tests:
First: F – test
(Y = A+B1X1 + ••• + Bk-1 Xk-1+ ε)
H0 : B1 = B2 = B3 = . . . = Bk-1 = 0
H1 : not all B’s = 0
72
73. H0 : B1 = B2 = B3 = . . . = Bk-1 = 0
H1 : not all B’s = 0
In “English”:
H0: The X’s
collectively do not
help us predict Y.
H1: At least one of
the darn X’s help us
predict Y!
We call this, reasonably so, a
“TEST OF THE OVERALL MODEL”
73
74. If we accept H0 that the X’s collectively do
not help us predict Y, we probably
discontinue formal statistical analysis.
However, if we reject H0 (i.e., the “F is
significant”), then we are likely to want a
series of t-tests:
H0 : B1 = 0
H1 : B1 ≠ 0 ,
H0 : B2 = 0
H0 : Bk-1 = 0
•••
H1 : B2 ≠ 0 ,
H1 : Bk-1 ≠ 0
74
75. These are called “Tests for individual X’s.”
The test is answering: (using B1 as an example)
H0 : Variable X1 is NOT helping us predict
Y, above and beyond the other
variables in the model.
H1 : X1 IS INDEED helping us predict Y,
above and beyond the other variables
in the model.
75
76. So, note:
We’re answering whether a variable gives us
INCREMENTAL value.
Sometimes a result looks “strange” X1 height
Y
weight
F-Test
t1
t2
X2 pant length
:
:
:
SIGNIFICANT
NOT SIGNIFICANT
NOT SIGNIFICANT
76
77. If I know a person’s X 1 , height,
do I get additional predictive
value
about Y, weight, Y = Weight
from knowing
X1 = Height
pant length?
No - hence, we
accept H 0 : B 2 = 0X2 = Pant Length
(t 2 not sign.)
77
78. If I know X2, pant length, do I get
additional predictive value about Y
from knowing height?
(Also) No - hence we accept
Ho: B1= 0
(t1 not sign.)
78
79. When the X’s themselves are highly
interrelated (the fact that leads to
the strange looking - but not really
strange result), we call this
MULTI-COLINEARITY.
79
80. Another “look” at this issue:
R2 = .5
Y X1
Y X2
R2 = .4
Y X,X
R2 = ?
1
2
Ans: between .5 and .9
(In some unusual, “strange” cases,
R2 may exceed .9 )
If X1 and X2 not overlapping in the
information provided, R2 = .9; if X2 tells
us a total subset of what X1 tells us,
R2 = .5.
80
81. If you have
Y
Y
Y
X1
X2
X1, X2
R2 = .70
R2 = .72
R2 = .73,
1) The F test is significant because the X’s together
tell us (an estimate of) 73% of what’s going on with Y.
2) t1 (likely) not sign., because the gain of .01 (.73 - .
72 [with only X2]) is judged by the t-test as too easily
due to the “luck of the draw”. (Actually, it depends
on the sample size)
3) t2 , similarly.
81
88. F
α = .05
from
output
47.598
To test: H0: B1 = B2 = B3 = 0
α = .05
H1: not all B’s = 0
Since p-value = .000000001528 < .
05,
we reject H0.
from output
88
90. For
1
and
3
we reject Ho; for
2
we accept Ho.
Conclusion in Practical Terms?
x1 (Test 1) and x3 (Test 3) each gives us
incremental predictive value about
PERFORMANCE, Y.
X2 (Test 2) is either irrelevant or redundant.
90
91. An added benefit of the analysis was to indicate
how the tests should be weighted: The best fit
occurs if the tests are weighted
1.02,
.137,
.87
(assuming we retain Test 2).
This is equivalent to weights of
1.02
2.027
,
.137
,
.87
2.027
2.027
.07,
1.02
.137
.87
2.027
.43)
or
(.50,
The present weights were (1/3, 1/3, 1/3).
91
92. “PROBLEM IN NOTES”
Consider the following model: Y = A+B1•X1+B2•X2+B3•X3+ε
Y = Sales Revenue (in units of $100,000)
X1= Expenditure on TV advertising (in units of $10,000)
X2= Expenditure on Web advertising (in units of $10,000)
X3= Expenditure on Newspaper advertising (in units of $10,000)
Refer to computer output following the questions 1. What is the least squares line (hyperplane)?
2. What revenue do I expect (in dollars) with no advertising
in any of the three media?
3. If $10,000 more were allocated to advertising, which
medium should receive it to generate the most additional
revenue?
92
93. 4) What percent of the variability in revenue is due to factors
other than the expenditures in the three advertising media?
5) If management decided to spend the same amount of money
on each of the three types of media, how much total money
would have to be spent to generate an expected revenue of
$40,000,000?
6) Test H0: B1 = B2 = B3 = 0 vs. H1: not all B’s = 0, at α = .05.
What is your conclusion in practical terms?
7) For each variable, test H0: B = 0 vs. H1: B ≠ 0, at α = .05.
What are your conclusions in practical terms?
93
96. We Get Yc = a + b1X1+ b2X2
For any given X1, income, we predict Y as
follows:
Male:
Yc = a + b1X1 +b2(1) = a + b1X1 + b2
Female:
Yc = a + b1X1 +b2(0) = a + b1X1 + 0
How is b2 to be interpreted?
96
97. Ans: The (estimated) amount spent by a
Male, above that which would be spent by a
Female, given the same X1 value (income).
(Of course, if b2 is negative, it says that we
estimate that Females spend more than
Males, at equal incomes.)
If we had defined
X2 = 1 for F’s
X2 = 0 for M’s ,
then b2 would reverse sign, and have the
opposite meaning.
97
98. Remember that a variable is a “dummy” variable
because of definition and interpretation. The
computer treats a variable whose values are 0 and 1,
just like any other variable.
Our data are, perhaps,
Y
X1
X2
20
50
1
18
40
1
33
65
0
24
49
0
21
62
1
•
•
•
•
•
•
98
99. Note that we have 2 categories, (M,F), but only one dummy variable.
This is necessary, as is the general situation of C categories, (C-1)
dummy variables.
This is because of computation issues involved in matrix inversion;
99
100. Example
Y C = a + b 1X 1 + b 2X 2 + b 3X 3 + b 4X 4 + b 5X 5
Water
Usage
Temp. Amount
Produced
# People
Employed
X4
X5
Plant 1
1
0
Plant 2
0
1
Plant 3
0
0
100
101. Let a + b1X1 + b2X2 + b3X3 = G
Then we predict: (for a given X1, X2 , X3)
FOR PLANT 1:
G + b4(1) + b5(0) = G + b4
FOR PLANT 2:
G + b4(0) + b5(1) = G + b5
FOR PLANT 3:
G + b4(0) + b5(0) = G
How do we interpret b4? b5?
101
105. Step 3: Internal: Y/X2, X4, X1
Y/X2, X4, X3
R2
.77
.73
2
External: Y/X2, X4, X1, R = .77
NOTE: If at any stage, the best variable
to enter is not significant by the t-test,
the ALGORITHM STOPS (and does not
bring that variable in!!!). You select a
p-value (pin), and if p-value of entering
variable > pin (i.e., variable is not
significant), the variable does not
105
enter and the algorithm stops.
106. Also- There’s a step 3b (and 4b, 5b, etc.)
Step 3b) Now that we’ve entered our third
variable, software goes back
and re-examines previously
entered variables to see if any
should be DELETED (specify a
“p to go out”, pout, so that if pvalue of a variable in our model
> pout, the variable is deleted.
Algorithm continues until it stops!!!!
106
112. Detailed Summary of Stepwise Analysis
Ent. Var.
LS Line
Step 1
UNDERGRAD
GPA
X
Step 2
QUANT
GMAT
Step 3
1
X2
COLLEGE
SEL.
X4
STOP!
R2
Yc= .85 + .73X1
.609
Yc= .585
+ .53X
1
+ .00165X
.833
2
.915
Yc= 1.197
+ .309X1
+ .00163X
2
+
.284X4
If we bring in Verbal GMAT, R2=.919
112
113. PRACTICE PROBLEM
Y = COMPANY ABC’s SALES ($millions)
X1 = OVERALL INDUSTRY SALES ($billions)
X2 = COMPANY ABC’s ADVERTISING ($millions)
X3 = SPECIAL PROMOTION BY CHIEF COMPETITOR: 0 = YES, 1 = NO
A STEPWISE REGRESSION WAS RUN WITH THESE RESULTS:
STEP 1: VARIABLE ENTERING: X1, Yc = 205+16•X1, R 2 = .48
STEP 2: VARIABLE ENTERING: X2, Yc = 183+11•X1+10•X2, R 2 = .64
STEP 3: VARIABLE ENTERING: X3, Yc = 180+10•X1+8•X2+65•X3, R 2 = .68
A)
If ABC’s advertising is to be same next year as this year (i.e., X2 held constant), and we do
not know (in advance) the value of X3, what would we predict to be the increase in ABC’s
sales if overall industry sales (X1) increase by $1 billion?
a) 10
B)
c) 16
Based on the given information, we can conclude that the R 2 between Y and X2 (the exact
value of which we cannot determine from the given information) is between:
a) .16 and .48
C)
b) 11
b) .16 and .64
c) .48 and .64
d) none of these
Answer part B) if the regression results above were NOT part of a stepwise procedure, but
simply a set of multiple regression results.
113
Editor's Notes
Regression is used more than any other technique
-it’s the mode
Measure the relationship between variables (like how often you go to wendys vs. how many children you have) sometimes (not always) to predict the variable of one variable based on the values of other variables.
Functional Relationships: an algebraic relationship
Statistical Relationships:
Functional Relationship: Algebraic relationship (an exact relationship)
Meaning there is no error
Two people that spend the same X will save the same Y. It is an exact relationship.
Statistical relationship: a relationship that is true only on the average
It is not exact.
Example: Labor hours vs production. There is an upward tendency so you make a trend line. It is not an EXACT relationship. Production won’t be EXACTLY the same what is revealed by trend line. This relationship is a linear relationship, a striahgt line.
If it is not a striaght line non-linear age vs. physical ability time vs. knowledge of a product follows the upside down u-shape curve
Instead of basing it on Y base it on adverising (x)
When advertising is at it’s highest, Y is at it’s highest (117 & 122). When Y is at its lowest, so is X. There is a link.
Sometimes its better to use this method or proof /demonstration. Meaning the Highest the Advertising the higher the sales.
Scatter Diagram When you graph X and Y data points, it is called a Scatter diagram. The scatter diagram has a trend (linear trend)
So its good to find the best fitting line. (next slide)
Yc is the best fitting line. The change in Y, if you increase x by 1 (the change in y per unit change in x).
IN this context, Y is called the dependent variable (the output variable). X is called the independent variable (the input variable)
Best fitting is defined as the Least Squares line. Line which minimizes the sum of the squared differences (prediction errors). Minimize the sum of the prediction errors squared.
To find Least Squares line, use excel or SPSS. Never do a regression by hand.
Dependent (Y)
Independent (X)
SSE we use x in the best way we can, but we still don’t get perfect prediction. The amount of misprediction still there is the SSE. They are due to factors or variables OTHER than advertising.
SSS Factors or variables affecting prediction that aren’t what you’re measuring
Error the collective name of all errors not used in making the prediction All the other variables
What happened to the 128? We reduce the prediction error dramatically. We reduced it by 128 by using X to help us predict. We call the 128 SSA.
SSA Sums of Squares attributed to X.
We gained that 128 is by using x to help us predict.
The total variability in sales = variability due to x + variability due to ERROR.
Why are sales not always the same? Because of variable xAdvertising is not always the same (X) and variability of errors