Powerpoint2.reg

Powerpoint 2
REGRESSION,
REGRESSION,
REGRESSION!
1

REGRESSION / CORRELATION
Object:
To measure the degree of association between
variables and/or to predict the value of one
variable from the knowledge of the values of
(an)other variable(s).
Relationships:
(1) Functional
(2) Statistical
2

Functional Relationship:
Y=f(X),
an exact relationship-- no “error”.
e.g.,
$ savings

Y = -25 +.10X
$ spent at B&N during the year

(joining Barnes & Noble book club)

3

Statistical Relationship:
(true only “on the average”)
Y
PRODUCTION

X
LABOR HOURS

Linear

Y
PHYSICAL ABILITY

X
AGE

Non-linear
Upside-down U-shape

4

Consider the following data, which represent the
sales of a product (adjusted for trend) over the
last 8 sales periods:

Y = sales (millions)
116 109 117 112
122 113 108 115
Y = 114

Last 8 sales of
periods

Average of the 8 sales amounts

What would (should) one predict for the next sales
period? Probably, one would be hard pressed, in this
case, to justify choosing other than Y=114. How good
will this prediction be?
5

But-- we can get an idea by looking at how well we would
have done, had we been using this 114 all along:
Prediction error/residual

Y
116
109
117
112
122
113
108
115
Y=114

Y (Y-Y) (Y-Y)2
114
2
4
114
-5
25
114
3
9
-2
114
4
114
8
64
114
-1
1
114
36
-6
114
1
1
0
144

TSS =
Total Sum
of Squares

n

So TSS = Σ (Y -Y)2 = 144
j
j=1

7

Two ways to look at the “TSS”:
1) A measure of the “mis-prediction”
(prediction error) using Y as predictor.
2) A measure of the “Total Variability in the
System” (the amount by which the 8 data
values aren’t all the same).
When the TSS is larger/when the data varies
more, you have more reason to investigate

8

Consider using X, advertising, to “help” predict Y:
Y

X

1 6
1
1 9
0
1 7
1
1 2
1
1 2
2
1 3
1
1 8
0
1 5
1

2
1
3
1
4
2
1
2

Y 1 4 X 2
= 1
=

Scatter Diagram
125

Y

120
115
110

X

105
0

1

2

3

4

9

Consider a Linear or Straight Line Statistical
relationship between the two variables, and then
consider finding the “best fitting line” to the data.
Call this line:
Yc = a+bX
Yc = “Computed Y” or “Predicted Y”
Y is called the Dependent Variable
X is called the Independent Variable
10

What do we mean by “best fitting”?
Answer:
The “Least Squares” line, i.e., the line which
minimizes the sum of the squares of the
distances from the “dots”, Y, and the “line”, Y c.
Hence, the MATH problem is to minimize
n

Σ

Y

j=1

(Y -Yc)2

Y1 = 7
Yc1 = 5
X1

X
11

To find this Least Squares line, we theoretically
need calculus.
However, as a practical matter, every text gives
the answer, and, more importantly, we will get
the result using Excel, or SPSS, or other
software - NOT “BY HAND.”
(There is an arithmetic formula for “b” and “a”
in terms of the sum of the X’s, the sum of the
Y’s, the sum of the X•Y’s, etc., but with software
available, we never use it.)
12

Least squares line
Yc = 106 + 4X

125
120
115
110
105
0

1

2

3

4

13

So, using X in the best way, we have a prediction line
of Yc=106+4X. How good are the predictions we’ll get
using this line? Suppose we had been using it:

106+4(2)

(Y-Y)2

(Y-Y)

Y

X

Yc

Y-Yc

(Y-Yc )2

4
25
9
4
64
1
36
1

2
-5
3
-2
8
-1
-6
1

116
109
117
112
122
113
108
115

2
1
3
1
4
2
1
2

114
110
118
110
122
114
110
114

2
-1
-1
2
0
-1
-2
1

4
1
1
4
0
1
4
1

144
TSS

0

0

16
SSE

23

So, SSE = Σ(Y-Yc)2 = 16.
SSE = Sum of Squares “due to error”
That is, we use X in the best way possible, and
still do not get perfect prediction. The amount
of “mis-prediction” still remaining, measured
by sum of squares, is 16. This must be due to
factors other than advertising (X). (Perhaps:
size of sales force, number of retail outlets,
strategy of competition, interest rates, etc.)
24

We call all these other factors “ERROR”. That
is, “error” is the collective name of all variables
(factors) not used in making the prediction.
SSE is also called “SUM OF SQUARED
RESIDUALS” or “RESIDUAL SUM OF
SQUARES”.

25

We have TSS = 144 and SSE = 16.
TSS - SSE = 128
What happened to the other 128? We call this
“SSA”: (“SSR” in text)
SSA = TSS - SSE = 128
SSA = Sum of squares “due to X” or “Attributed
to X”.
26

So, TSS = SSA + SSE
Total = Variability
Variability
+
Variability
Attributed due to ERROR
to X

27

We have
128
SSA
r =
=
= .89
144
TSS
2

r2 is called the “Coefficient of Determination”,
and is interpreted as the “proportion of
variability in Y explained by X” or “... explained
by the relationship between Y and X expressed
in the regression line”.

28

Of course, 1 - r2 =

SSE
TSS

= .11

is interpreted as the proportion of variability
in Y unexplained by X (and still present).

0 ≤ r2 ≤ 1

r2 =

SSA
TSS

Define r= SQRT(r2).
r= correlation (coefficient).
Here r=SQRT(.89) = .943
29

But, r can be + or - !!
SQRT(.89) = +.943 or -.943.
It takes on the sign of b in Yc = a+bX.
-1 ≤ r ≤ 1
A value of r near 1 or -1 is suggestive of a
strong linear relationship between Y and X.
A value of r near 0 is suggestive of no
linear relationship between Y and X.
30

Note that the sign of r indicates the
direction of the relationship (if any). A “+”
indicates that Y and X move in the same
direction; a “-” indicates that they move in
opposite directions. Some people refer to
a positive r as a “positive relationship” and
a negative r as an “inverse relationship”.

31

Y

Y
r = -1
r = +1
X

Y

X
Y
r = -.65

r = +.8
X
Y

r= 0

X
Y

X

r= 0

X
32

Note that a high r 2 does not necessarily mean
CAUSE/EFFECT.
Frequently we have “spurious correlations” – two
variables which are highly related in terms of r 2 ,
but only because they are both “driven” by a third
variable.
“Classic” example:
Number of Number quarts of
TEACHERS LIQUOR SOLD

33

THE MODEL
In order to get a measure of prediction
error (e.g., confidence intervals,
hypothesis testing), we must make
some assumptions about the
distribution of points scattered about
the regression line. These
assumptions are usually couched in
what is called a “statistical model.”
36

We specify

µ Y•X

= A+BX

Where µ Y•X is the mean or average
value of Y for a given X. We have a
(true) slope of B and (true) intercept of
A; A and B are parameters, the exact
values of which we’ll never know.

37

This says that if we set X = 1 (for example)
and sample an infinite number of Y’s (hence
finding µ Y.1) and then set X = 2 and find µ Y.2, X
= 3 and find µ Y.3, etc., all the µ Y•X fall exactly
on a straight line
(TRUE)
Average
Y,
µ Y•X

X
38

But, we never find µ Y•X. For a given X, we
observe a value of Y which differs from µ Y•X in
the same way that when we observe any random
variable value, it does not equal “µ,” but is some
point governed by some probability law .

f(Y)

µY.X

Y

39

The way we write this in a formal way
is:
Y = µ Y•X + ε = A+BX + ε
Where ε is the difference between an
individual Y and the mean Y, all given a
specific X.
ε is basically the impact of having a nonzero σ.

40

Example:
Suppose that
and µ Y•X=70”

Y = weight
X = height,
= 160 lbs.

Then a person 70” tall with weight of 168 pounds has
a “personal ε” of 8 lbs. If his/her weight were 158
lbs., his/her personal ε would be -2 pounds.
Of course, since ε = Y - µ Y•X, and we don’t know µ Y•X,
we don’t really know anybody’s personal ε.

41

We find the LS line,
Yc = a + bx
a → estimates → A
b → estimates → B
Yc → estimates → µY•X , and Y itself.

42

We usually make the following assumptions,
which are called
“the standard assumptions.”
1) NORMALITY
2) HOMOSCEDASTICITY
3) INDEPENDENCE

43

Assumption 1:
Given a value of X, the probability of Y is
normal.
(e.g., with Y = weight and X = height, for any
given height (say 70”) the Y’s are normal
around µ Y•X =70 (say, 160 lbs.)
Y

160
44

Assumption 2:
The standard deviation of ε , σ ε (which we
don’t know) which is usually called σ Y•X, is
constant for all values of X. The
characteristic of having σ Y•X constant is
referred to as “Homoscedasticity.”

45

Combining assumptions 1 & 2, we have
the Y’s being normally distributed with µ y.x
as mean (and correspondingly, average
error of 0) and constant standard dev. σ y•x.
Of course, as you know, neither µ Y•X nor
σ y•x is known.
µ Y•X is estimated by Yc = a+bx
σ y•x is estimated by “Sy•x”
Sy•x is called the “Standard Error of Estimate,”
SSE
Sy•x =
n−2
48

The SSE makes intuitive sense, in that
SSE is a variability due to error. The [n-2]
(instead of [n-1], the denominator of S in
most previous applications) is really a
degrees of freedom number. The df = n
minus a degree of freedom for each
parameter estimated from the data. Here,
there are 2 such parameters, A and B
(estimated by a and b, respectively).

49

Later, when we have a model of
Y = A + B 1 X1 + B 2 X2 + ε ,
the df will be [n-3].
We usually get Sy•x from the Computer output.
Here, Sy•x = 1.63 (See output on next page).
50

Assumption 3:
The Y values are independent of one another. (This
is often a problem when the data form a time series).
In the real world these assumptions may never be
exactly true, but are often close enough to true to
make our statistical analysis (which follows) valid.
Investigation has shown that moderate departure
from assumptions 1 and 2 do not appreciably affect
results (i.e., assumptions 1 and 2 are “Robust”). In
terms of large departures –– there are ways to
recognize them and do the appropriate (but more
complex) analysis.
53

CONFIDENCE INTERVALS
95% confidence
Intervals for A and B

54

This, you had before
Now added to output
56

Of greater interest (usually) is a confidence interval for
the prediction of the next “period.” This is done by:

Yc ± t1-α • Sy•x

Recall: Yc=106+4X

(n-2) df

This formula is a excellent approximation when n is
“large,” (virtually always in MK) and the value of X
at which we are predicting isn’t dramatically far
from the center [X-bar] of our data.
For 95% confidence,

and X = 3, we have:

118 ± 2.447(1.63) or 118 ± 3.99

TINV(.05, 6)
57

(EXCEL COMMAND)

TINV(.05, 6) = 2.447
In general: TINV(α, df)

58

Hypothesis Testing
To test:

H0: B=0
H1: B≠0

We compute

tcalc= b-BH0 0
sb

Note: B=0
same as X & Y
NOT RELATED
Y= A + BX + ε

and accept H0 if |tcalc| < t1- α ,

(n-2)df

reject H0 if |tcalc| > t 1- α

(n-2)df

59

In our problem- tcalc = 6.93 (see output on

next page)
If α=.05, we have t.95= 2.447 and we reject H0
6 df

(All we really need to do is to examine the p-value)
We’ll refer to this as the “t-test.”

60

P-value (called “significance” by SPSS)
62

To test

H0: all B’s=0
H1: not all B’s=0,

we have a different procedure.
Here, where µ y•x = A + BX,
there’s only one B, and thus the H’s
above are the same as the previous
H0: B=0

H1: B≠0
63

However , for the future, where
µ y•x = A + B 1 X 1 + B 2 X 2, and “all B’s=0”
means
B 1 =B 2 =0, and there is a difference
between
“B=0” and “all B’s=0,” we introduce:
H 0 : all B’s =0
H 1 : not all B’s=0
64

To test the above, we determine

Fcalc
We get Fcalc from the output!!! Yeah!!!!

65

And we accept H0 if Fcalc < F1- α

(1, n-2) df

reject H0 if Fcalc > F1- α
(1, n-2) df

where F 1- α is the appropriate value
from the F table.
More easily: examine p-value of F- test (next page)
F

α = 0.05
5.99

66

MULTIPLE REGRESSION
When there is more than one
independent variable (X), we call our
regression analysis by the term
“Multiple Regression.” With a single
independent variable, we call it “Simple
Regression.”

69

µ y•x = A + B1X1 + B2X2 + • • • + Bk-1Xk-1
Y = µ y•x + ε
Least Squares hyperplane (“line”):

Yc = a + b1x1 + b2x2 + • • • + bk-1xk-1

NOTE:

k-1 = Number of X’s
k = Number of parameters

70

Example:
Y =
X1 =
X2 =
X3 =

Job Performance
Score on (entrance) Test 1
Score on Test 2
Score on Test 3
or

Y =
X1 =
X2 =
X3 =

Sales
Advertising
Number of sales people
Number of competitors

We assume that Computer software gives
us all (or nearly all) the numerical results.
71

Typically, we wish to perform two types of
Hypothesis Tests:

First: F – test

(Y = A+B1X1 + ••• + Bk-1 Xk-1+ ε)

H0 : B1 = B2 = B3 = . . . = Bk-1 = 0
H1 : not all B’s = 0

72

H0 : B1 = B2 = B3 = . . . = Bk-1 = 0
H1 : not all B’s = 0
In “English”:

H0: The X’s
collectively do not
help us predict Y.
H1: At least one of
the darn X’s help us
predict Y!

We call this, reasonably so, a
“TEST OF THE OVERALL MODEL”

73

If we accept H0 that the X’s collectively do
not help us predict Y, we probably
discontinue formal statistical analysis.
However, if we reject H0 (i.e., the “F is
significant”), then we are likely to want a
series of t-tests:
H0 : B1 = 0
H1 : B1 ≠ 0 ,

H0 : B2 = 0
H0 : Bk-1 = 0
•••
H1 : B2 ≠ 0 ,
H1 : Bk-1 ≠ 0
74

These are called “Tests for individual X’s.”
The test is answering: (using B1 as an example)
H0 : Variable X1 is NOT helping us predict
Y, above and beyond the other
variables in the model.
H1 : X1 IS INDEED helping us predict Y,
above and beyond the other variables
in the model.

75

So, note:
We’re answering whether a variable gives us
INCREMENTAL value.
Sometimes a result looks “strange” X1 height

Y
weight
F-Test
t1
t2

X2 pant length
:
:
:

SIGNIFICANT
NOT SIGNIFICANT
NOT SIGNIFICANT
76

If I know a person’s X 1 , height,
do I get additional predictive
value
about Y, weight, Y = Weight
from knowing
X1 = Height
pant length?
No - hence, we
accept H 0 : B 2 = 0X2 = Pant Length
(t 2 not sign.)
77

If I know X2, pant length, do I get
additional predictive value about Y
from knowing height?
(Also) No - hence we accept
Ho: B1= 0
(t1 not sign.)

78

When the X’s themselves are highly
interrelated (the fact that leads to
the strange looking - but not really
strange result), we call this
MULTI-COLINEARITY.
79

Another “look” at this issue:

R2 = .5
Y X1
Y X2
R2 = .4
Y X,X
R2 = ?
1
2
Ans: between .5 and .9
(In some unusual, “strange” cases,
R2 may exceed .9 )
If X1 and X2 not overlapping in the
information provided, R2 = .9; if X2 tells
us a total subset of what X1 tells us,
R2 = .5.

80

If you have

Y
Y
Y

X1
X2
X1, X2

R2 = .70
R2 = .72
R2 = .73,

1) The F test is significant because the X’s together
tell us (an estimate of) 73% of what’s going on with Y.
2) t1 (likely) not sign., because the gain of .01 (.73 - .
72 [with only X2]) is judged by the t-test as too easily
due to the “luck of the draw”. (Actually, it depends
on the sample size)
3) t2 , similarly.

81

X1

X2

100
99
101
93
95
95

95
99
103
95
102
94

.

.

X3
87
98
101
91
88
84
n = 25

.

Example: Y = Job performance
X1 = Test 1 score
X2 = Test 2 score
X3 = Test 3 score

Y
88
80
96
76
80
73

.

82

LEAST SQUARES LINE
So, Yc =
-106.34 + 1.02• X1 + .137• X2 + .87• X3

F

α = .05

from
output

47.598

To test: H0: B1 = B2 = B3 = 0
α = .05
H1: not all B’s = 0
Since p-value = .000000001528 < .
05,
we reject H0.
from output

88

To Test
Ho: B1 = 0

Ho: B2 = 0

Ho: B3 = 0

H1: B1 ≠ 0

H1 : B2 ≠ 0

H1 : B3 ≠ 0

We have
tcalc1 = 3.65 tcalc2 = .80
(p = .0015)

t1-α = 2.08 α = .05
21 df = 25 - 4

tcalc3 = 3.57

(p = .4314)

- 2.08

(p = .0018)

0

2.08
89

For

1

and

3

we reject Ho; for

2

we accept Ho.

Conclusion in Practical Terms?
x1 (Test 1) and x3 (Test 3) each gives us
incremental predictive value about
PERFORMANCE, Y.

X2 (Test 2) is either irrelevant or redundant.
90

An added benefit of the analysis was to indicate
how the tests should be weighted: The best fit
occurs if the tests are weighted
1.02,

.137,

.87

(assuming we retain Test 2).
This is equivalent to weights of
1.02
2.027

,

.137

,

.87

2.027

2.027

.07,

1.02
.137
.87
2.027

.43)

or
(.50,

The present weights were (1/3, 1/3, 1/3).
91

“PROBLEM IN NOTES”
Consider the following model: Y = A+B1•X1+B2•X2+B3•X3+ε
Y = Sales Revenue (in units of $100,000)
X1= Expenditure on TV advertising (in units of $10,000)
X2= Expenditure on Web advertising (in units of $10,000)
X3= Expenditure on Newspaper advertising (in units of $10,000)
Refer to computer output following the questions 1. What is the least squares line (hyperplane)?
2. What revenue do I expect (in dollars) with no advertising
in any of the three media?
3. If $10,000 more were allocated to advertising, which
medium should receive it to generate the most additional
revenue?
92

4) What percent of the variability in revenue is due to factors
other than the expenditures in the three advertising media?
5) If management decided to spend the same amount of money
on each of the three types of media, how much total money
would have to be spent to generate an expected revenue of
$40,000,000?
6) Test H0: B1 = B2 = B3 = 0 vs. H1: not all B’s = 0, at α = .05.
What is your conclusion in practical terms?
7) For each variable, test H0: B = 0 vs. H1: B ≠ 0, at α = .05.
What are your conclusions in practical terms?

93

Dummy Variables
(Indicator)
(Categorical)

Disposable
Income / yr.

Ex: Y = A + B1X1 + B2X2 +
$spent on DVDs/mo.

ε

Sex
Male X2 = 1
Female X2 = 0
95

We Get Yc = a + b1X1+ b2X2

For any given X1, income, we predict Y as
follows:
Male:

Yc = a + b1X1 +b2(1) = a + b1X1 + b2

Female:

Yc = a + b1X1 +b2(0) = a + b1X1 + 0

How is b2 to be interpreted?
96

Ans: The (estimated) amount spent by a
Male, above that which would be spent by a
Female, given the same X1 value (income).
(Of course, if b2 is negative, it says that we
estimate that Females spend more than
Males, at equal incomes.)
If we had defined

X2 = 1 for F’s
X2 = 0 for M’s ,

then b2 would reverse sign, and have the
opposite meaning.
97

Remember that a variable is a “dummy” variable
because of definition and interpretation. The
computer treats a variable whose values are 0 and 1,
just like any other variable.
Our data are, perhaps,
Y

X1

X2

20

50

1

18

40

1

33

65

0

24

49

0

21

62

1

•

•

•

•

•

•

98

Note that we have 2 categories, (M,F), but only one dummy variable.

This is necessary, as is the general situation of C categories, (C-1)
dummy variables.

This is because of computation issues involved in matrix inversion;

99

Example
Y C = a + b 1X 1 + b 2X 2 + b 3X 3 + b 4X 4 + b 5X 5
Water
Usage

Temp. Amount
Produced

# People
Employed

X4

X5

Plant 1

1

0

Plant 2

0

1

Plant 3

0

0
100

Let a + b1X1 + b2X2 + b3X3 = G

Then we predict: (for a given X1, X2 , X3)
FOR PLANT 1:

G + b4(1) + b5(0) = G + b4

FOR PLANT 2:

G + b4(0) + b5(1) = G + b5

FOR PLANT 3:

G + b4(0) + b5(0) = G

How do we interpret b4? b5?

101

STEPWISE REGRESSION
A “variation” of multiple regression
to pick the “best” model.

102

Step 1:

Y/X1, X2, X3, X4

Internal: Run separate simple regressions
with each X; pick the best (best=
highest R2)
R2
Y/X1 .45
Y/X2 .50
Y/X3 .48
Y/X4 .28
2
External: Y/X2, R =.50

103

Step 2:
Internal: Y/X2, X1
Y/X2, X3
Y/X2, X4

R2
.59
.68
.70
2

External: Y/X2, X4, R = .70

104

Step 3: Internal: Y/X2, X4, X1
Y/X2, X4, X3

R2
.77
.73
2

External: Y/X2, X4, X1, R = .77
NOTE: If at any stage, the best variable
to enter is not significant by the t-test,
the ALGORITHM STOPS (and does not
bring that variable in!!!). You select a
p-value (pin), and if p-value of entering
variable > pin (i.e., variable is not
significant), the variable does not
105
enter and the algorithm stops.

Also- There’s a step 3b (and 4b, 5b, etc.)
Step 3b) Now that we’ve entered our third
variable, software goes back
and re-examines previously
entered variables to see if any
should be DELETED (specify a
“p to go out”, pout, so that if pvalue of a variable in our model
> pout, the variable is deleted.
Algorithm continues until it stops!!!!
106

Output for the example with three tests and job performance

108

Variable 1:
Variable 2:
Variable 3:
Variable 4:
Variable 5:

Y= GRADUATE GPA
X1= UNDERGRAD GPA
X2= QUANTITATIVE GMAT
X3= VERBAL GMAT
X4= COLLEGE SELECTIVITY
0= Less Selective

Y1

X1

3.50
3.90
.
.
.
3.20

3.60
3.60
.
.
.
2.90

X2
600.00
680.00
.
.
.
440.00

X3
580.00
670.00
.
.
.
430.00

1= More Selective

X4
0.0
1.00
.
.
.
1.00
111

Detailed Summary of Stepwise Analysis
Ent. Var.

LS Line

Step 1

UNDERGRAD
GPA
X

Step 2

QUANT
GMAT

Step 3

1

X2

COLLEGE
SEL.

X4

STOP!

R2

Yc= .85 + .73X1

.609

Yc= .585
+ .53X
1
+ .00165X

.833
2

.915
Yc= 1.197
+ .309X1
+ .00163X
2
+
.284X4
If we bring in Verbal GMAT, R2=.919
112

PRACTICE PROBLEM
Y = COMPANY ABC’s SALES ($millions)
X1 = OVERALL INDUSTRY SALES ($billions)
X2 = COMPANY ABC’s ADVERTISING ($millions)
X3 = SPECIAL PROMOTION BY CHIEF COMPETITOR: 0 = YES, 1 = NO
A STEPWISE REGRESSION WAS RUN WITH THESE RESULTS:
STEP 1: VARIABLE ENTERING: X1, Yc = 205+16•X1, R 2 = .48
STEP 2: VARIABLE ENTERING: X2, Yc = 183+11•X1+10•X2, R 2 = .64
STEP 3: VARIABLE ENTERING: X3, Yc = 180+10•X1+8•X2+65•X3, R 2 = .68
A)

If ABC’s advertising is to be same next year as this year (i.e., X2 held constant), and we do
not know (in advance) the value of X3, what would we predict to be the increase in ABC’s
sales if overall industry sales (X1) increase by $1 billion?
a) 10

B)

c) 16

Based on the given information, we can conclude that the R 2 between Y and X2 (the exact
value of which we cannot determine from the given information) is between:
a) .16 and .48

C)

b) 11

b) .16 and .64

c) .48 and .64

d) none of these

Answer part B) if the regression results above were NOT part of a stepwise procedure, but
simply a set of multiple regression results.
113

Powerpoint2.reg

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Powerpoint2.reg

Similar to Powerpoint2.reg (20)

Recently uploaded

Recently uploaded (20)

Powerpoint2.reg

Editor's Notes