15 regression basics

Regression Basics
Predicting a DV with a Single IV

Questions
• What are predictors and
criteria?
• Write an equation for
the linear regression.
Describe each term.
• How do changes in the
slope and intercept
affect (move) the
regression line?
• What does it mean to
test the significance of
the regression sum of
squares? R-square?
• What is R-square?
• What does it mean to choose
a regression line to satisfy
the loss function of least
squares?
• How do we find the slope
and intercept for the
regression line with a single
independent variable?
(Either formula for the slope
is acceptable.)
• Why does testing for the
regression sum of squares
turn out to have the same
result as testing for R-
square?

Basic Ideas
• Jargon
– IV = X = Predictor (pl. predictors)
– DV = Y = Criterion (pl. criteria)
– Regression of Y on X e.g., GPA on SAT
• Linear Model = relations between IV
and DV represented by straight line.
• A score on Y has 2 parts – (1) linear
function of X and (2) error.
Y Xi i i= + +α β ε (population values)

Basic Ideas (2)
• Sample value:
• Intercept – place where X=0
• Slope – change in Y if X changes 1
unit. Rise over run.
• If error is removed, we have a predicted
value for each person at X (the line):
Y a bX ei i i= + +
′ = +Y a bX
Suppose on average houses are worth about $75.00 a
square foot. Then the equation relating price to size
would be Y’=0+75X. The predicted price for a 2000
square foot house would be $150,000.

Linear Transformation
• 1 to 1 mapping of variables via line
• Permissible operations are addition and
multiplication (interval data)
1086420
X
40
35
30
25
20
15
10
5
0
Y
Changing the Y Intercept
Y=5+2X
Y=10+2X
Y=15+2X
Add a constant
1086420
X
30
20
10
0
Y
Changing the Slope
Y=5+.5X
Y=5+X
Y=5+2X
Multiply by a constant
′ = +Y a bX

Linear Transformation (2)
• Centigrade to Fahrenheit
• Note 1 to 1 map
• Intercept?
• Slope?
1209060300
Degrees C
240
200
160
120
80
40
0
DegreesF
32 degrees F, 0 degrees C
212 degrees F, 100 degrees C
Intercept is 32. When X (Cent) is 0, Y (Fahr) is 32.
Slope is 1.8. When Cent goes from 0 to 100 (run), Fahr goes
from 32 to 212 (rise), and 212-32 = 180. Then 180/100 =1.8 is
rise over run is the slope. Y = 32+1.8X. F=32+1.8C.
′ = +Y a bX

Review
• What are predictors and criteria?
• Write an equation for the linear
regression with 1 IV. Describe each
term.
• How do changes in the slope and
intercept affect (move) the regression
line?

Regression of Weight on
Height
Ht Wt
61 105
62 120
63 120
65 160
65 120
68 145
69 175
70 160
72 185
75 210
N=10 N=10
M=67 M=150
SD=4.57 SD=
33.99
767472706866646260
Height in Inches
240
210
180
150
120
90
60
Regression of Weight on HeightRegression of Weight on HeightRegression of Weight on Height
Rise
Run
Y= -316.86+6.97X
X
Correlation (r) = .94.
Regression equation: Y’=-316.86+6.97X
′ = +Y a bX

Illustration of the Linear
Model. This concept is vital!
727068666462
Height
200
180
160
140
120
100
Weight
Regression of Weight on Height
727068666462
Height
Regression of Weight on HeightRegression of Weight on Height
(65,120)
Mean of X
Mean of Y
Deviation from X
Deviation from Y
Linear Part
Error Part
yY'
e
Y Xi i i= + +α β ε
Y a bX ei i i= + +
Consider Y as
a deviation
from the
mean.
Part of that deviation can be associated with X (the linear
part) and part cannot (the error).
′ = +Y a bX
'
iii YYe −=

Predicted Values & Residuals
N Ht Wt Y' Resid
1 61 105 108.19 -3.19
2 62 120 115.16 4.84
3 63 120 122.13 -2.13
4 65 160 136.06 23.94
5 65 120 136.06 -16.06
6 68 145 156.97 -11.97
7 69 175 163.94 11.06
8 70 160 170.91 -10.91
9 72 185 184.84 0.16
10 75 210 205.75 4.25
M 67 150 150.00 0.00
SD 4.57 33.99 31.85 11.89
V 20.89 1155.56 1014.37 141.32
727068666462
Height
200
180
160
140
120
100
Weight
727068666462
Height
(65,120)
Mean of X
Mean of Y
Deviation from X
Deviation from Y
Linear Part
Error Part
yY'
e
Numbers for linear part and error.
Note M of Y’
and Residuals.
Note variance of
Y is V(Y’) +
V(res).
′ = +Y a bX

Finding the Regression Line
Need to know the correlation, SDs and means of X and Y.
The correlation is the slope when both X and Y are
expressed as z scores. To translate to raw scores, just bring
back original SDs for both.
N
zz
r YX
XY
∑=
X
Y
XY
SD
SD
rb =
To find the intercept, use: XbYa −=
(rise over run)
Suppose r = .50, SDX = .5, MX = 10, SDY = 2, MY = 5.
2
5.
2
5. ==b 15)10(25 −=−=a XY 215' +−=
Slope Intercept Equation

Line of Least Squares
727068666462
Height
200
180
160
140
120
100
Weight
727068666462
Height
(65,120)
Mean of X
Mean of Y
Deviation from X
Deviation from Y
Linear Part
Error Part
yY'
e
We have some points.
Assume linear relations
is reasonable, so the 2
vbls can be represented
by a line. Where
should the line go?
Place the line so errors (residuals) are small. The line we
calculate has a sum of errors = 0. It has a sum of squared
errors that are as small as possible; the line provides the
smallest sum of squared errors or least squares.

Review
• What does it mean to choose a regression line
to satisfy the loss function of least squares?
• What are predicted values and residuals?
Suppose r = .25, SDX = 1, MX = 10, SDY = 2, MY = 5.
What is the regression equation (line)?

Partitioning the Sum of
Squares
ebXaY ++= bXaY +='
eYY += ' 'YYe −=
Definitions
)'()'( YYYYYY −+−=− = y, deviation from mean
∑∑ −+−=− 22
)]'()'[()( YYYYYY Sum of squares
∑ ∑∑ −+−= 222
)'()'()( YYYYy
(cross products
drop out)
Sum of
squared
deviations
from the
mean
=
Sum of squares
due to
regression
+
Sum of squared
residuals
reg error
Analog: SStot=SSB+SSW

Partitioning SS (2)
SSY=SSReg + SSRes Total SS is regression SS plus
residual SS. Can also get
proportions of each. Can get
variance by dividing SS by N if you
want. Proportion of total SS due to
regression = proportion of total
variance due to regression = R2
(R-square).
Y
s
Y
g
Y
Y
SS
SS
SS
SS
SS
SS ReRe
+=
)1(1 22
RR −+=

Partitioning SS (3)
Wt (Y)
M=150
Y' Resid
(Y-Y')
Resid2
105 2025 108.19 -41.81 1748.076 -3.19 10.1761
120 900 115.16 -34.84 1213.826 4.84 23.4256
120 900 122.13 -27.87 776.7369 -2.13 4.5369
160 100 136.06 -13.94 194.3236 23.94 573.1236
120 900 136.06 -13.94 194.3236 -16.06 257.9236
145 25 156.97 6.97 48.5809 -11.97 143.2809
175 625 163.94 13.94 194.3236 11.06 122.3236
160 100 170.91 20.91 437.2281 -10.91 119.0281
185 1225 184.84 34.84 1213.826 0.16 0.0256
210 3600 205.75 55.75 3108.063 4.25 18.0625
Sum =
1500
10400 1500.01 0.01 9129.307 -0.01 1271.907
Variance 1155.56 1014.37 141.32
2
)( YY − YY −'
2
)'( YY −

Partitioning SS (4)
Total Regress Residual
SS 10400 9129.31 1271.91
Variance 1155.56 1014.37 141.32
12.88.1
10400
91.1271
10400
31.9129
10400
10400
+=⇒+= Proportion of SS
12.88.1
56.1155
32.141
56.1155
37.1014
56.1155
56.1155
+=⇒+= Proportion of
Variance
R2
= .88
Note Y’ is linear function of X, so
.
XYYY rr == 94.'
1' =XYr
012.35..88. '
222
' ===== EYYEYEYY rrrRr

Significance Testing
Testing for the SS due to regression = testing for the variance
due to regression = testing the significance of R2
. All are the
same. 0: 2
0 =populationRH
F
SS df
SS df
SS k
SS N k
reg
res
reg
res
= =
− −
/
/
/
/ ( )
1
2 1
k=number of IVs (here
it’s 1) and N is the
sample size (# people).
F with k and (N-k-1)
df.
F
SS df
SS df
reg
res
= =
− −
=
/
/
. /
. / ( )
.
1
2
9129 31 1
127191 10 1 1
57 42
)1/()1(
/
2
2
−−−
=
kNR
kR
F Equivalent test using R-square
instead of SS.
F =
− − −
=
. /
( . ) / ( )
.
88 1
1 88 10 1 1
58 67
Results will be same within
rounding error.

Review
• What does it mean to test the
significance of the regression sum of
squares? R-square?
• What is R-square?
• Why does testing for the regression sum of
squares turn out to have the same result as
testing for R-square?

15 regression basics

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to 15 regression basics

Similar to 15 regression basics (20)

15 regression basics