Bivariate analysis
The Multiple Regression Model

                    Idea: Examine the linear relationship between
                1 dependent (Y) & 2 or more independent variables (Xi)


Multiple Regression Model with k Independent Variables:

           Y-intercept                Population slopes                  Random Error



    Yi       β0 β1X1i β2 X2i  βk Xki                                              εi
Assumptions of Regression
Use the acronym LINE:
• Linearity
   – The underlying relationship between X and Y is linear

• Independence of Errors
   – Error values are statistically independent

• Normality of Error
   – Error values (ε) are normally distributed for any given value of X

• Equal Variance (Homoscedasticity)
   – The probability distribution of the errors has constant variance
Regression Statistics
Multiple R     0.998368         2   SSR   11704.1
                            r                       .996739
R Square       0.996739             SST    11740
Adjusted R
Square         0.995808
Standard
Error          1.350151                     99.674% variation is
Observations           28                   explained by the
                                            dependent Variables
ANOVA
                                                  Significan
                  df       SS      MS        F       ce F
Regression            6 11701.72 1950.286 1069.876 5.54E-25
Residual             21 38.28108 1.822908
Total                27    11740
Adjusted r2
• r2 never decreases when a new X variable is
  added to the model
  – This can be a disadvantage when comparing models
• What is the net effect of adding a new variable?
  – We lose a degree of freedom when a new X variable
    is added
  – Did the new X variable add enough explanatory
    power to offset the loss of one degree of freedom?
Adjusted r2
• Shows the proportion of variation in Y explained
  by all X variables adjusted for the number of X
  variables used
                          2                     n 1
                                                 2
                        radj     1      (1 r )
                                               n k 1
   (where n = sample size, k = number of independent variables)

  – Penalize excessive use of unimportant independent
    variables
  – Smaller than r2
  – Useful in comparing among models
Error and coefficients relationship
  • B1 = Covar(yx)/Varp(x)


Stddevp   419.28571 1103.4439 115902.4 1630165.82 36245060.6 706538.59 195.9184
Covar               662.14286    6862.5 25621.4286 120976.786 16061.643 257.1429
b1                  0.6000694 0.059209 0.01571707 0.00333775 0.0227329 1.3125
Is the Model Significant?
• F Test for Overall Significance of the Model
• Shows if there is a linear relationship between all of the
  X variables considered together and Y
• Use F-test statistic
• Hypotheses:
     H0: β1 = β2 = … = βk = 0 (no linear relationship)
     H1: at least one βi ≠ 0 (at least one independent
                                  variable affects Y)
F Test for Overall Significance
• Test statistic:
                          SSR
               MSR         k
          F
               MSE        SSE
                         n k 1
 where F has        (numerator) = k and
                    (denominator) = (n – k - 1)
                         degrees of freedom
Case discussion
Multiple Regression Assumptions
Errors (residuals) from the regression model:




                            <
                ei = (Yi – Yi)

   Assumptions:
   • The errors are normally distributed
   • Errors have a constant variance
   • The model errors are independent
Error terms and coefficient estimates
• Once we think of the Error term as a random
  variable, it becomes clear that the estimates
  of b1, b2, … (as distinguished from their true
  values) will also be random variables, because
  the estimates generated by the SSE criterion
  will depend upon the particular value of e
  drawn by nature for each individual in the
  data set.
Statistical Inference and Goodness of
                    fit
• The parameter estimates are themselves random
  variables, dependent upon the random variables e.
• Thus, each estimate can be thought of as a draw
  from some underlying probability distribution, the
  nature of that distribution as yet unspecified.
• If we assume that the error terms e are all drawn
  from the same normal distribution, it is possible to
  show that the parameter estimates have a normal
  distribution as well.
T Statistic and P value
• T = B1-B1average/B1 std dev


  Can you have a hypothesis that
  b1 average = b1 estimate
  and do the T test
Are Individual Variables Significant?

• Use t tests of individual variable slopes
• Shows if there is a linear relationship between the
  variable Xj and Y
• Hypotheses:
   – H0: βj = 0 (no linear relationship)
   – H1: βj ≠ 0 (linear relationship does exist
                     between Xj and Y)
Are Individual Variables Significant?

H0: βj = 0 (no linear relationship)
H1: βj ≠ 0 (linear relationship does exist
              between xj and y)

Test Statistic:

                   bj 0
              t                       (df = n – k – 1)
                     Sb j
Coefficien   Standard                           Lower      Upper      Lower      Upper
              ts         Error       t Stat   P-value      95%        95%       95.0%      95.0%
Intercept -59.0661     11.28404    -5.23448   3.45E-05   -82.5325   -35.5996   -82.5325   -35.5996
OFF        -0.00696      0.04619   -0.15068   0.881663   -0.10302   0.089097   -0.10302   0.089097
BAR       0.041988     0.005271    7.966651   8.81E-08   0.031028   0.052949   0.031028   0.052949
YNG       0.002716     0.000999    2.717326   0.012904   0.000637   0.004794   0.000637   0.004794
VEH         0.00147    0.000265    5.540878   1.69E-05   0.000918   0.002021   0.000918   0.002021
INV        -0.00274    0.001336    -2.05135   0.052914   -0.00552   3.78E-05   -0.00552   3.78E-05
SPD         -0.2682    0.068418    -3.92009   0.000786   -0.41049   -0.12592   -0.41049   -0.12592




                               with n – (k+1) degrees of freedom
Confidence Interval Estimate
                  for the Slope
• Confidence interval for the population slope βj

•     b j tn            S
                   k 1 bj      where t has (n – k – 1) d.f.

    Example: Form a 95% confidence interval for the effect of
    changes in Bars on fatal accidents:
           0.041988 (2.079614 )(0.005271)
           So the interval is (0.031028, 0.052949 )
    (This interval does not contain zero, so bars has a significant
                         effect on Accidents)
Coefficien   Standard                           Lower      Upper
              ts         Error      t Stat    P-value      95%        95%
Intercept -59.0661     11.28404    -5.23448   3.45E-05   -82.5325   -35.5996
OFF        -0.00696      0.04619   -0.15068   0.881663   -0.10302   0.089097
BAR       0.041988     0.005271    7.966651   8.81E-08   0.031028   0.052949
YNG       0.002716     0.000999    2.717326   0.012904   0.000637   0.004794
VEH         0.00147    0.000265    5.540878   1.69E-05   0.000918   0.002021
INV        -0.00274    0.001336    -2.05135   0.052914   -0.00552   3.78E-05
SPD         -0.2682    0.068418    -3.92009   0.000786   -0.41049   -0.12592
Using Dummy Variables

• A dummy variable is a categorical explanatory
  variable with two levels:
  – yes or no, on or off, male or female
  – coded as 0 or 1
• Regression intercepts are different if the
  variable is significant
• Assumes equal slopes for other variables
Interaction Between
           Independent Variables
• Hypothesizes interaction between pairs of X
  variables
  – Response to one X variable may vary at different
    levels of another X variable


• Contains cross-product term
       ˆ
       Y   b0   b1X1 b2 X2 b3 X3
  –
           b0   b1X1 b2 X2 b3 (X1X2 )
Effect of Interaction
• Given:
               Y   β0   β1X1 β2 X2   β3 X1X2   ε

• Without interaction term, effect of X1 on Y is
  measured by β1
• With interaction term, effect of X1 on Y is
  measured by β1 + β3 X2
• Effect changes as X2 changes
Interaction Example
Suppose X2 is a dummy variable and the estimated regression equation is
                                     ˆ
                                     Y= 1 + 2X1 + 3X2 + 4X1X2
      Y

12

                                                              X2 = 1:
  8                                            Y = 1 + 2X1 + 3(1) + 4X1(1) = 4 + 6X1


  4
                                                                 X2 = 0:
                                                  Y = 1 + 2X1 + 3(0) + 4X1(0) = 1 + 2X1
  0
                                                            X1
          0             0.5            1           1.5

              Slopes are different if the effect of X1 on Y depends on X2 value
Residual Analysis
                         ei      Yi     ˆ
                                        Yi
• The residual for observation i, ei, is the difference between
  its observed and predicted value
• Check the assumptions of regression by examining the
  residuals
   – Examine for linearity assumption
   – Evaluate independence assumption
   – Evaluate normal distribution assumption
   – Examine for constant variance for all levels of X (homoscedasticity)

• Graphical Analysis of Residuals
   – Can plot residuals vs. X
Residual Analysis for
            Independence

            Not Independent

                                                 Independent
residuals




                              X




                                  residuals
                                                                X
residuals




                              X
Residual Analysis for
                                 Equal Variance
                                                 Y
            Y




                                        x                                       x
residuals




                                        x   residuals                           x



                Non-constant variance
                                                           Constant variance
Linear vs. Nonlinear Fit

        Y                                           Y




                                       X                                             X
residuals




                                           X   residuals                                 X



            Linear fit does not give                           Nonlinear fit gives
            random residuals
                                                              random residuals
Quadratic Regression Model
                          Yi      β0      β1X1i        β 2 X1i
                                                            2
                                                                    εi
      Quadratic models may be considered when the scatter diagram takes on one of
      the following shapes:

Y                    Y                           Y                       Y




                X1                         X1                       X1                X1
    β1 < 0                     β1 > 0                   β1 < 0               β1 > 0

    β2 > 0                     β2 > 0                   β2 < 0               β2 < 0

                         β1 = the coefficient of the linear term
                         β2 = the coefficient of the squared term

Bivariate

  • 1.
  • 2.
    The Multiple RegressionModel Idea: Examine the linear relationship between 1 dependent (Y) & 2 or more independent variables (Xi) Multiple Regression Model with k Independent Variables: Y-intercept Population slopes Random Error Yi β0 β1X1i β2 X2i  βk Xki εi
  • 3.
    Assumptions of Regression Usethe acronym LINE: • Linearity – The underlying relationship between X and Y is linear • Independence of Errors – Error values are statistically independent • Normality of Error – Error values (ε) are normally distributed for any given value of X • Equal Variance (Homoscedasticity) – The probability distribution of the errors has constant variance
  • 4.
    Regression Statistics Multiple R 0.998368 2 SSR 11704.1 r .996739 R Square 0.996739 SST 11740 Adjusted R Square 0.995808 Standard Error 1.350151 99.674% variation is Observations 28 explained by the dependent Variables ANOVA Significan df SS MS F ce F Regression 6 11701.72 1950.286 1069.876 5.54E-25 Residual 21 38.28108 1.822908 Total 27 11740
  • 5.
    Adjusted r2 • r2never decreases when a new X variable is added to the model – This can be a disadvantage when comparing models • What is the net effect of adding a new variable? – We lose a degree of freedom when a new X variable is added – Did the new X variable add enough explanatory power to offset the loss of one degree of freedom?
  • 6.
    Adjusted r2 • Showsthe proportion of variation in Y explained by all X variables adjusted for the number of X variables used 2 n 1 2 radj 1 (1 r ) n k 1 (where n = sample size, k = number of independent variables) – Penalize excessive use of unimportant independent variables – Smaller than r2 – Useful in comparing among models
  • 7.
    Error and coefficientsrelationship • B1 = Covar(yx)/Varp(x) Stddevp 419.28571 1103.4439 115902.4 1630165.82 36245060.6 706538.59 195.9184 Covar 662.14286 6862.5 25621.4286 120976.786 16061.643 257.1429 b1 0.6000694 0.059209 0.01571707 0.00333775 0.0227329 1.3125
  • 8.
    Is the ModelSignificant? • F Test for Overall Significance of the Model • Shows if there is a linear relationship between all of the X variables considered together and Y • Use F-test statistic • Hypotheses: H0: β1 = β2 = … = βk = 0 (no linear relationship) H1: at least one βi ≠ 0 (at least one independent variable affects Y)
  • 9.
    F Test forOverall Significance • Test statistic: SSR MSR k F MSE SSE n k 1 where F has (numerator) = k and (denominator) = (n – k - 1) degrees of freedom
  • 10.
  • 11.
    Multiple Regression Assumptions Errors(residuals) from the regression model: < ei = (Yi – Yi) Assumptions: • The errors are normally distributed • Errors have a constant variance • The model errors are independent
  • 12.
    Error terms andcoefficient estimates • Once we think of the Error term as a random variable, it becomes clear that the estimates of b1, b2, … (as distinguished from their true values) will also be random variables, because the estimates generated by the SSE criterion will depend upon the particular value of e drawn by nature for each individual in the data set.
  • 13.
    Statistical Inference andGoodness of fit • The parameter estimates are themselves random variables, dependent upon the random variables e. • Thus, each estimate can be thought of as a draw from some underlying probability distribution, the nature of that distribution as yet unspecified. • If we assume that the error terms e are all drawn from the same normal distribution, it is possible to show that the parameter estimates have a normal distribution as well.
  • 14.
    T Statistic andP value • T = B1-B1average/B1 std dev Can you have a hypothesis that b1 average = b1 estimate and do the T test
  • 15.
    Are Individual VariablesSignificant? • Use t tests of individual variable slopes • Shows if there is a linear relationship between the variable Xj and Y • Hypotheses: – H0: βj = 0 (no linear relationship) – H1: βj ≠ 0 (linear relationship does exist between Xj and Y)
  • 16.
    Are Individual VariablesSignificant? H0: βj = 0 (no linear relationship) H1: βj ≠ 0 (linear relationship does exist between xj and y) Test Statistic: bj 0 t (df = n – k – 1) Sb j
  • 17.
    Coefficien Standard Lower Upper Lower Upper ts Error t Stat P-value 95% 95% 95.0% 95.0% Intercept -59.0661 11.28404 -5.23448 3.45E-05 -82.5325 -35.5996 -82.5325 -35.5996 OFF -0.00696 0.04619 -0.15068 0.881663 -0.10302 0.089097 -0.10302 0.089097 BAR 0.041988 0.005271 7.966651 8.81E-08 0.031028 0.052949 0.031028 0.052949 YNG 0.002716 0.000999 2.717326 0.012904 0.000637 0.004794 0.000637 0.004794 VEH 0.00147 0.000265 5.540878 1.69E-05 0.000918 0.002021 0.000918 0.002021 INV -0.00274 0.001336 -2.05135 0.052914 -0.00552 3.78E-05 -0.00552 3.78E-05 SPD -0.2682 0.068418 -3.92009 0.000786 -0.41049 -0.12592 -0.41049 -0.12592 with n – (k+1) degrees of freedom
  • 18.
    Confidence Interval Estimate for the Slope • Confidence interval for the population slope βj • b j tn S k 1 bj where t has (n – k – 1) d.f. Example: Form a 95% confidence interval for the effect of changes in Bars on fatal accidents: 0.041988 (2.079614 )(0.005271) So the interval is (0.031028, 0.052949 ) (This interval does not contain zero, so bars has a significant effect on Accidents)
  • 19.
    Coefficien Standard Lower Upper ts Error t Stat P-value 95% 95% Intercept -59.0661 11.28404 -5.23448 3.45E-05 -82.5325 -35.5996 OFF -0.00696 0.04619 -0.15068 0.881663 -0.10302 0.089097 BAR 0.041988 0.005271 7.966651 8.81E-08 0.031028 0.052949 YNG 0.002716 0.000999 2.717326 0.012904 0.000637 0.004794 VEH 0.00147 0.000265 5.540878 1.69E-05 0.000918 0.002021 INV -0.00274 0.001336 -2.05135 0.052914 -0.00552 3.78E-05 SPD -0.2682 0.068418 -3.92009 0.000786 -0.41049 -0.12592
  • 20.
    Using Dummy Variables •A dummy variable is a categorical explanatory variable with two levels: – yes or no, on or off, male or female – coded as 0 or 1 • Regression intercepts are different if the variable is significant • Assumes equal slopes for other variables
  • 21.
    Interaction Between Independent Variables • Hypothesizes interaction between pairs of X variables – Response to one X variable may vary at different levels of another X variable • Contains cross-product term ˆ Y b0 b1X1 b2 X2 b3 X3 – b0 b1X1 b2 X2 b3 (X1X2 )
  • 22.
    Effect of Interaction •Given: Y β0 β1X1 β2 X2 β3 X1X2 ε • Without interaction term, effect of X1 on Y is measured by β1 • With interaction term, effect of X1 on Y is measured by β1 + β3 X2 • Effect changes as X2 changes
  • 23.
    Interaction Example Suppose X2is a dummy variable and the estimated regression equation is ˆ Y= 1 + 2X1 + 3X2 + 4X1X2 Y 12 X2 = 1: 8 Y = 1 + 2X1 + 3(1) + 4X1(1) = 4 + 6X1 4 X2 = 0: Y = 1 + 2X1 + 3(0) + 4X1(0) = 1 + 2X1 0 X1 0 0.5 1 1.5 Slopes are different if the effect of X1 on Y depends on X2 value
  • 24.
    Residual Analysis ei Yi ˆ Yi • The residual for observation i, ei, is the difference between its observed and predicted value • Check the assumptions of regression by examining the residuals – Examine for linearity assumption – Evaluate independence assumption – Evaluate normal distribution assumption – Examine for constant variance for all levels of X (homoscedasticity) • Graphical Analysis of Residuals – Can plot residuals vs. X
  • 25.
    Residual Analysis for Independence Not Independent  Independent residuals X residuals X residuals X
  • 26.
    Residual Analysis for Equal Variance Y Y x x residuals x residuals x Non-constant variance  Constant variance
  • 27.
    Linear vs. NonlinearFit Y Y X X residuals X residuals X Linear fit does not give Nonlinear fit gives random residuals  random residuals
  • 28.
    Quadratic Regression Model Yi β0 β1X1i β 2 X1i 2 εi Quadratic models may be considered when the scatter diagram takes on one of the following shapes: Y Y Y Y X1 X1 X1 X1 β1 < 0 β1 > 0 β1 < 0 β1 > 0 β2 > 0 β2 > 0 β2 < 0 β2 < 0 β1 = the coefficient of the linear term β2 = the coefficient of the squared term