regression analysisA very valuable tool for today’s manager.Regression Analysis is used to:Understand the relationship between variables.Predict the value of one variable based onanother variable.A regression model has:dependent, or response, variable - Y axisan independent, or predictor, variable - X axis
regression analysis Triple A Construction Company renovates oldhomes in Albany. They have found that its dollarvolume of renovation work is dependent on the Albany area payroll. Local Payroll Triple A Sales ($100,000,000s) ($100,000s) 3 6 4 8 6 9 4 5 2 4.5 5 9.5
regression analysis model Regression: Understand & PredictCreate a Scatter PlotPerform Regression Analysis some random error that cannot be predicted. Dependent Variable, Slope Response Independent Variable, Predictor Intercept (Value of Y when X=0)
regression analysis modelSample data are used to estimatethe true values for the intercept andslope. Y = b0+ b 1XWhere,Y = predicted value of YThe difference between the actualvalue of Y and the predicted value(using sample data) is known asthe error. Error = (actual value) – (predicted value) e=Y-Y
regression analysis model _ 2 _ _Sales (Y) Payroll (X) (X - X) (X-X)(Y-Y) Calculating the required 6 3 1 1 parameters: 8 4 0 0 b 1= !(X-X)(Y-Y) = 12.5 = 1.25 ! (X-X) 2 10 9 6 4 4 5 4 0 0 bo= Y – b1X = 7 – (1.25)(4) = 2 4.5 2 4 5 So, 9.5 5 1 2.5 Y = 2 + 1.25 XSummations for each column: 42 24 10 12.5_ _Y = 42/6 = 7 X = 24/6 = 4
Measuring the Fit ofthe linear Regression Model
Measuring the Fit of the linear Regression Model To understand how well the X predicts the Y, we evaluate Variability in the Y Correlation Standard Residual variable Coefﬁcient Error AnalysisSSR –> Regression Variability St Deviation r – Strength of the Validation of that is explained by the of error relationship Model relationship b/w X & Y around the between Y and X + Regression variables SSE –> Unexplained LineVariability, due to factors then the regression Coefﬁcient of Test for Linearity ------------------------------------ Determination Signiﬁcance of theSST –> Total variability about R Sq - Proportion of Regression Model i.e. the mean explained variation Linear Regression Model
Variability10 y = 1.25x + 2 SSE SST R² = 0.6944 SSR explained 8 variability _ Y 6 4 2 0 0 1 2 3 4 5 6 Local Payroll Regression Line ($100,000,000s)
VariabilityErrors (deviations) may be positive ornegative. Summing the errors would bemisleading, thus we square the terms For Triple A Construction:prior to summing. 2 = 22.5 SST =! (Y-Y)! Sum of Squares Total (SST) measures the total variable in Y. SSE =! e 2 = ! (Y-Y) 2 = 6.875 2 SST =! (Y-Y) SSR =!(Y-Y)2 = 15.625! Sum of the Squared Error (SSE) is less than the SST because the regression line Note: reduced the variability. SST = SSR + SSE SSE =! e 2 = ! (Y-Y) 2 Explained Unexplained! Sum of Squares due to Regression (SSR) Variability Variability indicated how much of the total variability is explained by the regression model. SSR =!(Y-Y)2
Coefﬁcient of Determination The coefficient of determination (r2 ) is the proportion of the variability in Y that is explained by the regression equation. r2 = SSR = 1 – SSE SST, SSR and SSE SST SST just themselves provide little direct For Triple A Construction: interpretation. This measures the r2 = 15.625 = 0.6944 usefulness of 22.5 regression 69% of the variability in sales is explained by the regression based on payroll. Note: 0 < r2 < 1
Correlation Coefﬁcient The correlation coefficient (r) measures the strength of the linear relationship. Possible Scatter Diagrams for values of r. n!XY-!X!Y Shown as Multiple R in r= the output of Excel [n!X -(!X) ][n!Y -(!Y -(!Y) ] 2 2 2 2 2 ﬁle For Triple A Construction, r = 0.8333 Note: -1 < r < 1
Standard errorThe mean squared error (MSE) isthe estimate of the error variance ofthe regression equation. s = MSE = SSE 2 n–k-1 Estimate of Variance. Just like St Dev (which is around mean), it measures theWhere, variation of Y variation around the n = number of observations in the sample regression line OR St Dev of error around the Regression Line. Same units k = number of independent variables as Y. Means +1.3 x 100,000 USD Sales error in predictionFor Triple A Construction, s 2= 1.31
Test for linearity p value is signiﬁcance levelAn F-test is used to statistically alpha = level of signiﬁcance or = 1-conﬁdence intervaltest the null hypothesis that thereis no linear relationship between If p<alpha Reject the null hypothesis thatthe X and Y variables (i.e. ! 1 = 0). there is no linear relationshipIf the significance level for the F between X & Triple A Construction: For Ytest is low, we reject Ho and concludethere is a linear relationship. MSR = 15.625 = 15.625 1 F = MSR F = 15.625 = 9.0909 1.7188 MSE The significance level for F = 9.0909 is 0.0394, indicating we reject Ho and where, MSR = SSR conclude a linear relationship exists between sales and payroll. k
Computer Software for Regression In Excel, use Tools/ Data Analysis. Thisis an ‘add-in’ option.
Computer Software forMultiple R is Regression correlation Estimate of Variance. Just like St Dev (which is around mean), it measures the variation coefﬁcient of Y variation around the regression line OR St Dev of error around the Regression Line. Same units as Y. Means +1.3 x 100,000 USD Sales error in predictionnumber of independent variables in the model. The adjusted R Sq takes into account the p Value < Alpha (0.05 or 0.1) means relationship between X & Y is linear
Residual Analysis:to verify regression assumptions are correct
Assumptions of the Regression ModelWe make certain assumptions aboutthe errors in a regression model A plot ofwhich allow for statistical testing. the errors (Real Value minus predictedAssumptions: value of Y), also called! Errors are independent. residuals in excel may highlight! Errors are normally distributed. problems with the! Errors have a mean of zero. model.! Errors have a constant variance. PITFALLS: Prediction beyond the range of X values in the sample can be misleading, including interpretation of the intercept (X=0). A linear regression model may not be the best model, even in the presence of a significant F test.
Constant variance Triple A Construction Errors have constant Variance AssumptionPlot Residues w.r.t X valuesPattern should be random! Non-constant Variation in Error Residual Plot –violation 0 X
Normal distributionHistogram of Residuals - Should look like a bell curve Triple A Construction Not possible to see the bell curve with just 6 observations. Need more samples
zero mean Triple A Construction Errors have zero Mean0 X
independent errors Example: Manager of a package If samples collected over a delivery store wants to predictperiod of time and not at the weekly sales based on the same time, then plot the number of customers making residues w.r.t time to see if purchases for a period of 100any pattern (Autocorrelation) days. Data is collected over a exists. period of time so check for autocorrelation (pattern) effect.If substantial autocorrelation, Cyclical Pattern! A Violation Residues Regression Model Validity becomes doubtful Autocorrelation can also be checked using Durbin–Watson statistic. time
multiple regressionMultiple regression models aresimilar to simple linear regression Wilson Realty wants to develop a model to determine the suggested listing price for a housemodels except they include more based on size and age.than one X variable. Price 35000 Sq. Feet 1926 Age 30 Condition Good 47000 2069 40 Excellent 49900 1720 30 Excellent 55000 1396 15 Good 58900 1706 32 Mint 60000 1847 38 MintY = b0+ b1 X 1+ b2X 2+…+ bnXn 67000 1950 27 Mint 70000 2323 30 Excellent slope 78500 2285 26 Mint 79000 3752 35 Good 87500 2300 18 Good Independent variables 93000 2525 17 Good 95000 3800 40 Excellent 97000 1740 12 Mint
multiple regression Wilson Realty has found a linear 67% of the variation in relationship between price and size sales price is explained by and age. The coefficient for size size and age. Ho: No linear indicates each additional square foot relationship increases the value by $21.91, while is rejected each additional year in age decreases the value by $1449.34. Y = 60815.45 + 21.91(size) – 1449.34 (age) For a 1900 square foot house that is 10 years old, the following prediction can be made:Y = 60815.45 + 21.91(size) – 1449.34 (age) $87,951 = 21.91(1900) + 1449.34(10) Ho: !1 = 0 is rejected Ho: !2 = 0 is rejected
dummy variables Binary (or dummy) variables Return to Wilson Realty, and let’s evaluate how to use property are special variables that are condition in the regression model. created for qualitative data. There are three categories: Mint, Excellent, and Good.! A dummy variable is assigned a value of 1 if a particular condition is X3= 1 if the house is in excellent condition = 0 otherwise met and a value of 0 otherwise. X4 = 1 if the house is in mint condition! The number of dummy variables = 0 otherwise must equal one less than the number Note: If both X and X = 0 then the house is in good condition of categories of the qualitative variable.
dummy variables As more variables areadded to the model, the r2 usually increases. Y = 48329.23 + 28.21 (size) – 1981.41(age) + 23684.62 (if mint) + 16581.32 (if excellent)
adjusted r-SquareThe best model is a statisticallysignificant model with a high r2and a few variables.! As more variables are added to the model, the r2 usually increases.! The adjusted r2 takes into account the number of independent variables in the model.Note: When variables are added to the model, thevalue of r2 can never decrease; however, theadjusted r2 may decrease.
multicollinearityCollinearity or multicollinearity Duplication ofexists when an independent variable information occursis correlated with anotherindependent variable. When multicollinearity exists, the overall F test is still valid, but! Collinearity and multicollinearity the hypothesis tests related to the create problems in the coefficients. individual coefﬁcients are not.! The overall model prediction is still A variable may appear to be good; however individual signiﬁcant when it is interpretation of the variables is insigniﬁcant, or a variable may questionable. appear to be insigniﬁcant when it is signiﬁcant.
non-linear regressionEngineers at Colonel Motors want to use regression analysis to improve fuel efficiency. They arestudying the impact of weight on miles per gallon (MPG). Linear regression model: MPG = 47.8 – 8.2 (weight) F significance = .0003 r2 = .7446
non-linear regression We should not try to interpret the coefficients of the variables due to the correlation between (weight) and (weight squared). Normally we would interpret the coefficient for as the change in Y that results from a 1-unit change in X1, while holding all other variables constant. Obviously holding one variable constant while changing the other is impossible in this example since If changes, then must change also. This is an example of a problem that exists when multicollinearity is present.