The relationship between the explanatory variable X and the response variable Y is not always accurately reflected in the coefficient of X ; it depends on which other X ’s are included or not included in the equation.
This is especially true when there is a linear relationship between to or more explanatory variables, in which case we have multicollinearity.
By definition multicollinearity is the presence of a fairly strong linear relationship between two or more explanatory variables, and it can make estimation difficult.
The coefficient of Right indicates that the right foot’s effect on Height in addition to the effect of the left foot. This additional effect is probably minimal. That is, after the effect of Left on Height has already been taken into account, the extra information provided by Right is probably minimal. But it goes the other way also. The extra effort of Left, in addition to that provided by Right, is probably minimal.
The multiple R and the corresponding R 2 are about what we would expect, given the correlations between Height and either Right or Left.
In particular, the multiple R is close to the correlation between Height and either Right or Left. Also, the s e value is quite good. It implies that predictions of height from this regression equation will typically be off by only about 2 inches.
The coefficient of Gender implies that an average male customer spent about $130 less than the average female customer. Similarly, an average customer living close to stores with this type of merchandise spent about $288 less than those customers living far form stores.
The coefficient of Salary implies that, on average, about 1.5 cents of every salary dollar was spent on HyTex merchandise.
Interpretation of Final Regression Equation -- continued
The coefficient of Children implies that $158 less was spent for every extra child living at home.
The Customer97 and Spent97 terms are somewhat more difficult to interpret.
First, both of these terms are 0 for customers who didn’t purchase from HyTEx in 1997.
For those that did the terms become -724 + 0.47Spent97
The coefficient 0.47 implies that each extra dollar spent in 1997 can be expected to contribute an extra 47 cents in 1998.
Interpretation of Final Regression Equation -- continued
The median spender in 1997 spent about $900. So if we substitute this for Spent 97 we obtain -301.
Therefore, this “median” spender from 1997 can be expected to spend about $301 less in 1998 than the 1997 nonspender.
The coefficient of Catalog implies that each extra catalog can be expected to generate about $43 in extra spending.
When we validate this final regression equation with the 750 customers, using the procedure from Section 11.7, we find R 2 and s e values of 75.7% and $485.
These aren’t bad. They show little deterioration from the values based on the original 250 customers.
We haven’t tried all possibilities yet. We haven’t tried nonlinear or interaction variables, nor have we looked at different coding schemes; we haven’t checked for nonconstant error variance or looked at potential effects of outliers.
First, we will regress Salary versus the Female dummy, YrsExper, and the interactions between Female and YrsExper, labeled Fem_YrsExper. This will be the reduced equation.
Then we’ll see whether the JobGrade dummies Job_2 to Job_6 add anything significant to the reduced equation. If so, we will then see whether the interactions between the Female dummy and the JobGrade dummies, labeled Fem_Job2 to Fem_Job6, add anything significant to what we already have.
First, note that we created all of the dummies and interaction variables with StatPro’s Data Utilities procedures.
Also, note that we have used three sets of dummies, for gender, job grad and education level.
When we use these in a regression equation, the dummy for one category of each should always be excluded; it is the reference category. The reference categories we have used are “male”, job grade 1 and education level 1.
The degrees of freedom in cell C28 is the same as the value in cell C12, the degrees of freedom for SSE .
Then we calculate the F -ratio in cell C29 with the formula =((Reduced!D12-D12)/C27)/E12 were Reduced!D12 refers to SSE for the reduced equation from the Reduced sheet.
Finally, we calculate the corresponding p -value in cell C30 with the formula =FDIST(C29,C27,C28) . It is practically 0, so there is no doubt that the job grade dummies add significantly to the explanatory power of the equation.
Third, producing all of these outputs and doing the partial F tests is a lot of work. Therefore, we included a “Block” option in StatPro to make life easier. To run the analysis in this example use StatPro/Regression analysis/Block menu item. After selecting Salary as the response variable, we see this dialog box.
We want four blocks of explanatory variables, and we want a given block to enter only if it passes the partial F test at the 5% level. In later dialog boxes we specify the explanatory variables. Once we have specified all this, the regression calculations are done in stages. The output from this appears on the next two slides. The output spans over two figures. Note that the output for Block 4 has been left off because it did not pass the F test at 5%.
Finally, we have concentrated on the partial F test and statistical significance in this example. We don’t want you to lose sight, however, of the bigger picture. Once we have decided on a “final” regression equation we need to analyze its implications for the problem at hand.
In this case the bank is interested in possible salary discrimination against females, so we should interpret this final equation in these terms. Our point is simply that you shouldn’t get so caught in the details of statistical significance that you lose sight of the original purpose of the analysis!
So a change from -1.021 to -0.721 indicates less discrimination against females now than before. In other words, this unusual female employee accounts for a good bit of the discrimination argument - although a strong argument still exists even without her.
Consider the following three male employees at Fifth National:
Employee 5 makes $29,000, is in job grade 1, and has 3 years of experience at the bank.
Employee 156 makes $45,000, is in job grade 4, and has 6 years of experience at the bank.
Employee 198 makes $60,000, is in job grade 6, and has 12 years of experience at the bank.
Using regression equations for Salary that includes the explanatory variables Female, YrsExper, FemYrs_Exper, and the job grade dummies Job_2 to Job_6, check that the predicted salaries for these three employees are close to their actual salaries.
To see what would happen if these employees were females, we need to adjust the values of the explanatory variables Female and Fem_YrsExper.
For each employee in rows B227-229, the value of Female becomes 1 and the value of Fem_YrsExper becomes the same as the YrsExper. Copying the formula in A222 down to these rows gives the predicted salary for the females.
One way to compare females to males is to enter the formula =(A227-B222)/$B$216 in cell B227 and copy it down.
This is the number of standard errors the predicated female salary is above (if positive) or below (if negative) the actual male salary.
As we discussed earlier with this data set, females with only a few years experience actually tend to make more than males. But the opposite occurs for employees with many years of experience.
For example, male employee 198 is earning just about the regression equation predicts he should earn. But if he were female, we would predict a salary about $4500 below the male, almost a full standard error lower.
To obtain the predicted sales for these regions, we use Excel’s TREND function by highlighting the range H9:H13, typing the formula =TREND(SalesOld,PromoteOld,PromoteNew) and pressing Ctrl-Shift-Enter.
This substitutes the new values of the explanatory variable (in the third argument) into the regression equation based on the data from the first two arguments.
The more exact standard error of prediction depends on the value of Promote. To calculate it we enter the formula =$H$6*SQRT(1+1/50+(G9-AVERAGE(PromoteOld))^2 /(48*STDEV(PromoteOld))^2) in cell I9 and copy it down through cell I13.
We then calculate the lower and upper limits of 95% prediction intervals in columns J and K. These use the t -multiple in cell I3, obtained with the formula =TINV(0.05,73) .
We have gone through these rather tedious calculations to make several points.
First, the approximate standard errors s e and are usually quite accurate. This is fortunate because the exact standard errors are difficult to calculate and are not always given in statistical software packages.
Second, a simple rule of thumb for calculating individual 95% prediction intervals is to go out an amount 2s e , on either side of the predicted value. Again this is not exactly correct; but as calculations in this example indicate, it works quite well.
Finally, we see from the wide prediction intervals how much uncertainty remains.
The reason is the relatively large standard error of estimate, s e .
Contrary to what you may believe this is not a sample size problem.
The whole problem is that Promote is not highly correlated with Sales. The only way to decrease s e and get more accurate predictions is to find other explanatory variables that are more closely related to Sales.