BUS 305: SOLUTIONS TO
PRACTICE PROBLEMS EXAM 2
1) B
2) B
3) No, fan pattern (heteroscedasticity)
4) No, nonlinear relationship between X and Y
5) The black line is the regression line because it get closest to the sample points (minimizes error between the points and the line). The red line has a larger error; that is, larger total distance from points to the line.
6) Because it is reasonable to suppose that costs are dependent on production volume (since units are produced, directly resulting in costs), then regression is more appropriate for this data since regression is appropriate when an cause-and-effect relationship is assumed.
7) C
8) a) r = 0.8;
b) T = 1.31;
c) p = 0.117
d) There is no evidence of a significant correlation between X and Y in the population because we did not reject the null of H0: = 0.
9) Note: the following are not complete answers to Question 11; they are just enough for you to know whether your short answer addressed the correct things.
a) 1 = population slope, b1 = sample slope. On exam, would also want to address what you know (or don’t know) about each of these and how each is found.
b) An outlier can “drag” the regression line toward it. On the exam, also think about how this would affect the quality of your regression model and the predictions.
10) Yes, there appears to be a straight line relationship between the variables. Linear regression appears to be appropriate. The regression output is:
11) a) T = -0.09, p = 0.929, do not reject Ho, conclude there is no evidence of a relationship
b) R2 = 0.002 = 0.2%, No because value is very close to zero
c) Correlation = r = -0.0421. No, there is not a strong relationship between these variables. The correlation is nearly 0.
d) Regression line is Y^ = 1.26 – 0.035X.
Y^ = 1.26 – 0.035(100) = 1.26 – 3.5 = -2.24. No this does not make sense because you cannot have a negative number of near misses. It is not wise to predict with this model. The R-squared value is extremely low (essentially 0%), which means that there is no relationship at all between near misses and flights in this data. Therefore, predicting misses from flights is meaningless.
e) b1 = -0.035. As Number of flights increases by 1, we expect number of near misses to go down by 0.035. Or, put another way, as flights increases by 1000, we expect number of near misses to go down by 35. No, this does not make sense. We would assume that as flights increase, so would near misses.
12) a. Multiple regression is a direct extension of simple regression, except that now we have more than one independent (X) variable.
b. Note: the following is not a complete answer; it is just enough for you to know whether your short answer addressed the correct things: Multicollinearity is when the independent variables are highly correlated with one another. On the exam, also indicate how this affects the model, how one can identify if it is present, and what can be done to correct it.
c. Dummy variables are us ...
Introduction to ArtificiaI Intelligence in Higher Education
BUS 305 SOLUTIONS TOPRACTICE PROBLEMS EXAM 21) B2) B3.docx
1. BUS 305: SOLUTIONS TO
PRACTICE PROBLEMS EXAM 2
1) B
2) B
3) No, fan pattern (heteroscedasticity)
4) No, nonlinear relationship between X and Y
5) The black line is the regression line because it get closest to
the sample points (minimizes error between the points and the
line). The red line has a larger error; that is, larger total
distance from points to the line.
6) Because it is reasonable to suppose that costs are dependent
on production volume (since units are produced, directly
resulting in costs), then regression is more appropriate for this
data since regression is appropriate when an cause-and-effect
relationship is assumed.
7) C
8) a) r = 0.8;
b) T = 1.31;
c) p = 0.117
d) There is no evidence of a significant correlation between X
and Y in the population because we did not reject the null of
H0: = 0.
9) Note: the following are not complete answers to Question 11;
they are just enough for you to know whether your short answer
addressed the correct things.
a) 1 = population slope, b1 = sample slope. On exam,
would also want to address what you know (or don’t know)
2. about each of these and how each is found.
b) An outlier can “drag” the regression line toward it. On the
exam, also think about how this would affect the quality of your
regression model and the predictions.
10) Yes, there appears to be a straight line relationship between
the variables. Linear regression appears to be appropriate. The
regression output is:
11) a) T = -0.09, p = 0.929, do not reject Ho, conclude there is
no evidence of a relationship
b) R2 = 0.002 = 0.2%, No because value is very close to
zero
c) Correlation = r = -0.0421. No, there is not a strong
relationship between these variables. The correlation is nearly
0.
d) Regression line is Y^ = 1.26 – 0.035X.
Y^ = 1.26 – 0.035(100) = 1.26 – 3.5 = -2.24. No this does
not make sense because you cannot have a negative number of
near misses. It is not wise to predict with this model. The R-
squared value is extremely low (essentially 0%), which means
that there is no relationship at all between near misses and
flights in this data. Therefore, predicting misses from flights is
meaningless.
e) b1 = -0.035. As Number of flights increases by 1, we expect
number of near misses to go down by 0.035. Or, put another
way, as flights increases by 1000, we expect number of near
misses to go down by 35. No, this does not make sense. We
would assume that as flights increase, so would near misses.
12) a. Multiple regression is a direct extension of simple
regression, except that now we have more than one independent
(X) variable.
b. Note: the following is not a complete answer; it is
just enough for you to know whether your short answer
addressed the correct things: Multicollinearity is when the
3. independent variables are highly correlated with one another.
On the exam, also indicate how this affects the model, how one
can identify if it is present, and what can be done to correct it.
c. Dummy variables are used to incorporate categorical
variables into a regression model. A dummy variable is added
that is “1” if the person/item has the characteristic and “0” if it
does not.
13) B
14)
15) a) The since the p-value associated with the F-statistic is
very small (note: 2.45E-10 means to move the decimal point 10
places to the LEFT, i.e. 0.000000000245), we would reject the
null that says that none of the independent variables (Orig_Price
and MSRP) have an effect on price. Therefore, we conclude at
least one of these X variables does have an effect or
relationship with price.
b) Orig_Price does affect Price, since p = 1.031E-09 =
0.000000001031 < 0.01, reject Ho: = 0
MSRP does NOT since p = 0.475 > 0.10, do not reject Ho: = 0
c) Regression equation: Y^ = -7.62 + 1.01X1 – 0.08X2;
prediction: 65.18
d) MSRP -0.08, Orig_Price 1.01
e) R-squared = 0.866. This is a good model because r-square
is close to 1 (100%), thus I would feel pretty confident that my
predictions would be fairly accurate in this case.
16) Model 1: The first model run states that MPG is a linear
function of: EngineSize, CabSpace, HorsePower, TopSpeed, and
Weight. When that model is run, we find:
· R-square = 0.873
· Adjusted r-square = 0.865
· Significant variables: Horsepower, TopSpeed, Weight
4. · Insignificant variables: EngineSize, CabSpace
Because we have two insignificant variables, take them
out.
Model 2: This model states that MPG is a linear function
of HorsePower, TopSpeed, and Weight. We find that:
· R-square = 0.873
· Adjusted r-square = 0.868
· Significant variables: Horsepower, TopSpeed, Weight
· Insignificant variables: none
Taking out EngineSize and CabSpace did not change the
R-squared value at all. Apparently, CabSpace did not explain
any variation in MPG, so removing it clearly results in a better
model (simpler with no loss of explanatory power). Since all of
the independent variables left are significant, we find that this
is the best possible model (removing any more would surely
decrease R-squared).
Page 3
SUMMARY OUTPUT
Regression Statistics
Multiple R0.9583
R Square0.9183
Adjusted R Square0.9020
Standard Error4.1442
Observations7
ANOVA
dfSSMSFSignificance F
Regression1965.556965.55656.2210.00067
Residual585.87217.174
Total61051.429
CoefficientsStandard Errort StatP-valueLower 95%Upper 95%
Intercept162.70076.385425.48020.0000146.2865179.1148
ProdVolume-1.45700.1943-7.49800.0007-1.9565-0.9575
SUMMARY OUTPUT
Regression Statistics
Multiple R0.9346
9. 5) The black line is the regression line because it get closest to
the sample points (minimizes error between the points and the
line). The red line has a larger error; that is, larger total
distance from points to the line.
6) Because it is reasonable to suppose that costs are dependent
on production volume (since units are produced, directly
resulting in costs), then regression is more appropriate for this
data since regression is appropriate when an cause-and-effect
relationship is assumed.
7) C
8) a) r = 0.8;
b) T = 1.31;
c) p = 0.117
d) There is no evidence of a significant correlation between X
and Y in the population because we did not reject the null of
H0: = 0.
9) Note: the following are not complete answers to Question 11;
they are just enough for you to know whether your short answer
addressed the correct things.
a) 1 = population slope, b1 = sample slope. On exam,
would also want to address what you know (or don’t know)
about each of these and how each is found.
b) An outlier can “drag” the regression line toward it. On the
exam, also think about how this would affect the quality of your
regression model and the predictions.
10) Yes, there appears to be a straight line relationship between
the variables. Linear regression appears to be appropriate. The
regression output is:
11) a) T = -0.09, p = 0.929, do not reject Ho, conclude there is
10. no evidence of a relationship
b) R2 = 0.002 = 0.2%, No because value is very close to
zero
c) Correlation = r = -0.0421. No, there is not a strong
relationship between these variables. The correlation is nearly
0.
d) Regression line is Y^ = 1.26 – 0.035X.
Y^ = 1.26 – 0.035(100) = 1.26 – 3.5 = -2.24. No this does
not make sense because you cannot have a negative number of
near misses. It is not wise to predict with this model. The R-
squared value is extremely low (essentially 0%), which means
that there is no relationship at all between near misses and
flights in this data. Therefore, predicting misses from flights is
meaningless.
e) b1 = -0.035. As Number of flights increases by 1, we expect
number of near misses to go down by 0.035. Or, put another
way, as flights increases by 1000, we expect number of near
misses to go down by 35. No, this does not make sense. We
would assume that as flights increase, so would near misses.
12) a. Multiple regression is a direct extension of simple
regression, except that now we have more than one independent
(X) variable.
b. Note: the following is not a complete answer; it is
just enough for you to know whether your short answer
addressed the correct things: Multicollinearity is when the
independent variables are highly correlated with one another.
On the exam, also indicate how this affects the model, how one
can identify if it is present, and what can be done to correct it.
c. Dummy variables are used to incorporate categorical
variables into a regression model. A dummy variable is added
that is “1” if the person/item has the characteristic and “0” if it
does not.
13) B
11. 14)
15) a) The since the p-value associated with the F-statistic is
very small (note: 2.45E-10 means to move the decimal point 10
places to the LEFT, i.e. 0.000000000245), we would reject the
null that says that none of the independent variables (Orig_Price
and MSRP) have an effect on price. Therefore, we conclude at
least one of these X variables does have an effect or
relationship with price.
b) Orig_Price does affect Price, since p = 1.031E-09 =
0.000000001031 < 0.01, reject Ho: = 0
MSRP does NOT since p = 0.475 > 0.10, do not reject Ho: = 0
c) Regression equation: Y^ = -7.62 + 1.01X1 – 0.08X2;
prediction: 65.18
d) MSRP -0.08, Orig_Price 1.01
e) R-squared = 0.866. This is a good model because r-square
is close to 1 (100%), thus I would feel pretty confident that my
predictions would be fairly accurate in this case.
16) Model 1: The first model run states that MPG is a linear
function of: EngineSize, CabSpace, HorsePower, TopSpeed, and
Weight. When that model is run, we find:
· R-square = 0.873
· Adjusted r-square = 0.865
· Significant variables: Horsepower, TopSpeed, Weight
· Insignificant variables: EngineSize, CabSpace
Because we have two insignificant variables, take them
out.
Model 2: This model states that MPG is a linear function
of HorsePower, TopSpeed, and Weight. We find that:
· R-square = 0.873
· Adjusted r-square = 0.868
· Significant variables: Horsepower, TopSpeed, Weight
· Insignificant variables: none
12. Taking out EngineSize and CabSpace did not change the
R-squared value at all. Apparently, CabSpace did not explain
any variation in MPG, so removing it clearly results in a better
model (simpler with no loss of explanatory power). Since all of
the independent variables left are significant, we find that this
is the best possible model (removing any more would surely
decrease R-squared).
Page 3
SUMMARY OUTPUT
Regression Statistics
Multiple R0.9583
R Square0.9183
Adjusted R Square0.9020
Standard Error4.1442
Observations7
ANOVA
dfSSMSFSignificance F
Regression1965.556965.55656.2210.00067
Residual585.87217.174
Total61051.429
CoefficientsStandard Errort StatP-valueLower 95%Upper 95%
Intercept162.70076.385425.48020.0000146.2865179.1148
ProdVolume-1.45700.1943-7.49800.0007-1.9565-0.9575
SUMMARY OUTPUT
Regression Statistics
Multiple R0.9346
R Square0.8734
Adjusted R Square0.8651
Standard Error3.6750
Observations82
ANOVA
dfSSMSFSignificance F
Regression57081.0473441416.209104.8621.19E-32
Residual761026.41521713.50546
Total818107.462561
CoefficientsStandard Errort StatP-valueLower 95%Upper 95%
14. 4) If the scatterplot below depicted a set of bivariate data with
independent variable X and dependent variable Y, would a
regression model be appropriate for this data? Why or why not?
5) Which of the following would represent the regression line
for this data set? Why? Explain what characteristic of the line
makes it the regression line.
6) Suppose your company is interested in discovering if there is
a relationship or correlation between production volume (in
number of units) and costs (in $). Which would be more
appropriate for this data – to run a correlation analysis or to run
a regression analysis? Explain.
7) Suppose your company is interested in discovering if there is
a relationship between production volume (in number of units)
and costs (in $).
ProdVolume
Cost
20
137
25
122
28
123
30
120
35
106
40
15. 109
45
97
Which of the following is the most appropriate statistical
analysis to run?
A. ANOVA
B. Multiple linear regression
C. Simple linear regression
D. T-test for the mean of a single population
8) Suppose you run a regression for variables X and Y, and find
that r2 = 0.64, that the t-statistic for the hypothesis test H0: 1 =
0 is 1.31, and that the p-value for that test is 0.117. Then:
a) r = ______________
b) t-statistic for the hypothesis test H0: = 0 equals (give a
number):_____________
c) p-value for the hypothesis test H0: = 0 equals (give a
number): _________________
d) What do you conclude about the existence of a
significant correlation between X and Y in the population?
Explain.
9) Provide about one or two sentences to answer each question.
a) In a simple regression model, what is the difference
between the 1 and b1?
b) Why are outliers problematic in a multiple regression
model?
10) Given the following data and scatterplot, determine if a
simple linear regression model is appropriate for this data. If so,
generate the regression output using StatCrunch or Excel. If not,
explain why linear regression is not appropriate.
ProdVolume
Cost
17. 40
109
45
97
11) When answering questions (a) and (b) below, refer to the
following StatCrunch output from a regression model that
18. asserts that the number of near misses per year (Y) of
commercial airliners is a linear function of the number of
flights per year (X).
(a) Test for a linear relationship between near_misses and
num_flights by reading the appropriate values from the output
above. Be sure to indicate a test statistic, a p-value, and a
conclusion as to whether or not there is a relationship.
(b) What percentage of the variation in the number of near
misses is explained by the number of flights? Do you think this
is a good regression model?
(c) What is the correlation between misses and flights? Is
there a strong relationship between these variables? Explain.
(d) Write the regression line and then use it to calculate the
predicted number of near misses if the number of flights is 100.
Does this prediction make sense? Explain. Is it wise to make
predictions with this model? Why or why not? (Refer to a part
of the output to back up your conclusions.)
(e) Interpret the value of b1, the sample slope. Does this value
appear to make sense? Explain.
Multiple Regression Problems
12) Provide one or two sentences to answer each of these
questions.
a. Briefly explain the difference between multiple and simple
regression.
b. What is multicollinearity in a multiple regression model, and
why is it problematic?
c. How do you incorporate qualitative/categorical variables into
a regression model? Be specific about what kind of variable is
added to the model and what values that variable can be.
13) Suppose you want to try to estimate the miles per gallon of
various car types by using their engine size (number of
19. cylinders), cab space, horsepower, top speed and weight.
Which of the following is the most appropriate statistical
analysis to run?
A. ANOVA
B. Multiple linear regression
C. Simple linear regression
D. T-test for the mean of a single population
14) Given the following data set, generate the multiple
regression output for the model that states that MPG of a car is
a linear function of EngineSize, CabSpace, Horsepower,
TopSpeed, and Weight . Use StatCrunch or Excel. (See Excel
file, PracticeExam2data.xlsx to copy the entire data set.)
MAKE/Model
EngineSize
CabSpace
HorsePower
TopSpeed
Weight
MPG
GM/GeoMetroXF1
4
89
49
96
21. 16.7
Rolls-RoyceVarious
8
107
236
130
55
13.2
15) Use the following Excel output from a multiple regression
model to answer questions (a) - (d). The model asserts that the
sale price of an item is a function of both the original price, and
the manufacturer’s suggested retail price (MSRP).
a) What does the F-statistic and its p-value tell you
about the overall significance of the model in terms of the
effects of Orig_Price and MSRP on the price of an item?
b) Which, if any, of the independent variables appear to
affect the sale price (Y)? Indicate any numbers from the table
you used to arrive at this conclusion.
c) State the regression equation and use it to predict the
value of Y (sale price) corresponding to Original Price = 80 and
MSRP = 100.
d) How much can you expect the sale price (Y) to
increase as the MSRP increases by 1 unit? As Orig_Price
increases by one unit?
e) How good/effective is this model? Are you
comfortable using this regression equation to predict prices?
Why or why not?
16) Consider the data in the file PracticeExam2data.xls. This
data shows 82 cars and measures several characteristics of
each. Use this data to develop the BEST/most efficient multiple
regression model for predicting how many miles per gallon
(MPG) that vehicles get (you may have to run more than