Problem set 3 - Statistics and Econometrics - Msc Business Analytics - Imperial College London

Problem set 3
Jonathan Zimmermann
31 October 2015
Exercise 1
Suppose you collect data from a survey on wages, education, experience, and gender. In
addition, you ask for information about marijuana usage. The original question is: “On how
many separate occasions last month did you smoke marijuana?"
a) Write an equation that would allow you to estimate the effects of marijuana usage on wage, while
controlling for other factors. You should be able to make statements such as,“Smoking marijuana five
more times per month is estimated to change wage by x%."
>>
To be able to interpret the variables in that way, we need to build a log-linear model. The regression equation
would look like that:
log(wage) = β0 + β1marijuna_usage + β2education + β3experience + δ1gender + u
b) Write a model that would allow you to test whether drug usage has different effects on wages for men
and women. How would you test that there are no differences in the effects of drug usage for men and
women?
>>
We would need to add an interaction variable between the gender and the marijuana variables. The new
regression equation would look like that:
log(wage) = β0 + β1marijuna_usage + β2education + β3experience + δ1gender + δ2gender ∗ marijuna_usage + u
To test whether there are differences in the effects of drug usage for men and women, we could test the
following hypothesis with a t-test:
H0 : δ2 = 0H1 : δ2 = 0
To perform the t-test, we would first need to calculate the t-statistic with the following formula:
t =
gender ∗ marijuna − 0
s/
√
n
We would then look for the critical value based on the (1 − α/2) percentile in the t distribution with n-1
degrees of freedom. If the absolute value of the t-statistic is greater than the critical value, we would then
reject H0.
c) Suppose you think it is better to measure marijuana usage by putting people into one of four categories:
nonuser, light user (1 to 5 times per month), moderate user (6 to 10 times per month), and heavy
user (more than 10 times per month). Now, write a model that allows you to estimate the effects of
marijuana usage on wage.
1

>>
Incorporating this change into the model in a), we would have:
log(wage) = β0 + β2education + β3experience + δ1gender + δ2light_user + δ3moderate_user + δ4heavy_user + u
It is now easy to estimate each of the coefficients by running the regression normally.
d) Using the model in part (c), explain in detail how to test the null hypothesis that marijuana usage has
no effect on wage.
>>
We would need to test the following hypothesis (i.e. we want to test whether delta2, delta3 and delta4 are
together jointly significant), using a F-test:
H0 : δ2 = 0 AND δ3 = 0 AND δ4 = 0H1 : H0 is false
Let’s call the model in c) the “unrestricted model”. The “restricted model” would then be be:
log(wage) = β0 + β2education + β3experience + δ1gender + u
We then calculate the F-statistic, using the following formula:
SSRrestricted − SSRunrestricted/q
SSRunrestricted/(n − k − 1)
Where q = number of restrictions = 3 (because we test three parameters), k = number of variables in the
unrestricted model = 6
We would then reject H0 if the F-statistic is higher than the critical value (based on the Fisher distribution
at d1=q, d2=n-k-1).
e) What are some potential problems with drawing causal inference using the survey data that you
collected?
>>
The survey data might have multiple problems that would make it non representative of the population. One
of the biggest issues is self-selection and social desirability bias. In the case of this study, we could expect
for example individuals to voluntarily (or unconsciously) report lower values than their actual marijuna
consumption, by fear of looking like an addict/junkie (social desirability). Other issues might be linked to
the way the data has been collected. For example, if the survey has been conducted in a particular area or at
a particular time of the day, the respondants might not be a truly random sample of the population; this
will be the case for example if the survey is conducted by phone during the day, at times when the active
population is at work (which would result in a overrepresentation of unemployed people, housewives, retired
people, etc.). There are of course many other response biases that could make the data inaccurate, such as
the acquiescence bias.
2

Exercise 2
** Use the data in nbasal.RData for this exercise. **
a) Estimate a linear regression model relating points per game to experience in the league and position
(guard, forward, or center). Include experience in quadratic form and use centers as the base group.
Report the results (including SRF, the sample size, and R-squared).
>>
load("nbasal.RData")
The regression model is:
points = β0 + β1exper + β2expersq + δ1guard + δ2forward + u
The SRF is:
a = lm(points~exper+expersq+guard+forward,data)
a
##
## Call:
## lm(formula = points ~ exper + expersq + guard + forward, data = data)
##
## Coefficients:
## (Intercept) exper expersq guard forward
## 4.76076 1.28067 -0.07184 2.31469 1.54457
summary(a)
##
## Call:
## lm(formula = points ~ exper + expersq + guard + forward, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.220 -4.268 -1.003 3.444 22.265
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.76076 1.17862 4.039 7.03e-05 ***
## exper 1.28067 0.32853 3.898 0.000123 ***
## expersq -0.07184 0.02407 -2.985 0.003106 **
## guard 2.31469 1.00036 2.314 0.021444 *
## forward 1.54457 1.00226 1.541 0.124492
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.668 on 264 degrees of freedom
## Multiple R-squared: 0.09098, Adjusted R-squared: 0.07721
## F-statistic: 6.606 on 4 and 264 DF, p-value: 4.426e-05
3

Regression results:
points = 4.76076
(1.17862)
+ 1.28067
(0.32853)
exper − 0.07184
(0.02407)
expersq + 2.31469
(1.00036)
guard + 1.54457
(1.00226)
forward
The sample size is:
summary(a)$df[2]+length(coef(a)) # = Degrees of freedom + number of coefficients. nrow(data) would have
## [1] 269
The r-squared is:
summary(a)$r.squared
## [1] 0.09097856
b) Holding experience fixed, does a guard score more than a center? How much more? Is the difference
statistically significant?
>>
Yes, a guard seems to score more than a center. When we control for experience and experienceˆ2, a guard
seems to score on average 2.31469 (δ1) more points.
If we want to know whether it has a statistically significant positive effect, we can test the following hypothesis:
H0 : δ1 = 0H1 : δ1 > 0
The one-sided p-value of δ1 is 0.010722 (two-sided p-value divided by two), so it is statistically significant at
the 1.0722048% significance level.
c) Now, add marital status to the equation. Holding position and experience fixed, are married players
more productive (based on points per game)?
>>
The new regression model is:
points = β0 + β1exper + β2expersq + δ1guard + δ2forward + δ3marr + u
The SRF is:
a = lm(points~exper+expersq+guard+forward+marr,data)
a
##
## Call:
## lm(formula = points ~ exper + expersq + guard + forward + marr,
## data = data)
##
## Coefficients:
## 4.70294 1.23326 -0.07037 2.28632 1.54091
## marr
## 0.58427
4

summary(a)
##
## Call:
## lm(formula = points ~ exper + expersq + guard + forward + marr,
## data = data)
##
## Residuals:
## -10.874 -4.227 -1.251 3.631 22.412
##
## Coefficients:
## (Intercept) 4.70294 1.18174 3.980 8.93e-05 ***
## exper 1.23326 0.33421 3.690 0.000273 ***
## expersq -0.07037 0.02416 -2.913 0.003892 **
## guard 2.28632 1.00172 2.282 0.023265 *
## forward 1.54091 1.00298 1.536 0.125660
## marr 0.58427 0.74040 0.789 0.430751
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## F-statistic: 5.401 on 5 and 263 DF, p-value: 9.526e-05
Regression results:
points = 4.70294
(1.18174)
+ 1.23326
(0.33421)
exper − 0.07037
(0.02416)
expersq + 2.28632
(1.00172)
guard + 1.54091
(1.00298)
forward + 0.5842
(0.74040)
marr
The sample size is still:
## [1] 269
The r-squared is:
## [1] 0.09312579
Yes, married players seem to be more productive than non-married players. When we control for experience,
experienceˆ2 and position, a guard seems to score on average 0.58427 (δ3) more points. However, if might
not be statistically significant.
If we want to know whether it has a statistically significant positive effect, we need to test the following
hypothesis:
H0 : δ3 = 0H1 : δ3 > 0
The one-sided p-value of δ3 is 0.2153757 (two-sided p-value divided by two), so it is statistically significant
at the 21.5375685% significance level. So for most practical purposes, we cannot consider it as statistically
significant.
5

d) Add interactions of marital status with both experience variables. In this expanded model, is there
strong evidence that marital status aﬀects points per game?
>>
points = β0+β1exper+β2expersq+δ1guard+δ2forward+δ3marr+δ4marr∗experience+δ5marr∗expersq+u
The SRF is:
a = lm(points~exper+expersq+guard+forward+marr+marr*exper+marr*expersq,data)
a
##
## Call:
## lm(formula = points ~ exper + expersq + guard + forward + marr +
## marr * exper + marr * expersq, data = data)
##
## Coefficients:
## 5.81615 0.70255 -0.02950 2.25079 1.62915
## marr exper:marr expersq:marr
## -2.53750 1.27965 -0.09359
summary(a)
##
## Call:
## lm(formula = points ~ exper + expersq + guard + forward + marr +
## marr * exper + marr * expersq, data = data)
##
## Residuals:
## -10.239 -4.328 -1.067 3.742 22.197
##
## Coefficients:
## (Intercept) 5.81615 1.34878 4.312 2.29e-05 ***
## exper 0.70255 0.43405 1.619 0.1067
## expersq -0.02950 0.03267 -0.903 0.3674
## guard 2.25079 1.00002 2.251 0.0252 *
## forward 1.62915 1.00199 1.626 0.1052
## marr -2.53750 2.03822 -1.245 0.2143
## exper:marr 1.27965 0.68229 1.876 0.0618 .
## expersq:marr -0.09359 0.04887 -1.915 0.0566 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## F-statistic: 4.413 on 7 and 261 DF, p-value: 0.0001188
6

Regression results:
points = 5.81615
(1.34878)
+0.70255
(0.43405)
exper−0.02950
(0.03267)
expersq+2.25079
(1.00002)
guard+1.62915
(1.00199)
forward−2.53750
(2.03822)
marr+1.27965
(0.68229)
exper∗marrm−0.
(0.
## [1] 269
The r-squared is:
## [1] 0.1058214
This time, we want to perform a two-sided test (because we are interested in whether there is an effect in
either direction), on three different coefficients at the same time. Therefore, this is a joint hypothesis testing:
we want to know if, together, all the coefficients that include the marrital status have an effect on the points:
H0 : δ3 = 0ANDδ4 = 0ANDδ5 = 0H1 : H0isfalse
The two-sided p-value of δ3 is 0.2142624, so it is statistically significant at the 21.4262432% significance level.
So no, for most practical purposes, we cannot really say there is strong evidence that marital status affects
points per game.
e) Estimate the model from part (c) but use assists per game as the dependent variable. Are there any
notable differences from part (c)? Discuss.
>>
assists = β0 + β1exper + β2expersq + δ1guard + δ2forward + δ3marr + u
The SRF is:
a = lm(assists~exper+expersq+guard+forward+marr,data)
a
##
## Call:
## lm(formula = assists ~ exper + expersq + guard + forward + marr,
## data = data)
##
## Coefficients:
## -0.22581 0.44360 -0.02673 2.49167 0.44747
## marr
## 0.32190
7

summary(a)
##
## Call:
## lm(formula = assists ~ exper + expersq + guard + forward + marr,
## data = data)
##
## Residuals:
## -3.3127 -1.0780 -0.3157 0.6788 8.2488
##
## Coefficients:
## (Intercept) -0.225809 0.354904 -0.636 0.52516
## exper 0.443603 0.100372 4.420 1.45e-05 ***
## expersq -0.026726 0.007256 -3.683 0.00028 ***
## guard 2.491672 0.300842 8.282 6.19e-15 ***
## forward 0.447471 0.301220 1.486 0.13860
## marr 0.321899 0.222359 1.448 0.14891
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## F-statistic: 28.31 on 5 and 263 DF, p-value: < 2.2e-16
Regression results:
assists = −0.225809
(0.354904)
+0.443603
(0.100372)
exper−0.026726
(0.007256)
expersq+2.491672
(0.300842)
guard+0.447471
(0.301220)
forward+0.321899
(0.222359)
marr
## [1] 269
The r-squared is:
## [1] 0.3498759
As we can see, there are some differences compared to c), but nothing major. Except for the intercept,
which changed sign, the direction of all the effects is the same. The intercept, which was highly statistically
significant in c), is no longer statistically significant and the variable “guard” is now much more significant
than in c). All the variables changed in magnitude in sometimes significative ways. Most of these differences
in magnitude is explained by the different scales of “assists” and “points”:
8

mean(data$assists)
## [1] 2.408922
mean(data$points)
## [1] 10.21041
9

Problem set 3 - Statistics and Econometrics - Msc Business Analytics - Imperial College London

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Problem set 3 - Statistics and Econometrics - Msc Business Analytics - Imperial College London

Similar to Problem set 3 - Statistics and Econometrics - Msc Business Analytics - Imperial College London (20)

More from Jonathan Zimmermann

More from Jonathan Zimmermann (7)

Recently uploaded

Recently uploaded (20)

Problem set 3 - Statistics and Econometrics - Msc Business Analytics - Imperial College London