Chapter4_Multi_Reg_Estim.pdf.pdf

Recap
Sampling Distributions of the OLS Estimators
Testing Hypotheses About a Single Population Parameter
Confidence Intervals
Testing Single Linear Restrictions
Testing Multiple Exclusion Restrictions
The Multiple Regression Model: Inference
ECON 371: Econometrics
October 20, 2021
ECON 371: Econometrics The Multiple Regression Model: Inference

Recap
Recap
Testing Against One-Sided Alternatives
Testing Against Two-Sided Alternatives
Testing Other Hypotheses about the βj
Computing p-Values for t Tests
Language of Hypothesis Testing
Practical versus Statistical Significance
Relationship Between F and t Statistics
R-Squared Form of the F Statistic
The F Statistic for Overall Significance of a Regression

Recap
So far, what do we know how to do with the population model
y = β0 + β1x1 + ... + βkxk + u (1)
I Mechanics of OLS for a given sample. (We only need MLR.2
insofar as it introduces the data, and MLR.3 – no perfect
collinearity – so that the OLS estimates exist.) Interpretation
of OLS regression line – ceteris paribus effects – R2
goodness-of-fit measure. Some functional form (natural
logarithm).

Recap
I MLR.1: y = β0 + β1x1 + β2x2 + ... + βkxk + u
I MLR.2: random sampling
I MLR.3: The xij vary across i and there are no exact linear
relationships among the xij (no perfect collinearity)
I MLR.4: E(u|x1, ..., xk) = E(u) = 0 (exogenous explanatory
variables)
I MLR.5: Var(u|x1, ..., xk) = Var(u) = σ2 (homoskedasticity)

Recap
I Unbiasedness of OLS under MLR.1 to MRL.4. Obtain bias (or
at least the direction) when MLR.4 fails due to an omitted
variable.
I Obtain the variances, Var(β̂j ), under MLR.1 to MLR.5.
I The Gauss-Markov Assumptions also imply OLS is the best
linear unbiased estimator (BLUE).

Recap
We now want to test hypotheses about the βj . This means we
hypothesize that a population parameter is a certain value, then
use the data to determine whether the hypothesis is likely false.
EXAMPLE: (Motivated by ATTEND.DTA)
final = β0 + β1missed + β2priGPA + β3ACT + u (2)
where ACT is the achievement test score.
The null hypothesis, that missing lecture has no effect on final
exam performance (after accounting for prior MSU GPA and ACT
score), is
H0 : β1 = 0 (3)

Recap
To test hypotheses about the βj using exact (or “finite sample”)
testing procedures, we need to know more than just the mean and
variance of the OLS estimators.
MLR.1 to MLR.4:
E(β̂j ) = βj (4)
MLR.1 to MLR.5:
Var(β̂j ) =
σ2
SSTj (1 − R2
j )
(5)
And, σ̂2 = SSR
n−k−1 is an unbiased estimator of σ2.

Recap
I But hypothesis testing relies on the entire sampling
distributions of the β̂j . Even under MLR.1 through MLR.5,
the sample distributions can be virtually anything.
β̂j = βj +
n
X
i=1
wij ui (6)
where the wij are functions of {(xi1, ..., xik) : i = 1, ..., n}.
I Conditional on {(xi1, ..., xik) : i = 1, ..., n}, β̂j inherits its
distribution from that of {ui : i = 1, .., n}, which is a random
sample from the population distribution of u.

Recap
Assumption MLR.6 (Normality)
The population error u is independent of (x1, ..., xk) and is
normally distributed with mean zero and variance σ2:
u ∼ Normal(0, σ2
) (7)
I MLR.4: E(u|x1, ..., xk) = E(u) = 0
I MLR.5: Var(u|x1, ..., xk) = Var(u) = σ2
I Now MLR.6 imposes full independence (not just in the mean
and variance).

Recap
I But the important part of MLR.6 is that we have now made a
very specific distributional assumption for u: the familiar
bell-shaped curve.
I Ultimately, like Assumption MLR.5, Assumption MLR.6 is
maintained for convenience, and it can easily fail. Fortunately,
we will later see that, for approximate inference in large
samples, we can drop MLR.6. For now we keep it.
I Very difficult to perform exact statistical inference without
normality.
I Assumptions MLR.1 to MLR.6 are called the classical linear
model (CLM) assumptions (for cross-sectional regression).

Recap
I For any practical purpose, think of
CLM = GM + normality assumption (8)
I An important fact about independent normal random
variables: any linear combination is also normally distributed.
I Because the ui are independent and identically distributed
(i.i.d.) as Normal(0, σ2),
β̂j = βj +
n
X
i=1
wij ui ∼ Normal[βj , Var(β̂j )] (9)
where
Var(β̂j ) =
σ2
SSTj (1 − R2
j )
(10)

Recap
THEOREM (Normal Sampling Distributions)
Under the CLM Assumptions (and conditional on the sample
outcomes of the explanatory variables),
β̂j ∼ Normal[βj , Var(β̂j )] (11)
and so
β̂j − βj
sd(β̂j )
∼ Normal(0, 1) (12)
The second result follows from a feature of the normal distribution:
if W ∼ Normal then so is a + bW for constants a and b.

Recap
The standardized random variable
β̂j − βj
sd(β̂j )
(13)
always has zero mean and variance one. Under MLR.6, it is also
normally distributed.

Recap
I We cannot directly use the result
β̂j − βj
sd(β̂j )
∼ Normal(0, 1) (14)
to test hypotheses about βj : sd(β̂j ) depends on σ = sd(u),
which is unknown.
I But we have σ̂ as an estimator of σ. Using this in place of σ
gives us the standard error, se(β̂j ).

Recap
THEOREM (t Distribution for Standardized Estimators)
Under the CLM Assumptions,
β̂j − βj
se(β̂j )
∼ tn−k−1 = tdf (15)
We will not prove this.
I It is the replacement of σ (a constant) with σ̂(an estimator
that varies across samples, just like β̂j ), that takes us from the
standard normal to the t distribution.

Recap
I The t distribution also has a bell shape, but is more spread
out than the Normal(0, 1).
E(tdf ) = 0 if df > 1 (16)
Var(tdf ) =
df
df − 2
> 1 if df > 2 (17)
We will never have really small df in this class.
I When df = 10, Var(tdf ) = 1.25, which is 25% larger than the
Normal(0, 1) variance.
I When df = 120, Var(tdf ) ≈ 1.017.

Recap
I As df → ∞,
tdf → Normal(0, 1) (18)
I The difference is practically small for df > 120.

Recap
I We use the result on the t distribution to test the null
hypothesis that xj has no (partial) effect on y:
H0 : βj = 0 (19)
I For example,
lwage = β0 + β1educ + β2exper + β3tenure + u (20)
H0 : β2 = 0 (21)
I In words: Once we control for education and time on the
current job (tenure), total workforce experience has no impact
on lwage = ln(wage).

Recap
I To test H0 : βj = 0, we use the t statistic (or t ratio),
tβ̂j
=
β̂j − βj
se(β̂j )
=
β̂j
se(β̂j )
(22)
I This is the estimated coefficient divided by our estimate of its
sampling standard deviation. In virtually all cases β̂j is not
exactly equal to zero. When we use tβ̂j
, we are measuring how
far β̂j is from zero relative to its standard error.

Recap
I Because se(β̂j ) > 0, tβ̂j
always has the same sign as β̂j . To
use tβ̂j
to test H0 : βj = 0, we need to have an alternative.
I Some like to define tβ̂j
as the absolute value, so it is always
positive. This makes it cumbersome to test against one-sided
alternatives.

Recap
I First consider the alternative
H1 : βj > 0 (23)
which means the null is effectively.
H0 : βj ≤ 0 (24)
Using a positive one-sided alternative, if we reject βj = 0 then
we reject any βj < 0, too. We often just state H0 : βj = 0 and
act like we do not care about negative values.

Recap
I If the estimated coefficient β̂j is negative, it provides no
evidence against H0 in favor of H1 : βj > 0.
I If β̂j is positive, the question is: how big does tβ̂j
=
β̂j
se(β̂j )
has
to be before we conclude H0 is “unlikely”?

Recap
Traditional approach to hypothesis testing:
1. Choose a null hypothesis: H0 : βj = 0 (or H0 : βj ≤ 0)
2. Choose an alternative hypothesis: H1 : βj > 0
3. Choose a significance level (or simply level, or size) for the
test. That is, the probability of rejecting the null hypothesis
when it is in fact true (Type I Error). Suppose we use 5%, so
the probability of committing a Type I error is 0.05.
4. Choose a critical value, c > 0, so that the rejection rule
tβ̂j
> c
leads to a 5% level test.

Recap
I The key is that, under the null hypothesis,
tβ̂j
∼ tn−k−1 = tdf (25)
and this is what we use to obtain the critical value, c.
I Suppose df = 28 and we use a 5% test. The critical value is
c = 1.701, as can be gotten from Table G.2 (page 745 in 6e).
I We are conducting a one-tailed test (and it is these entries
that should be used in the table). See figure 4.2 at page 124.

Recap
I So, with df = 28, the rejection rule for H0 : βj = 0 against
H1 : βj > 0, at the 5% level, is
tβ̂j
> 1.701
We need a t statistic greater than 1.701 to conclude there is
enough evidence against H0.
I If tβ̂j
≤ 1.701, we fail to reject H0 against H1 at the 5%
significance level.
I Suppose df = 28, but we want to carry out the test at a
different significance level (often 10% level or the 1% level).
c.10 = 1.313
c.05 = 1.701
c.01 = 2.467

Recap
I If we want to reduce the probability of Type I error, we must
increase the critical value (so we reject the null less often).
I If we reject at, say, the 1% level, then we must also reject at
any larger level.
I If we fail to reject at, say, the 10% level – so that tβ̂j
≤ 1.313
– then we will fail to reject at any smaller level.

Recap
I With large sample sizes – certainly when df > 120 – we can
use critical values from the standard normal distribution.
These are the df = ∞ entries in Table G.2.
c.10 = 1.282
c.05 = 1.645
c.01 = 2.362
which we can round to 1.28, 1.65, and 2.36, respectively. The
value 1.65 is especially common for a one-tailed test.

Recap
EXAMPLE: Factors Affecting lwage (WAGE2.DTA)
In applications, it is helpful to label parameters with variable
names to state hypotheses, e.g., βeduc, βIQ, and βexper . Then
H0 : βexper = 0 (26)
is that workforce experience has no effect on a wage once
education, and IQ have been accounted for.

Recap

lwage = 5.198
(.007)
+ .057
(.007)
educ + .0058
(.0010)
IQ + .0195
(.0032)
exper
n = 935, R2
= .162
I The quantities in parentheses are still standard errors, not t
statistics!
I Easiest to read the t statistic off the Stata output, when
available:
texper = 6.02
which is well above the one-sided critical value at the 1%
level, 2.326. In fact, the 0.5% critical value is about 2.58.

Recap
I The bottom line is that H0 : βexper = 0 can be rejected
against H1 : βexper > 0 at very small significance levels. A t of
6.02 is very large.
I The estimated effect of exper – that is, its economic
importance – is apparent. Another year of experience, holding
educ and IQ fixed, is estimated to be worth about 1.9%.
I The t statistics for educ and IQ are also very large; there is
no need to even look up critical values.

Recap
EXAMPLE: Does ACT score help predict college GPA? Data in
GPA1.DTA. n = 141 MSU students from mid-1990s. All variables
are self reported.
Consider controlling for high school GPA:
colGPA = β0 + β1hsGPA + β2ACT + u (27)
H0 : β2 = 0 (28)
From the Stata output, β̂2 = β̂ACT = .0094 and tACT = .87. Even
at the 10% level (c = 1.28), we cannot reject H0 against
H1 : βACT > 0.

Recap
I Because we fail to reject H0 : βACT = 0, we say that “β̂ACT is
statistically insignificant at the 10% level against the
one-sided alternative.”
I It is also very important to see that the estimated effect of
ACT is small, too. Three more points (slightly more than one
standard deviation) only predicts colGPA that is
.0094(3) ≈ .028 – not even three one-hundredths of a grade
point.

Recap
I By contrast, β̂hsGPA = .453 is large in a practical sense – each
point on hsGPA is associated with about .45 points on
colGPA – and thsGPA = 4.73 is very large.
I No critical values in Table G.2 with df = 141 − 3 = 138 are
even close to 4. So “β̂hsGPA is statistically significant” at
very small significance levels.
I Notice what happens if we do not control for hsGPA. The
simple regression estimate is .0271 with tACT = 2.49. The
magnitude is still pretty modest, but we would conclude it is
statistically different from zero at the 1% significance level
using the standard normal critical value, 2.36.

Recap
I Not clear why ACT has such a small, statistically insignificant
effect. The sample size is small and the scores were
self-reported. The survey was done in a couple of economics
courses, so it is not a random sample of all MSU students.

Recap
For the negative one-sided alternative,
H1 : βj < 0, (29)
we use a symmetric rule. But the rejection rule is
tβ̂j
< −c (30)
where c is chosen in the same way as in the positive case.
With df = 18 and a 5% test, the critical value is c = −1.734, so
the rejection rule is
tβ̂j
< −1.734 (31)
Now we must see a significantly negative value for the t statistic to
reject H0 : βj = 0 in favor of H1 : βj < 0.

Recap
EXAMPLE: Does missing lectures affect final exam performance?
final = β0 + β1missed + β2priGPA + β3ACT + u (32)
H0 : β1 = 0, H1 : β1 < 0 (33)
I We get β̂1 = −0.079, tβ̂1
= −2.25. The 5% cv is −1.65 and
the 1% cv is −2.36. So we reject H0 in favor of H1 at the 5%
level but not at the 1% level.
I The effect is not huge: 10 missed lectures, out of 32, lowers
final exam score by about 0.8 points – so not even one point.
I If we do not control for ACT score, the effect of missed goes
away. It turns out that missed and ACT are positively
correlated: those with higher ACT scores miss more classes,
on average.

Recap
Reminder about Testing
I Our hypotheses involve the unknown population values, βj . If
in a given set of data we obtain, say, β̂j = 2.75, we do not
write the null hypothesis as
H0 : 2.75 = 0
(which is obviously false).
I Nor do we write
H0 : β̂j = 0
which is also false except in the very rare case that our
estimate is exactly zero.
I We do not test hypotheses about the estimates! We know
what they are. We hypothesize about the unknown βj .

Recap
I Sometimes we do not know ahead of time whether a variable
definitely has a positive effect or a negative effect. Even in the
example
final = β0 + β1missed + β2priGPA + β3ACT + u
it is conceivable that missing class helps final exam
performance. (The extra time is used for studying, say.)

Recap
I Generally, the null and alternative are
H0 : βj = 0
H1 : βj 6= 0
I Testing against the two-sided alternative is usually the default.
It prevents us from looking at the regression results and then
deciding on the alternative. Also, it is harder to reject H0
against the two-sided alternative, so it requires more evidence
that xj actually affects y.

Recap
I Now we reject H0 if β̂j is sufficiently large in magnitude,
either positive or negative. We again use the t statistic
tβ̂j
= β̂j /se(β̂j ), but now the rejection rule is
|tβ̂j
| > c (34)
I This results in a two-tailed test, and those are the critical
values we pull from Table G.2.
I For example, if we use a 5% level test and df = 25, the
two-tailed cv is 2.06. The two-tailed cv is, in this case, the
97.5 percentile in the t25 distribution. (Compare the one-tailed
cv, about 1.71, the 95th percentile in the t25 distribution).

Recap
EXAMPLE: Factors affecting math pass rates. (mathpnl.DTA)
I Run a multiple regression of math4 on lunch, ptr, avgsal,
enrol.
I A priori, we might expect lunch to have a negative effect (it is
essentially a school-level poverty rate), ptr to have a negative
effect, and avgsal to have a positive effect. But we can still
test against a two-sided alternative to avoid specifying the
alternative ahead of time.
I enrol is clearly ambiguous.

Recap
I With 2200 observations, we can use the standard normal
critical values. For a 10% test it is 1.65, for a 5%, 1.96, and
for 1%, cv = 2.58.
I For the variables lunch, ptr, and avgsal, all coefficients have
the anticipated signs, and the absolute values of the t
statistics are above 4. So we easily reject H0 : βj = 0 against
H1 : βj 6= 0.
I enrol is a different situation. tenroll = 0.57 < 1.65, so we fail
to reject H0 at even the 10% significance level.
I Functional form can make a difference. The math pass rates
are capped at 100, so a diminishing effect in avgsal and enrol
seem appropriate; these variables have lots of variation. So
use the logarithm instead.

Recap
I Of course, all estimates change, but it is those on the lavgsal
and lenrol that are now much different. Before, we were
measuring a dollar effect. But now, holding the other
variables fixed,
∆
math4 =
22.83
100
%∆avgsal = .2283(%∆avgsal) (35)
So if, say, %∆avgsal = 10 – teacher salaries are 10 percent
higher – math4 is estimated to increase by about 2.3 points.

Recap
I Also,
∆
math4 =
1.39
100
%∆enroll = .0139(%∆enroll) (36)
so a 10% increase in enrollment is associated with a .14 point
increase in math4.
I Notice how lenrol = ln(enrol) is statistically significant at the
5% level: tlenrol = 3.81 > 1.96.

Recap
I Reminder: When we report the results of, say, the second
regression, it looks like

math4 = −151.32
(25.11)
− .325
(.019)
lunch − 1.346
(.128)
ptr + (37)
+ 22.83
(2.50)
lavgsal + 1.39
(0.36)
lenrol (38)
n = 2199, R2
= .2337 (39)
so that standard errors are below coefficients.

Recap
I When we reject H0 : βj = 0 against H1 : βj 6= 0, we often say
that β̂j is statistically different from zero and usually
mention a significance level. For example, if we can reject at
the 1% level, we say that. If we can reject at the 10% level
but not the 5%, we say that.
I As in the one-sided case, we also say β̂j is “statistically
significant” when we can reject H0 : βj = 0.

Recap
I Testing the null H0 : βj = 0 is by far the most common. That
is why Stata and other regression packages automatically
report the t statistic for this hypothesis.
I It is critical to remember that
tβ̂j
=
β̂j
se(β̂j )
(40)
is only for H0 : βj = 0.

Recap
I What if we want to test a different null value? For example,
in a constant-elasticity consumption function,
ln(cons) = β0 + β1 ln(inc) + β2famsize + β3pareduc + u (41)
we might want to test
H0 : β1 = 1 (42)
which means an income elasticity equal to one. (We can be
pretty sure that β1 > 0.)

Recap
I More generally, suppose the null is
H0 : βj = aj (43)
where we specify the value aj (usually zero, but, in the
consumption example, aj = 1).
I It is easy to extend the t statistic:
t =
(β̂j − aj )
se(β̂j )
(44)
I This t statistic just measures how far our estimate, β̂j , is from
the hypothesized value, aj , relative to se(β̂j ).

Recap
I A useful general expression for general t testing:
t =
(estimate − hypothesized value)
standard error(estimate)
(45)
I The alternative can be one-sided or two-sided.
I We choose critical values exactly the same way as before.

Recap
I The language needs to be suitably modified. If, for example,
H0 : βj = 1
H1 : βj 6= 1
is rejected at the 5% level, we say “β̂j is statistically different
from one at the 5% level.” Otherwise, β̂j is “not statistically
different from one.” If the alternative is H1 : βj > 1, then “β̂j
is statistically greater than one at the 5% level.”

Recap
EXAMPLE: Relationship between crime and enrollment on college
campuses. (CAMPUS.DTA)
A simple regression model:
ln(crime) = β0 + β1 ln(enroll) + u (46)
H0 : β1 = 1 (47)
H1 : β1 > 1 (48)
We get β̂1 = 1.27, and so a 1% increase in enrollment is estimated
to increase crime by 1.27% (so more than 1%). Is this estimate
statistically greater than one?

Recap
I We cannot pull the t statistic of the usual Stata output. We
can compute it by hand (rounding the estimate and standard
error):
t =
(1.270 − 1)
.110
≈ 2.45
(Note how this is much smaller than the t for H0 : β1 = 0,
reported by Stata.)
I We have df = 97 − 2 = 95, so we use the df = 120 entry in
Table G.2. The 1% cv for a one-sided alternative is about
2.36, so we reject at the 1% significance level.

Recap
I Alternatively, we can let Stata do the work using the lincom
(“linear combination” command). Here the null is stated
equivalently as
H0 : β1 − 1 = 0
I The t = 2.46 is the more accurate calculation of the t
statistic.
I The lincom lenroll - 1 command is Stata’s way of saying
“test whether βlenroll − 1 equals zero.”

Recap
I The traditional approach to testing, where we choose a
significance level ahead of time, can be cumbersome.
I Plus, it can conceal information. For example, suppose that,
for testing against a two-sided alternative, a t statistical is
just below the 5% cv. I could simply say that “I fail to reject
H0 : βj = 0 against the two-sided alternative at the 5% level.”
But there is nothing sacred about 5%. Might I reject at, say,
6%?
I Rather than have to specify a level ahead of time, or discuss
different traditional significance levels (10%, 5%, 1%), it is
better to answer the following question: Given the observed
value of the t statistic, what is the smallest significance level
at which I can reject H0?

Recap
I The smallest level at which the null can be rejected is known
as the p-value of a test. It is a single number that
automatically allows us to carry out the test at any level.
I One way to think about the p-values is that it uses the
observed statistic as the critical value, and then finds the
significance level of the test using that critical value.
I It is most common to report p-values for two-sided
alternatives. This is what Stata does. The t tables are not
detailed enough.

Recap
I For t testing against a two-sided alternative,
p-value = P(|T| > |t|) (49)
where t is the value of the t statistic and T is a random
variable with the tdf distribution.
I The p-value is a probability, so it is between zero and one.
I Perhaps the best way to think about p-values: it is the
probability of observing a statistic as extreme as we did if the
null hypothesis is true.

Recap
I So smaller p-values provide more evidence against the null.
For example, if p-value = .50, then there is a 50% chance of
observing a t as large as we did (in absolute value). This is
not enough evidence against H0.
I If p-value = .001, then the chance of seeing a t statistic as
extreme as we did is .1%. We can conclude that we got a very
rare sample – which is not helpful – or that the null
hypothesis is very likely false.

Recap
I From
p-value = P(|T| > |t|) (50)
we see that as |t| increases the p-value decreases. Large
absolute t statistics are associated with small p-values.
I Suppose df = 40 and, from our data, we obtain t = 1.85 or
t = −1.85. Then
p-value = P(|T| > 1.85) = 2P(T > 1.85) = 2(.0359) = .0718
where T ∼ t40.

Recap
I Given p-value, we can carry out a test at any significance
level. If α is the chosen level, then
Reject H0 if p-value < α
I In the previous example we obtained p-value = 0.0718. This
means that we reject H0 at the 10% level but not the 5%
level. We reject at 8% but (not quite) at 7%.
I Knowing p-value = .0718 is clearly much better than just
saying “I fail to reject at the 5% level.”

Recap
Computing p-Values for One-Sided Alternatives
I Stata and other packages report the two-sided p-value. How
can we get a one-sided p-value?
I With a caveat, the answer is simple:
one-sided p-value =
two-sided p-value
2
(51)
I We only want the area in one tail, not two tails. The
two-sided p-value gives us the area in both tails.

Recap
I This is the correct calculation when it is interesting to do the
calculation. The caveat is simple: if the estimated coefficient
is not in the direction of the alternative, the one-sided p-value
is above 0.50, and so it is not an interesting calculation.
I In Stata, the two-sided p-values for H0 : βj = 0 are given in
the column labeled P > |t|.

Recap
EXAMPLE: Factors Affecting NBA Salaries (NBASAL.DTA)
I Disregarding the intercept (or “constant”), none of the
variables is statistically significant at the 1% level against a
two-sided alternative. The closest is points, with p-value
= .017. (The one-sided p-value is .017/2 = .0085 < .01, so it
is significant at the 1% level against the positive one-sided
alternative.)
I mingame is statistically significant a the 5% level because
p-value = .022 < .05.
I rebounds is statistically significant a the 10% level (against a
two-sided alternative) because p-value = .077 < .10, but not
at the 5% level. But the one-sided p-value is .077/2 = .0385.

Recap
I Both games and assists have very small t statistics, which
lead to p-values close to one (for example, for assists, p-value
= .991). These variables are statistically insignificant.
I In some applications, p-values equal to zero up to three
decimal places are not uncommon. We do not have to worry
about statistical significance in such cases.

Recap
I If we do not reject H0 (against any alternative), it is better to
say “we fail to reject H0” as opposed to “we accept H0,”
which is somewhat common.
I The reason is that many null hypotheses cannot be rejected in
any application. For example, if I have β̂j = .75 and
se(β̂j ) = .25, I do not say that I “accept H0 : βj = 1.” I fail to
reject because the t statistic is (.75 − 1)/.25 = −1. But the t
statistic for H0 : βj = .5 is (.75 − .5)/.25 = 1, so I cannot
reject H0 : βj = .5, either.

Recap
I Clearly βj = .5 and βj = 1 cannot both be true. There is a
single, unknown value in the population. So I should not
“accept” either.
I The outcomes of the t tests tell us the data cannot reject
either hypothesis. Nor can the data reject H0 : βj = .6, and so
on. The data does reject H0 : βj = 0 (t = 3) at a pretty small
significance level (if we have a reasonable df .)

Recap
I t testing is purely about statistical significance. It does not
directly speak to the issue of whether a variable has a
practically, or economically, large effect.
I Practical (Economic) Significance depends on the size (and
sign) of β̂j .
I Statistical Significance depends on tβ̂j
.

Recap
I It is possible to estimate practically large effects but have the
estimates so imprecise that they are statistically insignificant.
This is especially an issue with small data sets (but not only
small datasets).
I Even more importantly, it is possible to get estimates that are
statistically significant – often with very small p-values – but
are not practically large. This can happen with very large data
sets.

Recap
I EXAMPLE: Suppose that, using a large cross section data
set for teenagers across the U.S., we estimate the elasticity of
alcohol demand with respect to price to be −0.013 with
se = 0.002. Then the t statistic is −6.5, and we need look no
further to conclude the elasticity is statistically different from
zero. But the estimate means that, say, a 10% increase in the
price of alcohol reduces demand by an estimated 0.13%. This
is a small effect.
I The bottom line: do not just fixate on t statistics!
Interpreting the β̂j is just as important.

Recap
I Rather than just testing hypotheses about parameters it is
also useful to construct confidence intervals (also know as
interval estimates)
I Loosely, the CI is supposed to give a “likely” range of values
for the corresponding population parameter.
I We will only consider CIs of the form
β̂j ± c · se(β̂j ) (52)
where c > 0 is chosen based on the confidence level.

Recap
I We will use a 95% confidence level, in which case c comes
from the 97.5 percentile in the tdf distribution. In other words,
c is the 5% critical value against a two-sided alternative.
I Stata automatically reports at 95% CI for each parameter,
based on the t distribution using the appropriate df .

Recap
I Notice, in the NBA salaries example, how the three estimates
that are not statistically different from zero at the 5% level –
games, rebounds, and assists – all have 95% CIs that include
zero. For example, the 95% CI for βrebounds is
[−0.0045, 0.0859]
I By contrast, the 95% CI for βpoints is
[0.0067, 0.0661]
which excludes zero.

Recap
I A simple rule-of-thumb is useful for constructing a CI given
the estimate and its standard error. For, say, df ≥ 60, an
approximate 95% CI is
β̂j ± 2se(β̂j ) or [β̂j − 2se(β̂j ), β̂j + 2se(β̂j )]
That is, subtract and add twice the standard error to the
estimate. (In the case of the standard normal, the 2 becomes
1.96.)

Recap
I Properly interpreting a CI is a bit tricky. One often sees
statements such as “there is a 95% chance that βpoints is in
the interval [.0067, .0661].” This is incorrect. βpoints is some
fixed value, and it either is or is not in the interval.
I The correct way to interpret a CI is to remember that the
endpoints, β̂j − c · se(β̂j ) and β̂j + c · se(β̂j ), change with
each sample (or at least can change). That is, the endpoints
are random outcomes that depend on the data we draw.

Recap
I What a 95% CI means is that for 95% of the random samples
that we draw from the population, the interval we compute
using the rule β̂j ± c · se(β̂j ) will include the value βj . But for
a particular sample we do not know whether βj is in the
interval.
I This is similar to the idea that unbiasedness of β̂j does not
means that β̂j = βj . Most of the time β̂j is not βj .
Unbiasedness means E(β̂j ) = βj .

Recap
CIs and Hypothesis Testing
I If we have constructed a 95% CI for, say, βj , we can test any
null value against a two-sided alternative, at the 5% level. So
H0 : βj = aj
H1 : βj 6= aj
I If aj is in the 95% CI, then we fail to reject H0 at the 5% level.
I If aj is not in the 95% CI then we reject H0 in favor of H1 at
the 5% level.

Recap
Using WAGE2.dta,

lwage = 5.109
(.126)
+ .055
(.008)
educ + .0054
(.0010)
IQ + (53)
+ .0222
(.0034)
exper + .0126
(.0050)
meduc (54)
n = 857, R2
= .1804 (55)
I The 95% CI for βIQ is about [.0034, .0074]. So we can reject
H0 : βIQ = 0 against the two-sided alternative at the 5% level.
We cannot reject H0 : βIQ = .007 (altough it is close).
I We can reject a return to schooling of 3.0% as being too low,
but also 8.0% is too high.

Recap
I Just as with hypothesis testing, these CIs are only as good as
the underlying assumptions. If we have omitted key variables,
the β̂j are biased. If the error variance is not constant, the
standard errors are improperly computed.
I With df = 852, we will see later that normality is not very
important. But normality is needed for these CIs to be exact
95% CIs.

Recap
I So far, we have discussed testing hypotheses that involve only
one parameter, βj . But some hypotheses involve many
parameters.
I EXAMPLE: Are the Returns to a Year of Junior College the
same as for a Four-Year University? Suppose we have a
sample of high school graduates.
lwage = β0 + β1jc + β2univ + β3exper + u
H0 : β1 = β2
H1 : β1 < β2

Recap
I We could use a two-sided alternative, too.
I We can also write
H0 : β1 − β2 = 0
I Remember the general way to construct a t statistic:
t =
(estimate − hypothesized value)
standard error
(56)

Recap
I Given the OLS estimates β̂1 and β̂2,
t =
β̂1 − β̂2
se(β̂1 − β̂2)
(57)
I Problem: The OLS output gives us β̂1 and β̂2 and their
standard errors, but that is not enough to obtain se(β̂1 − β̂2).
I Recall a fact about variances:
Var(β̂1 − β̂2) = Var(β̂1) + Var(β̂2) − 2Cov(β̂1, β̂2) (58)

Recap
I The standard error is an estimate of the square root:
se(β̂1 − β̂2) = {[se(β̂1)]2
+ [se(β̂2)]2
− 2s12}1/2
(59)
where s12 is an estimate of Cov(β̂1, β̂2). This is the piece we
are missing.
I We could have Stata report s12 (cumbersome). There is also
a trick of rewriting the model (see text, Section 4.4).
I It is easiest to use a command for testing linear functions of
the coefficients. In Stata, it is lincom. After running the
regression we could type,
lincom jc − univ

Recap
I The t test allows us to test a single hypothesis, whether it
involves one or more than one parameter.
I But, sometimes, we want to test more than one hypotheses,
which then includes multiple parameters.
I Generally, it is not valid to look at individual t statistics. We
need a statistic used to test joint hypotheses.

Recap
EXAMPLE: Major League Baseball Salaries (MLBSAL.DTA)
lsalary = β0 + β1years + β2gamesyr + β3bavg
+β4hrunsyr + β5rbisyr + u
I H0 : Once we control for experience (years) and amount
played (gamesyr), actual performance has no effect on salary.
H0 : β3 = 0, β4 = 0, β5 = 0 (60)
I An example of exclusion restrictions: the three variables,
bavg, hrunsyr, and rbisyr can be excluded from the equation.

Recap
I To test H0, we need a joint (multiple)hypotheses test.
I A t statistic can be used for a single exclusion restriction; it
does not take a stand on the values of the other parameters.
I In this class, we only consider the alternative
H1 : H0 is not true (61)
I So, H1 means at least one of β3, β4, and β5 is different from
zero.

Recap
I One-sided alternatives (where, say, each β is positive) are hard
to deal with for multiple restrictions. So, our alternative is
two-sided.
I In the Stata output that follows, something curious occurs:
none of the three performance variables is statistically
significant, even thought the coefficient estimates are all
positive.
I Closest is rbisyr: trbisyr = 1.50, but this does not strongly
reject H0. (One-sided p-value = 0.067)

Recap
I The estimated effect of bavg is not really large: 10 more
points on lifetime average means salary is estimated to be
1.0% higher. Another RBI (without hitting a home run) is
worth about 1.1%.
I Note: We can use this equation to answer questions such as:
How much is a solo HR worth? That would be adding the two
coefficients, so about .025, or 2.5%.

Recap
I Question: On the basis of the three insignificant t statistics,
should we conclude that none of bavg, hrunsyr, and rbisyr
affects baseball player salaries?
I No. This would be a mistake.
I A hint at the problem is severe multicollinearity between
hrunsyr and rbisyr: the correlation is about .89. In fact, one
cannot hit a home run without getting at least one RBI.
Generally, home run hitters also tend to produce lots of RBIs.

Recap
I Because the individual coefficients are imprecisely estimated,
we need a joint test.
I The original model, containing all variables, is the
unrestricted model:
lsalary = β0 + β1years + β2gamesyr + β3bavg
+β4hrunsyr + β5rbisyr + u
I When we impose H0 : β3 = 0, β4 = 0, β5 = 0, we get the
restricted model:
lsalary = β0 + β1years + β2gamesyr + u

Recap
I We want to see how the fit deteriorates as we remove the
three variables. We use, initially, the sum of squared residuals
from the two regressions.
I Let SSRur denote the SSR from the unrestricted model. Note
that, in this example, dfur = 353 − 6 = 347. Let SSRr be the
SSR from the restricted model. So dfur = 353 − 3 = 350.
I It is an algebraic fact that the SSR must increase (or, at least,
not fall) when explanatory variables are dropped. So
SSRr ≥ SSRur (62)

Recap
I The test statistic we use essentially asks: does the SSR
increase proportionately by enough to conclude the
restrictions under H0 are false?
I In the general model
y = β0 + β1x1 + ... + βkxk + u (63)
we want to test that the last q variables can be excluded:
H0 : βk−q+1 = 0, ..., βk = 0 (64)
I We get SSRur from estimating the full model.

Recap
I The restricted model we estimate to get SSRr drops the last q
variables (q exclusion restrictions):
y = β0 + β1x1 + ... + βk−qxk−q + u (65)
I The F statistic uses a degrees of freedom adjustment. In
general, we have
F =
(SSRr − SSRur )/(dfr − dfur )
SSRur /dfur
=
(SSRr − SSRur )/q
SSRur /(n − k − 1)
(66)
where q is the number of exclusion restrictions imposed under
the null (q = 3 in the MLB case) and k is the number of
explanatory variables in the unrestricted model (k = 6 in the
MLB case).

Recap
q = numerator df = dfr − dfur (67)
n − k − 1 = denominator df = dfur (68)
I The denominator of the F statistic, SSRur /dfur , is the
unbiased estimator of σ2 from the unrestricted model.
I Note that F ≥ 0, and F > 0 virtually always holds.
I As a computational device, sometimes the formula
F =
(SSRr − SSRur )
SSRur
·
(n − k − 1)
q
(69)
is useful. These days, we let a computer do it.

Recap
I Using classical testing, the rejection rule is of the form
F > c (70)
where c is an appropriately chosen critical value.
I We obtain c using the (hard to show) fact that, under H0 (the
q exclusion restrictions)
F ∼ Fq,n−k−1 (71)
I That is, an F distribution with (q, n − k − 1) degrees of
freedom.
I Tables G.3a, G.3b, and G.3c contain critical values for10%,
5%, and 1% significance levels.

Recap
I In the MLB example with n = 353, k = 6, and q = 3, we have
numerator df = 3, denominator df = 347. Because the
denominator df is above 120, we use the “∞” entry. The 10%
cv is 2.08, the 5% cv is 2.60, and the 1% cv is 3.78.
I As with t testing, it is better to compute a p-value. These are
reported by Stata after every test command.
I The F statistic for excluding bavg, hrunsyr, and rbisyr from
the model is 9.55. This is well above the 1% critical value, so
we reject the null at the 1% level. In fact, to four decimal
places, p-value is zero.

Recap
The “by hand” calculation works, too, but we must estimate the
restricted model:
I So SSRr = 198.31, SSRur = 183.19.
F =
(198.31 − 183.19)
183.19
·
347
3
≈ 9.55 (72)
rounded to two decimal places, which is the same value
computed by Stata.
I We say that bavg, hrunsyr, and rbisyr are jointly statistically
significant (or just jointly significant) – in this case, at any
small significance level we want.

Recap
I The F statistic does not allow us to tell which of the
population coefficients are different from zero. And the t
statistics do not help much.
I We have, in effect, set up the problem ourselves by including
the two highly correlated variables, hrunsyr, and rbisyr.
I Not much is lost by dropping one of the two variables. Notice
that doing so does not change the fact that bavg is
statistically insignificant and its coefficient remains small.

Recap
I In the F test setting, nothing rules out q = 1. So, we can use
the F statistic to test, say,
H0 : βj = 0
H1 : βj 6= 0
(Remember, H1 is that H0 does not hold.)
I Of course, we can use a t statistic, too. Does that mean there
are now two ways to test a single exclusion restriction?

Recap
I No. It can be shown that, when q = 1,
F = t2
(73)
This makes sense because the t2
n−k−1 is identical to the
F1,n−k−1 distribution.
I The t test is more transparent and allows one-sided
alternatives. The F test adds nothing and takes a little away.

Recap
I Can abuse F tests, too. Suppose
y = β0 + β1x1 + β2x2 + β3x3 + u (74)
where the t statistic for β̂2 is statistically significant (at the
5% level) but that for β̂3 is insignificant.
I Logically, this means we should reject H0 : β2 = 0, β3 = 0.
But if we apply the F test for joint significance of x2 and x3,
it need not reject.
I Generally, if we use an F test to jointly test a significant
variable along with a bunch of insignificant variables, it has
less power than the single t test.

Recap
I When doing your own research, compute F statistics using a
built-in command. But it is sometimes useful to be able to
compute and F statistic using standard output.
I The R-squared is reported most of the time (unlike the SSR).
It turns out that F tests for exclusion restrictions can be
computed entirely from the R-squared for the restricted and
unrestricted models.
I The key is that, because the same dependent variable is used
in the two regressions,
SSRr = (1 − R2
r )SST (75)
SSRur = (1 − R2
ur )SST (76)

Recap
I Simple algebra shows F can also be written as
F =
(R2
ur − R2
r )/q
(1 − R2
ur )/(n − k − 1)
(77)
I Notice how R2
ur comes first in the numerator. We know
R2
ur ≥ R2
r so this ensures F ≥ 0.
I EXAMPLE: Do factors other than price affect the demand
for “ecologically friendly” apples?(APPLE.DTA). These are
survey data, and the eco-friendly apple was fictitious.

Recap
I ecolbs cannot be normally distributed. More than 37% of the
values are piled up at zero. We will return to these situations
later. For now, we just use the F statistic like always.
I Using the R-squared from the unrestricted and restricted
models, we get F = .646. Using the test command gives the
same thing, but of course we get the p-value, which is very
large. There is no evidence that variables other than price
affect ecolbs. Note that we need to use the test command
after estimating the unrestricted model.
I In Stata, qui means “quietly.”In this case it will not show in
the screen the output of the calculations it has done.

Recap
I The F statistic in the upper right hand corner of the Stata
outputs tests a very special null hypothesis. In the model
y = β0 + β1x1 + β2x2 + ... + βkxk + u
the null is that all slope coefficients are zero:
H0 : β1 = 0, β2 = 0, ..., βk = 0 (78)
I This means that none of the xj helps explain y.
I If we cannot reject this null, we have found no factors that
explain y.

Recap
I Some economic theories imply that a variable cannot be
predicted using certain sets of variables. For example, firm
stock returns over a given period should not be predictable
based on information at the beginning of that period. This is
one use of the overall F statistic.
I Do not use this statistic for testing a subset of variables.

Recap
I For this test, R2
r = 0 (no explanatory variables under H0).
R2
ur = R2 from the regression.
F =
R2/k
(1 − R2)/(n − k − 1)
=
R2
(1 − R2)
·
(n − k − 1)
k
(79)
so F is directly related to the R-squared. As R2 increases, so
does F.
I A small R2 can lead F to be significant. Maybe very
significant. If the df = n − k − 1 is large – because of large n
– F can be large even with a “small” R2.
I Increasing k decreases F.
I Even with R2 = .0402, the p-value for the joint F is very
small, .0002. So the variables ecoprc, ..., educ are jointly
statistically significant.

Chapter4_Multi_Reg_Estim.pdf.pdf

Recommended

Recommended

More Related Content

Similar to Chapter4_Multi_Reg_Estim.pdf.pdf

Similar to Chapter4_Multi_Reg_Estim.pdf.pdf (20)

More from ROBERTOENRIQUEGARCAA1

More from ROBERTOENRIQUEGARCAA1 (20)

Recently uploaded

Recently uploaded (20)

Chapter4_Multi_Reg_Estim.pdf.pdf