Chapter6.pdf.pdf

Multiple Regression Analysis: Statistical Inference: I
Introductory Econometrics: A Modern Approach, 5e
Haoming Liu
National University of Singapore
August 21, 2022
1 . Sampling Distributions of the OLS Estimators
2 . Testing Hypotheses About a Single Population Parameter
3 . Confidence Intervals
4 . Testing Single Linear Restrictions
Liu, H (NUS) Multiple Regression Analysis: Statistical Inference: I August 21, 2022 1 / 104

Recap
So far, what do we know how to do with the population model
y = β0 + β1x1 + ... + βkxk + u?
1 Mechanics of OLS for a given sample. We only need MLR.2 insofar as
it introduces the data, and MLR.3 (no perfect collinearity) so that the
OLS estimates exist. Interpretation of OLS regression line – ceteris
paribus effects – R2 goodness-of-fit measure. Some functional form
(natural logarithm).

Recap: MLRs
1 : y = β0 + β1x1 + β2x2 + ... + βkxk + u
2 : random sampling from the population
3 : no perfect collinearity in the sample
4 : E(u|x1, ..., xk) = E(u) = 0 (exogenous explanatory variables)
5 : Var(u|x1, ..., xk) = Var(u) = σ2 (homoskedasticity)

Recap
Unbiasedness of OLS under MLR.1 to MRL.4. Obtain bias (or at
least the direction) when MLR.4 fails due to an omitted variable.
Obtain the variances, Var(β̂j ), under MLR.1 to MLR.5.
The Gauss-Markov Assumptions also imply OLS is the best linear
unbiased estimator (BLUE) (conditional on the values of the
explanatory variables).

Sampling Distributions of the OLS Estimators
We now want to test hypotheses about the βj . This means we hypothesize
that a population parameter is a certain value, then use the data to
determine whether the hypothesis is likely to be false.
EXAMPLE: (Motivated by ATTEND.DTA)
final = β0 + β1missed + β2priGPA + β3ACT + u
where ACT is the achievement test score. The null hypothesis, that
missing lecture has no effect on final exam performance (after accounting
for prior MSU GPA and ACT score), is
H0 : β1 = 0

To test hypotheses about the βj using exact (or “finite sample”) testing
procedures, we need to know more than just the mean and variance of the
OLS estimators.
MLR.1 to MLR.4: We can compute the expected value as
E(β̂j ) = βj
MLR.1 to MLR.5: We know the variance is
Var(β̂j ) =
σ2
SSTj (1 − R2
j )
And, σ̂2 = SSR/(n − k − 1) is an unbiased estimator of σ2

But hypothesis testing relies on the entire sampling distributions of
the β̂j . Even under MLR.1 through MLR.5, the sample distributions
can be virtually anything.
Write
β̂j = βj +
n
X
i=1
wij ui ,
where the wij are functions of {(xi1, ..., xik) : i = 1, ..., n}.
Conditional on {(xi1, ..., xik) : i = 1, ..., n}, β̂j inherits its distribution
from that of {ui : i = 1, .., n}, which is a random sample from the
population distribution of u.

Assumption MRL.6 (Normality)
Normality
The population error u is independent of (x1, ..., xk) and is normally
distribution with mean zero and variance σ2:
u ∼ Normal(0, σ2
)
MLR.4: E(u|x1, ..., xk) = E(u) = 0
MLR.5: Var(u|x1, ..., xk) = Var(u) = σ2
Now MLR.6 imposes full independence between u and (x1, x2, ..., xk)
(not just mean and variance independence), which is where the label
of the xj as “independent variables” originated.

The important part of MLR.6 is that we have now made a very
specific distributional assumption for u: the familiar bell-shaped curve:

Assumption MRL.6 (Normality)
Normality is by far the most common assumption, but the usual
arguments about why normality is a good assumption are not always
operative.
Usually, the argument starts with the claim that u is the sum of many
independent factors, say u = a1 +a2 +...+am for “large” m, and then
we can apply the central limit theorem. But what if the factors have
very different distributions, or are multiplicative rather than additive?

Assumption MRL.1-6
Ultimately, like Assumption MLR.5, Assumption MLR.6 is maintained
for convenience. Fortunately, we will later see that, for approximate
inference in large samples, we can drop MLR.6. For now we keep it.
It is very difficult to perform exact statistical inference without
Assumption MLR.6.
Assumptions MLR.1 to MLR.6 are called the classical linear model
(CLM) assumptions (for cross-sectional regression).

Normality
For practical purposes, think of
CLM = Gauss-Markov + normality
An important fact about independent normal random variables: any
linear combination is also normally distributed. Because the ui are
independent and identically distributed (iid) as Normal(0, σ2),
β̂j = βj +
n
X
i=1
wij ui ∼ Normal[βj , Var(β̂j )]
where we already know the formula for Var(β̂j ):
Var(β̂j ) =
σ2
SSTj (1 − R2
j )

THEOREM (Normal Sampling Distributions)
Under the CLM Assumptions (and conditional on the sample outcomes of
the explanatory variables),
β̂j ∼ Normal[βj , Var(β̂j )]
and so
β̂j − βj
sd(β̂j )
∼ Normal(0, 1)
The second result follows from a feature of the normal distribution: if
W ∼ Normal then a + bW ∼ Normal for constants a and b.

Normality
The standardized random variable
β̂j − βj
sd(β̂j )
always has zero mean and variance one. Under MLR.6, it is also
normally distributed.
Notice that the standard normal distribution holds even when we do
not condition on {(xi1, xi2, ..., xik) : i = 1, ..., n}.

Testing Hypotheses About a Single Population Parameter
We cannot directly use the result
β̂j − βj
sd(β̂j )
∼ Normal(0, 1)
to test hypotheses about βj : sd(β̂j ) depends on σ = sd(u), which is
unknown.
But we have σ̂ as an estimator of σ. Using this in place of σ gives us
the standard error, se(β̂j ).

THEOREM (t Distribution for Standardized Estimators)
Under the CLM Assumptions,
β̂j − βj
se(β̂j )
∼ tn−k−1 = tdf
We will not prove this as the argument is somewhat involved.
It is replacing σ (an unknown constant) with σ̂ (an estimator that
varies across samples), that takes us from the standard normal to the
t distribution.

Distribution for Standardized Estimators
The t distribution also has a bell shape, but is more spread out than
the Normal(0, 1).
E(tdf ) = 0 if df > 1
Var(tdf ) =
df
df − 2
> 1 if df > 2
We will never have very small df in this class.
When df = 10, Var(tdf ) = 1.25, which is 25% larger than the
Normal(0, 1) variance.
When df = 120, Var(tdf ) ≈ 1.017 – only 1.7% larger than the
standard normal.

Distribution for Standardized Estimators
As df → ∞,
tdf → Normal(0, 1)
The difference is practically small for df > 120.
The next graph plots a standard normal pdf against a t6 pdf.

Testing
We use the result on the t distribution to test the null hypothesis that
xj has no partial effect on y:
H0 : βj = 0
lwage = β0 + β1educ + β2exper + β3tenure + u
H0 : β2 = 0
In words: Once we control for education and time on the current job
(tenure), total workforce experience has no affect on
lwage = log(wage).

Testing
To test H0 : βj = 0, we use the t statistic (or t ratio),
tβ̂j
=
β̂j
se(β̂j )
This is the estimated coefficient divided by our estimate of β̂j ’s
sampling standard deviation. In virtually all cases β̂j is not exactly
equal to zero. When we use tβ̂j
, we are measuring how far β̂j is from
zero relative to its standard error.

Testing
Because se(β̂j ) > 0, tβ̂j
always has the same sign as β̂j . To use tβ̂j
to
test H0 : βj = 0, we need to have an alternative.
Some like to define tβ̂j
as the absolute value, so it is always positive.
This makes it cumbersome to test against one-sided alternatives.

Testing Against One-Sided Alternatives
First consider the alternative
H1 : βj > 0
which means the null is effectively
H0 : βj ≤ 0
Using a positive one-sided alternative, if we reject βj = 0 than we
reject any βj < 0, too. We often just state H0 : βj = 0 and act like
we do not care about negative values.

If the estimated coefficient β̂j is negative, it provides no evidence
against H0 in favor of H1 : βj > 0.
If β̂j is positive, the question is: How big does tβ̂j
= β̂j /se(β̂j ) have
to be before we conclude H0 is “unlikely”?
Traditional approach to hypothesis testing:

1 . Choose a null hypothesis: H0 : βj = 0 (or H0 : βj ≤ 0)
2 . Choose an alternative hypothesis: H1 : βj > 0
3 . Choose a significance level (or simply level, or size) for the test.
That is, the probability of rejecting the null hypothesis when it is in
fact true. (Type I Error). Suppose we use 5%, so the probability of
committing a Type I error is .05.
4 . Choose a critical value, c > 0, so that the rejection rule
tβ̂j
> c
leads to a 5% level test.

The key is that, under the null hypothesis,
tβ̂j
∼ tn−k−1 = tdf
and this is what we use to obtain the critical value, c.
Suppose df = 28 and we use a 5% test. The critical value is
c = 1.701, as can be gotten from Table G.2 (page 833 in 5e).
The following picture shows that we are conducting a one-tailed test
(and it is these entries that should be used in the table).

So, with df = 28, the rejection rule for H0 : βj = 0 against
H1 : βj > 0, at the 5% level, is
tβ̂j
> 1.701
We need a t statistic greater than 1.701 to conclude there is enough
evidence against H0.
If tβ̂j
≤ 1.701, we fail to reject H0 against H1 at the 5% significance
level.
Suppose df = 28, but we want to carry out the test at a different
significance level (often 10% level or the 1% level).
c.10 = 1.313
c.05 = 1.701
c.01 = 2.467

If we want to reduce the probability of Type I error, we must increase
the critical value (so we reject the null less often).
If we reject at, say, the 1% level, then we must also reject at any
larger level.
If we fail to reject at, say, the 10% level – so that tβ̂j
≤ 1.313 – then
we will fail to reject at any smaller level.

With large sample sizes – certain when df > 120 – we can use critical
values from the standard normal distribution. These are the df = ∞
entry in Table G.2.
c.10 = 1.282
c.05 = 1.645
c.01 = 2.362
which we can round to 1.28, 1.65, and 2.36, respectively. The value
1.65 is especially common for a one-tailed test.

EXAMPLE: Factors Affecting lwage (WAGE2.DTA)
In applications, it is helpful to label parameters with variable names
to state hypotheses. So βeduc, βIQ, and βexper , for example. Then
H0 : βexper = 0
is that workforce experience has no effect on a wage once education,
and IQ have been accounted for.


lwage = −.229
(.230)
+ .107
(.012)
educ + .0080
(.0016)
IQ + .0435
(.0084)
exper
n = 759, R2
= .217
The quantities in parentheses are still standard errors, not t statistics!
Easiest to read the t statistic off the Stata output, when available:
texper = 5.17,
which is well above the one-sided critical value at the 1% level, 2.36.
In fact, the .5% critical value is about 2.58.

The bottom line is that H0 : βexper = 0 can be rejected against
H1 : βexper > 0 at very small significance levels. A t of 5.17 is very
large.
The estimated effect of exper – that is, its economic importance – is
apparent. Another year of experience, holding educ and IQ fixed, is
estimated to be worth about 4.4%.
The t statistics for educ and IQ are also very large; there is no need
to even look up critical values.

. reg lwage educ IQ exper
Source | SS df MS Number of obs = 759
-------------+------------------------------ F( 3, 755) = 69.78
Model | 57.0352742 3 19.0117581 Prob > F = 0.0000
Residual | 205.71337 755 .27246804 R-squared = 0.2171
-------------+------------------------------ Adj R-squared = 0.2140
Total | 262.748644 758 .346634095 Root MSE = .52198
------------------------------------------------------------------------------
lwage | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
educ | .1069849 .0116513 9.18 0.000 .084112 .1298578
IQ | .0080269 .0015893 5.05 0.000 .0049068 .0111469
exper | .0435405 .0084242 5.17 0.000 .0270028 .0600783
_cons | -.228922 .2299876 -1.00 0.320 -.6804132 .2225692
------------------------------------------------------------------------------

EXAMPLE: Does ACT score help predict college GPA?
In the GPA1.DTA n = 141 MSU students from mid-1990s. All variables
are self reported.
Consider controlling for high school GPA:
colGPA = β0 + β1hsGPA + β2ACT + u
H0 : β2 = 0
From the Stata ouput, β̂2 = β̂ACT = .0094 and tACT = .87. Even at
the 10% level (c = 1.28), we cannot reject H0 against H1 : βACT > 0.

Does ACT score help predict college GPA?
Because we fail to reject H0 : βACT = 0, we say that “β̂ACT is
statistically insignificant at the 10% level against at one-sided
alternative.”
It is also very important to see that the estimated effect of ACT is
small, too. Three more points (slightly more than one standard
deviation) only predicts colGPA that is .0094(3) ≈ .028 – not even
three one-hundreths of a grade point.

By contrast, β̂hsGPA = .453 is large in a practical sense – each point
on hsGPA is associated with about .45 points on colGPA – and
thsGPA = 4.73 is very large.
No critical values in Table G.2 with df = 141 − 3 = 138 are even
close to 4. So “β̂hsGPA is statistically significant” at very small
significance levels.
Notice what happens if we do not control for hsGPA. The simple
regression estimate is .0271 with tACT = 2.49. The magnitude is still
pretty modest, but we would conclude it is statistically different from
zero at the 1% significance level using the standard normal critical
value, 2.36.

Not clear why ACT has such a small, statistically insignificant effect.
The sample size is small and the scores were self-reported. The survey
was done in a couple of economics courses, so it is not a random
sample of all MSU students.

. des colGPA hsGPA ACT
storage display value
variable name type format label variable label
-----------------------------------------------------------------------------
colGPA float %9.0g MSU GPA
hsGPA float %9.0g high school GPA
ACT byte %9.0g ’achievement’ score
. sum colGPA hsGPA ACT
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
colGPA | 141 3.056738 .3723103 2.2 4
hsGPA | 141 3.402128 .3199259 2.4 4
ACT | 141 24.15603 2.844252 16 33

. reg colGPA hsGPA ACT
-------------+------------------------------ F( 2, 138) = 14.78
Model | 3.42365506 2 1.71182753 Prob > F = 0.0000
-------------+------------------------------ Adj R-squared = 0.1645
Total | 19.4060994 140 .138614996 Root MSE = .34032
------------------------------------------------------------------------------
colGPA | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
hsGPA | .4534559 .0958129 4.73 0.000 .2640047 .6429071
ACT | .009426 .0107772 0.87 0.383 -.0118838 .0307358
_cons | 1.286328 .3408221 3.77 0.000 .612419 1.960237
------------------------------------------------------------------------------
. reg colGPA ACT
-------------+------------------------------ F( 1, 139) = 6.21
Model | .829558811 1 .829558811 Prob > F = 0.0139
-------------+------------------------------ Adj R-squared = 0.0359
Total | 19.4060994 140 .138614996 Root MSE = .36557
------------------------------------------------------------------------------
colGPA | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
ACT | .027064 .0108628 2.49 0.014 .0055862 .0485417
_cons | 2.402979 .2642027 9.10 0.000 1.880604 2.925355
------------------------------------------------------------------------------

For the negative one-sided alternative,
H0 : βj < 0,
we use a symmetric rule. But the rejection rule is
tβ̂j
< −c
where c is chosen in the same way as in the positive case.
With df = 18 and a 5% test, the critical value is c = −1.734, so the
rejection rule is
tβ̂j
< −1.734

Now we must see a significantly negative value for the t statistic to
reject H0 : βj = 0 in favor of H1 : βj < 0.

EXAMPLE: Does missing lectures affect final exam
performance?
H0 : β1 = 0, H1 : β1 < 0
We get β̂1 = −.079, tβ̂1
= −2.25. The 5% cv is −1.65 and the 1% cv
is −2.36. So we reject H0 in favor of H1 at the 5% level but not at
the 1% level.
The effect is not huge: 10 missed lectures, out of 32, lowers final
exam score by about .8 points – so not even one point.

. reg final missed priGPA ACT
-------------+------------------------------ F( 3, 676) = 56.79
Model | 3032.09408 3 1010.69803 Prob > F = 0.0000
Residual | 12029.853 676 17.7956405 R-squared = 0.2013
-------------+------------------------------ Adj R-squared = 0.1978
Total | 15061.9471 679 22.1825435 Root MSE = 4.2185
------------------------------------------------------------------------------
final | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
missed | -.0793386 .0352349 -2.25 0.025 -.1485216 -.0101556
priGPA | 1.915294 .372614 5.14 0.000 1.183674 2.646914
ACT | .4010639 .0532268 7.54 0.000 .2965542 .5055736
_cons | 12.37304 1.171961 10.56 0.000 10.07192 14.67416
------------------------------------------------------------------------------

If we do not control for ACT score, the effect of missed goes away. It turns out that missed and ACT are
positively correlated: those with higher ACT scores miss more classes, on average.
. reg final missed priGPA
-------------+------------------------------ F( 2, 677) = 52.48
Model | 2021.72415 2 1010.86207 Prob > F = 0.0000
-------------+------------------------------ Adj R-squared = 0.1317
Total | 15061.9471 679 22.1825435 Root MSE = 4.3888
------------------------------------------------------------------------------
final | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
missed | .0172012 .0341483 0.50 0.615 -.0498481 .0842504
priGPA | 3.237554 .3419779 9.47 0.000 2.56609 3.909019
_cons | 17.41567 1.000942 17.40 0.000 15.45035 19.381
------------------------------------------------------------------------------

Reminder about Testing
Our hypthoses involve the unknown population values, βj . If in a our
set of data we obtain, say, β̂j = 2.75, we do not write the null
hypothesis as
H0 : 2.75 = 0
(which is obviously false).
Nor do we write
H0 : β̂j = 0
(which is also false except in the very rare case that our estimate is
exactly zero).

Testing Against Two-Sided Alternatives
We do not test hypotheses about the estimate! We know what it is
once we collect the sample. We hypothesize about the unknown
population value, βj .
Sometimes we do not know ahead of time whether a variable definitely
has a positive effect or a negative effect. Even in the example
it is conceivable that missing class helps final exam performance.
(The extra time is used for studying, say.)

Generally, the null and alternative are
H0 : βj = 0
H1 : βj ̸= 0
Testing against the two-sided alternative is usually the default. It
prevents us from looking at the regression results and then deciding
on the alternative. Also, it is harder to reject H0 against the two-sided
alternative, so it requires more evidence that xj actually affects y.

Two-Sided Alternatives
Now we reject if β̂j is sufficiently large in magnitude, either positive or
negative. We again use the t statistic tβ̂j
= β̂j /se(β̂j ), but now the
rejection rule is
tβ̂j
> c
This results in a two-tailed test, and those are the critical values we
pull from Table G.2.
For example, if we use a 5% level test and df = 25, the two-tailed cv
is 2.06. The two-tailed cv is, in this case, the 97.5 percentile in the
t25 distribution. (Compare the one-tailed cv, about 1.71, the 95th
percentile in the t25 distribution).

Two-Sided Alternatives

EXAMPLE: Factors affecting math pass rates.
(MEAP98.DTA)
Run a multiple regression of math4 on lunch, str, avgsal, enrol.
A priori, we might expect lunch to have a negative effect (it is
essentially a school-level poverty rate), str to have a negative effect,
and avgsal to have a positive effect. But we can still test against a
two-sided alternative to avoid specifying the alternative ahead of time.
enrol is clearly ambiguous.

With 923 observations, we can use the standard normal critical values. For a 10% test it is 1.65, for a 5%,
1.96, and for 1%, cv = 2.58.
. des math4 lunch str avgsal enrol
------------------------------------------------------------------------------
math4 byte %9.0g pass rate, 4th grade math test
lunch float %9.0g % students eligible free lunch
str float %9.0g student-teacher ratio
avgsal float %9.0g average teacher salary
enrol int %9.0g school enrollment
. sum math4 lunch str avgsal enrol
-------------+--------------------------------------------------------
math4 | 923 60.54713 19.71111 3 100
lunch | 923 37.34231 26.21696 0 98.78
str | 923 23.50704 3.755936 7.6 41.1
avgsal | 923 47557.53 8577.373 13976 81045
enrol | 923 403.5655 162.6491 18 1176

. reg math4 lunch str avgsal enrol
-------------+------------------------------ F( 4, 918) = 68.82
Model | 82641.3258 4 20660.3315 Prob > F = 0.0000
-------------+------------------------------ Adj R-squared = 0.2273
Total | 358222.7 922 388.527874 Root MSE = 17.326
------------------------------------------------------------------------------
math4 | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
lunch | -.2911477 .0237168 -12.28 0.000 -.3376931 -.2446023
str | -.8354922 .1776196 -4.70 0.000 -1.18408 -.4869046
avgsal | .0003744 .000079 4.74 0.000 .0002194 .0005294
enrol | .0050858 .0036523 1.39 0.164 -.002082 .0122537
_cons | 71.20066 4.302933 16.55 0.000 62.75593 79.64539
------------------------------------------------------------------------------
The variables lunch, str, and avgsal all of coefficients with the anticipated signs, and the absolute
values of the t statistics are above 4. So we easily reject H0 : βj = 0 against H1 : βj ̸= 0.
enrol is a different situation. tenroll = 1.39 < 1.65, so we fail to reject H0 at even the 10% signficance
level.

Functional form can make a difference. The math pass rates are capped at 100, so a diminishing effect in
avgsal and enrol seem appropriate; these variables have lots of variation. So use the logarithm instead.
. reg math4 lunch str lavgsal lenrol
-------------+------------------------------ F( 4, 918) = 71.09
Model | 84715.9491 4 21178.9873 Prob > F = 0.0000
-------------+------------------------------ Adj R-squared = 0.2332
Total | 358222.7 922 388.527874 Root MSE = 17.261
------------------------------------------------------------------------------
math4 | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
lunch | -.2886697 .0235046 -12.28 0.000 -.3347986 -.2425408
str | -.9549563 .1824296 -5.23 0.000 -1.312984 -.5969288
lavgsal | 18.13305 3.605116 5.03 0.000 11.05782 25.20827
lenrol | 2.622179 1.256434 2.09 0.037 .1563616 5.087996
_cons | -116.6793 37.28153 -3.13 0.002 -189.8462 -43.51239
------------------------------------------------------------------------------

Of course, all estimates change, but it is those on the lavgsal and
lenrol that are now much different. Before, we were measure a dollar
effect. But now, holding the other variables fixed,
∆
math4 = (18.13/100)%∆avgsal = .1813(%∆avgsal)
So if, say, %∆avgsal = 10 – teacher salaries are 10 percent higher –
math4 is estimated to increase by about 1.8 points.

Also,
∆
math4 = (2.62/100)%∆enroll = .0262(%∆enroll)
so a 10% increase in enrollment is associated with a .26 point increase in
math4.
Notice how lenrol = log(enrol) is statistically significant at the 5% level:
tlenrol = 2.09 > 1.96.

Reminder: When we report the results of, say, the second regression,
it looks like

math4 = −116.68
(37.28)
− .289
(.024)
lunch − .955
(.182)
str + 18.13
(3.61)
lavgsal + 2.62
(1.26)
lenro
n = 903, R2
= .237
so that standard errors are below coefficients.

When we reject H0 : βj = 0 against H1 : βj ̸= 0, we often say that β̂j is
statistically different from zero and usually mention a significance level. For
example, if we can reject at the 1% level, we say that. If we can reject a the
10% level but not the 5%, we say that.
As in the one-sided case, we also say β̂j is “statistically significant” when we
can reject H0 : βj = 0.

Testing Other Hypotheses about the βj
Testing the null H0 : βj = 0 is by far the most common. That is why
Stata and other regression packages automatically report the t
statistic for this hypothesis.
It is critical to remember that
tβ̂j
=
β̂j
se(β̂j )
is only for H0 : βj = 0.

What if we want to test a different null value? For example, in a
constant-elasticity consumption function,
log(cons) = β0 + β1 log(inc) + β2famsize + β3pareduc + u
we might want to test
H0 : β1 = 1
which means an income elasticity equal to one. (We can be pretty
sure that β1 > 0.)

More generally, suppose the null is
H0 : βj = aj
where we specify the value aj (usually zero, but, in the consumption
example, aj = 1).
It is easy to extend the t statistic:
t =
(β̂j − aj )
se(β̂j )
This t statistic just measures how far our estimate, β̂j , is from the
hypothesized value, aj , relative to se(β̂j ).

A useful general expression for general t testing:
t =
(estimate − hypothesized value)
standard error
The alternative can be one-sided or two-sided.
We choose critical values in exactly the same way as before.

The language needs to be suitably modified. If, for example,
H0 : βj = 1
H1 : βj ̸= 1
is rejected at the 5% level, we say “β̂j is statistically different from
one at the 5% level.” Otherwise, β̂j is “not statistically different from
one.” If the alternative is H1 : βj > 1, then “β̂j is statistically greater
than one at the 5% level.”

EXAMPLE: Crime and enrollment on college campuses
(CAMPUS.DTA)
A simple regression model:
log(crime) = β0 + β1 log(enroll) + u
H0 : β1 = 1
H1 : β1 > 1
We get β̂1 = 1.27, and so a 1% increase in enrollment is estimated to
increase crime by 1.27% (so more than 1%). Is this estimate statistically
greater than one?

Crime and enrollment on college campuses
(CAMPUS.DTA)
We cannot pull the t statistic off of the usual Stata output. We can
compute it by hand (rounding the estimate and standard error):
t =
(1.270 − 1)
.110
≈ 2.45
(Note how this is much smaller than the t for H0 : β1 = 0, reported
by Stata.)
We have df = 97 − 2 = 95, so we use the df = 120 entry in Table
G.2. The 1% cv for a one-sided alternative is about 2.36, so we reject
at the 1% significance level.

. reg lcrime lenroll
-------------+------------------------------ F( 1, 95) = 133.79
Model | 107.083654 1 107.083654 Prob > F = 0.0000
-------------+------------------------------ Adj R-squared = 0.5804
Total | 183.119479 96 1.90749457 Root MSE = .89464
------------------------------------------------------------------------------
lcrime | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
lenroll | 1.26976 .109776 11.57 0.000 1.051827 1.487693
_cons | -6.63137 1.03354 -6.42 0.000 -8.683206 -4.579533
------------------------------------------------------------------------------

Alternatively, we can let Stata do the work using the lincom (“linear combination” command). Here the
null is stated equivalent as
H0 : β1 − 1 = 0
. lincom lenroll - 1
( 1) lenroll = 1
------------------------------------------------------------------------------
lcrime | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
(1) | .2697603 .109776 2.46 0.016 .0518273 .4876932
------------------------------------------------------------------------------
The t = 2.46 is the more accurate calculation of the t statistic.
The lincom lenroll - 1 command is Stata’s way of saying “test whether βlenroll − 1 equals zero.”

Computing p-Values for t Tests
The traditional approach to testing, where we choose a significance
level ahead of time, can be cumbersome.
Plus, it can conceal information. For example, suppose that, for
testing against a two-sided alternative, a t statistical is just below the
5% cv. I could simply say that “I fail to reject H0 : βj = 0 against the
two-sided alternative at the 5% level.” But there is nothing sacred
about 5%. Might I reject at, say, 6%?

Computing p-Values for t Tests
Rather than have to specify a level ahead of time, or discuss different
traditional significance levels (10%, 5%, 1%), it is better to answer
the following question: Given the observed value of the t statistic,
what is the smallest significance level at which I can reject H0?
The smallest level at which the null can be rejected is known as the
p-value of a test. It is a single number that automatically allows us to
carry out the test at any level.

One way to think about the p-values is that it uses the observed
statistic as the critical value, and then finds the significance level of
the test using that critical value.
It is most common to report p-values for two-sided alternatives. This
is what Stata does. The t tables are not detailed enough.

For t testing against a two-sided alternative,
p-value = P(|T| > |t|)
where t is the value of the t statistic and T is a random variable with
the tdf distribution.
The p-value is a probability, so it is between zero and one.
Perhaps the best way to think about p-values: it is the probability of
observing a statistic as extreme as we did if the null hypothesis is true.

So smaller p-values provide more are evidence against the null. For
example, if p-value = .50, then there is a 50% chance of observing a
t as large as we did (in absolute value). This is not enough evidence
against H0.
If p-value = .001, then the chance of seeing a t statistic as extreme
as we did is .1%. We can conclude that we got a very rare sample –
which is not helpful – or that the null hypothesis is very likely false.

From
p-value = P(|T| > |t|)
we see that as |t| increases the p-value decreases. Large absolute t
statistics are associated with small p-values.
Suppose df = 40 and, from our data, we obtain t = 1.85 or
t = −1.85. Then
p-value = P(|T| > 1.85) = 2P(T > 1.85) = 2(.0359) = .0718
where T˜t40. Finding the actual numbers required using Stata.

Given p-value, we can carry out a test at any significance level. If α is
the chosen level, then
Reject H0 if p-value < α
For example, in the previous example we obtained p-value = .0718.
This means that we reject H0 at the 10% level but not the 5% level.
We reject at 8% but (not quite) at 7%.
Knowing p-value = .0718 is clearly much better than just saying “I
fail to reject at the 5% level.”

Computing p-Values for One-Sided Alternatives
Stata and other packages report the two-sided p-value. How can we
get a one-sided p-value?
With a caveat, the answer is simple:
one-sided p-value =
two-sided p-value
2
We only want the area in one tail, not two tails. The two-sided
p-value gives us the area in both tails.

This is the correct calculation when it is interesting to do the
calculation. The caveat is simple: if the estimated coefficient is not in
the direction of the alternative, the one-sided p-value is above .50,
and so it is not an interesting calculation.
In Stata, the two-sided p-values for H0 : βj = 0 are given in the
column labeled P |t|.

EXAMPLE: Factors Affecting NBA Salaries
(NBASAL.DTA)
des wage games mingame points rebounds assists
------------------------------------------------------------------------------------------------
wage float %9.0g annual salary, thousands $
games byte %9.0g average games per year
mingame float %9.0g minutes per game
points float %9.0g points per game
rebounds float %9.0g rebounds per game
assists float %9.0g assists per game
. sum wage games mingame points rebounds assists
-------------+--------------------------------------------------------
wage | 269 1423.828 999.7741 150 5740
games | 269 65.72491 18.85111 3 82
mingame | 269 23.97925 9.731177 2.888889 43.08537
points | 269 10.21041 5.900667 1.2 29.8
rebounds | 269 4.401115 2.892573 .5 17.3
-------------+--------------------------------------------------------
assists | 269 2.408922 2.092986 0 12.6

Factors Affecting NBA Salaries (NBASAL.DTA)
Use lwage = log(wage) to get constant percentage effects.
. reg lwage games mingame points rebounds assists
-------------+------------------------------ F( 5, 263) = 40.27
Model | 90.2698185 5 18.0539637 Prob > F = 0.0000
-------------+------------------------------ Adj R-squared = 0.4228
Total | 208.188763 268 .776823743 Root MSE = .6696
------------------------------------------------------------------------------
-------------+----------------------------------------------------------------
games | .0004132 .002682 0.15 0.878 -.0048679 .0056942
mingame | .0302278 .0130868 2.31 0.022 .0044597 .055996
points | .0363734 .0150945 2.41 0.017 .0066519 .0660949
rebounds | .0406795 .0229455 1.77 0.077 -.0045007 .0858597
assists | .0003665 .0314393 0.01 0.991 -.0615382 .0622712
_cons | 5.648996 .1559075 36.23 0.000 5.34201 5.955982
------------------------------------------------------------------------------

Forgetting the intercept (or “constant”), none of the variables is
statistically significant at the 1% level against a two-sided alternative.
The closest is points, with p-value = .017. (The one-sided p-value is
.017/2 = .0085 < .01, so it is significant at the 1% level against the
positive one-sided alternative.)
mingame is statistically significant a the 5% level because p-value
= .022 < .05.
rebounds is statistically significant a the 10% level (against a
two-sided alternative) because p-value = .077 < .10, but not at the
5% level. But the one-sided p-value is .077/2 = .0385

Both games and assists have very small t statistics, which lead to
p-values close to one (for example, for assists, p-value = .991).
These variables are statistically insignificant.
In some applications, p-values equal to zero up to three decimal
places are not uncommon. We do not have to worry about statistical
significance in such cases.

Using WAGE2.DTA:
. reg lwage educ IQ exper motheduc
-------------+------------------------------ F( 4, 754) = 54.26
Model | 58.7293322 4 14.682333 Prob > F = 0.0000
-------------+------------------------------ Adj R-squared = 0.2194
Total | 262.748644 758 .346634095 Root MSE = .52018
------------------------------------------------------------------------------
-------------+----------------------------------------------------------------
educ | .1006798 .0118813 8.47 0.000 .0773555 .124004
IQ | .00735 .0016068 4.57 0.000 .0041957 .0105043
exper | .0449386 .0084136 5.34 0.000 .0284217 .0614555
motheduc | .0239265 .0095623 2.50 0.013 .0051545 .0426985
_cons | -.3837064 .2373921 -1.62 0.106 -.8497344 .0823215
------------------------------------------------------------------------------

Language of Hypothesis Testing
If we do not reject H0 (against any alternative), it is better to say “we
fail to reject H0” as opposed to “we accept H0,” which is somewhat
common.
The reason is that many null hypotheses cannot be rejected in any
application. For example, if I have β̂j = .75 and se(β̂j ) = .25, I do
not say that I “accept H0 : βj = 1.”
I fail to reject because the t statistic is (.75 − 1)/.25 = −1.
But the t statistic for H0 : βj = .5 is (.75 − .5)/.25 = 1, so I cannot
reject H0 : βj = .5, either.

Clearly βj = .5 and βj = 1 cannot both be true. There is a single,
unknown value in the population. So I should not “accept” either.
The outcomes of the t tests tell us the data cannot reject either
hypothesis. Nor can the data reject H0 : βj = .6, and so on. The data
does reject H0 : βj = 0 (t = 3) at a pretty small significance level (if
we have a reasonable df .)

Practical versus Statistical Significance
t testing is purely about statistical significance. It does not directly
speak to the issue of whether a variable has a practically, or
economically, large effect.
Practical (Economic) Significance depends on the size (and sign)
of β̂j .
Statistical Significance depends on tβ̂j
.

It is possible estimate practically large effects but have the estimates
so imprecise that they are statistically insignificant. This is especially
an issue with small data sets (but not only small data sets).
Even more importantly, it is possible to get estimates that are
statistically significant – often with very small p-values – but are not
practically large. This can happen with very large data sets.

EXAMPLE
Suppose that, using a large cross section data set for teenagers across
the U.S.,
we estimate the elasticity of alcohol demand with respect to price to
be −.013 with se = .002.
Then the t statistic is −6.5, and we need look no further to conclude
the elasticity is statistically different from zero. But the estimate
means that, say, a 10% increase in the price of alcohol reduces
demand by an estimated .13%. This is a small effect.
The bottom line: do not just fixate on t statistics! Interpreting the β̂j
is just as important.

Confidence Intervals
Rather than just testing hypotheses about parameters it is also useful
to construct confidence intervals (also know as interval estimates)
Loosely, the CI is supposed to give a “likely” range of values for the
corresponding population parameter.
We will only consider CIs of the form
β̂j ± c · se(β̂j )
where c > 0 is chosen based on the confidence level.

We will use a 95% confidence level, in which case c comes from the
97.5 percentile in the tdf distribution. In other words, c is the 5%
critical value against a two-sided alternative.
Stata automatically reports at 95% CI for each parameter, based on
the t distribution using the appropriate df .

. reg lwage games mingame points rebounds assists
-------------+------------------------------ F( 5, 263) = 40.27
Model | 90.2698185 5 18.0539637 Prob > F = 0.0000
-------------+------------------------------ Adj R-squared = 0.4228
Total | 208.188763 268 .776823743 Root MSE = .6696
------------------------------------------------------------------------------
-------------+----------------------------------------------------------------
games | .0004132 .002682 0.15 0.878 -.0048679 .0056942
mingame | .0302278 .0130868 2.31 0.022 .0044597 .055996
points | .0363734 .0150945 2.41 0.017 .0066519 .0660949
rebounds | .0406795 .0229455 1.77 0.077 -.0045007 .0858597
assists | .0003665 .0314393 0.01 0.991 -.0615382 .0622712
_cons | 5.648996 .1559075 36.23 0.000 5.34201 5.955982
------------------------------------------------------------------------------

Notice how the three estimates that are not statistically different from
zero at the 5% level – games, rebounds, and assists – all have 95%
CIs that include zero. For example, the 95% CI for βrebounds is
[−.0045, .0859]
By contrast, the 95% CI for βpoints is
[.0067, .0661]
which excludes zero.

A simple rule-of-thumb is useful for constructing a CI given the
estimate and its standard error. For, say, df ≥ 60, an approximate
95% CI is
β̂j ± 2se(β̂j ) or [β̂j − 2se(β̂j ), β̂j + 2se(β̂j )]
That is, subtract and add twice the standard error to the estimate.
(In the case of the standard normal, the 2 becomes 1.96.)

Properly interpeting a CI is a bit tricky. One often sees statements
such as “there is a 95% chance that βpoints is in the interval
[.0067, .0661].” This is incorrect. βpoints is some fixed value, and it
either is or is not in the interval.
The correct way to interpret a CI is to remember that the endpoints,
β̂j − c · se(β̂j ) and β̂j + c · se(β̂j ), change with each sample (or at
least can change). That is, the endpoints are random outcomes that
depend on the data we draw.

What a 95% CI means is that for 95% of the random samples that we
draw from the population, the interval we compute using the rule
β̂j ± c · se(β̂j ) will include the value βj . But for a particular sample
we do not know whether βj is in the interval.
This is similar to the idea that unbiasedness of β̂j does not means
that β̂j = βj . Most of the time β̂j is not βj . Unbiasedness means
E(β̂j ) = βj .

CIs and Hypothesis Testing
If we have constructed a 95% CI for, say, βj , we can test any null
value against a two-sided alternative, at the 5% level. So
H0 : βj = aj
H1 : βj ̸= aj
1. If aj is in the 95% CI, then we fail to reject H0 at the 5% level.
2. If aj is not in the 95% CI then we reject H0 in favor of H1 at the
5% level.

Note that, measured as percents,
significance level = 100 − confidence level
. reg lwage educ IQ exper motheduc
-------------+------------------------------ F( 4, 754) = 54.26
Model | 58.7293322 4 14.682333 Prob > F = 0.0000
-------------+------------------------------ Adj R-squared = 0.2194
Total | 262.748644 758 .346634095 Root MSE = .52018
------------------------------------------------------------------------------
-------------+----------------------------------------------------------------
educ | .1006798 .0118813 8.47 0.000 .0773555 .124004
IQ | .00735 .0016068 4.57 0.000 .0041957 .0105043
exper | .0449386 .0084136 5.34 0.000 .0284217 .0614555
motheduc | .0239265 .0095623 2.50 0.013 .0051545 .0426985
_cons | -.3837064 .2373921 -1.62 0.106 -.8497344 .0823215
------------------------------------------------------------------------------

The 95% CI for βIQ is about [.0042, .0105]. So we can reject
H0 : βIQ = 0 against the two-sided alternative at the 5% level. We
cannot reject H0 : βIQ = .01 (altough it is close).
We can reject a return to schooling of 7.5% as being too low, but
also 12.5% is too high.
Just as with hypothesis testing, these CIs are only as good as the
underlying assumptions. If we have omitted key variables, the β̂j are
biased. If the error variance is not constant, the standard errors are
improperly computed.
With df = 754, we will see later that normality is not very important.
But normality is needed for these CIs to be eact 95% CIs.

Testing Single Linear Restrictions
So far, we have discussed testing hypotheses that involve only on
parameter, βj . But some hypotheses involve many parameters.
EXAMPLE: Are the Returns to a Year of Junior College the Same as
for a Four-Year University? (COLLEGE.DTA). Sample of high school
graduates.
lwage = β0 + β1jc + β2univ + β3exper + u
H0 : β1 = β2
H1 : β1 < β2

We could use a two-sided alternative, too.
We can also write
H0 : β1 − β2 = 0
Remember the general way to construct a t statistic:
t =
(estimate − hypothesized value)
standard error

Given the OLS estimates β̂1 and β̂2,
t =
β̂1 − β̂2
se(β̂1 − β̂2)
Problem: The OLS output gives us β̂1 and β̂2 and their standard
errors, but that is not enough to obtain se(β̂1 − β̂2).
Recall a fact about variances:
Var(β̂1 − β̂2) = Var(β̂1) + Var(β̂2) − 2Cov(β̂1, β̂2)

The standard error is an estimate of the square root:
se(β̂1 − β̂2) = {[se(β̂1)]2
+ [se(β̂1)]2
− 2s12}1/2
where s12 is an estimate of Cov(β̂1, β̂2). This is the piece we are
missing.
Stata will report s12 if we ask, but calcuting se(β̂1 − β̂2) is
cumbersome. There is also a trick of rewriting the model (see text,
Section 4.4).
These days, it is easiest to use a command for testing linear functions
of the coefficients. In Stata, it is lincom.

. des lwage jc univ exper
-----------------------------------------------------------------------------
lwage float %9.0g log(wage)
jc float %9.0g total 2-year credits
univ float %9.0g total 4-year credits
exper float %8.0g work experience, years
. sum lwage jc univ exper
-------------+--------------------------------------------------------
lwage | 750 2.233674 .4906276 .6931472 3.901973
jc | 750 .3449006 .7731012 0 3.833333
univ | 750 1.817076 2.276202 0 7.5
exper | 750 10.26722 2.713302 .25 13.83333

. reg lwage jc univ exper
-------------+------------------------------ F( 3, 746) = 86.25
Model | 46.4300797 3 15.4766932 Prob > F = 0.0000
-------------+------------------------------ Adj R-squared = 0.2545
Total | 180.295829 749 .240715393 Root MSE = .42361
------------------------------------------------------------------------------
-------------+----------------------------------------------------------------
jc | .0661471 .0202773 3.26 0.001 .0263398 .1059544
univ | .0836956 .0068935 12.14 0.000 .0701626 .0972287
exper | .0653706 .0057692 11.33 0.000 .0540448 .0766964
_cons | 1.387603 .0636388 21.80 0.000 1.262671 1.512536
------------------------------------------------------------------------------
Note that β̂jc − β̂univ = .0661 − .0837 = −.0176, so the estimated return to univ is about 1.8% higher.
But is the difference statistically significant?

. reg lwage jc univ exper
-------------+------------------------------ F( 3, 746) = 86.25
Model | 46.4300797 3 15.4766932 Prob > F = 0.0000
-------------+------------------------------ Adj R-squared = 0.2545
Total | 180.295829 749 .240715393 Root MSE = .42361
------------------------------------------------------------------------------
-------------+----------------------------------------------------------------
jc | .0661471 .0202773 3.26 0.001 .0263398 .1059544
univ | .0836956 .0068935 12.14 0.000 .0701626 .0972287
exper | .0653706 .0057692 11.33 0.000 .0540448 .0766964
_cons | 1.387603 .0636388 21.80 0.000 1.262671 1.512536
------------------------------------------------------------------------------

. lincom jc - univ
( 1) jc - univ = 0
------------------------------------------------------------------------------
-------------+----------------------------------------------------------------
(1) | -.0175485 .0206407 -0.85 0.395 -.0580694 .0229723
------------------------------------------------------------------------------
The two-sided p-value is .395, which means the one-sided p-value is .1975. Even against a one-sided
alternative, we cannot reject H0 : βjc = βuniv at even the 20% level.
Note how much more variation there is in univ compared with jc.

Of course, nothing changes (except the sign of the estimate) if we use βuniv − βjc:
. lincom univ - jc
( 1) - jc + univ = 0
------------------------------------------------------------------------------
-------------+----------------------------------------------------------------
(1) | .0175485 .0206407 0.85 0.395 -.0229723 .0580694
------------------------------------------------------------------------------

Chapter6.pdf.pdf

Recommended

Recommended

More Related Content

Similar to Chapter6.pdf.pdf

Similar to Chapter6.pdf.pdf (20)

More from ROBERTOENRIQUEGARCAA1

More from ROBERTOENRIQUEGARCAA1 (20)

Recently uploaded

Recently uploaded (20)

Chapter6.pdf.pdf