The Logit Model:
Estimation, Testing and Interpretation
Herman J. Bierens
October 25, 2008

1
1.1

Introduction to maximum likelihood estimation
The likelihood function

Consider a random sample Y1 , ..., Yn from the Bernoulli distribution:
Pr[Yj = 1] = p0
Pr[Yj = 0] = 1 โˆ’ p0 ,
where p0 is unknown. For example, toss n times a coin for which you suspect
that it is unfair: p0 6= 0.5, and for each tossing j assign Yj = 1 if the outcome
is heads and Yj = 0 if the outcome is tails. The question is how to estimate
p0 and how to test the null hypothesis that the coin is fair: p0 = 0.5.
The probability function involved can be written as
f(y|p0 ) = Pr[Yj = y]
=

py
0

1โˆ’y

(1 โˆ’ p0 )

=

(

p0
if y = 1,
1 โˆ’ p0 if y = 0.

Next, let y1 , ..., yn be a given sequence of zeros and ones. Thus, each yj is either 0 or 1. The joint probability function of the random sample Y1 , Y2 , ..., Yn
is de๏ฌned as
fn (y1 , ..., yn |p0 ) = Pr[Y1 = y1 and Y2 = y2 ..... and Yn = yn ].
1
Because the random variables Y1 , Y2 , ..., Yn are independent, we can write
Pr[Y1 = y1 and Y2 = y2 ..... and Yn = yn ]
= Pr[Y1 = y1 ] ร— Pr[Y2 = y2 ] ร— ... ร— Pr[Yn = yn ]
= f (y1 |p0 ) ร— f (y2 |p0 ) ร— ... ร— f (yn |p0 )
=

n
Y

j=1

f(yj |p0 ),

hence
fn (y1 , ..., yn |p0 ) =
=

n
Y

j=1

โŽ›

y

p0j (1 โˆ’ p0 )1โˆ’yj
โŽžโŽ›

n
Y

n
yj โŽ  โŽ Y
โŽ
p0
(1
j=1
j=1

Pn

= p0

j=1

yj

โŽž

1โˆ’yj โŽ 

โˆ’ p0 )
Pn

(1 โˆ’ p0 )nโˆ’

j=1

yj

.

Replacing the given non-random sequence y1 , ..., yn by the random sample
Y1 , Y2 , ..., Yn and the unknown probability p0 by a variable p in the interval
(0, 1) yields the likelihood function
Pn

Ln (p) = fn (Y1 , ..., Yn |p) = p

j=1

Yj

Pn

(1 โˆ’ p)nโˆ’

j=1

Yj

For the case p = p0 the likelihood function can be interpreted as the joint
probability that we draw a particular sample Y1 , ..., Yn .

1.2

Maximum likelihood estimation

The idea of maximum likelihood (ML) estimation is now to choose p such
that Ln (p) is maximal. In other words, choose p such that the probability of
drawing this particular sample Y1 , ..., Yn is maximal.
Note that maximizing Ln (p) is equivalent to maximizing ln (Ln (p)) , i.e.,
โŽ›

ln (Ln (p)) = โŽ

n
X

j=1

ยณ

โŽž

โŽ›

Yj โŽ  ln(p) + โŽn โˆ’

n
X

j=1

โŽž

Yj โŽ  ln(1 โˆ’ p)
ยด

= n Y ln(p) + (1 โˆ’ Y ) ln(1 โˆ’ p) ,
2
where

n
1X
Y =
Yj
n j=1

b
is the sample mean. Therefore, the ML estimator p in this case can be
obtained from the ๏ฌrst-order condition for a maximum of ln (Ln (p)) in p =
b
p:
รƒ

b
b
b
d ln (Ln (p))
d ln(p)
d ln(1 โˆ’ p)
=n Y
+ (1 โˆ’ Y )
0 =
b
b
b
dp
dp
dp
รƒ
!
b
b
b
d ln(p)
d ln(1 โˆ’ p) d(1 โˆ’ p)
= n Y
+ (1 โˆ’ Y )
ร—
b
b
b
dp
d(1 โˆ’ p)
dp
รƒ
!
1
1
ร— (โˆ’1)
= n Y + (1 โˆ’ Y )
b
b
p
1โˆ’p

= n

รƒ
รƒ

Y
1โˆ’Y
โˆ’
b
b
p
1โˆ’p

b
Y โˆ’p
= n
b
b
p (1 โˆ’ p)

!

!

โŽ›

= nโŽ

ยณ

b
b
Y (1 โˆ’ p) โˆ’ p 1 โˆ’ Y
b
b
p (1 โˆ’ p)

!

ยดโŽž
โŽ 

where we have used the fact that d ln(x)/dx = 1/x. Thus, in this case the
b
ML estimator p of p0 is the sample mean:
b
p =Y.

b
Note that this is an unbiased estimator: E (p) =

1.3

1
n

Pn

j=1

E (Yj ) = p0 .

Large sample statistical inference

It can be shown (but thisโˆš
requires advanced probability theory) that if the
b
sample size n is large then n (p โˆ’ p0 ) is approximately normally distributed,
i.e.,
n
โˆš
1 X
2
b
n (p โˆ’ p0 ) = โˆš
(Yj โˆ’ p0 ) โˆผ N[0, ฯƒ0 ],
n j=1
where

h

2
ฯƒ0 = var(Yj ) = E (Yj โˆ’ p0 )2

i

= (1 โˆ’ p0 )2 p0 + (โˆ’p0 )2 (1 โˆ’ p0 )
= p0 (1 โˆ’ p0 ).
3
Thus, for large sample size n,
โˆš
b
n (p โˆ’ p0 )
q

p0 (1 โˆ’ p0 )

โˆผ N[0, 1].

(1)

This result can be used to test hypotheses about p0 . In particular, under
the null hypothesis that the coin is fair, p0 = 0.5, we have
โˆš
โˆš
b
n (p โˆ’ 0.5)
b
2 n (p โˆ’ 0.5) = โˆš
โˆผ N[0, 1],
0.5 ร— 0.5
โˆš
b
Therefore, 2 n (p โˆ’ 0.5) can be used as the test statistic of the standard
normal test of the null hypothesis p0 = 1/2, as follows. Recall that for a
standard normal random variable U , Pr [|U | > 1.96] = 0.05. Thus, under the
null hypothesis p0 = 1/2 one would expect that
ยฏ
hยฏ โˆš
i
ยฏ
ยฏ
b
Pr ยฏ2 n (p โˆ’ 0.5)ยฏ > 1.96 = 0.05
ยฏ
hยฏ โˆš
i
ยฏ
ยฏ
b
Pr ยฏ2 n (p โˆ’ 0.5)ยฏ โ‰ค 1.96 = 0.95
โˆš
b
If |2 n (p โˆ’ 0.5)| > 1.96 then we reject the null hypothesis p0 = 1/2 at
the 5% signi๏ฌcance level, becauseโˆš
this is not what one would expect if the
b
null hypothesis is true, and if |2 n (p โˆ’ 0.5)| โ‰ค 1.96 then we accept this
null hypothesis, as this result is then in accordance with the null hypothesis
p0 = 1/2.
The result (1) can also be used to endow the unknown probability p0 with
a con๏ฌdence interval, for example the 95% con๏ฌdence interval, as follows. The
result (1) implies
ยฏ
โŽกยฏ โˆš
โŽค
ยฏ
ยฏ
ยฏ n (p โˆ’ p0 ) ยฏ
b
ยฏ โ‰ค 1.96โŽฆ = 0.95,
Pr โŽฃยฏ q
ยฏ
ยฏ
ยฏ p0 (1 โˆ’ p0 ) ยฏ
which, after some straightforward calculations, can be shown to be equivalent
to
h
i
Pr pn โ‰ค p0 โ‰ค pn = 0.95
where

pn =

q

b
b
b
n.p + (1.96)2 /2 โˆ’ 1.96 n.p (1 โˆ’ p) + (1.96)2 /4

n + (1.96)2
q

2

pn =

b
b
b
n.p + (1.96) /2 + 1.96 n.p (1 โˆ’ p) + (1.96)2 /4

n + (1.96)2
4
h

i

The interval pn , pn is now the 95% con๏ฌdence interval for p0 .

1.4

An application election polls

Consider a presidential election with two candidates, candidate A and candidate B, and let p0 be the fraction of likely voters who favor candidate A,
just before the election is held. To predict the outcome of the election, a
polling agency draws a random sample of size n = 3000, for example, from
the population of likely voters.1 Suppose that 1800 of the respondents express a preference for candidate A. Thus, the fraction of respondents favoring
b
b
candidate A is p = 0.6. Substituting n = 3000 and p = 0.6 in the formulas
for pn and pn yields
pn = 0.58, pn = 0.62
Thus, the 95% con๏ฌdence interval of 100 ร— p0 is [58, 62]. The polling results
are therefore stated as: 60% of the likely voters will vote for candidate A,
with a margin of error of ยฑ2 points.

2

Motivation for maximum likelihood estimation

A more formal motivation for ML estimation is based on the fact that for
0 < x < 1 and x > 1,
ln(x) < x โˆ’ 1.
This is illustrated in the following picture:
1

How to draw such a sample is beyond the scope of this lecture note.

5
ln(x) โ‰ค x โˆ’ 1.
The inequality ln(x) < x โˆ’ 1 is strict for x 6= 1, and ln(1) = 0. Consequently, taking x = f(Yj |p)/f(Yj |p0 ), we have the inequality
รƒ

f (Yj |p)
ln
f(Yj |p0 )

!

โ‰ค

f(Yj |p)
โˆ’ 1.
f (Yj |p0 )

Taking expectations, it follows that
"

รƒ

f (Yj |p)
E ln
f (Yj |p0 )

!#

โ‰ค
=
=
=

"

#

f(Yj |p)
E
โˆ’1
f (Yj |p0 )
f(1|p)
f (0|p)
Pr[Yj = 1] +
Pr[Yj = 0] โˆ’ 1
f (1|p0 )
f(0|p0 )
1โˆ’p
p
p0 +
(1 โˆ’ p0 ) โˆ’ 1
p0
1 โˆ’ p0
p + 1 โˆ’ p โˆ’ 1 = 0,
(2)

hence

"

รƒ

f(Yj |p)
E [ln (f (Yj |p))] โˆ’ E [ln (f (Yj |p0 ))] = E ln
f (Yj |p0 )

!#

โ‰ค 0,

and therefore,
E [ln (Ln (p))] โ‰ค E [ln (Ln(p0 ))] .

(3)

Thus, E [ln (Ln (p))] is maximal for p = p0 , and it can be shown that this
maximum is unique.
6
3

Maximum likelihood estimation of the Logit
model

3.1

The Logit model with one explanatory variable

Next, let (Y1 , X1 ), ..., (Yn , Xn ) be a random sample from the conditional Logit
distribution:
1
,
1 + exp (โˆ’ฮฑ0 โˆ’ ฮฒ0 Xj )
Pr[Yj = 0|Xj ] = 1 โˆ’ Pr[Yj = 1|Xj ]
exp (โˆ’ฮฑ0 โˆ’ ฮฒ0 Xj )
=
1 + exp (โˆ’ฮฑ0 โˆ’ ฮฒ0 Xj )
Pr[Yj = 1|Xj ] =

(4)

where the Xj โ€™s are the explanatory variables and ฮฑ0 and ฮฒ0 are unknown
parameters to be estimated. This model is called a Logit model, because
Pr[Yj = 1|Xj ] = F (ฮฑ0 + ฮฒ0 Xj )
where
F (x) =

1
1 + exp(โˆ’x)

(5)

(6)

is the distribution function of the logistic (Logit) distribution.
The conditional probability function involved is
f(y|Xj , ฮฑ0 , ฮฒ0 ) = Pr[Yj = y|Xj ]
= F (ฮฑ0 + ฮฒ0 Xj )y (1 โˆ’ F (ฮฑ0 + ฮฒ0 Xj ))1โˆ’y
(
F (ฮฑ0 + ฮฒ0 Xj )
if y = 1,
=
1 โˆ’ F (ฮฑ0 + ฮฒ0 Xj ) if y = 0.
Now the conditional log-likelihood function is
ln (Ln (ฮฑ, ฮฒ)) =

n
X

j=1

=

n
X

ln (f (Yj |Xj , ฮฑ, ฮฒ))

Yj ln (F (ฮฑ + ฮฒXj )) +

j=1

=โˆ’

n
X

(1 โˆ’ Yj ) ln (1 โˆ’ F (ฮฑ + ฮฒXj ))

j=1
n
X

n
X

(1 โˆ’ Yj ) (ฮฑ + ฮฒXj ) โˆ’

j=1

7

j=1

ln (1 + exp (โˆ’ฮฑ โˆ’ ฮฒXj )) .

(7)
Similar to (3) we have
E [ln (Ln (ฮฑ, ฮฒ))| X1 , ..., Xn ] โ‰ค E [ln (Ln (ฮฑ0 , ฮฒ0 ))| X1 , ..., Xn ] .
Again, this result motivates to estimate ฮฑ0 and ฮฒ0 by maximizing ln (Ln (ฮฑ, ฮฒ))
to ฮฑ and ฮฒ:
ยณ
ยด
b b
ln Ln (ฮฑ, ฮฒ) = max ln (Ln (ฮฑ, ฮฒ)) .
ฮฑ,ฮฒ

b
b
However, there is no longer an explicit solution for ฮฑ and ฮฒ. These ML
estimators have to be solved numerically. Your econometrics software will do
that for you.

3.2

Pseudo t-values

It can be shown that if the sample size n is large then
ยด
โˆš
โˆš ยณb
b
n (ฮฑ โˆ’ ฮฑ0 ) โˆผ N (0, ฯƒ 2 ), n ฮฒ โˆ’ ฮฒ0 โˆผ N (0, ฯƒ 2 ).
ฮฑ
ฮฒ

bฮฑ
bฮฒ
Given consistent estimators ฯƒ 2 and ฯƒ 2 of the unknown variances ฯƒ 2 and ฯƒ 2 ,
ฮฑ
ฮฒ
respectively (which are computed by your econometrics software), we then
have
ยด
โˆš ยณb
โˆš
n ฮฒ โˆ’ ฮฒ0
b โˆ’ ฮฑ0 )
n (ฮฑ
โˆผ N (0, 1),
โˆผ N(0, 1).
b
b
ฯƒฮฑ
ฯƒฮฒ
These results can be used to test whether the coe๏ฌƒcients ฮฑ0 and ฮฒ0 are zero
or not. In particular the null hypothesis ฮฒ0 = 0 is of interest, because this
hypothesis implies that the conditional probability Pr[Yj = 1|Xj ] does not
depend on Xj . Under the null hypothesis ฮฒ0 = 0 we have
โˆš b
bฮฒ = nฮฒ โˆผ N(0, 1).
t
b
ฯƒฮฒ

Recall that the 5% critical value of the two-sided standard normal test is
1.96. Thus, for example, the null hypothesis ฮฒ0 = 0 is rejectedยฏ atยฏ the 5%
ยฏb ยฏ
signi๏ฌcance level in favor of the alternative hypothesis ฮฒ0 6= 0 if ยฏt ฮฒ ยฏ > 1.96,
ยฏ

ยฏ

ยฏb ยฏ
and accepted if ยฏt ฮฒ ยฏ โ‰ค 1.96.

b
b
The statistic t ฮฒ is called the pseudo t-value of ฮฒ because it is used in the
b
same way as the t-value in linear regression, and ฯƒ ฮฒ is called the standard
b Your econometric software will report the ML estimators together
error of ฮฒ.
with their corresponding pseudo t-values and/or standard errors.

8
3.3

The general Logit model

The general Logit model takes the form

Pr[Yj = 1|X1j , ...Xk,j ] =
=

1
1+

0
exp (โˆ’ฮฒ1 X1j

ยณ

1 + exp โˆ’

0
โˆ’ ... โˆ’ ฮฒk Xkj )

1
Pk

i=1

ฮฒi0 Xij

(8)

ยด,

where one of the Xij equals 1 for the constant term, for example, let Xkj = 1,
and the ฮฒi0 โ€™s are the true parameter values. This model can be estimated by
ML in the same way as before. Thus, the log-likelihood function is
ln (Ln (ฮฒ1 , ..., ฮฒk )) = โˆ’

n
X

j=1

(1 โˆ’ Yj )

k
X
i=1

ฮฒi Xij โˆ’

n
X

j=1

รƒ

รƒ

ln 1 + exp โˆ’

k
X

ฮฒi Xij

i=1

!!

,

(9)

b
b
and the ML estimators ฮฒ 1 , ..., ฮฒ k are obtained by maximizing ln (Ln (ฮฒ1 , ..., ฮฒk )):
ยณ

ยด

b
b
ln Ln (ฮฒ 1 , ..., ฮฒ k ) = max ln (Ln (ฮฒ1 , ..., ฮฒk )) .
ฮฒ1 ,...,ฮฒk

Again, it can be shown that if n is large then for i = 1, ..., k,
ยด
โˆš ยณb
2
n ฮฒ i โˆ’ ฮฒi0 โˆผ N[0, ฯƒi ].

2
bi
Given consistent estimators ฯƒ 2 of the variances ฯƒi , it follows then that

ยด
โˆš ยณb
n ฮฒ i โˆ’ ฮฒi0
b
ฯƒi

โˆผ N[0, 1]

for i = 1, ..., k. Your econometrics software will report the ML estimators
โˆš b
b
b
b
ฮฒ i together with their corresponding pseudo t-values ti = nฮฒ i /ฯƒ i and/or
b i.
standard errors ฯƒ

3.4

Testing joint signi๏ฌcance

Now suppose you want to test the joint null hypothesis
0
0
0
H0 : ฮฒ1 = 0, ฮฒ2 = 0, ..., ฮฒm = 0,

9

(10)
where m < k.
There are two ways to do that. One way is akin to the F test in linear
regression: Re-estimate the Logit model under the null hypothesis:
ยณ

ยด

e
e
ln Ln (0, .., 0, ฮฒ m+1 , ..., ฮฒ k ) =

max

ฮฒm+1 ,...,ฮฒk

ln (Ln (0, .., 0, ฮฒm+1 , ..., ฮฒk )) .

and compare the log-likelihoods2 . It can be shown that under the null hypothesis (10) and for large samples,
รƒ

e
e
Ln (0, .., 0, ฮฒ m+1 , ..., ฮฒ k )
LRm = โˆ’2 ln
b
b
Ln (ฮฒ 1 , ..., ฮฒ k )

!

โˆผ ฯ‡2 ,
m

where the degrees of freedom m corresponds to the number of restrictions
imposed under the null hypothesis. This is the so-called likelihood ratio test,
which is conducted right-sided. For example, choose the 5% signi๏ฌcance
level, look up in the table of the ฯ‡2 distribution the critical value c such
that for a ฯ‡2 distributed random variable Zm , Pr[Zm > c] = 0.05. Then the
m
null hypothesis (10) is rejected at the 5% signi๏ฌcance level if LRm > c and
accepted if LRm โ‰ค c.
An alternative test of the null hypothesis (10) is the Wald test, which is
conducted in the same way as for linear regression models.3 Under the null
hypothesis (10) the Wald test statistic has also a ฯ‡2 distribution.
m

4

Interpretation of the coe๏ฌƒcients of the Logit
model

4.1

Marginal e๏ฌ€ects

Consider the Logit model (5). If ฮฒ0 > 0 then Pr[Yj = 1|Xj ] = F (ฮฑ0 + ฮฒ0 Xj )
is an increasing function of Xj :
dP [Yj = 1|Xj ]
= ฮฒ0 .F 0 (ฮฑ0 + ฮฒ0 Xj ),
dXj
where F 0 is the derivative of (6):
2
3

Your econometric software will report the log-likelihood function value.
In EasyReg International the Wald test can be conducted simply by point-and-click.

10
exp(โˆ’x)
1 + exp(โˆ’x)
1
2 =
2 โˆ’
(1 + exp(โˆ’x))
(1 + exp(โˆ’x))
(1 + exp(โˆ’x))2
1
1
=
โˆ’
= F (x) โˆ’ F (x)2
1 + exp(โˆ’x) (1 + exp(โˆ’x))2
= F (x) (1 โˆ’ F (x)) .

F 0 (x) =

Therefore, the marginal e๏ฌ€ect of Xj on Pr[Yj = 1|Xj ] depends on Xj :
dP [Yj = 1|Xj ]
= ฮฒ0 .F (ฮฑ0 + ฮฒ0 Xj ) (1 โˆ’ F (ฮฑ0 + ฮฒ0 Xj )) ,
dXj
which renders the interpretation of ฮฒ0 di๏ฌƒcult.
However, the coe๏ฌƒcient ฮฒ0 can be interpreted in terms of relative changes
in odds.

4.2

Odds and odds ratios

The odds is the ratio of the probability that something is true divided by the
probability that it is not true. Thus, in the Logit case (4),
Odds (X) =

Pr[Yj = 1|Xj ]
F (ฮฑ0 + ฮฒ0 Xj )
=
= exp(ฮฑ0 + ฮฒ0 Xj ).
Pr[Yj = 0|Xj ]
1 โˆ’ F (ฮฑ0 + ฮฒ0 Xj )

(11)

The odds ratio is the ratio of two odds for di๏ฌ€erent values of Xj , say
Xj = x and Xj = x + โˆ†x:
exp(ฮฑ + ฮฒx + ฮฒโˆ†x)
Odds (x + โˆ†x)
=
= exp(ฮฒโˆ†x),
Odds (x)
exp(ฮฑ + ฮฒx)
where โˆ†x is a small change in x. Then
รƒ

!

Odds (x + โˆ†x) โˆ’ Odds (x)
exp(ฮฒ0 โˆ†x) โˆ’ 1
= lim
โˆ†xโ†’0
Odds (x)
โˆ†x
ยฏ
exp(ฮฒ0 โˆ†x) โˆ’ 1
d exp(u) ยฏ
ยฏ
ยฏ
= ฮฒ0 lim
= ฮฒ0 ร—
= ฮฒ0 exp(0) = ฮฒ0 .
ฮฒ0 โˆ†xโ†’0
ฮฒ0 โˆ†x
du ยฏu=0
1
lim
โˆ†xโ†’0 โˆ†x

Thus, ฮฒ0 may be interpreted as the relative change in the odds due to a small
change โˆ†x in Xj :
Odds (x + โˆ†x) โˆ’ Odds (x)
Odds (x + โˆ†x)
=
โˆ’ 1 โ‰ˆ ฮฒ0 โˆ†x
Odds (x)
Odds (x)
11

(12)
If Xj is a binary variable itself, Xj = 0 or Xj = 1, then the only reasonable
choices for x + โˆ†x and x are 1 and 0, respectively, so that then
Odds (1) โˆ’ Odds (0)
Odds (1)
โˆ’1=
= exp(ฮฒ0 ) โˆ’ 1.
Odds (0)
Odds (0)
Only if ฮฒ0 is small we may then use the approximation exp(ฮฒ0 ) โˆ’ 1 โ‰ˆ ฮฒ0 . If
not, one has to interpret ฮฒ0 in terms of the log of the odds ratio involved:
รƒ

Odds (1)
ln
Odds (0)

!

= ฮฒ0 .

The interpretation of the coe๏ฌƒcients ฮฒi0 , i = 1, ..., k โˆ’ 1 in the general
Logit model (8) is similar as in the case (12):
Odds (X1j , ..., Xiโˆ’1,j , Xi,j + โˆ†Xi,j , Xi+1,j , ..., Xk,j )
โˆ’ 1 โ‰ˆ ฮฒi0 โˆ†Xi,j
Odds (X1j , ..., Xiโˆ’1,j , Xi,j , Xi+1,j , ..., Xk,j )
if โˆ†Xi,j is small. For example, ฮฒi0 may be interpreted as the percentage
change in Odds(X1j , .., Xk,j ) due to a small percentage change 100ร—โˆ†Xi,j = 1
in Xi,j .

12

Logit model testing and interpretation

  • 1.
    The Logit Model: Estimation,Testing and Interpretation Herman J. Bierens October 25, 2008 1 1.1 Introduction to maximum likelihood estimation The likelihood function Consider a random sample Y1 , ..., Yn from the Bernoulli distribution: Pr[Yj = 1] = p0 Pr[Yj = 0] = 1 โˆ’ p0 , where p0 is unknown. For example, toss n times a coin for which you suspect that it is unfair: p0 6= 0.5, and for each tossing j assign Yj = 1 if the outcome is heads and Yj = 0 if the outcome is tails. The question is how to estimate p0 and how to test the null hypothesis that the coin is fair: p0 = 0.5. The probability function involved can be written as f(y|p0 ) = Pr[Yj = y] = py 0 1โˆ’y (1 โˆ’ p0 ) = ( p0 if y = 1, 1 โˆ’ p0 if y = 0. Next, let y1 , ..., yn be a given sequence of zeros and ones. Thus, each yj is either 0 or 1. The joint probability function of the random sample Y1 , Y2 , ..., Yn is de๏ฌned as fn (y1 , ..., yn |p0 ) = Pr[Y1 = y1 and Y2 = y2 ..... and Yn = yn ]. 1
  • 2.
    Because the randomvariables Y1 , Y2 , ..., Yn are independent, we can write Pr[Y1 = y1 and Y2 = y2 ..... and Yn = yn ] = Pr[Y1 = y1 ] ร— Pr[Y2 = y2 ] ร— ... ร— Pr[Yn = yn ] = f (y1 |p0 ) ร— f (y2 |p0 ) ร— ... ร— f (yn |p0 ) = n Y j=1 f(yj |p0 ), hence fn (y1 , ..., yn |p0 ) = = n Y j=1 โŽ› y p0j (1 โˆ’ p0 )1โˆ’yj โŽžโŽ› n Y n yj โŽ  โŽ Y โŽ p0 (1 j=1 j=1 Pn = p0 j=1 yj โŽž 1โˆ’yj โŽ  โˆ’ p0 ) Pn (1 โˆ’ p0 )nโˆ’ j=1 yj . Replacing the given non-random sequence y1 , ..., yn by the random sample Y1 , Y2 , ..., Yn and the unknown probability p0 by a variable p in the interval (0, 1) yields the likelihood function Pn Ln (p) = fn (Y1 , ..., Yn |p) = p j=1 Yj Pn (1 โˆ’ p)nโˆ’ j=1 Yj For the case p = p0 the likelihood function can be interpreted as the joint probability that we draw a particular sample Y1 , ..., Yn . 1.2 Maximum likelihood estimation The idea of maximum likelihood (ML) estimation is now to choose p such that Ln (p) is maximal. In other words, choose p such that the probability of drawing this particular sample Y1 , ..., Yn is maximal. Note that maximizing Ln (p) is equivalent to maximizing ln (Ln (p)) , i.e., โŽ› ln (Ln (p)) = โŽ n X j=1 ยณ โŽž โŽ› Yj โŽ  ln(p) + โŽn โˆ’ n X j=1 โŽž Yj โŽ  ln(1 โˆ’ p) ยด = n Y ln(p) + (1 โˆ’ Y ) ln(1 โˆ’ p) , 2
  • 3.
    where n 1X Y = Yj n j=1 b isthe sample mean. Therefore, the ML estimator p in this case can be obtained from the ๏ฌrst-order condition for a maximum of ln (Ln (p)) in p = b p: รƒ b b b d ln (Ln (p)) d ln(p) d ln(1 โˆ’ p) =n Y + (1 โˆ’ Y ) 0 = b b b dp dp dp รƒ ! b b b d ln(p) d ln(1 โˆ’ p) d(1 โˆ’ p) = n Y + (1 โˆ’ Y ) ร— b b b dp d(1 โˆ’ p) dp รƒ ! 1 1 ร— (โˆ’1) = n Y + (1 โˆ’ Y ) b b p 1โˆ’p = n รƒ รƒ Y 1โˆ’Y โˆ’ b b p 1โˆ’p b Y โˆ’p = n b b p (1 โˆ’ p) ! ! โŽ› = nโŽ ยณ b b Y (1 โˆ’ p) โˆ’ p 1 โˆ’ Y b b p (1 โˆ’ p) ! ยดโŽž โŽ  where we have used the fact that d ln(x)/dx = 1/x. Thus, in this case the b ML estimator p of p0 is the sample mean: b p =Y. b Note that this is an unbiased estimator: E (p) = 1.3 1 n Pn j=1 E (Yj ) = p0 . Large sample statistical inference It can be shown (but thisโˆš requires advanced probability theory) that if the b sample size n is large then n (p โˆ’ p0 ) is approximately normally distributed, i.e., n โˆš 1 X 2 b n (p โˆ’ p0 ) = โˆš (Yj โˆ’ p0 ) โˆผ N[0, ฯƒ0 ], n j=1 where h 2 ฯƒ0 = var(Yj ) = E (Yj โˆ’ p0 )2 i = (1 โˆ’ p0 )2 p0 + (โˆ’p0 )2 (1 โˆ’ p0 ) = p0 (1 โˆ’ p0 ). 3
  • 4.
    Thus, for largesample size n, โˆš b n (p โˆ’ p0 ) q p0 (1 โˆ’ p0 ) โˆผ N[0, 1]. (1) This result can be used to test hypotheses about p0 . In particular, under the null hypothesis that the coin is fair, p0 = 0.5, we have โˆš โˆš b n (p โˆ’ 0.5) b 2 n (p โˆ’ 0.5) = โˆš โˆผ N[0, 1], 0.5 ร— 0.5 โˆš b Therefore, 2 n (p โˆ’ 0.5) can be used as the test statistic of the standard normal test of the null hypothesis p0 = 1/2, as follows. Recall that for a standard normal random variable U , Pr [|U | > 1.96] = 0.05. Thus, under the null hypothesis p0 = 1/2 one would expect that ยฏ hยฏ โˆš i ยฏ ยฏ b Pr ยฏ2 n (p โˆ’ 0.5)ยฏ > 1.96 = 0.05 ยฏ hยฏ โˆš i ยฏ ยฏ b Pr ยฏ2 n (p โˆ’ 0.5)ยฏ โ‰ค 1.96 = 0.95 โˆš b If |2 n (p โˆ’ 0.5)| > 1.96 then we reject the null hypothesis p0 = 1/2 at the 5% signi๏ฌcance level, becauseโˆš this is not what one would expect if the b null hypothesis is true, and if |2 n (p โˆ’ 0.5)| โ‰ค 1.96 then we accept this null hypothesis, as this result is then in accordance with the null hypothesis p0 = 1/2. The result (1) can also be used to endow the unknown probability p0 with a con๏ฌdence interval, for example the 95% con๏ฌdence interval, as follows. The result (1) implies ยฏ โŽกยฏ โˆš โŽค ยฏ ยฏ ยฏ n (p โˆ’ p0 ) ยฏ b ยฏ โ‰ค 1.96โŽฆ = 0.95, Pr โŽฃยฏ q ยฏ ยฏ ยฏ p0 (1 โˆ’ p0 ) ยฏ which, after some straightforward calculations, can be shown to be equivalent to h i Pr pn โ‰ค p0 โ‰ค pn = 0.95 where pn = q b b b n.p + (1.96)2 /2 โˆ’ 1.96 n.p (1 โˆ’ p) + (1.96)2 /4 n + (1.96)2 q 2 pn = b b b n.p + (1.96) /2 + 1.96 n.p (1 โˆ’ p) + (1.96)2 /4 n + (1.96)2 4
  • 5.
    h i The interval pn, pn is now the 95% con๏ฌdence interval for p0 . 1.4 An application election polls Consider a presidential election with two candidates, candidate A and candidate B, and let p0 be the fraction of likely voters who favor candidate A, just before the election is held. To predict the outcome of the election, a polling agency draws a random sample of size n = 3000, for example, from the population of likely voters.1 Suppose that 1800 of the respondents express a preference for candidate A. Thus, the fraction of respondents favoring b b candidate A is p = 0.6. Substituting n = 3000 and p = 0.6 in the formulas for pn and pn yields pn = 0.58, pn = 0.62 Thus, the 95% con๏ฌdence interval of 100 ร— p0 is [58, 62]. The polling results are therefore stated as: 60% of the likely voters will vote for candidate A, with a margin of error of ยฑ2 points. 2 Motivation for maximum likelihood estimation A more formal motivation for ML estimation is based on the fact that for 0 < x < 1 and x > 1, ln(x) < x โˆ’ 1. This is illustrated in the following picture: 1 How to draw such a sample is beyond the scope of this lecture note. 5
  • 6.
    ln(x) โ‰ค xโˆ’ 1. The inequality ln(x) < x โˆ’ 1 is strict for x 6= 1, and ln(1) = 0. Consequently, taking x = f(Yj |p)/f(Yj |p0 ), we have the inequality รƒ f (Yj |p) ln f(Yj |p0 ) ! โ‰ค f(Yj |p) โˆ’ 1. f (Yj |p0 ) Taking expectations, it follows that " รƒ f (Yj |p) E ln f (Yj |p0 ) !# โ‰ค = = = " # f(Yj |p) E โˆ’1 f (Yj |p0 ) f(1|p) f (0|p) Pr[Yj = 1] + Pr[Yj = 0] โˆ’ 1 f (1|p0 ) f(0|p0 ) 1โˆ’p p p0 + (1 โˆ’ p0 ) โˆ’ 1 p0 1 โˆ’ p0 p + 1 โˆ’ p โˆ’ 1 = 0, (2) hence " รƒ f(Yj |p) E [ln (f (Yj |p))] โˆ’ E [ln (f (Yj |p0 ))] = E ln f (Yj |p0 ) !# โ‰ค 0, and therefore, E [ln (Ln (p))] โ‰ค E [ln (Ln(p0 ))] . (3) Thus, E [ln (Ln (p))] is maximal for p = p0 , and it can be shown that this maximum is unique. 6
  • 7.
    3 Maximum likelihood estimationof the Logit model 3.1 The Logit model with one explanatory variable Next, let (Y1 , X1 ), ..., (Yn , Xn ) be a random sample from the conditional Logit distribution: 1 , 1 + exp (โˆ’ฮฑ0 โˆ’ ฮฒ0 Xj ) Pr[Yj = 0|Xj ] = 1 โˆ’ Pr[Yj = 1|Xj ] exp (โˆ’ฮฑ0 โˆ’ ฮฒ0 Xj ) = 1 + exp (โˆ’ฮฑ0 โˆ’ ฮฒ0 Xj ) Pr[Yj = 1|Xj ] = (4) where the Xj โ€™s are the explanatory variables and ฮฑ0 and ฮฒ0 are unknown parameters to be estimated. This model is called a Logit model, because Pr[Yj = 1|Xj ] = F (ฮฑ0 + ฮฒ0 Xj ) where F (x) = 1 1 + exp(โˆ’x) (5) (6) is the distribution function of the logistic (Logit) distribution. The conditional probability function involved is f(y|Xj , ฮฑ0 , ฮฒ0 ) = Pr[Yj = y|Xj ] = F (ฮฑ0 + ฮฒ0 Xj )y (1 โˆ’ F (ฮฑ0 + ฮฒ0 Xj ))1โˆ’y ( F (ฮฑ0 + ฮฒ0 Xj ) if y = 1, = 1 โˆ’ F (ฮฑ0 + ฮฒ0 Xj ) if y = 0. Now the conditional log-likelihood function is ln (Ln (ฮฑ, ฮฒ)) = n X j=1 = n X ln (f (Yj |Xj , ฮฑ, ฮฒ)) Yj ln (F (ฮฑ + ฮฒXj )) + j=1 =โˆ’ n X (1 โˆ’ Yj ) ln (1 โˆ’ F (ฮฑ + ฮฒXj )) j=1 n X n X (1 โˆ’ Yj ) (ฮฑ + ฮฒXj ) โˆ’ j=1 7 j=1 ln (1 + exp (โˆ’ฮฑ โˆ’ ฮฒXj )) . (7)
  • 8.
    Similar to (3)we have E [ln (Ln (ฮฑ, ฮฒ))| X1 , ..., Xn ] โ‰ค E [ln (Ln (ฮฑ0 , ฮฒ0 ))| X1 , ..., Xn ] . Again, this result motivates to estimate ฮฑ0 and ฮฒ0 by maximizing ln (Ln (ฮฑ, ฮฒ)) to ฮฑ and ฮฒ: ยณ ยด b b ln Ln (ฮฑ, ฮฒ) = max ln (Ln (ฮฑ, ฮฒ)) . ฮฑ,ฮฒ b b However, there is no longer an explicit solution for ฮฑ and ฮฒ. These ML estimators have to be solved numerically. Your econometrics software will do that for you. 3.2 Pseudo t-values It can be shown that if the sample size n is large then ยด โˆš โˆš ยณb b n (ฮฑ โˆ’ ฮฑ0 ) โˆผ N (0, ฯƒ 2 ), n ฮฒ โˆ’ ฮฒ0 โˆผ N (0, ฯƒ 2 ). ฮฑ ฮฒ bฮฑ bฮฒ Given consistent estimators ฯƒ 2 and ฯƒ 2 of the unknown variances ฯƒ 2 and ฯƒ 2 , ฮฑ ฮฒ respectively (which are computed by your econometrics software), we then have ยด โˆš ยณb โˆš n ฮฒ โˆ’ ฮฒ0 b โˆ’ ฮฑ0 ) n (ฮฑ โˆผ N (0, 1), โˆผ N(0, 1). b b ฯƒฮฑ ฯƒฮฒ These results can be used to test whether the coe๏ฌƒcients ฮฑ0 and ฮฒ0 are zero or not. In particular the null hypothesis ฮฒ0 = 0 is of interest, because this hypothesis implies that the conditional probability Pr[Yj = 1|Xj ] does not depend on Xj . Under the null hypothesis ฮฒ0 = 0 we have โˆš b bฮฒ = nฮฒ โˆผ N(0, 1). t b ฯƒฮฒ Recall that the 5% critical value of the two-sided standard normal test is 1.96. Thus, for example, the null hypothesis ฮฒ0 = 0 is rejectedยฏ atยฏ the 5% ยฏb ยฏ signi๏ฌcance level in favor of the alternative hypothesis ฮฒ0 6= 0 if ยฏt ฮฒ ยฏ > 1.96, ยฏ ยฏ ยฏb ยฏ and accepted if ยฏt ฮฒ ยฏ โ‰ค 1.96. b b The statistic t ฮฒ is called the pseudo t-value of ฮฒ because it is used in the b same way as the t-value in linear regression, and ฯƒ ฮฒ is called the standard b Your econometric software will report the ML estimators together error of ฮฒ. with their corresponding pseudo t-values and/or standard errors. 8
  • 9.
    3.3 The general Logitmodel The general Logit model takes the form Pr[Yj = 1|X1j , ...Xk,j ] = = 1 1+ 0 exp (โˆ’ฮฒ1 X1j ยณ 1 + exp โˆ’ 0 โˆ’ ... โˆ’ ฮฒk Xkj ) 1 Pk i=1 ฮฒi0 Xij (8) ยด, where one of the Xij equals 1 for the constant term, for example, let Xkj = 1, and the ฮฒi0 โ€™s are the true parameter values. This model can be estimated by ML in the same way as before. Thus, the log-likelihood function is ln (Ln (ฮฒ1 , ..., ฮฒk )) = โˆ’ n X j=1 (1 โˆ’ Yj ) k X i=1 ฮฒi Xij โˆ’ n X j=1 รƒ รƒ ln 1 + exp โˆ’ k X ฮฒi Xij i=1 !! , (9) b b and the ML estimators ฮฒ 1 , ..., ฮฒ k are obtained by maximizing ln (Ln (ฮฒ1 , ..., ฮฒk )): ยณ ยด b b ln Ln (ฮฒ 1 , ..., ฮฒ k ) = max ln (Ln (ฮฒ1 , ..., ฮฒk )) . ฮฒ1 ,...,ฮฒk Again, it can be shown that if n is large then for i = 1, ..., k, ยด โˆš ยณb 2 n ฮฒ i โˆ’ ฮฒi0 โˆผ N[0, ฯƒi ]. 2 bi Given consistent estimators ฯƒ 2 of the variances ฯƒi , it follows then that ยด โˆš ยณb n ฮฒ i โˆ’ ฮฒi0 b ฯƒi โˆผ N[0, 1] for i = 1, ..., k. Your econometrics software will report the ML estimators โˆš b b b b ฮฒ i together with their corresponding pseudo t-values ti = nฮฒ i /ฯƒ i and/or b i. standard errors ฯƒ 3.4 Testing joint signi๏ฌcance Now suppose you want to test the joint null hypothesis 0 0 0 H0 : ฮฒ1 = 0, ฮฒ2 = 0, ..., ฮฒm = 0, 9 (10)
  • 10.
    where m <k. There are two ways to do that. One way is akin to the F test in linear regression: Re-estimate the Logit model under the null hypothesis: ยณ ยด e e ln Ln (0, .., 0, ฮฒ m+1 , ..., ฮฒ k ) = max ฮฒm+1 ,...,ฮฒk ln (Ln (0, .., 0, ฮฒm+1 , ..., ฮฒk )) . and compare the log-likelihoods2 . It can be shown that under the null hypothesis (10) and for large samples, รƒ e e Ln (0, .., 0, ฮฒ m+1 , ..., ฮฒ k ) LRm = โˆ’2 ln b b Ln (ฮฒ 1 , ..., ฮฒ k ) ! โˆผ ฯ‡2 , m where the degrees of freedom m corresponds to the number of restrictions imposed under the null hypothesis. This is the so-called likelihood ratio test, which is conducted right-sided. For example, choose the 5% signi๏ฌcance level, look up in the table of the ฯ‡2 distribution the critical value c such that for a ฯ‡2 distributed random variable Zm , Pr[Zm > c] = 0.05. Then the m null hypothesis (10) is rejected at the 5% signi๏ฌcance level if LRm > c and accepted if LRm โ‰ค c. An alternative test of the null hypothesis (10) is the Wald test, which is conducted in the same way as for linear regression models.3 Under the null hypothesis (10) the Wald test statistic has also a ฯ‡2 distribution. m 4 Interpretation of the coe๏ฌƒcients of the Logit model 4.1 Marginal e๏ฌ€ects Consider the Logit model (5). If ฮฒ0 > 0 then Pr[Yj = 1|Xj ] = F (ฮฑ0 + ฮฒ0 Xj ) is an increasing function of Xj : dP [Yj = 1|Xj ] = ฮฒ0 .F 0 (ฮฑ0 + ฮฒ0 Xj ), dXj where F 0 is the derivative of (6): 2 3 Your econometric software will report the log-likelihood function value. In EasyReg International the Wald test can be conducted simply by point-and-click. 10
  • 11.
    exp(โˆ’x) 1 + exp(โˆ’x) 1 2= 2 โˆ’ (1 + exp(โˆ’x)) (1 + exp(โˆ’x)) (1 + exp(โˆ’x))2 1 1 = โˆ’ = F (x) โˆ’ F (x)2 1 + exp(โˆ’x) (1 + exp(โˆ’x))2 = F (x) (1 โˆ’ F (x)) . F 0 (x) = Therefore, the marginal e๏ฌ€ect of Xj on Pr[Yj = 1|Xj ] depends on Xj : dP [Yj = 1|Xj ] = ฮฒ0 .F (ฮฑ0 + ฮฒ0 Xj ) (1 โˆ’ F (ฮฑ0 + ฮฒ0 Xj )) , dXj which renders the interpretation of ฮฒ0 di๏ฌƒcult. However, the coe๏ฌƒcient ฮฒ0 can be interpreted in terms of relative changes in odds. 4.2 Odds and odds ratios The odds is the ratio of the probability that something is true divided by the probability that it is not true. Thus, in the Logit case (4), Odds (X) = Pr[Yj = 1|Xj ] F (ฮฑ0 + ฮฒ0 Xj ) = = exp(ฮฑ0 + ฮฒ0 Xj ). Pr[Yj = 0|Xj ] 1 โˆ’ F (ฮฑ0 + ฮฒ0 Xj ) (11) The odds ratio is the ratio of two odds for di๏ฌ€erent values of Xj , say Xj = x and Xj = x + โˆ†x: exp(ฮฑ + ฮฒx + ฮฒโˆ†x) Odds (x + โˆ†x) = = exp(ฮฒโˆ†x), Odds (x) exp(ฮฑ + ฮฒx) where โˆ†x is a small change in x. Then รƒ ! Odds (x + โˆ†x) โˆ’ Odds (x) exp(ฮฒ0 โˆ†x) โˆ’ 1 = lim โˆ†xโ†’0 Odds (x) โˆ†x ยฏ exp(ฮฒ0 โˆ†x) โˆ’ 1 d exp(u) ยฏ ยฏ ยฏ = ฮฒ0 lim = ฮฒ0 ร— = ฮฒ0 exp(0) = ฮฒ0 . ฮฒ0 โˆ†xโ†’0 ฮฒ0 โˆ†x du ยฏu=0 1 lim โˆ†xโ†’0 โˆ†x Thus, ฮฒ0 may be interpreted as the relative change in the odds due to a small change โˆ†x in Xj : Odds (x + โˆ†x) โˆ’ Odds (x) Odds (x + โˆ†x) = โˆ’ 1 โ‰ˆ ฮฒ0 โˆ†x Odds (x) Odds (x) 11 (12)
  • 12.
    If Xj isa binary variable itself, Xj = 0 or Xj = 1, then the only reasonable choices for x + โˆ†x and x are 1 and 0, respectively, so that then Odds (1) โˆ’ Odds (0) Odds (1) โˆ’1= = exp(ฮฒ0 ) โˆ’ 1. Odds (0) Odds (0) Only if ฮฒ0 is small we may then use the approximation exp(ฮฒ0 ) โˆ’ 1 โ‰ˆ ฮฒ0 . If not, one has to interpret ฮฒ0 in terms of the log of the odds ratio involved: รƒ Odds (1) ln Odds (0) ! = ฮฒ0 . The interpretation of the coe๏ฌƒcients ฮฒi0 , i = 1, ..., k โˆ’ 1 in the general Logit model (8) is similar as in the case (12): Odds (X1j , ..., Xiโˆ’1,j , Xi,j + โˆ†Xi,j , Xi+1,j , ..., Xk,j ) โˆ’ 1 โ‰ˆ ฮฒi0 โˆ†Xi,j Odds (X1j , ..., Xiโˆ’1,j , Xi,j , Xi+1,j , ..., Xk,j ) if โˆ†Xi,j is small. For example, ฮฒi0 may be interpreted as the percentage change in Odds(X1j , .., Xk,j ) due to a small percentage change 100ร—โˆ†Xi,j = 1 in Xi,j . 12