Chapter 3 : 
Likelihood function and inference 
[0] 
4 Likelihood function and inference 
The likelihood 
Information and curvature 
Suciency and ancilarity 
Maximum likelihood estimation 
Non-regular models 
EM algorithm
The likelihood 
Given an usually parametric family of distributions 
F 2 fF,  2 g 
with densities f [wrt a
xed measure ], the density of the iid 
sample x1, . . . , xn is 
Yn 
i=1 
f(xi) 
Note In the special case  is a counting measure, 
Yn 
i=1 
f(xi) 
is the probability of observing the sample x1, . . . , xn among all 
possible realisations of X1, . . . ,Xn
The likelihood 
De
nition (likelihood function) 
The likelihood function associated with a sample x1, . . . , xn is the 
function 
L : ! R+ 
 ! 
Yn 
i=1 
f(xi) 
same formula as density but dierent space of variation
Example: density function versus likelihood function 
Take the case of a Poisson density 
[against the counting measure] 
f(x; ) = 
x 
x! 
e- IN(x) 
which varies in N as a function of x 
versus 
L(; x) = 
x 
x! 
e- 
which varies in R+ as a function of   = 3
Example: density function versus likelihood function 
Take the case of a Poisson density 
[against the counting measure] 
f(x; ) = 
x 
x! 
e- IN(x) 
which varies in N as a function of x 
versus 
L(; x) = 
x 
x! 
e- 
which varies in R+ as a function of  x = 3
Example: density function versus likelihood function 
Take the case of a Normal N(0, ) 
density [against the Lebesgue measure] 
f(x; ) = 
1 
p 
2 
e-x2=2 IR(x) 
which varies in R as a function of x 
versus 
L(; x) = 
1 
p 
2 
e-x2=2 
which varies in R+ as a function of  
 = 2
Example: density function versus likelihood function 
Take the case of a Normal N(0, ) 
density [against the Lebesgue measure] 
f(x; ) = 
1 
p 
2 
e-x2=2 IR(x) 
which varies in R as a function of x 
versus 
L(; x) = 
1 
p 
2 
e-x2=2 
which varies in R+ as a function of  
x = 2
Example: density function versus likelihood function 
Take the case of a Normal N(0, 1=) 
density [against the Lebesgue measure] 
f(x; ) = 
p 
 
p 
2 
e-x2=2 IR(x) 
which varies in R as a function of x 
versus 
L(; x) = 
p 
 
p 
2 
e-x2=2 IR(x) 
which varies in R+ as a function of  
 = 1=2
Example: density function versus likelihood function 
Take the case of a Normal N(0, 1=) 
density [against the Lebesgue measure] 
f(x; ) = 
p 
 
p 
2 
e-x2=2 IR(x) 
which varies in R as a function of x 
versus 
L(; x) = 
p 
 
p 
2 
e-x2=2 IR(x) 
which varies in R+ as a function of  
x = 1=2
Example: Hardy-Weinberg equilibrium 
Population genetics: 
Genotypes of biallelic genes AA, Aa, and aa 
sample frequencies nAA, nAa and naa 
multinomial model M(n; pAA, pAa, paa) 
related to population proportion of A alleles, pA: 
pAA = p2 
A , pAa = 2pA(1 - pA) , paa = (1 - pA)2 
likelihood 
L(pAjnAA, nAa, naa) / p2nAA 
A [2pA(1 - pA)]nAa(1 - pA)2naa 
[Boos  Stefanski, 2013]
mixed distributions and their likelihood 
Special case when a random variable X may take speci
c values 
a1, . . . , ak and a continum of values A 
Example: Rainfall at a given spot on a given day may be zero with 
positive probability p0 [it did not rain!] or an arbitrary number 
between 0 and 100 [capacity of measurement container] or 100 
with positive probability p100 [container full]
mixed distributions and their likelihood 
Special case when a random variable X may take speci
c values 
a1, . . . , ak and a continum of values A 
Example: Tobit model where y  N(XT
, 2) but 
y = y  Ify  0g observed
mixed distributions and their likelihood 
Special case when a random variable X may take speci
c values 
a1, . . . , ak and a continum of values A 
Density of X against composition of two measures, counting and 
Lebesgue: 
fX(a) = 
 
P(X = a) if a 2 fa1, . . . , akg 
f(aj) otherwise 
Results in likelihood 
L(jx1, . . . , xn) = 
Yk 
j=1 
P(X = ai)nj  
Y 
xi =2fa1,...,akg 
f(xij) 
where nj # observations equal to aj
Enters Fisher, Ronald Fisher! 
Fisher's intuition in the 20's: 
the likelihood function contains the 
relevant information about the 
parameter  
the higher the likelihood the more 
likely the parameter 
the curvature of the likelihood 
determines the precision of the 
estimation
Concentration of likelihood mode around true parameter 
Likelihood functions for x1, . . . , xn  P(3) as n increases 
n = 40, ..., 240
Concentration of likelihood mode around true parameter 
Likelihood functions for x1, . . . , xn  P(3) as n increases 
n = 38, ..., 240
Concentration of likelihood mode around true parameter 
Likelihood functions for x1, . . . , xn  N(0, 1) as n increases
Concentration of likelihood mode around true parameter 
Likelihood functions for x1, . . . , xn  N(0, 1) as sample varies
Concentration of likelihood mode around true parameter 
Likelihood functions for x1, . . . , xn  N(0, 1) as sample varies
why concentration takes place 
Consider 
x1, . . . , xn 
iid 
 F 
Then 
log 
Yn 
i=1 
f(xij) = 
Xn 
i=1 
log f(xij) 
and by LLN 
1=n 
Xn 
i=1 
log f(xij) 
L 
! 
Z 
X 
log f(xj) dF(x) 
Lemma 
Maximising the likelihood is asymptotically equivalent to 
minimising the Kullback-Leibler divergence 
Z 
X 
log f(x)=f(xj) dF(x) 

c Member of the family closest to true distribution
Score function 
Score function de
ned by 
rlog L(jx) = 
 
@=@1L(jx), . . . , @=@pL(jx) 
 
L(jx) 
Gradient (slope) of likelihood function at point  
lemma 
When X  F, 
E[rlog L(jX)] = 0
Score function 
Score function de
ned by 
rlog L(jx) = 
 
@=@1L(jx), . . . , @=@pL(jx) 
 
L(jx) 
Gradient (slope) of likelihood function at point  
lemma 
When X  F, 
E[rlog L(jX)] = 0
Score function 
Score function de
ned by 
rlog L(jx) = 
 
@=@1L(jx), . . . , @=@pL(jx) 
 
L(jx) 
Gradient (slope) of likelihood function at point  
lemma 
When X  F, 
E[rlog L(jX)] = 0 
Reason: 
Z 
X 
rlog L(jx) dF(x) = 
Z 
X 
rL(jx) dx = r 
Z 
X 
dF(x)
Score function 
Score function de
ned by 
rlog L(jx) = 
 
@=@1L(jx), . . . , @=@pL(jx) 
 
L(jx) 
Gradient (slope) of likelihood function at point  
lemma 
When X  F, 
E[rlog L(jX)] = 0 
Connected with concentration theorem: gradient null on average 
for true value of parameter
Score function 
Score function de
ned by 
rlog L(jx) = 
 
@=@1L(jx), . . . , @=@pL(jx) 
 
L(jx) 
Gradient (slope) of likelihood function at point  
lemma 
When X  F, 
E[rlog L(jX)] = 0 
Warning: Not de
ned for non-dierentiable likelihoods, e.g. when 
support depends on
Fisher's information matrix 
Another notion attributed to Fisher [more likely due to Edgeworth] 
Information: covariance matrix of the score vector 
I() = E 
h 
rlog f(Xj) frlog f(Xj)gT 
i 
Often called Fisher information 
Measures curvature of the likelihood surface, which translates as 
information brought by the data 
Sometimes denoted IX to stress dependence on distribution of X
Fisher's information matrix 
Second derivative of the log-likelihood as well 
lemma 
If L(jx) is twice dierentiable [as a function of ] 
I() = -E 
 
rTrlog f(Xj) 
 
Hence 
Iij() = -E 
 
@2 
@i@j 
log f(Xj)
Illustrations 
Binomial B(n, p) distribution 
f(xjp) = 
 
n 
x 
 
px(1 - p)n-x 
@=@p log f(xjp) = x=p - n-x=1-p 
@2=@p2 log f(xjp) = -x=p2 - n-x=(1-p)2 
Hence 
I(p) = np=p2 + n-np=(1-p)2 
= n=p(1-p)
Illustrations 
Multinomial M(n; p1, . . . , pk) distribution 
f(xjp) = 
 
n 
x1    xk 
 
px1 
1    pxk 
k 
@=@pi log f(xjp) = xi=pi - xk=pk 
@2=@pi@pj log f(xjp) = -xk=p2 
k 
@2=@p2i 
log f(xjp) = -xi=p2i 
- xk=p2 
k 
Hence 
I(p) = n 
0 
1=p1 + 1=pk    1=pk 
BBB@ 
1=pk    1=pk 
. . . 
1=pk    1=pk-1 + 1=pk 
1 
CCCA
Illustrations 
Multinomial M(n; p1, . . . , pk) distribution 
f(xjp) = 
 
n 
x1    xk 
 
px1 
1    pxk 
k 
@=@pi log f(xjp) = xi=pi - xk=pk 
@2=@pi@pj log f(xjp) = -xk=p2 
k 
@2=@p2i 
log f(xjp) = -xi=p2i 
- xk=p2 
k 
and 
I(p)-1 = 1=n 
0 
p1(1 - p1) -p1p2    -p1pk-1 
-p1p2 p2(1 - p2)    -p2pk-1 
BBB@ 
. . . 
. . . 
-p1pk-1 -p2pk-1    pk-1(1 - pk-1) 
1 
CCCA
Illustrations 
Normal N(, 2) distribution 
f(xj) = 
1 
p 
2 
1 
 
exp 
 
-(x-)2=22 
	 
@=@ log f(xj) = x-=2 
@=@ log f(xj) = -1= + (x-)2=3 @2=@2 log f(xj) = -1=2 
@2=@@ log f(xj) = -2 x-=3 @2=@2 log f(xj) = 1=2 - 3 (x-)2=4 
Hence 
I() = 1=2 
 
1 0 
0 2
Properties 
Additive features translating as accumulation of information: 
if X and Y are independent, IX() + IY() = I(X,Y)() 
IX1,...,Xn() = nIX1() 
if X = T(Y) and Y = S(X), IX() = IY() 
if X = T(Y), IX() 6 IY() 
If  = 	() is a bijective transform, change of parameterisation: 
I() =
@ 
@ 

T 
I()
@ 
@ 

 
In information geometry, this is seen as a change of 
coordinates on a Riemannian manifold, and the intrinsic 
properties of curvature are unchanged under dierent 
parametrization. In general, the Fisher information 
matrix provides a Riemannian metric (more precisely, the 
Fisher-Rao metric). [Wikipedia]
Properties 
Additive features translating as accumulation of information: 
if X and Y are independent, IX() + IY() = I(X,Y)() 
IX1,...,Xn() = nIX1() 
if X = T(Y) and Y = S(X), IX() = IY() 
if X = T(Y), IX() 6 IY() 
If  = 	() is a bijective transform, change of parameterisation: 
I() =
@ 
@ 

T 
I()
@ 
@ 

 
In information geometry, this is seen as a change of 
coordinates on a Riemannian manifold, and the intrinsic 
properties of curvature are unchanged under dierent 
parametrization. In general, the Fisher information 
matrix provides a Riemannian metric (more precisely, the 
Fisher-Rao metric). [Wikipedia]
Approximations 
Back to the Kullback{Leibler divergence 
D(0, ) = 
Z 
X 
f(xj0) log f(xj0)=f(xj) dx 
Using a second degree Taylor expansion 
log f(xj) = log f(xj0) + ( - 0)Trlog f(xj0) 
+ 
1 
2 
( - 0)TrrT log f(xj0)( - 0) + o(jj - 0 jj2) 
approximation of divergence: 
D(0, )  
1 
2 
( - 0)TI(0)( - 0) 
[Exercise: show this is exact in the normal case]
Approximations 
Central limit law of the score vector 
p 
nrlog L(jX1, . . . ,Xn)  N(0, IX1()) 
1= 
Notation I1() stands for IX1() and indicates information 
associated with a single observation
Suciency 
What if a transform of the sample 
S(X1, . . . ,Xn) 
contains all the information, i.e. 
I(X1,...,Xn)() = IS(X1,...,Xn)() 
uniformly in ? 
In this case S() is called a sucient statistic [because it is 
sucient to know the value of S(x1, . . . , xn) to get complete 
information] 
A statistic is an arbitrary transform of the data X1, . . . ,Xn
Suciency (bis) 
Alternative de
nition: 
If (X1, . . . ,Xn)  f(x1, . . . , xnj) and if T = S(X1, . . . ,Xn) is such 
that the distribution of (X1, . . . ,Xn) conditional on T does not 
depend on , then S() is a sucient statistic 
Factorisation theorem 
S() is a sucient statistic if and only if 
f(x1, . . . , xnj) = g(S(x1, . . . , xnj))  h(x1, . . . , xn) 
another notion due to Fisher
Illustrations 
Uniform U(0, ) distribution 
L(jx1, . . . , xn) = -n 
Yn 
i=1 
I(0,)(xi) = -nI max 
i 
xi 
Hence 
S(X1, . . . ,Xn) = max 
i 
Xi = X(n) 
is sucient
Illustrations 
Bernoulli B(p) distribution 
L(pjx1, . . . , xn) = 
Yn 
i=1 
pxi(1 - p)n-xi = fp=1-pg 
P 
i xi (1 - p)n 
Hence 
S(X1, . . . ,Xn) = Xn 
is sucient
Illustrations 
Normal N(, 2) distribution 
L(, jx1, . . . , xn) = 
Yn 
i=1 
1 
p 
2 
expf-(xi-)2=22g 
= 
1 
f22gn=2 
exp 
 
- 
1 
22 
Xn 
i=1 
(xi - xn + xn - )2 
 
= 
1 
f22gn=2 
exp 
 
- 
1 
22 
Xn 
i=1 
(xi - xn)2 - 
1 
22 
Xn 
i=1 
(xn - )2 
 
Hence 
S(X1, . . . ,Xn) = 
  
Xn, 
Xn 
i=1 
(Xi - Xn)2 
! 
is sucient
Suciency and exponential families 
Both previous examples belong to exponential families 
f(xj) = h(x) exp 
 
T()TS(x) - () 
	 
Generic property of exponential families: 
f(x1, . . . , xnj) = 
Yn 
i=1 
h(xi) exp 
 
T()T 
Xn 
i=1 
 
S(xi) - n() 
lemma 
For an exponential family with summary statistic S(), the statistic 
S(X1, . . . ,Xn) = 
Xn 
i=1 
S(Xi) 
is sucient
Suciency as a rare feature 
Nice property reducing the data to a low dimension transform but... 
How frequent is it within the collection of probability distributions? 
Very rare as essentially restricted to exponential families 
[Pitman-Koopman-Darmois theorem] 
with the exception of parameter-dependent families like U(0, )
Pitman-Koopman-Darmois characterisation 
If X1, . . . ,Xn are iid random variables from a density 
f(j) whose support does not depend on  and verifying 
the property that there exists an integer n0 such that, for 
n  n0, there is a sucient statistic S(X1, . . . ,Xn) with
xed [in n] dimension, then f(j) belongs to an 
exponential family 
[Factorisation theorem] 
Note: Darmois published this result in 1935 [in French] and 
Koopman and Pitman in 1936 [in English] but Darmois is generally 
omitted from the theorem... Fisher proved it for one-D sucient 
statistics in 1934
Minimal suciency 
Multiplicity of sucient statistics, e.g., S0(x) = (S(x),U(x)) 
remains sucient when S() is sucient 
Search of a most concentrated summary: 
Minimal suciency 
A sucient statistic S() is minimal sucient if it is a function of 
any other sucient statistic 
Lemma 
For a minimal exponential family representation 
f(xj) = h(x) exp 
 
T()TS(x) - () 
	 
S(X1) + . . . + S(Xn) is minimal sucient
Ancillarity 
Opposite of suciency: 
Ancillarity 
When X1, . . . ,Xn are iid random variables from a density f(j), a 
statistic A() is ancillary if A(X1, . . . ,Xn) has a distribution that 
does not depend on  
Useless?! Not necessarily, as conditioning upon A(X1, . . . ,Xn) 
leads to more precision and eciency: 
Use of F(x1, . . . , xnjA(x1, . . . , xn)) instead of F(x1, . . . , xn) 
Notion of maximal ancillary statistic
Illustrations 
1 If X1, . . . ,Xn 
iid 
 U(0, ), A(X1, . . . ,Xn) = (X1, . . . ,Xn)=X(n) 
is ancillary 
2 If X1, . . . ,Xn 
iid 
 N(, 2), 
A(X1, . . . ,Xn) = 
(X1P- Xn, . . . ,Xn - Xn n 
i=1(Xi - Xn)2 
is ancillary 
3 If X1, . . . ,Xn 
iid 
 f(xj), rank(X1, . . . ,Xn) is ancillary 
 x=rnorm(10) 
 rank(x) 
[1] 7 4 1 5 2 6 8 9 10 3 
[see, e.g., rank tests]
Point estimation, estimators and estimates 
When given a parametric family f(j) and a sample supposedly 
drawn from this family 
(X1, . . . ,XN) 
iid 
 f(xj) 
an estimator of  is a statistic T(X1, . . . ,XN) or ^n providing a 
[reasonable] substitute for the unknown value . 
an estimate of  is the value of the estimator for a given [realised] 
sample, T(x1, . . . , xn) 
Example: For a Normal N(, 2 sample X1, . . . ,XN, 
T(X1, . . . ,XN) = ^n = 1=n Xn 
is an estimator of  and ^n = 2.014 is an estimate
Maximum likelihood principle 
Given the concentration property of the likelihood function, 
reasonable choice of estimator as mode: 
MLE 
A maximum likelihood estimator (MLE) ^n satis
es 
L(^njX1, . . . ,XN)  L(^njX1, . . . ,XN) for all  2  
Under regularity of L(jX1, . . . ,XN), MLE also solution of the 
likelihood equations 
rlog L(njX1, . . . ,XN) = 0 
Warning: ^n is not most likely value of  but makes observation 
(x1, . . . , xN) most likely...
Maximum likelihood invariance 
Principle independent of parameterisation: 
If  = h() is a one-to-one transform of , then 
^ 
MLE 
n = h(^MLE 
n ) 
[estimator of transform = transform of estimator] 
By extension, if  = h() is any transform of , then 
^ 
MLE 
n = h(^MLE 
n )
Unicity of maximum likelihood estimate 
Depending on regularity of L(jx1, . . . , xN), there may be 
1 an a.s. unique MLE ^MLE 
n 
2 
3 
1 Case of x1, . . . , xn  N(, 1) 
2 
3
Unicity of maximum likelihood estimate 
Depending on regularity of L(jx1, . . . , xN), there may be 
1 
2 several or an in
nity of MLE's [or of solutions to likelihood 
equations] 
3 
1 
2 Case of x1, . . . , xn  N(1 + 2, 1) [[and mixtures of normal] 
3
Unicity of maximum likelihood estimate 
Depending on regularity of L(jx1, . . . , xN), there may be 
1 
2 
3 no MLE at all 
1 
2 
3 Case of x1, . . . , xn  N(i, -2)
Unicity of maximum likelihood estimate 
Consequence of standard dierential calculus results on 
`() = L(jx1, . . . , xn): 
lemma 
If  is connected and open, and if `() is twice-dierentiable with 
lim 
!@ 
`()  +1 
and if H() = rrT`() is positive de
nite at all solutions of the 
likelihood equations, then `() has a unique global maximum 
Limited appeal because excluding local maxima
Unicity of MLE for exponential families 
lemma 
If f(j) is a minimal exponential family 
f(xj) = h(x) exp 
 
T()TS(x) - () 
	 
with T() one-to-one and twice dierentiable over , if  is open, 
and if there is at least one solution to the likelihood equations, 
then it is the unique MLE 
Likelihood equation is equivalent to S(x) = E[S(x)]
Unicity of MLE for exponential families 
Likelihood equation is equivalent to S(x) = E[S(x)] 
lemma 
If  is connected and open, and if `() is twice-dierentiable with 
lim 
!@ 
`()  +1 
and if H() = rrT`() is positive de
nite at all solutions of the 
likelihood equations, then `() has a unique global maximum
Illustrations 
Uniform U(0, ) likelihood 
L(jx1, . . . , xn) = -nI max 
i 
xi 
not dierentiable at X(n) but 
^MLE 
n = X(n)
Illustrations 
Bernoulli B(p) likelihood 
L(pjx1, . . . , xn) = fp=1-pg 
P 
i xi (1 - p)n 
dierentiable over (0, 1) and 
^pMLE 
n = Xn
Illustrations 
Normal N(, 2) likelihood 
L(, jx1, . . . , xn) / -n exp 
 
- 
1 
22 
Xn 
i=1 
(xi - xn)2 - 
1 
22 
Xn 
i=1 
(xn - )2 
 
dierentiable with 
n , ^2 
(^MLE 
MLE 
n ) = 
  
Xn, 
1 
n 
Xn 
i=1 
(Xi - Xn)2 
!
The fundamental theorem of Statistics 
fundamental theorem 
Under appropriate conditions, if (X1, . . . ,Xn) 
iid 
 f(xj), if ^n is 
solution of rlog f(X1, . . . ,Xnj) = 0, then 
p 
nf^n - g 
L 
! Np(0, I()-1) 
Equivalent of CLT for estimation purposes
Assumptions 
 identi

Statistics (1): estimation Chapter 3: likelihood function and likelihood estimation

  • 1.
    Chapter 3 : Likelihood function and inference [0] 4 Likelihood function and inference The likelihood Information and curvature Suciency and ancilarity Maximum likelihood estimation Non-regular models EM algorithm
  • 2.
    The likelihood Givenan usually parametric family of distributions F 2 fF, 2 g with densities f [wrt a
  • 3.
    xed measure ],the density of the iid sample x1, . . . , xn is Yn i=1 f(xi) Note In the special case is a counting measure, Yn i=1 f(xi) is the probability of observing the sample x1, . . . , xn among all possible realisations of X1, . . . ,Xn
  • 4.
  • 5.
    nition (likelihood function) The likelihood function associated with a sample x1, . . . , xn is the function L : ! R+ ! Yn i=1 f(xi) same formula as density but dierent space of variation
  • 6.
    Example: density functionversus likelihood function Take the case of a Poisson density [against the counting measure] f(x; ) = x x! e- IN(x) which varies in N as a function of x versus L(; x) = x x! e- which varies in R+ as a function of = 3
  • 7.
    Example: density functionversus likelihood function Take the case of a Poisson density [against the counting measure] f(x; ) = x x! e- IN(x) which varies in N as a function of x versus L(; x) = x x! e- which varies in R+ as a function of x = 3
  • 8.
    Example: density functionversus likelihood function Take the case of a Normal N(0, ) density [against the Lebesgue measure] f(x; ) = 1 p 2 e-x2=2 IR(x) which varies in R as a function of x versus L(; x) = 1 p 2 e-x2=2 which varies in R+ as a function of = 2
  • 9.
    Example: density functionversus likelihood function Take the case of a Normal N(0, ) density [against the Lebesgue measure] f(x; ) = 1 p 2 e-x2=2 IR(x) which varies in R as a function of x versus L(; x) = 1 p 2 e-x2=2 which varies in R+ as a function of x = 2
  • 10.
    Example: density functionversus likelihood function Take the case of a Normal N(0, 1=) density [against the Lebesgue measure] f(x; ) = p p 2 e-x2=2 IR(x) which varies in R as a function of x versus L(; x) = p p 2 e-x2=2 IR(x) which varies in R+ as a function of = 1=2
  • 11.
    Example: density functionversus likelihood function Take the case of a Normal N(0, 1=) density [against the Lebesgue measure] f(x; ) = p p 2 e-x2=2 IR(x) which varies in R as a function of x versus L(; x) = p p 2 e-x2=2 IR(x) which varies in R+ as a function of x = 1=2
  • 12.
    Example: Hardy-Weinberg equilibrium Population genetics: Genotypes of biallelic genes AA, Aa, and aa sample frequencies nAA, nAa and naa multinomial model M(n; pAA, pAa, paa) related to population proportion of A alleles, pA: pAA = p2 A , pAa = 2pA(1 - pA) , paa = (1 - pA)2 likelihood L(pAjnAA, nAa, naa) / p2nAA A [2pA(1 - pA)]nAa(1 - pA)2naa [Boos Stefanski, 2013]
  • 13.
    mixed distributions andtheir likelihood Special case when a random variable X may take speci
  • 14.
    c values a1,. . . , ak and a continum of values A Example: Rainfall at a given spot on a given day may be zero with positive probability p0 [it did not rain!] or an arbitrary number between 0 and 100 [capacity of measurement container] or 100 with positive probability p100 [container full]
  • 15.
    mixed distributions andtheir likelihood Special case when a random variable X may take speci
  • 16.
    c values a1,. . . , ak and a continum of values A Example: Tobit model where y N(XT
  • 17.
    , 2) but y = y Ify 0g observed
  • 18.
    mixed distributions andtheir likelihood Special case when a random variable X may take speci
  • 19.
    c values a1,. . . , ak and a continum of values A Density of X against composition of two measures, counting and Lebesgue: fX(a) = P(X = a) if a 2 fa1, . . . , akg f(aj) otherwise Results in likelihood L(jx1, . . . , xn) = Yk j=1 P(X = ai)nj Y xi =2fa1,...,akg f(xij) where nj # observations equal to aj
  • 20.
    Enters Fisher, RonaldFisher! Fisher's intuition in the 20's: the likelihood function contains the relevant information about the parameter the higher the likelihood the more likely the parameter the curvature of the likelihood determines the precision of the estimation
  • 21.
    Concentration of likelihoodmode around true parameter Likelihood functions for x1, . . . , xn P(3) as n increases n = 40, ..., 240
  • 22.
    Concentration of likelihoodmode around true parameter Likelihood functions for x1, . . . , xn P(3) as n increases n = 38, ..., 240
  • 23.
    Concentration of likelihoodmode around true parameter Likelihood functions for x1, . . . , xn N(0, 1) as n increases
  • 24.
    Concentration of likelihoodmode around true parameter Likelihood functions for x1, . . . , xn N(0, 1) as sample varies
  • 25.
    Concentration of likelihoodmode around true parameter Likelihood functions for x1, . . . , xn N(0, 1) as sample varies
  • 26.
    why concentration takesplace Consider x1, . . . , xn iid F Then log Yn i=1 f(xij) = Xn i=1 log f(xij) and by LLN 1=n Xn i=1 log f(xij) L ! Z X log f(xj) dF(x) Lemma Maximising the likelihood is asymptotically equivalent to minimising the Kullback-Leibler divergence Z X log f(x)=f(xj) dF(x) c Member of the family closest to true distribution
  • 27.
  • 28.
    ned by rlogL(jx) = @=@1L(jx), . . . , @=@pL(jx) L(jx) Gradient (slope) of likelihood function at point lemma When X F, E[rlog L(jX)] = 0
  • 29.
  • 30.
    ned by rlogL(jx) = @=@1L(jx), . . . , @=@pL(jx) L(jx) Gradient (slope) of likelihood function at point lemma When X F, E[rlog L(jX)] = 0
  • 31.
  • 32.
    ned by rlogL(jx) = @=@1L(jx), . . . , @=@pL(jx) L(jx) Gradient (slope) of likelihood function at point lemma When X F, E[rlog L(jX)] = 0 Reason: Z X rlog L(jx) dF(x) = Z X rL(jx) dx = r Z X dF(x)
  • 33.
  • 34.
    ned by rlogL(jx) = @=@1L(jx), . . . , @=@pL(jx) L(jx) Gradient (slope) of likelihood function at point lemma When X F, E[rlog L(jX)] = 0 Connected with concentration theorem: gradient null on average for true value of parameter
  • 35.
  • 36.
    ned by rlogL(jx) = @=@1L(jx), . . . , @=@pL(jx) L(jx) Gradient (slope) of likelihood function at point lemma When X F, E[rlog L(jX)] = 0 Warning: Not de
  • 37.
    ned for non-dierentiablelikelihoods, e.g. when support depends on
  • 38.
    Fisher's information matrix Another notion attributed to Fisher [more likely due to Edgeworth] Information: covariance matrix of the score vector I() = E h rlog f(Xj) frlog f(Xj)gT i Often called Fisher information Measures curvature of the likelihood surface, which translates as information brought by the data Sometimes denoted IX to stress dependence on distribution of X
  • 39.
    Fisher's information matrix Second derivative of the log-likelihood as well lemma If L(jx) is twice dierentiable [as a function of ] I() = -E rTrlog f(Xj) Hence Iij() = -E @2 @i@j log f(Xj)
  • 40.
    Illustrations Binomial B(n,p) distribution f(xjp) = n x px(1 - p)n-x @=@p log f(xjp) = x=p - n-x=1-p @2=@p2 log f(xjp) = -x=p2 - n-x=(1-p)2 Hence I(p) = np=p2 + n-np=(1-p)2 = n=p(1-p)
  • 41.
    Illustrations Multinomial M(n;p1, . . . , pk) distribution f(xjp) = n x1 xk px1 1 pxk k @=@pi log f(xjp) = xi=pi - xk=pk @2=@pi@pj log f(xjp) = -xk=p2 k @2=@p2i log f(xjp) = -xi=p2i - xk=p2 k Hence I(p) = n 0 1=p1 + 1=pk 1=pk BBB@ 1=pk 1=pk . . . 1=pk 1=pk-1 + 1=pk 1 CCCA
  • 42.
    Illustrations Multinomial M(n;p1, . . . , pk) distribution f(xjp) = n x1 xk px1 1 pxk k @=@pi log f(xjp) = xi=pi - xk=pk @2=@pi@pj log f(xjp) = -xk=p2 k @2=@p2i log f(xjp) = -xi=p2i - xk=p2 k and I(p)-1 = 1=n 0 p1(1 - p1) -p1p2 -p1pk-1 -p1p2 p2(1 - p2) -p2pk-1 BBB@ . . . . . . -p1pk-1 -p2pk-1 pk-1(1 - pk-1) 1 CCCA
  • 43.
    Illustrations Normal N(,2) distribution f(xj) = 1 p 2 1 exp -(x-)2=22 @=@ log f(xj) = x-=2 @=@ log f(xj) = -1= + (x-)2=3 @2=@2 log f(xj) = -1=2 @2=@@ log f(xj) = -2 x-=3 @2=@2 log f(xj) = 1=2 - 3 (x-)2=4 Hence I() = 1=2 1 0 0 2
  • 44.
    Properties Additive featurestranslating as accumulation of information: if X and Y are independent, IX() + IY() = I(X,Y)() IX1,...,Xn() = nIX1() if X = T(Y) and Y = S(X), IX() = IY() if X = T(Y), IX() 6 IY() If = () is a bijective transform, change of parameterisation: I() =
  • 45.
    @ @ T I()
  • 46.
    @ @ In information geometry, this is seen as a change of coordinates on a Riemannian manifold, and the intrinsic properties of curvature are unchanged under dierent parametrization. In general, the Fisher information matrix provides a Riemannian metric (more precisely, the Fisher-Rao metric). [Wikipedia]
  • 47.
    Properties Additive featurestranslating as accumulation of information: if X and Y are independent, IX() + IY() = I(X,Y)() IX1,...,Xn() = nIX1() if X = T(Y) and Y = S(X), IX() = IY() if X = T(Y), IX() 6 IY() If = () is a bijective transform, change of parameterisation: I() =
  • 48.
    @ @ T I()
  • 49.
    @ @ In information geometry, this is seen as a change of coordinates on a Riemannian manifold, and the intrinsic properties of curvature are unchanged under dierent parametrization. In general, the Fisher information matrix provides a Riemannian metric (more precisely, the Fisher-Rao metric). [Wikipedia]
  • 50.
    Approximations Back tothe Kullback{Leibler divergence D(0, ) = Z X f(xj0) log f(xj0)=f(xj) dx Using a second degree Taylor expansion log f(xj) = log f(xj0) + ( - 0)Trlog f(xj0) + 1 2 ( - 0)TrrT log f(xj0)( - 0) + o(jj - 0 jj2) approximation of divergence: D(0, ) 1 2 ( - 0)TI(0)( - 0) [Exercise: show this is exact in the normal case]
  • 51.
    Approximations Central limitlaw of the score vector p nrlog L(jX1, . . . ,Xn) N(0, IX1()) 1= Notation I1() stands for IX1() and indicates information associated with a single observation
  • 52.
    Suciency What ifa transform of the sample S(X1, . . . ,Xn) contains all the information, i.e. I(X1,...,Xn)() = IS(X1,...,Xn)() uniformly in ? In this case S() is called a sucient statistic [because it is sucient to know the value of S(x1, . . . , xn) to get complete information] A statistic is an arbitrary transform of the data X1, . . . ,Xn
  • 53.
  • 54.
    nition: If (X1,. . . ,Xn) f(x1, . . . , xnj) and if T = S(X1, . . . ,Xn) is such that the distribution of (X1, . . . ,Xn) conditional on T does not depend on , then S() is a sucient statistic Factorisation theorem S() is a sucient statistic if and only if f(x1, . . . , xnj) = g(S(x1, . . . , xnj)) h(x1, . . . , xn) another notion due to Fisher
  • 55.
    Illustrations Uniform U(0,) distribution L(jx1, . . . , xn) = -n Yn i=1 I(0,)(xi) = -nI max i xi Hence S(X1, . . . ,Xn) = max i Xi = X(n) is sucient
  • 56.
    Illustrations Bernoulli B(p)distribution L(pjx1, . . . , xn) = Yn i=1 pxi(1 - p)n-xi = fp=1-pg P i xi (1 - p)n Hence S(X1, . . . ,Xn) = Xn is sucient
  • 57.
    Illustrations Normal N(,2) distribution L(, jx1, . . . , xn) = Yn i=1 1 p 2 expf-(xi-)2=22g = 1 f22gn=2 exp - 1 22 Xn i=1 (xi - xn + xn - )2 = 1 f22gn=2 exp - 1 22 Xn i=1 (xi - xn)2 - 1 22 Xn i=1 (xn - )2 Hence S(X1, . . . ,Xn) = Xn, Xn i=1 (Xi - Xn)2 ! is sucient
  • 58.
    Suciency and exponentialfamilies Both previous examples belong to exponential families f(xj) = h(x) exp T()TS(x) - () Generic property of exponential families: f(x1, . . . , xnj) = Yn i=1 h(xi) exp T()T Xn i=1 S(xi) - n() lemma For an exponential family with summary statistic S(), the statistic S(X1, . . . ,Xn) = Xn i=1 S(Xi) is sucient
  • 59.
    Suciency as arare feature Nice property reducing the data to a low dimension transform but... How frequent is it within the collection of probability distributions? Very rare as essentially restricted to exponential families [Pitman-Koopman-Darmois theorem] with the exception of parameter-dependent families like U(0, )
  • 60.
    Pitman-Koopman-Darmois characterisation IfX1, . . . ,Xn are iid random variables from a density f(j) whose support does not depend on and verifying the property that there exists an integer n0 such that, for n n0, there is a sucient statistic S(X1, . . . ,Xn) with
  • 61.
    xed [in n]dimension, then f(j) belongs to an exponential family [Factorisation theorem] Note: Darmois published this result in 1935 [in French] and Koopman and Pitman in 1936 [in English] but Darmois is generally omitted from the theorem... Fisher proved it for one-D sucient statistics in 1934
  • 62.
    Minimal suciency Multiplicityof sucient statistics, e.g., S0(x) = (S(x),U(x)) remains sucient when S() is sucient Search of a most concentrated summary: Minimal suciency A sucient statistic S() is minimal sucient if it is a function of any other sucient statistic Lemma For a minimal exponential family representation f(xj) = h(x) exp T()TS(x) - () S(X1) + . . . + S(Xn) is minimal sucient
  • 63.
    Ancillarity Opposite ofsuciency: Ancillarity When X1, . . . ,Xn are iid random variables from a density f(j), a statistic A() is ancillary if A(X1, . . . ,Xn) has a distribution that does not depend on Useless?! Not necessarily, as conditioning upon A(X1, . . . ,Xn) leads to more precision and eciency: Use of F(x1, . . . , xnjA(x1, . . . , xn)) instead of F(x1, . . . , xn) Notion of maximal ancillary statistic
  • 64.
    Illustrations 1 IfX1, . . . ,Xn iid U(0, ), A(X1, . . . ,Xn) = (X1, . . . ,Xn)=X(n) is ancillary 2 If X1, . . . ,Xn iid N(, 2), A(X1, . . . ,Xn) = (X1P- Xn, . . . ,Xn - Xn n i=1(Xi - Xn)2 is ancillary 3 If X1, . . . ,Xn iid f(xj), rank(X1, . . . ,Xn) is ancillary x=rnorm(10) rank(x) [1] 7 4 1 5 2 6 8 9 10 3 [see, e.g., rank tests]
  • 65.
    Point estimation, estimatorsand estimates When given a parametric family f(j) and a sample supposedly drawn from this family (X1, . . . ,XN) iid f(xj) an estimator of is a statistic T(X1, . . . ,XN) or ^n providing a [reasonable] substitute for the unknown value . an estimate of is the value of the estimator for a given [realised] sample, T(x1, . . . , xn) Example: For a Normal N(, 2 sample X1, . . . ,XN, T(X1, . . . ,XN) = ^n = 1=n Xn is an estimator of and ^n = 2.014 is an estimate
  • 66.
    Maximum likelihood principle Given the concentration property of the likelihood function, reasonable choice of estimator as mode: MLE A maximum likelihood estimator (MLE) ^n satis
  • 67.
    es L(^njX1, .. . ,XN) L(^njX1, . . . ,XN) for all 2 Under regularity of L(jX1, . . . ,XN), MLE also solution of the likelihood equations rlog L(njX1, . . . ,XN) = 0 Warning: ^n is not most likely value of but makes observation (x1, . . . , xN) most likely...
  • 68.
    Maximum likelihood invariance Principle independent of parameterisation: If = h() is a one-to-one transform of , then ^ MLE n = h(^MLE n ) [estimator of transform = transform of estimator] By extension, if = h() is any transform of , then ^ MLE n = h(^MLE n )
  • 69.
    Unicity of maximumlikelihood estimate Depending on regularity of L(jx1, . . . , xN), there may be 1 an a.s. unique MLE ^MLE n 2 3 1 Case of x1, . . . , xn N(, 1) 2 3
  • 70.
    Unicity of maximumlikelihood estimate Depending on regularity of L(jx1, . . . , xN), there may be 1 2 several or an in
  • 71.
    nity of MLE's[or of solutions to likelihood equations] 3 1 2 Case of x1, . . . , xn N(1 + 2, 1) [[and mixtures of normal] 3
  • 72.
    Unicity of maximumlikelihood estimate Depending on regularity of L(jx1, . . . , xN), there may be 1 2 3 no MLE at all 1 2 3 Case of x1, . . . , xn N(i, -2)
  • 73.
    Unicity of maximumlikelihood estimate Consequence of standard dierential calculus results on `() = L(jx1, . . . , xn): lemma If is connected and open, and if `() is twice-dierentiable with lim !@ `() +1 and if H() = rrT`() is positive de
  • 74.
    nite at allsolutions of the likelihood equations, then `() has a unique global maximum Limited appeal because excluding local maxima
  • 75.
    Unicity of MLEfor exponential families lemma If f(j) is a minimal exponential family f(xj) = h(x) exp T()TS(x) - () with T() one-to-one and twice dierentiable over , if is open, and if there is at least one solution to the likelihood equations, then it is the unique MLE Likelihood equation is equivalent to S(x) = E[S(x)]
  • 76.
    Unicity of MLEfor exponential families Likelihood equation is equivalent to S(x) = E[S(x)] lemma If is connected and open, and if `() is twice-dierentiable with lim !@ `() +1 and if H() = rrT`() is positive de
  • 77.
    nite at allsolutions of the likelihood equations, then `() has a unique global maximum
  • 78.
    Illustrations Uniform U(0,) likelihood L(jx1, . . . , xn) = -nI max i xi not dierentiable at X(n) but ^MLE n = X(n)
  • 79.
    Illustrations Bernoulli B(p)likelihood L(pjx1, . . . , xn) = fp=1-pg P i xi (1 - p)n dierentiable over (0, 1) and ^pMLE n = Xn
  • 80.
    Illustrations Normal N(,2) likelihood L(, jx1, . . . , xn) / -n exp - 1 22 Xn i=1 (xi - xn)2 - 1 22 Xn i=1 (xn - )2 dierentiable with n , ^2 (^MLE MLE n ) = Xn, 1 n Xn i=1 (Xi - Xn)2 !
  • 81.
    The fundamental theoremof Statistics fundamental theorem Under appropriate conditions, if (X1, . . . ,Xn) iid f(xj), if ^n is solution of rlog f(X1, . . . ,Xnj) = 0, then p nf^n - g L ! Np(0, I()-1) Equivalent of CLT for estimation purposes
  • 82.