1. FACULTY OF ENGINEERING AND
ARCHITECTURE
Mathematical Techniques in Engineering Science
Module Statistics
Lecture 7+8
Estimation of parameters:
Fisher estimation
Bayesian estimation
Stijn De Vuyst
30 november 2016
STAT 7+8 Statistics Lecture 7+8 1
2. Statistics Lecture 7+8
Fisher estimation
Likelihood function
Score function
Fisher information
MSE: bias and variance
Unbiased estimators: Cramer-Rao Lower Bound
Biased estimators
Sufficient statistics
Rao-Blackwellisation
Maximum-likelihood estimator
The EM algorithm
Example: censored data
Bayesian estimation
STAT 7+8 Statistics Lecture 7+8 2
3. Estimation of parameters: two approaches
population X
parameter θ
sample x
estimate ˆθ
Classical framework
In 1920s and 1930s by Ronald Fisher,
Karl Pearson, Jerzy Neyman, . . .
Later also C.R. Rao, H. Cram´er,
Egon Pearson, D. Blackwell,
θ is unknown, but deterministic
θ ∈ S, the parameter space
Bayesian framework
18th century concepts by Thomas Bayes and Pierre-Simon Laplace
Huge following after 1950s due to availability of computer-intensive methods
θ is an unknown realisation
of a random variable Θ
Θ ∈ S
STAT 7+8 Statistics Lecture 7+8 3
4. Classical setting: Fisher estimation
X: population,
system, process, . . .
parameter θ
θ is a scalar here,
but could also be
a vector θ in some
parameter space S
X: data,
observations,
sample, . . .
estimate ˆθ
The sample
n independent members taken from the population
(n is the sample size)
X = (X1, X2, . . . , Xn) before observation
x = (x1, x2, . . . , xn) after observation
X ∈ Ω
Ω = Rn for real populations,
Ω = {0, 1}n for Bernoulli populations,. . .
The ‘model’: likelihood function
p(x; θ) = Prob[observe X = x if true parameter is θ]
p(x; θ) is called the likelihood function, ln p(x; θ) the log-likelihood
−→ can be either a density (X cont.)
or a mass function (X discr.)
STAT 7+8 Fisher Likelihood 4
5. Example: likelihood function for a Bernoulli population
Assume a Bernoulli population: X ∼ Bern(θ),
i.e. X = 1 with probability θ and X = 0 otherwise
The observed sample (n = 6) is x = (0, 0, 1, 0, 1, 0)
Likelihood p(x; θ) =
6
i=1 p(xi|θ) = (1 − θ)4
θ2
, θ ∈ S = [0, 1]
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0.000
0.005
0.010
0.015
0.020
0.025
Likelihood p(x; θ)
(1 − θ)4
θ2
θ
ˆθML =
1
3
Maximum-likelihood estimate for parameter θ
ˆθML = arg maxθ p(x; θ) =
count 1s in the data
n
=
c
n
=
2
6
=
1
3
STAT 7+8 Fisher Likelihood 5
Example
6. Score function
The score function of the model
S(θ, x) =
∂
∂θ
ln p(x; θ) =
∂
∂θ p(x; θ)
p(x; θ)
S(θ, x) indicates the relative change in likelihood
indicates the sensitivity of the log-likelihood to its parameter θ
Expected value and variance of the score
If X is not yet observed, the score S(θ, X) at θ is a random variable
What is its mean and variance?
The expected score is 0
E[S(θ, X)] =
Ω
∂
∂θ
ln p(x; θ) p(x; θ)dx =
Ω
∂
∂θ
p(x; θ)
p(x; θ)
p(x; θ)dx
REG
=
∂
∂θ Ω
p(x; θ)dx =
∂
∂θ
1 = 0
The variance of the score is called the Fisher information J(θ)
Var[S(θ, X)] = E[S2
(θ, X)] = E[
∂
∂θ
ln p(X; θ)
2
] J(θ)
STAT 7+8 Fisher Score 6
7. Fisher Information
J(θ) is the variance of the score function S(θ, X),
averaged over all possible samples X in Ω
J(θ) is a metric for how much you can expect to learn from the sample X
about parameter θ
Property
J(θ) = E[
∂
∂θ
ln p(X; θ)
2
] = −E[
∂2
∂θ2
ln p(X; θ)]
Proof:
The first equality is due to the definition of Fisher information.
The second follows from E[S(θ, X)] = 0, ∀θ, which means that also:
0 =
∂
∂θ
E[S(θ, X)] =
∂
∂θ Ω
∂
∂θ
ln p p dx
REG
=
Ω
∂2
∂θ2
ln p p +
∂
∂θ
ln p
∂
∂θ
p dx
=
Ω
∂2
∂θ2
ln p p dx +
Ω
∂
∂θ
ln p
2
p dx = E[
∂2
∂θ2
ln p] + E[
∂
∂θ
ln p
2
] QED
(!) Note we assume sufficient ‘regularity’ (REG) of the likelihood function
p(x; θ), so that differentiation over θ and integration over x can be switched
STAT 7+8 Fisher Information 7
8. Estimators for a parameter θ
Definition
An estimator ˆθ is a statistic Ω → S : x → ˆθ(x) (not depending on any unknown parameters!)
giving values that are hopefully ‘close’ to the true θ
! after observation, ˆθ(x) is a deterministic number
before observation, ˆθ(X) is a random variable
→ ˆθ is a shorthand notation for either, depending on the context
MEAN→ bias
E[ˆθ − θ] = E[ˆθ(X)] − θ is the bias
if bias = 0 for all θ ∈ S −→ estimator is ‘unbiased’
if estimator is not unbiased −→ estimator is biased
STAT 7+8 Fisher MSE: bias and variance 8
9. Estimators for a parameter θ
VARIANCE→ Mean Square Error
The variance of estimator ˆθ is the expect square deviation from E[ˆθ]:
Var[ˆθ] = E[ ˆθ(X) − E[ˆθ(X)]
2
]
= E[ ˆθ−θ − (E[ˆθ]−θ)
2
]
= E[(ˆθ − θ)2
] − 2(E[ˆθ] − θ)E[ˆθ − θ] + (E[ˆθ] − θ)2
= E[(ˆθ − θ)2
]
MSE
− E[ˆθ] − θ
bias
2
The Mean Square Error (MSE) is expected square deviation from true θ.
=⇒ MSE(ˆθ) = bias2
+ Var[ˆθ]
Minimum Variance and Unbiased estimator (MVU)
ˆθ is unbiased and has lower variance than any other estimator for all θ ∈ S
−→ estimator is ‘MVU’
STAT 7+8 Fisher MSE: bias and variance 9
10. Estimators for a parameter θ
Often, the asymptotic distribution of an estimator is of interest
−→ behaviour of ˆθ(X) when sample size n becomes very large?
An estimator ˆθn = ˆθ(X1, . . . , Xn) of θ is consistent if and only if
ˆθn converges to θ (‘in probability’) for n → ∞, ∀θ ∈ S, i.e.
lim
n→∞
Prob[|ˆθn − θ| > ε] = 0 , ∀ε > 0 , or plim
n→∞
ˆθn = θ , ∀θ ∈ S
Consistency
vs. bias
examples:
θ
ˆθn = ¯X
unbiased and consistent
θ
ˆθn =
X1 + X2 + X3
3
, n 3
unbiased but not consistent
θ
ˆθn = −
1
n
+
1
n
n
i=1
Xi
biased but consistent
θ a
ˆθn = a = θ
biased and not consistentn = 1
n = 2
n = 3
n = 5
n = 10
n = 50
STAT 7+8 Fisher MSE: bias and variance 10
11. Unbiased estimators: Cramer-Rao Lower Bound (CRLB)
There may be many plausible estimators ˆθ for θ.
? Which is the ‘best’
Several criteria for a suitable estimator are possible,
but suppose we aim for an MVU estimator (unbiased and minimal MSE)
Lower bound for the MSE of unbiased estimators
Given the model p(x; θ), there is a lower bound on the MSE that any unbiased
estimator ˆθ can possibly achieve:
MSE(ˆθ(X))
1
J(θ)
−→ ‘Cramer-Rao Lower Bound’ (CRLB)
if ˆθ reaches this bound, MSE(ˆθ(X)) = 1/J(θ) −→ estimator is ‘efficient’
the CRLB is inverse of the Fisher information
having a lot of information in the sample about true θ (high J(θ)) allows
for estimators with very low variance
efficient ⇒ MVU, but MVU efficient
because CRLB can not always be reached by MVU estimators
STAT 7+8 Fisher CRLB 11
12. Cramer-Rao Lower Bound (CRLB): proof
The ‘triangle inequality’, best known in Euclidean vector spaces Rn
u = (u1, . . . , un) ∈ Rn
is an n-dimensional vector
||u|| = u2
1 + . . . + u2
n is the Euclidean length of u
inner (dot) product:
u · v = ||v|| ||u|| cos α
∈[−1,1]
= u1v1 + . . . + unvn
u
v
α
||u|| cos α
Cauchy-Schwarz:
(u · v)2
||u||2
||v||2
equality iff u = kv
or:
i
ui vi
2
i
u2
i
i
v2
i equality iff ui = kvi, ∀i
If n → ∞, Rn
becomes a Hilbert space or ‘function space’:
u(x)v(x)dx
2
u(x)2
dx v(x)2
dx
equality iff u(x) = kv(x), ∀x
STAT 7+8 Fisher CRLB 12
14. Cramer-Rao Lower Bound (CRLB): proof
So due to the Cauchy-Schwarz inequality in Hilbert space:
u(x)v(x)dx
2
1
u(x)2
dx
MSE(ˆθ)
· v(x)2
dx
J(θ)
which proves the theorem: MSE(ˆθ) = Var[ˆθ]
1
J(θ)
QED
Efficient form
The bound becomes a strict equality if (and only if) u(x) = kv(x), i.e. iff
S(θ, x) = k(θ)[ˆθ(x) − θ] ‘efficient form’
If the score function can be written as k(θ)[ˆθ − θ] for all θ ∈ S
−→ estimator ˆθ is ‘efficient’
STAT 7+8 Fisher CRLB 14
15. Example: estimate the variance of a normal population
Assume a zero-mean normal population: X ∼ N(0, σ2
),
? How to estimate σ2
(= θ) given only the data x = (x1, . . . , xn)
Likelihood p(x; θ) =
n
i=1
1
√
2πθ
exp
−x2
i
2θ
Log-likelihood ln p(x; θ) = −n ln
√
2π −
n
2
ln θ −
1
2
n
i=1
x2
i
θ
Score S(θ, x) =
∂
∂θ
ln p(x; θ) = −
n
2θ
+
1
2
n
i=1
x2
i
θ2
=
n
2θ2
k(θ)
1
n
n
i=1
x2
i
ˆθ(x)
− θ
The score function can be written in efficient form!
so ˆθ(x) =
1
n
n
i=1
x2
i is an unbiased and efficient estimator for θ = σ2
STAT 7+8 Fisher CRLB 15
Example
16. Example: estimate intensity of a Poisson process
A Poisson process with intensity λ is a point process so that the times between
‘events’ are indep. and exponentially distributed with mean τ = 1/λ.
0 t
N = n
∼ Expon(λ)
The number of events N in an interval
of length t is Poiss(λt)
Likelihood p(n; λ) = Prob[n events in interval of length t] = e−λt (λt)n
n!
Log-likelihood ln p(n; λ) = −λt + n ln λt − ln n!
Score S(λ, n) =
∂
∂λ
ln p(n; λ) = −t +
n
λt
t =
t
λ
k(λ)
n
t
ˆλ(n)
− λ
This is efficient form, so ˆλ(n) =
n
t
is an unbiased efficient estimator for λ !
! However, the inverse 1/ˆλ = t/n is not an efficient estimator for τ
p(n; τ) = e−t/τ (t/τ)n
n!
, so that the score is S(τ, n) = −
n
τ
+
t
τ2
This is impossible to write in efficient form,
so no unbiased efficient estimator for τ exists!
STAT 7+8 Fisher CRLB 16
Example
17. Biased estimators
Should we always try to find unbiased estimators? No!
They may not exist
e.g. no unbiased estimator for 1/p from a Bern(p) population exists
They may be unreasonable
e.g. MVU estimate of p from X ∼ Geom(p) is ˆp(X) = 1 X=1
this estimate is always 0 or 1
They may have extremely large variance (= MSE)
So unbiased estimators do not always minimize the MSE:
MSE(ˆθ) = bias2
+ Var[ˆθ]
−→ Sometimes it is better to sacrifice unbiasedness for lower variance
Minimising the MSE
We require:
the concept of sufficient statistics
the Rao-Blackwell theorem
STAT 7+8 Fisher Biased estimators 17
18. Sufficient statistics
Recall: a statistic T(x) is any function of the sample data
not depending on unknown parameters
could also be vector-valued: T(x) : Ω → Rm
with m < n, typically
A statistic T(x) is sufficient with respect to the model p(x; θ) if
p(x|T(x); θ) = p(x|T(x)), ∀x
i.e. if the distribution of X given that T(X) = t, is independent of θ
−→ “All you can learn about θ from the data X,
you can also learn from the statistic T(X)”
If X is a book in which θ is a character, then a summary T(X) is sufficient
if it gives all information about θ that is also in the book
Sufficiency can be checked using the Neyman-Fisher criterium
STAT 7+8 Fisher Biased estimators Sufficient statistics 18
19. Sufficient statistics
Neyman-Fisher factorisation criterium
A statistic T(x) is sufficient with respect to the model p(x; θ)
⇔ p(x; θ) = g(x) · h T(x), θ ∀x ∈ Ω
independent of θ
depends only on x through T(x)
Proof: (assuming X is discrete)
First note if t = T(x) then “T(X) = t, X = x” and “X = x” are the same event!
−→ p(x; θ) = p(x, t; θ)
⇒ p(x; θ) = p(x, t; θ) = p(x|t; θ) · p(t; θ)
sufficiency
= p(x|t)
g(x)
· p(t; θ)
h(t, θ)
⇐ p(x|t; θ) =
p(x, t; θ)
p(t; θ)
=
p(x, t; θ)
x p(x, t; θ)1 T(x)=t
=
g(x)h(t, θ)
x g(x)h(t, θ)1 T(x)=t
independent of θ
= p(x|t) −→ sufficiency QED
STAT 7+8 Fisher Biased estimators Sufficient statistics 19
20. Example: Sample mean for Bernoulli population
Assume again a Bernoulli population: X ∼ Bern(θ),
i.e. p(x; θ) = θx
(1 − θ)1−x
for x ∈ {0, 1}
Sample size n
Take as statistic the sample mean T(X) = ¯X =
1
n
n
i=1
Xi =
C
n
with C the count of 1s in the sample
p(x; θ) =
n
i=1
p(xi; θ) =
n
i=1
θxi
(1 − θ)1−xi
= θ xi
(1 − θ)n− xi
= θnT (x)
(1 − θ)n−nT (x)
h(T (x),θ)
· 1
g(x)
Neyman-Fisher checks out, so the sample mean is a sufficient statistic for θ
−→ T(x) is also efficient, since S(θ, x) =
n
θ(1 − θ)
T(x) − θ
STAT 7+8 Fisher Biased estimators Sufficient statistics 20
Example
21. Rao-Blackwellisation of an estimator
Rao-Blackwell Theorem
For model p(x; θ), let ˆθ(x) be an estimator for θ so that Var[ˆθ] exists.
If T(x) is a sufficient statistic, then for the new estimator
ˆθ∗
(t) = E[ˆθ(X)|T(X) = t],
1) the new estimator ˆθ∗
is a statistic, i.e. does not depend on θ
2) if ˆθ is unbiased, then ˆθ∗
is also unbiased
3) MSE(ˆθ∗
) MSE(ˆθ) −→ so new estimator may be ‘better’!
4) MSE(ˆθ∗
) = MSE(ˆθ) iff ˆθ(x) depends on x only through T(x)
Process of improving existing estimators is called ‘Rao-Blackwellisation’
The process is idempotent: repeating it will give no further improvement
The proof is essentially based on the law of total expectation:
Let f(t) = E[ |T = t] then E[ ] = ET[f(T)] = ET[EX[ |T]]
inner expectation over all X for which T(X) is fixed
outer expectation over all T
STAT 7+8 Fisher Biased estimators Rao-Blackwell 21
22. Rao-Blackwellisation of an estimator
Proof:
1) ˆθ∗
is a statistic because of the sufficiency of T(x):
−→ ˆθ∗
(t) =
x
ˆθ(x)p(x|t; θ) is independent of θ
2) θ
unbiased
= E[ˆθ] = ET EX[ˆθ(X)|T] = ET[ˆθ∗
(T)] = E[ˆθ∗
]
3) Since both estimators are unbiased, their MSE equals their variance, so
MSE(ˆθ)−MSE(ˆθ∗
) = Var[ˆθ]−Var[ˆθ∗
] = E[ˆθ2
]− E[ˆθ]
2
θ2
−E[ˆθ∗2
]+ E[ˆθ∗
]
2
θ2
= E[ˆθ2
(X)] − E[ˆθ∗2
(T)] = ET EX[ˆθ2
(X)|T] − ˆθ∗2
(T)
= ET EX[ˆθ2
(X)|T] − EX[ˆθ(X)|T]
2
= ET Var[ˆθ(X)|T]
0
0
4) The inequality is strict iff Var[ˆθ(X)|T = t] = 0, ∀t
−→ given T(X) = t, ˆθ is fixed
so ˆθ(x) only depends on x through T(x)
STAT 7+8 Fisher Biased estimators Rao-Blackwell 22
23. Example: estimate maximum of uniform distribution
Observe X1, . . . , Xn ∼ Unif(0, a)
how to estimate upper bound a?
x2 x4 x1 x30
¯x
max(x) = t
a ?
Original (naive) estimator: since E[Xi] =
a
2
, one could propose
ˆa(x) = 2¯x =
2
n
n
i=1
xi −→ E[ˆa] = a , MSE(ˆa) =
a2
3n
(exercise)
T(x) = max(x) is sufficient for a since Neyman-Fisher checks out:
p(x; a) =
n
i=1
1
a
1 0 xi a =
1
an
1 T(x) a ·
n
i=1 1 0 xi
Rao-Blackwell new estimator: (suppose n > 1)
ˆa∗
(t) = E[ˆa(X)|T(x) = t] = E
2
n
n−1
i=1
Xi +t |T(x) = t =
2t
n
+(n−1)
t
n
=
n + 1
n
t =
n + 1
n
max(x) −→ E[ˆa∗
] = a , MSE(ˆa∗
) =
a2
n(n + 2)
(exercise)
We find that indeed, MSE(ˆa∗
) < MSE(ˆa) , ∀n > 1
STAT 7+8 Fisher Biased estimators Rao-Blackwell 23
Example
24. The Maximum-likelihood Estimator
For a model p(x; θ), the maximum-likelihood estimator ˆθML (MLE) for θ is the
value of θ for which the model
produces the highest probability of
observing sample X = x,
ˆθML(x) = arg max
θ∈S
p(x; θ)
Likelihood p(x; θ)
θˆθML
Finding ˆθML is a maximisation problem:
∂
∂θ
p(x; θ) = 0 ⇒
∂
∂θ
ln p(x; θ)
score function
= 0 ⇒ S(θ, x) = 0
−→ so involves finding zeroes of the score function
usually requires numerical (search) algorithms
STAT 7+8 Fisher MLE 24
25. The Maximum-likelihood Estimator
Properties
Any unbiased efficient estimator ˆθ is also MLE
score has efficient form S(θ, x) = k(θ) ˆθ(x) − θ , so
S(ˆθ, x) = 0 −→ ˆθ is MLE
The converse is not true: not all MLE are efficient
Under some regularity conditions however, for increasing sample size
n → ∞, the MLE
is consistent: plim
n→∞
ˆθML,n = θ
is asymptotically efficient: lim
n→∞
Var[ˆθML,n]
1/nJ(θ)
= 1
is asymptotically normal: ˆθML,n −→ N(θ,
1
nJ(θ)
) as n → ∞
STAT 7+8 Fisher MLE 25
26. EM algorithm (Expectation/Maximisation) for finding MLE
Observed data vs. complete data
The log-likelihood p(x; θ) may be a complicated function of θ so that
Find arg maxθ ln p(x; θ) −→ is difficult
But in the case where the observed data x is only part of the
underlying complete data ( x
observed
, y
hidden
)
often the complete-data log-likelihood problem
Find arg maxθ ln p(x, y; θ) −→ is easy
EM-algorithm
Numerical search algorithm: ˆθ0
E
→
M
→ ˆθ1
E
→
M
→ ˆθ2
E
→
M
→ . . . −→ ˆθML
Sure to converge to local likelihood maximum
STAT 7+8 Fisher EM algorithm 26
27. EM algorithm
p(x, y; θ) = p(x; θ)p(y|x; θ) −→ ln p(x; θ)
observed LL
max is difficult
= ln p(x, y; θ)
complete LL
max is easy
− ln p(y|x; θ)
hidden
conditional on x
EM approaches argmax of observed LL by iteratively maximising complete LL:
E-step (expectation)
So we need to maximise ln p(x, y; θ) . . . but how if y is unknown!?
Trick 1: Replace complete LL by its expected value :
Lx(θ) = E[ln p(x, Y; θ)] = ln p(x, y; θ) p(y|x; θ)dy
Trick 2: Use current estimate ˆθk of θ to fix distribution of hidden data
−→ Replace p(y|x; θ) by p(y|x; ˆθk) and calculate
Lx(θ|ˆθk) = ln p(x, y; θ) p(y|x; ˆθk)dy
M-step (maximisation)
Next estimate of θ is: ˆθk+1 ← arg maxθ Lx(θ|ˆθk)
STAT 7+8 Fisher EM algorithm 27
28. EM algorithm
It can be shown that, for the observed LL:
ln p(x; ˆθk+1) ln p(x; ˆθk)
So if the likelihood has a
local maximum, the EM-algorithm
will converge to it
In fact, the EM-algorithm is especially useful when the parameter to be
estimated is a vector
θ = (θ1, . . . , θh)
so that the ‘search space’ S is very large.
STAT 7+8 Fisher EM algorithm 28
29. Example: censored data
An electricity company has a power line to a part of the city with fluctuating daily demand. It
is known/assumed that the demand W of one day, measured in MWh, is N(µ, 1) . That is,
the variance is known (σ = 1MWh) but the mean is not.
To estimate the mean daily power demand µ = E[W], the company asks n = 5 employees to
measure the power, on 5 different days and each with a different power meter. Unfortunately,
the meters have a limited range ri, i = 1, . . . , n. If Wi > ri, the meter fails (×) and does not
give a reading.
employee (i) range meter (ri), MWh measurement (xi), MWh
1 7 ×
2 5 4.2
3 8 ×
4 6 4.7
5 10 6.9
−→ We try to find the MLE for µ ¯x = 1
3 (4.2 + 4.7 + 6.9) = 5.26
STAT 7+8 Fisher EM algorithm Example: censored data 29
Example
30. Example: censored data
Direct maximisation of observed LL
Suppose the first m n measurements succeeded, x = (x1, . . . , xm) (observed)
and the rest failed, Y = (Ym+1, . . . , Yn) (hidden) −→ Yi > ri , m < i n
p(x; µ) =
m
i=1
φ(xi − µ)
n
i=m+1
1 − Φ(ri − µ)
obs(µ) = ln p(x; µ) = −
m
2
ln(2π)−
m
i=1
1
2
(xi −µ)2
+
n
i=m+1
ln 1−Φ(ri −µ)
ˆµML satisfies obs(µ) = 0, or:
m(¯x − µ) =
n
i=m+1
φ(ri − µ)
1 − Φ(ri − µ)
transcendental equation, difficult to solve
can only be done numerically
−→ so let us use EM algorithm instead!
5.0 5.5 6.0 6.5 7.0 7.5 8.0
−20
−18
−16
−14
−12
−10
−8
maximum can be
found using
numerical
techniques
µ
observed LL obs(µ) = ln p(x; µ)
¯x
STAT 7+8 Fisher EM algorithm Example: censored data 30
Example
31. Example: censored data
E-step
Complete LL is ln p(x, Y; µ) = −
n
2
ln(2π)−
1
2
m
i=1
(xi −µ)2
−
1
2
n
i=m+1
(Yi −µ)2
1: Replace LL by its expected value:
E[ln p(x, Y; µ)] = −
1
2
m
i=1
(xi − µ)2
−
1
2
n
i=m+1
E[(Yi − µ)2
] +c
some constant
indep. of µ
E[(Yi − µ)2
] =
∞
ri
(y − µ)2
p(y|x; µ)
p(y;µ)
dy with p(yi; µ) =
φ(yi − µ)
1 − Φ(ri − µ)
2: . . . and use current estimate ˆµk for distribution of hidden data:
Eˆµk
[(Yi − µ)2
] =
∞
ri
(y − µ)2
p(y; ˆµk)dy =
∞
ri
(−2yµ + µ2
+ y2
)2
p(y; ˆµk)dy
= −2µ +
∞
ri
y p(y; ˆµk)dy
Eˆµk
[Y ]=Eˆµk
[W |W >ri]
+ µ2
∞
ri
p(y; ˆµk)dy
1
+ c
= −2µ ˆµk +
φ(ri − ˆµk)
1 − Φ(ri − ˆµk)
+ µ2
+ c
STAT 7+8 Fisher EM algorithm Example: censored data 31
Example
32. Example: censored data
M-step
Lx(µ|ˆµk) = −
1
2
m
i=1
(xi − µ)2
−
1
2
n
i=m+1
− 2µ ˆµk +
φ(ri − ˆµk)
1 − Φ(ri − ˆµk)
+ µ2
+ c
Lx(µ|ˆµk) = 0 ⇔ m¯x − nµ + (n − m)ˆµk +
n
i=m+1
φ(ri − ˆµk)
1 − Φ(ri − ˆµk)
= 0
So we update: ˆµk+1 ←
m
n
¯x +
n − m
n
ˆµk +
1
n
n
i=m+1
φ(ri − ˆµk)
1 − Φ(ri − ˆµk)
5.0 5.5 6.0 6.5 7.0 7.5 8.0
−20
−18
−16
−14
−12
−10
−8
µ
observed LL obs(µ) = ln p(x; µ)
ˆµ0
ˆµ1
ˆµ2
started with ˆµ0 = ¯x
convergence is very fast
only 2 or 3 iterations required here
STAT 7+8 Fisher EM algorithm Example: censored data 32
Example
33. Example: censored data
What if σ is also unknown!?
no problem, the EM-algorithm can be used to approximate θ = (µ, σ2
) :
ˆµk+1 ←
m
n
¯x +
n − m
n
ˆµk +
1
n
n
i=m+1
ˆσkφ((ri − ˆµk)/ˆσk)
1 − Φ((ri − ˆµk)/ˆσk)
ˆσ2
k+1 ←
1
n
m
i=1
x2
i +
n − m
n
(ˆµ2
k + ˆσ2
k) +
1
n
n
i=m+1
ˆσk(ˆµk + ri)φ((ri − ˆµk)/ˆσk)
1 − Φ((ri − ˆµk)/ˆσk)
5.0 5.5 6.0 6.5 7.0 7.5
0.0
0.5
1.0
1.5
2.0
2.5
3.0
ˆµML
ˆσML
ˆµk
ˆσk
observed LL is -5.91
started with ˆµ0 = ¯x, ˆσ2
0 = 1
convergence is again very fast
only 6 or 7 iterations required here
STAT 7+8 Fisher EM algorithm Example: censored data 33
Example