Estimation Theory, PhD Course, Ghent University, Belgium

FACULTY OF ENGINEERING AND
ARCHITECTURE
Mathematical Techniques in Engineering Science
Module Statistics
Lecture 7+8
Estimation of parameters:
Fisher estimation
Bayesian estimation
Stijn De Vuyst
30 november 2016
STAT 7+8 Statistics Lecture 7+8 1

Statistics Lecture 7+8
Fisher estimation
Likelihood function
Score function
Fisher information
MSE: bias and variance
Unbiased estimators: Cramer-Rao Lower Bound
Biased estimators
Suﬃcient statistics
Rao-Blackwellisation
Maximum-likelihood estimator
The EM algorithm
Example: censored data
Bayesian estimation

Estimation of parameters: two approaches
population X
parameter θ
sample x
estimate ˆθ
Classical framework
In 1920s and 1930s by Ronald Fisher,
Karl Pearson, Jerzy Neyman, . . .
Later also C.R. Rao, H. Cram´er,
Egon Pearson, D. Blackwell,
θ is unknown, but deterministic
θ ∈ S, the parameter space
Bayesian framework
18th century concepts by Thomas Bayes and Pierre-Simon Laplace
Huge following after 1950s due to availability of computer-intensive methods
θ is an unknown realisation
of a random variable Θ
Θ ∈ S

Classical setting: Fisher estimation
X: population,
system, process, . . .
parameter θ
θ is a scalar here,
but could also be
a vector θ in some
parameter space S
X: data,
observations,
sample, . . .
estimate ˆθ
The sample
n independent members taken from the population
(n is the sample size)
X = (X1, X2, . . . , Xn) before observation
x = (x1, x2, . . . , xn) after observation
X ∈ Ω
Ω = Rn for real populations,
Ω = {0, 1}n for Bernoulli populations,. . .
The ‘model’: likelihood function
p(x; θ) = Prob[observe X = x if true parameter is θ]
p(x; θ) is called the likelihood function, ln p(x; θ) the log-likelihood
−→ can be either a density (X cont.)
or a mass function (X discr.)
STAT 7+8 Fisher Likelihood 4

Example: likelihood function for a Bernoulli population
Assume a Bernoulli population: X ∼ Bern(θ),
i.e. X = 1 with probability θ and X = 0 otherwise
The observed sample (n = 6) is x = (0, 0, 1, 0, 1, 0)
Likelihood p(x; θ) =
6
i=1 p(xi|θ) = (1 − θ)4
θ2
, θ ∈ S = [0, 1]
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0.000
0.005
0.010
0.015
0.020
0.025
Likelihood p(x; θ)
(1 − θ)4
θ2
θ
ˆθML =
1
3
Maximum-likelihood estimate for parameter θ
ˆθML = arg maxθ p(x; θ) =
count 1s in the data
n
=
c
n
=
2
6
=
1
3
STAT 7+8 Fisher Likelihood 5
Example

Score function
The score function of the model
S(θ, x) =
∂
∂θ
ln p(x; θ) =
∂
∂θ p(x; θ)
p(x; θ)
S(θ, x) indicates the relative change in likelihood
indicates the sensitivity of the log-likelihood to its parameter θ
Expected value and variance of the score
If X is not yet observed, the score S(θ, X) at θ is a random variable
What is its mean and variance?
The expected score is 0
E[S(θ, X)] =
Ω
∂
∂θ
ln p(x; θ) p(x; θ)dx =
Ω
∂
∂θ
p(x; θ)
p(x; θ)
p(x; θ)dx
REG
=
∂
∂θ Ω
p(x; θ)dx =
∂
∂θ
1 = 0
The variance of the score is called the Fisher information J(θ)
Var[S(θ, X)] = E[S2
(θ, X)] = E[
∂
∂θ
ln p(X; θ)
2
] J(θ)
STAT 7+8 Fisher Score 6

Fisher Information
J(θ) is the variance of the score function S(θ, X),
averaged over all possible samples X in Ω
J(θ) is a metric for how much you can expect to learn from the sample X
about parameter θ
Property
J(θ) = E[
∂
∂θ
ln p(X; θ)
2
] = −E[
∂2
∂θ2
ln p(X; θ)]
Proof:
The first equality is due to the definition of Fisher information.
The second follows from E[S(θ, X)] = 0, ∀θ, which means that also:
0 =
∂
∂θ
E[S(θ, X)] =
∂
∂θ Ω
∂
∂θ
ln p p dx
REG
=
Ω
∂2
∂θ2
ln p p +
∂
∂θ
ln p
∂
∂θ
p dx
=
Ω
∂2
∂θ2
ln p p dx +
Ω
∂
∂θ
ln p
2
p dx = E[
∂2
∂θ2
ln p] + E[
∂
∂θ
ln p
2
] QED
(!) Note we assume sufficient ‘regularity’ (REG) of the likelihood function
p(x; θ), so that differentiation over θ and integration over x can be switched
STAT 7+8 Fisher Information 7

Estimators for a parameter θ
Deﬁnition
An estimator ˆθ is a statistic Ω → S : x → ˆθ(x) (not depending on any unknown parameters!)
giving values that are hopefully ‘close’ to the true θ
! after observation, ˆθ(x) is a deterministic number
before observation, ˆθ(X) is a random variable
→ ˆθ is a shorthand notation for either, depending on the context
MEAN→ bias
E[ˆθ − θ] = E[ˆθ(X)] − θ is the bias
if bias = 0 for all θ ∈ S −→ estimator is ‘unbiased’
if estimator is not unbiased −→ estimator is biased
STAT 7+8 Fisher MSE: bias and variance 8

VARIANCE→ Mean Square Error
The variance of estimator ˆθ is the expect square deviation from E[ˆθ]:
Var[ˆθ] = E[ ˆθ(X) − E[ˆθ(X)]
2
]
= E[ ˆθ−θ − (E[ˆθ]−θ)
2
]
= E[(ˆθ − θ)2
] − 2(E[ˆθ] − θ)E[ˆθ − θ] + (E[ˆθ] − θ)2
= E[(ˆθ − θ)2
]
MSE
− E[ˆθ] − θ
bias
2
The Mean Square Error (MSE) is expected square deviation from true θ.
=⇒ MSE(ˆθ) = bias2
+ Var[ˆθ]
Minimum Variance and Unbiased estimator (MVU)
ˆθ is unbiased and has lower variance than any other estimator for all θ ∈ S
−→ estimator is ‘MVU’

Often, the asymptotic distribution of an estimator is of interest
−→ behaviour of ˆθ(X) when sample size n becomes very large?
An estimator ˆθn = ˆθ(X1, . . . , Xn) of θ is consistent if and only if
ˆθn converges to θ (‘in probability’) for n → ∞, ∀θ ∈ S, i.e.
lim
n→∞
Prob[|ˆθn − θ| > ε] = 0 , ∀ε > 0 , or plim
n→∞
ˆθn = θ , ∀θ ∈ S
Consistency
vs. bias
examples:
θ
ˆθn = ¯X
unbiased and consistent
θ
ˆθn =
X1 + X2 + X3
3
, n 3
unbiased but not consistent
θ
ˆθn = −
1
n
+
1
n
n
i=1
Xi
biased but consistent
θ a
ˆθn = a = θ
biased and not consistentn = 1
n = 2
n = 3
n = 5
n = 10
n = 50

Unbiased estimators: Cramer-Rao Lower Bound (CRLB)
There may be many plausible estimators ˆθ for θ.
? Which is the ‘best’
Several criteria for a suitable estimator are possible,
but suppose we aim for an MVU estimator (unbiased and minimal MSE)
Lower bound for the MSE of unbiased estimators
Given the model p(x; θ), there is a lower bound on the MSE that any unbiased
estimator ˆθ can possibly achieve:
MSE(ˆθ(X))
1
J(θ)
−→ ‘Cramer-Rao Lower Bound’ (CRLB)
if ˆθ reaches this bound, MSE(ˆθ(X)) = 1/J(θ) −→ estimator is ‘efficient’
the CRLB is inverse of the Fisher information
having a lot of information in the sample about true θ (high J(θ)) allows
for estimators with very low variance
efficient ⇒ MVU, but MVU efficient
because CRLB can not always be reached by MVU estimators
STAT 7+8 Fisher CRLB 11

Cramer-Rao Lower Bound (CRLB): proof
The ‘triangle inequality’, best known in Euclidean vector spaces Rn
u = (u1, . . . , un) ∈ Rn
is an n-dimensional vector
||u|| = u2
1 + . . . + u2
n is the Euclidean length of u
inner (dot) product:
u · v = ||v|| ||u|| cos α
∈[−1,1]
= u1v1 + . . . + unvn
u
v
α
||u|| cos α
Cauchy-Schwarz:
(u · v)2
||u||2
||v||2
equality iff u = kv
or:
i
ui vi
2
i
u2
i
i
v2
i equality iff ui = kvi, ∀i
If n → ∞, Rn
becomes a Hilbert space or ‘function space’:
u(x)v(x)dx
2
u(x)2
dx v(x)2
dx
equality iff u(x) = kv(x), ∀x

ˆθ(x) is an unbiased estimator for θ, so E[ˆθ(X) − θ] = 0
⇒ 0 =
∂
∂θ
E[ˆθ(X) − θ] =
∂
∂θ
ˆθ(x) − θ p(x; θ)dx
REG
=
∂
∂θ
(ˆθ(x) − θ)p(x; θ) dx
= (0 − 1)p(x; θ)dx
−1
+ (ˆθ(x) − θ)
∂
∂θ
p(x; θ)
p(x; θ)S(θ, x)
dx
⇒ 1 =
Ω
(ˆθ(x) − θ) p(x; θ)
u(x)
p(x; θ)S(θ, x)
v(x)
dx =
Ω
u(x)v(x)dx
In particular for these two functions:
u(x)2
dx = (ˆθ(x) − θ)2
p(x; θ)dx = E[(ˆθ − θ)2
] = MSE(ˆθ)
v(x)2
dx = S2
(θ, x)p(x; θ)dx = E[S2
(θ, X)2
] = J(θ)

So due to the Cauchy-Schwarz inequality in Hilbert space:
u(x)v(x)dx
2
1
u(x)2
dx
MSE(ˆθ)
· v(x)2
dx
J(θ)
which proves the theorem: MSE(ˆθ) = Var[ˆθ]
1
J(θ)
QED
Efficient form
The bound becomes a strict equality if (and only if) u(x) = kv(x), i.e. iff
S(θ, x) = k(θ)[ˆθ(x) − θ] ‘efficient form’
If the score function can be written as k(θ)[ˆθ − θ] for all θ ∈ S
−→ estimator ˆθ is ‘efficient’

Example: estimate the variance of a normal population
Assume a zero-mean normal population: X ∼ N(0, σ2
),
? How to estimate σ2
(= θ) given only the data x = (x1, . . . , xn)
Likelihood p(x; θ) =
n
i=1
1
√
2πθ
exp
−x2
i
2θ
Log-likelihood ln p(x; θ) = −n ln
√
2π −
n
2
ln θ −
1
2
n
i=1
x2
i
θ
Score S(θ, x) =
∂
∂θ
ln p(x; θ) = −
n
2θ
+
1
2
n
i=1
x2
i
θ2
=
n
2θ2
k(θ)
1
n
n
i=1
x2
i
ˆθ(x)
− θ
The score function can be written in eﬃcient form!
so ˆθ(x) =
1
n
n
i=1
x2
i is an unbiased and eﬃcient estimator for θ = σ2
Example

Example: estimate intensity of a Poisson process
A Poisson process with intensity λ is a point process so that the times between
‘events’ are indep. and exponentially distributed with mean τ = 1/λ.
0 t
N = n
∼ Expon(λ)
The number of events N in an interval
of length t is Poiss(λt)
Likelihood p(n; λ) = Prob[n events in interval of length t] = e−λt (λt)n
n!
Log-likelihood ln p(n; λ) = −λt + n ln λt − ln n!
Score S(λ, n) =
∂
∂λ
ln p(n; λ) = −t +
n
λt
t =
t
λ
k(λ)
n
t
ˆλ(n)
− λ
This is efficient form, so ˆλ(n) =
n
t
is an unbiased efficient estimator for λ !
! However, the inverse 1/ˆλ = t/n is not an efficient estimator for τ
p(n; τ) = e−t/τ (t/τ)n
n!
, so that the score is S(τ, n) = −
n
τ
+
t
τ2
This is impossible to write in efficient form,
so no unbiased efficient estimator for τ exists!
Example

Biased estimators
Should we always try to find unbiased estimators? No!
They may not exist
e.g. no unbiased estimator for 1/p from a Bern(p) population exists
They may be unreasonable
e.g. MVU estimate of p from X ∼ Geom(p) is ˆp(X) = 1 X=1
this estimate is always 0 or 1
They may have extremely large variance (= MSE)
So unbiased estimators do not always minimize the MSE:
MSE(ˆθ) = bias2
+ Var[ˆθ]
−→ Sometimes it is better to sacrifice unbiasedness for lower variance
Minimising the MSE
We require:
the concept of sufficient statistics
the Rao-Blackwell theorem
STAT 7+8 Fisher Biased estimators 17

Recall: a statistic T(x) is any function of the sample data
not depending on unknown parameters
could also be vector-valued: T(x) : Ω → Rm
with m < n, typically
A statistic T(x) is sufficient with respect to the model p(x; θ) if
p(x|T(x); θ) = p(x|T(x)), ∀x
i.e. if the distribution of X given that T(X) = t, is independent of θ
−→ “All you can learn about θ from the data X,
you can also learn from the statistic T(X)”
If X is a book in which θ is a character, then a summary T(X) is sufficient
if it gives all information about θ that is also in the book
Sufficiency can be checked using the Neyman-Fisher criterium
STAT 7+8 Fisher Biased estimators Sufficient statistics 18

Neyman-Fisher factorisation criterium
A statistic T(x) is sufficient with respect to the model p(x; θ)
⇔ p(x; θ) = g(x) · h T(x), θ ∀x ∈ Ω
independent of θ
depends only on x through T(x)
Proof: (assuming X is discrete)
First note if t = T(x) then “T(X) = t, X = x” and “X = x” are the same event!
−→ p(x; θ) = p(x, t; θ)
⇒ p(x; θ) = p(x, t; θ) = p(x|t; θ) · p(t; θ)
sufficiency
= p(x|t)
g(x)
· p(t; θ)
h(t, θ)
⇐ p(x|t; θ) =
p(x, t; θ)
p(t; θ)
=
p(x, t; θ)
x p(x, t; θ)1 T(x)=t
=
g(x)h(t, θ)
x g(x)h(t, θ)1 T(x)=t
independent of θ
= p(x|t) −→ sufficiency QED

Example: Sample mean for Bernoulli population
Assume again a Bernoulli population: X ∼ Bern(θ),
i.e. p(x; θ) = θx
(1 − θ)1−x
for x ∈ {0, 1}
Sample size n
Take as statistic the sample mean T(X) = ¯X =
1
n
n
i=1
Xi =
C
n
with C the count of 1s in the sample
p(x; θ) =
n
i=1
p(xi; θ) =
n
i=1
θxi
(1 − θ)1−xi
= θ xi
(1 − θ)n− xi
= θnT (x)
(1 − θ)n−nT (x)
h(T (x),θ)
· 1
g(x)
Neyman-Fisher checks out, so the sample mean is a suﬃcient statistic for θ
−→ T(x) is also eﬃcient, since S(θ, x) =
n
θ(1 − θ)
T(x) − θ
Example

Rao-Blackwellisation of an estimator
Rao-Blackwell Theorem
For model p(x; θ), let ˆθ(x) be an estimator for θ so that Var[ˆθ] exists.
If T(x) is a sufficient statistic, then for the new estimator
ˆθ∗
(t) = E[ˆθ(X)|T(X) = t],
1) the new estimator ˆθ∗
is a statistic, i.e. does not depend on θ
2) if ˆθ is unbiased, then ˆθ∗
is also unbiased
3) MSE(ˆθ∗
) MSE(ˆθ) −→ so new estimator may be ‘better’!
4) MSE(ˆθ∗
) = MSE(ˆθ) iff ˆθ(x) depends on x only through T(x)
Process of improving existing estimators is called ‘Rao-Blackwellisation’
The process is idempotent: repeating it will give no further improvement
The proof is essentially based on the law of total expectation:
Let f(t) = E[ |T = t] then E[ ] = ET[f(T)] = ET[EX[ |T]]
inner expectation over all X for which T(X) is fixed
outer expectation over all T
STAT 7+8 Fisher Biased estimators Rao-Blackwell 21

Rao-Blackwellisation of an estimator
Proof:
1) ˆθ∗
is a statistic because of the sufficiency of T(x):
−→ ˆθ∗
(t) =
x
ˆθ(x)p(x|t; θ) is independent of θ
2) θ
unbiased
= E[ˆθ] = ET EX[ˆθ(X)|T] = ET[ˆθ∗
(T)] = E[ˆθ∗
]
3) Since both estimators are unbiased, their MSE equals their variance, so
MSE(ˆθ)−MSE(ˆθ∗
) = Var[ˆθ]−Var[ˆθ∗
] = E[ˆθ2
]− E[ˆθ]
2
θ2
−E[ˆθ∗2
]+ E[ˆθ∗
]
2
θ2
= E[ˆθ2
(X)] − E[ˆθ∗2
(T)] = ET EX[ˆθ2
(X)|T] − ˆθ∗2
(T)
= ET EX[ˆθ2
(X)|T] − EX[ˆθ(X)|T]
2
= ET Var[ˆθ(X)|T]
0
0
4) The inequality is strict iff Var[ˆθ(X)|T = t] = 0, ∀t
−→ given T(X) = t, ˆθ is fixed
so ˆθ(x) only depends on x through T(x)

Example: estimate maximum of uniform distribution
Observe X1, . . . , Xn ∼ Unif(0, a)
how to estimate upper bound a?
x2 x4 x1 x30
¯x
max(x) = t
a ?
Original (naive) estimator: since E[Xi] =
a
2
, one could propose
â(x) = 2¯x =
2
n
n
i=1
xi −→ E[â] = a , MSE(â) =
a2
3n
(exercise)
T(x) = max(x) is sufficient for a since Neyman-Fisher checks out:
p(x; a) =
n
i=1
1
a
1 0 xi a =
1
an
1 T(x) a ·
n
i=1 1 0 xi
Rao-Blackwell new estimator: (suppose n > 1)
â∗
(t) = E[â(X)|T(x) = t] = E
2
n
n−1
i=1
Xi +t |T(x) = t =
2t
n
+(n−1)
t
n
=
n + 1
n
t =
n + 1
n
max(x) −→ E[â∗
] = a , MSE(â∗
) =
a2
n(n + 2)
(exercise)
We find that indeed, MSE(â∗
) < MSE(â) , ∀n > 1
Example

The Maximum-likelihood Estimator
For a model p(x; θ), the maximum-likelihood estimator ˆθML (MLE) for θ is the
value of θ for which the model
produces the highest probability of
observing sample X = x,
ˆθML(x) = arg max
θ∈S
p(x; θ)
Likelihood p(x; θ)
θˆθML
Finding ˆθML is a maximisation problem:
∂
∂θ
p(x; θ) = 0 ⇒
∂
∂θ
ln p(x; θ)
score function
= 0 ⇒ S(θ, x) = 0
−→ so involves ﬁnding zeroes of the score function
usually requires numerical (search) algorithms
STAT 7+8 Fisher MLE 24

The Maximum-likelihood Estimator
Properties
Any unbiased efficient estimator ˆθ is also MLE
score has efficient form S(θ, x) = k(θ) ˆθ(x) − θ , so
S(ˆθ, x) = 0 −→ ˆθ is MLE
The converse is not true: not all MLE are efficient
Under some regularity conditions however, for increasing sample size
n → ∞, the MLE
is consistent: plim
n→∞
ˆθML,n = θ
is asymptotically efficient: lim
n→∞
Var[ˆθML,n]
1/nJ(θ)
= 1
is asymptotically normal: ˆθML,n −→ N(θ,
1
nJ(θ)
) as n → ∞
STAT 7+8 Fisher MLE 25

EM algorithm (Expectation/Maximisation) for ﬁnding MLE
Observed data vs. complete data
The log-likelihood p(x; θ) may be a complicated function of θ so that
Find arg maxθ ln p(x; θ) −→ is diﬃcult
But in the case where the observed data x is only part of the
underlying complete data ( x
observed
, y
hidden
)
often the complete-data log-likelihood problem
Find arg maxθ ln p(x, y; θ) −→ is easy
EM-algorithm
Numerical search algorithm: ˆθ0
E
→
M
→ ˆθ1
E
→
M
→ ˆθ2
E
→
M
→ . . . −→ ˆθML
Sure to converge to local likelihood maximum
STAT 7+8 Fisher EM algorithm 26

EM algorithm
p(x, y; θ) = p(x; θ)p(y|x; θ) −→ ln p(x; θ)
observed LL
max is diﬃcult
= ln p(x, y; θ)
complete LL
max is easy
− ln p(y|x; θ)
hidden
conditional on x
EM approaches argmax of observed LL by iteratively maximising complete LL:
E-step (expectation)
So we need to maximise ln p(x, y; θ) . . . but how if y is unknown!?
Trick 1: Replace complete LL by its expected value :
Lx(θ) = E[ln p(x, Y; θ)] = ln p(x, y; θ) p(y|x; θ)dy
Trick 2: Use current estimate ˆθk of θ to ﬁx distribution of hidden data
−→ Replace p(y|x; θ) by p(y|x; ˆθk) and calculate
Lx(θ|ˆθk) = ln p(x, y; θ) p(y|x; ˆθk)dy
M-step (maximisation)
Next estimate of θ is: ˆθk+1 ← arg maxθ Lx(θ|ˆθk)

EM algorithm
It can be shown that, for the observed LL:
ln p(x; ˆθk+1) ln p(x; ˆθk)
So if the likelihood has a
local maximum, the EM-algorithm
will converge to it
In fact, the EM-algorithm is especially useful when the parameter to be
estimated is a vector
θ = (θ1, . . . , θh)
so that the ‘search space’ S is very large.

An electricity company has a power line to a part of the city with fluctuating daily demand. It
is known/assumed that the demand W of one day, measured in MWh, is N(µ, 1) . That is,
the variance is known (σ = 1MWh) but the mean is not.
To estimate the mean daily power demand µ = E[W], the company asks n = 5 employees to
measure the power, on 5 different days and each with a different power meter. Unfortunately,
the meters have a limited range ri, i = 1, . . . , n. If Wi > ri, the meter fails (×) and does not
give a reading.
employee (i) range meter (ri), MWh measurement (xi), MWh
1 7 ×
2 5 4.2
3 8 ×
4 6 4.7
5 10 6.9
−→ We try to find the MLE for µ ¯x = 1
3 (4.2 + 4.7 + 6.9) = 5.26
STAT 7+8 Fisher EM algorithm Example: censored data 29
Example

Direct maximisation of observed LL
Suppose the first m n measurements succeeded, x = (x1, . . . , xm) (observed)
and the rest failed, Y = (Ym+1, . . . , Yn) (hidden) −→ Yi > ri , m < i n
p(x; µ) =
m
i=1
φ(xi − µ)
n
i=m+1
1 − Φ(ri − µ)
obs(µ) = ln p(x; µ) = −
m
2
ln(2π)−
m
i=1
1
2
(xi −µ)2
+
n
i=m+1
ln 1−Φ(ri −µ)
ˆµML satisfies obs(µ) = 0, or:
m(¯x − µ) =
n
i=m+1
φ(ri − µ)
1 − Φ(ri − µ)
transcendental equation, difficult to solve
can only be done numerically
−→ so let us use EM algorithm instead!
5.0 5.5 6.0 6.5 7.0 7.5 8.0
−20
−18
−16
−14
−12
−10
−8
maximum can be
found using
numerical
techniques
µ
observed LL obs(µ) = ln p(x; µ)
¯x
Example

E-step
Complete LL is ln p(x, Y; µ) = −
n
2
ln(2π)−
1
2
m
i=1
(xi −µ)2
−
1
2
n
i=m+1
(Yi −µ)2
1: Replace LL by its expected value:
E[ln p(x, Y; µ)] = −
1
2
m
i=1
(xi − µ)2
−
1
2
n
i=m+1
E[(Yi − µ)2
] +c
some constant
indep. of µ
E[(Yi − µ)2
] =
∞
ri
(y − µ)2
p(y|x; µ)
p(y;µ)
dy with p(yi; µ) =
φ(yi − µ)
1 − Φ(ri − µ)
2: . . . and use current estimate ˆµk for distribution of hidden data:
Eˆµk
[(Yi − µ)2
] =
∞
ri
(y − µ)2
p(y; ˆµk)dy =
∞
ri
(−2yµ + µ2
+ y2
)2
p(y; ˆµk)dy
= −2µ +
∞
ri
y p(y; ˆµk)dy
Eˆµk
[Y ]=Eˆµk
[W |W >ri]
+ µ2
∞
ri
p(y; ˆµk)dy
1
+ c
= −2µ ˆµk +
φ(ri − ˆµk)
1 − Φ(ri − ˆµk)
+ µ2
+ c
Example

M-step
Lx(µ|ˆµk) = −
1
2
m
i=1
(xi − µ)2
−
1
2
n
i=m+1
− 2µ ˆµk +
φ(ri − ˆµk)
+ µ2
+ c
Lx(µ|ˆµk) = 0 ⇔ m¯x − nµ + (n − m)ˆµk +
n
i=m+1
φ(ri − ˆµk)
= 0
So we update: ˆµk+1 ←
m
n
¯x +
n − m
n
ˆµk +
1
n
n
i=m+1
φ(ri − ˆµk)
5.0 5.5 6.0 6.5 7.0 7.5 8.0
−20
−18
−16
−14
−12
−10
−8
µ
observed LL obs(µ) = ln p(x; µ)
ˆµ0
ˆµ1
ˆµ2
started with ˆµ0 = ¯x
convergence is very fast
only 2 or 3 iterations required here
Example

What if σ is also unknown!?
no problem, the EM-algorithm can be used to approximate θ = (µ, σ2
) :
ˆµk+1 ←
m
n
¯x +
n − m
n
ˆµk +
1
n
n
i=m+1
ˆσkφ((ri − ˆµk)/ˆσk)
1 − Φ((ri − ˆµk)/ˆσk)
ˆσ2
k+1 ←
1
n
m
i=1
x2
i +
n − m
n
(ˆµ2
k + ˆσ2
k) +
1
n
n
i=m+1
ˆσk(ˆµk + ri)φ((ri − ˆµk)/ˆσk)
1 − Φ((ri − ˆµk)/ˆσk)
5.0 5.5 6.0 6.5 7.0 7.5
0.0
0.5
1.0
1.5
2.0
2.5
3.0
ˆµML
ˆσML
ˆµk
ˆσk
observed LL is -5.91
started with ˆµ0 = ¯x, ˆσ2
0 = 1
convergence is again very fast
only 6 or 7 iterations required here
Example

Estimation Theory, PhD Course, Ghent University, Belgium

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Estimation Theory, PhD Course, Ghent University, Belgium

Similar to Estimation Theory, PhD Course, Ghent University, Belgium (20)

Recently uploaded

Recently uploaded (20)

Estimation Theory, PhD Course, Ghent University, Belgium