SlideShare a Scribd company logo
FACULTY OF ENGINEERING AND
ARCHITECTURE
Mathematical Techniques in Engineering Science
Module Statistics
Lecture 7+8
Estimation of parameters:
Fisher estimation
Bayesian estimation
Stijn De Vuyst
30 november 2016
STAT 7+8 Statistics Lecture 7+8 1
Statistics Lecture 7+8
Fisher estimation
Likelihood function
Score function
Fisher information
MSE: bias and variance
Unbiased estimators: Cramer-Rao Lower Bound
Biased estimators
Sufficient statistics
Rao-Blackwellisation
Maximum-likelihood estimator
The EM algorithm
Example: censored data
Bayesian estimation
STAT 7+8 Statistics Lecture 7+8 2
Estimation of parameters: two approaches
population X
parameter θ
sample x
estimate ˆθ
Classical framework
In 1920s and 1930s by Ronald Fisher,
Karl Pearson, Jerzy Neyman, . . .
Later also C.R. Rao, H. Cram´er,
Egon Pearson, D. Blackwell,
θ is unknown, but deterministic
θ ∈ S, the parameter space
Bayesian framework
18th century concepts by Thomas Bayes and Pierre-Simon Laplace
Huge following after 1950s due to availability of computer-intensive methods
θ is an unknown realisation
of a random variable Θ
Θ ∈ S
STAT 7+8 Statistics Lecture 7+8 3
Classical setting: Fisher estimation
X: population,
system, process, . . .
parameter θ
θ is a scalar here,
but could also be
a vector θ in some
parameter space S
X: data,
observations,
sample, . . .
estimate ˆθ
The sample
n independent members taken from the population
(n is the sample size)
X = (X1, X2, . . . , Xn) before observation
x = (x1, x2, . . . , xn) after observation
X ∈ Ω
Ω = Rn for real populations,
Ω = {0, 1}n for Bernoulli populations,. . .
The ‘model’: likelihood function
p(x; θ) = Prob[observe X = x if true parameter is θ]
p(x; θ) is called the likelihood function, ln p(x; θ) the log-likelihood
−→ can be either a density (X cont.)
or a mass function (X discr.)
STAT 7+8 Fisher Likelihood 4
Example: likelihood function for a Bernoulli population
Assume a Bernoulli population: X ∼ Bern(θ),
i.e. X = 1 with probability θ and X = 0 otherwise
The observed sample (n = 6) is x = (0, 0, 1, 0, 1, 0)
Likelihood p(x; θ) =
6
i=1 p(xi|θ) = (1 − θ)4
θ2
, θ ∈ S = [0, 1]
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0.000
0.005
0.010
0.015
0.020
0.025
Likelihood p(x; θ)
(1 − θ)4
θ2
θ
ˆθML =
1
3
Maximum-likelihood estimate for parameter θ
ˆθML = arg maxθ p(x; θ) =
count 1s in the data
n
=
c
n
=
2
6
=
1
3
STAT 7+8 Fisher Likelihood 5
Example
Score function
The score function of the model
S(θ, x) =
∂
∂θ
ln p(x; θ) =
∂
∂θ p(x; θ)
p(x; θ)
S(θ, x) indicates the relative change in likelihood
indicates the sensitivity of the log-likelihood to its parameter θ
Expected value and variance of the score
If X is not yet observed, the score S(θ, X) at θ is a random variable
What is its mean and variance?
The expected score is 0
E[S(θ, X)] =
Ω
∂
∂θ
ln p(x; θ) p(x; θ)dx =
Ω
∂
∂θ
p(x; θ)
p(x; θ)
p(x; θ)dx
REG
=
∂
∂θ Ω
p(x; θ)dx =
∂
∂θ
1 = 0
The variance of the score is called the Fisher information J(θ)
Var[S(θ, X)] = E[S2
(θ, X)] = E[
∂
∂θ
ln p(X; θ)
2
] J(θ)
STAT 7+8 Fisher Score 6
Fisher Information
J(θ) is the variance of the score function S(θ, X),
averaged over all possible samples X in Ω
J(θ) is a metric for how much you can expect to learn from the sample X
about parameter θ
Property
J(θ) = E[
∂
∂θ
ln p(X; θ)
2
] = −E[
∂2
∂θ2
ln p(X; θ)]
Proof:
The first equality is due to the definition of Fisher information.
The second follows from E[S(θ, X)] = 0, ∀θ, which means that also:
0 =
∂
∂θ
E[S(θ, X)] =
∂
∂θ Ω
∂
∂θ
ln p p dx
REG
=
Ω
∂2
∂θ2
ln p p +
∂
∂θ
ln p
∂
∂θ
p dx
=
Ω
∂2
∂θ2
ln p p dx +
Ω
∂
∂θ
ln p
2
p dx = E[
∂2
∂θ2
ln p] + E[
∂
∂θ
ln p
2
] QED
(!) Note we assume sufficient ‘regularity’ (REG) of the likelihood function
p(x; θ), so that differentiation over θ and integration over x can be switched
STAT 7+8 Fisher Information 7
Estimators for a parameter θ
Definition
An estimator ˆθ is a statistic Ω → S : x → ˆθ(x) (not depending on any unknown parameters!)
giving values that are hopefully ‘close’ to the true θ
! after observation, ˆθ(x) is a deterministic number
before observation, ˆθ(X) is a random variable
→ ˆθ is a shorthand notation for either, depending on the context
MEAN→ bias
E[ˆθ − θ] = E[ˆθ(X)] − θ is the bias
if bias = 0 for all θ ∈ S −→ estimator is ‘unbiased’
if estimator is not unbiased −→ estimator is biased
STAT 7+8 Fisher MSE: bias and variance 8
Estimators for a parameter θ
VARIANCE→ Mean Square Error
The variance of estimator ˆθ is the expect square deviation from E[ˆθ]:
Var[ˆθ] = E[ ˆθ(X) − E[ˆθ(X)]
2
]
= E[ ˆθ−θ − (E[ˆθ]−θ)
2
]
= E[(ˆθ − θ)2
] − 2(E[ˆθ] − θ)E[ˆθ − θ] + (E[ˆθ] − θ)2
= E[(ˆθ − θ)2
]
MSE
− E[ˆθ] − θ
bias
2
The Mean Square Error (MSE) is expected square deviation from true θ.
=⇒ MSE(ˆθ) = bias2
+ Var[ˆθ]
Minimum Variance and Unbiased estimator (MVU)
ˆθ is unbiased and has lower variance than any other estimator for all θ ∈ S
−→ estimator is ‘MVU’
STAT 7+8 Fisher MSE: bias and variance 9
Estimators for a parameter θ
Often, the asymptotic distribution of an estimator is of interest
−→ behaviour of ˆθ(X) when sample size n becomes very large?
An estimator ˆθn = ˆθ(X1, . . . , Xn) of θ is consistent if and only if
ˆθn converges to θ (‘in probability’) for n → ∞, ∀θ ∈ S, i.e.
lim
n→∞
Prob[|ˆθn − θ| > ε] = 0 , ∀ε > 0 , or plim
n→∞
ˆθn = θ , ∀θ ∈ S
Consistency
vs. bias
examples:
θ
ˆθn = ¯X
unbiased and consistent
θ
ˆθn =
X1 + X2 + X3
3
, n 3
unbiased but not consistent
θ
ˆθn = −
1
n
+
1
n
n
i=1
Xi
biased but consistent
θ a
ˆθn = a = θ
biased and not consistentn = 1
n = 2
n = 3
n = 5
n = 10
n = 50
STAT 7+8 Fisher MSE: bias and variance 10
Unbiased estimators: Cramer-Rao Lower Bound (CRLB)
There may be many plausible estimators ˆθ for θ.
? Which is the ‘best’
Several criteria for a suitable estimator are possible,
but suppose we aim for an MVU estimator (unbiased and minimal MSE)
Lower bound for the MSE of unbiased estimators
Given the model p(x; θ), there is a lower bound on the MSE that any unbiased
estimator ˆθ can possibly achieve:
MSE(ˆθ(X))
1
J(θ)
−→ ‘Cramer-Rao Lower Bound’ (CRLB)
if ˆθ reaches this bound, MSE(ˆθ(X)) = 1/J(θ) −→ estimator is ‘efficient’
the CRLB is inverse of the Fisher information
having a lot of information in the sample about true θ (high J(θ)) allows
for estimators with very low variance
efficient ⇒ MVU, but MVU efficient
because CRLB can not always be reached by MVU estimators
STAT 7+8 Fisher CRLB 11
Cramer-Rao Lower Bound (CRLB): proof
The ‘triangle inequality’, best known in Euclidean vector spaces Rn
u = (u1, . . . , un) ∈ Rn
is an n-dimensional vector
||u|| = u2
1 + . . . + u2
n is the Euclidean length of u
inner (dot) product:
u · v = ||v|| ||u|| cos α
∈[−1,1]
= u1v1 + . . . + unvn
u
v
α
||u|| cos α
Cauchy-Schwarz:
(u · v)2
||u||2
||v||2
equality iff u = kv
or:
i
ui vi
2
i
u2
i
i
v2
i equality iff ui = kvi, ∀i
If n → ∞, Rn
becomes a Hilbert space or ‘function space’:
u(x)v(x)dx
2
u(x)2
dx v(x)2
dx
equality iff u(x) = kv(x), ∀x
STAT 7+8 Fisher CRLB 12
Cramer-Rao Lower Bound (CRLB): proof
ˆθ(x) is an unbiased estimator for θ, so E[ˆθ(X) − θ] = 0
⇒ 0 =
∂
∂θ
E[ˆθ(X) − θ] =
∂
∂θ
ˆθ(x) − θ p(x; θ)dx
REG
=
∂
∂θ
(ˆθ(x) − θ)p(x; θ) dx
= (0 − 1)p(x; θ)dx
−1
+ (ˆθ(x) − θ)
∂
∂θ
p(x; θ)
p(x; θ)S(θ, x)
dx
⇒ 1 =
Ω
(ˆθ(x) − θ) p(x; θ)
u(x)
p(x; θ)S(θ, x)
v(x)
dx =
Ω
u(x)v(x)dx
In particular for these two functions:
u(x)2
dx = (ˆθ(x) − θ)2
p(x; θ)dx = E[(ˆθ − θ)2
] = MSE(ˆθ)
v(x)2
dx = S2
(θ, x)p(x; θ)dx = E[S2
(θ, X)2
] = J(θ)
STAT 7+8 Fisher CRLB 13
Cramer-Rao Lower Bound (CRLB): proof
So due to the Cauchy-Schwarz inequality in Hilbert space:
u(x)v(x)dx
2
1
u(x)2
dx
MSE(ˆθ)
· v(x)2
dx
J(θ)
which proves the theorem: MSE(ˆθ) = Var[ˆθ]
1
J(θ)
QED
Efficient form
The bound becomes a strict equality if (and only if) u(x) = kv(x), i.e. iff
S(θ, x) = k(θ)[ˆθ(x) − θ] ‘efficient form’
If the score function can be written as k(θ)[ˆθ − θ] for all θ ∈ S
−→ estimator ˆθ is ‘efficient’
STAT 7+8 Fisher CRLB 14
Example: estimate the variance of a normal population
Assume a zero-mean normal population: X ∼ N(0, σ2
),
? How to estimate σ2
(= θ) given only the data x = (x1, . . . , xn)
Likelihood p(x; θ) =
n
i=1
1
√
2πθ
exp
−x2
i
2θ
Log-likelihood ln p(x; θ) = −n ln
√
2π −
n
2
ln θ −
1
2
n
i=1
x2
i
θ
Score S(θ, x) =
∂
∂θ
ln p(x; θ) = −
n
2θ
+
1
2
n
i=1
x2
i
θ2
=
n
2θ2
k(θ)
1
n
n
i=1
x2
i
ˆθ(x)
− θ
The score function can be written in efficient form!
so ˆθ(x) =
1
n
n
i=1
x2
i is an unbiased and efficient estimator for θ = σ2
STAT 7+8 Fisher CRLB 15
Example
Example: estimate intensity of a Poisson process
A Poisson process with intensity λ is a point process so that the times between
‘events’ are indep. and exponentially distributed with mean τ = 1/λ.
0 t
N = n
∼ Expon(λ)
The number of events N in an interval
of length t is Poiss(λt)
Likelihood p(n; λ) = Prob[n events in interval of length t] = e−λt (λt)n
n!
Log-likelihood ln p(n; λ) = −λt + n ln λt − ln n!
Score S(λ, n) =
∂
∂λ
ln p(n; λ) = −t +
n
λt
t =
t
λ
k(λ)
n
t
ˆλ(n)
− λ
This is efficient form, so ˆλ(n) =
n
t
is an unbiased efficient estimator for λ !
! However, the inverse 1/ˆλ = t/n is not an efficient estimator for τ
p(n; τ) = e−t/τ (t/τ)n
n!
, so that the score is S(τ, n) = −
n
τ
+
t
τ2
This is impossible to write in efficient form,
so no unbiased efficient estimator for τ exists!
STAT 7+8 Fisher CRLB 16
Example
Biased estimators
Should we always try to find unbiased estimators? No!
They may not exist
e.g. no unbiased estimator for 1/p from a Bern(p) population exists
They may be unreasonable
e.g. MVU estimate of p from X ∼ Geom(p) is ˆp(X) = 1 X=1
this estimate is always 0 or 1
They may have extremely large variance (= MSE)
So unbiased estimators do not always minimize the MSE:
MSE(ˆθ) = bias2
+ Var[ˆθ]
−→ Sometimes it is better to sacrifice unbiasedness for lower variance
Minimising the MSE
We require:
the concept of sufficient statistics
the Rao-Blackwell theorem
STAT 7+8 Fisher Biased estimators 17
Sufficient statistics
Recall: a statistic T(x) is any function of the sample data
not depending on unknown parameters
could also be vector-valued: T(x) : Ω → Rm
with m < n, typically
A statistic T(x) is sufficient with respect to the model p(x; θ) if
p(x|T(x); θ) = p(x|T(x)), ∀x
i.e. if the distribution of X given that T(X) = t, is independent of θ
−→ “All you can learn about θ from the data X,
you can also learn from the statistic T(X)”
If X is a book in which θ is a character, then a summary T(X) is sufficient
if it gives all information about θ that is also in the book
Sufficiency can be checked using the Neyman-Fisher criterium
STAT 7+8 Fisher Biased estimators Sufficient statistics 18
Sufficient statistics
Neyman-Fisher factorisation criterium
A statistic T(x) is sufficient with respect to the model p(x; θ)
⇔ p(x; θ) = g(x) · h T(x), θ ∀x ∈ Ω
independent of θ
depends only on x through T(x)
Proof: (assuming X is discrete)
First note if t = T(x) then “T(X) = t, X = x” and “X = x” are the same event!
−→ p(x; θ) = p(x, t; θ)
⇒ p(x; θ) = p(x, t; θ) = p(x|t; θ) · p(t; θ)
sufficiency
= p(x|t)
g(x)
· p(t; θ)
h(t, θ)
⇐ p(x|t; θ) =
p(x, t; θ)
p(t; θ)
=
p(x, t; θ)
x p(x, t; θ)1 T(x)=t
=
g(x)h(t, θ)
x g(x)h(t, θ)1 T(x)=t
independent of θ
= p(x|t) −→ sufficiency QED
STAT 7+8 Fisher Biased estimators Sufficient statistics 19
Example: Sample mean for Bernoulli population
Assume again a Bernoulli population: X ∼ Bern(θ),
i.e. p(x; θ) = θx
(1 − θ)1−x
for x ∈ {0, 1}
Sample size n
Take as statistic the sample mean T(X) = ¯X =
1
n
n
i=1
Xi =
C
n
with C the count of 1s in the sample
p(x; θ) =
n
i=1
p(xi; θ) =
n
i=1
θxi
(1 − θ)1−xi
= θ xi
(1 − θ)n− xi
= θnT (x)
(1 − θ)n−nT (x)
h(T (x),θ)
· 1
g(x)
Neyman-Fisher checks out, so the sample mean is a sufficient statistic for θ
−→ T(x) is also efficient, since S(θ, x) =
n
θ(1 − θ)
T(x) − θ
STAT 7+8 Fisher Biased estimators Sufficient statistics 20
Example
Rao-Blackwellisation of an estimator
Rao-Blackwell Theorem
For model p(x; θ), let ˆθ(x) be an estimator for θ so that Var[ˆθ] exists.
If T(x) is a sufficient statistic, then for the new estimator
ˆθ∗
(t) = E[ˆθ(X)|T(X) = t],
1) the new estimator ˆθ∗
is a statistic, i.e. does not depend on θ
2) if ˆθ is unbiased, then ˆθ∗
is also unbiased
3) MSE(ˆθ∗
) MSE(ˆθ) −→ so new estimator may be ‘better’!
4) MSE(ˆθ∗
) = MSE(ˆθ) iff ˆθ(x) depends on x only through T(x)
Process of improving existing estimators is called ‘Rao-Blackwellisation’
The process is idempotent: repeating it will give no further improvement
The proof is essentially based on the law of total expectation:
Let f(t) = E[ |T = t] then E[ ] = ET[f(T)] = ET[EX[ |T]]
inner expectation over all X for which T(X) is fixed
outer expectation over all T
STAT 7+8 Fisher Biased estimators Rao-Blackwell 21
Rao-Blackwellisation of an estimator
Proof:
1) ˆθ∗
is a statistic because of the sufficiency of T(x):
−→ ˆθ∗
(t) =
x
ˆθ(x)p(x|t; θ) is independent of θ
2) θ
unbiased
= E[ˆθ] = ET EX[ˆθ(X)|T] = ET[ˆθ∗
(T)] = E[ˆθ∗
]
3) Since both estimators are unbiased, their MSE equals their variance, so
MSE(ˆθ)−MSE(ˆθ∗
) = Var[ˆθ]−Var[ˆθ∗
] = E[ˆθ2
]− E[ˆθ]
2
θ2
−E[ˆθ∗2
]+ E[ˆθ∗
]
2
θ2
= E[ˆθ2
(X)] − E[ˆθ∗2
(T)] = ET EX[ˆθ2
(X)|T] − ˆθ∗2
(T)
= ET EX[ˆθ2
(X)|T] − EX[ˆθ(X)|T]
2
= ET Var[ˆθ(X)|T]
0
0
4) The inequality is strict iff Var[ˆθ(X)|T = t] = 0, ∀t
−→ given T(X) = t, ˆθ is fixed
so ˆθ(x) only depends on x through T(x)
STAT 7+8 Fisher Biased estimators Rao-Blackwell 22
Example: estimate maximum of uniform distribution
Observe X1, . . . , Xn ∼ Unif(0, a)
how to estimate upper bound a?
x2 x4 x1 x30
¯x
max(x) = t
a ?
Original (naive) estimator: since E[Xi] =
a
2
, one could propose
ˆa(x) = 2¯x =
2
n
n
i=1
xi −→ E[ˆa] = a , MSE(ˆa) =
a2
3n
(exercise)
T(x) = max(x) is sufficient for a since Neyman-Fisher checks out:
p(x; a) =
n
i=1
1
a
1 0 xi a =
1
an
1 T(x) a ·
n
i=1 1 0 xi
Rao-Blackwell new estimator: (suppose n > 1)
ˆa∗
(t) = E[ˆa(X)|T(x) = t] = E
2
n
n−1
i=1
Xi +t |T(x) = t =
2t
n
+(n−1)
t
n
=
n + 1
n
t =
n + 1
n
max(x) −→ E[ˆa∗
] = a , MSE(ˆa∗
) =
a2
n(n + 2)
(exercise)
We find that indeed, MSE(ˆa∗
) < MSE(ˆa) , ∀n > 1
STAT 7+8 Fisher Biased estimators Rao-Blackwell 23
Example
The Maximum-likelihood Estimator
For a model p(x; θ), the maximum-likelihood estimator ˆθML (MLE) for θ is the
value of θ for which the model
produces the highest probability of
observing sample X = x,
ˆθML(x) = arg max
θ∈S
p(x; θ)
Likelihood p(x; θ)
θˆθML
Finding ˆθML is a maximisation problem:
∂
∂θ
p(x; θ) = 0 ⇒
∂
∂θ
ln p(x; θ)
score function
= 0 ⇒ S(θ, x) = 0
−→ so involves finding zeroes of the score function
usually requires numerical (search) algorithms
STAT 7+8 Fisher MLE 24
The Maximum-likelihood Estimator
Properties
Any unbiased efficient estimator ˆθ is also MLE
score has efficient form S(θ, x) = k(θ) ˆθ(x) − θ , so
S(ˆθ, x) = 0 −→ ˆθ is MLE
The converse is not true: not all MLE are efficient
Under some regularity conditions however, for increasing sample size
n → ∞, the MLE
is consistent: plim
n→∞
ˆθML,n = θ
is asymptotically efficient: lim
n→∞
Var[ˆθML,n]
1/nJ(θ)
= 1
is asymptotically normal: ˆθML,n −→ N(θ,
1
nJ(θ)
) as n → ∞
STAT 7+8 Fisher MLE 25
EM algorithm (Expectation/Maximisation) for finding MLE
Observed data vs. complete data
The log-likelihood p(x; θ) may be a complicated function of θ so that
Find arg maxθ ln p(x; θ) −→ is difficult
But in the case where the observed data x is only part of the
underlying complete data ( x
observed
, y
hidden
)
often the complete-data log-likelihood problem
Find arg maxθ ln p(x, y; θ) −→ is easy
EM-algorithm
Numerical search algorithm: ˆθ0
E
→
M
→ ˆθ1
E
→
M
→ ˆθ2
E
→
M
→ . . . −→ ˆθML
Sure to converge to local likelihood maximum
STAT 7+8 Fisher EM algorithm 26
EM algorithm
p(x, y; θ) = p(x; θ)p(y|x; θ) −→ ln p(x; θ)
observed LL
max is difficult
= ln p(x, y; θ)
complete LL
max is easy
− ln p(y|x; θ)
hidden
conditional on x
EM approaches argmax of observed LL by iteratively maximising complete LL:
E-step (expectation)
So we need to maximise ln p(x, y; θ) . . . but how if y is unknown!?
Trick 1: Replace complete LL by its expected value :
Lx(θ) = E[ln p(x, Y; θ)] = ln p(x, y; θ) p(y|x; θ)dy
Trick 2: Use current estimate ˆθk of θ to fix distribution of hidden data
−→ Replace p(y|x; θ) by p(y|x; ˆθk) and calculate
Lx(θ|ˆθk) = ln p(x, y; θ) p(y|x; ˆθk)dy
M-step (maximisation)
Next estimate of θ is: ˆθk+1 ← arg maxθ Lx(θ|ˆθk)
STAT 7+8 Fisher EM algorithm 27
EM algorithm
It can be shown that, for the observed LL:
ln p(x; ˆθk+1) ln p(x; ˆθk)
So if the likelihood has a
local maximum, the EM-algorithm
will converge to it
In fact, the EM-algorithm is especially useful when the parameter to be
estimated is a vector
θ = (θ1, . . . , θh)
so that the ‘search space’ S is very large.
STAT 7+8 Fisher EM algorithm 28
Example: censored data
An electricity company has a power line to a part of the city with fluctuating daily demand. It
is known/assumed that the demand W of one day, measured in MWh, is N(µ, 1) . That is,
the variance is known (σ = 1MWh) but the mean is not.
To estimate the mean daily power demand µ = E[W], the company asks n = 5 employees to
measure the power, on 5 different days and each with a different power meter. Unfortunately,
the meters have a limited range ri, i = 1, . . . , n. If Wi > ri, the meter fails (×) and does not
give a reading.
employee (i) range meter (ri), MWh measurement (xi), MWh
1 7 ×
2 5 4.2
3 8 ×
4 6 4.7
5 10 6.9
−→ We try to find the MLE for µ ¯x = 1
3 (4.2 + 4.7 + 6.9) = 5.26
STAT 7+8 Fisher EM algorithm Example: censored data 29
Example
Example: censored data
Direct maximisation of observed LL
Suppose the first m n measurements succeeded, x = (x1, . . . , xm) (observed)
and the rest failed, Y = (Ym+1, . . . , Yn) (hidden) −→ Yi > ri , m < i n
p(x; µ) =
m
i=1
φ(xi − µ)
n
i=m+1
1 − Φ(ri − µ)
obs(µ) = ln p(x; µ) = −
m
2
ln(2π)−
m
i=1
1
2
(xi −µ)2
+
n
i=m+1
ln 1−Φ(ri −µ)
ˆµML satisfies obs(µ) = 0, or:
m(¯x − µ) =
n
i=m+1
φ(ri − µ)
1 − Φ(ri − µ)
transcendental equation, difficult to solve
can only be done numerically
−→ so let us use EM algorithm instead!
5.0 5.5 6.0 6.5 7.0 7.5 8.0
−20
−18
−16
−14
−12
−10
−8
maximum can be
found using
numerical
techniques
µ
observed LL obs(µ) = ln p(x; µ)
¯x
STAT 7+8 Fisher EM algorithm Example: censored data 30
Example
Example: censored data
E-step
Complete LL is ln p(x, Y; µ) = −
n
2
ln(2π)−
1
2
m
i=1
(xi −µ)2
−
1
2
n
i=m+1
(Yi −µ)2
1: Replace LL by its expected value:
E[ln p(x, Y; µ)] = −
1
2
m
i=1
(xi − µ)2
−
1
2
n
i=m+1
E[(Yi − µ)2
] +c
some constant
indep. of µ
E[(Yi − µ)2
] =
∞
ri
(y − µ)2
p(y|x; µ)
p(y;µ)
dy with p(yi; µ) =
φ(yi − µ)
1 − Φ(ri − µ)
2: . . . and use current estimate ˆµk for distribution of hidden data:
Eˆµk
[(Yi − µ)2
] =
∞
ri
(y − µ)2
p(y; ˆµk)dy =
∞
ri
(−2yµ + µ2
+ y2
)2
p(y; ˆµk)dy
= −2µ +
∞
ri
y p(y; ˆµk)dy
Eˆµk
[Y ]=Eˆµk
[W |W >ri]
+ µ2
∞
ri
p(y; ˆµk)dy
1
+ c
= −2µ ˆµk +
φ(ri − ˆµk)
1 − Φ(ri − ˆµk)
+ µ2
+ c
STAT 7+8 Fisher EM algorithm Example: censored data 31
Example
Example: censored data
M-step
Lx(µ|ˆµk) = −
1
2
m
i=1
(xi − µ)2
−
1
2
n
i=m+1
− 2µ ˆµk +
φ(ri − ˆµk)
1 − Φ(ri − ˆµk)
+ µ2
+ c
Lx(µ|ˆµk) = 0 ⇔ m¯x − nµ + (n − m)ˆµk +
n
i=m+1
φ(ri − ˆµk)
1 − Φ(ri − ˆµk)
= 0
So we update: ˆµk+1 ←
m
n
¯x +
n − m
n
ˆµk +
1
n
n
i=m+1
φ(ri − ˆµk)
1 − Φ(ri − ˆµk)
5.0 5.5 6.0 6.5 7.0 7.5 8.0
−20
−18
−16
−14
−12
−10
−8
µ
observed LL obs(µ) = ln p(x; µ)
ˆµ0
ˆµ1
ˆµ2
started with ˆµ0 = ¯x
convergence is very fast
only 2 or 3 iterations required here
STAT 7+8 Fisher EM algorithm Example: censored data 32
Example
Example: censored data
What if σ is also unknown!?
no problem, the EM-algorithm can be used to approximate θ = (µ, σ2
) :
ˆµk+1 ←
m
n
¯x +
n − m
n
ˆµk +
1
n
n
i=m+1
ˆσkφ((ri − ˆµk)/ˆσk)
1 − Φ((ri − ˆµk)/ˆσk)
ˆσ2
k+1 ←
1
n
m
i=1
x2
i +
n − m
n
(ˆµ2
k + ˆσ2
k) +
1
n
n
i=m+1
ˆσk(ˆµk + ri)φ((ri − ˆµk)/ˆσk)
1 − Φ((ri − ˆµk)/ˆσk)
5.0 5.5 6.0 6.5 7.0 7.5
0.0
0.5
1.0
1.5
2.0
2.5
3.0
ˆµML
ˆσML
ˆµk
ˆσk
observed LL is -5.91
started with ˆµ0 = ¯x, ˆσ2
0 = 1
convergence is again very fast
only 6 or 7 iterations required here
STAT 7+8 Fisher EM algorithm Example: censored data 33
Example
STAT 7+8 Bayes 34
STAT 7+8 Bayes 35
STAT 7+8 Bayes 36
STAT 7+8 Bayes 37
STAT 7+8 Bayes 38

More Related Content

What's hot

Simple lin regress_inference
Simple lin regress_inferenceSimple lin regress_inference
Simple lin regress_inference
Kemal İnciroğlu
 
Point Estimate, Confidence Interval, Hypotesis tests
Point Estimate, Confidence Interval, Hypotesis testsPoint Estimate, Confidence Interval, Hypotesis tests
Point Estimate, Confidence Interval, Hypotesis tests
University of Salerno
 
Chapter 2 part3-Least-Squares Regression
Chapter 2 part3-Least-Squares RegressionChapter 2 part3-Least-Squares Regression
Chapter 2 part3-Least-Squares Regression
nszakir
 
Statistics Case Study - Stepwise Multiple Regression
Statistics Case Study - Stepwise Multiple RegressionStatistics Case Study - Stepwise Multiple Regression
Statistics Case Study - Stepwise Multiple Regression
Sharad Srivastava
 

What's hot (20)

Sufficient statistics
Sufficient statisticsSufficient statistics
Sufficient statistics
 
Basic regression with time series data
Basic regression with time series dataBasic regression with time series data
Basic regression with time series data
 
Mixed Effects Models - Growth Curve Analysis
Mixed Effects Models - Growth Curve AnalysisMixed Effects Models - Growth Curve Analysis
Mixed Effects Models - Growth Curve Analysis
 
Simple lin regress_inference
Simple lin regress_inferenceSimple lin regress_inference
Simple lin regress_inference
 
Issues in time series econometrics
Issues in time series econometricsIssues in time series econometrics
Issues in time series econometrics
 
Logistic regression
Logistic regressionLogistic regression
Logistic regression
 
Point Estimate, Confidence Interval, Hypotesis tests
Point Estimate, Confidence Interval, Hypotesis testsPoint Estimate, Confidence Interval, Hypotesis tests
Point Estimate, Confidence Interval, Hypotesis tests
 
Logistic regression
Logistic regressionLogistic regression
Logistic regression
 
Logistic regression (blyth 2006) (simplified)
Logistic regression (blyth 2006) (simplified)Logistic regression (blyth 2006) (simplified)
Logistic regression (blyth 2006) (simplified)
 
Granger Causality
Granger CausalityGranger Causality
Granger Causality
 
Ml3 logistic regression-and_classification_error_metrics
Ml3 logistic regression-and_classification_error_metricsMl3 logistic regression-and_classification_error_metrics
Ml3 logistic regression-and_classification_error_metrics
 
Chapter 2 part3-Least-Squares Regression
Chapter 2 part3-Least-Squares RegressionChapter 2 part3-Least-Squares Regression
Chapter 2 part3-Least-Squares Regression
 
2.4 Scatterplots, correlation, and regression
2.4 Scatterplots, correlation, and regression2.4 Scatterplots, correlation, and regression
2.4 Scatterplots, correlation, and regression
 
Statistics Case Study - Stepwise Multiple Regression
Statistics Case Study - Stepwise Multiple RegressionStatistics Case Study - Stepwise Multiple Regression
Statistics Case Study - Stepwise Multiple Regression
 
Logistic regression
Logistic regressionLogistic regression
Logistic regression
 
Linear regression
Linear regressionLinear regression
Linear regression
 
Asymptotic notations
Asymptotic notationsAsymptotic notations
Asymptotic notations
 
Factor Analysis with an Example
Factor Analysis with an ExampleFactor Analysis with an Example
Factor Analysis with an Example
 
Gamma, Expoential, Poisson And Chi Squared Distributions
Gamma, Expoential, Poisson And Chi Squared DistributionsGamma, Expoential, Poisson And Chi Squared Distributions
Gamma, Expoential, Poisson And Chi Squared Distributions
 
Vecm
VecmVecm
Vecm
 

Similar to Estimation Theory, PhD Course, Ghent University, Belgium

Discrete probability
Discrete probabilityDiscrete probability
Discrete probability
Ranjan Kumar
 
Problem_Session_Notes
Problem_Session_NotesProblem_Session_Notes
Problem_Session_Notes
Lu Mao
 

Similar to Estimation Theory, PhD Course, Ghent University, Belgium (20)

Chapter1
Chapter1Chapter1
Chapter1
 
Discrete probability
Discrete probabilityDiscrete probability
Discrete probability
 
Fisher_info_ppt and mathematical process to find time domain and frequency do...
Fisher_info_ppt and mathematical process to find time domain and frequency do...Fisher_info_ppt and mathematical process to find time domain and frequency do...
Fisher_info_ppt and mathematical process to find time domain and frequency do...
 
PTSP PPT.pdf
PTSP PPT.pdfPTSP PPT.pdf
PTSP PPT.pdf
 
Confidence Intervals––Exact Intervals, Jackknife, and Bootstrap
Confidence Intervals––Exact Intervals, Jackknife, and BootstrapConfidence Intervals––Exact Intervals, Jackknife, and Bootstrap
Confidence Intervals––Exact Intervals, Jackknife, and Bootstrap
 
Problem_Session_Notes
Problem_Session_NotesProblem_Session_Notes
Problem_Session_Notes
 
Chapter_09_ParameterEstimation.pptx
Chapter_09_ParameterEstimation.pptxChapter_09_ParameterEstimation.pptx
Chapter_09_ParameterEstimation.pptx
 
the ABC of ABC
the ABC of ABCthe ABC of ABC
the ABC of ABC
 
1. linear model, inference, prediction
1. linear model, inference, prediction1. linear model, inference, prediction
1. linear model, inference, prediction
 
Probability Recap
Probability RecapProbability Recap
Probability Recap
 
Inference for stochastic differential equations via approximate Bayesian comp...
Inference for stochastic differential equations via approximate Bayesian comp...Inference for stochastic differential equations via approximate Bayesian comp...
Inference for stochastic differential equations via approximate Bayesian comp...
 
3_MLE_printable.pdf
3_MLE_printable.pdf3_MLE_printable.pdf
3_MLE_printable.pdf
 
Mayo Slides: Part I Meeting #2 (Phil 6334/Econ 6614)
Mayo Slides: Part I Meeting #2 (Phil 6334/Econ 6614)Mayo Slides: Part I Meeting #2 (Phil 6334/Econ 6614)
Mayo Slides: Part I Meeting #2 (Phil 6334/Econ 6614)
 
Random Variable
Random Variable Random Variable
Random Variable
 
Multivriada ppt ms
Multivriada   ppt msMultivriada   ppt ms
Multivriada ppt ms
 
MVPA with SpaceNet: sparse structured priors
MVPA with SpaceNet: sparse structured priorsMVPA with SpaceNet: sparse structured priors
MVPA with SpaceNet: sparse structured priors
 
Cs229 notes9
Cs229 notes9Cs229 notes9
Cs229 notes9
 
Econometrics 2.pptx
Econometrics 2.pptxEconometrics 2.pptx
Econometrics 2.pptx
 
Ch5
Ch5Ch5
Ch5
 
ISM_Session_5 _ 23rd and 24th December.pptx
ISM_Session_5 _ 23rd and 24th December.pptxISM_Session_5 _ 23rd and 24th December.pptx
ISM_Session_5 _ 23rd and 24th December.pptx
 

Recently uploaded

Automobile Management System Project Report.pdf
Automobile Management System Project Report.pdfAutomobile Management System Project Report.pdf
Automobile Management System Project Report.pdf
Kamal Acharya
 
LIGA(E)11111111111111111111111111111111111111111.ppt
LIGA(E)11111111111111111111111111111111111111111.pptLIGA(E)11111111111111111111111111111111111111111.ppt
LIGA(E)11111111111111111111111111111111111111111.ppt
ssuser9bd3ba
 
Digital Signal Processing Lecture notes n.pdf
Digital Signal Processing Lecture notes n.pdfDigital Signal Processing Lecture notes n.pdf
Digital Signal Processing Lecture notes n.pdf
AbrahamGadissa
 
Fruit shop management system project report.pdf
Fruit shop management system project report.pdfFruit shop management system project report.pdf
Fruit shop management system project report.pdf
Kamal Acharya
 

Recently uploaded (20)

2024 DevOps Pro Europe - Growing at the edge
2024 DevOps Pro Europe - Growing at the edge2024 DevOps Pro Europe - Growing at the edge
2024 DevOps Pro Europe - Growing at the edge
 
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdfWater Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdf
 
The Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdfThe Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdf
 
Automobile Management System Project Report.pdf
Automobile Management System Project Report.pdfAutomobile Management System Project Report.pdf
Automobile Management System Project Report.pdf
 
LIGA(E)11111111111111111111111111111111111111111.ppt
LIGA(E)11111111111111111111111111111111111111111.pptLIGA(E)11111111111111111111111111111111111111111.ppt
LIGA(E)11111111111111111111111111111111111111111.ppt
 
fluid mechanics gate notes . gate all pyqs answer
fluid mechanics gate notes . gate all pyqs answerfluid mechanics gate notes . gate all pyqs answer
fluid mechanics gate notes . gate all pyqs answer
 
ENERGY STORAGE DEVICES INTRODUCTION UNIT-I
ENERGY STORAGE DEVICES  INTRODUCTION UNIT-IENERGY STORAGE DEVICES  INTRODUCTION UNIT-I
ENERGY STORAGE DEVICES INTRODUCTION UNIT-I
 
Construction method of steel structure space frame .pptx
Construction method of steel structure space frame .pptxConstruction method of steel structure space frame .pptx
Construction method of steel structure space frame .pptx
 
ASME IX(9) 2007 Full Version .pdf
ASME IX(9)  2007 Full Version       .pdfASME IX(9)  2007 Full Version       .pdf
ASME IX(9) 2007 Full Version .pdf
 
Digital Signal Processing Lecture notes n.pdf
Digital Signal Processing Lecture notes n.pdfDigital Signal Processing Lecture notes n.pdf
Digital Signal Processing Lecture notes n.pdf
 
Introduction to Machine Learning Unit-5 Notes for II-II Mechanical Engineering
Introduction to Machine Learning Unit-5 Notes for II-II Mechanical EngineeringIntroduction to Machine Learning Unit-5 Notes for II-II Mechanical Engineering
Introduction to Machine Learning Unit-5 Notes for II-II Mechanical Engineering
 
Introduction to Machine Learning Unit-4 Notes for II-II Mechanical Engineering
Introduction to Machine Learning Unit-4 Notes for II-II Mechanical EngineeringIntroduction to Machine Learning Unit-4 Notes for II-II Mechanical Engineering
Introduction to Machine Learning Unit-4 Notes for II-II Mechanical Engineering
 
Toll tax management system project report..pdf
Toll tax management system project report..pdfToll tax management system project report..pdf
Toll tax management system project report..pdf
 
weather web application report.pdf
weather web application report.pdfweather web application report.pdf
weather web application report.pdf
 
Natalia Rutkowska - BIM School Course in Kraków
Natalia Rutkowska - BIM School Course in KrakówNatalia Rutkowska - BIM School Course in Kraków
Natalia Rutkowska - BIM School Course in Kraków
 
Arduino based vehicle speed tracker project
Arduino based vehicle speed tracker projectArduino based vehicle speed tracker project
Arduino based vehicle speed tracker project
 
Fruit shop management system project report.pdf
Fruit shop management system project report.pdfFruit shop management system project report.pdf
Fruit shop management system project report.pdf
 
Cloud-Computing_CSE311_Computer-Networking CSE GUB BD - Shahidul.pptx
Cloud-Computing_CSE311_Computer-Networking CSE GUB BD - Shahidul.pptxCloud-Computing_CSE311_Computer-Networking CSE GUB BD - Shahidul.pptx
Cloud-Computing_CSE311_Computer-Networking CSE GUB BD - Shahidul.pptx
 
WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234
 
Quality defects in TMT Bars, Possible causes and Potential Solutions.
Quality defects in TMT Bars, Possible causes and Potential Solutions.Quality defects in TMT Bars, Possible causes and Potential Solutions.
Quality defects in TMT Bars, Possible causes and Potential Solutions.
 

Estimation Theory, PhD Course, Ghent University, Belgium

  • 1. FACULTY OF ENGINEERING AND ARCHITECTURE Mathematical Techniques in Engineering Science Module Statistics Lecture 7+8 Estimation of parameters: Fisher estimation Bayesian estimation Stijn De Vuyst 30 november 2016 STAT 7+8 Statistics Lecture 7+8 1
  • 2. Statistics Lecture 7+8 Fisher estimation Likelihood function Score function Fisher information MSE: bias and variance Unbiased estimators: Cramer-Rao Lower Bound Biased estimators Sufficient statistics Rao-Blackwellisation Maximum-likelihood estimator The EM algorithm Example: censored data Bayesian estimation STAT 7+8 Statistics Lecture 7+8 2
  • 3. Estimation of parameters: two approaches population X parameter θ sample x estimate ˆθ Classical framework In 1920s and 1930s by Ronald Fisher, Karl Pearson, Jerzy Neyman, . . . Later also C.R. Rao, H. Cram´er, Egon Pearson, D. Blackwell, θ is unknown, but deterministic θ ∈ S, the parameter space Bayesian framework 18th century concepts by Thomas Bayes and Pierre-Simon Laplace Huge following after 1950s due to availability of computer-intensive methods θ is an unknown realisation of a random variable Θ Θ ∈ S STAT 7+8 Statistics Lecture 7+8 3
  • 4. Classical setting: Fisher estimation X: population, system, process, . . . parameter θ θ is a scalar here, but could also be a vector θ in some parameter space S X: data, observations, sample, . . . estimate ˆθ The sample n independent members taken from the population (n is the sample size) X = (X1, X2, . . . , Xn) before observation x = (x1, x2, . . . , xn) after observation X ∈ Ω Ω = Rn for real populations, Ω = {0, 1}n for Bernoulli populations,. . . The ‘model’: likelihood function p(x; θ) = Prob[observe X = x if true parameter is θ] p(x; θ) is called the likelihood function, ln p(x; θ) the log-likelihood −→ can be either a density (X cont.) or a mass function (X discr.) STAT 7+8 Fisher Likelihood 4
  • 5. Example: likelihood function for a Bernoulli population Assume a Bernoulli population: X ∼ Bern(θ), i.e. X = 1 with probability θ and X = 0 otherwise The observed sample (n = 6) is x = (0, 0, 1, 0, 1, 0) Likelihood p(x; θ) = 6 i=1 p(xi|θ) = (1 − θ)4 θ2 , θ ∈ S = [0, 1] 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.000 0.005 0.010 0.015 0.020 0.025 Likelihood p(x; θ) (1 − θ)4 θ2 θ ˆθML = 1 3 Maximum-likelihood estimate for parameter θ ˆθML = arg maxθ p(x; θ) = count 1s in the data n = c n = 2 6 = 1 3 STAT 7+8 Fisher Likelihood 5 Example
  • 6. Score function The score function of the model S(θ, x) = ∂ ∂θ ln p(x; θ) = ∂ ∂θ p(x; θ) p(x; θ) S(θ, x) indicates the relative change in likelihood indicates the sensitivity of the log-likelihood to its parameter θ Expected value and variance of the score If X is not yet observed, the score S(θ, X) at θ is a random variable What is its mean and variance? The expected score is 0 E[S(θ, X)] = Ω ∂ ∂θ ln p(x; θ) p(x; θ)dx = Ω ∂ ∂θ p(x; θ) p(x; θ) p(x; θ)dx REG = ∂ ∂θ Ω p(x; θ)dx = ∂ ∂θ 1 = 0 The variance of the score is called the Fisher information J(θ) Var[S(θ, X)] = E[S2 (θ, X)] = E[ ∂ ∂θ ln p(X; θ) 2 ] J(θ) STAT 7+8 Fisher Score 6
  • 7. Fisher Information J(θ) is the variance of the score function S(θ, X), averaged over all possible samples X in Ω J(θ) is a metric for how much you can expect to learn from the sample X about parameter θ Property J(θ) = E[ ∂ ∂θ ln p(X; θ) 2 ] = −E[ ∂2 ∂θ2 ln p(X; θ)] Proof: The first equality is due to the definition of Fisher information. The second follows from E[S(θ, X)] = 0, ∀θ, which means that also: 0 = ∂ ∂θ E[S(θ, X)] = ∂ ∂θ Ω ∂ ∂θ ln p p dx REG = Ω ∂2 ∂θ2 ln p p + ∂ ∂θ ln p ∂ ∂θ p dx = Ω ∂2 ∂θ2 ln p p dx + Ω ∂ ∂θ ln p 2 p dx = E[ ∂2 ∂θ2 ln p] + E[ ∂ ∂θ ln p 2 ] QED (!) Note we assume sufficient ‘regularity’ (REG) of the likelihood function p(x; θ), so that differentiation over θ and integration over x can be switched STAT 7+8 Fisher Information 7
  • 8. Estimators for a parameter θ Definition An estimator ˆθ is a statistic Ω → S : x → ˆθ(x) (not depending on any unknown parameters!) giving values that are hopefully ‘close’ to the true θ ! after observation, ˆθ(x) is a deterministic number before observation, ˆθ(X) is a random variable → ˆθ is a shorthand notation for either, depending on the context MEAN→ bias E[ˆθ − θ] = E[ˆθ(X)] − θ is the bias if bias = 0 for all θ ∈ S −→ estimator is ‘unbiased’ if estimator is not unbiased −→ estimator is biased STAT 7+8 Fisher MSE: bias and variance 8
  • 9. Estimators for a parameter θ VARIANCE→ Mean Square Error The variance of estimator ˆθ is the expect square deviation from E[ˆθ]: Var[ˆθ] = E[ ˆθ(X) − E[ˆθ(X)] 2 ] = E[ ˆθ−θ − (E[ˆθ]−θ) 2 ] = E[(ˆθ − θ)2 ] − 2(E[ˆθ] − θ)E[ˆθ − θ] + (E[ˆθ] − θ)2 = E[(ˆθ − θ)2 ] MSE − E[ˆθ] − θ bias 2 The Mean Square Error (MSE) is expected square deviation from true θ. =⇒ MSE(ˆθ) = bias2 + Var[ˆθ] Minimum Variance and Unbiased estimator (MVU) ˆθ is unbiased and has lower variance than any other estimator for all θ ∈ S −→ estimator is ‘MVU’ STAT 7+8 Fisher MSE: bias and variance 9
  • 10. Estimators for a parameter θ Often, the asymptotic distribution of an estimator is of interest −→ behaviour of ˆθ(X) when sample size n becomes very large? An estimator ˆθn = ˆθ(X1, . . . , Xn) of θ is consistent if and only if ˆθn converges to θ (‘in probability’) for n → ∞, ∀θ ∈ S, i.e. lim n→∞ Prob[|ˆθn − θ| > ε] = 0 , ∀ε > 0 , or plim n→∞ ˆθn = θ , ∀θ ∈ S Consistency vs. bias examples: θ ˆθn = ¯X unbiased and consistent θ ˆθn = X1 + X2 + X3 3 , n 3 unbiased but not consistent θ ˆθn = − 1 n + 1 n n i=1 Xi biased but consistent θ a ˆθn = a = θ biased and not consistentn = 1 n = 2 n = 3 n = 5 n = 10 n = 50 STAT 7+8 Fisher MSE: bias and variance 10
  • 11. Unbiased estimators: Cramer-Rao Lower Bound (CRLB) There may be many plausible estimators ˆθ for θ. ? Which is the ‘best’ Several criteria for a suitable estimator are possible, but suppose we aim for an MVU estimator (unbiased and minimal MSE) Lower bound for the MSE of unbiased estimators Given the model p(x; θ), there is a lower bound on the MSE that any unbiased estimator ˆθ can possibly achieve: MSE(ˆθ(X)) 1 J(θ) −→ ‘Cramer-Rao Lower Bound’ (CRLB) if ˆθ reaches this bound, MSE(ˆθ(X)) = 1/J(θ) −→ estimator is ‘efficient’ the CRLB is inverse of the Fisher information having a lot of information in the sample about true θ (high J(θ)) allows for estimators with very low variance efficient ⇒ MVU, but MVU efficient because CRLB can not always be reached by MVU estimators STAT 7+8 Fisher CRLB 11
  • 12. Cramer-Rao Lower Bound (CRLB): proof The ‘triangle inequality’, best known in Euclidean vector spaces Rn u = (u1, . . . , un) ∈ Rn is an n-dimensional vector ||u|| = u2 1 + . . . + u2 n is the Euclidean length of u inner (dot) product: u · v = ||v|| ||u|| cos α ∈[−1,1] = u1v1 + . . . + unvn u v α ||u|| cos α Cauchy-Schwarz: (u · v)2 ||u||2 ||v||2 equality iff u = kv or: i ui vi 2 i u2 i i v2 i equality iff ui = kvi, ∀i If n → ∞, Rn becomes a Hilbert space or ‘function space’: u(x)v(x)dx 2 u(x)2 dx v(x)2 dx equality iff u(x) = kv(x), ∀x STAT 7+8 Fisher CRLB 12
  • 13. Cramer-Rao Lower Bound (CRLB): proof ˆθ(x) is an unbiased estimator for θ, so E[ˆθ(X) − θ] = 0 ⇒ 0 = ∂ ∂θ E[ˆθ(X) − θ] = ∂ ∂θ ˆθ(x) − θ p(x; θ)dx REG = ∂ ∂θ (ˆθ(x) − θ)p(x; θ) dx = (0 − 1)p(x; θ)dx −1 + (ˆθ(x) − θ) ∂ ∂θ p(x; θ) p(x; θ)S(θ, x) dx ⇒ 1 = Ω (ˆθ(x) − θ) p(x; θ) u(x) p(x; θ)S(θ, x) v(x) dx = Ω u(x)v(x)dx In particular for these two functions: u(x)2 dx = (ˆθ(x) − θ)2 p(x; θ)dx = E[(ˆθ − θ)2 ] = MSE(ˆθ) v(x)2 dx = S2 (θ, x)p(x; θ)dx = E[S2 (θ, X)2 ] = J(θ) STAT 7+8 Fisher CRLB 13
  • 14. Cramer-Rao Lower Bound (CRLB): proof So due to the Cauchy-Schwarz inequality in Hilbert space: u(x)v(x)dx 2 1 u(x)2 dx MSE(ˆθ) · v(x)2 dx J(θ) which proves the theorem: MSE(ˆθ) = Var[ˆθ] 1 J(θ) QED Efficient form The bound becomes a strict equality if (and only if) u(x) = kv(x), i.e. iff S(θ, x) = k(θ)[ˆθ(x) − θ] ‘efficient form’ If the score function can be written as k(θ)[ˆθ − θ] for all θ ∈ S −→ estimator ˆθ is ‘efficient’ STAT 7+8 Fisher CRLB 14
  • 15. Example: estimate the variance of a normal population Assume a zero-mean normal population: X ∼ N(0, σ2 ), ? How to estimate σ2 (= θ) given only the data x = (x1, . . . , xn) Likelihood p(x; θ) = n i=1 1 √ 2πθ exp −x2 i 2θ Log-likelihood ln p(x; θ) = −n ln √ 2π − n 2 ln θ − 1 2 n i=1 x2 i θ Score S(θ, x) = ∂ ∂θ ln p(x; θ) = − n 2θ + 1 2 n i=1 x2 i θ2 = n 2θ2 k(θ) 1 n n i=1 x2 i ˆθ(x) − θ The score function can be written in efficient form! so ˆθ(x) = 1 n n i=1 x2 i is an unbiased and efficient estimator for θ = σ2 STAT 7+8 Fisher CRLB 15 Example
  • 16. Example: estimate intensity of a Poisson process A Poisson process with intensity λ is a point process so that the times between ‘events’ are indep. and exponentially distributed with mean τ = 1/λ. 0 t N = n ∼ Expon(λ) The number of events N in an interval of length t is Poiss(λt) Likelihood p(n; λ) = Prob[n events in interval of length t] = e−λt (λt)n n! Log-likelihood ln p(n; λ) = −λt + n ln λt − ln n! Score S(λ, n) = ∂ ∂λ ln p(n; λ) = −t + n λt t = t λ k(λ) n t ˆλ(n) − λ This is efficient form, so ˆλ(n) = n t is an unbiased efficient estimator for λ ! ! However, the inverse 1/ˆλ = t/n is not an efficient estimator for τ p(n; τ) = e−t/τ (t/τ)n n! , so that the score is S(τ, n) = − n τ + t τ2 This is impossible to write in efficient form, so no unbiased efficient estimator for τ exists! STAT 7+8 Fisher CRLB 16 Example
  • 17. Biased estimators Should we always try to find unbiased estimators? No! They may not exist e.g. no unbiased estimator for 1/p from a Bern(p) population exists They may be unreasonable e.g. MVU estimate of p from X ∼ Geom(p) is ˆp(X) = 1 X=1 this estimate is always 0 or 1 They may have extremely large variance (= MSE) So unbiased estimators do not always minimize the MSE: MSE(ˆθ) = bias2 + Var[ˆθ] −→ Sometimes it is better to sacrifice unbiasedness for lower variance Minimising the MSE We require: the concept of sufficient statistics the Rao-Blackwell theorem STAT 7+8 Fisher Biased estimators 17
  • 18. Sufficient statistics Recall: a statistic T(x) is any function of the sample data not depending on unknown parameters could also be vector-valued: T(x) : Ω → Rm with m < n, typically A statistic T(x) is sufficient with respect to the model p(x; θ) if p(x|T(x); θ) = p(x|T(x)), ∀x i.e. if the distribution of X given that T(X) = t, is independent of θ −→ “All you can learn about θ from the data X, you can also learn from the statistic T(X)” If X is a book in which θ is a character, then a summary T(X) is sufficient if it gives all information about θ that is also in the book Sufficiency can be checked using the Neyman-Fisher criterium STAT 7+8 Fisher Biased estimators Sufficient statistics 18
  • 19. Sufficient statistics Neyman-Fisher factorisation criterium A statistic T(x) is sufficient with respect to the model p(x; θ) ⇔ p(x; θ) = g(x) · h T(x), θ ∀x ∈ Ω independent of θ depends only on x through T(x) Proof: (assuming X is discrete) First note if t = T(x) then “T(X) = t, X = x” and “X = x” are the same event! −→ p(x; θ) = p(x, t; θ) ⇒ p(x; θ) = p(x, t; θ) = p(x|t; θ) · p(t; θ) sufficiency = p(x|t) g(x) · p(t; θ) h(t, θ) ⇐ p(x|t; θ) = p(x, t; θ) p(t; θ) = p(x, t; θ) x p(x, t; θ)1 T(x)=t = g(x)h(t, θ) x g(x)h(t, θ)1 T(x)=t independent of θ = p(x|t) −→ sufficiency QED STAT 7+8 Fisher Biased estimators Sufficient statistics 19
  • 20. Example: Sample mean for Bernoulli population Assume again a Bernoulli population: X ∼ Bern(θ), i.e. p(x; θ) = θx (1 − θ)1−x for x ∈ {0, 1} Sample size n Take as statistic the sample mean T(X) = ¯X = 1 n n i=1 Xi = C n with C the count of 1s in the sample p(x; θ) = n i=1 p(xi; θ) = n i=1 θxi (1 − θ)1−xi = θ xi (1 − θ)n− xi = θnT (x) (1 − θ)n−nT (x) h(T (x),θ) · 1 g(x) Neyman-Fisher checks out, so the sample mean is a sufficient statistic for θ −→ T(x) is also efficient, since S(θ, x) = n θ(1 − θ) T(x) − θ STAT 7+8 Fisher Biased estimators Sufficient statistics 20 Example
  • 21. Rao-Blackwellisation of an estimator Rao-Blackwell Theorem For model p(x; θ), let ˆθ(x) be an estimator for θ so that Var[ˆθ] exists. If T(x) is a sufficient statistic, then for the new estimator ˆθ∗ (t) = E[ˆθ(X)|T(X) = t], 1) the new estimator ˆθ∗ is a statistic, i.e. does not depend on θ 2) if ˆθ is unbiased, then ˆθ∗ is also unbiased 3) MSE(ˆθ∗ ) MSE(ˆθ) −→ so new estimator may be ‘better’! 4) MSE(ˆθ∗ ) = MSE(ˆθ) iff ˆθ(x) depends on x only through T(x) Process of improving existing estimators is called ‘Rao-Blackwellisation’ The process is idempotent: repeating it will give no further improvement The proof is essentially based on the law of total expectation: Let f(t) = E[ |T = t] then E[ ] = ET[f(T)] = ET[EX[ |T]] inner expectation over all X for which T(X) is fixed outer expectation over all T STAT 7+8 Fisher Biased estimators Rao-Blackwell 21
  • 22. Rao-Blackwellisation of an estimator Proof: 1) ˆθ∗ is a statistic because of the sufficiency of T(x): −→ ˆθ∗ (t) = x ˆθ(x)p(x|t; θ) is independent of θ 2) θ unbiased = E[ˆθ] = ET EX[ˆθ(X)|T] = ET[ˆθ∗ (T)] = E[ˆθ∗ ] 3) Since both estimators are unbiased, their MSE equals their variance, so MSE(ˆθ)−MSE(ˆθ∗ ) = Var[ˆθ]−Var[ˆθ∗ ] = E[ˆθ2 ]− E[ˆθ] 2 θ2 −E[ˆθ∗2 ]+ E[ˆθ∗ ] 2 θ2 = E[ˆθ2 (X)] − E[ˆθ∗2 (T)] = ET EX[ˆθ2 (X)|T] − ˆθ∗2 (T) = ET EX[ˆθ2 (X)|T] − EX[ˆθ(X)|T] 2 = ET Var[ˆθ(X)|T] 0 0 4) The inequality is strict iff Var[ˆθ(X)|T = t] = 0, ∀t −→ given T(X) = t, ˆθ is fixed so ˆθ(x) only depends on x through T(x) STAT 7+8 Fisher Biased estimators Rao-Blackwell 22
  • 23. Example: estimate maximum of uniform distribution Observe X1, . . . , Xn ∼ Unif(0, a) how to estimate upper bound a? x2 x4 x1 x30 ¯x max(x) = t a ? Original (naive) estimator: since E[Xi] = a 2 , one could propose ˆa(x) = 2¯x = 2 n n i=1 xi −→ E[ˆa] = a , MSE(ˆa) = a2 3n (exercise) T(x) = max(x) is sufficient for a since Neyman-Fisher checks out: p(x; a) = n i=1 1 a 1 0 xi a = 1 an 1 T(x) a · n i=1 1 0 xi Rao-Blackwell new estimator: (suppose n > 1) ˆa∗ (t) = E[ˆa(X)|T(x) = t] = E 2 n n−1 i=1 Xi +t |T(x) = t = 2t n +(n−1) t n = n + 1 n t = n + 1 n max(x) −→ E[ˆa∗ ] = a , MSE(ˆa∗ ) = a2 n(n + 2) (exercise) We find that indeed, MSE(ˆa∗ ) < MSE(ˆa) , ∀n > 1 STAT 7+8 Fisher Biased estimators Rao-Blackwell 23 Example
  • 24. The Maximum-likelihood Estimator For a model p(x; θ), the maximum-likelihood estimator ˆθML (MLE) for θ is the value of θ for which the model produces the highest probability of observing sample X = x, ˆθML(x) = arg max θ∈S p(x; θ) Likelihood p(x; θ) θˆθML Finding ˆθML is a maximisation problem: ∂ ∂θ p(x; θ) = 0 ⇒ ∂ ∂θ ln p(x; θ) score function = 0 ⇒ S(θ, x) = 0 −→ so involves finding zeroes of the score function usually requires numerical (search) algorithms STAT 7+8 Fisher MLE 24
  • 25. The Maximum-likelihood Estimator Properties Any unbiased efficient estimator ˆθ is also MLE score has efficient form S(θ, x) = k(θ) ˆθ(x) − θ , so S(ˆθ, x) = 0 −→ ˆθ is MLE The converse is not true: not all MLE are efficient Under some regularity conditions however, for increasing sample size n → ∞, the MLE is consistent: plim n→∞ ˆθML,n = θ is asymptotically efficient: lim n→∞ Var[ˆθML,n] 1/nJ(θ) = 1 is asymptotically normal: ˆθML,n −→ N(θ, 1 nJ(θ) ) as n → ∞ STAT 7+8 Fisher MLE 25
  • 26. EM algorithm (Expectation/Maximisation) for finding MLE Observed data vs. complete data The log-likelihood p(x; θ) may be a complicated function of θ so that Find arg maxθ ln p(x; θ) −→ is difficult But in the case where the observed data x is only part of the underlying complete data ( x observed , y hidden ) often the complete-data log-likelihood problem Find arg maxθ ln p(x, y; θ) −→ is easy EM-algorithm Numerical search algorithm: ˆθ0 E → M → ˆθ1 E → M → ˆθ2 E → M → . . . −→ ˆθML Sure to converge to local likelihood maximum STAT 7+8 Fisher EM algorithm 26
  • 27. EM algorithm p(x, y; θ) = p(x; θ)p(y|x; θ) −→ ln p(x; θ) observed LL max is difficult = ln p(x, y; θ) complete LL max is easy − ln p(y|x; θ) hidden conditional on x EM approaches argmax of observed LL by iteratively maximising complete LL: E-step (expectation) So we need to maximise ln p(x, y; θ) . . . but how if y is unknown!? Trick 1: Replace complete LL by its expected value : Lx(θ) = E[ln p(x, Y; θ)] = ln p(x, y; θ) p(y|x; θ)dy Trick 2: Use current estimate ˆθk of θ to fix distribution of hidden data −→ Replace p(y|x; θ) by p(y|x; ˆθk) and calculate Lx(θ|ˆθk) = ln p(x, y; θ) p(y|x; ˆθk)dy M-step (maximisation) Next estimate of θ is: ˆθk+1 ← arg maxθ Lx(θ|ˆθk) STAT 7+8 Fisher EM algorithm 27
  • 28. EM algorithm It can be shown that, for the observed LL: ln p(x; ˆθk+1) ln p(x; ˆθk) So if the likelihood has a local maximum, the EM-algorithm will converge to it In fact, the EM-algorithm is especially useful when the parameter to be estimated is a vector θ = (θ1, . . . , θh) so that the ‘search space’ S is very large. STAT 7+8 Fisher EM algorithm 28
  • 29. Example: censored data An electricity company has a power line to a part of the city with fluctuating daily demand. It is known/assumed that the demand W of one day, measured in MWh, is N(µ, 1) . That is, the variance is known (σ = 1MWh) but the mean is not. To estimate the mean daily power demand µ = E[W], the company asks n = 5 employees to measure the power, on 5 different days and each with a different power meter. Unfortunately, the meters have a limited range ri, i = 1, . . . , n. If Wi > ri, the meter fails (×) and does not give a reading. employee (i) range meter (ri), MWh measurement (xi), MWh 1 7 × 2 5 4.2 3 8 × 4 6 4.7 5 10 6.9 −→ We try to find the MLE for µ ¯x = 1 3 (4.2 + 4.7 + 6.9) = 5.26 STAT 7+8 Fisher EM algorithm Example: censored data 29 Example
  • 30. Example: censored data Direct maximisation of observed LL Suppose the first m n measurements succeeded, x = (x1, . . . , xm) (observed) and the rest failed, Y = (Ym+1, . . . , Yn) (hidden) −→ Yi > ri , m < i n p(x; µ) = m i=1 φ(xi − µ) n i=m+1 1 − Φ(ri − µ) obs(µ) = ln p(x; µ) = − m 2 ln(2π)− m i=1 1 2 (xi −µ)2 + n i=m+1 ln 1−Φ(ri −µ) ˆµML satisfies obs(µ) = 0, or: m(¯x − µ) = n i=m+1 φ(ri − µ) 1 − Φ(ri − µ) transcendental equation, difficult to solve can only be done numerically −→ so let us use EM algorithm instead! 5.0 5.5 6.0 6.5 7.0 7.5 8.0 −20 −18 −16 −14 −12 −10 −8 maximum can be found using numerical techniques µ observed LL obs(µ) = ln p(x; µ) ¯x STAT 7+8 Fisher EM algorithm Example: censored data 30 Example
  • 31. Example: censored data E-step Complete LL is ln p(x, Y; µ) = − n 2 ln(2π)− 1 2 m i=1 (xi −µ)2 − 1 2 n i=m+1 (Yi −µ)2 1: Replace LL by its expected value: E[ln p(x, Y; µ)] = − 1 2 m i=1 (xi − µ)2 − 1 2 n i=m+1 E[(Yi − µ)2 ] +c some constant indep. of µ E[(Yi − µ)2 ] = ∞ ri (y − µ)2 p(y|x; µ) p(y;µ) dy with p(yi; µ) = φ(yi − µ) 1 − Φ(ri − µ) 2: . . . and use current estimate ˆµk for distribution of hidden data: Eˆµk [(Yi − µ)2 ] = ∞ ri (y − µ)2 p(y; ˆµk)dy = ∞ ri (−2yµ + µ2 + y2 )2 p(y; ˆµk)dy = −2µ + ∞ ri y p(y; ˆµk)dy Eˆµk [Y ]=Eˆµk [W |W >ri] + µ2 ∞ ri p(y; ˆµk)dy 1 + c = −2µ ˆµk + φ(ri − ˆµk) 1 − Φ(ri − ˆµk) + µ2 + c STAT 7+8 Fisher EM algorithm Example: censored data 31 Example
  • 32. Example: censored data M-step Lx(µ|ˆµk) = − 1 2 m i=1 (xi − µ)2 − 1 2 n i=m+1 − 2µ ˆµk + φ(ri − ˆµk) 1 − Φ(ri − ˆµk) + µ2 + c Lx(µ|ˆµk) = 0 ⇔ m¯x − nµ + (n − m)ˆµk + n i=m+1 φ(ri − ˆµk) 1 − Φ(ri − ˆµk) = 0 So we update: ˆµk+1 ← m n ¯x + n − m n ˆµk + 1 n n i=m+1 φ(ri − ˆµk) 1 − Φ(ri − ˆµk) 5.0 5.5 6.0 6.5 7.0 7.5 8.0 −20 −18 −16 −14 −12 −10 −8 µ observed LL obs(µ) = ln p(x; µ) ˆµ0 ˆµ1 ˆµ2 started with ˆµ0 = ¯x convergence is very fast only 2 or 3 iterations required here STAT 7+8 Fisher EM algorithm Example: censored data 32 Example
  • 33. Example: censored data What if σ is also unknown!? no problem, the EM-algorithm can be used to approximate θ = (µ, σ2 ) : ˆµk+1 ← m n ¯x + n − m n ˆµk + 1 n n i=m+1 ˆσkφ((ri − ˆµk)/ˆσk) 1 − Φ((ri − ˆµk)/ˆσk) ˆσ2 k+1 ← 1 n m i=1 x2 i + n − m n (ˆµ2 k + ˆσ2 k) + 1 n n i=m+1 ˆσk(ˆµk + ri)φ((ri − ˆµk)/ˆσk) 1 − Φ((ri − ˆµk)/ˆσk) 5.0 5.5 6.0 6.5 7.0 7.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 ˆµML ˆσML ˆµk ˆσk observed LL is -5.91 started with ˆµ0 = ¯x, ˆσ2 0 = 1 convergence is again very fast only 6 or 7 iterations required here STAT 7+8 Fisher EM algorithm Example: censored data 33 Example