Statistical Decision Theory

All of Statistics Chapter 13.
Statistical Decision Theory
Sangwoo Mo
KAIST Algorithmic Intelligence Lab.
August 31, 2017
Sangwoo Mo (KAIST ALIN Lab.) AoS Chap 13. August 31, 2017 1 / 20

Table of Contents
1 Decision Theory: How to choose a ‘good’ estimator?
2 Computing Bayes Estimators
3 Computing Minimax Estimators
4 Maximum Likelihood Estimators: A good approximator
5 Admissibility: What are the ‘best’ estimators?

Table of Contents

We learned several point estimators (e.g. sample mean, MLE, MoM).
How do we choose among them? ⇒ Decision theory!

First, we will deﬁne loss and risk to evaluate the estimator.

Deﬁnition
Loss: L(θ, ˆθ) : Θ × ΘE → R measures the discrepancy between θ and ˆθ.

Deﬁnition
Example
L(θ, ˆθ) = (θ − ˆθ)2 squared error loss,
L(θ, ˆθ) = |θ − ˆθ| absolute error loss,
L(θ, ˆθ) = 0 if θ = ˆθ and 1 if θ = ˆθ zero-one loss,
L(θ, ˆθ) = log(f (x;θ)
f (x;ˆθ)
)f (x; θ)dx Kullback-Leibler loss.

Deﬁnition
Deﬁnition
Risk: R(θ, ˆθ) = Eθ(L(θ, ˆθ)) = L(θ, ˆθ(x))f (x; θ)dx.

Deﬁnition
Deﬁnition
Risk: R(θ, ˆθ) = Eθ(L(θ, ˆθ)) = L(θ, ˆθ(x))f (x; θ)dx.
We want to choose estimator with the smallest risk.
But how can we compare risk functons?

Measure for Risk Comparison
How can we compare risk functions?
⇒ We need one-number summary of the risk function!

Deﬁnition
Bayes risk: r(π, ˆθ) = R(θ, ˆθ)π(θ)dθ where π(θ) is a prior for θ.
Maximum risk: ¯R(ˆθ) = supθ R(θ, ˆθ).

Deﬁnition
Deﬁnition
Bayes estimator: ˆθ s.t. R(θ, ˆθ) = inf˜θ r(π, ˜θ).
Minimax estimator: ˆθ s.t. R(θ, ˆθ) = inf˜θ supθ R(θ, ˜θ).

Deﬁnition
Deﬁnition
Bayes estimator: ˆθ s.t. R(θ, ˆθ) = inf˜θ r(π, ˜θ).
Minimax estimator: ˆθ s.t. R(θ, ˆθ) = inf˜θ supθ R(θ, ˜θ).
Note that minimax estimator is the most conservative.
With proper prior, Bayes estimator works better in general.

Table of Contents

Computing Bayes Estimators
For given data x, we can compute Bayes estimator via posterior.

Now we can ﬁnd explicit formula for Bayes estimators for some speciﬁc
loss functions.

loss functions.
Theorem
If the loss is squared error / absolute error / zero-one loss,
then the Bayes estimator is mean / median / mode of the posterior f (θ|x).

loss functions.
Theorem
If the loss is squared error / absolute error / zero-one loss,
then the Bayes estimator is mean / median / mode of the posterior f (θ|x).
Proof.
We will only prove the theorem for squared error loss.
The Bayes estimator ˆθ(x) minimizes r(ˆθ|x) = (θ − ˆθ(x))2f (θ|x)dθ.
Taking the derivative of r(ˆθ|x) w.r.t. ˆθ(x) leads the result.

Table of Contents

Computing Minimax Estimators
Computing minimax estimator is complex in general.
Here, we will only mention a few key results.

Theorem
Let ˆθπ be the Bayes estimator for some prior π.
Suppose that R(θ, ˆθπ) ≤ r(π, ˆθπ) for all θ.
Then ˆθπ is minimax estimator and π is the least favorablea prior.
a
for any π , r(π, ˆθπ
) ≥ r(π , ˆθπ
)

Theorem
a
) ≥ r(π , ˆθπ
)
Proof.
Suppose ˆθπ is not minimax estimator.
Then there is ˆθ0 s.t. supθ R(θ, ˆθ0) < supθ R(θ, ˆθπ).
Hence, r(π, ˆθ0) ≤ supθ R(θ, ˆθ0) < supθ R(θ, ˆθπ) ≤ r(π, ˆθπ).
It leads the contradiction.

Theorem
a
) ≥ r(π , ˆθπ
)
Corollary
Suppose that R(θ, ˆθπ) = c for some c.
Then ˆθπ is minimax estimator.

Example
Let X1, ..., Xn ∼ Bernoulli(p). Assume squred error loss.
Assume Beta(α, β) prior. Then the posterior mean is ˆp =
α+
n
i=1
Xi
α+β+n .
Now let α = β = n/4. Then R(p, ˆp) = n
4(n+
√
n)2 .
Hence, by previous theorem, ˆp is minimax estimator.
Example
Consider again Bernoulli but with loss L(p, ˆp) = (p−ˆp)2
p(1−p) .
Assume uniform prior. Then the Bayes estimator is ˆp =
n
i=1
Xi
n .
Then R(p, ˆp) = 1
n . Hence, ˆp is minimax estimator.

For normal distribution, we can achieve a nice result.
Theorem
Let X1, ..., Xn ∼ N(θ, 1) and Θ = R. Let ˆθ = ¯X.
Then ˆθ is minimax w.r.t. any well-behaveda loss function.
Moreover, it is the only estimator with this property.
a
level sets are convex and symmetric about the origin

For normal distribution, we can achieve a nice result.
Theorem
Let X1, ..., Xn ∼ N(θ, 1) and Θ = R. Let ˆθ = ¯X.
Then ˆθ is minimax w.r.t. any well-behaveda loss function.
Moreover, it is the only estimator with this property.
a
level sets are convex and symmetric about the origin
If parameter space is bounded, the theorem above does not apply.
Example
Suppose that X ∼ N(θ, 1) and Θ = [−m, m] where 0 < m < 1.
Assume squared error loss. Assume 1
2(δ(−m) + δ(m)) prior.
Then ˆθ(X) = m tanh(mX) is minimax estimator.

Table of Contents

MLE approximates Bayes/minimax estimator
Still, it is challenging to compute Bayes and minimax estimator.
Surprisingly, it is shown that for large samples, MLE is approximately
Bayes and minimax for parametric models.

Idea Sketch (MLE is approximately Bayes)
For large n, the eﬀect of prior is negligible. Moreover,
√
n(ˆθBayes − θ) → N(0,
1
I(θ)
).
Thus, Bayes estimator is approximately MLE.

Idea Sketch (MLE is approximately minimax)
Assume squared error loss. Let ˆθ be MLE estimator. Then
R(θ, ˆθ) = Vθ(ˆθ) + bias2
≈ Vθ(ˆθ)a ≈
1
nI(θ)
.
For large n, for other θ , R(θ, θ ) ≥ 1
nI(θ) ≈ R(θ, ˆθ).
Thus, MLE is approximately minimax.
a
typically, squared bias is O(n−2
) and variance is O(n−1
)

Table of Contents

Admissibility
As we have seen before, we can’t decide the best estimator over risks.
However, we can decide what is not the best estimator.

Admissibility
As we have seen before, we can’t decide the best estimator over risks.
However, we can decide what is not the best estimator.
Deﬁnition
An estimator ˆθ is inadmissible if there exists an estimator ˆθ s.t.
R(θ, ˆθ ) ≤ R(θ, ˆθ) for all θ and
R(θ, ˆθ ) < R(θ, ˆθ) for at least one θ.
An estimator ˆθ is admissible if it is not inadmissible.
Deﬁnition
An estimator ˆθ is strongly inadmissible if there exists an estimator ˆθ
and > 0 s.t. R(θ, ˆθ ) < R(θ, ˆθ) − for all θ.

Admissibility
Remark that admissibility only characterize the bad estimator.
Admissible estimator are not generally good, and sometimes can be bad.

Admissibility
Example
Let X ∼ N(θ, 1). Assume squared error loss. Let ˆθ(X) = 3.
Then ˆθ(X) is admissible, even though it is clearly bad.

Admissibility
Example
Let X ∼ N(θ, 1). Assume squared error loss. Let ˆθ(X) = 3.
Then ˆθ(X) is admissible, even though it is clearly bad.
Example (Proof.)
Suppose not. There exists ˆθ s.t. R(3, ˆθ ) ≤ R(3, ˆθ) = 0.
Hence, R(3, ˆθ ) = (ˆθ (x) − 3)2f (x; 3)dx, and ˆθ (X) = 3.

Admissibility of Bayes Estimators
Assuming regular condition, Bayes estimators are admissible.

Theorem (unique case)
If Bayes estimator is unique, then it is admissible.

Theorem (discrete case)
If Θ is discrete set, then all Bayes estimators are admissible.

Theorem (discrete case)
If Θ is discrete set, then all Bayes estimators are admissible.
Theorem (continuous case)
If Θ is continuous set, and if R(θ, ˆθ) is continuous in θ for every ˆθ,
then all Bayes estimators are admissible.

Admissibility of Minimax Estimators
Neither minimaxity nor admissibility implies the other in general.
However, there are some facts linking them.

Theorem (admissibility ⇒ minimaxity)
Suppose that ˆθ has constant risk and is admissible.
Then it is minimax estimator.

Proof.
Suppose not. There exists ˆθ s.t.
R(θ, ˆθ ) ≤ supθ R(θ, ˆθ ) < supθ R(θ, ˆθ) = R(θ, ˆθ).
It implies that ˆθ is inadmissible.

Theorem (minimaxity ⇒ admissibility)
If ˆθ is minimax then it is not strongly inadmissible.

Questions?

Statistical Decision Theory

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Statistical Decision Theory

Similar to Statistical Decision Theory (20)

More from Sangwoo Mo

More from Sangwoo Mo (20)

Recently uploaded

Recently uploaded (20)

Statistical Decision Theory