Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Statistical Decision Theory
1. All of Statistics Chapter 13.
Statistical Decision Theory
Sangwoo Mo
KAIST Algorithmic Intelligence Lab.
August 31, 2017
Sangwoo Mo (KAIST ALIN Lab.) AoS Chap 13. August 31, 2017 1 / 20
2. Table of Contents
1 Decision Theory: How to choose a ‘good’ estimator?
2 Computing Bayes Estimators
3 Computing Minimax Estimators
4 Maximum Likelihood Estimators: A good approximator
5 Admissibility: What are the ‘best’ estimators?
Sangwoo Mo (KAIST ALIN Lab.) AoS Chap 13. August 31, 2017 2 / 20
3. Table of Contents
1 Decision Theory: How to choose a ‘good’ estimator?
2 Computing Bayes Estimators
3 Computing Minimax Estimators
4 Maximum Likelihood Estimators: A good approximator
5 Admissibility: What are the ‘best’ estimators?
Sangwoo Mo (KAIST ALIN Lab.) AoS Chap 13. August 31, 2017 3 / 20
4. Statistical Decision Theory
We learned several point estimators (e.g. sample mean, MLE, MoM).
How do we choose among them? ⇒ Decision theory!
Sangwoo Mo (KAIST ALIN Lab.) AoS Chap 13. August 31, 2017 4 / 20
5. Statistical Decision Theory
We learned several point estimators (e.g. sample mean, MLE, MoM).
How do we choose among them? ⇒ Decision theory!
First, we will define loss and risk to evaluate the estimator.
Sangwoo Mo (KAIST ALIN Lab.) AoS Chap 13. August 31, 2017 4 / 20
6. Statistical Decision Theory
We learned several point estimators (e.g. sample mean, MLE, MoM).
How do we choose among them? ⇒ Decision theory!
First, we will define loss and risk to evaluate the estimator.
Definition
Loss: L(θ, ˆθ) : Θ × ΘE → R measures the discrepancy between θ and ˆθ.
Sangwoo Mo (KAIST ALIN Lab.) AoS Chap 13. August 31, 2017 4 / 20
7. Statistical Decision Theory
We learned several point estimators (e.g. sample mean, MLE, MoM).
How do we choose among them? ⇒ Decision theory!
First, we will define loss and risk to evaluate the estimator.
Definition
Loss: L(θ, ˆθ) : Θ × ΘE → R measures the discrepancy between θ and ˆθ.
Example
L(θ, ˆθ) = (θ − ˆθ)2 squared error loss,
L(θ, ˆθ) = |θ − ˆθ| absolute error loss,
L(θ, ˆθ) = 0 if θ = ˆθ and 1 if θ = ˆθ zero-one loss,
L(θ, ˆθ) = log(f (x;θ)
f (x;ˆθ)
)f (x; θ)dx Kullback-Leibler loss.
Sangwoo Mo (KAIST ALIN Lab.) AoS Chap 13. August 31, 2017 4 / 20
8. Statistical Decision Theory
We learned several point estimators (e.g. sample mean, MLE, MoM).
How do we choose among them? ⇒ Decision theory!
First, we will define loss and risk to evaluate the estimator.
Definition
Loss: L(θ, ˆθ) : Θ × ΘE → R measures the discrepancy between θ and ˆθ.
Definition
Risk: R(θ, ˆθ) = Eθ(L(θ, ˆθ)) = L(θ, ˆθ(x))f (x; θ)dx.
Sangwoo Mo (KAIST ALIN Lab.) AoS Chap 13. August 31, 2017 4 / 20
9. Statistical Decision Theory
We learned several point estimators (e.g. sample mean, MLE, MoM).
How do we choose among them? ⇒ Decision theory!
First, we will define loss and risk to evaluate the estimator.
Definition
Loss: L(θ, ˆθ) : Θ × ΘE → R measures the discrepancy between θ and ˆθ.
Definition
Risk: R(θ, ˆθ) = Eθ(L(θ, ˆθ)) = L(θ, ˆθ(x))f (x; θ)dx.
We want to choose estimator with the smallest risk.
But how can we compare risk functons?
Sangwoo Mo (KAIST ALIN Lab.) AoS Chap 13. August 31, 2017 4 / 20
10. Measure for Risk Comparison
How can we compare risk functions?
⇒ We need one-number summary of the risk function!
Sangwoo Mo (KAIST ALIN Lab.) AoS Chap 13. August 31, 2017 5 / 20
11. Measure for Risk Comparison
How can we compare risk functions?
⇒ We need one-number summary of the risk function!
Definition
Bayes risk: r(π, ˆθ) = R(θ, ˆθ)π(θ)dθ where π(θ) is a prior for θ.
Maximum risk: ¯R(ˆθ) = supθ R(θ, ˆθ).
Sangwoo Mo (KAIST ALIN Lab.) AoS Chap 13. August 31, 2017 5 / 20
12. Measure for Risk Comparison
How can we compare risk functions?
⇒ We need one-number summary of the risk function!
Definition
Bayes risk: r(π, ˆθ) = R(θ, ˆθ)π(θ)dθ where π(θ) is a prior for θ.
Maximum risk: ¯R(ˆθ) = supθ R(θ, ˆθ).
Definition
Bayes estimator: ˆθ s.t. R(θ, ˆθ) = inf˜θ r(π, ˜θ).
Minimax estimator: ˆθ s.t. R(θ, ˆθ) = inf˜θ supθ R(θ, ˜θ).
Sangwoo Mo (KAIST ALIN Lab.) AoS Chap 13. August 31, 2017 5 / 20
13. Measure for Risk Comparison
How can we compare risk functions?
⇒ We need one-number summary of the risk function!
Definition
Bayes risk: r(π, ˆθ) = R(θ, ˆθ)π(θ)dθ where π(θ) is a prior for θ.
Maximum risk: ¯R(ˆθ) = supθ R(θ, ˆθ).
Definition
Bayes estimator: ˆθ s.t. R(θ, ˆθ) = inf˜θ r(π, ˜θ).
Minimax estimator: ˆθ s.t. R(θ, ˆθ) = inf˜θ supθ R(θ, ˜θ).
Note that minimax estimator is the most conservative.
With proper prior, Bayes estimator works better in general.
Sangwoo Mo (KAIST ALIN Lab.) AoS Chap 13. August 31, 2017 5 / 20
14. Table of Contents
1 Decision Theory: How to choose a ‘good’ estimator?
2 Computing Bayes Estimators
3 Computing Minimax Estimators
4 Maximum Likelihood Estimators: A good approximator
5 Admissibility: What are the ‘best’ estimators?
Sangwoo Mo (KAIST ALIN Lab.) AoS Chap 13. August 31, 2017 6 / 20
15. Computing Bayes Estimators
For given data x, we can compute Bayes estimator via posterior.
Sangwoo Mo (KAIST ALIN Lab.) AoS Chap 13. August 31, 2017 7 / 20
16. Computing Bayes Estimators
For given data x, we can compute Bayes estimator via posterior.
Definition
Posterior risk: r(ˆθ|x) = L(θ, ˆθ(x))f (θ|x)dθ
where f (θ|x) = f (x|θ)π(θ)/m(x) is posterior density
and m(x) = f (x|θ)π(θ)dθ is marginal distribution of X.
Sangwoo Mo (KAIST ALIN Lab.) AoS Chap 13. August 31, 2017 7 / 20
17. Computing Bayes Estimators
For given data x, we can compute Bayes estimator via posterior.
Definition
Posterior risk: r(ˆθ|x) = L(θ, ˆθ(x))f (θ|x)dθ
where f (θ|x) = f (x|θ)π(θ)/m(x) is posterior density
and m(x) = f (x|θ)π(θ)dθ is marginal distribution of X.
Theorem
The Bayes risk r(π, ˆθ) satisfies
r(π, ˆθ) = r(ˆθ|x)m(x)dx.
Let ˆθ(x) be the value minimizes r(ˆθ|x). Then ˆθ is the Bayes estimator.
Sangwoo Mo (KAIST ALIN Lab.) AoS Chap 13. August 31, 2017 7 / 20
18. Computing Bayes Estimators
Now we can find explicit formula for Bayes estimators for some specific
loss functions.
Sangwoo Mo (KAIST ALIN Lab.) AoS Chap 13. August 31, 2017 8 / 20
19. Computing Bayes Estimators
Now we can find explicit formula for Bayes estimators for some specific
loss functions.
Theorem
If the loss is squared error / absolute error / zero-one loss,
then the Bayes estimator is mean / median / mode of the posterior f (θ|x).
Sangwoo Mo (KAIST ALIN Lab.) AoS Chap 13. August 31, 2017 8 / 20
20. Computing Bayes Estimators
Now we can find explicit formula for Bayes estimators for some specific
loss functions.
Theorem
If the loss is squared error / absolute error / zero-one loss,
then the Bayes estimator is mean / median / mode of the posterior f (θ|x).
Proof.
We will only prove the theorem for squared error loss.
The Bayes estimator ˆθ(x) minimizes r(ˆθ|x) = (θ − ˆθ(x))2f (θ|x)dθ.
Taking the derivative of r(ˆθ|x) w.r.t. ˆθ(x) leads the result.
Sangwoo Mo (KAIST ALIN Lab.) AoS Chap 13. August 31, 2017 8 / 20
21. Table of Contents
1 Decision Theory: How to choose a ‘good’ estimator?
2 Computing Bayes Estimators
3 Computing Minimax Estimators
4 Maximum Likelihood Estimators: A good approximator
5 Admissibility: What are the ‘best’ estimators?
Sangwoo Mo (KAIST ALIN Lab.) AoS Chap 13. August 31, 2017 9 / 20
22. Computing Minimax Estimators
Computing minimax estimator is complex in general.
Here, we will only mention a few key results.
Sangwoo Mo (KAIST ALIN Lab.) AoS Chap 13. August 31, 2017 10 / 20
23. Computing Minimax Estimators
Computing minimax estimator is complex in general.
Here, we will only mention a few key results.
Theorem
Let ˆθπ be the Bayes estimator for some prior π.
Suppose that R(θ, ˆθπ) ≤ r(π, ˆθπ) for all θ.
Then ˆθπ is minimax estimator and π is the least favorablea prior.
a
for any π , r(π, ˆθπ
) ≥ r(π , ˆθπ
)
Sangwoo Mo (KAIST ALIN Lab.) AoS Chap 13. August 31, 2017 10 / 20
24. Computing Minimax Estimators
Computing minimax estimator is complex in general.
Here, we will only mention a few key results.
Theorem
Let ˆθπ be the Bayes estimator for some prior π.
Suppose that R(θ, ˆθπ) ≤ r(π, ˆθπ) for all θ.
Then ˆθπ is minimax estimator and π is the least favorablea prior.
a
for any π , r(π, ˆθπ
) ≥ r(π , ˆθπ
)
Proof.
Suppose ˆθπ is not minimax estimator.
Then there is ˆθ0 s.t. supθ R(θ, ˆθ0) < supθ R(θ, ˆθπ).
Hence, r(π, ˆθ0) ≤ supθ R(θ, ˆθ0) < supθ R(θ, ˆθπ) ≤ r(π, ˆθπ).
It leads the contradiction.
Sangwoo Mo (KAIST ALIN Lab.) AoS Chap 13. August 31, 2017 10 / 20
25. Computing Minimax Estimators
Computing minimax estimator is complex in general.
Here, we will only mention a few key results.
Theorem
Let ˆθπ be the Bayes estimator for some prior π.
Suppose that R(θ, ˆθπ) ≤ r(π, ˆθπ) for all θ.
Then ˆθπ is minimax estimator and π is the least favorablea prior.
a
for any π , r(π, ˆθπ
) ≥ r(π , ˆθπ
)
Corollary
Let ˆθπ be the Bayes estimator for some prior π.
Suppose that R(θ, ˆθπ) = c for some c.
Then ˆθπ is minimax estimator.
Sangwoo Mo (KAIST ALIN Lab.) AoS Chap 13. August 31, 2017 10 / 20
26. Computing Minimax Estimators
Example
Let X1, ..., Xn ∼ Bernoulli(p). Assume squred error loss.
Assume Beta(α, β) prior. Then the posterior mean is ˆp =
α+
n
i=1
Xi
α+β+n .
Now let α = β = n/4. Then R(p, ˆp) = n
4(n+
√
n)2 .
Hence, by previous theorem, ˆp is minimax estimator.
Example
Consider again Bernoulli but with loss L(p, ˆp) = (p−ˆp)2
p(1−p) .
Assume uniform prior. Then the Bayes estimator is ˆp =
n
i=1
Xi
n .
Then R(p, ˆp) = 1
n . Hence, ˆp is minimax estimator.
Sangwoo Mo (KAIST ALIN Lab.) AoS Chap 13. August 31, 2017 11 / 20
27. Computing Minimax Estimators
For normal distribution, we can achieve a nice result.
Theorem
Let X1, ..., Xn ∼ N(θ, 1) and Θ = R. Let ˆθ = ¯X.
Then ˆθ is minimax w.r.t. any well-behaveda loss function.
Moreover, it is the only estimator with this property.
a
level sets are convex and symmetric about the origin
Sangwoo Mo (KAIST ALIN Lab.) AoS Chap 13. August 31, 2017 12 / 20
28. Computing Minimax Estimators
For normal distribution, we can achieve a nice result.
Theorem
Let X1, ..., Xn ∼ N(θ, 1) and Θ = R. Let ˆθ = ¯X.
Then ˆθ is minimax w.r.t. any well-behaveda loss function.
Moreover, it is the only estimator with this property.
a
level sets are convex and symmetric about the origin
If parameter space is bounded, the theorem above does not apply.
Example
Suppose that X ∼ N(θ, 1) and Θ = [−m, m] where 0 < m < 1.
Assume squared error loss. Assume 1
2(δ(−m) + δ(m)) prior.
Then ˆθ(X) = m tanh(mX) is minimax estimator.
Sangwoo Mo (KAIST ALIN Lab.) AoS Chap 13. August 31, 2017 12 / 20
29. Table of Contents
1 Decision Theory: How to choose a ‘good’ estimator?
2 Computing Bayes Estimators
3 Computing Minimax Estimators
4 Maximum Likelihood Estimators: A good approximator
5 Admissibility: What are the ‘best’ estimators?
Sangwoo Mo (KAIST ALIN Lab.) AoS Chap 13. August 31, 2017 13 / 20
30. MLE approximates Bayes/minimax estimator
Still, it is challenging to compute Bayes and minimax estimator.
Surprisingly, it is shown that for large samples, MLE is approximately
Bayes and minimax for parametric models.
Sangwoo Mo (KAIST ALIN Lab.) AoS Chap 13. August 31, 2017 14 / 20
31. MLE approximates Bayes/minimax estimator
Still, it is challenging to compute Bayes and minimax estimator.
Surprisingly, it is shown that for large samples, MLE is approximately
Bayes and minimax for parametric models.
Idea Sketch (MLE is approximately Bayes)
For large n, the effect of prior is negligible. Moreover,
√
n(ˆθBayes − θ) → N(0,
1
I(θ)
).
Thus, Bayes estimator is approximately MLE.
Sangwoo Mo (KAIST ALIN Lab.) AoS Chap 13. August 31, 2017 14 / 20
32. MLE approximates Bayes/minimax estimator
Still, it is challenging to compute Bayes and minimax estimator.
Surprisingly, it is shown that for large samples, MLE is approximately
Bayes and minimax for parametric models.
Idea Sketch (MLE is approximately minimax)
Assume squared error loss. Let ˆθ be MLE estimator. Then
R(θ, ˆθ) = Vθ(ˆθ) + bias2
≈ Vθ(ˆθ)a ≈
1
nI(θ)
.
For large n, for other θ , R(θ, θ ) ≥ 1
nI(θ) ≈ R(θ, ˆθ).
Thus, MLE is approximately minimax.
a
typically, squared bias is O(n−2
) and variance is O(n−1
)
Sangwoo Mo (KAIST ALIN Lab.) AoS Chap 13. August 31, 2017 14 / 20
33. Table of Contents
1 Decision Theory: How to choose a ‘good’ estimator?
2 Computing Bayes Estimators
3 Computing Minimax Estimators
4 Maximum Likelihood Estimators: A good approximator
5 Admissibility: What are the ‘best’ estimators?
Sangwoo Mo (KAIST ALIN Lab.) AoS Chap 13. August 31, 2017 15 / 20
34. Admissibility
As we have seen before, we can’t decide the best estimator over risks.
However, we can decide what is not the best estimator.
Sangwoo Mo (KAIST ALIN Lab.) AoS Chap 13. August 31, 2017 16 / 20
35. Admissibility
As we have seen before, we can’t decide the best estimator over risks.
However, we can decide what is not the best estimator.
Definition
An estimator ˆθ is inadmissible if there exists an estimator ˆθ s.t.
R(θ, ˆθ ) ≤ R(θ, ˆθ) for all θ and
R(θ, ˆθ ) < R(θ, ˆθ) for at least one θ.
An estimator ˆθ is admissible if it is not inadmissible.
Definition
An estimator ˆθ is strongly inadmissible if there exists an estimator ˆθ
and > 0 s.t. R(θ, ˆθ ) < R(θ, ˆθ) − for all θ.
Sangwoo Mo (KAIST ALIN Lab.) AoS Chap 13. August 31, 2017 16 / 20
36. Admissibility
Remark that admissibility only characterize the bad estimator.
Admissible estimator are not generally good, and sometimes can be bad.
Sangwoo Mo (KAIST ALIN Lab.) AoS Chap 13. August 31, 2017 17 / 20
37. Admissibility
Remark that admissibility only characterize the bad estimator.
Admissible estimator are not generally good, and sometimes can be bad.
Example
Let X ∼ N(θ, 1). Assume squared error loss. Let ˆθ(X) = 3.
Then ˆθ(X) is admissible, even though it is clearly bad.
Sangwoo Mo (KAIST ALIN Lab.) AoS Chap 13. August 31, 2017 17 / 20
38. Admissibility
Remark that admissibility only characterize the bad estimator.
Admissible estimator are not generally good, and sometimes can be bad.
Example
Let X ∼ N(θ, 1). Assume squared error loss. Let ˆθ(X) = 3.
Then ˆθ(X) is admissible, even though it is clearly bad.
Example (Proof.)
Suppose not. There exists ˆθ s.t. R(3, ˆθ ) ≤ R(3, ˆθ) = 0.
Hence, R(3, ˆθ ) = (ˆθ (x) − 3)2f (x; 3)dx, and ˆθ (X) = 3.
Sangwoo Mo (KAIST ALIN Lab.) AoS Chap 13. August 31, 2017 17 / 20
39. Admissibility of Bayes Estimators
Assuming regular condition, Bayes estimators are admissible.
Sangwoo Mo (KAIST ALIN Lab.) AoS Chap 13. August 31, 2017 18 / 20
40. Admissibility of Bayes Estimators
Assuming regular condition, Bayes estimators are admissible.
Theorem (unique case)
If Bayes estimator is unique, then it is admissible.
Sangwoo Mo (KAIST ALIN Lab.) AoS Chap 13. August 31, 2017 18 / 20
41. Admissibility of Bayes Estimators
Assuming regular condition, Bayes estimators are admissible.
Theorem (unique case)
If Bayes estimator is unique, then it is admissible.
Theorem (discrete case)
If Θ is discrete set, then all Bayes estimators are admissible.
Sangwoo Mo (KAIST ALIN Lab.) AoS Chap 13. August 31, 2017 18 / 20
42. Admissibility of Bayes Estimators
Assuming regular condition, Bayes estimators are admissible.
Theorem (unique case)
If Bayes estimator is unique, then it is admissible.
Theorem (discrete case)
If Θ is discrete set, then all Bayes estimators are admissible.
Theorem (continuous case)
If Θ is continuous set, and if R(θ, ˆθ) is continuous in θ for every ˆθ,
then all Bayes estimators are admissible.
Sangwoo Mo (KAIST ALIN Lab.) AoS Chap 13. August 31, 2017 18 / 20
43. Admissibility of Minimax Estimators
Neither minimaxity nor admissibility implies the other in general.
However, there are some facts linking them.
Sangwoo Mo (KAIST ALIN Lab.) AoS Chap 13. August 31, 2017 19 / 20
44. Admissibility of Minimax Estimators
Neither minimaxity nor admissibility implies the other in general.
However, there are some facts linking them.
Theorem (admissibility ⇒ minimaxity)
Suppose that ˆθ has constant risk and is admissible.
Then it is minimax estimator.
Sangwoo Mo (KAIST ALIN Lab.) AoS Chap 13. August 31, 2017 19 / 20
45. Admissibility of Minimax Estimators
Neither minimaxity nor admissibility implies the other in general.
However, there are some facts linking them.
Theorem (admissibility ⇒ minimaxity)
Suppose that ˆθ has constant risk and is admissible.
Then it is minimax estimator.
Proof.
Suppose not. There exists ˆθ s.t.
R(θ, ˆθ ) ≤ supθ R(θ, ˆθ ) < supθ R(θ, ˆθ) = R(θ, ˆθ).
It implies that ˆθ is inadmissible.
Sangwoo Mo (KAIST ALIN Lab.) AoS Chap 13. August 31, 2017 19 / 20
46. Admissibility of Minimax Estimators
Neither minimaxity nor admissibility implies the other in general.
However, there are some facts linking them.
Theorem (admissibility ⇒ minimaxity)
Suppose that ˆθ has constant risk and is admissible.
Then it is minimax estimator.
Theorem (minimaxity ⇒ admissibility)
If ˆθ is minimax then it is not strongly inadmissible.
Sangwoo Mo (KAIST ALIN Lab.) AoS Chap 13. August 31, 2017 19 / 20