HW1 MIT Fall 2005

6.867 Homework 1
Fall 2005
Due: Wednesday, September 28, in class. 10% off per day late.
Don’t spend time doing hairy integrals by hand. If that’s the answer to a question, you can
just leave it un-integrated.
1 A Tale of Two Estimators
In this problem, we’re going to explore the bias-variance trade-off in a very simple setting. We
have a set of unidimensional data, X1, . . . , Xn, drawn from the positive reals. We will consider two
different models for its distribution:
• Model 1: The data are drawn from a uniform distribution on the interval [0, b]. This model
has a single positive real parameter, 0 < b.
• Model 2: The data are drawn from a uniform distribution on the interval [a, b]. This model
has two positive real parameters, a and b, such that 0 < a < b.
We are interested in comparing estimates of the mean of the distribution, derived from each of
these two models.
Question 0: What’s the mean of each of the distributions?
1.1 The Right Stuff
Let’s start by considering the situation in which the data were, in fact, drawn from an instance of
the model under consideration: either a uniform distribution on [0, b∗] (for model 1) or a uniform
distribution on [a∗, b∗] (for model 2). If you are, like Commander Toad1, “brave and bold, bold and
brave,” then avert your eyes from the rest of this section and compute the bias, variance, and mean
squared error of ML estimator of the mean of each model, on data drawn from this distribution, as
a function of n. Otherwise, let’s take it step by step.
As we saw in recitation, the ML estimator for b, ˆb = maxi Xi. By a similar argument, the ML
estimator for a, â = mini Xi. To understand the properties of these estimators we have to start by
deriving their PDFs. The minimum and maximum of a data set are also known as their first and
nth order statistics, and sometimes written X(1) and X(n).
1
Commander Toad In Space, Jane Yolen, Putnam, 1996
1

1.1.1 Using Model 1
In model 1, we just need to consider the distribution of ˆb. Generally speaking, the pdf of the
maximum of a set of data drawn from pdf f, with cdf F, is
fˆb(x) = nF(x)n−1
f(x) .
The idea is that, if x is the maximum, then n − 1 of the other data values will have to be less than
x, and the probability of that is F(x)n−1, and then one value will have to equal x, the probability
of which is f(x). We multiply by n because there are n diﬀerent ways to choose the data value
that could be the maximum.
Question 1: What is fˆb in the particular case where the data are drawn uniformly from 0 to
b?
Question 2: Write an expression for the expected value of ˆµ, as an integral.
In fact, there’s a nice closed form expression, which you can use in the following questions:
E[ˆµ] =
b∗
2
n
(n + 1)
.
Question 3: What is the squared bias of ˆµ? Is this estimator unbiased? Is it asymptotically
unbiased? (Reminder: bias2
(ˆθ) = (ED[ˆθ] − θ)2.)
Question 4: Write an expression for the variance of ˆµ, as an integral.
The closed form for the variance is
V [ˆµ] =
b∗2
4
n
(n + 1)2(n + 2)
.
Question 5: What is the mean squared error of ˆµ? (Reminder: MSE(ˆθ) = bias2
(ˆθ) + var(ˆθ).)
1.1.2 Another estimator for Model 1
We might consider something other than the MLE for Model 1. Consider the estimator
ˆµ =
x(n)(n + 1)
2n
.
Question 6: Write an expression for the expected value of this version of ˆµ as an integral.
Then solve the integral.
Question 7: What is the squared bias of this estimator for ˆµ? Is this estimator unbiased? Is
it asymptotically unbiased?
Question 8: Write an expression for the variance of ˆµ as an integral.
V [ˆµ] =
b∗2
4n(n + 2)
.
Question 9: What is the mean squared error of this version of ˆµ?
Question 10: What are the relative advantages of the estimator from the previous section and
this one?
2

1.1.3 Using Model 2
Now, let’s do the same thing, but for the MLE for model 2. We have to start by thinking about
the joint distribution of MLE’s â and ˆb. Generally speaking, the joint pdf of the minimum and the
maximum of a set of data drawn from pdf f, with cdf F, is
fâ,ˆb(x, y) = n(n − 1)(F(y) − F(x))n−2
f(x)f(y) .
Question 11: Explain in words why this makes sense.
Question 12: What is fâ,ˆb in the particular case where the data are drawn uniformly from a
to b?
Question 13: Write an expression for the expected value of ˆµ in terms of an integral.
Here’s what it should integrate to:
E[ˆµ] =
a∗ + b∗
2
.
Question 14: What is the squared bias of ˆµ? Is this estimator unbiased? Is it asymptotically
unbiased?
Question 15: Write an expression for the variance of ˆµ in terms of an integral.
V [ˆµ] =
(b∗ − a∗)2
2(n + 1)(n + 2)
.
Question 16: What is the mean squared error of ˆµ?
1.1.4 Comparing Models
What if we have data that is actually drawn from the interval [0, 1]? Both models seem like
reasonable choices.
Question 17: Show plots that compare the bias, variance, and mse of each of the estimators
we’ve considered on that data, as a function of n. (Use the formulas above; don’t do it by actually
generating data). Write a paragraph in English explaining your results. What estimator would you
use?
Now, what if we have data that is actually drawn from the interval [.1, 1]? It seems like model
2 is the only reasonable choice. But is it?
We already know the bias, variance, and mse for model 2 in this case. But what about the
MLE and unbiased estimators for model 1? Let’s characterize the general behavior when we use
the estimator ˆµ = x(n)(n + 1)/(2n) on data drawn from an interval [a∗, b∗].
Question 18: Write an expression for the expected value of ˆµ in terms of an integral.
The closed form expression is
E[ˆµ] =
a∗ + b∗n
2n
.
Question 19: Explain in English why this answer makes sense.
Question 20: What is the squared bias of this ˆµ? Explain in English why your answer makes
sense. Consider how it behaves as a increases, and how it behaves as n increases.
Question 21: Write an expression for the variance of this ˆµ in terms of an integral.
3

V [ˆµ] =
(b∗ − a∗)2
4n(n + 2)
.
To save you some tedious algebra, we’ll tell you that the mean squared error of this ˆµ is
(apologies for the ugliness; let us know if you ﬁnd a beautiful rewrite)
b∗2
n − 2a∗b∗n + a∗2
(2 − 2n + n3)
4n2(n + 2)
.
Question 22: Show plots that compare the bias, variance, and mse of this estimator with the
regular model 2 estimator on data drawn from [0.1, 1], as a function of n. Are there circumstances
in which it would be better to use this estimator? If so, what are they and why? If not, why not?
Question 23: Show plots of mse of both estimators, as a function of n on data drawn from
[.01, 1] and on data drawn from [.2, 1]. How do things change? Explain why this makes sense.
2 Bayesian Estimation
2.1 Dirichlet Prior
Exercise borrowed from Stat180 at UCLA.
The Dirichlet distribution is a multivariate version of the Beta distribution. When we have a
coin with two outcomes, we really only need a single parameter θ to model the probability of heads.
But now let’s consider a “thick” coin that has three possible outcomes: heads, tails, and edge. Now
we need two parameters: θh is the probability of heads, θt is the probability of tails, and then the
probability of an edge is 1 − θh − θt.
The random variables (V, W) ∈ [0, 1] and such that V + W ≤ 1 have a Dirichlet distribution
with parameters α1, α2, α3 if their joint density is
f(v, w) = vα1−1
wα2−1
(1 − v − w)α3−1 Γ(α1 + α2 + α3)
Γ(α1)Γ(α2)Γ(α3)
.
This is a direct generalization of the Beta distribution.
Question 24: If (θh, θt) have a Dirichlet distribution as above, what is the marginal distribution
of θh?
Question 25: Suppose you are playing with a thick coin, and get results X1 . . . Xn, resulting
in H heads and T tails out of n throws. Given θh and θt the random variables H and T have a
multinomial distribution:
Pr(h, t) =
n!
h!t!(n − h − t)!
θh
hθt
t(1 − θh − θt)n−h−t
.
Assume a uniform prior on the simplex of possible values of θh and θt. What is the posterior
distribution for θh and θt?
Question 26: In this same setting, what is the predictive distribution for getting another head?
That is, what’s Pr(Xn+1 = heads|X1 . . . Xn)?
Question 27: Now assume a Dirichlet prior for θh and θt with parameters α1, α2, α3. What is
the posterior in this case?
Question 28: What is the predictive distribution?
Question 29: If you assume a squared-error loss, what is your estimate of θh and θt?
Question 30: What happens as n → ∞?
4

3 Gaussian distributions
Question 31: (DeGroot) Consider a normal distribution for which the value of the mean µ is
unknown and the variance is 4, and suppose that the prior distribution of µ is a normal distribution
whose variance is 9. How large a random sample must be taken from the given normal distribution
in order to be able to specify an interval having a length of 1 unit such that the probability that µ
lies in this interval is at least 0.95?
4 Presidential debate
(Gelman) On September 25, 1988, the evening of a presidential campaign debate, ABC News
conducted a survey of registered voters in the United States; 639 persons were polled before the
debate, and 639 different persons were polled after. The results are displayed below:
Survey Bush Dukakis No opinion/other Total
pre-debate 294 307 38 639
post-debate 288 332 19 639
Assume the surveys are independent simple random samples from the population of registered
voters. We will ignore the “no opinion” cases and model the data with two different binomial
distributions. For j = 1, 2, let αj be the proportion of voters who preferred Bush, out of those who
had a preference for either Bush or Dukakis at the time of survey j.
Question 32: What is the posterior probability that there was a shift toward Bush?
5 Decision theory
Question 33: (DeGroot) Suppose that a person is going to sell Fizzy Cola at a football game and
must decide in advance how much to order. Suppose that he makes a gain of m cents on each quart
that he sells at the game but suffers a loss of c cents on each quart that he has ordered but does not
sell. Assume that the demand for Fizzy Cola at the game, as measured in quarts, is a continuous
random variable X with PDF f and CDF F, show that his expected profit will be maximized if he
orders an amount α such that F(α) = m/(m + c).
(Duda, Hart, Stork) Suppose we use a randomized decision rule, which chooses action ai with
probability Pr(ai|x).
Question 34: Show that the resulting risk is given by
[
m
i=1
R(ai|x) Pr(ai|x)] Pr(x)dx .
Question 35: What choice of Pr(ai|x) minimizes R in general?
In many problems, one has the option either to assign the pattern to one of c classes, or to
reject it as being unrecognizable. If the cost for rejects is not too high, rejection may be a desirable
action. Let the cost for a correct answer be 0, for an incorrect answer be λs, and for a reject be λr.
Question 36: Under what conditions on Pr(ci|x), λs and λr should we reject?
5

HW1 MIT Fall 2005

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Similar to HW1 MIT Fall 2005

Similar to HW1 MIT Fall 2005 (20)

Recently uploaded

Recently uploaded (20)

HW1 MIT Fall 2005