Mle
- 1. Sep 6th, 2001
Copyright © 2001, 2004, Andrew W. Moore
Learning with
Maximum Likelihood
Andrew W. Moore
Professor
School of Computer Science
Carnegie Mellon University
www.cs.cmu.edu/~awm
awm@cs.cmu.edu
412-268-7599
Note to other teachers and users of
these slides. Andrew would be delighted
if you found this source material useful in
giving your own lectures. Feel free to use
these slides verbatim, or to modify them
to fit your own needs. PowerPoint
originals are available. If you make use
of a significant portion of these slides in
your own lecture, please include this
message, or the following link to the
source repository of Andrew’s tutorials:
http://www.cs.cmu.edu/~awm/tutorials .
Comments and corrections gratefully
received.
- 2. Maximum Likelihood: Slide 2
Copyright © 2001, 2004, Andrew W. Moore
Maximum Likelihood learning of
Gaussians for Data Mining
• Why we should care
• Learning Univariate Gaussians
• Learning Multivariate Gaussians
• What’s a biased estimator?
• Bayesian Learning of Gaussians
- 3. Maximum Likelihood: Slide 3
Copyright © 2001, 2004, Andrew W. Moore
Why we should care
• Maximum Likelihood Estimation is a very
very very very fundamental part of data
analysis.
• “MLE for Gaussians” is training wheels for
our future techniques
• Learning Gaussians is more useful than you
might guess…
- 4. Maximum Likelihood: Slide 4
Copyright © 2001, 2004, Andrew W. Moore
Learning Gaussians from Data
• Suppose you have x1, x2, … xR ~ (i.i.d) N(,2)
• But you don’t know
(you do know 2)
MLE: For which is x1, x2, … xR most likely?
MAP: Which maximizes p(|x1, x2, … xR , 2)?
- 5. Maximum Likelihood: Slide 5
Copyright © 2001, 2004, Andrew W. Moore
Learning Gaussians from Data
• Suppose you have x1, x2, … xR ~(i.i.d) N(,2)
• But you don’t know
(you do know 2)
MLE: For which is x1, x2, … xR most likely?
MAP: Which maximizes p(|x1, x2, … xR , 2)?
Sneer
- 6. Maximum Likelihood: Slide 6
Copyright © 2001, 2004, Andrew W. Moore
Learning Gaussians from Data
• Suppose you have x1, x2, … xR ~(i.i.d) N(,2)
• But you don’t know
(you do know 2)
MLE: For which is x1, x2, … xR most likely?
MAP: Which maximizes p(|x1, x2, … xR , 2)?
Sneer
Despite this, we’ll spend 95% of our time on MLE. Why? Wait and see…
- 7. Maximum Likelihood: Slide 7
Copyright © 2001, 2004, Andrew W. Moore
MLE for univariate Gaussian
• Suppose you have x1, x2, … xR ~(i.i.d) N(,2)
• But you don’t know (you do know 2)
• MLE: For which is x1, x2, … xR most likely?
)
,
|
,...
,
(
max
arg 2
2
1
R
mle
x
x
x
p
- 8. Maximum Likelihood: Slide 8
Copyright © 2001, 2004, Andrew W. Moore
Algebra Euphoria
)
,
|
,...
,
(
max
arg 2
2
1
R
mle
x
x
x
p
= (by i.i.d)
= (monotonicity of
log)
= (plug in formula
for Gaussian)
= (after
simplification)
- 9. Maximum Likelihood: Slide 9
Copyright © 2001, 2004, Andrew W. Moore
= (by i.i.d)
= (monotonicity of
log)
= (plug in formula
for Gaussian)
= (after
simplification)
R
i
i
x
p
1
2
)
,
|
(
max
arg
Algebra Euphoria
)
,
|
,...
,
(
max
arg 2
2
1
R
mle
x
x
x
p
R
i
i
x
p
1
2
)
,
|
(
log
max
arg
R
i
i
x
1
2
2
2
)
(
2
1
max
arg
R
i
i
x
1
2
)
(
min
arg
- 10. Maximum Likelihood: Slide 10
Copyright © 2001, 2004, Andrew W. Moore
Intermission: A General Scalar
MLE strategy
Task: Find MLE assuming known form for p(Data| ,stuff)
1. Write LL = log P(Data| ,stuff)
2. Work out LL/ using high-school calculus
3. Set LL/=0 for a maximum, creating an equation in
terms of
4. Solve it*
5. Check that you’ve found a maximum rather than a
minimum or saddle-point, and be careful if is
constrained
*This is a perfect example of something that works perfectly in
all textbook examples and usually involves surprising pain if
you need it for something new.
- 11. Maximum Likelihood: Slide 11
Copyright © 2001, 2004, Andrew W. Moore
The MLE
)
,
|
,...
,
(
max
arg 2
2
1
R
mle
x
x
x
p
R
i
i
x
1
2
)
(
min
arg
LL
0
s.t.
= (what?)
- 12. Maximum Likelihood: Slide 12
Copyright © 2001, 2004, Andrew W. Moore
The MLE
)
,
|
,...
,
(
max
arg 2
2
1
R
mle
x
x
x
p
R
i
i
x
1
2
)
(
min
arg
LL
0
s.t.
R
i
i
x
1
2
)
(
R
i
i
x
1
)
(
2
R
i
i
x
R 1
1
Thus
- 13. Maximum Likelihood: Slide 13
Copyright © 2001, 2004, Andrew W. Moore
Lawks-a-lawdy!
R
i
i
mle
x
R 1
1
• The best estimate of the mean of a
distribution is the mean of the sample!
At first sight:
This kind of pedantic, algebra-filled and
ultimately unsurprising fact is exactly the
reason people throw down their
“Statistics” book and pick up their “Agent
Based Evolutionary Data Mining Using
The Neuro-Fuzz Transform” book.
- 14. Maximum Likelihood: Slide 14
Copyright © 2001, 2004, Andrew W. Moore
A General MLE strategy
Suppose = (1, 2, …, n)T is a vector of parameters.
Task: Find MLE assuming known form for p(Data| ,stuff)
1. Write LL = log P(Data| ,stuff)
2. Work out LL/ using high-school calculus
n
θ
LL
θ
LL
θ
LL
LL
2
1
θ
- 15. Maximum Likelihood: Slide 15
Copyright © 2001, 2004, Andrew W. Moore
A General MLE strategy
Suppose = (1, 2, …, n)T is a vector of parameters.
Task: Find MLE assuming known form for p(Data| ,stuff)
1. Write LL = log P(Data| ,stuff)
2. Work out LL/ using high-school calculus
3. Solve the set of simultaneous equations
0
0
0
2
1
n
θ
LL
θ
LL
θ
LL
- 16. Maximum Likelihood: Slide 16
Copyright © 2001, 2004, Andrew W. Moore
A General MLE strategy
Suppose = (1, 2, …, n)T is a vector of parameters.
Task: Find MLE assuming known form for p(Data| ,stuff)
1. Write LL = log P(Data| ,stuff)
2. Work out LL/ using high-school calculus
3. Solve the set of simultaneous equations
0
0
0
2
1
n
θ
LL
θ
LL
θ
LL
4. Check that you’re at
a maximum
- 17. Maximum Likelihood: Slide 17
Copyright © 2001, 2004, Andrew W. Moore
A General MLE strategy
Suppose = (1, 2, …, n)T is a vector of parameters.
Task: Find MLE assuming known form for p(Data| ,stuff)
1. Write LL = log P(Data| ,stuff)
2. Work out LL/ using high-school calculus
3. Solve the set of simultaneous equations
0
0
0
2
1
n
θ
LL
θ
LL
θ
LL
4. Check that you’re at
a maximum
If you can’t solve them,
what should you do?
- 18. Maximum Likelihood: Slide 18
Copyright © 2001, 2004, Andrew W. Moore
MLE for univariate Gaussian
• Suppose you have x1, x2, … xR ~(i.i.d) N(,2)
• But you don’t know or 2
• MLE: For which =(,2) is x1, x2,…xR most likely?
2
1
2
2
2
2
1 )
(
2
1
)
log
2
1
(log
)
,
|
,...
,
(
log
R
i
i
R x
R
x
x
x
p
)
(
1
1
2
R
i
i
x
LL
2
1
4
2
2
)
(
2
1
2
R
i
i
x
R
LL
- 19. Maximum Likelihood: Slide 19
Copyright © 2001, 2004, Andrew W. Moore
MLE for univariate Gaussian
• Suppose you have x1, x2, … xR ~(i.i.d) N(,2)
• But you don’t know or 2
• MLE: For which =(,2) is x1, x2,…xR most likely?
2
1
2
2
2
2
1 )
(
2
1
)
log
2
1
(log
)
,
|
,...
,
(
log
R
i
i
R x
R
x
x
x
p
)
(
1
0
1
2
R
i
i
x
2
1
4
2
)
(
2
1
2
0
R
i
i
x
R
- 20. Maximum Likelihood: Slide 20
Copyright © 2001, 2004, Andrew W. Moore
MLE for univariate Gaussian
• Suppose you have x1, x2, … xR ~(i.i.d) N(,2)
• But you don’t know or 2
• MLE: For which =(,2) is x1, x2,…xR most likely?
2
1
2
2
2
2
1 )
(
2
1
)
log
2
1
(log
)
,
|
,...
,
(
log
R
i
i
R x
R
x
x
x
p
R
i
i
R
i
i x
R
x
1
1
2
1
)
(
1
0
what?
)
(
2
1
2
0 2
1
4
2
R
i
i
x
R
- 21. Maximum Likelihood: Slide 21
Copyright © 2001, 2004, Andrew W. Moore
MLE for univariate Gaussian
• Suppose you have x1, x2, … xR ~(i.i.d) N(,2)
• But you don’t know or 2
• MLE: For which =(,2) is x1, x2,…xR most likely?
R
i
i
mle
x
R 1
1
2
1
2
)
(
1 mle
R
i
i
mle x
R
- 22. Maximum Likelihood: Slide 22
Copyright © 2001, 2004, Andrew W. Moore
Unbiased Estimators
• An estimator of a parameter is unbiased if the
expected value of the estimate is the same as the
true value of the parameters.
• If x1, x2, … xR ~(i.i.d) N(,2) then
R
i
i
mle
x
R
E
E
1
1
]
[
mle is unbiased
- 23. Maximum Likelihood: Slide 23
Copyright © 2001, 2004, Andrew W. Moore
Biased Estimators
• An estimator of a parameter is biased if the
expected value of the estimate is different from
the true value of the parameters.
• If x1, x2, … xR ~(i.i.d) N(,2) then
2
2
1
1
2
1
2 1
1
)
(
1
R
j
j
R
i
i
mle
R
i
i
mle x
R
x
R
E
x
R
E
E
2
mle is biased
- 24. Maximum Likelihood: Slide 24
Copyright © 2001, 2004, Andrew W. Moore
MLE Variance Bias
• If x1, x2, … xR ~(i.i.d) N(,2) then
2
2
2
1
1
2 1
1
1
1
R
x
R
x
R
E
E
R
j
j
R
i
i
mle
Intuition check: consider the case of R=1
Why should our guts expect that 2
mle would be an
underestimate of true 2?
How could you prove that?
- 25. Maximum Likelihood: Slide 25
Copyright © 2001, 2004, Andrew W. Moore
Unbiased estimate of Variance
• If x1, x2, … xR ~(i.i.d) N(,2) then
2
2
2
1
1
2 1
1
1
1
R
x
R
x
R
E
E
R
j
j
R
i
i
mle
So define
R
mle
1
1
2
2
unbiased
2
2
unbiased
So
E
- 26. Maximum Likelihood: Slide 26
Copyright © 2001, 2004, Andrew W. Moore
Unbiased estimate of Variance
• If x1, x2, … xR ~(i.i.d) N(,2) then
2
2
2
1
1
2 1
1
1
1
R
x
R
x
R
E
E
R
j
j
R
i
i
mle
So define
R
mle
1
1
2
2
unbiased
2
1
2
unbiased )
(
1
1 mle
R
i
i
x
R
2
2
unbiased
So
E
- 27. Maximum Likelihood: Slide 27
Copyright © 2001, 2004, Andrew W. Moore
Unbiaseditude discussion
• Which is best?
2
1
2
unbiased )
(
1
1 mle
R
i
i
x
R
2
1
2
)
(
1 mle
R
i
i
mle x
R
Answer:
•It depends on the task
•And doesn’t make much difference once R--> large
- 28. Maximum Likelihood: Slide 28
Copyright © 2001, 2004, Andrew W. Moore
Don’t get too excited about being
unbiased
• Assume x1, x2, … xR ~(i.i.d) N(,2)
• Suppose we had these estimators for the mean
R
i
i
suboptimal
x
R
R 1
7
1
1
x
crap
Are either of these unbiased?
Will either of them asymptote to the
correct value as R gets large?
Which is more useful?
- 29. Maximum Likelihood: Slide 29
Copyright © 2001, 2004, Andrew W. Moore
MLE for m-dimensional Gaussian
• Suppose you have x1, x2, … xR ~(i.i.d) N(,S)
• But you don’t know or S
• MLE: For which =(,S) is x1, x2, … xR most likely?
R
k
k
mle
R 1
1
x
μ
R
k
T
mle
k
mle
k
mle
R 1
1
μ
x
μ
x
Σ
- 30. Maximum Likelihood: Slide 30
Copyright © 2001, 2004, Andrew W. Moore
MLE for m-dimensional Gaussian
• Suppose you have x1, x2, … xR ~(i.i.d) N(,S)
• But you don’t know or S
• MLE: For which =(,S) is x1, x2, … xR most likely?
R
k
k
mle
R 1
1
x
μ
R
k
T
mle
k
mle
k
mle
R 1
1
x
x
Σ
R
k
ki
mle
i
R
μ
1
1
x
Where 1 i m
And xki is value of the
ith component of xk
(the ith attribute of
the kth record)
And i
mle is the ith
component of mle
- 31. Maximum Likelihood: Slide 31
Copyright © 2001, 2004, Andrew W. Moore
MLE for m-dimensional Gaussian
• Suppose you have x1, x2, … xR ~(i.i.d) N(,S)
• But you don’t know or S
• MLE: For which =(,S) is x1, x2, … xR most likely?
R
k
k
mle
R 1
1
x
μ
R
k
T
mle
k
mle
k
mle
R 1
1
x
x
Σ
R
k
mle
j
kj
mle
i
ki
mle
ij
R 1
1
x
x
Where 1 i m, 1 j m
And xki is value of the ith
component of xk (the ith
attribute of the kth record)
And ij
mle is the (i,j)th
component of Smle
- 32. Maximum Likelihood: Slide 32
Copyright © 2001, 2004, Andrew W. Moore
MLE for m-dimensional Gaussian
• Suppose you have x1, x2, … xR ~(i.i.d) N(,S)
• But you don’t know or S
• MLE: For which =(,S) is x1, x2, … xR most likely?
R
k
k
mle
R 1
1
x
μ
R
k
T
mle
k
mle
k
mle
R 1
1
x
x
Σ
Q: How would you prove this?
A: Just plug through the MLE
recipe.
Note how Smle is forced to be
symmetric non-negative definite
Note the unbiased case
How many datapoints would you
need before the Gaussian has a
chance of being non-degenerate?
R
k
T
mle
k
mle
k
mle
R
R
1
unbiased
1
1
1
1
x
x
Σ
Σ
- 33. Maximum Likelihood: Slide 33
Copyright © 2001, 2004, Andrew W. Moore
Confidence intervals
We need to talk
We need to discuss how accurate we expect mle and Smle to be as
a function of R
And we need to consider how to estimate these accuracies from
data…
•Analytically *
•Non-parametrically (using randomization and bootstrapping) *
But we won’t. Not yet.
*Will be discussed in future Andrew lectures…just before
we need this technology.
- 34. Maximum Likelihood: Slide 34
Copyright © 2001, 2004, Andrew W. Moore
Structural error
Actually, we need to talk about something else too..
What if we do all this analysis when the true distribution is in fact
not Gaussian?
How can we tell? *
How can we survive? *
*Will be discussed in future Andrew lectures…just before
we need this technology.
- 35. Maximum Likelihood: Slide 35
Copyright © 2001, 2004, Andrew W. Moore
Gaussian MLE in action
Using R=392 cars from the
“MPG” UCI dataset supplied
by Ross Quinlan
- 36. Maximum Likelihood: Slide 36
Copyright © 2001, 2004, Andrew W. Moore
Data-starved Gaussian MLE
Using three subsets of MPG.
Each subset has 6
randomly-chosen cars.
- 38. Maximum Likelihood: Slide 38
Copyright © 2001, 2004, Andrew W. Moore
Multivariate MLE
Covariance matrices are not exciting to look at
- 39. Maximum Likelihood: Slide 39
Copyright © 2001, 2004, Andrew W. Moore
Being Bayesian: MAP estimates for Gaussians
• Suppose you have x1, x2, … xR ~(i.i.d) N(,S)
• But you don’t know or S
• MAP: Which (,S) maximizes p(,S |x1, x2, … xR)?
Step 1: Put a prior on (,S)
- 40. Maximum Likelihood: Slide 40
Copyright © 2001, 2004, Andrew W. Moore
Being Bayesian: MAP estimates for Gaussians
• Suppose you have x1, x2, … xR ~(i.i.d) N(,S)
• But you don’t know or S
• MAP: Which (,S) maximizes p(,S |x1, x2, … xR)?
Step 1: Put a prior on (,S)
Step 1a: Put a prior on S
n0-m-1) S ~ IW(n0, n0-m-1) S 0 )
This thing is called the Inverse-Wishart
distribution.
A PDF over SPD matrices!
- 41. Maximum Likelihood: Slide 41
Copyright © 2001, 2004, Andrew W. Moore
Being Bayesian: MAP estimates for Gaussians
• Suppose you have x1, x2, … xR ~(i.i.d) N(,S)
• But you don’t know or S
• MAP: Which (,S) maximizes p(,S |x1, x2, … xR)?
Step 1: Put a prior on (,S)
Step 1a: Put a prior on S
n0-m-1) S ~ IW(n0, n0-m-1) S 0 )
This thing is called the Inverse-Wishart
distribution.
A PDF over SPD matrices!
n0 small: “I am not sure
about my guess of S 0 “
n0 large: “I’m pretty sure
about my guess of S 0 “
S 0 : (Roughly) my best
guess of S
ES S 0
- 42. Maximum Likelihood: Slide 42
Copyright © 2001, 2004, Andrew W. Moore
Being Bayesian: MAP estimates for Gaussians
• Suppose you have x1, x2, … xR ~(i.i.d) N(,S)
• But you don’t know or S
• MAP: Which (,S) maximizes p(,S |x1, x2, … xR)?
Together, “S” and
“ | S” define a
joint distribution
on (,S)
Step 1: Put a prior on (,S)
Step 1a: Put a prior on S
n0-m-1)S ~ IW(n0, n0-m-1)S 0 )
Step 1b: Put a prior on | S
| S ~ N(0 , S / k0)
- 43. Maximum Likelihood: Slide 43
Copyright © 2001, 2004, Andrew W. Moore
Step 1: Put a prior on (,S)
Step 1a: Put a prior on S
n0-m-1)S ~ IW(n0, n0-m-1)S 0 )
Step 1b: Put a prior on | S
| S ~ N(0 , S / k0)
Being Bayesian: MAP estimates for Gaussians
• Suppose you have x1, x2, … xR ~(i.i.d) N(,S)
• But you don’t know or S
• MAP: Which (,S) maximizes p(,S |x1, x2, … xR)?
0 : My best guess of
E 0
Together, “S” and
“ | S” define a
joint distribution
on (,S)
k0 small: “I am not sure
about my guess of 0 “
k0 large: “I’m pretty sure
about my guess of 0 “
Notice how we are forced to express our
ignorance of proportionally to S
- 44. Maximum Likelihood: Slide 44
Copyright © 2001, 2004, Andrew W. Moore
Being Bayesian: MAP estimates for Gaussians
• Suppose you have x1, x2, … xR ~(i.i.d) N(,S)
• But you don’t know or S
• MAP: Which (,S) maximizes p(,S |x1, x2, … xR)?
Why do we use this form
of prior?
Step 1: Put a prior on (,S)
Step 1a: Put a prior on S
n0-m-1)S ~ IW(n0, n0-m-1)S 0 )
Step 1b: Put a prior on | S
| S ~ N(0 , S / k0)
- 45. Maximum Likelihood: Slide 45
Copyright © 2001, 2004, Andrew W. Moore
Being Bayesian: MAP estimates for Gaussians
• Suppose you have x1, x2, … xR ~(i.i.d) N(,S)
• But you don’t know or S
• MAP: Which (,S) maximizes p(,S |x1, x2, … xR)?
Step 1: Put a prior on (,S)
Step 1a: Put a prior on S
n0-m-1)S ~ IW(n0, n0-m-1)S 0 )
Step 1b: Put a prior on | S
| S ~ N(0 , S / k0)
Why do we use this form of
prior?
Actually, we don’t have to
But it is computationally and
algebraically convenient…
…it’s a conjugate prior.
- 46. Maximum Likelihood: Slide 46
Copyright © 2001, 2004, Andrew W. Moore
Being Bayesian: MAP estimates for Gaussians
• Suppose you have x1, x2, … xR ~(i.i.d) N(,S)
• MAP: Which (,S) maximizes p(,S |x1, x2, … xR)?
Step 1: Prior: n0-m-1) S ~ IW(n0, n0-m-1) S 0 ), | S ~ N(0 , S / k0)
Step 2:
R
k
k
R 1
1
x
x
R
R
R
0
0
0
k
k x
μ
μ
R
R
0
k
k
R
R
0
n
n
R
m
m
T
R
k
T
k
k
R
R
/
1
/
1
)
1
(
)
1
(
0
0
0
1
0
0
k
n
n
μ
x
μ
x
x
x
x
x
Σ
Σ
Step 3: Posterior: (nR+m-1)S ~ IW(nR, (nR+m-1) S R ),
| S ~ N(R , S / kR)
Result: map = R, E[S |x1, x2, … xR ]= SR
- 47. Maximum Likelihood: Slide 47
Copyright © 2001, 2004, Andrew W. Moore
Being Bayesian: MAP estimates for Gaussians
• Suppose you have x1, x2, … xR ~(i.i.d) N(,S)
• MAP: Which (,S) maximizes p(,S |x1, x2, … xR)?
Step 1: Prior: n0-m-1) S ~ IW(n0, n0-m-1) S 0 ), | S ~ N(0 , S / k0)
Step 2:
R
k
k
R 1
1
x
x
R
R
R
0
0
0
k
k x
μ
μ
R
R
0
k
k
R
R
0
n
n
R
m
m
T
R
k
T
k
k
R
R
/
1
/
1
)
1
(
)
1
(
0
0
0
1
0
0
k
n
n
μ
x
μ
x
x
x
x
x
Σ
Σ
Step 3: Posterior: (nR+m-1)S ~ IW(nR, (nR+m-1) S R ),
| S ~ N(R , S / kR)
Result: map = R, E[S |x1, x2, … xR ]= SR
•Look carefully at what these formulae are
doing. It’s all very sensible.
•Conjugate priors mean prior form and posterior
form are same and characterized by “sufficient
statistics” of the data.
•The marginal distribution on is a student-t
•One point of view: it’s pretty academic if R > 30
- 48. Maximum Likelihood: Slide 48
Copyright © 2001, 2004, Andrew W. Moore
Where we’re at
Inputs
Classifier
Predict
category
Inputs
Density
Estimator
Prob-
ability
Inputs
Regressor
Predict
real no.
Categorical
inputs only
Mixed Real /
Cat okay
Real-valued
inputs only
Dec Tree
Joint BC
Naïve BC
Joint DE
Naïve DE
Gauss DE
- 49. Maximum Likelihood: Slide 49
Copyright © 2001, 2004, Andrew W. Moore
What you should know
• The Recipe for MLE
• What do we sometimes prefer MLE to MAP?
• Understand MLE estimation of Gaussian
parameters
• Understand “biased estimator” versus
“unbiased estimator”
• Appreciate the outline behind Bayesian
estimation of Gaussian parameters
- 50. Maximum Likelihood: Slide 50
Copyright © 2001, 2004, Andrew W. Moore
Useful exercise
• We’d already done some MLE in this class
without even telling you!
• Suppose categorical arity-n inputs x1, x2, …
xR~(i.i.d.) from a multinomial
M(p1, p2, … pn)
where
P(xk=j|p)=pj
• What is the MLE p=(p1, p2, … pn)?