SlideShare a Scribd company logo
1 of 50
Sep 6th, 2001
Copyright © 2001, 2004, Andrew W. Moore
Learning with
Maximum Likelihood
Andrew W. Moore
Professor
School of Computer Science
Carnegie Mellon University
www.cs.cmu.edu/~awm
awm@cs.cmu.edu
412-268-7599
Note to other teachers and users of
these slides. Andrew would be delighted
if you found this source material useful in
giving your own lectures. Feel free to use
these slides verbatim, or to modify them
to fit your own needs. PowerPoint
originals are available. If you make use
of a significant portion of these slides in
your own lecture, please include this
message, or the following link to the
source repository of Andrew’s tutorials:
http://www.cs.cmu.edu/~awm/tutorials .
Comments and corrections gratefully
received.
Maximum Likelihood: Slide 2
Copyright © 2001, 2004, Andrew W. Moore
Maximum Likelihood learning of
Gaussians for Data Mining
• Why we should care
• Learning Univariate Gaussians
• Learning Multivariate Gaussians
• What’s a biased estimator?
• Bayesian Learning of Gaussians
Maximum Likelihood: Slide 3
Copyright © 2001, 2004, Andrew W. Moore
Why we should care
• Maximum Likelihood Estimation is a very
very very very fundamental part of data
analysis.
• “MLE for Gaussians” is training wheels for
our future techniques
• Learning Gaussians is more useful than you
might guess…
Maximum Likelihood: Slide 4
Copyright © 2001, 2004, Andrew W. Moore
Learning Gaussians from Data
• Suppose you have x1, x2, … xR ~ (i.i.d) N(,2)
• But you don’t know 
(you do know 2)
MLE: For which  is x1, x2, … xR most likely?
MAP: Which  maximizes p(|x1, x2, … xR , 2)?
Maximum Likelihood: Slide 5
Copyright © 2001, 2004, Andrew W. Moore
Learning Gaussians from Data
• Suppose you have x1, x2, … xR ~(i.i.d) N(,2)
• But you don’t know 
(you do know 2)
MLE: For which  is x1, x2, … xR most likely?
MAP: Which  maximizes p(|x1, x2, … xR , 2)?
Sneer
Maximum Likelihood: Slide 6
Copyright © 2001, 2004, Andrew W. Moore
Learning Gaussians from Data
• Suppose you have x1, x2, … xR ~(i.i.d) N(,2)
• But you don’t know 
(you do know 2)
MLE: For which  is x1, x2, … xR most likely?
MAP: Which  maximizes p(|x1, x2, … xR , 2)?
Sneer
Despite this, we’ll spend 95% of our time on MLE. Why? Wait and see…
Maximum Likelihood: Slide 7
Copyright © 2001, 2004, Andrew W. Moore
MLE for univariate Gaussian
• Suppose you have x1, x2, … xR ~(i.i.d) N(,2)
• But you don’t know  (you do know 2)
• MLE: For which  is x1, x2, … xR most likely?
)
,
|
,...
,
(
max
arg 2
2
1 



R
mle
x
x
x
p

Maximum Likelihood: Slide 8
Copyright © 2001, 2004, Andrew W. Moore
Algebra Euphoria
)
,
|
,...
,
(
max
arg 2
2
1 



R
mle
x
x
x
p

= (by i.i.d)
= (monotonicity of
log)
= (plug in formula
for Gaussian)
= (after
simplification)
Maximum Likelihood: Slide 9
Copyright © 2001, 2004, Andrew W. Moore
= (by i.i.d)
= (monotonicity of
log)
= (plug in formula
for Gaussian)
= (after
simplification)


R
i
i
x
p
1
2
)
,
|
(
max
arg 


Algebra Euphoria
)
,
|
,...
,
(
max
arg 2
2
1 



R
mle
x
x
x
p



R
i
i
x
p
1
2
)
,
|
(
log
max
arg 






R
i
i
x
1
2
2
2
)
(
2
1
max
arg








R
i
i
x
1
2
)
(
min
arg 

Maximum Likelihood: Slide 10
Copyright © 2001, 2004, Andrew W. Moore
Intermission: A General Scalar
MLE strategy
Task: Find MLE  assuming known form for p(Data| ,stuff)
1. Write LL = log P(Data| ,stuff)
2. Work out LL/ using high-school calculus
3. Set LL/=0 for a maximum, creating an equation in
terms of 
4. Solve it*
5. Check that you’ve found a maximum rather than a
minimum or saddle-point, and be careful if  is
constrained
*This is a perfect example of something that works perfectly in
all textbook examples and usually involves surprising pain if
you need it for something new.
Maximum Likelihood: Slide 11
Copyright © 2001, 2004, Andrew W. Moore
The MLE 
)
,
|
,...
,
(
max
arg 2
2
1 



R
mle
x
x
x
p





R
i
i
x
1
2
)
(
min
arg 








LL
0
s.t.
= (what?)
Maximum Likelihood: Slide 12
Copyright © 2001, 2004, Andrew W. Moore
The MLE 
)
,
|
,...
,
(
max
arg 2
2
1 



R
mle
x
x
x
p





R
i
i
x
1
2
)
(
min
arg 








LL
0
s.t. 



 R
i
i
x
1
2
)
( 





R
i
i
x
1
)
(
2 



R
i
i
x
R 1
1
Thus 
Maximum Likelihood: Slide 13
Copyright © 2001, 2004, Andrew W. Moore
Lawks-a-lawdy!



R
i
i
mle
x
R 1
1

• The best estimate of the mean of a
distribution is the mean of the sample!
At first sight:
This kind of pedantic, algebra-filled and
ultimately unsurprising fact is exactly the
reason people throw down their
“Statistics” book and pick up their “Agent
Based Evolutionary Data Mining Using
The Neuro-Fuzz Transform” book.
Maximum Likelihood: Slide 14
Copyright © 2001, 2004, Andrew W. Moore
A General MLE strategy
Suppose  = (1, 2, …, n)T is a vector of parameters.
Task: Find MLE  assuming known form for p(Data| ,stuff)
1. Write LL = log P(Data| ,stuff)
2. Work out LL/ using high-school calculus





























n
θ
LL
θ
LL
θ
LL
LL

2
1
θ
Maximum Likelihood: Slide 15
Copyright © 2001, 2004, Andrew W. Moore
A General MLE strategy
Suppose  = (1, 2, …, n)T is a vector of parameters.
Task: Find MLE  assuming known form for p(Data| ,stuff)
1. Write LL = log P(Data| ,stuff)
2. Work out LL/ using high-school calculus
3. Solve the set of simultaneous equations
0
0
0
2
1









n
θ
LL
θ
LL
θ
LL

Maximum Likelihood: Slide 16
Copyright © 2001, 2004, Andrew W. Moore
A General MLE strategy
Suppose  = (1, 2, …, n)T is a vector of parameters.
Task: Find MLE  assuming known form for p(Data| ,stuff)
1. Write LL = log P(Data| ,stuff)
2. Work out LL/ using high-school calculus
3. Solve the set of simultaneous equations
0
0
0
2
1









n
θ
LL
θ
LL
θ
LL

4. Check that you’re at
a maximum
Maximum Likelihood: Slide 17
Copyright © 2001, 2004, Andrew W. Moore
A General MLE strategy
Suppose  = (1, 2, …, n)T is a vector of parameters.
Task: Find MLE  assuming known form for p(Data| ,stuff)
1. Write LL = log P(Data| ,stuff)
2. Work out LL/ using high-school calculus
3. Solve the set of simultaneous equations
0
0
0
2
1









n
θ
LL
θ
LL
θ
LL

4. Check that you’re at
a maximum
If you can’t solve them,
what should you do?
Maximum Likelihood: Slide 18
Copyright © 2001, 2004, Andrew W. Moore
MLE for univariate Gaussian
• Suppose you have x1, x2, … xR ~(i.i.d) N(,2)
• But you don’t know  or 2
• MLE: For which  =(,2) is x1, x2,…xR most likely?
2
1
2
2
2
2
1 )
(
2
1
)
log
2
1
(log
)
,
|
,...
,
(
log 




 






R
i
i
R x
R
x
x
x
p
)
(
1
1
2








 R
i
i
x
LL
2
1
4
2
2
)
(
2
1
2











 R
i
i
x
R
LL
Maximum Likelihood: Slide 19
Copyright © 2001, 2004, Andrew W. Moore
MLE for univariate Gaussian
• Suppose you have x1, x2, … xR ~(i.i.d) N(,2)
• But you don’t know  or 2
• MLE: For which  =(,2) is x1, x2,…xR most likely?
2
1
2
2
2
2
1 )
(
2
1
)
log
2
1
(log
)
,
|
,...
,
(
log 




 






R
i
i
R x
R
x
x
x
p
)
(
1
0
1
2






R
i
i
x
2
1
4
2
)
(
2
1
2
0 








R
i
i
x
R
Maximum Likelihood: Slide 20
Copyright © 2001, 2004, Andrew W. Moore
MLE for univariate Gaussian
• Suppose you have x1, x2, … xR ~(i.i.d) N(,2)
• But you don’t know  or 2
• MLE: For which  =(,2) is x1, x2,…xR most likely?
2
1
2
2
2
2
1 )
(
2
1
)
log
2
1
(log
)
,
|
,...
,
(
log 




 






R
i
i
R x
R
x
x
x
p

 





R
i
i
R
i
i x
R
x
1
1
2
1
)
(
1
0 


what?
)
(
2
1
2
0 2
1
4
2




 




R
i
i
x
R
Maximum Likelihood: Slide 21
Copyright © 2001, 2004, Andrew W. Moore
MLE for univariate Gaussian
• Suppose you have x1, x2, … xR ~(i.i.d) N(,2)
• But you don’t know  or 2
• MLE: For which  =(,2) is x1, x2,…xR most likely?



R
i
i
mle
x
R 1
1

2
1
2
)
(
1 mle
R
i
i
mle x
R

 



Maximum Likelihood: Slide 22
Copyright © 2001, 2004, Andrew W. Moore
Unbiased Estimators
• An estimator of a parameter is unbiased if the
expected value of the estimate is the same as the
true value of the parameters.
• If x1, x2, … xR ~(i.i.d) N(,2) then

 






 

R
i
i
mle
x
R
E
E
1
1
]
[
mle is unbiased
Maximum Likelihood: Slide 23
Copyright © 2001, 2004, Andrew W. Moore
Biased Estimators
• An estimator of a parameter is biased if the
expected value of the estimate is different from
the true value of the parameters.
• If x1, x2, … xR ~(i.i.d) N(,2) then
  2
2
1
1
2
1
2 1
1
)
(
1


 

























 

 


R
j
j
R
i
i
mle
R
i
i
mle x
R
x
R
E
x
R
E
E
2
mle is biased
Maximum Likelihood: Slide 24
Copyright © 2001, 2004, Andrew W. Moore
MLE Variance Bias
• If x1, x2, … xR ~(i.i.d) N(,2) then
  2
2
2
1
1
2 1
1
1
1


 

























 
 
 R
x
R
x
R
E
E
R
j
j
R
i
i
mle
Intuition check: consider the case of R=1
Why should our guts expect that 2
mle would be an
underestimate of true 2?
How could you prove that?
Maximum Likelihood: Slide 25
Copyright © 2001, 2004, Andrew W. Moore
Unbiased estimate of Variance
• If x1, x2, … xR ~(i.i.d) N(,2) then
  2
2
2
1
1
2 1
1
1
1


 

























 
 
 R
x
R
x
R
E
E
R
j
j
R
i
i
mle
So define








R
mle
1
1
2
2
unbiased

   2
2
unbiased
So 
 
E
Maximum Likelihood: Slide 26
Copyright © 2001, 2004, Andrew W. Moore
Unbiased estimate of Variance
• If x1, x2, … xR ~(i.i.d) N(,2) then
  2
2
2
1
1
2 1
1
1
1


 

























 
 
 R
x
R
x
R
E
E
R
j
j
R
i
i
mle
So define








R
mle
1
1
2
2
unbiased


2
1
2
unbiased )
(
1
1 mle
R
i
i
x
R

 




  2
2
unbiased
So 
 
E
Maximum Likelihood: Slide 27
Copyright © 2001, 2004, Andrew W. Moore
Unbiaseditude discussion
• Which is best?
2
1
2
unbiased )
(
1
1 mle
R
i
i
x
R

 




2
1
2
)
(
1 mle
R
i
i
mle x
R

 



Answer:
•It depends on the task
•And doesn’t make much difference once R--> large
Maximum Likelihood: Slide 28
Copyright © 2001, 2004, Andrew W. Moore
Don’t get too excited about being
unbiased
• Assume x1, x2, … xR ~(i.i.d) N(,2)
• Suppose we had these estimators for the mean




R
i
i
suboptimal
x
R
R 1
7
1

1
x
crap


Are either of these unbiased?
Will either of them asymptote to the
correct value as R gets large?
Which is more useful?
Maximum Likelihood: Slide 29
Copyright © 2001, 2004, Andrew W. Moore
MLE for m-dimensional Gaussian
• Suppose you have x1, x2, … xR ~(i.i.d) N(,S)
• But you don’t know  or S
• MLE: For which  =(,S) is x1, x2, … xR most likely?



R
k
k
mle
R 1
1
x
μ
  





R
k
T
mle
k
mle
k
mle
R 1
1
μ
x
μ
x
Σ
Maximum Likelihood: Slide 30
Copyright © 2001, 2004, Andrew W. Moore
MLE for m-dimensional Gaussian
• Suppose you have x1, x2, … xR ~(i.i.d) N(,S)
• But you don’t know  or S
• MLE: For which  =(,S) is x1, x2, … xR most likely?



R
k
k
mle
R 1
1
x
μ
  





R
k
T
mle
k
mle
k
mle
R 1
1

 x
x
Σ



R
k
ki
mle
i
R
μ
1
1
x
Where 1  i  m
And xki is value of the
ith component of xk
(the ith attribute of
the kth record)
And i
mle is the ith
component of mle
Maximum Likelihood: Slide 31
Copyright © 2001, 2004, Andrew W. Moore
MLE for m-dimensional Gaussian
• Suppose you have x1, x2, … xR ~(i.i.d) N(,S)
• But you don’t know  or S
• MLE: For which  =(,S) is x1, x2, … xR most likely?



R
k
k
mle
R 1
1
x
μ
  





R
k
T
mle
k
mle
k
mle
R 1
1

 x
x
Σ
  





R
k
mle
j
kj
mle
i
ki
mle
ij
R 1
1


 x
x
Where 1  i  m, 1  j  m
And xki is value of the ith
component of xk (the ith
attribute of the kth record)
And ij
mle is the (i,j)th
component of Smle
Maximum Likelihood: Slide 32
Copyright © 2001, 2004, Andrew W. Moore
MLE for m-dimensional Gaussian
• Suppose you have x1, x2, … xR ~(i.i.d) N(,S)
• But you don’t know  or S
• MLE: For which  =(,S) is x1, x2, … xR most likely?



R
k
k
mle
R 1
1
x
μ
  





R
k
T
mle
k
mle
k
mle
R 1
1

 x
x
Σ
Q: How would you prove this?
A: Just plug through the MLE
recipe.
Note how Smle is forced to be
symmetric non-negative definite
Note the unbiased case
How many datapoints would you
need before the Gaussian has a
chance of being non-degenerate?
  








R
k
T
mle
k
mle
k
mle
R
R
1
unbiased
1
1
1
1

 x
x
Σ
Σ
Maximum Likelihood: Slide 33
Copyright © 2001, 2004, Andrew W. Moore
Confidence intervals
We need to talk
We need to discuss how accurate we expect mle and Smle to be as
a function of R
And we need to consider how to estimate these accuracies from
data…
•Analytically *
•Non-parametrically (using randomization and bootstrapping) *
But we won’t. Not yet.
*Will be discussed in future Andrew lectures…just before
we need this technology.
Maximum Likelihood: Slide 34
Copyright © 2001, 2004, Andrew W. Moore
Structural error
Actually, we need to talk about something else too..
What if we do all this analysis when the true distribution is in fact
not Gaussian?
How can we tell? *
How can we survive? *
*Will be discussed in future Andrew lectures…just before
we need this technology.
Maximum Likelihood: Slide 35
Copyright © 2001, 2004, Andrew W. Moore
Gaussian MLE in action
Using R=392 cars from the
“MPG” UCI dataset supplied
by Ross Quinlan
Maximum Likelihood: Slide 36
Copyright © 2001, 2004, Andrew W. Moore
Data-starved Gaussian MLE
Using three subsets of MPG.
Each subset has 6
randomly-chosen cars.
Maximum Likelihood: Slide 37
Copyright © 2001, 2004, Andrew W. Moore
Bivariate
MLE
in
action
Maximum Likelihood: Slide 38
Copyright © 2001, 2004, Andrew W. Moore
Multivariate MLE
Covariance matrices are not exciting to look at
Maximum Likelihood: Slide 39
Copyright © 2001, 2004, Andrew W. Moore
Being Bayesian: MAP estimates for Gaussians
• Suppose you have x1, x2, … xR ~(i.i.d) N(,S)
• But you don’t know  or S
• MAP: Which (,S) maximizes p(,S |x1, x2, … xR)?
Step 1: Put a prior on (,S)
Maximum Likelihood: Slide 40
Copyright © 2001, 2004, Andrew W. Moore
Being Bayesian: MAP estimates for Gaussians
• Suppose you have x1, x2, … xR ~(i.i.d) N(,S)
• But you don’t know  or S
• MAP: Which (,S) maximizes p(,S |x1, x2, … xR)?
Step 1: Put a prior on (,S)
Step 1a: Put a prior on S
n0-m-1) S ~ IW(n0, n0-m-1) S 0 )
This thing is called the Inverse-Wishart
distribution.
A PDF over SPD matrices!
Maximum Likelihood: Slide 41
Copyright © 2001, 2004, Andrew W. Moore
Being Bayesian: MAP estimates for Gaussians
• Suppose you have x1, x2, … xR ~(i.i.d) N(,S)
• But you don’t know  or S
• MAP: Which (,S) maximizes p(,S |x1, x2, … xR)?
Step 1: Put a prior on (,S)
Step 1a: Put a prior on S
n0-m-1) S ~ IW(n0, n0-m-1) S 0 )
This thing is called the Inverse-Wishart
distribution.
A PDF over SPD matrices!
n0 small: “I am not sure
about my guess of S 0 “
n0 large: “I’m pretty sure
about my guess of S 0 “
S 0 : (Roughly) my best
guess of S
ES   S 0
Maximum Likelihood: Slide 42
Copyright © 2001, 2004, Andrew W. Moore
Being Bayesian: MAP estimates for Gaussians
• Suppose you have x1, x2, … xR ~(i.i.d) N(,S)
• But you don’t know  or S
• MAP: Which (,S) maximizes p(,S |x1, x2, … xR)?
Together, “S” and
“ | S” define a
joint distribution
on (,S)
Step 1: Put a prior on (,S)
Step 1a: Put a prior on S
n0-m-1)S ~ IW(n0, n0-m-1)S 0 )
Step 1b: Put a prior on  | S
 | S ~ N(0 , S / k0)
Maximum Likelihood: Slide 43
Copyright © 2001, 2004, Andrew W. Moore
Step 1: Put a prior on (,S)
Step 1a: Put a prior on S
n0-m-1)S ~ IW(n0, n0-m-1)S 0 )
Step 1b: Put a prior on  | S
 | S ~ N(0 , S / k0)
Being Bayesian: MAP estimates for Gaussians
• Suppose you have x1, x2, … xR ~(i.i.d) N(,S)
• But you don’t know  or S
• MAP: Which (,S) maximizes p(,S |x1, x2, … xR)?
 0 : My best guess of 
E    0
Together, “S” and
“ | S” define a
joint distribution
on (,S)
k0 small: “I am not sure
about my guess of  0 “
k0 large: “I’m pretty sure
about my guess of  0 “
Notice how we are forced to express our
ignorance of  proportionally to S
Maximum Likelihood: Slide 44
Copyright © 2001, 2004, Andrew W. Moore
Being Bayesian: MAP estimates for Gaussians
• Suppose you have x1, x2, … xR ~(i.i.d) N(,S)
• But you don’t know  or S
• MAP: Which (,S) maximizes p(,S |x1, x2, … xR)?
Why do we use this form
of prior?
Step 1: Put a prior on (,S)
Step 1a: Put a prior on S
n0-m-1)S ~ IW(n0, n0-m-1)S 0 )
Step 1b: Put a prior on  | S
 | S ~ N(0 , S / k0)
Maximum Likelihood: Slide 45
Copyright © 2001, 2004, Andrew W. Moore
Being Bayesian: MAP estimates for Gaussians
• Suppose you have x1, x2, … xR ~(i.i.d) N(,S)
• But you don’t know  or S
• MAP: Which (,S) maximizes p(,S |x1, x2, … xR)?
Step 1: Put a prior on (,S)
Step 1a: Put a prior on S
n0-m-1)S ~ IW(n0, n0-m-1)S 0 )
Step 1b: Put a prior on  | S
 | S ~ N(0 , S / k0)
Why do we use this form of
prior?
Actually, we don’t have to
But it is computationally and
algebraically convenient…
…it’s a conjugate prior.
Maximum Likelihood: Slide 46
Copyright © 2001, 2004, Andrew W. Moore
Being Bayesian: MAP estimates for Gaussians
• Suppose you have x1, x2, … xR ~(i.i.d) N(,S)
• MAP: Which (,S) maximizes p(,S |x1, x2, … xR)?
Step 1: Prior: n0-m-1) S ~ IW(n0, n0-m-1) S 0 ),  | S ~ N(0 , S / k0)
Step 2:



R
k
k
R 1
1
x
x
R
R
R



0
0
0
k
k x
μ
μ
R
R 
 0
k
k
R
R 
 0
n
n
     
R
m
m
T
R
k
T
k
k
R
R
/
1
/
1
)
1
(
)
1
(
0
0
0
1
0
0











 
 k
n
n
μ
x
μ
x
x
x
x
x
Σ
Σ
Step 3: Posterior: (nR+m-1)S ~ IW(nR, (nR+m-1) S R ),
 | S ~ N(R , S / kR)
Result: map = R, E[S |x1, x2, … xR ]= SR
Maximum Likelihood: Slide 47
Copyright © 2001, 2004, Andrew W. Moore
Being Bayesian: MAP estimates for Gaussians
• Suppose you have x1, x2, … xR ~(i.i.d) N(,S)
• MAP: Which (,S) maximizes p(,S |x1, x2, … xR)?
Step 1: Prior: n0-m-1) S ~ IW(n0, n0-m-1) S 0 ),  | S ~ N(0 , S / k0)
Step 2:



R
k
k
R 1
1
x
x
R
R
R



0
0
0
k
k x
μ
μ
R
R 
 0
k
k
R
R 
 0
n
n
     
R
m
m
T
R
k
T
k
k
R
R
/
1
/
1
)
1
(
)
1
(
0
0
0
1
0
0











 
 k
n
n
μ
x
μ
x
x
x
x
x
Σ
Σ
Step 3: Posterior: (nR+m-1)S ~ IW(nR, (nR+m-1) S R ),
 | S ~ N(R , S / kR)
Result: map = R, E[S |x1, x2, … xR ]= SR
•Look carefully at what these formulae are
doing. It’s all very sensible.
•Conjugate priors mean prior form and posterior
form are same and characterized by “sufficient
statistics” of the data.
•The marginal distribution on  is a student-t
•One point of view: it’s pretty academic if R > 30
Maximum Likelihood: Slide 48
Copyright © 2001, 2004, Andrew W. Moore
Where we’re at
Inputs
Classifier
Predict
category
Inputs
Density
Estimator
Prob-
ability
Inputs
Regressor
Predict
real no.
Categorical
inputs only
Mixed Real /
Cat okay
Real-valued
inputs only
Dec Tree
Joint BC
Naïve BC
Joint DE
Naïve DE
Gauss DE
Maximum Likelihood: Slide 49
Copyright © 2001, 2004, Andrew W. Moore
What you should know
• The Recipe for MLE
• What do we sometimes prefer MLE to MAP?
• Understand MLE estimation of Gaussian
parameters
• Understand “biased estimator” versus
“unbiased estimator”
• Appreciate the outline behind Bayesian
estimation of Gaussian parameters
Maximum Likelihood: Slide 50
Copyright © 2001, 2004, Andrew W. Moore
Useful exercise
• We’d already done some MLE in this class
without even telling you!
• Suppose categorical arity-n inputs x1, x2, …
xR~(i.i.d.) from a multinomial
M(p1, p2, … pn)
where
P(xk=j|p)=pj
• What is the MLE p=(p1, p2, … pn)?

More Related Content

What's hot

STT802project-writeup-Final (1)
STT802project-writeup-Final (1)STT802project-writeup-Final (1)
STT802project-writeup-Final (1)James P. Regan II
 
Deep learning study 2
Deep learning study 2Deep learning study 2
Deep learning study 2San Kim
 
IJSRED-V2I5P56
IJSRED-V2I5P56IJSRED-V2I5P56
IJSRED-V2I5P56IJSRED
 
Average Sensitivity of Graph Algorithms
Average Sensitivity of Graph AlgorithmsAverage Sensitivity of Graph Algorithms
Average Sensitivity of Graph AlgorithmsYuichi Yoshida
 

What's hot (7)

ClusteringDEC
ClusteringDECClusteringDEC
ClusteringDEC
 
Assignment4
Assignment4Assignment4
Assignment4
 
STT802project-writeup-Final (1)
STT802project-writeup-Final (1)STT802project-writeup-Final (1)
STT802project-writeup-Final (1)
 
Deep learning study 2
Deep learning study 2Deep learning study 2
Deep learning study 2
 
IJSRED-V2I5P56
IJSRED-V2I5P56IJSRED-V2I5P56
IJSRED-V2I5P56
 
Average Sensitivity of Graph Algorithms
Average Sensitivity of Graph AlgorithmsAverage Sensitivity of Graph Algorithms
Average Sensitivity of Graph Algorithms
 
Math Exam Help
Math Exam HelpMath Exam Help
Math Exam Help
 

Similar to Mle

Probability density in data mining and coveriance
Probability density in data mining and coverianceProbability density in data mining and coveriance
Probability density in data mining and coverianceudhayax793
 
Information Gain
Information GainInformation Gain
Information Gainguest32311f
 
Foundations of Machine Learning
Foundations of Machine LearningFoundations of Machine Learning
Foundations of Machine Learningmahutte
 
An Introduction to Deep Learning (March 2018)
An Introduction to Deep Learning (March 2018)An Introduction to Deep Learning (March 2018)
An Introduction to Deep Learning (March 2018)Julien SIMON
 
2013-1 Machine Learning Lecture 03 - Andrew Moore - probabilistic and baye…
2013-1 Machine Learning Lecture 03 - Andrew Moore - probabilistic and baye…2013-1 Machine Learning Lecture 03 - Andrew Moore - probabilistic and baye…
2013-1 Machine Learning Lecture 03 - Andrew Moore - probabilistic and baye…Dongseo University
 
ch04sdsdsdsdsdsdsdsdsdsdswewrerertrtr.ppt
ch04sdsdsdsdsdsdsdsdsdsdswewrerertrtr.pptch04sdsdsdsdsdsdsdsdsdsdswewrerertrtr.ppt
ch04sdsdsdsdsdsdsdsdsdsdswewrerertrtr.pptTushar Chaudhari
 
Aaron Roth, Associate Professor, University of Pennsylvania, at MLconf NYC 2017
Aaron Roth, Associate Professor, University of Pennsylvania, at MLconf NYC 2017Aaron Roth, Associate Professor, University of Pennsylvania, at MLconf NYC 2017
Aaron Roth, Associate Professor, University of Pennsylvania, at MLconf NYC 2017MLconf
 
Iteration method-Solution of algebraic and Transcendental Equations.
Iteration method-Solution of algebraic and Transcendental Equations.Iteration method-Solution of algebraic and Transcendental Equations.
Iteration method-Solution of algebraic and Transcendental Equations.AmitKumar8151
 
07 Machine Learning - Expectation Maximization
07 Machine Learning - Expectation Maximization07 Machine Learning - Expectation Maximization
07 Machine Learning - Expectation MaximizationAndres Mendez-Vazquez
 
Unit 4 SVM and AVR.ppt
Unit 4 SVM and AVR.pptUnit 4 SVM and AVR.ppt
Unit 4 SVM and AVR.pptRahul Borate
 
13_Unsupervised Learning.pdf
13_Unsupervised Learning.pdf13_Unsupervised Learning.pdf
13_Unsupervised Learning.pdfEmanAsem4
 

Similar to Mle (20)

Probability density in data mining and coveriance
Probability density in data mining and coverianceProbability density in data mining and coveriance
Probability density in data mining and coveriance
 
svm-jain.ppt
svm-jain.pptsvm-jain.ppt
svm-jain.ppt
 
pac.ppt
pac.pptpac.ppt
pac.ppt
 
Kmeans
KmeansKmeans
Kmeans
 
Information Gain
Information GainInformation Gain
Information Gain
 
PAC Learning
PAC LearningPAC Learning
PAC Learning
 
Lausanne 2019 #1
Lausanne 2019 #1Lausanne 2019 #1
Lausanne 2019 #1
 
Foundations of Machine Learning
Foundations of Machine LearningFoundations of Machine Learning
Foundations of Machine Learning
 
Ensembles.pdf
Ensembles.pdfEnsembles.pdf
Ensembles.pdf
 
ML-04.pdf
ML-04.pdfML-04.pdf
ML-04.pdf
 
MLE.pdf
MLE.pdfMLE.pdf
MLE.pdf
 
An Introduction to Deep Learning (March 2018)
An Introduction to Deep Learning (March 2018)An Introduction to Deep Learning (March 2018)
An Introduction to Deep Learning (March 2018)
 
2013-1 Machine Learning Lecture 03 - Andrew Moore - probabilistic and baye…
2013-1 Machine Learning Lecture 03 - Andrew Moore - probabilistic and baye…2013-1 Machine Learning Lecture 03 - Andrew Moore - probabilistic and baye…
2013-1 Machine Learning Lecture 03 - Andrew Moore - probabilistic and baye…
 
ch04sdsdsdsdsdsdsdsdsdsdswewrerertrtr.ppt
ch04sdsdsdsdsdsdsdsdsdsdswewrerertrtr.pptch04sdsdsdsdsdsdsdsdsdsdswewrerertrtr.ppt
ch04sdsdsdsdsdsdsdsdsdsdswewrerertrtr.ppt
 
Aaron Roth, Associate Professor, University of Pennsylvania, at MLconf NYC 2017
Aaron Roth, Associate Professor, University of Pennsylvania, at MLconf NYC 2017Aaron Roth, Associate Professor, University of Pennsylvania, at MLconf NYC 2017
Aaron Roth, Associate Professor, University of Pennsylvania, at MLconf NYC 2017
 
Iteration method-Solution of algebraic and Transcendental Equations.
Iteration method-Solution of algebraic and Transcendental Equations.Iteration method-Solution of algebraic and Transcendental Equations.
Iteration method-Solution of algebraic and Transcendental Equations.
 
07 Machine Learning - Expectation Maximization
07 Machine Learning - Expectation Maximization07 Machine Learning - Expectation Maximization
07 Machine Learning - Expectation Maximization
 
Unit 4 SVM and AVR.ppt
Unit 4 SVM and AVR.pptUnit 4 SVM and AVR.ppt
Unit 4 SVM and AVR.ppt
 
13_Unsupervised Learning.pdf
13_Unsupervised Learning.pdf13_Unsupervised Learning.pdf
13_Unsupervised Learning.pdf
 
15-visualization.pptx
15-visualization.pptx15-visualization.pptx
15-visualization.pptx
 

Recently uploaded

Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsanshu789521
 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application ) Sakshi Ghasle
 
Class 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdfClass 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdfakmcokerachita
 
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...M56BOOKSTORE PRODUCT/SERVICE
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxmanuelaromero2013
 
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxContemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxRoyAbrique
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfEnzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfSumit Tiwari
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
Science 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsScience 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsKarinaGenton
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13Steve Thomason
 
Concept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.CompdfConcept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.CompdfUmakantAnnand
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentInMediaRes1
 

Recently uploaded (20)

Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha elections
 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application )
 
Class 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdfClass 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdf
 
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptx
 
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxContemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfEnzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Science 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsScience 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its Characteristics
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
 
Concept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.CompdfConcept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.Compdf
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media Component
 
9953330565 Low Rate Call Girls In Rohini Delhi NCR
9953330565 Low Rate Call Girls In Rohini  Delhi NCR9953330565 Low Rate Call Girls In Rohini  Delhi NCR
9953330565 Low Rate Call Girls In Rohini Delhi NCR
 

Mle

  • 1. Sep 6th, 2001 Copyright © 2001, 2004, Andrew W. Moore Learning with Maximum Likelihood Andrew W. Moore Professor School of Computer Science Carnegie Mellon University www.cs.cmu.edu/~awm awm@cs.cmu.edu 412-268-7599 Note to other teachers and users of these slides. Andrew would be delighted if you found this source material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. PowerPoint originals are available. If you make use of a significant portion of these slides in your own lecture, please include this message, or the following link to the source repository of Andrew’s tutorials: http://www.cs.cmu.edu/~awm/tutorials . Comments and corrections gratefully received.
  • 2. Maximum Likelihood: Slide 2 Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood learning of Gaussians for Data Mining • Why we should care • Learning Univariate Gaussians • Learning Multivariate Gaussians • What’s a biased estimator? • Bayesian Learning of Gaussians
  • 3. Maximum Likelihood: Slide 3 Copyright © 2001, 2004, Andrew W. Moore Why we should care • Maximum Likelihood Estimation is a very very very very fundamental part of data analysis. • “MLE for Gaussians” is training wheels for our future techniques • Learning Gaussians is more useful than you might guess…
  • 4. Maximum Likelihood: Slide 4 Copyright © 2001, 2004, Andrew W. Moore Learning Gaussians from Data • Suppose you have x1, x2, … xR ~ (i.i.d) N(,2) • But you don’t know  (you do know 2) MLE: For which  is x1, x2, … xR most likely? MAP: Which  maximizes p(|x1, x2, … xR , 2)?
  • 5. Maximum Likelihood: Slide 5 Copyright © 2001, 2004, Andrew W. Moore Learning Gaussians from Data • Suppose you have x1, x2, … xR ~(i.i.d) N(,2) • But you don’t know  (you do know 2) MLE: For which  is x1, x2, … xR most likely? MAP: Which  maximizes p(|x1, x2, … xR , 2)? Sneer
  • 6. Maximum Likelihood: Slide 6 Copyright © 2001, 2004, Andrew W. Moore Learning Gaussians from Data • Suppose you have x1, x2, … xR ~(i.i.d) N(,2) • But you don’t know  (you do know 2) MLE: For which  is x1, x2, … xR most likely? MAP: Which  maximizes p(|x1, x2, … xR , 2)? Sneer Despite this, we’ll spend 95% of our time on MLE. Why? Wait and see…
  • 7. Maximum Likelihood: Slide 7 Copyright © 2001, 2004, Andrew W. Moore MLE for univariate Gaussian • Suppose you have x1, x2, … xR ~(i.i.d) N(,2) • But you don’t know  (you do know 2) • MLE: For which  is x1, x2, … xR most likely? ) , | ,... , ( max arg 2 2 1     R mle x x x p 
  • 8. Maximum Likelihood: Slide 8 Copyright © 2001, 2004, Andrew W. Moore Algebra Euphoria ) , | ,... , ( max arg 2 2 1     R mle x x x p  = (by i.i.d) = (monotonicity of log) = (plug in formula for Gaussian) = (after simplification)
  • 9. Maximum Likelihood: Slide 9 Copyright © 2001, 2004, Andrew W. Moore = (by i.i.d) = (monotonicity of log) = (plug in formula for Gaussian) = (after simplification)   R i i x p 1 2 ) , | ( max arg    Algebra Euphoria ) , | ,... , ( max arg 2 2 1     R mle x x x p    R i i x p 1 2 ) , | ( log max arg        R i i x 1 2 2 2 ) ( 2 1 max arg         R i i x 1 2 ) ( min arg  
  • 10. Maximum Likelihood: Slide 10 Copyright © 2001, 2004, Andrew W. Moore Intermission: A General Scalar MLE strategy Task: Find MLE  assuming known form for p(Data| ,stuff) 1. Write LL = log P(Data| ,stuff) 2. Work out LL/ using high-school calculus 3. Set LL/=0 for a maximum, creating an equation in terms of  4. Solve it* 5. Check that you’ve found a maximum rather than a minimum or saddle-point, and be careful if  is constrained *This is a perfect example of something that works perfectly in all textbook examples and usually involves surprising pain if you need it for something new.
  • 11. Maximum Likelihood: Slide 11 Copyright © 2001, 2004, Andrew W. Moore The MLE  ) , | ,... , ( max arg 2 2 1     R mle x x x p      R i i x 1 2 ) ( min arg          LL 0 s.t. = (what?)
  • 12. Maximum Likelihood: Slide 12 Copyright © 2001, 2004, Andrew W. Moore The MLE  ) , | ,... , ( max arg 2 2 1     R mle x x x p      R i i x 1 2 ) ( min arg          LL 0 s.t.      R i i x 1 2 ) (       R i i x 1 ) ( 2     R i i x R 1 1 Thus 
  • 13. Maximum Likelihood: Slide 13 Copyright © 2001, 2004, Andrew W. Moore Lawks-a-lawdy!    R i i mle x R 1 1  • The best estimate of the mean of a distribution is the mean of the sample! At first sight: This kind of pedantic, algebra-filled and ultimately unsurprising fact is exactly the reason people throw down their “Statistics” book and pick up their “Agent Based Evolutionary Data Mining Using The Neuro-Fuzz Transform” book.
  • 14. Maximum Likelihood: Slide 14 Copyright © 2001, 2004, Andrew W. Moore A General MLE strategy Suppose  = (1, 2, …, n)T is a vector of parameters. Task: Find MLE  assuming known form for p(Data| ,stuff) 1. Write LL = log P(Data| ,stuff) 2. Work out LL/ using high-school calculus                              n θ LL θ LL θ LL LL  2 1 θ
  • 15. Maximum Likelihood: Slide 15 Copyright © 2001, 2004, Andrew W. Moore A General MLE strategy Suppose  = (1, 2, …, n)T is a vector of parameters. Task: Find MLE  assuming known form for p(Data| ,stuff) 1. Write LL = log P(Data| ,stuff) 2. Work out LL/ using high-school calculus 3. Solve the set of simultaneous equations 0 0 0 2 1          n θ LL θ LL θ LL 
  • 16. Maximum Likelihood: Slide 16 Copyright © 2001, 2004, Andrew W. Moore A General MLE strategy Suppose  = (1, 2, …, n)T is a vector of parameters. Task: Find MLE  assuming known form for p(Data| ,stuff) 1. Write LL = log P(Data| ,stuff) 2. Work out LL/ using high-school calculus 3. Solve the set of simultaneous equations 0 0 0 2 1          n θ LL θ LL θ LL  4. Check that you’re at a maximum
  • 17. Maximum Likelihood: Slide 17 Copyright © 2001, 2004, Andrew W. Moore A General MLE strategy Suppose  = (1, 2, …, n)T is a vector of parameters. Task: Find MLE  assuming known form for p(Data| ,stuff) 1. Write LL = log P(Data| ,stuff) 2. Work out LL/ using high-school calculus 3. Solve the set of simultaneous equations 0 0 0 2 1          n θ LL θ LL θ LL  4. Check that you’re at a maximum If you can’t solve them, what should you do?
  • 18. Maximum Likelihood: Slide 18 Copyright © 2001, 2004, Andrew W. Moore MLE for univariate Gaussian • Suppose you have x1, x2, … xR ~(i.i.d) N(,2) • But you don’t know  or 2 • MLE: For which  =(,2) is x1, x2,…xR most likely? 2 1 2 2 2 2 1 ) ( 2 1 ) log 2 1 (log ) , | ,... , ( log              R i i R x R x x x p ) ( 1 1 2          R i i x LL 2 1 4 2 2 ) ( 2 1 2             R i i x R LL
  • 19. Maximum Likelihood: Slide 19 Copyright © 2001, 2004, Andrew W. Moore MLE for univariate Gaussian • Suppose you have x1, x2, … xR ~(i.i.d) N(,2) • But you don’t know  or 2 • MLE: For which  =(,2) is x1, x2,…xR most likely? 2 1 2 2 2 2 1 ) ( 2 1 ) log 2 1 (log ) , | ,... , ( log              R i i R x R x x x p ) ( 1 0 1 2       R i i x 2 1 4 2 ) ( 2 1 2 0          R i i x R
  • 20. Maximum Likelihood: Slide 20 Copyright © 2001, 2004, Andrew W. Moore MLE for univariate Gaussian • Suppose you have x1, x2, … xR ~(i.i.d) N(,2) • But you don’t know  or 2 • MLE: For which  =(,2) is x1, x2,…xR most likely? 2 1 2 2 2 2 1 ) ( 2 1 ) log 2 1 (log ) , | ,... , ( log              R i i R x R x x x p         R i i R i i x R x 1 1 2 1 ) ( 1 0    what? ) ( 2 1 2 0 2 1 4 2           R i i x R
  • 21. Maximum Likelihood: Slide 21 Copyright © 2001, 2004, Andrew W. Moore MLE for univariate Gaussian • Suppose you have x1, x2, … xR ~(i.i.d) N(,2) • But you don’t know  or 2 • MLE: For which  =(,2) is x1, x2,…xR most likely?    R i i mle x R 1 1  2 1 2 ) ( 1 mle R i i mle x R      
  • 22. Maximum Likelihood: Slide 22 Copyright © 2001, 2004, Andrew W. Moore Unbiased Estimators • An estimator of a parameter is unbiased if the expected value of the estimate is the same as the true value of the parameters. • If x1, x2, … xR ~(i.i.d) N(,2) then             R i i mle x R E E 1 1 ] [ mle is unbiased
  • 23. Maximum Likelihood: Slide 23 Copyright © 2001, 2004, Andrew W. Moore Biased Estimators • An estimator of a parameter is biased if the expected value of the estimate is different from the true value of the parameters. • If x1, x2, … xR ~(i.i.d) N(,2) then   2 2 1 1 2 1 2 1 1 ) ( 1                                     R j j R i i mle R i i mle x R x R E x R E E 2 mle is biased
  • 24. Maximum Likelihood: Slide 24 Copyright © 2001, 2004, Andrew W. Moore MLE Variance Bias • If x1, x2, … xR ~(i.i.d) N(,2) then   2 2 2 1 1 2 1 1 1 1                                   R x R x R E E R j j R i i mle Intuition check: consider the case of R=1 Why should our guts expect that 2 mle would be an underestimate of true 2? How could you prove that?
  • 25. Maximum Likelihood: Slide 25 Copyright © 2001, 2004, Andrew W. Moore Unbiased estimate of Variance • If x1, x2, … xR ~(i.i.d) N(,2) then   2 2 2 1 1 2 1 1 1 1                                   R x R x R E E R j j R i i mle So define         R mle 1 1 2 2 unbiased     2 2 unbiased So    E
  • 26. Maximum Likelihood: Slide 26 Copyright © 2001, 2004, Andrew W. Moore Unbiased estimate of Variance • If x1, x2, … xR ~(i.i.d) N(,2) then   2 2 2 1 1 2 1 1 1 1                                   R x R x R E E R j j R i i mle So define         R mle 1 1 2 2 unbiased   2 1 2 unbiased ) ( 1 1 mle R i i x R          2 2 unbiased So    E
  • 27. Maximum Likelihood: Slide 27 Copyright © 2001, 2004, Andrew W. Moore Unbiaseditude discussion • Which is best? 2 1 2 unbiased ) ( 1 1 mle R i i x R        2 1 2 ) ( 1 mle R i i mle x R       Answer: •It depends on the task •And doesn’t make much difference once R--> large
  • 28. Maximum Likelihood: Slide 28 Copyright © 2001, 2004, Andrew W. Moore Don’t get too excited about being unbiased • Assume x1, x2, … xR ~(i.i.d) N(,2) • Suppose we had these estimators for the mean     R i i suboptimal x R R 1 7 1  1 x crap   Are either of these unbiased? Will either of them asymptote to the correct value as R gets large? Which is more useful?
  • 29. Maximum Likelihood: Slide 29 Copyright © 2001, 2004, Andrew W. Moore MLE for m-dimensional Gaussian • Suppose you have x1, x2, … xR ~(i.i.d) N(,S) • But you don’t know  or S • MLE: For which  =(,S) is x1, x2, … xR most likely?    R k k mle R 1 1 x μ         R k T mle k mle k mle R 1 1 μ x μ x Σ
  • 30. Maximum Likelihood: Slide 30 Copyright © 2001, 2004, Andrew W. Moore MLE for m-dimensional Gaussian • Suppose you have x1, x2, … xR ~(i.i.d) N(,S) • But you don’t know  or S • MLE: For which  =(,S) is x1, x2, … xR most likely?    R k k mle R 1 1 x μ         R k T mle k mle k mle R 1 1   x x Σ    R k ki mle i R μ 1 1 x Where 1  i  m And xki is value of the ith component of xk (the ith attribute of the kth record) And i mle is the ith component of mle
  • 31. Maximum Likelihood: Slide 31 Copyright © 2001, 2004, Andrew W. Moore MLE for m-dimensional Gaussian • Suppose you have x1, x2, … xR ~(i.i.d) N(,S) • But you don’t know  or S • MLE: For which  =(,S) is x1, x2, … xR most likely?    R k k mle R 1 1 x μ         R k T mle k mle k mle R 1 1   x x Σ         R k mle j kj mle i ki mle ij R 1 1    x x Where 1  i  m, 1  j  m And xki is value of the ith component of xk (the ith attribute of the kth record) And ij mle is the (i,j)th component of Smle
  • 32. Maximum Likelihood: Slide 32 Copyright © 2001, 2004, Andrew W. Moore MLE for m-dimensional Gaussian • Suppose you have x1, x2, … xR ~(i.i.d) N(,S) • But you don’t know  or S • MLE: For which  =(,S) is x1, x2, … xR most likely?    R k k mle R 1 1 x μ         R k T mle k mle k mle R 1 1   x x Σ Q: How would you prove this? A: Just plug through the MLE recipe. Note how Smle is forced to be symmetric non-negative definite Note the unbiased case How many datapoints would you need before the Gaussian has a chance of being non-degenerate?            R k T mle k mle k mle R R 1 unbiased 1 1 1 1   x x Σ Σ
  • 33. Maximum Likelihood: Slide 33 Copyright © 2001, 2004, Andrew W. Moore Confidence intervals We need to talk We need to discuss how accurate we expect mle and Smle to be as a function of R And we need to consider how to estimate these accuracies from data… •Analytically * •Non-parametrically (using randomization and bootstrapping) * But we won’t. Not yet. *Will be discussed in future Andrew lectures…just before we need this technology.
  • 34. Maximum Likelihood: Slide 34 Copyright © 2001, 2004, Andrew W. Moore Structural error Actually, we need to talk about something else too.. What if we do all this analysis when the true distribution is in fact not Gaussian? How can we tell? * How can we survive? * *Will be discussed in future Andrew lectures…just before we need this technology.
  • 35. Maximum Likelihood: Slide 35 Copyright © 2001, 2004, Andrew W. Moore Gaussian MLE in action Using R=392 cars from the “MPG” UCI dataset supplied by Ross Quinlan
  • 36. Maximum Likelihood: Slide 36 Copyright © 2001, 2004, Andrew W. Moore Data-starved Gaussian MLE Using three subsets of MPG. Each subset has 6 randomly-chosen cars.
  • 37. Maximum Likelihood: Slide 37 Copyright © 2001, 2004, Andrew W. Moore Bivariate MLE in action
  • 38. Maximum Likelihood: Slide 38 Copyright © 2001, 2004, Andrew W. Moore Multivariate MLE Covariance matrices are not exciting to look at
  • 39. Maximum Likelihood: Slide 39 Copyright © 2001, 2004, Andrew W. Moore Being Bayesian: MAP estimates for Gaussians • Suppose you have x1, x2, … xR ~(i.i.d) N(,S) • But you don’t know  or S • MAP: Which (,S) maximizes p(,S |x1, x2, … xR)? Step 1: Put a prior on (,S)
  • 40. Maximum Likelihood: Slide 40 Copyright © 2001, 2004, Andrew W. Moore Being Bayesian: MAP estimates for Gaussians • Suppose you have x1, x2, … xR ~(i.i.d) N(,S) • But you don’t know  or S • MAP: Which (,S) maximizes p(,S |x1, x2, … xR)? Step 1: Put a prior on (,S) Step 1a: Put a prior on S n0-m-1) S ~ IW(n0, n0-m-1) S 0 ) This thing is called the Inverse-Wishart distribution. A PDF over SPD matrices!
  • 41. Maximum Likelihood: Slide 41 Copyright © 2001, 2004, Andrew W. Moore Being Bayesian: MAP estimates for Gaussians • Suppose you have x1, x2, … xR ~(i.i.d) N(,S) • But you don’t know  or S • MAP: Which (,S) maximizes p(,S |x1, x2, … xR)? Step 1: Put a prior on (,S) Step 1a: Put a prior on S n0-m-1) S ~ IW(n0, n0-m-1) S 0 ) This thing is called the Inverse-Wishart distribution. A PDF over SPD matrices! n0 small: “I am not sure about my guess of S 0 “ n0 large: “I’m pretty sure about my guess of S 0 “ S 0 : (Roughly) my best guess of S ES   S 0
  • 42. Maximum Likelihood: Slide 42 Copyright © 2001, 2004, Andrew W. Moore Being Bayesian: MAP estimates for Gaussians • Suppose you have x1, x2, … xR ~(i.i.d) N(,S) • But you don’t know  or S • MAP: Which (,S) maximizes p(,S |x1, x2, … xR)? Together, “S” and “ | S” define a joint distribution on (,S) Step 1: Put a prior on (,S) Step 1a: Put a prior on S n0-m-1)S ~ IW(n0, n0-m-1)S 0 ) Step 1b: Put a prior on  | S  | S ~ N(0 , S / k0)
  • 43. Maximum Likelihood: Slide 43 Copyright © 2001, 2004, Andrew W. Moore Step 1: Put a prior on (,S) Step 1a: Put a prior on S n0-m-1)S ~ IW(n0, n0-m-1)S 0 ) Step 1b: Put a prior on  | S  | S ~ N(0 , S / k0) Being Bayesian: MAP estimates for Gaussians • Suppose you have x1, x2, … xR ~(i.i.d) N(,S) • But you don’t know  or S • MAP: Which (,S) maximizes p(,S |x1, x2, … xR)?  0 : My best guess of  E    0 Together, “S” and “ | S” define a joint distribution on (,S) k0 small: “I am not sure about my guess of  0 “ k0 large: “I’m pretty sure about my guess of  0 “ Notice how we are forced to express our ignorance of  proportionally to S
  • 44. Maximum Likelihood: Slide 44 Copyright © 2001, 2004, Andrew W. Moore Being Bayesian: MAP estimates for Gaussians • Suppose you have x1, x2, … xR ~(i.i.d) N(,S) • But you don’t know  or S • MAP: Which (,S) maximizes p(,S |x1, x2, … xR)? Why do we use this form of prior? Step 1: Put a prior on (,S) Step 1a: Put a prior on S n0-m-1)S ~ IW(n0, n0-m-1)S 0 ) Step 1b: Put a prior on  | S  | S ~ N(0 , S / k0)
  • 45. Maximum Likelihood: Slide 45 Copyright © 2001, 2004, Andrew W. Moore Being Bayesian: MAP estimates for Gaussians • Suppose you have x1, x2, … xR ~(i.i.d) N(,S) • But you don’t know  or S • MAP: Which (,S) maximizes p(,S |x1, x2, … xR)? Step 1: Put a prior on (,S) Step 1a: Put a prior on S n0-m-1)S ~ IW(n0, n0-m-1)S 0 ) Step 1b: Put a prior on  | S  | S ~ N(0 , S / k0) Why do we use this form of prior? Actually, we don’t have to But it is computationally and algebraically convenient… …it’s a conjugate prior.
  • 46. Maximum Likelihood: Slide 46 Copyright © 2001, 2004, Andrew W. Moore Being Bayesian: MAP estimates for Gaussians • Suppose you have x1, x2, … xR ~(i.i.d) N(,S) • MAP: Which (,S) maximizes p(,S |x1, x2, … xR)? Step 1: Prior: n0-m-1) S ~ IW(n0, n0-m-1) S 0 ),  | S ~ N(0 , S / k0) Step 2:    R k k R 1 1 x x R R R    0 0 0 k k x μ μ R R   0 k k R R   0 n n       R m m T R k T k k R R / 1 / 1 ) 1 ( ) 1 ( 0 0 0 1 0 0               k n n μ x μ x x x x x Σ Σ Step 3: Posterior: (nR+m-1)S ~ IW(nR, (nR+m-1) S R ),  | S ~ N(R , S / kR) Result: map = R, E[S |x1, x2, … xR ]= SR
  • 47. Maximum Likelihood: Slide 47 Copyright © 2001, 2004, Andrew W. Moore Being Bayesian: MAP estimates for Gaussians • Suppose you have x1, x2, … xR ~(i.i.d) N(,S) • MAP: Which (,S) maximizes p(,S |x1, x2, … xR)? Step 1: Prior: n0-m-1) S ~ IW(n0, n0-m-1) S 0 ),  | S ~ N(0 , S / k0) Step 2:    R k k R 1 1 x x R R R    0 0 0 k k x μ μ R R   0 k k R R   0 n n       R m m T R k T k k R R / 1 / 1 ) 1 ( ) 1 ( 0 0 0 1 0 0               k n n μ x μ x x x x x Σ Σ Step 3: Posterior: (nR+m-1)S ~ IW(nR, (nR+m-1) S R ),  | S ~ N(R , S / kR) Result: map = R, E[S |x1, x2, … xR ]= SR •Look carefully at what these formulae are doing. It’s all very sensible. •Conjugate priors mean prior form and posterior form are same and characterized by “sufficient statistics” of the data. •The marginal distribution on  is a student-t •One point of view: it’s pretty academic if R > 30
  • 48. Maximum Likelihood: Slide 48 Copyright © 2001, 2004, Andrew W. Moore Where we’re at Inputs Classifier Predict category Inputs Density Estimator Prob- ability Inputs Regressor Predict real no. Categorical inputs only Mixed Real / Cat okay Real-valued inputs only Dec Tree Joint BC Naïve BC Joint DE Naïve DE Gauss DE
  • 49. Maximum Likelihood: Slide 49 Copyright © 2001, 2004, Andrew W. Moore What you should know • The Recipe for MLE • What do we sometimes prefer MLE to MAP? • Understand MLE estimation of Gaussian parameters • Understand “biased estimator” versus “unbiased estimator” • Appreciate the outline behind Bayesian estimation of Gaussian parameters
  • 50. Maximum Likelihood: Slide 50 Copyright © 2001, 2004, Andrew W. Moore Useful exercise • We’d already done some MLE in this class without even telling you! • Suppose categorical arity-n inputs x1, x2, … xR~(i.i.d.) from a multinomial M(p1, p2, … pn) where P(xk=j|p)=pj • What is the MLE p=(p1, p2, … pn)?