Learning with
                  Maximum Likelihood
                                                  Andrew W. Moore
Note ...
Why we should care
• Maximum Likelihood Estimation is a very
  very very very fundamental part of data
  analysis.
• “MLE ...
Learning Gaussians from Data
• Suppose you have x1, x2, … xR ~(i.i.d) N(µ,σ2)
• But you don’t know µ
                     ...
MLE for univariate Gaussian
• Suppose you have x1, x2, … xR ~(i.i.d) N(µ,σ2)
• But you don’t know µ (you do know σ2)
• MLE...
Algebra Euphoria
µ mle = arg max p ( x1 , x2 ,... x R | µ , σ 2 )
                    µ
                                  ...
The MLE µ
µ mle = arg max p ( x1 , x2 ,... x R | µ , σ 2 )
                    µ
                                  R
     ...
Lawks-a-lawdy!
                                                     1R
                                                   ...
A General MLE strategy
Suppose θ = (θ1, θ2, …, θn)T is a vector of parameters.
Task: Find MLE θ assuming known form for p(...
A General MLE strategy
Suppose θ = (θ1, θ2, …, θn)T is a vector of parameters.
Task: Find MLE θ assuming known form for p(...
MLE for univariate Gaussian
  • Suppose you have x1, x2, … xR ~(i.i.d) N(µ,σ2)
  • But you don’t know µ or σ2
  • MLE: For...
MLE for univariate Gaussian
• Suppose you have x1, x2, … xR ~(i.i.d) N(µ,σ2)
• But you don’t know µ or σ2
• MLE: For which...
Biased Estimators
• An estimator of a parameter is biased if the
  expected value of the estimate is different from
  the ...
Unbiased estimate of Variance
• If x1, x2, … xR ~(i.i.d) N(µ,σ2) then

                          ⎡1 ⎛ R            ⎞⎤ ⎛
  ...
Unbiaseditude discussion
• Which is best?
                            1R
                              ∑ ( xi −µ mle ) 2
 ...
MLE for m-dimensional Gaussian
• Suppose you have x1, x2, … xR ~(i.i.d) N(µ,Σ)
• But you don’t know µ or Σ
• MLE: For whic...
MLE for m-dimensional Gaussian
• Suppose you have x1, x2, … xR ~(i.i.d) N(µ,Σ)
• But you don’t know µ or Σ
• MLE: For whic...
Confidence intervals
 We need to talk

 We need to discuss how accurate we expect µmle and Σmle to be as
 a function of R
...
Gaussian MLE in action
Using R=392 cars from the
“MPG” UCI dataset supplied
by Ross Quinlan




Copyright © 2001, 2004, An...
Bivariate MLE in action




Copyright © 2001, 2004, Andrew W. Moore                                   Maximum Likelihood: ...
Being Bayesian: MAP estimates for Gaussians
• Suppose you have x1, x2, … xR ~(i.i.d) N(µ,Σ)
• But you don’t know µ or Σ
• ...
Being Bayesian: MAP estimates for Gaussians
     ν small: “I am not sure
    Suppose you have x1, x2, … xR ~(i.i.d) N(µ,Σ)...
Being Bayesian: MAP estimates for Gaussians
• Suppose you have x1, x2, … xR ~(i.i.d) N(µ,Σ)
• But you don’t know µ or Σ κ0...
Being Bayesian: MAP estimates for Gaussians
  • Suppose you have x1, x2, … xR ~(i.i.d) N(µ,Σ)
  • But you don’t know µ or ...
Being Bayesian: •Look carefully at what these formulae are
                       MAP estimates for Gaussians
            ...
What you should know
• The Recipe for MLE
• What do we sometimes prefer MLE to MAP?
• Understand MLE estimation of Gaussia...
Upcoming SlideShare
Loading in...5
×

Maximum Likelihood Estimation

1,891

Published on

Published in: Technology, Economy & Finance
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,891
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
68
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Maximum Likelihood Estimation

  1. 1. Learning with Maximum Likelihood Andrew W. Moore Note to other teachers and users of these slides. Andrew would be delighted Professor if you found this source material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them School of Computer Science to fit your own needs. PowerPoint originals are available. If you make use of a significant portion of these slides in Carnegie Mellon University your own lecture, please include this message, or the following link to the source repository of Andrew’s tutorials: www.cs.cmu.edu/~awm http://www.cs.cmu.edu/~awm/tutorials . Comments and corrections gratefully awm@cs.cmu.edu received. 412-268-7599 Copyright © 2001, 2004, Andrew W. Moore Sep 6th, 2001 Maximum Likelihood learning of Gaussians for Data Mining • Why we should care • Learning Univariate Gaussians • Learning Multivariate Gaussians • What’s a biased estimator? • Bayesian Learning of Gaussians Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 2 1
  2. 2. Why we should care • Maximum Likelihood Estimation is a very very very very fundamental part of data analysis. • “MLE for Gaussians” is training wheels for our future techniques • Learning Gaussians is more useful than you might guess… Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 3 Learning Gaussians from Data • Suppose you have x1, x2, … xR ~ (i.i.d) N(µ,σ2) • But you don’t know µ (you do know σ2) MLE: For which µ is x1, x2, … xR most likely? MAP: Which µ maximizes p(µ|x1, x2, … xR , σ2)? Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 4 2
  3. 3. Learning Gaussians from Data • Suppose you have x1, x2, … xR ~(i.i.d) N(µ,σ2) • But you don’t know µ (you do know σ2) Sneer MLE: For which µ is x1, x2, … xR most likely? MAP: Which µ maximizes p(µ|x1, x2, … xR , σ2)? Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 5 Learning Gaussians from Data • Suppose you have x1, x2, … xR ~(i.i.d) N(µ,σ2) • But you don’t know µ (you do know σ2) Sneer MLE: For which µ is x1, x2, … xR most likely? MAP: Which µ maximizes p(µ|x1, x2, … xR , σ2)? Despite this, we’ll spend 95% of our time on MLE. Why? Wait and see… Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 6 3
  4. 4. MLE for univariate Gaussian • Suppose you have x1, x2, … xR ~(i.i.d) N(µ,σ2) • But you don’t know µ (you do know σ2) • MLE: For which µ is x1, x2, … xR most likely? µ mle = arg max p ( x1 , x2 ,... x R | µ , σ 2 ) µ Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 7 Algebra Euphoria µ mle = arg max p ( x1 , x2 ,... x R | µ , σ 2 ) µ = (by i.i.d) = (monotonicity of log) (plug in formula = for Gaussian) = (after simplification) Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 8 4
  5. 5. Algebra Euphoria µ mle = arg max p ( x1 , x2 ,... x R | µ , σ 2 ) µ R = (by i.i.d) arg max ∏ p ( xi | µ , σ 2 ) µ i =1 R = arg max log p ( x | µ , σ 2 ) (monotonicity of ∑ log) i µ i =1 ( xi − µ ) 2 R (plug in formula = arg max 1 ∑ − 2σ 2 for Gaussian) 2π σ µ i =1 = (after R arg min ∑ ( xi − µ ) simplification) 2 µ i =1 Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 9 Intermission: A General Scalar MLE strategy Task: Find MLE θ assuming known form for p(Data| θ,stuff) 1. Write LL = log P(Data| θ,stuff) 2. Work out ∂LL/∂θ using high-school calculus 3. Set ∂LL/∂θ=0 for a maximum, creating an equation in terms of θ 4. Solve it* 5. Check that you’ve found a maximum rather than a minimum or saddle-point, and be careful if θ is constrained *This is a perfect example of something that works perfectly in all textbook examples and usually involves surprising pain if you need it for something new. Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 10 5
  6. 6. The MLE µ µ mle = arg max p ( x1 , x2 ,... x R | µ , σ 2 ) µ R = arg min ∑ ( xi − µ ) 2 µ i =1 ∂LL = µ s.t. 0 = = ∂µ = (what?) Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 11 The MLE µ µ mle = arg max p ( x1 , x2 ,... x R | µ , σ 2 ) µ R = arg min ∑ ( xi − µ ) 2 µ i =1 ∂ R ∂LL ∑ (x − µ )2 = µ s.t. 0 = = ∂µ i ∂µ i =1 R − ∑ 2 ( xi − µ ) i =1 1R Thus µ = ∑ xi R i =1 Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 12 6
  7. 7. Lawks-a-lawdy! 1R ∑ xi µ mle = R i =1 • The best estimate of the mean of a distribution is the mean of the sample! At first sight: This kind of pedantic, algebra-filled and ultimately unsurprising fact is exactly the reason people throw down their “Statistics” book and pick up their “Agent Based Evolutionary Data Mining Using The Neuro-Fuzz Transform” book. Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 13 A General MLE strategy Suppose θ = (θ1, θ2, …, θn)T is a vector of parameters. Task: Find MLE θ assuming known form for p(Data| θ,stuff) 1. Write LL = log P(Data| θ,stuff) 2. Work out ∂LL/∂θ using high-school calculus ⎛ ∂LL ⎞ ⎜ ⎟ ⎜ ∂θ1 ⎟ ∂LL ⎟ ∂LL ⎜⎜ ∂θ ⎟ = ∂θ ⎜ 2 ⎟ ⎜M⎟ ⎜ ∂LL ⎟ ⎜ ∂θ ⎟ ⎝ n⎠ Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 14 7
  8. 8. A General MLE strategy Suppose θ = (θ1, θ2, …, θn)T is a vector of parameters. Task: Find MLE θ assuming known form for p(Data| θ,stuff) 1. Write LL = log P(Data| θ,stuff) 2. Work out ∂LL/∂θ using high-school calculus 3. Solve the set of simultaneous equations ∂LL =0 ∂θ1 ∂LL =0 ∂θ 2 M ∂LL =0 ∂θ n Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 15 A General MLE strategy Suppose θ = (θ1, θ2, …, θn)T is a vector of parameters. Task: Find MLE θ assuming known form for p(Data| θ,stuff) 1. Write LL = log P(Data| θ,stuff) 2. Work out ∂LL/∂θ using high-school calculus 3. Solve the set of simultaneous equations ∂LL =0 ∂θ1 ∂LL =0 4. Check that you’re at ∂θ 2 a maximum M ∂LL =0 ∂θ n Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 16 8
  9. 9. A General MLE strategy Suppose θ = (θ1, θ2, …, θn)T is a vector of parameters. Task: Find MLE θ assuming known form for p(Data| θ,stuff) 1. Write LL = log P(Data| θ,stuff) 2. Work out ∂LL/∂θ using high-school calculus 3. Solve the set of simultaneous equations ∂LL =0 ∂θ1 If you can’t solve them, ∂LL what should you do? =0 4. Check that you’re at ∂θ 2 a maximum M ∂LL =0 ∂θ n Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 17 MLE for univariate Gaussian • Suppose you have x1, x2, … xR ~(i.i.d) N(µ,σ2) • But you don’t know µ or σ2 • MLE: For which θ =(µ,σ2) is x1, x2,…xR most likely? R 1 1 ∑ (x log p ( x1 , x2 ,... x R | µ , σ 2 ) = − R (log π + log σ 2 ) − −µ ) 2 2σ 2 i 2 i =1 ∂LL R 1 ∑ (x −µ ) =2 ∂µ σ i i =1 ∂LL R R 1 ∑ (x −µ ) 2 =− + ∂σ 2σ 2σ 4 i 2 2 i =1 Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 18 9
  10. 10. MLE for univariate Gaussian • Suppose you have x1, x2, … xR ~(i.i.d) N(µ,σ2) • But you don’t know µ or σ2 • MLE: For which θ =(µ,σ2) is x1, x2,…xR most likely? R 1 1 ∑ (x log p ( x1 , x2 ,... x R | µ , σ 2 ) = − R (log π + log σ 2 ) − −µ ) 2 2σ 2 i 2 i =1 R 1 ∑ (x −µ ) 0= σ2 i i =1 R R 1 ∑ (x −µ ) 2 0=− + 2σ 2σ i 2 4 i =1 Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 19 MLE for univariate Gaussian • Suppose you have x1, x2, … xR ~(i.i.d) N(µ,σ2) • But you don’t know µ or σ2 • MLE: For which θ =(µ,σ2) is x1, x2,…xR most likely? R 1 1 ∑ (x log p ( x1 , x 2 ,... x R | µ , σ 2 ) = − R (log π + log σ 2 ) − −µ ) 2 2σ 2 i 2 i =1 R R 1 1 ∑ (x ∑ xi −µ ) ⇒ µ = 0= σ i R i =1 2 i =1 R R 1 ∑ (x −µ ) 2 ⇒ what? 0=− + 2σ 2σ i 2 4 i =1 Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 20 10
  11. 11. MLE for univariate Gaussian • Suppose you have x1, x2, … xR ~(i.i.d) N(µ,σ2) • But you don’t know µ or σ2 • MLE: For which θ =(µ,σ2) is x1, x2,…xR most likely? 1R ∑ xi µ mle = R i =1 1R ∑ ( xi −µ mle ) 2 σ mle = 2 R i =1 Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 21 Unbiased Estimators • An estimator of a parameter is unbiased if the expected value of the estimate is the same as the true value of the parameters. • If x1, x2, … xR ~(i.i.d) N(µ,σ2) then ⎡1 R ⎤ E[ µ mle ] = E ⎢ ∑ xi ⎥ = µ ⎣ R i =1 ⎦ µmle is unbiased Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 22 11
  12. 12. Biased Estimators • An estimator of a parameter is biased if the expected value of the estimate is different from the true value of the parameters. • If x1, x2, … xR ~(i.i.d) N(µ,σ2) then ⎡1 ⎛ R ⎞⎤ 2 [] ⎡1 R mle 2 ⎤ 1R = E ⎢ ∑ ( xi −µ ) ⎥ = E ⎢ ⎜ ∑ xi − ∑ x j ⎟ ⎥ ≠ σ 2 Eσ 2 ⎢ R ⎜ i =1 R j =1 ⎟ ⎥ mle ⎣ R i =1 ⎦ ⎣⎝ ⎠⎦ σ2mle is biased Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 23 MLE Variance Bias • If x1, x2, … xR ~(i.i.d) N(µ,σ2) then ⎡1 ⎛ R ⎞⎤ ⎛ 2 [] 1⎞ 1R = E ⎢ ⎜ ∑ xi − ∑ x j ⎟ ⎥ = ⎜1 − ⎟σ 2 ≠ σ 2 Eσ 2 ⎜ ⎟⎥ ⎝ mle ⎢ R ⎝ i =1 R j =1 ⎠ R⎠ ⎣ ⎦ Intuition check: consider the case of R=1 σ2mle would be an Why should our guts expect that underestimate of true σ2? How could you prove that? Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 24 12
  13. 13. Unbiased estimate of Variance • If x1, x2, … xR ~(i.i.d) N(µ,σ2) then ⎡1 ⎛ R ⎞⎤ ⎛ 2 [] 1⎞ 1R = E ⎢ ⎜ ∑ xi − ∑ x j ⎟ ⎥ = ⎜1 − ⎟σ 2 ≠ σ 2 Eσ 2 ⎜ ⎟⎥ ⎝ mle ⎢ R ⎝ i =1 R j =1 ⎠ R⎠ ⎣ ⎦ σ mle 2 [ ] σ = 2 So define So E σ unbiased = σ 2 2 ⎛ 1⎞ unbiased ⎜1 − ⎟ R⎠ ⎝ Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 25 Unbiased estimate of Variance • If x1, x2, … xR ~(i.i.d) N(µ,σ2) then ⎡1 ⎛ R ⎞⎤ ⎛ 2 [] 1⎞ 1R = E ⎢ ⎜ ∑ xi − ∑ x j ⎟ ⎥ = ⎜1 − ⎟σ 2 ≠ σ 2 Eσ 2 ⎜ ⎟⎥ ⎝ mle ⎢ R ⎝ i =1 R j =1 ⎠ R⎠ ⎣ ⎦ σ mle 2 [ ] σ = 2 So E σ unbiased = σ 2 So define 2 ⎛ 1⎞ unbiased ⎜1 − ⎟ R⎠ ⎝ 1R ∑ ( xi −µ mle ) 2 σ unbiased = 2 R − 1 i =1 Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 26 13
  14. 14. Unbiaseditude discussion • Which is best? 1R ∑ ( xi −µ mle ) 2 σ mle = 2 R i =1 1R ∑ ( xi −µ mle ) 2 σ unbiased = 2 R − 1 i =1 Answer: •It depends on the task •And doesn’t make much difference once R--> large Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 27 Don’t get too excited about being unbiased • Assume x1, x2, … xR ~(i.i.d) N(µ,σ2) • Suppose we had these estimators for the mean R 1 ∑x µ = suboptimal i R+7 R i =1 Are either of these unbiased? µ crap = x1 Will either of them asymptote to the correct value as R gets large? Which is more useful? Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 28 14
  15. 15. MLE for m-dimensional Gaussian • Suppose you have x1, x2, … xR ~(i.i.d) N(µ,Σ) • But you don’t know µ or Σ • MLE: For which θ =(µ,Σ) is x1, x2, … xR most likely? 1R ∑ xk µ mle = R k =1 ( )( ) 1R ∑ x k − µ mle x k − µ mle T Σ mle = R k =1 Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 29 MLE for m-dimensional Gaussian • Suppose you have x1, x2, … xR ~(i.i.d) N(µ,Σ) • But you don’t know µ or Σ • MLE: For which θ =(µ,Σ) is x1, x2, … xR most likely? 1R Where 1 ≤ i ≤ m 1R ∑ xk = ∑ x ki µ mle = mle µ i R k =1 R k =1 And xki is value of the ith component of xk ( )( ) R 1 ∑ x k − µ mle x k − µ mle T Σ mle = (the ith attribute of R k =1 the kth record) And µimle is the ith component of µmle Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 30 15
  16. 16. MLE for m-dimensional Gaussian • Suppose you have x1, x2, … xR ~(i.i.d) N(µ,Σ) • But you don’t know µ or Σ • MLE: For which θ =(µ,Σ) is x1, x2, … xR most likely? Where 1 ≤ i ≤ m, 1 ≤ j ≤ m R 1 ∑ xk µ mle = And xki is value of the ith R k =1 component of xk (the ith attribute of the kth record) ( )( ) 1R ∑ x k − µ mle x k − µ mle T Σ mle = And σijmle is the (i,j)th R k =1 component of Σmle ( )( ) 1R ∑ x ki − µ imle x kj − µ mle σ ij = mle j R k =1 Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 31 MLE for m-dimensional Gaussian Q: How would you prove this? • Suppose you have x1, x2, … xR ~(i.i.d) through the MLE A: Just plug N(µ,Σ) recipe. • But you don’t know µ or Σ • MLE: For which θ =(µ,Σ) is x1, xhow ΣxR most likely? Note , … mle is forced to be 2 symmetric non-negative definite Note the unbiased case 1R = ∑ xk µ mle How many datapoints would you R k =1 need before the Gaussian has a chance of being non-degenerate? ( )( ) 1R ∑ x k − µ mle x k − µ mle T Σ mle = R k =1 ( )( ) Σ mle 1R 1 R −1 ∑ k T x − µ mle x k − µ mle Σ unbiased = = 1− k =1 R Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 32 16
  17. 17. Confidence intervals We need to talk We need to discuss how accurate we expect µmle and Σmle to be as a function of R And we need to consider how to estimate these accuracies from data… •Analytically * •Non-parametrically (using randomization and bootstrapping) * But we won’t. Not yet. *Will be discussed in future Andrew lectures…just before we need this technology. Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 33 Structural error Actually, we need to talk about something else too.. What if we do all this analysis when the true distribution is in fact not Gaussian? How can we tell? * How can we survive? * *Will be discussed in future Andrew lectures…just before we need this technology. Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 34 17
  18. 18. Gaussian MLE in action Using R=392 cars from the “MPG” UCI dataset supplied by Ross Quinlan Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 35 Data-starved Gaussian MLE Using three subsets of MPG. Each subset has 6 randomly-chosen cars. Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 36 18
  19. 19. Bivariate MLE in action Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 37 Multivariate MLE Covariance matrices are not exciting to look at Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 38 19
  20. 20. Being Bayesian: MAP estimates for Gaussians • Suppose you have x1, x2, … xR ~(i.i.d) N(µ,Σ) • But you don’t know µ or Σ • MAP: Which (µ,Σ) maximizes p(µ,Σ |x1, x2, … xR)? Step 1: Put a prior on (µ,Σ) Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 39 Being Bayesian: MAP estimates for Gaussians • Suppose you have x1, x2, … xR ~(i.i.d) N(µ,Σ) • But you don’t know µ or Σ • MAP: Which (µ,Σ) maximizes p(µ,Σ |x1, x2, … xR)? Step 1: Put a prior on (µ,Σ) Step 1a: Put a prior on Σ (ν0-m-1) Σ ~ IW(ν0, (ν0-m-1) Σ 0 ) This thing is called the Inverse-Wishart distribution. A PDF over SPD matrices! Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 40 20
  21. 21. Being Bayesian: MAP estimates for Gaussians ν small: “I am not sure Suppose you have x1, x2, … xR ~(i.i.d) N(µ,Σ) • 0 about my guess of Σ 0 “ Σ 0 : (Roughly) my best But you don’t know µ or Σ • guess of Σ ν0 large: “I’m pretty sure MAP: Which (µ,Σ) maximizes p(µ,Σ |x1, x2, … xR)? • about my guess of Σ 0 “ Ε[Σ ] = Σ 0 Step 1: Put a prior on (µ,Σ) Step 1a: Put a prior on Σ (ν0-m-1) Σ ~ IW(ν0, (ν0-m-1) Σ 0 ) This thing is called the Inverse-Wishart distribution. A PDF over SPD matrices! Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 41 Being Bayesian: MAP estimates for Gaussians • Suppose you have x1, x2, … xR ~(i.i.d) N(µ,Σ) • But you don’t know µ or Σ • MAP: Which (µ,Σ) maximizes p(µ,Σ |x1, x2, … xR)? Step 1: Put a prior on (µ,Σ) Step 1a: Put a prior on Σ (ν0-m-1)Σ ~ IW(ν0, (ν0-m-1)Σ 0 ) Together, “Σ” and “µ | Σ” define a Step 1b: Put a prior on µ | Σ joint distribution µ | Σ ~ N(µ0 , Σ / κ0) on (µ,Σ) Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 42 21
  22. 22. Being Bayesian: MAP estimates for Gaussians • Suppose you have x1, x2, … xR ~(i.i.d) N(µ,Σ) • But you don’t know µ or Σ κ0 small: “I am not sure • MAP: Which (µ,Σ) maximizes p(µ,Σ |x1, xof µ 0 “ R)? about my guess , … x 2 κ0 large: “I’m pretty sure µ 0 : My best guess of µ (µ,Σ) Step 1: Put a prior on about my guess of µ 0 “ Step E[µ ]Putµa prior on Σ 1a: = 0 (ν0-m-1)Σ ~ IW(ν0, (ν0-m-1)Σ 0 ) Together, “Σ” and “µ | Σ” define a Step 1b: Put a prior on µ | Σ joint distribution µ | Σ ~ N(µ0 , Σ / κ0) on (µ,Σ) Notice how we are forced to express our ignorance of µ proportionally to Σ Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 43 Being Bayesian: MAP estimates for Gaussians • Suppose you have x1, x2, … xR ~(i.i.d) N(µ,Σ) • But you don’t know µ or Σ • MAP: Which (µ,Σ) maximizes p(µ,Σ |x1, x2, … xR)? Why do we use this form Step 1: Put a prior on (µ,Σ) of prior? Step 1a: Put a prior on Σ (ν0-m-1)Σ ~ IW(ν0, (ν0-m-1)Σ 0 ) Step 1b: Put a prior on µ | Σ µ | Σ ~ N(µ0 , Σ / κ0) Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 44 22
  23. 23. Being Bayesian: MAP estimates for Gaussians • Suppose you have x1, x2, … xR ~(i.i.d) N(µ,Σ) • But you don’t know µ or Σ • MAP: Which (µ,Σ) maximizes p(µ,Σ |x1, x2, … xR)? Why do we use this form of Step 1: Put a prior on (µ,Σ) prior? Step 1a: Put a prior on Σ Actually, we don’t have to But it is computationally and (ν0-m-1)Σ ~ IW(ν0, (ν0-m-1)Σ 0 ) algebraically convenient… Step 1b: Put a prior on µ | Σ …it’s a conjugate prior. µ | Σ ~ N(µ0 , Σ / κ0) Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 45 Being Bayesian: MAP estimates for Gaussians • Suppose you have x1, x2, … xR ~(i.i.d) N(µ,Σ) • MAP: Which (µ,Σ) maximizes p(µ,Σ |x1, x2, … xR)? Step 1: Prior: (ν0-m-1) Σ ~ IW(ν0, (ν0-m-1) Σ 0 ), µ | Σ ~ N(µ0 , Σ / κ0) Step 2: κ µ + Rx ν = ν + R 1R ∑ x k µ R = 0κ 00 + R R 0 x= R k =1 κR = κ0 + R (x − µ 0 )(x − µ 0 )T R (ν R + m − 1) Σ R = (ν 0 + m − 1) Σ 0 + ∑ (x k − x )(x k − x ) + T 1/ κ 0 +1/ R k =1 Step 3: Posterior: (νR+m-1)Σ ~ IW(νR, (νR+m-1) Σ R ), µ | Σ ~ N(µR , Σ / κR) Result: µmap = µR, E[Σ |x1, x2, … xR ]= ΣR Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 46 23
  24. 24. Being Bayesian: •Look carefully at what these formulae are MAP estimates for Gaussians doing. It’s all very sensible. • Suppose you have x1, x2, … xRmean prior form and posterior •Conjugate priors ~(i.i.d) N(µ,Σ) form are same and characterized by “sufficient • MAP: Which (µ,Σ)statistics” of the p(µ,Σ |x1, x2, … xR)? maximizes data. Step 1: Prior: (ν0-m-1) Σ ~ •The 0marginal distributionΣ ~ µ is 0a, student-t IW(ν , (ν0-m-1) Σ 0 ), µ | on N(µ Σ / κ0) •One point of view: it’s pretty academic if R > 30 Step 2: κ µ + Rx ν R = ν 0 + R R 1 ∑ xk µ R = 0 0 x= κ0 + R κ = κ + R R k =1 R 0 (x − µ 0 )(x − µ 0 )T R (ν R + m − 1) Σ R = (ν 0 + m − 1) Σ 0 + ∑ (x k − x )(x k − x ) + T 1/ κ 0 +1/ R k =1 Step 3: Posterior: (νR+m-1)Σ ~ IW(νR, (νR+m-1) Σ R ), µ | Σ ~ N(µR , Σ / κR) Result: µmap = µR, E[Σ |x1, x2, … xR ]= ΣR Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 47 Where we’re at Categorical Real-valued Mixed Real / inputs only inputs only Cat okay Predict Joint BC Inputs Dec Tree Classifier category Naïve BC Joint DE Gauss DE Inputs Inputs Prob- Density ability Estimator Naïve DE Predict Regressor real no. Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 48 24
  25. 25. What you should know • The Recipe for MLE • What do we sometimes prefer MLE to MAP? • Understand MLE estimation of Gaussian parameters • Understand “biased estimator” versus “unbiased estimator” • Appreciate the outline behind Bayesian estimation of Gaussian parameters Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 49 Useful exercise • We’d already done some MLE in this class without even telling you! • Suppose categorical arity-n inputs x1, x2, … xR~(i.i.d.) from a multinomial M(p1, p2, … pn) where P(xk=j|p)=pj • What is the MLE p=(p1, p2, … pn)? Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 50 25

×