Gaussians

599 views

Published on

Published in: Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
599
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
30
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Gaussians

  1. 1. Note to other teachers and users of these slides. Andrew would be delighted if you found this source material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. PowerPoint originals are available. If you make use of a significant portion of these slides in your own lecture, please include this message, or the following link to the source repository of Andrew’s tutorials: http://www.cs.cmu.edu/~awm/tutorials . Comments and corrections gratefully received. Gaussians Andrew W. Moore Professor School of Computer Science Carnegie Mellon University www.cs.cmu.edu/~awm awm@cs.cmu.edu 412-268-7599 Copyright © Andrew W. Moore Slide 1 Gaussians in Data Mining • Why we should care • The entropy of a PDF • Univariate Gaussians • Multivariate Gaussians • Bayes Rule and Gaussians • Maximum Likelihood and MAP using Gaussians Copyright © Andrew W. Moore Slide 2 1
  2. 2. Why we should care • Gaussians are as natural as Orange Juice and Sunshine • We need them to understand Bayes Optimal Classifiers • We need them to understand regression • We need them to understand neural nets • We need them to understand mixture models •… (You get the idea) Copyright © Andrew W. Moore Slide 3 ⎧1 The “box” w | x |≤ if ⎪ p ( x) = ⎨ w 2 distribution w ⎪ 0 if | x |> ⎩ 2 1/w -w/2 0 w/2 Copyright © Andrew W. Moore Slide 4 2
  3. 3. ⎧1 The “box” w | x |≤ ⎪ w if 2 p ( x) = ⎨ distribution w ⎪ 0 if | x |> ⎩ 2 1/w -w/2 0 w/2 w2 E[ X ] = 0 Var[ X ] = 12 Copyright © Andrew W. Moore Slide 5 Entropy of a PDF ∞ ∫ p( x) log p( x)dx Entropy of X = H [ X ] = − x = −∞ Natural log (ln or loge) The larger the entropy of a distribution… …the harder it is to predict …the harder it is to compress it …the less spiky the distribution Copyright © Andrew W. Moore Slide 6 3
  4. 4. ⎧1 The “box” w | x |≤ ⎪ w if 2 p ( x) = ⎨ distribution w ⎪ 0 if | x |> ⎩ 2 1/w -w/2 0 w/2 ∞ w/ 2 w/ 2 1 1 1 1 ∫ p( x) log p( x)dx = − ∫w / 2 w log w dx = − w log w x=−∫wdx = log w H[ X ] = − x = −∞ x=− /2 Copyright © Andrew W. Moore Slide 7 ⎧1 Unit variance w | x |≤ if ⎪ p ( x) = ⎨ w 2 box distribution w ⎪ 0 if | x |> ⎩ 2 E[ X ] = 0 1 w2 Var[ X ] = 23 12 −3 0 3 if w = 2 3 then Var[ X ] = 1 and H [ X ] = 1.242 Copyright © Andrew W. Moore Slide 8 4
  5. 5. The Hat ⎧ w− | x | ⎪ if |x| ≤ w p ( x) = ⎨ w 2 distribution ⎪0 if |x| > w ⎩ E[ X ] = 0 w2 1 Var[ X ] = w 6 −w w 0 Copyright © Andrew W. Moore Slide 9 Unit variance hat ⎧ w− | x | ⎪ if |x| ≤ w p ( x) = ⎨ w 2 distribution ⎪0 if |x| > w ⎩ E[ X ] = 0 w2 1 Var[ X ] = 6 6 −6 0 6 if w = 6 then Var[ X ] = 1 and H [ X ] = 1.396 Copyright © Andrew W. Moore Slide 10 5
  6. 6. Dirac Delta The “2 spikes” δ ( x = −1) + δ ( x = 1) p ( x) = distribution 2 E[ X ] = 0 1 1 δ ( x = −1) δ ( x = 1) ∞ 2 2 Var[ X ] = 1 2 -1 0 1 ∞ ∫ p( x) log p( x)dx = −∞ H[ X ] = − x = −∞ Copyright © Andrew W. Moore Slide 11 Entropies of unit-variance distributions Distribution Entropy Box 1.242 Hat 1.396 2 spikes -infinity ??? 1.4189 Largest possible entropy of any unit- variance distribution Copyright © Andrew W. Moore Slide 12 6
  7. 7. Unit variance ⎛ x2 ⎞ 1 exp⎜ − ⎟ p ( x) = ⎜ 2⎟ Gaussian 2π ⎝ ⎠ E[ X ] = 0 Var[ X ] = 1 ∞ ∫ p( x) log p( x)dx = 1.4189 H[ X ] = − x = −∞ Copyright © Andrew W. Moore Slide 13 General ⎛ ( x − μ )2 ⎞ 1 exp⎜ − ⎟ p ( x) = ⎜ 2σ 2 ⎟ Gaussian 2π σ ⎝ ⎠ σ=15 E[ X ] = μ Var[ X ] = σ 2 μ=100 Copyright © Andrew W. Moore Slide 14 7
  8. 8. General Also known ⎛ ( x − μ )2 ⎞ as the normal 1 exp⎜ − ⎟ p ( x) = distribution ⎜ 2σ 2 ⎟ Gaussian or Bell- 2π σ ⎝ ⎠ shaped curve σ=15 E[ X ] = μ Var[ X ] = σ 2 μ=100 Shorthand: We say X ~ N(μ,σ2) to mean “X is distributed as a Gaussian with parameters μ and σ2”. In the above figure, X ~ N(100,152) Copyright © Andrew W. Moore Slide 15 The Error Function Assume X ~ N(0,1) Define ERF(x) = P(X<x) = Cumulative Distribution of X x ∫ p( z )dz ERF ( x) = z = −∞ ⎛ z2 ⎞ x 1 ∫ ⎜ 2⎟ exp⎜ − ⎟dz = 2π ⎝ ⎠ z = −∞ Copyright © Andrew W. Moore Slide 16 8
  9. 9. Using The Error Function Assume X ~ N(μ,σ2) x−μ P(X<x| μ,σ2) = ERF ( ) σ2 Copyright © Andrew W. Moore Slide 17 The Central Limit Theorem • If (X1,X2, … Xn) are i.i.d. continuous random variables 1n • Then define z = f ( x1 , x2 ,...xn ) = ∑ xi n i =1 • As n-->infinity, p(z)--->Gaussian with mean E[Xi] and variance Var[Xi] Somewhat of a justification for assuming Gaussian noise is common Copyright © Andrew W. Moore Slide 18 9
  10. 10. Other amazing facts about Gaussians • Wouldn’t you like to know? • We will not examine them until we need to. Copyright © Andrew W. Moore Slide 19 Bivariate Gaussians ⎛X⎞ Write r.v. X = ⎜ ⎟ Then define X ~ N (μ , Σ ) to mean ⎜Y ⎟ ⎝⎠ ( ) 1 exp − 1 (x − μ)T Σ −1 (x − μ) p ( x) = 1 2 2π || Σ || 2 Where the Gaussian’s parameters are… ⎛ μx ⎞ ⎛σ 2 x σ xy ⎞ μ=⎜ ⎟ Σ=⎜ ⎟ ⎜μ ⎟ ⎜σ σ 2y ⎟ ⎝ y⎠ ⎝ xy ⎠ Where we insist that Σ is symmetric non-negative definite Copyright © Andrew W. Moore Slide 20 10
  11. 11. Bivariate Gaussians ⎛X⎞ Write r.v. X = ⎜ ⎟ Then define X ~ N (μ , Σ ) to mean ⎜Y ⎟ ⎝⎠ ( ) 1 exp − 1 (x − μ)T Σ −1 (x − μ) p ( x) = 1 2 2π || Σ || 2 Where the Gaussian’s parameters are… ⎛ μx ⎞ ⎛σ 2 x σ xy ⎞ μ=⎜ ⎟ Σ=⎜ ⎟ ⎜μ ⎟ ⎜σ σ 2y ⎟ ⎝ y⎠ ⎝ xy ⎠ Where we insist that Σ is symmetric non-negative definite It turns out that E[X] = μ and Cov[X] = Σ. (Note that this is a resulting property of Gaussians, not a definition)* *This note rates 7.4 on the pedanticness scale Copyright © Andrew W. Moore Slide 21 Evaluating ( ) 1 exp − 1 (x − μ)T Σ −1 (x − μ) p ( x) = p(x): Step 1 1 2 2π || Σ || 2 1. Begin with vector x x μ Copyright © Andrew W. Moore Slide 22 11
  12. 12. Evaluating ( ) 1 exp − 1 (x − μ)T Σ −1 (x − μ) p ( x) = p(x): Step 2 1 2 2π || Σ || 2 1. Begin with vector x Define δ = x - μ 2. x δ μ Copyright © Andrew W. Moore Slide 23 Evaluating ( ) 1 exp − 1 (x − μ)T Σ −1 (x − μ) p ( x) = p(x): Step 3 1 2 2π || Σ || 2 Contours defined by sqrt(δTΣ-1δ) = constant 1. Begin with vector x Define δ = x - μ 2. x 3. Count the number of contours δ crossed of the ellipsoids formed Σ-1 μ sqrt(δTΣ-1δ) D = this count = = Mahalonobis Distance between x and μ Copyright © Andrew W. Moore Slide 24 12
  13. 13. Evaluating ( ) 1 exp − 1 (x − μ)T Σ −1 (x − μ) p ( x) = p(x): Step 4 1 2 2π || Σ || 2 1. Begin with vector x Define δ = x - μ 2. 3. Count the number of contours exp(-D 2/2) crossed of the ellipsoids formed Σ-1 D = this count = sqrt(δTΣ-1δ) = Mahalonobis Distance between x and μ 4. Define w = exp(-D 2/2) D2 x close to μ in squared Mahalonobis space gets a large weight. Far away gets a tiny weight Copyright © Andrew W. Moore Slide 25 Evaluating ( ) 1 exp − 1 (x − μ)T Σ −1 (x − μ) p ( x) = p(x): Step 5 1 2 2π || Σ || 2 1. Begin with vector x Define δ = x - μ 2. 3. Count the number of contours exp(-D 2/2) crossed of the ellipsoids formed Σ-1 D = this count = sqrt(δTΣ-1δ) = Mahalonobis Distance between x and μ 4. Define w = exp(-D 2/2) 5. D2 1 to ensure∫ p(x)dx = 1 Multiply w by 1 2π || Σ || 2 Copyright © Andrew W. Moore Slide 26 13
  14. 14. Example Observe: Mean, Principal axes, Common convention: show contour implication of off-diagonal corresponding to 2 standard covariance term, max gradient deviations from mean zone of p(x) Copyright © Andrew W. Moore Slide 27 Example Copyright © Andrew W. Moore Slide 28 14
  15. 15. Example In this example, x and y are almost independent Copyright © Andrew W. Moore Slide 29 Example In this example, x and “x+y” are clearly not independent Copyright © Andrew W. Moore Slide 30 15
  16. 16. Example In this example, x and “20x+y” are clearly not independent Copyright © Andrew W. Moore Slide 31 Multivariate Gaussians ⎛ X1 ⎞ ⎜ ⎟ ⎜X ⎟ Write r.v. X = ⎜ 2 ⎟ Then define X ~ N (μ , Σ ) to mean M ⎜ ⎟ ⎜X ⎟ ⎝ m⎠ ( ) 1 exp − 1 (x − μ)T Σ −1 (x − μ) p ( x) = m 1 2 (2π ) || Σ || 2 2 ⎛ μ1 ⎞ ⎛ σ 21 σ 12 L σ 1m ⎞ ⎜ ⎟ Where the Gaussian’s ⎜⎟ ⎜μ ⎟ ⎜σ σ 2 2 L σ 2m ⎟ parameters have… μ=⎜ 2⎟ Σ = ⎜ 12 M⎟ M ⎜M M O ⎟ ⎜⎟ ⎜μ ⎟ ⎜σ L σ 2m ⎟ σ 2m ⎝ m⎠ ⎝ 1m ⎠ Where we insist that Σ is symmetric non-negative definite Again, E[X] = μ and Cov[X] = Σ. (Note that this is a resulting property of Gaussians, not a definition) Copyright © Andrew W. Moore Slide 32 16
  17. 17. General Gaussians ⎛ μ1 ⎞ ⎛ σ 21 σ 12 L σ 1m ⎞ ⎜ ⎟ ⎜⎟ ⎜μ ⎟ ⎜σ σ 2 2 L σ 2m ⎟ μ=⎜ 2⎟ Σ = ⎜ 12 M⎟ M ⎜M M O ⎟ ⎜⎟ ⎜μ ⎟ ⎜σ L σ 2m ⎟ σ 2m ⎝ m⎠ ⎝ 1m ⎠ x2 x1 Copyright © Andrew W. Moore Slide 33 Axis-Aligned Gaussians ⎛ σ 21 0⎞ L 0 0 0 ⎜ ⎟ ⎛ μ1 ⎞ ⎜ 0 σ 22 0⎟ L 0 0 ⎜⎟ ⎜0 0⎟ ⎜μ ⎟ σ 2 L 0 0 Σ=⎜ ⎟ 3 μ=⎜ 2⎟ ⎜M M⎟ M M M O M ⎜⎟ ⎜ ⎟ ⎜μ ⎟ L σ 2 m −1 ⎝ m⎠ 0 0 0 0⎟ ⎜ ⎜0 σ 2m ⎟ L ⎝ ⎠ 0 0 0 X i ⊥ X i for i ≠ j x2 x1 Copyright © Andrew W. Moore Slide 34 17
  18. 18. Spherical Gaussians ⎛σ 2 0⎞ L 0 0 0 ⎜ ⎟ ⎛ μ1 ⎞ σ ⎜0 0⎟ 2 L 0 0 ⎜⎟ ⎜0 0⎟ ⎜μ ⎟ σ 2 L 0 0 Σ=⎜ ⎟ μ=⎜ 2⎟ ⎜M M⎟ M M M O M ⎜⎟ ⎜ ⎟ ⎜μ ⎟ L σ2 ⎝ m⎠ ⎜0 0 0 0⎟ ⎜0 σ2⎟ L ⎝ ⎠ 0 0 0 X i ⊥ X i for i ≠ j x2 x1 Copyright © Andrew W. Moore Slide 35 Degenerate Gaussians ⎛ μ1 ⎞ ⎜⎟ ⎜μ ⎟ μ=⎜ 2⎟ || Σ ||= 0 M ⎜⎟ ⎜μ ⎟ ⎝ m⎠ What’s so wrong with clipping one’s toenails in x2 public? x1 Copyright © Andrew W. Moore Slide 36 18
  19. 19. Where are we now? • We’ve seen the formulae for Gaussians • We have an intuition of how they behave • We have some experience of “reading” a Gaussian’s covariance matrix • Coming next: Some useful tricks with Gaussians Copyright © Andrew W. Moore Slide 37 Subsets of variables ⎛ X1 ⎞ ⎜ ⎟ U=⎜ M ⎟ ⎛ X1 ⎞ ⎜ ⎟ ⎜X ⎟ ⎛U⎞ ⎜ X2 ⎟ ⎝ m (u ) ⎠ ⎜ ⎟ where Write X = ⎜ as X = ⎜ ⎟ M⎟ ⎛ X m ( u ) +1 ⎞ ⎝V⎠ ⎜ ⎟ ⎜ ⎟ ⎜X ⎟ V =⎜ M ⎟ ⎝ m⎠ ⎜X ⎟ ⎝ ⎠ m This will be our standard notation for breaking an m- dimensional distribution into subsets of variables Copyright © Andrew W. Moore Slide 38 19
  20. 20. Gaussian Marginals ⎛U⎞ Margin- ⎜⎟ U ⎜V⎟ are Gaussian alize ⎝⎠ ⎛ X1 ⎞ ⎛ X1 ⎞ ⎜ ⎟ ⎛ X m ( u ) +1 ⎞ ⎜ ⎟ ⎜ ⎟ ⎛U⎞ ⎜ X2 ⎟ as X = ⎜ ⎟ where U = ⎜ M ⎟, V = ⎜ M ⎟ Write X = ⎜ ⎜V⎟ M⎟ ⎝⎠ ⎜X ⎟ ⎜X ⎟ ⎜ ⎟ ⎝ ⎠ ⎜X ⎟ ⎝ m (u ) ⎠ m ⎝ m⎠ ⎛⎛μ ⎞ ⎛ Σ Σ uv ⎞ ⎞ ⎛U⎞ IF ⎜ ⎟ ~ N ⎜ ⎜ u ⎟, ⎜ uu ⎟⎟ ⎜V⎟ ⎜ ⎜ μ ⎟ ⎜ ΣT Σ vv ⎟ ⎟ ⎝⎠ ⎝ ⎝ v ⎠ ⎝ uv ⎠⎠ THEN U is also distributed as a Gaussian U ~ N (μ u , Σ uu ) Copyright © Andrew W. Moore Slide 39 Gaussian Marginals ⎛U⎞ Margin- ⎜⎟ U ⎜V⎟ are Gaussian alize ⎝⎠ ⎛ X1 ⎞ ⎛ X1 ⎞ ⎜ ⎟ ⎛ X m ( u ) +1 ⎞ ⎜ ⎟ ⎜ ⎟ ⎛U⎞ ⎜ X2 ⎟ as X = ⎜ ⎟ where U = ⎜ M ⎟, V = ⎜ M ⎟ Write X = ⎜ ⎜⎟ M⎟ ⎝V⎠ ⎜X ⎟ ⎜X ⎟ ⎜ ⎟ ⎝ ⎠ ⎜X ⎟ ⎝ m (u ) ⎠ m ⎝ m⎠ ⎛⎛μ ⎞ ⎛ Σ Σ uv ⎞ ⎞ ⎛U⎞ IF ⎜ ⎟ ~ N ⎜ ⎜ u ⎟, ⎜ uu ⎟⎟ ⎜V⎟ ⎜ ⎜ μ ⎟ ⎜ ΣT Σ vv ⎟ ⎟ ⎝⎠ ⎝ ⎝ v ⎠ ⎝ uv ⎠⎠ This fact is not immediately obvious THEN U is also distributed as a Gaussian Obvious, once we know U ~ N (μ u , Σ uu ) it’s a Gaussian (why?) Copyright © Andrew W. Moore Slide 40 20
  21. 21. Gaussian Marginals ⎛U⎞ Margin- ⎜⎟ U ⎜V⎟ are Gaussian alize ⎝⎠ ⎛ X1 ⎞ ⎛ X1 ⎞ ⎜ ⎟ ⎛ X m ( u ) +1 ⎞ ⎜ ⎟ ⎜ ⎟ ⎛U⎞ ⎜ X2 ⎟ as X = ⎜ ⎟ where U = ⎜ M ⎟, V = ⎜ M ⎟ Write X = ⎜ ⎜⎟ M⎟ ⎝V⎠ ⎜X ⎟ ⎜X ⎟ ⎜ ⎟ ⎝ ⎠ ⎜X ⎟ ⎝ m (u ) ⎠ m ⎝ m⎠ How would you prove this? ⎛⎛μ ⎞ ⎛ Σ Σ uv ⎞ ⎞ ⎛U⎞ IF ⎜ ⎟ ~ N ⎜ ⎜ u ⎟, ⎜ uu ⎟⎟ ⎜V⎟ ⎜ ⎜ μ ⎟ ⎜ ΣT Σ vv ⎟ ⎟ ⎝⎠ ⎝ ⎝ v ⎠ ⎝ uv ⎠⎠ p (u ) ∫ = p (u , v ) d v THEN U is also distributed as a Gaussian v U ~ N (μ u , Σ uu ) = (snore...) Copyright © Andrew W. Moore Slide 41 Matrix A Linear Transforms AX Multiply X remain Gaussian Assume X is an m-dimensional Gaussian r.v. X ~ N (μ , Σ ) p ≤ m ): Define Y to be a p-dimensional r. v. thusly (note Y = AX …where A is a p x m matrix. Then… ( ) Y ~ N Aμ , AΣ A T Note: the “subset” result is a special case of this result Copyright © Andrew W. Moore Slide 42 21
  22. 22. Adding samples of 2 independent Gaussians X X+Y + is Gaussian Y if X ~ N (μ x , Σ x ) and Y ~ N (μ y , Σ y ) and X ⊥ Y then X + Y ~ N (μ x + μ y , Σ x + Σ y ) Why doesn’t this hold if X and Y are dependent? Which of the below statements is true? If X and Y are dependent, then X+Y is Gaussian but possibly with some other covariance If X and Y are dependent, then X+Y might be non-Gaussian Copyright © Andrew W. Moore Slide 43 Conditional of ⎛U⎞ Condition- ⎜⎟ U|V ⎜V⎟ Gaussian is Gaussian alize ⎝⎠ ⎛⎛μ ⎞ ⎛ Σ Σ uv ⎞ ⎞ ⎛U⎞ IF ⎜ ⎟ ~ N ⎜ ⎜ u ⎟, ⎜ uu ⎟⎟ ⎜⎟ ⎜ ⎜ μ ⎟ ⎜ ΣT Σ vv ⎟ ⎟ ⎝V⎠ ⎝ ⎝ v ⎠ ⎝ uv ⎠⎠ U | V ~ N (μ u |v , Σ u |v ) where THEN − μ u|v = μ u + Σ T Σ vv1 ( V − μ v ) uv − Σ u |v = Σ uu − Σ T Σ vv1 Σ uv uv Copyright © Andrew W. Moore Slide 44 22
  23. 23. ⎛ ⎛ 2977 ⎞ ⎛ 849 2 − 967 ⎞ ⎞ ⎛⎛μ ⎞ ⎛ Σ Σ uv ⎞ ⎞ ⎛U⎞ ⎛ w⎞ IF ⎜ ⎟ ~ N ⎜ ⎜ ⎟⎟ IF ⎜ ⎟ ~ N ⎜ ⎜ u ⎟, ⎜ uu ⎟⎟ ⎟, ⎜ ⎜V⎟ ⎜ y⎟ ⎜ ⎜ 76 ⎟ ⎜ − 967 ⎜ ⎜ μ ⎟ ⎜ ΣT Σ vv ⎟ ⎟ 3.68 2 ⎟ ⎟ ⎝⎠ ⎝⎠ ⎝⎝ ⎠⎝ ⎝ ⎝ v ⎠ ⎝ uv ⎠⎠ ⎠⎠ w | y ~ N (μ w| y , Σ w| y ) where U | V ~ N (μ u |v , Σ u|v ) where THEN THEN 976 ( y − 76 ) μ w| y = 2977 − − μ u |v = μ u + Σ T Σ vv1 ( V − μ v ) uv 3.68 2 967 2 − Σ u|v = Σ uu − Σ T Σ vv1 Σ uv = 849 2 − = 808 2 Σ w| y uv 3.68 2 Copyright © Andrew W. Moore Slide 45 ⎛ ⎛ 2977 ⎞ ⎛ 849 2 − 967 ⎞ ⎞ ⎛⎛μ ⎞ ⎛ Σ Σ uv ⎞ ⎞ ⎛U⎞ ⎛ w⎞ IF ⎜ ⎟ ~ N ⎜ ⎜ ⎟⎟ IF ⎜ ⎟ ~ N ⎜ ⎜ u ⎟, ⎜ uu ⎟⎟ ⎟, ⎜ ⎜V⎟ ⎜ y⎟ ⎜ ⎜ 76 ⎟ ⎜ − 967 ⎜ ⎜ μ ⎟ ⎜ ΣT Σ vv ⎟ ⎟ 3.68 2 ⎟ ⎟ ⎝⎠ ⎝⎠ ⎝⎝ ⎠⎝ ⎝ ⎝ v ⎠ ⎝ uv ⎠⎠ ⎠⎠ w | y ~ N (μ w| y , Σ w| y ) where U | V ~ N (μ u |v , Σ u|v ) where THEN THEN 976 ( y − 76 ) μ w| y = 2977 − − μ u |v = μ u + Σ T Σ vv1 ( V − μ v ) uv 3.68 2 967 2 − Σ u|v = Σ uu − Σ T Σ vv1 Σ uv = 849 2 − = 808 2 Σ w| y uv 2 3.68 P(w|m=82) P(w|m=76) P(w) Copyright © Andrew W. Moore Slide 46 23
  24. 24. ⎛ ⎛ 2977 ⎞ ⎛ 849 2 − 967 ⎞ ⎞ ⎛⎛μ ⎞ ⎛ Σ Σ uv ⎞ ⎞ ⎛U⎞ ⎛ w⎞ IF ⎜ ⎟ ~ N ⎜ ⎜ ⎟⎟ IF ⎜ ⎟ ~ N ⎜ ⎜ u ⎟, ⎜ uu ⎟⎟ ⎟, ⎜ ⎜V⎟ ⎜ y⎟ ⎜ ⎟ ⎜− ⎜ ⎜ μ ⎟ ⎜ ΣT Σ vv ⎟ ⎟ 2⎟ ⎝ Note: ⎜ ⎝ 76 given 967 3.68 ⎠ ⎟ ⎝⎠ ⎠ ⎠ ⎝ value of ⎝ ⎝ v ⎠ ⎝ uv ⎠⎠ when ⎝ ⎠ v isyμ~, N (μ conditional w , Σ | y ) where v the U | V ~ N (μ u |v , Σ u|v ) where THEN w | THEN mean of | y is wμu u 976 ( y − 76 ) μ w| y = 2977 − − μ u |v = μ u + Σ T Σ vv1 ( V − μ v ) uv 3.68 2 967 2 − Σ u|v = Σ uu − Σ T Σ vv1 Σ uv Σ w| y = 849 2 − = 808 2 Note: marginal 23.68 mean is uv a linear function of v P(w|m=82) Note: conditional variance can only be equal to or smaller than P(w|m=76) marginal variance Note: conditional variance is independent of the given value of v P(w) Copyright © Andrew W. Moore Slide 47 Gaussians and the ⎛U⎞ U|V Chain ⎜⎟ ⎜V⎟ chain rule Rule V ⎝⎠ Let A be a constant matrix IF U | V ~ N (AV , Σ u |v ) and V ~ N (μ v , Σ vv ) ⎛U⎞ ⎜ ⎟ ~ N (μ , Σ ), with THEN ⎜V⎟ ⎝⎠ ⎛ AΣ vv A T + Σ u |v AΣ vv ⎞ ⎛ Aμ v ⎞ Σ=⎜ ⎟ μ=⎜ ⎜μ ⎟ ⎟ ⎜ ( AΣ ) T Σ vv ⎟ ⎝ v⎠ ⎝ ⎠ vv Copyright © Andrew W. Moore Slide 48 24
  25. 25. Available Gaussian tools ⎛U⎞ ⎛⎛μ ⎞ ⎛ Σ Σ uv ⎞ ⎞ Margin- ⎛U⎞ THEN U ~ N (μ u , Σ uu ) IF ⎜ ⎟ ~ N ⎜ ⎜ u ⎟, ⎜ uu ⎟⎟ ⎜⎟ U ⎜⎟ ⎜ ⎜ μ ⎟ ⎜ ΣT Σ vv ⎟ ⎟ ⎜V⎟ ⎝V⎠ alize ⎝ ⎝ v ⎠ ⎝ uv ⎠⎠ ⎝⎠ Matrix A IF X ~ N (μ , Σ ) AND Y = AX THEN Y ~ N (Aμ , AΣ A T ) AX Multiply X if X ~ N (μ x , Σ x ) and Y ~ N (μ y , Σ y ) and X ⊥ Y then X + Y ~ N (μ x + μ y , Σ x + Σ y ) X + X+Y Y Σ uv ⎞ ⎞ THEN U | V ~ N (μ , Σ ) ⎛⎛μ ⎞ ⎛ Σ ⎛U⎞ IF ⎜ ⎟ ~ N ⎜ ⎜ u ⎟, ⎜ uu ⎟⎟ ⎜V⎟ ⎜ ⎜ μ ⎟ ⎜ ΣT Σ vv ⎟ ⎟ u |v u |v ⎛U⎞ ⎝⎠ ⎝ ⎝ v ⎠ ⎝ uv ⎠⎠ Condition- ⎜⎟ U|V ⎜V⎟ alize − μ u |v = μ u + Σ T Σ vv1 ( V − μ v ) where ⎝⎠ − Σ u |v = Σ uu − Σ T Σ vv1 Σ uv uv uv U | V ~ N (AV , Σ u |v ) and V ~ N (μ v , Σ vv ) ⎛U⎞ IF U|V Chain ⎜⎟ ⎜V⎟ Rule ⎛U⎞ ⎛ AΣ vv A T + Σ u|v AΣ vv ⎞ V ⎝⎠ ⎜ ⎟ ~ N (μ , Σ ), with Σ = ⎜ ⎟ THEN ⎜V⎟ ⎜ ( AΣ ) T Σ vv ⎟ ⎝⎠ ⎝ ⎠ vv Copyright © Andrew W. Moore Slide 49 Assume… • You are an intellectual snob • You have a child Copyright © Andrew W. Moore Slide 50 25
  26. 26. Intellectual snobs with children • …are obsessed with IQ • In the world as a whole, IQs are drawn from a Gaussian N(100,152) Copyright © Andrew W. Moore Slide 51 IQ tests • If you take an IQ test you’ll get a score that, on average (over many tests) will be your IQ • But because of noise on any one test the score will often be a few points lower or higher than your true IQ. SCORE | IQ ~ N(IQ,102) Copyright © Andrew W. Moore Slide 52 26
  27. 27. Assume… • You drag your kid off to get tested • She gets a score of 130 • “Yippee” you screech and start deciding how to casually refer to her membership of the top 2% of IQs in your Christmas newsletter. P(X<130|μ=100,σ2=152) = P(X<2| μ=0,σ2=1) = erf(2) = 0.977 Copyright © Andrew W. Moore Slide 53 Assume… • You drag your kid off to get tested You are thinking: • She gets a score of 130 Well sure the test isn’t accurate, so • “Yippee” you screech andan IQ of 120 or she how she might have start deciding might have an 1Q of 140, but the to casually refer most her IQ given the evidence the to likely membership of top 2% of IQs in “score=130” is, of course, newsletter. your Christmas 130. P(X<130|μ=100,σ2=152) = P(X<2| μ=0,σ2=1) = erf(2) = 0.977 Can we trust this reasoning? Copyright © Andrew W. Moore Slide 54 27
  28. 28. Maximum Likelihood IQ • IQ~N(100,152) • S|IQ ~ N(IQ, 102) • S=130 IQ mle = arg max p ( s = 130 | iq ) iq • The MLE is the value of the hidden parameter that makes the observed data most likely • In this case IQ mle = 130 Copyright © Andrew W. Moore Slide 55 BUT…. • IQ~N(100,152) • S|IQ ~ N(IQ, 102) • S=130 IQ mle = arg max p ( s = 130 | iq ) iq • The MLE is the value of the hidden parameter that makes the observed data most likely • In this case This is not the same as “The most likely value of the parameter given the observed = 130 mle IQ data” Copyright © Andrew W. Moore Slide 56 28
  29. 29. What we really want: • IQ~N(100,152) • S|IQ ~ N(IQ, 102) • S=130 • Question: What is IQ | (S=130)? Called the Posterior Distribution of IQ Copyright © Andrew W. Moore Slide 57 Which tool or tools? ⎛U⎞ • IQ~N(100,152) Margin- ⎜⎟ U ⎜V⎟ alize ⎝⎠ • S|IQ ~ N(IQ, 102) Matrix A • S=130 AX Multiply X • Question: What is X + X+Y IQ | (S=130)? Y ⎛U⎞ Condition- ⎜⎟ U|V ⎜V⎟ alize ⎝⎠ ⎛U⎞ U|V Chain ⎜⎟ ⎜V⎟ Rule V ⎝⎠ Copyright © Andrew W. Moore Slide 58 29
  30. 30. Plan • IQ~N(100,152) • S|IQ ~ N(IQ, 102) • S=130 • Question: What is IQ | (S=130)? ⎛S⎞ ⎛ IQ ⎞ S | IQ Chain Condition- ⎜⎟ ⎜⎟ IQ | S Swap ⎜ IQ ⎟ ⎜S⎟ Rule alize IQ ⎝⎠ ⎝⎠ Copyright © Andrew W. Moore Slide 59 Working… ⎛⎛μ ⎞ ⎛ Σ Σ uv ⎞ ⎞ THEN ⎛U⎞ IF ⎜ ⎟ ~ N ⎜ ⎜ u ⎟, ⎜ uu ⎟⎟ ⎜ ⎜ μ ⎟ ⎜ ΣT ⎜⎟ Σ vv ⎟ ⎟ V⎠ ⎝ ⎝ ⎝ v ⎠ ⎝ uv ⎠⎠ IQ~N(100,152) − μ u |v = μ u + Σ T Σ vv1 ( V − μ v ) S|IQ ~ N(IQ, 102) uv S=130 IF U | V ~ N (AV , Σ u |v ) and V ~ N (μ v , Σ vv ) Question: What is IQ | (S=130)? ⎛U⎞ ⎛ AΣ vv A T + Σ u |v AΣ vv ⎞ ⎜ ⎟ ~ N (μ , Σ ), with Σ = ⎜ ⎟ THEN ⎜V⎟ ⎜ ( AΣ ) T Σ vv ⎟ ⎝⎠ ⎝ ⎠ vv Copyright © Andrew W. Moore Slide 60 30
  31. 31. Your pride and joy’s posterior IQ • If you did the working, you now have p(IQ|S=130) • If you have to give the most likely IQ given the score you should give IQ map = arg max p (iq | s = 130 ) iq • where MAP means “Maximum A-posteriori” Copyright © Andrew W. Moore Slide 61 What you should know • The Gaussian PDF formula off by heart • Understand the workings of the formula for a Gaussian • Be able to understand the Gaussian tools described so far • Have a rough idea of how you could prove them • Be happy with how you could use them Copyright © Andrew W. Moore Slide 62 31

×