Successfully reported this slideshow.
Note to other teachers and users of these slides.                                             Andrew would be delighted if...
Bits You are watching a set of independent random samples of X You see that X has four possible valuesP(X=A) = 1/4 P(X=B) ...
Fewer Bits Someone tells you that the probabilities are not equalP(X=A) = 1/2 P(X=B) = 1/4 P(X=C) = 1/8 P(X=D) = 1/8 It’s ...
Fewer Bits Someone tells you that the probabilities are not equalP(X=A) = 1/2 P(X=B) = 1/4 P(X=C) = 1/8 P(X=D) = 1/8 It’s ...
Fewer BitsSuppose there are three equally likely values…              P(X=A) = 1/3 P(X=B) = 1/3 P(X=C) = 1/3    Here’s a n...
General CaseSuppose X can have one of m values… V1, V2,          …   VmP(X=V1) = p1                     P(X=V2) = p2   …. ...
General CaseSuppose X can have one of m values… V1, V2,                     …   VmP(X=V1) = p1                     P(X=V2)...
General CaseSuppose X can have one of m values… V1, V2,                     …   VmP(X=V1) = p1                     P(X=V2)...
Entropy in a nut-shell   Low Entropy                            High EntropyCopyright © 2001, 2003, Andrew W. Moore       ...
Entropy in a nut-shell   Low Entropy                                High Entropy                                          ...
Entropy of a PDF                                                     Entropy of X  H [ X ]                p( x) log p...
1           The “box”                                           w if                                                    ...
1   Unit variance                                       w if                                              p( x)       ...
The Hat                                      w | x |                                                                   ...
Unit variance hat                                       w | x |                                                        ...
Dirac Delta    The “2 spikes”                                                       ( x  1)   ( x  1)               ...
Entropies of unit-variance                   distributions                 Distribution             Entropy               ...
Unit variance                                           p( x)                                                            ...
Specific Conditional Entropy H(Y|X=v)Suppose I’m trying to predict output Y and I have input XX = College Major           ...
Specific Conditional Entropy H(Y|X=v)X = College Major                     Definition of Specific ConditionalY = Likes “Gl...
Specific Conditional Entropy H(Y|X=v)X = College Major                     Definition of Specific ConditionalY = Likes “Gl...
Conditional Entropy H(Y|X)X = College Major                     Definition of ConditionalY = Likes “Gladiator”            ...
Conditional EntropyX = College Major                    Definition of Conditional Entropy:Y = Likes “Gladiator”           ...
Information GainX = College Major                    Definition of Information Gain:Y = Likes “Gladiator”                 ...
Relative Entropy:Distance Kullback-                      Leibler                                     p( x)          D( p, ...
Mutual InformationA quantity that measures the mutual dependence of  the two random variables.                            ...
Mutual InformationI(X,Y)=H(Y)-H(Y/X)                                                         p(y / x )                    ...
Mutual information•   I(X,Y)=H(Y)-H(Y/X)•   I(X,Y)=H(X)-H(X/Y)•   I(X,Y)=H(X)+H(Y)-H(X,Y)•   I(X,Y)=I(Y,X)•   I(X,X)=H(X)C...
Information Gain ExampleCopyright © 2001, 2003, Andrew W. Moore   Information Gain: Slide 29
Another exampleCopyright © 2001, 2003, Andrew W. Moore      Information Gain: Slide 30
Relative Information GainX = College Major     Definition of Relative InformationY = Likes “Gladiator” Gain:              ...
What is Information Gain used for?Suppose you are trying to predict whether someoneis going live past 80 years. From histo...
Cross Entropy     Sea X una variable aleatoria con distribucion conocida p(x) y distribucion     estimada q(x), la “cross ...
Bivariate Gaussians                   X    Write r.v. X                      Y              Then define            ...
Bivariate Gaussians                   X    Write r.v. X                      Y              Then define            ...
Evaluating                               p ( x)                                                           1              ...
Evaluating                               p ( x)                                                           1              ...
Evaluating                               p ( x)                                                           1              ...
Evaluating                               p ( x)                                                               1          ...
Evaluating                                            p ( x)                                                             ...
Normal Bivariada NB(0,0,1,1,0)    persp(x,y,a,theta=30,phi=10,zlab="f(x,y)",box=FALSE,col=4)Copyright © 2001, 2003, Andrew...
Normal Bivariada NB(0,0,1,1,0)                         3                                          0.20                    ...
Multivariate Gaussians                  X1                                        X2   Write r.v. X               ...
General Gaussians                               1           21  12      1m                                     ...
Axis-Aligned Gaussians                                          21 0     0          0       0                         ...
Spherical Gaussians                                             2   0    0       0   0                                ...
Subsets of variables                                                 X1                                                 ...
Gaussian Marginals                                  U                                                                 ...
Gaussian Marginals                                  U                                                                 ...
Gaussian Marginals                                  U                                                                 ...
Matrix A   Linear Transforms                                    X     Multiply         AX    remain Gaussian  Assume X is ...
Adding samples of 2  independent Gaussians                           X                                                    ...
Conditional of                                    U                                                                   ...
 U     μ u   Σuu              Σuv          w     2977   849 2  967  IF   ~ N  ,  T   V    μ  ...
 U     μ u   Σuu              Σuv          w     2977   849 2  967  IF   ~ N  ,  T   V    μ  ...
 U     μ u   Σuu              Σuv                 w       2977   849 2  967  IF   ~ N  ,  T   V  ...
Gaussians and the                                U|V       Chain             U                                          ...
Available Gaussian toolsU            Margin-                       U   μ   Σ                                      ...
Assume…• You are an intellectual snob• You have a childCopyright © 2001, 2003, Andrew W. Moore             Information Gai...
Intellectual snobs with children• …are obsessed with IQ• In the world as a whole, IQs are drawn from  a Gaussian N(100,152...
IQ tests• If you take an IQ test you’ll get a score that,  on average (over many tests) will be your  IQ• But because of n...
Assume…• You drag your kid off to get tested• She gets a score of 130• “Yippee” you screech and start deciding how  to cas...
Assume…• You drag your kid off to get tested                           You are thinking:• She gets a score of 130         ...
What we really want:• IQ~N(100,152)• S|IQ ~ N(IQ, 102)• S=130• Question: What is  IQ | (S=130)?       Called the Posterior...
Which tool or tools?• IQ~N(100,152)                           U    Margin-                                            ...
Plan• IQ~N(100,152)• S|IQ ~ N(IQ, 102)• S=130• Question: What is  IQ | (S=130)? S | IQ              Chain           S   ...
Working…                                             U   μ   Σ                                                     I...
Upcoming SlideShare
Loading in …5
×

2013-1 Machine Learning Lecture 02 - Andrew Moore: Entropy

790 views

Published on

Published in: Education
  • Be the first to comment

  • Be the first to like this

2013-1 Machine Learning Lecture 02 - Andrew Moore: Entropy

  1. 1. Note to other teachers and users of these slides. Andrew would be delighted if you found this source material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. PowerPoint originals are available. If you make use of a significant portion of these slides in Entropy and your own lecture, please include this message, or the following link to the source repository of Andrew’s tutorials: http://www.cs.cmu.edu/~awm/tutorials . Comments and corrections gratefully received. Information Gain Andrew W. Moore Professor School of Computer Science Carnegie Mellon University www.cs.cmu.edu/~awm awm@cs.cmu.edu 412-268-7599Copyright © 2001, 2003, Andrew W. Moore
  2. 2. Bits You are watching a set of independent random samples of X You see that X has four possible valuesP(X=A) = 1/4 P(X=B) = 1/4 P(X=C) = 1/4 P(X=D) = 1/4 So you might see: BAACBADCDADDDA… You transmit data over a binary serial link. You can encode each reading with two bits (e.g. A = 00, B = 01, C = 10, D = 11) 0100001001001110110011111100… Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 2
  3. 3. Fewer Bits Someone tells you that the probabilities are not equalP(X=A) = 1/2 P(X=B) = 1/4 P(X=C) = 1/8 P(X=D) = 1/8 It’s possible… …to invent a coding for your transmission that only uses 1.75 bits on average per symbol. How? Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 3
  4. 4. Fewer Bits Someone tells you that the probabilities are not equalP(X=A) = 1/2 P(X=B) = 1/4 P(X=C) = 1/8 P(X=D) = 1/8 It’s possible… …to invent a coding for your transmission that only uses 1.75 bits on average per symbol. How? A 0 B 10 C 110 D 111 (This is just one of several ways) Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 4
  5. 5. Fewer BitsSuppose there are three equally likely values… P(X=A) = 1/3 P(X=B) = 1/3 P(X=C) = 1/3 Here’s a naïve coding, costing 2 bits per symbol A 00 B 01 C 10 Can you think of a coding that would need only 1.6 bits per symbol on average? In theory, it can in fact be done with 1.58496 bits per symbol. Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 5
  6. 6. General CaseSuppose X can have one of m values… V1, V2, … VmP(X=V1) = p1 P(X=V2) = p2 …. P(X=Vm) = pmWhat’s the smallest possible number of bits, on average, per symbol, needed to transmit a stream of symbols drawn from X’s distribution? It’s H ( X )   p1 log 2 p1  p2 log 2 p2    pm log 2 pm m   p j log 2 p j j 1H(X) = The entropy of X (Shannon, 1948)• “High Entropy” means X is from a uniform (boring) distribution• “Low Entropy” means X is from varied (peaks and valleys) distribution Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 6
  7. 7. General CaseSuppose X can have one of m values… V1, V2, … VmP(X=V1) = p1 P(X=V2) = p2 …. P(X=Vm) = pm A histogram of theWhat’s the smallest possible number of frequency average, per bits, on distribution of symbol, needed to transmit a stream values of X would have A histogram of the of symbols drawn from X’s distribution? It’s frequency distribution of many lows and one or values log would be flat p   highs H(X )   p of X p  p log two p log p 1 2 1 2 2 2 m 2 m m   p j log 2 p j j 1H(X) = The entropy of X• “High Entropy” means X is from a uniform (boring) distribution• “Low Entropy” means X is from varied (peaks and valleys) distribution Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 7
  8. 8. General CaseSuppose X can have one of m values… V1, V2, … VmP(X=V1) = p1 P(X=V2) = p2 …. P(X=Vm) = pm A histogram of theWhat’s the smallest possible number of frequency average, per bits, on distribution of symbol, needed to transmit a stream values of X would have A histogram of the of symbols drawn from X’s distribution? It’s frequency distribution of many lows and one or values log would be flat p   highs H(X )   p of X p  p log two p log p 1 2 1 2 2 2 m 2 m m   p ..and sop j values j log 2 the ..and so the values j 1 sampled from it would sampled from it would be be all over the place more predictableH(X) = The entropy of X• “High Entropy” means X is from a uniform (boring) distribution• “Low Entropy” means X is from varied (peaks and valleys) distribution Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 8
  9. 9. Entropy in a nut-shell Low Entropy High EntropyCopyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 9
  10. 10. Entropy in a nut-shell Low Entropy High Entropy ..the values (locations of ..the values (locations soup) unpredictable... of soup) sampled almost uniformly sampled entirely from within the throughout our dining room soup bowlCopyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 10
  11. 11. Entropy of a PDF  Entropy of X  H [ X ]    p( x) log p( x)dx x   Natural log (ln or loge) The larger the entropy of a distribution… …the harder it is to predict …the harder it is to compress it …the less spiky the distributionCopyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 11
  12. 12. 1 The “box”  w if p( x)   | x | w 2 distribution  0 if  | x | w 2 1/w -w/2 0 w/2  w/ 2 w/ 2 1 1 1 1H [ X ]    p( x) log p( x)dx    log dx   log wdx  log w x   x  w / 2 w w w w x / 2 Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 12
  13. 13. 1 Unit variance  w if p( x)   | x | w 2 box distribution  0 if  | x | w 2 E[ X ]  0 1 w2 2 3 Var[ X ]  12  3 0 3if w  2 3 then Var[ X ]  1 and H [ X ]  1.242Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 13
  14. 14. The Hat  w | x |  p ( x)   w2 if |x|  w distribution  0  if |x|  w E[ X ]  0 1 2 w w Var[ X ]  6 w 0 wCopyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 14
  15. 15. Unit variance hat  w | x |  p ( x)   w2 if |x|  w distribution  0  if |x|  w E[ X ]  0 1 2 w 6 Var[ X ]  6  6 0 6if w  6 then Var[ X ]  1 and H [ X ]  1.396Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 15
  16. 16. Dirac Delta The “2 spikes”  ( x  1)   ( x  1) p ( x)  distribution 2 1 1 E[ X ]  0   ( x  1)  ( x  1) 2 2 2 Var[ X ]  1 -1 0 1  H[ X ]    p( x) log p( x)dx   x  Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 16
  17. 17. Entropies of unit-variance distributions Distribution Entropy Box 1.242 Hat 1.396 2 spikes -infinity ??? 1.4189 Largest possible entropy of any unit- variance distributionCopyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 17
  18. 18. Unit variance p( x)  1  x2  exp     2 Gaussian 2   E[ X ]  0 Var[ X ]  1  H[ X ]    p( x) log p( x)dx  1.4189 x  Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 18
  19. 19. Specific Conditional Entropy H(Y|X=v)Suppose I’m trying to predict output Y and I have input XX = College Major Let’s assume this reflects the true probabilitiesY = Likes “Gladiator” X Y E.G. From this data we estimate Math Yes • P(LikeG = Yes) = 0.5 History No • P(Major = Math & LikeG = No) = 0.25 CS Yes • P(Major = Math) = 0.5 Math No • P(LikeG = Yes | Major = History) = 0 Math No Note: CS Yes History No • H(X) = 1.5 Math Yes •H(Y) = 1 Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 19
  20. 20. Specific Conditional Entropy H(Y|X=v)X = College Major Definition of Specific ConditionalY = Likes “Gladiator” Entropy: H(Y |X=v) = The entropy of Y X Y among only those records in which Math Yes X has value v History No CS Yes Math No Math No CS Yes History No Math Yes Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 20
  21. 21. Specific Conditional Entropy H(Y|X=v)X = College Major Definition of Specific ConditionalY = Likes “Gladiator” Entropy: H(Y |X=v) = The entropy of Y X Y among only those records in which Math Yes X has value v History No Example: CS Yes • H(Y|X=Math) = 1 Math No • H(Y|X=History) = 0 Math No CS Yes • H(Y|X=CS) = 0 History No Math Yes Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 21
  22. 22. Conditional Entropy H(Y|X)X = College Major Definition of ConditionalY = Likes “Gladiator” Entropy: H(Y |X) = The average specific X Y conditional entropy of Y Math Yes History No = if you choose a record at random what CS Yes will be the conditional entropy of Y, Math No conditioned on that row’s value of X Math No = Expected number of bits to transmit Y if CS Yes both sides will know the value of X History No Math Yes = Σj Prob(X=vj) H(Y | X = vj) Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 22
  23. 23. Conditional EntropyX = College Major Definition of Conditional Entropy:Y = Likes “Gladiator” H(Y|X) = The average conditional entropy of Y = ΣjProb(X=vj) H(Y | X = vj) X Y Math Yes Example: History No vj Prob(X=vj) H(Y | X = vj) CS Yes Math No Math 0.5 1 Math No History 0.25 0 CS Yes CS 0.25 0 History No Math Yes H(Y|X) = 0.5 * 1 + 0.25 * 0 + 0.25 * 0 = 0.5 Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 23
  24. 24. Information GainX = College Major Definition of Information Gain:Y = Likes “Gladiator” IG(Y|X) = I must transmit Y. How many bits on average would it save me if both ends of X Y the line knew X? Math Yes IG(Y|X) = H(Y) - H(Y | X) History No CS Yes Example: Math No • H(Y) = 1 Math No • H(Y|X) = 0.5 CS Yes History No • Thus IG(Y|X) = 1 – 0.5 = 0.5 Math Yes Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 24
  25. 25. Relative Entropy:Distance Kullback- Leibler p( x) D( p, q)   p( x) log 2 ( ) x q ( x)Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 25
  26. 26. Mutual InformationA quantity that measures the mutual dependence of the two random variables. p(x , y ) I (X ,Y )    p(x , y )log2( ) p(x )q (y ) p(x , y ) I (X ,Y )    p(x , y )log2( )dxdy Y X p(x )q (y ) p(x , y |c ) I (X ,Y |C )    p(x , y |c )log2( p(x |c )q (y |c ) )Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 26
  27. 27. Mutual InformationI(X,Y)=H(Y)-H(Y/X) p(y / x ) I (X ,Y )    p(x , y )log2( x y q (y ) ) I (X ,Y )    p(x , y )log2(p(y ))    p(x , y )log2(p(y / x )) x y x y I ( X , Y )   q( y) log 2 (q( y))   p( x) p( y / x) log 2 ( p( y / x)) y x yCopyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 27
  28. 28. Mutual information• I(X,Y)=H(Y)-H(Y/X)• I(X,Y)=H(X)-H(X/Y)• I(X,Y)=H(X)+H(Y)-H(X,Y)• I(X,Y)=I(Y,X)• I(X,X)=H(X)Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 28
  29. 29. Information Gain ExampleCopyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 29
  30. 30. Another exampleCopyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 30
  31. 31. Relative Information GainX = College Major Definition of Relative InformationY = Likes “Gladiator” Gain: RIG(Y|X) = I must transmit Y, what fraction of the bits on average would X Y it save me if both ends of the line knew X? Math Yes History No RIG(Y|X) = [H(Y) - H(Y | X) ]/ H(Y) CS Yes Math No Example: Math No • H(Y|X) = 0.5 CS Yes • H(Y) = 1 History No Math Yes • Thus IG(Y|X) = (1 – 0.5)/1 = 0.5 Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 31
  32. 32. What is Information Gain used for?Suppose you are trying to predict whether someoneis going live past 80 years. From historical data youmight find… •IG(LongLife | HairColor) = 0.01 •IG(LongLife | Smoker) = 0.2 •IG(LongLife | Gender) = 0.25 •IG(LongLife | LastDigitOfSSN) = 0.00001IG tells you how interesting a 2-d contingency table isgoing to be.Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 32
  33. 33. Cross Entropy Sea X una variable aleatoria con distribucion conocida p(x) y distribucion estimada q(x), la “cross entropy” mide la diferencia entre las dos distribuciones y se define por HC ( x)  E[ log( q( x)]  H ( x)  KL( p, q) donde H(X) es la entropia de X con respecto a la distribucion p y KL es la distancia Kullback-Leibler ente p y q. Si p y q son discretas se reduce a : H C ( X )   p( x) log 2 (q( x)) x y para p y q continuas se tieneCopyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 33
  34. 34. Bivariate Gaussians X Write r.v. X    Y  Then define X ~ N (μ, Σ) to mean   p ( x)  1 1  exp  1 (x  μ)T Σ 1 (x  μ) 2  2 || Σ || 2 Where the Gaussian’s parameters are…  x   2 x  xy  μ    Σ   y  2   y  xy Where we insist that S is symmetric non-negative definiteCopyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 34
  35. 35. Bivariate Gaussians X Write r.v. X    Y  Then define X ~ N (μ, Σ) to mean   p ( x)  1 1  exp  1 (x  μ)T Σ 1 (x  μ) 2  2 || Σ || 2 Where the Gaussian’s parameters are…  x   2 x  xy  μ    Σ   y  2   y  xy Where we insist that S is symmetric non-negative definite It turns out that E[X] =  and Cov[X] = S. (Note that this is a resulting property of Gaussians, not a definition)Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 35
  36. 36. Evaluating p ( x)  1  exp  1 (x  μ)T Σ 1 (x  μ) p(x): Step 1 1 2 2 || Σ || 2 1. Begin with vector x x Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 36
  37. 37. Evaluating p ( x)  1  exp  1 (x  μ)T Σ 1 (x  μ) p(x): Step 2 1 2 2 || Σ || 2 1. Begin with vector x 2. Define  = x -  x  Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 37
  38. 38. Evaluating p ( x)  1  exp  1 (x  μ)T Σ 1 (x  μ) p(x): Step 3 1 2 2 || Σ || 2 Contours defined by 1. Begin with vector x sqrt(TS-1) = constant 2. Define  = x -  x 3. Count the number of contours crossed of the ellipsoids  formed S-1  D = this count = sqrt(TS-1) = Mahalonobis Distance between x and Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 38
  39. 39. Evaluating p ( x)  1  exp  1 (x  μ)T Σ 1 (x  μ) p(x): Step 4 1 2 2 || Σ || 2 1. Begin with vector x 2. Define  = x -  3. Count the number of contours exp(-D 2/2) crossed of the ellipsoids formed S-1 D = this count = sqrt(TS-1) = Mahalonobis Distance between x and  4. Define w = exp(-D 2/2) D2 x close to  in squared Mahalonobis space gets a large weight. Far away gets a tiny weightCopyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 39
  40. 40. Evaluating p ( x)  1  exp  1 (x  μ)T Σ 1 (x  μ) p(x): Step 5 1 2 2 || Σ || 2 1. Begin with vector x 2. Define  = x -  3. Count the number of contours exp(-D 2/2) crossed of the ellipsoids formed S-1 D = this count = sqrt(TS-1) = Mahalonobis Distance between x and  4. Define w = exp(-D 2/2) 1 5. Multiply w by 1 to ensure p(x)dx  1 D2 2 || Σ || 2Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 40
  41. 41. Normal Bivariada NB(0,0,1,1,0) persp(x,y,a,theta=30,phi=10,zlab="f(x,y)",box=FALSE,col=4)Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 41
  42. 42. Normal Bivariada NB(0,0,1,1,0) 3 0.20 2 0.15 1 0 0.10 -1 0.05 -2 -3 0.00 -3 -2 -1 0 1 2 3 filled.contour(x,y,a,nlevels=4,col=2:5)Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 42
  43. 43. Multivariate Gaussians  X1     X2  Write r.v. X   Then define X ~ N (μ, Σ) to mean     X   m p ( x)  m 1 1  exp  1 (x  μ)T Σ 1 (x  μ) 2  (2 ) 2 || Σ || 2  1    21  12   1m  Where the Gaussian’s     parameters have…  2    12  2 2   2m  μ  Σ              2   m  1m  2 m   m Where we insist that S is symmetric non-negative definite Again, E[X] =  and Cov[X] = S. (Note that this is a resulting property of Gaussians, not a definition)Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 43
  44. 44. General Gaussians  1    21  12   1m       2    12  2 2   2m  μ  Σ              2   m  1m  2 m   m x2 x1Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 44
  45. 45. Axis-Aligned Gaussians   21 0 0  0 0     1   0  22 0  0 0     0  2   0  23  0 0   μ  Σ                  2 m 1  m  0 0 0 0   0   2m   0 0 0 X i  X i for i  j x2 x1Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 45
  46. 46. Spherical Gaussians  2 0 0  0 0     1   0 2 0  0 0     0  2  0 2  0 0  μ  Σ                  2  m  0 0 0 0   0  0 2  0 0 X i  X i for i  j x2 x1Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 46
  47. 47. Subsets of variables  X1     X1  U     X   X2   U  m (u )  Write X   as X    where V      X m ( u ) 1      X  V     m  X   m  This will be our standard notation for breaking an m- dimensional distribution into subsets of variablesCopyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 47
  48. 48. Gaussian Marginals U   Margin- U V are Gaussian alize    X1     X1   X m (u ) 1   X2   U     Write X     as X   V  where U    , V           X   X  X   m(u )   m   m  U   μ u   Σuu Σuv   IF   ~ N  ,  T V μ   Σ      v   uv Σ vv    THEN U is also distributed as a Gaussian U ~ Nμu , Σuu Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 48
  49. 49. Gaussian Marginals U   Margin- U V are Gaussian alize    X1     X1   X m (u ) 1   X2   U     Write X     as X   V  where U    , V           X   X  X   m(u )   m   m  U   μ u   Σuu Σuv   IF   ~ N  ,  T V μ   Σ      v   uv Σ vv    This fact is not immediately obvious THEN U is also distributed as a Gaussian Obvious, once we know U ~ Nμu , Σuu  it’s a Gaussian (why?)Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 49
  50. 50. Gaussian Marginals U   Margin- U V are Gaussian alize    X1     X1   X m (u ) 1   X2   U     Write X     as X   V  where U    , V           X   X  X   m(u )   m   m How would you prove this?  U   μ u   Σuu Σuv   IF   ~ N  ,  T V μ   Σ      v   uv Σ vv    p (u) THEN U is also distributed as a Gaussian   p(u, v)dv v U ~ Nμu , Σuu   (snore...)Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 50
  51. 51. Matrix A Linear Transforms X Multiply AX remain Gaussian Assume X is an m-dimensional Gaussian r.v. X ~ Nμ, Σ Define Y to be a p-dimensional r. v. thusly (note p  m): Y  AX …where A is a p x m matrix. Then…  Y ~ N Aμ, AΣ AT Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 51
  52. 52. Adding samples of 2 independent Gaussians X + XY is Gaussian Y if X ~ Nμ x , Σ x  and Y ~ Nμ y , Σ y  and X  Y then X  Y ~ Nμ x  μ y , Σ x  Σ y  Why doesn’t this hold if X and Y are dependent? Which of the below statements is true? If X and Y are dependent, then X+Y is Gaussian but possibly with some other covariance If X and Y are dependent, then X+Y might be non-GaussianCopyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 52
  53. 53. Conditional of U   Condition- U|V VGaussian is Gaussian alize    U   μ u   Σuu Σuv   IF   ~ N  ,  T V μ   Σ      v   uv Σ vv    THEN U | V ~ Nμu|v , Σu|v  where 1 μu|v  μu  ΣT Σvv (V  μ v ) uv  Σu|v  Σuu  ΣT Σvv1Σuv uvCopyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 53
  54. 54.  U   μ u   Σuu Σuv    w   2977   849 2  967  IF   ~ N  ,  T V μ   Σ  IF   ~ N   y  76 ,   967 3.682         v   uv Σ vv          THEN U | V ~ Nμu|v , Σu|v  where THEN w | y ~ Nμ w| y , Σ w| y  where 976( y  76) 1μu|v  μu  Σ Σ (V  μ v ) T μ w| y  2977  uv vv 3.682  967 2Σu|v  Σuu  ΣT Σvv1Σuv uv Σ w| y  8492   8082 3.682 Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 54
  55. 55.  U   μ u   Σuu Σuv    w   2977   849 2  967  IF   ~ N  ,  T V μ   Σ  IF   ~ N   y  76 ,   967 3.682         v   uv Σ vv          THEN U | V ~ Nμu|v , Σu|v  where THEN w | y ~ Nμ w| y , Σ w| y  where 976( y  76) 1μu|v  μu  Σ Σ (V  μ v ) T μ w| y  2977  uv vv 3.682  967 2Σu|v  Σuu  ΣT Σvv1Σuv uv Σ w| y  8492   8082 3.682 P(w|m=82) P(w|m=76) P(w) Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 55
  56. 56.  U   μ u   Σuu Σuv    w   2977   849 2  967  IF   ~ N  ,  T V μ   Σ  IF   ~ N   y  76 ,   967 3.682         v   uv Σ vv     Note:      when given value of   THEN U | V ~ Nμu|v , Σu|v  where THEN v isy~, Nμ w| y , Σ w| y  where w | v the conditional mean of u is u 976( y  76) 1μu|v  μu  ΣT Σvv (V  μ v ) μ w| y  2977  uv 3.682  967 2Σu|v  Σuu  ΣT Σvv1Σuv Σ w| y  8492   8082 uv Note: marginal 2 3.68 mean is a linear function of v P(w|m=82) Note: conditional variance can only be equal to or smaller than P(w|m=76) marginal variance Note: conditional variance is independent of the given value of v P(w) Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 56
  57. 57. Gaussians and the U|V Chain U   V chain rule Rule V  Let A be a constant matrixIF U | V ~ NAV , Σu|v  and V ~ Nμv , Σvv   UTHEN   ~ Nμ, Σ , with V    Aμ v   AΣ vv AT  Σu|v AΣ vv μ  μ   Σ  ( AΣ )T   v   vv Σ vv  Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 57
  58. 58. Available Gaussian toolsU Margin-  U μ   Σ IF   ~ N  u ,  uu Σuv    THEN U ~ Nμu , Σuu  V alize U V     μ   ΣT   v   uv Σ vv     Matrix A IF X ~ Nμ, Σ AND Y  AX THEN Y ~ N Aμ, AΣ AT  X Multiply AX if X ~ Nμ x , Σ x  and Y ~ Nμ y , Σ y  and X  YX then X  Y ~ Nμ x  μ y , Σ x  Σ y Y + XY U | V ~ Nμu|v , Σu|v   U μ   Σ Σuv   THEN IF   ~ N  u ,  uu V   μ   ΣT   Σ vv  U Condition-     v   uv V alize U | V where 1 μu|v  μu  ΣT Σvv (V  μ v )  uv  Σu|v  Σuu  ΣT Σvv1Σuv uv IF U | V ~ NAV , Σu|v  and V ~ Nμv , Σvv U|V Chain U Rule   V  U  AΣ vv AT  Σu|v AΣ vv  V   THEN   ~ Nμ, Σ , with Σ   V  ( AΣ )T     vv Σ vv   Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 58
  59. 59. Assume…• You are an intellectual snob• You have a childCopyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 59
  60. 60. Intellectual snobs with children• …are obsessed with IQ• In the world as a whole, IQs are drawn from a Gaussian N(100,152)Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 60
  61. 61. IQ tests• If you take an IQ test you’ll get a score that, on average (over many tests) will be your IQ• But because of noise on any one test the score will often be a few points lower or higher than your true IQ. SCORE | IQ ~ N(IQ,102)Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 61
  62. 62. Assume…• You drag your kid off to get tested• She gets a score of 130• “Yippee” you screech and start deciding how to casually refer to her membership of the top 2% of IQs in your Christmas newsletter. P(X<130|=100,2=152) = P(X<2| =0,2=1) = erf(2) = 0.977Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 62
  63. 63. Assume…• You drag your kid off to get tested You are thinking:• She gets a score of 130 Well sure the test isn’t accurate, so• “Yippee” you screech andan IQ of 120 or she how she might have start deciding might have an 1Q of 140, but the to casually refermost her IQ given the evidenceof the to likely membership top 2% of IQs in“score=130” is, of course, newsletter. your Christmas 130. P(X<130|=100,2=152) = P(X<2| =0,2=1) = erf(2) = 0.977 Can we trust this reasoning?Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 63
  64. 64. What we really want:• IQ~N(100,152)• S|IQ ~ N(IQ, 102)• S=130• Question: What is IQ | (S=130)? Called the Posterior Distribution of IQCopyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 64
  65. 65. Which tool or tools?• IQ~N(100,152) U Margin-   V alize U• S|IQ ~ N(IQ, 102)   Matrix A• S=130 X Multiply AX• Question: What is X + XY IQ | (S=130)? Y U Condition-   V alize U|V   U|V Chain U Rule   V V  Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 65
  66. 66. Plan• IQ~N(100,152)• S|IQ ~ N(IQ, 102)• S=130• Question: What is IQ | (S=130)? S | IQ Chain  S   IQ  Condition- Rule    IQ  Swap    S  alize IQ | S IQ    Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 66
  67. 67. Working…  U μ   Σ IF   ~ N  u ,  uu V   μ   ΣT Σuv   THEN  Σ vv       v   uv IQ~N(100,152) 1 μu|v  μu  ΣT Σvv (V  μ v )S|IQ ~ N(IQ, 102) uvS=130 IF U | V ~ NAV , Σu|v  and V ~ Nμv , Σvv Question: What is IQ | (S=130)?  U  AΣ vv AT  Σu|v AΣ vv  THEN   ~ Nμ, Σ , with Σ   V     ( AΣ )T Σ vv   vv Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 67

×