08 Machine Learning Maximum Aposteriori

1. Machine Learning for Data Mining Maximum A Posteriori (MAP) Andres Mendez-Vazquez July 22, 2015 1 / 66

2. Outline 1 Introduction A ﬁrst solution Example Properties of the MAP 2 Example of Application of MAP and EM Example Linear Regression The Gaussian Noise A Hierarchical-Bayes View of the Laplacian Prior Sparse Regression via EM Jeﬀrey’s Prior 2 / 66

3. Introduction We go back to the Bayesian Rule p (Θ|X) = p (X|Θ) p (Θ) p (X) (1) We now seek that value for Θ, called ΘMAP It allows to maximize the posterior p (Θ|X) 3 / 66

4. Introduction We go back to the Bayesian Rule p (Θ|X) = p (X|Θ) p (Θ) p (X) (1) We now seek that value for Θ, called ΘMAP It allows to maximize the posterior p (Θ|X) 3 / 66

6. Development of the solution We look to maximize ΘMAP ΘMAP = argmax Θ p (Θ) = argmax Θ p (X|Θ) p (Θ) P (X) ≈ argmax Θ p (X|Θ) p (Θ) = argmax Θ xi∈X p (xi|Θ) p (Θ) P (X) can be removed because it has no functional relation with Θ. 5 / 66

11. We can make this easier Use logarithms ΘMAP = argmax Θ   xi∈X log p (xi|Θ) + log p (Θ)   (2) 6 / 66

13. What Does the MAP Estimate Get Something Notable The MAP estimate allows us to inject into the estimation calculation our prior beliefs regarding the parameters values in Θ. For example Let’s conduct N independent trials of the following Bernoulli experiment with q parameter: We will ask each individual we run into in the hallway whether they will vote PRI or PAN in the next presidential election. With probability p to vote PRI Where the values of xi is either PRI or PAN. 8 / 66

17. First the ML estimate Samples X = xi = PAN PRI i = 1, ..., N (3) The log likelihood function log P (X|p) = N i=1 log p (xi|q) = i log p (xi = PRI|q) + ... i log p (xi = PAN|1 − q) =nPRI log q + (N − nPRI ) log (1 − q) Where nPRI are the numbers of individuals who are planning to vote PRI this fall 9 / 66

22. We use our classic tricks By setting L = log P (X|q) (4) We have that ∂L ∂q = 0 (5) Thus nPRI q − (N − nPRI ) (1 − q) = 0 (6) 10 / 66

25. Final Solution of ML We get qPRI = nPRI N (7) Thus If we say that N = 20 and if 12 are going to vote PRI, we get qPRI = 0.6. 11 / 66

26. Final Solution of ML We get qPRI = nPRI N (7) Thus If we say that N = 20 and if 12 are going to vote PRI, we get qPRI = 0.6. 11 / 66

27. Building the MAP estimate Obviously we need a prior belief distribution We have the following constraints: The prior for q must be zero outside the [0,1] interval. Within the [0,1] interval, we are free to specify our beliefs in any way we wish. In most cases, we would want to choose a distribution for the prior beliefs that peaks somewhere in the [0, 1] interval. We assume the following The state of Colima has traditionally voted PRI in presidential elections. However, on account of the prevailing economic conditions, the voters are more likely to vote PAN in the election in question. 12 / 66

32. What prior distribution can we use? We could use a Beta distribution being parametrized by two values α and β P (q) = 1 B (α, β) qα−1 (1 − q)β−1 . (8) Where We have B (α, β) = Γ(α)Γ(β) Γ(α+β) is the beta function where Γ is the generalization of the notion of factorial in the case of the real numbers. Properties When both the α, β > 0 then the beta distribution has its mode (Maximum value) at α − 1 α + β − 2 . (9) 13 / 66

35. We then do the following We do the following We can choose α = β so the beta prior peaks at 0.5. As a further expression of our belief We make the following choice α = β = 5. Why? Look at the variance of the beta distribution αβ (α + β) (α + β + 1) . (10) 14 / 66

38. Thus, we have the following nice properties We have a variance with α = β = 5 Var (q) ≈ 0.025 Thus, the standard deviation sd ≈ 0.16 which is a nice dispersion at the peak point!!! 15 / 66

39. Thus, we have the following nice properties We have a variance with α = β = 5 Var (q) ≈ 0.025 Thus, the standard deviation sd ≈ 0.16 which is a nice dispersion at the peak point!!! 15 / 66

40. Now, our MAP estimate for p... We have then pMAP = argmax Θ   xi∈X log p (xi|q) + log p (q)   (11) Plugging back the ML pMAP = argmax Θ [nPRI log q + (N − nPRI ) log (1 − q) + log p (q)] (12) Where log P (p) = log 1 B (α, β) qα−1 (1 − q)β−1 (13) 16 / 66

43. The log of P (p) We have that log P (q) = (α − 1) log q + (β − 1) log (1 − q) − log B (α, β) (14) Now taking the derivative with respect to p, we get nPRI q − (N − nPRI ) (1 − q) − β − 1 1 − q + α − 1 q = 0 (15) Thus qMAP = nPRI + α − 1 N + α + β − 2 (16) 17 / 66

46. Now With N = 20 with nPRI = 12 and α = β = 5 qMAP = 0.571 18 / 66

48. Properties First MAP estimation “pulls” the estimate toward the prior. Second The more focused our prior belief, the larger the pull toward the prior. Example If α = β=equal to large value It will make the MAP estimate to move closer to the prior. 20 / 66

51. Properties Third In the expression we derived for qMAP, the parameters α and β play a “smoothing” role vis-a-vis the measurement nPRI . Fourth Since we referred to q as the parameter to be estimated, we can refer to α and β as the hyper-parameters in the estimation calculations. 21 / 66

52. Properties Third In the expression we derived for qMAP, the parameters α and β play a “smoothing” role vis-a-vis the measurement nPRI . Fourth Since we referred to q as the parameter to be estimated, we can refer to α and β as the hyper-parameters in the estimation calculations. 21 / 66

53. Beyond simple derivation In the previous technique We took an logarithm of the likelihoot × the prior to obtain a function that can be derived in order to obtain each of the parameters to be estimated. What if we cannot derive? For example when we have something like |θi|. We can try the following EM + MAP to be able to estimate the sought parameters. 22 / 66

57. Example This application comes from “Adaptive Sparseness for Supervised Learning” by Mário A.T. Figueiredo In IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 25, NO. 9, SEPTEMBER 2003 24 / 66

58. Example This application comes from “Adaptive Sparseness for Supervised Learning” by Mário A.T. Figueiredo In IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 25, NO. 9, SEPTEMBER 2003 24 / 66

60. Introduction: Linear Regression with Gaussian Prior We consider regression functions that are linear with respect to the parameter vector β f (x, β) = k i=1 βih (x) = βT h (x) Where h (x) = [h1 (x) , ..., hk (x)]T is a vector of k ﬁxed function of the input, often called features. 26 / 66

61. Introduction: Linear Regression with Gaussian Prior We consider regression functions that are linear with respect to the parameter vector β f (x, β) = k i=1 βih (x) = βT h (x) Where h (x) = [h1 (x) , ..., hk (x)]T is a vector of k ﬁxed function of the input, often called features. 26 / 66

62. Actually, it can be... Linear Regression Linear regression, in which h (x) = [1, x1, ..., xd]T i; in this case, k = d + 1. Non-Linear Regression Here, you have a ﬁxed basis function where h (x) = [φ1 (x) , φ2 (x) , ..., φ1 (x)]T with φ1 (x) = 1. Kernel Regression Here h (x) = [1, K (x, x1) , K (x, x2) , ..., K (x, xn)]T where K (x, xi) is some kernel function. 27 / 66

66. Gaussian Noise We assume that the training set is contaminated by additive white Gaussian Noise yi = f (xi, β) + ωi (17) for i = 1, ..., N where [ω1, ..., ωN ] is a set of independent zero-mean Gaussian samples with variance σ2 Thus, for [y1, ..., yN ], we have the following likelihood p (y|β) = N Hβ, σ2 I (18) The posterior p (β|y) is still Gaussian and the mode is given by β = σ2 I + HT H −1 HT y (19) Remark: The Ridge regression. 29 / 66

69. Regression with a Laplacian Prior In order to favor sparse estimate p (β|α) = k i=1 α 2 exp {−α |βi|} = α 2 k exp {−α β 1} (20) Thus, the MAP estimate of β look like β = argmin β Hβ − y 2 2 + 2σ2 α β 1 (21) 30 / 66

70. Regression with a Laplacian Prior In order to favor sparse estimate p (β|α) = k i=1 α 2 exp {−α |βi|} = α 2 k exp {−α β 1} (20) Thus, the MAP estimate of β look like β = argmin β Hβ − y 2 2 + 2σ2 α β 1 (21) 30 / 66

71. Remark This criterion is know as the LASSO This norm l1 induces sparsity in the weight terms. How? For example, [1, 0]T 2 = [1/ √ 2, 1/ √ 2]T 2 = 1. In the other case, [1, 0]T 1 = 1 < [1/ √ 2, 1/ √ 2]T 2 = √ 2. 31 / 66

74. An example What if H is a orthogonal matrix In this case HT H = I Thus β =argmin β Hβ − y 2 2 + 2σ2 α β 1 =argmin β (Hβ − y)T (Hβ − y) + 2σ2 α k i=1 |βi| =argmin β βT HT Hβ − 2βT HT y + yT y + 2σ2 α k i=1 |βi| =argmin β βT β − 2βT HT y + yT y + 2σ2 α k i=1 |βi| 32 / 66

79. We can solve this last part as follow We can group for each βi β2 i − 2βi HT y i + 2σ2 α |βi| + y2 i (22) If we can minimize each group we will be able to get the solution βi = argmin βi β2 i − 2βi HT y i + 2σ2 α |βi| (23) We have two cases βi > 0 βi < 0 33 / 66

83. If βi > 0 We then derive with respect to βi ∂ β2 i − 2βi HT y i + 2σ2αiβi ∂βi = 2βi − 2 HT y i + 2σ2 α We have then βi = HT y i − σ2 α (24) 34 / 66

84. If βi > 0 We then derive with respect to βi ∂ β2 i − 2βi HT y i + 2σ2αiβi ∂βi = 2βi − 2 HT y i + 2σ2 α We have then βi = HT y i − σ2 α (24) 34 / 66

85. If βi < 0 We then derive with respect to βi ∂ β2 i − 2βi HT y i − 2σ2αiβi ∂βi = 2βi − 2 HT y i − 2σ2 α We have then βi = HT y i + σ2 α (25) 35 / 66

86. If βi < 0 We then derive with respect to βi ∂ β2 i − 2βi HT y i − 2σ2αiβi ∂βi = 2βi − 2 HT y i − 2σ2 α We have then βi = HT y i + σ2 α (25) 35 / 66

87. The value of HT y i Ww have that We have that: if βi > 0 then HT y i > σ2α if βi < 0 then HT y i < −σ2α 36 / 66

88. We can put all this together A compact Version βi = sgn HT y i HT y i − σ2 α + (26) With (a)+ = a if a ≥ 0 0 if a < 0 and “sgn” is the sign function. This rule is know as the The soft threshold!!! 37 / 66

91. Example We have the doted · · · line as the soft threshold a 3 2 1 0 -1 -2 -3 -3 -2 -1 0 1 2 3 38 / 66

93. Now Given that each βi has a zero-mean Gaussian prior p (βi|τi) = N (βi|0, τi) (27) Where τi has the following exponential hyperprior p (τi|γ) = γ 2 exp − γ 2 τi for τi ≥ 0 (28) This is a though property - so we take it by heart p (βi|γ) = ˆ ∞ 0 p (βi|τi) p (τi|γ) dτi = √ γ 2 exp {− √ γ |βi|} (29) 40 / 66

96. Example The double exponential 41 / 66

97. This is equivalent to the use of the L1-norm for regularization L1(Left) is better for sparsity promotion than L2(Right) 42 / 66

99. The EM trick How do we do this? This is done by regarding τ = [τ1, ..., τk] as the hidden/missing data Then, if we could observe τ, complete log-posterior log p (β, σ2 |y, τ) can be easily calculated p β, σ2 |y, τ ∝ p y|β, σ2 p (β|τ) p σ2 (30) Where p y|β, σ2 ∼ N β, σ2 p (β|0, τ) ∼ N (0, τ) 44 / 66

102. What about p (σ2 )? We select p σ2 as a constant However We can adopt a conjugate inverse Gamma prior for σ2, but for large number of samples the prior on the estimate of σ2is very small. In the constant case We can use the MAP idea, however we have hidden parameters so we resort to the EM 45 / 66

105. E-step Computes the expected value of the complete log-posterior Q β, σ2 |β(t), σ2 (t) = ˆ log p β, σ2 |y, τ p τ|β(t), σ2 (t), y dτ (31) 46 / 66

106. M-step Updates the parameter estimates by maximizing the Q-function β(t+1), σ2 (t+1) = argmax β,σ2 Q β, σ2 |β(t), σ2 (t) (32) 47 / 66

107. Remark First The EM algorithm converges to a local maximum of the a posteriori probability density function p β, σ2 |y ∝ p y|β, σ2 p (β|γ) (33) Without using the marginal prior p (β|γ) which is not Gaussian Instead we use a conditional Gaussian prior p (β|γ) 48 / 66

108. Remark First The EM algorithm converges to a local maximum of the a posteriori probability density function p β, σ2 |y ∝ p y|β, σ2 p (β|γ) (33) Without using the marginal prior p (β|γ) which is not Gaussian Instead we use a conditional Gaussian prior p (β|γ) 48 / 66

109. The Final Model We have p y|β, σ2 = N Hβ, σ2 I p σ2 ∝ ”constant” p (β|τ) = k i=1 N (βi|0, τi) = N β|0, (Υ (τ))−1 p (τ|γ) = γ 2 k k i=1 exp − γ 2 τi With Υ (τ) = diag τ−1 1 , ..., τ−1 k is the diagonal matrix with the inverse variances of all the βi’s. 49 / 66

117. Thus Second Did you notice that the term βT Υ (τ) β is linear with respect to Υ (τ) and the other terms do not depend on τ? Thus, the E-step is reduced to the computation of Υ (τ) V(t) = E Υ (τ) |y, β(t), σ2 (t) = diag E τ−1 1 |y, β(t), σ2 (t) , ..., E τ−1 k |y, β(t), σ2 (t) 51 / 66

124. Here is the interesting part We have the following probability density p τi|y, βi,(t), σ2 i,(t) = N βi,(t)|0, τi γ 2 exp −γ 2 τi ´ ∞ 0 N βi,(t)|0, τi γ 2 exp −γ 2 τi dτi (36) Then E τ−1 1 |y, βi,(t), σ2 (t) = ´ ∞ 0 1 τi N βi,(t)|0, τi γ 2 exp −γ 2 τi dτi ´ ∞ 0 N βi,(t)|0, τi γ 2 exp −γ 2 τi dτi (37) Now I leave to you to prove that (It can come in the test) 53 / 66

127. Here is the interesting part Thus E τ−1 1 |y, βi,(t), σ2 (t) = γ βi,(t) (38) Finally V(t) = γdiag β1,(t) −1 , ..., βk,(t) −1 (39) 54 / 66

128. Here is the interesting part Thus E τ−1 1 |y, βi,(t), σ2 (t) = γ βi,(t) (38) Finally V(t) = γdiag β1,(t) −1 , ..., βk,(t) −1 (39) 54 / 66

129. The Final Q function Something Notable Q β, σ2 |β(t), σ2 (t) = ˆ log p β, σ2 |y, τ p τ|β(t), σ2 (t), y dτ = ˆ −n log σ2 − y − Hβ 2 2 σ2 − βT Υ (τ) β p τ|β(t), σ2 (t), y dτ = −n log σ2 − y − Hβ 2 2 σ2 − βT ˆ Υ (τ) p τ|β(t), σ2 (t), y dτ β = −n log σ2 − y − Hβ 2 2 σ2 − βT V(t)β 55 / 66

133. Finally, the M-step First σ2 (t+1) = argmax σ2 −n log σ2 − y − Hβ 2 2 σ2 = y − Hβ 2 2 n Second β(t+1) = argmax β − y − Hβ 2 2 σ2 − βT V(t)β = σ2 (t+1)V(t) + HT H −1 HT y This also I leave to you It can come in the test. 56 / 66

139. However We need to deal in some way with the γ term It controls the degree of spareness!!! We can do assuming a Jeﬀrey’s Prior J. Berger, Statistical Decision Theory and Bayesian Analysis. New York: Springer-Verlag, 1980. We use instead p (τ) ∝ 1 τ (40) 58 / 66

142. Properties of the Jeﬀrey’s Prior Important This prior expresses ignorance with respect to scale and is parameter free Why scale invariant Imagine, we change the scale of τ by τ = Kτ where K is a constant expressing that change Thus, we have that p τ = 1 τ = 1 kτ ∝ 1 τ (41) 59 / 66

145. However Something Notable This prior is known as an improper prior. In addition This prior does not leads to a Laplacian prior on β. Nevertheless This prior induces sparseness and good performance for the β. 60 / 66

148. Introducing this prior into the equations Matrix V(t) is now V(t) = diag β1,(t) −2 , ..., βk,(t) −2 (42) Quite interesting!!! We do not have the free γ parameter. 61 / 66

149. Introducing this prior into the equations Matrix V(t) is now V(t) = diag β1,(t) −2 , ..., βk,(t) −2 (42) Quite interesting!!! We do not have the free γ parameter. 61 / 66

150. Here, we can see the new threshold With dotted line as the soft threshold, dashed the hard threshold and in blue solid the Jeﬀrey’s prior a 3 2 1 0 -1 -2 -3 -3 -2 -1 0 1 2 3 62 / 66

151. Observations The new rule is between The soft threshold rule. The hard threshold rule. Something Notable With large values of HT y i the new rule approaches the hard threshold. Once HT y i gets smaller The estimate becomes progressively smaller approaching the behavior of the soft rule. 63 / 66

154. Finally, an implementation detail Since several elements of β will go to zero V(t) = diag β1,(t) −2 , ..., βk,(t) −2 will have several elements going to large numbers Something Notable if we deﬁne U(t) = diag β1,(t) , ..., βk,(t) . Then, we have that V(t) = U−1 (t) U−1 (t) (43) 64 / 66

157. Thus We have that β(t+1) = σ2 (t+1)V(t) + HT H −1 HT y = σ2 (t+1)U−1 (t) U−1 (t) + HT H −1 HT y = σ2 (t+1)U−1 (t) IU−1 (t) + IHT HI −1 HT y = σ2 (t+1)U−1 (t) IU−1 (t) + U−1 (t) U(t)HT HU(t)U−1 (t) −1 HT y = U(t) σ2 (t+1)I + U(t)HT HU(t) −1 U(t)HT y 65 / 66

162. Advantages!!! Quite Important We avoid the inversion of the elements of β(t). We can avoid getting the inverse matrix We simply solve the corresponding linear system whose dimension is only the number of nonzero elements in U(t). Why? Remember you want to maximize Q β, σ2|β(t), σ2 (t) = −n log σ2 − y−Hβ 2 2 σ2 − βT V(t)β 66 / 66

08 Machine Learning Maximum Aposteriori

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (6)

Similar to 08 Machine Learning Maximum Aposteriori

Similar to 08 Machine Learning Maximum Aposteriori (20)

More from Andres Mendez-Vazquez

More from Andres Mendez-Vazquez (20)

Recently uploaded

Recently uploaded (20)

08 Machine Learning Maximum Aposteriori