Beginnig with reviewing Basyain Theorem and chain rule, then explain MAP Estimation; Maximum A Posteriori Estimation.
In the framework of MAP Estimation, we can describe a lot of famous models; naive bayes, regularized redge regression, logistic regression, log-linear model, and gaussian process.
MAP estimation is powerful framework to understand the above models from baysian point of view and cast possibility to extend models to semi-supervised ones.
1. Bayes Theorem MAP Estimate Summary Bibliography
MAP Estimate and its Periphery
kzky
2011/4/24
kzky MAP Estimate and its Periphery 2011/4/24 1 / 22
2. Bayes Theorem MAP Estimate Summary Bibliography
Outline
1 Bayes Theorem
Two Views of Bayes Theorem
Chain Rule
2 MAP Estimate
Introduction
Ridge Regression
Logistic Regression
Log Linear Model
Loss Function
Gaussian Process
3 Summary
MAP Estimation Summary
Further and Other Topics
4 Bibliography
Bibliography
kzky MAP Estimate and its Periphery 2011/4/24 2 / 22
3. Bayes Theorem MAP Estimate Summary Bibliography
Outline
1 Bayes Theorem
Two Views of Bayes Theorem
Chain Rule
2 MAP Estimate
Introduction
Ridge Regression
Logistic Regression
Log Linear Model
Loss Function
Gaussian Process
3 Summary
MAP Estimation Summary
Further and Other Topics
4 Bibliography
Bibliography
kzky MAP Estimate and its Periphery 2011/4/24 3 / 22
4. Bayes Theorem MAP Estimate Summary Bibliography
Two Views of Bayes Theorem
Bayes Theorem
Bayes Theorem
p(x|y) =
p(x, y)
p(y)
=
p(x)p(y|x)
p(y)
=
p(x)p(y|x)
x p(x)p(y|x)
begining with joint distribution
p(x, y) = p(x)p(y|x)
kzky MAP Estimate and its Periphery 2011/4/24 4 / 22
6. Bayes Theorem MAP Estimate Summary Bibliography
Outline
1 Bayes Theorem
Two Views of Bayes Theorem
Chain Rule
2 MAP Estimate
Introduction
Ridge Regression
Logistic Regression
Log Linear Model
Loss Function
Gaussian Process
3 Summary
MAP Estimation Summary
Further and Other Topics
4 Bibliography
Bibliography
kzky MAP Estimate and its Periphery 2011/4/24 6 / 22
7. Bayes Theorem MAP Estimate Summary Bibliography
Introduction
Setting for MAP Estimate
Setting in usual supervised learning and Assumption
D = {(xi, yi)}n
i=1
i.i.d
∼ P(x, y)
where x ∈ Rd, y ∈ {1, −1}, P: unknown joint distribution
With Bayes Theorem and assumption of x not depending on θ
p(θ|x, y) =
p(θ)p(x, y|θ)
p(x, y)
= =
kzky MAP Estimate and its Periphery 2011/4/24 7 / 22
8. Bayes Theorem MAP Estimate Summary Bibliography
Introduction
Setting for MAP Estimate
Setting in usual supervised learning and Assumption
D = {(xi, yi)}n
i=1
i.i.d
∼ P(x, y)
where x ∈ Rd, y ∈ {1, −1}, P: unknown joint distribution
With Bayes Theorem and assumption of x not depending on θ
p(θ|x, y) =
p(θ)p(x, y|θ)
p(x, y)
=
p(θ)p(y|x, θ)p(x|θ)
p(y|x)p(x)
=
kzky MAP Estimate and its Periphery 2011/4/24 7 / 22
9. Bayes Theorem MAP Estimate Summary Bibliography
Introduction
Setting for MAP Estimate
Setting in usual supervised learning and Assumption
D = {(xi, yi)}n
i=1
i.i.d
∼ P(x, y)
where x ∈ Rd, y ∈ {1, −1}, P: unknown joint distribution
With Bayes Theorem and assumption of x not depending on θ
p(θ|x, y) =
p(θ)p(x, y|θ)
p(x, y)
=
p(θ)p(y|x, θ)p(x|θ)
p(y|x)p(x)
=
p(θ)p(y|x, θ)
p(y|x)
kzky MAP Estimate and its Periphery 2011/4/24 7 / 22
10. Bayes Theorem MAP Estimate Summary Bibliography
Introduction
Setting for MAP Estimate
Setting in usual supervised learning and Assumption
D = {(xi, yi)}n
i=1
i.i.d
∼ P(x, y)
where x ∈ Rd, y ∈ {1, −1}, P: unknown joint distribution
With Bayes Theorem and assumption of x not depending on θ
p(θ|x, y) =
p(θ)p(x, y|θ)
p(x, y)
=
p(θ)p(y|x, θ)p(x|θ)
p(y|x)p(x)
=
p(θ)p(y|x, θ)
p(y|x)
posterior =
prior × likelihood
marginal likelihood
kzky MAP Estimate and its Periphery 2011/4/24 7 / 22
11. Bayes Theorem MAP Estimate Summary Bibliography
Introduction
MAP Estimate
Maximum A Posteriori Estimate
basically take log
maximize p(θ|D) with respect to θ
we can maximize a priori with respect to θ without loss of generality.
because log is monotonic.
kzky MAP Estimate and its Periphery 2011/4/24 8 / 22
12. Bayes Theorem MAP Estimate Summary Bibliography
Introduction
Formulation on MAP Estimate
max
θ
log p (θ|D)
= max
θ
(log (p (θ) p (D|θ)) − log p (D) )
kzky MAP Estimate and its Periphery 2011/4/24 9 / 22
13. Bayes Theorem MAP Estimate Summary Bibliography
Introduction
Formulation on MAP Estimate
max
θ
log p (θ|D)
= max
θ
(log (p (θ) p (D|θ)) − )
= max
θ
log p (θ) + log
i
p (xi, yi|θ)
kzky MAP Estimate and its Periphery 2011/4/24 9 / 22
14. Bayes Theorem MAP Estimate Summary Bibliography
Introduction
Formulation on MAP Estimate
max
θ
log p (θ|D)
= max
θ
(log (p (θ) p (D|θ)) − )
= max
θ
log p (θ) + log
i
p (xi, yi|θ)
= max
θ
log p (θ) +
i
log p (yi|xi, θ) +
i
log p (xi|θ)
kzky MAP Estimate and its Periphery 2011/4/24 9 / 22
15. Bayes Theorem MAP Estimate Summary Bibliography
Introduction
Formulation on MAP Estimate
max
θ
log p (θ|D)
= max
θ
(log (p (θ) p (D|θ)) − )
= max
θ
log p (θ) + log
i
p (xi, yi|θ)
= max
θ
log p (θ) +
i
log p (yi|xi, θ) +
= max
θ
log p (θ) +
i
log p (yi|xi, θ)
kzky MAP Estimate and its Periphery 2011/4/24 9 / 22
16. Bayes Theorem MAP Estimate Summary Bibliography
Introduction
Formulation only on Regularization Term
Assumption of Prior
w ≡ θ ∼ N 0,
I
2λ
p (w) =
1
(2π)d/2| I
2λ |1/2
exp −
1
2
wT I
2λ
−1
w
=
1
(2π)d/2| I
2λ |1/2
exp −λ w 2
2
kzky MAP Estimate and its Periphery 2011/4/24 10 / 22
17. Bayes Theorem MAP Estimate Summary Bibliography
Introduction
Formulation only on Regularization Term
Assumption of Prior
w ≡ θ ∼ N 0,
I
2λ
p (w) =
1
(2π)d/2| I
2λ |1/2
exp −
1
2
wT I
2λ
−1
w
=
1
(2π)d/2| I
2λ |1/2
exp −λ w 2
2
max
θ
log p (θ|D) = max
w
−λ w 2
2 +
i
log p (yi|xi, w)
kzky MAP Estimate and its Periphery 2011/4/24 10 / 22
23. Bayes Theorem MAP Estimate Summary Bibliography
Logistic Regression
Logistic Regression
Asumption of p(y|x, w)
p(y|x, w) =
1
1 + exp (−yf (x))
MAP Estimate becames
max
w
−λ w 2
2 +
i
log
1
1 + exp (−yif (xi))
kzky MAP Estimate and its Periphery 2011/4/24 12 / 22
24. Bayes Theorem MAP Estimate Summary Bibliography
Logistic Regression
Logistic Regression
Asumption of p(y|x, w)
p(y|x, w) =
1
1 + exp (−yf (x))
MAP Estimate becames
max
w
−λ w 2
2 +
i
log
1
1 + exp (−yif (xi))
= min
w
λ w 2
2 +
i
log (1 + exp (−yif (xi)))
kzky MAP Estimate and its Periphery 2011/4/24 12 / 22
25. Bayes Theorem MAP Estimate Summary Bibliography
Log Linear Model
Log Liner Model
Asumption of p(y|x, w)
p (y|x, w) = 1
Zx,w
exp wT φ (x, y)
Zx,w is normalization for exp(wT φ(x, y)) with respect to y
MAP Estimate becames
kzky MAP Estimate and its Periphery 2011/4/24 13 / 22
26. Bayes Theorem MAP Estimate Summary Bibliography
Log Linear Model
Log Liner Model
Asumption of p(y|x, w)
p (y|x, w) = 1
Zx,w
exp wT φ (x, y)
Zx,w is normalization for exp(wT φ(x, y)) with respect to y
MAP Estimate becames
max
w
−λ w 2
2 +
i
wT
φ (xi, yi) − ln Zx,w
kzky MAP Estimate and its Periphery 2011/4/24 13 / 22
27. Bayes Theorem MAP Estimate Summary Bibliography
Loss Function
Figures: Loss Function
kzky MAP Estimate and its Periphery 2011/4/24 14 / 22
28. Bayes Theorem MAP Estimate Summary Bibliography
Gaussian Process
Main points of Gaussian Process
Differences from previous discussion
do not take log
do not use any distribution other than Gaussian
only Gaussina distribution used
Concept of GP
x
i.i.d
∼ N (x, Σx)
y
i.i.d
∼ N (y, Σy)
p(x)p(y) is also Gaussian
kzky MAP Estimate and its Periphery 2011/4/24 15 / 22
29. Bayes Theorem MAP Estimate Summary Bibliography
Gaussian Process
Formulation of GP
begining with Bayes Theorem
p (θ|x, y) ∝ p (y|x, w) p (w)
convert into the form (x − x)T
Σ(x − x)
kzky MAP Estimate and its Periphery 2011/4/24 16 / 22
30. Bayes Theorem MAP Estimate Summary Bibliography
Gaussian Process
Formulation of GP
begining with Bayes Theorem
p (θ|x, y) ∝ p (y|x, w) p (w)
convert into the form (x − x)T
Σ(x − x)
p (θ|D) = exp −
1
2σ2
(y − Xw)T
(y − Xw) exp −
1
2
wΣ−1
w w
kzky MAP Estimate and its Periphery 2011/4/24 16 / 22
31. Bayes Theorem MAP Estimate Summary Bibliography
Gaussian Process
Formulation of GP
begining with Bayes Theorem
p (θ|x, y) ∝ p (y|x, w) p (w)
convert into the form (x − x)T
Σ(x − x)
p (θ|D) = exp −
1
2σ2
(y − Xw)T
(y − Xw) exp −
1
2
wΣ−1
w w
= exp −
1
2
(w − w)
1
σ2
XXT
+ Σ−1
w (w − w)
where w =
1
σ2
1
σ2
XXT
+ Σ−1
w
−1
Xy
kzky MAP Estimate and its Periphery 2011/4/24 16 / 22
32. Bayes Theorem MAP Estimate Summary Bibliography
Gaussian Process
Notice
1 expandable to kernelization
1 mapping x onto Feature Space (i.e. high dimensional space)
x → φ(x)
2 inner product of feature vectors occurs (i.e. φT
φ)
2 solvable analytically
1 similarity calculation between all training samples x and test
sample xnew
2 gram matrix calculation, then calculate only the inverse matrix
3 easy to impliment
(e.g., using a library to obtain an inverse matrix)
f(xnew
) =
i
αik(xnew
, x)
where α = (K + σ2
I)−1
y
kzky MAP Estimate and its Periphery 2011/4/24 17 / 22
33. Bayes Theorem MAP Estimate Summary Bibliography
Outline
1 Bayes Theorem
Two Views of Bayes Theorem
Chain Rule
2 MAP Estimate
Introduction
Ridge Regression
Logistic Regression
Log Linear Model
Loss Function
Gaussian Process
3 Summary
MAP Estimation Summary
Further and Other Topics
4 Bibliography
Bibliography
kzky MAP Estimate and its Periphery 2011/4/24 18 / 22
34. Bayes Theorem MAP Estimate Summary Bibliography
MAP Estimation Summary
MAP Estimation Summary
Good things of MAP Estimate are:
able to find Global Minima if we choose convex loss function
easy to understand and cast other interpretation to SVM
some models (e.g., GP) are solvable analytically
expandability:
1 we can change p(y|x, θ) into various distributions
2 easy to convert supervised model into SSL using p(x|θ) term
3 modifiability to sequantial labeling
(e.g., log linear model to Conditional Random Field)
kzky MAP Estimate and its Periphery 2011/4/24 19 / 22
35. Bayes Theorem MAP Estimate Summary Bibliography
MAP Estimation Summary
MAP Estimation Summary
Good things of MAP Estimate are:
able to find Global Minima if we choose convex loss function
easy to understand and cast other interpretation to SVM
some models (e.g., GP) are solvable analytically
expandability:
1 we can change p(y|x, θ) into various distributions
2 easy to convert supervised model into SSL using p(x|θ) term
3 modifiability to sequantial labeling
(e.g., log linear model to Conditional Random Field)
*GP for ML is freely downloadable from
http://www.gaussianprocess.org/gpml/chapters/
kzky MAP Estimate and its Periphery 2011/4/24 19 / 22
36. Bayes Theorem MAP Estimate Summary Bibliography
Further and Other Topics
Further and Other Topics
Relationships
1 Bayse Estimation: find a function of θ (but no guarantee for global
solution)
2 Maximum (Log) Likelihood (e.g., EM for GMM and HMM)
3 Naive Bayes: p(θ) ∼ Dirichlet and p(y|x, θ) ∼ multinominal
SSlize (expansion of MAP Estimate Case)
1 Entropy Regularization to Logistic Regression (nips 2005)
2 Null Categorial Noise Model to Gaussian Process (nips 2005)
kzky MAP Estimate and its Periphery 2011/4/24 20 / 22
37. Bayes Theorem MAP Estimate Summary Bibliography
Outline
1 Bayes Theorem
Two Views of Bayes Theorem
Chain Rule
2 MAP Estimate
Introduction
Ridge Regression
Logistic Regression
Log Linear Model
Loss Function
Gaussian Process
3 Summary
MAP Estimation Summary
Further and Other Topics
4 Bibliography
Bibliography
kzky MAP Estimate and its Periphery 2011/4/24 21 / 22
38. Bayes Theorem MAP Estimate Summary Bibliography
Bibliography
Bibliography
1 S.Akaho, “Kernel Maltiple Analysis”, Iwanami 2009
2 D.Takamura and M.Okumura, “Introductino to Machine Learning
for Natural Language Processing”, Corona 2010
3 X.Zhu, “Introduction to Semi-Supervised Learning”, Morgan &
Claypool Publishers 2009
4 X.Zhu, “Semi-Supervised Learning Literature Survey”, 2008
5 C.Rasmussen and C.Williams, “Gaussian Process for Machine
Learning”, the MIT Press 2006
kzky MAP Estimate and its Periphery 2011/4/24 22 / 22