MAP Estimation Introduction

Bayes Theorem MAP Estimate Summary Bibliography
MAP Estimate and its Periphery
kzky
2011/4/24
kzky MAP Estimate and its Periphery 2011/4/24 1 / 22

Outline
1 Bayes Theorem
Two Views of Bayes Theorem
Chain Rule
2 MAP Estimate
Introduction
Ridge Regression
Logistic Regression
Log Linear Model
Loss Function
Gaussian Process
3 Summary
MAP Estimation Summary
Further and Other Topics
4 Bibliography
Bibliography

Outline
1 Bayes Theorem
Chain Rule
2 MAP Estimate
Introduction
Ridge Regression
Logistic Regression
Log Linear Model
Loss Function
Gaussian Process
3 Summary
4 Bibliography
Bibliography

Chain Rule
Chain Rule
Chine Rule
p(x, y, z) = p(x)p(y, z|x)
= p(x)p(y|x)p(z|x, y)

Outline
1 Bayes Theorem
Chain Rule
2 MAP Estimate
Introduction
Ridge Regression
Logistic Regression
Log Linear Model
Loss Function
Gaussian Process
3 Summary
4 Bibliography
Bibliography

Introduction
Setting for MAP Estimate
Setting in usual supervised learning and Assumption
D = {(xi, yi)}n
i=1
i.i.d
∼ P(x, y)
where x ∈ Rd, y ∈ {1, −1}, P: unknown joint distribution
With Bayes Theorem and assumption of x not depending on θ
p(θ|x, y) =
p(θ)p(x, y|θ)
p(x, y)
= =

Introduction
D = {(xi, yi)}n
i=1
i.i.d
∼ P(x, y)
p(θ|x, y) =
p(θ)p(x, y|θ)
p(x, y)
=
p(θ)p(y|x, θ)p(x|θ)
p(y|x)p(x)
=

Introduction
D = {(xi, yi)}n
i=1
i.i.d
∼ P(x, y)
p(θ|x, y) =
p(θ)p(x, y|θ)
p(x, y)
=
p(y|x)p(x)
=
p(θ)p(y|x, θ)
p(y|x)

Introduction
D = {(xi, yi)}n
i=1
i.i.d
∼ P(x, y)
p(θ|x, y) =
p(θ)p(x, y|θ)
p(x, y)
=
p(y|x)p(x)
=
p(θ)p(y|x, θ)
p(y|x)
posterior =
prior × likelihood
marginal likelihood

Introduction
MAP Estimate
Maximum A Posteriori Estimate
basically take log
maximize p(θ|D) with respect to θ
we can maximize a priori with respect to θ without loss of generality.
because log is monotonic.

Introduction
Formulation on MAP Estimate
max
θ
log p (θ|D)
= max
θ
(log (p (θ) p (D|θ)) − log p (D) )

Introduction
max
θ
log p (θ|D)
= max
θ
(log (p (θ) p (D|θ)) − )
= max
θ
log p (θ) + log
i
p (xi, yi|θ)

Introduction
Formulation only on Regularization Term
Assumption of Prior
w ≡ θ ∼ N 0,
I
2λ
p (w) =
1
(2π)d/2| I
2λ |1/2
exp −
1
2
wT I
2λ
−1
w
=
1
(2π)d/2| I
2λ |1/2
exp −λ w 2
2

Introduction
Formulation only on Regularization Term
Assumption of Prior
w ≡ θ ∼ N 0,
I
2λ
p (w) =
1
(2π)d/2| I
2λ |1/2
exp −
1
2
wT I
2λ
−1
w
=
1
(2π)d/2| I
2λ |1/2
exp −λ w 2
2
max
θ
log p (θ|D) = max
w
−λ w 2
2 +
i
log p (yi|xi, w)

Ridge Regression
Ridge Regression
Asumption of p(y|x, w)
p(y|x, w) =
1
√
2πσ
exp −
(y − f(x))2
2σ2
MAP Estimate becames

Ridge Regression
Ridge Regression
p(y|x, w) =
1
√
2πσ
exp −
(y − f(x))2
2σ2
max
w
−λ w 2
2 −
i
(yi − f (xi))2
2σ2

Ridge Regression
Ridge Regression
p(y|x, w) =
1
√
2πσ
exp −
(y − f(x))2
2σ2
max
w
−λ w 2
2 −
i
(yi − f (xi))2
2σ2
= min
w
λ w 2
2 +
i
(yi − f (xi))2
2σ2

Ridge Regression
Ridge Regression
p(y|x, w) =
1
√
2πσ
exp −
(y − f(x))2
2σ2
max
w
−λ w 2
2 −
i
(yi − f (xi))2
2σ2
= min
w
λ w 2
2 +
i
(yi − f (xi))2
2σ2
= min
w
λ w 2
2 +
1
2σ2
(y − Xw)T
(y − Xw)

Logistic Regression
Logistic Regression
p(y|x, w) =
1
1 + exp (−yf (x))

Logistic Regression
Logistic Regression
p(y|x, w) =
1
1 + exp (−yf (x))
max
w
−λ w 2
2 +
i
log
1
1 + exp (−yif (xi))

Logistic Regression
Logistic Regression
p(y|x, w) =
1
1 + exp (−yf (x))
max
w
−λ w 2
2 +
i
log
1
1 + exp (−yif (xi))
= min
w
λ w 2
2 +
i
log (1 + exp (−yif (xi)))

Log Linear Model
Log Liner Model
p (y|x, w) = 1
Zx,w
exp wT φ (x, y)
Zx,w is normalization for exp(wT φ(x, y)) with respect to y

Log Linear Model
Log Liner Model
p (y|x, w) = 1
Zx,w
exp wT φ (x, y)
Zx,w is normalization for exp(wT φ(x, y)) with respect to y
max
w
−λ w 2
2 +
i
wT
φ (xi, yi) − ln Zx,w

Loss Function
Figures: Loss Function

Gaussian Process
Main points of Gaussian Process
Diﬀerences from previous discussion
do not take log
do not use any distribution other than Gaussian
only Gaussina distribution used
Concept of GP
x
i.i.d
∼ N (x, Σx)
y
i.i.d
∼ N (y, Σy)
p(x)p(y) is also Gaussian

Gaussian Process
Formulation of GP
begining with Bayes Theorem
p (θ|x, y) ∝ p (y|x, w) p (w)
convert into the form (x − x)T
Σ(x − x)

Gaussian Process
Formulation of GP
p (θ|x, y) ∝ p (y|x, w) p (w)
Σ(x − x)
p (θ|D) = exp −
1
2σ2
(y − Xw)T
(y − Xw) exp −
1
2
wΣ−1
w w

Gaussian Process
Formulation of GP
p (θ|x, y) ∝ p (y|x, w) p (w)
Σ(x − x)
p (θ|D) = exp −
1
2σ2
(y − Xw)T
(y − Xw) exp −
1
2
wΣ−1
w w
= exp −
1
2
(w − w)
1
σ2
XXT
+ Σ−1
w (w − w)
where w =
1
σ2
1
σ2
XXT
+ Σ−1
w
−1
Xy

Gaussian Process
Notice
1 expandable to kernelization
1 mapping x onto Feature Space (i.e. high dimensional space)
x → φ(x)
2 inner product of feature vectors occurs (i.e. φT
φ)
2 solvable analytically
1 similarity calculation between all training samples x and test
sample xnew
2 gram matrix calculation, then calculate only the inverse matrix
3 easy to impliment
(e.g., using a library to obtain an inverse matrix)
f(xnew
) =
i
αik(xnew
, x)
where α = (K + σ2
I)−1
y

Outline
1 Bayes Theorem
Chain Rule
2 MAP Estimate
Introduction
Ridge Regression
Logistic Regression
Log Linear Model
Loss Function
Gaussian Process
3 Summary
4 Bibliography
Bibliography

Good things of MAP Estimate are:
able to ﬁnd Global Minima if we choose convex loss function
easy to understand and cast other interpretation to SVM
some models (e.g., GP) are solvable analytically
expandability:
1 we can change p(y|x, θ) into various distributions
2 easy to convert supervised model into SSL using p(x|θ) term
3 modiﬁability to sequantial labeling
(e.g., log linear model to Conditional Random Field)

Good things of MAP Estimate are:
able to ﬁnd Global Minima if we choose convex loss function
easy to understand and cast other interpretation to SVM
some models (e.g., GP) are solvable analytically
expandability:
1 we can change p(y|x, θ) into various distributions
2 easy to convert supervised model into SSL using p(x|θ) term
3 modiﬁability to sequantial labeling
(e.g., log linear model to Conditional Random Field)
*GP for ML is freely downloadable from
http://www.gaussianprocess.org/gpml/chapters/

Relationships
1 Bayse Estimation: ﬁnd a function of θ (but no guarantee for global
solution)
2 Maximum (Log) Likelihood (e.g., EM for GMM and HMM)
3 Naive Bayes: p(θ) ∼ Dirichlet and p(y|x, θ) ∼ multinominal
SSlize (expansion of MAP Estimate Case)
1 Entropy Regularization to Logistic Regression (nips 2005)
2 Null Categorial Noise Model to Gaussian Process (nips 2005)

Outline
1 Bayes Theorem
Chain Rule
2 MAP Estimate
Introduction
Ridge Regression
Logistic Regression
Log Linear Model
Loss Function
Gaussian Process
3 Summary
4 Bibliography
Bibliography

Bibliography
Bibliography
1 S.Akaho, “Kernel Maltiple Analysis”, Iwanami 2009
2 D.Takamura and M.Okumura, “Introductino to Machine Learning
for Natural Language Processing”, Corona 2010
3 X.Zhu, “Introduction to Semi-Supervised Learning”, Morgan &
Claypool Publishers 2009
4 X.Zhu, “Semi-Supervised Learning Literature Survey”, 2008
5 C.Rasmussen and C.Williams, “Gaussian Process for Machine
Learning”, the MIT Press 2006

MAP Estimation Introduction

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to MAP Estimation Introduction

Similar to MAP Estimation Introduction (20)

Recently uploaded

Recently uploaded (20)

MAP Estimation Introduction