8803-09-lec16.pdf

Introduction Variational Inference Mixture of Gaussians Exponential Family Expectation Propagation Summary
Approximate Inference
Henrik I. Christensen
Robotics & Intelligent Machines @ GT
Georgia Institute of Technology,
Atlanta, GA 30332-0280
hic@cc.gatech.edu
Henrik I. Christensen (RIM@GT) Approximate Inference 1 / 36

Outline
1 Introduction
2 Variational Inference
3 Variational Mixture of Gaussians
4 Exponential Family
5 Expectation Propagation
6 Summary

Introduction
We often are required to estimate a (conditional) prior of the form
p(Z|X)
The solution might be intractable
1 There might not be a close form solution
2 The integration over X or a parameter space θ might be
computationally challenging
3 The set of possible outcomes might be significant/exponential
Two strategies
1 Deterministic Approximation Methods
2 Stochastic Sampling (Monte Carlo Techniques)
Today we will talk about deterministic techniques

Outline
1 Introduction
6 Summary

Variational Inference
In general we have a Bayesian Model as seen earlier, ie.
ln p(X) = ln p(X, Z) − ln p(Z|X)
We can rewrite this to
ln p(X) = L(q) + KL(q||p)
where
L(q) =
Z
q(Z) ln

p(X, Z)
q(Z)

KL(q||p) = −
Z
q(Z) ln

p(Z|X)
q(Z)

So L(q) is an estimate of the joint distribution and KL is the
Kullback-Leibler comparison of q(Z) to p(Z|X).

Factorized Distributions
Assume for now that we can factorize Z into disjoint groups so that
q(Z) =
M
Y
i=1
qi (Zi )
In physics a similar model has been adopted termed mean field theory
We can them optimize L(q) through a component wise optimization
L(q) =
Z Y
i
qi



ln p(X, Z) −
X
j
qj



dZ
=
Z
qj ln p̃(X, Zj )dZj −
Z
qj ln qj dZj + const
where
p̃(X, Zj ) = Ei6=j [ln p(X, Z)] + c = ln p(X, Z)
Y
i6=j
qi dZi + c

Factorized distributions
The optimal solution is now
ln q∗
j (Zj ) = Ei6=j [ln p(X, Z)] + c
Ie the solution where every factor minimizes the influence on L(q)

Outline
1 Introduction
6 Summary

Variational Mixture of Gaussians
We encounter mixtures of Gaussians all the time
Examples are multi-wall modelling, ambiguous localization, ...
We have:
a set of observed data X,
a set of latent variables, Z that describe the mixture

Mixture of Gaussians - Modelling
We can model the mixture model
p(Z|π) =
N
Y
n=1
K
Y
k=1
πznk
k
We can also derive the observed conditional
p(X|Z, µ, Λ) =
N
Y
n=1
K
Y
k=1
N(xn|µk, Λ−1
k )znk
We will for now assume that mixtures are modelled as diraclets
p(π) = Dir(π|α0) = C(α0)
K
Y
k=1
πα0−1
k

Mixture of Gaussians - Modelling
The component processes can be modelled as a Gaussian-Wishart
p(µ, Λ) = p(µ|Λ)p(Λ) =
K
Y
k=1
N(µk|m0, (β0Λk)−1
)W (Λk|W0, ν0)
Ie a total model of
xn
zn
N
π
µ
Λ

Mixtures of Gaussians - Variational
The conditional model can be seen as
p(X, Z, π, µ, Λ) = p(X|Z, µ, Λ)p(Z|π)p(π)p(µ|Λ)p(Λ)
Only X is observed
We can now consider the selection of a distribution
q(Z, π, µ, Λ) = q(Z)q(π, µ, Λ)
this is clear an assumption of independence.
We can use the general result of component-wise optimization
ln q∗
(Z) = Eπ,µ,Λ[ln p(X, Z, π, µ, Λ] + const
Decomposition gives us
ln q∗
(Z) = Eπ[ln p(Z|π)] + Eµ,Λ[ln p(X|Z, µ, Λ)] + const
ln q∗
(Z) =
N
X
n=1
K
X
k=1
znk ln ρnk + const

We can further achieve
ln ρnk = E[ln πk ]+
1
2
E[ln |Λk |]−
D
2
ln 2π−
1
2
Eµk ,Λk
[(xn −µk )T
Λk (xn −µk )]+c
Taking the exponential we have
q∗
(Z) ∝
K
Y
k=1
N
Y
n=1
ρznk
nk
Using normalization we arrive at
q∗
(Z) ∝
K
Y
k=1
N
Y
n=1
rznk
nk
Where
rnk =
ρnk
P
j ρnj

Just as we saw for EM we can define
Nk =
N
X
n=1
rnk
x̄k =
1
Nk
N
X
n=1
rnkxn
Sk =
1
Nk
N
X
n=1
rnk(xn − x̄n)(xn − x̄n)T

Mixtures of Gaussians - Parameters/Mixture
Lets now consider q(π, µ, Λ) to arrive at
ln q
∗
(π, µ, Λ) = ln p(π) +
K
X
k=1
ln p(µk , Λk ) + EZ [ln p(Z|π)] +
k
X
k=1
N
X
n=1
E[znk ] ln N(xn|µk , Λ
−1
k ) + c
We can partition the problem into
q(π, µ, Λ) = q(π)
K
Y
k=1
q(µk, Λk)
We can derive
ln q∗
(π) = (α0 − 1)
K
X
k=1
ln πk +
K
X
k=1
N
X
n=1
rnk ln πk + c
We can now derive
q∗
(π) = Dir(π|α)
where
αk = α0 + Nk

Mixtures of Gaussians - Parameters/Mixture
We can then derive
q∗
(µk, Λk) = N(µk|mk, (βkΛk)−1
)W (λk|Wk, νk)
where
βk = β0 + Nk
mk =
1
βk
(β0m0 + Nkx̄k)
W −1
K = W −1
0 + NkSk +
β0Nk
β0 + Nk
(x̄k − m0)(x̄k − m0)T
νk = ν0 + Nk + 1

Mixtures of Gaussians - Parameters
We can now arrive at the parameters
Eµk ,Λk
[(xn − µk )T
(xn − µk )] = Dβ−1
k + νk (xn − mk )T
WK (xn − mk )
ln Λ̃k = E[ln |Λ|k |] =
D
X
i=1
ψ

νk + 1 − i
2

+ D ln 2 + ln |Wk |
ln π̃k = E[ln πk ] = ψ(αk ) − ψ(α̂)
here ψ(.) which is defined as d/da ln Γ(a) also known as the digramma
function. The last two results are given by the Gauss-Wishart

Mixtures of Gaussians - Parameters
We can finally find the responsibilities
rnk ∝ πk|Λk|1/2
exp

−
1
2
(xn − µk)T
Λk(xn − µk)

The optimization is stepwise
1 Estimate µ, Λ and then rnk
2 Estimate π and Z
3 Check for convergence - return to 1 if not converged

Mixture of Gaussians - Example
0 15
60 120

MoG - Varional Lower Bound
We can estimate the best fit / lower bound
L = E[ln p(X|Z, µ, Λ)] + E[ln p(Z|pi)] + E[ln p(µ, Λ)] − E[ln q(Z)] − E[ln q(π)] − E[ln q(µ, Λ)]
E[ln p(X|Z, µ, Λ)] =
1
2
X
k
Nk
n
ln Λ̃k − Dβ−1
k − νk Tr(Sk Wk )
−νk (x̄k − mk )T
WK (x̄k − mk ) − D ln 2π

E[ln p(Z|π)] =
X
n
X
k
rnk ln rnk
E[ln p(π)] = ln C(α0) + (α0 − 1)
X
k
ln π̃k
.
.
. =
.
.
. (see book)

Outline
1 Introduction
6 Summary

Exponential Family Distribution
Recall from 3rd lecture:
Exponential family
p(x|η) = h(x)g(η) exp
n
ηT
u(x)
o
where η represent the “natural parameters”
g(η) is the normalization “factor”
u(x) is some general function of data

Exponential Family Distribution
The joint distribution for observed and latent variables is then
p(X, Z|η) =
N
Y
n=1
h(xn, zn)g(η) exp
n
ηT
u(xn, zn)
o
The conjugate prior for η is then
p(η|ν0, v0) = f (ν0, χ0)g(η)ν0
exp
n
ν0ηT
χ
o
where ν0 is prior number of observations and χ is the sufficient
statistics (moments)

Exponential Family Distribution - Variational
As before we can compute
ln q∗
(Z) = Eη[ln p(X, Z|η)] + const
=
X
n
n
ln h(xn, zn) + E[ηT
]u(xn, zn)
o
+ const
i.e. a sum of independent terms
Taking exponential on both sides we have
q∗
(zn) = h(xn, zn)g(E[η]) exp
n
E[ηT
]u(xn, zn)
o

Similarly the natural parameters can be optimized by
ln q∗
(η) = ln p(η|ν0, χ0) + EZ [ln p(X, Z|η)] + const
Which expands to
ln q∗
(η) = ν0 ln g(η) + ηT
χ0 +
X
ln g(η) + ηT
Ezn [u(xn, zn)]

+ const
Using the trick of exponentials on both sides we have
q∗
(η) = f (νN, χN)g(η)νN
exp
n
ηT
χN
o
where
νN = ν0 + N χn = χ0 +
X
n
Ezn [u(xn, zn)]

As expected the solution is iterative
q∗(zn) and q∗(η) are coupled.
In the E step compute E[u(xn, zn)] - the sufficient statistics and
compute q(η)
In the M step use the estimate to maximize the estimate for q(zn)
and compute E[ηT ]

Outline
1 Introduction
6 Summary

Expectation Propagation
Fundamentally we are trying to match distributions to the data and
match up the natural parameters. I.e. find the “best”family of
distributions and at the same time fit the parameter.
In the end we are trying to minimize the Kullback-Leibler (KL) with
respect to q(z)
Consider for a minute KL(p||q) where p(z) is fixed and q(z) is a
member of the exponential family
q(z) = h(z)g(η) exp
n
ηT
u(z)
o

Expectation Propagation - Optimization
The Kullback - Leibler is then
KL(p||q) = − ln g(η) − ηT
Ep(z)[u(z)] + const
The extrema is then given by
−∇ ln g(η) = Ep(z)[u(z)]
i.e. the best estimate is to match q(z) to p(z) by setting “natural
parameters” to the sufficient statistics (moment matching).
I.e. q(z) = N(z|µ, Σ) as a model for the data

Expectation Propagation - Modelling
Consider a model with factorized probabilities
p(D, θ) =
Y
i
fi (θ)
where fi (theta) = p(xn|θ) and you might have a prior f0(θ) = p(θ).
The posterior is then
p(θ|D) =
1
p(D)
Y
i
fi (θ)
The model evident is given by
p(D) =
Z Y
i
fi (θ)dθ

Expectation Propagation - Computing
The estimate is then
q(θ) =
1
Z
Y
i
˜
fi (θ)
q(θ) can be factorized so that each term is optimized
Through optimization factor-by-factor it is possible to generate an
estimate - take-one-out-and-optimize

Expectation Propagation - Algorithm
Initialize factor approximation - ˜
fi (θ)
Initialize posterior estimate - q(θ) ∝
Q
i
˜
fi (θ)
iterate
1 Choose a factor to refine
2 Remove ˜
fj (θ) from prior qj
= q/f
3 Evaluate new posterior/sufficient statistics
4 Update factors
5 Evaluate aproximation

Expectation Propagation - Example
θ x
−5 0 5 10
p(x|θ) = (1 − w)N(x|θ, I) + wN(x|0, aI)

Expectation Propagation - Example
θ
−5 0 5 10 θ
−5 0 5 10

Outline
1 Introduction
6 Summary

Summary
Often computation of complete model is a challenge
Two ways to approximate computations
Deterministic Approximations
Sampling Based Methods
Many tricks for approximation
Factorization is typically a first strategy
Iterative optimization of factors
Next time we will talk about sampling based methods

8803-09-lec16.pdf

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to 8803-09-lec16.pdf

Similar to 8803-09-lec16.pdf (20)

More from KSChidanandKumarJSSS

More from KSChidanandKumarJSSS (8)

Recently uploaded

Recently uploaded (20)

8803-09-lec16.pdf