The document summarizes key concepts from Chapter 10 of Bishop's PRML book on approximate inference using variational methods. It introduces variational inference as a deterministic alternative to importance sampling for approximating intractable distributions. Variational inference frames inference as an optimization problem of variationally approximating the true posterior using a simpler distribution from an assumed family. This is done by maximizing a lower bound on the marginal likelihood. Mean-field variational inference further assumes a factorized form for the variational distribution.
The variational Gaussian process (VGP), a Bayesian nonparametric model which adapts its shape to match com- plex posterior distributions. The VGP generates approximate posterior samples by generating latent inputs and warping them through random non-linear mappings; the distribution over random mappings is learned during inference, enabling the transformed outputs to adapt to varying complexity.
018 20160902 Machine Learning Framework for Analysis of Transport through Com...Ha Phuong
• Propose a data-driven framework to study the relationship
between fluid flow at the macro scale and the internal pore
structure, across the micro and mesoscales, in porous, granular media.
Quantifies a hypothesized link between high permeability and
efficient shortest paths that thread through relatively large
pore bodies connected to each other by high conductance pore throats, embodying connectivity and pore structure.
017_20160826 Thermodynamics Of Stochastic Turing MachinesHa Phuong
Show how to construct stochastic models which mimic the
behavior of a general-purpose computer (a Turing machine).
Discrete state systems obeying a Markovian master equation,
which are logically reversible and have a well-defined and
consistent thermodynamic interpretation
It's the deck for one Hulu internal machine learning workshop, which introduces the background, theory and application of expectation propagation method.
Notes on intersection theory written for a seminar in Bonn in 2010.
Following Fulton's book the following topics are covered:
- Motivation of intersection theory
- Cones and Segre Classes
- Chern Classes
- Gauss-Bonet Formula
- Segre classes under birational morphisms
- Flat pull back
short course on Subsurface stochastic modelling and geostatisticsAmro Elfeki
This is a short course on Subsurface stochastic modelling and geo-statistics that has been held at Delft University of Technology, Delft The Netherlands.
Generating a high quality Chaotic sequence is crucial to the success of the Superefficient Monte Carlo Simulation methodology. In this slides, we discuss how to numerically generates Chebychev Chaotic Sequence with arbitrary precision, and proposed a highly efficient parallel implementation.
Z Transform And Inverse Z Transform - Signal And SystemsMr. RahüL YøGi
The z-transform is the most general concept for the transformation of discrete-time series.
The Laplace transform is the more general concept for the transformation of continuous time processes.
For example, the Laplace transform allows you to transform a differential equation, and its corresponding initial and boundary value problems, into a space in which the equation can be solved by ordinary algebra.
The switching of spaces to transform calculus problems into algebraic operations on transforms is called operational calculus. The Laplace and z transforms are the most important methods for this purpose.
Universal Approximation Property via Quantum Feature Maps
----
The quantum Hilbert space can be used as a quantum-enhanced feature space in machine learning (ML) via the quantum feature map to encode classical data into quantum states. We prove the ability to approximate any continuous function with optimal approximation rate via quantum ML models in typical quantum feature maps.
---
Contributed talk at Quantum Techniques in Machine Learning 2021, Tokyo, November 8-12 2021.
By Quoc Hoan Tran, Takahiro Goto and Kohei Nakajima
CCS2019-opological time-series analysis with delay-variant embeddingHa Phuong
Q. H. Tran and Y. Hasegawa, Topological time-series analysis with delay-variant embedding, Oral Presentation at Conference on Complex Systems, Singapore, Singapore, Oct. 2019.
SIAM-AG21-Topological Persistence Machine of Phase TransitionHa Phuong
Presentation at SIAM Conference on Applied Algebraic Geometry (AG21), Aug. 2021.
Abstract. The study of phase transitions using data-driven approaches is challenging, especially when little prior knowledge of the system is available. Topological data analysis is an emerging framework for characterizing the shape of data and has recently achieved success in detecting structural transitions in material science, such as the glass--liquid transition. However, data obtained from physical states may not have explicit shapes as structural materials. We thus propose a general framework, termed “topological persistence machine," to construct the shape of data from correlations in states so that we can subsequently decipher phase transitions via qualitative changes in the shape. Our framework enables an effective and unified approach in phase transition analysis without having prior knowledge about phases or requiring the investigation of the system with large size. We demonstrate the efficacy of the approach in terms of detecting the Berezinskii--Kosterlitz--Thouless phase transition in the classical XY model and quantum phase transitions in the transverse Ising and Bose--Hubbard models. Interestingly, while these phase transitions have proven to be notoriously difficult to analyze using traditional methods, they can be characterized through our framework without requiring prior knowledge of the phases. Our approach is thus expected to be widely applicable and will provide the prospective with practical interests in exploring the phases of experimental physical systems.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
UiPath Test Automation using UiPath Test Suite series, part 5DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 5. In this session, we will cover CI/CD with devops.
Topics covered:
CI/CD with in UiPath
End-to-end overview of CI/CD pipeline with Azure devops
Speaker:
Lyndsey Byblow, Test Suite Sales Engineer @ UiPath, Inc.
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Communications Mining Series - Zero to Hero - Session 1DianaGray10
This session provides introduction to UiPath Communication Mining, importance and platform overview. You will acquire a good understand of the phases in Communication Mining as we go over the platform with you. Topics covered:
• Communication Mining Overview
• Why is it important?
• How can it help today’s business and the benefits
• Phases in Communication Mining
• Demo on Platform overview
• Q/A
Sudheer Mechineni, Head of Application Frameworks, Standard Chartered Bank
Discover how Standard Chartered Bank harnessed the power of Neo4j to transform complex data access challenges into a dynamic, scalable graph database solution. This keynote will cover their journey from initial adoption to deploying a fully automated, enterprise-grade causal cluster, highlighting key strategies for modelling organisational changes and ensuring robust disaster recovery. Learn how these innovations have not only enhanced Standard Chartered Bank’s data infrastructure but also positioned them as pioneers in the banking sector’s adoption of graph technology.
Threats to mobile devices are more prevalent and increasing in scope and complexity. Users of mobile devices desire to take full advantage of the features
available on those devices, but many of the features provide convenience and capability but sacrifice security. This best practices guide outlines steps the users can take to better protect personal devices and information.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...Neo4j
Leonard Jayamohan, Partner & Generative AI Lead, Deloitte
This keynote will reveal how Deloitte leverages Neo4j’s graph power for groundbreaking digital twin solutions, achieving a staggering 100x performance boost. Discover the essential role knowledge graphs play in successful generative AI implementations. Plus, get an exclusive look at an innovative Neo4j + Generative AI solution Deloitte is developing in-house.
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionAggregage
Join Maher Hanafi, VP of Engineering at Betterworks, in this new session where he'll share a practical framework to transform Gen AI prototypes into impactful products! He'll delve into the complexities of data collection and management, model selection and optimization, and ensuring security, scalability, and responsible use.
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Approximate Inference (Chapter 10, PRML Reading)
1. VC.M. Bishop’s PRML
Chapter 10: Approximate Inference
Tran Quoc Hoan
@k09hthaduonght.wordpress.com/
13 December 2015, PRML Reading, Hasegawa lab., Tokyo
The University of Tokyo
3. Outline
Variational Inference 3
10.1. Variational
Inference
10.2. Variational
Mixture of Gaussians
10.3. Variational Linear
Regression
10.4. Exponential Family
Distributions
10.5. Local Variational
Methods
10.6. Variational Logistic
Regression
Part I:
Probabilistic modeling and
the variational principle
Part II:
Design the
variational algorithms
10.7. Expectation
Propagation
4. Progress…
Variational Inference 4
10.1. Variational
Inference
10.2. Variational
Mixture of Gaussians
10.4. Exponential Family
Distributions
10.5. Local Variational
Methods
10.6. Variational Logistic
Regression
10.7. Expectation
Propagation
10.3. Variational Linear
Regression
5. Probabilistic Inference
10.1 Variational Inference 5
Any mechanism by which we deduce the probabilities
in our model based data.
Statistical Inference
In probabilistic models, we need reason over the probability of events
Inference links the observed data with our statistical assumptions and allows us
to ask questions of our data: predictions, visualization, model selection.
6. Modeling and Inference
10.1 Variational Inference 6
Posterior
Bayes’ rule in many of inferential problems
Probabilistic modeling will involve:
• Decide on a priori beliefs.
• Posit an explanation of how the observed
data is generated, i.e. provide a
probabilistic description.
=
Likelihood Prior
Marginal likelihood
(Model evidence)
p(z|x)
p(x|z) p(z)
Z
p(x, z)dz
Observed data
Hidden
variable
7. Modeling and Inference
10.1 Variational Inference 7
Most inference problems will be one of:
Marginalization
Expectation
Prediction
Posterior
=
Likelihood Prior
Marginal likelihood
(Model evidence)
Z
p(x, z)dz
p(z|x)
p(x|z) p(z)
Complex form for which
the expectations are not
tractable
8. Importance Sampling
10.1 Variational Inference 8
IntegralBasic idea:
Transform the integral
into an expectation over a
simple, known
distribution
p(z) f(z)
z
q(z)
Conditions:
• q(z) > 0 when f(z)p(z) ≠ 0
• Easy to sample from q(z)
E[f] =
Z
f(z)p(z)dz
E[f] =
Z
f(z)p(z)
q(z)
q(z)
dz
Notice: x is abbreviated in formula
E[f] =
Z
f(z)
p(z)
q(z)
q(z)dz
w(s)
=
p(z(s)
)
q(z(s))
z(s)
⇠ q(z)
E[f] =
1
S
X
s
w(s)
f(z(s)
)
Proposal
Importance
weight
Monte Carlo
9. Importance Sampling
10.1 Variational Inference 9
p(z) f(z)
z
q(z)
Properties:
• Unbiased estimate of the expectation
• No independent samples from posterior
distribution
• Many draws from proposal needed,
especially in high dimensions
E[f] =
1
S
X
s
w(s)
f(z(s)
)
w(s)
=
p(z(s)
)
q(z(s))
z(s)
⇠ q(z)
Stochastic
Approximation
Chapter 11
10. Importance Sampling
10.1 Variational Inference 10
p(z) f(z)
z
q(z)
Take inspiration from importance sampling, but instead:
• Obtain a deterministic algorithm
• Scaled up to high-dimensional and large data problems
• Easy convergence assessment
E[f] =
1
S
X
s
w(s)
f(z(s)
)
w(s)
=
p(z(s)
)
q(z(s))
z(s)
⇠ q(z)
Variational
Inference
11. What is a Variational Method?
10.1 Variational Inference 11
Variational Principle
General family of methods for approximating complicated
densities by a simpler class of densities
Approximation class
True posterior
Deterministic approximation procedures
with bounds on probabilities of interest
Fit the variational parameters
12. Variational Calculus
10.1 Variational Inference 12
Functions:
• Variables as input, output is a value
• Full and partial derivatives df/dx
• Ex. Maximize likelihood p(x|µ) w.r.t.
parameters µ
Both types of derivatives are
exploited in variational inference
Functionals:
• Functions as input, output is a value
• Functional derivatives ∂f/∂x
• Ex. Maximize the entropy H[p(x)] w.r.t.
p(x)
Variational method derives from the
Calculus of Variations
13. From IS to Variational Inference
10.1 Variational Inference 13
Integral
Importance weight
Jensen’s inequality
ln p(X) = ln
Z
p(X|Z)p(Z)dZ
ln p(X) = ln
Z
p(X|Z)
p(Z)
q(Z)
q(Z)dZ
ln
Z
p(x)g(x)dx
Z
p(x) ln g(x)dx
ln p(X)
Z
q(Z) ln (p(X|Z)
p(Z)
q(Z)
)dZ
Variational
(evidence) lower
bound
=
Z
q(Z) ln p(X|Z)dZ
Z
q(Z) ln
q(Z)
p(Z)
dZ
Eq(Z)[ln p(X|Z)] KL[q(Z)||p(Z)]=
14. Variational Lower Bound
10.1 Variational Inference 14
F(X, q) =
Reconstruction PenaltyApprox. Posterior
• Penalty: Ensures the explanation of the data q(Z) doesn’t deviate too far from
your beliefs p(Z).
• Reconstruction cost: The expected log-likelihood measure how well samples
from q(Z) are able to explain the data X.
• Approximate posterior distribution q(Z): Best match to true posterior p(Z|X),
one of the unknown inferential quantities of interest to us.
Interpreting the bound:
Eq(Z)[ln p(X|Z)] KL[q(Z)||p(Z)]
15. How low the VLB (ELB)?
10.1 Variational Inference 15
• Variational parameters: Parameters of q(Z) (Ex. if a Gaussian, they’re mean
and variance).
• Integration switched to optimization: optimize q(Z) directly (my thought:
actually it is q(Z|X) ) to minimize
Some comments on q:
ln p(X) F(X, q) =
Z
q(Z) ln p(X)dZ F(X, q)
=
Z
q(Z) ln p(X)dZ
Z
q(Z) ln p(X|Z)dZ +
Z
q(Z) ln
q(Z)
p(Z)
dZ
=
Z
q(Z) ln
p(X)q(Z)
p(X|Z)p(Z)
dZ =
Z
q(Z) ln
q(Z)
p(Z|X)
dZ
= KL[q(Z)||p(Z|X)]
KL[q(Z)||p(Z|X)]
16. From the book
10.1 Variational Inference 16
KL(q k p) =
Z
q(Z) ln
⇢
p(Z|X)
q(Z)
dZ
ln p(X) = L(q) + KL(q k p)
Maximum occurs when
q(Z) = p(Z|X)
Approximate the maximum by variational method
(10.2)
(10.3)
(10.4)
• What exactly is q(z)?
• How do we find the
variational parameters?
• How do we compute
the gradients?
• How do we optimize
the model parameters?
L(q) = F(X, q) =
Z
q(Z) ln
⇢
p(X, Z)
q(Z)
dZ
17. Free-form and Fixed-form
10.1 Variational Inference 17
Free-form variational method solves for the exact distribution
setting the functional derivative to zero
Fixed-form variational method specifies an explicit form of the q-
distribution
The optimal solution is the
true posterior distribution
but solving for the
normalization is original
Ideally rich class of
distribution
q (Z) = f(Z; )
L(q)
q(Z)
= 0 s.t.
Z
q(Z)dZ = 1
Variational parameter
q(Z) / p(Z)p(X|Z, ✓)
18. 10.1.1 Factorized distributions (I)
10.1 Variational Inference 18
• Mean-field methods assume that the distribution is factorized
Restricted class of approximations: every dimension (or subset of
dimensions) of the posterior is independent
• Let Z be partitioned into disjoint groups Zi (i = 1…M)
q(Z) =
MY
i=1
qi(Zi) No restriction on the
functional form of qi(Zi)
19. Factorized distributions (II)
10.1 Variational Inference 19
L(q) =
Z
q(Z) ln
⇢
p(X, Z)
q(Z)
dZ
=
Z
q(Z) {ln p(X, Z) ln q(Z)} dZ
=
Z Y
i
qi(Zi)
! (
ln p(X, Z)
X
i
ln qi(Zi)
)
dZ
=
Z Y
i
qi
!
[ln p(X, Z)] dZ
Z Y
i
qi
!
X
i
ln qi
!
dZ
qi
20. Factorized distributions (III)
10.1 Variational Inference 20
L(q) =
Z Y
i
qi
!
[ln p(X, Z)] dZ
Z Y
i
qi
!
X
i
ln qi
!
dZ
=
Z
qj
8
<
:
Z
ln p(X, Z)
Y
i6=j
qidZi
9
=
;
dZj
Z Y
i
qi
! 8
<
:
ln qj +
X
i6=j
ln qi
9
=
;
dZ
=
Z
qj
8
<
:
Z
ln p(X, Z)
Y
i6=j
qidZi
9
=
;
dZj
Z Y
i
qi
!
ln qjdZ
Z
0
@
Y
i6=j
qi
1
A
0
@
X
i6=j
ln qi
1
A
✓Z
qjdZj
◆
dZi6=j
=
Z
qj (ln ˜p(X, Zj) const) dZj
Z
qj ln qjdZj
Z
0
@
Y
i6=j
qi
1
A
0
@
X
i6=j
ln qi
1
A dZi6=j
• Consider with function qj(Zj)
= 1
21. Factorized distributions (IV)
10.1 Variational Inference 21
L(q) =
Z
qj ln ˜p(X, Zj)dZj
Z
qj ln qjdZj + const
negative KL divergence
where
˜p(X, Zj) = Ei6=j[ln p(X, Z)] + const
Ei6=j[ln p(X, Z)] =
Z
ln p(X, Z)
Y
i6=j
qidZi
and
• Maximize by keeping fixed
• This is same as minimizing KL divergence between
and
L(q) {qi6=j}
˜p(X, Zj)
qj(Zj)
(10.6)
(10.7)
(10.8)
22. Optimal Solution
10.1 Variational Inference 22
(1) Initialize all qj appropriately.
ln q⇤
j (Zj) = Ei6=j[ln p(X, Z)] + const
q⇤
j (Zj) =
exp(Ei6=j[ln p(X, Z)])
R
exp(Ei6=j[ln p(X, Z)])dZj
or
(2) Run below code until convergence.
• foreach qi
• Fixed all qj ≠qi and find optimal qi
• Update qi
(10.9)Today’s
memo
Next slides are
detailed examples
23. 10.1.2. Properties of factorized approximations
10.1 Variational Inference 23
Approximate Gaussian Distribution with factorized Gaussian
Consider,
p(z) = N(z|µ, ⇤ 1
)
µ = (µ1, µ2)T
, ⇤ =
✓
⇤11 ⇤12
⇤21 ⇤22
◆
q(z) = q1(z)q2(z)
ln q⇤
1(z1) = Ez2
[ln p(z)] + const
ln q⇤
1(z1) = Ez2 [
1
2
(z1 µ1)2
⇤11 (z1 µ1)⇤12(z2 µ2)] + const
where z = (z1, z2),
Approximate using
Optimal solution from (10.9)
consider only the terms have z1
26. 10.1 Variational Inference 26
Fig 10.2 The green contours corresponding to 1, 2
and 3 standard deviations for a correlated
Gaussian distribution p(z) over two variables z1
and z2 . The red contours represent the
corresponding levels for an approximating
distribution q(z) over the same variables given by
the product of two independent univariate
Gaussian
Minimize KL(q||p) Minimize KL(p||q)
• The mean is captured correctly, but the variance is underestimated in the
orthogonal direction
• Optimal solution
(that is the corresponding marginal distribution of p(Z) )
• Considering reverse KL divergence KL(p||q) =
Z
p(Z)[
MX
i=1
ln qi(Zi)]dZ + const
10.1.2. Properties of factorized approximations
(10.17)
27. 10.1 Variational Inference 27
Minimize KL(q||p)
Minimize KL(p||q)
Reverse KL divergence
KL(p||q) =
Z
p(Z)[
MX
i=1
ln qi(Zi)]dZ + const
10.1.2. Properties of factorized approximations
KL(q||p) =
Z
q(Z) ln
⇢
p(Z)
q(Z)
• If near zero then tends to close to zerop(Z) q(Z)
• If near zero then is not
important
p(Z) q(Z)
• KL divergence is minimized by distributions that
are nonzero in regions when is nonzerop(Z)
q(Z)
28. More about divergence
10.1 Variational Inference 28
Fig 10.3 Blue contour =
bimodal distribution
p(Z). Red contour=
single Gaussian
distribution q(Z) that
best approximates p(Z)
Minimize KL(q||p)
(a)
Minimize KL(p||q) Minimize KL(q||p)
(b) (c)
• KL(p||q) and KL(q||p) belong to the alpha family of divergences
where
• If will underestimate
• If will overestimate
• If it related to Hellinger distance
p(x)
p(x)
D↵(p||q) ! KL(q||p)
D↵(p||q) ! KL(p||q)
29. 10.1.3 The univariate Gaussian (I)
10.1 Variational Inference 29
• Goal: to inferrer posterior distribution for mean and precision given
data set
µ ⌧
D = {x1, ..., xN }
• Likelihood function
• Prior
• Approximate
and
(10.21)
(10.22)
(10.23)
(10.24)
34. 10.1.4 Model Comparison
10.1 Variational Inference 34
• Prior probabilities on the models be
• Goal: determine where is the observed data
• Approximate
• Maximizing by we get
• Maximizing by we find solutions for different m are coupled
due to the conditioning
• Optimize each individually and subsequently find
p(m)
p(m|X) X
q(Z, m) = q(Z|m)q(m)
ln p(X) = L
X
m
X
Z
q(Z|m)q(m) ln
⇢
p(Z, m|X)
q(Z|m)q(m)
where lower bound L =
X
m
X
Z
q(Z|m)q(m) ln
⇢
p(Z, X, m)
q(Z|m)q(m)
(10.34)
(10.35)
L q(m)
q(Z|m)
q(Z|m) q(m)
q(m) / p(m)exp(Lm) with
Lm
Lm =
X
Z
q(Z|m) ln
⇢
p(Z, X|m)
q(Z|m)
35. Progress…
Variational Inference 35
10.1. Variational
Inference
10.2. Variational
Mixture of Gaussians
10.4. Exponential Family
Distributions
10.5. Local Variational
Methods
10.6. Variational Logistic
Regression
10.7. Expectation
Propagation
10.3. Variational Linear
Regression
36. 10.2 Variational Mixture of Gaussians
10.2 Variational Mixture of Gaussians 36
• Goal: apply variational inference for the Gaussian mixture model
• Problem formation:
• Latent variable zn = {znk} (1-of-K binary vector) corresponding with each
observation xn
Observed data
Hidden variables
X = {x1, ..., xN }
Z = {z1, ..., zN }
• Conditional distribution of Z, given the mixing coefficients π
• Conditional distribution of the observed data vectors, given the latent
variables and the component parameters
p(X|Z, µ, ⇤) =
NY
n=1
KY
k=1
N(xn|µk, ⇤ 1
k )znk
p(Z|⇡) =
NY
n=1
KY
k=1
⇡znk
k (10.37)
(10.38)
37. 10.2 Variational Mixture of Gaussians
10.2 Variational Mixture of Gaussians 37
• Conjugate prior distributions
• Dirichlet distribution over the mixing coefficients π
p(⇡) = Dir(⇡|↵0) = C(↵0)
KY
k=1
⇡↵0 1
k
• Independent Gaussian-Whishart prior governing the mean and
precision of each Gaussian component
m0 = 0choose by symmetry
Fig 10.5 Directed acyclic graph representing the Bayesian
mixture of Gaussians model
(10.39)
(10.40)
38. 10.2.1. Variational Distribution (I)
10.2 Variational Mixture of Gaussians 38
• Joint distribution
• Approximate
p(X, Z, ⇡, µ, ⇤) = p(X|Z, µ, ⇤)p(Z|⇡)p(⇡)p(µ|⇤)p(⇤)
q(Z, ⇡, µ, ⇤) = q(Z)q(⇡, µ, ⇤)
• Optimal solution (from formula 10.9)
ln q⇤
(Z) = E⇡,µ,⇤[ln p(X, Z, ⇡, µ, ⇤)] + const
= E⇡[ln p(Z|⇡)] + Eµ,⇤[ln p(X|Z, µ, ⇤)] + const
= E⇡
"
ln
NY
n=1
KY
k=1
⇡znk
k
!#
+ Eµ,⇤
"
ln
NY
n=1
KY
k=1
N(xn|µk, ⇤ 1
k )znk
!#
+ const
=
NX
n=1
KX
k=1
znkE⇡[ln ⇡k]
+
NX
n=1
KX
k=1
znkEµ,⇤
1
2
ln |⇤k|
D
2
ln(2⇡)
1
2
(xn µk)T
⇤k(xn µk) + const
(10.42)
(10.43)
(10.44)
D is the dimensionality of the data variable x
39. 10.2.1 Variational Distribution (II)
10.2 Variational Mixture of Gaussians 39
• Optimal solution for q(Z)
ln q⇤
(Z) =
NX
n=1
KX
k=1
znk ln ⇢nk + const
where
then,
normalized,
where
also seen as
responsibilities
as in case of EM
(10.45)
(10.46)
(10.47)
(10.48)
41. 10.2.1 Variational Distribution (IV)
10.2 Variational Mixture of Gaussians 41
=
NX
n=1
KX
k=1
EZ[znk] ln N(xn|µk, ⇤ 1
k ) + EZ[ln p(Z|⇡)]
+ ln p(⇡) +
KX
k=1
ln p(µk, ⇤k) + const
ln q⇤
(⇡, µ, ⇤) ⇡
µ, ⇤
something of
+ something of
Then it could be further factorization
q(⇡, µ, ⇤) = q(⇡)
KY
k=1
q(µk, ⇤k)
q⇤
(⇡, µ, ⇤) = q⇤
(⇡)
KY
k=1
q⇤
(µk, ⇤k)
(10.54)
(10.54)
From (10.54) and (10.55) we have
ln q⇤
(⇡) = EZ[ln p(Z|⇡)] + ln p(⇡) + const
ln q⇤
(µk, ⇤k) = ln p(µk, ⇤k) +
NX
n=1
EZ[znk] ln N(xn|µk, ⇤ 1
k ) + const
42. 10.2.1 Variational Distribution (VI)
10.2 Variational Mixture of Gaussians 42
ln q⇤
(⇡) = EZ[ln p(Z|⇡)] + ln p(⇡) + const
= EZ
"
ln
NY
n=1
KY
k=1
⇡znk
k
!#
+ ln C(↵0)
KY
k=1
⇡↵0 1
k
!
+ const
=
NX
n=1
KX
k=1
EZ[znk] ln ⇡k + (↵0 1)
KX
k=1
ln ⇡k + const
=
KX
k=1
(Nk + ↵0 1) ln ⇡k + const
= ln
KY
k=1
⇡Nk+↵0 1
k
!
+ const
q⇤
(⇡) = Dir(⇡|↵)
↵k = Nk + ↵0
is recognized as Dirichlet distributionq⇤
(⇡)
(10.56)
(10.57)
(10.58)
43. 10.2.1 Variational Distribution (VII)
10.2 Variational Mixture of Gaussians 43
ln q⇤
(µk, ⇤k) = ln p(µk, ⇤k) +
NX
n=1
EZ[znk] ln N(xn|µk, ⇤ 1
k ) + const
We have Gaussian-Wishart distribution (exercise 10.13)
(10.59)
(10.60) - (10.63)
Analogous to the
M-step of the EM
algorithm
44. 10.2.1 Variational Distribution (VIII)
10.2 Variational Mixture of Gaussians 44
• Optimize the variational posterior Gaussian mixture distribution
(1) Initialize the responsibilities rnk
(2) Update by (10.51)-(10.53)Nk, ¯xk, Sk
(3) [M step]
• Use (10.57) to find
• Use (10.59) to find
(4) [E step]
• Use (10.64)-(10.66) and (10.46) - (10.49) to update
responsibilities to find
(5) Back to (2) until convergence
q⇤
(⇡)
Use the current
distribution of
parameters to evaluate
responsibilities
Fix responsibilities and
use it to recompute the
variational distribution
over parameters
q⇤
(µk, ⇤k) (k = 1, ..., K)
45. 10.2.1 Variational Distribution (IX)
10.2 Variational Mixture of Gaussians 45
Figure 10.6 Variational Bayesian mixture
of K = 6 Gaussians applied to the Old
Faithful data set. The ellipses denote the
one standard-deviation density contours
for each of the components, and the
density of red ink inside each ellipse
corresponds to the mean value of the
mixing coefficient for each component.
iteration iterations
iterations iterations
The coefficients of
meaningless distribution tend
to close to zero (disappear)
46. Compare EM with Variational Bayes
10.2 Variational Mixture of Gaussians 46
• The same calculate complexity
• As number of data point N → ∞, Bayesian treatment converges to
Maximum likelihood EM algorithm
• The advantage of Variational Bayes
A. Singularities that arise in ML are absent in Bayesian treatment,
removed by the introduction of prior
B. No over-fitting: could be used for determining the number of
components
47. 10.2.2 Variational lower bound
10.2 Variational Mixture of Gaussians 47
• At each step of the iterative re-estimation procedure the value of this
bound should not decrease
• Useful to test convergence
• To check on the correctness of both mathematical expression and
implementation
• For the variational mixture of Gaussians, the lower bound is given by
48. 10.2.3 Predictive density
10.2 Variational Mixture of Gaussians 48
• Predictive density , for a new value with corresponding
latent variable
• Depends on the posterior distribution of parameters
• As the posterior distribution is intractable the
variational approximation can be used to obtain an
approximate predictive density
P(ˆx|X) ˆx
ˆz
(10.78)
q(⇡)q(µ, ⇤)
variational approximation
49. 10.2.4 Number of components
10.2 Variational Mixture of Gaussians 49
• For a given mixture model of K components, each parameter setting is a
member of a family of K! equivalent setting
Figure 10.7
Plot of the variational lower bound L versus the number K of
components in the Gaussian mixture model, for the Old Faithful
data, showing a distinct peak at K = 2 components. For each value of
K, the model is trained from 100 different random starts, and the
results shown as ‘+’ symbols plotted with small random hori- zontal
perturbations so that they can be distinguished. Note that some
solutions find suboptimal local maxima, but that this hap- pens
infrequently.
• Starting with relative large value of K and components with insufficient
contribution are pruned out: the mixing coefficient is driven to zero
50. 10.2.5 Induced factorizations
10.2 Variational Mixture of Gaussians 50
Induced factorization arises from an interaction between the factorization
assumption in variational posterior and the conditional independence
properties of the true posterior
• For ex: Let A, B, C be disjoint groups of latent variables
• Factorization assumption
• The optimal solution ln q⇤
(A, B) = EC[ln p(A, B|X, C)] + const
q(A, B, C) = q(A, B)q(C)
• We need to determine if .
This is possible iff
• This can also determined from the directed-graph
model
51. Progress…
Variational Inference 51
10.1. Variational
Inference
10.2. Variational
Mixture of Gaussians
10.4. Exponential Family
Distributions
10.5. Local Variational
Methods
10.6. Variational Logistic
Regression
10.7. Expectation
Propagation
10.3. Variational Linear
Regression
52. 10.3 Variational Linear Regression
10.3 Variational Linear Regression 52
• Return to the Bayesian linear regression model (section 3.3)
• Approximated the integration over by making point
estimates obtained by maximizing the log marginal likelihood
↵,
omitted input x
predict value training target value
ˆ↵, ˆ = argmax↵,
ˆ↵, ˆ = argmax↵,
flat
MLE
then
53. 10.3 Variational Linear Regression
10.3 Variational Linear Regression 53
• The joint distribution of all the variables
• Prior
(10.90)
(10.89)
(10.87)
(10.88)
• Fully Bayesian approach would integrate over the hyper-parameters as well
as over parameters (this section)
• Suppose that the noise precision parameter is known.
n = (xn)where
p(↵) = Gam(↵|a0, b0) / ↵a0 1
exp( b0↵)
=
1
(2⇡)M/2
1
|⌃|1/2
exp
⇢
1
2
(wT
⌃ 1
w)
⌃ = ↵ 1
Iwhere
54. 10.3.1 Variational Distribution
10.3 Variational Linear Regression 54
• Goal: find an approximation to the posterior distribution p(w, ↵|t)
• Factorized approximation
q(w, ↵) = q(w)q(↵) (10.91)
• Optimal solution (from 10.9) ln q⇤
j (Zj) = Ei6=j[ln p(X, Z)] + const
ln q⇤
(↵) = Ew[ln (p(t|w)p(w|↵)p(↵))] + const
= ln p(↵) + Ew[ln p(w|↵)] + const
= (a0 1) ln ↵ b0↵ +
M
2
ln ↵
↵
2
Ew[wT
w] + const
(10.92)
then
q⇤
(↵) = Gam(↵|aN , bN ) where(10.93)
(10.94)
(10.95)
M is number of fitting
parameters wi or input
dimension
55. 10.3.1 Variational Distribution
10.3 Variational Linear Regression 55
• Similar optimal solution of q(w)
quadratic form of w
q⇤
(w) = N(w|mN , SN )
then
(10.99) where
(10.96)
(10.97)
(10.98)
56. 10.3.1 Variational Distribution
10.3 Variational Linear Regression 56
• Optimal solution
q⇤
(↵) = Gam(↵|aN , bN ) where
(10.94)
(10.95)
(10.93)
q⇤
(w) = N(w|mN , SN )(10.99) where
(10.100)
(10.101)
Moment
(10.102)
(10.103)
57. More about variational linear regression
10.3 Variational Linear Regression 57
• Predictive distribution over t, given a new input x
• Lower bound
(10.105)
(10.107)
58. More about variational linear regression
10.3 Variational Linear Regression 58
Lower bound
Order M for a polynomial model
Figure 10.9 The lower bound
versus the order M of the
polynomial model, in which
a set of 10 data points is
generated from a polynomial
with M=3 sampled over (-5, 5)
with additive Gaussian noise
of var=0.09. The value of the
bounds gives the log
probability of the model
Peak at M = 3,
corresponding to
the true model
59. Progress…
Variational Inference 59
10.1. Variational
Inference
10.2. Variational
Mixture of Gaussians
10.3. Variational Linear
Regression
10.4. Exponential Family
Distributions
10.5. Local Variational
Methods
10.6. Variational Logistic
Regression
Part I:
Probabilistic modeling and
the variational principle
Now:
Design the
variational algorithms
10.7. Expectation
Propagation
60. Progress…
Variational Inference 60
10.1. Variational
Inference
10.2. Variational
Mixture of Gaussians
10.4. Exponential Family
Distributions
10.5. Local Variational
Methods
10.6. Variational Logistic
Regression
10.7. Expectation
Propagation
10.3. Variational Linear
Regression
61. 10.4 Exponential Family Distributions
10.4 Exponential Family Distributions 61
• For many of the models in this book, the complete data likelihood is drawn
from the exponential family
• In general, this will not be the case for the marginal likelihood function for
the observed data. Ex: in a mixture of Gaussians.
62. 10.4 Exponential Family Distributions
10.4 Exponential Family Distributions 62
• Observed data
• Latent variables
X = {x1, ..., xN }
Z = {z1, ..., zN }
• Suppose that the joint distribution is a member of the exponential family
where the conjugate prior for ⌘
(10.113)
p(⌘|⌫0, 0) = f(⌫0, 0)g(⌘)⌫0
exp{⌫0⌘T
0}
(prior number of observations all having the value for the u vector)⌫0 0
(10.114)
• Variational distributions
q(Z, ⌘) = q(Z)q(⌘)
63. 10.4 Exponential Family Distributions
10.4 Exponential Family Distributions 63
ln q⇤
j (Zj) = Ei6=j[ln p(X, Z)] + const• Optimal solution (from 10.9)
= E⌘[ln p(X, Z|⌘)] + const
=
NX
n=1
ln h(xn, zn) + E⌘[⌘T
]u(xn, zn) + const
sum of independent things
Induced factorization
q⇤
(Z) =
Y
n
q⇤
(zn)
(10.115)
where
(10.116)
ln q⇤
(Z) = E⌘[ln p(X, Z, ⌘)] = E⌘[ln p(X, Z|⌘)p(⌘|⌫0, 0)p(⌫0, 0)]
q⇤
(zn) = h(xn, zn)g(E⌘[⌘]) exp{E⌘[⌘T
]u(xn, zn)}
65. Variational message passing
10.4 Exponential Family Distributions 65
• The joint distribution corresponding to a directed graph
then the optimal solution is
thus the update of the factors in the variational posterior
distribution represents a local calculation on the graph
p(x) =
Y
i
p(xi|pai)
parent set corresponding to node ivariable(s) associated with node i (latent or observed)
• Variational approximation q(x) =
Y
i
qi(xi)
Markov blanket
66. Variational message passing
10.4 Exponential Family Distributions 66
• If all the conditional distributions have a conjugate-exponential
structure, then the variational update will be:
• The distribution associated with a particular node can be updated once that
node has received messages from all of its parents and all of its children.
• It requires that the children have already received messages from their co-
parents
67. Progress…
Variational Inference 67
10.1. Variational
Inference
10.2. Variational
Mixture of Gaussians
10.4. Exponential Family
Distributions
10.5. Local Variational
Methods
10.6. Variational Logistic
Regression
10.7. Expectation
Propagation
10.3. Variational Linear
Regression
68. 10.5 Local Variational Methods
10.5 Local Variational Methods 68
• Global methods: approximation to the full posterior
• Local methods: approximation to individual or groups of variables
• Replace the likelihood with a simpler form - lower bound
that makes the expectation easy to compute
69. Convex duality
10.5 Local Variational Methods 69
Convex function f(x)
One of lower bound
but not the best
The line is moved
vertically to a tangent
Convex function f(x) Concave function f(x)
g(⌘) = max
x
{⌘x f(x)}
f(x) = max
⌘
{⌘x g(⌘)} f(x) = min
⌘
{⌘x g(⌘)}
g(⌘) = min
x
{⌘x f(x)}(10.130)
(10.131)
(10.133)
(10.132)
70. 10.5 Local Variational Methods
10.5 Local Variational Methods 70
Original Problem p(y = 1|x) =
1
1 + exp( x)
= (x)
is concave function, then considerf(x) = ln (x)
g(⌘) = min
x
{⌘x f(x)} = ⌘ ln ⌘ (1 ⌘) ln(1 ⌘)
ln (x) ⌘x g(⌘)
The upper bound
(x) exp(⌘x g(⌘))
(10.135)
(10.136)
(10.137)
The upper bound
Logistic
sigmoid
71. 10.5 Local Variational Methods
10.5 Local Variational Methods 71
is convex function of the variable x2 , then consider
The stationarity conditions
(10.139)
The lower bound
Logistic
sigmoid
• Gaussian lower bound (Jakkola and Jordan, 2000)
f(x) = ln(ex/2
+ e x/2
)
g(⌘) = max
x2
n
⌘x2
f(
p
x2)
o
0 = ⌘
d
dx2
d
dx
f(x) = ⌘ +
1
4x
tanh(
x
2
)
⌘ =
1
4⇠
tanh
✓
⇠
2
◆
=
1
2⇠
(⇠)
1
2
= (⇠)
g( (⇠)) = (⇠)⇠2
f(⇠) = (⇠)⇠2
+ ln(e⇠/2
+ e ⇠/2
)
f(x) (⇠)x2
g( (⇠)) = (⇠)x2
+ (⇠)⇠2
ln(e⇠/2
+ e ⇠/2
)
The bound on f(x)
The bound on sigmoid (x) (⇠) exp{(x ⇠)/2 (⇠)(x2
⇠2
)} (10.144)
72. How the bounds can be used
10.5 Local Variational Methods 72
• Evaluate I =
Z
(a)p(a)da
• The local variational bound (a) f(a, ⇠)
• The variational bound
(intractable)
I
Z
f(a, ⇠)p(a)da = F(⇠)
⇠ is additional parameter (depends on a)where
Finding the compromise to maximize F(⇠)⇠⇤
73. In Reviews…
10.5 Local Variational Methods 73
Original Problem
Local Bound
Bound with only
linear or quadratic
terms: expectations,
especially against a
Gaussian, are easy to
compute.
p(y = 1|x) =
1
1 + exp( x)
= (x)
(x) (⇠) exp{(x ⇠)/2 (⇠)(x2
⇠2
)}
⇠ is additional parameterwhere
74. Progress…
Variational Inference 74
10.1. Variational
Inference
10.2. Variational
Mixture of Gaussians
10.4. Exponential Family
Distributions
10.5. Local Variational
Methods
10.6. Variational Logistic
Regression
10.7. Expectation
Propagation
10.3. Variational Linear
Regression
75. 10.6 Variational Logistic Regression
10.6 Variational Logistic Regression 75
Return to the Bayesian logistic regression model (section 4.5)
• The posterior distribution
where the prior distribution
p(w) = N(w|w0, S0)
and the likelihood function
p(t|w) =
NY
n=1
yn
tn
{1 yn}1 tn
• Then
where yn = (wT
n)
Maximize the posterior to give wMAP and then
• The Gaussian approximation
p(w) = N(w|wN , SN )
p(w|t) / p(w)p(t|w)
76. 10.6.1 Variational posterior distribution
10.6 Variational Logistic Regression 76
A practical example of local variational method
• Recap of variational framework: maximize a lower bound on the marginal
likelihood
For the Bayesian logistic regression model, the marginal likelihood is:
The conditional distribution for t
(10.147)
(10.148)
where
77. 10.6.1 Variational posterior distribution
10.6 Variational Logistic Regression 77
• Lower bound on the logistic sigmoid function
where
• We can therefore write
• Bound on the joint distribution of t and w
where
(10.149)
(10.151)
(10.152)
(10.153)
where and each training set observation corresponds with( n, tn) ⇠n
78. 10.6.1 Variational posterior distribution
10.6 Variational Logistic Regression 78
• Lower bound on the log of the joint distribution of t and w
• Hypothesis for the prior p(w): Gaussian with parameters m0 and S0
considered as fixed
Then, the right side of (10.154) becomes as function of w
(10.154)
(10.155)
79. 10.6.1 Variational posterior distribution
10.6 Variational Logistic Regression 79
• Quantity of interest: exact posterior distribution, requires normalization of
the left side in (10.152) but usually intractable
• Work instead with the right side of (10.155): a quadratic function of w which
is a lower bound of p(w, t)
• A Gaussian variational posterior of the form
where
(10.156)
(10.157)
(10.158)
80. 10.6.1 Optimizing the variational parameters
10.6 Variational Logistic Regression 80
• Determine the variational parameters by maximizing the lower
bound of the marginal likelihood
Two approaches
Substitute (10.152) back into the marginal likelihood
(10.159)
• (1) View w as a latent variable and use the EM algorithm
• (2) Compute and maximize directly using the fact that p(w) is Gaussian
and is a quadratic function of w
Re-estimation equations
(10.164)
(10.163)
81. 10.6.1 Optimizing the variational parameters
10.6 Variational Logistic Regression 81
• Invoke EM algorithm
1. Initialize values for
2. E step
• Use to calculate the posterior distribution
3. M step
⇠old
• Maximize the complete-data log likelihood
Q(⇠, ⇠old
) = Eq(w) [ln h(w, ⇠)p(w)]
• Solve (stationarity condition)
then (10.163)
(10.160)
(10.162)
83. 10.6.3 Inference of hyper parameters
10.6 Variational Logistic Regression 83
• Allow hyper parameters to be inferred from dataset
• Consider simple Gaussian prior form
• Consider conjugate hyper prior given by a gamma distribution
• The marginal likelihood
where the joint distribution
(10.168)
(10.167)
(10.166)
(10.165)
84. 10.6.3 Inference of hyper parameters
10.6 Variational Logistic Regression 84
Combine global and local approach
• (1) Global approach: consider a variational distribution and apply
the standard decomposition
• (2) The lower bound is intractable so apply the local approach as
before to get a lower bound on and on
L(q)
L(q) ln p(t)
• (3) Assume that q is factorized as
(10.169)
(10.172)
85. 10.6.3 Inference of hyper parameters
10.6 Variational Logistic Regression 85
• It follows (quadratic function of w)
where
(10.174)
(10.175)
(10.176)
• From (10.153) and (10.165)
(10.153)
(10.165)
86. 10.6.3 Inference of hyper parameters
10.6 Variational Logistic Regression 86
where
then
• Similar with from (10.165) and (10.166)
(10.177)
(10.178)
(10.179)
q(↵)
(10.165)
We have
(10.166)p(↵) = Gam(↵|a0, b0) / ↵a0 1
exp( b0↵)
q(↵) = Gam(↵|aN , bN ) =
1
(aN )
abN
N ↵aN 1
e bN ↵
87. 10.6.3 Inference of hyper parameters
10.6 Variational Logistic Regression 87
• The variational parameters are obtained by maximizing the lower bound
(10.180)
(10.181)
(10.183)
(10.182)
• Re-estimation equations
where
Q(⇠, ⇠old
) = Eq(w) [ln h(w, ⇠)p(w)] (10.160)
as we done before with
88. Progress…
Variational Inference 88
10.1. Variational
Inference
10.2. Variational
Mixture of Gaussians
10.4. Exponential Family
Distributions
10.5. Local Variational
Methods
10.6. Variational Logistic
Regression
10.7. Expectation
Propagation
10.3. Variational Linear
Regression
89. Expectation Propagation (Minka, 2001)
10.7 Expectation Propagation 89
• An alternative form of deterministic approximate inference based on the
reverse KL divergence KL(p||q) ( instead of KL(q||p)) where p is the complex
distribution
KL(p||q) =
Z
p(z) ln
p(z)
q(z)
dzKL(q||p) =
Z
q(z) ln
q(z)
p(z)
dz
• Consider fixed distribution p(z) and member of the exponential family q(z).
KL(p||q) = ln g(⌘) ⌘T
Ep(z)[u(z)] + const
⌘
setting gradient to zero
(10.185)
(10.186)
q(z) = h(z)g(⌘) exp{⌘T
u(z)}
• The Kullback-Leibler divergence as function of
(10.184)
90. Expectation Propagation
10.7 Expectation Propagation 90
• Member of the exponential family q(z).
(10.187)
q(z) = h(z)g(⌘) exp{⌘T
u(z)} (10.184)
then Z
h(z)g(⌘) exp{⌘T
u(z)}dz = 1
taking the gradient of both size
rg(⌘)
Z
h(z) exp{⌘T
u(z)}dz +g(⌘)
Z
h(z) exp{⌘T
u(z)}u(z)dz = 0
1
g(⌘)
rg(⌘) = g(⌘)
Z
h(z) exp{⌘T
u(z)}u(z)dz = Eq(z)[u(z)]
r ln g(⌘) = Eq(z)[u(z)]
From 10.186 we have
Moment matching, setting mean and covariance of q(z) the same as p(z)’s
91. Expectation Propagation
10.7 Expectation Propagation 91
• Assume the joint distribution of data and hidden variables and
parameters is of the form
D ✓
• Quantities of interest ad posterior distribution
and model evidence
(10.188)
(10.189)
(10.190)
92. Expectation Propagation
10.7 Expectation Propagation 92
• Expectation propagation is based on an approximation to the posterior
distribution which is also given by a product of factors
where each factor comes from the exponential family
• Ideally, we would like to determine by minimizing the KL divergence
between the true posterior and the approximation
• Minimize the KL divergence between each pair of factors and
independently but the product is usually poor approximation
• EP: optimize each factor in turn using the current values for the remaining
factors (good in logistic type but bad for mixtures type due to multi-modality)
96. The clutter problem
10.7 Expectation Propagation 96
• Mixture of Gaussians of the form
where w is the proportion of background cluster and is assumed to be
known. The prior is taken to be Gaussian
• The joint distribution of N observations and is given by
• To apply EP, first identify the factors
then, choose the exponential family
• The factor approximation takes the form of exponentials of quadratic
functions
(10.209)
(10.210)
(10.211)
(10.212)
(10.213)
98. The clutter problem
10.7 Expectation Propagation 98
• Evaluate the approximation to the model evidence
where
(10.223)
(10.224)
99. Expectation Propagation on graphs
10.7 Expectation Propagation 99
• The factors are not function of all variables. If the approximating
distribution is fully factorized, EP reduces to Loopy Belief Propagation
• We seek an approximation q(x) that has the same factorization
(10.225)
(10.226)
100. Expectation Propagation on graphs
10.7 Expectation Propagation 100
• Restrict to approximations in which the factors factorize with respect to the
individual variables so that
(10.227)
101. Expectation Propagation on graphs
10.7 Expectation Propagation 101
• Suppose all the factors are initialized and we choose to refine factor
• Minimizing the reverse KL when q factorizes, leads to an optimal solution q
where factors are the marginals of p
107. Expectation propagation
10.7 Expectation Propagation 107
• The sum-product BP arises as a special case of EP when a fully factorized
approximating distributions is used
• EP can be seen as a way to generalized this: group factors and update them
together, use partially connected graph
• Q remains: How to choose the best grouping and disconnection?
• Summary: EP and Variational message passing correspond to the
optimization of two different KL divergences
• Minka 2005 gives a more general point of view using the family of alpha-divergences that includes both
KL and reverse KL, but also other divergence like Hellinger distance, Chi2-distance...
• He shows that by choosing to optimize one or the other of these divergences, you can derive a broad
range of message passing algorithms including Variational message passing, Loopy BP, EP, Tree-
Reweighted BP, Fractional BP, power EP.
108. Summary
Variational Inference 108
10.1. Variational
Inference
10.2. Variational
Mixture of Gaussians
10.3. Variational Linear
Regression
10.4. Exponential Family
Distributions
10.5. Local Variational
Methods
10.6. Variational Logistic
Regression
Part I:
Probabilistic modeling and
the variational principle
Part II:
Design the
variational algorithms
10.7. Expectation
Propagation