Variational Inference
Presenter: Shuai Zhang, CSE, UNSW
Content
1
•Brief Introduction
2
•Core Idea of VI
•Optimization
3
•Example: Bayesian Mix of Gauss
What is Variational Inference?
Variational Bayesian methods are a family of techniques for
approximating intractable integrals arising in Bayesian inference
and machine learning [Wiki].
It is widely used to approximate posterior densities for Bayesian
models, an alternative strategy to Markov Chain Monte Carlo, but
it tends to be faster and easier to scale to large data.
It has been applied to problems such as document analysis,
computational neuroscience and computer vision.
Core Idea
Consider a general problem of Bayesian Inference - Let the latent
variables in our problem be and the observed data
Inference in a Bayesian model amounts to conditioning on data
and computing the posterior
Approximate Inference
The inference problem is to compute the conditional given by the
below equation.
the denominator is the marginal distribution of the data obtained
by marginalizing all the latent variables from the joint distribution
p(x,z).
For many models, this evidence integral is unavailable in closed
form or requires exponential time to compute. The evidence is
what we need to compute the conditional from the joint; this is
why inference in such models is hard
MCMC
In MCMC, we first construct an ergodic Markov chain on z whose
stationary distribution is the posterior
Then, we sample from the chain to collect samples from the
stationary distribution.
Finally, we approximate the posterior with an empirical estimate
constructed from the collected samples.
VI vs. MCMC
MCMC VI
More computationally intensive Less intensive
Guarantees producing
asymptotically exact samples from
target distribution
No such guarantees
Slower Faster, especially for large data
sets and complex distributions
Best for precise inference Useful to explore many scenarios
quickly or large data sets
Core Idea
Rather than use sampling, the main idea behind variational
inference is to use optimization.
we restrict ourselves a family of approximate distributions D over
the latent variables. We then try to find the member of that
family that minimizes the Kullback-Leibler divergence to the exact
posterior. This reduces to solving an optimization problem.
The goal is to approximate p(z|x) with the resulting q(z). We
optimize q(z) for minimal value of KL divergence
Core Idea
The objective is not computable. Because
Because we cannot compute the KL, we optimize an alternative
objective that is equivalent to the KL up to an added constant.
We know from our discussion of EM.
Core Idea
Thus, we have the objective function:
Maximizing the ELBO is equivalent to minimizing the KL
divergence.
Intuition: We rewrite the ELBO as a sum of the expected log
likelihood of the data and the KL divergence between the prior
p(z) and q(z)
Mean field approximation
Now that we have specified the variational objective function
with the ELBO, we now need to specify the variational family of
distributions from which we pick the approximate variational
distribution.
A common family of distributions to pick is the Mean-field
variational family. Here, the latent variables are mutually
independent and each governed by a distinct factor in the
variational distribution.
Coordinate ascent mean-field VI
Having specified our objective function and the variational family
of distributions from which to pick the approximation, we now
work to optimize.
CAVI maximizes ELBO by iteratively optimizing each variational
factor of the mean-field variational distribution, while holding
the others fixed. It however, does not guarantee finding the
global optimum.
Coordinate ascent mean-field VI
given that we fix the value of all other variational factors ql(zl) (l
not equal to j), the optimal 𝑞 𝑗(𝑧𝑗) is proportional to the
exponentiated expected log of the complete conditional. This
then is equivalent to being proportional to the log of the joint
because the mean-field family assumes that all the latent
variables are independent.
Coordinate ascent mean-field VI
Below, we rewrite the first term using iterated expectation and
for the second term, we have only retained the term that
depends on
In this final equation, the RHS is equal to the negative KL
divergence between 𝑞 𝑗 and exp(A). Thus, maximizing this
expression is the same as minimizing the KL divergence between
𝑞 𝑗 and exp(A).
This occurs when 𝑞 𝑗 =exp(A).
Coordinate ascent mean-field VI
Bayesian Mixture of Gaussians
The full hierarchical model of
The joint dist.
Bayesian Mixture of Gaussians
The mean field variational family contains approximate posterior
densities of the form
Bayesian Mixture of Gaussians
Derive the ELBO as a function of the variational factors. Solve for
the ELBO
Bayesian Mixture of Gaussians
Next, we derive the CAVI update for the variational factors.
References
1. https://am207.github.io/2017/wiki/VI.html
2. https://www.cs.princeton.edu/courses/archive/fall11/cos597C/lectures/variational-inference-i.pdf
3. https://www.cs.cmu.edu/~epxing/Class/10708-17/notes-17/10708-scribe-lecture13.pdf
4. https://arxiv.org/pdf/1601.00670.pdf
Week Report
• Last week
• Metric Factorization model
• Learning Group
• This week
• Submit the ICDE paper

Learning group variational inference

  • 1.
  • 2.
    Content 1 •Brief Introduction 2 •Core Ideaof VI •Optimization 3 •Example: Bayesian Mix of Gauss
  • 3.
    What is VariationalInference? Variational Bayesian methods are a family of techniques for approximating intractable integrals arising in Bayesian inference and machine learning [Wiki]. It is widely used to approximate posterior densities for Bayesian models, an alternative strategy to Markov Chain Monte Carlo, but it tends to be faster and easier to scale to large data. It has been applied to problems such as document analysis, computational neuroscience and computer vision.
  • 4.
    Core Idea Consider ageneral problem of Bayesian Inference - Let the latent variables in our problem be and the observed data Inference in a Bayesian model amounts to conditioning on data and computing the posterior
  • 5.
    Approximate Inference The inferenceproblem is to compute the conditional given by the below equation. the denominator is the marginal distribution of the data obtained by marginalizing all the latent variables from the joint distribution p(x,z). For many models, this evidence integral is unavailable in closed form or requires exponential time to compute. The evidence is what we need to compute the conditional from the joint; this is why inference in such models is hard
  • 6.
    MCMC In MCMC, wefirst construct an ergodic Markov chain on z whose stationary distribution is the posterior Then, we sample from the chain to collect samples from the stationary distribution. Finally, we approximate the posterior with an empirical estimate constructed from the collected samples.
  • 7.
    VI vs. MCMC MCMCVI More computationally intensive Less intensive Guarantees producing asymptotically exact samples from target distribution No such guarantees Slower Faster, especially for large data sets and complex distributions Best for precise inference Useful to explore many scenarios quickly or large data sets
  • 8.
    Core Idea Rather thanuse sampling, the main idea behind variational inference is to use optimization. we restrict ourselves a family of approximate distributions D over the latent variables. We then try to find the member of that family that minimizes the Kullback-Leibler divergence to the exact posterior. This reduces to solving an optimization problem. The goal is to approximate p(z|x) with the resulting q(z). We optimize q(z) for minimal value of KL divergence
  • 9.
    Core Idea The objectiveis not computable. Because Because we cannot compute the KL, we optimize an alternative objective that is equivalent to the KL up to an added constant. We know from our discussion of EM.
  • 10.
    Core Idea Thus, wehave the objective function: Maximizing the ELBO is equivalent to minimizing the KL divergence. Intuition: We rewrite the ELBO as a sum of the expected log likelihood of the data and the KL divergence between the prior p(z) and q(z)
  • 11.
    Mean field approximation Nowthat we have specified the variational objective function with the ELBO, we now need to specify the variational family of distributions from which we pick the approximate variational distribution. A common family of distributions to pick is the Mean-field variational family. Here, the latent variables are mutually independent and each governed by a distinct factor in the variational distribution.
  • 12.
    Coordinate ascent mean-fieldVI Having specified our objective function and the variational family of distributions from which to pick the approximation, we now work to optimize. CAVI maximizes ELBO by iteratively optimizing each variational factor of the mean-field variational distribution, while holding the others fixed. It however, does not guarantee finding the global optimum.
  • 13.
    Coordinate ascent mean-fieldVI given that we fix the value of all other variational factors ql(zl) (l not equal to j), the optimal 𝑞 𝑗(𝑧𝑗) is proportional to the exponentiated expected log of the complete conditional. This then is equivalent to being proportional to the log of the joint because the mean-field family assumes that all the latent variables are independent.
  • 14.
    Coordinate ascent mean-fieldVI Below, we rewrite the first term using iterated expectation and for the second term, we have only retained the term that depends on In this final equation, the RHS is equal to the negative KL divergence between 𝑞 𝑗 and exp(A). Thus, maximizing this expression is the same as minimizing the KL divergence between 𝑞 𝑗 and exp(A). This occurs when 𝑞 𝑗 =exp(A).
  • 15.
  • 16.
    Bayesian Mixture ofGaussians The full hierarchical model of The joint dist.
  • 17.
    Bayesian Mixture ofGaussians The mean field variational family contains approximate posterior densities of the form
  • 18.
    Bayesian Mixture ofGaussians Derive the ELBO as a function of the variational factors. Solve for the ELBO
  • 19.
    Bayesian Mixture ofGaussians Next, we derive the CAVI update for the variational factors.
  • 20.
    References 1. https://am207.github.io/2017/wiki/VI.html 2. https://www.cs.princeton.edu/courses/archive/fall11/cos597C/lectures/variational-inference-i.pdf 3.https://www.cs.cmu.edu/~epxing/Class/10708-17/notes-17/10708-scribe-lecture13.pdf 4. https://arxiv.org/pdf/1601.00670.pdf
  • 21.
    Week Report • Lastweek • Metric Factorization model • Learning Group • This week • Submit the ICDE paper

Editor's Notes

  • #2 Named Entity Mining from Click-Through Data Using Weakly Supervised Latent Dirichlet Allocation Topic Regression Multi-Modal Latent Dirichlet Allocation for Image Annotation
  • #3 Named Entity Mining from Click-Through Data Using Weakly Supervised Latent Dirichlet Allocation Topic Regression Multi-Modal Latent Dirichlet Allocation for Image Annotation
  • #4 Named Entity Mining from Click-Through Data Using Weakly Supervised Latent Dirichlet Allocation Topic Regression Multi-Modal Latent Dirichlet Allocation for Image Annotation
  • #5 Named Entity Mining from Click-Through Data Using Weakly Supervised Latent Dirichlet Allocation Topic Regression Multi-Modal Latent Dirichlet Allocation for Image Annotation
  • #6 Named Entity Mining from Click-Through Data Using Weakly Supervised Latent Dirichlet Allocation Topic Regression Multi-Modal Latent Dirichlet Allocation for Image Annotation
  • #7 Named Entity Mining from Click-Through Data Using Weakly Supervised Latent Dirichlet Allocation Topic Regression Multi-Modal Latent Dirichlet Allocation for Image Annotation
  • #8 Named Entity Mining from Click-Through Data Using Weakly Supervised Latent Dirichlet Allocation Topic Regression Multi-Modal Latent Dirichlet Allocation for Image Annotation
  • #9 Named Entity Mining from Click-Through Data Using Weakly Supervised Latent Dirichlet Allocation Topic Regression Multi-Modal Latent Dirichlet Allocation for Image Annotation
  • #10 Named Entity Mining from Click-Through Data Using Weakly Supervised Latent Dirichlet Allocation Topic Regression Multi-Modal Latent Dirichlet Allocation for Image Annotation
  • #11 Named Entity Mining from Click-Through Data Using Weakly Supervised Latent Dirichlet Allocation Topic Regression Multi-Modal Latent Dirichlet Allocation for Image Annotation
  • #12 Named Entity Mining from Click-Through Data Using Weakly Supervised Latent Dirichlet Allocation Topic Regression Multi-Modal Latent Dirichlet Allocation for Image Annotation
  • #13 Named Entity Mining from Click-Through Data Using Weakly Supervised Latent Dirichlet Allocation Topic Regression Multi-Modal Latent Dirichlet Allocation for Image Annotation
  • #14 Named Entity Mining from Click-Through Data Using Weakly Supervised Latent Dirichlet Allocation Topic Regression Multi-Modal Latent Dirichlet Allocation for Image Annotation
  • #15 Named Entity Mining from Click-Through Data Using Weakly Supervised Latent Dirichlet Allocation Topic Regression Multi-Modal Latent Dirichlet Allocation for Image Annotation
  • #16 Named Entity Mining from Click-Through Data Using Weakly Supervised Latent Dirichlet Allocation Topic Regression Multi-Modal Latent Dirichlet Allocation for Image Annotation
  • #17 Named Entity Mining from Click-Through Data Using Weakly Supervised Latent Dirichlet Allocation Topic Regression Multi-Modal Latent Dirichlet Allocation for Image Annotation
  • #18 Named Entity Mining from Click-Through Data Using Weakly Supervised Latent Dirichlet Allocation Topic Regression Multi-Modal Latent Dirichlet Allocation for Image Annotation
  • #19 Named Entity Mining from Click-Through Data Using Weakly Supervised Latent Dirichlet Allocation Topic Regression Multi-Modal Latent Dirichlet Allocation for Image Annotation
  • #20 Named Entity Mining from Click-Through Data Using Weakly Supervised Latent Dirichlet Allocation Topic Regression Multi-Modal Latent Dirichlet Allocation for Image Annotation
  • #21 Named Entity Mining from Click-Through Data Using Weakly Supervised Latent Dirichlet Allocation Topic Regression Multi-Modal Latent Dirichlet Allocation for Image Annotation
  • #22 Named Entity Mining from Click-Through Data Using Weakly Supervised Latent Dirichlet Allocation Topic Regression Multi-Modal Latent Dirichlet Allocation for Image Annotation