This document discusses variational inference with Rényi divergence. It summarizes variational autoencoders (VAEs), which are deep generative models that parametrize a variational approximation with a recognition network. VAEs define a generative model as a hierarchical latent variable model and approximate the intractable true posterior using variational inference. The document explores using Rényi divergence as an alternative to the evidence lower bound objective of VAEs, as it may provide tighter variational bounds.
Sequential quasi-Monte Carlo (SQMC) is a quasi-Monte Carlo (QMC) version of sequential Monte Carlo (or particle filtering), a popular class of Monte Carlo techniques used to carry out inference in state space models. In this talk I will first review the SQMC methodology as well as some theoretical results. Although SQMC converges faster than the usual Monte Carlo error rate its performance deteriorates quickly as the dimension of the hidden variable increases. However, I will show with an example that SQMC may perform well for some "high" dimensional problems. I will conclude this talk with some open problems and potential applications of SQMC in complicated settings.
* ML in HEP
* classification and regression
* knn classification and regression
* ROC curve
* optimal bayesian classifier
* Fisher's QDA
* intro to Logistic Regression
A gentle introduction to 2 classification techniques, as presented by Kriti Puniyani to the NYC Predictive Analytics group (April 14, 2011). To download the file please go here: http://www.meetup.com/NYC-Predictive-Analytics/files/
* Logistic regression, logistic loss (log loss)
* stochastic optimization
* adding new features, generalized linear model
* Kernel trick, intro to SVM
* Overfitting
* Decision trees for classification and regression
* Building trees greedily: Gini index, entropy
* Trees fighting with overfitting: pre-stopping and post-pruning
* Feature importances
This talk introduces a new way to compact a (possibly non-uniform) probability distribution “F” into a set of representative points, called support points. These point sets can have important uses for both small-data problems, such as experimental design and uncertainty quantification in engineering applications, as well as big-data problems, such as the optimal reduction of large datasets in Bayesian computation. We first present support points as the minimizer of a powerful goodness-of-fit test called the energy distance, and discuss why such point sets are appealing to use for simulation and integration. An extension of this point set, called projected support points, is then introduced for high-dimensional integration under non-uniform “F”. We show that support points (and its variants) can provide good solutions to the aforementioned small-data and big-data problems. This talk concludes with some new ideas and ongoing work on experimental design, potential theory and robust optimization.
Here a Review of the Combination of Machine Learning models from Bayesian Averaging, Committees to Boosting... Specifically An statistical analysis of Boosting is done
A generalized class of normalized distance functions called Q-Metrics is described in this presentation. The Q-Metrics approach relies on a unique functional, using a single bounded parameter (Lambda), which characterizes the conventional distance functions in a normalized per-unit metric space. In addition to this coverage property, a distinguishing and extremely attractive characteristic of the Q-Metric function is its low computational complexity. Q-Metrics satisfy the standard metric axioms. Novel networks for classification and regression tasks are defined and constructed using Q-Metrics. These new networks are shown to outperform conventional feed forward back propagation networks with the same size when tested on real data sets.
Sequential quasi-Monte Carlo (SQMC) is a quasi-Monte Carlo (QMC) version of sequential Monte Carlo (or particle filtering), a popular class of Monte Carlo techniques used to carry out inference in state space models. In this talk I will first review the SQMC methodology as well as some theoretical results. Although SQMC converges faster than the usual Monte Carlo error rate its performance deteriorates quickly as the dimension of the hidden variable increases. However, I will show with an example that SQMC may perform well for some "high" dimensional problems. I will conclude this talk with some open problems and potential applications of SQMC in complicated settings.
* ML in HEP
* classification and regression
* knn classification and regression
* ROC curve
* optimal bayesian classifier
* Fisher's QDA
* intro to Logistic Regression
A gentle introduction to 2 classification techniques, as presented by Kriti Puniyani to the NYC Predictive Analytics group (April 14, 2011). To download the file please go here: http://www.meetup.com/NYC-Predictive-Analytics/files/
* Logistic regression, logistic loss (log loss)
* stochastic optimization
* adding new features, generalized linear model
* Kernel trick, intro to SVM
* Overfitting
* Decision trees for classification and regression
* Building trees greedily: Gini index, entropy
* Trees fighting with overfitting: pre-stopping and post-pruning
* Feature importances
This talk introduces a new way to compact a (possibly non-uniform) probability distribution “F” into a set of representative points, called support points. These point sets can have important uses for both small-data problems, such as experimental design and uncertainty quantification in engineering applications, as well as big-data problems, such as the optimal reduction of large datasets in Bayesian computation. We first present support points as the minimizer of a powerful goodness-of-fit test called the energy distance, and discuss why such point sets are appealing to use for simulation and integration. An extension of this point set, called projected support points, is then introduced for high-dimensional integration under non-uniform “F”. We show that support points (and its variants) can provide good solutions to the aforementioned small-data and big-data problems. This talk concludes with some new ideas and ongoing work on experimental design, potential theory and robust optimization.
Here a Review of the Combination of Machine Learning models from Bayesian Averaging, Committees to Boosting... Specifically An statistical analysis of Boosting is done
A generalized class of normalized distance functions called Q-Metrics is described in this presentation. The Q-Metrics approach relies on a unique functional, using a single bounded parameter (Lambda), which characterizes the conventional distance functions in a normalized per-unit metric space. In addition to this coverage property, a distinguishing and extremely attractive characteristic of the Q-Metric function is its low computational complexity. Q-Metrics satisfy the standard metric axioms. Novel networks for classification and regression tasks are defined and constructed using Q-Metrics. These new networks are shown to outperform conventional feed forward back propagation networks with the same size when tested on real data sets.
論文輪読資料「A review of unsupervised feature learning and deep learning for time-s...Kaoru Nasuno
2015年4月16日のdeeplearning.jpの勉強会における論文輪読資料。
「A review of unsupervised feature learning and deep learning for time-series modeling」
「時系列モデリングのための教師なし表現学習とディープラー ニングに関する調査」のサマリーです。
최근 이수가 되고 있는 Bayesian Deep Learning 관련 이론과 최근 어플리케이션들을 소개합니다. Bayesian Inference 의 이론에 관해서 간단히 설명하고 Yarin Gal 의 Monte Carlo Dropout 의 이론과 어플리케이션들을 소개합니다.
Variational inference is a technique for estimating Bayesian models that provides similar precision to MCMC at a greater speed, and is one of the main areas of current research in Bayesian computation. In this introductory talk, we take a look at the theory behind the variational approach and some of the most common methods (e.g. mean field, stochastic, black box). The focus of this talk is the intuition behind variational inference, rather than the mathematical details of the methods. At the end of this talk, you will have a basic grasp of variational Bayes and its limitations.
A Probabilistic Algorithm for Computation of Polynomial Greatest Common with ...mathsjournal
In the earlier work, Knuth present an algorithm to decrease the coefficient growth in the Euclidean algorithm of polynomials called subresultant algorithm. However, the output polynomials may have a small factor which can be removed. Then later, Brown of Bell Telephone Laboratories showed the subresultant in another way by adding a variant called 𝜏 and gave a way to compute the variant. Nevertheless, the way failed to determine every 𝜏 correctly.
In this paper, we will give a probabilistic algorithm to determine the variant 𝜏 correctly in most cases by adding a few steps instead of computing 𝑡(𝑥) when given 𝑓(𝑥) and𝑔(𝑥) ∈ ℤ[𝑥], where 𝑡(𝑥) satisfies that 𝑠(𝑥)𝑓(𝑥) + 𝑡(𝑥)𝑔(𝑥) = 𝑟(𝑥), here 𝑡(𝑥), 𝑠(𝑥) ∈ ℤ[𝑥]
Topic of presentation: Variational autoencoders for speech processing
The main points of the presentation: Variational autoencoders (or VAE) have become one of the most popular unsupervised learning techniques for modelling complex data distributions, such as images and audio. In this talk I'll begin with a general introduction to VAEs and then review a recent technique called VQ-VAE which is capable of learning rundimentary phoneme-level language model from raw audio without any supervision.
http://dataconf.com.ua/speaker-page/dmytro-bielievtsov.php
https://www.youtube.com/watch?v=euYSAL-aKMI&list=PL5_LBM8-5sLjbRFUtXaUpg84gtJtyc4Pu&t=0s&index=9
International Journal of Engineering Research and DevelopmentIJERD Editor
Electrical, Electronics and Computer Engineering,
Information Engineering and Technology,
Mechanical, Industrial and Manufacturing Engineering,
Automation and Mechatronics Engineering,
Material and Chemical Engineering,
Civil and Architecture Engineering,
Biotechnology and Bio Engineering,
Environmental Engineering,
Petroleum and Mining Engineering,
Marine and Agriculture engineering,
Aerospace Engineering.
To make Reinforcement Learning Algorithms work in the real-world, one has to get around (what Sutton calls) the "deadly triad": the combination of bootstrapping, function approximation and off-policy evaluation. The first step here is to understand Value Function Vector Space/Geometry and then make one's way into Gradient TD Algorithms (a big breakthrough to overcome the "deadly triad").
Let's dive deeper into the world of ODC! Ricardo Alves (OutSystems) will join us to tell all about the new Data Fabric. After that, Sezen de Bruijn (OutSystems) will get into the details on how to best design a sturdy architecture within ODC.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
2. ¤ arXiv 2 6
¤ Yingzhen Li Richard E. Turner
¤ University of Cambridge
¤ Li D3 “Stochastic Expectation Propagation” NIPS
¤ Rényi
¤ VAE importance weighted AE[Burda et al., 2015]
¤ Appendix
¤
4. ¤
¤ SVI [Hoffmann et al, 2013]
¤ SEP [Li et al., 2015]
¤ black-box
¤ [Ranganath et al., 2014]
¤ black-box alpha BB-α [Hernandez-Labato et al., 2015]
¤
¤ Importance weighted AE (IWAE)[Burda et al., 2015]
¤ VAE ICLR2016
5. ¤
¤ #(.|/)
¤ ,(.)
¤ KL # /
¤
principle literature [Gr¨unwald, 2007].
2.2 Variational Inference
Next we review the variational inference algorithm [Jordan et al.
perspective, using posterior approximation as a running examp
i.i.d. samples D = {xn}N
n=1 from a probabilistic model p(x|✓) pa
is drawn from a prior p0(✓). Bayesian inference involves comp
parameters given the data,
p(✓|D) =
p(✓, D)
p(D)
=
p0(✓)
QN
n=1 p
p(D)
3
ciple literature [Gr¨unwald, 2007].
Variational Inference
we review the variational inference algorithm [Jordan et al., 1999, Beal, 2003] from an optimisat
pective, using posterior approximation as a running example. Consider observing a dataset of
samples D = {xn}N
n=1 from a probabilistic model p(x|✓) parametrised by a random variable ✓ th
awn from a prior p0(✓). Bayesian inference involves computing the posterior distribution of t
meters given the data,
p(✓|D) =
p(✓, D)
p(D)
=
p0(✓)
QN
n=1 p(xn|✓)
p(D)
,
3
(D) =
R
p0(✓)
QN
n=1 p(xn|✓)d✓ is often called marginal likelihood or model evidence. For
l models, including Bayesian neural networks, the true posterior is typically intractable.
nference introduces an approximation q(✓) to the true posterior, which is obtained by minim
divergence in some tractable distribution family Q:
q(✓) = arg min
q2Q
KL[q(✓)||p(✓|D)].
r the KL divergence in (8) is also intractable, mainly because of the di cult term p(D). Varia
e sidesteps this di culty by considering an equivalent optimisation problem:
q(✓) = arg max
q2Q
LV I (q; D),
he variational lower-bound or evidence lower-bound (ELBO) LV I (q; D) is defined by
LV I (q; D) = log p(D) KL[q(✓)||p(✓|D)]
re p(D) =
R
p0(✓)
QN
n=1 p(xn|✓)d✓ is often called marginal likelihood or model evidence. For m
werful models, including Bayesian neural networks, the true posterior is typically intractable. Va
al inference introduces an approximation q(✓) to the true posterior, which is obtained by minimi
KL divergence in some tractable distribution family Q:
q(✓) = arg min
q2Q
KL[q(✓)||p(✓|D)].
wever the KL divergence in (8) is also intractable, mainly because of the di cult term p(D). Variati
rence sidesteps this di culty by considering an equivalent optimisation problem:
q(✓) = arg max
q2Q
LV I(q; D),
re the variational lower-bound or evidence lower-bound (ELBO) LV I(q; D) is defined by
LV I(q; D) = log p(D) KL[q(✓)||p(✓|D)]
p(✓, D)
n called marginal likelihood or model evidence. For many
etworks, the true posterior is typically intractable. Varia-
(✓) to the true posterior, which is obtained by minimising
on family Q:
min
2Q
KL[q(✓)||p(✓|D)]. (8)
able, mainly because of the di cult term p(D). Variational
g an equivalent optimisation problem:
arg max
q2Q
LV I(q; D), (9)
lower-bound (ELBO) LV I(q; D) is defined by
og p(D) KL[q(✓)||p(✓|D)]
p(✓, D) (10)
where p(D) =
R
p0(✓)
QN
n=1 p(xn|✓)d✓ is often called marginal likelihood
powerful models, including Bayesian neural networks, the true posterior is
tional inference introduces an approximation q(✓) to the true posterior, wh
the KL divergence in some tractable distribution family Q:
q(✓) = arg min
q2Q
KL[q(✓)||p(✓|D)].
However the KL divergence in (8) is also intractable, mainly because of the d
inference sidesteps this di culty by considering an equivalent optimisation
q(✓) = arg max
q2Q
LV I(q; D),
where the variational lower-bound or evidence lower-bound (ELBO) LV I(q
LV I(q; D) = log p(D) KL[q(✓)||p(✓|D)]
= Eq
log
p(✓, D)
q(✓)
.
6. VAE
¤ [Kingma et al,. 2014]
¤
¤ ℎ
¤
¤
1 Variational Auto-encoder with R´enyi Divergence
he variational auto-encoder (VAE) [Kingma and Welling, 2014, Rezende et al., 2014] is a re
oposed (deep) generative model that parametrizes the variational approximation with a recog
twork. The generative model is specified as a hierarchical latent variable model:
p(x) =
X
h(1)...h(L)
p(h(L)
)p(h(L 1)
|h(L)
) · · · p(x|h(1)
).
re we drop the parameters ✓ but keep in mind that they will be learned using approximate max
elihood. However for these models the exact computation of log p(x) requires marginalisation
dden variables and is thus often intractable. Variational expectation-maximisation (EM) me
mes to the rescue by approximating
log p(x) ⇡ LV I(q; x) = Eq(h|x)
log
p(x, h)
q(h|x)
,
here h collects all the hidden variables h(1)
, ..., h(L)
and the approximate posterior q(h|x) is defi
q(h|x) = q(h(1)
|x)q(h(2)
|h(1)
) · · · q(h(L)
|h(L 1)
).
variational EM, optimisation for q and p are alternated to guarantee convergence. However th
ea of VAE is to jointly optimising p and q, which instead has no guarantee of increasing the
ional Auto-encoder with R´enyi Divergence
auto-encoder (VAE) [Kingma and Welling, 2014, Rezende et al., 2014] is a recently
) generative model that parametrizes the variational approximation with a recognition
enerative model is specified as a hierarchical latent variable model:
p(x) =
X
h(1)...h(L)
p(h(L)
)p(h(L 1)
|h(L)
) · · · p(x|h(1)
). (14)
he parameters ✓ but keep in mind that they will be learned using approximate maximum
wever for these models the exact computation of log p(x) requires marginalisation of all
s and is thus often intractable. Variational expectation-maximisation (EM) methods
scue by approximating
log p(x) ⇡ LV I(q; x) = Eq(h|x)
log
p(x, h)
q(h|x)
, (15)
s all the hidden variables h(1)
, ..., h(L)
and the approximate posterior q(h|x) is defined as
q(h|x) = q(h(1)
|x)q(h(2)
|h(1)
) · · · q(h(L)
|h(L 1)
). (16)
EM, optimisation for q and p are alternated to guarantee convergence. However the core
to jointly optimising p and q, which instead has no guarantee of increasing the MLE
on in each iteration. Indeed jointly the method is biased [Turner and Sahani, 2011]. This
the possibility that alternative surrogate functions might return estimates that are tighter
So the VR bound is considered in this context:
L (q; x) =
1
log E
"✓
p(x, h)
◆1 ↵
#
. (17)
variational auto-encoder (VAE) [Kingma and Welling, 2014, Rezende et al., 2014] is
sed (deep) generative model that parametrizes the variational approximation with a r
rk. The generative model is specified as a hierarchical latent variable model:
p(x) =
X
h(1)...h(L)
p(h(L)
)p(h(L 1)
|h(L)
) · · · p(x|h(1)
).
we drop the parameters ✓ but keep in mind that they will be learned using approximate
ood. However for these models the exact computation of log p(x) requires marginalisa
n variables and is thus often intractable. Variational expectation-maximisation (EM
to the rescue by approximating
log p(x) ⇡ LV I (q; x) = Eq(h|x)
log
p(x, h)
q(h|x)
,
h collects all the hidden variables h(1)
, ..., h(L)
and the approximate posterior q(h|x) is
q(h|x) = q(h(1)
|x)q(h(2)
|h(1)
) · · · q(h(L)
|h(L 1)
).
iational EM, optimisation for q and p are alternated to guarantee convergence. Howeve
of VAE is to jointly optimising p and q, which instead has no guarantee of increasing
ive function in each iteration. Indeed jointly the method is biased [Turner and Sahani, 2
explores the possibility that alternative surrogate functions might return estimates that
bounds. So the VR bound is considered in this context:
"✓ ◆1 ↵
#
p(x|✓) =
h1,...,hL
p(hL
|✓)p(hL 1
|hL
, ✓) · · · p(x|h1
, ✓). (
Here, ✓ is a vector of parameters of the variational autoencoder, and h = {h1
, . . . , hL
} denotes t
stochastic hidden units, or latent variables. The dependence on ✓ is often suppressed for clarity. F
convenience, we define h0
= x. Each of the terms p(h`
|h`+1
) may denote a complicated nonline
relationship, for instance one computed by a multilayer neural network. However, it is assum
that sampling and probability evaluation are tractable for each p(h`
|h`+1
). Note that L denot
the number of stochastic hidden layers; the deterministic layers are not shown explicitly here. W
assume the recognition model q(h|x) is defined in terms of an analogous factorization:
q(h|x) = q(h1
|x)q(h2
|h1
) · · · q(hL
|hL 1
), (
where sampling and probability evaluation are tractable for each of the terms in the product.
In this work, we assume the same families of conditional probability distributions as Kingma
Welling (2014). In particular, the prior p(hL
) is fixed to be a zero-mean, unit-variance Gaussia
In general, each of the conditional distributions p(h`
| h`+1
) and q(h`
|h` 1
) is a Gaussian wi
diagonal covariance, where the mean and covariance parameters are computed by a determinis
feed-forward neural network. For real-valued observations, p(x|h1
) is also defined to be such
Gaussian; for binary observations, it is defined to be a Bernoulli distribution whose mean paramete
are computed by a neural network.
The VAE is trained to maximize a variational lower bound on the log-likelihood, as derived fro
Jensen’s Inequality:
log p(x) = log Eq(h|x)
p(x, h)
q(h|x)
Eq(h|x)
log
p(x, h)
q(h|x)
= L(x). (
Since L(x) = log p(x) DKL(q(h|x)||p(h|x)), the training procedure is forced to trade off t
7. VAE
¤
reparameterization trick
¤
¤
ed a reparameterization of the recognition distribution in terms
tributions, such that the samples from the recognition model are
s and auxiliary variables. While they presented the reparameter-
tions, for convenience we discuss the special case of Gaussians,
ork. (The general reparameterization trick can be used with our
tribution q(h`
|h` 1
, ✓) always takes the form of a Gaussian
hose mean and covariance are computed from the the states of
2
Under review as a conference paper at ICLR 2016
the hidden units at the previous layer and the model parameters. This can be
by first sampling an auxiliary variable ✏`
⇠ N (0, I), and then applying the d
h`
(✏`
, h` 1
, ✓) = ⌃(h` 1
, ✓)1/2
✏`
+ µ(h` 1
, ✓).
The joint recognition distribution q(h|x, ✓) over all latent variables can be
a deterministic mapping h(✏, x, ✓), with ✏ = (✏1
, . . . , ✏L
), by applying E
sequence. Since the distribution of ✏ does not depend on ✓, we can reformu
bound L(x) from Eqn. 3 by pushing the gradient operator inside the expecta
r✓ log Eh⇠q(h|x,✓)
p(x, h|✓)
q(h|x, ✓)
= r✓E✏1,...,✏L⇠N (0,I)
log
p(x, h
q(h(✏
= E✏1,...,✏L⇠N (0,I)
r✓ log
p(x, h
q(h(✏
Assuming the mapping h is represented as a deterministic feed-forward neu
✏, the gradient inside the expectation can be computed using standard backp
one approximates the expectation in Eqn. 6 by generating k samples of ✏ a
Carlo estimator
1
k
kX
r✓ log w (x, h(✏i, x, ✓), ✓)
der review as a conference paper at ICLR 2016
hidden units at the previous layer and the model parameters. This can be alternatively ex
first sampling an auxiliary variable ✏`
⇠ N (0, I), and then applying the deterministic m
h`
(✏`
, h` 1
, ✓) = ⌃(h` 1
, ✓)1/2
✏`
+ µ(h` 1
, ✓).
joint recognition distribution q(h|x, ✓) over all latent variables can be expressed in t
eterministic mapping h(✏, x, ✓), with ✏ = (✏1
, . . . , ✏L
), by applying Eqn. 4 for each
uence. Since the distribution of ✏ does not depend on ✓, we can reformulate the gradien
nd L(x) from Eqn. 3 by pushing the gradient operator inside the expectation:
r✓ log Eh⇠q(h|x,✓)
p(x, h|✓)
q(h|x, ✓)
= r✓E✏1,...,✏L⇠N (0,I)
log
p(x, h(✏, x, ✓)|✓)
q(h(✏, x, ✓)|x, ✓)
= E✏1,...,✏L⇠N (0,I)
r✓ log
p(x, h(✏, x, ✓)|✓)
q(h(✏, x, ✓)|x, ✓)
.
uming the mapping h is represented as a deterministic feed-forward neural network, fo
he gradient inside the expectation can be computed using standard backpropagation. In p
approximates the expectation in Eqn. 6 by generating k samples of ✏ and applying the
lo estimator
k
Under review as a conference paper at ICLR 2016
the hidden units at the previous layer and the model parameters. This can be alternatively expressed
by first sampling an auxiliary variable ✏`
⇠ N(0, I), and then applying the deterministic mapping
h`
(✏`
, h` 1
, ✓) = ⌃(h` 1
, ✓)1/2
✏`
+ µ(h` 1
, ✓). (4)
The joint recognition distribution q(h|x, ✓) over all latent variables can be expressed in terms of
a deterministic mapping h(✏, x, ✓), with ✏ = (✏1
, . . . , ✏L
), by applying Eqn. 4 for each layer in
sequence. Since the distribution of ✏ does not depend on ✓, we can reformulate the gradient of the
bound L(x) from Eqn. 3 by pushing the gradient operator inside the expectation:
r✓ log Eh⇠q(h|x,✓)
p(x, h|✓)
q(h|x, ✓)
= r✓E✏1,...,✏L⇠N (0,I)
log
p(x, h(✏, x, ✓)|✓)
q(h(✏, x, ✓)|x, ✓)
(5)
= E✏1,...,✏L⇠N (0,I)
r✓ log
p(x, h(✏, x, ✓)|✓)
q(h(✏, x, ✓)|x, ✓)
. (6)
Assuming the mapping h is represented as a deterministic feed-forward neural network, for a fixed
✏, the gradient inside the expectation can be computed using standard backpropagation. In practice,
one approximates the expectation in Eqn. 6 by generating k samples of ✏ and applying the Monte
Carlo estimator
1
k
kX
i=1
r✓ log w (x, h(✏i, x, ✓), ✓) (7)
with w(x, h, ✓) = p(x, h|✓)/q(h|x, ✓). This is an unbiased estimate of r✓L(x). We note that
the VAE update and the basic REINFORCE-like update are both unbiased estimators of the same
e hidden units at the previous layer and the model parameters. This can be alternatively expressed
y first sampling an auxiliary variable ✏`
⇠ N(0, I), and then applying the deterministic mapping
h`
(✏`
, h` 1
, ✓) = ⌃(h` 1
, ✓)1/2
✏`
+ µ(h` 1
, ✓). (4)
he joint recognition distribution q(h|x, ✓) over all latent variables can be expressed in terms of
deterministic mapping h(✏, x, ✓), with ✏ = (✏1
, . . . , ✏L
), by applying Eqn. 4 for each layer in
quence. Since the distribution of ✏ does not depend on ✓, we can reformulate the gradient of the
ound L(x) from Eqn. 3 by pushing the gradient operator inside the expectation:
r✓ log Eh⇠q(h|x,✓)
p(x, h|✓)
q(h|x, ✓)
= r✓E✏1,...,✏L⇠N(0,I)
log
p(x, h(✏, x, ✓)|✓)
q(h(✏, x, ✓)|x, ✓)
(5)
= E✏1,...,✏L⇠N(0,I)
r✓ log
p(x, h(✏, x, ✓)|✓)
q(h(✏, x, ✓)|x, ✓)
. (6)
ssuming the mapping h is represented as a deterministic feed-forward neural network, for a fixed
the gradient inside the expectation can be computed using standard backpropagation. In practice,
ne approximates the expectation in Eqn. 6 by generating k samples of ✏ and applying the Monte
arlo estimator
1
k
kX
i=1
r✓ log w (x, h(✏i, x, ✓), ✓) (7)
ith w(x, h, ✓) = p(x, h|✓)/q(h|x, ✓). This is an unbiased estimate of r✓L(x). We note that
e VAE update and the basic REINFORCE-like update are both unbiased estimators of the same
adient, but the VAE update tends to have lower variance in practice because it makes use of the
the hidden units at the previous layer and the
by first sampling an auxiliary variable ✏`
⇠
h`
(✏`
, h` 1
, ✓) = ⌃
The joint recognition distribution q(h|x, ✓)
a deterministic mapping h(✏, x, ✓), with ✏
sequence. Since the distribution of ✏ does n
bound L(x) from Eqn. 3 by pushing the gra
r✓ log Eh⇠q(h|x,✓)
p(x, h|✓)
q(h|x, ✓)
=
=
Assuming the mapping h is represented as a
✏, the gradient inside the expectation can be
one approximates the expectation in Eqn. 6
Carlo estimator
1
k
kX
i=1
r✓ lo
with w(x, h, ✓) = p(x, h|✓)/q(h|x, ✓). T
the VAE update and the basic REINFORCE
gradient, but the VAE update tends to have
8. VAE
VAE
¤ VAE
¤
¤ VAE KL
In this section we introduce a practical estimator of the lower bound and its derivatives w.r.t. the
parameters. We assume an approximate posterior in the form q (z|x), but please note that the
technique can be applied to the case q (z), i.e. where we do not condition on x, as well. The fully
variational Bayesian method for inferring a posterior over the parameters is given in the appendix.
Under certain mild conditions outlined in section 2.4 for a chosen approximate posterior q (z|x) we
can reparameterize the random variable ez ⇠ q (z|x) using a differentiable transformation g (✏, x)
of an (auxiliary) noise variable ✏:
ez = g (✏, x) with ✏ ⇠ p(✏) (4)
See section 2.4 for general strategies for chosing such an approriate distribution p(✏) and function
g (✏, x). We can now form Monte Carlo estimates of expectations of some function f(z) w.r.t.
q (z|x) as follows:
Eq (z|x(i)) [f(z)] = Ep(✏)
h
f(g (✏, x(i)
))
i
'
1
L
LX
l=1
f(g (✏(l)
, x(i)
)) where ✏(l)
⇠ p(✏) (5)
We apply this technique to the variational lower bound (eq. (2)), yielding our generic Stochastic
Gradient Variational Bayes (SGVB) estimator eLA
(✓, ; x(i)
) ' L(✓, ; x(i)
):
eLA
(✓, ; x(i)
) =
1
L
LX
l=1
log p✓(x(i)
, z(i,l)
) log q (z(i,l)
|x(i)
)
where z(i,l)
= g (✏(i,l)
, x(i)
) and ✏(l)
⇠ p(✏) (6)
3
g r✓,
eLM
(✓, ; XM
, ✏) (Gradients of minibatch estimator (8))
✓, Update parameters using gradients g (e.g. SGD or Adagrad [DHS10])
until convergence of parameters (✓, )
return ✓,
Often, the KL-divergence DKL(q (z|x(i)
)||p✓(z)) of eq. (3) can be integrated analytically (see
appendix B), such that only the expected reconstruction error Eq (z|x(i))
⇥
log p✓(x(i)
|z)
⇤
requires
estimation by sampling. The KL-divergence term can then be interpreted as regularizing , encour-
aging the approximate posterior to be close to the prior p✓(z). This yields a second version of the
SGVB estimator eLB
(✓, ; x(i)
) ' L(✓, ; x(i)
), corresponding to eq. (3), which typically has less
variance than the generic estimator:
eLB
(✓, ; x(i)
) = DKL(q (z|x(i)
)||p✓(z)) +
1
L
LX
l=1
(log p✓(x(i)
|z(i,l)
))
where z(i,l)
= g (✏(i,l)
, x(i)
) and ✏(l)
⇠ p(✏) (7)
Given multiple datapoints from a dataset X with N datapoints, we can construct an estimator of the
marginal likelihood lower bound of the full dataset, based on minibatches:
L(✓, ; X) ' eLM
(✓, ; XM
) =
N
M
MX
i=1
eL(✓, ; x(i)
) (8)
where the minibatch XM
= {x(i)
}M
i=1 is a randomly drawn sample of M datapoints from the
full dataset X with N datapoints. In our experiments we found that the number of samples L
per datapoint can be set to 1 as long as the minibatch size M was large enough, e.g. M = 100.
Derivatives r✓,
eL(✓; XM
) can be taken, and the resulting gradients can be used in conjunction
with stochastic optimization methods such as SGD or Adagrad [DHS10]. See algorithm 1 for a
basic approach to compute the stochastic gradients.
A connection with auto-encoders becomes clear when looking at the objective function given at
eq. (7). The first term is (the KL divergence of the approximate posterior from the prior) acts as a
regularizer, while the second term is a an expected negative reconstruction error. The function g (.)
is chosen such that it maps a datapoint x(i)
and a random noise vector ✏(l)
to a sample from the
9. Importance weighted AE IWAE
¤ VAE
¤
¤
¤
¤
¤ k=1 VAE
¤ k
ution must be approximately factorial and predictable with a feed-forward neural n
VAE criterion may be too strict; a recognition network which places only a small
0%) of its samples in the region of high posterior probability region may still be suffi
ming accurate inference. If we lower our standards in this way, this may give us ad
lity to train a generative network whose posterior distributions do not fit the VAE
This is the motivation behind our proposed algorithm, the Importance Weighted Auto
E).
WAE uses the same architecture as the VAE, with both a generative network and a rec
rk. The difference is that it is trained to maximize a different lower bound on log p
ular, we use the following lower bound, corresponding to the k-sample importance w
te of the log-likelihood:
Lk(x) = Eh1,...,hk⇠q(h|x)
"
log
1
k
kX
i=1
p(x, hi)
q(hi|x)
#
.
h1, . . . , hk are sampled independently from the recognition model. The term inside
ponds to the unnormalized importance weights for the joint distribution, which we wil
= p(x, hi)/q(hi|x).
s a lower bound on the marginal log-likelihood, as follows from Jensen’s Inequality
at the average importance weights are an unbiased estimator of p(x):
Lk = E
"
log
1
k
kX
wi
#
log E
"
1
k
kX
wi
#
= log p(x),
iew as a conference paper at ICLR 2016
1. For all k, the lower bounds satisfy
log p(x) Lk+1 Lk.
if p(h, x)/q(h|x) is bounded, then Lk approaches log p(x) as k goes to infinity.
e Appendix A.
10. Rényi α
¤ . # ,
¤ 1 1 > 0, 1 ≠ 1
¤ 1 → 1 KL
¤ 1 =
8
9
tributions p and q on a random variable ✓ 2 ⇥:
D↵[p||q] =
1
↵ 1
log
Z
p(✓)↵
q(✓)1 ↵
d✓.
> 1 the definition is valid when it is finite, and for discrete random variables the integr
d by summation. When ↵ ! 1 it recovers the Kullback-Leibler (KL) divergence that
role in machine learning and information theory:
D1[p||q] = lim
↵!1
D↵[p||q] =
Z
p(✓) log
p(✓)
q(✓)
d✓ = KL[p||q].
to ↵ = 1, for values ↵ = 0, +1 the R´enyi divergence is defined by continuity in ↵:
D0[p||q] = log
Z
p(✓)>0
q(✓)d✓,
D+1[p||q] = log max
✓2⇥
p(✓)
q(✓)
.
two distributions p and q on a random variable ✓ 2 ⇥:
D↵[p||q] =
1
↵ 1
log
Z
p(✓)↵
q(✓)1 ↵
d✓.
For ↵ > 1 the definition is valid when it is finite, and for discrete random variables the integratio
replaced by summation. When ↵ ! 1 it recovers the Kullback-Leibler (KL) divergence that pla
crucial role in machine learning and information theory:
D1[p||q] = lim
↵!1
D↵[p||q] =
Z
p(✓) log
p(✓)
q(✓)
d✓ = KL[p||q].
Similar to ↵ = 1, for values ↵ = 0, +1 the R´enyi divergence is defined by continuity in ↵:
D0[p||q] = log
Z
p(✓)>0
q(✓)d✓,
D+1[p||q] = log max
✓2⇥
p(✓)
q(✓)
.
Another special case is ↵ = 1
2 , where the corresponding R´enyi divergence is a function of the squ
2
R p p
the definition is valid when it is finite, and for discrete random variables the int
y summation. When ↵ ! 1 it recovers the Kullback-Leibler (KL) divergence th
in machine learning and information theory:
D1[p||q] = lim
↵!1
D↵[p||q] =
Z
p(✓) log
p(✓)
q(✓)
d✓ = KL[p||q].
↵ = 1, for values ↵ = 0, +1 the R´enyi divergence is defined by continuity in ↵:
D0[p||q] = log
Z
p(✓)>0
q(✓)d✓,
D+1[p||q] = log max
✓2⇥
p(✓)
q(✓)
.
pecial case is ↵ = 1
2 , where the corresponding R´enyi divergence is a function of
istance Hel2
[p||q] = 1
2
R
(
p
p(✓)
p
q(✓))2
d✓:
D1
2
[p||q] = 2 log(1 Hel2
[p||q]).
ven and Harremo¨es, 2014] the definition (1) is also extended to negative ↵ values,
t is non-positive and is thus no longer a valid divergence measure. The proposed m
11. Rényi
¤ # . / ,(.) KL
¤ Rényi
¤ Rényi α
¤
¤ 1 ≠ 1
LV I(q; D) = log p(D) KL[q(✓)||p(✓|D)]
= Eq
log
p(✓, D)
q(✓)
.
ational R´enyi Bound
Section 2.1 that the family of R´enyi divergences includes the KL divergence.
ee-energy approaches be generalised to the R´enyi case? Consider approxima
|D) by minimizing R´enyi’s ↵-divergence for some selected ↵ 0):
q(✓) = arg min
q2Q
D↵[q(✓)||p(✓|D)].
y the alternative optimization problem
q(✓) = arg max
q2Q
log p(D) D↵[q(✓)||p(✓|D)].
the objective can be rewritten as
log p(D)
1
↵ 1
log
Z
q(✓)↵
p(✓|D)1 ↵
d✓
"✓ ◆1 ↵
#
LV I (q; D) = log p(D) KL[q(✓)||p(✓|D)]
= Eq
log
p(✓, D)
q(✓)
.
ariational R´enyi Bound
m Section 2.1 that the family of R´enyi divergences includes the KL divergence.
al free-energy approaches be generalised to the R´enyi case? Consider approxima
p(✓|D) by minimizing R´enyi’s ↵-divergence for some selected ↵ 0):
q(✓) = arg min
q2Q
D↵[q(✓)||p(✓|D)].
erify the alternative optimization problem
q(✓) = arg max
q2Q
log p(D) D↵[q(✓)||p(✓|D)].
= 1, the objective can be rewritten as
log p(D)
1
↵ 1
log
Z
q(✓)↵
p(✓|D)1 ↵
d✓
= log p(D)
1
log E
"✓
p(✓, D)
◆1 ↵
#
q(✓)
Variational R´enyi Bound
rom Section 2.1 that the family of R´enyi divergences includes the KL divergence. Perhap
nal free-energy approaches be generalised to the R´enyi case? Consider approximating th
r p(✓|D) by minimizing R´enyi’s ↵-divergence for some selected ↵ 0):
q(✓) = arg min
q2Q
D↵[q(✓)||p(✓|D)].
verify the alternative optimization problem
q(✓) = arg max
q2Q
log p(D) D↵[q(✓)||p(✓|D)].
6= 1, the objective can be rewritten as
log p(D)
1
↵ 1
log
Z
q(✓)↵
p(✓|D)1 ↵
d✓
= log p(D)
1
↵ 1
log Eq
"✓
p(✓, D)
q(✓)p(D)
◆1 ↵
#
=
1
1 ↵
log Eq
"✓
p(✓, D)
q(✓)
◆1 ↵
#
:= L↵(q; D).
me this new objective the variational R´enyi bound (VR). Importantly the following theore
Rényi VR
12. VR
¤ VR
¤
¤
cope if Monte Carlo methods is not resorted to. This section develops a scalable opt
or the VR bound by extending the recent advances of traditional VI. Black-box met
ssed to enable it applications to arbitrary finite ↵ settings.
Monte Carlo Estimation of the VR Bound
se a simple Monte Carlo method that uses finite samples ✓k ⇠ q(✓), k = 1, ..., K to app
K:
ˆL↵,K(q; D) =
1
1 ↵
log
1
K
KX
k=1
"✓
p(✓k, D)
q(✓k)
◆1 ↵
#
.
aditional VI, here the Monte Carlo estimate is biased, since the expectation over q(✓)
thm. However we can bound the bias by the following theorems proved in the supple
m 2. E{✓k}K
k=1
[ ˆL↵,K(q; D)] as a function of ↵ and K is: 1) non-decreasing in K for fix
limiting result is L↵ for K ! +1 if |p/q| is bounded; 2) continuous and non-incre
] [ {|L↵| < +1}.
5
R Bound Optimisation Framework
energy methods sidestep intractabilities in a class of intractable models. Recent wor
proximations based on Monte Carlo to expend the set of models that can be handled.
be deployed on the same model class as Monte Carlo variational methods, but which
Monte Carlo methods is not resorted to. This section develops a scalable optimis
VR bound by extending the recent advances of traditional VI. Black-box method
o enable it applications to arbitrary finite ↵ settings.
Carlo Estimation of the VR Bound
mple Monte Carlo method that uses finite samples ✓k ⇠ q(✓), k = 1, ..., K to approxi
ˆL↵,K(q; D) =
1
1 ↵
log
1
K
KX
k=1
"✓
p(✓k, D)
q(✓k)
◆1 ↵
#
.
al VI, here the Monte Carlo estimate is biased, since the expectation over q(✓) is i
However we can bound the bias by the following theorems proved in the supplemen
{✓k}K
k=1
[ ˆL↵,K(q; D)] as a function of ↵ and K is: 1) non-decreasing in K for fixed ↵
ng result is L↵ for K ! +1 if |p/q| is bounded; 2) continuous and non-increasin
↵| < +1}.
(a) Sampling approximated VR bounds. (b) Simulated values of divergences.
Figure 2: (a) An illustration for the bounding properties of sampling approximations to the VR bounds.
Here ↵2 < 0 < ↵1 < 1 and 1 < K1 < K2 < +1. (b) The bias of sampling estimate of (negative) alpha
divergence. In this example p, q are 2-D Gaussian distributions with identity covariance matrix, where
the only di↵erence is µp = [0, 0] and µq = [1, 1]. Best viewed in colour.
Corollary 1. For K < +1, there exists ↵K < 0 such that for all ↵ ↵K, E{✓k}K
k=1
[ ˆL↵,K(q; D)]
log p(D). Furthermore ↵K is non-decreasing in K, with limK!1 ↵K = 1 and limK!+1 ↵K = 0.
To better understand the above theorems we plot in Figure 2(a) an illustration of the bounding
properties. By definition, the exact VR bound is a lower-bound or upper-bound of the log-likelihood
log p(D) when ↵ > 0 or ↵ < 0, respectively (red lines). However for ↵ 1 the sampling approximation
ˆL↵,K in expectation under-estimates the exact VR bound L↵ (blue dashed lines), where the approximation
quality can be improved by using more samples (the blue dashed arrow). Thus for finite samples, negative
alpha values (↵2 < 0) can be used to improve the accuracy of the approximation (see the red arrow
between the two blue dashed lines visualising ˆL↵1,K1 and ˆL↵2,K1 , respectively).
We empirically evaluate the theoretical results in Figure 2(b), by computing the exact and Monte
13. VR
exact approx.
(a) Sampling approximated VR bounds. (b) Simula
Figure 2: (a) An illustration for the bounding properties of sampling ap
Here ↵2 < 0 < ↵1 < 1 and 1 < K1 < K2 < +1. (b) The bias of sampl
VR
1 ≤ 1
1
k
14. VR
¤ IWAE
¤ 1
ated VR bounds. (b) Simulated values of divergences.
n for the bounding properties of sampling approximations to the VR bounds.
1 < K1 < K2 < +1. (b) The bias of sampling estimate of (negative) alpha
e p, q are 2-D Gaussian distributions with identity covariance matrix, where
[0, 0] and µq = [1, 1]. Best viewed in colour.
ˆ
1 ≤ 1
1 = 0
IWAE
15. VR-max
¤ Reparameterization trick
¤
¤
¤ 1 = 1 VAE
¤ 1 → −∞
¤ importance weight
¤ VR-max
if ↵ = 1: jn = arg maxk log ˆw(✏k; xn)
4: return the gradients to the optimiser
1
|S|
X
n2S
r log ˆw(✏jn
; xn)
to reduce the clutter of notations. Now we apply the reparameterization trick to the VR bound
L↵(q ; D) =
1
1 ↵
log E✏
"✓
p(g , D)
q(g )
◆1 ↵
#
. (19)
Then the gradient of the VR bound w.r.t. is
r L↵(q ; D) = E✏
w↵(✏; , D)r log
p(g , D)
q(g )
, (20)
where w↵(✏; , D) /
⇣
p(g ,D)
q(g )
⌘1 ↵
denotes the normalised importance weight. For finite samples ✏k ⇠
p(✏), k = 1, ..., K the gradient is approximated by
r ˆL↵,K(q ; D) =
1
K
KX
k=1
ˆw↵,kr log
p(g (✏k), D)
q(g (✏k))
. (21)
with ˆw↵,k short-hand for ˆw↵(✏k; , D), the normalised importance weight with finite samples. One can
show that it recovers the the stochastic gradients of LV I by setting ↵ = 1 in (21):
r LV I(q ; D) ⇡
1
K
KX
k=1
r log
p(g (✏k), D)
q(g (✏k))
, (22)
which means the resulting algorithm unifies the computation for all finite ↵ settings.
if ↵ = 1: jn = arg maxk log ˆw(✏k; xn)
4: return the gradients to the optimiser
1
|S|
X
n2S
r log ˆw(✏jn
; xn)
to reduce the clutter of notations. Now we apply the reparameterization trick to the VR bound
L↵(q ; D) =
1
1 ↵
log E✏
"✓
p(g , D)
q(g )
◆1 ↵
#
. (19)
Then the gradient of the VR bound w.r.t. is
r L↵(q ; D) = E✏
w↵(✏; , D)r log
p(g , D)
q(g )
, (20)
where w↵(✏; , D) /
⇣
p(g ,D)
q(g )
⌘1 ↵
denotes the normalised importance weight. For finite samples ✏k ⇠
p(✏), k = 1, ..., K the gradient is approximated by
r ˆL↵,K(q ; D) =
1
K
KX
k=1
ˆw↵,kr log
p(g (✏k), D)
q(g (✏k))
. (21)
with ˆw↵,k short-hand for ˆw↵(✏k; , D), the normalised importance weight with finite samples. One can
show that it recovers the the stochastic gradients of LV I by setting ↵ = 1 in (21):
r LV I(q ; D) ⇡
1
K
KX
k=1
r log
p(g (✏k), D)
q(g (✏k))
, (22)
which means the resulting algorithm unifies the computation for all finite ↵ settings.
if ↵ = 1: jn = arg maxk log ˆw(✏k; xn)
4: return the gradients to the optimiser
1
|S|
X
n2S
r log ˆw(✏jn
; xn)
to reduce the clutter of notations. Now we apply the reparameterization trick to the VR bou
L↵(q ; D) =
1
1 ↵
log E✏
"✓
p(g , D)
q(g )
◆1 ↵
#
.
Then the gradient of the VR bound w.r.t. is
r L↵(q ; D) = E✏
w↵(✏; , D)r log
p(g , D)
q(g )
,
where w↵(✏; , D) /
⇣
p(g ,D)
q(g )
⌘1 ↵
denotes the normalised importance weight. For finite sam
p(✏), k = 1, ..., K the gradient is approximated by
r ˆL↵,K(q ; D) =
1
K
KX
k=1
ˆw↵,kr log
p(g (✏k), D)
q(g (✏k))
.
with ˆw↵,k short-hand for ˆw↵(✏k; , D), the normalised importance weight with finite sample
show that it recovers the the stochastic gradients of LV I by setting ↵ = 1 in (21):
r LV I (q ; D) ⇡
1
K
KX
k=1
r log
p(g (✏k), D)
q(g (✏k))
,
which means the resulting algorithm unifies the computation for all finite ↵ settings.
To speed-up learning [Burda et al., 2015] suggested back-propagating only one sample ✏j wit
Algorithm 1 one gradient step for VR-↵/VR-max
1: sample ✏1, ..., ✏K ⇠ p(✏)
2: for all k = 1, ..., K, and n 2 S the current minibatch, compute the u
log ˆw(✏k; xn) = log p(g (✏k), xn) log q(g (
3: choose the sample ✏jn to back-propagate:
if |↵| < 1: jn ⇠ pk where pk / ˆw(✏k; xn)1 ↵
if ↵ = 1: jn = arg maxk log ˆw(✏k; xn)
4: return the gradients to the optimiser
1
|S|
X
r log ˆw(✏jn
; xn)
16. ¤ [Li et al., 2015] EP
¤
¤
¤ M
VR
¤ M
¤ Black-box alpha BB-α VR
or VAEs. Note that VR-max does not compute
ciple (MDL), since MDL approximates the true
upper-bounds the exact log-likelihood function.
scale Learning
hole dataset D. However for large datasets full
[Li et al., 2015] the authors discussed stochastic
tion for large-scale learning. Here we propose
batch training, which directly applies to the VR
“average likelihood” ¯fD(✓) = [
QN
n=1 fn(✓)]
1
N ,
✓) ¯fD(✓)N
. Now we sample M datapoints S =
posterior by minimising the exact VR bound L 1
4.3 Stochastic Approximation for La
So far we discussed the VR bounds computed on t
batch learning will be very ine cient. In the append
EP as a way to approximating the VR bound opt
another stochastic approximation method to enable
bound.
Using the notation fn(✓) = p(xn|✓) and definin
the joint distribution can be rewritten as p(✓, D) =
the minimum description length principle (MDL), since MDL approximates the true
sing the exact VR bound L 1 that upper-bounds the exact log-likelihood function.
c Approximation for Large-scale Learning
the VR bounds computed on the whole dataset D. However for large datasets full
e very ine cient. In the appendix of [Li et al., 2015] the authors discussed stochastic
proximating the VR bound optimisation for large-scale learning. Here we propose
pproximation method to enable minibatch training, which directly applies to the VR
on fn(✓) = p(xn|✓) and defining the “average likelihood” ¯fD(✓) = [
QN
n=1 fn(✓)]
1
N ,
n can be rewritten as p(✓, D) = p0(✓) ¯fD(✓)N
. Now we sample M datapoints S =
7set average likelihood” ¯fS(✓) = [
QM
m=1 fnm
(✓)]
1
M .
xn}. Then we approximate the VR bound (13) by
)↵
p0(✓) ¯fS(✓)N 1 ↵
d✓
0(✓) ¯fS(✓)N
q(✓)
◆1 ↵
].
(23)
wer-bound when ↵ ! 1. For other ↵ 6= 1 settings,
the bias of approximation. This is guaranteed by
{xn1
, ..., xnM
} ⇠ D and define the corresponding “subset average likelihood” ¯fS(✓) = [
QM
m=1 fnm
(✓)]
1
M .
When M = 1 we also write ¯fS(✓) = fn(✓) for S = {xn}. Then we approximate the VR bound (13) by
replacing ¯fD(✓) with ¯fS(✓):
˜L↵(q; S) =
1
1 ↵
log
Z
q(✓)↵
p0(✓) ¯fS(✓)N 1 ↵
d✓
=
1
1 ↵
log Eq[
✓
p0(✓) ¯fS(✓)N
q(✓)
◆1 ↵
].
(23)
This returns a stochastic estimate of the evidence lower-bound when ↵ ! 1. For other ↵ 6= 1 settings,
increasing the size of the minibatch M = |S| reduces the bias of approximation. This is guaranteed by
the following theorem proved in the supplementary.
Theorem 3. If the approximate distribution q(✓) is Gaussian N(µ, ⌃), and the likelihood functions has
an exponential family form p(x|✓) = exp[h✓, (x)i A(✓)], then for ↵ 1 the stochastic approximation
is bounded by
19. 1
¤ 3
¤ VAE 1 = 1
¤ IWAE 1 = 0
¤ VR-max 1 = −∞
¤ 1 = 0 * = 5000
¤ VR-max IWAE
¤ VR-max
¤ VR-max 25hr29min IWAE 61hr16min
e code1
. Note that the original implementation back-
hile VR-max only back-propagates the sample with
h 101 Silhouettes and MNIST. The experiments were
y small Frey Face dataset, while the other two were
onsists of L = 1 or 2 stochastic layers with determin-
rk architecture is detailed in the supplementary. We
n. For MNIST we used settings from [Burda et al.,
and number of epochs. For other two datasets the
the VI setting. We reproduced the experiments for
s included in [Burda et al., 2015] mismatches those
e 1 by computing log p(x) ⇡ ˆL↵,K(q; x) with ↵ = 0.0,
sent some samples from the VR-max trained models
d almost indistinguishable to IWAEs on all the three
ime to run compared to IWAE with a full backward
a Tesla C2075 GPU, and when trained on MNIST
R-max and IWAE took 25hr29min and 61hr16min,
also implemented the single backward pass version
od result for IWAE is -85.02, which is slightly worse
he arguments in Section 4.1 that negative ↵ can be
mputation resources are limited.
alue corresponding to the tightest VR bound becomes
q and the true posterior increases. This is the case
n q is fitted to approximate the typically multimodal
(a) Frey Face (b) Caltech 101 Silhouettes (c) MNIST
Figure 3: Sampled images from the VR-max trained auto-encoders.
Dataset L K VAE IWAE VR-max
Frey Face 1 5 1322.96 1380.30 1377.40
(± std. err.) ±10.03 ±4.60 ±4.59
Caltech 101 1 5 -119.69 -117.89 -118.01
Silhouettes 50 -119.61 -117.21 -117.10
MNIST 1 5 -86.47 -85.41 -85.42
50 -86.35 -84.80 -84.81
2 5 -85.01 -83.92 -84.04
50 -84.78 -83.12 -83.44
Table 1: Average Test log-likelihood. Results for VAE on MNIST are collected from [Burda et al., 2015].
IWAE results are reproduced using the publicly available implementation.
method was implemented upon the publicly available code1
. Note that the original implementation back-
propagates all the samples to compute gradients, while VR-max only back-propagates the sample with
the largest importance weight.
Three datasets are considered: Frey Face, Caltech 101 Silhouettes and MNIST. The experiments were