This document summarizes a presentation about variational autoencoders (VAEs) presented at the ICLR 2016 conference. The document discusses 5 VAE-related papers presented at ICLR 2016, including Importance Weighted Autoencoders, The Variational Fair Autoencoder, Generating Images from Captions with Attention, Variational Gaussian Process, and Variationally Auto-Encoded Deep Gaussian Processes. It also provides background on variational inference and VAEs, explaining how VAEs use neural networks to model probability distributions and maximize a lower bound on the log likelihood.
Topic of presentation: Variational autoencoders for speech processing
The main points of the presentation: Variational autoencoders (or VAE) have become one of the most popular unsupervised learning techniques for modelling complex data distributions, such as images and audio. In this talk I'll begin with a general introduction to VAEs and then review a recent technique called VQ-VAE which is capable of learning rundimentary phoneme-level language model from raw audio without any supervision.
http://dataconf.com.ua/speaker-page/dmytro-bielievtsov.php
https://www.youtube.com/watch?v=euYSAL-aKMI&list=PL5_LBM8-5sLjbRFUtXaUpg84gtJtyc4Pu&t=0s&index=9
MVPA with SpaceNet: sparse structured priorsElvis DOHMATOB
The GraphNet (aka S-Lasso), as well as other “sparsity + structure” priors like TV (Total-Variation), TV-L1, etc., are not easily applicable to brain data because of technical problems
relating to the selection of the regularization parameters. Also, in
their own right, such models lead to challenging high-dimensional optimization problems. In this manuscript, we present some heuristics for speeding up the overall optimization process: (a) Early-stopping, whereby one halts the optimization process when the test score (performance on leftout data) for the internal cross-validation for model-selection stops improving, and (b) univariate feature-screening, whereby irrelevant (non-predictive) voxels are detected and eliminated before the optimization problem is entered, thus reducing the size of the problem. Empirical results with GraphNet on real MRI (Magnetic Resonance Imaging) datasets indicate that these heuristics are a win-win strategy, as they add speed without sacrificing the quality of the predictions. We expect the proposed heuristics to work on other models like TV-L1, etc.
SLIDING WINDOW SUM ALGORITHMS FOR DEEP NEURAL NETWORKSIJCI JOURNAL
Sliding window sums are widely used for string indexing, hashing and time series analysis. We have
developed a family of the generic vectorized sliding sum algorithms that provide speedup of O(P/w) for
window size w and number of processors P. For a sum with a commutative operator the speedup is
improved to O(P/log(w)). Even more important, our algorithms exhibit efficient memory access patterns. In
this paper we study the application of sliding sum algorithms to the training and inference of Deep Neural
Networks. We demonstrate how both pooling and convolution primitives could be expressed as sliding
sums and evaluated by the compute kernels with a shared structure. We show that the sliding sum
convolution kernels are more efficient than the commonly used GEMM kernels on CPUs and could even
outperform their GPU counterparts.
In recent years, deep learning has had a profound impact on machine learning and artificial intelligence. At the same time, algorithms for quantum computers have been shown to efficiently solve some problems that are intractable on conventional, classical computers. We show that quantum computing not only reduces the time required to train a deep restricted Boltzmann machine, but also provides a richer and more comprehensive framework for deep learning than classical computing and leads to significant improvements in the optimization of the underlying objective function. Our quantum methods also permit efficient training of full Boltzmann machines and multilayer, fully connected models and do not have well known classical counterparts.
International Journal of Managing Information Technology (IJMIT)IJMIT JOURNAL
We present an improved SPFA algorithm for the single source shortest path problem. For a random graph, the empirical average time complexity is O(|E|), where |E| is the number of edges of the input network. SPFA maintains a queue of candidate vertices and add a vertex to the queue only if that vertex is relaxed. In the improved SPFA, MinPoP principle is employed to improve the quality of the queue. We theoretically analyse the advantage of this new algorithm and experimentally demonstrate that the algorithm is efficient
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionAggregage
Join Maher Hanafi, VP of Engineering at Betterworks, in this new session where he'll share a practical framework to transform Gen AI prototypes into impactful products! He'll delve into the complexities of data collection and management, model selection and optimization, and ensuring security, scalability, and responsible use.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
A tale of scale & speed: How the US Navy is enabling software delivery from l...sonjaschweigert1
Rapid and secure feature delivery is a goal across every application team and every branch of the DoD. The Navy’s DevSecOps platform, Party Barge, has achieved:
- Reduction in onboarding time from 5 weeks to 1 day
- Improved developer experience and productivity through actionable findings and reduction of false positives
- Maintenance of superior security standards and inherent policy enforcement with Authorization to Operate (ATO)
Development teams can ship efficiently and ensure applications are cyber ready for Navy Authorizing Officials (AOs). In this webinar, Sigma Defense and Anchore will give attendees a look behind the scenes and demo secure pipeline automation and security artifacts that speed up application ATO and time to production.
We will cover:
- How to remove silos in DevSecOps
- How to build efficient development pipeline roles and component templates
- How to deliver security artifacts that matter for ATO’s (SBOMs, vulnerability reports, and policy evidence)
- How to streamline operations with automated policy checks on container images
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
7. 背景知識:変分推論
¤ ⽣成モデルの学習=データから分布𝑝(𝑥)をモデル化したい
➡尤度𝑝(𝑥)を最⼤化することで求められる.
¤ 𝑝 𝑥 = ∫ 𝑝 𝑥, 𝑧 𝑑𝑧のように潜在変数𝑧もモデル化した場合・・・
¤ そのまま最⼤化できない.
¤ よって,代わりに対数尤度を常に下から抑える下界を最⼤化する.
¤ 𝑝 𝑧|𝑥 を近似する分布𝑞(𝑧|𝑥)を考える.
¤ このとき,対数尤度は次のように分解できる.
log 𝑝 𝑥 = 𝐿 𝑥 + 𝐾𝐿(𝑞(𝑧|𝑥)||𝑝(𝑧|𝑥))
下界 真の分布と近似分布の差
必ず0以上になる
452 9. MIXTURE MODELS AND EM
Figure 9.12 Illustration of the E step of
the EM algorithm. The q
distribution is set equal to
the posterior distribution for
the current parameter val-
ues θold
, causing the lower
bound to move up to the
same value as the log like-
lihood function, with the KL
divergence vanishing. ln p(X|θold
)L(q, θold
)
KL(q||p) = 0
shown in Figure 9.13. If we substitute q(Z) = p(Z|X, θold
) into (9.71), we see that,
after the E step, the lower bound takes the form
L(q, θ) =
Z
p(Z|X, θold
) ln p(X, Z|θ) −
Z
p(Z|X, θold
) ln p(Z|X, θold
)
= Q(θ, θold
) + const (9.74)
where the constant is simply the negative entropy of the q distribution and is there-
fore independent of θ. Thus in the M step, the quantity that is being maximized is the
expectation of the complete-data log likelihood, as we saw earlier in the case of mix-
tures of Gaussians. Note that the variable θ over which we are optimizing appears
only inside the logarithm. If the joint distribution p(Z, X|θ) comprises a member of
the exponential family, or a product of such members, then we see that the logarithm
will cancel the exponential and lead to an M step that will be typically much simpler
than the maximization of the corresponding incomplete-data log likelihood function
p(X|θ).
The operation of the EM algorithm can also be viewed in the space of parame-
ters, as illustrated schematically in Figure 9.14. Here the red curve depicts the (in-
Figure 9.13 Illustration of the M step of the EM
algorithm. The distribution q(Z)
is held fixed and the lower bound
L(q, θ) is maximized with respect
to the parameter vector θ to give
a revised value θnew
. Because the
KL divergence is nonnegative, this
causes the log likelihood ln p(X|θ)
to increase by at least as much as
the lower bound does.
ln p(X|θnew
)L(q, θnew
)
KL(q||p)
9. VAEのモデル化
¤ ニューラルネットワークによってモデル化する
・・・
・・・
・・・ ・・・
・・・
・・・
・・・
推論モデル
sampling
z(l)
= µ + σ ⊙ ϵ(l)
, ϵ(l)
∼ N(0, I).
ost significant difference from the estimator of VAE’s lower bound (Eq. (3)) is that there are
gative reconstruction terms in Eq. (8). These terms are correspondent to each modality. Same
, we call qφ(z|x, w) as encoder and both pθx (x|z(l)
) and pθw (w|z(l)
) as decoder.
parameterize encoder and decoder distribution as deep neural networks. Figs.2 draws the
which is same as Fig. 1 but represented by deep neural networks.
ering the encoder qφ(z|x, w) as a Gaussian distribution, we can estimate mean and variance
istribution by neural networks as follows:
y(x) = MLPφx (x)
y(w) = MLPφw (w)
µφ = Linear(y(x), y(w))
log σ2
φ = Tanh(Linear(y(x), y(w)), (9)
MLPφx and MLPφw mean deep neural networks corresponding each modality. Moreover,
r and Tanh mean a linear layer and a tanh layer. Linear(a, b) means that this network has
e input layers, which are corresponding to a and b.
e each modality has different feature representation, we should make different networks for
coder, pθx (x|z) and pθw (w|z). The type of distribution and the network architecture depend
epresentation of each modality, e.g., Gaussian distribution when the representation of modal-
ntinuous, Bernoulli distribution when binary value, 0 or 1. In case that pθw (w|z) is Bernoulli
tion B(w|µθw
), the parameter of Bernoulli distribution µθw
can estimate as follows:
y(z) = MLPθw (z)
µθ = Linear(y(z)) (10)
that the decoder is Gaussian distribution, you can estimate the parameter of this distribution
where z(l)
= µ + σ ⊙ ϵ(l)
, ϵ(l)
∼ N(0, I).
The most significant difference from the estimator of VAE’s lower bound (Eq. (3)) is that there are
two negative reconstruction terms in Eq. (8). These terms are correspondent to each modality. Same
as VAE, we call qφ(z|x, w) as encoder and both pθx (x|z(l)
) and pθw (w|z(l)
) as decoder.
We can parameterize encoder and decoder distribution as deep neural networks. Figs.2 draws the
model which is same as Fig. 1 but represented by deep neural networks.
Considering the encoder qφ(z|x, w) as a Gaussian distribution, we can estimate mean and variance
of the distribution by neural networks as follows:
y(x) = MLPφx (x)
y(w) = MLPφw (w)
µφ = Linear(y(x), y(w))
log σ2
φ = Tanh(Linear(y(x), y(w)), (9)
where MLPφx and MLPφw mean deep neural networks corresponding each modality. Moreover,
Linear and Tanh mean a linear layer and a tanh layer. Linear(a, b) means that this network has
multiple input layers, which are corresponding to a and b.
Because each modality has different feature representation, we should make different networks for
each decoder, pθx (x|z) and pθw (w|z). The type of distribution and the network architecture depend
on the representation of each modality, e.g., Gaussian distribution when the representation of modal-
ity is continuous, Bernoulli distribution when binary value, 0 or 1. In case that pθw (w|z) is Bernoulli
distribution B(w|µθw
), the parameter of Bernoulli distribution µθw
can estimate as follows:
y(z) = MLPθw (z)
µθ = Linear(y(z)) (10)
In case that the decoder is Gaussian distribution, you can estimate the parameter of this distribution
in the same way as Eq. (9), except that the input of the Linear network is single.
The main advantage of this model is following:
Figure 2: The network architecture of MVAE. This represents the sam
Hense,
L(x, w) = −DKL(qφ(z|x, w)||p(z))
+Eqφ(z|x,w)[log pθx (x|z)] + Eqφ(z|x,w)[log p
By SGBM algorithm, the estimator of the lower bound is as follows:
ˆL(x, w) = −DKL(qφ(z|x, w)||p(z))
+
1
L
L
l=1
log pθx (x|z(l)
) + log pθw (w|z(
where z(l)
= µ + σ ⊙ ϵ(l)
, ϵ(l)
∼ N(0, I).
The most significant difference from the estimator of VAE’s lower bound (E
two negative reconstruction terms in Eq. (8). These terms are correspondent
as VAE, we call qφ(z|x, w) as encoder and both pθx (x|z(l)
) and pθw (w|z(l)
We can parameterize encoder and decoder distribution as deep neural netw
model which is same as Fig. 1 but represented by deep neural networks.
$ #% $&
rk architecture of MVAE. This represents the same model as Fig.1.
= −DKL(qφ(z|x, w)||p(z))
+Eqφ(z|x,w)[log pθx (x|z)] + Eqφ(z|x,w)[log pθw (w|z)] (7)
estimator of the lower bound is as follows:
w) = −DKL(qφ(z|x, w)||p(z))
+
1
L
L
l=1
log pθx (x|z(l)
) + log pθw (w|z(l)
), (8)
, ϵ(l)
∼ N(0, I).
ence from the estimator of VAE’s lower bound (Eq. (3)) is that there are
・・・ ・・・
settings and found that their proposed model can extract better representations than sing
settings.
Srivastava & Salakhutdinov (2012) used deep restricted Boltzmann machines (RBM), w
earliest deep generative model, to multimodal learning settings. The same as one by usin
they jointed latent variables in multiple networks and tried to extract high-level feature
timodal features: images and texts. In their experiment, they showed that their model ou
Ngiam et al. (2011). It suggests that deep generative models may extract better represen
discriminative ones.
2.2 VARIATIONAL AUTOENCODERS
Variational autoencoders (VAE) (Welling, 2014; Rezende et al., 2014) are recent propose
erative models.
Given observation variables x and corresponding latent variables z, we consider their
processes as follow:
z ∼ p(z); x ∼ pθ(x|z), ˆx ˆw
where θ means the model parameter of p.
In varaitional inference, we consider qφ(z|x), where φ is the model parameter of q, in
proximate the posterior distribution pθ(z|x). The goal of this problem is that maximize t
𝑞(𝑧|𝑥)
𝑥~𝑝(𝑥|𝑧)
𝑧~𝑝(𝑧)
⽣成モデル
12. IWAE : 実験結果
¤ テスト尤度が向上していることが確認できる
Under review as a conference paper at ICLR 2016
MNIST OMNIGLOT
VAE IWAE VAE IWAE
# stoch.
layers k NLL
active
units NLL
active
units NLL
active
units NLL
active
units
1 1 86.76 19 86.76 19 108.11 28 108.11 28
5 86.47 20 85.54 22 107.62 28 106.12 34
50 86.35 20 84.78 25 107.80 28 104.67 41
2 1 85.33 16+5 85.33 16+5 107.58 28+4 107.56 30+5
5 85.01 17+5 83.89 21+5 106.31 30+5 104.79 38+6
50 84.78 17+5 82.90 26+7 106.30 30+5 103.38 44+7
Table 1: Results on density estimation and the number of active latent dimensions. For models with two latent
layers, “k1+k2” denotes k1 active units in the first layer and k2 in the second layer. The generative performance
of IWAEs improved with increasing k, while that of VAEs benefitted only slightly. Two-layer models achieved
better generative performance than one-layer models.
The log-likelihood results are reported in Table 1. Our VAE results are comparable to those previ-
ously reported in the literature. We observe that training a VAE with k > 1 helped only slightly. By
contrast, using multiple samples improved the IWAE results considerably on both datasets. Note that
16. Variational Fair Autoencoder
¤ The Variational Fair Autoencoder [Louizos+ 15; ICLR 2016]
¤ 𝑥と𝑠(sensitive変数.前ページでいうラベル)を独⽴にするために,次の
maximum mean discrepancy(MMD)を⼩さくするようにする.
¤ s=0とs=1のときの潜在変数の差がなくなるようにする.
¤ これをVAEの下界に追加する.
¤ MMDは通常カーネルの計算に持っていく.
¤ しかし,SGDで⾼次元のグラム⾏列を計算するのは⼤変なので,写像
を次の形で求める
(7)
Asymptotically, for a universal kernel such as the Gaussian kernel k(x, x0
) = e kx x0
k2
,
`MMD(X, X0
) is 0 if and only if P0 = P1. Equivalently, minimizing MMD can be viewed as
matching all of the moments of P0 and P1. Therefore, we can use it as an extra “regularizer” and
force the model to try to match the moments between the marginal posterior distributions of our
latent variables, i.e., q (z1|s = 0) and q (z1|s = 1) (in the case of binary nuisance information
s1
). By adding the MMD penalty into the lower bound of our aforementioned VAE architecture we
obtain our proposed model, the “Variational Fair Autoencoder” (VFAE):
FVFAE( , ✓; xn, xm, sn, sm, yn) = FVAE( , ✓; xn, xm, sn, sm, yn) `MMD(Z1s=0, Z1s=1) (8)
where:
`MMD(Z1s=0, Z1s=1) = k E˜p(x|s=0)[Eq(z1|x,s=0)[ (z1)]] E˜p(x|s=1)[Eq(z1|x,s=1)[ (z1)]]k2
(9)
2.4 FAST MMD VIA RANDOM FOURIER FEATURES
A naive implementation of MMD in minibatch stochastic gradient descent would require computing
the M ⇥M Gram matrix for each minibatch during training, where M is the minibatch size. Instead,
we can use random kitchen sinks (Rahimi & Recht, 2009) to compute a feature expansion such that
computing the estimator (6) approximates the full MMD (7). To compute this, we draw a random
K ⇥ D matrix W, where K is the dimensionality of x, D is the number of random features and
each entry of W is drawn from a standard isotropic Gaussian. The feature expansion is then given
as:
W(x) =
r
2
D
cos
✓r
2
xW + b
◆
. (10)
where b is a D-dimensional uniform random vector with entries in [0, 2⇡]. Zhao & Meng (2015)
have successfully applied the idea of using random kitchen sinks to approximate MMD. This esti-
mator is fairly accurate, and is typically much faster than the full MMD penalty. We use D = 500
et al., 2006):
`MMD(X, X0
) =
1
N2
0
N0X
n=1
N0X
m=1
k(xn, xm) +
1
N2
1
N1X
n=1
N1X
m=1
k(x0
n, x0
m)
2
N0N1
N0X
n=1
N1X
m=1
k(xn, x0
m).
(7)
Asymptotically, for a universal kernel such as the Gaussian kernel k(x, x0
) = e kx x0
k2
,
`MMD(X, X0
) is 0 if and only if P0 = P1. Equivalently, minimizing MMD can be viewed as
matching all of the moments of P0 and P1. Therefore, we can use it as an extra “regularizer” and
force the model to try to match the moments between the marginal posterior distributions of our
latent variables, i.e., q (z1|s = 0) and q (z1|s = 1) (in the case of binary nuisance information
s1
). By adding the MMD penalty into the lower bound of our aforementioned VAE architecture we
obtain our proposed model, the “Variational Fair Autoencoder” (VFAE):
FVFAE( , ✓; xn, xm, sn, sm, yn) = FVAE( , ✓; xn, xm, sn, sm, yn) `MMD(Z1s=0, Z1s=1) (8)
where:
`MMD(Z1s=0, Z1s=1) = k E˜p(x|s=0)[Eq(z1|x,s=0)[ (z1)]] E˜p(x|s=1)[Eq(z1|x,s=1)[ (z1)]]k2
(9)
2.4 FAST MMD VIA RANDOM FOURIER FEATURES
A naive implementation of MMD in minibatch stochastic gradient descent would require computing
the M ⇥M Gram matrix for each minibatch during training, where M is the minibatch size. Instead,
we can use random kitchen sinks (Rahimi & Recht, 2009) to compute a feature expansion such that
computing the estimator (6) approximates the full MMD (7). To compute this, we draw a random
K ⇥ D matrix W, where K is the dimensionality of x, D is the number of random features and
each entry of W is drawn from a standard isotropic Gaussian. The feature expansion is then given
as:
W(x) =
r
2
D
cos
✓r
2
xW + b
◆
. (10)
where b is a D-dimensional uniform random vector with entries in [0, 2⇡]. Zhao & Meng (2015)
have successfully applied the idea of using random kitchen sinks to approximate MMD. This esti-
mator is fairly accurate, and is typically much faster than the full MMD penalty. We use D = 500
in our experiments.
17. 実験:公平性の検証
¤ zからsの情報がなくなっているかどうかを検証
¤ zからsを分類したときの正解率で評価
Under review as a conference paper at ICLR 2016
(a) Adult dataset
(b) German dataset
(c) Health dataset
Figure 3: Fair classification results. Columns correspond to each evaluation scenario (in order):
Random/RF/LR accuracy on s, Discrimination/Discrimination prob. against s and Random/Model
18. 実験:ドメイン適応の検証
¤ 異なるドメイン間でのドメイン適応
¤ 半教師あり学習で実験(⽬標ドメインのラベルがない).
¤ the Amazon reviews dataset
¤ 𝑦はセンチメント(positiveかnegative)
¤ 結果:
¤ 12のうち9が既存研究([Ganin+ 15])を上回った
Under review as a conference paper at ICLR 2016
2 LEARNING INVARIANT REPRESENTATIONS
x
zs
N
Figure 1: Unsupervised model
x
z1s
z2
y
N
Figure 2: Semi-supervised model
2.1 UNSUPERVISED MODEL
Factoring out undesired variations from the data can be easily formulated as a general probabili
model which admits two distinct (independent) “sources”; an observed variable s, which denotes
variations that we want to remove, and a continuous latent variable z which models all the remain
information. This generative process can be formally defined as:
z ⇠ p(z); x ⇠ p✓(x|z, s)
where p✓(x|z, s) is an appropriate probability distribution for the data we are modelling. With
formulation we explicitly encode a notion of ‘invariance’ in our model, since the latent repres
Under review as a conference paper at ICLR 2016
is concerned, we compared against a recent neural network based state of the art method for domain
adaptation, Domain Adversarial Neural Network (DANN) (Ganin et al., 2015). As we can observe
in table 1, our accuracy on the labels y is higher on 9 out of the 12 domain adaptation tasks whereas
on the remaining 3 it is quite similar to the DANN architecture.
Table 1: Results on the Amazon reviews dataset. The DANN column is taken directly from Ganin
et al. (2015) (the column that uses the original representation as input).
Source - Target
S Y
RF LR VFAE DANN
books - dvd 0.535 0.564 0.799 0.784
books - electronics 0.541 0.562 0.792 0.733
books - kitchen 0.537 0.583 0.816 0.779
dvd - books 0.537 0.563 0.755 0.723
dvd - electronics 0.538 0.566 0.786 0.754
dvd - kitchen 0.543 0.589 0.822 0.783
electronics - books 0.562 0.590 0.727 0.713
electronics - dvd 0.556 0.586 0.765 0.738
electronics - kitchen 0.536 0.570 0.850 0.854
kitchen - books 0.560 0.593 0.720 0.709
kitchen - dvd 0.561 0.599 0.733 0.740
kitchen - electronics 0.533 0.565 0.838 0.843
19. CVAEの活⽤
¤ 条件付きVAEは,ラベル等に条件づけられた画像を⽣成できる
¤ 学習サンプルに存在していないデータも⽣成可能
¤ 数字ラベルで条件付け[Kingma+ 2014 ; NIPS 2014]
(a) Handwriting styles for MNIST obtained by fixing the class label and varying the 2D latent variable z
(b) MNIST analogies (c) SVHN analogies
Figure 1: (a) Visualisation of handwriting styles learned by the model with 2D z-space. (b,c)
Analogical reasoning with generative semi-supervised models using a high-dimensional z-space.
The leftmost columns show images from the test set. The other columns show analogical fantasies
of x by the generative model, where the latent variable z of each row is set to the value inferred from
20. Conditional alignDRAW
¤ Generating Images from Captions with Attention [Mansimov+
16 ; ICLR 2016]
¤ DRAWにbidirectional RNNで条件づけたモデル
¤ DRAW [Gregor+ 14]
¤ VAEの枠組みでRNNを使えるようにしたもの.
¤ 各時間ステップで画像を上書きしていく
¤ 前のステップとの差分をみることで注意(attention)をモデル化
Recurrent Neural Network For Image Generation
onstructs scenes
s emitted by the
encoder.
s step by step is
the scene while
e past few years
captured by a se-
, than by a sin-
chelle & Hinton,
; Ranzato, 2014;
et al., 2014; Ser-
ed by sequential
read
x
zt zt+1
P(x|z1:T )write
encoder
RNN
sample
decoder
RNN
read
x
write
encoder
RNN
sample
decoder
RNN
ct 1 ct cT
henc
t 1
hdec
t 1
Q(zt|x, z1:t 1) Q(zt+1|x, z1:t)
. . .
decoding
(generative model)
encoding
(inference)
encoder
FNN
sample
decoder
FNN
z
P(x|z)
Figure 2. Left: Conventional Variational Auto-Encoder. Dur-
DRAW: A Recurrent Neural Network For Image Generati
Time
Figure 7. MNIST generation sequences for DRAW without at-
tention. Notice how the network first generates a very blurry im-
21. Conditional alignDRAW
¤ Conditional alignDRAWの全体像
¤ DRAWにbidirectional RNNで条件づけたモデル
¤ Bidirectional RNNの出⼒を重み付け和したもので条件付ける.
Published as a conference paper at ICLR 2016
Figure 2: AlignDRAW model for generating images by learning an alignment between the input captions and
generating canvas. The caption is encoded using the Bidirectional RNN (left). The generative RNN takes a
latent sequence z1:T sampled from the prior along with the dynamic caption representation s1:T to generate
the canvas matrix cT , which is then used to generate the final image x (right). The inference RNN is used to
compute approximate posterior Q over the latent sequence.
3.2 IMAGE MODEL: THE CONDITIONAL DRAW NETWORK
To generate an image x conditioned on the caption information y, we extended the DRAW net-
work (Gregor et al., 2015) to include caption representation hlang
at each step, as shown in Fig. 2.
The conditional DRAW network is a stochastic recurrent neural network that consists of a sequence
of latent variables Zt 2 RD
, t = 1, .., T, where the output is accumulated over all T time-steps. For
simplicity in notation, the images x 2 Rh⇥w
are assumed to have size h-by-w and only one color
𝑦
𝑥
𝑧
3.3 LEARNING
The model is trained to maximize a variational lower bound L on the marginal likelihood of the
correct image x given the input caption y:
L =
X
Z
Q(Z | x, y) log P(x | y, Z) DKL (Q(Z | x, y) k P(Z | y)) log P(x | y). (9)
Similar to the DRAW model, the inference recurrent network produces an approximate posterior
Q(Z1:T | x, y) via a read operator, which reads a patch from an input image x using two arrays of
1D Gaussian filters (inverse of write from section 3.2) at each time-step t. Specifically,
ˆxt = x (ct 1), (10)
rt = read(xt, ˆxt, hgen
t 1), (11)
hinfer
t = LSTM infer
(hinfer
t 1 , [rt, hgen
t 1]), (12)
Q(Zt|x, y, Z1:t 1) = N
⇣
µ(hinfer
t ), (hinfer
t )
⌘
, (13)
where ˆx is the error image and hinfer
0 is initialized to the learned bias b. Note that the inference
LSTM infer
takes as its input both the output of the read operator rt 2 Rp⇥p
, which depends on
the original input image x, and the previous state of the generative decoder hgen
t 1, which depends
on the latent sample history z1:t 1 and dynamic sentence representation st 1 (see Eq. 3). Hence,
the approximate posterior Q will depend on the input image x, the corresponding caption y, and the
latent history Z1:t 1, except for the first step Q(Z1|x), which depends only on x.
The terms in the variational lower bound Eq. 9 can be rearranged using the law of total expectation.
Therefore, the variational bound L is calculated as follows:
L =EQ(Z1:T | y,x)
"
log p(x | y, Z1:T )
TX
t=2
DKL (Q(Zt | Z1:t 1, y, x) k P(Zt | Z1:t 1, y))
#
DKL (Q(Z1 | x) k P(Z1)) . (14)
1
We also experimented with a conditional Gaussian observation model, but it worked worse compared to
the Bernoulli model.
22. 実験:キャプション付きMNIST
¤ キャプション付きのMNISTで学習
¤ キャプションはMNISTの場所を指定
¤ 左が訓練データにあるもの,右はないもの.
¤ 複数の数字でも適切に⽣成されている.
Published as a conference paper at ICLR 2016
Figure 6: Examples of generating 60 ⇥ 60 MNIST images corresponding to respective captions. The captions
on the left column were part of the training set. The digits described in the captions on the right column were
hidden during training for the respective configurations.
APPENDIX A: MNIST WITH CAPTIONS
As an additional experiment, we trained our model on the MNIST dataset with artificial captions.
Either one or two digits from the MNIST training dataset were placed on a 60 ⇥ 60 blank image.
One digit was placed in one of the four (top-left, top-right, bottom-left or bottom-right) corners
of the image. Two digits were either placed horizontally or vertically in non-overlapping fashion.
The corresponding artificial captions specified the identity of each digit along with their relative
positions, e.g. “The digit three is at the top of the digit one”, or “The digit seven is at the bottom left
of the image”.
The generated images together with the attention alignments are displayed in Figure 6. The model
23. 実験:MSCOCOデータセット
¤ キャプションの⼀部(下線部)だけを変換
¤ 存在していないキャプションから⽣成
Published as a conference paper at ICLR 2016
A yellow school bus
parked in a parking lot.
A red school bus parked
in a parking lot.
A green school bus
parked in a parking lot.
A blue school bus parked
in a parking lot.
The decadent chocolate
desert is on the table.
A bowl of bananas is on
the table.
A vintage photo of a cat. A vintage photo of a dog.
Figure 3: Top: Examples of changing the color while keeping the caption fixed. Bottom: Examples of changing
the object while keeping the caption fixed. The shown images are the probabilities (cT ). Best viewed in
colour.
The expectation can be approximated by L Monte Carlo samples ˜z1:T from Q(Z1:T | y, x):
L ⇡
1
L
LX
l=1
"
log p(x | y, ˜zl
1:T )
TX
t=2
DKL Q(Zt | ˜zl
1:t 1, y, x) k P(Zt | ˜zl
1:t 1, y)
#
DKL (Q(Z1 | x) k P(Z1)) . (15)
The model can be trained using stochastic gradient descent. In all of our experiments, we used
only a single sample from Q(Z1:T | y, x) for parameter learning. Training details, hyperparameter
settings, and the overall model architecture are specified in Appendix B. The code is available at
https://github.com/emansim/text2image.
3.4 GENERATING IMAGES FROM CAPTIONS
During the image generation step, we discard the inference network and instead sample from the
prior distribution. Due to the blurriness of samples generated by the DRAW model, we perform an
additional post processing step where we use an adversarial network trained on residuals of a Lapla-
Published as a conference paper at ICLR 2016
A stop sign is flying in
blue skies.
A herd of elephants fly-
ing in the blue skies.
A toilet seat sits open in
the grass field.
A person skiing on sand
clad vast desert.
Figure 1: Examples of generated images based on captions that describe novel scene compositions that are
highly unlikely to occur in real life. The captions describe a common object doing unusual things or set in a
strange location.
25. ガウス過程
¤ ガウス過程とは・・・
¤ 関数の確率分布
¤ D次元の⼊⼒ベクトルのデータセット に対する関数の出⼒
ベクトル の同時分布が常にガウス分布
¤ 平均ベクトルは ,共分散⾏列は で完全に記
述される
an Processes
ew the predictive equations and marginal likelihood for Gaussian processes
e associated computational requirements, following the notational conven-
n et al. (2015). See, for example, Rasmussen and Williams (2006) for a
discussion of GPs.
ataset D of n input (predictor) vectors X = {x1, . . . , xn}, each of dimension
x an n ⇥ 1 vector of targets y = (y(x1), . . . , y(xn))>. If f(x) ⇠ GP(µ, k ),
tion of function values f has a joint Gaussian distribution,
f = f(X) = [f(x1), . . . , f(xn)]>
⇠ N(µ, KX,X) , (1)
ector, µi = µ(xi), and covariance matrix, (KX,X)ij = k (xi, xj), determined
function and covariance kernel of the Gaussian process. The kernel, k , is
y . Assuming additive Gaussian noise, y(x)|f(x) ⇠ N(y(x); f(x), 2), the
ibution of the GP evaluated at the n⇤ test points indexed by X⇤, is given by
f⇤|X⇤,X, y, , 2
⇠ N(E[f⇤], cov(f⇤)) , (2)
E[f⇤] = µX⇤
+ KX⇤,X[KX,X + 2
I] 1
y ,
cov(f⇤) = KX⇤,X⇤ KX⇤,X[KX,X + 2
I] 1
KX,X⇤ .
mple, is an n⇤ ⇥ n matrix of covariances between the GP evaluated at X⇤
3 Gaussian Processes
We briefly review the predictive equations and marginal likelihood for Gaussian processes
(GPs), and the associated computational requirements, following the notational conven-
tions in Wilson et al. (2015). See, for example, Rasmussen and Williams (2006) for a
comprehensive discussion of GPs.
We assume a dataset D of n input (predictor) vectors X = {x1, . . . , xn}, each of dimension
D, which index an n ⇥ 1 vector of targets y = (y(x1), . . . , y(xn))>. If f(x) ⇠ GP(µ, k ),
then any collection of function values f has a joint Gaussian distribution,
f = f(X) = [f(x1), . . . , f(xn)]>
⇠ N(µ, KX,X) , (1)
with a mean vector, µi = µ(xi), and covariance matrix, (KX,X)ij = k (xi, xj), determined
from the mean function and covariance kernel of the Gaussian process. The kernel, k , is
parametrized by . Assuming additive Gaussian noise, y(x)|f(x) ⇠ N(y(x); f(x), 2), the
predictive distribution of the GP evaluated at the n⇤ test points indexed by X⇤, is given by
f⇤|X⇤,X, y, , 2
⇠ N(E[f⇤], cov(f⇤)) , (2)
E[f⇤] = µX⇤
+ KX⇤,X[KX,X + 2
I] 1
y ,
cov(f⇤) = KX⇤,X⇤ KX⇤,X[KX,X + 2
I] 1
KX,X⇤ .
KX⇤,X, for example, is an n⇤ ⇥ n matrix of covariances between the GP evaluated at X⇤
We briefly review the predictive equations and marginal likelihood for Gaussian proc
(GPs), and the associated computational requirements, following the notational con
tions in Wilson et al. (2015). See, for example, Rasmussen and Williams (2006) f
comprehensive discussion of GPs.
We assume a dataset D of n input (predictor) vectors X = {x1, . . . , xn}, each of dimen
D, which index an n ⇥ 1 vector of targets y = (y(x1), . . . , y(xn))>. If f(x) ⇠ GP(µ,
then any collection of function values f has a joint Gaussian distribution,
f = f(X) = [f(x1), . . . , f(xn)]>
⇠ N(µ, KX,X) ,
with a mean vector, µi = µ(xi), and covariance matrix, (KX,X)ij = k (xi, xj), determ
from the mean function and covariance kernel of the Gaussian process. The kernel, k
parametrized by . Assuming additive Gaussian noise, y(x)|f(x) ⇠ N(y(x); f(x), 2)
predictive distribution of the GP evaluated at the n⇤ test points indexed by X⇤, is give
f⇤|X⇤,X, y, , 2
⇠ N(E[f⇤], cov(f⇤)) ,
E[f⇤] = µX⇤
+ KX⇤,X[KX,X + 2
I] 1
y ,
cov(f⇤) = KX⇤,X⇤ KX⇤,X[KX,X + 2
I] 1
KX,X⇤ .
and Nickisch, 2015) and extensions in Wilson et al. (2015) for e ciently representing kernel
functions, to produce scalable deep kernels.
3 Gaussian Processes
We briefly review the predictive equations and marginal likelihood for Gaussian processes
(GPs), and the associated computational requirements, following the notational conven-
tions in Wilson et al. (2015). See, for example, Rasmussen and Williams (2006) for a
comprehensive discussion of GPs.
We assume a dataset D of n input (predictor) vectors X = {x1, . . . , xn}, each of dimension
D, which index an n ⇥ 1 vector of targets y = (y(x1), . . . , y(xn))>. If f(x) ⇠ GP(µ, k ),
then any collection of function values f has a joint Gaussian distribution,
f = f(X) = [f(x1), . . . , f(xn)]>
⇠ N(µ, KX,X) , (1)
with a mean vector, µi = µ(xi), and covariance matrix, (KX,X)ij = k (xi, xj), determined
from the mean function and covariance kernel of the Gaussian process. The kernel, k , is
parametrized by . Assuming additive Gaussian noise, y(x)|f(x) ⇠ N(y(x); f(x), 2), the
predictive distribution of the GP evaluated at the n⇤ test points indexed by X⇤, is given by
f⇤|X⇤,X, y, , 2
⇠ N(E[f⇤], cov(f⇤)) , (2)
E[f⇤] = µX⇤
+ KX⇤,X[KX,X + 2
I] 1
y ,
cov(f⇤) = KX⇤,X⇤ KX⇤,X[KX,X + 2
I] 1
KX,X⇤ .
KX⇤,X, for example, is an n⇤ ⇥ n matrix of covariances between the GP evaluated at X⇤
and X. µX⇤
is the n⇤ ⇥ 1 mean vector, and KX,X is the n ⇥ n covariance matrix evaluated
show that the proposed model outperforms state of the art stand-alone deep learning archi-
tectures and Gaussian processes with advanced kernel learning procedures on a wide range
of datasets, demonstrating its practical significance. We achieve scalability while retaining
non-parametric model structure by leveraging the very recent KISS-GP approach (Wilson
and Nickisch, 2015) and extensions in Wilson et al. (2015) for e ciently representing kernel
functions, to produce scalable deep kernels.
3 Gaussian Processes
We briefly review the predictive equations and marginal likelihood for Gaussian processes
(GPs), and the associated computational requirements, following the notational conven-
tions in Wilson et al. (2015). See, for example, Rasmussen and Williams (2006) for a
comprehensive discussion of GPs.
We assume a dataset D of n input (predictor) vectors X = {x1, . . . , xn}, each of dimension
D, which index an n ⇥ 1 vector of targets y = (y(x1), . . . , y(xn))>. If f(x) ⇠ GP(µ, k ),
then any collection of function values f has a joint Gaussian distribution,
f = f(X) = [f(x1), . . . , f(xn)]>
⇠ N(µ, KX,X) , (1)
with a mean vector, µi = µ(xi), and covariance matrix, (KX,X)ij = k (xi, xj), determined
from the mean function and covariance kernel of the Gaussian process. The kernel, k , is
parametrized by . Assuming additive Gaussian noise, y(x)|f(x) ⇠ N(y(x); f(x), 2), the
predictive distribution of the GP evaluated at the n⇤ test points indexed by X⇤, is given by
f⇤|X⇤,X, y, , 2
⇠ N(E[f⇤], cov(f⇤)) , (2)
E[f⇤] = µX⇤
+ KX⇤,X[KX,X + 2
I] 1
y ,
cov(f⇤) = KX⇤,X⇤ KX⇤,X[KX,X + 2
I] 1
KX,X⇤ .
KX⇤,X, for example, is an n⇤ ⇥ n matrix of covariances between the GP evaluated at X⇤
and X. µX⇤
is the n⇤ ⇥ 1 mean vector, and KX,X is the n ⇥ n covariance matrix evaluated
at training inputs X. All covariance (kernel) matrices implicitly depend on the kernel
hyperparameters .
15.2. GPs for regression 517
−5 0 5
−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
(a)
−5 0 5
−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
2.5
(b)
Figure 15.2 Left: some functions sampled from a GP prior with SE kernel. Right: some samples from a GP
posterior, after conditioning on 5 noise-free observations. The shaded area represents E [f(x)]±2std(f(x).
ns and marginal likelihood for Gaussian processes
al requirements, following the notational conven-
example, Rasmussen and Williams (2006) for a
ctor) vectors X = {x1, . . . , xn}, each of dimension
ets y = (y(x1), . . . , y(xn))>. If f(x) ⇠ GP(µ, k ),
has a joint Gaussian distribution,
. . . , f(xn)]>
⇠ N(µ, KX,X) , (1)
ariance matrix, (KX,X)ij = k (xi, xj), determined
kernel of the Gaussian process. The kernel, k , is
Gaussian noise, y(x)|f(x) ⇠ N(y(x); f(x), 2), the
ed at the n⇤ test points indexed by X⇤, is given by
N(E[f⇤], cov(f⇤)) , (2)
X⇤,X[KX,X + 2
I] 1
y ,
KX⇤,X[KX,X + 2
I] 1
KX,X⇤ .
x of covariances between the GP evaluated at X⇤
nd KX,X is the n ⇥ n covariance matrix evaluated
26. 深層ガウス過程
¤ より複雑なサンプルを表現するため,process compositionによって
多層化する[Lawrence & Moore, 07]
➡ 深層ガウス過程(deep GP)
¤ 以下のように,多層グラフィカルモデルを考える
¤ ここでは𝑌がデータ,𝑋が潜在変数.
Published as a conference paper at ICLR 2016
X3 X2
f1 ⇠ GP
X1
f2 ⇠ GP
Y
f3 ⇠ GP
Figure 1: A deep Gaussian process with two hidden layers.
2 DEEP GAUSSIAN PROCESSES
Gaussian processes provide flexible, non-parametric, probabilistic approaches to function estima-
tion. However, their tractability comes at a price: they can only represent a restricted class of
functions. Indeed, even though sophisticated definitions and combinations of covariance functions
can lead to powerful models (Durrande et al., 2011; G¨onen & Alpaydin, 2011; Hensman et al.,
2013; Duvenaud et al., 2013; Wilson & Adams, 2013), the assumption about joint normal distribu-
tion of instantiations of the latent function remains; this limits the applicability of the models. One
line of recent research to address this limitation focused on function composition (Snelson et al.,
2004; Calandra et al., 2014). Inspired by deep neural networks, a deep Gaussian process instead
employs process composition (Lawrence & Moore, 2007; Damianou et al., 2011; L´azaro-Gredilla,
2012; Damianou & Lawrence, 2013; Hensman & Lawrence, 2014).
A deep GP is a deep directed graphical model that consists of multiple layers of latent variables
and employs Gaussian processes to govern the mapping between consecutive layers (Lawrence &
Moore, 2007; Damianou, 2015). Observed outputs are placed in the down-most layer and observed
inputs (if any) are placed in the upper-most layer, as illustrated in Figure 1. More formally, consider
a set of data Y 2 RN⇥D
with N datapoints and D dimensions. A deep GP then defines L layers of
latent variables, {Xl}L
l=1, Xl 2 RN⇥Ql
through the following nested noise model definition:
Y = f1(X1) + ✏1, ✏1 ⇠ N(0, 2
1I) (1)
Xl 1 = fl(Xl) + ✏l, ✏l ⇠ N(0, 2
l I), l = 2 . . . L (2)
where the functions fl are drawn from Gaussian processes with covariance functions kl, i.e. fl(x) ⇠
GP(0, kl(x, x0
)). In the unsupervised case, the top hidden layer is assigned a unit Gaussian as
a fairly uninformative prior which also provides soft regularization, i.e. XL ⇠ N(0, I). In the
supervised learning scenario, the inputs of the top hidden layer is observed and govern its hidden
outputs.
The expressive power of a deep GP is significantly greater than that of a standard GP, because the
successive warping of latent variables through the hierarchy allows for modeling non-stationarities
and sophisticated, non-parametric functional “features” (see Figure 2). Similarly to how a GP is
EEP GAUSSIAN PROCESSES
n processes provide flexible, non-parametric, probabilistic approaches to function estima-
However, their tractability comes at a price: they can only represent a restricted class of
ns. Indeed, even though sophisticated definitions and combinations of covariance functions
d to powerful models (Durrande et al., 2011; G¨onen & Alpaydin, 2011; Hensman et al.,
Duvenaud et al., 2013; Wilson & Adams, 2013), the assumption about joint normal distribu-
nstantiations of the latent function remains; this limits the applicability of the models. One
recent research to address this limitation focused on function composition (Snelson et al.,
Calandra et al., 2014). Inspired by deep neural networks, a deep Gaussian process instead
s process composition (Lawrence & Moore, 2007; Damianou et al., 2011; L´azaro-Gredilla,
Damianou & Lawrence, 2013; Hensman & Lawrence, 2014).
GP is a deep directed graphical model that consists of multiple layers of latent variables
ploys Gaussian processes to govern the mapping between consecutive layers (Lawrence &
2007; Damianou, 2015). Observed outputs are placed in the down-most layer and observed
if any) are placed in the upper-most layer, as illustrated in Figure 1. More formally, consider
data Y 2 RN⇥D
with N datapoints and D dimensions. A deep GP then defines L layers of
ariables, {Xl}L
l=1, Xl 2 RN⇥Ql
through the following nested noise model definition:
Y = f1(X1) + ✏1, ✏1 ⇠ N(0, 2
1I) (1)
Xl 1 = fl(Xl) + ✏l, ✏l ⇠ N(0, 2
l I), l = 2 . . . L (2)
he functions fl are drawn from Gaussian processes with covariance functions kl, i.e. fl(x) ⇠
kl(x, x0
)). In the unsupervised case, the top hidden layer is assigned a unit Gaussian as
uninformative prior which also provides soft regularization, i.e. XL ⇠ N(0, I). In the
sed learning scenario, the inputs of the top hidden layer is observed and govern its hidden
pressive power of a deep GP is significantly greater than that of a standard GP, because the
ive warping of latent variables through the hierarchy allows for modeling non-stationarities
histicated, non-parametric functional “features” (see Figure 2). Similarly to how a GP is
t of an infinitely wide neural network, a deep GP is the limit where the parametric function
ition of a deep neural network turns into a process composition. Specifically, a deep neural
2 DEEP GAUSSIAN PROCESSES
Gaussian processes provide flexible, non-parame
tion. However, their tractability comes at a pric
functions. Indeed, even though sophisticated defi
can lead to powerful models (Durrande et al., 2
2013; Duvenaud et al., 2013; Wilson & Adams, 2
tion of instantiations of the latent function remain
line of recent research to address this limitation
2004; Calandra et al., 2014). Inspired by deep n
employs process composition (Lawrence & Moor
2012; Damianou & Lawrence, 2013; Hensman &
A deep GP is a deep directed graphical model th
and employs Gaussian processes to govern the m
Moore, 2007; Damianou, 2015). Observed output
inputs (if any) are placed in the upper-most layer, a
a set of data Y 2 RN⇥D
with N datapoints and D
latent variables, {Xl}L
l=1, Xl 2 RN⇥Ql
through th
Y = f1(X1) + ✏1, ✏1
Xl 1 = fl(Xl) + ✏l, ✏l ⇠
where the functions fl are drawn from Gaussian pr
GP(0, kl(x, x0
)). In the unsupervised case, the
a fairly uninformative prior which also provides
supervised learning scenario, the inputs of the to
outputs.
The expressive power of a deep GP is significantl
successive warping of latent variables through the
and sophisticated, non-parametric functional “fea
the limit of an infinitely wide neural network, a de
composition of a deep neural network turns into a
network can be written as:
28. VAE-DGP
¤ DGPで変分推論する枠組みは提案されている[Damianou &
Lawrence 13]が,少ないデータでしか学習できなかった.
¤ 共分散⾏列の逆数や,膨⼤なパラメータのため.
¤ DGPの推論をVAEの識別モデル(エンコーダー)と考える.
¤ 制約が加わり,パラメータを減らして推論が速くなる.
¤ 従来のDGPより過学習を抑えられる.
➡VAE-DGP
Variationally Auto-Encoded Deep Gaussian Processes [Dai+ 15; ICLR 2016]
Published as a conference paper at ICLR 2016
X3 X2
f1 ⇠ GP
X1
f2 ⇠ GP
Y
f3 ⇠ GP
{g1(y(n)
)}N
n=1{g2(µ
(n)
1 )}N
n=1{g3(µ
(n)
2 )}N
n=1
Figure 3: A deep Gaussian process with three hidden layers and back-constraints.
29. 実験:⽋損補間
¤ テストデータの⽋損補間
¤ 各例の右端が元画像
ed as a conference paper at ICLR 2016
(a) (b) (c)
5: (a) The samples generated from VAE-DGP trained on the combination of Frey faces and
ces (Frey-Yale). (b) Imputation from the test set of Frey-Yale. (c) Imputation from the test
VHN. The gray color indicates the missing area. The 1st column shows the input images,
column show the imputed images and 3rd column shows the original full images.
KF1F1 , KU1U1 are the covariance matrices of F1 and U1 respectively, KF1U1 is the cross-
nce matrix between F1 and U1, and 1 = Tr(hKF1F1 iq(X1)), 1 = hKF1U1 iq(X1) and
⌦
K>
F1U1
KF1U1
↵
q(X )
, and ⇤1 = KU1U1
+ 1. This enables data-parallelism by dis-
Published as a conference paper at ICLR 2016
5.1 UNSUPERVISED LEARNING
Model MNIST
DBN 138±2
Stacked CAE 121 ± 1.6
Deep GSN 214 ± 1.1
Adversarial nets 225 ± 2
GMMN+AE 282 ± 2
VAE-DGP (5) 301.67
VAE-DGP (10-50) 674.86
VAE-DGP (5-20-50) 723.65
Table 1: Log-likelihood for the MNIST test
data with different models. The baselines are
DBN and Stacked CAE (Bengio et al., 2013),
Deep GSN (Bengio et al., 2014), Adversarial
nets (Goodfellow et al., 2014) and GMMN+AE
(Li et al., 2015).
Figure 6: Samples of imputation on the test
sets. The gray color indicates the missing
area. The 1st column shows the input im-
ages, the 2nd column show the imputed im-
ages and 3rd column shows the original full
images.
We first apply to our model to the combination of Frey faces and Yale faces (Frey-Yale). The Frey
faces contains 1956 20 ⇥ 28 frames taken from a video clip. The Yale faces contains 2414 images,
which are resized to 20 ⇥ 28. We take the last 200 frames from the Frey faces and 300 images
randomly from Yale faces as the test set and use the rest for training. The intensity of the original
gray-scale images are normalized to [0, 1]. The applied VAE-DGP has two hidden layers (a 2D top
hidden layer and a 20D middle hidden layer). The exponentiated quadratic kernel is used for all the
layers with 100 inducing points. All the MLPs in the recognition model have two hidden layers with
30. 実験:精度評価
¤ 対数尤度(MNIST)
¤ 教師あり学習(回帰)
¤ データセット:
¤ The Abalone dataset
¤ The Creep dataset
Published as a conference paper at ICLR 2016
5.1 UNSUPERVISED LEARNING
Model MNIST
DBN 138±2
Stacked CAE 121 ± 1.6
Deep GSN 214 ± 1.1
Adversarial nets 225 ± 2
GMMN+AE 282 ± 2
VAE-DGP (5) 301.67
VAE-DGP (10-50) 674.86
VAE-DGP (5-20-50) 723.65
Table 1: Log-likelihood for the MNIST test
data with different models. The baselines are
DBN and Stacked CAE (Bengio et al., 2013),
Deep GSN (Bengio et al., 2014), Adversarial
nets (Goodfellow et al., 2014) and GMMN+AE
(Li et al., 2015).
Figure 6: Samples of imputation on the test
sets. The gray color indicates the missing
area. The 1st column shows the input im-
ages, the 2nd column show the imputed im-
ages and 3rd column shows the original full
images.
We first apply to our model to the combination of Frey faces and Yale faces (Frey-Yale). The Frey
faces contains 1956 20 ⇥ 28 frames taken from a video clip. The Yale faces contains 2414 images,
which are resized to 20 ⇥ 28. We take the last 200 frames from the Frey faces and 300 images
randomly from Yale faces as the test set and use the rest for training. The intensity of the original
gray-scale images are normalized to [0, 1]. The applied VAE-DGP has two hidden layers (a 2D top
hidden layer and a 20D middle hidden layer). The exponentiated quadratic kernel is used for all the
layers with 100 inducing points. All the MLPs in the recognition model have two hidden layers with
widths (500-300). As a generative model, we can draw samples from the learned model by sampling
first from the prior distribution of the top hidden layer (a 2D unit Gaussian distribution in this case)
and layer-wise downwards. The generated images are shown in Figure 5a.
To evaluate the ability of our model learning the data distribution, we train the VAE-DGP on MNIST
(LeCun et al., 1998). We use the whole training set for learning, which consists of 60,000 28 ⇥ 28
images. The intensity of the original gray-scale images are normalized to [0, 1]. We train our model
with three different model settings (one, two and three hidden layers). The trained models are
Published as a conference paper at ICLR 2016
Figure 7: Bayesian optimization experiments for
Model Abalone
VEA-DGP 825.31 ± 64.35
GP 888.96 ± 78.22
Lin. Reg. 917.31 ± 53.76
Model Creep
VEA-DGP 575.39 ± 29.10
GP 602.11 ± 29.59
Lin. Reg. 1865.76 ± 23.36
Table 2: MSE obtained from our VEA-DGP,
standard GP and linear regression for the
Abalone and Creep benchmarks.
31. 変分推論における平均場近似
¤ VAEでは近似分布は𝑞(𝑧|𝑥)と考えてきた
¤ 𝑞(𝑧|𝑥)はニューラルネットワークで表現
¤ ⼀般的に近似分布は平均場近似によって近似される.
¤ もっとリッチな近似分布を考えることもできる
¤ パラメータ𝜆を確率変数として事前分布を考える(階層変分モデル)
log 𝑝 𝑥
= 𝐿 𝑥 + 𝐾𝐿(𝑞(𝑧|𝑥)||𝑝(𝑧|𝑥))
4
Variational Models
• We want to compute posterior p(z|x) (z: latent variables, x: data)
• Variational inference seeks to minimize
for a family q(z; )
KL(q(z; )||p(z|x))
• Maximizing evidence lower bound (ELBO)
log p(x) Eq(z; )[log p(x|z)] KL(q(z; )||p(z))
• (Common) Mean-field distribution q(z; ) =
Y
i
q(zi; i)
• Hierarchical variational models
• (Newer) Interpret the family as a variational model for posterior
latent variables z (introducing new latent variables)[1]
Lawrence, N. (2000). Variational Inference in Probabilistic Models. PhD thesis.
33. 変分ガウス過程の尤度
¤ 全ページの⽣成過程から,潜在変数𝑧の周辺尤度は
¤ このようにしてモデル化した近似分布は,𝑝(𝑧|𝑥)がどんな分布であろ
うと, とするパラメータが存在する
(Universal Approximation)
¤ つまり,これまでのどの⼿法よりも限りなく柔軟なモデルとなる.
variational distribution. (This idea appears in a different context in Blei & Lafferty (2006).) The
VGP specifies the following generative process for posterior latent variables z:
1. Draw latent input ⇠ 2 Rc
: ⇠ ⇠ N(0, I).
2. Draw non-linear mapping f : Rc
! Rd
conditioned on D: f ⇠
Qd
i=1 GP(0, K⇠⇠) | D.
3. Draw approximate posterior samples z 2 supp(p): z = (z1, . . . , zd) ⇠
Qd
i=1 q(fi(⇠)).
Figure 1 displays a graphical model for the VGP. Marginalizing over all non-linear mappings and
latent inputs, the VGP is
qVGP(z; ✓, D) =
ZZ " dY
i=1
q(zi | fi(⇠))
# " dY
i=1
GP(fi; 0, K⇠⇠) | D
#
N(⇠; 0, I) df d⇠, (4)
which is parameterized by kernel hyperparameters ✓ and variational data.
As a variational model, the VGP forms an infinite ensemble of mean-field distributions. A mean-field
distribution is specified conditional on a fixed function f(·) and input ⇠; the d outputs fi(⇠) = i are
the mean-field’s parameters. The VGP is a form of a hierarchical variational model (Ranganath et al.,
2015); it places a continuous Bayesian nonparametric prior over mean-field parameters.
Note that the VGP evaluates the d draws from a GP at the same latent input ⇠, which induces cor-
relation between their outputs, the mean-field parameters. In turn, this induces correlation between
latent variables of the variational model, correlations that are not captured in classical mean-field.
Finally, the complex non-linear mappings drawn from the GP make the VGP a flexible model for
complex discrete and continuous posteriors.
We emphasize that the VGP needs variational data because—unlike typical GP regression—there is
no observed data available to learn a distribution over non-linear mappings. The variational data
再掲
log 𝑝 𝑥
= 𝐿 𝑥 + 𝐾𝐿(𝑞(𝑧|𝑥)||𝑝(𝑧|𝑥))
quence of domain mappings during inference, from variational latent variable space
r latent variable space Q to data space P. We perform variational inference in the
e and auxiliary inference in the variational space.
resses the task of posterior inference by learning f⇤
: conditional on variational data
ameters to learn, the distribution of the GP learns to concentrate around this optimal
ng inference. This perspective provides intuition behind the following result.
Universal approximation). Let q(z; ✓, D) denote the variational Gaussian process. For
distribution p(z | x) with a finite number of latent variables and continuous quantile
rse CDF), there exist a set of parameters (✓, D) such that
KL(q(z; ✓, D) k p(z | x)) = 0.
x B for a proof. Theorem 1 states that any posterior distribution with strictly posi-
an be represented by a VGP. Thus the VGP is a flexible model for learning posterior
BOX INFERENCE
IONAL OBJECTIVE
34. 下界
¤ 学習では次の下界を最⼤化する
¤ イメージとしては次のような感じ
¤ 近似モデルでxからzを⽣成
¤ 補助モデルでxとzから写像と潜在変数を⽣成
3 BLACK BOX INFERENCE
3.1 VARIATIONAL OBJECTIVE
We derive an algorithm for performing black box inference over a wide class of generative models.
The original ELBO (Eq.1) is analytically intractable due to the log density log qVGP(z) (Eq.4). We
derive a tractable variational objective inspired by auto-encoders.
Specifically, a tractable lower bound to the model evidence log p(x) can be derived by subtracting
an expected KL divergence term from the ELBO:
log p(x) EqVGP
[log p(x | z)] KL(qVGP(z)kp(z)) EqVGP
h
KL(q(⇠, f | z)kr(⇠, f | z))
i
,
where r(⇠, f | z) is an auxiliary model. Such an objective has been considered independently by Sal-
imans et al. (2015) and Ranganath et al. (2015). Variational inference is performed in the posterior
latent variable space, minimizing KL(qkp) to learn the variational model; for this to occur auxil-
iary inference is performed in the variational latent variable space, minimizing KL(qkr) to learn an
auxiliary model. See Figure 2.
Unlike previous approaches, we rewrite this variational objective to connect to auto-encoders:
eL = EqVGP
[log p(x | z)] EqVGP
h
KL(q(z | f(⇠))kp(z)) + KL(q(⇠, f)kr(⇠, f | z))
i
, (5)
where the KL divergences are now taken over tractable distributions (see Appendix C). In auto-
encoder parlance, we maximize the expected negative reconstruction error, regularized by an ex-
pected divergence between the variational model and the original model’s prior, and an expected
divergence between the auxiliary model and the variational model’s prior. This is simply a nested
instantiation of the variational auto-encoder bound (Kingma & Welling, 2014): a KL divergence
between the inference model and a prior is taken as regularizers on both the posterior and variational
spaces. This interpretation justifies the previously proposed bound for variational models; as we
shall see, it also enables lower variance gradients during stochastic optimization.
derive a tractable variational objective inspired by auto-encoders.
Specifically, a tractable lower bound to the model evidence log p(x) can be derived by subtracting
an expected KL divergence term from the ELBO:
log p(x) EqVGP
[log p(x | z)] KL(qVGP(z)kp(z)) EqVGP
h
KL(q(⇠, f | z)kr(⇠, f | z))
i
,
where r(⇠, f | z) is an auxiliary model. Such an objective has been considered independently by Sal-
imans et al. (2015) and Ranganath et al. (2015). Variational inference is performed in the posterior
latent variable space, minimizing KL(qkp) to learn the variational model; for this to occur auxil-
iary inference is performed in the variational latent variable space, minimizing KL(qkr) to learn an
auxiliary model. See Figure 2.
Unlike previous approaches, we rewrite this variational objective to connect to auto-encoders:
eL = EqVGP
[log p(x | z)] EqVGP
h
KL(q(z | f(⇠))kp(z)) + KL(q(⇠, f)kr(⇠, f | z))
i
, (5)
where the KL divergences are now taken over tractable distributions (see Appendix C). In auto-
encoder parlance, we maximize the expected negative reconstruction error, regularized by an ex-
pected divergence between the variational model and the original model’s prior, and an expected
divergence between the auxiliary model and the variational model’s prior. This is simply a nested
instantiation of the variational auto-encoder bound (Kingma & Welling, 2014): a KL divergence
between the inference model and a prior is taken as regularizers on both the posterior and variational
spaces. This interpretation justifies the previously proposed bound for variational models; as we
shall see, it also enables lower variance gradients during stochastic optimization.
5
再構成誤差 正規化項
補助モデル
Under review as a conference paper at ICLR 2016
3.2 AUTO-ENCODING VARIATIONAL MODELS
Inference networks provide a flexible parameterization of approximating
in Helmholtz machines (Hinton & Zemel, 1994), deep Boltzmann machin
Larochelle, 2010), and variational auto-encoders (Kingma & Welling, 2014; R
replaces local variational parameters with global parameters coming from a ne
ically, for latent variables zn which correspond to a data point xn, an infere
a neural network which takes xn as input and its local variational parameter
amortizes inference by only defining a set of global parameters.
To auto-encode the VGP we specify inference networks to parameterize bo
auxiliary models. Unique from other auto-encoder approaches, we let the aux
observed data point xn and variational data point zn as input:
xn 7! q(zn | xn; ✓n), xn, zn 7! r(⇠n, fn | xn, zn; n
where q has local variational parameters given by the variational data Dn, a
fully factorized Gaussian with local variational parameters n = (µn 2 R
Note that by letting r’s inference network take both xn and zn as input, w
explicit specification of r(✏, f | z). This idea was first suggested but not imple
et al. (2015).
w as a conference paper at ICLR 2016
ENCODING VARIATIONAL MODELS
tworks provide a flexible parameterization of approximating distributions as used
z machines (Hinton & Zemel, 1994), deep Boltzmann machines (Salakhutdinov &
010), and variational auto-encoders (Kingma & Welling, 2014; Rezende et al., 2014). It
l variational parameters with global parameters coming from a neural network. Specif-
ent variables zn which correspond to a data point xn, an inference network specifies
work which takes xn as input and its local variational parameters n as output. This
erence by only defining a set of global parameters.
ode the VGP we specify inference networks to parameterize both the variational and
dels. Unique from other auto-encoder approaches, we let the auxiliary model take both
a point xn and variational data point zn as input:
xn 7! q(zn | xn; ✓n), xn, zn 7! r(⇠n, fn | xn, zn; n),
local variational parameters given by the variational data Dn, and r is specified as a
ed Gaussian with local variational parameters n = (µn 2 Rc+d
, 2
n 2 Rc+d
). 1
letting r’s inference network take both x and z as input, we avoid the restrictive
35. 実験:対数尤度
¤ 前⼈未到の70代に突⼊
¤ ⽣成部分のモデルをDRAW, 近似分布をVGPとしたモデルが⼀番良い
Published as a conference paper at ICLR 2016
Model log p(x)
DLGM + VAE [1] 86.76
DLGM + HVI (8 leapfrog steps) [2] 85.51 88.30
DLGM + NF (k = 80) [3] 85.10
EoNADE-5 2hl (128 orderings) [4] 84.68
DBN 2hl [5] 84.55
DARN 1hl [6] 84.13
Convolutional VAE + HVI [2] 81.94 83.49
DLGM 2hl + IWAE (k = 50) [1] 82.90
DRAW [7] 80.97
DLGM 1hl + VGP 84.79
DLGM 2hl + VGP 81.32
DRAW + VGP 79.88
Table 1: Negative predictive log-likelihood for binarized MNIST. Previous best results are
[1] (Burda et al., 2016), [2] (Salimans et al., 2015), [3] (Rezende & Mohamed, 2015), [4] (Raiko
et al., 2014), [5] (Murray & Salakhutdinov, 2009), [6] (Gregor et al., 2014), [7] (Gregor et al., 2015).