This document discusses techniques for training deep variational autoencoders and probabilistic ladder networks. It proposes three advances: 1) Using an inference model similar to ladder networks with multiple stochastic layers, 2) Adding a warm-up period to keep units active early in training, and 3) Using batch normalization. These advances allow training models with up to five stochastic layers and achieve state-of-the-art log-likelihood results on benchmark datasets. The document explains variational autoencoders, probabilistic ladder networks, and how the proposed techniques parameterize the generative and inference models.
本スライドは、弊社の梅本により弊社内の技術勉強会で使用されたものです。
近年注目を集めるアーキテクチャーである「Transformer」の解説スライドとなっております。
"Arithmer Seminar" is weekly held, where professionals from within and outside our company give lectures on their respective expertise.
The slides are made by the lecturer from outside our company, and shared here with his/her permission.
Arithmer株式会社は東京大学大学院数理科学研究科発の数学の会社です。私達は現代数学を応用して、様々な分野のソリューションに、新しい高度AIシステムを導入しています。AIをいかに上手に使って仕事を効率化するか、そして人々の役に立つ結果を生み出すのか、それを考えるのが私たちの仕事です。
Arithmer began at the University of Tokyo Graduate School of Mathematical Sciences. Today, our research of modern mathematics and AI systems has the capability of providing solutions when dealing with tough complex issues. At Arithmer we believe it is our job to realize the functions of AI through improving work efficiency and producing more useful results for society.
KDD Cup 2021で開催された時系列異常検知コンペ
Multi-dataset Time Series Anomaly Detection (https://compete.hexagon-ml.com/practice/competition/39/) に参加して
5位入賞した解法の紹介と上位解法の整理のための資料です.
9/24のKDD2021参加報告&論文読み会 (https://connpass.com/event/223966/) の発表資料です.
本スライドは、弊社の梅本により弊社内の技術勉強会で使用されたものです。
近年注目を集めるアーキテクチャーである「Transformer」の解説スライドとなっております。
"Arithmer Seminar" is weekly held, where professionals from within and outside our company give lectures on their respective expertise.
The slides are made by the lecturer from outside our company, and shared here with his/her permission.
Arithmer株式会社は東京大学大学院数理科学研究科発の数学の会社です。私達は現代数学を応用して、様々な分野のソリューションに、新しい高度AIシステムを導入しています。AIをいかに上手に使って仕事を効率化するか、そして人々の役に立つ結果を生み出すのか、それを考えるのが私たちの仕事です。
Arithmer began at the University of Tokyo Graduate School of Mathematical Sciences. Today, our research of modern mathematics and AI systems has the capability of providing solutions when dealing with tough complex issues. At Arithmer we believe it is our job to realize the functions of AI through improving work efficiency and producing more useful results for society.
KDD Cup 2021で開催された時系列異常検知コンペ
Multi-dataset Time Series Anomaly Detection (https://compete.hexagon-ml.com/practice/competition/39/) に参加して
5位入賞した解法の紹介と上位解法の整理のための資料です.
9/24のKDD2021参加報告&論文読み会 (https://connpass.com/event/223966/) の発表資料です.
論文輪読資料「A review of unsupervised feature learning and deep learning for time-s...Kaoru Nasuno
2015年4月16日のdeeplearning.jpの勉強会における論文輪読資料。
「A review of unsupervised feature learning and deep learning for time-series modeling」
「時系列モデリングのための教師なし表現学習とディープラー ニングに関する調査」のサマリーです。
A simple framework for contrastive learning of visual representationsDevansh16
Link: https://machine-learning-made-simple.medium.com/learnings-from-simclr-a-framework-contrastive-learning-for-visual-representations-6c145a5d8e99
If you'd like to discuss something, text me on LinkedIn, IG, or Twitter. To support me, please use my referral link to Robinhood. It's completely free, and we both get a free stock. Not using it is literally losing out on free money.
Check out my other articles on Medium. : https://rb.gy/zn1aiu
My YouTube: https://rb.gy/88iwdd
Reach out to me on LinkedIn. Let's connect: https://rb.gy/m5ok2y
My Instagram: https://rb.gy/gmvuy9
My Twitter: https://twitter.com/Machine01776819
My Substack: https://devanshacc.substack.com/
Live conversations at twitch here: https://rb.gy/zlhk9y
Get a free stock on Robinhood: https://join.robinhood.com/fnud75
This paper presents SimCLR: a simple framework for contrastive learning of visual representations. We simplify recently proposed contrastive self-supervised learning algorithms without requiring specialized architectures or a memory bank. In order to understand what enables the contrastive prediction tasks to learn useful representations, we systematically study the major components of our framework. We show that (1) composition of data augmentations plays a critical role in defining effective predictive tasks, (2) introducing a learnable nonlinear transformation between the representation and the contrastive loss substantially improves the quality of the learned representations, and (3) contrastive learning benefits from larger batch sizes and more training steps compared to supervised learning. By combining these findings, we are able to considerably outperform previous methods for self-supervised and semi-supervised learning on ImageNet. A linear classifier trained on self-supervised representations learned by SimCLR achieves 76.5% top-1 accuracy, which is a 7% relative improvement over previous state-of-the-art, matching the performance of a supervised ResNet-50. When fine-tuned on only 1% of the labels, we achieve 85.8% top-5 accuracy, outperforming AlexNet with 100X fewer labels.
Comments: ICML'2020. Code and pretrained models at this https URL
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
Cite as: arXiv:2002.05709 [cs.LG]
(or arXiv:2002.05709v3 [cs.LG] for this version)
Submission history
From: Ting Chen [view email]
[v1] Thu, 13 Feb 2020 18:50:45 UTC (5,093 KB)
[v2] Mon, 30 Mar 2020 15:32:51 UTC (5,047 KB)
[v3] Wed, 1 Jul 2020 00:09:08 UTC (5,829 KB)
In recent years, deep learning has had a profound impact on machine learning and artificial intelligence. At the same time, algorithms for quantum computers have been shown to efficiently solve some problems that are intractable on conventional, classical computers. We show that quantum computing not only reduces the time required to train a deep restricted Boltzmann machine, but also provides a richer and more comprehensive framework for deep learning than classical computing and leads to significant improvements in the optimization of the underlying objective function. Our quantum methods also permit efficient training of full Boltzmann machines and multilayer, fully connected models and do not have well known classical counterparts.
論文紹介:Learning With Neighbor Consistency for Noisy LabelsToru Tamaki
Ahmet Iscen, Jack Valmadre, Anurag Arnab, Cordelia Schmid, "Learning With Neighbor Consistency for Noisy Labels" CVPR2022
https://openaccess.thecvf.com/content/CVPR2022/html/Iscen_Learning_With_Neighbor_Consistency_for_Noisy_Labels_CVPR_2022_paper.html
Found this paper really interesting. It delves into the learning behaviors of Deep Learning Ensembles and compares them Bayesian Neural Networks, which theoretically does the same thing. This answers why Deep Ensembles Outperform
USING LEARNING AUTOMATA AND GENETIC ALGORITHMS TO IMPROVE THE QUALITY OF SERV...IJCSEA Journal
A hybrid learning automata–genetic algorithm (HLGA) is proposed to solve QoS routing optimization problem of next generation networks. The algorithm complements the advantages of the learning Automato Algorithm(LA) and Genetic Algorithm(GA). It firstly uses the good global search capability of LA to generate initial population needed by GA, then it uses GA to improve the Quality of Service(QoS) and acquiring the optimization tree through new algorithms for crossover and mutation operators which are an NP–Complete problem. In the proposed algorithm, the connectivity matrix of edges is used for genotype representation. Some novel heuristics are also proposed for mutation, crossover, and creation of random individuals. We evaluate the performance and efficiency of the proposed HLGA-based algorithm in comparison with other existing heuristic and GA-based algorithms by the result of simulation. Simulation results demonstrate that this paper proposed algorithm not only has the fast calculating speed and high accuracy but also can improve the efficiency in Next Generation Networks QoS routing. The proposed algorithm has overcome all of the previous algorithms in the literature..
Investigation on the Pattern Synthesis of Subarray Weights for Low EMI Applic...IOSRJECE
In modern radar applications, it is frequently required to produce sum and difference patterns sequentially. The sum pattern amplitude coefficients are obtained by using Dolph-Chebyshev synthesis method where as the difference pattern excitation coefficients will be optimized in this present work. For this purpose optimal group weights will be introduced to the different array elements to obtain any type of beam depending on the application. Optimization of excitation to the array elements is the main objective so in this process a subarray configuration is adopted. However, Differential Evolution Algorithm is applied for optimization method. The proposed method is reliable and accurate. It is superior to other methods in terms of convergence speed and robustness. Numerical and simulation results are presented.
Improving Performance of Back propagation Learning Algorithmijsrd.com
The standard back-propagation algorithm is one of the most widely used algorithm for training feed-forward neural networks. One major drawback of this algorithm is it might fall into local minima and slow convergence rate. Natural gradient descent is principal method for solving nonlinear function is presented and is combined with the modified back-propagation algorithm yielding a new fast training multilayer algorithm. This paper describes new approach to natural gradient learning in which the number of parameters necessary is much smaller than the natural gradient algorithm. This new method exploits the algebraic structure of the parameter space to reduce the space and time complexity of algorithm and improve its performance.
Evaluation of a hybrid method for constructing multiple SVM kernelsinfopapers
Dana Simian, Florin Stoica, Evaluation of a hybrid method for constructing multiple SVM kernels, Recent Advances in Computers, Proceedings of the 13th WSEAS International Conference on Computers, Recent Advances in Computer Engineering Series, WSEAS Press, Rodos, Greece, July 23-25, 2009, ISSN: 1790-5109, ISBN: 978-960-474-099-4, pp. 619-623
IMPROVING SUPERVISED CLASSIFICATION OF DAILY ACTIVITIES LIVING USING NEW COST...csandit
The growing population of elders in the society calls for a new approach in care giving. By
inferring what activities elderly are performing in their houses it is possible to determine their
physical and cognitive capabilities. In this paper we show the potential of important
discriminative classifiers namely the Soft-Support Vector Machines (C-SVM), Conditional
Random Fields (CRF) and k-Nearest Neighbors (k-NN) for recognizing activities from sensor
patterns in a smart home environment. We address also the class imbalance problem in activity
recognition field which has been known to hinder the learning performance of classifiers. Cost
sensitive learning is attractive under most imbalanced circumstances, but it is difficult to
determine the precise misclassification costs in practice. We introduce a new criterion for
selecting the suitable cost parameter C of the C-SVM method. Through our evaluation on four
real world imbalanced activity datasets, we demonstrate that C-SVM based on our proposed
criterion outperforms the state-of-the-art discriminative methods in activity recognition.
IMPROVING SUPERVISED CLASSIFICATION OF DAILY ACTIVITIES LIVING USING NEW COST...cscpconf
The growing population of elders in the society calls for a new approach in care giving. By inferring what activities elderly are performing in their houses it is possible to determine their
physical and cognitive capabilities. In this paper we show the potential of important discriminative classifiers namely the Soft-Support Vector Machines (C-SVM), Conditional Random Fields (CRF) and k-Nearest Neighbors (k-NN) for recognizing activities from sensor patterns in a smart home environment. We address also the class imbalance problem in activity recognition field which has been known to hinder the learning performance of classifiers. Cost sensitive learning is attractive under most imbalanced circumstances, but it is difficult to determine the precise misclassification costs in practice. We introduce a new criterion for selecting the suitable cost parameter C of the C-SVM method. Through our evaluation on four real world imbalanced activity datasets, we demonstrate that C-SVM based on our proposed criterion outperforms the state-of-the-art discriminative methods in activity recognition.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
6. ¤ ', # 2
¤ -
¤ !.
Processes (Tran
e & Mohamed,
rs (Burda et al.,
nd warm-up al-
suggesting that
o show that the
good or better
model for fur-
ned latent rep-
oposed here are
ations utilizing
tive assessment
s that the multi-
in the datasets
ised learning.
inspired by the
better than the
abilistic ladder network allows direct integration (+ in figure, see
Eq. (21) ) of bottom-up and top-down information in the infer-
ence model. In the VAE the top-down information is incorporated
indirectly through the conditional priors in the generative model.
The generative model p✓ is specified as follows:
p✓(x|z1) = N x|µ✓(z1), 2
✓(z1) or (1)
P✓(x|z1) = B (x|µ✓(z1)) (2)
for continuous-valued (Gaussian N) or binary-valued
(Bernoulli B) data, respectively. The latent variables z are
split into L layers zi, i = 1 . . . L:
p✓(zi|zi+1) = N zi|µ✓,i(zi+1), 2
✓,i(zi+1) (3)
p✓(zL) = N (zL|0, I) . (4)
The hierarchical specification allows the lower layers of the
latent variables to be highly correlated but still maintain the
computational efficiency of fully factorized models.
Each layer in the inference model q (z|x) is specified using
a fully factorized Gaussian distribution:
pared to equally or more
ng flexible variational dis-
Gaussian Processes (Tran
s (Rezende & Mohamed,
Autoencoders (Burda et al.,
alization and warm-up al-
rformance, suggesting that
ul. We also show that the
erforms as good or better
interesting model for fur-
dy the learned latent rep-
methods proposed here are
nt representations utilizing
. A qualitative assessment
her indicates that the multi-
el structure in the datasets
emi-supervised learning.
e:
the VAE inspired by the
g as well or better than the
n training increasing both
e across several different
of active stochastic latent
Figure 2. Flow of information in the inference and generative
models of a) probabilistic ladder network and b) VAE. The prob-
abilistic ladder network allows direct integration (+ in figure, see
Eq. (21) ) of bottom-up and top-down information in the infer-
ence model. In the VAE the top-down information is incorporated
indirectly through the conditional priors in the generative model.
The generative model p✓ is specified as follows:
p✓(x|z1) = N x|µ✓(z1), 2
✓(z1) or (1)
P✓(x|z1) = B (x|µ✓(z1)) (2)
for continuous-valued (Gaussian N) or binary-valued
(Bernoulli B) data, respectively. The latent variables z are
split into L layers zi, i = 1 . . . L:
p✓(zi|zi+1) = N zi|µ✓,i(zi+1), 2
✓,i(zi+1) (3)
p✓(zL) = N (zL|0, I) . (4)
The hierarchical specification allows the lower layers of the
latent variables to be highly correlated but still maintain the
computational efficiency of fully factorized models.
Each layer in the inference model q (z|x) is specified using
a fully factorized Gaussian distribution:
q (z1|x) = N z1|µ ,1(x), 2
,1(x) (5)
q (zi|zi 1) = N zi|µ ,i(zi 1), 2
,i(zi 1) (6)
ve generative performance, measured in terms of
likelihood, when compared to equally or more
ted methods for creating flexible variational dis-
s such as the Variational Gaussian Processes (Tran
15) Normalizing Flows (Rezende & Mohamed,
Importance Weighted Autoencoders (Burda et al.,
We find that batch normalization and warm-up al-
rease the generative performance, suggesting that
thods are broadly useful. We also show that the
stic ladder network performs as good or better
ng VAEs making it an interesting model for fur-
ies. Secondly, we study the learned latent rep-
ons. We find that the methods proposed here are
y for learning rich latent representations utilizing
ayers of latent variables. A qualitative assessment
ent representations further indicates that the multi-
DGMs capture high level structure in the datasets
likely to be useful for semi-supervised learning.
ary our contributions are:
ew parametrization of the VAE inspired by the
der network performing as well or better than the
ent best models.
ovel warm-up period in training increasing both
Figure 2. Flow of information in the inference and ge
models of a) probabilistic ladder network and b) VAE. Th
abilistic ladder network allows direct integration (+ in fig
Eq. (21) ) of bottom-up and top-down information in th
ence model. In the VAE the top-down information is incor
indirectly through the conditional priors in the generative
The generative model p✓ is specified as follows:
p✓(x|z1) = N x|µ✓(z1), 2
✓ (z1) or
P✓(x|z1) = B (x|µ✓(z1))
for continuous-valued (Gaussian N ) or binary-
(Bernoulli B) data, respectively. The latent variable
split into L layers zi, i = 1 . . . L:
p✓(zi|zi+1) = N zi|µ✓,i(zi+1), 2
✓,i(zi+1)
p✓(zL) = N (zL|0, I) .
The hierarchical specification allows the lower layer
latent variables to be highly correlated but still main
computational efficiency of fully factorized models.
Each layer in the inference model q (z|x) is specifie
a fully factorized Gaussian distribution:
q (z1|x) = N z1|µ ,1(x), 2
,1(x)
Autoencoders (Burda et al.,
alization and warm-up al-
formance, suggesting that
ul. We also show that the
rforms as good or better
interesting model for fur-
dy the learned latent rep-
methods proposed here are
t representations utilizing
A qualitative assessment
er indicates that the multi-
el structure in the datasets
emi-supervised learning.
e:
the VAE inspired by the
as well or better than the
n training increasing both
e across several different
of active stochastic latent
ence model. In the VAE the top-down information is incorporated
indirectly through the conditional priors in the generative model.
The generative model p✓ is specified as follows:
p✓(x|z1) = N x|µ✓(z1), 2
✓(z1) or (1)
P✓(x|z1) = B (x|µ✓(z1)) (2)
for continuous-valued (Gaussian N) or binary-valued
(Bernoulli B) data, respectively. The latent variables z are
split into L layers zi, i = 1 . . . L:
p✓(zi|zi+1) = N zi|µ✓,i(zi+1), 2
✓,i(zi+1) (3)
p✓(zL) = N (zL|0, I) . (4)
The hierarchical specification allows the lower layers of the
latent variables to be highly correlated but still maintain the
computational efficiency of fully factorized models.
Each layer in the inference model q (z|x) is specified using
a fully factorized Gaussian distribution:
q (z1|x) = N z1|µ ,1(x), 2
,1(x) (5)
q (zi|zi 1) = N zi|µ ,i(zi 1), 2
,i(zi 1) (6)
for i = 2 . . . L.
Functions µ(·) and 2
(·) in both the generative and the in-
ference models are implemented as:
7. ¤ /(0|1, 34
)
¤ ℬ(0|1)
increasing both
several different
stochastic latent
is essential for
stochastic latent
rain a generative
a x using auxil-
model q (z|x)1
to the likelihood
recognition model
decoder.
q (z1|x) = N z1|µ ,1(x), 2
,1(x) (5)
q (zi|zi 1) = N zi|µ ,i(zi 1), 2
,i(zi 1) (6)
for i = 2 . . . L.
Functions µ(·) and 2
(·) in both the generative and the in-
ference models are implemented as:
d(y) =MLP(y) (7)
µ(y) =Linear(d(y)) (8)
2
(y) =Softplus(Linear(d(y))) , (9)
where MLP is a two layered multilayer perceptron network,
Linear is a single linear layer, and Softplus applies
log(1 + exp(·)) non linearity to each component of its ar-
gument vector. In our notation, each MLP(·) or Linear(·)
gives a new mapping with its own parameters, so the de-
terministic variable d is used to mark that the MLP-part is
shared between µ and 2
whereas the last Linear layer is
not shared.
nspired by the
better than the
ncreasing both
veral different
ochastic latent
s essential for
tochastic latent
in a generative
x using auxil-
p✓(zL) = N (zL|0, I) . (4)
The hierarchical specification allows the lower layers of the
latent variables to be highly correlated but still maintain the
computational efficiency of fully factorized models.
Each layer in the inference model q (z|x) is specified using
a fully factorized Gaussian distribution:
q (z1|x) = N z1|µ ,1(x), 2
,1(x) (5)
q (zi|zi 1) = N zi|µ ,i(zi 1), 2
,i(zi 1) (6)
for i = 2 . . . L.
Functions µ(·) and 2
(·) in both the generative and the in-
ference models are implemented as:
d(y) =MLP(y) (7)
µ(y) =Linear(d(y)) (8)
2
(y) =Softplus(Linear(d(y))) , (9)
where MLP is a two layered multilayer perceptron network,
Linear is a single linear layer, and Softplus applies
Sigmoid
Abstract
Variational autoencoders are a powerful frame-
work for unsupervised learning. However, pre-
vious work has been restricted to shallow mod-
els with one or two layers of fully factorized
stochastic latent variables, limiting the flexibil-
ity of the latent representation. We propose three
advances in training algorithms of variational au-
toencoders, for the first time allowing to train
deep models of up to five stochastic layers, (1)
using a structure similar to the Ladder network
as the inference model, (2) warm-up period to
support stochastic units staying active in early
training, and (3) use of batch normalization. Us-
ing these improvements we show state-of-the-art
log-likelihood results for generative modeling on
several benchmark datasets.
1. Introduction
The recently introduced variational autoencoder (VAE)
(Kingma & Welling, 2013; Rezende et al., 2014) provides
a framework for deep generative models (DGM). DGMs
have later been shown to be a powerful framework for
semi-supervised learning (Kingma et al., 2014; Maaloee
X
!"
z
z
X
!"
d
"
d
"
a) b)
1
2
1
2
2 2
2
1 1
1
Figure 1. Inference (or encoder/rec
decoder) models. a) VAE inference
der inference model and c) generati
variables sampled from the approxi
with mean and variances parameteri
ables, each conditioned on the l
highly flexible latent distribution
model parameterizations: the fir
the VAE to multiple layers of la
ond is parameterized in such a w
as a probabilistic variational vari
arXiv:1602.02282v1[stat.ML]6
12. VAE
¤ VAE
¤
¤ CVAE [Kingma+ 2014]
¤ Variational Fair Auto Encoder [Louizos+ 15]
¤ X sensitive '(%|X) MMD
The inference network q (z|x) (3) is used during training of the model using both the labelled and
unlabelled data sets. This approximate posterior is then used as a feature extractor for the labelled
data set, and the features used for training the classifier.
3.1.2 Generative Semi-supervised Model Objective
For this model, we have two cases to consider. In the first case, the label corresponding to a data
point is observed and the variational bound is a simple extension of equation (5):
log p✓(x, y) Eq (z|x,y) [log p✓(x|y, z) + log p✓(y) + log p(z) log q (z|x, y)]= L(x, y), (6)
For the case where the label is missing, it is treated as a latent variable over which we perform
posterior inference and the resulting bound for handling data points with an unobserved label y is:
log p✓(x) Eq (y,z|x) [log p✓(x|y, z) + log p✓(y) + log p(z) log q (y, z|x)]
=
X
y
q (y|x)( L(x, y)) + H(q (y|x)) = U(x). (7)
The bound on the marginal likelihood for the entire dataset is now:
J =
X
(x,y)⇠epl
L(x, y) +
X
x⇠epu
U(x) (8)
The distribution q (y|x) (4) for the missing labels has the form a discriminative classifier, and
we can use this knowledge to construct the best classifier possible as our inference model. This
distribution is also used at test time for predictions of any unseen data.
In the objective function (8), the label predictive distribution q (y|x) contributes only to the second
term relating to the unlabelled data, which is an undesirable property if we wish to use this distribu-
tion as a classifier. Ideally, all model and variational parameters should learn in all cases. To remedy
%
#
X
13. ¤ Auxiliary Deep Generative Models [Maaløe+ 16]
¤ Auxiliary variables [Agakov+
2004 ]
¤
¤ state-of-the-art
Auxiliary Deep Generative Models
ry deep generative models
ngma (2013); Rezende et al. (2014) have cou-
oach of variational inference with deep learn-
e to powerful probabilistic models constructed
nce neural network q(z|x) and a generative
rk p(x|z). This approach can be perceived as
equivalent to the deep auto-encoder, in which
as the encoder and p(x|z) the decoder. How-
ference is that these models ensures efficient
er various continuous distributions in the la-
and complex input datasets x, where the pos-
ution p(x|z) is intractable. Furthermore, the
the variational upper bound are easily defined
agation through the network(s). To keep the
al requirements low the variational distribution
ally chosen to be a diagonal Gaussian, limiting
e power of the inference model.
er we propose a variational auxiliary vari-
ch (Agakov and Barber, 2004) to improve
al distribution: The generative model is ex-
yz
a
x
(a) Generative model P.
yz
a
x
(b) Inference model Q.
Figure 1. Probabilistic graphical model of the ADGM for semi-
supervised learning. The incoming joint connections to each vari-
able are deep neural networks with parameters ✓ and .
2.2. Auxiliary variables
We propose to extend the variational distribution with aux-
iliary variables a: q(a, z|x) = q(z|a, x)q(a|x) such that
the marginal distribution q(z|x) can fit more complicated
posteriors p(z|x). In order to have an unchanged gen-
(x|z). This approach can be perceived as
valent to the deep auto-encoder, in which
e encoder and p(x|z) the decoder. How-
ce is that these models ensures efficient
arious continuous distributions in the la-
complex input datasets x, where the pos-
n p(x|z) is intractable. Furthermore, the
variational upper bound are easily defined
on through the network(s). To keep the
quirements low the variational distribution
chosen to be a diagonal Gaussian, limiting
wer of the inference model.
e propose a variational auxiliary vari-
Agakov and Barber, 2004) to improve
istribution: The generative model is ex-
bles a to p(x, z, a) such that the original
t to marginalization over a: p(x, z, a) =
In the variational distribution, on the
s used such that marginal q(z|x) =
)da is a general non-Gaussian distribution.
specification allows the latent variables to
ough a, while maintaining the computa-
(a) Generative model P. (b) Inference model
Figure 1. Probabilistic graphical model of the ADGM
supervised learning. The incoming joint connections to e
able are deep neural networks with parameters ✓ and .
2.2. Auxiliary variables
We propose to extend the variational distribution w
iliary variables a: q(a, z|x) = q(z|a, x)q(a|x) s
the marginal distribution q(z|x) can fit more com
posteriors p(z|x). In order to have an unchang
erative model, p(x|z), it is required that the joi
p(x, z, a) gives back the original p(x, z) under m
ization over a, thus p(x, z, a) = p(a|x, z)p(x, z
iliary variables are used in the EM algorithm an
sampling and has previously been considered fo
tional learning by Agakov and Barber (2004). R
Ranganath et al. (2015) has proposed to make the
q(z|x) acts as the encoder and p(x|z) the decoder
ever, the difference is that these models ensures e
inference over various continuous distributions in
tent space z and complex input datasets x, where t
terior distribution p(x|z) is intractable. Furtherm
gradients of the variational upper bound are easily
by backpropagation through the network(s). To k
computational requirements low the variational dist
q(z|x) is usually chosen to be a diagonal Gaussian,
the expressive power of the inference model.
In this paper we propose a variational auxiliar
able approach (Agakov and Barber, 2004) to i
the variational distribution: The generative mode
tended with variables a to p(x, z, a) such that the
model is invariant to marginalization over a: p(x,
p(a|x, z)p(x, z). In the variational distribution,
other hand, a is used such that marginal q(zR
q(z|a, x)p(a|x)da is a general non-Gaussian distr
This hierarchical specification allows the latent vari
be correlated through a, while maintaining the co
tional efficiency of fully factorized models (cf. Fig
15. Importance Weighted AE
¤ Importance Weighted Autoencoder [Burda+ 15]
¤
¤ k
¤ Rényi [Li+ 16]
¤ Y
log' 0 ≥ :
;< %(C)
# ;< %(Z)
#
log F
', 0, =([)
!. =([)
0
Z
IC
≥ ℒ(), (; 0)
ecall from Section 2.1 that the family of R´enyi divergences includes the KL divergence. Perhaps c
ariational free-energy approaches be generalised to the R´enyi case? Consider approximating the tr
osterior p(✓|D) by minimizing R´enyi’s ↵-divergence for some selected ↵ 0):
q(✓) = arg min
q2Q
D↵[q(✓)||p(✓|D)]. (1
ow we verify the alternative optimization problem
q(✓) = arg max
q2Q
log p(D) D↵[q(✓)||p(✓|D)]. (1
When ↵ 6= 1, the objective can be rewritten as
log p(D)
1
↵ 1
log
Z
q(✓)↵
p(✓|D)1 ↵
d✓
= log p(D)
1
↵ 1
log Eq
"✓
p(✓, D)
q(✓)p(D)
◆1 ↵
#
=
1
1 ↵
log Eq
"✓
p(✓, D)
q(✓)
◆1 ↵
#
:= L↵(q; D).
(1
We name this new objective the variational R´enyi bound (VR). Importantly the following theorem i
rect result of Proposition 1.
heorem 1. The objective L↵(q; D) is continuous and non-increasing on ↵ 2 [0, 1] [ {|L↵| < +1
specially for all 0 < ↵ < 1,
16. ¤ Normalizing flows [Rezende+ 15]
¤
¤ Variational Gaussian Process [Tran+ 15]
¤
Variational Inference with Normalizing Flows
and involve matrix inverses that can be numerically unsta-
ble. We therefore require normalizing flows that allow for
low-cost computation of the determinant, or where the Ja-
cobian is not needed at all.
4.1. Invertible Linear-time Transformations
We consider a family of transformations of the form:
f(z) = z + uh(w>
z + b), (10)
where = {w 2 IRD
, u 2 IRD
, b 2 IR} are free pa-
rameters and h(·) is a smooth element-wise non-linearity,
with derivative h0
(·). For this mapping we can compute
the logdet-Jacobian term in O(D) time (using the matrix
determinant lemma):
(z) = h0
(w>
z + b)w (11)
det @f
@z = | det(I + u (z)>
)| = |1 + u>
(z)|. (12)
From (7) we conclude that the density qK(z) obtained by
transforming an arbitrary initial density q0(z) through the
sequence of maps fk of the form (10) is implicitly given
by:
zK = fK fK 1 . . . f1(z)
ln qK(zK) = ln q0(z)
KX
k=1
ln |1 + u>
k k(zk)|. (13)
The flow defined by the transformation (13) modifies the
initial density q0 by applying a series of contractions and
expansions in the direction perpendicular to the hyperplane
w>
z+b = 0, hence we refer to these maps as planar flows.
As an alternative, we can consider a family of transforma-
tions that modify an initial density q0 around a reference
point z0. The transformation family is:
f(z) = z + h(↵, r)(z z0), (14)
@f d 1 0
K=1 K=2
Planar Radial
q0 K=1 K=2K=10 K=10
UnitGaussianUniform
Figure 1. Effect of normalizing flow on two distributions.
Inference network Generative model
Figure 2. Inference and generative models. Left: Inference net-
work maps the observations to the parameters of the flow; Right:
generative model which receives the posterior samples from the
inference network during training time. Round containers repre-
sent layers of stochastic variables whereas square containers rep-
resent deterministic layers.
4.2. Flow-Based Free Energy Bound
If we parameterize the approximate posterior distribution
with a flow of length K, q (z|x) := qK(zK), the free en-
ergy (3) can be written as an expectation over the initial
distribution q0(z):
F(x) = Eq (z|x)[log q (z|x) log p(x, z)]
= Eq0(z0) [ln qK(zK) log p(x, zK)]
= Eq0(z0) [ln q0(z0)] Eq0(z0) [log p(x, zK)]
" K
#
Variational Inference with Normalizi
distribution :
q(z0
) = q(z) det
@f 1
@z0
= q(z) det
@f
@z
1
, (5)
where the last equality can be seen by applying the chain
rule (inverse function theorem) and is a property of Jaco-
bians of invertible functions. We can construct arbitrarily
complex densities by composing several simple maps and
successively applying (5). The density qK(z) obtained by
successively transforming a random variable z0 with distri-
bution q0 through a chain of K transformations fk is:
zK = fK . . . f2 f1(z0) (6)
ln qK(zK) = ln q0(z0)
KX
k=1
ln det
@fk
@zk
, (7)
where equation (6) will be used throughout the paper as a
shorthand for the composition fK(fK 1(. . . f1(x))). The
path traversed by the random variables zk = fk(zk 1) with
initial distribution q0(z0) is called the flow and the path
formed by the successive distributions qk is a normalizing
flow. A property of such transformations, often referred
to as the law of the unconscious statistician (LOTUS), is
that expectations w.r.t. the transformed density qK can be
computed without explicitly knowing qK. Any expectation
EqK
[h(z)] can be written as an expectation under q0 as:
EqK
[h(z)] = Eq0
[h(fK fK 1 . . . f1(z0))], (8)
which does not require computation of the the logdet-
Jacobian terms when h(z) does not depend on qK.
We can understand the effect of invertible flows as a se-
quence of expansions or contractions on the initial density.
For an expansion, the map z0
= f(z) pulls the points z
away from a region in IRd
, reducing the density in that re-
gion while increasing the density outside the region. Con-
partial diffe
sity q0(z) e
T describes
Langevin F
the Langev
dz
where d⇠(t
E[⇠i(t)⇠j(t
D = GG
random var
Langevin fl
of densities
Kolmogoro
qt(z) of the
@
@t
qt(z)=
In machine
with F(z, t
L(z) is an u
Importantly
is given by
That is, if
evolve its s
sulting poin
e L(z)
, i.e.
plored for s
Teh (2011);
Hamiltonia
described in
space ˜z = (
tonian H(z
used in mac
Under review as a conference paper at ICLR 2016
zifi⇠
✓ D = {(s, t)}
d
(a) VARIATIONAL MODEL
zi x
d
(b) GENERATIVE MODEL
Figure 1: (a) Graphical model of the variational Gaussian process. The VGP generates samples of
latent variables z by evaluating random non-linear mappings of latent inputs ⇠, and then drawing
mean-field samples parameterized by the mapping. These latent variables aim to follow the posterior
distribution for a generative model (b), conditioned on data x.
20. ¤ Ladder network [Valpola, 14][Rasmus+ 15]
probabilistic ladder network
¤ ladder network
¤
ariational Autoencoders and Probabilistic Ladder Networks
s we propose (1)
rm-up period to
rly training, and
zing DGMs and
up to five layers
ls, consisting of
h highly expres-
fficiency of fully
ese models have
ured in terms of
qually or more
variational dis-
Processes (Tran
e & Mohamed,
zd
z
z
a) b)
n
n
nn
"Likelihood"
Deterministic
bottom up
pathway
Stochastic
top down
pathway
"Posterior"
"Prior"
+
"Copy"
Top down pathway
through KL-divergences
in generative model
Bottom up pathway
in inference model
Indirect top
down
information
through prior
Direct flow of
information
zn
Generative
model
"Copy"
Figure 2. Flow of information in the inference and generative
models of a) probabilistic ladder network and b) VAE. The prob-
abilistic ladder network allows direct integration (+ in figure, see
Eq. (21) ) of bottom-up and top-down information in the infer-
Probabilistic ladder network VAE
bottom-up top-down
top-down
21. Top-down
¤ bottom-up
¤ top-down
:=~;< = 0 log
R =T R =TUV|=T R x|=V
; =V|0 ; =W|=V ;(=T|=TUV)
=C~! =C|0
=4~! =4|=C
:=~;< = 0 log
R =T R =TUV|=T R x|=V
; =V|0 ; =W|=V ;(=T|=TUV)
' =C
|=4
' =4
' 0|=C=C
~! =C
|0
=4
~! =4
|=C
22. ¤
¤
¤
th different number of latent lay-
Warm-up WU
vides a tractable lower bound
an be used as a training crite-
✓(x, z)
(z|x)
= L(✓, ; x) (10)
|p✓(z)) + Eq (z|x) [p✓(x|z)] ,
(11)
ibler divergence.
e likelihood may be obtained
crease of samples by using the
Burda et al., 2015):
(z(K)|x)
"
log
KX
k=1
p✓(x, z(k)
)
q (z(k)|x)
#
(12)
ve parameters, ✓ and , are
g Eq. (11) using stochastic
se the reparametrization trick
n through the Gaussian latent
, 2013; Rezende et al., 2014).
mance during training. The test set performance was estimated
using 5000 importance weighted samples providing a tighter
bound than the training bound explaining the better performance
here.
2015) as the inference model of a VAE, as shown in Figure
1. The generative model is the same as before.
The inference is constructed to first make a deterministic
upward pass:
d1 =MLP(x) (13)
µd,i =Linear(di), i = 1 . . . L (14)
2
d,i =Softplus(Linear(di)), i = 1 . . . L (15)
di =MLP(µd,i 1), i = 2 . . . L (16)
followed by a stochastic downward pass:
q (zL|x) =N µd,L, 2
d,L (17)
ti =MLP(zi+1), i = 1 . . . L 1 (18)
µt,i =Linear(ti) (19)
2
t,i =Softplus(Linear(ti)) (20)
q✓(zi|zi+1, x) =N
µt,i
2
t,i + µd,i
2
d,i
2
t,i + 2
d,i
,
1
2
t,i + 2
d,i
!
.
(21)
kelihood values for VAEs and the
ith different number of latent lay-
d Warm-up WU
ovides a tractable lower bound
can be used as a training crite-
p✓(x, z)
q (z|x)
= L(✓, ; x) (10)
||p✓(z)) + Eq (z|x) [p✓(x|z)] ,
(11)
eibler divergence.
he likelihood may be obtained
crease of samples by using the
(Burda et al., 2015):
q (z(K)|x)
"
log
KX
k=1
p✓(x, z(k)
)
q (z(k)|x)
#
(12)
ve parameters, ✓ and , are
ng Eq. (11) using stochastic
use the reparametrization trick
on through the Gaussian latent
g, 2013; Rezende et al., 2014).
mated using Monte Carlo sam-
orresponding q distribution.
mance during training. The test set performance was estimated
using 5000 importance weighted samples providing a tighter
bound than the training bound explaining the better performance
here.
2015) as the inference model of a VAE, as shown in Figure
1. The generative model is the same as before.
The inference is constructed to first make a deterministic
upward pass:
d1 =MLP(x) (13)
µd,i =Linear(di), i = 1 . . . L (14)
2
d,i =Softplus(Linear(di)), i = 1 . . . L (15)
di =MLP(µd,i 1), i = 2 . . . L (16)
followed by a stochastic downward pass:
q (zL|x) =N µd,L, 2
d,L (17)
ti =MLP(zi+1), i = 1 . . . L 1 (18)
µt,i =Linear(ti) (19)
2
t,i =Softplus(Linear(ti)) (20)
q✓(zi|zi+1, x) =N
µt,i
2
t,i + µd,i
2
d,i
2
t,i + 2
d,i
,
1
2
t,i + 2
d,i
!
.
(21)
2 2
a tractable lower bound
used as a training crite-
)
)
= L(✓, ; x) (10)
) + Eq (z|x) [p✓(x|z)] ,
(11)
ivergence.
lihood may be obtained
of samples by using the
a et al., 2015):
|x)
"
log
KX
k=1
p✓(x, z(k)
)
q (z(k)|x)
#
(12)
ameters, ✓ and , are
(11) using stochastic
reparametrization trick
ugh the Gaussian latent
3; Rezende et al., 2014).
using Monte Carlo sam-
here.
2015) as the inference model of a VAE, as shown in Figure
1. The generative model is the same as before.
The inference is constructed to first make a deterministic
upward pass:
d1 =MLP(x) (13)
µd,i =Linear(di), i = 1 . . . L (14)
2
d,i =Softplus(Linear(di)), i = 1 . . . L (15)
di =MLP(µd,i 1), i = 2 . . . L (16)
followed by a stochastic downward pass:
q (zL|x) =N µd,L, 2
d,L (17)
ti =MLP(zi+1), i = 1 . . . L 1 (18)
µt,i =Linear(ti) (19)
2
t,i =Softplus(Linear(ti)) (20)
q✓(zi|zi+1, x) =N
µt,i
2
t,i + µd,i
2
d,i
2
t,i + 2
d,i
,
1
2
t,i + 2
d,i
!
.
(21)
PRML
24. ¤ B
¤
[MacKey 01]
¤ ^
¤
¤
¤
ℒ ), (; 0 = −L-[!. = 0 ∥ ', = ] + :;< = 0 log', 0|=
phenomenon. The probabilistic ladder network provides
a framework with the wanted interaction, while keeping
complications manageable. A further extension could be
to make the inference in k steps over an iterative inference
procedure (Raiko et al., 2014).
2.2. Warm-up from deterministic to variational
autoencoder
The variational training criterion in Eq. (11) contains the
reconstruction term p✓(x|z) and the variational regular-
ization term. The variational regularization term causes
some of the latent units to become inactive during train-
ing (MacKay, 2001) because the approximate posterior for
unit k, q(zi,k| . . . ) is regularized towards its own prior
p(zi,k| . . . ), a phenomenon also recognized in the VAE set-
ting (Burda et al., 2015). This can be seen as a virtue
of automatic relevance determination, but also as a prob-
lem when many units are pruned away early in training
before they learned a useful representation. We observed
that such units remain inactive for the rest of the training,
presumably trapped in a local minima or saddle point at
KL(qi,k|pi,k) ⇡ 0, with the optimization algorithm unable
to re-activate them.
We propose to alleviate the problem by initializing train-
tr
fr
v
th
o
3
T
M
C
ch
3
al
M
q
6
b
re
phenomenon. The
a framework with
complications man
to make the inferen
procedure (Raiko et
2.2. Warm-up from
autoencoder
The variational trai
reconstruction term
ization term. The
some of the latent
ing (MacKay, 2001
unit k, q(zi,k| . . . )
p(zi,k| . . . ), a pheno
ting (Burda et al.,
of automatic releva
lem when many un
before they learned
that such units rem
presumably trapped
KL(qi,k|pi,k) ⇡ 0,
to re-activate them.
We propose to alle
ing using the recon
training a standard
25. ¤
¤
¤
→Warm-up
¤ Batch Normalization[Ioffe+ 15]
¤
¤ 2
before they learned a useful representation. We observed
that such units remain inactive for the rest of the training,
presumably trapped in a local minima or saddle point at
KL(qi,k|pi,k) ⇡ 0, with the optimization algorithm unable
to re-activate them.
We propose to alleviate the problem by initializing train-
ing using the reconstruction error only (corresponding to
training a standard deterministic auto-encoder), and then
gradually introducing the variational regularization term:
L(✓, ; x)T = (22)
KL(q (z|x)||p✓(z)) + Eq (z|x) [p✓(x|z)] ,
where is increased linearly from 0 to 1 during the first Nt
epochs of training. We denote this scheme warm-up (ab-
breviated WU in tables and graphs) because the objective
goes from having a delta-function solution (correspond-
ing to zero temperature) and then move towards the fully
stochastic variational objective. A similar idea has previ-
ously been considered in Raiko et al. (2007, Section 6.2),
however here used for Bayesian models trained with a co-
ordinate descent algorithm.
32, 16, 8 and 4, goin
all mappings using
MLP’s between x an
quent layers were c
64 and 32 for all co
bilistic ladder netwo
removing latent vari
sometimes refer to t
the four layer mode
models were trained
& Ba, 2014) optimiz
reported test log-like
(12) with 5000 impo
et al. (2015). The mo
(Bastien et al., 2012
Parmesan3
framewo
For MNIST we used
mean of a Bernoulli
(max(x, 0.1x)) as n
were trained for 200
on the complete trai
Nt = 200. Simila
ple the binarized tra
ages using a Bernou
27. MNIST
¤ VAE 2
¤ Batch normalization & warm-up
¤ probabilistic ladder network
How to Train Deep Variational Autoencoders and Probabilistic Ladder Netwo
Figure 3. MNIST test-set log-likelihood values for VAEs and the
probabilistic ladder networks with different number of latent lay-
ers, Batch normalizationBN and Warm-up WU
The variational principle provides a tractable lower bound
on the log likelihood which can be used as a training crite-
rion L.
log p(x) E
log
p✓(x, z)
= L(✓, ; x) (10)
200 400 600 800 1000 1
Epoc
-90
-88
-86
-84
-82
L(x)
VAE
VAE+BN
Figure 4. MNIST train (full lines) an
mance during training. The test set
using 5000 importance weighted s
bound than the training bound expla
here.
2015) as the inference model of a
1. The generative model is the sa
The inference is constructed to
28. ¤ MC importance weighted
IW
¤ MC IW
¤ permutation invariant MNIST
¤ -82.90 [Burda+ 15]
¤ -81.90 [Tran+ 15]
¤
tional Autoencoders and Probabilistic Ladder Networks
, where iter-
own signals
e 2. Notably
ce networks
see van den
sion on this
ork provides
hile keeping
on could be
ve inference
nal
contains the
nal regular-
term causes
during train-
Table 1. Fine-tuned test log-likelihood values for 5 layered VAE
and probabilistic ladder networks trained on MNIST. ANN. LR:
Annealed Learning rate, MC: Monte Carlo samples to approxi-
mate Eq(·)[·], IW: Importance weighted samples
FINETUNING NONE
ANN. LR.
MC=1
IW=1
ANN. LR.
MC=10
IW=1
ANN. LR.
MC=1
IW=10
ANN. LR.
MC=10
IW=10
VAE 82.14 81.97 81.84 81.41 81.30
PROB. LADDER 81.87 81.54 81.46 81.35 -81.20
training in deep neural networks by normalizing the outputs
from each layer. We show that batch normalization (abbre-
viated BN in tables and graphs), applied to all layers except
the output layers, is essential for learning deep hierarchies
of latent variables for L > 2.
29. ¤ KL 0.01
¤ VAE
¤ Batch normalization (BN)
¤ Warm-up (WU)
¤ Probabilistic ladder network
How to Train Deep Variational Autoencoders and Probabilistic Ladde
Table 2. Number of active latent units in five layer VAE and prob-
abilistic ladder networks trained on MNIST. A unit was defined
as active if KL(qi,k||pi,k) > 0.01
VAE VAE
+BN
VAE
+BN
+WU
PROB. LADDER
+BN
+WU
LAYER 1 20 20 34 46
LAYER 2 1 9 18 22
LAYER 3 0 3 6 8
LAYER 4 0 3 2 3
LAYER 5 0 2 1 2
TOTAL 21 37 61 81
importance weighted samples to 10 to reduce the variance
in the approximation of the expectations in Eq. (10) and
improve the inference model, respectively.
Models trained on the OMNIGLOT dataset4
, consisting of
28x28 binary images images were trained similar to above
except that the number of training epochs was 1500.
Models trained on the NORB dataset5
, consisting of 32x32
grays-scale images with color-coding rescaled to [0, 1],
Table 3. Test set Log-likelih
OMNIGLOT and NORB da
dataset and the number of la
VAE
OMNIGLOT
64 114.45
64-32 112.60
64-32-16 112.13
64-32-16-8 112.49
64-32-16-8-4 112.10
NORB
64 2630.8
64-32 2830.8
64-32-16 2757.5
64-32-16-8 2832.0
64-32-16-8-4 3064.1
tic layers. The performan
not improve with more th
variables. Contrary to thi
30. OMNIGLOT NORB
¤ BN WU
¤ NORB ladder
¤
¤ tanh
w to Train Deep Variational Autoencoders and Probabilistic Ladder Networks
ent units in five layer VAE and prob-
ined on MNIST. A unit was defined
> 0.01
AE
BN
VAE
+BN
+WU
PROB. LADDER
+BN
+WU
0 34 46
9 18 22
3 6 8
3 2 3
2 1 2
7 61 81
ples to 10 to reduce the variance
he expectations in Eq. (10) and
del, respectively.
MNIGLOT dataset4
, consisting of
ges were trained similar to above
training epochs was 1500.
RB dataset5
, consisting of 32x32
color-coding rescaled to [0, 1],
tion model with mean and vari-
near and a softplus output layer
were similar to the models above
ngent was used as nonlinearities
rate was 0.002, Nt = 1000 and
ochs were 4000.
Table 3. Test set Log-likelihood values for models trained on the
OMNIGLOT and NORB datasets. The left most column show
dataset and the number of latent variables i each model.
VAE VAE
+BN
VAE
+BN
+WU
PROB. LADDER
+BN
+WU
OMNIGLOT
64 114.45 108.79 104.63
64-32 112.60 106.86 102.03 102.12
64-32-16 112.13 107.09 101.60 -101.26
64-32-16-8 112.49 107.66 101.68 101.27
64-32-16-8-4 112.10 107.94 101.86 101.59
NORB
64 2630.8 3263.7 3481.5
64-32 2830.8 3140.1 3532.9 3522.7
64-32-16 2757.5 3247.3 3346.7 3458.7
64-32-16-8 2832.0 3302.3 3393.6 3499.4
64-32-16-8-4 3064.1 3258.7 3393.6 3430.3
tic layers. The performance of the vanilla VAE model did
not improve with more than two layers of stochastic latent
variables. Contrary to this, models trained with batch nor-
malization and warm-up consistently increase the model
performance for additional layers of stochastic latent vari-
ables. As expected the improvement in performance is de-
creasing for each additional layer, but we emphasize that
the improvements are consistent even for the addition of
31. ¤ VAE
¤
¤ KL
¤ KL 0
¤ BN WU Ladder
To study this effect we calculated the KL-divergence be-
tween q(zi,k|zi 1,k)
tent variable k during training as seen in Figure
To study this effect we calculated the KL-divergence be-
and p(zi|zi+1) for each stochastic la-
during training as seen in Figure
term is zero if the inference model is independent of the
data, i.e. q(zi,k|zi 1,k) = q(zi,k), and hence collapsed
33. ¤
¤ KL
¤ VAE 2
¤ Ladder BN WU
¤
structured high level latent representations that are likely
useful for semi-supervised learning.
The hierarchical latent variable models used here allows
highly flexible distributions of the lower layers conditioned
on the layers above. We measure the divergence between
these conditional distributions and the restrictive mean field
approximation by calculating the KL-divergence between
q(zi|zi 1) and a standard normal distribution for several
models trained on MNIST, see Figure 6 a). As expected
the lower layers have highly non (standard) Gaussian dis-
tributions when conditioned on the layers above. Interest-
ingly the probabilistic ladder network seems to have more
active intermediate layers than t,he VAE with batch nor-
malization and warm-up. Again this might be explained
by the deterministic upward pass easing flow of informa-
tion to the intermediate and upper layers. We further note
that the KL-divergence is approximately zero in the vanilla
VAE model above the second layer confirming the inactiv-
ity of these layers. Figure 6 b) shows generative samples
from the probabilistic ladder network created by injecting
⇤
Bouchard,
Yoshua. Th
arXiv prepr
Burda, Yuri,
lan. Impor
arXiv:1509
Dayan, Peter,
Zemel, Ric
putation, 7
Dieleman, Sa
ren Kaae S
Aaron, and
lease., Aug
10.5281/
Ioffe, Sergey
Acceleratin
covariate sh
Kingma, Di