8. author title 2020/7/20 8
Variational Inference
• The predictive distribution
• Log evidence lower bound
• Objective:
• Variational Prediction:
• Objective:
• A variational distribution q(w) that explains the data well
while still being close to prior
9. author title 2020/7/20 9
• The mini-batch objective with reparameterization:
• Simple MC sampling (one):
Variational Inference
11. author title 2020/7/20 11
Dropout
• Training stage: A unit is present with probability p
• Testing stage: The unit is always present and the weights are
multiplied by p, same as the expected output
• Training a neural network with dropout can be seen as training
a collection of 2 𝐾
thinned networks with extensive weight sharing
• A single neural network to approximate averaging output at test
time
Procedure
Intuition
12. author title 2020/7/20 12
Dropout for one-hidden-layer Neural Networks
• Dropout local units
• Equivalent to multiplying the global weight matrices by the
binary vectors to dropout entire rows:
• Application to regression
• Inputs x and Output y
• g: activation function; Weights:
• 𝑏𝑙 is binary dropout variables
13. author title 2020/7/20 13
• We can rewrite the loss into negative log:
• The weight parameter with present probability:
• The new loss:
Dropout for one-hidden-layer Neural Networks
15. author title 2020/7/20 15
A single-layer neural network example
• Setup
• Idea: Introduce 𝑊1 and 𝑊2 to approxiamtion
• Q: input dimension
K: number of hidden units
D: output dimension
• Goal : Learn 𝑊1 ∈ 𝑅 𝑄×𝐾 and 𝑊2 ∈ 𝑅 𝐾×𝐷to map
𝑋 ∈ 𝑅 𝑁×𝑄
and 𝑌 ∈ 𝑅 𝑁×𝐷
16. author title 2020/7/20 16
Variational Inference in the Approximate Model
• To mimic Dropout, q(W1) is factorised over input dimension, each
of them is a Gaussian mixture distribution with two components
Where
• Same for q(W2)
Where
• Optimise over parameters, especially
17. author title 2020/7/20 17
Evaluating the Log Evidence Lower Bound for Regression
• Log evidence lower bound
• Approximation of
• For large enough K we can approximate the KL divergence term as (Gal, 2016)
• Similarly for
18. author title 2020/7/20 18
KL condition
• More specifically, if we define the prior p(ω) s.t. the following holds:
• Get the following:
• With identical optimization procedures
• Dropout can be interpreted into Bayesian variational inference
(Gal, 2016)
19. author title 2020/7/20 19
MC dropout in Epistemic uncertainty
• predictive distribution
where 𝑤 is our set of random variables for a model with L layers, 𝑓 is our model’s
stochastic output, and 𝑞 𝜃
∗
𝑤 is an optimum of :
20. author title 2020/7/20 20
MC dropout in Epistemic uncertainty
For regression this epistemic uncertainty is captured by the predictive mean and
variance, which can be approximated as:
For classification this can be approximated using Monte Carlo integration as
follows:
MC
estimator
21. author title 2020/7/20 21
MC dropout in Epistemic uncertainty
• Generally, we use the or dropout only in training time, but for the MC dropout,
we do it also in testing time to get an average uncertainty for classification
• Dropout VI can severely underestimate model uncertainty (Gal, 2016,
Section 3.3.2) – a property many VI methods share
(Dropout with alpha divergence, Li, 2017)
• For Evidence Lower Bound:
• Can we use additional bound for estimating q(w)?
23. author title 2020/7/20 23
VI with upper bound
• Underestimate posterior with zero-forcing behavior
zero-forcing behavior
mass-covering property/zero-avoiding
24. author title 2020/7/20 24
CUBO (Adji B. Dieng, 2017)
• Chi-Square divergence
• For the log evidence and Chi-square divergence:
VI with upper bound
25. author title 2020/7/20 25
VI with upper bound
With an accompanying upper bound, one can perform what we call
maximum entropy model selection in which each model evidence values are chosen to be
that which maximizes the entropy of the resulting distribution on models
26. author title 2020/7/20 26
EUBO (Chunlin, 2019)
• Inclusive KL divergence
• For the Gibbs’ inequality:
VI with upper bound
27. author title 2020/7/20 27
VI with upper bound
Theorem 2:
⚫ SGD for the EUBO:
w(𝜃) is generally unknown, so use the joint distribution p(D|𝜃)p(𝜃) instead and
normalizes the weights to cancel the unknown constant p(D)
28. author title 2020/7/20 28
VI with upper bound
⚫ assume the posterior and the joint distribution has no relation the variational
parameter lambda:
29. author title 2020/7/20 29
Reference
[1]Gal, Yarin and Ghahramani, Zoubin. Dropout as a Bayesian approximation:
Representing model uncertainty in deep learning.ICML, 2016b.
[2] Gal, Yarin. Uncertainty in Deep Learning. PhD thesis, University of
Cambridge, 2016.
[3]Minka, Tom. Divergence measures and message passing. Technical report,
Microsoft Research, 2005.
[4] Li, Yingzhen and Turner, Richard E. Renyi divergence variational inference. In
NIPS, 2016.
[5] C. Ji and H. Shen. Stochastic variational inference via upper bound. arXiv
preprint, arXiv:1912.00650, 2019.
[6] Adji B. Dieng. Variational inference via chi-upper bound minimization. In
Proceedings of the Neural Information Processing Systems, 2017.
30. author title 2020/7/20 30
Appendix
• Marginal and Conditional Gaussians
• Given a marginal Gaussian distribution for x and a conditional
Gaussian distribution for y given x in the form
• The marginal distribution of y and the conditional distribution
of x given y are given by
• where