Monte carlo dropout and variational bound

author title 2020/7/20 1
2020/6/3
Monte Carlo dropout and variational bound
Lab Seminar

• ヤンティエンラ（訓読み：ようてんらく）
• 1996/7/18
• 出身：四川省成都市
• 大学：電子科技大学自動化専攻
• 趣味：音楽、カラオケ、アニメ、スポーツ
• 研究分野：ベイズ深層学習
自己紹介

Outline
• Variational Inference
• Dropout
• Evaluating the Log Evidence Lower Bound
• VI with upper bound
Preliminaries
Variational inference Approximation
Variational inference with Upper bound

Bayesian Inference

Variational Inference

Bayesian Inference

ELBO
𝑊~𝑞 𝜃(𝑊)

• The predictive distribution
• Log evidence lower bound
• Objective:
• Variational Prediction:
• Objective:
• A variational distribution q(w) that explains the data well
while still being close to prior

• The mini-batch objective with reparameterization:
• Simple MC sampling (one):

Dropout
• Training stage: A unit is present with probability p
• Testing stage: The unit is always present and the weights are
multiplied by p, same as the expected output
• Training a neural network with dropout can be seen as training
a collection of 2 𝐾
thinned networks with extensive weight sharing
• A single neural network to approximate averaging output at test
time
Procedure
Intuition

Dropout for one-hidden-layer Neural Networks
• Dropout local units
• Equivalent to multiplying the global weight matrices by the
binary vectors to dropout entire rows:
• Application to regression
• Inputs x and Output y
• g: activation function; Weights:
• 𝑏𝑙 is binary dropout variables

• We can rewrite the loss into negative log:
• The weight parameter with present probability:
• The new loss:

A single-layer neural network example
• Setup
• Idea: Introduce 𝑊1 and 𝑊2 to approxiamtion
• Q: input dimension
K: number of hidden units
D: output dimension
• Goal : Learn 𝑊1 ∈ 𝑅 𝑄×𝐾 and 𝑊2 ∈ 𝑅 𝐾×𝐷to map
𝑋 ∈ 𝑅 𝑁×𝑄
and 𝑌 ∈ 𝑅 𝑁×𝐷

Variational Inference in the Approximate Model
• To mimic Dropout, q(W1) is factorised over input dimension, each
of them is a Gaussian mixture distribution with two components
Where
• Same for q(W2)
Where
• Optimise over parameters, especially

Evaluating the Log Evidence Lower Bound for Regression
• Log evidence lower bound
• Approximation of
• For large enough K we can approximate the KL divergence term as (Gal, 2016)
• Similarly for

KL condition
• More specifically, if we define the prior p(ω) s.t. the following holds:
• Get the following:
• With identical optimization procedures
• Dropout can be interpreted into Bayesian variational inference
(Gal, 2016)

MC dropout in Epistemic uncertainty
• predictive distribution
where 𝑤 is our set of random variables for a model with L layers, 𝑓 is our model’s
stochastic output, and 𝑞 𝜃
∗
𝑤 is an optimum of :

For regression this epistemic uncertainty is captured by the predictive mean and
variance, which can be approximated as:
For classification this can be approximated using Monte Carlo integration as
follows:
MC
estimator

• Generally, we use the or dropout only in training time, but for the MC dropout,
we do it also in testing time to get an average uncertainty for classification
• Dropout VI can severely underestimate model uncertainty (Gal, 2016,
Section 3.3.2) – a property many VI methods share
(Dropout with alpha divergence, Li, 2017)
• For Evidence Lower Bound:
• Can we use additional bound for estimating q(w)?

VI with upper bound
• Divergence measures(Minka, 2005)

VI with upper bound
• Underestimate posterior with zero-forcing behavior
zero-forcing behavior
mass-covering property/zero-avoiding

CUBO (Adji B. Dieng, 2017)
• Chi-Square divergence
• For the log evidence and Chi-square divergence:
VI with upper bound

VI with upper bound
With an accompanying upper bound, one can perform what we call
maximum entropy model selection in which each model evidence values are chosen to be
that which maximizes the entropy of the resulting distribution on models

EUBO (Chunlin, 2019)
• Inclusive KL divergence
• For the Gibbs’ inequality:
VI with upper bound

VI with upper bound
Theorem 2:
⚫ SGD for the EUBO:
w(𝜃) is generally unknown, so use the joint distribution p(D|𝜃)p(𝜃) instead and
normalizes the weights to cancel the unknown constant p(D)

VI with upper bound
⚫ assume the posterior and the joint distribution has no relation the variational
parameter lambda:

Reference
[1]Gal, Yarin and Ghahramani, Zoubin. Dropout as a Bayesian approximation:
Representing model uncertainty in deep learning.ICML, 2016b.
[2] Gal, Yarin. Uncertainty in Deep Learning. PhD thesis, University of
Cambridge, 2016.
[3]Minka, Tom. Divergence measures and message passing. Technical report,
Microsoft Research, 2005.
[4] Li, Yingzhen and Turner, Richard E. Renyi divergence variational inference. In
NIPS, 2016.
[5] C. Ji and H. Shen. Stochastic variational inference via upper bound. arXiv
preprint, arXiv:1912.00650, 2019.
[6] Adji B. Dieng. Variational inference via chi-upper bound minimization. In
Proceedings of the Neural Information Processing Systems, 2017.

Appendix
• Marginal and Conditional Gaussians
• Given a marginal Gaussian distribution for x and a conditional
Gaussian distribution for y given x in the form
• The marginal distribution of y and the conditional distribution
of x given y are given by
• where

Appendix-Uncertainty

Appendix

Appendix-Sandwich theorem

Appendix-EUBO

Monte carlo dropout and variational bound

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Monte carlo dropout and variational bound

Similar to Monte carlo dropout and variational bound (20)

Recently uploaded

Recently uploaded (20)

Monte carlo dropout and variational bound