Bayesian Neural Networks

Bayesian Neural Network
Natan Katz
Natan.katz@gmail.com

Agenda
• Short Introduction to Bayesian Inference
• Variational Inference
• Bayesian Neural Network
• Numerical Methods
• MNIST Example

Bayesian Inference
The inputs:
Evidence – A Sample of observations (numbers, categories, vectors, images)
Hypothesis - An assumption about the prob. structure that creates the sample
Objective :
We wish to learn the optimal parameters of this distribution.
• This probability is called Posterior .
• We wish to find the optimal parameters for P(H|E)
• Remark In many books it is called MAP (Maximum A postriori Estimation)

Let’s Formulate
Z- R.V. that represents the hypothesis
X- R.V. that represents the evidence
Bayes formula:
P(Z|X) =
𝑃(𝑍,𝑋)
𝑃(𝑋)

Let’s Formulate (Cont.)
𝑃𝑟(Z) –Prior (The parameters’ distribution according to our belief)
𝑃𝑙(X|Z) –Likelihood (How likely is the sample given the parameters)
P(Z|X) =
𝑃𝑟(z) 𝑃 𝑙(x|z)
𝑃(𝑋)
Bayesian inference is therefore about working with the RHS terms.
In some case studying the denominator is intractable or extremely
difficult to calculate.

Example -GMM
We have K Gaussians with known variance σ
Draw μ 𝑘 ~ 𝑁 0, τ (τ is positive) from the prior
For each sample j =1…n
𝑧𝑗 ~Cat (1/K,1/K…1/K)
𝑥𝑗 ~ 𝑁(μ 𝑧 𝑗
, σ)
p(𝑥1….𝑛) = μ1:𝑘 𝑙=1
𝐾
𝑃(μ𝑙) 𝑖=1
𝑛
𝑧 𝑗
𝑝( 𝑧𝑗) P(𝑥𝑖|μ 𝑧 𝑗
) => 𝑃𝑟𝑒𝑡𝑡𝑦 𝑆ℎ𝑖𝑡

Some Good news
P(Z|X) =
𝑃𝑟(z) 𝑃 𝑙(x|z)
𝑃(𝑋)
• We wish to learn Z
• There is no Z in the denominator
=> P(Z|X) α 𝑃𝑙 𝑋 𝑍 𝑃𝑟(𝑍)

Solutions
Until 1999
Mostly numerical sampling:
• Metropolis Hastings
• RBM

“AN INTRODUCTION TO VARIATIONAL METHODS FOR GRAPHICAL MODELS”
11

VI – Algorithm Overview
• Rather a numerical sampling method we provide an analytical one:
1. We define a distribution family Q(Z) (bias-variance trade off)
2. We minimize KL divergence min KL(Q(z)|| P(Z|X))
log(P(X)) = 𝐸 𝑄 [log P(x, Z)] − 𝐸 𝑄 [log Q(Z)] + KL(Q(Z)||P(Z|X))
ELBO-Evidence Lower Bound
• Maximizing ELBO =>minimizing KL

𝐸𝐿𝐵𝑂 = 𝐸 𝑄[log P(X, Z)] − 𝐸 𝑄 [log Q(Z)] = 𝑄𝐿𝑜𝑔(
𝑃(𝑋,𝑍)
𝑄(𝑍)
)= J(Q)- Euler Lagrange
MFT (Mean Field Theory)
Scientific “approval”
13

What Deep Learning
doesn’t do

A DL Scenario
• We train a CNN to identify images (men versus women)
• Our accuracy can be amazing (98-99%)
Pretty cool

Let’s get Cruel
• We offer the model an image of a basketball
• The model outputs “man” or “woman”

Why is that?
Mathematical Observation
We trained a function F such that
F : {space of images}->{“man”,”woman”}
Statistical Observation
Basketball image is out of our training data

Anecdotes
Image (Uri Itay)
• Researchers trained a network to classify tanks and trees.
After using 200 images (100 of each kind 50 train 50 test), the test accuracy
was 100% .
As they took it to the Pentagon it began to miss. The reason was that all the
tank images were taken in cloudy days whereas trees in sunny.
Text
• We can see in text problems many cases where rather finding latent
phenomena networks use words as their anchor.

A plausible corollary
When we train a DL model:
• We hardly ever know what the model learned
• Models cannot “report” about their uncertainties

Is it crucial ?
• Consider an engine that decides upon AI whether a tumor is
malignant or benign
• Drug treatment upon medical record
• Actions that are taken by an autonomous vehicle
• High frequency trading

What can we do?
• DL models are trained to the optimal weights
• What if rather training weights pointwise, we ill train
weights’ distributions?
The Inference
• For each data pair (x,y) we create mean and variance
• This variance will reflect model’s uncertainty
• DL approach – Do Dropout in inference

Uncertainty Types
Epistemic Uncertainty :
Episteme= Knowledge
Uncertainty that theoretically we can know but we don’t:
• Model structure issues
• Absence of data
We can use the notion “reducible” too

Uncertainty Types
Aleatoric Uncertainty :
Aleator = Dice Player
Uncertainty that we cannot solve:
• The stochasticity of a dice
• Noisy labels
We can use the notion “irreducible” too

BNN-Training
• We have a neural network
• We place a prior distribution P over the weights W
• For data D={ (X,Y)}
For measuring uncertainties, we use the posterior Distribution

DL Vs. BNN
DL
1. Training using a loss that is related to prediction probability P(Y|X,W)
2. The weights W are trained point-wise with MLE
Bayesian NN
1. Training using a loss that is related the posterior probability P(W|X,Y)
2. We train weights’ distribution

BNN-Inference
Inference
We assume prior knowledge on the weights’ distribution π
As in any NN we get an input x’ and aim to predict y’ :
P(y’| x’) = 𝑃 y’ 𝑥′
, 𝑤 𝑃 𝑤 𝐷 𝑑𝑤
This can be rewritten as:
P(y’| x’) =𝐸 𝑃(𝑤|𝐷) 𝑃 y’ 𝑥′
, 𝑤
D={(X,Y)}

Measuring Uncertainty
• In the inference, given a data point 𝑥∗
• Sample weights W 𝑛 𝑡𝑖𝑚𝑒𝑠
• Calculate its statistics
E[f(𝑥∗
,w)]= 𝑖=1
𝑛
𝑓(𝑥∗
, 𝑤𝑖)
V([f(𝑥∗
,w)] =E𝑓(𝑥∗
,w)2
- E[f(𝑥∗
,w)]2
W –r.v. which 𝑤𝑖 is its samples

Common tools to obtain Posterior Dist.
1. Variational Inference
2. MCMC –Sampling (Metropolis –Hastings, Gibbs)
3. HMC
4. SGLD

Metropolis Hastings
• MCMC sampling algorithm
• The main idea is that we pick samples upon pdf comparisons:
At each step we accept or randomize a sample upon the previous
sample and decide to accept or reject
• Unbiased, Huge variance and very slow (iterate over the entire data)
• Great History

What is Hamiltonian?
• A physical operator that measures energy of a dynamical system
Two sets of coordinates
q -State coordinates
p- Momentum
H(p, q) =U(q) +K(p)
U(q) = log[π 𝑞 𝐿(𝑞|𝐷)] K(P)=
𝑝 2
2𝑚
U-Potential energy, K –Kinetic
𝑑𝐻
𝑑𝑝
= 𝑞 ,
𝑑𝐻
𝑑𝑞
= - 𝑝

Hamiltonian Monte Carlo
• Hamiltonians offer a deterministic vector field (with trajectories….)
• If we set a Hamiltonian depended distribution, we can use this
property for sampling
P(x,y) = 𝑒−𝐻(𝑥,𝑦)

Hybrid - MC
• We have the “state space” x
• We can add “momentum” and use Hamiltonian mechanism
Leap Frog Algorithm
We set a time interval δ, For each step i :
1. 𝑃𝑖(t+0.5 δ) =𝑃𝑖(t) – (δ/2)
𝑑𝑈
𝑑𝑞(𝑡)
2 𝑄𝑖(t+ δ ) = 𝑄𝑖(t) + δ
𝑑𝐾
𝑑𝑝(𝑡+0.5δ)
3 𝑃𝑖(t+ δ) = 𝑃𝑖(t+0.5 δ) - (δ/2)
𝑑𝑈
𝑑𝑞(𝑡+δ)
𝑄𝑖
𝑄

HMC
Algorithm (Neal 1995, 2012, Duane 1987)
1. Draw 𝑥0 from our prior
Draw 𝑝0 from standard normal dist.
2. Perform L steps of leapfrog
3 Pick the 𝑥 𝑡 following M.H.

HMC –Pros & Cons
Pros
• It takes points from a wider domains thus we can describe the
distribution better
• It may take points with lower density
• Faster than MCMC
Cons
• It may suffer from low energy barrier
• No minibatch –Not nice
• It has to calculate gradients for the entire data!!! Bad

What do we need then?
• A tool that allows sub-sampling
• Fewer Gradients
• Keen knowledge about extremums and escape rooms

Stochastic Gradient
Langevin Dynamics
(SGLD)

Langevin Equation
Langevin Equation describes the motion of pollen grain in water:
F -γ𝑣 𝑡 +ξ 𝑡=0 ξ 𝑡 ~N(0,t)
ξ 𝑡 is a Brownian Force- The collisions with the water molecules
F - External forces
This equation has an equilibrium solution which is our posterior
distribution

Langevin Equation
Let’s use the following:
F=𝛻𝐸 𝑣 𝑡 =
𝑑𝑋
𝑑𝑇
The eq in its discrete form becomes:
𝑥𝑡+1 = 𝑥𝑡 +
dt
γ
ξ 𝑡 + 𝛻𝐸
dt
γ
(looks familiar doesn’t it?)

Langevin Equation
Some more re-write:
𝑥𝑡+1 = 𝑥𝑡 + 𝜖 𝑡 𝛻𝐸 +ε 𝑡 ε 𝑡 -Stochastic term
Consider this term
Are we in a better situation ?

Robbins & Monro (Stoch. Approx. 1951)
• Let F a function and θ a number
• There exists a unique solution :
F(𝑥∗ ) = θ
F - is unknown
Y - A measurable r.v.
E[Y(x)] = F(x)

Robbins &
Monro
(cont)
The following algorithm converges to 𝑥∗ :
𝑋 𝑁+1 = 𝑋 𝑁 +α 𝑁 (𝑌𝑁 − θ )

Back to Langevin
𝑥 𝑡+1 = 𝑥 𝑡 + 𝜖 𝑡 𝛻𝐸 +ε 𝑡
𝛻𝐸 𝑚𝑏=𝛻𝐸 + ε 𝑡
𝑥 𝑡+1 = 𝑥 𝑡 + 𝜖 𝑡 𝛻𝐸 𝑚𝑏
Δ 𝜃𝑡 =ε 𝑡(𝛻log 𝑝 𝜃𝑡 +
𝑁
𝑛 𝑖=1
𝑁
𝛻log 𝑝 𝑥𝑖|𝜃𝑡 )

We are almost there
• This eq converges to an optimal solution (MAP).
• We need a solution of SDE (probability)
• Let’s add a stochastic term
Δ 𝜃𝑡 =ε 𝑡(𝛻log 𝑝 𝜃𝑡 +
𝑁
𝑛 𝑖=1
𝑁
𝛻log 𝑝 𝑥𝑖|𝜃𝑡 ) + η 𝑡 η 𝑡~N(0,σ)

Variance Analysis
ε 𝑡 - Follow R&M rules
How big is σ ?
Bigger than ε 𝑡 *V(𝛻)
As t->∞ the equation must become Langevin.
THE variance of η must be therefore bigger than ε 𝑡 *V(𝛻)

Finally, Example
https://towardsdatascience.com/making-your-neural-network-say-i-
dont-know-bayesian-nns-using-pyro-and-pytorch-b1c24e6ab8cd

Problem’s Framework
• MNIST CNN model
• MNIST SOTA ~99.8%

The Experiment
• Training a BNN – using VI (small amount, of epochs)
• Set a regular decision law – Take the max score of each digit
=>Accuracy ~88%

Allowing the Network to refuse
• For each image:
• Sample 100 networks
• We obtain 100 outputs per image
• We have 10 digits each with 100 scores
• If the median of these 100 scores>0.2 we take
(Indeed, we can accept more than one result)

Summary
• Accuracy 96%
• Refuse 12.5%
• Random images 95% have been refused

My process
• https://wjmaddox.github.io/assets/BNN_tutorial_CILVR.pdf
• https://arxiv.org/pdf/2007.06823.pdf
• https://towardsdatascience.com/what-uncertainties-tell-you-in-bayesian-neural-networks-6fbd5f85648e
• https://medium.com/@uriitai/augmentation-and-groups-theory-795c287fec3f
• https://github.com/paraschopra/bayesian-neural-network-mnist/blob/master/bnn.ipynb
• https://towardsdatascience.com/making-your-neural-network-say-i-dont-know-bayesian-nns-using-pyro-
and-pytorch-b1c24e6ab8cd
• http://www.stats.ox.ac.uk/~teh/research/compstats/WelTeh2011a.pdf
• https://arxiv.org/pdf/1206.1901.pdf
• http://cgl.elte.hu/~racz/Stoch-diff-eq.pdf
• https://arxiv.org/ftp/arxiv/papers/1103/1103.1184.pdf
• https://henripal.github.io/blog/langevin
• https://www.cs.princeton.edu/courses/archive/fall11/cos597C/lectures/variational-inference-i.pdf

Bayesian Neural Networks

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Bayesian Neural Networks

Similar to Bayesian Neural Networks (20)

More from Natan Katz

More from Natan Katz (13)

Recently uploaded

Recently uploaded (20)

Bayesian Neural Networks