SGLD Berlin ML GROUP

SGLD
STOC H ASTIC
GR AD IEN T
LAN GEVIN
DYN AMIC S
Natan.katz@gmail.com
natank@checkpoint.com

a
Natan Katz
• Algorithm researcher since my military service
• M.sc Applied math- Weizmann Inst.
• Spent one year in Goethe uni –Prof. Kloeden
• Official author in TDS
• More than 10 patents
• Currently at Checkpoint
• Fanatic fan of the Celtics and SGE – so so
Coordiantes
• Natan.katz@gmail.com
• Natank@checkpoint.com
• https://www.linkedin.com/in/natan-katz-2936425/

a
Agenda
• DL – Virtues & Vices
• Uncertainties
• Bayesian Inference
• BNN
• Langevin Dynamics
• Pytorch Optimizer

A DL Scenario
• We train a CNN to identify images (men versus women)
• We train a CNN to identify images (men versus women)
• Our accuracy can be amazing (98-99%)

Let’s get Cruel
• We offer the model an image of a basketball
• The model outputs “man” or “woman”

Why is that?
Mathematical Observation
We trained a function F such that
F : {space of images}->{“man”, ”woman”}
Statistical Observation
Basketball image is out of our training data

Anecdotes
Vision (Uri Itai)
• Researchers trained a network to classify tanks and trees.
• For 200 images, the test accuracy was 100% .
• In a Pentagon test the model suffered a significant performances’ decay.
The reason was that all the tank images were taken in cloudy days whereas trees in
sunny.
Text
• In text problems, often models focus on words rather extracting latent phenomena

Why Can’t we handle this ?
DL Training’s Vices
In practice
NN can't say “I don’t know”
NN can't provide a protocol upon its decision
• Models cannot “report” about their uncertainties
• Pointwise training may lead to overfitting

Is it crucial ?
• Medical- An engine that decides whether a tumor is malignant or benign
• Autonomous vehicle - Actions that are taken upon DL model
• High frequency trading
A potential solution - Run Inference with dropout

Epistemic Uncertainty
Episteme= Knowledge
Uncertainty that can be reduced:
• Improving model’s structure
• Adding data
Episteme= Knowledge
Uncertainty that can be reduced:
• Improving model’s structure
• Adding data

Aleatoric Uncertainty :
Aleator = Dice Player
Uncertainty that we cannot reduce:
• Stochasticity of a dice
• Noisy Labels

• Evidence – The observed data
• Hypothesis – The latent variables
• P(Z|X) – Posterior dist.

Important Terms
𝑃 (𝜃|X) =
𝑃𝑟(𝜃)𝑃𝑙(x|𝜃)
𝑃(𝑋)
-Posterior dist.
• We wish
𝑃(𝜃|X) α 𝑃𝑙 𝑋 𝜃 𝑃𝑟(𝜃).
(posterior is prop to the product of prior and liklihood)

Statistical Inference Comparison
FR EQU EN STIST
• No prior knowledfe
• Parameters are unknown but fixed- No
probablity
• Only the data is r..v.
• TrainingMLE - P(data| Θ)
 Neural net are frenquetist entites.
(are they?)
BAYESIAN – ART OF BELIEF
• Previous trials are used as a prior
knowledge
• Used data is integrated into params dist .
• Parameters have probabiity we learn :
• Training. MAP - P( Θ | data)
Need a Bayesian analogue

Frequentist -MLE
Bayessian - MAP

BAYESIAN NEURAL NETWORKS
(BNN)

𝑢1
𝑥4
𝑥5
𝑥6
𝑥7
𝑧0
DL - Pointwise Learning
𝑢3
𝑢2

𝑢1
𝑥4
𝑥5
𝑥6
𝑥7
𝑧0
BNN – Posterior Training (𝞗- Training)
𝑢3
𝑢2
𝞿 𝞗
𝑤𝑖 ~𝞿 𝞗

BNN- Prediction
Predictive Porbabnility for training data D
Perform Droput during the inference

Uncertainy -Estimators
We calculate the prediction’s variance :
• The first term is the epistemic uncertainty – Variance of means
• The second term is the aleatoric uncertainty- Mean of variance

BNN EXAMPLE
HTTPS://TOWARDSDATASCIENCE.COM/MAKING -
YOUR-NEURAL-NETWORK-SAY-I-DONT-KNOW-
BAYESIAN-NNS-USING-PYRO-AND-PYTORCH-
B1C24E6AB8CD

Problem’s Framework
• MNIST CNN model
SOTA ~99.8%

Experiment's settings
• Vanilla trial ~Accuracy 88%
BNN Training
• We use VI with small amount of epochs
For each image in the test :
• Sample 100 networks
• We obtain 100 outputs per image
• We have 10 digits each with 100 scores
• If the median of these 100 scores>0.2 we take
(Indeed, we can accept more than a single result)

Results
On 10000 images:
• The network refused to decide on 1250 images
• On the rest of 8750 images the accuracy was 96%
• On A random data the network refused to answer on 95% of the
images

Not-mnist alphaphabet letters
Refused iamge-mnist 2
Refused iamge-mnist 2

THE MATH BEHIND
CLASSICAL BAYSIAN Q.

Random Process
A function X on the pair (w, t) where :
• w - An outcome of a draw
• t – time index
• If w is fixed- X is a continuous function
• If t is fixed- X is a random variable

Robbins & Monro (Stoch. Approx. 1951)
• An unknown function F and a number θ satisfies:
F(𝑥∗ ) = θ
Y - A measurable r. v. s,t E[Y(x)] = F(x)

How do we Obtain Posterior ?
1. Variational Inference
2. MCMC –Sampling (Metropolis –Hastings, Gibbs)
3. HMC
4. SGLD

MH- Properties
• Unbiased, Huge variance and very slow (iterate over the entire data)
• Great History
HMC (Duane 1987, Neal 1995)
• Fatster than MH
• It reaches to a low-density domains
• No minibatches- Need to calulae many graidenst and many accept- reject

SGLD
STOCHASTIC GRADIENT
LANGEVIN DYNAMIC
THE NERDS’ PART

Objectives
Numeric as DL Stochastic as Bayes
• We want a tool that :
• Allows mini btaches
• Good at finding distibutions for sampling weights
• Reduce Overfitting

Physics
Langevin Equation describes the motion of pollen grain in water:
F ~N(0, 𝑡). Brownian Force – collisions with molecules
This equation is an SDE : its solution is a random process

Overdamped Eq & ML (W &T 2011)
F
F(x) -γ𝑣𝑡 +ξ𝑡=0 ξ𝑡 ~N(0, 𝑡). (Brownian Dynamics, molecules d.)
F(x)=𝛻𝐸(x) 𝑣𝑡 =
𝑑𝑋
𝑑𝑡
Discretization
𝑥𝑡+1 = 𝑥𝑡 + dt(𝛻𝐸(x)+ ξ𝑡)
𝑥𝑡+1 = 𝑥𝑡 + 𝜖𝑡 𝛻𝐸 +ε𝑡 ε𝑡 -Stochastic term

SGD- Let’s Batch !! (Welling & The 2011)
Denote . 𝛻𝐸𝑚𝑏 + u𝑡 =𝛻𝐸. .u𝑡 ~ N(0, V) V bounded
𝑥𝑡+1 = 𝑥𝑡 + 𝜖𝑡(𝛻𝐸𝑚𝑏 + u𝑡 ) +ε𝑡
• Ignore the stochastic term ε𝑡
𝑃 (𝜃|X) = 𝑒−𝐸 𝑥
R & M -> MAP solution

The Langevin Term. ε𝑡
• We wish to avoid MAP collapsing as we want to exploit (we are
Bayesians)
• We can tweak variances For this purpose !
ε𝑡 ~N(0, 𝜎 )
What is ) σ ?
We need it to create a bigger variance than the SGD
term
If SGD’s variance goes as LR square , we can take LR

The Solution of W & T
• Welling & Teh 2011
Gal

Yarin Gal (2015) –BNN
https://javierantoran.github.io/assets/pdf/poster_advml.pdf
The Associated Langevin
𝑤 = F(w) + ξ𝑡 (Overdamped Langevin on W)
ε𝑡 ~N(0,, 𝜎 )
F(w) = 𝛻𝐸(w)

SGLD EXAMPLE.
H T T P S : / / H E N R I PA L . G I T H U B . I O / B L O G / L A N G E V I N

NotMnist Measuring the prob (model trained on MNIST)
https://github.com/henripal/sgld/blob/master/nbs/mnist.ipynb

Weight’s Variance CNN on MNIST

Bibliograpy
• https://towardsdatascience.com/making-your-neural-network-say-i-dont-know-bayesian-nns-using-pyro-and-pytorch-b1c24e6ab8cd
• https://henripal.github.io/blog/langevin
• https://d1.awsstatic.com/APG/quantifying-uncertainty-in-deep-learning-systems.pdf
• https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.446.9306&rep=rep1&type=pdf
• https://proceedings.neurips.cc/paper/2017/file/2650d6089a6d640c5e85b2b88265dc2b-Paper.pdf
• https://github.com/henripal/sgld/blob/master/sgld/sgld/sgld_optimizer.py
• https://github.com/henripal/sgld/blob/master/sgld/sgld/eval.py
• https://arxiv.org/pdf/1710.07283.pdf
• https://arxiv.org/pdf/1710.07283.pdf

Bibliograpy
• https://towardsdatascience.com/making-your-neural-network-say-i-dont-know-bayesian-nns-using-pyro-and-pytorch-
b1c24e6ab8cd
• https://github.com/noahgolmant/SGLD
• https://henripal.github.io/blog/langevin
• https://proceedings.neurips.cc/paper/2017/file/2650d6089a6d640c5e85b2b88265dc2b-Paper.pdf
• https://github.com/henripal/sgld/blob/master/sgld/sgld/sgld_optimizer.py
• https://javierantoran.github.io/assets/pdf/poster_advml.pdf
• https://www.cs.toronto.edu/~duvenaud/distill_bayes_net/public/
• https://javierantoran.github.io/assets/pdf/poster_advml.pdf
• http://physics.gu.se/~frtbm/joomla/media/mydocs/LennartSjogren/kap6.pdf

SGLD Berlin ML GROUP

Recommended

Recommended

More Related Content

Similar to SGLD Berlin ML GROUP

Similar to SGLD Berlin ML GROUP (20)

More from Natan Katz

More from Natan Katz (14)

Recently uploaded

Recently uploaded (20)

SGLD Berlin ML GROUP