This document summarizes Natan Katz's background and research interests in stochastic gradient Langevin dynamics (SGLD). It notes that Natan Katz has worked on algorithm research since his military service, with an M.Sc in applied math from the Weizmann Institute. He has over 10 patents and currently works at Checkpoint. The rest of the document discusses some of the virtues and vices of deep learning, including its inability to quantify uncertainty. It introduces Bayesian neural networks (BNN) and SGLD as ways to capture model uncertainty through Bayesian inference and sampling in weight space. SGLD allows training BNNs using mini-batches and can help reduce overfitting. Examples are provided showing how BNNs
Topography and sediments of the floor of the Bay of Bengal
SGLD Berlin ML GROUP
1. SGLD
STOC H ASTIC
GR AD IEN T
LAN GEVIN
DYN AMIC S
Natan.katz@gmail.com
natank@checkpoint.com
2. a
Natan Katz
• Algorithm researcher since my military service
• M.sc Applied math- Weizmann Inst.
• Spent one year in Goethe uni –Prof. Kloeden
• Official author in TDS
• More than 10 patents
• Currently at Checkpoint
• Fanatic fan of the Celtics and SGE – so so
Coordiantes
• Natan.katz@gmail.com
• Natank@checkpoint.com
• https://www.linkedin.com/in/natan-katz-2936425/
5. A DL Scenario
• We train a CNN to identify images (men versus women)
• We train a CNN to identify images (men versus women)
• Our accuracy can be amazing (98-99%)
6. Let’s get Cruel
• We offer the model an image of a basketball
• The model outputs “man” or “woman”
7. Why is that?
Mathematical Observation
We trained a function F such that
F : {space of images}->{“man”, ”woman”}
Statistical Observation
Basketball image is out of our training data
8. Anecdotes
Vision (Uri Itai)
• Researchers trained a network to classify tanks and trees.
• For 200 images, the test accuracy was 100% .
• In a Pentagon test the model suffered a significant performances’ decay.
The reason was that all the tank images were taken in cloudy days whereas trees in
sunny.
Text
• In text problems, often models focus on words rather extracting latent phenomena
9. Why Can’t we handle this ?
DL Training’s Vices
In practice
NN can't say “I don’t know”
NN can't provide a protocol upon its decision
• Models cannot “report” about their uncertainties
• Pointwise training may lead to overfitting
10. Is it crucial ?
• Medical- An engine that decides whether a tumor is malignant or benign
• Autonomous vehicle - Actions that are taken upon DL model
• High frequency trading
A potential solution - Run Inference with dropout
13. Epistemic Uncertainty
Episteme= Knowledge
Uncertainty that can be reduced:
• Improving model’s structure
• Adding data
Episteme= Knowledge
Uncertainty that can be reduced:
• Improving model’s structure
• Adding data
16. • Evidence – The observed data
• Hypothesis – The latent variables
• P(Z|X) – Posterior dist.
17. Important Terms
𝑃 (𝜃|X) =
𝑃𝑟(𝜃)𝑃𝑙(x|𝜃)
𝑃(𝑋)
-Posterior dist.
• We wish
𝑃(𝜃|X) α 𝑃𝑙 𝑋 𝜃 𝑃𝑟(𝜃).
(posterior is prop to the product of prior and liklihood)
18. Statistical Inference Comparison
FR EQU EN STIST
• No prior knowledfe
• Parameters are unknown but fixed- No
probablity
• Only the data is r..v.
• TrainingMLE - P(data| Θ)
Neural net are frenquetist entites.
(are they?)
BAYESIAN – ART OF BELIEF
• Previous trials are used as a prior
knowledge
• Used data is integrated into params dist .
• Parameters have probabiity we learn :
• Training. MAP - P( Θ | data)
Need a Bayesian analogue
25. Uncertainy -Estimators
We calculate the prediction’s variance :
• The first term is the epistemic uncertainty – Variance of means
• The second term is the aleatoric uncertainty- Mean of variance
28. Experiment's settings
• Vanilla trial ~Accuracy 88%
BNN Training
• We use VI with small amount of epochs
For each image in the test :
• Sample 100 networks
• We obtain 100 outputs per image
• We have 10 digits each with 100 scores
• If the median of these 100 scores>0.2 we take
(Indeed, we can accept more than a single result)
29. Results
On 10000 images:
• The network refused to decide on 1250 images
• On the rest of 8750 images the accuracy was 96%
• On A random data the network refused to answer on 95% of the
images
33. Random Process
A function X on the pair (w, t) where :
• w - An outcome of a draw
• t – time index
• If w is fixed- X is a continuous function
• If t is fixed- X is a random variable
34. Robbins & Monro (Stoch. Approx. 1951)
• An unknown function F and a number θ satisfies:
F(𝑥∗ ) = θ
Y - A measurable r. v. s,t E[Y(x)] = F(x)
35. How do we Obtain Posterior ?
1. Variational Inference
2. MCMC –Sampling (Metropolis –Hastings, Gibbs)
3. HMC
4. SGLD
37. MH- Properties
• Unbiased, Huge variance and very slow (iterate over the entire data)
• Great History
HMC (Duane 1987, Neal 1995)
• Fatster than MH
• It reaches to a low-density domains
• No minibatches- Need to calulae many graidenst and many accept- reject
39. Objectives
Numeric as DL Stochastic as Bayes
• We want a tool that :
• Allows mini btaches
• Good at finding distibutions for sampling weights
• Reduce Overfitting
40. Physics
Langevin Equation describes the motion of pollen grain in water:
F ~N(0, 𝑡). Brownian Force – collisions with molecules
This equation is an SDE : its solution is a random process
41. Overdamped Eq & ML (W &T 2011)
F
F(x) -γ𝑣𝑡 +ξ𝑡=0 ξ𝑡 ~N(0, 𝑡). (Brownian Dynamics, molecules d.)
F(x)=𝛻𝐸(x) 𝑣𝑡 =
𝑑𝑋
𝑑𝑡
Discretization
𝑥𝑡+1 = 𝑥𝑡 + dt(𝛻𝐸(x)+ ξ𝑡)
𝑥𝑡+1 = 𝑥𝑡 + 𝜖𝑡 𝛻𝐸 +ε𝑡 ε𝑡 -Stochastic term
42. SGD- Let’s Batch !! (Welling & The 2011)
Denote . 𝛻𝐸𝑚𝑏 + u𝑡 =𝛻𝐸. .u𝑡 ~ N(0, V) V bounded
𝑥𝑡+1 = 𝑥𝑡 + 𝜖𝑡(𝛻𝐸𝑚𝑏 + u𝑡 ) +ε𝑡
• Ignore the stochastic term ε𝑡
𝑃 (𝜃|X) = 𝑒−𝐸 𝑥
R & M -> MAP solution
43. The Langevin Term. ε𝑡
• We wish to avoid MAP collapsing as we want to exploit (we are
Bayesians)
• We can tweak variances For this purpose !
ε𝑡 ~N(0, 𝜎 )
What is ) σ ?
We need it to create a bigger variance than the SGD
term
If SGD’s variance goes as LR square , we can take LR