3. Overfitting and Uncertainty
Overfitting
◦ Plain feedforward neural networks are prone to overfitting.
Uncertainty
◦ NN often incapable of correctly assessing the uncertainty in the training data.
→Overly confident decisions
(a) Softmax input as a function of data x: f(x) (b) Softmax output as a function of data x: (f(x))
Figure 1: A sketch of softmax input and output for an idealised binary classification problem.
Training data is given between the dashed grey lines. Function point estimate is shown with a solid
line. Function uncertainty is shown with a shaded area. Marked with a dashed red line is a point
x⇤
far from the training data. Ignoring function uncertainty, point x⇤
is classified as class 1 with
probability 1.
have made use of NNs for Q-value function approximation. These are functions that estimate the
quality of different actions an agent can make. Epsilon greedy search is often used where the agent
Training data
Function
point
estimate
Without uncertainty:
class 1 with probability 1
With uncertainty:
better reflects classification uncertainty
[Gal et. al 2015]
Function
uncertainty
5. Contribution
They proposed Bayes by Backprop
◦ A simple approximate learning algorithm similar to backpropagation.
◦ All weight are represented by probability distributions over possible values.
It achieves good results in several domains.
◦ Classification
◦ Regression
◦ Bandit problem
Weight Uncertainty in Neural Networks
H1 H2 H3 1
X 1
Y
0.5 0.1 0.7 1.3
1.40.3
1.2
0.10.1 0.2
H1 H2 H3 1
X 1
Y
Figure 1. Left: each weight has a fixed value, as provided by clas-
sical backpropagation. Right: each weight is assigned a distribu-
tion, as provided by Bayes by Backprop.
is related to recent methods in deep, generative modelling
(Kingma and Welling, 2014; Rezende et al., 2014; Gregor
et al., 2014), where variational inference has been applied
to stochastic hidden units of an autoencoder. Whilst the
number of stochastic hidden units might be in the order of
thousands, the number of weights in a neural network is
easily two orders of magnitude larger, making the optimisa-
tion problem much larger scale. Uncertainty in the hidden
units allows the expression of uncertainty about a particular
the parameters of the categorical distribution
through the exponential function then re-norm
regression Y is R and P(y|x, w) is a Gaussian
– this corresponds to a squared loss.
Inputs x are mapped onto the parameters of
tion on Y by several successive layers of linear
tion (given by w) interleaved with element-wise
transforms.
The weights can be learnt by maximum likelih
tion (MLE): given a set of training examples D
the MLE weights wMLE
are given by:
wMLE
= arg max
w
log P(D|w)
= arg max
w
X
i
log P(yi|xi, w
This is typically achieved by gradient descent
propagation), where we assume that log P(D|w
entiable in w.
Regularisation can be introduced by placing a
the weights w and finding the maximum a
MAP
Classical backpropagation Bayes by Backprop
6. Related Works
Variational approximation
◦ [Graves 2011]
→ the gradients of this can be made unbiased and this method can be used with
non-Gaussian priors.
Uncertainty in the hidden units
◦ [Kingma and Welling, 2014] [Rezende et al., 2014] [Gregor et al., 2014]
◦ Variational autoencoder
→ the number of weights in a neural network is easily two orders of magnitude
larger
Contextual bandit problems using Thompson sampling
◦ [Thompson, 1933] [Chapelle and Li, 2011] [Agrawal and Goyal, 2012] [May et al.,
2012]
→ Weights with greater uncertainty introduce more variability into the decisions
made by the network, leading naturally to exploration.
7. Point Estimates of NN
Neural network : 𝑝(𝑦|𝑥, 𝑤)
◦ Input : 𝑥 ∈ ℝ+
◦ Output : 𝑦 ∈ 𝒴
◦ The set of parameters : 𝑤
◦ Cross-entropy (categorical distiribution), squared loss (Gaussian distribution)
Learning 𝐷 = (𝑥/, 𝑦/)
◦ MLE:
𝑤012
= arg max
8
log 𝑝 𝐷 𝑤 = arg max
8
; log 𝑝(𝑦/ |𝑥/, 𝑤)
/
◦ MAP:
𝑤0<=
= arg max
8
log 𝑝 𝑤 𝐷 = arg max
8
log 𝑝(𝐷|𝑤) + log 𝑝(𝑤)
9. Variational Learning
The Basyan posterior distribution on the weight : 𝑞(𝑤|𝜃)
◦ parameters : 𝜃
The posterior distribution given the training data : 𝑝 𝑤 𝐷
Find the parameters 𝜃 that minimizes the KL divergence :
𝜃∗ = arg min
L
ℱ 𝐷, 𝜃
where ℱ 𝐷, 𝜃 = 𝐾𝐿 𝑞 𝑤 𝜃 𝑝 𝑤 𝐷
= 𝐾𝐿 𝑞 𝑤 𝜃 𝑝(𝑤) − 𝔼 Q 𝑤 𝜃 [log 𝑝(𝐷|𝑤)]
data-dependent part
(likelihood cost)
prior-dependent part
(complexity cost)
12. Gaussian variational posterior
1. Sample 𝜀 ∼ 𝑁(0, 𝐼)
2. Let 𝑤 = 𝜇 + log (1 + exp ( 𝜌)) ∘ 𝜀
3. Let 𝜃 = (𝜇, 𝜌)
4. Let 𝑓(𝑤, 𝜃) = log 𝑞 𝑤 𝜃 − log 𝑝 𝑤 𝑝(𝐷|𝑤)
5. Calculate the gradient with respect to the mean
∆e =
𝜕𝑓(𝑤, 𝜃)
𝜕𝑤
+
𝜕𝑓(𝑤, 𝜃)
𝜕𝜇
6. Calculate the gradient with respect to the standard deviation parameter 𝜌
∆g =
𝜕𝑓(𝑤, 𝜃)
𝜕𝑤
𝜀
1 + exp(−𝜌)
+
𝜕𝑓(𝑤, 𝜃)
𝜕𝜌
7. Update the variational parameters:
𝜇 ← 𝜇 − 𝛼∆e
𝜌 ← 𝜌 − 𝛼∆g
13. Scalar mixture prior
They proposed using scale mixture of two Gaussian densities as the prior
◦ Combined with a diagonal Gaussian posterior
◦ Two degrees of freedom per weight only increases the number of parameters to
optimize by a factor of two.
𝑝(𝑤) = ∏ 𝜋𝑁(𝑤l |0, 𝜎n) + (1 − 𝜋)𝑁(𝑤l|0, 𝜎o) l
where 𝜎n > 𝜎o, 𝜎o ≪ 1
Empirically they found optimizing the parameters of a prior 𝑝(𝑤) to not be
useful, and yield worse results.
◦ It can be easier to change the prior parameters than the posterior parameters.
◦ The prior parameters try to capture the empirical distribution of the weights at the
beginning of learning.
→ pick a fixed-form prior and don’t adjust its hyperparamater
14. Minibatches and KL re-weighting
The minibatch cost for minibatch i = 1,2,...,M:
𝐹/
u
(𝐷/, 𝜃) = 𝜋/ 𝐾𝐿[𝑞(𝑤|𝜃) || 𝑃(𝑤)] − 𝔼Q 𝑤 𝜃 [log 𝑃 𝐷/ 𝑤 ]
where π ∈ [0,1]0
and ∑ 𝜋/ = 10
/xn
◦ Then 𝔼0 ∑ 𝐹/
u
𝐷/ , 𝜃0
/xn = 𝐹(𝐷, 𝜃)
𝜋 =
oyz{
oy|n
works well
◦ the first few minibatches are heavily influenced by the complexity cost
◦ the later minibatches are largely influenced by the data
22. Classification on MNIST
The level of redundancy in a Bayes by Backprop.
◦ Replace the weights with a constant zero (from the lowest signal-to-noise
ratio ).
Figure 4. Density and CDF of the Signal-to-Noise ratio over all
weights in the network. The red line denotes the 75% cut-off.
In Table 2, we examine the effect of replacing the vari-
ational posterior on some of the weights with a constant
zero, so as to determine the level of redundancy in the
network found by Bayes by Backprop. We took a Bayes
by Backprop trained network with two layers of 1200
units2
and ordered the weights by their signal-to-noise ra-
tio (|µi|/ i). We removed the weights with the lowest sig-
nal to noise ratio. As can be seen in Table 2, even when
95% of the weights are removed the network still performs
well, with a significant drop in performance once 98% of
the weights have been removed.
In Figure 4 we examined the distribution of the signal-to-
noise relative to the cut-off in the network uses in Table 2.
The lower plot shows the cumulative distribution of signal-
to-noise ratio, whilst the top plot shows the density. From
the density plot we see there are two modalities of signal-
to-noise ratios, and from the CDF we see that the 75%
cut-off separates these two peaks. These two peaks coin-
cide with a drop in performance in Table 2 from 1.24%
to 1.29%, suggesting that the signal-to-noise heuristic is in
fact related to the test performance.
2
We used a network from the end of training rather than pick-
ing a network with a low validation cost found during training,
hence the disparity with results in Table 1. The lowest test error
a network 20 times smaller and did
proportion of weights (at most 11%)
ing good test performance. The sca
by Bayes by Backprop encourages
weights. Many of these weights can b
without impacting performance signi
5.2. Regression curves
We generated training data from the
y = x + 0.3 sin(2⇡(x + ✏)) + 0.3
where ✏ ⇠ N(0, 0.02). Figure 5 sh
fitting a neural network to these data
tional Gaussian loss. Note that in th
space where there are no data, the or
reduces the variance to zero and ch
lar function, even though there are m
olations of the training data. On the
averaging affects predictions: where
confidence intervals diverge, reflect
possible extrapolations. In this cas
prefers to be uncertain where there a
opposed to a standard neural networ
confident.
5.3. Bandits on Mushroom Task
We take the UCI Mushrooms data set
23. Regression curves
Training data :
𝑦 = 𝑥 + 0.3sin 2𝜋 𝑥 + 𝜀 + 0.3sin 4𝜋 𝑥 + 𝜀 + 𝜀
where 𝜀~𝑁(0,0.02) Weight Uncertainty in Neural Networks
x
x
x
x
x
x
x
xx
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
xx
x
x
x
x
x
x
x
x
xxxx
x
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
xx
x
xx
x
x
x
x x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x x
x
x
x
x
x
x x
x
x
xxx x
xx
x
x
x
xx
x
x
x
x
x
x
x
x
x x
x
x
x
x
x
x
xxx
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
xx
x
x
x
x
xxx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
xx
x
x
x
x x
x
x
x
x
x
x
xx
x
x
x
x
x
x
xx
x
xx xx
x
xxx
x
x
x
x
x
x
x
x
x
x
xx
x
x x
x
x
x
x
x
x
x
x
xxx
x
x
x
x
x
x
x xx
x
x
x
x
x x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
xx
x
xx
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
xx
x
x
x
xx
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
xxx
x
x
x
x
x
x
x
x
xx
x
x x
x
xx
x
x
x
x
x
x
x
x
x
xx
xx
x
x
x
xx
x
xx
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
−0.4
0.0
0.4
0.8
1.2
0.0 0.4 0.8 1.2
x
x
x
x
x
x
x
xx
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
xx
x
x
x
x
x
x
x
x
xxxx
x
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
xx
x
xx
x
x
x
x x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x x
x
x
x
x
x
x x
x
x
xxx x
xx
x
x
x
xx
x
x
x
x
x
x
x
x
x x
x
x
x
x
x
x
xxx
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
xx
x
x
x
x
xxx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
xx
x
x
x
x x
x
x
x
x
x
x
xx
x
x
x
x
x
x
xx
x
xx xx
x
xxx
x
x
x
x
x
x
x
x
x
x
xx
x
x x
x
x
x
x
x
x
x
x
xxx
x
x
x
x
x
x
x xx
x
x
x
x
x x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
xx
x
xx
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
xx
x
x
x
xx
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
xxx
x
x
x
x
x
x
x
x
xx
x
x x
x
xx
x
x
x
x
x
x
x
x
x
xx
xx
x
x
x
xx
x
xx
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
−0.4
0.0
0.4
0.8
1.2
0.0 0.4 0.8 1.2
Figure 5. Regression of noisy data with interquatile ranges. Black
crosses are training samples. Red lines are median predictions.
Blue/purple region is interquartile range. Left: Bayes by Back-
prop neural network, Right: standard neural network.
for action selection. We kept t
and action tuples in a buffer,
ing randomly drawn minibatch
steps (64 ⇥ 64 = 4096) per int
bandit. A common heuristic fo
exploitation is to follow an "
bility " propose a uniformly ra
the best action according to th
Figure 6 compares a Bayes by
"-greedy agents, for values of
and 5%. An " of 5% appears
purely greedy agent does poo
ily electing to eat nothing, but
it has seen enough data. It se
approximation updates allow
Bayes by Backprop Standard NN
24. Bandits on Mushroom Task
UCI Mushrooms dataset
◦ Each mushroom has a set of features and is labelled as edible or poisonous.
◦ An edible mushroom : a reward of 5
◦ A poisonous mushroom : a reward of −35 or 5 (with probability 0.5)
◦ Not to eat a mushroom : a reward of 0
Model :
◦ Input (context and action)→(ReLU)→100→(ReLU)→100→ the expected reward
Comparison approach :
◦ 𝜀-greedy policy : with probability 𝜀 propose a uniformly random action,
otherwise pick the best action according to the neural network.
25. Bandits on Mushroom Task
Comparison of cumulative regret of various agents.
◦ Regret : the difference between the reward achievable by an oracle and the
reward received by an agent.
◦ Oracle : an edible mushroom→5, a poisonous mushroom→0
◦ Lower is better.
x
x
x
xx x
x
x
xx
xx
x
xx
x
x
x
x
x x
x
x
x
x
x
x
x
xx xx
xxx
x
x
x
x
x x
xx
xx x
x
x
xx xx
xx xx
x
xx
xx
x
x
xx
x
x
x
x
x
xx
xx
x
x
xx
x
x
x
x
x
x
x
x
xx
x
x
x
x
x
x
x
−0.4
0.0
0.0 0.4 0.8 1.2
x
x
x
xx x
x
x
xx
xx
x
xx
x
x
x
x
x x
x
x
x
x
x
x
x
xx xx
xxx
x
x
x
x
x x
xx
xx x
x
x
xx xx
xx xx
x
xx
xx
x
x
xx
x
x
x
x
x
xx
xx
x
x
xx
x
x
x
x
x
x
x
x
xx
x
x
x
x
x
x
x
−0.4
0.0
0.0 0.4 0.8 1.2
Figure 5. Regression of noisy data with interquatile ranges. Black
crosses are training samples. Red lines are median predictions.
Blue/purple region is interquartile range. Left: Bayes by Back-
prop neural network, Right: standard neural network.
1000
10000
0 10000 20000 30000 40000 50000
Step
CumulativeRegret
5% Greedy
1% Greedy
Greedy
Bayes by Backprop
Figure 6. Comparison of cumulative regret of various agents on
bandit. A com
exploitation is
bility " propos
the best action
Figure 6 comp
"-greedy agen
and 5%. An
purely greedy
ily electing to
it has seen en
approximation
as for the first
approximately
eat mushroom
from the begi
and quickly co
most perfect r
6. Discussio
We introduced
works with u
Backprop. It
over-explore
quickly converges