(研究会輪読) Weight Uncertainty in Neural Networks

Weight Uncertainty in Neural Networks
04/11/2015
MASAHIRO SUZUKI

Paper Information
Title : Weight Uncertainty in Neural Networks (ICML 2015)
Authors : Charles Blundell, Jullen Cornebise, KorayKavukcuoglu, Daan
Wierstra
◦ Google DeepMind
They proposed Bayes by Backprop
Motivation
◦ I’d like to know how to treat the model “uncertainty’’ in deep learning
approach.
◦ I like Bayesianapproach.

Overfitting and Uncertainty
Overfitting
◦ Plain feedforward neural networks are prone to overfitting.
Uncertainty
◦ NN often incapable of correctly assessing the uncertainty in the training data.
→Overly confident decisions
(a) Softmax input as a function of data x: f(x) (b) Softmax output as a function of data x: (f(x))
Figure 1: A sketch of softmax input and output for an idealised binary classiﬁcation problem.
Training data is given between the dashed grey lines. Function point estimate is shown with a solid
line. Function uncertainty is shown with a shaded area. Marked with a dashed red line is a point
x⇤
far from the training data. Ignoring function uncertainty, point x⇤
is classiﬁed as class 1 with
probability 1.
have made use of NNs for Q-value function approximation. These are functions that estimate the
quality of different actions an agent can make. Epsilon greedy search is often used where the agent
Training data
Function
point
estimate
Without uncertainty:
class 1 with probability 1
With uncertainty:
better reflects classification uncertainty
[Gal et. al 2015]
Function
uncertainty

How to prevent overfitting
Various regularization schemes have been proposed.
◦ Early stopping
◦ Weight decay
◦ Dropout
This paper addresses this problem by using variational Bayesian learning to
introduce uncertainty in the weights of the network.
↓
Bayes by Backprop

Contribution
They proposed Bayes by Backprop
◦ A simple approximate learning algorithm similar to backpropagation.
◦ All weight are represented by probability distributions over possible values.
It achieves good results in several domains.
◦ Classification
◦ Regression
◦ Bandit problem
H1 H2 H3 1
X 1
Y
0.5 0.1 0.7 1.3
1.40.3
1.2
0.10.1 0.2
H1 H2 H3 1
X 1
Y
Figure 1. Left: each weight has a ﬁxed value, as provided by clas-
sical backpropagation. Right: each weight is assigned a distribu-
tion, as provided by Bayes by Backprop.
is related to recent methods in deep, generative modelling
(Kingma and Welling, 2014; Rezende et al., 2014; Gregor
et al., 2014), where variational inference has been applied
to stochastic hidden units of an autoencoder. Whilst the
number of stochastic hidden units might be in the order of
thousands, the number of weights in a neural network is
easily two orders of magnitude larger, making the optimisa-
tion problem much larger scale. Uncertainty in the hidden
units allows the expression of uncertainty about a particular
the parameters of the categorical distribution
through the exponential function then re-norm
regression Y is R and P(y|x, w) is a Gaussian
– this corresponds to a squared loss.
Inputs x are mapped onto the parameters of
tion on Y by several successive layers of linear
tion (given by w) interleaved with element-wise
transforms.
The weights can be learnt by maximum likelih
tion (MLE): given a set of training examples D
the MLE weights wMLE
are given by:
wMLE
= arg max
w
log P(D|w)
= arg max
w
X
i
log P(yi|xi, w
This is typically achieved by gradient descent
propagation), where we assume that log P(D|w
entiable in w.
Regularisation can be introduced by placing a
the weights w and ﬁnding the maximum a
MAP
Classical backpropagation Bayes by Backprop

Related Works
Variational approximation
◦ [Graves 2011]
→ the gradients of this can be made unbiased and this method can be used with
non-Gaussian priors.
Uncertainty in the hidden units
◦ [Kingma and Welling, 2014] [Rezende et al., 2014] [Gregor et al., 2014]
◦ Variational autoencoder
→ the number of weights in a neural network is easily two orders of magnitude
larger
Contextual bandit problems using Thompson sampling
◦ [Thompson, 1933] [Chapelle and Li, 2011] [Agrawal and Goyal, 2012] [May et al.,
2012]
→ Weights with greater uncertainty introduce more variability into the decisions
made by the network, leading naturally to exploration.

Point Estimates of NN
Neural network : 𝑝(𝑦|𝑥, 𝑤)
◦ Input : 𝑥 ∈ ℝ+
◦ Output : 𝑦 ∈ 𝒴
◦ The set of parameters : 𝑤
◦ Cross-entropy (categorical distiribution), squared loss (Gaussian distribution)
Learning 𝐷 = (𝑥/, 𝑦/)
◦ MLE:
𝑤012
= arg max
8
log 𝑝 𝐷 𝑤 = arg max
8
; log 𝑝(𝑦/ |𝑥/, 𝑤)
/
◦ MAP:
𝑤0<=
= arg max
8
log 𝑝 𝑤 𝐷 = arg max
8
log 𝑝(𝐷|𝑤) + log 𝑝(𝑤)

Being Bayesian
The predictive distribution : 𝑝(𝑦?|𝑥?)
◦ an unknown label : 𝑦?
◦ a test data item : 𝑥?
𝑝 𝑦? 𝑥? = @ 𝑝 𝑦? 𝑥?, 𝑤 𝑝 𝑤 𝐷 𝑑𝑤 = 𝔼+(8|C)[𝑝 𝑦? 𝑥?, 𝑤 ]
Taking expectation = an ensemble of an uncountably infinite number of NN
↓
Intractable

Variational Learning
The Basyan posterior distribution on the weight : 𝑞(𝑤|𝜃)
◦ parameters : 𝜃
The posterior distribution given the training data : 𝑝 𝑤 𝐷
Find the parameters 𝜃 that minimizes the KL divergence :
𝜃∗ = arg min
L
ℱ 𝐷, 𝜃
where ℱ 𝐷, 𝜃 = 𝐾𝐿 𝑞 𝑤 𝜃 𝑝 𝑤 𝐷
= 𝐾𝐿 𝑞 𝑤 𝜃 𝑝(𝑤) − 𝔼 Q 𝑤 𝜃 [log 𝑝(𝐷|𝑤)]
data-dependent part
(likelihood cost)
prior-dependent part
(complexity cost)

Unbiased Monte Carlo gradients
Proposition 1. Let 𝜀 be a random variable having a probability density
given by 𝑞(𝜀) and let 𝑤 = 𝑡(𝜃, 𝜀) where 𝑡(𝜃, 𝜀) is a deterministic
function. Suppose further that the marginal probability density of 𝑤,
𝑞(𝑤|𝜃), is such that 𝑞(𝜀)𝑑𝜀 = 𝑞(𝑤|𝜃)𝑑𝑤. Then for a function 𝑓 with
derivatives in 𝑤:
U
UL
𝔼Q(8|L) 𝑓 𝑤, 𝜃 = 𝔼Q(V)
UW(8,L)
UL
U8
UL
+
UW(8,L)
UL
A generalization of the Gaussian reparameterizationtrick

Bayes by backprop
We approximate the expected lower bound as
ℱ 𝐷, 𝜃 ≈ ; log 𝑞 𝑤(/) 𝜃 − log 𝑝(𝑤(/)) − log 𝑝(𝐷|𝑤 / )
/
where 𝑤(/)~𝑞 𝑤(/) 𝜃 (Monte Carlo)
◦ Every term of this approximate cost depends upon the particular weights
drawn from the variational posterior.

Gaussian variational posterior
1. Sample 𝜀 ∼ 𝑁(0, 𝐼)
2. Let 𝑤 = 𝜇 + log (1 + exp ( 𝜌)) ∘ 𝜀
3. Let 𝜃 = (𝜇, 𝜌)
4. Let 𝑓(𝑤, 𝜃) = log 𝑞 𝑤 𝜃 − log 𝑝 𝑤 𝑝(𝐷|𝑤)
5. Calculate the gradient with respect to the mean
∆e =
𝜕𝑓(𝑤, 𝜃)
𝜕𝑤
+
𝜕𝑓(𝑤, 𝜃)
𝜕𝜇
6. Calculate the gradient with respect to the standard deviation parameter 𝜌
∆g =
𝜕𝑓(𝑤, 𝜃)
𝜕𝑤
𝜀
1 + exp(−𝜌)
+
𝜕𝑓(𝑤, 𝜃)
𝜕𝜌
7. Update the variational parameters:
𝜇 ← 𝜇 − 𝛼∆e
𝜌 ← 𝜌 − 𝛼∆g

Scalar mixture prior
They proposed using scale mixture of two Gaussian densities as the prior
◦ Combined with a diagonal Gaussian posterior
◦ Two degrees of freedom per weight only increases the number of parameters to
optimize by a factor of two.
𝑝(𝑤) = ∏ 𝜋𝑁(𝑤l |0, 𝜎n) + (1 − 𝜋)𝑁(𝑤l|0, 𝜎o) l
where 𝜎n > 𝜎o, 𝜎o ≪ 1
Empirically they found optimizing the parameters of a prior 𝑝(𝑤) to not be
useful, and yield worse results.
◦ It can be easier to change the prior parameters than the posterior parameters.
◦ The prior parameters try to capture the empirical distribution of the weights at the
beginning of learning.
→ pick a fixed-form prior and don’t adjust its hyperparamater

Minibatches and KL re-weighting
The minibatch cost for minibatch i = 1,2,...,M:
𝐹/
u
(𝐷/, 𝜃) = 𝜋/ 𝐾𝐿[𝑞(𝑤|𝜃) || 𝑃(𝑤)] − 𝔼Q 𝑤 𝜃 [log 𝑃 𝐷/ 𝑤 ]
where π ∈ [0,1]0
and ∑ 𝜋/ = 10
/xn
◦ Then 𝔼0 ∑ 𝐹/
u
𝐷/ , 𝜃0
/xn = 𝐹(𝐷, 𝜃)
𝜋 =
oyz{
oy|n
works well
◦ the first few minibatches are heavily influenced by the complexity cost
◦ the later minibatches are largely influenced by the data

Contextual Bandits
Contextual Bandits
◦ Simple reinforcement learning problems
Example : Clinical Decision Making
Another Example: Clinical Decision Making
Repeatedly:
1. A patient comes to a doctor with
symptoms, medical history, test results
2. The doctor chooses a treatment
3. The patient responds to it
The doctor wants a policy for choosing
targeted treatments for individual patients.
Context 𝑥
Actions 𝑎 ∈ [0, . . , 𝐾]
Rewards
𝑟~𝑝(𝑟|𝑥, 𝑎, 𝑤)
Online training the agent’s model
𝑝 𝑟 𝑥, 𝑎, 𝑤

Thompson Sampling
Thompson sampling
◦ Popular means of picking an action that trades-off between exploitation and
exploration.
◦ Necessitates a Bayesian treatment of the model parameters
1. Sample a new set of parameters for the model.
2. Pick the action with the highest expected reward according to the
sampled parameters.
3. Update the model. Go to 1.

Thompson Sampling for NN
Thompson sampling is easily adapted to neural networks using the
variational pasterior.
1. Sample weights from the variational posterior: 𝑤~𝑞(𝑤|𝜃).
2. Receive the context 𝑥.
3. Pick the action 𝑎 that minimises 𝔼+(€|•,‚,8)[𝑟]
4. Receive reward 𝑟.
5. Update variationalparameters 𝜃. Go to 1.

Experiment
1. Classification on MNIST
2. Regression curves
3. Bandits on Mushroom Task

Classification on MNIST
Model :
◦ 28×28→(ReLU)→(400,800,1200) →(ReLU)→(400,800,1200) → (Softmax)→ 10
Table 1. Classiﬁcation Error Rates on MNIST. ? indicates result
used an ensemble of 5 networks.
Method
#Units/Layer
#Weights
Test
Error
SGD, no regularisation (Simard et al., 2003) 800 1.3m 1.6%
SGD, dropout (Hinton et al., 2012) ⇡ 1.3%
SGD, dropconnect (Wan et al., 2013) 800 1.3m 1.2%?
SGD 400 500k 1.83%
800 1.3m 1.84%
1200 2.4m 1.88%
SGD, dropout 400 500k 1.51%
800 1.3m 1.33%
1200 2.4m 1.36%
Bayes by Backprop, Gaussian 400 500k 1.82%
800 1.3m 1.99%
1200 2.4m 2.04%
Bayes by Backprop, Scale mixture 400 500k 1.36%
800 1.3m 1.34%
1200 2.4m 1.32%
known that variational methods under-estimate uncertainty
(Minka, 2001; 2005; Bishop, 2006) which could lead to
0.8
1.2
1.6
2.0
0 100 200 300 400 500 6
Epochs
Testerror(%)
Figure 2. Test error on MNIST as tr
0
5
10
15
−0.2 −0.1 0.0 0.1 0.2
Weight
Density
Figure 3. Histogram of the trained weight
for Dropout, plain SGD, and samples from
used an ensemble of
5 networks

Test error on MNIST as training progress
◦ Bayes by Backprop and dropout converge at similar rates.

Density estimates of the weights
◦ Bayes by Backprop : sampled from the variational posterior
◦ Dropout : used at test time
◦ Bayes by Backprop uses the greatest range of weights

The level of redundancy in a Bayes by Backprop.
◦ Replace the weights with a constant zero (from the lowest signal-to-noise
ratio ).
Figure 4. Density and CDF of the Signal-to-Noise ratio over all
weights in the network. The red line denotes the 75% cut-off.
In Table 2, we examine the effect of replacing the vari-
ational posterior on some of the weights with a constant
zero, so as to determine the level of redundancy in the
network found by Bayes by Backprop. We took a Bayes
by Backprop trained network with two layers of 1200
units2
and ordered the weights by their signal-to-noise ra-
tio (|µi|/ i). We removed the weights with the lowest sig-
nal to noise ratio. As can be seen in Table 2, even when
95% of the weights are removed the network still performs
well, with a significant drop in performance once 98% of
the weights have been removed.
In Figure 4 we examined the distribution of the signal-to-
noise relative to the cut-off in the network uses in Table 2.
The lower plot shows the cumulative distribution of signal-
to-noise ratio, whilst the top plot shows the density. From
the density plot we see there are two modalities of signal-
to-noise ratios, and from the CDF we see that the 75%
cut-off separates these two peaks. These two peaks coin-
cide with a drop in performance in Table 2 from 1.24%
to 1.29%, suggesting that the signal-to-noise heuristic is in
fact related to the test performance.
2
We used a network from the end of training rather than pick-
ing a network with a low validation cost found during training,
hence the disparity with results in Table 1. The lowest test error
a network 20 times smaller and did
proportion of weights (at most 11%)
ing good test performance. The sca
by Bayes by Backprop encourages
weights. Many of these weights can b
without impacting performance signi
5.2. Regression curves
We generated training data from the
y = x + 0.3 sin(2⇡(x + ✏)) + 0.3
where ✏ ⇠ N(0, 0.02). Figure 5 sh
fitting a neural network to these data
tional Gaussian loss. Note that in th
space where there are no data, the or
reduces the variance to zero and ch
lar function, even though there are m
olations of the training data. On the
averaging affects predictions: where
confidence intervals diverge, reflect
possible extrapolations. In this cas
prefers to be uncertain where there a
opposed to a standard neural networ
confident.
5.3. Bandits on Mushroom Task
We take the UCI Mushrooms data set

Regression curves
Training data :
𝑦 = 𝑥 + 0.3sin 2𝜋 𝑥 + 𝜀 + 0.3sin 4𝜋 𝑥 + 𝜀 + 𝜀
where 𝜀~𝑁(0,0.02) Weight Uncertainty in Neural Networks
x
x
x
x
x
x
x
xx
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
xx
x
x
x
x
x
x
x
x
xxxx
x
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
xx
x
xx
x
x
x
x x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x x
x
x
x
x
x
x x
x
x
xxx x
xx
x
x
x
xx
x
x
x
x
x
x
x
x
x x
x
x
x
x
x
x
xxx
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
xx
x
x
x
x
xxx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
xx
x
x
x
x x
x
x
x
x
x
x
xx
x
x
x
x
x
x
xx
x
xx xx
x
xxx
x
x
x
x
x
x
x
x
x
x
xx
x
x x
x
x
x
x
x
x
x
x
xxx
x
x
x
x
x
x
x xx
x
x
x
x
x x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
xx
x
xx
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
xx
x
x
x
xx
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
xxx
x
x
x
x
x
x
x
x
xx
x
x x
x
xx
x
x
x
x
x
x
x
x
x
xx
xx
x
x
x
xx
x
xx
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
−0.4
0.0
0.4
0.8
1.2
0.0 0.4 0.8 1.2
x
x
x
x
x
x
x
xx
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
xx
x
x
x
x
x
x
x
x
xxxx
x
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
xx
x
xx
x
x
x
x x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x x
x
x
x
x
x
x x
x
x
xxx x
xx
x
x
x
xx
x
x
x
x
x
x
x
x
x x
x
x
x
x
x
x
xxx
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
xx
x
x
x
x
xxx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
xx
x
x
x
x x
x
x
x
x
x
x
xx
x
x
x
x
x
x
xx
x
xx xx
x
xxx
x
x
x
x
x
x
x
x
x
x
xx
x
x x
x
x
x
x
x
x
x
x
xxx
x
x
x
x
x
x
x xx
x
x
x
x
x x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
xx
x
xx
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
xx
x
x
x
xx
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
xxx
x
x
x
x
x
x
x
x
xx
x
x x
x
xx
x
x
x
x
x
x
x
x
x
xx
xx
x
x
x
xx
x
xx
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
−0.4
0.0
0.4
0.8
1.2
0.0 0.4 0.8 1.2
Figure 5. Regression of noisy data with interquatile ranges. Black
crosses are training samples. Red lines are median predictions.
Blue/purple region is interquartile range. Left: Bayes by Back-
prop neural network, Right: standard neural network.
for action selection. We kept t
and action tuples in a buffer,
ing randomly drawn minibatch
steps (64 ⇥ 64 = 4096) per int
bandit. A common heuristic fo
exploitation is to follow an "
bility " propose a uniformly ra
the best action according to th
Figure 6 compares a Bayes by
"-greedy agents, for values of
and 5%. An " of 5% appears
purely greedy agent does poo
ily electing to eat nothing, but
it has seen enough data. It se
approximation updates allow
Bayes by Backprop Standard NN

Bandits on Mushroom Task
UCI Mushrooms dataset
◦ Each mushroom has a set of features and is labelled as edible or poisonous.
◦ An edible mushroom : a reward of 5
◦ A poisonous mushroom : a reward of −35 or 5 (with probability 0.5)
◦ Not to eat a mushroom : a reward of 0
Model :
◦ Input (context and action)→(ReLU)→100→(ReLU)→100→ the expected reward
Comparison approach :
◦ 𝜀-greedy policy : with probability 𝜀 propose a uniformly random action,
otherwise pick the best action according to the neural network.

Bandits on Mushroom Task
Comparison of cumulative regret of various agents.
◦ Regret : the difference between the reward achievable by an oracle and the
reward received by an agent.
◦ Oracle : an edible mushroom→5, a poisonous mushroom→0
◦ Lower is better.
x
x
x
xx x
x
x
xx
xx
x
xx
x
x
x
x
x x
x
x
x
x
x
x
x
xx xx
xxx
x
x
x
x
x x
xx
xx x
x
x
xx xx
xx xx
x
xx
xx
x
x
xx
x
x
x
x
x
xx
xx
x
x
xx
x
x
x
x
x
x
x
x
xx
x
x
x
x
x
x
x
−0.4
0.0
0.0 0.4 0.8 1.2
x
x
x
xx x
x
x
xx
xx
x
xx
x
x
x
x
x x
x
x
x
x
x
x
x
xx xx
xxx
x
x
x
x
x x
xx
xx x
x
x
xx xx
xx xx
x
xx
xx
x
x
xx
x
x
x
x
x
xx
xx
x
x
xx
x
x
x
x
x
x
x
x
xx
x
x
x
x
x
x
x
−0.4
0.0
0.0 0.4 0.8 1.2
Figure 5. Regression of noisy data with interquatile ranges. Black
crosses are training samples. Red lines are median predictions.
Blue/purple region is interquartile range. Left: Bayes by Back-
prop neural network, Right: standard neural network.
1000
10000
0 10000 20000 30000 40000 50000
Step
CumulativeRegret
5% Greedy
1% Greedy
Greedy
Bayes by Backprop
Figure 6. Comparison of cumulative regret of various agents on
bandit. A com
exploitation is
bility " propos
the best action
Figure 6 comp
"-greedy agen
and 5%. An
purely greedy
ily electing to
it has seen en
approximation
as for the ﬁrst
approximately
eat mushroom
from the begi
and quickly co
most perfect r
6. Discussio
We introduced
works with u
Backprop. It
over-explore
quickly converges

Conclusion
They introduced a new algorithm for learning neural networks with
uncertainty on the weights called Bayes by Backprop.
The algorithm achieves good results in several domains.
◦ Classifying MNIST digits : Performance from Bayes by Backprop is
comparable to that of dropout.
◦ Non-linear regression problem : Bayes by Backprop allows the network to
make more reasonable predictions about unseen data.
◦ Contextual bandits : Bayes by Backprop can automatically learn how to
trade-off exploration and exploitation.

(研究会輪読) Weight Uncertainty in Neural Networks

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to (研究会輪読) Weight Uncertainty in Neural Networks

Similar to (研究会輪読) Weight Uncertainty in Neural Networks (20)

More from Masahiro Suzuki

More from Masahiro Suzuki (6)

Recently uploaded

Recently uploaded (20)

(研究会輪読) Weight Uncertainty in Neural Networks