SlideShare a Scribd company logo
1 of 43
Download to read offline
Bayesian Neural Network
2017/01/27
¤ bayesian neural network
¤ NIPS
http://bayesiandeeplearning.org
¤ bayesian neural network
¤ Stan PyMC3 Edward
¤
¤
¤
¤
¤
¤
¤
¤ ! " #
¤ $(#|")
¤
¤ $(!|#, ")
¤
¤
¤ w
¤ $(#|!, ")
¤
¤ $(!|")
¤
¤
¤
¤
¤ D
¤
¤ ! = {+} = - $ ! # = ∏ $(+|#)/
012 = $(-|#)
¤ ! = {+, 3} = (-, 3) $ ! # = ∏ $(3|+, #)/
012 = $(3|-, #)
$ # ! =
$ ! # $(#)
$(!)
=
$ ! # $(#)
∫ $ ! # $ # 5#
MAP
¤
¤ MAP
¤ MAP
¤
¤ MAP
#789 = arg max log $(!|#) = arg max
A
B log $(+C|#)
C
#7DE = arg max log $(#|!) = arg max
A
log $ ! # + log $(#)
¤ +G
¤ #
¤
¤ +G
3G
$ +G ! = H $ +G # $ # ! 5# = IJ(A|K)[$ +G # ]
$ 3G +G, ! = H $ 3G +G, # $ # ! 5# = IJ(A|K)[$ 3G +G, # ]
¤
¤
¤
¤
¤
¤
$ +G ! = H $ +G # $ # ! 5# = IJ(A|K)[$ +G # ]
$ " ! =
$ ! " $(")
$(!)
$ # ! =
$ ! # $(#)
$(!)
=
$ ! # $(#)
∫ $ ! # $ # 5#
¤
¤
¤ #
¤
$ # ! =
$ ! # $(#)
$(!)
=
$ ! # $(#)
∫ $ ! # $ # 5#
¤ 2
1. MCMC
2.
1. MCMC
¤ $(#|!)
¤
¤
2.
¤ $(#|N) O(#)
¤ 2
¤
¤
¤ # $(#)
¤ #
¤
Weight Uncertainty in Neural Networks
H1 H2 H3 1
X 1
Y
0.5 0.1 0.7 1.3
1.40.3
1.2
0.10.1 0.2
H1 H2 H3 1
X 1
Y
Figure 1. Left: each weight has a fixed value, as provided by clas-
sical backpropagation. Right: each weight is assigned a distribu-
tion, as provided by Bayes by Backprop.
is related to recent methods in deep, generative modelling
(Kingma and Welling, 2014; Rezende et al., 2014; Gregor
et al., 2014), where variational inference has been applied
to stochastic hidden units of an autoencoder. Whilst the
number of stochastic hidden units might be in the order of
the parameters of the categorical dis
through the exponential function then
regression Y is R and P(y|x, w) is a G
– this corresponds to a squared loss.
Inputs x are mapped onto the param
tion on Y by several successive layers
tion (given by w) interleaved with elem
transforms.
The weights can be learnt by maximum
tion (MLE): given a set of training exam
the MLE weights wMLE
are given by:
wMLE
= arg max
w
log P(D|w
= arg max
w
i
log P(
This is typically achieved by gradient
NN Bayesian NN
$(3G|!, +G, N) = H $ 3G +G, # $ # !, N 5#
[Blundell+ 2015]
¤
¤
¤ “The Importance of Knowing What We Don't Know (by Yarin Gal)”
¤
¤
¤
¤
¤
¤
WHY SHOULD WE CARE?
Calibrated model and prediction uncertainty: getting
systems that know when they don’t know.
Automatic model complexity control and structure learnin
(Bayesian Occam’s Razor)
Figure from Yarin Gal’s thesis “Uncertainty in Deep Learning” (2016)
Zoubin Ghahramani
http://mlg.eng.cam.ac.uk/yarin/blog_2248.html
¤ stochastic neural network
¤ VAE
¤ =
¤
¤
https://jmhl.org/research/
¤
¤
¤
¤
¤ 1
¤
¤
http://evelinag.com/blog/2014/09-15-introducing-ariadne/#.WI8X7LaLTEY
¤
¤ P
Q
¤
¤ α
¤
¤
3.1.5 Multiple outputs
So far, we have considered the case of a single target variable t. In some applica-
tions, we may wish to predict K > 1 target variables, which we denote collectively
by the target vector t. This could be done by introducing a different set of basis func-
tions for each component of t, leading to multiple, independent regression problems.
However, a more interesting, and more common, approach is to use the same set of
basis functions to model all of the components of the target vector so that
y(x, w) = WT
φ(x) (3.31)
where y is a K-dimensional column vector, W is an M × K matrix of parameters,
and φ(x) is an M-dimensional column vector with elements φj(x), with φ0(x) = 1
as before. Suppose we take the conditional distribution of the target vector to be an
isotropic Gaussian of the form
p(t|x, W, β) = N(t|WT
φ(x), β−1
I). (3.32)
If we have a set of observations t1, . . . , tN , we can combine these into a matrix T
of size N × K such that the nth
row is given by tT
n. Similarly, we can combine the
input vectors x1, . . . , xN into a matrix X. The log likelihood function is then given
by
ln p(T|X, W, β) =
N
n=1
ln N(tn|WT
φ(xn), β−1
I)
=
NK
2
ln
β
2π
−
β
2
N
n=1
tn − WT
φ(xn)
2
. (3.33)
by the target vector t. This could be done by introducing a different set of basis func-
tions for each component of t, leading to multiple, independent regression problems.
However, a more interesting, and more common, approach is to use the same set of
basis functions to model all of the components of the target vector so that
y(x, w) = WT
φ(x) (3.31)
where y is a K-dimensional column vector, W is an M × K matrix of parameters,
and φ(x) is an M-dimensional column vector with elements φj(x), with φ0(x) = 1
as before. Suppose we take the conditional distribution of the target vector to be an
isotropic Gaussian of the form
p(t|x, W, β) = N (t|WT
φ(x), β−1
I). (3.32)
If we have a set of observations t1, . . . , tN , we can combine these into a matrix T
of size N × K such that the nth
row is given by tT
n. Similarly, we can combine the
input vectors x1, . . . , xN into a matrix X. The log likelihood function is then given
by
ln p(T|X, W, β) =
N
n=1
ln N (tn|WT
φ(xn), β−1
I)
=
NK
2
ln
β
2π
−
β
2
N
n=1
tn − WT
φ(xn)
2
. (3.33)
mN = SN S−1
0 m0 + βΦT
t (3.50)
S−1
N = S−1
0 + βΦT
Φ. (3.51)
Note that because the posterior distribution is Gaussian, its mode coincides with its
mean. Thus the maximum posterior weight vector is simply given by wMAP = mN .
If we consider an infinitely broad prior S0 = α−1
I with α → 0, the mean mN
of the posterior distribution reduces to the maximum likelihood value wML given
by (3.15). Similarly, if N = 0, then the posterior distribution reverts to the prior.
Furthermore, if data points arrive sequentially, then the posterior distribution at any
stage acts as the prior distribution for the subsequent data point, such that the new
posterior distribution is again given by (3.49).3.8
For the remainder of this chapter, we shall consider a particular form of Gaus-
sian prior in order to simplify the treatment. Specifically, we consider a zero-mean
isotropic Gaussian governed by a single precision parameter α so that
p(w|α) = N(w|0, α−1
I) (3.52)
and the corresponding posterior distribution over w is then given by (3.49) with
mN = βSN ΦT
t (3.53)
S−1
N = αI + βΦT
Φ. (3.54)
The log of the posterior distribution is given by the sum of the log likelihood and
the log of the prior and, as a function of w, takes the form
ln p(w|t) = −
β
2
N
n=1
{tn − wT
φ(xn)}2
−
α
2
wT
w + const. (3.55)
Maximization of this posterior distribution with respect to w is therefore equiva-
3.3. Bayesian Linear Regression 153
Next we compute the posterior distribution, which is proportional to the product
of the likelihood function and the prior. Due to the choice of a conjugate Gaus-
sian prior distribution, the posterior will also be Gaussian. We can evaluate this
distribution by the usual procedure of completing the square in the exponential, and
then finding the normalization coefficient using the standard result for a normalized
Gaussian. However, we have already done the necessary work in deriving the gen-
eral result (2.116), which allows us to write down the posterior distribution directly
in the form
p(w|t) = N (w|mN , SN ) (3.49)
where
mN = SN S−1
0 m0 + βΦT
t (3.50)
S−1
N = S−1
0 + βΦT
Φ. (3.51)
then finding the normalization coefficient using the standard result for a normalized
Gaussian. However, we have already done the necessary work in deriving the gen-
eral result (2.116), which allows us to write down the posterior distribution directly
in the form
p(w|t) = N(w|mN , SN ) (3.49)
where
mN = SN S−1
0 m0 + βΦT
t (3.50)
S−1
N = S−1
0 + βΦT
Φ. (3.51)
Note that because the posterior distribution is Gaussian, its mode coincides with its
mean. Thus the maximum posterior weight vector is simply given by wMAP = mN .
If we consider an infinitely broad prior S0 = α−1
I with α → 0, the mean mN
of the posterior distribution reduces to the maximum likelihood value wML given
by (3.15). Similarly, if N = 0, then the posterior distribution reverts to the prior.
Furthermore, if data points arrive sequentially, then the posterior distribution at any
stage acts as the prior distribution for the subsequent data point, such that the new
posterior distribution is again given by (3.49).
For the remainder of this chapter, we shall consider a particular form of Gaus-
sian prior in order to simplify the treatment. Specifically, we consider a zero-mean
isotropic Gaussian governed by a single precision parameter α so that
p(w|α) = N(w|0, α−1
I) (3.52)
and the corresponding posterior distribution over w is then given by (3.49) with
mN = βSN ΦT
t (3.53)
S−1
N = αI + βΦT
Φ. (3.54)
The log of the posterior distribution is given by the sum of the log likelihood and
the log of the prior and, as a function of w, takes the form
PRML
¤
¤
equal to the mean, although this will no longer hold if q ̸= 2.
3.3.2 Predictive distribution
In practice, we are not usually interested in the value of w itself but rather in
making predictions of t for new values of x. This requires that we evaluate the
predictive distribution defined by
p(t|t, α, β) = p(t|w, β)p(w|t, α, β) dw (3.57)
in which t is the vector of target values from the training set, and we have omitted the
corresponding input vectors from the right-hand side of the conditioning statements
to simplify the notation. The conditional distribution p(t|x, w, β) of the target vari-
able is given by (3.8), and the posterior weight distribution is given by (3.49). We
see that (3.57) involves the convolution of two Gaussian distributions, and so making
use of the result (2.115) from Section 8.1.4, we see that the predictive distribution
takes the form3.10
p(t|x, t, α, β) = N(t|mT
N φ(x), σ2
N (x)) (3.58)
where the variance σ2
N (x) of the predictive distribution is given by
σ2
N (x) =
1
β
+ φ(x)T
SN φ(x). (3.59)
The first term in (3.59) represents the noise on the data whereas the second term
reflects the uncertainty associated with the parameters w. Because the noise process
and the distribution of w are independent Gaussians, their variances are additive.
Note that, as additional data points are observed, the posterior distribution becomes
narrower. As a consequence it can be shown (Qazaz et al., 1997) that σ2
N+1(x)
σ2
(x). In the limit N → ∞, the second term in (3.59) goes to zero, and the variance3.11
p(t|t, α, β) = p(t|w, β)p(w|t, α, β) dw (3.57)
in which t is the vector of target values from the training set, and we have omitted the
corresponding input vectors from the right-hand side of the conditioning statements
to simplify the notation. The conditional distribution p(t|x, w, β) of the target vari-
able is given by (3.8), and the posterior weight distribution is given by (3.49). We
see that (3.57) involves the convolution of two Gaussian distributions, and so making
use of the result (2.115) from Section 8.1.4, we see that the predictive distribution
takes the form.10
p(t|x, t, α, β) = N(t|mT
N φ(x), σ2
N (x)) (3.58)
where the variance σ2
N (x) of the predictive distribution is given by
σ2
N (x) =
1
β
+ φ(x)T
SN φ(x). (3.59)
The first term in (3.59) represents the noise on the data whereas the second term
reflects the uncertainty associated with the parameters w. Because the noise process
and the distribution of w are independent Gaussians, their variances are additive.
Note that, as additional data points are observed, the posterior distribution becomes
narrower. As a consequence it can be shown (Qazaz et al., 1997) that σ2
N+1(x)
σ2
N (x). In the limit N → ∞, the second term in (3.59) goes to zero, and the variance.11
of the predictive distribution arises solely from the additive noise governed by the
parameter β.
As an illustration of the predictive distribution for Bayesian linear regression
models, let us return to the synthetic sinusoidal data set of Section 1.1. In Figure 3.8,
p(t|t, α, β) = p(t|w, β)p(w|t, α, β) dw (3.57)
in which t is the vector of target values from the training set, and we have omitted the
corresponding input vectors from the right-hand side of the conditioning statements
to simplify the notation. The conditional distribution p(t|x, w, β) of the target vari-
able is given by (3.8), and the posterior weight distribution is given by (3.49). We
see that (3.57) involves the convolution of two Gaussian distributions, and so making
use of the result (2.115) from Section 8.1.4, we see that the predictive distribution
takes the formcise 3.10
p(t|x, t, α, β) = N(t|mT
N φ(x), σ2
N (x)) (3.58)
where the variance σ2
N (x) of the predictive distribution is given by
σ2
N (x) =
1
β
+ φ(x)T
SN φ(x). (3.59)
The first term in (3.59) represents the noise on the data whereas the second term
reflects the uncertainty associated with the parameters w. Because the noise process
and the distribution of w are independent Gaussians, their variances are additive.
Note that, as additional data points are observed, the posterior distribution becomes
narrower. As a consequence it can be shown (Qazaz et al., 1997) that σ2
N+1(x)
σ2
N (x). In the limit N → ∞, the second term in (3.59) goes to zero, and the variancecise 3.11
of the predictive distribution arises solely from the additive noise governed by the
parameter β.
As an illustration of the predictive distribution for Bayesian linear regression
models, let us return to the synthetic sinusoidal data set of Section 1.1. In Figure 3.8,
3.3. Bayesian Linear Regression 157
x
t
0 1
−1
0
1
x
t
0 1
−1
0
1
x
t
0 1
−1
0
1
x
t
0 1
−1
0
1
Figure 3.8 Examples of the predictive distribution (3.58) for a model consisting of 9 Gaussian basis functions
of the form (3.4) using the synthetic sinusoidal data set of Section 1.1. See the text for a detailed discussion.
PRML
¤
¤
¤
¤
¤ 2
¤
¤
¤
¤
1.
2. MAP #7DE
3. 2 R
4.
chapters and so we can exploit the results obtained there. We can then make use of
the evidence framework to provide point estimates for the hyperparameters and to
compare alternative models (for example, networks having different numbers of hid-
den units). To start with, we shall discuss the regression case and then later consider
the modifications needed for solving classification tasks.
5.7.1 Posterior parameter distribution
Consider the problem of predicting a single continuous target variable t from
a vector x of inputs (the extension to multiple targets is straightforward). We shall
suppose that the conditional distribution p(t|x) is Gaussian, with an x-dependent
mean given by the output of a neural network model y(x, w), and with precision
(inverse variance) β
p(t|x, w, β) = N(t|y(x, w), β−1
). (5.161)
Similarly, we shall choose a prior distribution over the weights w that is Gaussian of
the form
p(w|α) = N(w|0, α−1
I). (5.162)
For an i.i.d. data set of N observations x1, . . . , xN , with a corresponding set of target
values D = {t1, . . . , tN }, the likelihood function is given by
p(D|w, β) =
N
n=1
N(tn|y(xn, w), β−1
) (5.163)
and so the resulting posterior distribution is then
p(w|D, α, β) ∝ p(w|α)p(D|w, β). (5.164)
the modifications needed for solving classification tasks.
5.7.1 Posterior parameter distribution
Consider the problem of predicting a single continuous target variable t fro
a vector x of inputs (the extension to multiple targets is straightforward). We sh
suppose that the conditional distribution p(t|x) is Gaussian, with an x-depende
mean given by the output of a neural network model y(x, w), and with precisi
(inverse variance) β
p(t|x, w, β) = N(t|y(x, w), β−1
). (5.16
Similarly, we shall choose a prior distribution over the weights w that is Gaussian
the form
p(w|α) = N(w|0, α−1
I). (5.16
For an i.i.d. data set of N observations x1, . . . , xN , with a corresponding set of targ
values D = {t1, . . . , tN }, the likelihood function is given by
p(D|w, β) =
N
n=1
N(tn|y(xn, w), β−1
) (5.16
and so the resulting posterior distribution is then
p(w|D, α, β) ∝ p(w|α)p(D|w, β). (5.16
which, as a consequence of the nonlinear dependence of y(x, w) on w, will be no
Gaussian.
We can find a Gaussian approximation to the posterior distribution by using t
Laplace approximation. To do this, we must first find a (local) maximum of t
5.7. Bayesian Neural Networks 279
form
ln p(w|D) = −
α
2
wT
w −
β
2
N
n=1
{y(xn, w) − tn}
2
+ const (5.165)
which corresponds to a regularized sum-of-squares error function. Assuming for
the moment that α and β are fixed, we can find a maximum of the posterior, which
we denote wMAP, by standard nonlinear optimization algorithms such as conjugate
gradients, using error backpropagation to evaluate the required derivatives.
Having found a mode wMAP, we can then build a local Gaussian approximation
by evaluating the matrix of second derivatives of the negative log posterior distribu-
tion. From (5.165), this is given by
A = −∇∇ ln p(w|D, α, β) = αI + βH (5.166)
where H is the Hessian matrix comprising the second derivatives of the sum-of-
squares error function with respect to the components of w. Algorithms for comput-
ing and approximating the Hessian were discussed in Section 5.4. The corresponding
Gaussian approximation to the posterior is then given from (4.134) by
q(w|D) = N(w|wMAP, A−1
). (5.167)
form
ln p(w|D) = −
α
2
wT
w −
β
2
N
n=1
{y(xn, w) − tn}
2
+ const (5.165)
which corresponds to a regularized sum-of-squares error function. Assuming for
the moment that α and β are fixed, we can find a maximum of the posterior, which
we denote wMAP, by standard nonlinear optimization algorithms such as conjugate
gradients, using error backpropagation to evaluate the required derivatives.
Having found a mode wMAP, we can then build a local Gaussian approximation
by evaluating the matrix of second derivatives of the negative log posterior distribu-
tion. From (5.165), this is given by
A = −∇∇ ln p(w|D, α, β) = αI + βH (5.166)
where H is the Hessian matrix comprising the second derivatives of the sum-of-
squares error function with respect to the components of w. Algorithms for comput-
ing and approximating the Hessian were discussed in Section 5.4. The corresponding
Gaussian approximation to the posterior is then given from (4.134) by
q(w|D) = N(w|wMAP, A−1
). (5.167)
Similarly, the predictive distribution is obtained by marginalizing with respect
PRML
¤
¤
¤
https://www.r-bloggers.com/easy-laplace-approximation-of-bayesian-models-in-r/
¤ O(#|N)
¤ ST[O(#|N)||$ # ! ] N
¤ KL
¤ N ELBO N
→
ST[O(#|N)||$	(#|!)]
= −∫ O	 # N log
W K|A W A
J # N
5# + log $ !
= −ℒ	 !; N + log $ !
ELBO
ELBO
¤ ELBO ℒ	 !; N
1.
¤ $(#|!)
MC
¤
2. ELBO
¤ MC
¤ MC ∫ O	 # N log
W K|A W A
J # N
5# = IJ[log
W K|A W A
J # N
]
3.
¤ EM
¤ ELBO
¤ Gal
¤ Denker, Schwartz, Wittner, Solla, Howard, Jackel, Hopfield (1987)
¤ Denker and LeCun (1991)
¤ MacKay (1992)
¤ Hinton and van Camp (1993)
¤
¤ Neal (1995)
¤ Barber and Bishop (1998)
¤ Graves (2011)
¤ Blundell, Cornebise, Kavukcuoglu, and Wierstra (2015)
¤ Hernandez-Lobato and Adam (2015)
¤ Practical Variational Inference for Neural Networks [Graves 2011]
¤
¤
¤ T9 O(#|N) N = {Z, [}
¤
ℒ	 !; N = ∫ O	(#|N)log
$ !|# $(#)
O(#|N)
5#
= − EJ(A|])[log $(!|#)] + ST[O(#|N)||$(#)]
^T9
^Z
≈
1
a
B
^ log $(!|#)
^#
/
C
^T9
^[b
≈
1
2a
B
^ log $(!|#)
^#
b/
C
¤ Weight Uncertainty in Neural Networks [Blundell+ 2015]
¤
¤ O(#|N)
¤
¤ http://www.slideshare.net/masa_s/weight-uncertainty-in-neural-networks
Bayes by backprop
^
^N
ℒ	 !; N =
^
^N
IJ(A|]) d(#, N)
d #, N = log
$ !|# $(#)
O(#|N)
= IJ(e)
^d(#, N)
^N
^#
^N
+
^d(#, N)
^N
Reparameterization trick
# = Z + diag([) ⊙ i i~k(0, m)
Dropout
¤ Dropout as a Bayesian Approximation [Gal+ 2015]
¤ O # N = ∏ O(nC|oC)
¤ T9 = − EJ # N [log $(!|#)]
¤ oC 0
¤ 0
¤ dropout
¤ drop-connect multiplicative Gaussian noise
sults are summarised here
n uncertainty estimates for
el with L layers and a loss
max loss or the Euclidean
Wi the NN’s weight ma-
1, and by bi the bias vec-
ayer i = 1, ..., L. We de-
corresponding to input xi
the input and output sets
on a regularisation term is
egularisation weighted by
n a minimisation objective
λ
L
i=1
||Wi||2
2 + ||bi||2
2 .
(1)
variables for every input
in each layer (apart from
le takes value 1 with prob-
ropped (i.e. its value is set
orresponding binary vari-
me values in the backward
to the parameters.
p(y|x, ω) = N y; y(x, ω), τ ID
y x, ω = {W1, ...,WL}
=
1
KL
WLσ ...
1
K1
W2σ W1x + m1 ...
The posterior distribution p(ω|X, Y) in eq. (2) is in-
tractable. We use q(ω), a distribution over matrices whose
columns are randomly set to zero, to approximate the in-
tractable posterior. We define q(ω) as:
Wi = Mi · diag([zi,j]Ki
j=1)
zi,j ∼ Bernoulli(pi) for i = 1, ..., L, j = 1, ..., Ki−1
given some probabilities pi and matrices Mi as variational
parameters. The binary variable zi,j = 0 corresponds then
to unit j in layer i − 1 being dropped out as an input to
layer i. The variational distribution q(ω) is highly multi-
modal, inducing strong joint correlations over the rows of
the matrices Wi (which correspond to the frequencies in
the sparse spectrum GP approximation).
We minimise the KL divergence between the approximate
posterior q(ω) above and the posterior of the full deep GP,
p(ω|X, Y). This KL is our minimisation objective
− q(ω) log p(Y|X, ω)dω + KL(q(ω)||p(ω)). (3)
Dropout
¤ dropout
¤ dropout
¤
¤ MC dropout
¤ http://mlg.eng.cam.ac.uk/yarin/blog_2248.html
where ω = {Wi}L
i=1 is our set of random variables for a
model with L layers.
We will perform moment-matching and estimate the first
two moments of the predictive distribution empirically.
More specifically, we sample T sets of vectors of realisa-
tions from the Bernoulli distribution {zt
1, ..., zt
L}T
t=1 with
zt
i = [zt
i,j]Ki
j=1, giving {Wt
1, ..., Wt
L}T
t=1. We estimate
Eq(y∗|x∗)(y∗
) ≈
1
T
T
t=1
y∗
(x∗
, Wt
1, ..., Wt
L) (6)
following proposition C in the appendix. We refer to this
Monte Carlo estimate as MC dropout. In practice this
is equivalent to performing T stochastic forward passes
through the network and averaging the results.
This result has been presented in the literature before as
model averaging. We have given a new derivation for this
result which allows us to derive mathematically grounded
uncertainty estimates as well. Srivastava et al. (2014, sec-
tion 7.5) have reasoned empirically that MC dropout can
be approximated by averaging the weights of the network
(multiplying each Wi by pi at test time, referred to as stan-
dard dropout).
We estimate the second raw moment in the same way:
log p(y∗
|x∗
, X, Y) ≈
with a log-sum-exp o
passes through the ne
Our predictive distr
highly multi-modal,
give a glimpse into i
proximating variation
matrix column is bi-
tribution over each la
3.2 in the appendix).
Note that the dropo
To estimate the predi
we simply collect the
through the model.
used with existing N
thermore, the forward
sulting in constant run
dropout.
5. Experiments
T t=1
following proposition C in the appendix. We refer to this
Monte Carlo estimate as MC dropout. In practice this
is equivalent to performing T stochastic forward passes
through the network and averaging the results.
This result has been presented in the literature before as
model averaging. We have given a new derivation for this
result which allows us to derive mathematically grounded
uncertainty estimates as well. Srivastava et al. (2014, sec-
tion 7.5) have reasoned empirically that MC dropout can
be approximated by averaging the weights of the network
(multiplying each Wi by pi at test time, referred to as stan-
dard dropout).
We estimate the second raw moment in the same way:
Eq(y∗|x∗) (y∗
)T
(y∗
) ≈ τ−1
ID
+
1
T
T
t=1
y∗
(x∗
, Wt
1, ..., Wt
L)T
y∗
(x∗
, Wt
1, ..., Wt
L)
following proposition D in the appendix. To obtain the
model’s predictive variance we have:
Varq(y∗|x∗) y∗
≈ τ−1
ID
2
In the appendix (section 4.1) we extend this derivation to
classification. E(·) is defined as softmax loss and τ is set to 1.
proximating variational distributio
matrix column is bi-modal, and
tribution over each layer’s weight
3.2 in the appendix).
Note that the dropout NN mod
To estimate the predictive mean a
we simply collect the results of s
through the model. As a result,
used with existing NN models tra
thermore, the forward passes can
sulting in constant running time id
dropout.
5. Experiments
We next perform an extensive ass
of the uncertainty estimates obta
and convnets on the tasks of regr
We compare the uncertainty obtai
architectures and non-linearities,
olation, and show that model unc
classification tasks using MNIST
as an example. We then show th
tainty we can obtain a considerabl
tive log-likelihood and RMSE co
of-the-art methods. We finish wi
¤ MC dropout
Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning
(a) Standard dropout with weight averaging (b) Gaussian process with SE covariance function
(c) MC dropout with ReLU non-linearities (d) MC dropout with TanH non-linearities
Figure 2. Predictive mean and uncertainties on the Mauna Loa CO2 concentrations dataset, for various models. In red is the
observed function (left of the dashed blue line); in blue is the predictive mean plus/minus two standard deviations (8 for fig. 2d).
Different shades of blue represent half a standard deviation. Marked with a dashed red line is a point far away from the data: standard
dropout confidently predicts an insensible value for the point; the other models predict insensible values as well but with the additional
information that the models are uncertain about their predictions.
model’s uncertainty in a Bayesian pipeline. We give a
quantitative assessment of the model’s performance in the
setting of reinforcement learning on a task similar to that
used in deep reinforcement learning (Mnih et al., 2015).
Using the results from the previous section, we begin by
qualitatively evaluating the dropout NN uncertainty on two
comparison. Fig. 2c shows the results of the same network
as in fig. 2a, but with MC dropout used to evaluate the pre-
dictive mean and uncertainty for the training and test sets.
Lastly, fig. 2d shows the same using the TanH network with
5 layers (plotted with 8 times the standard deviation for vi-
sualisation purposes). The shades of blue represent model
uncertainty: each colour gradient represents half a standard
Dropout
¤ Variational dropout and the local reparameterization trick
[Kingma+ 2015]
¤
¤ 0
¤ 0 local reparameterization
trick
¤ Dropout
¤
¤ http://www.slideshare.net/masa_s/dl-hacks-variational-dropout-and-the-local-
reparameterization-trick
2.2 Variance of the SGVB estimator
The theory of stochastic approximation tells us that stochastic gradient ascent using (3) will asymp-
totically converge to a local optimum for an appropriately declining step size and sufficient weight
updates [18], but in practice the performance of stochastic gradient ascent crucially depends on
the variance of the gradients. If this variance is too large, stochastic gradient descent will fail
to make much progress in any reasonable amount of time. Our objective function consists of an
expected log likelihood term that we approximate using Monte Carlo, and a KL divergence term
DKL(qφ(w)||p(w)) that we assume can be calculated analytically and otherwise be approximated
with Monte Carlo with similar reparameterization.
Assume that we draw minibatches of datapoints with replacement; see appendix F for a similar
analysis for minibatches without replacement. Using Li as shorthand for log p(yi
|xi
, w = f(ϵi
, φ)),
the contribution to the likelihood for the i-th datapoint in the minibatch, the Monte Carlo estimator
(3) may be rewritten as LSGVB
D (φ) = N
M
M
i=1 Li, whose variance is given by
Var LSGVB
D (φ) =
N2
M2
M
i=1
Var [Li] + 2
M
i=1
M
j=i+1
Cov [Li, Lj] (4)
=N2 1
M
Var [Li] +
M − 1
M
Cov [Li, Lj] , (5)
where the variances and covariances are w.r.t. both the data distribution and ϵ distribution, i.e.
Var [Li] = Varϵ,xi,yi log p(yi
|xi
, w = f(ϵ, φ)) , with xi
, yi
drawn from the empirical distribu-
tion defined by the training set. As can be seen from (5), the total contribution to the variance by
Var [Li] is inversely proportional to the minibatch size M. However, the total contribution by the
covariances does not decrease with M. In practice, this means that the variance of LSGVB
D (φ) can be
dominated by the covariances for even moderately large M.
2.3 Local Reparameterization Trick
We therefore propose an alternative estimator for which we have Cov [Li, Lj] = 0, so that the vari-
ance of our stochastic gradients scales as 1/M. We then make this new estimator computationally
efficient by not sampling ϵ directly, but only sampling the intermediate variables f(ϵ) through which
SGVB
Automatic Differentiation
Variational Inference
¤
¤
¤
¤
→ Automatic differentiation variational inference ADVI
Automatic Differentiation
Variational Inference ADVI
¤ Automatic Differentiation Variational Inference [Kucukelbir+ 2016]
¤ Stan PyMC3 Edward
¤ ADVI
1. N $(+, N) $(+, p)
ST[O(p)||$(p, +))]
2. O MC
3. O
Automatic transformation
¤ N
¤ $(N) support
¤
¤ N p
¤ p
¤ N −> p r p
0 1 2 3 θ
De (a) Latent variable space
T−1
−1 0 1 2 ζ
(b) Real coordinate space
1: Transforming the latent variable to real coordinate space. The purple line is the pos
ne is the approximation. (a) The latent variable space is >0. (a→b) T transforms
space to . (b) The variational approximation is a Gaussian in real coordinate space
: Transforming the latent variable to real coordinate space. The purple line is the post
e is the approximation. (a) The latent variable space is >0. (a→b) T transforms
space to . (b) The variational approximation is a Gaussian in real coordinate space.
tify the transformed variables as ζ = T(θ). The transformed joint density p(x,ζ) is
as the representation
p(x,ζ) = p x, T−1
(ζ) det JT−1 (ζ) ,
(x,θ = T−1
(ζ)) is the joint density in the original latent variable space, and JT−1
of the inverse of T. Transformations of continuous probability densities require a
nts for how the transformation warps unit volumes and ensures that the transforme
s to one (Olive, 2014). (See Appendix A.)
Automatic transformation
¤
¤ r = log	(N)
¤
¤ p
p(x,ζ) = p x, T−1
(ζ) det JT−1 (ζ) ,
x,θ = T−1
(ζ)) is the joint density in the original latent variable space, and JT−1 (ζ) is the
of the inverse of T. Transformations of continuous probability densities require a Jacobian;
s for how the transformation warps unit volumes and ensures that the transformed density
to one (Olive, 2014). (See Appendix A.)
again our running Weibull-Poisson example from Section 2.1. The latent variable θ lives in
ogarithm ζ = T(θ) = log(θ) transforms >0 to the real line . Its Jacobian adjustment is the
of the inverse of the logarithm |det JT−1(ζ)| = exp(ζ). The transformed density is
p(x,ζ) = Poisson(x | exp(ζ)) × Weibull(exp(ζ) ; 1.5,1) × exp(ζ).
epicts this transformation.
cribe in the introduction, we implement our algorithm in Stan (Stan Development Team,
an maintains a library of transformations and their corresponding Jacobians.4
With Stan,
tomatically transforms the joint density of any differentiable probability model to one with
d latent variables. (See Figure 2.)
riational Approximations in Real Coordinate Space
ransformation, the latent variables ζ have support in the real coordinate space K
. We have
f variational approximations in this space. Here, we consider Gaussian distributions; these
nduce non-Gaussian variational distributions in the original latent variable space.
of ζ; it has the representation
p(x,ζ) = p x, T−1
(ζ) det JT−1 (ζ) ,
where p(x,θ = T−1
(ζ)) is the joint density in the original latent variable space, and JT−1 (ζ) is t
Jacobian of the inverse of T. Transformations of continuous probability densities require a Jacobia
it accounts for how the transformation warps unit volumes and ensures that the transformed dens
integrates to one (Olive, 2014). (See Appendix A.)
Consider again our running Weibull-Poisson example from Section 2.1. The latent variable θ lives
>0. The logarithm ζ = T(θ) = log(θ) transforms >0 to the real line . Its Jacobian adjustment is t
derivative of the inverse of the logarithm |det JT−1(ζ)| = exp(ζ). The transformed density is
p(x,ζ) = Poisson(x | exp(ζ)) × Weibull(exp(ζ) ; 1.5,1) × exp(ζ).
Figure 1 depicts this transformation.
As we describe in the introduction, we implement our algorithm in Stan (Stan Development Tea
2015). Stan maintains a library of transformations and their corresponding Jacobians.4
With St
we can automatically transforms the joint density of any differentiable probability model to one w
real-valued latent variables. (See Figure 2.)
2.4 Variational Approximations in Real Coordinate Space
After the transformation, the latent variables ζ have support in the real coordinate space K
. We ha
0 1 2 3
1
θ
Density
(a) Latent variable space
T
T−1
−1 0 1 2
1
ζ
Prior
Posterior
Approximation
(b) Real coordinate space
Figure 1: Transforming the latent variable to real coordinate space. The purple line is the posterior. The
green line is the approximation. (a) The latent variable space is >0. (a→b) T transforms the latent
variable space to . (b) The variational approximation is a Gaussian in real coordinate space.
Figure 1: Transforming the latent variable to real coordinate space. The purple line is the posterior. The
green line is the approximation. (a) The latent variable space is >0. (a→b) T transforms the latent
variable space to . (b) The variational approximation is a Gaussian in real coordinate space.
¤
¤
¤
¤ L
¤
¤
.4 Variational Approximations in Real Coordinate Space
fter the transformation, the latent variables ζ have support in the real coordinate space K
. We hav
choice of variational approximations in this space. Here, we consider Gaussian distributions; thes
mplicitly induce non-Gaussian variational distributions in the original latent variable space.
Mean-field Gaussian. One option is to posit a factorized (mean-field) Gaussian variational approxima
on
q(ζ; φ) = ζ; µ,diag(σ2
) =
K
k=1
ζk ; µk,σ2
k ,
where the vector φ = (µ1,··· ,µK ,σ2
1,··· ,σ2
K ) concatenates the mean and variance of each Gaussia
actor. Since the variance parameters must always be positive, the variational parameters live in th
et Φ = { K
, K
>0}. Re-parameterizing the mean-field Gaussian removes this constraint. Consider th
4
Stan provides various transformations for upper and lower bounds, simplex and ordered vectors, and structured matrices suc
covariance matrices and Cholesky factors.
6
N
x [ n ] ~ poisson ( theta ) ;
}
Figure 2: Specifying a simple nonconjugate probability model in Stan.
Figure 2: Specifying a simple nonconjugate probability model in Stan.
arithm of the standard deviations, ω = log(σ), applied element-wise. The support of ω is n
real coordinate space and σ is always positive. The mean-field Gaussian becomes q(ζ; φ)
ζ; µ,diag(exp(ω)2
) , where the vector φ = (µ1,··· ,µK ,ω1,··· ,ωK ) concatenates the mean a
arithm of the standard deviation of each factor. Now, the variational parameters are unconstrain
2K
.
l-rank Gaussian. Another option is to posit a full-rank Gaussian variational approximation
q(ζ; φ) = ζ; µ,Σ ,
ere the vector φ = (µ,Σ) concatenates the mean vector µ and covariance matrix Σ. To ensure t
always remains positive semidefinite, we re-parameterize the covariance matrix using a Chole
torization, Σ = LL⊤
. We use the non-unique definition of the Cholesky factorization where
gonal elements of L need not be positively constrained (Pinheiro and Bates, 1996). Therefor
s in the unconstrained space of lower-triangular matrices with K(K + 1)/2 real-valued entries. T
-rank Gaussian becomes q(ζ; φ) = ζ; µ, LL⊤
, where the variational parameters φ = (µ, L)
constrained in K+K(K+1)/2
.
Figure 2: Specifying a simple nonconjugate probability model in Stan.
Figure 2: Specifying a simple nonconjugate probability model in Stan.
logarithm of the standard deviations, ω = log(σ), applied element-wise. The support of ω
the real coordinate space and σ is always positive. The mean-field Gaussian becomes q(ζ
ζ; µ,diag(exp(ω)2
) , where the vector φ = (µ1,··· ,µK ,ω1,··· ,ωK ) concatenates the m
logarithm of the standard deviation of each factor. Now, the variational parameters are uncon
in 2K
.
Full-rank Gaussian. Another option is to posit a full-rank Gaussian variational approximation
q(ζ; φ) = ζ; µ,Σ ,
where the vector φ = (µ,Σ) concatenates the mean vector µ and covariance matrix Σ. To ens
Σ always remains positive semidefinite, we re-parameterize the covariance matrix using a C
factorization, Σ = LL⊤
. We use the non-unique definition of the Cholesky factorization wh
diagonal elements of L need not be positively constrained (Pinheiro and Bates, 1996). The
lives in the unconstrained space of lower-triangular matrices with K(K + 1)/2 real-valued entr
full-rank Gaussian becomes q(ζ; φ) = ζ; µ, LL⊤
, where the variational parameters φ = (µ
unconstrained in K+K(K+1)/2
.
The full-rank Gaussian generalizes the mean-field Gaussian approximation. The off-diagonal term
covariance matrix Σ capture posterior correlations across latent random variables.5
This leads to
accurate posterior approximation than the mean-field Gaussian; however, it comes at a compu
cost. Various low-rank approximations to the covariance matrix reduce this cost, yet limit its a
¤ ELBO
¤ s
¤
¤ Reparameterization trick p Z
¤ ELBO
¤ s
¤
2.5 The Variational Problem in Real Coordinate Space
Here is the story so far. We began with a differentiable probability model p(x,θ). We transformed th
latent variables into ζ, which live in the real coordinate space. We defined variational approximation
in the transformed space. Now, we consider the variational optimization problem.
Write the variational objective function, the ELBO, in real coordinate space as
(φ) = q(ζ;φ) log p x, T−1
(ζ) + log det JT−1 (ζ) + q(ζ; φ) . (5
The inverse of the transformation T−1
appears in the joint model, along with the determinant of th
Jacobian adjustment. The ELBO is a function of the variational parameters φ and the entropy , both o
which depend on the variational approximation. (Derivation in Appendix B.)
Now, we can freely optimize the ELBO in the real coordinate space without worrying about the suppo
matching constraint. The optimization problem from Equation (3) becomes
φ∗
= argmax
φ
(φ) (6
where the parameter vector φ lives in some appropriately dimensioned real coordinate space. This is a
unconstrained optimization problem that we can solve using gradient ascent. Traditionally, this woul
require manual computation of gradients. Instead, we develop a stochastic gradient ascent algorithm
that uses automatic differentiation to compute gradients and MC integration to approximate expect
tions.
We cannot directly use automatic differentiation on the ELBO. This is because the ELBO involves an un
where the parameter vector φ lives in some appropriately dimensioned real coordinate space. This is an
unconstrained optimization problem that we can solve using gradient ascent. Traditionally, this would
require manual computation of gradients. Instead, we develop a stochastic gradient ascent algorithm
that uses automatic differentiation to compute gradients and MC integration to approximate expecta-
tions.
We cannot directly use automatic differentiation on the ELBO. This is because the ELBO involves an un-
known expectation. However, we can automatically differentiate the functions inside the expectation.
(The model p and transformation T are both easy to represent as computer functions (Baydin et al.,
2015).) To apply automatic differentiation, we want to push the gradient operation inside the expec-
tation. To this end, we employ one final transformation: elliptical standardization6
(Härdle and Simar,
2012).
Elliptical standardization. Consider a transformation Sφ that absorbs the variational parameters φ;
this converts the Gaussian variational approximation into a standard Gaussian. In the mean-field case,
the standardization is η = Sφ(ζ) = diag exp(ω)
−1
(ζ − µ). In the full-rank Gaussian, the standardiza-
tion is η = Sφ(ζ) = L−1
(ζ − µ).
In both cases, the standardization encapsulates the variational parameters; in return it gives a fixed
variational density
q(η) = η; 0, I =
K
k=1
ηk ; 0,1 ,
as shown in Figure 3.
The standardization transforms the variational problem from Equation (5) into
φ∗
= argmax
φ
(η;0,I) log p x, T−1
(S−1
φ (η)) + log det JT−1 S−1
φ (η) + q(ζ; φ) .
The expectation is now in terms of a standard Gaussian density. The Jacobian of elliptical standard-
ization evaluates to one, because the Gaussian distribution is a member of the location-scale family:
standardizing a Gaussian gives another Gaussian distribution. (See Appendix A.)
We do not need to transform the entropy term as it does not depend on the model or the transformation;
we have a simple analytic form for the entropy of a Gaussian and its gradient. We implement these once
and reuse for all models.
¤ Black-box variational inference [Ranganath+ 2014]
¤
¤ ADVI likelihood ratio trick
¤ reparameterization trick
3.2 Variance of the Stochastic Gradients
ADVI uses Monte Carlo integration to approximate gradients of the ELBO, and then uses these gradients
in a stochastic optimization algorithm (Section 2). The speed of ADVI hinges on the variance of the
gradient estimates. When a stochastic optimization algorithm suffers from high-variance gradients, it
must repeatedly recover from poor parameter estimates.
ADVI is not the only way to compute Monte Carlo approximations of the gradient of the ELBO. Black
box variational inference (BBVI) takes a different approach (Ranganath et al., 2014). The BBVI gradient
estimator uses the gradient of the variational approximation and avoids using the gradient of the model.
For example, the following BBVI estimator
∇BBVI
µ = q(ζ;φ) ∇µ logq(ζ; φ) log p x, T−1
(ζ) + log det JT−1 (ζ) − logq(ζ; φ)
and the ADVI gradient estimator in Equation (7) both lead to unbiased estimates of the exact gradient.
While BBVI is more general—it does not require the gradient of the model and thus applies to more
settings—its gradients can suffer from high variance.
Figure 8 empirically compares the variance of both estimators for two models. Figure 8a shows the vari-
ance of both gradient estimators for a simple univariate model, where the posterior is a Gamma(10,10).
We estimate the variance using ten thousand re-calculations of the gradient ∇φ , across an increasing
number of MC samples M. The ADVI gradient has lower variance; in practice, a single sample suffices.
(See the experiments in Section 4.)
Figure 8b shows the same calculation for a 100-dimensional nonlinear regression model with likeli-
hood (y | tanh(x⊤
β), I) and a Gaussian prior on the regression coefficients β. Because this is a
multivariate example, we also show the BBVI gradient with a variance reduction scheme using control
variates described in Ranganath et al. (2014). In both cases, the ADVI gradients are statistically more
efficient.
100
101
102
103
100
101
102
103
Number of MC samples
Variance
(a) Univariate Model
100
101
102
103
10−3
10−1
101
103
Number of MC samples
ADVI
BBVI
BBVI with
control variate
(b) Multivariate Nonlinear Regression Model
Figure 8: Comparison of gradient estimator variances. The ADVI gradient estimator exhibits lower
variance than the BBVI estimator. Moreover, it does not require control variate variance reduction, which
is not available in univariate situations.
Figure 8: Comparison of gradient estimator variances. The ADVI gradient estimator exhibits lower
variance than the BBVI estimator. Moreover, it does not require control variate variance reduction, which
is not available in univariate situations.
¤
¤
¤
¤
¤ 2
¤
¤ O
¤
¤
¤ Stan
¤
¤ MCMC NUTS[Hoffman+ 2014] HMC
¤ Stan python R
¤ ADVI
¤ PyMC3
¤ Python MCMC
¤ Theano GPU
¤ ADVI
¤ Edward
¤
¤ criticism
¤ Python Tensorflow Keras
¤ Stan PyMC3 35x [Tran+ 2016]
¤ Tars
¤ https://github.com/masa-su/Tars
¤ Edward Tran PyMC3 Wiecki star
¤
¤
¤ Edward PyMC3
¤
¤
¤ Q Tars
¤ A

More Related Content

What's hot

基礎からのベイズ統計学 輪読会資料 第4章 メトロポリス・ヘイスティングス法
基礎からのベイズ統計学 輪読会資料 第4章 メトロポリス・ヘイスティングス法基礎からのベイズ統計学 輪読会資料 第4章 メトロポリス・ヘイスティングス法
基礎からのベイズ統計学 輪読会資料 第4章 メトロポリス・ヘイスティングス法Ken'ichi Matsui
 
ベルヌーイ分布からベータ分布までを関係づける
ベルヌーイ分布からベータ分布までを関係づけるベルヌーイ分布からベータ分布までを関係づける
ベルヌーイ分布からベータ分布までを関係づけるitoyan110
 
負の二項分布について
負の二項分布について負の二項分布について
負の二項分布についてHiroshi Shimizu
 
Oracle property and_hdm_pkg_rigorouslasso
Oracle property and_hdm_pkg_rigorouslassoOracle property and_hdm_pkg_rigorouslasso
Oracle property and_hdm_pkg_rigorouslassoSatoshi Kato
 
深層生成モデルと世界モデル(2020/11/20版)
深層生成モデルと世界モデル(2020/11/20版)深層生成モデルと世界モデル(2020/11/20版)
深層生成モデルと世界モデル(2020/11/20版)Masahiro Suzuki
 
PRML輪読#8
PRML輪読#8PRML輪読#8
PRML輪読#8matsuolab
 
今さら聞けないカーネル法とサポートベクターマシン
今さら聞けないカーネル法とサポートベクターマシン今さら聞けないカーネル法とサポートベクターマシン
今さら聞けないカーネル法とサポートベクターマシンShinya Shimizu
 
クラシックな機械学習の入門  9. モデル推定
クラシックな機械学習の入門  9. モデル推定クラシックな機械学習の入門  9. モデル推定
クラシックな機械学習の入門  9. モデル推定Hiroshi Nakagawa
 
PRML 上 2.3.6 ~ 2.5.2
PRML 上 2.3.6 ~ 2.5.2PRML 上 2.3.6 ~ 2.5.2
PRML 上 2.3.6 ~ 2.5.2禎晃 山崎
 
PRML輪読#2
PRML輪読#2PRML輪読#2
PRML輪読#2matsuolab
 
パターン認識と機械学習(PRML)第2章 確率分布 2.3 ガウス分布
パターン認識と機械学習(PRML)第2章 確率分布 2.3 ガウス分布パターン認識と機械学習(PRML)第2章 確率分布 2.3 ガウス分布
パターン認識と機械学習(PRML)第2章 確率分布 2.3 ガウス分布Nagayoshi Yamashita
 
最適輸送の計算アルゴリズムの研究動向
最適輸送の計算アルゴリズムの研究動向最適輸送の計算アルゴリズムの研究動向
最適輸送の計算アルゴリズムの研究動向ohken
 
PRML第6章「カーネル法」
PRML第6章「カーネル法」PRML第6章「カーネル法」
PRML第6章「カーネル法」Keisuke Sugawara
 
機械学習による統計的実験計画(ベイズ最適化を中心に)
機械学習による統計的実験計画(ベイズ最適化を中心に)機械学習による統計的実験計画(ベイズ最適化を中心に)
機械学習による統計的実験計画(ベイズ最適化を中心に)Kota Matsui
 
5分で分かる自己組織化マップ
5分で分かる自己組織化マップ5分で分かる自己組織化マップ
5分で分かる自己組織化マップDaisuke Takai
 
最適化超入門
最適化超入門最適化超入門
最適化超入門Takami Sato
 
ベイズ深層学習5章 ニューラルネットワークのベイズ推論 Bayesian deep learning
ベイズ深層学習5章 ニューラルネットワークのベイズ推論 Bayesian deep learningベイズ深層学習5章 ニューラルネットワークのベイズ推論 Bayesian deep learning
ベイズ深層学習5章 ニューラルネットワークのベイズ推論 Bayesian deep learningssuserca2822
 

What's hot (20)

基礎からのベイズ統計学 輪読会資料 第4章 メトロポリス・ヘイスティングス法
基礎からのベイズ統計学 輪読会資料 第4章 メトロポリス・ヘイスティングス法基礎からのベイズ統計学 輪読会資料 第4章 メトロポリス・ヘイスティングス法
基礎からのベイズ統計学 輪読会資料 第4章 メトロポリス・ヘイスティングス法
 
ベルヌーイ分布からベータ分布までを関係づける
ベルヌーイ分布からベータ分布までを関係づけるベルヌーイ分布からベータ分布までを関係づける
ベルヌーイ分布からベータ分布までを関係づける
 
負の二項分布について
負の二項分布について負の二項分布について
負の二項分布について
 
Oracle property and_hdm_pkg_rigorouslasso
Oracle property and_hdm_pkg_rigorouslassoOracle property and_hdm_pkg_rigorouslasso
Oracle property and_hdm_pkg_rigorouslasso
 
深層生成モデルと世界モデル(2020/11/20版)
深層生成モデルと世界モデル(2020/11/20版)深層生成モデルと世界モデル(2020/11/20版)
深層生成モデルと世界モデル(2020/11/20版)
 
PRML輪読#8
PRML輪読#8PRML輪読#8
PRML輪読#8
 
今さら聞けないカーネル法とサポートベクターマシン
今さら聞けないカーネル法とサポートベクターマシン今さら聞けないカーネル法とサポートベクターマシン
今さら聞けないカーネル法とサポートベクターマシン
 
Rによるベイジアンネットワーク入門
Rによるベイジアンネットワーク入門Rによるベイジアンネットワーク入門
Rによるベイジアンネットワーク入門
 
クラシックな機械学習の入門  9. モデル推定
クラシックな機械学習の入門  9. モデル推定クラシックな機械学習の入門  9. モデル推定
クラシックな機械学習の入門  9. モデル推定
 
PRML 上 2.3.6 ~ 2.5.2
PRML 上 2.3.6 ~ 2.5.2PRML 上 2.3.6 ~ 2.5.2
PRML 上 2.3.6 ~ 2.5.2
 
PRML輪読#2
PRML輪読#2PRML輪読#2
PRML輪読#2
 
Prml 2.3
Prml 2.3Prml 2.3
Prml 2.3
 
線形計画法入門
線形計画法入門線形計画法入門
線形計画法入門
 
パターン認識と機械学習(PRML)第2章 確率分布 2.3 ガウス分布
パターン認識と機械学習(PRML)第2章 確率分布 2.3 ガウス分布パターン認識と機械学習(PRML)第2章 確率分布 2.3 ガウス分布
パターン認識と機械学習(PRML)第2章 確率分布 2.3 ガウス分布
 
最適輸送の計算アルゴリズムの研究動向
最適輸送の計算アルゴリズムの研究動向最適輸送の計算アルゴリズムの研究動向
最適輸送の計算アルゴリズムの研究動向
 
PRML第6章「カーネル法」
PRML第6章「カーネル法」PRML第6章「カーネル法」
PRML第6章「カーネル法」
 
機械学習による統計的実験計画(ベイズ最適化を中心に)
機械学習による統計的実験計画(ベイズ最適化を中心に)機械学習による統計的実験計画(ベイズ最適化を中心に)
機械学習による統計的実験計画(ベイズ最適化を中心に)
 
5分で分かる自己組織化マップ
5分で分かる自己組織化マップ5分で分かる自己組織化マップ
5分で分かる自己組織化マップ
 
最適化超入門
最適化超入門最適化超入門
最適化超入門
 
ベイズ深層学習5章 ニューラルネットワークのベイズ推論 Bayesian deep learning
ベイズ深層学習5章 ニューラルネットワークのベイズ推論 Bayesian deep learningベイズ深層学習5章 ニューラルネットワークのベイズ推論 Bayesian deep learning
ベイズ深層学習5章 ニューラルネットワークのベイズ推論 Bayesian deep learning
 

Viewers also liked

確率的プログラミングライブラリEdward
確率的プログラミングライブラリEdward確率的プログラミングライブラリEdward
確率的プログラミングライブラリEdwardYuta Kashino
 
データ解析のための統計モデリング入門10章前半
データ解析のための統計モデリング入門10章前半データ解析のための統計モデリング入門10章前半
データ解析のための統計モデリング入門10章前半Shinya Akiba
 
研究者の研究履歴による学術の動向の把握とその予測 (第11回データマイニング+WEB@東京)
研究者の研究履歴による学術の動向の把握とその予測 (第11回データマイニング+WEB@東京)研究者の研究履歴による学術の動向の把握とその予測 (第11回データマイニング+WEB@東京)
研究者の研究履歴による学術の動向の把握とその予測 (第11回データマイニング+WEB@東京)Nagayoshi Yamashita
 
広告プラットフォーム立ち上げ百鬼夜行
広告プラットフォーム立ち上げ百鬼夜行広告プラットフォーム立ち上げ百鬼夜行
広告プラットフォーム立ち上げ百鬼夜行Takahiro Ogoshi
 
Jap2017 ss65 優しいベイズ統計への導入法
Jap2017 ss65 優しいベイズ統計への導入法Jap2017 ss65 優しいベイズ統計への導入法
Jap2017 ss65 優しいベイズ統計への導入法考司 小杉
 
Lyric Jumper:アーティストごとの歌詞トピックの傾向に基づく歌詞探索サービス
Lyric Jumper:アーティストごとの歌詞トピックの傾向に基づく歌詞探索サービスLyric Jumper:アーティストごとの歌詞トピックの傾向に基づく歌詞探索サービス
Lyric Jumper:アーティストごとの歌詞トピックの傾向に基づく歌詞探索サービスKosetsu Tsukuda
 
もしその単語がなかったら
もしその単語がなかったらもしその単語がなかったら
もしその単語がなかったらHiroshi Nakagawa
 
PoisoningAttackSVM (ICMLreading2012)
PoisoningAttackSVM (ICMLreading2012)PoisoningAttackSVM (ICMLreading2012)
PoisoningAttackSVM (ICMLreading2012)Hidekazu Oiwa
 
[DL輪読会]Learning by Association - A versatile semi-supervised training method ...
[DL輪読会]Learning by Association - A versatile semi-supervised training method ...[DL輪読会]Learning by Association - A versatile semi-supervised training method ...
[DL輪読会]Learning by Association - A versatile semi-supervised training method ...Deep Learning JP
 
多腕バンディット問題: 定式化と応用 (第13回ステアラボ人工知能セミナー)
多腕バンディット問題: 定式化と応用 (第13回ステアラボ人工知能セミナー)多腕バンディット問題: 定式化と応用 (第13回ステアラボ人工知能セミナー)
多腕バンディット問題: 定式化と応用 (第13回ステアラボ人工知能セミナー)STAIR Lab, Chiba Institute of Technology
 
20171024NL研報告スライド
20171024NL研報告スライド20171024NL研報告スライド
20171024NL研報告スライドMasatoshi TSUCHIYA
 
[DLHacks LT] PytorchのDataLoader -torchtextのソースコードを読んでみた-
[DLHacks LT] PytorchのDataLoader -torchtextのソースコードを読んでみた-[DLHacks LT] PytorchのDataLoader -torchtextのソースコードを読んでみた-
[DLHacks LT] PytorchのDataLoader -torchtextのソースコードを読んでみた-Deep Learning JP
 
深層学習の判断根拠を理解するための 研究とその意義 @PRMU 2017熊本
深層学習の判断根拠を理解するための 研究とその意義 @PRMU 2017熊本深層学習の判断根拠を理解するための 研究とその意義 @PRMU 2017熊本
深層学習の判断根拠を理解するための 研究とその意義 @PRMU 2017熊本Takahiro Kubo
 
日本音響学会2017秋 ビギナーズセミナー "深層学習を深く学習するための基礎"
日本音響学会2017秋 ビギナーズセミナー "深層学習を深く学習するための基礎"日本音響学会2017秋 ビギナーズセミナー "深層学習を深く学習するための基礎"
日本音響学会2017秋 ビギナーズセミナー "深層学習を深く学習するための基礎"Shinnosuke Takamichi
 
マッチングサービスにおけるKPIの話
マッチングサービスにおけるKPIの話マッチングサービスにおけるKPIの話
マッチングサービスにおけるKPIの話cyberagent
 
アドテクスタジオのデータ分析基盤について
アドテクスタジオのデータ分析基盤についてアドテクスタジオのデータ分析基盤について
アドテクスタジオのデータ分析基盤についてkazuhiro ito
 
VentureCafe_第2回:SIerでのキャリアパスを考える_ござ先輩発表資料 V1.0
VentureCafe_第2回:SIerでのキャリアパスを考える_ござ先輩発表資料 V1.0VentureCafe_第2回:SIerでのキャリアパスを考える_ござ先輩発表資料 V1.0
VentureCafe_第2回:SIerでのキャリアパスを考える_ござ先輩発表資料 V1.0Michitaka Yumoto
 
StanとRでベイズ統計モデリング読書会 Chapter 7(7.6-7.9) 回帰分析の悩みどころ ~統計の力で歌うまになりたい~
StanとRでベイズ統計モデリング読書会 Chapter 7(7.6-7.9) 回帰分析の悩みどころ ~統計の力で歌うまになりたい~StanとRでベイズ統計モデリング読書会 Chapter 7(7.6-7.9) 回帰分析の悩みどころ ~統計の力で歌うまになりたい~
StanとRでベイズ統計モデリング読書会 Chapter 7(7.6-7.9) 回帰分析の悩みどころ ~統計の力で歌うまになりたい~nocchi_airport
 
SwiftでRiemann球面を扱う
SwiftでRiemann球面を扱うSwiftでRiemann球面を扱う
SwiftでRiemann球面を扱うhayato iida
 

Viewers also liked (20)

確率的プログラミングライブラリEdward
確率的プログラミングライブラリEdward確率的プログラミングライブラリEdward
確率的プログラミングライブラリEdward
 
データ解析のための統計モデリング入門10章前半
データ解析のための統計モデリング入門10章前半データ解析のための統計モデリング入門10章前半
データ解析のための統計モデリング入門10章前半
 
研究者の研究履歴による学術の動向の把握とその予測 (第11回データマイニング+WEB@東京)
研究者の研究履歴による学術の動向の把握とその予測 (第11回データマイニング+WEB@東京)研究者の研究履歴による学術の動向の把握とその予測 (第11回データマイニング+WEB@東京)
研究者の研究履歴による学術の動向の把握とその予測 (第11回データマイニング+WEB@東京)
 
広告プラットフォーム立ち上げ百鬼夜行
広告プラットフォーム立ち上げ百鬼夜行広告プラットフォーム立ち上げ百鬼夜行
広告プラットフォーム立ち上げ百鬼夜行
 
Jap2017 ss65 優しいベイズ統計への導入法
Jap2017 ss65 優しいベイズ統計への導入法Jap2017 ss65 優しいベイズ統計への導入法
Jap2017 ss65 優しいベイズ統計への導入法
 
Lyric Jumper:アーティストごとの歌詞トピックの傾向に基づく歌詞探索サービス
Lyric Jumper:アーティストごとの歌詞トピックの傾向に基づく歌詞探索サービスLyric Jumper:アーティストごとの歌詞トピックの傾向に基づく歌詞探索サービス
Lyric Jumper:アーティストごとの歌詞トピックの傾向に基づく歌詞探索サービス
 
もしその単語がなかったら
もしその単語がなかったらもしその単語がなかったら
もしその単語がなかったら
 
PoisoningAttackSVM (ICMLreading2012)
PoisoningAttackSVM (ICMLreading2012)PoisoningAttackSVM (ICMLreading2012)
PoisoningAttackSVM (ICMLreading2012)
 
[DL輪読会]Learning by Association - A versatile semi-supervised training method ...
[DL輪読会]Learning by Association - A versatile semi-supervised training method ...[DL輪読会]Learning by Association - A versatile semi-supervised training method ...
[DL輪読会]Learning by Association - A versatile semi-supervised training method ...
 
多腕バンディット問題: 定式化と応用 (第13回ステアラボ人工知能セミナー)
多腕バンディット問題: 定式化と応用 (第13回ステアラボ人工知能セミナー)多腕バンディット問題: 定式化と応用 (第13回ステアラボ人工知能セミナー)
多腕バンディット問題: 定式化と応用 (第13回ステアラボ人工知能セミナー)
 
20171024NL研報告スライド
20171024NL研報告スライド20171024NL研報告スライド
20171024NL研報告スライド
 
[DLHacks LT] PytorchのDataLoader -torchtextのソースコードを読んでみた-
[DLHacks LT] PytorchのDataLoader -torchtextのソースコードを読んでみた-[DLHacks LT] PytorchのDataLoader -torchtextのソースコードを読んでみた-
[DLHacks LT] PytorchのDataLoader -torchtextのソースコードを読んでみた-
 
深層学習の判断根拠を理解するための 研究とその意義 @PRMU 2017熊本
深層学習の判断根拠を理解するための 研究とその意義 @PRMU 2017熊本深層学習の判断根拠を理解するための 研究とその意義 @PRMU 2017熊本
深層学習の判断根拠を理解するための 研究とその意義 @PRMU 2017熊本
 
日本音響学会2017秋 ビギナーズセミナー "深層学習を深く学習するための基礎"
日本音響学会2017秋 ビギナーズセミナー "深層学習を深く学習するための基礎"日本音響学会2017秋 ビギナーズセミナー "深層学習を深く学習するための基礎"
日本音響学会2017秋 ビギナーズセミナー "深層学習を深く学習するための基礎"
 
マッチングサービスにおけるKPIの話
マッチングサービスにおけるKPIの話マッチングサービスにおけるKPIの話
マッチングサービスにおけるKPIの話
 
アドテクスタジオのデータ分析基盤について
アドテクスタジオのデータ分析基盤についてアドテクスタジオのデータ分析基盤について
アドテクスタジオのデータ分析基盤について
 
Dragon: A Distributed Object Storage at Yahoo! JAPAN (WebDB Forum 2017)
Dragon: A Distributed Object Storage at Yahoo! JAPAN (WebDB Forum 2017)Dragon: A Distributed Object Storage at Yahoo! JAPAN (WebDB Forum 2017)
Dragon: A Distributed Object Storage at Yahoo! JAPAN (WebDB Forum 2017)
 
VentureCafe_第2回:SIerでのキャリアパスを考える_ござ先輩発表資料 V1.0
VentureCafe_第2回:SIerでのキャリアパスを考える_ござ先輩発表資料 V1.0VentureCafe_第2回:SIerでのキャリアパスを考える_ござ先輩発表資料 V1.0
VentureCafe_第2回:SIerでのキャリアパスを考える_ござ先輩発表資料 V1.0
 
StanとRでベイズ統計モデリング読書会 Chapter 7(7.6-7.9) 回帰分析の悩みどころ ~統計の力で歌うまになりたい~
StanとRでベイズ統計モデリング読書会 Chapter 7(7.6-7.9) 回帰分析の悩みどころ ~統計の力で歌うまになりたい~StanとRでベイズ統計モデリング読書会 Chapter 7(7.6-7.9) 回帰分析の悩みどころ ~統計の力で歌うまになりたい~
StanとRでベイズ統計モデリング読書会 Chapter 7(7.6-7.9) 回帰分析の悩みどころ ~統計の力で歌うまになりたい~
 
SwiftでRiemann球面を扱う
SwiftでRiemann球面を扱うSwiftでRiemann球面を扱う
SwiftでRiemann球面を扱う
 

Similar to (DL hacks輪読)Bayesian Neural Network

Machine learning (10)
Machine learning (10)Machine learning (10)
Machine learning (10)NYversity
 
Fixed point theorem in fuzzy metric space with e.a property
Fixed point theorem in fuzzy metric space with e.a propertyFixed point theorem in fuzzy metric space with e.a property
Fixed point theorem in fuzzy metric space with e.a propertyAlexander Decker
 
Machine learning (2)
Machine learning (2)Machine learning (2)
Machine learning (2)NYversity
 
Proofs nearest rank
Proofs nearest rankProofs nearest rank
Proofs nearest rankfithisux
 
Fixed points theorem on a pair of random generalized non linear contractions
Fixed points theorem on a pair of random generalized non linear contractionsFixed points theorem on a pair of random generalized non linear contractions
Fixed points theorem on a pair of random generalized non linear contractionsAlexander Decker
 
Asymptotics of ABC, lecture, Collège de France
Asymptotics of ABC, lecture, Collège de FranceAsymptotics of ABC, lecture, Collège de France
Asymptotics of ABC, lecture, Collège de FranceChristian Robert
 
Series_Solution_Methods_and_Special_Func.pdf
Series_Solution_Methods_and_Special_Func.pdfSeries_Solution_Methods_and_Special_Func.pdf
Series_Solution_Methods_and_Special_Func.pdfmohamedtawfik358886
 
(DL輪読)Variational Dropout Sparsifies Deep Neural Networks
(DL輪読)Variational Dropout Sparsifies Deep Neural Networks(DL輪読)Variational Dropout Sparsifies Deep Neural Networks
(DL輪読)Variational Dropout Sparsifies Deep Neural NetworksMasahiro Suzuki
 
X01 Supervised learning problem linear regression one feature theorie
X01 Supervised learning problem linear regression one feature theorieX01 Supervised learning problem linear regression one feature theorie
X01 Supervised learning problem linear regression one feature theorieMarco Moldenhauer
 
Introduction to Evidential Neural Networks
Introduction to Evidential Neural NetworksIntroduction to Evidential Neural Networks
Introduction to Evidential Neural NetworksFederico Cerutti
 
Multivriada ppt ms
Multivriada   ppt msMultivriada   ppt ms
Multivriada ppt msFaeco Bot
 
International Journal of Mathematics and Statistics Invention (IJMSI)
International Journal of Mathematics and Statistics Invention (IJMSI) International Journal of Mathematics and Statistics Invention (IJMSI)
International Journal of Mathematics and Statistics Invention (IJMSI) inventionjournals
 

Similar to (DL hacks輪読)Bayesian Neural Network (20)

Lecture 3 - Linear Regression
Lecture 3 - Linear RegressionLecture 3 - Linear Regression
Lecture 3 - Linear Regression
 
Machine learning (10)
Machine learning (10)Machine learning (10)
Machine learning (10)
 
Fixed point theorem in fuzzy metric space with e.a property
Fixed point theorem in fuzzy metric space with e.a propertyFixed point theorem in fuzzy metric space with e.a property
Fixed point theorem in fuzzy metric space with e.a property
 
Machine learning (2)
Machine learning (2)Machine learning (2)
Machine learning (2)
 
Proofs nearest rank
Proofs nearest rankProofs nearest rank
Proofs nearest rank
 
Matching
MatchingMatching
Matching
 
Fixed points theorem on a pair of random generalized non linear contractions
Fixed points theorem on a pair of random generalized non linear contractionsFixed points theorem on a pair of random generalized non linear contractions
Fixed points theorem on a pair of random generalized non linear contractions
 
Asymptotics of ABC, lecture, Collège de France
Asymptotics of ABC, lecture, Collège de FranceAsymptotics of ABC, lecture, Collège de France
Asymptotics of ABC, lecture, Collège de France
 
lec8.ppt
lec8.pptlec8.ppt
lec8.ppt
 
506
506506
506
 
Series_Solution_Methods_and_Special_Func.pdf
Series_Solution_Methods_and_Special_Func.pdfSeries_Solution_Methods_and_Special_Func.pdf
Series_Solution_Methods_and_Special_Func.pdf
 
(DL輪読)Variational Dropout Sparsifies Deep Neural Networks
(DL輪読)Variational Dropout Sparsifies Deep Neural Networks(DL輪読)Variational Dropout Sparsifies Deep Neural Networks
(DL輪読)Variational Dropout Sparsifies Deep Neural Networks
 
New test123
New test123New test123
New test123
 
Multivariate Methods Assignment Help
Multivariate Methods Assignment HelpMultivariate Methods Assignment Help
Multivariate Methods Assignment Help
 
X01 Supervised learning problem linear regression one feature theorie
X01 Supervised learning problem linear regression one feature theorieX01 Supervised learning problem linear regression one feature theorie
X01 Supervised learning problem linear regression one feature theorie
 
Introduction to Evidential Neural Networks
Introduction to Evidential Neural NetworksIntroduction to Evidential Neural Networks
Introduction to Evidential Neural Networks
 
Calculas
CalculasCalculas
Calculas
 
Parallel algorithm in linear algebra
Parallel algorithm in linear algebraParallel algorithm in linear algebra
Parallel algorithm in linear algebra
 
Multivriada ppt ms
Multivriada   ppt msMultivriada   ppt ms
Multivriada ppt ms
 
International Journal of Mathematics and Statistics Invention (IJMSI)
International Journal of Mathematics and Statistics Invention (IJMSI) International Journal of Mathematics and Statistics Invention (IJMSI)
International Journal of Mathematics and Statistics Invention (IJMSI)
 

More from Masahiro Suzuki

確率的推論と行動選択
確率的推論と行動選択確率的推論と行動選択
確率的推論と行動選択Masahiro Suzuki
 
深層生成モデルと世界モデル, 深層生成モデルライブラリPixyzについて
深層生成モデルと世界モデル,深層生成モデルライブラリPixyzについて深層生成モデルと世界モデル,深層生成モデルライブラリPixyzについて
深層生成モデルと世界モデル, 深層生成モデルライブラリPixyzについてMasahiro Suzuki
 
深層生成モデルと世界モデル
深層生成モデルと世界モデル深層生成モデルと世界モデル
深層生成モデルと世界モデルMasahiro Suzuki
 
「世界モデル」と関連研究について
「世界モデル」と関連研究について「世界モデル」と関連研究について
「世界モデル」と関連研究についてMasahiro Suzuki
 
GAN(と強化学習との関係)
GAN(と強化学習との関係)GAN(と強化学習との関係)
GAN(と強化学習との関係)Masahiro Suzuki
 
深層生成モデルを用いたマルチモーダルデータの半教師あり学習
深層生成モデルを用いたマルチモーダルデータの半教師あり学習深層生成モデルを用いたマルチモーダルデータの半教師あり学習
深層生成モデルを用いたマルチモーダルデータの半教師あり学習Masahiro Suzuki
 
(DL輪読)Matching Networks for One Shot Learning
(DL輪読)Matching Networks for One Shot Learning(DL輪読)Matching Networks for One Shot Learning
(DL輪読)Matching Networks for One Shot LearningMasahiro Suzuki
 
深層生成モデルを用いたマルチモーダル学習
深層生成モデルを用いたマルチモーダル学習深層生成モデルを用いたマルチモーダル学習
深層生成モデルを用いたマルチモーダル学習Masahiro Suzuki
 
(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...
(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...
(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...Masahiro Suzuki
 
(DL hacks輪読) Variational Inference with Rényi Divergence
(DL hacks輪読) Variational Inference with Rényi Divergence(DL hacks輪読) Variational Inference with Rényi Divergence
(DL hacks輪読) Variational Inference with Rényi DivergenceMasahiro Suzuki
 
(DL hacks輪読) Deep Kalman Filters
(DL hacks輪読) Deep Kalman Filters(DL hacks輪読) Deep Kalman Filters
(DL hacks輪読) Deep Kalman FiltersMasahiro Suzuki
 
(研究会輪読) Weight Uncertainty in Neural Networks
(研究会輪読) Weight Uncertainty in Neural Networks(研究会輪読) Weight Uncertainty in Neural Networks
(研究会輪読) Weight Uncertainty in Neural NetworksMasahiro Suzuki
 
(DL hacks輪読) Deep Kernel Learning
(DL hacks輪読) Deep Kernel Learning(DL hacks輪読) Deep Kernel Learning
(DL hacks輪読) Deep Kernel LearningMasahiro Suzuki
 
(DL hacks輪読) Seven neurons memorizing sequences of alphabetical images via sp...
(DL hacks輪読) Seven neurons memorizing sequences of alphabetical images via sp...(DL hacks輪読) Seven neurons memorizing sequences of alphabetical images via sp...
(DL hacks輪読) Seven neurons memorizing sequences of alphabetical images via sp...Masahiro Suzuki
 
(研究会輪読) Facial Landmark Detection by Deep Multi-task Learning
(研究会輪読) Facial Landmark Detection by Deep Multi-task Learning(研究会輪読) Facial Landmark Detection by Deep Multi-task Learning
(研究会輪読) Facial Landmark Detection by Deep Multi-task LearningMasahiro Suzuki
 
(DL hacks輪読) Difference Target Propagation
(DL hacks輪読) Difference Target Propagation(DL hacks輪読) Difference Target Propagation
(DL hacks輪読) Difference Target PropagationMasahiro Suzuki
 
(DL hacks輪読) Variational Dropout and the Local Reparameterization Trick
(DL hacks輪読) Variational Dropout and the Local Reparameterization Trick(DL hacks輪読) Variational Dropout and the Local Reparameterization Trick
(DL hacks輪読) Variational Dropout and the Local Reparameterization TrickMasahiro Suzuki
 
(DL Hacks輪読) How transferable are features in deep neural networks?
(DL Hacks輪読) How transferable are features in deep neural networks?(DL Hacks輪読) How transferable are features in deep neural networks?
(DL Hacks輪読) How transferable are features in deep neural networks?Masahiro Suzuki
 

More from Masahiro Suzuki (18)

確率的推論と行動選択
確率的推論と行動選択確率的推論と行動選択
確率的推論と行動選択
 
深層生成モデルと世界モデル, 深層生成モデルライブラリPixyzについて
深層生成モデルと世界モデル,深層生成モデルライブラリPixyzについて深層生成モデルと世界モデル,深層生成モデルライブラリPixyzについて
深層生成モデルと世界モデル, 深層生成モデルライブラリPixyzについて
 
深層生成モデルと世界モデル
深層生成モデルと世界モデル深層生成モデルと世界モデル
深層生成モデルと世界モデル
 
「世界モデル」と関連研究について
「世界モデル」と関連研究について「世界モデル」と関連研究について
「世界モデル」と関連研究について
 
GAN(と強化学習との関係)
GAN(と強化学習との関係)GAN(と強化学習との関係)
GAN(と強化学習との関係)
 
深層生成モデルを用いたマルチモーダルデータの半教師あり学習
深層生成モデルを用いたマルチモーダルデータの半教師あり学習深層生成モデルを用いたマルチモーダルデータの半教師あり学習
深層生成モデルを用いたマルチモーダルデータの半教師あり学習
 
(DL輪読)Matching Networks for One Shot Learning
(DL輪読)Matching Networks for One Shot Learning(DL輪読)Matching Networks for One Shot Learning
(DL輪読)Matching Networks for One Shot Learning
 
深層生成モデルを用いたマルチモーダル学習
深層生成モデルを用いたマルチモーダル学習深層生成モデルを用いたマルチモーダル学習
深層生成モデルを用いたマルチモーダル学習
 
(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...
(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...
(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...
 
(DL hacks輪読) Variational Inference with Rényi Divergence
(DL hacks輪読) Variational Inference with Rényi Divergence(DL hacks輪読) Variational Inference with Rényi Divergence
(DL hacks輪読) Variational Inference with Rényi Divergence
 
(DL hacks輪読) Deep Kalman Filters
(DL hacks輪読) Deep Kalman Filters(DL hacks輪読) Deep Kalman Filters
(DL hacks輪読) Deep Kalman Filters
 
(研究会輪読) Weight Uncertainty in Neural Networks
(研究会輪読) Weight Uncertainty in Neural Networks(研究会輪読) Weight Uncertainty in Neural Networks
(研究会輪読) Weight Uncertainty in Neural Networks
 
(DL hacks輪読) Deep Kernel Learning
(DL hacks輪読) Deep Kernel Learning(DL hacks輪読) Deep Kernel Learning
(DL hacks輪読) Deep Kernel Learning
 
(DL hacks輪読) Seven neurons memorizing sequences of alphabetical images via sp...
(DL hacks輪読) Seven neurons memorizing sequences of alphabetical images via sp...(DL hacks輪読) Seven neurons memorizing sequences of alphabetical images via sp...
(DL hacks輪読) Seven neurons memorizing sequences of alphabetical images via sp...
 
(研究会輪読) Facial Landmark Detection by Deep Multi-task Learning
(研究会輪読) Facial Landmark Detection by Deep Multi-task Learning(研究会輪読) Facial Landmark Detection by Deep Multi-task Learning
(研究会輪読) Facial Landmark Detection by Deep Multi-task Learning
 
(DL hacks輪読) Difference Target Propagation
(DL hacks輪読) Difference Target Propagation(DL hacks輪読) Difference Target Propagation
(DL hacks輪読) Difference Target Propagation
 
(DL hacks輪読) Variational Dropout and the Local Reparameterization Trick
(DL hacks輪読) Variational Dropout and the Local Reparameterization Trick(DL hacks輪読) Variational Dropout and the Local Reparameterization Trick
(DL hacks輪読) Variational Dropout and the Local Reparameterization Trick
 
(DL Hacks輪読) How transferable are features in deep neural networks?
(DL Hacks輪読) How transferable are features in deep neural networks?(DL Hacks輪読) How transferable are features in deep neural networks?
(DL Hacks輪読) How transferable are features in deep neural networks?
 

Recently uploaded

Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Angeliki Cooney
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Zilliz
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxRemote DBA Services
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 

Recently uploaded (20)

Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 

(DL hacks輪読)Bayesian Neural Network

  • 2. ¤ bayesian neural network ¤ NIPS http://bayesiandeeplearning.org ¤ bayesian neural network ¤ Stan PyMC3 Edward ¤ ¤ ¤
  • 4.
  • 5. ¤ ! " # ¤ $(#|") ¤ ¤ $(!|#, ") ¤ ¤ ¤ w ¤ $(#|!, ") ¤ ¤ $(!|") ¤ ¤ ¤
  • 6. ¤ ¤ D ¤ ¤ ! = {+} = - $ ! # = ∏ $(+|#)/ 012 = $(-|#) ¤ ! = {+, 3} = (-, 3) $ ! # = ∏ $(3|+, #)/ 012 = $(3|-, #) $ # ! = $ ! # $(#) $(!) = $ ! # $(#) ∫ $ ! # $ # 5#
  • 7. MAP ¤ ¤ MAP ¤ MAP ¤ ¤ MAP #789 = arg max log $(!|#) = arg max A B log $(+C|#) C #7DE = arg max log $(#|!) = arg max A log $ ! # + log $(#)
  • 8. ¤ +G ¤ # ¤ ¤ +G 3G $ +G ! = H $ +G # $ # ! 5# = IJ(A|K)[$ +G # ] $ 3G +G, ! = H $ 3G +G, # $ # ! 5# = IJ(A|K)[$ 3G +G, # ]
  • 9. ¤ ¤ ¤ ¤ ¤ ¤ $ +G ! = H $ +G # $ # ! 5# = IJ(A|K)[$ +G # ] $ " ! = $ ! " $(") $(!) $ # ! = $ ! # $(#) $(!) = $ ! # $(#) ∫ $ ! # $ # 5#
  • 10. ¤ ¤ ¤ # ¤ $ # ! = $ ! # $(#) $(!) = $ ! # $(#) ∫ $ ! # $ # 5#
  • 11. ¤ 2 1. MCMC 2. 1. MCMC ¤ $(#|!) ¤ ¤ 2. ¤ $(#|N) O(#) ¤ 2 ¤ ¤
  • 12.
  • 13. ¤ # $(#) ¤ # ¤ Weight Uncertainty in Neural Networks H1 H2 H3 1 X 1 Y 0.5 0.1 0.7 1.3 1.40.3 1.2 0.10.1 0.2 H1 H2 H3 1 X 1 Y Figure 1. Left: each weight has a fixed value, as provided by clas- sical backpropagation. Right: each weight is assigned a distribu- tion, as provided by Bayes by Backprop. is related to recent methods in deep, generative modelling (Kingma and Welling, 2014; Rezende et al., 2014; Gregor et al., 2014), where variational inference has been applied to stochastic hidden units of an autoencoder. Whilst the number of stochastic hidden units might be in the order of the parameters of the categorical dis through the exponential function then regression Y is R and P(y|x, w) is a G – this corresponds to a squared loss. Inputs x are mapped onto the param tion on Y by several successive layers tion (given by w) interleaved with elem transforms. The weights can be learnt by maximum tion (MLE): given a set of training exam the MLE weights wMLE are given by: wMLE = arg max w log P(D|w = arg max w i log P( This is typically achieved by gradient NN Bayesian NN $(3G|!, +G, N) = H $ 3G +G, # $ # !, N 5# [Blundell+ 2015]
  • 14. ¤ ¤ ¤ “The Importance of Knowing What We Don't Know (by Yarin Gal)” ¤ ¤ ¤ ¤ ¤ ¤ WHY SHOULD WE CARE? Calibrated model and prediction uncertainty: getting systems that know when they don’t know. Automatic model complexity control and structure learnin (Bayesian Occam’s Razor) Figure from Yarin Gal’s thesis “Uncertainty in Deep Learning” (2016) Zoubin Ghahramani http://mlg.eng.cam.ac.uk/yarin/blog_2248.html
  • 15. ¤ stochastic neural network ¤ VAE ¤ = ¤ ¤ https://jmhl.org/research/
  • 17.
  • 18. ¤ ¤ P Q ¤ ¤ α ¤ ¤ 3.1.5 Multiple outputs So far, we have considered the case of a single target variable t. In some applica- tions, we may wish to predict K > 1 target variables, which we denote collectively by the target vector t. This could be done by introducing a different set of basis func- tions for each component of t, leading to multiple, independent regression problems. However, a more interesting, and more common, approach is to use the same set of basis functions to model all of the components of the target vector so that y(x, w) = WT φ(x) (3.31) where y is a K-dimensional column vector, W is an M × K matrix of parameters, and φ(x) is an M-dimensional column vector with elements φj(x), with φ0(x) = 1 as before. Suppose we take the conditional distribution of the target vector to be an isotropic Gaussian of the form p(t|x, W, β) = N(t|WT φ(x), β−1 I). (3.32) If we have a set of observations t1, . . . , tN , we can combine these into a matrix T of size N × K such that the nth row is given by tT n. Similarly, we can combine the input vectors x1, . . . , xN into a matrix X. The log likelihood function is then given by ln p(T|X, W, β) = N n=1 ln N(tn|WT φ(xn), β−1 I) = NK 2 ln β 2π − β 2 N n=1 tn − WT φ(xn) 2 . (3.33) by the target vector t. This could be done by introducing a different set of basis func- tions for each component of t, leading to multiple, independent regression problems. However, a more interesting, and more common, approach is to use the same set of basis functions to model all of the components of the target vector so that y(x, w) = WT φ(x) (3.31) where y is a K-dimensional column vector, W is an M × K matrix of parameters, and φ(x) is an M-dimensional column vector with elements φj(x), with φ0(x) = 1 as before. Suppose we take the conditional distribution of the target vector to be an isotropic Gaussian of the form p(t|x, W, β) = N (t|WT φ(x), β−1 I). (3.32) If we have a set of observations t1, . . . , tN , we can combine these into a matrix T of size N × K such that the nth row is given by tT n. Similarly, we can combine the input vectors x1, . . . , xN into a matrix X. The log likelihood function is then given by ln p(T|X, W, β) = N n=1 ln N (tn|WT φ(xn), β−1 I) = NK 2 ln β 2π − β 2 N n=1 tn − WT φ(xn) 2 . (3.33) mN = SN S−1 0 m0 + βΦT t (3.50) S−1 N = S−1 0 + βΦT Φ. (3.51) Note that because the posterior distribution is Gaussian, its mode coincides with its mean. Thus the maximum posterior weight vector is simply given by wMAP = mN . If we consider an infinitely broad prior S0 = α−1 I with α → 0, the mean mN of the posterior distribution reduces to the maximum likelihood value wML given by (3.15). Similarly, if N = 0, then the posterior distribution reverts to the prior. Furthermore, if data points arrive sequentially, then the posterior distribution at any stage acts as the prior distribution for the subsequent data point, such that the new posterior distribution is again given by (3.49).3.8 For the remainder of this chapter, we shall consider a particular form of Gaus- sian prior in order to simplify the treatment. Specifically, we consider a zero-mean isotropic Gaussian governed by a single precision parameter α so that p(w|α) = N(w|0, α−1 I) (3.52) and the corresponding posterior distribution over w is then given by (3.49) with mN = βSN ΦT t (3.53) S−1 N = αI + βΦT Φ. (3.54) The log of the posterior distribution is given by the sum of the log likelihood and the log of the prior and, as a function of w, takes the form ln p(w|t) = − β 2 N n=1 {tn − wT φ(xn)}2 − α 2 wT w + const. (3.55) Maximization of this posterior distribution with respect to w is therefore equiva- 3.3. Bayesian Linear Regression 153 Next we compute the posterior distribution, which is proportional to the product of the likelihood function and the prior. Due to the choice of a conjugate Gaus- sian prior distribution, the posterior will also be Gaussian. We can evaluate this distribution by the usual procedure of completing the square in the exponential, and then finding the normalization coefficient using the standard result for a normalized Gaussian. However, we have already done the necessary work in deriving the gen- eral result (2.116), which allows us to write down the posterior distribution directly in the form p(w|t) = N (w|mN , SN ) (3.49) where mN = SN S−1 0 m0 + βΦT t (3.50) S−1 N = S−1 0 + βΦT Φ. (3.51) then finding the normalization coefficient using the standard result for a normalized Gaussian. However, we have already done the necessary work in deriving the gen- eral result (2.116), which allows us to write down the posterior distribution directly in the form p(w|t) = N(w|mN , SN ) (3.49) where mN = SN S−1 0 m0 + βΦT t (3.50) S−1 N = S−1 0 + βΦT Φ. (3.51) Note that because the posterior distribution is Gaussian, its mode coincides with its mean. Thus the maximum posterior weight vector is simply given by wMAP = mN . If we consider an infinitely broad prior S0 = α−1 I with α → 0, the mean mN of the posterior distribution reduces to the maximum likelihood value wML given by (3.15). Similarly, if N = 0, then the posterior distribution reverts to the prior. Furthermore, if data points arrive sequentially, then the posterior distribution at any stage acts as the prior distribution for the subsequent data point, such that the new posterior distribution is again given by (3.49). For the remainder of this chapter, we shall consider a particular form of Gaus- sian prior in order to simplify the treatment. Specifically, we consider a zero-mean isotropic Gaussian governed by a single precision parameter α so that p(w|α) = N(w|0, α−1 I) (3.52) and the corresponding posterior distribution over w is then given by (3.49) with mN = βSN ΦT t (3.53) S−1 N = αI + βΦT Φ. (3.54) The log of the posterior distribution is given by the sum of the log likelihood and the log of the prior and, as a function of w, takes the form PRML
  • 19. ¤ ¤ equal to the mean, although this will no longer hold if q ̸= 2. 3.3.2 Predictive distribution In practice, we are not usually interested in the value of w itself but rather in making predictions of t for new values of x. This requires that we evaluate the predictive distribution defined by p(t|t, α, β) = p(t|w, β)p(w|t, α, β) dw (3.57) in which t is the vector of target values from the training set, and we have omitted the corresponding input vectors from the right-hand side of the conditioning statements to simplify the notation. The conditional distribution p(t|x, w, β) of the target vari- able is given by (3.8), and the posterior weight distribution is given by (3.49). We see that (3.57) involves the convolution of two Gaussian distributions, and so making use of the result (2.115) from Section 8.1.4, we see that the predictive distribution takes the form3.10 p(t|x, t, α, β) = N(t|mT N φ(x), σ2 N (x)) (3.58) where the variance σ2 N (x) of the predictive distribution is given by σ2 N (x) = 1 β + φ(x)T SN φ(x). (3.59) The first term in (3.59) represents the noise on the data whereas the second term reflects the uncertainty associated with the parameters w. Because the noise process and the distribution of w are independent Gaussians, their variances are additive. Note that, as additional data points are observed, the posterior distribution becomes narrower. As a consequence it can be shown (Qazaz et al., 1997) that σ2 N+1(x) σ2 (x). In the limit N → ∞, the second term in (3.59) goes to zero, and the variance3.11 p(t|t, α, β) = p(t|w, β)p(w|t, α, β) dw (3.57) in which t is the vector of target values from the training set, and we have omitted the corresponding input vectors from the right-hand side of the conditioning statements to simplify the notation. The conditional distribution p(t|x, w, β) of the target vari- able is given by (3.8), and the posterior weight distribution is given by (3.49). We see that (3.57) involves the convolution of two Gaussian distributions, and so making use of the result (2.115) from Section 8.1.4, we see that the predictive distribution takes the form.10 p(t|x, t, α, β) = N(t|mT N φ(x), σ2 N (x)) (3.58) where the variance σ2 N (x) of the predictive distribution is given by σ2 N (x) = 1 β + φ(x)T SN φ(x). (3.59) The first term in (3.59) represents the noise on the data whereas the second term reflects the uncertainty associated with the parameters w. Because the noise process and the distribution of w are independent Gaussians, their variances are additive. Note that, as additional data points are observed, the posterior distribution becomes narrower. As a consequence it can be shown (Qazaz et al., 1997) that σ2 N+1(x) σ2 N (x). In the limit N → ∞, the second term in (3.59) goes to zero, and the variance.11 of the predictive distribution arises solely from the additive noise governed by the parameter β. As an illustration of the predictive distribution for Bayesian linear regression models, let us return to the synthetic sinusoidal data set of Section 1.1. In Figure 3.8, p(t|t, α, β) = p(t|w, β)p(w|t, α, β) dw (3.57) in which t is the vector of target values from the training set, and we have omitted the corresponding input vectors from the right-hand side of the conditioning statements to simplify the notation. The conditional distribution p(t|x, w, β) of the target vari- able is given by (3.8), and the posterior weight distribution is given by (3.49). We see that (3.57) involves the convolution of two Gaussian distributions, and so making use of the result (2.115) from Section 8.1.4, we see that the predictive distribution takes the formcise 3.10 p(t|x, t, α, β) = N(t|mT N φ(x), σ2 N (x)) (3.58) where the variance σ2 N (x) of the predictive distribution is given by σ2 N (x) = 1 β + φ(x)T SN φ(x). (3.59) The first term in (3.59) represents the noise on the data whereas the second term reflects the uncertainty associated with the parameters w. Because the noise process and the distribution of w are independent Gaussians, their variances are additive. Note that, as additional data points are observed, the posterior distribution becomes narrower. As a consequence it can be shown (Qazaz et al., 1997) that σ2 N+1(x) σ2 N (x). In the limit N → ∞, the second term in (3.59) goes to zero, and the variancecise 3.11 of the predictive distribution arises solely from the additive noise governed by the parameter β. As an illustration of the predictive distribution for Bayesian linear regression models, let us return to the synthetic sinusoidal data set of Section 1.1. In Figure 3.8, 3.3. Bayesian Linear Regression 157 x t 0 1 −1 0 1 x t 0 1 −1 0 1 x t 0 1 −1 0 1 x t 0 1 −1 0 1 Figure 3.8 Examples of the predictive distribution (3.58) for a model consisting of 9 Gaussian basis functions of the form (3.4) using the synthetic sinusoidal data set of Section 1.1. See the text for a detailed discussion. PRML
  • 21. ¤ ¤ 1. 2. MAP #7DE 3. 2 R 4. chapters and so we can exploit the results obtained there. We can then make use of the evidence framework to provide point estimates for the hyperparameters and to compare alternative models (for example, networks having different numbers of hid- den units). To start with, we shall discuss the regression case and then later consider the modifications needed for solving classification tasks. 5.7.1 Posterior parameter distribution Consider the problem of predicting a single continuous target variable t from a vector x of inputs (the extension to multiple targets is straightforward). We shall suppose that the conditional distribution p(t|x) is Gaussian, with an x-dependent mean given by the output of a neural network model y(x, w), and with precision (inverse variance) β p(t|x, w, β) = N(t|y(x, w), β−1 ). (5.161) Similarly, we shall choose a prior distribution over the weights w that is Gaussian of the form p(w|α) = N(w|0, α−1 I). (5.162) For an i.i.d. data set of N observations x1, . . . , xN , with a corresponding set of target values D = {t1, . . . , tN }, the likelihood function is given by p(D|w, β) = N n=1 N(tn|y(xn, w), β−1 ) (5.163) and so the resulting posterior distribution is then p(w|D, α, β) ∝ p(w|α)p(D|w, β). (5.164) the modifications needed for solving classification tasks. 5.7.1 Posterior parameter distribution Consider the problem of predicting a single continuous target variable t fro a vector x of inputs (the extension to multiple targets is straightforward). We sh suppose that the conditional distribution p(t|x) is Gaussian, with an x-depende mean given by the output of a neural network model y(x, w), and with precisi (inverse variance) β p(t|x, w, β) = N(t|y(x, w), β−1 ). (5.16 Similarly, we shall choose a prior distribution over the weights w that is Gaussian the form p(w|α) = N(w|0, α−1 I). (5.16 For an i.i.d. data set of N observations x1, . . . , xN , with a corresponding set of targ values D = {t1, . . . , tN }, the likelihood function is given by p(D|w, β) = N n=1 N(tn|y(xn, w), β−1 ) (5.16 and so the resulting posterior distribution is then p(w|D, α, β) ∝ p(w|α)p(D|w, β). (5.16 which, as a consequence of the nonlinear dependence of y(x, w) on w, will be no Gaussian. We can find a Gaussian approximation to the posterior distribution by using t Laplace approximation. To do this, we must first find a (local) maximum of t 5.7. Bayesian Neural Networks 279 form ln p(w|D) = − α 2 wT w − β 2 N n=1 {y(xn, w) − tn} 2 + const (5.165) which corresponds to a regularized sum-of-squares error function. Assuming for the moment that α and β are fixed, we can find a maximum of the posterior, which we denote wMAP, by standard nonlinear optimization algorithms such as conjugate gradients, using error backpropagation to evaluate the required derivatives. Having found a mode wMAP, we can then build a local Gaussian approximation by evaluating the matrix of second derivatives of the negative log posterior distribu- tion. From (5.165), this is given by A = −∇∇ ln p(w|D, α, β) = αI + βH (5.166) where H is the Hessian matrix comprising the second derivatives of the sum-of- squares error function with respect to the components of w. Algorithms for comput- ing and approximating the Hessian were discussed in Section 5.4. The corresponding Gaussian approximation to the posterior is then given from (4.134) by q(w|D) = N(w|wMAP, A−1 ). (5.167) form ln p(w|D) = − α 2 wT w − β 2 N n=1 {y(xn, w) − tn} 2 + const (5.165) which corresponds to a regularized sum-of-squares error function. Assuming for the moment that α and β are fixed, we can find a maximum of the posterior, which we denote wMAP, by standard nonlinear optimization algorithms such as conjugate gradients, using error backpropagation to evaluate the required derivatives. Having found a mode wMAP, we can then build a local Gaussian approximation by evaluating the matrix of second derivatives of the negative log posterior distribu- tion. From (5.165), this is given by A = −∇∇ ln p(w|D, α, β) = αI + βH (5.166) where H is the Hessian matrix comprising the second derivatives of the sum-of- squares error function with respect to the components of w. Algorithms for comput- ing and approximating the Hessian were discussed in Section 5.4. The corresponding Gaussian approximation to the posterior is then given from (4.134) by q(w|D) = N(w|wMAP, A−1 ). (5.167) Similarly, the predictive distribution is obtained by marginalizing with respect PRML
  • 23. ¤ O(#|N) ¤ ST[O(#|N)||$ # ! ] N ¤ KL ¤ N ELBO N → ST[O(#|N)||$ (#|!)] = −∫ O # N log W K|A W A J # N 5# + log $ ! = −ℒ !; N + log $ ! ELBO
  • 24. ELBO ¤ ELBO ℒ !; N 1. ¤ $(#|!) MC ¤ 2. ELBO ¤ MC ¤ MC ∫ O # N log W K|A W A J # N 5# = IJ[log W K|A W A J # N ] 3. ¤ EM ¤ ELBO
  • 25. ¤ Gal ¤ Denker, Schwartz, Wittner, Solla, Howard, Jackel, Hopfield (1987) ¤ Denker and LeCun (1991) ¤ MacKay (1992) ¤ Hinton and van Camp (1993) ¤ ¤ Neal (1995) ¤ Barber and Bishop (1998) ¤ Graves (2011) ¤ Blundell, Cornebise, Kavukcuoglu, and Wierstra (2015) ¤ Hernandez-Lobato and Adam (2015)
  • 26. ¤ Practical Variational Inference for Neural Networks [Graves 2011] ¤ ¤ ¤ T9 O(#|N) N = {Z, [} ¤ ℒ !; N = ∫ O (#|N)log $ !|# $(#) O(#|N) 5# = − EJ(A|])[log $(!|#)] + ST[O(#|N)||$(#)] ^T9 ^Z ≈ 1 a B ^ log $(!|#) ^# / C ^T9 ^[b ≈ 1 2a B ^ log $(!|#) ^# b/ C
  • 27. ¤ Weight Uncertainty in Neural Networks [Blundell+ 2015] ¤ ¤ O(#|N) ¤ ¤ http://www.slideshare.net/masa_s/weight-uncertainty-in-neural-networks Bayes by backprop ^ ^N ℒ !; N = ^ ^N IJ(A|]) d(#, N) d #, N = log $ !|# $(#) O(#|N) = IJ(e) ^d(#, N) ^N ^# ^N + ^d(#, N) ^N Reparameterization trick # = Z + diag([) ⊙ i i~k(0, m)
  • 28. Dropout ¤ Dropout as a Bayesian Approximation [Gal+ 2015] ¤ O # N = ∏ O(nC|oC) ¤ T9 = − EJ # N [log $(!|#)] ¤ oC 0 ¤ 0 ¤ dropout ¤ drop-connect multiplicative Gaussian noise sults are summarised here n uncertainty estimates for el with L layers and a loss max loss or the Euclidean Wi the NN’s weight ma- 1, and by bi the bias vec- ayer i = 1, ..., L. We de- corresponding to input xi the input and output sets on a regularisation term is egularisation weighted by n a minimisation objective λ L i=1 ||Wi||2 2 + ||bi||2 2 . (1) variables for every input in each layer (apart from le takes value 1 with prob- ropped (i.e. its value is set orresponding binary vari- me values in the backward to the parameters. p(y|x, ω) = N y; y(x, ω), τ ID y x, ω = {W1, ...,WL} = 1 KL WLσ ... 1 K1 W2σ W1x + m1 ... The posterior distribution p(ω|X, Y) in eq. (2) is in- tractable. We use q(ω), a distribution over matrices whose columns are randomly set to zero, to approximate the in- tractable posterior. We define q(ω) as: Wi = Mi · diag([zi,j]Ki j=1) zi,j ∼ Bernoulli(pi) for i = 1, ..., L, j = 1, ..., Ki−1 given some probabilities pi and matrices Mi as variational parameters. The binary variable zi,j = 0 corresponds then to unit j in layer i − 1 being dropped out as an input to layer i. The variational distribution q(ω) is highly multi- modal, inducing strong joint correlations over the rows of the matrices Wi (which correspond to the frequencies in the sparse spectrum GP approximation). We minimise the KL divergence between the approximate posterior q(ω) above and the posterior of the full deep GP, p(ω|X, Y). This KL is our minimisation objective − q(ω) log p(Y|X, ω)dω + KL(q(ω)||p(ω)). (3)
  • 29. Dropout ¤ dropout ¤ dropout ¤ ¤ MC dropout ¤ http://mlg.eng.cam.ac.uk/yarin/blog_2248.html where ω = {Wi}L i=1 is our set of random variables for a model with L layers. We will perform moment-matching and estimate the first two moments of the predictive distribution empirically. More specifically, we sample T sets of vectors of realisa- tions from the Bernoulli distribution {zt 1, ..., zt L}T t=1 with zt i = [zt i,j]Ki j=1, giving {Wt 1, ..., Wt L}T t=1. We estimate Eq(y∗|x∗)(y∗ ) ≈ 1 T T t=1 y∗ (x∗ , Wt 1, ..., Wt L) (6) following proposition C in the appendix. We refer to this Monte Carlo estimate as MC dropout. In practice this is equivalent to performing T stochastic forward passes through the network and averaging the results. This result has been presented in the literature before as model averaging. We have given a new derivation for this result which allows us to derive mathematically grounded uncertainty estimates as well. Srivastava et al. (2014, sec- tion 7.5) have reasoned empirically that MC dropout can be approximated by averaging the weights of the network (multiplying each Wi by pi at test time, referred to as stan- dard dropout). We estimate the second raw moment in the same way: log p(y∗ |x∗ , X, Y) ≈ with a log-sum-exp o passes through the ne Our predictive distr highly multi-modal, give a glimpse into i proximating variation matrix column is bi- tribution over each la 3.2 in the appendix). Note that the dropo To estimate the predi we simply collect the through the model. used with existing N thermore, the forward sulting in constant run dropout. 5. Experiments T t=1 following proposition C in the appendix. We refer to this Monte Carlo estimate as MC dropout. In practice this is equivalent to performing T stochastic forward passes through the network and averaging the results. This result has been presented in the literature before as model averaging. We have given a new derivation for this result which allows us to derive mathematically grounded uncertainty estimates as well. Srivastava et al. (2014, sec- tion 7.5) have reasoned empirically that MC dropout can be approximated by averaging the weights of the network (multiplying each Wi by pi at test time, referred to as stan- dard dropout). We estimate the second raw moment in the same way: Eq(y∗|x∗) (y∗ )T (y∗ ) ≈ τ−1 ID + 1 T T t=1 y∗ (x∗ , Wt 1, ..., Wt L)T y∗ (x∗ , Wt 1, ..., Wt L) following proposition D in the appendix. To obtain the model’s predictive variance we have: Varq(y∗|x∗) y∗ ≈ τ−1 ID 2 In the appendix (section 4.1) we extend this derivation to classification. E(·) is defined as softmax loss and τ is set to 1. proximating variational distributio matrix column is bi-modal, and tribution over each layer’s weight 3.2 in the appendix). Note that the dropout NN mod To estimate the predictive mean a we simply collect the results of s through the model. As a result, used with existing NN models tra thermore, the forward passes can sulting in constant running time id dropout. 5. Experiments We next perform an extensive ass of the uncertainty estimates obta and convnets on the tasks of regr We compare the uncertainty obtai architectures and non-linearities, olation, and show that model unc classification tasks using MNIST as an example. We then show th tainty we can obtain a considerabl tive log-likelihood and RMSE co of-the-art methods. We finish wi
  • 30. ¤ MC dropout Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning (a) Standard dropout with weight averaging (b) Gaussian process with SE covariance function (c) MC dropout with ReLU non-linearities (d) MC dropout with TanH non-linearities Figure 2. Predictive mean and uncertainties on the Mauna Loa CO2 concentrations dataset, for various models. In red is the observed function (left of the dashed blue line); in blue is the predictive mean plus/minus two standard deviations (8 for fig. 2d). Different shades of blue represent half a standard deviation. Marked with a dashed red line is a point far away from the data: standard dropout confidently predicts an insensible value for the point; the other models predict insensible values as well but with the additional information that the models are uncertain about their predictions. model’s uncertainty in a Bayesian pipeline. We give a quantitative assessment of the model’s performance in the setting of reinforcement learning on a task similar to that used in deep reinforcement learning (Mnih et al., 2015). Using the results from the previous section, we begin by qualitatively evaluating the dropout NN uncertainty on two comparison. Fig. 2c shows the results of the same network as in fig. 2a, but with MC dropout used to evaluate the pre- dictive mean and uncertainty for the training and test sets. Lastly, fig. 2d shows the same using the TanH network with 5 layers (plotted with 8 times the standard deviation for vi- sualisation purposes). The shades of blue represent model uncertainty: each colour gradient represents half a standard
  • 31. Dropout ¤ Variational dropout and the local reparameterization trick [Kingma+ 2015] ¤ ¤ 0 ¤ 0 local reparameterization trick ¤ Dropout ¤ ¤ http://www.slideshare.net/masa_s/dl-hacks-variational-dropout-and-the-local- reparameterization-trick 2.2 Variance of the SGVB estimator The theory of stochastic approximation tells us that stochastic gradient ascent using (3) will asymp- totically converge to a local optimum for an appropriately declining step size and sufficient weight updates [18], but in practice the performance of stochastic gradient ascent crucially depends on the variance of the gradients. If this variance is too large, stochastic gradient descent will fail to make much progress in any reasonable amount of time. Our objective function consists of an expected log likelihood term that we approximate using Monte Carlo, and a KL divergence term DKL(qφ(w)||p(w)) that we assume can be calculated analytically and otherwise be approximated with Monte Carlo with similar reparameterization. Assume that we draw minibatches of datapoints with replacement; see appendix F for a similar analysis for minibatches without replacement. Using Li as shorthand for log p(yi |xi , w = f(ϵi , φ)), the contribution to the likelihood for the i-th datapoint in the minibatch, the Monte Carlo estimator (3) may be rewritten as LSGVB D (φ) = N M M i=1 Li, whose variance is given by Var LSGVB D (φ) = N2 M2 M i=1 Var [Li] + 2 M i=1 M j=i+1 Cov [Li, Lj] (4) =N2 1 M Var [Li] + M − 1 M Cov [Li, Lj] , (5) where the variances and covariances are w.r.t. both the data distribution and ϵ distribution, i.e. Var [Li] = Varϵ,xi,yi log p(yi |xi , w = f(ϵ, φ)) , with xi , yi drawn from the empirical distribu- tion defined by the training set. As can be seen from (5), the total contribution to the variance by Var [Li] is inversely proportional to the minibatch size M. However, the total contribution by the covariances does not decrease with M. In practice, this means that the variance of LSGVB D (φ) can be dominated by the covariances for even moderately large M. 2.3 Local Reparameterization Trick We therefore propose an alternative estimator for which we have Cov [Li, Lj] = 0, so that the vari- ance of our stochastic gradients scales as 1/M. We then make this new estimator computationally efficient by not sampling ϵ directly, but only sampling the intermediate variables f(ϵ) through which SGVB
  • 33. ¤ ¤ ¤ ¤ → Automatic differentiation variational inference ADVI
  • 34. Automatic Differentiation Variational Inference ADVI ¤ Automatic Differentiation Variational Inference [Kucukelbir+ 2016] ¤ Stan PyMC3 Edward ¤ ADVI 1. N $(+, N) $(+, p) ST[O(p)||$(p, +))] 2. O MC 3. O
  • 35. Automatic transformation ¤ N ¤ $(N) support ¤ ¤ N p ¤ p ¤ N −> p r p 0 1 2 3 θ De (a) Latent variable space T−1 −1 0 1 2 ζ (b) Real coordinate space 1: Transforming the latent variable to real coordinate space. The purple line is the pos ne is the approximation. (a) The latent variable space is >0. (a→b) T transforms space to . (b) The variational approximation is a Gaussian in real coordinate space : Transforming the latent variable to real coordinate space. The purple line is the post e is the approximation. (a) The latent variable space is >0. (a→b) T transforms space to . (b) The variational approximation is a Gaussian in real coordinate space. tify the transformed variables as ζ = T(θ). The transformed joint density p(x,ζ) is as the representation p(x,ζ) = p x, T−1 (ζ) det JT−1 (ζ) , (x,θ = T−1 (ζ)) is the joint density in the original latent variable space, and JT−1 of the inverse of T. Transformations of continuous probability densities require a nts for how the transformation warps unit volumes and ensures that the transforme s to one (Olive, 2014). (See Appendix A.)
  • 36. Automatic transformation ¤ ¤ r = log (N) ¤ ¤ p p(x,ζ) = p x, T−1 (ζ) det JT−1 (ζ) , x,θ = T−1 (ζ)) is the joint density in the original latent variable space, and JT−1 (ζ) is the of the inverse of T. Transformations of continuous probability densities require a Jacobian; s for how the transformation warps unit volumes and ensures that the transformed density to one (Olive, 2014). (See Appendix A.) again our running Weibull-Poisson example from Section 2.1. The latent variable θ lives in ogarithm ζ = T(θ) = log(θ) transforms >0 to the real line . Its Jacobian adjustment is the of the inverse of the logarithm |det JT−1(ζ)| = exp(ζ). The transformed density is p(x,ζ) = Poisson(x | exp(ζ)) × Weibull(exp(ζ) ; 1.5,1) × exp(ζ). epicts this transformation. cribe in the introduction, we implement our algorithm in Stan (Stan Development Team, an maintains a library of transformations and their corresponding Jacobians.4 With Stan, tomatically transforms the joint density of any differentiable probability model to one with d latent variables. (See Figure 2.) riational Approximations in Real Coordinate Space ransformation, the latent variables ζ have support in the real coordinate space K . We have f variational approximations in this space. Here, we consider Gaussian distributions; these nduce non-Gaussian variational distributions in the original latent variable space. of ζ; it has the representation p(x,ζ) = p x, T−1 (ζ) det JT−1 (ζ) , where p(x,θ = T−1 (ζ)) is the joint density in the original latent variable space, and JT−1 (ζ) is t Jacobian of the inverse of T. Transformations of continuous probability densities require a Jacobia it accounts for how the transformation warps unit volumes and ensures that the transformed dens integrates to one (Olive, 2014). (See Appendix A.) Consider again our running Weibull-Poisson example from Section 2.1. The latent variable θ lives >0. The logarithm ζ = T(θ) = log(θ) transforms >0 to the real line . Its Jacobian adjustment is t derivative of the inverse of the logarithm |det JT−1(ζ)| = exp(ζ). The transformed density is p(x,ζ) = Poisson(x | exp(ζ)) × Weibull(exp(ζ) ; 1.5,1) × exp(ζ). Figure 1 depicts this transformation. As we describe in the introduction, we implement our algorithm in Stan (Stan Development Tea 2015). Stan maintains a library of transformations and their corresponding Jacobians.4 With St we can automatically transforms the joint density of any differentiable probability model to one w real-valued latent variables. (See Figure 2.) 2.4 Variational Approximations in Real Coordinate Space After the transformation, the latent variables ζ have support in the real coordinate space K . We ha 0 1 2 3 1 θ Density (a) Latent variable space T T−1 −1 0 1 2 1 ζ Prior Posterior Approximation (b) Real coordinate space Figure 1: Transforming the latent variable to real coordinate space. The purple line is the posterior. The green line is the approximation. (a) The latent variable space is >0. (a→b) T transforms the latent variable space to . (b) The variational approximation is a Gaussian in real coordinate space. Figure 1: Transforming the latent variable to real coordinate space. The purple line is the posterior. The green line is the approximation. (a) The latent variable space is >0. (a→b) T transforms the latent variable space to . (b) The variational approximation is a Gaussian in real coordinate space.
  • 37. ¤ ¤ ¤ ¤ L ¤ ¤ .4 Variational Approximations in Real Coordinate Space fter the transformation, the latent variables ζ have support in the real coordinate space K . We hav choice of variational approximations in this space. Here, we consider Gaussian distributions; thes mplicitly induce non-Gaussian variational distributions in the original latent variable space. Mean-field Gaussian. One option is to posit a factorized (mean-field) Gaussian variational approxima on q(ζ; φ) = ζ; µ,diag(σ2 ) = K k=1 ζk ; µk,σ2 k , where the vector φ = (µ1,··· ,µK ,σ2 1,··· ,σ2 K ) concatenates the mean and variance of each Gaussia actor. Since the variance parameters must always be positive, the variational parameters live in th et Φ = { K , K >0}. Re-parameterizing the mean-field Gaussian removes this constraint. Consider th 4 Stan provides various transformations for upper and lower bounds, simplex and ordered vectors, and structured matrices suc covariance matrices and Cholesky factors. 6 N x [ n ] ~ poisson ( theta ) ; } Figure 2: Specifying a simple nonconjugate probability model in Stan. Figure 2: Specifying a simple nonconjugate probability model in Stan. arithm of the standard deviations, ω = log(σ), applied element-wise. The support of ω is n real coordinate space and σ is always positive. The mean-field Gaussian becomes q(ζ; φ) ζ; µ,diag(exp(ω)2 ) , where the vector φ = (µ1,··· ,µK ,ω1,··· ,ωK ) concatenates the mean a arithm of the standard deviation of each factor. Now, the variational parameters are unconstrain 2K . l-rank Gaussian. Another option is to posit a full-rank Gaussian variational approximation q(ζ; φ) = ζ; µ,Σ , ere the vector φ = (µ,Σ) concatenates the mean vector µ and covariance matrix Σ. To ensure t always remains positive semidefinite, we re-parameterize the covariance matrix using a Chole torization, Σ = LL⊤ . We use the non-unique definition of the Cholesky factorization where gonal elements of L need not be positively constrained (Pinheiro and Bates, 1996). Therefor s in the unconstrained space of lower-triangular matrices with K(K + 1)/2 real-valued entries. T -rank Gaussian becomes q(ζ; φ) = ζ; µ, LL⊤ , where the variational parameters φ = (µ, L) constrained in K+K(K+1)/2 . Figure 2: Specifying a simple nonconjugate probability model in Stan. Figure 2: Specifying a simple nonconjugate probability model in Stan. logarithm of the standard deviations, ω = log(σ), applied element-wise. The support of ω the real coordinate space and σ is always positive. The mean-field Gaussian becomes q(ζ ζ; µ,diag(exp(ω)2 ) , where the vector φ = (µ1,··· ,µK ,ω1,··· ,ωK ) concatenates the m logarithm of the standard deviation of each factor. Now, the variational parameters are uncon in 2K . Full-rank Gaussian. Another option is to posit a full-rank Gaussian variational approximation q(ζ; φ) = ζ; µ,Σ , where the vector φ = (µ,Σ) concatenates the mean vector µ and covariance matrix Σ. To ens Σ always remains positive semidefinite, we re-parameterize the covariance matrix using a C factorization, Σ = LL⊤ . We use the non-unique definition of the Cholesky factorization wh diagonal elements of L need not be positively constrained (Pinheiro and Bates, 1996). The lives in the unconstrained space of lower-triangular matrices with K(K + 1)/2 real-valued entr full-rank Gaussian becomes q(ζ; φ) = ζ; µ, LL⊤ , where the variational parameters φ = (µ unconstrained in K+K(K+1)/2 . The full-rank Gaussian generalizes the mean-field Gaussian approximation. The off-diagonal term covariance matrix Σ capture posterior correlations across latent random variables.5 This leads to accurate posterior approximation than the mean-field Gaussian; however, it comes at a compu cost. Various low-rank approximations to the covariance matrix reduce this cost, yet limit its a
  • 38. ¤ ELBO ¤ s ¤ ¤ Reparameterization trick p Z ¤ ELBO ¤ s ¤ 2.5 The Variational Problem in Real Coordinate Space Here is the story so far. We began with a differentiable probability model p(x,θ). We transformed th latent variables into ζ, which live in the real coordinate space. We defined variational approximation in the transformed space. Now, we consider the variational optimization problem. Write the variational objective function, the ELBO, in real coordinate space as (φ) = q(ζ;φ) log p x, T−1 (ζ) + log det JT−1 (ζ) + q(ζ; φ) . (5 The inverse of the transformation T−1 appears in the joint model, along with the determinant of th Jacobian adjustment. The ELBO is a function of the variational parameters φ and the entropy , both o which depend on the variational approximation. (Derivation in Appendix B.) Now, we can freely optimize the ELBO in the real coordinate space without worrying about the suppo matching constraint. The optimization problem from Equation (3) becomes φ∗ = argmax φ (φ) (6 where the parameter vector φ lives in some appropriately dimensioned real coordinate space. This is a unconstrained optimization problem that we can solve using gradient ascent. Traditionally, this woul require manual computation of gradients. Instead, we develop a stochastic gradient ascent algorithm that uses automatic differentiation to compute gradients and MC integration to approximate expect tions. We cannot directly use automatic differentiation on the ELBO. This is because the ELBO involves an un where the parameter vector φ lives in some appropriately dimensioned real coordinate space. This is an unconstrained optimization problem that we can solve using gradient ascent. Traditionally, this would require manual computation of gradients. Instead, we develop a stochastic gradient ascent algorithm that uses automatic differentiation to compute gradients and MC integration to approximate expecta- tions. We cannot directly use automatic differentiation on the ELBO. This is because the ELBO involves an un- known expectation. However, we can automatically differentiate the functions inside the expectation. (The model p and transformation T are both easy to represent as computer functions (Baydin et al., 2015).) To apply automatic differentiation, we want to push the gradient operation inside the expec- tation. To this end, we employ one final transformation: elliptical standardization6 (Härdle and Simar, 2012). Elliptical standardization. Consider a transformation Sφ that absorbs the variational parameters φ; this converts the Gaussian variational approximation into a standard Gaussian. In the mean-field case, the standardization is η = Sφ(ζ) = diag exp(ω) −1 (ζ − µ). In the full-rank Gaussian, the standardiza- tion is η = Sφ(ζ) = L−1 (ζ − µ). In both cases, the standardization encapsulates the variational parameters; in return it gives a fixed variational density q(η) = η; 0, I = K k=1 ηk ; 0,1 , as shown in Figure 3. The standardization transforms the variational problem from Equation (5) into φ∗ = argmax φ (η;0,I) log p x, T−1 (S−1 φ (η)) + log det JT−1 S−1 φ (η) + q(ζ; φ) . The expectation is now in terms of a standard Gaussian density. The Jacobian of elliptical standard- ization evaluates to one, because the Gaussian distribution is a member of the location-scale family: standardizing a Gaussian gives another Gaussian distribution. (See Appendix A.) We do not need to transform the entropy term as it does not depend on the model or the transformation; we have a simple analytic form for the entropy of a Gaussian and its gradient. We implement these once and reuse for all models.
  • 39. ¤ Black-box variational inference [Ranganath+ 2014] ¤ ¤ ADVI likelihood ratio trick ¤ reparameterization trick 3.2 Variance of the Stochastic Gradients ADVI uses Monte Carlo integration to approximate gradients of the ELBO, and then uses these gradients in a stochastic optimization algorithm (Section 2). The speed of ADVI hinges on the variance of the gradient estimates. When a stochastic optimization algorithm suffers from high-variance gradients, it must repeatedly recover from poor parameter estimates. ADVI is not the only way to compute Monte Carlo approximations of the gradient of the ELBO. Black box variational inference (BBVI) takes a different approach (Ranganath et al., 2014). The BBVI gradient estimator uses the gradient of the variational approximation and avoids using the gradient of the model. For example, the following BBVI estimator ∇BBVI µ = q(ζ;φ) ∇µ logq(ζ; φ) log p x, T−1 (ζ) + log det JT−1 (ζ) − logq(ζ; φ) and the ADVI gradient estimator in Equation (7) both lead to unbiased estimates of the exact gradient. While BBVI is more general—it does not require the gradient of the model and thus applies to more settings—its gradients can suffer from high variance. Figure 8 empirically compares the variance of both estimators for two models. Figure 8a shows the vari- ance of both gradient estimators for a simple univariate model, where the posterior is a Gamma(10,10). We estimate the variance using ten thousand re-calculations of the gradient ∇φ , across an increasing number of MC samples M. The ADVI gradient has lower variance; in practice, a single sample suffices. (See the experiments in Section 4.) Figure 8b shows the same calculation for a 100-dimensional nonlinear regression model with likeli- hood (y | tanh(x⊤ β), I) and a Gaussian prior on the regression coefficients β. Because this is a multivariate example, we also show the BBVI gradient with a variance reduction scheme using control variates described in Ranganath et al. (2014). In both cases, the ADVI gradients are statistically more efficient. 100 101 102 103 100 101 102 103 Number of MC samples Variance (a) Univariate Model 100 101 102 103 10−3 10−1 101 103 Number of MC samples ADVI BBVI BBVI with control variate (b) Multivariate Nonlinear Regression Model Figure 8: Comparison of gradient estimator variances. The ADVI gradient estimator exhibits lower variance than the BBVI estimator. Moreover, it does not require control variate variance reduction, which is not available in univariate situations. Figure 8: Comparison of gradient estimator variances. The ADVI gradient estimator exhibits lower variance than the BBVI estimator. Moreover, it does not require control variate variance reduction, which is not available in univariate situations.
  • 40.
  • 42. ¤ Stan ¤ ¤ MCMC NUTS[Hoffman+ 2014] HMC ¤ Stan python R ¤ ADVI ¤ PyMC3 ¤ Python MCMC ¤ Theano GPU ¤ ADVI ¤ Edward ¤ ¤ criticism ¤ Python Tensorflow Keras ¤ Stan PyMC3 35x [Tran+ 2016]
  • 43. ¤ Tars ¤ https://github.com/masa-su/Tars ¤ Edward Tran PyMC3 Wiecki star ¤ ¤ ¤ Edward PyMC3 ¤ ¤ ¤ Q Tars ¤ A