1. Learning stochastic neural networks
with Chainer
Sep. 21, 2016 | PyCon JP @ Waseda University
The University of Tokyo, Preferred Networks, Inc.
Seiya Tokui
@beam2d
2. Self introduction
• Seiya Tokui
• @beam2d (Twitter/GitHub)
• Researcher at Preferred Networks, Inc.
• Lead developer of Chainer (a framework for neural nets)
• Ph.D student at the University of Tokyo (since Apr. 2016)
• Supervisor: Lect. Issei Sato
• Topics: deep generative models
Today I will talk as a student (i.e. an academic researcher and
a user of Chainer).
2
3. Topics of this talk: how to compute
gradients through stochastic units
First 20 min.
• Stochastic unit
• Learning methods for stochastic neural nets
Second 20 min.
• How to implement it with Chainer
• Experimental results
Take-home message: you can train stochastic NNs without
modifying backprop procedure in most frameworks
(including Chainer)
3
4. Caution!!!
This talk DOES NOT introduce
• basic maths
• backprop algorithm
• how to install Chainer (see the official documantes!)
• basic concept and usage of Chainer (ditto!)
I could not avoid using some math to explain the work, so just
take this talk as an example of how a researcher writes scripts
in Python.
4
6. Neural net: directed acyclic graph of
linear-nonlinear operations
Linear Nonlinear
6
All operations are deterministic and differentiable
7. Stochastic unit: a neuron with sampling
Case 1: (diagonal) Gaussian
This unit defines a random variable, and forward-prop do its
sampling.
7
Linear-Nonlinear Sampling
8. Stochastic unit: a neuron with sampling
8
Linear-Nonlinear Sampling
(sigmoid)
Case 2: Bernoulli (binary unit, taking 1 with probability )
9. Applications of stochastic units
Stochastic feed-forward networks
• Non-deterministic prediction
• Used for multi-valued predictions
• E.g. inpainting lower-part of given images
Learning generative models
• Loss function is often written as a computational graph
including stochastic units
• E.g. variational autoencoder (VAE)
9
10. Gradient estimation of stochastic NN is
difficult!
Stochastic NN is NOT deterministic
-> we have to optimize expectation over the stochasticity
• All possible realization of stochastic units should be
considered (with losses weighted by the probability)
• Enumerating all such realizations is infeasible!
• We cannot enumerate all samples from Gaussian
• Even in case of using Bernoulli, it costs time for units
-> need approximation
10
we want to optimize it!
11. General trick: likelihood-ratio method
• Do forward prop with sampling
• Decrease the probability of chosen values if the loss is high
• difficult to decide whether the loss this time is high or low…
-> decrease the probability by an amount proportional to the loss
• Using log-derivative results in unbiased gradient estimate
Not straight-forward to implement on NN frameworks
(I’ll show later)
11
“sampled from”sampled loss log derivative
12. Technique: LR with baseline
LR method results in high variance
• The gradient is accurate only after observing many samples
(because the log-derivative is not related to the loss function)
We can reduce the variance by shifting the loss value by a
constant: using instead of
• It does not change the relative goodness of each sample
• The shift is called baseline
12
13. Modern trick: reparameterization trick
Write the sampling procedure as a differentiable computation
• Given noise, the computation is deterministic and differentiable
• Easy to implement on NN frameworks (as easy as dropout)
• The variance is low!!
13
noise
14. Summary of learning stochastic NNs
For Gaussian units, we can use reparameterization trick
• It has low variance so that we can train them efficiently
For Bernoulli units, we have to use likelihood-ratio methods
• It has high variance, which is problematic
• In order to capture discrete nature of data representation, it
is better to use discrete units, so we have to develop a fast
algorithm of learning discrete units
14
16. Task 1: variational autoencoder (VAE)
Autoencoder with the hidden layer being diagonal Gaussian
with reparameterization trick
16
reconstruction loss
encoder
decoder
KL loss
(regularization)
17. 17
class VAE(chainer.Chain):
def __init__(self, encoder, decoder):
super().__init__(encoder=encoder, decoder=decoder)
def __call__(self, x):
mu, ln_var = self.encoder(x)
# You can also write:
# z = F.gaussian(mu, ln_var)
sigma = F.exp(ln_var / 2)
eps = self.xp.random.rand(*mu.data.shape)
z = mu + sigma * eps
x_hat = self.decoder(z)
recon_loss = F.gaussian_nll(x, x_hat)
kl_loss = F.gaussian_kl_divergence(mu, ln_var)
return recon_loss + kl_loss
18. 18
class VAE(chainer.Chain):
def __init__(self, encoder, decoder):
super().__init__(encoder=encoder, decoder=decoder)
def __call__(self, x):
mu, ln_var = self.encoder(x)
# You can also write:
# z = F.gaussian(mu, ln_var)
sigma = F.exp(ln_var / 2)
eps = self.xp.random.rand(*mu.data.shape)
z = mu + sigma * eps
x_hat = self.decoder(z)
recon_loss = F.gaussian_nll(x, x_hat)
kl_loss = F.gaussian_kl_divergence(mu, ln_var)
return recon_loss + kl_loss
stochastic part
Just returning the stochastic loss.
Backprop through the sampled loss estimates the gradient.
19. Task 2: variational learning of
sigmoid belief network (SBN)
Hierarchical autoencoder with Bernoulli units
19
(2 hidden layers case)
sampled loss
23. This code computes the following loss value:
bernoulli_nll(a, z) computes the following value:
23
class SBNLR(SBNBase):
def expected_loss(self, x, forward_result):
(a1, mu1, z1), (a2, mu2, z2) = forward_result
neg_log_p = (bernoulli_nll(z2, self.prior) +
bernoulli_nll(z1, self.p2(z2)) +
bernoulli_nll(x, self.p1(z1)))
neg_log_q = (bernoulli_nll(z1, mu1) +
bernoulli_nll(z2, mu2))
return F.sum(neg_log_p - neg_log_q)
def bernoulli_nll(x, y):
return F.sum(F.softplus(y) - x * y, axis=1)
24. How can we compute gradient through sampling?
Recall likelihood-ratio:
We can fake the gradient-based optimizer by passing a fake
loss value whose gradient is the LR estimate
24
25. How can we compute gradient through sampling?
Recall likelihood-ratio:
We can fake the gradient-based optimizer by passing a fake
loss value whose gradient is the LR estimate
25
def __call__(self, x):
forward_result = self.forward(x)
loss = self.expected_loss(x, forward_result)
(a1, mu1, z1), (a2, mu2, z2) = forward_result
fake1 = loss.data * bernoulli_nll(z1, a1)
fake2 = loss.data * bernoulli_nll(z2, a2)
fake = F.sum(fake1) + F.sum(fake2)
return loss + fake
fake loss
Optimizer runs backprop from this value
26. Other note on experiments (1)
Plain LR does not learn well. It always needs to use baseline.
• There are many techniques, including
• Moving average of the loss value
• Predict the loss value from the input
• Optimal constant baseline estimation
Better to use momentum SGD and adaptive learning rate
• = Adam
• Momentum effectively reduces the gradient noise
26
27. Other note on experiments (2)
Use Trainer!
• snapshot extension makes it easy to do resume/suspend,
which is crucial for handling long experiments
• Adding a custom extension is super-easy: I wrote
• an extension to hold the model of the current best validation score
(for early stopping)
• an extension to report variance of estimated gradients
• an extension to plot the learning curve at regular intervals
Use report function!
• It is easy to collect statistics of any values which are
computed as by-products of forward computation
27
28. Example of report function
28
def expected_loss(self, x, forward_result):
(a1, mu1, z1), (a2, mu2, z2) = forward_result
neg_log_p = (bernoulli_nll(z2, self.prior) +
bernoulli_nll(z1, self.p2(z2)) +
bernoulli_nll(x, self.p1(z1)))
neg_log_q = (bernoulli_nll(z1, mu1) +
bernoulli_nll(z2, mu2))
chainer.report({'nll_p': neg_log_p,
'nll_q': neg_log_q}, self)
return F.sum(neg_log_p - neg_log_q)
LogReport extension will log the average of these reported values for
each interval. The values are also reported during the validation.
29. My research
My current research is on low-variance gradient estimate for
stochastic NNs with Bernoulli units
• Need extra computation, which is embarrassingly
parallelizable
• Theoretically guaranteed to have lower variance than LR
(even vs. LR with the optimal input-dependent baseline)
• Empirically shown to be faster to learn
29
30. Summary
• Stochastic units introduce stochasticity to neural networks
(and their computational graphs)
• Reparameterization trick and likelihood-ratio methods are
often used for learning them
• Reparameterization trick can be implemented with Chainer
as a simple feed-forward network with additional noise
• Likelihood-ratio methods can be implemented with Chainer
using fake loss
30