SlideShare a Scribd company logo
1 of 30
Download to read offline
Learning stochastic neural networks
with Chainer
Sep. 21, 2016 | PyCon JP @ Waseda University
The University of Tokyo, Preferred Networks, Inc.
Seiya Tokui
@beam2d
Self introduction
• Seiya Tokui
• @beam2d (Twitter/GitHub)
• Researcher at Preferred Networks, Inc.
• Lead developer of Chainer (a framework for neural nets)
• Ph.D student at the University of Tokyo (since Apr. 2016)
• Supervisor: Lect. Issei Sato
• Topics: deep generative models
Today I will talk as a student (i.e. an academic researcher and
a user of Chainer).
2
Topics of this talk: how to compute
gradients through stochastic units
First 20 min.
• Stochastic unit
• Learning methods for stochastic neural nets
Second 20 min.
• How to implement it with Chainer
• Experimental results
Take-home message: you can train stochastic NNs without
modifying backprop procedure in most frameworks
(including Chainer)
3
Caution!!!
This talk DOES NOT introduce
• basic maths
• backprop algorithm
• how to install Chainer (see the official documantes!)
• basic concept and usage of Chainer (ditto!)
I could not avoid using some math to explain the work, so just
take this talk as an example of how a researcher writes scripts
in Python.
4
Stochastic units
and their learning methods
Neural net: directed acyclic graph of
linear-nonlinear operations
Linear Nonlinear
6
All operations are deterministic and differentiable
Stochastic unit: a neuron with sampling
Case 1: (diagonal) Gaussian
This unit defines a random variable, and forward-prop do its
sampling.
7
Linear-Nonlinear Sampling
Stochastic unit: a neuron with sampling
8
Linear-Nonlinear Sampling
(sigmoid)
Case 2: Bernoulli (binary unit, taking 1 with probability )
Applications of stochastic units
Stochastic feed-forward networks
• Non-deterministic prediction
• Used for multi-valued predictions
• E.g. inpainting lower-part of given images
Learning generative models
• Loss function is often written as a computational graph
including stochastic units
• E.g. variational autoencoder (VAE)
9
Gradient estimation of stochastic NN is
difficult!
Stochastic NN is NOT deterministic
-> we have to optimize expectation over the stochasticity
• All possible realization of stochastic units should be
considered (with losses weighted by the probability)
• Enumerating all such realizations is infeasible!
• We cannot enumerate all samples from Gaussian
• Even in case of using Bernoulli, it costs time for units
-> need approximation
10
we want to optimize it!
General trick: likelihood-ratio method
• Do forward prop with sampling
• Decrease the probability of chosen values if the loss is high
• difficult to decide whether the loss this time is high or low…
-> decrease the probability by an amount proportional to the loss
• Using log-derivative results in unbiased gradient estimate
Not straight-forward to implement on NN frameworks
(I’ll show later)
11
“sampled from”sampled loss log derivative
Technique: LR with baseline
LR method results in high variance
• The gradient is accurate only after observing many samples
(because the log-derivative is not related to the loss function)
We can reduce the variance by shifting the loss value by a
constant: using instead of
• It does not change the relative goodness of each sample
• The shift is called baseline
12
Modern trick: reparameterization trick
Write the sampling procedure as a differentiable computation
• Given noise, the computation is deterministic and differentiable
• Easy to implement on NN frameworks (as easy as dropout)
• The variance is low!!
13
noise
Summary of learning stochastic NNs
For Gaussian units, we can use reparameterization trick
• It has low variance so that we can train them efficiently
For Bernoulli units, we have to use likelihood-ratio methods
• It has high variance, which is problematic
• In order to capture discrete nature of data representation, it
is better to use discrete units, so we have to develop a fast
algorithm of learning discrete units
14
Implementing stochastic NNs
with Chainer
Task 1: variational autoencoder (VAE)
Autoencoder with the hidden layer being diagonal Gaussian
with reparameterization trick
16
reconstruction loss
encoder
decoder
KL loss
(regularization)
17
class VAE(chainer.Chain):
def __init__(self, encoder, decoder):
super().__init__(encoder=encoder, decoder=decoder)
def __call__(self, x):
mu, ln_var = self.encoder(x)
# You can also write:
# z = F.gaussian(mu, ln_var)
sigma = F.exp(ln_var / 2)
eps = self.xp.random.rand(*mu.data.shape)
z = mu + sigma * eps
x_hat = self.decoder(z)
recon_loss = F.gaussian_nll(x, x_hat)
kl_loss = F.gaussian_kl_divergence(mu, ln_var)
return recon_loss + kl_loss
18
class VAE(chainer.Chain):
def __init__(self, encoder, decoder):
super().__init__(encoder=encoder, decoder=decoder)
def __call__(self, x):
mu, ln_var = self.encoder(x)
# You can also write:
# z = F.gaussian(mu, ln_var)
sigma = F.exp(ln_var / 2)
eps = self.xp.random.rand(*mu.data.shape)
z = mu + sigma * eps
x_hat = self.decoder(z)
recon_loss = F.gaussian_nll(x, x_hat)
kl_loss = F.gaussian_kl_divergence(mu, ln_var)
return recon_loss + kl_loss
stochastic part
Just returning the stochastic loss.
Backprop through the sampled loss estimates the gradient.
Task 2: variational learning of
sigmoid belief network (SBN)
Hierarchical autoencoder with Bernoulli units
19
(2 hidden layers case)
sampled loss
20
Parameter and forward-prop definitions
class SBNBase(chainer.Chain):
def __init__(self, n_x, n_z1, n_z2):
super().__init__(
q1=L.Linear(n_x, n_z1), # q(z_1|x)
q2=L.Linear(n_z1, n_z2), # q(z_2|z_1)
p1=L.Linear(n_z1, n_x), # p(x|z_1)
p2=L.Linear(n_z2, n_z1), # p(z_1|z_2)
)
self.add_param('prior', (1, n_z2)) # p(z_2)
self.prior.data.fill(0)
def bernoulli(self, mu): # sampling from Bernoulli
noise = self.xp.random.rand(*mu.data.shape)
return (noise < mu.data).astype('float32')
def forward(self, x):
a1 = self.q1(x)
mu1 = F.sigmoid(a1)
z1 = self.bernoulli(mu1)
a2 = self.q2(z1)
mu2 = F.sigmoid(a2)
z2 = self.bernoulli(mu2)
return (a1, mu1, z1), (a2, mu2, z2)
1st layer
2nd layer
21
Parameter and forward-prop definitions
class SBNBase(chainer.Chain):
def __init__(self, n_x, n_z1, n_z2):
super().__init__(
q1=L.Linear(n_x, n_z1), # q(z_1|x)
q2=L.Linear(n_z1, n_z2), # q(z_2|z_1)
p1=L.Linear(n_z1, n_x), # p(x|z_1)
p2=L.Linear(n_z2, n_z1), # p(z_1|z_2)
)
self.add_param('prior', (1, n_z2)) # p(z_2)
self.prior.data.fill(0)
def bernoulli(self, mu): # sampling from Bernoulli
noise = self.xp.random.rand(*mu.data.shape)
return (noise < mu.data).astype('float32')
def forward(self, x):
a1 = self.q1(x)
mu1 = F.sigmoid(a1)
z1 = self.bernoulli(mu1)
a2 = self.q2(z1)
mu2 = F.sigmoid(a2)
z2 = self.bernoulli(mu2)
return (a1, mu1, z1), (a2, mu2, z2)
1st layer
2nd layer
initialize
parameters
22
Parameter and forward-prop definitions
class SBNBase(chainer.Chain):
def __init__(self, n_x, n_z1, n_z2):
super().__init__(
q1=L.Linear(n_x, n_z1), # q(z_1|x)
q2=L.Linear(n_z1, n_z2), # q(z_2|z_1)
p1=L.Linear(n_z1, n_x), # p(x|z_1)
p2=L.Linear(n_z2, n_z1), # p(z_1|z_2)
)
self.add_param('prior', (1, n_z2)) # p(z_2)
self.prior.data.fill(0)
def bernoulli(self, mu): # sampling from Bernoulli
noise = self.xp.random.rand(*mu.data.shape)
return (noise < mu.data).astype('float32')
def forward(self, x):
a1 = self.q1(x)
mu1 = F.sigmoid(a1)
z1 = self.bernoulli(mu1)
a2 = self.q2(z1)
mu2 = F.sigmoid(a2)
z2 = self.bernoulli(mu2)
return (a1, mu1, z1), (a2, mu2, z2)
sample through
the encoder
1st layer
2nd layer
This code computes the following loss value:
bernoulli_nll(a, z) computes the following value:
23
class SBNLR(SBNBase):
def expected_loss(self, x, forward_result):
(a1, mu1, z1), (a2, mu2, z2) = forward_result
neg_log_p = (bernoulli_nll(z2, self.prior) +
bernoulli_nll(z1, self.p2(z2)) +
bernoulli_nll(x, self.p1(z1)))
neg_log_q = (bernoulli_nll(z1, mu1) +
bernoulli_nll(z2, mu2))
return F.sum(neg_log_p - neg_log_q)
def bernoulli_nll(x, y):
return F.sum(F.softplus(y) - x * y, axis=1)
How can we compute gradient through sampling?
Recall likelihood-ratio:
We can fake the gradient-based optimizer by passing a fake
loss value whose gradient is the LR estimate
24
How can we compute gradient through sampling?
Recall likelihood-ratio:
We can fake the gradient-based optimizer by passing a fake
loss value whose gradient is the LR estimate
25
def __call__(self, x):
forward_result = self.forward(x)
loss = self.expected_loss(x, forward_result)
(a1, mu1, z1), (a2, mu2, z2) = forward_result
fake1 = loss.data * bernoulli_nll(z1, a1)
fake2 = loss.data * bernoulli_nll(z2, a2)
fake = F.sum(fake1) + F.sum(fake2)
return loss + fake
fake loss
Optimizer runs backprop from this value
Other note on experiments (1)
Plain LR does not learn well. It always needs to use baseline.
• There are many techniques, including
• Moving average of the loss value
• Predict the loss value from the input
• Optimal constant baseline estimation
Better to use momentum SGD and adaptive learning rate
• = Adam
• Momentum effectively reduces the gradient noise
26
Other note on experiments (2)
Use Trainer!
• snapshot extension makes it easy to do resume/suspend,
which is crucial for handling long experiments
• Adding a custom extension is super-easy: I wrote
• an extension to hold the model of the current best validation score
(for early stopping)
• an extension to report variance of estimated gradients
• an extension to plot the learning curve at regular intervals
Use report function!
• It is easy to collect statistics of any values which are
computed as by-products of forward computation
27
Example of report function
28
def expected_loss(self, x, forward_result):
(a1, mu1, z1), (a2, mu2, z2) = forward_result
neg_log_p = (bernoulli_nll(z2, self.prior) +
bernoulli_nll(z1, self.p2(z2)) +
bernoulli_nll(x, self.p1(z1)))
neg_log_q = (bernoulli_nll(z1, mu1) +
bernoulli_nll(z2, mu2))
chainer.report({'nll_p': neg_log_p,
'nll_q': neg_log_q}, self)
return F.sum(neg_log_p - neg_log_q)
LogReport extension will log the average of these reported values for
each interval. The values are also reported during the validation.
My research
My current research is on low-variance gradient estimate for
stochastic NNs with Bernoulli units
• Need extra computation, which is embarrassingly
parallelizable
• Theoretically guaranteed to have lower variance than LR
(even vs. LR with the optimal input-dependent baseline)
• Empirically shown to be faster to learn
29
Summary
• Stochastic units introduce stochasticity to neural networks
(and their computational graphs)
• Reparameterization trick and likelihood-ratio methods are
often used for learning them
• Reparameterization trick can be implemented with Chainer
as a simple feed-forward network with additional noise
• Likelihood-ratio methods can be implemented with Chainer
using fake loss
30

More Related Content

What's hot

Introduction to PyTorch
Introduction to PyTorchIntroduction to PyTorch
Introduction to PyTorchJun Young Park
 
論文輪読資料「Gated Feedback Recurrent Neural Networks」
論文輪読資料「Gated Feedback Recurrent Neural Networks」論文輪読資料「Gated Feedback Recurrent Neural Networks」
論文輪読資料「Gated Feedback Recurrent Neural Networks」kurotaki_weblab
 
Introduction to Deep Learning with Python
Introduction to Deep Learning with PythonIntroduction to Deep Learning with Python
Introduction to Deep Learning with Pythonindico data
 
Deep Learning with PyTorch
Deep Learning with PyTorchDeep Learning with PyTorch
Deep Learning with PyTorchMayur Bhangale
 
Deep Learning with TensorFlow: Understanding Tensors, Computations Graphs, Im...
Deep Learning with TensorFlow: Understanding Tensors, Computations Graphs, Im...Deep Learning with TensorFlow: Understanding Tensors, Computations Graphs, Im...
Deep Learning with TensorFlow: Understanding Tensors, Computations Graphs, Im...Altoros
 
Introduction to Machine Learning with TensorFlow
Introduction to Machine Learning with TensorFlowIntroduction to Machine Learning with TensorFlow
Introduction to Machine Learning with TensorFlowPaolo Tomeo
 
Practical Reinforcement Learning with TensorFlow
Practical Reinforcement Learning with TensorFlowPractical Reinforcement Learning with TensorFlow
Practical Reinforcement Learning with TensorFlowIllia Polosukhin
 
Keras on tensorflow in R & Python
Keras on tensorflow in R & PythonKeras on tensorflow in R & Python
Keras on tensorflow in R & PythonLonghow Lam
 
Tensor flow (1)
Tensor flow (1)Tensor flow (1)
Tensor flow (1)景逸 王
 
Anomaly detection using deep one class classifier
Anomaly detection using deep one class classifierAnomaly detection using deep one class classifier
Anomaly detection using deep one class classifier홍배 김
 
Introduction to theano, case study of Word Embeddings
Introduction to theano, case study of Word EmbeddingsIntroduction to theano, case study of Word Embeddings
Introduction to theano, case study of Word EmbeddingsShashank Gupta
 
Introduction to Chainer 11 may,2018
Introduction to Chainer 11 may,2018Introduction to Chainer 11 may,2018
Introduction to Chainer 11 may,2018Preferred Networks
 
Introduction to Deep Learning, Keras, and TensorFlow
Introduction to Deep Learning, Keras, and TensorFlowIntroduction to Deep Learning, Keras, and TensorFlow
Introduction to Deep Learning, Keras, and TensorFlowSri Ambati
 
Matching networks for one shot learning
Matching networks for one shot learningMatching networks for one shot learning
Matching networks for one shot learningKazuki Fujikawa
 
NIPS読み会2013: One-shot learning by inverting a compositional causal process
NIPS読み会2013: One-shot learning by inverting  a compositional causal processNIPS読み会2013: One-shot learning by inverting  a compositional causal process
NIPS読み会2013: One-shot learning by inverting a compositional causal processnozyh
 
Rajat Monga at AI Frontiers: Deep Learning with TensorFlow
Rajat Monga at AI Frontiers: Deep Learning with TensorFlowRajat Monga at AI Frontiers: Deep Learning with TensorFlow
Rajat Monga at AI Frontiers: Deep Learning with TensorFlowAI Frontiers
 
Introducton to Convolutional Nerural Network with TensorFlow
Introducton to Convolutional Nerural Network with TensorFlowIntroducton to Convolutional Nerural Network with TensorFlow
Introducton to Convolutional Nerural Network with TensorFlowEtsuji Nakai
 
Neural Turing Machines
Neural Turing MachinesNeural Turing Machines
Neural Turing MachinesKato Yuzuru
 
Introduction to Tensorflow
Introduction to TensorflowIntroduction to Tensorflow
Introduction to TensorflowTzar Umang
 

What's hot (20)

Introduction to PyTorch
Introduction to PyTorchIntroduction to PyTorch
Introduction to PyTorch
 
論文輪読資料「Gated Feedback Recurrent Neural Networks」
論文輪読資料「Gated Feedback Recurrent Neural Networks」論文輪読資料「Gated Feedback Recurrent Neural Networks」
論文輪読資料「Gated Feedback Recurrent Neural Networks」
 
Introduction to Deep Learning with Python
Introduction to Deep Learning with PythonIntroduction to Deep Learning with Python
Introduction to Deep Learning with Python
 
Deep Learning with PyTorch
Deep Learning with PyTorchDeep Learning with PyTorch
Deep Learning with PyTorch
 
Deep Learning with TensorFlow: Understanding Tensors, Computations Graphs, Im...
Deep Learning with TensorFlow: Understanding Tensors, Computations Graphs, Im...Deep Learning with TensorFlow: Understanding Tensors, Computations Graphs, Im...
Deep Learning with TensorFlow: Understanding Tensors, Computations Graphs, Im...
 
Introduction to Machine Learning with TensorFlow
Introduction to Machine Learning with TensorFlowIntroduction to Machine Learning with TensorFlow
Introduction to Machine Learning with TensorFlow
 
Practical Reinforcement Learning with TensorFlow
Practical Reinforcement Learning with TensorFlowPractical Reinforcement Learning with TensorFlow
Practical Reinforcement Learning with TensorFlow
 
Keras on tensorflow in R & Python
Keras on tensorflow in R & PythonKeras on tensorflow in R & Python
Keras on tensorflow in R & Python
 
Tensor flow (1)
Tensor flow (1)Tensor flow (1)
Tensor flow (1)
 
Deep Learning in theano
Deep Learning in theanoDeep Learning in theano
Deep Learning in theano
 
Anomaly detection using deep one class classifier
Anomaly detection using deep one class classifierAnomaly detection using deep one class classifier
Anomaly detection using deep one class classifier
 
Introduction to theano, case study of Word Embeddings
Introduction to theano, case study of Word EmbeddingsIntroduction to theano, case study of Word Embeddings
Introduction to theano, case study of Word Embeddings
 
Introduction to Chainer 11 may,2018
Introduction to Chainer 11 may,2018Introduction to Chainer 11 may,2018
Introduction to Chainer 11 may,2018
 
Introduction to Deep Learning, Keras, and TensorFlow
Introduction to Deep Learning, Keras, and TensorFlowIntroduction to Deep Learning, Keras, and TensorFlow
Introduction to Deep Learning, Keras, and TensorFlow
 
Matching networks for one shot learning
Matching networks for one shot learningMatching networks for one shot learning
Matching networks for one shot learning
 
NIPS読み会2013: One-shot learning by inverting a compositional causal process
NIPS読み会2013: One-shot learning by inverting  a compositional causal processNIPS読み会2013: One-shot learning by inverting  a compositional causal process
NIPS読み会2013: One-shot learning by inverting a compositional causal process
 
Rajat Monga at AI Frontiers: Deep Learning with TensorFlow
Rajat Monga at AI Frontiers: Deep Learning with TensorFlowRajat Monga at AI Frontiers: Deep Learning with TensorFlow
Rajat Monga at AI Frontiers: Deep Learning with TensorFlow
 
Introducton to Convolutional Nerural Network with TensorFlow
Introducton to Convolutional Nerural Network with TensorFlowIntroducton to Convolutional Nerural Network with TensorFlow
Introducton to Convolutional Nerural Network with TensorFlow
 
Neural Turing Machines
Neural Turing MachinesNeural Turing Machines
Neural Turing Machines
 
Introduction to Tensorflow
Introduction to TensorflowIntroduction to Tensorflow
Introduction to Tensorflow
 

Similar to Learning stochastic neural networks with Chainer

Deep learning from scratch
Deep learning from scratch Deep learning from scratch
Deep learning from scratch Eran Shlomo
 
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...Universitat Politècnica de Catalunya
 
机器学习Adaboost
机器学习Adaboost机器学习Adaboost
机器学习AdaboostShocky1
 
Exploring Simple Siamese Representation Learning
Exploring Simple Siamese Representation LearningExploring Simple Siamese Representation Learning
Exploring Simple Siamese Representation LearningSungchul Kim
 
Optimization for Neural Network Training - Veronica Vilaplana - UPC Barcelona...
Optimization for Neural Network Training - Veronica Vilaplana - UPC Barcelona...Optimization for Neural Network Training - Veronica Vilaplana - UPC Barcelona...
Optimization for Neural Network Training - Veronica Vilaplana - UPC Barcelona...Universitat Politècnica de Catalunya
 
Deep learning from a novice perspective
Deep learning from a novice perspectiveDeep learning from a novice perspective
Deep learning from a novice perspectiveAnirban Santara
 
Rethinking Attention with Performers
Rethinking Attention with PerformersRethinking Attention with Performers
Rethinking Attention with PerformersJoonhyung Lee
 
Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)
Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)
Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)Universitat Politècnica de Catalunya
 
Machine Learning Essentials Demystified part2 | Big Data Demystified
Machine Learning Essentials Demystified part2 | Big Data DemystifiedMachine Learning Essentials Demystified part2 | Big Data Demystified
Machine Learning Essentials Demystified part2 | Big Data DemystifiedOmid Vahdaty
 
Introduction to Neural Netwoks
Introduction to Neural Netwoks Introduction to Neural Netwoks
Introduction to Neural Netwoks Abdallah Bashir
 
Auto encoders in Deep Learning
Auto encoders in Deep LearningAuto encoders in Deep Learning
Auto encoders in Deep LearningShajun Nisha
 
Reinforcement Learning and Artificial Neural Nets
Reinforcement Learning and Artificial Neural NetsReinforcement Learning and Artificial Neural Nets
Reinforcement Learning and Artificial Neural NetsPierre de Lacaze
 
The world of loss function
The world of loss functionThe world of loss function
The world of loss function홍배 김
 
Dual SVM Problem.pdf
Dual SVM Problem.pdfDual SVM Problem.pdf
Dual SVM Problem.pdfssuser8547f2
 
DeepLearningLecture.pptx
DeepLearningLecture.pptxDeepLearningLecture.pptx
DeepLearningLecture.pptxssuserf07225
 
Sim Slides,Tricks,Trends,2012jan15
Sim Slides,Tricks,Trends,2012jan15Sim Slides,Tricks,Trends,2012jan15
Sim Slides,Tricks,Trends,2012jan15Dennis Sweitzer
 
Deep Feed Forward Neural Networks and Regularization
Deep Feed Forward Neural Networks and RegularizationDeep Feed Forward Neural Networks and Regularization
Deep Feed Forward Neural Networks and RegularizationYan Xu
 

Similar to Learning stochastic neural networks with Chainer (20)

Deep learning from scratch
Deep learning from scratch Deep learning from scratch
Deep learning from scratch
 
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
 
机器学习Adaboost
机器学习Adaboost机器学习Adaboost
机器学习Adaboost
 
Exploring Simple Siamese Representation Learning
Exploring Simple Siamese Representation LearningExploring Simple Siamese Representation Learning
Exploring Simple Siamese Representation Learning
 
Optimization for Neural Network Training - Veronica Vilaplana - UPC Barcelona...
Optimization for Neural Network Training - Veronica Vilaplana - UPC Barcelona...Optimization for Neural Network Training - Veronica Vilaplana - UPC Barcelona...
Optimization for Neural Network Training - Veronica Vilaplana - UPC Barcelona...
 
Deep learning from a novice perspective
Deep learning from a novice perspectiveDeep learning from a novice perspective
Deep learning from a novice perspective
 
Rethinking Attention with Performers
Rethinking Attention with PerformersRethinking Attention with Performers
Rethinking Attention with Performers
 
Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)
Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)
Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)
 
Machine Learning Essentials Demystified part2 | Big Data Demystified
Machine Learning Essentials Demystified part2 | Big Data DemystifiedMachine Learning Essentials Demystified part2 | Big Data Demystified
Machine Learning Essentials Demystified part2 | Big Data Demystified
 
Introduction to Neural Netwoks
Introduction to Neural Netwoks Introduction to Neural Netwoks
Introduction to Neural Netwoks
 
Auto encoders in Deep Learning
Auto encoders in Deep LearningAuto encoders in Deep Learning
Auto encoders in Deep Learning
 
2021 04-01-dalle
2021 04-01-dalle2021 04-01-dalle
2021 04-01-dalle
 
Reinforcement Learning and Artificial Neural Nets
Reinforcement Learning and Artificial Neural NetsReinforcement Learning and Artificial Neural Nets
Reinforcement Learning and Artificial Neural Nets
 
The world of loss function
The world of loss functionThe world of loss function
The world of loss function
 
Dual SVM Problem.pdf
Dual SVM Problem.pdfDual SVM Problem.pdf
Dual SVM Problem.pdf
 
DeepLearningLecture.pptx
DeepLearningLecture.pptxDeepLearningLecture.pptx
DeepLearningLecture.pptx
 
Lec 6-bp
Lec 6-bpLec 6-bp
Lec 6-bp
 
Unit 1
Unit 1Unit 1
Unit 1
 
Sim Slides,Tricks,Trends,2012jan15
Sim Slides,Tricks,Trends,2012jan15Sim Slides,Tricks,Trends,2012jan15
Sim Slides,Tricks,Trends,2012jan15
 
Deep Feed Forward Neural Networks and Regularization
Deep Feed Forward Neural Networks and RegularizationDeep Feed Forward Neural Networks and Regularization
Deep Feed Forward Neural Networks and Regularization
 

More from Seiya Tokui

Chainer/CuPy v5 and Future (Japanese)
Chainer/CuPy v5 and Future (Japanese)Chainer/CuPy v5 and Future (Japanese)
Chainer/CuPy v5 and Future (Japanese)Seiya Tokui
 
Chainer v2 and future dev plan
Chainer v2 and future dev planChainer v2 and future dev plan
Chainer v2 and future dev planSeiya Tokui
 
Chainer v2 alpha
Chainer v2 alphaChainer v2 alpha
Chainer v2 alphaSeiya Tokui
 
深層学習フレームワーク Chainer の開発と今後の展開
深層学習フレームワーク Chainer の開発と今後の展開深層学習フレームワーク Chainer の開発と今後の展開
深層学習フレームワーク Chainer の開発と今後の展開Seiya Tokui
 
論文紹介 Pixel Recurrent Neural Networks
論文紹介 Pixel Recurrent Neural Networks論文紹介 Pixel Recurrent Neural Networks
論文紹介 Pixel Recurrent Neural NetworksSeiya Tokui
 
Introduction to Chainer
Introduction to ChainerIntroduction to Chainer
Introduction to ChainerSeiya Tokui
 
Chainer Update v1.8.0 -> v1.10.0+
Chainer Update v1.8.0 -> v1.10.0+Chainer Update v1.8.0 -> v1.10.0+
Chainer Update v1.8.0 -> v1.10.0+Seiya Tokui
 
Overview of Chainer and Its Features
Overview of Chainer and Its FeaturesOverview of Chainer and Its Features
Overview of Chainer and Its FeaturesSeiya Tokui
 
生成モデルの Deep Learning
生成モデルの Deep Learning生成モデルの Deep Learning
生成モデルの Deep LearningSeiya Tokui
 
Chainer Development Plan 2015/12
Chainer Development Plan 2015/12Chainer Development Plan 2015/12
Chainer Development Plan 2015/12Seiya Tokui
 
Towards Chainer v1.5
Towards Chainer v1.5Towards Chainer v1.5
Towards Chainer v1.5Seiya Tokui
 
Deep Learningの基礎と応用
Deep Learningの基礎と応用Deep Learningの基礎と応用
Deep Learningの基礎と応用Seiya Tokui
 
Chainerの使い方と自然言語処理への応用
Chainerの使い方と自然言語処理への応用Chainerの使い方と自然言語処理への応用
Chainerの使い方と自然言語処理への応用Seiya Tokui
 
論文紹介 Compressing Neural Networks with the Hashing Trick
論文紹介 Compressing Neural Networks with the Hashing Trick論文紹介 Compressing Neural Networks with the Hashing Trick
論文紹介 Compressing Neural Networks with the Hashing TrickSeiya Tokui
 
深層学習フレームワークChainerの紹介とFPGAへの期待
深層学習フレームワークChainerの紹介とFPGAへの期待深層学習フレームワークChainerの紹介とFPGAへの期待
深層学習フレームワークChainerの紹介とFPGAへの期待Seiya Tokui
 
Introduction to Chainer: A Flexible Framework for Deep Learning
Introduction to Chainer: A Flexible Framework for Deep LearningIntroduction to Chainer: A Flexible Framework for Deep Learning
Introduction to Chainer: A Flexible Framework for Deep LearningSeiya Tokui
 
論文紹介 Semi-supervised Learning with Deep Generative Models
論文紹介 Semi-supervised Learning with Deep Generative Models論文紹介 Semi-supervised Learning with Deep Generative Models
論文紹介 Semi-supervised Learning with Deep Generative ModelsSeiya Tokui
 
Recurrent Neural Networks
Recurrent Neural NetworksRecurrent Neural Networks
Recurrent Neural NetworksSeiya Tokui
 
Deep learning実装の基礎と実践
Deep learning実装の基礎と実践Deep learning実装の基礎と実践
Deep learning実装の基礎と実践Seiya Tokui
 

More from Seiya Tokui (20)

Chainer/CuPy v5 and Future (Japanese)
Chainer/CuPy v5 and Future (Japanese)Chainer/CuPy v5 and Future (Japanese)
Chainer/CuPy v5 and Future (Japanese)
 
Chainer v3
Chainer v3Chainer v3
Chainer v3
 
Chainer v2 and future dev plan
Chainer v2 and future dev planChainer v2 and future dev plan
Chainer v2 and future dev plan
 
Chainer v2 alpha
Chainer v2 alphaChainer v2 alpha
Chainer v2 alpha
 
深層学習フレームワーク Chainer の開発と今後の展開
深層学習フレームワーク Chainer の開発と今後の展開深層学習フレームワーク Chainer の開発と今後の展開
深層学習フレームワーク Chainer の開発と今後の展開
 
論文紹介 Pixel Recurrent Neural Networks
論文紹介 Pixel Recurrent Neural Networks論文紹介 Pixel Recurrent Neural Networks
論文紹介 Pixel Recurrent Neural Networks
 
Introduction to Chainer
Introduction to ChainerIntroduction to Chainer
Introduction to Chainer
 
Chainer Update v1.8.0 -> v1.10.0+
Chainer Update v1.8.0 -> v1.10.0+Chainer Update v1.8.0 -> v1.10.0+
Chainer Update v1.8.0 -> v1.10.0+
 
Overview of Chainer and Its Features
Overview of Chainer and Its FeaturesOverview of Chainer and Its Features
Overview of Chainer and Its Features
 
生成モデルの Deep Learning
生成モデルの Deep Learning生成モデルの Deep Learning
生成モデルの Deep Learning
 
Chainer Development Plan 2015/12
Chainer Development Plan 2015/12Chainer Development Plan 2015/12
Chainer Development Plan 2015/12
 
Towards Chainer v1.5
Towards Chainer v1.5Towards Chainer v1.5
Towards Chainer v1.5
 
Deep Learningの基礎と応用
Deep Learningの基礎と応用Deep Learningの基礎と応用
Deep Learningの基礎と応用
 
Chainerの使い方と自然言語処理への応用
Chainerの使い方と自然言語処理への応用Chainerの使い方と自然言語処理への応用
Chainerの使い方と自然言語処理への応用
 
論文紹介 Compressing Neural Networks with the Hashing Trick
論文紹介 Compressing Neural Networks with the Hashing Trick論文紹介 Compressing Neural Networks with the Hashing Trick
論文紹介 Compressing Neural Networks with the Hashing Trick
 
深層学習フレームワークChainerの紹介とFPGAへの期待
深層学習フレームワークChainerの紹介とFPGAへの期待深層学習フレームワークChainerの紹介とFPGAへの期待
深層学習フレームワークChainerの紹介とFPGAへの期待
 
Introduction to Chainer: A Flexible Framework for Deep Learning
Introduction to Chainer: A Flexible Framework for Deep LearningIntroduction to Chainer: A Flexible Framework for Deep Learning
Introduction to Chainer: A Flexible Framework for Deep Learning
 
論文紹介 Semi-supervised Learning with Deep Generative Models
論文紹介 Semi-supervised Learning with Deep Generative Models論文紹介 Semi-supervised Learning with Deep Generative Models
論文紹介 Semi-supervised Learning with Deep Generative Models
 
Recurrent Neural Networks
Recurrent Neural NetworksRecurrent Neural Networks
Recurrent Neural Networks
 
Deep learning実装の基礎と実践
Deep learning実装の基礎と実践Deep learning実装の基礎と実践
Deep learning実装の基礎と実践
 

Recently uploaded

Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 

Recently uploaded (20)

Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 

Learning stochastic neural networks with Chainer

  • 1. Learning stochastic neural networks with Chainer Sep. 21, 2016 | PyCon JP @ Waseda University The University of Tokyo, Preferred Networks, Inc. Seiya Tokui @beam2d
  • 2. Self introduction • Seiya Tokui • @beam2d (Twitter/GitHub) • Researcher at Preferred Networks, Inc. • Lead developer of Chainer (a framework for neural nets) • Ph.D student at the University of Tokyo (since Apr. 2016) • Supervisor: Lect. Issei Sato • Topics: deep generative models Today I will talk as a student (i.e. an academic researcher and a user of Chainer). 2
  • 3. Topics of this talk: how to compute gradients through stochastic units First 20 min. • Stochastic unit • Learning methods for stochastic neural nets Second 20 min. • How to implement it with Chainer • Experimental results Take-home message: you can train stochastic NNs without modifying backprop procedure in most frameworks (including Chainer) 3
  • 4. Caution!!! This talk DOES NOT introduce • basic maths • backprop algorithm • how to install Chainer (see the official documantes!) • basic concept and usage of Chainer (ditto!) I could not avoid using some math to explain the work, so just take this talk as an example of how a researcher writes scripts in Python. 4
  • 5. Stochastic units and their learning methods
  • 6. Neural net: directed acyclic graph of linear-nonlinear operations Linear Nonlinear 6 All operations are deterministic and differentiable
  • 7. Stochastic unit: a neuron with sampling Case 1: (diagonal) Gaussian This unit defines a random variable, and forward-prop do its sampling. 7 Linear-Nonlinear Sampling
  • 8. Stochastic unit: a neuron with sampling 8 Linear-Nonlinear Sampling (sigmoid) Case 2: Bernoulli (binary unit, taking 1 with probability )
  • 9. Applications of stochastic units Stochastic feed-forward networks • Non-deterministic prediction • Used for multi-valued predictions • E.g. inpainting lower-part of given images Learning generative models • Loss function is often written as a computational graph including stochastic units • E.g. variational autoencoder (VAE) 9
  • 10. Gradient estimation of stochastic NN is difficult! Stochastic NN is NOT deterministic -> we have to optimize expectation over the stochasticity • All possible realization of stochastic units should be considered (with losses weighted by the probability) • Enumerating all such realizations is infeasible! • We cannot enumerate all samples from Gaussian • Even in case of using Bernoulli, it costs time for units -> need approximation 10 we want to optimize it!
  • 11. General trick: likelihood-ratio method • Do forward prop with sampling • Decrease the probability of chosen values if the loss is high • difficult to decide whether the loss this time is high or low… -> decrease the probability by an amount proportional to the loss • Using log-derivative results in unbiased gradient estimate Not straight-forward to implement on NN frameworks (I’ll show later) 11 “sampled from”sampled loss log derivative
  • 12. Technique: LR with baseline LR method results in high variance • The gradient is accurate only after observing many samples (because the log-derivative is not related to the loss function) We can reduce the variance by shifting the loss value by a constant: using instead of • It does not change the relative goodness of each sample • The shift is called baseline 12
  • 13. Modern trick: reparameterization trick Write the sampling procedure as a differentiable computation • Given noise, the computation is deterministic and differentiable • Easy to implement on NN frameworks (as easy as dropout) • The variance is low!! 13 noise
  • 14. Summary of learning stochastic NNs For Gaussian units, we can use reparameterization trick • It has low variance so that we can train them efficiently For Bernoulli units, we have to use likelihood-ratio methods • It has high variance, which is problematic • In order to capture discrete nature of data representation, it is better to use discrete units, so we have to develop a fast algorithm of learning discrete units 14
  • 16. Task 1: variational autoencoder (VAE) Autoencoder with the hidden layer being diagonal Gaussian with reparameterization trick 16 reconstruction loss encoder decoder KL loss (regularization)
  • 17. 17 class VAE(chainer.Chain): def __init__(self, encoder, decoder): super().__init__(encoder=encoder, decoder=decoder) def __call__(self, x): mu, ln_var = self.encoder(x) # You can also write: # z = F.gaussian(mu, ln_var) sigma = F.exp(ln_var / 2) eps = self.xp.random.rand(*mu.data.shape) z = mu + sigma * eps x_hat = self.decoder(z) recon_loss = F.gaussian_nll(x, x_hat) kl_loss = F.gaussian_kl_divergence(mu, ln_var) return recon_loss + kl_loss
  • 18. 18 class VAE(chainer.Chain): def __init__(self, encoder, decoder): super().__init__(encoder=encoder, decoder=decoder) def __call__(self, x): mu, ln_var = self.encoder(x) # You can also write: # z = F.gaussian(mu, ln_var) sigma = F.exp(ln_var / 2) eps = self.xp.random.rand(*mu.data.shape) z = mu + sigma * eps x_hat = self.decoder(z) recon_loss = F.gaussian_nll(x, x_hat) kl_loss = F.gaussian_kl_divergence(mu, ln_var) return recon_loss + kl_loss stochastic part Just returning the stochastic loss. Backprop through the sampled loss estimates the gradient.
  • 19. Task 2: variational learning of sigmoid belief network (SBN) Hierarchical autoencoder with Bernoulli units 19 (2 hidden layers case) sampled loss
  • 20. 20 Parameter and forward-prop definitions class SBNBase(chainer.Chain): def __init__(self, n_x, n_z1, n_z2): super().__init__( q1=L.Linear(n_x, n_z1), # q(z_1|x) q2=L.Linear(n_z1, n_z2), # q(z_2|z_1) p1=L.Linear(n_z1, n_x), # p(x|z_1) p2=L.Linear(n_z2, n_z1), # p(z_1|z_2) ) self.add_param('prior', (1, n_z2)) # p(z_2) self.prior.data.fill(0) def bernoulli(self, mu): # sampling from Bernoulli noise = self.xp.random.rand(*mu.data.shape) return (noise < mu.data).astype('float32') def forward(self, x): a1 = self.q1(x) mu1 = F.sigmoid(a1) z1 = self.bernoulli(mu1) a2 = self.q2(z1) mu2 = F.sigmoid(a2) z2 = self.bernoulli(mu2) return (a1, mu1, z1), (a2, mu2, z2) 1st layer 2nd layer
  • 21. 21 Parameter and forward-prop definitions class SBNBase(chainer.Chain): def __init__(self, n_x, n_z1, n_z2): super().__init__( q1=L.Linear(n_x, n_z1), # q(z_1|x) q2=L.Linear(n_z1, n_z2), # q(z_2|z_1) p1=L.Linear(n_z1, n_x), # p(x|z_1) p2=L.Linear(n_z2, n_z1), # p(z_1|z_2) ) self.add_param('prior', (1, n_z2)) # p(z_2) self.prior.data.fill(0) def bernoulli(self, mu): # sampling from Bernoulli noise = self.xp.random.rand(*mu.data.shape) return (noise < mu.data).astype('float32') def forward(self, x): a1 = self.q1(x) mu1 = F.sigmoid(a1) z1 = self.bernoulli(mu1) a2 = self.q2(z1) mu2 = F.sigmoid(a2) z2 = self.bernoulli(mu2) return (a1, mu1, z1), (a2, mu2, z2) 1st layer 2nd layer initialize parameters
  • 22. 22 Parameter and forward-prop definitions class SBNBase(chainer.Chain): def __init__(self, n_x, n_z1, n_z2): super().__init__( q1=L.Linear(n_x, n_z1), # q(z_1|x) q2=L.Linear(n_z1, n_z2), # q(z_2|z_1) p1=L.Linear(n_z1, n_x), # p(x|z_1) p2=L.Linear(n_z2, n_z1), # p(z_1|z_2) ) self.add_param('prior', (1, n_z2)) # p(z_2) self.prior.data.fill(0) def bernoulli(self, mu): # sampling from Bernoulli noise = self.xp.random.rand(*mu.data.shape) return (noise < mu.data).astype('float32') def forward(self, x): a1 = self.q1(x) mu1 = F.sigmoid(a1) z1 = self.bernoulli(mu1) a2 = self.q2(z1) mu2 = F.sigmoid(a2) z2 = self.bernoulli(mu2) return (a1, mu1, z1), (a2, mu2, z2) sample through the encoder 1st layer 2nd layer
  • 23. This code computes the following loss value: bernoulli_nll(a, z) computes the following value: 23 class SBNLR(SBNBase): def expected_loss(self, x, forward_result): (a1, mu1, z1), (a2, mu2, z2) = forward_result neg_log_p = (bernoulli_nll(z2, self.prior) + bernoulli_nll(z1, self.p2(z2)) + bernoulli_nll(x, self.p1(z1))) neg_log_q = (bernoulli_nll(z1, mu1) + bernoulli_nll(z2, mu2)) return F.sum(neg_log_p - neg_log_q) def bernoulli_nll(x, y): return F.sum(F.softplus(y) - x * y, axis=1)
  • 24. How can we compute gradient through sampling? Recall likelihood-ratio: We can fake the gradient-based optimizer by passing a fake loss value whose gradient is the LR estimate 24
  • 25. How can we compute gradient through sampling? Recall likelihood-ratio: We can fake the gradient-based optimizer by passing a fake loss value whose gradient is the LR estimate 25 def __call__(self, x): forward_result = self.forward(x) loss = self.expected_loss(x, forward_result) (a1, mu1, z1), (a2, mu2, z2) = forward_result fake1 = loss.data * bernoulli_nll(z1, a1) fake2 = loss.data * bernoulli_nll(z2, a2) fake = F.sum(fake1) + F.sum(fake2) return loss + fake fake loss Optimizer runs backprop from this value
  • 26. Other note on experiments (1) Plain LR does not learn well. It always needs to use baseline. • There are many techniques, including • Moving average of the loss value • Predict the loss value from the input • Optimal constant baseline estimation Better to use momentum SGD and adaptive learning rate • = Adam • Momentum effectively reduces the gradient noise 26
  • 27. Other note on experiments (2) Use Trainer! • snapshot extension makes it easy to do resume/suspend, which is crucial for handling long experiments • Adding a custom extension is super-easy: I wrote • an extension to hold the model of the current best validation score (for early stopping) • an extension to report variance of estimated gradients • an extension to plot the learning curve at regular intervals Use report function! • It is easy to collect statistics of any values which are computed as by-products of forward computation 27
  • 28. Example of report function 28 def expected_loss(self, x, forward_result): (a1, mu1, z1), (a2, mu2, z2) = forward_result neg_log_p = (bernoulli_nll(z2, self.prior) + bernoulli_nll(z1, self.p2(z2)) + bernoulli_nll(x, self.p1(z1))) neg_log_q = (bernoulli_nll(z1, mu1) + bernoulli_nll(z2, mu2)) chainer.report({'nll_p': neg_log_p, 'nll_q': neg_log_q}, self) return F.sum(neg_log_p - neg_log_q) LogReport extension will log the average of these reported values for each interval. The values are also reported during the validation.
  • 29. My research My current research is on low-variance gradient estimate for stochastic NNs with Bernoulli units • Need extra computation, which is embarrassingly parallelizable • Theoretically guaranteed to have lower variance than LR (even vs. LR with the optimal input-dependent baseline) • Empirically shown to be faster to learn 29
  • 30. Summary • Stochastic units introduce stochasticity to neural networks (and their computational graphs) • Reparameterization trick and likelihood-ratio methods are often used for learning them • Reparameterization trick can be implemented with Chainer as a simple feed-forward network with additional noise • Likelihood-ratio methods can be implemented with Chainer using fake loss 30