2. Outline
• Introduction
• Stochastic Variational Inference
– Variational Inference 101
– Stochastic Variational Inference
– Deep Generative Models with SVB
• MCMC with mini-batches
– MCMC 101
– MCMC using noisy gradients
– MCMC using noisy Metropolis-Hastings
– Theoretical results
• Conclusion
3. Big Data (mine is bigger than yours)
Square Kilometer Array (SKA) produces 1 Exabyte per day by 2024…
(interested to do approximate inference on this data, talk to me)
7. Little data inside Big data
Not every data-case carries information about every model component
New user with no ratings
(cold start problem)
7
8. 1943: First NN
(+/- N=10)
1988: NetTalk
(+/- N=20K)
2009: Hinton’s
Deep Belief Net
(+/- N=10M)
2013: Google/Y!
(N=+/- 10B)
Big Models!
Models grow faster than useful information in data
8
9. Two Ingredients for Big Data Bayes
Any big data posterior inference algorithm should:
1. easily run on a distributed architecture.
2. only use a small mini-batch of the data at every iteration.
10. Bayesian Posterior Inference
Variational Inference Sampling
Variational Family Q
All probability distributions
• Deterministic
• Biased
• Local minima
• Easy to assess convergence
• Stochastic (sample error)
• Unbiased
• Hard to mix between modes
• Hard to assess convergence
11. Variational Bayes
11
Hinton & van Camp (1993)
Neal & Hinton (1999)
Saul & Jordan (1996)
Saul, Jaakkola & Jordan (1996)
Attias (1999,2000)
Wiegerinck (2000)
Ghahramani & Beal (2000,2001)
Coordinate descent on Q
P
Q
(Bishop, Pattern Recognition
and Machine Learning)
12. Stochastic VB Hoffman, Blei & Bach, 2010
Stochastic natural gradient descent on Q
12
• P and Q in exponential family.
• Q factorized:
• At every iteration: subsample n<<N data-cases:
• solve analytically.
• update parameter using stochastic natural gradient descent.
14. Reparameterization Trick
14
-Variational Bayesian Inference with Stochastic Search [D.M. Blei, M.I. Jordan and J.W. Paisley, 2012]
-Fixed-Form Variational Posterior Approximation through Stochastic Linear Regression [T. Salimans and A. Knowles, 2013].
-Black Box Variational Inference. [R. Ranganath, S. Gerrish and D.M. Blei. 2013]
-Stochastic Variational Inference [M.D. Hoffman, D. Blei, C. Wang and J. Paisley, 2013]
-Estimating or propagating gradients through stochastic neurons. [Y. Bengio, 2013].
-Neural Variational Inference and Learning in Belief Networks. [A. Mnih and K. Gregor, 2014]
Kingma 2013, Bengio 2013, Kingma & W. 2014
Other solutions to solve the same "large variance problem":
Talk Monday June 23, 15:20
In Track F (Deep Learning II)
“Efficient Gradient Based Inference through Transformations between
Bayes Nets and Neural Nets”
15. Auto Encoding Variational Bayes
Both P(X|Z) and Q(Z|X) are general models
(e.g. deep neural net)
Kingma & W., 2013, Rezende et al 2014
15
The Helmholtz machine
Wake/Sleep algorithm
Dayan, Hinton, Neal, Zemel, 1995
Z
X
Q(Z|X)
P(X|Z)P(Z)
19. Semi-supervised Model
Z
X
Y
Q(Y,Z|X) = Q(Z|Y,X)Q(Y|X)
Analogies: Fix Z, vary Y, sample X|Z,Y
P(X,Z,Y) = P(X|Z,Y)P(Y)P(Z)
Kingma, Rezende, Mohamed, Wierstra, W., 2014
20. REFERENCES SVB:
-Practical Variational Inference for Neural Networks [Alex Graves, 2011]
-Variational Bayesian Inference with Stochastic Search [D.M. Blei, M.I. Jordan and J.W. Paisley, 2012]
-Fixed-Form Variational Posterior Approximation through Stochastic Linear Regression. Bayesian Analysis [T. Salimans and A. Knowles, 2013].
-Black Box Variational Inference. [R. Ranganath, S. Gerrish and D.M. Blei. 2013]
-Stochastic Variational Inference [M.D. Hoffman, D. Blei, C. Wang and J. Paisley, 2013]
-Stochastic Structured Mean Field Variational Inference [Matthew Hoffman, 2013]
-Doubly Stochastic Variational Bayes for non-Conjugate Inference [M. K. Titsias and M. Lázaro-Gredilla, 2014]
REFERENCES STOCHASTIC BACKPROP AND DEEP GENERATIVE MODELS
-Fast Gradient-Based Inference with Continuous Latent Variable Models in Auxiliary Form. [D.P. Kingma, 2013].
-Estimating or propagating gradients through stochastic neurons. [Y. Bengio, 2013].
-Auto-Encoding Variational Bayes [D.P. Kingma and M. W., 2013].
-Semi-supervised Learning with Deep Generative Models [D.P. Kingma, D.J. Rezende, S. Mohamed, M. W., 2014]
-Efficient Gradient-Based Inference through Transformations between Bayes Nets and Neural Nets [D.P. Kingma and M. W., 2014]
-Deep Generative Stochastic Networks Trainable by Backprop [Y. Bengio, E. Laufer, G. Alain, J, Yosinski, 2014]
-Stochastic Back-propagation and Approximate Inference in Deep Generative Models [D.J. Rezende, S. Mohamed and D. Wierstra, 2014]
-Deep AutoRegressive Networks [K. Gregor, A. Mnih and D. Wierstra, 2014].
-Neural Variational Inference and Learning in Belief Networks. [A. Mnih and K. Gregor, 2014].
References: Lots of action at ICML 2014!
21. Sampling 101 – Why MCMC?
Generating Independent Samples
Sample from g and suppress samples with low p(θ|X)
e.g. a) Rejection Sampling b) Importance Sampling
- Does not scale to high dimensions
Markov Chain Monte Carlo
• Make steps by perturbing previous sample
• Probability of visiting a state is equal to P(θ|X)
22. Sampling 101 – What is MCMC?
Burn-in ( Throw away) Samples from S0
Auto correlation time
0 200 400 600 800 1000
−3−2−10123
iteration
lastpositioncoordinate
Random−walk Metropolis
0 200 400 600 800 1000
−3−2−10123
iteration
lastpositioncoordinate
Hamiltonian Monte Carlo
0 200 400 600 800 1000
−3−2−10123
iteration
lastpositioncoordinate
Random−walk Metropolis
0 200 400 600 800 1000
−3−2−10123
iteration
lastpositioncoordinate
Hamiltonian Monte Carlo
High τ Low τ
23. Sampling 101 – Metropolis-Hastings
Transition Kernel T(θt+1|θt)
Accept/Reject TestPropose
Is the new state
more probable?
Is it easy to come back
to the current state?
For Bayesian Posterior Inference,
2) is too high.
1) Burn-in is unnecessarily slow.
24. Approximate MCMC
Low
Variance
( Fast )
High
Variance
( Slow )
High Bias Low Bias
xx
x
x
x x
x xx x
x x
x
x
x
x
xx
x
x
x x
x
x
x
x
x
x
x x
x
x
x
x
Decreasing ϵ
25. Minimizing Risk
X Axis – ϵ, Y Axis – Bias2, Variance, Risk
Computational Time
25
Risk Bias Variance
= +
2
Given finite sampling
time, ϵ=0 is not the
optimal setting.
26. Designing fast MCMC samplers
Method 2
Develop a proposal with
acceptance probability ≈ 1
and avoid the expensive
accept/reject test
Propose Accept/Reject
O(N)
Method 1
Develop an approximate
accept/reject test that uses
only a fraction of the data
27. Stochastic Gradient Langevin Dynamics
Langevin Dynamics
Stochastic Gradient Langevin Dynamics (SGLD)
θt+1 is then accepted /rejected using a Metropolis-Hastings test
Avoid expensive Metropolis-Hastings test by keeping ε small
W. & Teh, 2011
30. The SGLD Knob
Burn-in using
SGA
Biased
sampling
Exact sampling
Decrease ϵ over time
Low
Variance
( Fast )
High
Variance
( Slow )
High Bias Low Bias
xx
x
x
x x
x xx x
x x
x
x
x
x
xx
x
x
x x
x
x
x
x
x
x
x x
x
x
x
x