SlideShare a Scribd company logo
1 of 31
Bayesian Posterior Inference
in the
Big Data Arena
Max Welling
Anoop Korattikara
Outline
• Introduction
• Stochastic Variational Inference
– Variational Inference 101
– Stochastic Variational Inference
– Deep Generative Models with SVB
• MCMC with mini-batches
– MCMC 101
– MCMC using noisy gradients
– MCMC using noisy Metropolis-Hastings
– Theoretical results
• Conclusion
Big Data (mine is bigger than yours)
Square Kilometer Array (SKA) produces 1 Exabyte per day by 2024…
(interested to do approximate inference on this data, talk to me)
Introduction
Why do we need posterior inference if the datasets are BIG?
p>>N
Big data may mean large p, small n
Gene expression data
fMRI data
5
Planning
Planning against uncertainty needs probabilities
6
Little data inside Big data
Not every data-case carries information about every model component
New user with no ratings
(cold start problem)
7
1943: First NN
(+/- N=10)
1988: NetTalk
(+/- N=20K)
2009: Hinton’s
Deep Belief Net
(+/- N=10M)
2013: Google/Y!
(N=+/- 10B)
Big Models!
Models grow faster than useful information in data
8
Two Ingredients for Big Data Bayes
Any big data posterior inference algorithm should:
1. easily run on a distributed architecture.
2. only use a small mini-batch of the data at every iteration.
Bayesian Posterior Inference
Variational Inference Sampling
Variational Family Q
All probability distributions
• Deterministic
• Biased
• Local minima
• Easy to assess convergence
• Stochastic (sample error)
• Unbiased
• Hard to mix between modes
• Hard to assess convergence
Variational Bayes
11
Hinton & van Camp (1993)
Neal & Hinton (1999)
Saul & Jordan (1996)
Saul, Jaakkola & Jordan (1996)
Attias (1999,2000)
Wiegerinck (2000)
Ghahramani & Beal (2000,2001)
Coordinate descent on Q
P
Q
(Bishop, Pattern Recognition
and Machine Learning)
Stochastic VB Hoffman, Blei & Bach, 2010
Stochastic natural gradient descent on Q
12
• P and Q in exponential family.
• Q factorized:
• At every iteration: subsample n<<N data-cases:
• solve analytically.
• update parameter using stochastic natural gradient descent.
General SVB
very high variance
sample
13
subsample X
(ignoring latent variables Z)
Reparameterization Trick
14
-Variational Bayesian Inference with Stochastic Search [D.M. Blei, M.I. Jordan and J.W. Paisley, 2012]
-Fixed-Form Variational Posterior Approximation through Stochastic Linear Regression [T. Salimans and A. Knowles, 2013].
-Black Box Variational Inference. [R. Ranganath, S. Gerrish and D.M. Blei. 2013]
-Stochastic Variational Inference [M.D. Hoffman, D. Blei, C. Wang and J. Paisley, 2013]
-Estimating or propagating gradients through stochastic neurons. [Y. Bengio, 2013].
-Neural Variational Inference and Learning in Belief Networks. [A. Mnih and K. Gregor, 2014]
Kingma 2013, Bengio 2013, Kingma & W. 2014
Other solutions to solve the same "large variance problem":
Talk Monday June 23, 15:20
In Track F (Deep Learning II)
“Efficient Gradient Based Inference through Transformations between
Bayes Nets and Neural Nets”
Auto Encoding Variational Bayes
Both P(X|Z) and Q(Z|X) are general models
(e.g. deep neural net)
Kingma & W., 2013, Rezende et al 2014
15
The Helmholtz machine
Wake/Sleep algorithm
Dayan, Hinton, Neal, Zemel, 1995
Z
X
Q(Z|X)
P(X|Z)P(Z)
The VB Landscape
SVB SSVB
AEVB FSSVB
Stochastic
Variational Bayes
Auto-Encoding
Variational Bayes
Structured Stoch.
Variational Bayes
Fully Struc. Stoch.
Variational Bayes (ICML 2015)
Variational Auto-Encoder
(with 2 latent variables)
17
Face Model
Semi-supervised Model
Z
X
Y
Q(Y,Z|X) = Q(Z|Y,X)Q(Y|X)
Analogies: Fix Z, vary Y, sample X|Z,Y
P(X,Z,Y) = P(X|Z,Y)P(Y)P(Z)
Kingma, Rezende, Mohamed, Wierstra, W., 2014
REFERENCES SVB:
-Practical Variational Inference for Neural Networks [Alex Graves, 2011]
-Variational Bayesian Inference with Stochastic Search [D.M. Blei, M.I. Jordan and J.W. Paisley, 2012]
-Fixed-Form Variational Posterior Approximation through Stochastic Linear Regression. Bayesian Analysis [T. Salimans and A. Knowles, 2013].
-Black Box Variational Inference. [R. Ranganath, S. Gerrish and D.M. Blei. 2013]
-Stochastic Variational Inference [M.D. Hoffman, D. Blei, C. Wang and J. Paisley, 2013]
-Stochastic Structured Mean Field Variational Inference [Matthew Hoffman, 2013]
-Doubly Stochastic Variational Bayes for non-Conjugate Inference [M. K. Titsias and M. Lázaro-Gredilla, 2014]
REFERENCES STOCHASTIC BACKPROP AND DEEP GENERATIVE MODELS
-Fast Gradient-Based Inference with Continuous Latent Variable Models in Auxiliary Form. [D.P. Kingma, 2013].
-Estimating or propagating gradients through stochastic neurons. [Y. Bengio, 2013].
-Auto-Encoding Variational Bayes [D.P. Kingma and M. W., 2013].
-Semi-supervised Learning with Deep Generative Models [D.P. Kingma, D.J. Rezende, S. Mohamed, M. W., 2014]
-Efficient Gradient-Based Inference through Transformations between Bayes Nets and Neural Nets [D.P. Kingma and M. W., 2014]
-Deep Generative Stochastic Networks Trainable by Backprop [Y. Bengio, E. Laufer, G. Alain, J, Yosinski, 2014]
-Stochastic Back-propagation and Approximate Inference in Deep Generative Models [D.J. Rezende, S. Mohamed and D. Wierstra, 2014]
-Deep AutoRegressive Networks [K. Gregor, A. Mnih and D. Wierstra, 2014].
-Neural Variational Inference and Learning in Belief Networks. [A. Mnih and K. Gregor, 2014].
References: Lots of action at ICML 2014!
Sampling 101 – Why MCMC?
Generating Independent Samples
Sample from g and suppress samples with low p(θ|X)
e.g. a) Rejection Sampling b) Importance Sampling
- Does not scale to high dimensions
Markov Chain Monte Carlo
• Make steps by perturbing previous sample
• Probability of visiting a state is equal to P(θ|X)
Sampling 101 – What is MCMC?
Burn-in ( Throw away) Samples from S0
Auto correlation time
0 200 400 600 800 1000
−3−2−10123
iteration
lastpositioncoordinate
Random−walk Metropolis
0 200 400 600 800 1000
−3−2−10123
iteration
lastpositioncoordinate
Hamiltonian Monte Carlo
0 200 400 600 800 1000
−3−2−10123
iteration
lastpositioncoordinate
Random−walk Metropolis
0 200 400 600 800 1000
−3−2−10123
iteration
lastpositioncoordinate
Hamiltonian Monte Carlo
High τ Low τ
Sampling 101 – Metropolis-Hastings
Transition Kernel T(θt+1|θt)
Accept/Reject TestPropose
Is the new state
more probable?
Is it easy to come back
to the current state?
For Bayesian Posterior Inference,
2) is too high.
1) Burn-in is unnecessarily slow.
Approximate MCMC
Low
Variance
( Fast )
High
Variance
( Slow )
High Bias Low Bias
xx
x
x
x x
x xx x
x x
x
x
x
x
xx
x
x
x x
x
x
x
x
x
x
x x
x
x
x
x
Decreasing ϵ
Minimizing Risk
X Axis – ϵ, Y Axis – Bias2, Variance, Risk
Computational Time
25
Risk Bias Variance
= +
2
Given finite sampling
time, ϵ=0 is not the
optimal setting.
Designing fast MCMC samplers
Method 2
Develop a proposal with
acceptance probability ≈ 1
and avoid the expensive
accept/reject test
Propose Accept/Reject
O(N)
Method 1
Develop an approximate
accept/reject test that uses
only a fraction of the data
Stochastic Gradient Langevin Dynamics
Langevin Dynamics
Stochastic Gradient Langevin Dynamics (SGLD)
θt+1 is then accepted /rejected using a Metropolis-Hastings test
Avoid expensive Metropolis-Hastings test by keeping ε small
W. & Teh, 2011
SGLD & Optimization
28
SGLD & Optimization
29
The SGLD Knob
Burn-in using
SGA
Biased
sampling
Exact sampling
Decrease ϵ over time
Low
Variance
( Fast )
High
Variance
( Slow )
High Bias Low Bias
xx
x
x
x x
x xx x
x x
x
x
x
x
xx
x
x
x x
x
x
x
x
x
x
x x
x
x
x
x
Demo: SGLD
31

More Related Content

Similar to Deep generative learning_icml_part1

ProbabilisticModeling20080411
ProbabilisticModeling20080411ProbabilisticModeling20080411
ProbabilisticModeling20080411
Clay Stanek
 
Refining Bayesian Data Analysis Methods for Use with Longer Waveforms
Refining Bayesian Data Analysis Methods for Use with Longer WaveformsRefining Bayesian Data Analysis Methods for Use with Longer Waveforms
Refining Bayesian Data Analysis Methods for Use with Longer Waveforms
James Bell
 
CLIM Program: Remote Sensing Workshop, Computational and Statistical Trade-of...
CLIM Program: Remote Sensing Workshop, Computational and Statistical Trade-of...CLIM Program: Remote Sensing Workshop, Computational and Statistical Trade-of...
CLIM Program: Remote Sensing Workshop, Computational and Statistical Trade-of...
The Statistical and Applied Mathematical Sciences Institute
 

Similar to Deep generative learning_icml_part1 (15)

Representation Learning & Generative Modeling with Variational Autoencoder(VA...
Representation Learning & Generative Modeling with Variational Autoencoder(VA...Representation Learning & Generative Modeling with Variational Autoencoder(VA...
Representation Learning & Generative Modeling with Variational Autoencoder(VA...
 
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
 
SGLD Berlin ML GROUP
SGLD Berlin ML GROUPSGLD Berlin ML GROUP
SGLD Berlin ML GROUP
 
How to Accelerate Molecular Simulations with Data? by Žofia Trsťanová, Machin...
How to Accelerate Molecular Simulations with Data? by Žofia Trsťanová, Machin...How to Accelerate Molecular Simulations with Data? by Žofia Trsťanová, Machin...
How to Accelerate Molecular Simulations with Data? by Žofia Trsťanová, Machin...
 
Bayesian Neural Networks
Bayesian Neural NetworksBayesian Neural Networks
Bayesian Neural Networks
 
kape_science
kape_sciencekape_science
kape_science
 
Into to prob_prog_hari
Into to prob_prog_hariInto to prob_prog_hari
Into to prob_prog_hari
 
ProbabilisticModeling20080411
ProbabilisticModeling20080411ProbabilisticModeling20080411
ProbabilisticModeling20080411
 
CLIM: Transition Workshop - Accounting for Model Errors Due to Sub-Grid Scale...
CLIM: Transition Workshop - Accounting for Model Errors Due to Sub-Grid Scale...CLIM: Transition Workshop - Accounting for Model Errors Due to Sub-Grid Scale...
CLIM: Transition Workshop - Accounting for Model Errors Due to Sub-Grid Scale...
 
ppt
pptppt
ppt
 
Risk Classification with an Adaptive Naive Bayes Kernel Machine Model
Risk Classification with an Adaptive Naive Bayes Kernel Machine ModelRisk Classification with an Adaptive Naive Bayes Kernel Machine Model
Risk Classification with an Adaptive Naive Bayes Kernel Machine Model
 
Refining Bayesian Data Analysis Methods for Use with Longer Waveforms
Refining Bayesian Data Analysis Methods for Use with Longer WaveformsRefining Bayesian Data Analysis Methods for Use with Longer Waveforms
Refining Bayesian Data Analysis Methods for Use with Longer Waveforms
 
Distance-based bias in model-directed optimization of additively decomposable...
Distance-based bias in model-directed optimization of additively decomposable...Distance-based bias in model-directed optimization of additively decomposable...
Distance-based bias in model-directed optimization of additively decomposable...
 
Simulation and Modeling of Gravitational Wave Sources
Simulation and Modeling of Gravitational Wave SourcesSimulation and Modeling of Gravitational Wave Sources
Simulation and Modeling of Gravitational Wave Sources
 
CLIM Program: Remote Sensing Workshop, Computational and Statistical Trade-of...
CLIM Program: Remote Sensing Workshop, Computational and Statistical Trade-of...CLIM Program: Remote Sensing Workshop, Computational and Statistical Trade-of...
CLIM Program: Remote Sensing Workshop, Computational and Statistical Trade-of...
 

Recently uploaded

Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Sérgio Sacani
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Lokesh Kothari
 
The Philosophy of Science
The Philosophy of ScienceThe Philosophy of Science
The Philosophy of Science
University of Hertfordshire
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
PirithiRaju
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
gindu3009
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
Sérgio Sacani
 
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSS
LeenakshiTyagi
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
Sérgio Sacani
 

Recently uploaded (20)

Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C P
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdf
 
The Philosophy of Science
The Philosophy of ScienceThe Philosophy of Science
The Philosophy of Science
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdf
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptx
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdf
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptx
 
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSS
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
 

Deep generative learning_icml_part1

  • 1. Bayesian Posterior Inference in the Big Data Arena Max Welling Anoop Korattikara
  • 2. Outline • Introduction • Stochastic Variational Inference – Variational Inference 101 – Stochastic Variational Inference – Deep Generative Models with SVB • MCMC with mini-batches – MCMC 101 – MCMC using noisy gradients – MCMC using noisy Metropolis-Hastings – Theoretical results • Conclusion
  • 3. Big Data (mine is bigger than yours) Square Kilometer Array (SKA) produces 1 Exabyte per day by 2024… (interested to do approximate inference on this data, talk to me)
  • 4. Introduction Why do we need posterior inference if the datasets are BIG?
  • 5. p>>N Big data may mean large p, small n Gene expression data fMRI data 5
  • 7. Little data inside Big data Not every data-case carries information about every model component New user with no ratings (cold start problem) 7
  • 8. 1943: First NN (+/- N=10) 1988: NetTalk (+/- N=20K) 2009: Hinton’s Deep Belief Net (+/- N=10M) 2013: Google/Y! (N=+/- 10B) Big Models! Models grow faster than useful information in data 8
  • 9. Two Ingredients for Big Data Bayes Any big data posterior inference algorithm should: 1. easily run on a distributed architecture. 2. only use a small mini-batch of the data at every iteration.
  • 10. Bayesian Posterior Inference Variational Inference Sampling Variational Family Q All probability distributions • Deterministic • Biased • Local minima • Easy to assess convergence • Stochastic (sample error) • Unbiased • Hard to mix between modes • Hard to assess convergence
  • 11. Variational Bayes 11 Hinton & van Camp (1993) Neal & Hinton (1999) Saul & Jordan (1996) Saul, Jaakkola & Jordan (1996) Attias (1999,2000) Wiegerinck (2000) Ghahramani & Beal (2000,2001) Coordinate descent on Q P Q (Bishop, Pattern Recognition and Machine Learning)
  • 12. Stochastic VB Hoffman, Blei & Bach, 2010 Stochastic natural gradient descent on Q 12 • P and Q in exponential family. • Q factorized: • At every iteration: subsample n<<N data-cases: • solve analytically. • update parameter using stochastic natural gradient descent.
  • 13. General SVB very high variance sample 13 subsample X (ignoring latent variables Z)
  • 14. Reparameterization Trick 14 -Variational Bayesian Inference with Stochastic Search [D.M. Blei, M.I. Jordan and J.W. Paisley, 2012] -Fixed-Form Variational Posterior Approximation through Stochastic Linear Regression [T. Salimans and A. Knowles, 2013]. -Black Box Variational Inference. [R. Ranganath, S. Gerrish and D.M. Blei. 2013] -Stochastic Variational Inference [M.D. Hoffman, D. Blei, C. Wang and J. Paisley, 2013] -Estimating or propagating gradients through stochastic neurons. [Y. Bengio, 2013]. -Neural Variational Inference and Learning in Belief Networks. [A. Mnih and K. Gregor, 2014] Kingma 2013, Bengio 2013, Kingma & W. 2014 Other solutions to solve the same "large variance problem": Talk Monday June 23, 15:20 In Track F (Deep Learning II) “Efficient Gradient Based Inference through Transformations between Bayes Nets and Neural Nets”
  • 15. Auto Encoding Variational Bayes Both P(X|Z) and Q(Z|X) are general models (e.g. deep neural net) Kingma & W., 2013, Rezende et al 2014 15 The Helmholtz machine Wake/Sleep algorithm Dayan, Hinton, Neal, Zemel, 1995 Z X Q(Z|X) P(X|Z)P(Z)
  • 16. The VB Landscape SVB SSVB AEVB FSSVB Stochastic Variational Bayes Auto-Encoding Variational Bayes Structured Stoch. Variational Bayes Fully Struc. Stoch. Variational Bayes (ICML 2015)
  • 17. Variational Auto-Encoder (with 2 latent variables) 17
  • 19. Semi-supervised Model Z X Y Q(Y,Z|X) = Q(Z|Y,X)Q(Y|X) Analogies: Fix Z, vary Y, sample X|Z,Y P(X,Z,Y) = P(X|Z,Y)P(Y)P(Z) Kingma, Rezende, Mohamed, Wierstra, W., 2014
  • 20. REFERENCES SVB: -Practical Variational Inference for Neural Networks [Alex Graves, 2011] -Variational Bayesian Inference with Stochastic Search [D.M. Blei, M.I. Jordan and J.W. Paisley, 2012] -Fixed-Form Variational Posterior Approximation through Stochastic Linear Regression. Bayesian Analysis [T. Salimans and A. Knowles, 2013]. -Black Box Variational Inference. [R. Ranganath, S. Gerrish and D.M. Blei. 2013] -Stochastic Variational Inference [M.D. Hoffman, D. Blei, C. Wang and J. Paisley, 2013] -Stochastic Structured Mean Field Variational Inference [Matthew Hoffman, 2013] -Doubly Stochastic Variational Bayes for non-Conjugate Inference [M. K. Titsias and M. Lázaro-Gredilla, 2014] REFERENCES STOCHASTIC BACKPROP AND DEEP GENERATIVE MODELS -Fast Gradient-Based Inference with Continuous Latent Variable Models in Auxiliary Form. [D.P. Kingma, 2013]. -Estimating or propagating gradients through stochastic neurons. [Y. Bengio, 2013]. -Auto-Encoding Variational Bayes [D.P. Kingma and M. W., 2013]. -Semi-supervised Learning with Deep Generative Models [D.P. Kingma, D.J. Rezende, S. Mohamed, M. W., 2014] -Efficient Gradient-Based Inference through Transformations between Bayes Nets and Neural Nets [D.P. Kingma and M. W., 2014] -Deep Generative Stochastic Networks Trainable by Backprop [Y. Bengio, E. Laufer, G. Alain, J, Yosinski, 2014] -Stochastic Back-propagation and Approximate Inference in Deep Generative Models [D.J. Rezende, S. Mohamed and D. Wierstra, 2014] -Deep AutoRegressive Networks [K. Gregor, A. Mnih and D. Wierstra, 2014]. -Neural Variational Inference and Learning in Belief Networks. [A. Mnih and K. Gregor, 2014]. References: Lots of action at ICML 2014!
  • 21. Sampling 101 – Why MCMC? Generating Independent Samples Sample from g and suppress samples with low p(θ|X) e.g. a) Rejection Sampling b) Importance Sampling - Does not scale to high dimensions Markov Chain Monte Carlo • Make steps by perturbing previous sample • Probability of visiting a state is equal to P(θ|X)
  • 22. Sampling 101 – What is MCMC? Burn-in ( Throw away) Samples from S0 Auto correlation time 0 200 400 600 800 1000 −3−2−10123 iteration lastpositioncoordinate Random−walk Metropolis 0 200 400 600 800 1000 −3−2−10123 iteration lastpositioncoordinate Hamiltonian Monte Carlo 0 200 400 600 800 1000 −3−2−10123 iteration lastpositioncoordinate Random−walk Metropolis 0 200 400 600 800 1000 −3−2−10123 iteration lastpositioncoordinate Hamiltonian Monte Carlo High τ Low τ
  • 23. Sampling 101 – Metropolis-Hastings Transition Kernel T(θt+1|θt) Accept/Reject TestPropose Is the new state more probable? Is it easy to come back to the current state? For Bayesian Posterior Inference, 2) is too high. 1) Burn-in is unnecessarily slow.
  • 24. Approximate MCMC Low Variance ( Fast ) High Variance ( Slow ) High Bias Low Bias xx x x x x x xx x x x x x x x xx x x x x x x x x x x x x x x x x Decreasing ϵ
  • 25. Minimizing Risk X Axis – ϵ, Y Axis – Bias2, Variance, Risk Computational Time 25 Risk Bias Variance = + 2 Given finite sampling time, ϵ=0 is not the optimal setting.
  • 26. Designing fast MCMC samplers Method 2 Develop a proposal with acceptance probability ≈ 1 and avoid the expensive accept/reject test Propose Accept/Reject O(N) Method 1 Develop an approximate accept/reject test that uses only a fraction of the data
  • 27. Stochastic Gradient Langevin Dynamics Langevin Dynamics Stochastic Gradient Langevin Dynamics (SGLD) θt+1 is then accepted /rejected using a Metropolis-Hastings test Avoid expensive Metropolis-Hastings test by keeping ε small W. & Teh, 2011
  • 30. The SGLD Knob Burn-in using SGA Biased sampling Exact sampling Decrease ϵ over time Low Variance ( Fast ) High Variance ( Slow ) High Bias Low Bias xx x x x x x xx x x x x x x x xx x x x x x x x x x x x x x x x x

Editor's Notes

  1. Properties – Variational Inference is inherently biased, MCMC is unbiased given infinite sampling time, etc. Main Latex Equations: q^* = min_{q in Q} ext{KL} [ q( heta) || p( heta| X)] mathbb{E}_{p( heta|X)}[ f( heta) ] approx frac{1}{T} sum_{t=1}^{T} f( heta_t) \ ext{~where~} heta_t sim p( heta|X)
  2. Is there too much information on this slide? Latex: Given target distribution $S_0$, design transitions s.t. $p_t( heta_t) o S_0$ as $t o infty$
  3. S_0( heta) propto p( heta) prod_{i=1}^N p(x_i| heta)
  4. ext{Use samples from~} mathcal{S}_epsilon ext{~(instead of~} mathcal{S}_0 ext{)} ext{~to compute~} langle f angle_{mathcal{S}_0}
  5. heta’ leftarrow heta_t + frac{epsilon}{2} abla_ heta mathcallog {S}_0 ( heta_t) + eta ext{~~~~where~} eta sim mathcal{N}(0,epsilon) heta_{t+1} leftarrow heta_t + frac{epsilon}{2} abla_ heta mathcallog {S}_0 ( heta_t) + mathbb{N}(epsilon) ext{Bias~} = langle f angle_{mathcal{S}_0} - langle f angle_{mathcal{S}_epsilon} = O(epsilon) ext{Bayesian Posterior:~~} abla_ heta mathcallog {S}_0 ( heta_t) = frac{N}{n} sum_{i=1}^n abla log l(x_i| heta_t) + abla log ho( heta_t) qquad O(n)