Deep generative learning_icml_part2

Stochastic Gradient Fisher Scoring
Ahn, Korattikara, Welling – 2012
Large Gradient
SmallGradient
Mixing Issues Bernstein-von Mises theorem
θ0 - True parameter
IN - Fisher Information at θ0
( a.k.a Bayesian CLT)
1vrijdag 4 juli 14

SGFS
Stochastic Gradient Langevin
Samples from the correct posterior, , at low ϵ
2
2vrijdag 4 juli 14

SGFS
LowBias
High
2
2vrijdag 4 juli 14

SGFS
Markov Chain for Approximate
LowBias
High
Samples from approximate posterior, , at any ϵ
2
2vrijdag 4 juli 14

SGFS
Markov Chain for Approximate
LowBias
High
Samples from approximate posterior, , at any ϵ
Low
HighBias
2
2vrijdag 4 juli 14

SGFS
Small ϵ
Large ϵ
Bias
Variance
3
3vrijdag 4 juli 14

SGFS
Small ϵ
Large ϵ
Bias
Variance
(term compensates for subsampling noise)
3
3vrijdag 4 juli 14

The SGFS Knob
Burn-in
using
Sampli
ng
Sampli
ng
Decrease ϵ over time
Exact
Sampli
4
Low
Variance
( Fast )
High
Variance
( Slow )
High Bias Low Bias
xx
x
x
x x
x xx x
x x
x
x
x
x
xx
x
x
x x
x
x
x
x
x
x
x x
x
x
x
x
4vrijdag 4 juli 14

Demo SGFS ε = 2
5
5vrijdag 4 juli 14

Demo SGFS ε = 0.4
6
6vrijdag 4 juli 14

Stochastic Gradient Riemannian Langevin Dynamics
(SGRLD) - Patterson & Teh, 2013
Euclidean space
of parameters θ = (σ, µ)
of a normal distribution
7vrijdag 4 juli 14

Euclidean space
Euclidean distance b/w
parameters is 1,
but densities p(x|θ) are
very different
7vrijdag 4 juli 14

Euclidean space
parameters is 1,
very different
parameters is 10,
almost identical
7vrijdag 4 juli 14

Euclidean space
parameters is 1,
very different
parameters is 10,
almost identical
where G(θ) is positive semi-definite
7vrijdag 4 juli 14

Euclidean space
parameters is 1,
very different
parameters is 10,
almost identical
where G(θ) is positive semi-definite
Natural Gradient change in curvaturealign noise
7vrijdag 4 juli 14

Stochastic Gradient Hamiltonian Monte Carlo
T. Chen, E. B. Fox, C. Guestrin (2014)
8vrijdag 4 juli 14

An (over-) simplified explanation of Hamiltonian Monte Carlo (HMC)
one informative gradient step of size ϵ + one random step of size ϵ
= Random walk type movement and bad mixing
Langevin Update
8vrijdag 4 juli 14

Langevin Update
• HMC allows multiple gradient steps per noise step
• HMC can make distant proposals with high acceptance probability
8vrijdag 4 juli 14

Stochastic Gradient Hamiltonian Monte Carlo (SGHMC)
Langevin Update
• Naively using stochastic gradients in HMC does not work well
• Authors use a correction term to cancel the effect of noise in gradients
8vrijdag 4 juli 14

Stochastic Gradient Hamiltonian Monte Carlo (SGHMC)
Langevin Update
• Naively using stochastic gradients in HMC does not work well
• Authors use a correction term to cancel the effect of noise in gradients
Talk tomorrow afternoon
In Track C (Monte Carlo)
8vrijdag 4 juli 14

Distributed SGLD
Ahn, Shahbaba, Welling (2014)
N
N
N
Total N
Data points
9vrijdag 4 juli 14

Distributed SGLD
Ahn, Shahbaba, Welling (2014)
N
N
N
Total N
Data points
Adaptive Load Balancing: Longer trajectories from faster machines
9vrijdag 4 juli 14

D-SGLD Results
Wikipedia dataset: 4.6M articles,811M tokens,vocabulary size: 7702
PubMed dataset: 8.2M articles,730M tokens,vocabulary size: 39987
Model: Latent Dirichlet Allocation
10
10vrijdag 4 juli 14

A Recap
Use
an
efficient
proposal
so
that
the
Metropolis-‐Has3ngs
test
can
be
avoidedUse
an
efficient
proposal
so
that
the
test
can
be
avoided
SGLD Langevin
Dynamics
with
stochas3c
gradients
SGFS Precondi3oning
matrix
based
on
Fisher
informa3on
at
mode
SGRLD Posi3on
specific
precondi3oning
matrix
based
on
Reimannian
geometry
SGHMC Avoids
random
walks
by
taking
mul3ple
gradient
steps
DSGLD Distributed
version
of
above
algorithms
11vrijdag 4 juli 14

A Recap
Use
an
efficient
proposal
so
that
the
test
can
be
avoidedUse
an
efficient
proposal
so
that
the
test
can
be
avoided
SGLD Langevin
Dynamics
with
stochas3c
gradients
SGFS Precondi3oning
matrix
based
on
Fisher
informa3on
at
mode
SGRLD Posi3on
specific
precondi3oning
matrix
based
on
Reimannian
geometry
SGHMC Avoids
random
walks
by
taking
mul3ple
gradient
steps
DSGLD Distributed
version
of
above
algorithms
Approximate
the
Test
using
less
dataApproximate
the
Test
using
less
data
11vrijdag 4 juli 14

Why approximate the MH test?
(if gradient based methods seem to work so well)
• Gradient based proposals are not always available
– Parameter spaces of different dimensionality
– Distributions on constrained manifolds
– Discrete variables
• High gradients may catapult the sampler to low density regions
12vrijdag 4 juli 14

Metropolis-Hastings
1 2
3
13vrijdag 4 juli 14

Metropolis-Hastings
14vrijdag 4 juli 14

Metropolis-Hastings
Does not depend on
the data (x)
14vrijdag 4 juli 14

Approximate Metropolis-Hastings
15vrijdag 4 juli 14

Collect more data
15vrijdag 4 juli 14

How do we choose Δ+ and Δ-?
Collect more data
15vrijdag 4 juli 14

Approach 1: Using Confidence Intervals
Korattikara, Chen, Welling (2014)
Collect more data
16vrijdag 4 juli 14

Collect more data
(c is chosen as in a t-test for µ = µ0 vs µ ≠ µ0 )
16vrijdag 4 juli 14

Collect more data
16vrijdag 4 juli 14

Collect more data
• Singh, Wick, McCallum (2012) – inference in large scale factor graphs
• DuBois, Korattikara, Welling, Smyth (2014) – approximate Slice Sampling
16vrijdag 4 juli 14

Independent Component Analysis
Mixture of 4 audio sources - 1.95 Million datapoints, 16 dimensions
Test function is Amari distance to true unmixing matrix
17
17vrijdag 4 juli 14

SGLD + approximate MH
SGLD
SGLD
+
MH
18
18vrijdag 4 juli 14

Approach 2: Using Concentration Inequalities
Bardenet, Doucet, Holmes (2014)
Collect more data
19vrijdag 4 juli 14

Collect more data
• Complementary to previous method
• More robust as it does not use any CLT assumptions
• Uses more data per test if CLT assumptions do hold
19vrijdag 4 juli 14

Summary
Use
an
efficient
proposal
so
that
the
test
can
be
avoidedUse
an
efficient
proposal
so
that
the
test
can
be
avoided
SGLD Langevin
Dynamics
with
stochas3c
gradients
SGFS Precondi3oning
matrix
based
on
Fisher
informa3on
at
mode
SGRLD Posi3on
specific
precondi3oning
based
on
Reimannian
geometry
SGHMC Avoids
random
walks
by
taking
mul3ple
gradient
steps
DSGLD Distributed
version
of
above
algorithms
Approximate
the
Test
using
less
dataApproximate
the
Test
using
less
data
Confidence

Intervals
Based
on
confidence
levels
using
CLT
assump3ons.
Concentra3on

Bounds
Based
on
concentra3on
bounds.
More
robust
as
it
does
not
use
CLT

assump3ons,
but
uses
more
data
than
above
if
CLT
assump3ons
hold
20vrijdag 4 juli 14

Langevin Dynamics
• The Langevin update is a discrete
time approximation of a stochastic differential equation(SDE)
• The stationary distribution of this SDE is S0(θ)
• Discretization introduces O(ϵ) errors that are corrected using a MH test
Analysis: SGLD
I. Sato and H. Nakagawa (2014)
Stochastic Gradient Langevin Dynamics
• The stationary distribution of the SDE that SGLD represents can also be
shown to be S0(θ) I. Sato and H. Nakagawa (2014)
• Time Discretized SGLD converges weakly to the SGLD SDE
i.e. For any continuous differentiable and polynomial growth function f:
21vrijdag 4 juli 14

Langevin Dynamics
• The Langevin update is a discrete
time approximation of a stochastic differential equation(SDE)
• The stationary distribution of this SDE is S0(θ)
• Discretization introduces O(ϵ) errors that are corrected using a MH test
Analysis: SGLD
I. Sato and H. Nakagawa (2014)
Stochastic Gradient Langevin Dynamics
• The stationary distribution of the SDE that SGLD represents can also be
shown to be S0(θ) I. Sato and H. Nakagawa (2014)
• Time Discretized SGLD converges weakly to the SGLD SDE
i.e. For any continuous differentiable and polynomial growth function f:
Talk Monday afternoon
In Track C (Monte Carlo &
Approximate Inference)
21vrijdag 4 juli 14

Assume Uniform Ergodicity
Control error in Transition Kernel
Analysis: Approximate MH
Control probability of making a wrong decision:
_ Error in acceptance probability is bounded:
_ Error in transition probability is bounded:
where Total Variation
22vrijdag 4 juli 14

Error in Stationary Distribution
If the error in transition probability is bounded:
And uniform ergodicity holds:
Then, the error in the stationary distribution is bounded as:
Analysis: Approximate MH
For more details:
1. P. Alquier, N. Friel, R. Everitt, A. Boland (2014)
2. R. Bardenet, A. Doucet, C. Holmes (2014)
3. A. Korattikara, Y. Chen, M. Welling (2014)
4. N. S. Pillai, A. Smith (2014)
23vrijdag 4 juli 14

References - MCMC
Approximate MCMC algorithms using mini-batch gradients
• Stochastic Gradient Langevin Dynamics – M. Welling and Y. W. Teh (ICML 2011)
• Stochastic Gradient Fisher Scoring – S. Ahn, A. Korattikara, M. Welling (ICML 2012)
• Stochastic Gradient Riemannian Langevin Dynamics on the Probability Simplex – S. Patterson and Y.W. Teh (NIPS 2013)
• Stochastic Gradient Hamiltonian Monte Carlo - T. Chen, E. B. Fox, C. Guestrin (ICML 2014)
• Distributed Stochastic Gradient MCMC – S. Ahn, B. Shahbaba, M. Welling (ICML 2014)
Approximate MCMC algorithms using mini-batch Metropolis-Hastings
• Austerity in MCMC Land: Cutting the Metropolis-Hastings Budget - A. Korattikara, Y. Chen, M. Welling (ICML 2014)
• Towards scaling up Markov chain Monte Carlo: an adaptive subsampling approach – R. Bardenet, A. Doucet, C. Holmes (ICML 2014)
• Approximate Slice Sampling for Bayesian Posterior Inference – C. DuBois, A. Korattikara, M. Welling, P. Smyth (AISTATS 2014)
Theory
• Approximation Analysis of Stochastic Gradient Langevin Dynamics using Fokker-Planck Equation and Ito Process –I. Sato and H.
Nakagawa (ICML 2014)
• Noisy Monte Carlo: Convergence of Markov chains with approximate transition kernels - P. Alquier, N. Friel, R. Everitt, A. Boland (arXiv
2014)
• Ergodicity of Approximate MCMC Chains with Applications to Large Data Sets - N. S. Pillai, A. Smith (arXiv 2014)
Asymptotically unbiased MCMC algorithms using mini-batches
• Asymptotically Exact, Embarrassingly Parallel MCMC – W. Neiswanger, C. Wang, R. Xing (arXiv 2013)
• Firefly Monte Carlo: Exact MCMC with Subsets of Data – D. Maclaurin, R. P. Adams (arXiv 2014)
• Accelerating MCMC via Parallel Predictive Prefetching – E. Angelino, E. Kohler, A. Waterland, M. Seltzer, R. P. Adams (arXiv 2014)
24vrijdag 4 juli 14

Conclusions & Future Directions
• Bayesian
Inference
is
not
superﬂuous
in
the
context
of
big
data.
• Two
requirements:
• Stochas3c
/
mini-‐batch
based
updates
• Distributed
implementa3on
• Two
fruiRul
approaches:
• Stochas3c
Varia3onal
Bayes
• Mini-‐batch
MCMC
• Future
VB:
• Very
ﬂexible
varia3onal
posteriors,
very
small
remaining
bias
• Black-‐box
inference
engine,
a
la

Infer.net,
BUGS
• Future
MCMC
• BeTer
theory
• BeTer
use
of
powerful
(stochas3c)
op3miza3on
methods.

25vrijdag 4 juli 14

Stochas3c
Fully
Structured

Distributed
Varia3onal
Bayes
Stochas3c
Approxima3on

MCMC
(driving
bias
to
0)
(driving
variance
to
0)
26vrijdag 4 juli 14

Acknowledgements & Collaborators
• Yee Whye Teh
• Sungjin Ahn
• Babak Shahbaba
• Yutian Chen
• Durk Kingma
• Taco Cohen
• Alex Ihler
• Chris DuBois
• Padhraic Smyth
• Dan Gillen
27vrijdag 4 juli 14

Deep generative learning_icml_part2

Recommended

Recommended

More Related Content

Recently uploaded

Recently uploaded (20)

Featured

Featured (20)

Deep generative learning_icml_part2