QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop, Stochastic Gradient MCMC for Independent & Correlated Data - Yian Ma, Dec 11, 2017

Stochastic Gradient MCMC for
Independent and Correlated Data
Yi-An (Yian) Ma
University of California, Berkeley
Tianqi ChenEmily Fox Nick Foti Felix Ye

Issue = Efficiency
multiple modes
strongly correlated
across dimensions
parameters observations
complex
Goal: Large-Data Posteriors

Goal: Large-Data Posteriors
largeparameters observations
Issue = Scalability
E.g.:
Wikipedia corpus analysis,
Human genome sequence,
Ion channel recordings

Classical Approach:
MCMC via Jump Processes
“Standard” method = Metropolis-Hastings (MH)
– Propose θ’ from kernel depending on past value θ
– Accept or reject
Example of a jump process
Often, inefficiently explores posterior

Continuous dynamic based samplers
aka Grad-MCMC
Use (stochastic) dynamics on energy landscape
to simulate distant proposal

Example: Hamiltonian Monte Carlo (HMC)
Hamiltonian
(total energy H)
Target posterior of θ
Add auxiliary “momentum”
variable r
Focus on Potential Energy
Kinetic Energy

Simulate Hamiltonian Dynamics
Use Hamiltonian dynamics to collect samples on a fixed
(continuous-time) interval

. . .
Resample Momentum for Ergodicity

Continuous Dynamic-Based Samplers
Dynamic 1 Dynamic 2 Dynamic 3
invariant under these dynamics
Simulated samples are desired posterior samples
Want:
(+ ergodicity)

Is there a general
recipe for
construction?
All Continuous
Markov Processes
Exploring “Correct” Dynamics
Processes with
ps(θ) = π(θ)
Langevin
dynamics (LD)
HMC
Riemann
Manifold
HMC
Riemann Manifold LD

Assume target distribution
A Recipe for Continuous Dynamics MCMC
d-dim Wiener process
parameters auxiliary vars
Target total energy
Ma, Chen, Fox, NIPS 2015.
SDE
PSD D
skew-sym Q
total energy H
is invariant

Assume target distribution
parameters auxiliary vars
Target total energy
A Recipe for Continuous Dynamics MCMC

Recipe is Complete
All Continuous
Markov Processes
ps(z) =
π(z)
SDE defined by
D(z),
Q(z)
All existing samplers can
be written in framework:
– HMC
– Riemann HMC
– LD
– Riemann LD
Any valid sampler
has a D and Q
in our framework

Continuous dynamic based samplers
aka Grad-MCMC
The nitty gritty practical issues

A Practical Algorithm
Consider ε – discretization
Some discretization error
m steps
Construct irreversible MH to correct this bias

Example: Metropolis-Hastings
Propose sample using
proposal dist. q
Calculate
accept-reject
ratio
Ratio of reverse
and forward
proposals
Accept or reject

Irreversible Jump: Naïve Revision
f
f
g
Forward proposal
Reverse proposal
Leads to wrong stationary distribution
Naïve Irreversible

Correcting the Algorithm
Jump to
adjoint process
upon rejection
Introduce auxiliary variable zp ~ U{1,-1}:
zp = 1
zp = -1
I-Jump Algorithm: Ma, Fox, Chen, Wu, arXiv 2016.

Irreversible MALA Algorithm
Continuous dynamical process
(e.g., irreversible SDE)
Adjoint process
Use as
f(z|y)
Use as
g(z|y)
Compute for one
step of dynamics
I-MALA Algorithm: Ma, Fox, Chen, Wu, arXiv 2016.
Q → - Q

Consider ε – discretization
m steps
Must compute gradient!
– Computations costly for large data
– Cannot handle streaming data
DATA
A Practical Algorithm

Scaling Grad-MCMC:
Handling large datasets efficiently

Scaling up: Stochastic Gradients
Compute noisy gradient based on minibatch
consisting of n i.i.d. observations:
– For minibatches sampled uniformly at random from data,
Only requires examining n data points
True gradientNoisy gradient
DATAand we assume (appealing to CLT):

Scalable Version of Algorithm
Original update rule:
Modified stochastic gradient update rule:
As , SG noise decreases and bias  0
Use small, finite in practice (allow some bias)
Subtract estimate of variance of SG noise

Example D and Q of Past Algorithms
D(z)
Q(z)
(D, Q) Space not
previously
explored
SGHMC:
SGLD:
SGRLD:
SGNHT:
SGLD
SGRLD

• Use existing D, Q building blocks to define new samplers
• SGRHMC existence previously only speculated
– Naïve(ish) approach has wrong stationary distribution
• Take D and Q of SGHMC and make state dependent
SGRiemannHMC
State-dependent
pos. def. matrix

Perplexity
Iteration
SGLD
SGHMC
SGRHMC
SGRLD
Applied SGRHMC (using Fisher info metric) to online LDA
– Latent Dirichlet allocation (LDA) = mixed membership
document model
Scraped Wikipedia entries in a streaming manner
– Each entry was analyzed on-the-fly
Streaming Wikipedia Analysis
Step size
selected via
grid search

Scaling Approach 2:
Stochastic Gradient MCMC
for Correlated datasets

DATA
Welling, Teh, ICML 2011. Chen, Fox, Guestrin, ICML 2014.
and we assume (appealing to CLT):

DATA
Welling, Teh, ICML 2011. Chen, Fox, Guestrin, ICML 2014.
and we assume (appealing to CLT):
i.i.d.
What about non i.i.d data?
E.g., time series:
:
OK
. . . .
How?
:

Hidden Markov Models (HMMs)
discrete state sequence
observations
transition probabilities,
observation parameters

Batch learning for HMMs:
A quick review

Batch Learning for HMMs
Use current to form local state beliefs:
– Propagate info forwards to form

Use current to form local state beliefs:
– Propagate info backwards

Combine to form smoothed local state belief:

Given local beliefs, update global parameter
Issue: Cost is O(K2T) per global update!
Costly when using uninformed initializations
or observations are redundant

Minibatch learning for HMMs
via SG-MCMC

Why is this not so straightforward?
SG-MCMC assumes continuous parameter space
– Typical HMM MCMC algorithms iterate on
(sampling) latent discrete-valued state sequence
Need to prove that correct stationary distribution
is maintained in presence of:
– Incomplete observations in subsequences
– Mutually correlated subsequences per minibatch
Ma, Foti, Fox, ICML 2017

Marginal Likelihood Representation
Marginalize x

Marginal Likelihood Representation
…

Rewriting in terms of a specific subsequence
…
…

Rewriting in terms of a specific subsequence
Random dynamical system:
Synchronizes as
Ye, Ma, Qian, arXiv 2017

Potential energy for Grad-MCMC

Issues with the gradient computation
q,π calculations
involve touching
(nearly) all T obs!
sum over all
subsequences

A Stochastic Gradient Approach
q,π calculations
involve touching
(nearly) all T obs!
sum over all
subsequences
𝑠

Approximating Gradient Terms with Buffering
𝑠

𝑠
Approximating Gradient Terms with Buffering

How much buffering is sufficient?
B𝐿 𝑠B
Set buffer length B by estimating the Lyapunov exponent
of the underlying random dynamical system
Ye, Ma, Qian, arXiv 2017

Minibatch = Set of Subsequences
Subsequences are correlated! Reduces efficiency
B𝐿 𝑠B

Mitigating Subsequence Correlations
Minimum gap: ν
Based on
2nd largest eig(A)
B𝐿 𝑠B

Resulting Gradient Approximation
Plugs into
SG-MCMC
theory
B𝐿 𝑠B

Ion Channel Analysis – Segmentations
716.19 sec 7245.14 sec2124.45 sec
44.05 sec 138.51 sec 466.82 sec
1 MHz recording of single alamethicin channel
Our dataset: 209,634 observations
BatchGrad-
MCMC
SG-MCMC
[Rosenstein et al. 2013]

Ion Channel Analysis – Estimation

Consequence of non-i.i.d. data
& importance of buffering
DiagonallyDominantReversedCycles
log predictive prob || A − Atrue ||F

Irreversibility:
Increasing Sampler Efficiency

When Q(z) is non-zero, process is irreversible
– i.e., time-reversed process is statistically
distinguishable from forward process
– Saw greater efficiency for such processes
(e.g., HMC, Riemann HMC, SGHMC, SGRHMC,…)
Reversibility: Continuous Dynamics
Skew-symmetric
Hwang, Hwang-Ma and Cheu (1993, 2005); Rey-Bellet and Spiliopoulos (2014)

Reversibility: Jump Processes
Reversibility
Irreversibility
explores distribution in a directed manner
asymmetric, cyclic motion
Chen, Lovasz and Pak (1999); Diaconis, Holmes and Neal (2000); Bierkens(2015)
symmetric dynamics

Reversibility: Jump Processes
Reversibility
Irreversibility
Chen, Lovasz and Pak (1999); Diaconis, Holmes and Neal (2000); Bierkens(2015)
Easy!
e.g., Metropolis-Hastings
Hard

What can we do?
For continuous dynamic part:
SDE(f(z))
Defines diffusion
Hard to specify!

What can we do?
SDE(D(z),Q(z))
Defines
irreversibility
PSD Skew-sym
For jump part:
MJP(W(z|x))
Transition
kernel
Hard to specify!
MH is one choice,
but reversible

What can we do?
SDE(D(z),Q(z))
Defines
irreversibility
PSD Skew-sym
For jump part:
MJP(S(x,z),A(x,z))
Antisymmetric
kernel
Symmetric
kernel
Defines
irreversibility
“Only” need
Ma, Fox, Chen, Wu, arXiv 2016.

f
f
g
Naïve Irreversible

Jump to
adjoint process
upon rejection
Introduce auxiliary variable zp ~ U{1,-1}:
zp = 1
zp = -1
Lifting Method: Turitsyn, Chertkov and Vucelja (2011)

Simple Irreversible Jump Algorithm
Keep track of
process in use
Flip the process
Can show:

Simple Irreversible Jump Algorithm
Possible choice:
f
g
f
g

zp
Proposal
rejected
(z,1) (z*,1)1
(z*,1)2
(z*,1)3
(z,-1)

zp
(z,1) (z1,1)
(z4,-1)
(z2,1)
(z2,-1)
(z3,-1)

Moving to Higher Dimensions
zp
In 1D, there’s only one choice
of direction: In higher dimensions:
1. Let
uniformly distributed
on unit ball
2. Flip sign of
upon rejection
3. After multiple rejections,
resample

Approach 1:
jump process
with π
invariant
Approach 2:
continuous
dynamics with π
invariant
Combined:
irreversible MALA
Combining Approach 1 & 2

Irreversible MALA Algorithm
Metropolis Adjusted Langevin Algorithm (MALA) algorithm: Xifara, Sherlock ,
Livingstone, Byrne and Girolami (2014); Zig-Zag: Bierkens and Roberts (2016)
Continuous dynamical process
(e.g., irreversible SDE)
Adjoint process
Use as
f(z|y)
Use as
g(z|y)
Can compute for one
step of dynamics

QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop, Stochastic Gradient MCMC for Independent & Correlated Data - Yian Ma, Dec 11, 2017

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop, Stochastic Gradient MCMC for Independent & Correlated Data - Yian Ma, Dec 11, 2017

Similar to QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop, Stochastic Gradient MCMC for Independent & Correlated Data - Yian Ma, Dec 11, 2017 (20)

More from The Statistical and Applied Mathematical Sciences Institute

More from The Statistical and Applied Mathematical Sciences Institute (20)

Recently uploaded

Recently uploaded (20)

QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop, Stochastic Gradient MCMC for Independent & Correlated Data - Yian Ma, Dec 11, 2017