QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop, Stochastic Gradient MCMC for Independent & Correlated Data - Yian Ma, Dec 11, 2017
Similar to QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop, Stochastic Gradient MCMC for Independent & Correlated Data - Yian Ma, Dec 11, 2017
Similar to QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop, Stochastic Gradient MCMC for Independent & Correlated Data - Yian Ma, Dec 11, 2017 (20)
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop, Stochastic Gradient MCMC for Independent & Correlated Data - Yian Ma, Dec 11, 2017
1. Stochastic Gradient MCMC for
Independent and Correlated Data
Yi-An (Yian) Ma
University of California, Berkeley
Tianqi ChenEmily Fox Nick Foti Felix Ye
4. Classical Approach:
MCMC via Jump Processes
“Standard” method = Metropolis-Hastings (MH)
– Propose θ’ from kernel depending on past value θ
– Accept or reject
Example of a jump process
Often, inefficiently explores posterior
5. Continuous dynamic based samplers
aka Grad-MCMC
Use (stochastic) dynamics on energy landscape
to simulate distant proposal
6. Example: Hamiltonian Monte Carlo (HMC)
Hamiltonian
(total energy H)
Target posterior of θ
Add auxiliary “momentum”
variable r
Focus on Potential Energy
Kinetic Energy
10. Is there a general
recipe for
construction?
All Continuous
Markov Processes
Exploring “Correct” Dynamics
Processes with
ps(θ) = π(θ)
Langevin
dynamics (LD)
HMC
Riemann
Manifold
HMC
Riemann Manifold LD
11. Assume target distribution
A Recipe for Continuous Dynamics MCMC
d-dim Wiener process
parameters auxiliary vars
Target total energy
Ma, Chen, Fox, NIPS 2015.
SDE
PSD D
skew-sym Q
total energy H
is invariant
13. Recipe is Complete
All Continuous
Markov Processes
ps(z) =
π(z)
SDE defined by
D(z),
Q(z)
Ma, Chen, Fox, NIPS 2015.
All existing samplers can
be written in framework:
– HMC
– Riemann HMC
– LD
– Riemann LD
Any valid sampler
has a D and Q
in our framework
17. Irreversible Jump: Naïve Revision
f
f
g
Forward proposal
Reverse proposal
Leads to wrong stationary distribution
Naïve Irreversible
18. Correcting the Algorithm
Jump to
adjoint process
upon rejection
Introduce auxiliary variable zp ~ U{1,-1}:
zp = 1
zp = -1
I-Jump Algorithm: Ma, Fox, Chen, Wu, arXiv 2016.
19. Irreversible MALA Algorithm
Continuous dynamical process
(e.g., irreversible SDE)
Adjoint process
Use as
f(z|y)
Use as
g(z|y)
Compute for one
step of dynamics
I-MALA Algorithm: Ma, Fox, Chen, Wu, arXiv 2016.
Q → - Q
20. Consider ε – discretization
m steps
Must compute gradient!
– Computations costly for large data
– Cannot handle streaming data
DATA
A Practical Algorithm
22. Scaling up: Stochastic Gradients
Compute noisy gradient based on minibatch
consisting of n i.i.d. observations:
– For minibatches sampled uniformly at random from data,
Only requires examining n data points
True gradientNoisy gradient
DATAand we assume (appealing to CLT):
23. Scalable Version of Algorithm
Original update rule:
Modified stochastic gradient update rule:
As , SG noise decreases and bias 0
Use small, finite in practice (allow some bias)
Subtract estimate of variance of SG noise
24. Example D and Q of Past Algorithms
D(z)
Q(z)
(D, Q) Space not
previously
explored
SGHMC:
SGLD:
SGRLD:
SGNHT:
SGLD
SGRLD
25. • Use existing D, Q building blocks to define new samplers
• SGRHMC existence previously only speculated
– Naïve(ish) approach has wrong stationary distribution
• Take D and Q of SGHMC and make state dependent
SGRiemannHMC
Ma, Chen, Fox, NIPS 2015.
State-dependent
pos. def. matrix
26. Perplexity
Iteration
SGLD
SGHMC
SGRHMC
SGRLD
Applied SGRHMC (using Fisher info metric) to online LDA
– Latent Dirichlet allocation (LDA) = mixed membership
document model
Scraped Wikipedia entries in a streaming manner
– Each entry was analyzed on-the-fly
Streaming Wikipedia Analysis
Ma, Chen, Fox, NIPS 2015.
Step size
selected via
grid search
28. Scaling up: Stochastic Gradients
Compute noisy gradient based on minibatch
consisting of n i.i.d. observations:
– For minibatches sampled uniformly at random from data,
Only requires examining n data points
DATA
Welling, Teh, ICML 2011. Chen, Fox, Guestrin, ICML 2014.
and we assume (appealing to CLT):
29. Scaling up: Stochastic Gradients
Compute noisy gradient based on minibatch
consisting of n i.i.d. observations:
– For minibatches sampled uniformly at random from data,
Only requires examining n data points
DATA
Welling, Teh, ICML 2011. Chen, Fox, Guestrin, ICML 2014.
and we assume (appealing to CLT):
i.i.d.
What about non i.i.d data?
E.g., time series:
:
OK
. . . .
How?
:
35. Given local beliefs, update global parameter
Batch Learning for HMMs
Issue: Cost is O(K2T) per global update!
Costly when using uninformed initializations
or observations are redundant
37. Why is this not so straightforward?
SG-MCMC assumes continuous parameter space
– Typical HMM MCMC algorithms iterate on
(sampling) latent discrete-valued state sequence
Need to prove that correct stationary distribution
is maintained in presence of:
– Incomplete observations in subsequences
– Mutually correlated subsequences per minibatch
Ma, Foti, Fox, ICML 2017
47. How much buffering is sufficient?
B𝐿 𝑠B
Set buffer length B by estimating the Lyapunov exponent
of the underlying random dynamical system
Ye, Ma, Qian, arXiv 2017
48. Minibatch = Set of Subsequences
Subsequences are correlated! Reduces efficiency
B𝐿 𝑠B
56. When Q(z) is non-zero, process is irreversible
– i.e., time-reversed process is statistically
distinguishable from forward process
– Saw greater efficiency for such processes
(e.g., HMC, Riemann HMC, SGHMC, SGRHMC,…)
Reversibility: Continuous Dynamics
Skew-symmetric
Hwang, Hwang-Ma and Cheu (1993, 2005); Rey-Bellet and Spiliopoulos (2014)
60. Irreversible Jump: Naïve Revision
f
f
g
Forward proposal
Reverse proposal
Leads to wrong stationary distribution
Naïve Irreversible
61. What can we do?
For continuous dynamic part:
SDE(f(z))
Defines diffusion
Hard to specify!
62. What can we do?
For continuous dynamic part:
SDE(D(z),Q(z))
Defines
irreversibility
PSD Skew-sym
For jump part:
MJP(W(z|x))
Transition
kernel
Hard to specify!
MH is one choice,
but reversible
63. What can we do?
For continuous dynamic part:
SDE(D(z),Q(z))
Defines
irreversibility
PSD Skew-sym
For jump part:
MJP(S(x,z),A(x,z))
Antisymmetric
kernel
Symmetric
kernel
Defines
irreversibility
“Only” need
Ma, Fox, Chen, Wu, arXiv 2016.
70. Moving to Higher Dimensions
zp
In 1D, there’s only one choice
of direction: In higher dimensions:
1. Let
uniformly distributed
on unit ball
2. Flip sign of
upon rejection
3. After multiple rejections,
resample
71. Approach 1:
jump process
with π
invariant
Approach 2:
continuous
dynamics with π
invariant
Combined:
irreversible MALA
Combining Approach 1 & 2
72. Irreversible MALA Algorithm
Metropolis Adjusted Langevin Algorithm (MALA) algorithm: Xifara, Sherlock ,
Livingstone, Byrne and Girolami (2014); Zig-Zag: Bierkens and Roberts (2016)
Continuous dynamical process
(e.g., irreversible SDE)
Adjoint process
Use as
f(z|y)
Use as
g(z|y)
Can compute for one
step of dynamics