7. This Time: Monte-Carlo Methods
Sampling: Want to compute properties of a βdifficultβ
distribution as cheaply as possible:
β’ Take samples from a dataset
β’ Sample points in a model parameter space
β’ Sample latent (hidden) variables
The distribution is βdifficultβ:
β’ Sum or product of many components
β’ Not closed form
β’ The distribution may itself be parametric, and has to be
trained to approach reality.
8. This Time: Monte-Carlo Methods
Approximate Integrals:
Expected value of a function: E π π₯ = π π₯ π π₯ ππ₯
Or for discrete x, E π π₯ = π=1
π
π π₯π π π₯π
In both cases, we can approximate the expected value with a
sample of n values π₯1, β¦ , π₯π from the distribution p(), giving:
π =
1
π
π=1
π
π π₯π
9. This Time: Monte-Carlo Methods
Writing π¦π = π π₯π , since we choose the π₯π at random,
the π¦π are IID: independent and identically-distributed.
Their sum:
π =
1
π
π=1
π
π π₯π
Therefore approaches a normal distribution by the Central
Limit Theorem. The variance of this limit is VAR π /π and so
goes to zero as n increases. The relative error std π /E π
also approaches zero.
10. This Time: Monte-Carlo Methods
Non Hand-Waving Central Limit Theorem:
Sample mean:
π =
1
π
π=1
π
π¦π
Sample standard deviation:
π =
1
π
π=1
π
π¦π β π 2
Then the t-statistic t = π/s approaches a standard normal
distribution in the following sense:
11. This Time: Monte-Carlo Methods
The cumulative distribution of t satisfies:
π₯
sup
Pr π‘ < π₯ β Ξ¦ π₯ β€
6.4E π 3 + 2E π 1
π
Where Ξ¦ π₯ is the cumulative distribution of the standard
Normal distribution (mean 0, variance 1).
E π 1 and E π 3 are the first and third absolute moments of
π, i.e. expected first and third powers of π β E π .
12. This Time: Monte-Carlo Methods
The cumulative distribution of t satisfies:
π₯
sup
Pr π‘ < π₯ β Ξ¦ π₯ β€
6.4E π 3 + 2E π 1
π
Falls off quite slowly with n, can have large mass βin the tailβ.
So you cant use the asymptotic CLT to derive a confidence
interval for the mean.
13. Importance Sampling
Sampling from the true probability distribution p(x) can be
inefficient. The value we want is:
π =
1
π
π=1, π₯π~π(π₯)
π
π π₯π
Which has the same mean as:
π π =
1
π
π=1, π₯π~π(π₯)
π
π π₯π π π₯π
π π₯π
This is always an unbiased estimator, i.e. πΈ π π = πΈ π = π,
but the variance depends strongly on the choice of sampling
distribution π .
14. Importance Sampling
The variance is:
VAR π =
1
π
π=1, π₯π~π(π₯)
π
π π₯π
π π₯π
2
π π₯π β π
2
The optimal (minimum variance) π is:
πβ π₯ =
π π₯
π
π π₯ β π
where Z normalizes the πβ π₯π to sum to 1 across possible
values of π₯π.
i.e. choosing more extremal values of π π₯ improves
accuracy in estimating E π π₯
15. Importance Sampling
The weakness of this method is that we need to compute Z.
That can be expensive or impossible.
An alternative is to estimate Z, which we can do with:
π π΅πΌπ =
π=1
π π π₯π
π π₯π
π π₯π
π=1
π π π₯π
π π₯π
with π₯π~π π₯ . This estimator is biased for finite n, but
asymptotically unbiased as π β β.
16. Markov Chain Monte Carlo
One of the most powerful methods for sampling from a
difficult distribution is Markov Chain Monte Carlo (MCMC).
The idea is to generate a sequence of samples
π₯(1), β¦ , π₯(π) from the distribution π π₯ using a Markov chain.
i.e. where each sample π₯(π) depends (only) on the previous
sample π₯(πβ1).
The samples are not independent of each other, but the
distribution of each sample β π π₯ as π β β. i.e. they still
give unbiased estimates for expected values.
MCMC methods are typically fast, and avoid calculation or
approximation of Z.
17. Energy Representation
We can take negative logs as we do for likelihoods in
machine learning, yielding an Energy Based Model:
π π₯ = exp βπΈ π₯
Note that if the energy is finite, π π₯ > 0.
A probability distribution in this form is called a
Boltzmann distribution.
18. Gibbs Sampling
Gibbs sampling is an efficient approach to generating
samples from the true distribution π π₯ using only the
unnormalized distribution π π₯ .
Taking each π₯π in turn, sample π₯π from π π₯π|π₯βπ where π₯βπ =
π₯1, β¦ , π₯πβ1, π₯π+1, β¦ , π₯π
We need π π₯π|π₯βπ to be normalized, but e.g. if π₯π has a small
discrete range, we can compute the unnormalized
probabilities for each value and normalize (c.f. softmax).
This is often easy because the energy is local, i.e. derived
from edge potentials in a graphical model.
19. Gibbs Sampling
Recall that energy for graphical models depends on local
potentials:
A
Nodes
Potential Functions
ππ΄π΅π· B
E
D C
ππ·πΈ ππΆπΈ
Can efficiently sample
D given A, B, E
20. Metropolis Hastings
Gibbs sampling assumes that we can not only evaluate
π π₯π|π₯βπ , but we can efficiently sample from π . |π₯βπ .
This is only possible when the local probability distribution is
closed-form π . |π₯βπ . Generally it isnβt.
When we donβt have a closed form local distribution, we can
still use a kind of rejection sampling to form a Markov chain.
For this we use a proposal distribution π to choose the next
π₯(π)
~ π . |π₯(πβ1)
, and then accept-reject using the true
distribution p
21. Metropolis Hastings
The Metropolis-Hastings acceptance test accepts a proposed
π₯(π)
with probability p, where:
π π₯(πβ1)
, π₯(π)
= min 1,
π π₯(π) π π₯(πβ1)|π₯(π)
π π₯(πβ1) π π₯(π)|π₯(πβ1)
This follows by considering transitions between the two states
π₯(π) and π₯(πβ1) :
We want detailed balance:
π π₯(π)|π₯(πβ1)
π π₯(πβ1)|π₯(π) =
π π₯(π)
π π₯(πβ1)
p(x(j-1))
x(j-1)
p(x(j))
x(j)
p(x(j-1)|x(j) )
p(x(j)|x(j-1) )
23. Metropolis Hastings
Metropolis-Hastings is very powerful and widely-used.
It reduces the problem of sampling from a difficult distribution
π π₯ to making proposals: π π₯(π)|π₯(πβ1) and evaluating
ratios of π π₯(π) /π π₯(πβ1) .
The proposals can be trivial, e.g. random walk (choose π₯(π)
from a normal distribution centered at π₯(πβ1)
), or sophisticated.
Only efficiency, not correctness is affected by the proposal.
Because only ratios π π₯(π) /π π₯(πβ1) are needed, we donβt
have to deal with normalizing sums Z.
24. Metropolis Hastings
M-H can be used for many applications, but has a serious
limitation when π π₯ depends on many observations, e.g.
when π₯ = π are the parameters of a model.
Namely model parameters depend on all data points, and so
does the ratio:
π π(π)
/π π(πβ1)
This forces batch updates (one per pass over the dataset)
which makes this kind of sampling very slow.
Until recently there was no fast way to perform M-H tests on
minibatches of data, which would allow SGD-style updates.
25. Fast Minibatch Metropolis Hastings
The classical MH test has an acceptance probability which is
asymmetric and non-smooth:
where ππ is shorthand for π π₯(π) .
Chu et al βAn Efficient Minibatch Acceptance Test for Metropolis-Hastingsβ 2016
u = log(pj/pj-1)
Pr(accept) = min(1, pj/pj-1)
pj/pj-1 = exp(u)
1
26. Fast Minibatch Metropolis Hastings
The classical MH test has an acceptance probability which is
asymmetric and non-smooth:
An alternative smooth, symmetric distribution is the logistic
function (Barker test):
u = log(pj/pj-1)
Pr(accept)
pj/pj-1 = exp(u)
u = log(pj/pj-1)
Pr(accept)
1/(1+exp(-u))
1
1
0.5
27. Fast Minibatch Metropolis Hastings
The Barker test also satisfies detailed balance:
πΏ(π’)
πΏ(βπ’)
= exp π’ = ππ/ππβ1
u = log(pj/pj-1)
Pr(accept)
1/(1+exp(-u))
L(u)
L(-u)
πΏ(π’)
πΏ(βπ’)
= exp(π’)
u
-u
Logistic function L(u)
pj-1
x(j-1)
pj
x(j)
L(-u)
L(u)
28. Minibatch Metropolis Hastings
Testing against the smooth distribution can be done using a
random variable x whose CDF is the acceptance curve:
u = log(pj/pj-1)
Pr(accept)
1/(1+exp(u))
x
density(x)
Accept if
x < u
29. Minibatch Metropolis Hastings
Testing against the smooth distribution can be done using a
random variable X whose CDF is the acceptance curve:
This allows us to use the minibatch-induced variance in
likelihood estimates to provide the variation for the MH test.
u - - = u -
u = log(pj/pj-1)
Pr(accept)
1/(1+exp(u))
x
density(x)
Accept if
x < u
i.e.
u - x > 0
nU (normal) x (logistic)
xc
30. Minibatch Metropolis Hastings
Testing against the smooth distribution can be done using a
random variable X whose CDF is the acceptance curve:
This allows us to use the minibatch-induced variance in
likelihood estimates to provide the variation for the MH test.
u - - = u -
u = log(pj/pj-1)
Pr(accept)
1/(1+exp(u))
x
density(x)
Accept if
u - x > 0
nU (normal) x (logistic)
xc
What we have
31. Minibatch Metropolis Hastings
Testing against the smooth distribution can be done using a
random variable X whose CDF is the acceptance curve:
This allows us to use the minibatch-induced variance in
likelihood estimates to provide the variation for the MH test.
u - - = u -
u = log(pj/pj-1)
Pr(accept)
1/(1+exp(u))
x
density(x)
Accept if
u + x > 0
nU (normal) x (logistic)
xc
What we want
What we have
32. Minibatch Metropolis Hastings
This allows us to use the minibatch-induced variance in
likelihood estimates to provide the variation for the MH test.
u - - = u -
nU (minibatch
noise)
x (logistic)
xc
The variable x is the sum of nu and xc.
Its distribution is the convolution
of these two.
We can compute the distribution of xc
by deconvolution.
Exact
value
across
dataset
33. Minibatch Metropolis Hastings
The test itself is simple:
β’ Propose x(j) using q(x(j)| x(j-1)) from a minibatch of data.
β’ Compute u = log(pj/pj-1) = log(p(x(j))/p(x(j-1))) from another
minibatch.
β’ Sample xc from the correction distribution.
β’ Accept if u - xc > 0.
34. Minibatch Metropolis Hastings
β’ This approach has the same complexity as standard
SGD, with a small constant overhead.
β’ Minibatch sizes are fixed, acceptance probability is
reasonably high (around 0.5 typically).
β’ The variance in the log likelihood needs to be small (< 1).
β’ But this corresponds to βefficientβ proposal distributions.
35. Hamiltonian Monte-Carlo
β’ Typical MCMC proposal are random walks. They lead to
very slow exploration of parameter space.
β’ Adding gradient bias seems like a natural step, but this
destroys detailed balance.
36. Hamiltonian Monte-Carlo
β’ Typical MCMC proposal are random walks. They lead to
very slow exploration of parameter space.
β’ Adding gradient bias seems like a natural step, but this
destroys detailed balance.
β’ We can restore detailed balance by introducing βfictitiousβ
state in the form of momentum variables.
β’ Now the state space is x (called βpositionβ) and p (called
momentum). We augment the energy function as well.
β’ We recover the distribution of x, p(x) by βforgettingβ the
momentum coordinates.
37. Hamiltonian Dynamics
Let x be position and p be momentum, H is an energy
function. The system evolves following:
ππ₯π
ππ‘
=
ππ»
πππ
and
πππ
ππ‘
= β
ππ»
ππ₯π
And typically:
π» π₯, π = π π₯ + 1/2
π=1
π
ππ
2
ππ
Potential energy
(log probability(x))
Kinetic energy
(fictitious)
Total energy
38. Hamiltonian Dynamics
Typically: let x be position and p be momentum:
π» π₯, π = π π₯ + 1/2
π=1
π
ππ
2
ππ
So
ππ₯π
ππ‘
=
ππ
ππ
and
πππ
ππ‘
= β
ππ
ππ₯π
Discretizing in time:
π(π) = π(πβ1) β π»π π₯(πβ1)
π₯(π) = π₯(πβ1) + πβ1π(πβ1)
Like SGD with no momentum decay.
39. Hamiltonian Monte-Carlo
Each iteration of HMC proceeds in two steps:
1. Sample momentum p from a zero-mean normal
distribution.
2. Do L times:
β’ Propose a new state (x,p) using Hamiltonian dynamics.
β’ Accept using an M-H test on the energy difference οH.
β’ Note that momentum is only preserved through the steps
in the inner loop over L.
β’ Necessary to occasionally βspillβ momentum so the
particle doesnβt have too much energy after descending.
45. Langevin Dynamics
i.e. Langevin dynamics agrees, up to step quantization
error, with the standard equations of SGD with momentum.
Remarkably, Langevin dynamics has a robust solution for
the stationary distribution of x:
π π₯ = exp β
π π₯
π
The model allows us to control T even with significant noise
in the SGD gradients.
47. Langevin Dynamics
Quick summary:
β’ Almost same cost as standard SGD with momentum.
β’ Using fast minibatch M-H testing and optimal step sizes,
cost increases about 2x.
β’ One sample generated per SGD/MH step. (HMC
generates one sample per L MH tests).
β’ Full posterior sampling at SGD rates.
48. Applications of HMC and Langevin
β’ Better characterization of the stationary distribution.
Control of temperature for annealing, mixing.
β’ Better quantized updates: Leapfrog method.
β’ Better understanding of coordinate scaling (ADAGRAD,
RMSprop, Natural gradient).
β’ Alternatives to variational smoothing.
55. CLAIM: such a cube matches well with the working range
of typical DNN parameters.
W_1
W_2
Coordinate Scaling Again
56. Coordinate Scaling Revisited
β’ For a variety of reasons, the ADAGRAD/RMSprop is poor
from an MCMC perspective.
β’ The optimal diagonal scaling, by several arguments, is:
ππ =
ππ
πΈ ππ
2
β’ i.e. no square root in denominator as per ADAGRAD and
RMSprop.
β’ This update is invariant to coordinate transformations, and
minimizes bias and variance.
57. Coordinate Scaling Revisited
β’ For a variety of reasons, the ADAGRAD/RMSprop is poor
from an MCMC perspective.
β’ The optimal diagonal scaling, by several arguments, is:
ππ =
ππ
πΈ ππ
2
β’ i.e. no square root in denominator as per ADAGRAD and
RMSprop.
β’ In practice, it may be prone to βexplodingβ when πΈ ππ
2
is
small. But when βworking rangeβ is taken into account,
πΈ ππ
2
should never be too small.
58. Model the working range with a regularizing norm on x as
part of V(x). Then ππ
2
will never be too small.
W_1
W_2
Optimal Coordinate Scaling Again
Optimal g scaling
59. Coordinate Scaling Revisited
Optimal coordinate scaling for HMC and Langevin:
Girolami et al. 2014 βRiemann Manifold Langevin and Hamiltonian
Monte Carloβ
Practical performance of gradient scaling methods:
Yann Ollivier, 2015 βRiemannian metrics for neural networks I:
Feedforward networksβ
60. Takeaways
β’ Importance Sampling is used to
estimate means with fewer samples.
β’ Monte-Carlo sampling requires either:
β’ Closed form local distributions (Gibbs sampling)
β’ A proposal distribution and acceptance test (M-H).
β’ M-H tests keep samples in known distribution. Proposals
only affect speed.
β’ There are alternatives to the Metropolis-Hastings test
(Barker test) which works fast on minibatches of data.
Performance is similar to vanilla SGD.
61. Takeaways
β’ Hamiltonian Monte Carlo (HMC)
adds a momentum term to allow
non-reversible (gradient) proposals.
β’ Langevin dynamics adds a damping term to standard HMC.
Can take larger steps with fewer M-H tests.
β’ Applications of MCMC methods:
β’ Better training parameter setting.
β’ Controlled annealing and tempering.
β’ Efficient full posterior inference.