SlideShare a Scribd company logo
1 of 61
CS294-129: Designing, Visualizing and
Understanding Deep Neural Networks
John Canny
Fall 2016
Lecture 20: Monte-Carlo Methods
Last Time: Variational Auto-Encoders
Dependency:  - parametrized decoder:
οͺ - parametrized encoder:
Both deep networks,
learned concurrently
Maximize the lower bound on log p(z|x)
jointly wrt  and οͺ:
Recall: Applications of Probabilistic Models
β€’ Sampling
(synthesis)
β€’ Autoencoding
(smile vectors):
Real Images Synthetic Images
Recall: Applications of Probabilistic Models
β€’ Missing value imputation
β€’ Density Estimation
β€’ Denoising
This Time: Monte-Carlo Methods
Las Vegas: Fixed outcomes, random time
Monte-Carlo: Random outcomes
Outline: Monte-Carlo Methods
β€’ Importance Sampling
β€’ Markov Chain Monte-Carlo and Metropolis-Hastings
β€’ Hamiltonian Monte-Carlo
β€’ Langevin Dynamics
This Time: Monte-Carlo Methods
Sampling: Want to compute properties of a β€œdifficult”
distribution as cheaply as possible:
β€’ Take samples from a dataset
β€’ Sample points in a model parameter space
β€’ Sample latent (hidden) variables
The distribution is β€œdifficult”:
β€’ Sum or product of many components
β€’ Not closed form
β€’ The distribution may itself be parametric, and has to be
trained to approach reality.
This Time: Monte-Carlo Methods
Approximate Integrals:
Expected value of a function: E 𝑓 π‘₯ = 𝑝 π‘₯ 𝑓 π‘₯ 𝑑π‘₯
Or for discrete x, E 𝑓 π‘₯ = 𝑖=1
𝑁
𝑝 π‘₯𝑖 𝑓 π‘₯𝑖
In both cases, we can approximate the expected value with a
sample of n values π‘₯1, … , π‘₯𝑛 from the distribution p(), giving:
𝑠 =
1
𝑛
𝑗=1
𝑛
𝑓 π‘₯𝑗
This Time: Monte-Carlo Methods
Writing 𝑦𝑗 = 𝑓 π‘₯𝑗 , since we choose the π‘₯𝑗 at random,
the 𝑦𝑗 are IID: independent and identically-distributed.
Their sum:
𝑠 =
1
𝑛
𝑗=1
𝑛
𝑓 π‘₯𝑗
Therefore approaches a normal distribution by the Central
Limit Theorem. The variance of this limit is VAR π‘Œ /𝑛 and so
goes to zero as n increases. The relative error std π‘Œ /E π‘Œ
also approaches zero.
This Time: Monte-Carlo Methods
Non Hand-Waving Central Limit Theorem:
Sample mean:
πœ‡ =
1
𝑛
𝑖=1
𝑛
𝑦𝑖
Sample standard deviation:
𝑠 =
1
𝑛
𝑖=1
𝑛
𝑦𝑖 βˆ’ πœ‡ 2
Then the t-statistic t = πœ‡/s approaches a standard normal
distribution in the following sense:
This Time: Monte-Carlo Methods
The cumulative distribution of t satisfies:
π‘₯
sup
Pr 𝑑 < π‘₯ βˆ’ Ξ¦ π‘₯ ≀
6.4E π‘Œ 3 + 2E π‘Œ 1
𝑛
Where Ξ¦ π‘₯ is the cumulative distribution of the standard
Normal distribution (mean 0, variance 1).
E π‘Œ 1 and E π‘Œ 3 are the first and third absolute moments of
π‘Œ, i.e. expected first and third powers of π‘Œ βˆ’ E π‘Œ .
This Time: Monte-Carlo Methods
The cumulative distribution of t satisfies:
π‘₯
sup
Pr 𝑑 < π‘₯ βˆ’ Ξ¦ π‘₯ ≀
6.4E π‘Œ 3 + 2E π‘Œ 1
𝑛
Falls off quite slowly with n, can have large mass β€œin the tail”.
So you cant use the asymptotic CLT to derive a confidence
interval for the mean.
Importance Sampling
Sampling from the true probability distribution p(x) can be
inefficient. The value we want is:
𝑠 =
1
𝑛
𝑗=1, π‘₯𝑗~𝑝(π‘₯)
𝑛
𝑓 π‘₯𝑗
Which has the same mean as:
π‘ π‘ž =
1
𝑛
𝑗=1, π‘₯𝑗~π‘ž(π‘₯)
𝑛
𝑝 π‘₯𝑗 𝑓 π‘₯𝑗
π‘ž π‘₯𝑗
This is always an unbiased estimator, i.e. 𝐸 π‘ π‘ž = 𝐸 𝑠 = πœ‡,
but the variance depends strongly on the choice of sampling
distribution π‘ž .
Importance Sampling
The variance is:
VAR π‘Œ =
1
𝑛
𝑗=1, π‘₯𝑗~π‘ž(π‘₯)
𝑛
𝑝 π‘₯𝑗
π‘ž π‘₯𝑗
2
𝑓 π‘₯𝑗 βˆ’ πœ‡
2
The optimal (minimum variance) π‘ž is:
π‘žβˆ— π‘₯ =
𝑝 π‘₯
𝑍
𝑓 π‘₯ βˆ’ πœ‡
where Z normalizes the π‘žβˆ— π‘₯𝑖 to sum to 1 across possible
values of π‘₯𝑖.
i.e. choosing more extremal values of 𝑓 π‘₯ improves
accuracy in estimating E 𝑓 π‘₯
Importance Sampling
The weakness of this method is that we need to compute Z.
That can be expensive or impossible.
An alternative is to estimate Z, which we can do with:
𝑠𝐡𝐼𝑆 =
𝑗=1
𝑛 𝑝 π‘₯𝑗
π‘ž π‘₯𝑗
𝑓 π‘₯𝑗
𝑗=1
𝑛 𝑝 π‘₯𝑗
π‘ž π‘₯𝑗
with π‘₯𝑗~π‘ž π‘₯ . This estimator is biased for finite n, but
asymptotically unbiased as 𝑛 β†’ ∞.
Markov Chain Monte Carlo
One of the most powerful methods for sampling from a
difficult distribution is Markov Chain Monte Carlo (MCMC).
The idea is to generate a sequence of samples
π‘₯(1), … , π‘₯(𝑛) from the distribution 𝑝 π‘₯ using a Markov chain.
i.e. where each sample π‘₯(𝑗) depends (only) on the previous
sample π‘₯(π‘—βˆ’1).
The samples are not independent of each other, but the
distribution of each sample β†’ 𝑝 π‘₯ as 𝑛 β†’ ∞. i.e. they still
give unbiased estimates for expected values.
MCMC methods are typically fast, and avoid calculation or
approximation of Z.
Energy Representation
We can take negative logs as we do for likelihoods in
machine learning, yielding an Energy Based Model:
𝑝 π‘₯ = exp βˆ’πΈ π‘₯
Note that if the energy is finite, 𝑝 π‘₯ > 0.
A probability distribution in this form is called a
Boltzmann distribution.
Gibbs Sampling
Gibbs sampling is an efficient approach to generating
samples from the true distribution 𝑝 π‘₯ using only the
unnormalized distribution 𝑝 π‘₯ .
Taking each π‘₯𝑖 in turn, sample π‘₯𝑖 from 𝑝 π‘₯𝑖|π‘₯βˆ’π‘– where π‘₯βˆ’π‘– =
π‘₯1, … , π‘₯π‘–βˆ’1, π‘₯𝑖+1, … , π‘₯𝑛
We need 𝑝 π‘₯𝑖|π‘₯βˆ’π‘– to be normalized, but e.g. if π‘₯𝑖 has a small
discrete range, we can compute the unnormalized
probabilities for each value and normalize (c.f. softmax).
This is often easy because the energy is local, i.e. derived
from edge potentials in a graphical model.
Gibbs Sampling
Recall that energy for graphical models depends on local
potentials:
A
Nodes
Potential Functions
πœ‘π΄π΅π· B
E
D C
πœ‘π·πΈ πœ‘πΆπΈ
Can efficiently sample
D given A, B, E
Metropolis Hastings
Gibbs sampling assumes that we can not only evaluate
𝑝 π‘₯𝑖|π‘₯βˆ’π‘– , but we can efficiently sample from 𝑝 . |π‘₯βˆ’π‘– .
This is only possible when the local probability distribution is
closed-form 𝑝 . |π‘₯βˆ’π‘– . Generally it isn’t.
When we don’t have a closed form local distribution, we can
still use a kind of rejection sampling to form a Markov chain.
For this we use a proposal distribution π‘ž to choose the next
π‘₯(𝑗)
~ π‘ž . |π‘₯(π‘—βˆ’1)
, and then accept-reject using the true
distribution p
Metropolis Hastings
The Metropolis-Hastings acceptance test accepts a proposed
π‘₯(𝑗)
with probability p, where:
𝑝 π‘₯(π‘—βˆ’1)
, π‘₯(𝑗)
= min 1,
𝑝 π‘₯(𝑗) π‘ž π‘₯(π‘—βˆ’1)|π‘₯(𝑗)
𝑝 π‘₯(π‘—βˆ’1) π‘ž π‘₯(𝑗)|π‘₯(π‘—βˆ’1)
This follows by considering transitions between the two states
π‘₯(𝑗) and π‘₯(π‘—βˆ’1) :
We want detailed balance:
𝑝 π‘₯(𝑗)|π‘₯(π‘—βˆ’1)
𝑝 π‘₯(π‘—βˆ’1)|π‘₯(𝑗) =
𝑝 π‘₯(𝑗)
𝑝 π‘₯(π‘—βˆ’1)
p(x(j-1))
x(j-1)
p(x(j))
x(j)
p(x(j-1)|x(j) )
p(x(j)|x(j-1) )
Metropolis Hastings
Detailed balance
𝑝 π‘₯(𝑗)|π‘₯(π‘—βˆ’1)
𝑝 π‘₯(π‘—βˆ’1)|π‘₯(𝑗) =
π‘ž π‘₯(𝑗)|π‘₯(π‘—βˆ’1) 𝑝 π‘₯(π‘—βˆ’1),π‘₯(𝑗)
π‘ž π‘₯(π‘—βˆ’1)|π‘₯(𝑗) 𝑝 π‘₯(𝑗),π‘₯(π‘—βˆ’1)
𝑝 π‘₯(π‘—βˆ’1),π‘₯(𝑗)
𝑝 π‘₯(𝑗),π‘₯(π‘—βˆ’1) =
𝑝 π‘₯(𝑗) π‘ž π‘₯(π‘—βˆ’1)|π‘₯(𝑗)
𝑝 π‘₯(π‘—βˆ’1) π‘ž π‘₯(𝑗)|π‘₯(π‘—βˆ’1) substitute into the RHS, q’s
cancel, and we find:
𝑝 π‘₯(𝑗)|π‘₯(π‘—βˆ’1)
𝑝 π‘₯(π‘—βˆ’1)|π‘₯(𝑗)
=
𝑝 π‘₯(𝑗)
𝑝 π‘₯(π‘—βˆ’1)
p(x(j-1))
x(j-1)
p(x(j))
x(j)
p(x(j-1)|x(j) )
p(x(j)|x(j-1) )
Transition prob. proposal acceptance
acceptance
Metropolis Hastings
Metropolis-Hastings is very powerful and widely-used.
It reduces the problem of sampling from a difficult distribution
𝑝 π‘₯ to making proposals: π‘ž π‘₯(𝑗)|π‘₯(π‘—βˆ’1) and evaluating
ratios of 𝑝 π‘₯(𝑗) /𝑝 π‘₯(π‘—βˆ’1) .
The proposals can be trivial, e.g. random walk (choose π‘₯(𝑗)
from a normal distribution centered at π‘₯(π‘—βˆ’1)
), or sophisticated.
Only efficiency, not correctness is affected by the proposal.
Because only ratios 𝑝 π‘₯(𝑗) /𝑝 π‘₯(π‘—βˆ’1) are needed, we don’t
have to deal with normalizing sums Z.
Metropolis Hastings
M-H can be used for many applications, but has a serious
limitation when 𝑝 π‘₯ depends on many observations, e.g.
when π‘₯ = πœƒ are the parameters of a model.
Namely model parameters depend on all data points, and so
does the ratio:
𝑝 πœƒ(𝑗)
/𝑝 πœƒ(π‘—βˆ’1)
This forces batch updates (one per pass over the dataset)
which makes this kind of sampling very slow.
Until recently there was no fast way to perform M-H tests on
minibatches of data, which would allow SGD-style updates.
Fast Minibatch Metropolis Hastings
The classical MH test has an acceptance probability which is
asymmetric and non-smooth:
where 𝑝𝑗 is shorthand for 𝑝 π‘₯(𝑗) .
Chu et al β€œAn Efficient Minibatch Acceptance Test for Metropolis-Hastings” 2016
u = log(pj/pj-1)
Pr(accept) = min(1, pj/pj-1)
pj/pj-1 = exp(u)
1
Fast Minibatch Metropolis Hastings
The classical MH test has an acceptance probability which is
asymmetric and non-smooth:
An alternative smooth, symmetric distribution is the logistic
function (Barker test):
u = log(pj/pj-1)
Pr(accept)
pj/pj-1 = exp(u)
u = log(pj/pj-1)
Pr(accept)
1/(1+exp(-u))
1
1
0.5
Fast Minibatch Metropolis Hastings
The Barker test also satisfies detailed balance:
𝐿(𝑒)
𝐿(βˆ’π‘’)
= exp 𝑒 = 𝑝𝑗/π‘π‘—βˆ’1
u = log(pj/pj-1)
Pr(accept)
1/(1+exp(-u))
L(u)
L(-u)
𝐿(𝑒)
𝐿(βˆ’π‘’)
= exp(𝑒)
u
-u
Logistic function L(u)
pj-1
x(j-1)
pj
x(j)
L(-u)
L(u)
Minibatch Metropolis Hastings
Testing against the smooth distribution can be done using a
random variable x whose CDF is the acceptance curve:
u = log(pj/pj-1)
Pr(accept)
1/(1+exp(u))
x
density(x)
Accept if
x < u
Minibatch Metropolis Hastings
Testing against the smooth distribution can be done using a
random variable X whose CDF is the acceptance curve:
This allows us to use the minibatch-induced variance in
likelihood estimates to provide the variation for the MH test.
u - - = u -
u = log(pj/pj-1)
Pr(accept)
1/(1+exp(u))
x
density(x)
Accept if
x < u
i.e.
u - x > 0
nU (normal) x (logistic)
xc
Minibatch Metropolis Hastings
Testing against the smooth distribution can be done using a
random variable X whose CDF is the acceptance curve:
This allows us to use the minibatch-induced variance in
likelihood estimates to provide the variation for the MH test.
u - - = u -
u = log(pj/pj-1)
Pr(accept)
1/(1+exp(u))
x
density(x)
Accept if
u - x > 0
nU (normal) x (logistic)
xc
What we have
Minibatch Metropolis Hastings
Testing against the smooth distribution can be done using a
random variable X whose CDF is the acceptance curve:
This allows us to use the minibatch-induced variance in
likelihood estimates to provide the variation for the MH test.
u - - = u -
u = log(pj/pj-1)
Pr(accept)
1/(1+exp(u))
x
density(x)
Accept if
u + x > 0
nU (normal) x (logistic)
xc
What we want
What we have
Minibatch Metropolis Hastings
This allows us to use the minibatch-induced variance in
likelihood estimates to provide the variation for the MH test.
u - - = u -
nU (minibatch
noise)
x (logistic)
xc
The variable x is the sum of nu and xc.
Its distribution is the convolution
of these two.
We can compute the distribution of xc
by deconvolution.
Exact
value
across
dataset
Minibatch Metropolis Hastings
The test itself is simple:
β€’ Propose x(j) using q(x(j)| x(j-1)) from a minibatch of data.
β€’ Compute u = log(pj/pj-1) = log(p(x(j))/p(x(j-1))) from another
minibatch.
β€’ Sample xc from the correction distribution.
β€’ Accept if u - xc > 0.
Minibatch Metropolis Hastings
β€’ This approach has the same complexity as standard
SGD, with a small constant overhead.
β€’ Minibatch sizes are fixed, acceptance probability is
reasonably high (around 0.5 typically).
β€’ The variance in the log likelihood needs to be small (< 1).
β€’ But this corresponds to β€œefficient” proposal distributions.
Hamiltonian Monte-Carlo
β€’ Typical MCMC proposal are random walks. They lead to
very slow exploration of parameter space.
β€’ Adding gradient bias seems like a natural step, but this
destroys detailed balance.
Hamiltonian Monte-Carlo
β€’ Typical MCMC proposal are random walks. They lead to
very slow exploration of parameter space.
β€’ Adding gradient bias seems like a natural step, but this
destroys detailed balance.
β€’ We can restore detailed balance by introducing β€œfictitious”
state in the form of momentum variables.
β€’ Now the state space is x (called β€œposition”) and p (called
momentum). We augment the energy function as well.
β€’ We recover the distribution of x, p(x) by β€œforgetting” the
momentum coordinates.
Hamiltonian Dynamics
Let x be position and p be momentum, H is an energy
function. The system evolves following:
𝑑π‘₯𝑖
𝑑𝑑
=
πœ•π»
πœ•π‘π‘–
and
𝑑𝑝𝑖
𝑑𝑑
= βˆ’
πœ•π»
πœ•π‘₯𝑖
And typically:
𝐻 π‘₯, 𝑝 = 𝑉 π‘₯ + 1/2
𝑖=1
𝑛
𝑝𝑖
2
π‘šπ‘–
Potential energy
(log probability(x))
Kinetic energy
(fictitious)
Total energy
Hamiltonian Dynamics
Typically: let x be position and p be momentum:
𝐻 π‘₯, 𝑝 = 𝑉 π‘₯ + 1/2
𝑖=1
𝑛
𝑝𝑖
2
π‘šπ‘–
So
𝑑π‘₯𝑖
𝑑𝑑
=
𝑝𝑖
π‘šπ‘–
and
𝑑𝑝𝑖
𝑑𝑑
= βˆ’
πœ•π‘‰
πœ•π‘₯𝑖
Discretizing in time:
𝑝(𝑗) = 𝑝(π‘—βˆ’1) βˆ’ 𝛻𝑉 π‘₯(π‘—βˆ’1)
π‘₯(𝑗) = π‘₯(π‘—βˆ’1) + π‘šβˆ’1𝑝(π‘—βˆ’1)
Like SGD with no momentum decay.
Hamiltonian Monte-Carlo
Each iteration of HMC proceeds in two steps:
1. Sample momentum p from a zero-mean normal
distribution.
2. Do L times:
β€’ Propose a new state (x,p) using Hamiltonian dynamics.
β€’ Accept using an M-H test on the energy difference H.
β€’ Note that momentum is only preserved through the steps
in the inner loop over L.
β€’ Necessary to occasionally β€œspill” momentum so the
particle doesn’t have too much energy after descending.
Hamiltonian Monte-Carlo Animation
See https://www.youtube.com/watch?v=Vv3f0QNWvWQ
Langevin Dynamics
HMC:
β€’ Have to periodically β€œreset” momentum.
β€’ Need M-H tests on every sub-iteration.
Can we simulate a probabilistic system more directly?
Langevin dynamics:
𝑀π‘₯ = βˆ’π›»π‘‰ βˆ’ 𝛾π‘₯ + 2𝛾𝑇𝒩(0,1)
mass x
acceleration
energy
gradient
viscous
damping
Noise (at
Temperature T)
Chen et al. β€œStochastic Gradient Hamiltonian Monte-Carlo” 2014
Langevin Dynamics
Langevin dynamics:
𝑀π‘₯ = βˆ’π›»π‘‰ βˆ’ 𝛾π‘₯ + 2𝛾𝑇𝒩(0,1)
Divide by M:
π‘₯ = βˆ’
1
𝑀
𝛻𝑉 βˆ’
𝛾
𝑀
π‘₯ +
2𝛾𝑇
𝑀
𝒩(0,1)
Discretize:
π‘₯(𝑗) = 𝑣(𝑗) βˆ’ 𝑣(π‘—βˆ’1)
π‘₯(𝑗) = 𝑣(π‘—βˆ’1)
π‘₯(𝑗) = π‘₯(π‘—βˆ’1) + 𝑣(𝑗)
Gives
𝑣(𝑗)
=
π‘€βˆ’π›Ύ
𝑀
𝑣(π‘—βˆ’1)
βˆ’
1
𝑀
𝛻𝑉 +
2𝛾𝑇
𝑀
𝒩 0,1
π‘₯(𝑗)
= π‘₯(π‘—βˆ’1)
+ 𝑣(𝑗)
Langevin Dynamics
These equations are just SGD with momentum:
𝑣(𝑗) =
π‘€βˆ’π›Ύ
𝑀
𝑣(π‘—βˆ’1) βˆ’
1
𝑀
𝛻𝑉 +
2𝛾𝑇
𝑀
𝒩 0,1
π‘₯(𝑗) = π‘₯(π‘—βˆ’1) + 𝑣(𝑗)
Can be written:
𝑣(𝑗) = 𝛼𝑣(π‘—βˆ’1) βˆ’ 𝑙 𝛻𝑉 + πœ–π’© 0,1
π‘₯(𝑗) = π‘₯(π‘—βˆ’1) + 𝑣(𝑗)
Where
β€’ 𝛼 < 1 is β€œmomentum”,
β€’ 𝑙 is learning rate
β€’ πœ– is minibatch noise in the gradient 𝛻𝑉
Langevin Dynamics
These equations are just SGD with momentum:
𝑣(𝑗) =
π‘€βˆ’π›Ύ
𝑀
𝑣(π‘—βˆ’1) βˆ’
1
𝑀
𝛻𝑉 +
2𝛾𝑇
𝑀
𝒩 0,1
π‘₯(𝑗) = π‘₯(π‘—βˆ’1) + 𝑣(𝑗)
Can be written:
𝑣(𝑗) = 𝛼 𝑣(π‘—βˆ’1) βˆ’ 𝑙 𝛻𝑉 + πœ– 𝒩 0,1
π‘₯(𝑗) = π‘₯(π‘—βˆ’1) + 𝑣(𝑗)
Can solve for 𝑀, 𝛾, 𝑇 given 𝛼, 𝑙, πœ– , i.e., we can translate
the standard SGD parameters into physical ones for the
simulated system.
Langevin Dynamics
i.e. Langevin dynamics agrees, up to step quantization
error, with the standard equations of SGD with momentum.
Remarkably, Langevin dynamics has a robust solution for
the stationary distribution of x:
𝑝 π‘₯ = exp βˆ’
𝑉 π‘₯
𝑇
The model allows us to control T even with significant noise
in the SGD gradients.
Langevin Dynamics
Notes:
β€’ We so far assumed noise 𝒩 was generate by minibatch
variance only. But we can add noise explicitly to better
control the dynamics.
β€’ We only used simplest possible quantization. Better
quantization, i.e. leapfrog steps can be used. This is
similar to using Nesterov momentum steps.
β€’ M-H tests are still needed for optimal size steps (analysis
missing from Chen et al. paper).
Langevin Dynamics
Quick summary:
β€’ Almost same cost as standard SGD with momentum.
β€’ Using fast minibatch M-H testing and optimal step sizes,
cost increases about 2x.
β€’ One sample generated per SGD/MH step. (HMC
generates one sample per L MH tests).
β€’ Full posterior sampling at SGD rates.
Applications of HMC and Langevin
β€’ Better characterization of the stationary distribution.
Control of temperature for annealing, mixing.
β€’ Better quantized updates: Leapfrog method.
β€’ Better understanding of coordinate scaling (ADAGRAD,
RMSprop, Natural gradient).
β€’ Alternatives to variational smoothing.
original W
True negative gradient direction
W_1
W_2
Coordinate Scaling Again
ADAGRAD, RMSprop normalize gradient
coordinates, so gradients lie in a cube.
W_1
W_2
Coordinate Scaling Again
But this leads to higher variance of the
likelihood in the high gradient direction.
W_1
W_2
Coordinate Scaling Again
ADAGRAD and RMSprop are also not
invariant to scaling coordinates.
W_1
W_2
Coordinate Scaling Again
ADAGRAD and RMSprop are also not
invariant to scaling coordinates.
W_1
W_2
Coordinate Scaling Again
ADAGRAD’s implicit β€œworking cube”, defined in the
analysis as Dο‚₯
W_1
W_2
Coordinate Scaling Again
CLAIM: such a cube matches well with the working range
of typical DNN parameters.
W_1
W_2
Coordinate Scaling Again
Coordinate Scaling Revisited
β€’ For a variety of reasons, the ADAGRAD/RMSprop is poor
from an MCMC perspective.
β€’ The optimal diagonal scaling, by several arguments, is:
𝑔𝑖 =
𝑔𝑖
𝐸 𝑔𝑖
2
β€’ i.e. no square root in denominator as per ADAGRAD and
RMSprop.
β€’ This update is invariant to coordinate transformations, and
minimizes bias and variance.
Coordinate Scaling Revisited
β€’ For a variety of reasons, the ADAGRAD/RMSprop is poor
from an MCMC perspective.
β€’ The optimal diagonal scaling, by several arguments, is:
𝑔𝑖 =
𝑔𝑖
𝐸 𝑔𝑖
2
β€’ i.e. no square root in denominator as per ADAGRAD and
RMSprop.
β€’ In practice, it may be prone to β€œexploding” when 𝐸 𝑔𝑖
2
is
small. But when β€œworking range” is taken into account,
𝐸 𝑔𝑖
2
should never be too small.
Model the working range with a regularizing norm on x as
part of V(x). Then 𝑔𝑖
2
will never be too small.
W_1
W_2
Optimal Coordinate Scaling Again
Optimal g scaling
Coordinate Scaling Revisited
Optimal coordinate scaling for HMC and Langevin:
Girolami et al. 2014 β€œRiemann Manifold Langevin and Hamiltonian
Monte Carlo”
Practical performance of gradient scaling methods:
Yann Ollivier, 2015 β€œRiemannian metrics for neural networks I:
Feedforward networks”
Takeaways
β€’ Importance Sampling is used to
estimate means with fewer samples.
β€’ Monte-Carlo sampling requires either:
β€’ Closed form local distributions (Gibbs sampling)
β€’ A proposal distribution and acceptance test (M-H).
β€’ M-H tests keep samples in known distribution. Proposals
only affect speed.
β€’ There are alternatives to the Metropolis-Hastings test
(Barker test) which works fast on minibatches of data.
Performance is similar to vanilla SGD.
Takeaways
β€’ Hamiltonian Monte Carlo (HMC)
adds a momentum term to allow
non-reversible (gradient) proposals.
β€’ Langevin dynamics adds a damping term to standard HMC.
Can take larger steps with fewer M-H tests.
β€’ Applications of MCMC methods:
β€’ Better training parameter setting.
β€’ Controlled annealing and tempering.
β€’ Efficient full posterior inference.

More Related Content

Similar to Monte Carlo Berkeley.pptx

Stratified sampling and resampling for approximate Bayesian computation
Stratified sampling and resampling for approximate Bayesian computationStratified sampling and resampling for approximate Bayesian computation
Stratified sampling and resampling for approximate Bayesian computationUmberto Picchini
Β 
15 anomaly detection
15 anomaly detection15 anomaly detection
15 anomaly detectionTanmayVijay1
Β 
Distributed ADMM
Distributed ADMMDistributed ADMM
Distributed ADMMPei-Che Chang
Β 
Modeling and quantification of uncertainties in numerical aerodynamics
Modeling and quantification of uncertainties in numerical aerodynamicsModeling and quantification of uncertainties in numerical aerodynamics
Modeling and quantification of uncertainties in numerical aerodynamicsAlexander Litvinenko
Β 
ORMR_Monte Carlo Method.pdf
ORMR_Monte Carlo Method.pdfORMR_Monte Carlo Method.pdf
ORMR_Monte Carlo Method.pdfSanjayBalu7
Β 
MM - KBAC: Using mixed models to adjust for population structure in a rare-va...
MM - KBAC: Using mixed models to adjust for population structure in a rare-va...MM - KBAC: Using mixed models to adjust for population structure in a rare-va...
MM - KBAC: Using mixed models to adjust for population structure in a rare-va...Golden Helix Inc
Β 
Introduction to Bootstrap and elements of Markov Chains
Introduction to Bootstrap and elements of Markov ChainsIntroduction to Bootstrap and elements of Markov Chains
Introduction to Bootstrap and elements of Markov ChainsUniversity of Salerno
Β 
Dong Zhang's project
Dong Zhang's projectDong Zhang's project
Dong Zhang's projectDong Zhang
Β 
PRML Reading Chapter 11 - Sampling Method
PRML Reading Chapter 11 - Sampling MethodPRML Reading Chapter 11 - Sampling Method
PRML Reading Chapter 11 - Sampling MethodHa Phuong
Β 
Linear regression [Theory and Application (In physics point of view) using py...
Linear regression [Theory and Application (In physics point of view) using py...Linear regression [Theory and Application (In physics point of view) using py...
Linear regression [Theory and Application (In physics point of view) using py...ANIRBANMAJUMDAR18
Β 
Input analysis
Input analysisInput analysis
Input analysisBhavik A Shah
Β 
Markov Chain Monte Carlo Methods
Markov Chain Monte Carlo MethodsMarkov Chain Monte Carlo Methods
Markov Chain Monte Carlo MethodsFrancesco Casalegno
Β 
Lecture: Monte Carlo Methods
Lecture: Monte Carlo MethodsLecture: Monte Carlo Methods
Lecture: Monte Carlo MethodsFrank Kienle
Β 
Aaa ped-16-Unsupervised Learning: clustering
Aaa ped-16-Unsupervised Learning: clusteringAaa ped-16-Unsupervised Learning: clustering
Aaa ped-16-Unsupervised Learning: clusteringAminaRepo
Β 
Markov chain Monte Carlo methods and some attempts at parallelizing them
Markov chain Monte Carlo methods and some attempts at parallelizing themMarkov chain Monte Carlo methods and some attempts at parallelizing them
Markov chain Monte Carlo methods and some attempts at parallelizing themPierre Jacob
Β 

Similar to Monte Carlo Berkeley.pptx (20)

Stratified sampling and resampling for approximate Bayesian computation
Stratified sampling and resampling for approximate Bayesian computationStratified sampling and resampling for approximate Bayesian computation
Stratified sampling and resampling for approximate Bayesian computation
Β 
15 anomaly detection
15 anomaly detection15 anomaly detection
15 anomaly detection
Β 
Distributed ADMM
Distributed ADMMDistributed ADMM
Distributed ADMM
Β 
Modeling and quantification of uncertainties in numerical aerodynamics
Modeling and quantification of uncertainties in numerical aerodynamicsModeling and quantification of uncertainties in numerical aerodynamics
Modeling and quantification of uncertainties in numerical aerodynamics
Β 
ORMR_Monte Carlo Method.pdf
ORMR_Monte Carlo Method.pdfORMR_Monte Carlo Method.pdf
ORMR_Monte Carlo Method.pdf
Β 
17_monte_carlo.pdf
17_monte_carlo.pdf17_monte_carlo.pdf
17_monte_carlo.pdf
Β 
MM - KBAC: Using mixed models to adjust for population structure in a rare-va...
MM - KBAC: Using mixed models to adjust for population structure in a rare-va...MM - KBAC: Using mixed models to adjust for population structure in a rare-va...
MM - KBAC: Using mixed models to adjust for population structure in a rare-va...
Β 
Introduction to Bootstrap and elements of Markov Chains
Introduction to Bootstrap and elements of Markov ChainsIntroduction to Bootstrap and elements of Markov Chains
Introduction to Bootstrap and elements of Markov Chains
Β 
Dong Zhang's project
Dong Zhang's projectDong Zhang's project
Dong Zhang's project
Β 
PRML Reading Chapter 11 - Sampling Method
PRML Reading Chapter 11 - Sampling MethodPRML Reading Chapter 11 - Sampling Method
PRML Reading Chapter 11 - Sampling Method
Β 
Talk 5
Talk 5Talk 5
Talk 5
Β 
Linear regression [Theory and Application (In physics point of view) using py...
Linear regression [Theory and Application (In physics point of view) using py...Linear regression [Theory and Application (In physics point of view) using py...
Linear regression [Theory and Application (In physics point of view) using py...
Β 
Input analysis
Input analysisInput analysis
Input analysis
Β 
Markov Chain Monte Carlo Methods
Markov Chain Monte Carlo MethodsMarkov Chain Monte Carlo Methods
Markov Chain Monte Carlo Methods
Β 
Lecture: Monte Carlo Methods
Lecture: Monte Carlo MethodsLecture: Monte Carlo Methods
Lecture: Monte Carlo Methods
Β 
Aaa ped-16-Unsupervised Learning: clustering
Aaa ped-16-Unsupervised Learning: clusteringAaa ped-16-Unsupervised Learning: clustering
Aaa ped-16-Unsupervised Learning: clustering
Β 
2. diagnostics, collinearity, transformation, and missing data
2. diagnostics, collinearity, transformation, and missing data 2. diagnostics, collinearity, transformation, and missing data
2. diagnostics, collinearity, transformation, and missing data
Β 
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
Β 
Fa18_P1.pptx
Fa18_P1.pptxFa18_P1.pptx
Fa18_P1.pptx
Β 
Markov chain Monte Carlo methods and some attempts at parallelizing them
Markov chain Monte Carlo methods and some attempts at parallelizing themMarkov chain Monte Carlo methods and some attempts at parallelizing them
Markov chain Monte Carlo methods and some attempts at parallelizing them
Β 

More from HaibinSu2

CS0098S2G02_08 (1).ppt
CS0098S2G02_08 (1).pptCS0098S2G02_08 (1).ppt
CS0098S2G02_08 (1).pptHaibinSu2
Β 
chem2503_oct05.ppt
chem2503_oct05.pptchem2503_oct05.ppt
chem2503_oct05.pptHaibinSu2
Β 
transfer.pptx
transfer.pptxtransfer.pptx
transfer.pptxHaibinSu2
Β 
MolecularMotors.ppt
MolecularMotors.pptMolecularMotors.ppt
MolecularMotors.pptHaibinSu2
Β 
Lesson 25-27.pdf
Lesson 25-27.pdfLesson 25-27.pdf
Lesson 25-27.pdfHaibinSu2
Β 
Lesson 28-29.pdf
Lesson 28-29.pdfLesson 28-29.pdf
Lesson 28-29.pdfHaibinSu2
Β 

More from HaibinSu2 (6)

CS0098S2G02_08 (1).ppt
CS0098S2G02_08 (1).pptCS0098S2G02_08 (1).ppt
CS0098S2G02_08 (1).ppt
Β 
chem2503_oct05.ppt
chem2503_oct05.pptchem2503_oct05.ppt
chem2503_oct05.ppt
Β 
transfer.pptx
transfer.pptxtransfer.pptx
transfer.pptx
Β 
MolecularMotors.ppt
MolecularMotors.pptMolecularMotors.ppt
MolecularMotors.ppt
Β 
Lesson 25-27.pdf
Lesson 25-27.pdfLesson 25-27.pdf
Lesson 25-27.pdf
Β 
Lesson 28-29.pdf
Lesson 28-29.pdfLesson 28-29.pdf
Lesson 28-29.pdf
Β 

Recently uploaded

power system scada applications and uses
power system scada applications and usespower system scada applications and uses
power system scada applications and usesDevarapalliHaritha
Β 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile servicerehmti665
Β 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVRajaP95
Β 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escortsranjana rawat
Β 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxJoΓ£o Esperancinha
Β 
Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...VICTOR MAESTRE RAMIREZ
Β 
Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.eptoze12
Β 
Model Call Girl in Narela Delhi reach out to us at πŸ”8264348440πŸ”
Model Call Girl in Narela Delhi reach out to us at πŸ”8264348440πŸ”Model Call Girl in Narela Delhi reach out to us at πŸ”8264348440πŸ”
Model Call Girl in Narela Delhi reach out to us at πŸ”8264348440πŸ”soniya singh
Β 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130Suhani Kapoor
Β 
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...ZTE
Β 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...Soham Mondal
Β 
Biology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxBiology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxDeepakSakkari2
Β 
Introduction to Microprocesso programming and interfacing.pptx
Introduction to Microprocesso programming and interfacing.pptxIntroduction to Microprocesso programming and interfacing.pptx
Introduction to Microprocesso programming and interfacing.pptxvipinkmenon1
Β 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024Mark Billinghurst
Β 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girlsssuser7cb4ff
Β 
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfAsst.prof M.Gokilavani
Β 
What are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxWhat are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxwendy cai
Β 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024hassan khalil
Β 
microprocessor 8085 and its interfacing
microprocessor 8085  and its interfacingmicroprocessor 8085  and its interfacing
microprocessor 8085 and its interfacingjaychoudhary37
Β 

Recently uploaded (20)

power system scada applications and uses
power system scada applications and usespower system scada applications and uses
power system scada applications and uses
Β 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Β 
young call girls in Rajiv ChowkπŸ” 9953056974 πŸ” Delhi escort Service
young call girls in Rajiv ChowkπŸ” 9953056974 πŸ” Delhi escort Serviceyoung call girls in Rajiv ChowkπŸ” 9953056974 πŸ” Delhi escort Service
young call girls in Rajiv ChowkπŸ” 9953056974 πŸ” Delhi escort Service
Β 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
Β 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
Β 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Β 
Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...
Β 
Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.
Β 
Model Call Girl in Narela Delhi reach out to us at πŸ”8264348440πŸ”
Model Call Girl in Narela Delhi reach out to us at πŸ”8264348440πŸ”Model Call Girl in Narela Delhi reach out to us at πŸ”8264348440πŸ”
Model Call Girl in Narela Delhi reach out to us at πŸ”8264348440πŸ”
Β 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
Β 
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...
Β 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
Β 
Biology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxBiology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptx
Β 
Introduction to Microprocesso programming and interfacing.pptx
Introduction to Microprocesso programming and interfacing.pptxIntroduction to Microprocesso programming and interfacing.pptx
Introduction to Microprocesso programming and interfacing.pptx
Β 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024
Β 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girls
Β 
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
Β 
What are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxWhat are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptx
Β 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024
Β 
microprocessor 8085 and its interfacing
microprocessor 8085  and its interfacingmicroprocessor 8085  and its interfacing
microprocessor 8085 and its interfacing
Β 

Monte Carlo Berkeley.pptx

  • 1. CS294-129: Designing, Visualizing and Understanding Deep Neural Networks John Canny Fall 2016 Lecture 20: Monte-Carlo Methods
  • 2. Last Time: Variational Auto-Encoders Dependency:  - parametrized decoder: οͺ - parametrized encoder: Both deep networks, learned concurrently Maximize the lower bound on log p(z|x) jointly wrt  and οͺ:
  • 3. Recall: Applications of Probabilistic Models β€’ Sampling (synthesis) β€’ Autoencoding (smile vectors): Real Images Synthetic Images
  • 4. Recall: Applications of Probabilistic Models β€’ Missing value imputation β€’ Density Estimation β€’ Denoising
  • 5. This Time: Monte-Carlo Methods Las Vegas: Fixed outcomes, random time Monte-Carlo: Random outcomes
  • 6. Outline: Monte-Carlo Methods β€’ Importance Sampling β€’ Markov Chain Monte-Carlo and Metropolis-Hastings β€’ Hamiltonian Monte-Carlo β€’ Langevin Dynamics
  • 7. This Time: Monte-Carlo Methods Sampling: Want to compute properties of a β€œdifficult” distribution as cheaply as possible: β€’ Take samples from a dataset β€’ Sample points in a model parameter space β€’ Sample latent (hidden) variables The distribution is β€œdifficult”: β€’ Sum or product of many components β€’ Not closed form β€’ The distribution may itself be parametric, and has to be trained to approach reality.
  • 8. This Time: Monte-Carlo Methods Approximate Integrals: Expected value of a function: E 𝑓 π‘₯ = 𝑝 π‘₯ 𝑓 π‘₯ 𝑑π‘₯ Or for discrete x, E 𝑓 π‘₯ = 𝑖=1 𝑁 𝑝 π‘₯𝑖 𝑓 π‘₯𝑖 In both cases, we can approximate the expected value with a sample of n values π‘₯1, … , π‘₯𝑛 from the distribution p(), giving: 𝑠 = 1 𝑛 𝑗=1 𝑛 𝑓 π‘₯𝑗
  • 9. This Time: Monte-Carlo Methods Writing 𝑦𝑗 = 𝑓 π‘₯𝑗 , since we choose the π‘₯𝑗 at random, the 𝑦𝑗 are IID: independent and identically-distributed. Their sum: 𝑠 = 1 𝑛 𝑗=1 𝑛 𝑓 π‘₯𝑗 Therefore approaches a normal distribution by the Central Limit Theorem. The variance of this limit is VAR π‘Œ /𝑛 and so goes to zero as n increases. The relative error std π‘Œ /E π‘Œ also approaches zero.
  • 10. This Time: Monte-Carlo Methods Non Hand-Waving Central Limit Theorem: Sample mean: πœ‡ = 1 𝑛 𝑖=1 𝑛 𝑦𝑖 Sample standard deviation: 𝑠 = 1 𝑛 𝑖=1 𝑛 𝑦𝑖 βˆ’ πœ‡ 2 Then the t-statistic t = πœ‡/s approaches a standard normal distribution in the following sense:
  • 11. This Time: Monte-Carlo Methods The cumulative distribution of t satisfies: π‘₯ sup Pr 𝑑 < π‘₯ βˆ’ Ξ¦ π‘₯ ≀ 6.4E π‘Œ 3 + 2E π‘Œ 1 𝑛 Where Ξ¦ π‘₯ is the cumulative distribution of the standard Normal distribution (mean 0, variance 1). E π‘Œ 1 and E π‘Œ 3 are the first and third absolute moments of π‘Œ, i.e. expected first and third powers of π‘Œ βˆ’ E π‘Œ .
  • 12. This Time: Monte-Carlo Methods The cumulative distribution of t satisfies: π‘₯ sup Pr 𝑑 < π‘₯ βˆ’ Ξ¦ π‘₯ ≀ 6.4E π‘Œ 3 + 2E π‘Œ 1 𝑛 Falls off quite slowly with n, can have large mass β€œin the tail”. So you cant use the asymptotic CLT to derive a confidence interval for the mean.
  • 13. Importance Sampling Sampling from the true probability distribution p(x) can be inefficient. The value we want is: 𝑠 = 1 𝑛 𝑗=1, π‘₯𝑗~𝑝(π‘₯) 𝑛 𝑓 π‘₯𝑗 Which has the same mean as: π‘ π‘ž = 1 𝑛 𝑗=1, π‘₯𝑗~π‘ž(π‘₯) 𝑛 𝑝 π‘₯𝑗 𝑓 π‘₯𝑗 π‘ž π‘₯𝑗 This is always an unbiased estimator, i.e. 𝐸 π‘ π‘ž = 𝐸 𝑠 = πœ‡, but the variance depends strongly on the choice of sampling distribution π‘ž .
  • 14. Importance Sampling The variance is: VAR π‘Œ = 1 𝑛 𝑗=1, π‘₯𝑗~π‘ž(π‘₯) 𝑛 𝑝 π‘₯𝑗 π‘ž π‘₯𝑗 2 𝑓 π‘₯𝑗 βˆ’ πœ‡ 2 The optimal (minimum variance) π‘ž is: π‘žβˆ— π‘₯ = 𝑝 π‘₯ 𝑍 𝑓 π‘₯ βˆ’ πœ‡ where Z normalizes the π‘žβˆ— π‘₯𝑖 to sum to 1 across possible values of π‘₯𝑖. i.e. choosing more extremal values of 𝑓 π‘₯ improves accuracy in estimating E 𝑓 π‘₯
  • 15. Importance Sampling The weakness of this method is that we need to compute Z. That can be expensive or impossible. An alternative is to estimate Z, which we can do with: 𝑠𝐡𝐼𝑆 = 𝑗=1 𝑛 𝑝 π‘₯𝑗 π‘ž π‘₯𝑗 𝑓 π‘₯𝑗 𝑗=1 𝑛 𝑝 π‘₯𝑗 π‘ž π‘₯𝑗 with π‘₯𝑗~π‘ž π‘₯ . This estimator is biased for finite n, but asymptotically unbiased as 𝑛 β†’ ∞.
  • 16. Markov Chain Monte Carlo One of the most powerful methods for sampling from a difficult distribution is Markov Chain Monte Carlo (MCMC). The idea is to generate a sequence of samples π‘₯(1), … , π‘₯(𝑛) from the distribution 𝑝 π‘₯ using a Markov chain. i.e. where each sample π‘₯(𝑗) depends (only) on the previous sample π‘₯(π‘—βˆ’1). The samples are not independent of each other, but the distribution of each sample β†’ 𝑝 π‘₯ as 𝑛 β†’ ∞. i.e. they still give unbiased estimates for expected values. MCMC methods are typically fast, and avoid calculation or approximation of Z.
  • 17. Energy Representation We can take negative logs as we do for likelihoods in machine learning, yielding an Energy Based Model: 𝑝 π‘₯ = exp βˆ’πΈ π‘₯ Note that if the energy is finite, 𝑝 π‘₯ > 0. A probability distribution in this form is called a Boltzmann distribution.
  • 18. Gibbs Sampling Gibbs sampling is an efficient approach to generating samples from the true distribution 𝑝 π‘₯ using only the unnormalized distribution 𝑝 π‘₯ . Taking each π‘₯𝑖 in turn, sample π‘₯𝑖 from 𝑝 π‘₯𝑖|π‘₯βˆ’π‘– where π‘₯βˆ’π‘– = π‘₯1, … , π‘₯π‘–βˆ’1, π‘₯𝑖+1, … , π‘₯𝑛 We need 𝑝 π‘₯𝑖|π‘₯βˆ’π‘– to be normalized, but e.g. if π‘₯𝑖 has a small discrete range, we can compute the unnormalized probabilities for each value and normalize (c.f. softmax). This is often easy because the energy is local, i.e. derived from edge potentials in a graphical model.
  • 19. Gibbs Sampling Recall that energy for graphical models depends on local potentials: A Nodes Potential Functions πœ‘π΄π΅π· B E D C πœ‘π·πΈ πœ‘πΆπΈ Can efficiently sample D given A, B, E
  • 20. Metropolis Hastings Gibbs sampling assumes that we can not only evaluate 𝑝 π‘₯𝑖|π‘₯βˆ’π‘– , but we can efficiently sample from 𝑝 . |π‘₯βˆ’π‘– . This is only possible when the local probability distribution is closed-form 𝑝 . |π‘₯βˆ’π‘– . Generally it isn’t. When we don’t have a closed form local distribution, we can still use a kind of rejection sampling to form a Markov chain. For this we use a proposal distribution π‘ž to choose the next π‘₯(𝑗) ~ π‘ž . |π‘₯(π‘—βˆ’1) , and then accept-reject using the true distribution p
  • 21. Metropolis Hastings The Metropolis-Hastings acceptance test accepts a proposed π‘₯(𝑗) with probability p, where: 𝑝 π‘₯(π‘—βˆ’1) , π‘₯(𝑗) = min 1, 𝑝 π‘₯(𝑗) π‘ž π‘₯(π‘—βˆ’1)|π‘₯(𝑗) 𝑝 π‘₯(π‘—βˆ’1) π‘ž π‘₯(𝑗)|π‘₯(π‘—βˆ’1) This follows by considering transitions between the two states π‘₯(𝑗) and π‘₯(π‘—βˆ’1) : We want detailed balance: 𝑝 π‘₯(𝑗)|π‘₯(π‘—βˆ’1) 𝑝 π‘₯(π‘—βˆ’1)|π‘₯(𝑗) = 𝑝 π‘₯(𝑗) 𝑝 π‘₯(π‘—βˆ’1) p(x(j-1)) x(j-1) p(x(j)) x(j) p(x(j-1)|x(j) ) p(x(j)|x(j-1) )
  • 22. Metropolis Hastings Detailed balance 𝑝 π‘₯(𝑗)|π‘₯(π‘—βˆ’1) 𝑝 π‘₯(π‘—βˆ’1)|π‘₯(𝑗) = π‘ž π‘₯(𝑗)|π‘₯(π‘—βˆ’1) 𝑝 π‘₯(π‘—βˆ’1),π‘₯(𝑗) π‘ž π‘₯(π‘—βˆ’1)|π‘₯(𝑗) 𝑝 π‘₯(𝑗),π‘₯(π‘—βˆ’1) 𝑝 π‘₯(π‘—βˆ’1),π‘₯(𝑗) 𝑝 π‘₯(𝑗),π‘₯(π‘—βˆ’1) = 𝑝 π‘₯(𝑗) π‘ž π‘₯(π‘—βˆ’1)|π‘₯(𝑗) 𝑝 π‘₯(π‘—βˆ’1) π‘ž π‘₯(𝑗)|π‘₯(π‘—βˆ’1) substitute into the RHS, q’s cancel, and we find: 𝑝 π‘₯(𝑗)|π‘₯(π‘—βˆ’1) 𝑝 π‘₯(π‘—βˆ’1)|π‘₯(𝑗) = 𝑝 π‘₯(𝑗) 𝑝 π‘₯(π‘—βˆ’1) p(x(j-1)) x(j-1) p(x(j)) x(j) p(x(j-1)|x(j) ) p(x(j)|x(j-1) ) Transition prob. proposal acceptance acceptance
  • 23. Metropolis Hastings Metropolis-Hastings is very powerful and widely-used. It reduces the problem of sampling from a difficult distribution 𝑝 π‘₯ to making proposals: π‘ž π‘₯(𝑗)|π‘₯(π‘—βˆ’1) and evaluating ratios of 𝑝 π‘₯(𝑗) /𝑝 π‘₯(π‘—βˆ’1) . The proposals can be trivial, e.g. random walk (choose π‘₯(𝑗) from a normal distribution centered at π‘₯(π‘—βˆ’1) ), or sophisticated. Only efficiency, not correctness is affected by the proposal. Because only ratios 𝑝 π‘₯(𝑗) /𝑝 π‘₯(π‘—βˆ’1) are needed, we don’t have to deal with normalizing sums Z.
  • 24. Metropolis Hastings M-H can be used for many applications, but has a serious limitation when 𝑝 π‘₯ depends on many observations, e.g. when π‘₯ = πœƒ are the parameters of a model. Namely model parameters depend on all data points, and so does the ratio: 𝑝 πœƒ(𝑗) /𝑝 πœƒ(π‘—βˆ’1) This forces batch updates (one per pass over the dataset) which makes this kind of sampling very slow. Until recently there was no fast way to perform M-H tests on minibatches of data, which would allow SGD-style updates.
  • 25. Fast Minibatch Metropolis Hastings The classical MH test has an acceptance probability which is asymmetric and non-smooth: where 𝑝𝑗 is shorthand for 𝑝 π‘₯(𝑗) . Chu et al β€œAn Efficient Minibatch Acceptance Test for Metropolis-Hastings” 2016 u = log(pj/pj-1) Pr(accept) = min(1, pj/pj-1) pj/pj-1 = exp(u) 1
  • 26. Fast Minibatch Metropolis Hastings The classical MH test has an acceptance probability which is asymmetric and non-smooth: An alternative smooth, symmetric distribution is the logistic function (Barker test): u = log(pj/pj-1) Pr(accept) pj/pj-1 = exp(u) u = log(pj/pj-1) Pr(accept) 1/(1+exp(-u)) 1 1 0.5
  • 27. Fast Minibatch Metropolis Hastings The Barker test also satisfies detailed balance: 𝐿(𝑒) 𝐿(βˆ’π‘’) = exp 𝑒 = 𝑝𝑗/π‘π‘—βˆ’1 u = log(pj/pj-1) Pr(accept) 1/(1+exp(-u)) L(u) L(-u) 𝐿(𝑒) 𝐿(βˆ’π‘’) = exp(𝑒) u -u Logistic function L(u) pj-1 x(j-1) pj x(j) L(-u) L(u)
  • 28. Minibatch Metropolis Hastings Testing against the smooth distribution can be done using a random variable x whose CDF is the acceptance curve: u = log(pj/pj-1) Pr(accept) 1/(1+exp(u)) x density(x) Accept if x < u
  • 29. Minibatch Metropolis Hastings Testing against the smooth distribution can be done using a random variable X whose CDF is the acceptance curve: This allows us to use the minibatch-induced variance in likelihood estimates to provide the variation for the MH test. u - - = u - u = log(pj/pj-1) Pr(accept) 1/(1+exp(u)) x density(x) Accept if x < u i.e. u - x > 0 nU (normal) x (logistic) xc
  • 30. Minibatch Metropolis Hastings Testing against the smooth distribution can be done using a random variable X whose CDF is the acceptance curve: This allows us to use the minibatch-induced variance in likelihood estimates to provide the variation for the MH test. u - - = u - u = log(pj/pj-1) Pr(accept) 1/(1+exp(u)) x density(x) Accept if u - x > 0 nU (normal) x (logistic) xc What we have
  • 31. Minibatch Metropolis Hastings Testing against the smooth distribution can be done using a random variable X whose CDF is the acceptance curve: This allows us to use the minibatch-induced variance in likelihood estimates to provide the variation for the MH test. u - - = u - u = log(pj/pj-1) Pr(accept) 1/(1+exp(u)) x density(x) Accept if u + x > 0 nU (normal) x (logistic) xc What we want What we have
  • 32. Minibatch Metropolis Hastings This allows us to use the minibatch-induced variance in likelihood estimates to provide the variation for the MH test. u - - = u - nU (minibatch noise) x (logistic) xc The variable x is the sum of nu and xc. Its distribution is the convolution of these two. We can compute the distribution of xc by deconvolution. Exact value across dataset
  • 33. Minibatch Metropolis Hastings The test itself is simple: β€’ Propose x(j) using q(x(j)| x(j-1)) from a minibatch of data. β€’ Compute u = log(pj/pj-1) = log(p(x(j))/p(x(j-1))) from another minibatch. β€’ Sample xc from the correction distribution. β€’ Accept if u - xc > 0.
  • 34. Minibatch Metropolis Hastings β€’ This approach has the same complexity as standard SGD, with a small constant overhead. β€’ Minibatch sizes are fixed, acceptance probability is reasonably high (around 0.5 typically). β€’ The variance in the log likelihood needs to be small (< 1). β€’ But this corresponds to β€œefficient” proposal distributions.
  • 35. Hamiltonian Monte-Carlo β€’ Typical MCMC proposal are random walks. They lead to very slow exploration of parameter space. β€’ Adding gradient bias seems like a natural step, but this destroys detailed balance.
  • 36. Hamiltonian Monte-Carlo β€’ Typical MCMC proposal are random walks. They lead to very slow exploration of parameter space. β€’ Adding gradient bias seems like a natural step, but this destroys detailed balance. β€’ We can restore detailed balance by introducing β€œfictitious” state in the form of momentum variables. β€’ Now the state space is x (called β€œposition”) and p (called momentum). We augment the energy function as well. β€’ We recover the distribution of x, p(x) by β€œforgetting” the momentum coordinates.
  • 37. Hamiltonian Dynamics Let x be position and p be momentum, H is an energy function. The system evolves following: 𝑑π‘₯𝑖 𝑑𝑑 = πœ•π» πœ•π‘π‘– and 𝑑𝑝𝑖 𝑑𝑑 = βˆ’ πœ•π» πœ•π‘₯𝑖 And typically: 𝐻 π‘₯, 𝑝 = 𝑉 π‘₯ + 1/2 𝑖=1 𝑛 𝑝𝑖 2 π‘šπ‘– Potential energy (log probability(x)) Kinetic energy (fictitious) Total energy
  • 38. Hamiltonian Dynamics Typically: let x be position and p be momentum: 𝐻 π‘₯, 𝑝 = 𝑉 π‘₯ + 1/2 𝑖=1 𝑛 𝑝𝑖 2 π‘šπ‘– So 𝑑π‘₯𝑖 𝑑𝑑 = 𝑝𝑖 π‘šπ‘– and 𝑑𝑝𝑖 𝑑𝑑 = βˆ’ πœ•π‘‰ πœ•π‘₯𝑖 Discretizing in time: 𝑝(𝑗) = 𝑝(π‘—βˆ’1) βˆ’ 𝛻𝑉 π‘₯(π‘—βˆ’1) π‘₯(𝑗) = π‘₯(π‘—βˆ’1) + π‘šβˆ’1𝑝(π‘—βˆ’1) Like SGD with no momentum decay.
  • 39. Hamiltonian Monte-Carlo Each iteration of HMC proceeds in two steps: 1. Sample momentum p from a zero-mean normal distribution. 2. Do L times: β€’ Propose a new state (x,p) using Hamiltonian dynamics. β€’ Accept using an M-H test on the energy difference H. β€’ Note that momentum is only preserved through the steps in the inner loop over L. β€’ Necessary to occasionally β€œspill” momentum so the particle doesn’t have too much energy after descending.
  • 40. Hamiltonian Monte-Carlo Animation See https://www.youtube.com/watch?v=Vv3f0QNWvWQ
  • 41. Langevin Dynamics HMC: β€’ Have to periodically β€œreset” momentum. β€’ Need M-H tests on every sub-iteration. Can we simulate a probabilistic system more directly? Langevin dynamics: 𝑀π‘₯ = βˆ’π›»π‘‰ βˆ’ 𝛾π‘₯ + 2𝛾𝑇𝒩(0,1) mass x acceleration energy gradient viscous damping Noise (at Temperature T) Chen et al. β€œStochastic Gradient Hamiltonian Monte-Carlo” 2014
  • 42. Langevin Dynamics Langevin dynamics: 𝑀π‘₯ = βˆ’π›»π‘‰ βˆ’ 𝛾π‘₯ + 2𝛾𝑇𝒩(0,1) Divide by M: π‘₯ = βˆ’ 1 𝑀 𝛻𝑉 βˆ’ 𝛾 𝑀 π‘₯ + 2𝛾𝑇 𝑀 𝒩(0,1) Discretize: π‘₯(𝑗) = 𝑣(𝑗) βˆ’ 𝑣(π‘—βˆ’1) π‘₯(𝑗) = 𝑣(π‘—βˆ’1) π‘₯(𝑗) = π‘₯(π‘—βˆ’1) + 𝑣(𝑗) Gives 𝑣(𝑗) = π‘€βˆ’π›Ύ 𝑀 𝑣(π‘—βˆ’1) βˆ’ 1 𝑀 𝛻𝑉 + 2𝛾𝑇 𝑀 𝒩 0,1 π‘₯(𝑗) = π‘₯(π‘—βˆ’1) + 𝑣(𝑗)
  • 43. Langevin Dynamics These equations are just SGD with momentum: 𝑣(𝑗) = π‘€βˆ’π›Ύ 𝑀 𝑣(π‘—βˆ’1) βˆ’ 1 𝑀 𝛻𝑉 + 2𝛾𝑇 𝑀 𝒩 0,1 π‘₯(𝑗) = π‘₯(π‘—βˆ’1) + 𝑣(𝑗) Can be written: 𝑣(𝑗) = 𝛼𝑣(π‘—βˆ’1) βˆ’ 𝑙 𝛻𝑉 + πœ–π’© 0,1 π‘₯(𝑗) = π‘₯(π‘—βˆ’1) + 𝑣(𝑗) Where β€’ 𝛼 < 1 is β€œmomentum”, β€’ 𝑙 is learning rate β€’ πœ– is minibatch noise in the gradient 𝛻𝑉
  • 44. Langevin Dynamics These equations are just SGD with momentum: 𝑣(𝑗) = π‘€βˆ’π›Ύ 𝑀 𝑣(π‘—βˆ’1) βˆ’ 1 𝑀 𝛻𝑉 + 2𝛾𝑇 𝑀 𝒩 0,1 π‘₯(𝑗) = π‘₯(π‘—βˆ’1) + 𝑣(𝑗) Can be written: 𝑣(𝑗) = 𝛼 𝑣(π‘—βˆ’1) βˆ’ 𝑙 𝛻𝑉 + πœ– 𝒩 0,1 π‘₯(𝑗) = π‘₯(π‘—βˆ’1) + 𝑣(𝑗) Can solve for 𝑀, 𝛾, 𝑇 given 𝛼, 𝑙, πœ– , i.e., we can translate the standard SGD parameters into physical ones for the simulated system.
  • 45. Langevin Dynamics i.e. Langevin dynamics agrees, up to step quantization error, with the standard equations of SGD with momentum. Remarkably, Langevin dynamics has a robust solution for the stationary distribution of x: 𝑝 π‘₯ = exp βˆ’ 𝑉 π‘₯ 𝑇 The model allows us to control T even with significant noise in the SGD gradients.
  • 46. Langevin Dynamics Notes: β€’ We so far assumed noise 𝒩 was generate by minibatch variance only. But we can add noise explicitly to better control the dynamics. β€’ We only used simplest possible quantization. Better quantization, i.e. leapfrog steps can be used. This is similar to using Nesterov momentum steps. β€’ M-H tests are still needed for optimal size steps (analysis missing from Chen et al. paper).
  • 47. Langevin Dynamics Quick summary: β€’ Almost same cost as standard SGD with momentum. β€’ Using fast minibatch M-H testing and optimal step sizes, cost increases about 2x. β€’ One sample generated per SGD/MH step. (HMC generates one sample per L MH tests). β€’ Full posterior sampling at SGD rates.
  • 48. Applications of HMC and Langevin β€’ Better characterization of the stationary distribution. Control of temperature for annealing, mixing. β€’ Better quantized updates: Leapfrog method. β€’ Better understanding of coordinate scaling (ADAGRAD, RMSprop, Natural gradient). β€’ Alternatives to variational smoothing.
  • 49. original W True negative gradient direction W_1 W_2 Coordinate Scaling Again
  • 50. ADAGRAD, RMSprop normalize gradient coordinates, so gradients lie in a cube. W_1 W_2 Coordinate Scaling Again
  • 51. But this leads to higher variance of the likelihood in the high gradient direction. W_1 W_2 Coordinate Scaling Again
  • 52. ADAGRAD and RMSprop are also not invariant to scaling coordinates. W_1 W_2 Coordinate Scaling Again
  • 53. ADAGRAD and RMSprop are also not invariant to scaling coordinates. W_1 W_2 Coordinate Scaling Again
  • 54. ADAGRAD’s implicit β€œworking cube”, defined in the analysis as Dο‚₯ W_1 W_2 Coordinate Scaling Again
  • 55. CLAIM: such a cube matches well with the working range of typical DNN parameters. W_1 W_2 Coordinate Scaling Again
  • 56. Coordinate Scaling Revisited β€’ For a variety of reasons, the ADAGRAD/RMSprop is poor from an MCMC perspective. β€’ The optimal diagonal scaling, by several arguments, is: 𝑔𝑖 = 𝑔𝑖 𝐸 𝑔𝑖 2 β€’ i.e. no square root in denominator as per ADAGRAD and RMSprop. β€’ This update is invariant to coordinate transformations, and minimizes bias and variance.
  • 57. Coordinate Scaling Revisited β€’ For a variety of reasons, the ADAGRAD/RMSprop is poor from an MCMC perspective. β€’ The optimal diagonal scaling, by several arguments, is: 𝑔𝑖 = 𝑔𝑖 𝐸 𝑔𝑖 2 β€’ i.e. no square root in denominator as per ADAGRAD and RMSprop. β€’ In practice, it may be prone to β€œexploding” when 𝐸 𝑔𝑖 2 is small. But when β€œworking range” is taken into account, 𝐸 𝑔𝑖 2 should never be too small.
  • 58. Model the working range with a regularizing norm on x as part of V(x). Then 𝑔𝑖 2 will never be too small. W_1 W_2 Optimal Coordinate Scaling Again Optimal g scaling
  • 59. Coordinate Scaling Revisited Optimal coordinate scaling for HMC and Langevin: Girolami et al. 2014 β€œRiemann Manifold Langevin and Hamiltonian Monte Carlo” Practical performance of gradient scaling methods: Yann Ollivier, 2015 β€œRiemannian metrics for neural networks I: Feedforward networks”
  • 60. Takeaways β€’ Importance Sampling is used to estimate means with fewer samples. β€’ Monte-Carlo sampling requires either: β€’ Closed form local distributions (Gibbs sampling) β€’ A proposal distribution and acceptance test (M-H). β€’ M-H tests keep samples in known distribution. Proposals only affect speed. β€’ There are alternatives to the Metropolis-Hastings test (Barker test) which works fast on minibatches of data. Performance is similar to vanilla SGD.
  • 61. Takeaways β€’ Hamiltonian Monte Carlo (HMC) adds a momentum term to allow non-reversible (gradient) proposals. β€’ Langevin dynamics adds a damping term to standard HMC. Can take larger steps with fewer M-H tests. β€’ Applications of MCMC methods: β€’ Better training parameter setting. β€’ Controlled annealing and tempering. β€’ Efficient full posterior inference.