thesis_final_draft

Senior Thesis in Mathematics
Sampling from High Dimensional
Distributions
Author:
Bill DeRose
Advisor:
Dr. Gabe Chandler
Submitted to Pomona College in Partial Fulﬁllment
of the Degree of Bachelor of Arts
April 3, 2015

Contents
1 Introduction
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 Related works . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2 Random Variable Generation
2.1 Random Variable Generation . . . . . . . . . . . . . . . . . . . .
2.1.1 The Inverse Transform . . . . . . . . . . . . . . . . . . . .
2.1.2 Acceptance-Rejection Sampling . . . . . . . . . . . . . . .
3 Markov Chains
3.1 Markov Chains . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2 Properties of Markov Chains . . . . . . . . . . . . . . . . . . . .
3.2.1 Irreducibility . . . . . . . . . . . . . . . . . . . . . . . . .
3.2.2 Aperiodicity . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2.3 Stationarity . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2.4 Ergodicity . . . . . . . . . . . . . . . . . . . . . . . . . . .
4 Markov Chain Monte Carlo
4.1 Monte Carlo Methods . . . . . . . . . . . . . . . . . . . . . . . .
4.2 Metropolis Hastings . . . . . . . . . . . . . . . . . . . . . . . . .
4.2.1 Bias-Variance Trade-oﬀ . . . . . . . . . . . . . . . . . . .
4.2.2 Approximate Metropolis-Hastings . . . . . . . . . . . . .
4.3 Slice Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3.1 Auxiliary Variable MCMC . . . . . . . . . . . . . . . . . .
4.3.2 Uniform Ergodicity of the Slice Sampler . . . . . . . . . .
4.4 Gibbs Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.5 Hamiltonian Monte Carlo . . . . . . . . . . . . . . . . . . . . . .
4.5.1 Hamiltonian Dynamics . . . . . . . . . . . . . . . . . . . .
4.5.2 HMC . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5 Conclusion

Chapter 1
Introduction
1.1 Introduction
The challenges we face in computational statistics are due to the incredible
advance of technology in the past 100 years. In a world where human suffering
is the daily reality of so many, we should be so lucky to wrestle with the problems
of algorithm design and implementation. Despite the advances in statistics over
the past century, random number generation remains an active field of research.
From machine learning and artificial intelligence, to the simulation of protein
formation, the ability to draw from probability distributions has wide ranging
applications.
But what exactly is a random number, and what is randomness? More
importantly, how can an algorithm take a finite number of deterministic steps
to produce something random? Often, humans delude themselves into seeing
randomness where there is none – they detect a signal in the noise.
Figure 1.1: Which is random?

The image on the left of Figure 1.1 depicts genuine randomness. The points
on the right are too evenly spaced for it to be truly random. In actuality, each
point on the left represents the location of a star in our galaxy while the points
on the right represent the location of glowworms on the ceiling of a cave in
New Zealand. The glowworms spread themselves out to reduce competition for
food amongst themselves. The seemingly uniform distribution is the result of a
non-random force.
So how do we go about generating images like those on the left? We begin
with a little cheat and assume the existence of a random number generator that
allows us to sample U ∼ Uniform([0, 1]). Though we will not discuss methods
for drawing uniformly from the unit interval, their importance to us cannot be
understated.
In practice, exact inference is often either impossible (e.g. provably non-
integrable functions) or intractable (e.g. high dimensional integration) and we
must turn to approximations. This work explores Monte Carlo methods as one
approach to numerical approximation.
Example 1.1 (Numeric Integration) We wish to evaluate an integral Q =
b
a
f(x) dx. From calculus, we know favg =
Q
b − a
⇒ Q = (b − a)favg. By the
LLN, we can choose X1, . . . , Xn uniformly in [a, b] to approximate
favg ≈
1
n
n
i=1
f(Xi) ⇒ Q ≈
b − a
n
n
i=1
f(Xi).
1.2 Related works
Many of the algorithms we cover are versions of the Metropolis algorithm which
ﬁrst appeared in [9] and was eventually generalized by Hastings in [6]. Though
the naming of the algorithm has been contended (Metropolis merely oversaw
the research), we refer to the algorithm as Metropolis-Hastings for historical
reasons. Regardless of naming conventions, we are indebted to Arianna W.
Rosenbluth, Marshall N. Rosenbluth, Augusta H. Teller, and Edward Teller for
their work on the original paper outlining the Metropolis algorithm.
The term “Monte Carlo” was coined because some of the ﬁrst applications
were to card games, like those in the Monte Carlo Casino in Monaco. A Monte
Carlo algorithm is simply an algorithm whose output is random. Such Monte
Carlo simulations were important to the development of the Manhattan project
(after all, Metropolis worked at Los Alamos during WWII) and remain an im-
portant tool of modern statistical physics.
Though the Gibbs sampler was introduced by brothers Stuart and Donald
Geman in 1984 [5] these sorts of numerical sampling techniques did not en-
ter the mainstream until the early 1990’s, arguable because of the advent of
the personal computer and wider access to computational power. The Gibbs
sampler appeared nearly two decades before Neal’s slice sampler [11], though

we cover them in reverse chronological order because the latter provides a nice
motivation for the former. We draw heavily from [10], [13], [12] in proving the
uniform ergodicity of the 2D slice sampler. Neal’s work appears again in the
Hamiltonian Monte Carlo [2] which uses gradient information to explore state
space more eﬃciently.
The contemporary results presented are mostly due to the proliferation of
big data which, in the Bayesian setting, necessitates the ability to sample from a
posterior distribution with billions of data points. Sequential hypothesis testing
allows us to reduce some of the overhead required for Metropolis-Hastings [7].
As a general introduction to each of these sampling techniques, [1] has proven
invaluable.

Chapter 2
Random Variable
Generation
2.1 Random Variable Generation
Assuming we may draw U ∼ Unif([0, 1]), what other distributions can we gen-
erate from? It turns out we can draw X ∼ Bernoulli(p) by letting X ← 1(U <
p). If X1, . . . , Xn
i.i.d
∼ Bern(p) then
n
i=1 Xi = Y ∼ Bin(n, p). However, a
more general approach called the inverse transform allows us to draw from any
1D density whose closed form cdf we may write down.
2.1.1 The Inverse Transform
Definition 2.1 Suppose X is a random variable with probability distribution
function (pdf) fX. We denote by FX the cumulative distribution function (cdf)
where
FX(a) = Pr(X ≤ a).
Cumulative distribution functions are nonnegative, increasing, right-continuous
functions with lima→−∞ = 0 and lima→∞ = 1.
Definition 2.2 For an increasing function F on R, the pseudoinverse of F,
denoted F−1
p , is the function such that
F−1
p (u) = inf{a : F(a) ≤ u}
If F is strictly increasing then F−1
p ≡ F−1
.
With these definitions in hand, we have the tools to generate random variates
from any distribution with a computable generalized inverse.
Lemma 2.3 Let F(a) be a cdf and U ∼ Unif([0, 1]). If X = F−1
p (U) then X
has cdf F(a).

Proof Let F, U, and X be as given in the lemma. Then
Pr(X ≤ a) = Pr(F−1
p (U) ≤ a)
= Pr(U ≤ F(a))
= F(a)
Where the second equality follows from the fact that F is increasing.
We now see the importance of being able to to draw uniformly from the unit
interval. In fact if U is not truly uniform on [0, 1], then the inverse transform
method fails to sample from the correct distribution. However, to use the inverse
transform we must explicitly write down the cumulative distribution function
and efficiently compute its generalized inverse. As we will see in Example 2.6,
this is not always possible.
Example 2.4 We wish to draw X ∼ Exp(λ) using the inverse transform
method. The cdf of an exponential random variable is given by
FX(a) = 1 − exp(−λa).
Solving for the inverse yields
U = 1 − exp(−λa)
log(1 − U) = − λa
−λ−1
log(1 − U) = a
So that if U ∼ Unif([0, 1]) then −λ−1
log(1 − U) = X ∼ Exp(λ).
Example 2.5 Recall that the pdf of a Cauchy random variable, X, is
fX(s) =
1
π(1 + s2)
.
Given U ∼ Unif([0, 1]), we find a transformation Y = r(U) such that Y has a
Cauchy distribution. We begin by finding the cdf of X:
FX(a) =
a
−∞
1
π(1 + s2)
ds
=
1
π
tan−1
(s)
a
−∞
=
1
π
lim
n→−∞
tan−1
(a) − tan−1
(n)
=
1
π
tan−1
(a) +
π
2
=
tan−1
(a)
π
+
1
2

To compute the desired transformation, we have:
U =
tan−1
(a)
π
+
1
2
π(U −
1
2
) = tan−1
(a)
tan(π(U −
1
2
)) = a
So that Y = r(U) = tan π(U −
1
2
) ∼ Cauchy.
cauchies <- tan(pi * (runif(10000) - 0.5))
hist(cauchies[abs(cauchies) <= 500], prob = TRUE,
breaks = 2000, xlim = c(-20, 20), ylim = c(0, 0.35),
main = "Cauchy r.v. using ITF", xlab ="X")
lines(seq(-20, 20, 0.2), dcauchy(seq(-20, 20, 0.2)),
col = "blue")
Cauchy r.v. using ITF
X
Density
−20 −10 0 10 20
0.000.050.100.150.200.250.300.35

Example 2.6 Even in 1D, there exist densities whose cdf we cannot write
down. For example, the cumulative distribution function of the standard normal
distribution cannot be expressed in a closed form:
Φ(x) =
1
√
2π
x
−∞
exp(−z2
/2) dz
Clearly, we must develop other methods that do not rely as strongly on nice
analytic properties of our target distribution.
2.1.2 Acceptance-Rejection Sampling
Much of this section stems from the idea that if fX is the target distribution,
we may write
fX(s) =
fX (x)
0
1 ds.
Here, fX appears as the marginal density of the joint distribution
(X, U) ∼ Unif({(x, u) : 0 < u < fX(x)}).
Introducing the auxiliary variable U allows us to sample from our target
distribution by drawing uniformly from the area under the curve of fX and
ignoring the auxiliary coordinate.
Theorem 2.7 (The Fundamental Theorem of Simulation) Simulating X ∼
fX is equivalent to simulating
(X, U) ∼ Unif({(x, u) : 0 < u < fX(x)}).
Actually sampling from the joint distribution of (X, U) introduces diﬃculty,
though, because sampling X ∼ fX and U|X ∼ Unif([0, fX(X)]) defeats the
purpose of introducing the auxiliary variable. If we could sample X ∼ fX in
the ﬁrst place, we would already be done.
The solution is to generate pairs (X, U) from a superset and accept them if
they satisfy the constraint. For instance, suppose the 1D density fX is bounded
by m and the support of fX, denoted supp fX, is [c, d]. Sampling pairs
(X, U) ∼ Unif([0 ≤ u ≤ fX(x)])
is equivalent to simulating X ∼ Unif([c, d]), U|X ∼ Unif([0, m]), and accepting
the pair if 0 < U < fX(X). It is easily shown that this does indeed sample from

the desired distribution:
Pr(X ≤ a) = Pr(X ≤ a|U ≤ fX(X))
=
Pr(X ≤ a, U ≤ fX(X))
Pr(U ≤ fX(X))
=
a
c
fX (x)
0
1
d−c · 1
m du dz
d
c
fX (x)
0
1
d−c · 1
m du dz
=
a
c
fX (x)
0
du dz
d
c
fX (x)
0
du dz
=
a
c
fX(z) dz
d
c
fX(z) dz
=
a
c
fX(z) dz
= FX(a)
This computation was made easier by the fact that both fX and supp fX
were bounded. In situations where this is not the case, we can no longer use a
rectangle as the superset from which we draw candidates. Instead, we use some
other probability distribution g(x) that may be readily sampled from. Such a
distribution is called a proposal distribution and must satisfy
M · g(x) ≥ fX(x), M ≥ 1, ∀x ∈ supp fX.
We formalize this notion in the following theorem.
Theorem 2.8 (The Acceptance-Rejection Theorem) Let g be a probabil-
ity distribution that satisﬁes
M · g(x) ≥ fX(x)
for some M ≥ 1 and for all x ∈ supp fX. Then, to simulate X ∼ fX, it is
suﬃcient to simulate
Y ∼ g and U|Y ∼ Unif([0, M · g(Y )])
and let X ← Y if U ≤ fX(Y ).
Proof Sampling Y ∼ g, U|Y ∼ Unif([0, M · g(Y )]), and letting X ← Y if

U ≤ fX(Y ) generates X ∼ fX:
Pr(X ∈ A) = Pr(Y ∈ A|U ≤ fX(Y ))
=
A
fX (x)
0
g(z) 1
Mg(z) du dz
supp fX
fX (x)
0
g(z) 1
Mg(z) du dz
= A
fX(z) dz
supp fX
fX(z) dz
=
A
fX(z) dz
The proposals used in acceptance-rejection sampling come from g(Y ) and
are accepted with probability
fX(Y )
M · g(Y )
, so the probability we accept any given
proposal is then
Pr(accept) = g(y)
fX(y)
M · g(y)
dy
=
1
M
fX(y) dy.
The larger M is, the more points we must reject before accepting a proposal.
For eﬃciency’s sake, we want M = sup
fX(x)
g(x)
to ensure the highest possible
acceptance rate. This leads directly to the Acceptance-Rejection algorithm,
which is a realization of Theorem 2.8:
Algorithm 1 AR Sampling
1: procedure Acceptance-Rejection
2: Draw Y ∼ g, U ∼ Unif([0, M · g(Y )])
3: Let X ← Y if U ≤ fX(Y ), else return to 2.
4: end procedure
Example 2.9 Given Y ∼ Cauchy we use acceptance rejection to generate
X ∼ Exp(1/2). To use AR with a proposal distribution g(x), we must ensure
M · g(x) ≥ fX(x) ⇒ M ≥
fX(x)
g(x)
for all x ∈ supp fX. Ideally, M is as close to
1 as possible:
M ≥ sup
x≥0
fX(x)
g(x)
≈ 3.629
We conﬁne our maximization to the positive reals because the target distribution
only has support on the positive reals. The maximum is attained at x = 2 +
√
3.
Using M = 3.629 yields

Draw.AR <- function() {
repeat {
proposal <- rcauchy(1)
u <- runif(1, 0, 3.629 * dcauchy(proposal))
if (u <= dexp(proposal, rate = 1 / 2)) {
return(proposal)
}
}
}
x <- seq(0, 15, 0.1)
hist(replicate(10000, Draw.AR()), breaks = 100, prob = TRUE,
xlab = "X", main = "Exp(1/2) using AR")
lines(x, dexp(x, rate = 1 / 2), col = "blue")
Exp(1/2) using AR
X
Density
0 5 10 15
0.00.10.20.30.4

Chapter 3
Markov Chains
3.1 Markov Chains
Definition 3.1 A sequence of random variables X1, . . . , Xn, denoted (Xn), is
a Markov chain if
Pr(Xn+1|Xn, Xn−1, . . . , X1) = Pr(Xn+1|Xn) (3.1)
Example 3.2 A random walk is a Markov chain that satisfies
Xn+1 = Xn + n
where n is generated independent of the current state. If the distribution of n
is symmetric about 0, we call this a symmetric random walk. In section 4.2 we
will see how random walks are used in MCMC algorithms.
Every Markov chain has an initial distribution, π0, and a transition kernel K.
The state space, denoted X, is the set of possible values Xi may take on at each
step in the Markov chain.
Definition 3.3 A transition kernel is a function K defined on X × B(X) such
that
• ∀x ∈ X, K(x, ·) is a probability measure;
• ∀A ∈ B(X), K(·, A) is measurable.
where B denotes the σ-algebra defined on the set X.
When the state space is discrete, the transition kernel is a matrix K where
Kij = Pr(Xn+1 = Xj|Xn = Xi).
In the continuous case, the transition kernel denotes a conditional density where
Pr(x ∈ A|x) = A
K(x, x ) dx . A Markov chain is said to be time homogeneous
if K(Xn+1|Xn) is independent of n.

We restrict our study almost entirely to time homogeneous Markov chains.
An example of a time heterogeneous Markov chain is the simulated annealing
algorithm, whose transition kernel changes with the “temperature” of the sys-
tem. Time heterogeneity is a key property of simulated annealing because it
allows us to explore the entire state space when the temperature is high, but
restricts our moves when the temperature is low. The algorithm is inspired by
annealing in metallurgy where the process is used to temper or harden metals
and glass by heating them to a high temperature and gradually cooling them,
allowing the material to reach a low-energy crystalline state [14].
Given a transition matrix for a discrete Markov chain and an initial distri-
bution π0, the distribution of X1 is obtained by matrix multiplication
π1 = π0K.
Similarly, Xn ∼ πn = π0Kn
. Notice that once the initial state is specified, the
behavior of the chain is entirely dependent on K.
Definition 3.4 Consider A ∈ B(X). The first n for which the chain enters the
set A is denoted by
τA = inf{n ≥ 1 : Xn ∈ A}
and is called the stopping time at A. By convention, τA = ∞ if Xn ∈ A for
every n. Associated with the set A, we also define
ηA =
∞
n=1
1(Xn ∈ A),
the number of times the chain enters A.
Example 3.5 In a zero-sum coin tossing game, the payoff to player b is +1 if
a heads appears and −1 if a tails appears. Similarly, the payoff to player c is +1
if a tails appears and −1 if a heads appears. Let Xn be the sum of the gains of
player b after n rounds of the game. The infinite dimensional transition matrix,
K, has zeros on the diagonal since player b must either lose or gain a point on
each round. Furthermore, K has upper and lower sub-diagonals equal to 1/2
because because we are flipping a fair coin. Assuming that player b begins with
B dollars and player c begins with C dollars,
τ1 = inf{n : Xn ≤ −B} and τ2 = inf{n : Xn ≥ C}
represent, respectively, the ruins of the player b and c. The probability of
bankruptcy for player b is then Pr(τ1 > τ2).
3.2 Properties of Markov Chains
3.2.1 Irreducibility
Irreducibility is an important property of Markov chains which guarantees that
regardless of the current state of the chain, it is possible to reach any other state

in a finite number of transitions. In the discrete case, irreducibility also tells
us the transition matrix cannot be broken down into smaller matrices (i.e. the
transition graph is connected).
Definition 3.6 Given a measure φ, a Markov chain with transition kernel K(·)
is φ-irreducible if for every A ∈ B(X) such that φ(A) > 0, Pr(τA < ∞) > 0
regardless of the initial state.
Irreducibility together with aperiodicity, a property introduced in the following
subsection, allow us to make strong analytic arguments about the convergence
of Markov chains.
3.2.2 Aperiodicity
We define the period of a state x ∈ X to be
d(x) = gcd{m ≥ 1 : Km
(x, x) > 0}
If d(x) ≥ 2, we say x is periodic with period d(x). A state is aperiodic if it has
period 1. An irreducible chain is aperiodic if each state has period 1.
Example 3.7 A Markov chain with period n is given by the block matrix
P =







0 P1 0 0 0
0 0 P2 0 0
...
...
...
...
...
0 0 0 0 Pn−1
Pn 0 0 0 0







where Pi is a stochastic matrix and P is irreducible.
3.2.3 Stationarity
Definition 3.8 A Markov chain (Xn) has stationary distribution π if Xn ∼
π ⇒ Xn+1 ∼ π.
For MCMC methods to be of any use to us, we must be able to reason about
the asymptotic behavior of Markov chains. The distribution of Xn as n → ∞
is called the limiting distribution. Ideally, we would like some guarantee that,
regardless of initial conditions, the limiting distribution of a Markov chain is
also its stationary distribution.
The general approach with MCMC algorithms is to initialize and run a
Markov chain for a sufficient number of steps to draw samples approximately
from the desired stationary distribution. It is common to ignore some num-
ber of samples at the beginning, and then consider only every nth
sample (for
independence) when computing an expectation.

3.2.4 Ergodicity
When exactly do we know when the limiting distribution of a Markov chain is
the stationary distribution? The Ergodic Theorem tells us just this.
Theorem 3.9 (The Ergodic Theorem) Let (Xn) be a Markov chain with
stationary distribution π. If the chain is φ-irreducible and aperiodic, then for
all measurable sets A, limn→∞ Pr(Xn ∈ A) = π(A).
Which is to say that the limiting distribution of irreducible, aperiodic Markov
chains is always the stationary distribution. An even stronger guarantee of
convergence exists, but to get there we must introduce more terminology.
Definition 3.10 The Markov chain (Xn) has an atom α ∈ B(X) if there exists
an associated non-zero measure µ such that
K(x, A) = µ(A), ∀x ∈ α, ∀A ∈ B(X).
The definition of a small set follows naturally and will be used in our defi-
nition of one of the strongest form of convergence, uniform ergodicity.
Definition 3.11 A set C is small if there exists an m > 0 and a nonzero
measure µm such that
Km
(x, x ) ≥ µm(xm)
for all x ∈ C and for all x ∈ B(X)
Definition 3.12 The Markov chain (Xn) is uniformly ergodic if
lim
n→∞
sup
x∈X
Pn
(x, ·) − π T V = 0
Where · T V denotes the total variation norm.
In showing uniform ergodicity, we will make use of the following theorem.
Theorem 3.13 (Doeblin’s Condition) The following are equivalent:
(a) (Xn) is uniformly ergodic;
(b) there exist R ≤ ∞ and r such that
Pn
(x, ·) − π T V < Rr−n
, ∀x ∈ X;
(c) (Xn) is aperiodic and X is a small set;
(d) (Xn) is aperiodic and there exist a small set C and a real κ > 1 such that
sup
x∈X
Ex[kτC
] < ∞

If the whole space X is small, there exists a probability distribution, φ, on X,
constants , δ > 0, and n such that, if φ(A) > then
inf
x∈X
Kn
(x, A) > δ, ∀A ∈ B(X).
We see here the relation between analytic limits and uniform ergodicity, giving
us a feel for just how strong the guarantee of convergence is. Now that we have
covered enough of the basic vocabulary of Markov chains we may begin our
survey of MCMC sampling algorithms.

Chapter 4
Markov Chain Monte Carlo
4.1 Monte Carlo Methods
Although the sampling techniques discussed in Chapter 2 work well, they are
not flawless. The inverse transform method fails beyond 1-dimension and even
then it requires us to write down the closed form cdf of the target distribution.
Acceptance-rejection can be used in any dimension we like, but as dimensonality
increases it becomes more difficult to find good proposal distributions. We
turn now to Markov Chain Monte Carlo (MCMC) simulations because they
ameliorate many of these issues.
Monte Carlo simulations allow us to approximate the probability of certain
outcomes by running a large number of trials to obtain an empirical distribu-
tion of possible events. Markov Chain Monte Carlo simulations use Markov
chains whose stationary distribution is the target distribution we wish to sam-
ple from. The oldest MCMC algorithm, and the one we choose to cover first, is
the Metropolis-Hasting algorithm.
4.2 Metropolis Hastings
At this point, we may readily sample from most distributions covered in an
introductory probability course. However, when faced with the task of drawing
from a non-standard distribution, we will need more powerful tools at our dis-
posal. For instance, in Bayesian statistics we would often like to sample from
the posterior distribution of a parameter to compute its expected value.
At a high level, Metropolis-Hastings samples from a target distribution fX
by drawing from a proposal distribution g (“easy” to sample) and accepting if
it looks like it came from fX (“hard” to sample). At step T in the algorithm,
in which the current state is XT , we draw a candidate/proposal X∗
∼ g(X|XT )
and and let XT +1 = X∗
with probability
A(X∗
, XT ) = min 1,
fX(X∗
)g(XT |X∗
)
fX(XT )g(X∗|XT )
.

Otherwise, let XT +1 = XT .
We notice two things about the acceptance probability. First, the Metropolis-
Hastings algorithm only requires we know fX up to a normalizing constant.
Second, if g is symmetric the acceptance probability becomes
A(X∗
, XT ) = min 1,
fX(X∗
)
fX(XT )
which implies we always accept a candidate that is more probable, and accept
candidates randomly otherwise. The acceptance probability combines concepts
from steepest accent and random walk algorithms which help prevent getting
stuck in local maxima. Following Algorithm 2 ensures the stationary distribu-
tion of the Markov chain is fX.
Algorithm 2 MH Sampling
1: procedure Metropolis-Hastings Input: Current state: XT ∼ fX
2: Draw X∗
∼ g(X|XT ), U ∼ Unif([0, 1])
3: Compute acceptance probability Pa = A(X∗
, XT )
4: If U < Pa set XT +1 ← X∗
, otherwise set XT +1 ← XT
5: end procedure
Make no mistake, Metropolis-Hastings is no free lunch. The proposal distri-
bution must be chosen carefully and presents diﬃculties in higher dimensions
where our intuition and imagination fail us. This is especially the case when
using a non-symmetric proposal distribution. For this reason, we restrict our
study of Metropolis-Hastings solely to the symmetric, random walk case. A
common (symmetric) proposal distribution is a Gaussian centered on the cur-
rent state. It is also typical for the proposal distribution’s variance to be chosen
to be on the same order of magnitude as the smallest variance of the target
distribution.
σmax
σmin
ρ
Figure 4.1: Contours of a bivariate normal target distribution (red) and sym-
metric proposal distribution with standard deviation ρ (blue).
Consider Figure 4.1, where the 2D target distribution exhibits a strong cor-
relation between components. To achieve a high acceptance ratio, the stan-

dard deviation of the proposal distribution must be kept on the same order of
magnitude as σmin. Otherwise, our proposals will be from all over the space
and we would rarely accept any move. The random walk behavior also means
that to explore the length of the distribution, a distance of σmax/σmin, it takes
(σmax/σmin)2
steps due to the convergence of the chain being proportional to√
n. If our target distribution is pinched in one dimension and elongated in
another, the Metropolis-Hastings algorithm offers poor convergence properties.
Example 4.1 Suppose we wish to sample from the 2-dimensional mixture of
normals whose contours are shown in Figure 4.2 (bottom), alongside its 1-
dimensional analogue (top). Figure 4.3 shows that in 2-dimensions, the first
coordinate of the points sampled using a standard Metropolis algorithm appear
to mix well early, but clearly display difficulty jumping between modes. Figure
4.4a and Figure 4.4b suggest that in 5-,10-, and higher dimensions the problem
is only exacerbated.
−10 −5 0 5 10
0.000.050.100.150.20
1D Normal Mixture
X
Density
2D Normal Mixture
X2
X1
0.01 0.01
0.02
0.02
0.03
0.03
0.04
0.04
0.05
0.05
0.06
0.06
0.07
0.07
−6 −4 −2 0 2 4 6
−6−4−20246
Figure 4.2: The one dimension analogue (top) of the 2D target distribution
(bottom) (µ1 = −2, µ2 = 2, σ1 = σ2 = 1)

0 500 1000 1500 2000
−6−4−20246
Random−walk Metropolis (2D)
Index
Firstpositioncoordinate
Figure 4.3: Mixing of ﬁrst coordinate, X1, from 2D a Metropolis sample.
0 500 1000 1500 2000
−6−4−20246
Index
(a)
0 500 1000 1500 2000
−6−4−20246
Index
(b)
Figure 4.4: First coordinate of points sampled from Metropolis random walk in
5D (a) and 10D (b).
4.2.1 Bias-Variance Trade-oﬀ
Implicit in our handling of MCMC lies the desire for unbiased draws from some
stationary distribution, π. In many practical applications, it is too computa-

tionally intensive to draw enough samples to estimate a parameter, ˆθ, or the
expectation of a function, E[f(X)], with sufficiently low variance. If we allow for
some bias in our draws from the stationary distribution, the task of simulation
is made easier.
The mean square error in our estimate is a measure of both bias and variance,
MSE = B2
+ V . When drawing from a posterior density over billions of data
points, unbiased Markov chains incur significant computational costs. As a
result, the variance of these approximations are high because we can only collect
small samples in a fixed amount of time.
Alternativly, we can simulate from a slightly biased stationary distribution
π , where is a parameter that controls the bias we allow in our simulation [7].
As increases it becomes easier to simulate draws from π . Given infinite time
we should let = 0 and run the chain to draw infinite samples. However, when
given limited or finite wall-clock time it may be advantageous to tolerate some
bias in return for lowering variance by either collecting large samples or mixing
better.
4.2.2 Approximate Metropolis-Hastings
As we alluded to earlier, in Bayesian inference it is often the case that we wish
to find the expectation of a parameter θ with respect to a posterior distribution,
f(θ). Given a dataset of N observations XN = {x1, . . . , xN }, which we model
with a distribution f(x|θ) and prior ρ(θ), we want to sample from the posterior
density
f(θ) ∝ ρ(θ)
N
i=1
f(x|θ)
to estimate ˆθ. If our data is minimally sufficient and if XN contains billions of
points, then evaluating f(·) at least once in the Metropolis-Hastings acceptance
ratio is a costly O(N) operation for a single bit of information.
By reformulating step 4 of Algorithm 2 as a statistical test of significance,
we can reduce some of the overhead incurred by unbiased MCMC. In standard
Metropolis-Hastings we accept the proposal θ∗
if U < Pa, otherwise we stay
where we are. This condition is equivalent to checking if
U <
f(θ∗
)g(θT |θ∗
)
f(θT )g(θ∗|θT )
U
g(θ∗
|θT )ρ(θT )
g(θT |θ∗)ρ(θ∗)
<
N
i=1 f(x|θ∗
)
N
i=1 f(x|θT )
1
N
log U
g(θ∗
|θT )ρ(θT )
g(θT |θ∗)ρ(θ∗)
<
1
N
N
i=1
li where li = log f(x|θ∗
) − log f(x|θT )
µ0 < µ
where in the last step we substitute µ0 on the left-hand side and µ on the right
hand side for notational convenience.

The costly computation that may have previously required the evaluation
of a posterior density over billions of points is equivalent to testing whether
the mean of a finite population {l1, . . . , lN } is greater than some constant µ0
that does not depend on the data. This makes it easy to frame the check as a
sequential hypothesis test: randomly draw a mini-batch of size n < N without
replacement from XN and compute its mean, ¯l. If the difference between ¯l and
µ0 is significantly larger than the standard deviation of ¯l and if µ0 < ¯l then
θ∗
is accepted, otherwise we stay put. If significance is not achieved, we add
more observations to the mini-batch and re-run until significance is achieved.
Significance will eventually be achieved and the sequential hypothesis test will
terminate because when n = N the standard deviation of ¯l is 0 because ¯l is the
population mean, µ.
Formally, we can test the hypotheses
H0 : µ0 ≤ µ vs H1 : µ0 > µ
where the sample mean, ¯l, and the sample standard deviation, sl, are given as
¯l =
1
N
n
i=1
li,
s2
l =
n l2 − (l)2
n − 1
,
the standard deviation of ¯l is estimated to be
s =
sl
√
n
1 −
n − 1
N − 1
,
and the test statistic is
t =
¯l − µ0
s
.
For large enough n, we claim t follows a standard Student-t distribution with
n − 1 degrees of freedom when µ = µ0. To determine if the difference between
µ0 and µ is significant, we compute the p-value as p = 1 − φn−1(|t|) where φ(·)
is the cdf of the Student-t distribution. If p is less than the α level of our test,
then we can reject H0 and conclude µ0 = µ. The peusdocode below as well as
a more detailed proof of the distrbution of t may be found in [7].
We are often able to make confident decisions considering only n < N data
points in the posterior. Though we introduce bias in the form of the α level
of the test, we make up for this by drawing more samples from the stationary
distribution. For error bounds on the estimates produced, a description of
optimal sequential test design, and illustrative examples, see [7]. In the following
section we cover the slice sampling algorithm, which may be conceptualized
as a higher dimensional analogue to the inverse transform. Interestingly, an
approximate slice sampler also exists [4].

Algorithm 3 Approximate MH Test
procedure Approx. MH Input: θT , θ∗
, , µ0, XN , m Output: accept
Initialize estimated means l ← 0 and l2 ← 0
n ← 0, done ← false
Draw U ∼ Unif([0, 1])
while not done do
Draw mini-batch X of size min(m, N − n) w/o replacement from
XN and set XN ← XN X
Update l and l2 using X and n ← n + |X|
Compute δ ← 1 − φn−1
l − µ0
s
if δ < then
accept ← true if µ0 < l and false otherwise
done ← true
end if
end while
end procedure
4.3 Slice Sampling
Unlike Metropolis-Hastings, the slice sampler does not require the selection of
a proposal distribution nor does it require any convexity properties, as some
adaptive acceptance-rejection methods do. In practice, however, slice sampling
is not entirely unreliant on hyperparameter selection.
In the univariate case, the slice sampler transitions from a point (X, U) under
the curve of fX to another point (X , U ) under the curve of fX in such a way
that the stationary distribution of (X, U) converges to a uniform distribution
over the area under the curve of fX [8].
The pseudocode in Algorithm 4 outlines the 2D case. Many important de-
Algorithm 4 2D Slice Sampler
1: procedure Slice sample Input: XT ∈ supp fX
2: Draw U ∼ Unif([0, 1])
3: Draw XT +1 ∼ Unif({x : fX(x) ≥ U · fX(XT )})
4: end procedure
tails are left out but a full implementation may be found in Figure 4.5. The
problem of drawing from the exact level sets of the distribution in step 3 can be
intractable when fX is complex enough. We have adapted Neal’s slice sampling
algorithm from [11] and naively expand out from XT using an arbitrarily chosen
step size until a suitable interval is found. If we were able to sample perfectly
from the slice under the curve, there would be no rejected samples. The idea of
learning or predicting these level sets is intriguing, and to my knowledge, has
not been attempted.

Slice.Sample <- function(x0, f, nsample, step = 1) {
x <- x0
for (i in 2:nsample) {
u <- runif(1, 0, f(x[i - 1]))
lower <- x[i - 1] - 1
upper <- x[i - 1] + 1
while (u < f(lower)) {
lower <- lower - step
}
while (u < f(upper)) {
upper <- upper + step
}
repeat{
x.proposal <- runif(1, lower, upper)
if (u < f(x.proposal)) {
x[i] <- x.proposal
break
} else if (x.proposal < lower) {
lower <- x.proposal
} else if (x.proposal > upper) {
upper <- x.proposal
}
}
}
return(x)
}
Figure 4.5: Naive implementation of the slice sampler
Example 4.2 We use the slice sampler to draw from a tri-modal mixture of
normals deﬁned in the target function below. The issue of ﬁnding correct level
sets becomes apparent, as we might not expand our interval out far enough to
jump modes.

target <- function(x) {
return(0.25 * dnorm(x, -2, 0.3) +
0.50 * dnorm(x, 0, 0.3) +
0.25 * dnorm(x, 2, 0.3))
}
hist(Slice.Sample(1, target, 10000, 1),
breaks = 100, prob = TRUE, ylim = c(0, 0.7),
main = "Trimodal Mixture of Normal", xlab = "X")
x <- seq(-10, 10, length = 1000)
lines(x, target(x), col = "blue")
Trimodal Mixture of Normal
X
Density
−3 −2 −1 0 1 2 3
0.00.10.20.30.40.50.60.7
Figure 4.6: The result of slice sampling a trimodal normal distribution.
4.3.1 Auxiliary Variable MCMC
The slice sampler introduces an auxiliary variable, an approach we revisit with
the Hamiltonian Monte Carlo, that is marginalized out to produce the desired
distribution. Using the Fundamental Theorem of Simulation, we are able to
draw samples from fX by drawing samples uniformly under the curve of fX.
Let Q be the area under the curve of fX so that choosing (X, U) ∼ Unif({(x, u) :
0 < u < fX(x)}) occurs with probability
1
Q
:
f(X,U)(X, U) =
1
Q
1(0 ≤ U ≤ fX(X)).

This implies the marginal distribution of X is
f(X,U)(x, u) du =
1
Q
fX (X)
0
du =
fX(X)
Q
.
As Algorithm 4 suggests, we alternate between sampling X and U. To see
that the general slice sampler preserves the uniform distribution over the area
under the curve of fX, note that if XT ∼ fX and UT +1 ∼ Unif([0, fX(XT )])
then
(XT , UT +1) ∼ fX(XT )
1(0 ≤ UT +1 ≤ fX(XT ))
fX(XT )
∝ 1(0 ≤ UT +1 ≤ fX(XT )).
If XT +1 ∼ Unif(AT +1) = Unif({XT +1 : 0 ≤ UT +1 ≤ fX(XT +1)}) then
(XT , UT +1, XT +1) ∼ fX(XT )
1(0 ≤ UT +1 ≤ fX(XT ))
fX(XT )
1(0 ≤ UT +1 ≤ fX(XT +1))
µ(At+1)
,
where µ(AT +1) denotes the Lebesgue measure of the set. Marginalizing out XT
gives
f(UT +1, XT +1) ∝ 1(0 ≤ UT +1 ≤ fX(x))
1(0 ≤ UT +1 ≤ fX(XT +1))
µ(AT +1)
dx
=
1(0 ≤ UT +1 ≤ fX(XT +1))
µ(AT +1)
1(0 ≤ UT +1 ≤ fX(x)) dx
= 1(0 ≤ UT +1 ≤ fX(XT +1)),
so that if we begin with XT ∼ fX then the updates that generate XT +1 and
UT +1 preserve the uniform distribution under the curve of fX.
4.3.2 Uniform Ergodicity of the Slice Sampler
We now discuss the convergence properties of the slice sampler in the simple
2D case. In the ensuing calculations we denote by µ(ω) the Lebesgue measure
of the set
Aω = {x : 0 ≤ ω ≤ fX(x)}.
To gain insight into how the slice sampler behaves asymptotically, we look to
the cdf of the transition kernel. More speciﬁcally, we look at the probability
that fX(XT +1) ≤ η given that we are currently at XT and fX(XT ) = ν.
Pr fX(XT +1) ≤ η | fX(XT ) = ν =
1(0 ≤ ω ≤ ν)
ν
1(ω ≤ fX(x) ≤ η)
µ(ω)
dω dx,

where we ﬁrst draw ω uniformly on [0, ν] and then draw XT +1 uniformly on Aω.
Simplifying further gives
Pr fX(XT +1) ≤ η | fX(XT ) = ν =
1
ν
1(0 ≤ ω ≤ ν)
µ(ω)
1(ω ≤ fX(x) ≤ η) dx dω
=
1
ν
1(0 ≤ ω ≤ ν) ·
µ(ω) − µ(η)
µ(ω)
dω
=
1
ν
min(η,ν)
0
µ(ω) − µ(η)
µ(ω)
dω
=
1
ν
ν
0
max 1 −
µ(η)
µ(ω)
, 0 dω,
which tells us the convergence properties of the slice sampler are total dependent
on the measure, µ. Now, for the main result which we owe Tierney and Mira [10]
who, under boundness conditions, established the following lemma.
Lemma 4.3 If fX and supp fX are bounded, the 2D slice sampler is uniformly
ergodic.
Proof Without loss of generality, assume that fX is bounded by 1 and that
supp fX = [0, 1]. To prove uniform ergodicity, we will show that supp fX is a
small set so that we may invoke Doeblin’s condition. Let
ξ(ν) = Pr fX(XT +1) ≤ η | fX(XT ) = ν
Notice that ω > η implies µ(η) > µ(ω) and ξ(ν) = 0. Further, when ν ≥ η,
ξ(ν) =
1
ν
η
0
1 −
µ(η)
µ(ω)
dω
is decreasing in ν since it only appears in the denominator outside of the integral.
When ν ≤ η we recognize
ξ(ν) =
1
ν
ν
0
1 −
µ(η)
µ(ω)
dω
as the expected value of the function 1 −
µ(η)
µ(ω)
where ω ∼ Unif([0, ν]). The
larger ω, the smaller µ(ω) is; we conclude that µ(ω) is decreasing in ω and thus
also decreasing in ν.
Therefore ξ(ν) is decreasing in ν for all η. Intuitively, it would not make
sense if ξ(ν) were increasing in ν because it would imply our Markov chain is
not spending enough time in the modes. If ξ(ν) were increasing in ν then the
larger ν the more likely we are to end up below some threshold (away from the

mode). For the proof to be complete, we must establish bounds on the cdf of
the transition kernel. The minimum occurs when ν = 1:
lim
ν→1
ξ(v) =
η
0
1 −
µ(η)
µ(ω)
dω,
which is bounded above by
η
0
1 dω = µ(η) and below by 0. The maximum is
given by L’Hopital’s rule:
lim
ν→0
ξ(ν) = lim
ν→0
ν
0
1 −
µ(η)
µ(ω)
dω
ν
= lim
ν→0
1 −
µ(η)
µ(ν)
= 1 − µ(η).
1 − µ(η) is bounded above by 1 and below by 0 because the support is [0, 1].
Once we have found nondegenerate upper and lower bounds on the cdf of
the transition kernel,it is not diﬃcult to derive Doeblin’s condition. The entire
support of fX is thus a small set and uniform ergodicity follows.
This proof serves to remind us that rigorous results are not easy to come
by in MCMC. We must work hard to ensure the methods we employ do indeed
sample from the desired target distribution. We have thus introduced the slice
sampler, given a rudimentary implementation of it, and discussed its conver-
gence properties in the simple 2D case. Next, we cover the Gibbs sampler which
extends the slice sampler’s idea of alternately sampling variables conditioned on
one another.
4.4 Gibbs Sampling
In this section, we consider sampling from the multivariate distribution f(x) =
f(X1, . . . , Xn). Each step of the Gibbs sampling algorithm replaces a single
value, say Xi, by sampling from the distribution conditioned on everything
but Xi, namely fXi (Xi|x−i). That is, we replace Xi with a value drawn from
fXi
(Xi|x−i) where Xi denotes the ith
component of the vector x and x−i denotes
the vector x without the ith
component. The deterministic scan Gibbs sampler
is expressed rather nicely in Algorithm 5.
Each Gibbs step loops through x and replaces each component with a sam-
ple drawn from the correct conditional distribution using the most up-to-date
values. In the context of Metropolis-Hastings, x−i remains unchanged when
we draw Xi so the proposal distribution is fx∗ (x∗
|x−i). We also have that
x∗
−i = x−i, and fx(x) = fXi
(Xi|x−i)fx−i
(x−i) so the Metropolis-Hasting’s ac-
ceptance probability is
A(x∗
, x) =
fXi
(X∗
i |x∗
−i)fx−i
(x∗
−i)fXi
(Xi|x∗
−i)
fXi (Xi|x−i)fx−i (x−i)fXi (X∗
i |x−i)
= 1.

Algorithm 5 Gibbs Sampling
1: procedure Gibbs Step Input: x = (X1, . . . , Xn) Output: x∗
2: Draw X∗
1 ∼ fX1
(X1|X2, . . . , Xn)
3: Draw X∗
2 ∼ fX2 (X2|X∗
1 , X3, . . . , Xn)
4:
...
5: Draw X∗
n ∼ fXn
(Xn|X∗
1 , X∗
2 , . . . , X∗
n−1)
6: return x∗
← (X∗
1 , . . . , X∗
n)
7: end procedure
Thus, if when dealing with high dimensional distributions we have access the
conditional distributions (which is often the case in Bayesian networks), the
Gibbs sampler never rejects a proposal.
Example 4.4 Say we wish to draw points (X, Y ) where X, Y ∼ Exp(λ). Be-
low, we implement a deterministic scan Gibbs sampler that draws from a bounded
2D exponential distribution. We bound/truncate the points we draw for graphical
simplicity.
Exp.Bounded <- function(rate, B) {
repeat{
x <- rexp(1, rate)
if (x <= B) {
return(x)
}
}
}
Gibbs.Sampler <- function(M, B) {
mat <- matrix(ncol=2, nrow = M)
x <- 1; y <- 1
mat[1, ] <- c(x, y)
for (i in 2:M) {
x <- Exp.Bounded(y, B)
y <- Exp.Bounded(x, B)
mat[i,] <- c(x, y)
}
return(mat)
}
mat <- Gibbs.Sampler(1000, 10)
layout(matrix(c(1, 1, 2, 3), 2, 2, byrow = TRUE))
plot(mat, main="Joint Distribution", xlab=expression("X"[1]),
ylab=expression("X"[2]), ylim = c(0, 10), xlim = c(0, 10))
hist(mat[ , 1], main=expression("Marginal dist. of X"[1]),
xlab=expression("X"[1]), prob = TRUE, breaks = 30)

hist(mat[ , 2], main=expression("Marginal dist. of X"[2]),
xlab=expression("X"[2]), prob = TRUE, breaks = 30)
0 2 4 6 8 10
0246810
Joint Distribution
X1
X2
Marginal dist. of X1
X1
Density
0 2 4 6 8 10
0.00.20.40.60.8
Marginal dist. of X2
X2
Density
0 2 4 6 8 10
0.00.20.40.60.8
Example 4.5 Here we use a random scan Gibbs sampler to approximate the
probability that a point drawn uniformly from the unit hypersphere in 6 dimen-
sions is at least a distance of 0.9 from the origin. Our algorithm begins at the
origin and then randomly chooses a cordinate to replace. Given (X1, . . . , Xn)
in the n-dimensional unit hypersphere we choose a random coordinate to update
(WLOG, say x1) and sample it uniformly such that
x ≤ 1
X2
1 + . . . + X2
n ≤ 1
X2
1 ≤ 1 − (X2 + . . . + Xn)
X1 ≤ 1 − (X2 + . . . + Xn)

But square roots are always positive, so we must also ﬂip a fair coin to determine
the sign. More explicitly,
Xi|x−i ∼ Unif − 1 − (X2 + . . . + Xn), 1 − (X2 + . . . + Xn) .
Euclidean.Norm <- function(x) {
return(sqrt(sum(x ^ 2)))
}
Gibbs.Hypersphere.Conditional <- function(x) {
if (runif(1) <= 0.5) {
return(-1 * runif(1, min = 0, max = sqrt(1 - sum(x ^ 2))))
}
return(runif(1, min = 0, max = sqrt(1 - sum(x ^ 2))))
}
Random.Scan.Gibbs.Hypersphere <- function(x = rep(0, 6)) {
idx <- sample(1:6, 1)
x[idx] <- Gibbs.Hypersphere.Conditional(x[-idx])
return(x)
}
Hypersphere.MC <- function(steps = 100, f.sample) {
x <- rep(0, 6) # start at origin
for (i in 1:(0.1 * steps)) {
x <- f.sample(x)
}
data <- matrix(0, ncol = length(x), nrow = steps)
for (i in 1:steps) {
x <- f.sample(x)
data[i, ] <- x
}
return(data)
}
draws <- replicate(10,
Hypersphere.MC(steps = 5000,
Random.Scan.Gibbs.Hypersphere))
counts <- apply(draws, MARGIN = 3, FUN = apply, 1, Euclidean.Norm)
p <- mean(counts >= 0.9)
s <- sd(counts >= 0.9) / sqrt(length(counts))
We ﬁnd the probability that a uniform point drawn from the unit hypersphere in
6 dimensions is at least 0.9 from the origin is 0.469 ± 0.002.

4.5 Hamiltonian Monte Carlo
Originally introduced in 1987 as the Hybrid Monte Carlo [3], what we refer
to as the Hamiltonian Monte Carlo (HMC) combines Hamiltonian dynamics
and the Metropolis algorithm to propose large changes in state (e.g. jumping
from mode to mode in a single iteration) while maintaining a high acceptance
probability. HMC interprets x as a position and introduces an auxiliary variable
to simulate Hamiltonian mechanics on phase space. But first, we introduce the
basic vocabulary of Hamiltonian dynamics.
4.5.1 Hamiltonian Dynamics
Hamiltonian dynamics is a reformulation of classical Newtonian mechanics in
which a particle is described by a position vector x and a momentum vector
p. We associate with our position and momentum a total energy H(x, p) =
U(x) + K(p) called the Hamiltonian of our system. H(x, p) is the sum of the
potential energy associated with x and the kinetic energy associated with p.
We often take the kinetic energy to be
K(p) =
1
2
p 2
2
which corresponds to simulating Hamiltonian dynamics on a Euclidean mani-
fold. Exploring the effects of alternate kintetic energies is beyond the scope of
this text, however one can imagine simulating the dynamics on a Riemannian
manifold instead. The choice of potential energy, we will see, depends on the
target distribution we wish to sample from.
Given a position and momentum, the system evolves according to Hamilton’s
equations:
dp
dt
= −
∂H
∂x
and
dx
dt
=
∂H
∂p
.
The laws of thermodynamics must be obeyed so that a particle whose movement
is governed by Hamiltonian dynamics travels along level sets of constant energy
in the joint, or phase, space. Although H remains invariant, the values of x and
p change over time. By simulating the dynamics of a system over a finite time
period, we are able to make large changes to x and avoid random walk behavior.
Example 4.6 (A One-Dimensional Example) Consider the simple case in
which the Hamiltonian of our system is defined as follows:
H(x, p) = U(x) + K(p), U(x) =
x2
2
, K(p) =
p2
2
.
The resulting dynamics evolve according to the equations
dp
dt
= −x,
dx
dt
= p.

The solutions to these equations have the following form, for some constants r
and a:
x(t) = r cos(a + t), p(t) = −r sin(a + t),
which correspond to a rotation by s radians clockwise around the origin in the
(x, p) plane.
4.5.2 HMC
If we consider the joint distribution over states (x, p) with total energy H(x, p),
i.e.
P(x, p) ∝ exp(−H(x, p)),
we realize that simply starting at some point (x0, p0) and running the dynamics
does not sample ergodically from P. To see this, notice this only explores level
sets of constant energy. All states in the set {(x, p) : H(x, p) = H(x0, p0)} are
unreachable. To construct an ergodic Markov chain, we need to perturb the
value of H while keeping P invariant. Conceptually, we want to jump between
level sets of constant energy to explore the space. Adding a Gibbs step where
we draw p ∼ P(p|x) accomplishes just this. Our job is made even simpler by
the independence of x and p, which follows from the factorization of P as
P(x, p) ∝ exp(−U(x)) exp(−K(p)).
Marginalizing out x yields P(p) ∝ exp(−K(p)) which implies p ∼ exp(− p 2
2/2)
which we recognise as the pdf of a standard normal random variable. Applying
the same thinking to p, we see that U(x) = − log(fX(x)) implies x ∼ fX, giving
x the desired marginal distribution.
An algorithm begins to emerge: starting at some point (x, p) in phase space,
simulate Hamiltonian dynamics for a ﬁnite number of steps, and end in a new
state (x∗
, p∗
). The proposal is accepted with probability
min(1, exp(H(x, p) − H(x∗
, p∗
))).
By the conservation of energy, we should always accept such proposals. Some-
times, errors in our numeric simulation of the dynamics prevent this from hap-
pening. In our experiments we used Radford Neal’s code that appears in Chap-
ter 5 of [2] and is avalible online at http : //www.cs.utoronto.ca/ ∼ radford/ham − mcmc − simple.
Algorithm 6 Hamiltonian Monte Carlo Sampler
1: procedure HMC Input: x ∼ fX
2: Draw p ∼ Norm(0, 1), U ∼ Unif([0, 1])
3: Simulate Hamiltonian dynamics to get (x∗
, p∗
) ∼ P
4: Compute acceptance probability Pa = min(1, exp(H(x, p)−H(x∗
, p∗
)))
5: if U < Pa then return x∗
6: else return x
7: end if
8: end procedure

Example 4.7 Suppose we wish to sample from the 1D bimodal distribution from
Example 4.1. Although we have touted the performance of HMC in high dimen-
sions, we restrict ourselves to a 1D density so that the joint phase space may be
vizualized, as below. We begin somewhere in the space (Figure 4.7a), simulate
Hamiltonian dynamics for some number of steps, and accept the proposal state
(Figure 4.7b). We must be cautious in our choice of L and , as it is not diﬃ-
cult to imagine an instance where the simulated particle returns to its starting
position after a ﬁnite number of iterations. Choosing the stepsize, , at random
before simulating the particle’s path can prevent this type of behavior.
−10 −5 0 5 10
−10−50510
P
X
10
20
30
40 4050 5060
60
60
60
7070
70
70
(a)
−10 −5 0 5 10
−10−50510
P
X
10
20
30
40 4050 5060
60
60
60
7070
70
70
(b)
Figure 4.7
4.6 Summary
Thus ends our exploration of Markov Chain Monte Carlo sampling methods.
As you may have noticed, algorithms that sample from high dimensional dis-
tributions are seldom written once and used forever. Instead, they require an
attention to detail and a tested dedication to writing correct code. Even once
the practitioner has chosen an algorithm most applicable to their setting, it
may require days or weeks of tuning and testing hyperparameter combinations
to achieve the desired convergence. However, the four approaches to MCMC
presented in this work (random walk, Metropolis-Hastings, auxiliary variables,
Gibbs sampling) comprise the vast majority of the practitioner’s toolbox.

Chapter 5
Conclusion
We began with the question of how to generate randomness and have concen-
trated largely on algorithms that do just that: spit out randomness. This merely
scratches the surface of the work being done on Monte Carlo methods. We can
now answer real questions faced by statisticans, ecnonomists, mathemeticians,
and nuclear physicists. We can theorize models based on our beliefs, collect
data, and determine, through simulation, whether our observations are in line
with our predictions or if they can be considered “extreme”, “weird”, or “out-
lying”. Prerequisite to all of this is the ability the sample uniformly from the
unit interval. We are reminded of the power of the Fundamental Theorem of
Simulation and how ultimitaley, all of our problems are reduced to sampling
uniformly.

Bibliography
[1] Christopher Bishop. Pattern Recognition and Machine Learning. Springer,
New York, 2006.
[2] Steve Brooks. Handbook of Markov Chain Monte Carlo. CRC Press/Taylor
& Francis, Boca Raton, 2011.
[3] Simon Duane, Anthony D Kennedy, Brian J Pendleton, and Duncan
Roweth. Hybrid Monte Carlo. Physics letters B, 195(2):216–222, 1987.
[4] Christopher DuBois, Anoop Korattikara, Max Welling, and Padhraic
Smyth. Approximate Slice Sampling for Bayesian Posterior Inference. In
Artiﬁcial Intelligence and Statistics, 2014.
[5] Stuart Geman and Donald Geman. Stochastic Relaxation, Gibbs Distri-
butions, and the Bayesian Restoration of Images. Pattern Analysis and
Machine Intelligence, IEEE Transactions on, (6):721–741, 1984.
[6] W Keith Hastings. Monte Carlo Sampling Methods Using Markov Chains
and their Applications. Biometrika, 57(1):97–109, 1970.
[7] Anoop Korattikara, Yutian Chen, and Max Welling. Austerity in
MCMC Land: Cutting the Metropolis-Hastings Budget. arXiv preprint
arXiv:1304.5299, 2013.
[8] David J. C. MacKay. Information Theory, Inference and Learning Algo-
rithms. Cambridge University Press, 2003.
[9] Nicholas Metropolis, Arianna W Rosenbluth, Marshall N Rosenbluth, Au-
gusta H Teller, and Edward Teller. Equation of State Calculations by Fast
Computing Machines. The journal of chemical physics, 21(6):1087–1092,
1953.
[10] Antonietta Mira and Luke Tierney. Eﬃciency and Convergence Properties
of Slice Samplers. Scandinavian Journal of Statistics, 29(1):1–12, 2002.
[11] Radford M Neal. Slice Sampling. Annals of statistics, pages 705–741, 2003.
[12] Christian Robert. Monte Carlo Statistical Methods. Springer, New York,
2004.

[13] Gareth O. Roberts and Jeﬀrey S. Rosenthal. On Convergence Rates of
Gibbs Samplers for Uniform Distributions. The Annals of Applied Proba-
bility, 8(4):pp. 1291–1302, 1998.
[14] Stuart Russell and Peter Norvig. Artiﬁcial Intelligence: A Modern Ap-
proach (3rd Edition). Prentice Hall, 2009.

thesis_final_draft

More Related Content

What's hot

Similar to thesis_final_draft

thesis_final_draft