Senior Thesis in Mathematics
Sampling from High Dimensional
Distributions
Author:
Bill DeRose
Advisor:
Dr. Gabe Chandler
Submitted to Pomona College in Partial Fulfillment
of the Degree of Bachelor of Arts
April 3, 2015
Contents
1 Introduction
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 Related works . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2 Random Variable Generation
2.1 Random Variable Generation . . . . . . . . . . . . . . . . . . . .
2.1.1 The Inverse Transform . . . . . . . . . . . . . . . . . . . .
2.1.2 Acceptance-Rejection Sampling . . . . . . . . . . . . . . .
3 Markov Chains
3.1 Markov Chains . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2 Properties of Markov Chains . . . . . . . . . . . . . . . . . . . .
3.2.1 Irreducibility . . . . . . . . . . . . . . . . . . . . . . . . .
3.2.2 Aperiodicity . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2.3 Stationarity . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2.4 Ergodicity . . . . . . . . . . . . . . . . . . . . . . . . . . .
4 Markov Chain Monte Carlo
4.1 Monte Carlo Methods . . . . . . . . . . . . . . . . . . . . . . . .
4.2 Metropolis Hastings . . . . . . . . . . . . . . . . . . . . . . . . .
4.2.1 Bias-Variance Trade-off . . . . . . . . . . . . . . . . . . .
4.2.2 Approximate Metropolis-Hastings . . . . . . . . . . . . .
4.3 Slice Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3.1 Auxiliary Variable MCMC . . . . . . . . . . . . . . . . . .
4.3.2 Uniform Ergodicity of the Slice Sampler . . . . . . . . . .
4.4 Gibbs Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.5 Hamiltonian Monte Carlo . . . . . . . . . . . . . . . . . . . . . .
4.5.1 Hamiltonian Dynamics . . . . . . . . . . . . . . . . . . . .
4.5.2 HMC . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5 Conclusion
Chapter 1
Introduction
1.1 Introduction
The challenges we face in computational statistics are due to the incredible
advance of technology in the past 100 years. In a world where human suffering
is the daily reality of so many, we should be so lucky to wrestle with the problems
of algorithm design and implementation. Despite the advances in statistics over
the past century, random number generation remains an active field of research.
From machine learning and artificial intelligence, to the simulation of protein
formation, the ability to draw from probability distributions has wide ranging
applications.
But what exactly is a random number, and what is randomness? More
importantly, how can an algorithm take a finite number of deterministic steps
to produce something random? Often, humans delude themselves into seeing
randomness where there is none – they detect a signal in the noise.
Figure 1.1: Which is random?
The image on the left of Figure 1.1 depicts genuine randomness. The points
on the right are too evenly spaced for it to be truly random. In actuality, each
point on the left represents the location of a star in our galaxy while the points
on the right represent the location of glowworms on the ceiling of a cave in
New Zealand. The glowworms spread themselves out to reduce competition for
food amongst themselves. The seemingly uniform distribution is the result of a
non-random force.
So how do we go about generating images like those on the left? We begin
with a little cheat and assume the existence of a random number generator that
allows us to sample U ∼ Uniform([0, 1]). Though we will not discuss methods
for drawing uniformly from the unit interval, their importance to us cannot be
understated.
In practice, exact inference is often either impossible (e.g. provably non-
integrable functions) or intractable (e.g. high dimensional integration) and we
must turn to approximations. This work explores Monte Carlo methods as one
approach to numerical approximation.
Example 1.1 (Numeric Integration) We wish to evaluate an integral Q =
b
a
f(x) dx. From calculus, we know favg =
Q
b − a
⇒ Q = (b − a)favg. By the
LLN, we can choose X1, . . . , Xn uniformly in [a, b] to approximate
favg ≈
1
n
n
i=1
f(Xi) ⇒ Q ≈
b − a
n
n
i=1
f(Xi).
1.2 Related works
Many of the algorithms we cover are versions of the Metropolis algorithm which
first appeared in [9] and was eventually generalized by Hastings in [6]. Though
the naming of the algorithm has been contended (Metropolis merely oversaw
the research), we refer to the algorithm as Metropolis-Hastings for historical
reasons. Regardless of naming conventions, we are indebted to Arianna W.
Rosenbluth, Marshall N. Rosenbluth, Augusta H. Teller, and Edward Teller for
their work on the original paper outlining the Metropolis algorithm.
The term “Monte Carlo” was coined because some of the first applications
were to card games, like those in the Monte Carlo Casino in Monaco. A Monte
Carlo algorithm is simply an algorithm whose output is random. Such Monte
Carlo simulations were important to the development of the Manhattan project
(after all, Metropolis worked at Los Alamos during WWII) and remain an im-
portant tool of modern statistical physics.
Though the Gibbs sampler was introduced by brothers Stuart and Donald
Geman in 1984 [5] these sorts of numerical sampling techniques did not en-
ter the mainstream until the early 1990’s, arguable because of the advent of
the personal computer and wider access to computational power. The Gibbs
sampler appeared nearly two decades before Neal’s slice sampler [11], though
we cover them in reverse chronological order because the latter provides a nice
motivation for the former. We draw heavily from [10], [13], [12] in proving the
uniform ergodicity of the 2D slice sampler. Neal’s work appears again in the
Hamiltonian Monte Carlo [2] which uses gradient information to explore state
space more efficiently.
The contemporary results presented are mostly due to the proliferation of
big data which, in the Bayesian setting, necessitates the ability to sample from a
posterior distribution with billions of data points. Sequential hypothesis testing
allows us to reduce some of the overhead required for Metropolis-Hastings [7].
As a general introduction to each of these sampling techniques, [1] has proven
invaluable.
Chapter 2
Random Variable
Generation
2.1 Random Variable Generation
Assuming we may draw U ∼ Unif([0, 1]), what other distributions can we gen-
erate from? It turns out we can draw X ∼ Bernoulli(p) by letting X ← 1(U <
p). If X1, . . . , Xn
i.i.d
∼ Bern(p) then
n
i=1 Xi = Y ∼ Bin(n, p). However, a
more general approach called the inverse transform allows us to draw from any
1D density whose closed form cdf we may write down.
2.1.1 The Inverse Transform
Definition 2.1 Suppose X is a random variable with probability distribution
function (pdf) fX. We denote by FX the cumulative distribution function (cdf)
where
FX(a) = Pr(X ≤ a).
Cumulative distribution functions are nonnegative, increasing, right-continuous
functions with lima→−∞ = 0 and lima→∞ = 1.
Definition 2.2 For an increasing function F on R, the pseudoinverse of F,
denoted F−1
p , is the function such that
F−1
p (u) = inf{a : F(a) ≤ u}
If F is strictly increasing then F−1
p ≡ F−1
.
With these definitions in hand, we have the tools to generate random variates
from any distribution with a computable generalized inverse.
Lemma 2.3 Let F(a) be a cdf and U ∼ Unif([0, 1]). If X = F−1
p (U) then X
has cdf F(a).
Proof Let F, U, and X be as given in the lemma. Then
Pr(X ≤ a) = Pr(F−1
p (U) ≤ a)
= Pr(U ≤ F(a))
= F(a)
Where the second equality follows from the fact that F is increasing.
We now see the importance of being able to to draw uniformly from the unit
interval. In fact if U is not truly uniform on [0, 1], then the inverse transform
method fails to sample from the correct distribution. However, to use the inverse
transform we must explicitly write down the cumulative distribution function
and efficiently compute its generalized inverse. As we will see in Example 2.6,
this is not always possible.
Example 2.4 We wish to draw X ∼ Exp(λ) using the inverse transform
method. The cdf of an exponential random variable is given by
FX(a) = 1 − exp(−λa).
Solving for the inverse yields
U = 1 − exp(−λa)
log(1 − U) = − λa
−λ−1
log(1 − U) = a
So that if U ∼ Unif([0, 1]) then −λ−1
log(1 − U) = X ∼ Exp(λ).
Example 2.5 Recall that the pdf of a Cauchy random variable, X, is
fX(s) =
1
π(1 + s2)
.
Given U ∼ Unif([0, 1]), we find a transformation Y = r(U) such that Y has a
Cauchy distribution. We begin by finding the cdf of X:
FX(a) =
a
−∞
1
π(1 + s2)
ds
=
1
π
tan−1
(s)
a
−∞
=
1
π
lim
n→−∞
tan−1
(a) − tan−1
(n)
=
1
π
tan−1
(a) +
π
2
=
tan−1
(a)
π
+
1
2
To compute the desired transformation, we have:
U =
tan−1
(a)
π
+
1
2
π(U −
1
2
) = tan−1
(a)
tan(π(U −
1
2
)) = a
So that Y = r(U) = tan π(U −
1
2
) ∼ Cauchy.
cauchies <- tan(pi * (runif(10000) - 0.5))
hist(cauchies[abs(cauchies) <= 500], prob = TRUE,
breaks = 2000, xlim = c(-20, 20), ylim = c(0, 0.35),
main = "Cauchy r.v. using ITF", xlab ="X")
lines(seq(-20, 20, 0.2), dcauchy(seq(-20, 20, 0.2)),
col = "blue")
Cauchy r.v. using ITF
X
Density
−20 −10 0 10 20
0.000.050.100.150.200.250.300.35
Example 2.6 Even in 1D, there exist densities whose cdf we cannot write
down. For example, the cumulative distribution function of the standard normal
distribution cannot be expressed in a closed form:
Φ(x) =
1
√
2π
x
−∞
exp(−z2
/2) dz
Clearly, we must develop other methods that do not rely as strongly on nice
analytic properties of our target distribution.
2.1.2 Acceptance-Rejection Sampling
Much of this section stems from the idea that if fX is the target distribution,
we may write
fX(s) =
fX (x)
0
1 ds.
Here, fX appears as the marginal density of the joint distribution
(X, U) ∼ Unif({(x, u) : 0 < u < fX(x)}).
Introducing the auxiliary variable U allows us to sample from our target
distribution by drawing uniformly from the area under the curve of fX and
ignoring the auxiliary coordinate.
Theorem 2.7 (The Fundamental Theorem of Simulation) Simulating X ∼
fX is equivalent to simulating
(X, U) ∼ Unif({(x, u) : 0 < u < fX(x)}).
Actually sampling from the joint distribution of (X, U) introduces difficulty,
though, because sampling X ∼ fX and U|X ∼ Unif([0, fX(X)]) defeats the
purpose of introducing the auxiliary variable. If we could sample X ∼ fX in
the first place, we would already be done.
The solution is to generate pairs (X, U) from a superset and accept them if
they satisfy the constraint. For instance, suppose the 1D density fX is bounded
by m and the support of fX, denoted supp fX, is [c, d]. Sampling pairs
(X, U) ∼ Unif([0 ≤ u ≤ fX(x)])
is equivalent to simulating X ∼ Unif([c, d]), U|X ∼ Unif([0, m]), and accepting
the pair if 0 < U < fX(X). It is easily shown that this does indeed sample from
the desired distribution:
Pr(X ≤ a) = Pr(X ≤ a|U ≤ fX(X))
=
Pr(X ≤ a, U ≤ fX(X))
Pr(U ≤ fX(X))
=
a
c
fX (x)
0
1
d−c · 1
m du dz
d
c
fX (x)
0
1
d−c · 1
m du dz
=
a
c
fX (x)
0
du dz
d
c
fX (x)
0
du dz
=
a
c
fX(z) dz
d
c
fX(z) dz
=
a
c
fX(z) dz
= FX(a)
This computation was made easier by the fact that both fX and supp fX
were bounded. In situations where this is not the case, we can no longer use a
rectangle as the superset from which we draw candidates. Instead, we use some
other probability distribution g(x) that may be readily sampled from. Such a
distribution is called a proposal distribution and must satisfy
M · g(x) ≥ fX(x), M ≥ 1, ∀x ∈ supp fX.
We formalize this notion in the following theorem.
Theorem 2.8 (The Acceptance-Rejection Theorem) Let g be a probabil-
ity distribution that satisfies
M · g(x) ≥ fX(x)
for some M ≥ 1 and for all x ∈ supp fX. Then, to simulate X ∼ fX, it is
sufficient to simulate
Y ∼ g and U|Y ∼ Unif([0, M · g(Y )])
and let X ← Y if U ≤ fX(Y ).
Proof Sampling Y ∼ g, U|Y ∼ Unif([0, M · g(Y )]), and letting X ← Y if
U ≤ fX(Y ) generates X ∼ fX:
Pr(X ∈ A) = Pr(Y ∈ A|U ≤ fX(Y ))
=
A
fX (x)
0
g(z) 1
Mg(z) du dz
supp fX
fX (x)
0
g(z) 1
Mg(z) du dz
= A
fX(z) dz
supp fX
fX(z) dz
=
A
fX(z) dz
The proposals used in acceptance-rejection sampling come from g(Y ) and
are accepted with probability
fX(Y )
M · g(Y )
, so the probability we accept any given
proposal is then
Pr(accept) = g(y)
fX(y)
M · g(y)
dy
=
1
M
fX(y) dy.
The larger M is, the more points we must reject before accepting a proposal.
For efficiency’s sake, we want M = sup
fX(x)
g(x)
to ensure the highest possible
acceptance rate. This leads directly to the Acceptance-Rejection algorithm,
which is a realization of Theorem 2.8:
Algorithm 1 AR Sampling
1: procedure Acceptance-Rejection
2: Draw Y ∼ g, U ∼ Unif([0, M · g(Y )])
3: Let X ← Y if U ≤ fX(Y ), else return to 2.
4: end procedure
Example 2.9 Given Y ∼ Cauchy we use acceptance rejection to generate
X ∼ Exp(1/2). To use AR with a proposal distribution g(x), we must ensure
M · g(x) ≥ fX(x) ⇒ M ≥
fX(x)
g(x)
for all x ∈ supp fX. Ideally, M is as close to
1 as possible:
M ≥ sup
x≥0
fX(x)
g(x)
≈ 3.629
We confine our maximization to the positive reals because the target distribution
only has support on the positive reals. The maximum is attained at x = 2 +
√
3.
Using M = 3.629 yields
Draw.AR <- function() {
repeat {
proposal <- rcauchy(1)
u <- runif(1, 0, 3.629 * dcauchy(proposal))
if (u <= dexp(proposal, rate = 1 / 2)) {
return(proposal)
}
}
}
x <- seq(0, 15, 0.1)
hist(replicate(10000, Draw.AR()), breaks = 100, prob = TRUE,
xlab = "X", main = "Exp(1/2) using AR")
lines(x, dexp(x, rate = 1 / 2), col = "blue")
Exp(1/2) using AR
X
Density
0 5 10 15
0.00.10.20.30.4
Chapter 3
Markov Chains
3.1 Markov Chains
Definition 3.1 A sequence of random variables X1, . . . , Xn, denoted (Xn), is
a Markov chain if
Pr(Xn+1|Xn, Xn−1, . . . , X1) = Pr(Xn+1|Xn) (3.1)
Example 3.2 A random walk is a Markov chain that satisfies
Xn+1 = Xn + n
where n is generated independent of the current state. If the distribution of n
is symmetric about 0, we call this a symmetric random walk. In section 4.2 we
will see how random walks are used in MCMC algorithms.
Every Markov chain has an initial distribution, π0, and a transition kernel K.
The state space, denoted X, is the set of possible values Xi may take on at each
step in the Markov chain.
Definition 3.3 A transition kernel is a function K defined on X × B(X) such
that
• ∀x ∈ X, K(x, ·) is a probability measure;
• ∀A ∈ B(X), K(·, A) is measurable.
where B denotes the σ-algebra defined on the set X.
When the state space is discrete, the transition kernel is a matrix K where
Kij = Pr(Xn+1 = Xj|Xn = Xi).
In the continuous case, the transition kernel denotes a conditional density where
Pr(x ∈ A|x) = A
K(x, x ) dx . A Markov chain is said to be time homogeneous
if K(Xn+1|Xn) is independent of n.
We restrict our study almost entirely to time homogeneous Markov chains.
An example of a time heterogeneous Markov chain is the simulated annealing
algorithm, whose transition kernel changes with the “temperature” of the sys-
tem. Time heterogeneity is a key property of simulated annealing because it
allows us to explore the entire state space when the temperature is high, but
restricts our moves when the temperature is low. The algorithm is inspired by
annealing in metallurgy where the process is used to temper or harden metals
and glass by heating them to a high temperature and gradually cooling them,
allowing the material to reach a low-energy crystalline state [14].
Given a transition matrix for a discrete Markov chain and an initial distri-
bution π0, the distribution of X1 is obtained by matrix multiplication
π1 = π0K.
Similarly, Xn ∼ πn = π0Kn
. Notice that once the initial state is specified, the
behavior of the chain is entirely dependent on K.
Definition 3.4 Consider A ∈ B(X). The first n for which the chain enters the
set A is denoted by
τA = inf{n ≥ 1 : Xn ∈ A}
and is called the stopping time at A. By convention, τA = ∞ if Xn ∈ A for
every n. Associated with the set A, we also define
ηA =
∞
n=1
1(Xn ∈ A),
the number of times the chain enters A.
Example 3.5 In a zero-sum coin tossing game, the payoff to player b is +1 if
a heads appears and −1 if a tails appears. Similarly, the payoff to player c is +1
if a tails appears and −1 if a heads appears. Let Xn be the sum of the gains of
player b after n rounds of the game. The infinite dimensional transition matrix,
K, has zeros on the diagonal since player b must either lose or gain a point on
each round. Furthermore, K has upper and lower sub-diagonals equal to 1/2
because because we are flipping a fair coin. Assuming that player b begins with
B dollars and player c begins with C dollars,
τ1 = inf{n : Xn ≤ −B} and τ2 = inf{n : Xn ≥ C}
represent, respectively, the ruins of the player b and c. The probability of
bankruptcy for player b is then Pr(τ1 > τ2).
3.2 Properties of Markov Chains
3.2.1 Irreducibility
Irreducibility is an important property of Markov chains which guarantees that
regardless of the current state of the chain, it is possible to reach any other state
in a finite number of transitions. In the discrete case, irreducibility also tells
us the transition matrix cannot be broken down into smaller matrices (i.e. the
transition graph is connected).
Definition 3.6 Given a measure φ, a Markov chain with transition kernel K(·)
is φ-irreducible if for every A ∈ B(X) such that φ(A) > 0, Pr(τA < ∞) > 0
regardless of the initial state.
Irreducibility together with aperiodicity, a property introduced in the following
subsection, allow us to make strong analytic arguments about the convergence
of Markov chains.
3.2.2 Aperiodicity
We define the period of a state x ∈ X to be
d(x) = gcd{m ≥ 1 : Km
(x, x) > 0}
If d(x) ≥ 2, we say x is periodic with period d(x). A state is aperiodic if it has
period 1. An irreducible chain is aperiodic if each state has period 1.
Example 3.7 A Markov chain with period n is given by the block matrix
P =







0 P1 0 0 0
0 0 P2 0 0
...
...
...
...
...
0 0 0 0 Pn−1
Pn 0 0 0 0







where Pi is a stochastic matrix and P is irreducible.
3.2.3 Stationarity
Definition 3.8 A Markov chain (Xn) has stationary distribution π if Xn ∼
π ⇒ Xn+1 ∼ π.
For MCMC methods to be of any use to us, we must be able to reason about
the asymptotic behavior of Markov chains. The distribution of Xn as n → ∞
is called the limiting distribution. Ideally, we would like some guarantee that,
regardless of initial conditions, the limiting distribution of a Markov chain is
also its stationary distribution.
The general approach with MCMC algorithms is to initialize and run a
Markov chain for a sufficient number of steps to draw samples approximately
from the desired stationary distribution. It is common to ignore some num-
ber of samples at the beginning, and then consider only every nth
sample (for
independence) when computing an expectation.
3.2.4 Ergodicity
When exactly do we know when the limiting distribution of a Markov chain is
the stationary distribution? The Ergodic Theorem tells us just this.
Theorem 3.9 (The Ergodic Theorem) Let (Xn) be a Markov chain with
stationary distribution π. If the chain is φ-irreducible and aperiodic, then for
all measurable sets A, limn→∞ Pr(Xn ∈ A) = π(A).
Which is to say that the limiting distribution of irreducible, aperiodic Markov
chains is always the stationary distribution. An even stronger guarantee of
convergence exists, but to get there we must introduce more terminology.
Definition 3.10 The Markov chain (Xn) has an atom α ∈ B(X) if there exists
an associated non-zero measure µ such that
K(x, A) = µ(A), ∀x ∈ α, ∀A ∈ B(X).
The definition of a small set follows naturally and will be used in our defi-
nition of one of the strongest form of convergence, uniform ergodicity.
Definition 3.11 A set C is small if there exists an m > 0 and a nonzero
measure µm such that
Km
(x, x ) ≥ µm(xm)
for all x ∈ C and for all x ∈ B(X)
Definition 3.12 The Markov chain (Xn) is uniformly ergodic if
lim
n→∞
sup
x∈X
Pn
(x, ·) − π T V = 0
Where · T V denotes the total variation norm.
In showing uniform ergodicity, we will make use of the following theorem.
Theorem 3.13 (Doeblin’s Condition) The following are equivalent:
(a) (Xn) is uniformly ergodic;
(b) there exist R ≤ ∞ and r such that
Pn
(x, ·) − π T V < Rr−n
, ∀x ∈ X;
(c) (Xn) is aperiodic and X is a small set;
(d) (Xn) is aperiodic and there exist a small set C and a real κ > 1 such that
sup
x∈X
Ex[kτC
] < ∞
If the whole space X is small, there exists a probability distribution, φ, on X,
constants , δ > 0, and n such that, if φ(A) > then
inf
x∈X
Kn
(x, A) > δ, ∀A ∈ B(X).
We see here the relation between analytic limits and uniform ergodicity, giving
us a feel for just how strong the guarantee of convergence is. Now that we have
covered enough of the basic vocabulary of Markov chains we may begin our
survey of MCMC sampling algorithms.
Chapter 4
Markov Chain Monte Carlo
4.1 Monte Carlo Methods
Although the sampling techniques discussed in Chapter 2 work well, they are
not flawless. The inverse transform method fails beyond 1-dimension and even
then it requires us to write down the closed form cdf of the target distribution.
Acceptance-rejection can be used in any dimension we like, but as dimensonality
increases it becomes more difficult to find good proposal distributions. We
turn now to Markov Chain Monte Carlo (MCMC) simulations because they
ameliorate many of these issues.
Monte Carlo simulations allow us to approximate the probability of certain
outcomes by running a large number of trials to obtain an empirical distribu-
tion of possible events. Markov Chain Monte Carlo simulations use Markov
chains whose stationary distribution is the target distribution we wish to sam-
ple from. The oldest MCMC algorithm, and the one we choose to cover first, is
the Metropolis-Hasting algorithm.
4.2 Metropolis Hastings
At this point, we may readily sample from most distributions covered in an
introductory probability course. However, when faced with the task of drawing
from a non-standard distribution, we will need more powerful tools at our dis-
posal. For instance, in Bayesian statistics we would often like to sample from
the posterior distribution of a parameter to compute its expected value.
At a high level, Metropolis-Hastings samples from a target distribution fX
by drawing from a proposal distribution g (“easy” to sample) and accepting if
it looks like it came from fX (“hard” to sample). At step T in the algorithm,
in which the current state is XT , we draw a candidate/proposal X∗
∼ g(X|XT )
and and let XT +1 = X∗
with probability
A(X∗
, XT ) = min 1,
fX(X∗
)g(XT |X∗
)
fX(XT )g(X∗|XT )
.
Otherwise, let XT +1 = XT .
We notice two things about the acceptance probability. First, the Metropolis-
Hastings algorithm only requires we know fX up to a normalizing constant.
Second, if g is symmetric the acceptance probability becomes
A(X∗
, XT ) = min 1,
fX(X∗
)
fX(XT )
which implies we always accept a candidate that is more probable, and accept
candidates randomly otherwise. The acceptance probability combines concepts
from steepest accent and random walk algorithms which help prevent getting
stuck in local maxima. Following Algorithm 2 ensures the stationary distribu-
tion of the Markov chain is fX.
Algorithm 2 MH Sampling
1: procedure Metropolis-Hastings Input: Current state: XT ∼ fX
2: Draw X∗
∼ g(X|XT ), U ∼ Unif([0, 1])
3: Compute acceptance probability Pa = A(X∗
, XT )
4: If U < Pa set XT +1 ← X∗
, otherwise set XT +1 ← XT
5: end procedure
Make no mistake, Metropolis-Hastings is no free lunch. The proposal distri-
bution must be chosen carefully and presents difficulties in higher dimensions
where our intuition and imagination fail us. This is especially the case when
using a non-symmetric proposal distribution. For this reason, we restrict our
study of Metropolis-Hastings solely to the symmetric, random walk case. A
common (symmetric) proposal distribution is a Gaussian centered on the cur-
rent state. It is also typical for the proposal distribution’s variance to be chosen
to be on the same order of magnitude as the smallest variance of the target
distribution.
σmax
σmin
ρ
Figure 4.1: Contours of a bivariate normal target distribution (red) and sym-
metric proposal distribution with standard deviation ρ (blue).
Consider Figure 4.1, where the 2D target distribution exhibits a strong cor-
relation between components. To achieve a high acceptance ratio, the stan-
dard deviation of the proposal distribution must be kept on the same order of
magnitude as σmin. Otherwise, our proposals will be from all over the space
and we would rarely accept any move. The random walk behavior also means
that to explore the length of the distribution, a distance of σmax/σmin, it takes
(σmax/σmin)2
steps due to the convergence of the chain being proportional to√
n. If our target distribution is pinched in one dimension and elongated in
another, the Metropolis-Hastings algorithm offers poor convergence properties.
Example 4.1 Suppose we wish to sample from the 2-dimensional mixture of
normals whose contours are shown in Figure 4.2 (bottom), alongside its 1-
dimensional analogue (top). Figure 4.3 shows that in 2-dimensions, the first
coordinate of the points sampled using a standard Metropolis algorithm appear
to mix well early, but clearly display difficulty jumping between modes. Figure
4.4a and Figure 4.4b suggest that in 5-,10-, and higher dimensions the problem
is only exacerbated.
−10 −5 0 5 10
0.000.050.100.150.20
1D Normal Mixture
X
Density
2D Normal Mixture
X2
X1
0.01 0.01
0.02
0.02
0.03
0.03
0.04
0.04
0.05
0.05
0.06
0.06
0.07
0.07
−6 −4 −2 0 2 4 6
−6−4−20246
Figure 4.2: The one dimension analogue (top) of the 2D target distribution
(bottom) (µ1 = −2, µ2 = 2, σ1 = σ2 = 1)
0 500 1000 1500 2000
−6−4−20246
Random−walk Metropolis (2D)
Index
Firstpositioncoordinate
Figure 4.3: Mixing of first coordinate, X1, from 2D a Metropolis sample.
0 500 1000 1500 2000
−6−4−20246
Random−walk Metropolis (5D)
Index
Firstpositioncoordinate
(a)
0 500 1000 1500 2000
−6−4−20246
Random−walk Metropolis (10D)
Index
Firstpositioncoordinate
(b)
Figure 4.4: First coordinate of points sampled from Metropolis random walk in
5D (a) and 10D (b).
4.2.1 Bias-Variance Trade-off
Implicit in our handling of MCMC lies the desire for unbiased draws from some
stationary distribution, π. In many practical applications, it is too computa-
tionally intensive to draw enough samples to estimate a parameter, ˆθ, or the
expectation of a function, E[f(X)], with sufficiently low variance. If we allow for
some bias in our draws from the stationary distribution, the task of simulation
is made easier.
The mean square error in our estimate is a measure of both bias and variance,
MSE = B2
+ V . When drawing from a posterior density over billions of data
points, unbiased Markov chains incur significant computational costs. As a
result, the variance of these approximations are high because we can only collect
small samples in a fixed amount of time.
Alternativly, we can simulate from a slightly biased stationary distribution
π , where is a parameter that controls the bias we allow in our simulation [7].
As increases it becomes easier to simulate draws from π . Given infinite time
we should let = 0 and run the chain to draw infinite samples. However, when
given limited or finite wall-clock time it may be advantageous to tolerate some
bias in return for lowering variance by either collecting large samples or mixing
better.
4.2.2 Approximate Metropolis-Hastings
As we alluded to earlier, in Bayesian inference it is often the case that we wish
to find the expectation of a parameter θ with respect to a posterior distribution,
f(θ). Given a dataset of N observations XN = {x1, . . . , xN }, which we model
with a distribution f(x|θ) and prior ρ(θ), we want to sample from the posterior
density
f(θ) ∝ ρ(θ)
N
i=1
f(x|θ)
to estimate ˆθ. If our data is minimally sufficient and if XN contains billions of
points, then evaluating f(·) at least once in the Metropolis-Hastings acceptance
ratio is a costly O(N) operation for a single bit of information.
By reformulating step 4 of Algorithm 2 as a statistical test of significance,
we can reduce some of the overhead incurred by unbiased MCMC. In standard
Metropolis-Hastings we accept the proposal θ∗
if U < Pa, otherwise we stay
where we are. This condition is equivalent to checking if
U <
f(θ∗
)g(θT |θ∗
)
f(θT )g(θ∗|θT )
U
g(θ∗
|θT )ρ(θT )
g(θT |θ∗)ρ(θ∗)
<
N
i=1 f(x|θ∗
)
N
i=1 f(x|θT )
1
N
log U
g(θ∗
|θT )ρ(θT )
g(θT |θ∗)ρ(θ∗)
<
1
N
N
i=1
li where li = log f(x|θ∗
) − log f(x|θT )
µ0 < µ
where in the last step we substitute µ0 on the left-hand side and µ on the right
hand side for notational convenience.
The costly computation that may have previously required the evaluation
of a posterior density over billions of points is equivalent to testing whether
the mean of a finite population {l1, . . . , lN } is greater than some constant µ0
that does not depend on the data. This makes it easy to frame the check as a
sequential hypothesis test: randomly draw a mini-batch of size n < N without
replacement from XN and compute its mean, ¯l. If the difference between ¯l and
µ0 is significantly larger than the standard deviation of ¯l and if µ0 < ¯l then
θ∗
is accepted, otherwise we stay put. If significance is not achieved, we add
more observations to the mini-batch and re-run until significance is achieved.
Significance will eventually be achieved and the sequential hypothesis test will
terminate because when n = N the standard deviation of ¯l is 0 because ¯l is the
population mean, µ.
Formally, we can test the hypotheses
H0 : µ0 ≤ µ vs H1 : µ0 > µ
where the sample mean, ¯l, and the sample standard deviation, sl, are given as
¯l =
1
N
n
i=1
li,
s2
l =
n l2 − (l)2
n − 1
,
the standard deviation of ¯l is estimated to be
s =
sl
√
n
1 −
n − 1
N − 1
,
and the test statistic is
t =
¯l − µ0
s
.
For large enough n, we claim t follows a standard Student-t distribution with
n − 1 degrees of freedom when µ = µ0. To determine if the difference between
µ0 and µ is significant, we compute the p-value as p = 1 − φn−1(|t|) where φ(·)
is the cdf of the Student-t distribution. If p is less than the α level of our test,
then we can reject H0 and conclude µ0 = µ. The peusdocode below as well as
a more detailed proof of the distrbution of t may be found in [7].
We are often able to make confident decisions considering only n < N data
points in the posterior. Though we introduce bias in the form of the α level
of the test, we make up for this by drawing more samples from the stationary
distribution. For error bounds on the estimates produced, a description of
optimal sequential test design, and illustrative examples, see [7]. In the following
section we cover the slice sampling algorithm, which may be conceptualized
as a higher dimensional analogue to the inverse transform. Interestingly, an
approximate slice sampler also exists [4].
Algorithm 3 Approximate MH Test
procedure Approx. MH Input: θT , θ∗
, , µ0, XN , m Output: accept
Initialize estimated means l ← 0 and l2 ← 0
n ← 0, done ← false
Draw U ∼ Unif([0, 1])
while not done do
Draw mini-batch X of size min(m, N − n) w/o replacement from
XN and set XN ← XN  X
Update l and l2 using X and n ← n + |X|
Compute δ ← 1 − φn−1
l − µ0
s
if δ < then
accept ← true if µ0 < l and false otherwise
done ← true
end if
end while
end procedure
4.3 Slice Sampling
Unlike Metropolis-Hastings, the slice sampler does not require the selection of
a proposal distribution nor does it require any convexity properties, as some
adaptive acceptance-rejection methods do. In practice, however, slice sampling
is not entirely unreliant on hyperparameter selection.
In the univariate case, the slice sampler transitions from a point (X, U) under
the curve of fX to another point (X , U ) under the curve of fX in such a way
that the stationary distribution of (X, U) converges to a uniform distribution
over the area under the curve of fX [8].
The pseudocode in Algorithm 4 outlines the 2D case. Many important de-
Algorithm 4 2D Slice Sampler
1: procedure Slice sample Input: XT ∈ supp fX
2: Draw U ∼ Unif([0, 1])
3: Draw XT +1 ∼ Unif({x : fX(x) ≥ U · fX(XT )})
4: end procedure
tails are left out but a full implementation may be found in Figure 4.5. The
problem of drawing from the exact level sets of the distribution in step 3 can be
intractable when fX is complex enough. We have adapted Neal’s slice sampling
algorithm from [11] and naively expand out from XT using an arbitrarily chosen
step size until a suitable interval is found. If we were able to sample perfectly
from the slice under the curve, there would be no rejected samples. The idea of
learning or predicting these level sets is intriguing, and to my knowledge, has
not been attempted.
Slice.Sample <- function(x0, f, nsample, step = 1) {
x <- x0
for (i in 2:nsample) {
u <- runif(1, 0, f(x[i - 1]))
lower <- x[i - 1] - 1
upper <- x[i - 1] + 1
while (u < f(lower)) {
lower <- lower - step
}
while (u < f(upper)) {
upper <- upper + step
}
repeat{
x.proposal <- runif(1, lower, upper)
if (u < f(x.proposal)) {
x[i] <- x.proposal
break
} else if (x.proposal < lower) {
lower <- x.proposal
} else if (x.proposal > upper) {
upper <- x.proposal
}
}
}
return(x)
}
Figure 4.5: Naive implementation of the slice sampler
Example 4.2 We use the slice sampler to draw from a tri-modal mixture of
normals defined in the target function below. The issue of finding correct level
sets becomes apparent, as we might not expand our interval out far enough to
jump modes.
target <- function(x) {
return(0.25 * dnorm(x, -2, 0.3) +
0.50 * dnorm(x, 0, 0.3) +
0.25 * dnorm(x, 2, 0.3))
}
hist(Slice.Sample(1, target, 10000, 1),
breaks = 100, prob = TRUE, ylim = c(0, 0.7),
main = "Trimodal Mixture of Normal", xlab = "X")
x <- seq(-10, 10, length = 1000)
lines(x, target(x), col = "blue")
Trimodal Mixture of Normal
X
Density
−3 −2 −1 0 1 2 3
0.00.10.20.30.40.50.60.7
Figure 4.6: The result of slice sampling a trimodal normal distribution.
4.3.1 Auxiliary Variable MCMC
The slice sampler introduces an auxiliary variable, an approach we revisit with
the Hamiltonian Monte Carlo, that is marginalized out to produce the desired
distribution. Using the Fundamental Theorem of Simulation, we are able to
draw samples from fX by drawing samples uniformly under the curve of fX.
Let Q be the area under the curve of fX so that choosing (X, U) ∼ Unif({(x, u) :
0 < u < fX(x)}) occurs with probability
1
Q
:
f(X,U)(X, U) =
1
Q
1(0 ≤ U ≤ fX(X)).
This implies the marginal distribution of X is
f(X,U)(x, u) du =
1
Q
fX (X)
0
du =
fX(X)
Q
.
As Algorithm 4 suggests, we alternate between sampling X and U. To see
that the general slice sampler preserves the uniform distribution over the area
under the curve of fX, note that if XT ∼ fX and UT +1 ∼ Unif([0, fX(XT )])
then
(XT , UT +1) ∼ fX(XT )
1(0 ≤ UT +1 ≤ fX(XT ))
fX(XT )
∝ 1(0 ≤ UT +1 ≤ fX(XT )).
If XT +1 ∼ Unif(AT +1) = Unif({XT +1 : 0 ≤ UT +1 ≤ fX(XT +1)}) then
(XT , UT +1, XT +1) ∼ fX(XT )
1(0 ≤ UT +1 ≤ fX(XT ))
fX(XT )
1(0 ≤ UT +1 ≤ fX(XT +1))
µ(At+1)
,
where µ(AT +1) denotes the Lebesgue measure of the set. Marginalizing out XT
gives
f(UT +1, XT +1) ∝ 1(0 ≤ UT +1 ≤ fX(x))
1(0 ≤ UT +1 ≤ fX(XT +1))
µ(AT +1)
dx
=
1(0 ≤ UT +1 ≤ fX(XT +1))
µ(AT +1)
1(0 ≤ UT +1 ≤ fX(x)) dx
= 1(0 ≤ UT +1 ≤ fX(XT +1)),
so that if we begin with XT ∼ fX then the updates that generate XT +1 and
UT +1 preserve the uniform distribution under the curve of fX.
4.3.2 Uniform Ergodicity of the Slice Sampler
We now discuss the convergence properties of the slice sampler in the simple
2D case. In the ensuing calculations we denote by µ(ω) the Lebesgue measure
of the set
Aω = {x : 0 ≤ ω ≤ fX(x)}.
To gain insight into how the slice sampler behaves asymptotically, we look to
the cdf of the transition kernel. More specifically, we look at the probability
that fX(XT +1) ≤ η given that we are currently at XT and fX(XT ) = ν.
Pr fX(XT +1) ≤ η | fX(XT ) = ν =
1(0 ≤ ω ≤ ν)
ν
1(ω ≤ fX(x) ≤ η)
µ(ω)
dω dx,
where we first draw ω uniformly on [0, ν] and then draw XT +1 uniformly on Aω.
Simplifying further gives
Pr fX(XT +1) ≤ η | fX(XT ) = ν =
1
ν
1(0 ≤ ω ≤ ν)
µ(ω)
1(ω ≤ fX(x) ≤ η) dx dω
=
1
ν
1(0 ≤ ω ≤ ν) ·
µ(ω) − µ(η)
µ(ω)
dω
=
1
ν
min(η,ν)
0
µ(ω) − µ(η)
µ(ω)
dω
=
1
ν
ν
0
max 1 −
µ(η)
µ(ω)
, 0 dω,
which tells us the convergence properties of the slice sampler are total dependent
on the measure, µ. Now, for the main result which we owe Tierney and Mira [10]
who, under boundness conditions, established the following lemma.
Lemma 4.3 If fX and supp fX are bounded, the 2D slice sampler is uniformly
ergodic.
Proof Without loss of generality, assume that fX is bounded by 1 and that
supp fX = [0, 1]. To prove uniform ergodicity, we will show that supp fX is a
small set so that we may invoke Doeblin’s condition. Let
ξ(ν) = Pr fX(XT +1) ≤ η | fX(XT ) = ν
Notice that ω > η implies µ(η) > µ(ω) and ξ(ν) = 0. Further, when ν ≥ η,
ξ(ν) =
1
ν
η
0
1 −
µ(η)
µ(ω)
dω
is decreasing in ν since it only appears in the denominator outside of the integral.
When ν ≤ η we recognize
ξ(ν) =
1
ν
ν
0
1 −
µ(η)
µ(ω)
dω
as the expected value of the function 1 −
µ(η)
µ(ω)
where ω ∼ Unif([0, ν]). The
larger ω, the smaller µ(ω) is; we conclude that µ(ω) is decreasing in ω and thus
also decreasing in ν.
Therefore ξ(ν) is decreasing in ν for all η. Intuitively, it would not make
sense if ξ(ν) were increasing in ν because it would imply our Markov chain is
not spending enough time in the modes. If ξ(ν) were increasing in ν then the
larger ν the more likely we are to end up below some threshold (away from the
mode). For the proof to be complete, we must establish bounds on the cdf of
the transition kernel. The minimum occurs when ν = 1:
lim
ν→1
ξ(v) =
η
0
1 −
µ(η)
µ(ω)
dω,
which is bounded above by
η
0
1 dω = µ(η) and below by 0. The maximum is
given by L’Hopital’s rule:
lim
ν→0
ξ(ν) = lim
ν→0
ν
0
1 −
µ(η)
µ(ω)
dω
ν
= lim
ν→0
1 −
µ(η)
µ(ν)
= 1 − µ(η).
1 − µ(η) is bounded above by 1 and below by 0 because the support is [0, 1].
Once we have found nondegenerate upper and lower bounds on the cdf of
the transition kernel,it is not difficult to derive Doeblin’s condition. The entire
support of fX is thus a small set and uniform ergodicity follows.
This proof serves to remind us that rigorous results are not easy to come
by in MCMC. We must work hard to ensure the methods we employ do indeed
sample from the desired target distribution. We have thus introduced the slice
sampler, given a rudimentary implementation of it, and discussed its conver-
gence properties in the simple 2D case. Next, we cover the Gibbs sampler which
extends the slice sampler’s idea of alternately sampling variables conditioned on
one another.
4.4 Gibbs Sampling
In this section, we consider sampling from the multivariate distribution f(x) =
f(X1, . . . , Xn). Each step of the Gibbs sampling algorithm replaces a single
value, say Xi, by sampling from the distribution conditioned on everything
but Xi, namely fXi (Xi|x−i). That is, we replace Xi with a value drawn from
fXi
(Xi|x−i) where Xi denotes the ith
component of the vector x and x−i denotes
the vector x without the ith
component. The deterministic scan Gibbs sampler
is expressed rather nicely in Algorithm 5.
Each Gibbs step loops through x and replaces each component with a sam-
ple drawn from the correct conditional distribution using the most up-to-date
values. In the context of Metropolis-Hastings, x−i remains unchanged when
we draw Xi so the proposal distribution is fx∗ (x∗
|x−i). We also have that
x∗
−i = x−i, and fx(x) = fXi
(Xi|x−i)fx−i
(x−i) so the Metropolis-Hasting’s ac-
ceptance probability is
A(x∗
, x) =
fXi
(X∗
i |x∗
−i)fx−i
(x∗
−i)fXi
(Xi|x∗
−i)
fXi (Xi|x−i)fx−i (x−i)fXi (X∗
i |x−i)
= 1.
Algorithm 5 Gibbs Sampling
1: procedure Gibbs Step Input: x = (X1, . . . , Xn) Output: x∗
2: Draw X∗
1 ∼ fX1
(X1|X2, . . . , Xn)
3: Draw X∗
2 ∼ fX2 (X2|X∗
1 , X3, . . . , Xn)
4:
...
5: Draw X∗
n ∼ fXn
(Xn|X∗
1 , X∗
2 , . . . , X∗
n−1)
6: return x∗
← (X∗
1 , . . . , X∗
n)
7: end procedure
Thus, if when dealing with high dimensional distributions we have access the
conditional distributions (which is often the case in Bayesian networks), the
Gibbs sampler never rejects a proposal.
Example 4.4 Say we wish to draw points (X, Y ) where X, Y ∼ Exp(λ). Be-
low, we implement a deterministic scan Gibbs sampler that draws from a bounded
2D exponential distribution. We bound/truncate the points we draw for graphical
simplicity.
Exp.Bounded <- function(rate, B) {
repeat{
x <- rexp(1, rate)
if (x <= B) {
return(x)
}
}
}
Gibbs.Sampler <- function(M, B) {
mat <- matrix(ncol=2, nrow = M)
x <- 1; y <- 1
mat[1, ] <- c(x, y)
for (i in 2:M) {
x <- Exp.Bounded(y, B)
y <- Exp.Bounded(x, B)
mat[i,] <- c(x, y)
}
return(mat)
}
mat <- Gibbs.Sampler(1000, 10)
layout(matrix(c(1, 1, 2, 3), 2, 2, byrow = TRUE))
plot(mat, main="Joint Distribution", xlab=expression("X"[1]),
ylab=expression("X"[2]), ylim = c(0, 10), xlim = c(0, 10))
hist(mat[ , 1], main=expression("Marginal dist. of X"[1]),
xlab=expression("X"[1]), prob = TRUE, breaks = 30)
hist(mat[ , 2], main=expression("Marginal dist. of X"[2]),
xlab=expression("X"[2]), prob = TRUE, breaks = 30)
0 2 4 6 8 10
0246810
Joint Distribution
X1
X2
Marginal dist. of X1
X1
Density
0 2 4 6 8 10
0.00.20.40.60.8
Marginal dist. of X2
X2
Density
0 2 4 6 8 10
0.00.20.40.60.8
Example 4.5 Here we use a random scan Gibbs sampler to approximate the
probability that a point drawn uniformly from the unit hypersphere in 6 dimen-
sions is at least a distance of 0.9 from the origin. Our algorithm begins at the
origin and then randomly chooses a cordinate to replace. Given (X1, . . . , Xn)
in the n-dimensional unit hypersphere we choose a random coordinate to update
(WLOG, say x1) and sample it uniformly such that
x ≤ 1
X2
1 + . . . + X2
n ≤ 1
X2
1 ≤ 1 − (X2 + . . . + Xn)
X1 ≤ 1 − (X2 + . . . + Xn)
But square roots are always positive, so we must also flip a fair coin to determine
the sign. More explicitly,
Xi|x−i ∼ Unif − 1 − (X2 + . . . + Xn), 1 − (X2 + . . . + Xn) .
Euclidean.Norm <- function(x) {
return(sqrt(sum(x ^ 2)))
}
Gibbs.Hypersphere.Conditional <- function(x) {
if (runif(1) <= 0.5) {
return(-1 * runif(1, min = 0, max = sqrt(1 - sum(x ^ 2))))
}
return(runif(1, min = 0, max = sqrt(1 - sum(x ^ 2))))
}
Random.Scan.Gibbs.Hypersphere <- function(x = rep(0, 6)) {
idx <- sample(1:6, 1)
x[idx] <- Gibbs.Hypersphere.Conditional(x[-idx])
return(x)
}
Hypersphere.MC <- function(steps = 100, f.sample) {
x <- rep(0, 6) # start at origin
for (i in 1:(0.1 * steps)) {
x <- f.sample(x)
}
data <- matrix(0, ncol = length(x), nrow = steps)
for (i in 1:steps) {
x <- f.sample(x)
data[i, ] <- x
}
return(data)
}
draws <- replicate(10,
Hypersphere.MC(steps = 5000,
Random.Scan.Gibbs.Hypersphere))
counts <- apply(draws, MARGIN = 3, FUN = apply, 1, Euclidean.Norm)
p <- mean(counts >= 0.9)
s <- sd(counts >= 0.9) / sqrt(length(counts))
We find the probability that a uniform point drawn from the unit hypersphere in
6 dimensions is at least 0.9 from the origin is 0.469 ± 0.002.
4.5 Hamiltonian Monte Carlo
Originally introduced in 1987 as the Hybrid Monte Carlo [3], what we refer
to as the Hamiltonian Monte Carlo (HMC) combines Hamiltonian dynamics
and the Metropolis algorithm to propose large changes in state (e.g. jumping
from mode to mode in a single iteration) while maintaining a high acceptance
probability. HMC interprets x as a position and introduces an auxiliary variable
to simulate Hamiltonian mechanics on phase space. But first, we introduce the
basic vocabulary of Hamiltonian dynamics.
4.5.1 Hamiltonian Dynamics
Hamiltonian dynamics is a reformulation of classical Newtonian mechanics in
which a particle is described by a position vector x and a momentum vector
p. We associate with our position and momentum a total energy H(x, p) =
U(x) + K(p) called the Hamiltonian of our system. H(x, p) is the sum of the
potential energy associated with x and the kinetic energy associated with p.
We often take the kinetic energy to be
K(p) =
1
2
p 2
2
which corresponds to simulating Hamiltonian dynamics on a Euclidean mani-
fold. Exploring the effects of alternate kintetic energies is beyond the scope of
this text, however one can imagine simulating the dynamics on a Riemannian
manifold instead. The choice of potential energy, we will see, depends on the
target distribution we wish to sample from.
Given a position and momentum, the system evolves according to Hamilton’s
equations:
dp
dt
= −
∂H
∂x
and
dx
dt
=
∂H
∂p
.
The laws of thermodynamics must be obeyed so that a particle whose movement
is governed by Hamiltonian dynamics travels along level sets of constant energy
in the joint, or phase, space. Although H remains invariant, the values of x and
p change over time. By simulating the dynamics of a system over a finite time
period, we are able to make large changes to x and avoid random walk behavior.
Example 4.6 (A One-Dimensional Example) Consider the simple case in
which the Hamiltonian of our system is defined as follows:
H(x, p) = U(x) + K(p), U(x) =
x2
2
, K(p) =
p2
2
.
The resulting dynamics evolve according to the equations
dp
dt
= −x,
dx
dt
= p.
The solutions to these equations have the following form, for some constants r
and a:
x(t) = r cos(a + t), p(t) = −r sin(a + t),
which correspond to a rotation by s radians clockwise around the origin in the
(x, p) plane.
4.5.2 HMC
If we consider the joint distribution over states (x, p) with total energy H(x, p),
i.e.
P(x, p) ∝ exp(−H(x, p)),
we realize that simply starting at some point (x0, p0) and running the dynamics
does not sample ergodically from P. To see this, notice this only explores level
sets of constant energy. All states in the set {(x, p) : H(x, p) = H(x0, p0)} are
unreachable. To construct an ergodic Markov chain, we need to perturb the
value of H while keeping P invariant. Conceptually, we want to jump between
level sets of constant energy to explore the space. Adding a Gibbs step where
we draw p ∼ P(p|x) accomplishes just this. Our job is made even simpler by
the independence of x and p, which follows from the factorization of P as
P(x, p) ∝ exp(−U(x)) exp(−K(p)).
Marginalizing out x yields P(p) ∝ exp(−K(p)) which implies p ∼ exp(− p 2
2/2)
which we recognise as the pdf of a standard normal random variable. Applying
the same thinking to p, we see that U(x) = − log(fX(x)) implies x ∼ fX, giving
x the desired marginal distribution.
An algorithm begins to emerge: starting at some point (x, p) in phase space,
simulate Hamiltonian dynamics for a finite number of steps, and end in a new
state (x∗
, p∗
). The proposal is accepted with probability
min(1, exp(H(x, p) − H(x∗
, p∗
))).
By the conservation of energy, we should always accept such proposals. Some-
times, errors in our numeric simulation of the dynamics prevent this from hap-
pening. In our experiments we used Radford Neal’s code that appears in Chap-
ter 5 of [2] and is avalible online at http : //www.cs.utoronto.ca/ ∼ radford/ham − mcmc − simple.
Algorithm 6 Hamiltonian Monte Carlo Sampler
1: procedure HMC Input: x ∼ fX
2: Draw p ∼ Norm(0, 1), U ∼ Unif([0, 1])
3: Simulate Hamiltonian dynamics to get (x∗
, p∗
) ∼ P
4: Compute acceptance probability Pa = min(1, exp(H(x, p)−H(x∗
, p∗
)))
5: if U < Pa then return x∗
6: else return x
7: end if
8: end procedure
Example 4.7 Suppose we wish to sample from the 1D bimodal distribution from
Example 4.1. Although we have touted the performance of HMC in high dimen-
sions, we restrict ourselves to a 1D density so that the joint phase space may be
vizualized, as below. We begin somewhere in the space (Figure 4.7a), simulate
Hamiltonian dynamics for some number of steps, and accept the proposal state
(Figure 4.7b). We must be cautious in our choice of L and , as it is not diffi-
cult to imagine an instance where the simulated particle returns to its starting
position after a finite number of iterations. Choosing the stepsize, , at random
before simulating the particle’s path can prevent this type of behavior.
−10 −5 0 5 10
−10−50510
P
X
10
20
30
40 4050 5060
60
60
60
7070
70
70
(a)
−10 −5 0 5 10
−10−50510
P
X
10
20
30
40 4050 5060
60
60
60
7070
70
70
(b)
Figure 4.7
4.6 Summary
Thus ends our exploration of Markov Chain Monte Carlo sampling methods.
As you may have noticed, algorithms that sample from high dimensional dis-
tributions are seldom written once and used forever. Instead, they require an
attention to detail and a tested dedication to writing correct code. Even once
the practitioner has chosen an algorithm most applicable to their setting, it
may require days or weeks of tuning and testing hyperparameter combinations
to achieve the desired convergence. However, the four approaches to MCMC
presented in this work (random walk, Metropolis-Hastings, auxiliary variables,
Gibbs sampling) comprise the vast majority of the practitioner’s toolbox.
Chapter 5
Conclusion
We began with the question of how to generate randomness and have concen-
trated largely on algorithms that do just that: spit out randomness. This merely
scratches the surface of the work being done on Monte Carlo methods. We can
now answer real questions faced by statisticans, ecnonomists, mathemeticians,
and nuclear physicists. We can theorize models based on our beliefs, collect
data, and determine, through simulation, whether our observations are in line
with our predictions or if they can be considered “extreme”, “weird”, or “out-
lying”. Prerequisite to all of this is the ability the sample uniformly from the
unit interval. We are reminded of the power of the Fundamental Theorem of
Simulation and how ultimitaley, all of our problems are reduced to sampling
uniformly.
Bibliography
[1] Christopher Bishop. Pattern Recognition and Machine Learning. Springer,
New York, 2006.
[2] Steve Brooks. Handbook of Markov Chain Monte Carlo. CRC Press/Taylor
& Francis, Boca Raton, 2011.
[3] Simon Duane, Anthony D Kennedy, Brian J Pendleton, and Duncan
Roweth. Hybrid Monte Carlo. Physics letters B, 195(2):216–222, 1987.
[4] Christopher DuBois, Anoop Korattikara, Max Welling, and Padhraic
Smyth. Approximate Slice Sampling for Bayesian Posterior Inference. In
Artificial Intelligence and Statistics, 2014.
[5] Stuart Geman and Donald Geman. Stochastic Relaxation, Gibbs Distri-
butions, and the Bayesian Restoration of Images. Pattern Analysis and
Machine Intelligence, IEEE Transactions on, (6):721–741, 1984.
[6] W Keith Hastings. Monte Carlo Sampling Methods Using Markov Chains
and their Applications. Biometrika, 57(1):97–109, 1970.
[7] Anoop Korattikara, Yutian Chen, and Max Welling. Austerity in
MCMC Land: Cutting the Metropolis-Hastings Budget. arXiv preprint
arXiv:1304.5299, 2013.
[8] David J. C. MacKay. Information Theory, Inference and Learning Algo-
rithms. Cambridge University Press, 2003.
[9] Nicholas Metropolis, Arianna W Rosenbluth, Marshall N Rosenbluth, Au-
gusta H Teller, and Edward Teller. Equation of State Calculations by Fast
Computing Machines. The journal of chemical physics, 21(6):1087–1092,
1953.
[10] Antonietta Mira and Luke Tierney. Efficiency and Convergence Properties
of Slice Samplers. Scandinavian Journal of Statistics, 29(1):1–12, 2002.
[11] Radford M Neal. Slice Sampling. Annals of statistics, pages 705–741, 2003.
[12] Christian Robert. Monte Carlo Statistical Methods. Springer, New York,
2004.
[13] Gareth O. Roberts and Jeffrey S. Rosenthal. On Convergence Rates of
Gibbs Samplers for Uniform Distributions. The Annals of Applied Proba-
bility, 8(4):pp. 1291–1302, 1998.
[14] Stuart Russell and Peter Norvig. Artificial Intelligence: A Modern Ap-
proach (3rd Edition). Prentice Hall, 2009.

thesis_final_draft

  • 1.
    Senior Thesis inMathematics Sampling from High Dimensional Distributions Author: Bill DeRose Advisor: Dr. Gabe Chandler Submitted to Pomona College in Partial Fulfillment of the Degree of Bachelor of Arts April 3, 2015
  • 2.
    Contents 1 Introduction 1.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Related works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Random Variable Generation 2.1 Random Variable Generation . . . . . . . . . . . . . . . . . . . . 2.1.1 The Inverse Transform . . . . . . . . . . . . . . . . . . . . 2.1.2 Acceptance-Rejection Sampling . . . . . . . . . . . . . . . 3 Markov Chains 3.1 Markov Chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Properties of Markov Chains . . . . . . . . . . . . . . . . . . . . 3.2.1 Irreducibility . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.2 Aperiodicity . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.3 Stationarity . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.4 Ergodicity . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Markov Chain Monte Carlo 4.1 Monte Carlo Methods . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Metropolis Hastings . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 Bias-Variance Trade-off . . . . . . . . . . . . . . . . . . . 4.2.2 Approximate Metropolis-Hastings . . . . . . . . . . . . . 4.3 Slice Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.1 Auxiliary Variable MCMC . . . . . . . . . . . . . . . . . . 4.3.2 Uniform Ergodicity of the Slice Sampler . . . . . . . . . . 4.4 Gibbs Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5 Hamiltonian Monte Carlo . . . . . . . . . . . . . . . . . . . . . . 4.5.1 Hamiltonian Dynamics . . . . . . . . . . . . . . . . . . . . 4.5.2 HMC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Conclusion
  • 3.
    Chapter 1 Introduction 1.1 Introduction Thechallenges we face in computational statistics are due to the incredible advance of technology in the past 100 years. In a world where human suffering is the daily reality of so many, we should be so lucky to wrestle with the problems of algorithm design and implementation. Despite the advances in statistics over the past century, random number generation remains an active field of research. From machine learning and artificial intelligence, to the simulation of protein formation, the ability to draw from probability distributions has wide ranging applications. But what exactly is a random number, and what is randomness? More importantly, how can an algorithm take a finite number of deterministic steps to produce something random? Often, humans delude themselves into seeing randomness where there is none – they detect a signal in the noise. Figure 1.1: Which is random?
  • 4.
    The image onthe left of Figure 1.1 depicts genuine randomness. The points on the right are too evenly spaced for it to be truly random. In actuality, each point on the left represents the location of a star in our galaxy while the points on the right represent the location of glowworms on the ceiling of a cave in New Zealand. The glowworms spread themselves out to reduce competition for food amongst themselves. The seemingly uniform distribution is the result of a non-random force. So how do we go about generating images like those on the left? We begin with a little cheat and assume the existence of a random number generator that allows us to sample U ∼ Uniform([0, 1]). Though we will not discuss methods for drawing uniformly from the unit interval, their importance to us cannot be understated. In practice, exact inference is often either impossible (e.g. provably non- integrable functions) or intractable (e.g. high dimensional integration) and we must turn to approximations. This work explores Monte Carlo methods as one approach to numerical approximation. Example 1.1 (Numeric Integration) We wish to evaluate an integral Q = b a f(x) dx. From calculus, we know favg = Q b − a ⇒ Q = (b − a)favg. By the LLN, we can choose X1, . . . , Xn uniformly in [a, b] to approximate favg ≈ 1 n n i=1 f(Xi) ⇒ Q ≈ b − a n n i=1 f(Xi). 1.2 Related works Many of the algorithms we cover are versions of the Metropolis algorithm which first appeared in [9] and was eventually generalized by Hastings in [6]. Though the naming of the algorithm has been contended (Metropolis merely oversaw the research), we refer to the algorithm as Metropolis-Hastings for historical reasons. Regardless of naming conventions, we are indebted to Arianna W. Rosenbluth, Marshall N. Rosenbluth, Augusta H. Teller, and Edward Teller for their work on the original paper outlining the Metropolis algorithm. The term “Monte Carlo” was coined because some of the first applications were to card games, like those in the Monte Carlo Casino in Monaco. A Monte Carlo algorithm is simply an algorithm whose output is random. Such Monte Carlo simulations were important to the development of the Manhattan project (after all, Metropolis worked at Los Alamos during WWII) and remain an im- portant tool of modern statistical physics. Though the Gibbs sampler was introduced by brothers Stuart and Donald Geman in 1984 [5] these sorts of numerical sampling techniques did not en- ter the mainstream until the early 1990’s, arguable because of the advent of the personal computer and wider access to computational power. The Gibbs sampler appeared nearly two decades before Neal’s slice sampler [11], though
  • 5.
    we cover themin reverse chronological order because the latter provides a nice motivation for the former. We draw heavily from [10], [13], [12] in proving the uniform ergodicity of the 2D slice sampler. Neal’s work appears again in the Hamiltonian Monte Carlo [2] which uses gradient information to explore state space more efficiently. The contemporary results presented are mostly due to the proliferation of big data which, in the Bayesian setting, necessitates the ability to sample from a posterior distribution with billions of data points. Sequential hypothesis testing allows us to reduce some of the overhead required for Metropolis-Hastings [7]. As a general introduction to each of these sampling techniques, [1] has proven invaluable.
  • 6.
    Chapter 2 Random Variable Generation 2.1Random Variable Generation Assuming we may draw U ∼ Unif([0, 1]), what other distributions can we gen- erate from? It turns out we can draw X ∼ Bernoulli(p) by letting X ← 1(U < p). If X1, . . . , Xn i.i.d ∼ Bern(p) then n i=1 Xi = Y ∼ Bin(n, p). However, a more general approach called the inverse transform allows us to draw from any 1D density whose closed form cdf we may write down. 2.1.1 The Inverse Transform Definition 2.1 Suppose X is a random variable with probability distribution function (pdf) fX. We denote by FX the cumulative distribution function (cdf) where FX(a) = Pr(X ≤ a). Cumulative distribution functions are nonnegative, increasing, right-continuous functions with lima→−∞ = 0 and lima→∞ = 1. Definition 2.2 For an increasing function F on R, the pseudoinverse of F, denoted F−1 p , is the function such that F−1 p (u) = inf{a : F(a) ≤ u} If F is strictly increasing then F−1 p ≡ F−1 . With these definitions in hand, we have the tools to generate random variates from any distribution with a computable generalized inverse. Lemma 2.3 Let F(a) be a cdf and U ∼ Unif([0, 1]). If X = F−1 p (U) then X has cdf F(a).
  • 7.
    Proof Let F,U, and X be as given in the lemma. Then Pr(X ≤ a) = Pr(F−1 p (U) ≤ a) = Pr(U ≤ F(a)) = F(a) Where the second equality follows from the fact that F is increasing. We now see the importance of being able to to draw uniformly from the unit interval. In fact if U is not truly uniform on [0, 1], then the inverse transform method fails to sample from the correct distribution. However, to use the inverse transform we must explicitly write down the cumulative distribution function and efficiently compute its generalized inverse. As we will see in Example 2.6, this is not always possible. Example 2.4 We wish to draw X ∼ Exp(λ) using the inverse transform method. The cdf of an exponential random variable is given by FX(a) = 1 − exp(−λa). Solving for the inverse yields U = 1 − exp(−λa) log(1 − U) = − λa −λ−1 log(1 − U) = a So that if U ∼ Unif([0, 1]) then −λ−1 log(1 − U) = X ∼ Exp(λ). Example 2.5 Recall that the pdf of a Cauchy random variable, X, is fX(s) = 1 π(1 + s2) . Given U ∼ Unif([0, 1]), we find a transformation Y = r(U) such that Y has a Cauchy distribution. We begin by finding the cdf of X: FX(a) = a −∞ 1 π(1 + s2) ds = 1 π tan−1 (s) a −∞ = 1 π lim n→−∞ tan−1 (a) − tan−1 (n) = 1 π tan−1 (a) + π 2 = tan−1 (a) π + 1 2
  • 8.
    To compute thedesired transformation, we have: U = tan−1 (a) π + 1 2 π(U − 1 2 ) = tan−1 (a) tan(π(U − 1 2 )) = a So that Y = r(U) = tan π(U − 1 2 ) ∼ Cauchy. cauchies <- tan(pi * (runif(10000) - 0.5)) hist(cauchies[abs(cauchies) <= 500], prob = TRUE, breaks = 2000, xlim = c(-20, 20), ylim = c(0, 0.35), main = "Cauchy r.v. using ITF", xlab ="X") lines(seq(-20, 20, 0.2), dcauchy(seq(-20, 20, 0.2)), col = "blue") Cauchy r.v. using ITF X Density −20 −10 0 10 20 0.000.050.100.150.200.250.300.35
  • 9.
    Example 2.6 Evenin 1D, there exist densities whose cdf we cannot write down. For example, the cumulative distribution function of the standard normal distribution cannot be expressed in a closed form: Φ(x) = 1 √ 2π x −∞ exp(−z2 /2) dz Clearly, we must develop other methods that do not rely as strongly on nice analytic properties of our target distribution. 2.1.2 Acceptance-Rejection Sampling Much of this section stems from the idea that if fX is the target distribution, we may write fX(s) = fX (x) 0 1 ds. Here, fX appears as the marginal density of the joint distribution (X, U) ∼ Unif({(x, u) : 0 < u < fX(x)}). Introducing the auxiliary variable U allows us to sample from our target distribution by drawing uniformly from the area under the curve of fX and ignoring the auxiliary coordinate. Theorem 2.7 (The Fundamental Theorem of Simulation) Simulating X ∼ fX is equivalent to simulating (X, U) ∼ Unif({(x, u) : 0 < u < fX(x)}). Actually sampling from the joint distribution of (X, U) introduces difficulty, though, because sampling X ∼ fX and U|X ∼ Unif([0, fX(X)]) defeats the purpose of introducing the auxiliary variable. If we could sample X ∼ fX in the first place, we would already be done. The solution is to generate pairs (X, U) from a superset and accept them if they satisfy the constraint. For instance, suppose the 1D density fX is bounded by m and the support of fX, denoted supp fX, is [c, d]. Sampling pairs (X, U) ∼ Unif([0 ≤ u ≤ fX(x)]) is equivalent to simulating X ∼ Unif([c, d]), U|X ∼ Unif([0, m]), and accepting the pair if 0 < U < fX(X). It is easily shown that this does indeed sample from
  • 10.
    the desired distribution: Pr(X≤ a) = Pr(X ≤ a|U ≤ fX(X)) = Pr(X ≤ a, U ≤ fX(X)) Pr(U ≤ fX(X)) = a c fX (x) 0 1 d−c · 1 m du dz d c fX (x) 0 1 d−c · 1 m du dz = a c fX (x) 0 du dz d c fX (x) 0 du dz = a c fX(z) dz d c fX(z) dz = a c fX(z) dz = FX(a) This computation was made easier by the fact that both fX and supp fX were bounded. In situations where this is not the case, we can no longer use a rectangle as the superset from which we draw candidates. Instead, we use some other probability distribution g(x) that may be readily sampled from. Such a distribution is called a proposal distribution and must satisfy M · g(x) ≥ fX(x), M ≥ 1, ∀x ∈ supp fX. We formalize this notion in the following theorem. Theorem 2.8 (The Acceptance-Rejection Theorem) Let g be a probabil- ity distribution that satisfies M · g(x) ≥ fX(x) for some M ≥ 1 and for all x ∈ supp fX. Then, to simulate X ∼ fX, it is sufficient to simulate Y ∼ g and U|Y ∼ Unif([0, M · g(Y )]) and let X ← Y if U ≤ fX(Y ). Proof Sampling Y ∼ g, U|Y ∼ Unif([0, M · g(Y )]), and letting X ← Y if
  • 11.
    U ≤ fX(Y) generates X ∼ fX: Pr(X ∈ A) = Pr(Y ∈ A|U ≤ fX(Y )) = A fX (x) 0 g(z) 1 Mg(z) du dz supp fX fX (x) 0 g(z) 1 Mg(z) du dz = A fX(z) dz supp fX fX(z) dz = A fX(z) dz The proposals used in acceptance-rejection sampling come from g(Y ) and are accepted with probability fX(Y ) M · g(Y ) , so the probability we accept any given proposal is then Pr(accept) = g(y) fX(y) M · g(y) dy = 1 M fX(y) dy. The larger M is, the more points we must reject before accepting a proposal. For efficiency’s sake, we want M = sup fX(x) g(x) to ensure the highest possible acceptance rate. This leads directly to the Acceptance-Rejection algorithm, which is a realization of Theorem 2.8: Algorithm 1 AR Sampling 1: procedure Acceptance-Rejection 2: Draw Y ∼ g, U ∼ Unif([0, M · g(Y )]) 3: Let X ← Y if U ≤ fX(Y ), else return to 2. 4: end procedure Example 2.9 Given Y ∼ Cauchy we use acceptance rejection to generate X ∼ Exp(1/2). To use AR with a proposal distribution g(x), we must ensure M · g(x) ≥ fX(x) ⇒ M ≥ fX(x) g(x) for all x ∈ supp fX. Ideally, M is as close to 1 as possible: M ≥ sup x≥0 fX(x) g(x) ≈ 3.629 We confine our maximization to the positive reals because the target distribution only has support on the positive reals. The maximum is attained at x = 2 + √ 3. Using M = 3.629 yields
  • 12.
    Draw.AR <- function(){ repeat { proposal <- rcauchy(1) u <- runif(1, 0, 3.629 * dcauchy(proposal)) if (u <= dexp(proposal, rate = 1 / 2)) { return(proposal) } } } x <- seq(0, 15, 0.1) hist(replicate(10000, Draw.AR()), breaks = 100, prob = TRUE, xlab = "X", main = "Exp(1/2) using AR") lines(x, dexp(x, rate = 1 / 2), col = "blue") Exp(1/2) using AR X Density 0 5 10 15 0.00.10.20.30.4
  • 13.
    Chapter 3 Markov Chains 3.1Markov Chains Definition 3.1 A sequence of random variables X1, . . . , Xn, denoted (Xn), is a Markov chain if Pr(Xn+1|Xn, Xn−1, . . . , X1) = Pr(Xn+1|Xn) (3.1) Example 3.2 A random walk is a Markov chain that satisfies Xn+1 = Xn + n where n is generated independent of the current state. If the distribution of n is symmetric about 0, we call this a symmetric random walk. In section 4.2 we will see how random walks are used in MCMC algorithms. Every Markov chain has an initial distribution, π0, and a transition kernel K. The state space, denoted X, is the set of possible values Xi may take on at each step in the Markov chain. Definition 3.3 A transition kernel is a function K defined on X × B(X) such that • ∀x ∈ X, K(x, ·) is a probability measure; • ∀A ∈ B(X), K(·, A) is measurable. where B denotes the σ-algebra defined on the set X. When the state space is discrete, the transition kernel is a matrix K where Kij = Pr(Xn+1 = Xj|Xn = Xi). In the continuous case, the transition kernel denotes a conditional density where Pr(x ∈ A|x) = A K(x, x ) dx . A Markov chain is said to be time homogeneous if K(Xn+1|Xn) is independent of n.
  • 14.
    We restrict ourstudy almost entirely to time homogeneous Markov chains. An example of a time heterogeneous Markov chain is the simulated annealing algorithm, whose transition kernel changes with the “temperature” of the sys- tem. Time heterogeneity is a key property of simulated annealing because it allows us to explore the entire state space when the temperature is high, but restricts our moves when the temperature is low. The algorithm is inspired by annealing in metallurgy where the process is used to temper or harden metals and glass by heating them to a high temperature and gradually cooling them, allowing the material to reach a low-energy crystalline state [14]. Given a transition matrix for a discrete Markov chain and an initial distri- bution π0, the distribution of X1 is obtained by matrix multiplication π1 = π0K. Similarly, Xn ∼ πn = π0Kn . Notice that once the initial state is specified, the behavior of the chain is entirely dependent on K. Definition 3.4 Consider A ∈ B(X). The first n for which the chain enters the set A is denoted by τA = inf{n ≥ 1 : Xn ∈ A} and is called the stopping time at A. By convention, τA = ∞ if Xn ∈ A for every n. Associated with the set A, we also define ηA = ∞ n=1 1(Xn ∈ A), the number of times the chain enters A. Example 3.5 In a zero-sum coin tossing game, the payoff to player b is +1 if a heads appears and −1 if a tails appears. Similarly, the payoff to player c is +1 if a tails appears and −1 if a heads appears. Let Xn be the sum of the gains of player b after n rounds of the game. The infinite dimensional transition matrix, K, has zeros on the diagonal since player b must either lose or gain a point on each round. Furthermore, K has upper and lower sub-diagonals equal to 1/2 because because we are flipping a fair coin. Assuming that player b begins with B dollars and player c begins with C dollars, τ1 = inf{n : Xn ≤ −B} and τ2 = inf{n : Xn ≥ C} represent, respectively, the ruins of the player b and c. The probability of bankruptcy for player b is then Pr(τ1 > τ2). 3.2 Properties of Markov Chains 3.2.1 Irreducibility Irreducibility is an important property of Markov chains which guarantees that regardless of the current state of the chain, it is possible to reach any other state
  • 15.
    in a finitenumber of transitions. In the discrete case, irreducibility also tells us the transition matrix cannot be broken down into smaller matrices (i.e. the transition graph is connected). Definition 3.6 Given a measure φ, a Markov chain with transition kernel K(·) is φ-irreducible if for every A ∈ B(X) such that φ(A) > 0, Pr(τA < ∞) > 0 regardless of the initial state. Irreducibility together with aperiodicity, a property introduced in the following subsection, allow us to make strong analytic arguments about the convergence of Markov chains. 3.2.2 Aperiodicity We define the period of a state x ∈ X to be d(x) = gcd{m ≥ 1 : Km (x, x) > 0} If d(x) ≥ 2, we say x is periodic with period d(x). A state is aperiodic if it has period 1. An irreducible chain is aperiodic if each state has period 1. Example 3.7 A Markov chain with period n is given by the block matrix P =        0 P1 0 0 0 0 0 P2 0 0 ... ... ... ... ... 0 0 0 0 Pn−1 Pn 0 0 0 0        where Pi is a stochastic matrix and P is irreducible. 3.2.3 Stationarity Definition 3.8 A Markov chain (Xn) has stationary distribution π if Xn ∼ π ⇒ Xn+1 ∼ π. For MCMC methods to be of any use to us, we must be able to reason about the asymptotic behavior of Markov chains. The distribution of Xn as n → ∞ is called the limiting distribution. Ideally, we would like some guarantee that, regardless of initial conditions, the limiting distribution of a Markov chain is also its stationary distribution. The general approach with MCMC algorithms is to initialize and run a Markov chain for a sufficient number of steps to draw samples approximately from the desired stationary distribution. It is common to ignore some num- ber of samples at the beginning, and then consider only every nth sample (for independence) when computing an expectation.
  • 16.
    3.2.4 Ergodicity When exactlydo we know when the limiting distribution of a Markov chain is the stationary distribution? The Ergodic Theorem tells us just this. Theorem 3.9 (The Ergodic Theorem) Let (Xn) be a Markov chain with stationary distribution π. If the chain is φ-irreducible and aperiodic, then for all measurable sets A, limn→∞ Pr(Xn ∈ A) = π(A). Which is to say that the limiting distribution of irreducible, aperiodic Markov chains is always the stationary distribution. An even stronger guarantee of convergence exists, but to get there we must introduce more terminology. Definition 3.10 The Markov chain (Xn) has an atom α ∈ B(X) if there exists an associated non-zero measure µ such that K(x, A) = µ(A), ∀x ∈ α, ∀A ∈ B(X). The definition of a small set follows naturally and will be used in our defi- nition of one of the strongest form of convergence, uniform ergodicity. Definition 3.11 A set C is small if there exists an m > 0 and a nonzero measure µm such that Km (x, x ) ≥ µm(xm) for all x ∈ C and for all x ∈ B(X) Definition 3.12 The Markov chain (Xn) is uniformly ergodic if lim n→∞ sup x∈X Pn (x, ·) − π T V = 0 Where · T V denotes the total variation norm. In showing uniform ergodicity, we will make use of the following theorem. Theorem 3.13 (Doeblin’s Condition) The following are equivalent: (a) (Xn) is uniformly ergodic; (b) there exist R ≤ ∞ and r such that Pn (x, ·) − π T V < Rr−n , ∀x ∈ X; (c) (Xn) is aperiodic and X is a small set; (d) (Xn) is aperiodic and there exist a small set C and a real κ > 1 such that sup x∈X Ex[kτC ] < ∞
  • 17.
    If the wholespace X is small, there exists a probability distribution, φ, on X, constants , δ > 0, and n such that, if φ(A) > then inf x∈X Kn (x, A) > δ, ∀A ∈ B(X). We see here the relation between analytic limits and uniform ergodicity, giving us a feel for just how strong the guarantee of convergence is. Now that we have covered enough of the basic vocabulary of Markov chains we may begin our survey of MCMC sampling algorithms.
  • 18.
    Chapter 4 Markov ChainMonte Carlo 4.1 Monte Carlo Methods Although the sampling techniques discussed in Chapter 2 work well, they are not flawless. The inverse transform method fails beyond 1-dimension and even then it requires us to write down the closed form cdf of the target distribution. Acceptance-rejection can be used in any dimension we like, but as dimensonality increases it becomes more difficult to find good proposal distributions. We turn now to Markov Chain Monte Carlo (MCMC) simulations because they ameliorate many of these issues. Monte Carlo simulations allow us to approximate the probability of certain outcomes by running a large number of trials to obtain an empirical distribu- tion of possible events. Markov Chain Monte Carlo simulations use Markov chains whose stationary distribution is the target distribution we wish to sam- ple from. The oldest MCMC algorithm, and the one we choose to cover first, is the Metropolis-Hasting algorithm. 4.2 Metropolis Hastings At this point, we may readily sample from most distributions covered in an introductory probability course. However, when faced with the task of drawing from a non-standard distribution, we will need more powerful tools at our dis- posal. For instance, in Bayesian statistics we would often like to sample from the posterior distribution of a parameter to compute its expected value. At a high level, Metropolis-Hastings samples from a target distribution fX by drawing from a proposal distribution g (“easy” to sample) and accepting if it looks like it came from fX (“hard” to sample). At step T in the algorithm, in which the current state is XT , we draw a candidate/proposal X∗ ∼ g(X|XT ) and and let XT +1 = X∗ with probability A(X∗ , XT ) = min 1, fX(X∗ )g(XT |X∗ ) fX(XT )g(X∗|XT ) .
  • 19.
    Otherwise, let XT+1 = XT . We notice two things about the acceptance probability. First, the Metropolis- Hastings algorithm only requires we know fX up to a normalizing constant. Second, if g is symmetric the acceptance probability becomes A(X∗ , XT ) = min 1, fX(X∗ ) fX(XT ) which implies we always accept a candidate that is more probable, and accept candidates randomly otherwise. The acceptance probability combines concepts from steepest accent and random walk algorithms which help prevent getting stuck in local maxima. Following Algorithm 2 ensures the stationary distribu- tion of the Markov chain is fX. Algorithm 2 MH Sampling 1: procedure Metropolis-Hastings Input: Current state: XT ∼ fX 2: Draw X∗ ∼ g(X|XT ), U ∼ Unif([0, 1]) 3: Compute acceptance probability Pa = A(X∗ , XT ) 4: If U < Pa set XT +1 ← X∗ , otherwise set XT +1 ← XT 5: end procedure Make no mistake, Metropolis-Hastings is no free lunch. The proposal distri- bution must be chosen carefully and presents difficulties in higher dimensions where our intuition and imagination fail us. This is especially the case when using a non-symmetric proposal distribution. For this reason, we restrict our study of Metropolis-Hastings solely to the symmetric, random walk case. A common (symmetric) proposal distribution is a Gaussian centered on the cur- rent state. It is also typical for the proposal distribution’s variance to be chosen to be on the same order of magnitude as the smallest variance of the target distribution. σmax σmin ρ Figure 4.1: Contours of a bivariate normal target distribution (red) and sym- metric proposal distribution with standard deviation ρ (blue). Consider Figure 4.1, where the 2D target distribution exhibits a strong cor- relation between components. To achieve a high acceptance ratio, the stan-
  • 20.
    dard deviation ofthe proposal distribution must be kept on the same order of magnitude as σmin. Otherwise, our proposals will be from all over the space and we would rarely accept any move. The random walk behavior also means that to explore the length of the distribution, a distance of σmax/σmin, it takes (σmax/σmin)2 steps due to the convergence of the chain being proportional to√ n. If our target distribution is pinched in one dimension and elongated in another, the Metropolis-Hastings algorithm offers poor convergence properties. Example 4.1 Suppose we wish to sample from the 2-dimensional mixture of normals whose contours are shown in Figure 4.2 (bottom), alongside its 1- dimensional analogue (top). Figure 4.3 shows that in 2-dimensions, the first coordinate of the points sampled using a standard Metropolis algorithm appear to mix well early, but clearly display difficulty jumping between modes. Figure 4.4a and Figure 4.4b suggest that in 5-,10-, and higher dimensions the problem is only exacerbated. −10 −5 0 5 10 0.000.050.100.150.20 1D Normal Mixture X Density 2D Normal Mixture X2 X1 0.01 0.01 0.02 0.02 0.03 0.03 0.04 0.04 0.05 0.05 0.06 0.06 0.07 0.07 −6 −4 −2 0 2 4 6 −6−4−20246 Figure 4.2: The one dimension analogue (top) of the 2D target distribution (bottom) (µ1 = −2, µ2 = 2, σ1 = σ2 = 1)
  • 21.
    0 500 10001500 2000 −6−4−20246 Random−walk Metropolis (2D) Index Firstpositioncoordinate Figure 4.3: Mixing of first coordinate, X1, from 2D a Metropolis sample. 0 500 1000 1500 2000 −6−4−20246 Random−walk Metropolis (5D) Index Firstpositioncoordinate (a) 0 500 1000 1500 2000 −6−4−20246 Random−walk Metropolis (10D) Index Firstpositioncoordinate (b) Figure 4.4: First coordinate of points sampled from Metropolis random walk in 5D (a) and 10D (b). 4.2.1 Bias-Variance Trade-off Implicit in our handling of MCMC lies the desire for unbiased draws from some stationary distribution, π. In many practical applications, it is too computa-
  • 22.
    tionally intensive todraw enough samples to estimate a parameter, ˆθ, or the expectation of a function, E[f(X)], with sufficiently low variance. If we allow for some bias in our draws from the stationary distribution, the task of simulation is made easier. The mean square error in our estimate is a measure of both bias and variance, MSE = B2 + V . When drawing from a posterior density over billions of data points, unbiased Markov chains incur significant computational costs. As a result, the variance of these approximations are high because we can only collect small samples in a fixed amount of time. Alternativly, we can simulate from a slightly biased stationary distribution π , where is a parameter that controls the bias we allow in our simulation [7]. As increases it becomes easier to simulate draws from π . Given infinite time we should let = 0 and run the chain to draw infinite samples. However, when given limited or finite wall-clock time it may be advantageous to tolerate some bias in return for lowering variance by either collecting large samples or mixing better. 4.2.2 Approximate Metropolis-Hastings As we alluded to earlier, in Bayesian inference it is often the case that we wish to find the expectation of a parameter θ with respect to a posterior distribution, f(θ). Given a dataset of N observations XN = {x1, . . . , xN }, which we model with a distribution f(x|θ) and prior ρ(θ), we want to sample from the posterior density f(θ) ∝ ρ(θ) N i=1 f(x|θ) to estimate ˆθ. If our data is minimally sufficient and if XN contains billions of points, then evaluating f(·) at least once in the Metropolis-Hastings acceptance ratio is a costly O(N) operation for a single bit of information. By reformulating step 4 of Algorithm 2 as a statistical test of significance, we can reduce some of the overhead incurred by unbiased MCMC. In standard Metropolis-Hastings we accept the proposal θ∗ if U < Pa, otherwise we stay where we are. This condition is equivalent to checking if U < f(θ∗ )g(θT |θ∗ ) f(θT )g(θ∗|θT ) U g(θ∗ |θT )ρ(θT ) g(θT |θ∗)ρ(θ∗) < N i=1 f(x|θ∗ ) N i=1 f(x|θT ) 1 N log U g(θ∗ |θT )ρ(θT ) g(θT |θ∗)ρ(θ∗) < 1 N N i=1 li where li = log f(x|θ∗ ) − log f(x|θT ) µ0 < µ where in the last step we substitute µ0 on the left-hand side and µ on the right hand side for notational convenience.
  • 23.
    The costly computationthat may have previously required the evaluation of a posterior density over billions of points is equivalent to testing whether the mean of a finite population {l1, . . . , lN } is greater than some constant µ0 that does not depend on the data. This makes it easy to frame the check as a sequential hypothesis test: randomly draw a mini-batch of size n < N without replacement from XN and compute its mean, ¯l. If the difference between ¯l and µ0 is significantly larger than the standard deviation of ¯l and if µ0 < ¯l then θ∗ is accepted, otherwise we stay put. If significance is not achieved, we add more observations to the mini-batch and re-run until significance is achieved. Significance will eventually be achieved and the sequential hypothesis test will terminate because when n = N the standard deviation of ¯l is 0 because ¯l is the population mean, µ. Formally, we can test the hypotheses H0 : µ0 ≤ µ vs H1 : µ0 > µ where the sample mean, ¯l, and the sample standard deviation, sl, are given as ¯l = 1 N n i=1 li, s2 l = n l2 − (l)2 n − 1 , the standard deviation of ¯l is estimated to be s = sl √ n 1 − n − 1 N − 1 , and the test statistic is t = ¯l − µ0 s . For large enough n, we claim t follows a standard Student-t distribution with n − 1 degrees of freedom when µ = µ0. To determine if the difference between µ0 and µ is significant, we compute the p-value as p = 1 − φn−1(|t|) where φ(·) is the cdf of the Student-t distribution. If p is less than the α level of our test, then we can reject H0 and conclude µ0 = µ. The peusdocode below as well as a more detailed proof of the distrbution of t may be found in [7]. We are often able to make confident decisions considering only n < N data points in the posterior. Though we introduce bias in the form of the α level of the test, we make up for this by drawing more samples from the stationary distribution. For error bounds on the estimates produced, a description of optimal sequential test design, and illustrative examples, see [7]. In the following section we cover the slice sampling algorithm, which may be conceptualized as a higher dimensional analogue to the inverse transform. Interestingly, an approximate slice sampler also exists [4].
  • 24.
    Algorithm 3 ApproximateMH Test procedure Approx. MH Input: θT , θ∗ , , µ0, XN , m Output: accept Initialize estimated means l ← 0 and l2 ← 0 n ← 0, done ← false Draw U ∼ Unif([0, 1]) while not done do Draw mini-batch X of size min(m, N − n) w/o replacement from XN and set XN ← XN X Update l and l2 using X and n ← n + |X| Compute δ ← 1 − φn−1 l − µ0 s if δ < then accept ← true if µ0 < l and false otherwise done ← true end if end while end procedure 4.3 Slice Sampling Unlike Metropolis-Hastings, the slice sampler does not require the selection of a proposal distribution nor does it require any convexity properties, as some adaptive acceptance-rejection methods do. In practice, however, slice sampling is not entirely unreliant on hyperparameter selection. In the univariate case, the slice sampler transitions from a point (X, U) under the curve of fX to another point (X , U ) under the curve of fX in such a way that the stationary distribution of (X, U) converges to a uniform distribution over the area under the curve of fX [8]. The pseudocode in Algorithm 4 outlines the 2D case. Many important de- Algorithm 4 2D Slice Sampler 1: procedure Slice sample Input: XT ∈ supp fX 2: Draw U ∼ Unif([0, 1]) 3: Draw XT +1 ∼ Unif({x : fX(x) ≥ U · fX(XT )}) 4: end procedure tails are left out but a full implementation may be found in Figure 4.5. The problem of drawing from the exact level sets of the distribution in step 3 can be intractable when fX is complex enough. We have adapted Neal’s slice sampling algorithm from [11] and naively expand out from XT using an arbitrarily chosen step size until a suitable interval is found. If we were able to sample perfectly from the slice under the curve, there would be no rejected samples. The idea of learning or predicting these level sets is intriguing, and to my knowledge, has not been attempted.
  • 25.
    Slice.Sample <- function(x0,f, nsample, step = 1) { x <- x0 for (i in 2:nsample) { u <- runif(1, 0, f(x[i - 1])) lower <- x[i - 1] - 1 upper <- x[i - 1] + 1 while (u < f(lower)) { lower <- lower - step } while (u < f(upper)) { upper <- upper + step } repeat{ x.proposal <- runif(1, lower, upper) if (u < f(x.proposal)) { x[i] <- x.proposal break } else if (x.proposal < lower) { lower <- x.proposal } else if (x.proposal > upper) { upper <- x.proposal } } } return(x) } Figure 4.5: Naive implementation of the slice sampler Example 4.2 We use the slice sampler to draw from a tri-modal mixture of normals defined in the target function below. The issue of finding correct level sets becomes apparent, as we might not expand our interval out far enough to jump modes.
  • 26.
    target <- function(x){ return(0.25 * dnorm(x, -2, 0.3) + 0.50 * dnorm(x, 0, 0.3) + 0.25 * dnorm(x, 2, 0.3)) } hist(Slice.Sample(1, target, 10000, 1), breaks = 100, prob = TRUE, ylim = c(0, 0.7), main = "Trimodal Mixture of Normal", xlab = "X") x <- seq(-10, 10, length = 1000) lines(x, target(x), col = "blue") Trimodal Mixture of Normal X Density −3 −2 −1 0 1 2 3 0.00.10.20.30.40.50.60.7 Figure 4.6: The result of slice sampling a trimodal normal distribution. 4.3.1 Auxiliary Variable MCMC The slice sampler introduces an auxiliary variable, an approach we revisit with the Hamiltonian Monte Carlo, that is marginalized out to produce the desired distribution. Using the Fundamental Theorem of Simulation, we are able to draw samples from fX by drawing samples uniformly under the curve of fX. Let Q be the area under the curve of fX so that choosing (X, U) ∼ Unif({(x, u) : 0 < u < fX(x)}) occurs with probability 1 Q : f(X,U)(X, U) = 1 Q 1(0 ≤ U ≤ fX(X)).
  • 27.
    This implies themarginal distribution of X is f(X,U)(x, u) du = 1 Q fX (X) 0 du = fX(X) Q . As Algorithm 4 suggests, we alternate between sampling X and U. To see that the general slice sampler preserves the uniform distribution over the area under the curve of fX, note that if XT ∼ fX and UT +1 ∼ Unif([0, fX(XT )]) then (XT , UT +1) ∼ fX(XT ) 1(0 ≤ UT +1 ≤ fX(XT )) fX(XT ) ∝ 1(0 ≤ UT +1 ≤ fX(XT )). If XT +1 ∼ Unif(AT +1) = Unif({XT +1 : 0 ≤ UT +1 ≤ fX(XT +1)}) then (XT , UT +1, XT +1) ∼ fX(XT ) 1(0 ≤ UT +1 ≤ fX(XT )) fX(XT ) 1(0 ≤ UT +1 ≤ fX(XT +1)) µ(At+1) , where µ(AT +1) denotes the Lebesgue measure of the set. Marginalizing out XT gives f(UT +1, XT +1) ∝ 1(0 ≤ UT +1 ≤ fX(x)) 1(0 ≤ UT +1 ≤ fX(XT +1)) µ(AT +1) dx = 1(0 ≤ UT +1 ≤ fX(XT +1)) µ(AT +1) 1(0 ≤ UT +1 ≤ fX(x)) dx = 1(0 ≤ UT +1 ≤ fX(XT +1)), so that if we begin with XT ∼ fX then the updates that generate XT +1 and UT +1 preserve the uniform distribution under the curve of fX. 4.3.2 Uniform Ergodicity of the Slice Sampler We now discuss the convergence properties of the slice sampler in the simple 2D case. In the ensuing calculations we denote by µ(ω) the Lebesgue measure of the set Aω = {x : 0 ≤ ω ≤ fX(x)}. To gain insight into how the slice sampler behaves asymptotically, we look to the cdf of the transition kernel. More specifically, we look at the probability that fX(XT +1) ≤ η given that we are currently at XT and fX(XT ) = ν. Pr fX(XT +1) ≤ η | fX(XT ) = ν = 1(0 ≤ ω ≤ ν) ν 1(ω ≤ fX(x) ≤ η) µ(ω) dω dx,
  • 28.
    where we firstdraw ω uniformly on [0, ν] and then draw XT +1 uniformly on Aω. Simplifying further gives Pr fX(XT +1) ≤ η | fX(XT ) = ν = 1 ν 1(0 ≤ ω ≤ ν) µ(ω) 1(ω ≤ fX(x) ≤ η) dx dω = 1 ν 1(0 ≤ ω ≤ ν) · µ(ω) − µ(η) µ(ω) dω = 1 ν min(η,ν) 0 µ(ω) − µ(η) µ(ω) dω = 1 ν ν 0 max 1 − µ(η) µ(ω) , 0 dω, which tells us the convergence properties of the slice sampler are total dependent on the measure, µ. Now, for the main result which we owe Tierney and Mira [10] who, under boundness conditions, established the following lemma. Lemma 4.3 If fX and supp fX are bounded, the 2D slice sampler is uniformly ergodic. Proof Without loss of generality, assume that fX is bounded by 1 and that supp fX = [0, 1]. To prove uniform ergodicity, we will show that supp fX is a small set so that we may invoke Doeblin’s condition. Let ξ(ν) = Pr fX(XT +1) ≤ η | fX(XT ) = ν Notice that ω > η implies µ(η) > µ(ω) and ξ(ν) = 0. Further, when ν ≥ η, ξ(ν) = 1 ν η 0 1 − µ(η) µ(ω) dω is decreasing in ν since it only appears in the denominator outside of the integral. When ν ≤ η we recognize ξ(ν) = 1 ν ν 0 1 − µ(η) µ(ω) dω as the expected value of the function 1 − µ(η) µ(ω) where ω ∼ Unif([0, ν]). The larger ω, the smaller µ(ω) is; we conclude that µ(ω) is decreasing in ω and thus also decreasing in ν. Therefore ξ(ν) is decreasing in ν for all η. Intuitively, it would not make sense if ξ(ν) were increasing in ν because it would imply our Markov chain is not spending enough time in the modes. If ξ(ν) were increasing in ν then the larger ν the more likely we are to end up below some threshold (away from the
  • 29.
    mode). For theproof to be complete, we must establish bounds on the cdf of the transition kernel. The minimum occurs when ν = 1: lim ν→1 ξ(v) = η 0 1 − µ(η) µ(ω) dω, which is bounded above by η 0 1 dω = µ(η) and below by 0. The maximum is given by L’Hopital’s rule: lim ν→0 ξ(ν) = lim ν→0 ν 0 1 − µ(η) µ(ω) dω ν = lim ν→0 1 − µ(η) µ(ν) = 1 − µ(η). 1 − µ(η) is bounded above by 1 and below by 0 because the support is [0, 1]. Once we have found nondegenerate upper and lower bounds on the cdf of the transition kernel,it is not difficult to derive Doeblin’s condition. The entire support of fX is thus a small set and uniform ergodicity follows. This proof serves to remind us that rigorous results are not easy to come by in MCMC. We must work hard to ensure the methods we employ do indeed sample from the desired target distribution. We have thus introduced the slice sampler, given a rudimentary implementation of it, and discussed its conver- gence properties in the simple 2D case. Next, we cover the Gibbs sampler which extends the slice sampler’s idea of alternately sampling variables conditioned on one another. 4.4 Gibbs Sampling In this section, we consider sampling from the multivariate distribution f(x) = f(X1, . . . , Xn). Each step of the Gibbs sampling algorithm replaces a single value, say Xi, by sampling from the distribution conditioned on everything but Xi, namely fXi (Xi|x−i). That is, we replace Xi with a value drawn from fXi (Xi|x−i) where Xi denotes the ith component of the vector x and x−i denotes the vector x without the ith component. The deterministic scan Gibbs sampler is expressed rather nicely in Algorithm 5. Each Gibbs step loops through x and replaces each component with a sam- ple drawn from the correct conditional distribution using the most up-to-date values. In the context of Metropolis-Hastings, x−i remains unchanged when we draw Xi so the proposal distribution is fx∗ (x∗ |x−i). We also have that x∗ −i = x−i, and fx(x) = fXi (Xi|x−i)fx−i (x−i) so the Metropolis-Hasting’s ac- ceptance probability is A(x∗ , x) = fXi (X∗ i |x∗ −i)fx−i (x∗ −i)fXi (Xi|x∗ −i) fXi (Xi|x−i)fx−i (x−i)fXi (X∗ i |x−i) = 1.
  • 30.
    Algorithm 5 GibbsSampling 1: procedure Gibbs Step Input: x = (X1, . . . , Xn) Output: x∗ 2: Draw X∗ 1 ∼ fX1 (X1|X2, . . . , Xn) 3: Draw X∗ 2 ∼ fX2 (X2|X∗ 1 , X3, . . . , Xn) 4: ... 5: Draw X∗ n ∼ fXn (Xn|X∗ 1 , X∗ 2 , . . . , X∗ n−1) 6: return x∗ ← (X∗ 1 , . . . , X∗ n) 7: end procedure Thus, if when dealing with high dimensional distributions we have access the conditional distributions (which is often the case in Bayesian networks), the Gibbs sampler never rejects a proposal. Example 4.4 Say we wish to draw points (X, Y ) where X, Y ∼ Exp(λ). Be- low, we implement a deterministic scan Gibbs sampler that draws from a bounded 2D exponential distribution. We bound/truncate the points we draw for graphical simplicity. Exp.Bounded <- function(rate, B) { repeat{ x <- rexp(1, rate) if (x <= B) { return(x) } } } Gibbs.Sampler <- function(M, B) { mat <- matrix(ncol=2, nrow = M) x <- 1; y <- 1 mat[1, ] <- c(x, y) for (i in 2:M) { x <- Exp.Bounded(y, B) y <- Exp.Bounded(x, B) mat[i,] <- c(x, y) } return(mat) } mat <- Gibbs.Sampler(1000, 10) layout(matrix(c(1, 1, 2, 3), 2, 2, byrow = TRUE)) plot(mat, main="Joint Distribution", xlab=expression("X"[1]), ylab=expression("X"[2]), ylim = c(0, 10), xlim = c(0, 10)) hist(mat[ , 1], main=expression("Marginal dist. of X"[1]), xlab=expression("X"[1]), prob = TRUE, breaks = 30)
  • 31.
    hist(mat[ , 2],main=expression("Marginal dist. of X"[2]), xlab=expression("X"[2]), prob = TRUE, breaks = 30) 0 2 4 6 8 10 0246810 Joint Distribution X1 X2 Marginal dist. of X1 X1 Density 0 2 4 6 8 10 0.00.20.40.60.8 Marginal dist. of X2 X2 Density 0 2 4 6 8 10 0.00.20.40.60.8 Example 4.5 Here we use a random scan Gibbs sampler to approximate the probability that a point drawn uniformly from the unit hypersphere in 6 dimen- sions is at least a distance of 0.9 from the origin. Our algorithm begins at the origin and then randomly chooses a cordinate to replace. Given (X1, . . . , Xn) in the n-dimensional unit hypersphere we choose a random coordinate to update (WLOG, say x1) and sample it uniformly such that x ≤ 1 X2 1 + . . . + X2 n ≤ 1 X2 1 ≤ 1 − (X2 + . . . + Xn) X1 ≤ 1 − (X2 + . . . + Xn)
  • 32.
    But square rootsare always positive, so we must also flip a fair coin to determine the sign. More explicitly, Xi|x−i ∼ Unif − 1 − (X2 + . . . + Xn), 1 − (X2 + . . . + Xn) . Euclidean.Norm <- function(x) { return(sqrt(sum(x ^ 2))) } Gibbs.Hypersphere.Conditional <- function(x) { if (runif(1) <= 0.5) { return(-1 * runif(1, min = 0, max = sqrt(1 - sum(x ^ 2)))) } return(runif(1, min = 0, max = sqrt(1 - sum(x ^ 2)))) } Random.Scan.Gibbs.Hypersphere <- function(x = rep(0, 6)) { idx <- sample(1:6, 1) x[idx] <- Gibbs.Hypersphere.Conditional(x[-idx]) return(x) } Hypersphere.MC <- function(steps = 100, f.sample) { x <- rep(0, 6) # start at origin for (i in 1:(0.1 * steps)) { x <- f.sample(x) } data <- matrix(0, ncol = length(x), nrow = steps) for (i in 1:steps) { x <- f.sample(x) data[i, ] <- x } return(data) } draws <- replicate(10, Hypersphere.MC(steps = 5000, Random.Scan.Gibbs.Hypersphere)) counts <- apply(draws, MARGIN = 3, FUN = apply, 1, Euclidean.Norm) p <- mean(counts >= 0.9) s <- sd(counts >= 0.9) / sqrt(length(counts)) We find the probability that a uniform point drawn from the unit hypersphere in 6 dimensions is at least 0.9 from the origin is 0.469 ± 0.002.
  • 33.
    4.5 Hamiltonian MonteCarlo Originally introduced in 1987 as the Hybrid Monte Carlo [3], what we refer to as the Hamiltonian Monte Carlo (HMC) combines Hamiltonian dynamics and the Metropolis algorithm to propose large changes in state (e.g. jumping from mode to mode in a single iteration) while maintaining a high acceptance probability. HMC interprets x as a position and introduces an auxiliary variable to simulate Hamiltonian mechanics on phase space. But first, we introduce the basic vocabulary of Hamiltonian dynamics. 4.5.1 Hamiltonian Dynamics Hamiltonian dynamics is a reformulation of classical Newtonian mechanics in which a particle is described by a position vector x and a momentum vector p. We associate with our position and momentum a total energy H(x, p) = U(x) + K(p) called the Hamiltonian of our system. H(x, p) is the sum of the potential energy associated with x and the kinetic energy associated with p. We often take the kinetic energy to be K(p) = 1 2 p 2 2 which corresponds to simulating Hamiltonian dynamics on a Euclidean mani- fold. Exploring the effects of alternate kintetic energies is beyond the scope of this text, however one can imagine simulating the dynamics on a Riemannian manifold instead. The choice of potential energy, we will see, depends on the target distribution we wish to sample from. Given a position and momentum, the system evolves according to Hamilton’s equations: dp dt = − ∂H ∂x and dx dt = ∂H ∂p . The laws of thermodynamics must be obeyed so that a particle whose movement is governed by Hamiltonian dynamics travels along level sets of constant energy in the joint, or phase, space. Although H remains invariant, the values of x and p change over time. By simulating the dynamics of a system over a finite time period, we are able to make large changes to x and avoid random walk behavior. Example 4.6 (A One-Dimensional Example) Consider the simple case in which the Hamiltonian of our system is defined as follows: H(x, p) = U(x) + K(p), U(x) = x2 2 , K(p) = p2 2 . The resulting dynamics evolve according to the equations dp dt = −x, dx dt = p.
  • 34.
    The solutions tothese equations have the following form, for some constants r and a: x(t) = r cos(a + t), p(t) = −r sin(a + t), which correspond to a rotation by s radians clockwise around the origin in the (x, p) plane. 4.5.2 HMC If we consider the joint distribution over states (x, p) with total energy H(x, p), i.e. P(x, p) ∝ exp(−H(x, p)), we realize that simply starting at some point (x0, p0) and running the dynamics does not sample ergodically from P. To see this, notice this only explores level sets of constant energy. All states in the set {(x, p) : H(x, p) = H(x0, p0)} are unreachable. To construct an ergodic Markov chain, we need to perturb the value of H while keeping P invariant. Conceptually, we want to jump between level sets of constant energy to explore the space. Adding a Gibbs step where we draw p ∼ P(p|x) accomplishes just this. Our job is made even simpler by the independence of x and p, which follows from the factorization of P as P(x, p) ∝ exp(−U(x)) exp(−K(p)). Marginalizing out x yields P(p) ∝ exp(−K(p)) which implies p ∼ exp(− p 2 2/2) which we recognise as the pdf of a standard normal random variable. Applying the same thinking to p, we see that U(x) = − log(fX(x)) implies x ∼ fX, giving x the desired marginal distribution. An algorithm begins to emerge: starting at some point (x, p) in phase space, simulate Hamiltonian dynamics for a finite number of steps, and end in a new state (x∗ , p∗ ). The proposal is accepted with probability min(1, exp(H(x, p) − H(x∗ , p∗ ))). By the conservation of energy, we should always accept such proposals. Some- times, errors in our numeric simulation of the dynamics prevent this from hap- pening. In our experiments we used Radford Neal’s code that appears in Chap- ter 5 of [2] and is avalible online at http : //www.cs.utoronto.ca/ ∼ radford/ham − mcmc − simple. Algorithm 6 Hamiltonian Monte Carlo Sampler 1: procedure HMC Input: x ∼ fX 2: Draw p ∼ Norm(0, 1), U ∼ Unif([0, 1]) 3: Simulate Hamiltonian dynamics to get (x∗ , p∗ ) ∼ P 4: Compute acceptance probability Pa = min(1, exp(H(x, p)−H(x∗ , p∗ ))) 5: if U < Pa then return x∗ 6: else return x 7: end if 8: end procedure
  • 35.
    Example 4.7 Supposewe wish to sample from the 1D bimodal distribution from Example 4.1. Although we have touted the performance of HMC in high dimen- sions, we restrict ourselves to a 1D density so that the joint phase space may be vizualized, as below. We begin somewhere in the space (Figure 4.7a), simulate Hamiltonian dynamics for some number of steps, and accept the proposal state (Figure 4.7b). We must be cautious in our choice of L and , as it is not diffi- cult to imagine an instance where the simulated particle returns to its starting position after a finite number of iterations. Choosing the stepsize, , at random before simulating the particle’s path can prevent this type of behavior. −10 −5 0 5 10 −10−50510 P X 10 20 30 40 4050 5060 60 60 60 7070 70 70 (a) −10 −5 0 5 10 −10−50510 P X 10 20 30 40 4050 5060 60 60 60 7070 70 70 (b) Figure 4.7 4.6 Summary Thus ends our exploration of Markov Chain Monte Carlo sampling methods. As you may have noticed, algorithms that sample from high dimensional dis- tributions are seldom written once and used forever. Instead, they require an attention to detail and a tested dedication to writing correct code. Even once the practitioner has chosen an algorithm most applicable to their setting, it may require days or weeks of tuning and testing hyperparameter combinations to achieve the desired convergence. However, the four approaches to MCMC presented in this work (random walk, Metropolis-Hastings, auxiliary variables, Gibbs sampling) comprise the vast majority of the practitioner’s toolbox.
  • 36.
    Chapter 5 Conclusion We beganwith the question of how to generate randomness and have concen- trated largely on algorithms that do just that: spit out randomness. This merely scratches the surface of the work being done on Monte Carlo methods. We can now answer real questions faced by statisticans, ecnonomists, mathemeticians, and nuclear physicists. We can theorize models based on our beliefs, collect data, and determine, through simulation, whether our observations are in line with our predictions or if they can be considered “extreme”, “weird”, or “out- lying”. Prerequisite to all of this is the ability the sample uniformly from the unit interval. We are reminded of the power of the Fundamental Theorem of Simulation and how ultimitaley, all of our problems are reduced to sampling uniformly.
  • 37.
    Bibliography [1] Christopher Bishop.Pattern Recognition and Machine Learning. Springer, New York, 2006. [2] Steve Brooks. Handbook of Markov Chain Monte Carlo. CRC Press/Taylor & Francis, Boca Raton, 2011. [3] Simon Duane, Anthony D Kennedy, Brian J Pendleton, and Duncan Roweth. Hybrid Monte Carlo. Physics letters B, 195(2):216–222, 1987. [4] Christopher DuBois, Anoop Korattikara, Max Welling, and Padhraic Smyth. Approximate Slice Sampling for Bayesian Posterior Inference. In Artificial Intelligence and Statistics, 2014. [5] Stuart Geman and Donald Geman. Stochastic Relaxation, Gibbs Distri- butions, and the Bayesian Restoration of Images. Pattern Analysis and Machine Intelligence, IEEE Transactions on, (6):721–741, 1984. [6] W Keith Hastings. Monte Carlo Sampling Methods Using Markov Chains and their Applications. Biometrika, 57(1):97–109, 1970. [7] Anoop Korattikara, Yutian Chen, and Max Welling. Austerity in MCMC Land: Cutting the Metropolis-Hastings Budget. arXiv preprint arXiv:1304.5299, 2013. [8] David J. C. MacKay. Information Theory, Inference and Learning Algo- rithms. Cambridge University Press, 2003. [9] Nicholas Metropolis, Arianna W Rosenbluth, Marshall N Rosenbluth, Au- gusta H Teller, and Edward Teller. Equation of State Calculations by Fast Computing Machines. The journal of chemical physics, 21(6):1087–1092, 1953. [10] Antonietta Mira and Luke Tierney. Efficiency and Convergence Properties of Slice Samplers. Scandinavian Journal of Statistics, 29(1):1–12, 2002. [11] Radford M Neal. Slice Sampling. Annals of statistics, pages 705–741, 2003. [12] Christian Robert. Monte Carlo Statistical Methods. Springer, New York, 2004.
  • 38.
    [13] Gareth O.Roberts and Jeffrey S. Rosenthal. On Convergence Rates of Gibbs Samplers for Uniform Distributions. The Annals of Applied Proba- bility, 8(4):pp. 1291–1302, 1998. [14] Stuart Russell and Peter Norvig. Artificial Intelligence: A Modern Ap- proach (3rd Edition). Prentice Hall, 2009.