Lecture_9.pdf

MAST90125: Bayesian Statistical Learning
Semester 2 2021
Lecture 9: Posterior simulation
John Holmes
1 / 18

What do we want to achieve
In the previous lecture, we looked at various techniques for approximating a posterior
distribution. These methods were either
I Deterministic,
I Based on generating samples of i.i.d. random numbers.
However as would have become apparent in the examples we looked at, these
techniques are at least one of
I Do not generalise well in multi-dimensional problems,
I Computationally wasteful.
In this lecture we will introduce a much more computationally efficient technique based
on iterative sampling called Markov Chain Monte Carlo (MCMC) which comes at the
cost of having to generate dependent samples.
2 / 18

Markov Chains
Before we look at MCMC algorithms, we need to define what a Markov chain is, and
what properties a Markov chain should have in the context of posterior simulation.
I A Markov chain is a stochastic process where the distribution of the present state
(at time t) θ(t) is only dependent on the immediately preceding state, that is
p(θ(t)
|θ(1)
, . . . , θ(t−1)
) = p(θ(t)
|θ(t−1)
).
I While there are many possible characteristics we can use to categorise a Markov
chain, for posterior simulation, we are most interested that the Markov chain has
the following properties:
I That all possible θ ∈ Θ can eventually be reached by the Markov chain.
I That the chain is aperiodic and not transient.
I While the conditions above imply a unique stationary distribution exists,
I we still need to show the stationary distribution is p(θ|y).
3 / 18

Metropolis algorithm
Like rejection sampling, the Metropolis algorithm is an example of a accept-reject rule.
The steps of the algorithm are:
I Define a proposal (or jumping) distribution, J(θa|θb), that is symmetric, that is
J() satisfies
J(θa|θb) = J(θb|θa).
In addition, J(·|·) must be able to eventually reach all possible θ ∈ Θ to ensure a
stationary distribution exists.
I Draw a starting point θ(0) from a starting distribution p(0)(θ) such that
p(θ(0)|y) > 0.
I Then we iterate,
4 / 18

Metropolis algorithm
I For t = 1, 2, . . .,
I Sample θ∗
from the proposal distribution J(θ∗
|θ(t−1)
)
I Calculate
r =
p(θ∗
|y)
p(θ(t−1)|y)
=
p(θ∗
, y)/p(y)
p(θ(t−1), y)/p(y)
=
p(y|θ∗
)p(θ∗
)
p(y|θ(t−1))p(θ(t−1))
I Set
θ(t)
=
(
θ∗
if u ≤ min(r, 1) where u ∼ U(0, 1).
θ(t−1)
otherwise
I Some notes:
I If a jump θ∗
is rejected, that is θ(t)
= θ(t−1)
, it is still counted as an iteration.
I The transition distribution is a mixture distribution.
5 / 18

Metropolis-Hastings algorithm
As might be guessed from the name, the Metropolis-Hastings algorithm is an extension
of the Metropolis algorithm. Therefore Metropolis Hastings is also a case of an
accept-reject algorithm.
I Like the Metropolis algorithm, we first need to define a proposal (or jumping)
distribution, J(θa|θb). Also like the Metropolis algorithm, J(·|·) must be chosen in
such a way as to ensure a stationary distribution exists.
I However there is now no requirement that the jumping distribution is symmetric.
This also means the Metropolis algorithm is a special case of the
Metropolis-Hastings algorithm.
I Draw a starting point θ(0) from a starting distribution p(0)(θ) such that
p(θ(0)|y) > 0.
I Then we must iterate,
6 / 18

Metropolis-Hastings algorithm
I For t = 1, 2, . . .,
I Sample θ∗
from the proposal distribution J(θ∗
|θ(t−1)
)
I Calculate
r =
p(θ∗
|y)/J(θ∗
|θ(t−1)
)
p(θ(t−1)|y)/J(θ(t−1)|θ∗)
=
p(θ∗
,y)
p(y)J(θ∗|θ(t−1))
p(θ(t−1),y)
p(y)J(θ(t−1)|θ∗)
=
p(y|θ∗
)p(θ∗
)/J(θ∗
|θ(t−1)
)
p(y|θ(t−1))p(θ(t−1))/J(θ(t−1)|θ∗)
I Set
θ(t)
=
(
θ∗
if u ≤ min(r, 1) where u ∼ U(0, 1).
θ(t−1)
otherwise
I Some notes:
I If a jump θ∗
is rejected, that is θ(t)
= θ(t−1)
, it is still counted as an iteration.
I The transition distribution is again a mixture distribution.
7 / 18

Motivation to consider a Gibbs sampler
I In most real life problems, we will not be attempting to make inference on a single
parameter only, as we have done in most examples to date. Rather, we will want
to make inference on a set of parameters, θ = (θ1, . . . , θK ).
I In this case, it may be possible to integrate out the parameters that are not of
immediate interest. We did this when estimating the parameters of the normal
distribution in Assignment 1 and Lab 2. Usually integrating out parameters is not
a straight forward task.
I More commonly though, we simply cannot analytically determine the posterior.
What we can easily obtain is the joint distribution of parameters and data,
p(θ, y) = p(θ1, . . . θK , y) = p(y|θ1, . . . θK )p(θ1, . . . θK )
,
8 / 18

Conditional posteriors
I Normally, we would try to directly determine p(θ|y) from p(θ, y). However the
rules of probability means there is nothing to stop us from instead finding
p(θ1 | θ2 . . . θK , y)
.
.
.
p(θK | θ1 . . . θK−1, y),
the set of conditional posteriors. Note the conditional here refers to the fact we
are finding the posterior of θi conditional on all other parameters as well as data.
I The appeal of conditional posteriors is
I A sequence of draws from low-dimensional spaces may be less computationally
intensive than a single draw from a high-dimensional space.
I Even if p(θ|y) is unknown analytically, p(θi |θ−i , y) could be.
,
9 / 18

The Gibbs sampler
I In the Gibbs sampler, it is assumed that is possible to directly sample from the set
of conditional posteriors p(θi |θ−i , y), 1 ≤ i ≤ K.
I First, we define a starting point θ(0) = (θ
(0)
1 , . . . , θ
(0)
K ) from some starting
distribution.
I Then we iterate. However unlike in the Metropolis(-Hastings) algorithm, we need
to iterate over the components of θ within each iteration t.
I For t = 1, 2, . . .,
I For i = 1, . . . K, draw θ
(t)
i from the conditional posterior
p(θ
(t)
i |θ∗
−i , y),
where θ∗
−i = (θ
(t)
1 , . . . , θ
(t)
i−1, θ
(t−1)
i+1 , . . . , θ
(t−1)
K ).
10 / 18

How do we know that we will converge to p(θ|y)
I While we have introduced various MCMC algorithms, we have not shown that any
will converge to p(θ|y). To prove convergence, consider the joint distribution of
two successive draws θa, θb from a Metropolis-Hastings algorithm. To help,
assume p(θb|y)/J(θb|θa) ≥ p(θa|y)/J(θa|θb).
I First assume θ(t−1) = θa, θ(t) = θb. Then the joint distribution is,
p(θ(t−1)
= θa, θ(t)
= θb) = p(θ(t−1)
= θa)p(θ(t)
= θb|θ(t−1)
= θa)
= p(θ(t−1)
= θa)J(θb|θa),
since as r ≥ 1, θb ∼ J(θb|θa) is accepted with probability 1.
I Now change the order of events.
11 / 18

I From our choice of J(·|·), we know a unique stationary distribution exists. Now
assume θ(t−1) is drawn from the posterior. This means that
p(θ(t−1)
= θa, θ(t)
= θb) = p(θ(t)
= θa, θ(t−1)
= θb),
and
p(θ(t)
= θa) =
Z
p(θ(t)
= θa, θ(t−1)
= θb)dθb
= p(θa|y)
Z
J(θb|θa)
p(θb|y)
p(θb|y)
dθb
= p(θa|y)
Z
J(θb|θa)dθb = p(θa|y)
meaning we can conclude θ(t) is also drawn from the posterior and thus p(θ|y) is
the stationary distribution.
13 / 18

But what about the convergence of the Gibbs sampler?
I We have shown the stationary distribution of the Metropolis-Hastings algorithm is
the posterior, p(θ|y).
I Implicitly, we have shown this is also true for the Metropolis algorithm, as the
Metropolis algorithm is just a special case of the Metropolis-Hastings algorithm.
I But what about the Gibbs sampler?
I It turns out the Gibbs sampler is another special case of the Metropolis-Hastings
algorithm. If we view iteration t(i) as the iteration, t0
= tK + i − K, rather than an
iteration within iteration, with candidate θ∗
= (θ
(t)
1 , . . . , θ
(t)
i−1, θ∗
i , θ
(t−1)
i+1 , . . . , θ
(t−1)
K )
and current draw θ(t0
−1)
= (θ
(t)
1 , . . . , θ
(t)
i−1, θ
(t−1)
i , θ
(t−1)
i+1 , . . . , θ
(t−1)
K ), then p(θ∗
i |·) =
J(θ∗
|θ(t−1)
) is a valid and usually non-symmetric jumping distribution.
14 / 18

The Gibbs sampler is a special case of Metropolis-Hastings
I Moreover, the Gibbs sampler is an example of the Metropolis-Hastings algorithm
where all moves will be accepted as shown below,
r =
p(θ∗|y)/J(θ∗|θ(t0−1))
p(θ(t0−1)|y)/J(θ(t0−1)|θ∗)
=
p(θ
(t)
1 ,...,θ
(t)
i−1,θ∗
i ,θ
(t−1)
i+1 ,...,θ
(t−1)
K |y)
p(θ∗
i |θ
(t)
1 ,...,θ
(t)
i−1,θ
(t−1)
i+1 ,...,θ
(t−1)
K ,y)
p(θ
(t)
1 ,...,θ
(t)
i−1,θ
(t−1)
i ,θ
(t−1)
i+1 ,...,θ
(t−1)
K |y)
p(θ
(t−1)
i |θ
(t)
1 ,...,θ
(t)
i−1,θ
(t−1)
i+1 ,...,θ
(t−1)
K ,y)
=
p(θ
(t)
1 , . . . , θ
(t)
i−1, θ
(t−1)
i+1 , . . . , θ
(t−1)
K |y)
p(θ
(t)
1 , . . . , θ
(t)
i−1, θ
(t−1)
i+1 , . . . , θ
(t−1)
K |y)
= 1
15 / 18

Comments
I As we are using computationally intensive techniques, what might we want to do?
Minimise computational cost wherever possible.
I For example, in previous lectures we have encountered the sufficiency principle.
Therefore you may want to compress the data down to just the sufficient statistics.
I Addition/subtraction are easier operations for a computer than multiplication/
division. Rejection sampling, importance sampling, and Metropolis (-Hastings)
algorithms all require density ratios. If we use (natural) log-densities instead, these
ratios become differences and quicker to compute. Then exponentiate when needed.
16 / 18

Some more comments
I Note there is nothing to stop us mixing techniques for a particular problem.
For example, you may be faced with the situation where the parameters split
θ = (θC θNC ) such that p(θi |θ−i , y) can be directly sampled from only if θi ⊆ θC .
However, you can combine a Gibbs sampler with Metropolis(-Hastings), such that
θ
(t)
i ∼ p(θ
(t)
i |θ∗
−i , y) if θi ⊆ θC
θ
(t)
i =
(
θ∗
i if u ≤ min(r, 1) where u ∼ U(0, 1), θ∗
i ∼ g(θ∗
i |·)
θ
(t−1)
i otherwise
if θi 6⊆ θC
with g(θ∗
i |·) being a jumping rule and r defined as in the Metropolis(-Hastings)
algorithm.
17 / 18

Conclusion
I In the next lecture, we will look at examples of the Metropolis,
Metropolis-Hastings and Gibbs sampling algorithms in R.
I The code used in these examples will be put up before the lecture,
so you can run the code during the lecture.
18 / 18

Lecture_9.pdf

Recommended

Recommended

More Related Content

Similar to Lecture_9.pdf

Similar to Lecture_9.pdf (20)

Recently uploaded

Recently uploaded (20)

Lecture_9.pdf