SlideShare a Scribd company logo
MAST90125: Bayesian Statistical Learning
Semester 2 2021
Lecture 9: Posterior simulation
John Holmes
1 / 18
What do we want to achieve
In the previous lecture, we looked at various techniques for approximating a posterior
distribution. These methods were either
I Deterministic,
I Based on generating samples of i.i.d. random numbers.
However as would have become apparent in the examples we looked at, these
techniques are at least one of
I Do not generalise well in multi-dimensional problems,
I Computationally wasteful.
In this lecture we will introduce a much more computationally efficient technique based
on iterative sampling called Markov Chain Monte Carlo (MCMC) which comes at the
cost of having to generate dependent samples.
2 / 18
Markov Chains
Before we look at MCMC algorithms, we need to define what a Markov chain is, and
what properties a Markov chain should have in the context of posterior simulation.
I A Markov chain is a stochastic process where the distribution of the present state
(at time t) θ(t) is only dependent on the immediately preceding state, that is
p(θ(t)
|θ(1)
, . . . , θ(t−1)
) = p(θ(t)
|θ(t−1)
).
I While there are many possible characteristics we can use to categorise a Markov
chain, for posterior simulation, we are most interested that the Markov chain has
the following properties:
I That all possible θ ∈ Θ can eventually be reached by the Markov chain.
I That the chain is aperiodic and not transient.
I While the conditions above imply a unique stationary distribution exists,
I we still need to show the stationary distribution is p(θ|y).
3 / 18
Metropolis algorithm
Like rejection sampling, the Metropolis algorithm is an example of a accept-reject rule.
The steps of the algorithm are:
I Define a proposal (or jumping) distribution, J(θa|θb), that is symmetric, that is
J() satisfies
J(θa|θb) = J(θb|θa).
In addition, J(·|·) must be able to eventually reach all possible θ ∈ Θ to ensure a
stationary distribution exists.
I Draw a starting point θ(0) from a starting distribution p(0)(θ) such that
p(θ(0)|y) > 0.
I Then we iterate,
4 / 18
Metropolis algorithm
I For t = 1, 2, . . .,
I Sample θ∗
from the proposal distribution J(θ∗
|θ(t−1)
)
I Calculate
r =
p(θ∗
|y)
p(θ(t−1)|y)
=
p(θ∗
, y)/p(y)
p(θ(t−1), y)/p(y)
=
p(y|θ∗
)p(θ∗
)
p(y|θ(t−1))p(θ(t−1))
I Set
θ(t)
=
(
θ∗
if u ≤ min(r, 1) where u ∼ U(0, 1).
θ(t−1)
otherwise
I Some notes:
I If a jump θ∗
is rejected, that is θ(t)
= θ(t−1)
, it is still counted as an iteration.
I The transition distribution is a mixture distribution.
5 / 18
Metropolis-Hastings algorithm
As might be guessed from the name, the Metropolis-Hastings algorithm is an extension
of the Metropolis algorithm. Therefore Metropolis Hastings is also a case of an
accept-reject algorithm.
I Like the Metropolis algorithm, we first need to define a proposal (or jumping)
distribution, J(θa|θb). Also like the Metropolis algorithm, J(·|·) must be chosen in
such a way as to ensure a stationary distribution exists.
I However there is now no requirement that the jumping distribution is symmetric.
This also means the Metropolis algorithm is a special case of the
Metropolis-Hastings algorithm.
I Draw a starting point θ(0) from a starting distribution p(0)(θ) such that
p(θ(0)|y) > 0.
I Then we must iterate,
6 / 18
Metropolis-Hastings algorithm
I For t = 1, 2, . . .,
I Sample θ∗
from the proposal distribution J(θ∗
|θ(t−1)
)
I Calculate
r =
p(θ∗
|y)/J(θ∗
|θ(t−1)
)
p(θ(t−1)|y)/J(θ(t−1)|θ∗)
=
p(θ∗
,y)
p(y)J(θ∗|θ(t−1))
p(θ(t−1),y)
p(y)J(θ(t−1)|θ∗)
=
p(y|θ∗
)p(θ∗
)/J(θ∗
|θ(t−1)
)
p(y|θ(t−1))p(θ(t−1))/J(θ(t−1)|θ∗)
I Set
θ(t)
=
(
θ∗
if u ≤ min(r, 1) where u ∼ U(0, 1).
θ(t−1)
otherwise
I Some notes:
I If a jump θ∗
is rejected, that is θ(t)
= θ(t−1)
, it is still counted as an iteration.
I The transition distribution is again a mixture distribution.
7 / 18
Motivation to consider a Gibbs sampler
I In most real life problems, we will not be attempting to make inference on a single
parameter only, as we have done in most examples to date. Rather, we will want
to make inference on a set of parameters, θ = (θ1, . . . , θK ).
I In this case, it may be possible to integrate out the parameters that are not of
immediate interest. We did this when estimating the parameters of the normal
distribution in Assignment 1 and Lab 2. Usually integrating out parameters is not
a straight forward task.
I More commonly though, we simply cannot analytically determine the posterior.
What we can easily obtain is the joint distribution of parameters and data,
p(θ, y) = p(θ1, . . . θK , y) = p(y|θ1, . . . θK )p(θ1, . . . θK )
,
8 / 18
Conditional posteriors
I Normally, we would try to directly determine p(θ|y) from p(θ, y). However the
rules of probability means there is nothing to stop us from instead finding
p(θ1 | θ2 . . . θK , y)
.
.
.
p(θK | θ1 . . . θK−1, y),
the set of conditional posteriors. Note the conditional here refers to the fact we
are finding the posterior of θi conditional on all other parameters as well as data.
I The appeal of conditional posteriors is
I A sequence of draws from low-dimensional spaces may be less computationally
intensive than a single draw from a high-dimensional space.
I Even if p(θ|y) is unknown analytically, p(θi |θ−i , y) could be.
,
9 / 18
The Gibbs sampler
I In the Gibbs sampler, it is assumed that is possible to directly sample from the set
of conditional posteriors p(θi |θ−i , y), 1 ≤ i ≤ K.
I First, we define a starting point θ(0) = (θ
(0)
1 , . . . , θ
(0)
K ) from some starting
distribution.
I Then we iterate. However unlike in the Metropolis(-Hastings) algorithm, we need
to iterate over the components of θ within each iteration t.
I For t = 1, 2, . . .,
I For i = 1, . . . K, draw θ
(t)
i from the conditional posterior
p(θ
(t)
i |θ∗
−i , y),
where θ∗
−i = (θ
(t)
1 , . . . , θ
(t)
i−1, θ
(t−1)
i+1 , . . . , θ
(t−1)
K ).
10 / 18
How do we know that we will converge to p(θ|y)
I While we have introduced various MCMC algorithms, we have not shown that any
will converge to p(θ|y). To prove convergence, consider the joint distribution of
two successive draws θa, θb from a Metropolis-Hastings algorithm. To help,
assume p(θb|y)/J(θb|θa) ≥ p(θa|y)/J(θa|θb).
I First assume θ(t−1) = θa, θ(t) = θb. Then the joint distribution is,
p(θ(t−1)
= θa, θ(t)
= θb) = p(θ(t−1)
= θa)p(θ(t)
= θb|θ(t−1)
= θa)
= p(θ(t−1)
= θa)J(θb|θa),
since as r ≥ 1, θb ∼ J(θb|θa) is accepted with probability 1.
I Now change the order of events.
11 / 18
How do we know that we will converge to p(θ|y)
I The joint distribution of θ(t) = θa, θ(t−1) = θb is,
p(θ(t)
= θa, θ(t−1)
= θb) = p(θ(t−1)
= θb)p(θ(t)
= θa|θ(t−1)
= θb)
= p(θ(t−1)
= θb)J(θa|θb)
p(θa|y)/J(θa|θb)
p(θb|y)/J(θb|θa)
,
= p(θa|y)J(θb|θa)
p(θ(t−1) = θb)
p(θb|y)
,
since flipping the event order is equivalent to flipping the density ratio so r must
now be ≤ 1.
12 / 18
How do we know that we will converge to p(θ|y)
I From our choice of J(·|·), we know a unique stationary distribution exists. Now
assume θ(t−1) is drawn from the posterior. This means that
p(θ(t−1)
= θa, θ(t)
= θb) = p(θ(t)
= θa, θ(t−1)
= θb),
and
p(θ(t)
= θa) =
Z
p(θ(t)
= θa, θ(t−1)
= θb)dθb
= p(θa|y)
Z
J(θb|θa)
p(θb|y)
p(θb|y)
dθb
= p(θa|y)
Z
J(θb|θa)dθb = p(θa|y)
meaning we can conclude θ(t) is also drawn from the posterior and thus p(θ|y) is
the stationary distribution.
13 / 18
But what about the convergence of the Gibbs sampler?
I We have shown the stationary distribution of the Metropolis-Hastings algorithm is
the posterior, p(θ|y).
I Implicitly, we have shown this is also true for the Metropolis algorithm, as the
Metropolis algorithm is just a special case of the Metropolis-Hastings algorithm.
I But what about the Gibbs sampler?
I It turns out the Gibbs sampler is another special case of the Metropolis-Hastings
algorithm. If we view iteration t(i) as the iteration, t0
= tK + i − K, rather than an
iteration within iteration, with candidate θ∗
= (θ
(t)
1 , . . . , θ
(t)
i−1, θ∗
i , θ
(t−1)
i+1 , . . . , θ
(t−1)
K )
and current draw θ(t0
−1)
= (θ
(t)
1 , . . . , θ
(t)
i−1, θ
(t−1)
i , θ
(t−1)
i+1 , . . . , θ
(t−1)
K ), then p(θ∗
i |·) =
J(θ∗
|θ(t−1)
) is a valid and usually non-symmetric jumping distribution.
14 / 18
The Gibbs sampler is a special case of Metropolis-Hastings
I Moreover, the Gibbs sampler is an example of the Metropolis-Hastings algorithm
where all moves will be accepted as shown below,
r =
p(θ∗|y)/J(θ∗|θ(t0−1))
p(θ(t0−1)|y)/J(θ(t0−1)|θ∗)
=
p(θ
(t)
1 ,...,θ
(t)
i−1,θ∗
i ,θ
(t−1)
i+1 ,...,θ
(t−1)
K |y)
p(θ∗
i |θ
(t)
1 ,...,θ
(t)
i−1,θ
(t−1)
i+1 ,...,θ
(t−1)
K ,y)
p(θ
(t)
1 ,...,θ
(t)
i−1,θ
(t−1)
i ,θ
(t−1)
i+1 ,...,θ
(t−1)
K |y)
p(θ
(t−1)
i |θ
(t)
1 ,...,θ
(t)
i−1,θ
(t−1)
i+1 ,...,θ
(t−1)
K ,y)
=
p(θ
(t)
1 , . . . , θ
(t)
i−1, θ
(t−1)
i+1 , . . . , θ
(t−1)
K |y)
p(θ
(t)
1 , . . . , θ
(t)
i−1, θ
(t−1)
i+1 , . . . , θ
(t−1)
K |y)
= 1
15 / 18
Comments
I As we are using computationally intensive techniques, what might we want to do?
Minimise computational cost wherever possible.
I For example, in previous lectures we have encountered the sufficiency principle.
Therefore you may want to compress the data down to just the sufficient statistics.
I Addition/subtraction are easier operations for a computer than multiplication/
division. Rejection sampling, importance sampling, and Metropolis (-Hastings)
algorithms all require density ratios. If we use (natural) log-densities instead, these
ratios become differences and quicker to compute. Then exponentiate when needed.
16 / 18
Some more comments
I Note there is nothing to stop us mixing techniques for a particular problem.
For example, you may be faced with the situation where the parameters split
θ = (θC θNC ) such that p(θi |θ−i , y) can be directly sampled from only if θi ⊆ θC .
However, you can combine a Gibbs sampler with Metropolis(-Hastings), such that
θ
(t)
i ∼ p(θ
(t)
i |θ∗
−i , y) if θi ⊆ θC
θ
(t)
i =
(
θ∗
i if u ≤ min(r, 1) where u ∼ U(0, 1), θ∗
i ∼ g(θ∗
i |·)
θ
(t−1)
i otherwise
if θi 6⊆ θC
with g(θ∗
i |·) being a jumping rule and r defined as in the Metropolis(-Hastings)
algorithm.
17 / 18
Conclusion
I In the next lecture, we will look at examples of the Metropolis,
Metropolis-Hastings and Gibbs sampling algorithms in R.
I The code used in these examples will be put up before the lecture,
so you can run the code during the lecture.
18 / 18

More Related Content

Similar to Lecture_9.pdf

Estimation of the score vector and observed information matrix in intractable...
Estimation of the score vector and observed information matrix in intractable...Estimation of the score vector and observed information matrix in intractable...
Estimation of the score vector and observed information matrix in intractable...
Pierre Jacob
 
X01 Supervised learning problem linear regression one feature theorie
X01 Supervised learning problem linear regression one feature theorieX01 Supervised learning problem linear regression one feature theorie
X01 Supervised learning problem linear regression one feature theorie
Marco Moldenhauer
 
Discussion cabras-robert-130323171455-phpapp02
Discussion cabras-robert-130323171455-phpapp02Discussion cabras-robert-130323171455-phpapp02
Discussion cabras-robert-130323171455-phpapp02Deb Roy
 
Discussion of ABC talk by Stefano Cabras, Padova, March 21, 2013
Discussion of ABC talk by Stefano Cabras, Padova, March 21, 2013Discussion of ABC talk by Stefano Cabras, Padova, March 21, 2013
Discussion of ABC talk by Stefano Cabras, Padova, March 21, 2013
Christian Robert
 
ABC with data cloning for MLE in state space models
ABC with data cloning for MLE in state space modelsABC with data cloning for MLE in state space models
ABC with data cloning for MLE in state space models
Umberto Picchini
 
Deep Learning Opening Workshop - Statistical and Computational Guarantees of ...
Deep Learning Opening Workshop - Statistical and Computational Guarantees of ...Deep Learning Opening Workshop - Statistical and Computational Guarantees of ...
Deep Learning Opening Workshop - Statistical and Computational Guarantees of ...
The Statistical and Applied Mathematical Sciences Institute
 
Conceptual Introduction to Gaussian Processes
Conceptual Introduction to Gaussian ProcessesConceptual Introduction to Gaussian Processes
Conceptual Introduction to Gaussian Processes
JuanPabloCarbajal3
 
Considerate Approaches to ABC Model Selection
Considerate Approaches to ABC Model SelectionConsiderate Approaches to ABC Model Selection
Considerate Approaches to ABC Model Selection
Michael Stumpf
 
Sampling strategies for Sequential Monte Carlo (SMC) methods
Sampling strategies for Sequential Monte Carlo (SMC) methodsSampling strategies for Sequential Monte Carlo (SMC) methods
Sampling strategies for Sequential Monte Carlo (SMC) methodsStephane Senecal
 
Seminar Talk: Multilevel Hybrid Split Step Implicit Tau-Leap for Stochastic R...
Seminar Talk: Multilevel Hybrid Split Step Implicit Tau-Leap for Stochastic R...Seminar Talk: Multilevel Hybrid Split Step Implicit Tau-Leap for Stochastic R...
Seminar Talk: Multilevel Hybrid Split Step Implicit Tau-Leap for Stochastic R...
Chiheb Ben Hammouda
 
Anomaly And Parity Odd Transport
Anomaly And Parity Odd Transport Anomaly And Parity Odd Transport
Anomaly And Parity Odd Transport
Subham Dutta Chowdhury
 
Introduction to modern Variational Inference.
Introduction to modern Variational Inference.Introduction to modern Variational Inference.
Introduction to modern Variational Inference.
Tomasz Kusmierczyk
 
Ada boost brown boost performance with noisy data
Ada boost brown boost performance with noisy dataAda boost brown boost performance with noisy data
Ada boost brown boost performance with noisy data
Shadhin Rahman
 
Machine learning (9)
Machine learning (9)Machine learning (9)
Machine learning (9)NYversity
 
Fractal dimension versus Computational Complexity
Fractal dimension versus Computational ComplexityFractal dimension versus Computational Complexity
Fractal dimension versus Computational Complexity
Hector Zenil
 
On problem-of-parameters-identification-of-dynamic-object
On problem-of-parameters-identification-of-dynamic-objectOn problem-of-parameters-identification-of-dynamic-object
On problem-of-parameters-identification-of-dynamic-objectCemal Ardil
 
Paper finance hosseinkhan_remy
Paper finance hosseinkhan_remyPaper finance hosseinkhan_remy
Paper finance hosseinkhan_remy
Rémy Hosseinkhan
 
Nokton theory-en
Nokton theory-enNokton theory-en
Nokton theory-en
saidanilassaad
 
Talk in BayesComp 2018
Talk in BayesComp 2018Talk in BayesComp 2018
Talk in BayesComp 2018
JeremyHeng10
 

Similar to Lecture_9.pdf (20)

Estimation of the score vector and observed information matrix in intractable...
Estimation of the score vector and observed information matrix in intractable...Estimation of the score vector and observed information matrix in intractable...
Estimation of the score vector and observed information matrix in intractable...
 
X01 Supervised learning problem linear regression one feature theorie
X01 Supervised learning problem linear regression one feature theorieX01 Supervised learning problem linear regression one feature theorie
X01 Supervised learning problem linear regression one feature theorie
 
Discussion cabras-robert-130323171455-phpapp02
Discussion cabras-robert-130323171455-phpapp02Discussion cabras-robert-130323171455-phpapp02
Discussion cabras-robert-130323171455-phpapp02
 
Discussion of ABC talk by Stefano Cabras, Padova, March 21, 2013
Discussion of ABC talk by Stefano Cabras, Padova, March 21, 2013Discussion of ABC talk by Stefano Cabras, Padova, March 21, 2013
Discussion of ABC talk by Stefano Cabras, Padova, March 21, 2013
 
ABC with data cloning for MLE in state space models
ABC with data cloning for MLE in state space modelsABC with data cloning for MLE in state space models
ABC with data cloning for MLE in state space models
 
Deep Learning Opening Workshop - Statistical and Computational Guarantees of ...
Deep Learning Opening Workshop - Statistical and Computational Guarantees of ...Deep Learning Opening Workshop - Statistical and Computational Guarantees of ...
Deep Learning Opening Workshop - Statistical and Computational Guarantees of ...
 
Conceptual Introduction to Gaussian Processes
Conceptual Introduction to Gaussian ProcessesConceptual Introduction to Gaussian Processes
Conceptual Introduction to Gaussian Processes
 
Considerate Approaches to ABC Model Selection
Considerate Approaches to ABC Model SelectionConsiderate Approaches to ABC Model Selection
Considerate Approaches to ABC Model Selection
 
Sampling strategies for Sequential Monte Carlo (SMC) methods
Sampling strategies for Sequential Monte Carlo (SMC) methodsSampling strategies for Sequential Monte Carlo (SMC) methods
Sampling strategies for Sequential Monte Carlo (SMC) methods
 
Seminar Talk: Multilevel Hybrid Split Step Implicit Tau-Leap for Stochastic R...
Seminar Talk: Multilevel Hybrid Split Step Implicit Tau-Leap for Stochastic R...Seminar Talk: Multilevel Hybrid Split Step Implicit Tau-Leap for Stochastic R...
Seminar Talk: Multilevel Hybrid Split Step Implicit Tau-Leap for Stochastic R...
 
Anomaly And Parity Odd Transport
Anomaly And Parity Odd Transport Anomaly And Parity Odd Transport
Anomaly And Parity Odd Transport
 
Ch06 6
Ch06 6Ch06 6
Ch06 6
 
Introduction to modern Variational Inference.
Introduction to modern Variational Inference.Introduction to modern Variational Inference.
Introduction to modern Variational Inference.
 
Ada boost brown boost performance with noisy data
Ada boost brown boost performance with noisy dataAda boost brown boost performance with noisy data
Ada boost brown boost performance with noisy data
 
Machine learning (9)
Machine learning (9)Machine learning (9)
Machine learning (9)
 
Fractal dimension versus Computational Complexity
Fractal dimension versus Computational ComplexityFractal dimension versus Computational Complexity
Fractal dimension versus Computational Complexity
 
On problem-of-parameters-identification-of-dynamic-object
On problem-of-parameters-identification-of-dynamic-objectOn problem-of-parameters-identification-of-dynamic-object
On problem-of-parameters-identification-of-dynamic-object
 
Paper finance hosseinkhan_remy
Paper finance hosseinkhan_remyPaper finance hosseinkhan_remy
Paper finance hosseinkhan_remy
 
Nokton theory-en
Nokton theory-enNokton theory-en
Nokton theory-en
 
Talk in BayesComp 2018
Talk in BayesComp 2018Talk in BayesComp 2018
Talk in BayesComp 2018
 

Recently uploaded

bordetella pertussis.................................ppt
bordetella pertussis.................................pptbordetella pertussis.................................ppt
bordetella pertussis.................................ppt
kejapriya1
 
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
yqqaatn0
 
Comparative structure of adrenal gland in vertebrates
Comparative structure of adrenal gland in vertebratesComparative structure of adrenal gland in vertebrates
Comparative structure of adrenal gland in vertebrates
sachin783648
 
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
University of Maribor
 
NuGOweek 2024 Ghent - programme - final version
NuGOweek 2024 Ghent - programme - final versionNuGOweek 2024 Ghent - programme - final version
NuGOweek 2024 Ghent - programme - final version
pablovgd
 
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...
Studia Poinsotiana
 
Hemoglobin metabolism_pathophysiology.pptx
Hemoglobin metabolism_pathophysiology.pptxHemoglobin metabolism_pathophysiology.pptx
Hemoglobin metabolism_pathophysiology.pptx
muralinath2
 
Orion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWSOrion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWS
Columbia Weather Systems
 
Mammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also FunctionsMammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also Functions
YOGESH DOGRA
 
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Sérgio Sacani
 
Lateral Ventricles.pdf very easy good diagrams comprehensive
Lateral Ventricles.pdf very easy good diagrams comprehensiveLateral Ventricles.pdf very easy good diagrams comprehensive
Lateral Ventricles.pdf very easy good diagrams comprehensive
silvermistyshot
 
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Sérgio Sacani
 
S.1 chemistry scheme term 2 for ordinary level
S.1 chemistry scheme term 2 for ordinary levelS.1 chemistry scheme term 2 for ordinary level
S.1 chemistry scheme term 2 for ordinary level
ronaldlakony0
 
In silico drugs analogue design: novobiocin analogues.pptx
In silico drugs analogue design: novobiocin analogues.pptxIn silico drugs analogue design: novobiocin analogues.pptx
In silico drugs analogue design: novobiocin analogues.pptx
AlaminAfendy1
 
general properties of oerganologametal.ppt
general properties of oerganologametal.pptgeneral properties of oerganologametal.ppt
general properties of oerganologametal.ppt
IqrimaNabilatulhusni
 
role of pramana in research.pptx in science
role of pramana in research.pptx in sciencerole of pramana in research.pptx in science
role of pramana in research.pptx in science
sonaliswain16
 
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
Sérgio Sacani
 
Seminar of U.V. Spectroscopy by SAMIR PANDA
 Seminar of U.V. Spectroscopy by SAMIR PANDA Seminar of U.V. Spectroscopy by SAMIR PANDA
Seminar of U.V. Spectroscopy by SAMIR PANDA
SAMIR PANDA
 
erythropoiesis-I_mechanism& clinical significance.pptx
erythropoiesis-I_mechanism& clinical significance.pptxerythropoiesis-I_mechanism& clinical significance.pptx
erythropoiesis-I_mechanism& clinical significance.pptx
muralinath2
 
What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.
moosaasad1975
 

Recently uploaded (20)

bordetella pertussis.................................ppt
bordetella pertussis.................................pptbordetella pertussis.................................ppt
bordetella pertussis.................................ppt
 
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
 
Comparative structure of adrenal gland in vertebrates
Comparative structure of adrenal gland in vertebratesComparative structure of adrenal gland in vertebrates
Comparative structure of adrenal gland in vertebrates
 
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
 
NuGOweek 2024 Ghent - programme - final version
NuGOweek 2024 Ghent - programme - final versionNuGOweek 2024 Ghent - programme - final version
NuGOweek 2024 Ghent - programme - final version
 
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...
 
Hemoglobin metabolism_pathophysiology.pptx
Hemoglobin metabolism_pathophysiology.pptxHemoglobin metabolism_pathophysiology.pptx
Hemoglobin metabolism_pathophysiology.pptx
 
Orion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWSOrion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWS
 
Mammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also FunctionsMammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also Functions
 
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
 
Lateral Ventricles.pdf very easy good diagrams comprehensive
Lateral Ventricles.pdf very easy good diagrams comprehensiveLateral Ventricles.pdf very easy good diagrams comprehensive
Lateral Ventricles.pdf very easy good diagrams comprehensive
 
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
 
S.1 chemistry scheme term 2 for ordinary level
S.1 chemistry scheme term 2 for ordinary levelS.1 chemistry scheme term 2 for ordinary level
S.1 chemistry scheme term 2 for ordinary level
 
In silico drugs analogue design: novobiocin analogues.pptx
In silico drugs analogue design: novobiocin analogues.pptxIn silico drugs analogue design: novobiocin analogues.pptx
In silico drugs analogue design: novobiocin analogues.pptx
 
general properties of oerganologametal.ppt
general properties of oerganologametal.pptgeneral properties of oerganologametal.ppt
general properties of oerganologametal.ppt
 
role of pramana in research.pptx in science
role of pramana in research.pptx in sciencerole of pramana in research.pptx in science
role of pramana in research.pptx in science
 
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
 
Seminar of U.V. Spectroscopy by SAMIR PANDA
 Seminar of U.V. Spectroscopy by SAMIR PANDA Seminar of U.V. Spectroscopy by SAMIR PANDA
Seminar of U.V. Spectroscopy by SAMIR PANDA
 
erythropoiesis-I_mechanism& clinical significance.pptx
erythropoiesis-I_mechanism& clinical significance.pptxerythropoiesis-I_mechanism& clinical significance.pptx
erythropoiesis-I_mechanism& clinical significance.pptx
 
What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.
 

Lecture_9.pdf

  • 1. MAST90125: Bayesian Statistical Learning Semester 2 2021 Lecture 9: Posterior simulation John Holmes 1 / 18
  • 2. What do we want to achieve In the previous lecture, we looked at various techniques for approximating a posterior distribution. These methods were either I Deterministic, I Based on generating samples of i.i.d. random numbers. However as would have become apparent in the examples we looked at, these techniques are at least one of I Do not generalise well in multi-dimensional problems, I Computationally wasteful. In this lecture we will introduce a much more computationally efficient technique based on iterative sampling called Markov Chain Monte Carlo (MCMC) which comes at the cost of having to generate dependent samples. 2 / 18
  • 3. Markov Chains Before we look at MCMC algorithms, we need to define what a Markov chain is, and what properties a Markov chain should have in the context of posterior simulation. I A Markov chain is a stochastic process where the distribution of the present state (at time t) θ(t) is only dependent on the immediately preceding state, that is p(θ(t) |θ(1) , . . . , θ(t−1) ) = p(θ(t) |θ(t−1) ). I While there are many possible characteristics we can use to categorise a Markov chain, for posterior simulation, we are most interested that the Markov chain has the following properties: I That all possible θ ∈ Θ can eventually be reached by the Markov chain. I That the chain is aperiodic and not transient. I While the conditions above imply a unique stationary distribution exists, I we still need to show the stationary distribution is p(θ|y). 3 / 18
  • 4. Metropolis algorithm Like rejection sampling, the Metropolis algorithm is an example of a accept-reject rule. The steps of the algorithm are: I Define a proposal (or jumping) distribution, J(θa|θb), that is symmetric, that is J() satisfies J(θa|θb) = J(θb|θa). In addition, J(·|·) must be able to eventually reach all possible θ ∈ Θ to ensure a stationary distribution exists. I Draw a starting point θ(0) from a starting distribution p(0)(θ) such that p(θ(0)|y) > 0. I Then we iterate, 4 / 18
  • 5. Metropolis algorithm I For t = 1, 2, . . ., I Sample θ∗ from the proposal distribution J(θ∗ |θ(t−1) ) I Calculate r = p(θ∗ |y) p(θ(t−1)|y) = p(θ∗ , y)/p(y) p(θ(t−1), y)/p(y) = p(y|θ∗ )p(θ∗ ) p(y|θ(t−1))p(θ(t−1)) I Set θ(t) = ( θ∗ if u ≤ min(r, 1) where u ∼ U(0, 1). θ(t−1) otherwise I Some notes: I If a jump θ∗ is rejected, that is θ(t) = θ(t−1) , it is still counted as an iteration. I The transition distribution is a mixture distribution. 5 / 18
  • 6. Metropolis-Hastings algorithm As might be guessed from the name, the Metropolis-Hastings algorithm is an extension of the Metropolis algorithm. Therefore Metropolis Hastings is also a case of an accept-reject algorithm. I Like the Metropolis algorithm, we first need to define a proposal (or jumping) distribution, J(θa|θb). Also like the Metropolis algorithm, J(·|·) must be chosen in such a way as to ensure a stationary distribution exists. I However there is now no requirement that the jumping distribution is symmetric. This also means the Metropolis algorithm is a special case of the Metropolis-Hastings algorithm. I Draw a starting point θ(0) from a starting distribution p(0)(θ) such that p(θ(0)|y) > 0. I Then we must iterate, 6 / 18
  • 7. Metropolis-Hastings algorithm I For t = 1, 2, . . ., I Sample θ∗ from the proposal distribution J(θ∗ |θ(t−1) ) I Calculate r = p(θ∗ |y)/J(θ∗ |θ(t−1) ) p(θ(t−1)|y)/J(θ(t−1)|θ∗) = p(θ∗ ,y) p(y)J(θ∗|θ(t−1)) p(θ(t−1),y) p(y)J(θ(t−1)|θ∗) = p(y|θ∗ )p(θ∗ )/J(θ∗ |θ(t−1) ) p(y|θ(t−1))p(θ(t−1))/J(θ(t−1)|θ∗) I Set θ(t) = ( θ∗ if u ≤ min(r, 1) where u ∼ U(0, 1). θ(t−1) otherwise I Some notes: I If a jump θ∗ is rejected, that is θ(t) = θ(t−1) , it is still counted as an iteration. I The transition distribution is again a mixture distribution. 7 / 18
  • 8. Motivation to consider a Gibbs sampler I In most real life problems, we will not be attempting to make inference on a single parameter only, as we have done in most examples to date. Rather, we will want to make inference on a set of parameters, θ = (θ1, . . . , θK ). I In this case, it may be possible to integrate out the parameters that are not of immediate interest. We did this when estimating the parameters of the normal distribution in Assignment 1 and Lab 2. Usually integrating out parameters is not a straight forward task. I More commonly though, we simply cannot analytically determine the posterior. What we can easily obtain is the joint distribution of parameters and data, p(θ, y) = p(θ1, . . . θK , y) = p(y|θ1, . . . θK )p(θ1, . . . θK ) , 8 / 18
  • 9. Conditional posteriors I Normally, we would try to directly determine p(θ|y) from p(θ, y). However the rules of probability means there is nothing to stop us from instead finding p(θ1 | θ2 . . . θK , y) . . . p(θK | θ1 . . . θK−1, y), the set of conditional posteriors. Note the conditional here refers to the fact we are finding the posterior of θi conditional on all other parameters as well as data. I The appeal of conditional posteriors is I A sequence of draws from low-dimensional spaces may be less computationally intensive than a single draw from a high-dimensional space. I Even if p(θ|y) is unknown analytically, p(θi |θ−i , y) could be. , 9 / 18
  • 10. The Gibbs sampler I In the Gibbs sampler, it is assumed that is possible to directly sample from the set of conditional posteriors p(θi |θ−i , y), 1 ≤ i ≤ K. I First, we define a starting point θ(0) = (θ (0) 1 , . . . , θ (0) K ) from some starting distribution. I Then we iterate. However unlike in the Metropolis(-Hastings) algorithm, we need to iterate over the components of θ within each iteration t. I For t = 1, 2, . . ., I For i = 1, . . . K, draw θ (t) i from the conditional posterior p(θ (t) i |θ∗ −i , y), where θ∗ −i = (θ (t) 1 , . . . , θ (t) i−1, θ (t−1) i+1 , . . . , θ (t−1) K ). 10 / 18
  • 11. How do we know that we will converge to p(θ|y) I While we have introduced various MCMC algorithms, we have not shown that any will converge to p(θ|y). To prove convergence, consider the joint distribution of two successive draws θa, θb from a Metropolis-Hastings algorithm. To help, assume p(θb|y)/J(θb|θa) ≥ p(θa|y)/J(θa|θb). I First assume θ(t−1) = θa, θ(t) = θb. Then the joint distribution is, p(θ(t−1) = θa, θ(t) = θb) = p(θ(t−1) = θa)p(θ(t) = θb|θ(t−1) = θa) = p(θ(t−1) = θa)J(θb|θa), since as r ≥ 1, θb ∼ J(θb|θa) is accepted with probability 1. I Now change the order of events. 11 / 18
  • 12. How do we know that we will converge to p(θ|y) I The joint distribution of θ(t) = θa, θ(t−1) = θb is, p(θ(t) = θa, θ(t−1) = θb) = p(θ(t−1) = θb)p(θ(t) = θa|θ(t−1) = θb) = p(θ(t−1) = θb)J(θa|θb) p(θa|y)/J(θa|θb) p(θb|y)/J(θb|θa) , = p(θa|y)J(θb|θa) p(θ(t−1) = θb) p(θb|y) , since flipping the event order is equivalent to flipping the density ratio so r must now be ≤ 1. 12 / 18
  • 13. How do we know that we will converge to p(θ|y) I From our choice of J(·|·), we know a unique stationary distribution exists. Now assume θ(t−1) is drawn from the posterior. This means that p(θ(t−1) = θa, θ(t) = θb) = p(θ(t) = θa, θ(t−1) = θb), and p(θ(t) = θa) = Z p(θ(t) = θa, θ(t−1) = θb)dθb = p(θa|y) Z J(θb|θa) p(θb|y) p(θb|y) dθb = p(θa|y) Z J(θb|θa)dθb = p(θa|y) meaning we can conclude θ(t) is also drawn from the posterior and thus p(θ|y) is the stationary distribution. 13 / 18
  • 14. But what about the convergence of the Gibbs sampler? I We have shown the stationary distribution of the Metropolis-Hastings algorithm is the posterior, p(θ|y). I Implicitly, we have shown this is also true for the Metropolis algorithm, as the Metropolis algorithm is just a special case of the Metropolis-Hastings algorithm. I But what about the Gibbs sampler? I It turns out the Gibbs sampler is another special case of the Metropolis-Hastings algorithm. If we view iteration t(i) as the iteration, t0 = tK + i − K, rather than an iteration within iteration, with candidate θ∗ = (θ (t) 1 , . . . , θ (t) i−1, θ∗ i , θ (t−1) i+1 , . . . , θ (t−1) K ) and current draw θ(t0 −1) = (θ (t) 1 , . . . , θ (t) i−1, θ (t−1) i , θ (t−1) i+1 , . . . , θ (t−1) K ), then p(θ∗ i |·) = J(θ∗ |θ(t−1) ) is a valid and usually non-symmetric jumping distribution. 14 / 18
  • 15. The Gibbs sampler is a special case of Metropolis-Hastings I Moreover, the Gibbs sampler is an example of the Metropolis-Hastings algorithm where all moves will be accepted as shown below, r = p(θ∗|y)/J(θ∗|θ(t0−1)) p(θ(t0−1)|y)/J(θ(t0−1)|θ∗) = p(θ (t) 1 ,...,θ (t) i−1,θ∗ i ,θ (t−1) i+1 ,...,θ (t−1) K |y) p(θ∗ i |θ (t) 1 ,...,θ (t) i−1,θ (t−1) i+1 ,...,θ (t−1) K ,y) p(θ (t) 1 ,...,θ (t) i−1,θ (t−1) i ,θ (t−1) i+1 ,...,θ (t−1) K |y) p(θ (t−1) i |θ (t) 1 ,...,θ (t) i−1,θ (t−1) i+1 ,...,θ (t−1) K ,y) = p(θ (t) 1 , . . . , θ (t) i−1, θ (t−1) i+1 , . . . , θ (t−1) K |y) p(θ (t) 1 , . . . , θ (t) i−1, θ (t−1) i+1 , . . . , θ (t−1) K |y) = 1 15 / 18
  • 16. Comments I As we are using computationally intensive techniques, what might we want to do? Minimise computational cost wherever possible. I For example, in previous lectures we have encountered the sufficiency principle. Therefore you may want to compress the data down to just the sufficient statistics. I Addition/subtraction are easier operations for a computer than multiplication/ division. Rejection sampling, importance sampling, and Metropolis (-Hastings) algorithms all require density ratios. If we use (natural) log-densities instead, these ratios become differences and quicker to compute. Then exponentiate when needed. 16 / 18
  • 17. Some more comments I Note there is nothing to stop us mixing techniques for a particular problem. For example, you may be faced with the situation where the parameters split θ = (θC θNC ) such that p(θi |θ−i , y) can be directly sampled from only if θi ⊆ θC . However, you can combine a Gibbs sampler with Metropolis(-Hastings), such that θ (t) i ∼ p(θ (t) i |θ∗ −i , y) if θi ⊆ θC θ (t) i = ( θ∗ i if u ≤ min(r, 1) where u ∼ U(0, 1), θ∗ i ∼ g(θ∗ i |·) θ (t−1) i otherwise if θi 6⊆ θC with g(θ∗ i |·) being a jumping rule and r defined as in the Metropolis(-Hastings) algorithm. 17 / 18
  • 18. Conclusion I In the next lecture, we will look at examples of the Metropolis, Metropolis-Hastings and Gibbs sampling algorithms in R. I The code used in these examples will be put up before the lecture, so you can run the code during the lecture. 18 / 18