SlideShare a Scribd company logo
1 of 18
Download to read offline
MAST90125: Bayesian Statistical Learning
Semester 2 2021
Lecture 9: Posterior simulation
John Holmes
1 / 18
What do we want to achieve
In the previous lecture, we looked at various techniques for approximating a posterior
distribution. These methods were either
I Deterministic,
I Based on generating samples of i.i.d. random numbers.
However as would have become apparent in the examples we looked at, these
techniques are at least one of
I Do not generalise well in multi-dimensional problems,
I Computationally wasteful.
In this lecture we will introduce a much more computationally efficient technique based
on iterative sampling called Markov Chain Monte Carlo (MCMC) which comes at the
cost of having to generate dependent samples.
2 / 18
Markov Chains
Before we look at MCMC algorithms, we need to define what a Markov chain is, and
what properties a Markov chain should have in the context of posterior simulation.
I A Markov chain is a stochastic process where the distribution of the present state
(at time t) θ(t) is only dependent on the immediately preceding state, that is
p(θ(t)
|θ(1)
, . . . , θ(t−1)
) = p(θ(t)
|θ(t−1)
).
I While there are many possible characteristics we can use to categorise a Markov
chain, for posterior simulation, we are most interested that the Markov chain has
the following properties:
I That all possible θ ∈ Θ can eventually be reached by the Markov chain.
I That the chain is aperiodic and not transient.
I While the conditions above imply a unique stationary distribution exists,
I we still need to show the stationary distribution is p(θ|y).
3 / 18
Metropolis algorithm
Like rejection sampling, the Metropolis algorithm is an example of a accept-reject rule.
The steps of the algorithm are:
I Define a proposal (or jumping) distribution, J(θa|θb), that is symmetric, that is
J() satisfies
J(θa|θb) = J(θb|θa).
In addition, J(·|·) must be able to eventually reach all possible θ ∈ Θ to ensure a
stationary distribution exists.
I Draw a starting point θ(0) from a starting distribution p(0)(θ) such that
p(θ(0)|y) > 0.
I Then we iterate,
4 / 18
Metropolis algorithm
I For t = 1, 2, . . .,
I Sample θ∗
from the proposal distribution J(θ∗
|θ(t−1)
)
I Calculate
r =
p(θ∗
|y)
p(θ(t−1)|y)
=
p(θ∗
, y)/p(y)
p(θ(t−1), y)/p(y)
=
p(y|θ∗
)p(θ∗
)
p(y|θ(t−1))p(θ(t−1))
I Set
θ(t)
=
(
θ∗
if u ≤ min(r, 1) where u ∼ U(0, 1).
θ(t−1)
otherwise
I Some notes:
I If a jump θ∗
is rejected, that is θ(t)
= θ(t−1)
, it is still counted as an iteration.
I The transition distribution is a mixture distribution.
5 / 18
Metropolis-Hastings algorithm
As might be guessed from the name, the Metropolis-Hastings algorithm is an extension
of the Metropolis algorithm. Therefore Metropolis Hastings is also a case of an
accept-reject algorithm.
I Like the Metropolis algorithm, we first need to define a proposal (or jumping)
distribution, J(θa|θb). Also like the Metropolis algorithm, J(·|·) must be chosen in
such a way as to ensure a stationary distribution exists.
I However there is now no requirement that the jumping distribution is symmetric.
This also means the Metropolis algorithm is a special case of the
Metropolis-Hastings algorithm.
I Draw a starting point θ(0) from a starting distribution p(0)(θ) such that
p(θ(0)|y) > 0.
I Then we must iterate,
6 / 18
Metropolis-Hastings algorithm
I For t = 1, 2, . . .,
I Sample θ∗
from the proposal distribution J(θ∗
|θ(t−1)
)
I Calculate
r =
p(θ∗
|y)/J(θ∗
|θ(t−1)
)
p(θ(t−1)|y)/J(θ(t−1)|θ∗)
=
p(θ∗
,y)
p(y)J(θ∗|θ(t−1))
p(θ(t−1),y)
p(y)J(θ(t−1)|θ∗)
=
p(y|θ∗
)p(θ∗
)/J(θ∗
|θ(t−1)
)
p(y|θ(t−1))p(θ(t−1))/J(θ(t−1)|θ∗)
I Set
θ(t)
=
(
θ∗
if u ≤ min(r, 1) where u ∼ U(0, 1).
θ(t−1)
otherwise
I Some notes:
I If a jump θ∗
is rejected, that is θ(t)
= θ(t−1)
, it is still counted as an iteration.
I The transition distribution is again a mixture distribution.
7 / 18
Motivation to consider a Gibbs sampler
I In most real life problems, we will not be attempting to make inference on a single
parameter only, as we have done in most examples to date. Rather, we will want
to make inference on a set of parameters, θ = (θ1, . . . , θK ).
I In this case, it may be possible to integrate out the parameters that are not of
immediate interest. We did this when estimating the parameters of the normal
distribution in Assignment 1 and Lab 2. Usually integrating out parameters is not
a straight forward task.
I More commonly though, we simply cannot analytically determine the posterior.
What we can easily obtain is the joint distribution of parameters and data,
p(θ, y) = p(θ1, . . . θK , y) = p(y|θ1, . . . θK )p(θ1, . . . θK )
,
8 / 18
Conditional posteriors
I Normally, we would try to directly determine p(θ|y) from p(θ, y). However the
rules of probability means there is nothing to stop us from instead finding
p(θ1 | θ2 . . . θK , y)
.
.
.
p(θK | θ1 . . . θK−1, y),
the set of conditional posteriors. Note the conditional here refers to the fact we
are finding the posterior of θi conditional on all other parameters as well as data.
I The appeal of conditional posteriors is
I A sequence of draws from low-dimensional spaces may be less computationally
intensive than a single draw from a high-dimensional space.
I Even if p(θ|y) is unknown analytically, p(θi |θ−i , y) could be.
,
9 / 18
The Gibbs sampler
I In the Gibbs sampler, it is assumed that is possible to directly sample from the set
of conditional posteriors p(θi |θ−i , y), 1 ≤ i ≤ K.
I First, we define a starting point θ(0) = (θ
(0)
1 , . . . , θ
(0)
K ) from some starting
distribution.
I Then we iterate. However unlike in the Metropolis(-Hastings) algorithm, we need
to iterate over the components of θ within each iteration t.
I For t = 1, 2, . . .,
I For i = 1, . . . K, draw θ
(t)
i from the conditional posterior
p(θ
(t)
i |θ∗
−i , y),
where θ∗
−i = (θ
(t)
1 , . . . , θ
(t)
i−1, θ
(t−1)
i+1 , . . . , θ
(t−1)
K ).
10 / 18
How do we know that we will converge to p(θ|y)
I While we have introduced various MCMC algorithms, we have not shown that any
will converge to p(θ|y). To prove convergence, consider the joint distribution of
two successive draws θa, θb from a Metropolis-Hastings algorithm. To help,
assume p(θb|y)/J(θb|θa) ≥ p(θa|y)/J(θa|θb).
I First assume θ(t−1) = θa, θ(t) = θb. Then the joint distribution is,
p(θ(t−1)
= θa, θ(t)
= θb) = p(θ(t−1)
= θa)p(θ(t)
= θb|θ(t−1)
= θa)
= p(θ(t−1)
= θa)J(θb|θa),
since as r ≥ 1, θb ∼ J(θb|θa) is accepted with probability 1.
I Now change the order of events.
11 / 18
How do we know that we will converge to p(θ|y)
I The joint distribution of θ(t) = θa, θ(t−1) = θb is,
p(θ(t)
= θa, θ(t−1)
= θb) = p(θ(t−1)
= θb)p(θ(t)
= θa|θ(t−1)
= θb)
= p(θ(t−1)
= θb)J(θa|θb)
p(θa|y)/J(θa|θb)
p(θb|y)/J(θb|θa)
,
= p(θa|y)J(θb|θa)
p(θ(t−1) = θb)
p(θb|y)
,
since flipping the event order is equivalent to flipping the density ratio so r must
now be ≤ 1.
12 / 18
How do we know that we will converge to p(θ|y)
I From our choice of J(·|·), we know a unique stationary distribution exists. Now
assume θ(t−1) is drawn from the posterior. This means that
p(θ(t−1)
= θa, θ(t)
= θb) = p(θ(t)
= θa, θ(t−1)
= θb),
and
p(θ(t)
= θa) =
Z
p(θ(t)
= θa, θ(t−1)
= θb)dθb
= p(θa|y)
Z
J(θb|θa)
p(θb|y)
p(θb|y)
dθb
= p(θa|y)
Z
J(θb|θa)dθb = p(θa|y)
meaning we can conclude θ(t) is also drawn from the posterior and thus p(θ|y) is
the stationary distribution.
13 / 18
But what about the convergence of the Gibbs sampler?
I We have shown the stationary distribution of the Metropolis-Hastings algorithm is
the posterior, p(θ|y).
I Implicitly, we have shown this is also true for the Metropolis algorithm, as the
Metropolis algorithm is just a special case of the Metropolis-Hastings algorithm.
I But what about the Gibbs sampler?
I It turns out the Gibbs sampler is another special case of the Metropolis-Hastings
algorithm. If we view iteration t(i) as the iteration, t0
= tK + i − K, rather than an
iteration within iteration, with candidate θ∗
= (θ
(t)
1 , . . . , θ
(t)
i−1, θ∗
i , θ
(t−1)
i+1 , . . . , θ
(t−1)
K )
and current draw θ(t0
−1)
= (θ
(t)
1 , . . . , θ
(t)
i−1, θ
(t−1)
i , θ
(t−1)
i+1 , . . . , θ
(t−1)
K ), then p(θ∗
i |·) =
J(θ∗
|θ(t−1)
) is a valid and usually non-symmetric jumping distribution.
14 / 18
The Gibbs sampler is a special case of Metropolis-Hastings
I Moreover, the Gibbs sampler is an example of the Metropolis-Hastings algorithm
where all moves will be accepted as shown below,
r =
p(θ∗|y)/J(θ∗|θ(t0−1))
p(θ(t0−1)|y)/J(θ(t0−1)|θ∗)
=
p(θ
(t)
1 ,...,θ
(t)
i−1,θ∗
i ,θ
(t−1)
i+1 ,...,θ
(t−1)
K |y)
p(θ∗
i |θ
(t)
1 ,...,θ
(t)
i−1,θ
(t−1)
i+1 ,...,θ
(t−1)
K ,y)
p(θ
(t)
1 ,...,θ
(t)
i−1,θ
(t−1)
i ,θ
(t−1)
i+1 ,...,θ
(t−1)
K |y)
p(θ
(t−1)
i |θ
(t)
1 ,...,θ
(t)
i−1,θ
(t−1)
i+1 ,...,θ
(t−1)
K ,y)
=
p(θ
(t)
1 , . . . , θ
(t)
i−1, θ
(t−1)
i+1 , . . . , θ
(t−1)
K |y)
p(θ
(t)
1 , . . . , θ
(t)
i−1, θ
(t−1)
i+1 , . . . , θ
(t−1)
K |y)
= 1
15 / 18
Comments
I As we are using computationally intensive techniques, what might we want to do?
Minimise computational cost wherever possible.
I For example, in previous lectures we have encountered the sufficiency principle.
Therefore you may want to compress the data down to just the sufficient statistics.
I Addition/subtraction are easier operations for a computer than multiplication/
division. Rejection sampling, importance sampling, and Metropolis (-Hastings)
algorithms all require density ratios. If we use (natural) log-densities instead, these
ratios become differences and quicker to compute. Then exponentiate when needed.
16 / 18
Some more comments
I Note there is nothing to stop us mixing techniques for a particular problem.
For example, you may be faced with the situation where the parameters split
θ = (θC θNC ) such that p(θi |θ−i , y) can be directly sampled from only if θi ⊆ θC .
However, you can combine a Gibbs sampler with Metropolis(-Hastings), such that
θ
(t)
i ∼ p(θ
(t)
i |θ∗
−i , y) if θi ⊆ θC
θ
(t)
i =
(
θ∗
i if u ≤ min(r, 1) where u ∼ U(0, 1), θ∗
i ∼ g(θ∗
i |·)
θ
(t−1)
i otherwise
if θi 6⊆ θC
with g(θ∗
i |·) being a jumping rule and r defined as in the Metropolis(-Hastings)
algorithm.
17 / 18
Conclusion
I In the next lecture, we will look at examples of the Metropolis,
Metropolis-Hastings and Gibbs sampling algorithms in R.
I The code used in these examples will be put up before the lecture,
so you can run the code during the lecture.
18 / 18

More Related Content

Similar to Lecture_9.pdf

Estimation of the score vector and observed information matrix in intractable...
Estimation of the score vector and observed information matrix in intractable...Estimation of the score vector and observed information matrix in intractable...
Estimation of the score vector and observed information matrix in intractable...Pierre Jacob
 
X01 Supervised learning problem linear regression one feature theorie
X01 Supervised learning problem linear regression one feature theorieX01 Supervised learning problem linear regression one feature theorie
X01 Supervised learning problem linear regression one feature theorieMarco Moldenhauer
 
Discussion of ABC talk by Stefano Cabras, Padova, March 21, 2013
Discussion of ABC talk by Stefano Cabras, Padova, March 21, 2013Discussion of ABC talk by Stefano Cabras, Padova, March 21, 2013
Discussion of ABC talk by Stefano Cabras, Padova, March 21, 2013Christian Robert
 
Discussion cabras-robert-130323171455-phpapp02
Discussion cabras-robert-130323171455-phpapp02Discussion cabras-robert-130323171455-phpapp02
Discussion cabras-robert-130323171455-phpapp02Deb Roy
 
ABC with data cloning for MLE in state space models
ABC with data cloning for MLE in state space modelsABC with data cloning for MLE in state space models
ABC with data cloning for MLE in state space modelsUmberto Picchini
 
Conceptual Introduction to Gaussian Processes
Conceptual Introduction to Gaussian ProcessesConceptual Introduction to Gaussian Processes
Conceptual Introduction to Gaussian ProcessesJuanPabloCarbajal3
 
Considerate Approaches to ABC Model Selection
Considerate Approaches to ABC Model SelectionConsiderate Approaches to ABC Model Selection
Considerate Approaches to ABC Model SelectionMichael Stumpf
 
Sampling strategies for Sequential Monte Carlo (SMC) methods
Sampling strategies for Sequential Monte Carlo (SMC) methodsSampling strategies for Sequential Monte Carlo (SMC) methods
Sampling strategies for Sequential Monte Carlo (SMC) methodsStephane Senecal
 
Seminar Talk: Multilevel Hybrid Split Step Implicit Tau-Leap for Stochastic R...
Seminar Talk: Multilevel Hybrid Split Step Implicit Tau-Leap for Stochastic R...Seminar Talk: Multilevel Hybrid Split Step Implicit Tau-Leap for Stochastic R...
Seminar Talk: Multilevel Hybrid Split Step Implicit Tau-Leap for Stochastic R...Chiheb Ben Hammouda
 
Introduction to modern Variational Inference.
Introduction to modern Variational Inference.Introduction to modern Variational Inference.
Introduction to modern Variational Inference.Tomasz Kusmierczyk
 
Ada boost brown boost performance with noisy data
Ada boost brown boost performance with noisy dataAda boost brown boost performance with noisy data
Ada boost brown boost performance with noisy dataShadhin Rahman
 
Machine learning (9)
Machine learning (9)Machine learning (9)
Machine learning (9)NYversity
 
Fractal dimension versus Computational Complexity
Fractal dimension versus Computational ComplexityFractal dimension versus Computational Complexity
Fractal dimension versus Computational ComplexityHector Zenil
 
On problem-of-parameters-identification-of-dynamic-object
On problem-of-parameters-identification-of-dynamic-objectOn problem-of-parameters-identification-of-dynamic-object
On problem-of-parameters-identification-of-dynamic-objectCemal Ardil
 
Paper finance hosseinkhan_remy
Paper finance hosseinkhan_remyPaper finance hosseinkhan_remy
Paper finance hosseinkhan_remyRémy Hosseinkhan
 
Talk in BayesComp 2018
Talk in BayesComp 2018Talk in BayesComp 2018
Talk in BayesComp 2018JeremyHeng10
 

Similar to Lecture_9.pdf (20)

Estimation of the score vector and observed information matrix in intractable...
Estimation of the score vector and observed information matrix in intractable...Estimation of the score vector and observed information matrix in intractable...
Estimation of the score vector and observed information matrix in intractable...
 
X01 Supervised learning problem linear regression one feature theorie
X01 Supervised learning problem linear regression one feature theorieX01 Supervised learning problem linear regression one feature theorie
X01 Supervised learning problem linear regression one feature theorie
 
Discussion of ABC talk by Stefano Cabras, Padova, March 21, 2013
Discussion of ABC talk by Stefano Cabras, Padova, March 21, 2013Discussion of ABC talk by Stefano Cabras, Padova, March 21, 2013
Discussion of ABC talk by Stefano Cabras, Padova, March 21, 2013
 
Discussion cabras-robert-130323171455-phpapp02
Discussion cabras-robert-130323171455-phpapp02Discussion cabras-robert-130323171455-phpapp02
Discussion cabras-robert-130323171455-phpapp02
 
ABC with data cloning for MLE in state space models
ABC with data cloning for MLE in state space modelsABC with data cloning for MLE in state space models
ABC with data cloning for MLE in state space models
 
Deep Learning Opening Workshop - Statistical and Computational Guarantees of ...
Deep Learning Opening Workshop - Statistical and Computational Guarantees of ...Deep Learning Opening Workshop - Statistical and Computational Guarantees of ...
Deep Learning Opening Workshop - Statistical and Computational Guarantees of ...
 
Conceptual Introduction to Gaussian Processes
Conceptual Introduction to Gaussian ProcessesConceptual Introduction to Gaussian Processes
Conceptual Introduction to Gaussian Processes
 
Considerate Approaches to ABC Model Selection
Considerate Approaches to ABC Model SelectionConsiderate Approaches to ABC Model Selection
Considerate Approaches to ABC Model Selection
 
Sampling strategies for Sequential Monte Carlo (SMC) methods
Sampling strategies for Sequential Monte Carlo (SMC) methodsSampling strategies for Sequential Monte Carlo (SMC) methods
Sampling strategies for Sequential Monte Carlo (SMC) methods
 
Seminar Talk: Multilevel Hybrid Split Step Implicit Tau-Leap for Stochastic R...
Seminar Talk: Multilevel Hybrid Split Step Implicit Tau-Leap for Stochastic R...Seminar Talk: Multilevel Hybrid Split Step Implicit Tau-Leap for Stochastic R...
Seminar Talk: Multilevel Hybrid Split Step Implicit Tau-Leap for Stochastic R...
 
Anomaly And Parity Odd Transport
Anomaly And Parity Odd Transport Anomaly And Parity Odd Transport
Anomaly And Parity Odd Transport
 
Ch06 6
Ch06 6Ch06 6
Ch06 6
 
Introduction to modern Variational Inference.
Introduction to modern Variational Inference.Introduction to modern Variational Inference.
Introduction to modern Variational Inference.
 
Ada boost brown boost performance with noisy data
Ada boost brown boost performance with noisy dataAda boost brown boost performance with noisy data
Ada boost brown boost performance with noisy data
 
Machine learning (9)
Machine learning (9)Machine learning (9)
Machine learning (9)
 
Fractal dimension versus Computational Complexity
Fractal dimension versus Computational ComplexityFractal dimension versus Computational Complexity
Fractal dimension versus Computational Complexity
 
On problem-of-parameters-identification-of-dynamic-object
On problem-of-parameters-identification-of-dynamic-objectOn problem-of-parameters-identification-of-dynamic-object
On problem-of-parameters-identification-of-dynamic-object
 
Paper finance hosseinkhan_remy
Paper finance hosseinkhan_remyPaper finance hosseinkhan_remy
Paper finance hosseinkhan_remy
 
Nokton theory-en
Nokton theory-enNokton theory-en
Nokton theory-en
 
Talk in BayesComp 2018
Talk in BayesComp 2018Talk in BayesComp 2018
Talk in BayesComp 2018
 

Recently uploaded

Microphone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxMicrophone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxpriyankatabhane
 
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxLIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxmalonesandreagweneth
 
BUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdf
BUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdfBUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdf
BUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdfWildaNurAmalia2
 
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 GenuineCall Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuinethapagita
 
Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024AyushiRastogi48
 
User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)Columbia Weather Systems
 
preservation, maintanence and improvement of industrial organism.pptx
preservation, maintanence and improvement of industrial organism.pptxpreservation, maintanence and improvement of industrial organism.pptx
preservation, maintanence and improvement of industrial organism.pptxnoordubaliya2003
 
Davis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technologyDavis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technologycaarthichand2003
 
Pests of Bengal gram_Identification_Dr.UPR.pdf
Pests of Bengal gram_Identification_Dr.UPR.pdfPests of Bengal gram_Identification_Dr.UPR.pdf
Pests of Bengal gram_Identification_Dr.UPR.pdfPirithiRaju
 
Transposable elements in prokaryotes.ppt
Transposable elements in prokaryotes.pptTransposable elements in prokaryotes.ppt
Transposable elements in prokaryotes.pptArshadWarsi13
 
GenBio2 - Lesson 1 - Introduction to Genetics.pptx
GenBio2 - Lesson 1 - Introduction to Genetics.pptxGenBio2 - Lesson 1 - Introduction to Genetics.pptx
GenBio2 - Lesson 1 - Introduction to Genetics.pptxBerniceCayabyab1
 
Analytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptxAnalytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptxSwapnil Therkar
 
Scheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docxScheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docxyaramohamed343013
 
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPirithiRaju
 
FREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by naFREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by naJASISJULIANOELYNV
 
Base editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editingBase editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editingNetHelix
 
Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Nistarini College, Purulia (W.B) India
 
Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Patrick Diehl
 
OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024innovationoecd
 

Recently uploaded (20)

Microphone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxMicrophone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
 
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxLIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
 
BUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdf
BUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdfBUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdf
BUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdf
 
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 GenuineCall Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
 
Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024
 
User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)
 
preservation, maintanence and improvement of industrial organism.pptx
preservation, maintanence and improvement of industrial organism.pptxpreservation, maintanence and improvement of industrial organism.pptx
preservation, maintanence and improvement of industrial organism.pptx
 
Davis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technologyDavis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technology
 
Pests of Bengal gram_Identification_Dr.UPR.pdf
Pests of Bengal gram_Identification_Dr.UPR.pdfPests of Bengal gram_Identification_Dr.UPR.pdf
Pests of Bengal gram_Identification_Dr.UPR.pdf
 
Transposable elements in prokaryotes.ppt
Transposable elements in prokaryotes.pptTransposable elements in prokaryotes.ppt
Transposable elements in prokaryotes.ppt
 
GenBio2 - Lesson 1 - Introduction to Genetics.pptx
GenBio2 - Lesson 1 - Introduction to Genetics.pptxGenBio2 - Lesson 1 - Introduction to Genetics.pptx
GenBio2 - Lesson 1 - Introduction to Genetics.pptx
 
Analytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptxAnalytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptx
 
Scheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docxScheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docx
 
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
 
Volatile Oils Pharmacognosy And Phytochemistry -I
Volatile Oils Pharmacognosy And Phytochemistry -IVolatile Oils Pharmacognosy And Phytochemistry -I
Volatile Oils Pharmacognosy And Phytochemistry -I
 
FREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by naFREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by na
 
Base editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editingBase editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editing
 
Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...
 
Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?
 
OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024
 

Lecture_9.pdf

  • 1. MAST90125: Bayesian Statistical Learning Semester 2 2021 Lecture 9: Posterior simulation John Holmes 1 / 18
  • 2. What do we want to achieve In the previous lecture, we looked at various techniques for approximating a posterior distribution. These methods were either I Deterministic, I Based on generating samples of i.i.d. random numbers. However as would have become apparent in the examples we looked at, these techniques are at least one of I Do not generalise well in multi-dimensional problems, I Computationally wasteful. In this lecture we will introduce a much more computationally efficient technique based on iterative sampling called Markov Chain Monte Carlo (MCMC) which comes at the cost of having to generate dependent samples. 2 / 18
  • 3. Markov Chains Before we look at MCMC algorithms, we need to define what a Markov chain is, and what properties a Markov chain should have in the context of posterior simulation. I A Markov chain is a stochastic process where the distribution of the present state (at time t) θ(t) is only dependent on the immediately preceding state, that is p(θ(t) |θ(1) , . . . , θ(t−1) ) = p(θ(t) |θ(t−1) ). I While there are many possible characteristics we can use to categorise a Markov chain, for posterior simulation, we are most interested that the Markov chain has the following properties: I That all possible θ ∈ Θ can eventually be reached by the Markov chain. I That the chain is aperiodic and not transient. I While the conditions above imply a unique stationary distribution exists, I we still need to show the stationary distribution is p(θ|y). 3 / 18
  • 4. Metropolis algorithm Like rejection sampling, the Metropolis algorithm is an example of a accept-reject rule. The steps of the algorithm are: I Define a proposal (or jumping) distribution, J(θa|θb), that is symmetric, that is J() satisfies J(θa|θb) = J(θb|θa). In addition, J(·|·) must be able to eventually reach all possible θ ∈ Θ to ensure a stationary distribution exists. I Draw a starting point θ(0) from a starting distribution p(0)(θ) such that p(θ(0)|y) > 0. I Then we iterate, 4 / 18
  • 5. Metropolis algorithm I For t = 1, 2, . . ., I Sample θ∗ from the proposal distribution J(θ∗ |θ(t−1) ) I Calculate r = p(θ∗ |y) p(θ(t−1)|y) = p(θ∗ , y)/p(y) p(θ(t−1), y)/p(y) = p(y|θ∗ )p(θ∗ ) p(y|θ(t−1))p(θ(t−1)) I Set θ(t) = ( θ∗ if u ≤ min(r, 1) where u ∼ U(0, 1). θ(t−1) otherwise I Some notes: I If a jump θ∗ is rejected, that is θ(t) = θ(t−1) , it is still counted as an iteration. I The transition distribution is a mixture distribution. 5 / 18
  • 6. Metropolis-Hastings algorithm As might be guessed from the name, the Metropolis-Hastings algorithm is an extension of the Metropolis algorithm. Therefore Metropolis Hastings is also a case of an accept-reject algorithm. I Like the Metropolis algorithm, we first need to define a proposal (or jumping) distribution, J(θa|θb). Also like the Metropolis algorithm, J(·|·) must be chosen in such a way as to ensure a stationary distribution exists. I However there is now no requirement that the jumping distribution is symmetric. This also means the Metropolis algorithm is a special case of the Metropolis-Hastings algorithm. I Draw a starting point θ(0) from a starting distribution p(0)(θ) such that p(θ(0)|y) > 0. I Then we must iterate, 6 / 18
  • 7. Metropolis-Hastings algorithm I For t = 1, 2, . . ., I Sample θ∗ from the proposal distribution J(θ∗ |θ(t−1) ) I Calculate r = p(θ∗ |y)/J(θ∗ |θ(t−1) ) p(θ(t−1)|y)/J(θ(t−1)|θ∗) = p(θ∗ ,y) p(y)J(θ∗|θ(t−1)) p(θ(t−1),y) p(y)J(θ(t−1)|θ∗) = p(y|θ∗ )p(θ∗ )/J(θ∗ |θ(t−1) ) p(y|θ(t−1))p(θ(t−1))/J(θ(t−1)|θ∗) I Set θ(t) = ( θ∗ if u ≤ min(r, 1) where u ∼ U(0, 1). θ(t−1) otherwise I Some notes: I If a jump θ∗ is rejected, that is θ(t) = θ(t−1) , it is still counted as an iteration. I The transition distribution is again a mixture distribution. 7 / 18
  • 8. Motivation to consider a Gibbs sampler I In most real life problems, we will not be attempting to make inference on a single parameter only, as we have done in most examples to date. Rather, we will want to make inference on a set of parameters, θ = (θ1, . . . , θK ). I In this case, it may be possible to integrate out the parameters that are not of immediate interest. We did this when estimating the parameters of the normal distribution in Assignment 1 and Lab 2. Usually integrating out parameters is not a straight forward task. I More commonly though, we simply cannot analytically determine the posterior. What we can easily obtain is the joint distribution of parameters and data, p(θ, y) = p(θ1, . . . θK , y) = p(y|θ1, . . . θK )p(θ1, . . . θK ) , 8 / 18
  • 9. Conditional posteriors I Normally, we would try to directly determine p(θ|y) from p(θ, y). However the rules of probability means there is nothing to stop us from instead finding p(θ1 | θ2 . . . θK , y) . . . p(θK | θ1 . . . θK−1, y), the set of conditional posteriors. Note the conditional here refers to the fact we are finding the posterior of θi conditional on all other parameters as well as data. I The appeal of conditional posteriors is I A sequence of draws from low-dimensional spaces may be less computationally intensive than a single draw from a high-dimensional space. I Even if p(θ|y) is unknown analytically, p(θi |θ−i , y) could be. , 9 / 18
  • 10. The Gibbs sampler I In the Gibbs sampler, it is assumed that is possible to directly sample from the set of conditional posteriors p(θi |θ−i , y), 1 ≤ i ≤ K. I First, we define a starting point θ(0) = (θ (0) 1 , . . . , θ (0) K ) from some starting distribution. I Then we iterate. However unlike in the Metropolis(-Hastings) algorithm, we need to iterate over the components of θ within each iteration t. I For t = 1, 2, . . ., I For i = 1, . . . K, draw θ (t) i from the conditional posterior p(θ (t) i |θ∗ −i , y), where θ∗ −i = (θ (t) 1 , . . . , θ (t) i−1, θ (t−1) i+1 , . . . , θ (t−1) K ). 10 / 18
  • 11. How do we know that we will converge to p(θ|y) I While we have introduced various MCMC algorithms, we have not shown that any will converge to p(θ|y). To prove convergence, consider the joint distribution of two successive draws θa, θb from a Metropolis-Hastings algorithm. To help, assume p(θb|y)/J(θb|θa) ≥ p(θa|y)/J(θa|θb). I First assume θ(t−1) = θa, θ(t) = θb. Then the joint distribution is, p(θ(t−1) = θa, θ(t) = θb) = p(θ(t−1) = θa)p(θ(t) = θb|θ(t−1) = θa) = p(θ(t−1) = θa)J(θb|θa), since as r ≥ 1, θb ∼ J(θb|θa) is accepted with probability 1. I Now change the order of events. 11 / 18
  • 12. How do we know that we will converge to p(θ|y) I The joint distribution of θ(t) = θa, θ(t−1) = θb is, p(θ(t) = θa, θ(t−1) = θb) = p(θ(t−1) = θb)p(θ(t) = θa|θ(t−1) = θb) = p(θ(t−1) = θb)J(θa|θb) p(θa|y)/J(θa|θb) p(θb|y)/J(θb|θa) , = p(θa|y)J(θb|θa) p(θ(t−1) = θb) p(θb|y) , since flipping the event order is equivalent to flipping the density ratio so r must now be ≤ 1. 12 / 18
  • 13. How do we know that we will converge to p(θ|y) I From our choice of J(·|·), we know a unique stationary distribution exists. Now assume θ(t−1) is drawn from the posterior. This means that p(θ(t−1) = θa, θ(t) = θb) = p(θ(t) = θa, θ(t−1) = θb), and p(θ(t) = θa) = Z p(θ(t) = θa, θ(t−1) = θb)dθb = p(θa|y) Z J(θb|θa) p(θb|y) p(θb|y) dθb = p(θa|y) Z J(θb|θa)dθb = p(θa|y) meaning we can conclude θ(t) is also drawn from the posterior and thus p(θ|y) is the stationary distribution. 13 / 18
  • 14. But what about the convergence of the Gibbs sampler? I We have shown the stationary distribution of the Metropolis-Hastings algorithm is the posterior, p(θ|y). I Implicitly, we have shown this is also true for the Metropolis algorithm, as the Metropolis algorithm is just a special case of the Metropolis-Hastings algorithm. I But what about the Gibbs sampler? I It turns out the Gibbs sampler is another special case of the Metropolis-Hastings algorithm. If we view iteration t(i) as the iteration, t0 = tK + i − K, rather than an iteration within iteration, with candidate θ∗ = (θ (t) 1 , . . . , θ (t) i−1, θ∗ i , θ (t−1) i+1 , . . . , θ (t−1) K ) and current draw θ(t0 −1) = (θ (t) 1 , . . . , θ (t) i−1, θ (t−1) i , θ (t−1) i+1 , . . . , θ (t−1) K ), then p(θ∗ i |·) = J(θ∗ |θ(t−1) ) is a valid and usually non-symmetric jumping distribution. 14 / 18
  • 15. The Gibbs sampler is a special case of Metropolis-Hastings I Moreover, the Gibbs sampler is an example of the Metropolis-Hastings algorithm where all moves will be accepted as shown below, r = p(θ∗|y)/J(θ∗|θ(t0−1)) p(θ(t0−1)|y)/J(θ(t0−1)|θ∗) = p(θ (t) 1 ,...,θ (t) i−1,θ∗ i ,θ (t−1) i+1 ,...,θ (t−1) K |y) p(θ∗ i |θ (t) 1 ,...,θ (t) i−1,θ (t−1) i+1 ,...,θ (t−1) K ,y) p(θ (t) 1 ,...,θ (t) i−1,θ (t−1) i ,θ (t−1) i+1 ,...,θ (t−1) K |y) p(θ (t−1) i |θ (t) 1 ,...,θ (t) i−1,θ (t−1) i+1 ,...,θ (t−1) K ,y) = p(θ (t) 1 , . . . , θ (t) i−1, θ (t−1) i+1 , . . . , θ (t−1) K |y) p(θ (t) 1 , . . . , θ (t) i−1, θ (t−1) i+1 , . . . , θ (t−1) K |y) = 1 15 / 18
  • 16. Comments I As we are using computationally intensive techniques, what might we want to do? Minimise computational cost wherever possible. I For example, in previous lectures we have encountered the sufficiency principle. Therefore you may want to compress the data down to just the sufficient statistics. I Addition/subtraction are easier operations for a computer than multiplication/ division. Rejection sampling, importance sampling, and Metropolis (-Hastings) algorithms all require density ratios. If we use (natural) log-densities instead, these ratios become differences and quicker to compute. Then exponentiate when needed. 16 / 18
  • 17. Some more comments I Note there is nothing to stop us mixing techniques for a particular problem. For example, you may be faced with the situation where the parameters split θ = (θC θNC ) such that p(θi |θ−i , y) can be directly sampled from only if θi ⊆ θC . However, you can combine a Gibbs sampler with Metropolis(-Hastings), such that θ (t) i ∼ p(θ (t) i |θ∗ −i , y) if θi ⊆ θC θ (t) i = ( θ∗ i if u ≤ min(r, 1) where u ∼ U(0, 1), θ∗ i ∼ g(θ∗ i |·) θ (t−1) i otherwise if θi 6⊆ θC with g(θ∗ i |·) being a jumping rule and r defined as in the Metropolis(-Hastings) algorithm. 17 / 18
  • 18. Conclusion I In the next lecture, we will look at examples of the Metropolis, Metropolis-Hastings and Gibbs sampling algorithms in R. I The code used in these examples will be put up before the lecture, so you can run the code during the lecture. 18 / 18