Your SlideShare is downloading. ×
Introduction to advanced Monte Carlo methods
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Introduction to advanced Monte Carlo methods

3,102
views

Published on


0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
3,102
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
100
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. An introduction to advanced (?) MCMC methods An introduction to advanced (?) MCMC methods Christian P. Robert Universit´ Paris-Dauphine and CREST-INSEE e http://www.ceremade.dauphine.fr/~xian Royal Statistical Society, October 13, 2010
  • 2. An introduction to advanced (?) MCMC methods Motivating example Motivating example 1 Motivating example 2 The Metropolis-Hastings Algorithm
  • 3. An introduction to advanced (?) MCMC methods Motivating example Latent structures make life harder! Even simple models may lead to computational complications, as in latent variable models f (x|θ) = f ⋆ (x, x⋆ |θ) dx⋆
  • 4. An introduction to advanced (?) MCMC methods Motivating example Latent structures make life harder! Even simple models may lead to computational complications, as in latent variable models f (x|θ) = f ⋆ (x, x⋆ |θ) dx⋆ If (x, x⋆ ) observed, fine!
  • 5. An introduction to advanced (?) MCMC methods Motivating example Latent structures make life harder! Even simple models may lead to computational complications, as in latent variable models f (x|θ) = f ⋆ (x, x⋆ |θ) dx⋆ If (x, x⋆ ) observed, fine! If only x observed, trouble!
  • 6. An introduction to advanced (?) MCMC methods Motivating example Example (Mixture models) Models of mixtures of distributions: X ∼ fj with probability pj , for j = 1, 2, . . . , k, with overall density X ∼ p1 f1 (x) + · · · + pk fk (x) .
  • 7. An introduction to advanced (?) MCMC methods Motivating example Example (Mixture models) Models of mixtures of distributions: X ∼ fj with probability pj , for j = 1, 2, . . . , k, with overall density X ∼ p1 f1 (x) + · · · + pk fk (x) . For a sample of independent random variables (X1 , · · · , Xn ), sample density n {p1 f1 (xi ) + · · · + pk fk (xi )} . i=1
  • 8. An introduction to advanced (?) MCMC methods Motivating example Example (Mixture models) Models of mixtures of distributions: X ∼ fj with probability pj , for j = 1, 2, . . . , k, with overall density X ∼ p1 f1 (x) + · · · + pk fk (x) . For a sample of independent random variables (X1 , · · · , Xn ), sample density n {p1 f1 (xi ) + · · · + pk fk (xi )} . i=1 Expanding this product involves k n elementary terms: prohibitive to compute in large samples.
  • 9. An introduction to advanced (?) MCMC methods Motivating example 0.3N (µ1 , 1) + 0.7N (µ2 , 1) loglikelihood 3 2 µ2 1 0 −1 −1 0 1 2 3 µ1
  • 10. An introduction to advanced (?) MCMC methods Motivating example A typology of Bayes computational problems (i) use of a complex parameter space, as for instance in constrained parameter sets like those resulting from imposing stationarity constraints in dynamic models;
  • 11. An introduction to advanced (?) MCMC methods Motivating example A typology of Bayes computational problems (i) use of a complex parameter space, as for instance in constrained parameter sets like those resulting from imposing stationarity constraints in dynamic models; (ii) use of a complex sampling model with an intractable likelihood, as for instance in missing data and graphical models;
  • 12. An introduction to advanced (?) MCMC methods Motivating example A typology of Bayes computational problems (i) use of a complex parameter space, as for instance in constrained parameter sets like those resulting from imposing stationarity constraints in dynamic models; (ii) use of a complex sampling model with an intractable likelihood, as for instance in missing data and graphical models; (iii) use of a huge dataset;
  • 13. An introduction to advanced (?) MCMC methods Motivating example A typology of Bayes computational problems (i) use of a complex parameter space, as for instance in constrained parameter sets like those resulting from imposing stationarity constraints in dynamic models; (ii) use of a complex sampling model with an intractable likelihood, as for instance in missing data and graphical models; (iii) use of a huge dataset; (iv) use of a complex prior distribution (which may be the posterior distribution associated with an earlier sample);
  • 14. An introduction to advanced (?) MCMC methods Motivating example A typology of Bayes computational problems (i) use of a complex parameter space, as for instance in constrained parameter sets like those resulting from imposing stationarity constraints in dynamic models; (ii) use of a complex sampling model with an intractable likelihood, as for instance in missing data and graphical models; (iii) use of a huge dataset; (iv) use of a complex prior distribution (which may be the posterior distribution associated with an earlier sample); (v) use of a complex inferential procedure as for instance, Bayes factors π π(θ ∈ Θ0 ) B01 (x) = P (θ ∈ Θ0 | x)/P (θ ∈ Θ1 | x) . π(θ ∈ Θ1 )
  • 15. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm The Metropolis-Hastings Algorithm 1 Motivating example 2 The Metropolis-Hastings Algorithm Monte Carlo Methods based on Markov Chains The Metropolis–Hastings algorithm A collection of Metropolis-Hastings algorithms Extensions Convergence assessment
  • 16. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm Monte Carlo Methods based on Markov Chains Running Monte Carlo via Markov Chains Fact: It is not necessary to use a sample from the distribution f to approximate the integral I= h(x)f (x)dx ,
  • 17. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm Monte Carlo Methods based on Markov Chains Running Monte Carlo via Markov Chains Fact: It is not necessary to use a sample from the distribution f to approximate the integral I= h(x)f (x)dx , We can obtain X1 , . . . , Xn ∼ f (approx) without directly simulating from f , using an ergodic Markov chain with stationary distribution f
  • 18. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm Monte Carlo Methods based on Markov Chains Running Monte Carlo via Markov Chains (2) Idea For an arbitrary starting value x(0) , an ergodic chain (X (t) ) is generated using a transition kernel with stationary distribution f
  • 19. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm Monte Carlo Methods based on Markov Chains Running Monte Carlo via Markov Chains (2) Idea For an arbitrary starting value x(0) , an ergodic chain (X (t) ) is generated using a transition kernel with stationary distribution f Ensures the convergence in distribution of (X (t) ) to a random variable from f . For a “large enough” T0 , X (T0 ) can be considered as distributed from f Produces a dependent sample X (T0 ) , X (T0 +1) , . . ., which is generated from f , sufficient for most approximation purposes.
  • 20. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm The Metropolis–Hastings algorithm The Metropolis–Hastings algorithm Problem: How can one build a Markov chain with a given stationary distribution?
  • 21. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm The Metropolis–Hastings algorithm The Metropolis–Hastings algorithm Problem: How can one build a Markov chain with a given stationary distribution? MH basics Algorithm that converges to the objective (target) density f using an arbitrary transition kernel density q(x, y) called instrumental (or proposal) distribution
  • 22. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm The Metropolis–Hastings algorithm The MH algorithm Algorithm (Metropolis–Hastings) Given x(t) , 1 Generate Yt ∼ q(x(t) , y). 2 Take Yt with prob. ρ(x(t) , Yt ), X (t+1) = x(t) with prob. 1 − ρ(x(t) , Yt ), where f (y) q(y, x) ρ(x, y) = min ,1 . f (x) q(x, y)
  • 23. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm The Metropolis–Hastings algorithm Features Independent of normalizing constants for both f and q(x, ·) (ie, those constants independent of x) Never move to values with f (y) = 0 The chain (x(t) )t may take the same value several times in a row, even though f is a density wrt Lebesgue measure The sequence (yt )t is usually not a Markov chain
  • 24. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm The Metropolis–Hastings algorithm Features Independent of normalizing constants for both f and q(x, ·) (ie, those constants independent of x) Never move to values with f (y) = 0 The chain (x(t) )t may take the same value several times in a row, even though f is a density wrt Lebesgue measure The sequence (yt )t is usually not a Markov chain P( θ-> θ ’) Satisfies the detailed balance condition θ’ θ f (x)K(x, y) = f (y)K(y, x) P(θ’-> θ ) [Green, 1995]
  • 25. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm The Metropolis–Hastings algorithm Convergence properties 1 The M-H Markov chain is reversible, with invariant/stationary density f .
  • 26. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm The Metropolis–Hastings algorithm Convergence properties 1 The M-H Markov chain is reversible, with invariant/stationary density f . 2 As f is a probability measure, the chain is positive recurrent
  • 27. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm The Metropolis–Hastings algorithm Convergence properties 1 The M-H Markov chain is reversible, with invariant/stationary density f . 2 As f is a probability measure, the chain is positive recurrent 3 If f (Yt ) q(Yt , X (t) ) Pr ≥ 1 < 1. (1) f (X (t) ) q(X (t) , Yt ) i.e., if the event {X (t+1) = X (t) } occurs with positive probability, then the chain is aperiodic
  • 28. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm The Metropolis–Hastings algorithm Convergence properties (2) 4 If q(x, y) > 0 for every (x, y), (2) the chain is irreducible
  • 29. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm The Metropolis–Hastings algorithm Convergence properties (2) 4 If q(x, y) > 0 for every (x, y), (2) the chain is irreducible 5 For M-H, f -irreducibility implies Harris recurrence
  • 30. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm The Metropolis–Hastings algorithm Convergence properties (2) 4 If q(x, y) > 0 for every (x, y), (2) the chain is irreducible 5 For M-H, f -irreducibility implies Harris recurrence 6 Thus, under conditions (1) and (2) (i) For h, with Ef |h(X)| < ∞, T 1 lim h(X (t) ) = h(x)df (x) a.e. f. T →∞ T t=1 (ii) and lim K n (x, ·)µ(dx) − f =0 n→∞ TV for every initial distribution µ, where K n (x, ·) denotes the kernel for n transitions.
  • 31. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm A collection of Metropolis-Hastings algorithms The Independent Case The instrumental distribution q(x, ·) is independent of x and is denoted g
  • 32. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm A collection of Metropolis-Hastings algorithms The Independent Case The instrumental distribution q(x, ·) is independent of x and is denoted g Algorithm (Independent Metropolis-Hastings) Given x(t) , 1 Generate Yt ∼ g(y) 2 Take  Y f (Yt ) g(x(t) ) with prob. min ,1 ,  t X (t+1) = f (x(t) ) g(Yt )  x(t) otherwise. 
  • 33. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm A collection of Metropolis-Hastings algorithms Properties The resulting sample is not iid
  • 34. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm A collection of Metropolis-Hastings algorithms Properties The resulting sample is not iid but there exist strong convergence properties: Theorem (Ergodicity) The algorithm produces a uniformly ergodic chain if there exists a constant M such that f (x) ≤ M g(x) , x ∈ supp f. In this case, n 1 K n (x, ·) − f TV ≤ 1− . M [Mengersen & Tweedie, 1996]
  • 35. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm A collection of Metropolis-Hastings algorithms Example (Noisy AR(1)) Hidden Markov chain from a regular AR(1) model, xt+1 = ϕxt + ǫt+1 ǫt ∼ N (0, τ 2 ) and observables yt |xt ∼ N (x2 , σ 2 ) t
  • 36. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm A collection of Metropolis-Hastings algorithms Example (Noisy AR(1)) Hidden Markov chain from a regular AR(1) model, xt+1 = ϕxt + ǫt+1 ǫt ∼ N (0, τ 2 ) and observables yt |xt ∼ N (x2 , σ 2 ) t The distribution of xt given xt−1 , xt+1 and yt is −1 τ2 exp (xt − ϕxt−1 )2 + (xt+1 − ϕxt )2 + (yt − x2 )2 t . 2τ 2 σ2
  • 37. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm A collection of Metropolis-Hastings algorithms Example (Noisy AR(1) too) 2 Use for proposal the N (µt , ωt ) distribution, with xt−1 + xt+1 2 τ2 µt = ϕ and ωt = . 1 + ϕ2 1 + ϕ2
  • 38. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm A collection of Metropolis-Hastings algorithms Example (Noisy AR(1) too) 2 Use for proposal the N (µt , ωt ) distribution, with xt−1 + xt+1 2 τ2 µt = ϕ and ωt = . 1 + ϕ2 1 + ϕ2 Ratio π(x)/qind (x) = exp −(yt − x2 )2 /2σ 2 t is bounded
  • 39. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm A collection of Metropolis-Hastings algorithms (top) Last 500 realisations of the chain {Xk }k out of 10, 000 iterations; (bottom) histogram of the chain, compared with the target distribution.
  • 40. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm A collection of Metropolis-Hastings algorithms Random walk Metropolis–Hastings Instead, use a local perturbation as proposal Yt = X (t) + εt , where εt ∼ g, independent of X (t) . The instrumental density is now of the form g(y − x) and the Markov chain is a random walk if g is symmetric g(x) = g(−x)
  • 41. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm A collection of Metropolis-Hastings algorithms Algorithm (Random walk Metropolis) Given x(t) 1 Generate Yt ∼ g(y − x(t) ) 2 Take f (Yt )  Y with prob. min 1, , (t+1) t X = f (x(t) )  (t) x otherwise.
  • 42. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm A collection of Metropolis-Hastings algorithms Probit illustration Likelihood and posterior given by n π(β|y, X) ∝ ℓ(β|y, X) ∝ Φ(xiT β)yi (1 − Φ(xiT β))ni −yi . i=1 under the flat prior
  • 43. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm A collection of Metropolis-Hastings algorithms Probit illustration Likelihood and posterior given by n π(β|y, X) ∝ ℓ(β|y, X) ∝ Φ(xiT β)yi (1 − Φ(xiT β))ni −yi . i=1 under the flat prior A random walk proposal works well for a small number of ˆ predictors. Use the maximum likelihood estimate β as starting ˆ value and asymptotic (Fisher) covariance matrix of the MLE, Σ, as scale
  • 44. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm A collection of Metropolis-Hastings algorithms MCMC algorithm Probit random-walk Metropolis-Hastings ˆ ˆ Initialization: Set β (0) = β and compute Σ Iteration t: 1 ˜ ˆ Generate β ∼ Nk+1 (β (t−1) , τ Σ) 2 Compute ˜ π(β|y) ˜ ρ(β (t−1) , β) = min 1, π(β (t−1) |y) 3 ˜ ˜ With probability ρ(β (t−1) , β) set β (t) = β; otherwise set β (t) = β (t−1) .
  • 45. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm A collection of Metropolis-Hastings algorithms R bank benchmark Probit modelling with no intercept over the 0.8 −1.0 1.0 0.4 four measurements. −2.0 0.0 0.0 0 4000 8000 −2.0 −1.5 −1.0 −0.5 0 200 600 1000 Three different scales 3 0.0 0.4 0.8 τ = 1, 0.1, 10: best 2 0.4 1 mixing behavior is −1 0.0 0 4000 8000 −1 0 1 2 3 0 200 600 1000 associated with τ = 1. 2.5 0.8 0.0 0.4 0.8 −0.5 1.0 Average of the 0.4 0.0 parameters over 1.8 0 4000 8000 −0.5 0.5 1.5 2.5 0 200 600 1000 MCMC 9, 000 0.0 0.4 0.8 2.0 1.2 1.0 iterations gives plug-in 0.0 0.6 0 4000 8000 0.6 1.0 1.4 1.8 0 200 600 1000 estimate pi = Φ (−1.2193xi1 + 0.9540xi2 + 0.9795xi3 + 1.1481xi4 ) . ˆ
  • 46. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm A collection of Metropolis-Hastings algorithms Example (Mixture models) n k π(θ|x) ∝ pℓ f (xj |µℓ , σℓ ) π(θ) j=1 ℓ=1
  • 47. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm A collection of Metropolis-Hastings algorithms Example (Mixture models) n k π(θ|x) ∝ pℓ f (xj |µℓ , σℓ ) π(θ) j=1 ℓ=1 Metropolis-Hastings proposal: θ(t) + ωε(t) if u(t) < ρ(t) θ(t+1) = θ(t) otherwise where π(θ(t) + ωε(t) |x) ρ(t) = ∧1 π(θ(t) |x) and ω scaled for good acceptance rate
  • 48. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm A collection of Metropolis-Hastings algorithms Random walk MCMC output for .7N (µ1 , 1) + .3N (µ2 , 1) and scale 1 Iteration 1 4 3 2 µ2 1 0 −1 −1 0 1 2 3 4 µ1
  • 49. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm A collection of Metropolis-Hastings algorithms Random walk MCMC output for .7N (µ1 , 1) + .3N (µ2 , 1) and scale 1 Iteration 10 4 3 2 µ2 1 0 −1 −1 0 1 2 3 4 µ1
  • 50. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm A collection of Metropolis-Hastings algorithms Random walk MCMC output for .7N (µ1 , 1) + .3N (µ2 , 1) and scale 1 Iteration 100 4 3 2 µ2 1 0 −1 −1 0 1 2 3 4 µ1
  • 51. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm A collection of Metropolis-Hastings algorithms Random walk MCMC output for .7N (µ1 , 1) + .3N (µ2 , 1) and scale 1 Iteration 500 4 3 2 µ2 1 0 −1 −1 0 1 2 3 4 µ1
  • 52. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm A collection of Metropolis-Hastings algorithms Random walk MCMC output for .7N (µ1 , 1) + .3N (µ2 , 1) and scale 1 Iteration 1000 4 3 2 µ2 1 0 −1 −1 0 1 2 3 4 µ1
  • 53. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm A collection of Metropolis-Hastings algorithms Random walk MCMC output for .7N (µ1 , 1) + .3N (µ2 , 1) √ and scale .1 Iteration 10 4 3 2 µ2 1 0 −1 −1 0 1 2 3 4 µ1
  • 54. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm A collection of Metropolis-Hastings algorithms Random walk MCMC output for .7N (µ1 , 1) + .3N (µ2 , 1) √ and scale .1 Iteration 100 4 3 2 µ2 1 0 −1 −1 0 1 2 3 4 µ1
  • 55. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm A collection of Metropolis-Hastings algorithms Random walk MCMC output for .7N (µ1 , 1) + .3N (µ2 , 1) √ and scale .1 Iteration 500 4 3 2 µ2 1 0 −1 −1 0 1 2 3 4 µ1
  • 56. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm A collection of Metropolis-Hastings algorithms Random walk MCMC output for .7N (µ1 , 1) + .3N (µ2 , 1) √ and scale .1 Iteration 1000 4 3 2 µ2 1 0 −1 −1 0 1 2 3 4 µ1
  • 57. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm A collection of Metropolis-Hastings algorithms Random walk MCMC output for .7N (µ1 , 1) + .3N (µ2 , 1) √ and scale .1 Iteration 10,000 4 3 2 µ2 1 0 −1 −1 0 1 2 3 4 µ1
  • 58. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm A collection of Metropolis-Hastings algorithms Random walk MCMC output for .7N (µ1 , 1) + .3N (µ2 , 1) √ and scale .1 Iteration 5000 4 3 2 µ2 1 0 −1 −1 0 1 2 3 4 µ1
  • 59. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm A collection of Metropolis-Hastings algorithms Convergence properties Uniform ergodicity prohibited by random walk structure
  • 60. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm A collection of Metropolis-Hastings algorithms Convergence properties Uniform ergodicity prohibited by random walk structure At best, geometric ergodicity: Theorem (Sufficient ergodicity) For a symmetric density f , log-concave in the tails, and a positive and symmetric density g, the chain (X (t) ) is geometrically ergodic. [Mengersen & Tweedie, 1996] no tail effect
  • 61. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm A collection of Metropolis-Hastings algorithms 1.5 1.5 1.0 1.0 Example (Comparison of tail effects) 0.5 0.5 0.0 0.0 Random-walk Metropolis–Hastings algorithms -0.5 -0.5 based on a N (0, 1) instrumental -1.0 -1.0 for the generation of (a) a -1.5 -1.5 N (0, 1) distribution and (b) a 0 50 100 (a) 150 200 0 50 100 (b) 150 200 distribution with density 90% confidence envelopes of ψ(x) ∝ (1 + |x|)−3 the means, derived from 500 parallel independent chains
  • 62. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm Extensions Extensions There are many other families of HM algorithms Adaptive Rejection Metropolis Sampling Reversible Jump Langevin algorithms to name just a few...
  • 63. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm Extensions Langevin Algorithms Proposal based on the Langevin diffusion Lt is defined by the stochastic differential equation 1 dLt = dBt + ∇ log f (Lt )dt, 2 where Bt is the standard Brownian motion Theorem The Langevin diffusion is the only non-explosive diffusion which is reversible with respect to f .
  • 64. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm Extensions Discretization Because continuous time cannot be simulated, consider the discretised sequence σ2 x(t+1) = x(t) + ∇ log f (x(t) ) + σεt , εt ∼ Np (0, Ip ) 2 where σ 2 corresponds to the discretisation step 0.6 0.5 0.4 Example of Density 0.3 f (x) = exp(−x4 ) 0.2 0.1 0.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 σ2 = .1
  • 65. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm Extensions Discretization Because continuous time cannot be simulated, consider the discretised sequence σ2 x(t+1) = x(t) + ∇ log f (x(t) ) + σεt , εt ∼ Np (0, Ip ) 2 where σ 2 corresponds to the discretisation step 0.6 0.5 0.4 Example of 0.3 f (x) = exp(−x4 ) 0.2 0.1 0.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 σ2 = .01
  • 66. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm Extensions Discretization Because continuous time cannot be simulated, consider the discretised sequence σ2 x(t+1) = x(t) + ∇ log f (x(t) ) + σεt , εt ∼ Np (0, Ip ) 2 where σ 2 corresponds to the discretisation step 0.6 0.5 0.4 Example of Density 0.3 f (x) = exp(−x4 ) 0.2 0.1 0.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 σ2 = .001
  • 67. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm Extensions Discretization Because continuous time cannot be simulated, consider the discretised sequence σ2 x(t+1) = x(t) + ∇ log f (x(t) ) + σεt , εt ∼ Np (0, Ip ) 2 where σ 2 corresponds to the discretisation step 0.8 0.6 Example of Density 0.4 f (x) = exp(−x4 ) 0.2 0.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 σ2 = .0001
  • 68. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm Extensions Discretization Because continuous time cannot be simulated, consider the discretised sequence σ2 x(t+1) = x(t) + ∇ log f (x(t) ) + σεt , εt ∼ Np (0, Ip ) 2 where σ 2 corresponds to the discretisation step 0.6 0.5 0.4 Example of Density 0.3 f (x) = exp(−x4 ) 0.2 0.1 0.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 σ2 = .0001∗
  • 69. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm Extensions Discretization Unfortunately, the discretized chain may be transient, for instance when lim σ 2 ∇ log f (x)|x|−1 > 1 x→±∞ Example of f (x) = exp(−x4 ) when σ 2 = .2
  • 70. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm Extensions MH correction Accept the new value Yt with probability 2 σ2 exp − Yt − x(t) − (t) 2 ∇ log f (x ) 2σ 2 f (Yt ) · ∧1. f (x(t) ) σ2 2 exp − x(t) − Yt − 2 ∇ log f (Yt ) 2σ 2 Choice of the scaling factor σ Should lead to an acceptance rate of 0.574 to achieve optimal convergence rates (when the components of x are uncorrelated) [Roberts & Rosenthal, 1998]
  • 71. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm Extensions Optimizing the Acceptance Rate Problem of choice of the transition kernel from a practical point of view Most common alternatives: 1 a fully automated algorithm like ARMS; 2 an instrumental density g which approximates f , such that f /g is bounded for uniform ergodicity to apply; 3 a random walk In both cases (b) and (c), the choice of g is critical,
  • 72. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm Extensions Case of the random walk Different approach to acceptance rates A high acceptance rate does not indicate that the algorithm is moving correctly since it indicates that the random walk is moving too slowly on the surface of f .
  • 73. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm Extensions Case of the random walk Different approach to acceptance rates A high acceptance rate does not indicate that the algorithm is moving correctly since it indicates that the random walk is moving too slowly on the surface of f . If x(t) and yt are close, i.e. f (x(t) ) ≃ f (yt ) y is accepted with probability f (yt ) min ,1 ≃ 1 . f (x(t) ) For multimodal densities with well separated modes, the negative effect of limited moves on the surface of f clearly shows.
  • 74. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm Extensions Case of the random walk (2) If the average acceptance rate is low, the successive values of f (yt ) tend to be small compared with f (x(t) ), which means that the random walk moves quickly on the surface of f since it often reaches the “borders” of the support of f
  • 75. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm Extensions Rule of thumb In small dimensions, aim at an average acceptance rate of 50%. In large dimensions, at an average acceptance rate of 25%. [Gelman,Gilks and Roberts, 1995]
  • 76. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm Extensions Rule of thumb In small dimensions, aim at an average acceptance rate of 50%. In large dimensions, at an average acceptance rate of 25%. [Gelman,Gilks and Roberts, 1995] This rule is to be taken with a pinch of salt!
  • 77. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm Extensions Example (Noisy AR(1) continued) For a Gaussian random walk with scale ω small enough, the random walk never jumps to the other mode. But if the scale ω is sufficiently large, the Markov chain explores both modes and give a satisfactory approximation of the target distribution.
  • 78. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm Extensions Markov chain based on a random walk with scale ω = .1
  • 79. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm Extensions Markov chain based on a random walk with scale ω = .5
  • 80. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm Extensions Where do we stand? MCMC in a nutshell:
  • 81. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm Extensions Where do we stand? MCMC in a nutshell: Running a sequence Xt+1 = Ψ(Xt , Yy ) provides approximation to target density f when detailed balance condition holds f (x)K(x, y) = f (y)K(y, x)
  • 82. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm Extensions Where do we stand? MCMC in a nutshell: Running a sequence Xt+1 = Ψ(Xt , Yy ) provides approximation to target density f when detailed balance condition holds f (x)K(x, y) = f (y)K(y, x) Easiest implementation of the principle is random walk Metropolis-Hastings Yt = X (t) + εt
  • 83. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm Extensions Where do we stand? MCMC in a nutshell: Running a sequence Xt+1 = Ψ(Xt , Yy ) provides approximation to target density f when detailed balance condition holds f (x)K(x, y) = f (y)K(y, x) Easiest implementation of the principle is random walk Metropolis-Hastings Yt = X (t) + εt Practical convergence requires sufficient energy from the proposal that is calibrated by trial and error.
  • 84. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm Convergence assessment Convergence diagnostics How many iterations?
  • 85. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm Convergence assessment Convergence diagnostics How many iterations? Rule # 1 There is no absolute number of simulations, i.e. 1, 000 is neither large, nor small. Rule # 2 It takes [much] longer to check for convergence than for the chain itself to converge. Rule # 3 MCMC is a “what-you-get-is-what-you-see” algorithm: it fails to tell about unexplored parts of the space. Rule # 4 When in doubt, run MCMC chains in parallel and check for consistency.
  • 86. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm Convergence assessment Convergence diagnostics How many iterations? Rule # 1 There is no absolute number of simulations, i.e. 1, 000 is neither large, nor small. Rule # 2 It takes [much] longer to check for convergence than for the chain itself to converge. Rule # 3 MCMC is a “what-you-get-is-what-you-see” algorithm: it fails to tell about unexplored parts of the space. Rule # 4 When in doubt, run MCMC chains in parallel and check for consistency. Many “quick-&-dirty” solutions in the literature, but not necessarily 100% trustworthy.
  • 87. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm Convergence assessment Example (Bimodal target) 0.4 Density 0.3 exp −x2 /2 4(x − .3)2 + .01 0.2 f (x) = √ . 4(1 + (.3)2 ) + .01 0.1 2π 0.0 −4 −2 0 2 4 and use of random walk Metropolis–Hastings algorithm with variance .04 Evaluation of the missing mass by T −1 [θ(t+1) − θ(t) ] f (θ(t) ) t=1
  • 88. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm Convergence assessment 1.0 0.8 0.6 mass 0.4 0.2 0.0 0 500 1000 1500 2000 Index Sequence [in blue] and mass evaluation [in brown] [Philippe & Robert, 2001]
  • 89. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm Convergence assessment Effective sample size How many iid simulations from π are equivalent to N simulations from the MCMC algorithm?
  • 90. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm Convergence assessment Effective sample size How many iid simulations from π are equivalent to N simulations from the MCMC algorithm? Based on estimated k-th order auto-correlation, ρk = cov x(t) , x(t+k) , effective sample size T0 −1/2 ess N =n 1+2 ρk ˆ , k=1 Only partial indicator that fails to signal chains stuck in one mode of the target
  • 91. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm Convergence assessment Tempering Facilitate exploration of π by flattening the target: simulate from πα (x) ∝ π(x)α for α > 0 small enough
  • 92. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm Convergence assessment Tempering Facilitate exploration of π by flattening the target: simulate from πα (x) ∝ π(x)α for α > 0 small enough Determine where the modal regions of π are (possibly with parallel versions using different α’s) Recycle simulations from π(x)α into simulations from π by importance sampling Simple modification of the Metropolis–Hastings algorithm, with new acceptance α π(θ′ |x) q(θ|θ′ ) ∧1 π(θ|x) q(θ′ |θ)
  • 93. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm Convergence assessment Tempering with the mean mixture 1 0.5 0.2 4 4 4 3 3 3 2 2 2 1 1 1 0 0 0 −1 −1 −1 −1 0 1 2 3 4 −1 0 1 2 3 4 −1 0 1 2 3 4