Upcoming SlideShare
×

Introduction to advanced Monte Carlo methods

3,647 views

Published on

0 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

• Be the first to like this

Views
Total views
3,647
On SlideShare
0
From Embeds
0
Number of Embeds
835
Actions
Shares
0
110
0
Likes
0
Embeds 0
No embeds

No notes for slide

Introduction to advanced Monte Carlo methods

1. 1. An introduction to advanced (?) MCMC methods An introduction to advanced (?) MCMC methods Christian P. Robert Universit´ Paris-Dauphine and CREST-INSEE e http://www.ceremade.dauphine.fr/~xian Royal Statistical Society, October 13, 2010
2. 2. An introduction to advanced (?) MCMC methods Motivating example Motivating example 1 Motivating example 2 The Metropolis-Hastings Algorithm
3. 3. An introduction to advanced (?) MCMC methods Motivating example Latent structures make life harder! Even simple models may lead to computational complications, as in latent variable models f (x|θ) = f ⋆ (x, x⋆ |θ) dx⋆
4. 4. An introduction to advanced (?) MCMC methods Motivating example Latent structures make life harder! Even simple models may lead to computational complications, as in latent variable models f (x|θ) = f ⋆ (x, x⋆ |θ) dx⋆ If (x, x⋆ ) observed, ﬁne!
5. 5. An introduction to advanced (?) MCMC methods Motivating example Latent structures make life harder! Even simple models may lead to computational complications, as in latent variable models f (x|θ) = f ⋆ (x, x⋆ |θ) dx⋆ If (x, x⋆ ) observed, ﬁne! If only x observed, trouble!
6. 6. An introduction to advanced (?) MCMC methods Motivating example Example (Mixture models) Models of mixtures of distributions: X ∼ fj with probability pj , for j = 1, 2, . . . , k, with overall density X ∼ p1 f1 (x) + · · · + pk fk (x) .
7. 7. An introduction to advanced (?) MCMC methods Motivating example Example (Mixture models) Models of mixtures of distributions: X ∼ fj with probability pj , for j = 1, 2, . . . , k, with overall density X ∼ p1 f1 (x) + · · · + pk fk (x) . For a sample of independent random variables (X1 , · · · , Xn ), sample density n {p1 f1 (xi ) + · · · + pk fk (xi )} . i=1
8. 8. An introduction to advanced (?) MCMC methods Motivating example Example (Mixture models) Models of mixtures of distributions: X ∼ fj with probability pj , for j = 1, 2, . . . , k, with overall density X ∼ p1 f1 (x) + · · · + pk fk (x) . For a sample of independent random variables (X1 , · · · , Xn ), sample density n {p1 f1 (xi ) + · · · + pk fk (xi )} . i=1 Expanding this product involves k n elementary terms: prohibitive to compute in large samples.
9. 9. An introduction to advanced (?) MCMC methods Motivating example 0.3N (µ1 , 1) + 0.7N (µ2 , 1) loglikelihood 3 2 µ2 1 0 −1 −1 0 1 2 3 µ1
10. 10. An introduction to advanced (?) MCMC methods Motivating example A typology of Bayes computational problems (i) use of a complex parameter space, as for instance in constrained parameter sets like those resulting from imposing stationarity constraints in dynamic models;
11. 11. An introduction to advanced (?) MCMC methods Motivating example A typology of Bayes computational problems (i) use of a complex parameter space, as for instance in constrained parameter sets like those resulting from imposing stationarity constraints in dynamic models; (ii) use of a complex sampling model with an intractable likelihood, as for instance in missing data and graphical models;
12. 12. An introduction to advanced (?) MCMC methods Motivating example A typology of Bayes computational problems (i) use of a complex parameter space, as for instance in constrained parameter sets like those resulting from imposing stationarity constraints in dynamic models; (ii) use of a complex sampling model with an intractable likelihood, as for instance in missing data and graphical models; (iii) use of a huge dataset;
13. 13. An introduction to advanced (?) MCMC methods Motivating example A typology of Bayes computational problems (i) use of a complex parameter space, as for instance in constrained parameter sets like those resulting from imposing stationarity constraints in dynamic models; (ii) use of a complex sampling model with an intractable likelihood, as for instance in missing data and graphical models; (iii) use of a huge dataset; (iv) use of a complex prior distribution (which may be the posterior distribution associated with an earlier sample);
14. 14. An introduction to advanced (?) MCMC methods Motivating example A typology of Bayes computational problems (i) use of a complex parameter space, as for instance in constrained parameter sets like those resulting from imposing stationarity constraints in dynamic models; (ii) use of a complex sampling model with an intractable likelihood, as for instance in missing data and graphical models; (iii) use of a huge dataset; (iv) use of a complex prior distribution (which may be the posterior distribution associated with an earlier sample); (v) use of a complex inferential procedure as for instance, Bayes factors π π(θ ∈ Θ0 ) B01 (x) = P (θ ∈ Θ0 | x)/P (θ ∈ Θ1 | x) . π(θ ∈ Θ1 )
15. 15. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm The Metropolis-Hastings Algorithm 1 Motivating example 2 The Metropolis-Hastings Algorithm Monte Carlo Methods based on Markov Chains The Metropolis–Hastings algorithm A collection of Metropolis-Hastings algorithms Extensions Convergence assessment
16. 16. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm Monte Carlo Methods based on Markov Chains Running Monte Carlo via Markov Chains Fact: It is not necessary to use a sample from the distribution f to approximate the integral I= h(x)f (x)dx ,
17. 17. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm Monte Carlo Methods based on Markov Chains Running Monte Carlo via Markov Chains Fact: It is not necessary to use a sample from the distribution f to approximate the integral I= h(x)f (x)dx , We can obtain X1 , . . . , Xn ∼ f (approx) without directly simulating from f , using an ergodic Markov chain with stationary distribution f
18. 18. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm Monte Carlo Methods based on Markov Chains Running Monte Carlo via Markov Chains (2) Idea For an arbitrary starting value x(0) , an ergodic chain (X (t) ) is generated using a transition kernel with stationary distribution f
19. 19. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm Monte Carlo Methods based on Markov Chains Running Monte Carlo via Markov Chains (2) Idea For an arbitrary starting value x(0) , an ergodic chain (X (t) ) is generated using a transition kernel with stationary distribution f Ensures the convergence in distribution of (X (t) ) to a random variable from f . For a “large enough” T0 , X (T0 ) can be considered as distributed from f Produces a dependent sample X (T0 ) , X (T0 +1) , . . ., which is generated from f , suﬃcient for most approximation purposes.
20. 20. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm The Metropolis–Hastings algorithm The Metropolis–Hastings algorithm Problem: How can one build a Markov chain with a given stationary distribution?
21. 21. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm The Metropolis–Hastings algorithm The Metropolis–Hastings algorithm Problem: How can one build a Markov chain with a given stationary distribution? MH basics Algorithm that converges to the objective (target) density f using an arbitrary transition kernel density q(x, y) called instrumental (or proposal) distribution
22. 22. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm The Metropolis–Hastings algorithm The MH algorithm Algorithm (Metropolis–Hastings) Given x(t) , 1 Generate Yt ∼ q(x(t) , y). 2 Take Yt with prob. ρ(x(t) , Yt ), X (t+1) = x(t) with prob. 1 − ρ(x(t) , Yt ), where f (y) q(y, x) ρ(x, y) = min ,1 . f (x) q(x, y)
23. 23. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm The Metropolis–Hastings algorithm Features Independent of normalizing constants for both f and q(x, ·) (ie, those constants independent of x) Never move to values with f (y) = 0 The chain (x(t) )t may take the same value several times in a row, even though f is a density wrt Lebesgue measure The sequence (yt )t is usually not a Markov chain
24. 24. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm The Metropolis–Hastings algorithm Features Independent of normalizing constants for both f and q(x, ·) (ie, those constants independent of x) Never move to values with f (y) = 0 The chain (x(t) )t may take the same value several times in a row, even though f is a density wrt Lebesgue measure The sequence (yt )t is usually not a Markov chain P( θ-> θ ’) Satisﬁes the detailed balance condition θ’ θ f (x)K(x, y) = f (y)K(y, x) P(θ’-> θ ) [Green, 1995]
25. 25. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm The Metropolis–Hastings algorithm Convergence properties 1 The M-H Markov chain is reversible, with invariant/stationary density f .
26. 26. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm The Metropolis–Hastings algorithm Convergence properties 1 The M-H Markov chain is reversible, with invariant/stationary density f . 2 As f is a probability measure, the chain is positive recurrent
27. 27. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm The Metropolis–Hastings algorithm Convergence properties 1 The M-H Markov chain is reversible, with invariant/stationary density f . 2 As f is a probability measure, the chain is positive recurrent 3 If f (Yt ) q(Yt , X (t) ) Pr ≥ 1 < 1. (1) f (X (t) ) q(X (t) , Yt ) i.e., if the event {X (t+1) = X (t) } occurs with positive probability, then the chain is aperiodic
28. 28. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm The Metropolis–Hastings algorithm Convergence properties (2) 4 If q(x, y) > 0 for every (x, y), (2) the chain is irreducible
29. 29. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm The Metropolis–Hastings algorithm Convergence properties (2) 4 If q(x, y) > 0 for every (x, y), (2) the chain is irreducible 5 For M-H, f -irreducibility implies Harris recurrence
30. 30. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm The Metropolis–Hastings algorithm Convergence properties (2) 4 If q(x, y) > 0 for every (x, y), (2) the chain is irreducible 5 For M-H, f -irreducibility implies Harris recurrence 6 Thus, under conditions (1) and (2) (i) For h, with Ef |h(X)| < ∞, T 1 lim h(X (t) ) = h(x)df (x) a.e. f. T →∞ T t=1 (ii) and lim K n (x, ·)µ(dx) − f =0 n→∞ TV for every initial distribution µ, where K n (x, ·) denotes the kernel for n transitions.
31. 31. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm A collection of Metropolis-Hastings algorithms The Independent Case The instrumental distribution q(x, ·) is independent of x and is denoted g
32. 32. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm A collection of Metropolis-Hastings algorithms The Independent Case The instrumental distribution q(x, ·) is independent of x and is denoted g Algorithm (Independent Metropolis-Hastings) Given x(t) , 1 Generate Yt ∼ g(y) 2 Take  Y f (Yt ) g(x(t) ) with prob. min ,1 ,  t X (t+1) = f (x(t) ) g(Yt )  x(t) otherwise. 
33. 33. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm A collection of Metropolis-Hastings algorithms Properties The resulting sample is not iid
34. 34. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm A collection of Metropolis-Hastings algorithms Properties The resulting sample is not iid but there exist strong convergence properties: Theorem (Ergodicity) The algorithm produces a uniformly ergodic chain if there exists a constant M such that f (x) ≤ M g(x) , x ∈ supp f. In this case, n 1 K n (x, ·) − f TV ≤ 1− . M [Mengersen & Tweedie, 1996]
35. 35. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm A collection of Metropolis-Hastings algorithms Example (Noisy AR(1)) Hidden Markov chain from a regular AR(1) model, xt+1 = ϕxt + ǫt+1 ǫt ∼ N (0, τ 2 ) and observables yt |xt ∼ N (x2 , σ 2 ) t
36. 36. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm A collection of Metropolis-Hastings algorithms Example (Noisy AR(1)) Hidden Markov chain from a regular AR(1) model, xt+1 = ϕxt + ǫt+1 ǫt ∼ N (0, τ 2 ) and observables yt |xt ∼ N (x2 , σ 2 ) t The distribution of xt given xt−1 , xt+1 and yt is −1 τ2 exp (xt − ϕxt−1 )2 + (xt+1 − ϕxt )2 + (yt − x2 )2 t . 2τ 2 σ2
37. 37. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm A collection of Metropolis-Hastings algorithms Example (Noisy AR(1) too) 2 Use for proposal the N (µt , ωt ) distribution, with xt−1 + xt+1 2 τ2 µt = ϕ and ωt = . 1 + ϕ2 1 + ϕ2
38. 38. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm A collection of Metropolis-Hastings algorithms Example (Noisy AR(1) too) 2 Use for proposal the N (µt , ωt ) distribution, with xt−1 + xt+1 2 τ2 µt = ϕ and ωt = . 1 + ϕ2 1 + ϕ2 Ratio π(x)/qind (x) = exp −(yt − x2 )2 /2σ 2 t is bounded
39. 39. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm A collection of Metropolis-Hastings algorithms (top) Last 500 realisations of the chain {Xk }k out of 10, 000 iterations; (bottom) histogram of the chain, compared with the target distribution.
40. 40. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm A collection of Metropolis-Hastings algorithms Random walk Metropolis–Hastings Instead, use a local perturbation as proposal Yt = X (t) + εt , where εt ∼ g, independent of X (t) . The instrumental density is now of the form g(y − x) and the Markov chain is a random walk if g is symmetric g(x) = g(−x)
41. 41. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm A collection of Metropolis-Hastings algorithms Algorithm (Random walk Metropolis) Given x(t) 1 Generate Yt ∼ g(y − x(t) ) 2 Take f (Yt )  Y with prob. min 1, , (t+1) t X = f (x(t) )  (t) x otherwise.
42. 42. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm A collection of Metropolis-Hastings algorithms Probit illustration Likelihood and posterior given by n π(β|y, X) ∝ ℓ(β|y, X) ∝ Φ(xiT β)yi (1 − Φ(xiT β))ni −yi . i=1 under the ﬂat prior
43. 43. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm A collection of Metropolis-Hastings algorithms Probit illustration Likelihood and posterior given by n π(β|y, X) ∝ ℓ(β|y, X) ∝ Φ(xiT β)yi (1 − Φ(xiT β))ni −yi . i=1 under the ﬂat prior A random walk proposal works well for a small number of ˆ predictors. Use the maximum likelihood estimate β as starting ˆ value and asymptotic (Fisher) covariance matrix of the MLE, Σ, as scale
44. 44. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm A collection of Metropolis-Hastings algorithms MCMC algorithm Probit random-walk Metropolis-Hastings ˆ ˆ Initialization: Set β (0) = β and compute Σ Iteration t: 1 ˜ ˆ Generate β ∼ Nk+1 (β (t−1) , τ Σ) 2 Compute ˜ π(β|y) ˜ ρ(β (t−1) , β) = min 1, π(β (t−1) |y) 3 ˜ ˜ With probability ρ(β (t−1) , β) set β (t) = β; otherwise set β (t) = β (t−1) .
45. 45. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm A collection of Metropolis-Hastings algorithms R bank benchmark Probit modelling with no intercept over the 0.8 −1.0 1.0 0.4 four measurements. −2.0 0.0 0.0 0 4000 8000 −2.0 −1.5 −1.0 −0.5 0 200 600 1000 Three diﬀerent scales 3 0.0 0.4 0.8 τ = 1, 0.1, 10: best 2 0.4 1 mixing behavior is −1 0.0 0 4000 8000 −1 0 1 2 3 0 200 600 1000 associated with τ = 1. 2.5 0.8 0.0 0.4 0.8 −0.5 1.0 Average of the 0.4 0.0 parameters over 1.8 0 4000 8000 −0.5 0.5 1.5 2.5 0 200 600 1000 MCMC 9, 000 0.0 0.4 0.8 2.0 1.2 1.0 iterations gives plug-in 0.0 0.6 0 4000 8000 0.6 1.0 1.4 1.8 0 200 600 1000 estimate pi = Φ (−1.2193xi1 + 0.9540xi2 + 0.9795xi3 + 1.1481xi4 ) . ˆ
46. 46. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm A collection of Metropolis-Hastings algorithms Example (Mixture models) n k π(θ|x) ∝ pℓ f (xj |µℓ , σℓ ) π(θ) j=1 ℓ=1
47. 47. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm A collection of Metropolis-Hastings algorithms Example (Mixture models) n k π(θ|x) ∝ pℓ f (xj |µℓ , σℓ ) π(θ) j=1 ℓ=1 Metropolis-Hastings proposal: θ(t) + ωε(t) if u(t) < ρ(t) θ(t+1) = θ(t) otherwise where π(θ(t) + ωε(t) |x) ρ(t) = ∧1 π(θ(t) |x) and ω scaled for good acceptance rate
48. 48. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm A collection of Metropolis-Hastings algorithms Random walk MCMC output for .7N (µ1 , 1) + .3N (µ2 , 1) and scale 1 Iteration 1 4 3 2 µ2 1 0 −1 −1 0 1 2 3 4 µ1
49. 49. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm A collection of Metropolis-Hastings algorithms Random walk MCMC output for .7N (µ1 , 1) + .3N (µ2 , 1) and scale 1 Iteration 10 4 3 2 µ2 1 0 −1 −1 0 1 2 3 4 µ1
50. 50. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm A collection of Metropolis-Hastings algorithms Random walk MCMC output for .7N (µ1 , 1) + .3N (µ2 , 1) and scale 1 Iteration 100 4 3 2 µ2 1 0 −1 −1 0 1 2 3 4 µ1
51. 51. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm A collection of Metropolis-Hastings algorithms Random walk MCMC output for .7N (µ1 , 1) + .3N (µ2 , 1) and scale 1 Iteration 500 4 3 2 µ2 1 0 −1 −1 0 1 2 3 4 µ1
52. 52. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm A collection of Metropolis-Hastings algorithms Random walk MCMC output for .7N (µ1 , 1) + .3N (µ2 , 1) and scale 1 Iteration 1000 4 3 2 µ2 1 0 −1 −1 0 1 2 3 4 µ1
53. 53. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm A collection of Metropolis-Hastings algorithms Random walk MCMC output for .7N (µ1 , 1) + .3N (µ2 , 1) √ and scale .1 Iteration 10 4 3 2 µ2 1 0 −1 −1 0 1 2 3 4 µ1
54. 54. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm A collection of Metropolis-Hastings algorithms Random walk MCMC output for .7N (µ1 , 1) + .3N (µ2 , 1) √ and scale .1 Iteration 100 4 3 2 µ2 1 0 −1 −1 0 1 2 3 4 µ1
55. 55. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm A collection of Metropolis-Hastings algorithms Random walk MCMC output for .7N (µ1 , 1) + .3N (µ2 , 1) √ and scale .1 Iteration 500 4 3 2 µ2 1 0 −1 −1 0 1 2 3 4 µ1
56. 56. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm A collection of Metropolis-Hastings algorithms Random walk MCMC output for .7N (µ1 , 1) + .3N (µ2 , 1) √ and scale .1 Iteration 1000 4 3 2 µ2 1 0 −1 −1 0 1 2 3 4 µ1
57. 57. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm A collection of Metropolis-Hastings algorithms Random walk MCMC output for .7N (µ1 , 1) + .3N (µ2 , 1) √ and scale .1 Iteration 10,000 4 3 2 µ2 1 0 −1 −1 0 1 2 3 4 µ1
58. 58. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm A collection of Metropolis-Hastings algorithms Random walk MCMC output for .7N (µ1 , 1) + .3N (µ2 , 1) √ and scale .1 Iteration 5000 4 3 2 µ2 1 0 −1 −1 0 1 2 3 4 µ1
59. 59. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm A collection of Metropolis-Hastings algorithms Convergence properties Uniform ergodicity prohibited by random walk structure
60. 60. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm A collection of Metropolis-Hastings algorithms Convergence properties Uniform ergodicity prohibited by random walk structure At best, geometric ergodicity: Theorem (Suﬃcient ergodicity) For a symmetric density f , log-concave in the tails, and a positive and symmetric density g, the chain (X (t) ) is geometrically ergodic. [Mengersen & Tweedie, 1996] no tail eﬀect
61. 61. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm A collection of Metropolis-Hastings algorithms 1.5 1.5 1.0 1.0 Example (Comparison of tail eﬀects) 0.5 0.5 0.0 0.0 Random-walk Metropolis–Hastings algorithms -0.5 -0.5 based on a N (0, 1) instrumental -1.0 -1.0 for the generation of (a) a -1.5 -1.5 N (0, 1) distribution and (b) a 0 50 100 (a) 150 200 0 50 100 (b) 150 200 distribution with density 90% conﬁdence envelopes of ψ(x) ∝ (1 + |x|)−3 the means, derived from 500 parallel independent chains
62. 62. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm Extensions Extensions There are many other families of HM algorithms Adaptive Rejection Metropolis Sampling Reversible Jump Langevin algorithms to name just a few...
63. 63. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm Extensions Langevin Algorithms Proposal based on the Langevin diﬀusion Lt is deﬁned by the stochastic diﬀerential equation 1 dLt = dBt + ∇ log f (Lt )dt, 2 where Bt is the standard Brownian motion Theorem The Langevin diﬀusion is the only non-explosive diﬀusion which is reversible with respect to f .
64. 64. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm Extensions Discretization Because continuous time cannot be simulated, consider the discretised sequence σ2 x(t+1) = x(t) + ∇ log f (x(t) ) + σεt , εt ∼ Np (0, Ip ) 2 where σ 2 corresponds to the discretisation step 0.6 0.5 0.4 Example of Density 0.3 f (x) = exp(−x4 ) 0.2 0.1 0.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 σ2 = .1
65. 65. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm Extensions Discretization Because continuous time cannot be simulated, consider the discretised sequence σ2 x(t+1) = x(t) + ∇ log f (x(t) ) + σεt , εt ∼ Np (0, Ip ) 2 where σ 2 corresponds to the discretisation step 0.6 0.5 0.4 Example of 0.3 f (x) = exp(−x4 ) 0.2 0.1 0.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 σ2 = .01
66. 66. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm Extensions Discretization Because continuous time cannot be simulated, consider the discretised sequence σ2 x(t+1) = x(t) + ∇ log f (x(t) ) + σεt , εt ∼ Np (0, Ip ) 2 where σ 2 corresponds to the discretisation step 0.6 0.5 0.4 Example of Density 0.3 f (x) = exp(−x4 ) 0.2 0.1 0.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 σ2 = .001
67. 67. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm Extensions Discretization Because continuous time cannot be simulated, consider the discretised sequence σ2 x(t+1) = x(t) + ∇ log f (x(t) ) + σεt , εt ∼ Np (0, Ip ) 2 where σ 2 corresponds to the discretisation step 0.8 0.6 Example of Density 0.4 f (x) = exp(−x4 ) 0.2 0.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 σ2 = .0001
68. 68. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm Extensions Discretization Because continuous time cannot be simulated, consider the discretised sequence σ2 x(t+1) = x(t) + ∇ log f (x(t) ) + σεt , εt ∼ Np (0, Ip ) 2 where σ 2 corresponds to the discretisation step 0.6 0.5 0.4 Example of Density 0.3 f (x) = exp(−x4 ) 0.2 0.1 0.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 σ2 = .0001∗
69. 69. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm Extensions Discretization Unfortunately, the discretized chain may be transient, for instance when lim σ 2 ∇ log f (x)|x|−1 > 1 x→±∞ Example of f (x) = exp(−x4 ) when σ 2 = .2
70. 70. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm Extensions MH correction Accept the new value Yt with probability 2 σ2 exp − Yt − x(t) − (t) 2 ∇ log f (x ) 2σ 2 f (Yt ) · ∧1. f (x(t) ) σ2 2 exp − x(t) − Yt − 2 ∇ log f (Yt ) 2σ 2 Choice of the scaling factor σ Should lead to an acceptance rate of 0.574 to achieve optimal convergence rates (when the components of x are uncorrelated) [Roberts & Rosenthal, 1998]
71. 71. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm Extensions Optimizing the Acceptance Rate Problem of choice of the transition kernel from a practical point of view Most common alternatives: 1 a fully automated algorithm like ARMS; 2 an instrumental density g which approximates f , such that f /g is bounded for uniform ergodicity to apply; 3 a random walk In both cases (b) and (c), the choice of g is critical,
72. 72. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm Extensions Case of the random walk Diﬀerent approach to acceptance rates A high acceptance rate does not indicate that the algorithm is moving correctly since it indicates that the random walk is moving too slowly on the surface of f .
73. 73. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm Extensions Case of the random walk Diﬀerent approach to acceptance rates A high acceptance rate does not indicate that the algorithm is moving correctly since it indicates that the random walk is moving too slowly on the surface of f . If x(t) and yt are close, i.e. f (x(t) ) ≃ f (yt ) y is accepted with probability f (yt ) min ,1 ≃ 1 . f (x(t) ) For multimodal densities with well separated modes, the negative eﬀect of limited moves on the surface of f clearly shows.
74. 74. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm Extensions Case of the random walk (2) If the average acceptance rate is low, the successive values of f (yt ) tend to be small compared with f (x(t) ), which means that the random walk moves quickly on the surface of f since it often reaches the “borders” of the support of f
75. 75. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm Extensions Rule of thumb In small dimensions, aim at an average acceptance rate of 50%. In large dimensions, at an average acceptance rate of 25%. [Gelman,Gilks and Roberts, 1995]
76. 76. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm Extensions Rule of thumb In small dimensions, aim at an average acceptance rate of 50%. In large dimensions, at an average acceptance rate of 25%. [Gelman,Gilks and Roberts, 1995] This rule is to be taken with a pinch of salt!
77. 77. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm Extensions Example (Noisy AR(1) continued) For a Gaussian random walk with scale ω small enough, the random walk never jumps to the other mode. But if the scale ω is suﬃciently large, the Markov chain explores both modes and give a satisfactory approximation of the target distribution.
78. 78. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm Extensions Markov chain based on a random walk with scale ω = .1
79. 79. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm Extensions Markov chain based on a random walk with scale ω = .5
80. 80. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm Extensions Where do we stand? MCMC in a nutshell:
81. 81. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm Extensions Where do we stand? MCMC in a nutshell: Running a sequence Xt+1 = Ψ(Xt , Yy ) provides approximation to target density f when detailed balance condition holds f (x)K(x, y) = f (y)K(y, x)
82. 82. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm Extensions Where do we stand? MCMC in a nutshell: Running a sequence Xt+1 = Ψ(Xt , Yy ) provides approximation to target density f when detailed balance condition holds f (x)K(x, y) = f (y)K(y, x) Easiest implementation of the principle is random walk Metropolis-Hastings Yt = X (t) + εt
83. 83. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm Extensions Where do we stand? MCMC in a nutshell: Running a sequence Xt+1 = Ψ(Xt , Yy ) provides approximation to target density f when detailed balance condition holds f (x)K(x, y) = f (y)K(y, x) Easiest implementation of the principle is random walk Metropolis-Hastings Yt = X (t) + εt Practical convergence requires suﬃcient energy from the proposal that is calibrated by trial and error.
84. 84. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm Convergence assessment Convergence diagnostics How many iterations?
85. 85. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm Convergence assessment Convergence diagnostics How many iterations? Rule # 1 There is no absolute number of simulations, i.e. 1, 000 is neither large, nor small. Rule # 2 It takes [much] longer to check for convergence than for the chain itself to converge. Rule # 3 MCMC is a “what-you-get-is-what-you-see” algorithm: it fails to tell about unexplored parts of the space. Rule # 4 When in doubt, run MCMC chains in parallel and check for consistency.
86. 86. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm Convergence assessment Convergence diagnostics How many iterations? Rule # 1 There is no absolute number of simulations, i.e. 1, 000 is neither large, nor small. Rule # 2 It takes [much] longer to check for convergence than for the chain itself to converge. Rule # 3 MCMC is a “what-you-get-is-what-you-see” algorithm: it fails to tell about unexplored parts of the space. Rule # 4 When in doubt, run MCMC chains in parallel and check for consistency. Many “quick-&-dirty” solutions in the literature, but not necessarily 100% trustworthy.
87. 87. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm Convergence assessment Example (Bimodal target) 0.4 Density 0.3 exp −x2 /2 4(x − .3)2 + .01 0.2 f (x) = √ . 4(1 + (.3)2 ) + .01 0.1 2π 0.0 −4 −2 0 2 4 and use of random walk Metropolis–Hastings algorithm with variance .04 Evaluation of the missing mass by T −1 [θ(t+1) − θ(t) ] f (θ(t) ) t=1
88. 88. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm Convergence assessment 1.0 0.8 0.6 mass 0.4 0.2 0.0 0 500 1000 1500 2000 Index Sequence [in blue] and mass evaluation [in brown] [Philippe & Robert, 2001]
89. 89. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm Convergence assessment Eﬀective sample size How many iid simulations from π are equivalent to N simulations from the MCMC algorithm?
90. 90. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm Convergence assessment Eﬀective sample size How many iid simulations from π are equivalent to N simulations from the MCMC algorithm? Based on estimated k-th order auto-correlation, ρk = cov x(t) , x(t+k) , eﬀective sample size T0 −1/2 ess N =n 1+2 ρk ˆ , k=1 Only partial indicator that fails to signal chains stuck in one mode of the target
91. 91. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm Convergence assessment Tempering Facilitate exploration of π by ﬂattening the target: simulate from πα (x) ∝ π(x)α for α > 0 small enough
92. 92. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm Convergence assessment Tempering Facilitate exploration of π by ﬂattening the target: simulate from πα (x) ∝ π(x)α for α > 0 small enough Determine where the modal regions of π are (possibly with parallel versions using diﬀerent α’s) Recycle simulations from π(x)α into simulations from π by importance sampling Simple modiﬁcation of the Metropolis–Hastings algorithm, with new acceptance α π(θ′ |x) q(θ|θ′ ) ∧1 π(θ|x) q(θ′ |θ)
93. 93. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm Convergence assessment Tempering with the mean mixture 1 0.5 0.2 4 4 4 3 3 3 2 2 2 1 1 1 0 0 0 −1 −1 −1 −1 0 1 2 3 4 −1 0 1 2 3 4 −1 0 1 2 3 4