# Monte Caro Simualtions, Sampling and Markov Chain Monte Carlo

This is the invited talk give at the Basque Center for Applied Mathematics (BCAM) in Spain in 2010.

### Monte Caro Simualtions, Sampling and Markov Chain Monte Carlo

Monte Carlo Simulations, Sampling and Markov Chain Monte Carlo

Xin-She Yang

Estimating π

Buffon's needle problem (1733). Probability of crossing a line
2 L
p= · ,
π d
where L = length of needles, and d =spacing.
Estimating π

How to estimate π using only a ruler and some match sticks?
Buffon's Needle Problem

Buffon's needle problem (1733). Probability of crossing a line
2 L
p= · ,
π d
where L = length of needles, and d =spacing.
Probability of Crossing a Line

Since p ≈ n/N ≈ 2L/πd, we have
2N L
π≈ · .
n d

Lazzarini (1901): L = 5d/6, N = 3408, n = 1808, so
2 × 3408 5
π≈ · ≈ 3.14159290.
1808 6

Too accurate?! Is this right? What happens when n = 1809?
√
Errors ∼ 1/ N ∼ 2%.
Monte Carlo Methods

Everyone has used Monte Carlo methods in some way ...

Measure temperatures, choose a product, ...
Taste soup, wine ...
Monte Carlo Integration

n
1
I= fdv = V fi + O(ǫ),
Ω N
i =1
1 N 2 √
N i =1 fi − µ2
ǫ∼ ∼ O(1/ N).
N
Importance and Quality of the Samples

Higher dimensions – even more challenging!
I= ... f (u, v , ..., w ) du dv ...dw .

√
Errors ∼ 1/ N

Higher dimensional integrals

How to distribute these sampling points?

Regular grids: E ∼ O(N −2/d ) in d ≥ 4 dimensions (not enough!)

Strategies: importance sampling, Latin hypercube, ...

Any other ways?
Quasi-Monte Carlo Methods

In essence, that is to distribute (consecutive) sampling points as far away as possible, using quasi-random or low-discrepancy numbers (not pseudo-random)... Halton, Sobol, Corput ...

For example, Corput express an integer n as a prime base b
m
n= aj (n)b j , aj ∈ {0, 1, 2, ..., b − 1}.
j=0

Then, it is reversed or reflected
m
1
φb (n) = aj (n) .
b j+1
j=0

For example, 0, 1, 2, ..., 15 =⇒ 0, 1 , 1 , 3 , 1 , ..., 15 .
2 4 4 8 16

Errors ∼ O(1/N)
Pseudorandom numbers – by deterministic sequences

Uniform Distributions:
di = (adi −1 + c) mod m,

Classic IBM generator:
m = 231 (strong correlation!)
a = 65539, c = 0,

In fact, correlation coefficient is 1!

Better choice (old Matlab):
a = 75 = 16807, c = 0, m =31 −1 = 2, 147, 483, 647.

If scaled by m, all numbers are in [1/m, (m − 1)/m].

New Matlab: [ǫ, 1 − ǫ], ǫ = 2−53 ≈ 1.1 × 10−16 .

IEEE: 64-bits system = 53 bits for a signed fraction in base 2 and 11 bits for a signed exponent.
Other Distributions

Inverse transform method, rejection method, Mersenne twister, ..., Markov chain Monte Carlo.
2 √1 e −u /2 ,
Standard norm distribution: p(u) = 2π
v −u 2 /2 du CDF: Φ(v ) = √1 = 1 v 2 [1 + ( 2 )],
−∞ e
√ 2π
√
v = Φ−1 (u) = 2 erf−1 (2u − 1),
Transform method: Limitations

√
v = Φ−1 (u) = 2 erf−1 (2u − 1),

√
π πx 3 7π 2 x 5 127π 3 x 7
erf−1 (x) = x+ + + + ··· .
2 12 480 40320

Not so easy to calculate!

Sometimes, the inverse may not be possible.
Multivariate Distributions

Bivariate normal distributions:
1 −(v1 +v2 )/2
2 2
p(v1 , v2 ) = e .
2π

Box-Müller method: from u1 , u2 ∼ uniform distributions u

v1 = −2 ln u1 cos(2πu2 ), v2 = −2 ln u1 sin(2πu2 ).

Problems

Difficult to calculate the inverse in most cases (sometimes, even impossible!).

Other methods (e.g., rejection method) are inefficient.

So – the Markov chain Monte Carlo (MCMC) way!
Random Walk down the Markov Chains

Random walk – A drunkard's walk:
ut+1 = µ + ut + wt ,

where wt is a random variable, and µ is the drift.

For example, wt ∼ N(0, σ 2 ) (Gaussian).
Markov Chains

Markov chain: the next state only depends on the current state and the transition probability.

P(i , j) ≡ P(Vt+1 = Sj V0 = Sp , ..., Vt = Si )

= P(Vt+1 = Sj Vt = Sj ),

=⇒ Pij πi∗ = Pji πj∗ , π ∗ = stionary probability distribution.

Examples: Brownian motion

ui +1 = µ + ui + ǫi , ǫi ∼ N(0, σ 2 ).
Markov Chains

Monopoly (board games)

Monopoly Animation
A Famous \$Billion Markov Chain – PageRank

Google PageRank Algorithm (by Page et al., 1997)

Billions of web pages: pages = states, link probability ∼ 1/t where t ≈ the expectation of the number of clicks.
Googling as a Markov Chain
(t)
(t+1) 1−α
Ranki Rankj = +α ,
N B(pi )
pi ∈Ω(pi )

where N=number of pages, B(pi ) is the link bounds of page
(t=0)
pi , and α=a ranking factor (≈ 0.85). Ranki = 1/N.
T
Let R = Rank1 , ..., RankN , and L(pi , pj ) = 0 if no links =⇒
 
(1 − α)  L(p1 , p1 ) ... L(p1 , pj ) ...L(p1 , pN ) . . 
  . 
1 .   
R=  .  + α L(pi , p1 ) L(pi , pj ) ...L(pi , pN )  R,
  
N . . .. 
  . . 
. 
(1 − α) L(pN , p1 ) ... L(pN , pN )

where N L(pi , pj ) = 1. Google Matrix (stochastic, sparse).
i =1

=⇒ a stationary probability distribution R (update monthly).
Markov Chain Monte Carlo

Landmarks: Monte Carlo method (1930s, 1945, from 1950s) e.g., Metropolis Algorithm (1953), Metropolis-Hastings (1970).

Markov Chain Monte Carlo (MCMC) methods – A class of methods.

Really took off in 1990s, now applied to a wide range of areas: physics, Bayesian statistics, climate changes, machine learning, finance, economy, medicine, biology, materials and engineering ...
Metropolis-Hastings

The Metropolis-Hastings algorithm algorithm:

1 Begin with any initial θ0 at time t ← 0 such that p(θ0 ) > 0

2 Generating a candidate sample θ∗ ∼ q(θt , .) from a proposal distribution

3 Evaluate the acceptance probability α(θt , θ∗ ) given by
p(θ∗ )q(θ∗ , θt )
α = min ,1
p(θt )q(θt , θ∗ )

4 Generate a uniformly-distributed random number u ∼ Unif[0, 1], and accept θ∗ if α ≥ u. That is, if α ≥ u then θt+1 ← θ∗ else θt+1 ← θt

5 Increase the counter or time t ← t + 1, and go to step 2
Mixture distribution: A distribution with known mean and variance.

f (x|µ, σ 2 ) = K αi pi (x|µi , σi2 ),
i =i K i =1 αi = 1.

E.g., α1 = α2 = 1/2, µ1 = 2, µ2 = −2 and σ1 = σ2 = 1.
When to Stop the Chain

As the MCMC runs, convergence may be reached

When does a chain converge? When to stop the chain ... ?

Are the samples correlated ?
A Long Single Chain or Multiple Short Chains?

When a Markov chain will converge in practice? If it has converged, what does it mean?

Is a very long chain really good enough (from statistical point of view)?

How long is long enough?

Are multiple chains better?

How to improve the sampling efficiency and/or mixing properties ?
Simulated Tempering

Simulated annealing: temperature T from high to low.

Simulated tempering: raise T to a higher value, reduce to low.

πτ = π(x)1/τ , πτ →∞ → 1, as τ → ∞.

The basic idea is to reduce from a very high τ to τ0 = 1.

flatten
=⇒
π≥ 0 πτ = π(x)1/τ

Tempering

Use flattened (near uniform) distributions as proposals/candidates to produce high quality samplings.
Sampling: Forward or Backward? Which Way?

Is this the only way?

No! – Coupling from the Past & Metaheuristics

If we go backward along the chain, any advantages? If so, how?

Is there a universally efficient sampling tool for drawing samples in general?

No! – No-free-lunch theorem (Wolpert & Macready, 1997)

The aim of the research is to find the best algorithm(s) for a given/specific problem/distribution.

Also Metaheuristics (very promosing).
Thank you

References

Gamerman D., Markov Chain Monte Carlo, Chapman & Hall/CRC, (1997).
Corcoran J. and Tweedie R., Perfect sampling ... Jour. Stat. Plan. Infer., 104, 297 (2002).
Cox M., Forbes A. B., Harris P. M., Smith I., Classification and solution of regression ..., NPL SSfM Report, (2004).
Propp J. & Wilson D., Exact sampling ..., Random Stru. Alg., 9, 223 (1996).
Yang X. S., Nature-Inspired Metaheuristic Algorithms, Luniver Press, (2008).
Yang X. S., Introduction to Computational Mathematics, World Scientific, (2008).
Yang X. S., Engineering Optimization: An Introduction with Metaheuristic Applications, Wiley, (2010).

Acknowledgement:
EPSRC, SSfM, NPL, CUED, and London Maths Society.

Thank you!