MarkovChainMonteCarlo
theory and worked examples




                             Dario Digiuni,
                             A.A. 2007/2008
Markov Chain Monte Carlo
• Class of sampling algorithms

• High sampling efficiency

• Sample from a distribution with unknown normalization constant

• Often the only way to solve problems in time polynomial in the
  number of dimensions
       e.g. evaluation of a convex body volume
MCMC: applications
   • Statistical Mechanics
     Metropolis-Hastings



   • Optimization
     ▫ Simulated annealing




   • Bayesian Inference
     ▫ Metropolis-Hastings
     ▫ Gibbs sampling
The Monte Carlo principle
• Sample a set of N independent and identically-distributed variables




• Approximation of the target p.d.f. with the empirical expression




       … then approximation of the integrals!
Rejection Sampling
  1. It needs finding M!
  2. Low acceptance rate
Idea
• I can use the previously sampled value to find the following one

• Exploration of the configuration space by means of Markov Chains:




       def .: Markov process




       def .: Markov chain
Invariant distribution
 • Stability conditions:

   1. Irreducibility= for every state there exists a finite probability to visit
      any other state
   2. Aperiodicity = there are no loops.

 • Sufficient condition
   1. Detailed balance principle




 MCMC algorithms are aperiodic, irreducible Markov chains having
  the target pdf as the invariant distribution
Example
• What is the probability to find the lift at the ground floor in a three
  floor building?

  ▫ 3 states Markov chain




  ▫ Lift= Random Walker

  ▫ Transition matrix




  ▫ Looking for the invariant distribution
      … burn-in …
Example - 2
• I can apply the matrix T on the right to any of the states, e.g.



                                                          homogeneous
                                                          Markov chain




                                               ~ 50% is the probability to find
• Google’s PageRank:                               the lift at the ground floor

  ▫ Websites are the states, T is defined by the number of hyperlinks among
    them and the user is the random walker:

      The webpages are displayed following the invariant distribution!
Metropolis-Hastings
• Given the target distribution
                                                          equivalent to T
  1.   Choose a value for

  2.   Sample from a proposal distribution

  3.   Accept the new value with probability




  4.   Return to 1
                  Ratio independent            Equal in Metropolis algorithm
                 of the normalization!
M.-H. – Pros and Cons
• Very general sampling method:

  ▫ I can sample from a unnormalized distribution

  ▫ It does not require to provide upper bound for the function



• Good working depends on the choice of the proposal distribution

  ▫ well-mixing condition
M.-H. - Example
• In Statistical Mechanics it is important to evalue the partition
  function,

  e.g. Ising model
                                                   Sum every possible spin state:
                                                     In a 10 x 10 x 10 spin cube,
                                                      I would have to sum over
 MCMC APPROACH:

 1. Evaluate the system’s energy                   Possible states = UNFEASIBLE

 2. Pick up a spin at random and flip it:

     1. If energy decreases, this is the new spin configuration

     2. If energy increases, this is the new spin configuration with
        probability
Simulated Annealing
• It allows one to find the global maximum of a generic pdf

  ▫ No comparison between the value of local minima required
  ▫ Application to the maximum-likelihood method

• It is a non-homogeneous Markov chain whose invariant distribution
  keeps changing as follows:
Simulated Annealing: example
  • Let us apply the algorithm to a simple, 1-dimensional case

  • The optimal cooling scheme is
Simulated Annealing: Pros and Cons
• The global maximum is univocally determined
  ▫ Even if walker starts next to a local (non global!) maximum, it converges to the
    true global maximum




• It requires a good tuning of the parameters
Gibbs Sampler
• Optimal method to marginalize multidimensional distributions

• Let us assume we have a n-dimensional vector and that we know all
  the conditional probability expression for the pdf




• We take the following proposal distribution:
Gibbs Sampler - 2
• Then:




                    very efficient
                      method!
Gibbs Sampler – practically
Gibbs Sampler – practically
1.       §Initialize              fix n-1 coordinates and sample
                                  from the resulting pdf

2.       for (i=0 ; i < N; i++)

     •     Sample

     •     Sample

     •     Sample




     •     Sample
Gibbs Sampler – example




• Let us pretend we cannot determine the normalization
  constant…




  … but we can make a comparison with the true marginalized
    pdf…
Gibbs Sampler – results
                • Comparison      between      Gibbs
                  Sampling and the true M.-H.
                  sampling from the marginalized pdf


                             • Good c2 agreement
A complex MCMC application
 A radioactive source decays with frequency l1 and a detector records
   only every k1 –th event, then at the moment tc the decay rate
 changes to l2 and only one event out ofk2 is recorded.



 Apparently l1 , k1 , tc , l2 and k2 are undetermined.


         We wish to find them.
Preparation
• The waiting time for the k-th event in a Poissonian process with
  frequency l is distributed according to:




• I can sample a big amount of events from this pdf, changing the
  parameters l1 e k1 to l2 e k2 at time tc

• I evaluate the likelihood:
Idea
• I assume log-likelihood to be the invariant distribution!
  ▫ which are the Markov chain states?

        struct State {
                                                 Parameter
            double lambda1, lambda2;
                                                 space
            double tc;
            int k1, k2;                          Corresponding log-
            double plog;                         likelihood value

           State(double la1, double la2, double t, int kk1, int kk2) :

                       lambda1(la1), lambda2(la2), tc(t), k1(kk1), k2(kk2) {}

         State() {};
        };
Practically
• I have to find an appropriate proposal distribution to move among
  the states
  ▫ Attention: varying li and ki I have toi prevent the acceptance rate to be
    too low… but also too high!

• The a ratio is evaluated as the ratio between the final-state and
  initial-state likelihood values.

• Try to guess the values for li , ki and tc

• Let the chain evolve for a burn-in time and then record the results.
Results   • Even if the inital guess is quite far from the real
            value, the random walker converges.
            guess:          l1=5      l2 = 5   k1 = 3    k2 = 2


            real:           l1=1      l2 = 2   k1 = 1,   k2 = 1
Results- 2
  • Estimate of the uncertainty




                                  l2

                       l1
Results- 3
    • All the parameters can be detemined quickly
      guess:        tc=150           real:    tc=300
References
• C. Andrieu, N. De Freitas, A. Doucet e M.I. Jordan, Machine Learning 50
  (2003), 5-43.

• G. Casella e E.I. George, The American Statistician 46, 3 (1992), 167-174.

• W.H. Press, S. A. Teukolsky, W.T. Vetterling e B.P. Flannery, Numerical
  Recipes , Third Edition, Cambridge University Press, 2007.

• M. Loreti, Teoria degli errori e fondamenti di statistica, Decibel, Zanichelli
  (1998).

• B. Walsh, Markov Chain Monte Carlo and Gibbs Sampling, Lecture Notes
  for EEB 581

Markov Chain Monte Carlo explained

  • 1.
    MarkovChainMonteCarlo theory and workedexamples Dario Digiuni, A.A. 2007/2008
  • 2.
    Markov Chain MonteCarlo • Class of sampling algorithms • High sampling efficiency • Sample from a distribution with unknown normalization constant • Often the only way to solve problems in time polynomial in the number of dimensions e.g. evaluation of a convex body volume
  • 3.
    MCMC: applications • Statistical Mechanics Metropolis-Hastings • Optimization ▫ Simulated annealing • Bayesian Inference ▫ Metropolis-Hastings ▫ Gibbs sampling
  • 4.
    The Monte Carloprinciple • Sample a set of N independent and identically-distributed variables • Approximation of the target p.d.f. with the empirical expression … then approximation of the integrals!
  • 5.
    Rejection Sampling 1. It needs finding M! 2. Low acceptance rate
  • 6.
    Idea • I canuse the previously sampled value to find the following one • Exploration of the configuration space by means of Markov Chains: def .: Markov process def .: Markov chain
  • 7.
    Invariant distribution •Stability conditions: 1. Irreducibility= for every state there exists a finite probability to visit any other state 2. Aperiodicity = there are no loops. • Sufficient condition 1. Detailed balance principle MCMC algorithms are aperiodic, irreducible Markov chains having the target pdf as the invariant distribution
  • 8.
    Example • What isthe probability to find the lift at the ground floor in a three floor building? ▫ 3 states Markov chain ▫ Lift= Random Walker ▫ Transition matrix ▫ Looking for the invariant distribution … burn-in …
  • 9.
    Example - 2 •I can apply the matrix T on the right to any of the states, e.g. homogeneous Markov chain ~ 50% is the probability to find • Google’s PageRank: the lift at the ground floor ▫ Websites are the states, T is defined by the number of hyperlinks among them and the user is the random walker:  The webpages are displayed following the invariant distribution!
  • 10.
    Metropolis-Hastings • Given thetarget distribution equivalent to T 1. Choose a value for 2. Sample from a proposal distribution 3. Accept the new value with probability 4. Return to 1 Ratio independent Equal in Metropolis algorithm of the normalization!
  • 11.
    M.-H. – Prosand Cons • Very general sampling method: ▫ I can sample from a unnormalized distribution ▫ It does not require to provide upper bound for the function • Good working depends on the choice of the proposal distribution ▫ well-mixing condition
  • 12.
    M.-H. - Example •In Statistical Mechanics it is important to evalue the partition function, e.g. Ising model Sum every possible spin state: In a 10 x 10 x 10 spin cube, I would have to sum over MCMC APPROACH: 1. Evaluate the system’s energy Possible states = UNFEASIBLE 2. Pick up a spin at random and flip it: 1. If energy decreases, this is the new spin configuration 2. If energy increases, this is the new spin configuration with probability
  • 13.
    Simulated Annealing • Itallows one to find the global maximum of a generic pdf ▫ No comparison between the value of local minima required ▫ Application to the maximum-likelihood method • It is a non-homogeneous Markov chain whose invariant distribution keeps changing as follows:
  • 14.
    Simulated Annealing: example • Let us apply the algorithm to a simple, 1-dimensional case • The optimal cooling scheme is
  • 15.
    Simulated Annealing: Prosand Cons • The global maximum is univocally determined ▫ Even if walker starts next to a local (non global!) maximum, it converges to the true global maximum • It requires a good tuning of the parameters
  • 16.
    Gibbs Sampler • Optimalmethod to marginalize multidimensional distributions • Let us assume we have a n-dimensional vector and that we know all the conditional probability expression for the pdf • We take the following proposal distribution:
  • 17.
    Gibbs Sampler -2 • Then: very efficient method!
  • 18.
    Gibbs Sampler –practically
  • 19.
    Gibbs Sampler –practically 1. §Initialize fix n-1 coordinates and sample from the resulting pdf 2. for (i=0 ; i < N; i++) • Sample • Sample • Sample • Sample
  • 20.
    Gibbs Sampler –example • Let us pretend we cannot determine the normalization constant… … but we can make a comparison with the true marginalized pdf…
  • 21.
    Gibbs Sampler –results • Comparison between Gibbs Sampling and the true M.-H. sampling from the marginalized pdf • Good c2 agreement
  • 22.
    A complex MCMCapplication A radioactive source decays with frequency l1 and a detector records only every k1 –th event, then at the moment tc the decay rate changes to l2 and only one event out ofk2 is recorded. Apparently l1 , k1 , tc , l2 and k2 are undetermined. We wish to find them.
  • 23.
    Preparation • The waitingtime for the k-th event in a Poissonian process with frequency l is distributed according to: • I can sample a big amount of events from this pdf, changing the parameters l1 e k1 to l2 e k2 at time tc • I evaluate the likelihood:
  • 24.
    Idea • I assumelog-likelihood to be the invariant distribution! ▫ which are the Markov chain states? struct State { Parameter double lambda1, lambda2; space double tc; int k1, k2; Corresponding log- double plog; likelihood value State(double la1, double la2, double t, int kk1, int kk2) : lambda1(la1), lambda2(la2), tc(t), k1(kk1), k2(kk2) {} State() {}; };
  • 25.
    Practically • I haveto find an appropriate proposal distribution to move among the states ▫ Attention: varying li and ki I have toi prevent the acceptance rate to be too low… but also too high! • The a ratio is evaluated as the ratio between the final-state and initial-state likelihood values. • Try to guess the values for li , ki and tc • Let the chain evolve for a burn-in time and then record the results.
  • 26.
    Results • Even if the inital guess is quite far from the real value, the random walker converges. guess: l1=5 l2 = 5 k1 = 3 k2 = 2 real: l1=1 l2 = 2 k1 = 1, k2 = 1
  • 27.
    Results- 2 • Estimate of the uncertainty l2 l1
  • 28.
    Results- 3 • All the parameters can be detemined quickly guess: tc=150 real: tc=300
  • 29.
    References • C. Andrieu,N. De Freitas, A. Doucet e M.I. Jordan, Machine Learning 50 (2003), 5-43. • G. Casella e E.I. George, The American Statistician 46, 3 (1992), 167-174. • W.H. Press, S. A. Teukolsky, W.T. Vetterling e B.P. Flannery, Numerical Recipes , Third Edition, Cambridge University Press, 2007. • M. Loreti, Teoria degli errori e fondamenti di statistica, Decibel, Zanichelli (1998). • B. Walsh, Markov Chain Monte Carlo and Gibbs Sampling, Lecture Notes for EEB 581